Spot ML
Search…
Spotml Config file
SpotML needs a spotml.yaml configuration file in the root folder. An example configuration file looks like below:
project:
name: mnist
maxIdleMinutes: 15
syncFilters:
- exclude:
- .git/*
- .idea/*
- '*/__pycache__/*'
containers:
- &DEFAULT_CONTAINER
projectDir: /workspace/project
# file: docker/Dockerfile
image: tensorflow/tensorflow:latest-py3
volumeMounts:
- name: workspace
mountPath: /workspace
env:
PYTHONPATH: /workspace/project
ports:
# tensorboard
- containerPort: 6006
hostPort: 6006
# jupyter
- containerPort: 8888
hostPort: 8888
instances:
- name: aws-1
provider: aws
parameters:
region: us-east-1
instanceType: t2.large
spotStrategy: on-demand
ports: [6006, 6007, 8888]
rootVolumeSize: 125
volumes:
- name: workspace
parameters:
size: 50
scripts:
train: |
python train.py
tensorboard: |
tensorboard --port 6006 --logdir results/
jupyter: |
CUDA_VISIBLE_DEVICES="" jupyter notebook --allow-root --ip 0.0.0.0

Project Parameters

name

Name of the project, this name is used as a prefix in all the aws resources created.
project:
name: mnist

maxIdleMinutes

Maximum idle time before which instance must automatically be shut down. Set this to 0 to turn off idle time checking.
project:
name: mnist
maxIdleMinutes: 15
SpotML periodically(every 5 mins) checks instances for idle time. It track them by checking if the docker instance has any active running commands or if there was any tty(keyboard) activity. If it finds no activity and no running commands for more than maxIdleMinutes it terminates the instance.

syncFilters (optional)

By default SpotML syncs all the files/folders in the project directory. You can use this to exclude files you don't want to sync to instance
project:
name: mnist
syncFilters:
- exclude:
- .git/*
- .idea/*
- '*/__pycache__/*'

Container Parameters

image

Specify the docker image to use to launch the container. This works for simple cases where you don't need a custom Dockerfile with anything else installed.
containers:
- &DEFAULT_CONTAINER
projectDir: /workspace/project
image: tensorflow/tensorflow:latest-py3

file (if above image is not specified)

When you need a custom Dockerfile, to customize the instance. Specify the path to the Dockerfile.
containers:
- &DEFAULT_CONTAINER
projectDir: /workspace/project
file: docker/Dockerfile

env

Environment variables available in the container
containers:
- &DEFAULT_CONTAINER
env:
PYTHONPATH: /workspace/project\

ports

Ports that should be exposed in the container and the host instance so that you can access apps like jupyter notebook from your browser.
containers:
- &DEFAULT_CONTAINER
ports:
# tensorboard
- containerPort: 6006
hostPort: 6006
# jupyter
- containerPort: 8888
hostPort: 8888

Instance Parameters

name

An identifier for the aws resources created. This name is used as a prefix in all the aws resources created.
containers:
- &DEFAULT_CONTAINER
projectDir: /workspace/project
image: tensorflow/tensorflow:latest-py3

provider

Right now we only support aws as the provider
instances:
- name: aws-1
provider: aws

parameters

instances:
- name: aws-1
provider: aws
parameters:
region: us-east-1
instanceType: t2.large
spotStrategy: on-demand
ports: [6006, 6007, 8888]
rootVolumeSize: 125
region - The aws region to create resources.
instanceType - The aws instance type to launch
spotStrategy - Options for this is either on-demand or spot.
  • on-demand - Launch an aws on demand instance
  • spot - Launch a spot instance only.
ports - Ports to be exposed in the aws instance.
rootVolumeSize - The size(GB) of the root EBS volume attached to instance. This will be destroyed after instance terminates
volumes - The persistent EBS volumes that are to be attached to the instance. These will not be destroyed after instance terminates. These are re-attached the next time the instance starts and the data is preserved.
instances:
- name: aws-1
provider: aws
parameters:
volumes:
- name: workspace
parameters:
size: 50

Script Parameters

This section has script configurations that are used in the sptoML managed runs.
scripts:
train: |
python train.py
tensorboard: |
tensorboard --port 6006 --logdir results/