Comment on page
Typical workflow
SpotML uses your AWS credentials to manage the runs for you. You can verify that you have the credentials setup if you see the contents of
~/.aws/credentials
to be something like below.[my-aws-profile]
aws_access_key_id = your_aws_access_key_id
aws_secret_access_key = your_aws_secret_access_key
Secondly, the above
access key
IAM user needs to have the permissions to create all the aws resources.pip install spotml --upgrade
- 1.First, copy-paste below file in your code root folder.
project:
name: mnist
syncFilters:
- exclude:
- .git/*
- .idea/*
- '*/__pycache__/*'
containers:
- &DEFAULT_CONTAINER
projectDir: /workspace/project
image: tensorflow/tensorflow:latest-py3
volumeMounts:
- name: workspace
mountPath: /workspace
env:
PYTHONPATH: /workspace/project
ports:
# tensorboard
- containerPort: 6006
hostPort: 6006
# jupyter
- containerPort: 8888
hostPort: 8888
instances:
- name: aws-1
provider: aws
parameters:
region: us-east-1
instanceType: t2.large
spotStrategy: on-demand
ports: [6006, 6007, 8888]
rootVolumeSize: 125
volumes:
- name: workspace
parameters:
size: 50
scripts:
train: |
python train.py
tensorboard: |
tensorboard --port 6006 --logdir results/
jupyter: |
CUDA_VISIBLE_DEVICES="" jupyter notebook --allow-root --ip 0.0.0.0
2. Change the
instanceType
to the aws instance you want to run the code on. instanceType: t2.large
spotStrategy: on-demand
In the above
spotml.yaml
notice that we've used a docker image directly. image: tensorflow/tensorflow:latest-py3
But if you want to configure a custom docker container, you can write your own
Dockerfile
. See example setup here.spotml start
This will automatically do the following for you:
- Syncs your source code folder to S3
- Launch your instance (on-demand or spot instance)
- Create a persistent EBS volume and attach it to the above instance.
- Setup your docker environment inside the AWS instance
- Pulls the code/data from S3 to the EBS volume attached to the instance
The resources created in your AWS account(EBS Volume, EC2 instance, S3 Bucket, Security Groups, etc.) will all have the prefix
spotml
. You should see an output like below once the instance is launched.

spotml sh
You should see a screen like below. Notice that your source code is already synced into the instance.

You can now run your training command here to kick start a manual training.
If you have a long running training jobs that you only want to run on spot instances, it's a pain to manage them manually.
- 1.Configure your script section in
spotml.yaml
file to the command you want
scripts:
train: |
python train.py
2. Change the
spotStrategy
to on-demand
spotStrategy: on-demand
3. Then type below command to let SpotML automatically manage the run.
spotml run train
If the instance is not already running, spotML tries to spawn a new spot instance and runs the above script once the instance is ready.
Note that if a spot instance is not available, spotML backend service keeps trying every 15 mins, until it can spawn the instance. So you can turn off your laptop and do other things, while SpotML tries to schedule the run.
If you intend to cancel the scheduled run, type
spotml manage stop
You can check the status of the
instance
, and the run
with the below command.spotml status

To also check any logs generated when starting the instance type:
spotml status --logs

Once you see the run status as
RUNNING
from the status
command you can ssh into the actual run session by typingspotml sh run

This opens the
tmux
session where spotML ran the train
command.You can also just ssh into a separate normal ssh session by typing
spotml sh
as before for an interactive session.
Last modified 2yr ago