Spot ML
Search…
Typical workflow

Step 0: Setup AWS credentials (one time)

SpotML uses your AWS credentials to manage the runs for you. You can verify that you have the credentials setup if you see the contents of ~/.aws/credentials to be something like below.
1
[my-aws-profile]
2
aws_access_key_id = your_aws_access_key_id
3
aws_secret_access_key = your_aws_secret_access_key
Copied!
Secondly, the above access key IAM user needs to have the permissions to create all the aws resources.
If you are new to AWS, configure your aws cli by going through this setup guide.

Step 1: Install spotml cli (one time)

1
pip install spotml --upgrade
Copied!

Step 2: Configure spotml.yaml

  1. 1.
    First, copy-paste below file in your code root folder.
1
project:
2
name: mnist
3
syncFilters:
4
- exclude:
5
- .git/*
6
- .idea/*
7
- '*/__pycache__/*'
8
9
containers:
10
- &DEFAULT_CONTAINER
11
projectDir: /workspace/project
12
image: tensorflow/tensorflow:latest-py3
13
volumeMounts:
14
- name: workspace
15
mountPath: /workspace
16
env:
17
PYTHONPATH: /workspace/project
18
ports:
19
# tensorboard
20
- containerPort: 6006
21
hostPort: 6006
22
# jupyter
23
- containerPort: 8888
24
hostPort: 8888
25
26
instances:
27
- name: aws-1
28
provider: aws
29
parameters:
30
region: us-east-1
31
instanceType: t2.large
32
spotStrategy: on-demand
33
ports: [6006, 6007, 8888]
34
rootVolumeSize: 125
35
volumes:
36
- name: workspace
37
parameters:
38
size: 50
39
40
scripts:
41
train: |
42
python train.py
43
44
tensorboard: |
45
tensorboard --port 6006 --logdir results/
46
47
jupyter: |
48
CUDA_VISIBLE_DEVICES="" jupyter notebook --allow-root --ip 0.0.0.0
49
Copied!
2. Change the instanceType to the aws instance you want to run the code on.
1
instanceType: t2.large
2
spotStrategy: on-demand
Copied!

Step 3: Create a Docker file (optional)

In the above spotml.yaml notice that we've used a docker image directly.
1
image: tensorflow/tensorflow:latest-py3
Copied!
But if you want to configure a custom docker container, you can write your own Dockerfile. See example setup here.

Step 4: Start an instance

1
spotml start
Copied!
This will automatically do the following for you:
  • Syncs your source code folder to S3
  • Launch your instance (on-demand or spot instance)
  • Create a persistent EBS volume and attach it to the above instance.
  • Setup your docker environment inside the AWS instance
  • Pulls the code/data from S3 to the EBS volume attached to the instance
The resources created in your AWS account(EBS Volume, EC2 instance, S3 Bucket, Security Groups, etc.) will all have the prefix spotml.
You should see an output like below once the instance is launched.

Step 5: SSH into the instance

1
spotml sh
Copied!
You should see a screen like below. Notice that your source code is already synced into the instance.
You can now run your training command here to kick start a manual training.

Step 6: Schedule a managed run (optional)

If you have a long running training jobs that you only want to run on spot instances, it's a pain to manage them manually.
  1. 1.
    Configure your script section in spotml.yaml file to the command you want
1
scripts:
2
train: |
3
python train.py
Copied!
2. Change the spotStrategy to on-demand
1
spotStrategy: on-demand
Copied!
3. Then type below command to let SpotML automatically manage the run.
1
spotml run train
Copied!
If the instance is not already running, spotML tries to spawn a new spot instance and runs the above script once the instance is ready.
Note that if a spot instance is not available, spotML backend service keeps trying every 15 mins, until it can spawn the instance. So you can turn off your laptop and do other things, while SpotML tries to schedule the run.
If you intend to cancel the scheduled run, type
1
spotml manage stop
Copied!

Step 7: Check run status (optional)

You can check the status of the instance, and the run with the below command.
1
spotml status
Copied!
To also check any logs generated when starting the instance type:
1
spotml status --logs
Copied!
Once you see the run status as RUNNING from the status command you can ssh into the actual run session by typing
1
spotml sh run
Copied!
This opens the tmux session where spotML ran the train command.
You can also just ssh into a separate normal ssh session by typing spotml sh as before for an interactive session.