Getting started(5 mins)

Step 0: Setup AWS credentials (one time)

SpotML uses your AWS credentials to manage the runs for you. You can verify that you have the credentials setup if you see the contents of ~/.aws/credentials to be something like below.

[my-aws-profile]
aws_access_key_id = your_aws_access_key_id
aws_secret_access_key = your_aws_secret_access_key

Secondly, the above access key IAM user needs to have the permissions to create all the aws resources.

If you are new to AWS, configure your aws cli by going through this setup guide.

Step 1: Install spotml cli (one time)

pip install spotml --upgrade

Step 2: Go through simple workflow:

How to start an aws instance to train MNIST code.
How to ssh into instance to check progress.
How to download results to local machine

1. Clone the repo

git clone https://github.com/SpotML/spotml-examples.git 
cd spotml-examples/mnist

2. Start the instance

spotml start

Wait for the instance to start, you will see an output like below once complete.

By default, SpotML will track the instance for idle time. If the instance is idle for more than 30 mins, it's automatically terminated.

3. SSH into instance

spotml sh

SpotML uses tmux sessions. So to exit the ssh session type Ctrl + b, then type d to disconnect from the session.

4. Download the generated model file.

Make sure you have disconnected from the above ssh session. Once you have, from your local terminal type below command to download the generated model file.

spotml download -i 'my_model.h5'

Step 3: Go through managed run workflow (optional)

How to let SpotML automatically manage a long-running training on spot instances.
SpotML automatically restarts the interrupted instances and resumes training.
SpotML automatically turns off idle instance.

1. Update config file

Open the spotml.yaml file and find the line that says

   spotStrategy: on-demand

change it to

   spotStrategy: spot

Also notice the scripts section of config file like below.

This allows you to configure keywords like train to run your custom training command.

2. Run the script

spotml run train

This should produce output like below. If the instance is not already running, spotML tries to spawn a new spot instance and runs the above script once the instance is ready.

Note that if a spot instance is not available, spotML backend service keeps trying every 15 mins, until it can spawn the instance. So you can turn off your laptop and do other things, while SpotML tries to schedule the run.

If you intend to cancel the scheduled run, type