Spot ML
Search…
Getting started(5 mins)

Step 0: Setup AWS credentials (one time)

SpotML uses your AWS credentials to manage the runs for you. You can verify that you have the credentials setup if you see the contents of ~/.aws/credentials to be something like below.
[my-aws-profile]
aws_access_key_id = your_aws_access_key_id
aws_secret_access_key = your_aws_secret_access_key
Secondly, the above access key IAM user needs to have the permissions to create all the aws resources.
If you are new to AWS, configure your aws cli by going through this setup guide.

Step 1: Install spotml cli (one time)

pip install spotml --upgrade

Step 2: Go through simple workflow:

  • How to start an aws instance to train MNIST code.
  • How to ssh into instance to check progress.
  • How to download results to local machine
1. Clone the repo
git clone https://github.com/SpotML/spotml-examples.git
cd spotml-examples/mnist
2. Start the instance
spotml start
Wait for the instance to start, you will see an output like below once complete.
By default, SpotML will track the instance for idle time. If the instance is idle for more than 30 mins, it's automatically terminated.
3. SSH into instance
spotml sh
SpotML uses tmux sessions. So to exit the ssh session type Ctrl + b, then type d to disconnect from the session.
4. Download the generated model file.
Make sure you have disconnected from the above ssh session. Once you have, from your local terminal type below command to download the generated model file.
spotml download -i 'my_model.h5'

Step 3: Go through managed run workflow (optional)

  • How to let SpotML automatically manage a long-running training on spot instances.
  • SpotML automatically restarts the interrupted instances and resumes training.
  • SpotML automatically turns off idle instance.
1. Update config file
Open the spotml.yaml file and find the line that says
spotStrategy: on-demand
change it to
spotStrategy: spot
Also notice the scripts section of config file like below.
This allows you to configure keywords like train to run your custom training command.
2. Run the script
spotml run train
This should produce output like below. If the instance is not already running, spotML tries to spawn a new spot instance and runs the above script once the instance is ready.
Note that if a spot instance is not available, spotML backend service keeps trying every 15 mins, until it can spawn the instance. So you can turn off your laptop and do other things, while SpotML tries to schedule the run.
If you intend to cancel the scheduled run, type
spotml run stop
3. Check Status
spotml status
You can check the status of the instance, and the run with the above command. Once the instance is running you should see an output like below.
To also check any logs generated when starting the instance type
spotml status --logs
Once you see the run status as RUNNING from the status command you can ssh into the actual run session by typing
spotml sh run
This opens the tmux session where spotML ran the train command.
You can also just ssh into a separate normal ssh session by typing below command as before for an interactive session.
spotml sh