What is SpotML?
SpotML is a command line tool that automatically manages ML training on AWS spot instances(3X cheaper). It lets you handle spot interruptions by resuming training using the latest checkpoint.
If there is no spot instance available, SpotML backend service keeps trying until it can spawn an instance.
It then runs the command you configured on this instance
If there is a spot interruption before the training completes, it tracks it and retries launching a new spot instance to resume training.
It monitors the instance for idle time and shuts down the machine once the job is complete.
All you need is a simple config file spotml.yaml
placed in the root directory of your codebase. No changes to your code is needed.
How it works
Let's say you have a local script to train a neural network that you usually run like this:
By default spotML looks for a spotml.yaml
configuration file in the root folder. An example configuration file looks like below:
1. Firstly SpotML finds the docker image/file. It uses this to run the training job hosted on the AWS instance.
2. Then it finds the instance type that you specify for the run. In the below example I've specified a t2.large aws instance
3. Thirdly, it finds the training command configured in the scripts section.
With the above config in place, you can then type below command to schedule the run in cloud.
When the above command is run, spotML does the below steps in order:
Syncs your source code folder to S3
Launch your instance (on-demand or spot instance)
Mount an EBS volume that persists data through spot interruptions
Pulls the code/data from S3 to the instance.
Setup your docker environment
Runs the "train" command configured in your spotml.yaml config, until process completes
If there is a spot Interruption, respawn a different instance and resume training
After job completes, save the results/log back to S3
Terminates the instance, if it detects idle time.
Credit
SpotML is built on top of existing open source library Spotty, and we would like to give credit to it. The goal of SpotML is not to replace the library but to further enhance the usefulness of the open source components by automating run management and provide the most cost-effective training platform.
Last updated