What is SpotML?
SpotML is a command line tool that automatically manages ML training on AWS spot instances(3X cheaper). It lets you handle spot interruptions by resuming training using the latest checkpoint.
- If there is no spot instance available, SpotML backend service keeps trying until it can spawn an instance.
- It then runs the command you configured on this instance
- If there is a spot interruption before the training completes, it tracks it and retries launching a new spot instance to resume training.
- It monitors the instance for idle time and shuts down the machine once the job is complete.
All you need is a simple config file
spotml.yamlplaced in the root directory of your codebase. No changes to your code is needed.
Let's say you have a local script to train a neural network that you usually run like this:
By default spotML looks for a
spotml.yamlconfiguration file in the root folder. An example configuration file looks like below:
1. Firstly SpotML finds the docker image/file. It uses this to run the training job hosted on the AWS instance.
3. Thirdly, it finds the training command configured in the scripts section.
With the above config in place, you can then type below command to schedule the run in cloud.
spotml run train
When the above command is run, spotML does the below steps in order:
- Syncs your source code folder to S3
- Launch your instance (on-demand or spot instance)
- Mount an EBS volume that persists data through spot interruptions
- Pulls the code/data from S3 to the instance.
- Setup your docker environment
- Runs the "train" command configured in your spotml.yaml config, until process completes
- If there is a spot Interruption, respawn a different instance and resume training
- After job completes, save the results/log back to S3
- Terminates the instance, if it detects idle time.
SpotML is built on top of existing open source library Spotty, and we would like to give credit to it. The goal of SpotML is not to replace the library but to further enhance the usefulness of the open source components by automating run management and provide the most cost-effective training platform.