Getting started(5 mins)
Step 0: Setup AWS credentials (one time)
SpotML uses your AWS credentials to manage the runs for you. You can verify that you have the credentials setup if you see the contents of ~/.aws/credentials
to be something like below.
Secondly, the above access key
IAM user needs to have the permissions to create all the aws resources.
If you are new to AWS, configure your aws cli by going through this setup guide.
Step 1: Install spotml cli (one time)
Step 2: Go through simple workflow:
How to
start
an aws instance to train MNIST code.How to
ssh
into instance to check progress.How to
download
results to local machine
1. Clone the repo
2. Start the instance
Wait for the instance to start, you will see an output like below once complete.
By default, SpotML will track the instance for idle time. If the instance is idle for more than 30 mins, it's automatically terminated.
3. SSH into instance
SpotML uses tmux sessions. So to exit the ssh session type Ctrl + b
, then type d
to disconnect from the session.
4. Download the generated model file.
Make sure you have disconnected from the above ssh session. Once you have, from your local terminal type below command to download the generated model file.
Step 3: Go through managed run workflow (optional)
How to let SpotML automatically manage a long-running training on
spot
instances.SpotML automatically restarts the interrupted instances and resumes training.
SpotML automatically turns off idle instance.
1. Update config file
Open the spotml.yaml file and find the line that says
change it to
Also notice the scripts section of config file like below.
This allows you to configure keywords like train to run your custom training command.
2. Run the script
This should produce output like below. If the instance is not already running, spotML tries to spawn a new spot instance and runs the above script once the instance is ready.
Note that if a spot instance is not available, spotML backend service keeps trying every 15 mins, until it can spawn the instance. So you can turn off your laptop and do other things, while SpotML tries to schedule the run.
If you intend to cancel the scheduled run, type
3. Check Status
You can check the status of the instance
, and the run
with the above command. Once the instance is running you should see an output like below.
To also check any logs generated when starting the instance type
Once you see the run status as RUNNING
from the status
command you can ssh into the actual run session by typing
This opens the tmux session where spotML ran the train
command.
You can also just ssh into a separate normal ssh session by typing below command as before for an interactive session.
Last updated