Never lose a training run again: a checkpoint-and-resume playbook for ephemeral GPUs
To prevent losing work due to ephemeral GPUs, save a resumable checkpoint every epoch that includes the model, optimizer, scheduler, epoch counter, best metric, and RNG state. This can be done with a function like make_checkpoint. If the machine dies while writing the checkpoint, it may still be lost. Use a robust checkpointing mechanism to ensure data integrity. This approach can save hours of work and make training on free GPUs more reliable.