Their team of scientists use Cedana to automate spot instances. No code modifications, configuration changes, or babysitting workloads.
Cedana makes sure you never lose work. Long-running, stateful workloads are automatically resumed on a new instance through revocations or failures. Your workload doesn't lose progress, and you don't waste time babysitting jobs.
Live migrate GPU workloads before failures happen while system-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.
Assign individual jobs SLAs for reliability, costs and other criteria - required for efficiently sharing compute across users, groups and use-cases.
Checkpointing is transparently and continuously performed with no impact on performance. No need to manage checkpoints.
Just add a few lines of code to your Helm Chart and you're ready to go.