Their team of scientists use Cedana to automate spot instances. No code modifications, configuration changes, or babysitting workloads.
Cedana makes sure you never lose work. Long-running, stateful workloads are automatically resumed on a new instance through revocations or failures. Your workload doesn't lose progress, and you don't waste time babysitting jobs.
Live migrate GPU workloads before failures happen while system-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.
Assign individual jobs SLAs for reliability, costs and other criteria - required for efficiently sharing compute across users, groups and use-cases.
Checkpointing is transparently and continuously performed with no impact on performance. No need to manage checkpoints.
Just add a few lines of code to your Helm Chart and you're ready to go.
We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.
Learn more about how Cedana is transforming compute orchestration and how we can help your organization.
From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.