Automatic, unbreakable spot instances

Run long-running workloads on spot reliably, without interruption or losing work.

Caltech's Computational Biology group saved 70%, with 2x faster results.

Their team of scientists use Cedana to automate spot instances. No code modifications, configuration changes, or babysitting workloads.

Unbreakable, stateful reliability

Cedana makes sure you never lose work. Long-running, stateful workloads are automatically resumed on a new instance through revocations or failures. Your workload doesn't lose progress, and you don't waste time babysitting jobs.  

Automated

Live migrate GPU workloads before failures happen while system-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.

Job-level SLAs

Assign individual jobs SLAs for reliability, costs and other criteria - required for efficiently sharing compute across users, groups and use-cases.

No code modifications or config changes

Checkpointing is transparently and continuously performed with no impact on performance. No need to manage checkpoints.  

Seamless, easy install

Just add a few lines of code to your Helm Chart and you're ready to go.

Product benefits

Reduce costs up to 80% by migrating long-running workloads to spot

Faster insights through automation. Reduce delays and interruptions.

Automated stateful reliablity. Never lose work.

Automation Use Cases

Reliability

Automatically continue workloads from catastrophic failures without losing progress or restarting.

read more

Productivity

Automatically migrate workloads to eliminate idle GPUs and increase throughput.

read more

Operations

Perform maintenance without losing workload progress or manual re-submission.

read more
Backers / Partners