Automate your spot instances

Run long-running workloads on spot without interruption or losing work.

Caltech's Computational Biology group saved 70%, with 2x faster results.

Their team of scientists use Cedana to automate spot instances. No code modifications, configuration changes, or babysitting workloads.

Unbreakable, stateful reliability

Cedana makes sure you never lose work. Long-running, stateful workloads are automatically resumed on a new instance through revocations or failures. Your workload doesn't lose progress, and you don't waste time babysitting jobs.  

Automated

Live migrate GPU workloads before failures happen while system-level checkpoint/restore capabilities ensure no lost-work even during mid-epoch failures - even on large multi-node clusters.

Job-level SLAs

Assign individual jobs SLAs for reliability, costs and other criteria - required for efficiently sharing compute across users, groups and use-cases.

No code modifications or config changes

Checkpointing is transparently and continuously performed with no impact on performance. No need to manage checkpoints.  

Seamless, easy install

Just add a few lines of code to your Helm Chart and you're ready to go.

Product benefits

Reduce costs up to 80% by migrating long-running workloads to spot

Faster insights through automation. Reduce delays and interruptions.

Automated stateful reliablity. Never lose work.

Get started

Play in the sandbox

We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.

Sandbox

Get a demo

Learn more about how Cedana is transforming compute orchestration and how we can help your organization.

Connect

API Reference & Guides

From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.

VIEW DOCS
Backers / Partners