At Cedana, our bread and butter is checkpoint/restore and live migration to optimize your GPU/CPU clusters. Check out our website (cedana.com) to learn more about what we’re building!
Generally, we at Cedana see scientific computing as a way for research to keep up with the increasing complexity of the real world. Modeling multivariate and complex systems via physically simulated models (that often are already approximations) breaks down when getting more fine-grained and when investigating interdependent phenomena.

Trying to bridge the gap between theory and the real world depends then on being able to simulate phenomena at scale, which quickly becomes a thorny multi-disciplinary problem. It combines systems engineering, whatever core scientific discipline (biology <> compBio, physics <> CAD, etc) and software; a skillset in rare supply.
Problems with running these systems at scale have been solved to varying degrees - but remains a messy field. At Cedana, we’re thinking about how we can solve a few pertinent problems for the space where it is right now:
All of these combine to slow down time-to-insight. For scientific compute, we’re fundamentally seeking to answer the question:
If I (as a PI or lab admin) have invested $200K of grant money into a DGX cluster, how soon can I see insights? How soon after can I get something into a clinical trial?
Part of our engineering philosophy when designing Cedana has been to keep our systems flexible and maintain separation of concerns. We’ve built some incredibly performant low-level tooling around GPU checkpoint/restore and live migration, network migration and more - leaving room to quickly scaffold and integrate in existing systems.
One example is with Kubernetes, and specifically to Scientific Computing - Kueue.
Kueue is a cloud-native job queueing system for batch, HPC, AI/ML in Kubernetes. It sets out to solve some of the pain points with bringing HPC over to cloud-native, seeing serious adoption at places like Apple and CERN.
Kueue introduces familiar concepts like job queues and resource quotas into Kubernetes, allowing you to manage shared resources across multiple tenants or research groups. It provides the primitives to smooth over the pain points of bringing a SLURM-like workflow into a modern, container-orchestrated environment.
In any multi-tenant system though, priorities change. A critical job needs to run now, and that means something else has to be booted off the nodes. Kueue has primitives for this. It can issue a preemption notice, terminate a lower-priority pod, and allow a higher-priority one to take its place.
The problem is that vanilla Kubernetes preemption is a disruptive operation. The application running in the pod receives a SIGTERM signal and is expected to shut down gracefully. For a stateless web server, that's generally OK, but for a multi-body simulation that’s been chugging away for 72 hours, a sudden SIGTERM means all that progress is lost unless the application itself has been meticulously engineered with its own checkpointing logic. This puts a heavy burden on the researchers and developers to build resilience into their applications, which is often not feasible for complex or legacy codebases.
This is where Cedana comes in. We work with Kueue to make its preemption capabilities truly non-disruptive. Instead of just sending a SIGTERM and hoping for the best, Cedana hooks into the preemption notice to trigger a system-level checkpoint.
Our technology safely freezes the entire state of the running workload—the process memory, CPU registers, and even the complex state of the GPU—and saves it to storage. The pod can then be safely evicted. When resources become available again, or when the workload is rescheduled on a different node (even in a different cluster!), Cedana restores that saved state, and the application resumes execution as if nothing ever happened.
The workload is completely unaware it was ever stopped and moved.
At Caltech, we're seeing this play out in a real-world scientific environment. They are running a mixed workload pipeline: using GROMACS and Boltz-2 in tandem.

And so:
By combining the Kubernetes-native scheduling of Kueue with Cedana's transparent checkpoint/restore capabilities, you can build a truly robust and efficient platform for scientific computing. You give researchers the flexibility they need without forcing them to become distributed systems engineers, and you ensure that your $200K DGX cluster is always running at its maximum potential.
On top of all this, they ran the cluster on Spot instances - saving ~80% in costs.