Cedana schedules and migrates workloads based on price, performance, SLAs, and resource availability, matching them in real-time to user demand. It self-heals at every level of the stack.
Works seamlessly with Kubernetes, Kueue
(for HPC-style workloads), KServe (for inference), KubeFlow (for large-scale training), SLURM, Ray, and more.
Scale workloads and clusters up and down with higher performance, utilization and faster response times than previously available. Preempt and save workloads quickly to downscale resources without losing progress or performance.
Continuous, transparent, system-level checkpoints automatically resume workloads through catastrophic GPU/CPU failures. Ensure your agents, distributed training, and inference jobs meet mission-critical SLAs.
We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.
Learn more about how Cedana is transforming compute orchestration and how we can help your organization.
From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.