Command your

compute_

Command your

compute_

Autonomous live migration for GPU workloads.
Increase throughput and performance up to 5x while improving costs and reliability.

Kubernetes & SLURM aware. Runs anywhere your compute does.

Automatically migrates workloads to improve throughput and performance 2-10x.

Cedana schedules and migrates workloads based on price, performance, SLAs, and resource availability, matching them in real-time to user demand. It self-heals at every level of the stack.

HOW IT WORKS

Magically extend your favorite orchestration platforms.

Works seamlessly with Kubernetes,  Kueue (for HPC-style workloads), KServe (for inference), KubeFlow (for large-scale training), SLURM, Ray, and more.

PERFORMANCE

Real-time compute orchestration.

Scale workloads and clusters up and down with higher performance, utilization and faster response times than previously available. Preempt and save workloads quickly to downscale resources without losing progress or performance.

ORCHESTRATION

Increase reliability
and availability.

Continuous, transparent, system-level checkpoints automatically resume workloads through catastrophic GPU/CPU failures. Ensure your agents, distributed training, and inference jobs meet mission-critical SLAs.  

RELIABILITY


Use Cases

Maximize value and reliability with automated GPU orchestration.

  • 20%-80% increase in utilization with GPU live migrations
  • Automatic workload failover
  • Zero-downtime OS/HW upgrades
  • Dynamically resize workloads onto optimal instances without interruption

Highest performance, fastest, lowest-cost inferencing.

  • 2-10x faster time-to-first token
  • Dynamically resize workloads to optimal instances
  • Automatically reduce idle inferencing time
  • Use spot instances without interruption
  • Faster model hotswapping

Increase throughput, reliability and speed of advanced large model training

  • Real-time checkpoint/restore of multi-node systems
  • Automatic workload failover always preserves work in mini-batch
  • Fully transparent, no code modifications
  • Fine-grained system-level checkpointing
  • High availability and reliability, swap in GPUs and nodes on failure

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

  • Increase GPU utilization with efficient hot swapping and bin-packing
  • Dynamic scaling for
    • Larger models
    • Increasing task complexity, context windows, and agent counts
    • Variable workload demands
  • Persistent agent state

Improve the performance and reliability of your gaming infrastructure

  • Reduce latency by migrating workloads to player geographies
  • Load balance workloads to eliminate resource bottlenecks
  • Automated workload failover
  • Zero-downtime OS/HW upgrades.

Increase automation, throughput, and reliability of your HPC workloads.

  • Never lose work on long-running workloads in SLURM
  • Schedule, queue, and prioritize workloads across users and groups dynamically
  • 20-80% lower compute costs
  • Increase workload throughput
  • Automate workflows conditionally based on time and success criteria

Get started

Play in the sandbox

We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.

Sandbox

Get a demo

Learn more about how Cedana is transforming compute orchestration and how we can help your organization.

Connect

API Reference & Guides

From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.

VIEW DOCS
Backers / Partners