cedana

Automatically migrates workloads to improve throughput and performance 2-10x.

Cedana schedules and migrates workloads based on price, performance, SLAs, and resource availability, matching them in real-time to user demand. It self-heals at every level of the stack.

HOW IT WORKS

Magically extend your favorite orchestration platforms.

Works seamlessly with Kubernetes, Kueue (for HPC-style workloads), KServe (for inference), KubeFlow (for large-scale training), SLURM, Ray, and more.

PERFORMANCE

Real-time compute orchestration.

Scale workloads and clusters up and down with higher performance, utilization and faster response times than previously available. Preempt and save workloads quickly to downscale resources without losing progress or performance.

ORCHESTRATION

Increase reliability
and availability.

Continuous, transparent, system-level checkpoints automatically resume workloads through catastrophic GPU/CPU failures. Ensure your agents, distributed training, and inference jobs meet mission-critical SLAs.

RELIABILITY

Use Cases

Maximize value and reliability with automated GPU orchestration.

20%-80% increase in utilization with GPU live migrations

Automatic workload failover
Zero-downtime OS/HW upgrades
Dynamically resize workloads onto optimal instances without interruption

Highest performance, fastest, lowest-cost inferencing.

2-10x faster time-to-first token
Dynamically resize workloads to optimal instances

Automatically reduce idle inferencing time
Use spot instances without interruption
Faster model hotswapping

Increase throughput, reliability and speed of advanced large model training

Real-time checkpoint/restore of multi-node systems

Automatic workload failover always preserves work in mini-batch

Fully transparent, no code modifications
Fine-grained system-level checkpointing

High availability and reliability, swap in GPUs and nodes on failure

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Increase GPU utilization with efficient hot swapping and bin-packing

Dynamic scaling for
- Larger models
- Increasing task complexity, context windows, and agent counts
- Variable workload demands

Persistent agent state

Improve the performance and reliability of your gaming infrastructure

Reduce latency by migrating workloads to player geographies

Load balance workloads to eliminate resource bottlenecks

Automated workload failover
Zero-downtime OS/HW upgrades.

Increase automation, throughput, and reliability of your HPC workloads.

Never lose work on long-running workloads in SLURM
Schedule, queue, and prioritize workloads across users and groups dynamically

20-80% lower compute costs
Increase workload throughput
Automate workflows conditionally based on time and success criteria

Get started

Play in the sandbox

We’ve deployed a test cluster for you to play with where you can interact and experiment with the system.

Sandbox

Get a demo

Learn more about how Cedana is transforming compute orchestration and how we can help your organization.

Connect

API Reference & Guides

From deploying on your cluster, to market, to GPU Checkpointing, learn our system and get started quickly.

VIEW DOCS

Command your

compute_

Command your

compute_

Autonomous live migration for GPU workloads.
Increase throughput and performance up to 5x while improving costs and reliability.

Automatically migrates workloads to improve throughput and performance 2-10x.

Magically extend your favorite orchestration platforms.

Real-time compute orchestration.

Increase reliability
and availability.

Use Cases

Maximize value and reliability with automated GPU orchestration.

Highest performance, fastest, lowest-cost inferencing.

Increase throughput, reliability and speed of advanced large model training

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Improve the performance and reliability of your gaming infrastructure

Increase automation, throughput, and reliability of your HPC workloads.

Get started

Play in the sandbox

Get a demo

API Reference & Guides

command your compute_

Command your

compute_

Command your

compute_

Autonomous live migration for GPU workloads. Increase throughput and performance up to 5x while improving costs and reliability.

Automatically migrates workloads to improve throughput and performance 2-10x.

Magically extend your favorite orchestration platforms.

Real-time compute orchestration.

Increase reliability and availability.

Use Cases

Maximize value and reliability with automated GPU orchestration.

Highest performance, fastest, lowest-cost inferencing.

Increase throughput, reliability and speed of advanced large model training

Orchestrate agent inferencing and training autonomously. Maximize utilization, reliability, and performance.

Improve the performance and reliability of your gaming infrastructure

Increase automation, throughput, and reliability of your HPC workloads.

Get started

Play in the sandbox

Get a demo

API Reference & Guides

command your compute_

Autonomous live migration for GPU workloads.
Increase throughput and performance up to 5x while improving costs and reliability.

Increase reliability
and availability.