Cedana is the

Automation Layer

for AI Factories

Cedana is the

Automation Layer

for AI Factories

Automatically save, migrate, and resume live GPU workloads across your infrastructure, increasing your AI productivity per $/GPU. Works from a single node to entire AI factories. Start with one instance and scale seamlessly.

Seamlessly extend Kubernetes & SLURM. On-premise and Multi-cloud.

Unlock
Your Scheduler

Seamless integration with your existing infrastructure, designed for high-performance computing.

Work with What you Have

No rip-and-replace. No code changes. No disruption to your teams.

Kubernetes & SLURM

Build for HPC. Native support for SLURM workload manager job queues.

Your First Migration in Under 30 min

SCHEDULE A DEMO NOW

AI Workloads
Cannot Move once Running

Expensive Failures

Failure and preemptions force workloads to restart from scratch. Up to 65% of compute wasted.

Overprivisioned GPUs

In high-scale AI infrastructure, capacity is routinely over-provisioned by 10-50% just to maintain reliability and hit SLAs.

Idle GPUs

Valuable compute remains stranded while critical work is delayed.

Rigid Infrastructure

Schedulers cannot dynamically adapt to failures, demands, or changing priorities.

Cedana Brings
Liquidity to AI Infrasturcture

Automated Reliability

Workloads automatically migrate to healthy infrastructure and resume after failures with no lost progress.

read more

Eliminate Overprovisioning

Automatic migration and recovery remove the need for large safety buffers to meet SLAs and QoS.

read more

Adaptive Infrastructure

Kubernetes and SLURM adapt workloads in real time to failures and demand.

read more

Maximize Throughput

Workloads shift to idle GPUs, reclaiming capacity and maximizing cluster throughput.

read more

The Cedana
Difference

Without Migration

Expensive Failures

Up to 65% compute lost

Idle GPUs

Stranded Compute While Jobs Wait

Over-provisioned GPUs

10-50% Capacity Buffers

Rigid Infrastructure

Schedulers Cannot Adapt

With Migration

Automated Reliability

Workloads Resume Automatically

Maximize Throughput

Workloads Migrate to Idle GPUs

Adaptive Infrastructure

Workloads Adjust in Real Time

Eliminate Overprivisioning

SLAs and Reliability without Safety Buffers

Automation Use Cases

Reliability

Automatically continue workloads from catastrophic failures without losing progress or restarting.

read more

Productivity

Automatically migrate workloads to eliminate idle GPUs and increase throughput.

read more

Operations

Perform maintenance without losing workload progress or manual re-submission.

read more

Built for
High Performance AI and HPC

Native support for NCCL and MPI workloads. Achieve massive scale with note-aware scheduling and low-latency interconnect optimization.

Advanced Workloads

Works with distributed multi-node compute, including NCCL and MPI workloads. Works with both CPU and GPU workloads.

Scalability

Works across on-premise clusters, hybrid environments, and cloud infrastructure. Scale from a single node, to cluster, to AI factory.

Backers / Partners