GPU Interposing and Performance

11% inference throughput and 10x faster cold starts, with minimal overhead!

May 29, 2025

GPU Interposing

‍

Checkpointing and restoring a process or container that uses an NVIDIA GPU is a complex task. It requires low-level access to the driver, which is challenging due to the proprietary nature of NVIDIA's drivers. To overcome this, Cedana interposes at the Driver API while the process is running. This approach enables Cedana to reliably, transparently, and deterministically capture and restore GPU state — even for multi-GPU workloads or workloads scattered across several machines — while also enabling live migration.

‍

While this could theoretically introduce performance overhead, Cedana minimizes this cost by using a bag of low-level tricks. Our optimizations effectively turn Cedana into a virtual machine for NVIDIA GPUs. By operating at our level (rather than the higher-level CUDA runtime or at the application layer) Cedana gains deep visibility into GPU operations. This enables advanced optimizations in real time, such as merging CUDA graph API calls, eliminating redundant operations and more — almost functioning as a just-in-time (JIT) compiler for GPUs.

‍

Moreover, GPU workloads often involve the CPU idling while waiting for GPU tasks to complete. Because Cedana still runs on the CPU, its overhead is frequently hidden or amortized over time — often resulting in no measurable cost at all.

‍

In practice, the performance overhead from Cedana’s GPU interception is minimal, and in some cases, even improves performance. Below are benchmarks that demonstrate this in more detail.

‍

Performance is a moving target, and Cedana's goal is to get to native or better — so check back in for regular updates!

‍

Benchmarks

‍

The benchmarks list below were run on a range of machines:

‍

Intel Zeon Platinum 8480+ CPU with an NVIDIA H100 PCIe GPU
AMD EPYC 7R13 CPU with an NVIDIA L4 GPU
AMD EPYC 7J13 CPU with an NVIDIA A100 SXM4 GPU

‍

Complete system specifications can be found at the top-right corner of each image.

‍

Raw Performance

‍

Benchmarks that measure raw GPU performance — such as memory bandwidth (in-device) and compute throughput — show minimal overhead of 3% with Cedana.

‍

‍

‍

These workloads rely almost entirely on the GPU, with minimal CPU involvement. Since Cedana’s interception introduces only a small CPU-side cost, it has minimal impact on these GPU-bound benchmarks, as expected. These were run on an AMD EPYC 7R13 CPU with an NVIDIA L4 GPU (full spec in top-right of the image).

‍

The benchmark below measures the time taken to concurrently launch multiple kernels. Even under this concurrent load, the observed overhead was about 3%. This was run on an Intel Zeon Platinum 8480+ CPU with an NVIDIA H100 PCIe GPU (full specs shown in the top-right corner of the image).

‍

‍

Memory Bandwidth

‍

When performing memory transfers to/from the GPU, interception could, in theory, introduce some overhead. However, Cedana is designed to preserve the performance of these operations.

‍

‍

‍

There is minimal degradation observed for small-sized memory transfers, where the overhead of interception can accumulate due to the higher number of CUDA calls. These were run on an Intel Zeon Platinum 8480+ CPU with an NVIDIA H100 PCIe GPU (full specs shown in the top-right corner of the image).

‍

Training Overhead

‍

As mentioned earlier, for longer-running workloads, the interception overhead is largely amortized. This is because most GPU workloads involve the CPU waiting idly for the GPU to complete its tasks, making the CPU-side cost negligible over time. First, we measured the overhead of each training iteration as we increased the model size. The overhead seemed to increase with model size.

‍

‍

On the other hand, when we measure the overhead over time for a model size of 120 million parameters, if the training is continued to run, we observe a total overhead of just 1.3%, likey due to amortization. With some of our planned work on the horizon, our goal is to get this to < 0.1%.

‍

‍

These were run on an AMD EPYC 7R13 CPU with an NVIDIA L4 GPU (full specs shown in the top-right corner of the image).

‍

Inference Overhead

‍

Below are the results from running this benchmark on a LLaMa 3.1 8B model comparing runtime throughput of using Cedana with native.

‍

‍

Runtime throughput when running with Cedana is about 11% faster. Any overhead has been mitigated by optimizations that are only possible due to the asynchronous design of our GPU virtual machine. This design allows the GPU controller to make decisions before actually executing CUDA driver API calls.

‍

These were run on an AMD EPYC 7J13 CPU with an NVIDIA A100 SXM4 GPU.

‍

Cold Start Time

‍

Inference cold starts can be greatly reduced when restoring from a checkpoint. Here we compare cold starts with restored starts of NVIDIA’s CUDA C/R (light grey) and Cedana (blue).

‍

‍

‍

Cedana’s restored cold starts are an order of magnitude faster than native cold starts. In the case of larger models (as shown in the second image), Cedana’s cold start times remain nearly constant, regardless of model size. This is likely due to saturation of the GPU’s memory bandwidth, while being bottlenecked by a large fixed component of the total time. In contrast, restores using NVIDIA’s CUDA checkpoint/restore (C/R) show little to no improvement over native cold starts for most larger models. These were run on an Intel Xeon Platinum 8480+ CPU with an NVIDIA H100 PCIe GPU (full specs shown in the top-right corner of the image).

‍

Conclusion

‍

The overhead introduced by Cedana’s GPU interception is minimal — typically in the range of 1–3%. In some cases, it can even result in performance gains (see Inference Overhead). Overall, the benefits enabled by Cedana, such as live migration and significantly faster cold starts, often far outweigh the costs.

‍