Baseten brings AI video to life on Nebius

Video generation workloads are long-running, consume a lot of memory, and require stable scaling. Baseten and Nebius deliver the infrastructure they demand.

Nebius AI Cloud + Baseten Inference Stack
TL;DR

Baseten runs a custom text-to-video system on Nebius using the Baseten Inference Stack.

The Inference Runtime uses custom modality-specific kernels, kernel fusion, attention kernels, continuous batching, request prioritization, and topology-aware parallelism. The Inference-optimized Infrastructure provides intelligent request routing, geo-aware load balancing, SLA-aware autoscaling, and fast cold starts, plus active-active reliability and multi-cloud capacity management.

The outcome: predictable P90/P99 latency, high utilization, and stable uptime on Nebius capacity.

Text-to-video is not a simple extension of another modality. It's a complex workload that runs longer, stresses memory, and can create scaling challenges due to large requests demanding up to an entire node. Users notice slight shifts in latency and quality. The only way to meet that bar in production is to treat performance, reliability, and cost as one system.

The Baseten Inference Stack provides the primitives to do this cleanly, then lets us place that system on  Nebius, so capacity is where we need it when demand spikes.

Optimizing performance at the Model Runtime layer 

We start by looking at the Model Runtime. For video generation, we select modality-specific kernels that understand image and video execution patterns. We rely on custom kernels that are optimized for the memory patterns that video workloads demand. Kernel fusion helps us fuse smaller operations, and we use specific attention kernels that balance quantization and quality specifically for video workloads. Asynchronous compute helps the device stay busy rather than waiting on each launch. 

We use request prioritization to prevent long-running generations from pushing up tail latency, allowing urgent requests to move ahead in the queue. When the model spans multiple devices, we apply topology-aware parallelism and scale to multi-node inference, while minimizing communication overhead. The net effect is steady step time and higher effective frames per GPU second. 

Driving repeatable performance via Inference-optimized Infrastructure

Performance only matters if we can deliver it every hour of the day. The Inference-optimized Infrastructure is the other half of the stack, and it is where Nebius shows up. 

Intelligent request routing directs work to replicas that have processed the user's workload and have cached elements, to reduce time-to-output and reduce redundant computation. Geo-aware load balancing keeps traffic close to users and in Nebius regions with sufficient headroom. Autoscaling expands capacity when traffic rises and contracts when it falls. Policies are SLA-aware and are paired with fast cold starts, so new replicas contribute within the SLO, not after the spike has passed. In multi-stage systems, we use independent component scaling so that no single step becomes the bottleneck. 

Reliability is continuous rather than reactive. We operate with active-active reliability and cross-cloud orchestration, ensuring that a zone or provider event does not result in user impact. Multi-cloud Capacity Management treats Nebius as a first-class pool within a single control plane. That gives us elasticity for launches and a clean way to manage cost when traffic normalizes. 

Selecting protocols that fit the workload 

Text-to-video often benefits from interactive or streaming behavior. The stack supports protocol flexibility across HTTPS, WebSockets, and gRPC. That lets us stream progress and control long-running requests without bolting on a side channel. The same traffic paths are used in staging and production, which ensures consistent behavior across environments. 

What we measure and why it matters 

Production success is measured, not assumed. We track total request latency and per-step denoising times with P90/P99 targets. We monitor utilization to avoid overprovisioning during regular periods and under-provisioning during bursts, and alerting is aligned with these goals, so actions are tied to impact. When a change in weights or parameters shifts behavior, the runtime and the control plane surface it quickly, and the autoscaler adapts. 

How a request flows on Nebius 

Ingress receives the request. Intelligent request routing selects a Nebius region with geo-aware load balancing, ensuring optimal performance. The request is directed to pre-warmed replicas running the modality-specific runtime. Continuous batching and request prioritization maintain the latency envelope while custom kernels and asynchronous compute execute the denoise loop.

Autoscaling evaluates live signals with SLA-aware policies and adjusts Nebius capacity using fast cold starts. If additional headroom is required, our Multi-cloud Capacity Management system enables capacity to be scaled across providers without requiring application changes. 

Why this pairing works

Nebius provides large GPU pools and low-friction capacity growth. The Baseten Inference Stack converts that capacity into consistent latency and reliable throughput. The runtime raises raw performance through custom kernels, kernel fusion, and modality-specific runtimes. The infrastructure layer turns that performance into a property of the service through intelligent request routing, geo-aware load balancing, autoscaling, fast cold starts, and active-active reliability. The result is a text-to-video system that delivers the same experience at noon on a Tuesday and during a creator surge on a Saturday night. 

Closing thoughts 

Shipping video generation is a systems challenge. The Baseten Inference Stack provides the runtime and control plane to solve it. Nebius gives us the capacity to scale it. Together, we achieve predictable P90/P99 latency, strong utilization, and uptime that remains steady as traffic fluctuates. That is what turns a demo into a product.

Experience Baseten Inference for yourself and apply to receive up to $10k in credits!

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.