Introducing Baseten Loops: A Training SDK for Frontier RL. Learn more here

Nvidia Cosmos 3: Robots finally take over

NVIDIA's Cosmos 3 Might Be the Most Important AI Model of the Year. Almost Nobody Will See Its Output.

Cosmos 3
TL;DR

NVIDIA Cosmos 3 is a foundation model built for things that move in the physical world. Unlike traditional generative video models, Cosmos 3 is designed specifically to help developers build, train, and evaluate robots, autonomous systems, and vision AI agents.

Deploy it on Baseten.

The door problem: Why robotics is harder than it looks

A humanoid robot walks up to a door. It reaches for the handle, fumbles, then turns it and pushes through. The clip racks up four million views on Twitter. Everyone says "cool" and moves on with their lives.

Robotics engineers see something very different: months of data collection, simulation, training, and validation required to make that behavior work reliably across thousands of different doors and environments.

Door opening is the canonical "easy task that's secretly hard." A three year old does it without thinking. A robot has to: detect the handle, decide if it pushes or pulls, decide if the handle is a lever or knob or bar, approach at the right angle, grip without crushing or slipping, apply just enough torque, swing through the arc, and walk through before the door closes. Multiply by every door style on Earth: glass, push bar, hospital bumper, badge reader, mortise lock, weird European pull. The robot trained at lab A fails the moment you put it in office B with different doors.

This is the generalization gap in robotics. It's why generalist robots don't exist yet as products, and it's exactly the challenge Cosmos 3 was built to address.

What Cosmos 3 actually is:

Cosmos 3 is a world foundation model. The key word is "world," not "video." While most generative video models are designed to produce visually compelling content, Cosmos 3 is designed to understand and model how the physical world behaves. Rather than focusing solely on the pixels, it reasons about objects, actions, motion, and cause-and-effect relationships that govern real-world environments. Built as a single unified omni-model, Cosmos 3 can reason and generate across text, images, video, audio, and actions through a single architecture. The same model can interpret observations, generate future scenarios, recover actions from video, or produce action sequences for agents and robots

The Cosmos 3 family includes Cosmos 3 Nano, optimized for efficient deployment and real-time workloads, and Cosmos 3 Super, designed for maximum reasoning capability and  generation quality.

The question everyone asks first: What does Cosmos 3 actually do?

The answer depends on how you use it.

Cosmos 3 supports six capabilities through a single omni-model architecture:

  • text2image - generate images from text prompts

  • text2video - generate videos from text prompts

  • image2video - extends images into videos

  • forward_dynamics - predict what happens next in a scene

  • Inverse_dynamics - recovers actions from observations or video

  • policy - generate action sequences for agents and robots

For robots and Physical AI, the last three are often the most interesting.

Two ways to use Cosmos 3 in robotics

Path A: Cosmos 3 in the cockpit

In this approach, a robot sends observations directly to Cosmos 3. The model analyzes what it sees and returns the actions the robot should take next.

For example:

  • Robot observes a door

  • Cosmos determines how to approach it

  • Cosmos generates an action sequence

  • Robot executes the actions

This works and is valuable for research, experimentation, and rapid prototyping. However, most production robots cannot rely on large foundation models running remotely. They need fast, deterministic control loops running locally with minimal latency. For commercial deployments like a butler robot, ideally it can open doors without an internet connection.

Path B: Cosmos 3 as a physical AI data factory

Instead of controlling the robot directly, Cosmos helps create the training data  used to teach smaller, specialized robot policies.

A typical workflow looks like this:

  1. Collect real-world videos from robots, vehicles, factories, or public datasets

  2. Use inverse_dynamics to recover action labels

  3. Use text2video and image2video to generate synthetic variations

  4. Create large-scale training datasets consisting of observations and correct actions

  5. Train a smaller robot policy that runs efficiently onboard

The production robot never calls Cosmos 3. Instead it runs a compact model that learned from data generated by Cosmos.

Training pipelineTraining pipeline

This is the value proposition. The model that drives your robot in production was distilled from data Cosmos manufactured. Cosmos is the factory, your custom policy is the product.

The key insight

Cosmos 3 is not the robot. It’s the factory that helps build the robot. The large model produces the data, the small model ships, the customer never sees the big model. That is why Cosmos 3 represents such an important shift for Physical AI. Rather than simply generating content, it helps generate the data needed to train the next generation of robots, autonomous systems, and intelligent agents.

The six modes, framed by what they do for our door-opener

Cosmos exposes six interaction modes through a single model_mode field on the API. Each one plays a distinct role in the data factory.

Sample request: policy for door opening

1curl -X POST "$BASETEN_URL/predict" \
2  -H "Authorization: Api-Key $BASETEN_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model_mode": "policy",
6    "vision_path": "https://yourcdn.com/robot_at_door_clip.mp4",
7    "prompt": "Open the door and walk through.",
8    "image_size": 480,
9    "fps": 5,
10    "action_chunk_size": 16,
11    "raw_action_dim": 10,
12    "domain_name": "bridge_orig_lerobot",
13    "num_steps": 30,
14    "guidance": 1.0,
15    "shift": 5.0,
16    "seed": 0
17  }'

Returns a 16x10 action tensor under sample_outputs.outputs[0].content.action. Pair it with the input video for one training example.

Sample request: inverse_dynamics for action labels

1curl -X POST "$BASETEN_URL/predict" \
2  -H "Authorization: Api-Key $BASETEN_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model_mode": "inverse_dynamics",
6    "vision_path": "https://yourcdn.com/human_opening_door.mp4",
7    "prompt": "A first-person view of a person reaching for and turning a door handle.",
8    "image_size": 480,
9    "fps": 10,
10    "raw_action_dim": 9,
11    "action_chunk_size": 60,
12    "domain_name": "av",
13    "num_steps": 30,
14    "guidance": 1.0,
15    "shift": 5.0,
16    "seed": 0
17  }'

The same call shape works on YouTube footage, dashcam corpora, factory cameras. Anywhere there's video of an agent acting.

What makes this particularly interesting for robotics is that the same model can support multiple stages of the development lifecycle. Some models help create synthetic training data. Others help label existing data. Others can predict future outcomes or generate actions directly. Together, they form a toolkit for building Physical AI systems that can learn faster, generalize better, and require significantly less expensive real-world data collection.

Video generation: Cosmos 3 vs Other Video Gen Models

Watch a typical creative video output from Sora 2 or Veo 3 for ten seconds and you'll see physics violations: objects pop in and out, water flows backward, hands have six fingers, gravity does whatever the model felt like. Those models are optimized to look right for two seconds. Cosmos 3 is optimized to produce output that obeys conservation laws: object permanence, mass continuity, contact dynamics, friction. For ad creative, nobody cares. For training a robot, the "looks right" lie poisons the dataset.

Why this matters: robotics doesn't have a data flywheel yet

Every other modern ML domain has compounding free data. Robotics doesn't. That's the entire reason Cosmos exists.

Robot demonstration data is the only ML training input in 2026 that still requires a person with a $300k teleoperation rig physically moving a robot arm, paid $50-150 per hour, producing 50-200 demos per hour. A single generalist robot policy wants millions of demos across thousands of tasks. This is the bottleneck. Money is hard. Wall-clock time is harder.

Cosmos attacks every leg of the bottleneck. Use inverse_dynamics over unlabeled video and you turn YouTube into labeled training data. Use text2video to augment a thin demo corpus with synthetic variants. Use forward_dynamics as a learned simulator instead of spending an engineer-quarter building one in Isaac Sim. Use policy as a zero-shot teacher to distill into a small custom student.

The door-opening economics, end to end

Let's price our door-opening robot two ways.

The numbers are purely illustrative, you get the idea.

When Cosmos 3 may not be the right tool

Honest section. Cosmos 3 might not excel at:

  • Pure creative video. TikToks, ads, animation.

  • Game engine content. Cosmos 3 was trained on real-world video, not game footage. Wrong distribution.

  • Anything with text inside the video. All diffusion models are bad at this.

  • Scientific simulation (fluid dynamics, weather, molecular). Use a real PDE solver.

Try it

Cosmos 3 Nano runs on Baseten on a single H100. Cold-start under two minutes, warm text2video in roughly four minutes for a 720p clip, action modes in under thirty seconds. One click from the Baseten model library.

Contact us to deploy Cosmos 3 Super.

Subscribe to our newsletter

Stay up to date on model performance, inference infrastructure, and more.