NVIDIA Cosmos 3: Robots finally take over

The door problem: Why robotics is harder than it looks

A humanoid robot walks up to a door. It reaches for the handle, fumbles, then turns it and pushes through. The clip racks up four million views on Twitter. Everyone says "cool" and moves on with their lives.

Robotics engineers see something very different: months of data collection, simulation, training, and validation required to make that behavior work reliably across thousands of different doors and environments.

Door opening is the canonical "easy task that's secretly hard." A three-year-old does it without thinking. A robot has to detect the handle, decide if it pushes or pulls, decide if the handle is a lever or knob or bar, approach at the right angle, grip without crushing or slipping, apply just enough torque, swing through the arc, and walk through before the door closes. Multiply by every door style on Earth: glass, push bar, hospital bumper, badge reader, mortise lock, weird European pull. The robot trained at lab A fails the moment you put it in office B with different doors.

This is the generalization gap in robotics. It's why generalist robots don't exist yet as products, and it's exactly the challenge Cosmos 3 was built to address.

What is Cosmos 3?

Cosmos 3 is a world foundation model. The keyword is "world," not "video." While most generative video models are designed to produce visually compelling content, Cosmos 3 is designed to understand and model how the physical world behaves. Rather than focusing solely on the pixels, it reasons about objects, actions, motion, and cause-and-effect relationships that govern real-world environments. Built as a single unified omni-model, Cosmos 3 can reason and generate across text, images, video, audio, and actions through a single architecture. The same model can interpret observations, generate future scenarios, recover actions from video, or produce action sequences for agents and robots

The Cosmos 3 family includes Cosmos 3 Nano, optimized for efficient deployment and real-time workloads, and Cosmos 3 Super, designed for maximum reasoning capability and generation quality.

The question everyone asks first: What does Cosmos 3 actually do?

The answer depends on how you use it.

Cosmos 3 supports six capabilities through a single omni-model architecture:

text2image - generate images from text prompts
text2video - generate videos from text prompts
image2video - extends images into videos
forward_dynamics - predict what happens next in a scene
inverse_dynamics - recovers actions from observations or video
policy - generate action sequences for agents and robots

For robots and physical AI, the last three are often the most interesting.

Two ways to use Cosmos 3 in robotics

Path A: Cosmos 3 in the cockpit

In this approach, a robot sends observations directly to Cosmos 3. The model analyzes what it sees and returns the actions the robot should take next.

For example:

Robot observes a door
Cosmos determines how to approach it
Cosmos generates an action sequence
Robot executes the actions

This works and is valuable for research, experimentation, and rapid prototyping. However, most production robots cannot rely on large foundation models running remotely. They need fast, deterministic control loops running locally with minimal latency. For commercial deployments like a butler robot, ideally it can open doors without an internet connection.

Path B: Cosmos 3 as a physical AI data factory

Instead of controlling the robot directly, Cosmos helps create the training data used to teach smaller, specialized robot policies.

A typical workflow looks like this:

Collect real-world videos from robots, vehicles, factories, or public datasets
Use inverse_dynamics to recover action labels
Use text2video and image2video to generate synthetic variations
Create large-scale training datasets consisting of observations and correct actions
Train a smaller robot policy that runs efficiently onboard

The production robot never calls Cosmos 3. Instead, it runs a compact model that learned from data generated by Cosmos.

Training pipeline

This is the value proposition. The model that drives your robot in production was distilled from data Cosmos manufactured. Cosmos is the factory, your custom policy is the product.

The key insight

Cosmos 3 is the factory that helps build the robot. The large model produces the data, the small model ships, and the customer never sees the big model. That is why Cosmos 3 represents such an important shift for Physical AI. Rather than simply generating content, it helps generate the data needed to train the next generation of robots, autonomous systems, and intelligent agents.

The six modes, framed by what they do for our door-opener

Cosmos exposes six interaction modes through a single model_mode field on the API. Each one plays a distinct role in the data factory.

Sample request: policy for door opening

1curl -X POST "$BASETEN_URL/predict" \
2  -H "Authorization: Api-Key $BASETEN_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model_mode": "policy",
6    "vision_path": "https://yourcdn.com/robot_at_door_clip.mp4",
7    "prompt": "Open the door and walk through.",
8    "image_size": 480,
9    "fps": 5,
10    "action_chunk_size": 16,
11    "raw_action_dim": 10,
12    "domain_name": "bridge_orig_lerobot",
13    "num_steps": 30,
14    "guidance": 1.0,
15    "shift": 5.0,
16    "seed": 0
17  }'

Returns a 16x10 action tensor under sample_outputs.outputs[0].content.action. Pair it with the input video for one training example.

Sample request: inverse_dynamics for action labels.

1curl -X POST "$BASETEN_URL/predict" \
2  -H "Authorization: Api-Key $BASETEN_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model_mode": "inverse_dynamics",
6    "vision_path": "https://yourcdn.com/human_opening_door.mp4",
7    "prompt": "A first-person view of a person reaching for and turning a door handle.",
8    "image_size": 480,
9    "fps": 10,
10    "raw_action_dim": 9,
11    "action_chunk_size": 60,
12    "domain_name": "av",
13    "num_steps": 30,
14    "guidance": 1.0,
15    "shift": 5.0,
16    "seed": 0
17  }'

The same call shape works on YouTube footage, dashcam corpora, factory cameras. Anywhere there's video of an agent acting.

What makes this particularly interesting for robotics is that the same model can support multiple stages of the development lifecycle. Some models help create synthetic training data. Others help label existing data. Others can predict future outcomes or generate actions directly. Together, they form a toolkit for building Physical AI systems that can learn faster, generalize better, and require significantly less expensive real-world data collection.

Video generation: Cosmos 3 vs other video gen models

Watch a typical creative video output from Sora 2 or Veo 3 for ten seconds and you'll see physics violations: objects pop in and out, water flows backward, hands have six fingers, gravity does whatever the model felt like. Those models are optimized to look right for two seconds. Cosmos 3 is optimized to produce output that obeys conservation laws: object permanence, mass continuity, contact dynamics, friction. For ad creative, nobody cares. For training a robot, the "looks right" lie poisons the dataset.

Why this matters: robotics doesn't have a data flywheel yet

Every other modern ML domain has compounding free data. Robotics doesn't. That's the entire reason Cosmos exists.

Robot demonstration data is the only ML training input in 2026 that still requires a person with a $300k teleoperation rig physically moving a robot arm, paid $50-150 per hour, producing 50-200 demos per hour. A single generalist robot policy wants millions of demos across thousands of tasks. This is the bottleneck. Money is hard. Wall-clock time is harder.

Cosmos attacks every leg of the bottleneck. Use inverse_dynamics over unlabeled video and you turn YouTube into labeled training data. Use text2video to augment a thin demo corpus with synthetic variants. Use forward_dynamics as a learned simulator instead of spending an engineer-quarter building one in Isaac Sim. Use policy as a zero-shot teacher to distill into a small custom student.

The door-opening economics, end-to-end

Let's price our door-opening robot two ways.

The numbers are purely illustrative, you get the idea.

When Cosmos 3 may not be the right tool

Cosmos 3 might not excel at:

Pure creative video. TikToks, ads, animation.
Game engine content. Cosmos 3 was trained on real-world video, not game footage. Wrong distribution.
Anything with text inside the video. All diffusion models are bad at this.
Scientific simulation (fluid dynamics, weather, molecular). Use a real PDE solver.

Try it

Cosmos 3 Nano runs on Baseten on a single H100. Cold-start under two minutes, warm text2video in roughly four minutes for a 720p clip, action modes in under thirty seconds. One click from the Baseten model library.

NVIDIA Cosmos 3: Robots finally take over

Authors

Last updated

Share