"Inference Engineering" is now available. Get your copy here

NVIDIA Nemotron 3 Nano Omni: Build Multimodal Agents on Baseten

NVIDIA Nemotron 3 Nano Omni is an open multimodal foundation model that unifies audio, images, and videos into a single context.

NVIDIA Nemotron 3 Nano Omni
TL;DR

NVIDIA Nemotron 3 Nano Omni is an open multimodal foundation model that unifies audio, images, video, and text  into a single context. Built on the Nemotron 3 Nano backbone, Nemotron 3 Nano Omni powers sub-agents in agentic workflows with leading efficiency and accuracy. 

Today, Baseten is launching support for NVIDIA Nemotron 3 Nano Omni, an open unified multimodal model built for production agents.

What is NVIDIA Nemotron 3 Nano Omni?

NVIDIA Nemotron 3 Nano Omni is an open, efficient multimodal foundation model built to power sub-agents that understand and reason across video, audio, images, documents and text in enterprise agent systems.

Most agent systems today rely on separate models for speech, vision and language. When it comes to agentic workflows, this creates problems. Separate models mean repeated inference passes, increasing latency. Orchestration and error handling also become more complex and context fragments across modalities which reduces accuracy. 

NVIDIA Nemotron 3 Nano Omni takes a different approach. It is a single multimodal reasoning model that enables agents to reason and perceive across modalities within a unified loop with complete deployment control and efficient performance. 

Nemotron 3 Nano Omni combines audio and vision encoders into a unified multimodal architecture, which eliminates the need for separate perception models. This enables agents to complete tasks faster at scale, and simplifies agent development. 

There are three architectural choices in particular that make Omni efficient:

  • A latent MoE design used that improves memory and compute efficiency

  • 3D convolutional layers let the model extract spatial and temporal features together, so it knows how visuals change over time

  • Efficient video sampling selectively processes the most dynamic parts of long videos, instead of scanning entire frames 

Nemotron 3 Nano Omni extends the efficiency and accuracy of Nemotron 3 Nano across different modalities.

When to use Nemotron 3 Nano Omni

Nemotron 3 Nano Omni’s open, lightweight 30B-A3B architecture supports deployments in local environments, such as NVIDIA DGX systems, as well as datacenters and cloud environments. It’s designed for computer use, complex document intelligence and audio and video reasoning.

Context matters in customer service, research and monitoring workflows, and Nemotron 3 Nano Omni preserves unified multimodal context across audio, video, and documents within a single  reasoning loop.

Scaling enterprise AI with Nemotron 3 Nano Omni on Baseten

At Baseten, as an AI infrastructure platform purpose-built for ultra-fast inference, we provide day-zero support for Nemotron 3 Nano Omni.

Our platform accelerates enterprise AI initiatives through:

The Baseten Inference stack is critical to achieve these results and under the hood it uses NVFP4, components of TensorRT-LLM, Dynamo and the Baseten Speculation Engine. All running on NVIDIA Blackwell GPUs.

Building with NVIDIA Nemotron 3 Nano Omni in production

If you’re building agents that need to see, hear, and reason across workflows—such as customer service, computer use or document intelligence - Nemotron 3 Nano Omni provides you with a production-ready, open foundation to do all of this with a single model.

The model processes multimodal inputs such as audio, video, images, and documents (e.g., recordings or files) and performs unified reasoning across them in a single pass.

You can deploy Nemotron 3 Nano Omni on Baseten for scalable multimodal inference, or get in touch with our engineers to learn more about the performance, scale, security, and flexibility we offer enterprises, including our self-hosting capabilities.

Subscribe to our newsletter

Stay up to date on model performance, inference infrastructure, and more.