Introducing Baseten Loops: A Training SDK for Frontier RL. Learn more here
large language

NVIDIA logoNVIDIA Nemotron 3 Ultra

550B hybrid Mamba-Transformer MoE with 55B active params, latent MoE routing, multi-token prediction, and 1M token context

Model details

View repository

Example usage

OpenAI-compatible chat completion. NVIDIA recommends temperature=1.0 and top_p=0.95; toggle reasoning via chat_template_kwargs.enable_thinking.

Input
1from openai import OpenAI
2import os
3
4model_url = ""  # Copy in from the API pane in your Baseten model dashboard
5
6client = OpenAI(
7    api_key=os.environ["BASETEN_API_KEY"],
8    base_url=model_url,
9)
10
11# NVIDIA recommends temperature=1.0 and top_p=0.95 for all tasks.
12# Toggle reasoning on/off via chat_template_kwargs.enable_thinking.
13response = client.chat.completions.create(
14    model="nvidia/nemotron-3-ultra",
15    messages=[
16        {"role": "user", "content": "Tell me a fun fact about hummingbirds."}
17    ],
18    temperature=1.0,
19    top_p=0.95,
20    max_tokens=256,
21    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
22)
23print(response)
JSON output
1{
2    "id": "143",
3    "choices": [
4        {
5            "finish_reason": "stop",
6            "index": 0,
7            "logprobs": null,
8            "message": {
9                "content": "[Model output here]",
10                "role": "assistant",
11                "audio": null,
12                "function_call": null,
13                "tool_calls": null
14            }
15        }
16    ],
17    "created": 1741224586,
18    "model": "",
19    "object": "chat.completion",
20    "service_tier": null,
21    "system_fingerprint": null,
22    "usage": {
23        "completion_tokens": 145,
24        "prompt_tokens": 38,
25        "total_tokens": 183,
26        "completion_tokens_details": null,
27        "prompt_tokens_details": null
28    }
29}

🔥 Trending models