large language
NVIDIA Nemotron 3 Ultra
550B hybrid Mamba-Transformer MoE with 55B active params, latent MoE routing, multi-token prediction, and 1M token context
Model details
View repositoryExample usage
OpenAI-compatible chat completion. NVIDIA recommends temperature=1.0 and top_p=0.95; toggle reasoning via chat_template_kwargs.enable_thinking.
Input
1from openai import OpenAI
2import os
3
4model_url = "" # Copy in from the API pane in your Baseten model dashboard
5
6client = OpenAI(
7 api_key=os.environ["BASETEN_API_KEY"],
8 base_url=model_url,
9)
10
11# NVIDIA recommends temperature=1.0 and top_p=0.95 for all tasks.
12# Toggle reasoning on/off via chat_template_kwargs.enable_thinking.
13response = client.chat.completions.create(
14 model="nvidia/nemotron-3-ultra",
15 messages=[
16 {"role": "user", "content": "Tell me a fun fact about hummingbirds."}
17 ],
18 temperature=1.0,
19 top_p=0.95,
20 max_tokens=256,
21 extra_body={"chat_template_kwargs": {"enable_thinking": True}},
22)
23print(response)JSON output
1{
2 "id": "143",
3 "choices": [
4 {
5 "finish_reason": "stop",
6 "index": 0,
7 "logprobs": null,
8 "message": {
9 "content": "[Model output here]",
10 "role": "assistant",
11 "audio": null,
12 "function_call": null,
13 "tool_calls": null
14 }
15 }
16 ],
17 "created": 1741224586,
18 "model": "",
19 "object": "chat.completion",
20 "service_tier": null,
21 "system_fingerprint": null,
22 "usage": {
23 "completion_tokens": 145,
24 "prompt_tokens": 38,
25 "total_tokens": 183,
26 "completion_tokens_details": null,
27 "prompt_tokens_details": null
28 }
29}