Qwen3.5 4B Latency

Fast 4B parameter model from the Qwen3.5 family, optimized for latency.

Deploy now

‌

Model details

Developed by
Qwen
Model family
Qwen
Use case
large language
Version
V1
Variant
Latency
Size
4B
Optimization
vLLM
Hardware
H100
API
openai
License
qwen license

View repository

Example usage

OpenAI-compatible chat completion

Input

1from openai import OpenAI
2import os
3
4model_url = ""  # Copy in from API pane in Baseten model dashboard
5
6client = OpenAI(
7    api_key=os.environ['BASETEN_API_KEY'],
8    base_url=model_url
9)
10
11# Chat completion
12response_chat = client.chat.completions.create(
13    model="",
14    messages=[
15        {"role": "user", "content": "Tell me a fun fact about cats."}
16    ],
17    temperature=0.6,
18    max_tokens=100,
19)
20print(response_chat)

JSON output

1{
2    "id": "143",
3    "choices": [
4        {
5            "finish_reason": "stop",
6            "index": 0,
7            "logprobs": null,
8            "message": {
9                "content": "[Model output here]",
10                "role": "assistant",
11                "audio": null,
12                "function_call": null,
13                "tool_calls": null
14            }
15        }
16    ],
17    "created": 1741224586,
18    "model": "",
19    "object": "chat.completion",
20    "service_tier": null,
21    "system_fingerprint": null,
22    "usage": {
23        "completion_tokens": 145,
24        "prompt_tokens": 38,
25        "total_tokens": 183,
26        "completion_tokens_details": null,
27        "prompt_tokens_details": null
28    }
29}