large language
GLM-4.6V
A frontier vision language model by Z AI with native multimodal function calling and interleaved image-text content generation
Model details
Example usage
GLM-4.6V scales its context window to 128k tokens in training, and achieves SoTA performance in visual understanding among models of similar parameter scales. Crucially, we integrate native Function Calling capabilities for the first time. This effectively bridges the gap between "visual perception" and "executable action" providing a unified technical foundation for multimodal agents in real-world business scenarios. You can deploy GLM-4.6V on NVIDIA H100 GPUs with Baseten today.
✕
GLM-4.6V benchmarksDeployments of GLM-4.6V are OpenAI-compatible.
Input
1from openai import OpenAI
2import os
3
4model_url = "" # Copy in from API pane in Baseten model dashboard
5
6client = OpenAI(
7 api_key=os.environ['BASETEN_API_KEY'],
8 base_url=model_url
9)
10
11# Chat completion
12response_chat = client.chat.completions.create(
13 model="zai-org/GLM-4.6V",
14 stream=True,
15 messages=[
16 {"role": "system", "content": "You are a helpful vision-language assistant."},
17 {"role": "user", "content": [{"url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png", "type": "image"},
18 {"text": "Describe this image in detail.", "type": "text"}
19 ],
20 max_tokens=1024
21 temperature=0.7}
22print(response_chat)JSON output
1{
2 "id": "143",
3 "choices": [
4 {
5 "finish_reason": "stop",
6 "index": 0,
7 "logprobs": null,
8 "message": {
9 "content": "[Model output here]",
10 "role": "assistant",
11 "audio": null,
12 "function_call": null,
13 "tool_calls": null
14 }
15 }
16 ],
17 "created": 1741224586,
18 "model": "",
19 "object": "chat.completion",
20 "service_tier": null,
21 "system_fingerprint": null,
22 "usage": {
23 "completion_tokens": 145,
24 "prompt_tokens": 38,
25 "total_tokens": 183,
26 "completion_tokens_details": null,
27 "prompt_tokens_details": null
28 }
29}