Model details
View repositoryExample usage
Qwen-3-embeddings is a text-embeddings model, producing a 1D embeddings vector, given an input. It's frequently used for downstream tasks like clustering, used with vector databases.
This model is quantized to FP8 for deployment, which is supported by Nvidia's newest GPUs e.g. H100, H100_40GB, B200 or L4. Quantization is optional, but leads to higher efficiency.
The client code can be installed via pip.
https://github.com/basetenlabs/truss/tree/main/baseten-performance-client
Alternatively, you may use also the OpenAI embeddings client.
Input
1import os
2from baseten_performance_client import (
3 PerformanceClient, OpenAIEmbeddingsResponse, ClassificationResponse
4)
5
6api_key = os.environ.get("BASETEN_API_KEY")
7model_id = "yqv0rjjw"
8base_url = f"https://model-{model_id}.api.baseten.co/environments/production/sync"
9
10client = PerformanceClient(base_url=base_url, api_key=api_key)
11
12def format_query(task_description: str, query: str, document: str) -> str:
13 # qwen-3-embedding style qeury formatting..
14 return f'Instruct: {task_description}\nQuery:{query}'
15
16task = 'Given a web search query, retrieve relevant passages that answer the query'
17texts = [
18 get_detailed_instruct(task, 'Explain gravity'),
19 "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
20]
21response: OpenAIEmbeddingsResponse = client.embed(
22 input=texts,
23 model="my_model",
24 batch_size=16,
25 max_concurrent_requests=32,
26)
27array = response.numpy()
JSON output
1{
2 "data": [
3 {
4 "embedding": [
5 0
6 ],
7 "index": 0,
8 "object": "embedding"
9 }
10 ],
11 "model": "thenlper/gte-base",
12 "object": "list",
13 "usage": {
14 "prompt_tokens": 512,
15 "total_tokens": 512
16 }
17}