Multi-model inference built for ultra-low latency at scale
Use Chains to orchestrate inference workflows across multiple models, with a framework designed for performance.
Multiple models. Multiple machines. One framework.
Simplify orchestration of multiple ML models, business logic services, and their underlying resources in pure Python using Chains
1import truss_chains as chains
2from truss import truss_config
3
4MISTRAL_HF_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
5MISTRAL_CACHE = truss_config.ModelRepo(
6 repo_id=MISTRAL_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"]
7)
8HF_ACCESS_TOKEN_NAME = "hf_access_token"
9
10class MistralLLM(chains.ChainletBase):
11 remote_config = chains.RemoteConfig(
12 docker_image=chains.DockerImage(
13 pip_requirements=[
14 "transformers==4.38.1",
15 "torch==2.0.1",
16 ]
17 ),
18 compute=chains.Compute(cpu_count=2, gpu="A10G"),
19 assets=chains.Assets(cached=[MISTRAL_CACHE], secret_keys=[HF_ACCESS_TOKEN_NAME]),
20 )
21
22 def __init__(
23 self,
24 # Adding the `context` to the init arguments, allows us to access the
25 # huggingface token.
26 context: chains.DeploymentContext = chains.depends_context(),
27 ) -> None:
28 # Note the imports of the *specific* python requirements are pushed down to
29 # here. This code will only be executed on the remotely deployed chainlet,
30 # not in the local environment, so we don't need to install these packages
31 # in the local dev environment.
32 import torch
33 import transformers
34
35 self._model = transformers.AutoModelForCausalLM.from_pretrained(
36 MISTRAL_HF_MODEL,
37 torch_dtype=torch.float16,
38 device_map="auto",
39 use_auth_token=context.secrets[HF_ACCESS_TOKEN_NAME],
40 )
41
42 self._tokenizer = transformers.AutoTokenizer.from_pretrained(
43 MISTRAL_HF_MODEL,
44 device_map="auto",
45 torch_dtype=torch.float16,
46 use_auth_token=context.secrets[HF_ACCESS_TOKEN_NAME],
47 )
48
49 self._generate_args = {
50 "max_new_tokens": 512,
51 "temperature": 1.0,
52 "top_p": 0.95,
53 "top_k": 50,
54 "repetition_penalty": 1.0,
55 "no_repeat_ngram_size": 0,
56 "use_cache": True,
57 "do_sample": True,
58 "eos_token_id": self._tokenizer.eos_token_id,
59 "pad_token_id": self._tokenizer.pad_token_id,
60 }
61
62 def run_remote(self, prompt: str) -> str:
63 import torch
64
65 formatted_prompt = f"[INST] {prompt} [/INST]"
66 input_ids = self._tokenizer(
67 formatted_prompt, return_tensors="pt"
68 ).input_ids.cuda()
69 with torch.no_grad():
70 output = self._model.generate(inputs=input_ids, **self._generate_args)
71 result = self._tokenizer.decode(output[0])
72 return result
73
74class PoemGenerator(chains.ChainletBase):
75 def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None:
76 self._phi_llm = phi_llm
77
78 def run_remote(self, words: list[str]) -> list[str]:
79 results = []
80 for word in words:
81 messages = Messages(
82 messages=[
83 {"role": "system", "content": "You are poet"},
84 {"role": "user", "content": f"Write a poem about {word}"},
85 ]
86 )
87 poem = self._phi_llm.run_remote(messages)
88 results.append(poem)
89 return results
90
91class PhiLLM(chains.ChainletBase):
92 remote_config = chains.RemoteConfig(
93 docker_image=chains.DockerImage(
94 pip_requirements=[
95 "transformers==4.41.2",
96 "torch==2.3.0",
97 ]
98 ),
99 compute=chains.Compute(cpu_count=2, gpu="T4"),
100 )
Get to market faster with products that perform better
Reduce latency
Minimize network hops to deliver the lowest latency possible by calling each model directly. Automatically scale GPU and CPU resources with demand to avoid bottlenecks and outages.
Lower GPU costs at scale
Stop wasting GPU resources with monolithic deployments. Chains allow you to optimize costs by selecting the right hardware for each component (Chainlet) in your workflow.
Save development hours
Stop wasting valuable developer time building and maintaining inference infrastructure. Chains to enable high-performance multi-model workflows at scale from day 1.
Created for engineers. Loved by enterprises.
Support for every model
Integrate any model architecture seamlessly into your workflows. Combine your own fine-tuned or bespoke models with the latest open source and 3rd party models.
Delightful dev experience
Our SDK optimizes development by abstracting complexities, facilitating simple task automation while providing robust tools for intricate operations.
Composable and extensible
Create components once, and use them universally. Chainlets allow you to easily integrate new and existing AI technologies into a fully cohesive product experience.
Expert support on-demand
Our team of AI experts accelerates your project from concept to production. We optimize each part of your deployment to deliver the best possible performance at scale.
Volume-based GPU discounts
Get the best possible ROI on your GPU spend with our volume-based discounts. Reduce your incremental cost as you scale to realize the best possible unit economics.
Enterprise-grade security
Rime’s state-of-the-art p99 latency and 100% uptime over 2024 is driven by our shared laser focus on fundamentals, and we’re excited to push the frontier even further with Baseten.
Guides and examples
Build RAG workflows
Connect to vector databases and augment LLM results with additional context without introducing overhead to the model inference.
Chunked audio transcription
Transcribe large audio files by splitting them into smaller chunks and processing them in parallel — process 10-hour files in minutes.
Multi-model pipelines
Build powerful compound AI systems and experiences like AI phone calling, multi-step image generation, and Multimodal chat.
You guys have literally enabled us to hit insane revenue numbers without ever thinking about GPUs and scaling. I know I ask for a lot so I just wanted to let you guys know that I am so blown away by everything Baseten.
Lily Clifford,
Co-founder and CEO
You guys have literally enabled us to hit insane revenue numbers without ever thinking about GPUs and scaling. I know I ask for a lot so I just wanted to let you guys know that I am so blown away by everything Baseten.