Benchmarking fast Mistral 7B inference

Baseten has achieved best in class performance for key latency and throughput metrics as measured by independent researchers at Artificial Analysis.

Mistral 7B throughput and latency as measured March 11, 2024

The Artificial Analysis benchmark measures essential metrics for model performance:

Time to first token (TTFT): The time from when a request is sent to the model to when the first token (or chunk) of output is received.
Tokens per second (TPS): The average number of tokens per second received during the entire response.
Total response time: The total time taken to generate 100 tokens.

Baseten benchmarks at a 130-millisecond time to first token with 170 tokens per second and a total response time of 700 milliseconds for Mistral 7B, solidly in the most attractive quadrant for these metrics.

This point-in-time result is the cumulation of months of work, but it’s only the beginning of our commitment to excellence in the rapidly evolving field of model performance. Recently, we’ve experimented with a variety of model performance techniques:

As part of our focus on model performance, we’ve built tools and rigorous methodologies for benchmarking LLMs to ensure accurate metrics that are relevant to real-world situations. We’re pleased to have validated our approach via Artificial Analysis’ independent benchmarking process. In this article, we’ll dive deeper into our benchmarking methods and explore more performance metrics for Mistral 7B.

Benchmarking factors for dedicated model deployments

Other providers benchmarked by Artificial Analysis offer shared inference endpoints, where a single endpoint serves traffic from many customers. In contrast, Baseten provides dedicated model deployments, where users deploy their own instance of a model with its own endpoint. This approach offers substantial benefits — privacy, security, reliability — and exposes more levers for developers to adjust, adding nuance to benchmark results.

Depending on which aspects of LLM performance you’re optimizing for, you can trade off between latency, throughput, and cost with a dedicated deployment. In our benchmarks, we measure a range of possible settings to represent multiple realistic production scenarios.

Mistral 7B performance across batch sizes

With a dedicated deployment of Mistral 7B, you can set your own batch size to trade off between latency and throughput. A smaller batch size provides lower-latency responses, while a larger batch size sacrifices some latency for total throughput.

When using a dedicated model deployment, you pay per minute of GPU usage rather than per input and output token of an individual request. A model deployment with a larger batch size is able to handle more simultaneous queries and produce more throughput, reducing your effective cost per million tokens. At scale, large batch sizes on dedicated deployments can offer substantial cost savings.

Given the same sequence lengths as the Artificial Analysis benchmarks (80x100), here are the TTFT and TPS benchmarks observed for batch sizes from 1 to 128:

Mistral 7B TTFT and TPS across batch sizes

We can see that as batch size goes up, the TTFT and TPS both get worse (higher latency, lower tokens per second).

As we’re working with a dedicated deployment, as batch size increases, so too does the total tokens per second generated by the model, reducing the cost per token. Here’s a look at how larger batch sizes increase total throughput of the deployed model:

Mistral 7B total throughput across batch sizes

When you deploy Mistral 7B, or any model, you can decide where you want to end up on the cost-performance curve and get there by adjusting batch sizes. With continuous batching, requests are slotted in as they are received up to the maximum batch size specified to the TensorRT-LLM serving engine. As more slots are used, up to the maximum batch size, the iteration speed of the engine gradually decreases.

Mistral 7B performance across sequence lengths

In a benchmark, you use fixed sequence lengths — the number of input tokens the model receives and the number of output tokens the model produces. In production, there will be variance in these lengths, but you can estimate reasonable values to run a benchmark with based on use case (for example, summarization tasks will have longer input sequences than output sequences).

Artificial Analysis uses 80 input tokens and 100 output tokens for their benchmark, which lines up with short chat use cases like a customer service chatbot. In addition to this 80x100 sequence, we wanted to see how Mistral 7B performed on longer sequences: 100x1000, 250x1000, 500x1000, and 1000x1000. Here’s a visualization of the results (all run with a fixed batch size of 32):

Mistral 7B performance across sequence shapes

We observe that as sequence length increases, time to first token gets higher and tokens per second goes lower. The worse performance is mostly driven by the longer input sequence, which requires more time to process in the compute-bound prefill step of LLM inference.

There’s also an interaction between sequence lengths and batch size. With a long enough input and a large enough batch (say 1000 input tokens with a batch size of 96), you can saturate the compute slots available for prefill. This makes the model’s time to first token go vertical: from a few hundred milliseconds at a batch size of 72 to many seconds with a batch size of 96 for the thousand-input-token example. Benchmarking across a range of sequence lengths and batch sizes identifies these sharp cliffs to help you find the sweet spot for operating your model serving engine.

Session reuse for lower time to first token

Time to first token includes not only the time the model needs to generate a token of output, but also the time it takes to send the request and response to and from the model. That means it’s an infrastructure and application architecture problem alongside a model performance problem.

One free win for lowering TTFT in production is reusing connections in a session. There’s a variable overhead — typically 20 to 30 milliseconds — in establishing a TCP connection through a network. If you can skip that overhead by reusing an existing connection, your TTFT improves.

With a batch size of 32 and an 80-token input sequence, time to first token is under 60 milliseconds with an active TCP connection. If it takes 30 milliseconds to establish a new connection, your TTFT goes up 50%, a huge increase in latency.

Accurate tokenization for measuring TPS

Throughput for an LLM is measured in tokens per second. We all know what a second is, but what exactly is a token? As it turns out, the definition of token can vary. Major LLMs, including Mistral 7B, use “subword tokenization,” where a token can be a short word, a piece of a longer word, or a special character.

The exact tokenizer used can have a slight impact on the tokens per second calculated by a benchmark. Artificial Analysis uses tiktoken, a general purpose tokenizer created by OpenAI, for all models to make model-to-model comparisons more standardized. You may get slightly different TPS measurements with the Mistral-specific tokenizer that our Mistral 7B implementation and benchmarking tool use.

To report metrics in terms of tokens, you don’t just need an accurate and shared definition of “token,” you also need to ensure that the tokens generated are valuable output. Users don’t exactly care about tokens per second; they care about how quickly the application can generate the output they want.

If the type of output generated by a benchmarking tool doesn’t align with the expected usage of the model in production, metrics like TPS can get skewed. Tokens produced during a benchmark should have similar distribution statistics to real-world natural language: similar numbers of simple and complex words, special characters, newlines, and so forth.

High-performance LLM inference in production

Model performance is essential. Users demand the lowest latency and highest speeds, so we’ve invested in creating best in class performance for models like Mistral 7B. But using an LLM in production is about more than just tokens per second. You need robust and reliable infrastructure to power your production workloads.

Baseten is not a shared endpoint inference provider. Instead, we offer dedicated deployments of Mistral 7B — and any open source or custom model — for unmatched privacy, security, and reliability. On Baseten’s infrastructure, you get autoscaling with scale to zero, SOC2 and HIPAA compliance, and dedicated model deployments with no noisy neighbors or

Shared endpoint model providers set their prices as a cost per million tokens. With autoscaling dedicated model deployments billed per minute of GPU usage, your actual cost per million tokens varies based on traffic patterns, batch sizes, and sequence lengths (how many input and output tokens your average request contains) — all of these factor into total tokens processed per second. Our optimized inference engines give you levers to make tradeoffs around latency, throughput, and cost, helping you get the maximum value from your hardware and achieve a lower cost at scale than shared endpoint providers.

You can deploy our optimized implementation of Mistral 7B on H100 GPUs today. It uses TensorRT-LLM as an optimization library, meaning that you’ll need an engine built specifically for your traffic patterns with the right batch size and sequence lengths. You can deploy TensorRT-LLM optimized models directly from the Baseten model library or let us know about your use case and we can help configure an optimized model serving engine.

The model performance landscape is constantly changing, and we’re changing with it. As new optimization techniques are discovered and more advanced model serving technologies are released, we look forward to baking them into a production-ready platform capable of securely and reliably serving ML models with best in class performance.

Benchmarking fast Mistral 7B inference

Benchmarking factors for dedicated model deployments

Mistral 7B performance across batch sizes

Mistral 7B performance across sequence lengths

Session reuse for lower time to first token

Accurate tokenization for measuring TPS

High-performance LLM inference in production

Related Model performance posts

How to build function calling and JSON mode for open-source and fine-tuned LLMs

How to double tokens per second for Llama 3 with Medusa

How to serve 10,000 fine-tuned LLMs from a single GPU