Technical Writer
Technical Writer
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
Quantizing Mistral 7B to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.