Using open source ML models for your generative AI applications gives you a level of control, customization, reliability, and security not provided by proprietary model APIs. And paying for the hardware used for inference directly gives you more control over your spend while saving money at scale with optimized usage. While there are many open source models to pick from, we selected a few models whose capabilities line up well with popular closed source models, such as replacing GPT with Mistral.
Building on top of open source models gives you access to a wide range of capabilities that you would otherwise lack from a black box endpoint provider.
Think of ML models like different types of engines, each suited for a particular use case. For example, a sports car’s engine is optimized for speed and acceleration, while a tractor’s engine prioritizes torque and fuel efficiency.
Similarly, with open source models, developers can choose the model architecture that is best suited for their needs, rather than being limited to a one-size-fits-all proprietary model.
Every benefit of using open source models on dedicated hardware is downstream of the full control you get over your model’s inputs, outputs, and environment. You can fully customize your model and model server for specific use cases.
Using open source models directly also protects you from “model shift,” where endpoint providers change the underlying model, sometimes with little or no notice. Updated models often behave differently than their predecessors, breaking prompt engineering and output parsing.
When you own your model, you decide if and when it changes and exactly how its API endpoint behaves.
Optimization is about tradeoffs. When you control your own model and infrastructure, you can make those tradeoffs based on your use case rather than accepting one vendor’s best approximation of everyone’s needs.
If you need to reduce latency, you can try specialized backend servers like vLLM, KV caching, speculative decoding, and of course using a more powerful GPU. On the other hand, if you have a latency-tolerant use case, you can batch multiple requests together to improve throughput and ensure you’re making the most of your compute resources. And if cost is a concern, quantizing your model reduces the model size, letting you make bigger batches or use smaller, less expensive GPUs.
These optimizations come from being able to do inference math for your GPU and model so that you know exactly where bottlenecks exist during inference.
Switching to a dedicated deployment of an open source model insulates you from a class of “noisy neighbors” problems that pop up when using shared resources.
In shared-endpoint environments, multiple customers use the same hardware resources. When a “noisy neighbor” customer overuses resources, it can degrade performance for your workload.
On the other hand, when you host your own models on dedicated hardware, you replace variability with stability. You get more consistent SLAs, more control over your model status, and insight into performance metrics.
Crucially, dedicated infrastructure also shrinks your application’s attack surface and reduces the chance of data leakage, as there are fewer parties who can access the data. This enables better regulatory compliance.
With open source models, developers pay for the hardware resources directly. This provides more predictable costs compared to proprietary API services that charge per API request or per token.
When paying per API request or per token, costs can wildly fluctuate based on usage. Billing can spike if sudden load comes in. With open source models, you can cap your costs without losing the ability to scale up in response to user demand.
Autoscaling infrastructure for ML models lets you estimate expected request volume and provision the hardware accordingly, then automatically scale up and down within the configured limits in response to traffic. This limits your costs while ensuring reliability.
The granular pricing of API services encourages limiting usage to control costs. With open source models, the focus shifts to fully utilizing available hardware resources and configuring autoscaling to match traffic. The break-even point depends on factors like request volume, batch size, resource utilization, API pricing model. But for many realistic usage scenarios, open source models become cheaper at scale.
The biggest benefit — and sometimes the biggest drawback — of open source is the sheer number of models that are publicly available. There are new models being released every day, and sometimes evaluations seem to be mostly based on ~vibes~. From our experiences, here are some of the best open source models that you can deploy right now for common use cases.
If you’re using models like GPT or Claude, there are great open source alternatives. Large language models are one of the most actively developed categories of open source models, and there are multiple model families to choose from.
Two of the highest-quality open source LLMs are Mistral 7B and Llama 2, which comes in 3 sizes: 7B, 13B, and 70B. Larger models offer higher output quality but require larger, more expensive GPUs for inference. But 7 billion parameter models like Mistral still offer strong performance, rivaling Llama 2 13B on many benchmarks.
Open source LLMs come in a few different variants. There are base models, which are designed for evaluation and fine-tuning, and instruct or chat-tuned models, which provide a conversational output similar to ChatGPT. You’ll most likely want instruct/chat-tuned models for building user-facing products.
Where open source LLMs also shine is the variety of models designed for specific tasks, like CodeLlama, a family of models optimized for code generation.
The often-overlooked category of text embedding models are essential for many production use cases for LLMs — especially around giving LLMs information that isn’t in their training data.
When working with the GPT APIs, the ada-002 text embedding model fills that role. For open source, you have a number of models to choose from, but the jina-embeddings-v2 model matches ada-002 in both context window size and benchmark scores.
One important note: text embeddings are not compatible from model to model. If you switch to a new model, you’ll need to re-generate embeddings for the corpus you query against.
Whisper is one of OpenAI’s open-source models, which means you can still use the same model for audio transcription tasks while getting the performance, cost, and security benefits of running it on dedicated hardware.
Deploy Whisper from the model library.
There are a number of open source alternatives to the Audio API, like Coqui XTTS 2 and Bark. Text to speech models vary in their capabilities across languages, voices, and background sounds, but experimenting with these models is a lot of fun!
Open source ML models can improve the security, reliability, performance, and cost of your generative AI applications.