A text embedding model transforms text into a vector of numbers that represents the text’s semantic meaning. There are a number of high-quality open source text embedding models for different use cases across search, recommendation, classification, and retrieval-augmented generation with LLMs.
Text embedding models aren’t flashy like large language models, but they’re a foundational piece of the natural language processing field and a key component for building production-ready applications on LLMs.
At face value, turning nice human-readable text into a long list of numbers might seem pointless. One text embedding can’t be used for much. But creating embeddings from a corpus of text—say every post on your blog or every paragraph in your documentation—enables use cases like:
Search: given a query, create an embedding of that query and compare its similarity with embeddings from the data set, and return the most relevant content.
Retrieval-augmented generation (RAG): use embedding search to grab chunks of content to use as context for text generation with LLMs.
Recommendations: surface related content like similar blog posts or podcast episodes.
Classification and clustering: categorize text by similarity.
As each of these use cases relies on creating a set of embeddings, it’s important to use the same embeddings model for both the initial dataset and any subsequent embeddings (such as search queries).
A text embedding encodes a chunk of text as a vector (a list of floating-point numbers). This vector represents the text’s meaning in an n-dimensional space.
This is difficult to visualize at the scale of real text embedding models, which have hundreds of dimensions, but here’s a simple example in two dimensions:
Here, we have a simplified two-dimensional vector space for sentences. The sentences are clustered by similarity, with the same change in a sentence (e.g. yellow -> red) resulting in the same direction and magnitude of shift in sentence location.
Of course, what’s happening in the model is far more complex than this example, but the basic intuition remains the same. Text embedding models encode the meaning of chunks of text into vectors, which can then be compared and grouped.
Along with this general intuition, it’s worth understanding three key aspects of text embedding models: their tokenizer, context window, dimensionality, and similarity function.
Like large language models, text embedding models use a tokenizer to split up the input text into chunks called “tokens” to be encoded. This happens behind the scenes in the encoding function.
Every embedding model we’ll talk about uses “subword tokenization,” which is also standard for LLMs. This form of tokenization strikes a balance between limiting the number of possible tokens and making each token meaningful.
Subword tokenization gives short, common words their own token, while splitting up larger and more complex words by their roots, prefixes, suffixes, and other components.
Like LLMs, text embedding models have a context window: the number of tokens of input they can process at once. If you give a text embedding model a string that’s too long, it will only encode the meaning of the first N tokens of the string, where N is the number of tokens in its context window.
A larger context window allows for embedding more substantial pieces of text, which expands the use cases for the text embedding model. A context window of 256 tokens (~200 words) lets you create embeddings of a book a page at a time, while a 8,192-token (~6,000-word) context window will let you process whole chapters at a time.
One trick to using text embedding effectively is finding the right chunk size when embedding a corpus of text. For your use case, do you get the most value from retrieving sentences, paragraphs, or pages? If you need to embed longer chunks of text for the project to work, you’ll be limited to picking text embedding models with larger context windows.
One cool property of text embedding models is that no matter how short or long the input string is, the output will be exactly the same length.
That’s because a text embedding model has a fixed dimensionality, or length of output sequence. Remember, the output of a text embedding model is a vector, or list of floating-point numbers. Having every output the same length is essential for using embeddings later on.
|Pros of high-dimensionality models||Cons of high-dimensionality models|
|They can provide more accurate results.||They require a larger dataset to make the most of extra dimensions.|
|They can encode more semantic meaning, especially from longer chunks of text.||They make search and indexing much more costly.|
|They are likely to encode the meaning of rare words, which can be essential to the text.||Their longer outputs mean an increased cost of storage.|
Every use case for text embedding models involves comparing vectors. Every vector produced by the model will be the same length, and linear algebra gives us three popular comparison methods:
Euclidean distance, which measures the linear distance between the endpoints of two vectors.
Cosine similarity, which measures the angle between two vectors. This is the only similarity function that does not consider magnitude.
Dot product similarity, which is calculated based on each component of two vectors.
While each method has its advantages and disadvantages, what’s most important when building with text embedding models is always using the same similarity function to create consistency between comparisons.
New and updated open source text embedding models are released every week. You’ll find thousands on Hugging Face. Which model to pick depends on your use case and compute resources.
When building with text embedding models, it’s essential to pick a model that meets all of your needs. If you decide to switch models, you’ll need to regenerate embeddings for your entire database with the new model; you can’t meaningfully compare embeddings generated with different models.
Here’s a rundown of five popular open source text embedding models.
One of the most popular embedding models is all-MiniLM-L6-v2. Model stats:
Context window: 256 tokens
Use all-MiniLM-L6-v2 for embedding sentences and paragraphs of English text.
Another great all-purpose text embedding model is all-mpnet-base-v2. Model stats:
Context window: 384 tokens
Use all-mpnet-base-v2 for embedding sentences and paragraphs of English text.
For embedding pages and chapters instead of sentences and paragraphs, jina-embeddings-v2-base-en is a highly capable model that matches OpenAI’s ada-002 on benchmarks. Model stats:
Context window: 8,192 tokens
Take advantage of jina-embeddings-v2-base-en’s context window, which is 32 times longer than all-MiniLM-L6-v2’s context window, when you need to create embeddings of entire blog posts, book chapters, and other long-form content.
While many text embedding models are English-only, LEALLA-base supports 109 languages. Model stats:
Context window: 512 tokens (from
Dimensionality: 128 (small model), 192 (base model), 256 (large model)
Choose LEALLA, which comes in small, base, and large sizes, for multilingual text embedding projects.
Instruction tuning with instructor-xl lets you provide a task-specific instruction alongside the text chunk to be embedded, which can improve performance in specific domains. Model stats:
Context window: 512 tokens
Use instructor-xl when you want to prompt the text embedding model with specific instructions.
Many text embedding models are available using the
sentence-transformers library, which handles steps like tokenization and normalization for you. If a model isn’t supported by
sentence-transformers, you can still use it with the standard
transformers library, it’ll just take a couple of extra lines of code. Follow the code examples in the model’s documentation for model-specific guidance, such as this usage guide for LEALLA-base.
Here’s a demonstration of packaging all-MiniLM-L6-v2, which is supported in
sentence-transformers, as a Truss.
We start by installing the truss package from PyPI and initialing an empty Truss:
pip install --upgrade truss truss init embedding-model
Enter a model name like all-MiniLM-L6-v2 when prompted.
In the Truss’
model/model.py file, we can implement the model server.
from sentence_transformers import SentenceTransformer class Model: def __init__(self, **kwargs): self._model = None def load(self): self._model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') def predict(self, model_input): return self._model.encode(model_input)
config.yaml, we add the Python requirements for running the model. In this case, we just need
requirements: - sentence-transformers==2.2.2
Finally, we’ll set the necessary model resources. Running this text embedding model does not require a GPU, so we’ll pick a midsize CPU instance type to balance cost and performance. A 4-core instance with 16 GiB of RAM will give us solid performance for experimenting with the model.
config.yaml, we can specify those resources.
resources: accelerator: null cpu: '4' memory: 16Gi use_gpu: false
With the model packaged as a Truss, we can deploy it to Baseten. In your terminal, run:
Enter your Baseten API key if prompted, and the model will be deployed to your account.
Once the model is deployed, we can run inference. The text embedding model we’re using, all-MiniLM-L6-v2, takes a list of strings as input and returns a list of vectors as output.
truss predict -d '["this is a sentence", "this is another sentence"]'
This would be inconvenient for making more than a couple of embeddings. There are various methods for using files as model input, we’ll again use the Truss CLI:
truss predict -f input.json > output.json
input.json contains a JSON-serializable list of strings, this will create an embedding for each string in the file and save those embeddings to
If this primer got you excited about text embedding models, there are a ton of resources to explore:
For inspiration on projects, check out embeds.ai, a text embedding battleground where you can compare the performance of popular open and closed source text embedding models.