Deployment and inference for open source text embedding models

Prompt: a typewriter embedded in a motherboard

TL;DR

A text embedding model transforms text into a vector of numbers that represents the text’s semantic meaning. There are a number of high-quality open source text embedding models for different use cases across search, recommendation, classification, and retrieval-augmented generation with LLMs.

Text embedding models aren’t flashy like large language models, but they’re a foundational piece of the natural language processing field and a key component for building production-ready applications on LLMs.

Why create text embeddings?

At face value, turning nice human-readable text into a long list of numbers might seem pointless. One text embedding can’t be used for much. But creating embeddings from a corpus of text—say every post on your blog or every paragraph in your documentation—enables use cases like:

Search: given a query, create an embedding of that query and compare its similarity with embeddings from the data set, and return the most relevant content.
Retrieval-augmented generation (RAG): use embedding search to grab chunks of content to use as context for text generation with LLMs.
Recommendations: surface related content like similar blog posts or podcast episodes.
Classification and clustering: categorize text by similarity.

As each of these use cases relies on creating a set of embeddings, it’s important to use the same embeddings model for both the initial dataset and any subsequent embeddings (such as search queries).

What is a text embedding?

A text embedding encodes a chunk of text as a vector (a list of floating-point numbers). This vector represents the text’s meaning in an n-dimensional space.

This is difficult to visualize at the scale of real text embedding models, which have hundreds of dimensions, but here’s a simple example in two dimensions:

Here, we have a simplified two-dimensional vector space for sentences. The sentences are clustered by similarity, with the same change in a sentence (e.g. yellow -> red) resulting in the same direction and magnitude of shift in sentence location.

Of course, what’s happening in the model is far more complex than this example, but the basic intuition remains the same. Text embedding models encode the meaning of chunks of text into vectors, which can then be compared and grouped.

Along with this general intuition, it’s worth understanding three key aspects of text embedding models: their tokenizer, context window, dimensionality, and similarity function.

Tokenizer

Like large language models, text embedding models use a tokenizer to split up the input text into chunks called “tokens” to be encoded. This happens behind the scenes in the encoding function.

Every embedding model we’ll talk about uses “subword tokenization,” which is also standard for LLMs. This form of tokenization strikes a balance between limiting the number of possible tokens and making each token meaningful.

Subword tokenization gives short, common words their own token, while splitting up larger and more complex words by their roots, prefixes, suffixes, and other components.

Context window

Like LLMs, text embedding models have a context window: the number of tokens of input they can process at once. If you give a text embedding model a string that’s too long, it will only encode the meaning of the first N tokens of the string, where N is the number of tokens in its context window.

A larger context window allows for embedding more substantial pieces of text, which expands the use cases for the text embedding model. A context window of 256 tokens (~200 words) lets you create embeddings of a book a page at a time, while a 8,192-token (~6,000-word) context window will let you process whole chapters at a time.

One trick to using text embedding effectively is finding the right chunk size when embedding a corpus of text. For your use case, do you get the most value from retrieving sentences, paragraphs, or pages? If you need to embed longer chunks of text for the project to work, you’ll be limited to picking text embedding models with larger context windows.

Dimensionality

One cool property of text embedding models is that no matter how short or long the input string is, the output will be exactly the same length.

That’s because a text embedding model has a fixed dimensionality, or length of output sequence. Remember, the output of a text embedding model is a vector, or list of floating-point numbers. Having every output the same length is essential for using embeddings later on.

Similarity function

Every use case for text embedding models involves comparing vectors. Every vector produced by the model will be the same length, and linear algebra gives us three popular comparison methods:

Euclidean distance, which measures the linear distance between the endpoints of two vectors.
Cosine similarity, which measures the angle between two vectors. This is the only similarity function that does not consider magnitude.
Dot product similarity, which is calculated based on each component of two vectors.

While each method has its advantages and disadvantages, what’s most important when building with text embedding models is always using the same similarity function to create consistency between comparisons.

Selecting an open source text embedding model

New and updated open source text embedding models are released every week. You’ll find thousands on Hugging Face. Which model to pick depends on your use case and compute resources.

When building with text embedding models, it’s essential to pick a model that meets all of your needs. If you decide to switch models, you’ll need to regenerate embeddings for your entire database with the new model; you can’t meaningfully compare embeddings generated with different models.

Here’s a rundown of five popular open source text embedding models.

For general use (lower dimensionality): all-MiniLM-L6-v2

One of the most popular embedding models is all-MiniLM-L6-v2. Model stats:

Context window: 256 tokens
Dimensionality: 384

Use all-MiniLM-L6-v2 for embedding sentences and paragraphs of English text. Deploy all-MiniLM-L6-v2 instantly from our model library.

For general use (higher dimensionality): all-mpnet-base-v2

Another great all-purpose text embedding model is all-mpnet-base-v2. Model stats:

Context window: 384 tokens
Dimensionality: 768

Use all-mpnet-base-v2 for embedding sentences and paragraphs of English text. Deploy all-mpnet-base-v2 instantly from our model library.

For long text chunks: jina-embeddings-v2-base-en

For embedding pages and chapters instead of sentences and paragraphs, jina-embeddings-v2-base-en is a highly capable model that matches OpenAI’s ada-002 on benchmarks. Model stats:

Context window: 8,192 tokens
Dimensionality: 768

Take advantage of jina-embeddings-v2-base-en’s context window, which is 32 times longer than all-MiniLM-L6-v2’s context window, when you need to create embeddings of entire blog posts, book chapters, and other long-form content.

You can learn more about jina-embeddings-v2-base-en, or deploy it in two clicks from the model library.

For multiple languages: LEALLA-base

While many text embedding models are English-only, LEALLA-base supports 109 languages. Model stats:

Context window: 512 tokens (from BertModel)
Dimensionality: 128 (small model), 192 (base model), 256 (large model)

Choose LEALLA, which comes in small, base, and large sizes, for multilingual text embedding projects.

For instruction tuning: instructor-xl

Instruction tuning with instructor-xl lets you provide a task-specific instruction alongside the text chunk to be embedded, which can improve performance in specific domains. Model stats:

Context window: 512 tokens
Dimensionality: 768

Use instructor-xl when you want to prompt the text embedding model with specific instructions.

Packaging text embedding models with Truss

Many text embedding models are available using the sentence-transformers library, which handles steps like tokenization and normalization for you. If a model isn’t supported by sentence-transformers, you can still use it with the standard transformers library, it’ll just take a couple of extra lines of code. Follow the code examples in the model’s documentation for model-specific guidance, such as this usage guide for LEALLA-base.

Here’s a demonstration of packaging all-MiniLM-L6-v2, which is supported in sentence-transformers, as a Truss.

Creating a Truss

We start by installing the truss package from PyPI and initialing an empty Truss:

pip install --upgrade truss
truss init embedding-model

Enter a model name like all-MiniLM-L6-v2 when prompted.

Implementing the model server

In the Truss’ model/model.py file, we can implement the model server.

1from sentence_transformers import SentenceTransformer
2
3
4class Model:
5    def __init__(self, **kwargs):
6        self._model = None
7
8    def load(self):
9        self._model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
10
11    def predict(self, model_input):
12        return self._model.encode(model_input)

Bundling model requirements

Then, in config.yaml, we add the Python requirements for running the model. In this case, we just need sentence-transformers.

requirements:
  - sentence-transformers==2.2.2

Setting model resources

Finally, we’ll set the necessary model resources. Running this text embedding model does not require a GPU, so we’ll pick a midsize CPU instance type to balance cost and performance. A 4-core instance with 16 GiB of RAM will give us solid performance for experimenting with the model.

In config.yaml, we can specify those resources.

resources:
  accelerator: null
  cpu: '4'
  memory: 16Gi
  use_gpu: false

Deploy text embedding models to Baseten

With the model packaged as a Truss, we can deploy it to Baseten. In your terminal, run:

truss push

Enter your Baseten API key if prompted, and the model will be deployed to your account.

Running inference on text embedding models

Once the model is deployed, we can run inference. The text embedding model we’re using, all-MiniLM-L6-v2, takes a list of strings as input and returns a list of vectors as output.

Encoding one sentence

You can call your deployed model using Python, cURL, JavaScript, and more, but for simple testing we’ll stick with the Truss CLI:

truss predict -d '["this is a sentence", "this is another sentence"]'

Creating text embeddings from a corpus of text

This would be inconvenient for making more than a couple of embeddings. There are various methods for using files as model input, we’ll again use the Truss CLI:

truss predict -f input.json > output.json

Given that input.json contains a JSON-serializable list of strings, this will create an embedding for each string in the file and save those embeddings to output.json.