BentoML and Truss are two open-source frameworks for serving and deploying machine learning models. For data scientists on small teams or at startups, Truss offers similar capabilities to BentoML and works seamlessly with Baseten’s model deployment platform. This article presents a technical comparison of the two open-source frameworks to help you decide which one to use to serve your model.
Ultimately, the choice between BentoML and Truss comes down to your own priorities. If you’re a data scientist who is comfortable with Docker and Kubernetes or has dedicated MLOps resources, BentoML offers a lot of flexibility and features. But if you’re looking to deploy your own models with minimal effort, Truss offers a straightforward path to production while providing the power to run even large and complex models like Stable Diffusion. And you can deploy a Truss with one line of code if you’re deploying to Baseten.
Serving simple models: a photo finish
Serving a simple model, like an sklearn classifier on the classic iris data set, highlights the similarities between the two frameworks.
The following code sample is adapted from the BentoML quickstart docs.
The following code sample is adapted from the Truss sklearn docs.
In both cases, a line of Python takes the model as an in-memory object and converts it into a serveable artifact. But a sklearn classifier on the iris dataset is far simpler than most production models. Let’s take a look at both packages’ capabilities for dealing with real-world use cases.
Adding processing: Truss gives opinionated, clean code
Running inference on a string or a short list of integers is all well and good, but sometimes ML models have larger and more complex inputs, like an image. And sometimes a model’s output, full of confidence scores and other metrics, needs to be formatted or simplified for the end user. Pre- and post-processing code lets you bundle these tasks in with the model.
BentoML users are used to adding these functions to a service file. In a Truss, the model.py file specifies a preprocess and postprocess function. These functions will be run before and after each model invocation, respectively, and can parse inputs and outputs, save results to a database, or perform business logic based on the model’s prediction. Whether you run your Truss locally or in production, every model inference request runs through these functions.
Here’s an example of a Model class in a Truss that implements these functions:
Deploying to production: skip the headaches with Truss
Serving a model locally is great for development, testing, and iteration. But putting models into production is how you deliver business value with ML. We’ll review BentoML’s DIY tools, then demonstrate how using Truss with Baseten achieves the same results without the infrastructure work.
Deploying a BentoML model
BentoML offers two options for deployment: Bentoctl and Yatai. Bentoctl is a CLI tool for deploying BentoML models to various platforms. Yatai is a self-hosted platform for deploying ML models on Kubernetes clusters.
The core difference between Bentoctl and Yatai is that Yatai is platform agnostic. You can run it anywhere you have a Kubernetes cluster, even locally with minicube. Meanwhile, Bentoctl relies on a series of templates that are specific to different deployment targets like AWS EC2, AWS Lambda, and Google Cloud Run. Yatai requires more infrastructure work to install than Bentoctl, and in both cases you have to ensure your infrastructure is scalable, stable, and secure.
Get started with:
- Bentoctl documentation: https://github.com/bentoml/bentoctl
- Yatai documentation: https://github.com/bentoml/Yatai
Deploying a Truss model
With Truss, you always have the option to do your own deployment. Deploying a Truss as a Docker image is like using Yatai in that it is platform agnostic — you can run a Truss in production on anything that runs Docker, like AWS ECS and Google Cloud Run.
However this presents the same challenges in setting up infrastructure. So Truss users can let Baseten handle that complexity by using our model deployment platform.
Going from Truss to Baseten is seamless. Truss powers the Baseten Python client. Your model in its Truss can be deployed to Baseten with a single line of code and will be hosted for you on robust, scalable, secure infrastructure. Plus, you can take advantage of the entire Baseten platform and use your model in full-stack applications.
Get started with:
- Truss to Baseten documentation: https://truss.baseten.co/deploy/baseten
- Truss to AWS documentation: https://truss.baseten.co/deploy/aws
From here, we’ll break down a few key aspects of using a model in production and compare how BentoML and Truss handle each.
Secrets (winner: tie)
Models need access to databases, APIs, S3 buckets, and other secured resources that use authentication tokens and other secrets. BentoML and Truss both enable you to reference these secrets in your models and pre- and post-processing code.
With BentoML, you can use environment variables or mount secrets to your container. In either case, the secrets can be accessed as args in the model service file.
Truss also supports both environment variables and mounted secrets. Secrets are named in the Truss’s configuration file, and can be set locally in the ~/.truss directory on your computer for development and testing. And if you’re using Baseten to deploy your Truss, don’t worry about figuring out how to set environment variables or mount secrets, Baseten has built-in secret management.
Get started with:
- BentoML secrets example: https://docs.bentoml.org/en/latest/guides/containerization.html#access-aws-credentials-during-image-build
- Truss secrets documentation: https://truss.baseten.co/develop/secrets
GPU access (winner: Truss + Baseten)
Both Truss and BentoML support running model inference on GPUs. However, GPUs are tricky to work with. Both packages manage that complexity, but neither entirely removes it. Model inference must be run on a GPU with the hardware and driver capabilities to run inference.
With BentoML, GPU arguments must be passed into the docker image that the model is being run on. From there, the model code must take advantage of those resources.
Truss specifies GPU requirements in the config file. This tells Docker to run on the GPU if available, but the model code should perform checks that GPU resources exist before attempting invocation.
Get started with:
- BentoML documentation for GPUs: https://docs.bentoml.org/en/latest/guides/gpu.html
- Truss documentation for GPUs: https://truss.baseten.co/develop/gpu
API endpoints (winner: Truss + Baseten)
BentoML’s service model lends itself to creating API endpoints. Hosting and securing those endpoints depends on the production environment. For example, an AWS Lambda deployment via Bentoctl would use Amazon API Gateway. API keys would need to be implemented separately.
When you deploy your Truss to Baseten, it automatically is served behind an API endpoint. What’s more, that endpoint is secured with an API key, which you can set and manage through the Baseten interface. And with the rest of the Baseten platform, you can expand that API for more complex tasks with worklets, databases, and data connections.
Get started with:
- BentoML documentation for services: https://docs.bentoml.org/en/latest/tutorial.html#creating-a-service
- Baseten model deployment: https://docs.baseten.co/models/deploying-models
Choosing BentoML or Truss
For data scientists at startups, Truss is the model serving package built for your needs.
If you are familiar enough with Docker and Kubernetes to manage your own production infrastructure and need the flexibility that self-hosting provides, BentoML provides a great ecosystem to plug in to. However, many data scientists don’t have the time or resources to manage models in production. So Truss offers the same model serving features as BentoML backed by one-line deployment to Basten’s model deployment platform.
Choosing the right horizontal scaling setup for high-traffic models
Horizontal scaling via replicas with load balancing is an important technique for handling high traffic to an ML model. Let’s examine three tips for understanding how to properly replicate your instances to save users time without wasting your money.