Serving and deploying models: BentoML vs Truss

BentoML and Truss are two open-source frameworks for serving and deploying machine learning models.

Ultimately, the choice between BentoML and Truss comes down to your own priorities. If you’re a data scientist who is comfortable with Docker and Kubernetes or has dedicated MLOps resources, BentoML offers a lot of flexibility and features. But if you’re looking to deploy your own models with minimal effort, Truss offers a straightforward path to production while providing the power to run even large and complex models like Stable Diffusion. And you can deploy a Truss with one line of code if you’re deploying to Baseten.

Who it's forMLOps engineers and data scientists comfortable with Docker & KubernetesData scientists and ML researchers at startups
Why it's greatProvides a complete toolkit for MLOps practitioners to serve modelsProvides an opinionated bridge from model development to model deployment
What it needsDeployment options that don't require infrastructure and maintenanceFurther development to support more customization for advanced users
Supported frameworksHugging Face Transformers, LightGBM, PyTorch, Scikit-learn, Tensorflow, XGBoost, and 12+ more frameworksHugging Face Transformers, LightGBM, PyTorch, Scikit-learn, Tensorflow, XGBoost, and manual options
Deployment targetsMost cloud hosts via Bentoctl; Kubernetes Cluster via YataiBaseten (one line deployment); Any cloud host with Docker
Secret management✔️✔️
Pre-processing code✔️✔️
API endpoint for model✔️✔️

Serving simple models: a photo finish

Serving a simple model, like an sklearn classifier on the classic iris data set, highlights the similarities between the two frameworks.

With BentoML

The following code sample is adapted from the BentoML quickstart docs.

from sklearn import svm, datasets
# Load training data
iris = datasets.load_iris()
X, y =,
# Model Training
clf = svm.SVC(), y)
import bentoml
bentoml.sklearn.save_model("iris_clf", clf)

With Truss

The following code sample is adapted from the Truss sklearn docs.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from truss import mk_truss

iris = load_iris()
data_x = iris['data']
data_y = iris['target']
model = RandomForestClassifier(), data_y)

tr = mk_truss(model, target_directory="sklearn_truss")

In both cases, a line of Python takes the model as an in-memory object and converts it into a serveable artifact. But a sklearn classifier on the iris dataset is far simpler than most production models. Let’s take a look at both packages’ capabilities for dealing with real-world use cases.

Adding processing: Truss gives opinionated, clean code

Running inference on a string or a short list of integers is all well and good, but sometimes ML models have larger and more complex inputs, like an image. And sometimes a model’s output, full of confidence scores and other metrics, needs to be formatted or simplified for the end user. Pre- and post-processing code lets you bundle these tasks in with the model.

BentoML users are used to adding these functions to a service file. In a Truss, the file specifies a preprocess and postprocess function. These functions will be run before and after each model invocation, respectively, and can parse inputs and outputs, save results to a database, or perform business logic based on the model’s prediction. Whether you run your Truss locally or in production, every model inference request runs through these functions.

Here’s an example of a Model class in a Truss that implements these functions:

import requests
import tempfile
import numpy as np
from scipy.special import softmax

class Model:
   def __init__(self, **kwargs) -> None:
       self._data_dir = kwargs["data_dir"]
       config = kwargs["config"]
       model_metadata = config["model_metadata"]
       self._model_binary_dir = model_metadata["model_binary_dir"]
       self._model = None
   def load(self):
       self._model = keras.models.load_model(
           str(self._data_dir / self._model_binary_dir)

  def preprocess(url):
      """Preprocess step for ResNet"""
      request = requests.get(url)
      with tempfile.NamedTemporaryFile() as f:

          input_image = tf.image.decode_png(
      preprocessed_image = tf.keras.applications.resnet_v2.preprocess_input(
          tf.image.resize([input_image], (224, 224))
      return np.array(preprocessed_image)

  def postprocess(predictions, k=5):
      """Post process step for ResNet"""
      class_predictions = predictions[0]
      LABELS = requests.get(
      class_probabilities = softmax(class_predictions)
      top_probability_indices = class_probabilities.argsort()[::-1][:k].tolist()
      return {LABELS[index]: 100 * class_probabilities[index].round(3) for index in top_probability_indices}

  def predict(self, request: Dict) -> Dict[str, List]:
         response = {}
         inputs = request["inputs"]
         inputs = np.array(inputs)
         result = self._model.predict(inputs).tolist()
         response["predictions"] = result
         return response

Deploying to production: skip the headaches with Truss

Serving a model locally is great for development, testing, and iteration. But putting models into production is how you deliver business value with ML. We’ll review BentoML’s DIY tools, then demonstrate how using Truss with Baseten achieves the same results without the infrastructure work.

Deploying a BentoML model

BentoML offers two options for deployment: Bentoctl and Yatai. Bentoctl is a CLI tool for deploying BentoML models to various platforms. Yatai is a self-hosted platform for deploying ML models on Kubernetes clusters.

The core difference between Bentoctl and Yatai is that Yatai is platform agnostic. You can run it anywhere you have a Kubernetes cluster, even locally with minicube. Meanwhile, Bentoctl relies on a series of templates that are specific to different deployment targets like AWS EC2, AWS Lambda, and Google Cloud Run. Yatai requires more infrastructure work to install than Bentoctl, and in both cases you have to ensure your infrastructure is scalable, stable, and secure.

Get started with:

Deploying a Truss model

With Truss, you always have the option to do your own deployment. Deploying a Truss as a Docker image is like using Yatai in that it is platform agnostic — you can run a Truss in production on anything that runs Docker, like AWS ECS and Google Cloud Run.

However this presents the same challenges in setting up infrastructure. So Truss users can let Baseten handle that complexity by using our model deployment platform.

Going from Truss to Baseten is seamless. Truss powers the Baseten Python client. Your model in its Truss can be deployed to Baseten with a single line of code and will be hosted for you on robust, scalable, secure infrastructure. Plus, you can take advantage of the entire Baseten platform and use your model in full-stack applications.

Get started with:

From here, we’ll break down a few key aspects of using a model in production and compare how BentoML and Truss handle each.

Secrets (winner: tie)

Models need access to databases, APIs, S3 buckets, and other secured resources that use authentication tokens and other secrets. BentoML and Truss both enable you to reference these secrets in your models and pre- and post-processing code.

With BentoML, you can use environment variables or mount secrets to your container. In either case, the secrets can be accessed as args in the model service file.

Truss also supports both environment variables and mounted secrets. Secrets are named in the Truss’s configuration file, and can be set locally in the ~/.truss directory on your computer for development and testing. And if you’re using Baseten to deploy your Truss, don’t worry about figuring out how to set environment variables or mount secrets, Baseten has built-in secret management.

Get started with:

GPU access (winner: Truss + Baseten)

Both Truss and BentoML support running model inference on GPUs. However, GPUs are tricky to work with. Both packages manage that complexity, but neither entirely removes it. Model inference must be run on a GPU with the hardware and driver capabilities to run inference.

With BentoML, GPU arguments must be passed into the docker image that the model is being run on. From there, the model code must take advantage of those resources.

Truss specifies GPU requirements in the config file. This tells Docker to run on the GPU if available, but the model code should perform checks that GPU resources exist before attempting invocation.

Get started with:

API endpoints (winner: Truss + Baseten)

BentoML’s service model lends itself to creating API endpoints. Hosting and securing those endpoints depends on the production environment. For example, an AWS Lambda deployment via Bentoctl would use Amazon API Gateway. API keys would need to be implemented separately.

When you deploy your Truss to Baseten, it automatically is served behind an API endpoint. What’s more, that endpoint is secured with an API key, which you can set and manage through the Baseten interface. And with the rest of the Baseten platform, you can expand that API for more complex tasks with worklets, databases, and data connections.

Get started with:

Choosing BentoML or Truss

For data scientists at startups, Truss is the model serving package built for your needs.

If you are familiar enough with Docker and Kubernetes to manage your own production infrastructure and need the flexibility that self-hosting provides, BentoML provides a great ecosystem to plug in to. However, many data scientists don’t have the time or resources to manage models in production. So Truss offers the same model serving features as BentoML backed by one-line deployment to Basten’s model deployment platform. 

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.

Machine Learning

NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference

This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.

Philip Kiely

September 15, 2023

Machine Learning

SDXL inference in under 2 seconds: the ultimate guide to Stable Diffusion optimization

Out of the box, Stable Diffusion XL 1.0 (SDXL) takes 8-10 seconds to create a 1024x1024px image from a prompt on an A100 GPU. Here’s everything I did to cut SDXL invocation to as fast as 1.92 seconds on an A100.

Varun Shenoy

August 30, 2023