How Pipe’s data science team keeps up with rapid growth

Background

Pipe, a trading platform for recurring revenues, connects companies with recurring revenue streams to institutional  investors so they can access up-front capital without diluting ownership or taking out loans. Last valued at $2 billion, Pipe is growing rapidly, and so are its machine learning needs.

Leading the charge is Faaez Ul Haq, Pipe’s Head of Data Science, and his team of four data scientists. The team uses a combination of analytics, modeling, and optimization to power several key areas of Pipe’s platform.

The Challenge

While some components of models Ul Haq and his team develop are built in Python, Pipe’s backend infrastructure is in Go. To get models into production, the Data Science team would need to spin up all new infrastructure around Python. That means configuring Docker on VMs or Kubernetes, along with managing the day-to-day maintenance and DevOps that comes with self-hosting.

As a small team needing to move quickly, Ul Haq wanted to avoid this at all costs. “I want my team to focus on business outcomes,” said Ul Haq. “We are always looking for ways to minimize doing work that is not to our comparative advantage.” 

The Solution

Seeking an alternative, Ul Haq stumbled upon Baseten. With Baseten, the team could serve its core models with just a few lines of Python. Baseten owns the containerization and deployment, and its Kubernetes-based architecture meant the team didn’t need to worry about model performance even with increased traffic and more demanding models.

In addition, the team found Baseten’s flexibility and thoughtful details in developer ergonomics, like being able to store API keys securely via the UI, to be a step above the competition. “Baseten feels like a modern tool that is designed with the data scientist in mind,” said Ul Haq.

Using Baseten’s Custom Models, the Pipe team added logic to integrate their model into Pipe’s Go stack. When called, the service in Go hits Baseten’s API to instantly return a prediction. Predictions are also stored in a Postgres database to enable the team to debug and improve the model. It took only a few days to move from an offline model in a Jupyter notebook to one deployed in production.  

“Baseten is the ideal solution,” said Ul Haq. “It provides an easy way for us to host our models, iterate on them, and experiment without worrying about any of the DevOps involved.” 

The Results

With their first Baseten model in production, Pipe’s data science team is already looking to add value in additional areas.

For Ul Haq, this is just the beginning. From “the point that a customer plugs in their data sources all the way to matching them with investors on the buy side,” there are tons of modeling and optimization problems he believes his team can solve. 

“Data scientists feel empowered when they own the end-to-end lifecycle of their models,” said Ul Haq. “And with Baseten, my team can self-serve their models into production very, very quickly.”

How Pipe’s data science team keeps up with rapid growth

No items found.

Background

Founded in 2013, Patreon is a membership platform that enables creators to earn revenue from fan subscriptions in exchange for exclusive access, extra content, and more. Today, Patreon has over 250 thousand creators and 8 million monthly active patrons. In 2021 alone, creators earned over $1 billion from their memberships.

Being creator-first, Patreon wants its users to have as much creative control as possible, while still ensuring that the platform is safe. Creators can upload content to Patreon in any format—including text, image, audio, and video—so long as it follows the Community Guidelines.

To help keep up with the millions of assets uploaded to the platform each day, Patreon’s Data Science team developed a set of multi-label image classifiers. These classifiers identify content that may be in violation of the Community Guidelines, which enables Trust and Safety agents to prioritize one-on-one conversations with creators to bring their content within the guidelines.

The Problem

With binary classification, images are labeled as either in violation or not, and often struggle to capture the nuances of the Community Guidelines. While content that contains hate speech or illegal activities are strictly prohibited, mature themes like nudity fall in a grayer area and require context. This makes it difficult for creators and Trust & Safety team members alike to clearly diagnose why content is flagged.

Instead, Nikhil Harithas, Senior Machine Learning Engineer, wanted to classify images based on three selection areas, on a five-point scale. To preserve the thoughtfulness of the Community Guidelines, the team knew they wanted to build this in-house rather than with 3rd-party providers.

“Companies like Labelbox try to do this, but are overly complex,” said Harithas. “And we wanted to be able to get more granular to Patreon’s specific content and policies.”

To collect training data, Harithas and the team would ask Patreon’s Trust & Safety experts to manually classify 100 images per day using the new labels. But asking their colleagues to go from the relatively simple task of labeling images with a 0 or 1 to a more robust classification system—everyday—was a big ask.

Harithas wanted to build a webapp to make the task as easy as possible for the Trust & Safety team. But “I don’t know how to write frontend, and we couldn’t get engineering resources,” said Harithas.

“All that together meant we were ready and willing to improve our image classification systems in service of all of our stakeholders, but we didn’t have the means to do it ourselves,” said Harithas.

The Solution

Searching for solutions, Harithas stumbled upon Baseten’s Data Labeling demo app. Instantly, he saw that all he would need to do is change the image source URL to have a working tool. Believing he had found the answer, Harithas introduced Andre Bach, Staff Data Scientist at Patreon, to Baseten.

Using Baseten’s Views, Bach quickly implemented the new five-point scale and selection areas on his own. Within days, he and Harithas shipped their live, custom image labeling webapp to the Trust & Safety team.

Each day, Trust & Safety experts use the webapp to classify a random sampling of 100 images. Those labels are stored in Baseten’s Postgres database, which then feeds into Patreon’s Databricks pipelines to train the model on a 24-hour schedule.

avatar

“I’ve encountered web apps but not in a hardcore way, and I write Python but in an offline way...So the ecosystem Baseten has built, which combines the ability to write Python objects to connect my database to draggable UI components without worrying about JavaScript, hits the sweet spot for me as a generalist Data Scientist.”

— Andre Bach, Staff Data Scientist at Patreon

The Result

Today, the team has labeled more than 10,000 images with Baseten. And it takes the Trust & Safety team less than 10 seconds per image to classify.

“Without Baseten, I would’ve asked the Trust & Safety team to fill out a Google Sheet,” said Bach. “Not only is that unsustainable, but it would’ve increased frustration with the process. I’d have the Director of the Trust & Safety team asking me if we really needed to do this.”

With their nudity image classification in a good place, Bach and Harithas are exploring additional themes, as well as other content types, like animated images, audio, and video.

“Baseten gets the process of tool-building out of the way, so we can focus on our key skills: modeling, measurement, and problem solving,” said Harithas.