Upgrading Patreon’s custom image classification system

Background

Founded in 2013, Patreon is a membership platform that enables creators to earn revenue from fan subscriptions in exchange for exclusive access, extra content, and more. Today, Patreon has over 250 thousand creators and 8 million monthly active patrons. In 2021 alone, creators earned over $1 billion from their memberships.

Being creator-first, Patreon wants its users to have as much creative control as possible, while still ensuring that the platform is safe. Creators can upload content to Patreon in any format—including text, image, audio, and video—so long as it follows the Community Guidelines

To help keep up with the millions of assets uploaded to the platform each day, Patreon’s Data Science team developed a set of multi-label image classifiers. These classifiers identify content that may be in violation of the Community Guidelines, which enables Trust and Safety agents to prioritize one-on-one conversations with creators to bring their content within the guidelines.

The Problem

With binary classification, images are labeled as either in violation or not, and often struggle to capture the nuances of the Community Guidelines. While content that contains hate speech or illegal activities are strictly prohibited, mature themes like nudity fall in a grayer area and require context. This makes it difficult for creators and Trust & Safety team members alike to clearly diagnose why content is flagged.

Instead, Nikhil Harithas, Senior Machine Learning Engineer, wanted to classify images based on three selection areas, on a five-point scale. To preserve the thoughtfulness of the Community Guidelines, the team knew they wanted to build this in-house rather than with 3rd-party providers.

“Companies like Labelbox try to do this, but are overly complex,” said Harithas. “And we wanted to be able to get more granular to Patreon’s specific content and policies.” 

To collect training data, Harithas and the team would ask Patreon’s Trust & Safety experts to manually classify 100 images per day using the new labels. But asking their colleagues to go from the relatively simple task of labeling images with a 0 or 1 to a more robust classification system—everyday—was a big ask. 

Harithas wanted to build a webapp to make the task as easy as possible for the Trust & Safety team. But “I don’t know how to write frontend, and we couldn’t get engineering resources,” said Harithas.

“All that together meant we were ready and willing to improve our image classification systems in service of all of our stakeholders, but we didn’t have the means to do it ourselves,” said Harithas.

The Solution

Searching for solutions, Harithas stumbled upon Baseten’s Data Labeling demo app. Instantly, he saw that all he would need to do is change the image source URL to have a working tool. Believing he had found the answer, Harithas introduced Andre Bach, Staff Data Scientist at Patreon, to Baseten. 

Using Baseten’s Views, Bach quickly implemented the new five-point scale and selection areas on his own. Within days, he and Harithas shipped their live, custom image labeling webapp to the Trust & Safety team. 

Each day, Trust & Safety experts use the webapp to classify a random sampling of 100 images. Those labels are stored in Baseten’s Postgres database, which then feeds into Patreon’s Databricks pipelines to train the model on a 24-hour schedule. 

“I’ve encountered webapps but not in a hardcore way, and I write Python but in an offline way,” said Bach. “So the ecosystem Baseten has built, which combines the ability to write Python objects to connect my database to draggable UI components without worrying about JavaScript, hits the sweet spot for me as a generalist Data Scientist.”

The Result

Today, the team has labeled more than 10,000 images with Baseten. And it takes the Trust & Safety team less than 10 seconds per image to classify. 

“Without Baseten, I would’ve asked the Trust & Safety team to fill out a Google Sheet,” said Bach. “Not only is that unsustainable, but it would’ve increased frustration with the process. I’d have the Director of the Trust & Safety team asking me if we really needed to do this.” 

With their nudity image classification in a good place, Bach and Harithas are exploring additional themes, as well as other content types, like animated images, audio, and video. 

“Baseten gets the process of tool-building out of the way, so we can focus on our key skills: modeling, measurement, and problem solving,” said Harithas.