Welcome to Baseten’s new series, StartupML, where we invite ML experts at start-ups for an “Ask Me Anything”-style interview to share best practices, frameworks, and insights on building ML teams from the ground up.
After finishing his undergraduate degree in Computer Science at Georgia Tech, Nikhil joined Patreon as a Software Engineer on their Risk team. Now a little more than six years into his tenure, Nikhil has worked on everything from product to fighting fraud to content moderation.
Through his work on Risk, Nikhil started to spend more and more time in the Machine Learning space, specifically helping to build the infrastructure needed to productionize Patreon’s ML models. About three years ago, Nikhil transitioned full-time into ML Engineering, where he’s currently helping to build out Patreon’s ML platform.
To learn more about his work as an early ML team member, keep reading this AMA to hear more of what Nikhil had to say about:
Moving from software engineer to ML engineer
Balancing speed vs. scalability with your ML platform
Knowing when to build internally vs. buy/open-source
Considerations for your ML team’s 2nd and 3rd hries
At the time we had just productionized our first machine learning system, which was a fraud fighting algorithm. It was pretty bootstrapped. It didn’t really create an extensible system for other ML models to build on. When the time came for us to build our second ML system, which was for image classification for content moderation, it became clear that there was room for a role, for someone to think about systems more long-term and be more opinionated about how we train and deploy models instead of hacky one-offs.
Today, there are ~four of us in total. We have one pure data scientist who works on models, another who is a jack of all trades: data scientist, ML engineer, software engineer. And then myself and another person who are software engineers-turned-ML engineers.
To have any opinions worth having, first you should just try to do whatever it takes to get the first version out. That does a couple of things. First, it proves that it's actually something worth doing, that it’s valuable for the product and for the business. And secondly, in that process, you’re gonna learn a lot of things, like: Did we frame the problem correctly? Did we care about data the right way? And with the actual operationalization, did we roll it out too fast, too slow?
One example of us doing this is we had a new model that was re-training on some new data. And it wasn't performing the same way. And it's because of this really hairy bug about how we do feature engineering at inference time that prompted us to add more layers of unit tests. So that’s immediately how we diagnosed what problem to invest in so we could iterate faster. There's no substitute for experience. So what you get from V1 is proving the value, and figuring out what you need to improve for V2.
At startups, the most important thing to consider is you don't have the luxury of time. You need to build with urgency. And if you go away for three months and ship something that maybe someone will like, you just wasted three months of runway. Being hyper-focused on providing value early on, as fast as possible, should be the goal. The first version of what you build shouldn't be very clean. It shouldn't be full of craft, that probably means you've gone too slow.
This is the eternal tension in software engineering organizations. Do you build scrapy and fast, or do you build long-term systems?
I think the earlier you are in your startup’s or team’s journey, the better it is to stay scrappy. And then as you prove out value and earn yourself more time, then you can start to prioritize building long-term systems. One spicy thing I’ll say is that tech debt isn't necessarily bad. Tech debt is only bad if you never intend to pay it back. In startups, you're flying by the seat of your pants. And if you're taking several months to build out something that's unproven, it's just a big business risk.
Again, it comes back to what provides value fastest. And it doesn't always have to be value to the product or the business. It can also be things that help you iterate faster. An example of that is reducing the time it takes to retrain your model from weeks to days, or days to hours, hours to minutes. Another important value-creating dimension to consider early on is how to measure problems and models better. Maybe you’ve got some data science and product-based metrics, but maybe you haven't converted that to dollars for the business, or you could’ve sampled your data better.
So the way I would stack rank what you prioritize is:
Can I add more value for the business or the product?
Can I make this process iteration faster?
Can I remove manual work?
And can I measure this better?
As an example, when I first transitioned into my ML Engineer role at Patreon, we had built this home-grown serialization logic that very quickly became un-maintainable. We couldn’t use the latest versions of libraries because we needed to re-write our serialization logic every time. So the first thing I thought was: how can I remove this manual work?
And that’s when I decided to scrap our in-house solution and replace it with MLFlow. We went from 500+ lines of code to literally one line of code. That meant that we could just upgrade MLFlow every time we needed to use the latest model, and it allowed us to keep pace with library improvements. It really came down to this decision to embrace open source, and get a bigger bank for our buck with a small team. And that’s really, really important: knowing when to build or buy.
When you're met with the decision to build or buy/go open-source, the question you should be asking yourself is: is the problem I'm trying to solve unique to us or not? And if it’s not unique to you, for example, model training is probably not that unique to you, you should not be building a model training platform. It's just not something that makes sense for most companies. I really regret building our initial model serialization in-house because it’s not a problem unique to us, open-source has already solved this problem. Focusing on the problems that are unique to you, and trying to either buy or open source everything else, is the way to iterate and get to value fastest.
For us, being opinionated about our data is our differentiator. We are a two-sided marketplace that also has user-generated content. So we probably can't use an off-the-shelf image classification model that's not trained on our environment. It just wouldn’t work as well. Our spectrum of images differ from the median image available on the internet. It doesn't matter where we train the model or how we serialize it, but the problem framing is uniquely ours. Similarly, other companies might choose a different moment in time to trigger their image classification model for content moderation. But for us, we wanted to do it when the creator creates a post, because that’s what we feel like is the lightest touch, most proactive way to let them know that this image might be problematic before they even post it.
When you’re operationalizing a machine learning system, you’re actually doing way more than one thing. What does it really mean to deploy a model? It means you've figured out data collection, training, deployment, monitoring and retraining. And each of those things are non-trivial. But which one of those things are going to add the most value to you? Which ones are going to be the most unique to you? Those are the questions I would ask.
Early on in any machine learning engineering team’s journey, I think it’s valuable to prioritize people who have a horizontal skill set. I think there’s a day for hiring specialists, but it’s not early on, because you want someone who can unblock themselves at whatever part of the problem they’re tackling at a given time.
For us at Patreon, where there’s just a couple of us, we own the entire problem end-to-end. Each of us starts with the problem, figures out how to get the data, trains the model, serves the model, monitors it, and is on-call for it. We’ll even do the analysis on the results. So you have to have people who are comfortable with working on more than one part of the stack. Someone who's open to talking to other stakeholders, and who's willing to get their hands dirty with whatever it takes.
There are a couple of things in this. First, are we ready as an industry to do machine learning at small companies? I think the answer to that is a definitive yes. Assuming you have the problem or the data that justifies it, and you have the right mentality and the right people, I think it's absolutely possible to do a lot with very little.
The tooling gold rush for machine learning is great news. It means that there are a lot of companies working very hard to get a lot of your problems out of your way. And if you can pick them correctly, I think that you can be opinionated about what part of the problem that you wanna solve, and letting SaaS or open source products get the other things out of the way for you.
Coming back to people thinking about moving from big to small companies in machine learning, I think the opportunity for increased ownership is profound. Maybe at a big company, the model already exists, and it's just about gaining inches. When you get to an ML problem at start-ups, it’s about gaining miles. It's gonna be many years before it's time to gain inches. And I think that's just way more exciting. You get to say, I did all these things that directly impacted the performance of this model, which impacted the business in such-and-such way. And that's just not a thing that you get to say at companies often.