Meet Daniel, Data Scientist and founding data science team member at SIL.
Welcome to Baseten’s new series, StartupML, where we invite ML experts at start-ups for an “Ask Me Anything”-style interview to share best practices, frameworks, and insights on building ML teams from the ground up.
Meet Daniel, Data Scientist at SIL 👋
When Daniel completed his PhD in Computational Physics at Purdue University, he decided he wanted to pursue a career outside of academia. From his experience running simulations and computations during his PhD, he found data science to be a good fit, and enrolled in an online bootcamp to familiarize himself with the jargon.
Since then, Daniel has held data science positions at startups like Telnyx and Pachyderm working on fraud detection and pricing optimization, among other things. He also worked as a consultant, including for The New York Times, where he helped analyze comments on article posts. In 2018 he joined SIL, a nonprofit focused on helping people flourish in their own language. As their first data scientist, Daniel applies novel AI and NLP techniques to SIL’s language-related efforts.
Daniel also co-hosts Practical AI, a podcast focused on making artificial intelligence practical, productive, and accessible to everyone.
To learn more about his efforts as the first Data Scientist at SIL, keep reading this AMA to hear more of what Daniel had to say about:
- Gaining buy-in for data science across the organization
- Proving value early with creativity and scrappiness
- Being a producer, not a consumer, with engineering teams
- Hiring owners over theorists
- Overcoming failure with earlier feedback loops
Joining As The First Data Scientist
When I got to SIL, I was the data science organization. I’ve since built a team here that operates like a startup within a larger organization—we look across the business and find new opportunities to apply data science and add value. If all goes well, we’ll soon be up to five people on my team. There’s also another team that’s focused completely on NLP who we work very closely with.
Two-Pronged Approach to Gaining Buy-In
When I was hired, there wasn’t complete buy-in across the organization that NLP and AI was something we needed to put intentional effort into. So the approach I took was two-fold: on one side, I did what I term “data monkey work,” which I actually enjoy. It’s going to a team and saying “hey, I’m this new person here, what sort of data problems are you having?” And just the fact that I could join this data with that data, or help with automating something, meant a lot. It wasn’t machine learning, and it wasn’t very hard for me, but it would be a mindblowing thing for the people I was helping. I think this is really crucial to establishing trust and showing you can provide value quickly.
Then on the side, I was building rough prototypes that helped show what AI/ML could do for SIL. These were projects that nobody necessarily gave me an explicit mandate to work on. But, from building relationships with stakeholders, I started to identify problems and coded up solutions that I could tangibly show.
Doing both of those things was a really strategic way for me to get buy-in because it’s a way to get in front of people. And that's where I think the momentum really got going. Because then leadership would see these prototypes and say: “we really want to do more of that.”
Proving Value Early With Creativity and Scrappiness
When I came on there was certainly no GPU or MLOps infrastructure built out. So for me, initially it was about being creative with the tooling that I leveraged. I think today there are tools like Baseten and HuggingFace that provide a lot of really good tooling out of the box that were a lot harder for me.
When I started, I was like: how far can I get using Google Colab? And I would run a bunch of notebooks, however many sessions I was allowed, and label them Worker 1-4, in order to run experiments. I would say that creativity is really crucial because you want to use that scrappy, dynamic infrastructure before you convince the company to buy something like an on-prem server. Again, it’s about showing that you can solve mission-critical problems, that you can benefit other teams’ lives, and establish that before you make a big infrastructure ask.
Being a Producer, Not a Consumer, with Engineering
Because we’re a non-profit, we get restricted funding for specific projects, so our engineers’ time is allocated based on what projects are funded. Asking them to help out on something outside of that is a big ask. And I think this applies to startups as well, because human resources are very limited, and it’s important that it’s spent on the parts of the company that are proven to produce value.
So I think if you’re able to take your data science work and produce a prototype or deploy a model yourself, then rather than coming to the engineering team and saying “hey here’s a notebook” and they don’t know what to do with that, you can instead say “hey I have this API endpoint or this single-page application, check out what you can do.”
Of course, you will need some engineering resources at some point. But if you can at least get to the point of an API or an application, then it helps both leadership and engineering teams see how you can add value. They can say, “hey, we wanna build in this awesome new feature, check out this demo, we should build this.” And it gives them an opportunity to build up their resourcing and time allocation. And you get to come off as actually helping the engineers build capacity and build creative new things, versus being a consumer that needs help.
Experiment Cross-Functionally to Identify Highest Impact
For us, we were always really conscious from the start that in order to build up AI and NLP at SIL, we needed to show the value of this type of technology in the various areas where SIL is already working. So we started on the translation side, where we built a system to estimate the quality of translations, and that immediately showed a lot of value. And when people started to see that, they had additional ideas for other translation projects. The danger in that is, even though translation is a value-add, is it where we can add the most value at SIL?
When we look at projects now, we try to prioritize new opportunities where we’ve not yet applied AI and NLP, while still supporting what we’ve started on the translation side. We didn’t want to be just AI for translation work alone. That way, we can start to understand what the relative value is for applying ML within SIL’s respective areas.
I would recommend that new data science teams establishing themselves within an organization try out projects within marketing, operations, sales, core product. Start exploring different areas and dip your toe in so you can learn where your return on investment is the highest.
Hire Owners Over Theorists
When it comes to hiring, I’ve always had a more practical bent. I look for people that could show practical project experience versus people who have theoretical knowledge. I want to see people owning a project from conception to actual delivery for a project, where they walk me through their model, data, code, and all the way through to how it was deployed and scaled. That to me is a lot more valuable than seeing a paper about a new model or a GitHub repo that people can’t run because the code is indecipherable.
The more theoretical skill set certainly has its place—it’s good for research and development. But I think especially in the beginning, it’s important to look for people that are able to own something and take it forward. You have your own work to do as the team leader, your own capacity problems you're dealing with and managing stakeholders, so you need every one of your early hires to be able to carry their own.
Overcoming Failure, Loop In End Users Early
When we first started working on our translation quality estimation system, we built it using the jargon that’s common in academia like adequacy and fluency—terms the industry uses. We underestimated the importance of looping in the actual end users earlier and more often.
We invested a lot in building an MVP, and when we showed these metrics to users at SIL, who were typically translators or consultants, the feedback we got was that it didn’t mean anything to them. They weren’t sure how to make use of the scores. We learned a lot from that, and now we try to carefully strike the balance between speed and bringing in end users earlier. Building technology is not the hard part—I would argue that’s the easiest part. But building technology that actually helps people and provides value is really difficult. And you can become really disconnected from that when you don’t involve your end users.
For Startups, Agility Is Your Comparative Advantage
I think generally as an industry we’ve started to shift away from this idea that you need to always train models from scratch. Of course, there are really large technology and science institutions that invest in that, but earlier stage companies often don’t need that. Instead, I think the momentum has shifted so data scientists really think about how to leverage what’s already been trained, and adapt it to their specific use case. And that fits very well into a startup environment. You can fine tune a large language model in a couple of hours.
I also think that there are some unique advantages smaller teams have. I remember when I first started talking to the team at HuggingFace, they were less than ten people at the time. And they’ve since made huge waves throughout the industry. Why is that? It’s because they were able to move very quickly. So with that sort of agility, combined with a willingness to fine-tune general purpose models, I think you can really achieve a lot with very little.
Choosing the right horizontal scaling setup for high-traffic models
Horizontal scaling via replicas with load balancing is an important technique for handling high traffic to an ML model. Let’s examine three tips for understanding how to properly replicate your instances to save users time without wasting your money.