SIL expands NLP to support the world’s 7,100+ languages


SIL International is a global nonprofit that partners with local communities to develop language solutions for the more than 7,100 spoken languages in the world today. One of SIL’s focus areas is extending the rapidly evolving natural language processing (NLP) techniques—historically restricted to the world’s larger languages—to local communities. 

SIL is no stranger to cutting-edge technology. In 1976, their organization developed the first portable computer specifically designed for linguistic field work. 

Building on SIL’s rich history of marrying technology with linguistics is a team led by Daniel Whitenack, a Data Scientist at SIL. For Whitenack’s team, expanding recent NLP advances related to large language models, like BERT, T0, and GPT3, to more local languages for use in search, dialogue systems, speech technology, and mobile apps is one way to help local language communities flourish in the digital sphere.

Whitenack and his core team of six are part of an innovation organization within SIL. Tasked with ushering in new ideas and projects, the team is a mix of data scientists and computational linguists well-suited to developing novel NLP solutions.

The Challenge

In a recent project, Whitenack and his team were building a diagnostic suite to assess the quality of text translations. Because poor translations can have social or liability consequences, especially for industries like media or healthcare, many organizations choose to not support multiple languages. The team developed a zoo of techniques to combat this with essentially “a bunch of different models that assess different qualities of a translation like readability, comprehensibility, and similarity,” said Whitenack.

But serving such a computationally demanding tool proved to be unwieldy. “We would need to run a huge Kubernetes cluster with a bunch of REST APIs in Flask or FastAI and maintain it all ourselves,” said Whitenack. 

Whitenack wanted an alternative. “I don’t want my team’s time to be sucked up debugging some infrastructure issue,” he said. “I want them focused on method development, because that's where their strengths are.” 

The team could have engaged other DevOps and infrastructure teams within SIL for help, but they didn’t want to become dependent on another team and lose control of their operational velocity. Ideally, the team could self-serve their models into production quickly, without needing to worry about infrastructure and maintenance. 

The Solution

That’s when Whitenack discovered Baseten. With Baseten, the team could instantly deploy their models with just a few lines of Python. Suddenly, SIL’s data science team could go from an offline notebook to a fully deployed, production-ready model within a matter of minutes. “We’re happy to write a Python class,” said Whitenack, “but we don’t want to be writing YAML config all day.” 

And with Baseten’s Worklets, the team could use draggable nodes to easily configure the inputs, outputs, and business logic for other services to call and access the functionality provided by one or more deployed models. 

Whitenack had found the solution that would ensure his team could get models in front of their stakeholders quickly. Baseten had all of the appeal and control of self-serving model deployment, without “any of the annoying config, infra, and health checks.”

“I’m done with getting pings from PagerDuty in the middle of the night,” joked Whitenack. 

The team’s diagnostic suite is now in staging, with several translation projects already working with the technology. Whitenack hopes to launch to general release sometime in mid-2022. 

The Result

Looking ahead, Whitenack is excited about how much faster his team will be able to move, and how that will uplevel their work product. “Now that our model serving and infrastructure are automated through Baseten,” he said, “we can focus more on experimentation and optimization.” 

With the ability to quickly go from model development to an interface—either via an API or an interactable view—Whitenack’s team is empowered to act more iteratively. “Because we can move through a full development cycle more quickly, our quality of work is better,” said Whitenack. “We can try more things without worrying about breaking something, or needing to figure out which model version to revert back to.” 

Today, the team spends more time prototyping and getting ideas in front of their non-technical partners, knowing that they’ll be able to easily make adjustments and build as they go. 

“I’m really happy with the caliber of work we’re doing with NLP,” said Whitenack, “and by accelerating our development with Baseten, our team has the freedom to take on even more.”