In a post on its AI research blog, Microsoft today detailed a new language system, Speller100, that the company claims is one of the most comprehensive ever made in terms of language coverage and accuracy. Comprising a number of machine learning models that can understand speech in over 100 languages collectively, Speller100 now powers spelling correction on Bing.
As Microsoft notes, for a language with very little web presence, it’s challenging to collect an adequate amount of data to train a model. Moreover, models can’t rely solely on training data to learn the spelling of a language. At its core, spelling correction is about building both an error and a language model, and not all errors are the same. For example, non-word errors occurs when a word isn’t in the vocabulary for a given language, while real-word errors occur when the word exists but doesn’t fit in a larger context.
Speller100 is built around the concept of language families, or larger groups of languages based on similarities that multiple languages share. It also leverages zero-shot learning, a technique that allows a model to accurately learn and correct spelling without any additional language-specific labeled training data.
To scale Speller100 to over 100 languages, Microsoft says it developed a spelling correction pretraining approach that relies on functions to take text extracted from web pages and generate errors like deletion, addition, rotation, and replacement. This eliminated the need for a large dataset of misspelled searches, enabling Speller100 to reach 50% of correction recall for top candidates in languages for which zero training data existed. Deployed as-is on Bing, where about 15% of searches are misspelled, it would’ve reduced the number of misspellings by 7.5%.
To improve performance even further, Microsoft leveraged the orthographic, morphological, and semantic similarity between languages in the same group. They built a dozen or so language family–based models to maximize the zero-shot benefit and keep the model compact enough for runtime, enabling Speller100 to power spelling correction for languages will relatively little training data, like Afrikaans and Luxembourgish,
Microsoft says that to date on Bing, Speller100 has reduced the number of pages with no results reduced by up to 30% and the number of times users had to manually reformulate their query reduced by 5%. It’s also increased the number of times users clicked on Bings spelling suggestion increased from 8% to 67%.
Microsoft says it plans to implement Speller100 in more of its products going forward.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more