Mozilla today released the latest version of Common Voice, its open source collection of transcribed voice data for startups, researchers, and hobbyists to build voice-enabled apps, services, and devices. Common Voice now contains over 7,226 total hours of contributed voice data in 54 different languages, up from 1,400 hours across 18 languages in February 2019.
Common Voice consists not only of voice snippets, but of voluntarily contributed metadata useful for training speech engines, like speakers’ ages, sex, and accents. It’s designed to be integrated with DeepSpeech, a suite of open-source speech-to-text, text-to-speech engines, and trained models maintained by Mozilla’s Machine Learning Group.
Collecting the over 5.5 million clips in Common Voice required a lot of legwork, namely because the prompts on the Common Voice website had to be translated into each language. Still, 5,591 of the 7,226 hours have been confirmed valid by the project’s contributors so far. And according to Mozilla, five languages in Common Voice — English, German, French, Italian, and Spanish — now have over 5,000 unique speakers, while seven languages — English, German, French, Kabyle, Catalan, Spanish, Kinyarwandan — have over 500 recorded hours.
Today also saw the release of Mozilla’s first-ever data set target segment, which aims to collect voice data for specific purposes and use cases. This segment includes the digits “zero” through “nine” as well ts the words “yes,” “no,” “hey,” and “Firefox,” spoken by 11,000 people for 120 hours collectively across 18 languages. Previously, Common Voice product lead Megan Branson said it’d be used partly for “Hey Firefox” wakeword testing.
“This segment data will help Mozilla benchmark the accuracy of our open source voice recognition engine, DeepSpeech, in multiple languages for a similar task and will enable more detailed feedback on how to continue improving the dataset,” wrote Branson in a blog post. “With contributions from all over the globe, [our contributors] are helping us follow through on our goal to create a voice dataset that is publicly available to anyone and represents the world we live in.”
The Common Voice refresh follows a significant update to DeepSpeech that incorporated one of the fastest open source speech recognition models to date. The latest version added support for TensorFlow Lite, a distribution of Google’s TensorFlow machine learning framework that’s optimized for compute-constrained mobile and embedded devices, and cut down DeepSpeech’s memory consumption by 22 times while boosting its startup speed by over 500 times.
Both Common Voice and DeepSpeech inform work on Mozilla projects like Firefox Voice, a browser extension that adds voice recognition support to Firefox. Currently, Firefox Voice can understand commands like “What is the weather” and “Find the Gmail tab,” but the goal is to facilitate “meaningful interactions” with websites using voice alone.