MLCommons Launches Two Giant Open-Source Speech Datasets
AI non-profit consortium MLCommons has released two large speech datasets as open-source resources to improve speech recognition and voice technology. The People’s Speech Dataset offers more than 30,000 hours of supervised conversational data, while the Multilingual Spoken Words Corpus (MSWC) contains more than 23.4 million examples of 340,000 keywords in 50 languages.
The People’s Speech Dataset provides a collection of organized speech data under a Creative Commons license. The collection is provided by companies and researchers, including Harvard University, Nvidia, Intel, and Baidu. MLCommons claims the dataset raises the speech information available by an enormous amount, changing the field research landscape. The Multilingual Spoken Words Corpus draws from a similarly diverse mix of contributors to encompass a broader swath of languages and words than is usually publically accessible. Both datasets are a way to open up speech tech research to more developers and companies than can access the proprietary data owned by Amazon, Google, or other big companies.
MLCommons began as a working group in 2018 to organize the most common languages into one dataset and make it worthwhile to any speech researcher. The People’s Speech Dataset and the MSWC are the results, as shared in whitepapers at this year’s Conference on Neural Information Processing Systems. The main difference between them is that the People’s Speech Dataset is shaped for speech recognition compared to the keyword identification focus of the MSWC.
“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” MLCommons Association co-founder and executive director David Kanter said. “The People’s Speech is a large scale dataset in English, while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”
Demand for open-source databases of cataloged and transcribed voice recordings has fueled similar projects to MLCommons. The Mozilla Foundation’s Common Voice project launched with its own goal of supporting voice tech developers without access to proprietary data a few years ago. The Common Voice Database boasts more than 9,000 hours of 60 different languages and claims to be the world’s largest public domain voice dataset. Nvidia made a $1.5 million investment in Mozilla Common Voice and started working with Mozilla on voice AI and speech recognition in April.
The global focus of the project led to a $3.4 million investment this spring to focus on creating such a resource for Kiswahili, known elsewhere as Swahili, a language spoken by about 100 million people and currently zero voice assistants. Common Voice is partnering with African companies and research groups to leverage the database into voice tech for the area, especially financial and agricultural services. That investment also marked Common Voice’s transition to being entirely under the Mozilla Foundation umbrella.
Meta (formerly Facebook) released two giant conversational AI datasets of its own this summer. Facebook has publicly shared two large conversational AI datasets to encourage research and development to advance artificial intelligence and improve how well virtual assistants understand and interact with users. One dataset should help train an AI with only a tenth of the amount of raw data, while the other should help streamline developing multilingual voice assistants.