Amazon Unveils Speech Datasets for Alexa Skill Development in 51 Languages
Amazon has introduced a 51-language dataset appropriately called MASSIVE to encourage multilingual Alexa and other voice AI development. The open-source dataset arrives in tandem with the opening of Amazon’s new Massively Multilingual NLU 2022 (MMNLU-22) competition to encourage global experiments in building voice apps in less commonly spoken languages.
The collection of voice AI data is built on the idea of a single model understanding information fed to it in a wide range of languages. This massively multilingual natural-language understanding (MMNLU) concept suggests one machine learning model absorbing multiple languages with plenty of data can apply that knowledge to other languages where there are more gaps in the training data and teach the AI to speak in those tongues. Amazon brought in professional translators to build the non-English Multilingual Amazon SLURP (SLU resource package) using the same English utterances.
“We are very excited to share this large multilingual dataset with the worldwide language research community,” vice president of Alexa AI Natural Understanding Prem Natarajan said. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”
The MMNLU-22 competition will be held in Abu Dhabi and online and combine academic presentations and company announcements. That includes cohosting with Amazon a workshop to show off the award submissions. These kinds of efforts can boost awareness and interest in voice AI and speech recognition well beyond research. That’s why companies have been racing to build and occasionally share out these datasets, including AI non-profit consortium MLCommons, which rolled out both The People’s Speech Dataset of more than 30,000 hours of supervised conversational data and the Multilingual Spoken Words Corpus (MSWC). Last summer, Meta came out with two giant conversational AI datasets of its to encourage research and development to advance AI and virtual assistants. One dataset should help train an AI with only a tenth of the amount of raw data, while the other should help streamline the development of multilingual voice assistants.
The quest very much includes languages unspoken by most voice AI. The Mozilla Foundation’s Common Voice project launched with its own goal of supporting voice tech developers without access to proprietary data a few years ago. The Common Voice Database boasts more than 9,000 hours of 60 different languages and claims to be the world’s largest public domain voice dataset. In April, Nvidia made a $1.5 million investment in Mozilla Common Voice and started working with Mozilla on voice AI and speech recognition. The global focus of the project led to a $3.4 million investment this spring to focus on creating such a resource for Kiswahili, known elsewhere as Swahili.