Facebook Builds Speech Recognition Engine Combining 51 Languages in One Model
Data scientists at Facebook have developed an automatic speech recognition (ASR) model capable of understanding 51 languages, according to a new research paper. The model, built on more than 16,000 hours of voice recordings, is purportedly the biggest ever designed.
ASR engines usually understand only a single language, with multiple such models necessary for a voice assistant to communicate in more than one tongue. Facebook’s design puts all of the languages into a single system using what the developers called a joint sequence-to-sequence model. Essentially, it uses the hours of voice data, collected from public, anonymized videos on Facebook, to parse not only what someone is saying, but what language they are speaking. The different languages were broken down into multiple subcategories to identify the language spoken and thus how to respond.
“A single model capable of recognizing multiple languages has been a long-term goal in the field of automatic speech recognition,” the paper’s authors wrote. “In general, multi- and cross-lingual speech processing has been an active area of research for decades.”
There are approximately a billion parameters for language in the model, which makes its speech recognition better compared to the conventional models used, according to Facebook. The paper cites a 28.8% performance improvement using the new model. Languages with fewer hours of recording to work with actually had a higher percentage improvement in terms of word error rate because those languages aren’t used as often for the standard designs.
“To the best of our knowledge, this work is the first one to study multilingual systems at a massive scale,” the authors explain in the paper. “We demonstrated that it is possible to train a massive single ASR architecture for 51 various languages, which we found in practice considerably less time-consuming to tune than 51 different monolingual baselines.”
Facebook’s interest in a model that can understand and communicate in many languages within a single model is more than academic. The company has been investing a great deal in improving conversational AI on several fronts. Most recently, it debuted a new open-source chatbot called Blender. Supposedly more advanced than any rival, including Google’s new Meena chatbot, Blender is designed to hold a conversation on any subject and show empathy with users. And Facebook wants to keep collecting voice data for training speech recognition engines, even paying a small fee to people who submit audio through its Viewpoints market research app. Those projects, plus the new experiment, may lay the groundwork for a voice assistant in the rumored Facebook operating system. A multilingual setup will be necessary if Facebook wants to compete on the global stage. Alexa and Google Assistant already speak many languages, but they have limited multilingual modes. Alexa can identify and reply appropriately to those speaking English and either Spanish, French, or Hindi depending on the speaker’s location. Meanwhile, Google Assistant can be bilingual in English and any language already spoken by the voice assistant.