Facebook Shares 100-Language Translation Model, First Without English Reliance
Facebook unveiled and made open-source a language translation model this week capable of shifting between any two of 100 languages. The M2M-100 was built without using English, making it unique among machine learning models for language, according to Facebook. At the same time, the social media giant claims the M2M-100 can outperform the more standard, English-derived approach.
M2M-100 can claim to perform so well partly because of the sheer number and variety of language translations it trained on during development. Facebook used 2,200 pairs of languages to create the new model, a collection of 7.5 billion sentences encompassing most major languages, and several that are not as widely spoken. Usually, translation models are designed around a model for each language, with English acting as a middle ground between them. That tends to make the translation less accurate, as anyone who has used an online translator to send a sentence into several languages and then back to its original can attest. Facebook went for a multilingual machine translation (MMT) model; instead, one that processes the languages and translates directly.
“When translating, say, Chinese to French, most English-centric multilingual models train on Chinese to English and English to French, because English training data is the most widely available,” Facebook AI research assistant Angela Fan explained in a blog post. “Our model directly trains on Chinese to French data to better preserve meaning. It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations.”
Facebook is pitching the M2M-100 model as a useful translator in many contexts, especially for languages that aren’t as widely spoken. Making the model open source could enhance those translations, even more, Facebook said. The social media platform performs 20 billion translations on an average day, two-thirds of which don’t involve English. To make sure languages spoken by fewer people still had accurate translations, Facebook divided the languages into 14 families and designated bridge languages from the best-known of those groups, like Hindi, Bengali, and Tamil a connection to the Indo-Aryan languages.
The new model complements Facebook’s release in July of an automatic speech recognition (ASR) model capable of understanding 51 languages, built on more than 16,000 hours of voice recordings. The goal is to make it possible for voice assistants to grasp both what someone is saying and what language they are speaking. Combined with 100 languages for translation, Facebook could get a step ahead of voice AI rivals like Amazon and Google. The translations might also be a boon for another open-source project, the Blender chatbot, which is supposed to be able to hold a conversation on any subject and show empathy with users.
“For years, AI researchers have been working toward building a single universal model that can understand all languages across different tasks,” Tan wrote. “A single model that supports all languages, dialects, and modalities will help us better serve more people, keep translations up to date, and create new experiences for billions of people equally. This work brings us closer to this goal.”