Google’s Translatotron 2 Improves Linguistic Shifts Without the Deepfake Potential
Google researchers have created a new version of the Translatotron AI translation model that recreates a speaker’s voice in a different language. Translatotron 2 performs better as a translator and voice mimicker but deliberately cuts out the potential for synthesizing someone else’s voice as a convincing deepfake, which was raised as a concern after the 2019 release of the first Translatotron. The researchers published details of Translatotron 2 in a paper this month.
Translatotron and its successor are designed to listen to someone speaking in one language, translate what they are saying into a second tongue, then broadcast the translated speech as though the original speaker were now fluent in another language. The system encodes the source speech, picks out the right sword sounds, known as phonemes, and synthesizes the decoded results into whatever language the user chooses. The result sounds more natural and friendlier than a pure text or artificially-voiced translation. Translatotron 2 is better at translating languages than its predecessor model and processes and recites speech faster and with fewer errors than its earlier incarnation.
“Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause,” the researchers wrote.
The update also resolves an issue present in the first Translatotron where people could exploit the technology to speak as themselves in one language, and have the translation sound like an entirely different person, even just using samples played from standard TVs and radios. By skipping over the identification of the previous Translatotron, the AI will ignore attempts to translate someone’s words into a different voice.
“The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts,” the researchers wrote. “The performance of voice conversion has progressed rapidly in the recent years, and is reaching a quality that is hard for automatic speaker verification systems to detect. Such progress poses concerns on related techniques being misused for creating spoofing artifacts, so we designed Translatotron 2 with the motivation of avoiding such potential misuse.”
Google has been keen to promote its translation services, adding new features and availability regularly. Last year, Google Translate added a real-time transcription feature not long after incorporating an instant translation feature for Google Assistant on Android. It’s also a service that other communication platforms want to have, sparking Zoom’s acquisition of Kites for that purpose in June. Google’s need to address the possibility of deepfakes is more than just technical, though. That’s one of the biggest reasons people say they are hesitant to use voice as an identification tool.