Facebook’s New AI Model Can Distinguish Five Voices Speaking Simultaneously
Facebook researchers have created an AI model that can distinguish between five voices speaking at the same time on one microphone better than any existing system. The new method could improve audio technology in noisy spaces including hearing aids and voice assistants.
Voice in the Crowd
The researchers explain their model in a paper titled Voice Separation with an Unknown Number of Multiple Speakers and will present it at the 2020 International Conference on Machine Learning. The scientists taught the AI how to tell different voices apart using a new variant on recurrent neural networks to simulate memory and analyze the audio to determine how many people are speaking before an encoder network organizes the voice appropriately. Models trained on between two and five simultaneous speakers, all with just one microphone.
“The ability to separate a single voice from the multiple conversations occurring concurrently forms a challenging perceptual task,” the researchers explain in the paper. “The ability of humans to do so has inspired many computational attempts, with much of the earlier work focusing on multiple microphones and unsupervised learning, e.g., the independent component analysis approach. In this work, we focus on the problem of supervised voice separation from a single microphone, which has seen a great leap in performance following the advent of deep neural networks.”
Facebook’s researchers point to a couple of different areas where their model could enhance existing audio technology, such as hearing aids. While hearing aids today have advanced beyond simply making sound louder, people using them can still struggle to hear the person they are speaking with when it’s noisy. Isolating different voices and removing extraneous sound would be ideal for people using a hearing aid at a party or when it’s windy. The same tech could also provide a foundation for major upgrades to voice assistants. Once the different speech can be analyzed on its own by the AI, it would then be able to tell if its wake words were used, and what request was made by the speaker with far greater accuracy than current models.
Right now, extra noise, or even just two voices speaking at once can confuse a voice assistant on a smart speaker, which is why several companies are pursuing similar goals. For instance, Google has spent a long time developing a “de-noiser” that filters out irrelevant noises during Google Meet calls. Building software for noisy and complex audio environments has garnered startups like AudioTelligence millions of dollars from venture capitalists for software that can pick out a human speaking when it’s noisy. The TalkTo noise filtering software created DSP Concepts, meanwhile, was recently qualified by Amazon for Alexa built-in devices. Facebook’s researchers are now working on applying the new model to real-world situations, presumably for eventual commercial use by Facebook, perhaps integrating the model into the voice assistant the company is currently developing.