Meta Unveils Generative AI Synthetic Speech Tool Voicebox Along With Deepfake Audio Detector

Meta has introduced a generative AI-powered text-to-speech (TTS) tool called Voicebox that the tech giant claims can produce a synthetic voice 20 times faster than the current state of the art and with only two seconds of recording. Deepfake voices produced with Voicebox are so good, according to Meta, that it won’t release all of the code, and even came up with a method for detecting AI-generated audio.

Voicebox AI

TTS models usually require curated and relatively small, labeled data sets for training, as audio quality can degrade as the data set grows. Voicebox overcomes that limit by using what Meta described as an architecture that could handle the “in-filling” of audio information. The large, unlabeled databases to hone Voicebox’s speech synthesis capabilities are similar to those used by ChatGPT and other large language models, which gives Voicebox the ability to mimic a speaker’s voice to read a text, including in multiple languages and even replace the audio where there’s too much noise with a synthetic version.

Voicebox’s ability to deepfake individuals is good enough that Meta said it is concerned about the “potential risks of misuse.” As a result, Meta also created a way to identify when Voicebox’s synthetic speech is used, a deepfake audio detector for its own creation. It’s not too dissimilar from the audio watermark developed by Resemble AI to identify recordings as real and not synthetic without lowering their quality. The same worries have led Meta to hold back Voicebox’s model and the code behind it, restricting public access to only some samples and the accompanying research paper, which Meta says helps “strike the right balance between openness with responsibility.” Presumably, Meta has upped its internal security since its LLaMA generative AI model leaked.

“As the first versatile, efficient model that successfully performs task generalization, we believe Voicebox could usher in a new era of generative AI for speech. As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm,” Meta wrote in its announcement. “In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks.”

Generative Synthetic Media

Synthetic speech has a lot of uses, and generative AI has only expanded those possibilities. Deepfake voices from startups like ElevenLabs and Play.ht are continually coming up with better and cheaper synthetic voice platforms. The hotbed of experimentation has demonstrated synthetic speech’s capacity to produce AI girlfriends, synthetic songs and singers, fake podcast episodes, parody commercials, and more. Big brands aren’t ignoring the potential, either. Spotify created an AI DJ with a synthetic voice and wants to use them to enhance its podcast ads while streaming service Deezer is working on how to spot and remove AI-generated songs. Meta has a lot of ideas for how to use Voicebox and is working on embedding the technology in future products.

“In the future, multipurpose generative AI models like Voicebox could give natural-sounding voices to virtual assistants and non-player-characters in the metaverse. They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more,” Meta wrote. “Voicebox is an important step forward in our generative AI research, and we look forward to continuing our exploration in the audio space and seeing how other researchers build on our work.”

