Synthetic Media Startup Play.ht’s New Model Deepfakes Voice Clones From Just Seconds of Audio
Synthetic voice developer Play.ht has introduced a new voice cloning model called Parrot, capable of making a deepfake voice from a seconds-long recording of their speech. Parrot is available as a beta for Play.ht’s text-to-speech synthesis platform, aimed at content creators looking for voice clone solutions.
Play.ht garnered a lot of attention when it published an imaginary conversation between deceased Apple founder Steve Jobs and Joe Rogan using its AI-generated voice clones and a script composed by OpenAI’s GPT-3 large language model. That kicked off its now-defunct Podcast.ai show, which demonstrates the synthetic speech services offered by Play.ht. including episodes where Zach Galifianakis interviewed Quentin Tarantino and Oprah shared stress-relief tips.
Those voices rely on Play. ht’s Peregrine model, which has now been surpassed by the new Parrot model, continuing the bird theme for Play.ht. Parrot was trained on a larger data set and used what the developers learned from Peregrine to update how the training was handled. The synthetic voices are then used to process text as audio. The company claims Parrot is great at doing all kinds of accents, though it can only speak English. That said, Parrot can use the voice clone models of non-English speakers so that they appear to speak English, even keeping their original accent intact. Play.ht emphasizes that its models are more than just voice clones reading out a text. The company boasts that the AI understands the emotion that should be present in a voice based on the context of the whole text and adjusts the speech accordingly.
The zero-shot approach of Parrot only requires a short recording, but Play.ht also has a high-fidelity voice clone method that uses about 20 minutes of audio for more comprehensive and nuanced cloning. The audio can be created on Play.ht’s website or embedded as an API in a customer’s product.
“Content creators of all kinds (gaming, media production, elearning) spend a lot of time and effort recording and editing high-quality audio. We solve that and make it as simple as writing and editing text. Our users range from individual creators looking to voice their videos, podcasts, etc., to teams at various companies creating dynamic audio content,” Play. ht’s founders explained in a post on Y-Combinator. “. There are many robotic TTS services out there, but ours allows people to generate truly human-level expressive speech and allows anyone to clone voices instantly with strong resemblance. We initially used existing TTS models and APIs but when we started talking to our customers in gaming, media production, and others, people didn’t like the monotone robotic TTS style. So we doubled down in training a new model based on the new emerging architectures using transformers and self-supervised learning.”