Synthetic Speech Startup ElevenLabs Raises $2M for AI Voices With Context-Relevant Emotion
Speech AI startup ElevenLabs has released a beta version of its text-to-speech platform for English and Polish and raised $2 million in a pre-seed funding round led by Credo Ventures. ElevenLabs is designed for turning long-form text into audio using both clones of existing voices and entirely synthesized models of human speech. The AI mimics human emotion in reading the text, using context clues to decide the mood and adjusting tone and inflection to match.
ElevenLabs developed proprietary deep learning models to create its AI-delivered speeches. The startup’s synthetic voices employ natural language understanding to grasp the context of what a person is saying. The AI might spot adjectives describing someone’s speech as cheerful or sad or note the environment of a wedding or a traffic jam and adjust the delivery accordingly. It can even understand humor and sarcasm well enough to laugh when something is sarcastic funny (or at least written to imply it should be). ElevenLabs’s platform offers users the option of using one of the startup’s stable of artificial voices or quickly generating a clone of a human voice.
“What we do differently is we take text and the context of what you write to generate the tonality of voices. It understands the text and can know how to speak the [emotions] correctly,” ElevenLabs co-founder and CEO Mati Staniszewski told Voicebot in an interview. “It works exceptionally well on longer-form texts because it can preserve that context. No others are taking that kind of context into consideration. We also stand out in how we approach how to replicate or clone a voice. We developed a cloning module that doesn’t require training, only a few seconds of recording, though ideally a full minute.”
That’s a speed comparable to Microsoft’s new VALL-E tool for voice cloning. You can see an example of ElevenLabs’ voice cloning below. ElevenLabs recreated the voice of Steve Jobs and used it to read a short text about the company generated with OpenAI’s ChatGPT generative AI chatbot.
ElevenLabs is working on setting up a system for users to design a new artificial voice using AI and has begun expanding its existing voice stable with the voices of actors, who will get a cut of the proceeds should a user choose to employ their voice clone. There are English-speaking voice models and Polish-speaking voice models, though they don’t allow for overlap at the moment. Staniszewski and co-founder Piotr Dabkowski are both native Polish speakers, but that didn’t necessarily simplify the process of designing text-to-speech tools in Polish.
“[Text-to-speech models] require a huge amount of data. For English, that’s easy; for Polish, that’s harder both for text and speech,” Staniszewski said. “On the positive side, if you fix the Polish data part, our model is so good the only comparison is professional [actors].”
The beta version of ElevenLabs doesn’t allow audio editing for when the AI might miss an emotional cue, but it hasn’t been an issue yet. The company already has around 500 users, with another 5,000 or so on the waiting list. Staniszewski said almost all of the users could be classified as content creators, especially for YouTube, newsletter writers, independent book authors, or news agencies. Staniszewski pointed out that popular audiobook platforms like Audible and Spotify don’t allow AI narrators at the moment. Still, that hasn’t stopped a popular author from employing ElevenLabs to read one of his books and submit it to those publishers, where it has been approved and awaits publication without rejection.