OpenAI Releases Open-Source ‘Whisper’ Transcription and Translation AI
OpenAI has introduced a new automatic speech recognition (ASR) system called Whisper as an open-source software kit on GitHub. Whisper’s AI can transcribe speech in multiple languages and translate them into English, though the GPT-3 developer claims Whisper’s training makes it better at distinguishing voices in loud environments and parsing heavy accents and technical language.
Whisper trained its ASR model on 680,000 hours of “multilingual and multitask” data pulled from the web. The idea is that a broad approach to data collection improves Whisper’s ability to understand more speech because of the different accents, environmental noise, and subjects discussed. The AI can understand and transcribe many languages and translate any of them into English. You can see an example in the Korean song translated and transcribed below.
— Andrew Mayne (@AndrewMayne) September 21, 2022
While impressive, OpenAI’s research paper suggests that the ASR is really only that successful in about 10 languages, a limitation likely stemming from how two-thirds of the training data is in English. And while OpenAI admits Whisper’s accuracy doesn’t always measure up to other models, the “robust” nature of its training puts it ahead in other And though the “robust” training enables Whisper to discern and transcribe speech through background noise and accent variations, it also creates new problems.
“Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level,” OpenAI’s researchers explained on GitHub. “However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.”
OpenAI is often in the news for GPT-3 and related products like text-to-image generator DALL-E. Whisper offers a glimpse at how the company’s AI research extends into other arenas. Whisper is open-source, but the value of neural net AI speech recognition for consumer and business purposes is conclusively proven at this point. Whisper could be a starting point for OpenAI to join in, as the researchers already speculated. The release of Whisper just as OpenAI’s GPT-3 playground has added a microphone for speech-to-text interactions is probably not a coincidence.
“We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications.