OpenAI Turns ChatGPT into a Voice Assistant That Can See and Understand Images and Speech
ChatGPT will no longer be just a chatbot after introducing visual and audio interaction features this week. OpenAI has unveiled new multimodal options for ChatGPT that allow for verbal conversations and enable the AI to process images in addition to text. The features are limited to ChatGPT Plus and ChatGPT Enterprise subscribers but will likely roll out to free users and developers in the near future.
The most notable change to ChatGPT is its new ability to understand speech and respond in kind. A new text-to-speech model that mimics human voices after hearing just seconds of sample audio lets users hear ChatGPT’s ‘voice’ respond to their input. OpenAI’s speech recognition system Whisper transcribes users’ spoken words. The conversation, as seen above, essentially turns ChatGPT into a voice assistant like Alexa or Google Assistant, albeit one with the benefits and limits of the generative AI chatbot. ChatGPT can converse using any of five available voices, synthesized from professional voice actors into models like the one heard in the video.
“You can now use voice to engage in a back-and-forth conversation with your assistant. Speak with it on the go, request a bedtime story for your family, or settle a dinner table debate,” OpenAI wrote in a blog post. “The new voice technology — capable of crafting realistic synthetic voices from just a few seconds of real speech — opens doors to many creative and accessibility-focused applications. However, these capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud.”
As part of the synthetic voice feature launch, OpenAI announced that it is working with Spotify for a new podcast translation feature. Podcast hosts will be able to create their own synthetic voice models and have them perform a translated transcript of their show so that they can republish their podcast in multiple languages, Spanish, French, and German, for now, while keeping their own voice. That option is limited to a handful of podcasters at the moment: Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett for the launch. That goes well beyond what Simmons had previously hinted was coming for Spotify in terms of using a synthetic voice for podcast ads.
“By matching the creator’s own voice, Voice Translation gives listeners around the world the power to discover and be inspired by new podcasters in a more authentic way than ever before,” Spotify vice president of personalization Ziad Sultan said in a statement. “We believe that a thoughtful approach to AI can help build deeper connections between listeners and creators, a key component of Spotify’s mission to unlock the potential of human creativity.”
The other big expansion to ChatGPT’s senses gives it ‘eyes’ that can understand photos and other images well enough to describe them and analyze their content. Users can show the chatbot an image, and it will provide relevant responses based on the visual input. OpenAI sees it as a way to add features like image-based question-answering or visual storytelling to ChatGPT. This comes immediately after OpenAI revealed plans to incorporate DALL-E 3, a new version of its text-to-image engine, into ChatGPT.
OpenAI suggested the image tool could be used for things like asking about recipes based on a picture of a meal ingredient list, taking pictures of appliances needing troubleshooting, or settling debates by presenting visual evidence to ChatGPT. Users can tap on the photo button to take a picture or select one or more images. Users can also use the built-in drawing tool on the ChatGPT mobile app to circle parts of the image that the AI should focus on specifically. As a safety measure, OpenAI said there are technical limits in place holding back ChatGPT’s ability to violate an individual’s privacy.
“Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images,” OpenAI explained. “Vision-based models also present new challenges, ranging from hallucinations about people to relying on the model’s interpretation of images in high-stakes domains. Prior to broader deployment, we tested the model with red teamers for risk in domains such as extremism and scientific proficiency, and a diverse set of alpha testers. Our research enabled us to align on a few key details for responsible usage.”