Amazon Alexa Skills to Support Multiple Voices
Amazon announced this morning that it will soon support multiple voices within Alexa skills. Currently, developers have only two options. They can use the native Alexa synthetic voice within their skill to transform text into speech or they can record the content as .mp3 files that play at the appropriate times in user interactions. That will soon change. Developers can now apply to participate in preview that will offer eight new voice options from Amazon’s Polly synthetic speech service.
Polly currently lists eight US-English voices, five are female and three are male. When you include Alexa’s voice this means the gender imbalance is two-to-one female. Here are examples of two of the voices:
Joanna from Amazon Polly
Matthew from Amazon Polly
Differentiating 1P from 3P
There is a concept of first-party (1P) skills which are native to Alexa. These include fetching information to answer questions, converting volumes of liquids, performing mathematical calculations, and reciting a weather forecast for the area. Then there are third-party (3P) skills that are built for the Alexa platform by independent developers. These include popular games, weather services and branded skills like Jeopardy, Big Sky, and Capital One respectively. Nearly all of the 33,000 Alexa skills in the U.S. employ Alexa’s voice. This create a continuity of user experience, but not much differentiation. It also doesn’t let the user know whether they are in a 1P or 3P skill. The introduction of new voices for 3P skills can change the dynamic.
Google already does this. From the beginning, Google Assistant reserved its voice and then voices for 1P activities. Developers building Google Actions (i.e. the voice app equivalent to an Alexa skill) for Assistant, still select from four different voices in the U.S. reserved for 3P Actions. Mark Webster of Sayspring refers to this as the “operator model.” The 1P Google Assistant voice is the operator that directs you to other voices that represent something that is not produced by Google. This intuitively makes it easier for users to differentiate between what is Google and what is third-party.
It is also worth noting that Google Assistant’s six new voice options announced at the I/O developer conference are for 1P use. These are not additional options for 3P use. That change is strictly about users customizing their 1P experience.
Will Amazon Adopt the Google Assistant Model for Alexa?
Amazon stresses in its announcement that the new voices offer developers an opportunity to enrich the user experience:
This new capability can help you enrich your skill’s experience, making it more engaging for customers. For example, you can give a different voice to each character in adventure stories and games.
The idea that you can enhance your Alexa skill by adding a different voice and multiple voices is perfectly logical. An obvious follow-up question is whether this creates a path for Amazon to adopt the “operator model” and eventually remove Alexa’s voice from 3P skills. This would create “brand protection” of Alexa as a persona since Amazon would have full control over what is said by the Alexa voice. If there is content that Amazon would prefer not be attributed to Alexa’s voice that could subtly shift perception of the Alexa persona, then the availability of reserved 3P voices avoids issues that are certainly arising with increasing frequency.
In the future, even if Amazon doesn’t prohibit use of the Alexa’s voice in 3P skills, it is logical to assume that content review might become a more important part of the certification process. The standard would likely be different for using Alexa’s voice versus a Polly voice, essentially placing Alexa’s voice usage as a premium differentiator for a 3P skill.
Alexa to Allow Multiple Voices Within Skills
It is tempting to look at Amazon’s announcement and conclude that Amazon is looking to match Google Assistant’s option for a differentiated voice in 3P apps. However, there is a significant difference buried in the program description. Google Action developers must still choose from a single voice for use within their Assistant app. It is different than the Google Assistant voice, but it is only one voice for the entire user experience. Amazon’s announcement indicates that multiple voices can be used simultaneously within a 3P Alexa skill. From a feature enhancement standpoint, this puts Alexa slightly ahead of Google Assistant functionality with more voice options for developers and the ability to use multiple voices in their skills.
Why Use Synthetic Voices at All
A final point is worth noting for readers that have not built Alexa skills or Google Actions. Recorded audio is great for enhancing user experience, but it is time consuming, can be costly and is inflexible. Synthetic voices operating on text-to-speech (TTS) are the opposite. The are quick, free and flexible. Flexibility may be the most important point. There are some skills where recorded audio using a voice actor is worth the time and cost because it provides a much richer experience. However, if there is an variability in the user interaction or if you need to update the content, you must re-record the voice over. With TTS, changes and curation are greatly simplified because they can be made by simply typing in new words that the synthetic voice will read. Changes are essentially instantaneous. As a result, you should expect to see more use of synthetic voices over time.
Amazon is pushing developers to make their Alexa skills more engaging and interesting for users. The addition of new voices shows that Amazon is committed to providing additional tools to help developers in those efforts.