Guest Post: What the Infinity Gauntlet Teaches Us About Voice-First Experiences

In the Marvel Studios hit, “Avengers: Infinity War”, Thanos seeks to collect all six Infinity Stones, place them in a gauntlet, and complete his mission of re-balancing the universe. The Avengers must stop him on his quest for the Infinity Stones which would give the bearer increased power and control over space, time, reality, thoughts, and souls. These stones can also provide a framework for discussing voice-first technology and how multiple facets must come together to unleash the full potential for businesses, consumers, and all users.

Each of the Infinity Stones are independently powerful, but it’s not until all the stones are together that their full abilities are realized. The same can be said for this comparison in the voice-first world. Here are the stones with their powers described:

Convenience (Power)

The power of voice is that it is our natural-form of communication and the quickest way to complete some tasks. Pair this with ubiquity and you can get something done in the moment that it occurs to you. While cooking breakfast, you use the last egg and add eggs to your shopping list. You remember to schedule your kid’s dentist appointment as you’re driving, and you do it without looking up phone numbers or waiting on hold. At VoiceXP, we call that Think. Say. Done. The Speed of Voice. ™

Voice User Interface (VUI) design has a huge impact on our interactions with a voice-first device and can be convenient or clunky. Quality design makes our conversations with these devices flow and feel like conversations we would have with anyone. The goal is for these computers to finally conform to our language instead of us to theirs.

Discoverability (Space)

Whether you are talking about skills for Alexa or Cortana or actions for Google, one of the biggest challenges is how users can find these custom voice apps. With an app-store model, users need to know the name of your voice app and you post on social media a link to the app store page or a specific-phrase to voice-enable the app. This is called explicit invocation. More examples of this type of discoverability are promotion in voice app stores, company websites, or in-store displays.

Amazon is solving this problem with what they call arbitration where questions to Alexa can result in locating, auto-enabling, and starting a relevant skill. Other times, she recommends skills to try. Tell Alexa “order a pizza” and she will ask if you want to try Amazon Restaurants, Domino’s, or Pizza Hut.

Google calls this functionality implicit invocation and gives developers best practices to follow that could help an Action be selected by the recommendation algorithm. But there are no guarantees.

What is guaranteed is that discoverability will continue to be an important stone in the voice-first gauntlet.

Monetization (Reality)

In reality, voice apps need to be financially viable. Voice app makers need to be compensated for the hours spent in design and development, by the client purchasing the voice app or the consumers ultimately using it. This can happen through in-app purchases or monthly subscriptions. This may expand to include pay-to-enable and in-app advertisement.

We can already donate to charities with our voice and (on Alexa) donors can do that within the voice app experience. Safeguards will need to be in place so that users can trust that the app belongs to their chosen organization.

Voice commerce is projected to grow to $40B by 2022 and according to a recent PwC report, 36% of U.S. survey participants prefer shopping via voice over visiting a physical store.  That is due to strong voice-commerce platforms by Amazon and Google. Now consumers can purchase products or services within a voice app without the additional requirement of account linking.

Not all apps need to directly make money as they may be designed to increase customer satisfaction, differentiate a company from its competitors, or raise awareness. The point is, voice apps need to provide provable value. That means tracing usage to value.

Personalization (Soul)

Ask your voice assistant “who inspires you?” or “what’s your favorite food?” and you will get personality, not truth. After a short time using Alexa, you might find yourself using the words “she” and “her” to describe her. Amazon designed Alexa to exhibit personality traits such as smart, approachable, humble, enthusiastic, helpful, and friendly. You are much more likely to have a friend named Alexa than Cortana, Siri, or Google. The personalization of these devices can help form a bond between human and computer. These relationships make using the device more fun but can also help those who are lonely or have a disability. Whether it’s a verbal interaction in a quiet room, setting a timer to help an autistic child transition to another activity, helping the blind be more connected, or practice to improve pronunciation, voice devices can be used in ways to make our lives better.

Voice designers can personalize the experience by picking from a selection of text-to-speech voices, opting for voice actors, including audio, crafting phrases, and refining the conversation flow. Developers can choose data values to save within and across app sessions and can remove hints or alter the conversation based on the number of times the user has interacted with the app or feature.

The more we can make these devices personal to us, the more they will become ingrained in our daily lives. For example, Personal Skill Blueprints allow non-developers to customize Alexa to answer questions, play games, or tell family stories. With over 20 blueprints available today, families can customize the Alexa devices in their home in minutes using fill-in-the-blank templates.

Personalizing the experience for children is the focus of the Echo Dot Kids Edition and Free Time Unlimited bundle. Parents get the ability to set daily limits, review activity, filter Amazon Music explicit songs, and pause Alexa on the device. Kids get access to hundreds of hours of age-appropriate content including 300 Audible books, ad-free radio, and premium skills. Google has announced their own kid-friendly apps and will likely release a competing offering.

Smart home is all about controlling your personal domain by adding lights, sensors, cameras, and more. By integrating your smart home with a voice assistant, you have the power of convenience paired with personalization.

Retention (Time)

Once your voice app has been discovered, will it be engaging enough to use more than once? That is what retention is all about. With good design, the conversation will not annoy and will have appropriate personalization and convenience. But even then, not all voice apps will be used repeatedly. It’s all about picking the scenarios that are right for voice.

Let’s say you created an Alexa Skill to tell you if the groundhog saw its shadow on February 2nd. Best case is that a user would invoke the skill at least once and then remember it again a year later. Worst case is that the user wouldn’t see enough value to even enable it.

Voice apps like mortgage calculators would be useful only when you are looking for a new home or refinancing, but a user would likely stick with your app over another if you saved their data. A daily meditation app has a much higher likelihood of becoming part of your daily routine.

Notifications is a feature that allows a voice app to provide content when some trigger happens. This could be a breaking news story, a weather warning, or an alert associated with a business KPI.

Context & Memory (Mind)

It is interesting that the mind stone is larger and has prominent placement on the gauntlet. In the voice-first realm, this stone represents context and memory. Without the ability to remember things you have already told your voice assistant, each time you use it is like the first time.

Google Actions allow both input and output contexts that have a lifespan of a few requests or a few minutes. For example, you can say “tell me about Phoenix” and follow up with “what’s the weather there?”. The second utterance knows that the context is Phoenix and gives the expected answer.

At the recent World Wide Web Conference, Amazon’s Ruhi Sarikaya talked about context carryover and memory capabilities that will soon roll out in the U.S.  In the past, Alexa’s context has supported two-turn interactions with explicit pronouns but is now expanding to support context across domains and not require pronouns. The memory feature allows you to ask Alexa to remember things so that you can recall them later.

Voice-app teams need to determine what information to save for later that will be useful to the user or improve the overall experience. By remembering answers, users are spared having to enter that information again. Transparency in the information that is stored as well as a way for the user to clear that saved context is vital.

There is other context that custom voice apps should be able to access. This can include the time zone of the device, the user’s address, a secure payment system, or the voice profile of the person speaking. Over time this context could include more about us such as our preferences and interests.


Each stone has its own special capabilities, but when combined, they are unchallenged in their power. Voice teams will be able to create the very best experiences when wielding the stones of convenience, discoverability, monetization, personalization, retention, and context & memory. It won’t be as easy as a snap of the fingers, but nothing worthwhile ever is.

About Mark Tucker

Mark Tucker is a Partner & Principal Architect at VoiceXP where the team empowers business through quality Voice Experiences™.  He is an Alexa Champion and the organizer of the Phoenix Alexa Meetup and the Phoenix Chapter of the Ubiquitous Voice Society. You can find him on Twitter @marktucker or LinkedIn.