Modality Featured

Understanding the Secret Sauce of Conversational Voice Experience

Editor’s Note: This is a guest post by Jason Fields, Chief Strategy Officer of conversation software management platform Voicify

Humans are built to communicate and converse with each other. But, the method we use to converse can change everything about an interaction.

In-person is always the most powerful kind of conversation. We get more done, faster, when we can speak with someone at a company face-to-face, rather than over the phone.  B2B conversations tend to be less error-prone when meetings happen in person. But conversations can still happen over email, social channels, video chats, courier pigeon, or even skywriting; where meaning can be conveyed, a conversation can be had.

The reason in-person conversations are so powerful is that humans use what academics call “modalities.”  We use our hands, tone, facial expression, objects, and language to communicate our message all at once; each is a single modality. The use of more than one modality, multimodality, is second nature to us.

When voice assistants arrived on the scene, the primary device was a speaker. Voice is an aural mechanism, so a speaker fits nicely. Then came native apps on phones, smart displays, embedded televisions, appliances, and cars.  Our sense of hearing was augmented by vision and touch. These physical senses offer different modalities of communication. The opportunity and responsibility for brand and experience managers become more exciting and more daunting all at once.

We’ve all heard the phrase, right content, right person, right time. It has been a marketing and customer experience axiom years.  Voice assistants bring this axiom to life because the right person (the person asking the question) is asking at the right time (now), about the right content (what is important to them). Multimodality gives us a lot of options to sift through in deciding how to provide them with the right content.

At Voicify, our customers have libraries of content across any of the many digital systems. While we integrate with nearly all of them to respect the system of record, we are often asked, how do we make sense of this? How can we leverage what we have, or determine what we need, to support this vibrant experiential channel?

What we have found through our research and conversations with academic and customer experience professionals, is that communicating with humans requires attention to emotions as well as intellect. Even more important is the intention with which to communicate. What we have done is taken a collection of modalities and mapped them to common intentions of a conversation.  The grid below visualizes that matrix.

The concept is straightforward once you recognize the categories of modality as well as the general intentions of a conversation. Can this matrix be expanded? Yes. Can it be customized by a brand? Of course. This is the framework for brands to leverage, to create efficiency of management and decision-making. Take the use case of our fictional traveler Liza who is getting ready for a flight. Over the hours leading up to the flight, Liza will be interested in knowing the status of the flight. The devices she uses are one of the indicators of where she is in that conversation with the airline. Each device allows for the use of a different set of modality combinations, and those combinations are indicators of intent that the brand can use to match the modality best suited in response to Liza.

It’s important to note that, just like with the web, where the browser/device information is part of the response to the application, so it is with voice as well. For most brand managers, custom code is mindnumbing. With Voicify, we optimize for each device based on either a) what you have stored in our system / referenced to Voicify in other systems or b) with the custom layout you chose to assign to the device, also within Voicify.

When Liza is engaging through her voice assistant enabled TV, the answer can leverage more real estate, more information.  People’s TV’s are typically in their homes (living or bedroom), so the modalities used can depend on time to process or react to by Liza.  Maps, instructions, flight path, images of the terminal – these are all viable modalities to use. While in a car we know Liza is on the move.  We assume she is driving and needs to focus on that task.  We ascertain we should minimize the modalities to lessen the burden on her to process information; we use a simple readout of data. If Liza is the passenger in the car, the experience remains appropriate, with no downside for Liza. Finally, Liza uses the same voice assistant, this time on her watch.  Again, we can safely assume she is moving or needs an answer quickly (otherwise, she might use her phone). Our modality choice is primarily that of data communication. Get to the point and move on, where badging and iconography are best used. Whether you agree with the specific modality choices in this use case or not, the point remains that decisions had to be made.  With this modality framework, any brand can align the team managing the channel, so consistency of experience and conversation is achieved.  Taken one step further, this framework can create a taxonomy and applied to assets across the board. This creates a simplification of asset research, assignment, and use – and when the process to manage a voice experience is simplified, the experience itself becomes richer and the audience more satisfied.