Crowdsourced UX Testing for Alexa Skills and Google Assistant Actions – Interview with Charlie Ungashick from Applause
An aspect of voice app development that is starting to gain more attention is testing. Voicebot was first to report last week that Bespoken closed a funding round of $2.4 million. Bespoken started out with development tools and a monitoring solution, but has since built out a robust suite of testing tools for developers and QA professionals that is generating a lot of buzz.
Beyond tooling for developers, another element of mature testing protocols is user testing. Applause is a leader in crowdsourced testing for mobile, desktop and other devices and has recently started offering a similar service for Amazon Alexa skills. Voicebot caught up with Charlie Ungashick, Chief Marketing Officer of Applause, to better understand the new offering.
How did Applause get started and what service do you provide?
Charlie Ungashick: Applause pioneered the crowdsourced testing concept in 2008 – testing with real people, real devices, in real customer use cases. Today, our crowdtesting community is the largest in the world, with more than 300,000 testers and 2.4 million devices spread out across 200+ countries and territories. We provide a full suite of testing and user feedback solutions, including functional, security, omnichannel, payments, accessibility, automation, and usability. For the past 10 years we’ve tested websites, mobile apps, IoT devices, and connected in-store experiences for the world’s leading brands – helping them drastically improve application quality and time-to-market.
This year, we announced Applause for Amazon Alexa, and we are the largest of only four testing vendors listed on Amazon Alexa’s developer site.
What is your offering specific to Amazon Alexa skill testing?
We found that voice introduced three main challenges that brands were struggling to solve:
- Tools: Traditional testing and QA approaches are text-based and not equipped for voice.
- Scale: Voice input creates infinitely complex customer variables and test cases.
- Friction: Functional bugs, ASR/NLU (recognition) errors, and confusing prompts are difficult to account for.
Without a GUI for guidance, voice experiences like Alexa skills must be highly flexible and forgiving for users. Consumers expect to be able to launch the skill, speak naturally, and accomplish whatever they set out to do. The problem is that conversations are highly variable, and there are thousands of ways to say the same thing – especially when you consider different languages and dialects. We’ve worked with one company to test its Alexa-enabled device in the UK and Ireland to ensure it understands and can respond to twelve different dialects. Another client, a leading auto manufacturer, works with us for Alexa Voice Services testing. Our testers discovered that Alexa really struggled to understand US drivers with specific southern and midwestern accents – which for this client, represented an important demographic.
On top of that, individual users may expect different responses to the same utterance. Crowdtesting, where we’re using real end-users from across the globe, is a natural fit to uncover and account for this variability between people, geographies, and devices.
How does it work?
Our goal is to replicate the end-user experience as closely as possible during testing. All of our testing is done with real people, voices, and devices – under the same use cases and in the exact environments where they will be used by end-users. Testers could be testing at home or on the move – wherever they would typically be using Alexa. In the case of Alexa skills, our community is testing by speaking to their personally-owned Echo devices and other AVS-enabled devices like vehicles and smart speakers.
Testing follows this general flow:
- Customer defines test requirements – whether it be structured utterance flows or exploratory testing and feedback.
- Testers are selected by location, dialect, device, interests, and any other demographic requirement.
- Testers execute testing – using real devices, they’ll speak and interact with Alexa per the test requirements and then respond to survey questions and log bugs.
- Video/audio recordings, survey results, test case results, and bug reports are available in real time in our SaaS platform.
There are three pillars that make up Applause for Amazon Alexa. The first is dialog verification, which ensures the phrases customers use are understood and correctly mapped to the appropriate responses. Exploratory testing and feedback, the second pillar, is where testers provide insight into qualitative and subjective aspects of the voice experience. Finally, skill localization determines how different languages and cultures impact the user experience.
How many Alexa skills have you tested to date?
We’ve already tested skills with dozens of clients spanning numerous industries and use cases like media, healthcare services, consumer goods, and automotive. We’re also doing a good deal of testing on AVS-enabled devices. These are really interesting use cases because it brings an additional hardware variable to the customer experience. It always amazes me what our community can do – we have people testing Alexa experiences with their personally-owned vehicles and have also shipped pre-production hardware to testers before our clients launch. I can’t think of any other approaches that provide such a realistic customer perspective during testing.
One global company we worked with needed to test its Alexa-enabled device in France. Something very interesting happened when our testers tried to ask this device particular phrases. In certain cases, the device mistook French phrases as names. For example, when a tester asked the device to message her friend that she would see her soon, the device thought à bientôt (French for “see you soon”) was her friend’s last name. Upon finding that no one’s last name was à bientôt, the device reported it couldn’t complete the call because that contact didn’t exist.
Do you support Google Assistant and other voice assistants as well? What are your plans in terms of device and manufacturer support?
We currently support testing for Google Assistant and other voice assistants. The nature of crowdtesting allows us to test any platform or device – including voice assistants – in any locale.
What are some common types of questions your testing is answering for Alexa skill publishers?
In general, our clients want to make sure their skills are as intuitive, efficient, and reliable as possible. There are some things clients go into testing thinking about (these are known flows and utterances they are looking to verify), but there’s always the “wow, we didn’t even think about that” issues that weren’t thought of until after the testing results come back. As there’s not always a clear path to getting to where the user needs to go. There is a huge focus on navigation and UX. Unlike mobile apps and the web, the user has much less of an idea of what’s next in the flow.
From the skill publisher perspective:
- Have we accounted for all of the phrases/utterances our customers will use and are they correctly mapped to the appropriate intents/responses? Can every user speak in a natural way to get to what they want?
- Is the quality of the response up-to-par? Does it provide the user with the info they wanted or give them enough instruction to continue along the flow?
- How do different customer languages and dialects impact the experience? If Alexa misinterprets a word or phrase, can we account for it to avoid an issue?
We are helping our clients make sure their customers never have to think:
- ”Why can’t I get what I need from the skill, using my natural dialog?
- ”Why doesn’t the skill understand me and the phrases I use?”
- “Why does it take so many steps to get to what I want to do?
- “I don’t understand the skill’s response, and have no idea what I can do next”
What have you learned are common issues that Alexa skills face?
Stickiness is an issue. It can largely be attributed to the skill just being thought of as a novelty because it doesn’t help solve a problem for the user. If it’s not intuitive and doesn’t provide the user with what they want, users will move on.
We’ve seen a lot of Alexa skills struggle with utterances and localization. Having the correct utterances is essential to the customer experience because it helps users find and launch the skill they are looking for. Brands need to plan for the different ways people ask for things (and how people in different cultures ask for things). While I might say, “Alexa, tell me the weather,” you might ask, “Alexa, what is the weather in my area?” or “Alexa, can you give me today’s forecast?” These questions all lead to the same answer, but the weather app first needs to recognize that it is being called upon. Having the right utterances is the key to being discovered, used, and retained as a voice app.
Localization brings up a similar problem. Brands have to think global when developing their Alexa skills to account for users speaking different languages, with different dialects and accents. What’s more, brands have to think about the cultural norms and adjust their skills accordingly.
Given your experience testing technology products more generally, what are one or two common errors that application developers should avoid?
Not putting enough emphasis on exploratory testing to ensure that you can login, sync devices, and utilize multimodal. Test automation is a critical aspect for every organization, but at the end of the day, automation tests for expected outcomes. Exploratory testing, particularly in real-world environments like with crowdtesting, goes beyond just meeting internal requirements. It recognizes that just because something is designed to be used one way doesn’t mean consumers will always use it in that manner.
Apple HomePod to Get Calling, Multiple Timers, Calendar and Find My iPhone Features