Microsoft Hits New Conversational Speech Recognition Milestone
Microsoft announced last year that it had reached speech recognition transcription parity with human transcribers. That milestone had a 5.9% error rate. The company yesterday announced in a blog post that it had passed another milestone bringing the error rate down to 5.1%. This is a human level transcription error rate when transcribers are allowed to go back and review audio recordings to revise errors. The post said that Microsoft has been working toward “accuracy on par with humans” for 25 years. Microsoft products that use the underlying technology include Cortana, Presentation Translater and Cognitive Services.
The translation was performed on the 2000 Switchboard evaluation set. The Switchboard corpus has about 2,400 telephone conversations on 52 topics ranging from drug testing and music to woodworking and gun control. The audio is all from native English speakers. To achieve the 12% improvement over 2016, Microsoft also increased the 30,500 word system vocabulary to 165,000 words. This involved adding the Broadcast News corpus and Conversational Web corpus and reduced out-of-vocabulary instances from 0.29% to just 0.06%.
ASR and Not NLU
It is important to note that this is an automated speech recognition (ASR) and not natural language understanding (NLU). There was no attempt to extract meaning from the dialogues. This was simply an exercise to improve speech-to-text (STT) transcription. STT is a critical element of all voice assistants today because the systems first transcribe the audio or utterances into text before applying the NLU to determine user intent. More accurate STT should improve the inputs that you operate NLU against.
Speech Recognition is Quickly Getting Much Better
The key takeaway for readers is that speech recognition is quickly getting much more accurate. Anyone with an Amazon Alexa-based device, Google Assistant or Microsoft Cortana can experience this first hand. However, as we start to rely on voice assistants more and grant them agency, reducing error rates for speech recognition will be critical to build user confidence in our new helper bots.
You can read the full Microsoft report here.