Evaluating the potential of S2R
When a traditional ASR system converts audio into a single text string, it may lose contextual cues that could help disambiguate the meaning (i.e., information loss). If the system misinterprets the audio early on, that error is passed along to the search engine, which typically lacks the ability to correct it (i.e., error propagation). As a result, the final search result may not reflect the user’s intent.
To investigate this relationship, we conducted an experiment designed to simulate an ideal ASR performance. We began by collecting a representative set of test queries reflecting typical voice search traffic. Crucially, these queries were then manually transcribed by human annotators, effectively creating a “perfect ASR” scenario where the transcription is the absolute truth.
We then established two distinct search systems for comparison (see chart below):
- Cascade ASR represents a typical real-world setup, where speech is converted to text by an automatic speech recognition (ASR) system, and that text is then fed to a retrieval system.
- Cascade groundtruth simulates a “perfect” cascade model by sending the flawless ground-truth text directly to the same retrieval system.
The retrieved documents from both systems (cascade ASR and cascade groundtruth) were then presented to human evaluators, or “raters”, alongside the original true query. The evaluators were tasked with comparing the search results from both systems, providing a subjective assessment of their respective quality.
We use word error rate (WER) to measure the ASR quality and to measure the search performance, we use mean reciprocal rank (MRR) — a statistical metric for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness and calculated as the average of the reciprocals of the rank of the first correct answer across all queries. The difference in MRR and WER between the real-world system and the groundtruth system reveals the potential performance gains across some of the most commonly used voice search languages in the SVQ dataset (shown below).