Careless Whisper - Speech-to-Text Hallucination Harms

Koenecke, Allison, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, and Mona Sloane. 2024. “Careless Whisper: Speech-to-Text Hallucination Harms.” The 2024 ACM Conference on Fairness, Accountability, and Transparency, June 3, 1672–81. https://doi.org/10.1145/3630106.3658996.

Notes

1% of the audios transcripted from Whisper had hallucinations
38% of the hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority
hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia
- Potential RQ: Is Whisper more likely to hallucinate for non-native English speakers and non-English languages?

In-text annotations

"While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio." (Page 1672)

"We evaluate Whisper’s transcription performance on the axis of “hallucinations,” defined as undesirable generated text “that is nonsensical, or unfaithful to the provided source input”" (Page 1672)

"we provide experimental quantification of Whisper hallucinations, finding that nearly 40% of the hallucinations are harmful or concerning in some way" (Page 1672)

"Our key insight (at the time of analysis) is that hallucinations are often non-deterministic, yielding different random text on each run of the API" (Page 1674)