Svarah - Evaluating English ASR Systems on Indian Accents

Javed, Tahir, Sakshi Joshi, Vignesh Nagarajan, et al. 2023. “Svarah: Evaluating English ASR Systems on Indian Accents.” arXiv:2305.15760. Preprint, arXiv, May 25. https://doi.org/10.48550/arXiv.2305.15760.

Notes

Dataset for benchmarking:
- 19 Indian languages
- 117 speakers (54 men, 63 women)
- 9.6 hours of English speech data
- Data types
  - multi-domain text data (predefined text)
    - 1k sentences from 9 domains (text being read covered diverse vocabulary)
      - health, entertainment, culture, geography, history, business, news, sports, and tourism
  - extempore data - simple questions that the participants can answer to
    - 28 topics of interest - painting, cooking, gardening, knitting, stitching, travelling

In-text annotations

"For each of the 19 languages, we recruited 3-5 bilingual speakers who spoke English and one of the constitutionally recognized Indian languages resulting in a total of 117 speakers. Of these, 54 were men and 63 were women. We also ensured that we had a roughly equal number of speakers belonging to the following age groups 18-30, 30-45, 45-60, and 60+. The speakers also came from different educational backgrounds (arts, commerce, science) with different levels of education (graduates, post-graduates, PhDs). The task was clearly explained to the speakers, and they were informed that the data was being collected to build and evaluate speech models. Their voice samples were recorded only if the speakers willingly agreed to participate in the task and signed a consent form to this effect." (Page 2)

"A part of the data included read speech which required participants to read a piece of text shown to them. To ensure that the text being read covered diverse vocabulary, we collected text from Wikipedia belonging to multiple domains. These domains were identified using the “Category” information available for Wikipedia articles. In total, we collected 1k sentences from 9 domains, viz. health, entertainment, culture, geography, history, business, news, sports, and tourism. Each participant was asked to read 5 sentences randomly chosen from this collection while ensuring that no two participants got the same sentence." (Page 2)

"The participants were asked to fill out a form, where, in addition to meta-data such as age, district of residence, native language, etc. they were also asked to select (i) topics of interest and (ii) specific domains about which they could talk. We considered 28 topics of interest such as painting, cooking, gardening, knitting and stitching, travelling, etc., and the same 9 domains listed above. For each topic of interest, we created a few simple questions that any participant could answer, such as, “What inspired you to take up drawing?”, “What are the dishes you like to cook and tell us the recipe of your favourite dish?”. Similarly for each domain, we created simple questions which anyone interested in that domain could answer. For example, someone interested in Entertainment should be able to answer the following question “What is your favorite movie or TV serial and why do you like it?”. Each participant was shown four such questions for each of the selected topics of interest and domains and was required to answer these questions using extempore speech." (Page 2)