Vistaar - Diverse Benchmarks and Training Sets for Indian Language ASR
Bhogale, Kaushal Santosh, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, and Pratyush Kumar. 2023. “Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR.” arXiv:2305.15386. Preprint, arXiv, August 2. https://doi.org/10.48550/arXiv.2305.15386.
Notes
- “curate all publicly available training sets for 12 Indian languages amounting over 10,700 hours of audio” (Bhogale et al., 2023, p. 1)
- Vistaar Benchmark Set
- Kathbath
- read speech data across 12 languages
- Kathbath-hard
- Kathbath with background noise added
- FLEURS
- read speech of translated Wikipedia content with 3 recordings by different speakers for a sentence and manual validation collected by researchers at Meta and their collaborators - 11 languages
- CommonVoice
- crowd-sourced read speech - 8 languages
- IndicTTS
- studio-quality read speech by professional speakers - 9 languages
- MUCS
- read speech - 6 languages
- GramVaani
- telephone quality speech data
- regional/dialectical variations of Hindi
- Kathbath