Language Technologies for Low Resource Languages - Sociolinguistic and Multilingual Insights
Doğruöz, A. Seza, and Sunayana Sitaram. 2022. “Language Technologies for Low Resource Languages: Sociolinguistic and Multilingual Insights.” In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, edited by Maite Melero, Sakriani Sakti, and Claudia Soria, 92–97. Marseille, France: European Language Resources Association. https://aclanthology.org/2022.sigul-1.12/.
Notes
In-text annotations
"Joshi et al. (2020) categorize the languages of the world into six categories based on the resources available in terms of labeled and unlabeled data. More than 88% of the world’s languages belong to the lowest resource class, with only 25 languages belonging to the two high resource classes. In other words, a majority of the world’s languages count as LRLs even when they have large numbers of speakers (e.g. Gondi (Mehta et al., 2020) and Odia (Parida et al., 2020) spoken in India)." (Page 92)
"Therefore, aiming to create monolingual data sets even for comparisons or benchmarking purposes is not a meaningful effort for LRLs which inherently contain many borrowed words in highly multilingual areas" (Page 93)