HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer 1, 2 Isaac Caswell 1 Lisa Wang 1 Ahsan Wahab 3 Daan van Esch 1 Nasanbayar Ulzii-Orshikh 4 Allahsera Tapo 2, 5 Nishant Subramani 2, 6 Artem Sokolov 1 Claytone Sikasote 2, 7 Monang Setyawan 1 Supheakmungkol Sarin 3 Sokhar Samb 8 Benoît Sagot 9 Clara Rivera 1 Annette Rios 10 Isabel Papadimitriou 11 Salomey Osei 2, 12 Pedro Ortiz Suarez 9, 13 Iroro Orife 2 Kelechi Ogueji 2, 14 Rubungo Andre Niyongabo 2, 15 Toan Nguyen 16 Mathias Müller 10 André Müller 10 Shamsuddeen Hassan Muhammad 2, 17 Nanda Muhammad 1 Ayanda Mnyakeni 1 Jamshidbek Mirzakhalov 3, 18 Tapiwanashe Matangira 1 Colin Leong 2 Nze Lawson 1 Sneha Kudugunta 1 Yacine Jernite 2, 19 Mathias Jenny 10 Orhan Firat 3, 1 Bonaventure Dossou 2, 20 Sakhile Dlamini 1 Nisansa de Silva 21 Sakine Çabuk Ballı 1 Stella Biderman 22 Alessia Battisti 10 Ahmed Baruwa 2, 23 Ankur Bapna 1 Pallavi Baljekar 1 Israel Abebe Azime 2, 8 Ayodele Awokoya 2, 24 Duygu Ataman 3, 10 Orevaoghene Ahia 2, 25 Oghenefego Ahia 3 Sweta Agrawal 26 Mofetoluwa Adeyemi 2, 27
Abstract : With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Document type :
Journal articles
Complete list of metadata

https://hal.inria.fr/hal-03177623
Contributor : Benoît Sagot Connect in order to contact the contributor
Submitted on : Sunday, February 13, 2022 - 7:17:14 PM
Last modification on : Wednesday, April 6, 2022 - 3:48:39 PM
Long-term archiving on: : Saturday, May 14, 2022 - 6:18:22 PM

File

tacl_a_00447.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Collections

Citation

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, et al.. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, The MIT Press, 2022, 10, pp.50-72. ⟨10.1162/tacl_a_00447⟩. ⟨hal-03177623⟩

Share

Metrics

Record views

228

Files downloads

33