Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Isaac Caswell 1 Julia Kreutzer 1, 2 Lisa Wang 1 Ahsan Wahab 3 Daan van Esch 1 Nasanbayar Ulzii-Orshikh 4 Allahsera Tapo 2, 5 Nishant Subramani 2, 6 Artem Sokolov 1 Claytone Sikasote 2, 7 Monang Setyawan 1 Supheakmungkol Sarin 3 Sokhar Samb 8 Benoît Sagot 9 Clara Rivera 1 Annette Rios 10 Isabel Papadimitriou 11 Salomey Osei 2, 12 Pedro Javier Ortiz Suárez 9, 13 Iroro Orife 2 Kelechi Ogueji 2, 14 Rubungo Andre Niyongabo 2, 15 Toan Q. Nguyen 16 Mathias Müller 10 André Müller 10 Shamsuddeen Hassan Muhammad 2, 17 Nanda Muhammad 1 Ayanda Mnyakeni 1 Jamshidbek Mirzakhalov 3, 18 Tapiwanashe Matangira 1 Colin Leong 2 Nze Lawson 1 Sneha Kudugunta 1 Yacine Jernite 2, 19 Mathias Jenny 10 Orhan Firat 3, 1 Bonaventure F. P. Dossou 2, 20 Sakhile Dlamini 1 Nisansa de Silva 21 Sakine Çabuk Ballı 1 Stella Biderman 22 Alessia Battisti 10 Ahmed Baruwa 2, 23 Ankur Bapna 1 Pallavi Baljekar 1 Israel Abebe Azime 2, 8 Ayodele Awokoya 2, 24 Duygu Ataman 3, 10 Orevaoghene Ahia 2, 25 Oghenefego Ahia 3 Sweta Agrawal 26 Mofetoluwa Adeyemi 2, 27
Abstract : With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Document type :
Preprints, Working Papers, ...
Complete list of metadata

https://hal.inria.fr/hal-03177623
Contributor : Benoît Sagot <>
Submitted on : Tuesday, March 23, 2021 - 12:07:38 PM
Last modification on : Thursday, April 1, 2021 - 9:45:45 PM

Links full text

Identifiers

  • HAL Id : hal-03177623, version 1
  • ARXIV : 2103.12028

Collections

Citation

Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, et al.. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. 2021. ⟨hal-03177623⟩

Share

Metrics

Record views

39