HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Statistical Machine Translation: Application to low resourced languages

Abstract : This work is dedicated to statistical machine translation for poorly resourced languages. We are interested in Arabic dialects which represent the daily language of all Arab peoples. These dialects differ from one Arab country to another and even in the same country several variations of dialects coexist. These dialects by their oral nature and non-standard represent a challenge in NLP. In machine translation, these dialects are difficult to translate because of the lack of resources (of all natures) in particular the monolingual and especially parallel corpora necessary for training. In this thesis, we are interested by this issue with particular attention to the Algerian dialect and more precisely to the Algiers dialect. A parallel multi-dialect PADIC corpus (for Parallel Arabic Dialect Corpus) has been created, this is a textual resource important which includes, so far, six Arabic dialects in addition to Modern Standard Arabic. This corpus was the subject of an analytical study to highlight the relationship between dialects (between them) and Standard Arabic. By means of the corpus PADIC, we tackled the problem of statistical machine translation between the different dialect pairs and Standard Arabic. Several results have been obtained and all point to the difficulty of translating dialects. In addition, several tools dedicated to the Algiers dialect have been produced in the framework of this thesis. The problem of code-switching was also discussed where an identification tool was implemented using techniques of "Machine Learning".
Document type :
Complete list of metadata

Contributor : Kamel Smaïli Connect in order to contact the contributor
Submitted on : Wednesday, March 31, 2021 - 3:20:43 PM
Last modification on : Wednesday, November 3, 2021 - 7:57:49 AM
Long-term archiving on: : Thursday, July 1, 2021 - 6:46:20 PM


Files produced by the author(s)


  • HAL Id : tel-03186940, version 1


Salima Harrat. Statistical Machine Translation: Application to low resourced languages. Computation and Language [cs.CL]. École Supérieure d’Informatique, 2018. English. ⟨tel-03186940⟩



Record views


Files downloads