Normalisation orthographique de corpus bruités

Abstract : The information contained in messages posted on the Internet (forums, social networks, review sites...) is of strategic importance for many companies. However, few tools have been designed for analysing such messages, the spelling, typography and syntax of which are often noisy. This industrial PhD thesis has been carried out within the viavoo company with the aim of improving the results of a lemma-based information retrieval tool. We have developed a processing pipeline for the normalisation of noisy texts. Its aim is to ensure that each word is assigned the standard spelling corresponding to one of its lemma’s inflected forms. First, among all tokens of the corpus that are unknown to a reference lexicon, we automatically determine which ones result from alterations — and therefore should be normalised — as opposed to those that do not (neologisms, loanwords...). Normalisation candidates are then generated for these tokens using weighted rules obtained by analogy-based machine learning techniques. Next we identify tokens that are known to the reference lexicon but are nevertheless the result of an alteration (grammatical errors), and generate normalisation candidates for each of them. Finally, language models allow us to perform a context-sensitive disambiguation of the normalisation candidates generated for all types of alterations. Numerous experiments and evaluations are carried out on French data for each module and for the overall pipeline. Special attention has been paid to keep all modules as language-independent as possible, which paves the way for future adaptations of our pipeline to other European languages.
Complete list of metadatas

Cited literature [207 references]  Display  Hide  Download

https://hal.inria.fr/tel-01226159
Contributor : Marion Baranes <>
Submitted on : Sunday, November 8, 2015 - 9:49:31 PM
Last modification on : Friday, January 4, 2019 - 5:33:24 PM
Long-term archiving on : Tuesday, February 9, 2016 - 10:58:39 AM

Identifiers

  • HAL Id : tel-01226159, version 1

Collections

Citation

Marion Baranes. Normalisation orthographique de corpus bruités. Linguistique. Université Paris-Diderot - Paris VII, 2015. Français. ⟨tel-01226159⟩

Share

Metrics

Record views

506

Files downloads

863