Optional realization of the French negative particule (ne) on Twitter: Can big data reveal new sociolinguistic patterns?

Abstract : From the outset, sociolinguistics has taken the question of data seriously (Labov, 1975). It is thus not surprising that the field recently joined the movement of computational social sciences (Lazer et al., 2009) that results from the ability to collect and model vast digital datasets concerning the behavior of individuals in collective contexts. The emerging field of computational sociolinguistics (Nguyen et al., 2016) works on data resulting from the use of sensors (proximity sensors, wearable recorders) or the digital communication that permits automatic, ongoing and unsupervised recording through the collection of traces on the web, social media or portable terminals.This paper aims at illustrating how large datasets including language and social links reveal sociolinguistic patterns that could remain invisible with smaller samples. More precisely, the dataset includes 100 million of tweets authored by 1 million of users, combined with the follower links between them. The tweets are written in French and the sample represents 10% of the production in the GMT+1 time zone between June 2014 and July 2016. We examine (ne), a sociolinguistic variable of French: optional realization of the first morpheme of the negation (Je fume pas vs. Je ne fume pas, I do not smoke) for three reasons : (ne) is a well-documented sociolinguistic marker of spoken French (Armstrong et Smith, 2002, inter alia) ; realization and omission of (ne) are visible in the written tweets; (ne) is always realized in the standard writing, which allows an assessment of the adherence of the users to the writing norm. We will present the empirical procedures for extracting the tweets that include a negative construction and for constructing a social network based on the reciprocal mentions between users. We will then focus on three results: 1/ The overall score of (ne) realization and its regional variation in France (approx. 16% in the North and 28% in the South); 2/ A never before seen pattern showing a very regular variation of (ne) realization according to the time of day, every day in the week (increase in the morning, decrease during the night); 3/ The observation that users with high scores interact frequently with each other. The discussion focusses on the sociolinguistic meaning of the results, including the close examination of the risk of bias. Finally, we will defend that thick data should combine with big data in order to explain such patterns (Wang, 2013).
Complete list of metadatas

Contributor : Jean-Pierre Chevrot <>
Submitted on : Saturday, February 4, 2017 - 5:22:02 PM
Last modification on : Sunday, April 14, 2019 - 7:20:11 PM


  • HAL Id : hal-01456302, version 1


Paul Mangold, Yannick Léo, Jean-Pierre Chevrot, Eric Fleury, Márton Karsai, et al.. Optional realization of the French negative particule (ne) on Twitter: Can big data reveal new sociolinguistic patterns?. ICLAVE 9 2017 - International Conference on Language Variation in Europe, Jun 2017, Malaga, Spain. ⟨hal-01456302⟩



Record views