Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets

R. Geetha
  • Fonction : Auteur
  • PersonId : 1117219
S. Karthika
  • Fonction : Auteur
  • PersonId : 1117220

Résumé

Twitter is the most popular social platform where common people reflect their personal, political and business views that obliquely build an active online repository. The data presented by users on social networking sites are usually composed of sensitive or private data that is highly potential for cyber threats. The most frequently presented sensitive private data is analyzed by collecting real-time tweets based on benchmarked cyber-keywords under personal, professional and health categories. This research work aims to generate a Topic Keyword Extractor by adapting the Automatic Acronym - Abbreviation Replacer which is specially developed for social media short texts. The feature space is modeled using the Latent Dirichlet Allocation technique to discover topics for each cyber-keyword. The user’s context and intentions are preserved by replacing the internet jargon and abbreviations. The originality of this research work lies in identifying sensitive keywords that reveal Tweeter’s Personally Identifiable Information through the novel Topic Keyword Extractor. The potential sensitive topics in which the social media users frequently exhibit personal information and unintended information disclosures are discovered for the benchmarked cyber-keywords by adapting the proposed qualitative topic-wise keyword distribution approach. This experiment analyzed cyber-keywords and the identified sensitive topic keywords as bi-grams to predict the most common sensitive information leaks happening in Twitter. The results showed that the most frequently discussed sensitive topic was ‘weight loss’ with the cyber-keyword ‘weight’ of the health tweet category.
Fichier principal
Vignette du fichier
507484_1_En_5_Chapter.pdf (275.04 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03434777 , version 1 (18-11-2021)

Licence

Paternité

Identifiants

Citer

R. Geetha, S. Karthika. Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets. 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩. ⟨hal-03434777⟩
38 Consultations
155 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More