Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2018

Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

Résumé

Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.
Fichier principal
Vignette du fichier
472401_1_En_5_Chapter.pdf (264.66 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01988842 , version 1 (22-01-2019)

Licence

Paternité

Identifiants

Citer

Kyle Porter, Slobodan Petrovic. Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora. 14th IFIP International Conference on Digital Forensics (DigitalForensics), Jan 2018, New Delhi, India. pp.67-85, ⟨10.1007/978-3-319-99277-8_5⟩. ⟨hal-01988842⟩
85 Consultations
132 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More