Skip to Main content Skip to Navigation
Conference papers

Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

Abstract : Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.
Document type :
Conference papers
Complete list of metadata

Cited literature [26 references]  Display  Hide  Download

https://hal.inria.fr/hal-01988842
Contributor : Hal Ifip <>
Submitted on : Tuesday, January 22, 2019 - 9:44:41 AM
Last modification on : Tuesday, February 23, 2021 - 7:22:03 PM
Long-term archiving on: : Tuesday, April 23, 2019 - 2:07:11 PM

File

472401_1_En_5_Chapter.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Citation

Kyle Porter, Slobodan Petrovic. Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora. 14th IFIP International Conference on Digital Forensics (DigitalForensics), Jan 2018, New Delhi, India. pp.67-85, ⟨10.1007/978-3-319-99277-8_5⟩. ⟨hal-01988842⟩

Share

Metrics

Record views

78

Files downloads

36