Abstract : The paper extends ideas from data compression by deduplication to the Bioinformatic field. The specific problems on which we show our approach to be useful are the clustering of a large set of DNA strings and the search for approximate matches of long substrings, both based on the design of what we call an approximate hashing function. The outcome of the new procedure is very similar to the clustering and search results obtained by accurate tools, but in much less time and with less required memory.
https://hal.inria.fr/hal-03219482 Contributor : Pierre PeterlongoConnect in order to contact the contributor Submitted on : Thursday, May 6, 2021 - 2:21:27 PM Last modification on : Monday, April 4, 2022 - 9:28:27 AM Long-term archiving on: : Saturday, August 7, 2021 - 7:02:12 PM
Guy Arbitman, Shmuel Klein, Pierre Peterlongo, Dana Shapira. Approximate Hashing for Bioinformatics. CIAA 2021 - 25th International Conference on Implementation and Application of Automata, Jul 2021, Bremen, Germany. pp.1-12. ⟨hal-03219482⟩