Computational methods for de novo assembly of next-generation genome sequencing data

Rayan Chikhi 1
1 GenScale - Scalable, Optimized and Parallel Algorithms for Genomics
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
Abstract : In this thesis, we discuss computational methods (theoretical models and algorithms) to perform the reconstruction (de novo assembly) of DNA sequences produced by high-throughput sequencers. This problem is challenging, both theoretically and practically. The theoretical difficulty arises from the complex structure of genomes. The assembly process has to deal with reconstruction ambiguities. The output of sequencing predicts up to an exponential number of reconstructions, yet only one is correct. To deal with this problem, only a fragmented approximation of the genome is returned. The practical difficulty stems from the huge volume of data produced by sequencers, with high redundancy. Significant computing power is required to process it. As larger genomes and meta-genomes are being sequenced, the need for efficient computational methods for de novo assembly is increasing rapidly. This thesis introduces novel contributions to genome assembly, both in terms of incorporating more information to improve the quality of results, and efficiently processing data to reduce the computation complexity. Specifically, we propose a novel algorithm to quantify the maximum theoretical genome coverage achievable by sequencing data (paired reads), and apply this algorithm to several model genomes. We formulate a set of computational problems that take into account pairing information in assembly, and study their complexity. Then, two novel concepts that cover practical aspects of assembly are proposed: localized assembly and memory-efficient reads indexing. Localized assembly consists in constructing and traversing a partial assembly graph. These ingredients are implemented in a complete de novo assembly software package, the Monument assembler. Monument is compared with other state of the art assembly methods. Finally, we conclude with a series of smaller projects, exploring concepts beyond classical de novo assembly.
Document type :
Theses
Liste complète des métadonnées

Cited literature [54 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00752033
Contributor : Abes Star <>
Submitted on : Wednesday, November 14, 2012 - 4:47:33 PM
Last modification on : Friday, November 16, 2018 - 1:40:37 AM
Document(s) archivé(s) le : Saturday, December 17, 2016 - 10:15:02 AM

File

Chikhi2012.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00752033, version 1

Citation

Rayan Chikhi. Computational methods for de novo assembly of next-generation genome sequencing data. Other [cs.OH]. École normale supérieure de Cachan - ENS Cachan, 2012. English. ⟨NNT : 2012DENS0033⟩. ⟨tel-00752033⟩

Share

Metrics

Record views

1263

Files downloads

433