Global Optimization for Scaffolding and Completing Genome Assemblies

Sebastien François 1 Rumen Andonov 1 Dominique Lavenier 1 Hristo Djidjev 2
1 GenScale - Scalable, Optimized and Parallel Algorithms for Genomics
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
Abstract : This work focuses simultaneously on both the scaffolding and gap filling phases of de nouveau genome assembly. Given a set of contigs and their relationships--overlaps and/or remoteness in terms of distances between them--we propose an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information. Specifically, we define a graph, which we call a contig graph, that encodes information about contigs and overlaps and mate-pair distances between them, and reduce the scaffolding problem to the problem of finding a longest simple path in that graph such that as many as possible mate-pairs distances are satisfied. Since both conditions cannot generally be simultaneously satisfied, our objective function is a linear combination of the path length and penalties for distance mismatches. ​​Unlike the shortest path problem with non-negative weights, for which efficient polynomial-time algorithms exist, the longest weighted path problem is NP-hard. We solve this problem by reformulating it as a mixed integer linear program (MILP) and develop a method that exactly solves the resulting program on genomes of up to 165 contigs and up to 6682 binary variables. An advantage of our approach is that the modeling of scaffolding as a longest path problem allows one to solve simultaneously several subtasks specific for this problem like: contig orientation and ordering, repeats, gap filling, and scaffold extension, which in other approaches are targeted as separate problems. We are not aware of previous approaches on scaffolding based on the longest path problem reduction. A drawback of the typically used strategy of constructing a set of disjoint paths, rather than a single path, is that it would require additional steps of gap filling and scaffold extension, involving additional work. Moreover, it would make impossible to find a provably optimal final solution, since, even if each separate problem is implemented optimally, their combination may not be optimal. We tested this model on a set of chloroplast and bacteria genome data and showed that it allows to assemble the complete genome as a single scaffold. Compared to the publicly available scaffolding tools that we have tested, our solution produces assemblies of significantly higher quality.
Document type :
Conference papers
Complete list of metadatas

Cited literature [14 references]  Display  Hide  Download

https://hal.inria.fr/hal-01499859
Contributor : Rumen Andonov <>
Submitted on : Saturday, April 1, 2017 - 8:31:29 AM
Last modification on : Friday, September 13, 2019 - 9:49:21 AM
Long-term archiving on : Sunday, July 2, 2017 - 12:34:54 PM

File

RR-9050 (1).pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01499859, version 1

Relations

Citation

Sebastien François, Rumen Andonov, Dominique Lavenier, Hristo Djidjev. Global Optimization for Scaffolding and Completing Genome Assemblies. International Network Optimization Conference , European Network Optimization Group (ENOG), Feb 2017, Lisboa, Portugal. pp.15. ⟨hal-01499859⟩

Share

Metrics

Record views

421

Files downloads

225