Global Optimization for Scaffolding and Completing Genome Assemblies - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2017

Global Optimization for Scaffolding and Completing Genome Assemblies

Optimisation globale appliquée à l’assemblage génomique.

Résumé

This work focuses simultaneously on both the scaffolding and gap filling phases of de nouveau genome assembly. Given a set of contigs and their relationships--overlaps and/or remoteness in terms of distances between them--we propose an optimization-based approach for finding the genome sequence as a longest sequence that is consistent with the given contig and linkage information. Specifically, we define a graph, which we call a contig graph, that encodes information about contigs and overlaps and mate-pair distances between them, and reduce the scaffolding problem to the problem of finding a longest simple path in that graph such that as many as possible mate-pairs distances are satisfied. Since both conditions cannot generally be simultaneously satisfied, our objective function is a linear combination of the path length and penalties for distance mismatches. ​​Unlike the shortest path problem with non-negative weights, for which efficient polynomial-time algorithms exist, the longest weighted path problem is NP-hard. We solve this problem by reformulating it as a mixed integer linear program (MILP) and develop a method that exactly solves the resulting program on genomes of up to 165 contigs and up to 6682 binary variables. An advantage of our approach is that the modeling of scaffolding as a longest path problem allows one to solve simultaneously several subtasks specific for this problem like: contig orientation and ordering, repeats, gap filling, and scaffold extension, which in other approaches are targeted as separate problems. We are not aware of previous approaches on scaffolding based on the longest path problem reduction. A drawback of the typically used strategy of constructing a set of disjoint paths, rather than a single path, is that it would require additional steps of gap filling and scaffold extension, involving additional work. Moreover, it would make impossible to find a provably optimal final solution, since, even if each separate problem is implemented optimally, their combination may not be optimal. We tested this model on a set of chloroplast and bacteria genome data and showed that it allows to assemble the complete genome as a single scaffold. Compared to the publicly available scaffolding tools that we have tested, our solution produces assemblies of significantly higher quality.
Ce travail porte particulièrement sur les étapes de scaffolding et de gap-filling lors de l’assemblage de novo d’un génome. Etant donné un ensemble de contigs et leurs relations (chevauchement et/ou éloignement), nous proposons une approche basée sur l’optimisation, qui consiste à trouver le plus long assemblage possible consistant avec les distances attendues. Nous résolvons ce problème en le formulant sous la forme d’une programme mixte en nombres entiers (MILP). Notre méthode permet de résoudre exactement des instances comptant jusqu’à 6682 variables binaires pour 165 contigs. L’un des avantages de notre modélisation est qu’elle permet de résoudre simultanément différentes sous taches spécifiques à l’assemblage complet d’un génome : déterminer l’orientation de chaque contig, ordonnancer les répétitions, compléter les scaffolds et les étendre en un assemblage unique etc. . . Ce qui constitue d’ordinaire un ensemble de problèmes à traiter séquentiellement. (Nécessitant alors des solutions spécifiques pour chacun d’entre eux). A notre connaissance, aucune autre approche ne propose de réduire le problème de l’assemblage génomique à celui de la recherche du plus long chemin.
Fichier principal
Vignette du fichier
RR-9050 (1).pdf (703.83 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01499859 , version 1 (01-04-2017)

Identifiants

  • HAL Id : hal-01499859 , version 1

Citer

Sebastien François, Rumen Andonov, Dominique Lavenier, Hristo Djidjev. Global Optimization for Scaffolding and Completing Genome Assemblies. International Network Optimization Conference , European Network Optimization Group (ENOG), Feb 2017, Lisboa, Portugal. pp.15. ⟨hal-01499859⟩
280 Consultations
257 Téléchargements

Partager

Gmail Facebook X LinkedIn More