Improvement of the assembly of heterozygous genomes of non-model organisms

Anaïs Gouin 1 Anthony Bretaudeau 2, 3 Emmanuelle D'Alençon 4 Claire Lemaitre 1 Fabrice Legeai 5, 1
1 GenScale - Scalable, Optimized and Parallel Algorithms for Genomics
IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE, Inria Rennes – Bretagne Atlantique
2 Plateforme bioinformatique GenOuest [Rennes]
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, UR1 - Université de Rennes 1, Plateforme Génomique Santé Biogenouest®, Inria Rennes – Bretagne Atlantique
Abstract : Whereas the number of non-model organisms being sequenced has drastically increased, the extraction of biological information from such data is hampered by the low quality of the draft assemblies. In particular, the combination of a high level of heterozygosity and short reads sequencing leads to fragmented assembly and the overestimation of the gene content and of the genome size. Recently, new assemblers have been developed to better handle heterozygous data. But, the complete re-assembly of a genome involves automatic and manual re-annotations tasks that are very cost-effective. Thus, we present here a novel method to detect and correct false duplications due to heterozygosity (two alleles instead of one consensus sequence) in diploid draft assemblies. In addition, the method is able to relocate and merge supernumerary gene annotations. The method is based on a whole genome self-alignment (Lastz + AxtChain) allowing the detection of highly similar regions. These can have two origins: either allelic regions or duplicated regions. To distinguish between them, three criteria are used: 1/ their location inside scaffolds: contrary to duplications, unmerged haplotypes come from the same locus and must share the same genomic contexts, 2/ their cumulative read depth (close to the expected one) and 3/ their level of redundancy in the whole assembly. Next, Detected pairs of allelic regions needs to be merged into one unique sequence in the assembly: either by the complete deletion of the redundant scaffolds or by the construction of meta-scaffolds (scaffolds joined together) keeping only the allele present in the longest scaffold of the pair. Genes located on the merged alleles need to be correctly re-annotated. This is performed using Exonerate and Augustus. The former allows to identify the location of the deleted genes onto the remaining allele. The latter is used to predict new genes or consensus ones. We applied this method to an heterozygous wild type insect genome assembly. This leads to a drastic reduction of the genome assembly size (coherent with the expected size estimated by flow cytometry) and to the increase of the N50. Most of the new meta-scaffolds were confirmed by several additional resources : mate pairs, BAC ends sequence mapping and synteny analysis. Moreover, about 80% of gene predictions located in removed fragments have been either relocated or merged with their complementary allele.
Type de document :
Poster
Genome Informatics, Oct 2015, Cold Spring Harbor Laboratory, United States. 2015
Liste complète des métadonnées

https://hal.inria.fr/hal-01231793
Contributeur : Anaïs Gouin <>
Soumis le : vendredi 20 novembre 2015 - 16:49:07
Dernière modification le : mardi 16 janvier 2018 - 15:54:20
Document(s) archivé(s) le : vendredi 28 avril 2017 - 16:46:35

Fichier

genome_informatics_gouin.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01231793, version 1

Citation

Anaïs Gouin, Anthony Bretaudeau, Emmanuelle D'Alençon, Claire Lemaitre, Fabrice Legeai. Improvement of the assembly of heterozygous genomes of non-model organisms. Genome Informatics, Oct 2015, Cold Spring Harbor Laboratory, United States. 2015. 〈hal-01231793〉

Partager

Métriques

Consultations de la notice

664

Téléchargements de fichiers

121