Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

Abhishek Roy; Yanlei Diao; Uday Evani; Avinash Abhyankar; Clinton Howarth; Rémi Le Priol; Toby Bloom

doi:10.1145/3035918.3064048

Communication Dans Un Congrès Année : 2017

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

(1) , (2, 1) , (3) , (3) , (3) , (1, 4) , (3)

1
2
3
4

Abhishek Roy

Fonction : Auteur

University of Massachusetts [Amherst]

Yanlei Diao

Fonction : Auteur

Rich Data Analytics at Cloud Scale

University of Massachusetts [Amherst]

Uday Evani

Fonction : Auteur

New York Genome Center [New York]

Avinash Abhyankar

Fonction : Auteur

New York Genome Center [New York]

Clinton Howarth

Fonction : Auteur

New York Genome Center [New York]

Rémi Le Priol

Fonction : Auteur

University of Massachusetts [Amherst]

École polytechnique

Toby Bloom

Fonction : Auteur

New York Genome Center [New York]

Résumé

This paper presents a joint effort between a group of computer scientists and bioinformaticians to take an important step towards a general big data platform for genome analysis pipelines. The key goals of this study are to develop a thorough understanding of the strengths and limitations of big data technology for genomic data analysis, and to identify the key questions that the research community could address to realize the vision of personalized genomic medicine. Our platform, called Gesall, is based on the new " Wrapper Technology " that supports existing genomic data analysis programs in their native forms, without having to rewrite them. To do so, our system provides several layers of software , including a new Genome Data Parallel Toolkit (GDPT), which can be used to " wrap " existing data analysis programs. This platform offers a concrete context for evaluating big data technology for genomics: we report on super-linear speedup and sublinear speedup for various tasks, as well as the reasons why a parallel program could produce different results from those of a serial program. These results lead to key research questions that require a synergy between ge-nomics scientists and computer scientists to find solutions.

Mots clés

Big data Bioinfomatics Data management

Domaines

Informatique [cs] Calcul parallèle, distribué et partagé [cs.DC] Bio-informatique [q-bio.QM] Base de données [cs.DB]

Félix Raimundo : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01683398

Soumis le : samedi 13 janvier 2018-15:18:54

Dernière modification le : vendredi 24 mars 2023-14:53:06

Dates et versions

hal-01683398 , version 1 (13-01-2018)

Identifiants

HAL Id : hal-01683398 , version 1
DOI : 10.1145/3035918.3064048

Citer

Abhishek Roy, Yanlei Diao, Uday Evani, Avinash Abhyankar, Clinton Howarth, et al.. Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study. SIGMOD '17 - ACM International Conference on Management of Data, SIGMOD ACM Special Interest Group on Management of Data, May 2017, Chicago, Illinois, United States. pp.187-202, ⟨10.1145/3035918.3064048⟩. ⟨hal-01683398⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS INRIA LIX X-LIX X-DEP-INFO INRIA2 UNIV-PARIS-SACLAY GS-COMPUTER-SCIENCE

165 Consultations

0 Téléchargements

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager