Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style

Natalia Golenetskaya; David James Sherman

Communication Dans Un Congrès Année : 2011

Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style

(1, 2) , (1, 2)

1
2

Natalia Golenetskaya

Fonction : Auteur

Laboratoire Bordelais de Recherche en Informatique

Models and Algorithms for the Genome

David James Sherman

Fonction : Auteur
PersonId : 1356
IdHAL : david-james-sherman
ORCID : 0000-0002-2316-1005
IdRef : 09147101X

Laboratoire Bordelais de Recherche en Informatique

Models and Algorithms for the Genome

Résumé

The formalisms for describing algorithms have traditionally followed technologies. The classic separation between online and offline algorithms was inspired by the hardware of the time, taped drives and small-capacity CPUs; the RAM model was inspired by machines with relatively large random-access memories; message-passing styles (MPI) were inspired by high connected networks of parallel processors. Today we have hierarchies of storage and computation that are just as severe, going from GPU coprocessors to heterogeneous computing grids. Of the algorithms in bioinformatics, those that are used in comparative genomics are the most stressed by the increase in volume of sequence data, since they involve all- against-all global comparisons that scale geometrically with the size of the problem instance. Consequently these analyses are the most in need for new technologies for large scale deployment on fault-tolerant computing infrastructures, and it is important to choose appropriate formalisms that best match these infrastructures. The emerging paradigms of MapReduce and NoSQL provide a good starting point for reasoning about algorithms for comparative genomics. MapReduce decomposes an algorithm into two phases, a map phase, where the same function is applied in parallel to every element of the input data set, and a reduce phase, where another function is applied in parallel to sets of results from the map phase. If an algorithm can be expressed using this decomposition, then it can be executed on large (1000-node) clusters with guarantees of robustness and fault tolerance. However, not all algorithms used in bioinformatics can be usefully decomposed in this way. NoSQL systems such as Apache Cassandra distribute data across a redundant ring of computers that automatically rebalance their storage as individual machines fail or are added to the ring. Cassandra uses a balanced, distributed hash table (Merkle tree) to implement nested, key-value data schemas that are designed around a predefined set of query patterns. NoSQL storage and MapReduce algorithms can usefully cooperate for distributed, global analysis of very large datasets. We undertook a redefinition of some of our existing comparative genomics algorithms, using a tightly-coupled MapReduce and NoSQL framework. We will report on new data schemas for NoSQL storage of genomics data, and MapReduce definitions of two algorithms, one for consensus clustering of high dimension data sets, and other for inference of gene fusion and fission events across large phylogenetic groups. These definitions make it possible to perform comparative analyses on large sets of related genomes, such as those obtained using increasingly common paraphyletic sequencing strategies.

Mots clés

bioinformatics mapreduce distributed computing databases

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

David James Sherman : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00654797

Soumis le : jeudi 22 décembre 2011-23:56:22

Dernière modification le : jeudi 15 février 2024-03:31:05

Dates et versions

hal-00654797 , version 1 (22-12-2011)

Identifiants

HAL Id : hal-00654797 , version 1

Citer

Natalia Golenetskaya, David James Sherman. Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style. SeqBio 2011, Dec 2011, Lille, France. ⟨hal-00654797⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA LABRI LABRI-MABIOVIS INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

198 Consultations

0 Téléchargements

Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager