Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style

Natalia Golenetskaya 1, 2 David James Sherman 1, 2
2 MAGNOME - Models and Algorithms for the Genome
Inria Bordeaux - Sud-Ouest, UB - Université de Bordeaux, CNRS - Centre National de la Recherche Scientifique : UMR5800
Abstract : The formalisms for describing algorithms have traditionally followed technologies. The classic separation between online and offline algorithms was inspired by the hardware of the time, taped drives and small-capacity CPUs; the RAM model was inspired by machines with relatively large random-access memories; message-passing styles (MPI) were inspired by high connected networks of parallel processors. Today we have hierarchies of storage and computation that are just as severe, going from GPU coprocessors to heterogeneous computing grids. Of the algorithms in bioinformatics, those that are used in comparative genomics are the most stressed by the increase in volume of sequence data, since they involve all- against-all global comparisons that scale geometrically with the size of the problem instance. Consequently these analyses are the most in need for new technologies for large scale deployment on fault-tolerant computing infrastructures, and it is important to choose appropriate formalisms that best match these infrastructures. The emerging paradigms of MapReduce and NoSQL provide a good starting point for reasoning about algorithms for comparative genomics. MapReduce decomposes an algorithm into two phases, a map phase, where the same function is applied in parallel to every element of the input data set, and a reduce phase, where another function is applied in parallel to sets of results from the map phase. If an algorithm can be expressed using this decomposition, then it can be executed on large (1000-node) clusters with guarantees of robustness and fault tolerance. However, not all algorithms used in bioinformatics can be usefully decomposed in this way. NoSQL systems such as Apache Cassandra distribute data across a redundant ring of computers that automatically rebalance their storage as individual machines fail or are added to the ring. Cassandra uses a balanced, distributed hash table (Merkle tree) to implement nested, key-value data schemas that are designed around a predefined set of query patterns. NoSQL storage and MapReduce algorithms can usefully cooperate for distributed, global analysis of very large datasets. We undertook a redefinition of some of our existing comparative genomics algorithms, using a tightly-coupled MapReduce and NoSQL framework. We will report on new data schemas for NoSQL storage of genomics data, and MapReduce definitions of two algorithms, one for consensus clustering of high dimension data sets, and other for inference of gene fusion and fission events across large phylogenetic groups. These definitions make it possible to perform comparative analyses on large sets of related genomes, such as those obtained using increasingly common paraphyletic sequencing strategies.
Type de document :
Communication dans un congrès
SeqBio 2011, Dec 2011, Lille, France. 2011
Liste complète des métadonnées
Contributeur : David James Sherman <>
Soumis le : jeudi 22 décembre 2011 - 23:56:22
Dernière modification le : jeudi 11 janvier 2018 - 06:22:12


  • HAL Id : hal-00654797, version 1


Natalia Golenetskaya, David James Sherman. Rethinking global analyses and algorithms for comparative genomics in a functional MapReduce style. SeqBio 2011, Dec 2011, Lille, France. 2011. 〈hal-00654797〉



Consultations de la notice