Skip to Main content Skip to Navigation
Poster communications

Reactive pipelines for integrated structural bioinformatics resources

Abstract : Integration of structural bioinformatics resources is a tremendous challenge, for the following reasons: 1. Structural bioinformatics is much more fragmented than sequence-based bioinformatics. Structure prediction tools use a plethora of formats to describe protein models: rotation matrices (for docking), normal mode amplitudes, and discretized backbone and sidechain rotamers (for folding). When converting to simple atomic coordinates, much information is lost. To achieve interoperability, it is not sufficient to have a semantic ontology for protein structure. Multiple syntactic ontologies (and their pairwise conversions) for different protein model formats are necessary. 2. Structure prediction tools are typically full-stack protocols: sampling, scoring and refinement all occur within the same tool. Tools are optimized to return a handful of models, to be used directly by biologists. This makes tool integration extremely difficult. Instead, tools should be decomposed into their constituent stages. At each stage, large numbers (up to millions) of models should be kept, to be re-ranked or filtered by downstream tools (e. g. using additional experimental information). This allows labs to focus on single-purpose tools that work well, rather than building competing and incompatible protocol stacks. 3. Structural biology databases are full of implicit dependencies, including time. For example, PDB codes are not stable URIs: PDB code 1XYZ may point to different coordinates over time, changing when the PDB entry gets updated. This means that computations are not reproducible from input parameters with PDB codes. To make computations reproducible, their inputs should be defined by the values of the input data, not their database URIs. These values should be stored as checksums, and the database should resolve any requested checksum to its value. At the RPBS, we are developing technologies to deal with the above problems: - Syntactic ontologies (using a superset of JSON schema) to describe the input and output data formats of structural biology tools. - Tracking the dependencies of a computation (including code dependencies) into a computation tree of checksums. This computation tree uniquely and deterministically defines the result of the computation. - Checksum servers listing URI resources that serve the value of the requested checksum. - A server to map the checksum of a computation result to its computation tree. As the inputs are often computation results themselves, this allows a computation to be tracked all the way down to the original experimental data. - Reactive pipelines to re-evaluate computation trees as they change. This allows the automatic re-computation of a structure prediction if any of its inputs change (e. g. because of new experimental data, or if the tool itself is improved).
Complete list of metadatas

https://hal.inria.fr/hal-01925064
Contributor : Isaure Chauvot de Beauchene <>
Submitted on : Friday, November 16, 2018 - 2:31:30 PM
Last modification on : Friday, March 27, 2020 - 3:06:19 AM
Document(s) archivé(s) le : Sunday, February 17, 2019 - 2:14:48 PM

File

reactive_pipelines_final.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01925064, version 1

Citation

Isaure Chauvot de Beauchêne, Sjoerd de Vries. Reactive pipelines for integrated structural bioinformatics resources. 3D-BioInfo: Launch Meeting for a proposed ELIXIR Community in Structural Bioinformatics, Oct 2018, Basel, Switzerland. ⟨hal-01925064⟩

Share

Metrics

Record views

91

Files downloads

29