Integration and query of biological datasets with Semantic Web technologies: AskOmics

Aurélie Evrard 1 Charles Bettembourg 2 Mélanie Jubault 1 Olivier Dameron 2, 3 Olivier Filangi 1, 4 Anthony Bretaudeau 1, 4 Fabrice Legeai 1, 5
2 Dyliss - Dynamics, Logics and Inference for biological Systems and Sequences
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
4 Plateforme bioinformatique GenOuest [Rennes]
Inria Rennes – Bretagne Atlantique , Plateforme Génomique Santé Biogenouest®, UR1 - Université de Rennes 1, IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires
5 GenScale - Scalable, Optimized and Parallel Algorithms for Genomics
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
Abstract : Over the past few years, research programs involving genetic, genomic and post-genomic sequencing of various living organisms have become fast growing areas of biology. Once the computational challenges of processing datasets have been dealt with; large and complex biological data still remain in the hands of biologists for interpretation. Projects such as Biomart and Intermine have been developed for the international community to facilitate exchange and comparison of complex biological data. However for non-model organisms, large heterogeneous biological datasets can be difficult to associate in order to obtain a comprehensive view. Overall, access and interrogation remain time consuming for biologists and integrating publicly available data is still an open challenge. Linked data and Semantic Web technologies benefit to biologists. Using RDF (Reference Description Framework); biological data can be described using triples that associate an entity (called subject), a relation (called property) and a value for the relation (called object). Data from different datasets can be integrated and the SPARQL query language support their analysis. Nevertheless; understanding and acquiring the query language can be a daunting task for biologists. Here we present AskOmics, a tool supporting both intuitive data integration and querying while shielding the user from most of the technical difficulties underlying RDF and SPARQL. The virtualization-based deployment of AskOmics makes the tool easy to manage, reliable and simple to install. For data integration, the user loads his data as tabulation-separated files structured according to simple principles. This structure allows AskOmics to generate automatically the corresponding RDF triples, and to store them into a triplestore such as Fuseki or Virtuoso. At this point, the user’s data are available just like in any SPARQL endpoint. AskOmics automatically generates an abstract representation of the dataset based on the types of the subject and object of its triples. For data querying, AskOmics provides a visually intuitive interface compatible with any SPARQL endpoint (that is one generated by AskOmics data generation function, or any regular triplestore). The user can then select a sequence of nodes in this simplified view, and AskOmics generates the corresponding SPARQL query that can be executed on the original dataset. For example; it could be difficult for biologists to identify features such as genes underlying localised genomic regions limited by genetic markers as it requires the users to combine different files. Tabulation-separated files containing genes and genetic markers could be uploaded in AskOmics with the following criteria: genetic markers and genes identified as entity, each entity is related to a chromosome and a position start and end with numerical values. AskOmics interface allows the user, without knowledge in SPARQL language, to either select genomic regions with distinct markers or simply provide numerical values as the lower and upper position. The intersection with additional features could be computed for producing lists of features such as genes underlying specific genomic regions. The result can then be downloaded as a tabulation-separated file. Currently under development, AskOmics will also support the integration of external databases to compare or complete new findings. AskOmics’ principle is generic. It has been applied successfully to the analysis of large scale datasets including genetic, epigenomic, transcriptomic profiles and orthologous relationships to identify genomic regions that are involved in the variability of Brassicaceae (Arabidopsis, cabbage, turnip and oilseed rape) in response to clubroot disease. About 2.6 millions of triples were stored from 370 000 uploaded entities corresponding to genomic positions of genes amongst the four species of the Brassicaeae family as well as relationship data (orthology and transcriptomics). The fast queries allowed to identify lists of genes with specific expression profiles and their corresponding orthologs in the three others species.
Type de document :
Poster
Journées Ouvertes Biologie, Informatique et Mathématiques (JOBIM 2016), Jun 2016, Lyon, France
Liste complète des métadonnées

https://hal.inria.fr/hal-01391087
Contributeur : Olivier Dameron <>
Soumis le : jeudi 3 novembre 2016 - 17:47:30
Dernière modification le : mardi 16 janvier 2018 - 15:54:20

Identifiants

  • HAL Id : hal-01391087, version 1

Citation

Aurélie Evrard, Charles Bettembourg, Mélanie Jubault, Olivier Dameron, Olivier Filangi, et al.. Integration and query of biological datasets with Semantic Web technologies: AskOmics. Journées Ouvertes Biologie, Informatique et Mathématiques (JOBIM 2016), Jun 2016, Lyon, France. 〈hal-01391087〉

Partager

Métriques

Consultations de la notice

631

Téléchargements de fichiers

95