An application suite based on the IFB Container as a Service platform

Francois Moreews 1, 2 Olivier Sallou 3 Olivier Collin 3
1 Dyliss - Dynamics, Logics and Inference for biological Systems and Sequences
Inria Rennes – Bretagne Atlantique , IRISA-D7 - GESTION DES DONNÉES ET DE LA CONNAISSANCE
3 Plateforme bioinformatique GenOuest [Rennes]
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, UR1 - Université de Rennes 1, Plateforme Génomique Santé Biogenouest®, Inria Rennes – Bretagne Atlantique
Abstract : IFB, the French Elixir Node, is a national service infrastructure which provides services and resources in bioinformatics[1] . IFB’s goal is to offer to scientific users and developers a scalable, flexible and user-friendly computation facility associated to a large storage capacity, as needed for current life science data processing. To analyze heterogeneous biological data, bioinformaticians require hundreds of different specialized software including well-established tools as well as research prototypes. In addition, these software are used alone or in workflows, from GUI or command lines, for production, tests or developments. Thus, providing an updated and complete set of tools requires huge resources. To offer an efficient service for this expected diversity of usages, we propose a software architecture and a cloud model which bring solutions for tools packaging, rapid deployment and multiple channel software distribution. We describe here the set of technical components that we built to enable a Container as a Service Model (CaaS) adapted to a bioinformatics academic cloud facility. BioShaDock BioShaDock[2] is the community based container registry for bioinformatics of the French bioinformatics Institute. It focuses on reproducibility in bioinformatics tools or pipelines using Docker containers. Containers are automatically build in background with security scans and meta data extraction. Meta-data can include general information but also ontologies terms. The BioShaDock registry already provides a large catalog of tools direcly from users, or project like Bioconda or Debian. The registry is open source and can be used by anyone, it is accessible by any Docker or rkt client. Computer scientists and bioinformaticians can more easily disseminate their programs and find potential users using a dedicated domain-centric Docker registry. There is a wide range of possible uses for container registries in bioinformatics: repositories managed at a community level, based on tools embedded in containers, allow users to exchange and replicate data analyses. GO-Docker GO-Docker[3] is a batch computing/cluster management tool using Docker as an execution/isolation system. It is dedicated to containers and has both a command line client and a web front end. It uses Docker Swarm and Apache Mesos and is compatible with google Kubernetes. A common concern regarding containers solution for cloud or HPC is related to potential security issues. First of all, we should remind that Docker implements the Linux Kernel cgroups feature and it can be used to isolate resource usage by users. Furthermore, we implemented SSL certificate and LDAP authentication in the GO-Docker Rest API prior to allow access to the job scheduler that manage the nodes where containers can be run. In addition, depending on the facility audience and exposure, an even safer solution can be obtained by using virtualized computation nodes. Developers used to command line can exploit the Go-Docker CLI that emulates classical scheduler commands. GO-Docker has a rich Rest API used in by clients. The clients (PYTHON or JAVA ) can be used in script or SaaS front end. Galaxy to Docker Galaxy is a widely adopted user-friendly web front-end for biological data processing. It provides powerful functionalities to enhance data analysis accessibility and reproducibility. It currently suits well the integration of existing command line tools and offers a large collection of bioinformatics software. However, the integration of each software needs the manual off-line creation of XML descriptor and sometimes additional wrappers: it is still a technical and time-consuming task. We propose to by-pass this limitation by enabling the direct execution of command line within any Ad Hoc container from a trusted repository like BioShaDock using the GO-Docker python API. This Galaxy to Docker component allows to create and use new “on demand tools” in a Galaxy instance without being an administrator and without need for coding. Accordingly, advanced users can easily and quickly include custom developments in their data analysis pipelines. This results in a more flexible Galaxy environment. D4WP The D4 workflow portal (D4WP) [4] is an advanced SaaS developer oriented environment for rapid tool and workflow design. It allows online graphical workflow and component authoring. Any command line tool and script are quickly captured and integrated using a full WYSIWYG approach. All workflow component dependencies can be defined as containers using an URI syntax. In this way a re-executable and self-contained workflow specification can be produced. D4WP integrates a GO-Docker scheduler API. From a unique specification, code generation can be used to target different languages to maximize potential workflow usage and dissemination. Current developments focus on Galaxy tool generation and Common Workflow Language export. The presented software components allow the creation of reproducible and flexible data analysis environments for different audiences (end users and developers) and multiple purposes (production data analysis, benchmark, workflow, tool and method development, dissemination, article publishing…) All tools embedded in containers, made available in BioShaDock and scheduled with GO-Docker are directly usable in Galaxy, D4WP and command line. We think that such an architecture limits deployment overhead and software integration cost and therefore accelerates the transfer of bioinformatics research output to production computation facilities. In a context of massive biological data production, the CaaS model offer interesting prospects. Thus, when data movement is limited by network capacity, deploying the whole CaaS environment on data production nodes may be a pragmatic solution. Furthermore, the suite of software components we presented here are developed to fit the long-term objective of the creation of a federation of interoperable clouds. Future works will include dissemination related features and compatibility and standardization effort. References 1. IFB cloud: The academic cloud of the French Institute of Bioinformatics. http://www.france-bioinformatique.fr/ 2. Moreews F, Sallou O, Ménager H et al. BioShaDock: a community driven bioinformatics shared Docker-based tools registry. F1000Research 2015 3. Sallou O, Monjeaud C: GO-Docker: Batch scheduling with containers. IEEE Cluster 2015. 2015. 4. Moreews F: Design and share data analysis workflows. Application to bioinformatics intensive treatments. Thesis, université de Rennes 1. 2015. http://workflow.genouest.org
Type de document :
Communication dans un congrès
ECCB 2016 (Elixir Talks) , Sep 2016, Den Haag, Netherlands. 2016, 〈http://www.eccb2016.org〉
Liste complète des métadonnées

https://hal.inria.fr/hal-01394295
Contributeur : Francois Moreews <>
Soumis le : mercredi 9 novembre 2016 - 10:29:21
Dernière modification le : mercredi 16 mai 2018 - 11:23:53

Identifiants

  • HAL Id : hal-01394295, version 1

Citation

Francois Moreews, Olivier Sallou, Olivier Collin. An application suite based on the IFB Container as a Service platform. ECCB 2016 (Elixir Talks) , Sep 2016, Den Haag, Netherlands. 2016, 〈http://www.eccb2016.org〉. 〈hal-01394295〉

Partager

Métriques

Consultations de la notice

586