ESA Datalabs: an e-Science Platform for Data Exploitation and Preservation at ESAC
2020-11-11, 07:00–07:15, Times in UTC

Since the appearance of e-science as a tangible approach – a data-intensive approach to science, geared towards discovery – astronomy has been arguably the most successful example: it is a perfect fit for a data-intensive approach since most data is public and free of privacy concerns or commercial value. Also, and more recently, we have entered what could be called the golden age of surveys, with several large-scale projects, spanning decades, between finished, ongoing, and planned activities. ESA is responsible, or is a major partner, in several of these initiatives.

This change is profound and data has become the major technological challenge. Increases by multiple orders of magnitude in dataset size means that transferring data to a scientist is often unfeasible. But size is only one of the aspects in a data-intensive domain. There are layers of ingestion, curation and analysis happening in parallel and across many communities. Preservation is vital and is, in general, a largely unsolved problem, both in the technology side and in public policy. Finally, curation and analysis also create new challenges that intersect with a push for open science.

We present the current status in the development of the ESA Datalabs platform. This system allows users to bring their code to ESA’s infrastructure and have direct access to ESA’s archives. Datalabs are full computational environments and our catalogue of Datalabs ranges from new tools that have become de-facto standard for analysis, to complex legacy systems repackaged to run via a web browser. ESA Datalabs underlying architecture is domain agnostic; it fosters research and innovation through the integration of transversal access to big data, containerised applications, notebook technologies and domain specific software. For example, customised JupyterLab environments are readily available for astronomers, scientists in Earth Observation related fields, or researchers in global navigation. Moreover, ESA Datalabs support for development environments such as Octave, or reference tools such as TopCat in astronomy enable reusability of existing code baselines.

We will discuss the challenges faced in developing a multi-domain exploitation platform capable of fulfilling user requirements that vary from execution of simple notebooks, to machine learning algorithms, to science pipelines. Finally, we will show functionalities already available for users as well as the future evolution plan.

Theme – Science Platforms and Data Lakes, Cloud Computing at Different Scales