Using Docker in a radio-astronomy environment
2020-11-11, 12:15–12:30, Times in UTC

Docker is widely used by the astronomy community and is arguably one of the most common choices for containerisation despite its primary target audience being Web, IoT and cloud industries and their users. Already an important part of the research ecosystem, if used properly, containers can shorten development and deployment times and ensure a consistent processing software environment across multiple machines. There is however a significant difference between deploying a web application using Node.js and running a real-time pipeline, as required to process tens of gigabytes of raw data per second. If Docker is not used properly, this can lead to misused and underutilised resources, which are often limited compared to those at the disposal of ‘the industry’. In the worst-case scenario a badly-implemented pipeline can corrupt the data and return incorrect results.

In this talk I will present lessons learned from developing and maintaining a Docker-based pipeline used as part of the MeerTRAP single-pulse search efforts. This pipeline is used to run a time-domain search for transients, such as pulsars and Fast Radio Bursts, in the radio part of the electromagnetic spectrum. Distributed across 65 compute nodes, it consists of multiple stages and overlapping CPU and GPU processing. It makes heavy use of C++ for the initial processing, with the post-processing including candidate classification using machine learning written in Python. I will present the best techniques that we have developed during almost 2 years of developing and maintaining the Docker images for the MeerTRAP project. These include ways to write clean and concise Dockerfiles that result in fast and lean images used for production deployments, reducing the overall size of the Docker ecosystem and maintaining a reliable private repository.


Theme – Data Processing Pipelines and Science-Ready Data, Data Interoperability