Software Engineer at The University of Manchester.
Using Docker in a radio-astronomy environment
Docker is widely used by the astronomy community and is arguably one of the most common choices for containerisation despite its primary target audience being Web, IoT and cloud industries and their users. Already an important part of the research ecosystem, if used properly, containers can shorten development and deployment times and ensure a consistent processing software environment across multiple machines. There is however a significant difference between deploying a web application using Node.js and running a real-time pipeline, as required to process tens of gigabytes of raw data per second. If Docker is not used properly, this can lead to misused and underutilised resources, which are often limited compared to those at the disposal of ‘the industry’. In the worst-case scenario a badly-implemented pipeline can corrupt the data and return incorrect results.
In this talk I will present lessons learned from developing and maintaining a Docker-based pipeline used as part of the MeerTRAP single-pulse search efforts. This pipeline is used to run a time-domain search for transients, such as pulsars and Fast Radio Bursts, in the radio part of the electromagnetic spectrum. Distributed across 65 compute nodes, it consists of multiple stages and overlapping CPU and GPU processing. It makes heavy use of C++ for the initial processing, with the post-processing including candidate classification using machine learning written in Python. I will present the best techniques that we have developed during almost 2 years of developing and maintaining the Docker images for the MeerTRAP project. These include ways to write clean and concise Dockerfiles that result in fast and lean images used for production deployments, reducing the overall size of the Docker ecosystem and maintaining a reliable private repository.
It worked on my laptop! - how to approach reproducibility in astronomy?
With large observatories that provide data to thousands of astronomers around the world already online or in the design phase and under construction, it is now more important than ever to approach the problem of reproducibility in astronomy. The last few years have seen a wide adoption of solutions that aim to address some of the reproducibility concerns, such as containers and Jupyter Notebooks. They help to provide a consistent processing environment by, for example, locking users to a single version of Python. This can, however, provide a false sense of security as on the lowest level, these solutions do not take any possible hardware differences into account. On the higher level, the lack of clear software and data format documentation can lead to easily-avoided mistakes. This is especially important in the new era of multi-wavelength astronomy where teams from different backgrounds, using different tools and file formats, come together to solve the same problem.
Considering all of the above, what do we expect from reproducibility? What are we willing to sacrifice to achieve it (and do we have to sacrifice anything at all?)? Can we as a wider community come together and develop a clear set of guidelines and standards that will ensure the maximum possible reproducibility? If 100% reproducibility is not possible, how can we ensure that all the relevant parties are aware of the possible shortcomings and can include them in their analysis?