Distributed Streaming Radio Astronomy Reduction with Dask
2020-11-12, 11:45–12:00, Times in UTC

From 2010, the biyearly doubling in processor transistor counts predicted by Moore’s Law has slowed, along with the associated increases in single processor speeds. In a bid to offer an edge over competitors, processor manufacturers began offering multiple processors on a single die, leading to the current multi-core processor era. Today, multi-core CPUs are ubiquitous, while the massive performance offered by GPUs is predicated on multi-core programming models and architecture.

While greatly contributing to the ability to process large data volumes, simply adding more cores has proven insufficient to process the sheer quantity of contemporary data, often referred to as “Big Data”. Horizontal scaling, or the use of multiple compute nodes within either HPC or Cloud Computing environments, is routinely used along with strategies such as MapReduce, and cluster computing frameworks such as Spark.

Additionally, the software required to operate in such environments has necessarily changed: While the 1990s and early 2000s advocated Object-Orientated Programming, Big Data leans towards a streaming, chunked, functional programming style with minimal shared state. Consequently, individual tasks handling chunks of data can be flexibly scheduled on multiple cores and nodes. Legacy radio astronomy codes do not readily adapt to this paradigm. To process the quantities of data produced by contemporary radio telescopes such as MeerKAT, and future telescopes such as the SKA using the aforementioned paradigms, radio astronomy codes will need to adapt appropriately.

In this talk we will cover the principles and practices of developing HPC code with Dask, a lightweight Python parallelisation and distribution framework that seamlessly integrates with the PyData ecosystem to address the above challenges. We have found that the intersection of these technologies provides a rich ecosystem for contemporary Radio Astronomy software development. It has already led to a number of diverse packages (covered elsewhere at this conference) at various stages of development and/or release. These include tricolour, xova, shadeMS and QuartiCal.


Theme – Data Processing Pipelines and Science-Ready Data