Scientific Software Developer at the South African Radio Astronomy Observatory (SARAO)
Distributed Streaming Radio Astronomy Reduction with Dask
From 2010, the biyearly doubling in processor transistor counts predicted by Moore’s Law has slowed, along with the associated increases in single processor speeds. In a bid to offer an edge over competitors, processor manufacturers began offering multiple processors on a single die, leading to the current multi-core processor era. Today, multi-core CPUs are ubiquitous, while the massive performance offered by GPUs is predicated on multi-core programming models and architecture.
While greatly contributing to the ability to process large data volumes, simply adding more cores has proven insufficient to process the sheer quantity of contemporary data, often referred to as “Big Data”. Horizontal scaling, or the use of multiple compute nodes within either HPC or Cloud Computing environments, is routinely used along with strategies such as MapReduce, and cluster computing frameworks such as Spark.
Additionally, the software required to operate in such environments has necessarily changed: While the 1990s and early 2000s advocated Object-Orientated Programming, Big Data leans towards a streaming, chunked, functional programming style with minimal shared state. Consequently, individual tasks handling chunks of data can be flexibly scheduled on multiple cores and nodes. Legacy radio astronomy codes do not readily adapt to this paradigm. To process the quantities of data produced by contemporary radio telescopes such as MeerKAT, and future telescopes such as the SKA using the aforementioned paradigms, radio astronomy codes will need to adapt appropriately.
In this talk we will cover the principles and practices of developing HPC code with Dask, a lightweight Python parallelisation and distribution framework that seamlessly integrates with the PyData ecosystem to address the above challenges. We have found that the intersection of these technologies provides a rich ecosystem for contemporary Radio Astronomy software development. It has already led to a number of diverse packages (covered elsewhere at this conference) at various stages of development and/or release. These include tricolour, xova, shadeMS and QuartiCal.
Parallel Radio Astronomy Application Development with Dask and Numba
Dask is a lightweight Python parallelisation and distribution framework that seamlessly integrates with the PyData ecosystem to provide a rich framework for developing Parallel and Distributed Radio Astronomy Applications.
In this demo, we will demonstrate how to create a multi-core radio astronomy application using a combination of Dask and Numba.
Time permitting, we aim to cover the following topics:
- General compute graph concepts.
- The Dask Array abstraction and its relation to NumPy arrays
- Dask Array chunk size effects on application memory and performance.
- Exposing CASA Tables as xarray Datasets and Table Columns as Dask Arrays
- Choosing appropriate chunking strategies
- Updating and writing CASA Tables from Dask Arrays
- JIT-compiling Python code to speeds similar to the equivalent C/Fortran code.
- The relation between performance, input size and dropping the Python Global Interpreter Lock.
- Wrapping accelerated Numba code in Dask for multi-core performance.
- Scaling a Dask application up to a High Performance Computing cluster.
- Management of the cluster with dask_jobqueue.
- Annotation and manual scheduling of specific tasks for optimal placement and performance.