shadeMS: rapid plotting of Big radio interferometry Data

Radio interferometry data was big well before “Big Data” was a glint in a proto-data-scientist’s eye. The raw outputs of a radio interferometer, i.e. the complex visibility data and all associated metadata, while of little interest to the end-user astronomer per se, contain a wealth of information about the functioning of the instrument and software pipelines, and can provide vital diagnostics during the entire data reduction process. It is therefore important to be able to visualize them in all sorts of ways. However, the sheer size of these datasets (e.g. upwards of a billion measurements for even a short MeerKAT observation) calls for fairly sophisticated plotting techniques that can represent both dense data and outliers, and do it in a reasonable timeframe. This is well beyond the capabilities of our trusted workhorse Matplotlib.

Two recently developed technologies make a solution possible. The Datashader suite ( https://datashader.org ), driven by Big Data developments in multiple fields, provides functionality for rendering huge datasets onto two-dimensional canvases, using a variety of aggregation and categorization options. The Dask-MS library ( https://dask-ms.readthedocs.io , see also Perkins this conf.) provides a native mapping from the Measurement Set, the standard radio astronomy data format, to Dask arrays, which facilitate massively parallel computation (and are natively supported by Datashader).

The shadeMS tool ( https://github.com/ratt-ru/shadeMS ) brings these two technologies together to allow for the rapid plotting of radio interferometry data. The premise of shadeMS is to support the plotting of anything versus anything, aggregated by anything and coloured (i.e. categorized) by anything, via a straightforward command-line or Python interface. The use of Dask means that a large number of cores can be efficiently exploited, making the plotting process I/O-limited in many cases. This allows data processing pipelines to produce a rich variety of diagnostic plots with relatively little overhead.


Theme – Data Processing Pipelines and Science-Ready Data