Oleg Smirnov is the SKA Research Chair at the Rhodes University Centre for Radio Astronomy Techniques & Technologies (RATT), as well as head of the Radio Astronomy Research Group at the South African Radio Astronomy Observatory (SARAO). His research interests are radio interferometry calibration and imaging algorithms, software and pipelines.
Radiopadre: remote, interactive, zero-admin visualization of data pipeline products
Modern [not only radio] astronomers are coming to terms with being separated from their data and pipelines. Sheer data size alone dictates that data reductions are hardly ever “local” in any sense, but rather have to run on a big node or cluster somewhere remote, with SSH gateways and network latency in between. The new work patterns of the covid-19 pandemic only exacerbate this separation. At the same time, the complexity of new telescopes and pipelines results in a far greater volume and variety of intermediate diagnostics and final data products. The following scenario is becoming familiar: my pipeline run has finished (or crashed), it’s produced 300 log files, 200 intermediate plots, 50 FITS images, and a dozen HTML reports -- on a remote cluster node, which doesn’t even have a basic image viewer installed (and which network lag would have made difficult to use in any case). How do I make sense of all this, without transferring gigabytes of products to my laptop or local workstation first?
Radiopadre (Python Astronomy Data Reductions Examiner, https://github.com/ratt-ru/radiopadre) provides (at least part of) the answer. It is a combination of a client-side script, Docker or Singularity images, a Jupyter Notebook framework, and integrated browser-based FITS viewers (CARTA and JS9) which allows for quick visualization of remote data products. Radiopadre is virtually zero-admin, in the sense that it requires nothing more than a web browser on the client side, an SSH connection, and Docker or Singularity support on the remote end. It allows for both interactive (exploratory) visualization via a Jupyter Notebook, as well as the development of rich, extensive report-style notebooks tuned to the outputs of a particular pipeline.
The demo will showcase the interactive visualization capabilities of Radiopadre, using the output of various MeerKAT imaging pipelines as a working example.
shadeMS: rapid plotting of Big radio interferometry Data
Radio interferometry data was big well before “Big Data” was a glint in a proto-data-scientist’s eye. The raw outputs of a radio interferometer, i.e. the complex visibility data and all associated metadata, while of little interest to the end-user astronomer per se, contain a wealth of information about the functioning of the instrument and software pipelines, and can provide vital diagnostics during the entire data reduction process. It is therefore important to be able to visualize them in all sorts of ways. However, the sheer size of these datasets (e.g. upwards of a billion measurements for even a short MeerKAT observation) calls for fairly sophisticated plotting techniques that can represent both dense data and outliers, and do it in a reasonable timeframe. This is well beyond the capabilities of our trusted workhorse Matplotlib.
Two recently developed technologies make a solution possible. The Datashader suite (https://datashader.org), driven by Big Data developments in multiple fields, provides functionality for rendering huge datasets onto two-dimensional canvases, using a variety of aggregation and categorization options. The Dask-MS library (https://dask-ms.readthedocs.io, see also Perkins this conf.) provides a native mapping from the Measurement Set, the standard radio astronomy data format, to Dask arrays, which facilitate massively parallel computation (and are natively supported by Datashader).
The shadeMS tool (https://github.com/ratt-ru/shadeMS) brings these two technologies together to allow for the rapid plotting of radio interferometry data. The premise of shadeMS is to support the plotting of anything versus anything, aggregated by anything and coloured (i.e. categorized) by anything, via a straightforward command-line or Python interface. The use of Dask means that a large number of cores can be efficiently exploited, making the plotting process I/O-limited in many cases. This allows data processing pipelines to produce a rich variety of diagnostic plots with relatively little overhead.
ADASS 2021 will be in...