Interactive exploration framework for big data sets
2020-11-11, 06:45–07:00, Times in UTC

Astronomy has been a discipline that often suffers from big data issues. Large amounts of data are observed and need to be processed in order to find interesting phenomena. Automatic or semi-automatic approaches are welcome to solve this cumbersome task. In the past, several machine learning methods were proposed to organize, classify, or condense big data sets. However, this is not the end of the road. In most cases, researchers need to take further analysis by hand on automatically preprocessed data to gather valuable conclusions.

To facilitate the pipeline of data analysis, we suggest a generic front-end framework allowing the user not only to process the data automatically, but also to interactively explore and investigate the results of machine learning procedures. A compact visualization gives an initial overview and can be adjusted to point out the parts of interest. By providing abstract accommodation functions such as zooming, scrolling, filtering, and labeling, crucial data fragments can be found and marked in an intuitive way.

We present the prototype UltraPINK in order to demonstrate the idea of such an explorative visualization interface. UltraPINK is a web application to train, store, load, and browse self-organizing Kohonen maps. By using the Parallelized rotation and flipping INvariant Kohonen maps framework (PINK) as a back-end, a clear representation of common shapes in the data set is generated and displayed. The closeup-view of a map shows the generated prototypes and enables different methods to analyse it. Single prototypes can be selected and used to find the data points that resemble this prototype the most. Furthermore, prototypes can be labeled, whereby the given label can be transferred to all similar data points. It is also possible to view and label outliers that do not resemble one of the prototypes. All annotations that have been made to the original data-set are downloadable in various formats.

While our prototype is currently tailored to the PINK back-end specifically, the investigation and labeling functions underlie a generic pattern that can easily be adapted for various kinds of input data and machine learning algorithms. The ultimate goal would be an abstract framework accepting all different data types and algorithms.


Theme – Science Platforms and Data Lakes, Machine Learning, Statistics, and Algorithms