Claudia Comito

After a PhD and 15 years "post-doc" in high-mass star formation between Bonn and Cologne, in 2018 I moved to the Jülich Supercomputing Center, where I help design and write scientific HPC applications. I am lucky to be one of the HeAT devs (https://github.com/helmholtz-analytics/heat) and help researchers from assorted fields of science speed up their number crunching.


Affiliation – Jülich Supercomputing Center Position – Research Scientist Github ID – ClaudiaComito Homepage – www.linkedin.com/in/claudia-comito-jsc

Talks

Analyzing Scientific Big Data with the Helmholtz Analytics Toolkit (HeAT)

The exponential increase in data size over the last years means researchers are scrambling to port their previously cluster-bound data analysis to full-blown high-performance-computing (HPC) applications. In the 2020s, astrophysicists might get used to applying for supercomputing time to calibrate, image, and analyze their data, just as naturally as they apply for telescope time.

Python is the standard programming language within the scientific community, with the SciPy stack the clear reference for data analysis. While parallelizing SciPy code can be relatively straightforward if the algorithm itself is "embarassingly parallel" (as in: chunk up the data, ship them to available compute nodes, run the calculations in single-node mode on those chunks), data scientists today are still mostly on their own when it comes to solving more complex problems, requiring ad-hoc communication among CPUs/GPUs and generally sound HPC training.

The Helmholtz Analytics Framework (HeAT) is meant to bridge this gap. HeAT is an open-source Python tensor library for scientific parallel computing and machine learning. Under the hood, low-level operations and high-level algorithms are optimized to exploit the available resources, be it a dual-core laptop or a supercomputer. At the same time, HeAT's NumPy-like API makes it straightforward for SciPy users to implement HPC applications, or to parallelize their existing ones. HeAT relies on PyTorch for its data objects, which implies fast on-process operations and GPU support. Our recent benchmarks show that the current early-phase HeAT can achieve a speed-up of up to two orders of magnitude compared to similar Python frameworks.

In this talk, I will show you HeAT's inner workings and what to keep in mind when you import heat as np to parallelize your astrophysical data analysis.