June 3, 2021 – As part of a series to share best practices in preparing applications for the Aurora supercomputer, ALCF highlights researchers’ efforts to optimize code to run efficiently on graphics processors.
The ATLAS experiment
To prepare the ATLAS experiment for the exascale era of computing, developers are preparing the code that will allow the experiment to perform its simulation and data analysis tasks on a range of next-generation architectures, including the upcoming Intel HPE Aurora System to be housed in the Argonne Leadership Computer Facility (ALCF), a US Department of Energy science user facility in the Argonne National Laboratory.
The ATLAS experiment – located at CERN’s Large Hadron Collider (LHC), the world’s largest particle accelerator – uses a 25-meter-high, 144-meter-long cylinder equipped with magnets and other instruments woven around a central beam pipe are to measure and record phenomena and related to the subatomic particles dispersed by proton-proton collisions. The signals generated by the ATLAS instruments provide important information about the physics of the collision and its effects through the computational reconstruction of the events.
An ATLAS physics analysis consists of three steps. First, when generating events, the researchers use the physics they are familiar with to model the types of particle collisions that take place in the LHC. In the next step, the simulation, you generate the subsequent measurements that the ATLAS detector would make. Finally, reconstruction algorithms are carried out on both simulated and real data, the output of which can be compared to identify differences between theoretical prediction and measurement.
Such measurements led to the discovery of the Higgs boson, but hundreds of petabytes of data remain untested and the computational needs of the experiment will increase by an order of magnitude or more over the next decade. This need is compounded by the imminent completion of the high luminosity LHC upgrade project.
In addition, ATLAS requires an immense simulation effort for the standard model and background modeling as well as for general detector and upgrade studies.
To do this, the developers create code that can be used on a variety of architectures.
FastCaloSim, a code for fast parameterized calorimeter simulation, was written with CUDA, SYCL and Kokkos and was also executed on the ALCF’s Aurora testbed, i.e. on Intel architectures, NVIDIA architectures, AMD architectures, and GPU-based systems, in addition to non-accelerated setups.
The developers are now also implementing a coconut version of MadGraph, an event simulator for LHC experiments that performs particle physics calculations to generate the expected interactions between LHC and detector particles.
As a framework, MadGraph aims at a complete Standard Model and phenomenology beyond the Standard Model, including such elements as cross-sectional calculations and event manipulation and analysis.
The developers started with a CUDA implementation of the MadGraph algorithm. They then ported it to Kokkos, which they found could run the algorithm on an Intel CPU system. Next, MadGraph was deployed using OpenMP as the backend of a parallel thread setup on top of an NVIDIA GPU based setup. With CUDA as the backend, it was executed on the Intel GPU test beds, which are housed in Argonne’s Joint Laboratory for System Evaluation (JLSE).
Kokkos is preferred in the case of MadGraph because it is a third-party programming library written in C ++ that allows developers to write their code in a single framework: when Kokkos structured code is compiled, it is compiled for each architecture, on which it runs. This is beneficial for researchers in high energy physics as they only need to write the complex algorithms near the center of their work once (instead of rewriting the algorithms multiple times to create specific variants that are compatible with each chipmaker’s software) .
Due to its preponderance of complex calculations, MadGraph requires a lot of computing time – and will only need more in the future, especially after the bright upgrade to the LHC has been completed in the middle of the decade. After the upgrade, the data throughput increases by 10 to 20 times, but this only sets a minimal baseline for the amount of simulation to be generated: achieving optimal performance would increase the throughput by a further factor of 10 (i.e. 100. ) require – or a 200-fold simulation growth would be required).
Since one of the main goals of the developers is to understand the limits of the performance portability of different architectures, the performance of each different code implementation was compared to the others. The Kokkos version of MadGraph was able to perform on a par with the CUDA and “Vanilla” CUDA implementations, with metrics falling 10 percent from each other.
The main hurdle in implementing a Kokkos version of FastCaloSim was that using shared libraries requires that every symbol in a kernel be visible and resolvable by a single compilation unit, which means that all kernels have to be packaged into a single file, which the various remaining includes kernel files via an array of #include commands.
The developers have redesigned a large number of functions and files to minimize code duplication while maximizing the number of identical code paths between the CUDA-Kokkos implementations.
One advantage of using the CUDA backend from Kokkos is its full interoperability with “pure” CUDA, ie CUDA functions can be called from Kokkos kernels. This interoperability enabled an incremental porting process and had the added benefit of simplified validation.
In keeping with their focus on architectural diversity, the developers used a wide range of Kokkos backends for FastCaloSim, including various NVIDIA GPUs, multiple AMD GPUs, integrated Intel discrete GPUs, and parallel host pThreads and OpenMP backends.
The developers concluded that FastCaloSim was severely underutilizing the GPU performance. While this has been mitigated a bit by grouping data between multiple events, more complex programs may require significant code refactoring. However, the underutilization suggests that a single GPU could be shared by multiple CPU processes, reducing hardware costs.
Comparing the Kokkos and CUDA variants of FastCaloSim, the concepts of the CUDA portability layer translate well in general, even if the explicit syntaxes are different, although that does not mean that certain elements (like views and buffers) do not add any additional loads entail.
Ultimately, after considerable effort and in the midst of a rapidly developing compiler landscape, FastCaloSim ran successfully on every “taste” of the attempted GPU: Intel iGPUs and Xe-HP GPUs with DPC ++, NVIDIA GPUs with SYCL with a CUDA backend and AMD GPUs with hipSYCL (an implementation of SYCL via NVIDIA CUDA / AMD HIP).
Click here to learn more.
Source: Nils Heinonen, ALCF