March 10, 2022 – A team funded by the Exascale Computing Project (ECP) has ported and scaled to the GPU-based WarpX, a particle-in-cell (PIC) code for solving the motion of relativistic, charged particles in the presence of electromagnetic fields Supercomputers like Summit and upcoming machines Aurora and Frontier. Her work addresses limitations, including the use of Fortran cores for PIC operations such as current capture, field collection, and particle pushers and field solvers. In addition, manual optimization for certain architectures had resulted in a lack of portability. Researchers ported the PIC kernels from Fortran to C++ and developed a framework in which CUDA, HIP or DPC++ is used to offload kernels, depending on the computing platform (e.g. NVIDIA, AMD, Intel). This advancement eliminates the need for mixed-language programming, which adds significant complications and often overrides compiler optimizations, and provides a relatively consistent programming model across platforms. The researchers’ work was published in the September 2021 ECP special issue parallel computing.
WarpX can be used for a variety of applications in plasma physics. The ultimate goal is for WarpX to help design smaller, less expensive particle accelerators based on wakefield acceleration. This would impact the US Department of Energy’s (DOE) Discovery Science Mission while also resulting in a range of societal benefits including industrial, environmental and medical applications. With all DOE supercomputers now GPU-based, porting WarpX to GPU platforms was critical and required refactoring, re-implementing, and rethinking many core algorithms.
The researchers achieved a more than 100x improvement over their pre-ECP baseline on CPU machines. Noting the importance of optimized memory footprints, minimized kernel startup latency, and properly utilized memory hierarchy, the team expects its optimizations to be carried over to other accelerator architectures as well as CPU-based machines. They pass on the lessons learned to others working on porting other CPU-based code to exascale computers. Methods developed during this work, including particle sorting to improve cache reuse in particle mesh operations, memory pools to reduce overhead associated with device memory allocation, and fusion of kernels to minimize overhead associated with startup latency of the GPU kernels, all likely to do so, will be useful for other simulation code, both inside and outside the PIC community. Current and future work includes further optimization of the current storage and parallel communication routines, as well as improvements for non-NVIDIA GPUs.
authors and citation
Myers A, Almgren A, Amorim LD, Bell J, Fedeli L, Ge L, Gott K, Grote DP, Hogan M, Huebl A, Jambunathan R, Lehe R, Ng C. M. Rowan, O. Shapoval, M. Thévenet, J.-L. Vay, H. Vincenti, E. Yang, N. Zaïm, W. Zhang, Y. Zhao, and E. Zoni. “Porting WarpX to GPU-Accelerated Platforms.” 2021. Parallel Computing (September).