HipBone, GPU-aware asynchronous tasks, autotuning and more


In this regular feature HPCwire highlights newly published research in the high-performance computing community and related fields. From parallel programming to exascale to quantum computing, find the details here.

A two-MPI process lattice array of 2D third-order spectral elements. Photo credit: Chalmers et al.

HipBone: A powerful, GPU-accelerated C++ version of the NekBone benchmark

Using three HPC systems at Oak Ridge Laboratory — Summit supercomputer and Frontier Early Access Cluster, Spock, and Crusher — the academia research team (which includes two authors from AMD) demonstrated the performance of hipBone, one Open source application for Nek5000 computational fluid dynamic applications. HipBone “is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithm and implementation improvements that optimize its performance on modern fine-grain parallel GPU accelerators.” The tests demonstrate the “portability of hipBone across different clusters and very good scaling efficiency, especially for large problems”.

Authors: Noel Chalmers, Abhishek Mishra, Damon McDougall and Tim Warburton

A case for intra-rack resource disaggregation in HPC

A cross-institutional research team used Cori, a high-performance computing system at the National Energy Research Scientific Computing Center, to “analyze resource disaggregation to enable finer-grained mapping of hardware resources to applications.” In their article, the authors also profile a “variety of deep learning applications that represent an emerging workload.” The researchers showed that “in a rack configuration and similar applications like Cori, a central processing unit with intra-rack disaggregation has a 99.5 percent chance of finding all the resources it needs in its rack.”

Authors: George Michelogiannakis, Benjamin Klenk, Brandon Cook, Min Yee Teh, Madeleine Glick, Larry Dennison, Keren Bergman and John Shalf

MPI 3D Jacobi example (Jacobi3D) with a manual overlap option. Photo credit: Choi et al.

Improving scalability with GPU-aware asynchronous tasks

Computer scientists from the University of Illinois at Urbana-Champaign and Lawrence Livermore National Laboratory demonstrated improved scalability to hide the communication behind computation with GPU-aware asynchronous tasks. According to the authors, “The ability to hide the communication behind the computation can be very effective in small scale scenarios, but performance begins to suffer at smaller problem sizes or at high scale due to fine-grained overhead and reduced scope for overlap.” Authors integrated “GPU-aware communication into asynchronous tasks in addition to overlapping computational communication, aiming to reduce the time spent on communication and further increase GPU utilization”. They were able to demonstrate the performance impact of their approach by using “a proxy application that runs the Jacobi iteration method on GPUs, Jacobi3D.” In their article, the authors also delve into “techniques like kernel fusion and CUDA graphs to combat fine-grained overheads at scale.”

Authors: Jaemin Choi, David F. Richards, Laxmikant V. Kale

A convolutional neural network based approach to computational fluid dynamics

To overcome the cost, time, and memory disadvantages of using Computational Fluid Dynamics (CFD) simulations, this Indian research team proposed “using a convolutional neural network-based model to predict non-uniform flows in 2D.” . They define CFD as “the visualization of how a fluid moves and interacts with things as it flows by using applied mathematics, physics and computer software”. The authors’ approach “aims to support the behavior of fluid particles in a given system and to support the evolution of the system based on the fluid particles that traverse it. In the early stages of design, this technique can provide rapid, real-time feedback for design revisions.”

Authors: Satyadhyan Chickerur and P Ashish

A single block of the variational wave function in the form of parameterized quantum circuits. Photo credit: Rinaldi et al.

Matrix model simulations using quantum computing, deep learning, and Lattice Monte Carlo

This international research team conducted “the first systematic survey for quantum computing and deep learning approaches to matrix quantum mechanics”. While the “Euclidean lattice Monte Carlo simulations de facto numerical tool for understanding the spectrum of large matrix models and have been used to test holographic duality,” the authors write, “they are not tailored to extract dynamical properties or even the quantum wave function of the ground state of matrix models.” The authors compare the Deep -Learning approaches using lattice Monte Carlo simulations and provide basic benchmarks. The research used Riken’s HOKUSAI “BigWaterfall” supercomputer.

Authors: Enrico Rinaldi, Xizhi Han, Mohammad Hassan, Yuan Feng, Franco Nori, Michael McGuigan and Masanori Hanada

GPTuneBand: Multitasking and multi-fidelity autotuning for large-scale, high-performance computing applications

A group of researchers from Cornell University and Lawrence Berkeley National Laboratory propose: a Multitasking and multi-fidelity autotuning framework called GPTuneBand for optimizing high-performance computing applications. GPTuneBand combines a Bayesian multitasking optimization algorithm with a multi-armed bandit strategy well suited for optimizing expensive HPC applications such as numerical libraries, scientific simulation codes, and machine learning models, especially with a very limited tuning budget.” write the authors. Compared to its predecessor, GPTuneBand showed “a maximum acceleration of 1.2x and wins at 72.5 percent tasks over a single-task multi-fidelity tuner BOHB”.

Authors: Xinran Zhu, Yang Liu, Pieter Ghysels, David Bindel, Xiaoye S.Li

High performance computing architecture for sample processing in the smart grid

In this open access article, a group of researchers from the University of the Basque Country, Spain, present a high quality interface solution for application designers that addresses the challenges of current smart grid technologies. Arguing that FPGAs offer superior performance and reliability over CPUs, the authors present a “solution to accelerate computation of hundreds of streams, combining a custom silicon-based IP and a new-generation field-programmable gate array-based accelerator card.” . The researchers use Xilinx’s FPGAs and adaptive computing framework.

Authors: Le Sun, Leire Muguira, Jaime Jiménez, Armando Astarloa, Jesús Lázaro

Do you know of any research that should be included in next month’s list? Then email us at [email protected] We are happy to hear from them.


Comments are closed.