Exascale Computing Project: RAJA Portability Suite enables powerful HPC codes for portable CPUs and GPUs


By Rob Farber on behalf of the Exascale Computing Project

A growing number of HPC applications are required to deliver high performance on CPU and GPU hardware platforms. One software tool that is now available and showing tremendous promise for the exascale era is the open source RAJA Portability Suite. RAJA is part of the NNSA software portfolio of the Exascale Computing Project (ECP) and is also supported by the ECP Programming Models and Runtimes division.

ECP and NNSA production application developers recognize that it is easy to integrate RAJA into their applications so that they can run on new hardware, be it CPU or GPU based, while maintaining high performance on existing computing platforms . [i] The RAJA Portability Suite is the core of the LLNL ASC application GPU porting strategy according to the HPC Best Practices Webinar on March 10, 2021, “An Overview of the RAJA Portability Suite”, which also states: “The RAJA Portability Suite is well on the way to being ready for the next generation of platforms, including Exascale. “

RAJA provides a kernel API designed to build and transform complex parallel kernels without changing the kernel source code. Implemented using C ++ templates, this API helps isolate the application source code from the hardware and underlying programming model details so that subject matter experts (SME) can express the parallelism of their calculations while they focus on writing the correct code. The expression of concurrency, as discussed below, requires a basic understanding of the underlying hardware platform and verification that application kernels are running correctly in parallel.

Rich Hornung

The RAJA API enables concern segregation, which allows developers with specialized expertise in performance analysis to optimize application performance for specific hardware platforms without disrupting application source code. RAJA developers are also working to optimize in-house software implementations so that multiple application teams can make the most of their efforts. RAJA development includes the expertise of over 38 contributors and 8 members of the core project team, as well as interactions with vendors to support new hardware from IBM, NIVIDA, AMD, Intel and Cray. Rich Hornung, RAJA Project Leader and member of the High Performance Computing Group at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory states, “We are seeing good results from our benchmark suite on all ECP target platforms. We also work with vendors to further improve performance. “

A production-proven performance abstraction

Most LLNL-ASC applications, as well as a number of ECP and non-ASC-LLNL applications, rely on the RAJA Portability Suite to run on a variety of platforms. LLNL’s institutional RADIUSS project promotes and finances the integration of the RAJA Portability Suite in non-ASC applications.

Figure 2: Main use of the LLNL ASC application. (Source: An Overview of the RAJA Portability Suite)

In the RAJA Portability Suite, RAJA provides the kernel execution API. Other tools in the suite offer portable storage management capabilities. In particular, the CHAI library implements a “managed array” abstraction to automatically move data, when needed, into the storage area used by the device running a kernel. CHAI relies on the Umpire package in the suite, which provides a unified portable memory management API for CPUs and GPUs. An example of how these libraries interact when running a GPU code is shown in Figure 3 below.

Figure 3: An example of a code excerpt and a data flow that shows the interaction of the components of the RAJA Portability Suite. (Source: An Overview of the RAJA Portability Suite)

The end result is an API that application programmers can use to run across a diverse hardware ecosystem. Each of the RAJA, CHAI, and Umpire APIs are exposed to the application programmer, as shown in Figure 4 (below). Figure 3 (above) shows a sample group of calls. The RAJA Portability Suite also uses a low-level collection of macros and metaprogramming functions in the Concepts And Meta-Programming Library (CAMP). An important goal of the CAMP project is to achieve broad compiler compatibility across HPC-oriented systems and thus to guarantee portability and longevity. As shown in Figure 4, CAMP is not directly accessible to application programmers. However, you can use it in your applications if necessary.

Figure 4: A suite of libraries offers portability from a single source. (Source: An Overview of the RAJA Portability Suite)

The proof is in the performance

As mentioned at the beginning of this article, the RAJA Portability Suite is already being used in production. The ECP examples shown in Figure 5 below show the variety of hardware platforms and application domains that already benefit from the RAJA Portability Suite.

Figure 5: The RAJA Portability Suite supports a variety of applications and production hardware platforms. (Source: An Overview of the RAJA Portability Suite)

The RAJA team reports that each of these ECP applications show impressive performance gains on pre-exascale platforms: [ii]

  • LLNL ATDM application (higher order ALE hydro simulations using RAJA and Umpire)
    • Node-to-Node Acceleration:
      • 15x: Sierra (2 P9 + 4 V100) vs. CTS-1 Intel Cascade Lake (48-core CPUs)
      • 30x: Sierra vs. Astra (Cavium ThunderX2 28-core CPUs)
    • SW4 application (high resolution earthquake simulations using RAJA and Umpire)
      • Node-to-Node Acceleration:
        • 16x: Sierra vs. CTS-1 Intel Cascade Lake
        • 32x: Sierra vs. CTS-1 Intel Broadwell
      • GEOSX application (subsurface solid mechanics simulations using RAJA, Umpire, and CHAI)
        • Node-to-Node Acceleration
          • 14x: Lassen (Sierra Bow) vs. CTS-1 Intel Cascade Lake
        • ExaSGD application (power grid optimization using RAJA and Umpire)
          • Adopted RAJA and Umpire about 8 months ago from March 2021
          • Parts of the code that run on Tulip (Frontier EA system) with good performance

A useful co-design tool

The RAJA team reports that the RAJA Performance Suite is an essential co-design tool to work with hardware and compiler vendors to improve the RAJA Portability Suite’s performance on new architectures. The Performance Suite is a collection of various numerical kernels that exercise a wide range of RAJA functions in the way that they are used in applications. Each kernel in the Performance Suite is implemented in several RAJA and non-RAJA (baseline) flavors for each supported programming model back-end (e.g., OpenMP, CUDA, HIP, etc.). Figure 6 (below) compares the performance of RAJA and baseline variants of different kernels in the RAJA Performance Suite for CUDA and HIP running on NVIDIA or AMD GPUs.

Figure 6: Acceleration of selected RAJA variants of Performance Suite kernels compared to baseline variants for CUDA and HIP. An acceleration of one indicates equivalent performance. Acceleration values ​​greater / less than one indicate that RAJA variants are faster / slower than the initial value. (Source LLNL)

The following figure shows the performance of the recently added SYCL-based kernel. Together with CUDA and HIP, the new RAJA SYCL support enables RAJA applications to run on GPU systems from all GPU manufacturers (NVIDIA, AMD and Intel) whose hardware is used in production systems in US DoE laboratories.

Figure 7: Acceleration of selected RAJA variants of Performance Suite kernels compared to the initial value for CUDA, HIP and SYCL (source LLNL)

Application user reports indicate that the RAJA Portability Suite:

  • Easy-to-use features and / or optimizations that have been developed for other applications
  • Easy to understand for all application developers
  • Easy integration into existing applications
  • Easy to adopt step by step

More information

Further information on the RAJA Portability Suite can be found in the following sources:

  • RAJA: RAJA main project website containing links to user guides, team communication, and related software projects.
  • RAJA Performance Suite: Collection of kernels for evaluating compilers and RAJA performance. Used by the RAJA team, vendors, for sourcing from DOE platforms, and others.
  • Umpire: Main Umpire project page that contains links to user guides and team communication.
  • CHAI: Main page of the CHAI project containing links to user guides and team communication.
  • CAMP: Main location of the CAMP project.

Rob Farber is a global technology consultant and author with an extensive background in HPC and machine learning technology development for use in national laboratories and commercial organizations. He can be reached at [email protected]

[i] http://ideas-productivity.org/wordpress/wp-content/uploads/2021/03/webinar050-raja.pdf

[ii] Ibid

Source link


Comments are closed.