How high-bandwidth storage overcomes performance bottlenecks

0


Intel recently announced that High-Bandwidth Memory (HBM) will be available on select “Sapphire Rapids” Xeon SP processors and will provide the CPU backbone for the “Aurora” exascale supercomputer at the Argonne National Laboratory.

Paired with Intel’s Xe HPC (code name “Ponte Vecchio”) computing GPUs running in a unified CPU / GPU memory environment will deliver more than an exaflop / s of double precision performance to Aurora. Realizing or exceeding an exaflop / s performance metric with 64-bit data operands means there is no need for programmers to take shortcuts or compromise on precision by using reduced precision arithmetic. This means that the storage system has to deliver data much faster than previous generations of processors. Together with HBM for AI and data-intensive applications, the Sapphire Rapids Xeon SPs also implement the Advanced Matrix Extensions (AMX), which use the 64-bit programming paradigm to accelerate tile operations and give programmers the ability to perform matrix operations with reduced precision for convolutional neural networks and other applications.

Maintaining sufficient bandwidth to support 64-bit exascale supercomputing in an accelerated computing environment with unified memory is a significant achievement that is causing serious excitement and anticipation in both the corporate and HPC communities. For Argonne, the unified storage environment means: “Programming techniques that are already used on current systems are applied directly to Aurora.” In addition, institutional, corporate and cloud data centers will be able to run highly optimized systems with next-generation Intel Xeon SPs for simulation, machine learning, and high-performance data analysis workloads (or, for short, HPC-AI-HPDA) with applications written to run existing systems.

Rick Stevens, Associate Laboratory Director of Computing for Environment and Life Sciences at Argonne National Laboratory, codified the importance of achievement and necessity for HBM when he writes, “To get exascale results, the fast access and processing is huge Amount of data required. The integration of high bandwidth memory into scalable Intel Xeon processors will significantly increase Aurora’s memory bandwidth and allow us to leverage the power of artificial intelligence and data analysis to run advanced simulations and 3D models. “

Why is HBM important

It has been known for several years that the ability of modern processors and GPUs to deliver flops has quickly overtaken the ability of storage systems to deliver bytes / sec. John McCalpin, the author of the popular STREAM benchmark, noted in his SC16 invitation lecture Memory bandwidth and system balance in HPC systems this peak flop per socket increased 50 to 60 percent per year, while memory bandwidth only increased about 23 percent per year. He illustrated this trend with the following graph, in which he plotted the flops to memory bandwidth balance of commercially successful systems with good memory performance since 1990 compared to their competitors. Computer manufacturers are aware of the memory bandwidth issue and have added more memory channels and using faster memory DIMMs.

Comparison of memory bandwidth with floating point capability for commercially successful platforms since 1990. (Source: John McCalpin https://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and -system- balance-in-hpc-systems /)

HBM devices reflect an alternative approach that uses 3D manufacturing technology to create stacks of DRAM chips built on a wide bus interface. An HBM2e device, for example, connects the DRAM stack to the processor via a bus interface of 1024 bits. This broad data interface and associated commands and addresses require the DRAM to be built on a silicon interposer that essentially “wires” the approximately 1,700 lines required for HBM read / write transactions. The silicon approach is necessary because it is impractical to create such a large number of lines with printed circuit board (PCB) technology.

Scheme of an HBM 2.5D memory system with a single DRAM stack (source: https://semiengineering.com/hbm-issues-in-ai-systems/)

The result is a huge leap in memory bandwidth and considerable energy savings compared to DDR memory systems. EEWeb notes that “a single HBM2e device uses almost half as much power as a GDDR6 solution”. It concludes, “HBM2e gives you the same or higher bandwidth than GDDR6 and similar capacity, but power consumption is nearly halved while TOPS / W is doubled.” The TOPS or Tera operations per second is a measure for the maximum achievable throughput for the given bandwidth of the storage device. It is used to evaluate the best throughput for the money for an application such as neural networks and data-intensive AI applications.

The past is the prelude to the future – benchmarks for memory bandwidth tell the story

Benchmarks show the effects of increasing memory bandwidth on HPC applications quite well. Intel recently published an apple-to-apple comparison between a dual-socket Intel Xeon-AP system with two Intel “Cascade Lake” Xeon SP-9282 Platinum and a dual-socket AMD “Rome” 7742 system. As can be seen below, the Intel Xeon SP-9200 series system from Intel with twelve memory channels per socket (i.e. 24 channels in the configuration with two sockets) outperformed the AMD system with eight memory channels per socket (a total of sixteen with two sockets ) by a geomean of 29 percent for a wide range of real world HPC workloads.

Impact of twelve memory channels versus eight memory channels on a variety of HPC benchmarks (Source: Memory-bound results only, which can be found in https://www.datasciencecentral.com/profiles/blogs/cpu-vendors-compete-over-memory-bandwidth – achieve leadership)

The reason for this is because these benchmarks are dominated by memory bandwidth while others are compute-bound, as shown below:

Sensitivities of various HPC workloads to memory and computing limitations (source: https://medium.com/performance-at-intel/hpc-leadership-where-it-matters-real-world-performance-b16c47b11a01)

The heterogeneous programming of oneAPI enables functions of the next generation

The compute and memory bandwidth bottleneck dichotomy shown in the graphic above shows how the combined efforts of the oneAPI initiative can help address a variety of compute and memory bottlenecks simultaneously in an environment with a combination of CPUs, GPUs, and others To solve accelerators. In short, a high memory bandwidth is of fundamental importance in order to keep multiple devices in a system and to supply the processing units per core with data. Once enough bandwidth is available to prevent data starvation, programmers can get down to work to overcome the compute bottlenecks by making changes to the software.

OneAPI’s heterogeneous programming approach helps to enable these specially developed, state-of-the-art functions.

  • HBM memory: A high computing power simply cannot be achieved if the computing cores and vector units are starving for data. As the name suggests and presented in this article, HBM offers high memory bandwidth.
  • Unified Memory Environment: CPUs and accelerators such as the Intel X offer a uniform memory spacee Computing GPU the ability to easily access data. This means that users can choose the Intel GPU based on X. can adde Architecture or based on Xe The HPC microarchitecture accelerates computational problems that are beyond the capabilities of the CPU cores. The additional bandwidth of the HBM storage system helps keep several devices busy and supply them with data.
  • Intel AMX Instructions: Intel added the AMX Instructions to speed up the SIMD processing of some highly stressed computational operations in AI and certain other workloads. The core of the AMX extensions is a new matrix register file with eight-row two-tensor (matrix) registers – known as tiles. The programmer can configure the number of lines and bytes per line in the tile via a tile control register (TILECFG). This gives programmers the ability to customize the properties of the tile to make the algorithm and computation more natural. The Sapphire Rapids Xeon SPs support the full AMX specification, including AMX-TILE, AMX-INT8, and AMX-BF16 operations.
  • Cross-Architecture Programming with oneAPI: OneAPI’s open, unified, cross-architecture programming model enables users to run a single software abstraction on heterogeneous hardware platforms that contain CPUs, GPUs, and other accelerators from multiple vendors. At the heart of oneAPI is the Data Parallel C ++ (DPC ++) project, which is bringing Khronos SYCL to LLVM to support data parallelism and heterogeneous programming within a single source code application. SYCL is a royalty-free, cross-platform abstraction layer based entirely on ISO C ++ that eliminates concerns about binding applications to proprietary systems and software. DPC ++ enables the reuse of code on different hardware targets such as CPU, GPUs and FPGAs individually or the orchestration of all devices in a system socket into a powerful combined heterogeneous calculating machine that can perform calculations simultaneously on the various system devices. A growing list of companies, universities and institutions are reporting on the advantages of oneAPI and its growing software ecosystem.

Look to the future

Of course everyone wants to know how much memory bandwidth the new Intel Xeon Scalable HBM memory system provides. This information has yet to be announced. Mark Kachmarek, Product Manager for Xeon SP HBM at Intel: “The new high-bandwidth memory system for Intel Xeon processors will offer more bandwidth and capacity than was available with the Intel Xeon Phi product family.” which is exciting.

The actual bandwidth of the Sapphire Rapids HBM storage system is defined by the number of storage channels and the performance of the HBM devices on each channel. Current HBM2 devices deliver between 256 GB / sec to 410 GB / sec, which gives us an idea of ​​the performance potential of a modern HBM2 stacked memory channel. The number of memory channels supported by the HBM-enabled Sapphire Rapids Xeon SPs has not yet been announced.

Rob Farber is a global technology consultant and author with a broad background in developing HPC and machine learning technologies for use in national laboratories and commercial organizations. Rob can be reached at [email protected]

Register for our newsletter

With highlights, analyzes and stories from the week straight from us in your inbox, without in between.
Subscribe now


Share.

Comments are closed.