Sponsored Feature. There are many things that the world’s HPC centers and hyperscalers have in common, and one of them is their attitude towards software. They like to control as much of their system software as possible because they want to get as much performance out of their systems as possible. However, the resources of time, money, and level of expertise required to create custom operating systems, middleware, and runtime environments are too burdensome for most other organizations that should benefit from HPC in its many guises.
With a rapidly growing number and types of compute engines in the data center, and a growing set of HPC applications – encompassing traditional simulation and modeling, as well as data analysis and machine learning, and increasingly a hodgepodge of these techniques stacked in a workflow that is creating a new type of application – creating and maintaining a comprehensive HPC software stack is a serious challenge.
What if this could be more of a group effort? What if there was a way to create a complete HPC software stack that could still be optimized for very specific use cases? Wouldn’t this be an advantage for the larger HPC community and especially for those academic, government and corporate centers that don’t have the resources to build and maintain their own HPC stack?
It’s hard to argue against customization and optimization in the HPC realm, so don’t think that’s what we’re doing here. But on the contrary. But we’re thinking of some sort of organized mass adoption that benefits more HPC adopters and more diverse architectures—because system architectures become more homogeneous over time, not less.
Each CPU or GPU or FPGA accelerator manufacturer, not to mention custom ASIC vendors, creates their own compilers and often their own application development and runtime environments in the endless task of squeezing more performance out of the expensive HPC clusters that companies build calculators and networks of their own. (After all, it’s difficult to separate processing power and network power in a clustered system, which is one of the reasons Nvidia paid $6.9 billion for Mellanox.)
The list of important HPC compilers and runtimes is not long, but varied.
Intel had its historic Parallel Studio XE stacks, the C++ and Fortran compilers and a Python interpreter, as well as the Math Kernel Library, the Data Analytics Acceleration Library, Integrated Performance Primitives (for algorithm acceleration for specific domains), Threading Building Blocks (for parallel programming with shared memory ) as well as an MPI library to implement scale-out clustering for message propagation, optimizations for TensorFlow and PyTorch Machine Learning Frameworksnow included in Intel’s oneAPI toolkits.
Nvidia developed its Compute Unified Device Architecture (CUDA) to make it easier to move computing tasks from CPUs to GPUs instead of having to resort to OpenGL. Over time, the CUDA development environment and runtime has added support for OpenMP, OpenACC, and OpenCL programming models. In 2013, Nvidia bought the venerable PGI C, C++, and Fortran compilers that came from mini-supercomputer maker Floating Point Systems decades ago, and for more than a year the PGI compilers have been distributed as part of the Nvidia HPC SDK Stacks distributed.
AMD has the Radeon Open Compute Platform, or ROCm for short, which makes heavy use of the Heterogeneous System Architecture runtime, which has a compiler front-end that can generate hybrid code that runs on both CPUs and GPU accelerators can be, and most importantly, the tools that make up the ROCm environment is open source. ROCm supports both OpenMP and OpenCL programming models and has another Heterogeneous Interface for Portability (HIP) programming model, which is a C++ kernel language and GPU offload runtime that can generate code that runs on either Can run AMD or Nvidia GPUs and also convert code built from Nvidia’s CUDA environment to run on HIP and therefore have some sort of portability.
The Cray Linux Environment and Compiler Set, now sold by Hewlett Packard Enterprise as the Cray Programming Environment Suite, comes to mind immediately and can be run on HPE’s own Cray XE systems with Intel or AMD CPUs and Nvidia -, AMD or Intel GPUs (by integrating both vendors’ tools) as well as the Apollo 80 machines with Fujitsu’s heavily vectored A64FX ARM server processor. ARM has its Allinea compiler set, which is important for the A64FX processors as well as Neoverse Arm processor designs that will come out with vector extensions in the years to come. Fujitsu also has its own C++ and Fortran compilers that can run on the A64FX chip, and of course there is also the open source GCC compiler set.
There are other major HPC compiler and runtime stacks with acceleration libraries for all kinds of algorithms important in various fields of simulation, modelling, financial services and analysis. The more the better. But here is the important lesson illustrated by the introduction of the Apollo 80 system with HPE’s A64FX processor: Not every compiler is good at compiling every kind of code. This is something all academic and government supercomputing centers, especially those that heavily transform architectures, are well aware of. Diverse arithmetic means diverse compilation.
And that’s why it’s best to have many different compilers and libraries in the toolbox to choose from. What the HPC market really needs is a hyper-compiler that can look at code and figure out which compiler should be used on a wide range and potentially diverse mix of compute engines to get the best performance. We do not believe that the HPC industry needs many different complete HPC SDKs optimized by their vendor advocates, but rather compilers and libraries from many different experts, all integrated into a single, broad and complete SDK framework for HPC can workload.
Going up the HPC software stack, and further complicating the situation, is the fact that every HPC system manufacturer has its own Linux environment, or one designed as a chosen one by IBM’s Red Hat entity or SUSE Linux Scientific Linux or one cobbled together by the HPC center itself.
In an HPC world, where both security and efficiency are paramount, we need a stack of operating systems, middleware, compilers and libraries designed as a whole, with options you can throw into the stack as needed- and can slide out, but this offers the widest optionality. This software does not have to be open source, but it must be able to be integrated end-to-end via APIs. As inspiration for this HPC stack, we take the OpenHPC effort led by Intel six years ago and the Tri-Lab Operating System Stack (TOSS) platform developed by the US Department of Energy – specifically by Lawrence Livermore National Laboratory, the Sandia National Laboratories and Los Alamos National Laboratory. The TOSS platform is used on the commodity clusters shared by these HPC centers.
The OpenHPC effort appeared to be gaining some traction a year later but a few more years came and went, and by then nobody was talking about OpenHPC anymore. Instead, Red Hat created its own Linux distribution, optimized for running traditional HPC simulation and modeling programs, and running it on the world’s two largest supercomputers, “Summit” at Oak Ridge National Laboratory and “Sierra” at Lawrence Livermore. was running Red Hat Enterprise Linux 7. The OpenHPC effort was a little too Intel-centric for many, but that focus was understandable to a degree given the lack of AMD CPUs or GPUs and no ARM CPUs in the HPC hunt gave. But the mix-and-match nature of the stack was right.
Our thought experiment about an HPC stack goes further than just allowing anything to plug into OpenHPC. What we want is something designed more like TOSS, who was profiled at SC17 four years ago. Using TOSS, the labs created a derivative of Red Hat Enterprise Linux that used consistent source code across X86, Power, and ARM architectures and a build system to slice out the parts of RHEL irrelevant to HPC clusters and add other software that was needed .
In a conversation about Exascale systems in 2019Livermore Computing CTO Bronis de Supinski said Lawrence Livermore pulled 4,345 packages from Red Hat Enterprise Linux’s more than 40,000 packages, then patched and repackaged another 37 of them, and then added another 253 packages that the Tri-Lab systems require the creation of a TOSS platform with 4,598 packages. The software’s surface is obviously heavily stripped down while still supporting different CPUs and GPUs for computation, different networks, different types of middleware abstractions, and the Luster parallel file system.
What is also interesting about the TOSS platform is that it has an add-on development environment built on top of compilers, libraries and the like, called the Tri-Lab Compute Environment:
If three of the major HPC labs in the United States can create an HPC Linux flavor and development tool stack that provides consistency across architectures, allows for some level of application portability, and lowers the total cost of ownership of the commodity clusters they use, How much more impact could a unified HPC stack with all current compiler, library, middleware and such vendors have on the HPC industry as a whole? Imagine a build system shared by the entire community that could only throw out the components required for a specific set of HPC use cases and that would limit the security risks of the entire stack used. Imagine if math libraries and other algorithmic speedups were more portable across architectures. (That’s a topic for another day.)
It is good that each HPC compute module or operating system vendor has its own complete and highly tuned stack. We welcome this, and for many customers this will in many cases be sufficient to adequately design, develop and maintain HPC applications. However, this will most likely not be enough to support a wide range of applications on a wide range of hardware. Ultimately you want to be able to have a consistent framework for compilers and libraries across all vendors, allowing any math library to be used with any compiler and for a customizable Linux platform.
Sponsored by Intel.