This article is part of the Technology Insight series made possible by funds from Intel.
As the data spreads from the network core to the intelligent edge, more and more diverse computing resources follow, which harmonize performance, performance and response time. In the past, graphics processing units (GPUs) have been the preferred offload target for computing. Today, field-programmable gate arrays (FPGAs), image processing units (VPUs) and application-specific integrated circuits (ASICs) also have unique strengths. Intel calls these accelerators (and anything else a CPU can send processing tasks to) as XPUs.
The challenge for software developers is figuring out which XPU is best for their workload. Finding an answer often takes a lot of trial and error. With a growing list of architecture-specific programming tools that must be supported, Intel has spearheaded a standards-based programming model called oneAPI to standardize code across all XPU types. Simplifying software development for XPUs cannot come soon enough. Eventually, the move to heterogeneous computing – processing on the best XPU for a given application – seems inevitable given the evolving use cases and the many devices struggling to handle them.
- Intel considers heterogeneous computing (where a host device sends computing tasks to different accelerators) as inevitable.
- An XPU can be any offload target commanded by the CPU and built on any architecture from any hardware vendor.
- The oneAPI initiative is an open, standards-based programming model that enables developers to address multiple XPUs with a single code base.
Intel’s strategy is facing headwinds from NVIDIA’s established CUDA platform, which assumes you are using only NVIDIA graphics processors. This walled garden may not be as impenetrable as it used to be. Intel already has a design win with its upcoming Xe-HPC-GPU, code-named Ponte Vecchio. For example, the Argonne National Laboratory’s Aurora supercomputer will have more than 9,000 nodes, each with six Xe HPCs with a total of more than 1 exa / FLOP / s sustained DP performance.
Time will tell whether Intel can keep its promise to streamline heterogeneous programming with oneAPI and lower the entry barrier for hardware vendors and software developers alike. A convincing XPU roadmap certainly gives the industry reason to take a closer look.
Heterogeneous computing is the future, but it’s not easy
According to The Seagate Rethink Data Survey, the total volume of data distributed between in-house data centers, cloud repositories, third-party data centers and remote locations is projected to increase by more than 42% from 2020 to 2022. The value of this information depends on what you do with it, where and when. Some data can be captured, classified, and stored to make machine learning breakthroughs. Other applications require a real-time response.
The compute resources required to fulfill these use cases are not similar. GPUs optimized for server platforms each consume hundreds of watts, while VPUs in the one-watt range could power smart cameras or computer vision-based AI appliances. In both cases, a developer must choose the best XPU to process data as efficiently as possible. This is not a new phenomenon. Rather, it is an evolution of a decade-long trend toward heterogeneity, where applications can perform control, data, and computing tasks on the hardware architecture that is best suited for each specific workload.
“The transition to heterogeneity is inevitable for the same reasons we moved from single-core to multicore CPUs,” said James Reinders, a parallel computing engineer at Intel. “It makes our computers more powerful and able to solve more problems and do things that they couldn’t in the past – but within the limits of the hardware we can design and build.”
As with the advent of multicore processing, which forced developers to look at their algorithms in terms of parallelism, the major obstacle to the heterogeneity of computers today is the complexity of their programming.
In the past, developers programmed hardware-oriented with low-level languages, which allowed very little abstraction. The code was often quick and efficient, but not portable. Nowadays, higher languages extend compatibility to broader hardware while hiding a lot of unnecessary details. Compilers, runtimes, and libraries under the code ensure that the hardware does what you want it to do. It makes sense that we are seeing more specialized architectures that enable new functionality through abstract languages.
oneAPI aims to simplify software development for XPUs
Even now, new accelerators require their own software stacks, which devours the hardware manufacturer’s time and money. From there, developers invest in learning new tools themselves so that they can determine the best architecture for their application.
Instead of spending time rewriting and recompiling code with different libraries and SDKs, envision an open, cross-architecture model that allows you to migrate between architectures without leaving performance on the table. This is what Intel proposes with its oneAPI initiative.
oneAPI supports high-level languages (Data Parallel C ++ or DPC ++), a number of APIs and libraries and a hardware abstraction layer for low-level XPU access. In addition to the open specification, Intel has its own suite of toolkits for various development tasks. For example, the Base Toolkit includes the DPC ++ compiler, a handful of libraries, a compatibility tool for migrating NVIDIA CUDA code to DPC ++, the optimization-oriented VTune profiler, and the advisor analysis tool that helps identify the best kernels to offload. Other toolkits focus on more specific segments like HPC, AI and machine learning acceleration, IoT, rendering and deep learning inference.
“When we talk about oneAPI at Intel, it’s a pretty straightforward concept,” says Intel’s Reinders. “I want as much to stay the same as possible. It’s not that there is one API for everything. If I want to perform fast Fourier transforms, I would rather learn the interface for an FFT library, then I would like to use the same interface for all of my XPUs. “
Intel does not use oneAPI for purely selfless reasons. The company already has an extensive portfolio of XPUs that benefit from a uniform programming model (in addition to the host processors that are responsible for their control). If each XPU were treated as an island, the industry would get stuck where it was before oneAPI: with independent software ecosystems, marketing resources, and training for each architecture. By doing as much in common as possible, developers can spend more time innovating and less time reinventing the wheel.
What does the industry need to take care of Intel’s message?
An enormous number of FLOP / s or floating point operations per second come from GPUs. NVIDIA’s CUDA is the dominant platform for general purpose GPU computing and requires you to use NVIDIA hardware. Since CUDA is the established technology, developers are reluctant to change software that is already working, even if they prefer a wider range of hardware.
If Intel wants the community to look beyond proprietary lock-ins, it has to build a better mousetrap than its competition, and that starts with compelling GPU hardware. At its most recent Architecture Day 2021, Intel announced that a pre-production implementation of its Xe-HPC architecture already had more than 45 TFLOPS FP32 throughput, more than 5 TB / s fabric bandwidth and more than 2 TB / s storage bandwidth. At least on paper, that’s a higher single-precision performance than NVIDIA’s fastest data center processor.
However, the world of XPUs is made up of more than just GPUs, which can be exciting and terrifying depending on who you ask. Supported by an open, standards-based programming model, a multitude of architectures can enable time-to-market advantages, drastically lower power consumption or workload-specific optimizations. But without oneAPI (or something like that), developers have to learn new tools for every accelerator, prevent innovation and overwhelm programmers.
Fortunately, we’re seeing signs of life beyond NVIDIA’s closed platform. For example, the team responsible for RIKEN’s Fugaku supercomputer recently used Intel’s oneAPI Deep Neural Network Library (oneDNN) as a reference to develop their own deep learning process library. Fugaku uses Fujitsu A64FX CPUs based on Armv8-A with the Scalable Vector Extension (SVE) instruction set, which did not yet have a DL library. The optimization of Intel’s code for Armv8-A processors enabled a speed increase of up to 400 times compared to simply recompiling oneDNN without modification. By incorporating these changes into the main library branch, the benefits of the team are made available to other developers.
Intel’s Reinders admits that the whole thing sounds a lot like open source. However, the XPU philosophy goes a step further, influencing the way code is written so that it is ready for different types of accelerators to run under. “I’m not worried this is a fad,” he says. “This is one of the next big steps in computing. It is not a question of whether an idea like oneAPI will be implemented, but when it will be implemented. “