Moore’s Law needs a hug. The days of cramming transistors onto small silicon computer chips are over, and their life rafts – hardware accelerators – come at a price.
When programming an accelerator—a process in which applications offload certain tasks to the system hardware to speed up that task—you have to build an entirely new software support. Hardware accelerators can perform certain tasks orders of magnitude faster than CPUs, but they cannot be used out of the box. Software must use accelerators’ instructions efficiently to make them compatible with the entire application system. This means a lot of engineering work that would then have to be maintained for a new chip you’re compiling code for using whatever programming language you choose.
Now scientists at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new programming language called “Exo” for writing high-performance code on hardware accelerators. Exo helps low-level performance engineers turn very simple programs that specify what they want to calculate into very complex programs that do the same thing as the specification but much, much faster by using these special accelerator chips. For example, engineers can use Exo to turn a simple matrix multiplication into a more complex program that runs orders of magnitude faster by using these special accelerators.
Unlike other programming languages and compilers, Exo is based on a concept called “exocompilation”. “Traditionally, much of the research has focused on automating the optimization process for the specific hardware,” says Yuka Ikarashi, an electrical engineering and computer science PhD student and CSAIL member who is the lead author of a new article on Exo. “That’s great for most programmers, but for performance engineers, the compiler gets in the way whenever it helps. Because the compiler’s optimizations are automatic, if it’s doing the wrong thing and giving you 45 percent efficiency instead of 90 percent, there’s no good way to fix it.”
With Exocompilation, the performance engineer is back in the driver’s seat. Responsibility for choosing which optimizations to apply, when, and in what order is passed from the compiler back to the performance engineer. This way they don’t have to waste time fighting the compiler on the one hand, or doing everything manually on the other. At the same time, Exo takes responsibility for ensuring that all of these tweaks are correct. As a result, the performance engineer can spend their time improving performance instead of debugging the complex, optimized code.
“The Exo language is a compiler that is parameterized by the hardware it targets; the same compiler can adapt to many different hardware accelerators,” says Adrian Sampson, an assistant professor in the Department of Computer Science at Cornell University. Rather than writing a bunch of messy C++ code to compile for a new accelerator, Exo gives you an abstract, unified way of writing down the “shape” of the hardware you want to target. Then you can reuse the existing Exo compiler to adapt to this new description instead of writing something completely new from scratch. The potential impact of such work is huge: if hardware innovators can stop worrying about the cost of developing new compilers for each new hardware idea, they can try and ship more ideas. The industry could overcome its reliance on legacy hardware that thrives only because of its attachment to the ecosystem and despite its inefficiencies.”
The most powerful computer chips made today, such as Google’s TPU, Apple’s Neural Engine, or NVIDIA’s Tensor Cores, power scientific computing and machine learning applications by accelerating what are known as “essential subprograms,” kernels, or High Performance Computing (HPC) subprograms.
Cumbersome jargon aside, the programs are essential. For example, something called Basic Linear Algebra Subroutines (BLAS) is a “library” or collection of such subroutines dedicated to linear algebra computations, enabling many machine learning tasks such as neural networks, weather forecasting, cloud computing, and drug discovery. (BLAS is so important that it earned Jack Dongarra the 2021 Turing Award.) However, these new chips – which require hundreds of engineers to develop – are only as good as these HPC software libraries allow them to be.
Currently, however, this type of performance tuning is still done by hand to ensure every last compute cycle is used on these chips. HPC subroutines routinely run at over 90 percent of theoretical peak efficiency, and hardware engineers go to great lengths to increase these theoretical peak speeds by another five or ten percent. So unless the software is aggressively optimized, all that hard work goes to waste – and that’s what helps Exo avoid.
Another important aspect of Exocompilation is that performance engineers can describe the new chips they want to optimize for without having to change the compiler. Traditionally, the definition of the hardware interface is maintained by compiler developers, but on most of these new accelerator chips, the hardware interface is proprietary. Companies must maintain their own copy (fork) of an entire legacy compiler modified to support their particular chip. This requires hiring teams of compiler developers in addition to performance engineers.
“In Exo, we instead outsource the definition of hardware-specific backends from the Exocompiler. This allows us to better distinguish between Exo – an open source project – and hardware-specific code – which is often proprietary. We’ve shown that we can quickly write code with Exo that is as powerful as Intel’s hand-optimized Math Kernel Library. We are actively collaborating with engineers and researchers from multiple companies,” says Gilbert Bernstein, a postdoc at the University of California, Berkeley.
The future of Exo involves exploring a more productive scheduling metalanguage and extending its semantics to support parallel programming models to apply to even more accelerators, including GPUs.
Ikarashi and Bernstein co-authored the paper with Alex Reinking and Hasan Genc, both UC Berkeley graduate students, and MIT assistant professor Jonathan Ragan-Kelley.
This work was supported in part by the Applications Driving Architectures Center, one of six centers of JUMP, a Semiconductor Research Corporation program co-sponsored by the Defense Advanced Research Projects Agency. Ikarashi was supported by the Funai Overseas Scholarship, the Masason Foundation, and the Great Educators Fellowship. The team presented the work at the 2022 ACM SIGPLAN Conference on Programming Language Design and Implementation.