After more than a decade of planning, the United States’ first exascale computer, Frontier, is due to arrive at Oak Ridge National Laboratory (ORNL) later this year. In order to exceed this “1,000-fold” horizon, four major challenges had to be mastered: power requirements, reliability, extreme parallelism and data movement.
Al Geist launched ORNL’s Advanced Technologies Section (ATS) webinar series last month by reviewing the story of the march towards the Exascale. When Geist described how the Frontier supercomputer addresses the four primary exascale challenges, it revealed vital information about the anticipated first U.S. exascale computer.
Most importantly, Frontier is poised to meet the 20 MW power target set by DARPA in 2008 by delivering more than 1.5 peak exaflops of power in a 29 MW power range. Although the once aspired target was originally set for 2015, until recently it was not clear whether the first crop of exascale supercomputers – due to hit the market in 2021-2023 – would make the cut. In fact, it’s unclear if they all will, but it looks like Frontier, which uses HPE and AMD technologies, will.
Geist is a Corporate Fellow and CTO of the Oak Ridge Leadership Computing Facility (OLCF) and CTO of the Exascale Computing Project. He is also one of the original developers of parallel virtual machine (PVM) software, a de facto standard for heterogeneous distributed computing.
Geist began his talk by looking back at the four major challenges that were formulated in the 2008-2009 period when exascale planning was ramped up in the Department of Energy and its affiliated organizations.
“The four challenges were also there during the Petascale regime, but in 2009 we felt that there was a serious problem where we may not even be able to build an exascale system,” said Geist. “It wasn’t just expensive or difficult to program – it just might be impossible.”
The energy consumption was great.
“Research reports from 2008 predicted that an exaflop system would use between 150 and up to 500 megawatts of energy. And the vendors were given the ambitious goal of reducing this to 20, which seems like a lot, ”said Geist.
There was also reliability: “The fear with the calculations at the time is that mistakes happen faster than you could check a job,” says Geist.
It was also believed that billions of dollars in parallelism would be required.
“The question was, could there be more than a handful of applications, if any, that could use that much parallelism?” Geist remembered. “In 2009, large-scale parallelism was typically less than 10,000 nodes. And the largest application we had on record was only about 100,000 nodes used. “
The last topic was sensitive: data movement.
“We saw the whole problem with the memory wall: Basically, the time it took to move data from memory to the processors and from the processors back to memory was actually the biggest bottleneck for computation; the computing time was insignificant, ”said Geist. “The time it takes to move a byte is orders of magnitude longer than a floating point operation.”
Geist recalled the DARPA Exascale Computing Report published in 2008 (under the direction of Peter Kogge). It included an in-depth analysis of what would be required to implement a 1 exaflops peak system.
With the technologies of the time, it would take 1,000 MW to build a system of off-the-shelf components, but if you scaled the flops-per-watt trends current at the time, you’d be exascale at about 155 MW with a very optimized architecture, mind passed. A barebone configuration, in which the storage of the straw man system was reduced to only 16 gigabytes per node, resulted in a floor space of 69 to 70 MW.
But even the aggressive 70 MW values were out of range. A machine that hungry for power was unlikely to get the necessary mining permits.
“You may be wondering where is that? [20 MW number] come from? “Geist asked.” Actually it came from a completely non-technical assessment of what was possible. What was possible said: It will take 150 MW. What we said is: we need it 20 [MW]. And why did we say that? [we asked] the DOE, ‘How much are you willing to pay for a system’s lifetime performance?’ and the number that came back from the head of the Office of Science at the time was that in five years they weren’t ready to pay over $ 100 million, so it’s an easy math [based on an average cost of $1 million per megawatt per year]. The 20 megawatts had nothing to do with what was possible, it was just this pile that we drove into the ground. “
Moving forward in the presentation (which can be viewed and linked at the end of this article), Geist follows the evolution of the machines at Oak Ridge: Titan to Summit to Frontier. The extreme parallelism challenge is addressed by Frontier’s fat node approach, in which the GPUs hide the parallelism in their pipelines.
“The knot count hasn’t exploded – it didn’t take a million knots to get to Frontier,” Geist said. “In fact, the number of nodes is really quite small.”
While Titan used a GPU to CPU ratio of one to one, Summit implemented a ratio of three to one. Frontier’s design takes it one step further with a GPU-to-CPU ratio of four to one.
“In the end, we found that Exascale didn’t need this exotic technology that was published in the 2008 report,” said Geist. “We didn’t need any special architectures, we didn’t even need new programming paradigms. It turned out to be very incremental steps, not a giant leap like we thought it would take to get to Frontier. “
In terms of output, Frontier is expected to exceed the peak output of one and a half exaflops while consuming no more than 29 megawatts. “That’s actually a bit better than the 20 megawatts per exaflop that we simply drove into the ground as a rule of thumb, as opposed to what the technology could do,” said Geist. “But in fact, the vendors who worked and developed on Frontier did an amazing job to make this happen.”
Geist also attributes the energy efficiency improvements to the DOE’s investments in the exascale development programs FastForward, DesignForward, and PathForward.
“It was [largely] because of this 10 year DOE investment that [participating] Vendors were actually able to reduce the amount of energy their chips and memory needed to do an exaflop of calculations for just 20 megawatts of power, ”said Geist.
Geist’s energy efficiency math is based on peak-flops (double precision), not Linpack. A conservatively estimated computing efficiency of 70 percent (Rmax / Rpeak) provides 1050 Linpack petaflops at 29 megawatts or 36.2 gigaflops-per-watt. With 80 percent computing efficiency, the energy efficiency increases to 41.4 gigaflops per watt. (The greenest supercomputers today are approaching 30 gigaflops per watt.) Perlmutter, the new # 5 system installed at Berkeley Lab – combines HPE, AMD, and Nvidia technology, and also uses a GPU-to-CPU ratio from four to one – reaches 25.50 gigaflops per watt. Also note that ORNL has said that Frontier will be “more than” 1.5 exaflops.
Geist also highlighted the reliability improvements due to the on-node flash memory, which is being enabled by manufacturers to make their networks and system software much more adaptable. (A bug and a graceful restart is key.)
With Frontier, the memory wall problem was alleviated by using HBM on the GPUs. “Frontier has high bandwidth memory (HBM) soldered directly onto the GPU,” said Geist. “So it increases the bandwidth by an order of magnitude. So it kind of takes to the streets for this problem. And the GPUs, one of the things caused by the high bandwidth is that the latency can be quite high in these cases, but the GPUs are actually very good at hiding the latency given their pipelines. “
There is much more interesting material in Geist’s presentation, such as the cosmic ray problem, lessons from Summit and Sierra, and a question-and-answer session. Watch the full talk here: https://vimeo.com/562917879