When done right, the high-performance data centers in academic and government institutions around the world should be on the cutting edge of all new technologies that enhance the performance of simulation, modeling, analytics, and artificial intelligence. Not the bleeding edge where the hyperscalers and national labs live, but something back from the riskiest part of the blade.
And so, with a nod of approval, we see the disaggregated and composable infrastructure taking hold in HPC, with aspiring composable fabric maker Liqid once again doing a big deal to test the ideas embodied in its Matrix Fabric and Liquid Command Center controller.
In this case, the Liquid Disaggregation and Composability Software and the PCI-Express Switching Fabric form the heart of a prototype of a system called Accelerating Computing for Emerging Sciences, or ACES for short, which is being built by the National Science Foundation. The ACES machine is developed and used by researchers from the University of Illinois, Texas A&M University, and the University of Texas and is installed at Texas A&M along with the Dell-built Grace 6.2 petaflops supercomputer, which consists of: 800 All-CPU computing node with Intel “Cascade Lake” Xeon SP processors plus 100 hybrid CPU-GPU nodes with the same Xeon SPs plus a paid Nvidia “Ampere” A100 GPU accelerator. (The system also has eight nodes of large storage, 3 TB instead of 384 GB per node, and eight nodes for inference with Nvidia T4 GPU accelerators.)
As you’d expect with two Texas universities, there was a pretty good chance that Dell would be the prime contractor for the ACES machine, and it actually is. The exact configuration of the hardware in ACES has not yet been determined, but it is known that the host processors are Intel “Sapphire Rapids” Xeon SPs processors, and the NSF Award documents indicate that they are variants with HBM2 will be storage on them, which we have discussed here, and which obviously use PCI-Express 5.0 controllers, which are then connected to a PCI-Express 5.0 fabric. The compute engines in the ACES system also include Intel Agilex FPGAs and “Ponte Vecchio” X.e HPC GPU accelerator.
Also, with heterogeneity and experimentation being a key mission for the ACES prototype, the machine will be used as Aurora vector engines from NEC, IPU engines from Graphcore, and custom compute ASICs (very little known about) from NextSilicon, an Israeli company Chip, contain startup that raised $ 200 million and is now worth $ 1.5 billion – not bad for a company no one knows about.
The computing elements and the storage in the racks are connected via PCI-Express switched fabrics and managed by Matrix, and the nodes are also connected to each other and to external storage via 400 GB / s Quantum2 NDR InfiniBand connections. The plan is to have Optane memory banks in the racks and to connect an external parallel Luster file system to the cluster.
The ACES machine is managed under the auspices of the NSF’s Office of Advanced Cyberinfrastructure and has $ 5 million per year for hardware and $ 1 million per year for operations and power and cooling between 2022 and 2026 inclusive. ACES is planned to be operational by September 2022.
At the heart of the system is the liqid matrix stack, and as we discussed earlier, a supercomputer’s malleability could be more important than its raw data and speeds in the years to come, and the ACES prototype is testing this idea in the field. For those of you unfamiliar with Liqid, we profiled the company after it came out of cover in June 2017, talked about its three big system victories with the U.S. Army last fall, and then gave one Art’s mission statement on breakdown and composability when kicked off in 2021, which included thoughts on the second wave of composability, mainly championed by Liqid and GigaIO.
Register for our newsletter
With highlights, analyzes and stories from the week straight from us in your inbox, without in between.