Grace, Hopper, NVSwitch detailed at Hot Chips


In four sessions over two days, senior NVIDIA engineers will describe innovations in accelerated computing for today’s data center and edge systems.

At a virtual Hot Chips event, an annual gathering of processor and system architects, they will announce performance numbers and other technical details for NVIDIA’s first server CPU, the Hopper GPU, the latest version of the NVSwitch interconnect chip, and the NVIDIA Jetson Orin System on Module (SoM).

The presentations offer new insights into how the NVIDIA platform will achieve new levels of performance, efficiency, scalability, and security.

In particular, the presentations will demonstrate a design philosophy of innovation across the entire stack of chips, systems and software, with GPUs, CPUs and DPUs acting as peer processors. Together they are creating a platform that is already running AI, data analytics and high performance computing jobs in cloud service providers, supercomputing centers, enterprise data centers and autonomous systems.

Inside NVIDIA’s first server CPU

Data centers require flexible clusters of CPUs, GPUs, and other accelerators that share vast pools of memory to deliver the energy-efficient performance today’s workloads demand.

To address this need, Jonathon Evans, a respected engineer and 15-year veteran at NVIDIA, will describe the NVIDIA NVLink-C2C. It connects 900 gigabytes per second CPUs and GPUs with 5x the power efficiency of the existing PCIe Gen 5 standard, thanks to data transfers that consume just 1.3 picojoules per bit.

NVLink-C2C combines two CPU chips to create the NVIDIA Grace CPU with 144 Arm Neoverse cores. It’s a processor built to solve the world’s biggest computing problems.

For maximum efficiency, the Grace CPU uses LPDDR5X memory. It enables one terabyte per second of memory bandwidth while keeping power consumption at 500 watts for the entire complex.

One link, many uses

NVLink-C2C also connects Grace CPU and Hopper GPU chips as memory-sharing peers in the NVIDIA Grace Hopper Superchip, providing maximum acceleration for performance-hungry tasks like AI training.

Anyone can build custom chiplets using NVLink-C2C to create a coherent connection to NVIDIA GPUs, CPUs, DPUs and SoCs to expand this new class of integrated products. The connection supports the AMBA CHI and CXL protocols used by ARM and x86 processors, respectively.

First memory benchmarks for Grace and Grace Hopper.

To scale at the system level, the new NVIDIA NVSwitch connects multiple servers into one AI supercomputer. It uses NVLink, connections at 900 gigabytes per second, more than 7 times the bandwidth of PCIe Gen 5.

With NVSwitch, users can connect 32 NVIDIA DGX H100 systems into an AI supercomputer that delivers an exaflop of AI excellence.

Alexander Ishii and Ryan Wells, both veteran NVIDIA engineers, will describe how the switch enables users to build systems with up to 256 GPUs to handle demanding workloads like training AI models with more than 1 trillion parameters.

The switch contains engines that accelerate data transfers using the NVIDIA Scalable Hierarchical Aggregation Reduction Protocol. SHARP is an in-network computing feature introduced in NVIDIA Quantum InfiniBand networks. It can double data throughput for communication-intensive AI applications.

NVSwitch systems enable exaflop-class AI
NVSwitch systems enable exaflop-class AI supercomputers.

Jack Choquette, a veteran engineer with 14 years of experience at the company, will provide a detailed tour of the NVIDIA H100 Tensor Core GPU, also known as the Hopper.

In addition to using the new connections to scale to unprecedented heights, it offers many advanced features that increase the accelerator’s performance, efficiency and safety.

Hopper’s new Transformer Engine and improved Tensor Cores deliver a 30x speedup over the previous generation in AI inference with the world’s largest neural network models. And it uses the world’s first HBM3 memory system, delivering a whopping 3 terabytes of memory bandwidth, NVIDIA’s largest generational increase ever.

New features include:

Choquette, one of the Nintendo 64 console’s leading chip designers early in his career, will also describe parallel computing techniques that underlie some of Hopper’s advances.

Michael Ditty, Orin’s chief architect and a 17-year veteran of the company, will provide new performance specifications for NVIDIA Jetson AGX Orin, an engine for edge AI, robotics, and advanced autonomous machines.

It integrates 12 Arm Cortex-A78 cores and an NVIDIA Ampere architecture GPU to deliver up to 275 trillion operations per second in AI inference jobs. That is up to 8 times more performance with 2.3 times more energy efficiency than the previous generation.

The latest production module offers up to 32 gigabytes of memory and is part of a compatible family that scales down to pocket-sized 5W Jetson Nano developer kits.

Performance benchmarks for NVIDIA Orin
Performance benchmarks for NVIDIA Orin

All new chips support the NVIDIA software stack, which accelerates more than 700 applications and is used by 2.5 million developers.

Based on the CUDA programming model, it includes dozens of NVIDIA SDKs for vertical markets such as automotive (DRIVE) and healthcare (Clara), as well as technologies such as recommender systems (Merlin) and conversational AI (Riva).

The NVIDIA AI platform is available from all major cloud service and system manufacturers.


Comments are closed.