Intel has launched a new generation of AI server chips that offer a huge leap in compute, storage, and networking capabilities, setting themselves apart from NVIDIA’s GPUs for deep learning training in data centers.
The company is wooing every major cloud-computing giant with the new Gaudi2, the second generation of the server chip launched last year in cloud services for training Amazon Web Services (AWS) AI models. However, it also uses the chips designed by Habana Labs in its own data centers to push the boundaries in autonomous driving and other areas.
Intel’s Mobileye unit uses Habana’s first-generation Gaudi accelerators to train the AI at the heart of its self-driving vehicles to sense and understand their surroundings. Gaby Hayon, Mobileye’s executive vice president of R&D, said Mobileye uses Gaudi in the cloud of AWS and on-premises in its data centers because training such models is time-consuming and expensive to achieve “significant cost savings” compared to GPUs .
Hayon said the use of Habanas Gaudi accelerator cards exist “Better time-to-market for existing models or training much larger and more complex models aimed at taking advantage of the Gaudi architecture.”
Intel has also deployed more than a thousand eight-card Gaudi2 servers in its data centers to support the research and development of its Gaudi2 software and to share further advances in its next-generation Gaudi3.
AI cost savings
Gaudi2 is based on the same heterogeneous architecture as its predecessor. But Habana has upgraded to the 7nm node to pack more computing engines, on-chip and on-package memory, and networking into the chip.
According to Intel, Gaudi2 can run AI workloads faster and more efficiently than its previous chips while bringing big leaps in performance over NVIDIA’s A100 GPU. However, the key selling point, according to the company, is the reduction in total cost of ownership (TCO).
Last year AWS introduced a cloud computing service based on the first generation of Gaudi. It has been claimed that customers would get up to 40% better performance per dollar than instances running on NVIDIA’s GPUs.
The Gaudi2 integrates 24 Ethernet ports right on the chip, each running up to 100Gb/s RoCE – RDMA over Converged Ethernet – versus 10 ports of 100GbE in the first generation. This eliminates the need for a separate server NIC in each server, reducing system costs. The integration of RoCE ports into the processor itself gives customers the ability to scale to thousands of Gaudi2s over Ethernet.
“Reducing the number of components in the system reduces the TCO for the end customer,” said Eitan Medina, COO of Habana. Using Ethernet also allows customers to avoid lock-in with proprietary interfaces like NVIDIA’s NVLink GPU-to-GPU connection.
Most of the Ethernet ports are used to communicate with the other Gaudi2 processors in the server. The rest delivers 2.4 TB/s network throughput to other Gaudi2 servers in the data center or cluster.
Capturing market share from NVIDIA has been a challenge for Intel and other players in the AI chip landscape. The graphics chip giant has invested aggressively in its AI software tools, including its CUDA development kit, to run AI workloads on its GPUs.
Sandra Rivera, executive vice president and general manager of Intel’s data center and AI unit, said the AI chip market is expected to grow at about 25% per year over the next five years. reach $50 billion.
In addition to building a range of server chips, including Habana’s AI accelerators and Arc’s general-purpose GPUs, Intel is trying to lure customers by giving open-source software development a higher priority.
Habana’s Synapse AI Software Development Kit (SDK) is an open standard and freely accessible. Customers can use Habana’s software to translate workloads from PyTorch and TensorFlow to tap into the computing power of Gaudi2 and the 24 Tensor Processor Cores (TPCs) inside based on a VLIW (Very Long Instruction Word) architecture. The SDK includes Habanas compiler, runtime, libraries, firmware, drivers and other tools.
Medina of Habana said the Gaudi2 aims to make AI training more accessible. “The software’s job is to hide the complexity of the underlying hardware and support customers where they are,” he added.