Training base models for machine learning, sometimes with billions of parameters, requires significant computational power. For example, the largest version of GPT-3, the famous large language model behind OpenAI’s DALL-E 2, has 175 billion parameters and requires really powerful hardware. The model was trained on an AI supercomputer specially designed by Microsoft for OpenAI, containing over 285,000 CPU cores, 10,000 GPUs and 400 GB/s InfiniBand networks.
These bespoke, high-performance computing systems are expensive and often inaccessible to people outside of a data center or research facility. Researchers from IBM and PyTorch want to change that.
IBM announced that it is working with a distributed team within PyTorch, the open source ML platform powered by the Linux Foundation, to enable training of large AI models on affordable network hardware such as Ethernet. Additionally, the company has built an open source operator to optimize PyTorch deployments on Red Hat OpenShift on IBM Cloud.
Using PyTorch’s FSDP, an API for data-parallel training, the team successfully trained models with 11 billion parameters across a multi-node, multi-GPU cluster using standard Ethernet networks on the IBM Cloud. According to IBM, this method of training models with 12 billion or fewer parameters is 90% more efficient than expensive HPC network systems.
“Our approach achieves training models of this scale on par with HPC network systems, making HPC network infrastructure virtually redundant for small and medium-sized AI models,” said Mike Murphy, a research author for IBM, in a company blog post.
Murphy describes the infrastructure used for this work as “essentially off-the-shelf hardware” running on the IBM Cloud and consisting of 200 nodes, each with eight Nvidia A100 80GB GPUs, 96 vCPUs, and 1.2TB of CPU RAM. The GPU cards within individual nodes are connected via NVLink with a card-to-card bandwidth of 600 GB/s, and the nodes are connected via two 100 GB/s Ethernet links using an SR-IOV-based TCP/ IP stack, which Murphy says has a usable bandwidth of 120 GB/s (although he notes that researchers observed peak network bandwidth utilization of 32 GB/s for the 11B model).
This GPU system configured with OpenShift has been running since May. The research team is currently building a production-ready software stack for end-to-end training, tuning and inference of large AI models.
Although this research was conducted using an 11 billion parameter model rather than a GPT-3 sized model, IBM hopes to scale this technology to larger models.
“We believe this approach is the first in the industry to achieve scaling efficiencies for models with up to 11 billion parameters using Kubernetes and PyTorch’s FSDP APIs with standard Ethernet,” said Murphy. “This will allow researchers and organizations to train massive models on any cloud in a far more cost-effective and sustainable way. In 2023, the goal of the combined team is to further scale this technology to handle even larger models.”
A model that dominates them all: Transformer Networks are introducing AI 2.0, says Forrester
IBM Research open source deep search tools
Meta releases AI model that translates over 200 languages