What are GPUs doing in your data center?


A graphics processing unit performs fast math calculations to render graphics, and businesses often deploy GPUs to speed up workloads — particularly to support AI and machine learning. Heavy processing tasks of this nature often require multiple GPU chips, each with multiple cores, to do their job.

A GPU performs heavier mathematical calculations than a CPU using parallel processing, where multiple processors take on different parts of the same task. It has its own RAM for storing the data it processes.

Implementing a GPU in your data center

You can implement a GPU in two ways. With the aftermarket approach—the smaller method—you install the GPU subsystem as an upgrade on one or more existing servers in your data center. This approach is most popular with organizations early in the adoption cycle and those still experimenting with GPUs. However, this type of implementation creates significant additional performance requirements.

The other approach to implementation is to include GPUs as part of your server refresh cycles. This allows you to buy or lease new servers with pre-installed GPUs and matching power supplies. Organizations aiming for broader, durable GPU deployments often prefer this approach.

Your implementation approach depends on how you want your infrastructure to consume GPU resources and what workloads you want to run.

GPUs vs. CPUs

GPUs differ from other processing chips in a few ways.

The CPU – the predecessor of the GPU – uses logic circuitry to process commands sent to it by the operating system and performs arithmetic, logic, and I/O operations. It feeds data to specialized hardware and sends instructions to various parts of the computing system. It acts as the “brain” of the computer. The GPU was originally designed to complement the CPU, offloading more complex graphics rendering tasks and leaving the CPU free to handle other workloads. A key difference between the GPU and the CPU is that the GPU performs operations in parallel rather than serially.

The GPU and the CPU are both crucial for data analysis. The GPU’s ability to handle heavy math tasks like matrix multiplication can speed up deep learning and data analysis, but is poorly suited to smaller tasks like querying data in a database.

Instead of choosing between GPU or CPU chips for your data center, consider using both together. The chips were designed to complement each other; Both allow you to significantly accelerate particularly heavy workloads, e.g. B. training models for machine learning or running specialized analysis tasks.

Virtualization of GPUs

Virtualizing a GPU is not quite the same as virtualizing CPU or RAM. Each GPU is unique, which means designing, licensing, and deploying vGPUs requires a different approach.

For example, you can run different models of GPU cards in the same VMware cluster. However, each host in this cluster must be running the same GPU cards internally, which means that while your hosts can have different GPU models, only one model can be installed on each host. You must also have a license to allow drivers access to remote GPU features, and that license then determines the features of each vGPU. If you have multiple types of GPUs in your cluster, you need an additional license to merge everything together.

Also, consider issues such as security, hardware host platforms, power requirements, and cooling requirements before deploying vGPUs in infrastructure.

Compare the top GPU offerings

Nvidia and AMD offer the two most popular GPU products on the market. Nvidia’s GPUs handle a range of tasks in data centers, including machine learning and running machine learning models. Organizations also use Nvidia GPUs to speed up calculations in supercomputing simulations. Nvidia has worked with its partner OmniSci to develop a platform with a GPU-accelerated database, rendering engine and visualization system for faster analysis results.

AMD’s GPUs, meanwhile, are primarily aimed at scientific workloads. The GPU portfolio has two separate targets, one for data center performance models and the other for gaming-focused models. It offers an open software development platform called ROCm, which allows developers to write and compile code for a variety of environments, and it supports popular machine learning frameworks.

AMD has a slight performance advantage over Nvidia. However, Nvidia handles AI workloads better and builds more memory into its GPUs and has a more mature programming framework.

When it comes to AMD and Nvidia’s vGPU offerings, the vendors have different deployment approaches. Nvidia’s vGPUs require installing host drivers within the hypervisor and assigning vGPUs to guest VMs. AMD takes a fully hardware-based approach, directly allocating a portion of GPU cores to each machine.


Comments are closed.