One of the reasons why Intel six years ago 16.7 billion companies would like to shift this work to network cards with their own much cheaper and much more energy-efficient processing.
We used to call these SmartNICs, which meant outsourcing and accelerating certain functions with a custom ASIC on the network card. We are now increasingly calling them DPUs, short for Data Processing Units, as these devices take a hybrid approach to their computing power and acceleration, mixing CPUs, GPUs and FPGAs on the same device. Because it has to be different, Intel gives offload devices, which are much more advanced SmartNICs, the Infrastructure Processing Unit or IPU – but to avoid confusion, we’ll stick with the DPU name for everyone.
In any case, Intel unveiled three of its upcoming DPUs at its recent Architecture Day extravaganza, and its Data Platforms Group executives showed that they have actually been on their way to Damascus over the past few years and not just stop chasing DPUs, but accept them fully. Well, it wasn’t so much a change, but an injection of new people who brought new thoughts, and that includes Guido Appenzeller, who is now Chief Technology Officer at the former Data Center Group. Appenzeller led the Clean Slate Lab at Stanford University, which gave rise to the OpenFlow software-defined network control plan standard, and was the co-founder and CEO of Big Switch Networks (now part of Arista Networks). Appenzeller was for a while Chief Technology Strategy Officer in the Networking and Security division at VMware and was behind the open source network operating system project OpenSwitch, which was launched by Hewlett Packard Enterprise a few years ago.
Intel hasn’t talked much about relieving CPU work, because that’s heresy – even if it happens and even if there are very good economic and safety reasons for doing so. The metaphor for DPUs that Appenzeller developed and discussed at the Architecture Day is clever. It’s more about sharing resources and multi-tenancy than getting better value for money in a cluster of systems, which in our opinion is the real driver behind the DPU. (This is hair-splitting, as we know. Offloading the network and storage to the DPU helps reduce latency, improve throughput, reduce costs, and provide secure multi-tenancy.)
“If you want to think about an analogy, it’s a bit like a hotel versus a single-family house,” explains Appenzeller. “In my home, I want it to be easy to move from the living room to the kitchen to the dining table. It’s very different in a hotel. The guest rooms as well as the dining room and the kitchen are neatly separated. The areas in which the hotel staff work are different from the areas in which the hotel guests stay. And you get a bed, in some cases you may want to switch from one to the other. And essentially, this is the same trend that we see in cloud infrastructure today. “
In the Intel conception of the DPU, the IPU is the place where the control level of the cloud service provider – what we call hyperscaler and cloud builder – runs and the hypervisor and tenant code on the CPU cores within the server housing in which the DPU is inserted. Many would argue against this approach, and Amazon Web Services, who have perfected the art of DPU with their smart “nitro” NICs, would be the first to raise an objection. All of the network and storage virtualization code runs on the Nitro DPU for all EC2 instances and, more importantly, the server virtualization hypervisor, with the exception of the tiniest paravirtualized code, which has almost no overhead. The CPU cores are only intended to run operating systems and perform computing tasks. No longer.
In a way, as we’ve been saying for a while, a CPU is really a serial computing accelerator for the DPU. And not too far into the future, the DPU will have all of the accelerators attached to it in a high-speed structure that allows all the stuff to be broken up and put together, with the DPU – not the CPU – at the heart of the architecture. This is going too far for Intel, we suspect. But this makes more sense and fulfills much of the four decades long vision “The network is the computer” represented by former Sun Microsystems specialist John Gage. There will be more and more processing on the network, in DPUs and in switches themselves, as this is the natural place for collective operations to be performed. Maybe they shouldn’t have been put on the CPU in the first place.
To be fair, Appenzeller admitted later in his talk, as you can see in the graphic above, that there is CPU usage that allows customers to “maximize the income from CPUs”. Intel has certainly done this for the past decade, but that strategy no longer works. That is one of the reasons Appenzeller was brought in from outside Intel.
And this data below from Facebook, which Appenzeller cited, makes it clear why Intel has rethought – especially after AWS and Microsoft have watched over the past few years how AWS and Microsoft have fully embraced DPUs and other hyperscalers and cloud builders with different ones Deployment levels have followed suit and success.
This is perhaps a generous dataset – especially if you don’t factor in the overhead of a server virtualization hypervisor, as many large companies have to do, even if the hyperscalers and cloud builders tend to do bare metal with containers over it.
At the moment, Intel is only talking about DPUs based on GPUs, FPGAs and custom ASICs because the oneAPI software stack is not fully developed and no software ecosystem is running on GPU-accelerated devices. But over time, we believe that GPUs that excel at certain types of parallel processing and can be reprogrammed faster than FPGAs will be part of the DPU mix at Intel as they dominate at Nvidia. It’s only a matter of time.
For now, two of the DPUs that Intel presented at Architecture Day were based on CPU and FPGA combinations – one called “Arrow Creek”, which is based on an FPGA / CPU SoC, one called “Oak Springs Canyon” with a mixture of these to FPGA plus an external Xeon-D processor – or based on a custom ASIC code-named “Mount Evans” that Intel developed for a “top cloud provider” that remains unnamed.
Here are the Arrow Creek (left) and Oak Springs Canyon (right) cards that plug into PCI Express slots in servers:
And here’s a drill down on Arrow Creek’s features:
The Arrow Creek DPU has two 100 Gbps ports that use QSFP28 connectors and has an Agilex FPA compute engine. The DPU has a two-port E810 Ethernet controller chip that latches into eight lanes of PCI-Express 4.0 slot capacity, and the Agilex FPGA also has its own eight PCI-Express lanes; both run back into the server’s CPU complex via the PCI Express bus. ARM cores are embedded on the Agilex FPGA that can perform modest computing tasks and have five memory channels (four plus one spare memory, it looks like) with a total capacity of 1 GB. The FPGA part of the Agilex device has four DDR4 memory channels with a combined capacity of 16 GB.
This Arrow Creek DPU is specifically designed for network acceleration workloads, including customizable packet processing that occurs on “the bump in the wire,” as we have long termed FPGA-accelerated SmartNICs. This device is programmable via the OFS and DPDK software development kits and features Open vSwitch and Juniper Contrail Virtual Switching, as well as SRv6 and vFW stacks pre-molded on their FPGA logic gates. This is for workloads that change sometimes, but not very often, which is what we said about FPGAs from the start.
Oak Springs Canyon is a little different as you can see:
The Xeon-D processor’s feeds and speeds haven’t been revealed yet, but it probably has 16 cores as many SmartNICs are trending these days. As far as we know, the Xeon D CPU and Agilex FPGA are on the same die – Intel has been working on it for years and promised such devices as part of the Altera acquisition back in 2015 – but for all we know they are integrated with a single socket EMIB connections. The CPU and GPU each have 16 GB of DDR4 memory over four channels and are connected to a pair of 100 Gbps QSFP28 ports via the FPGA.
The Oak Springs Canyon DPU can be programmed using the OFS, DPDK and SPDK toolkits and has integrated stacks for virtual Open vSwitch switching as well as the NVM-Express over Fabrics and RoCE RDMA protocols. Obviously, this DPU aims to speed up the network and storage and to relieve the CPU complex in the servers.
The third DPU, the Mount Evans device, is perhaps the most interesting as it was jointly developed with this “top cloud provider” and incorporated a custom arm processor complex and a custom network subsystem in the same package. Like this:
The network subsystem has four SerDes with 56 Gbit / s, which deliver 200 Gbit / s with full duplex and can be divided and used by four host servers. (The diagrams say it must be Xeons, but it seems unlikely that this is a requirement. Ethernet is Ethernet.) The network interface implements the RoCE v2 protocol to speed up the network without involving the CPU (as is the case with RDMA implementations the case is). NVM Express Offload Engine so that the CPUs in the host do not have to deal with this overhead either. There is a custom programmable packet processing engine that uses the P4 programming language that we strongly suspect is based on parts of the “Tofino” switch ASICs from Intel’s acquisition of Barefoot Networks more than two years ago. The network subsystem has a traffic shaping block of logic to improve performance and reduce latency between the network and the hosts, and there is also a block of logic that performs line-speed IPSec inline encryption and decryption.
The computing complex on the Mount Evans device has 16 Neoverse N1 cores licensed by Arm Holdings, preceded by an undisclosed cache hierarchy and an unusual three DDR4 memory controllers (that’s not a very base-2 number) . The computing complex also has a lookaside cryptography engine and a compression engine, which offloads these two jobs from the host CPUs, as well as an administrative complex that enables the DPU to be managed externally.
It’s not clear what the workload is, but Intel says the programming environment will “leverage and extend” the DPK and SPDK tools, presumably with P4. We strongly suspect that Mount Evans is used in Facebook microservers, but that’s just a guess. And we also strongly suspect that it will not be available to anyone other than the intended customer, which would be a shame. Hopefully we are wrong in this assumption.