Intel Rolls Out Gaudi 3 Accelerator Chip for Large AI Clusters

April 19, 2024

This article is part of the TechXchanges: Generating AI and Chiplets – Electronic Design Automation Insights.

Intel is challenging NVIDIA’s crown in artificial-intelligence (AI) silicon with its latest AI accelerator for data centers: the Gaudi 3.

The technology firms on the front lines of the AI boom are lashing together tens of thousands of chips over sprawling high-bandwidth networks to train and run large language models (LLMs) being created by Google, Meta, OpenAI, and a growing crowd of AI startups. Intel said the next-gen Gaudi is expressly designed to be assembled into these vast AI clusters for training and inferencing AI models with up to trillions of parameters.

The Gaudi 3 ushers in improvements to everything from the transistors on out to the accelerator cores, the networking silicon, and the high-bandwidth memory (HBM) surrounding it all, significantly boosting performance. While it’s based on the same fundamental architecture as the Gaudi 2, Intel said the Gaudi 3 brings 2X more performance to the table when it computes using smaller units of data based on 8-bit floating-point numbers called FP8. It works 4X faster when using the higher-precision, 16-bit floating-point format called BF16.

Gaudi 3, which is assembled out of 10 separate slabs of silicon occupying the same package, also features 2X faster networking bandwidth and 1.5X extra HBM than the Gaudi 2 released in 2022.

According to Intel, the new state-of-the-art AI accelerator stands out for its ability to scale flexibly from a single node to large-scale clusters connected over Ethernet. “Gaudi is a very unique accelerator in that it integrates not only the compute and memory, but also network interface ports that are used for both scaling up and scaling out,” said Eitan Medina of Habana Labs, the unit behind Intel’s Gaudi family of AI chips.

The next-gen Gaudi 3 integrates 24 200-Gb/s networking interfaces based on RDMA over Converged Ethernet (RoCEv2), doubling the bandwidth of the 24 100-Gb/s Ethernet ports in its predecessor, taking the place of the network interface cards (NICs) in the system. It uses industry-standard Ethernet to interact with other Gaudi accelerators in the same server, in the same rack, and even in other racks in the data center.

Intel revealed Gaudi 3 at the company’s recent Vision event in Phoenix, Arizona.

Gaudi 3: More Cores, More Chiplets, More Performance

The Gaudi 3 is comprised of a pair of heterogeneous chiplets that feature all of the functionalities of the high-performance SoC, including the AI accelerators, on-chip memory, networking, and connectivity to the HBM.

These slabs of silicon are based on a 5-nm process technology from TSMC, bringing a large generational leap in performance over the transistors in the second-generation Gaudi 2, which was built on the 7-nm process. By partitioning the processor into a pair of chiplets, which are mirror images of each other, and packaging them to mimic a single chip, the silicon dies can be manufactured larger than usual to squeeze in more transistors.

The heterogeneous compute engine at the heart of the Gaudi 3 consists of 64 next-gen programmable Tensor processor cores (TPCs) devoted to AI, up from the 24 TPCs in the second generation. It’s also equipped with eight matrix multiplication engines (MMEs). Every MME is composed of a 256 by 256 grid of smaller cores that execute up to 64,000 multiply-accumulate (MAC) operations per cycle, giving it a high degree of computational efficiency when carrying out the matrix operations at the heart of machine learning.

Though it lacks the throngs of accelerator cores in the latest data-center GPUs, Intel said the Gaudi 3 integrates a smaller number of larger matrix multiplication units so that it can feed them data faster and more efficiently.

The accelerator delivers up to 1,835 trillion floating-point operations per second (TFLOPS) of performance when it carries out AI operations at FP8, which is approximately 2X more than the Gaudi 2. These smaller data formats are faster and more energy-efficient to compute, and they require smaller amounts of memory. As a result, they’re favored for training transformers, a type of neural network that’s widely used for generative AI. NVIDIA can also run AI computations at FP8 in its Hopper H100 GPU—the current gold standard in AI silicon.

The Gaudi 3 is bordered by eight 16-GB HBM chips on the same package, totaling up to 128 GB of enhanced HBM2E, up from 96 GB in its predecessor. Memory bandwidth clocks in at 3.7 TB/s from 2.4 TB/s. Co-packaging more memory with the accelerator chip itself means that larger, more advanced AI models—or larger portions of them—can be crammed into the chip, saving power and aiding performance.

The chip adds double the on-chip memory, with 96 MB of SRAM. On-chip memory capacity is limited, so HBM is increasingly vital for reducing the latency and power for training and inferencing.

Ethernet: The Backbone of Intel’s Next-Gen Gaudi 3

While the AI-dedicated accelerator cores and high-bandwidth memory are the brains of the Gaudi 3, there’s more to the mix. Intel said its most distinctive feature is its massive, flexible on-chip networking capability.

The most advanced AI models are expanding by an order of magnitude with every generation. In that context, high-bandwidth, low-latency networking technologies that can ferry data between AI accelerators in the same server—also called “scale up” in the parlance of the semiconductor industry—and between the servers and racks that they’re assembled into—also called “scale out”—are becoming a bigger piece of the puzzle in AI.

NVIDIA uses its NVLink interconnect to tie together GPUs within the same server and the same rack. To link up larger clusters of tens of thousands of its AI chips, the company leverages its InfiniBand networking technology.