NVIDIA’s Blackwell GPU: Driving the Future of AI

March 29, 2024

Using smaller numbers to handle the computations behind AI enables faster execution, specifically for inference. Reducing precision also makes it possible for larger, more complex AI models—as in, the number of parameters—to fit into the memory of a single GPU.

The transformer engine in Blackwell pushes the envelope even further. The GPU uses what NVIDIA calls “micro-tensor scaling” to improve performance and accuracy to the point where it can run the computations at the heart of Al using floating-point numbers only 4 bits wide (FP4), which is boon for AI inference. “That means we can deliver twice the amount of compute as before, we can double the effective bandwidth, and we can double the model size that can fit on an individual GPU,” said Buck.

The transformer engine inside the Blackwell GPU is also very fine-grained, said Buck. It can scale the numbers used by even smaller building blocks inside the tensors themselves, instead of the entire tensor.

“Getting down to that level of fine granularity is a miracle in itself,” noted Buck. The B200 GPU can pump out up to 10,000 trillion operations per second (10 petaFLOPS) using the FP8 format. That translates to 2.5X faster inferencing than the Hopper GPUs that it released in 2022. Buck added that Blackwell-based GPUs can increase their performance up to 20 petaFLOPS when using the more compact FP4 format for inference.

NVIDIA said Blackwell adds “confidential computing” capabilities that keep AI models and data used for training them secure, which is valuable in sectors such as healthcare and finance, where privacy matters.

One of the unfortunate tradeoffs with Blackwell is its power consumption. The B200 burns through 1000 W (1 kW), and the B100 fits into the same 700-W power envelope (TDP) as its predecessor. Designing a power delivery network (PDN) that can supply enough power to AI accelerators like those from NVIDIA is becoming a daunting task, as is dissipating the heat that the chips generate.

High-Bandwidth Connectivity: The Missing Link in AI

The B200 itself is only one piece of the puzzle. High-bandwidth, low-latency connectivity is a must-have for building colossal AI systems that can train and run massive AI models behind the boom in generative AI.

NVIDIA uses the new generation of its chip-to-chip interconnect technology, NVLink, to lash together the silicon die in a package. In addition, the chips act as a single unified GPU to the software that runs on top of it, instead of a pair of GPUs placed beside each other. NVLink is also used to bridge longer distances in data centers, allowing high-speed communication between up to 576 GPUs.

Blackwell leverages the NVLink interconnect to transfer data at up to 1.8 Tb/s in both directions between other GPUs in the system, doubling the bandwidth of the NVLink in Hopper, according to the company.

Also revealed at GTC is that the Blackwell GPU adds another engine to run diagnostics and test for potential reliability risk. The new engine uses AI-based preventative maintenance at the chip level that can monitor “every single gate and every bit of memory on the chip and connected to it” and identify any weak points. The objective is to increase uptime and improve resiliency for large-scale systems that tend to run uninterrupted for weeks or months at a time to train the largest ML models.

While it’s the dominant force in the world of AI silicon, NVIDIA is trying to stay a step ahead of rivals Intel (with its Gaudi 2 and future Gaudi 3 accelerators) and AMD (with its Instinct GPUs). It’s also contending with cloud giants such as AWS, Google, and Microsoft, which are rolling out in-house AI silicon for training and inferencing. On top of that, Cerebras Systems and many other startups are trying to close the gap with NVIDIA.

The company is out to fend off the competition with its new Blackwell GPUs. To make it happen, plans are to package them with a broad range of connectivity and software technologies into supercomputer-class systems.

The Blackwell GPUs are also at the heart of the company’s new GB200 “superchips.” These are modules that connect its “Grace” Arm CPU and a pair of the Blackwell GPUs over a 900-GB/s NVLink interconnect.

The modules are arranged into larger computing modules, called the DGX GB200. Up to eight of these units can be linked with NVLink to create a supercomputer-class system called a SuperPOD that contains 576 Blackwell chips.