Ampere’s Jeff Wittich: ‘AI Inference At Scale Will Really Break Things’

May 6, 2024

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SANTA CLARA, Calif.—AI inference at scale is going to be a much bigger problem for data centers, especially in terms of power consumption, than training ever was, Ampere chief product officer Jeff Wittich told EE Times.

There has been a lot of emphasis on AI training, especially LLM training, in the last year or so, Wittich said. But the proliferation of open-source foundation models is shifting focus to inference. So, as AI infrastructure is built out, the majority will be for inference—not training.

“The scale-out inferencing problem is the one that will really break things,” Wittich said, noting that inference is about 85% of AI compute cycles today.

ampere-jeff-wittich-1-8202376 — Jeff Wittich (Source: Ampere)

“The problem statement is totally different,” he added. “Training is more or less a supercomputer problem, and it might take months to run, and it might make sense to have dedicated infrastructure for that. Inference is a totally different task, it’s bigger in terms of overall compute cycles, but instead of one gigantic job that’s consuming a huge number of compute cycles, it’s billions of small jobs, each consuming a reasonable amount of compute cycles, but it adds up.”

By MRPeasy 05.01.2024

By Global Unichip Corp. 04.18.2024

The solution, Wittich said, could be CPUs. While AI inference is a broad application that will require a number of silicon solutions, CPUs have a significant part to play.

“For the vast majority of use cases, GPU-free AI inferencing is the optimal solution,” he said. “It’s much easier to run these models on a CPU because people are used to the technology, but they are more power efficient by nature and a lot more flexible. When you buy a GPU for one task, it’s the only task you can run.”

In the short term, flexibility may be required—infrastructure may need to run diverse workloads, he said. General-purpose solutions like CPUs can provide that flexibility.

“AI inference isn’t run in isolation,” he said. “Those inference results are going somewhere, they’re being served up via an application, some sort of web server, there’s an application layer, there are caching layers, there are databases, and other stuff running alongside that inference…and the balance between AI inference and these other tasks can change.”

While research-phase models will keep getting bigger, models intended for deployment will likely decrease in size as techniques like sparsification, pruning and quantization become more mature. This adds to CPUs’ case.

Power consumption

AI training has caused a spike in the power consumed by data centers and has required huge investment in specialized hardware, mostly GPUs. Making even a small difference in inference power efficiency will have more of an impact than for training, since the workload is so much bigger overall, Wittich argued.

“If we can solve the inferencing problem and deliver inferencing at scale in a more efficient way, we’ll alleviate the power consumption problem,” he added.

Part of the problem is siloed decision making inside cloud providers, where the person making decisions about infrastructure is not usually the same person responsible for what kind of compute gets bought.

Investment in compute is influenced by a mixture of customer demand and cloud provider choices, which Wittich said can be challenging.

“When you’re the infrastructure provider, our value really shines clearly because you can see the power savings and cost savings, and we have amazing traction in that space, but there’s still a lot of work to be done in informing the end user about why they should choose that infrastructure for their job,” he said.

192-core design

Ampere offers Arm-based data center CPUs up to 192 cores today, supporting a range of AI-friendly data formats (FP32, FP16, BF16, INT16, INT8). AI applications are supported by Ampere’s AI Optimizer (AIO) software layer, which performs model optimization and hardware mapping, including data reorganization and optimal instruction selection. It works seamlessly with TensorFlow or Pytorch files, Wittich said.

While Ampere makes tooling available for porting customer code that has been optimized and honed over years to be efficient on other architectures, AI Inference code is relatively easy to port between different CPUs (and different ISAs) and other hardware since AI models are built to be portable, Wittich said.

“It’s not that difficult, the switch from deploying on GPUs to CPUs, but there’s a psychological barrier—people think it’s going to be a lot harder than it is,” he said. “Many of our customers have done this and it’s not hard. AI is one of the easiest things to move over…because when you build a model in TensorFlow, it’s meant to be really portable, because you’re expecting to run this model all over the place. The AIO helps, but there’s not a big barrier there.”

AI performance

Ampere’s slide deck shows the Ampere Altra Max 128-core CPU with comparable or better performance versus other leading data center CPUs on inferences per second (albeit at different precisions) for various AI workloads (DLRM, Bert Large, Whisper and ResNet-50—all relatively small models compared to today’s giant LLMs).

Each of Ampere’s 128 cores has two 128-bit vector units which run efficiently at high clock speeds. Software, including the AIO, is a big factor in Ampere’s AI performance, Wittich said, but Ampere’s general approach to efficient scaling, which helps with all workloads, is also paying off.

“If you have a large number of compute elements and you can scale out really easily across your CPU and not get bottlenecked, you can feed in a whole lot of data in a really efficient way, so you’re going to have an optimal inferencing solution as well,” he said.

Communication between cores (and/or between chiplets) can be a bottleneck for other CPU architectures, he added.

“This is something we’re really good at, because we ran into this problem from day one,” he said. “It isn’t a matter of: we used to build a 12-core CPU and now we’re trying to figure out how to make it 64 cores. On day one we had an 80-core CPU, so we had to solve this problem on day one.”

Per Ampere’s figures, Ampere Altra CPU instances in Oracle cloud also compared favorably with AWS Nvidia A10 GPU instances in terms of inferences per second per dollar. This is down to Ampere’s lower power consumption combined with saving on non-CPU costs in servers. Cloud providers can save money this way, though whether they pass those cost reductions on to customers is up to them, Wittich said.

Wittich’s hope is that cloud customers truly are interested in the carbon footprint of their compute, since power efficiency is where Ampere really shines, he said.

“Five years ago people told me over and over again that this [power] problem doesn’t matter,” he said. “Creating awareness that power consumption is going up, and it isn’t free and it’s not unlimited, I think is really important…we can’t let up on that front because while people care, when push comes to shove, cost still ends up becoming the highest priority.”