Skip to content
Roofline · arithmetic intensity · memory vs compute bound

HBM Bandwidth Console

Most AI kernels are limited by memory bandwidth, not FLOPS. Plot a workload on the roofline by its arithmetic intensity, find the ridge point, and see whether it's memory- or compute-bound — and how much of peak it can actually reach.

01 · Quick roofline

GPU & workload arithmetic intensity → bound & attainable.

H100: 990 TFLOPS · 3.35 TB/s HBM

Bound by
memory
Attainable
1%
of peak FLOPS
Roofline chart & ridge point ↓
02 · Deep analysis

Roofline console

Roofline
Roofline: attainable performance vs arithmetic intensityridge 2965 TF990arithmetic intensity (FLOP/byte, log)
memory-bound compute-bound
Ridge point
296
FLOP/byte
Regime
memory
Attainable
5 TF
1% of peak
Peak / HBM BW
990 / 3.35
TFLOPS / TB/s
Memory-bound · 1% of peak

At 1.5 FLOP/byte — below the 296 ridge — this kernel reaches only 5 TFLOPS (1% of peak). It's starved for bandwidth; adding compute does nothing. Raise intensity (batching, fusion) or use more HBM bandwidth.

To become compute-bound at this intensity you'd need 660.0 TB/s of bandwidth — vs 3.35 available.

For LLM decode, batching lifts intensity — model the serving in the LLM Serving console; size HBM in HBM Cost.

Why it matters

Why bandwidth is the real bottleneck

The roofline has two regimes

Below the ridge point a kernel is memory-bound — limited by how fast data moves, not how fast the chip computes. Above it, compute-bound. The ridge = peak FLOPS ÷ peak bandwidth tells you which world you're in.

LLM decode is brutally memory-bound

Generating tokens one at a time has an arithmetic intensity near 1 — far below any GPU's ridge of hundreds — so decode runs at a tiny fraction of peak FLOPS. The bottleneck is HBM bandwidth, not compute.

More FLOPS without more bandwidth is wasted

For a memory-bound kernel, adding compute does nothing — only more bandwidth helps. This is why HBM generations (and their bandwidth) matter as much as the FLOPS headline for real AI workloads.

Batching moves you up the roofline

Raising arithmetic intensity — by batching, fusing kernels, or reusing data — pushes a memory-bound workload toward the ridge, recovering compute that was sitting idle. It's the main software lever.

Field notes

The chip is usually waiting for memory

The headline number on an accelerator is its FLOPS, but for a great deal of real AI work that number is a fiction — the compute units sit idle, waiting for data to arrive from memory. The roofline model makes this concrete by plotting attainable performance against arithmetic intensity, the FLOPs a kernel does per byte it moves. The result has two regimes divided by a ridge point, and which side you're on determines everything about how to make it faster.

Below the ridge — which on a modern GPU sits at hundreds of FLOPs per byte — a kernel is memory-bound: its speed is the bandwidth times its intensity, and the expensive compute units are starved. Above the ridge it's compute-bound, finally saturating those units. The ridge itself is peak FLOPS divided by peak bandwidth, and because FLOPS have grown faster than bandwidth for years, that ridge keeps rising — pushing more and more kernels into the memory-bound regime where the bandwidth number, not the FLOPS number, is the performance number.

The starkest example is LLM token generation. Generating one token reads the entire model's weights from HBM but does only a little arithmetic with each byte — an intensity near one, hundreds of times below the ridge. So decode runs at a single-digit percentage of the chip's peak FLOPS, bottlenecked entirely on how fast weights stream out of memory. This is why two accelerators with very different FLOPS can serve tokens at nearly the same speed if their bandwidth is similar, and why HBM bandwidth gains matter as much as compute gains.

The good news is that arithmetic intensity is a software lever. Batching — serving many requests per weight load — fuses, tiling, and data reuse all raise the FLOPs per byte, shifting a workload rightward up the roofline toward the ridge and reclaiming idle compute. For LLM inference, batching is the dominant technique, which is exactly why serving throughput improves so much with concurrency. Model that in the LLM Serving console, and size the memory itself in the HBM Cost console.

HBM Bandwidth FAQs

Have more questions? Contact us

Trusted by Kernel, Performance & Systems Teams

4.8
Based on 3,020 reviews

Ridge point, regime, and attainable percentage of peak in one screen is exactly the first analysis I do on any kernel. Seeing LLM decode at single-digit percent of peak FLOPS — purely bandwidth-bound — is the result that reframes where to optimize. Matches my profiler's roofline.

D
Dr. Hannah Lee
GPU kernel engineer
June 14, 2026

The 'more FLOPS is wasted on a memory-bound kernel' point is the one that changes hardware decisions — for our inference workload HBM bandwidth is the spec that matters, not TFLOPS. Batching to move up the roofline is the lever we pull, and this shows it. Pairs perfectly with the model-fit and serving tools.

D
Diego Fernández
ML performance
May 15, 2026

Clean roofline with GPU and op presets — decode vs GEMM vs elementwise is instantly clear. The required-bandwidth-to-be-compute-bound figure is a nice touch. Would love measured-intensity import, but as a bounds-and-direction tool it's exactly right.

P
Priya Menon
AI systems architect
March 25, 2026

Explaining to leadership why a faster-FLOPS GPU didn't speed up inference — because we're memory-bound — is a one-chart conversation here. The ridge point per GPU is the number. Fast, exact, and the regime call is always right.

T
Tom Wagner
Inference optimization
December 30, 2025

Love using our calculator?

Connected instruments

Related tools

Similar Calculators

More tools in the same category

Inference Cost Calculator

Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.

Training Cost Calculator

Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.

GPU Cluster Sizing

Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.

Model Fit Checker

Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.

AI Chip Comparator

Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.

Token Cost Estimator

Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.

Often Used Together

Complementary tools for complete analysis

Learn More

Related Articles

Dive deeper with our expert guides and tutorials related to HBM Bandwidth Calculator

Loading articles...

ridge = peak FLOPS ÷ peak bandwidth · attainable = min(peak FLOPS, intensity × bandwidth) · Last reviewed: 2026-06