Question 1

What is the roofline model?

Accepted Answer

The roofline model is a visual way to find a kernel's performance ceiling on a given processor. It plots attainable performance (FLOPS) against arithmetic intensity (FLOPs per byte of memory traffic). The 'roof' has two parts: a sloped line (peak bandwidth × arithmetic intensity) where performance is limited by memory, and a flat line (peak compute) where it's limited by the processor. Where they meet is the ridge point. A kernel's arithmetic intensity tells you which regime it's in and the maximum performance it can reach. This calculator computes the ridge, the regime, and the attainable performance for your workload and GPU.

Question 2

What is arithmetic intensity?

Accepted Answer

Arithmetic intensity is the ratio of compute to memory traffic: floating-point operations performed per byte read from or written to memory (FLOP/byte). A kernel that does many operations on each byte (like a large matrix multiply reusing data) has high arithmetic intensity; one that touches each byte only a few times (like an elementwise operation or token-by-token LLM decode) has low intensity. It's the single number that determines whether a kernel is limited by compute or by memory bandwidth, which is why it's the x-axis of the roofline and the key input here.

Question 3

What does memory-bound versus compute-bound mean?

Accepted Answer

A kernel is memory-bound when it can't get data from memory fast enough to keep the compute units busy — its performance is capped by bandwidth × arithmetic intensity, and adding more compute power wouldn't help. It's compute-bound when it has enough data locality to saturate the arithmetic units — capped by peak FLOPS, and more bandwidth wouldn't help. The boundary is the ridge point (peak FLOPS ÷ peak bandwidth). Knowing which regime a workload is in tells you whether to optimize data movement or computation — the most important decision in performance tuning.

Question 4

Why is LLM token generation memory-bound?

Accepted Answer

Because generating one token at a time reuses very little data: each step reads the model weights and the KV cache from memory but does only a small amount of computation per byte, giving an arithmetic intensity near 1 FLOP/byte. That's far below the ridge point of a modern GPU (hundreds of FLOP/byte), so decode runs at a tiny fraction of the chip's peak FLOPS — it's entirely limited by how fast weights stream from HBM. This is why memory bandwidth, not compute, dominates inference latency, and why techniques that raise intensity (batching, speculative decoding) matter so much.

Question 5

How do I know if my workload needs more bandwidth or more compute?

Accepted Answer

Compare its arithmetic intensity to the ridge point (peak FLOPS ÷ peak bandwidth). If intensity is below the ridge, it's memory-bound — more bandwidth helps, more FLOPS don't. If above, it's compute-bound — more FLOPS help, more bandwidth doesn't. For example, on an H100 the ridge is around 300 FLOP/byte; a large training GEMM (intensity ~200) is near or above it (compute-bound), while LLM decode (intensity ~1.5) is far below (badly memory-bound). This calculator computes the ridge and your workload's position relative to it.

Question 6

Why does HBM bandwidth matter as much as FLOPS?

Accepted Answer

Because a large fraction of real AI work — especially inference and many memory-bound kernels — is limited by bandwidth, not compute. For those workloads, a GPU's headline FLOPS are irrelevant; what determines speed is how fast it streams data from HBM. This is why each HBM generation's bandwidth increase (HBM3 → HBM3E → HBM4) is as important as the FLOPS gains, and why accelerators advertise both. For memory-bound workloads, the bandwidth number is the performance number. This calculator shows the attainable performance from both, exposing which one limits you.

Question 7

How can I make a memory-bound kernel faster?

Accepted Answer

Raise its arithmetic intensity so it moves up (and eventually past) the ridge. The main levers: batching (process more data per weight load, amortizing memory traffic), kernel fusion (combine operations to avoid round-trips to memory), data reuse and tiling (keep data in fast on-chip SRAM), and for LLM inference specifically, larger batches and speculative or parallel decoding. Each increases compute per byte, recovering FLOPS that were sitting idle. The other route is simply more bandwidth (newer HBM), but software intensity gains are usually cheaper. This calculator lets you raise the intensity and watch the attainable performance climb.

Question 8

What is the ridge point?

Accepted Answer

The ridge point is the arithmetic intensity at which a kernel transitions from memory-bound to compute-bound — the corner of the roofline. It equals peak FLOPS divided by peak bandwidth (in FLOP/byte). Below it, performance scales with intensity (memory-bound); at and above it, performance is flat at peak compute. A higher ridge means the processor needs more arithmetic intensity to be compute-bound, so more kernels fall into the memory-bound regime. Modern GPUs have high ridge points (FLOPS grew faster than bandwidth), which is why memory-bound is the common case. This calculator computes the ridge for your GPU.

Question 9

How does batching affect the roofline position?

Accepted Answer

Batching increases arithmetic intensity for many kernels — especially LLM inference — because the same weight load from memory serves multiple inputs, so more compute happens per byte moved. This shifts the workload rightward on the roofline, from deep in the memory-bound region toward the ridge, recovering idle compute and improving throughput per byte of bandwidth. There are limits (memory capacity for the larger batch, latency for individual requests), but batching is the primary software technique to escape the memory-bound penalty of small-batch inference. This calculator lets you model the intensity increase batching provides.

Question 10

How accurate is this roofline analysis?

Accepted Answer

The roofline relationships (ridge = peak FLOPS ÷ bandwidth; attainable = min(peak FLOPS, intensity × bandwidth)) are exact and are the standard model for performance bounds. Accuracy depends on using the right peak FLOPS (for your precision) and peak bandwidth for the GPU, and a realistic arithmetic intensity for your kernel (which depends on data types, reuse, and implementation). The model gives an upper bound — real performance is at or below the roofline due to other overheads. Use it to identify the binding constraint and the optimization direction; profile for the exact achieved performance.

Question 11

Does this tool send my data anywhere?

Accepted Answer

No. All roofline math runs entirely in your browser in JavaScript — nothing is uploaded and there's no telemetry.

HBM Bandwidth Console

Roofline console

Why bandwidth is the real bottleneck

The chip is usually waiting for memory

HBM Bandwidth FAQs

Trusted by Kernel, Performance & Systems Teams

Related tools

Similar Calculators

Inference Cost Calculator

Training Cost Calculator

GPU Cluster Sizing

Model Fit Checker

AI Chip Comparator

Token Cost Estimator

Often Used Together

Wafer Cost Calculator

Die Per Wafer Calculator

Yield Calculator

Chip Profitability Calculator

Related Articles

Technical Services