Question 1

How many GPUs do I need to serve an LLM?

Accepted Answer

Enough to satisfy two constraints simultaneously: compute (generate output tokens fast enough for your request rate) and memory (hold the model weights plus the KV cache for every in-flight request). The compute requirement is the required output tokens per second divided by per-GPU throughput; the memory requirement is total memory divided by per-GPU HBM. You need the larger of the two. This calculator computes both and reports the GPU count and which constraint binds, so you know whether to add compute or reduce memory pressure.

Question 2

What is the difference between compute-bound and memory-bound serving?

Accepted Answer

Compute-bound serving means you have enough memory to hold the model and caches, but not enough throughput to generate tokens at the required rate — so you add GPUs or increase per-GPU throughput. Memory-bound serving means you can compute fast enough, but the weights plus the KV caches of all concurrent requests exceed your GPU memory — so memory, not speed, sets the GPU count. Long contexts and high concurrency push toward memory-bound; high request rates with short outputs push toward compute-bound. This calculator identifies which you are.

Question 3

How is the required throughput calculated?

Accepted Answer

Required output tokens per second = requests per second (QPS) × output tokens per request. For example, 50 requests per second each generating 500 tokens needs 25,000 output tokens per second from the fleet. Divide that by the per-GPU output throughput (tokens per second the model achieves on one GPU, which depends on the model size, hardware and batching) to get the number of GPUs needed for compute. This calculator uses your QPS, tokens-per-request and per-GPU throughput to compute it.

Question 4

Why does the KV cache drive memory in serving?

Accepted Answer

Every request being processed holds a KV cache proportional to its context length, and the number of concurrent requests is roughly QPS × latency (Little's law). So total memory is the model weights (fixed) plus the per-request KV cache times the number of in-flight requests. At high concurrency or long contexts, the summed KV caches can exceed the weights themselves, making the KV cache the dominant memory term and pushing serving into the memory-bound regime. Techniques like paged attention, KV quantization, and shorter contexts reduce it. This calculator sizes the KV memory from concurrency and per-request cache.

Question 5

What is continuous batching and how does it help?

Accepted Answer

Continuous (or in-flight) batching dynamically adds and removes requests from the running batch as they arrive and complete, rather than processing fixed batches. Because LLM token generation is memory-bandwidth-bound, processing more requests together amortizes the cost of loading the weights from HBM across more tokens, dramatically increasing throughput per GPU. This raises the per-GPU throughput input to this calculator (often several-fold over naive serving), reducing the compute-bound GPU count — which is why modern serving frameworks (vLLM, TensorRT-LLM) center on it.

Question 6

How does latency target affect GPU count?

Accepted Answer

Indirectly, through concurrency and memory. By Little's law, the number of requests in flight is approximately QPS times the per-request latency, so a longer latency (or higher QPS) means more concurrent sequences, each holding a KV cache — increasing memory pressure and potentially the memory-bound GPU count. A tighter latency target reduces concurrency and memory but may require more compute headroom to hit the deadline. This calculator uses your latency to estimate concurrency and the resulting KV memory, linking the latency target to the fleet size.

Question 7

How do I increase per-GPU serving throughput?

Accepted Answer

Mainly by batching (continuous batching to amortize weight loads), since generation is memory-bandwidth-bound. Also: quantizing the model (int8/int4) so weights stream faster from HBM, speculative decoding (generating multiple tokens per forward pass), optimized kernels (FlashAttention, fused ops), and tensor parallelism for very large models. Each raises the tokens-per-second one GPU delivers, lowering the compute-bound GPU count. This calculator takes per-GPU throughput as an input — measure it for your model and serving stack, and improving it directly shrinks the fleet.

Question 8

How does model size affect serving requirements?

Accepted Answer

Larger models need more memory (more weight bytes, and often larger per-request KV caches) and have lower per-GPU throughput (more compute per token), so they need more GPUs on both axes. A 7B model might serve hundreds of QPS on a single GPU; a 70B model needs multiple GPUs just to hold the weights and delivers fewer tokens per second each. Quantization helps both. This calculator lets you set the weights, per-GPU throughput and KV size for your model so you see the serving cost of model size directly.

Question 9

How does this relate to inference cost?

Accepted Answer

The GPU count this calculator produces is the basis of serving cost: GPUs × the hourly cost (owned or rented) gives the infrastructure cost, which divided by the served tokens gives cost per token. So sizing the fleet here is the first step; the token-cost and inference-cost calculators turn the GPU count into a per-token or per-query price, and the accelerator-ROI calculator decides whether to own or rent those GPUs. Together they span capacity sizing and cost. This tool answers 'how many GPUs?'; the others answer 'at what cost?'.

Question 10

How accurate is this serving estimate?

Accepted Answer

The structure — compute-bound from required-vs-per-GPU throughput, memory-bound from weights plus concurrency-scaled KV cache, take the larger — is the correct sizing model, and the arithmetic is exact for your inputs. Accuracy hinges on a realistic per-GPU throughput (measure it with your serving framework and batching, not a theoretical peak) and a correct per-request KV size and latency. It simplifies prefill-vs-decode phases, variable sequence lengths, and scheduling efficiency, so add headroom (20–30%) for production. Use it for first-order fleet sizing; load-test for the final count.

Question 11

Does this tool send my data anywhere?

Accepted Answer

No. All serving-sizing math runs entirely in your browser in JavaScript — nothing is uploaded and there's no telemetry.

LLM Serving Console

Fleet-sizing console

Why serving has two ceilings

Compute and memory, whichever runs out first

LLM Serving FAQs

Trusted by Inference Platform & Capacity Teams

Related tools

Similar Calculators

Inference Cost Calculator

Training Cost Calculator

GPU Cluster Sizing

Model Fit Checker

HBM Bandwidth Calculator

AI Chip Comparator

Often Used Together

Wafer Cost Calculator

Die Per Wafer Calculator

Yield Calculator

Chip Profitability Calculator

Related Articles

Technical Services