LLM Serving Console
Serving an LLM at scale needs enough GPUs to both compute the tokens and hold the caches. Size the fleet as the larger of the compute and memory requirements, and see which constraint binds — so you know whether to add GPUs or cut the KV cache.
QPS, tokens/req & per-GPU throughput → GPUs needed.
Fleet-sizing console
Fleet = max(compute, memory) = 10 GPUs. The taller bar is the binding constraint.
Serving 50 QPS × 500 tokens needs 25,000 output tok/s (10 GPUs at 2500/GPU) and 440GB for weights + 250 concurrent KV caches (6 GPUs). The fleet is the larger: 10 GPUs, compute-bound.
Batch harder or raise per-GPU throughput to cut the count. Add 20–30% headroom for production.
Turn GPUs into cost per token in the Token Cost console; confirm fit in Model Fit.
Why serving has two ceilings
You need enough GPUs to both compute the tokens fast enough and hold the weights plus all in-flight KV caches. The fleet size is the larger of the two — and which binds tells you what to optimize.
If compute is the limit, the answer is more GPUs or higher per-GPU throughput (bigger batches, better kernels). Token generation is memory-bandwidth-bound, so batching is the main throughput lever.
Long contexts and high concurrency inflate the per-request KV cache until memory, not compute, sets the GPU count. Shorter contexts, paged attention, and quantized KV all help.
By Little's law, requests in flight ≈ QPS × latency. Higher throughput or longer responses mean more concurrent sequences, each holding a KV cache — which is why latency targets ripple into memory sizing.
Compute and memory, whichever runs out first
Sizing a fleet to serve a large language model is a two-front problem, and the fleet you need is set by whichever front you lose first. On one side is compute: you must generate output tokens fast enough to keep up with the request rate, which takes a certain number of GPUs at a given per-GPU throughput. On the other is memory: every GPU must collectively hold the model weights plus the key-value cache of every request currently in flight. The answer is the larger of the two counts, and knowing which one binds is what tells you how to make serving cheaper.
The compute side is governed by token throughput, and here the crucial fact is that generation is memory-bandwidth-bound — each token reads the whole model from HBM but does little arithmetic. That's why batching is the dominant lever: processing many requests together amortizes the weight load across more tokens, multiplying per-GPU throughput. Continuous batching in modern serving stacks is precisely this, and it can cut the compute-bound GPU count several-fold.
The memory side is governed by the KV cache, and it scales with concurrency. By Little's law, the requests in flight are roughly the request rate times the latency, and each one holds a cache proportional to its context length. For long-context, high-concurrency serving, the summed caches can dwarf the weights themselves, making memory — not speed — the constraint. The levers there are different: shorter contexts, paged attention, quantized KV caches, and accepting lower concurrency.
So the binding constraint is also the to-do list. Compute-bound? Batch harder, quantize, add GPUs. Memory-bound? Attack the KV cache. This console computes both counts, names the binding one, and sizes the fleet — then the GPU count flows into cost per token in the Token Cost console, the fit check in Model Fit, and the bandwidth roofline in the HBM Bandwidth console.
Trusted by Inference Platform & Capacity Teams
“Sizing by max(compute, memory) and naming the binding constraint is exactly how we plan a serving fleet. The long-context preset going memory-bound on KV cache is the trap teams hit — surfacing it here saves a load-test. Concurrency from Little's law is the right model.”
“The compute-vs-memory split tells me whether to batch harder or cut context, which is the actual decision. Per-GPU throughput as an input (not a fantasy peak) keeps it honest. Pairs perfectly with the HBM-bandwidth and model-fit tools.”
“Clean fleet sizing from QPS, tokens and per-GPU throughput. The KV-cache-driven memory bound matches our vLLM deployments. Would love prefill/decode separation, but as a first-order capacity tool it's exactly right and fast.”
“We size production fleets off this, then add headroom. The bottleneck call — compute on our high-QPS API, memory on our long-context one — directs optimization correctly. Feeds straight into the token-cost calculator. Excellent.”
Love using our calculator?
Related tools
Similar Calculators
More tools in the same category
Inference Cost Calculator
Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.
Training Cost Calculator
Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.
GPU Cluster Sizing
Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.
Model Fit Checker
Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.
HBM Bandwidth Calculator
Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.
AI Chip Comparator
Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.
Often Used Together
Complementary tools for complete analysis
Related Articles
Dive deeper with our expert guides and tutorials related to LLM Serving Calculator
GPUs = max(QPS×tokens ÷ per-GPU tok/s, (weights + QPS×latency×KV) ÷ HBM) · Last reviewed: 2026-06