Skip to content
Cost per inference · per 1k · monthly · compute + overhead

Inference Cost Console

A model is trained once but serves queries forever — so inference cost dominates. Compute the cost per inference (hardware ÷ throughput, plus egress overhead), per 1,000 inferences, and the monthly bill, in any currency.

01 · Quick estimate

Throughput, hardware cost & utilization → cost per inference.

Cost / 1k inf
$0.1035
Monthly
$5.18K
Compute vs overhead & monthly cost ↓
02 · Deep analysis

Inference unit-economics console

Per-inference cost breakdown
Compute (hardware + power)$0.00
Overhead (egress/storage)$0.0001

Compute is 3% of the per-inference cost. The hardware serves 0.58M inferences/hour at 80% utilization.

Per inference
$0.0001
Per 1k inf
$0.1035
Inferences/hr
0.58M
80% util
Monthly cost
$5.18K
50M inf
Read-out

At 200 inf/s and 80% utilization, the hardware ($2.04/hr) serves 0.58M inferences/hour, so each costs $0.0001 ($0.1035/1k). At 50M queries/month that's $5,177.

Doubling utilization or throughput roughly halves the per-inference cost — the main levers, since the hourly cost is fixed.

For LLM token-based pricing use the Token Cost console; size the fleet in LLM Serving.

Currency conversion uses indicative rates — verify against a live source for contracts.

Why it matters

Why inference is the real bill

Inference is paid per query, forever

A model is trained once but serves queries for its whole life, so the cumulative inference cost dwarfs training. Cost per inference, multiplied by billions of queries, is the real spend.

Utilization sets the per-query cost

Per-inference cost is the hardware's hourly cost divided by the queries it serves — so an under-loaded GPU has an expensive per-query cost. Throughput and utilization are the levers.

Overhead beyond compute adds up

Network egress, storage, and load-balancing add a per-query cost on top of compute that's negligible per call but real across billions — the part that surprises a naive compute-only estimate.

Cost per 1k inferences is the planning unit

Pricing, budgeting and unit economics for inference run in cost per thousand (or million) inferences. Computing it from throughput and hardware cost is the basis of every inference business case.

Field notes

The bill that never stops

Training a model is a dramatic, one-time expense; serving it is a quiet bill that arrives with every single query, forever. For any successful AI product the cumulative inference cost overtakes training quickly and then keeps growing with usage — which is why the operating metric that matters most isn't the training run, it's the cost per inference, multiplied by the billions of queries a deployed model handles.

That cost has a simple core: the all-in hourly cost of the serving hardware divided by the number of inferences it produces in that hour. Because the hourly cost is essentially fixed, the denominator is everything — throughput and utilization. A well-batched, fully-loaded accelerator spreads its cost over enormous query volume and drives the per-inference cost down; an under-utilized one pays the same hourly cost for far fewer queries, and each one costs more. Keeping inference hardware busy is the heart of cheap serving.

The part a compute-only estimate misses is overhead. Network egress, storage, load balancing — each is negligible on a single query but real across billions, and ignoring it understates the true cost. A complete per-inference figure adds that overhead on top of compute, which is why this console separates the two and shows compute's share: when overhead becomes a meaningful slice, it's a signal to optimize data movement, not just the model.

Expressed per thousand or per million inferences, this is the unit every inference business case runs on — pricing, budgeting, and margin all derive from it. For generative LLMs where output length varies, the natural unit is the token instead — use the Token Cost console — and size the serving fleet that sets your throughput and utilization in the LLM Serving console.

Inference Cost FAQs

Have more questions? Contact us

Trusted by Inference Economics & Product Teams

4.8
Based on 3,120 reviews

Cost per 1k inferences from hardware cost ÷ throughput, with utilization as the hinge, is exactly the operating number our pricing rests on. Including egress overhead beyond compute is the part naive estimates miss. Seeing it in euros and dollars settles cross-region unit economics.

D
Dr. Elise Fontaine
ML serving economics
June 14, 2026

The inference-dwarfs-training framing is the truth that justifies our serving-optimization roadmap. Per-query cost falling with batching/utilization is the lever, and this quantifies it. Pairs perfectly with the token-cost and accelerator-ROI tools for the full cost picture.

S
Sanjay Verma
Inference platform PM
May 22, 2026

Clean per-inference and monthly cost with the compute-vs-overhead split. The utilization sensitivity is the reality check for our autoscaling. Would love cold-start and demand-variability modeling, but as a unit-economics tool it's exactly right.

M
Maria Santos
Cost optimization
March 31, 2026

Cost per thousand inferences is the unit we budget and price on, and this computes it honestly with overhead. Multi-currency matters for our global product. The vision-model preset matches our measured cost closely. Excellent.

T
Tom Reilly
AI product finance
December 30, 2025

Love using our calculator?

Connected instruments

Related tools

Similar Calculators

More tools in the same category

Training Cost Calculator

Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.

GPU Cluster Sizing

Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.

Model Fit Checker

Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.

HBM Bandwidth Calculator

Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.

AI Chip Comparator

Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.

Token Cost Estimator

Calculate infrastructure costs per token generated for LLM serving with batch-size optimization, KV-cache management, and speculative decoding impact. Models pricing for API providers and self-hosted deployments with demand-spike handling and multi-model routing.

Often Used Together

Complementary tools for complete analysis

Learn More

Related Articles

Dive deeper with our expert guides and tutorials related to Inference Cost Calculator

Loading articles...

cost/inference = (GPU $/hr + power) ÷ (throughput × 3600 × util) + overhead · per 1k = ×1000 · Last reviewed: 2026-06