Skip to content
Self-host vs API · cost per 1M tokens · break-even

Token Cost Console

Build or buy your tokens? Compute the self-hosted cost per million tokens — GPU cost ÷ tokens produced — against an API's list price, find the break-even volume, and see why utilization decides it, in any currency.

01 · Quick estimate

Throughput, GPU cost & API price → cost per million tokens.

Self-host /1M
$0.43
vs API $3.00
Self vs API & break-even volume ↓
02 · Deep analysis

Build-vs-buy console

Cost per million tokens
Self-hosted$0.43
API list price$3.00
Self /1M tok
$0.43
Saves /1M
$2.57
Tokens / hour
7.2M
80% util
Break-even
1943M tok/mo
self beats API
Self-hosting is cheaper per token

At 2,500 tok/s and 80% utilization, your GPU produces 7.2M tokens/hour for $3.07/hr — $0.43/1M vs the API's $3.00. Above ~1943M tokens/month (covering fixed ops), self-hosting wins.

Self-hosted cost = hourly GPU cost ÷ tokens/hour, so utilization is the hinge — an idle GPU has a terrible cost per token.

Raise throughput via batching (see LLM Serving); decide own-vs-rent in Accelerator ROI.

Currency conversion uses indicative rates — verify against a live source for contracts.

Why it matters

Why volume decides build vs buy

Self-hosting wins on cost at scale

At high utilization, the per-token cost of running your own (or rented) GPUs is a fraction of API list prices — because the API price bundles the provider's margin, overhead and convenience. Volume is what unlocks the saving.

API prices carry a large markup

Token APIs price for convenience: no infrastructure, instant scale, no ops. That premium is genuinely worth it below a break-even volume — but above it, self-hosting's raw cost dominates.

Utilization is the hinge

Self-hosted cost per token is the GPU's hourly cost divided by the tokens it produces — so an idle GPU has a terrible cost-per-token. Only steady, high-throughput serving makes owning cheaper than the API.

Cost per million tokens is the unit of the market

Every API and every infrastructure plan prices in dollars per million tokens. Computing your self-hosted figure on that same axis is the only honest way to compare against the API menu.

Field notes

The price of a million tokens

Every LLM deployment eventually faces the same question: keep paying an API per token, or stand up your own serving? The answer lives on a single axis — cost per million tokens — and the honest comparison puts your self-hosted figure next to the API's menu price on that exact axis. Self-hosted cost is disarmingly simple: the all-in hourly cost of the serving GPU divided by the tokens it actually produces in that hour.

That denominator is everything. A GPU's hourly cost is roughly fixed whether it's flat out or idle, so cost per token is inversely proportional to utilization and throughput. Run it hard, batching many requests, and the cost per token drops to a fraction of any API price. Run it at twenty percent, and each token costs five times as much — often worse than the API. This is why self-hosting is a high-volume game and the API is the right answer for everyone below the threshold.

The API's price is higher per token for good reasons: it bundles the provider's margin, their infrastructure and operations, and the genuine value of zero setup and instant elastic scale. Below a break-even volume, that bundle is a bargain — you'd waste far more on underused hardware. Above it, you're paying the markup on billions of tokens, and the raw economics of owning the serving dominate. The break-even, which this console computes, is where fixed operating costs and the per-token gap balance.

So the decision is really about your steady volume and how hard you can drive utilization. Push throughput up with batching — modeled in the LLM Serving console — to lower self-hosted cost per token, and decide whether to own or rent the GPUs in the Accelerator ROI console.

Token Cost FAQs

Have more questions? Contact us

Trusted by AI Product & Inference Cost Teams

4.8
Based on 3,070 reviews

Self-hosted cost per million tokens against the API list price, with the break-even volume, is exactly the build-vs-buy analysis I run. Seeing our 70B self-host at $0.45 vs $3 API — but only above the break-even — is the decision in one screen, in dollars and euros. Utilization as the hinge is correctly central.

D
Dr. Aaron Klein
AI product economics
June 14, 2026

The markup-on-the-API-bundles-convenience framing is the truth, and this quantifies exactly when that convenience stops being worth it. Batching raising throughput to cut cost-per-token is the lever we pull. Pairs perfectly with the serving and accelerator-ROI tools.

M
Mei Lin
Inference cost optimization
May 19, 2026

Clean cost-per-Mtok with the break-even monthly volume — that's the number finance approves on. Currency support saves a spreadsheet. Would love demand-variability modeling, but as a build-vs-buy decision tool it's exactly right.

T
Tomasz Nowak
ML platform finance
March 29, 2026

The low-volume preset correctly says stay on the API — don't buy idle GPUs. Knowing the token volume we'd need to justify self-hosting is the reality check. Honest about utilization, fast, multi-currency. Excellent.

S
Sara Okafor
AI startup CTO
December 30, 2025

Love using our calculator?

Connected instruments

Related tools

Similar Calculators

More tools in the same category

Inference Cost Calculator

Estimate deployment costs for AI models across cloud, edge, and hybrid infrastructures with per-query, per-token, and per-hour pricing models. Integrates GPU/ASIC rental rates, network egress, storage, and scaling overhead for accurate inference TCO analysis.

Training Cost Calculator

Calculate AI model training expenses including GPU cluster rental, data transfer, checkpoint storage, and engineering time with distributed-training overhead modeling. Supports LLM, vision, and multimodal training with FLOPs-to-cost mapping and carbon-footprint estimation.

GPU Cluster Sizing

Determine optimal GPU cluster configurations for training and inference workloads with interconnect topology modeling, memory-bandwidth balancing, and fault-tolerance planning. Supports NVIDIA, AMD, and custom accelerator clusters with InfiniBand and NVLink network analysis.

Model Fit Checker

Verify whether AI models fit within hardware constraints including GPU HBM capacity, on-chip SRAM, and interconnect bandwidth with layer-wise memory profiling. Supports model parallelism, pipeline parallelism, and ZeRO optimization recommendations for large-model deployment.

HBM Bandwidth Calculator

Estimate memory bandwidth requirements for AI workloads with operation-type analysis, data-movement profiling, and roofline model integration. Calculates HBM generation selection, channel count, and clock-speed requirements to eliminate memory-bound bottlenecks.

AI Chip Comparator

Compare AI accelerators across performance, cost, power, and software-ecosystem metrics with normalized benchmarking for training and inference workloads. Supports NVIDIA, AMD, Intel, Google TPU, Amazon Trainium, and custom ASICs with TCO-per-FLOP analysis.

Often Used Together

Complementary tools for complete analysis

Learn More

Related Articles

Dive deeper with our expert guides and tutorials related to Token Cost Estimator

Loading articles...

self-host /1M = (GPU $/hr + power) ÷ (throughput × 3600 × util) × 10⁶ · break-even = fixed/mo ÷ (API − self per 1M) · Last reviewed: 2026-06