Question 1

How do you estimate the cost of training an AI model?

Accepted Answer

Start with the compute: the Chinchilla approximation says training a transformer takes about 6 FLOPs per parameter per token, so total FLOPs ≈ 6 × parameters × training tokens. Divide that by your cluster's effective throughput — the number of GPUs times each GPU's peak FLOPS times the Model FLOPs Utilization (MFU) — to get the wall-clock time. GPU-hours is GPUs × hours, cost is GPU-hours × the hourly rate, and energy is GPUs × power × hours. This calculator does all of that and adds the carbon footprint, in your chosen currency.

Question 2

What is the 6ND rule for training FLOPs?

Accepted Answer

It's the standard estimate that training a dense transformer requires approximately 6 floating-point operations per parameter per token: total training FLOPs ≈ 6 × N × D, where N is the parameter count and D the number of training tokens. The 6 comes from roughly 2 FLOPs for the forward pass and 4 for the backward pass per parameter per token. It's an approximation (it ignores attention's quadratic term, which is small for typical context lengths relative to the dense matmuls), but it's remarkably accurate for budgeting and is the basis of this calculator's compute estimate.

Question 3

What is Model FLOPs Utilization (MFU)?

Accepted Answer

MFU is the fraction of a cluster's theoretical peak FLOPS that the training actually achieves, after losses to communication between GPUs, memory bandwidth limits, pipeline bubbles, and non-matmul operations. Real large-scale training typically reaches 30–50% MFU; well-optimized runs on good interconnect can exceed 50%. It directly scales the time and cost — at 40% MFU a run takes 2.5× longer than the theoretical peak suggests. Using a realistic MFU (not the peak FLOPS) is essential for an accurate estimate, which is why this calculator makes it a primary input.

Question 4

How long does it take to train a large language model?

Accepted Answer

It depends on the compute and the cluster: time = (6 × params × tokens) ÷ (GPUs × peak FLOPS × MFU). A 70B model on 2 trillion tokens across 2,048 H100s at 45% MFU takes on the order of 10–11 days; a GPT-3-class 175B model on 300B tokens across ~1,000 A100s took roughly a month; a frontier model on trillions of tokens across tens of thousands of GPUs runs for months. This calculator computes the wall-clock days from your model size, token count, and cluster, so you can see how each lever changes the schedule.

Question 5

Why are frontier training runs so expensive?

Accepted Answer

Because cost scales with the product of model size and data, both of which have grown enormously. A frontier model might have hundreds of billions of parameters trained on tens of trillions of tokens — that's 10^25 FLOPs or more, requiring tens of thousands of GPUs running for months. At a few dollars per GPU-hour, that's tens of millions of dollars in compute alone, before counting the data pipeline, the engineering team, the failed runs, and the inference cost afterward. This calculator gives the compute-cost figure, which is the dominant and most quotable component.

Question 6

Can I see training costs in different currencies?

Accepted Answer

Yes. Use the currency selector to enter the GPU hourly rate and see the total training cost in US dollars, euros, pounds, rupees, yen, yuan and other currencies, formatted for the locale. The FLOPs, time, GPU-hours, energy and carbon are currency-independent; only the money converts, using indicative exchange rates. Since training budgets are approved and reported in local currency, this makes the figure directly usable for planning and comparison across regions.

Question 7

How is the carbon footprint of training calculated?

Accepted Answer

Energy is the cluster power (GPUs × watts each, plus you can scale up for datacenter overhead via PUE) times the training hours, giving kilowatt-hours; carbon is that energy times the grid's carbon intensity (kg CO₂ per kWh). A months-long run on tens of thousands of kilowatt-class GPUs is gigawatt-hours of energy and hundreds of tonnes of CO₂. The figure varies several-fold with the grid — a run on renewable power emits far less than one on coal. This calculator reports the energy and carbon so you can account for the environmental cost alongside the dollar cost.

Question 8

Does this include data, engineering, and failed-run costs?

Accepted Answer

No — this calculator computes the raw compute cost (GPU-hours × rate), which is the dominant and most directly estimable component. A full training budget also includes data acquisition and cleaning, the engineering team's salaries, storage for datasets and checkpoints, the inevitable failed and restarted runs (which can add 20–50% to compute), and the network and infrastructure overhead. Treat the compute cost here as the floor and add those components for a complete budget; the compute figure is what scales most dramatically with model size.

Question 9

How does the choice of GPU affect training cost?

Accepted Answer

It changes both the time and the rate. A faster GPU (more peak FLOPS) finishes sooner, reducing GPU-hours, but usually costs more per hour and draws more power. The relevant metric is cost-efficiency: GPU-hours × rate for the same job. A newer accelerator with much higher FLOPS can be cheaper overall despite a higher hourly rate if its speed advantage outweighs it, and vice versa. This calculator lets you switch GPU types (with representative FLOPS, power, and rates) to compare total cost, time, and energy for the same training run.

Question 10

How accurate is this training-cost estimate?

Accepted Answer

The 6ND FLOPs estimate and the time-from-MFU calculation are the standard, well-validated method used across the industry for training budgets, and the arithmetic is exact for your inputs. Accuracy depends on a realistic MFU (the biggest variable — use measured values where possible, not peak), the right effective peak FLOPS for your precision (bf16/fp16 dense), and remembering that it's compute-only. It doesn't model the quadratic attention term (small for typical configs), failed runs, or non-compute costs. Use it for first-order budgeting and comparison; refine MFU with profiling for precision.

Question 11

Does this tool send my data anywhere?

Accepted Answer

No. All training-cost, energy and carbon math — and the currency conversion — runs entirely in your browser in JavaScript. Nothing is uploaded and there's no telemetry.

Training Cost Console

Training-run console

Why one formula sets the bill

Six FLOPs per parameter per token

Training Cost FAQs

Trusted by ML Infrastructure & Research Teams

Related tools

Similar Calculators

Inference Cost Calculator

GPU Cluster Sizing

Model Fit Checker

HBM Bandwidth Calculator

AI Chip Comparator

Token Cost Estimator

Often Used Together

Wafer Cost Calculator

Die Per Wafer Calculator

Yield Calculator

Chip Profitability Calculator

Related Articles

Technical Services