Self-hosting LLMs is the decision every CTO runs into once API bills hit five figures per month: do you keep buying tokens or invest in GPU clusters and an ops team? The answer depends on three numbers you can measure in under a week: your monthly token volume, your latency 95th‑percentile target, and the fully loaded cost of one ML engineer (we use $220k/year).
Direct answer: For a stable workload that consumes ~3 billion inference tokens per month (roughly 3M 1k‑token calls), self-hosting an H100 cluster plus one ML engineer is cheaper after 18–30 months—total 3‑year TCO $2.1M vs hosted‑API $3.2M—if your 95th‑percentile latency target is below 200ms and you can accept a 2–4 hour single‑instance recovery SLA. If your token volume is under 500M/month or latency requirements are soft, an API remains cheaper and lower risk.
Setup: the stakes are material. OpenAI, Anthropic, and other hosted providers charge per‑token model access that scales linearly with usage; large production deployments commonly spend $50k–$300k/month on inference. A senior ML engineer in the U.S. costs roughly $220k/year fully loaded. Cloud GPU on‑demand H100 class instances run between $3/hour (spot) and $40/hour (on‑demand), and managed providers add margin. One unexpected throttling incident or a month of downtime can easily cost $100k+ in lost revenue and SLA credits.
Self-hosting LLMs: cost components and break-even math
Break down the decision into four buckets: model licensing or weights, inference infrastructure, engineering and ops, and residual vendor dependency. Model licensing varies: open weights like Llama or Mistral are free to use but have compliance and research‑use caveats; enterprise licenses for tuned weights (from Anthropic, Mistral, or commercial redistributors) can run $150k–$500k upfront or multi‑year subscriptions. Inference infrastructure is the most predictable line item: a run‑day H100 cluster sized for 300 tokens/second sustained throughput is $8k–$25k/month on spot capacity but $30k–$120k/month for on‑demand redundancy and multi‑AZ replicas when purchased through AWS, GCP, or CoreWeave.
To show the math, take a concrete example: you serve 3M 1k‑token calls per month (3B tokens). A hosted API priced at $0.03 per 1k tokens costs $90k/month or $3.24M over 36 months. Self‑hosting requires two production H100 nodes for redundancy (estimate $50k/month if you buy a managed cluster with reserved pricing), storage and networking $4k/month, and 1.0 FTE ML engineer plus 0.5 SRE (combined $350k/year fully loaded or ~$29k/month). That totals ~$83k/month or $2.99M over 36 months. If you can commit to spot/contracted GPUs and trim ops to 0.7 FTE, self‑hosting falls to ~$59k/month and the 3‑year TCO becomes $2.12M, beating the hosted API. Those are the numbers you put into a spreadsheet.
Latency and reliability change the calculus. Hosted APIs often give 95th‑percentile latencies of 150–400ms for LLM completions, depending on model size and prompt length. Self‑hosting with local H100s gives you control to push tail latency below 100ms for optimized models and batching, which matters for interactive UIs and synchronous flows. However, you trade single‑vendor SLA for operational risk: expect a realistic recovery SLA with self‑hosting of 2–4 hours for hardware failures unless you pay for cross‑region replication, which doubles infra cost.
Hidden costs are decisive. Model updates, quantization, custom kernel tuning, and prompt‑orchestration frameworks each consume engineering weeks. A productive ML engineer spends roughly 50% of their time on platform glue and reliability until the stack reaches maturity. That labor cost shows up as a fixed expense that hosted APIs externalize.
Self‑hosting saves money only when steady, high token volume meets strict latency or data‑sovereignty requirements—and your organization treats ops as a product, not a cost center.
What this means for CTOs and technical founders
First, measure precisely. You need three numbers before any commit: monthly 1k‑token equivalents, target 95th‑percentile latency, and acceptable recovery SLA. If your monthly token usage is under 500M tokens (about $15k/month at $0.03/1k), keep buying APIs. If usage is steady above ~1.5B tokens/month and latency or data residency are strict, build the financial model for self‑hosting.
Second, quantify engineering runway. If your team lacks an ML engineer with systems experience, budget $220k/year loaded for a senior hire plus another $150k/year for SRE coverage for the first 12 months. If hiring is impossible, plan for a managed cluster provider (CoreWeave, Lambda Labs, Paperspace) and expect a 15–30% price premium over pure cloud volumes—but with faster time to production and lower hiring risk.
Third, treat vendor risk as an economic line item. A dependency on OpenAI or Anthropic exposes you to price increases, throttling, or terms changes. Quantify that exposure: assume a 20–50% annual price creep scenario and run a sensitivity analysis. If a 30% price increase over 12 months pushes your hosted API spend above the self‑hosted 36‑month curve, you should accelerate a hybrid strategy.
Key takeaways
1. If your usage is under 500M tokens/month, continue with hosted APIs; total 3‑year cost rarely justifies the fixed infra and hiring overhead.
2. If usage is >1.5B tokens/month and you need sub‑200ms 95th‑percentile latency or strict data residency, model self‑hosting with reserved GPU capacity and one senior ML engineer; expect 18–30 month payback.
3. Include a vendor‑risk premium (assume 20–30% annual price increase) in your hosted‑API forecast to avoid being surprised by vendor hikes.
4. Build a hybrid runbook: start on hosted APIs, instrument token volume and latency, and plan a staged migration once 6 consecutive months exceed your break‑even threshold.
5. Never self‑host without a plan for model refresh and quantization; otherwise operational debt will outstrip infra savings.
Closing: the decision isn't ideological. Self‑hosting LLMs is a capital and operating tradeoff—lower marginal inference cost at the price of fixed infra, hiring, and operational complexity. The right move is the one that aligns your token volume, latency targets, and appetite for vendor risk with a measurable 18–36 month ROI and a staged migration plan that gives you an escape hatch back to hosted APIs if assumptions break.



