Production model selection: hosted APIs vs self-hosted

Production model selection begins with a simple, counterintuitive claim: cheaper model inference is not the same as lower total cost to serve. A $0.02 per 1k‑token API can still cost you 3× more than self-hosting once you add latency SLA penalties, developer productivity loss, and vendor switching fees.

A 5‑engineer product team is roughly $900k–$1.1M/yr fully loaded using a $180k/yr median engineer rate. A predictable hosted API line item of $12k/month is small next to that salary cost but large relative to feature-margin on a sub-$200/mo SaaS product. Latency matters: a 100ms P95 vs a 700ms P95 changes conversion and support load meaningfully in conversational products.

Direct answer: production model selection is a three-variable decision — latency budget, monthly inference spend, and operational tolerance — and it breaks down neatly with thresholds. If your inference spend is under $10k/month, use hosted APIs; if your P99 latency requirement is below 150ms, invest in edge or co‑located self-hosting; if you spend over $50k/month on tokens, run a 3‑year TCO to evaluate self-hosting or hybrid approaches.

Production model selection: hosted APIs, self-hosted, and hybrid tradeoffs

Hosted model APIs remove GPU capacity planning and most operational burden. Hosted APIs charge per token or per request. Typical hosted-api ranges are $5–$60 per 1M tokens depending on model size and quality. Hosted latency is commonly 50–300ms for simple completion calls, with P99 often in the 400–1200ms range depending on burst traffic and vendor queuing.

Self-hosted inference replaces per‑token bills with infrastructure and human costs. A single 80GB GPU instance suitable for medium‑large models rents for $5–$30/hr depending on cloud and spot vs on‑demand. A single A100‑class instance at $12/hr runs ~30–120k tokens/hour of heavy decoding workload; that translates to roughly $100–$900 per million tokens in raw GPU cost depending on utilization and model efficiency.

Hybrid architectures combine hosted APIs for tail cases and self-hosted models for steady baseline traffic. Hybrids reduce per‑token spend by 20–60% for high-volume applications while keeping operational risk manageable. A hybrid where 70% of traffic is routed to self-hosted inference and 30% to hosted APIs often achieves a 30–50% reduction in monthly bills versus hosted‑only at scale.

Choose the inference model that aligns with your SLOs: cost alone selects hosted APIs; strict latency or heavy volume selects self-host or hybrid.

Cost math and operational thresholds you can use today

Calculate a 3‑year break‑even for self‑hosting using three line items: GPU infra, storage/egress, and ops labor. A conservative baseline for a production pod is $2,500–$7,000/month for GPU capacity and $15k–$25k/month for one SRE and one ML engineer share. Put another way, a two‑node GPU pod plus ops runs roughly $200k–$400k/year.

Hosted API spend scales linearly with tokens. At $20 per 1M tokens, a product serving 50M tokens/month costs $1,000/month. A different calculation: 50M tokens/month at $20/1M costs $1,000/month; your annual hosted spend is $12k. Your 3‑year hosted bill is $36k. Comparing that to $240k/yr for a self‑hosted pod shows hosted wins until you add latency penalties and lost revenue from slower responses.

Latency penalties: if a 200ms P99 reduces conversion by 1.5 percentage points on a $100 ARR account base, the revenue impact exceeds a $10k/month API bill quickly. Quantify SLA penalties before choosing purely on inference price.

What this means for a CTO or technical founder

You should instrument three metrics before you design: (1) tokens/month baseline, (2) P95/P99 latency requirement, and (3) feature revenue per unit latency. These three numbers give you a simple decision surface.

If your tokens/month is below 10M, hosted APIs are cheaper and faster to iterate with. If your P99 latency requirement is below 150ms and you serve users from a single region, evaluate colocated self‑hosting on that region as the faster path to meeting SLOs. If you're spending more than $50k/month on inference, run a detailed 3‑year TCO including staff costs; self‑hosting will be cost‑effective in many such cases.

Operational advice: version your prompts and model routing in feature flags so you can switch routing between hosted and self‑hosted without a deploy. Charge a single SRE with 20–30% time budget to maintain the inference cluster and a single ML engineer 30–50% to maintain model artifacts; this staffing keeps ops predictable.

Key takeaways

1. If your inference spend is under $10k/month, use hosted APIs for speed and reduced ops. 2. If your P99 latency target is <150ms or you must run inference in a private network, self-host or colocate inference. 3. If you spend >$50k/month on tokens, run a 3‑year TCO that includes $200k–$400k/yr pod costs and $180k/yr per 1–2 engineers for ops. 4. Use a hybrid routing strategy to optimize cost and tail latency before a full migration. 5. Version routing and prompts behind feature flags so you can A/B vendor vs weights safely.

Production model selection is a governance problem as much as a systems problem: you must align SLA, product economics, and engineering headcount. The right answer for a startup launching a conversational feature with 5k daily active users will be different than a mid‑market SaaS running 50M tokens/month. Match the architecture to the three numbers that matter — tokens, latency, and ops tolerance — and the rest becomes arithmetic, not taste.

Production model selection: hosted APIs vs self-hosted models

Production model selection: hosted APIs, self-hosted, and hybrid tradeoffs

Cost math and operational thresholds you can use today

What this means for a CTO or technical founder

Key takeaways

More from Insights

RAG architecture: production tradeoffs and cost model

RAG evaluation framework: production metrics for retrieval

Fine-tuning vs prompting: when to train models