Model serving architecture: cost & latency tradeoffs

Model serving architecture is where product economics and SLOs collide — putting a heavyweight model behind a single endpoint is cheap for demos but bankrupting at scale. The right architecture routes by cost, latency, and business importance so invoices and user experience both stay predictable.

Direct answer: What is the correct model serving architecture? A hybrid routing architecture that combines lightweight on-edge or low-cost models for 60–80% of requests, mid-tier hosted models for 15–30%, and high-cost specialist models behind an explicit premium path will usually minimize cost while meeting a 200–500 ms P95 latency SLO and keeping model spend under 20% of product gross margin in year one. Expect TCO differences of 3×–10× versus a single-model design over three years.

Start-up stakes are concrete: a 10k DAU B2B product with an average of 0.5 inference calls per user per day will do ~150k inferences per month. If your high-capacity model costs $0.50 per 1k tokens and averages $0.10 per inference, that's roughly $15k/month. Route 70% of inferences to cheaper models and you cut that line to $4.5k/month — $126k/year vs. $180k/year, a 30% reduction in annual model spend for a single optimization.

Engineering cost matters too: a 5-engineer team in the US runs roughly $1.0M–$1.4M/year loaded. Buying or building routing and serving infrastructure that reduces model spend by $100k/year is only attractive if it costs you less than half the engineers’ time over 3 years — otherwise the vendor or simpler architecture is the cheaper path.

Model serving architecture: routing, placement, and SLOs

Model serving architecture is the combination of where models run (edge, regional, cloud), which model answers which request (routing), and how you measure success (latency and accuracy SLOs). Each axis maps to dollars: compute egress and inference costs, storage for embeddings, and engineering time to maintain routing logic. Decisions here determine monthly bills and the shape of growth.

Routing has three practical levers: static routing by product tier, dynamic routing based on request complexity, and staged routing where cheap models handle candidate generation and expensive models verify. Static routing is simplest but wasteful: companies that rely solely on a single high-capacity model report model-spend growth of 3–8× year-over-year. Dynamic routing will lower that growth if you can measure complexity reliably.

Placement choices carry predictable tradeoffs. Running embeddings or mid-sized models in managed GPUs in us-east-1 costs $2.50–$6.00/hr per GPU instance; inference for low-latency requires replicas, pushing baseline monthly infra to $3k–$12k depending on concurrency. Serving distilled models on Cloudflare Workers or on-device can reduce compute costs by 70% but increases engineering surface area and release cadence.

Vector storage and cache behavior also matter. Using a managed vector DB like Pinecone or Weaviate typically costs $0.03–$0.20 per 1k queries plus storage; a Redis-based approximate nearest neighbor cache can cut per-query cost by 40–70% at the expense of higher operational complexity and replica coordination. For a product with 150k monthly retrievals, that’s the difference between $45–$900/month and a $200–$600 engineering bill to operate and maintain consistency.

Latency budgets shape routing thresholds. If your product demands a 200 ms P95, you cannot route to a cross-region heavyweight model without paying for geo-replicated inference or getting lucky on network conditions. For 200 ms P95, favor edge or regional small-to-medium models and reserve large models for async or guaranteed-slo premium paths.

You must budget observability and evals into TCO. Instrumentation that attributes cost-per-inference, recall@k, and user success lift is not optional. Companies that skip this see blind model spend increases of 40–90% within 12 months because they can't measure the impact of routing rules or model upgrades.

Treat routing and placement as product features: they determine 3–10× of your model TCO and directly control whether inference spend scales with users or with revenue.

What this means for a CTO or technical founder

You should design your model serving architecture around three concrete constraints: a target latency SLO, a per-inference cost ceiling, and an operator cost budget. Pick numbers up front: for example, 300 ms P95, $0.03 average inference cost, and $150k/year operator budget. These numbers make decisions binary and stop optimistic trade-offs from drifting into runaway spend.

Start with a routing policy you can measure. Route 60–80% of traffic to a low-cost path (distilled model, cached responses, or heuristic fallback), 15–30% to a mid-tier hosted model, and 5–10% to the expensive specialist model. Implement attribution: every inference must record which model served it, tokens consumed, and a proxy success metric. If you can't get those three data points in the first 90 days, postpone any custom routing.

If you buy a vendor for serving, insist on three SLA commitments: (1) per-region P95 latency numbers, (2) transparent per-inference and storage pricing, and (3) an exportable telemetry stream with model attribution. If a vendor refuses, their black-box cost structure will be your problem at 10× scale. If you build, budget a 20–30% engineering tax for continuous tuning and regression testing of the routing model.

Quick actionable checklist

1) Define your SLOs and per-inference cost ceiling in dollars and milliseconds. 2) Instrument per-inference telemetry before you change any routing. 3) Deploy a three-tier routing plan (cheap/mid/expensive) and A/B test routing thresholds for 90 days. 4) Cache retrievals and use ANN caches for high-volume patterns. 5) Re-evaluate vector DB vs. in-memory cache at 100k monthly retrievals.

Three practical platform trade-offs to weigh now: use managed vector DBs to prioritize development speed at $0.03–$0.20/1k queries; invest in an ANN cache when monthly queries exceed 100k; and prefer regional model replicas for tight latency SLOs rather than a single central cluster with cross-region egress.

When deciding build vs. buy for serving, use a 3-year TCO model that includes model spend, infra, and 30–50% of an engineer's time for maintenance. A conservative rule: if internal cost to build is greater than 1.5× vendor cost over three years, buy; otherwise you should build only if you need IP-differentiated routing logic that vendors can't provide.

Key takeaways:

1. A hybrid routing architecture reduces model spend by 3×–10× versus single-model approaches over three years. 2. Define latency and cost SLOs in concrete numbers up front — 200–500 ms P95 and a per-inference dollar ceiling are typical. 3. Instrument per-inference telemetry before changing routing. 4. Use managed vector DBs for speed-to-market; switch to ANN caches at ~100k monthly retrievals to cut query costs. 5. Buy serving infrastructure if building costs exceed 1.5× vendor TCO over three years.

Designing model serving architecture deliberately is how you keep product margins intact as you scale. Route low-risk traffic to cheap models, keep heavy models behind explicit premium paths, and instrument every inference. That discipline turns model cost from an existential risk into a lever for product differentiation.

Model serving architecture: cost, latency, and routing tradeoffs

Model serving architecture: routing, placement, and SLOs

What this means for a CTO or technical founder

Quick actionable checklist

More from Insights

Questions to ask an AI development company

Agent orchestration architecture: planner/executor tradeoffs

Production model selection: hosted APIs vs self-hosted models