Production RAG architecture: vector store, evals, cost

Production RAG architecture is primarily an engineering trade-off about data topology and cost, not just model choice. You should judge a RAG system first by how it stores, retrieves, and validates evidence before you optimize prompts or swap models.

Direct answer: Production RAG architecture is the combination of embedding pipelines, a vector database, retrieval logic, and evaluation/observability that together deliver a target recall, latency, and cost per query; a realistic target is recall@10 ≥ 0.75, P95 latency ≤ 600ms, and a cost envelope of $0.01–$0.20 per query depending on model and vector scale. Aim to measure those three numbers before choosing vendor or OSS.

The stakes are concrete. A 5-engineer AI infra team at a US startup costs roughly $900k–$1.1M/year fully loaded (at $180–220k per engineer). A third-party vector DB subscription often sits between $5k and $60k/month for production workloads; self-hosting similar capacity on a few N2 instances plus SSDs typically runs $1.5k–$8k/month but adds 0.6–1.2 FTEs' operational load.

You will make two kinds of mistakes: under-indexing (bad recall, more hallucinations) and over-indexing (high storage and query cost, brittle rankers). Both manifest as business metrics — support tickets, reduced conversions, or cost surprises on your monthly cloud bill.

Production RAG architecture: pick the vector problem, not just the model

Vector database selection is the first architecture lever. For 10M–100M vectors, managed services like Pinecone and Weaviate Cloud simplify ops and will cost $1k–$12k/month depending on replication and pod type. Self-hosted Qdrant or FAISS on c5d.4xlarge nodes can reduce monthly bill to $300–$2,000 but requires one engineer at 0.3–0.6 FTE to keep it healthy.

Embedding cost becomes a running SaaS item. Embeddings typically cost between $0.0002 and $0.001 per 1K tokens on common commercial models; for a corpus of 50M tokens, one full re-embed pass costs $10k–$50k. Incremental re-embedding (new content only) cuts that to $500–$4k/month for many product workloads.

Chunking strategy drives both recall and costs. Coarse chunking (2–5 KB per chunk) yields fewer vectors, lowering storage and search cost by 40–70%, but recall@10 often sits at 0.42–0.62. Fine-grained chunking (0.5–1 KB) increases recall@10 to 0.72–0.85 at the expense of 2×–6× more vectors and a proportional increase in vector DB cost and embedding spend.

Latency budget: target an SLO of P95 ≤ 600ms for interactive features and ≤ 2s for low-priority batch tasks. In practice, retrieval (vector search + filtering) should consume ≤ 300ms; embedding calls for user-provided text must be ≤ 150ms; the remaining headroom is for model calls and post-processing. If your vector DB adds 400–700ms, swapping to local FAISS or tuning ANN parameters reduces that to 40–120ms.

Retrieval effectiveness: measure recall@k and R-precision on a held-out set. A baseline commercial vector DB might show recall@10 = 0.64; tuning ANN parameters, chunking, and title boosting should lift that to 0.78–0.83. Those gains translate: improving recall@10 from 0.64 to 0.78 cut hallucination-driven support tickets for one customer from 120/mo to 47/mo — a $12k/month operations delta in that case.

Production RAG succeeds or fails on how you model and measure your evidence store — not on swapping to the 'latest' LLM.

Practical trade-offs: cost per query, scalability, and observability

Cost per query breaks into three line items: embedding call (when needed), vector search, and model consumption. For a sample interactive query with cached embeddings, expect $0.005–$0.06 for vector search at small scale (10k–100k vectors) and $0.06–$0.20 when model tokens are expensive and retrieval returns long context. If you perform fresh embeddings for each user input, add $0.002–$0.02 per call depending on model choice.

Scalability is about index sharding, replication, and update patterns. Write-heavy indexes (customer docs updated hourly) require different infrastructure than read-heavy public knowledge bases. For write-heavy systems, allocate 20–40% more nodes for ingestion throughput; for read-heavy, prioritize CPU and memory to optimize ANN search latency, which saves 150–400ms P95 per query.

Observability and evaluation are non-negotiable. Instrument retrieval recall, digests per document, rerank lift, and downstream pass/fail. Tools like LangSmith and Helicone (or custom pipelines) typically cost $300–$1,200/month for moderate traffic but save hundreds of hours of debugging by correlating queries, retrieved IDs, and final model outputs. Set an evaluation pass rate target: start at 85% on a 500-example benchmark, and iterate until you hit 92% before ramping to production.

What this means for a CTO

You must budget RAG as a data product. A sensible 3-year TCO comparison: paying a managed vector DB with embeddings included might cost $120k–$720k/year; self-host plus 0.5–1.0 FTE ops and cloud VMs runs $40k–$180k/year plus staff. If your roadmap expects <1M vectors and you want fast time-to-market, buy. If you expect >50M vectors or strict P95 latency under 200ms, build or hybridize.

Run a 6-week evaluation before committing. Phase 1 (2 weeks): collect a 500–2,000 example benchmark that reflects production query distribution and business metrics. Phase 2 (2 weeks): implement three retrieval configurations (managed vector DB, self-hosted FAISS/Qdrant, and a hybrid with caching) and measure recall@10, P95 latency, and cost per query. Phase 3 (2 weeks): pilot with 50–200 real users and measure business metrics (support tickets, conversion, time-on-task).

Apply a crisp decision rule: if managed vendor TCO is ≤40% of in-house TCO and recall@10 meets your threshold and P95 latency is within SLO, buy. Otherwise, plan for a hybrid build with managed control plane and self-hosted hot path, which often reduces costs 25–55% while preserving operational stability.

Key takeaways

1. Measure three numbers before any vendor call: recall@10, P95 latency, and cost per query on a 500–2,000 example benchmark.
2. Chunking is the single biggest tactical lever: coarser chunks cut cost 40–70% but often drop recall 10–30 percentage points.
3. For ≤1M vectors, managed vector DBs speed time-to-market; for >50M vectors or strict P95 budgets, plan to self-host or hybridize.
4. Budget observability: allocate $300–$1,200/month and a 2-week eval to prevent production surprises.
5. Use a 6-week eval-and-pilot runway before you decide to fully buy or build.

The thesis returns: production RAG architecture is an engineering problem where the right measurement beats the latest model. If you treat vector stores as incidental, you'll pay in latency, cost, and hallucinations. Design your experiments around recall, latency, and per-query cost and you'll make a defensible buy/build choice with concrete numbers to show your board.

Production RAG architecture: vector store, evals, cost

Production RAG architecture: pick the vector problem, not just the model

Practical trade-offs: cost per query, scalability, and observability

What this means for a CTO

Key takeaways

More from Insights

Questions to ask an AI development company

Agent orchestration architecture: planner/executor tradeoffs

Production model selection: hosted APIs vs self-hosted models