RAG evaluation framework: metrics to measure retrieval

RAG evaluation framework should be the first engineering discipline you formalize when RAG carries revenue or supports SLA-bound workflows. Teams that evaluate only model output quality lose 20–60% of the benefit retrieval provides because poor search returns never reach the model.

A mid-market B2B product with 100k queries/day typically sees $30k–$120k/month in model-token costs and $3k–$25k/month in retrieval infrastructure and embedding spend. When retrieval drives 60% of end-user satisfaction, small changes in recall@10 (from 70% to 90%) can shift support-ticket rate by 15–30% and trial-conversion by 5–12%.

Direct answer: A RAG evaluation framework is a set of offline and online measurements—recall@k, precision@1, end-to-end answer accuracy, latency P95, and cost-per-query—tied to business outcomes. Aim for recall@50 ≥ 85% for exploratory search, precision@1 ≥ 80% for single-answer flows, latency P95 ≤ 300 ms for interactive UIs, and a marginal cost-per-query under $0.005 to keep 100k QPD model spend manageable. These targets let you prioritize index strategy, embedding model, and vector-store topology.

RAG evaluation framework: key metrics and trade-offs

Start with objective, instrumented metrics. Recall@k measures whether the correct document is in the top-k results; precision@k measures how many returned documents are relevant. Both are cheap to compute offline from labeled queries and should be your gate for index or chunking changes. A jump from recall@10 of 72% to 88% reduces downstream hallucination surface by roughly the same factor because the model sees better context most of the time.

Measure latency in three places: embedding latency, retrieval latency, and end-to-end model latency. Embedding generation is often 30–120 ms per request on 512-token inputs; vector search P95 ranges from 20–60 ms for in-memory ANN to 120–300 ms for disk-backed HNSW or kNN over SSD. In production you'll see the vector store represent 20–45% of end-to-end P95 latency for interactive flows.

Cost metrics must be explicit. Typical numbers in 2026: managed vector DB storage $0.20–$0.50/GB-month (Pinecone, Weaviate), self-hosted index storage on cloud block storage $0.08–$0.18/GB-month (Milvus, FAISS on EBS), and embedding API cost per 1k tokens ranges $0.0005–$0.003 depending on model family. Calculate cost-per-query as (embedding cost + retrieval cost + added model tokens * model token price). If cost-per-query exceeds $0.01 at scale, you must adopt caching or offline summarization.

Offline metrics are necessary but not sufficient. A retrieval pipeline that achieves recall@50=92% in offline tests can still fail online because of query drift, document freshness, or embedding-model-domain mismatch. You need a two-track evaluation: deployable offline tests (unit-level recall/precision, index-A/B tests) and short-window online experiments (canary traffic A/B with business KPIs).

Tooling choices shape what you can measure. LangSmith and Helicone capture model calls and latencies; Pinecone, Qdrant, and Weaviate provide prom-exporter metrics for query rates and P95. Self-hosted FAISS or Milvus gives you cheaper storage but increases variance in P95 under tail loads unless you overprovision by 30–60% CPU/ram and add replica shards.

Measure retrieval like a product: tie recall, latency, and cost to the single business KPI you need to move, then optimize the smallest, cheapest lever that reliably changes that KPI.

RAG evaluation in production

Define acceptance criteria you can automate. For a support-assistant workflow your acceptance might be: precision@1 ≥ 80% on a 5k labeled holdout, end-to-end human-verified answer accuracy ≥ 88% on a 1k sample, and latency P95 ≤ 350 ms under 95th-percentile load. Automate nightly recalculation and fail pipelines that regress any criterion by more than 3 percentage points.

Set thresholds tied to cost. If your product runs 100k queries/day and each query increases token consumption by 300 tokens on average, model token spend at $0.002 per 1k tokens is $18k/month. Reducing model tokens by 25% via retrieval that supplies better context can save $4.5k/month — a 25% reduction in recurring cost from a single engineering investment.

Choose managed vs. self-hosted by predictable load and engineering bandwidth. Buy managed (Pinecone, Weaviate Cloud, Qdrant Cloud) when peak QPS is >5k or when your SRE team is under 3 people; the predictable SLA and metrics feed into experiments. Self-host (Milvus, FAISS, Redis Vector) when 3‑year TCO for storage and ops is at least 30% lower and you can accept 40–90 ms higher P95 tail latency.

3 actionable steps for CTOs

Label 5–10k representative queries and run offline recall@k and precision@1 baselines within 2 weeks. Use those baselines to decide chunk size and whether to include long-context retrieval or hierarchical retrieval.
Instrument end-to-end: capture query text, chosen documents, embedding model ID, vector-store score, model prompt tokens, model response tokens, latency P50/P95, and an automated post-filter score. Keep these traces for 30–90 days and sample for human review at a 0.5–2% rate.
Run a 4-week canary that ties retrieval changes to business KPIs. If a retrieval tweak improves recall@10 by 12% but increases P95 by 80 ms and reduces trial conversion by 3%, roll back. Prioritize changes that improve a revenue or retention KPI while keeping latency within your UX budget.

Bonus: maintain a cost ledger. Track monthly model spend, embedding spend, vector-store bill, and ops time. A simple ledger showing model tokens = $X, embeddings = $Y, vector DB = $Z makes trade-offs clear when you consider caching, distilled embeddings, or semantic compression.

Key takeaways

A RAG evaluation framework is recall@k + precision@1 + latency P95 + cost-per-query, and these metrics must map to a single business KPI.
Aim for recall@50 ≥ 85% for exploration and precision@1 ≥ 80% for single-answer flows; keep latency P95 ≤ 300–350 ms for interactive experiences.
Buy managed vector services when you need SLA and predictable metrics at >5k QPS; self-host when 3‑year TCO and tail latency trade-offs favor you and you have ops headcount.
Instrument end-to-end traces and run short-window online experiments; offline metrics alone are insufficient to predict user-facing outcomes.
Track a cost ledger monthly; reducing model token use via better retrieval is often the highest-ROI lever and can cut model spend by 15–40% within months.

RAG evaluation framework isn’t an analytics afterthought. When you formalize it—labeling, automated acceptance, instrumentation, and canarying—you change retrieval from a noisy backend to a leaky-pipe control knob that reliably improves conversion, retention, and cost. The new twist: treat retrieval optimization as cheap product experimentation—small offline wins should always be validated with short-window canaries tied to one KPI before you commit engineering cycles and hosting spend.

RAG evaluation framework: production metrics for retrieval

RAG evaluation framework: key metrics and trade-offs

RAG evaluation in production

3 actionable steps for CTOs

Key takeaways

More from Insights

Questions to ask an AI development company

Agent orchestration architecture: planner/executor tradeoffs

Production model selection: hosted APIs vs self-hosted models