questions to ask an AI development company: ask them for numbers, not promises. A vendor that answers with roadmaps and vague guarantees will cost you months and multiples of the initial contract when the model misbehaves or the team hands off an untestable mess.

Answer, up front: who owns the training data, the evals, and the production prompts? If you don't secure that ownership you pay again to reproduce the system later. A 5-engineer in-house team costs roughly $900k–$1.1M/year fully loaded; a $150k/year SaaS contract that gives you operational ownership is often cheaper than rebuilding the same capability from a failed vendor handoff.

Direct answer: when someone asks “questions to ask an AI development company,” push for six commitments: an explicit hallucination-rate SLA (e.g., ≤1% for factual QA), recall@5 and MRR-tested retrieval metrics (recall@5 ≥90%), a token-cost budget with hard ceilings, an ownership transfer plan including eval suites and CI hooks, latency SLOs (300–500ms p95), and a 90-day roadmap with measurable milestones. If the vendor can't commit to numbers, don't sign.

questions to ask an AI development company — technical checklist

First technical question: which model families and deployment patterns will you use, and why? Expect a vendor to name specific hosted providers or open-weight models, and to justify them with latency, cost, and privacy trade-offs. For example, cite a hosted large-model API with median latency 220ms and cost $0.02–$0.08 per 1k tokens, or a self-hosted 4x GPU pod with predictable 80ms inference for private data.

Ask about retrieval architecture: vector database choice, index freshness cadence, chunking strategy, and recall targets. A production retrieval pipeline should state recall@5 ≥90% on your eval set and a reindex cadence (nearline or streaming) with max staleness defined in minutes. If the vendor can't show a retrieval eval report, they're guessing.

Probe evaluation: demand an eval harness that you own. A credible vendor delivers a test suite with unit tests, synthetic adversarial queries, and a metric dashboard. Expect automated evals to run on commit, with baseline pass rates (e.g., unit tests 100%, functional evals ≥85%) and historical trends across releases.

  • Model selection: name provider or model, median p50 latency, and cost per 1k tokens.
  • Retrieval: recall@5 target and reindexing frequency.
  • Eval suite: test coverage percentage and CI integration.

Next, ask about hallucination measurement and mitigation. A vendor should provide a measurable target per use case — for knowledge-grounded Q&A you should see ≤1% factual error at p95 after RAG tuning; for creative generation a 5%–10% tolerance is acceptable if clearly documented. Demand pre-launch adversarial tests that include out-of-knowledge queries and edge cases derived from your domain documents.

Also ask for details on guardrails: deterministic grounding (citation linking), output filters, and escalation flows for high-risk outputs. A robust system will log the chain of evidence for every response and tag outputs that fall below the grounding threshold so operators can triage and improve the index or prompt.

how will you measure hallucination, recall, and latency?

Start by asking for three metrics: hallucination rate per intent, retrieval recall@k, and end-to-end latency p95. A vendor should give you a baseline and an improvement plan — for example, reduce hallucination from 6% to ≤1% on intent X by adding citation-aware prompts and a 2-level reranker within 8 weeks.

Request the methodology. How are hallucinations labeled? Who annotates — vendor annotators or your SMEs? If the vendor uses internal, unshared labels, you lose the ability to audit. Good vendors provide the annotation schema and a sample of labeled cases (10–50 examples per intent) as part of the contract.

Demand latency budgets by call type. For synchronous user-facing flows, expect p95 ≤500ms; for high-frequency API integrations set p95 ≤300ms. If a vendor's architecture depends on multiple remote calls (e.g., index fetch + rerank + model), ask for measured tail latencies and a plan for caching or batching.

handover, ownership, and post-launch SLAs

The handover is where most vendor engagements fail. Ask for a concrete transfer package: code, infra-as-code, CI pipelines, eval suites, prompts with version history, and an access map for secrets. Expect an acceptance checklist tied to objective metrics, not just 'knowledge transfer sessions.'

Define post-launch ownership: who handles model drift, dataset updates, and incident response? A vendor should offer a time‑boxed managed-support window (90 days minimum) and a clear SLA for severity 1 incidents (e.g., 30-minute paging, 6-hour remediation target). If they refuse SLAs, you'll be carrying operational risk.

Ask about cost guardrails. Require an agreed monthly token or inference budget and automatic throttles that default to a safe mode rather than fail-open with higher spend. A typical busy feature might consume $8k–$20k/month in inference; you should have hard spending thresholds and a mechanism to disable or shift to cheaper models automatically.

Hire for measurable commitments: hallucination SLAs, owned eval suites, and a handover package with CI hooks — anything less is a risk you pay to learn later.

what this means for a CTO or technical founder

When you evaluate vendors, treat the RFP like a systems integration test. Require sample deliverables before contract signature: a one-week spike delivering a reproducible eval, a CI pipeline that runs your test suite, and a cost projection tied to projected traffic. If a vendor balks at an early spike, they're selling discovery, not delivery.

Budget for people and processes. A vendor can reduce your first-year engineering load, but your team still needs an owner. Expect to assign one senior engineer or product manager internally who devotes 20%–50% of their time to the project during the first 6 months; without that role, integrations and data ownership slip.

Insist on contractual levers: acceptance tests tied to hallucination thresholds, a defined handover checklist, and a rollback plan. Pay vendors on milestone deliveries, not time. If you're paying a $300k engagement, structure payments against measurable objectives spaced over 60–120 days.

quick checklist you can use in vendor interviews

  • Show me an eval report for a comparable domain with recall@5 and hallucination rate.
  • Provide a handover package sample that includes CI-based evals and prompts with history.
  • State your latency p95 for synchronous flows and your cost per 1k tokens or per-inference.
  • Describe your incident SLA and the first 90 days of post-launch support.
  • Demonstrate ownership of annotations and offer export of labeled data.

key takeaways

  1. Require numeric commitments up front: hallucination SLA, recall@k, latency p95, and a token-cost ceiling.
  2. Own the evals: insist that labeled data, test harnesses, and CI hooks are transferred to you.
  3. Make acceptance contractual — tie vendor payments to objective metrics and a handover checklist.
  4. Budget internal ownership: assign a senior engineer or PM at 20%–50% capacity for the first 6 months.
  5. Prefer vendors who will run a short, paid spike delivering a reproducible eval before larger commitments.

You can run the vendor interview yourself with these questions, or you can delegate it to a senior partner who knows how to turn answers into pass/fail gate checks. If you want help building the evals, drafting the handover checklist, or running a paid spike that proves a vendor's claims, our production AI agent development teams deliver those artifacts and the acceptance tests you'll use in procurement.

As a final test, ask a vendor to sign one sentence in the contract: they will deliver an exportable eval harness and labeled dataset at handover. If they won't, assume the engagement ends with a monolith of opaque prompts and a monthly invoice — and plan accordingly.