Most retrieval-augmented generation demos collapse the moment real users show up. These are the boring things we check before a RAG system is allowed anywhere near production.
No eval set, no merge. We build a golden set of 200–500 Q&A pairs reviewed by a subject-matter expert before writing a single embedding. Every change is graded against it.
Input filtering, output filtering, and cost ceilings. Every call is observable: prompt, retrieved chunks, model response, cost, latency. We replay problem queries in CI.
80% of RAG quality comes from the data pipeline, not the model.
Get the ingestion, normalization and eval loop right first. The rest is tuning.