The one-sentence answer
RAG (Retrieval-Augmented Generation) means the AI looks up your real data before answering, instead of relying on what it half-remembers from training. It's the difference between an open-book exam and a closed-book one.
Why this matters
Language models like Claude and GPT-4 are trained on huge slices of the public internet — but they don't know your company's order history, your latest pricing, your internal SOPs, or any document written after their training cutoff. Ask one a question about something it doesn't know and it doesn't say "I don't know" — it confidently makes something up. That's hallucination.
RAG fixes this by adding a step before the model answers:
- You ask a question.
- The system looks through your documents/database to find the most relevant snippets.
- It hands those snippets to the model, along with your question.
- The model answers using those snippets as the source of truth.
Result: the model can answer about your specific data, with citations, and is much less likely to invent facts.
The open-book exam analogy
Imagine a brilliant student who's read a lot of books but never seen your textbook. Ask them a question about your textbook and they'll guess plausibly — and sometimes wrongly.
Now give them the textbook and let them open it before answering. Same student, dramatically better answers. That's RAG.
The four pieces of a RAG system
1. Documents
Whatever you want the AI to be able to "read" — help-center articles, product specs, internal SOPs, customer history, contracts, anything. RAG starts with you having something worth retrieving from. Garbage in, garbage out applies harder than usual here.
2. Embeddings
Each chunk of your documents gets converted into a vector — a list of numbers — that captures its meaning. Similar meanings end up at similar coordinates in vector space. "Customer can't login" and "user has trouble signing in" end up near each other even though the words barely overlap.
Today (Q2 2026), most teams use OpenAI's text-embedding-3-large for English, or Voyage AI for multilingual / domain-specific use cases.
3. Vector database
Where the embeddings live. When a question comes in, you embed the question, then search the database for the closest matching document chunks. Common choices:
- pgvector — a Postgres extension. Free, fast for up to ~5M embeddings, lives where your data already lives. Our default.
- Pinecone — managed vector DB. Convenient, more expensive, right for multi-tenant SaaS at scale.
- Weaviate / Qdrant / LanceDB — alternatives with their own tradeoffs.
4. Generation
The retrieved snippets get inserted into the prompt sent to the model. The model then answers using those snippets as context. This is the "generation" part — the model is generating an answer, augmented by retrieval.
RAG vs fine-tuning vs prompt-only
These get conflated a lot. They're three different tools:
- Prompt-only: stuff everything in the prompt. Works if your data fits in the context window (~200k tokens for Sonnet 4) and changes rarely. Pricey at scale because you pay per token every call.
- RAG: retrieve relevant snippets per query. Works when your data is big, changes often, or differs per user. Most production "AI on your data" systems are RAG.
- Fine-tuning: train the model on your data. Best for teaching style (brand voice, code style) rather than facts. Expensive, slow to iterate, and won't pick up tomorrow's documents.
Most production systems use RAG for facts + light prompt-tuning for style. Fine-tuning is rarely the right first move.
When RAG is the right pattern
- Customer support automation — the AI needs to know your help articles, product changes, customer history. Case in point.
- Internal Q&A over policies, SOPs, runbooks.
- Legal / compliance research across your own document corpus.
- Sales enablement — AI that can pull case studies, pricing, ICP fit from your library.
- Personalised recommendations using your customer/product data as context.
When RAG is the wrong pattern
- The task doesn't need your data (general writing, summarisation of this conversation, translation). Skip RAG.
- You need the AI to do things, not know things. That's agentic tool use, not RAG.
- Your "documents" are tiny and static enough to just paste in the prompt. RAG adds latency and complexity for no benefit.
Five common RAG pitfalls (and what we do about them)
1. Bad chunking
Documents get split into chunks (usually 200–800 tokens). Cut them wrong and you split mid-sentence, mid-table, mid-thought. Retrieval recall plummets. Use overlap (10–20% of chunk size) and respect natural boundaries (headings, list items).
2. Retrieving too much / too little
Retrieve 1 chunk and you miss context. Retrieve 50 and you bury the model in noise. We tune K (number of retrieved chunks) per workflow — usually 5–15.
3. No re-ranking
Vector similarity is good at "topic-relevant", not always at "answer-relevant". A re-ranker (Cohere, Voyage, or a small Claude pass) reorders the top-K and dramatically improves quality. Often the single biggest quality jump for the least work.
4. No evals
"Did the answer use the right document?" is a checkable question. Build a test set of question-answer-source triples and measure retrieval accuracy independently from generation accuracy. Without this you're guessing at quality.
5. Static index, stale answers
Your help articles update; your RAG index doesn't notice. Build a sync pipeline from day one. Most teams skip this and discover three months later that half their answers cite outdated content.
The unglamorous secret of production RAG: 70% of the work is the eval suite and the data-quality pipeline, not the model. Teams that nail those two beat teams with fancier models.
The 2026 RAG stack we use at Growvate
- Documents: Markdown / clean HTML, normalised by a preprocessing step we own
- Chunking: heading-aware, ~500 tokens with 10% overlap
- Embeddings: OpenAI
text-embedding-3-large(English) or Voyage Multilingual (when needed) - Storage: pgvector inside the client's existing Postgres (no new vendor)
- Retrieval: hybrid (vector + BM25 keyword), K=15
- Re-ranking: Cohere Rerank 3, top 5 forward to the model
- Generation: Claude Sonnet 4 by default, Haiku for high-volume / cost-sensitive flows
- Evals: Promptfoo for offline, Langfuse for production traces
The full stack lives in our default 2026 AI stack post.
How long does it take to build a production RAG system?
For a focused use case (e.g., "AI support agent over our help center + order data"):
- Week 1: data audit, chunking pipeline, eval set creation
- Week 2: embedding + retrieval pipeline, first generation prompt
- Week 3: re-ranking, prompt iteration against evals, integration with your app
- Week 4: production hardening, observability, rollout
30 days is realistic. Most teams take longer because they skip the eval suite — which is the one thing they shouldn't.
Building something that needs to "know your data"? Book a 30-minute audit. We'll go through what you have and tell you whether RAG, fine-tuning, prompt-engineering, or just buying a SaaS is the right call for your specific use case.