What is RAG? Plain-English guide (2026)

The one-sentence answer

RAG (Retrieval-Augmented Generation) means the AI looks up your real data before answering, instead of relying on what it half-remembers from training. It's the difference between an open-book exam and a closed-book one.

Why this matters

Language models like Claude and GPT-4 are trained on huge slices of the public internet — but they don't know your company's order history, your latest pricing, your internal SOPs, or any document written after their training cutoff. Ask one a question about something it doesn't know and it doesn't say "I don't know" — it confidently makes something up. That's hallucination.

RAG fixes this by adding a step before the model answers:

You ask a question.
The system looks through your documents/database to find the most relevant snippets.
It hands those snippets to the model, along with your question.
The model answers using those snippets as the source of truth.

Result: the model can answer about your specific data, with citations, and is much less likely to invent facts.

The open-book exam analogy

Imagine a brilliant student who's read a lot of books but never seen your textbook. Ask them a question about your textbook and they'll guess plausibly — and sometimes wrongly.

Now give them the textbook and let them open it before answering. Same student, dramatically better answers. That's RAG.

The four pieces of a RAG system

1. Documents

Whatever you want the AI to be able to "read" — help-center articles, product specs, internal SOPs, customer history, contracts, anything. RAG starts with you having something worth retrieving from. Garbage in, garbage out applies harder than usual here.

2. Embeddings

Each chunk of your documents gets converted into a vector — a list of numbers — that captures its meaning. Similar meanings end up at similar coordinates in vector space. "Customer can't login" and "user has trouble signing in" end up near each other even though the words barely overlap.

Today (Q2 2026), most teams use OpenAI's text-embedding-3-large for English, or Voyage AI for multilingual / domain-specific use cases.

3. Vector database

Where the embeddings live. When a question comes in, you embed the question, then search the database for the closest matching document chunks. Common choices:

pgvector — a Postgres extension. Free, fast for up to ~5M embeddings, lives where your data already lives. Our default.
Pinecone — managed vector DB. Convenient, more expensive, right for multi-tenant SaaS at scale.
Weaviate / Qdrant / LanceDB — alternatives with their own tradeoffs.

4. Generation

The retrieved snippets get inserted into the prompt sent to the model. The model then answers using those snippets as context. This is the "generation" part — the model is generating an answer, augmented by retrieval.

RAG vs fine-tuning vs prompt-only

These get conflated a lot. They're three different tools:

Prompt-only: stuff everything in the prompt. Works if your data fits in the context window (~200k tokens for Sonnet 4) and changes rarely. Pricey at scale because you pay per token every call.
RAG: retrieve relevant snippets per query. Works when your data is big, changes often, or differs per user. Most production "AI on your data" systems are RAG.
Fine-tuning: train the model on your data. Best for teaching style (brand voice, code style) rather than facts. Expensive, slow to iterate, and won't pick up tomorrow's documents.

Most production systems use RAG for facts + light prompt-tuning for style. Fine-tuning is rarely the right first move.

When RAG is the right pattern

Customer support automation — the AI needs to know your help articles, product changes, customer history. Case in point.
Internal Q&A over policies, SOPs, runbooks.
Legal / compliance research across your own document corpus.
Sales enablement — AI that can pull case studies, pricing, ICP fit from your library.
Personalised recommendations using your customer/product data as context.

When RAG is the wrong pattern

The task doesn't need your data (general writing, summarisation of this conversation, translation). Skip RAG.
You need the AI to do things, not know things. That's agentic tool use, not RAG.
Your "documents" are tiny and static enough to just paste in the prompt. RAG adds latency and complexity for no benefit.

Five common RAG pitfalls (and what we do about them)

1. Bad chunking

Documents get split into chunks (usually 200–800 tokens). Cut them wrong and you split mid-sentence, mid-table, mid-thought. Retrieval recall plummets. Use overlap (10–20% of chunk size) and respect natural boundaries (headings, list items).

2. Retrieving too much / too little

Retrieve 1 chunk and you miss context. Retrieve 50 and you bury the model in noise. We tune K (number of retrieved chunks) per workflow — usually 5–15.

3. No re-ranking

Vector similarity is good at "topic-relevant", not always at "answer-relevant". A re-ranker (Cohere, Voyage, or a small Claude pass) reorders the top-K and dramatically improves quality. Often the single biggest quality jump for the least work.

4. No evals

"Did the answer use the right document?" is a checkable question. Build a test set of question-answer-source triples and measure retrieval accuracy independently from generation accuracy. Without this you're guessing at quality.

5. Static index, stale answers

Your help articles update; your RAG index doesn't notice. Build a sync pipeline from day one. Most teams skip this and discover three months later that half their answers cite outdated content.

The unglamorous secret of production RAG: 70% of the work is the eval suite and the data-quality pipeline, not the model. Teams that nail those two beat teams with fancier models.

The 2026 RAG stack we use at Growvate

Documents: Markdown / clean HTML, normalised by a preprocessing step we own
Chunking: heading-aware, ~500 tokens with 10% overlap
Embeddings: OpenAI text-embedding-3-large (English) or Voyage Multilingual (when needed)
Storage: pgvector inside the client's existing Postgres (no new vendor)
Retrieval: hybrid (vector + BM25 keyword), K=15
Re-ranking: Cohere Rerank 3, top 5 forward to the model
Generation: Claude Sonnet 4 by default, Haiku for high-volume / cost-sensitive flows
Evals: Promptfoo for offline, Langfuse for production traces

The full stack lives in our default 2026 AI stack post.

How long does it take to build a production RAG system?

For a focused use case (e.g., "AI support agent over our help center + order data"):

Week 1: data audit, chunking pipeline, eval set creation
Week 2: embedding + retrieval pipeline, first generation prompt
Week 3: re-ranking, prompt iteration against evals, integration with your app
Week 4: production hardening, observability, rollout

30 days is realistic. Most teams take longer because they skip the eval suite — which is the one thing they shouldn't.

Building something that needs to "know your data"? Book a 30-minute audit. We'll go through what you have and tell you whether RAG, fine-tuning, prompt-engineering, or just buying a SaaS is the right call for your specific use case.

What is RAG? A plain-English guide.