RAG (Retrieval-Augmented Generation) is a simple trick. Before the AI answers, it first looks up the most relevant snippets from your documents and slips them into the prompt. The answer is grounded in your data, not just what the model picked up in general training.
The one analogy that makes it click: it's an open-book exam. The model is the smart student, but it has never read your employee handbook. RAG is the moment you hand it the exact right page, open to the right paragraph, just before it answers.
A few terms you'll meet, each in one line: a large language model (LLM) is the AI that writes the answer; training is the slow, expensive process that taught it general language; the prompt is the text you send it right now. RAG never touches training. It just makes the prompt smarter. Now let's bust the biggest myth about it.
This is the most common misunderstanding, and it scares people off, because retraining sounds slow, expensive, and risky. It usually isn't what you need at all.
"To make AI use my documents, I have to fine-tune or retrain the model on them, and once I do, it can't make things up anymore."
You don't retrain anything. With RAG, your documents stay exactly where they are. At the moment of the question, the system retrieves the right snippets and pastes them into the prompt. Change a price in your handbook this morning, and the AI uses the new price this afternoon: no model update, no waiting, no retraining bill.
And it does not make hallucination impossible. "Hallucination" is when an AI states something false with total confidence. RAG reduces it by handing the model real, relevant text to lean on. But the model can still misread a snippet, blend two together, or answer when the right passage wasn't retrieved. RAG grounds answers; it doesn't make them flawless. We'll come back to that honestly.
RAG isn't magic. It's a short assembly line. Three stages happen once when you set it up (chunk, embed, store), and two happen every time someone asks (retrieve, generate).
Two terms unpacked, plainly:
An embedding is a passage's meaning turned into a list of numbers. The clever part: text that means similar things lands on nearby numbers, so "money back" sits close to "refund" even though they share no words. That's how retrieval matches by meaning, not just keywords.
A vector database is just a search index built for those number-lists. Ask it "what's nearest to this question?" and it returns the closest passages, fast, even across thousands of documents.
๐ก The cost of pasting text into a prompt is measured in tokens, the AI's unit of text. More retrieved text = more tokens = more cost per question. That's why RAG retrieves only the best few snippets instead of your whole library. See /learn/tokens.
Below is a tiny pretend company handbook of 7 snippets. Pick a question (or type your own). The demo scores every snippet, lights up the closest 2 to 3, and writes an answer that points at exactly those sources. The retrieval step is the whole aha: watch which snippets light up.
Honest about the fake: a real system scores by meaning using embeddings. To keep this 100% offline, this demo fakes the scoring with simple word-overlap (it counts shared and related words), so you can see the shape of retrieval without any backend. The aha is the same: look up first, then answer from what you found.
You know it's an open-book exam, not retraining. You can name the pipeline: chunk, embed, store, retrieve, generate. And you watched a retrieval choose its sources. Two honest notes before you go.
1 ยท It reduces hallucination; it doesn't end it. RAG's answers are only as good as what it retrieved. Bad chunking, a thin document, or a near-miss match can still produce a confident wrong answer, and the model can misread a perfectly good snippet. So for anything that matters: show the sources and keep a human in the loop. More on that in /learn/ai-safety.
2 ยท RAG vs. the alternatives. Pick by what you're trying to change:
Give the model knowledge at answer-time.
Change the model's behavior, not its facts.
Just paste it all in the prompt.
RAG doing real work: an AI receptionist that answers customer questions from your actual handbook, price list, and past jobs (and cites where each answer came from) is RAG, end to end. That's exactly the kind of thing we build for small businesses, safely.
Day 8 of 30 free, plain-English AI lessons for small business.