large language models

What is RAG? A Beginner's Guide to Retrieval-Augmented Generation

Rabindra Lamsal

05 May 2025 • 4 min read

Retrieval-Augmented Generation (RAG).

Not long ago, all large language models (LLMs) had what we called a knowledge cutoff. This meant they only knew information up until a certain date — anything that happened after that, they simply couldn’t help you.

Today, that’s changed, at least for cloud-based LLMs like ChatGPT. These models can now access real-time or recent information. So if you ask something like "Who won the Australia 2025 elections?", ChatGPT can give you an up-to-date answer: The Australian Labor Party, led by Anthony Albanese, won the election.

But here’s the catch: if you download and run an LLM like LLaMA locally on your own machine — using platforms like Hugging Face or Ollama — and ask the same question, the model won’t know. Why? Because that model hasn't been updated with events that happened after its training date, and it doesn’t have access to the web to look things up.

Real Example

To make this clearer, I asked the same question — "Who won the Australia 2025 elections?" — to both:

ChatGPT
LLaMA 4 (running via Groq)

Results:

ChatGPT: Correctly answered that the Australian Labor Party won.
LLaMA 4: Didn’t know — because it didn’t have that recent info.

But don’t be mistaken, this doesn’t mean LLaMA is “worse.” It's just operating with different capabilities. ChatGPT, being cloud-based, can access live or frequently updated data sources in the background. LLaMA running locally can't.

In another use case, consider this: If you ask something like “Who won the best employee award last week at Company X?” — and that information is only posted on the company’s internal notice board, not on any public website — even ChatGPT won’t be able to answer. That’s because the data isn't publicly accessible, and no LLM (cloud or local) can retrieve private or restricted information unless explicitly provided.

This brings us to a powerful concept in modern AI — RAG: Retrieval-Augmented Generation.

What is RAG?

If you're new to LLMs, here's something: think of ChatGPT as a chatbot, a polished user interface that uses GPT behind the scenes. Similarly, you have open-source models like LLaMA, Mistral, or Qwen — and at their core, they all work the same: given a prompt or context, they generate the next likely sequence of text.

But all these models have two big limitations:

They don’t know anything that happened after their training data ended.
They might "hallucinate" — that is, make up facts when they don’t know the answer. (Pro tip: Never rely solely on LLMs for research. They may cite fake papers or DOIs.)

The Fix: Enter RAG

Retrieval-Augmented Generation (RAG) addresses both issues. It combines two powerful tools:

A retriever: finds relevant data from a source (e.g., PDFs, databases, websites).
A generator: an LLM that uses this retrieved data to form a well-informed response.

A Simple Analogy: Think of RAG like a smart student. When asked a question, s/he don’t just guess — s/he first goes to the library, finds relevant books, reads the sections s/he needs, and finally gives you an answer.

How RAG Works — Step-by-Step

Let’s walk through a real example:
Question: "Who won the Australia 2025 elections?"

Embedding the Query

First, the user’s question is converted into a vector — a list of numbers that represent the meaning of the sentence. This process is called embedding. If you’re curious about embeddings, I’ve explained them in detail in a previous post.

Searching the Vector Database

Next, the system compares that query vector to a vector database. This database contains pre-embedded content from various documents (e.g., books, PDFs, websites).

Some popular vector databases:

FAISS
ChromaDB
Pinecone

Let’s say you have a book or report — you split it into small chunks, embed those, and store them in the vector database. Now, when a query comes in, the system retrieves the most similar chunks using something like cosine similarity.

Example of retrieved content:

The Australian Labor Party is the major centre-left political party in Australia and one of two major parties in Australian politics.
Anthony Albanese and the Australian Labor Party won the 2025 federal election in a historic landslide, securing a second consecutive term in office.
The Liberal Party of Australia is the major centre-right political party in Australia.

Reranking for Relevance (Optional)

Sometimes the initial search might return content that’s similar, but not useful. So some RAG systems use a reranker to reorder results based on relevance, not just semantic similarity.

For example, from the top 10 matches, the reranker might pick the 3 most contextually useful results.

Constructing the Prompt

Now we build the prompt to send to the LLM. This includes:

The original user query
The retrieved content

Example prompt:

You are an expert assistant. Use the context below to answer the user's question.

Context:
1. The Australian Labor Party is the major centre-left political party in Australia and one of two major parties in Australian politics.
2. Anthony Albanese and the Australian Labor Party won the 2025 federal election in a historic landslide, securing a second consecutive term in office.
3. The Liberal Party of Australia is the major centre-right political party in Australia.

Question:
Who won the Australia 2025 elections?

The LLM Generates the Answer

The LLM reads the context and the question, and generates an answer:

"The Australian Labor Party, led by Anthony Albanese, won the 2025 federal election in a historic landslide, securing a second consecutive term in office."

Because the model was provided with updated, trusted context, the LLM does not hallucinate and returns a correct answer.

So, summing up, RAG is called Retrieval-Augmented Generation because a language model is asked to use the additional information retrieved from a vector search to generate a more accurate and relevant answer.

Recap: Full RAG Workflow

Here’s the full pipeline:

Embed the Query
Search Vector DB for relevant chunks
(Optional) Rerank results
Build a Prompt (context + question)
LLM Generates the answer using the prompt

What’s Next?

If you want to build your own RAG pipeline, Langformers provides all the basics you need:

RAG is a game-changer for anyone working with private, custom, or constantly evolving information.

That's it for this post. See you in the next one.