Why Your RAG System Is Lying to You?

Ashish Arora
Jul 9
7 min read

Updated: Jul 21

The company's AI assistant, powered by their "state-of-the-art RAG system," had confidently told a potential client:

"According to our 2023 financial report (page 47), our enterprise revenue grew 340% year-over-year."

Impressive, right? Except their actual growth was 34%. No decimal point error in the source document - the AI had simply decided to add a zero.

Here's what happened: The retrieved document contained multiple percentages scattered across the page -"34% revenue growth," "340 basis points improvement in margins," and "3.4x increase in enterprise customers." The model saw these numbers clustered together and performed what researchers call "semantic smoothing" - it unconsciously blended similar-looking numbers into what seemed like a coherent fact.

The citation was real. Page 47 did exist and did discuss growth metrics. The model knew exactly where to point for credibility, even while fabricating the actual number. It's the AI equivalent of a student correctly citing Shakespeare while completely misquoting him.

If you've been in the AI space lately, you've probably heard claims similar to:

"But we use RAG! It can't hallucinate, it only uses our documents!"

RAG has become the go-to solution for the hallucination problems. The promise is simple: ground your AI in real documents, and it can't make things up. As you'd expect by now - this is a dangerous oversimplification. Let's dive into why?

What Is RAG, Really?

Imagine you're writing an essay, but instead of relying on your memory, you can instantly search through a library and pull out relevant books. That's essentially what RAG does for AI - it gives language models access to external documents when answering questions.

It sounds foolproof. Like turning a closed-book exam into an open-book one.

Surely students can't get facts wrong if the textbook is right there? Anyone who's graded open-book exams knows better :P

RAG: Under the Hood

RAG systems have two main components working in tandem:

Retriever: A specialized semantic search engine that understands meaning, not just keywords
Generator: The language model that crafts responses from retrieved information

But here's where it gets interesting. These components perform an intricate four-stage dance, and each step is an opportunity for things to go beautifully right (or spectacularly wrong):

1. Query Transformation: Your innocent question "What's your refund policy?" doesn't stay as text. It gets transformed into an embedding - a mathematical representation in high-dimensional space. Think of it as translating your question into a language that computers understand.
2. The Hunt: Your document collection has been pre-processed into chunks, each with its own mathematical fingerprint. When your query embedding arrives, the system plays a massive game of "find the closest match" across millions of vectors. It's like looking for your keys in the dark with a flashlight that only illuminates things that feel similar to keys.
3. Context Assembly: The system grabs 3-10 chunks and stitches them together. But these chunks might be from different documents, different time periods, or even different data sources.
4. Generation: The language model doesn't just copy from the retrieved text - it interprets, synthesizes, and creates. It's processing everything through dozens of transformer layers, weighing retrieved content against its trained knowledge, deciding word by word what sounds right.

The key insight? These aren't independent systems mechanically passing information. They're engaged in a complex negotiation where each stage influences the others, creating emergent behaviours that cannot be designed or intended in advance, just as your user's queries.

Where does RAG shines? (And Where It Falls Apart?)

When everything aligns, RAG can feel like magic. I've seen customer support systems jump from 60% to 95% accuracy overnight. Ask about "refund policy" when you have a document titled "Refund Policy"? Perfect semantic match. Clear retrieval. Unambiguous context. Beautiful answer.

This is RAG at its best - accurate, grounded, and verifiable. It's these success stories that sell the technology but also creates the notion that RAG magically "solves" hallucinations.

But each stage of the pipeline hides its own flavour of failure:

Query Understanding: The Dense Passage Retrieval paper (Karpukhin et al., 2020) says that even the best query encoders only capture about 70% of semantic meaning. Ask about "getting my money back" and the system might miss documents about "refund procedures" because the semantic gap could be too large.
Retrieval Roulette: The BEIR Benchmark Study (Thakur et al., 2021) tested 18 retrieval systems across diverse domains. The sobering result? The best systems still retrieved irrelevant documents 20-35% of the time.
The Invisible Middle: Liu et al. (2023) discovered something they called the "Lost in the Middle" phenomenon. Give an AI model 10 pieces of information, and it will pay strong attention to the first and last ones while missing 70-80% of what's in the middle. It's like reading only the first and last paragraphs of a mystery novel and trying to solve the crime.
Generation Jazz: Even with perfect retrieval, the generation process adds its own creativity. I've seen models take "Processing takes 5-7 business days" and jazz it up to "Processing typically takes 5-7 business days, though during peak seasons like holidays it may extend to 10-14 days." Helpful? Maybe. True? Absolutely not.

Common Misconceptions About RAG

Citation = Truth

One of the most common misconceptions is that citations equal truth. We're academically trained to respect footnotes, and RAG systems exploit this brilliantly. They have learned to exploit this trust. Citations and Trust in LLM Generated Answers shows that citations boost user trust, sometimes to a fault. RAG outputs can thus appear convincing while still being incorrect.

RAG can only use what's in the documents

Better Search = Better Answers

The LLM in a RAG system will do more than just quote documents. As explained in the paper: RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models - It will summarize, rephrase, and sometimes inadvertently fabricate or alter info. This means RAG based LLM responses can still hallucinate (though less often than vanilla LLMs). The key takeaway is to treat generation as its own critical component – prompt it carefully, test it rigorously, and consider adding verification layers which is where faithfulness checks are gaining a lot of traction.

More Context = Better Answers

Quality of context beats quantity of context. There’s ample evidence that giving an LLM the most relevant small set of information leads to better answers than giving it a dump of everything remotely related. In practice, this means tuning your retriever to be precise, limiting the number of documents you stuff into the prompt, and using strategies to handle truly large contexts (like splitting the query or iterative retrieval). Users should realize that if an AI doesn’t “know the answer” after a certain amount of info, throwing more data at it might not help and could hurt. Instead, ask more specific questions or break the problem down. .

Why RAG Hallucinations Are More Dangerous?

Traditional LLM hallucinations are often laughably wrong - such as claiming Python was invented in 1823 or that the Earth has three moons. They're the Nigerian prince emails of the AI world, so obviously false they serve as their own warning.

RAG hallucinations are different. They come dressed in the clothing of credibility:

Specific page numbers
Real document titles
Properly formatted citations
Confident technical language

They are not just false - they come with receipts.

Research: When Training Fights Retrieval? What happens to Weights & Biases?

What happens when RAG retrieves information that contradicts what the model learned during training? This isn't a rare edge case - it's a fundamental conflict that happens constantly.

Your LLM operates with two distinct memory systems:

Parametric Memory: The Permanent Resident Everything baked into the model's weights during training. When the model learned that "coffee is a bean," it didn't just memorize this fact - it encoded it into the statistical relationships between billions of neurons. This knowledge is literally part of the model's architecture.
Non-Parametric Memory: The House Guest The RAG-supplied context - external documents presented in the prompt. It's temporary, contextual, and competing for influence with everything the model "knows" from training.

These aren't equal partners. One is carved in silicon; the other is written in sand.

The Attention Battlefield: Split Personalities

When your model processes "According to recent research, coffee is botanically classified as a seed, not a bean," something remarkable happens inside the transformer.

The model's attention heads - specialized components that decide what to focus on - literally split their loyalty:

Some heads lock onto "seed" from the retrieved text
Others fire strongly on "coffee bean" patterns from training
Still others focus on what "sounds right" linguistically

The probability distributions reveal the internal struggle:

Without RAG:

Coffee is a "bean": 89%
Coffee is a "seed": 2%

With contradicting RAG:

Coffee is a "bean": 43% (still fighting!)
Coffee is a "seed": 41%

Even with explicit contradiction, training only partially yields.

The Confidence Paradox: Wrong but Sure

Here's the truly perverse finding: models are often MORE confident when relying on parametric memory, even when it's wrong. The strength of neural pathways correlates with confidence scores, not accuracy.

This creates hybrid hallucinations where the model confidently states: "Coffee, which is a bean, has recently been reclassified as a seed for botanical purposes." It sounds authoritative, cites real documents, and is indecisive - a fabrication born from trying to reconcile irreconcilable information.

The Hierarchy of Stubbornness

Not all knowledge is equally entrenched. You could guess their gravity by understanding what would have had more repetitions in the training data:

Neural Highways (Nearly Immovable)

Geographic facts: "Paris is the capital of France"
Common phrases: "peanut butter" (not "peanut jam")
Basic science: "water boils at 100°C"

Negotiable Territory

Recent events
Specific numbers
Technical specifications
Company-specific information

This hierarchy explains why RAG works well for updating quarterly earnings but can't convince a model that coffee isn't a bean.

The Bottom Line:

RAG remains one of our best tools for making language models more accurate, current, and verifiable. But we need to stop considering it as a magic solution.

RAG is like giving a brilliant but opinionated professor access to a library. Yes, they have better resources than someone working from memory alone. But could they still:

Misread sources through their own biases
Creatively combine unrelated facts
Cite books they've only skimmed
Blend what they know with what they find

Meanwhile, the next time someone tells you "RAG eliminates hallucinations," ask them:

"How do you detect when your system is confidently citing real documents to support false claims?"

Ashish Arora