top of page

What Do the World's Smartest AI Models Read Before They Graduate? (Spoiler: It's Not What You Think)

  • Ashish Arora
  • Jul 3
  • 6 min read

Picture this: The entire internet contains roughly 100 trillion words of text. Now guess how much of that makes it into training the world's most leading LLMs? If you guessed "most of it" or even "half of it," you're off by a huge factor.


LLM Training Data

Less than 2% of available web data makes through the journey from raw internet to refined training dataset. Infact, a well refined dataset is one of the most important factors affecting model's accuracy and range.


What does an LLM eat?
What does an LLM eat?

Data Mining before the training

Training an LLM isn't like filling a bucket with internet water. It's more like running the world's most sophisticated diamond mine where 98% of what you dig up gets discarded. Research from "The RefinedWeb Dataset" (Penedo et al., 2023) reveals this stunning filtering cascade:


  • Start: 250TB of raw web crawl (imagine 50 million movies worth of text)

  • After language detection: 100TB (goodbye to code dumps and gibberish)

  • After URL filtering: 50TB (spam sites and content farms eliminated)

  • After deduplication: 15TB (turns out the internet loves copy-paste)

  • After quality filtering: 5TB (only the coherent, valuable content survives)

  • Final training data: 600GB-2TB (the refined gold)


That's a 99.2% rejection rate. Harvard's acceptance rate looks generous by comparison :P


The Copy-Paste Catastrophe

According to a research, nearly 30% of the internet is just... copies of itself. The same news article republished 1,000 times. The same product description on 500 e-commerce sites. The same Wikipedia paragraph quoted everywhere.


But here's where it gets fascinating. Research from "Deduplicating Training Data Makes Language Models Better" (Lee et al., 2022) found that training on duplicates doesn't just waste resources - it actively makes models worse. Removing near-duplicates improved model performance by 10% and reduced harmful memorization by 47%.


The latest breakthrough from D4 (Diversifying and Deduplicating) research in 2023 shows that proper deduplication combined with diversity can improve training efficiency by 20%. Think about it: If you read the same fact 1,000 times versus reading 1,000 different facts once, which makes you smarter?


ree

How These Filters Actually Work?

Let's dig in and see how these filtering systems actually operate:


Stage 1: Language Detection - The First Gatekeeper

The process starts with sophisticated language identification. Models like Meta's fastText can identify 176 languages with over 95% accuracy in milliseconds. But they don't just check if text is English - they check confidence scores.


A page that's 60% English, 30% JavaScript code, and 10% Chinese gets filtered out because the confidence drops below threshold. It's like having a bouncer who doesn't just check IDs but notices when something feels "off" about a guest.


Stage 2: URL-Level Filtering - The Reputation System

Companies maintain massive blocklists of over 500,000 domains. These aren't just obvious spam sites - they include:

  • Content farms that rewrite the same article 1,000 ways

  • Sites caught serving different content to crawlers versus humans

  • Domains with over 50% machine-generated content

  • Known copyright infringement hubs


The filtering goes beyond simple blacklists. Pattern matching catches things like suspicious URL structures (ever notice how spam sites love random numbers in URLs?) and domains that pop up and disappear quickly.


Stage 3: The Deduplication Mathematics

Here's where it gets clever. Instead of comparing every document to every other document (impossible at scale), they use a technique called MinHash that creates a "fingerprint" of 128 numbers for each document.


If 80% of these numbers match between two documents, they're likely duplicates. This brilliant approach can process billions of documents efficiently. That news story about a celebrity that got republished on 1,000 gossip sites? The algorithm keeps the highest quality version and discards 999 copies.


Stage 4: Quality Scoring - The AI Judge

This is perhaps the most innovative filter. They use a smaller, already-trained language model as a quality judge. Well-written text has predictable patterns that the model recognizes. Garbage text surprises the model at every word.


The latest innovation from Ultra-Fineweb (2025) takes this further - their model-based filtering pipeline has shown improvements on major benchmarks like MMLU, ARC, and C-Eval by better identifying high-quality educational content. It's not just filtering out bad content anymore; it's actively selecting for content that makes models smarter.


Stage 5: The AI-Assisted Revolution

Meta's breakthrough with LLaMA-3 introduced something revolutionary: using LLaMA-2 to filter training data for its successor. The AI can make nuanced judgments about content quality that simple rules can't capture. It detects:

  • Conspiracy theories presented as fact

  • Coherent-looking but meaningless text

  • Subtle spam that passes other filters


It's like having an experienced editor read every single page of the internet and mark the good stuff.



The Secret Recipe: Optimizing Data Mixtures

Recent research from DoReMi (2023) reveals something crucial: it's not just about having good data - it's about mixing it in the right proportions. Think of it like baking: you need the right ratio of flour to sugar to eggs, not just high-quality ingredients.


The Data Mixing Laws (2024) have shown we can now predict model performance based on data composition before training even begins. This breakthrough means:

  • Less wasted compute on bad mixtures

  • Faster iteration to find optimal blends

  • Better performance with the same amount of data


The Universal Foundation Every Model Uses:

  • Filtered Common Crawl: 40-60% (general knowledge)

  • Wikipedia: 3-5% (but weighted 3-4x higher for quality)

  • Books: 10-15% (long-form coherence)

  • Code: 5-10% (logical reasoning boost)

  • Academic papers: 2-5% (technical depth)

  • Forums/Dialogue: 10-50% (conversational ability)


The exact proportions are each company's secret sauce, refined through thousands of experiments.


Not All Data Is Treated Equal

Here's where things get really fascinating. Even after all this filtering, not all data is treated equally during training. The comprehensive Data Management Survey (2023) revealed sophisticated weighting strategies across the industry.


The Weight Assignment Process

When GPT-3 was trained, researchers made fascinating choices:

  • Wikipedia content, despite being only 3% of the total data, was given 3.4 times more influence per word

  • Random web pages made up 60% of the data but were given less than half the normal influence

  • Books comprised 16% but received nearly double weight


Why? Because a carefully researched Wikipedia article about physics is more valuable for learning than a thousand spam blog posts.


The Factuality Crisis and Its Solutions

The FACTS Grounding benchmark (2024) exposed a critical problem: even the best models hallucinate information 15-30% of the time. This led to revolutionary changes in how data is selected and weighted.


The Turing Factuality Guide (2024) introduced strategies now used across the industry:

  • Source Authority Weighting: Academic papers and verified news get 5-10x weight over random web content

  • Contradiction Detection: When sources disagree, models learn to express uncertainty

  • Temporal Weighting: More recent information gets higher weight for current events


The Looming Data Crisis: When We Run Out of Internet

Here's a sobering thought from "Educating Silicon" (2024): We're approaching the limits of available human text. Their research estimates:

  • 2024: We've used ~10% of all quality human text ever written

  • 2026: We'll hit 50% at current scaling rates

  • 2028: Potential exhaustion of new, high-quality text


This isn't science fiction - it's math. With models growing from billions to trillions of tokens, we're literally running out of human writings to feed them.


The implications are staggering:

  • Future gains must come from better curation, not just more data

  • Synthetic data generation becomes crucial (but risks quality degradation)

  • The value of high-quality, unique text will skyrocket


Why Code Makes Everything Smarter

Here's an unexpected finding: Including programming code makes models better at everything - math, reasoning, even creative writing. Why?


Code teaches structured thinking. Every function and loop is a lesson in breaking complex problems into steps. Models trained with significant code (like GPT-4) show dramatic improvements in logical reasoning. It's no accident that top models all include substantial programming content.


The New Crisis: When AI Eats Its Own Tail

"The Curse of Recursion" (Shumailov et al., 2024) identified an emerging threat: As AI-generated content floods the internet, future models risk training on synthetic data, leading to degradation.


This makes curation even more critical. New filtering systems now detect and remove AI-generated content. The bar keeps rising.


What This Means for the Future

Understanding these curation pipelines reveals profound insights:

  1. Quality beats quantity: Smaller, refined datasets outperform larger, noisy ones

  2. Mixture matters: DoReMi shows optimal blending can improve performance by 30%

  3. Deduplication is crucial: D4 proves 20% efficiency gains from proper deduplication

  4. Factuality is paramount: New benchmarks drive focus on truthful content

  5. We're hitting limits: Data scarcity will reshape AI development strategies

  6. Perfect data doesn't exist: The goal is understanding and managing limitations


The Bottom Line That Changes Everything

The myth of "trained on all web data" isn't just wrong - it misses the entire point of modern AI. These models aren't indiscriminate data vacuums. They're the result of one of the most sophisticated curation operations in human history.


When you interact with an LLM, you're not talking to "the internet." You're talking to a carefully selected, weighted, and filtered synthesis of human knowledge - the 2% that survived the most rigorous selection process ever designed.


But here's the kicker: We're running out of that 2%. The future of AI isn't just about better algorithms - it's about squeezing every drop of value from the finite well of human knowledge.


Comments


Drop Me a Line, Let Me Know What You Think

© 2024 Made with ❤️ in India

bottom of page