The Hidden Science Behind LLM Token Limits (And How Million-Token Models Actually Work)

Ashish Arora
Jun 27
7 min read

Updated: Jun 29

Introduction

"Why can't I just paste my entire company's documentation into ChatGPT?"

If I had a dollar for every time I've heard this question in my two years as a Generative AI consultant, I could probably afford the GPU cluster needed to answer it. (I wish!)

Last week was no different - another meeting we discussed if context sizes were just "another controllable parameter." After all, if our laptops can open a 500MB text file instantly, so why can't a powerful AI model handle it?

This misconception is so common that I've decided to pull back the curtain on what really happens challenge models with bulky prompts. (No context limits for detailing here! 😉)

ashisharora.ai LLMs with a lot of data — LLM with a 500 page document in a prompt :)

The Myth: "Token Limits are Artificial Business Constraints"

The myth typically sounds something like this:

"OpenAI/Anthropic artificially restrict context length to charge more"
"If Google Docs can handle a 200-page document, why can't these powerful AI models?"
"They could easily allow unlimited tokens but choose not to"
"It's just like mobile data caps - purely for profit"

This sounds logical on the surface. After all, we're used to artificial limitations in tech products. But token limits are different - they're more like the speed of light than a speed limit.

Understanding Tokens: The DNA of Language Models

First, let's demystify tokens. Think of them as the "atoms" of language that models understand:

1 token ≈ 4 characters in English
1 token ≈ 0.75 words on average
"Hello, world!" = 4 tokens: ["Hello", ",", " world", "!"]

So when we talk about GPT-4o's 128K token limit, that's roughly:

96,000 words
192 pages of text
A short novel

Sounds generous? Here's where physics enters the chat.

The Reality: Quadratic Complexity and the Attention Mechanism

The Attention Mechanism: How LLMs "Read"

Large Language Models use something called "self-attention" to understand context. Unlike humans who read linearly, LLMs must consider how every token relates to every other token simultaneously.

Imagine you're reading this sentence:

"The bank by the river was the perfect spot for the bank executive's lunch."

A human reads left-to-right and understands "bank" means different things from context. An LLM must:

Compare "bank #1" with every other word
Compare "bank #2" with every other word
Determine relationships between all word pairs
Do this for EVERY token in the context

The Quadratic Explosion

Here's where it gets mathematical. The number of comparisons needed is:

Attention Operations = n² (where n = number of tokens)

Let's see this in action:

Context Length	Tokens	Attention Operations	Relative Compute
1 page	500	250,000	1x
10 pages	5,000	25,000,000	100x
100 pages	50,000	2,500,000,000	10,000x
1000 pages	500,000	250,000,000,000	1,000,000x

Research Deep Dive:

"Efficient Transformers: A Survey" (Tay et al., 2020): This comprehensive survey confirmed that standard self-attention has O(n²) time and memory complexity, becoming prohibitive beyond 10K tokens.
"Long Range Arena" (Tay et al., 2021): Models failed to complete training on sequences beyond 20K tokens even with optimization techniques on 32GB V100 GPUs.

Memory Requirements: The RAM Bottleneck

The attention mechanism doesn't just need compute - it needs memory:

Memory Required ≈ batch_size × n_tokens² × n_layers × precision

For large models processing 128K tokens, consider that each attention operation must store scores for every token pair. At 128K tokens, that's over 16 billion relationships to track - per layer, per attention head.

While exact numbers for proprietary models aren't public, storing a single 128K × 128K attention matrix in FP16 precision requires approximately 32GB. Modern LLMs have dozens of layers and multiple attention heads, quickly pushing memory requirements into the hundreds of gigabytes range.

That's why even with enterprise GPUs (80GB A100s), serving long contexts requires:

Model parallelism across multiple GPUs
Sophisticated memory management (gradient checkpointing, memory-efficient attention)
Significant infrastructure with high-bandwidth interconnects

Real-world comparison:


# Your laptop opening a text file
with open('war_and_peace.txt', 'r') as f:
    text = f.read()  # Simple linear read: O(n)

# LLM processing text
attention_scores = torch.matmul(Q, K.T)  # O(n²) operation
# Must happen for EVERY layer, EVERY head, EVERY token

The Quality Degradation: "Lost in the Middle" Problem

Even when technically possible, long contexts suffer quality degradation. Research paper named "Lost in the Middle" (Liu et al., 2023) found that all models showed a distinctive U-shaped performance curve:

Position in Context vs. Retrieval Accuracy

100% |* *

80% | * *

60% | * *

40% | ** **

20% | ******** ********

0% |_____________****________|

Start Middle End

Why This Happens?

Training Data Bias: Most training data consists of short texts - web pages (~3K tokens), news articles (~1K tokens), and social media posts. When 99% of training examples are under 8K tokens, models become specialists in short-form content. They learn that important information typically appears early (headlines, abstracts) and never develop the ability to maintain focus deep into long documents.
Positional Encoding Limitations: Positional encodings help transformers understand word order, but they degrade over distance. At position 75,000, the model can't reliably distinguish if a token is at position 75,000 or 74,500. This "positional blur" means the model loses track of relationships between distant tokens - like trying to measure intercontinental distances with a ruler.
Attention Dilution: Attention weights must sum to 1.0, creating a fundamental constraint:
- 1K context: Each token gets ~0.1% average attention
- 100K context: Each token gets ~0.001% average attention
- 1M context: Each token gets ~0.0001% average attention
When attention is spread too thin, the model compensates with "spotlight" behavior -intensely focusing on beginning and end while leaving the middle in darkness.

Here are some interesting papers in this area and their findings:

"The Reversal Curse": Models trained on "A is B" facts often cannot infer "B is A", showing inherent directional biases. Trained on: "The Eiffel Tower is in Paris"
Cannot reliably answer: "What famous tower is in Paris?". This directional bias compounds over distance. In a 100K token context, the model effectively creates one-way information highways. Information can flow forward but not backward, creating "orphaned" content that the model knows exists but cannot properly connect to related concepts elsewhere in the document.
"Scaling Laws": Doubling context from 32K to 64K yields minimal gains while quadrupling costs. Beyond ~32K tokens, you're essentially paying exponentially more for logarithmically smaller improvements. It's like adding more lanes to a highway that's already 50 lanes wide - traffic doesn't flow meaningfully faster

The Combined Effect: A Perfect Storm

These factors don't operate in isolation - they compound each other:

Training bias means models expect important information early
Positional degradation means they can't track where they are in long documents
Attention dilution means they can't focus on everything important
Reversal curse means they can't cross-reference effectively
Scaling laws mean throwing more compute at the problem yields diminishing returns

The result? The "lost in the middle" phenomenon isn't a bug - it's the inevitable outcome of multiple fundamental limitations converging. It's why even the most advanced models still struggle with truly long contexts, despite impressive engineering workarounds

How Million-Token Models Actually Work: The Architecture Revolution

But wait .. OpenAI's GPT-4.1, Google's Gemini family, and Meta's Llama 4 all claim to handle 1 million tokens or more.

How is this possible given the quadratic complexity barrier?

The answer: they don't use traditional dense attention. Instead, each company has developed fundamentally different architectures.

Meta's Llama 4: Rethinking the Fundamentals

Meta's crown jewel is Interleaved Rotary Positional Encoding (iRoPE). Traditional transformers use positional encodings on every layer, but these break down beyond training lengths. Llama 4's solution: alternate between layers with and without positional embeddings.

Why does this work? By not reinforcing absolute position at every step, the model can generalize far beyond its training length. It's like teaching navigation by landmarks rather than GPS coordinates - more flexible in new territory.

Google's Gemini: Distributed Computing Meets Smart Routing

Ring Attention: Creates a circular pipeline across multiple GPUs. Each GPU stores keys/values for only a portion of the sequence. During computation, GPUs pass their portion around like a relay race. No single GPU ever holds the entire matrix.

Mixture of Depths: Recognizes that some tokens need less processing. A lightweight router decides which tokens get full computational treatment versus coasting on residual connections. Can reduce computation by 40-60%.

OpenAI's GPT-4.1: Training Over Architecture

OpenAI focused on teaching their model to handle long contexts better through specialized training on:

Finding information buried in massive documents
Multi-hop reasoning across distant text
Maintaining coherence across millions of tokens

In their MRCR benchmark, GPT-4.1 can reliably find "the third poem about tapirs" even when surrounded by similar distractors across a million-token context.

The Physics Still Applies

Despite these innovations, none truly "solve" the quadratic attention problem:

Distributed Processing divides but doesn't eliminate the problem
Sparse Patterns sacrifice global understanding for efficiency
Smart Routing reduces average but not worst-case complexity
Training Tricks improve utilization but don't change fundamentals

The iron law remains: long context, high quality, or low cost - pick two.

The Hidden Trade-offs

1. The "Lost in the Middle" Problem Persists

Even advanced models struggle when information is buried in the middle of long contexts. Real-world retrieval is far more complex than simple needle-finding.

2. Inconsistent Performance Across Content Types

Technical documentation and code suffer more degradation than narrative text. Structured data shows the steepest decline.

3. Unpredictable Latency and New Failure Modes

Attention drift: Models lose track of the original query
Context poisoning: Early errors compound throughout processing
Extreme variance: 2-minute median, but 15+ minute tail

Key Takeaways

Token limits are physics, not business - quadratic complexity is fundamental
Longer isn't always better - quality degradation is real and measurable
Different architectures optimize for different goals
Most applications work better with RAG or hierarchical processing
Understanding limits enables better system design

Conclusion

The next time someone suggests token limits are arbitrary, explain it's like asking why we can't build a bridge to the moon - technically possible perhaps, but the physics makes it impractical.

Understanding these limitations isn't about accepting defeat - it's about designing better systems that work with the physics rather than against it. Million-token models show us what's possible with engineering ingenuity, but they also remind us that in computing, every breakthrough comes with trade-offs.

As we await the next breakthrough in efficient attention mechanisms, remember: constraints often drive the most creative solutions.

What's your experience with token limits? Have you found creative workarounds? Share your thoughts in the comments or connect with me on LinkedIn to discuss how we can build better AI systems within physical constraints.

References and Further Reading

Core Research Papers:

Attention Is All You Need - The foundational transformer paper
Efficient Transformers: A Survey - Comprehensive complexity analysis
Lost in the Middle - Quality degradation study
FlashAttention-2 - Optimizing attention computation
Ring Attention with Blockwise Transformers - Distributed attention
Mixture of Depths - Smart token routing
GQA: Training Generalized Multi-Query Transformer Models - Memory optimization

Technical Deep Dives:

The Illustrated Transformer - Visual explanation
Transformer Memory Complexity - Benchmarks

Ashish Arora