Context Windows Are a Lie: How LLMs Actually Use Long Context

Every major model vendor is in a context window arms race. Gemini advertises 2 million tokens. Claude offers 200K. GPT-5 pushes 400K. The marketing message is clear: throw your entire codebase in, and the model will understand it all.

Audio

Listen to this article

A 2-minute audio overview of this article, narrated by our robot.

0:00 / 0:00

Except it won’t. A growing body of research shows that LLMs don’t use long context uniformly. They attend strongly to the beginning and end of their input while neglecting the middle, a phenomenon researchers call lost in the middle. And the gap between “can retrieve a fact from 1M tokens” and “can reason over a 100K-token codebase” is vast enough to swallow your entire debugging session.

This matters because developers are making architectural decisions (RAG vs. context stuffing, chunking strategies, what to include in prompts) based on headline context window numbers that don’t reflect how models actually perform. Here’s what the research says and what it means for how you build with LLMs.

The Lost in the Middle Effect

In July 2023, Nelson F. Liu and colleagues at Stanford published “Lost in the Middle: How Language Models Use Long Contexts.”¹ The paper tested models on two tasks: multi-document question answering and key-value retrieval. The researchers varied where the relevant information appeared in the input (beginning, middle, or end) and measured accuracy.

The results drew a U-shaped curve. Models performed best when the answer was near the start or end of the context. When the relevant information sat in the middle, accuracy dropped significantly, even for models explicitly designed for long contexts.

This wasn’t a subtle effect. On multi-document QA with 20 documents, some models lost 20+ percentage points of accuracy when the gold document moved from position 1 to position 10. The key-value retrieval task showed similar patterns: near-perfect accuracy for keys at the boundaries, significant degradation for keys in the middle.

The paper’s contribution wasn’t discovering that models struggle with long input; practitioners had noticed this informally. Its contribution was quantifying the pattern and showing that it held across multiple model families and context lengths. The U-shaped curve became one of the most cited findings in LLM research.

Why This Happens

The mechanism traces back to how transformer attention works. Self-attention assigns weights to every token based on relevance to the current prediction. In theory, this lets the model attend to any position equally. In practice, positional encodings and training data distributions create biases.

Models see enormous amounts of text during training where the most important information (instructions, key facts, conclusions) tends to appear at the beginning or end. Documents have introductions and conclusions. Conversations have openers and final answers. The model learns these patterns and develops a recency bias and a primacy bias that persist at inference time, even when you deliberately place critical information in the middle.

Think of it like reading a 200-page report under time pressure. You’ll probably read the executive summary closely, skim the middle sections, and pay attention to the conclusion. LLMs do something analogous, not because they’re lazy, but because their attention mechanisms developed similar statistical shortcuts during training.

Needle in a Haystack: The Benchmark That Flatters

The most popular test for long-context performance is Needle in a Haystack (NIAH), created by Greg Kamradt in late 2023.² The test inserts a specific fact (the “needle”) into a large body of irrelevant text (the “haystack”) at various positions and depths, then asks the model to retrieve it.

Models score impressively on NIAH. Gemini 1.5 Pro achieves over 99.7% recall up to 1 million tokens. Claude 3 Opus performs strongly across its full 200K window. These results get featured in press releases and technical reports as evidence that long context is a solved problem.

It isn’t. NIAH tests retrieval — finding a specific, distinctive fact in a pile of unrelated text. This is the easiest possible long-context task. The needle is deliberately different from the haystack, making it relatively simple for attention mechanisms to locate. Real-world usage demands something harder: reasoning over large contexts where the relevant information isn’t distinct from the surrounding text.

When you feed a 50K-token codebase into a model and ask it to find a bug, the model isn’t looking for a needle in a haystack. It’s looking for a slightly bent needle in a pile of other needles. The relevant code looks like the irrelevant code. The bug might span multiple files. Understanding it requires synthesizing information from different positions in the context.

NIAH tells you that the model can find a sentence about pizza in the middle of Paul Graham essays. It doesn’t tell you whether the model can trace a race condition through three layers of async callback handlers buried 80K tokens into your codebase.

Context Rot: The Chroma Study

In 2025, the team at Chroma published “Context Rot,” a systematic study of how 18 state-of-the-art models (including Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and Qwen3) perform as input length increases.³ Their findings fill in the details of the degradation problem.

Performance Degrades on Even Simple Tasks

Chroma tested models on controlled tasks including retrieval with varying needle-question similarity, distractor handling, and text replication. Even on simple retrieval tasks, performance degraded as context length grew. The degradation was worse when the needle-question similarity was low, meaning the question didn’t obviously match the relevant text. This reflects realistic scenarios where exact keyword matches between questions and answers are rare.

Distractors Compound the Problem

When the researchers added distractor content (text somewhat related to the question but not containing the answer) performance dropped further. Even a single distractor reduced accuracy. Multiple distractors compounded the effect. This maps directly to how developers use context windows. When you load an entire codebase, most of the code is a “distractor” relative to any specific question.

Coherent Text Is Harder Than Random Text

One counterintuitive finding: models performed worse when the haystack preserved logical flow compared to when it was shuffled randomly. Coherent surrounding text creates more opportunities for the model to latch onto plausible-but-wrong context. Random text is easier to ignore.

This has direct implications for code. Source code is highly structured and coherent. Variable names, function signatures, and import statements create dense webs of semantic relationships. When you dump an entire repository into a context window, the model faces a wall of interconnected, coherent text that makes it harder, not easier, to isolate the specific information it needs.

Model-Specific Patterns

Not all models degrade equally. The Chroma study found that Claude models exhibited the lowest hallucination rates when faced with distractors, tending to abstain rather than confabulate. GPT models showed the highest hallucination rates. Gemini models fell in between.

This tracks with my own experience. When I’m working with large codebases, Claude tends to say “I’d need to see the implementation of that function” rather than guessing. GPT models more often generate plausible-looking but incorrect answers about code they can’t actually resolve from context.

The Gap Between Retrieval and Reasoning

The core problem with context window marketing is that retrieval and reasoning are fundamentally different cognitive operations. Models that excel at one don’t necessarily excel at the other.

Retrieval is finding specific information: “What is the database connection string?” If it’s in the context, the model can usually locate it. This is what NIAH tests.

Reasoning is synthesizing information across positions: “Why does this API endpoint return 500 when the request body includes nested arrays?” Answering this might require correlating a schema definition at position 5K, a validation function at position 30K, a middleware layer at position 45K, and an error handler at position 80K.

Research suggests the effective context window for reasoning tasks is dramatically smaller than for retrieval. Stanford and UC Berkeley researchers found that model correctness starts dropping around 32,000 tokens for complex tasks, even for models advertising windows many times larger.⁴ For reasoning-heavy tasks, some studies suggest the practical sweet spot is smaller still: a few thousand tokens of highly relevant context outperforms 100K tokens of broadly relevant context.

This gap explains a common developer experience: you paste your entire project into Claude or GPT expecting it to understand the architecture, and instead get a response that fixates on the files near the beginning and end of your prompt while ignoring the critical middleware logic you pasted in the middle.

The Coding Context Problem

Coding tasks make this worse because code has unique properties that stress long-context attention:

Cross-file dependencies. Understanding a bug often requires context from multiple files: type definitions, implementations, tests, and configuration. These rarely sit adjacent in a concatenated codebase.
Implicit context. Code relies heavily on naming conventions, patterns, and framework-specific behavior that the model needs to infer, not just retrieve.
Precision requirements. In natural language, approximate understanding often suffices. In code, a single wrong character (== vs ===, 0 vs O) changes everything. Long-context degradation hits harder when the margin for error is zero.

# What developers often do:
prompt = system_prompt + "\n\n"
for file in project_files:
    prompt += f"## {file.path}\n```\n{file.content}\n```\n\n"
prompt += f"\nQuestion: {user_question}"

# What this creates:
# - 80K+ tokens of code, mostly irrelevant to the question
# - Critical files buried in the middle
# - Model attention diluted across hundreds of functions
# - Worse performance than sending 5 relevant files

What Developers Should Actually Do

The context window arms race encourages a brute-force approach: just throw everything in and let the model figure it out. The research says this is almost always the wrong strategy.

Curate, Don’t Stuff

The single most impactful thing you can do is send less context, not more. Identify the specific files, functions, and types relevant to your question and include only those. Five well-chosen files will outperform fifty loosely related ones every time.

This is the core argument for RAG over context stuffing. A good retrieval system acts as a filter, surfacing the 2K tokens of relevant context from a 500K-token codebase. The model gets a focused window instead of drinking from a fire hose.

// Instead of loading everything, load what matters
const relevantFiles = await findRelatedFiles(userQuestion, {
  // Search by semantic similarity to the question
  semanticSearch: true,
  // Include files that import/export from matched files
  followDependencies: true,
  // Cap at a reasonable context budget
  maxTokens: 8_000,
});

const prompt = buildPrompt(systemPrompt, relevantFiles, userQuestion);

Put Important Information at the Boundaries

If you must include a lot of context, structure matters. Place the most critical information at the beginning and end of your prompt. Put the question or instruction at the very end. Put key constraints or system instructions at the very beginning. Let the middle contain supporting context that’s useful but not critical.

This isn’t a hack. It’s working with the model’s attention patterns instead of against them. The U-shaped attention curve means boundary positions get the most reliable processing.

Use Structured Context, Not Raw Dumps

Instead of pasting raw files, provide structured summaries that help the model understand relationships:

## Architecture Overview
- API layer: Express routes in src/routes/
- Business logic: Service classes in src/services/
- Data layer: Prisma models in prisma/schema.prisma

## Relevant Files for This Question
1. src/routes/users.ts (the endpoint returning 500)
2. src/services/UserService.ts (validation logic)
3. prisma/schema.prisma (User model definition)

## Full File Contents
[files here]

## Question
Why does POST /users return 500 when the request includes nested address objects?

This structure gives the model a map before dropping it into the territory. The overview at the top (primacy position) and the question at the bottom (recency position) both get strong attention, while the file contents in the middle are pre-contextualized by the architecture summary.

Chunk and Iterate

For complex tasks, break the work into steps rather than trying to solve everything in one massive prompt. Use a first pass to identify relevant code, a second pass to analyze the specific problem, and a third pass to generate the fix. Each pass uses a focused context window rather than a bloated one.

Good AI coding tools already do this behind the scenes. When Claude Code processes a large repository, it doesn’t paste the entire codebase into a single prompt. It searches for relevant files, reads specific sections, and builds context incrementally. The tool does the context engineering that the raw API leaves to you.

Monitor Your Context Budget

The LLM is a CPU, the context window is RAM, and your job is to be the operating system.

Andrej Karpathy offered that mental model,⁵ and it’s the right way to think about context budgets. An OS doesn’t load every file on disk into memory at once. It loads what’s needed for the current operation, manages cache coherence, and evicts what’s no longer relevant.

Think about your context window the same way. You have a budget. Every token you spend on irrelevant context is a token that dilutes attention on relevant context. The goal isn’t to maximize utilization. It’s to maximize the signal-to-noise ratio within your context budget.

Why “Just Make It Bigger” Doesn’t Work

The vendor pitch is seductive: if models struggle at 128K, just train them on 1M. If 1M isn’t enough, push to 10M. Eventually, the window will be big enough that the limitations don’t matter.

This argument fails for three reasons.

Attention is zero-sum. Transformer attention distributes a fixed probability mass across all tokens. Making the window larger means each token gets a smaller share of attention on average. You can’t fix attention dilution by adding more tokens to dilute across.

Cost scales with context. Processing 1M tokens isn’t just 8x more expensive than 128K. It’s often worse due to the quadratic scaling of attention (or the constant overhead of linear attention approximations). Stuffing a million-token window costs $0.50 to $20 per request with current pricing, takes 60+ seconds, and delivers worse results than a focused 8K-token prompt for most tasks.

The fundamental problem is architectural. Lost-in-the-middle isn’t a bug that gets patched with more training data or clever position encodings. Researchers have proposed mitigations (the “Found in the Middle” paper⁶ showed that plug-and-play positional encoding modifications can help) but no architecture has fully eliminated the U-shaped attention curve. Models are getting better at long context, but the improvement curve is logarithmic, not linear.

The industry is starting to recognize this. The term context engineering, coined by Shopify CEO Tobi Lütke in June 2025, describes the emerging discipline of assembling exactly the right context for each task rather than brute-forcing with maximum context. The shift from “make the window bigger” to “make the context smarter” is the real path forward.

The RAG Rebuttal

Some developers argue that RAG is a temporary workaround: once context windows are large and reliable enough, we can skip retrieval entirely and just feed models everything. The research doesn’t support this.

Even with perfect long-context performance, stuffing everything into the prompt is wasteful. You’re paying to process tokens the model doesn’t need. You’re increasing latency for no benefit. You’re introducing noise that can only hurt, never help.

RAG isn’t a workaround for insufficient context windows. It’s an architecturally sound pattern for the same reason databases use indexes instead of table scans: selective access is faster, cheaper, and more reliable than brute-force traversal, regardless of how much hardware you throw at the brute-force approach.

The sophisticated position in 2026 isn’t “RAG vs. long context.” It’s both: use retrieval to assemble relevant context, then use long context windows to process that curated context with room for the model to reason. The context window isn’t a substitute for retrieval; it’s the workspace where retrieved context gets processed.

Limitations and Counterpoints

Long context windows aren’t useless. They’re oversold. For certain tasks, large windows genuinely help. Summarizing a 100-page document, answering factual questions over a long transcript, or extracting structured data from a lengthy report all benefit from fitting the full source material in context. The lost-in-the-middle effect is most damaging for reasoning-heavy tasks that require synthesizing scattered information. For retrieval-heavy tasks over a single coherent document, larger windows can deliver real gains.

Some of the research cited here also skews toward earlier model generations. The Stanford “Lost in the Middle” paper tested models available in mid-2023. Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o have all shown measurable improvements in long-context benchmarks since then. The Chroma “Context Rot” study is more current (2025) and still finds degradation, but the slope is gentler than what earlier work documented. Architectures are improving, just not as fast as the marketing suggests.

Newer benchmarks like RULER, InfiniteBench, and Chroma’s distractor-aware evaluations are giving a more realistic picture of long-context performance than NIAH alone. These tests are harder and more representative, and models are being trained against them. The gap between headline token counts and practical performance is narrowing, even if it remains significant for complex coding tasks.

Context windows and context engineering also aren’t an either-or proposition. Larger windows give RAG and chunking strategies more room to operate. A 200K window that receives 30K of carefully retrieved context has headroom for the model to reason, include chain-of-thought, and generate lengthy output. None of that is possible if the window barely fits the retrieved content. The argument isn’t against big context windows. It’s against treating them as a substitute for thoughtful context assembly.

Conclusion

Context window size is a marketing metric, not a capability metric. What matters isn’t how many tokens a model accepts but how effectively it uses them. The research consistently shows:

The next time a vendor announces a 10M-token context window, ask the question that actually matters: what’s the effective context window for the task you care about? For most coding tasks, the honest answer is much smaller than the number on the box.

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics, 2024. arXiv:2307.03172 ↩
Kamradt, G. (2023). “Needle In A Haystack - Pressure Testing LLMs.” GitHub ↩
Chroma Research. (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” research.trychroma.com ↩
Referenced in multiple studies including the Chroma report and Stanford/UC Berkeley long-context evaluations. See also: “The Maximum Effective Context Window for Real World Applications.” arXiv:2509.21361 ↩
Karpathy, A. (2025). Referenced via Andrej Karpathy’s commentary on context engineering. ↩
Zhu, Q., et al. (2024). “Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.” arXiv:2403.04797 ↩

Context Windows Are a Lie: How LLMs Actually Use Long Context

Listen to this article

The Lost in the Middle Effect

Why This Happens

Needle in a Haystack: The Benchmark That Flatters

Context Rot: The Chroma Study

Performance Degrades on Even Simple Tasks

Distractors Compound the Problem

Coherent Text Is Harder Than Random Text

Model-Specific Patterns

The Gap Between Retrieval and Reasoning

The Coding Context Problem

What Developers Should Actually Do

Curate, Don’t Stuff

Put Important Information at the Boundaries

Use Structured Context, Not Raw Dumps

Chunk and Iterate

Monitor Your Context Budget

Why “Just Make It Bigger” Doesn’t Work

The RAG Rebuttal

Limitations and Counterpoints

Conclusion

Visual Studio Ships Full Pull Request Reviews Inside the IDE

Copilot's Meter Is On: What Usage-Based AI Credits Cost You

GPT-5.6 Turns Codex Into OpenAI's New Work-Agent Bet

Listen to this article

The Lost in the Middle Effect

Why This Happens

Needle in a Haystack: The Benchmark That Flatters

Context Rot: The Chroma Study

Performance Degrades on Even Simple Tasks

Distractors Compound the Problem

Coherent Text Is Harder Than Random Text

Model-Specific Patterns

The Gap Between Retrieval and Reasoning

The Coding Context Problem

What Developers Should Actually Do

Curate, Don’t Stuff

Put Important Information at the Boundaries

Use Structured Context, Not Raw Dumps

Chunk and Iterate

Monitor Your Context Budget

Why “Just Make It Bigger” Doesn’t Work

The RAG Rebuttal

Limitations and Counterpoints

Conclusion

Footnotes

Related reading

Visual Studio Ships Full Pull Request Reviews Inside the IDE

Copilot's Meter Is On: What Usage-Based AI Credits Cost You

GPT-5.6 Turns Codex Into OpenAI's New Work-Agent Bet

Get Brain Bytes in your inbox