Context Windows Are a Lie: How LLMs Actually Use Long Context
Table of Contents
Every major model vendor is in a context window arms race. Gemini advertises 2 million tokens. Claude offers 200K. GPT-5 pushes 400K. The marketing message is clear: throw your entire codebase in, and the model will understand it all.
Except it won’t. A growing body of research shows that LLMs don’t use long context uniformly. They attend strongly to the beginning and end of their input while neglecting the middle, a phenomenon researchers call lost in the middle. And the gap between “can retrieve a fact from 1M tokens” and “can reason over a 100K-token codebase” is vast enough to swallow your entire debugging session.
This matters because developers are making architectural decisions (RAG vs. context stuffing, chunking strategies, what to include in prompts) based on headline context window numbers that don’t reflect how models actually perform. Here’s what the research says and what it means for how you build with LLMs.
The Lost in the Middle Effect
In July 2023, Nelson F. Liu and colleagues at Stanford published “Lost in the Middle: How Language Models Use Long Contexts.”1 The paper tested models on two tasks: multi-document question answering and key-value retrieval. The researchers varied where the relevant information appeared in the input (beginning, middle, or end) and measured accuracy.
The results drew a U-shaped curve. Models performed best when the answer was near the start or end of the context. When the relevant information sat in the middle, accuracy dropped significantly, even for models explicitly designed for long contexts.
This wasn’t a subtle effect. On multi-document QA with 20 documents, some models lost 20+ percentage points of accuracy when the gold document moved from position 1 to position 10. The key-value retrieval task showed similar patterns: near-perfect accuracy for keys at the boundaries, significant degradation for keys in the middle.
The paper’s contribution wasn’t discovering that models struggle with long input; practitioners had noticed this informally. Its contribution was quantifying the pattern and showing that it held across multiple model families and context lengths. The U-shaped curve became one of the most cited findings in LLM research.
Why This Happens
The mechanism traces back to how transformer attention works. Self-attention assigns weights to every token based on relevance to the current prediction. In theory, this lets the model attend to any position equally. In practice, positional encodings and training data distributions create biases.
Models see enormous amounts of text during training where the most important information (instructions, key facts, conclusions) tends to appear at the beginning or end. Documents have introductions and conclusions. Conversations have openers and final answers. The model learns these patterns and develops a recency bias and a primacy bias that persist at inference time, even when you deliberately place critical information in the middle.
Think of it like reading a 200-page report under time pressure. You’ll probably read the executive summary closely, skim the middle sections, and pay attention to the conclusion. LLMs do something analogous, not because they’re lazy, but because their attention mechanisms developed similar statistical shortcuts during training.
Needle in a Haystack: The Benchmark That Flatters
The most popular test for long-context performance is Needle in a Haystack (NIAH), created by Greg Kamradt in late 2023.2 The test inserts a specific fact (the “needle”) into a large body of irrelevant text (the “haystack”) at various positions and depths, then asks the model to retrieve it.
Models score impressively on NIAH. Gemini 1.5 Pro achieves over 99.7% recall up to 1 million tokens. Claude 3 Opus performs strongly across its full 200K window. These results get featured in press releases and technical reports as evidence that long context is a solved problem.
It isn’t. NIAH tests retrieval — finding a specific, distinctive fact in a pile of unrelated text. This is the easiest possible long-context task. The needle is deliberately different from the haystack, making it relatively simple for attention mechanisms to locate. Real-world usage demands something harder: reasoning over large contexts where the relevant information isn’t distinct from the surrounding text.
When you feed a 50K-token codebase into a model and ask it to find a bug, the model isn’t looking for a needle in a haystack. It’s looking for a slightly bent needle in a pile of other needles. The relevant code looks like the irrelevant code. The bug might span multiple files. Understanding it requires synthesizing information from different positions in the context.
NIAH tells you that the model can find a sentence about pizza in the middle of Paul Graham essays. It doesn’t tell you whether the model can trace a race condition through three layers of async callback handlers buried 80K tokens into your codebase.
Context Rot: The Chroma Study
In 2025, the team at Chroma published “Context Rot,” a systematic study of how 18 state-of-the-art models (including Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and Qwen3) perform as input length increases.3 Their findings fill in the details of the degradation problem.
Performance Degrades on Even Simple Tasks
Chroma tested models on controlled tasks including retrieval with varying needle-question similarity, distractor handling, and text replication. Even on simple retrieval tasks, performance degraded as context length grew. The degradation was worse when the needle-question similarity was low, meaning the question didn’t obviously match the relevant text. This reflects realistic scenarios where exact keyword matches between questions and answers are rare.
Distractors Compound the Problem
When the researchers added distractor content (text somewhat related to the question but not containing the answer) performance dropped further. Even a single distractor reduced accuracy. Multiple distractors compounded the effect. This maps directly to how developers use context windows. When you load an entire codebase, most of the code is a “distractor” relative to any specific question.
Coherent Text Is Harder Than Random Text
One counterintuitive finding: models performed worse when the haystack preserved logical flow compared to when it was shuffled randomly. Coherent surrounding text creates more opportunities for the model to latch onto plausible-but-wrong context. Random text is easier to ignore.
This has direct implications for code. Source code is highly structured and coherent. Variable names, function signatures, and import statements create dense webs of semantic relationships. When you dump an entire repository into a context window, the model faces a wall of interconnected, coherent text that makes it harder, not easier, to isolate the specific information it needs.
Model-Specific Patterns
Not all models degrade equally. The Chroma study found that Claude models exhibited the lowest hallucination rates when faced with distractors, tending to abstain rather than confabulate. GPT models showed the highest hallucination rates. Gemini models fell in between.
This tracks with my own experience. When I’m working with large codebases, Claude tends to say “I’d need to see the implementation of that function” rather than guessing. GPT models more often generate plausible-looking but incorrect answers about code they can’t actually resolve from context.
The Gap Between Retrieval and Reasoning
The core problem with context window marketing is that retrieval and reasoning are fundamentally different cognitive operations. Models that excel at one don’t necessarily excel at the other.
Retrieval is finding specific information: “What is the database connection string?” If it’s in the context, the model can usually locate it. This is what NIAH tests.
Reasoning is synthesizing information across positions: “Why does this API endpoint return 500 when the request body includes nested arrays?” Answering this might require correlating a schema definition at position 5K, a validation function at position 30K, a middleware layer at position 45K, and an error handler at position 80K.
Research suggests the effective context window for reasoning tasks is dramatically smaller than for retrieval. Stanford and UC Berkeley researchers found that model correctness starts dropping around 32,000 tokens for complex tasks, even for models advertising windows many times larger.4 For reasoning-heavy tasks, some studies suggest the practical sweet spot is smaller still: a few thousand tokens of highly relevant context outperforms 100K tokens of broadly relevant context.
This gap explains a common developer experience: you paste your entire project into Claude or GPT expecting it to understand the architecture, and instead get a response that fixates on the files near the beginning and end of your prompt while ignoring the critical middleware logic you pasted in the middle.
The Coding Context Problem
Coding tasks make this worse because code has unique properties that stress long-context attention:
- Cross-file dependencies. Understanding a bug often requires context from multiple files: type definitions, implementations, tests, and configuration. These rarely sit adjacent in a concatenated codebase.
- Implicit context. Code relies heavily on naming conventions, patterns, and framework-specific behavior that the model needs to infer, not just retrieve.
- Precision requirements. In natural language, approximate understanding often suffices. In code, a single wrong character (
==vs===,0vsO) changes everything. Long-context degradation hits harder when the margin for error is zero.
# What developers often do:
prompt = system_prompt + "\n\n"
for file in project_files:
prompt += f"## {file.path}\n```\n{file.content}\n```\n\n"
prompt += f"\nQuestion: {user_question}"
# What this creates:
# - 80K+ tokens of code, mostly irrelevant to the question
# - Critical files buried in the middle
# - Model attention diluted across hundreds of functions
# - Worse performance than sending 5 relevant files
What Developers Should Actually Do
The context window arms race encourages a brute-force approach: just throw everything in and let the model figure it out. The research says this is almost always the wrong strategy.
Curate, Don’t Stuff
The single most impactful thing you can do is send less context, not more. Identify the specific files, functions, and types relevant to your question and include only those. Five well-chosen files will outperform fifty loosely related ones every time.
This is the core argument for RAG over context stuffing. A good retrieval system acts as a filter, surfacing the 2K tokens of relevant context from a 500K-token codebase. The model gets a focused window instead of drinking from a fire hose.
// Instead of loading everything, load what matters
const relevantFiles = await findRelatedFiles(userQuestion, {
// Search by semantic similarity to the question
semanticSearch: true,
// Include files that import/export from matched files
followDependencies: true,
// Cap at a reasonable context budget
maxTokens: 8_000,
});
const prompt = buildPrompt(systemPrompt, relevantFiles, userQuestion);
Put Important Information at the Boundaries
If you must include a lot of context, structure matters. Place the most critical information at the beginning and end of your prompt. Put the question or instruction at the very end. Put key constraints or system instructions at the very beginning. Let the middle contain supporting context that’s useful but not critical.
This isn’t a hack. It’s working with the model’s attention patterns instead of against them. The U-shaped attention curve means boundary positions get the most reliable processing.
Use Structured Context, Not Raw Dumps
Instead of pasting raw files, provide structured summaries that help the model understand relationships:
## Architecture Overview
- API layer: Express routes in src/routes/
- Business logic: Service classes in src/services/
- Data layer: Prisma models in prisma/schema.prisma
## Relevant Files for This Question
1. src/routes/users.ts (the endpoint returning 500)
2. src/services/UserService.ts (validation logic)
3. prisma/schema.prisma (User model definition)
## Full File Contents
[files here]
## Question
Why does POST /users return 500 when the request includes nested address objects?
This structure gives the model a map before dropping it into the territory. The overview at the top (primacy position) and the question at the bottom (recency position) both get strong attention, while the file contents in the middle are pre-contextualized by the architecture summary.
Chunk and Iterate
For complex tasks, break the work into steps rather than trying to solve everything in one massive prompt. Use a first pass to identify relevant code, a second pass to analyze the specific problem, and a third pass to generate the fix. Each pass uses a focused context window rather than a bloated one.
Good AI coding tools already do this behind the scenes. When Claude Code processes a large repository, it doesn’t paste the entire codebase into a single prompt. It searches for relevant files, reads specific sections, and builds context incrementally. The tool does the context engineering that the raw API leaves to you.
Monitor Your Context Budget
The LLM is a CPU, the context window is RAM, and your job is to be the operating system.
Andrej Karpathy offered that mental model,5 and it’s the right way to think about context budgets. An OS doesn’t load every file on disk into memory at once. It loads what’s needed for the current operation, manages cache coherence, and evicts what’s no longer relevant.
Think about your context window the same way. You have a budget. Every token you spend on irrelevant context is a token that dilutes attention on relevant context. The goal isn’t to maximize utilization. It’s to maximize the signal-to-noise ratio within your context budget.
Why “Just Make It Bigger” Doesn’t Work
The vendor pitch is seductive: if models struggle at 128K, just train them on 1M. If 1M isn’t enough, push to 10M. Eventually, the window will be big enough that the limitations don’t matter.
This argument fails for three reasons.
Attention is zero-sum. Transformer attention distributes a fixed probability mass across all tokens. Making the window larger means each token gets a smaller share of attention on average. You can’t fix attention dilution by adding more tokens to dilute across.
Cost scales with context. Processing 1M tokens isn’t just 8x more expensive than 128K. It’s often worse due to the quadratic scaling of attention (or the constant overhead of linear attention approximations). Stuffing a million-token window costs $0.50 to $20 per request with current pricing, takes 60+ seconds, and delivers worse results than a focused 8K-token prompt for most tasks.
The fundamental problem is architectural. Lost-in-the-middle isn’t a bug that gets patched with more training data or clever position encodings. Researchers have proposed mitigations (the “Found in the Middle” paper6 showed that plug-and-play positional encoding modifications can help) but no architecture has fully eliminated the U-shaped attention curve. Models are getting better at long context, but the improvement curve is logarithmic, not linear.
The industry is starting to recognize this. The term context engineering, coined by Shopify CEO Tobi Lütke in June 2025, describes the emerging discipline of assembling exactly the right context for each task rather than brute-forcing with maximum context. The shift from “make the window bigger” to “make the context smarter” is the real path forward.
The RAG Rebuttal
Some developers argue that RAG is a temporary workaround: once context windows are large and reliable enough, we can skip retrieval entirely and just feed models everything. The research doesn’t support this.
Even with perfect long-context performance, stuffing everything into the prompt is wasteful. You’re paying to process tokens the model doesn’t need. You’re increasing latency for no benefit. You’re introducing noise that can only hurt, never help.
RAG isn’t a workaround for insufficient context windows. It’s an architecturally sound pattern for the same reason databases use indexes instead of table scans: selective access is faster, cheaper, and more reliable than brute-force traversal, regardless of how much hardware you throw at the brute-force approach.
The sophisticated position in 2026 isn’t “RAG vs. long context.” It’s both: use retrieval to assemble relevant context, then use long context windows to process that curated context with room for the model to reason. The context window isn’t a substitute for retrieval; it’s the workspace where retrieved context gets processed.
Limitations and Counterpoints
Long context windows aren’t useless. They’re oversold. For certain tasks, large windows genuinely help. Summarizing a 100-page document, answering factual questions over a long transcript, or extracting structured data from a lengthy report all benefit from fitting the full source material in context. The lost-in-the-middle effect is most damaging for reasoning-heavy tasks that require synthesizing scattered information. For retrieval-heavy tasks over a single coherent document, larger windows can deliver real gains.
Some of the research cited here also skews toward earlier model generations. The Stanford “Lost in the Middle” paper tested models available in mid-2023. Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o have all shown measurable improvements in long-context benchmarks since then. The Chroma “Context Rot” study is more current (2025) and still finds degradation, but the slope is gentler than what earlier work documented. Architectures are improving, just not as fast as the marketing suggests.
Newer benchmarks like RULER, InfiniteBench, and Chroma’s distractor-aware evaluations are giving a more realistic picture of long-context performance than NIAH alone. These tests are harder and more representative, and models are being trained against them. The gap between headline token counts and practical performance is narrowing, even if it remains significant for complex coding tasks.
Context windows and context engineering also aren’t an either-or proposition. Larger windows give RAG and chunking strategies more room to operate. A 200K window that receives 30K of carefully retrieved context has headroom for the model to reason, include chain-of-thought, and generate lengthy output. None of that is possible if the window barely fits the retrieved content. The argument isn’t against big context windows. It’s against treating them as a substitute for thoughtful context assembly.
Conclusion
Context window size is a marketing metric, not a capability metric. What matters isn’t how many tokens a model accepts but how effectively it uses them. The research consistently shows:
The next time a vendor announces a 10M-token context window, ask the question that actually matters: what’s the effective context window for the task you care about? For most coding tasks, the honest answer is much smaller than the number on the box.
Footnotes
-
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics, 2024. arXiv:2307.03172 ↩
-
Kamradt, G. (2023). “Needle In A Haystack - Pressure Testing LLMs.” GitHub ↩
-
Chroma Research. (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” research.trychroma.com ↩
-
Referenced in multiple studies including the Chroma report and Stanford/UC Berkeley long-context evaluations. See also: “The Maximum Effective Context Window for Real World Applications.” arXiv:2509.21361 ↩
-
Karpathy, A. (2025). Referenced via Andrej Karpathy’s commentary on context engineering. ↩
-
Zhu, Q., et al. (2024). “Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.” arXiv:2403.04797 ↩
Written by
Evan Musick
Computer Science & Data Science student at Missouri State University. Building at the intersection of AI, software development, and human cognition.
Newsletter
Get Brain Bytes in your inbox
Weekly articles on AI, development, and the questions no one else is asking. No spam.