The Homogenization Engine: How LLMs Are Shrinking Cognitive Diversity

Every writer who uses ChatGPT to draft a blog post writes a better blog post. Every developer who uses Copilot to scaffold a function ships faster. Every team that pipes their brainstorm through Claude gets more polished ideas. The individual gains are real, measurable, and consistent across studies. And they are hiding something.

Audio

Listen to this article

A 2-minute audio overview of this article, narrated by our robot.

0:00 / 0:00

When 293 writers used GPT-4 to help write short stories, independent evaluators rated their work as more novel, more useful, and more enjoyable than stories written without AI assistance.¹ But the stories were also 10.7% more similar to each other. The writers got better. The writing got flatter. Anil Doshi and Oliver Hauser, the researchers behind that study, called it a social dilemma: “writers are individually better off, but collectively a narrower scope of novel content is produced.”

That finding isn’t an edge case. It’s the central result of a converging line of research spanning cognitive science, computer science, and creativity studies. The pattern is consistent: LLMs boost individual output quality while compressing the diversity of ideas across groups. The mechanism isn’t mysterious. It’s baked into how language models work. And the implications extend well beyond writing, into how developers architect software, how teams solve problems, and how entire industries think.

Background: Why Next-Token Prediction Converges

To understand why LLMs homogenize output, start with what they are: probability machines. A language model predicts the next token based on statistical patterns learned from training data. The training objective (minimize prediction error across a massive corpus) inherently favors modal responses. The most likely continuation. The center of the distribution.

This is a feature, not a bug, when you want fluent, coherent text. But it means that when a thousand different users ask the same model to brainstorm solutions to a problem, the model gravitates toward the same high-probability regions of idea space. It doesn’t explore the tails. It doesn’t produce the weird, low-probability combination that turns out to be a breakthrough. It produces the statistical consensus of its training data.

Three mechanisms compound the convergence. First, training corpora overrepresent certain cultures, languages, and reasoning styles, particularly those from WEIRD societies (Western, Educated, Industrialized, Rich, Democratic). Second, reinforcement learning from human feedback (RLHF) further narrows the output distribution by penalizing responses that human raters find unusual, offensive, or unhelpful. Third, next-token prediction itself acts as a smoothing function: high-probability continuations get reinforced, low-probability ones get suppressed. Every layer of the training pipeline pushes toward the center.

The result is a system that’s individually helpful and collectively flattening. A single user sees better output. A population of users sees less varied output. And because the flattening happens at the group level, no individual user has any reason to notice it.

Methodology: How Researchers Measure Homogenization

Measuring homogenization requires comparing the diversity of outputs across groups, not evaluating individual quality. The standard approach across recent studies uses sentence embeddings (vector representations of text meaning) and computes pairwise cosine similarity between all outputs in a group. Higher average similarity means less diversity. Lower means more.

The most common embedding model is all-MiniLM-L6-v2 from the Sentence-BERT family, used by Anderson et al. (2024)² and Wenger and Kenett (2025)³ in their creativity studies. A complementary approach uses the open-source diversity Python package⁴, which implements nine metrics including compression ratio, Self-BLEU, and BERTScore homogenization. The researchers behind that package recommend reporting at minimum three metrics: compression ratio (fast, correlates well with expensive alternatives), Self-Repetition, and Self-BLEU, because they capture orthogonal dimensions of textual sameness.

One critical methodological point: text length confounds every diversity metric. Longer texts naturally score lower on diversity measures, so length must be controlled or reported alongside scores. Studies that ignore this confound overstate homogenization effects.

With this toolkit in hand, the research paints a consistent picture.

Findings

Individual Creativity Goes Up, Consistently

The Doshi and Hauser experiment remains the most rigorous demonstration of the individual-collective paradox.¹ Published in Science Advances in 2024, it used a preregistered design with 293 writers and 600 independent evaluators who assessed 3,519 story evaluations total.

Writers were randomly assigned to three conditions: write a short story alone, write with one GPT-4 story idea, or write with up to five GPT-4 story ideas. The task was eight-sentence microfiction on assigned topics.

The individual gains were unambiguous:

Metric	1 AI Idea	5 AI Ideas
Novelty	+5.4% (P=0.021)	+8.1% (P<0.001)
Usefulness	+3.7% (P=0.039)	+9.0% (P<0.001)
Better written	+22.4%	+37.2% (P<0.001)
More enjoyable	+18.6%	+37.5% (P<0.001)
Contains plot twist	+31.1%	+46.8% (P<0.001)

Less creative writers, as measured by the Divergent Association Task (DAT), benefited most. Those in the bottom half of inherent creativity saw a 10.7% novelty increase and 11.5% usefulness increase with five AI ideas. More creative writers saw minimal improvement; they were already performing well without AI.

“Writers are individually better off, but collectively a narrower scope of novel content is produced.” — Doshi & Hauser, Science Advances (2024)

This equalization effect has an intuitive appeal: AI as a leveler, bringing everyone up to a baseline. But it conceals the population-level cost.

Collective Diversity Goes Down, Also Consistently

In the same experiment, Doshi and Hauser measured pairwise similarity between all stories within each condition. Stories written with one AI idea were 10.7% more similar to each other than stories written without AI. Stories with five AI ideas were 8.9% more similar. The AI-assisted stories were also 5.0-5.2% more similar to the original AI-generated prompts than human-only stories were, confirming that the model’s suggestions anchored writers toward overlapping regions of idea space.

Anderson, Shah, and Kreminski (2024) replicated the group-level finding with a different design.² Their within-subjects study had 33 participants complete divergent thinking tasks using both ChatGPT and Oblique Strategies (a deliberately oblique creativity tool designed by Brian Eno). Measured by cosine similarity of sentence embeddings:

Group-level semantic similarity: ChatGPT condition was significantly more homogeneous (d=0.47, P=0.038)
Individual-level semantic similarity: No significant difference (P=0.352)

This is the critical distinction. Individual users maintained similar diversity regardless of tool. The homogenization emerged at the group level: different users given similar AI suggestions produced similar outputs. The mechanism isn’t that AI makes you less creative. It’s that AI gives everyone the same creativity.

Anderson’s team also found that ChatGPT users reported significantly lower personal responsibility for their ideas (48.2% vs 63.6% with Oblique Strategies, P=0.003). They generated more ideas per session and covered more categories, but the ideas converged toward a narrower collective set. More fluent. Less distinct.

The Problem Isn’t One Model. It’s All of Them

If homogenization were a quirk of ChatGPT specifically, the fix would be simple: use a different model. Emily Wenger and Yoed Kenett (2025) tested whether this escape hatch exists.³ It doesn’t.

They administered three standardized divergent thinking tests (the Alternative Uses Task (AUT), Forward Flow, and the Divergent Association Task) to 22 LLMs from distinct model families and 102 human participants. The results were stark:

Test	LLM Variability	Human Variability	Effect Size
AUT	0.459	0.738	2.2
Forward Flow	0.534	0.835	2.0
DAT	0.665	0.819	1.4

Effect sizes above 0.8 are typically considered “large.” These are in the 1.4-2.2 range. The diversity gap between LLM populations and human populations is not subtle.

Even when the researchers used creative-focused system prompts, instructing models to be “imaginative” and think “outside the box,” individual scores improved, but population-level diversity stayed far lower than humans. The gap is structural. Different models, trained on overlapping data with similar objectives, converge on overlapping output distributions.

Scale Amplifies the Effect

The 2025 PNAS study “Echoes in AI” confirmed what the smaller experiments implied: homogenization gets worse at scale.⁵ Analyzing LLM-generated short stories, the researchers found that AI-produced narratives contained repetitive combinations of plot elements. Human-written stories maintained higher uniqueness. The diversity gap widened with more outputs, too. The more stories generated, the more the AI-produced corpus converged.

This has implications beyond creative writing. By April 2025, 74.2% of newly created webpages contained some AI-generated text.⁶ AI-written pages in Google’s top-20 search results climbed from 11.11% to 19.56% between May 2024 and July 2025. As AI-generated content saturates the web, it becomes training data for next-generation models. Shumailov et al. (2024) demonstrated in Nature that this recursive loop causes model collapse: models trained on AI-generated data lose their grasp of rare events and drift toward “bland central tendencies.”⁶

The feedback loop looks like this: LLMs produce homogeneous content. That content enters the training data for future models. Those models produce even more homogeneous content. The tails of the distribution, where unusual ideas, minority perspectives, and creative breakthroughs live, erode with each generation.

Beyond Language: Reasoning Itself Narrows

Sourati and Dehghani’s 2026 review in Trends in Cognitive Sciences extends the homogenization argument beyond style and into cognition.⁷ They argue that LLMs don’t just standardize how people write. They standardize how people think.

Their synthesis draws on evidence from linguistics, psychology, cognitive science, and computer science. The core claim: LLMs favor linear modes of reasoning, particularly chain-of-thought approaches that require step-by-step logical progression. This emphasis, embedded in both training data and RLHF objectives, reduces reliance on intuitive, associative, or abstract reasoning styles. These approaches are sometimes more efficient than linear reasoning and contribute to cognitive diversity across populations.

The paper also highlights a mechanism that operates beneath conscious awareness. Users increasingly accept model-generated suggestions as “good enough,” shifting creative control from the human to the algorithm through thousands of micro-decisions.⁸ Over time, the user’s sense of “what sounds right” becomes trained by AI interactions. Approval starts to feel like authorship, even when the generative work happened elsewhere.

“When these differences are mediated by the same LLMs, their distinct linguistic style, perspective, and reasoning strategies become homogenized.” — Zhivar Sourati, USC

Discussion

The Contradicting Evidence Matters

The homogenization finding is strong but not universal. One large-scale experiment complicates the narrative. Ashkinaze et al. (2024) ran a dynamic experiment with 844 participants across 48 countries using the Alternate Uses Task.⁹ Participants in high-AI-exposure conditions, where they passively saw AI-generated ideas before generating their own, actually produced more diverse ideas collectively. The effect was significant (delta=0.31, P=0.001).

The key difference: participants were passively exposed to AI ideas, not actively using AI as a generation tool. They saw the ideas and then came up with their own, rather than starting from or incorporating AI suggestions directly. The researchers concluded that AI served as an “accessible idea bank” that countered the natural convergence tendencies of groups.

This distinction matters. It suggests the vector of interaction determines the outcome. When AI outputs serve as a springboard (something to react to, diverge from, or be inspired by) they can diversify thinking. When AI outputs serve as a draft (something to accept, edit, or refine) they compress it. The difference between “here’s an idea, now come up with your own” and “here’s your idea, fix it if you want” is the difference between divergence and convergence.

A separate crossover study in higher education found that AI use “did not reduce fluency, flexibility, or originality, nor did it lead to thematic homogenisation.” Task complexity and domain expertise may moderate the effect. The homogenization finding holds up in the general case but probably has boundary conditions that researchers are still mapping.

What This Means for Developers

The research focuses on creative writing and divergent thinking, but the implications for software development are hard to ignore. When developers across teams use the same models to generate code, architect systems, and solve problems, the same convergence pressure applies. If Copilot suggests similar patterns to every developer working on a similar problem, the collective diversity of approaches across the industry narrows, even as individual developers ship faster and with fewer bugs.

Consider architecture decisions. If thousands of developers ask Claude or GPT how to structure a real-time data pipeline, they’ll get overlapping recommendations. The models have learned the most common patterns from their training data and will surface those patterns disproportionately. Unusual but potentially superior architectures, the ones that live in the tails of the distribution, get deprioritized by the same statistical convergence that makes LLMs fluent.

This doesn’t mean AI-assisted development is bad. It means the benefits are unevenly distributed across scales. Individual developers gain speed and quality. The industry potentially loses architectural diversity and the exploratory failures that sometimes produce innovations.

The Measurement Problem

What makes this research unsettling is how invisible the cost is. No individual user sees their output getting less diverse, because it isn’t, at the individual level. The homogenization only appears when you compare outputs across users. Most organizations don’t measure collective idea diversity. Most teams don’t track whether their solutions are converging over time. Population-level diversity, the metric that actually matters, is one that almost nobody monitors.

This creates a classic negative externality. Each user rationally adopts AI tools (the individual benefit is real and immediate) while the collective cost accumulates unnoticed. Without deliberate measurement and intervention, the default trajectory is toward less diverse thought, not more.

Limitations of the Current Research

The evidence has gaps. Most studies use short-form creative tasks: eight-sentence stories, divergent thinking prompts, brainstorming exercises. Whether the homogenization effect holds for longer, more complex creative work (novels, software architectures, research programs) remains an open question. The Doshi and Hauser study excluded expert writers. Anderson’s sample was 33 people. Wenger and Kenett’s preprint hasn’t been peer-reviewed yet.

The contradicting evidence from Ashkinaze et al. also introduces genuine ambiguity about when and how homogenization manifests. Passive versus active AI use appears to matter, but the boundary between those modes is blurry in real-world workflows where people alternate between generating with AI and generating independently.

And the research is almost entirely in English. Whether the homogenization effect is stronger or weaker in other languages, particularly those less represented in training data, is unknown.

Conclusion

The research converges on a finding that should reshape how we think about AI-assisted work:

LLMs reliably improve individual creative output, particularly for less skilled practitioners, while reducing the diversity of ideas produced across groups. This is a structural feature of next-token prediction, not a fixable bug.
The effect spans models. Twenty-two different LLMs produced far less diverse creative output than 102 humans, with effect sizes between 1.4 and 2.2. Switching models doesn’t solve it.
The effect scales. More AI-generated content means less diversity, and that content entering training data creates a recursive feedback loop toward increasingly homogeneous outputs.
The cost is invisible to individuals. No single user sees reduced diversity in their own work. The compression only appears at the population level, a dimension almost no one measures.
Interaction design matters. Passive exposure to AI ideas can increase diversity; active use of AI as a generation tool tends to decrease it. How we integrate AI into workflows determines whether it diversifies or compresses collective thought.

Open questions:

Does the homogenization effect hold for long-form, complex creative work and expert practitioners, or is it primarily a phenomenon of short-form tasks by non-experts?
How do different interaction patterns (AI as draft vs. AI as provocation vs. AI as editor) modulate the collective diversity outcome?
Can diversity-aware training objectives or decoding strategies counteract the convergence pressure without sacrificing output quality?
What does homogenization look like in code generation? Are developers converging on similar architectures and patterns at the industry level?

The irony is worth sitting with. We built tools that make each of us more creative, and the aggregate effect might be making all of us more alike. The question isn’t whether to use these tools; the individual benefits are too real for that. The question is whether we treat cognitive diversity as a resource worth measuring and protecting, or whether we let it erode because the loss is invisible to every person contributing to it.

Doshi, A.R., & Hauser, O.P. (2024). “Generative AI enhances individual creativity but reduces the collective diversity of novel content.” Science Advances, 10(28), eadn5290. PMC. ↩ ↩²
Anderson, B.R., Shah, J.H., & Kreminski, M. (2024). “Homogenization Effects of Large Language Models on Human Creative Ideation.” Proceedings of the 16th ACM Conference on Creativity & Cognition. arXiv:2402.01536. ↩ ↩²
Wenger, E., & Kenett, Y. (2025). “We’re Different, We’re the Same: Creative Homogeneity Across LLMs.” Preprint, not peer-reviewed. arXiv:2501.19361. ↩ ↩²
Meister, C., et al. (2024). “Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis.” arXiv:2403.00553. ↩
“Echoes in AI: Quantifying lack of plot diversity in LLM outputs.” (2025). Proceedings of the National Academy of Sciences. PNAS. ↩
Shumailov, I., et al. (2024). “AI models collapse when trained on recursively generated data.” Nature. Nature. ↩ ↩²
Sourati, Z., & Dehghani, M. (2026). “The homogenizing effect of large language models on human expression and thought.” Trends in Cognitive Sciences. Cell. ↩
“AI Is Quietly Colonizing How You Think.” (2026). Psychology Today. Psychology Today. ↩
Ashkinaze, J., Mendelsohn, J., Qiwei, L., Budak, C., & Gilbert, E. (2024). “How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas.” arXiv:2401.13481. ↩

The Homogenization Engine: How LLMs Are Shrinking Cognitive Diversity

Listen to this article

Background: Why Next-Token Prediction Converges

Methodology: How Researchers Measure Homogenization

Findings

Individual Creativity Goes Up, Consistently

Collective Diversity Goes Down, Also Consistently

The Problem Isn’t One Model. It’s All of Them

Scale Amplifies the Effect

Beyond Language: Reasoning Itself Narrows

Discussion

The Contradicting Evidence Matters

What This Means for Developers

The Measurement Problem

Limitations of the Current Research

Conclusion

The Developer's Dopamine Loop: Why AI Autocomplete Is Addictive

GPT-5.6 Turns Codex Into OpenAI's New Work-Agent Bet

Claude Sonnet 5 Is the Frontier Model for the Default Slot

Listen to this article

Background: Why Next-Token Prediction Converges

Methodology: How Researchers Measure Homogenization

Findings

Individual Creativity Goes Up, Consistently

Collective Diversity Goes Down, Also Consistently

The Problem Isn’t One Model. It’s All of Them

Scale Amplifies the Effect

Beyond Language: Reasoning Itself Narrows

Discussion

The Contradicting Evidence Matters

What This Means for Developers

The Measurement Problem

Limitations of the Current Research

Conclusion

Footnotes

Related reading

The Developer's Dopamine Loop: Why AI Autocomplete Is Addictive

GPT-5.6 Turns Codex Into OpenAI's New Work-Agent Bet

Claude Sonnet 5 Is the Frontier Model for the Default Slot

Get Brain Bytes in your inbox