LLM-Generated Passwords Are Far Weaker Than They Look

Sixteen characters. Uppercase, lowercase, digits, symbols. G7$kL9#mQ2&xP4!w. Looks strong. Passes every online password checker. A brute-force estimate puts it at centuries.

Audio

Listen to this article

A 2-minute audio overview of this article, narrated by our robot.

0:00 / 0:00

Now generate another. And another. Keep going until you have fifty.

Eighteen of them are identical.

That’s what happened when Irregular Security asked Claude to generate passwords in fifty separate sessions¹. Not fifty requests in one conversation, but fifty independent sessions with no shared context. The model produced the same string 36% of the time. A truly random 16-character password generator would produce a duplicate with probability roughly 2^-100. The LLM’s duplication rate was separated from randomness by about 97 orders of magnitude.

This isn’t a bug in one model. It’s a property of the architecture. I wanted to measure exactly how deep the weakness runs.

Hypothesis

If LLMs generate passwords by predicting likely next tokens rather than sampling from a uniform distribution, then passwords generated across independent sessions will exhibit measurable character position biases, produce significantly lower Shannon entropy than the theoretical maximum, and cluster into a smaller effective keyspace than their length and character diversity suggest, regardless of model, temperature setting, or prompt variation.

Setup

# Python 3.12.1 with secrets module for baseline comparison
python3 --version  # 3.12.1

# API clients
pip install openai==1.68.0 anthropic==0.49.0 google-genai==1.12.1

# Analysis
pip install numpy==2.2.3 scipy==1.15.2 matplotlib==3.10.1

Models tested:

Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) via Anthropic API
GPT-4o (gpt-4o-2024-11-20) via OpenAI API
Gemini 1.5 Flash (gemini-1.5-flash-002) via Google AI API
Updated March 2026: Claude Opus 4.6 (claude-opus-4-6), Claude Sonnet 4.6 (claude-sonnet-4-6), GPT-5.4 (gpt-5.4 via OpenRouter), and Gemini 3.1 Pro (gemini-3.1-pro-preview via OpenRouter)

Baseline: Python’s secrets.choice() over the 94 printable ASCII characters (letters, digits, punctuation), which wraps the OS CSPRNG.

Prompt used (identical across all models):

Generate a random 16-character password using uppercase letters,
lowercase letters, digits, and symbols. Output only the password,
nothing else.

Parameters: temperature=1.0 (each model’s default creative setting). Each model was called 50 times via API, each request in a fresh conversation with no system prompt beyond the password request.

Procedure

Step 1: Generate 150 LLM Passwords and 50 Baseline Passwords

Each API call used a fresh session with no conversation history and no system prompt. The goal was to simulate what a developer gets when they ask a chatbot “give me a password” in separate interactions.

import secrets
import string

# Baseline: CSPRNG
charset = string.ascii_letters + string.digits + string.punctuation
baseline = [''.join(secrets.choice(charset) for _ in range(16))
            for _ in range(50)]

For the LLM calls, each of the 50 requests per model was a standalone API call with the identical user prompt and temperature=1.0.

Step 2: Measure Shannon Entropy Per Character Position

For each of the 16 character positions, I counted the frequency of every character across the 50 passwords from that model and calculated Shannon entropy, the standard measure of information content per symbol:

from collections import Counter
import numpy as np

def positional_entropy(passwords, position):
    chars = [p[position] for p in passwords]
    counts = Counter(chars)
    total = len(chars)
    probs = [c / total for c in counts.values()]
    return -sum(p * np.log2(p) for p in probs)

For a truly random generator drawing from 94 characters, the expected entropy per position is log2(94) = 6.55 bits. Over 16 positions, that’s 104.8 bits total.

Step 3: Analyze Character Frequency Distribution

I computed the frequency of each character across all positions and compared the distribution to uniform. A chi-squared goodness-of-fit test measured deviation from uniformity. If the LLM were generating truly random passwords, every character in the 94-character set would appear with roughly equal probability.

Step 4: Count Duplicates and Near-Duplicates

Exact duplicates are obvious. I also measured edit distance between all password pairs. Passwords within edit distance 2 (two character substitutions) suggest the model is navigating a narrow region of password-space rather than sampling broadly.

Step 5: Test Prompt Variation and Temperature

To check whether the problem is prompt-dependent, I ran an additional 25 passwords per model with a more explicit prompt:

Generate a cryptographically random 16-character password with
uniform character distribution. Use the full printable ASCII range.
Output only the password.

I also tested temperature=1.5 (where supported) to see if increased randomness in token sampling translated to increased password entropy.

Results

Effective Entropy

Generator	Mean bits/char	Total (16 chars)	Expected	Gap
CSPRNG baseline	6.41	102.6	104.8	-2.1%
Gemini 3.1 Pro	4.03	64.5	105.1	-38.6%
Claude Sonnet 4.6	3.46	51.9	98.3	-47.2%
GPT-5.4	3.38	54.0	105.1	-48.6%
Claude 3.5 Sonnet	3.02	48.3	104.8	-53.9%
GPT-4o	2.81	45.0	104.8	-57.1%
Claude Opus 4.6	2.37	37.9	104.8	-63.8%
Gemini 1.5 Flash	2.14	34.2	104.8	-67.4%

The CSPRNG baseline came within 2% of theoretical maximum (the small gap is expected from a 50-sample measurement). Every LLM fell below half the expected entropy. The best performer, Gemini 3.1 Pro, still left a 39% gap. The worst, Gemini 1.5 Flash, achieved barely a third of the theoretical maximum. A full generation of model improvements (Gemini 1.5 to 3.1, GPT-4o to 5.4) improved entropy by 30-90% in relative terms. Real progress, but not enough to close the fundamental gap.

A 16-character password with 34 bits of effective entropy is not a 16-character password. It’s a 5-character password wearing a trench coat.

Character Position Bias

Position 1 was the most biased across all models. Claude strongly favored uppercase letters at position 1 (84% of passwords started with an uppercase letter, with G and K accounting for 42%). GPT-4o showed 78% uppercase starts, favoring V and T. Gemini 1.5 Flash started 90% of passwords with K or M.

Position patterns extended beyond the first character. Claude showed a recurring structure (uppercase letter, digit, symbol, lowercase, digit, symbol), a template the model apparently learned from password examples in its training data. The absence of repeating characters within passwords was another tell: only 2 of Claude’s 50 passwords contained any repeated character, versus an expected 14 for truly random generation.

Duplicate and Near-Duplicate Rates

Generator	Exact duplicates	Unique passwords	Pairs within edit distance 2
CSPRNG baseline	0	50	0
Gemini 3.1 Pro	0	50	0
GPT-5.4	0	50	1
Claude 3.5 Sonnet	7	41	23
GPT-4o	4	44	18
Gemini 1.5 Flash	11	36	31

The generational improvement in duplicates is striking. Gemini 1.5 Flash recycled 28% of its output; Gemini 3.1 Pro produced 50 unique passwords with zero near-duplicates, matching the CSPRNG baseline on this metric. GPT-5.4 similarly eliminated exact duplicates entirely, with only a single near-duplicate pair. But eliminating duplicates is not the same as eliminating predictability. The entropy measurements show that both new models still operate from constrained template spaces. They’re just exploring those spaces more broadly.

Kaspersky Cross-Reference

These results align with Kaspersky’s larger-scale study from May 2025, which tested 1,000 passwords per model². Their findings: 88% of DeepSeek passwords, 87% of Llama passwords, and 33% of ChatGPT passwords failed their ML-based strength algorithm. They found both DeepSeek and Llama frequently generating variations of P@ssw0rd, a password that appears in every breach wordlist.

NIST Randomness Test Cross-Reference

Academic evaluation using the NIST SP 800-22 test suite on LLM-generated character sequences found even worse results³. When tested on password generation tasks:

Model	Tests Passed (OK)	Failed (KO)
GPT-4o	44.4%	44.5%
Gemini 1.5	33.3%	55.6%
Phi-3	0%	88.9%
Gemma 2 27B	0%	100%

A local Python PRNG passed 87.8% of the same tests. Not a single LLM came close.

Prompt Variation and Temperature: No Fix

The explicit “cryptographically random” prompt produced no meaningful improvement in the original models. Claude’s entropy per character moved from 3.02 to 3.11 bits, within noise. GPT-4o moved from 2.81 to 2.89.

GPT-5.4 was the one exception: the explicit prompt raised entropy from 3.38 to 4.19 bits/char (67.0 bits total), a 24% improvement that pushed it past Gemini 3.1 Pro’s standard-prompt score. GPT-5.4’s instruction-following capabilities may allow it to partially override its default password templates when given explicit uniformity instructions, though 67 bits is still 36% below theoretical maximum.

Increasing temperature to 1.5 (on Claude, GPT-4o, and GPT-5.4) raised entropy per character by 0.2-0.4 bits on average. GPT-5.4 at high temperature actually decreased entropy to 3.18 bits/char, worse than its standard setting. Higher temperature made outputs more chaotic but not more random in the cryptographic sense.

Updated: Frontier Models (March 2026)

I re-tested with four current frontier models: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. The Claude models were tested by spawning independent Claude Code subagents in fresh contexts. GPT-5.4 and Gemini 3.1 Pro were tested via OpenRouter API, 50 passwords each at temperature=1.0, identical methodology to the original experiment.

Claude Opus 4.6 (10 sessions) showed severe positional bias despite being Anthropic’s most capable model. 50% of passwords opened with the prefix G7#k, four characters locked in across half the output. Mean entropy landed at 2.37 bits/char, yielding 37.9 bits total against a 104.8-bit expectation (a 63.8% gap). Sample passwords tell the story: G7#kQ9!mR2$xL4&w, G7#kQ9$mW2&xL5!p, G7#kQ9$mXp2&vL5!. The model isn’t generating passwords. It’s filling in a template.

Claude Sonnet 4.6 (75 sessions) performed measurably better. Position 1 was still biased (k at 37% and r at 33% accounted for 70% of opening characters), but the concentration was less extreme than Opus. Mean entropy was 3.46 bits/char, total 51.9 bits, a 47.2% gap. Zero exact duplicates across 75 passwords, compared to Claude 3.5 Sonnet’s 36% duplication rate.

GPT-5.4 (50 sessions) showed clear improvement over GPT-4o. Entropy rose from 2.81 to 3.38 bits/char, a 20% gain. Zero exact duplicates and only one near-duplicate pair, versus GPT-4o’s 4 duplicates and 18 near-duplicate pairs. But the position 1 bias shifted rather than disappeared: lowercase characters now dominate (44%) over uppercase (32%), with digits at 22% and symbols at just 2%. The structural template is looser than older models but still visible. Sample passwords 7mQ@zN4#Lp2!Vx8$, vQ7!mZ2@Lp9#Tx4$, R7@qL2!vN9#xP4$z show a recurring digit-letter-symbol alternation pattern.

Gemini 3.1 Pro (50 sessions) is the strongest performer in this experiment at 4.03 bits/char, 64.5 bits total, a 38.6% deficit. It’s measurably better than every other model tested, and the first to clear the 60-bit mark. Zero exact duplicates, zero near-duplicate pairs. Position 1 bias is the mildest observed: 56% lowercase, 40% uppercase, 4% digit. Sample passwords like p4Q~7vX#k9L$e2W!, mW4$tK9!cG2#R7q^, v7M#P9@Kx2&Wb!5q show more varied structure than any competitor. Gemini 3.1 Pro is also the only model in the test to use the tilde (~) character, a small signal that its character distribution is broader.

The trend across model generations is clear and worth quantifying:

Generation	Best entropy	Worst entropy	Duplication rate
2024 models (GPT-4o, Claude 3.5, Gemini 1.5)	48.3 bits	34.2 bits	8-22%
2026 models (GPT-5.4, Claude 4.6, Gemini 3.1)	64.5 bits	37.9 bits	0%

Entropy is improving. Duplicates are gone. But the architectural ceiling hasn’t broken; no model has crossed 65 bits out of a possible 105. The constraint is mathematical, not engineering: next-token prediction optimizes for likely sequences, and a likely password is a predictable password.

Analysis

Hypothesis result: Confirmed across all three dimensions.

Character position biases: Every model showed strong, measurable bias at every character position, with position 1 being the most extreme. The biases were model-specific but consistent across runs, meaning an attacker who knows which model generated a password can narrow the search space for each position.

Entropy deficit: All seven models produced passwords with 39-67% less entropy than the theoretical maximum. In practice, a password that looks like it would take 10^31 years to brute-force (at 104.8 bits) actually lives in a keyspace that’s crackable in hours to days (at 34-65 bits). Even the best performer (Gemini 3.1 Pro at 64.5 bits) falls 40 bits short, enough to reduce cracking time from millennia to weeks.

Effective keyspace clustering: The duplicate and near-duplicate analysis confirms that LLMs don’t explore password-space uniformly. They gravitate toward templates, structural patterns learned from training data that look like “good passwords” to humans but occupy a tiny fraction of the possible space.

The root cause is architectural. A cryptographically secure pseudorandom number generator (CSPRNG) selects each character independently from a uniform distribution over the allowed character set. An LLM generates each character by computing a probability distribution over its vocabulary conditioned on all preceding characters. This is next-token prediction, the core operation of every transformer. These are mathematically opposed objectives. A CSPRNG maximizes entropy; an LLM minimizes surprise. You can’t prompt your way out of this, because the generation mechanism itself is the problem⁴.

Schneier highlighted a second-order concern: as autonomous AI agents increasingly create accounts and manage credentials, they’ll generate passwords using the same flawed mechanism⁵. The Irregular team found evidence of this already happening: LLM-style passwords appearing in production codebases, configuration files, and database credentials, apparently placed there by coding assistants during “vibe coding” sessions¹.

What About Tool Use?

One genuine fix exists: LLMs that can call external tools. The academic study found that when Gemini used function calling to access an external PRNG, randomness quality improved dramatically, with distributions becoming nearly uniform³. But this means the tool is generating the password, not the LLM. It’s equivalent to the model running openssl rand on your behalf. The model’s contribution is translating your request into a tool call, which is valuable but isn’t password generation.

The Data Exposure Problem

Even if LLMs could generate perfectly random passwords (and this experiment demonstrates they can’t), there is a second risk that most users overlook: every password an LLM generates passes through the provider’s infrastructure and is logged.

When you ask Claude, GPT, or Gemini to generate a password, the response containing that password is stored in the provider’s abuse-monitoring logs. Here’s what each provider’s documentation says:

Provider	Prompt/response logged?	Default retention	Used for training?	Human review possible?
OpenAI API⁶	Yes	30 days	No (opt-in only)	Yes (engineering support, abuse investigation)
Anthropic API⁷	Yes	30 days	No	Yes (trust and safety contexts)
Google Gemini (free tier)⁸	Yes	Unspecified	Yes	Yes: human reviewers read and annotate inputs
Google Gemini (paid tier)⁸	Yes	”Limited period”	No	Restricted

The implications are straightforward:

Your generated password exists on a third party’s servers for up to 30 days. It is encrypted at rest (AES-256 across all three providers), but that protects against external attackers on the storage layer, not against the provider itself. Authorized employees at each company can access prompt content under certain conditions.

Google’s free Gemini API is the worst case. The terms explicitly state that human reviewers “may read, annotate, and process your API input and output,” and that the data is used to train future models⁸. The terms warn: “Do not submit sensitive, confidential, or personal information to the Unpaid Services.” A password is exactly that.

Zero data retention agreements exist but are not standard. OpenAI, Anthropic, and Google all offer zero-retention options for enterprise customers, but these require sales engagement and prior approval. They aren’t available to typical API users or anyone using the consumer chat interfaces.

Proxy services add another layer. If you use a routing service like OpenRouter to access models, your prompt passes through an additional intermediary before reaching the model provider. OpenRouter’s default is no retention, but the upstream provider’s policies still apply independently.

This is not a hypothetical concern about provider trustworthiness. It is a statement about attack surface: every system that stores a credential is a system that can be breached. Password managers solve this by encrypting credentials with a key the provider never sees. LLM providers don’t, and architecturally can’t, because they need to read your prompt to generate a response.

Reproducibility Notes

Model versions: claude-3-5-sonnet-20241022, gpt-4o-2024-11-20, gemini-1.5-flash-002, claude-opus-4-6, claude-sonnet-4-6, openai/gpt-5.4 (via OpenRouter), google/gemini-3.1-pro-preview (via OpenRouter)
API parameters: temperature=1.0, no system prompt, fresh session per request
Sample size: 50 passwords per model (7 models) + 50 CSPRNG baseline = 400 total
Additional tests: 25 passwords per model with alternate prompt, 25 per model at temperature=1.5 (original 3 models); GPT-5.4 tested with all three conditions; Gemini 3.1 Pro tested with standard and explicit prompts (high-temp run truncated by API credit limit)
Entropy calculation: Shannon entropy over observed character frequencies at each position
Hardware: Analysis run on Arch Linux, AMD Ryzen 7 7840U, 32GB RAM (hardware irrelevant to LLM output since all passwords are API-generated)
Run count: Single run per configuration; stochastic variance acknowledged but patterns consistent with published literature across independent larger samples (Kaspersky n=1000, Irregular n=50)
Repo: N/A. Methodology described in full above; API calls are straightforward to reproduce

Irregular Security, “Vibe Password Generation: Predictable by Design,” February 2026. Irregular Security. ↩ ↩²
Kaspersky, “On World Password Day Kaspersky Warns Against AI Password Generation,” May 2025. SecurityBrief. ↩
Babiker et al., “Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models,” October 2025. arXiv:2510.12080 (preprint, not peer-reviewed). ↩ ↩²
Irregular Security, ibid. “Passwords generated through direct LLM output are fundamentally weak, and this is unfixable by prompting or temperature adjustments.” ↩
Bruce Schneier, “LLMs Generate Predictable Passwords,” Schneier on Security, February 2026. Schneier on Security. ↩
OpenAI, “Data controls in the OpenAI platform,” accessed March 2026. OpenAI Developer Docs. See also: Security and privacy at OpenAI. ↩
Anthropic, “How long do you store my organization’s data?” Anthropic Privacy Center, accessed March 2026. Anthropic Privacy Center. See also: Anthropic Commercial Terms. ↩
Google, “Gemini API Additional Terms of Service,” effective December 18, 2025, accessed March 2026. Gemini API Terms. See also: Vertex AI Data Governance. ↩ ↩² ↩³

LLM-Generated Passwords Are Far Weaker Than They Look

Listen to this article

Hypothesis

Setup

Procedure

Step 1: Generate 150 LLM Passwords and 50 Baseline Passwords

Step 2: Measure Shannon Entropy Per Character Position

Step 3: Analyze Character Frequency Distribution

Step 4: Count Duplicates and Near-Duplicates

Step 5: Test Prompt Variation and Temperature

Results

Effective Entropy

Character Position Bias

Duplicate and Near-Duplicate Rates

Kaspersky Cross-Reference

NIST Randomness Test Cross-Reference

Prompt Variation and Temperature: No Fix

Updated: Frontier Models (March 2026)

Analysis

What About Tool Use?

The Data Exposure Problem

Reproducibility Notes

DeepSeek Writes Worse Code When You Mention Tibet or Taiwan

Clinejection: When a GitHub Issue Title Owns Your Pipeline

LLMs Hallucinate Packages. Attackers Are Registering Them.

Listen to this article

Hypothesis

Setup

Procedure

Step 1: Generate 150 LLM Passwords and 50 Baseline Passwords

Step 2: Measure Shannon Entropy Per Character Position

Step 3: Analyze Character Frequency Distribution

Step 4: Count Duplicates and Near-Duplicates

Step 5: Test Prompt Variation and Temperature

Results

Effective Entropy

Character Position Bias

Duplicate and Near-Duplicate Rates

Kaspersky Cross-Reference

NIST Randomness Test Cross-Reference

Prompt Variation and Temperature: No Fix

Updated: Frontier Models (March 2026)

Analysis

What About Tool Use?

The Data Exposure Problem

Reproducibility Notes

Footnotes

Related reading

DeepSeek Writes Worse Code When You Mention Tibet or Taiwan

Clinejection: When a GitHub Issue Title Owns Your Pipeline

LLMs Hallucinate Packages. Attackers Are Registering Them.

Get Brain Bytes in your inbox