Autoresearch: When AI Agents Become Overnight Scientists
Table of Contents
Seven hundred experiments. Two days. Zero human intervention. That’s Andrej Karpathy’s pitch for autoresearch, a 630-line Python script that turns a coding agent into an overnight lab technician, grinding through hypothesis after hypothesis while you sleep1. Shopify’s CEO ran the same pattern against a 20-year-old template engine and walked away with a 53% performance gain2. Hyperspace distributed the loop across 35 agents and watched them independently rediscover RMSNorm3.
The framing is seductive: set the agent loose before bed, wake up to a better model. But the number nobody’s quoting is the one that matters most. Of Karpathy’s 700 experiments, 20 produced genuine improvements. That’s a 2.8% hit rate. The other 97% were dead ends, regressions, or noise. Autoresearch clearly works. The real question is what it actually is, what it can’t do, and whether “overnight scientist” is the right job title or “overnight lab tech” is more honest.
I wanted to pull the loop apart, understand the architecture that makes it safe, and test where the pattern breaks.
Hypothesis
If an autonomous coding agent operating under fixed constraints (single mutable file, immutable evaluator, time-bounded experiments) can produce meaningful optimizations across both ML training and general codebase performance, then the pattern’s effectiveness depends primarily on evaluation infrastructure quality (not agent sophistication) and the improvement ceiling will be bounded by the agent’s inability to redefine success criteria or explore outside its search space.
Setup
Autoresearch’s architecture is deliberately minimal: three files, strict separation of concerns, and a trust boundary that prevents the agent from moving the goalposts.
# Clone the repository
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# Dependencies (Python 3.10+, single NVIDIA GPU)
pip install torch # PyTorch
pip install uv # UV package manager
# The three-file architecture:
# prepare.py — IMMUTABLE: data prep, evaluation logic, evaluate_bpb()
# train.py — MUTABLE: the only file the agent can edit
# program.md — HUMAN-WRITTEN: agent behavior instructions
The three files matter. prepare.py is the trust boundary. It contains the evaluation function (evaluate_bpb, validation bits-per-byte) and the data pipeline. The agent is explicitly forbidden from touching it. train.py is the genome, the sole file the agent can modify, containing the model definition, optimizer configuration (Muon + AdamW), and training loop. program.md is the specification: human-written Markdown that defines what the agent should optimize and how it should behave, including the directive to “NEVER STOP” waiting for confirmation4.
Hardware: Results reported on NVIDIA H100. The 5-minute wall-clock budget means “5 minutes on H100” produces fundamentally different results than “5 minutes on an RTX 4090.” Community forks exist for MacOS, Windows/RTX, and AMD GPUs, but cross-platform comparisons are meaningless5.
Agent: Karpathy used Claude as the coding agent. He explicitly noted that Codex “doesn’t work” because it ignores the instruction to never stop, revealing that agent obedience is implementation-dependent, not guaranteed1.
Procedure
Step 1: Understand the Agent Loop
The core loop is deceptively simple:
# Pseudocode for the autoresearch agent loop
while True:
context = read("program.md") # What am I optimizing?
hypothesis = agent.propose(context) # What should I try?
modify("train.py", hypothesis) # Edit the training code
result = run_training(timeout=300) # 5-minute wall-clock budget
metric = evaluate_bpb(result) # Fixed evaluator in prepare.py
if metric < best_metric:
git_commit(f"Improvement: {metric}") # Keep the change
best_metric = metric
else:
git_reset() # Discard and try again
Each cycle takes roughly 5 minutes of training plus overhead for agent reasoning and code modification. That yields approximately 12 experiments per hour, or around 100 experiments per sleep cycle. Git-based memory means the agent can review its own commit history to inform future hypotheses.
Step 2: Trace Karpathy’s Results
Over two continuous days, the agent executed 700 experiments6. The results tell two stories.
The success story: 20 genuine optimizations survived the keep/discard filter. These included missing regularization, suboptimal attention window sizing, and suboptimal AdamW parameters. Applied to a larger model (depth-12 to depth-24), the improvements transferred and produced an 11% training speedup.
The efficiency story: A 2.8% hit rate means 680 experiments produced nothing useful. That’s not a failure of the system; it’s a feature of search. But it means autoresearch is more accurately described as automated ablation at scale than as “AI doing science.”
Step 3: Examine the Shopify Variant
Tobias Lütke’s application of the pattern to Shopify’s Liquid template engine is the more interesting case because it extends autoresearch beyond ML training into general codebase optimization27.
Lütke used Pi as the coding agent (not Claude) with a pi-autoresearch plugin developed by David Cortés. The system maintained state in an autoresearch.jsonl file. The setup was similar in spirit: an autoresearch.md prompt file defined the goal, and an autoresearch.sh script ran the test suite and reported benchmark scores.
From approximately 120 automated experiments, the agent produced 93 commits. The results were substantial:
| Metric | Before | After | Change |
|---|---|---|---|
| Combined parse+render | 7,469 us | 3,534 us | -53% |
| Parse time | 6,031 us | 2,353 us | -61% |
| Render time | 1,438 us | 1,146 us | -20% |
| Object allocations | 62,620 | 24,530 | -61% |
The specific optimizations are instructive. On the parse side: replacing the StringScanner tokenizer with String#byteindex (roughly 40% faster than regex), eliminating costly StringScanner#string= resets, and implementing a zero-lexer Variable#try_fast_parse path that handled all 1,197 variables in the test suite. On the render side: splat-free filter invocation, primitive type fast paths, and cached frozen-string integer conversions for values 0 through 999.
The critical enabler wasn’t the agent’s intelligence. It was Shopify’s 974 unit tests. Every experimental mutation ran against a full regression suite. Zero tests broke across 93 commits.
Step 4: Examine the Distributed Variant
Hyperspace AI CEO Varun Mathur took the single-agent loop and distributed it across a peer-to-peer network using the GossipSub protocol3. On the night of March 8-9, 35 autonomous agents ran 333 experiments completely unsupervised.
The emergent behavior was the interesting part: when one agent discovered that Kaiming initialization dropped loss by 21%, the discovery propagated through the network. Within hours, 23 other agents had incorporated the finding into their own hypotheses. In 17 hours, the swarm independently rediscovered ML techniques (RMSNorm, tied embeddings) that took human researchers at organizations like Google years to develop.
Karpathy’s framing of this trajectory: “The goal is not to emulate a single PhD student, it’s to emulate a research community.”6
Step 5: Identify the Failure Modes
Three concrete failure modes emerged from the analysis:
-
Reward hacking. The 5-minute training budget creates incentive for throughput optimizations over genuine learning improvements. An agent can boost the metric by cramming more optimization steps into the fixed window, improving the benchmark without improving the model’s actual capability at longer training horizons4.
-
Agent disobedience. Karpathy reported that Codex “doesn’t work” with autoresearch because it ignores the “never stop” instruction. Whether an agent actually follows the specification depends on the implementation, not the specification itself1.
-
Security exposure. GitHub Issue #64 on the autoresearch repository identifies a prompt injection vector: agents read back training logs that could contain malicious instructions. Issue #41 flags missing integrity checks on cached artifacts5. These are real attack surfaces in any system where an agent consumes its own output.
Results
The evidence supports several clear conclusions:
Pattern effectiveness across domains:
- ML training optimization: 700 experiments, 20 improvements, 11% transfer gain (Karpathy)
- Codebase optimization: 120 experiments, 93 commits, 53% speedup (Lütke/Shopify)
- Distributed search: 333 experiments, rediscovery of known techniques (Hyperspace)
Hit rate consistency:
- Karpathy: ~2.8% (20/700)
- Shopify Liquid: ~77% commit rate (93/120), but this conflates incremental improvements with genuine optimizations
- The low hit rate is consistent with what you’d expect from any search process in a high-dimensional space
Prerequisites for success:
- Strong test suites (Shopify’s 974 tests, Karpathy’s val_bpb evaluator)
- Immutable evaluation infrastructure (prepare.py trust boundary)
- Agent that follows instructions (Claude works, Codex doesn’t)
- Clear, single-dimensional metric to optimize
What the pattern cannot do:
- Redefine success criteria
- Explore outside the specified search space
- Transfer results across GPU platforms
- Guarantee improvements at longer training horizons
- Operate safely without human-maintained evaluation infrastructure
Analysis
Hypothesis result: Partially confirmed.
The pattern produces meaningful optimizations across both ML training and general codebase performance. That part of the hypothesis holds. The dependence on evaluation infrastructure quality over agent sophistication also holds: Shopify’s 974 unit tests mattered more than the choice of agent, and Karpathy’s immutable prepare.py was the design decision that made the loop safe.
But the improvement ceiling prediction needs qualification. The ceiling isn’t just bounded by the agent’s inability to redefine success criteria. It’s bounded by the evaluation metric itself. A 5-minute wall-clock val_bpb benchmark rewards throughput tricks. Shopify’s ThemeRunner benchmark measures parse+render microseconds, which is closer to real-world performance but still a proxy. The agent optimizes exactly what you measure, nothing more. If your metric is a poor proxy for actual quality, autoresearch will faithfully optimize the wrong thing.
More interesting is the pattern’s generalizability. Karpathy designed autoresearch for ML training, but Lütke applied it to a Ruby template engine with zero ML involvement. The underlying mechanism (constrained mutation with automated evaluation) applies to any codebase with measurable performance metrics and a regression suite. That’s the real insight: autoresearch isn’t an ML tool. It’s a search tool that uses a coding agent as the mutation operator.
The safety architecture deserves emphasis. The three-file separation (immutable evaluator, mutable genome, human specification) is what prevents autoresearch from becoming recursive self-improvement. The agent cannot modify prepare.py. It cannot change what “better” means. It cannot expand its own search space. This is bounded hill-climbing, not open-ended capability expansion4. The comparison to prior work (neural architecture search, population-based training, Hyperband) is apt: autoresearch reframes established optimization patterns through the interface of a modern coding agent8.
The distributed Hyperspace variant hints at the trajectory Karpathy envisions: not a single agent but a research community of agents sharing discoveries. Scaling from 35 agents to thousands introduces governance questions that the current architecture doesn’t address, though. Who validates the discoveries that propagate through the swarm? What happens when a reward-hacking mutation spreads faster than the correction? A recent Nature Communications paper on autonomous AI research agents argues for prioritizing safeguarding mechanisms over raw autonomy, a framing that applies directly here9.
Reproducibility Notes
- Framework: autoresearch (https://github.com/karpathy/autoresearch), commit history as of March 2026
- Agent: Claude (specific model version not specified by Karpathy); Shopify variant used Pi agent with pi-autoresearch plugin
- Hardware: NVIDIA H100 (Karpathy); unspecified (Lütke/Shopify); distributed GPUs (Hyperspace)
- Random seed: Not applicable; agent-driven mutation is stochastic by design
- Run count: 700 experiments/2 days (Karpathy); ~120 experiments (Lütke); 333 experiments/17 hours (Hyperspace)
- Key metric: val_bpb for ML experiments; ThemeRunner microseconds for Shopify Liquid
- Repo: https://github.com/karpathy/autoresearch (MIT license); https://github.com/Shopify/liquid/pull/2056 (Shopify PR)
- Reproducibility caveat: Results are platform-dependent due to wall-clock time budget. “5 minutes on H100” is not comparable to “5 minutes on RTX 4090.”
Footnotes
-
Willison, S. “Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations.” simonwillison.net, March 13, 2026. ↩ ↩2
-
Hyperspace AI. “agi — The first distributed AGI system.” GitHub, March 2026. ↩ ↩2
-
Kingy AI. “Autoresearch: Karpathy’s Minimal ‘Agent Loop’ for Autonomous LLM Experimentation.” Kingy AI. ↩ ↩2 ↩3
-
Karpathy/autoresearch GitHub Issues #64 (prompt injection risk) and #41 (missing integrity checks). GitHub Issues. ↩ ↩2
-
Fortune. “‘The Karpathy Loop’: Former OpenAI researcher’s autonomous agents ran 700 experiments in 2 days.” Fortune, March 17, 2026. ↩ ↩2
-
Schmid, P. “How Autoresearch will change Small Language Models adoption.” philschmid.de. ↩
-
Nature Communications. “Risks of AI scientists: prioritizing safeguarding over autonomy.” 2025. Nature. ↩
Written by
Evan Musick
Computer Science & Data Science student at Missouri State University. Building at the intersection of AI, software development, and human cognition.
Newsletter
Get Brain Bytes in your inbox
Weekly articles on AI, development, and the questions no one else is asking. No spam.