Back to Blog
AI Research

The Algorithm That Rewrites Itself: How Autoresearch Makes AI Skills Self-Improving

Karpathy's autoresearch method lets AI agents improve their own prompts overnight. One skill went from 56% to 92% pass rate in 4 rounds. Here's how it works, what it costs, and how to run it yourself.

Augmi Team|
autoresearchkarpathyprompt-optimizationclaude-codeai-skillsself-improvementaugmi
The Algorithm That Rewrites Itself: How Autoresearch Makes AI Skills Self-Improving

The Algorithm That Rewrites Itself

How Andrej Karpathy’s autoresearch method became a practical toolkit for AI systems that improve on their own

You have a prompt that works okay. You tweak a sentence, test it, it gets worse. You change it back, add a clause, test again. After an hour you’ve improved it by maybe 10%, if you’re lucky. You’re not sure which of your three changes actually helped. This is how most people improve AI systems — and it’s a poor approach.

Autoresearch is different. You define what “good” looks like, point the agent at your prompt, and walk away. By morning it has run 50 improvement cycles on its own, kept what worked, discarded what didn’t, and left you a changelog explaining every decision. Andrej Karpathy built the original version to run ML experiments autonomously. Ole Lehmann adapted the pattern for something anyone can use: improving Claude skills from 56% to 92% pass rate with zero manual intervention between rounds.

This post covers how it works, what the evidence says about its limits, and why the human role in prompt optimization is shifting.

Where It Started: Karpathy’s 630-Line Experiment

In early 2025, Andrej Karpathy — formerly of Tesla and OpenAI — published a Python script on GitHub with a deceptively simple goal: let an AI agent run its own machine learning experiments. The full script is 630 lines. It reads the current training code, forms a hypothesis about what might improve it, makes one change, runs the experiment, checks the metric, and keeps or discards the change. Then repeats, indefinitely.

Over two days, this system ran roughly 700 experiments. About 20 represented genuine improvements. The aggregate result was an 11% reduction in the time required to reach GPT-2-quality language model performance — and along the way, the system caught oversights Karpathy himself had missed across 20 years of working with these architectures.

The repo collected 42,000 GitHub stars in a matter of weeks. Shopify’s CEO used a similar approach and reported 53% faster template rendering through 93 automated commits.

What makes it work? Four design decisions in Karpathy’s original:

Single mutable file. Only train.py gets modified. Everything else — the evaluation harness, the rules, the metric — stays fixed. This constrains the search space and prevents the agent from gaming the system by changing how success is measured.

One change per iteration. Each experiment touches exactly one thing. This mirrors basic experimental design: isolate the variable. If something works, you know what caused it.

A single, unambiguous metric. Karpathy used validation bits-per-byte (val_bpb), which is vocabulary-size-independent and hardware-normalized via a 5-minute time budget. You can’t argue with it. Either the number went down or it didn’t.

Full logging. Every hypothesis and outcome gets recorded. The history becomes usable knowledge.

These constraints aren’t incidental. They are the mechanism.

The Adaptation: From ML to Plain Text

Karpathy’s original requires a GPU, ML knowledge, and meaningful compute time. It’s designed for researchers. Ole Lehmann saw the pattern underneath it and asked a different question: what if the thing being optimized was just a text file?

Claude skills — or more broadly, any structured prompt that an AI agent follows — are pure text files. They have inputs and outputs. You can define what a good output looks like. You can write a script that runs the skill, checks the output, and scores it. The only missing piece is the evaluation criteria.

Lehmann built a loop: take the skill file, run it on test cases, score results against a binary checklist of 3-6 yes/no questions, ask the agent to identify what failed and why, make one change to the skill file, repeat. His landing page copy skill started at a 56% pass rate. After 4 rounds of overnight optimization, it reached 92%.

The article describing this process received 2.2 million views on X. Not because the technical idea was new — academics had been doing automated prompt optimization for years — but because for the first time, the barrier to entry was essentially zero. You need Claude Code or a similar coding agent, a text file to optimize, and a clear definition of “good.”

The Technical Pattern in Detail

Every successful autoresearch implementation shares the same core loop, regardless of what’s being optimized:

Read — the agent reads the current version of whatever is being improved.

Hypothesize — the agent analyzes recent failures and proposes a targeted change.

Change — one modification is made to the mutable file. Not two. Not a rewrite. One change.

Test — the modified version is run against a set of test cases.

Score — results are checked against the evaluation criteria. Pass or fail, numerically.

Keep or discard — if the score improved, the change persists. If not, the previous version is restored.

Log and repeat — every round’s score, hypothesis, and outcome gets recorded. Then the loop runs again.

The architecture separates into three components: a rules file that never changes (equivalent to Karpathy’s program.md), a fixed evaluation harness (prepare.py), and the mutable file the agent edits (train.py for ML, a skill file for prompt optimization).

Worth understanding before you start: this is hill-climbing, not random search. Each iteration builds on the best version so far. There is one lineage of accumulated improvements, not a population of competing candidates. Computationally cheap, but it can get stuck in local optima — a known limitation.

The Binary Evaluation Unlock

If you survey every successful autoresearch implementation, they all converge on the same evaluation method: binary yes/no checklists. Not “rate this response from 1-10.” Not “is this good?” Yes or no, for each criterion.

The Builder’s Playbook framing is direct: “If you can’t explain how to score it in one sentence, rewrite it.” Lehmann’s sweet spot is 3-6 yes/no questions. MindStudio’s implementation guide recommends “deterministic yes/no checks” over subjective LLM scoring. Karpathy’s val_bpb metric is a single number that either improved or didn’t.

Why does this matter? Autoresearch doesn’t just require that you can score outputs — it requires that the agent can reliably determine what went wrong in a failing case and form a hypothesis about how to fix it. Vague criteria (“be more helpful”) produce vague failure analyses and vague changes. Binary criteria (“does the output include a specific call-to-action?”) produce clear failure cases and targeted improvements.

There’s a second thing here worth noting. Forcing yourself to write binary evaluation criteria often clarifies what you actually want from the system. The checklist creation process does real intellectual work — not just enabling the loop, but surfacing ambiguity in the original goal. I’ve seen people realize mid-checklist that they were optimizing for the wrong thing entirely.

What the Evidence Shows: Real Results and the Plateau

Concrete outcomes reported across sources:

  • Lehmann’s landing page copy skill: 56% to 92% in 4 rounds
  • Karpathy’s training script: 11% improvement in time-to-quality metric over ~700 experiments
  • Shopify (Tobi Lutke’s team): 53% faster template rendering through 93 automated commits
  • MindStudio overnight run: $1.50-$4.50 in API costs for 50+ improvement cycles, typically ~18,000 tokens per cycle

The improvement trajectory follows a consistent pattern. Cycles 1-5 fix obvious failures — cases where the skill gives clearly wrong outputs. Cycles 6-12 make structural changes that unlock better performance. Cycles 13 and beyond fine-tune around the edges. Between phases 2 and 3 there is often a plateau at 60-70% pass rate where incremental wording changes stop helping.

This plateau matters. It’s not a sign the system isn’t working. It signals that the current structure of the prompt has hit its ceiling and needs a different kind of change — adding few-shot examples, restructuring the output format, splitting one task into two passes. If you stop at the plateau, you miss the breakthrough.

From Karpathy’s original data: of ~700 experiments run, only about 20 produced genuine improvements. Most changes are neutral or harmful. This is normal. The value is that rejection happens automatically, at low cost, without your attention.

The Agent Reliability Problem

There is a hidden variable in all of this that community experience has surfaced: not every AI model can reliably run the loop.

Autoresearch requires an agent that will follow “keep running” instructions without stopping, maintain state coherently across dozens of iterations, and resist making multiple changes at once or declaring victory prematurely. Latent Space’s coverage reported that Opus 4.6 maintained stability for 12+ hours of continuous operation, while GPT-5.4 failed on a “LOOP FOREVER” instruction. A separate technical analysis found that OpenAI Codex ignores “never stop” instructions entirely.

The community has largely converged on Claude Code as the default agent for autoresearch loops, not because it’s the most capable model, but because it follows multi-step autonomous instructions reliably. As Latent Space put it: “Harness fragility matters more than raw model capability.”

This is a real constraint on portability. The pattern is theoretically agent-agnostic, but in practice you need an agent you can trust to run unsupervised for hours without going off-script.

The Academic Context: DSPy and OPRO

Autoresearch didn’t appear out of nowhere. Stanford’s DSPy framework and Google’s OPRO optimizer represent the formal academic lineage of the same core idea: treat prompts as programs, optimize them automatically through measured iteration.

DSPy’s COPRO and MIPROv2 optimizers have demonstrated accuracy improvements of 18+ percentage points on complex reasoning tasks. OPRO uses an LLM to generate candidate prompt modifications, score them, and accumulate the best variants over many rounds. These are rigorous, well-studied approaches with significant academic literature behind them.

The difference is access. DSPy requires programming skills and framework integration. OPRO requires understanding how to configure optimization pipelines. Autoresearch as Lehmann adapted it requires a yes/no checklist and a text editor.

The actual algorithm — hill-climbing, evolutionary search, automated hyperparameter tuning — has existed for decades. What changed is that natural language agents can now execute the optimization loop, which turns a software engineering problem into something closer to a product decision.

Beyond AI Skills: Where the Pattern Applies

Three conditions for autoresearch to work: the output is scorable, the evaluation is automated, and you can constrain changes to a single file per iteration. This generalizes further than it first appears.

Marketing copy. Cold email sequences that used to take 5 weeks of A/B testing cycles (limited by send volume and response lag) can be tested in 24-48 hours by running autoresearch against synthetic evaluation criteria. MindStudio’s marketing application guide covers cold email, ad creative, and landing page elements in detail.

System prompts. Any AI agent with a system prompt defining its behavior is a candidate. If you can specify what good responses look like in binary terms, the loop can optimize the system prompt directly.

Content templates. Editorial style guides, report templates, document formats — anything with consistent structure and measurable quality attributes.

Agent workflows. Multi-step agent pipelines where one file defines the coordination logic can be optimized if you can score the final output of the pipeline.

The constraint is always the evaluation criteria. Creative tasks, ambiguous goals, multi-objective optimization, and anything requiring aesthetic judgment resist binary scoring. Autoresearch excels at well-defined, measurable tasks.

The Compounding Knowledge Problem

One thing that tends to get overlooked: the changelog is often more valuable than the improved prompt.

Lehmann made this explicit: “That changelog is probably the most valuable piece… when smarter models come out, you hand them that changelog.” Every round of autoresearch generates a record of what was tried, what failed, and why. This accumulated knowledge about what works for a specific task is institutional memory.

When a better base model becomes available, you don’t start from scratch. You start from the final improved prompt plus the changelog documenting the decision history. The new model can read what was learned and skip past already-explored dead ends. The optimization history compounds in value over time.

What This Means for Prompt Engineering as a Practice

The human role changes from author to judge. You no longer write the prompt directly — you define what a good prompt produces, and the system finds a prompt that produces it. The skill becomes evaluation design, not prompt writing.

This is why binary checklists matter so much. They are the product. The loop is infrastructure. Getting the evaluation criteria right is the work. Getting the prompt right is the loop’s job.

Latent Space calls this a shift from “vibe coding” — lazy automation — toward genuine autonomous research capability. Where vibe coding means asking an agent to do something and accepting whatever it produces, autoresearch means defining measurable success criteria and letting the agent iterate until those criteria are met.

The longer-term implication Karpathy gestured toward in follow-up posts: distributed collaborative autoresearch. A SETI@home model where many agents run improvement experiments in parallel against a shared optimization target, with discovered improvements merged into a common pool. The infrastructure for this is closer than it sounds.

Practical Steps to Get Started

If you want to run autoresearch on a prompt or skill you own, here’s the operational path:

Step 1: Pick something with a measurable goal. A prompt with specific, definable outputs works. A prompt for “generate creative ideas” does not (without additional constraints).

Step 2: Write your binary checklist. 3-6 yes/no questions. Each question should be answerable from the output alone, without context. Test your checklist on 5 existing outputs to make sure it distinguishes good from bad.

Step 3: Build your test cases. 10-20 diverse inputs that cover the range of situations your prompt handles. Include edge cases. The test set stays fixed throughout the optimization run.

Step 4: Establish your baseline. Run your current prompt against all test cases, score everything with your checklist, calculate the overall pass rate.

Step 5: Configure the loop. Claude Code or a comparable coding agent with clear instructions: read the skill file, identify which test cases failed and why, make one change, re-run, score, keep if improved, discard if not, log everything, repeat.

Step 6: Let it run overnight. Budget $5-25 for 50-100 cycles. Don’t stop at the plateau around 60-70%. Structural breakthroughs typically emerge in cycles 6-12.

Step 7: Review the changelog. Read what the agent learned. This is the institutional knowledge about your specific use case.

Takeaways

The pattern is old: measure, change one thing, keep what improves, repeat. What’s new is that coding agents can execute this loop autonomously on plain text, at a cost of a few dollars, while you sleep.

The binary evaluation checklist is not a convenience feature. Defining what “good” means in binary terms is the hard part. The loop is the easy part.

Model reliability matters more than model capability for long-running loops. Not every model can run unsupervised for 12 hours without going off-script.

The changelog the system generates is worth keeping. It documents what was learned about a specific optimization problem and compounds in value as models improve.

Prompt engineering is shifting from a manual craft to an evaluation design discipline. The question stops being “how do I write a better prompt?” and starts being “how do I define what a good prompt produces?” That shift is already underway.

0 views