The Night AI Started Doing Its Own Research

How Karpathy’s autoresearch turned a single GPU into an overnight research lab — and why Shopify’s CEO thinks the singularity has begun

AI agent running experiments in an overnight laboratory

Tobi Lütke, Shopify’s CEO, went to bed on March 8th with a simple instruction queued up for an AI agent: take this search model and make it better. When he woke up 8 hours later, the agent had run 37 experiments, improved the model’s quality score by 19%, and done it all on a model half the size of the one it replaced.

“I’m not a ML researcher of course,” Tobi wrote. “But its mesmerizing to just read it reasoning its way through the experiments. I learned more from that than months of following ml researchers.”

The tool that made this possible is called autoresearch, and it was released the day before by Andrej Karpathy — former Director of AI at Tesla, founding team member at OpenAI, and one of the most influential figures in machine learning.

630 Lines That Change the Game

Autoresearch is deceptively simple. The entire project fits in three files:

prepare.py handles data preparation and stays untouched. train.py contains the GPT model and training loop in about 630 lines of Python — this is the only file the AI agent modifies. program.md is where the human provides instructions, context, and research direction.

The agent reads program.md, modifies train.py, runs a 5-minute training experiment, evaluates the result using validation bits-per-byte, and either commits the improvement to a git branch or rolls back. Then it does it again. And again.

The ratchet mechanism of continuous improvement

Over a single night, the system runs roughly 100 experiments. The ratchet mechanism ensures the model never gets worse — it can only improve or stay the same. After Karpathy’s own 276-experiment marathon, 29 improvements stuck, and those gains transferred when he scaled to larger models.

The Tobi Effect

What made Tobi’s tweet go viral (3,749 likes, 541,000 views, 3,652 bookmarks) wasn’t just the 19% improvement. It was the proof of concept for a new paradigm: a non-expert using autonomous AI agents to achieve research-grade results while sleeping.

Tobi has been building qmd — a local search engine for documents and knowledge bases — as a personal project. He applied autoresearch to qmd’s query-expansion model and pointed it at his own training data. The agent didn’t just tweak hyperparameters; it reasoned through architectural decisions, optimizer configurations, and training strategies.

The result: a 0.8 billion parameter model that outperformed his previous 1.6 billion parameter version. Smaller, faster, and more accurate — the trifecta that ML researchers spend months chasing.

The Progression You Can’t Unsee

The democratization of AI research

Garry’s List published an analysis tracing the acceleration of AI-assisted development:

February 2025: “Vibe coding” emerges. Developers describe what they want in natural language; AI generates the code.
February 2026: “Agentic engineering” takes hold. Developers orchestrate fleets of AI agents that handle coding tasks. Karpathy himself noted: “you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.”
March 2026: Autonomous research arrives. A human writes a markdown file. The AI runs 100 experiments. The human reviews the results.

Each transition removed another layer of human involvement. The bottleneck shifted from writing code, to writing instructions, to — perhaps soon — just defining the objective.

SETI@home, But for AI Research

A community of AI agents collaborating on research

Karpathy’s vision doesn’t stop at single-agent overnight runs. In a follow-up tweet, he outlined the next step: “asynchronously massively collaborative” agent research, modeled after SETI@home.

“The goal is not to emulate a single PhD student,” he wrote. “It’s to emulate a research community of them.”

The architecture: thousands of agents across different GPUs and platforms, each running autonomous experiments, sharing findings through GitHub Discussions, branching into divergent research directions. The repo becomes a living organism of parallel scientific inquiry.

The community is already building toward this. Within its first weekend, autoresearch accumulated 8,100 GitHub stars and 1,100 forks. Community members created ports for macOS (MLX), Windows (RTX), and resource-constrained environments. The infrastructure for distributed agent research is assembling itself.

The Risk That 80% of Experts Flagged

The AI feedback loop tightening

Not everyone is celebrating. A study published on arXiv interviewed 25 leading AI researchers from Google DeepMind, OpenAI, Anthropic, Meta, and top universities. Twenty of them — 80% — identified automating AI research as one of the most severe and urgent risks in the field.

They call it a “meta risk”: an accelerant that amplifies every other AI danger. If AI can improve its own training code, the feedback loop tightens. One researcher noted bluntly: “A model really good at coding is almost already very good at ML R&D.”

The study also revealed a sharp divide. Frontier lab employees regularly discuss recursive improvement as an achievable near-term goal, while academic researchers express systematic skepticism built on decades of unfulfilled promises. Both groups agree on one thing: the pace of progress has surprised them.

Perhaps most telling: 17 of 25 researchers expect that frontier labs will keep their most advanced AI research automation capabilities internal rather than open-sourcing them. Autoresearch democratizes incremental optimization. The tools that produce paradigm shifts may stay behind corporate walls.

Beyond ML: The Autonomous Loop Template

The enterprise implications extend far beyond machine learning. Leaplytics published an analysis arguing that autoresearch is significant not because of what it does with training code, but because of the pattern it demonstrates: fully autonomous iteration against a clear metric.

Any business workflow with measurable outputs is a candidate:

A/B testing that runs hundreds of variants overnight
Infrastructure optimization that tunes configurations continuously
Content generation that iterates on engagement metrics
Sales processes that experiment with outreach strategies

The prerequisite is always the same: an unambiguous success metric, clean data, and a consistent evaluation framework. The lesson isn’t “use autoresearch for your ML.” It’s “identify which of your loops can run autonomously.”

What It Means for AI Agent Platforms

Autoresearch validates a core thesis of the AI agent movement: that the value isn’t in the model itself, but in the orchestration layer that lets agents operate autonomously within constrained environments.

Every element of the system — persistent state on disk, git-based checkpointing, clear evaluation metrics, autonomous decision-making — maps to what agent platforms need to provide at scale. The fact that Tobi ran this from what he called “my pi” (likely a Raspberry Pi or personal compute setup) demonstrates that the compute requirements are modest. The intelligence is in the orchestration.

This is why agent platforms matter. Most people won’t set up autoresearch from scratch. They’ll want a managed service that handles the infrastructure, provides the compute, and lets them focus on what matters: defining the objective and reviewing the results.

The Singularity Question

The dawn of the singularity

The morning after his overnight experiment, Tobi posted a follow-up: “the singularity has begun. so many signs.”

Is that hyperbole? Maybe. But consider what just happened. A 630-line open-source script enabled a non-researcher to achieve ML improvements that would have required a dedicated team months ago. The script ran on commodity hardware. The results were real and transferable. And the system is getting better — community contributions are pouring in, the SETI@home vision is taking shape, and the feedback loop between AI capability and AI research is tightening.

The Hacker News community pushed back on the singularity framing, noting that autoresearch currently operates within a narrow optimization space — more hyperparameter search than genuine discovery. That’s a fair critique. Karpathy acknowledged that current models feel “cagy and scared” when facing truly open-ended problems.

But the trajectory is what matters. The gap between “optimizes hyperparameters” and “discovers new architectures” is a capability gap, not a conceptual one. The framework is the same. The agent just needs to get better.

And if there’s one thing autoresearch demonstrates, it’s that agents are getting better fast.

Takeaways

Autonomous AI research is no longer theoretical — it’s running on consumer hardware and producing real results
The bottleneck has shifted from execution to instruction design — the quality of your program.md determines your outcomes
Non-experts can now achieve research-grade results — the democratization of ML is accelerating
Every measurable workflow is a candidate for autonomous optimization — ML was first because the metric was cleanest
The recursive improvement loop has begun — AI agents are now improving AI systems, and the pace is increasing

Sources: Analysis synthesized from 10 sources including Karpathy’s autoresearch announcement, Tobi Lütke’s overnight results, GitHub repository, arXiv study on AI researcher perspectives, and community discussions on Hacker News.