Back to Blog
AI Development

Shipping at Inference Speed: What the Best AI-Assisted Developers Actually Do

How top developers like Peter Steinberger ship at inference speed — the workflows, tools, and mindset shifts behind AI-assisted development in 2026.

Augmi Team|
ai-codingagentic-engineeringdeveloper-productivityclaude-codeai-workflow
Shipping at Inference Speed: What the Best AI-Assisted Developers Actually Do

There is a specific type of developer emerging in 2026 who ships faster than the rest of their cohort — not because they have better taste or more experience, but because they have rebuilt their relationship with the act of writing code entirely. They do not write most of it. They orchestrate it.

This is not the same thing as “using Copilot to autocomplete lines.” The gap between that and what the best AI-assisted developers are doing now is roughly the gap between using spell-check and having a technical co-founder. One helps you write what you already know. The other writes entire systems while you hold the architecture in your head.

Understanding what separates the developers who have genuinely unlocked this from those still using AI as a fancy autocomplete requires looking at specific workflows, specific tools, and — perhaps most importantly — specific mindset shifts about what a developer’s job actually is.

The Steinberger Benchmark

Peter Steinberger is a useful reference point. He created OpenClaw, the open-source AI agent framework that has accumulated over 206,000 GitHub stars and become one of the more influential pieces of infrastructure in the agentic AI space. In February 2026 he joined OpenAI. He talks publicly about how he works, which makes him rare in a space where most elite workflows are kept private.

His core thesis is this: the bottleneck is no longer model capability. It is “inference time and hard thinking.” The question is not whether the model can write the code — GPT-5, Claude 4.6 Opus, and the current frontier models almost certainly can. The question is whether you have set up the conditions for inference to actually produce something coherent and trustworthy.

What does Steinberger’s actual workflow look like? He runs 3 to 8 concurrent projects at once. He commits to main, not feature branches, and rarely reverts. He credits GPT-5 as a breakthrough specifically for what he calls “factory-scale building” — the ability to generate and iterate on large amounts of code with enough reliability that the process feels industrial rather than artisanal. He trusts models enough that he rarely reads the code being generated. Instead, he monitors output streams and maintains architectural understanding without tracking every line.

This last point is the one that tends to unsettle developers who have not made the transition. Not reading the code sounds reckless. In Steinberger’s framing, it is a rational response to a new division of labor: if you understand the architecture and the requirements, and the model reliably translates those into working code, spending time reading that code is waste. The bottleneck is not comprehension of the output — it is the quality of the input.

He uses different models for different contexts. Codex gets 10 to 15 minutes of “silent reading” for large refactors, where the model processes context without producing output before beginning. Claude Opus handles smaller, faster edits. He has built an “Oracle Tool” — a CLI research assistant — specifically for moments when agents get stuck. He runs a multi-machine setup, a MacBook Pro plus a Mac Studio connected via Jump Desktop, to keep multiple inference contexts warm simultaneously. And he uses an AGENTS.MD file — a system instruction document — to maintain consistency across sessions.

This is not a casual setup. It is a purpose-built workflow for a specific kind of work.

From Vibe Coding to Agentic Engineering

The evolution of terminology here is itself revealing. Andrej Karpathy coined the term “vibe coding” roughly a year ago to describe the experience of describing what you want in natural language and accepting whatever the model produces — trusting the vibe, iterating on intuition, not reading the diff. He later introduced “agentic engineering” to describe something more deliberate: you are not writing the code 99% of the time, you are orchestrating agents.

The distinction matters because it changes what skills are valuable. Vibe coding is a creativity exercise with a permissive constraint set. Agentic engineering is a systems design exercise where the execution layer happens to be language models rather than human developers or traditional compilers.

A related development gaining traction among teams shipping at high velocity is Spec-Driven Development. The pattern is to invest heavily in specifications before any code runs — detailed requirements documents, architectural decisions, edge case enumeration — and then hand that spec to an agent. Early practitioners report 10 to 50% faster development cycles compared to iterative prompting. The intuition is that the model’s ability to produce coherent code is proportional to the coherence of the problem statement it receives. Garbage in, garbage out is not a new idea, but it applies with particular force when the garbage can look convincingly like working code before the tests run.

What the Data Actually Shows

Before going further, it is worth being honest about what the research says, because the discourse around AI-assisted development has a tendency toward hyperbole in both directions.

The optimistic case: 92% of developers report using AI coding tools, and approximately 41% of code in active repositories is now AI-generated. MIT researchers found a 26% increase in completed tasks among developers using AI assistance. A recent Anthropic analysis found that 27% of work now consists of tasks that would not have been done at all without AI — not faster execution of existing plans, but net new output. Solo founders consistently report 3 to 5x productivity gains on greenfield work.

The skeptical case: A METR study found that experienced developers were actually 19% slower on relevant tasks when using AI tools, while perceiving themselves to be 24% faster. That perception-reality gap is significant. Stack Overflow’s 2025 survey found 84% adoption but only 29% trust — down from 40% the previous year. Security researchers have found that AI-generated code has 2.74 times higher security vulnerability rates in pull requests, and approximately 45% of AI-generated code contains OWASP vulnerabilities. Legacy codebases see minimal gains from AI assistance; the productivity uplift concentrates in greenfield development.

The honest synthesis: AI-assisted development genuinely increases output for developers who have adapted their workflows to it, particularly on new projects. It does not uniformly accelerate all development. The METR result — slower but feeling faster — is probably the most important finding to hold onto, because it suggests that naive adoption can actively degrade performance while creating a false sense of progress.

The developers shipping at inference speed are not the ones who adopted AI tools. They are the ones who rebuilt their workflows around AI tools. The difference is not trivial.

The Core Skills That Actually Matter

Looking across the workflows of developers operating at the frontier — synthesizing from public discussions, published accounts, and the emerging literature on AI-assisted development — a set of consistent practices emerges.

Context Engineering Is the Job

If you had to name one skill that separates high-output AI-assisted developers from everyone else, it is context engineering. This is the craft of giving a model exactly the information it needs to produce a useful output, and no more. Too little context and the model hallucinates or produces something generic. Too much context and attention dilutes across irrelevant information.

Context engineering includes knowing what goes in the system prompt versus the user message, how to structure project documentation for model consumption, when to include full file contents versus summaries, and how to maintain context coherence across long sessions with many tool calls. It also includes knowing when to start a fresh context window rather than dragging a corrupted context forward.

The AGENTS.MD file that Steinberger uses is a form of persistent context engineering — a document that establishes the model’s operating parameters for a given project so that those parameters do not have to be re-established in every session. Adam Osmani and others in the AI engineering space have described similar artifacts. The rough consensus on length is 60 to 300 lines: enough to establish meaningful constraints, short enough that the model actually attends to all of it.

Plan First, Then Execute

Adam Osmani, whose writing on AI-assisted engineering is widely read, describes the planning phase as “waterfall in 15 minutes.” The idea is that for non-trivial work, you spend a short but focused period producing a detailed plan — with the model’s help — before any code runs. This plan becomes the specification the agent executes against.

This is counterintuitive to developers who learned that iteration beats planning, that you should build something fast and fix it later. That heuristic made sense when iteration was cheap in terms of human time. When the execution layer is a model running at inference speed, iteration is cheap in a different way, but planning is even cheaper relative to the cost of a poorly specified run that produces plausible-looking but wrong output.

The practical implementation varies. Some developers write explicit task lists. Some produce architecture documents. Some use structured prompts that walk the model through requirements, constraints, and success criteria before asking it to generate code. The common thread is treating the planning artifact as a first-class deliverable, not a throwaway step.

Close the Loop

One of the clearest markers of a mature AI-assisted workflow is whether the agent verifies its own output. Running compilation. Running tests. Checking that the endpoint returns the expected response. Confirming that the migration applied cleanly.

Agents that generate code and stop are producing unverified output. Agents that generate code, compile it, run the test suite, fix failures, and only surface to the human when they are stuck or when tests pass are producing verified output. The difference in the human’s time cost is enormous.

This is partly a tooling question — the ability to close the loop depends on having a working test suite, a clean build pipeline, and an environment where the agent can actually execute and observe results. It is also partly a workflow design question: setting up the agent’s instructions to include verification steps rather than assuming they will happen.

Commit Often

Version control as a safety net sounds obvious, but it has a specific implication in agentic workflows: you want small, frequent commits not because each commit represents a meaningful unit of work, but because you want cheap rollback points. When a model takes a large codebase in a direction that turns out to be wrong, the cost of recovery is proportional to how far you let it go before noticing.

Steinberger commits to main and rarely reverts, which suggests he has achieved enough model reliability and workflow oversight that he rarely needs to. Developers earlier in the transition should probably be more conservative — treating each successfully tested step as a commit point, giving themselves the ability to recover cleanly when an agentic run goes sideways.

Never Trust Blindly

Simon Willison, who has written extensively and carefully about AI-assisted development, has a stated rule: he will not commit code he cannot explain. This is worth taking seriously as a principle even if the exact implementation varies.

The risk in agentic development is not that models produce obviously broken code — that is easy to catch. The risk is that models produce subtly wrong code that passes tests, looks coherent, and introduces a logic error or security vulnerability that surfaces later. The developers who have caught the most problems of this type tend to be the ones who maintain enough architectural understanding to notice when something feels off, even if they are not reading every line.

This is also where the security statistics become relevant. A 2.74x higher security vulnerability rate in AI-generated pull requests is not an argument against AI-assisted development. It is an argument for maintaining security review practices — automated SAST, code review focused on security-sensitive paths, dependency audits — that you might be tempted to skip when the code is generating quickly.

Lean on Strong Typing

Static type systems are, among other things, a form of verification. A well-typed codebase surfaces a large category of model errors at compile time rather than at runtime or in production. TypeScript, Rust, Go, Haskell — languages where the type system is expressive enough to encode real constraints — make the close-the-loop pattern much more effective. The model can write the code, the compiler can reject the wrong parts, and the model can fix them.

This is a practical argument for not skimping on types in AI-assisted projects. Loose typing is a human-readable shortcut. It is a poor fit for agentic execution, where the feedback loop is automated.

The Tools Landscape in Early 2026

The tooling has matured significantly, and the landscape now has clear differentiation between the major players.

Claude Code operates as a CLI-first tool and is widely regarded as having the deepest reasoning capability among production-ready coding assistants. Claude 4.6 Opus currently leads the SWE-bench benchmark at 80.8% — SWE-bench being the standard evaluation for AI performance on real GitHub software engineering tasks. Kimi K2.5 from Moonshotai sits at 76.8%. GPT-5.2 is at 69%. These numbers represent genuine progress from the benchmark scores of 18 months ago and give some empirical grounding to the sense that frontier models have become substantially more capable on real coding tasks.

Cursor has embedded itself deeply into the IDE workflow, with support for up to 8 parallel agents within a single project context. For developers who think inside their editor, it is the natural starting point.

Windsurf positions itself around autonomous multi-step execution — the model takes a goal and works toward it without requiring step-by-step guidance. LogRocket’s 2025 developer survey ranked it the top AI development tool.

Cline CLI 2.0 describes itself as an “AI agent control plane,” which reflects where the category is heading: not just AI assistance within a tool, but AI as the execution layer that tools connect to.

The inference market context is useful here. The market for AI inference — the compute that actually runs these models — is currently valued around $106 billion and projected to reach $255 billion by 2030. That trajectory reflects the underlying bet that inference becomes the dominant compute workload: not training runs happening a few times a year, but continuous inference running in developers’ terminals and CI pipelines around the clock. Multi-model workflows — using different models for different tasks within a single project — are now reported by 57% of organizations using AI coding tools.

What Separates the Leaders

Pulling back to the broader pattern: the developers who are genuinely shipping at inference speed have made a set of bets that are individually small but collectively significant.

They have invested in the infrastructure of agentic work. System instruction documents. Comprehensive test suites. Strong typing. Fast CI. These investments pay back in the quality of agentic runs, because a well-typed codebase with comprehensive tests is a codebase where the close-the-loop pattern works well and the model gets fast, accurate feedback on its output.

They have developed judgment about model selection. Different models have different strengths. Codex’s “silent reading” approach for large refactors reflects a specific insight about how that model benefits from extended context processing. Opus for small edits reflects a different set of tradeoffs. Building intuition for which model to reach for in which context is itself a skill.

They have accepted the new division of labor. The shift from “I write code” to “I orchestrate agents that write code” is not purely a workflow change. It requires a different relationship to the code itself — one where architectural understanding matters more than line-by-line familiarity, where the ability to specify precisely matters more than the ability to implement quickly, where noticing that something is wrong matters more than knowing exactly how to fix it.

They have maintained skepticism. The developers who have avoided the failure modes — security vulnerabilities in production, subtly wrong logic that passes tests, agentic runs that generate plausible-looking waste — tend to be the ones who kept the same critical eye they would apply to any code review, even as the velocity of production increased.

The Honest Picture

None of this is magic. The 26% MIT study result — not 10x, just 26% — is probably closer to the median experienced developer’s gain than the stories of solo founders shipping entire applications in a weekend. Those stories are real, but they concentrate in specific conditions: greenfield projects, experienced developers who have rebuilt their workflows, high-quality specs, modern tech stacks.

The METR result — 19% slower with AI tools, while feeling faster — is a cautionary flag that adoption does not automatically translate to productivity. The gap between adoption and genuine productivity gain is where most developers are currently sitting. Closing that gap requires the workflow changes described above, not just having access to a capable model.

What is genuinely new and significant is the category of work that AI makes tractable that was not tractable before. Anthropic’s finding that 27% of work represents tasks that would not otherwise have been done is the most interesting data point in this space. It is not just that existing work happens faster. It is that the set of things a developer or a team considers worth attempting has expanded. That is a different kind of impact than raw productivity, and it compounds differently over time.

Building for This World

The infrastructure around agentic development is still being built. The workflows that Steinberger and others have developed are personal; they do not yet have the tooling layer that would make them accessible to a broader population of developers. AGENTS.MD files are text documents maintained by hand. Multi-machine setups are hardware investments. Oracle Tools are custom-built CLIs.

What comes next is the productization of these workflows — platforms that handle the infrastructure of agentic development so that the developer can focus on architecture and specification rather than on keeping inference contexts warm and model outputs synced across machines.

That is exactly what we are building at Augmi.world. We deploy OpenClaw agents with one click — wallet auth, USDC payments, Telegram and Discord integration out of the box. If you are building toward the agentic engineering model and want the infrastructure layer handled, that is where to start.

The developers shipping at inference speed today built their own infrastructure. The ones doing it at scale in 2027 probably will not have to.

0 views