The Tightest Ship I've Ever Run

02 Feb, 2026

METR measures AI capability in "time horizons"—how long a human expert would take. The longer the horizon, the more likely the agent fails. Failure probability compounds.

The engineering response: don't let it compound. Break long tasks into short ones. Add checkpoints. Make every step verifiable. Strict compilers, linters, type checkers, tests—binary oracles that reset confidence at every hop. Task-tracking layers like Beads that externalize state so work survives crashes, context compaction, session boundaries.

This all works. Guardrails keep agents honest about what the code does.

But I've watched Claude pass every gate—types check, linter clean, tests green—and still produce a god object. A 400-line function doing seven things. A file that started focused and gradually became the junk drawer because that's where the related code was.

Agents prefer local changes. Fix the bug where the bug is. Add the feature next to the similar feature. Refactoring is expensive in tokens. Shoehorning is cheap. The toolchain doesn't catch this. Why would it? The code compiles.

So you add another layer. Cyclomatic complexity limits. Cognitive complexity caps. Function and file length thresholds. SOLID principles documented where the agent can see them. SRP especially—skills that remind the agent to ask should this be two things? before adding to something that already exists.

Two layers. The first keeps agents honest about correctness. The second keeps them honest about structure.

And then you step back and look at what you've built.

Documented architectural principles. Automated quality gates. Explicit complexity budgets. Clear task breakdowns with dependency tracking. Persistent state that survives handoffs. Code review (by other agents, or by you). Regular checkpoints where work gets validated before continuing.

This is just... running an engineering team. Sprint planning. Tech debt management. Architecture reviews. CI/CD. All the stuff we've preached for decades about how humans should build software.

Except we're actually doing it now. More rigorously than most real-world teams ever operate. I've worked at companies where god objects lived for years because nobody had time to refactor. Where the linter config was "suggested." Where architectural principles existed in a wiki nobody read.

The agents don't get that slack. They can't. Without strict guardrails, they drift. So we build the guardrails. And suddenly we're running tighter ships than we ever ran for ourselves.

I wrote before about the eventual industrialization of software engineering. This is what it looks like from the inside. Not some distant future—now. The problem of "how do you reliably produce working software" is getting solved, methodically, by treating agents like a workforce that needs structure.

The industry is already changing. What I don't think most engineers in orgs not yet fully embracing this see—for understandable reasons—is how fast it's moving. By the time it feels urgent, the gap will be hard to close.

#ai-development #architecture #claude #vibe-coding