The Spec Problem, Part 1: When Tools Try to Do Too Much
"Assumption is the mother of all fuck ups." Penn snarled that in Under Siege 2 after a mercenary admitted he assumed Ryback was dead without seeing the body. It's also the best summary of why spec-driven development exists.
Agents align best with human goals when humans can clearly articulate what they actually want. Vague goals get interpreted in surprising ways. The more you deviate from well-understood patterns โ a standard REST API versus a custom vector art spaceship โ the more precision you need. Thoughtworks calls specs "refined context": just enough information for the LLM to be effective without being overwhelmed.
So far so good. The problem is what happened next.
SDD tools proliferated. Spec-Kit, OpenSpec, Kiro, and dozens of others. Most of them share a fatal flaw: they don't stop at spec writing. They want to own the whole lifecycle โ flowing from spec into implementation, treating the spec as the living source of truth as discovery happens and plans evolve.
For small to medium tasks, this works. By "small" I mean larger than a single agentic session but still bounded enough to hold in your head. You write the spec, the tool tasks it out, you execute. Clean.
For ambitious projects, it falls apart. The volume of specs you need to solidify before you can do meaningful discovery is enormous. And discovery creates churn โ new requirements, changed assumptions, entire subsystems that need rethinking. Text becomes untenable as the medium of task tracking. You're managing a sprawling markdown graveyard instead of shipping software.
I ran into this wall with Beads. Beads excels at execution โ claiming work, tracking progress, managing the discovery that emerges during implementation. But it assumes you already know what to build. I still needed an up-front spec process. So I tried blending Spec-Kit with Beads: use Spec-Kit for the planning phase, chop off its implementation phase, hand off to Beads.
Clunky. Token-inefficient. Spec-Kit wanted to task everything out in markdown, and then I spent more tokens converting those markdown tasks to Beads epics. I was paying twice for the same work.
The real issue? These tools conflate two distinct problems: articulation (figuring out what to build) and execution (tracking the work as you build it). Spec-Kit is good at articulation but insists on owning execution. Beads is good at execution but assumes articulation already happened. Nothing cleanly hands off from one to the other.
There's a deeper challenge too. You and the agent can use the same words, agree on a spec, and believe you're aligned โ then implementation reveals you weren't. The spec looked clear. The details were grossly underspecified. A human might ask clarifying questions or "just understand" the intent. An agent builds exactly what you said, which wasn't what you meant.
Identifying underspecification is a skill of its own. A spec can read well and still hide landmines. As one GitHub discussion put it: "spec drift and consistency are nontrivial in an AI-driven flow." The spec you wrote and the spec the agent internalized can diverge silently.
I wanted a process that would surface these gaps before implementation โ without the lifecycle bloat of full SDD tools. Something that could hand off cleanly to Beads when the spec was solid enough to execute.
For most tasks, I still use a single pass: superpowers' brainstorming and writing-plans skills. That's enough when scope is small or I have high confidence in the domain. But for tricky problems โ where I suspect Claude might unintentionally gaslight me about scope or feasibility โ I needed something more adversarial.
Steve Yegge mentioned in a Beads discussion that his colleague gets better planning results by making the LLM iterate on a plan five times before considering it done โ then iterating five more times on the epics before handing off to agents. The models themselves, Yegge noted, "have validated that this approach matches their cognition." His conclusion: "you don't need the tools โ just the iterations, and Beads." Karpathy's LLM Council does something adjacent: multiple models deliberate, review each other anonymously, and a chairman synthesizes. Research on iterative refinement confirms the pattern: multi-pass feedback loops help stronger models "unlock their full potential."
The insight isn't that more passes are better. It's that a single reviewer tends toward either over-engineering or over-simplification. Opposing passes converge on correct scope.
That's where dialectical refinement comes in. Part 2.