Harness Engineering: The Fast Version
Three findings, seven layers, five patterns. No fluff.
š¬ Three Research Findings
1. SWE-Agent: 64% from Interface Design
Same model, same task, different interface = 64% performance improvement.
The winning interface:
- Capped search (max 50 results) ā forces specificity
- Stateful viewer (100 lines, numbered) ā removes cognitive load
- Linter at edit time ā catches errors immediately
- Context compression ā old turns summarized
Result: 3.97% ā 12.47% issue resolution. Same GPT-4.
2. Anthropic: Two-Agent Architecture
Problem: Most projects exceed any context window.
Solution:
- Initializer: Creates init.sh, feature_list.json, progress.txt, git commit
- Coding: One feature at a time, clean handoff, documented state
Key: Feature list is ground truth. 200+ specific, end-to-end, all marked failing. Agents cannot fake completion.
3. OpenAI: 1M Lines, Zero Manual Code
Aug 2025 - Feb 2026. Three engineers. 1,500 PRs. 3.5 PRs/engineer/day.
Architecture:
- Repository = system of record (no Slack, no Docs)
- Progressive disclosure (short entry ā deep docs)
- Mechanical enforcement (linters, not human review)
- Application legibility (browser automation, observability)
š„ Seven Layers (Top to Bottom)
| 1. Human Oversight | Approval, review, steering |
| 2. Spec Tools | Structured requirements, task DAGs |
| 3. Lifecycle Platforms | End-to-end management |
| 4. Task Runners | Issue tracker ā agent ā PR |
| 5. Orchestrators | Git worktree isolation |
| 6. Frameworks/Runtimes | Primitives vs. persistent infra |
| 7. Coding Agents | Claude Code, Codex (commodity) |
Layers 1-6 determine layer 7's effectiveness.
š Five Patterns
| Progressive Disclosure | Minimal entry + pointers. Context is finite. |
| Worktree Isolation | One agent, one branch. Parallel without collision. |
| Spec First | Machine-readable requirements in repo. Feature lists prevent fake-done. |
| Mechanical Enforcement | Linters, not code review. Invariants, not implementations. |
| Tight Feedback | Close gap between action and consequence. Catch early. |
ā” Minimal Effective Harness
Don't need OpenAI's observability stack. Start with:
- Progress file: Read at start, write at end. Prevents "declare victory early."
- Feature list: Specific, verifiable, end-to-end. All marked failing initially.
- Descriptive commits: Every session ends with git commit + progress update.
- Browser automation: If web app ā agents must see what users see.
𩺠Diagnostic Questions
When systems underperform:
- What information does the agent need that it cannot access?
- What feedback loop is missing that would catch mistakes?
- Where is context getting polluted?
- What constraints need mechanical enforcement?
Each answer ā specific harness improvement.
š” Core Finding
The harness is the durable advantage. Model capability is a commodity. Organizations investing in environment design ā scaffolding, feedback loops, spec tooling, orchestration ā outperform those focused on model selection.