šŸ”¬ Three Research Findings

1. SWE-Agent: 64% from Interface Design

Same model, same task, different interface = 64% performance improvement.

The winning interface:

Result: 3.97% → 12.47% issue resolution. Same GPT-4.

2. Anthropic: Two-Agent Architecture

Problem: Most projects exceed any context window.

Solution:

  1. Initializer: Creates init.sh, feature_list.json, progress.txt, git commit
  2. Coding: One feature at a time, clean handoff, documented state

Key: Feature list is ground truth. 200+ specific, end-to-end, all marked failing. Agents cannot fake completion.

3. OpenAI: 1M Lines, Zero Manual Code

Aug 2025 - Feb 2026. Three engineers. 1,500 PRs. 3.5 PRs/engineer/day.

Architecture:

šŸ„ž Seven Layers (Top to Bottom)

1. Human Oversight Approval, review, steering
2. Spec Tools Structured requirements, task DAGs
3. Lifecycle Platforms End-to-end management
4. Task Runners Issue tracker → agent → PR
5. Orchestrators Git worktree isolation
6. Frameworks/Runtimes Primitives vs. persistent infra
7. Coding Agents Claude Code, Codex (commodity)

Layers 1-6 determine layer 7's effectiveness.

šŸ” Five Patterns

Progressive Disclosure Minimal entry + pointers. Context is finite.
Worktree Isolation One agent, one branch. Parallel without collision.
Spec First Machine-readable requirements in repo. Feature lists prevent fake-done.
Mechanical Enforcement Linters, not code review. Invariants, not implementations.
Tight Feedback Close gap between action and consequence. Catch early.

⚔ Minimal Effective Harness

Don't need OpenAI's observability stack. Start with:

  1. Progress file: Read at start, write at end. Prevents "declare victory early."
  2. Feature list: Specific, verifiable, end-to-end. All marked failing initially.
  3. Descriptive commits: Every session ends with git commit + progress update.
  4. Browser automation: If web app — agents must see what users see.

🩺 Diagnostic Questions

When systems underperform:

Each answer → specific harness improvement.

šŸ’” Core Finding

The harness is the durable advantage. Model capability is a commodity. Organizations investing in environment design — scaffolding, feedback loops, spec tooling, orchestration — outperform those focused on model selection.