Harness Engineering March 18, 2026

Harness Engineering: The Research

Analysis of three major systems and what they reveal about agent performance.

The research from 2024-2026 converges on a single finding: environment design determines agent performance more than model capability. This article analyzes three major implementations and extracts operational patterns.

Finding 1: The 64% Interface Improvement

The SWE-agent paper (Princeton NLP, 2024) demonstrated that the same model with a purpose-built interface outperformed the same model with standard shell access by 64%.

Same model. Same task. Same compute. Different interface.

The interface changes that produced this result:

Capped search: Maximum 50 results. Forces specificity.
Stateful file viewer: 100 lines with explicit line numbers. Removes counting cognitive load.
Linter integration: Immediate validation at edit time. Catches errors before propagation.
Context compression: Old observations collapsed to summaries. Preserves attention for current work.

Benchmark result: 3.97% → 12.47% issue resolution. The performance difference was cognitive load management, not model intelligence.

Finding 2: Spanning Context Windows

Anthropic's harness engineering work addressed a different problem: most production systems exceed any context window. Their solution was architectural, not prompt-based.

The Two-Agent Pattern

Systems using this pattern separate initialization from execution:

Initializer: First session creates scaffolding — init.sh, feature_list.json, claude-progress.txt, initial git commit.
Coding: Subsequent sessions read the scaffold, execute one feature at a time, leave clean state.

The critical component is the feature list: 200+ specific, end-to-end verifiable features, all initially marked failing. Agents cannot declare completion from partial observation. The list provides ground truth.

The startup sequence becomes standardized: confirm directory → read progress → read features → init.sh → test → begin work. Each session starts oriented rather than excavating.

Finding 3: Million Lines, Zero Manual Code

OpenAI's Codex team (Aug 2025 - Feb 2026) built approximately one million lines of production code with the constraint: no human-written code. Humans steered; agents executed.

Output: 1,500 pull requests, three engineers averaging 3.5 PRs per day each.

Key Architectural Decisions

Repository as system of record: All knowledge machine-readable in the repo. No Google Docs, no Slack threads. If the agent cannot read it in context, it does not exist.

Progressive disclosure: Short AGENTS.md as map to deeper docs/ structure. Agents orient from minimal entry points rather than comprehensive dumps.

Mechanical enforcement: Custom linters enforce architectural invariants. Human code review does not scale to 3.5 PRs per engineer per day.

Application legibility: Browser automation (CDP), observability integration (LogQL/PromQL/TraceQL), per-worktree isolation. Agents verify end-to-end, not just code.

The Seven-Layer Taxonomy

The Awesome Agent Harness repository maps the ecosystem into distinct layers:

Human Oversight: Approval, review, prioritization — the steering layer.
Spec Tools: Structured requirements and task DAGs. AI proposes; humans verify before execution.
Lifecycle Platforms: End-to-end management from requirements to delivery.
Task Runners: Bridge issue trackers to agents. Spawn workspaces, deliver PRs.
Orchestrators: Parallel execution via git worktree isolation.
Frameworks/Runtimes: Composable primitives vs. persistent infrastructure.
Coding Agents: The execution layer — Claude Code, Codex. A commodity; effectiveness determined by layers 1-6.

Five Patterns That Repeat

Across all implementations, these patterns appear consistently:

1. Progressive Disclosure

Minimal entry points with pointers to deeper context. Context is finite; attention is not uniformly distributed.

2. Git Worktree Isolation

One agent, one worktree. Parallel execution without collision. Validation in isolation before merge.

3. Spec First, Repository as System of Record

Requirements machine-readable in repo. Agents blind to informal knowledge. Feature lists prevent partial-completion errors.

4. Mechanical Architecture Enforcement

Linters, structural tests, CI replace code review at scale. Enforce invariants, not implementations.

5. Integrated Feedback Loops

Tighten gap between action and consequence. Syntax errors at edit time. Runtime errors via observability. UI errors via browser automation.

Operational Implications

When systems underperform, the diagnostic questions:

What information does the agent need that it cannot access?
What feedback loop is missing that would catch mistakes before propagation?
Where is context getting polluted with irrelevant information?
What constraints need mechanical enforcement rather than agent judgment?

Each answer points to a specific harness improvement. Missing information → new tool or document. Missing feedback → new test or linter. Context pollution → new compression strategy. Unenforced constraints → mechanical check.

The virtuous cycle: failures signal what the environment needs; environment improvements reduce failure frequency across all future sessions.

Conclusion

The research is consistent across organizations, models, and domains. Environment design determines performance. The harness is the durable advantage. Model capability is a commodity.

Organizations investing in harness engineering — scaffolding, feedback loops, observability, spec tooling, orchestration — will outperform those focused on model selection or prompt optimization.

Published March 18, 2026 — Prompt Engines Lab