Harness Engineering: Implementation Guide
Technical patterns from SWE-agent, Anthropic, and OpenAI. Code-level details.
Implementation 1: The ACI Tool Suite
The SWE-agent paper demonstrated that replacing bash commands with purpose-built tools produced 64% performance improvement. Here's the implementation pattern:
Capped Search Tool
def search_file(pattern: str, path: str = ".") -> str:
"""
Search for pattern in files. Returns max 50 matches.
If >50 matches, returns count + suggestion to refine.
"""
matches = grep(pattern, path)
if len(matches) > 50:
return (
f"Found {len(matches)} matches. Too many to display.\n"
f"Suggestion: Use a more specific pattern or narrow path.\n"
f"Example: search_file('def {pattern}', 'src/')"
)
return "\n".join(f"{m.file}:{m.line}: {m.text}" for m in matches[:50])
Key design decision: Hard cap forces refinement. Agents cannot proceed by being vague. Creates natural specificity loop.
Stateful File Viewer
class FileViewer:
def __init__(self):
self.positions = {} # file -> current_line
def view(self, file: str, offset: int = 0) -> str:
"""
Display 100 lines from file starting at current position + offset.
Goldilocks number: 30 lines loses context, full file loses focus.
"""
current = self.positions.get(file, 0)
start = max(0, current + offset)
end = start + 100
lines = read_lines(file, start, end)
numbered = [f"{i:4d}| {line}" for i, line in enumerate(lines, start)]
self.positions[file] = start
return "\n".join(numbered)
Key design decisions:
- 100 lines — tested optimal for maintaining context without overwhelming
- Explicit line numbers — agents reference directly, no counting cognitive load
- Stateful — maintains position across interactions
Linter-Integrated Editor
def edit_file(file: str, start: int, end: int, replacement: str) -> str:
"""
Replace lines [start, end] with replacement text.
Runs linter immediately. Returns result or error with context.
"""
original = read_lines(file, start, end)
# Apply edit
apply_replacement(file, start, end, replacement)
# Immediate validation
lint_result = run_linter(file)
if lint_result.errors:
# Revert and return actionable error
revert_to(file, original)
return (
f"Edit rejected: syntax error introduced\n"
f"Error: {lint_result.errors[0]}\n"
f"Original:\n{original}\n"
f"Your edit:\n{replacement}\n"
f"Fix the syntax error and try again."
)
return f"Edit successful. Lines {start}-{end} modified."
Key design decision: Edit + validate as atomic operation. Errors caught at introduction, not three steps later when chasing ghosts.
Implementation 2: Two-Agent Architecture
Anthropic's pattern for spanning context windows. Implementation structure:
Initializer Agent
# initializer_system_prompt.md
Your role is environment setup. Do not write features.
Create the scaffolding that future coding agents will use.
Required outputs:
1. init.sh — script to reliably start dev environment
2. feature_list.json — specific, end-to-end feature descriptions
3. claude-progress.txt — initial empty progress log
4. Git commit with message "[init] Environment initialized"
Feature list requirements:
- 200+ specific features for a production web app
- Each feature: category, description, steps[], passes: false
- All initially marked failing
- No feature is "implement the app" — all are user-visible behaviors
Feature List Schema
{
"features": [
{
"category": "authentication",
"description": "User can sign up with email and password",
"steps": [
"Navigate to /signup",
"Enter email and password",
"Click submit",
"Verify redirect to dashboard",
"Verify user created in database"
],
"passes": false
}
]
}
Key design decision: Feature list as ground truth. Agents cannot infer completion from code. Must verify against explicit criteria.
Coding Agent Startup Sequence
# Standardized startup — executed at beginning of every session
1. pwd — confirm working directory
2. read claude-progress.txt — understand recent work
3. git log --oneline -20 — see recent commits
4. read feature_list.json — identify highest-priority incomplete feature
5. run init.sh — start development environment
6. run startup_test.py — verify application in working state
7. IF startup_test FAILS: fix before touching anything new
8. BEGIN work on one feature at a time
Session End Requirements
# Every session ends with:
1. Git commit with descriptive message
2. Update claude-progress.txt with:
- What was worked on
- What was completed
- What state things were left in
3. Verify clean state (revert if needed)
4. Update feature_list.json if feature passes
Implementation 3: Mechanical Enforcement
OpenAI's approach: custom linters with remediation instructions formatted for agent consumption.
Architecture Linter
def check_layer_violation(file_path: str, import_path: str) -> Optional[str]:
"""
Check if import violates layer architecture.
Returns remediation message or None.
"""
file_layer = get_layer(file_path) # domain, service, api, etc.
import_layer = get_layer(import_path)
allowed = LAYER_RULES.get(file_layer, [])
if import_layer not in allowed:
return (
f"Architecture violation: {file_path} ({file_layer})\n"
f"imports {import_path} ({import_layer})\n"
f"Allowed imports from {file_layer}: {allowed}\n"
f"Fix: Move code to appropriate layer or use dependency inversion."
)
return None
Error Message Format
# Linter error format designed for agent consumption:
{
"rule": "architecture.layer_violation",
"violated": "api/routes.py imports domain/models.py",
"constraint": "api layer may only import service layer",
"remediation": [
"Option 1: Move the required function to service layer",
"Option 2: Create a service-layer facade that domain calls",
"Option 3: Use dependency injection to break the coupling"
]
}
Key design decision: Error messages include remediation. Agents receive constraint, violation, and fix options in single context.
Implementation 4: Git Worktree Orchestration
Pattern for parallel agent execution without collision.
class AgentWorkspace:
def __init__(self, task_id: str, base_branch: str = "main"):
self.task_id = task_id
self.worktree_path = f"/worktrees/{task_id}"
self.branch = f"agent/{task_id}"
# Create isolated worktree
run(f"git worktree add {self.worktree_path} -b {self.branch}")
def execute(self, agent_fn) -> Result:
"""Run agent in isolated workspace."""
old_cwd = os.getcwd()
try:
os.chdir(self.worktree_path)
result = agent_fn()
# Validate before merge
if self.all_checks_pass():
return Result.success(self.commit_and_push())
else:
return Result.failure(self.get_errors())
finally:
os.chdir(old_cwd)
def cleanup(self):
"""Remove worktree after merge or failure."""
run(f"git worktree remove {self.worktree_path}")
run(f"git branch -D {self.branch}")
Implementation 5: Application Legibility
OpenAI's investment in making the application observable to agents.
Browser Automation Integration
class BrowserTool:
"""CDP-based browser automation for end-to-end verification."""
def navigate(self, url: str) -> str:
"""Navigate to URL, return DOM snapshot."""
def click(self, selector: str) -> str:
"""Click element, return updated DOM."""
def fill(self, selector: str, value: str) -> str:
"""Fill form field, return updated DOM."""
def screenshot(self) -> bytes:
"""Capture screenshot for visual verification."""
def assert_visible(self, text: str) -> bool:
"""Verify text is visible on page."""
# Agent usage:
# 1. Navigate to feature URL
# 2. Perform user actions (click, fill)
# 3. Assert expected outcome visible
# 4. Only mark feature passed if assertion succeeds
Observability Integration
class ObservabilityTools:
"""Query logs, metrics, traces from isolated agent task."""
def query_logs(self, query: str, since: str = "1h") -> List[LogEntry]:
"""LogQL query against task-specific logs."""
def query_metrics(self, metric: str, range: str = "1h") -> List[MetricPoint]:
"""PromQL query against task-specific metrics."""
def query_traces(self, trace_id: str) -> Trace:
"""TraceQL query for distributed trace details."""
# Each agent task runs on isolated app instance with own observability.
# Data torn down after task complete. Agents debug like human engineers.
Summary: The Implementation Stack
| Component | Pattern | Source |
|---|---|---|
| Capped search tools | Forces specificity | SWE-agent |
| Stateful file viewer (100 lines) | Removes cognitive load | SWE-agent |
| Linter at edit time | Catches errors immediately | SWE-agent |
| Two-agent split | Scaffolder + executor | Anthropic |
| Feature list as ground truth | Prevents fake-done | Anthropic |
| Startup sequence | Orient before work | Anthropic |
| Mechanical enforcement | Linters, not review | OpenAI |
| Git worktree isolation | Parallel execution | General |
| Browser + observability | End-to-end verification | OpenAI |
Core principle: Model capability is a commodity. Environment design determines performance. The harness is the implementation surface.