Anthropic's Multi-Agent Harness: Generator-Evaluator Architecture
Anthropic Labs has published detailed research on multi-agent harness design for long-running application development. The work by Prithvi Rajasekaran demonstrates that a 20x cost increase ($9 to $200) can produce functional, feature-rich applications compared to broken single-agent outputs. The architecture draws inspiration from Generative Adversarial Networks (GANs), applying generator-evaluator loops to both frontend design and full-stack coding tasks that run autonomously for multiple hours.
The Context Anxiety Problem
A key insight from the research is "context anxiety" — a behavior where models begin wrapping up work prematurely as they approach what they perceive as their context limit. This manifests as agents declaring tasks complete when they are not, or rushing to finish before the context window fills. Compaction, where earlier conversation parts are summarized in place, does not solve this. The solution is context resets: clearing the context window entirely and starting a fresh agent with a structured handoff.
Generator-Evaluator Pattern
The architecture separates generation from evaluation. A generator produces outputs. An evaluator grades them. This addresses a critical problem: agents evaluating their own work tend to respond with confident praise even when quality is mediocre. Tuning a standalone evaluator to be skeptical proves more tractable than making a generator critical of its own work.
Frontend Design Criteria
Four criteria were developed to make subjective quality gradable:
- Design Quality: Coherent whole vs collection of parts (high weight)
- Originality: Custom decisions vs templates, penalizes "AI slop" (high weight)
- Craft: Typography, spacing, color harmony (standard weight)
- Functionality: Usability, task completion (standard weight)
The evaluator uses the Playwright MCP to interact with live pages directly, navigating and studying implementations before scoring. Runs span 5-15 iterations over up to four hours.
Three-Agent Architecture
| Agent | Role | Function |
|---|---|---|
| Planner | Spec expansion | 1-4 sentences → full product spec with AI features |
| Generator | Implementation | Builds one feature at a time (React, Vite, FastAPI, SQLite) |
| Evaluator | QA/Grading | Playwright click-through, criteria scoring, hard thresholds |
Sprint Contracts
Before each sprint, the generator and evaluator negotiate a "sprint contract" — agreeing on what "done" looks like. This bridges the gap between high-level user stories and testable implementation. Communication is handled via files: one agent writes, another reads and responds.
Results: Game Maker Test
| Harness | Duration | Cost | Outcome |
|---|---|---|---|
| Solo | 20 min | $9 | Broken — unplayable, broken wiring |
| Full Harness | 6 hr | $200 | Functional — 16 features, 10 sprints, AI integration |
The evaluator found specific issues: "Tool only places tiles at drag start/end points instead of filling the region" and "Delete key handler condition should be selection || (selectedEntityId && activeLayer === 'entity')."
Opus 4.6 Simplification
With Claude Opus 4.6, sprint decomposition became unnecessary. The model could run coherently for over two hours without resets. The evaluator moved to end-of-run rather than per-sprint. Key principle: the evaluator is worth the cost when the task sits beyond what the current model does reliably solo. As models improve, this boundary moves outward.
DAW Test (V2 Harness)
A Digital Audio Workstation built with the updated harness: 3 hr 50 min, $124.70 total. The QA agent caught real gaps across three rounds: "Clips can't be dragged/moved on the timeline," "Audio recording is stub-only," "Clip resize by edge drag not implemented."
Key Principles
- Decompose complex tasks into tractable chunks with specialized agents
- Use structured handoffs rather than compaction for long sessions
- External evaluation drives improvement; self-evaluation is unreliable
- Stress test harness assumptions as models improve
- Explicit criteria make subjective quality gradable
Conclusion
Anthropic's research demonstrates that significant quality improvements are achievable through deliberate architecture design, even at substantially higher cost. The generator-evaluator pattern provides a concrete framework for building reliable long-running agentic systems. As the authors note: "the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."
Quick Facts
| Metric | Value |
|---|---|
| Author | Prithvi Rajasekaran (Anthropic Labs) |
| Architecture | Generator-Evaluator (GAN-inspired) |
| Agents | 3 (Planner, Generator, Evaluator) |
| Cost Ratio | 20x ($9 → $200) |
| Quality Delta | Broken → Functional |
| Max Runtime | 6 hours |
| Key Models | Claude Opus 4.5, 4.6 |
Core Techniques
| Technique | Purpose |
|---|---|
| Context Resets | Clean slate vs compaction for anxiety |
| Sprint Contracts | Align spec with implementation |
| File Communication | Agent handoffs without pollution |
| Few-shot Calibration | Tune evaluator skepticism |
| Playwright MCP | Live testing vs static analysis |
When Evaluators Add Value
| Scenario | Evaluator Value |
|---|---|
| Within reliable capability | Unnecessary overhead |
| At edge of capability | High — catches critical gaps |
| Subjective quality | Essential — self-grading poor |
Opus 4.6 Impact
- Removed need for sprint decomposition
- Single continuous session viable (2+ hours)
- Evaluator moved to end-of-run
- Automatic compaction sufficient
Cost Breakdown (DAW)
| Phase | Time | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build R1 | 2 hr 7 min | $71.08 |
| QA R1 | 8.8 min | $3.24 |
| Build R2 | 1 hr 2 min | $36.89 |
| QA R2 | 6.8 min | $3.09 |
| Build R3 | 10.9 min | $5.88 |
| QA R3 | 9.6 min | $4.06 |
| Total | 3 hr 50 min | $124.70 |
Implementation Guide
Building multi-agent harnesses with generator-evaluator loops.
Basic Loop Structure
// Generator with criteria
generator.instructions = `
Create frontends. Weight design quality and
originality higher than craft/functionality.
Decide after each eval: refine or pivot.
`;
// Evaluator with Playwright
evaluator.instructions = `
Navigate live pages via Playwright MCP.
Score 1-10 on each criterion.
BE SKEPTICAL — err toward lower scores.
Output: {scores, critique, specific_fixes}
`;
// Iteration loop
for (i = 0; i < max_iterations; i++) {
result = generator.run(prompt_or_feedback);
eval = evaluator.run(result);
if (eval.all_scores >= threshold) break;
feedback = eval.critique;
generator.instruct(
eval.trend === "improving"
? "refine"
: "pivot"
);
}
Sprint Contract Pattern
function negotiate_contract(generator, evaluator, sprint) {
// Generator proposes
const proposal = generator.run(`
Propose implementation for: ${sprint}
Include: what you'll build, how success is verified
`);
// Evaluator reviews
const review = evaluator.run(`
Review proposal: ${proposal}
Is this the right thing to build?
Criteria: ${sprint.criteria}
`);
// Iterate until agreement
while (review.status !== "APPROVED") {
proposal = generator.run(review.feedback);
review = evaluator.run(proposal);
}
return proposal; // This is the contract
}
function run_sprint(generator, evaluator, contract) {
// Build
const implementation = generator.run(contract);
// Test
const result = evaluator.run({
artifact: implementation,
contract: contract,
mode: "strict" // Hard thresholds
});
// Iterate if failed
while (result.status === "FAIL") {
const revised = generator.run(result.feedback);
result = evaluator.run({
artifact: revised,
contract: contract
});
}
return implementation;
}
Context Reset Pattern
function context_reset(current_agent, handoff_artifact) {
// Terminate current agent
current_agent.end_session();
// Create fresh agent
const new_agent = create_agent({
instructions: base_instructions,
context: null // Clean slate
});
// Provide handoff
new_agent.run(`
Previous state: ${handoff_artifact.state}
Completed: ${handoff_artifact.completed_tasks}
Next: ${handoff_artifact.next_tasks}
Continue from here.
`);
return new_agent;
}
When to Add Complexity
| Model | Sprint Decomp | Context Reset | Per-sprint QA |
|---|---|---|---|
| Sonnet 4.5 | Required | Required | Required |
| Opus 4.6 | Optional | Optional | Optional |
Principle: Start simple. Add components only when baseline fails. Re-evaluate with each model release.
Author: Prithvi Rajasekaran (Anthropic Labs)