Anthropic's Multi-Agent Harness: Generator-Evaluator Architecture

Anthropic Labs has published detailed research on multi-agent harness design for long-running application development. The work by Prithvi Rajasekaran demonstrates that a 20x cost increase ($9 to $200) can produce functional, feature-rich applications compared to broken single-agent outputs. The architecture draws inspiration from Generative Adversarial Networks (GANs), applying generator-evaluator loops to both frontend design and full-stack coding tasks that run autonomously for multiple hours.

The Context Anxiety Problem

A key insight from the research is "context anxiety" — a behavior where models begin wrapping up work prematurely as they approach what they perceive as their context limit. This manifests as agents declaring tasks complete when they are not, or rushing to finish before the context window fills. Compaction, where earlier conversation parts are summarized in place, does not solve this. The solution is context resets: clearing the context window entirely and starting a fresh agent with a structured handoff.

Generator-Evaluator Pattern

The architecture separates generation from evaluation. A generator produces outputs. An evaluator grades them. This addresses a critical problem: agents evaluating their own work tend to respond with confident praise even when quality is mediocre. Tuning a standalone evaluator to be skeptical proves more tractable than making a generator critical of its own work.

Frontend Design Criteria

Four criteria were developed to make subjective quality gradable:

Design Quality: Coherent whole vs collection of parts (high weight)
Originality: Custom decisions vs templates, penalizes "AI slop" (high weight)
Craft: Typography, spacing, color harmony (standard weight)
Functionality: Usability, task completion (standard weight)

The evaluator uses the Playwright MCP to interact with live pages directly, navigating and studying implementations before scoring. Runs span 5-15 iterations over up to four hours.

Three-Agent Architecture

Agent	Role	Function
Planner	Spec expansion	1-4 sentences → full product spec with AI features
Generator	Implementation	Builds one feature at a time (React, Vite, FastAPI, SQLite)
Evaluator	QA/Grading	Playwright click-through, criteria scoring, hard thresholds

Sprint Contracts

Before each sprint, the generator and evaluator negotiate a "sprint contract" — agreeing on what "done" looks like. This bridges the gap between high-level user stories and testable implementation. Communication is handled via files: one agent writes, another reads and responds.

Results: Game Maker Test

Harness	Duration	Cost	Outcome
Solo	20 min	$9	Broken — unplayable, broken wiring
Full Harness	6 hr	$200	Functional — 16 features, 10 sprints, AI integration

The evaluator found specific issues: "Tool only places tiles at drag start/end points instead of filling the region" and "Delete key handler condition should be selection || (selectedEntityId && activeLayer === 'entity')."

Opus 4.6 Simplification

With Claude Opus 4.6, sprint decomposition became unnecessary. The model could run coherently for over two hours without resets. The evaluator moved to end-of-run rather than per-sprint. Key principle: the evaluator is worth the cost when the task sits beyond what the current model does reliably solo. As models improve, this boundary moves outward.

DAW Test (V2 Harness)

A Digital Audio Workstation built with the updated harness: 3 hr 50 min, $124.70 total. The QA agent caught real gaps across three rounds: "Clips can't be dragged/moved on the timeline," "Audio recording is stub-only," "Clip resize by edge drag not implemented."

Key Principles

Decompose complex tasks into tractable chunks with specialized agents
Use structured handoffs rather than compaction for long sessions
External evaluation drives improvement; self-evaluation is unreliable
Stress test harness assumptions as models improve
Explicit criteria make subjective quality gradable

Conclusion

Anthropic's research demonstrates that significant quality improvements are achievable through deliberate architecture design, even at substantially higher cost. The generator-evaluator pattern provides a concrete framework for building reliable long-running agentic systems. As the authors note: "the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."

Quick Facts

Metric	Value
Author	Prithvi Rajasekaran (Anthropic Labs)
Architecture	Generator-Evaluator (GAN-inspired)
Agents	3 (Planner, Generator, Evaluator)
Cost Ratio	20x ($9 → $200)
Quality Delta	Broken → Functional
Max Runtime	6 hours
Key Models	Claude Opus 4.5, 4.6

Core Techniques

Technique	Purpose
Context Resets	Clean slate vs compaction for anxiety
Sprint Contracts	Align spec with implementation
File Communication	Agent handoffs without pollution
Few-shot Calibration	Tune evaluator skepticism
Playwright MCP	Live testing vs static analysis

When Evaluators Add Value

Scenario	Evaluator Value
Within reliable capability	Unnecessary overhead
At edge of capability	High — catches critical gaps
Subjective quality	Essential — self-grading poor

Opus 4.6 Impact

Removed need for sprint decomposition
Single continuous session viable (2+ hours)
Evaluator moved to end-of-run
Automatic compaction sufficient

Cost Breakdown (DAW)

Phase	Time	Cost
Planner	4.7 min	$0.46
Build R1	2 hr 7 min	$71.08
QA R1	8.8 min	$3.24
Build R2	1 hr 2 min	$36.89
QA R2	6.8 min	$3.09
Build R3	10.9 min	$5.88
QA R3	9.6 min	$4.06
Total	3 hr 50 min	$124.70

Implementation Guide

Building multi-agent harnesses with generator-evaluator loops.

Basic Loop Structure

// Generator with criteria
 generator.instructions = `
   Create frontends. Weight design quality and 
   originality higher than craft/functionality.
   Decide after each eval: refine or pivot.
 `;

// Evaluator with Playwright
 evaluator.instructions = `
   Navigate live pages via Playwright MCP.
   Score 1-10 on each criterion.
   BE SKEPTICAL — err toward lower scores.
   Output: {scores, critique, specific_fixes}
 `;

// Iteration loop
 for (i = 0; i < max_iterations; i++) {
   result = generator.run(prompt_or_feedback);
   eval = evaluator.run(result);
   
   if (eval.all_scores >= threshold) break;
   
   feedback = eval.critique;
   generator.instruct(
     eval.trend === "improving" 
       ? "refine" 
       : "pivot"
   );
 }

Sprint Contract Pattern

function negotiate_contract(generator, evaluator, sprint) {
  // Generator proposes
  const proposal = generator.run(`
    Propose implementation for: ${sprint}
    Include: what you'll build, how success is verified
  `);
  
  // Evaluator reviews
  const review = evaluator.run(`
    Review proposal: ${proposal}
    Is this the right thing to build?
    Criteria: ${sprint.criteria}
  `);
  
  // Iterate until agreement
  while (review.status !== "APPROVED") {
    proposal = generator.run(review.feedback);
    review = evaluator.run(proposal);
  }
  
  return proposal; // This is the contract
}

function run_sprint(generator, evaluator, contract) {
  // Build
  const implementation = generator.run(contract);
  
  // Test
  const result = evaluator.run({
    artifact: implementation,
    contract: contract,
    mode: "strict" // Hard thresholds
  });
  
  // Iterate if failed
  while (result.status === "FAIL") {
    const revised = generator.run(result.feedback);
    result = evaluator.run({
      artifact: revised,
      contract: contract
    });
  }
  
  return implementation;
}

Context Reset Pattern

function context_reset(current_agent, handoff_artifact) {
  // Terminate current agent
  current_agent.end_session();
  
  // Create fresh agent
  const new_agent = create_agent({
    instructions: base_instructions,
    context: null // Clean slate
  });
  
  // Provide handoff
  new_agent.run(`
    Previous state: ${handoff_artifact.state}
    Completed: ${handoff_artifact.completed_tasks}
    Next: ${handoff_artifact.next_tasks}
    Continue from here.
  `);
  
  return new_agent;
}

When to Add Complexity

Model	Sprint Decomp	Context Reset	Per-sprint QA
Sonnet 4.5	Required	Required	Required
Opus 4.6	Optional	Optional	Optional

Principle: Start simple. Add components only when baseline fails. Re-evaluate with each model release.

Source: Harness design for long-running application development — Anthropic Engineering, March 2026
Author: Prithvi Rajasekaran (Anthropic Labs)