LabNotes
Version: V1 Narrative V2 Scannable V3 Agent/Builder

Anthropic's Multi-Agent Harness: Generator-Evaluator Architecture

Anthropic Labs has published detailed research on multi-agent harness design for long-running application development. The work by Prithvi Rajasekaran demonstrates that a 20x cost increase ($9 to $200) can produce functional, feature-rich applications compared to broken single-agent outputs. The architecture draws inspiration from Generative Adversarial Networks (GANs), applying generator-evaluator loops to both frontend design and full-stack coding tasks that run autonomously for multiple hours.

The Context Anxiety Problem

A key insight from the research is "context anxiety" — a behavior where models begin wrapping up work prematurely as they approach what they perceive as their context limit. This manifests as agents declaring tasks complete when they are not, or rushing to finish before the context window fills. Compaction, where earlier conversation parts are summarized in place, does not solve this. The solution is context resets: clearing the context window entirely and starting a fresh agent with a structured handoff.

Generator-Evaluator Pattern

The architecture separates generation from evaluation. A generator produces outputs. An evaluator grades them. This addresses a critical problem: agents evaluating their own work tend to respond with confident praise even when quality is mediocre. Tuning a standalone evaluator to be skeptical proves more tractable than making a generator critical of its own work.

Frontend Design Criteria

Four criteria were developed to make subjective quality gradable:

  • Design Quality: Coherent whole vs collection of parts (high weight)
  • Originality: Custom decisions vs templates, penalizes "AI slop" (high weight)
  • Craft: Typography, spacing, color harmony (standard weight)
  • Functionality: Usability, task completion (standard weight)

The evaluator uses the Playwright MCP to interact with live pages directly, navigating and studying implementations before scoring. Runs span 5-15 iterations over up to four hours.

Three-Agent Architecture

AgentRoleFunction
PlannerSpec expansion1-4 sentences → full product spec with AI features
GeneratorImplementationBuilds one feature at a time (React, Vite, FastAPI, SQLite)
EvaluatorQA/GradingPlaywright click-through, criteria scoring, hard thresholds

Sprint Contracts

Before each sprint, the generator and evaluator negotiate a "sprint contract" — agreeing on what "done" looks like. This bridges the gap between high-level user stories and testable implementation. Communication is handled via files: one agent writes, another reads and responds.

Results: Game Maker Test

HarnessDurationCostOutcome
Solo20 min$9Broken — unplayable, broken wiring
Full Harness6 hr$200Functional — 16 features, 10 sprints, AI integration

The evaluator found specific issues: "Tool only places tiles at drag start/end points instead of filling the region" and "Delete key handler condition should be selection || (selectedEntityId && activeLayer === 'entity')."

Opus 4.6 Simplification

With Claude Opus 4.6, sprint decomposition became unnecessary. The model could run coherently for over two hours without resets. The evaluator moved to end-of-run rather than per-sprint. Key principle: the evaluator is worth the cost when the task sits beyond what the current model does reliably solo. As models improve, this boundary moves outward.

DAW Test (V2 Harness)

A Digital Audio Workstation built with the updated harness: 3 hr 50 min, $124.70 total. The QA agent caught real gaps across three rounds: "Clips can't be dragged/moved on the timeline," "Audio recording is stub-only," "Clip resize by edge drag not implemented."

Key Principles

  • Decompose complex tasks into tractable chunks with specialized agents
  • Use structured handoffs rather than compaction for long sessions
  • External evaluation drives improvement; self-evaluation is unreliable
  • Stress test harness assumptions as models improve
  • Explicit criteria make subjective quality gradable

Conclusion

Anthropic's research demonstrates that significant quality improvements are achievable through deliberate architecture design, even at substantially higher cost. The generator-evaluator pattern provides a concrete framework for building reliable long-running agentic systems. As the authors note: "the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."

Quick Facts

MetricValue
AuthorPrithvi Rajasekaran (Anthropic Labs)
ArchitectureGenerator-Evaluator (GAN-inspired)
Agents3 (Planner, Generator, Evaluator)
Cost Ratio20x ($9 → $200)
Quality DeltaBroken → Functional
Max Runtime6 hours
Key ModelsClaude Opus 4.5, 4.6

Core Techniques

TechniquePurpose
Context ResetsClean slate vs compaction for anxiety
Sprint ContractsAlign spec with implementation
File CommunicationAgent handoffs without pollution
Few-shot CalibrationTune evaluator skepticism
Playwright MCPLive testing vs static analysis

When Evaluators Add Value

ScenarioEvaluator Value
Within reliable capabilityUnnecessary overhead
At edge of capabilityHigh — catches critical gaps
Subjective qualityEssential — self-grading poor

Opus 4.6 Impact

  • Removed need for sprint decomposition
  • Single continuous session viable (2+ hours)
  • Evaluator moved to end-of-run
  • Automatic compaction sufficient

Cost Breakdown (DAW)

PhaseTimeCost
Planner4.7 min$0.46
Build R12 hr 7 min$71.08
QA R18.8 min$3.24
Build R21 hr 2 min$36.89
QA R26.8 min$3.09
Build R310.9 min$5.88
QA R39.6 min$4.06
Total3 hr 50 min$124.70

Implementation Guide

Building multi-agent harnesses with generator-evaluator loops.

Basic Loop Structure

// Generator with criteria
 generator.instructions = `
   Create frontends. Weight design quality and 
   originality higher than craft/functionality.
   Decide after each eval: refine or pivot.
 `;

// Evaluator with Playwright
 evaluator.instructions = `
   Navigate live pages via Playwright MCP.
   Score 1-10 on each criterion.
   BE SKEPTICAL — err toward lower scores.
   Output: {scores, critique, specific_fixes}
 `;

// Iteration loop
 for (i = 0; i < max_iterations; i++) {
   result = generator.run(prompt_or_feedback);
   eval = evaluator.run(result);
   
   if (eval.all_scores >= threshold) break;
   
   feedback = eval.critique;
   generator.instruct(
     eval.trend === "improving" 
       ? "refine" 
       : "pivot"
   );
 }

Sprint Contract Pattern

function negotiate_contract(generator, evaluator, sprint) {
  // Generator proposes
  const proposal = generator.run(`
    Propose implementation for: ${sprint}
    Include: what you'll build, how success is verified
  `);
  
  // Evaluator reviews
  const review = evaluator.run(`
    Review proposal: ${proposal}
    Is this the right thing to build?
    Criteria: ${sprint.criteria}
  `);
  
  // Iterate until agreement
  while (review.status !== "APPROVED") {
    proposal = generator.run(review.feedback);
    review = evaluator.run(proposal);
  }
  
  return proposal; // This is the contract
}

function run_sprint(generator, evaluator, contract) {
  // Build
  const implementation = generator.run(contract);
  
  // Test
  const result = evaluator.run({
    artifact: implementation,
    contract: contract,
    mode: "strict" // Hard thresholds
  });
  
  // Iterate if failed
  while (result.status === "FAIL") {
    const revised = generator.run(result.feedback);
    result = evaluator.run({
      artifact: revised,
      contract: contract
    });
  }
  
  return implementation;
}

Context Reset Pattern

function context_reset(current_agent, handoff_artifact) {
  // Terminate current agent
  current_agent.end_session();
  
  // Create fresh agent
  const new_agent = create_agent({
    instructions: base_instructions,
    context: null // Clean slate
  });
  
  // Provide handoff
  new_agent.run(`
    Previous state: ${handoff_artifact.state}
    Completed: ${handoff_artifact.completed_tasks}
    Next: ${handoff_artifact.next_tasks}
    Continue from here.
  `);
  
  return new_agent;
}

When to Add Complexity

ModelSprint DecompContext ResetPer-sprint QA
Sonnet 4.5RequiredRequiredRequired
Opus 4.6OptionalOptionalOptional

Principle: Start simple. Add components only when baseline fails. Re-evaluate with each model release.

Source: Harness design for long-running application development — Anthropic Engineering, March 2026
Author: Prithvi Rajasekaran (Anthropic Labs)