AutoAgent: Let Your Agent Engineer Itself Overnight

Most teams build AI agents the slow way: write code, test manually, tweak prompts, repeat. What if the agent could improve itself? Running experiments overnight, scoring results, keeping improvements, discarding regressions — all autonomously.

AutoAgent does exactly this. It is not just another agent framework. It is a meta-agent system that iteratively edits its own harness, benchmark-tests changes, and hill-climbs on performance without human intervention. You describe what you want in a program.md file, go to sleep, and wake up to a measurably better agent.

The Meta-Agent Loop

Traditional agent development looks like this: engineer writes code → tests on a few examples → deploys → discovers edge cases → repeats. The cycle takes days.

AutoAgent replaces this with a self-improving loop:

program.md — You write the directive (what agent to build, constraints, tools to use)
agent.py — The harness under test (system prompt, tool definitions, orchestration)
tasks/ — Containerized benchmarks in Harbor format
meta-agent — Reads program, examines harness, runs benchmarks, diagnoses failures
Edit → Run → Score — Modifies harness, tests, writes passed/total to results.tsv
Keep or Discard — Hill-climbs: keeps improvements, reverts regressions
Repeat — Runs 20-50 iterations overnight

The insight: you are not programming the agent. You are programming the meta-agent through constraints in program.md. The actual harness is a single file that gets iteratively rewritten based on empirical performance.

What You Control vs. What It Discovers

You Provide (Fixed)	Meta-Agent Discovers (Variable)
High-level directive in program.md	Exact system prompt phrasing
Which tools exist (tool signatures)	Tool selection strategy per task
Model (gpt-5, fixed)	Prompt engineering for that model
Benchmark tasks (Harbor format)	Failure mode diagnosis
Keep/discard criteria (score-based)	Orchestration patterns
Guardrails (no prod deploy without human review)	Reusable skill extraction

The Harness Architecture

AutoAgent uses a simple but powerful file structure:

program.md          ← Human writes the directive
agent.py            ← Meta-agent edits the harness
tasks/              ← Human creates benchmarks
  task-name/
    task.toml       ← Config (timeout, metadata)
    instruction.md  ← Prompt to agent
    tests/
      test.sh       ← Verification script
      test.py       ← LLM-as-judge or assertions
    environment/
      Dockerfile    ← Isolated task container
    files/          ← Reference materials
results.tsv         ← Score ledger (auto-generated)
.agent/             ← Workspace for reusable skills

The Editable Boundary

The agent.py has clear sections:

# === EDITABLE SECTION ===
# Meta-agent modifies everything below

SYSTEM_PROMPT = """You are a support agent for Glowup..."""
MODEL = "gpt-5"  # Fixed unless human changes
MAX_TURNS = 30

def create_tools(environment):
    @function_tool
    async def check_credits(user_id: str) -> str:
        ...
    
    @function_tool  
    async def search_knowledge(query: str) -> str:
        ...
    
    return [check_credits, search_knowledge]

def create_agent(environment):
    return Agent(
        name="glowup-support",
        instructions=SYSTEM_PROMPT,
        tools=create_tools(environment),
        model=MODEL
    )

async def run_task(environment, instruction):
    agent = create_agent(environment)
    # Orchestration logic the meta-agent can reshape
    ...

# === FIXED ADAPTER BOUNDARY ===
# Harbor integration, trajectory logging
# Do not modify below this line

The meta-agent treats this as a constrained optimization problem: maximize passed tasks while staying within the boundary (model fixed, tools must use provided environment, etc.).

Harbor: The Benchmark Engine

AutoAgent uses Harbor for containerized task evaluation. Each task is a Docker container that receives the agent's trajectory and returns a score (0.0 to 1.0) to /logs/reward.txt.

This design is crucial: tasks are isolated, deterministic, and can test anything — API calls, file manipulation, multi-step workflows. The agent cannot game the benchmark by memorizing answers because each task runs fresh.

What AutoAgent Is Good For

Not every problem benefits from autonomous iteration. AutoAgent excels in specific scenarios:

Well-Defined Support Agents

Customer service agents with clear boundaries — known APIs, documented processes, escalation criteria. Examples:

Photo enhancement support (Glowup): "How do I enhance?" "Why did it fail?" "How many credits?"
VoIP setup guidance (APS): "How do I forward my number?" "Why didn't I get the alert?"
SaaS onboarding: "Where is my data?" "How do I integrate?"

The key: limited but deep domain, existing API surface, real user questions you can capture.

Rapid Prompt Engineering

When you need to optimize prompts for a specific model and task distribution. The meta-agent tries variations you might not think of, measures the difference, and keeps what works.

Tool Orchestration Discovery

Which tools to use when? In what order? When to escalate? AutoAgent discovers patterns from failure trajectories rather than requiring upfront design.

What It Is Not For

Novel architecture: If you need a new type of agent (multi-agent teams, new memory system), design it manually first.
Ill-defined domains: If you cannot write clear benchmarks, the meta-agent has no signal to hill-climb on.
High-stakes production: AutoAgent produces candidates. Humans review before production deploy. The results.tsv is a ledger, not a deployment trigger.

Integration for Product Teams

Here is how a product team would adopt AutoAgent:

Phase 1: Infrastructure (1-2 days)

Clone autoagent to your workspace
Build the base Docker image: docker build -f Dockerfile.base -t autoagent-base .
Set up OpenAI API credentials (requires gpt-5)
Install Harbor: pip install harbor

Phase 2: Benchmark Design (3-5 days)

This is the critical investment. Good benchmarks determine everything. A bad benchmark produces an agent that overfits to specific answers rather than learning general capabilities.

Start with 10-20 real user questions:

# Example: Glowup support task
tasks/credit-balance-check/
  task.toml:
    name = "credit-balance-check"
    timeout_sec = 60
    
  instruction.md:
    "User asks: 'How many credits do I have left?' 
     Look up their balance and explain what they can do with remaining credits."
     
  tests/test.py:
    # Verify agent called check_credits tool
    # Verify response included actual number
    # Verify response suggested enhancement options
    # Score: 1.0 if all checks pass, 0.0 otherwise

Mix question types: simple lookups, multi-step workflows, edge cases, escalation triggers. Use LLM-as-judge for fuzzy criteria ("was the tone helpful?") and assertions for objective checks ("did it call the right API?").

Phase 3: Meta-Agent Operation

You have two integration options:

Option A: External Meta-Agent (Recommended)

Use your existing coding agent (Claude Code, Claude Desktop, OpenAI coding agent). Point it at the repo:

"Read program.md. The current agent.py scores 3/20 on tasks/. 
 Diagnose failures, modify the harness, run experiments, 
 and hill-climb on passed tasks. Keep changes that improve, 
 discard those that regress. Run overnight."

This keeps the meta-agent separate from your production infrastructure. You review results.tsv in the morning and decide what to merge.

Option B: Built-in Loop (Advanced)

Implement the meta-agent as an ACP harness session that runs on a schedule. Requires careful guardrails and cost controls. Only for teams with existing agent infrastructure experience.

Phase 4: Production Integration

AutoAgent does not deploy to production. It produces candidate harnesses. Your workflow:

Review results.tsv from overnight run
Examine the winning harness (agent.py)
Test manually on edge cases not in benchmark
Merge to product codebase if acceptable
Add new tasks as new failure modes are discovered

Cost Analysis

Fixed Costs (One-Time)

Item	Time	Cost
Infrastructure setup	1-2 days	~$0 (Docker-based)
Benchmark design (20 tasks)	3-5 days	Engineering time
Baseline runs	1 day	~$10-50
Initial integration	2-3 days	Engineering time
Total fixed	7-11 days	~$100 + eng time

Variable Costs (Per Overnight Run)

Assume 20 iterations × 20 tasks:

GPT-5 API: ~$5-20 per iteration (diagnosis + harness generation)
Total per experiment: $100-400 for 20 iterations
Task compute: $0.01-0.50 (Docker container time)

Cost optimizations:

Start with 5-10 tasks for faster iteration
Use GPT-4o for early experiments (cheaper, faster, 80% of the signal)
Run parallel experiments with different program.md variants
Set daily API spend limits

Comparison to Manual Engineering

Approach	Time per improvement	Cost per improvement
Manual agent engineering	2-5 days	$800-2000 (eng time)
AutoAgent overnight	8-12 hours	$100-400 (API + compute)
Break-even	After 2-3 overnight runs vs. one manual cycle

AutoAgent wins when you need many small improvements, have systematic benchmark coverage, and have more ideas than engineering time. Manual engineering wins for novel architecture changes or when benchmarks do not exist yet.

Getting Started

AutoAgent is open-source on GitHub. If you have a support agent that handles repetitive questions — and you can write 10-20 example tasks — it is worth experimenting.

Recommended first project: A narrow-scope support agent with existing API integration. Glowup credit checks and photo enhancement guidance, or APS phone setup instructions. These have clear success criteria and measurable outcomes.

The framework is not magic. It is disciplined empirical optimization: measure, modify, keep, discard, repeat. But disciplined optimization at machine speed, overnight, beats manual intuition during business hours.

Resources: