Most teams build AI agents the slow way: write code, test manually, tweak prompts, repeat. What if the agent could improve itself? Running experiments overnight, scoring results, keeping improvements, discarding regressions — all autonomously.
AutoAgent does exactly this. It is not just another agent framework. It is a meta-agent system that iteratively edits its own harness, benchmark-tests changes, and hill-climbs on performance without human intervention. You describe what you want in a program.md file, go to sleep, and wake up to a measurably better agent.
The Meta-Agent Loop
Traditional agent development looks like this: engineer writes code → tests on a few examples → deploys → discovers edge cases → repeats. The cycle takes days.
AutoAgent replaces this with a self-improving loop:
- program.md — You write the directive (what agent to build, constraints, tools to use)
- agent.py — The harness under test (system prompt, tool definitions, orchestration)
- tasks/ — Containerized benchmarks in Harbor format
- meta-agent — Reads program, examines harness, runs benchmarks, diagnoses failures
- Edit → Run → Score — Modifies harness, tests, writes
passed/totalto results.tsv - Keep or Discard — Hill-climbs: keeps improvements, reverts regressions
- Repeat — Runs 20-50 iterations overnight
The insight: you are not programming the agent. You are programming the meta-agent through constraints in program.md. The actual harness is a single file that gets iteratively rewritten based on empirical performance.
What You Control vs. What It Discovers
| You Provide (Fixed) | Meta-Agent Discovers (Variable) |
|---|---|
| High-level directive in program.md | Exact system prompt phrasing |
| Which tools exist (tool signatures) | Tool selection strategy per task |
| Model (gpt-5, fixed) | Prompt engineering for that model |
| Benchmark tasks (Harbor format) | Failure mode diagnosis |
| Keep/discard criteria (score-based) | Orchestration patterns |
| Guardrails (no prod deploy without human review) | Reusable skill extraction |
The Harness Architecture
AutoAgent uses a simple but powerful file structure:
program.md ← Human writes the directive
agent.py ← Meta-agent edits the harness
tasks/ ← Human creates benchmarks
task-name/
task.toml ← Config (timeout, metadata)
instruction.md ← Prompt to agent
tests/
test.sh ← Verification script
test.py ← LLM-as-judge or assertions
environment/
Dockerfile ← Isolated task container
files/ ← Reference materials
results.tsv ← Score ledger (auto-generated)
.agent/ ← Workspace for reusable skills
The Editable Boundary
The agent.py has clear sections:
# === EDITABLE SECTION ===
# Meta-agent modifies everything below
SYSTEM_PROMPT = """You are a support agent for Glowup..."""
MODEL = "gpt-5" # Fixed unless human changes
MAX_TURNS = 30
def create_tools(environment):
@function_tool
async def check_credits(user_id: str) -> str:
...
@function_tool
async def search_knowledge(query: str) -> str:
...
return [check_credits, search_knowledge]
def create_agent(environment):
return Agent(
name="glowup-support",
instructions=SYSTEM_PROMPT,
tools=create_tools(environment),
model=MODEL
)
async def run_task(environment, instruction):
agent = create_agent(environment)
# Orchestration logic the meta-agent can reshape
...
# === FIXED ADAPTER BOUNDARY ===
# Harbor integration, trajectory logging
# Do not modify below this line
The meta-agent treats this as a constrained optimization problem: maximize passed tasks while staying within the boundary (model fixed, tools must use provided environment, etc.).
Harbor: The Benchmark Engine
AutoAgent uses Harbor for containerized task evaluation. Each task is a Docker container that receives the agent's trajectory and returns a score (0.0 to 1.0) to /logs/reward.txt.
This design is crucial: tasks are isolated, deterministic, and can test anything — API calls, file manipulation, multi-step workflows. The agent cannot game the benchmark by memorizing answers because each task runs fresh.
What AutoAgent Is Good For
Not every problem benefits from autonomous iteration. AutoAgent excels in specific scenarios:
Well-Defined Support Agents
Customer service agents with clear boundaries — known APIs, documented processes, escalation criteria. Examples:
- Photo enhancement support (Glowup): "How do I enhance?" "Why did it fail?" "How many credits?"
- VoIP setup guidance (APS): "How do I forward my number?" "Why didn't I get the alert?"
- SaaS onboarding: "Where is my data?" "How do I integrate?"
The key: limited but deep domain, existing API surface, real user questions you can capture.
Rapid Prompt Engineering
When you need to optimize prompts for a specific model and task distribution. The meta-agent tries variations you might not think of, measures the difference, and keeps what works.
Tool Orchestration Discovery
Which tools to use when? In what order? When to escalate? AutoAgent discovers patterns from failure trajectories rather than requiring upfront design.
What It Is Not For
- Novel architecture: If you need a new type of agent (multi-agent teams, new memory system), design it manually first.
- Ill-defined domains: If you cannot write clear benchmarks, the meta-agent has no signal to hill-climb on.
- High-stakes production: AutoAgent produces candidates. Humans review before production deploy. The results.tsv is a ledger, not a deployment trigger.
Integration for Product Teams
Here is how a product team would adopt AutoAgent:
Phase 1: Infrastructure (1-2 days)
- Clone autoagent to your workspace
- Build the base Docker image:
docker build -f Dockerfile.base -t autoagent-base . - Set up OpenAI API credentials (requires gpt-5)
- Install Harbor:
pip install harbor
Phase 2: Benchmark Design (3-5 days)
This is the critical investment. Good benchmarks determine everything. A bad benchmark produces an agent that overfits to specific answers rather than learning general capabilities.
Start with 10-20 real user questions:
# Example: Glowup support task
tasks/credit-balance-check/
task.toml:
name = "credit-balance-check"
timeout_sec = 60
instruction.md:
"User asks: 'How many credits do I have left?'
Look up their balance and explain what they can do with remaining credits."
tests/test.py:
# Verify agent called check_credits tool
# Verify response included actual number
# Verify response suggested enhancement options
# Score: 1.0 if all checks pass, 0.0 otherwise
Mix question types: simple lookups, multi-step workflows, edge cases, escalation triggers. Use LLM-as-judge for fuzzy criteria ("was the tone helpful?") and assertions for objective checks ("did it call the right API?").
Phase 3: Meta-Agent Operation
You have two integration options:
Option A: External Meta-Agent (Recommended)
Use your existing coding agent (Claude Code, Claude Desktop, OpenAI coding agent). Point it at the repo:
"Read program.md. The current agent.py scores 3/20 on tasks/.
Diagnose failures, modify the harness, run experiments,
and hill-climb on passed tasks. Keep changes that improve,
discard those that regress. Run overnight."
This keeps the meta-agent separate from your production infrastructure. You review results.tsv in the morning and decide what to merge.
Option B: Built-in Loop (Advanced)
Implement the meta-agent as an ACP harness session that runs on a schedule. Requires careful guardrails and cost controls. Only for teams with existing agent infrastructure experience.
Phase 4: Production Integration
AutoAgent does not deploy to production. It produces candidate harnesses. Your workflow:
- Review results.tsv from overnight run
- Examine the winning harness (agent.py)
- Test manually on edge cases not in benchmark
- Merge to product codebase if acceptable
- Add new tasks as new failure modes are discovered
Cost Analysis
Fixed Costs (One-Time)
| Item | Time | Cost |
|---|---|---|
| Infrastructure setup | 1-2 days | ~$0 (Docker-based) |
| Benchmark design (20 tasks) | 3-5 days | Engineering time |
| Baseline runs | 1 day | ~$10-50 |
| Initial integration | 2-3 days | Engineering time |
| Total fixed | 7-11 days | ~$100 + eng time |
Variable Costs (Per Overnight Run)
Assume 20 iterations × 20 tasks:
- GPT-5 API: ~$5-20 per iteration (diagnosis + harness generation)
- Total per experiment: $100-400 for 20 iterations
- Task compute: $0.01-0.50 (Docker container time)
Cost optimizations:
- Start with 5-10 tasks for faster iteration
- Use GPT-4o for early experiments (cheaper, faster, 80% of the signal)
- Run parallel experiments with different program.md variants
- Set daily API spend limits
Comparison to Manual Engineering
| Approach | Time per improvement | Cost per improvement |
|---|---|---|
| Manual agent engineering | 2-5 days | $800-2000 (eng time) |
| AutoAgent overnight | 8-12 hours | $100-400 (API + compute) |
| Break-even | After 2-3 overnight runs vs. one manual cycle | |
AutoAgent wins when you need many small improvements, have systematic benchmark coverage, and have more ideas than engineering time. Manual engineering wins for novel architecture changes or when benchmarks do not exist yet.
Getting Started
AutoAgent is open-source on GitHub. If you have a support agent that handles repetitive questions — and you can write 10-20 example tasks — it is worth experimenting.
Recommended first project: A narrow-scope support agent with existing API integration. Glowup credit checks and photo enhancement guidance, or APS phone setup instructions. These have clear success criteria and measurable outcomes.
The framework is not magic. It is disciplined empirical optimization: measure, modify, keep, discard, repeat. But disciplined optimization at machine speed, overnight, beats manual intuition during business hours.
Resources: