LabNotes

Self-Improving Agents: Continuous Learning Loops

Every LLM agent makes mistakes. The question isn't whether an agent will hallucinate, misinterpret a command, or produce broken code — it's what happens after. The gap between agents that repeat the same error and agents that genuinely improve over time isn't a model capability problem. It's a memory architecture problem.

We've spent the past several months testing learning loops in production agent systems — not the theoretical "agents will self-improve" pitch, but the practical mechanics of how errors get captured, stored, and re-applied. Here's what we found.

The Learning Loop Architecture

A self-improving agent doesn't need to retrain its weights. It needs a structured pipeline that turns failure into future guidance. The loop has four stages:

  1. Detection — Recognizing that something went wrong (or could be better)
  2. Capture — Structured logging of what happened, why, and what the correct behavior should be
  3. Storage — Persisting learnings in a format the agent can retrieve in future sessions
  4. Application — Loading relevant learnings before executing similar tasks

Each stage has failure modes. Detection misses errors that don't throw exceptions. Capture produces unstructured notes that can't be parsed. Storage sits in files that never get read. Application loads irrelevant context and drowns the agent in noise.

The systems that actually work address all four — imperfectly, but structurally.

Real Implementations in Production

OpenClaw's Self-Improvement Skill

The most mature example we've tested is the self-improvement skill for OpenClaw (and compatible agents). It defines three structured log files:

  • .learnings/LEARNINGS.md — Corrections, knowledge gaps, best practices
  • .learnings/ERRORS.md — Command failures, exceptions, unexpected behavior
  • .learnings/FEATURE_REQUESTS.md — Capabilities users want but the agent doesn't have

Each entry follows a strict schema: timestamped ID (LRN-20260314-001), priority level, status tracking, area tags for filtering, and structured metadata including reproduction steps and suggested fixes. The critical innovation is the promotion pathway — when a learning proves broadly applicable (recurrence-count ≥ 3 across 2+ distinct tasks within 30 days), it gets promoted from the verbose log into permanent project guidance files: AGENTS.md, SOUL.md, or TOOLS.md.

This creates a three-tier memory hierarchy:

TierFilePurposeLifespan
Raw logs.learnings/*.mdVerbose incident recordsUntil resolved/promoted
Project memoryAGENTS.md, TOOLS.mdCurated rules and patternsLong-lived
Behavioral coreSOUL.mdPersonality and principlesPermanent

It's not sophisticated. It's grep-friendly markdown files with consistent headers. And that's exactly why it works.

CLAUDE.md and Cursor Rules

Claude Code's CLAUDE.md pattern and Cursor's .cursorrules serve a similar function with less structure. They're static instruction files that humans manually update when they discover the agent repeatedly makes a specific mistake. The workflow is: notice pattern → edit file → agent loads it next session.

This is learning-by-proxy — the human detects, captures, and stores the learning, and the agent simply reads what's there. It works surprisingly well for project-specific conventions ("use pnpm, not npm," "test with vitest, not jest") but fails silently for anything that requires the agent to recognize its own errors without human intervention.

The limitation is obvious: it scales with human attention, not with agent capability. If nobody's watching, nobody updates the rules.

Session Memory and Daily Logs

OpenClaw's workspace pattern uses memory/YYYY-MM-DD.md for raw daily logs and MEMORY.md for curated long-term memory. A periodic heartbeat process reviews recent daily logs and distills significant events into the long-term file. This is the closest thing to automated learning extraction we've seen in practice.

The effectiveness depends entirely on what gets logged. Daily logs that record only actions ("committed code," "sent email") produce noise. Daily logs that record decisions and reasoning ("chose approach X because Y," "discovered that Z doesn't work in this context") produce signal.

Error Detection Triggers

The self-improvement skill defines explicit triggers for when to log a learning:

TriggerExampleLog Target
Command failureNon-zero exit code, timeoutERRORS.md
User correction"No, that's not right..."LEARNINGS.md (correction)
Missing capability"Can you also do X?"FEATURE_REQUESTS.md
Outdated knowledgeReference yields wrong infoLEARNINGS.md (knowledge_gap)
Better approach foundDiscovered more efficient methodLEARNINGS.md (best_practice)

These triggers can be manual (the agent decides to log) or hook-based (external scripts fire on specific events). The hook approach — triggering on PostToolUse for bash commands, for example — catches errors that the agent might otherwise miss or rationalize away.

What Actually Improves Performance

After testing across multiple agent platforms, the patterns that produce measurable improvement are narrow:

1. Convention capture works. Logging that "this project uses X, not Y" produces immediate, durable improvement. It's the highest-ROI learning type because it's binary (right/wrong) and universally applicable within the project.

2. Error reproduction patterns work. When an error entry includes the exact command, input, and error message, the agent can match against it in future sessions. When it says "something failed with Docker," it's useless.

3. Behavioral corrections work — with caveats. "Be more concise," "don't use marketing language," "ask before modifying external files" — these produce improvement when they're concrete and testable. Abstract corrections ("be better at reasoning") don't.

4. Architectural learnings don't transfer well. "The database schema should have been normalized differently" is a learning that requires understanding the full system context to apply. It rarely generalizes to new situations, even within the same project.

Limitations and Honest Assessment

The self-improvement paradigm has real constraints that marketing materials don't mention:

No weight updating. These are prompt-layer interventions, not training-time improvements. The base model doesn't change. "Learning" means "better instructions next time," not "smarter model." If the model fundamentally can't do something, no amount of logging will fix it.

Context window competition. Every learning file competes for context space with the actual task. A LEARNINGS.md with 200 entries doesn't make the agent 200 times smarter — it makes the agent slower and potentially confused as irrelevant learnings dilute task-relevant context. The promotion-and-pruning cycle isn't optional; it's load-bearing.

Session boundary amnesia. Between sessions, the agent "forgets" everything except what's in the persisted files. The learning loop is only as good as what got written down. Subtle realizations, contextual insights, and implicit knowledge evaporate at session end.

Detection is the bottleneck. The agent can only learn from errors it recognizes as errors. Confident hallucinations — where the agent is wrong but doesn't know it — never enter the pipeline. This is the hardest unsolved problem in the loop.

Promotion thresholds are arbitrary. The "seen 3 times in 30 days" rule is a heuristic, not a principled cutoff. Some patterns should promote after once; others will recur for months without being broadly applicable enough to warrant permanent storage.

The Current State

Self-improving agents are real, but the term oversells what's happening. What actually works is structured note-taking with a promotion pathway — agents that write down what went wrong in a consistent format, review those notes before similar tasks, and promote proven patterns to permanent guidance.

This is closer to a well-organized engineer's personal wiki than to autonomous learning. And that's fine. A well-organized wiki is genuinely useful. The gap between "repeats the same mistake forever" and "writes it down and checks next time" is large enough to matter in production.

The next step isn't better logging formats or smarter promotion heuristics. It's closing the detection gap — getting the agent to recognize confident-but-wrong outputs without external correction. Until that's solved, the learning loop is fundamentally human-supervised, and the "self-improving" label is aspirational.


References
OpenClaw self-improvement skill: skills/self-improving-agent/SKILL.md
LEARNINGS.md schema: ID format TYPE-YYYYMMDD-XXX, priority levels, promotion workflow
CLAUDE.md pattern: Anthropic's project-level instruction file convention
Cursor rules: .cursorrules for project-specific AI editor behavior
ByteRover: Structured memory for cross-session agent knowledge persistence