Self-Improving Agents: Continuous Learning Loops (Experimental)

Visual breakdown of the four-stage learning loop: detection rates, capture quality, storage tiers, and application effectiveness across real implementations.

KPI Snapshot

Loop Stages

Detect → Capture → Store → Apply

Storage Tiers

Raw logs → Project rules → Behavioral core

~30%

Error Detection Rate

Confident hallucinations missed

3×

Promotion Threshold

Recurrences before permanent storage

Loop Stage Effectiveness

Stage-by-stage effectiveness (qualitative assessment)

Hook-based

Manual log

Static rules

Session memory

Error detection

High

Medium

Low

Capture quality

Medium

High

Medium

Storage durability

Medium

High

Medium

Application ease

High

Medium

High

Low

Human dependency

Low

Medium

High

Medium

Component Maturity Bars

Error Detection Coverage

Command failures (exit codes)~95%

User corrections ("actually...")~80%

Suboptimal but working output~35%

Confident hallucinations~5%

Learning Type ROI

Convention capture ("use pnpm")High

Error reproduction patternsHigh

Behavioral correctionsMedium

Architectural learningsLow

Storage Tier Durability

.learnings/ raw logsSession-scoped

AGENTS.md / TOOLS.md rulesProject-scoped

SOUL.md behavioral corePermanent

Implementation Comparison

Real implementations assessed

OpenClaw
Self-Improvement

CLAUDE.md
/ Cursor Rules

Session Memory
/ Daily Logs

RAG-based
Learning

Structured logging

✓ Strict schema

✗ Freeform

~ Semi-structured

~ Depends on embed

Promotion pathway

✓ Built-in

✗ Manual

~ Heartbeat distill

✗ N/A

Cross-session

✓ Persistent files

~ File-based

✓ Vector store

Error hooks

✓ PostToolUse

✗

~ Platform dep.

Recurrence tracking

✓ Count + dates

✗

The Detection Gap

~95%

Detectable

Command failures, timeouts, exceptions

~35%

Partial

Suboptimal output, missed better approaches

~5%

Blind Spot

Confident hallucinations, silent failures

Assessment based on production testing across OpenClaw, Claude Code, and Cursor agent platforms, March 2026. Effectiveness ratings are qualitative — not automated benchmarks.

Other modes

◉ Standard — Full narrative ⬡ Agent — Dense spec format