KPI Snapshot
3
Major Agents
Anthropic · OpenAI · Google
18
Tasks Evaluated
6 per category
78%
Avg First-Attempt Pass
Across all agents & tasks
15%
Max Performance Gap
Between best & worst per task
Five-Axis Comparison
Performance by Category (first-attempt pass rate)
Claude Code
Codex CLI
Gemini CLI
Multi-file refactor
83%
78%
72%
Bug isolation
72%
83%
71%
Greenfield scaffold
88%
79%
81%
Test-driven dev
80%
82%
69%
Instruction adherence
91%
76%
74%
Capability Speed Bars
Context Window
Free Tier Generosity
First-Attempt Pass Rate (Aggregate)
IDE Coverage
Feature Matrix
Feature Availability
Claude Code
Codex CLI
Gemini CLI
MCP support
✓
✓
✓
Search grounding
✗
✗
✓
Plugin system
✓
✗
✗
GitHub integration
@claude
Codex Web
GH Action
Cloud offload
✗
✓
✗
Open source
✗
Apache 2.0
Apache 2.0
Quick Verdict
Claude Code
Best For
Complex refactors, team workflows, instruction-heavy tasks
Codex CLI
Best For
Debug-repair loops, cloud offload, ChatGPT ecosystem users
Gemini CLI
Best For
Budget-conscious devs, massive context tasks, documentation lookups
Data: internal evaluation, 18 tasks across 3 real repositories, March 2026. Bars normalized to best performer per category.