March 14, 2026 · 9 min read · Agent Infrastructure

The Context Drought: Anthropic Ships 1M in GA, but the Plateau is Real

Anthropic shipped 1M context windows to general availability today — Opus 4.6 with 78.3% on MRCR v2 at full million-token length, the highest long-context benchmark score we've seen. They also dropped the API's extra charge for long context, removed the beta header requirement, and expanded media limits to 600 images or PDF pages per request.

This is good. It's also two years late — and that's the real story.

The Timeline Nobody Wants to Talk About

Google shipped 1M context in Gemini in February 2024. Anthropic announced it for Claude 3 in March 2024. OpenAI GA'd theirs last week. And here we are in March 2026, still at 1 million tokens.

Two years. The same order of magnitude. In an industry where model capabilities 10x every 12-18 months, context windows have flatlined.

As Latent Space put it: "just under 1 order of magnitude growth in 2 years in context windows, which is much slower growth than all other dimensions (cost/speed/quality) of LLMs."

The bottleneck isn't algorithms. It's physics.

The Memory Wall

Every token in a 1M context window needs to be attended to. That means HBM — high-bandwidth memory on the GPU. There's a fixed amount of it per chip, and the global supply is constrained. You can't just spin up more context without more silicon.

In a recent Latent Space podcast, semiconductor analyst Doug O'Laughlin laid out the blunt arithmetic: we can't even 2x the available HBM easily, much less 10x it. The hardware pipeline is years behind the software ambition.

The result? "Context rationing" — a term coined by swyx during that conversation. The idea that large context windows become a premium, metered resource. Free tier users might get 1,000 tokens. Enterprise gets 1M. Everyone else negotiates in between.

"The 1 million context window is like a mansion." — Doug O'Laughlin

This has immediate implications for how we build agent systems.

What This Means for Agent Builders

If context windows are plateauing, then the architecture of agent memory matters more than ever. The "just stuff everything into the prompt" approach has a hard ceiling. Agent systems that rely on ever-growing context will hit it.

The practical alternatives are already emerging:

1. Tiered Memory Hierarchies

Instead of loading all context at once, agents need structured memory layers. OpenViking — which hit 2,191 stars today on GitHub — models this as a filesystem with L0/L1/L2 tiers. On-demand loading by relevance, not brute-force inclusion. This mirrors how computer architecture solved the same problem decades ago: cache, RAM, disk.

Recent research from IBM showed that extracting reusable strategy and recovery tips from agent trajectories improved task completion from 69.6% to 73.2% — without any increase in context window size. The gains came from smarter memory, not more memory.

2. Compaction Over Inclusion

Every long-running agent session eventually hits what Latent Space calls "the compaction dumb zone" — the point where context gets summarized to fit, and quality degrades. The fix isn't a bigger window. It's better compression: extracting durable learnings, storing structured facts, and discarding ephemeral conversation noise.

Agents that learn to compress — retaining what matters and discarding what doesn't — will outperform agents that just accumulate tokens.

3. Cross-Session Persistence

The most interesting agent architectures right now — Hermes Agent, OpenClaw, deer-flow — all share a trait: they persist state across sessions. Not by keeping context windows alive, but by writing to external memory stores that survive restarts.

This is the agent equivalent of "write it down instead of trying to remember." Files beat context windows for durability. Database entries beat prompt stuffing for retrieval. Structured memory beats raw history for relevance.

4. Sparse Attention Optimizations

On the infra side, IndexCache — which reuses sparse-attention index information across layers — showed 1.2x end-to-end speedup on GLM-5 (744B) with matching quality. At 200K context on a 30B model, gains reached 1.82x on prefill and 1.48x on decode after removing 75% of indexers. These are the kinds of optimizations that make existing context windows more useful without requiring new hardware.

The Anthropic Release, Assessed

Let's be clear about what Anthropic actually shipped:

Opus 4.6 with 1M context — now the default for Max, Team, and Enterprise tiers
78.3% MRCR v2 at 1M tokens — a new frontier benchmark, fighting Context Rot at full scale
No extra API charge for long context — removing a cost barrier that discouraged experimentation
Beta header removed — production-ready, no opt-in friction
600 images/PDFs per request — massive media ingestion for document-heavy workflows

The MRCR score is the most impressive number here. MRCR (Multi-Round Coreference Resolution) tests whether a model can find specific information buried deep in a long context. At 1M tokens, 78.3% means Claude can reliably retrieve details from a million-token haystack — about 750,000 words, or roughly five novels.

For agent systems processing large codebases, document corpora, or multi-day conversation histories, this is a meaningful capability jump. The removal of extra charges is equally important — it lowers the experimentation barrier for teams building context-heavy agents.

What Comes Next

We're betting that context windows don't meaningfully exceed 1M tokens in the next two years. The hardware math doesn't support it, and the attention mechanism's quadratic cost is a fundamental constraint, not an engineering problem waiting for a clever fix.

What will improve:

Context quality over quantity — better retrieval, smarter chunking, relevance scoring
Memory architectures — tiered, persistent, self-organizing agent memory systems
Compression techniques — lossy summarization that retains task-critical information
Sparse attention — making the existing 1M window more efficient through algorithmic optimization
Multi-model orchestration — using different models for different context sizes, routing by task complexity

The agent systems that win won't be the ones with the longest context. They'll be the ones that need the least.

Practical Takeaways

If you're building agent systems today:

Design for the 1M ceiling — assume this is the max for the foreseeable future. Architect accordingly.
Invest in memory infrastructure now — tiered storage, persistent state, and retrieval systems will differentiate winning agent platforms.
Test Anthropic's new pricing — the removed long-context surcharge makes 1M-token API calls viable for production. Benchmark your workloads.
Track MRCR, not raw context length — a model that scores 78% at 1M is more useful than one that claims 10M but degrades past 200K.
Build compression pipelines — every agent session should end with structured memory extraction, not raw log storage.

The context drought isn't a crisis. It's a forcing function. It pushes agent builders toward architectures that are more efficient, more persistent, and more intelligent about what they remember. That's a better future than "just add more tokens."