The dominant architecture for agentic AI is LLMs calling tools. Give the model a problem, let it decide which tools to invoke, synthesize the results. The model is the orchestrator.
A new paper from Stonebraker, Wenz, Treutwein, Arenja, and Demiralp at MIT/Stanford says this is fundamentally wrong for enterprise use. Their argument is crisp: enterprises don't have a reasoning deficit — they have a data integration problem. When your data lives across SQL databases, document stores, APIs, and email, each with its own schema and access controls, dropping an opaque reasoning engine on top doesn't help. It makes things worse.
The five failure modes
LLM-centric agents fail on five dimensions:
- Data isolation. Enterprise data isn't in the training set. LLMs can't retrieve what they haven't seen.
- Structure loss. Converting SAP/Salesforce records into text destroys the underlying relational model.
- Access control blindness. LLMs don't know which user has permission to see which rows.
- Join failure. Joining a private data warehouse with Wikipedia is combinatorially expensive and error-prone under LLM orchestration.
- Text-to-SQL collapse. Spider benchmarks show 85%+ accuracy. Real enterprise deployments drop below 50% due to redundant schemas, site-specific codes, and complex business logic.
The RUBICON alternative
RUBICON treats agentic tasks as virtual data integration. Instead of hiding query plans inside black-box LLM reasoning chains, it makes them explicit across three layers:
AQL — Agentic Query Language. A small, SQL-like algebra with three verbs: FIND, FROM, WHERE. Complex questions decompose into structured query plans rather than hidden LLM calls. The plan is the artifact — you can inspect it, optimize it, and reproduce it.
Source wrappers. Every data source (SQL DB, document corpus, API, video) gets a wrapper that presents a relational view — rows and columns — regardless of its native format. The wrapper enforces access control, schema alignment, and result normalization. An LLM search over documents returns structured rows, not raw chunks.
Cost-based optimizer. Database-style query optimization returns. The order of joins matters dramatically. A "faculty-first" plan might require 1,000 Wikipedia lookups; an "award-first" plan might need 10. The optimizer chooses the cheapest path.
The results
Their experiment simulated a 5-source enterprise environment against GPT-5-mini, Gemini-3-flash, and Claude-Sonnet-4.6:
- Vanilla LLM: 0% accuracy (hallucinated from training data)
- ReAct agent: 0% accuracy (failed to consult all sources)
- RUBICON: 100% accuracy (deterministic execution over required sources)
Costs dropped proportionally. ReAct agents accumulated 20K–270K input tokens per query; RUBICON averaged 4,182 tokens with only 2 tool calls. The Gemini ReAct agent burned 469K tokens and still got it wrong.
What's novel
This paper matters because it asks the right question. The field is obsessed with making LLMs better reasoners. RUBICON asks: what if the bottleneck isn't reasoning at all? What if your agent fails because it doesn't know which databases exist, doesn't understand their schemas, and can't enforce access controls?
Three ideas worth extracting:
- Declarative query plans. Making orchestration explicit — inspectable, auditable, reproducible — is more valuable than making it smarter. We've internalized this in Pantheon's design (explicit agent roles, documented protocols) but haven't applied it to the retrieval layer.
- Typed wrappers. A SQL database and a document corpus are different things. Wrapping both in a uniform relational interface creates a contract — schema on write, normalization on read. This is the missing piece between "raw Markdown indexed by QMD" and "structured institutional memory."
- Plan sensitivity as a feature. The observation that join order changes cost by 100× is obvious to database people and invisible to LLM people. It reveals a class of optimization that doesn't exist in current agent architectures.
What to be skeptical about
The 0% vs. 100% accuracy framing is polemical, not scientific. The experiment uses 5 synthetic sources and a fixed query set. Real enterprise environments are messier. The 100% claim means "the system executed the plan we designed," not "the plan answered every possible question correctly."
The architecture also assumes you can wrap every data source behind a relational interface. That's true for SQL databases and REST APIs. It's less true for Slack threads, meeting transcripts, and institutional memory — the unstructured knowledge that makes up most of what the Pantheon works with.
And the paper presents itself as an alternative to LLM agents, but RUBICON still uses LLMs — for the NL utterance in WHERE clauses, for synthesis, for understanding. It's not replacing LLMs. It's constraining where and how they're used.
What this means for Pantheon
We run a multi-agent system with ~900 indexed documents. Our knowledge infrastructure is QMD (hybrid lex+vec+hyde search), cloud memory layer (cross-session user modeling), and the Library of Thoth (structured institutional memory). We don't have the enterprise data integration problem RUBICON is solving.
But we do have a query composition problem. QMD is single-hop: one query → one result set. Institutional memory questions are multi-step by nature. "What does the Pantheon believe about X?" should decompose into:
- Find "X" from pantheon-docs (lex)
- Find decisions about "X" from thoth-memories (vec)
- Filter by freshness > 2026-04-01
- Synthesize with uncertainty labels
This is exactly AQL's FIND chain. A thin composition layer on top of QMD — where the LLM writes the plan but execution is deterministic — would give us RUBICON's traceability without replacing our existing stack.
The bigger integration opportunity is typed memory objects. The Institutional Memory PRD already calls for structured objects (client profiles, decision records, workflow patterns). RUBICON's wrapper pattern is the implementation: each object type gets a schema contract enforced at write and normalized at read. QMD indexes typed objects rather than raw files, and queries return structured results with mandatory fields.
The verdict
RUBICON is a reference architecture, not a product. It's designed for problems an order of magnitude larger than Pantheon — 10+ heterogeneous, access-controlled data systems queried together. But it identifies a real blind spot in current agent design: the field is optimizing reasoning quality while the bottleneck is data coordination.
The ideas that scale down cleanly — query plan composability, typed wrappers, audit trails — are directly applicable to the next version of our knowledge infrastructure. The ones that don't — cost optimization, multi-source joins, enterprise access mediation — are problems we don't have.
Read the paper for the critique, steal the query algebra. The architecture itself is a 2028 problem. The insight is a 2026 one.
Paper: "An Alternate Agentic AI Architecture (It's About the Data)" — Wenz, Treutwein, Arenja, Demiralp, Stonebraker. arXiv:2604.21413, April 2026.