Back to Rover Blog
Research

The Four Architectures for Website AI Agents

RAG bots can only talk. API-tool agents ship a second maintenance org. Code sandboxes burn per-user server cost. Only DOM-native execution inherits the web's own truth model — live HTML, user session, existing IAM. A structural comparison.

rtrvr.ai Team
·April 17, 2026·11 min read
The Four Architectures for Website AI Agents

The Four Architectures for Website AI Agents

A research thesis on where embedded agents are headed, what's structurally broken, and why DOM-native execution is the only architecture that composes with how the web actually ships.

rtrvr.ai | April 2026


Every production AI agent attached to a website in 2026 fits into one of four architectures. Three of them are structurally compromised — by staleness, by maintenance burden, or by security exposure.

The argument is not that one implementation is better polished than another. The argument is that three of these architectures have ceilings that no amount of engineering can push through. RAG bots cannot act. API-tool agents cannot ship without a parallel maintenance org. Code-sandbox agents cannot run on a consumer website without exposing platform shape and paying sandbox cost per user per action.

Only DOM-native embedded execution grandfathers in the web's existing truth model: live HTML, user session, server-side IAM, existing deploy cadence.

2,000+ websites are already integrated with Rover. That number moves because the architecture moves — not because the pitch does.


Architecture 1: RAG Bots

Intercom Fin, Drift, every chatbot your company has shipped.

Chunk your docs. Embed them. Store in a vector DB. User asks a question, similarity-search, stuff top-k chunks into an LLM prompt, stream a response. This is ~99% of customer-facing chatbots on the web today.

The ceiling

The one that matters most is the simplest: RAG bots can only talk. They cannot click a button. They cannot fill a form. They cannot finish a checkout or advance an onboarding flow. They are a search bar with a chat veneer.

Six compounding failure modes sit underneath:

  1. Temporal drift. Ship a feature Tuesday. Embeddings reflect Monday. The bot lies confidently about what your product does right now.
  2. Retrieval noise. At scale, thematic adjacency is not factual correctness. The bot grabs a chunk that sounds right and generates around it.
  3. Scope hallucination. Stripe's chatbot will help you with React. Your healthcare chatbot will debate philosophy. The base LLM has no native concept of "I am a payments site; decline off-topic."
  4. Chunking artifacts. No universal chunking strategy exists. It is domain-specific tuning you pay for forever.
  5. The evaluation gap. Dashboards say 87% positive. Users report the bot is an annoyance. Industry data: ~20% of users rate chatbot experiences as acceptable.
  6. The maintenance tax. You didn't deploy AI support. You deployed a second full-time job: re-embed, re-index, re-test, re-prompt.

RAG is a good primitive for knowledge retrieval. As a customer-facing website agent it has peaked. The data: 45% of users abandon chatbot interactions after three failed attempts; mobile rage-click rates up 667% YoY after AI chatbot deployment. The ceiling is load-bearing, not a temporary engineering gap.


Architecture 2: API-Connected Agents

Google WebMCP, custom MCP integrations, Intercom Fin with actions.

Expose your internal APIs as structured tools. Publish schemas. Let an agent call those tools. Google's WebMCP is the most aggressive version: let Chrome's agent talk to your site through exposed tool definitions.

The ceiling

Not broken in theory. Broken in operations. Three unavoidable costs:

Maintenance coupling. Modern product teams ship multiple times a day. Every meaningful API change requires a corresponding schema update, tool description update, and re-test. You've created a second surface that must ship in lockstep with your product. Miss a sync and the agent silently breaks — and because agents are stochastic, you find out from a customer, not a test.

Security surface. Every exposed tool is an authorization question. Who can call it? Under what subscription tier? With what user context? With what rate limit? The agent now sits inside your IAM graph, and every tier change, every permission boundary, every row-level access rule must be re-encoded as agent-side policy. This is a permanent engineering org, not a one-time integration.

Client-side exposure or server-side cost. You pick your poison. Expose your API surface client-side and you've handed your platform shape to anyone who reads a network tab. Keep it server-side and you're spinning per-user sandboxes — which pushes you into Architecture 3.

WebMCP specifically: Google intermediates your users. You do the integration work. You maintain the tool schemas. And the user has the conversation with Chrome's agent, not with your site. Your checkout, your onboarding, your brand moment is mediated by a third party whose incentives are not aligned with yours. This is the 2010s SEO trade again, with higher stakes.

This architecture will exist for backend-to-backend agent coordination. As the front door between an AI agent and a consumer website, it imposes a cost structure — ship-in-lockstep + permanent IAM surface + exposure risk — that only a small minority of sites can afford.


Architecture 3: Code-Writing Sandbox Agents

Cloudflare Agent Lee (launched April 15, 2026), Codemode patterns.

Agent Lee is Cloudflare's in-dashboard AI assistant. The mechanism: rather than presenting tool definitions directly to the model, Agent Lee uses Codemode to convert tools into a TypeScript API and asks the model to write code that calls it. Generated code runs in a sandboxed Durable Object that acts as a credentialed proxy — API keys never appear in the generated code, read ops proxy directly, write ops gate behind user approval.

The numbers: ~18,000 daily users, ~250K tool calls/day across DNS, Workers, R2, Registrar, Cache, Tunnels, and more.

Credit where due — this is elegant. For Cloudflare's problem.

Why it doesn't generalize

Cloudflare's problem is a developer dashboard: one company, one platform, one permission graph. Every property that makes Agent Lee work depends on Cloudflare owning the entire stack. For a consumer website, the same architecture fails on four axes:

Sandbox latency is not consumer-grade. Spinning a sandboxed execution environment per request is fine for developers troubleshooting at 2am. It is not fine for a user trying to check out in under 10 seconds. The LLM-writes-code-then-executes loop adds seconds per action.

Approval gates break consumer flow. Every write operation requires a confirmation dialog. On a developer dashboard: a feature. On a checkout page: death. Consumers will not click through modals to complete a purchase they asked the agent to do.

Platform shape becomes a typed API exposed to the LLM. Maintenance debt and security debt at the same time. Every internal capability becomes a documented surface the model reasons about. You don't ship a feature; you ship a feature plus its type signature plus its sandbox permission profile.

Per-user server cost. Each interaction burns a sandboxed DO plus LLM tokens for code gen plus tool-call round-trips. Cloudflare's users pay $20–$20K/month — that's fine. For a consumer site where the visitor is a free browser user, unit economics are broken before you start.

Code-writing sandbox agents are the right architecture for developer platforms — high-consequence ops, developer users, approval gates as a feature. They are the wrong architecture for consumer websites.

The generative UI trap

Agent Lee and CopilotKit both push generative UI — the agent dynamically composes charts, tables, forms, and cards at runtime. CopilotKit built an entire protocol (AG-UI) around it, adopted by Google, LangChain, and AWS.

Three reasons this is wrong for consumer websites:

  1. Token cost compounds per turn. Every UI-rendering interaction costs 2–5x the tokens of text, plus the rendering pipeline, plus re-renders on state change. On a paid dashboard: fine. On a consumer site where every visit is a cost center: not.

  2. It is a maintenance surface disguised as a feature. Good generative UI needs skill prompts, block libraries, brand-matched style tokens, responsive breakpoints, accessibility audits on LLM-composed output. Most deployments do none of these.

  3. Your website already has a UI. It is tested, designed, conversion-optimized, and A/B-validated. Letting an LLM improvise UI blocks on top of that — for one user, at checkout, at the moment of highest intent — undoes all of that work. Every user sees a slightly different, untested surface.

Rover takes the opposite position: don't replace your UI. Drive it. The site owner already decided what the right button, form, and flow are. The agent's job is to get the user there faster.


Architecture 4: DOM-Native Embedded Execution

Rover.

One <script> tag. The runtime reads the live DOM and accessibility tree, plans the next action, and executes directly in the user's browser — in the user's existing authenticated session, on the site's own domain. No screenshots. No VMs. No server-side sandbox. No exposed API schema.

Seven structural properties

These are not features. They are consequences of DOM-native execution that the other three architectures cannot replicate without rebuilding as Rover.

1. Zero documentation maintenance. The DOM is the source of truth. When your team ships a new button at 2pm, Rover sees it at 2:00:01pm. No re-embedding. No schema update. No chunk re-index.

2. Zero API exposure. No backend tools to spec. No TypeScript types to publish. No client-side API keys. The agent acts through the same interface a human does — the rendered page. Your platform shape stays private.

3. IAM grandfathered in. The user is already signed in. The agent acts inside that session. Whatever the user can see and do, the agent can. Whatever they cannot, it cannot. Your existing server-side IAM — every tier, every permission, every row-level rule — is the agent's authorization layer for free.

4. Zero deploy dependency. Ship your product at whatever cadence you want. Rover doesn't need a release. You don't version your agent against your product. The agent reads whatever shipped.

5. Sub-second actions. No screenshot round-trip. No vision model inference. No sandbox spin-up. Rover identifies the target element by semantic identity (ARIA label, role, data attributes) and dispatches a native DOM event. Milliseconds, not seconds.

6. Bounded scope by construction. The agent can only do what the DOM exposes. It cannot answer React questions on your payments page — not because of a guardrail prompt, but because there is no React documentation in its input. Scope isn't policy; it's architecture.

7. Client-side execution. User data never leaves the browser except through the same network calls your site already makes. Privacy, compliance, and residency stories are inherited. No new data plane to audit.


Head-to-head

RAG botsAPI-tool agentsCode sandboxesRover (DOM-native)
Can act on the pageNoVia exposed APIsVia generated codeYes (native DOM)
Maintenance per shipHigh (re-embed)High (schema sync)Medium (type sync)Zero
API exposure riskNoneHighMedium (platform shape to LLM)None
IAM integration costN/ARebuild per toolRebuild in sandbox proxyFree (inherits session)
Latency per action~1s (retrieval)0.5–2s (API call)2–8s (code gen + sandbox)<200ms
Per-user costModerateModerateHigh (sandbox + tokens)Low (client-side)
Scope guaranteePrompt policyTool whitelistCode classifier + gateArchitecture (DOM boundary)
Consumer-grade UXYes (but only talks)No (auth flows, latency)No (approval modals)Yes

Validation

  • #1 on Halluminate WebBench at 81.39%, ahead of OpenAI Operator and Anthropic Computer Use
  • 25,000+ users, 3M+ executed workflows
  • 2,000+ websites integrated via the embed, Chrome extension, cloud API, and MCP server
  • Open source under FSL-1.1-Apache-2.0 — site owners can read every line that runs in their users' browsers

What this unlocks

When the runtime is DOM-native, embedded, and session-native, intent-to-outcome distance collapses:

  • Prompt-to-checkout. "Buy the Pro plan" — Rover navigates, selects, fills, submits. In the user's session, on the site's domain.
  • Guided onboarding. "Show me how to set up my first campaign" — Rover clicks alongside the user, advancing the real product UI.
  • Form completion. Multi-step forms become conversational. 40% less drop-off in production deployments.
  • Cross-site handoffs. Workflows that span Rover-enabled sites with aggregated lineage.

Where this is going

The agent-web interaction layer is being built in public — one Agents Week at a time. Cloudflare is solving content, payment, and identity edges. Google is pushing A2A for agent-to-agent. Anthropic is driving MCP for agent-to-tool. All of these layers matter. None of them is the execution layer.

The execution layer — the one that turns "I want to do X on your site" into X, on your site, inside the user's session, at sub-second latency, without a parallel engineering org — is DOM-native and embedded.

The other three architectures will exist. RAG will keep answering questions. API-tool agents will run backend-to-backend. Code sandboxes will own developer platforms. None of them will own the consumer website agent layer, because none of them composes with how the web actually works.

Rover does.

Back to Rover BlogRover Docs

Try Rover on Your Site

One script tag. No knowledge base. Rover reads your site live and acts for your users.

Get StartedLearn More
rtrvr.ai logo
Rover

Browser-native execution for websites and interfaces, plus analytics for the owners improving them.

Product

  • Overview
  • Get Started
  • Live Test
  • Pricing
  • Sweet Shop

Developers

  • GitHub
  • Preview Helper
  • Quick Start
  • Instant Preview
  • Try on Other Sites
  • Instant Preview API
  • Configuration
  • AI / Agent Tasks
  • API Reference
  • Security
  • Examples

Resources

  • Blog
  • Videos
  • SDK Preview Helpers
  • OpenAPI Spec
  • rtrvr.ai Docs
  • rtrvr.ai Cloud

© 2026 rtrvr.ai. All rights reserved.

PrivacyTerms