rtrvr.ai logo
rtrvr.ai
Templates
Book Demo
Blog
Docs
Pricing
Back to Blog
Engineering Deep Dive

Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

Web agents don't fail because LLMs can't click buttons. They fail because the web is a distributed system disguised as a UI. This is about live view + takeover for ephemeral cloud browsers - without sticky instance routing.

rtrvr.ai Team
•January 4, 2026•16 min read

53+ Parallel Browsers + Live VNC Takeover

Watch rtrvr.ai execute across 53+ parallel cloud browsers with live view and mid-run takeover capability

53+ Parallel Browsers + Live VNC Takeover
2:45
Live View
Real-time visibility
Instant Takeover
Mid-run control
53+ Parallel
Cloud browsers
Zero CDP
DOM-native only

Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

Cloud Run Relay + DOM-Native Automation

Web agents don't fail in production because "LLMs can't click buttons."

They fail because the web is a distributed system disguised as a UI: dynamic DOMs, nested iframes, cross-origin boundaries, shadow roots, overlays, challenge flows, flaky UI state — and once you ship… you go blind.

This post is about the thing that made everything else easier for us:

Live view + takeover for ephemeral cloud browsers - without relying on sticky instance routing - and how that changed our debugging loop for real-world web agent failures.

We'll cover:

  • Why we built live VNC view + takeover (born from prod debugging pain)
  • The Cloud Run constraints that make "just add VNC" fragile
  • The architecture that works: Runner + Relay + signed pairing tokens
  • Why we moved from Fluxbox to Xvfb + x11vnc + xsetroot
  • A pragmatic security model (no public VNC ports; execution-bound access)
  • Production metrics + what we're instrumenting next
  • How this fits with our broader approach: DOM-native automation (no screenshots/vision) and no CDP / Playwright / Puppeteer

TL;DR

  • rtrvr.ai is DOM-native: we read and reason over the DOM directly - not pixels - and we don't use screenshots/vision as the primary understanding layer
  • We avoid CDP-based automation (and the "debugger port" control model). We integrate with Chrome via extension APIs (locally and in cloud browsers)
  • We built live VNC view + takeover because debugging native captcha solving (image selection challenges) required seeing exactly what the agent sees in real-time logs weren't enough
  • Cloud Run's reality (best-effort affinity + WebSockets + autoscaling) pushes you toward a design where the viewer never needs to hit the same worker instance
  • The solution: a separate relay service that pairs viewer ↔ runner via short-lived signed tokens and forwards bytes

Context: Our Agent is DOM-Native (and That's Why Observability Still Matters)

Most web agents fall into one of two camps:

1) Vision-first "computer use" agents screenshot → vision model → guess what's clickable → click

2) CDP tooling Playwright/Puppeteer/Selenium via Chrome DevTools Protocol

We take a third path:

  • DOM-native intelligence via Chrome extension APIs (locally and in the cloud)
  • No screenshot parsing as the primary "understanding layer"
  • No CDP / Playwright / Puppeteer / debugger port controlling the browser

If you want the deeper details:

  • Benchmark results (Halluminate Web Bench)
  • Architecture deep dive

"If you're DOM-native, why VNC?"

Because humans still debug with eyes.

VNC isn't how the agent understands the web. It's how we:

  • see the real page state in production
  • verify what the agent clicked
  • identify the frame/realm/shadow boundary we mis-targeted
  • recover runs via takeover (then hand back control to the agent)

What We Shipped (Product Surface)

What users get:

1) Live VNC view + takeover for cloud browsers

  • Watch what's happening in prod
  • Take control mid-run when needed
  • Optionally attach "later" while a scheduled run is still active

2) Parallel execution UI

  • When we run 50-100+ browsers in parallel, users see a grid/graph
  • This turns "hours of web research" into minutes

3) Better action success in iframes / shadow DOM / cross-origin boundaries

  • We've done heavy work to traverse and target the correct realm/subtree reliably

4) Templates (rtrvr.ai/retrieve) + Shareable outputs

  • One-click workflows
  • Share results via a single URL

This post focuses on the engineering behind live view + takeover - because that's the piece most builders underestimate.


The Moment That Forced This: Native Captcha Solving

We made a deliberate choice early on: no third-party captcha services.

Why? We don't want to insecurely forward user data to external providers. We don't want dependencies on services that could disappear or change pricing. And frankly, if our DOM intelligence can traverse iframes, shadow DOM, and cross-origin boundaries - it should be able to handle captchas natively.

So we built it. And it works.

Our agent can click through Google reCAPTCHA - including the image selection challenges - by leveraging the same iframe and shadow DOM capabilities we use everywhere else. The captcha widget lives in a cross-origin iframe? We handle it. The checkbox is inside nested shadow roots? We handle it.

But here's where it got interesting: image selection captchas.

When the agent needs to "select all images with traffic lights," it's making real-time decisions about which tiles to click. And when something goes wrong - when it's clicking the wrong images or missing tiles - logs don't tell you why.

I found myself staring at execution logs thinking: "Did it click the right images? Which ones did it miss? Was the grid even fully loaded?"

That's when I wanted to just see it.

Not screenshots after the fact. Not logs describing click coordinates. I wanted to watch the agent work through a captcha in real-time, see exactly what it was seeing, and understand where decisions went wrong.

That debugging need - born from native captcha solving - is what drove the VNC implementation.


Cloud Run Realities: Why Naive "VNC in the Runner" Breaks

If you're running ephemeral Chrome workers on Cloud Run, you can't wish away these constraints:

1) Concurrency is a cap, not isolation

By default, a Cloud Run instance can serve multiple requests concurrently. If you want "one browser per container" and clean isolation, set runner concurrency=1.

This is the simplest model that keeps "one execution ↔ one browser ↔ one container" consistent.

2) Session affinity is best-effort (not a correctness primitive)

Even with session affinity enabled, Cloud Run explicitly frames it as best-effort. You can't build a system that depends on "viewer reconnects always hit the same instance" and expect it to hold under autoscaling, restarts, or deployment rollouts.

3) WebSockets don't make stickiness deterministic

WebSockets on Cloud Run still inherit the platform's routing reality. New WebSocket requests can land on different instances; if your design requires "attach to the exact same instance later," you will eventually lose that bet.

So we designed around one principle:

Never require the viewer to hit the same runner instance.


The Architecture That Survives Production: Runner + Relay + Signed Pairing Tokens

Design Goals

  • Runner stays clean: concurrency=1, one browser per container, minimal mixed traffic
  • Viewer can attach "later" without sticky routing hacks
  • VNC is gated by auth and bound to the correct execution session
  • No public VNC ports. No "open 5900 to the internet"

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           VIEWER (noVNC)                                │
│                         User's Browser                                  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                          wss /vnc
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          VNC RELAY                                      │
│                     Cloud Run (high concurrency)                        │
│                                                                         │
│   • Token validation (sid + uid + role)                                 │
│   • Viewer ↔ Runner pairing                                             │
│   • Bidirectional byte forwarding                                       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                       wss /agent (outbound)
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           RUNNER                                        │
│                  Cloud Run (concurrency=1)                              │
│                                                                         │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                │
│   │   Chrome    │◄───│    Xvfb     │◄───│   x11vnc    │                │
│   │  + Extension│    │  :99 display│    │  TCP 5900   │                │
│   └─────────────┘    └─────────────┘    └─────────────┘                │
│                                              (localhost only)           │
└─────────────────────────────────────────────────────────────────────────┘

"Attach Later" Flow

1. User Browser ──► API: Request viewer token (sid)
   API ──► User: viewerToken(sid, uid, exp, role=viewer)

2. Runner ──► API: Request runner token (sid)
   API ──► Runner: runnerToken(sid, uid, exp, role=runner)

3. User Browser ──► Relay: wss /vnc (viewerToken)
   Runner ──► Relay: wss /agent (runnerToken)

4. Relay validates tokens, matches sid+uid, pairs viewer↔runner
   Relay ◄──► User: Start forwarding bytes
   Relay ◄──► Runner: Start forwarding bytes

How Pairing Works

A live session is two WebSockets:

  • viewer → relay at /vnc
  • runner → relay at /agent (outbound from the runner)

Both present short-lived signed tokens (JWT/HMAC) containing:

  • sid (execution/session id)
  • uid (owner id)
  • role (viewer vs runner)
  • exp (expiry)

Relay pairs only if:

  • sid matches
  • uid matches
  • roles are correct
  • token is valid + unexpired

Once paired, relay forwards bytes both directions.

Why this matters: attach/reconnect becomes deterministic, and Cloud Run routing becomes irrelevant.


Display Stack: We Started with Fluxbox and Removed It

We started with the "typical headless VNC recipe." It worked - but it was heavier than we wanted.

Iteration 1: Fluxbox (works, but too much)

Fluxbox was fine for "a desktop," but we didn't want a desktop. We wanted:

  • deterministic pixels
  • minimal overhead
  • fewer moving parts
  • reproducible background/viewport

Iteration 2: The Minimal Stack (what we shipped)

  • Xvfb (virtual X server)
  • x11vnc (expose X display over VNC)
  • xsetroot (set background to confirm correct display & keep consistent visuals)
  • noVNC on the client side

Startup sequence:

# 1) Create a deterministic virtual display
Xvfb :99 -screen 0 1920x1080x24 -nolisten tcp &
export DISPLAY=:99

# 2) Make it visually obvious you're on the right display
xsetroot -solid "#0B0F14"  # pick your brand-friendly dark

# 3) Start Chrome against DISPLAY=:99
google-chrome --no-sandbox --disable-dev-shm-usage ... &

# 4) Expose the display via VNC locally (NOT public)
x11vnc -display :99 -rfbport 5900 -shared -forever -nopw -noxdamage &

Important: we do not expose 5900 publicly. We bridge it through the runner → relay tunnel.


Why the Relay is a Separate Service

We intentionally did not terminate viewer WebSockets on the same runner service that executes tasks:

  • WebSockets are long-lived requests → they "occupy" instance capacity
  • Even with session affinity, Cloud Run can route new connections to different instances
  • We want runner to remain simple: concurrency=1, one browser, isolated execution, predictable scaling

So:

  • Runner: executes the agent, owns the browser, stays isolated
  • Relay: just pairs + forwards, can run high concurrency, long timeouts

Security Model

The worst version of this system is: "expose raw VNC to the internet."

VNC is literally remote control. Treat it like production SSH:

1) No public VNC ports VNC only exists on localhost inside the runner container.

2) Short-lived tokens Viewer access is tied to an authenticated execution; tokens expire quickly.

3) Session binding Relay only pairs viewer and runner for the same sid and uid.

4) Role separation Viewer tokens can't impersonate runner tokens (and vice versa).

5) No CDP/debugger port for core automation Our core automation is via Chrome extension APIs, not a debugger backdoor.


Observability: What Live View Immediately Unlocked

1) Captcha debugging became visual

Once we could watch real captcha flows:

  • wrong tile selections became obvious immediately
  • timing issues (clicking before images fully loaded) became visible
  • we could see exactly which images the agent was interpreting correctly vs incorrectly

2) Iframes, shadow DOM, cross-origin boundaries became debuggable

Modern sites are not "one DOM." They are a forest:

  • top frame
  • nested iframes (sometimes cross-origin)
  • shadow roots
  • dynamic overlays/modals

Action failures often come down to:

  • wrong realm
  • wrong subtree
  • wrong clickability state

Live view makes these failure classes visually undeniable.

3) Parallel fleets became something users can trust

Once users can see what's happening across a fleet, parallelism becomes less scary:

  • failures are localized (one worker fails; the rest complete)
  • you can recover outliers via takeover
  • throughput becomes a function of parallelism, not patience

Performance & Capacity (Production)

Report date: Jan 2026

Executive Summary

ServiceStatusAvg ConcurrentError Rate
VNC RelayStable0.75 sessions0%
Web Agent (Runner)Some errors0.99 executions2.4%

VNC Relay — Current Load

MetricValue
Window analyzed2,667s (~44 min)
Total connections88
Avg concurrent viewer connections0.75
Avg concurrent runner connections1.02
Avg concurrent sessions0.75
Error count0
Error rate0%

How sessions are computed: each session requires two WS connections (viewer + runner). We compute avg concurrency using request durations and take min(viewer, runner) as paired sessions.

Relay Configuration

SettingValue
Concurrency200
Max instances200
Max concurrent connections (ceiling)40,000
Max concurrent sessions (ceiling)~20,000

This ceiling is just concurrency x max_instances ÷ 2. Real limits show up earlier via: sockets/FD limits, memory per connection, bandwidth saturation, autoscaling dynamics.

Bandwidth: The Metric That Matters Next

CPU often looks easy. Bandwidth/session is usually the true limiter.

What we're instrumenting next in the relay:

  • bytes runner→viewer, bytes viewer→runner
  • attach latency (click → first frame)
  • p50/p95 attach latency under load

Runner Latency + Errors

MetricValue
Window analyzed3,415s (~57 min)
Total requests41
5xx errors1
Error rate2.4%
Avg concurrent executions0.99

Latency distribution:

PercentileLatency
Avg153.8s
P50118.2s
P95513.2s (~8.5 min)

What Broke (Lessons Learned)

1) "VNC in the runner service" created routing problems Reconnects and attach-later were unreliable. We stopped fighting this and built the relay.

2) Fluxbox was unnecessary complexity It's great if you want a desktop. We didn't. Removing it improved determinism.

3) Bandwidth is the real capacity question CPU looked fine; bandwidth/session is what we're measuring next.

4) The web is many realms, not one page iframe/shadow-root mis-targeting is where many "works in demo, fails in prod" bugs live. Live view made these obvious.


Why This Matters Beyond Debugging

This is the productivity multiplier we care about:

  • Give rtrvr a list of URLs + one sentence describing output fields
  • Spin up dozens of cloud browsers in parallel
  • Return a structured dataset in minutes
  • Let users watch and take over when needed

Open Questions

If you've shipped similar "live attach" systems:

  1. What replaced VNC for you (WebRTC? custom streaming?) and what was the biggest win?
  2. Best approach for recording/replay of agent runs without insane storage?
  3. Any gotchas with pairing/rendezvous services at high scale?

References

rtrvr.ai

  • Web Bench results
  • DOM Intelligence Architecture

Cloud Run

  • Max concurrent requests / concurrency
  • Configure concurrency
  • Session affinity
  • WebSockets guidance
  • Request timeouts
  • Max instances

Display + VNC Stack

  • Xvfb (man page)
  • xsetroot (man page)
  • Fluxbox

DOM Complexity

  • MDN ShadowRoot
  • Chrome extensions: content scripts

Benchmarks

  • WebArena paper (arXiv)
Share this article:
Back to Blog

See Live VNC in Action

Try parallel cloud browsers with live view and takeover. Watch your web agents work in real-time.

53+ parallel browsers • Live takeover • Zero CDP • DOM-native automation

Try Cloud PlatformInstall ExtensionRead Architecture Post
rtrvr.ai logo
rtrvr.ai

Scrape, Automate, Monitor the Web

By subscribing, you agree to receive marketing emails from rtrvr.ai. You can unsubscribe at any time.

Product

  • APINEW
  • Browser Extension🔥
  • Cloud Platform✨
  • WhatsApp Bot

Use Cases

  • Vibe Scraping
  • Lead Enrichment
  • Agentic Form Filling
  • Web Monitoring
  • Social Media
  • Job Applications
  • Data Migration
  • AI Web Context

Resources

  • Documentation
  • Blog
  • Pricing
  • Book Demo
  • Google Cloud Partner

Company

  • Privacy Policy
  • Terms of Service
  • Security Brief
support@rtrvr.ai

© 2026 rtrvr.ai. All rights reserved.

Made withfor the automation community