Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

Cloud Run Relay + DOM-Native Automation

Web agents don't fail in production because "LLMs can't click buttons."

They fail because the web is a distributed system disguised as a UI: dynamic DOMs, nested iframes, cross-origin boundaries, shadow roots, overlays, challenge flows, flaky UI state — and once you ship… you go blind.

This post is about the thing that made everything else easier for us:

Live view + takeover for ephemeral cloud browsers - without relying on sticky instance routing - and how that changed our debugging loop for real-world web agent failures.

We'll cover:

Why we built live VNC view + takeover (born from prod debugging pain)
The Cloud Run constraints that make "just add VNC" fragile
The architecture that works: Runner + Relay + signed pairing tokens
Why we moved from Fluxbox to Xvfb + x11vnc + xsetroot
A pragmatic security model (no public VNC ports; execution-bound access)
Production metrics + what we're instrumenting next
How this fits with our broader approach: DOM-native automation (no screenshots/vision) and no CDP / Playwright / Puppeteer

TL;DR

rtrvr.ai is DOM-native: we read and reason over the DOM directly - not pixels - and we don't use screenshots/vision as the primary understanding layer
We avoid CDP-based automation (and the "debugger port" control model). We integrate with Chrome via extension APIs (locally and in cloud browsers)
We built live VNC view + takeover because debugging native captcha solving (image selection challenges) required seeing exactly what the agent sees in real-time logs weren't enough
Cloud Run's reality (best-effort affinity + WebSockets + autoscaling) pushes you toward a design where the viewer never needs to hit the same worker instance
The solution: a separate relay service that pairs viewer ↔ runner via short-lived signed tokens and forwards bytes

Context: Our Agent is DOM-Native (and That's Why Observability Still Matters)

Most web agents fall into one of two camps:

1) Vision-first "computer use" agents screenshot → vision model → guess what's clickable → click

2) CDP tooling Playwright/Puppeteer/Selenium via Chrome DevTools Protocol

We take a third path:

DOM-native intelligence via Chrome extension APIs (locally and in the cloud)
No screenshot parsing as the primary "understanding layer"
No CDP / Playwright / Puppeteer / debugger port controlling the browser

If you want the deeper details:

"If you're DOM-native, why VNC?"

Because humans still debug with eyes.

VNC isn't how the agent understands the web. It's how we:

see the real page state in production
verify what the agent clicked
identify the frame/realm/shadow boundary we mis-targeted
recover runs via takeover (then hand back control to the agent)

What We Shipped (Product Surface)

What users get:

1) Live VNC view + takeover for cloud browsers

Watch what's happening in prod
Take control mid-run when needed
Optionally attach "later" while a scheduled run is still active

2) Parallel execution UI

When we run 50-100+ browsers in parallel, users see a grid/graph
This turns "hours of web research" into minutes

3) Better action success in iframes / shadow DOM / cross-origin boundaries

We've done heavy work to traverse and target the correct realm/subtree reliably

4) Templates (rtrvr.ai/retrieve) + Shareable outputs

One-click workflows
Share results via a single URL

This post focuses on the engineering behind live view + takeover - because that's the piece most builders underestimate.

The Moment That Forced This: Native Captcha Solving

We made a deliberate choice early on: no third-party captcha services.

Why? We don't want to insecurely forward user data to external providers. We don't want dependencies on services that could disappear or change pricing. And frankly, if our DOM intelligence can traverse iframes, shadow DOM, and cross-origin boundaries - it should be able to handle captchas natively.

So we built it. And it works.

Our agent can click through Google reCAPTCHA - including the image selection challenges - by leveraging the same iframe and shadow DOM capabilities we use everywhere else. The captcha widget lives in a cross-origin iframe? We handle it. The checkbox is inside nested shadow roots? We handle it.

But here's where it got interesting: image selection captchas.

When the agent needs to "select all images with traffic lights," it's making real-time decisions about which tiles to click. And when something goes wrong - when it's clicking the wrong images or missing tiles - logs don't tell you why.

I found myself staring at execution logs thinking: "Did it click the right images? Which ones did it miss? Was the grid even fully loaded?"

That's when I wanted to just see it.

Not screenshots after the fact. Not logs describing click coordinates. I wanted to watch the agent work through a captcha in real-time, see exactly what it was seeing, and understand where decisions went wrong.

That debugging need - born from native captcha solving - is what drove the VNC implementation.

Cloud Run Realities: Why Naive "VNC in the Runner" Breaks

If you're running ephemeral Chrome workers on Cloud Run, you can't wish away these constraints:

1) Concurrency is a cap, not isolation

By default, a Cloud Run instance can serve multiple requests concurrently. If you want "one browser per container" and clean isolation, set runner concurrency=1.

This is the simplest model that keeps "one execution ↔ one browser ↔ one container" consistent.

2) Session affinity is best-effort (not a correctness primitive)

Even with session affinity enabled, Cloud Run explicitly frames it as best-effort. You can't build a system that depends on "viewer reconnects always hit the same instance" and expect it to hold under autoscaling, restarts, or deployment rollouts.

3) WebSockets don't make stickiness deterministic

WebSockets on Cloud Run still inherit the platform's routing reality. New WebSocket requests can land on different instances; if your design requires "attach to the exact same instance later," you will eventually lose that bet.

So we designed around one principle:

Never require the viewer to hit the same runner instance.

The Architecture That Survives Production: Runner + Relay + Signed Pairing Tokens

Design Goals

Runner stays clean: concurrency=1, one browser per container, minimal mixed traffic
Viewer can attach "later" without sticky routing hacks
VNC is gated by auth and bound to the correct execution session
No public VNC ports. No "open 5900 to the internet"

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           VIEWER (noVNC)                                │
│                         User's Browser                                  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                          wss /vnc
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          VNC RELAY                                      │
│                     Cloud Run (high concurrency)                        │
│                                                                         │
│   • Token validation (sid + uid + role)                                 │
│   • Viewer ↔ Runner pairing                                             │
│   • Bidirectional byte forwarding                                       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                       wss /agent (outbound)
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           RUNNER                                        │
│                  Cloud Run (concurrency=1)                              │
│                                                                         │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                │
│   │   Chrome    │◄───│    Xvfb     │◄───│   x11vnc    │                │
│   │  + Extension│    │  :99 display│    │  TCP 5900   │                │
│   └─────────────┘    └─────────────┘    └─────────────┘                │
│                                              (localhost only)           │
└─────────────────────────────────────────────────────────────────────────┘

"Attach Later" Flow

1. User Browser ──► API: Request viewer token (sid)
   API ──► User: viewerToken(sid, uid, exp, role=viewer)

2. Runner ──► API: Request runner token (sid)
   API ──► Runner: runnerToken(sid, uid, exp, role=runner)

3. User Browser ──► Relay: wss /vnc (viewerToken)
   Runner ──► Relay: wss /agent (runnerToken)

4. Relay validates tokens, matches sid+uid, pairs viewer↔runner
   Relay ◄──► User: Start forwarding bytes
   Relay ◄──► Runner: Start forwarding bytes

How Pairing Works

A live session is two WebSockets:

viewer → relay at /vnc
runner → relay at /agent (outbound from the runner)

Both present short-lived signed tokens (JWT/HMAC) containing:

sid (execution/session id)
uid (owner id)
role (viewer vs runner)
exp (expiry)

Relay pairs only if:

sid matches
uid matches
roles are correct
token is valid + unexpired

Once paired, relay forwards bytes both directions.

Why this matters: attach/reconnect becomes deterministic, and Cloud Run routing becomes irrelevant.

Display Stack: We Started with Fluxbox and Removed It

We started with the "typical headless VNC recipe." It worked - but it was heavier than we wanted.

Iteration 1: Fluxbox (works, but too much)

Fluxbox was fine for "a desktop," but we didn't want a desktop. We wanted:

deterministic pixels
minimal overhead
fewer moving parts
reproducible background/viewport

Iteration 2: The Minimal Stack (what we shipped)

Xvfb (virtual X server)
x11vnc (expose X display over VNC)
xsetroot (set background to confirm correct display & keep consistent visuals)
noVNC on the client side

Startup sequence:

# 1) Create a deterministic virtual display
Xvfb :99 -screen 0 1920x1080x24 -nolisten tcp &
export DISPLAY=:99

# 2) Make it visually obvious you're on the right display
xsetroot -solid "#0B0F14"  # pick your brand-friendly dark

# 3) Start Chrome against DISPLAY=:99
google-chrome --no-sandbox --disable-dev-shm-usage ... &

# 4) Expose the display via VNC locally (NOT public)
x11vnc -display :99 -rfbport 5900 -shared -forever -nopw -noxdamage &

Important: we do not expose 5900 publicly. We bridge it through the runner → relay tunnel.

Why the Relay is a Separate Service

We intentionally did not terminate viewer WebSockets on the same runner service that executes tasks:

WebSockets are long-lived requests → they "occupy" instance capacity
Even with session affinity, Cloud Run can route new connections to different instances
We want runner to remain simple: concurrency=1, one browser, isolated execution, predictable scaling

So:

Runner: executes the agent, owns the browser, stays isolated
Relay: just pairs + forwards, can run high concurrency, long timeouts

Security Model

The worst version of this system is: "expose raw VNC to the internet."

VNC is literally remote control. Treat it like production SSH:

1) No public VNC ports VNC only exists on localhost inside the runner container.

2) Short-lived tokens Viewer access is tied to an authenticated execution; tokens expire quickly.

3) Session binding Relay only pairs viewer and runner for the same sid and uid.

4) Role separation Viewer tokens can't impersonate runner tokens (and vice versa).

5) No CDP/debugger port for core automation Our core automation is via Chrome extension APIs, not a debugger backdoor.

Observability: What Live View Immediately Unlocked

1) Captcha debugging became visual

Once we could watch real captcha flows:

wrong tile selections became obvious immediately
timing issues (clicking before images fully loaded) became visible
we could see exactly which images the agent was interpreting correctly vs incorrectly

2) Iframes, shadow DOM, cross-origin boundaries became debuggable

Modern sites are not "one DOM." They are a forest:

top frame
nested iframes (sometimes cross-origin)
shadow roots
dynamic overlays/modals

Action failures often come down to:

wrong realm
wrong subtree
wrong clickability state

Live view makes these failure classes visually undeniable.

3) Parallel fleets became something users can trust

Once users can see what's happening across a fleet, parallelism becomes less scary:

failures are localized (one worker fails; the rest complete)
you can recover outliers via takeover
throughput becomes a function of parallelism, not patience

Performance & Capacity (Production)

Report date: Jan 2026

Executive Summary

Service	Status	Avg Concurrent	Error Rate
VNC Relay	Stable	0.75 sessions	0%
Web Agent (Runner)	Some errors	0.99 executions	2.4%

VNC Relay — Current Load

Metric	Value
Window analyzed	2,667s (~44 min)
Total connections	88
Avg concurrent viewer connections	0.75
Avg concurrent runner connections	1.02
Avg concurrent sessions	0.75
Error count	0
Error rate	0%

How sessions are computed: each session requires two WS connections (viewer + runner). We compute avg concurrency using request durations and take min(viewer, runner) as paired sessions.

Relay Configuration

Setting	Value
Concurrency	200
Max instances	200
Max concurrent connections (ceiling)	40,000
Max concurrent sessions (ceiling)	~20,000

This ceiling is just concurrency x max_instances ÷ 2. Real limits show up earlier via: sockets/FD limits, memory per connection, bandwidth saturation, autoscaling dynamics.

Bandwidth: The Metric That Matters Next

CPU often looks easy. Bandwidth/session is usually the true limiter.

What we're instrumenting next in the relay:

bytes runner→viewer, bytes viewer→runner
attach latency (click → first frame)
p50/p95 attach latency under load

Runner Latency + Errors

Metric	Value
Window analyzed	3,415s (~57 min)
Total requests	41
5xx errors	1
Error rate	2.4%
Avg concurrent executions	0.99

Latency distribution:

Percentile	Latency
Avg	153.8s
P50	118.2s
P95	513.2s (~8.5 min)

What Broke (Lessons Learned)

1) "VNC in the runner service" created routing problems Reconnects and attach-later were unreliable. We stopped fighting this and built the relay.

2) Fluxbox was unnecessary complexity It's great if you want a desktop. We didn't. Removing it improved determinism.

3) Bandwidth is the real capacity question CPU looked fine; bandwidth/session is what we're measuring next.

4) The web is many realms, not one page iframe/shadow-root mis-targeting is where many "works in demo, fails in prod" bugs live. Live view made these obvious.

Why This Matters Beyond Debugging

This is the productivity multiplier we care about:

Give rtrvr a list of URLs + one sentence describing output fields
Spin up dozens of cloud browsers in parallel
Return a structured dataset in minutes
Let users watch and take over when needed

Open Questions

If you've shipped similar "live attach" systems:

What replaced VNC for you (WebRTC? custom streaming?) and what was the biggest win?
Best approach for recording/replay of agent runs without insane storage?
Any gotchas with pairing/rendezvous services at high scale?

References

rtrvr.ai

Cloud Run

Display + VNC Stack

DOM Complexity

Benchmarks

WebArena paper (arXiv)

Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

Cloud Run Relay + DOM-Native Automation

Web agents don't fail in production because "LLMs can't click buttons."

This post is about the thing that made everything else easier for us:

Live view + takeover for ephemeral cloud browsers - without relying on sticky instance routing - and how that changed our debugging loop for real-world web agent failures.

We'll cover:

Why we built live VNC view + takeover (born from prod debugging pain)
The Cloud Run constraints that make "just add VNC" fragile
The architecture that works: Runner + Relay + signed pairing tokens
Why we moved from Fluxbox to Xvfb + x11vnc + xsetroot
A pragmatic security model (no public VNC ports; execution-bound access)
Production metrics + what we're instrumenting next
How this fits with our broader approach: DOM-native automation (no screenshots/vision) and no CDP / Playwright / Puppeteer

TL;DR

rtrvr.ai is DOM-native: we read and reason over the DOM directly - not pixels - and we don't use screenshots/vision as the primary understanding layer
We avoid CDP-based automation (and the "debugger port" control model). We integrate with Chrome via extension APIs (locally and in cloud browsers)
We built live VNC view + takeover because debugging native captcha solving (image selection challenges) required seeing exactly what the agent sees in real-time logs weren't enough
Cloud Run's reality (best-effort affinity + WebSockets + autoscaling) pushes you toward a design where the viewer never needs to hit the same worker instance
The solution: a separate relay service that pairs viewer ↔ runner via short-lived signed tokens and forwards bytes

Context: Our Agent is DOM-Native (and That's Why Observability Still Matters)

Most web agents fall into one of two camps:

1) Vision-first "computer use" agents screenshot → vision model → guess what's clickable → click

2) CDP tooling Playwright/Puppeteer/Selenium via Chrome DevTools Protocol

We take a third path:

DOM-native intelligence via Chrome extension APIs (locally and in the cloud)
No screenshot parsing as the primary "understanding layer"
No CDP / Playwright / Puppeteer / debugger port controlling the browser

If you want the deeper details:

"If you're DOM-native, why VNC?"

Because humans still debug with eyes.

VNC isn't how the agent understands the web. It's how we:

see the real page state in production
verify what the agent clicked
identify the frame/realm/shadow boundary we mis-targeted
recover runs via takeover (then hand back control to the agent)

What We Shipped (Product Surface)

What users get:

1) Live VNC view + takeover for cloud browsers

Watch what's happening in prod
Take control mid-run when needed
Optionally attach "later" while a scheduled run is still active

2) Parallel execution UI

When we run 50-100+ browsers in parallel, users see a grid/graph
This turns "hours of web research" into minutes

3) Better action success in iframes / shadow DOM / cross-origin boundaries

We've done heavy work to traverse and target the correct realm/subtree reliably

4) Templates (rtrvr.ai/retrieve) + Shareable outputs

One-click workflows
Share results via a single URL

This post focuses on the engineering behind live view + takeover - because that's the piece most builders underestimate.

The Moment That Forced This: Native Captcha Solving

We made a deliberate choice early on: no third-party captcha services.

So we built it. And it works.

But here's where it got interesting: image selection captchas.

I found myself staring at execution logs thinking: "Did it click the right images? Which ones did it miss? Was the grid even fully loaded?"

That's when I wanted to just see it.

That debugging need - born from native captcha solving - is what drove the VNC implementation.

Cloud Run Realities: Why Naive "VNC in the Runner" Breaks

If you're running ephemeral Chrome workers on Cloud Run, you can't wish away these constraints:

1) Concurrency is a cap, not isolation

By default, a Cloud Run instance can serve multiple requests concurrently. If you want "one browser per container" and clean isolation, set runner concurrency=1.

This is the simplest model that keeps "one execution ↔ one browser ↔ one container" consistent.

2) Session affinity is best-effort (not a correctness primitive)

3) WebSockets don't make stickiness deterministic

So we designed around one principle:

Never require the viewer to hit the same runner instance.

The Architecture That Survives Production: Runner + Relay + Signed Pairing Tokens

Design Goals

Runner stays clean: concurrency=1, one browser per container, minimal mixed traffic
Viewer can attach "later" without sticky routing hacks
VNC is gated by auth and bound to the correct execution session
No public VNC ports. No "open 5900 to the internet"

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           VIEWER (noVNC)                                │
│                         User's Browser                                  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                          wss /vnc
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          VNC RELAY                                      │
│                     Cloud Run (high concurrency)                        │
│                                                                         │
│   • Token validation (sid + uid + role)                                 │
│   • Viewer ↔ Runner pairing                                             │
│   • Bidirectional byte forwarding                                       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                       wss /agent (outbound)
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           RUNNER                                        │
│                  Cloud Run (concurrency=1)                              │
│                                                                         │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                │
│   │   Chrome    │◄───│    Xvfb     │◄───│   x11vnc    │                │
│   │  + Extension│    │  :99 display│    │  TCP 5900   │                │
│   └─────────────┘    └─────────────┘    └─────────────┘                │
│                                              (localhost only)           │
└─────────────────────────────────────────────────────────────────────────┘

"Attach Later" Flow

1. User Browser ──► API: Request viewer token (sid)
   API ──► User: viewerToken(sid, uid, exp, role=viewer)

2. Runner ──► API: Request runner token (sid)
   API ──► Runner: runnerToken(sid, uid, exp, role=runner)

3. User Browser ──► Relay: wss /vnc (viewerToken)
   Runner ──► Relay: wss /agent (runnerToken)

4. Relay validates tokens, matches sid+uid, pairs viewer↔runner
   Relay ◄──► User: Start forwarding bytes
   Relay ◄──► Runner: Start forwarding bytes

How Pairing Works

A live session is two WebSockets:

viewer → relay at /vnc
runner → relay at /agent (outbound from the runner)

Both present short-lived signed tokens (JWT/HMAC) containing:

sid (execution/session id)
uid (owner id)
role (viewer vs runner)
exp (expiry)

Relay pairs only if:

sid matches
uid matches
roles are correct
token is valid + unexpired

Once paired, relay forwards bytes both directions.

Why this matters: attach/reconnect becomes deterministic, and Cloud Run routing becomes irrelevant.

Display Stack: We Started with Fluxbox and Removed It

We started with the "typical headless VNC recipe." It worked - but it was heavier than we wanted.

Iteration 1: Fluxbox (works, but too much)

Fluxbox was fine for "a desktop," but we didn't want a desktop. We wanted:

deterministic pixels
minimal overhead
fewer moving parts
reproducible background/viewport

Iteration 2: The Minimal Stack (what we shipped)

Xvfb (virtual X server)
x11vnc (expose X display over VNC)
xsetroot (set background to confirm correct display & keep consistent visuals)
noVNC on the client side

Startup sequence:

# 1) Create a deterministic virtual display
Xvfb :99 -screen 0 1920x1080x24 -nolisten tcp &
export DISPLAY=:99

# 2) Make it visually obvious you're on the right display
xsetroot -solid "#0B0F14"  # pick your brand-friendly dark

# 3) Start Chrome against DISPLAY=:99
google-chrome --no-sandbox --disable-dev-shm-usage ... &

# 4) Expose the display via VNC locally (NOT public)
x11vnc -display :99 -rfbport 5900 -shared -forever -nopw -noxdamage &

Important: we do not expose 5900 publicly. We bridge it through the runner → relay tunnel.

Why the Relay is a Separate Service

We intentionally did not terminate viewer WebSockets on the same runner service that executes tasks:

WebSockets are long-lived requests → they "occupy" instance capacity
Even with session affinity, Cloud Run can route new connections to different instances
We want runner to remain simple: concurrency=1, one browser, isolated execution, predictable scaling

So:

Runner: executes the agent, owns the browser, stays isolated
Relay: just pairs + forwards, can run high concurrency, long timeouts

Security Model

The worst version of this system is: "expose raw VNC to the internet."

VNC is literally remote control. Treat it like production SSH:

1) No public VNC ports VNC only exists on localhost inside the runner container.

2) Short-lived tokens Viewer access is tied to an authenticated execution; tokens expire quickly.

3) Session binding Relay only pairs viewer and runner for the same sid and uid.

4) Role separation Viewer tokens can't impersonate runner tokens (and vice versa).

5) No CDP/debugger port for core automation Our core automation is via Chrome extension APIs, not a debugger backdoor.

Observability: What Live View Immediately Unlocked

1) Captcha debugging became visual

Once we could watch real captcha flows:

wrong tile selections became obvious immediately
timing issues (clicking before images fully loaded) became visible
we could see exactly which images the agent was interpreting correctly vs incorrectly

2) Iframes, shadow DOM, cross-origin boundaries became debuggable

Modern sites are not "one DOM." They are a forest:

top frame
nested iframes (sometimes cross-origin)
shadow roots
dynamic overlays/modals

Action failures often come down to:

wrong realm
wrong subtree
wrong clickability state

Live view makes these failure classes visually undeniable.

3) Parallel fleets became something users can trust

Once users can see what's happening across a fleet, parallelism becomes less scary:

failures are localized (one worker fails; the rest complete)
you can recover outliers via takeover
throughput becomes a function of parallelism, not patience

Performance & Capacity (Production)

Report date: Jan 2026

Executive Summary

Service	Status	Avg Concurrent	Error Rate
VNC Relay	Stable	0.75 sessions	0%
Web Agent (Runner)	Some errors	0.99 executions	2.4%

VNC Relay — Current Load

Metric	Value
Window analyzed	2,667s (~44 min)
Total connections	88
Avg concurrent viewer connections	0.75
Avg concurrent runner connections	1.02
Avg concurrent sessions	0.75
Error count	0
Error rate	0%

How sessions are computed: each session requires two WS connections (viewer + runner). We compute avg concurrency using request durations and take min(viewer, runner) as paired sessions.

Relay Configuration

Setting	Value
Concurrency	200
Max instances	200
Max concurrent connections (ceiling)	40,000
Max concurrent sessions (ceiling)	~20,000

This ceiling is just concurrency x max_instances ÷ 2. Real limits show up earlier via: sockets/FD limits, memory per connection, bandwidth saturation, autoscaling dynamics.

Bandwidth: The Metric That Matters Next

CPU often looks easy. Bandwidth/session is usually the true limiter.

What we're instrumenting next in the relay:

bytes runner→viewer, bytes viewer→runner
attach latency (click → first frame)
p50/p95 attach latency under load

Runner Latency + Errors

Metric	Value
Window analyzed	3,415s (~57 min)
Total requests	41
5xx errors	1
Error rate	2.4%
Avg concurrent executions	0.99

Latency distribution:

Percentile	Latency
Avg	153.8s
P50	118.2s
P95	513.2s (~8.5 min)

What Broke (Lessons Learned)

1) "VNC in the runner service" created routing problems Reconnects and attach-later were unreliable. We stopped fighting this and built the relay.

2) Fluxbox was unnecessary complexity It's great if you want a desktop. We didn't. Removing it improved determinism.

3) Bandwidth is the real capacity question CPU looked fine; bandwidth/session is what we're measuring next.

4) The web is many realms, not one page iframe/shadow-root mis-targeting is where many "works in demo, fails in prod" bugs live. Live view made these obvious.

Why This Matters Beyond Debugging

This is the productivity multiplier we care about:

Give rtrvr a list of URLs + one sentence describing output fields
Spin up dozens of cloud browsers in parallel
Return a structured dataset in minutes
Let users watch and take over when needed

Open Questions

If you've shipped similar "live attach" systems:

What replaced VNC for you (WebRTC? custom streaming?) and what was the biggest win?
Best approach for recording/replay of agent runs without insane storage?
Any gotchas with pairing/rendezvous services at high scale?

53+ Parallel Browsers + Live VNC Takeover

Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

TL;DR

Context: Our Agent is DOM-Native (and That's Why Observability Still Matters)

"If you're DOM-native, why VNC?"

What We Shipped (Product Surface)

The Moment That Forced This: Native Captcha Solving

Cloud Run Realities: Why Naive "VNC in the Runner" Breaks

1) Concurrency is a cap, not isolation

2) Session affinity is best-effort (not a correctness primitive)

3) WebSockets don't make stickiness deterministic

The Architecture That Survives Production: Runner + Relay + Signed Pairing Tokens

Design Goals

High-Level Architecture

"Attach Later" Flow

How Pairing Works

Display Stack: We Started with Fluxbox and Removed It

Iteration 1: Fluxbox (works, but too much)

Iteration 2: The Minimal Stack (what we shipped)

Why the Relay is a Separate Service

Security Model

Observability: What Live View Immediately Unlocked

1) Captcha debugging became visual

2) Iframes, shadow DOM, cross-origin boundaries became debuggable

3) Parallel fleets became something users can trust

Performance & Capacity (Production)

Executive Summary

VNC Relay — Current Load

Relay Configuration

Bandwidth: The Metric That Matters Next

Runner Latency + Errors

What Broke (Lessons Learned)

Why This Matters Beyond Debugging

Open Questions

References

rtrvr.ai

Cloud Run

Display + VNC Stack

DOM Complexity

Benchmarks

Ready to Get Started?

53+ Parallel Browsers + Live VNC Takeover

Seeing the Web Agent: Live VNC Takeover for Serverless Chrome Workers

TL;DR

Context: Our Agent is DOM-Native (and That's Why Observability Still Matters)

"If you're DOM-native, why VNC?"

What We Shipped (Product Surface)

The Moment That Forced This: Native Captcha Solving

Cloud Run Realities: Why Naive "VNC in the Runner" Breaks

1) Concurrency is a cap, not isolation

2) Session affinity is best-effort (not a correctness primitive)

3) WebSockets don't make stickiness deterministic

The Architecture That Survives Production: Runner + Relay + Signed Pairing Tokens

Design Goals

High-Level Architecture

"Attach Later" Flow

How Pairing Works

Display Stack: We Started with Fluxbox and Removed It

Iteration 1: Fluxbox (works, but too much)

Iteration 2: The Minimal Stack (what we shipped)

Why the Relay is a Separate Service

Security Model

Observability: What Live View Immediately Unlocked

1) Captcha debugging became visual

2) Iframes, shadow DOM, cross-origin boundaries became debuggable

3) Parallel fleets became something users can trust

Performance & Capacity (Production)

Executive Summary

VNC Relay — Current Load

Relay Configuration

Bandwidth: The Metric That Matters Next

Runner Latency + Errors

What Broke (Lessons Learned)

Why This Matters Beyond Debugging

Open Questions

References

rtrvr.ai

Cloud Run

Display + VNC Stack

DOM Complexity