← Patrick White

Browser Use Is a Robotics Problem

Why AI can one-shot an entire app but can't click a dropdown

She posted the receipts. A log of Claude trying to create a Google Calendar event: take screenshot, find the date, click it, type the event name, try to uncheck "All day," fumble with a time picker dropdown, scroll too far, scroll back up, hunt for 10:00am, click the wrong thing, recover. Eleven agonizing steps to do what a human does in four seconds without thinking.

Meanwhile, the same model can architect an entire functioning application from scratch in a single pass. Complex routing, database schemas, authentication flows, deployment configs — all correct, all coherent, all on the first try.

Everyone notices this gap. Nobody seems to ask the obvious question: what kind of problem produces this specific error signature?

* * *

In Browser Agents Aren't the Future, I argued that forcing a linguistic intelligence through a visual interface is like making a geometer use surveying instruments. That framing was true but incomplete. It explained why browser use felt wrong but not why it's so persistently bad while everything else improves so fast.

The missing piece came from a conversation about what happens when an AI looks at a screenshot. It turns out there are two completely different things you can do with an image, and they're as different as reading a book and juggling.

When an AI looks at a chart and extracts "revenue grew 40% in Q3," it's doing symbolic work. The image is a container for information. You extract the meaning, convert it to language, and then you're back in your native medium for all the actual reasoning. One translation, then home. AI is excellent at this.

When an AI looks at a screenshot and has to click a specific button at specific coordinates, it's doing something entirely different. The image isn't a container for information — it's an environment to act in. And every property of that environment matters with a precision that content understanding never requires.

If you're reading a chart and identify a bar as "roughly 340" when it's actually 347, you got the meaning. If you're clicking a button and target 340,182 when it's at 347,182, you clicked the wrong element entirely.

That's not a difference of degree. It's a difference of kind. One is perception. The other is control.

* * *

Think about what happens when you click something on your phone.

You don't compute the coordinates of the button and fire your finger at them ballistically. You start moving your finger roughly toward the target. Your eyes track the finger as it approaches. You make continuous micro-adjustments. You tap when you see alignment. The precision doesn't come from calculation. It comes from the feedback loop being fast and continuous.

An AI doing browser use gets none of this. It takes a screenshot — a frozen instant. It reasons about coordinates. It fires blind. Then it takes another screenshot to see what happened. It's the difference between driving by looking through the windshield and driving by taking a photo, closing your eyes, turning the wheel some amount, then taking another photo to see where you ended up.

You could get better at the photo-based version with practice. But it would never be natural and it would always be brittle, because the problem isn't your driving ability. It's the information architecture.

* * *

This is where it gets structural.

A transformer — the architecture underneath every major language model — is fundamentally a discrete, asynchronous system. Tokens in, tokens out. There is no "during." There's no process running between input and output that's perceiving and adjusting. There's a single forward pass that produces a complete output, and then the system is inert until the next input arrives.

There is literally no place in the architecture for continuous feedback to happen.

You can simulate it. Run inference in a tight loop — screenshot, reason, act, screenshot, reason, act. This is what browser agents do. But you're simulating continuity through rapid discrete steps, and each step is a full forward pass through a massive model. Real time, real money, real latency, all to approximate something a human nervous system does with negligible conscious effort.

This is different from other AI limitations. When Claude makes a logic error or hallucinates a fact, those are failures within the space the architecture was designed for — failures of the discrete reasoning process that transformers are built to do. They're improvable through normal means: better training, more data, refined techniques. But "Claude can't do continuous sensorimotor control" isn't a failure of the architecture. It's an absence. The mechanism doesn't exist.

There are intelligence problems and there are control problems. Language models are very good at intelligence problems. Browser use is a control problem.
* * *

Now look at Tracy Chou's log again. "Scrolled too far. Need to scroll back up." That's not a reasoning error. That's an overshoot-and-correct pattern. If you showed that log to a roboticist, they'd recognize it instantly — it's what a robot arm does when it overshoots a grasp target.

And the fumbling spiral that follows is characteristic too. Each mistake changes the environment (the dropdown is now in a different state), which makes the next perception harder (the screenshot shows something unexpected), which makes the next action more likely to fail. Correlated sequential errors cascading through a changing environment. That's not how language problems fail. That's how control problems fail.

Look at what browser use actually requires. A coordinate space. An effector that must be positioned precisely within it. A feedback loop where you observe results and adjust. Latency between intention and outcome. A physics of the environment — elements with positions, sizes, overlapping z-indices, scroll states. Things that move. Dropdowns that occlude other elements. Hover states that change what's visible.

That's just robotics through a pane of glass.

* * *

We've been fooled by the medium. Everyone assumes screen interaction is a software problem because it happens on a computer. But the computational structure of the task — spatial precision, continuous state, real-time feedback, error recovery in a changing environment — is the computational structure of robotics. The screen is just the glass between the robot and its workspace.

And the history of AI confirms this. Language got good fast. Image understanding got good fast. Robot manipulation remains brutally hard, even in constrained environments. The "last mile" of physical interaction has been the stubborn unsolved problem for decades. Not because nobody's working on it. Because the problem is genuinely different in kind from pattern recognition and reasoning. It requires real-time closed-loop control in environments with continuous state.

That's exactly what clicking a time picker requires. The fact that it happens behind glass rather than in meatspace is almost irrelevant to the computational challenge.

* * *

In The Text Distance, I argued that the speed at which AI transforms a domain is a function of how close that domain already is to text. Browser use is the confounding case. It looks like it should be close to text — it's on a screen, it's digital, the DOM is literally HTML. But the interaction modality is embodied. The text distance of reading a webpage is zero. The text distance of using a webpage is closer to surgery.

That's the illusion. We conflated the medium with the modality. A webpage is a text artifact that requires physical interaction to operate. Building the artifact is an intelligence problem. Operating it is a control problem. AI solved the first one years ago. The second one is the same problem robotics has been stuck on for decades — just rendered in pixels instead of atoms.

* * *

This reframe makes a prediction. If browser use is a robotics problem, then the path to solving it looks like the path to solving robotics, not the path to improving language models.

Bigger models with better vision won't crack it. You probably need continuous perception rather than discrete snapshots. Learned motor primitives rather than explicit coordinate reasoning. Some form of spatial memory that persists across actions. Maybe a small fast model running in a tight perception-action loop for the control part while a large model handles planning. These are robotics solutions, not language model solutions.

And this means the browser agent companies may be fighting two hard problems simultaneously while thinking they're fighting one. They think they're solving "AI that understands interfaces." They're actually also solving "AI that has a body" — a virtual one, but a body nonetheless, with all the control problems that implies.

No wonder progress feels slower than everyone expected.

* * *

The original Browser Agents essay argued that the browser agent race is building sophisticated surveying instruments when you could just do the geometry directly. That's still true. For any task where a code path exists — an API, a database query, a programmatic interface — using the browser is the wrong approach entirely. Intelligence works better in its native medium.

But now the argument is sharper. Browser use isn't just the wrong layer. It's the wrong type of computation. It's asking an architecture that's natively good at discrete symbolic reasoning to do continuous sensorimotor control. The surveying instruments metaphor may actually understate the problem.

The honest path forward isn't to make language models better at clicking. It's to make clicking unnecessary — to rebuild the interfaces so that AI can operate in its native medium instead of ours. Every API built, every programmatic interface exposed, every workflow that lets AI work in text and code instead of pixels and coordinates is a step toward letting intelligence do what intelligence is good at, and stop asking it to be a robot.