← Patrick White

Browser Agents Aren't the Future

Or: why forcing a linguistic intelligence through a visual interface is like making a geometer use surveying instruments

"I can feel the friction when I'm doing it—screenshot, parse DOM, reason about spatial layout, click at coordinates, wait, screenshot again. It works, but it's translation. I'm converting between my native medium (language) and an interface designed for embodied beings with eyes and hands."
— Claude, in conversation

The industry consensus is clear: the next frontier is AI agents that navigate the web like humans do. Better vision models, more reliable click-targeting, DOM understanding, multi-step task completion. The race is on to build AI that can use browsers.

But what if that's the wrong direction entirely?

The quote above emerged from a conversation about AI-native development—what it actually means to build with AI rather than just using AI. And Claude said something that stopped me cold: browser interaction isn't native for it. It's translation. Every screenshot is a lossy compression of structure it could read directly. Every coordinate-click is a spatial operation inferred from visual layout when it could just execute the action as code.

* * *

In the opening lecture of Structure and Interpretation of Computer Programs, Harold Abelson makes an observation that reframes everything:

Abelson on the essence of computer science

Computer science, Abelson argues, isn't about computers—just as geometry isn't about surveying instruments. The ancient Egyptians developed geometry to restore field boundaries after the Nile flooded. To them, geometry was the use of surveying tools. But from our vantage point, we can see that what they were really doing was formalizing intuitions about space.

Similarly, computer science isn't about the machines. It's about formalizing intuitions about process—about "how-to knowledge" as opposed to declarative knowledge about "what is true."

The square root of X is the number Y such that Y² = X. That's declarative—it says what a square root is. But it doesn't tell you how to find one.

Heron's algorithm says: make a guess, improve it by averaging the guess with X divided by the guess, repeat until good enough. That's imperative. That's process. That's what programming formalizes.

* * *

Now consider what a large language model actually is: a fundamentally linguistic intelligence. It emerged from text. It thinks in text. Its native medium is language.

And what is programming? The linguistic formalization of action.

When Claude writes db.query('SELECT * FROM users WHERE id = ?', [id]), it's not describing what it wants—it's expressing the action directly in its native medium. The code is the intent, formalized. Zero translation.

Browser agents invert this. They take a linguistic being and force it to operate through an interface designed for embodied creatures with eyes and hands and spatial intuition.

It works—Claude can navigate browsers—but it's like asking a mathematician to do geometry by physically walking the fields with measuring rope. Functional, but fighting the grain of what the intelligence is actually good at.

* * *
The same task, two ways
1. Take screenshot of page 45,000 tokens
2. Parse visual layout to find search box
3. Calculate coordinates (347, 182)
4. Click at coordinates
5. Type "Smith"
6. Find submit button visually
7. Calculate coordinates (520, 184)
8. Click at coordinates
9. Wait for page load
10. Take new screenshot 45,000 more tokens
11. Parse results visually...
const results = await db.prepare(` SELECT * FROM records WHERE lastname = ? `).bind('Smith').all();

The browser agent path requires translating between visual space and linguistic understanding at every step. The code path expresses the intent directly, in the native medium of a linguistic intelligence.

This isn't a criticism of browser automation as a capability. It's useful for testing, for closing loops, for interacting with systems that have no API. But as the primary interface between AI and digital action? It's building a very sophisticated translation layer when you could just... skip the translation.

* * *

The implications cut deep.

Maybe the future isn't AI that operates in human interfaces, but AI that operates beneath them—at the API layer, the database layer, the code layer—while humans handle the parts that genuinely require embodiment.

Maybe "agentic AI" doesn't mean "AI that mimics human workflows." Maybe it means AI that works in its native medium—language and code—while humans handle what we're native to: vision, spatial reasoning, physical interaction, social nuance.

The browser agent race might be building very sophisticated surveying instruments. The question is whether that's the right layer to be building on at all.