The Controlled Experiment
How the smartest model ever built proved that computer use isn't an intelligence problem
Anthropic's Claude Mythos Preview can find vulnerabilities in Firefox's source code that no human has ever found. It solved every cybersecurity CTF challenge in the Cybench benchmark—100%, all trials, no failures. It autonomously discovered thousands of zero-day vulnerabilities across every major operating system and every major web browser. Anthropic considered it too dangerous to release publicly.
Mythos cannot reliably open Firefox and click a bookmark.
These two facts are not in tension. They are the same fact, observed from different angles. And together they constitute the cleanest empirical evidence we've ever had for a claim I've been making since January: computer use is a robotics problem, not an intelligence problem, and no amount of scaling will fix it.
Here are the benchmark gains from Opus 4.6 to Mythos Preview:
| Benchmark | Opus 4.6 | Mythos | Gain |
|---|---|---|---|
| SWE-bench Pro | 53.4% | 77.8% | +24.4 |
| CyberGym | 66.6% | 83.1% | +16.5 |
| Terminal-Bench 2.0 | 65.4% | 82.0% | +16.6 |
| SWE-bench Verified | 80.8% | 93.9% | +13.1 |
| OSWorld | 72.7% | 79.6% | +6.9 |
Every symbolic benchmark jumped by double digits. Some nearly doubled. Cybench went to 100%. USAMO hit 97.6%. GPQA Diamond, the graduate-level science benchmark, hit 94.6%. Across seventeen of eighteen benchmarks measured, Mythos is the highest-scoring model on record.
And OSWorld—the benchmark that tests whether the model can interact with a graphical desktop environment, take screenshots, reason about what it sees, and click the right elements—gained less than seven points.
On a model that is dramatically smarter at everything else.
This is why the numbers matter so much. Mythos is an accidental controlled experiment.
Same architecture. Same company. Same training approach, scaled further. The only variable is capability—raw intelligence, cranked as high as it's ever been cranked. Everything else is held constant. When you do this and the symbolic benchmarks leap to the ceiling while the computer use benchmark barely moves, you've isolated the variable. You've demonstrated, empirically, that computer use performance is not primarily a function of intelligence.
If it were an intelligence problem, Mythos would have crushed it the way it crushed everything else. The model that solves every CTF challenge ever written would find “crop an image in GIMP” trivially easy. The model that scores 97.6% on olympiad mathematics would handle “format a cell in LibreOffice Calc” without breaking a sweat.
But it doesn't. Because the bottleneck in OSWorld was never knowledge or reasoning. It's the screenshot-reason-click loop: take a frozen image, infer coordinates, fire blind, take another screenshot to see what happened. That loop is continuous sensorimotor control—a robotics computation—and making the model smarter doesn't help for the same reason that making a calculator faster doesn't help it walk across a room. The problem isn't in the reasoning. It's in the type of computation being attempted.
Look at the gain differentials again, but this time through the lens of text distance—how close each benchmark's task is to pure text.
SWE-bench Pro is pure code. Text in, text out. Zero translation. The model reads a codebase, reasons about the bug, writes a patch. This is a linguistic intelligence operating in its native medium. Gain: +24.4 points.
CyberGym and Terminal-Bench are close to text. Exploit reproduction involves reading source code and writing payloads. Terminal interaction is command-line—text-native. Gains: +16.5 and +16.6 points.
SWE-bench Verified is also pure code but ceiling-compressed—Opus was already at 80.8%, so the remaining headroom was smaller. Gain: +13.1 points.
OSWorld requires visual perception, spatial reasoning, and coordinate-precise interaction with a graphical interface. It's the furthest from text of any benchmark in the suite. Gain: +6.9 points.
This isn't random variation in benchmark difficulty. It's a function—the same function I described in The Text Distance—now visible in a single model's improvement across its own benchmark suite.
The Firefox example deserves its own moment, because it makes the distinction almost absurdly clear.
Mythos can read Firefox's C++ source code—millions of lines—and find logic errors that the humans who wrote it missed. It found vulnerabilities that have been sitting in production code for years, seen by thousands of developers, caught by nobody. On the knowledge and reasoning dimension of understanding Firefox, Mythos is superhuman. It understands the browser better than the people who built it.
And it still can't use the browser.
Same application. Two paths to interact with it. One path goes through text—source code, symbolic reasoning, logical analysis. The other path goes through pixels—screenshots, coordinate inference, mouse clicks. On the text path, Mythos is the best that has ever existed. On the pixel path, it fails one time in five at tasks an intern could do blindfolded.
This is not a paradox. This is two different kinds of computation producing two different results. The text path is intelligence. The pixel path is control. Mythos maxed out one and barely moved the other because they're not the same kind of thing, no matter how similar they look on a benchmark report card.
There's a human side to this story too. Reports have surfaced recently that researchers at the major labs are frustrated by the lack of progress on computer use. This frustration is the subjective experience of hitting the wall the Mythos numbers describe.
They're doing what smart people do: throwing more intelligence at the problem. Better vision models. More training data. Refined prompting strategies. Tighter inference loops. And the numbers inch up—6.9 points is not zero—but the returns are diminishing in a way that feels wrong. Everything else is leaping forward. Computer use is trudging.
The frustration is the feeling of pushing against a category error from inside. If you believe computer use is an intelligence problem, then more intelligence should solve it, and the fact that it doesn't is baffling. But if computer use is a control problem—if it's robotics through a pane of glass—then the frustration is exactly what you'd predict. You're applying the wrong kind of computation and watching it underperform. Of course it's frustrating. It would be like trying to solve a math problem by running faster.
In This Time Isn't Different, I argued that AI's inability to reliably navigate graphical interfaces preserves the institutional walls that slow economic transformation to a survivable pace. Legacy enterprise systems—Epic, SAP, Oracle—are protected by a technical limitation the vendors didn't build and don't understand. If browser agents worked, AI could slot into any existing system through the GUI. The bypass would open and the transformation would be sudden.
Mythos updates that argument in an important way. It's no longer just that browser agents don't work today. It's that scaling—the primary lever the industry has—doesn't fix it.
Anthropic built the most capable model in history. They pushed intelligence to a point where the model discovers zero-day vulnerabilities in production operating systems autonomously. And OSWorld went up seven points. If this isn't evidence that the problem is architectural rather than capability-limited, what would be?
Making models smarter—which is what every lab is doing, which is where all the investment is going—does not close the computer use gap at a rate anywhere close to the rate it closes every other gap. The walls will hold until someone either solves a robotics problem or builds the APIs that let intelligence operate in its native medium instead of through pixels.
Both of those take time. The robotics problem has been stubborn for decades. API buildout requires institutional cooperation from vendors whose business model depends on lock-in. Neither responds to the pace of model improvement. And that means the electrification analogy from that essay—rapid transformation of text-native domains, slow transformation of everything behind a GUI—isn't just an analogy. It's a prediction the Mythos benchmarks support with hard numbers.
The broader point is about what benchmark suites actually measure when they put different kinds of tasks on the same report card.
SWE-bench and OSWorld produce numbers that look comparable. Percentages on a scale. You can put them in a table, compare them across models, plot them on a chart. The format suggests they're measuring the same dimension—“AI capability”—at different difficulty levels.
They're not. They're measuring fundamentally different kinds of computation. SWE-bench measures symbolic reasoning in a text-native domain. OSWorld measures continuous sensorimotor control in a visual-spatial domain. One is what transformers were built to do. The other is what they architecturally cannot do—not badly, not yet, but cannot, in the way that a calculator cannot walk regardless of how fast it computes.
Putting these on the same benchmark card is like measuring a fish's swimming speed and tree-climbing ability and averaging the results. The average tells you nothing. The gap tells you everything. And Mythos, by pushing the swimming speed to the theoretical maximum while the climbing barely improved, made the gap impossible to ignore.
Nobody is talking about the Mythos benchmarks this way. The coverage is about the safety implications—should Anthropic release a model this powerful?—and about the cybersecurity applications of Project Glasswing. Those are important conversations.
But buried in the benchmark table is something more fundamental than any policy discussion. It's empirical evidence for a claim about the nature of computation: that what we call “AI capability” is not a single dimension, that scaling intelligence along the symbolic axis does not transfer to the control axis, and that the stubborn 20% failure rate on tasks like “click a dropdown” or “crop an image” is not a problem waiting for a smarter model. It's a problem waiting for a different kind of solution entirely.
The smartest model ever built proved it. The experiment is right there in the numbers. Nobody designed it as an experiment, which is what makes it so clean. No agenda, no cherry-picking, no motivated reasoning. Just a capability jump so large that the absence of improvement on one benchmark became the most informative result in the entire suite.
Mythos can exploit every browser ever built. It just can't use one.