Why it mattered
If agents are going to matter, they need more than clever browser tricks. They need evaluation regimes that map to real work, real domains, and real failure modes.
Agent evaluation
Humanity's Last Exam for computer-use agents, spanning 63 high-GDP domains and a much tougher standard for useful agent performance.
Pushed the work from vague agent demos toward domain-based evaluation where competence has to cash out as real action.
If agents are going to matter, they need more than clever browser tricks. They need evaluation regimes that map to real work, real domains, and real failure modes.
Benchmark design, domain framing, and the work of turning broad real-world tasks into something measurable without flattening them into toy tasks.
The quality of an evaluation determines the quality of the conversation around a system. Weak evals manufacture fake confidence.
More work