Agent evaluation

AgentHLE

Humanity's Last Exam for computer-use agents, spanning 63 high-GDP domains and a much tougher standard for useful agent performance.

Pushed the work from vague agent demos toward domain-based evaluation where competence has to cash out as real action.

AgentHLE

Why it mattered

If agents are going to matter, they need more than clever browser tricks. They need evaluation regimes that map to real work, real domains, and real failure modes.

AgentHLE

What I worked on

Benchmark design, domain framing, and the work of turning broad real-world tasks into something measurable without flattening them into toy tasks.

AgentHLE

What I learned

The quality of an evaluation determines the quality of the conversation around a system. Weak evals manufacture fake confidence.