Researchers developed REAL, a benchmark featuring 11 deterministic, high-fidelity simulations of real-world websites and 112 multi-turn tasks to evaluate autonomous web agents. The evaluation of frontier language models on REAL revealed that no agent achieved higher than a 41.07% success rate, indicating significant limitations in current capabilities.