| No. | Model | Score |
|---|---|---|
| 1 | Gemini 3 Deep Think (Preview) | 1.00 |
| 2 | Gemini 3 Pro | 0.92 |
| 3 | o3 | 0.82 |
| 4 | GPT-5.1 | 0.81 |
| 5 | Grok 4 | 0.77 |
| 6 | Gemini 2.5 Pro | 0.76 |
| 7 | GPT-5 | 0.75 |
| 8 | GPT-5 (high) | 0.72 |
| 9 | o3 (high) | 0.70 |
| 10 | o4-mini (high) | 0.63 |
| 11 | GPT-5 Mini | 0.61 |
| 12 | Grok 4 Fast (reasoning) | 0.59 |
| 13 | Gemini 2.5 Flash | 0.59 |
| 14 | o4-mini (medium) | 0.53 |
| 15 | 0.52 | |
| 16 | Gemini 2.0 Flash | 0.51 |
| 17 | Gemini 1.5 Pro | 0.51 |
| 18 | 0.47 | |
| 19 | o3 Mini High | 0.45 |
| 20 | GPT-5 Pro | 0.45 |
| 21 | 0.44 | |
| 22 | o4-mini (low) | 0.44 |
| 23 | Deepseek R1 0528 | 0.43 |
| 24 | Gemma 3 27B | 0.43 |
| 25 | 0.41 | |
| 26 | 0.40 | |
| 27 | GPT-4.1 | 0.40 |
| 28 | DeepSeek-R1 | 0.40 |
| 29 | GPT-4.1 Mini | 0.39 |
| 30 | o1-mini | 0.39 |
| 31 | Qwen3 235B A22B Instruct 2507 | 0.39 |
| 32 | GPT-5.1 (high) | 0.39 |
| 33 | Z.AI: GLM 4.5V | 0.39 |
| 34 | Z.AI: GLM 4.5 | 0.36 |
| 35 | Llama 4 Maverick | 0.35 |
| 36 | 0.35 | |
| 37 | Llama 4 Scout | 0.34 |
| 38 | GPT-4.1 Nano | 0.33 |
| 39 | GPT-4o-mini | 0.33 |
| 40 | Qwen2.5 72B Instruct | 0.32 |
| 41 | o4 mini | 0.32 |
| 42 | Grok 3 | 0.30 |
| 43 | 0.30 | |
| 44 | GPT 4.5 | 0.29 |
| 45 | GPT-4o | 0.28 |
| 46 | 0.25 | |
| 47 | Gemma 3 12B | 0.25 |
| 48 | Llama 3.1 70B Instruct | 0.23 |
| 49 | gpt-oss-20b | 0.23 |
| 50 | Pixtral 12B | 0.23 |
| 51 | o1 | 0.22 |
| 52 | Kimi K2 | 0.22 |
| 53 | o3 Mini | 0.21 |
| 54 | 0.19 | |
| 55 | Qwen3 14B | 0.19 |
| 56 | Tiny Recursive Model | 0.14 |
| 57 | 0.13 | |
| 58 | o3 Pro | 0.11 |
| 59 | Gemini 2.0 Flash (Thinking) | 0.11 |
| 60 | Qwen3 32B | 0.09 |
| 61 | Qwen2.5 32B Instruct | 0.09 |
| 62 | QwQ 32B | 0.01 |
| 63 | DeepSeek V3.1 | 0.01 |
| 64 | 0.01 | |
| 65 | 0.00 | |
| 66 | 0.00 | |
| 67 | Gemma 2 27B | 0.00 |
| Name | Organization | Best Model |
|---|---|---|
| PHYBench | Peking University | Gemini 2.5 Pro |
| Humanity's Last Exam | Scale AI | Gemini 3 Pro |
| MMMU-Pro | Carnegie Mellon University | Gemini 3.0 Pro |
| CharXiv | University of Wisconsin | o3 (high) |
| Global PIQA | University of California, San Diego | Gemini 2.5 Pro |
| ARC AGI 2 | ARC Prize Foundation | Gemini 3 Deep Think (Preview) |
| QuestBench | Google DeepMind | Gemini Flash Thinking 2.0 Exp 01-21 |
| KUMO | Tsinghua University | DeepSeek-V3 |
| DeltaBench | Alibaba Group | GPT-4-turbo-128k |
| MathIF | Shanghai AI Laboratory | Qwen3-14B |
| Reasoning Gym | Github | G-e-en-13b-instruct |
| SPIN-Bench | The University of Texas at Austin | o1 |
| PhysReason | National University of Singapore | Deepseek-R1 |
| MME-CoT | Shanghai AI Laboratory | Kimi k1.5 |
| BWOR | Shanghai Jiao Tong University | DeepSeek-R1 |
| REASONINGWEEKLY | Charles University | OpenAI o1 |
| LLM-SRBench | Carnegie Mellon University | GPT-4o-mini |
| SPLAT | Australian Institute for Machine Learning, University of Adelaide | GPT-4 |
| BBEH | Google DeepMind | o3-mini (high) |
| KORGym | Beihang University | O3-mini |
| HellaSwag | University of Washington | Human Performance |
| PRMBench | Shanghai AI Laboratory | Human |
| ChartQA | Nanyang Technological University | VL-T5 Pretrained |
| ZebraLogic | University of Washington | o1-full |
| BABILong | AIRI | ARMT (137M) fine-tune |
| OmniSpatial | Shanghai AI Laboratory | Human |
| SocialIQA | Allen Institute for Artificial Intelligence | BERT-large |
| BIG-Bench Hard | Stanford University | Max human-rater |
| ScienceQA | UCLA | Mutimodal-T-SciQ_Large |
| OlympicArena | Shanghai Artificial Intelligence Laboratory | O1 |
| REVEAL | Google DeepMind | PaLM-2-L |
| SpatialMQA | Fudan University | Human |
| OptimalThinkingBench | Carnegie Mellon University | o3 (Thinking) |
| WM-ABench | University of Michigan | Human |
| LLM4Causal | Amazon | LLM4Causal-Mixed (Llama-2 7B) |
| THINKSLM | University of Oxford | Llama3.1 70B (FP8) |
| ACPBench | IBM Research | LLAMA-3.1 405B |
| AceMath-RewardBench | NVIDIA | AceMath-72B-Instruct |
| GAMEBoT | University of Cambridge | GPT-4o |
| DROP | Allen Institute for Artificial Intelligence | Human |
| CLadder | University of Washington | GPT-4 + CAUSALCOT |
| A-I-RAVEN | Warsaw University of Technology | SCL |
| Sudoku-Bench | Sakana AI | GPT-5 High |
| ShortcutQA | Alibaba Group | Llama 3 (70B Instruct) |
| Sys2Bench | Texas A&M University | LLaMa 3.1 405B |
| gg-bench | UC Berkeley | o1 |
| Reasoning-Intensive Regression | MIT | GPT-5 MENTAT (Basic Prompt) |
| VerifyBench | Meituan Group | CompassVerifier-32B |
| MME-Reasoning | Shanghai AI Laboratory | Gemini-2.5-Pro-T |
| COLD | Indian Institute of Technology Kanpur (IIT Kanpur) | phi-2 |
| DeepTheorem | Shanghai Jiao Tong University | o3-mini |
| MM-HELIX | Shanghai AI Laboratory | GPT-5 |
| MM-Bench | Shanghai Jiao Tong University | DeepSeek-R1-671B |
| GameArena | University of California, San Diego | Claude 3.5 Sonnet |
| R-HORIZON | Fudan University | Qwen3-235B-Thinking |
| ChemCoTBench | Peking University | Claude3.7-sonnet-think |
| Unpuzzles | Google DeepMind | o3 |
| NLGraph | Xi The most complete list of organizations that are part of the publication of this paper is: Xi’an Jiaotong University and University of Washington. | text-davinci-003 (COT+SC) |
| MARS | HKUST | Gemma-2 9B (Fine-tuned) |
| CLRS | Google DeepMind | PGN |
| CHARTOM | ETH Zurich | gemini-2.5 |
| UGPhysics | Tsinghua University | DeepSeek-R1 |
| LongReason | University of Illinois at Urbana-Champaign | Gemini-1.5 Pro |
| LogicGame | Tsinghua University | o1-preview |
| RuleArena | University of California, Santa Barbara | o1-preview |
| VisuRiddles | Huazhong University of Science and Technology | PAVR |
| USACO | Princeton University | Human Average |
| Gravity-Bench-v1 | University of Toronto | o4-mini-high-2025-04-16 |
| CriticBench | Tsinghua University | GPT-4 |
| LINGOLY | University of Oxford | Claude Opus |
| MR-Ben | University of Cambridge | o1-preview-2024-09-12 |
| AutoLogi | Alibaba Group | Claude-3.5-sonnet |
| KOR-Bench | University of Illinois at Urbana-Champaign | O1-preview-2024-09-12 |
| GraCoRe | Harbin Institute of Technology | OpenAI o1 |
| LogiCity | University of Toronto | Oracle |
| NPHardEval | University of Michigan | Mistral-7b |
| NaturalProofs | University of Washington | BERT (P/S) +joint |
| LogiQA | Fudan University | Ceiling Performance |
| DetectiveQA | Huawei Noah’s Ark Lab | Claude 3 Opus (200k) |
| MPBench | Shanghai AI Laboratory | GPT-4o |
| QuantumTheorems | MIT | Ax-Prover |
| EnigmaEval | MIT | gpt-5-pro-2025-10-06 |
| SATBench | University of Illinois at Urbana-Champaign | o4-mini |
| PuzzleVQA | Alibaba Group | GPT-4V |
| SCoRE | Huawei Noah’s Ark Lab | o1-preview |
| SCIREAS | Yale University | GPT-5-2025-08-07 |
| PhysGym | KAUST | Gemini-2.5-pro |
| PRISM-Physics | Harvard University | GPT-5 High |
| L0-Bench | NVIDIA | Deepseek-R1 |
| Braingle Brainteaser | Georgia Institute of Technology | OpenAI o3 |
| CLEAR | Shanghai AI Laboratory | GPT-4 |
| LR²Bench | Shanghai Artificial Intelligence Laboratory | o1-preview |
| TextGames | NAIST | GPT-o3 Mini |
| ProcBench | AI Alignment Network | o1-preview |
| Reasoning Core | CNRS | gpt-5 |
| PhysUniBench | Michigan State University | GPT-o4-mini |
| RULEARN | University of Texas at Dallas | Human |
| HellaSwag-Pro | Alibaba Group | Human |
| LongReasonArena | Microsoft | o1 |
| Socratic-PRMBench | Chinese Academy of Sciences | o3-mini |
| LogiGLUE | Arizona State University | GPT-4 |
| LLM Planning Benchmark | Google DeepMind | Gemini 1.5 Pro (Our-BW, 2-shot, NL) |
| Task Structure Variations | Sun Yat-Sen University | GPT-3.5 |
| Formal Logic Deduction | Hitachi, Ltd. | T5 (fine-tuned) |
| DocPuzzle | Huawei Noah’s Ark Lab | GPT-4o-0811 |
| ReClor | National University of Singapore | Graduate Students |
| ControlBench | University of Illinois at Urbana-Champaign | Claude 3 Opus |
| GridPuzzle | Arizona State University | GPT-4-Turbo |
| BizBench | Kensho Technologies | GPT 4* |
| HeroBench | Skoltech | Grok-4 |
| LogiEval | Westlake University | DeepSeek R1 |
| SeqEval | University of Edinburgh | Llama-3-8B (SIT on TuluV2) |
| GLoRE | Alibaba Group | QwQ-32B |
| Chain-of-Thought Hub | University of Washington | GPT-4 |
| BRAINTEASER | Tencent AI Lab | Human |
| TMBench | Tianjin University | Gemini-2.5-Pro |
| CLR-Bench | The Hong Kong Polytechnic University | qwen2.5-32b-instruct |
| TurnaboutLLM | University of Pennsylvania | DeepSeek-V3 |
| ACADREASON | University of Michigan | OAgents |
| RUPBench | Stanford University | Llama3 8B |
| BlendQA | Tsinghua University | GPT-4o |
| bAbI | Facebook AI Research | GPT-5 (2025-08) |
| FormalML | ETH Zurich | STP |
| Verbose ListOps | UNSW | Gemini 2.5 Pro |
| IOLBENCH | University of Michigan | GPT-5 |
| Multi-Turn Puzzles | Google DeepMind | Gemini-2.5-Pro-Exp-0325 |
| ZeMPE | Stony Brook University | GPT-4 Turbo |
| Entailment Verification | University of Washington | GPT-4 |
| FaithCoT-Bench | City University of Hong Kong | GPT-4o-mini |
| Generalized Associative Recall | Beijing University of Posts and Telecommunications | GPT-4 |
| RECV | The Alan Turing Institute | GPT-4 |
| EconLogicQA | Georgia Institute of Technology | GPT-4-Turbo |
| DAG-MATH | Google DeepMind | GPT-4.1-M |
| TMGBench | Harbin Institute of Technology | Qwen3-32B |
| PuzzleWorld | Imperial College London | GPT-o3 |
| RE-IMAGINE | Microsoft | GPT-o1 |
| DRE-Bench | Shanghai Artificial Intelligence Laboratory | Claude-3.7 |
| RULEBREAKERS | The University of Sheffield | Meta-Llama-3-8B-Instruct |
| RoomSpace | Alan Turing Institute | GPT-4 |
| Multi-LogiEval | Arizona State University | GPT-4 |
| AnaloBench | Johns Hopkins University | Human |
| Visual Abductive Reasoning | ETH Zurich | REASONER |
| FCoReBench | Indian Institute of Technology Delhi | GPT-4-Turbo |
| InfiMM-Eval | ByteDance | GPT-4V |
| ActionReasoningBench | Arizona State University | Finetuned Llama-3.1-8B |
| Connections | New York University | GPT-4-Turbo (CoT) |
| ReasoningLLMs | University of Milano Bicocca | GPT-4-0613 |
| QCBench | Wuhan University | o3 |
| THiNK | McGill University | GPT-4O |
| CLUTRR | McGill University | GAT |
| THINK-Bench | Jilin University | Grok-3-mini-beta |
| SearchBench | UC Berkeley | GPT-4 (MSMT A* Prompting) |
| CorrectBench | Griffith University | Claude 3.5-Sonnet |
| CryptoBench | Beihang University | o1 |
| PuzzLing Machines | University of Copenhagen | ChatGPT |
| StyleBench | UC Berkeley | Qwen 32B |
| SpartQA | Michigan State University | Human |
| SATQuest | Chinese Academy of Sciences | o3-mini |
| Mental-Ability Reasoning Benchmark | University of Southern California | GPT-4 |
| GVGAI-LLM | New York University | GPT-o3-mini |
| TruthQuest | Munich Center for Machine Learning (MCML) | LLaMA-3-70B |
| MATCHA | University of Illinois at Urbana-Champaign | Qwen2.5-7B |
| Compound-QA | Shanghai University of Finance and Economics | InternLM |
| TurnBench-MS | University of New South Wales | gpt-o4-mini-high |
| CogniLoad | University of Oslo | gpt-5-2025-08-07 |
| R2PE | HKUST | text-davinci-003 |
| MastermindEval | Humboldt-Universität zu Berlin | o3-mini |
| WikiWhy | University of California, Santa Barbara | GPT-3 (davinci-002) |
| DimEval | Fudan University | DimPerc |
| KnotGym | Cornell University | DreamerV3 |
| MatSciBench | UCLA | Gemini-2.5-Pro |
| PhysicsEval | Islamic University of Technology | Phi-4-reasoning-plus |
| ConfProBench | Jilin University | Gemini-2.5-flash |
| SMART-101 | Mitsubishi Electric Research Labs | Second Grader |
| TCP | University of Cambridge | o4-mini |
| True Detective | University of Tartu | GPT-4 |
| BoardgameQA | Google Research | PaLM 62B (Prompt-tuned w/ CoT) |
| AQA-Bench | University of Edinburgh | O1-Preview |
| rsbench | University of Edinburgh | CLIP |
| STEM-POM | University of Illinois at Urbana-Champaign | GPT-4o |
| Karp Dataset | Worcester Polytechnic Institute | Strawberry |
| MCR Benchmark | University of Toronto | GPT-4 (gpt-4-0613) |
| DIA-Bench | University of Oslo | ChatGPT-4o |
| PuzzlePlex | New York University | Custom |
| DivLogicEval | Fudan University | OpenAI o1-preview (o1-preview-2024-09-12) |
| LatEval | Tsinghua University | GPT-4 |
| Plausible Distractors | University of Manchester | longformer |
| DSR-Bench | Stanford University | GPT-5 (med) |
| LogicPrpBank | University of Pittsburgh | BERT-base (110M) |
| CODAH | Northwestern University | Human |
| Com² | Fudan University | GPT-4o |
| State Tracking | Cardiff University | GPT 4o |
| Pretty-CLEVR | Google DeepMind | Recurrent Relational Network |
| XCOPA | University of Cambridge | Human |
| ALERT | Meta | OPT-CoT 13B |
| Comparative Reasoning Benchmark | University of Notre Dame | T5 + CMP |
| CHECK-MAT | Lomonosov Moscow State University | OpenAI o4-mini |
| Giraffe Bench | Zhejiang University | Vicuna-13B |
| MASLegalBench | Tsinghua University | Llama3.1-8B-Instruct |
| MskQA and MskCal | i’s Factory Corporation, Ltd. | GPT-4o |
| RiddleBench | GPT-oss-120B | |
| Concept-Reversed Winograd Schema Challenge | Zhejiang University | Llama-3.1 |
| Abstract Reasoning Benchmark | Xiamen University | AGENTCHAT(AUTOGEN) |
| seqBench | Capital One | GPT-5 |
| V-LoL | TU Darmstadt | αILP |
| CHBench | Renmin University of China | Llama-3.1-70b |
| KANDY | University of Pisa | Aleph (Nat-Mid) |
| RAVEN-FAIR | Tel Aviv University | MRNet (MC) |
| SPaRC | University of Göttingen | Human |
| Fermi Problems | Max Planck Institute for Intelligent Systems | T5 (FT both) |
| STREET | Northwestern University | GPT-3 davinci (few-shot) |
| RobustLR | University of Southern California | Human |
| RRIP | Beijing University of Technology | GPT-3.5 |
| SOP-Maze | Nanjing University | DeepSeek-V3.1-Thinking |
| FeasibilityQA | Arizona State University | GPT-3 (text-davinci-002) |
| TRIP | Michigan State University | RoBERTa |
| AlgoSimBench | The University of Texas at Austin | o3-mini-medium |
| MultiZebraLogic | The Alexandra Institute | o3-mini |
| oLMpics | Tel Aviv University | RoBERTa-L |
| ImplicitRelations | Allen Institute for Artificial Intelligence | Davinci |
| ZsLR | Xian Jiaotong University | TaCo |
| RepublicQA | Huazhong University of Science and Technology | GPT-4o |
| HardcoreLogic | University of Southern California | GPT-5 |
| Cross-Platform LLM Benchmark | Universidad Pontificia Comillas | Hermes-4-70B |
| TRAC | Sun Yat-Sen University | GPT-2-small |
| HATS | University of Pennsylvania | Gemini 3 Pro Preview (2025-11-18) |
| RegexPSPACE | Yonsei University | gpt-oss-low |
| Analogical Reasoning Test | University of Cambridge | ChatGPT (Categorical) |