Reasoning | alphaXiv

Reasoning

Complex problem solving

228 datasets

last indexed 17h ago

Model Leaderboard

No.	Model	Score
1	Gemini 3 Deep Think (Preview)	1.00
2	Gemini 3 Pro	0.92
3	o3	0.82
4	GPT-5.1	0.81
5	Grok 4	0.77
6	Gemini 2.5 Pro	0.76
7	GPT-5	0.75
8	GPT-5 (high)	0.72
9	o3 (high)	0.70
10	o4-mini (high)	0.63
11	GPT-5 Mini	0.61
12	Grok 4 Fast (reasoning)	0.59
13	Gemini 2.5 Flash	0.59
14	o4-mini (medium)	0.53
15	Claude Sonnet 4.5	0.52
16	Gemini 2.0 Flash	0.51
17	Gemini 1.5 Pro	0.51
18	Claude 3.5 Sonnet	0.47
19	o3 Mini High	0.45
20	GPT-5 Pro	0.45
21	Claude Sonnet 4	0.44
22	o4-mini (low)	0.44
23	Deepseek R1 0528	0.43
24	Gemma 3 27B	0.43
25	Claude Opus 4	0.41
26	Claude Haiku 4.5	0.40
27	GPT-4.1	0.40
28	DeepSeek-R1	0.40
29	GPT-4.1 Mini	0.39
30	o1-mini	0.39
31	Qwen3 235B A22B Instruct 2507	0.39
32	GPT-5.1 (high)	0.39
33	Z.AI: GLM 4.5V	0.39
34	Z.AI: GLM 4.5	0.36
35	Llama 4 Maverick	0.35
36	Claude 3.7 Sonnet	0.35
37	Llama 4 Scout	0.34
38	GPT-4.1 Nano	0.33
39	GPT-4o-mini	0.33
40	Qwen2.5 72B Instruct	0.32
41	o4 mini	0.32
42	Grok 3	0.30
43	Claude 4.5 Sonnet (thinking)	0.30
44	GPT 4.5	0.29
45	GPT-4o	0.28
46	Claude Opus 4.1	0.25
47	Gemma 3 12B	0.25
48	Llama 3.1 70B Instruct	0.23
49	gpt-oss-20b	0.23
50	Pixtral 12B	0.23
51	o1	0.22
52	Kimi K2	0.22
53	o3 Mini	0.21
54	Claude Opus 4 (thinking)	0.19
55	Qwen3 14B	0.19
56	Tiny Recursive Model	0.14
57	Claude Sonnet 4 (thinking)	0.13
58	o3 Pro	0.11
59	Gemini 2.0 Flash (Thinking)	0.11
60	Qwen3 32B	0.09
61	Qwen2.5 32B Instruct	0.09
62	QwQ 32B	0.01
63	DeepSeek V3.1	0.01
64	Claude 3 Sonnet	0.01
65	Claude 3 Haiku	0.00
66	Claude 3.7 Sonnet (thinking)	0.00
67	Gemma 2 27B	0.00

Name	Organization	Best Model
PHYBench	Peking University	Gemini 2.5 Pro
Humanity's Last Exam	Scale AI	Gemini 3 Pro
MMMU-Pro	Carnegie Mellon University	Gemini 3.0 Pro
CharXiv	University of Wisconsin	o3 (high)
Global PIQA	University of California, San Diego	Gemini 2.5 Pro
ARC AGI 2	ARC Prize Foundation	Gemini 3 Deep Think (Preview)
QuestBench	Google DeepMind	Gemini Flash Thinking 2.0 Exp 01-21
KUMO	Tsinghua University	DeepSeek-V3
DeltaBench	Alibaba Group	GPT-4-turbo-128k
MathIF	Shanghai AI Laboratory	Qwen3-14B
Reasoning Gym	Github	G-e-en-13b-instruct
SPIN-Bench	The University of Texas at Austin	o1
PhysReason	National University of Singapore	Deepseek-R1
MME-CoT	Shanghai AI Laboratory	Kimi k1.5
BWOR	Shanghai Jiao Tong University	DeepSeek-R1
REASONINGWEEKLY	Charles University	OpenAI o1
LLM-SRBench	Carnegie Mellon University	GPT-4o-mini
SPLAT	Australian Institute for Machine Learning, University of Adelaide	GPT-4
BBEH	Google DeepMind	o3-mini (high)
KORGym	Beihang University	O3-mini
HellaSwag	University of Washington	Human Performance
PRMBench	Shanghai AI Laboratory	Human
ChartQA	Nanyang Technological University	VL-T5 Pretrained
ZebraLogic	University of Washington	o1-full
BABILong	AIRI	ARMT (137M) fine-tune
OmniSpatial	Shanghai AI Laboratory	Human
SocialIQA	Allen Institute for Artificial Intelligence	BERT-large
BIG-Bench Hard	Stanford University	Max human-rater
ScienceQA	UCLA	Mutimodal-T-SciQ_Large
OlympicArena	Shanghai Artificial Intelligence Laboratory	O1
REVEAL	Google DeepMind	PaLM-2-L
SpatialMQA	Fudan University	Human
OptimalThinkingBench	Carnegie Mellon University	o3 (Thinking)
WM-ABench	University of Michigan	Human
LLM4Causal	Amazon	LLM4Causal-Mixed (Llama-2 7B)
THINKSLM	University of Oxford	Llama3.1 70B (FP8)
ACPBench	IBM Research	LLAMA-3.1 405B
AceMath-RewardBench	NVIDIA	AceMath-72B-Instruct
GAMEBoT	University of Cambridge	GPT-4o
DROP	Allen Institute for Artificial Intelligence	Human
CLadder	University of Washington	GPT-4 + CAUSALCOT
A-I-RAVEN	Warsaw University of Technology	SCL
Sudoku-Bench	Sakana AI	GPT-5 High
ShortcutQA	Alibaba Group	Llama 3 (70B Instruct)
Sys2Bench	Texas A&M University	LLaMa 3.1 405B
gg-bench	UC Berkeley	o1
Reasoning-Intensive Regression	MIT	GPT-5 MENTAT (Basic Prompt)
VerifyBench	Meituan Group	CompassVerifier-32B
MME-Reasoning	Shanghai AI Laboratory	Gemini-2.5-Pro-T
COLD	Indian Institute of Technology Kanpur (IIT Kanpur)	phi-2
DeepTheorem	Shanghai Jiao Tong University	o3-mini
MM-HELIX	Shanghai AI Laboratory	GPT-5
MM-Bench	Shanghai Jiao Tong University	DeepSeek-R1-671B
GameArena	University of California, San Diego	Claude 3.5 Sonnet
R-HORIZON	Fudan University	Qwen3-235B-Thinking
ChemCoTBench	Peking University	Claude3.7-sonnet-think
Unpuzzles	Google DeepMind	o3
NLGraph	Xi The most complete list of organizations that are part of the publication of this paper is: Xi’an Jiaotong University and University of Washington.	text-davinci-003 (COT+SC)
MARS	HKUST	Gemma-2 9B (Fine-tuned)
CLRS	Google DeepMind	PGN
CHARTOM	ETH Zurich	gemini-2.5
UGPhysics	Tsinghua University	DeepSeek-R1
LongReason	University of Illinois at Urbana-Champaign	Gemini-1.5 Pro
LogicGame	Tsinghua University	o1-preview
RuleArena	University of California, Santa Barbara	o1-preview
VisuRiddles	Huazhong University of Science and Technology	PAVR
USACO	Princeton University	Human Average
Gravity-Bench-v1	University of Toronto	o4-mini-high-2025-04-16
CriticBench	Tsinghua University	GPT-4
LINGOLY	University of Oxford	Claude Opus
MR-Ben	University of Cambridge	o1-preview-2024-09-12
AutoLogi	Alibaba Group	Claude-3.5-sonnet
KOR-Bench	University of Illinois at Urbana-Champaign	O1-preview-2024-09-12
GraCoRe	Harbin Institute of Technology	OpenAI o1
LogiCity	University of Toronto	Oracle
NPHardEval	University of Michigan	Mistral-7b
NaturalProofs	University of Washington	BERT (P/S) +joint
LogiQA	Fudan University	Ceiling Performance
DetectiveQA	Huawei Noah’s Ark Lab	Claude 3 Opus (200k)
MPBench	Shanghai AI Laboratory	GPT-4o
QuantumTheorems	MIT	Ax-Prover
EnigmaEval	MIT	gpt-5-pro-2025-10-06
SATBench	University of Illinois at Urbana-Champaign	o4-mini
PuzzleVQA	Alibaba Group	GPT-4V
SCoRE	Huawei Noah’s Ark Lab	o1-preview
SCIREAS	Yale University	GPT-5-2025-08-07
PhysGym	KAUST	Gemini-2.5-pro
PRISM-Physics	Harvard University	GPT-5 High
L0-Bench	NVIDIA	Deepseek-R1
Braingle Brainteaser	Georgia Institute of Technology	OpenAI o3
CLEAR	Shanghai AI Laboratory	GPT-4
LR²Bench	Shanghai Artificial Intelligence Laboratory	o1-preview
TextGames	NAIST	GPT-o3 Mini
ProcBench	AI Alignment Network	o1-preview
Reasoning Core	CNRS	gpt-5
PhysUniBench	Michigan State University	GPT-o4-mini
RULEARN	University of Texas at Dallas	Human
HellaSwag-Pro	Alibaba Group	Human
LongReasonArena	Microsoft	o1
Socratic-PRMBench	Chinese Academy of Sciences	o3-mini
LogiGLUE	Arizona State University	GPT-4
LLM Planning Benchmark	Google DeepMind	Gemini 1.5 Pro (Our-BW, 2-shot, NL)
Task Structure Variations	Sun Yat-Sen University	GPT-3.5
Formal Logic Deduction	Hitachi, Ltd.	T5 (fine-tuned)
DocPuzzle	Huawei Noah’s Ark Lab	GPT-4o-0811
ReClor	National University of Singapore	Graduate Students
ControlBench	University of Illinois at Urbana-Champaign	Claude 3 Opus
GridPuzzle	Arizona State University	GPT-4-Turbo
BizBench	Kensho Technologies	GPT 4*
HeroBench	Skoltech	Grok-4
LogiEval	Westlake University	DeepSeek R1
SeqEval	University of Edinburgh	Llama-3-8B (SIT on TuluV2)
GLoRE	Alibaba Group	QwQ-32B
Chain-of-Thought Hub	University of Washington	GPT-4
BRAINTEASER	Tencent AI Lab	Human
TMBench	Tianjin University	Gemini-2.5-Pro
CLR-Bench	The Hong Kong Polytechnic University	qwen2.5-32b-instruct
TurnaboutLLM	University of Pennsylvania	DeepSeek-V3
ACADREASON	University of Michigan	OAgents
RUPBench	Stanford University	Llama3 8B
BlendQA	Tsinghua University	GPT-4o
bAbI	Facebook AI Research	GPT-5 (2025-08)
FormalML	ETH Zurich	STP
Verbose ListOps	UNSW	Gemini 2.5 Pro
IOLBENCH	University of Michigan	GPT-5
Multi-Turn Puzzles	Google DeepMind	Gemini-2.5-Pro-Exp-0325
ZeMPE	Stony Brook University	GPT-4 Turbo
Entailment Verification	University of Washington	GPT-4
FaithCoT-Bench	City University of Hong Kong	GPT-4o-mini
Generalized Associative Recall	Beijing University of Posts and Telecommunications	GPT-4
RECV	The Alan Turing Institute	GPT-4
EconLogicQA	Georgia Institute of Technology	GPT-4-Turbo
DAG-MATH	Google DeepMind	GPT-4.1-M
TMGBench	Harbin Institute of Technology	Qwen3-32B
PuzzleWorld	Imperial College London	GPT-o3
RE-IMAGINE	Microsoft	GPT-o1
DRE-Bench	Shanghai Artificial Intelligence Laboratory	Claude-3.7
RULEBREAKERS	The University of Sheffield	Meta-Llama-3-8B-Instruct
RoomSpace	Alan Turing Institute	GPT-4
Multi-LogiEval	Arizona State University	GPT-4
AnaloBench	Johns Hopkins University	Human
Visual Abductive Reasoning	ETH Zurich	REASONER
FCoReBench	Indian Institute of Technology Delhi	GPT-4-Turbo
InfiMM-Eval	ByteDance	GPT-4V
ActionReasoningBench	Arizona State University	Finetuned Llama-3.1-8B
Connections	New York University	GPT-4-Turbo (CoT)
ReasoningLLMs	University of Milano Bicocca	GPT-4-0613
QCBench	Wuhan University	o3
THiNK	McGill University	GPT-4O
CLUTRR	McGill University	GAT
THINK-Bench	Jilin University	Grok-3-mini-beta
SearchBench	UC Berkeley	GPT-4 (MSMT A* Prompting)
CorrectBench	Griffith University	Claude 3.5-Sonnet
CryptoBench	Beihang University	o1
PuzzLing Machines	University of Copenhagen	ChatGPT
StyleBench	UC Berkeley	Qwen 32B
SpartQA	Michigan State University	Human
SATQuest	Chinese Academy of Sciences	o3-mini
Mental-Ability Reasoning Benchmark	University of Southern California	GPT-4
GVGAI-LLM	New York University	GPT-o3-mini
TruthQuest	Munich Center for Machine Learning (MCML)	LLaMA-3-70B
MATCHA	University of Illinois at Urbana-Champaign	Qwen2.5-7B
Compound-QA	Shanghai University of Finance and Economics	InternLM
TurnBench-MS	University of New South Wales	gpt-o4-mini-high
CogniLoad	University of Oslo	gpt-5-2025-08-07
R2PE	HKUST	text-davinci-003
MastermindEval	Humboldt-Universität zu Berlin	o3-mini
WikiWhy	University of California, Santa Barbara	GPT-3 (davinci-002)
DimEval	Fudan University	DimPerc
KnotGym	Cornell University	DreamerV3
MatSciBench	UCLA	Gemini-2.5-Pro
PhysicsEval	Islamic University of Technology	Phi-4-reasoning-plus
ConfProBench	Jilin University	Gemini-2.5-flash
SMART-101	Mitsubishi Electric Research Labs	Second Grader
TCP	University of Cambridge	o4-mini
True Detective	University of Tartu	GPT-4
BoardgameQA	Google Research	PaLM 62B (Prompt-tuned w/ CoT)
AQA-Bench	University of Edinburgh	O1-Preview
rsbench	University of Edinburgh	CLIP
STEM-POM	University of Illinois at Urbana-Champaign	GPT-4o
Karp Dataset	Worcester Polytechnic Institute	Strawberry
MCR Benchmark	University of Toronto	GPT-4 (gpt-4-0613)
DIA-Bench	University of Oslo	ChatGPT-4o
PuzzlePlex	New York University	Custom
DivLogicEval	Fudan University	OpenAI o1-preview (o1-preview-2024-09-12)
LatEval	Tsinghua University	GPT-4
Plausible Distractors	University of Manchester	longformer
DSR-Bench	Stanford University	GPT-5 (med)
LogicPrpBank	University of Pittsburgh	BERT-base (110M)
CODAH	Northwestern University	Human
Com²	Fudan University	GPT-4o
State Tracking	Cardiff University	GPT 4o
Pretty-CLEVR	Google DeepMind	Recurrent Relational Network
XCOPA	University of Cambridge	Human
ALERT	Meta	OPT-CoT 13B
Comparative Reasoning Benchmark	University of Notre Dame	T5 + CMP
CHECK-MAT	Lomonosov Moscow State University	OpenAI o4-mini
Giraffe Bench	Zhejiang University	Vicuna-13B
MASLegalBench	Tsinghua University	Llama3.1-8B-Instruct
MskQA and MskCal	i’s Factory Corporation, Ltd.	GPT-4o
RiddleBench	Google	GPT-oss-120B
Concept-Reversed Winograd Schema Challenge	Zhejiang University	Llama-3.1
Abstract Reasoning Benchmark	Xiamen University	AGENTCHAT(AUTOGEN)
seqBench	Capital One	GPT-5
V-LoL	TU Darmstadt	αILP
CHBench	Renmin University of China	Llama-3.1-70b
KANDY	University of Pisa	Aleph (Nat-Mid)
RAVEN-FAIR	Tel Aviv University	MRNet (MC)
SPaRC	University of Göttingen	Human
Fermi Problems	Max Planck Institute for Intelligent Systems	T5 (FT both)
STREET	Northwestern University	GPT-3 davinci (few-shot)
RobustLR	University of Southern California	Human
RRIP	Beijing University of Technology	GPT-3.5
SOP-Maze	Nanjing University	DeepSeek-V3.1-Thinking
FeasibilityQA	Arizona State University	GPT-3 (text-davinci-002)
TRIP	Michigan State University	RoBERTa
AlgoSimBench	The University of Texas at Austin	o3-mini-medium
MultiZebraLogic	The Alexandra Institute	o3-mini
oLMpics	Tel Aviv University	RoBERTa-L
ImplicitRelations	Allen Institute for Artificial Intelligence	Davinci
ZsLR	Xian Jiaotong University	TaCo
RepublicQA	Huazhong University of Science and Technology	GPT-4o
HardcoreLogic	University of Southern California	GPT-5
Cross-Platform LLM Benchmark	Universidad Pontificia Comillas	Hermes-4-70B
TRAC	Sun Yat-Sen University	GPT-2-small
HATS	University of Pennsylvania	Gemini 3 Pro Preview (2025-11-18)
RegexPSPACE	Yonsei University	gpt-oss-low
Analogical Reasoning Test	University of Cambridge	ChatGPT (Categorical)

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode