Reasoning
Complex problem solving
228 datasets
last indexed 17h ago
Model Leaderboard
No.Model
Score
1
Google logoGemini 3 Deep Think (Preview)
1.00
2
Google logoGemini 3 Pro
0.92
3
OpenAI logoo3
0.82
4
OpenAI logoGPT-5.1
0.81
5
Grok 4
0.77
6
Google logoGemini 2.5 Pro
0.76
7
OpenAI logoGPT-5
0.75
8
OpenAI logoGPT-5 (high)
0.72
9
OpenAI logoo3 (high)
0.70
10
OpenAI logoo4-mini (high)
0.63
11
OpenAI logoGPT-5 Mini
0.61
12
Grok 4 Fast (reasoning)
0.59
13
Google logoGemini 2.5 Flash
0.59
14
OpenAI logoo4-mini (medium)
0.53
15
Anthropic logoClaude Sonnet 4.5
0.52
16
Google logoGemini 2.0 Flash
0.51
17
Google logoGemini 1.5 Pro
0.51
18
Anthropic logoClaude 3.5 Sonnet
0.47
19
OpenAI logoo3 Mini High
0.45
20
OpenAI logoGPT-5 Pro
0.45
21
Anthropic logoClaude Sonnet 4
0.44
22
OpenAI logoo4-mini (low)
0.44
23
DeepSeek logoDeepseek R1 0528
0.43
24
Google logoGemma 3 27B
0.43
25
Anthropic logoClaude Opus 4
0.41
26
Anthropic logoClaude Haiku 4.5
0.40
27
OpenAI logoGPT-4.1
0.40
28
DeepSeek logoDeepSeek-R1
0.40
29
OpenAI logoGPT-4.1 Mini
0.39
30
OpenAI logoo1-mini
0.39
31
Qwen3 235B A22B Instruct 2507
0.39
32
OpenAI logoGPT-5.1 (high)
0.39
33
Z.AI: GLM 4.5V
0.39
34
Z.AI: GLM 4.5
0.36
35
Llama 4 Maverick
0.35
36
Anthropic logoClaude 3.7 Sonnet
0.35
37
Llama 4 Scout
0.34
38
OpenAI logoGPT-4.1 Nano
0.33
39
OpenAI logoGPT-4o-mini
0.33
40
Qwen2.5 72B Instruct
0.32
41
OpenAI logoo4 mini
0.32
42
Grok 3
0.30
43
Anthropic logoClaude 4.5 Sonnet (thinking)
0.30
44
OpenAI logoGPT 4.5
0.29
45
OpenAI logoGPT-4o
0.28
46
Anthropic logoClaude Opus 4.1
0.25
47
Google logoGemma 3 12B
0.25
48
Llama 3.1 70B Instruct
0.23
49
OpenAI logogpt-oss-20b
0.23
50
Pixtral 12B
0.23
51
OpenAI logoo1
0.22
52
Kimi K2
0.22
53
OpenAI logoo3 Mini
0.21
54
Anthropic logoClaude Opus 4 (thinking)
0.19
55
Qwen3 14B
0.19
56
Tiny Recursive Model
0.14
57
Anthropic logoClaude Sonnet 4 (thinking)
0.13
58
OpenAI logoo3 Pro
0.11
59
Google logoGemini 2.0 Flash (Thinking)
0.11
60
Qwen3 32B
0.09
61
Qwen2.5 32B Instruct
0.09
62
QwQ 32B
0.01
63
DeepSeek logoDeepSeek V3.1
0.01
64
Anthropic logoClaude 3 Sonnet
0.01
65
Anthropic logoClaude 3 Haiku
0.00
66
Anthropic logoClaude 3.7 Sonnet (thinking)
0.00
67
Google logoGemma 2 27B
0.00
NameBest Model
PHYBenchGemini 2.5 Pro
Humanity's Last ExamGemini 3 Pro
MMMU-ProGemini 3.0 Pro
CharXivo3 (high)
Global PIQAGemini 2.5 Pro
ARC AGI 2Gemini 3 Deep Think (Preview)
QuestBenchGemini Flash Thinking 2.0 Exp 01-21
KUMODeepSeek-V3
DeltaBenchGPT-4-turbo-128k
MathIFQwen3-14B
Reasoning GymG-e-en-13b-instruct
SPIN-Bencho1
PhysReasonDeepseek-R1
MME-CoTKimi k1.5
BWORDeepSeek-R1
REASONINGWEEKLYOpenAI o1
LLM-SRBenchGPT-4o-mini
SPLATGPT-4
BBEHo3-mini (high)
KORGymO3-mini
HellaSwagHuman Performance
PRMBenchHuman
ChartQAVL-T5 Pretrained
ZebraLogico1-full
BABILongARMT (137M) fine-tune
OmniSpatialHuman
SocialIQABERT-large
BIG-Bench HardMax human-rater
ScienceQAMutimodal-T-SciQ_Large
OlympicArenaO1
REVEALPaLM-2-L
SpatialMQAHuman
OptimalThinkingBencho3 (Thinking)
WM-ABenchHuman
LLM4CausalLLM4Causal-Mixed (Llama-2 7B)
THINKSLMLlama3.1 70B (FP8)
ACPBenchLLAMA-3.1 405B
AceMath-RewardBenchAceMath-72B-Instruct
GAMEBoTGPT-4o
DROPHuman
CLadderGPT-4 + CAUSALCOT
A-I-RAVENSCL
Sudoku-BenchGPT-5 High
ShortcutQALlama 3 (70B Instruct)
Sys2BenchLLaMa 3.1 405B
gg-bencho1
Reasoning-Intensive RegressionGPT-5 MENTAT (Basic Prompt)
VerifyBenchCompassVerifier-32B
MME-ReasoningGemini-2.5-Pro-T
COLDphi-2
DeepTheoremo3-mini
MM-HELIXGPT-5
MM-BenchDeepSeek-R1-671B
GameArenaClaude 3.5 Sonnet
R-HORIZONQwen3-235B-Thinking
ChemCoTBenchClaude3.7-sonnet-think
Unpuzzleso3
NLGraphtext-davinci-003 (COT+SC)
MARSGemma-2 9B (Fine-tuned)
CLRSPGN
CHARTOMgemini-2.5
UGPhysicsDeepSeek-R1
LongReasonGemini-1.5 Pro
LogicGameo1-preview
RuleArenao1-preview
VisuRiddlesPAVR
USACOHuman Average
Gravity-Bench-v1o4-mini-high-2025-04-16
CriticBenchGPT-4
LINGOLYClaude Opus
MR-Beno1-preview-2024-09-12
AutoLogiClaude-3.5-sonnet
KOR-BenchO1-preview-2024-09-12
GraCoReOpenAI o1
LogiCityOracle
NPHardEvalMistral-7b
NaturalProofsBERT (P/S) +joint
LogiQACeiling Performance
DetectiveQAClaude 3 Opus (200k)
MPBenchGPT-4o
QuantumTheoremsAx-Prover
EnigmaEvalgpt-5-pro-2025-10-06
SATBencho4-mini
PuzzleVQAGPT-4V
SCoREo1-preview
SCIREASGPT-5-2025-08-07
PhysGymGemini-2.5-pro
PRISM-PhysicsGPT-5 High
L0-BenchDeepseek-R1
Braingle BrainteaserOpenAI o3
CLEARGPT-4
LR²Bencho1-preview
TextGamesGPT-o3 Mini
ProcBencho1-preview
Reasoning Coregpt-5
PhysUniBenchGPT-o4-mini
RULEARNHuman
HellaSwag-ProHuman
LongReasonArenao1
Socratic-PRMBencho3-mini
LogiGLUEGPT-4
LLM Planning BenchmarkGemini 1.5 Pro (Our-BW, 2-shot, NL)
Task Structure VariationsGPT-3.5
Formal Logic DeductionT5 (fine-tuned)
DocPuzzleGPT-4o-0811
ReClorGraduate Students
ControlBenchClaude 3 Opus
GridPuzzleGPT-4-Turbo
BizBenchGPT 4*
HeroBenchGrok-4
LogiEvalDeepSeek R1
SeqEvalLlama-3-8B (SIT on TuluV2)
GLoREQwQ-32B
Chain-of-Thought HubGPT-4
BRAINTEASERHuman
TMBenchGemini-2.5-Pro
CLR-Benchqwen2.5-32b-instruct
TurnaboutLLMDeepSeek-V3
ACADREASONOAgents
RUPBenchLlama3 8B
BlendQAGPT-4o
bAbIGPT-5 (2025-08)
FormalMLSTP
Verbose ListOpsGemini 2.5 Pro
IOLBENCHGPT-5
Multi-Turn PuzzlesGemini-2.5-Pro-Exp-0325
ZeMPEGPT-4 Turbo
Entailment VerificationGPT-4
FaithCoT-BenchGPT-4o-mini
Generalized Associative RecallGPT-4
RECVGPT-4
EconLogicQAGPT-4-Turbo
DAG-MATHGPT-4.1-M
TMGBenchQwen3-32B
PuzzleWorldGPT-o3
RE-IMAGINEGPT-o1
DRE-BenchClaude-3.7
RULEBREAKERSMeta-Llama-3-8B-Instruct
RoomSpaceGPT-4
Multi-LogiEvalGPT-4
AnaloBenchHuman
Visual Abductive ReasoningREASONER
FCoReBenchGPT-4-Turbo
InfiMM-EvalGPT-4V
ActionReasoningBenchFinetuned Llama-3.1-8B
ConnectionsGPT-4-Turbo (CoT)
ReasoningLLMsGPT-4-0613
QCBencho3
THiNKGPT-4O
CLUTRRGAT
THINK-BenchGrok-3-mini-beta
SearchBenchGPT-4 (MSMT A* Prompting)
CorrectBenchClaude 3.5-Sonnet
CryptoBencho1
PuzzLing MachinesChatGPT
StyleBenchQwen 32B
SpartQAHuman
SATQuesto3-mini
Mental-Ability Reasoning BenchmarkGPT-4
GVGAI-LLMGPT-o3-mini
TruthQuestLLaMA-3-70B
MATCHAQwen2.5-7B
Compound-QAInternLM
TurnBench-MSgpt-o4-mini-high
CogniLoadgpt-5-2025-08-07
R2PEtext-davinci-003
MastermindEvalo3-mini
WikiWhyGPT-3 (davinci-002)
DimEvalDimPerc
KnotGymDreamerV3
MatSciBenchGemini-2.5-Pro
PhysicsEvalPhi-4-reasoning-plus
ConfProBenchGemini-2.5-flash
SMART-101Second Grader
TCPo4-mini
True DetectiveGPT-4
BoardgameQAPaLM 62B (Prompt-tuned w/ CoT)
AQA-BenchO1-Preview
rsbenchCLIP
STEM-POMGPT-4o
Karp DatasetStrawberry
MCR BenchmarkGPT-4 (gpt-4-0613)
DIA-BenchChatGPT-4o
PuzzlePlexCustom
DivLogicEvalOpenAI o1-preview (o1-preview-2024-09-12)
LatEvalGPT-4
Plausible Distractorslongformer
DSR-BenchGPT-5 (med)
LogicPrpBankBERT-base (110M)
CODAHHuman
Com²GPT-4o
State TrackingGPT 4o
Pretty-CLEVRRecurrent Relational Network
XCOPAHuman
ALERTOPT-CoT 13B
Comparative Reasoning BenchmarkT5 + CMP
CHECK-MATOpenAI o4-mini
Giraffe BenchVicuna-13B
MASLegalBenchLlama3.1-8B-Instruct
MskQA and MskCalGPT-4o
RiddleBenchGPT-oss-120B
Concept-Reversed Winograd Schema ChallengeLlama-3.1
Abstract Reasoning BenchmarkAGENTCHAT(AUTOGEN)
seqBenchGPT-5
V-LoLαILP
CHBenchLlama-3.1-70b
KANDYAleph (Nat-Mid)
RAVEN-FAIRMRNet (MC)
SPaRCHuman
Fermi ProblemsT5 (FT both)
STREETGPT-3 davinci (few-shot)
RobustLRHuman
RRIPGPT-3.5
SOP-MazeDeepSeek-V3.1-Thinking
FeasibilityQAGPT-3 (text-davinci-002)
TRIPRoBERTa
AlgoSimBencho3-mini-medium
MultiZebraLogico3-mini
oLMpicsRoBERTa-L
ImplicitRelationsDavinci
ZsLRTaCo
RepublicQAGPT-4o
HardcoreLogicGPT-5
Cross-Platform LLM BenchmarkHermes-4-70B
TRACGPT-2-small
HATSGemini 3 Pro Preview (2025-11-18)
RegexPSPACEgpt-oss-low
Analogical Reasoning TestChatGPT (Categorical)