Knowledge
Factual recall and domain expertise
350 datasets
last indexed 17h ago
Model Leaderboard
No.Model
Score
1
Google logoGemini 3 Pro
1.00
2
Grok 4
0.60
3
OpenAI logoGPT-5
0.60
4
Qwen3 235B A22B
0.59
5
Google logoGemini 2.5 Pro
0.58
6
OpenAI logoo3
0.56
7
OpenAI logoGPT-5 Mini
0.48
8
Anthropic logoClaude Sonnet 4.5
0.32
9
OpenAI logoGPT-4.1
0.31
10
OpenAI logoo1
0.30
11
Grok 3
0.28
12
OpenAI logoGPT-5.1
0.18
13
Anthropic logoClaude Opus 4.1
0.17
14
Google logoGemini 2.5 Flash
0.15
15
Anthropic logoClaude 3.7 Sonnet
0.12
16
DeepSeek logoDeepSeek-R1
0.10
17
OpenAI logoGPT-4o
0.08
18
Anthropic logoClaude 4.5 Sonnet (thinking)
0.06
19
DeepSeek logoDeepSeek-V3
0.02
20
Anthropic logoClaude Opus 4 (thinking)
0.00
21
Anthropic logoClaude 3.5 Sonnet
0.00
NameBest Model
Humanity's Last ExamGemini 3 Pro
SimpleQA VerifiedGemini 3 Pro Preview
TimeQAHuman Worker
RAGBenchDeBERTA
OMGEvalGPT-4
SimpleQAOpenAI o1-preview
SuperGPQAgpt-5
MedXpertQAo1
RAGTruthGPT-4-0613
POPQAGPT-3 davinci-003
Visual-RAGGemini 2.5 Pro
ESGeniuso3
ChroKnowBenchGPT4o-mini
Factcheck-BenchFactcheck-GPT (Web)
LongFactGPT-4-Turbo
LLM-AggreFactBespoke-Minicheck-7B
SciAssessGPT-4o
ARCDGEM
CulturalBencho1
AGIEvalGPT-4o
SciFIBenchGPT-4o
HalluEditBenchLlama3-8B
DLAMAarBERT
PubMedQAGPT-4 (Medprompt)
FinDERClaude-3.7-Sonnet
HaluBenchLYNX (70B)
BizFinBenchChatGPT-o3
QuALITYHuman annotators
Chinese SimpleQAo1-preview
KGQuizDavinci (text-davinci-003)
Wikidata5MKEPLER-Cond
FinEvalClaude 3.5-Sonnet
FaithJudgegemini-2.5-flash
STaRKClaude3 Reranker
PinocchioGPT-3.5-Turbo
MedQA-CSClaude 3.5 Sonnet
KGQAGen-10kGPT-4o (w/ SP)
CRUD-RAGGPT-4o
MedBrowseCompO3 deepresearch
LexRAGQwen-2.5-72B
C-Eval海信星海
XRAGGPT-4o
SecKnowledge-EvalCyberPal.AI-Mistral
SimpleVQAGemini-2.0-flash
INFOSEEKCLIP → FiD
ICD-Bencho4-mini
A-OKVQAGPV-2
RAG-QA ArenaGPT-4O (without CoT)
CommonsenseQAHuman
VERISCOREGPT-4o
MLaKEQwen1.5-7B-Chat
FACT-AUDITGPT-4o
RAG-CheckGPT-4o
LawBenchGPT-4
TheoremQAGPT-4
BoolQRankVicuna
ChineseEcomQADeepSeek-R1
WikiBigEditmistral-7b
MedBenchGPT-4
MedRGBGPT-3.5
ORQALlama3.1-405B-I
ReasonVQAGPT-4o
ToolQAReAct (GPT-3.5)
HistBenchHistAgent (gpt-4o)
FactBenchGPT-4o
KG-LLM-Benchclaude-3.5-sonnet-v2
SecBencho1-preview
ScienceMeterOLMO2-32B-INSTRUCT
Chinese SafetyQAo1-preview
BMMRGemini-2.5-pro
DomainRAGBaichuan2-33B-32k
SciKnowEvalo4-mini
WildHallucinationsGPT-4o
OKGQAGPT-4o (CoT+SC)
GAOKAO-BenchGPT-4-0314
CFinBenchYi1.5-34B
Continual Knowledge LearningT5-Modular
Hallucinations LeaderboardOpenHermes-2.5-Mistral-7B
WikiContradictMistral-7b-inst
ICR2GPT-4-turbo
CFLUEGPT-4
FELMGPT-4
TeleQnAGPT-4
NEPAQuADGemini 1.5 Pro
YESciEvalLLaMA-3.1-8B (SFT+RL adversarial)
OpenFactCheckGPT-4
KoLAGPT-4
ECHOQAOpenAI o1
SciFactOracle Rationale
MMKE-BenchLLaVA-1.5 (IKE)
SUBARULlama3.1-8B-Instruct
ArabicMMLUGPT-4
ChineseSimpleVQAo1-preview (0901)
PubHealthBenchGPT-4.5
EWoKHuman
BMIKE-53Llama 3.1 8B
GuessArenaOpenAI-o1
MILUGPT-4o
MEMERAGQwen 2.5 32B
FRANKBERTScore Precision
HalluQAERNIE-Bot
KorNATHyperCLOVA X
InfiBenchGPT-4o-0125
FACTRBENCHGPT4o
SeaExam and SeaBenchGemma-2-9b-it
CAQAMistral-v0.3 (7B)
ORD-QARAG-EDA (ours)
OpenMedQADeepSeek-R1 with CR
MIRAGE-BenchGPT-4o
UniKnowQWEN-14B
FeTaQAT5-large
AA-OmniscienceGemini 3 Pro Preview
EverGreenQAEG-E5
Head-to-TailGPT-4
KCIFGPT-4o-2024-08-06
DefAnGPT-3.5
TEMPREASONTempT5 (T5-base)
WixQAGPT-4o
HAE-RAE BenchGPT-4
ProfBenchGPT-5 (high)
ContextualBenchSFR-RAG-9B
MULTIexpert
DebateQAGPT-4o
ZNO-EvalGPT-4o
CPsyExamGPT-4
EVOUNAAnother Human
ComprehendEditBLIP-2 OPT
CliMedBenchERNIE-Bot
ATEBGecko
GEOHALUBENCHGemini-2.0-flash
FACTOROPT-66B
MKQAXLM-R (Gold Passages, Translate-Train)
InsQABenchGLM4 (Fine-tuned) + RAG-ReAct 9B
VND-BenchGPT-4
ArabLegalEvalGPT-4o
Question Answering with Conflicting ContextsPhi-3 Medium (finetuned)
RoleEvalGPT-4-0613
FinCDMGLM-4
CHARMQwen-72B
FAMMAPoT + GPT-o1
FACTORYQwen3
MMLU-Pro+O1-preview
EvoWikiSFT + Open-book (Mistral-7B-Instruct-v0.3)
SciRerankBenchMXBAI
CLIcKClaude 2
MedThink-BenchMedGemma-27B
DRAGONQwen 2.5 32b Instruct
BioKGBenchGPT-4
CREAKT5-3B (In-Domain)
KnowMT-BenchGPT-4o
KnowShiftQAo1-preview
ArXivBenchClaude-3.5-sonnet
SportQAGPT-4(5S,CoT)
TUMLUClaude 3.5 Sonnet
MISBENCHQwen2.5-14B
KaLMAGPT-4 (temp=0.5)
LegalBench.PTGPT-4o
OneEvalo3
X-FaKTLlama-3-70B
MaScQAGPT-4-CoT
AICryptohuman
KG-FPQLlama2-13B-Chat
K-QAGPT-4+ICL
BnMMLUgemini-2.0-flash
HaystackCraftGPT-5
Pedagogy BenchmarkGemini-2.5 Pro
RARELlama-3.2-90B-Vision-Instruct
FAC²EGPT-4
SciTabQAOmniTab
KoNETgpt-4o-2024-05-13
KoLasSimpleQAGPT-4o
EMPECGPT-4
RoMQAseq2seq+retrieval
EvolvingQADPR
CAMBQwen3-235B-A22B-Instruct
FakepediaMistral-7B-Instruct-v0.1
PROBELMPYTHIA-2.8B
X-FACTXLM-R large
SummEditsHuman Performance
GTSQASubgraphRAG (200)
TrendFactFactISR(QwQ-32B)
DFIR-MetricGPT-4.1
CUBCOMMAND A
FactIRSnowflake-arctic-embed-s
RAG Hallucination BenchmarkTLM
MedREQALMixtral
ScholarBencho3-mini
FuturepediaQwen2-72B-Instruct
ParamBenchLlama-3.3-70B
MultiHoaxGemini-2.0-pro-exp
AncientDocGemini2.5-Pro
KBLMistral-7B-v0.2-Instruct
NoMIRACLMistral-7B
SciExClaude-3-opus-20240229
QuanTempFinQA-Roberta-Large
DQABenchGPT-4
StructFactGPT-4o-mini
QualBenchQwen2.5-7b-instruct
PediaBenchQwen-MAX
ALCUNAgpt-4o-2024-04-09
KQA ProBART KoPL
NEWTONGPT-4
SECQUEGPT-4o
GaRAGeNova Pro
DocTERChatGPT
CROLIN-MQUAKEGPT-3.5-turbo-instruct
EarthSEGemini-2.5
COMPKEGPT-3.5-TURBO
XiezhiGPT-4
KFinEval-Pilotgpt-o1
Explain-Query-TestSonnet-3.5
Temporally Consistent Factuality ProbeGPT-J [6B]
AC-EVALERNIE-Bot 4.0
MedicineQARagPULSE (20B)
CONNERChatGPT (text-davinci-003) (Zero-shot)
BertaQAGPT-4 Turbo
UhuraGPT-o1-preview
CLR-FactGPT-4o
INSEvaDoubao-1.5
FoundaBenchInternLM-123B
LLMzSzŁMistral-Large-Instruct-2407
ObfusQADeepSeek R1
EnviroExamdeepseek-67b-chat
Polish Cultural CompetencyGemini-Exp-1206
NetEvalgpt-4
EpiK-EvalFlan-T5-XL
AgriEvalQwen-Plus
ArxEvalQwen-2.5 7B
ComparisonQAGPT-2
LegalScoreClaude Sonnet 3.5
ECKGBenchQwen2-max
SeaEvalGPT-4 (gpt-4-0613)
HSSBenchHuman
SKA-BenchTableGPT-2
PRGB BenchmarkGemini-2.5-pro-preview
MultiHalGemini 2.0 Flash
TransportationGamesGPT-4
MultifacetEvalGemini-pro
CoLoTaOpenAI-o1
OntoURLLLaMA3.1-8B
DailyQAQwen2.5-72B-Instruct
MultiNativQAGemini-1.5 Flash
PretexEvalGPT-4o
LPFQAGPT-5
NorQANorMistral-7B-warm
Temporal WikiLlama 3.1 70B
KaRROPT (175B)
FinEval-KRDeepSeek-R1
BELIEFLlama3-70B
MAGICGPT-4o-mini
LHMKEBaichuan2-13B-Chat
TAXIHuman
M3KEGPT-4
EESEGPT-5
LLM-KG-BenchLLMKE
BaRDaGPT-4
CPSDBenchGPT-4
TuringQGPT-4
HEAD-QALiu et al. (2020)
GeoGLUENezha
FACT-BENCHGPT-4
MedMKEBLLaVA-Med
Alvorada-BenchO3 Pro
EFO-1-QALogicE
CG-EvalGPT-4
BeerQAIRRR (SQuAD + HotpotQA)
Down and AcrossRAG wiki
COPENHuman
QuantumBenchGPT-5-high
Knowledge CrosswordsVERIFY-ALL (GPT-4)
NuclearQALlama 2
ThaiCLIGPT-4o
MQA-AEVALGPT-3.5-TURBO-INSTRUCT
CUS-QALlama-3.3-70B-Instruct
SANSKRITIGPT-4o
MINEDGemini-2.5-Pro
M-QALMFlan-T5 (11B)
CoDExConvE
ZhuJiuGPT-3.5-turbo
KoSimpleQAHCX SEED 14B
KEO QA Benchmarkgemma-3-it
XL-BELXLMR + SAP all syn
Entity Cloze by DateT5-large
SLAQGemma-3 1B
Self-Diagnostic Atomic KnowledgeGPT-4
Multilingual Compositional Wikidata QuestionsmT5-base+RIR
OKBenchGPT-4o
SCiPS-QAmeta-llama-3-70B
CK-ArenaDeepSeek-V3
KG Attributes BenchmarkGPT-4
ONTOLAMARoBERTa-large-pm-m3-voc
KULTURE BenchHuman
BIOLAMABio-LM
LoFTIGPT-4
BEARMeta-Llama-3-8B
FinLFQAGPT-4o
LiveSearchBenchgpt-5
Hakka BenchmarkLlama 3.1 with RAG
QUENCHGPT-4-Turbo
Japanese Financial BenchmarkClaude 3.5 Sonnet
RAmBLAGPT-3.5
MuLanAlpaca
RealFactBenchClaude-3.7-Sonnet
YpathRYpathRAG
InsCoQAGPT-4
GeoSQAPMI
FailureSensorIQo1
MedFactXiaoYi
CheckThat! 2020Buster.AI
ClimateEvalMistral 24B
URDUFACTBENCHGPT-4O
Ambiguous Entity RetrievalBootleg
MultiWikiQAMistral-Small-3.1-24B-Instruct-2503
MedBench-ITDeepSeek-R1
CultSportQAGPT-4o
X-FACTRM-BERT
LM-PUB-QUIZMeta-Llama-3-8B
Materials Knowledge Benchmarkphi-4
ChnEvalRoBERTa-wwm-ext
DentalBenchGPT-4o
ReFACTGPT-4o
MultiReQAUSE-QA (fine-tuned)
TempQA-WDSYGMA
MediQAlo3
DEEPAMBIGQAGPT-5
FinS-PilotXiaofa-1.0
ConvQuestionsOracle + Convex
evolveQAGPT-5-mini
Swedish Factsgemma-3-27b
Chinese Commonsense Multi-hop ReasoningGemini-2.5-Pro
Nunchi-Benchgemini-2.5-pro-preview
Agri-QueryGemini 2.5 Flash
FATHOMS-RAGclaude-sonnet-4
ADAMGemini Flash 2.5
SinhalaMMLUClaude 3.5 Sonnet
XLQAOracle LM (GPT-4.1)
TripJudgeEnsemble
RoBiologyDataChoiceQAgemini-2.0-flash
MedLAMASapBERT
ArcMMLUGPT-4 (gpt-4-0613)
KEOgemma-3-it
EFO_k-CQACQD
FactCheckerGPT-4
ECLeKTicGemini 2.0 Pro
KACCAttH
ParRoTMacaw-11B
PalmX-GCGPT-4.1
CIKQAG2T (Bian et al., 2021)
TEXTWORLDSQADrQA-M
KMIRRoBERTa