State of the Art/Software Engineering
Software Engineering
Code development and debugging
320 datasets
last indexed 17h ago
Model Leaderboard
No.Model
Score
1
Anthropic logoClaude Sonnet 4.5
0.93
2
OpenAI logoGPT-5 Codex
0.87
3
OpenAI logoGPT-5.1 (high)
0.87
4
OpenAI logoGPT-5.1
0.86
5
Anthropic logoClaude Sonnet 4 (thinking)
0.86
6
OpenAI logoGPT-5
0.84
7
Anthropic logoClaude 3.7 Sonnet (thinking)
0.75
8
Anthropic logoClaude Sonnet 4
0.75
9
Anthropic logoClaude Opus 4.1
0.72
10
OpenAI logoGPT-5 (high)
0.71
11
Google logoGemini 3 Pro
0.68
12
Grok 4 Fast (reasoning)
0.65
13
Kimi K2 Thinking
0.55
14
OpenAI logoGPT-5 Mini
0.54
15
Google logoGemini 2.5 Pro
0.54
16
DeepSeek logoDeepSeek V3.1
0.52
17
Qwen3 235B A22B
0.52
18
Anthropic logoClaude 4.5 Sonnet (thinking)
0.49
19
Grok 4
0.49
20
Qwen3 VL 235B A22B Thinking
0.49
21
Anthropic logoClaude 3.7 Sonnet
0.48
22
Grok 4 Fast
0.43
23
Anthropic logoClaude Haiku 4.5
0.43
24
Kimi-K2-Instruct
0.42
25
Kimi K2 0905
0.42
26
OpenAI logoo4-mini (high)
0.40
27
Grok Code Fast 1
0.40
28
Anthropic logoClaude Opus 4
0.27
29
MiniMax M2
0.26
30
Qwen3 Next 80B A3B Thinking
0.26
31
Anthropic logoClaude 3.5 Sonnet v2
0.24
32
Google logoGemini 2.5 Flash
0.21
33
Z.AI: GLM 4.5
0.19
34
Qwen3 Coder 480B A35B
0.18
35
OpenAI logoo3 (high)
0.18
36
OpenAI logogpt-oss-120b
0.17
37
DeepSeek logoDeepSeek-R1
0.13
38
Z.AI: GLM 4.5 Air
0.11
39
OpenAI logoGPT-5 Nano
0.05
40
OpenAI logogpt-oss-20b
0.00
NameBest Model
LiveCodeBench ProGemini 3 Pro Preview
LiveBenchClaude 4 Sonnet
SWE-Bench ProClaude 4.5 Sonnet
GSOClaude-4.5-Sonnet
Terminal BenchGemini 3 Pro
SWE-LancerClaude 3.5 Sonnet
HumanEvalCodex-12B
Design2CodeGPT-4o
LiveCodeBenchO4-Mini (High)
AL-BenchFastLog
HumanEval-DecompileGhidra + LLM4Decompile-Ref-22B
SWT-BenchGPT-5
CodeAgentBenchGPT-4-turbo
BigCodeBenchClaude 3.7 Sonnet (20250219)
CodeEloo1-mini
SWE-DevClaude-3.7-Sonnet-thinking
rSDE-BenchEvoMAC
PwP-BenchClaude 3.5 Sonnet
HPC Performance Optimization BenchmarkCodee
SWE-bench MultimodalGPT-4o
DA-CodeGPT-4
VADERo3
APPSGPT-Neo 2.7B
SWE-bench-LiveQwen3-Coder-480B-A35B
SciReplicate-Benchclaude-3-sonnet-20240229
SecRepoBenchGPT-5
EvalPlusO1 Preview
WebGen-BenchWebGen-LM-32B
LeetCodeDatasetDeepSeek-R1
CyberSecEvalLlama-2-7b-chat
Multi-SWE-benchGemini-2.5-Pro
REPOEXECDeepSeek-R1
SciCodeOpenAI o3-mini-low
Web-BenchWeb-Agent (Claude 3.7 Sonnet)
FEA-BenchDeepSeek-R1
RustEvo2Claude-3.7-Sonnet
Copilot Arenadeepseek-coder-fim
BigOBenchDeepSeek-R1 Llama 70B
CodeFlowBenchGPT-4.1-mini
SWE-PolyBenchDeepseek R1
RepoBenchCodex (code-davinci-002)
AutoCodeBenchClaude Opus 4 (20250514) (Reasoning)
LongCodeBenchClaude 3.5 Sonnet
CoV-Evalclaude-3-sonnet-20240229
SWE-QAClaude 3.7 Sonnet
EffiBenchstarcoder2-15b
GitTaskBenchClaude 3.7 Sonnet
MigrationBenchClaude-3.5-Sonnet-v2
DS-1000Codex-002
SallmGPT-3.5
TestGenEvalGPT-4o
A.S.EClaude-3.7-Sonnet-20250219
Vibe CheckerGPT 5
CodeXGLUECodeBERT Baseline
CODEGUARD+GPT-4-1106-preview
Commit0Claude 3.5 Sonnet
ConvCodeWorldGPT-4-0613
SecureAgentBenchDeepSeek-V3.1
TDD-Bench VerifiedGPT-4o
HumanEval-VClaude 3.5 Sonnet
EvoCodeBenchgpt-4
Real-World Project BenchmarkDeepSeek-R1
BaxBenchGPT-5
ProjectEvalGPT-4o
InterCodeGPT-4
CodeEval-ProDeepseek-R1
ClassEval-TDeepSeek-V3
RACEo1-mini-2024-09-12
BinMetricGPT-4
WebApp1Ko1-preview
Long Code ArenaGPT-4
TESTEVALGPT-4o
CLEVERDeepSeek-R1
PyBenchGPT-4o
Mercurydeepseek-coder-33b-base + DPO
ClassEvalGPT-4
CWEvalgpt-4o-2024-08-06
SWE-CompassClaude-Sonnet-4
CodeCriticBenchGPT-OSS-120B
CanItEditGPT-4
M2RC-EVALDeepSeekCoder-6.7B
LoCoBenchGemini-2.5-Pro
CodeCrashDeepSeek-R1
ExecRepoBenchQwen2.5-Coder-Instruct-C (7B)
SURGEDeepSeek-V3
EffiBench-XGemini-2.5-Pro
DomainCodeBenchDeepSeekCoder-33B
CodeIFClaude-3-5-Sonnet-20241022
XCODEEVALCodeLlama-13b-Instruct
ParEvalGPT-3.5
AetherCodeo4-mini-high
EvoEvalGPT-4-Turbo
DevEvalgpt-4
SWA-Bench and SWEE-BenchGPT-4O
CodeReviewQALlama-3.1-70B-Instruct
DebugBenchgpt-4-0613
CodeMindGPT-4-Turbo
ReCodeCodeGen 16B mono
ArtifactsBenchGPT-5
ENAMELHumanEval+
DesignBenchGPT-4o
SEC-benchClaude 3.7 Sonnet
CodeClashClaude Sonnet 4.5
TestBenchGPT-4
RepoTransBenchGPT-4o
DafnyBenchClaude 3 Opus
APEvalGPT-4o
DeepCRCEvalLLM-Reviewer
RustRepoTransDeepSeek-R1-0528
CodeEditorBenchgemini-ultra
REvalGPT-4-Turbo (0125)
RepoClassBenchLlama3-70b
HumanEvalNextdeepseek-coder
ProBenchQwQ-32B-Preview
EDIT-Benchclaude-sonnet-4
ProjectTestGPT-o1
CodeHaluEvalGPT-4
CodeMMLUGPT 4o
CodeLMSecCodeGen-6B
CodeTransOceanChatGPT (gpt-3.5-turbo)
CODESYNCGPT-4o
EVALPERFGPT-4 Turbo perf-CoT
VERINAo4-mini
CRUXEval-XGPT-4o
CodeSmellEvalMistral-7B-v0.3
FrontendBencho3-mini
SAFIMQwen2.5-Coder-32B
StackEvalO1 Preview
VulDetectBenchERNIE 4.0
R2C2-BenchDeepSeekCoder-6.7B (R2C2-Tuned)
CoderEvalPanGu-Coder (300M)
CodeAssistBenchChatGPT 4.1 Mini
ICPC-Evalo3-mini High
McEvalGPT-4o
SeCodePLTClaude-3.7-Sonnet
COFFELlama3.1
ComplexCodeEvalCodeLLaMa 34B
Stateful SWEClaude Sonnet 4
DI-BENCHGPT-4o
GitChameleon 2.0Gemini 2.5 Pro
CoReGPT o3
DSDBenchQwen2.5-72B-Instruct
HumanEvalCommDeepSeek Coder
SafeGenBencho3
SyncBenchClaude-3.5-Sonnet
EditEvalChatGPT (gpt-3.5-turbo-0613)
HackerRank-ASTRAo1
M2-EVALGPT-4o
DyCodeEvalLlama-3.1-8B-Instruct
CodeJudge-EvalClaude-3.5-Sonnet
SR-EvalQW3-235
FreshBrewGemini 2.5 Flash
CodeArenaDeepSeek-Coder
DSCodeBenchGPT-4o
CodeMirageGPT-4
CodePrefBenchGPT-4o
ManyIFEvalo3-mini (high)
SWE-EffiQwen3-32B
Mostly Hard Python Problemso1-preview
MDEVALo1-mini
VulnLLMEvalLlama3.1-8b
DafnyCompCLAUDE-3.5-SONNET
SWR-BenchGemini-2.5-Pro
DependEvalDeepSeek-V3 (37/671B)
USEbenchUSEagent
HLCEo4-mini (High)
Scoring Verifierso3-mini
DynaCodeGPT-4o
REPOCODGPT-4o
RACodeBenchGPT-4o-mini
ECCOGPT-4o
NaturalCodeBenchGPT-4
AutoAPIEvalChatGPT (gpt-3.5-turbo)
CodeScopeGPT-4
SE ArenaGPT-4o
CodeSenseClaude 3.5 Sonnet
SWE-fficiencyHuman Expert
ExKLoPCodestral-22B
MBXPCodeGen-mono 16B
HumanEval-NFRARCHCODE (GPT-3.5-Turbo)
HumanEvoGPT-4
Copilot Evaluation HarnessGPT-4
ARCADEPACHINCO
CoreCodeBenchClaude-3.7-Sonnet
SwingArenaDeepSeek-V3
KGymGPT-4 Turbo
TCGBenchHuman
CLMEEvalCodeLlama (7B)
FullFrontClaude 3.7 Sonnet
RepoMasterEvalDeepSeek-Coder-Base 33B
COMPASSO4-Mini-High
UA-Code-BenchOpenAI o4-mini medium
CoderUJBGPT-4
NaturalCCcpt-code M
SpecEvalGPT-4
CTF-CodeCTF
OBFUSEVALGPT-4-Turbo-0125
OSVBenchDoubao-1.5-pro
CLOVERGPT-4O
BioCoderGPT-4
AICoderEvalAICoder (Llama-3-8B-Instruct w/ SFT)
SWE-MERADeepSeek-R1-0528
IFEvalCodeControlledCoder
FeatBenchGPT-5
LiveOIBenchGPT-5
VeriEquivBenchHuman
GitChameleonYi-1.5-Chat 34B
CodeIF-BenchClaude-3.5-Sonnet
PythonSagaGPT-4
Breakpointo4-mini
InteractScienceGPT-5
CPP-UT-BenchTinyLlama-1.1B-Chat-v1.0
MRG-BenchClaude 3.5 Sonnet
VersiCodeGPT-4o
TransCoder-test-XExeCoder
TestCase-EvalHuman Expert
CodeMixBenchPhi-4
Exec-CSNGPT-4
CoCo-Bencho1-mini
DOMAINEVALGPT-4o-mini
Next Edit PredictionClaude 4 Sonnet
GitGoodBenchGPT-4o
CodeFuse-CR-BenchGemini 2.5 Pro
FeedbackEvalClaude-3.5 Sonnet
IdentityChainGPT-4
GBCVGPT-4o
DRCodePilotDRCodePilot
OOPChatGPT
MERA CodeGPT-4o
NoFunEvalGPT-4
CODEMENVGPT-4O
OEIS Benchmarko1-preview
FAUN-EvalGPT-4o
JavaBenchgpt-4o-2024-05-13
TC-BenchClaude4
PseudoEvalQwen32B
STEPWISE-CODEX-Benchopenai-o3
LLM SAST BenchmarkGPT-4.1
DebugEvalDeepSeek-Coder-V2
mHumanEvalClaude-3.5-Opus
WebUIBenchGPT-4o
SysMBenchQwen3-32B
CACPGPT-4 (gpt-4-1106)
TRACYQwen2.5-Coder-14B-Instruct
buggy-HumanEvalCODEGEN-16B-MONO
OSS-BenchPHP (baseline)
L2CEvalgpt-4
ScratchEvalGemini-1.5-Pro
PYMIGBENCHGPT-4o
CodeApexGPT-4
CRQBenchGPT-4
TutorCodeGPT-4
E2EDevBenchGemini-2.5-Pro
ThrowBenchQwen2.5 Coder Instr.
BICSGPT-4o
MultiCodeIFClaude-3-7-Sonnet
FPBenchDeepSeek-R1
TypyBenchCLAUDE-3.5-SONNET
TurbulenceGPT-4 (t=0)
Code2BenchClaude-Sonnet-4
RunBugRunCodeT5
PostcondGenMistral-7B-Instruct
QCodero3
PromptSEQwen-1.5b
MLDebuggingDeepSeek-V3 (72B)
ReDefCodeBERT
RepoDebugClaude 3.5 Sonnet
SWEDEClaude-SONNET-3-5
Vericoding Benchmarkclaude-opus-4.1
Defects4CCodeLlama-Instruct-7B
MINICODEClaude Sonnet 4
Python Programming Puzzlesdavinci-codex (Long Prompt)
VJBenchCodex
OpenCodeEditGPT-4
CodeInsightCodeLLAMA 13B
RaCGEvalGemma 7B
Geospatial Code Generationbigcode/starcoder2-7b
PRDBenchGPT-5 (Minimal)
CoverageEvalGPT-4
PerfBenchClaude Sonnet 4
UnLeakedTestBenchgemma-3-27b-it
HumanEval-HaskellUniXcoder (Fine-tuned)
StudentEvalStarCoderBase
ExeDSJuPyT5
MT-SecClaude Opus 4 (Thinking)
SolBenchClaude-3.5-Haiku
CoQuIRVoyage-code-3
ACEOBCodeT5-small
VerifyThisBencho3-mini
VUL4CVulnFix
RepairBencho4-mini-2025-04-16-high
SeqCoBenchqwen2.5-coder-instruct (32B)
ReCatcherDeepSeek-Coder-6.7B (Merged)
MultiOOPGPT-4o mini
Code Execution SimulationGemini-2.5-Pro
SIMCODEGPT-4.1 (FT)
Defects4J-Nl2fixgpt-3.5-turbo
X-HumanEval-XDeepSeek-Coder-33B
Educational Program Repair Benchmarkstarcoder (3B)
Build-benchGPT-5
EVALOOPo3-mini-2025-01-31
AppForgeGPT-5-High
UniCodeo4-mini (high)
MacroBenchGPT-4o-Mini
PECCClaude 3 Haiku
TREATGPT-5
JUGEEVOSUITE
LoCaLCrystalBLEU
SimCopiloto3-mini (high)
CCrepairBenchllama3.3-70B
AutoGEEval++o4-mini
RegMiner4APRChatGPT-4o + Conversation
SWE-Sharp-BenchGPT-5
PACTQwen3
Assertion MessagesCodestral-22B
Diff-XYZGPT 4.1
BRACESeed-Coder-8B-Instruct
Code Comprehension BenchmarkCodestral-22B
ComErrFix-CGP-v1.1WARP-Full (CodeLlama-70B)
CryptoAPI-BenchCryptoGuard