Skip to content

Updated 5 hours agoSources:LiveBench Math

/ Live Benchmarks / Math

Math benchmarks

Numerical reasoning and mathematical problem solving from LiveBench.

#ModelScore
1
GPT-5.5 Thinking xHigh EffortOpenAI
96.3%
2
GPT-5.4 Thinking xHigh EffortOpenAI
94.2%
3
GPT-5.2 HighOpenAI
93.2%
4
Claude 4.7 Opus Thinking xHigh EffortAnthropic
93.1%
5
GPT-5.4 Nano xHighOpenAI
91.3%
6
Gemini 3.1 Pro Preview HighGoogle
91.0%
7
DeepSeek V4 ProDeepSeek
90.7%
8
Claude 4.5 Opus Thinking High EffortAnthropic
90.4%
9
Claude 4.6 Opus Thinking High EffortAnthropic
89.3%
10
GPT-5.2 CodexOpenAI
88.8%
11
Gemini 3.5 Flash HighGoogle
88.2%
12
GPT-5.3 Codex HighOpenAI
87.8%
13
Grok 4.20 BetaxAI
87.1%
14
Claude 4.6 Sonnet Thinking Medium EffortAnthropic
87.0%
15
GPT-5.1 HighOpenAI
86.9%
16
GPT-5 ProOpenAI
86.2%
17
Qwen 3.7 MaxAlibaba
85.3%
18
DeepSeek V3.2 ThinkingDeepSeek
85.0%
19
GLM 5.1Z.AI
84.9%
20
Kimi K2.5 ThinkingMoonshot AI
84.9%
21
Grok 4.3xAI
84.3%
22
Kimi K2.6 ThinkingMoonshot AI
84.3%
23
Gemini 3 Flash Preview HighGoogle
84.2%
24
Qwen 3.6 PlusAlibaba
83.7%
25
Grok 4.1 FastxAI
83.7%
26
GLM 5Z.AI
83.5%
27
GPT-5.1 Codex Max HighOpenAI
83.2%
28
Grok 4xAI
83.0%
29
DeepSeek V3.2 Exp ThinkingDeepSeek
82.4%
30
GPT-5 Mini HighOpenAI
82.2%
31
Gemini 3 Pro Preview HighGoogle
81.8%
32
GLM 4.6Z.AI
81.1%
33
Kimi K2 ThinkingMoonshot AI
81.1%
34
Minimax M2.7Minimax
80.5%
35
Qwen 3.6 27BAlibaba
79.9%
36
DeepSeek V4 FlashDeepSeek
79.7%
37
GPT-5.1 CodexOpenAI
79.6%
38
Claude Sonnet 4.5 ThinkingAnthropic
79.3%
39
Qwen 3.6 FlashAlibaba
78.9%
40
GPT-5.4 Mini xHighOpenAI
78.6%
41
Claude Haiku 4.5 ThinkingAnthropic
77.5%
42
Minimax M2.5Minimax
77.4%
43
MiMo V2 ProXiaomi
77.0%
44
GPT-5.1 Codex MiniOpenAI
76.3%
45
GLM 4.7Z.AI
76.0%
46
Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google
75.3%
47
Qwen 3 Next 80B A3B ThinkingAlibaba
74.3%
48
Gemma 4 31BGoogle
73.9%
49
Gemini 3.1 Flash Lite Preview HighGoogle
73.6%
50
Qwen 3 235B A22B Thinking 2507Alibaba
73.4%
51
Claude 4.1 Opus ThinkingAnthropic
73.2%
52
GPT-5.3 InstantOpenAI
72.4%
53
Claude 4 Sonnet ThinkingAnthropic
70.5%
54
GLM 5V TurboZ.AI
70.4%
55
Qwen 3 Next 80B A3B InstructAlibaba
70.2%
56
GPT OSS 120bOpenAI
68.9%
57
Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google
68.8%
58
GPT-5 Nano HighOpenAI
68.4%
59
Gemini 2.5 Pro (Max Thinking)Google
68.3%
60
Qwen 3 235B A22B Instruct 2507Alibaba
68.0%
61
Qwen 3 32BAlibaba
67.4%
62
Claude 4.5 Opus Medium EffortAnthropic
66.3%
63
Qwen 3 30B A3BAlibaba
65.3%
64
Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google
64.9%
65
DeepSeek V3.2 ExpDeepSeek
64.4%
66
DeepSeek V3.2DeepSeek
64.0%
67
Claude 4.1 OpusAnthropic
62.8%
68
Claude Sonnet 4.5Anthropic
62.6%
69
GLM 4.6VZ.AI
62.5%
70
Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google
61.0%
71
Claude 4 SonnetAnthropic
60.4%
72
GPT-5.2 No ThinkingOpenAI
58.3%
73
Kimi K2 InstructMoonshot AI
58.1%
74
Claude Haiku 4.5Anthropic
58.0%
75
Elephant AlphaOpenRouter
57.5%
76
Grok Code FastxAI
56.0%
77
Devstral 2Mistral
52.5%
78
Grok 4.20 Beta (Non-Reasoning)xAI
45.5%
79
Trinity Large PreviewArcee
44.9%
80
GPT-5.1 No ThinkingOpenAI
44.5%
81
Grok 4.1 Fast (Non-Reasoning)xAI
38.9%
82
Nemotron 3 Super 120B A12BNVIDIA
36.4%

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.