Skip to content

Updated 5 hours agoSources:LiveBench Instruction Following

/ Live Benchmarks / Instruction Following

Instruction following benchmarks

Adherence to formatting constraints and complex instructions from LiveBench.

LiveBench Instruction Following

View original source →
#ModelScore
1
Gemini 3.1 Pro Preview HighGoogle
79.1%
2
Gemini 3.5 Flash HighGoogle
75.6%
3
Gemini 3 Flash Preview HighGoogle
74.9%
4
Qwen 3.7 MaxAlibaba
74.0%
5
GPT-5.5 Thinking xHigh EffortOpenAI
73.0%
6
GPT-5.1 Codex Max HighOpenAI
70.4%
7
GPT-5.4 Thinking xHigh EffortOpenAI
70.2%
8
Gemini 3.1 Flash Lite Preview HighGoogle
68.6%
9
GLM 5.1Z.AI
68.5%
10
Gemma 4 31BGoogle
67.6%
11
GPT-5.4 Nano xHighOpenAI
67.2%
12
GPT-5.2 CodexOpenAI
66.5%
13
Gemini 3 Pro Preview HighGoogle
65.8%
14
GPT-5.3 Codex HighOpenAI
65.4%
15
GPT-5 Mini HighOpenAI
65.3%
16
Kimi K2.6 ThinkingMoonshot AI
64.4%
17
GPT-5 ProOpenAI
64.0%
18
GPT-5.1 HighOpenAI
63.9%
19
GPT-5.1 CodexOpenAI
63.4%
20
Grok 4.20 BetaxAI
63.4%
21
Claude 4.6 Opus Thinking High EffortAnthropic
63.3%
22
Claude 4.6 Sonnet Thinking Medium EffortAnthropic
63.2%
23
DeepSeek V4 FlashDeepSeek
63.1%
24
Grok 4.3xAI
62.8%
25
Claude 4.5 Opus Thinking High EffortAnthropic
62.5%
26
DeepSeek V4 ProDeepSeek
62.4%
27
Kimi K2 ThinkingMoonshot AI
62.0%
28
GPT-5.2 HighOpenAI
61.8%
29
Minimax M2.7Minimax
61.1%
30
GPT-5.4 Mini xHighOpenAI
60.3%
31
GPT-5.3 InstantOpenAI
59.4%
32
Claude 4.7 Opus Thinking xHigh EffortAnthropic
59.3%
33
GPT-5.1 Codex MiniOpenAI
59.0%
34
Qwen 3.6 PlusAlibaba
58.3%
35
Kimi K2.5 ThinkingMoonshot AI
57.4%
36
Minimax M2.5Minimax
57.2%
37
GPT-5 Nano HighOpenAI
55.7%
38
GLM 5Z.AI
55.3%
39
Claude Sonnet 4.5 ThinkingAnthropic
53.4%
40
Qwen 3.6 27BAlibaba
53.2%
41
GPT OSS 120bOpenAI
50.3%
42
Claude Haiku 4.5 ThinkingAnthropic
49.8%
43
DeepSeek V3.2 ThinkingDeepSeek
48.2%
44
Qwen 3.6 FlashAlibaba
47.2%
45
Claude 4 Sonnet ThinkingAnthropic
44.3%
46
MiMo V2 ProXiaomi
43.2%
47
Claude 4.1 Opus ThinkingAnthropic
42.4%
48
Qwen 3 Next 80B A3B ThinkingAlibaba
41.5%
49
DeepSeek V3.2 Exp ThinkingDeepSeek
41.3%
50
Qwen 3 235B A22B Thinking 2507Alibaba
40.6%
51
GLM 4.7Z.AI
35.7%
52
Gemini 2.5 Pro (Max Thinking)Google
33.1%
53
Elephant AlphaOpenRouter
29.6%
54
Grok 4xAI
29.1%
55
Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google
28.5%
56
Nemotron 3 Super 120B A12BNVIDIA
28.4%
57
Grok 4.1 FastxAI
28.2%
58
Claude 4.5 Opus Medium EffortAnthropic
28.1%
59
Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google
28.1%
60
Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google
27.7%
61
GLM 5V TurboZ.AI
27.2%
62
GPT-5.2 No ThinkingOpenAI
27.2%
63
GLM 4.6Z.AI
26.2%
64
Claude 4.1 OpusAnthropic
25.9%
65
Grok 4.20 Beta (Non-Reasoning)xAI
24.4%
66
Claude Sonnet 4.5Anthropic
23.5%
67
GPT-5.1 No ThinkingOpenAI
23.5%
68
Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google
23.1%
69
DeepSeek V3.2DeepSeek
23.1%
70
Claude 4 SonnetAnthropic
22.7%
71
Grok Code FastxAI
22.3%
72
Qwen 3 235B A22B Instruct 2507Alibaba
21.7%
73
Qwen 3 30B A3BAlibaba
21.1%
74
Kimi K2 InstructMoonshot AI
20.4%
75
DeepSeek V3.2 ExpDeepSeek
19.3%
76
Qwen 3 Next 80B A3B InstructAlibaba
19.2%
77
Qwen 3 32BAlibaba
17.8%
78
Claude Haiku 4.5Anthropic
17.8%
79
GLM 4.6VZ.AI
17.1%
80
Grok 4.1 Fast (Non-Reasoning)xAI
17.0%
81
Devstral 2Mistral
13.5%
82
Trinity Large PreviewArcee
12.2%

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.