Instruction following benchmarks

Adherence to formatting constraints and complex instructions from LiveBench.

LiveBench Instruction Following

#	Model	Score	Input $/M	Output $/M	Context	CI
1	Gemini 3.1 Pro Preview HighGoogle	79.1%	—	—	—	—
2	Gemini 3.5 Flash HighGoogle	75.6%	—	—	—	—
3	Gemini 3 Flash Preview HighGoogle	74.9%	—	—	—	—
4	Qwen 3.7 MaxAlibaba	74.0%	—	—	—	—
5	GPT-5.5 Thinking xHigh EffortOpenAI	73.0%	—	—	—	—
6	GPT-5.1 Codex Max HighOpenAI	70.4%	—	—	—	—
7	GPT-5.4 Thinking xHigh EffortOpenAI	70.2%	—	—	—	—
8	Gemini 3.1 Flash Lite Preview HighGoogle	68.6%	—	—	—	—
9	GLM 5.1Z.AI	68.5%	—	—	—	—
10	Gemma 4 31BGoogle	67.6%	—	—	—	—
11	Claude 4.8 Opus Thinking xHigh EffortAnthropic	67.5%	—	—	—	—
12	GPT-5.4 Nano xHighOpenAI	67.2%	—	—	—	—
13	GPT-5.2 CodexOpenAI	66.5%	—	—	—	—
14	Gemini 3 Pro Preview HighGoogle	65.8%	—	—	—	—
15	GPT-5.3 Codex HighOpenAI	65.4%	—	—	—	—
16	GPT-5 Mini HighOpenAI	65.3%	—	—	—	—
17	Grok Build 0.1xAI	65.2%	—	—	—	—
18	Kimi K2.6 ThinkingMoonshot AI	64.4%	—	—	—	—
19	GPT-5 ProOpenAI	64.0%	—	—	—	—
20	GPT-5.1 HighOpenAI	63.9%	—	—	—	—
21	GPT-5.1 CodexOpenAI	63.4%	—	—	—	—
22	Grok 4.20 BetaxAI	63.4%	—	—	—	—
23	Claude 4.6 Opus Thinking High EffortAnthropic	63.3%	—	—	—	—
24	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	63.2%	—	—	—	—
25	DeepSeek V4 FlashDeepSeek	63.1%	—	—	—	—
26	Grok 4.3xAI	62.8%	—	—	—	—
27	Claude 4.5 Opus Thinking High EffortAnthropic	62.5%	—	—	—	—
28	DeepSeek V4 ProDeepSeek	62.4%	—	—	—	—
29	GLM 5.2Z.AI	62.3%	—	—	—	—
30	Kimi K2 ThinkingMoonshot AI	62.0%	—	—	—	—
31	GPT-5.2 HighOpenAI	61.8%	—	—	—	—
32	Minimax M2.7MiniMax	61.1%	—	—	—	—
33	GPT-5.4 Mini xHighOpenAI	60.3%	—	—	—	—
34	Claude Fable 5 Thinking xHigh Effort*losing out due to stricter content moderationAnthropic	60.0%	—	—	—	—
35	GPT-5.3 InstantOpenAI	59.4%	—	—	—	—
36	Claude Sonnet 5 xHigh EffortAnthropic	59.4%	—	—	—	—
37	Claude 4.7 Opus Thinking xHigh EffortAnthropic	59.3%	—	—	—	—
38	GPT-5.1 Codex MiniOpenAI	59.0%	—	—	—	—
39	Qwen 3.6 PlusAlibaba	58.3%	—	—	—	—
40	Nemotron 3 Ultra 550B A55BNVIDIA	58.2%	—	—	—	—
41	Minimax M3MiniMax	57.5%	—	—	—	—
42	Kimi K2.5 ThinkingMoonshot AI	57.4%	—	—	—	—
43	Minimax M2.5MiniMax	57.2%	—	—	—	—
44	Kimi K2.7 CodeMoonshot AI	56.3%	—	—	—	—
45	GPT-5 Nano HighOpenAI	55.7%	—	—	—	—
46	GLM 5Z.AI	55.3%	—	—	—	—
47	Claude Sonnet 4.5 ThinkingAnthropic	53.4%	—	—	—	—
48	Qwen 3.6 27BAlibaba	53.2%	—	—	—	—
49	GPT OSS 120bOpenAI	50.3%	—	—	—	—
50	Claude Haiku 4.5 ThinkingAnthropic	49.8%	—	—	—	—
51	DeepSeek V3.2 ThinkingDeepSeek	48.2%	—	—	—	—
52	Qwen 3.6 FlashAlibaba	47.2%	—	—	—	—
53	Claude 4 Sonnet ThinkingAnthropic	44.3%	—	—	—	—
54	MiMo V2 ProXiaomi	43.2%	—	—	—	—
55	Claude 4.1 Opus ThinkingAnthropic	42.4%	—	—	—	—
56	Qwen 3 Next 80B A3B ThinkingAlibaba	41.5%	—	—	—	—
57	DeepSeek V3.2 Exp ThinkingDeepSeek	41.3%	—	—	—	—
58	Qwen 3 235B A22B Thinking 2507Alibaba	40.6%	—	—	—	—
59	GLM 4.7Z.AI	35.7%	—	—	—	—
60	Gemini 2.5 Pro (Max Thinking)Google	33.1%	—	—	—	—
61	Elephant AlphaOpenRouter	29.6%	—	—	—	—
62	Grok 4xAI	29.1%	—	—	—	—
63	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	28.5%	—	—	—	—
64	Nemotron 3 Super 120B A12BNVIDIA	28.4%	—	—	—	—
65	Grok 4.1 FastxAI	28.2%	—	—	—	—
66	Claude 4.5 Opus Medium EffortAnthropic	28.1%	—	—	—	—
67	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	28.1%	—	—	—	—
68	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	27.7%	—	—	—	—
69	GLM 5V TurboZ.AI	27.2%	—	—	—	—
70	GPT-5.2 No ThinkingOpenAI	27.2%	—	—	—	—
71	GLM 4.6Z.AI	26.2%	—	—	—	—
72	Claude 4.1 OpusAnthropic	25.9%	—	—	—	—
73	Grok 4.20 Beta (Non-Reasoning)xAI	24.4%	—	—	—	—
74	Claude Sonnet 4.5Anthropic	23.5%	—	—	—	—
75	GPT-5.1 No ThinkingOpenAI	23.5%	—	—	—	—
76	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	23.1%	—	—	—	—
77	DeepSeek V3.2DeepSeek	23.1%	—	—	—	—
78	Claude 4 SonnetAnthropic	22.7%	—	—	—	—
79	Grok Code FastxAI	22.3%	—	—	—	—
80	Qwen 3 235B A22B Instruct 2507Alibaba	21.7%	—	—	—	—
81	Qwen 3 30B A3BAlibaba	21.1%	—	—	—	—
82	Kimi K2 InstructMoonshot AI	20.4%	—	—	—	—
83	DeepSeek V3.2 ExpDeepSeek	19.3%	—	—	—	—
84	Qwen 3 Next 80B A3B InstructAlibaba	19.2%	—	—	—	—
85	Qwen 3 32BAlibaba	17.8%	—	—	—	—
86	Claude Haiku 4.5Anthropic	17.8%	—	—	—	—
87	GLM 4.6VZ.AI	17.1%	—	—	—	—
88	Grok 4.1 Fast (Non-Reasoning)xAI	17.0%	—	—	—	—
89	Devstral 2Mistral	13.5%	—	—	—	—
90	Trinity Large PreviewArcee AI	12.2%	—	—	—	—

/ Live Benchmarks

Need help choosing the right AI model for your business?

Benchmarks are a starting point, not an answer. The right model depends on your workload, budget, and integration constraints — let's figure it out together.

Get in touch →