Agentic-Coding-Benchmarks

Mehrstufige Code-Bearbeitung und Tool-Nutzung — agentische Workflows aus LiveBench.

LiveBench Agentic Coding

#	Model	Score	Input $/M	Output $/M	Context	CI
1	GLM 5.2Z.AI	73.3%	—	—	—	—
2	GPT-5.4 Thinking xHigh EffortOpenAI	70.0%	—	—	—	—
3	Kimi K2.7 CodeMoonshot AI	70.0%	—	—	—	—
4	Gemini 3.1 Pro Preview HighGoogle	65.0%	—	—	—	—
5	Claude Sonnet 5 xHigh EffortAnthropic	65.0%	—	—	—	—
6	Claude 4.5 Opus Thinking High EffortAnthropic	63.3%	—	—	—	—
7	Claude 4.5 Opus Medium EffortAnthropic	63.3%	—	—	—	—
8	Claude 4.6 Opus Thinking High EffortAnthropic	61.7%	—	—	—	—
9	Claude 4.8 Opus Thinking xHigh EffortAnthropic	60.0%	—	—	—	—
10	Claude Fable 5 Thinking xHigh Effort*losing out due to stricter content moderationAnthropic	60.0%	—	—	—	—
11	Claude 4.7 Opus Thinking xHigh EffortAnthropic	60.0%	—	—	—	—
12	Claude 4.6 Sonnet Thinking Medium EffortAnthropic	60.0%	—	—	—	—
13	Minimax M3MiniMax	60.0%	—	—	—	—
14	Kimi K2.6 ThinkingMoonshot AI	58.3%	—	—	—	—
15	GPT-5.5 Thinking xHigh EffortOpenAI	56.7%	—	—	—	—
16	DeepSeek V4 ProDeepSeek	56.7%	—	—	—	—
17	Gemini 3 Pro Preview HighGoogle	55.0%	—	—	—	—
18	GPT-5.3 Codex HighOpenAI	55.0%	—	—	—	—
19	Qwen 3.6 PlusAlibaba	55.0%	—	—	—	—
20	GLM 5.1Z.AI	55.0%	—	—	—	—
21	GLM 5Z.AI	55.0%	—	—	—	—
22	GPT-5.1 Codex Max HighOpenAI	53.3%	—	—	—	—
23	GPT-5.1 HighOpenAI	53.3%	—	—	—	—
24	Grok Build 0.1xAI	53.3%	—	—	—	—
25	GPT-5.1 CodexOpenAI	53.3%	—	—	—	—
26	Claude Sonnet 4.5 ThinkingAnthropic	53.3%	—	—	—	—
27	Claude 4.1 OpusAnthropic	53.3%	—	—	—	—
28	Gemini 3.5 Flash HighGoogle	51.7%	—	—	—	—
29	GPT-5.2 HighOpenAI	51.7%	—	—	—	—
30	GPT-5.2 CodexOpenAI	51.7%	—	—	—	—
31	Qwen 3.7 MaxAlibaba	51.7%	—	—	—	—
32	GPT-5 ProOpenAI	51.7%	—	—	—	—
33	Minimax M2.5MiniMax	51.7%	—	—	—	—
34	DeepSeek V4 FlashDeepSeek	50.0%	—	—	—	—
35	Grok 4.3xAI	50.0%	—	—	—	—
36	Qwen 3.6 27BAlibaba	50.0%	—	—	—	—
37	Minimax M2.7MiniMax	50.0%	—	—	—	—
38	GPT-5.4 Nano xHighOpenAI	49.1%	—	—	—	—
39	Kimi K2.5 ThinkingMoonshot AI	48.3%	—	—	—	—
40	Claude 4.1 Opus ThinkingAnthropic	48.3%	—	—	—	—
41	Claude Sonnet 4.5Anthropic	48.3%	—	—	—	—
42	GPT-5.4 Mini xHighOpenAI	47.5%	—	—	—	—
43	GPT-5 Mini HighOpenAI	46.7%	—	—	—	—
44	Qwen 3.6 FlashAlibaba	46.7%	—	—	—	—
45	DeepSeek V3.2DeepSeek	46.7%	—	—	—	—
46	Nemotron 3 Ultra 550B A55BNVIDIA	46.7%	—	—	—	—
47	Grok 4.20 BetaxAI	43.3%	—	—	—	—
48	Devstral 2Mistral	43.3%	—	—	—	—
49	Claude Haiku 4.5 ThinkingAnthropic	41.7%	—	—	—	—
50	GLM 4.7Z.AI	41.7%	—	—	—	—
51	Gemini 3 Flash Preview HighGoogle	40.0%	—	—	—	—
52	DeepSeek V3.2 ThinkingDeepSeek	40.0%	—	—	—	—
53	Gemma 4 31BGoogle	40.0%	—	—	—	—
54	Claude 4 Sonnet ThinkingAnthropic	40.0%	—	—	—	—
55	GPT-5.1 Codex MiniOpenAI	40.0%	—	—	—	—
56	GPT-5.2 No ThinkingOpenAI	40.0%	—	—	—	—
57	Kimi K2 ThinkingMoonshot AI	38.3%	—	—	—	—
58	Claude 4 SonnetAnthropic	38.3%	—	—	—	—
59	Grok 4.20 Beta (Non-Reasoning)xAI	38.3%	—	—	—	—
60	DeepSeek V3.2 ExpDeepSeek	36.7%	—	—	—	—
61	GLM 4.6Z.AI	35.0%	—	—	—	—
62	Gemini 3.1 Flash Lite Preview HighGoogle	33.3%	—	—	—	—
63	Gemini 2.5 Pro (Max Thinking)Google	33.3%	—	—	—	—
64	Claude Haiku 4.5Anthropic	33.3%	—	—	—	—
65	Grok Code FastxAI	33.3%	—	—	—	—
66	Grok 4.1 FastxAI	31.7%	—	—	—	—
67	DeepSeek V3.2 Exp ThinkingDeepSeek	31.7%	—	—	—	—
68	Kimi K2 InstructMoonshot AI	31.7%	—	—	—	—
69	Grok 4xAI	30.0%	—	—	—	—
70	MiMo V2 ProXiaomi	30.0%	—	—	—	—
71	GPT-5.3 InstantOpenAI	28.3%	—	—	—	—
72	GPT-5.1 No ThinkingOpenAI	28.3%	—	—	—	—
73	Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google	23.3%	—	—	—	—
74	GPT-5 Nano HighOpenAI	23.3%	—	—	—	—
75	Nemotron 3 Super 120B A12BNVIDIA	23.0%	—	—	—	—
76	Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google	16.7%	—	—	—	—
77	GPT OSS 120bOpenAI	16.7%	—	—	—	—
78	Qwen 3 235B A22B Instruct 2507Alibaba	13.3%	—	—	—	—
79	Qwen 3 Next 80B A3B InstructAlibaba	10.0%	—	—	—	—
80	Grok 4.1 Fast (Non-Reasoning)xAI	10.0%	—	—	—	—
81	Qwen 3 Next 80B A3B ThinkingAlibaba	8.3%	—	—	—	—
82	Qwen 3 235B A22B Thinking 2507Alibaba	6.7%	—	—	—	—
83	Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google	5.0%	—	—	—	—
84	GLM 5V TurboZ.AI	3.3%	—	—	—	—
85	Qwen 3 32BAlibaba	3.3%	—	—	—	—
86	GLM 4.6VZ.AI	3.3%	—	—	—	—
87	Trinity Large PreviewArcee AI	3.3%	—	—	—	—
88	Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google	1.7%	—	—	—	—
89	Qwen 3 30B A3BAlibaba	1.7%	—	—	—	—
90	Elephant AlphaOpenRouter	1.7%	—	—	—	—

/ Live Benchmarks

Brauchen Sie Hilfe bei der Auswahl des richtigen KI-Modells?

Benchmarks sind ein Ausgangspunkt, keine Antwort. Das richtige Modell hängt von Ihrem Workload, Budget und Ihren Integrations-Anforderungen ab – lassen Sie es uns gemeinsam herausfinden.

Kontakt aufnehmen →