Zum Inhalt springen

Aktualisiert vor 5 StundenQuellen:LiveBench Agentic Coding

/ Live Benchmarks / Agentic Coding

Agentic-Coding-Benchmarks

Mehrstufige Code-Bearbeitung und Tool-Nutzung — agentische Workflows aus LiveBench.

LiveBench Agentic Coding

Originalquelle ansehen →
#ModelScore
1
GPT-5.4 Thinking xHigh EffortOpenAI
70.0%
2
Gemini 3.1 Pro Preview HighGoogle
65.0%
3
Claude 4.5 Opus Thinking High EffortAnthropic
63.3%
4
Claude 4.5 Opus Medium EffortAnthropic
63.3%
5
Claude 4.6 Opus Thinking High EffortAnthropic
61.7%
6
Claude 4.7 Opus Thinking xHigh EffortAnthropic
60.0%
7
Claude 4.6 Sonnet Thinking Medium EffortAnthropic
60.0%
8
Kimi K2.6 ThinkingMoonshot AI
58.3%
9
GPT-5.5 Thinking xHigh EffortOpenAI
56.7%
10
DeepSeek V4 ProDeepSeek
56.7%
11
Gemini 3 Pro Preview HighGoogle
55.0%
12
GPT-5.3 Codex HighOpenAI
55.0%
13
Qwen 3.6 PlusAlibaba
55.0%
14
GLM 5.1Z.AI
55.0%
15
GLM 5Z.AI
55.0%
16
GPT-5.1 Codex Max HighOpenAI
53.3%
17
GPT-5.1 HighOpenAI
53.3%
18
GPT-5.1 CodexOpenAI
53.3%
19
Claude Sonnet 4.5 ThinkingAnthropic
53.3%
20
Claude 4.1 OpusAnthropic
53.3%
21
Gemini 3.5 Flash HighGoogle
51.7%
22
GPT-5.2 HighOpenAI
51.7%
23
GPT-5.2 CodexOpenAI
51.7%
24
Qwen 3.7 MaxAlibaba
51.7%
25
GPT-5 ProOpenAI
51.7%
26
Minimax M2.5Minimax
51.7%
27
DeepSeek V4 FlashDeepSeek
50.0%
28
Grok 4.3xAI
50.0%
29
Qwen 3.6 27BAlibaba
50.0%
30
Minimax M2.7Minimax
50.0%
31
GPT-5.4 Nano xHighOpenAI
49.1%
32
Kimi K2.5 ThinkingMoonshot AI
48.3%
33
Claude 4.1 Opus ThinkingAnthropic
48.3%
34
Claude Sonnet 4.5Anthropic
48.3%
35
GPT-5.4 Mini xHighOpenAI
47.5%
36
GPT-5 Mini HighOpenAI
46.7%
37
Qwen 3.6 FlashAlibaba
46.7%
38
DeepSeek V3.2DeepSeek
46.7%
39
Grok 4.20 BetaxAI
43.3%
40
Devstral 2Mistral
43.3%
41
Claude Haiku 4.5 ThinkingAnthropic
41.7%
42
GLM 4.7Z.AI
41.7%
43
Gemini 3 Flash Preview HighGoogle
40.0%
44
DeepSeek V3.2 ThinkingDeepSeek
40.0%
45
Gemma 4 31BGoogle
40.0%
46
Claude 4 Sonnet ThinkingAnthropic
40.0%
47
GPT-5.1 Codex MiniOpenAI
40.0%
48
GPT-5.2 No ThinkingOpenAI
40.0%
49
Kimi K2 ThinkingMoonshot AI
38.3%
50
Claude 4 SonnetAnthropic
38.3%
51
Grok 4.20 Beta (Non-Reasoning)xAI
38.3%
52
DeepSeek V3.2 ExpDeepSeek
36.7%
53
GLM 4.6Z.AI
35.0%
54
Gemini 3.1 Flash Lite Preview HighGoogle
33.3%
55
Gemini 2.5 Pro (Max Thinking)Google
33.3%
56
Claude Haiku 4.5Anthropic
33.3%
57
Grok Code FastxAI
33.3%
58
Grok 4.1 FastxAI
31.7%
59
DeepSeek V3.2 Exp ThinkingDeepSeek
31.7%
60
Kimi K2 InstructMoonshot AI
31.7%
61
Grok 4xAI
30.0%
62
MiMo V2 ProXiaomi
30.0%
63
GPT-5.3 InstantOpenAI
28.3%
64
GPT-5.1 No ThinkingOpenAI
28.3%
65
Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google
23.3%
66
GPT-5 Nano HighOpenAI
23.3%
67
Nemotron 3 Super 120B A12BNVIDIA
23.0%
68
Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google
16.7%
69
GPT OSS 120bOpenAI
16.7%
70
Qwen 3 235B A22B Instruct 2507Alibaba
13.3%
71
Qwen 3 Next 80B A3B InstructAlibaba
10.0%
72
Grok 4.1 Fast (Non-Reasoning)xAI
10.0%
73
Qwen 3 Next 80B A3B ThinkingAlibaba
8.3%
74
Qwen 3 235B A22B Thinking 2507Alibaba
6.7%
75
Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google
5.0%
76
GLM 5V TurboZ.AI
3.3%
77
Qwen 3 32BAlibaba
3.3%
78
GLM 4.6VZ.AI
3.3%
79
Trinity Large PreviewArcee
3.3%
80
Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google
1.7%
81
Qwen 3 30B A3BAlibaba
1.7%
82
Elephant AlphaOpenRouter
1.7%

/ Live Benchmarks

Brauchen Sie Hilfe bei der Auswahl des richtigen KI-Modells?

Benchmarks sind ein Ausgangspunkt, keine Antwort. Das richtige Modell hängt von Ihrem Workload, Budget und Ihren Integrations-Anforderungen ab – lassen Sie es uns gemeinsam herausfinden.