| 1 | gpt-5 (high)OpenAI | 88.0% correct | — | — | — | — |
| 2 | gpt-5 (medium)OpenAI | 86.7% correct | — | — | — | — |
| 3 | o3-pro (high)OpenAI | 84.9% correct | — | — | — | — |
| 4 | gemini-2.5-pro-preview-06-05 (32k think)Google DeepMind | 83.1% correct | — | — | — | — |
| 5 | gpt-5 (low)OpenAI | 81.3% correct | — | — | — | — |
| 6 | o3 (high)OpenAI | 81.3% correct | — | — | — | — |
| 7 | grok-4 (high)xAI | 79.6% correct | — | — | — | — |
| 8 | gemini-2.5-pro-preview-06-05 (default think)Google DeepMind | 79.1% correct | — | — | — | — |
| 9 | o3 (high) + gpt-4.1OpenAI | 78.2% correct | — | — | — | — |
| 10 | O3OpenAI | 76.9% correct | — | — | — | — |
| 11 | Gemini 2.5 Pro Preview 05-06Google DeepMind | 76.9% correct | — | — | — | — |
| 12 | DeepSeek-V3.2-Exp (Reasoner)DeepSeek | 74.2% correct | — | — | — | — |
| 13 | Gemini 2.5 Pro Preview 03-25Google DeepMind | 72.9% correct | — | — | — | — |
| 14 | claude-opus-4-20250514 (32k thinking)Anthropic | 72.0% correct | — | — | — | — |
| 15 | o4-mini (high)OpenAI | 72.0% correct | — | — | — | — |
| 16 | DeepSeek R1 (0528)DeepSeek | 71.4% correct | — | — | — | — |
| 17 | claude-opus-4-20250514 (no think)Anthropic | 70.7% correct | — | — | — | — |
| 18 | DeepSeek-V3.2-Exp (Chat)DeepSeek | 70.2% correct | — | — | — | — |
| 19 | claude-3-7-sonnet-20250219 (32k thinking tokens)Anthropic | 64.9% correct | — | — | — | — |
| 20 | DeepSeek R1 + claude-3-5-sonnet-20241022DeepSeek | 64.0% correct | — | — | — | — |
| 21 | o1-2024-12-17 (high)OpenAI | 61.7% correct | — | — | — | — |
| 22 | claude-sonnet-4-20250514 (32k thinking)Anthropic | 61.3% correct | — | — | — | — |
| 23 | claude-3-7-sonnet-20250219 (no thinking)Anthropic | 60.4% correct | — | — | — | — |
| 24 | o3-mini (high)OpenAI | 60.4% correct | — | — | — | — |
| 25 | Qwen3 235B A22B diff, no think, Alibaba APIAlibaba | 59.6% correct | — | — | — | — |
| 26 | Kimi K2Moonshot AI | 59.1% correct | — | — | — | — |
| 27 | DeepSeek R1DeepSeek | 56.9% correct | — | — | — | — |
| 28 | claude-sonnet-4-20250514 (no thinking)Anthropic | 56.4% correct | — | — | — | — |
| 29 | gemini-2.5-flash-preview-05-20 (24k think)Google DeepMind | 55.1% correct | — | — | — | — |
| 30 | DeepSeek V3 (0324)DeepSeek | 55.1% correct | — | — | — | — |
| 31 | Quasar AlphaUnknown | 54.7% correct | — | — | — | — |
| 32 | o3-mini (medium)OpenAI | 53.8% correct | — | — | — | — |
| 33 | Grok 3 BetaxAI | 53.3% correct | — | — | — | — |
| 34 | Optimus AlphaUnknown | 52.9% correct | — | — | — | — |
| 35 | Gpt 4.1OpenAI | 52.4% correct | — | — | — | — |
| 36 | Claude 3 5 Sonnet 20241022Anthropic | 51.6% correct | — | — | — | — |
| 37 | Grok 3 Mini Beta (high)xAI | 49.3% correct | — | — | — | — |
| 38 | DeepSeek Chat V3 (prev)DeepSeek | 48.4% correct | — | — | — | — |
| 39 | gemini-2.5-flash-preview-04-17 (default)Google DeepMind | 47.1% correct | — | — | — | — |
| 40 | chatgpt-4o-latest (2025-03-29)Unknown | 45.3% correct | — | — | — | — |
| 41 | Gpt 4.5 PreviewOpenAI | 44.9% correct | — | — | — | — |
| 42 | gemini-2.5-flash-preview-05-20 (no think)Google DeepMind | 44.0% correct | — | — | — | — |
| 43 | gpt-oss-120b (high)OpenAI | 41.8% correct | — | — | — | — |
| 44 | Qwen3 32BAlibaba | 40.0% correct | — | — | — | — |
| 45 | Gemini Exp 1206Google DeepMind | 38.2% correct | — | — | — | — |
| 46 | Gemini 2.0 Pro exp-02-05Google DeepMind | 35.6% correct | — | — | — | — |
| 47 | Grok 3 Mini Beta (low)xAI | 34.7% correct | — | — | — | — |
| 48 | O1 Mini 2024 09 12OpenAI | 32.9% correct | — | — | — | — |
| 49 | Gpt 4.1 MiniOpenAI | 32.4% correct | — | — | — | — |
| 50 | Claude 3 5 Haiku 20241022Anthropic | 28.0% correct | — | — | — | — |
| 51 | chatgpt-4o-latest (2025-02-15)Unknown | 27.1% correct | — | — | — | — |
| 52 | QwQ-32B + Qwen 2.5 Coder InstructUnknown | 26.2% correct | — | — | — | — |
| 53 | Gpt 4o 2024 08 06OpenAI | 23.1% correct | — | — | — | — |
| 54 | Gemini 2.0 Flash ExpGoogle DeepMind | 22.2% correct | — | — | — | — |
| 55 | Qwen Max 2025 01 25Alibaba | 21.8% correct | — | — | — | — |
| 56 | QwQ 32BUnknown | 20.9% correct | — | — | — | — |
| 57 | Gemini 2.0 Flash Thinking Exp 01 21Google DeepMind | 18.2% correct | — | — | — | — |
| 58 | Gpt 4o 2024 11 20OpenAI | 18.2% correct | — | — | — | — |
| 59 | DeepSeek Chat V2.5DeepSeek | 17.8% correct | — | — | — | — |
| 60 | Qwen2.5 Coder 32B InstructAlibaba | 16.4% correct | — | — | — | — |
| 61 | Llama 4 MaverickMeta | 15.6% correct | — | — | — | — |
| 62 | Yi LightningUnknown | 12.9% correct | — | — | — | — |
| 63 | Command A 03 2025 QualityUnknown | 12.0% correct | — | — | — | — |
| 64 | Codestral 25.01Mistral | 11.1% correct | — | — | — | — |
| 65 | Openhands Lm 32b v0.1Unknown | 10.2% correct | — | — | — | — |
| 66 | Gpt 4.1 NanoOpenAI | 8.9% correct | — | — | — | — |
| 67 | Qwen2.5 Coder 32B InstructAlibaba | 8.0% correct | — | — | — | — |
| 68 | Gemma 3 27b ItGoogle DeepMind | 4.9% correct | — | — | — | — |
| 69 | Gpt 4o Mini 2024 07 18OpenAI | 3.6% correct | — | — | — | — |