Gemini 3.1 Pro Shatters AI Benchmarks, Yet Developers Lament Persistent Usability Challenges
Google’s Gemini 3.1 Pro Preview has reportedly set new industry standards for raw AI intelligence, outperforming competitors like Opus 4.6 Max and GPT-5 in numerous benchmarks. Achieving a score four points higher on the Artificial Intelligence Index at a significantly lower cost ($892 vs. ~$2500 for Opus 4.6 Max), the model also showcased an “insane” 78% on ARC AGI 2. External validation and internal testing confirm its exceptional knowledge and spatial reasoning, with a consistent 100% on the Skate Bench and being the first model to produce usable, animatable SVGs for complex prompts. Furthermore, Gemini 3.1 Pro demonstrated superior knowledge and low hallucination rates on Artificial Analysis’s Omniscience benchmark, and impressive performance on the Convex LLM leaderboard, especially when provided with explicit guidelines, reaching nearly 95% accuracy in complex backend tasks. Its peculiar ability to achieve a perfect “snitch” score on the SnitchBench, reporting on medical malpractice 100% of the time in certain scenarios, further underscores its distinct, benchmark-optimized intelligence.
Despite its unparalleled intellectual prowess, the practical usability of Gemini 3.1 Pro presents a jarring contrast for developers. Testers report a deeply frustrating developer experience, primarily due to Google’s Command Line Interface (CLI), which is described as “unusable”—frequently buggy, prone to randomly switching models, and lacking Day 1 support for 3.1 Pro. A major point of contention is the model’s inconsistent and often flawed tool-calling capabilities. Unlike models from Anthropic (e.g., Haiku’s reliability) or those tuned by third parties (like Cursor’s adapted Gemini), Gemini 3.1 Pro struggles with basic tool execution, often overusing, underusing, or incorrectly formatting calls. This leads to inefficient, token-wasting loops and requires constant human oversight, reminiscent of earlier, less mature LLM iterations. While incredibly smart, the model lacks the “competence” for sustained agentic tasks, gets easily confused, and exhibits bizarre failures like hardcoded file-reading limits (100 lines) and hallucinating non-existent packages, indicating a fundamental disconnect between benchmark optimization and real-world software development needs.