GPT-5.2 Faces Scrutiny: New Model's Real-World Performance Questioned Despite Benchmark Wins

The latest iteration of OpenAI’s large language model, GPT-5.2, is drawing unexpected criticism for its real-world performance, despite achieving high scores on traditional benchmarks. Early users report a series of concerning inconsistencies, including basic reasoning errors—such as miscounting letters in a word—and financially illogical advice, prompting an investigation into the model’s practical utility versus its statistical achievements. The hypothesis emerging from these observations is that GPT-5.2 may be ‘benchmark-maxed,’ optimized to excel on specific tests rather than demonstrating robust general intelligence.

Independent benchmarks, designed to assess more nuanced capabilities, paint a different picture. AI Explained’s ‘Simple Bench,’ comprising hard questions from various fields, places GPT-5.2 Pro significantly lower than expected, ranking it eighth and behind models like Claude 4 Opus and even Gemini 2.5 Pro. The ‘SKA Bench,’ a spatial reasoning test, revealed a notable regression for GPT-5.2 (scoring 97% vs. GPT-5’s perfect 100% on a previous run) and a dismal 2% for its ‘no reasoning’ version, indicating a loss in a specific reasoning capacity. Conversely, in a ‘Writing Arena’ project, GPT-5.2 demonstrated strong instruction following and adeptness at incorporating feedback, significantly improving essays after review. This contrasts sharply with Gemini 3 Pro, which was criticized for poor writing quality and an inability to apply feedback effectively. While acknowledging GPT-5.2’s ‘smartness’ in certain domains, the overall sentiment points to a model that is powerful but not always ‘better’ in terms of steerability and practical application, a trait exemplified by faster, more focused models like Cursor’s Composer, which prioritizes speed and direct instruction following over raw intelligence. As one observer summarized, “GBD 5.2 is the smarter model. Opus 4.5 is the better model. And Gemini 3 Pro is indeed a model.”