Sitemap

Apple Research —The Illusion of Thinking: The Thinking Revolution That Wasn’t

A deep dive into Apple’s recent research that proves what LLM engineers have long suspected: today’s reasoning models often fail to think when it matters most.

Mahesh
8 min readJun 8, 2025
Photo by Andrea De Santis on Unsplash

tl;dr (Sort of 🫣)

If you’ve been working with large language models for a while, you’ve probably noticed something: those shiny new “reasoning” models like OpenAI’s o1, DeepSeek-R1, and Claude’s thinking variants aren’t quite the game-changers they were marketed to be. Sure, they can solve some math problems better, but there’s always been this nagging feeling that something fundamental is missing. And do they truly understand complex tasks — or are they just cleverly mimicking solutions?

This paper from Apple Research, “The Illusion of Thinking,” digs into that question with surgical precision. Using cleverly constructed puzzles and controlled experiments, the authors show that current reasoning-enabled LLMs (a.k.a. Large Reasoning Models or LRMs) exhibit clear limits. In fact, once the complexity of a problem hits a certain threshold, these models stop trying altogether — despite having plenty of computation left to use.

This confirms what many of us have been quietly thinking: these Large Reasoning Models (LRMs) aren’t actually reasoning in any meaningful sense. They’re just very sophisticated pattern matchers that happen to write out their “thoughts” before giving answers. And more importantly, they hit a hard wall when problems get complex enough.

What engineers working with LLMs have long suspected — about overthinking, performance collapse, and inconsistent reasoning — this paper proves with rigorous evidence.

The paper, “The Illusion of Thinking,” is a systematic takedown of the reasoning capabilities we thought these models possessed. But it’s not just another negative result — it’s a carefully crafted investigation that reveals exactly where and why these models fail, using controllable puzzle environments.

The Problem with Math Benchmarks (And Why We Needed Better Tests)

Before diving into the meat of the research, let’s talk about why this study was necessary.

LLMs have evolved from just language generators into what some call “thinking agents.” With techniques like Chain-of-Thought (CoT), self-verification, and Reinforcement Learning with feedback, models now reason out loud before giving a final answer.

This “thinking” boosts performance on benchmarks like math problems and coding challenges, and it’s inspired a whole new class of LLMs: Large Reasoning Models (LRMs). Examples include OpenAI’s o3, DeepSeek-R1, Gemini Thinking, and Claude 3.7 Sonnet (Thinking).

The problem? These benchmarks might be contaminated by training data, and they don’t show us what’s happening inside the model’s thought process. Are these LRMs really reasoning — or just generating more verbose patterns?

In the paper, the researchers found something telling when they compared thinking vs. non-thinking models on these math benchmarks. On MATH-500, both model types performed similarly when given the same compute budget. But on AIME24 and AIME25, thinking models showed increasing advantages. Here’s the kicker: humans actually performed better on AIME25 than AIME24, suggesting AIME25 was easier — yet models performed worse on it. This screams data contamination during training.

This is exactly why the research team turned to puzzle environments.

So They Built a Puzzle Lab…

The researchers designed four puzzle environments with tightly controlled complexity.

Each puzzle has simple, well-defined rules. As you increase complexity (like more disks, more blocks, more actors), the model needs to plan deeper and reason longer.

Illustration of the four puzzle environments. Columns show the progression from initial state (top) through intermediate state (middle) to target state (bottom) for puzzles: Tower of Hanoi (disk transfer across pegs), Checkers Jumping (position swapping of colored tokens), River Crossing (transporting entities across a river), and Blocks World (stack reconfiguration). Source: The Illusion of Thinking
  1. Tower of Hanoi: The classic disk-moving puzzle where complexity scales exponentially
  2. Checker Jumping: A one-dimensional puzzle about swapping colored pieces
  3. River Crossing: A constraint satisfaction problem with actors and agents
  4. Blocks World: Stack rearrangement with minimum moves

Each puzzle allows fine-grained control over complexity while maintaining the same underlying logical structure. This is brilliant experimental design — instead of trying to guess whether one math problem is harder than another, you can literally dial the difficulty up by adding more disks, checkers, or blocks.

These puzzles aren’t just cute toys — they’re powerful diagnostic tools. They:

  • Avoid data contamination.
  • Allow precise control over problem size.
  • Force true planning and logical steps.
  • Let researchers test both final answers and the step-by-step thoughts.

What Did They Find? (Spoiler: It’s Not Great)

Using frontier models like Claude 3.7 (thinking vs non-thinking) and DeepSeek-R1 vs DeepSeek-V3, they found three distinct regimes of reasoning:

Pass@k performance of thinking vs. non-thinking models across equivalent compute budgets in puzzle environments of low, medium and high complexity. Non-thinking models excel in simple problems, thinking models show advantages at medium complexity, while both approaches fail at high complexity regardless of compute allocation. Source: The Illusion of Thinking

Regime 1: Low Complexity — The Overthinking Problem

At low complexity, something surprising happens: standard LLMs actually outperform their thinking counterparts. The reasoning models find the correct answer early but then waste tokens exploring wrong paths. It’s like watching someone solve a simple addition problem correctly, then second-guessing themselves into the wrong answer.

Regime 2: Medium Complexity — Where Thinking Shines

This is the sweet spot where LRMs justify their existence. The extra thinking time and self-reflection mechanisms start paying dividends. The models explore multiple approaches and course-correct when they hit dead ends.

Regime 3: High Complexity — Universal Collapse

Both model types completely fall apart. But here’s the really weird part: as problems get harder, thinking models actually start thinking less. Despite having plenty of tokens left in their budget, they give shorter and shorter responses as complexity increases.

The Collapse: When Smart Models Get Lazy

The most striking finding is what happens at the complexity threshold. Look at this pattern across different state-of-the-art reasoning models:

Accuracy and thinking tokens vs. problem complexity for reasoning models across puzzle environments. As complexity increases, reasoning models initially spend more tokens while accuracy declines gradually, until a critical point where reasoning collapses — performance drops sharply and reasoning effort decreases. Source: The Illusion of Thinking

Every single reasoning model — o3-mini, DeepSeek-R1, Claude 3.7 Sonnet — follows the same pattern:

  1. Accuracy gradually declines with complexity
  2. Thinking tokens initially increase with problem difficulty
  3. At a critical threshold, both accuracy and thinking effort collapse simultaneously

This isn’t a gradual degradation — it’s a cliff. And the fact that models reduce their reasoning effort right when they need it most suggests something fundamental about how these systems work (or don’t work).

The authors also showed that giving the model the algorithm directly doesn’t help. Even when told exactly how to solve Tower of Hanoi, the model still fails on complex cases.

Inside the Mind of a “Thinking” Model

The researchers didn’t stop at measuring accuracy. They dove deep into the actual reasoning traces to see what these models are doing when they “think.” Using puzzle simulators, they could validate every intermediate solution attempt within the thinking process.

Source: The Illusion of Thinking

The patterns they found are fascinating and somewhat depressing 😞:

In simple problems: Models find the right answer quickly but then explore wrong alternatives. The correct solutions appear early in the thinking process, while incorrect ones pile up toward the end. Classic overthinking.

In medium problems: The pattern reverses. Models struggle initially, exploring many wrong paths before eventually finding correct solutions later in their thinking.

In complex problems: Almost everything the model generates is wrong, regardless of position in the thinking trace.

The Algorithm Test: When Cheating Still Doesn’t Help

Here’s perhaps the most damning finding. The researchers gave models the complete algorithm for solving Tower of Hanoi — basically the cheat codes. The models just had to execute the given steps.

Source: The Illusion of Thinking

Performance didn’t improve. At all.

Think about what this means. These models can’t even follow explicit, step-by-step instructions when the complexity gets high enough. They’re not failing because they can’t devise solutions — they’re failing because they can’t execute solutions even when handed to them on a silver platter.

What This Means for Practitioners

If you’re working with reasoning models in production, this research has several important implications:

1. Know Your Complexity Limits

Every reasoning model has a complexity cliff. Find it early in your testing, because performance doesn’t degrade gracefully — it falls off a cliff.

2. Simple Problems Don’t Need Reasoning Models

For straightforward tasks, standard LLMs are often more efficient and accurate. Don’t pay the reasoning tax unless you’re in that medium-complexity sweet spot.

3. Validation Is Critical

These models can produce very confident-sounding reasoning that’s completely wrong. If you’re using them for anything important, you need robust validation mechanisms.

4. Beware the Token Budget Paradox

Just because a model has tokens left doesn’t mean it will use them effectively. As problems get harder, reasoning models may actually give up rather than trying harder. And, More Tokens ≠ Better Thinking. Surprisingly, models think less as problems get harder.

5. Reasoning Models Aren’t There Yet

LRMs collapse under pressure. Their reasoning is shallow and inconsistent beyond a certain point.

6. Benchmarks Lie. Puzzles Don’t

Puzzle environments reveal model behavior with surgical precision — and should be the new gold standard for reasoning tests.

7. Giving Them the Answer Doesn’t Help

Even with step-by-step algorithms, models still fail. That’s a serious limitation for exact computation tasks.

The Bigger Picture: What Is Reasoning Anyway?

This research raises fundamental questions about what we mean by “reasoning” in AI systems. These models can:

  • Generate coherent explanations
  • Show their work step by step
  • Sometimes self-correct mistakes
  • Handle moderately complex problems

But they can’t:

  • Scale to truly complex problems
  • Follow explicit algorithms reliably
  • Reason consistently across problem types
  • Avoid catastrophic failure modes

The paper’s title — “The Illusion of Thinking” — captures this perfectly. These models create a compelling illusion of reasoning, but underneath they’re still fundamentally pattern matching systems with all the limitations that entails.

Open Questions Worth Exploring

  • Can we design training paradigms that teach true reasoning — not just verbose pattern-matching?
  • Why do models give up despite having ample compute left?
  • Can we detect when a model is “pretending to think” but has already collapsed?
  • Are there architectures better suited for deep compositional planning?

Final Thoughts: Keeping Expectations Realistic

This paper doesn’t just show cracks in the foundation of reasoning models — it shows that we may need to rethink what “thinking” really means for AI. As we build more advanced LLMs, puzzle environments like the ones in this study will be essential to track progress — not just in answers, but in how those answers are formed.

For those of us building real systems, that’s actually valuable information. Better to know the limitations upfront than discover them in production. And who knows? Maybe understanding these failure modes will point the way toward the next breakthrough.

--

--

Mahesh
Mahesh

Responses (3)