Critique of Apple’s AI Testing | Limitations in Model Evaluation

Claude Opus has released his response to the controversially discussed work by Apple titled “The Illusion of Thinking.” His article with the same title — “The Illusion of the Illusion of Thinking” — where he is the first of two authors, is already available on arXiv.

The document is small — only three pages — but essentially develops a critique of the experiments conducted by Apple. Here are the main points:

First — the automatic evaluation system clearly failed. It considered the answer correct only when the model could directly and completely enumerate all steps of the solution. At the same time, it ignored the important difference between “I can’t” and “I can but don’t want.” Also, the metric for task complexity was strange — they evaluated it solely based on the number of steps, without considering the diversity of solutions, NP-completeness, and other nuances.

Second — there was an issue when testing models on unsolvable problems. For example, the well-known “River Crossing” problem with N ≥ 6 elements and a boat capacity of 3. Such problems by definition have no solution, but the models, seemingly, did not notice this and went to zero — scoring no points for failure.

And third — in their experiment, there apparently should have been no limit on the length of reasoning. The original states that in tasks like “Towers of Hanoi,” models did not tolerate failures in the chain of reasoning — they stopped due to token limits. But if you specify, for example, by writing a function to solve the problem — everything starts working without any issues.

Overall, the idea of the dissertation is that these models draw conclusions based on human reasoning articles, but in reality, they have their pitfalls and limitations. Welcome to 2025!