Apple vs. OpenAI

Apple just proved AI "reasoning" models like GPT-4, Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

Marc Baumann

and

Sangam Bharti

Jun 10, 2025

∙ Paid

A fresh Apple study just dropped a bombshell:

The most hyped reasoning models — Claude, DeepSeek, o3 — fail at real reasoning.

They just memorize patterns really well.

And when things get hard, they give up.

Let’s dig in.

What are Large Reasoning Models (LRMs)?

LRMs are an evolution of LLMs, built for reasoning, problem-solving, and logic, not just text generation.

Key Features

Structured Thinking: Use Chain-of-Thought (CoT) to break problems into steps
Interpretable Output: Intermediate steps make decisions more transparent
Problem-Solving: Excel at math, code, and multi-step tasks
Multimodal Ready: Can work across text, images, and structured data

Examples: OpenAI o1 / o3, DeepSeek-R1, Claude 3.7 Sonnet

Comparison of Large Reasoning Models (LRMs) | by Carlos E ... — (Source: Intuition Machine)

👉Get your brand in front of 30,000+ decision-makers — book your ad spot now.

But: LRMs are smarter, but not always sharper — and knowing when to use them is the real edge.

What happened: Apple threw models like OpenAI’s o1 and DeepSeek-R1 into puzzle tests (think: Tower of Hanoi, River Crossing) to see how well they actually “reason”:

Low complexity: Non-thinking LLMs outperform. They’re faster and more accurate.
Medium complexity: Reasoning LLMs shine — long Chain-of-Thought helps.
High complexity: Both collapse. Reasoning completely fails, with models giving up before using all tokens
(Source: Apple)

And here’s where it gets weird:

As tasks get harder, LRMs think less. They use fewer tokens, even when they have a budget.
On easy stuff, they overthink. They find the right answer, then talk themselves out of it.

This kills the myth that Chain of Thought (CoT) = intelligence. More “thinking” doesn’t mean better reasoning. It just means longer (and often worse) outputs.

Instead, the research shows:

Chain-of-Thought ≠ robust reasoning
Models don’t generalize reasoning patterns
Exact computation is still out of reach — even when the “answer” is explicitly given

Why it matters: Companies are pouring money into AI for big, complex jobs—like forecasting or risk analysis.

But here’s the catch: if LRMs break down when things get tough, that’s a costly mistake.

Most benchmarks? Flawed. They’re stuffed with familiar data, hiding real weaknesses. Apple’s clean tests show what these models can really handle—and it’s not much.

And the waste? Wild. LRMs often overthink easy tasks, using way more compute than needed, which drives up your costs for no real benefit.

Devil’s advocate: Apple missed the first wave of AI.

Now debunking the core claims of its biggest competitors.

Strategic acumen? Maybe.

But Apple is not alone.

Meta's chief AI scientist, Yann LeCun believes LLMs will be largely obsolete within five years.

The message is clear:

We don’t have reasoning agents yet.

We have glorified autocomplete engines with good marketing.

Take care,

Marc

PS: Are you an AI vendor? Work with us to build enterprise visibility, credibility & trust. We’re helping Avalanche, Near, MoonPay and others doing the same.

For PRO readers: What this means for you

If you're deploying AI copilots or automation agents for complex reasoning tasks…

Apple vs. OpenAI

Apple just proved AI "reasoning" models like GPT-4, Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

What are Large Reasoning Models (LRMs)?

Key Features

For PRO readers: What this means for you

This post is for paid subscribers