AI Operator

AI Operator

Share this post

AI Operator
AI Operator
Apple vs. OpenAI

Apple vs. OpenAI

Apple just proved AI "reasoning" models like GPT-4, Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

Marc Baumann's avatar
Sangam Bharti's avatar
Marc Baumann
and
Sangam Bharti
Jun 10, 2025
∙ Paid
5

Share this post

AI Operator
AI Operator
Apple vs. OpenAI
3
Share

A fresh Apple study just dropped a bombshell:

The most hyped reasoning models — Claude, DeepSeek, o3 — fail at real reasoning.

They just memorize patterns really well.

And when things get hard, they give up.

Let’s dig in.

Loading...

What are Large Reasoning Models (LRMs)?

LRMs are an evolution of LLMs, built for reasoning, problem-solving, and logic, not just text generation.

Key Features

  • Structured Thinking: Use Chain-of-Thought (CoT) to break problems into steps

  • Interpretable Output: Intermediate steps make decisions more transparent

  • Problem-Solving: Excel at math, code, and multi-step tasks

  • Multimodal Ready: Can work across text, images, and structured data

Examples: OpenAI o1 / o3, DeepSeek-R1, Claude 3.7 Sonnet

Comparison of Large Reasoning Models (LRMs) | by Carlos E ...
(Source: Intuition Machine)

👉Get your brand in front of 30,000+ decision-makers — book your ad spot now.

But: LRMs are smarter, but not always sharper — and knowing when to use them is the real edge.

What happened: Apple threw models like OpenAI’s o1 and DeepSeek-R1 into puzzle tests (think: Tower of Hanoi, River Crossing) to see how well they actually “reason”:

  • Low complexity: Non-thinking LLMs outperform. They’re faster and more accurate.

  • Medium complexity: Reasoning LLMs shine — long Chain-of-Thought helps.

  • High complexity: Both collapse. Reasoning completely fails, with models giving up before using all tokens

    (Source: Apple)

And here’s where it gets weird:

  • As tasks get harder, LRMs think less. They use fewer tokens, even when they have a budget.

  • On easy stuff, they overthink. They find the right answer, then talk themselves out of it.

This kills the myth that Chain of Thought (CoT) = intelligence. More “thinking” doesn’t mean better reasoning. It just means longer (and often worse) outputs.

Instead, the research shows:

  • Chain-of-Thought ≠ robust reasoning

  • Models don’t generalize reasoning patterns

  • Exact computation is still out of reach — even when the “answer” is explicitly given

Why it matters: Companies are pouring money into AI for big, complex jobs—like forecasting or risk analysis.

But here’s the catch: if LRMs break down when things get tough, that’s a costly mistake.

Most benchmarks? Flawed. They’re stuffed with familiar data, hiding real weaknesses. Apple’s clean tests show what these models can really handle—and it’s not much.

And the waste? Wild. LRMs often overthink easy tasks, using way more compute than needed, which drives up your costs for no real benefit.

Devil’s advocate: Apple missed the first wave of AI.

Now debunking the core claims of its biggest competitors.

Strategic acumen? Maybe.

But Apple is not alone.

Meta's chief AI scientist, Yann LeCun believes LLMs will be largely obsolete within five years.

The message is clear:

We don’t have reasoning agents yet.

We have glorified autocomplete engines with good marketing.

Take care,

Marc

PS: Are you an AI vendor? Work with us to build enterprise visibility, credibility & trust. We’re helping Avalanche, Near, MoonPay and others doing the same.

For PRO readers: What this means for you

If you're deploying AI copilots or automation agents for complex reasoning tasks…

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Marc Baumann
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share