Chain of Thought: Why Your AI Model Does Not Reason

AI · Reasoning · LLM

Chain of Thought: Why Your AI Model Doesn't Reason.

Chain of Thought is not thinking. It is the statistical shape of thinking. Apple, Arizona State and UC Berkeley prove it with data. Here is what it means for anyone deploying AI in production.

9 min readAI · Production · Infrastructure

Chain of Thought (CoT) is a technique that makes language models generate intermediate steps before answering. While it improves benchmark results, recent research demonstrates it does not constitute genuine reasoning: it is a statistical constraint that mimics the form of human thought.

For organisations deploying AI in production environments, understanding this difference is not a philosophical debate. It is an architectural decision that affects reliability, cost and operational risk. At SIXE we have spent over 15 years designing critical infrastructure where failure tolerance is zero. That experience has taught us a rule that applies equally to an IBM Power cluster and to an AI agent: never trust a single component for anything that cannot go down.

01 · What it is

What is Chain of Thought and why does it look like reasoning?

When you enable "reasoning" or "thinking" mode in models like GPT-5, Claude or DeepSeek, the model generates an intermediate monologue before answering: "okay, first I'll analyse X... now let me consider Y... wait, let me check Z...". In the technical literature this is known as Chain of Thought (CoT).

The problem is that this is not thinking. It is generating text with the statistical shape of human reasoning. The model saw millions of examples of step-by-step reasoning during training and learned to reproduce that pattern. When you ask it to "think", what it actually does is recognise the problem category and fill in the statistical template that best matches.

A concrete example: if you ask a "reasoning" model to size a Ceph storage cluster with 12 OSDs, 3x replication and tolerance for 2 simultaneous node failures, it will return four flawless paragraphs with formulas, failure-domain considerations and a final number. It looks like structured thinking. What it actually did was detect "Ceph sizing problem" and apply the statistical pattern it has seen in hundreds of similar technical documents.

Why does it work? Because most of the time it gets the right answer. The question is what happens when the problem goes off-script.

02 · The evidence

Do LLMs actually reason? What the papers say

CoT works. It measurably improves benchmarks. The relevant question is not whether it works, but why it works. And the answer from the most rigorous studies is uncomfortable.

CoT as a statistical crutch

A team from Arizona State University demonstrated that CoT shines when the problem data falls within the training distribution. As soon as the problem moves outside known territory, performance collapses. It is the difference between a system that has memorised solutions and one that genuinely understands the underlying principles.

CoT as an architectural constraint

CoT is not abstract reasoning: it is a constraint that forces the model to imitate the form of reasoning. Forcing the model to write "first... second... therefore..." makes each generated token influence the next one more coherently. It is an architectural trick that improves the internal coherence of the text, not a cognitive act. The paper Chain-of-Thought Reasoning In The Wild Is Not Always Faithful documents how CoT can give an inaccurate picture of the actual process the model follows to reach its conclusions.

Technical conclusion

CoT is useful for many tasks. But it is not reasoning. It is formal coherence with the appearance of logic.

03 · The decorative

What are "decorative thinking steps" in AI reasoning?

A paper from October 2025 by researchers at UC Berkeley and UC Davis introduced the concept of decorative thinking steps, and their finding is especially relevant for anyone evaluating AI models for production.

The researchers found that many intermediate CoT steps are literally decorative. The model writes things like "wait, let me check... I think I made a mistake... let me recalculate", and then completely ignores that self-correction and delivers the answer it had already decided internally.

The demonstration was elegant: they deliberately perturbed the intermediate steps (changed numbers, altered logic) and checked whether the final answer changed. In many cases, it did not. The conclusion was already decided. The chain of thought was generated afterwards, as post-hoc rationalisation.

Tap each step to discover if it is real or decorative
"Hmm, wait. I think I got the replication factor wrong. Let me recalculate from the beginning..."
Tap to reveal
Decorative The model already had the answer. The "self-correction" did not change the final result. It is narrative theatre.
"Raw capacity = 12 × 8 TB = 96 TB. With 3x replication: 96 / 3 = 32 TB usable."
Tap to reveal
Real This step contains the calculation that determines the final answer. High TTS: the output depends on it.
"Let me verify my answer step by step to make sure I haven't made any calculation errors in the previous estimate..."
Tap to reveal
Decorative Pure rhetorical formula. The model does not re-execute any calculation: it has already emitted the answer tokens. It just adds words that look like rigour.
"Interesting question. Before answering, let me consider multiple angles: the failure domain, OSD balancing and metadata overhead..."
Tap to reveal
Decorative Listing factors without processing them is not analysis. It is the statistical form of what an expert would do. The model has already chosen the answer.
"With 2-node failure tolerance and 4 OSDs per node, the worst case loses 8 OSDs. Minimum guaranteed capacity: (12−8) × 8 / 3 ≈ 10.7 TB."
Tap to reveal
Real Introduces new variables (nodes, OSDs per node) that actually change the result. Without this step, the answer would be different.

A concrete finding: on the AIME dataset, only 2.3% of reasoning steps in the CoT had real causal influence on the model's final prediction. The rest was decoration. (Source: Can Aha Moments Be Fake?, UC Berkeley)

Direct implication

A model explaining well why it reached a conclusion does not mean that conclusion is correct. The explanation is generated alongside (or after) the result, and in many cases it is a justification built on top of a predetermined answer.

04 · Apple

"The Illusion of Thinking": Apple's study that changes everything

If decorative steps demonstrate that CoT does not guarantee reasoning even when it gets the right answer, Apple's study goes one step further: it shows that when the problem gets truly complex, the models give up.

In June 2025, Apple published The Illusion of Thinking, a study that tested state-of-the-art reasoning models on classic computer science puzzles: the Tower of Hanoi, river-crossing problems and other exercises that any first-year student solves with pen and paper.

PERFORMANCE BY COMPLEXITY — APPLE "ILLUSION OF THINKING" DATA (2025) 100% 75% 50% 25% 0% 88% 72% Easy 55% 82% Medium 8% 10% Hard No CoT With CoT
Model performance with and without CoT by complexity — Based on data from Apple ML Research, "The Illusion of Thinking", June 2025
Drag the slider — how does CoT perform as complexity increases?
Easy Medium Hard
No CoT
88%
With CoT
72%
El modelo sin CoT gana. El "pensamiento" extra solo añade coste y latencia.

The most significant finding is the third one. "Reasoning" models do not just fail on complex problems — they reduce computational effort precisely when they should increase it. It is the equivalent of a monitoring system that stops generating alerts when the infrastructure needs it most.

It is worth noting that the paper sparked debate: a team from CSIC in Madrid replicated part of the experiments and noted that some failures were due to output token limits, not pure cognitive limitations. But the core conclusions — that performance collapses with complexity and CoT does not scale predictably — held up.

05 · Cost

Is it worth paying for reasoning models?

It depends. And that is precisely the answer most vendors do not want to give you.

A case that illustrates the risk: a European company built a "reasoning" agent to classify support tickets. The chain of thought the model generated was narratively flawless. The problem was that 30% of the tickets ended up in the wrong queue, and the model explained with impeccable eloquence why that incorrect classification was the right one. Perfect narrative, wrong result.

This happens because we are confusing quality of explanation with quality of decision. They are different things. A model can produce formally flawless reasoning and reach an incorrect conclusion, exactly the same way a presentation with spectacular charts can defend a wrong strategy.

Rule of thumb

Before paying for reasoning models, benchmark with your real-world use case. Vendor marketing decks show results from their best days. Your data, your edge cases and your specific scenarios determine whether the extra cost is justified.

06 · Perspective

Is AI useless then?

No. And it is important to understand this clearly, because the pendulum can swing to the other extreme just as easily.

An LLM not reasoning like a human does not make it useless. It means you need to understand exactly what it does to use it correctly.

An IBM Power10 system running AIX does not "think" about workloads. It has no intuition. What it has is a high-performance RISC architecture, memory bandwidth that an equivalent x86 cannot match, and mainframe-grade reliability (RAS). If you understand what it does, you use it for what it is worth: critical databases, HPC, AI inference at scale. If you do not understand it, you use it as an expensive x86 server and wonder why it underperforms.

The same applies to LLMs. They are extraordinary language processors. They synthesise, translate, draft, classify and extract text patterns at a speed no human team can match. That is real, it has measurable value, and it is transforming operations across every sector.

What they are not is thinking agents with genuine understanding of the world. And selling the latter when what you have is the former is what is creating an expectations bubble that, sooner or later, will correct itself.

07 · In production

How to use AI in production without falling into the trap

In the world of critical infrastructure — IBM Power, AIX, high-availability clusters — there is a principle that never fails: design with redundancy. You never trust a single component for anything that cannot go down.

1. Do not use the explanation as proof the answer is correct

The model's explanation is generated alongside (or after) the result. It is often a post-hoc rationalisation. If the system makes critical decisions, you need independent verification. No matter how well it explains itself.

2. Benchmark with your real use case before choosing a model

For simple tasks, the cheaper model can outperform the expensive one. For medium tasks, CoT pays off. For highly complex tasks, both fail. The only way to know is to test with your real data, not the vendor's.

3. Design architectures with external verification

If your AI architecture is "ask the model and trust what it says", you do not have an architecture. A serious AI deployment includes cross-validation, business rules as a control layer, alerts when model confidence drops, and humans in the loop for critical decisions.

4. Demand evidence, not promises

The AI market is full of extraordinary claims without proportional evidence. A serious vendor shows you benchmarks on your type of data. A less serious vendor shows you a spectacular demo with prepared data.

08 · Our methodology

How do we evaluate AI models for production environments?

At SIXE we apply the same criteria to AI that we have applied to every critical infrastructure component for over 15 years:

  • Testing with real client data, not generic datasets or prepared demos.
  • Performance measurement on edge cases, not just the happy path. Errors do not show up in the median; they show up at the extremes.
  • Redundant architecture always. AI is one more layer in the system, not the entire system. It is complemented by business rules, cross-validation and human oversight where the decision is critical.
  • Model selection by use case, not by marketing. A model with CoT can be perfect for complex text analysis and entirely unnecessary (and more expensive) for simple classification.
  • Infrastructure sized for inference. An AI model is only as good as the infrastructure that supports it. We have verified this first-hand with vLLM on IBM Power and with Ceph as the AI storage backend.
Summary

For executives short on time

The essentials in 6 points

Chain of Thought is not thinking: it is the statistical shape of thinking, a constraint that improves the coherence of generated text.

Apple demonstrated that reasoning models collapse on complex problems and reduce their effort precisely when they should increase it.

Only 2.3% of reasoning steps have causal influence on the model's answer. The rest is decoration.

Do not pay for "reasoning" without measuring it on your specific use case with your real data.

Never use the model's explanation as proof that the answer is correct.

Design with external verification. AI is an extraordinary tool, not an oracle.

Sources

References and cited papers

Apple Machine Learning Research. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. June 2025. machinelearning.apple.com

Zhao, C. et al. (Arizona State University). Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens. August 2025. arxiv.org/abs/2508.01191

Zhao, J. et al. (UC Berkeley, UC Davis). Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought. October 2025. arxiv.org/abs/2510.24941

Arcuschin, I. et al. Chain-of-Thought Reasoning In The Wild Is Not Always Faithful. March 2025. arxiv.org/abs/2503.08679

Dellibarda Varela, I. et al. (CSIC, Madrid). Rethinking the Illusion of Thinking. July 2025. arxiv.org/abs/2507.01231

Last updated:


AI in production

Need to evaluate how to integrate AI into your infrastructure?

At SIXE we design AI architectures with the same philosophy we apply to any critical system: redundancy, external verification and real benchmarks. Tell us about your case.

SIXE