Alice's Brother Has Five Sisters. AI Says Four.
The first comprehensive map of LLM reasoning failures reveals three systematic patterns across 400+ papers and a roadmap for building AI that actually thinks.
Here's a question my six-year-old nephew nailed on the first try:
The answer is five. Alice herself is a sister to her brother. A child sees this instantly. GPT-4—the model trusted with legal analysis, medical triage, and billion-dollar code reviews—gets it wrong.
This isn’t a cherry-picked gotcha. It’s one specimen in a vast, now-cataloged ecosystem of reasoning failures that persist even in state-of-the-art models. And for the first time, we have a comprehensive map of the entire terrain.
The core thesis: A landmark survey from Caltech, Carleton College, and Stanford (Song, Han & Goodman, published in Transactions on Machine Learning Research, January 2026) systematically analyzed over 400 papers on LLM reasoning failures and discovered something striking: these failures aren’t random. They fall into three predictable categories across three reasoning domains—forming a 3×3 matrix that explains nearly every documented breakdown in LLM reasoning.
If you build products on top of LLMs, deploy AI agents in production, or simply want to understand why the most impressive technology of our decade still can’t count letters in a word, this piece is your field guide. I’ll share the taxonomy, the most jaw-dropping failures, and a practical framework for anyone shipping AI systems.
But first, a story about babies.
We’re shipping reasoning models into production. That makes this urgent.
In the last twelve months, the AI industry has made a decisive bet: reasoning is the product. OpenAI’s o1 and o3 models, DeepSeek-R1, and a growing army of “reasoning-specialized” LLMs are being deployed into high-stakes applications legal analysis, financial modeling, scientific discovery, autonomous robotics.
The benchmark scores look spectacular. But benchmarks are not the real world.
The Song, Han & Goodman survey is the first to pull fragmented research into a unified framework. What it reveals should make anyone deploying reasoning models pay closer attention: the failures are systematic, predictable, and often fundamental to how these systems are built.
Your AI makes the same mistake as a 10-month-old baby
In developmental psychology, there’s a famous experiment called the A-not-B task. You show an infant a toy being hidden in location A, repeatedly. The baby reaches for A and finds it. Then, right in front of the baby, you move the toy to location B.
Babies under about 12 months will still reach for location A. Their brains get stuck in the pattern, unable to override the learned response with new evidence. Developmental psychologists call this a failure of inhibitory control one of the core executive functions that mature as we grow.
Now watch what happens when researchers from Caltech ran a version of this test on Gemini:
The model saw “Answer: A” twice and got stuck in the pattern. Like a 10-month-old reaching for the wrong location, it couldn’t override the established response even when the answer was staring it in the face.
This is not a quirky anecdote. It’s one manifestation of a fundamental failure category that the survey traces across informal reasoning, formal logic, and embodied AI alike. The same architectural features that make LLMs so powerful at pattern matching the self-attention mechanism, the next-token prediction objective are exactly what make them brittle when the pattern needs to break.
The thing that makes LLMs brilliant is the same thing that makes them fail.
The 3×3 matrix that explains (nearly) every LLM reasoning failure
The survey’s most important contribution is a two-axis taxonomy. One axis categorizes the type of reasoning (what domain is the model working in?). The other categorizes the type of failure (how fundamental is the breakdown?). Cross them and you get nine cells that map the entire landscape:

Let me walk you through each axis. The reasoning is straightforward—but the implications are profound.
The failure axis distinguishes three types of breakdown. Fundamental failures are intrinsic to LLM architectures they stem from how attention works, how tokens are predicted, how training data is structured. These affect everything downstream. Application-specific limitations show up in particular domains where we expect competence but don’t get it—Theory of Mind, math word problems, physics. Robustness issues are the sneakiest: the model appears to work but collapses under minor variations, like reordering options in a multiple-choice question.
The reasoning axis spans three domains. Informal reasoning intuition, social cognition, biases is the stuff humans develop in childhood. Formal reasoning logic, math, code is the stuff we learn in school. Embodied reasoning physics, spatial awareness, real-world action is the stuff we learn by existing in the physical world.
Now here’s where it gets interesting. Let me take you through the most important failures in each domain.
When intuition fails: your LLM has the cognitive biases of its training data
Humans develop informal reasoning early. We learn to read faces, judge intentions, navigate social situations. We also develop predictable cognitive biases along the way anchoring, framing effects, confirmation bias. They’re well-documented in psychology.
Here’s the uncomfortable finding: LLMs have inherited our biases without inheriting our corrective mechanisms.
The survey documents anchoring bias (early inputs disproportionately shape reasoning), framing effects (logically equivalent but differently phrased prompts produce different answers), and confirmation bias (models favor information that aligns with prior context). These aren’t occasional glitches. They’re systematic, reproducible, and traced to three root causes: biased training data, architectural features like causal masking, and alignment processes like RLHF that amplify human raters’ own biases.
But the social reasoning failures are what really stopped me cold.
Your AI can’t understand what a child sees through a transparent bag
Theory of Mind the ability to understand that other people have different beliefs, knowledge, and intentions than your own is something human children develop around age four. It’s so fundamental to social reasoning that its absence is a diagnostic criterion for certain developmental disorders.
The bag is transparent. Sam can see the popcorn. Yet the model ignores the visual evidence and defaults to the label. This isn’t just wrong it reveals that the model is performing shallow pattern matching on the word “label” rather than building a mental model of what Sam actually perceives.
And the moral reasoning failures are even more troubling for real-world deployment. The survey finds that LLMs produce contradictory ethical judgments when questions are slightly reworded. In one documented case, GPT-4 said no crime was occurring in a surveillance video, then recommended calling the police about the same video when asked a differently framed question.
If you’re building AI-powered moderation, customer service, or decision support tools, this should be a flashing red warning. The model’s ethical reasoning is inconsistent in ways that would be unacceptable from a human colleague.
The reversal curse: your model knows that A equals B, but not that B equals A
This might be the most elegant failure in the entire survey.
Read that again. The model knows that Tom Cruise’s mother is Mary Lee Pfeiffer. But when you ask the reverse who is Mary Lee Pfeiffer’s son? it draws a blank. The logical equivalence that any human grasps instantly (”if she’s his mother, he’s her son”) is invisible to the model.
Why? The unidirectional training objective. Models learn to predict the next token moving left to right. They learn “Tom Cruise → mother → Mary Lee Pfeiffer” but never form the bidirectional association. The knowledge is stored as a one-way street. Research shows that scaling alone can’t fix this—it’s structural, rooted in how the weights encode directional associations.
But here’s the thing that nobody talks about: the reversal curse is just one instance of a much deeper problem.
Solves part A, solves part B, fails part A+B
Compositional reasoning collapse might be the most practically dangerous failure in the entire taxonomy. The model can solve individual components of a problem, but when asked to combine them even just two steps it falls apart.

The model knows tan(Y) in the triangle. It knows tan(90°) doesn’t exist. But when you ask for tan(X) where X is the 90° angle it happily gives you 24/7. It never connects the dots.
This same pattern appears in multi-hop question answering (combine two facts across documents), claim verification (check multiple evidence sources), and code generation (compose separate functions). The model passes every unit test but fails integration testing. Sound familiar?
our robot assistant thinks flannel is less malleable than a baseball
The embodied reasoning failures are where the survey gets genuinely unsettling for anyone building AI agents that interact with the physical world.
LLMs fail at basic physical commonsense in ways that reveal a profound absence of world modeling. When asked whether flannel is more malleable than a baseball, models say no. When told “a house is inside an electric bulb” and asked whether the bulb is bigger than the house, they say no. When asked about acceleration at the apex of a thrown object, ChatGPT simultaneously claims the correct value (9.8 m/s² downward) and that there’s “no net force” contradicting itself within the same response.
And it gets worse in 3D. Vision-language models fail at basic anomaly detection (not noticing someone ice-skating on a wooden floor), can’t count overlapping objects, and when asked to estimate real-world distances for robotics applications, generate plans involving physically impossible actions.
Most alarming: researchers have demonstrated that embodied LLMs can be jailbroken into performing harmful physical actions recording private information, violating safety constraints through prompt manipulation. The survey calls this “an urgent need for robust, self-correcting, and safety-aware embodied AI systems before real-world deployment.”
The pattern behind the patterns: why the same architecture creates failures everywhere
Here’s what the survey makes devastatingly clear when you read across all nine cells of the matrix: the same architectural features cause failures in every domain.
The self-attention mechanism? It disperses focus under complex tasks, causing working memory failures (informal), compositional reasoning breakdowns (formal), and spatial modeling errors (embodied). The next-token prediction objective? It prioritizes statistical pattern completion over deliberate reasoning—explaining the A-not-B error (informal), the reversal curse (formal), and the lack of physical planning (embodied). The causal masking in transformers? It introduces order bias everywhere.
Remember that opening question about the 10-month-old baby? The A-not-B error. That same failure pattern—an inability to override a learned response when new evidence arrives—manifests as confirmation bias in informal reasoning, as the reversal curse in formal reasoning, and as robots repeating failed actions in embodied reasoning.
Different symptoms. Same disease.
This is the survey's deepest insight: if you understand the architectural roots, you can predict where new failures will emerge before they do. That's the difference between playing whack-a-mole with bugs and building genuinely resilient systems.
The RADAR framework: a practitioner’s guide to reasoning failure mitigation
Based on the survey’s findings and mitigation strategies, here’s a practical framework for anyone building on LLMs. I’m calling it RADAR five steps that map directly to the patterns identified across 400+ papers.
The RADAR Framework in Practice
R - Recognize the failure type. Is it fundamental (architectural), application-specific (domain gap), or robustness (brittleness under variation)? This determines your mitigation budget and timeline.
A - Analyze the reasoning domain. Informal failures need bias mitigation and alignment work. Formal failures need structural interventions. Embodied failures need grounding and simulation.
D - Diagnose the root cause. Trace the failure to architecture (attention dispersion?), training (data bias? tokenization?), or deployment (prompt sensitivity?). The survey shows most failures trace to a small set of architectural causes.
A - Address with matched interventions. Data augmentation for bias. Bidirectional training for reversal curse. External tools for arithmetic. Physics simulators for embodied reasoning. The key insight: match the intervention to the root cause, not the symptom.
R - Retest with perturbation-based evaluation. Don’t just check if the original failure is fixed—apply semantics-preserving transformations (reorder options, rename variables, rephrase questions) to verify true robustness. Then iterate.
Fail better
There’s a line near the end of the survey that I keep returning to. The authors compare the systematic study of LLM reasoning failures to fault-tolerance research in early computing and incident analysis in safety-critical industries. In aviation, nuclear power, and medicine, the disciplines that dramatically reduced catastrophic failures didn’t start by building better systems. They started by understanding how existing systems failed.
We’re at that inflection point with AI reasoning. The models are getting deployed. The stakes are getting higher. And for the first time, thanks to work like this survey, we have a comprehensive map of where the ground is weak.
The Alice problem my six-year-old nephew’s victory lap over GPT-4 isn’t just a funny anecdote. It’s a signal. These failures aren’t random, and they aren’t going away with scale alone. They’re structural, predictable, and if we take them seriously addressable.
The survey’s closing line captures it perfectly:
"As reasoning-specialized models become more prevalent, sustained attention to failure modes will be essential to ensure that future LLMs not only perform better in reasoning tasks, but fail better—gracefully, transparently, recoverably."
— Song, Han & Goodman, TMLR 2026
The goal has never been AI that never fails. It’s AI that fails in ways we can predict, detect, and recover from.
And the first step toward that future is knowing exactly where the failures are.
Now you have the map.







