Hierarchical Reasoning Model

Juan Vera

August 2025

Abstract

Reading Hierarchical Reasoning Models by Wang et al.

Reasoning is still a critical challenge in AI. The Chain of Thought (CoT) of language models suffers from task decomposition,

This stems from autoregressive inference, as a language model samples as $P(y_1, y_2, y_T) = \prod_{t=1}^T P(y_t | y_{<t})$ , where $y_t$ is the $t$ th token in the sequence, and $y_{<t}$ is the set of tokens before $t$ . Meaning small hallucinations in early steps can lead to larger errors in downstream steps. See more by Yann LeCun here.

extensive data requirements ( $O(n^2 \cdot d)$ for self-attention), and high latency.

Directly inspired by the human brain, HRMs execute sequential reasoning in a single forward pass through two interdependent recurrent modules—one for slow and abstract planning, and a low-level module for rapid and detailed computations.

With 27M parameters and ~1000 training examples, HRM beats larger transformer-based models on ARC-AGI-(1 & 2), Sudoku-Extreme, and Maze-Hard.

Introduction

Language models, despite being overly parameterized, with many transformer blocks stacked on top of each other, are paradoxically shallow.

This paper proves that language models can be simulated by uniform constant-depth threshold circuits, meaning they are in the class $\text{TC}^0$ .

$\text{TC}^0$ is the class of functions computed by uniform families of Boolean circuits of constant depth and polynomial size, with unbounded fan-in AND, OR, and NOT gates.

$\text{TC}^0 \subseteq \text{P}$ , generally sitting lower in the complexity hierarchy, meaning they can't solve all problems in $P$ (polynomial time), placing fundamental limits on what they can compute directly. There are simply some functions or patterns transformers cannot represent (more here).

LLMs are therefore not Turing-complete, as Turing machines can solve all problems in $\text{P}$ , and therefore cannot compute a significant portion of problems efficiently, or even at all.

Chain of Thought reasoning can also easily break down, where a single misordering of steps can collapse the entire reasoning process.

In this paper, they explore latent reasoning, where the model reasons within its internal hidden state, aligning with the fact that the human mind reasons separately from language.

Recurrent Neural Networks use such a hidden state, but they are computationally expensive, not parallelizable, and suffer from the vanishing gradient problem due to backpropagation through time.

The human brain is a compelling blueprint, organizing computations hierarchically rather than sequentially, operating at different speeds, which can enable deep multi-stage reasoning in parallel.

The HRM is constructed similarly, with a high-level module designed for abstract deliberate reasoning while the low-level module is designed for rapid and detailed computations, which helps avoid the rapid convergence of recurrent models, which they coin as "hierarchical convergence".

The higher-level module only advances after the lower-level module has completed multiple steps and reached a stable state.

They also propose a one-step gradient approximation for training the HRM, which eliminates the need for backpropagation through time, decreasing the computational requirements of training the recurrent model.

Due to its architecture, the HRM offers excellent performance using only 1,000 training examples, without any pre-training or CoT SFT, even learning solutions to problems that are intractable to LLMs.

Architecture

The model is inspired by three principles:

Hierarchical Processing, where the brain processes information over hierarchical cortical regions—differentiating long-term slow reasoning from fast thinking.
Temporal Separation, where different regions of the brain operate at different Hz.
Recurrent Connectivity, where different connections in the brain are recursive, allowing for more context-sensitive reasoning.

The model consists of:

An input network, $f_I(\cdot ; \theta_I)$
A low-level recurrent module, $f_L(\cdot ; \theta_L)$
A high-level recurrent module, $f_H(\cdot ; \theta_H)$
An output network, $f_O(\cdot ; \theta_O)$

An inference pass over the HRM is done through $N$ high-level cycles of $T$ low-level timesteps, meaning $N \times T$ total timesteps.

$f_L$ and $f_H$ each keep a hidden state, $z_L^i$ for the low-level module and $z_H^i$ for the high-level module.

Given an input vector $x$ , the HRM maps it to an output prediction vector $\hat{y}$ through the following process:

First, $\hat{x} = f(x; \theta_l)$

At each time step $i$ , the low-level module updates its hidden state conditioned on $z_H^i$ , $z_L^i$ , and $\hat{x}$ , while $f_H(\cdot ; \theta_H)$ updates its state after a full cycle of $f_L(\cdot ; \theta_L)$ using the hidden state of $f_L(\cdot ; \theta_L)$ , $z_L^i$ , and $z_H^i$ , without $\hat{x}$ :

z_L^{i+1} = f_L(z_H^i, z_L^i, \hat{x}; \theta_L) \\[2mm] z_H^{i+1} = \begin{cases} f_H(z_H^i, z_L^i; \theta_H) & \text{if } i \bmod T = 0 \\ z_H^i & \text{otherwise} \end{cases}

Meaning unless $i$ is divisible by $T$ with no remainder, the $f_H(\cdot ; \theta_H)$ is not updated.

Until $N$ full cycles are completed, a prediction $\hat{y}$ is extracted from the hidden state of the output module:

\hat{y} = f_O(z_H^{(N \times T)}; \theta_O)

The entire set of $N \times T$ timesteps within the HRM represents a single forward pass through the HRM.

The HRM is designed to counteract the premature convergence brought upon by RNNs by allowing the high-level module to only advance after the low-level module has completed a full cycle of $T$ timesteps—which allows the low-level module to reach a "local equilibrium" before updating the high-level module.

Approximate Gradient Methods

Recurrent models use BPTT to compute gradients, but BPTT requires storing the hidden states from the forward pass and then combining them with gradients during the backward pass, which is computationally expensive in terms of memory, as it has a memory complexity of $O(T)$ , growing linearly with the sequence length $T$ , thereby forcing small batch sizes.

Consider the situation when you have an RNN, $\mathcal{R}$ , that is trained on the same input, $x$ , over $T$ timesteps, with weight updates at every timestep.

Eventually, you will converge to a fixed point, or hidden state, $z^*$ , where $\mathcal{R}(z^*, \hat{x}; \theta_L) = z^*$ .

Meaning after lengthy recursive training on the same inputs, the RNN will reach a fixed hidden state—or in the case of the HRM, a local equilibrium.

So if we consider the HRM where the high-level module $z_H^k = f_H(z_H^{k-1}, z_L^i; \theta_H)$ serves as conditioning to the low-level module $f_L(z_H^k, z_L^i, \hat{x}; \theta_L)$ , we can observe that the ideal HRM would reach a fixed point, $z_H^*$ , where $f_H(z_H^*, z_L^i; \theta_H) = z_H^*$ , and only then the high-level module can update its hidden state as $z_H^* = f_H(z_H^{k-1}, z_L^*; \theta_H)$ .

This is because during every low-level iteration, $f_L(\cdot)$ is conditioned on $f_H(\cdot)$ and some input vector $\hat{x}$ , meaning the only variable that can be updated is $z_L^i$ , and given that it's defined as a recursive function, it will eventually converge to a stabilized and fixed point.

Let's define $\mathcal{F}$ as the transformation which contains the updates of the high-level module and the low-level module, $z_H^k = \mathcal{F}(z_H^k; \tilde{x}, \theta)$ , where $\theta = (\theta_I, \theta_L)$ .

The fixed point where we have the equilibrium can be written as $z_H^* = \mathcal{F}(z_H^*, \tilde{x}, \theta)$

$J_{\mathcal{F}} = \frac{\partial \mathcal{F}}{\partial z_H}$ is the Jacobian matrix of $\mathcal{F}$ with respect to $z_H$ .

If the matrix $I - J_\mathcal{F}$ is invertible at $z_H^*$ and $\mathcal{F}$ is continuous and differentiable, the implicit function theorem then allows for the computation of the exact gradient at that point without explicit backpropagation:

\frac{\partial z_H^*}{\partial \theta} = (I - J_\mathcal{F})^{-1} \frac{\partial \mathcal{F}}{\partial \theta} \vert_{z_H^*}

Computing the inverse of $I - J_\mathcal{F}$ is computationally expensive, so we can take the Neumann series expansion of the inverse:

(I - J_\mathcal{F})^{-1} = I + \sum_{k=0}^\infty (J_\mathcal{F})^k

and approximate the first term at $k = 1$ , which leads us to an approximation of the gradients at $z_H^*$ .

Deep Supervision

Given a data sample $(x, y)$ , you run multiple forward passes of the HRM models, where $M$ is the total number of forward passes, and $m_i$ is the $i$ th forward pass.

For each segment $m_i \in \{1, \ldots, M\}$ , you compute similar to gradient descent as:

z^m, \hat{y}^m = \text{HRM}(z^{m-1}, x; \theta) \\[2mm] L^m \leftarrow \text{Loss}(\hat{y}^m, y) \\[2mm] \theta \leftarrow \theta - \eta \nabla_\theta L^m

with the caveat that $z^m$ is not involved in the computation of gradients, as we're approximating the gradient with a single timestep and would not need $z^m$ to compute the gradient.

Adaptive Computational Time

They incorporate a halting strategy into the HRM via $Q$ -learning, enabling the HRM to dynamically select the number of segments $M$ based on the complexity of the task.

The $Q$ -head uses the final state of $f_H(\cdot)$ to predict the Q-values, $\hat{Q}^m_{\text{halt}}, \hat{Q}^m_{\text{continue}} = \hat{Q}^m$ , as:

\hat{Q}^m = \sigma(\theta_{Q}^\top z_H^{mNT})

where $\sigma$ is the sigmoid gate to derive the Q-values for halting and continuing.

Let:

$M_{\text{max}}$ be the maximum number of segments
$M_{\text{min}}$ be the minimum number of segments, where $M_{\text{min}} \geq 2$ , sampled probabilistically in a uniform manner with probability $\epsilon$ from the set $\{2, \ldots, M_{\text{max}}\}$ and with probability $1 - \epsilon$ that it is $1$ .

The criteria for halting is given by:

When the segment count surpasses the maximum threshold
When the halt value exceeds the continue value
The segment count has reached the minimum threshold $M_{\text{min}}$

The Q-head is trained through Q-learning, defined by a Markov decision process with a $S$ -state space, $A$ -action space, and a reward function $R(s, a)$ , where $s^m = \{z^0, \ldots, z^{MNT}\}$ is the state, $a^m \in \{\text{halt}, \text{continue}\}$ is the action, and $R(s^m, a^m)$ is the reward.

Once the Q-head halts, it returns a prediction and a corresponding binary reward for its prediction. Continuing returns a reward of $0$ .

For each possible action, $\hat{G}^m = (\hat{G}_{\text{halt}}^m, \hat{G}_{\text{continue}}^m)$ :

\hat{G}^m_\text{continue} = \begin{cases} \hat{Q}^m_\text{halt} & \text{if } m \geq N_{\text{max}} \\ \max(\hat{Q}^m_\text{halt}, \hat{Q}^m_\text{continue}) & \text{otherwise} \end{cases}

Meaning we halt if the steps we've taken exceed the maximum steps, or if the halt value exceeds the continue value.

The loss function is then $L^m_\text{ACT} = \text{Loss}(\hat{y}^m, y) + \text{BCE}(\hat{Q}^m, \hat{G}^m)$ .

The stability of Q-learning is questionable, but under some conditions—such as Post-Normalization and weight decay—stability can be achieved.

Architecture

The HRM is a sequence-to-sequence architecture: the input and output are both sequences of tokens, which are then mapped into vectors.

The model includes an embedding layer $f_I$ that converts tokens into vectors, and an output head that transforms the hidden state of the final timestep into the output probability vector $\hat{y}$ .

The low-level and high-level modules are implemented using encoder-only Transformers with identical architectures and dimensions. They include enhancements in modern models, such as RoPE, GLUs, and Post-RMSNorm.

Results

40.3% accuracy on ARC-AGI-1
5.0% accuracy on ARC-AGI-2
55.0% accuracy on Sudoku-Extreme
74.5% accuracy on Maze-Hard

Beating DeepSeek R1, Claude 3.7, and o3-mini-high across all benchmarks.