RetNet: Retentive Network

RetNet is a foundation architecture for LLMs that achieves training parallelism, low-cost inference, and good performance simultaneously. The “impossible triangle” of sequence modeling.

The key contribution is the Retention mechanism, which has a dual form of recurrence and parallelism. This means we can train models in parallel (like Transformers) while doing inference recurrently (like RNNs).

Three computation paradigms:

Parallel — training parallelism on GPU
Recurrent — O(1) inference, reducing decode latency and GPU memory
Chunkwise Recurrent — efficient long-sequence modeling

Paper: https://arxiv.org/abs/2307.08621

Comparison with Other Architectures

The three main sequence modeling families each make a different tradeoff:

	Training	Inference memory	Inference cost per step	Positional encoding
Transformer	Parallel	$O(n)$ KV cache grows with sequence	$O(n)$ per token	RoPE / ALiBi / learned
RNN (RWKV, SSM)	Sequential	$O(1)$ fixed state	$O(1)$ per token	Built into recurrence
RetNet	Parallel	$O(1)$ fixed state $S_n$	$O(1)$ per token	Built into $A$ diagonalization

vs. Transformer. Attention computes $\text{softmax}(QK^T / \sqrt{d}) V$ over the full context at every step. Every new token attends to all previous tokens, so inference memory and compute grow linearly with sequence length. RetNet replaces softmax with the decay matrix $D$ (no normalization across positions, just exponential weighting) and compresses the entire history into a fixed-size state $S_n$ . The tradeoff: Transformers can attend to any past token equally; RetNet exponentially forgets distant tokens controlled by $\gamma$ .

vs. RNNs / RWKV / SSMs. These also maintain a fixed state and have O(1) inference, but they are trained sequentially — each step depends on the previous one, so training cannot be parallelized across the sequence. RetNet gets both: because the recurrence $S_n = \gamma S_{n-1} + K_n^T V_n$ can be unrolled into a closed-form sum (the parallel form), the full sequence can be computed in one matrix multiply during training, just like Transformers.

vs. Linear Attention. Linear attention replaces softmax with a kernel $\phi(Q)\phi(K)^T$ and also has O(1) recurrent inference. RetNet is similar in spirit but adds two things linear attention lacks: (1) the exponential decay $\gamma^{n-m}$ which gives a proper forgetting mechanism rather than equal weighting of all past tokens, and (2) the complex rotations from diagonalizing $A$ which give relative position encoding for free.

Variables at a Glance

Symbol	Shape	What it is
$X$	$\\|x\\| \times d_{\text{model}}$	Full input sequence, one row per token
$W_Q, W_K, W_V$	$d \times d$	Learned projection matrices
$Q_n, K_n, V_n$	$1 \times d$	Query, Key, Value vectors for token $n$
$S_n$	$d \times d$	Retention state — running memory accumulating all past KV pairs
$o_n$	$1 \times d$	Output for token $n$ , read from the state
$A$	$d \times d$	Recurrence matrix — controls how state evolves
$\gamma$	scalar $\in (0,1)$	Decay factor — eigenvalue magnitude of $A$ after diagonalization
$\theta$	scalar	Rotation frequency — eigenvalue angle of $A$ , encodes relative position
$\Lambda$	diagonal $d \times d$	Change-of-basis that diagonalizes $A$ , absorbed into $W_Q$ / $W_K$
$D$	$\\|x\\| \times \\|x\\|$	Causal decay matrix — combines mask and exponential decay

The intuition: $S_n$ is the memory at step $n$ . It accumulates all past $(K_m, V_m)$ pairs, exponentially forgetting older ones via $\gamma$ . The output $o_n$ is a query into that memory.

2.1 Retention: Recurrent Form

The input sequence $\{x_i\}$ is packed into a matrix $X \in \mathbb{R}^{|x| \times d_{\text{model}}}$ .

The general recurrence. The retention state starts as:

$S_n = A \, S_{n-1} + K_n^T V_n$

$S_n$ is a $d \times d$ matrix — an outer-product memory. Each token writes $K_n^T V_n$ into it (store this key-value association), while $A$ decays what is already there.

Why diagonalize $A$ ? A full $d \times d$ matrix $A$ is expensive and hard to train stably. The paper constrains it to be diagonalizable:

$A = \Lambda \, (\gamma \, e^{i\Theta}) \, \Lambda^{-1}$ —

$\gamma e^{i\Theta}$ is diagonal with entries $\gamma e^{i\theta_j}$ — a magnitude $\gamma$ and a rotation $\theta_j$ per dimension. Absorbing $\Lambda$ into $W_Q$ and $\Lambda^{-1}$ into $W_K$ (they are just linear projections), $A$ disappears from the recurrence:

$S_n = \gamma \, S_{n-1} + K_n^T V_n$

So $\gamma$ is not a hyperparameter — it is the eigenvalue magnitude of $A$ . The state update is now O( $d^2$ ) per step with no matrix multiply.

Position encoding comes for free. The absorbed $\Lambda$ matrices become xPos-style complex rotations on $Q$ and $K$ :

$Q_n \leftarrow Q_n e^{in\theta}, \qquad K_m \leftarrow K_m e^{im\theta}$

The inner product $Q_n K_m^*$ then gives $e^{i(n-m)\theta}$ — relative position $n-m$ is baked in. No separate RoPE layer needed on $Q$ and $K$ ; the diagonalization of $A$ handles it.

Reading out the output and unrolling the recurrence:

$o_n = Q_n S_n = \sum_{m=1}^{n} \gamma^{n-m} \left( Q_n e^{in\theta} \right) \left( K_m e^{im\theta} \right)^* V_m$

Token $m$ ’s contribution to the current output is weighted by $\gamma^{n-m}$ — exponentially smaller the further back it is. The conjugate $(\cdot)^*$ on $K_m$ is what makes the phase difference $e^{i(n-m)\theta}$ emerge.

2.2 Retention: Parallel Form

For training, process the full sequence at once. Apply rotations across all positions:

$Q = (X W_Q) \odot \Theta, \quad K = (X W_K) \odot \bar{\Theta}, \quad V = X W_V$

where $\Theta_n = e^{in\theta}$ and $\bar{\Theta}$ is its conjugate (so $K$ gets the conjugated rotation, matching the recurrent form).

The causal decay matrix $D \in \mathbb{R}^{|x| \times |x|}$ encodes both causal masking and $\gamma$ decay:

$D_{nm} = \begin{cases} \gamma^{n-m} & n \geq m \\ 0 & n < m \end{cases}$

Row $n$ , column $m$ of $D$ is exactly the weight $\gamma^{n-m}$ that token $m$ gets when computing output at position $n$ . Future tokens are zeroed.

$\text{Retention}(X) = (Q K^T \odot D) V$

Mathematically identical to the recurrent form, just computed all at once.

2.3 Retention: Chunkwise Recurrent Form

The chunkwise form is the training-time workhorse for long sequences. The idea: split the sequence into non-overlapping chunks of length $B$ , run the parallel form inside each chunk, and pass a recurrent state $R_i$ across chunks.

Notation. The $i$ -th chunk extracts:

$Q_{[i]} = Q_{(Bi\,:\,B(i+1))}, \quad K_{[i]} = K_{(Bi\,:\,B(i+1))}, \quad V_{[i]} = V_{(Bi\,:\,B(i+1))}$

Each is a $B \times d$ matrix. The $D$ matrix is now $B \times B$ (causal decay within the chunk).

Output for chunk $i$ — two terms:

$\text{Retention}(X)_{[i]} = \underbrace{(Q_{[i]} K_{[i]}^T \odot D)\, V_{[i]}}_{\text{intra-chunk}} + \underbrace{(Q_{[i]}\, R_{i-1}) \odot \xi}_{\text{cross-chunk}}$

The intra-chunk term is a standard parallel retention over $B$ tokens. Fully parallelizable, just like section 2.2.

The cross-chunk term reads from the carried state $R_{i-1}$ (a $d \times d$ matrix summarizing all history before chunk $i$ ). The scaling matrix $\xi$ weights each position within the chunk by how far it is from the start of the chunk:

$\xi_{j} = \gamma^{j+1}, \quad j = 0, \ldots, B-1$

Position $j=0$ (start of chunk) gets weight $\gamma^1$ ; position $j=B-1$ gets $\gamma^B$ . This ensures the cross-chunk contribution decays correctly relative to the global timeline.

State update across chunks:

$R_i = K_{[i]}^T \left(V_{[i]} \odot \zeta\right) + \gamma^B\, R_{i-1}$

where $\zeta_j = \gamma^{B-j-1}$ down-weights tokens earlier in the chunk before folding them into the state (they will be further in the past for future chunks). The full previous state $R_{i-1}$ is discounted by $\gamma^B$ — one chunk length of decay.

$R_0 = 0$ . The state has fixed size $d \times d$ regardless of total sequence length.

Why this is efficient. For a sequence of length $n$ with chunk size $B$ , there are $n/B$ chunks. Each chunk costs $O(B^2 d)$ for the intra-chunk parallel term and $O(B d^2)$ for the cross-chunk recurrent term. Total: $O(n d (B + d))$ — linear in $n$ . Standard attention is $O(n^2 d)$ . The cross-chunk recurrence is sequential but only involves $d \times d$ updates, which are cheap.

	Intra-chunk	Cross-chunk	State size
Compute	$O(B^2 d)$ per chunk, parallel	$O(B d^2)$ per chunk, sequential	—
Memory	$O(B^2)$ attention matrix	$O(d^2)$ fixed	$d \times d$

2.5 Retention Score Normalization

Raw retention scores can vary wildly in magnitude, especially as sequence length grows or as $\gamma$ varies per head. The paper exploits a key property: GroupNorm is scale-invariant — $\text{GroupNorm}(\alpha \cdot h) = \text{GroupNorm}(h)$ for any scalar $\alpha$ . So we can insert normalizing scale factors at intermediate steps without changing the final output.

Three stabilization steps, applied in sequence:

Step 1 — Scale the scores:

$\frac{QK^T}{\sqrt{d}}$

Standard scaled dot-product to prevent large inner products (same as Transformer).

Step 2 — Normalize the decay matrix $D$ :

$\hat{D}_{nm} = \frac{D_{nm}}{\sqrt{\sum_{i=1}^{n} D_{ni}}}$

Divide each row of $D$ by the square root of its row sum. This normalizes the total weight each output position accumulates from the past, preventing early positions (which see fewer past tokens) from having systematically different magnitudes than late positions.

Step 3 — Normalize the retention scores $R = QK^T \odot \hat{D}$ :

$\hat{R}_{nm} = \frac{R_{nm}}{\max\!\left(\left|\sum_{i=1}^{n} R_{ni}\right|,\; 1\right)}$

Divide each row of $R$ by its absolute row sum, clamped to at least 1. The clamp prevents division by near-zero values when the row sum is small. This keeps the magnitude of the aggregated output bounded regardless of sequence length.

Because GroupNorm absorbs any constant rescaling, all three steps leave the mathematical output unchanged while keeping intermediate activations numerically well-behaved for both forward and backward passes.

2.6 Multi-Scale Retention (MSR)

Just like multi-head attention, retention is computed in parallel across $h = d_{\text{model}} / d$ heads. The “multi-scale” part: each head gets its own $\gamma_i$ , giving different memory horizons:

$\gamma = 1 - 2^{-5 - \text{arange}(0,\, h)}$

For $h = 8$ this gives $\gamma \in \{1 - \tfrac{1}{32},\; 1 - \tfrac{1}{64},\; \ldots,\; 1 - \tfrac{1}{512}\}$ . Heads with $\gamma$ close to 1 retain information over long ranges; heads with smaller $\gamma$ focus on local context. The model uses both simultaneously.

Different $\gamma$ values produce activations with different variances. A single LayerNorm would conflate them. GroupNorm with $h$ groups normalizes each head independently, correcting each head’s variance before concatenation.

Full MSR block:

$\text{head}_i = \text{Retention}(X,\; \gamma_i)$

$Y = \text{GroupNorm}_h\!\left(\text{Concat}(\text{head}_1, \ldots, \text{head}_h)\right)$

$\text{MSR}(X) = \big(\text{swish}(X W_G) \odot Y\big)\, W_O$

The Swish gate $\text{swish}(X W_G)$ is a learned input-dependent signal that modulates the retention output before the final projection. Similar to SwiGLU in FFN layers, it adds non-linearity and improves model capacity.

Appendix A: The Three Forms Side by Side

All three forms compute the same function. The choice is purely about what is efficient given your hardware and sequence length.

What each form does

Recurrent unrolls as a step-by-step state update. Token $n$ arrives, updates $S_n$ , and immediately produces output $o_n$ . You never store past tokens — only the $d \times d$ state.

Parallel treats the entire sequence as a single matrix operation. No recurrence, no state. The $D$ matrix encodes all the decay weights at once. This is what GPUs are built for.

Chunkwise splits the sequence into blocks of size $B$ . Inside each block: parallel. Across blocks: recurrent state $R_i$ . It is a controlled interpolation between the two.

Complexity comparison

	Training compute	Training memory	Inference compute/step	Inference memory
Parallel	$O(n^2 d)$	$O(n^2)$	— (not used)	—
Recurrent	$O(n d^2)$ sequential	$O(d^2)$	$O(d^2)$	$O(d^2)$
Chunkwise	$O(nd(B+d))$	$O(Bd + d^2)$	— (not used)	—
Transformer	$O(n^2 d)$	$O(n^2)$	$O(nd)$	$O(nd)$ KV cache

For typical $d = 256$ , $B = 512$ , $n = 8192$ : chunkwise is roughly $8\times$ cheaper than parallel training.

When to use each

Recurrent is used at inference time. You process one token at a time, keep $S_n$ in memory, output $o_n$ . No KV cache. Memory is $O(d^2)$ and fixed.

Parallel is used for short-sequence training. One shot, maximum GPU utilization, but $O(n^2)$ memory means it does not scale past a few thousand tokens.

Chunkwise is used for long-sequence training. The $B \times B$ intra-chunk matrix stays small, and the $d \times d$ cross-chunk state does not grow with sequence length. For sequences in the tens of thousands, this is the only practical option.

The key insight

The three forms are not approximations of each other — they are algebraically identical. The parallel form is the chunkwise form with $B = n$ (one chunk = whole sequence). The recurrent form is the chunkwise form with $B = 1$ (one chunk = one token). Chunkwise with intermediate $B$ just picks a sweet spot for GPU efficiency.

This is different from Transformers + RNNs tradeoff, where you would have to choose your architecture. In RetNet you train with chunkwise (parallel within chunk, recurrent across chunks) and switch to the recurrent form at inference — same weights, same math, different compute schedule.

Section 3: Code (WIP)

3.1 Retention: Recurrent Form

WIP

3.2 Retention: Parallel Form

WIP

3.3 Retention: Chunkwise Recurrent Form

WIP

3.4 Multi-Scale Retention (MSR)

WIP