Replacing fixed residual connections with learned softmax attention over depth.
Each layer selectively routes information from all preceding blocks.
In standard Transformers, the residual connection is a simple addition:
Every layer contributes equally to the output, regardless of whether its representation is useful at a given position. In deep networks (100+ layers), this creates two problems:
Key insight: What if each layer could attend to all prior layers and dynamically decide how much weight to assign each one?
Collect all completed block sums b0, …, bn-1 plus the current partial block into a value matrix.
Prevent large-magnitude blocks from dominating attention logits. Without this, deeper blocks (which accumulate more layer outputs) would receive disproportionate weight.
A learned pseudo-query wl ∈ ℝD scores each block. Crucially, w is initialized to zero — ensuring the model starts as a standard residual and smoothly transitions.
The softmax is taken over the block/depth dimension, not the sequence dimension. This is attention over layers, not over tokens.
The output is a learned convex combination of all block representations. Each layer can choose exactly how much information to draw from each depth.
Adjust the pseudo-query weights and observe how the attention distribution over depth changes in real time. With zero weights (initialization), all sources receive equal attention. As weights evolve during training, selective patterns emerge.
Observe how depth attention patterns evolve during training. The pseudo-query vectors wl start at zero (uniform attention) and gradually learn to selectively attend to the most useful blocks at each depth.