Resolving "it"
"The trophy did not fit in the suitcase because it was too big." Does "it" mean the trophy or the suitcase? You look back. So does the model.
Transformers & LLMs
Transformers are not magic. They are very organized gossip machines: every word looks around the sentence and asks, "Which other words should I listen to?" This room makes that invisible conversation visible. Pick a word, aim the beam, and watch attention light up the words that matter.
Four levels · Beginner → Intermediate → Advanced → Expert. The rail on the left tracks where you are.
Beams = attention. Brighter beam, stronger the listening.
Level 1 · Beginner
Reading is not about staring at each word alone. When you read "it", your brain quietly glances back to figure out what "it" means. Transformers do the same trick with math.
A word on its own is fuzzy. "Bank" could be a riverbank or a money bank. The other words in the sentence are the clue. Attention is just a word choosing how much to listen to each of the other words so its meaning becomes clear.
"The trophy did not fit in the suitcase because it was too big." Does "it" mean the trophy or the suitcase? You look back. So does the model.
"I sat by the bank and watched the river." The word "river" tells you this is a riverbank, not a money bank.
To answer a question you do not listen to everyone equally. You listen most to the few people who actually know. Attention is that spotlight.
Attention is a spotlight, not a floodlight. Each word spends its attention budget on the few words that help it most.
Level 2 · Intermediate
How does a word decide who to listen to? Every word makes three little notes about itself: a Query, a Key, and a Value.
"What am I looking for?" The looking word holds up a wanted-poster of the kind of word it needs.
"What do I offer?" Every other word holds up a label saying what it is about.
"What do I pass on?" If a word gets listened to, this is the content it hands over to be mixed in.
A query is compared against every key. Good match → loud value. Poor match → that value is mostly ignored. The matching is a dating app: the query swipes, the keys are profiles, and the values are what you actually get on the date.
One spotlight is limiting. Real transformers run several attention games in parallel, called heads. Each head learns to chase a different kind of relationship, then the results are stitched back together.
Level 3 · Advanced
Time for the actual equations. Every formula below comes with a plain-English reading, a symbol legend, and where useful a picture of what it does.
In words: take the stack of word vectors \(X\) and multiply by three learned matrices to produce queries, keys and values: three different "views" of the same words.
In words: score every query against every key, divide by \(\sqrt{d_k}\) so the numbers stay sane, soften the scores into weights that sum to 1, then take a weighted blend of the values.
The three formulas above are really one assembly line. Press play to watch the query word "lifts" flow through all four stages: raw scores QK⊤, scaled by ÷√dk, softened by softmax, then blended into one output ·V.
In words: exponentiate each score, then normalize so they form a probability distribution. The biggest score wins the most weight, but everyone gets a slice.
In words: run \(h\) attention heads in parallel, glue their outputs side by side, then mix them with one more learned matrix to get the final result.
In words: attention treats the sentence as a bag of words with no order, so we add sine/cosine waves of different frequencies to each position. Together they form a unique "fingerprint" for every slot, letting the model tell "dog bites man" from "man bites dog".
The 2017 sinusoidal encoding above is added to the input once at the bottom. Almost every large model today (Llama, Mistral, Qwen, DeepSeek, GPT-style stacks) instead uses Rotary Position Embedding (RoPE): it leaves the values alone and rotates each query and key by an angle proportional to its position, inside every attention layer.
In words: after rotating the query at position \(m\) and the key at position \(n\), their dot product depends only on the gap \(m-n\), not on where the pair sits in the sentence. That is exactly the "relative position" property attention wants, and it lets models stretch to longer contexts than they trained on.
Want the geometry and the proof? See the RoPE card in the Frontier tier below.
Level 4 · Expert
Attention is one sublayer. A transformer block wraps it with residual connections, LayerNorm, and a feed-forward network, then stacks the whole thing many times.
The residual ("+") lets the original signal skip past each sublayer, so gradients flow cleanly through dozens of stacked blocks. LayerNorm rescales each token vector to keep the numbers stable. Without them, deep transformers barely train.
The original block did Post-LN (normalize after adding the residual). Modern LLMs put the norm before each sublayer (Pre-LN), which keeps the residual highway clean and makes very deep stacks train without warmup tricks. Most also swap LayerNorm for the cheaper RMSNorm.
In words: divide the vector by its own root-mean-square so its scale is fixed, then multiply by a learned gain \(g\). Unlike LayerNorm it never subtracts the mean: one fewer statistic, slightly faster, and in practice just as stable.
Used in Llama, Mistral, Gemma and most 2024-2026 open models. The Add & Norm boxes in the diagram above still apply: just read them as Pre-RMSNorm then sublayer then add.
The diagram's "Feed-Forward" box is a per-token MLP. The classic version is \(\operatorname{FFN}(x)=\operatorname{ReLU}(xW_1)W_2\). Modern models use a gated variant (SwiGLU), where one branch decides how much of the other branch to let through.
In words: project \(x\) two ways; pass one through the smooth Swish gate, multiply the two element-wise, then project back down. The gate \(\odot\) lets the layer suppress or amplify each hidden unit, which trains better than a plain ReLU MLP at the same parameter budget.
Because the gate adds a third matrix, gated FFNs use a hidden width near \(\tfrac{8}{3}d\) instead of \(4d\) to keep the parameter count matched.
Every token can look left and right. Great for understanding a whole input (e.g. BERT-style models, the encoder half of translation).
During generation a token must not peek at words that come after it; that would be cheating, since those words do not exist yet. The causal mask enforces this.
In words: before softmax, set every "future" score to negative infinity. \(e^{-\infty}=0\), so future tokens get exactly zero attention weight.
Tip: turn on Causal mask in the Beam Runner above and watch the upper-right triangle of the heatmap go dark.
In words: attention compares every token with every other token, so the work grows with the square of the sequence length \(n\). Double the context and you roughly quadruple the cost, which is exactly why very long context is expensive and hard.
The KV cache trick stores past keys/values so each new token only attends once against the prefix instead of recomputing everything, turning per-step cost from quadratic to linear.
The KV cache fixes compute, but its memory still grows with sequence length × number of heads. The fix used by Llama 3, Mistral, and most 2026 models: keep many query heads but let groups of them share one set of key/value heads.
All \(h\) query heads share a single K/V head. Smallest cache, but can lose a little quality.
Use \(g\) K/V groups with \(1 < g < h\). The sweet spot: nearly full quality at a fraction of the cache.
In words: the cache stores one K and one V vector per group per token. Drop from \(g=h\) (vanilla multi-head) toward \(g=1\) (MQA) and the cache shrinks by the same factor, the lever that makes long-context serving affordable.
The \(O(n^2 d)\) cost above is the compute, but the real bottleneck on a GPU is writing the full \(n\times n\) score grid to memory. FlashAttention avoids that: it streams over the sequence in tiles and computes softmax online, so the big matrix never leaves fast on-chip memory. The math is identical; only the memory traffic changes.
FlashAttention-2 (2023) and FlashAttention-3 (2024, tuned for Hopper/Blackwell GPUs) are the
production default. In PyTorch you just call torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)
and it dispatches to a fused FlashAttention kernel for you, with no manual QK⊤ / softmax / @V needed.
import math, torch
import torch.nn.functional as F
B, T, D = 2, 5, 16 # batch, tokens, model dim
X = torch.randn(B, T, D) # input word vectors
Wq = torch.randn(D, D) # learned projections
Wk = torch.randn(D, D)
Wv = torch.randn(D, D)
Q = X @ Wq # Q = X W_Q
K = X @ Wk # K = X W_K
V = X @ Wv # V = X W_V
scores = Q @ K.transpose(-2, -1) / math.sqrt(D) # QK^T / sqrt(d_k)
mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
scores = scores.masked_fill(mask, float("-inf")) # causal mask
weights = F.softmax(scores, dim=-1) # rows sum to 1
out = weights @ V # weighted blend of values
X - token vectors after embedding + positional encoding.
Q, K, V - three learned views of the same words.
scores - \(QK^\top/\sqrt{d_k}\): how well each query matches each key.
mask - upper triangle set to \(-\infty\) so a token can't see the future.
weights @ V - softmax weights mix the value vectors into the output.
import torch
import torch.nn as nn
D, H = 16, 4 # model dim, number of heads
layer = nn.TransformerEncoderLayer(
d_model=D,
nhead=H, # multi-head attention
dim_feedforward=4 * D, # the FFN sublayer
batch_first=True,
)
X = torch.randn(2, 5, D) # (batch, tokens, dim)
# causal mask for decoder-style use (True = block)
T = X.size(1)
causal = torch.triu(torch.ones(T, T), diagonal=1).bool()
out = layer(X, src_mask=causal) # attention + add&norm + FFN + add&norm
nn.TransformerEncoderLayer - one ready-made block: multi-head attention, residual + LayerNorm, feed-forward, residual + LayerNorm.
nhead=H - runs H attention heads in parallel and concatenates them.
dim_feedforward - the hidden size of the per-token FFN (usually \(4\times d\)).
src_mask=causal - pass the causal mask to forbid looking ahead, turning the encoder layer into a decoder-style block.
Frontier · research-grade
The four levels above build the transformer from intuition to the full block. This tier goes further: the pieces that make real LLMs work and the active research that pushes them. Each topic is a guided lesson with step-through proofs, a worked numerical example, a computational visualization, and citations. Pick a card.