Transformers & LLMs

Aim the attention beam and find out why words keep staring at other words.

Transformers are not magic. They are very organized gossip machines: every word looks around the sentence and asks, "Which other words should I listen to?" This room makes that invisible conversation visible. Pick a word, aim the beam, and watch attention light up the words that matter.

Start at Beginner Jump to the math & code

Four levels · Beginner → Intermediate → Advanced → Expert. The rail on the left tracks where you are.

Therobotliftsthebox

Beams = attention. Brighter beam, stronger the listening.

Level 1 · Beginner

Attention = every word deciding which other words to look at.

Reading is not about staring at each word alone. When you read "it", your brain quietly glances back to figure out what "it" means. Transformers do the same trick with math.

Explanation

Words borrow meaning from their neighbors.

A word on its own is fuzzy. "Bank" could be a riverbank or a money bank. The other words in the sentence are the clue. Attention is just a word choosing how much to listen to each of the other words so its meaning becomes clear.

Resolving "it"

"The trophy did not fit in the suitcase because it was too big." Does "it" mean the trophy or the suitcase? You look back. So does the model.

"The bank"

"I sat by the bank and watched the river." The word "river" tells you this is a riverbank, not a money bank.

Classroom

To answer a question you do not listen to everyone equally. You listen most to the few people who actually know. Attention is that spotlight.

Interactive game · Aim the Beam

Aim the Beam

Click a word to make it the looker.

Worked example

Walk through "The robot lifts the box."

The word lifts is the looker. It asks: who is doing the lifting, and what gets lifted?
It aims a bright beam at "robot" (the actor) and a bright beam at "box" (the thing lifted).
The little words "the" get dim beams; they carry almost no meaning here.
"lifts" updates itself into "lifts (by a robot, a box)". Now it knows the full action.

Takeaway

Attention is a spotlight, not a floodlight. Each word spends its attention budget on the few words that help it most.

Level 2 · Intermediate

Query, Key, Value: how a word actually aims its beam.

How does a word decide who to listen to? Every word makes three little notes about itself: a Query, a Key, and a Value.

Explanation

Three notes every word writes.

Q · Query

"What am I looking for?" The looking word holds up a wanted-poster of the kind of word it needs.

K · Key

"What do I offer?" Every other word holds up a label saying what it is about.

V · Value

"What do I pass on?" If a word gets listened to, this is the content it hands over to be mixed in.

A query is compared against every key. Good match → loud value. Poor match → that value is mostly ignored. The matching is a dating app: the query swipes, the keys are profiles, and the values are what you actually get on the date.

Each word's vector is multiplied by three learned weight matrices to make its Q, K and V.

Interactive · Attention heatmap

Edit a sentence, read the heatmap

Each row = a word looking; each square = how hard it listens.

Sentence (words become tokens)

Pattern Focus word

Explanation · Multi-head

Many beams at once: multi-head attention.

One spotlight is limiting. Real transformers run several attention games in parallel, called heads. Each head learns to chase a different kind of relationship, then the results are stitched back together.

Interactive · Head Parade

Head Parade

Head 1 · Actor finderLinks an action to who did it.

Head 2 · Neighbor netTracks nearby words for grammar.

Head 3 · Lookback hookPoints to the previous word.

Head 4 · Descriptor glueBinds an adjective to its noun.

Worked example

"The cat sat on the mat": who does "sat" listen to?

Query of "sat": "I'm a verb, I want my subject and my place."
Keys: "cat" advertises "I'm a noun/animal"; "mat" advertises "I'm a place"; "the" advertises almost nothing.
The query matches "cat" and "mat" strongly, "the" weakly.
Values of "cat" and "mat" get mixed into "sat", so it becomes "sat (the cat, on the mat)".

Level 3 · Advanced

The real math: scaled dot-product attention.

Time for the actual equations. Every formula below comes with a plain-English reading, a symbol legend, and where useful a picture of what it does.

Formula · Projections

\[ Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V \]

In words: take the stack of word vectors \(X\) and multiply by three learned matrices to produce queries, keys and values: three different "views" of the same words.

\(X\) - matrix of input word vectors, one row per token (shape \(n\times d\))
\(W_Q, W_K, W_V\) - learned projection matrices (the model's knobs)
\(Q, K, V\) - the resulting query, key and value matrices

Formula · The core equation

\[ \operatorname{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V \]

In words: score every query against every key, divide by \(\sqrt{d_k}\) so the numbers stay sane, soften the scores into weights that sum to 1, then take a weighted blend of the values.

\(QK^{\top}\) - all query-key match scores at once (a square \(n\times n\) grid)
\(\sqrt{d_k}\) - scaling factor; \(d_k\) is the key dimension. Big vectors make big dot products, so we shrink them
\(\operatorname{softmax}\) - turns each row of scores into probabilities that add to 1
\(V\) - the value content that gets mixed by those weights

Watch it move · The attention pipeline

The three formulas above are really one assembly line. Press play to watch the query word "lifts" flow through all four stages: raw scores QK^⊤, scaled by ÷√d_k, softened by softmax, then blended into one output ·V.

QK^⊤ → ÷√d_k → softmax → ·V

Formula · Softmax

\[ \operatorname{softmax}(z)_i=\frac{e^{z_i}}{\sum_{j} e^{z_j}} \]

In words: exponentiate each score, then normalize so they form a probability distribution. The biggest score wins the most weight, but everyone gets a slice.

\(z_i\) - the raw match score for key \(i\)
\(e^{z_i}\) - exponentiation; makes everything positive and exaggerates gaps
denominator - the sum over all keys, so the weights total 1

Interactive · Beam Runner playground

QK^⊤ → softmax → weighted V

Formula · Multi-head concat

\[ \operatorname{MultiHead}(Q,K,V)=\operatorname{Concat}(\operatorname{head}_1,\dots,\operatorname{head}_h)\,W_O \]

In words: run \(h\) attention heads in parallel, glue their outputs side by side, then mix them with one more learned matrix to get the final result.

\(\operatorname{head}_i\) - attention computed with that head's own \(W_Q^i,W_K^i,W_V^i\)
\(h\) - number of heads (e.g. 8, 12, 96)
\(\operatorname{Concat}\) - lay the head outputs next to each other
\(W_O\) - output projection that blends the heads back into one vector

Formula · Positional encoding

\[ PE_{(pos,\,2i)}=\sin\!\left(\frac{pos}{10000^{\,2i/d}}\right),\qquad PE_{(pos,\,2i+1)}=\cos\!\left(\frac{pos}{10000^{\,2i/d}}\right) \]

In words: attention treats the sentence as a bag of words with no order, so we add sine/cosine waves of different frequencies to each position. Together they form a unique "fingerprint" for every slot, letting the model tell "dog bites man" from "man bites dog".

\(pos\) - the position of the token (0, 1, 2, ...)
\(i\) - which dimension pair of the encoding
\(d\) - the model's vector size
\(10000^{2i/d}\) - sets the wavelength; low \(i\) wiggles fast, high \(i\) wiggles slowly

Positional waves

Dimension pair

Latest · What 2026 LLMs use

From adding waves to rotating queries: RoPE.

The 2017 sinusoidal encoding above is added to the input once at the bottom. Almost every large model today (Llama, Mistral, Qwen, DeepSeek, GPT-style stacks) instead uses Rotary Position Embedding (RoPE): it leaves the values alone and rotates each query and key by an angle proportional to its position, inside every attention layer.

\[ \langle R_{\theta m}\,q,\; R_{\theta n}\,k\rangle \;=\; g\!\left(q,\,k,\;m-n\right) \]

In words: after rotating the query at position \(m\) and the key at position \(n\), their dot product depends only on the gap \(m-n\), not on where the pair sits in the sentence. That is exactly the "relative position" property attention wants, and it lets models stretch to longer contexts than they trained on.

\(R_{\theta m}\) - a rotation matrix; the angle grows with position \(m\)
\(m-n\) - the relative distance between query and key
\(g(\cdot)\) - a function of \(q,k\) and that distance alone

Want the geometry and the proof? See the RoPE card in the Frontier tier below.

Level 4 · Expert

The full transformer block, masking, complexity, and code.

Attention is one sublayer. A transformer block wraps it with residual connections, LayerNorm, and a feed-forward network, then stacks the whole thing many times.

Explanation · The block

One transformer block, floor by floor.

Read bottom to top. Skip lines are residual connections.

Block walkthrough

Why residuals + LayerNorm?

The residual ("+") lets the original signal skip past each sublayer, so gradients flow cleanly through dozens of stacked blocks. LayerNorm rescales each token vector to keep the numbers stable. Without them, deep transformers barely train.

Latest · Modern normalization

The "Norm" box, updated: Pre-LN and RMSNorm.

The original block did Post-LN (normalize after adding the residual). Modern LLMs put the norm before each sublayer (Pre-LN), which keeps the residual highway clean and makes very deep stacks train without warmup tricks. Most also swap LayerNorm for the cheaper RMSNorm.

\[ \operatorname{RMSNorm}(x) \;=\; \frac{x}{\sqrt{\tfrac{1}{d}\sum_{i=1}^{d} x_i^{2}+\epsilon}}\;\odot\; g \]

In words: divide the vector by its own root-mean-square so its scale is fixed, then multiply by a learned gain \(g\). Unlike LayerNorm it never subtracts the mean: one fewer statistic, slightly faster, and in practice just as stable.

\(x\) - the token vector (length \(d\))
\(\tfrac{1}{d}\sum x_i^2\) - mean square of its entries; its root is the RMS
\(g\) - learned per-dimension gain (the only parameters)
\(\epsilon\) - tiny constant so we never divide by zero

Used in Llama, Mistral, Gemma and most 2024-2026 open models. The Add & Norm boxes in the diagram above still apply: just read them as Pre-RMSNorm then sublayer then add.

Latest · Inside the feed-forward box

What the FFN actually is now: SwiGLU.

The diagram's "Feed-Forward" box is a per-token MLP. The classic version is \(\operatorname{FFN}(x)=\operatorname{ReLU}(xW_1)W_2\). Modern models use a gated variant (SwiGLU), where one branch decides how much of the other branch to let through.

\[ \operatorname{FFN}_{\text{SwiGLU}}(x) \;=\; \bigl(\operatorname{Swish}(xW_1)\;\odot\; xV\bigr)\,W_2,\qquad \operatorname{Swish}(z)=z\,\sigma(z) \]

In words: project \(x\) two ways; pass one through the smooth Swish gate, multiply the two element-wise, then project back down. The gate \(\odot\) lets the layer suppress or amplify each hidden unit, which trains better than a plain ReLU MLP at the same parameter budget.

\(W_1, V\) - two up-projections (the "value" and "gate" branches)
\(\odot\) - element-wise (Hadamard) product of the two branches
\(\sigma\) - the logistic sigmoid; \(\operatorname{Swish}=z\,\sigma(z)\) is a smooth ReLU
\(W_2\) - down-projection back to model width \(d\)

Because the gate adds a third matrix, gated FFNs use a hidden width near \(\tfrac{8}{3}d\) instead of \(4d\) to keep the parameter count matched.

Explanation · Two doors

Encoder vs decoder, and the causal mask.

Encoder: reads both ways

Every token can look left and right. Great for understanding a whole input (e.g. BERT-style models, the encoder half of translation).

Decoder: looks left only

During generation a token must not peek at words that come after it; that would be cheating, since those words do not exist yet. The causal mask enforces this.

\[ \text{scores}_{ij} \;\leftarrow\; -\infty \quad\text{for all } j > i \]

In words: before softmax, set every "future" score to negative infinity. \(e^{-\infty}=0\), so future tokens get exactly zero attention weight.

\(i\) - the query position (the word doing the looking)
\(j\) - the key position (a word being looked at)
\(j > i\) - "in the future" relative to \(i\); masked out

Tip: turn on Causal mask in the Beam Runner above and watch the upper-right triangle of the heatmap go dark.

Formula · Complexity

\[ \text{cost} \;=\; O\!\left(n^{2} d\right) \]

In words: attention compares every token with every other token, so the work grows with the square of the sequence length \(n\). Double the context and you roughly quadruple the cost, which is exactly why very long context is expensive and hard.

\(n\) - number of tokens in the sequence
\(d\) - model / head dimension
\(n^2\) - the size of the query-key score grid

Why long context is hard

Sequence length

The KV cache trick stores past keys/values so each new token only attends once against the prefix instead of recomputing everything, turning per-step cost from quadratic to linear.

Latest · Shrinking the KV cache

GQA and MQA: fewer key/value heads.

The KV cache fixes compute, but its memory still grows with sequence length × number of heads. The fix used by Llama 3, Mistral, and most 2026 models: keep many query heads but let groups of them share one set of key/value heads.

MQA: multi-query

All \(h\) query heads share a single K/V head. Smallest cache, but can lose a little quality.

GQA: grouped-query

Use \(g\) K/V groups with \(1 < g < h\). The sweet spot: nearly full quality at a fraction of the cache.

\[ \text{KV memory} \;\propto\; 2\,\cdot\, g \,\cdot\, d_{\text{head}} \,\cdot\, n, \qquad 1 \le g \le h \]

In words: the cache stores one K and one V vector per group per token. Drop from \(g=h\) (vanilla multi-head) toward \(g=1\) (MQA) and the cache shrinks by the same factor, the lever that makes long-context serving affordable.

\(h\) - number of query heads
\(g\) - number of K/V groups (\(g=h\): standard, \(g=1\): MQA, in between: GQA)
\(d_{\text{head}}\) - per-head dimension; \(n\) - tokens cached so far

Latest · How attention runs in 2026

You rarely write the loop by hand: fused kernels.

The \(O(n^2 d)\) cost above is the compute, but the real bottleneck on a GPU is writing the full \(n\times n\) score grid to memory. FlashAttention avoids that: it streams over the sequence in tiles and computes softmax online, so the big matrix never leaves fast on-chip memory. The math is identical; only the memory traffic changes.

FlashAttention-2 (2023) and FlashAttention-3 (2024, tuned for Hopper/Blackwell GPUs) are the production default. In PyTorch you just call torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True) and it dispatches to a fused FlashAttention kernel for you, with no manual QK⊤ / softmax / @V needed.

Code · PyTorch

From math to PyTorch

import math, torch
import torch.nn.functional as F

B, T, D = 2, 5, 16            # batch, tokens, model dim
X  = torch.randn(B, T, D)      # input word vectors

Wq = torch.randn(D, D)         # learned projections
Wk = torch.randn(D, D)
Wv = torch.randn(D, D)

Q = X @ Wq                     # Q = X W_Q
K = X @ Wk                     # K = X W_K
V = X @ Wv                     # V = X W_V

scores = Q @ K.transpose(-2, -1) / math.sqrt(D)   # QK^T / sqrt(d_k)
mask   = torch.triu(torch.ones(T, T), diagonal=1).bool()
scores = scores.masked_fill(mask, float("-inf"))  # causal mask
weights = F.softmax(scores, dim=-1)               # rows sum to 1
out     = weights @ V                             # weighted blend of values

X - token vectors after embedding + positional encoding.

Q, K, V - three learned views of the same words.

scores - \(QK^\top/\sqrt{d_k}\): how well each query matches each key.

mask - upper triangle set to \(-\infty\) so a token can't see the future.

weights @ V - softmax weights mix the value vectors into the output.

import torch
import torch.nn as nn

D, H = 16, 4                   # model dim, number of heads
layer = nn.TransformerEncoderLayer(
    d_model=D,
    nhead=H,                   # multi-head attention
    dim_feedforward=4 * D,     # the FFN sublayer
    batch_first=True,
)

X = torch.randn(2, 5, D)       # (batch, tokens, dim)

# causal mask for decoder-style use (True = block)
T = X.size(1)
causal = torch.triu(torch.ones(T, T), diagonal=1).bool()

out = layer(X, src_mask=causal)   # attention + add&norm + FFN + add&norm

Challenge · Boss quiz

Transformers & LLMs boss quiz

Score: 0 / 0

Lock it in

Say it back in your own words.

Frontier · research-grade

The deeper machinery & the research frontier.

The four levels above build the transformer from intuition to the full block. This tier goes further: the pieces that make real LLMs work and the active research that pushes them. Each topic is a guided lesson with step-through proofs, a worked numerical example, a computational visualization, and citations. Pick a card.

Aim the attention beam and find out why words keep staring at other words.

Attention = every word deciding which other words to look at.

Words borrow meaning from their neighbors.

Resolving "it"

"The bank"

Classroom

Aim the Beam

Walk through "The robot lifts the box."

Query, Key, Value: how a word actually aims its beam.

Three notes every word writes.

Q · Query

K · Key

V · Value

Edit a sentence, read the heatmap

Many beams at once: multi-head attention.

Head Parade

"The cat sat on the mat": who does "sat" listen to?

The real math: scaled dot-product attention.

QK⊤ → ÷√dk → softmax → ·V

QK⊤ → softmax → weighted V

Positional waves

From adding waves to rotating queries: RoPE.

The full transformer block, masking, complexity, and code.

One transformer block, floor by floor.

Block walkthrough

The "Norm" box, updated: Pre-LN and RMSNorm.

What the FFN actually is now: SwiGLU.

Encoder vs decoder, and the causal mask.

Encoder: reads both ways

Decoder: looks left only

Why long context is hard

GQA and MQA: fewer key/value heads.

MQA: multi-query

GQA: grouped-query

You rarely write the loop by hand: fused kernels.

From math to PyTorch

Transformers & LLMs boss quiz

Say it back in your own words.

The deeper machinery & the research frontier.

QK^⊤ → ÷√d_k → softmax → ·V

QK^⊤ → softmax → weighted V