A line of judges
Each judge listens to the judges before them, weighs their opinions, and forms their own. The final judge gives the verdict.
Deep Learning
A deep neural network is just layers of simple little switches wired together. Each one adds, weighs, and decides, and stacked deep enough, those switches learn to recognize digits, translate languages, and paint pictures. In Deep Learning you'll light up neurons, solve the puzzle that nearly killed AI, and train a real network from scratch in your browser, watching it learn live.
Four levels · Beginner → Intermediate → Advanced → Expert. The rail on the left tracks where you are.
Dots = neurons. Lines = weighted connections. Pulses = signals flowing forward.
Level 1 · Beginner
Imagine a crowd of very simple helpers. Each one looks at a few numbers, gives each a thumbs-up or thumbs-down weight, adds them up, and shouts a single number back. Wire thousands of these helpers in layers and the crowd can learn almost anything.
Your brain has billions of neurons, each one a tiny switch: it gathers signals from its neighbors and, if they're strong enough together, it "fires". An artificial neural network copies the idea (not the biology): many simple units, arranged in layers, each unit passing a number to the next layer.
Each judge listens to the judges before them, weighs their opinions, and forms their own. The final judge gives the verdict.
Numbers are the baton. Each layer transforms the baton a little and hands it on, until the last layer crosses the finish line with an answer.
Every connection is a dimmer: turn it up to let a signal through loudly, down to mute it. Learning is just adjusting the dimmers.
Flip the input switches and slide the two connection weights so the output bulb glows as close as possible to the target. Get within 0.1 to win.
A single neuron does three things: it multiplies each input by a weight, adds them up plus a bias, then squashes the total through an activation function to decide how strongly to fire.
In words: weigh every input, add them together with a bias nudge, then pass the sum through a squashing function to get the neuron's output.
Set the weights and bias so the neuron behaves like the chosen logic gate. It must output near 1 for "true" rows and near 0 for "false" rows on all four input combinations.
A neuron is just "weighted vote → squash". Everything else in deep learning is stacking and tuning millions of these.
So far we've squinted at one neuron. Now press play and watch a tiny 2 → 3 → 1 network actually compute: each neuron lights up in turn, displaying the exact number it produced. The signal washes left → right, one layer at a time, and that whole trip is the forward pass.
Each connection shows its weight; each neuron shows its computed activation once the wave reaches it. Change the inputs and replay to see every number recompute.
Every hidden neuron took the two inputs, formed its own weighted vote, and squashed it. The output neuron then took a weighted vote of those three results. No magic, just the same "weighted sum → squash" repeated, layer after layer.
Level 2 · Intermediate
One neuron can only draw a single straight line. The magic appears when you stack neurons into layers: input, hidden, output, and let signals flow forward through all of them. Suddenly the network can bend, curve, and carve up the world.
Information enters at the input layer, gets transformed by one or more hidden layers, and leaves at the output layer. Pushing numbers through, layer by layer, to get a prediction is called the forward pass. Each hidden neuron is a fresh "weighted vote → squash" on the layer below it.
Choose how many hidden neurons, set the two inputs, then press Fire signal and watch each layer light up in turn. Goal: make the output neuron cross the firing line (output > 0.5) and the node turns gold when you succeed.
Stack two linear layers with no squashing in between and, surprise, you still only get a straight line. The activation function's bend is what lets layers combine into curves. The classic proof is XOR: output 1 when the two inputs differ, 0 when they match. No single straight line can separate those points.
In words: two matrix multiplies in a row collapse into one big matrix and that stays a straight line. Slip a nonlinear \(\sigma\) between them and the network gains the power to bend its decision boundary.
First try with no hidden layer: drag the single line; you'll never split the four points correctly. Then switch on the hidden layer and bend the two cuts until each class sits on its own side. The accuracy meter tells you how close you are.
Hidden layers invent their own features (edges, corners, "is one input on") that make the final decision easy. Deeper networks build features on top of features.
Before training, every weight gets a starting value. Pick badly and the network may never learn. Start all at zero and every neuron computes the same thing forever (they can't specialize). Start too large and signals blow up or saturate. A balanced start (Xavier/He-style, small random values scaled to the layer size) lets learning take off smoothly.
In words: draw each weight from a small random range whose width shrinks as the layer gets wider, so the signal keeps a healthy size as it passes through.
Press Train all three. The all-zero start is stuck (symmetry never breaks), the too-large start thrashes or stalls, and the balanced start glides down. The loss curves race in real time.
Initialization is the network's "starting posture". Zero is paralysis (all neurons identical), huge is chaos (exploding activations), and a carefully scaled small-random start is the sweet spot that real frameworks use by default.
Level 3 · Advanced
Now the actual equations: the forward pass, the loss, and the chain rule that lets a network learn from its mistakes. Then the centerpiece: a real MLP, implemented from scratch in plain JavaScript, that you can train on 2D data and watch learn in real time.
In words: each layer takes the previous layer's activations, applies its weight matrix and bias, then squashes, then repeat from input to output.
In words: measure how wrong the network is by averaging the squared gap between each prediction and its true answer. Bigger mistakes are punished much harder.
In words: for yes/no problems, reward the network when it is confident and right, and punish it hard when it is confident and wrong.
In words: to know how a single weight affected the loss, multiply the sensitivities along the path back from the loss to that weight. The chain rule links them.
In words: the scalar chain rule above, written once for a whole layer. Take the error signal \(\delta^{(\ell+1)}\) sitting one layer closer to the loss, send it backward through that layer's weights (the transpose), then gate it by how steep this layer's activation was. That single recursion is run from the output layer down to layer 1, and once you have every \(\delta^{(\ell)}\), each weight's gradient is just \(\partial L/\partial W^{(\ell)} = \delta^{(\ell)}\,\bigl(a^{(\ell-1)}\bigr)^{\!\top}\).
This is still literally what runs inside every modern framework. PyTorch / JAX autodiff builds the same recursion automatically from your forward code, and large 2025 models tame the cost of storing all the \(a^{(\ell)}\) with activation checkpointing: recomputing them in the backward pass to trade compute for memory.
Forward pushes activations left→right (teal). Then we compute the loss. Then the gradient flows right→left (rose), and one highlighted path shows the chain rule multiplying its links.
In words: nudge every weight a little bit downhill (opposite its gradient), scaled by the learning rate. Repeat thousands of times and the loss drops.
Plain SGD zig-zags. Momentum builds speed downhill. Adam adapts the step per-weight and usually converges fastest. The convergence theory behind these optimizers is in the Frontier tier below.
For multi-class problems the last layer outputs one raw score (a logit) per class. Softmax turns those scores into probabilities that are all positive and sum to 1, so the network can say "70% cat, 20% dog, 10% fox". Bigger logits get exponentially more of the probability; the largest logit wins.
In words: exponentiate every score (making them all positive and stretching gaps), then divide by the total so the results form a clean probability distribution that adds up to 1.
Slide each class's raw logit. The bars show the softmax probabilities live. Notice they always sum to 100%, and pulling one logit up steals probability from the others. The tallest bar (the argmax) is the predicted class.
In words: stack the softmax above onto cross-entropy and almost
everything cancels. The gradient flowing back into the raw logits is simply the predicted
probability minus the true label: prediction error, nothing more. Predict the
right class with probability 1 and the gradient is 0; over-confident on a wrong class and
it points straight at the fix. This is why the two cards above belong together, and it is
exactly the dz = out - target line you'll see in the trainer.
Backprop multiplies a long chain of slopes together. If each slope is a bit less than 1, the product shrinks toward zero by the time it reaches the early layers, so the gradient vanishes and those layers barely learn. If each slope is bigger than 1, the product blows up, so the gradient explodes. The activation function is a big culprit.
In words: the gradient reaching layer 1 is a product of every layer's weight times its activation slope. Many factors under 1 → it dies; many over 1 → it explodes.
Each bar is the gradient size at one layer, layer 1 (deepest from the loss) on the left. Toggle the activation and weight scale, then send the gradient backward and watch it shrink to nothing (sigmoid) or stay alive (ReLU).
Each gradient step can use the whole dataset (smooth but slow per step), one example at a time (fast, very noisy, i.e. pure SGD), or a mini-batch of a few (the practical middle ground). More examples per step = a smoother, more accurate gradient; fewer = noisier steps that can actually help escape bad spots.
Watch three dots roll down the contour bowl. Full-batch takes a clean straight line, mini-batch wobbles a little, and single-example zig-zags wildly, but they all head downhill.
In high dimensions the loss surface is dominated by saddle points: flat spots that go down in some directions but up in others. Plain gradient descent crawls onto the flat ridge of a saddle and nearly freezes: the gradient there is tiny, so each step barely moves. Momentum keeps a running velocity, so it coasts across the flat part and rolls off the ridge into a descending direction, which is exactly why momentum and Adam (next tier) train faster.
Drag momentum to 0 and the dot stalls on the flat ridge near the saddle ✦. Raise it and the dot builds speed, slips off the ridge, and escapes down the valley.
What it is: two networks in a row. An encoder reads an
input sequence (say, an English sentence) and squeezes everything it understood into a
single context vector. A decoder then reads that summary
and generates an output sequence one token at a time (say, the French translation).
What it's for: machine translation, summarization, speech-to-text,
any task that turns one sequence into another of a different length.
In words: the encoder turns the whole input into one context summary \(c\); the decoder produces each output token from \(c\) plus everything it has written so far.
The input words flow into the encoder (left → right) and collapse into one context vector in the middle. Then the decoder reads that vector and emits the translation one word at a time. Pick a sentence and play it.
Cramming a long sentence into one fixed vector loses detail, which is exactly the problem attention fixed (the decoder peeks back at every input word). See Transformers & LLMs.
What it is: the architecture behind GPT, BERT, and modern AI. Instead of
looping like an RNN, a Transformer lets every token attend to every other
token at once, then refines each one with a tiny feed-forward network. One
block = attention + FFN, each wrapped with a residual skip and a
normalization. Stack \(N\) blocks and add a head on top.
What it's for: language, vision, audio, code, and it has largely
replaced RNNs because it trains in parallel and models long-range relationships.
Click each stage (or press Step) to light up the data path. Notice the + nodes: those are the residual skips you met above: they're what let a Transformer be dozens of blocks deep. Full attention mechanics live in Transformers & LLMs.
Level 4 · Expert
A trained network that only works on the training data is useless. Experts fight overfitting, choose the right architecture (CNN, RNN, Transformer...), and write the training loop. Here are the tools of the trade.
A network with too much capacity can memorize the training set (including its noise) and then fail on new data. The cures: weight decay (keep weights small), dropout (randomly silence neurons so none becomes a crutch), early stopping (quit when validation loss turns up), and batch norm (normalize layer inputs to stabilize training).
In words: add a penalty for large weights to the loss. The network now has to justify every big weight with a real drop in data error, so it prefers simpler, smoother solutions that generalize.
Click cells on the left to draw a shape. Pick a kernel (or hand-tune it), then drag/step the window across the image and watch the right-hand feature map glow where the kernel matches: that's how CNNs detect edges and textures.
Image (click to draw)
Feature map (kernel response)
After convolution, CNNs pool: slide a small window over the feature map and keep just one number per window: the max (strongest response) or the average. This halves the resolution, keeps the important signal, and makes the network cheaper and a little position-invariant.
Feature map (input to pooling)
Pooled output (downsampled)
Feed the sequence in one step at a time. The hidden state \(h_t\) is a mix of the new input and its own previous value, which is how the network "remembers".
In words: the new memory is a blend of the last memory and the current input, squashed by tanh.
During training, dropout randomly switches off a fraction of neurons on every step. No single neuron can become a crutch, so the network spreads its knowledge out and generalizes better. Too little dropout overfits; too much starves the net of capacity. There's a sweet spot.
Watch neurons blink off at random each step. Find the dropout rate that best closes the gap between training and validation accuracy: that's the rate that generalizes. Hit the green zone to win.
As data flows through layers its distribution drifts and stretches, which slows learning. Batch norm takes each layer's inputs and re-centers them to mean 0 and re-scales them to variance 1 (then learns a slope and shift to undo that if helpful). The result: stable, well-behaved signals and much faster training.
In words: subtract the batch's mean and divide by its spread to standardize, then let the network rescale (\(\gamma\)) and reshift (\(\beta\)) if it wants the original shape back.
The raw inputs (left) wander off-center and uneven. With batch norm on, the same values get re-centered to 0 and scaled to a tidy spread (right), exactly what the next layer wants to see.
Stack too many layers and the signal (and its gradient) gets mangled on the long trip. A residual / skip connection adds a shortcut that lets the input bypass a block and rejoin it further along, so the layer only has to learn a small change, and the gradient has a clean highway straight back. This is the trick behind ResNets and Transformers.
In words: the block's output is the original input plus whatever the layers learned to add, so doing "nothing" (the identity) is easy, and the gradient flows through the \(+x\) shortcut unharmed.
With the skip off, the pulse must crawl through every layer (and fades). Turn it on and a second pulse leaps over the block on the shortcut and rejoins at the + node, keeping the signal strong even in a deep stack.
An autoencoder forces data through a narrow bottleneck and then tries to reconstruct the original from that tiny summary. To succeed it must learn the most important features. Make the bottleneck smaller and the reconstruction gets blurrier, so you can literally see how much information the squeeze throws away.
Draw on the input grid. The network compresses it to just a few numbers (the bottleneck), then rebuilds it on the right. Shrink the bottleneck and watch detail vanish: the essence survives, the fine detail doesn't.
Input (click to draw)
Reconstruction
CNNs, RNNs, Transformers, and diffusion models look different but share the same DNA: layers of weighted sums + nonlinearities, trained by backprop. They differ mainly in how they wire the connections to match their data.
Shares small filters across an image, great for vision. Weights are reused everywhere.
Loops a hidden state through a sequence, built for time and language order.
Lets every token attend to every other. See Transformers & LLMs.
Turn words/items into learned vectors. Explore them in Embeddings & Vector Search.
A deep net that denoises noise into images, step by step.
The plain fully-connected net you trained above: the foundation of them all.
A network learns to place each word at a point so that related words sit near each other. Press scatter, then watch the words glide into meaning-based clusters: fruit here, animals there, royals over there. Dig deeper in Embeddings & Vector Search.
A generator makes fake samples; a discriminator tries to tell fakes from real. Each pushes the other to improve. Watch the generator's blurry fakes (orange) sharpen toward the real distribution (teal) as the two compete.
Below are the big architecture families. Click a tile to read a one-line "what it is / what it's for". They all share the same DNA (layers of weighted sums and nonlinearities trained by backprop) but each wires its connections to match its data.
Tap a tile above to learn what each architecture does.
What it is: a U-shaped network. The left side (encoder) repeatedly
downsamples the image to capture "what" is in it; the right side (decoder)
upsamples back to full size to say "where". The trick is the
skip connections that copy fine detail straight across from each encoder
level to the matching decoder level, so sharp edges aren't lost in the squeeze.
What it's for: image segmentation (medical scans, self-driving), and
it's the backbone of diffusion models: see
Diffusion Models.
A pulse travels down the left arm (each step halves the resolution), across the bottom, and back up the right arm. With skips on, copies of detail leap across the gap (gold arrows) and rejoin the matching decoder level: that's why U-Net outputs stay crisp.
You met the GAN tug-of-war above. Here's the real machinery: a generator
\(G\) turns random noise \(z\) into a fake sample, and a discriminator \(D\)
scores how real it looks. They play a minimax game: \(D\) tries to
maximize how often it's right; \(G\) tries to minimize that, i.e. to fool \(D\). At the
equilibrium, the fakes are indistinguishable from real data.
What it's for: generating realistic images, super-resolution, style
transfer, data augmentation.
In words: the discriminator wants to give real samples a high score and fakes a low score (maximize the sum); the generator wants its fakes \(G(z)\) to score high (minimize the same sum). They pull the value in opposite directions.
Watch the loop: noise → generator → fake; fake and real → discriminator → a real/fake verdict that trains both. The value bar swings as \(D\) gets sharper, then as \(G\) catches up: the back-and-forth of the minimax game.
What it is: a twist on the autoencoder you built earlier. Instead of
encoding to a single point, a VAE's encoder outputs a mean \(\mu\) and a
log-variance \(\log\sigma^2\), a little cloud of possibilities. We
sample a latent \(z\) from that cloud, and the decoder rebuilds the input
from \(z\). Because the space is continuous and regularized toward a Gaussian, you can
sample new \(z\) and generate brand-new data.
What it's for: generative modeling; the full math (the ELBO / KL term)
lives in Diffusion Models.
In words: the "reparameterization trick": instead of sampling \(z\) directly (which we couldn't backprop through), we sample plain noise \(\varepsilon\) and shift/scale it by the learned \(\mu\) and \(\sigma\). Now gradients flow right through.
Slide the spread \(\sigma\) to widen or tighten the latent cloud, then hit Re-sample: each \(z\) is drawn from that cloud, so the decoder produces a slightly different output every time. A small \(\sigma\) gives a faithful copy; a larger one explores nearby variations.
What it is: a stack of invertible transforms. Start with a
plain Gaussian blob you can sample easily, then warp it step by step until it matches a
complicated target distribution. Because every step is reversible, you can also run it
backwards to get the exact probability of any data point, something GANs and VAEs
can't do directly.
What it's for: density estimation, generative modeling with exact
likelihoods, sampling.
In words: chain invertible maps \(f_k\) to turn simple noise \(z\) into data \(x\). The change-of-variables term (the log-determinant of each step's Jacobian) accounts for how the transform stretches or squishes space, so probabilities stay exact.
A round Gaussian cloud (teal) is bent through a few invertible steps into a curved target (a ring / two-moons style shape). Press Forward to warp it, and Reverse to flow it exactly back to the blob: that reversibility is the whole point.
What it is: a network that runs on a graph: points
(nodes) joined by edges. Each round, every node gathers ("aggregates") messages from its
neighbours and updates its own vector. After a few rounds, a node's vector reflects not just
itself but its whole local neighbourhood.
What it's for: social networks (friend recommendations), molecules
(predicting properties from atoms + bonds), maps, and recommendation systems.
In words: a node's next state is a function of its current state and an aggregate (sum / mean / max) of messages from its neighbours. Repeat to spread information further across the graph.
One node starts "lit" (it holds some signal). Press Pass messages: each round, every node sends its value along its edges and blends in what it receives. Watch the signal ripple outward until the whole graph knows.
What it is: instead of one giant network, you have many smaller
expert sub-networks plus a tiny router (gate). For each
input the router picks the best few experts and sends the input only to them. The output is
a weighted blend of just those experts.
What it's for: scaling huge models cheaply: you can have trillions of
parameters but only activate a small slice per token, so it stays fast (used in many
frontier LLMs).
In words: the gate scores every expert with a softmax, keeps the top-\(k\) (often just 1 or 2), and the answer is those experts' outputs blended by their gate weights. The rest of the experts do nothing for this input.
Send different inputs through. The router in the middle scores all the experts and lights up only the top 2: those run; the others stay dark (and cost nothing). Different inputs route to different experts.
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, d_in=2, d_hidden=16, d_out=1):
super().__init__()
self.fc1 = nn.Linear(d_in, d_hidden) # weighted sum + bias
self.act = nn.Tanh() # the nonlinearity
self.fc2 = nn.Linear(d_hidden, d_out) # output layer
def forward(self, x):
x = self.act(self.fc1(x)) # hidden activations
return self.fc2(x) # raw output (logits)
model = MLP()
nn.Linear: one layer: \(Wx+b\). The weights and biases are learned.
nn.Tanh: the nonlinearity between layers (this is what gives the net curves).
forward: defines the forward pass: input → hidden → output.
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
lossf = nn.BCEWithLogitsLoss() # cross-entropy for yes/no
for epoch in range(1000):
opt.zero_grad() # 1. clear old gradients
y_hat = model(X) # 2. forward pass
loss = lossf(y_hat, y) # 3. how wrong are we?
loss.backward() # 4. backprop: fill .grad
opt.step() # 5. nudge weights downhill
zero_grad: gradients accumulate by default, so clear them first.
forward: run the model to get predictions.
loss / backward: measure error, then the chain rule fills every .grad.
step: the optimizer applies \(w \leftarrow w - \eta\,\nabla L\).
Frontier · research-grade
The levels above build neural networks and their architectures. This tier opens the engine: the algorithm that computes every gradient, why gradient descent converges (and how momentum and Adam speed it up), how to initialize so deep nets train at all, and why over-parameterized models generalize. Each topic is a guided lesson with step-through proofs, a worked example, a visualization, and citations.