Deep Learning

Forge a brain out of tiny math neurons, then teach it to think.

A deep neural network is just layers of simple little switches wired together. Each one adds, weighs, and decides, and stacked deep enough, those switches learn to recognize digits, translate languages, and paint pictures. In Deep Learning you'll light up neurons, solve the puzzle that nearly killed AI, and train a real network from scratch in your browser, watching it learn live.

Start at Beginner Jump to the live trainer

Four levels · Beginner → Intermediate → Advanced → Expert. The rail on the left tracks where you are.

Dots = neurons. Lines = weighted connections. Pulses = signals flowing forward.

Level 1 · Beginner

What is a neural network? Layers of tiny decision-makers.

Imagine a crowd of very simple helpers. Each one looks at a few numbers, gives each a thumbs-up or thumbs-down weight, adds them up, and shouts a single number back. Wire thousands of these helpers in layers and the crowd can learn almost anything.

Explanation

A brain-inspired stack of simple units.

Your brain has billions of neurons, each one a tiny switch: it gathers signals from its neighbors and, if they're strong enough together, it "fires". An artificial neural network copies the idea (not the biology): many simple units, arranged in layers, each unit passing a number to the next layer.

A line of judges

Each judge listens to the judges before them, weighs their opinions, and forms their own. The final judge gives the verdict.

A relay race

Numbers are the baton. Each layer transforms the baton a little and hands it on, until the last layer crosses the finish line with an answer.

Dimmer switches

Every connection is a dimmer: turn it up to let a signal through loudly, down to mute it. Learning is just adjusting the dimmers.

Game · Light the Network

Light the Network

Match the target output.

Flip the input switches and slide the two connection weights so the output bulb glows as close as possible to the target. Get within 0.1 to win.

Input A Input B

Weight on A 0.50 Weight on B 0.50

Explanation

One neuron, up close: the perceptron.

A single neuron does three things: it multiplies each input by a weight, adds them up plus a bias, then squashes the total through an activation function to decide how strongly to fire.

\[ y=\sigma\!\left(\sum_i w_i x_i + b\right) \]

In words: weigh every input, add them together with a bias nudge, then pass the sum through a squashing function to get the neuron's output.

\(x_i\): the inputs (the numbers coming in)
\(w_i\): the weights (how much each input matters)
\(b\): the bias (a constant nudge, like a default mood)
\(\sigma\): the activation function that squashes the sum
\(y\): the neuron's output (how strongly it fires)

Game · Tune the Neuron

Tune the Neuron

Score: 0 / 4

Set the weights and bias so the neuron behaves like the chosen logic gate. It must output near 1 for "true" rows and near 0 for "false" rows on all four input combinations.

Weight on A 0.0 Weight on B 0.0 Bias 0.0

Interactive + Quiz · Activation Lab

Activation Lab

Pick a squashing function and watch the neuron's output curve.

Input value \(z\) 0.0

Quiz · Match the shape

Match the activation to its shape

Score: 0 / 0

Worked example

Compute one neuron by hand.

Inputs \(x = [1, 0]\), weights \(w = [0.6, -0.4]\), bias \(b = 0.1\).
Weighted sum: \(0.6\cdot 1 + (-0.4)\cdot 0 + 0.1 = 0.7\).
Squash with sigmoid: \(\sigma(0.7)=\dfrac{1}{1+e^{-0.7}}\approx 0.668\).
The neuron fires at 0.668: a soft "yes, lean true".

Takeaway

A neuron is just "weighted vote → squash". Everything else in deep learning is stacking and tuning millions of these.

Explanation

Watch real numbers flow through a net.

So far we've squinted at one neuron. Now press play and watch a tiny 2 → 3 → 1 network actually compute: each neuron lights up in turn, displaying the exact number it produced. The signal washes left → right, one layer at a time, and that whole trip is the forward pass.

Animation · Numbers Flowing

Numbers Flowing Forward

Set the inputs, then press Play.

Each connection shows its weight; each neuron shows its computed activation once the wave reaches it. Change the inputs and replay to see every number recompute.

Input x₁ 1.00 Input x₂ 0.30

In words

Every hidden neuron took the two inputs, formed its own weighted vote, and squashed it. The output neuron then took a weighted vote of those three results. No magic, just the same "weighted sum → squash" repeated, layer after layer.

Level 2 · Intermediate

Stacking neurons into networks.

One neuron can only draw a single straight line. The magic appears when you stack neurons into layers: input, hidden, output, and let signals flow forward through all of them. Suddenly the network can bend, curve, and carve up the world.

Explanation

Layers and the forward pass.

Information enters at the input layer, gets transformed by one or more hidden layers, and leaves at the output layer. Pushing numbers through, layer by layer, to get a prediction is called the forward pass. Each hidden neuron is a fresh "weighted vote → squash" on the layer below it.

Every arrow carries a weight. The forward pass flows left → right.

Game · Forward-Pass Relay

Forward-Pass Relay

Hit Fire to send a signal through.

Choose how many hidden neurons, set the two inputs, then press Fire signal and watch each layer light up in turn. Goal: make the output neuron cross the firing line (output > 0.5) and the node turns gold when you succeed.

Hidden neurons Input 1 1.0

Input 2 0.0

Explanation

Why depth and nonlinearity? Meet the XOR problem.

Stack two linear layers with no squashing in between and, surprise, you still only get a straight line. The activation function's bend is what lets layers combine into curves. The classic proof is XOR: output 1 when the two inputs differ, 0 when they match. No single straight line can separate those points.

\[ \underbrace{W_2\,(W_1 x)}_{\text{stays linear}} \;\neq\; \underbrace{W_2\,\sigma(W_1 x)}_{\text{can curve}} \]

In words: two matrix multiplies in a row collapse into one big matrix and that stays a straight line. Slip a nonlinear \(\sigma\) between them and the network gains the power to bend its decision boundary.

\(W_1, W_2\): the weight matrices of two layers
\(\sigma\): the nonlinearity (sigmoid, tanh, ReLU...)
\(x\): the input vector

Game · Solve XOR

Solve XOR

One line can't do it.

First try with no hidden layer: drag the single line; you'll never split the four points correctly. Then switch on the hidden layer and bend the two cuts until each class sits on its own side. The accuracy meter tells you how close you are.

Worked example · What hidden layers learn

How two cuts solve XOR.

Hidden neuron H1 learns "is at least one input on?" (an OR-ish cut).
Hidden neuron H2 learns "are both inputs on?" (an AND-ish cut).
The output neuron computes H1 AND NOT H2 → "exactly one on" → XOR.
Each hidden neuron carved a useful feature; the output just combined them.

Big idea

Hidden layers invent their own features (edges, corners, "is one input on") that make the final decision easy. Deeper networks build features on top of features.

Explanation

Where you start matters: weight initialization.

Before training, every weight gets a starting value. Pick badly and the network may never learn. Start all at zero and every neuron computes the same thing forever (they can't specialize). Start too large and signals blow up or saturate. A balanced start (Xavier/He-style, small random values scaled to the layer size) lets learning take off smoothly.

\[ w \sim \mathcal{U}\!\left(-\sqrt{\tfrac{6}{n_\text{in}+n_\text{out}}},\; \sqrt{\tfrac{6}{n_\text{in}+n_\text{out}}}\right) \]

In words: draw each weight from a small random range whose width shrinks as the layer gets wider, so the signal keeps a healthy size as it passes through.

\(w\): an initial weight value
\(\mathcal{U}(-a, a)\): uniform random between \(-a\) and \(a\)
\(n_\text{in}, n_\text{out}\): neurons feeding in / out of the layer

Animation · Init Race

Init Race: three starting points

Same network, same data; only the starting weights differ.

Press Train all three. The all-zero start is stuck (symmetry never breaks), the too-large start thrashes or stalls, and the balanced start glides down. The loss curves race in real time.

In words

Initialization is the network's "starting posture". Zero is paralysis (all neurons identical), huge is chaos (exploding activations), and a carefully scaled small-random start is the sweet spot that real frameworks use by default.

Level 3 · Advanced

The real math, and a live network you train yourself.

Now the actual equations: the forward pass, the loss, and the chain rule that lets a network learn from its mistakes. Then the centerpiece: a real MLP, implemented from scratch in plain JavaScript, that you can train on 2D data and watch learn in real time.

Formula · Forward pass

\[ a^{(\ell)}=\sigma\!\left(W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}\right) \]

In words: each layer takes the previous layer's activations, applies its weight matrix and bias, then squashes, then repeat from input to output.

\(a^{(\ell)}\): activations (outputs) of layer \(\ell\)
\(W^{(\ell)}, b^{(\ell)}\): that layer's weights and biases
\(a^{(0)}=x\): the input is "layer 0"
\(\sigma\): the activation applied element-wise

Formula · Loss (MSE)

\[ L_{\text{MSE}}=\frac{1}{n}\sum_{i=1}^{n}\bigl(\hat y_i - y_i\bigr)^2 \]

In words: measure how wrong the network is by averaging the squared gap between each prediction and its true answer. Bigger mistakes are punished much harder.

\(\hat y_i\): the network's prediction for example \(i\)
\(y_i\): the true target value
\(n\): number of examples
square: makes the error positive and penalizes big misses

Formula · Loss (cross-entropy)

\[ L_{\text{CE}}=-\frac{1}{n}\sum_{i=1}^{n}\Bigl[\,y_i\log\hat y_i + (1-y_i)\log(1-\hat y_i)\Bigr] \]

In words: for yes/no problems, reward the network when it is confident and right, and punish it hard when it is confident and wrong.

\(\hat y_i\in(0,1)\): predicted probability of class 1
\(y_i\in\{0,1\}\): the true label
\(\log\): punishes confident wrong answers steeply (toward \(\infty\))
\(-\): negate so smaller loss means better

Formula · Backprop & the chain rule

\[ \frac{\partial L}{\partial w}=\frac{\partial L}{\partial \hat y}\cdot\frac{\partial \hat y}{\partial w} \]

In words: to know how a single weight affected the loss, multiply the sensitivities along the path back from the loss to that weight. The chain rule links them.

\(\dfrac{\partial L}{\partial \hat y}\): how much the loss changes when the prediction changes
\(\dfrac{\partial \hat y}{\partial w}\): how much the prediction changes when the weight changes
chain: multiply the links to get the full effect of \(w\) on \(L\)

Formula · The equation of backprop

\[ \delta^{(\ell)} \;=\; \Bigl(\bigl(W^{(\ell+1)}\bigr)^{\!\top}\,\delta^{(\ell+1)}\Bigr)\;\odot\;\sigma'\!\bigl(z^{(\ell)}\bigr) \]

In words: the scalar chain rule above, written once for a whole layer. Take the error signal \(\delta^{(\ell+1)}\) sitting one layer closer to the loss, send it backward through that layer's weights (the transpose), then gate it by how steep this layer's activation was. That single recursion is run from the output layer down to layer 1, and once you have every \(\delta^{(\ell)}\), each weight's gradient is just \(\partial L/\partial W^{(\ell)} = \delta^{(\ell)}\,\bigl(a^{(\ell-1)}\bigr)^{\!\top}\).

\(\delta^{(\ell)}\): the error signal at layer \(\ell\) (how much its pre-activation affects the loss, \(\partial L/\partial z^{(\ell)}\))
\(\bigl(W^{(\ell+1)}\bigr)^{\!\top}\): the next layer's weights, transposed, to route the signal back the way it came
\(\odot\): element-wise product (gate each unit by its own slope)
\(\sigma'(z^{(\ell)})\): the activation's slope, exactly the factor that vanishes or explodes below

Latest (2024-2026)

This is still literally what runs inside every modern framework. PyTorch / JAX autodiff builds the same recursion automatically from your forward code, and large 2025 models tame the cost of storing all the \(a^{(\ell)}\) with activation checkpointing: recomputing them in the backward pass to trade compute for memory.

Interactive · Backprop Flow

Watch the gradient flow backward

Forward pushes activations left→right (teal). Then we compute the loss. Then the gradient flows right→left (rose), and one highlighted path shows the chain rule multiplying its links.

Formula · The learning step

\[ w \;\leftarrow\; w - \eta\,\frac{\partial L}{\partial w} \]

In words: nudge every weight a little bit downhill (opposite its gradient), scaled by the learning rate. Repeat thousands of times and the loss drops.

\(\eta\): learning rate (step size)
\(\partial L/\partial w\): the gradient from backprop
\(\leftarrow\): overwrite the weight with its updated value

SGD vs Momentum vs Adam

Same problem, three optimizers, same number of steps.

Plain SGD zig-zags. Momentum builds speed downhill. Adam adapts the step per-weight and usually converges fastest. The convergence theory behind these optimizers is in the Frontier tier below.

Explanation

The softmax head: turning scores into probabilities.

For multi-class problems the last layer outputs one raw score (a logit) per class. Softmax turns those scores into probabilities that are all positive and sum to 1, so the network can say "70% cat, 20% dog, 10% fox". Bigger logits get exponentially more of the probability; the largest logit wins.

\[ p_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} \]

In words: exponentiate every score (making them all positive and stretching gaps), then divide by the total so the results form a clean probability distribution that adds up to 1.

\(z_i\): the raw logit (score) for class \(i\)
\(e^{z_i}\): exponential: always positive, amplifies big scores
\(\sum_j e^{z_j}\): total over all classes (the normalizer)
\(p_i\): output probability for class \(i\) (all sum to 1)

Interactive · Softmax Head

Drag the logits, watch the probabilities

Winner: -

Slide each class's raw logit. The bars show the softmax probabilities live. Notice they always sum to 100%, and pulling one logit up steals probability from the others. The tallest bar (the argmax) is the predicted class.

Formula · Softmax + cross-entropy: the gradient is just \(p-y\)

\[ \frac{\partial L_{\text{CE}}}{\partial z_i} \;=\; p_i - y_i \]

In words: stack the softmax above onto cross-entropy and almost everything cancels. The gradient flowing back into the raw logits is simply the predicted probability minus the true label: prediction error, nothing more. Predict the right class with probability 1 and the gradient is 0; over-confident on a wrong class and it points straight at the fix. This is why the two cards above belong together, and it is exactly the dz = out - target line you'll see in the trainer.

\(z_i\): the raw logit for class \(i\) (input to softmax)
\(p_i\): the softmax probability for class \(i\)
\(y_i\): the one-hot true label (1 for the right class, else 0)
\(p_i-y_i\): the residual; no \(\sigma'\) factor survives, so this layer never vanishes

Explanation

Why deep nets used to be so hard: vanishing & exploding gradients.

Backprop multiplies a long chain of slopes together. If each slope is a bit less than 1, the product shrinks toward zero by the time it reaches the early layers, so the gradient vanishes and those layers barely learn. If each slope is bigger than 1, the product blows up, so the gradient explodes. The activation function is a big culprit.

\[ \frac{\partial L}{\partial a^{(1)}} \;\propto\; \prod_{\ell} W^{(\ell)}\,\sigma'\!\left(z^{(\ell)}\right) \]

In words: the gradient reaching layer 1 is a product of every layer's weight times its activation slope. Many factors under 1 → it dies; many over 1 → it explodes.

\(\sigma'(z)\): slope of the activation (sigmoid maxes at 0.25!)
\(W^{(\ell)}\): layer \(\ell\)'s weights
\(\prod_\ell\): multiply across all layers (the chain rule)

Animation · Gradient Decay

Gradient magnitude across the layers

Each bar is the gradient size at one layer, layer 1 (deepest from the loss) on the left. Toggle the activation and weight scale, then send the gradient backward and watch it shrink to nothing (sigmoid) or stay alive (ReLU).

Weight scale 0.9

Explanation

How much data per step? Batch vs mini-batch vs full-batch.

Each gradient step can use the whole dataset (smooth but slow per step), one example at a time (fast, very noisy, i.e. pure SGD), or a mini-batch of a few (the practical middle ground). More examples per step = a smoother, more accurate gradient; fewer = noisier steps that can actually help escape bad spots.

Animation · Descent Paths

Three descent paths on one loss bowl

Same start, same steps; different batch size.

Watch three dots roll down the contour bowl. Full-batch takes a clean straight line, mini-batch wobbles a little, and single-example zig-zags wildly, but they all head downhill.

Explanation

What actually stalls deep training: saddle points, not valleys.

In high dimensions the loss surface is dominated by saddle points: flat spots that go down in some directions but up in others. Plain gradient descent crawls onto the flat ridge of a saddle and nearly freezes: the gradient there is tiny, so each step barely moves. Momentum keeps a running velocity, so it coasts across the flat part and rolls off the ridge into a descending direction, which is exactly why momentum and Adam (next tier) train faster.

Animation · Saddle Escape

Escaping a saddle with momentum

Surface f(a,b) = a² − b²: down along b, up along a.

Drag momentum to 0 and the dot stalls on the flat ridge near the saddle ✦. Raise it and the dot builds speed, slips off the ridge, and escapes down the valley.

Momentum β 0.00

★ Centerpiece game · Train a Brain

Decision boundary & loss curve

Architecture · Encoder-Decoder (Seq2Seq)

Read the whole input, then write the whole output.

What it is: two networks in a row. An encoder reads an input sequence (say, an English sentence) and squeezes everything it understood into a single context vector. A decoder then reads that summary and generates an output sequence one token at a time (say, the French translation).
What it's for: machine translation, summarization, speech-to-text, any task that turns one sequence into another of a different length.

\[ c=\operatorname{Enc}(x_1,\dots,x_T),\qquad y_t=\operatorname{Dec}\!\left(y_{<t},\,c\right) \]

In words: the encoder turns the whole input into one context summary \(c\); the decoder produces each output token from \(c\) plus everything it has written so far.

\(x_1\dots x_T\): the input tokens (the source sentence)
\(c\): the context vector (the encoder's summary)
\(y_{<t}\): the output tokens generated before step \(t\)
\(y_t\): the next output token the decoder emits

Animation · Translate a sentence

Watch tokens flow in, then out

Press play to translate.

The input words flow into the encoder (left → right) and collapse into one context vector in the middle. Then the decoder reads that vector and emits the translation one word at a time. Pick a sentence and play it.

Worked example

"the cat sits" → "le chat est assis"

Encoder reads the · cat · sits, updating its hidden state each word.
The final hidden state becomes the context vector \(c\): a compressed meaning.
Decoder starts from \(c\) and emits le, then feeds it back to emit chat, and so on.
It stops when it emits the special <end> token.

Why a bottleneck hurts

Cramming a long sentence into one fixed vector loses detail, which is exactly the problem attention fixed (the decoder peeks back at every input word). See Transformers & LLMs.

Architecture · The Transformer block

Stack attention + a small MLP, N times.

What it is: the architecture behind GPT, BERT, and modern AI. Instead of looping like an RNN, a Transformer lets every token attend to every other token at once, then refines each one with a tiny feed-forward network. One block = attention + FFN, each wrapped with a residual skip and a normalization. Stack \(N\) blocks and add a head on top.
What it's for: language, vision, audio, code, and it has largely replaced RNNs because it trains in parallel and models long-range relationships.

Diagram · Step through a block

Embed → N×[Attention + FFN] → Head

Tap a stage to highlight it.

Click each stage (or press Step) to light up the data path. Notice the + nodes: those are the residual skips you met above: they're what let a Transformer be dozens of blocks deep. Full attention mechanics live in Transformers & LLMs.

Level 4 · Expert

Making it work in practice.

A trained network that only works on the training data is useless. Experts fight overfitting, choose the right architecture (CNN, RNN, Transformer...), and write the training loop. Here are the tools of the trade.

Explanation · Overfitting & regularization

Memorizing vs learning.

A network with too much capacity can memorize the training set (including its noise) and then fail on new data. The cures: weight decay (keep weights small), dropout (randomly silence neurons so none becomes a crutch), early stopping (quit when validation loss turns up), and batch norm (normalize layer inputs to stabilize training).

\[ L_{\text{reg}} = L_{\text{data}} + \lambda \sum_j w_j^{2} \]

In words: add a penalty for large weights to the loss. The network now has to justify every big weight with a real drop in data error, so it prefers simpler, smoother solutions that generalize.

\(L_{\text{data}}\): the usual fit-the-data loss (MSE / CE)
\(\lambda\): regularization strength (how much we punish size)
\(\sum_j w_j^2\): total squared weight magnitude (L2 / weight decay)

Capacity demo: fit vs overfit

Slide the capacity. Too little underfits; too much overfits the noise.

Model capacity (polynomial degree) 3 Apply regularization (smooth it out)

Game · Convolution: the edge detector

Slide the kernel

A CNN slides a tiny filter over the image; matches light up the feature map.

Click cells on the left to draw a shape. Pick a kernel (or hand-tune it), then drag/step the window across the image and watch the right-hand feature map glow where the kernel matches: that's how CNNs detect edges and textures.

Image (click to draw)

Feature map (kernel response)

Animation · Pooling: shrink the map

Pooling downsamples the feature map

2×2 windows → one value each.

After convolution, CNNs pool: slide a small window over the feature map and keep just one number per window: the max (strongest response) or the average. This halves the resolution, keeps the important signal, and makes the network cheaper and a little position-invariant.

Feature map (input to pooling)

Pooled output (downsampled)

Interactive · RNN: memory over a sequence

Step through a sequence

An RNN keeps a hidden state and updates it one item at a time.

Feed the sequence in one step at a time. The hidden state \(h_t\) is a mix of the new input and its own previous value, which is how the network "remembers".

\[ h_t=\tanh\!\left(W_h h_{t-1} + W_x x_t + b\right) \]

In words: the new memory is a blend of the last memory and the current input, squashed by tanh.

\(h_t\): hidden state (memory) at step \(t\)
\(x_t\): the input at step \(t\)
\(W_h, W_x, b\): learned recurrence weights and bias

Explanation

Dropout: randomly silence neurons to fight overfitting.

During training, dropout randomly switches off a fraction of neurons on every step. No single neuron can become a crutch, so the network spreads its knowledge out and generalizes better. Too little dropout overfits; too much starves the net of capacity. There's a sweet spot.

Game · Dial the Dropout

Dial the Dropout

Pick a rate, then run.

Watch neurons blink off at random each step. Find the dropout rate that best closes the gap between training and validation accuracy: that's the rate that generalizes. Hit the green zone to win.

Dropout rate 0.30

Explanation

Batch Normalization: re-center the signal at every layer.

As data flows through layers its distribution drifts and stretches, which slows learning. Batch norm takes each layer's inputs and re-centers them to mean 0 and re-scales them to variance 1 (then learns a slope and shift to undo that if helpful). The result: stable, well-behaved signals and much faster training.

\[ \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y = \gamma\,\hat{x} + \beta \]

In words: subtract the batch's mean and divide by its spread to standardize, then let the network rescale (\(\gamma\)) and reshift (\(\beta\)) if it wants the original shape back.

\(\mu_B, \sigma_B^2\): mean and variance over the current batch
\(\hat{x}\): the standardized value (mean 0, variance 1)
\(\gamma, \beta\): learned scale and shift
\(\epsilon\): tiny constant so we never divide by zero

Animation · Re-centering

A drifting distribution, snapped back

Toggle batch norm and watch the histogram.

The raw inputs (left) wander off-center and uneven. With batch norm on, the same values get re-centered to 0 and scaled to a tidy spread (right), exactly what the next layer wants to see.

Apply batch normalization

Explanation

Skip connections: a shortcut for very deep nets.

Stack too many layers and the signal (and its gradient) gets mangled on the long trip. A residual / skip connection adds a shortcut that lets the input bypass a block and rejoin it further along, so the layer only has to learn a small change, and the gradient has a clean highway straight back. This is the trick behind ResNets and Transformers.

\[ y = x + \mathcal{F}(x) \]

In words: the block's output is the original input plus whatever the layers learned to add, so doing "nothing" (the identity) is easy, and the gradient flows through the \(+x\) shortcut unharmed.

\(x\): the input to the block (the skip path)
\(\mathcal{F}(x)\): the residual the layers learn to add
\(y\): the block's output

Animation · Bypass the Block

Watch the signal take the shortcut

Toggle the skip path on and off.

With the skip off, the pulse must crawl through every layer (and fades). Turn it on and a second pulse leaps over the block on the shortcut and rejoins at the + node, keeping the signal strong even in a deep stack.

Enable skip connection

Explanation

Autoencoders: squeeze, then rebuild.

An autoencoder forces data through a narrow bottleneck and then tries to reconstruct the original from that tiny summary. To succeed it must learn the most important features. Make the bottleneck smaller and the reconstruction gets blurrier, so you can literally see how much information the squeeze throws away.

Interactive · The Bottleneck

Compress and reconstruct

Draw on the input grid. The network compresses it to just a few numbers (the bottleneck), then rebuilds it on the right. Shrink the bottleneck and watch detail vanish: the essence survives, the fine detail doesn't.

Bottleneck size 8

Input (click to draw)

Reconstruction

Explanation · The deep-learning landscape

They're all deep nets underneath.

CNNs, RNNs, Transformers, and diffusion models look different but share the same DNA: layers of weighted sums + nonlinearities, trained by backprop. They differ mainly in how they wire the connections to match their data.

CNN

Shares small filters across an image, great for vision. Weights are reused everywhere.

RNN / LSTM

Loops a hidden state through a sequence, built for time and language order.

Transformer

Lets every token attend to every other. See Transformers & LLMs.

Embeddings

Turn words/items into learned vectors. Explore them in Embeddings & Vector Search.

Diffusion

A deep net that denoises noise into images, step by step.

MLP

The plain fully-connected net you trained above: the foundation of them all.

Animation · Words finding their cluster

Embeddings: meaning becomes geometry

Similar words drift together in space.

A network learns to place each word at a point so that related words sit near each other. Press scatter, then watch the words glide into meaning-based clusters: fruit here, animals there, royals over there. Dig deeper in Embeddings & Vector Search.

Animation · GAN tug-of-war

GANs: a forger vs a detective

A generator makes fake samples; a discriminator tries to tell fakes from real. Each pushes the other to improve. Watch the generator's blurry fakes (orange) sharpen toward the real distribution (teal) as the two compete.

Map · The architecture family tree

Pick an architecture to see what it's for.

Below are the big architecture families. Click a tile to read a one-line "what it is / what it's for". They all share the same DNA (layers of weighted sums and nonlinearities trained by backprop) but each wires its connections to match its data.

Tap a tile above to learn what each architecture does.

Architecture · U-Net

Shrink it down, build it back up, with shortcuts across.

What it is: a U-shaped network. The left side (encoder) repeatedly downsamples the image to capture "what" is in it; the right side (decoder) upsamples back to full size to say "where". The trick is the skip connections that copy fine detail straight across from each encoder level to the matching decoder level, so sharp edges aren't lost in the squeeze.
What it's for: image segmentation (medical scans, self-driving), and it's the backbone of diffusion models: see Diffusion Models.

Animation · Trace the U

Down the encoder, across the skips, up the decoder

Toggle the skips and play.

A pulse travels down the left arm (each step halves the resolution), across the bottom, and back up the right arm. With skips on, copies of detail leap across the gap (gold arrows) and rejoin the matching decoder level: that's why U-Net outputs stay crisp.

Show skip connections

Architecture · GAN (the full picture)

Two networks locked in a game.

You met the GAN tug-of-war above. Here's the real machinery: a generator \(G\) turns random noise \(z\) into a fake sample, and a discriminator \(D\) scores how real it looks. They play a minimax game: \(D\) tries to maximize how often it's right; \(G\) tries to minimize that, i.e. to fool \(D\). At the equilibrium, the fakes are indistinguishable from real data.
What it's for: generating realistic images, super-resolution, style transfer, data augmentation.

\[ \min_{G}\max_{D}\; \mathbb{E}_{x\sim p_\text{data}}\!\left[\log D(x)\right] + \mathbb{E}_{z\sim p_z}\!\left[\log\!\left(1 - D(G(z))\right)\right] \]

In words: the discriminator wants to give real samples a high score and fakes a low score (maximize the sum); the generator wants its fakes \(G(z)\) to score high (minimize the same sum). They pull the value in opposite directions.

\(D(x)\): discriminator's "this is real" probability
\(G(z)\): a fake sample built from noise \(z\)
\(p_\text{data}\): the true data distribution
\(p_z\): the noise distribution \(G\) draws from
\(\min_G\max_D\): the adversarial game (one minimizes, one maximizes)

Animation · The adversarial loop

Generator vs discriminator, value swinging

Press play to run the game.

Watch the loop: noise → generator → fake; fake and real → discriminator → a real/fake verdict that trains both. The value bar swings as \(D\) gets sharper, then as \(G\) catches up: the back-and-forth of the minimax game.

Architecture · Variational Autoencoder (VAE)

An autoencoder that learns a smooth, sample-able latent space.

What it is: a twist on the autoencoder you built earlier. Instead of encoding to a single point, a VAE's encoder outputs a mean \(\mu\) and a log-variance \(\log\sigma^2\), a little cloud of possibilities. We sample a latent \(z\) from that cloud, and the decoder rebuilds the input from \(z\). Because the space is continuous and regularized toward a Gaussian, you can sample new \(z\) and generate brand-new data.
What it's for: generative modeling; the full math (the ELBO / KL term) lives in Diffusion Models.

\[ z = \mu + \sigma\odot\varepsilon,\qquad \varepsilon\sim\mathcal{N}(0,I) \]

In words: the "reparameterization trick": instead of sampling \(z\) directly (which we couldn't backprop through), we sample plain noise \(\varepsilon\) and shift/scale it by the learned \(\mu\) and \(\sigma\). Now gradients flow right through.

\(\mu\): the encoder's predicted mean for this input
\(\sigma\): the spread (from \(\tfrac12\log\sigma^2\))
\(\varepsilon\): random noise from a standard Gaussian
\(z\): the sampled latent fed to the decoder

Diagram · Sample the bottleneck

Encoder → (μ, log-var) → sample → Decoder

Slide the spread \(\sigma\) to widen or tighten the latent cloud, then hit Re-sample: each \(z\) is drawn from that cloud, so the decoder produces a slightly different output every time. A small \(\sigma\) gives a faithful copy; a larger one explores nearby variations.

Latent spread σ 0.6

Architecture · Normalizing Flows

Bend a simple cloud into any shape, reversibly.

What it is: a stack of invertible transforms. Start with a plain Gaussian blob you can sample easily, then warp it step by step until it matches a complicated target distribution. Because every step is reversible, you can also run it backwards to get the exact probability of any data point, something GANs and VAEs can't do directly.
What it's for: density estimation, generative modeling with exact likelihoods, sampling.

\[ x = f_K\circ\cdots\circ f_1(z),\qquad \log p(x)=\log p(z) - \sum_{k}\log\left|\det \frac{\partial f_k}{\partial z_{k-1}}\right| \]

In words: chain invertible maps \(f_k\) to turn simple noise \(z\) into data \(x\). The change-of-variables term (the log-determinant of each step's Jacobian) accounts for how the transform stretches or squishes space, so probabilities stay exact.

\(z\): a sample from a simple base (Gaussian)
\(f_k\): the \(k\)-th invertible transform
\(x\): the resulting (complex) data sample
\(\det\,\partial f_k\): Jacobian determinant: the local volume change

Animation · Warp the blob

Gaussian → target shape

Play to warp; reverse to undo.

A round Gaussian cloud (teal) is bent through a few invertible steps into a curved target (a ring / two-moons style shape). Press Forward to warp it, and Reverse to flow it exactly back to the blob: that reversibility is the whole point.

Architecture · Graph Neural Network (GNN)

Let each node learn from its neighbours.

What it is: a network that runs on a graph: points (nodes) joined by edges. Each round, every node gathers ("aggregates") messages from its neighbours and updates its own vector. After a few rounds, a node's vector reflects not just itself but its whole local neighbourhood.
What it's for: social networks (friend recommendations), molecules (predicting properties from atoms + bonds), maps, and recommendation systems.

\[ h_v^{(k+1)} = \phi\!\left(h_v^{(k)},\; \bigoplus_{u\in\mathcal{N}(v)} \psi\!\left(h_u^{(k)}\right)\right) \]

In words: a node's next state is a function of its current state and an aggregate (sum / mean / max) of messages from its neighbours. Repeat to spread information further across the graph.

\(h_v^{(k)}\): node \(v\)'s feature vector at round \(k\)
\(\mathcal{N}(v)\): the neighbours of node \(v\)
\(\psi\): the message function (what a neighbour sends)
\(\bigoplus\): the permutation-invariant aggregator (sum/mean/max)
\(\phi\): the update function (combine self + messages)

Animation · Message passing

Watch information spread round by round

Round 0, only one node knows.

One node starts "lit" (it holds some signal). Press Pass messages: each round, every node sends its value along its edges and blends in what it receives. Watch the signal ripple outward until the whole graph knows.

Architecture · Mixture of Experts (MoE)

Don't use the whole brain for every thought.

What it is: instead of one giant network, you have many smaller expert sub-networks plus a tiny router (gate). For each input the router picks the best few experts and sends the input only to them. The output is a weighted blend of just those experts.
What it's for: scaling huge models cheaply: you can have trillions of parameters but only activate a small slice per token, so it stays fast (used in many frontier LLMs).

\[ y = \sum_{i\in\text{top-}k} g_i(x)\, E_i(x),\qquad g(x)=\operatorname{softmax}\!\left(W_g x\right) \]

In words: the gate scores every expert with a softmax, keeps the top-\(k\) (often just 1 or 2), and the answer is those experts' outputs blended by their gate weights. The rest of the experts do nothing for this input.

\(E_i(x)\): the \(i\)-th expert sub-network
\(g_i(x)\): the gate's weight for expert \(i\)
top-\(k\): keep only the \(k\) highest-scoring experts
\(W_g\): the router's learned weights

Animation · The router lights up experts

Each input wakes a different couple of experts

Send different inputs through. The router in the middle scores all the experts and lights up only the top 2: those run; the others stay dark (and cost nothing). Different inputs route to different experts.

Experts to activate (top-k) 2

Code · PyTorch

From math to PyTorch

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, d_in=2, d_hidden=16, d_out=1):
        super().__init__()
        self.fc1 = nn.Linear(d_in, d_hidden)   # weighted sum + bias
        self.act = nn.Tanh()                    # the nonlinearity
        self.fc2 = nn.Linear(d_hidden, d_out)   # output layer

    def forward(self, x):
        x = self.act(self.fc1(x))   # hidden activations
        return self.fc2(x)          # raw output (logits)

model = MLP()

nn.Linear: one layer: \(Wx+b\). The weights and biases are learned.

nn.Tanh: the nonlinearity between layers (this is what gives the net curves).

forward: defines the forward pass: input → hidden → output.

opt  = torch.optim.Adam(model.parameters(), lr=1e-2)
lossf = nn.BCEWithLogitsLoss()      # cross-entropy for yes/no

for epoch in range(1000):
    opt.zero_grad()                 # 1. clear old gradients
    y_hat = model(X)                # 2. forward pass
    loss  = lossf(y_hat, y)         # 3. how wrong are we?
    loss.backward()                 # 4. backprop: fill .grad
    opt.step()                      # 5. nudge weights downhill

Challenge · Boss quiz

Deep Learning boss quiz

Score: 0 / 0

Lock it in

Say it back in your own words.

Frontier · research-grade

How learning really works — and the research frontier.

The levels above build neural networks and their architectures. This tier opens the engine: the algorithm that computes every gradient, why gradient descent converges (and how momentum and Adam speed it up), how to initialize so deep nets train at all, and why over-parameterized models generalize. Each topic is a guided lesson with step-through proofs, a worked example, a visualization, and citations.