š¼ļø The Restorer's Slider
pure noiseDrag from t = T (pure static) down to t = 0 (clean image). Watch the painting emerge.
Diffusion Models
A diffusion model is an art restorer trapped inside a snowstorm. It learns to wipe away a little noise at a time, and after enough careful wipes a blizzard of static turns into a real image. Generating = denoising.
Live: static is restored into a painting, then dissolved back into noise: the two directions of diffusion.
Imagine a finished painting. Now sprinkle TV static on top until you can't see it at all. That's noise. A diffusion model has practiced the reverse trick so many times that it can look at the static and guess what was underneath, removing a thin layer of fuzz over and over until the painting reappears.
Drag from t = T (pure static) down to t = 0 (clean image). Watch the painting emerge.
Shake a snow globe and the scene vanishes behind swirling flakes. Set it down and let it settle: the scene comes back. Diffusion is a model that has learned how to "let the flakes settle" on purpose.
A foggy bathroom mirror hides your reflection. Each swipe of your hand reveals a bit more. The model makes one tiny swipe per step, and after many steps the full reflection (image) is clear.
Waves smear a sandcastle into flat sand (adding noise is easy and automatic). The clever part is a machine that can run the tape backwards and rebuild the castle grain by grain.
Suppose generation uses 1000 tiny steps. At step 1000 the canvas is 100% static. The model removes a sliver of noise, so step 999 is maybe 99.8% static, barely different. Repeat. Around step 600 faint blobs of color appear. By step 200 you can tell it's a face. At step 0 it's a crisp picture.
Key idea: no single step does much. Generation is the accumulation of a thousand gentle denoises.
Predicting a whole sharp image from pure noise in one jump is incredibly hard. Removing a little noise is easy, like the difference between sculpting a statue with one hammer blow versus a thousand careful taps. Splitting the job into many easy steps is what makes diffusion work so well.
A photocopier can reconstruct a picture you feed it, but it can never surprise you with a brand-new one. The dream of generative modelling is a machine that has seen thousands of faces and can now dream up a face nobody has ever seen. To do that, the machine needs a tidy "map of ideas" it can pick a random spot on, and that map has to be smooth, so every spot decodes into something sensible.
An autoencoder squeezes each picture into a tiny code (a dot on the map) and learns to rebuild it. But the dots end up scattered with big empty gaps. Pick a random point and you land in a gap ā garbage. A good generative map packs the dots into one smooth blob so every random pick decodes into a believable image. Toggle the two and click Sample a random point.
Each dot is one training image encoded to a 2D code. The big tile shows what the decoder makes from the picked point.
Autoencoder map: songs pinned at random spots with silence in between. Drop the needle anywhere and you mostly hit static. A smooth map clusters similar songs into one dense region, so a random drop always lands on music, maybe a tune you've never heard, but still a tune.
Encode many handwritten "3"s. If the codes for a curly 3 and a blocky 3 sit far apart with a void between, the midpoint decodes to a smudge. We want the in-between point to decode to a reasonable blend, a slightly-curly 3. That "no holes" property is exactly what makes a map samplable.
This is the bridge to the next level: to guarantee "no holes", we make the map a probability distribution and force the codes to fill a known shape (a bell curve). That's a VAE.
Training has two halves. The forward process takes a clean image and adds a measured dose of noise at each timestep. It needs no learning, it's just a recipe. The reverse process is a neural network that looks at a noisy image and predicts the noise that was added, so we can subtract it. The model's only job is to be a good noise-guesser.
The forward chain (rose, fixed recipe) walks a clean image into pure noise. The reverse chain (green, learned) walks noise back to an image, one denoise at a time.
The schedule decides how much signal survives at each timestep. The bar shows how much of the original image (\(\bar\alpha_t\)) remains vs. how much is noise (\(1-\bar\alpha_t\)). Early on the image dominates; late on it's almost all noise.
At t = 0 forward and reverse agree perfectly, nothing to fix. Push t up: FORWARD smears the picture into static, while REVERSE shows what the model recovers from that much noise. Small t ā easy ā near-perfect recovery. Large t ā almost no signal left ā only a rough guess survives.
Repeat millions of times over many images and timesteps. A model that can reliably name the noise can also remove it, and removing noise is generation.
A Variational Autoencoder upgrades the encoder so that instead of pinning each image to a single dot, it maps the image to a small fuzzy cloud (a little Gaussian). During training we gently push all those clouds to pile up into one big standard bell curve centred at the origin. Once they do, generating is trivial: draw a random point from that bell curve and let the decoder turn it into an image.
Left: the latent plane with the prior \(p(z)=\mathcal N(0,I)\) drawn as rings. Drag the spread to see each image become a cloud, not a point. Hit Generate to draw \(z\sim p(z)\) and decode it.
In words: the chance the model assigns to a real image \(x\) is the total over every latent code \(z\): how likely that code is under the prior, times how likely the decoder is to draw \(x\) from it. We'd love to maximise this, but the integral runs over all of latent space, so it's intractable.
Latent space might be 256-dimensional. Summing \(p_\theta(x\mid z)p(z)\) over all of it is like asking "out of every possible code in the universe, how many would have produced this exact cat photo?" Astronomically many to check. The next level shows the clever dodge: a lower bound we can compute.
Time for the equations behind the snow globe. Three formulas carry almost everything: the forward marginal (jump straight to any timestep), the reparameterization that writes \(x_t\) in one line, and the surprisingly simple training loss.
In words: you don't have to add noise step by step, you can jump straight to timestep \(t\). The noisy image is the clean image shrunk by \(\sqrt{\bar\alpha_t}\), plus Gaussian noise whose size grows as \(1-\bar\alpha_t\).
In words: the same idea written as one concrete recipe: blend the clean image with a single draw of standard noise, weighted by the schedule. This is the "reparameterization trick": it makes \(x_t\) a differentiable function of a fixed random \(\epsilon\).
In words: over random images, random noise, and random timesteps, make the network's predicted noise \(\epsilon_\theta\) match the noise we actually added. It's just mean-squared error on noise.
Watch \(\bar\alpha_t\) decay across timesteps. Choose a schedule and drag the marker to read off how much signal vs. noise is in \(x_t\) at that point.
Say at \(t=500\) the schedule gives \(\bar\alpha_t = 0.30\). Then \(\sqrt{\bar\alpha_t}\approx0.55\) and \(\sqrt{1-\bar\alpha_t}\approx0.84\). So \(x_t \approx 0.55\,x_0 + 0.84\,\epsilon\): the noisy image is roughly 55% original picture and 84%-strength static, already noise-dominated. The readout above updates with the live numbers as you drag.
We can't compute \(\log p_\theta(x)\) directly, so we build a lower bound on it: the Evidence Lower BOund. Maximising the floor drags the true value up with it. The bound has two readable pieces: a reconstruction term (decode my code back to me) and a KL regulariser (keep my encoder cloud close to the prior so the map stays samplable).
In words: the log-probability of the data is at least "how well the decoder rebuilds \(x\) from codes the encoder suggests" minus "how far the encoder's cloud has drifted from the standard bell curve". Make this big and you've made \(\log p_\theta(x)\) big too.
In words: the gap between the true value and our floor is exactly how wrong the encoder is about the true posterior. A perfect encoder closes the gap and the bound becomes an equality. That's why a better encoder = a tighter bound.
The tall bar is the fixed truth \(\log p(x)\). The green bar is the ELBO. Improve the encoder (slide right) and the green floor rises, the pink gap shrinks toward zero, and the bound gets tight.
Say \(\log p(x)=-3.0\) (nats). A mediocre encoder gives ELBO \(=-3.8\), so the gap (its posterior error) is \(0.8\). Improve it until ELBO \(=-3.05\): the gap is now \(0.05\), we're reading the true likelihood almost exactly, and the reconstruction/KL split tells us why any remaining points are lost.
The ELBO weights reconstruction and KL equally. In practice you put a knob β in front of the KL term. Turn it up and the encoder cloud snaps tightly to the prior: a cleaner, more disentangled, more samplable map, but blurrier rebuilds. Turn it down and you get crisp reconstructions whose codes ignore the prior (and don't sample well). β-VAE is exactly this dial.
In words: the same two ELBO pieces, but the KL is scaled by \(\beta\). \(\beta=1\) is the honest ELBO from above. \(\beta>1\) pushes the encoder harder toward the standard bell curve (Higgins et al., 2017); \(\beta<1\) lets it drift for sharper rebuilds.
Crank \(\beta\) too high and the cheapest way to win is to make the KL term zero: the encoder simply outputs the prior, \(q_\phi(z\mid x)\approx p(z)\), regardless of \(x\). The code now carries no information about the input (mutual information \(I(x;z)\to 0\)) and a powerful decoder just ignores \(z\) and paints a generic average. This is posterior collapse, the formal version of the blur the Beginner tier showed you, and why β (or KL annealing / free-bits) has to be tuned, not maxed.
In words: the average KL the model is willing to pay is an upper bound on how much the code can tell you about the image. Penalising KL with \(\beta\) therefore caps the information the latent may store: the higher \(\beta\), the tighter the cap, until at the extreme the code says nothing at all.
The same recon-vs-rate trade governs the VAE that front-ends every latent-diffusion model (Stable Diffusion, SDXL, Flux, SD3): its autoencoder is trained with a deliberately tiny KL weight so codes stay sharp and information-rich, and the diffusion model, not the KL prior, does the heavy lifting of making the latent space samplable. The "tidy map" job moved from the VAE to the diffusion process. See the Latent Diffusion card in the Frontier tier.
To train we need gradients, but you can't differentiate through "roll a random \(z\)". The fix: don't sample \(z\) directly. Sample a fixed noise \(\epsilon\) outside the model, then build \(z\) from it with a smooth formula. Now the randomness is an external input and gradients flow cleanly through \(\mu\) and \(\sigma\).
In words: instead of drawing \(z\) from the encoder's cloud, shift and stretch a standard noise draw by the encoder's mean and spread. Same distribution for \(z\), but now \(z\) is a plain differentiable function of \(\mu,\sigma\), and \(\epsilon\) is just data.
The stochastic node is split: a deterministic path (clay, carries gradients) plus an external dice roll \(\epsilon\) (rose, no gradient needed).
Both setups estimate the same gradient, but the naive "score-function" estimator that samples \(z\) directly is wildly noisy. Watch the spread of gradient estimates: with the trick they cluster tightly (training descends smoothly); without it they scatter (training stalls / diverges).
VAEs work, but their samples are famously soft. The culprit is mathematical: a Gaussian decoder rewards the average of all plausible images. When one blurry code could explain several sharp originals, the safest low-error guess is to blend them, and blends are blurry.
The true data has two equally-good sharp answers for one code (e.g. a "3" written two ways). A Gaussian likelihood minimises squared error by predicting their mean, which is neither answer. Slide the ambiguity up to watch the crisp modes melt into one smear.
In words: a Gaussian decoder is trained with squared error, and the guess that minimises squared error is the mean of all valid images for that code. If two sharp images share a code, the mean is their fuzzy average. Hence: blur.
Posterior collapse: if the decoder is powerful enough to ignore \(z\), the KL term happily pushes the encoder all the way to the prior and \(z\) carries no information: the model stops using its latent code at all.
One giant leap: mapping pure noise to a full sharp image in a single decode is a brutally hard function to learn. The fix that powers diffusion: don't leap, take a thousand tiny, easy steps. That's the next concept.
Stack the VAE idea over and over. The forward process is a fixed chain of tiny Gaussian noisings; the reverse process is a learned chain of tiny denoisings. The diffusion ELBO is then just a sum of per-timestep KL terms: one easy little VAE per rung of the ladder.
In words: each forward step shrinks the image a hair and adds a hair of noise (top). Because Gaussians compose, you can also jump straight to step \(t\) in one line, the same reparameterization trick, reused (bottom).
In words: the whole diffusion objective decomposes into a sum of per-timestep KL terms, each one asking "does my learned reverse step match the true reverse step at this rung?". Add a reconstruction term at the bottom and a prior term at the top.
The reverse step is only allowed to be a simple Gaussian when each forward step is tiny, i.e. when there are many of them. The true reverse of one big step is a lumpy, multi-modal mess that a single Gaussian can't capture. Drag the step count: the green Gaussian (our model) only hugs the grey true reverse when \(T\) is large.
Walking down a smooth ramp blindfolded: 1000 baby steps and every footfall lands where you expect (Gaussian, predictable). Try it in 3 giant leaps and you can't predict where you'll land: the landing spot is spread across several spots at once. Many small steps are what make each reverse step approximately Gaussian, which is the entire reason diffusion is learnable.
You can name the noise, now use it. Sampling subtracts predicted noise step by step. Conditioning lets you ask for a specific thing. Classifier-free guidance lets you ask more strongly. And there's a neat bridge to score functions.
In words: one reverse step. Take the noisy image, subtract a scaled copy of the predicted noise, rescale, and add a touch of fresh randomness \(\sigma_t z\) so samples stay diverse. Do this from \(t=T\) down to \(t=1\) and an image appears.
In words: predicting noise is (up to a scale) the same as estimating the score, the direction in pixel-space that makes the image more probable. "Denoise" and "walk uphill in likelihood" are two views of one arrow. This links DDPMs to score-based / SDE models.
In words (the unifying identity): Tweedie's formula. The denoiser's best guess of the clean image, the posterior mean \(\mathbb{E}[x_0\mid x_t]\), the score, and the predicted noise are the same object in three costumes. Substituting the score relation from above turns the left equality into the familiar \(\hat x_0\)-from-\(\epsilon\) formula, so "predict the noise", "estimate the score", and "guess the clean image" are one network with three read-outs.
Why it matters (2024 to 2026): this identity is the backbone of the EDM / score-SDE design space and of every \(x_0\)- and \(v\)-prediction model in modern practice; they're algebraic re-parameterisations of this one line. See the score/SDE and EDM cards in the Frontier tier.
In words: classifier-free guidance. Run the model twice: once with your prompt \(c\), once with no prompt, then exaggerate the difference by a factor \(w\). Higher \(w\) pulls the sample harder toward the prompt (sharper, more on-topic) but can hurt diversity and realism.
The toy generator is torn between two prompts. Pick a target and crank guidance \(w\): low \(w\) drifts to a blurry average of both shapes; high \(w\) snaps decisively toward your prompt.
Each reverse step trusts a linear approximation of a curved path from noise to image. Few steps take big, crude jumps and overshoot fine detail; many small steps hug the true trajectory and look cleaner, at a linear cost in compute. Fast samplers curve smarter so you need fewer steps (DDIM, EDM and one-step consistency models are derived in the Frontier tier). Try the Sampling steps slider above: low steps look blocky, high steps look crisp.
The condition \(c\) is just an extra input to \(\epsilon_\theta\). For text-to-image, \(c\) is a text embedding fed in via cross-attention; for class-conditional models it's a learned label vector. During training you randomly drop \(c\) (set it to \(\varnothing\)) some of the time, and that's what gives you the unconditional model needed for the guidance formula above. No separate classifier required.
DDPM and DDIM start from the same noise and aim at the same data, but take different routes. DDPM adds a fresh kick of randomness \(\sigma_t z\) every step: a jittery walk that needs many steps to settle. DDIM sets \(\sigma_t=0\): a smooth, deterministic update that follows the underlying ODE and can reach the data in a handful of steps. Watch both descend toward a two-mode toy density and drag the step slider down: the DDIM path stays clean long after the DDPM path turns blocky.
Same start point (ā), same target modes. The clay path is stochastic DDPM; the green path is deterministic DDIM. Lower the step count to see DDIM stay smooth while DDPM gets rough and scattered.
In words: the DDPM reverse step from Level 4. The injected \(\sigma_t z\) is genuine randomness, so two runs from the same \(x_t\) diverge, and you need many small steps for the average path to track the true curve.
In words: DDIM. Estimate the clean image \(\hat x_0\), then jump directly to the next noise level with no added randomness. The update is a deterministic ODE solver step, so the same start always gives the same image, and big, accurate steps are allowed, which is why a handful suffice.
# --- one DDPM training step (epsilon-prediction) ---
t = torch.randint(0, T, (batch,)) # random timestep per image
noise = torch.randn_like(x0) # the epsilon we will try to predict
ab = alpha_bar[t].view(-1, 1, 1, 1) # alpha_bar_t for the schedule
sqrt_ab = ab.sqrt()
sqrt_1mab = (1 - ab).sqrt()
x_t = sqrt_ab * x0 + sqrt_1mab * noise # reparameterized forward sample
pred = model(x_t, t, c) # network guesses the noise (c = condition)
loss = F.mse_loss(pred, noise) # L = || eps - eps_theta ||^2
loss.backward(); opt.step(); opt.zero_grad()
t picks a random timestep so the model trains on every noise level, easy to hard.
noise is the target \(\epsilon\), the exact thing the network must learn to name.
x_t = sqrt_ab*x0 + sqrt_1mab*noise is the reparameterization formula from Level 3, in code.
model(x_t, t, c) feeds the noisy image, the timestep, and the optional condition \(c\); randomly setting c=None during training is what enables classifier-free guidance.
mse_loss(pred, noise) is the simple loss \(L\), no fancy likelihood term needed.
The honest ELBO gives every timestep its own weight \(w_t\). Ho et al. noticed that simply throwing those weights away (weighting every timestep equally) trains faster and produces sharper images. It's an approximation to the true objective, but a beneficial one: it stops the loss from obsessing over the near-clean steps that barely matter for perceived quality.
In words: the principled variational loss is the same noise-MSE, but each timestep is scaled by a weight \(w_t\) that blows up at small \(t\). Those huge weights make training spend almost all its effort on the easy, barely-noisy steps.
In words: set every weight to 1. This is the loss everyone actually uses. It quietly up-weights the harder, noisier timesteps, which is where image quality is really decided, so the "wrong" objective gives better pictures.
The curve is the per-timestep weight applied to the noise-MSE. Toggle between the true ELBO weighting (spikes at low \(t\)) and the flat \(w_t=1\) of \(L_{\text{simple}}\). The bar below shows how training effort gets distributed across timesteps as a result.
One animated trip through the chain, with the live numbers attached. Play it forward to watch \(x_t=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\,\epsilon\) bury the image, then reverse to watch the sampler \(x_{t-1}=\tfrac{1}{\sqrt{\alpha_t}}(x_t-\cdots)\) dig it back out. Each frame prints the current \(t\), \(\bar\alpha_t\), and which equation is acting.
Nothing about diffusion says "image". Swap the picture for a robot trajectory (a sequence of waypoints) and the model denoises random scribbles into a smooth, valid plan. This is Diffusion Policy, and you can project each reverse step onto safety constraints (the SafeDiffuser idea). Below is a one-canvas taste; the full story (Diffusion Policy, flow for control (Ļ0), 3D Diffuser Actor, Diffusion Forcing and more) lives in the Frontier tier, alongside flows, the score/SDE view, DDIM, EDM, latent diffusion and consistency models.
The dotted path starts as pure noise between start (green) and goal (clay). Hit Denoise plan and watch it relax into a smooth route. Turn on the safety projection to push every reverse step out of the red obstacle, so the plan bends around it instead of through it.
In words: denoise the trajectory \(\tau\) exactly like an image, conditioned on the current state \(s\). Then project each partially-denoised plan back into the safe set \(\mathcal C\) (no collisions, joint limits) before the next step. Safety is baked into the sampling loop, not bolted on afterward.
A robot often has many equally-good ways to reach a goal (go left of the table, or right). A single-output policy averages them and crashes into the table. A diffusion policy keeps the modes separate and commits to one smooth plan, and the per-step projection lets you guarantee it never enters the obstacle, a property a one-shot network can't promise.
Frontier Ā· research-grade
A research-grade curriculum in three strands. (1) Flows: normalizing flows ā continuous normalizing flows ā flow matching, with exact likelihoods and the velocity-field view. (2) Diffusion & variants: score/SDE, DDIM, the EDM design space that unifies them, latent diffusion, and one-step consistency models. (3) Robotics: how diffusion and flow became the dominant action representation: Diffusion Policy, flow for control (Ļ0), and a tour of recent papers (3D Diffuser Actor, Equivariant DP, RDT, Diffusion Forcing, UniPi). Each topic is a guided lesson with step-through proofs, a worked example, an advanced visualization, and citations. Work them in order, or jump in.