Neuronauts | Diffusion Models

Diffusion Models

Start with static, then teach a model to clean it into a picture.

A diffusion model is an art restorer trapped inside a snowstorm. It learns to wipe away a little noise at a time, and after enough careful wipes a blizzard of static turns into a real image. Generating = denoising.

By the end, you will be able to explain: Why denoising step-by-step is the same thing as generating The forward (add noise) and reverse (remove noise) processes The DDPM marginal, the reparameterization trick, and the loss How text/class conditioning and guidance steer the result

Live: static is restored into a painting, then dissolved back into noise: the two directions of diffusion.

Level 1 Ā· Beginner

The snow-globe idea: shake it up, then let it settle into a picture.

Imagine a finished painting. Now sprinkle TV static on top until you can't see it at all. That's noise. A diffusion model has practiced the reverse trick so many times that it can look at the static and guess what was underneath, removing a thin layer of fuzz over and over until the painting reappears.

šŸ–¼ļø The Restorer's Slider

pure noise

Drag from t = T (pure static) down to t = 0 (clean image). Watch the painting emerge.

Everyday example 1: the snow globe

Shake a snow globe and the scene vanishes behind swirling flakes. Set it down and let it settle: the scene comes back. Diffusion is a model that has learned how to "let the flakes settle" on purpose.

Everyday example 2: wiping fog off a mirror

A foggy bathroom mirror hides your reflection. Each swipe of your hand reveals a bit more. The model makes one tiny swipe per step, and after many steps the full reflection (image) is clear.

Everyday example 3: a sandcastle in reverse

Waves smear a sandcastle into flat sand (adding noise is easy and automatic). The clever part is a machine that can run the tape backwards and rebuild the castle grain by grain.

Worked example: counting the swipes

Suppose generation uses 1000 tiny steps. At step 1000 the canvas is 100% static. The model removes a sliver of noise, so step 999 is maybe 99.8% static, barely different. Repeat. Around step 600 faint blobs of color appear. By step 200 you can tell it's a face. At step 0 it's a crisp picture.

Key idea: no single step does much. Generation is the accumulation of a thousand gentle denoises.

Why not just paint it in one shot?

Predicting a whole sharp image from pure noise in one jump is incredibly hard. Removing a little noise is easy, like the difference between sculpting a statue with one hammer blow versus a thousand careful taps. Splitting the job into many easy steps is what makes diffusion work so well.

New idea Ā· the real goal

We don't just want to copy a picture, we want to invent new ones.

A photocopier can reconstruct a picture you feed it, but it can never surprise you with a brand-new one. The dream of generative modelling is a machine that has seen thousands of faces and can now dream up a face nobody has ever seen. To do that, the machine needs a tidy "map of ideas" it can pick a random spot on, and that map has to be smooth, so every spot decodes into something sensible.

šŸ—ŗļø Autoencoder vs. a samplable map

autoencoder

An autoencoder squeezes each picture into a tiny code (a dot on the map) and learns to rebuild it. But the dots end up scattered with big empty gaps. Pick a random point and you land in a gap → garbage. A good generative map packs the dots into one smooth blob so every random pick decodes into a believable image. Toggle the two and click Sample a random point.

Each dot is one training image encoded to a 2D code. The big tile shows what the decoder makes from the picked point.

Everyday example: a music playlist

Autoencoder map: songs pinned at random spots with silence in between. Drop the needle anywhere and you mostly hit static. A smooth map clusters similar songs into one dense region, so a random drop always lands on music, maybe a tune you've never heard, but still a tune.

Everyday example: handwriting

Encode many handwritten "3"s. If the codes for a curly 3 and a blocky 3 sit far apart with a void between, the midpoint decodes to a smudge. We want the in-between point to decode to a reasonable blend, a slightly-curly 3. That "no holes" property is exactly what makes a map samplable.

This is the bridge to the next level: to guarantee "no holes", we make the map a probability distribution and force the codes to fill a known shape (a bell curve). That's a VAE.

Level 2 Ā· Intermediate

Two arrows: a forward process that ruins, a reverse process that repairs.

Training has two halves. The forward process takes a clean image and adds a measured dose of noise at each timestep. It needs no learning, it's just a recipe. The reverse process is a neural network that looks at a noisy image and predicts the noise that was added, so we can subtract it. The model's only job is to be a good noise-guesser.

xā‚€ xā‚œā‚‹ā‚ xā‚œ x_T forward q(xā‚œ | xā‚œā‚‹ā‚): add a little noise → ← reverse pĪø(xā‚œā‚‹ā‚ | xā‚œ): predict & remove noise (the neural net)

The forward chain (rose, fixed recipe) walks a clean image into pure noise. The reverse chain (green, learned) walks noise back to an image, one denoise at a time.

āš™ļø Forward vs. Reverse

t = 0
FORWARD Ā· clean → noisy (just adds static)
REVERSE Ā· the model's denoised guess at t

šŸ“‰ The noise schedule

The schedule decides how much signal survives at each timestep. The bar shows how much of the original image (\(\bar\alpha_t\)) remains vs. how much is noise (\(1-\bar\alpha_t\)). Early on the image dominates; late on it's almost all noise.

First real example

At t = 0 forward and reverse agree perfectly, nothing to fix. Push t up: FORWARD smears the picture into static, while REVERSE shows what the model recovers from that much noise. Small t → easy → near-perfect recovery. Large t → almost no signal left → only a rough guess survives.

Worked example: the training loop in plain words

  1. Grab a real image \(x_0\) and a random timestep \(t\) (say 730).
  2. Roll some random noise \(\epsilon\) and add the "t = 730 amount" of it to make a noisy image \(x_t\).
  3. Show \(x_t\) (and the number 730) to the network. Ask: "what noise did I add?"
  4. Compare the network's guess to the real \(\epsilon\). The gap is the loss; nudge the weights to shrink it.

Repeat millions of times over many images and timesteps. A model that can reliably name the noise can also remove it, and removing noise is generation.

New idea Ā· the VAE

The VAE: make the map a bell curve you can sample.

A Variational Autoencoder upgrades the encoder so that instead of pinning each image to a single dot, it maps the image to a small fuzzy cloud (a little Gaussian). During training we gently push all those clouds to pile up into one big standard bell curve centred at the origin. Once they do, generating is trivial: draw a random point from that bell curve and let the decoder turn it into an image.

šŸŒ«ļø Encode → sample → decode

prior š’©(0, I)

Left: the latent plane with the prior \(p(z)=\mathcal N(0,I)\) drawn as rings. Drag the spread to see each image become a cloud, not a point. Hit Generate to draw \(z\sim p(z)\) and decode it.

\[ p_\theta(x) = \int p_\theta(x \mid z)\,p(z)\,dz, \qquad p(z)=\mathcal N(0,\mathbf I) \]

In words: the chance the model assigns to a real image \(x\) is the total over every latent code \(z\): how likely that code is under the prior, times how likely the decoder is to draw \(x\) from it. We'd love to maximise this, but the integral runs over all of latent space, so it's intractable.

  • \(z\): a latent code (a point on the map)
  • \(p(z)=\mathcal N(0,\mathbf I)\): the prior: a standard bell curve we can sample directly
  • \(p_\theta(x\mid z)\): the decoder: given a code, how likely is image \(x\)
  • \(q_\phi(z\mid x)\): the encoder: given an image, which codes are plausible (a small Gaussian)
  • \(\theta,\phi\): decoder / encoder weights
Why the integral is hopeless

Latent space might be 256-dimensional. Summing \(p_\theta(x\mid z)p(z)\) over all of it is like asking "out of every possible code in the universe, how many would have produced this exact cat photo?" Astronomically many to check. The next level shows the clever dodge: a lower bound we can compute.

Level 3 Ā· Advanced

The real DDPM math (now properly typeset).

Time for the equations behind the snow globe. Three formulas carry almost everything: the forward marginal (jump straight to any timestep), the reparameterization that writes \(x_t\) in one line, and the surprisingly simple training loss.

\[ q(x_t \mid x_0) = \mathcal{N}\!\big(x_t;\, \sqrt{\bar\alpha_t}\,x_0,\; (1-\bar\alpha_t)\mathbf{I}\big) \]

In words: you don't have to add noise step by step, you can jump straight to timestep \(t\). The noisy image is the clean image shrunk by \(\sqrt{\bar\alpha_t}\), plus Gaussian noise whose size grows as \(1-\bar\alpha_t\).

  • \(x_0\): the original clean image
  • \(x_t\): the image after \(t\) steps of noising
  • \(\bar\alpha_t\): fraction of the signal that survives by step \(t\) (starts near 1, decays to ~0)
  • \(\mathcal{N}(\mu,\Sigma)\): a Gaussian (bell curve) with mean \(\mu\), covariance \(\Sigma\)
  • \(\mathbf{I}\): identity matrix: noise is added independently to every pixel
\[ x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, \qquad \epsilon \sim \mathcal{N}(0,\mathbf{I}) \]

In words: the same idea written as one concrete recipe: blend the clean image with a single draw of standard noise, weighted by the schedule. This is the "reparameterization trick": it makes \(x_t\) a differentiable function of a fixed random \(\epsilon\).

  • \(\epsilon\): standard Gaussian noise, the thing the network will try to predict
  • \(\sqrt{\bar\alpha_t}\): how much clean image to keep
  • \(\sqrt{1-\bar\alpha_t}\): how much noise to pour in
\[ L = \mathbb{E}_{x_0,\,\epsilon,\,t}\Big[\,\big\lVert \epsilon - \epsilon_\theta(x_t,\,t) \big\rVert^2\,\Big] \]

In words: over random images, random noise, and random timesteps, make the network's predicted noise \(\epsilon_\theta\) match the noise we actually added. It's just mean-squared error on noise.

  • \(\epsilon_\theta(x_t,t)\): the network's guess of the noise, given the noisy image and the timestep
  • \(\theta\): the network's trainable weights
  • \(\lVert\cdot\rVert^2\): squared distance (the penalty for a wrong guess)
  • \(\mathbb{E}[\cdot]\): average over all those random draws

šŸŽšļø Noise-schedule explorer

Watch \(\bar\alpha_t\) decay across timesteps. Choose a schedule and drag the marker to read off how much signal vs. noise is in \(x_t\) at that point.

Worked example: plug in numbers

Say at \(t=500\) the schedule gives \(\bar\alpha_t = 0.30\). Then \(\sqrt{\bar\alpha_t}\approx0.55\) and \(\sqrt{1-\bar\alpha_t}\approx0.84\). So \(x_t \approx 0.55\,x_0 + 0.84\,\epsilon\): the noisy image is roughly 55% original picture and 84%-strength static, already noise-dominated. The readout above updates with the live numbers as you drag.

New idea Ā· the ELBO

The ELBO: a floor we can compute, pushed up against a ceiling we can't.

We can't compute \(\log p_\theta(x)\) directly, so we build a lower bound on it: the Evidence Lower BOund. Maximising the floor drags the true value up with it. The bound has two readable pieces: a reconstruction term (decode my code back to me) and a KL regulariser (keep my encoder cloud close to the prior so the map stays samplable).

\[ \log p_\theta(x) \;\ge\; \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{\mathrm{KL}\!\big(q_\phi(z\mid x)\,\|\,p(z)\big)}_{\text{regulariser}} \;=\; \mathcal{L}_{\text{ELBO}} \]

In words: the log-probability of the data is at least "how well the decoder rebuilds \(x\) from codes the encoder suggests" minus "how far the encoder's cloud has drifted from the standard bell curve". Make this big and you've made \(\log p_\theta(x)\) big too.

  • reconstruction: average log-likelihood of getting \(x\) back from sampled codes
  • \(\mathrm{KL}(q\,\|\,p)\): penalty for the encoder cloud straying from \(p(z)=\mathcal N(0,\mathbf I)\)
  • \(\mathbb{E}_{q_\phi}[\cdot]\): average over codes drawn from the encoder
  • \(\mathcal L_{\text{ELBO}}\): the bound we actually maximise
\[ \log p_\theta(x) - \mathcal{L}_{\text{ELBO}} \;=\; \mathrm{KL}\!\big(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\big) \;\ge\; 0 \]

In words: the gap between the true value and our floor is exactly how wrong the encoder is about the true posterior. A perfect encoder closes the gap and the bound becomes an equality. That's why a better encoder = a tighter bound.

  • \(p_\theta(z\mid x)\): the true (unknowable) posterior over codes
  • gap: \(\ge 0\) always, so the ELBO never overshoots the truth

šŸ“Š Tighten the bound

gap = 0.80

The tall bar is the fixed truth \(\log p(x)\). The green bar is the ELBO. Improve the encoder (slide right) and the green floor rises, the pink gap shrinks toward zero, and the bound gets tight.

Worked example: splitting the score

Say \(\log p(x)=-3.0\) (nats). A mediocre encoder gives ELBO \(=-3.8\), so the gap (its posterior error) is \(0.8\). Improve it until ELBO \(=-3.05\): the gap is now \(0.05\), we're reading the true likelihood almost exactly, and the reconstruction/KL split tells us why any remaining points are lost.

New idea · the β knob

One dial trades sharp reconstructions for a tidy, disentangled latent map.

The ELBO weights reconstruction and KL equally. In practice you put a knob β in front of the KL term. Turn it up and the encoder cloud snaps tightly to the prior: a cleaner, more disentangled, more samplable map, but blurrier rebuilds. Turn it down and you get crisp reconstructions whose codes ignore the prior (and don't sample well). β-VAE is exactly this dial.

\[ \mathcal{L}_\beta = \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big]}_{\text{reconstruction}} \;-\; \beta\,\underbrace{\mathrm{KL}\!\big(q_\phi(z\mid x)\,\|\,p(z)\big)}_{\text{regulariser}} \]

In words: the same two ELBO pieces, but the KL is scaled by \(\beta\). \(\beta=1\) is the honest ELBO from above. \(\beta>1\) pushes the encoder harder toward the standard bell curve (Higgins et al., 2017); \(\beta<1\) lets it drift for sharper rebuilds.

  • \(\beta\): the KL weight: the single knob that sets the trade-off
  • \(\beta=1\): recovers the exact ELBO (the bound stays valid)
  • \(\beta\uparrow\): tidier, more disentangled, more samplable codes; blurrier output
  • \(\beta\downarrow\): sharper reconstruction; codes stray from the prior, sample poorly
The failure mode at large β: posterior collapse

Crank \(\beta\) too high and the cheapest way to win is to make the KL term zero: the encoder simply outputs the prior, \(q_\phi(z\mid x)\approx p(z)\), regardless of \(x\). The code now carries no information about the input (mutual information \(I(x;z)\to 0\)) and a powerful decoder just ignores \(z\) and paints a generic average. This is posterior collapse, the formal version of the blur the Beginner tier showed you, and why β (or KL annealing / free-bits) has to be tuned, not maxed.

\[ I(x;z)\;\le\;\mathbb{E}_x\big[\mathrm{KL}(q_\phi(z\mid x)\,\|\,p(z))\big] \]

In words: the average KL the model is willing to pay is an upper bound on how much the code can tell you about the image. Penalising KL with \(\beta\) therefore caps the information the latent may store: the higher \(\beta\), the tighter the cap, until at the extreme the code says nothing at all.

  • \(I(x;z)\): mutual information: bits the code carries about the input
  • \(\mathbb{E}_x[\mathrm{KL}]\): the rate the model spends; an information ceiling
Latest (2024 to 2026): where the β idea lives now

The same recon-vs-rate trade governs the VAE that front-ends every latent-diffusion model (Stable Diffusion, SDXL, Flux, SD3): its autoencoder is trained with a deliberately tiny KL weight so codes stay sharp and information-rich, and the diffusion model, not the KL prior, does the heavy lifting of making the latent space samplable. The "tidy map" job moved from the VAE to the diffusion process. See the Latent Diffusion card in the Frontier tier.

New idea Ā· backprop through randomness

The reparameterization trick: move the dice outside the network.

To train we need gradients, but you can't differentiate through "roll a random \(z\)". The fix: don't sample \(z\) directly. Sample a fixed noise \(\epsilon\) outside the model, then build \(z\) from it with a smooth formula. Now the randomness is an external input and gradients flow cleanly through \(\mu\) and \(\sigma\).

\[ z = \mu_\phi(x) + \sigma_\phi(x)\odot\epsilon, \qquad \epsilon\sim\mathcal N(0,\mathbf I) \]

In words: instead of drawing \(z\) from the encoder's cloud, shift and stretch a standard noise draw by the encoder's mean and spread. Same distribution for \(z\), but now \(z\) is a plain differentiable function of \(\mu,\sigma\), and \(\epsilon\) is just data.

  • \(\mu_\phi(x)\): centre of the encoder cloud (learned, differentiable)
  • \(\sigma_\phi(x)\): spread of the cloud (learned, differentiable)
  • \(\epsilon\): external standard noise; the only random part, carries no gradient
  • \(\odot\): elementwise multiply
x μ(x) σ(x) ε z decode gradients flow through μ and σ āœ“ Ā· ε is fixed noise āœ—

The stochastic node is split: a deterministic path (clay, carries gradients) plus an external dice roll \(\epsilon\) (rose, no gradient needed).

🧪 What if we DON'T reparameterize?

trick ON

Both setups estimate the same gradient, but the naive "score-function" estimator that samples \(z\) directly is wildly noisy. Watch the spread of gradient estimates: with the trick they cluster tightly (training descends smoothly); without it they scatter (training stalls / diverges).

New idea Ā· the catch

Why a single-jump VAE comes out blurry.

VAEs work, but their samples are famously soft. The culprit is mathematical: a Gaussian decoder rewards the average of all plausible images. When one blurry code could explain several sharp originals, the safest low-error guess is to blend them, and blends are blurry.

🌁 Sharp target vs. averaged guess

The true data has two equally-good sharp answers for one code (e.g. a "3" written two ways). A Gaussian likelihood minimises squared error by predicting their mean, which is neither answer. Slide the ambiguity up to watch the crisp modes melt into one smear.

\[ \arg\min_{\hat x}\ \mathbb{E}_{x\sim p(x\mid z)}\big[\lVert x-\hat x\rVert^2\big] \;=\; \mathbb{E}[x\mid z] \]

In words: a Gaussian decoder is trained with squared error, and the guess that minimises squared error is the mean of all valid images for that code. If two sharp images share a code, the mean is their fuzzy average. Hence: blur.

  • \(\hat x\): the decoder's single output for a code
  • \(\mathbb{E}[x\mid z]\): the conditional mean (the blur-prone optimum)
Two more failure modes

Posterior collapse: if the decoder is powerful enough to ignore \(z\), the KL term happily pushes the encoder all the way to the prior and \(z\) carries no information: the model stops using its latent code at all.

One giant leap: mapping pure noise to a full sharp image in a single decode is a brutally hard function to learn. The fix that powers diffusion: don't leap, take a thousand tiny, easy steps. That's the next concept.

New idea Ā· from VAE to diffusion

Diffusion = a VAE with a thousand-layer latent ladder.

Stack the VAE idea over and over. The forward process is a fixed chain of tiny Gaussian noisings; the reverse process is a learned chain of tiny denoisings. The diffusion ELBO is then just a sum of per-timestep KL terms: one easy little VAE per rung of the ladder.

\[ q(x_t\mid x_{t-1}) = \mathcal N\!\big(\sqrt{1-\beta_t}\,x_{t-1},\ \beta_t\mathbf I\big),\qquad x_t=\sqrt{\bar\alpha_t}\,x_0+\sqrt{1-\bar\alpha_t}\,\epsilon \]

In words: each forward step shrinks the image a hair and adds a hair of noise (top). Because Gaussians compose, you can also jump straight to step \(t\) in one line, the same reparameterization trick, reused (bottom).

  • \(\beta_t\): the tiny noise added at step \(t\) (the schedule)
  • \(\bar\alpha_t=\prod_{s\le t}(1-\beta_s)\): cumulative surviving signal
  • \(p_\theta(x_{t-1}\mid x_t)\): the learned reverse step (a tiny decoder)
\[ -\log p_\theta(x_0)\ \le\ \underbrace{\textstyle\sum_{t>1}\mathrm{KL}\!\big(q(x_{t-1}\mid x_t,x_0)\,\|\,p_\theta(x_{t-1}\mid x_t)\big)}_{\text{one KL per rung}} + (\text{recon} + \text{prior})\]

In words: the whole diffusion objective decomposes into a sum of per-timestep KL terms, each one asking "does my learned reverse step match the true reverse step at this rung?". Add a reconstruction term at the bottom and a prior term at the top.

  • \(\sum_{t}\mathrm{KL}\): a separate, simple matching problem per timestep
  • \(q(x_{t-1}\mid x_t,x_0)\): the true reverse step (tractable Gaussian once \(x_0\) is known)

🧪 What if we use too FEW steps?

T = 50 steps

The reverse step is only allowed to be a simple Gaussian when each forward step is tiny, i.e. when there are many of them. The true reverse of one big step is a lumpy, multi-modal mess that a single Gaussian can't capture. Drag the step count: the green Gaussian (our model) only hugs the grey true reverse when \(T\) is large.

Worked example: the staircase

Walking down a smooth ramp blindfolded: 1000 baby steps and every footfall lands where you expect (Gaussian, predictable). Try it in 3 giant leaps and you can't predict where you'll land: the landing spot is spread across several spots at once. Many small steps are what make each reverse step approximately Gaussian, which is the entire reason diffusion is learnable.

Level 4 Ā· Expert

Sampling, conditioning, guidance, and the score connection.

You can name the noise, now use it. Sampling subtracts predicted noise step by step. Conditioning lets you ask for a specific thing. Classifier-free guidance lets you ask more strongly. And there's a neat bridge to score functions.

\[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta(x_t,t)\right) + \sigma_t z \]

In words: one reverse step. Take the noisy image, subtract a scaled copy of the predicted noise, rescale, and add a touch of fresh randomness \(\sigma_t z\) so samples stay diverse. Do this from \(t=T\) down to \(t=1\) and an image appears.

  • \(\alpha_t = 1-\beta_t\): the per-step signal-keep factor
  • \(\sigma_t z\): small injected noise (\(z\sim\mathcal N(0,\mathbf I)\)); set to 0 at the last step
  • \(\epsilon_\theta(x_t,t)\): the network's predicted noise, the only learned part
\[ \nabla_{x_t}\log q(x_t) \;\approx\; -\frac{\epsilon_\theta(x_t,t)}{\sqrt{1-\bar\alpha_t}} \]

In words: predicting noise is (up to a scale) the same as estimating the score, the direction in pixel-space that makes the image more probable. "Denoise" and "walk uphill in likelihood" are two views of one arrow. This links DDPMs to score-based / SDE models.

  • \(\nabla_{x_t}\log q(x_t)\): the score: gradient of log-density w.r.t. the image
  • \(\epsilon_\theta\): predicted noise; its negative points toward cleaner images
\[ \mathbb{E}[x_0\mid x_t] \;=\; \frac{x_t + (1-\bar\alpha_t)\,\nabla_{x_t}\log q(x_t)}{\sqrt{\bar\alpha_t}} \;=\; \frac{x_t - \sqrt{1-\bar\alpha_t}\;\epsilon_\theta(x_t,t)}{\sqrt{\bar\alpha_t}} \]

In words (the unifying identity): Tweedie's formula. The denoiser's best guess of the clean image, the posterior mean \(\mathbb{E}[x_0\mid x_t]\), the score, and the predicted noise are the same object in three costumes. Substituting the score relation from above turns the left equality into the familiar \(\hat x_0\)-from-\(\epsilon\) formula, so "predict the noise", "estimate the score", and "guess the clean image" are one network with three read-outs.

  • \(\mathbb{E}[x_0\mid x_t]\): the optimal denoiser: posterior mean of the clean image
  • \(\hat x_0\): in code, this is the same quantity the sampler reconstructs each step
  • \(\bar\alpha_t\): the cumulative signal factor that ties all three views together

Why it matters (2024 to 2026): this identity is the backbone of the EDM / score-SDE design space and of every \(x_0\)- and \(v\)-prediction model in modern practice; they're algebraic re-parameterisations of this one line. See the score/SDE and EDM cards in the Frontier tier.

\[ \tilde\epsilon_\theta(x_t,t,c) = \epsilon_\theta(x_t,t,\varnothing) + w\big(\epsilon_\theta(x_t,t,c) - \epsilon_\theta(x_t,t,\varnothing)\big) \]

In words: classifier-free guidance. Run the model twice: once with your prompt \(c\), once with no prompt, then exaggerate the difference by a factor \(w\). Higher \(w\) pulls the sample harder toward the prompt (sharper, more on-topic) but can hurt diversity and realism.

  • \(c\): the condition (text embedding, class label, ...)
  • \(\varnothing\): the "no condition" / dropped-prompt case (trained by randomly dropping \(c\))
  • \(w\): guidance strength: \(w=0\) ignores the prompt, \(w\) large over-commits to it

šŸŽÆ Guidance-strength playground

w = 3.0

The toy generator is torn between two prompts. Pick a target and crank guidance \(w\): low \(w\) drifts to a blurry average of both shapes; high \(w\) snaps decisively toward your prompt.

Why more steps = better but slower

Each reverse step trusts a linear approximation of a curved path from noise to image. Few steps take big, crude jumps and overshoot fine detail; many small steps hug the true trajectory and look cleaner, at a linear cost in compute. Fast samplers curve smarter so you need fewer steps (DDIM, EDM and one-step consistency models are derived in the Frontier tier). Try the Sampling steps slider above: low steps look blocky, high steps look crisp.

Conditioning, briefly

The condition \(c\) is just an extra input to \(\epsilon_\theta\). For text-to-image, \(c\) is a text embedding fed in via cross-attention; for class-conditional models it's a learned label vector. During training you randomly drop \(c\) (set it to \(\varnothing\)) some of the time, and that's what gives you the unconditional model needed for the guidance formula above. No separate classifier required.

New idea Ā· why fewer steps can still work

Two ways down the same hill: stochastic vs deterministic sampling.

DDPM and DDIM start from the same noise and aim at the same data, but take different routes. DDPM adds a fresh kick of randomness \(\sigma_t z\) every step: a jittery walk that needs many steps to settle. DDIM sets \(\sigma_t=0\): a smooth, deterministic update that follows the underlying ODE and can reach the data in a handful of steps. Watch both descend toward a two-mode toy density and drag the step slider down: the DDIM path stays clean long after the DDPM path turns blocky.

šŸ—ŗļø DDPM vs DDIM trajectories

ready

Same start point (ā—†), same target modes. The clay path is stochastic DDPM; the green path is deterministic DDIM. Lower the step count to see DDIM stay smooth while DDPM gets rough and scattered.

DDPM (\(\sigma_t z\) on) DDIM (\(\sigma_t=0\)) data modes

\[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta\right) + \underbrace{\sigma_t z}_{\text{DDPM}} \]

In words: the DDPM reverse step from Level 4. The injected \(\sigma_t z\) is genuine randomness, so two runs from the same \(x_t\) diverge, and you need many small steps for the average path to track the true curve.

  • \(\sigma_t z\): the stochastic kick; this is the only term DDIM removes
\[ x_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat x_0 + \sqrt{1-\bar\alpha_{t-1}}\;\epsilon_\theta,\qquad \hat x_0 = \frac{x_t-\sqrt{1-\bar\alpha_t}\,\epsilon_\theta}{\sqrt{\bar\alpha_t}} \]

In words: DDIM. Estimate the clean image \(\hat x_0\), then jump directly to the next noise level with no added randomness. The update is a deterministic ODE solver step, so the same start always gives the same image, and big, accurate steps are allowed, which is why a handful suffice.

  • \(\hat x_0\): the current best guess of the clean signal
  • deterministic: reproducible samples; enables fast few-step solvers (EDM in the Frontier)

🧪 PyTorch idea: the whole training step

New idea Ā· the lucky approximation

Dropping the "correct" weighting actually trains better.

The honest ELBO gives every timestep its own weight \(w_t\). Ho et al. noticed that simply throwing those weights away (weighting every timestep equally) trains faster and produces sharper images. It's an approximation to the true objective, but a beneficial one: it stops the loss from obsessing over the near-clean steps that barely matter for perceived quality.

\[ \mathcal L_{\text{vlb}} = \mathbb E_{x_0,\epsilon,t}\Big[\,w_t\,\big\lVert\epsilon-\epsilon_\theta(x_t,t)\big\rVert^2\Big],\qquad w_t=\frac{\beta_t^2}{2\sigma_t^2\,\alpha_t(1-\bar\alpha_t)} \]

In words: the principled variational loss is the same noise-MSE, but each timestep is scaled by a weight \(w_t\) that blows up at small \(t\). Those huge weights make training spend almost all its effort on the easy, barely-noisy steps.

  • \(w_t\): the timestep weight from the true ELBO derivation
  • \(\sigma_t^2\): the reverse-step variance
  • \(\alpha_t,\bar\alpha_t\): per-step and cumulative signal factors
\[ \mathcal L_{\text{simple}} = \mathbb E_{x_0,\epsilon,t}\Big[\,\big\lVert\epsilon-\epsilon_\theta(x_t,t)\big\rVert^2\Big]\qquad(\,w_t \equiv 1\,) \]

In words: set every weight to 1. This is the loss everyone actually uses. It quietly up-weights the harder, noisier timesteps, which is where image quality is really decided, so the "wrong" objective gives better pictures.

  • \(w_t\equiv 1\): the deliberate approximation: equal weight everywhere

🧪 Weighted ELBO vs. simple loss

L_simple

The curve is the per-timestep weight applied to the noise-MSE. Toggle between the true ELBO weighting (spikes at low \(t\)) and the flat \(w_t=1\) of \(L_{\text{simple}}\). The bar below shows how training effort gets distributed across timesteps as a result.

Centerpiece Ā· watch the equations act

The whole journey, annotated: clean → noise → clean.

One animated trip through the chain, with the live numbers attached. Play it forward to watch \(x_t=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\,\epsilon\) bury the image, then reverse to watch the sampler \(x_{t-1}=\tfrac{1}{\sqrt{\alpha_t}}(x_t-\cdots)\) dig it back out. Each frame prints the current \(t\), \(\bar\alpha_t\), and which equation is acting.

šŸŽ¬ Annotated diffusion player

forward: q(xā‚œ|xā‚€)

Bridge Ā· beyond images → the Frontier

Same trick, new canvas: denoising a plan, not a picture.

Nothing about diffusion says "image". Swap the picture for a robot trajectory (a sequence of waypoints) and the model denoises random scribbles into a smooth, valid plan. This is Diffusion Policy, and you can project each reverse step onto safety constraints (the SafeDiffuser idea). Below is a one-canvas taste; the full story (Diffusion Policy, flow for control (Ļ€0), 3D Diffuser Actor, Diffusion Forcing and more) lives in the Frontier tier, alongside flows, the score/SDE view, DDIM, EDM, latent diffusion and consistency models.

šŸ¤– Denoise a trajectory

noise

The dotted path starts as pure noise between start (green) and goal (clay). Hit Denoise plan and watch it relax into a smooth route. Turn on the safety projection to push every reverse step out of the red obstacle, so the plan bends around it instead of through it.

\[ \tau_{t-1} = \underbrace{\tfrac{1}{\sqrt{\alpha_t}}\!\big(\tau_t-c_t\,\epsilon_\theta(\tau_t,t,s)\big)}_{\text{ordinary reverse step}}, \qquad \tau_{t-1}\leftarrow \operatorname{Proj}_{\mathcal C}\big(\tau_{t-1}\big) \]

In words: denoise the trajectory \(\tau\) exactly like an image, conditioned on the current state \(s\). Then project each partially-denoised plan back into the safe set \(\mathcal C\) (no collisions, joint limits) before the next step. Safety is baked into the sampling loop, not bolted on afterward.

  • \(\tau_t\): the whole action/trajectory at noise level \(t\)
  • \(s\): the observed state the plan is conditioned on
  • \(\epsilon_\theta(\tau_t,t,s)\): predicted trajectory-noise (the learned part)
  • \(\operatorname{Proj}_{\mathcal C}\): projection onto the safe/feasible set \(\mathcal C\)
Why diffusion suits robots

A robot often has many equally-good ways to reach a goal (go left of the table, or right). A single-output policy averages them and crashes into the table. A diffusion policy keeps the modes separate and commits to one smooth plan, and the per-step projection lets you guarantee it never enters the obstacle, a property a one-shot network can't promise.

šŸ† Challenge: earn the Restorer badge

0 / 17

Frontier Ā· research-grade

Flows, diffusion & the generative frontier, all the way to robotics.

A research-grade curriculum in three strands. (1) Flows: normalizing flows → continuous normalizing flows → flow matching, with exact likelihoods and the velocity-field view. (2) Diffusion & variants: score/SDE, DDIM, the EDM design space that unifies them, latent diffusion, and one-step consistency models. (3) Robotics: how diffusion and flow became the dominant action representation: Diffusion Policy, flow for control (Ļ€0), and a tour of recent papers (3D Diffuser Actor, Equivariant DP, RDT, Diffusion Forcing, UniPi). Each topic is a guided lesson with step-through proofs, a worked example, an advanced visualization, and citations. Work them in order, or jump in.