L5 Chapter 2 🐥 🕒 9 min

Diffusion Math: From Adding Noise to Generation

Stable Diffusion, DALL·E 3, Sora all use diffusion models. This article explains the core math—with minimal formulas.

HelloAI Editors

7/8/2026

The Diffusion Denoising visualization showed you the “noise to image” animation.

This article opens its math box—explaining why this seemingly “counter-intuitive” approach works so well.

A Counter-intuitive Training Objective

Generating images—how should we train?

Intuitive approach: make the model directly learn “how to paint realistic images”—but extremely hard. GANs tried, training was unstable. VAEs tried, output not sharp enough.

Diffusion’s approach is the opposite:

Don’t teach the model how to paint; teach it how to gradually remove noise—then generate by starting from pure noise.

Sounds weird. But this is the most important breakthrough in generative models since 2020.

Station 1: Forward Process (Adding Noise)

Define a “destruction” process for images—gradually adding Gaussian noise:

x_t = \sqrt{1 - \beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon_t

$x_0$ : clean original image
$x_T$ : completely noise (T usually 1000)
$\beta_t$ : noise intensity per step (small number, e.g., 0.001 to 0.02)
$\epsilon_t$ : Gaussian noise $\mathcal{N}(0, I)$

Intuition: each step pushes the image slightly toward noise; after 1000 steps it’s almost pure noise.

Designing $\beta_t$ is an art—linear, cosine, other schedules. Stable Diffusion uses cosine.

Jump Directly to Any Step

Luckily, jumping from $x_0$ to $x_t$ has a closed form:

x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon

Where $\bar{\alpha}_t = \prod_{i=1}^{t}(1 - \beta_i)$ .

Meaning: give me original $x_0$ and noise $\epsilon$ , I directly give you step $t$ image—no step-by-step computation needed. This makes training especially fast.

Station 2: Reverse Process (Denoising)

If we can learn a network to infer $x_{t-1}$ from $x_t$ —we can go from pure noise $x_T$ back to clean $x_0$ .

Key Theorem

If $\beta_t$ is small enough, each reverse step approximately follows a Gaussian:

p(x_{t-1} | x_t) \approx \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Our task: train neural network $\theta$ to learn $\mu_\theta$ (variance can be fixed).

Station 3: Simplified Training Objective

After some math (derived in the DDPM 2020 paper), the training objective simplifies to a surprisingly simple form:

L = \mathbb{E}_{t, x_0, \epsilon}[\| \epsilon - \epsilon_\theta(x_t, t) \|^2]

Translation:

Train a network $\epsilon_\theta$ that takes “noisy image + timestep t” as input, predicts “the noise that was added.” Loss is MSE between predicted and actual noise.

Just this one line. All diffusion models essentially do this.

Training Loop

for image in dataset:
    # 1. Sample random timestep
    t = randint(1, T)

    # 2. Add noise
    noise = randn_like(image)
    noisy_image = sqrt(alpha_bar[t]) * image + sqrt(1 - alpha_bar[t]) * noise

    # 3. Let model predict noise
    predicted_noise = model(noisy_image, t)

    # 4. MSE loss
    loss = ((predicted_noise - noise) ** 2).mean()
    loss.backward()
    optimizer.step()

Runs in dozens of lines. Simple to a degree that makes you wonder.

Station 4: Sampling (Generation)

After training, how to generate images from noise?

DDPM Sampling

Classic version, each step:

x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z

First term: remove part of noise using predicted noise
Second term: add a bit of new randomness $z$ (ensures diversity)

Walk $T$ steps (1000) and you generate a clean image from pure noise.

DDIM Sampling (Faster)

DDPM needs 1000 steps—too slow. DDIM (2020 improvement) can do it in 20-50 steps—it skips:

DDPM: x_1000 → x_999 → x_998 → ... → x_0  (1000 steps)
DDIM: x_1000 → x_950  → x_900 → ... → x_0  (20-50 steps)

Sacrifice tiny bit of quality for 20× speedup—most applications use DDIM or more advanced DPM-Solver.

Station 5: Make Generation “Listen”—Conditional Diffusion

Everything so far is “unconditional”—gives you random images.

In practice we want “generate image from text”—e.g., “a cat sitting on a window.”

Add Text Condition to Denoiser

Simplest: feed text embedding to the denoising network:

predicted_noise = model(noisy_image, t, text_embedding)
                                          ↑ text condition added

How exactly to “feed”—usually via cross-attention:

Image patches as query
Text tokens as key/value
Attention lets image denoising “hear” the text prompt

This is why Stable Diffusion’s U-Net uses cross-attention.

Classifier-Free Guidance (CFG)

Simple but powerful trick—at training, randomly drop text condition (30% probability).

At generation, linearly combine “conditional” and “unconditional” predictions:

\epsilon_{guided} = \epsilon_{uncond} + s \cdot (\epsilon_{cond} - \epsilon_{uncond})

$s$ is guidance scale (typically 7.5).

Intuition: make generation “more obedient” to text prompt. Bigger s = more strict prompt following, but too much distorts.

The CFG number you adjust in Stable Diffusion is $s$ .

Station 6: Latent Diffusion (Stable Diffusion’s Optimization)

Original diffusion operates in pixel space—a 512×512 image is 786K-dim space. Computation is super expensive.

Stable Diffusion’s key trick: compress image to “latent space” first, do diffusion there.

Original image 512×512×3
        ↓ VAE Encoder
Latent representation 64×64×4
        ↓ Diffusion (happens here)
New latent 64×64×4
        ↓ VAE Decoder
New image 512×512×3

Benefit: diffusion happens in 16K-dim space, 64× faster than pixel space.

This is why Stable Diffusion can run on consumer GPUs. OpenAI’s DALL·E 2 didn’t use latent diffusion—much more expensive.

Complete Stable Diffusion Architecture

        "a cat" (text prompt)
             ↓
        CLIP Text Encoder → text vector
                                ↓
Random noise ──→ U-Net ←── text vector via cross-attention
   ↑              ↓
   └─ repeat 50 times
              latent vector
                  ↓
              VAE Decoder
                  ↓
              Clean image

5 components:

CLIP Text Encoder
U-Net (denoising network)
VAE Encoder / Decoder
Scheduler (DDIM and other sampling algorithms)
Tokenizer (comes with CLIP)

This is Stable Diffusion’s full architecture—all derivatives (SD XL, SD3, Flux, Midjourney etc.) build on this paradigm.

Video Generation (Sora-class)

Extend diffusion to time dimension— no longer (H, W, 3) image, but (T, H, W, 3) video volume.

Sora’s approach:

Split video into 3D patches (spanning space and time)
Use Transformer (not U-Net) for diffusion
Data: millions of hours of video

Compute is tens to hundreds of times image diffusion—which is why video generation only matured in 2024-2026.

Some Interesting “Emergent” Abilities

After large-scale training, diffusion models exhibit unexpected abilities:

1. Inpainting

Mask part of an image, have the model fill in—natively supported because it learns “denoising.”

2. Style Transfer

Through CLIP guidance, make generated image “look like Van Gogh.”

3. ControlNet (2023)

Given line drawings, depth maps, poses—make generation strictly follow these “control signals.”

4. Image-to-Image

Not from pure noise—start from user-provided image with some noise, denoise. Useful for “stylizing photos.”

Some Frontiers

Flow Matching / Rectified Flow

2023-2024 new approach—instead of 1000-step discrete process, learn a straight line from noise to image directly:

x_t = (1 - t) \cdot x_0 + t \cdot \epsilon

Learn a velocity field that “pushes” noise straight to images. Stable Diffusion 3 uses this—5-10× fewer steps.

Consistency Models

1-step image generation—distill multi-step diffusion to single-step. Slightly lower quality but extremely fast.

The diffusion field has been evolving incredibly fast 2020-2026—2-3 major breakthroughs per year.

One-line Summary

Diffusion = train a denoiser, then walk backward from noise.

Math is elegant, training is stable, generation quality is amazing— it’s the de facto standard for generative models in the 2020s.

GANs are dead, VAEs are retired, Diffusion rules.

💡 Want to "see" it

Diffusion Denoising visualization — pick target image, watch 50 steps from noise to clean. Read this with the visualization—math and intuition build together.

Recommended next: L5-03 ViT or jump to L4-04 Agent Construction—your interest decides.