Diffusion Math: From Adding Noise to Generation
Stable Diffusion, DALL·E 3, Sora all use diffusion models. This article explains the core math—with minimal formulas.
The Diffusion Denoising visualization showed you the “noise to image” animation.
This article opens its math box—explaining why this seemingly “counter-intuitive” approach works so well.
A Counter-intuitive Training Objective
Generating images—how should we train?
Intuitive approach: make the model directly learn “how to paint realistic images”—but extremely hard. GANs tried, training was unstable. VAEs tried, output not sharp enough.
Diffusion’s approach is the opposite:
Don’t teach the model how to paint; teach it how to gradually remove noise—then generate by starting from pure noise.
Sounds weird. But this is the most important breakthrough in generative models since 2020.
Station 1: Forward Process (Adding Noise)
Define a “destruction” process for images—gradually adding Gaussian noise:
- : clean original image
- : completely noise (T usually 1000)
- : noise intensity per step (small number, e.g., 0.001 to 0.02)
- : Gaussian noise
Intuition: each step pushes the image slightly toward noise; after 1000 steps it’s almost pure noise.
Designing is an art—linear, cosine, other schedules. Stable Diffusion uses cosine.
Jump Directly to Any Step
Luckily, jumping from to has a closed form:
Where .
Meaning: give me original and noise , I directly give you step image—no step-by-step computation needed. This makes training especially fast.
Station 2: Reverse Process (Denoising)
If we can learn a network to infer from —we can go from pure noise back to clean .
Key Theorem
If is small enough, each reverse step approximately follows a Gaussian:
Our task: train neural network to learn (variance can be fixed).
Station 3: Simplified Training Objective
After some math (derived in the DDPM 2020 paper), the training objective simplifies to a surprisingly simple form:
Translation:
Train a network that takes “noisy image + timestep t” as input, predicts “the noise that was added.” Loss is MSE between predicted and actual noise.
Just this one line. All diffusion models essentially do this.
Training Loop
for image in dataset:
# 1. Sample random timestep
t = randint(1, T)
# 2. Add noise
noise = randn_like(image)
noisy_image = sqrt(alpha_bar[t]) * image + sqrt(1 - alpha_bar[t]) * noise
# 3. Let model predict noise
predicted_noise = model(noisy_image, t)
# 4. MSE loss
loss = ((predicted_noise - noise) ** 2).mean()
loss.backward()
optimizer.step()
Runs in dozens of lines. Simple to a degree that makes you wonder.
Station 4: Sampling (Generation)
After training, how to generate images from noise?
DDPM Sampling
Classic version, each step:
- First term: remove part of noise using predicted noise
- Second term: add a bit of new randomness (ensures diversity)
Walk steps (1000) and you generate a clean image from pure noise.
DDIM Sampling (Faster)
DDPM needs 1000 steps—too slow. DDIM (2020 improvement) can do it in 20-50 steps—it skips:
DDPM: x_1000 → x_999 → x_998 → ... → x_0 (1000 steps)
DDIM: x_1000 → x_950 → x_900 → ... → x_0 (20-50 steps)
Sacrifice tiny bit of quality for 20× speedup—most applications use DDIM or more advanced DPM-Solver.
Station 5: Make Generation “Listen”—Conditional Diffusion
Everything so far is “unconditional”—gives you random images.
In practice we want “generate image from text”—e.g., “a cat sitting on a window.”
Add Text Condition to Denoiser
Simplest: feed text embedding to the denoising network:
predicted_noise = model(noisy_image, t, text_embedding)
↑ text condition added
How exactly to “feed”—usually via cross-attention:
- Image patches as query
- Text tokens as key/value
- Attention lets image denoising “hear” the text prompt
This is why Stable Diffusion’s U-Net uses cross-attention.
Classifier-Free Guidance (CFG)
Simple but powerful trick—at training, randomly drop text condition (30% probability).
At generation, linearly combine “conditional” and “unconditional” predictions:
is guidance scale (typically 7.5).
Intuition: make generation “more obedient” to text prompt. Bigger s = more strict prompt following, but too much distorts.
The CFG number you adjust in Stable Diffusion is .
Station 6: Latent Diffusion (Stable Diffusion’s Optimization)
Original diffusion operates in pixel space—a 512×512 image is 786K-dim space. Computation is super expensive.
Stable Diffusion’s key trick: compress image to “latent space” first, do diffusion there.
Original image 512×512×3
↓ VAE Encoder
Latent representation 64×64×4
↓ Diffusion (happens here)
New latent 64×64×4
↓ VAE Decoder
New image 512×512×3
Benefit: diffusion happens in 16K-dim space, 64× faster than pixel space.
This is why Stable Diffusion can run on consumer GPUs. OpenAI’s DALL·E 2 didn’t use latent diffusion—much more expensive.
Complete Stable Diffusion Architecture
"a cat" (text prompt)
↓
CLIP Text Encoder → text vector
↓
Random noise ──→ U-Net ←── text vector via cross-attention
↑ ↓
└─ repeat 50 times
latent vector
↓
VAE Decoder
↓
Clean image
5 components:
- CLIP Text Encoder
- U-Net (denoising network)
- VAE Encoder / Decoder
- Scheduler (DDIM and other sampling algorithms)
- Tokenizer (comes with CLIP)
This is Stable Diffusion’s full architecture—all derivatives (SD XL, SD3, Flux, Midjourney etc.) build on this paradigm.
Video Generation (Sora-class)
Extend diffusion to time dimension— no longer (H, W, 3) image, but (T, H, W, 3) video volume.
Sora’s approach:
- Split video into 3D patches (spanning space and time)
- Use Transformer (not U-Net) for diffusion
- Data: millions of hours of video
Compute is tens to hundreds of times image diffusion—which is why video generation only matured in 2024-2026.
Some Interesting “Emergent” Abilities
After large-scale training, diffusion models exhibit unexpected abilities:
1. Inpainting
Mask part of an image, have the model fill in—natively supported because it learns “denoising.”
2. Style Transfer
Through CLIP guidance, make generated image “look like Van Gogh.”
3. ControlNet (2023)
Given line drawings, depth maps, poses—make generation strictly follow these “control signals.”
4. Image-to-Image
Not from pure noise—start from user-provided image with some noise, denoise. Useful for “stylizing photos.”
Some Frontiers
Flow Matching / Rectified Flow
2023-2024 new approach—instead of 1000-step discrete process, learn a straight line from noise to image directly:
Learn a velocity field that “pushes” noise straight to images. Stable Diffusion 3 uses this—5-10× fewer steps.
Consistency Models
1-step image generation—distill multi-step diffusion to single-step. Slightly lower quality but extremely fast.
The diffusion field has been evolving incredibly fast 2020-2026—2-3 major breakthroughs per year.
One-line Summary
Diffusion = train a denoiser, then walk backward from noise.
Math is elegant, training is stable, generation quality is amazing— it’s the de facto standard for generative models in the 2020s.
GANs are dead, VAEs are retired, Diffusion rules.
Diffusion Denoising visualization — pick target image, watch 50 steps from noise to clean. Read this with the visualization—math and intuition build together.
Recommended next: L5-03 ViT or jump to L4-04 Agent Construction—your interest decides.