Diffusion Model
A generative model that produces data by iteratively denoising pure noise, guided at each step by a learned neural network. The dominant architecture for high-fidelity image generation since 2022.
Diffusion models are the architecture behind almost every modern image generator — Stable Diffusion, Midjourney, DALL·E 3, FLUX. They replaced GANs as the default approach to high-fidelity image generation in the early 2020s and now dominate text-to-image, text-to-video, and increasingly text-to-audio as well.
How diffusion works
The core idea is a two-phase process:
- Forward diffusion (training time): take a real image, gradually add Gaussian noise over many steps until it becomes pure noise. This is deterministic and requires no learning.
- Reverse diffusion (generation time): start from pure noise and, guided by a trained neural network (usually a U-Net), remove noise step by step until a coherent image emerges.
During training, the network learns to predict "how much noise is in this partially-noisy image?" for every noise level. At generation time, it uses that prediction to iteratively denoise.
Why they replaced GANs
Three reasons:
- Training stability. GANs are notoriously unstable — generators and discriminators can collapse onto trivial equilibria. Diffusion training is a simple supervised regression over noise levels.
- Diversity. GANs tend to mode-collapse, producing outputs that lack variety. Diffusion naturally covers the full data distribution.
- Text conditioning. Diffusion pairs cleanly with text encoders like CLIP and T5, which made text-to-image practical.
The trade-off: diffusion is slower at inference time. A single GAN forward pass generates an image; diffusion requires 20–100 denoising steps. Modern accelerations (LCM, Turbo, Distilled) have closed this gap to single-digit steps.
Detection implications
Diffusion leaves a different fingerprint than GANs:
- Characteristic mid-frequency hash. The iterative denoising process embeds subtle periodic patterns that detectors can learn.
- Color-channel statistics. Diffusion-generated images have color correlations subtly different from natural photography (where a camera's Bayer filter imposes known correlations).
- Upsampling grid. Latent-space diffusion (Stable Diffusion and its descendants) generates at low resolution then upsamples, leaving a grid signature.
A detector trained only on GAN outputs will miss diffusion. A detector trained only on an early Stable Diffusion checkpoint may miss FLUX. Good detection requires cross-family training and frequent model updates. See how to detect AI-generated images for the practical workflow.