detect·deepfakesby Resemble AI
Glossary

Diffusion Model

Also: diffusion · denoising diffusion · diffusion models

A generative model that produces data by iteratively denoising pure noise, guided at each step by a learned neural network. The dominant architecture for high-fidelity image generation since 2022.

Diffusion models are the architecture behind almost every modern image generator — Stable Diffusion, Midjourney, DALL·E 3, FLUX. They replaced GANs as the default approach to high-fidelity image generation in the early 2020s and now dominate text-to-image, text-to-video, and increasingly text-to-audio as well.

How diffusion works

The core idea is a two-phase process:

  1. Forward diffusion (training time): take a real image, gradually add Gaussian noise over many steps until it becomes pure noise. This is deterministic and requires no learning.
  2. Reverse diffusion (generation time): start from pure noise and, guided by a trained neural network (usually a U-Net), remove noise step by step until a coherent image emerges.

During training, the network learns to predict "how much noise is in this partially-noisy image?" for every noise level. At generation time, it uses that prediction to iteratively denoise.

Why they replaced GANs

Three reasons:

  • Training stability. GANs are notoriously unstable — generators and discriminators can collapse onto trivial equilibria. Diffusion training is a simple supervised regression over noise levels.
  • Diversity. GANs tend to mode-collapse, producing outputs that lack variety. Diffusion naturally covers the full data distribution.
  • Text conditioning. Diffusion pairs cleanly with text encoders like CLIP and T5, which made text-to-image practical.

The trade-off: diffusion is slower at inference time. A single GAN forward pass generates an image; diffusion requires 20–100 denoising steps. Modern accelerations (LCM, Turbo, Distilled) have closed this gap to single-digit steps.

Detection implications

Diffusion leaves a different fingerprint than GANs:

  • Characteristic mid-frequency hash. The iterative denoising process embeds subtle periodic patterns that detectors can learn.
  • Color-channel statistics. Diffusion-generated images have color correlations subtly different from natural photography (where a camera's Bayer filter imposes known correlations).
  • Upsampling grid. Latent-space diffusion (Stable Diffusion and its descendants) generates at low resolution then upsamples, leaving a grid signature.

A detector trained only on GAN outputs will miss diffusion. A detector trained only on an early Stable Diffusion checkpoint may miss FLUX. Good detection requires cross-family training and frequent model updates. See how to detect AI-generated images for the practical workflow.

See also