Spectrogram
A two-dimensional visual representation of an audio signal, with time on the horizontal axis, frequency on the vertical axis, and intensity encoded by color or brightness. Used as the primary input for most audio deepfake detection models.
A spectrogram is the audio-analysis analogue of an image. It takes a 1D waveform and converts it to a 2D plot where time runs left-to-right, frequency runs bottom-to-top, and brightness/color shows how much energy is present at each time-frequency point.
Why spectrograms matter for deepfake detection
Audio deepfake detectors overwhelmingly operate on spectrograms rather than raw waveforms. Three reasons:
- Compactness. A 10-second waveform at 22 kHz has 220,000 samples. A spectrogram of the same clip is ~500×80 pixels — much smaller input to a neural network.
- Relevant structure exposed. The fingerprints of synthesis (phase patterns, harmonic hash, vocoder artifacts) show up as visible structure in spectrograms.
- Image-model reuse. You can apply mature image-detection architectures (CNNs, Vision Transformers) to spectrograms, borrowing decades of computer-vision research.
The mel-spectrogram
A common variant is the mel-spectrogram, which warps the frequency axis to approximate human hearing sensitivity (more resolution at low frequencies, less at high). Most modern audio detectors work on mel-spectrograms because that's also what the generation pipelines produce internally — so artifacts are most visible in that representation.
What detectors see that humans don't
Looking at a mel-spectrogram, a trained eye can spot some deepfake signals:
- Unnaturally smooth formant transitions between phonemes.
- Too-clean inter-harmonic gaps (no noise between voice harmonics).
- High-frequency roll-off patterns characteristic of specific vocoder families.
Automated detectors pick these up far more reliably than humans, and learn patterns too subtle for visual inspection.