Vocoder
A neural network that converts a spectrogram (time-frequency representation of audio) into an audible waveform. In TTS and voice-cloning systems, the vocoder is the final stage and the primary source of synthesis artifacts.
A vocoder is the component of an audio-generation pipeline that turns a spectrogram into a waveform. In modern TTS and voice cloning systems, the acoustic model produces a spectrogram — a visual representation of frequency content over time — and the vocoder synthesizes the actual audible audio from it.
Why vocoders matter for detection
Most of the fingerprints audio deepfake detectors look for come from the vocoder stage. Each vocoder family (WaveNet, HiFi-GAN, MelGAN, diffusion-based vocoders, neural source-filter) leaves a characteristic signature:
- Phase coherence patterns. Real vocal-tract physics produces tight phase relationships across harmonics. Some vocoders leak phase inconsistencies.
- High-frequency hash. Consistency of energy above 8 kHz differs between vocoder families.
- Inter-harmonic noise. Real voices have organic noise between harmonics; vocoders sometimes produce cleaner or differently-structured inter-harmonic content.
Detectors that train across vocoder families generalize better than ones trained on a single pipeline.
Adversarial vocoder selection
Sophisticated attackers sometimes choose vocoders specifically because they're underrepresented in detector training data. Frequent detector updates — including retraining on newly-released vocoder models — are the standard defense.