Zero-Shot
A model's ability to perform its task on inputs it has never explicitly seen during training. In deepfake context, "zero-shot TTS" clones voices from a few seconds of unseen reference audio; "zero-shot detection" identifies deepfakes from synthesis methods the detector wasn't trained on.
"Zero-shot" in machine learning means: the model generalizes to things it didn't see in training. In deepfakes, the term shows up on both sides of the arms race:
Zero-shot voice cloning
Earlier TTS systems required training data from each target speaker — sometimes hours of audio — to produce a clone. Zero-shot TTS uses a short reference clip (10–30 seconds, sometimes less) at inference time: the model sees the reference, extracts a speaker embedding, and generates new speech in that voice without any per-speaker training.
This is what makes modern voice cloning a practical threat. An attacker can clone anyone whose voice appears publicly, without access to that person's model training pipeline.
Zero-shot deepfake detection
On the defensive side, zero-shot detection means a detector that was trained on, say, ten synthesis methods generalizes to new methods it has never seen. This is essential because:
- New TTS and image-generation models release every few weeks.
- Attackers deliberately seek out models the detector wasn't trained on.
- Retraining the detector every time a new model ships is impractical.
Good zero-shot detection requires diverse training data spanning many generation architectures, loss functions that force the model to learn synthesis fingerprints rather than specific-method patterns, and frequent evaluation on held-out synthesis methods.
Our audio deepfake detector is a zero-shot model — trained across dozens of cloning architectures so it generalizes to methods it hasn't seen directly.