Glossary

Zero-Shot

Also: zero-shot learning · zero-shot generalization

A model's ability to perform its task on inputs it has never explicitly seen during training. In deepfake context, "zero-shot TTS" clones voices from a few seconds of unseen reference audio; "zero-shot detection" identifies deepfakes from synthesis methods the detector wasn't trained on.

"Zero-shot" in machine learning means: the model generalizes to things it didn't see in training. In deepfakes, the term shows up on both sides of the arms race:

Zero-shot voice cloning

Earlier TTS systems required training data from each target speaker — sometimes hours of audio — to produce a clone. Zero-shot TTS uses a short reference clip (10–30 seconds, sometimes less) at inference time: the model sees the reference, extracts a speaker embedding, and generates new speech in that voice without any per-speaker training.

This is what makes modern voice cloning a practical threat. An attacker can clone anyone whose voice appears publicly, without access to that person's model training pipeline.

Zero-shot deepfake detection

On the defensive side, zero-shot detection means a detector that was trained on, say, ten synthesis methods generalizes to new methods it has never seen. This is essential because:

New TTS and image-generation models release every few weeks.
Attackers deliberately seek out models the detector wasn't trained on.
Retraining the detector every time a new model ships is impractical.

Good zero-shot detection requires diverse training data spanning many generation architectures, loss functions that force the model to learn synthesis fingerprints rather than specific-method patterns, and frequent evaluation on held-out synthesis methods.

Our audio deepfake detector is a zero-shot model — trained across dozens of cloning architectures so it generalizes to methods it hasn't seen directly.

Zero-shot voice cloning

Zero-shot deepfake detection

See also