Glossary

Voice Cloning

Also: voice clone · speaker cloning · zero-shot TTS

The use of a machine-learning model to synthesize new speech in a target person's voice from a short sample of their recorded audio — often as little as 10–30 seconds.

Voice cloning is the audio-modality equivalent of a face swap: a person's voice becomes a puppet that can say whatever a script provides. The quality jump since 2022 has been dramatic, and as of 2026 a convincing clone can be built from 10–30 seconds of reference audio with open-source models running on consumer GPUs.

How it works

Modern voice cloning uses zero-shot text-to-speech architectures. Unlike earlier systems that required hours of target-speaker training data, zero-shot TTS conditions on a short reference clip at inference time:

An encoder extracts a speaker-identity embedding from the reference audio (roughly: "what makes this voice sound like this person").
A decoder (often diffusion-based) generates new audio for arbitrary text, conditioned on that embedding.

The result is speech that carries the target's timbre, pitch, and (usually) prosodic style — speaking text the target never said.

Resemble AI's generative voice products use similar architectures for legitimate purposes like accessibility, localization, and content production. The same capability, without consent, is voice cloning as a threat.

Attack surface

Voice cloning enables:

CEO-fraud calls. A cloned voice of a senior executive instructs a finance employee to authorize a transfer. Industry-wide losses are growing at triple-digit rates — see deepfake statistics 2026.
Family-member scams. "Mom, I'm in trouble, I need bail money" — the voice matches the grandchild because the clone was trained on a 20-second social-media clip.
Authentication bypass. Phone-based voice ID systems at banks and contact centers can be defeated. Many institutions have since added liveness challenges.
Non-consensual media. Synthetic voicemails, fake endorsements, intimate-content voice-overs.

Detection implications

Cloned voices leave fingerprints that deepfake-audio detectors look for:

Phase coherence issues across harmonics, subtly different from human vocal-tract resonance.
Unnatural breath placement — clones often omit breaths or place them in mechanically implausible spots.
Lack of room tone. TTS outputs are clean; real phone calls carry HVAC, air movement, microphone self-noise.

The Resemble AI audio detector is a zero-shot detector — trained across dozens of cloning architectures so it generalizes to ones it hasn't seen. See how to detect audio deepfakes for the detection workflow.

How it works

Attack surface

Detection implications

See also