Voice Conversion
A machine-learning technique that transforms existing speech recordings so they sound like they were spoken by a different person, while preserving the linguistic content, timing, and prosody of the original.
Voice conversion (VC) is the "swap" counterpart to voice cloning. Where cloning generates new speech from text in a target's voice, voice conversion takes existing speech and rewrites its vocal identity.
The attack scenario: an attacker records their own voice saying the desired words, then converts it to a target's voice. Because the source carries natural prosody, breathing, and emotional inflection, the output often sounds more realistic than direct TTS-based cloning.
How it differs from cloning
- Cloning generates a new waveform from text + a target-voice reference. No source speech is needed.
- Voice conversion takes source speech (anyone's) and morphs its identity to match a target reference. The linguistic content and prosody are inherited from the source.
Modern VC systems are "any-to-any" — they can convert any source speaker to any target without retraining.
Detection implications
VC outputs inherit real human prosody, which makes them harder to flag via rhythm or breath-placement cues. The fingerprints that remain are in the vocal tract resonance mismatch and in the vocoder artifacts left by the identity-transfer step. Detectors trained across both cloning and VC architectures generalize best.