Text-to-Speech
A system that converts written text into spoken audio using a machine-learning model. Modern neural TTS produces near-human quality and is the underlying technology behind most deepfake audio attacks.
Text-to-speech (TTS) is the engine that turns a string of words into a waveform. Classical TTS used concatenative methods — stitching pre-recorded phonemes together — which sounded stilted. Modern TTS uses neural networks to generate audio directly, producing outputs that are, for most purposes, indistinguishable from real speech.
How modern TTS works
A typical neural TTS pipeline has three stages:
- Text analysis. Parse the text into phonemes, stress markers, and prosodic cues.
- Acoustic model. A neural network (often Transformer-based) converts the linguistic features into a spectrogram — a time-frequency representation of the target audio.
- Vocoder. A second neural network converts the spectrogram to an actual waveform.
Some recent models collapse these steps into a single end-to-end architecture (e.g., VALL-E, NaturalSpeech, Resemble AI's own models).
Relationship to voice cloning
Voice cloning is TTS with an added speaker-identity conditioning signal. Generic TTS produces a synthetic-sounding voice; voice-cloning TTS produces a voice that matches a specific target. Architecturally they're the same family — the distinction is the conditioning.
Detection implications
TTS outputs have characteristic fingerprints across the synthesis pipeline: phase coherence patterns from the vocoder, prosodic regularities from the acoustic model, and spectral signatures from the training data distribution. Audio deepfake detectors learn these across many TTS families.