detect·deepfakesby Resemble AI
Glossary

Text-to-Speech

Also: TTS · speech synthesis · neural TTS

A system that converts written text into spoken audio using a machine-learning model. Modern neural TTS produces near-human quality and is the underlying technology behind most deepfake audio attacks.

Text-to-speech (TTS) is the engine that turns a string of words into a waveform. Classical TTS used concatenative methods — stitching pre-recorded phonemes together — which sounded stilted. Modern TTS uses neural networks to generate audio directly, producing outputs that are, for most purposes, indistinguishable from real speech.

How modern TTS works

A typical neural TTS pipeline has three stages:

  1. Text analysis. Parse the text into phonemes, stress markers, and prosodic cues.
  2. Acoustic model. A neural network (often Transformer-based) converts the linguistic features into a spectrogram — a time-frequency representation of the target audio.
  3. Vocoder. A second neural network converts the spectrogram to an actual waveform.

Some recent models collapse these steps into a single end-to-end architecture (e.g., VALL-E, NaturalSpeech, Resemble AI's own models).

Relationship to voice cloning

Voice cloning is TTS with an added speaker-identity conditioning signal. Generic TTS produces a synthetic-sounding voice; voice-cloning TTS produces a voice that matches a specific target. Architecturally they're the same family — the distinction is the conditioning.

Detection implications

TTS outputs have characteristic fingerprints across the synthesis pipeline: phase coherence patterns from the vocoder, prosodic regularities from the acoustic model, and spectral signatures from the training data distribution. Audio deepfake detectors learn these across many TTS families.

See also