Glossary
The vocabulary of synthetic media.
Plain-language definitions, one per term. Written for journalists, trust & safety teams, and engineers starting to ship their own detection pipelines.
A
C
D
- DeepfakeAI-generated or AI-manipulated audio, image, or video content created with intent to deceive — typically by impersonating a real person, depicting an event that didn't happen, or presenting fabricated evidence.
- Deepfake-as-a-ServiceA commercial service that produces deepfake content on demand — typically voice clones, face swaps, or manipulated videos — either as a legitimate creative tool (clearly labeled) or as a fraud-enabling offering on underground forums.
- Diffusion ModelA generative model that produces data by iteratively denoising pure noise, guided at each step by a learned neural network. The dominant architecture for high-fidelity image generation since 2022.
F
- Face ReenactmentA deepfake technique where a driver video (usually the attacker's own face) controls the expressions, head pose, and gaze of a target identity in a generated video — enabling live impersonation.
- Face SwapA deepfake technique where a machine-learning model replaces one person's face in a video or image with another person's face, preserving the original head pose, lighting, and expression.
- Forensic AnalysisThe structured investigation of a piece of media to determine its authenticity, capture-device origin, and edit history, combining algorithmic analysis (deepfake detection, noise-pattern analysis, metadata review) with expert human judgment.
G
- GANA neural-network architecture introduced in 2014 where a generator creates synthetic data (images, audio, etc.) while a discriminator tries to distinguish it from real data. The two train in opposition until the generator's output is indistinguishable from the real.
- Generative AIA class of AI systems that produce new content — text, images, audio, video, or code — rather than classifying or analyzing existing data. Includes diffusion models, large language models, GANs, and TTS systems. Deepfake generation is a subset.
I
L
- Latent SpaceThe compressed, abstract representation inside a generative model where content is encoded before being decoded back into concrete output (pixels, audio samples). Operations in latent space — interpolation, conditioning, manipulation — underlie most generative AI capabilities.
- Lip-Sync DeepfakeA deepfake technique where a real video is paired with new audio, and only the mouth region is re-generated to match the new speech — leaving the rest of the face, body, and scene untouched.
- Liveness DetectionA class of biometric verification techniques that confirm the input comes from a live, present human — distinguishing a real face or voice from a photo, recording, mask, or deepfake presented to the sensor.
P
- Perceptual HashA hash function that produces similar outputs for perceptually similar media (near-identical images, videos, or audio). Unlike cryptographic hashes, which change drastically on any edit, perceptual hashes are stable across compression, minor cropping, and format changes — enabling fast "have we seen this before" lookups at scale.
- Presentation AttackAn attempt to defeat a biometric authentication system by presenting a fabricated input — a printed photo, replayed voice recording, silicone mask, or deepfake — to the sensor in place of the genuine live user.
- ProvenanceThe verifiable record of how a piece of media was created, edited, and distributed — who captured or generated it, with what tools, when, and what modifications were applied. In deepfake defense, strong provenance is evidence a file is authentic.
S
- Smishing vs VishingSmishing and vishing are both social-engineering fraud attacks that impersonate trusted parties to extract money, credentials, or sensitive actions. Smishing ("SMS phishing") operates over text messages; vishing ("voice phishing") operates over phone calls. Attackers often use them in combination — an SMS that asks the target to call a spoofed number is the most common pattern.
- SpectrogramA two-dimensional visual representation of an audio signal, with time on the horizontal axis, frequency on the vertical axis, and intensity encoded by color or brightness. Used as the primary input for most audio deepfake detection models.
- Synthetic MediaAny media content — audio, images, video, or text — that was generated, modified, or substantially enhanced using AI models rather than captured directly from a camera, microphone, or human author.
T
V
- VishingVishing (voice phishing) is a social-engineering attack conducted over a phone call or voice channel. Attackers impersonate trusted parties — banks, IT support, family members, executives — to pressure the target into transferring money, handing over credentials, or authorizing actions. In 2026, AI voice cloning makes vishing attacks far more convincing and far cheaper to run at scale.
- VocoderA neural network that converts a spectrogram (time-frequency representation of audio) into an audible waveform. In TTS and voice-cloning systems, the vocoder is the final stage and the primary source of synthesis artifacts.
- Voice CloningThe use of a machine-learning model to synthesize new speech in a target person's voice from a short sample of their recorded audio — often as little as 10–30 seconds.
- Voice ConversionA machine-learning technique that transforms existing speech recordings so they sound like they were spoken by a different person, while preserving the linguistic content, timing, and prosody of the original.
- Voice PhishingVoice phishing is fraud conducted over a voice channel (phone call, VoIP, voicemail) in which an attacker impersonates a trusted party to extract money, credentials, or sensitive actions from the target. The term is functionally synonymous with vishing; "voice phishing" is preferred in regulatory and academic contexts, "vishing" in security-industry vernacular.