Glossary

Lip-Sync Deepfake

Also: lip-sync attack · audio-driven lip sync · Wav2Lip

A deepfake technique where a real video is paired with new audio, and only the mouth region is re-generated to match the new speech — leaving the rest of the face, body, and scene untouched.

Lip-sync deepfakes are the cheapest, fastest, and most common form of video manipulation in 2026. Unlike face swaps, which replace the entire face, lip-sync only touches the mouth region — which makes them harder to detect visually and faster to produce.

The attack

The pipeline:

Take a real video of a target.
Generate new audio via voice cloning or record a voice-actor delivering the desired script.
Use an audio-driven lip-sync model (Wav2Lip was the popular 2020 open-source option; better successors exist now) to regenerate the mouth region frame-by-frame, matching phonemes to mouth shapes.

The resulting video looks real for most viewers — everything except the mouth is real — and lets an attacker put words into the target's mouth without a full face-swap pipeline.

Why detection requires dual-track analysis

A visual-only deepfake detector often misses lip-sync attacks because:

The bulk of the frame is genuine.
Face-identity signals match (it's really the target's face).
Only the mouth region and its immediate boundary show regeneration artifacts.

This is the core reason our video detector runs audio and video analysis in parallel. A lip-sync attack typically passes the video-track check but fails the audio-track check — because the paired audio is cloned.

Detection signals

For the visual track:

Mouth-region high-frequency hash. The regenerated mouth carries the generation model's fingerprint.
Boundary artifacts. A faint blend ring where the regenerated mouth meets the rest of the face.
Phoneme-viseme mismatch. Subtle timing differences between audio phonemes and visual mouth shapes.
Teeth and tongue implausibility. Generated mouths sometimes render anatomically incorrect teeth positions or tongue placement.

The attack

Why detection requires dual-track analysis

Detection signals

See also