Lip-Sync Deepfake
A deepfake technique where a real video is paired with new audio, and only the mouth region is re-generated to match the new speech — leaving the rest of the face, body, and scene untouched.
Lip-sync deepfakes are the cheapest, fastest, and most common form of video manipulation in 2026. Unlike face swaps, which replace the entire face, lip-sync only touches the mouth region — which makes them harder to detect visually and faster to produce.
The attack
The pipeline:
- Take a real video of a target.
- Generate new audio via voice cloning or record a voice-actor delivering the desired script.
- Use an audio-driven lip-sync model (Wav2Lip was the popular 2020 open-source option; better successors exist now) to regenerate the mouth region frame-by-frame, matching phonemes to mouth shapes.
The resulting video looks real for most viewers — everything except the mouth is real — and lets an attacker put words into the target's mouth without a full face-swap pipeline.
Why detection requires dual-track analysis
A visual-only deepfake detector often misses lip-sync attacks because:
- The bulk of the frame is genuine.
- Face-identity signals match (it's really the target's face).
- Only the mouth region and its immediate boundary show regeneration artifacts.
This is the core reason our video detector runs audio and video analysis in parallel. A lip-sync attack typically passes the video-track check but fails the audio-track check — because the paired audio is cloned.
Detection signals
For the visual track:
- Mouth-region high-frequency hash. The regenerated mouth carries the generation model's fingerprint.
- Boundary artifacts. A faint blend ring where the regenerated mouth meets the rest of the face.
- Phoneme-viseme mismatch. Subtle timing differences between audio phonemes and visual mouth shapes.
- Teeth and tongue implausibility. Generated mouths sometimes render anatomically incorrect teeth positions or tongue placement.