detect·deepfakesby Resemble AI
Video detection

AI Video & Deepfake Detector

Upload a video or paste a URL to check for face-swaps, lip-sync deepfakes, and fully-synthesized AI video. Get frame-level reasoning.

The four classes of video deepfake

Not all video fakes are the same. Each class leaves different fingerprints and requires a different detection approach:

  • Face swap — another identity transplanted onto real footage. Most common attack. Detects via per-frame identity drift and boundary-blending artifacts.
  • Lip-sync — real video with new audio and a regenerated mouth region. Used for misattributed-quote attacks. Requires dual-track analysis to catch.
  • Reenactment — real face, attacker-driven expressions in real time. The Arup-style video-call attack. Hardest to detect live; easier after recording.
  • Full synthesis — everything generated (Sora, Runway, Veo). Fewer samples in the wild yet, but quality has climbed fast.

How dual-track analysis works

For every video you upload, we run the visual track through our frame-level detector (same DETECT-3B model, video head) and the audio track through the zero-shot audio detector in parallel. Results come back as two independent verdicts plus a combined recommendation. The combination logic:

  • Both tracks ≥ 0.8 real → almost certainly authentic.
  • Either track ≥ 0.7 fake → treat the video as compromised.
  • Both in the middle (0.4–0.6) → inconclusive, request a second source.
  • Video real, audio fake → most likely a lip-sync attack.
  • Video fake, audio real → most likely a face swap.

What this detector catches

  • Face-swap videos from DeepFaceLab, FaceFusion, Roop and commercial successors
  • Lip-sync attacks built on Wav2Lip and successors
  • Full-synthesis video from Sora, Runway Gen-3, Veo, Kling, Pika, Luma Dream Machine
  • Avatar and digital-twin video from HeyGen and Synthesia
  • Combined attacks: face swap + voice clone used in CEO-fraud video calls

Limitations

  • Clips heavily compressed by social platforms lose frame-level artifacts.
  • Very short clips (<3s) or mostly-static scenes limit the temporal signals we rely on.
  • Novel generation models released after our last retrain will initially slip through until the detector catches up.

Common use cases

  • Deposition and legal: verifying video evidence in litigation.
  • Elections: newsroom verification of politician clips before publication.
  • Insurance: video claims evidence (injuries, accident scenes).
  • Ransom and extortion: verifying “proof of life” videos sent to victims.
  • Content platforms: automated screening for non-consensual deepfake content.

Frequently asked questions

+What types of video deepfakes does this catch?

Face swaps (DeepFaceLab-family), lip-sync attacks (Wav2Lip and successors), reenactment (live-puppeted faces on video calls), and full-synthesis video from Sora, Runway, Veo, Kling, Pika, Luma.

+Why do you report audio and visual verdicts separately?

Because they can diverge. A lip-sync attack often pairs real video with cloned audio — the video track looks fine alone. A face-swap attack often pairs real audio with a generated face — the audio passes alone. Separate verdicts catch both.

+Does short clip length hurt accuracy?

Yes — clips under 3 seconds have significantly less temporal signal. We recommend 10–30 seconds for best confidence.

+What about heavy social-media compression?

TikTok- and Reels-grade compression destroys much of the frame-level signal. Analyze the highest-quality source you can find — original uploads, not screen recordings of social embeds.

+Can I run this at scale?

Yes — the API handles batch workloads with sub-second median latency per minute of video. See resemble.ai/detect for enterprise pricing.

Keep reading