The State of Deepfake Detection 2026
The first reproducible, head-to-head benchmark of commercial deepfake detection systems across audio, image, and video. Pre-registered methodology, open sample set, quarterly updates.
Why this benchmark
No one has done it well. Existing roundups are editorial, not empirical. Academic benchmarks (FaceForensics++, ASVspoof) don't cover 2026-era commercial systems. The gap is a benchmark that is empirical, reproducible, covers commercial tools, and updates quarterly. We're the team with the credibility, the model infrastructure, and the willingness to publish losses alongside wins.
Detectors in scope
12+ commercial tools per modality. Every system that has a usable API or web upload and is in active development is on the list.
| Detector | Modalities | Access |
|---|---|---|
| DETECT-3B (Resemble / us) | Audio · Image · Video | API |
| Hive Moderation | Audio · Image · Video | API |
| Reality Defender | Audio · Image · Video | API |
| Sensity AI | Audio · Video | API |
| Deepware Scanner | Video | Web |
| Intel FakeCatcher | Video | Restricted API |
| Microsoft Video Authenticator | Video | Restricted |
| Pindrop | Audio | Enterprise |
| Sightengine | Image · Video | API |
| Illuminarty | Image | API |
| ElevenLabs AI Speech Classifier | Audio | Web |
| Undetectable.ai | Audio | Web |
| AIVoiceDetector.com | Audio | Web |
| Deepfake-o-meter (U. Buffalo) | Image · Video | Academic |
Sample set
4,500 labeled samples total— 1,500 per modality, split 750 real and 750 synthetic, plus a 300-sample "in-the-wild" adversarial tier per modality that's what actually separates a credible benchmark from a toy one.
- Audio:LibriSpeech, VoxCeleb2, podcast corpora, Resemble's own consented-speaker bank. Synthetic samples across ElevenLabs v2 + Flash, PlayHT, Resemble (yes, including ours — credibility requires it), OpenAI TTS, Azure Neural, Google Cloud TTS, Amazon Polly, Hume, Cartesia.
- Image: FFHQ, COCO, Unsplash, commissioned stock. Synthetic samples across Midjourney v7, DALL·E 3, Stable Diffusion XL, FLUX.1, Imagen 3, Firefly, Sora stills, Ideogram, Recraft, Krea.
- Video: YouTube-8M CC, Pond5, Artgrid, commissioned clips. Synthetic samples across Sora, Runway Gen-3, Veo, Kling, Pika, Luma Dream Machine, HeyGen, Synthesia, DeepFaceLab (face-swap), SadTalker (lip-sync).
- Adversarial tier: real scam calls (redacted), virally-circulated deepfake videos, re-compressed samples, partial splices, heavily-inpainted images. This is the tier academic benchmarks miss.
What we measure
Per detector, per modality, per slice:
- Accuracy (headline)
- Precision / Recall / F1 at default threshold
- ROC AUC
- False Positive Rate @ 1% FNR — the enterprise-fraud metric
- False Negative Rate @ 1% FPR — the journalism metric
- Median + p95 latency
- Per-generator accuracy
- Per-compression-level accuracy
- Per-demographic accuracy (bias table)
- Explainability score (0–4 rubric)
Fairness commitments
- All detectors tested on identical samples in identical order.
- DETECT-3B tested blind by a team member not on the detection model team; pre-registered methodology locked before we know our own numbers.
- We publish even when we lose a slice. If Hive wins image by 2%, we say so. Any whiff of fudged numbers destroys the authority play forever.
- Protocol pre-registered on arXiv and OSF before evaluations start.
- Sample set released on Hugging Face under a research license.
- Evaluation code released under MIT.
Ethics & legal
- No samples of real people without consent.
- No sexual content. No minors.
- No live scam audio containing victim PII — redacted or synthesized.
- Every real sample under CC-BY, CC-BY-SA, commissioned, or research-cleared licensing. Per-sample licenses published.
- Full legal review before publication.
Timeline
- Weeks 1–2: sample-set assembly, labeling, legal review.
- Weeks 3–4: evaluation harness build, competitor API onboarding.
- Weeks 5–6: run evaluations, adjudication.
- Week 7: paper drafting, chart design.
- Week 8: internal review, pre-registration on arXiv + OSF.
- Week 9: tier-1 press embargo warm-up (48h windows).
- Week 10: publish.
Realistic with buffer: 14 weeks. Target publication: Q3 2026.
Follow-ups
Quarterly reruns in Q1 2027, Q2 2027, Q4 2027. Each update folds in the new commercial generators that launched since the previous quarter (Sora 2, DETECT-4B, next-gen Runway, etc.) and publishes the delta. Quarterly updates are what keeps journalists coming back.
If you're working on deepfake detection and want to participate — either by submitting a detector for inclusion or by reviewing the methodology — reach out. Peer review from outside Resemble strengthens the benchmark.