Research · Benchmark

The State of Deepfake Detection 2026

The first reproducible, head-to-head benchmark of commercial deepfake detection systems across audio, image, and video. Pre-registered methodology, open sample set, quarterly updates.

Pre-registeredOpen datasetQuarterly refreshResults: Q3 2026

Why this benchmark

No one has done it well. Existing roundups are editorial, not empirical. Academic benchmarks (FaceForensics++, ASVspoof) don't cover 2026-era commercial systems. The gap is a benchmark that is empirical, reproducible, covers commercial tools, and updates quarterly. We're the team with the credibility, the model infrastructure, and the willingness to publish losses alongside wins.

Detectors in scope

12+ commercial tools per modality. Every system that has a usable API or web upload and is in active development is on the list.

Detector	Modalities	Access
DETECT-3B (Resemble / us)	Audio · Image · Video	API
Hive Moderation	Audio · Image · Video	API
Reality Defender	Audio · Image · Video	API
Sensity AI	Audio · Video	API
Deepware Scanner	Video	Web
Intel FakeCatcher	Video	Restricted API
Microsoft Video Authenticator	Video	Restricted
Pindrop	Audio	Enterprise
Sightengine	Image · Video	API
Illuminarty	Image	API
ElevenLabs AI Speech Classifier	Audio	Web
Undetectable.ai	Audio	Web
AIVoiceDetector.com	Audio	Web
Deepfake-o-meter (U. Buffalo)	Image · Video	Academic

Sample set

4,500 labeled samples total— 1,500 per modality, split 750 real and 750 synthetic, plus a 300-sample "in-the-wild" adversarial tier per modality that's what actually separates a credible benchmark from a toy one.

Audio:LibriSpeech, VoxCeleb2, podcast corpora, Resemble's own consented-speaker bank. Synthetic samples across ElevenLabs v2 + Flash, PlayHT, Resemble (yes, including ours — credibility requires it), OpenAI TTS, Azure Neural, Google Cloud TTS, Amazon Polly, Hume, Cartesia.
Image: FFHQ, COCO, Unsplash, commissioned stock. Synthetic samples across Midjourney v7, DALL·E 3, Stable Diffusion XL, FLUX.1, Imagen 3, Firefly, Sora stills, Ideogram, Recraft, Krea.
Video: YouTube-8M CC, Pond5, Artgrid, commissioned clips. Synthetic samples across Sora, Runway Gen-3, Veo, Kling, Pika, Luma Dream Machine, HeyGen, Synthesia, DeepFaceLab (face-swap), SadTalker (lip-sync).
Adversarial tier: real scam calls (redacted), virally-circulated deepfake videos, re-compressed samples, partial splices, heavily-inpainted images. This is the tier academic benchmarks miss.

What we measure

Per detector, per modality, per slice:

Accuracy (headline)
Precision / Recall / F1 at default threshold
ROC AUC
False Positive Rate @ 1% FNR — the enterprise-fraud metric
False Negative Rate @ 1% FPR — the journalism metric
Median + p95 latency
Per-generator accuracy
Per-compression-level accuracy
Per-demographic accuracy (bias table)
Explainability score (0–4 rubric)

Fairness commitments

All detectors tested on identical samples in identical order.
DETECT-3B tested blind by a team member not on the detection model team; pre-registered methodology locked before we know our own numbers.
We publish even when we lose a slice. If Hive wins image by 2%, we say so. Any whiff of fudged numbers destroys the authority play forever.
Protocol pre-registered on arXiv and OSF before evaluations start.
Sample set released on Hugging Face under a research license.
Evaluation code released under MIT.

Ethics & legal

No samples of real people without consent.
No sexual content. No minors.
No live scam audio containing victim PII — redacted or synthesized.
Every real sample under CC-BY, CC-BY-SA, commissioned, or research-cleared licensing. Per-sample licenses published.
Full legal review before publication.

Timeline

Weeks 1–2: sample-set assembly, labeling, legal review.
Weeks 3–4: evaluation harness build, competitor API onboarding.
Weeks 5–6: run evaluations, adjudication.
Week 7: paper drafting, chart design.
Week 8: internal review, pre-registration on arXiv + OSF.
Week 9: tier-1 press embargo warm-up (48h windows).
Week 10: publish.

Realistic with buffer: 14 weeks. Target publication: Q3 2026.

Follow-ups

Quarterly reruns in Q1 2027, Q2 2027, Q4 2027. Each update folds in the new commercial generators that launched since the previous quarter (Sora 2, DETECT-4B, next-gen Runway, etc.) and publishes the delta. Quarterly updates are what keeps journalists coming back.

For researchers

If you're working on deepfake detection and want to participate — either by submitting a detector for inclusion or by reviewing the methodology — reach out. Peer review from outside Resemble strengthens the benchmark.

Email the research team arXiv pre-registration (pending)

Back to research