How we'll run the benchmark.
The full methodology is pre-registered before any detector runs. This is the condensed version — the complete protocol will be published on arXiv alongside the Q4 2026 release.
1. Scope
Modalities: audio, image, video. Text is excluded from v1 (different problem; potential v2 addition).
Detectors: every commercial tool with a usable API or web upload in active development as of Q3 2026. The enrolled list on the leaderboard page is canonical.
Non-goals:we don't evaluate open-source research models (too many, unstable), watermark-only provenance tools (that's not detection), or celebrity recognition (different task).
2. Sample set
1,500 labeled samples per modality(4,500 total). 50% real / 50% synthetic, plus a 300-sample “in-the-wild” tier per modality covering real scam calls, viral deepfake videos, re-compressed audio, and partial splices.
Real samples sourced from licensed corpora (LibriSpeech, VoxCeleb2, FFHQ, YouTube-8M CC subset) plus commissioned content.
Synthetic samples generated across ten major commercial generators per modality. Audio: ElevenLabs, PlayHT, Resemble, OpenAI TTS, Azure, Google, Amazon, Hume, Cartesia. Image: Midjourney, DALL·E 3, Stable Diffusion, Flux, Imagen, Firefly, Sora stills, Ideogram. Video: Sora, Runway, Veo, Kling, Pika, Luma, HeyGen, Synthesia, DeepFaceLab, SadTalker.
Every sample carries labels for modality, real/synthetic, generator + version, compression level, and demographic tags — enabling per-slice reporting including a per-demographic bias table.
3. Evaluation protocol
Every detector receives identical inputs in identical order. API detectors are called with default parameters (we don't tune for our own favor). Web-upload tools are automated via headless browser where ToS permits; otherwise manually evaluated on a representative sub-sample with the asymmetry noted in the paper.
Raw responses and latencies are stored. Code and data are released alongside publication — every claim is reproducible.
4. Metrics
- Accuracy (headline)
- Precision / Recall / F1 at the detector's default threshold
- ROC-AUC where a confidence score is available
- FPR @ 1% FNR — the metric fraud teams actually care about
- FNR @ 1% FPR — the metric journalists care about
- Median + p95 latency
- Per-generator accuracy
- Per-compression-level accuracy
- Per-demographic accuracy (bias table)
- Explainability rubric: 0 (no explanation) → 4 (identifies generator + cites regions)
5. Fairness rules
- DETECT-3B is evaluated by a team member not on the detection model team. Protocol is locked before we know our own numbers.
- We publish results even when we lose. Honesty is the entire product.
- Methodology is pre-registered on arXiv / OSF before any evaluation runs.
6. Ethics
No real-person samples without consent. No sexual content. No minors. Scam-call samples redacted or synthesized to remove victim PII. Every real sample carries an explicit license (CC-BY, CC-BY-SA, commissioned, or cleared-for-research). License is published per-sample in the release.
7. Deliverables
- Paper on /research/state-of-deepfake-detection-2026 (PDF + interactive HTML + arXiv)
- Interactive leaderboard on /benchmark (sortable, filterable, linkable)
- Dataset release on Hugging Face under a research license
- Press kit on /press/benchmark-2026 (charts, summary, embargo terms)
- Quarterly re-runs (Q3 2026, Q1 2027, etc.) with version-locked deltas
Want to collaborate — as a detector vendor, researcher, or independent reviewer? Reach out to the Resemble AI research team. Independent reviewers on the byline strengthen the benchmark's authority.