Benchmark methodology · v1

How we'll run the benchmark.

The full methodology is pre-registered before any detector runs. This is the condensed version — the complete protocol will be published on arXiv alongside the Q4 2026 release.

1. Scope

Modalities: audio, image, video. Text is excluded from v1 (different problem; potential v2 addition).

Detectors: every commercial tool with a usable API or web upload in active development as of Q3 2026. The enrolled list on the leaderboard page is canonical.

Non-goals:we don't evaluate open-source research models (too many, unstable), watermark-only provenance tools (that's not detection), or celebrity recognition (different task).

2. Sample set

1,500 labeled samples per modality(4,500 total). 50% real / 50% synthetic, plus a 300-sample “in-the-wild” tier per modality covering real scam calls, viral deepfake videos, re-compressed audio, and partial splices.

Real samples sourced from licensed corpora (LibriSpeech, VoxCeleb2, FFHQ, YouTube-8M CC subset) plus commissioned content.

Synthetic samples generated across ten major commercial generators per modality. Audio: ElevenLabs, PlayHT, Resemble, OpenAI TTS, Azure, Google, Amazon, Hume, Cartesia. Image: Midjourney, DALL·E 3, Stable Diffusion, Flux, Imagen, Firefly, Sora stills, Ideogram. Video: Sora, Runway, Veo, Kling, Pika, Luma, HeyGen, Synthesia, DeepFaceLab, SadTalker.

Every sample carries labels for modality, real/synthetic, generator + version, compression level, and demographic tags — enabling per-slice reporting including a per-demographic bias table.

3. Evaluation protocol

Every detector receives identical inputs in identical order. API detectors are called with default parameters (we don't tune for our own favor). Web-upload tools are automated via headless browser where ToS permits; otherwise manually evaluated on a representative sub-sample with the asymmetry noted in the paper.

Raw responses and latencies are stored. Code and data are released alongside publication — every claim is reproducible.

4. Metrics

Accuracy (headline)
Precision / Recall / F1 at the detector's default threshold
ROC-AUC where a confidence score is available
FPR @ 1% FNR — the metric fraud teams actually care about
FNR @ 1% FPR — the metric journalists care about
Median + p95 latency
Per-generator accuracy
Per-compression-level accuracy
Per-demographic accuracy (bias table)
Explainability rubric: 0 (no explanation) → 4 (identifies generator + cites regions)

5. Fairness rules

DETECT-3B is evaluated by a team member not on the detection model team. Protocol is locked before we know our own numbers.
We publish results even when we lose. Honesty is the entire product.
Methodology is pre-registered on arXiv / OSF before any evaluation runs.

6. Ethics

No real-person samples without consent. No sexual content. No minors. Scam-call samples redacted or synthesized to remove victim PII. Every real sample carries an explicit license (CC-BY, CC-BY-SA, commissioned, or cleared-for-research). License is published per-sample in the release.

7. Deliverables

Paper on /research/state-of-deepfake-detection-2026 (PDF + interactive HTML + arXiv)
Interactive leaderboard on /benchmark (sortable, filterable, linkable)
Dataset release on Hugging Face under a research license
Press kit on /press/benchmark-2026 (charts, summary, embargo terms)
Quarterly re-runs (Q3 2026, Q1 2027, etc.) with version-locked deltas

Want to collaborate — as a detector vendor, researcher, or independent reviewer? Reach out to the Resemble AI research team. Independent reviewers on the byline strengthen the benchmark's authority.