detect·deepfakesby Resemble AI
Benchmark methodology · v1

How we'll run the benchmark.

The full methodology is pre-registered before any detector runs. This is the condensed version — the complete protocol will be published on arXiv alongside the Q4 2026 release.

1. Scope

Modalities: audio, image, video. Text is excluded from v1 (different problem; potential v2 addition).

Detectors: every commercial tool with a usable API or web upload in active development as of Q3 2026. The enrolled list on the leaderboard page is canonical.

Non-goals:we don't evaluate open-source research models (too many, unstable), watermark-only provenance tools (that's not detection), or celebrity recognition (different task).

2. Sample set

1,500 labeled samples per modality(4,500 total). 50% real / 50% synthetic, plus a 300-sample “in-the-wild” tier per modality covering real scam calls, viral deepfake videos, re-compressed audio, and partial splices.

Real samples sourced from licensed corpora (LibriSpeech, VoxCeleb2, FFHQ, YouTube-8M CC subset) plus commissioned content.

Synthetic samples generated across ten major commercial generators per modality. Audio: ElevenLabs, PlayHT, Resemble, OpenAI TTS, Azure, Google, Amazon, Hume, Cartesia. Image: Midjourney, DALL·E 3, Stable Diffusion, Flux, Imagen, Firefly, Sora stills, Ideogram. Video: Sora, Runway, Veo, Kling, Pika, Luma, HeyGen, Synthesia, DeepFaceLab, SadTalker.

Every sample carries labels for modality, real/synthetic, generator + version, compression level, and demographic tags — enabling per-slice reporting including a per-demographic bias table.

3. Evaluation protocol

Every detector receives identical inputs in identical order. API detectors are called with default parameters (we don't tune for our own favor). Web-upload tools are automated via headless browser where ToS permits; otherwise manually evaluated on a representative sub-sample with the asymmetry noted in the paper.

Raw responses and latencies are stored. Code and data are released alongside publication — every claim is reproducible.

4. Metrics

  • Accuracy (headline)
  • Precision / Recall / F1 at the detector's default threshold
  • ROC-AUC where a confidence score is available
  • FPR @ 1% FNR — the metric fraud teams actually care about
  • FNR @ 1% FPR — the metric journalists care about
  • Median + p95 latency
  • Per-generator accuracy
  • Per-compression-level accuracy
  • Per-demographic accuracy (bias table)
  • Explainability rubric: 0 (no explanation) → 4 (identifies generator + cites regions)

5. Fairness rules

  • DETECT-3B is evaluated by a team member not on the detection model team. Protocol is locked before we know our own numbers.
  • We publish results even when we lose. Honesty is the entire product.
  • Methodology is pre-registered on arXiv / OSF before any evaluation runs.

6. Ethics

No real-person samples without consent. No sexual content. No minors. Scam-call samples redacted or synthesized to remove victim PII. Every real sample carries an explicit license (CC-BY, CC-BY-SA, commissioned, or cleared-for-research). License is published per-sample in the release.

7. Deliverables

  • Paper on /research/state-of-deepfake-detection-2026 (PDF + interactive HTML + arXiv)
  • Interactive leaderboard on /benchmark (sortable, filterable, linkable)
  • Dataset release on Hugging Face under a research license
  • Press kit on /press/benchmark-2026 (charts, summary, embargo terms)
  • Quarterly re-runs (Q3 2026, Q1 2027, etc.) with version-locked deltas

Want to collaborate — as a detector vendor, researcher, or independent reviewer? Reach out to the Resemble AI research team. Independent reviewers on the byline strengthen the benchmark's authority.