detect·deepfakesby Resemble AI
The Leaderboard

Every deepfake detector, scored honestly.

Deepfake detection marketing pages are full of “99% accurate” claims. Almost none are published with reproducible methodology. We're fixing that — by running every major detector through an identical evaluation harness on 4,500 labeled samples, publishing every number (including our own losses), and releasing the dataset for independent verification.

Inaugural publication · Q4 2026

14 detectors. 4,500 samples. Three modalities.

The v1 benchmark covers every detector with a usable API or web upload as of Q3 2026. 1,500 labeled samples per modality — audio, image, video. Samples cover ten major generators per modality, four compression levels, and balanced demographics.

DetectorModalitiesAccessStatus
Resemble DETECT-3B
AIV
APIEnrolled · awaiting run
Hive Moderation
AIV
APIEnrolled · awaiting run
Reality Defender
AIV
APIEnrolled · awaiting run
Sensity AI
AV
APIEnrolled · awaiting run
Deepware Scanner
V
WebEnrolled · awaiting run
Intel FakeCatcher
V
ResearchEnrolled · awaiting run
Sightengine
IV
APIEnrolled · awaiting run
Illuminarty
I
APIEnrolled · awaiting run
Pindrop
A
EnterpriseEnrolled · awaiting run
ElevenLabs Classifier
A
WebEnrolled · awaiting run
Undetectable.ai
A
WebEnrolled · awaiting run
AIVoiceDetector
A
WebEnrolled · awaiting run
Deepfake-o-meter (UB)
IV
AcademicEnrolled · awaiting run
SynthID Verifier
IV
APIEnrolled · awaiting run

Missing a detector? Email us and we'll add it to the harness if there's a usable API.

How we do it

Three principles that make this citable.

Pre-registered

The full protocol is locked and published on arXiv/OSF before any detector is run. No retroactive tuning.

Blind-tested

DETECT-3B is evaluated by a team that didn't build the model. We publish the results whether we win or lose.

Quarterly updated

Full re-run every quarter against new commercial generators (Sora, Runway, HeyGen, etc.). Version-locked deltas.

What we measure

Ten metrics per detector, per modality.

Headline accuracy is only the start. We publish precision, recall, F1, ROC-AUC, latency, per-generator accuracy, per-compression accuracy, per-demographic accuracy, and an explainability rubric — the metric that separates a number from a product a fraud analyst can defend in court.

  • Accuracy
  • Precision / Recall
  • F1 score
  • ROC-AUC
  • FPR @ 1% FNR
  • FNR @ 1% FPR
  • Median + p95 latency
  • Per-generator accuracy
  • Per-compression accuracy
  • Per-demographic accuracy
  • Explainability rubric
  • Cost per 1,000 calls