The Leaderboard

Every deepfake detector, scored honestly.

Deepfake detection marketing pages are full of “99% accurate” claims. Almost none are published with reproducible methodology. We're fixing that — by running every major detector through an identical evaluation harness on 4,500 labeled samples, publishing every number (including our own losses), and releasing the dataset for independent verification.

Read the methodology Collaborate as a researcher

Inaugural publication · Q4 2026

14 detectors. 4,500 samples. Three modalities.

The v1 benchmark covers every detector with a usable API or web upload as of Q3 2026. 1,500 labeled samples per modality — audio, image, video. Samples cover ten major generators per modality, four compression levels, and balanced demographics.

Detector	Modalities	Access	Status
Resemble DETECT-3B	AIV	API	Enrolled · awaiting run
Hive Moderation	AIV	API	Enrolled · awaiting run
Reality Defender	AIV	API	Enrolled · awaiting run
Sensity AI	AV	API	Enrolled · awaiting run
Deepware Scanner	V	Web	Enrolled · awaiting run
Intel FakeCatcher	V	Research	Enrolled · awaiting run
Sightengine	IV	API	Enrolled · awaiting run
Illuminarty	I	API	Enrolled · awaiting run
Pindrop	A	Enterprise	Enrolled · awaiting run
ElevenLabs Classifier	A	Web	Enrolled · awaiting run
Undetectable.ai	A	Web	Enrolled · awaiting run
AIVoiceDetector	A	Web	Enrolled · awaiting run
Deepfake-o-meter (UB)	IV	Academic	Enrolled · awaiting run
SynthID Verifier	IV	API	Enrolled · awaiting run

Missing a detector? Email us and we'll add it to the harness if there's a usable API.

How we do it

Three principles that make this citable.

Pre-registered

The full protocol is locked and published on arXiv/OSF before any detector is run. No retroactive tuning.

Blind-tested

DETECT-3B is evaluated by a team that didn't build the model. We publish the results whether we win or lose.

Quarterly updated

Full re-run every quarter against new commercial generators (Sora, Runway, HeyGen, etc.). Version-locked deltas.

What we measure

Ten metrics per detector, per modality.

Headline accuracy is only the start. We publish precision, recall, F1, ROC-AUC, latency, per-generator accuracy, per-compression accuracy, per-demographic accuracy, and an explainability rubric — the metric that separates a number from a product a fraud analyst can defend in court.

Accuracy
Precision / Recall
F1 score
ROC-AUC
FPR @ 1% FNR
FNR @ 1% FPR
Median + p95 latency
Per-generator accuracy
Per-compression accuracy
Per-demographic accuracy
Explainability rubric
Cost per 1,000 calls