Every deepfake detector, scored honestly.
Deepfake detection marketing pages are full of “99% accurate” claims. Almost none are published with reproducible methodology. We're fixing that — by running every major detector through an identical evaluation harness on 4,500 labeled samples, publishing every number (including our own losses), and releasing the dataset for independent verification.
14 detectors. 4,500 samples. Three modalities.
The v1 benchmark covers every detector with a usable API or web upload as of Q3 2026. 1,500 labeled samples per modality — audio, image, video. Samples cover ten major generators per modality, four compression levels, and balanced demographics.
| Detector | Modalities | Access | Status |
|---|---|---|---|
| Resemble DETECT-3B | AIV | API | Enrolled · awaiting run |
| Hive Moderation | AIV | API | Enrolled · awaiting run |
| Reality Defender | AIV | API | Enrolled · awaiting run |
| Sensity AI | AV | API | Enrolled · awaiting run |
| Deepware Scanner | V | Web | Enrolled · awaiting run |
| Intel FakeCatcher | V | Research | Enrolled · awaiting run |
| Sightengine | IV | API | Enrolled · awaiting run |
| Illuminarty | I | API | Enrolled · awaiting run |
| Pindrop | A | Enterprise | Enrolled · awaiting run |
| ElevenLabs Classifier | A | Web | Enrolled · awaiting run |
| Undetectable.ai | A | Web | Enrolled · awaiting run |
| AIVoiceDetector | A | Web | Enrolled · awaiting run |
| Deepfake-o-meter (UB) | IV | Academic | Enrolled · awaiting run |
| SynthID Verifier | IV | API | Enrolled · awaiting run |
Missing a detector? Email us and we'll add it to the harness if there's a usable API.
Three principles that make this citable.
The full protocol is locked and published on arXiv/OSF before any detector is run. No retroactive tuning.
DETECT-3B is evaluated by a team that didn't build the model. We publish the results whether we win or lose.
Full re-run every quarter against new commercial generators (Sora, Runway, HeyGen, etc.). Version-locked deltas.
Ten metrics per detector, per modality.
Headline accuracy is only the start. We publish precision, recall, F1, ROC-AUC, latency, per-generator accuracy, per-compression accuracy, per-demographic accuracy, and an explainability rubric — the metric that separates a number from a product a fraud analyst can defend in court.
- Accuracy
- Precision / Recall
- F1 score
- ROC-AUC
- FPR @ 1% FNR
- FNR @ 1% FPR
- Median + p95 latency
- Per-generator accuracy
- Per-compression accuracy
- Per-demographic accuracy
- Explainability rubric
- Cost per 1,000 calls