Why have audio deepfake detectors gotten worse in production than in benchmarks?

Two reasons. Codecfakes — synthetic audio generated through neural codecs rather than traditional vocoders — drop average Equal Error Rate by ~41.4% on legacy detectors. And replay attacks (playing synthetic audio through a speaker and re-recording it) mask the digital artifacts detectors rely on. Both defeat models trained only on clean vocoder artifacts.

What is PerTh and why does it matter?

PerTh is Resemble AI's neural watermarker. It embeds an imperceptible, tamper-resistant signature into the latent space of generated audio at creation time. Unlike metadata-based provenance, PerTh survives lossy compression, resampling, background noise, time-stretching, and even secondary model training — creating an unbreakable chain of custody that aligns with California AB 3211's mandated provenance watermarks.

Where does Resemble's detector rank on public benchmarks?

On the Hugging Face Speech DF Arena leaderboard, DETECT-3B Omni posts an Average EER of 2.570% and 97.40% Accuracy across 14 datasets — in the top tier alongside Hiya (2.113% EER) and Modulate. Unlike telephony-only competitors, DETECT-3B covers audio, image, and video through a single unified API.

how to detect

Detecting Audio Deepfakes in 2026

The state of audio deepfake detection — Codecfakes, replay attacks, prosodic analysis, and how DETECT-3B Omni + PerTh watermarking close the gap. A technical brief for security, fraud, and trust-and-safety teams.

Resemble AI·Apr 18, 2026·9 min read

Human hearing is no longer a defense. By 2026, one in four Americans has received a deepfake voice call, and nearly half the population admits they can't reliably tell a cloned voice from a real one. The AI-driven evolution of voice cloning has fundamentally destabilized communication security — and every organization that authenticates or transacts through voice channels is in the impact zone.

This guide covers what actually works in audio deepfake detection in 2026: the attack surface, why legacy detectors fail, and how modern multi-layered defenses (Resemble AI's DETECT-3B Omni, the DETECT-2B low-latency model, PerTh neural watermarking, and the Intelligence explainability layer) close the gap.

The 2026 Landscape

The asymmetry of effort between generation and detection has collapsed dramatically in the attacker's favor. Producing a convincing voice clone now takes three to five seconds of reference audio, consumer hardware, and pennies of compute. Detecting it requires specialized machine-learning pipelines, substantial computational overhead, and — critically — time. That latency is where malicious actors operate: long enough to execute wire fraud, extract credentials, or spread disinformation before verification systems register a response.

The consequences are measurable:

Voice deepfake call volume grew 680% year-over-year in 2024, with a 16% compound annual rise in unwanted automated calls across the US, UK, France, Germany, and Spain.
Audio fraud cost US corporates $1.3B+ in 2025, with the Arup $25M case as the watershed public incident.
Public trust in mobile networks has collapsed: survey data shows consumers believe scammers are defeating telecom security 2 to 1.

This matters because enterprise resilience can no longer lean on reactive moderation. The math doesn't work — an infinite supply of synthetic audio cannot be filtered by finite human review capacity.

California's Regulatory Pivot

California has emerged as the global leader in deepfake governance, and its 2025–2026 legislation reshapes the commercial landscape:

AB 621 (effective October 2025) establishes a private right of action against anyone who creates, shares, or facilitates non-consensual deepfake pornography. It dismantles the safe-harbor protections that previously insulated distribution platforms.
SB 683 (effective January 1, 2026) grants individuals whose voice, name, or likeness is used without consent immediate injunctive relief with two-business-day takedown compliance. Prevailing plaintiffs recover the greater of $750 or actual damages.
AB 3211 (effective January 1, 2026) is the most structurally significant. It mandates latent disclosures and provenance watermarks in every output from a generative AI system, including the generator company name and model version. Recording-device manufacturers must offer firmware updates enabling origin-point authenticity watermarks. Watermarks must be compatible with industry standards like C2PA.

The net effect: legal responsibility for authenticity is shifting from the recipient (who previously had to detect fakery) to the creator (who must now cryptographically prove origin). This makes active provenance — not just passive detection — a compliance requirement.

Why Legacy Detection Fails

Most audio deepfake detectors in 2024 were binary classifiers trained to flag specific digital artifacts — spectral correlations, noise estimation anomalies, phase mismatches, or vocoder signatures (Griffin-Lim, early HiFi-GAN). These systems now fail catastrophically in two dimensions.

The Generalization Gap

Empirical studies show that state-of-the-art detectors trained on curated benchmark datasets (like LibriTTS) suffer catastrophic accuracy drops when deployed against real-world fraud. Fraud audio is never pristine: it traverses lossy telephony codecs, background noise, bandwidth fluctuations, and intentional signal degradation. Detectors hyper-optimized for clean vocoder artifacts collapse when those artifacts are smoothed or masked.

Codecfakes

The most insidious evolution is the Codecfake — synthetic audio generated through neural codec tokenization (e.g., the SNAC hierarchical codec used by Maya1) rather than traditional mathematical vocoding. Because these systems don't produce the expected acoustic artifacts, detection models trained exclusively on vocoder output experience a 41.4% reduction in average Equal Error Rate when confronted with codec-based deepfakes. The artifacts that survive are exceptionally subtle, transient, and bypass frontline perimeters relying on legacy spectral analysis.

Replay Attacks

Perhaps the most mathematically devastating evasion in 2026 is the Replay Attack: the attacker plays a synthetic audio file through a physical loudspeaker and simultaneously re-records it via a microphone in a physical acoustic environment. The introduction of room acoustics — multi-path reverberation, ambient noise, hardware frequency response — masks or destroys the digital artifacts detectors depend on.

Research published at Interspeech 2025 using the ReplayDF dataset (132.5 hours across 109 speaker-microphone combinations, six languages, four TTS models) quantified the damage. The top open-source detector, W2V2-AASIST, saw its Equal Error Rate surge from 4.7% on original digital audio to 18.2% on replayed audio — a near-4x degradation. Researchers found a Pearson correlation between aggressive recording setups and detection failure: the more analog the laundering, the harder the synthetic origin is to detect.

What Works in 2026

The industry has transitioned from artifact-based detection to dynamic, multi-modal, behavior-based analysis. Four approaches now define the frontier:

1. Acoustic Prosodic Modeling

Rather than searching for synthesis artifacts (which generators are learning to eliminate), modern detectors search for the absence of authentic biological markers. Human speech contains intricate higher-order correlations — breathing patterns, micro-pauses, structural rhythms, stress distribution, and emotional inflection — that AI consistently fails to replicate over long durations.

This flips the mathematical problem. Even when a replay attack obscures vocoder artifacts through reverberation, it cannot artificially inject natural respiratory variation or emotional inconsistency. Prosodic modeling achieves ~93% accuracy with 24.7% EER against Codecfakes where traditional artifact detectors drop to near-chance performance.

2. Cross-Attention Fusion

Advanced detectors dynamically weigh prosodic, spectral, and source-tracing signals through cross-attention architectures. This multi-dimensional fusion handles cases that single-signal models miss — particularly partial splices where 3–5 seconds of synthetic speech is inserted into otherwise-authentic audio to alter a specific number or instruction.

3. Replay-Aware Training (RIR Augmentation)

Resemble AI researchers, leveraging the ReplayDF dataset, have demonstrated that adaptive retraining with Room Impulse Responses (RIRs) — mathematical models of physical acoustic reverberation — teaches detection networks to isolate underlying synthetic anomalies beneath layers of analog distortion. Training pipeline augmentation with RIRs reduced replay-attack error rates from 18.2% down to 11.0%. Combined with DETECT-3B's 3B-parameter capacity, this neutralizes one of the most effective laundering techniques in the wild.

4. Native Provenance via PerTh Watermarking

Detection is reactive. Watermarking is proactive. Resemble's PerTh Neural Watermarker embeds an imperceptible, tamper-resistant signature directly into the latent space of generated audio at creation time. Unlike metadata or post-generation digital signatures (which can be stripped by any editor), PerTh is woven into the fundamental acoustic properties of the output.

The watermark survives:

Heavy lossy compression
Dynamic resampling
Background noise addition
Re-encoding
Time-stretching and time-shifting
Secondary model training (a key AB 3211 compliance advantage)

This creates an unbreakable chain of custody. Enterprises can prove their content's provenance instantly, detect unauthorized scraping used to train competing models, and satisfy the cryptographic-provenance mandates of California AB 3211. PerTh aligns with C2PA open standards through a JavaScript SDK widget for broad manifest verification.

DETECT-3B Omni and the Low-Latency DETECT-2B

Resemble AI's audio defense stack has two layers:

DETECT-3B Omni (3 billion parameters)

The flagship multimodal model. State-of-the-art detection across speech, image, and video through a single unified API. Recent Hugging Face Speech DF Arena snapshots:

System	Average EER ↓	Accuracy	Focus
Hiya (Authenticity-Verific)	2.113%	97.88%	Low-latency telephony
Resemble AI (DETECT-3B Omni)	2.570%	97.40%	Multimodal enterprise
Modulate (Velma)	n/a	98.90%	Live conversation
Whispeak	n/a	96.90%	Deepfake platform
Deep Learning (SpeakSure-v0.1)	n/a	96.00%	Acoustic analysis

DETECT-3B trades slight margins in pure telephony accuracy for massive generalized capability across audio, image, and video synthesis models — the right choice when the detection problem spans modalities (as it increasingly does).

DETECT-2B Neural

For live environments where 3B-parameter overhead is prohibitive, DETECT-2B operates as an ensemble of specialized sub-models leveraging pre-trained self-supervised audio representations. It returns authenticity scores in 200–300ms across 40+ languages with up to 94% accuracy — making it suitable for live contact centers, video conferencing nodes, and real-time transaction workflows.

Explainability: Resemble Intelligence and Audio Source Tracing

In high-stakes environments — fraud SOCs, newsrooms, legal proceedings, government intelligence — a binary verdict is insufficient. Analysts need the why. Resemble Intelligence runs concurrently with the detection model, surfacing observable characteristics, structural anomalies, and timestamp-specific findings in plain-English commentary. No extra user steps, no additional API call.

Paired with Intelligence is the Audio Source Tracing capability: rather than merely labeling audio as synthetic, the Source Tracing API performs forensic analysis to identify the specific generative platform — ElevenLabs, Meta, OpenAI, Resemble itself, or a known open-source fork. This attribution enables SOCs to map threat-actor infrastructure and build comprehensive incident reports. The Identity API extends this with enterprise biometric voice profiles, providing continuous voice-based authentication against known personnel.

Integration Patterns

Resemble Detect ships as API-first software with flexible deployment:

Cloud API for social-media monitoring, journalism, customer-service integrations.
On-premise / air-gapped Kubernetes for regulated sectors where data sovereignty is mandatory — healthcare, tier-1 banking, defense. SOC 2 Type II, GDPR, HIPAA compliance without external dependencies.
Python SDK abstracts the HTTP layer with synchronous or asynchronous execution. The DetectionRequest schema supports flags for intelligence, audio_source_tracing, use_ood_detector, visualize, and privacy_mode.

For journalists and fact-checkers, Resemble introduced a Press Tier in early 2026 — 90 free DETECT-3B Omni scans per month with Intelligence Lite and on-demand technical support.

The Workflow That Actually Works

For any organization serious about defending against audio deepfakes in 2026:

Treat every unverified voice as suspect, regardless of caller-ID. Spoofing is trivial and widespread.
Layer three signals in parallel on high-risk calls: voice biometrics (who), deepfake detection (is it synthetic), and call-metadata fraud scoring (did this originate where it claims).
Require out-of-band callback verification for any financial or credential-reset authority. The WPP defense works because policy, not technology, was the control.
Embed PerTh watermarking in every piece of legitimate synthetic audio your organization produces — accessibility narration, localization voiceovers, IVR prompts. This satisfies AB 3211 and establishes proof of authentic provenance downstream.
Deploy real-time detection in call-center IVR and conferencing where streaming analysis is feasible. See the call-center playbook for integration patterns.
Re-benchmark quarterly against the Hugging Face Speech DF Arena. Any detector not retrained against Codecfakes, replay attacks, and the latest commercial TTS families is drifting.

Ship this architecture via DETECT-3B Omni — multimodal detection, Intelligence explainability, PerTh watermarking, on-prem optional. Start free.

Get API access