Can I hear the difference between a cloned voice and a real one?

Sometimes, under good conditions. In 2026, a convincing 30-second clone defeats untrained listeners roughly 50-80% of the time depending on pipeline. Treat your ear as a soft flag, not a verdict.

Does phone compression make cloning easier or harder to detect?

Harder. Telephony codecs strip the high-frequency band where many TTS vocoder artifacts live, and cloning artifacts drop into the signal noise. This is why scam calls are effective - phone audio hides most of the tells.

How to spot · audio

How to Detect an AI-Cloned Voice

What to listen for in a suspicious voice call or voicemail — the specific audible tells of AI voice cloning in 2026 and the limits of what your ears can reliably catch.

Resemble AI·Apr 12, 2026·3 min read

Audio is the hardest modality to spot by ear. Video gives you a dozen signals at once (lip sync, blink rate, lighting); an image you can pause and zoom. Audio gives you one channel: the waveform itself. And in 2026, commercial voice cloning is good enough that the "does this voice sound wrong" test has a roughly 50/50 hit rate against modern pipelines.

Still, there are things worth listening for before you escalate.

The five things to listen for

1. Breath placement

Real speakers breathe in places that make biomechanical sense — before long phrases, after stressed syllables, mid-thought during a pause. Voice clones often either skip breaths entirely or place them in grammatically odd spots.

Re-listen with breaths specifically in mind. If breaths sound "dropped in" rather than integrated with the phrasing, that's a soft flag.

2. Prosody on emotional content

Voice clones replicate timbre well but often flatten emotional prosody. Listen specifically to exclamations, questions, laughs, moments of surprise. Does the pitch excursion feel natural, or does the voice stay in a narrower range than a real person would?

3. Sibilance uniformity

Real "s", "sh", and "ch" sounds vary slightly in brightness depending on the surrounding words and the speaker's mouth position. Synthetic sibilance is often consistently bright — a little too clean. This is an acquired skill to hear, but worth practicing.

4. Room tone

A real voice recorded on a phone carries the room behind it: faint HVAC, distance to the microphone, reflections off a window, slight mic noise. TTS output is clean. If the voice is suspiciously free of any ambient acoustic, that's a flag.

On a live call with caller-ID spoofing, the attacker might deliberately add noise to cover this. Listen for whether the noise sounds like a real room (HVAC slowly shifting, distant conversation) or a loop (same background over and over).

5. Response to unexpected input

If you're on a live call and suspicious, throw a curve:

Ask a question only the real person would know — a shared memory, a project name, the name of their dog
Say something slightly off-topic and see how naturally they redirect
Pause mid-sentence and see if the caller fills the silence naturally

A voice clone driven by a live operator with a script will often stumble on any of these. This was the defense that foiled the Ferrari attack — a specific question the real CEO would have answered without hesitation.

When to stop listening and use the detector

If you have a recording (voicemail, downloaded call audio, screen recording), upload it to our free audio deepfake detector. You'll get:

A real-vs-synthetic verdict with confidence
Timestamped reasoning: which segments of the audio the model flagged and why
Generator match: which TTS family (ElevenLabs, PlayHT, Resemble, OpenAI TTS, etc.) the audio most closely resembles
An explanation you can cite in a newsroom piece or fraud report

For organizations running contact centers or fraud teams, the same model runs via API in the critical call path — see the banking deepfake playbook.

What doesn't work

"Do you sound like yourself?" as a verification question. A voice clone will agree.
Relying purely on caller ID. Caller-ID spoofing is trivial and widespread.
Comparing to a mental baseline if you don't have recent in-person audio of the target. Our mental audio memory is weaker than our visual memory, and decays fast.

Organizational defense

Individual ear-based detection does not scale. If your organization is at risk, the defenses that actually work:

Callback verification policy on a known-good number — catches 99% of vishing attacks regardless of clone quality. See Vishing and voice phishing.
Shared-context verification (ask about something only the real person would know).
Real-time audio deepfake detection integrated into call routing — see the banking playbook.