How I Combined pyannote.audio and ECAPA to Build Voice Intelligence

25 Jun, 2025

Ok, here’s the deal.

When I stopped improving talktothegame.felo5.com and began rebuilding askthegame.felo5.com, I didn’t just want to transcribe podcast episodes. I wanted not just “what was said”, but who said it, and to remember that voice across time.

That’s where most diarization tools fall short. They segment speech well, but they forget who everyone is once the file ends. I needed memory. So I built it.

Diarization vs. Speaker Recognition (And Why I Needed Both)

There’s a key distinction here that changed how I approached the problem:

Diarization answers: "Who spoke when?"
Speaker recognition answers: "Who is each speaker?"

Most diarization pipelines (like pyannote.audio) are amazing at segmenting audio by speaker turns. But they don’t persist speaker identity across files. Every new episode resets the speaker labels.

And that’s not enough when you’re trying to track people like Alex Hormozi across dozens of interviews.

Why I Use pyannote.audio

Let me be super clear — I love pyannote.audio. It’s not the problem.

In fact, I use it as the first step of my pipeline. It tells me:

When a speaker starts and stops
Where the boundaries are
How many unique speakers are there

So yes, I use Pyannote, but I just don’t stop there.

Why I Add ECAPA-TDNN from SpeechBrain

This is where ECAPA-TDNN comes in (from speechbrain/spkrec-ecapa-voxceleb). ECAPA doesn’t care about timestamps. It gives me speaker embeddings — unique voiceprints I can reuse across episodes.

So once I have Pyannote segments, I feed those into ECAPA to answer:

“Which known speaker is this?”

It’s like matching fingerprints, but for voice.

I matched Alex Hormozi with 89.92% similarity over a 63-minute episode. That’s not segmentation. That’s memory.

The Hybrid Architecture I’m Running Now

Here’s what’s actually happening behind the scenes:

pyannote.audio: Segments audio into speaker turns (“who spoke when”)
ECAPA-TDNN (SpeechBrain): Matches each segment to a known voiceprint (“who is it?”)
Hybrid Mapping: Merges pyannote’s segments with ECAPA’s identities
Memory Layer: Stores and reuses speaker identities across all future episodes

I didn’t replace Pyannote. I extended it.

💡 Why This Architecture Matters

This setup gives me:

Timestamps for accurate transcription alignment
Voiceprints for speaker tracking
Persistent identity across episodes
A modular system that I can tune at every level

Instead of one black box, I built three composable engines:

A segmentation engine (pyannote)
An identity engine (ECAPA)
A memory layer (my custom logic)

That’s the real value.

What I Can Do Now

Pull every time Alex Hormozi mentions a keyword
Track recurring guests without manual tagging
Build speaker-specific summaries
Extract trends per person, not per file

This isn’t diarization. It’s a voice intelligence layer. And it’s just getting started.

☕️ TLDR

I use pyannote.audio to segment speech.
I use ECAPA to identify who’s speaking.
I built a memory layer to track voices across time.

Now I don’t just have transcripts.
I have context.

And that’s everything.

– Benoit Meunier