How I Combined pyannote.audio and ECAPA to Build Voice Intelligence
Ok, here’s the deal.
When I stopped improving talktothegame.felo5.com and began rebuilding askthegame.felo5.com, I didn’t just want to transcribe podcast episodes. I wanted not just “what was said”, but who said it, and to remember that voice across time.
That’s where most diarization tools fall short. They segment speech well, but they forget who everyone is once the file ends. I needed memory. So I built it.
Diarization vs. Speaker Recognition (And Why I Needed Both)
There’s a key distinction here that changed how I approached the problem:
- Diarization answers: "Who spoke when?"
- Speaker recognition answers: "Who is each speaker?"
Most diarization pipelines (like pyannote.audio) are amazing at segmenting audio by speaker turns. But they don’t persist speaker identity across files. Every new episode resets the speaker labels.
And that’s not enough when you’re trying to track people like Alex Hormozi across dozens of interviews.
Why I Use pyannote.audio
Let me be super clear — I love pyannote.audio. It’s not the problem.
In fact, I use it as the first step of my pipeline. It tells me:
- When a speaker starts and stops
- Where the boundaries are
- How many unique speakers are there
So yes, I use Pyannote, but I just don’t stop there.
Why I Add ECAPA-TDNN from SpeechBrain
This is where ECAPA-TDNN comes in (from speechbrain/spkrec-ecapa-voxceleb
). ECAPA doesn’t care about timestamps. It gives me speaker embeddings — unique voiceprints I can reuse across episodes.
So once I have Pyannote segments, I feed those into ECAPA to answer:
“Which known speaker is this?”
It’s like matching fingerprints, but for voice.
I matched Alex Hormozi with 89.92% similarity over a 63-minute episode. That’s not segmentation. That’s memory.
The Hybrid Architecture I’m Running Now
Here’s what’s actually happening behind the scenes:
- pyannote.audio: Segments audio into speaker turns (“who spoke when”)
- ECAPA-TDNN (SpeechBrain): Matches each segment to a known voiceprint (“who is it?”)
- Hybrid Mapping: Merges pyannote’s segments with ECAPA’s identities
- Memory Layer: Stores and reuses speaker identities across all future episodes
I didn’t replace Pyannote. I extended it.
💡 Why This Architecture Matters
This setup gives me:
- Timestamps for accurate transcription alignment
- Voiceprints for speaker tracking
- Persistent identity across episodes
- A modular system that I can tune at every level
Instead of one black box, I built three composable engines:
- A segmentation engine (pyannote)
- An identity engine (ECAPA)
- A memory layer (my custom logic)
That’s the real value.
What I Can Do Now
- Pull every time Alex Hormozi mentions a keyword
- Track recurring guests without manual tagging
- Build speaker-specific summaries
- Extract trends per person, not per file
This isn’t diarization. It’s a voice intelligence layer. And it’s just getting started.
☕️ TLDR
I use pyannote.audio to segment speech.
I use ECAPA to identify who’s speaking.
I built a memory layer to track voices across time.
Now I don’t just have transcripts.
I have context.
And that’s everything.
– Benoit Meunier