Ask The Game, the Build Log

How I Combined pyannote.audio and ECAPA to Build Voice Intelligence

Ok, here’s the deal.

When I stopped improving talktothegame.felo5.com and began rebuilding askthegame.felo5.com, I didn’t just want to transcribe podcast episodes. I wanted not just “what was said”, but who said it, and to remember that voice across time.

That’s where most diarization tools fall short. They segment speech well, but they forget who everyone is once the file ends. I needed memory. So I built it.

Diarization vs. Speaker Recognition (And Why I Needed Both)

There’s a key distinction here that changed how I approached the problem:

Most diarization pipelines (like pyannote.audio) are amazing at segmenting audio by speaker turns. But they don’t persist speaker identity across files. Every new episode resets the speaker labels.

And that’s not enough when you’re trying to track people like Alex Hormozi across dozens of interviews.

Why I Use pyannote.audio

Let me be super clear — I love pyannote.audio. It’s not the problem.

In fact, I use it as the first step of my pipeline. It tells me:

So yes, I use Pyannote, but I just don’t stop there.

Why I Add ECAPA-TDNN from SpeechBrain

This is where ECAPA-TDNN comes in (from speechbrain/spkrec-ecapa-voxceleb). ECAPA doesn’t care about timestamps. It gives me speaker embeddings — unique voiceprints I can reuse across episodes.

So once I have Pyannote segments, I feed those into ECAPA to answer:

“Which known speaker is this?”

It’s like matching fingerprints, but for voice.

I matched Alex Hormozi with 89.92% similarity over a 63-minute episode. That’s not segmentation. That’s memory.

The Hybrid Architecture I’m Running Now

Here’s what’s actually happening behind the scenes:

  1. pyannote.audio: Segments audio into speaker turns (“who spoke when”)
  2. ECAPA-TDNN (SpeechBrain): Matches each segment to a known voiceprint (“who is it?”)
  3. Hybrid Mapping: Merges pyannote’s segments with ECAPA’s identities
  4. Memory Layer: Stores and reuses speaker identities across all future episodes

I didn’t replace Pyannote. I extended it.

💡 Why This Architecture Matters

This setup gives me:

Instead of one black box, I built three composable engines:

That’s the real value.

What I Can Do Now

This isn’t diarization. It’s a voice intelligence layer. And it’s just getting started.

☕️ TLDR

I use pyannote.audio to segment speech.
I use ECAPA to identify who’s speaking.
I built a memory layer to track voices across time.

Now I don’t just have transcripts.
I have context.

And that’s everything.

– Benoit Meunier