Ask The Game, the Build Log

Why I Went Local with pyannote for Speaker Diarization (vs API)

I just got this shiny email from pyannote.ai.

SCR-20250709-sglm

No credit card. No strings. It’s a solid offer.

And I said no thanks.

Not because it’s bad. Quite the opposite — pyannote.ai is one of the best diarization platforms out there. But for what I’m building, I need something a little different.

So I’m sticking with the local, open-source version: pyannote.audio

Hosted API vs Local Setup

Both come from the same DNA. The API is built on top of the same models, just wrapped in a hosted product.

So on paper, the results should be similar.

But there’s a big difference in how you interact with them:

Feature pyannote.ai (API) pyannote.audio (Local)
Setup No install, just send audio Full local setup (Python, torch, models, GPU optional)
Diarization
Speaker identification ✅ (via voiceprints) ❌ out of the box, but you can integrate with ECAPA
Custom logic ❌ Black box ✅ Full control
Embedding access
Offline use
Rate limits Yes (hour caps) No
Easy integration ✅ REST API Needs glue code
Tweakable confidence, chunking, merging

So for a lot of use cases, the API is actually the better choice.
But not for mine.

What I’m Building

I’m not trying to make a clean transcript with speakers labelled A, B, and C.

I’m trying to build persistent speaker memory, something that can:

This means I need:

The API doesn’t let me do that.

But the local version does, when I combine pyannote.audio with ECAPA-TDNN from SpeechBrain.

My Setup

Here’s what I’m using:

It’s definitely messier. I’m debugging confidence scores, comparing cosine similarities, writing validators to clean up merges — but it works.

And more importantly, I understand most part of it. Most. I'm still vibe-coding this.

Why Local Works Better (For Me)

Going local means I can:

It’s slower to get started, sure. But once it’s running, it’s mine. I can tweak it, I can explain it ... and I can build on it.

When the API Might Be Better

Honestly, if you're:

...then use the API. It’s easier.

And if you don’t need to know who the speaker is, just that there are different speakers, well it’ll save you a ton of time.

They’ve made it ridiculously accessible. You don’t even need a credit card to try it.

So thank you, no thank you.

The email I got was intriguing. It made me pause. But then I looked at what I’m building ... and I realized, that I'm not looking for finished results. I'm building the "brain".

That brain needs to hear voices, recognize them, and remember them across time, not just one file at a time.

So I’ll keep running pyannote locally. For now.

– Benoit Meunier