Why I Went Local with pyannote for Speaker Diarization (vs API)

10 Jul, 2025

I just got this shiny email from pyannote.ai.

SCR-20250709-sglm

No credit card. No strings. It’s a solid offer.

And I said no thanks.

Not because it’s bad. Quite the opposite — pyannote.ai is one of the best diarization platforms out there. But for what I’m building, I need something a little different.

So I’m sticking with the local, open-source version: pyannote.audio

Hosted API vs Local Setup

Both come from the same DNA. The API is built on top of the same models, just wrapped in a hosted product.

So on paper, the results should be similar.

But there’s a big difference in how you interact with them:

Feature	pyannote.ai (API)	pyannote.audio (Local)
Setup	No install, just send audio	Full local setup (Python, torch, models, GPU optional)
Diarization	✅	✅
Speaker identification	✅ (via voiceprints)	❌ out of the box, but you can integrate with ECAPA
Custom logic	❌ Black box	✅ Full control
Embedding access	❌	✅
Offline use	❌	✅
Rate limits	Yes (hour caps)	No
Easy integration	✅ REST API	Needs glue code
Tweakable confidence, chunking, merging	❌	✅

So for a lot of use cases, the API is actually the better choice.
But not for mine.

What I’m Building

I’m not trying to make a clean transcript with speakers labelled A, B, and C.

I’m trying to build persistent speaker memory, something that can:

Listen to multiple podcast episodes
Figure out who is speaking
Remember that voice later, even in a different episode
Be confident enough to say “That’s Alex Hormozi” based on sound, not text

This means I need:

Speaker embeddings
Custom filtering and merging
Control over chunk sizes
The ability to fine-tune what counts as a “speaker change”

The API doesn’t let me do that.

But the local version does, when I combine pyannote.audio with ECAPA-TDNN from SpeechBrain.

My Setup

Here’s what I’m using:

pyannote.audio to split audio into segments by speaker
SpeechBrain ECAPA-TDNN to compare those segments to known voices
A bunch of glue code to filter, merge, and track speaker identities over time

It’s definitely messier. I’m debugging confidence scores, comparing cosine similarities, writing validators to clean up merges — but it works.

And more importantly, I understand most part of it. Most. I'm still vibe-coding this.

Why Local Works Better (For Me)

Going local means I can:

Tune thresholds based on my data
Add new voiceprints at runtime
Filter out junk segments that mess up identification
Improve the system as I go

It’s slower to get started, sure. But once it’s running, it’s mine. I can tweak it, I can explain it ... and I can build on it.

When the API Might Be Better

Honestly, if you're:

Transcribing interviews
Adding speaker labels to meetings
Building a polished product with clean diarization

...then use the API. It’s easier.

And if you don’t need to know who the speaker is, just that there are different speakers, well it’ll save you a ton of time.

They’ve made it ridiculously accessible. You don’t even need a credit card to try it.

So thank you, no thank you.

The email I got was intriguing. It made me pause. But then I looked at what I’m building ... and I realized, that I'm not looking for finished results. I'm building the "brain".

That brain needs to hear voices, recognize them, and remember them across time, not just one file at a time.

So I’ll keep running pyannote locally. For now.

– Benoit Meunier