Why I Went Local with pyannote for Speaker Diarization (vs API)
I just got this shiny email from pyannote.ai.
No credit card. No strings. It’s a solid offer.
And I said no thanks.
Not because it’s bad. Quite the opposite — pyannote.ai is one of the best diarization platforms out there. But for what I’m building, I need something a little different.
So I’m sticking with the local, open-source version: pyannote.audio
Hosted API vs Local Setup
Both come from the same DNA. The API is built on top of the same models, just wrapped in a hosted product.
So on paper, the results should be similar.
But there’s a big difference in how you interact with them:
Feature | pyannote.ai (API) | pyannote.audio (Local) |
---|---|---|
Setup | No install, just send audio | Full local setup (Python, torch, models, GPU optional) |
Diarization | ✅ | ✅ |
Speaker identification | ✅ (via voiceprints) | ❌ out of the box, but you can integrate with ECAPA |
Custom logic | ❌ Black box | ✅ Full control |
Embedding access | ❌ | ✅ |
Offline use | ❌ | ✅ |
Rate limits | Yes (hour caps) | No |
Easy integration | ✅ REST API | Needs glue code |
Tweakable confidence, chunking, merging | ❌ | ✅ |
So for a lot of use cases, the API is actually the better choice.
But not for mine.
What I’m Building
I’m not trying to make a clean transcript with speakers labelled A, B, and C.
I’m trying to build persistent speaker memory, something that can:
- Listen to multiple podcast episodes
- Figure out who is speaking
- Remember that voice later, even in a different episode
- Be confident enough to say “That’s Alex Hormozi” based on sound, not text
This means I need:
- Speaker embeddings
- Custom filtering and merging
- Control over chunk sizes
- The ability to fine-tune what counts as a “speaker change”
The API doesn’t let me do that.
But the local version does, when I combine pyannote.audio
with ECAPA-TDNN
from SpeechBrain.
My Setup
Here’s what I’m using:
pyannote.audio
to split audio into segments by speakerSpeechBrain ECAPA-TDNN
to compare those segments to known voices- A bunch of glue code to filter, merge, and track speaker identities over time
It’s definitely messier. I’m debugging confidence scores, comparing cosine similarities, writing validators to clean up merges — but it works.
And more importantly, I understand most part of it. Most. I'm still vibe-coding this.
Why Local Works Better (For Me)
Going local means I can:
- Tune thresholds based on my data
- Add new voiceprints at runtime
- Filter out junk segments that mess up identification
- Improve the system as I go
It’s slower to get started, sure. But once it’s running, it’s mine. I can tweak it, I can explain it ... and I can build on it.
When the API Might Be Better
Honestly, if you're:
- Transcribing interviews
- Adding speaker labels to meetings
- Building a polished product with clean diarization
...then use the API. It’s easier.
And if you don’t need to know who the speaker is, just that there are different speakers, well it’ll save you a ton of time.
They’ve made it ridiculously accessible. You don’t even need a credit card to try it.
So thank you, no thank you.
The email I got was intriguing. It made me pause. But then I looked at what I’m building ... and I realized, that I'm not looking for finished results. I'm building the "brain".
That brain needs to hear voices, recognize them, and remember them across time, not just one file at a time.
So I’ll keep running pyannote locally. For now.
– Benoit Meunier