Why I'm Still Not Using Deepgram for Speaker ID (Yet)
I've been in touch with the Deepgram team lately through the Deepgram Startup Program.
They're considering revealing speaker identification features that track real voices, not just "Speaker A/B" diarization. They've been curious about my level of accuracy required, where pyannote performs well, its limitations, and what features are necessary for effective speaker ID for me.
So here's the honest explanation of why I haven't used Deepgram for speaker ID yet, what I've actually built, and what I'd love to see from their stack someday.
What I Built
This isn't just about clean diarization. I built something more experimental: a system that remembers speakers across episodes ... something closer to "speaker memory" than transcript labeling.
The idea is simple but weird:
- A podcast has returning voices
- I want the system to know when someone comes back
- I want it to recognize that voice and say, "Yeah, that's them again"
Not from metadata. From the sound of their voice.
Current status: This is now working! The system correctly identifies Alex Hormozi with 99%+ accuracy and recognizes returning guests across episodes.
The Alex Hormozi Voiceprint Creation Process
To make this work, I had to solve the "cold start" problem: creating a baseline voiceprint for Alex Hormozi.
Audio Sample Collection
I manually extracted 3 high-quality audio samples from different episodes:
- alex_sample-001.mp3: 102.31 seconds (1.97 MB)
- alex_sample-002.mp3: 69.95 seconds (1.64 MB)
- alex_sample-003.mp3: 64.72 seconds (1.61 MB)
Total: ~4 minutes of clean Alex Hormozi speech across different contexts and episodes.
Technical Process
The voiceprint creation uses SpeechBrain's ECAPA-TDNN model (speechbrain/spkrec-ecapa-voxceleb
):
- Audio Processing: Each sample is loaded and converted from stereo to mono
- Embedding Generation: The model creates a 192-dimensional vector for each sample
- Voiceprint Creation: All embeddings are averaged into a single representative voiceprint
- Storage: The final voiceprint is saved as
alex_hormozi_voiceprint.pt
(768 bytes)
The Result
The final voiceprint is a 192-dimensional float32 tensor that mathematically represents Alex's voice characteristics. When new speakers are identified, the system calculates cosine similarity against this reference voiceprint with a threshold of 0.85.
Why Deepgram Isn't a Fit (Right Now)
Here's where I get honest about the limitations of using Deepgram alone.
What Deepgram does well: I DO use Deepgram for transcription and basic speaker diarization. It's excellent at identifying speaker turns ("someone new is talking").
What I need beyond that:
1. Speaker embeddings
I don't want just "Speaker A/B." I want vector representations of voices so I can compare them, store them, match them later, or say "these are similar."
2. Persistent speaker memory
In my system, I now have 25 speaker embeddings in the database: 5 Alex Hormozi voice segments and 20 different unknown speakers across multiple episodes. The system remembers voices across episodes.
3. Flexible registration workflow
Most of the time for The Game Podcast, I have a known host (Alex Hormozi) and an unknown guest. I don't want to pre-register every guest manually. But I also don't want a black box that just auto-IDs everyone without traceability. I want flexible, reviewable registration:
- Match what you can
- Register what's new
- Let me audit and refine later
4. Hybrid API/local processing
My current system is not fully offline as I initially envisioned. It's a hybrid approach:
APIs I use:
- Deepgram for transcription and basic diarization
- OpenAI for text embeddings
- Supabase for database storage
Local processing:
- ECAPA-TDNN for voice fingerprinting
- Speaker matching and similarity calculation
- Voice memory management
What I'm Doing Instead
Right now my architecture combines:
- Deepgram for transcription and speaker turns (not just pyannote.audio as I initially planned)
- ECAPA-TDNN from SpeechBrain for voiceprint matching
- Supabase + pgvector for persistent speaker memory
- Custom logic for filtering, merging, and scoring matches
It's more complex than I initially wanted, but it gives me:
- Access to voice embeddings
- Full control over thresholds and scoring
- The ability to grow a persistent speaker memory across episodes
- 87.9% speaker identification success rate across the podcast archive
Features I Still Want to Build
1. Speaker merging
Sometimes I realize two voiceprints belong to the same person. I need to be able to say "these are actually the same speaker" and merge their history. The database structure is there, but the logic isn't implemented yet.
2. Automatic voiceprint refinement
I want to update embeddings over time as I process more episodes. Currently, I can add new embeddings, but there's no system in place to refine existing voiceprints based on new data automatically.
3. Better unknown speaker clustering
I have 20 "unknown" speakers in my database. Some of these are probably the same person across different episodes. I want better clustering to automatically identify recurring guests.
But ... This Shouldn't Be DIY
I'm happy to build this stuff. I'm learning as I'm vibe-conding this, but I shouldn't have to.
If Deepgram exposed even a few of these controls, like speaker embedding access, voiceprint registration, merge logic, I'd move parts of my pipeline over tomorrow.
They've already nailed the infrastructure. What I want is a little more flexibility on top.
What I could do next if ...
If Deepgram integrates speaker memory into its stack, even as an advanced feature, it'll unlock a vast set of use cases beyond meetings and transcripts.
Stuff like:
- Podcast archives with persistent speaker tracking
- Audio-first knowledge graphs
- Long-term speaker analytics
- Episodic content intelligence
And yeah, weird builders like me would stop duct-taping ECAPA pipelines in the dark.
To the team at Deepgram: thanks for the convo.
This is the kind of product feedback loop most devs dream about.
Can't wait to see what you ship next.
- Benoit Meunier