Why I'm Still Not Using Deepgram for Speaker ID (Yet)

10 Jul, 2025

I've been in touch with the Deepgram team lately through the Deepgram Startup Program.

They're considering revealing speaker identification features that track real voices, not just "Speaker A/B" diarization. They've been curious about my level of accuracy required, where pyannote performs well, its limitations, and what features are necessary for effective speaker ID for me.

So here's the honest explanation of why I haven't used Deepgram for speaker ID yet, what I've actually built, and what I'd love to see from their stack someday.

What I Built

This isn't just about clean diarization. I built something more experimental: a system that remembers speakers across episodes ... something closer to "speaker memory" than transcript labeling.

The idea is simple but weird:

A podcast has returning voices
I want the system to know when someone comes back
I want it to recognize that voice and say, "Yeah, that's them again"

Not from metadata. From the sound of their voice.

Current status: This is now working! The system correctly identifies Alex Hormozi with 99%+ accuracy and recognizes returning guests across episodes.

The Alex Hormozi Voiceprint Creation Process

To make this work, I had to solve the "cold start" problem: creating a baseline voiceprint for Alex Hormozi.

Audio Sample Collection

I manually extracted 3 high-quality audio samples from different episodes:

alex_sample-001.mp3: 102.31 seconds (1.97 MB)
alex_sample-002.mp3: 69.95 seconds (1.64 MB)
alex_sample-003.mp3: 64.72 seconds (1.61 MB)

Total: ~4 minutes of clean Alex Hormozi speech across different contexts and episodes.

Technical Process

The voiceprint creation uses SpeechBrain's ECAPA-TDNN model (speechbrain/spkrec-ecapa-voxceleb):

Audio Processing: Each sample is loaded and converted from stereo to mono
Embedding Generation: The model creates a 192-dimensional vector for each sample
Voiceprint Creation: All embeddings are averaged into a single representative voiceprint
Storage: The final voiceprint is saved as alex_hormozi_voiceprint.pt (768 bytes)

The Result

The final voiceprint is a 192-dimensional float32 tensor that mathematically represents Alex's voice characteristics. When new speakers are identified, the system calculates cosine similarity against this reference voiceprint with a threshold of 0.85.

Why Deepgram Isn't a Fit (Right Now)

Here's where I get honest about the limitations of using Deepgram alone.

What Deepgram does well: I DO use Deepgram for transcription and basic speaker diarization. It's excellent at identifying speaker turns ("someone new is talking").

What I need beyond that:

1. Speaker embeddings

I don't want just "Speaker A/B." I want vector representations of voices so I can compare them, store them, match them later, or say "these are similar."

2. Persistent speaker memory

In my system, I now have 25 speaker embeddings in the database: 5 Alex Hormozi voice segments and 20 different unknown speakers across multiple episodes. The system remembers voices across episodes.

3. Flexible registration workflow

Most of the time for The Game Podcast, I have a known host (Alex Hormozi) and an unknown guest. I don't want to pre-register every guest manually. But I also don't want a black box that just auto-IDs everyone without traceability. I want flexible, reviewable registration:

Match what you can
Register what's new
Let me audit and refine later

4. Hybrid API/local processing

My current system is not fully offline as I initially envisioned. It's a hybrid approach:

APIs I use:

Deepgram for transcription and basic diarization
OpenAI for text embeddings
Supabase for database storage

Local processing:

ECAPA-TDNN for voice fingerprinting
Speaker matching and similarity calculation
Voice memory management

What I'm Doing Instead

Right now my architecture combines:

Deepgram for transcription and speaker turns (not just pyannote.audio as I initially planned)
ECAPA-TDNN from SpeechBrain for voiceprint matching
Supabase + pgvector for persistent speaker memory
Custom logic for filtering, merging, and scoring matches

It's more complex than I initially wanted, but it gives me:

Access to voice embeddings
Full control over thresholds and scoring
The ability to grow a persistent speaker memory across episodes
87.9% speaker identification success rate across the podcast archive

Features I Still Want to Build

1. Speaker merging

Sometimes I realize two voiceprints belong to the same person. I need to be able to say "these are actually the same speaker" and merge their history. The database structure is there, but the logic isn't implemented yet.

I want to update embeddings over time as I process more episodes. Currently, I can add new embeddings, but there's no system in place to refine existing voiceprints based on new data automatically.

3. Better unknown speaker clustering

I have 20 "unknown" speakers in my database. Some of these are probably the same person across different episodes. I want better clustering to automatically identify recurring guests.

But ... This Shouldn't Be DIY

I'm happy to build this stuff. I'm learning as I'm vibe-conding this, but I shouldn't have to.

If Deepgram exposed even a few of these controls, like speaker embedding access, voiceprint registration, merge logic, I'd move parts of my pipeline over tomorrow.

They've already nailed the infrastructure. What I want is a little more flexibility on top.

What I could do next if ...

If Deepgram integrates speaker memory into its stack, even as an advanced feature, it'll unlock a vast set of use cases beyond meetings and transcripts.

Stuff like:

Podcast archives with persistent speaker tracking
Audio-first knowledge graphs
Long-term speaker analytics
Episodic content intelligence

And yeah, weird builders like me would stop duct-taping ECAPA pipelines in the dark.

To the team at Deepgram: thanks for the convo.

This is the kind of product feedback loop most devs dream about.

Can't wait to see what you ship next.

Benoit Meunier

Ask The Game, the Build Log