I Built A Cross-Episode Speaker Memory
I built what I'm calling an "embedding index mega-system" that gives my podcast pipeline a memory. Not just any memory, though. Cross-episode speaker memory that actually works.
Unknown Speakers
I was processing hundreds of podcast episodes, and every time a guest appears, my system treats them like a complete stranger. Episode 1? "Unknown Speaker." Episode 50 with the same guest? Still "Unknown Speaker."
It was driving me crazy.
My original pipeline was solid. It could transcribe audio, identify speakers within an episode, and store all the information in Supabase. But each episode existed in its own little bubble. No learning. No recognition. No intelligence.
Enter the Embedding Index Revolution
I decided to create an embedding index that remembers every voice it's ever heard.
Every time someone speaks in an episode, I generate a unique "voiceprint" using ECAPA-TDNN embeddings. Think of it like a fingerprint, but for voices. These voiceprints are stored in a database with advanced indexing that enables lightning-fast searching.
But here's where it gets interesting.
Cross-Episode Recognition
When a new episode comes in, my system doesn't just process it in isolation. It actually asks: "Hey, have I heard this voice before?"
The results? Mind-blowing.
I'm getting 99.9% similarity scores when returning guests appear! The system recognizes voices across episodes with near-perfect accuracy.
Alex Hormozi's voice is instantly recognizable across all episodes. Now, I won't spend time labelling everyone, but every guest or interviewer will be recognised instantly.
The Technical Stack
Let me break down what's under the hood without getting too nerdy:
Storage: SQLite database with NumPy arrays for the actual voice embeddings. Simple, fast, and debuggable.
Recognition: ECAPA-TDNN model from SpeechBrain. It's the gold standard for speaker recognition.
Intelligence: Cosine similarity matching with temporal boosting. Recent episodes get a slight boost because, well, people's voices change over time.
Performance: Everything runs locally. No API calls for the core recognition. Lightning fast.
Real Results That Made Me Smile
After building and testing this system, here's what I discovered:
- Processing Speed: 14 seconds to fully process an entire episode
- Recognition Accuracy: 100% for known speakers like Alex Hormozi
- Storage Efficiency: 0.1 MB for 33 voice embeddings
- Error Rate: 0% (robust error handling really works)
But the best part? The system learns and gets smarter with every episode.
No UI Yet
Everything runs through CLI tools and generates file-based reports.
Want to see unknown speakers? One command.
Need to manually assign a name to a voice? Another command.
Bulk processing of historical episodes? You got it.
It's not the time to create a UI. I have more work to do.
What's Next?
I'm not stopping here. Ideas are flowing. I could do so many things with this:
- Advanced clustering algorithms for grouping unknown speakers
- Content analysis that tracks what topics each speaker discusses
- Integration with external speaker databases
- Trend detection for recurring themes by specific speakers
However, for now, I'm quite pleased with what we've built. I need to lay down.
– Benoit Meunier