I had to Build a Memory-Efficient Speaker Diarization for large podcast episodes.
The Problem That I Didn't Know
What I was not prepared for was that everything works great until I hit a one-hour episode.
I'd been running my new hybrid speaker identification pipeline (pyannote.audio + ECAPA) on shorter episodes without a hitch. Alex Hormozi's voice? Identified perfectly. Guest speakers? Tracked across conversations. Everything was smooth.
Then I tried processing We Made A BIG Decision… | Ep 908 a one episode
Crash.
Memory exhausted during speaker diarization. Pipeline is dead in the water.
What I didn't know was that I had to process 40Mb, but 644Mb. Why? I had to convert the audio.
Podcasts are typically delivered as MP3 files because they're compressed, efficient, and ideal for streaming. But AI models I'm using (pyannote.audio and ECAPA by Speechbrain) need raw, uncompressed audio data to work their magic.
So here's what happens:
- Download: 40MB MP3 file
- Convert to WAV: 644MB uncompressed audio
- Load into memory: Even more as PyTorch tensors
That's a 16x size increase, and suddenly my MacBook's RAM is crying for mercy.
But there's no way around this. I can't perform speaker diarization on compressed audio. The models need every frequency, every nuance to distinguish between voices.
The Solution: Think in Chunks, Not Episodes
I didn't have a choice but to break long episodes into manageable chunks, process each one independently, and then intelligently merge the results.
Well, I did have a choice, but as I'm not a developer, I'm not ready to complicate things by doing it in the cloud. One Python script I have locally is slow, but I can control it and understand most of its parts. I'm sure there's a better way, but I haven't found it yet.
So here's the approach I built:
Chunked Diarization
Split the big episode into 5-minute chunks with a 30-second overlap. Each chunk gets processed separately:
- Chunk 1: 0-5 minutes
- Chunk 2: 4.5-9.5 minutes (overlap prevents boundary issues)
- Chunk 3: 9-14 minutes
- And so on...
Memory Management
Between each chunk, clear the GPU cache and release memory. This maintains stable memory usage throughout the entire process.
Speaker Consistency Across Chunks
Here's where it gets tricky. Each chunk might identify the same person as a different speaker label:
- Chunk 1: Alex = "SPEAKER_00"
- Chunk 5: Alex = "SPEAKER_02"
So, I had to merge speakers by base name, then run ECAPA identification on the combined segments to ensure that Alex Hormozi remains consistent as "Alex Hormozi" throughout.
Error Recovery and Progress Monitoring
I ask Claude Code to provide me with detailed progress tracking because I wasn't aware of what was happening.
- Audio Download
- Audio Padding
- Speaker Diarization
- Speaker Identification
- Transcription
- Data Processing
Each step gets a retry logic:
- Diarization: 2 attempts with memory clearing between tries
- Audio processing: 3 attempts with format conversion fallbacks
- Database operations: Reduced batch sizes to avoid URL limits
I need to be transparent here; I don't always know exactly what's happening. I'm guessing and letting Claude Code support me. But I'm slowly getting it.
Database Optimization
I didn't know, but Supabase/Cloudflare has URL length limits. With 493 utterances, even small batch sizes were generating URLs that were too large for duplicate checking.
So I had to temporarily disable duplicate checking for long episodes. I'm not sure if it's a good idea; I might hit myself in the back later. Adding it to my list of things. But I wanted to see content in my database.
So, the results?
As I write this, the pipeline is chugging through chunk 6 of 15. Here's what the logs show:
- ✅ Chunk 1 completed - found 4 speakers
- ✅ Chunk 2 completed - found 3 speakers
- ✅ Chunk 3 completed - found 4 speakers
- ✅ Chunk 4 completed - found 3 speakers
- ✅ Chunk 5 completed - found 3 speakers
- 🔄 Processing chunk 6: 22.5-27.5 min
Each chunk takes about 3 minutes to process. Total estimated time: ~50 minutes for an hour-long episode.
Could you compare that to the previous approach: immediate crash?
Memory Usage? Before: 644MB loaded at once = crash. And now, ~80MB per chunk, stable throughout
Speaker Identification Accuracy
It was a concern for me. This feature is crucial, and it's why knowing who's speaking is essential as a data structure to feed a LLM or, soon, an agent. You can follow me on Twitter at (x.com/bmeunier)[x.com/bmeunier] to understand why I think it's essential.
The chunked approach actually preserves the same accuracy for identifying Alex Hormozi. ECAPA's voice embeddings are robust enough that even with segmented processing, the cosine similarity scores remain consistent.
What I learned.
I'm learning so much, really, as a non-dev, that my head hurts. Every time I develop something, I have to research the acronym, the concepts, and ask ChatGPT to explain it to me like I'm twelve. But I'm getting there.
- Audio Processing Is Memory-Intensive: Don't underestimate the memory requirements of uncompressed audio. Plan for 15-20x the file size of the original MP3.
- Overlap Is Critical: Without that 30-second overlap between chunks, you'll lose speaker turns that happen right at boundaries. The overlap ensures continuity.
- Progress Monitoring Saves Sanity: When you're dealing with hour-long processing times, detailed progress logs aren't a luxury; they're a necessity. I need to create better prompts earlier in my process that include monitoring.
- Database Limits: Next time, need to research the limits of cloud services before choosing one. URL length, request size, and rate limiting... they're biting me when I'm scaling up.
- Error Recovery Beats Perfect Code: My code is far from elegant. I need to understand what graceful degradation means, but it works. This is the magical aspect about coding with AI, which can help me do that.
What's Next
I wanted to work on topic segmentation, but long episodes are slowing me down. However, at least this chunked approach addresses future problems that I didn't know I might encounter.
However, I can now take longer podcasts episodes. Might do the ones on Modern Wisdom (4 hours).
With stable memory usage, I can run more advanced speaker identification models if needed. But for now, I am satisfied.
The chunked architecture could potentially support live podcast processing, which I'm not ready for, but it's giving me an idea.
And maybe, I could start tracking not just Alex Hormozi but regular guests across episodes. Oh?!
I have so many constraints you're working within. Memory limits, processing time, database restrictions, even laptop sleep settings (yep, I had to script something to keep my pipeline alive when I'm gone).
I hope it's working since I'm watching the monitor switch from chunk 6 to chunk 7.
Next: Testing this method on even longer episodes to see how far we can go. I'll do the Topic Segmentation later.
– Benoit Meunier