I Built a Confidence-Aware Filter and It Removed 8% of Garbage
There's something weirdly satisfying about getting a computer to say "no."
I've been building a modular pipeline that turns podcast audio into speaker-resolved, quote-friendly transcripts. It handles transcription, diarization, speaker identification, and the entire process. And now, it finally has boundaries.
It finally says, "Nope, that segment's not good enough. We're not sending that downstream."
This is what I call confidence-aware filtering. And I just shipped it.
Garbage In, Garbage Out
Here's the problem: transcription models are fast, but not always perfect. Some words come back mumbled, low-confidence, or just weird fragments like "Yeah. Uh. I guess… hmm."
If you feed that into speaker identification or quote extraction, it breaks everything.
- You get quotes that don't mean anything
- You train your voiceprint matcher on junk
- You end up attributing "Mhmm." to the wrong person
So I built a system that inspects each segment before it's allowed to move forward.
Here's What It Does
The new filter module breaks every transcript segment into three buckets:
- ✅ Passed -- high-confidence, multi-word, meaningful stuff
- ⚠️ Flagged -- maybe useful, but short or borderline confidence
- ❌ Dropped -- we don't trust it, and it won't affect speaker ID
It runs locally, it's blazing fast, and it logs everything. It even supports different filtering strategies (average confidence, weighted, and percentile), allowing me to tweak it later without altering the entire architecture.
Smart Min-Words Logic
The system is also smart about word count thresholds. While it normally requires 3+ words per segment, it makes exceptions for high-confidence short segments like "I agree" or "New York" (2 words with 90%+ confidence) and even single words like "Yes" or "Okay" (95%+ confidence). This prevents losing meaningful short responses while still filtering out "uh" and "um" fragments.
Real-World Results
I played the same episode of The Game twice: once with filtering off, once with it on.
The difference?
- It filtered out 8.3% of total segments
- Speaker switches dropped by 8 (from 92 to 84)
- Confidence improved from 0.954 to 0.982
- 35 garbage segments (under 2 seconds) got removed
- 144.9 seconds of Alex Hormozi segments were filtered out (mostly short utterances and low-confidence segments)
And here's the kicker: not a single meaningful quote was lost.
This wasn't just cleanup, it was a performance upgrade for everything that comes after.
Speaker Attribution Got Smarter
Because the input was cleaner, speaker attribution got better too.
- Fewer false switches
- Fewer short utterances misattributed
- More reliable voiceprint matching
In the noisy version, Alex Hormozi had 344 attributed segments. After filtering, it dropped to 317 -- and every single one that got removed was junk.
What Does This Mean?
The filter doesn't try to be perfect.
It just says: "This segment isn't strong enough to influence speaker identity or quote extraction." And that one decision protects the rest of the system.
It's a guardrail. A quiet upgrade that makes everything else better.
What Next?
Next, I'll explore visualizing these filters over time, possibly auto-tuning thresholds based on episode patterns, or publishing the module as a Hugging Face Space for others to use. Maybe. Not sure if the effort worth the squeeze. I need to focus on making this better.
But for now? I'm just happy my pipeline can say no.
And I trust it more because of that.
– Benoit Meunier