Building a JSON File Index for Safer Chunking Experiments

25 Jul, 2025

This blog post documents a quiet but critical backend fix I made to unblock the next phase of semantic chunking experiments.

I'm planning to completely rechunk over 900 episodes of The Game podcast using better semantic chunking algorithms. However, first, I needed to ensure that I could actually recover from the raw transcript files.

Plus, turns out, I was missing 103 of them.

Here's how I built a hybrid system that indexes all raw Deepgram JSON files as a foundation for future chunking experiments, and why this matters more than just "fixing missing files."

Why Raw Transcripts Are My Foundation

I've been building search tools for Alex Hormozi's podcast, and my current chunking strategy is pretty basic. I split transcripts into fixed-size chunks with some overlap, embed them, and store them in a vector database. It works, but I know it could be much better.

I want to experiment with semantic chunking, where content is divided into meaningful ideas with more meaningful topic boundaries rather than based on arbitrary word counts or speaker change. This could involve chunking by complete thoughts, shifts in conversation topics, or when Alex begins a new story.

But here's the thing: if I'm going to rechunk 900+ episodes multiple times as I test different strategies, I need a reliable way to start from the raw transcripts. I can’t keep hitting Deepgram’s API every time I test a new chunking approach. That would cost hundreds of dollars... and take forever.

So I started building what I call a "hybrid system": keep all the raw Deepgram JSON files indexed and searchable, while also maintaining my current processed chunks.

That's when I discovered the problem.

Database episodes: 909

Transcript files: 807

Missing: 103

I was missing 11% of my raw transcripts.

The Cost of Lost Raw Data

The raw Deepgram JSON files contain way more than just transcripts. They have:

Precise word-level timestamps for topic boundary detection
Speaker diarization data to identify speakers
Confidence scores for each word to filter out unclear sections
Detected topics and entities that could guide semantic chunking
Paragraph and sentence boundaries from Deepgram's smart formatting

Without these files, I'd have to retranscribe episodes through Deepgram every time I wanted to test a new chunking strategy. At $0.15 per episode, that's $15+ per experiment. Run 10 experiments and suddenly you're spending $150+ on transcription costs.

More importantly, it would take hours to retranscribe batches of episodes, making the feedback loop for chunking experiments painfully slow.

Building a Surgical Fix

I built a simple script that identifies the gaps, fetches just the missing transcripts from Deepgram, and saves them using my existing transcript service. And it worked seamlessly.

Final results:

103/103 episodes recovered (100% success rate)
Average per episode: ~15 seconds
Total cost: ~$15.45

I didn't touch any of my existing, working systems. It just filled in the gaps.

What's coming?

Now that I have 100% raw transcript coverage, I can finally experiment with better chunking strategies:

Topic-based chunking. Use Deepgram's detected topics and word-level timestamps to break content when the conversation naturally shifts subjects.

Speaker-aware chunking. Separate Alex's main points from guest responses, or chunk differently when it's a solo episode vs interview format.

Confidence-filtered chunking. Skip sections where Deepgram has low confidence, focusing chunks on clear, high-quality content.

Story-boundary chunking. Alex often tells stories to illustrate points. I could detect these narrative patterns and create chunks that keep complete stories together.

I'll be able to test any chunking strategy on the full 900+ episode dataset without worrying about retranscription costs. I can iterate quickly, compare results, and measure which approaches work better for finding specific business insights.

I'm particularly excited about story-aware chunking. Will see.

Now that the data's clean, the real work — designing smarter chunks — can finally begin.

— Benoît Meunier