I Just Wanted Better Logs. Now I Have a Whole New Architecture.
I started with wanting a better logging, then I just sent OCD and refactored the file architecture, and it was fun.
Better Logging
The projectโs running. Things are working. But something felt... foggy. I wasn't able to see what's going on behind the scenes. So, I simply try to add structured logging so I can observe what's going on.
At the time, I had a massive main.py
script with over 1,800 lines of code. It did diarization, embeddings, speaker ID, all of it jammed together.
I started out just wanting better logs. But once you start thinking in terms of observability, you canโt unsee whatโs missing.
I needed:
- Timestamps
- Config snapshots
- Segmentation data (how many speakers?)
- ECAPA similarity scores
- Cluster results (did we match Alex Hormozi correctly?)
- Episode hashes for data integrity
I discussed it with ChatGPT and Claude, and both proposed swapping out loose log lines for structured JSON.
# Old way
logging.info("Diarization found some speakers")
# New way
structured_logger.log_segmentation(
step="diarization_complete",
speakers_detected=3,
processing_time=45.2,
memory_efficient=True,
details={"speaker_labels": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"]}
)
And the result is that I can now query the logs directly. It's similar to what I did by converting raw transcripts into fully structured data. Now my logs are structured and machine-readable. They can be analysed, filtered, or acted upon.
{
"run_id": "20250626_172050",
"episode_name": "Episode Title",
"segmentation_logs": [...],
"ecapa_logs": [...],
"cluster_logs": [...],
"success": true,
"total_processing_time_seconds": 65.3
}
And thatโs when it hit me: let's do more cleanup.
Total File Restructure
After asking around, I found out there's no one-size-fits-all standard for structuring files. However, the structure I proposed closely follow best practices from multiple recognized Python conventions, especially when building modular, data-driven projects:
- https://github.com/drivendata/cookiecutter-data-science
- https://packaging.python.org/en/latest/
Companies like OpenAI, HuggingFace, Meta, and Netflix often use a modular ML pipeline approach in their projects. So, why shouldn't I?
And ChatGPT also shared some extra suggestions:
"Use
src/
for all reusable logic. Put CLI entry points inscripts/
. Migrate config toconfigs/
."
So before:
askthegame/
โโโ main.py (1800+ lines)
โโโ debug_chunks.py
โโโ create_voiceprint.py
โโโ temp_audio/
โโโ logs/
โโโ deepgram_backups/
โโโ chaos everywhere
... and after.
askthegame/
โโโ src/askthegame/
โ โโโ audio/
โ โโโ transcription/
โ โโโ speaker/
โ โโโ embeddings/
โ โโโ database/
โ โโโ pipeline/
โ โโโ utils/
โโโ scripts/
โโโ configs/
โโโ data/
โโโ tests/
โโโ docs/
Now everything has a place, and more importantly, it has a purpose.
Modular Design = Sanity
Anyway, I was lost. So many files. Could I break up the 1800-line file? It seemed like cleaning up a cluttered garage. And so I did with some AI help.
Each module now does exactly one thing. Nothing more.
rss_processor.py
? Just fetches episodes.service.py
underembeddings/
? Just runs OpenAI.orchestrator.py
? Ties it all together.
And the CLI? Still works:
python scripts/run_pipeline.py --target-episode "Ep 908"
And because I developed habits, I kept the legacy main.py
there. Backward-compatible. However, I made a note to remove it if I'm no longer using it.
Configuration? YAML All the Way
Hardcoded variables just don't sit right with me. What if I need to make a change? Plus, I've heard from developers in the past that hardcoded variables aren't ideal. So, I've made a change.
Before:
TARGET_EPISODE_TITLE = "Ep 908"
After:
# pipeline_config.yaml
target_episode_title: "Ep 908"
max_episodes_per_run: 1
Flexible. Clear. Shareable.
Always Be Documenting
During my vibe coding sessions, I've noticed comments at the top of files created by Claude Code. I researched and found that I can benefit from using module-level docstrings in my Python scripts.
So, I reviewed every file and wrote docstrings as I would want to read them six months from now. Not just notes, but proper, developer-grade documentation.
"""
Filename: rss_processor.py
Description:
RSS feed processor for The Game Podcast ETL.
Handles metadata extraction and filtering.
Author: Benoรฎt Meunier
Created: 2025-06-26
"""
Every module. Every script. I really love documentation. And because I'm vibe coding, it's an exercise that is helping me understand what each thing does.
The Unexpected Win
What started as a logging improvement ended up transforming everything and documenting everything. The original code still runs; nothing broke. But now, everythingโs just... better.
Clean, documented, modular, understandable.
โ Benoit Meunier