I Just Wanted Better Logs. Now I Have a Whole New Architecture.

27 Jun, 2025

I started with wanting a better logging, then I just sent OCD and refactored the file architecture, and it was fun.

Better Logging

The project’s running. Things are working. But something felt... foggy. I wasn't able to see what's going on behind the scenes. So, I simply try to add structured logging so I can observe what's going on.

At the time, I had a massive main.py script with over 1,800 lines of code. It did diarization, embeddings, speaker ID, all of it jammed together.

I started out just wanting better logs. But once you start thinking in terms of observability, you can’t unsee what’s missing.

I needed:

Timestamps
Config snapshots
Segmentation data (how many speakers?)
ECAPA similarity scores
Cluster results (did we match Alex Hormozi correctly?)
Episode hashes for data integrity

I discussed it with ChatGPT and Claude, and both proposed swapping out loose log lines for structured JSON.

# Old way
logging.info("Diarization found some speakers")

# New way
structured_logger.log_segmentation(
    step="diarization_complete",
    speakers_detected=3,
    processing_time=45.2,
    memory_efficient=True,
    details={"speaker_labels": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"]}
)

And the result is that I can now query the logs directly. It's similar to what I did by converting raw transcripts into fully structured data. Now my logs are structured and machine-readable. They can be analysed, filtered, or acted upon.

{
  "run_id": "20250626_172050",
  "episode_name": "Episode Title",
  "segmentation_logs": [...],
  "ecapa_logs": [...],
  "cluster_logs": [...],
  "success": true,
  "total_processing_time_seconds": 65.3
}

And that’s when it hit me: let's do more cleanup.

Total File Restructure

After asking around, I found out there's no one-size-fits-all standard for structuring files. However, the structure I proposed closely follow best practices from multiple recognized Python conventions, especially when building modular, data-driven projects:

https://github.com/drivendata/cookiecutter-data-science
https://packaging.python.org/en/latest/

Companies like OpenAI, HuggingFace, Meta, and Netflix often use a modular ML pipeline approach in their projects. So, why shouldn't I?

And ChatGPT also shared some extra suggestions:

"Use src/ for all reusable logic. Put CLI entry points in scripts/. Migrate config to configs/."

So before:

askthegame/
├── main.py (1800+ lines)
├── debug_chunks.py
├── create_voiceprint.py
├── temp_audio/
├── logs/
├── deepgram_backups/
└── chaos everywhere

... and after.

askthegame/
├── src/askthegame/
│   ├── audio/         
│   ├── transcription/ 
│   ├── speaker/       
│   ├── embeddings/    
│   ├── database/      
│   ├── pipeline/      
│   └── utils/         
├── scripts/
├── configs/
├── data/
├── tests/
└── docs/

Now everything has a place, and more importantly, it has a purpose.

Modular Design = Sanity

Anyway, I was lost. So many files. Could I break up the 1800-line file? It seemed like cleaning up a cluttered garage. And so I did with some AI help.

Each module now does exactly one thing. Nothing more.

rss_processor.py? Just fetches episodes.
service.py under embeddings/? Just runs OpenAI.
orchestrator.py? Ties it all together.

And the CLI? Still works:

python scripts/run_pipeline.py --target-episode "Ep 908"

And because I developed habits, I kept the legacy main.py there. Backward-compatible. However, I made a note to remove it if I'm no longer using it.

Configuration? YAML All the Way

Hardcoded variables just don't sit right with me. What if I need to make a change? Plus, I've heard from developers in the past that hardcoded variables aren't ideal. So, I've made a change.

Before:

TARGET_EPISODE_TITLE = "Ep 908"

After:

# pipeline_config.yaml
target_episode_title: "Ep 908"
max_episodes_per_run: 1

Flexible. Clear. Shareable.

Always Be Documenting

During my vibe coding sessions, I've noticed comments at the top of files created by Claude Code. I researched and found that I can benefit from using module-level docstrings in my Python scripts.

So, I reviewed every file and wrote docstrings as I would want to read them six months from now. Not just notes, but proper, developer-grade documentation.

"""
Filename: rss_processor.py

Description:
    RSS feed processor for The Game Podcast ETL.
    Handles metadata extraction and filtering.

Author: Benoît Meunier
Created: 2025-06-26
"""

Every module. Every script. I really love documentation. And because I'm vibe coding, it's an exercise that is helping me understand what each thing does.

The Unexpected Win

What started as a logging improvement ended up transforming everything and documenting everything. The original code still runs; nothing broke. But now, everything’s just... better.

Clean, documented, modular, understandable.

– Benoit Meunier