Trying to Bring Audio Snippets into Chatbots

09 Jul, 2025

I’ve been losing sleep over one UX problem that won’t leave me alone. It kickstarted this whole project. I’ll ask a chatbot a question about something I heard in a podcast, and the answer is… fine. It’s summarized, but it’s kind of helpful; however, it’s missing something important.

What I want is that moment. The quote. The voice. The way the host or the guests said it. The nuance is the content, not just the words. However, instead of providing me with that, most tools simply spit out a summary. And what’s worse, and even more common, is that it provides me with the link to the podcast's home page. That’s like handing me a library and saying the book is somewhere on one of those shelves.

So I’ve been trying to fix that for myself.

My Guess

The idea I’m playing with is simple: when an AI quotes something from a podcast, I want to be able to listen to just that part right there in the response.

No scrubbing through a 50-minute episode. No chasing timestamps. Just the quote, the clip, and a little context if I want more.

I’ve started mocking this up with Hormozi episodes from The Game podcast. I’ve got little quote cards that show the text, the speaker, the exact time in the episode, and a play button. You hit play and hear exactly the moment he says it. That’s it.

SCR-20250709-gjjt-2

It’s basic. But it feels so much better than what chatbots usually do.

Why This Is Technically Tricky

The reason most tools don’t do this is because AI isn’t great with time-based media. Most retrieval systems (RAGs) are built for text — PDFs, documents, Notion pages, and similar content. Audio is different.

It’s not structured in tidy paragraphs. It’s not divided into clear sections like paragraphs. You can’t just chunk it like text. To make a quote clickable and listenable, you have to:

Transcribe the whole episode
Attach timestamps to the transcript
Keep the audio file synced with the quote
Build a player that starts at precisely the right moment

Most AI tools stop at the transcript and discard the rest. It’s not that the AI can’t perform better; it’s that nobody built the framework to enable it and support this user experience.

What This Unlocks

Once you’ve stitched together quotes and audio, a few cool things become possible.

You can:

Let people save and share podcast moments, not just links
Stack related quotes into themes, like “everything Hormozi says about pricing”
Embed those snippets in blog posts or lessons, where people can listen in context, not just read a weak summary

I don't want to just read summaries anymore. I want to hear the source. And that matters for me. Especially when the tone and delivery often carry the point more than the words themselves.

I'm prototyping the UI everyday

I'm exploring various ways to integrate audio into chatbot output. I tried scrollytelling where there's a bunch of quotes on the left, with audio on the right that auto-play when the quote is active. It was fun to use for a while, until it was really annoying. So many voices at the same time, disconnected, shouting one after the other. Honestly, it was a horrible experience.

I also need to design a better UI for exploring multiple quotes from the same summary. Right now it’s a little clunky. You can listen to each one, but you don’t get a good sense of the bigger picture.

There’s also a bigger challenge I haven’t solved: what does this look like when your source isn’t just podcasts, but a mix of audio, video, articles, and pages from a book? I don’t have an answer yet. However, I suspect the key is to treat each media type with the respect it deserves, rather than flattening everything into text simply because it’s easier.

My UX Obsession

This started as a selfish project. My personal obsession. Feels like it's an untapped problem. I also dislike reading AI summaries that lose the spirit of what was actually said.

Now it’s evolving into something bigger. Maybe even a way to rebuild trust in AI-generated answers. I don’t need another chatbot that rewrites quotes. We need one that plays the clip.

I’m starting to believe this isn’t just a UX improvement. It’s a new way of trusting information. Of learning. Of hearing what matters instead of reading someone else’s remix.

– Benoit Meunier