Migrated from SQLLite to Supabase with pgvector

29 Jun, 2025

Remember that speaker memory system I built yesterday? The one that could recognize Alex Hormozi's returning guests with 99.9% accuracy using ECAPA-TDNN embeddings? Read it here I Built A Cross-Episode Speaker Memory

Well, I didn't get why Claude Code did it locally, so I asked and this is what happened:

From SQLite to Supabase

The original system was pretty solid: SQLite + NumPy storage, cosine similarity matching and temporal boosting for recent episodes. It worked great for our 33 voice embeddings across 7 episodes.

But I kept thinking about scalability. What happens when I have 500 episodes? 1000? A logo SQLite approach would start showing its limitations pretty quickly, right? I was not completely wrong.

And Claude Code suggested why not use Supabase pgvector, since you are already using it for the chunks.

What is pgvector?

pgvector is a PostgreSQL extension that adds native vector data types and operations. Instead of storing embeddings as files and doing similarity calculations in Python, we can now do everything at the database level.

Here's what that means in practice:

Before (SQLite + NumPy):

Load all embeddings into memory
Calculate similarities in Python
Sort and filter results manually

After (Supabase pgvector):

Single SQL query with built-in similarity functions
Database handles all the math
Lightning-fast results with proper indexing

The Migration Process

Setting Up the Schema

First, I created a proper table structure in Supabase:

CREATE TABLE speaker_embeddings_vectors (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    speaker_label TEXT,
    episode_id TEXT NOT NULL,
    embedding_id TEXT NOT NULL UNIQUE,
    embedding vector(192),  -- 192-dimensional ECAPA embeddings
    segment_timestamp REAL NOT NULL,
    duration REAL DEFAULT 0.0,
    confidence REAL DEFAULT 1.0
);

The vector(192) type is the magic here. It stores our ECAPA-TDNN embeddings natively and provides built-in similarity functions.

Migrating the Data

The migration script that transferred all 33 existing embeddings from the SQLite system to Supabase. Zero data loss, and every voice fingerprint made it across perfectly.

The migration included:

Voice embeddings (192-dimensional vectors)
Speaker metadata and timestamps
Episode associations and confidence scores
Temporal information for boosting recent episodes

New Query Performance

Here's where things get exciting. What used to require loading everything into memory and doing calculations in Python now becomes a single SQL query:

SELECT 
    speaker_label,
    episode_id,
    1 - (embedding <=> %s) as similarity
FROM speaker_embeddings_vectors 
WHERE 1 - (embedding <=> %s) > 0.7
ORDER BY similarity DESC 
LIMIT 5;

The <=> operator is pgvector's cosine distance function. Super fast, handles all the math at the database level.

Performance Improvements

Query Speed

SQLite approach: Load 33 embeddings, calculate similarities in Python (~50ms)
pgvector approach: Single database query (~5ms)

That's a 10x speed improvement right out of the gate.

Memory Usage

Before: Load the entire embedding dataset into memory
After: Database handles everything, minimal Python memory footprint

Scalability

SQLite: Performance degrades with each new voice embedding
pgvector: Maintains consistent performance with proper indexing

Mystery Voices Analytics

One of the coolest features of this iteration is the automated reporting of mystery voices. The system generates detailed reports about unidentified speakers:

{
  "total_embeddings": 33,
  "unknown_speakers": 1,
  "episodes_with_unknowns": 4,
  "identification_suggestions": [...]
}

This makes it super easy to manually identify recurring guests and improve the system over time. Useful now? Not sure. Maybe later.

Integration with the Pipeline

The speaker memory system now plugs seamlessly into the pipeline. When I process a new episode:

Extract voice segments from Deepgram transcription
Generate ECAPA embeddings for each speaker
Query Supabase for similar voices from previous episodes
Automatically identify returning guests with high confidence
Store new voices for future episode recognition

All of this happens automatically during the standard process.

Storage and Cost Efficiency

Storage Requirements:

33 voice embeddings: ~25KB total
Full metadata and relationships: ~50KB

Scalability Projection:

1000 episodes with ~5 speakers each: ~2MB storage
Still sub-10ms query times with proper indexing

Supabase's pricing makes this incredibly cost-effective even at scale.

What's Next?

Just geeking out, but it'd be cool to automatically detect when Alex has a returning guest and flag it for special handling or promotion.

As a user, you could also say, "I loved this episode with this guest, so find me more episodes with them."

I could track which guests appear most often.

With pgvector's speed, I could potentially perform voice recognition in near real-time as episodes go live.

All that is definitely possible now. But like Alex says, watch out for the girl in the red dress. ;)

Fast.

Migrating the speaker memory system from SQLite to Supabase pgvector was one of those upgrades that surpassed my expectations. I experienced significant performance boosts, improved scalability, and gained access to advanced features I couldn't use before.

The voice recognition accuracy stays at 99.9% for returning speakers, but now it’s 10 times faster with less memory use. I'm preparing to scale up to all episodes without any trouble.

All that was pretty fun.