Why Your RAG Ingest Should Know What Kind of Content It's Eating

TLDR: When your RAG corpus has multiple source kinds, make source_kind a first-class column from day one. The unit of rebuild should match the unit of change — or you're paying to re-embed content that hasn't moved.

the build

This is an internal SOP canvas tool — a React Flow canvas that renders standard operating procedures as linked nodes, with a bottom-drawer RAG chat (Cmd+K) so anyone can ask questions and get answers grounded in the actual docs.

The corpus isn't uniform. It's three different kinds of content:

SOP node bodies — the markdown prose that lives on each canvas node
Blueprint stubs — auto-generated via npm run import-make, which takes Make blueprint exports and produces node skeletons
Transcripts — .md or .txt call recordings and meeting notes dropped into the source folder

All of it flows through the same ingest pipeline: chunk → strip → embed → SQLite via OpenAI text-embedding-3-small.

the wall i hit immediately

The first version of npm run ingest did what you'd expect: rebuilt everything.

Every single source file, every kind, chunked and embedded from scratch.

That's fine when you're standing the thing up. It's a problem the first time you drop a new transcript in and watch the pipeline chew through every SOP node body it already processed yesterday.

Embeddings through the OpenAI API aren't free. Wall-clock adds up. And the whole point of a RAG pipeline is that it lives alongside the content — you'll be running it constantly.

the fix that actually worked

One more positional argument: npm run ingest [kind...].

No kind argument? Full rebuild. Pass transcripts? Only the transcript slice gets touched. sop-nodes? Just those.

The key that makes this work is source_kind as a stable column in SQLite from the start. When you want to rebuild a kind, you delete that slice and re-embed it. Everything else stays untouched.

That same session I also swapped the embedding model from Voyage to text-embedding-3-small — unrelated decision, but it happened to land in the same commit. (Sometimes you're just in the engine.)

why the unit of rebuild matters

This is the lesson I want to be clear about, because I've gotten it wrong before:

The unit of rebuild should match the unit of change.

Transcripts change on their own schedule. Blueprint imports happen in batches. SOP node bodies get edited one at a time. These are different clocks.

If your ingest pipeline has no concept of source kind, the only safe option is nuke-and-rebuild — because you can't target a slice you haven't named.

Design it keyed from day one. source_kind, source_path, whatever makes sense for your corpus. It costs almost nothing to add when you're building. It costs real time and money to retrofit after you've got 10,000 chunks in the index.

The pipeline that respects content heterogeneity is the one you'll actually keep running.