TLDR: My RAG system silently ignores PDFs. Every PDF I'd ever dropped in was invisible — no error, just gone. The fix: extract to .md, clean the junk, reindex the right vault.

The Setup

I've been running Apollo Obsidian RAG (my semantic search system, my AI's memory layer, built on Claude Agent SDK + OpenAI embeddings) over two vaults — my Apollo memory/ folder and my personal Obsidian vault.

It's a great system. Hybrid retrieval, RRF fusion, agentic chunking. I can ask it things and it actually knows.

So naturally I assumed dropping a PDF into the vault meant it was searchable.

It is not.

The Wall

No error. Nothing. The PDF just… didn't show up.

When I finally dug into corpora.py — the file that defines what gets indexed — I saw it immediately: the glob is **/*.md.

That's it.

Only Markdown files. Every PDF I'd ever placed in either vault was completely invisible to RAG, to /mem-search, to everything.

The dangerous failure mode here isn't a crash. It's silence.

The lesson for any builder: audit what your ingester's glob actually matches — not what you assumed it matched. Then test retrieval. Don't trust ingestion.

What I Tried That Didn't Work

My first instinct was "surely pdftotext just handles it."

For the Enneagram RHETI report (a linear, text-heavy PDF), that was completely true. One command:

/opt/homebrew/bin/pdftotext -layout enneagram.pdf enneagram.md

Clean output. Done. Indexed beautifully.

Then I tried the Working Genius report.

That one is heavily designed — branded layout, three-column option boxes, bold gear-letter callouts. The -layout flag tries to preserve spatial position, and on a designed PDF it absolutely loses its mind.

The output was: the six repeated option-box blocks (each block appearing three times, re-rendered), stray isolated gear-letters W/I/D/G/E/T scattered randomly through the text, and spaced banner text like APPL I CATI ON — the renderer had pulled apart every glyph with a space.

Feed that to the agentic chunker and you get garbage chunks. Feed garbage chunks to retrieval and you get garbage answers.

The Fix That Actually Worked

Two changes for the Working Genius PDF:

1. Switch to plain mode (drop the -layout flag). For multi-column design documents, plain mode preserves reading order better than spatial layout.

2. Write a state-machine strip pass — a small script that identified and removed the boilerplate patterns: the repeated option-box blocks, the isolated single gear-letters, the spaced-glyph banner text.

Once the text was clean, write it to notes/working-genius.md in the vault, commit it, and reindex.

The Second Trap (Yes, There Were Two)

I ran my usual reindex script and queried for Working Genius.

Still nothing.

Turns out my build-full-reference.sh wrapper only reindexes --vault apollo (my memory/ folder). It does NOT touch my personal Obsidian vault at all.

The note I'd written landed in my personal Obsidian vault. So it was, again, invisible.

The correct command:

apollo-rag reindex --vault all

That covers both vaults. Running --vault apollo only gets you half the corpus.

Lesson two: your convenience wrapper probably has a scope assumption baked in. Know what it covers and test the full path.

One More Thing About Agentic Chunking

Once the text is clean and indexed, the agentic chunker (Claude Sonnet via Claude Agent SDK) reads the whole note and identifies where topics shift — creating semantically meaningful chunks rather than fixed-window slices.

On very large notes, it occasionally returns non-JSON and falls back to structural chunking. That's graceful degradation, not a failure. The system handles it.

What it can't handle is a PDF it never saw in the first place.

Why This Matters to Me

I built this memory system so Apollo knows me — my personality profiles, my working style, the context behind decisions. The Working Genius report was a meaningful piece of that.

But I'd assumed "in the vault" meant "indexed." It didn't. And I would have never known, because the failure was completely silent.

Now whenever I add a document I care about, I verify retrieval, not just ingestion. Two different things. Only one that matters.