The Memory Cap That Was Crying Wolf

TLDR: I spent two archive sweeps chasing a token cap that was never actually costing me anything. 9% reduction. The cap was aspirational. My session budget was fine the whole time.

The Setup

Apollo (my AI assistant, running inside Claude Code) uses a tiered memory system to survive between sessions.

The rough shape:

Tier 1 — MEMORY.md: Always loaded. One-line pointers only. 80-line hard cap. The map, not the territory.
Tier 2 — FULL_REFERENCE.md: A shell-script-built concatenation of every top-level memory file. Pulled on demand. Never auto-loaded.
Tier 3 — Semantic search: Apollo's RAG CLI. BM25 + vector + RRF over the whole Obsidian vault when I need fuzzy cross-file recall.
Archive — memory/archive/: Shipped projects, resolved incidents, delivered talks. Out of context by default. Still searchable.

I'd built the archive tier in April, raised the Tier 2 cap from 25K → 50K tokens when the corpus grew legitimately past the original budget, and felt pretty good about it.

Then I opened the build output one day and went: "It's MASSIVE. What is going on."

The Wall

The build script warned "exceeds target." Every time.

FULL_REFERENCE.md had ballooned to 167K tokens. I had set a 50K cap and was sitting at 3× that.

So I did the obvious thing: I panic-archived.

What Did NOT Work

I ran a sweep of shipped, dormant, and scoped memories. Twenty files. Moved them all to memory/archive/, cleaned the ARCHIVE.md index, rebuilt Tier 2.

Result: 167K → 152K tokens. 9%.

Nine percent.

I just moved 20 memories and barely moved the needle.

The reason it didn't work: the bloat wasn't stale work — it was active work. The biggest files were live projects (a law firm client's commitment tracker, a client documentation project, a cancer education business finder tool, an ecommerce business KOL CRM) plus 97 accumulated feedback files at 44K words combined. And feedback files cannot be archived per my own rules — they're always future-facing.

There was nothing to cut that I should actually cut.

The Reframe That Fixed It

Here's what I'd missed the whole time.

Tier 1 is small ON PURPOSE. It loads every single session. Token cost is real, it runs every boot, so I guard it with an 80-line hard cap.

Tier 2 is never auto-loaded. Not once. It's pulled on demand, only when I need depth. So its size taxes… nothing. No token is spent on it unless I explicitly call it.

The 50K cap had been an aspirational number from when the system was lean. It was never a real constraint. The build script was crying wolf.

The actual fix: raise the warning threshold to a realistic number (~150K), stop chasing the red warning, and archive only what genuinely meets the criteria — shipped projects with no future relevance, resolved incidents, delivered talks.

Why This Matters to Me

I almost stripped out load-bearing memories to hit a vanity number.

Before you optimize something, ask: is this actually costing me anything? A budget that never gets spent isn't a problem, it's just a number on a screen. The constraint that matters for Apollo is Tier 1 — because that IS loaded every session. Tier 2 can be as big as it needs to be.

Measure what's loaded. Optimize what hurts. The rest is noise.