TLDR: I spent two archive sweeps chasing a token cap that was never actually costing me anything. 9% reduction. The cap was aspirational. My session budget was fine the whole time.
The Setup
Apollo (my AI assistant, running inside Claude Code) uses a tiered memory system to survive between sessions.
The rough shape:
- Tier 1 —
MEMORY.md: Always loaded. One-line pointers only. 80-line hard cap. The map, not the territory. - Tier 2 —
FULL_REFERENCE.md: A shell-script-built concatenation of every top-level memory file. Pulled on demand. Never auto-loaded. - Tier 3 — Semantic search: Apollo's RAG CLI. BM25 + vector + RRF over the whole Obsidian vault when I need fuzzy cross-file recall.
- Archive —
memory/archive/: Shipped projects, resolved incidents, delivered talks. Out of context by default. Still searchable.
I'd built the archive tier in April, raised the Tier 2 cap from 25K → 50K tokens when the corpus grew legitimately past the original budget, and felt pretty good about it.
Then I opened the build output one day and went: "It's MASSIVE. What is going on."
The Wall
The build script warned "exceeds target." Every time.
FULL_REFERENCE.md had ballooned to 167K tokens. I had set a 50K cap and was sitting at 3× that.
So I did the obvious thing: I panic-archived.
What Did NOT Work
I ran a sweep of shipped, dormant, and scoped memories. Twenty files. Moved them all to memory/archive/, cleaned the ARCHIVE.md index, rebuilt Tier 2.
Result: 167K → 152K tokens. 9%.
Nine percent.
I just moved 20 memories and barely moved the needle.
The reason it didn't work: the bloat wasn't stale work — it was active work. The biggest files were live projects (a law firm client's commitment tracker, a client documentation project, a cancer education business finder tool, an ecommerce business KOL CRM) plus 97 accumulated feedback files at 44K words combined. And feedback files cannot be archived per my own rules — they're always future-facing.
There was nothing to cut that I should actually cut.
The Reframe That Fixed It
Here's what I'd missed the whole time.
Tier 1 is small ON PURPOSE. It loads every single session. Token cost is real, it runs every boot, so I guard it with an 80-line hard cap.
Tier 2 is never auto-loaded. Not once. It's pulled on demand, only when I need depth. So its size taxes… nothing. No token is spent on it unless I explicitly call it.
The 50K cap had been an aspirational number from when the system was lean. It was never a real constraint. The build script was crying wolf.
The actual fix: raise the warning threshold to a realistic number (~150K), stop chasing the red warning, and archive only what genuinely meets the criteria — shipped projects with no future relevance, resolved incidents, delivered talks.
Why This Matters to Me
I almost stripped out load-bearing memories to hit a vanity number.
Before you optimize something, ask: is this actually costing me anything? A budget that never gets spent isn't a problem, it's just a number on a screen. The constraint that matters for Apollo is Tier 1 — because that IS loaded every session. Tier 2 can be as big as it needs to be.
Measure what's loaded. Optimize what hurts. The rest is noise.