How I Wired a Local LLM Sidecar with a Pure-Stdlib Fallback Chain

TLDR: I wanted a fast local model ranking my tasks. The runtime had to stay pure stdlib. The answer was a sidecar in its own venv — and the real lesson was: build the fallback chain first, or you don't have a system, you have a demo.

what I was building

The Apollo Dashboard is my personal cockpit — a local Python server I launch from the Dock, no cloud, no runtime deps, just python3 and the standard library.

One of the panels ranks my Things3 (my task manager) "Today" list using an LLM. Which task matters most right now? That's actually hard for a heuristic. A model handles it better.

The original backend was Ollama (a local LLM runtime — you pull models and run them on your Mac). Fine. But I wanted faster.

DFlash — MLX speculative decoding on Apple Silicon, an inference trick that uses a small draft model to predict tokens for a bigger one, then verifies in parallel — is genuinely impressive on an M4 Max. I wanted it.

the wall

Here's the problem.

DFlash runs on MLX (Apple's machine-learning framework, designed for Apple Silicon). MLX is a heavy native package. Installing it means pip install mlx, native wheels, the whole stack.

But the dashboard runtime is pure stdlib by design — no pip install ever, nothing beyond what ships with Python. That constraint isn't accidental. It's what keeps the server lightweight, portable, and always-on.

I couldn't pull DFlash into the runtime process. Full stop.

what I tried that didn't work

My first instinct was to relax the constraint. Just... add a requirements.txt. Make an exception.

I spent maybe twenty minutes looking at that option before I put it down.

The whole point of the pure-stdlib constraint is that the runtime never breaks due to a missing dep. The moment I pull in MLX, I've got a server that won't start on a clean machine without a setup step — exactly what I didn't want.

There had to be a cleaner split.

the fix that worked

Run DFlash as a sidecar: its own isolated venv, its own process, launched on-demand, talking to the runtime over HTTP.

The sidecar gets subprocess and urllib from the runtime side — both pure stdlib. The runtime POSTs a ranking request; DFlash answers like any local API server. The MLX dependency stack lives entirely inside the sidecar's venv and never touches the main process.

Then I built the lifecycle manager: spin the sidecar up on first request, keep it alive while the dashboard is running, tear it down on shutdown.

That last part bit me. When I killed the dashboard, the DFlash process kept running.

The fix was one commit: fix(dflash): reap sidecar on SIGTERM. Wire a SIGTERM handler in the runtime that sends the kill signal to the sidecar PID before exiting. Simple. But you only discover you need it the first time you leave a ghost process running in the background.

the fallback chain

The thing I'm most glad I built: DFlash → local Ollama → heuristic, each with a visible backend badge in the UI.

If the sidecar isn't running yet, fall back to local Ollama. If Ollama isn't responding, fall back to the deterministic heuristic. The user always gets a ranked list. The badge tells them which backend answered.

This turned out to matter more than the DFlash win itself. Because sometimes DFlash takes a few seconds to warm up. Sometimes you want to test on a machine without MLX. Sometimes you're moving fast and the sidecar is just… not up yet.

A system without a fallback chain isn't resilient — it's fragile with extra steps.

why this matters to me

I benchmarked DFlash before I built the surrounding subsystem. If it hadn't been meaningfully faster, I would've scrapped it. That's the other lesson: front-load the go/no-go test to the cheapest moment — right after the new thing first stands up — not after you've built everything around it.

The pure-stdlib constraint forced the sidecar isolation. The sidecar isolation forced me to think about the fallback. The SIGTERM bug forced me to think about lifecycle.

Every constraint in this build made the final design better.

P.S. The selftest cases I added with the SIGTERM fix have already caught two regressions I would've shipped blind. Write the selftests while the bug is fresh in your head — you'll never circle back.