How 'Your Voice Sucks' Fixed My TTS Architecture

TLDR: Put your TTS behind a proxy URL endpoint. Use <Play> not <Say>. Swap providers without touching telephony.

The Setup

I built a /call skill for Apollo (my AI executive assistant) so it can dial my cell and read me urgent messages aloud.

The infrastructure is Telnyx (my voice-call provider) with TeXML (Telnyx's call-script XML, basically TwiML) running on a tiny Vercel webhook.

Getting the call to connect took an afternoon. Getting the voice right? That's where it got interesting.

The Wall

My first working /call used <Say voice="Polly.Joanna-Neural"> — AWS Polly's neural voice, baked right into the TeXML response.

I dialed it.

I listened.

My actual, unfiltered reaction: "your voice sucks."

Polly works. It just sounds like a phone tree from 2014. For an assistant I'm building to trust, that's a non-starter.

What I Tried That Didn't Work

The obvious next move: pre-generate the ElevenLabs (my TTS provider, the gold standard for natural voice) MP3 in dial.sh on my laptop, upload it somewhere, point <Play> at it.

The problem? That splits secrets everywhere. My laptop needs ELEVENLABS_API_KEY to generate. Vercel needs access to whatever blob store I'm hosting on. Now I've got API keys in two environments and a file-hosting dependency I didn't want.

More moving parts. More surface area. No thanks.

The Fix That Actually Worked

I added a second endpoint: /api/audio?text=...

It calls ElevenLabs — model eleven_turbo_v2_5, my voice ID, streams MP3 bytes back. That's it.

Then the TeXML uses <Play> instead of <Say>:

<Response>
  <Play>https://my-webhook.vercel.app/api/audio?text=...</Play>
  <Hangup/>
</Response>

ELEVENLABS_API_KEY lives in Vercel env vars only. Never touches my laptop. The telephony layer just sees a URL that returns audio — it doesn't care who made it.

Generation is ~1-2s for short text. Telnyx waits up to 30s. Plenty of headroom.

The Bug That Hit After

One more thing nobody warns you about: URL-encode everything in that ?text= param.

Commas, colons, em dashes — anything that's reserved-but-unsafe in a URL will silently corrupt your TeXML response and the call just… fails. I burned time on that one. Full percent-encoding, no exceptions.

Why This Matters to Me

The abstraction boundary is the URL.

Once TTS lives behind GET /api/audio?text=..., I can swap ElevenLabs for anything — a local model, a cheaper provider, a different voice ID — without touching a single line of telephony code. The TeXML doesn't know and doesn't care.

This is how I think about every AI provider now. Don't bake it in. Build a seam. The day the pricing changes or the voice gets better somewhere else, you want a config change, not a rewrite.

P.S. The /speak skill for on-screen laptop TTS is a completely separate path — that one runs ElevenLabs client-side through Claude Code. Same provider, different seam. Each stays in its lane.