TLDR: Put your TTS behind a proxy URL endpoint. Use
<Play>not<Say>. Swap providers without touching telephony.
The Setup
I built a /call skill for Apollo (my AI executive assistant) so it can dial my cell and read me urgent messages aloud.
The infrastructure is Telnyx (my voice-call provider) with TeXML (Telnyx's call-script XML, basically TwiML) running on a tiny Vercel webhook.
Getting the call to connect took an afternoon. Getting the voice right? That's where it got interesting.
The Wall
My first working /call used <Say voice="Polly.Joanna-Neural"> — AWS Polly's neural voice, baked right into the TeXML response.
I dialed it.
I listened.
My actual, unfiltered reaction: "your voice sucks."
Polly works. It just sounds like a phone tree from 2014. For an assistant I'm building to trust, that's a non-starter.
What I Tried That Didn't Work
The obvious next move: pre-generate the ElevenLabs (my TTS provider, the gold standard for natural voice) MP3 in dial.sh on my laptop, upload it somewhere, point <Play> at it.
The problem? That splits secrets everywhere. My laptop needs ELEVENLABS_API_KEY to generate. Vercel needs access to whatever blob store I'm hosting on. Now I've got API keys in two environments and a file-hosting dependency I didn't want.
More moving parts. More surface area. No thanks.
The Fix That Actually Worked
I added a second endpoint: /api/audio?text=...
It calls ElevenLabs — model eleven_turbo_v2_5, my voice ID, streams MP3 bytes back. That's it.
Then the TeXML uses <Play> instead of <Say>:
<Response>
<Play>https://my-webhook.vercel.app/api/audio?text=...</Play>
<Hangup/>
</Response>
ELEVENLABS_API_KEY lives in Vercel env vars only. Never touches my laptop. The telephony layer just sees a URL that returns audio — it doesn't care who made it.
Generation is ~1-2s for short text. Telnyx waits up to 30s. Plenty of headroom.
The Bug That Hit After
One more thing nobody warns you about: URL-encode everything in that ?text= param.
Commas, colons, em dashes — anything that's reserved-but-unsafe in a URL will silently corrupt your TeXML response and the call just… fails. I burned time on that one. Full percent-encoding, no exceptions.
Why This Matters to Me
The abstraction boundary is the URL.
Once TTS lives behind GET /api/audio?text=..., I can swap ElevenLabs for anything — a local model, a cheaper provider, a different voice ID — without touching a single line of telephony code. The TeXML doesn't know and doesn't care.
This is how I think about every AI provider now. Don't bake it in. Build a seam. The day the pricing changes or the voice gets better somewhere else, you want a config change, not a rewrite.
P.S. The
/speakskill for on-screen laptop TTS is a completely separate path — that one runs ElevenLabs client-side through Claude Code. Same provider, different seam. Each stays in its lane.