Why Stripping HTML Tags Isn't Enough: Two CMSes Taught Me This in One Week

TLDR: CMS rich-text fields store content with HTML tags and entity codes. Strip the tags, and the entities survive — literally. You need two passes. Always.

the setup

I was importing data from two different CMSes in the same week.

One was TakeShape (a headless CMS), pulling 859 cancer treatment centers into a cancer education business's Finder app — names, bios, program descriptions.

The other was Shopify, syncing product records into a Supabase database for an ecommerce business's KOL CRM.

Two projects. Two CMSes. One very sneaky shared problem.

what I assumed

Both CMSes store rich text — the kind a content editor types into a WYSIWYG box.

When you pull that out of the API and display it somewhere that isn't an HTML renderer (a form input, a product card, a search index, an AI context window), you need to strip the tags.

So I did. One line:

text.replace(/<[^>]*>/g, '')

Done, I thought. Clean text.

what actually came out

Not clean.

On the Shopify side, a product description for an "Aloe & Beta Glucan & Hyaluronic Acid" supplement came through as:

Aloe &amp; Beta Glucan &amp; Hyaluronic Acid

On the TakeShape side, facility descriptions had   bleeding into paragraph text and apostrophes showing up as '.

The tag strip works. The problem is what's left behind: HTML entities are text nodes, not tags. The regex doesn't touch them. They sit there in your cleaned string and render literally in any surface that doesn't decode HTML.

So what? GREAT question. In an admin dialog, it looks embarrassing. In an AI context window, the model is reading garbled copy and potentially summarizing it wrong. In a search index, & becomes a searchable string that never matches a real query.

the fix

Two passes. Not one.

const cleaned = (p.body_html ?? '')
  .replace(/<[^>]*>/g, '')     // pass 1 — strip tags
  .replace(/&nbsp;/g, ' ')     // pass 2 — decode entities
  .replace(/&amp;/g, '&')
  .replace(/&lt;/g, '<')
  .replace(/&gt;/g, '>')
  .replace(/&quot;/g, '"')
  .replace(/&#39;/g, "'")
  .replace(/&apos;/g, "'")
  .trim() || null

These seven entities cover ~100% of what Shopify's and TakeShape's rich-text editors actually emit. (If some future product uses … or —, add them — but I haven't seen it yet.)

don't forget the backfill

This is the part I almost skipped, and it would have been a mistake.

Fixing the importer only cleans future syncs. The bad data that already landed in Postgres is still there, sitting in description columns full of & and  .

You need a one-time SQL backfill alongside the code fix:

UPDATE products SET description =
  replace(
    replace(description, '&nbsp;', ' '),
  '&amp;', '&')
-- (chain the rest of the entities the same way)
WHERE description ~ '&(amp|lt|gt|quot|nbsp|#39|apos);';

One gotcha: for the apostrophe, use chr(39) instead of a literal ' inside the SQL string — fighting the quoting gets messy fast, and chr(39) is cleaner every time.

Verify the fix actually worked:

SELECT count(*) FROM products WHERE description ~ '&(amp|lt|gt|quot|nbsp|#39|apos);';
-- Expect 0

why this sticks with me

The thing about this bug is how confident you feel after the first pass.

You wrote a regex. You tested it. The tags are gone. Looks fine. Ship it.

But "fine" in your editor is not the same as "fine in the database" or "fine in the AI's context window." The entities don't announce themselves — they just sit there, quietly wrong, until someone screenshots the admin panel or the AI starts echoing   back to a user.

Two different CMSes, two different projects, same week — that's the universe being loud.

Pass one strips. Pass two decodes. Do both.

P.S. Same pattern applies anywhere you pull rich text out of a CMS into a non-HTML surface: search indexes, LLM prompts, mobile app text fields, CSV exports. If a human typed it into a WYSIWYG, assume entities.