Atom XHTML Content Considered Jerky

Dr. Drang writes about varying support for Atom’s content type="xhtml".

I strongly disliked this part of the Atom spec.

The thing about this feature is that the HTML tags are included as naked tags in the XML. That is, they’re not escaped or placed in a CDATA section.

From the good doctor’s example:

<content type="xhtml" xml:lang="en-US" xml:base="​http://lancemannion.typepad.com/​lance_mannion/">
<div xmlns="​http://www.w3.org/​1999/xhtml"><p><em>Barnes & Noble. Thursday. April 17, 2014. Six forty five p.m.</em></p>…
</content>

Just look at it. This feature is a giant invitation to screwed-up feeds. The HTML — which is probably a blog post, typed by a human — has to be valid XML. People writing scripts to generate these feeds have to make sure they can turn that HTML into valid XML.

Feeds are already screwed-up often enough. This could only make it worse.

The second issue: how can a parser handle this? What an RSS/Atom parser really wants is everything between <content> and </content> as a string.

I remember writing this code for NetNewsWire, which still supports this feature. (I checked. I have no idea if they’ve touched my code or not.)

I’m going from memory, but I’m pretty sure this is what I did.

NetNewsWire used libxml2’s SAX parser, which meant it gets called when the XML parser encounters the beginning and end of a tag and when it encounters a range of characters. There was no easy way — that I know of; correct me if I’m wrong — to tell the parser to treat everything inside a tag (<content> and </content>) as a string when there are XML tags inside that tag.

So I wrote the Atom parser to rebuild the HTML. It maintained a string and pushed stuff on it. If it encountered a tag and (optionally) attributes, it would create a string version and push it. When it encountered characters it would push those. When it encountered the end of a tag it would create a string version and push that. Once the </content> was reached then it had the entire string.

The end result was equivalent HTML, but not necessarily character-for-character the same, since little things like quote characters could change.

It worked. But boy was I coding angry that day.

PS Luckily I didn’t see this feature used very often. Still had to write the code, though.

PPS I still wonder if there’s an easier way. (Using a SAX parser. No way would I give up SAX. Performance and memory use considerations require SAX.)

PPPS For a great list of other ways feeds get screwed up, see Brian’s Stupid Feed Tricks.

21 Apr 2014

Archive