XHTML to HTML5 in Launchpad

XHTML to HTML5

Registered 2013-06-12 by Peter S. May

This transform is designed to translate an XHTML5 document to the vastly more popular plain HTML syntax. It was borne largely of the annoyance and general dirty feeling that comes with trying to generate a working polyglot using XSLT (no direct way to output the terse new HTML5 DOCTYPE, acrobatics necessary to keep <script/> from rendering as a void element when empty, and so forth). If you're using XSLT anyway and aren't limited to only one transform, this may be the converter for you.

The result of the transform, provided appropriate XHTML5 input, is a HTML-syntax HTML5 document whose features include:

-    UTF-8 encoding (if it's not overridden by your pipeline somewhere).
    -    The meta charset is *not* automatically inserted; it must be provided in the source document.
-    The all-important "<!DOCTYPE html>" leader.
-    All CDATA blocks resolved to text.
-    Unescaped text in selected elements such as <script> and <style>.
-    Double-quoted attributes, which may include literal < or > as appropriate.
-    Omitted end tag on known void elements, such as <img>.
    -    Void elements have no self-closing slash.
-    Included end tag on all other elements, such as <script></script>.

This transform is not a validator; for the output to be valid HTML5, the input should be (similar to) valid XHTML5 and also be clear of certain constructs that might be allowed in the XHTML syntax but not the HTML syntax. For example, the CDATA "</script>" inside a "<script>" element can be unambiguous in XML, but in HTML there is a breaking ambiguity. XML also allows text at the beginning of a comment that would be disallowed in HTML.

The input HTML must be in the XHTML namespace, and any embedded SVG or MathML should use the appropriate namespace as well. To refresh your memory:

    HTML: http://www.w3.org/1999/xhtml
    SVG: http://www.w3.org/2000/svg
    MathML: http://www.w3.org/1998/Math/MathML

Elements without one of the above namespaces, including elements with no namespace, are ignored and not recursed.

The output is performed using the "text" output method rather than "html" or "xml" in order to retain fine control over the result.

The result document is strictly non-XML since it omits the self-closing slashes on void elements, outputs characters disallowed by XML in attribute data and the character data of certain elements, and strips all namespace information. Crafting polyglot XHTML5 presents a vastly non-trivial problem, and I am in no hurry to solve it in general (substantial demand or arbitrary whims could potentially make me reconsider). Currently, it's probably a better idea just to serve the original XHTML when XML is needed and the converted version when HTML is needed.

If your pipeline supports Java, the tools from the Validator.nu htmlparser distribution are highly recommended as an alternative. Aside from the de facto normative HTML5 parser, the htmlparser suite packs in extremely useful tools for translating HTML to XHTML and back, as well as an XSLT tool that can be run directly on HTML5.

The inspiration for this transform, and a few important bits of its being, originate in nu.validator.htmlparser.sax.HtmlSerializer from htmlparser-1.4. HtmlSerializer is implemented on SAX, a streaming API, so the actual implementation details for XSLT necessarily differ. Among the borrowed parts are the lists of void, non-escaping, and newline-started elements, the attribute prefix mappings, and the particular choice of which characters in cdata and attributes are escaped to entities. This transform is not and does not intend to be a perfect workalike to that module. For example, HtmlSerializer outputs character data (though not elements) even while inside void (self-closing) or foreign (outside HTML, SVG, or MathML namespaces) elements; this transform does not recurse into them at all.

Project information

Maintainer:: Peter S. May

Driver:: Peter S. May

Licence:: MIT / X / Expat Licence

RDF metadata

View full history Series and milestones

trunk series is the current focus of development.

All code Code

lp://staging/xhtml-to-html5
Browse the code

Version control system:: Bazaar

Programming languages:: XSLT

Get Involved

warning

Report a bug
warning

Ask a question
warning

Help translate

Downloads

XHTML to HTML5 does not have any download files registered with Launchpad.