This transform is designed to translate an XHTML5 document to the vastly more popular plain HTML syntax. It was borne largely of the annoyance and general dirty feeling that comes with trying to generate a working polyglot using XSLT (no direct way to output the terse new HTML5 DOCTYPE, acrobatics necessary to keep <script/> from rendering as a void element when empty, and so forth). If you're using XSLT anyway and aren't limited to only one transform, this may be the converter for you.
The result of the transform, provided appropriate XHTML5 input, is a HTML-syntax HTML5 document whose features include:
- UTF-8 encoding (if it's not overridden by your pipeline somewhere).
- The meta charset is *not* automatically inserted; it must be provided in the source document.
- The all-important "<!DOCTYPE html>" leader.
- All CDATA blocks resolved to text.
- Unescaped text in selected elements such as <script> and <style>.
- Double-quoted attributes, which may include literal < or > as appropriate.
- Omitted end tag on known void elements, such as <img>.
- Void elements have no self-closing slash.
- Included end tag on all other elements, such as <script></script>.
This transform is not a validator; for the output to be valid HTML5, the input should be (similar to) valid XHTML5 and also be clear of certain constructs that might be allowed in the XHTML syntax but not the HTML syntax. For example, the CDATA "</script>" inside a "<script>" element can be unambiguous in XML, but in HTML there is a breaking ambiguity. XML also allows text at the beginning of a comment that would be disallowed in HTML.
The input HTML must be in the XHTML namespace, and any embedded SVG or MathML should use the appropriate namespace as well. To refresh your memory:
HTML: http://
SVG: http://
MathML: http://
Elements without one of the above namespaces, including elements with no namespace, are ignored and not recursed.
The output is performed using the "text" output method rather than "html" or "xml" in order to retain fine control over the result.
The result document is strictly non-XML since it omits the self-closing slashes on void elements, outputs characters disallowed by XML in attribute data and the character data of certain elements, and strips all namespace information. Crafting polyglot XHTML5 presents a vastly non-trivial problem, and I am in no hurry to solve it in general (substantial demand or arbitrary whims could potentially make me reconsider). Currently, it's probably a better idea just to serve the original XHTML when XML is needed and the converted version when HTML is needed.
If your pipeline supports Java, the tools from the Validator.nu htmlparser distribution are highly recommended as an alternative. Aside from the de facto normative HTML5 parser, the htmlparser suite packs in extremely useful tools for translating HTML to XHTML and back, as well as an XSLT tool that can be run directly on HTML5.
The inspiration for this transform, and a few important bits of its being, originate in nu.validator.
View full history Series and milestones
trunk series is the current focus of development.