mod_publisher: Parser

mod_publisher uses a SAX parser from libxml2 to parse and process markup on the fly. This is the same usage as in mod_accessibility and mod_proxy_html. The same parser is also used in a range of other markup modules such as mod_htnorm (AccessValet) accessibility analysis, mod_annot document editing and management, and mod_transform XSLT/Xinclude processing, all of which use it in other modes (less efficient than SAX).

Input

Unlike WebÞing's previous modules, mod_publisher uses different variants of the parser according to the markup being processed:

When processing text/html documents it uses HTMLparser, which parses HTML and is tolerant of tag-soup.
When processing XML document types with Namespace processing enabled, it uses the default XML parser in SAX2 mode. This requires the input to be well-formed XML.
When processing XML document types without Namespace processing enabled, it uses the default XML parser in the (old) SAX1 mode. This is believed to be faster than SAX2, and also requires the input to be well-formed XML.
It may be possible to modify the above using configuration directives. In particular, text/html may be parsed as XML to enable namespace support. It is your responsibility to ensure that input is well-formed when using such override.

Error Correction

The parser is capable of a limited amount of error correction when processing malformed markup. However, in its normal mode, it is less tolerant than is usual amongst browsers. The most common manifestion of this is that broken Javascript fails, and fragments appear on a page. The underlying reason for the difference is that browsers can devote huge resources (Processor cycles and computer memory) to fixing bad markup. mod_publisher, by contrast, may be processing thousands of documents concurrently, so efficiency is far more important.

A pre-parser to fix malformed markup is also available. This brings mod_publisher's parsing more in line with browsers, at the cost of a significant additional processing overhead (though still much less than a browser). It is enabled using the MLExtendedFixups directive.

Internationalisation

mod_publisher uses the tried-and-tested charset support from mod_proxy_html versions 2.x:

The HTTP headers, where available, always take precedence over other information.
If the first 2-4 bytes are an XML Byte Order Mark (BOM), this is used.
If the document starts with an XML declaration <?xml .... ?>, this determines encoding by XML rules.
If the document contains the HTML hack <meta http-equiv="Content-Type" ...>, any charset declared here is used.
In the absence of any of the above indications, the HTML-over-HTTP default encoding ISO-8859-1 is assumed.
The parser is set to ignore invalid characters, so a malformed input stream will generate glitches (unexpected characters) rather than risk aborting a parse altogether.
Output is always UTF-8, and is marked as such both in the HTTP headers and in the document itself.

META support

The HTML <meta http-equiv...> construct defines notional equivalents to HTTP headers. mod_publisher supports conversion of these to real HTTP headers. The MLMeta directive controls whether this is enabled.

Features

A new feature in mod_publisher is DTD support. Except where defined as a macro or handled by a namespace module, each element and attribute in the input may be checked against a DTD and stripped if not valid. The DTD is pre-parsed and cached for speed.

Another feature not previously implemented in libxml2-based modules is XML namespace support. This is based on mod_xmlns, and is source- and binary-compatible with namespace modules using version 1.0 of the namespace API.

Output

mod_publisher can convert HTML to XHTML or vice versa, using the MLOutputMode directive (less usefully it can also rewrite other XML document types in HTML-like syntax). It is important to set the right output mode when processing text/html or when using a DTD that transforms the output. In other cases, the output will automatically be correct.

Performance

If namespaces are enabled, mod_publisher will parse in SAX2 mode; otherwise it will use the old SAX. This means that namespace support incurs some overhead, but fundamentally it is a fast, lean SAX parse in both cases The parser is libxml2 in either case. Please see the xmlbench benchmarks for more details.