mod_proxy_html: Technical Guide

mod_proxy_html From Version 2.4 (Sept 2004). Updates in Version 3 (Dec. 2006) are highlighted.

Contents

URL Rewriting

Rewriting URLs into a proxy's address space is of course the primary purpose of this module. From Version 2.0, this capability has been extended from rewriting HTML URLs to processing scripts and stylesheets that may contain URLs.

Because the module doesn't contain parsers for javascript or CSS, this additional processing means we have had to introduce some heuristic parsing. What that means is that the parser cannot automatically distinguish between a URL that should be replaced and one that merely appears as text. It's up to you to match the right things! To help you do this, we have introduced some new features:

  1. The ProxyHTMLExtended directive. The extended processing will only be activated if this is On. The default is Off, which gives you the old behaviour.
  2. Regular Expression match-and-replace. This can be used anywhere, but is most useful where context information can help distinguish URLs that should be replaced and avoid false positives. For example, to rewrite URLs of CSS @import, we might define a rule
    ProxyHTMLURLMap url\(http://internal.example.com([^\)]*)\) url(http://proxy.example.com$1) Rihe
    This explicitly rewrites from one servername to another, and uses regexp memory to match a path and append it unchanged in $1, while using the url(...) context to reduce the danger of a match that shouldn't be rewritten. The R flag invokes regexp processing for this rule; i makes the match case-insensitive; while h and e save processing cycles by preventing the match being applied to HTML links and scripting events, where it is clearly irrelevant.

HTML Links

HTML links are those attributes defined by the HTML 4 and XHTML 1 DTDs as of type %URI. For example, the href attribute of the a element. For a full list, see the declaration of linked_elts in pstartElement. Rules are applicable provided the h flag is not set. From Version 3, the definition of links to use is delegated to the system administrator via the ProxyHTMLLinks directive.

An HTML link always contains exactly one URL. So whenever mod_proxy_html finds a matching ProxyHTMLURLMap rule, it will apply the transformation once and stop processing the attribute. This can be overridden by the l flag, which causes processing a URL to continue after a rewrite.

Scripting Events

Scripting events are the contents of event attributes as defined in the HTML4 and XHTML1 DTDs; for example onclick. For a full list, see the declaration of events in pstartElement. Rules are applicable provided the e flag is not set. From Version 3, the definition of events to use is delegated to the system administrator via the ProxyHTMLEvents directive.

A scripting event may contain more than one URL, and will contain other text. So when ProxyHTMLExtended is On, all applicable rules will be applied in order until and unless a rule with the L flag matches. A rule may match more than once, provided the matches do not overlap, so a URL/pattern that appears more than once is rewritten every time it matches.

Embedded Scripts and Stylesheets

Embedded scripts and stylesheets are the contents of <script> and <style> elements. Rules are applicable provided the c flag is not set.

A script or stylesheet may contain more than one URL, and will contain other text. So when ProxyHTMLExtended is On, all applicable rules will be applied in order until and unless a rule with the L flag matches. A rule may match more than once, provided the matches do not overlap, so a URL/pattern that appears more than once is rewritten every time it matches.

Output Transformation

mod_proxy_html uses a SAX parser. This means that the input stream - and hence the output generated - will be normalised in various ways, even where nothing is actually rewritten. To an HTML or XML parser, the document is not changed by normalisation, except as noted below. Exceptions to this may arise where the input stream is malformed, when the output of mod_proxy_html may be undefined. These should of course be fixed at the backend: if mod_proxy_html doesn't work as expected, then neither will browsers in real life, except by coincidence.

FPI (Doctype)

Strictly speaking, HTML and XHTML documents are required to have a Formal Public Identifier (FPI), also know as a Document Type Declaration. This references a Document Type Definition (DTD) which defines the grammar/ syntax to which the contents of the document must conform.

The parser in mod_proxy_html loses any FPI in the input document, but gives you the option to insert one. You may select either HTML or XHTML (see below), and if your backend is sloppy you may also want to use the "Legacy" keyword to make it declare documents "Transitional". You may also declare a custom DTD, or (if your backend is seriously screwed so no DTD would be appropriate) omit it altogether.

HTML vs XHTML

The differences between HTML 4.01 and XHTML 1.0 are essentially negligible, and mod_proxy_html can transform between the two. You can safely select either, regardless of what the backend generates, and mod_proxy_html will apply the appropriate rules in generating output. HTML saves a few bytes.

If you declare a custom DTD, you should specify whether to generate HTML or XHTML syntax in the output. This affects empty elements: HTML <br> vs XHTML <br />.

If you select standard HTML or XHTML, mod_proxy_html 3 will perform some additional fixups of bogus markup. If you don't want this, you can enter a standard DTD using the nonstandard form of ProxyHTMLDTD, which will then be treated as unknown (no corrections).

Character Encoding

The parser uses UTF-8 (Unicode) internally, and mod_proxy_html prior to version 3 always generates output as UTF-8. This is supported by all general-purpose web software, and supports more character sets and languages than any other charset. Version 3 supports, but does not recommend different outputs, using the ProxyHTMLCharsetOut directive.

The character encoding should be declared in HTTP: for example
Content-Type: text/html; charset=latin1
mod_proxy_html has always supported this in its input, and ensured this happens in output. But prior to version 2, it did not fully support detection (sniffing) the charset when a backend fails to set the HTTP Header.

From version 2.0, mod_proxy_html will detect the encoding of its input as follows:

  1. The HTTP headers, where available, always take precedence over other information.
  2. If the first 2-4 bytes are an XML Byte Order Mark (BOM), this is used.
  3. If the document starts with an XML declaration <?xml .... ?>, this determines encoding by XML rules.
  4. If the document contains the HTML hack <meta http-equiv="Content-Type" ...>, any charset declared here is used.
  5. In the absence of any of the above indications, the HTML-over-HTTP default encoding ISO-8859-1 or the ProxyHTMLCharsetDefault value is assumed.
  6. The parser is set to ignore invalid characters, so a malformed input stream will generate glitches (unexpected characters) rather than risk aborting a parse altogether.

In version 3.0, this remains the default, but internationalisation support is further improved, and is no longer limited to the encodings supported by libxml2:

meta http-equiv support

The HTML meta element includes a form <meta http-equiv="Some-Header" contents="some-value"> which should notionally be converted to a real HTTP header by the webserver. In practice, it is more commonly supported in browsers than servers, and is common in constructs such as ClientPull (aka "meta refresh"). The ProxyHTMLMeta directive supports the server generating real HTTP headers from these. However, it does not strip them from the HTML (except for Content-Type, which is removed in case it contains conflicting charset information).

Other Fixups

For additional minor functions of mod_proxy_html, please see the ProxyHTMLFixups and ProxyHTMLStripComments directives in the Configuration Guide.

Debugging your Configuration

From Version 2.1, mod_proxy_html supports a ProxyHTMLLogVerbose directive, to enable verbose logging at LogLevel Info. This is designed to help with setting up your proxy configuration and diagnosing unexpected behaviour; it is not recommended for normal operation, and can be disabled altogether at compile time for extra performance (see the top of the source).

When verbose logging is enabled, the following messages will be logged:

  1. In Charset Detection, it will report what charset is detected and how (HTTP rules, XML rules, or HTML rules). Note that, regardless of verbose logging, an error or warning will be logged if an unsupported charset is detected or if no information can be found.
  2. When ProxyHTMLMeta is enabled, it logs each header/value pair processed.
  3. Whenever a ProxyHTMLURLMap rule matches and causes a rewrite, it is logged. The message contains abbreviated context information: H denotes an HTML link matched; E denotes a match in a scripting event, C denotes a match in an inline script or stylesheet. When the match is a regexp find-and-replace, it is also marked as RX.

Workarounds for Browser Bugs

Because mod_proxy_html unsets the Content-Length header, it risks losing the performance advantage of HTTP Keep-Alive. It therefore sets up HTTP Chunked Encoding when responding to HTTP/1.1 requests. This enables keep-alive again for HTTP/1.1 agents.

Unfortunately some buggy agents will send an HTTP/1.1 request but choke on an HTTP/1.1 response. Typically you will see numbers before and after, and possibly in the middle of, a page. To work around this, set the force-response-1.0 environment variable in httpd.conf. For example,
BrowserMatch MSIE force-response-1.0