mod_proxy_html Version 3.1 (April 2009).
Rewriting URLs into a proxy's address space is of course the primary purpose of this module. From Version 2.0, this capability has been extended from rewriting HTML URLs to processing scripts and stylesheets that may contain URLs.
Because the module doesn't contain parsers for javascript or CSS, this additional processing means we have had to introduce some heuristic parsing. What that means is that the parser cannot automatically distinguish between a URL that should be replaced and one that merely appears as text. It's up to you to match the right things! To help you do this, we have introduced some new features:
ProxyHTMLExtended
directive. The extended processing
will only be activated if this is On. The default is Off, which gives you
the old behaviour.ProxyHTMLURLMap url\(http://internal.example.com([^\)]*)\) url(http://proxy.example.com$1) Rihe
url(...)
context to reduce the danger of a match that shouldn't
be rewritten. The R flag invokes regexp processing for this rule;
i makes the match case-insensitive; while h and e
save processing cycles by preventing the match being applied to HTML links
and scripting events, where it is clearly irrelevant.HTML links are those attributes defined by the HTML 4 and XHTML 1
DTDs as of type %URI. For example, the href
attribute of the a element.
Rules are applicable provided the h flag is not set.
From Version 3, the definition of links to use is delegated to the
system administrator via the ProxyHTMLLinks
directive.
(the accompanying proxy_html.conf configuration file gives
you standard HTML4 and XHTML 1, as hardwired in earlier
mod_proxy_html versions).
An HTML link always contains exactly one URL. So whenever mod_proxy_html
finds a matching ProxyHTMLURLMap
rule, it will apply the
transformation once and stop processing the attribute. This
can be overridden by the l
flag, which causes processing
a URL to continue after a rewrite.
Scripting events are the contents of event attributes as defined in the
HTML4 and XHTML1 DTDs; for example onclick
.
Rules are applicable provided the e flag is not set.
From Version 3, the definition of events to use is
delegated to the system administrator via the ProxyHTMLEvents
directive: see proxy_html.conf.
A scripting event may contain more than one URL, and will contain other
text. So when ProxyHTMLExtended
is On, all applicable rules
will be applied in order until and unless a rule with the L flag
matches. A rule may match more than once, provided the matches do not
overlap, so a URL/pattern that appears more than once is rewritten
every time it matches.
Embedded scripts and stylesheets are the contents of
<script>
and <style>
elements.
Rules are applicable provided the c flag is not set.
A script or stylesheet may contain more than one URL, and will contain other
text. So when ProxyHTMLExtended
is On, all applicable rules
will be applied in order until and unless a rule with the L flag
matches. A rule may match more than once, provided the matches do not
overlap, so a URL/pattern that appears more than once is rewritten
every time it matches.
mod_proxy_html uses a SAX parser. This means that the input stream - and hence the output generated - will be normalised in various ways, even where nothing is actually rewritten. To an HTML or XML parser, the document is not changed by normalisation, except as noted below. Exceptions to this may arise where the input stream is malformed, when the output of mod_proxy_html may be undefined. These should of course be fixed at the backend: if mod_proxy_html doesn't work as expected, then neither will browsers in real life, except by coincidence.
Strictly speaking, HTML and XHTML documents are required to have a Formal Public Identifier (FPI), also know as a Document Type Declaration. This references a Document Type Definition (DTD) which defines the grammar/ syntax to which the contents of the document must conform.
The parser in mod_proxy_html loses any FPI in the input document, but gives you the option to insert one. You may select either HTML or XHTML (see below), and if your backend is sloppy you may also want to use the "Legacy" keyword to make it declare documents "Transitional". You may also declare a custom DTD, or (if your backend is seriously screwed so no DTD would be appropriate) omit it altogether.
The differences between HTML 4.01 and XHTML 1.0 are essentially negligible, and mod_proxy_html can transform between the two. You can safely select either, regardless of what the backend generates, and mod_proxy_html will apply the appropriate rules in generating output. HTML saves a few bytes.
If you declare a custom DTD, you should specify whether to generate HTML or XHTML syntax in the output. This affects empty elements: HTML <br> vs XHTML <br />.
If you select standard HTML or XHTML, mod_proxy_html 3 will
perform some additional fixups of bogus markup. If you don't want this,
you can enter a standard DTD using the nonstandard form of
ProxyHTMLDTD
, which will then be treated as unknown
(no corrections).
The parser uses UTF-8 (Unicode) internally, and mod_proxy_html prior to version 3 always generates output as UTF-8. This is supported by all general-purpose web software, and supports more character sets and languages than any other charset.
The character encoding should be declared in HTTP: for example
Content-Type: text/html; charset=latin1
mod_proxy_html has always supported this in its input, and ensured
this happens in output. But prior to version 2, it did not fully
support detection (sniffing) the charset when a backend fails to
set the HTTP Header.
From version 2, mod_proxy_html will detect the encoding of its input as follows:
<?xml .... ?>
, this determines encoding by XML rules.<meta http-equiv="Content-Type" ...>
, any charset declared
here is used.From Version 3.1 the above is delegated to mod_xml2enc, which also expands charset support and enables you to:
The HTML meta
element includes a form
<meta http-equiv="Some-Header" contents="some-value">
which should notionally be converted to a real HTTP header by the webserver.
In practice, it is more commonly supported in browsers than servers, and
is common in constructs such as ClientPull (aka "meta refresh").
The ProxyHTMLMeta
directive supports the server generating
real HTTP headers from these. However, it does not strip them from the
HTML (except for Content-Type, which is removed in case it contains
conflicting charset information).
For additional minor functions of mod_proxy_html, please see the
ProxyHTMLFixups
and ProxyHTMLStripComments
directives in the Configuration Guide.
From Version 2.1, mod_proxy_html supports a ProxyHTMLLogVerbose
directive, to enable verbose logging at LogLevel Info
. This
is designed to help with setting up your proxy configuration and
diagnosing unexpected behaviour; it is not recommended for normal
operation, and can be disabled altogether at compile time for extra
performance (see the top of the source).
When verbose logging is enabled, the following messages will be logged:
ProxyHTMLMeta
is enabled, it logs each header/value
pair processed.ProxyHTMLURLMap
rule matches and causes a
rewrite, it is logged. The message contains abbreviated context information:
H denotes an HTML link matched; E
denotes a match in a scripting event, C denotes a match
in an inline script or stylesheet. When the match is a regexp
find-and-replace, it is also marked as RX.Because mod_proxy_html unsets the Content-Length header, it risks losing the performance advantage of HTTP Keep-Alive. It therefore sets up HTTP Chunked Encoding when responding to HTTP/1.1 requests. This enables keep-alive again for HTTP/1.1 agents.
Unfortunately some buggy agents will send an HTTP/1.1 request but
choke on an HTTP/1.1 response. Typically you will see numbers before
and after, and possibly in the middle of, a page. To work around this, set the
force-response-1.0
environment variable in httpd.conf.
For example,BrowserMatch MSIE force-response-1.0