I recently had a request at work to set up an additional virtualhost on one of the reverse proxy servers to proxy in an application for testing. Unfortunately all the links need to be "de-internalized" in order for the site to be of any use to anyone on the outside world.
Having looked into doing this in the past a bit with Sharepoint, I had come across
mod_proxy_html and decided it sounded like a great solution to the problem. Wellll, it almost was. :-) After getting a package together for RHEL5 (and subsequently submitting it to EPEL) I was able to get a seemingly working configuration going pretty quickly. However, it soon became obvious that mod_proxy_html likes to actually modify the HTML instead of simply replacing the text you tell it to. It does its best to generate "correct" (W3C compliant) HTML/4.01 or XHTML. You can specify transitional or not, but the end result was that the code that came out did not look like the code as the developers had written it. The side effect of this was the code looking right on Firefox and wrong in IE, or the flash not loading correctly in one browser, and so on and so on, depending on which docstring we chose for the module to convert to. Good fun. To be fair to the module, the HTML was not valid W3C code, and really should have been tweaked, but in the end all the back and forth just made me start looking for another solution that tried to be a little less "smart".
In any case, for comparison's sake, here is the mod_proxy_html configuration I came up with that
almost worked:
ProxyRequests off
ProxyPreserveHost off
# We want to redirect people who request / to /index.map
RewriteEngine On
RewriteRule ^/$ /index.map [R]
# Turn the following to "On" for debugging (and then watch error_log)
ProxyHTMLMeta Off
ProxyHTMLLogVerbose on
ProxyHTMLExtended On
ProxyHTMLFixups reset
# If we don't set this, XHTML is used and things look funky in IE.
ProxyHTMLDoctype HTML Legacy
# Custom links for use with mod_proxy_html (a href is probably redundant)
ProxyHTMLLinks input value
ProxyHTMLLinks a href
ProxyHTMLLinks script src
SetOutputFilter proxy-html
RequestHeader unset Accept-Encoding
# Passing this twice to handle multiple occurences of the string in a URL.
# We also need to use POSIX regular expressions or matching fails miserably.
ProxyHTMLURLMap (.*)http://internal.domain.com:8080/apppath(/?.*) $1http://external.domain.com$2 [R,x,l,e,c]
ProxyHTMLURLMap (.*)http://internal.domain.com:8080/apppath(/?.*) $1http://external.domain.com$2 [R,x,l,e,c]
ProxyHTMLURLMap /apppath/ /
ProxyHTMLURLMap /apppath /
# This will rewrite requests into the base directory, but nothing beneath.
# We'll explicitly proxy those later. I don't know of a way to do this with
# so mod_rewrite to the rescue!
RewriteRule ^/favicon.ico$ http://internal.domain.com:8080/favicon.ico [P,L]
RewriteRule ^/([^/]*)$ http://internal.domain.com:8080/apppath/$1 [P,L]
Anyways, after a lot of tinkering and troubleshooting, I began thinking maybe I could set things up to do the HTML "fixes" differently based on what type of browser was accessing the page. Then I smacked myself upside the head for even considering that route. This was becoming way more work than it should be!
I set out to look for a better solution, and stumbled across
mod_sed -- released by Sun recently under the Apache license. This also sounded perfect! The developer had posted a lengthy thread about it in the Apache Development mailing list which I read through to see how stable this was -- it sounded perfect. After perusing through the thread however, I realized that lo and behold, Apache 2.2.7 already had something I could make use of --
mod_substitute! Now, RHEL 5 includes only Apache 2.2.3, but fortunately we had built our own RPM of Apache based on the Fedora RPM's and we were at 2.2.9. I quickly loaded up the module into my config and came up with the following:
ProxyRequests off
ProxyPreserveHost off
# We want to redirect people who request / to /index.map
RewriteEngine On
RewriteRule ^/$ /index.map [R]
AddOutputFilterByType SUBSTITUTE text/html
Substitute "s|internal.domain.com:8080/apppath|external.domain.com|n"
Substitute "s|/apppath/|/|n"
# This will rewrite requests into the base directory, but nothing beneath.
# We'll explicitly proxy those later. I don't know of a way to do this with
# so mod_rewrite to the rescue!
RewriteRule ^/favicon.ico$ http://internal.domain.com:8080/favicon.ico [P,L]
RewriteRule ^/([^/]*)$ http://internal.domain.com:8080/apppath/$1 [P,L]
Short, sweet and to the point.
I would have liked to give mod_sed a try and gotten it packaged up for Fedora and EPEL, but the above worked perfectly. The application now works as it's supposed to, and the developers don't have to muck with their HTML.
Labels: Apache, Linux, Proxy, Work