Normalize Accept-Encoding Headers via HAProxy for Squid Cache

I noticed that when browsing one of our heavy traffic website’s under different browsers, I would see completely separate versions of a page depending on which browser I used. We use a cluster of Squid servers to cache the many pages of this particular website, but I believed the only thing that needed to happen was to configure the HAProxy load balancer to balance requests based on URI so each page would always be served by the same Squid server (to optimize hit rate). Apparently not so!

So… what’s the deal?

Apparently, the Vary: Encoding header passed by Squid, in response to the browser requests, is used to make sure browsers receive only pages with encoding, if any, that they support. So, for example, Firefox would tell Squid that it accepts “gzip,deflate” as encoding methods, while Chrome was telling Squid it accepts “gzip,deflate,sdch”. Internet Explorer was only different from Firefox by a single space (“gzip, deflate”), but that was enough for Squid to cache and serve a completely separate object. I felt like this was a waste of resources. The “Big 3” browsers all support gzip and deflate (among other things), so in order to further optimize performance (and hit rate), I decided to normalize the Accept-Encoding headers.

My first attempt was to do something on the Squid servers themselves. However, the “hdr_replace” function I looked into did not have any impact on how Squid actually handles those requests, so I could hdr_replace all day long and each browser would still see separate cache objects for each individual page.

The alternative was to use HAProxy, but it turns out this works well! After a bit of reading, I found the “reqirep” function that allows one to rewrite any header passed down through the back-end servers. It uses regular expressions to do the deed, and after some testing/serverfaulting/luck I ended up adding the following command to the HAProxy backend configuration:

reqirep ^Accept-Encoding:\ gzip[,]*[\ ]*deflate.* Accept-Encoding:\ gzip,deflate

This regex will match for IE (gzip, deflate), Chrome (gzip,deflate,sdhc), and Firefox (gzip,deflate). It should also be noted that Googlebot uses the same header as Firefox. I didn’t bother looking into other browsers, as I was most concerned with the major browsers. If someone wants to contribute a better Regex to get the job done, let me know!

An additional reason normalized Accept-Encoding headers are good is that your cache will be primed much more quickly if everybody that visits your website are retrieving the same version of a page. Less work on your web server farm, database servers, and all that.

Speed is good. Good luck!

Using the ACL in HAProxy for Load Balancing Named Virtual Hosts

Until recently, I wasn’t aware of the ACL system in HAProxy, but once I found it I realized that I have been missing a very important part of load balancing with HAProxy!

While the full configuration settings available for the ACL are listed in the configuration doc, the below example includes the basics that you’ll need to build an HAProxy load balancer that supports multiple host headers.

Here is a quick example haproxy configuration file that uses ACLs:

global
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode http
    option httplog
    option dontlognull
    retries 3
    option redispatch
    maxconn 2000
    contimeout 5000
    clitimeout 50000
    srvtimeout 50000

frontend http-in
    bind *:80
    acl is_www_example_com hdr_end(host) -i example.com
    acl is_www_domain_com hdr_end(host) -i domain.com
    
    use_backend www_example_com if is_www_example_com
    use_backend www_domain_com if is_www_domain_com
    default_backend www_example_com

backend www_example_com
    balance roundrobin
    cookie SERVERID insert nocache indirect
    option httpchk HEAD /check.txt HTTP/1.0
    option httpclose
    option forwardfor
    server Server1 10.1.1.1:80 cookie Server1
    server Server2 10.1.1.2:80 cookie Server2

backend www_domain_com
    balance roundrobin
    cookie SERVERID insert nocache indirect
    option httpchk HEAD /check.txt HTTP/1.0
    option httpclose
    option forwardfor
    server Server1 192.168.5.1:80 cookie Server1
    server Server2 192.168.5.2:80 cookie Server2

In HAProxy 1.3, the ACL rules are placed in a “frontend” and (depending on the logic) the request is proxied through to any number of “backends”. You’ll notice in our frontend entitled “http-in” that I’m checking the host header using the hdr_end feature. This feature performs a simple check on the host header to see if it ends with the provided argument.

You can find the rest of the Layer 7 matching options by searching for “7.5.3. Matching at Layer 7” in the configuration doc I linked to above. A few of the options I didn’t use but you might find useful are path_beg, path_end, path_sub, path_reg, url_beg, url_end, url_sub, and url_reg. The *_reg commands allow you to perform RegEx matching on the url/path, but there is the usual performance consideration you need to make for RegEx (especially since this is a load balancer).

The first “use_backend” that matches a request will be used, and if none are matched, then HAProxy will use the “default_backend”. You can also combine ACL rules in the “use_backend” statements to match one or more rules. See the configuration doc for more helpful info.

If you’re looking to use HAProxy with SSL, that requires a different approach, and I’ll blog about that soon.

Tips on Load Balancing a Joomla Cluster with HAProxy

For the past several weeks, I have been working with Joomla in a clustered environment. We have a single load-balancer running HAProxy that sends requests to two web servers synchronized with unison. One server is a hybrid and includes both the MySQL database as well as Apache2/PHP5. The other web server is strictly Apache2/PHP5. We have been renting two super fast dedicated servers temporarily until we acquire some new hardware, so I had to make do with what few servers I had.

Update: Having written this blog post almost a full year ago, I have since then completely switched all of my Joomla websites to the automatically scaling website cloud host: Scale My Site. Since doing so, we haven’t had to deal with HAProxy, load balancing, or anything with regard to scaling due to the hosting cloud’s seamlessly clustered environment. I highly recommend anyone reading this article right now to check out cloud hosting to get load balancing/scaling for your Joomla website without breaking a sweat.

The load balancer is located at our own colo. I followed the tutorial on Setting Up A High-Availability Load Balancer (With Failover and Session Support) With HAProxy/Heartbeat On Debian Etch to set up two servers at our colo in an ActivePassive fashion using Heartbeat for redundancy.

Weighted Load Balancing

Since I’m using only two web servers and one needs to serve database requests, I decided to set weights in HAProxy so that the hybrid server receives half as many requests as the dedicated web server. Here is an example of what my haproxy.cfg file contains:

/etc/haproxy.cfg

global
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    maxconn 4096
    user haproxy
    group haproxy

defaults
    log global
    mode http
    option httplog
    option dontlognull
    retries 3
    redispatch
    maxconn 2000
    contimeout 5000
    clitimeout 50000
    srvtimeout 50000

listen webfarm 63.123.123.100:80
    mode http
    balance roundrobin
    cookie SERVERID insert nocache indirect
    option forwardfor
    option httpchk HEAD /check.txt HTTP/1.0

    # Stats
    stats enable
    stats auth admin:password

    # Web Node
    server SBNode1 63.123.123.101:80 cookie Server1 weight 20 check
    # Web + MySQL Node
    server SBNode2 63.123.123.102:80 cookie Server2 weight 10 check

How to Use the Administrator Control Panel in a Joomla Cluster

Many people understand that it’s a super big pain to work with the administrator control panel in a Joomla clustered environment. First of all, you’ll keep getting kicked out every few page requests, even while using sticky/persistent load balancing. Second, working with backend WYSIWYG rich-text editors is nearly impossible. I figured out how to do it, and here’s what I did.

  1. Decide upon the management node
  2. Give the management node a public host entry in DNS (e.g. node1.yourdomain.com)
  3. Open configuration.php for editing
  4. Locate the “live site” variable ($mosConfig_live_site)
  5. Replace with “http://” . $_SERVER[“HTTP_HOST”];
  6. Save

Using the current host as the live site allows you to use node1.yourdomain.com as an access point for the control panel. You can work in the control panel without doing this, but you will run into tons of problems with rich-text editors and custom components that request the live site URL in their underlying code.

Update: Recently, I implemented a load balancing solution using HAProxy that used the ACL system to send all traffic with /administrator/ in the URL to one “master” node, and it provided a way around the Joomla configuration change mentioned above. Check out this blog post for more info.