Exclude Squid Cache from updatedb.mlocat

One of the Squid servers in a cluster I manage had an unusually high load, while the rest of the Squid servers with a relatively equal amount of connections were within acceptable load levels.

While reviewing the process list, I found that updatedb.mlocat was running during the busiest time of the day. Turns out it is enabled by default in cron.daily on Debian Lenny, and it was using a noticeable amount of CPU resources. Wtf?

It was attempting to index all the files added to the Squid repository. On these servers, I have a separate partition for Squid in /mnt/squid/, and so UpdateDB can’t tell if it’s supposed to keep track of all these files alongside everything else in the system. Luckily, there is an easy solution.

Exclude your Squid cache location via the “PRUNEPATHS” variable in /etc/updatedb.conf (in my case it was /mnt/squid)

I added my Squid Cache location to PRUNEPATHS, then the process dropped off the radar, and server load returned to acceptable levels.

Normalize Accept-Encoding Headers via HAProxy for Squid Cache

I noticed that when browsing one of our heavy traffic website’s under different browsers, I would see completely separate versions of a page depending on which browser I used. We use a cluster of Squid servers to cache the many pages of this particular website, but I believed the only thing that needed to happen was to configure the HAProxy load balancer to balance requests based on URI so each page would always be served by the same Squid server (to optimize hit rate). Apparently not so!

So… what’s the deal?

Apparently, the Vary: Encoding header passed by Squid, in response to the browser requests, is used to make sure browsers receive only pages with encoding, if any, that they support. So, for example, Firefox would tell Squid that it accepts “gzip,deflate” as encoding methods, while Chrome was telling Squid it accepts “gzip,deflate,sdch”. Internet Explorer was only different from Firefox by a single space (“gzip, deflate”), but that was enough for Squid to cache and serve a completely separate object. I felt like this was a waste of resources. The “Big 3” browsers all support gzip and deflate (among other things), so in order to further optimize performance (and hit rate), I decided to normalize the Accept-Encoding headers.

My first attempt was to do something on the Squid servers themselves. However, the “hdr_replace” function I looked into did not have any impact on how Squid actually handles those requests, so I could hdr_replace all day long and each browser would still see separate cache objects for each individual page.

The alternative was to use HAProxy, but it turns out this works well! After a bit of reading, I found the “reqirep” function that allows one to rewrite any header passed down through the back-end servers. It uses regular expressions to do the deed, and after some testing/serverfaulting/luck I ended up adding the following command to the HAProxy backend configuration:

reqirep ^Accept-Encoding:\ gzip[,]*[\ ]*deflate.* Accept-Encoding:\ gzip,deflate

This regex will match for IE (gzip, deflate), Chrome (gzip,deflate,sdhc), and Firefox (gzip,deflate). It should also be noted that Googlebot uses the same header as Firefox. I didn’t bother looking into other browsers, as I was most concerned with the major browsers. If someone wants to contribute a better Regex to get the job done, let me know!

An additional reason normalized Accept-Encoding headers are good is that your cache will be primed much more quickly if everybody that visits your website are retrieving the same version of a page. Less work on your web server farm, database servers, and all that.

Speed is good. Good luck!