Looking at the load this blog is causing, virtually all of it’s requests are for the syndication feeds so I decided to off-load the network bandwidth to Feed Burner ( recently acquired by Google).
Feedburner basically checks your site every 30 minutes (you can also manually request an update) and caches the response. Rather than ask my subscribers to change the feeds source, I decided to use Apache to redirect all requests to my feeds to feedburner, unless it was actually feedburner requesting the information.
To get this to work took a bit of Apache shenanigans, so I thought I would document it here both for myself and for anyone else who needs to do the same. It is also a useful example of how powerful Apache is, particularly as a forward facing server than manages the virtual URL space and links it up to various webs server technologies and platforms behind the scenes (loose URL => implementation coupling). The blog is running on Ruby on Rails, Typo engine, running on port 4000 internally. Our firewall blocks outside access to this port, so we use Apache to proxy it for us (it rewrites any URL’s on the way out to be the correct external address). We also use it as an SSL gateway, so we setup all the certificates in just one place.
All configuration in Apache is done in the http.conf file. Here is the setup I am using for the blogs subdomain with notes of what is going on. There is also a similar VirtualHost setting for 443 (SSL Access), but doesn’t really add much by posting it here. I will give a quick summary of what is going on here, but full details can be found on the excellent Apache documentation
# Concept First blog
<VirtualHost *:80>
ServerName blogs.conceptfirst.com
DocumentRoot "e:/Blogs/www"
ErrorLog "e:/blogs/logs/error.log"
CustomLog e:/blogs/logs/access.log common
DirectoryIndex index.html
<Directory "e:/Blogs/www">
Options none
AllowOverride None
Order allow,deny
Allow from all
</Directory>
Tell apache which virtual host name this is for, where static files are, where to put log files, etc.
RewriteEngine on
RewriteEngine on tells apache to apply the following rules. Apache can handle redirects and proxying using Redirect and ProxyPass directives, but I had issues with the order things were being done in, so used rewrite rules for it all. Rewrite rules are more flexible and powerful than the individual Redirect or ProxyPass directives, so its worth understanding their capabilities in full.
RewriteLogLevel 0
# RewriteLog "e:/rewrite.log"
If having problems debugging rewrite rules, I’d recommend just setting up a log file and turning RewriteLogLevel to 9. Turn it off by putting back to 0 when everything is working.
# Redirect feeds to feedburner unless actually feed burner. Only do main ones
RewriteCond %{HTTP_USER_AGENT} !FeedBurner
RewriteRule /xml/rss20/feed.xml$ http://feeds.feedburner.com/ConceptFirst [R,L]
RewriteCond %{HTTP_USER_AGENT} !FeedBurner
RewriteRule /xml/atom10/feed.xml$ http://feeds.feedburner.com/ConceptFirst [R,L]
This is the configuration that tells Apache to redirect all my traffic to feedburner unless its from Feedburner itself.
- RewriteCond: A condition rule, applies to the next line. It is negated by !
- RewriteRule: A rewriting rule, with a match part (regular expression) and a target
- [R]: Send a HTTP 302 redirect
- [P]: Do a internal proxy
- [L]: Stop applying rewrite rules after this one
So the first two lines tells Apache (in random syntax pseudo-code): IF (HTTP_USER_AGENT <> 'FeedBurner') AND (URL = '/xml/rss20/feed.xml') THEN SEND_REDIRECT_TO('http://feeds.feedburner.com/ConceptFirst') AND STOP_PROCESSING_RULES
Feedburner uses the HTTP header USER_AGENT set to FeedBurner so that is how we detect it and don’t redirect it to itself ! I am only redirecting my main feeds here, the categorised feeds and individual comment feeds are still handled by the blog engine.
# Make admin secure
RewriteRule /accounts(.*) https://blogs.conceptfirst.com/accounts$1 [R,L]
RewriteRule /admin(.*) https://blogs.conceptfirst.com/admin$1 [R,L]
These two lines use a similar rules to make sure the admin parts of the blog are handled through HTTPS so we don’t get any cleartext passwords floating around on the net. The $1 at the end of the redirect is the matched data from (.*) in the regular expression. So if the URL is /admin/login, $1 will be /login.
# Proxy to mongrel for everything but the media directory
RewriteCond $1 !^Media/(.*)
RewriteRule /(.*) http://localhost:4000/$1 [P,L]
This is the rule that actually gets the blog pages from the internal ruby on rails app. The condition rule is checking the URL match to make sure it is not part of the Media subdirectory (this is a static directory that apache serves up, it contains images for blog entries, etc). Everything else gets passed to the server running on 4000.
ExpiresActive On
ExpiresByType text/html "now plus 1 day"
ExpiresByType image/gif "now plus 1 week"
ExpiresByType image/jpeg "now plus 1 week"
ExpiresByType text/css "now plus 1 week"
ExpiresByType image/png "now plus 1 week"
ExpiresByType image/jpg "now plus 1 week"
</VirtualHost>
The Expires stuff is the Apache way of setting the HTTP expires headers when sending responses. I’ve set images to be cached on the client for a week to help reduce bandwidth and load on the server.
I greatly recommend Apache, its very easy to setup, and very powerful (although it does test your knowledge of regular expressions :-), we’ve used it to proxy Ruby On Rails, Cold Fusion, IIS and home grown web servers, and its great for rewriting URLs to make them technology agnostic and nice and RESTful.
What
I’ve moved my blog away from Blogger.com and on to a Typo installation hosted ourselves. Typo is a weblog application that runs on top of Ruby On Rails, the latest posterchild of the open source community. I chose a rails based blog because I’ve been doing work with rails recently, I understand it, and I really like it (I’m half way through writing up an entry about why I like it).
Why
The main reason is that blogger feed does not include comments people make. I personally read most blogs in an aggregator, and without comments being included, its only one half of the conversation.
For the last year we have been working on a product for a client that uses Queuing Theory to model how servers operate under load, an area known as ‘Capacity Management’. Performance agents log ongoing behaviour on the server, and then the results are plugged into various mathmatical formulae that show how the server will work if your email doubles, your websites have 10 times as much load, etc.
Here at Concept First we have a data centre hosted server that handles email, ftp, and various client’s websites. In the spirit of ‘dog-fooding’, I’ve been running monitoring software on our hosted server, and building models of its spare capacity.
The results are: we have stupid amounts of spare capacity. The server is a 1U rack mounted server from Dell, 1G of RAM, mirrored RAID disks, a single 2Ghz processor. A fairly low spec machine these days, we bought it about 3 years ago. And basically it never does anything. I’ve never seen the CPU loads rise about 5%, same for disk utilisation. Modern hardware is so fast, that a few web requests and a bit of email are not not giving it much of a workout.
For that reason, I’m happy to run an interpreted language framework for this blog, because we really do have CPU cycles to waste, and I don’t have time to waste. Having a Rails app running on my server full time will also allow me to understand if it’s stable under Windows.
How
Style
Typo is easy to install if you’ve worked with Rails, its just a load of files in a directory. You can choose from 2 built in themes or download lots of others. I’ve taken one of the built in ones and just changed it to make use of the full screen.
I hate fixed width websites. What is the point of my lovely 1920 wide screen laptop if the site is only 500 pixels wide !! Adjusting to the display has always been a key strength of HTML. Too many people are still designing for the web as though it was a piece of paper !
However the move away from fixed width has caused a small bug to display in IE 6, when using the search box in the top right hand corner. There is flickering as the AJAX call is done. However the issue does not occur in IE 7 or Firefox so it is no doubt one of the boatload of IE 6 rendering bugs, so i’m not going to pursuit it. Life is too short.
Database
Typo supports just about every database, but I’ve just left it running on a simple SQLite database. Typo caches everything so i don’t really need the hassle of putting it in a real RDBMS. The server is already running SQL Server, MySQL and Oracle XE. Maintaining what apps use what databases, where, is a bit of a headache. A file based database is just easier to manage if you can get away with it.
Webserver
Our server is running, or has run in the past, various different websites running on lots of different technologies: ASP, Websnap, Coldfusion, Zope, Php, Perl. To manage all this, we use Apache 2 running on Windows Server 2000. Apache has been good to us. Its very flexible, well documented, secure, and scalable. It is our Internet facing webserver. It handles SSL, compression (via mod_deflate) and virtual hosting.
To connect our hosted sites up to the Apache ‘gateway’ we’ve tried various things. We’ve tried Apache modules, ISAPI, FastCGI, SCGI, mod_php.
Apache modules / ISAPI: Nice and fast, but too tightly bound. When Apache versions change a recompile is required. Restart the server and you lose server state, like current sessions. Bugs can jeapodise the server stability.
FastCGI: Works well, we have various websnap sites running over FastCGI. However the protocol is complex, binary, and the whole FastCGI protocol is not actively developed any longer.
SCGI: A simpler version of FastCGI, I’ve run a couple of Rails sites on SCGI. The protocol is easier to understand, but its another set of config to learn, and not all technologies support SCGI.
With this in mind, we now use simple HTTP proxing. Each different application runs its own webserver on an internal port number, the firewall preventing any access to the outside world. Apache then uses its mod_proxy settings to pass the request to the applications webserver, before passing it back to the web browser.
Advantages:
- Debugging is simple. Just fire up a copy of Fiddler.
- All applications support SSL, Apache handles it.
- All applications support compression, Apache handles it.
- All application logged in same format, I can use the same tools.
- Apache can use its redirect rules to enforce my rules (e.g. admin in Typo must be secure).
- Upgrading any individual application has no effect on the others, everything is loosly coupled.
- Most importantly: One set of syntax for me to remember.
It all works beautifully. It’s happily proxing to Rails, to web services exposed in Indy components. It’s fast, and it ‘Just Works’, not often you get to say that in the world of computing ...