Effective content caching is one of the key features of scalable web sites. Although there are several out-of-the-box options for caching with modern web technologies, a custom built cache still provides the best performance.
The primary aim of caching is to speed up processing and reduce load on critical resources. In case of web sites, the most critical resource is probably going to be the database – so dataset caching is a common solution to speed up the web. It is relatively easy to implement, and it can be introduced almost transparently into an existing system. But that is just painting the corpse. First of all, dataset caching requires the same source data to be re-processed over and over, and takes up unnecessary CPU cycles. Also, a lot of data from recordsets is often not displayed directly, so dataset caching can lead to a lot of wasted memory – in fact, I’ve seen sites where this approach caused serious problems by fragmenting memory under heavy load and leaving the application without enough memory to run.
Although the web site will, no doubt, benefit significantly even from a dataset cache, it will scale much more and better if the cached content is closer to the final product. How much closer, that depends on the application. The golden rule of web caching is: For the caching to be most effective, cache as close to the final product as possible.
Ideally, cache content into static files
The best option for caching, if possible, is to use static files on the disk and allow web servers to publish those files directly. All web servers will process static files very efficiently, and that will also leave more resources for stuff that needs to be processed dynamically. Content on the disk does not take up valuable memory space available for the web application, and it absolutely minimises CPU requirements per request.
Here are the results of a stress test I recently did on a fairly good web server machine running IIS6 over a gigabit network. I measured the number of requests per second during a two-minute stress load for serving the a couple of files using several techniques.
|Requests/s \ file (KB)||64||32||16||8||4||2||1||0.5|
|theoretic network capacity||2,048.00||4,096.00||8,192.00||16,384.00||25,600.00||51,200.00||102,400.00||204,800.00|
- Static – file served directly by IIS from disk
- ASP-SSI – file included using #include SSI directive in an ASP file (turning on the SSI engine)
- ASP-Include – file included using #include SSI directive in an ASP file but with an ASP statement if (true) around the include (so turning on the ASP engine as well)
- ASP-Cache – file read from disk using FSO, but cached in an Application object for up to 10 seconds.
- ASP-Nocache – file read from disk on every request using FSO
- ASPX-SSI – file included using #include SSI directive in an ASPX file
- ASPX-Include – file included using #include SSI directive in an ASPX file, with an additional C# if (true) block around the include
- ASPX-Cache – file read from disk but cached using ASP.NET page caching for 10 seconds
- ASPX-NoCache – file read from disk on each request, page caching turned off
The difference between “static” and “aspx-cache” is the pure overhead of using the ASP.NET engine. For a 4KB file, we get a 38% increase of performance from a file based cache over even the ASP.NET page caching mechanism. Compared to a simple cache based on the Application object in ASP, the performance increase is 150%. And this is just for managing completely static content. A typical web application would pull the content from a database or format it, leading to much bigger differences. Also interesting is the difference between ASP-Include and ASP-SSI, effectively the cost of having turning on the ASP engine (having if(true) statement in ASP code). The good news is that difference between those two cases in ASP.NET is negligible.
With larger files we hit the bandwidth bottleneck first, so the benefits of a file based cache are not that visible. However, with the growing number of Ajax-based sites, pages are getting split into smaller independent requests, and so the overhead becomes quite noticeable. Although ASP.NET page caching is a great utility given that it almost requires no work to implement, file based cache can also be split across the server farm, allowing multiple machines to use the same content.
Caching into static files also brings the benefit of automatic support for last modification timestamps. For browsers with HTTP 1.1 support, if the same files are downloaded often (live Ajax updates), the server can actually reply just with the “304 Not modified” header, without sending any file content at all. Cache that works directly from disk files also allows us to use lightweight HTTP servers such as LightHTTPD to get some extra performance from the same hardware.
Not just for pre-publishing
The file based cache is often used for pre-publishing content, but there is a simple trick to use this technique for on-demand publishing, even for content that must be generated often on the fly. Most web servers I have worked with, including IIS and Apache, will allow us to override the 404 “Not found” handler. This can be used to implement the cache-miss scenario: when users request a file that has not yet been cached, IIS will not find it on the disk and will call the 404 handler; we can then generate the file using dynamic processing (ASP/ASP.NET) and store it to the disk for the next request before sending the content back to the client. The same technique can be used with Apache and PHP.
An important note for this technique is that requests must be mapped completely to the file path of the URL, not to GET parameters. So, for example, instead of using http://myserver/search.aspx?query=fitnesse, we would have to use something like http://myserver/search/fitnesse.query. Then a 404 handler would be set up on the search folder to perform a search based on the active URL path and store it into the fitnesse.query file. You can generally chose any extension you want, but avoid ASP/ASPX and other standard extensions, to prevent IIS from turing on ASP/ASP.NET processing when serving those files. HTML and TXT extensions are also not a good choice. IE uses “smart” caching by default, so HTML files are downloaded just once per page. If you fire a background AJAX request twice for a HTML file, only the first will actually go to the server. TXT and HTML files may also be cached by transparent proxy systems, so it’s best to avoid using those extensions.
Image credits: Akis Kolokotronis/SXC
I'm Gojko Adzic, author of Impact Mapping and Specification by Example. My latest book is Fifty Quick Ideas to Improve Your Tests. To learn about discounts on my books, conferences and workshops, sign up for Impact or follow me on Twitter. Join me at these conferences and workshops:
How to get more value out of user stories
- Stockholm, SE, 16 October
Specification by Example Workshops