It is fairly obvious that web site performance can be increased by making the code run faster and optimising the response time. But that only scales up to a point. To really take our web sites to the next level, we need to look at the performance problem from a different angle.

How much can you handle?

Although an average web server is able to process a few thousand requests per second, the number of requests it can actually handle at the same time is severely limited. Here are some simple figures:

  • With sufficient network bandwidth from the datacentre to the world, an average web server could execute several thousand requests per second, depending on the average size of the response.
  • However, servers like IIS and Apache allocate a thread to execute each individual request. Although a single thread is lightweight by nature, four or five hundred concurrent threads are quite a hit on the operating system resources. So those web servers can actually execute only at most a few hundred requests at the same time.
  • The actual limit might be lower than you think: by default, a single IIS 6 process creates only up to four threads per processor. You can increase the limit in the registry, but Microsoft suggests to leave it below 20. With a really cool eight core system, this still gives you only 160 concurrent requests.
  • If the requests share other pooled resources, such as database connections, the number becomes even lower. Database connections are split across the whole server farm, so it is most likely that a single web server will have something in order of magnitude of tens available connections in a pool.

So, the number of concurrently running requests per server is most likely to be in the order of tens or hundreds. To squeeze the most of a single machine, we need to avoid that bottleneck and make the requests as fast as possible. One way, that should not be overlooked, is to optimise the code and make each request do its job faster and release critical resources such as database connections as soon as possible. But that only scales up to a point.

The key to solve this problem lies in the classical definition of speed. In Physics, speed is defined as the distance divided by time required to cross that distance. So we can make the requests run faster either by decreasing the time, but we can also do it by shortening the distance. Instead of doing the whole thing quicker, we need to look into reducing the amount of work that a single request has to do. Technically, this comes down to drawing the line between what is processed synchronously (inside the request) and asynchronously (in the background) to complete the request workflow.

Clever choice of asynchronous processing is definitely one of the most important decisions in any enterprise system. This is especially true for Web sites, where it absolutely plays a key role. Here are a few ideas to think about when deciding how to split the work.

  • Delegate all long operations to a background process
  • Never ever talk to an external system synchronously, no matter how fast it is
  • Be lazy – if something does not have to be processed now, leave it for later

Idea #1: Delegate all long operations to a background process

Web servers are typically configured to kill a request if it takes too long, and production servers often have much less tolerance for sluggish processing than development machines. If you get the urge to re-configure the web server in order to complete some processing, resist it by all means. The time limit is imposed for a good reason – web requests are really not suitable for longer operations. They take up scarce system resources and long request really became a point of contention for the system. A few long-running requests can slow everything down, not because they themselves are draining CPU or memory power, but because they hold on to important resources and make other requests wait for them. Also, web requests are not guaranteed to complete correctly. A server can kill the request because of a timeout. A user can also just close the window and interrupt the workflow, without sending any notification to the server.

It is miles better to just enqueue longer requests and take care of it from a background processing queue. The web response can complete the active database transaction, release system resources and return to the caller. Background services are much more robust and reliable than web requests, and they should execute long-running operations. The browser can poll the web server every few seconds and just check the status of the operation. If the user closes the browser, the operation still gets processed correctly. If the remote server is temporarily down, the background service can reprocess the request after a few minutes. You can cluster background servers and load balance longer operations. This kind of asynchronous processing is much more robust and will allow you to scale the system much easier. Clearly isolating longer operations will make sure that a few long actions do not block thousands of quick ones.

Idea #2: Never ever talk to an external system synchronously, no matter how fast it is

A common issue with external systems and synchronous communication is underestimating the latency. Talking about designing scalable systems, Dan Pritchett from Ebay pointed out that “One of the underlying principles is assuming high latency, not low latency. An architecture that is tolerant of high latency will operate perfectly well with low latency, but the opposite is never true.” Anything that goes out of the internal network should be handled asynchronously. This rule of thumb should really be common sense by now, but I still see it violated very often. If you process credit cards using an external payment provider, for the love of God do not try to authorise the transaction from the web response. Do not do this even if the processor is really fast. The server might work OK at the moment, but in a 24/7 environment, over the course of a few months, you have to expect that there will be remote connectivity problems. Network connections can fail, servers can start timing out and the poor users that are caught in the requests will be left hanging.

Most web servers process requests from the same session in a sequence, so the user will not be able to make a new request to the server while one of his requests is blocked, even from a different tab. To do anything useful, the user will have to close the browser and log in again. And, as the processing is caught in a blocked request, you will have no idea what actually happened. Whenever I hear about transactions getting “stuck”, it is most likely because of synchronous communication to an external server. The payment provider might charge the user’s account but your server might not actually record that the money arrived on your end.

Domain Driven Design suggests using aggregates as boundaries for synchronous processing. It would be very hard to convince anyone that your web server and the payment processor are parts of the same aggregate, regardless of how you structure the application.

Idea #3: Be lazy – if something does not have to be processed now, leave it for later

The CAP theorem coined by Dr. Eric A. Brewer in ‘98 says that any system can have at most two from the following group of properties: Consistency, Availability, tolerance to network Partitioning. As the number of users grows, availability and partitioning become much more important than consistency. By (temporarily) giving up on consistency, we can make the system much faster and much better scalable. Gregor Hohpe wrote a really nice article on this subject in 2004, called “Starbucks Does Not Use Two-Phase Commit”. I strongly suggest reading it if you have not already done so.

Web applications often try to do too much at the same time. When an application needs to serve 20 or 30 thousand users over a month, doing too much might not be seen as a problem at all. But once the user base grows to a few hundred thousand, behaving lazy will significantly improve scalability. Think really hard about things that do not really have to be processed instantly. Break down the process and see if something can be postponed for later, even if it may cause slight problems. Anything that is not likely to generate a lot of problems, and the problems it causes can be easily fixed later, is a good candidate for taking out of the critical path.

My rule of thumb to check what can be left out of the primary request workflow is to ask what is the worst thing that can happen if things go bad, and how frequently can we expect that. I’ll use an online bookstore again as an example: a shopping cart checkout request should theoretically check whether the book is in stock, authorise the payment, remove a copy of the book from the available stock, create an order item in the database and send it to the shipping department. New payment industry standards, such as Verified by VISA, make it hard to process the payment offline. However, checking and modifying the stock can safely be left for later. What is the worst thing that can happen if things go wrong? We run out of stock and over-sell a bit. We can notify the user about a slight delay, reorder the book and ship it a few days later. Alternatively, the user could cancel the order and get the money back (if the transaction was just authorised and not captured, the money will never be taken from them in the first place). How frequently will this happen? Not a lot, since the bookstore should manage stock levels pro-actively. In this case, the request can just authorise the payment and put the shopping cart into a background queue. We can process the requests in the queue overnight, when the system is not under a heavy load. By avoiding to use a shared resource (stock) we avoid both the contention and simplify the request workflow.