loading

Put the web server on a diet and increase scalability

May 5, 2008

Published in: Software design articles

HTTP Sessions allow us to develop web applications as if they were running on a desktop machine, making the web so much more useful. Although HTTP is a stateless protocol and there is a lot of work involved in providing this abstraction, web servers make it very easy to use — perhaps too easy. Taking a quote from Spiderman, with great power comes great responsibility. That is why misusing HTTP sessions is probably the number one obstacle to building scalable web sites today. Here are some tips how to consume HTTP sessions responsibly.

Turn off the session if you don't need it

Session support is often turned on by default for the whole web site, so that it is available to any web page that might need it. This significantly reduces web site performance under heavy load. For start, session access requires synchronisation, and synchronisation inhibits concurrency. If the session is kept locally in web server memory, then you have to pin down all the requests from the session to the same server. Most web servers will execute requests coming from the same session in sequence. Turning the session off when it is not required (or even better, turning it on only for the pages that really need it), will allow web servers to process more requests in parallel and with less synchronisation. It will also allow you to balance requests that do not require a session across multiple servers easier, and improve the processing time. Each session request incurs some processing overhead to restore session values. Depending on the way that session information is kept, this may mean disk IO or even network access to the database. On a single request, that overhead is insignificant, but when several thousand users access the site at the same time, the overhead can be quite visible.

Sharing the session across web servers brings a promise of balancing users easier and allowing them to continue to work even if the server that handled their session goes down. Although it is now incredibly easy to set up a shared session storage on a web server farm, that is not a decision that should be taken lightly. Shared sessions require even more synchronisation. Even worse, that synchronisation is over the network instead of just being in-memory. In most cases, if only the minimal security information is kept in the session, the worst thing that can happen if a server dies is asking the users to log in again. From a financial and performance perspective, that often makes more sense than sharing user sessions across servers.

Consider using cookies as an alternative to store simple information that is not security-sensitive. That is not going to save you a lot of memory, but it will allow you to turn off the session for more pages, while still serving user-specific information. Web sites like Amazon or LinkedIn store the user ID in an (encrypted) cookie and use that to display site content without requiring you to log on and probably without a session on their end (although I have no access to their code so this might not be correct, it would make perfect sense). Until you add stuff to the shopping cart or want to save updated account information, there is no need for the session to start.

With Web 2.0 and AJAX becoming increasingly popular, we can take this approach even further and limit the session only to AJAX requests that really need it, and have the session turned off for most of the web site.

Encapsulate session access

Most web servers today allow us to use the session as sort of a hash-map, dumping anything we like into it and associating it with a string key. Session hash-maps are universally accessible, do not provide any compile time checking or type-safety, so they suffer from the same kind of problems that caused global variables to be frowned upon. Uncontrolled use of session maps is very error prone, and the errors related to this are typically painfully hard to troubleshoot.

Using session hash maps effectively requires an agreement on what will be assigned to which key and calls for a lot of discipline, especially with larger teams. One bad example I can remember instantly is an application that stored different types of objects under the same key name depending on the context. One part of the site expected that session['user'] is the user’s id, another part expected it to be the whole user object. They were developed by different people, who unintentionally used the same key. Of course, this caused a lot of weird type checking errors and people being suddenly “logged out”, but the problem became obvious only in production.

This kind of “free for all” access to session details can cause huge consistency problems. The session was used to keep track of the current user and the account balance of that user in an application that I had to fix a few years ago. The balance was loaded from the database on-demand, if the hash-map did not have an object with the “balance” key when it was required. When a user logged off, only the “user” key was cleared, which left the “balance” key in place. This caused a lot of weird errors during testing (when people were often logging in and out under different usernames). Luckily, a web server would isolate the sessions for real users normally, but I can imagine that this problem would affect people accessing the site from public computers, such as in libraries or internet cafés.

To avoid this problem, I suggest always encapsulating access to session objects with a class that has a strictly defined interface and provides a degree of type safety and control over what gets stored and how. Do not use the session object directly, use a proxy for that. In a MVC environment, this approach also allows you to decouple controller code from the web servers. You can use a test implementation of the session class that does not depend on HTTP contexts for unit testing the controllers from your IDE.

Store only non-volatile data in the session

Storing several related objects in the session can cause consistency problems — like in the example with the accounts, but even during a single session. If personal details are stored in the session along with the user’s ID and the user updates his e-mail address, then we have to remember to re-load the details stored in the session object as well. Whenever people complain about stale data, inconsistent session usage is probably to blame. This gets even worse if the data can be modified outside the web application — for example if the customer’s account balance gets reduced because they played a game in another window or because their order was processed in the back-end.

In general, I consider that there are only two types of information that belong in a web server session:

  • security sensitive identifiers: The ID of the user currently logged in is a good example, since the user should not be able to specify it directly within a request. Even for this, there are exceptions — I'll talk about that later.
  • intermediate results — non-persistent data that is not yet complete and it is completely internal to the web application workflow. Shopping cart contents are a good example. Until a shopping cart is converted into an order and persisted, the shopping cart contents are not observable or accessible by anything outside the current users session.

Use proper caching for mutable data

HTTP sessions are often used to store frequently accessed user-specific information, but that is wrong. It is error-prone from the consistency perspective, but it also significantly inhibits scalability. Session state is kept on the server, and although a single record set might not take up a lot of memory, multiply that with several tens of thousands of users and things quickly get out of hand. Because of the way that HTTP works, it is not easy to guess whether someone stopped using the site or whether they are just reading a really long page or popped out for a coffee. Session memory is typically kept for a period of time before being discarded, often ten to thirty minutes, even if the users close their browsers. So all the cached information keeps taking up memory for quite a while. Strictly speaking, session and cache memory should have a different life-cycle anyway. Cached objects are typically loaded on demand, and we should be able to throw them out of at any point if we need to release space. We do not have that flexibility with session objects — the web server has to rely on the programmers to explicitly kill objects that are no longer required while the session is still alive.

Session memory is typically controlled by the web server and we have no direct impact on it. So storing a lot of objects in the session temporarily, even if we do release it after, can cause an effect very similar to memory leaks. I remember an application that loaded whole record sets into the session space for a short period of time, and although they were quickly released, IIS was running out of memory every few days. Under heavy load, it could happen a few times a day, sometimes even during busy trading hours, which was a big problem. From the programmer’s perspective, the code was correct in a sense that session objects were deleted, but memory was still leaking. It turned out that OLE strings caused a huge fragmentation of IIS memory, so the server was always trying to allocate more and more continuous blocks and could not fit the record sets in the space that was released. Although automatic memory management has improved a lot over the last five or six years, it can still cause headaches and it’s best to keep the sessions thin. The less information you have in the session for each user, the more users a single web server will be able to handle.

I strongly suggest using a proper caching mechanism instead of session objects for on-demand caching and optimising access to frequently required information. This caching memory can be released at any time, does not have to persist and does not have to be synchronised across requests. If the cached information is not user specific, we can get even better cache performance by re-using the cache across all sessions. Splitting the data streams into generic and user-specific helps us improve web site scalability and performance significantly.

For more information on sessions and platform-specific ideas how to improve session performance and reduce overhead, see the following pages:

Share:

Learn more

Get practical knowledge and speed up your software delivery by participating in hands-on, interactive workshops:

Books

For more in-depth insights, check out my books. I wrote six so far. Some of them even won awards!

Spy on me

I'm @gojkoadzic on Twitter, and @gojko on GitHub. I also hang out on the Claudia.js chat.

Presentations and videos

I'm a frequent keynote speaker at software delivery conferences. Watch some recorded sessions.

Schedule a visit

Organising a company workshop or a public conference? Ping me at gojko@neuri.co.uk.

Don't miss the next update

Get future articles, book and conference discounts by e-mail.