Mar 03 2008
Two data streams for a happy website
One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is “when” in this case, not “if”).
In a classic online book-store example, book details, prices and shop categories are all generic. They do not depend on any particular user. Browsing the catalogue should produce the same results for any two users at the same moment. Actions like buying a book or reviewing orders and transactions are user specific and they cannot be shared between different users. Why is it so important to split these two streams right from the start? Because generic and user-specific data flows are affected by completely different caching and partitioning constraints.
Different constraints
Keeping session footprint low, caching and partitioning have been tried and tested as best practices for scaling web systems. If there is only a single data stream, because of different constraints, these practices will be hard to implement and even harder to improve later.
For example, most generic data flow is completely stateless, but user-specific actions are often stateful. If these two flows are clearly separated, then we can process generic actions with stateless services and servers. Stateless services are much easier to manage and scale than stateful services, since they can be easily replaced without any effect on the system operation. You can just stack more cheap servers and throw requests at them using round-robin or simple load balancing when you need more power. Stateful services cannot be scaled so easy — they might rely on resource locking and load balancers have to send all requests for a single session to the same server. If a stateful server dies, that has a visible effect on the system operation, so these services have to be much more resilient than stateless ones.
Generic data can be cached and shared between users and even servers. User-specific data like account information or transaction statuses should never be cached. Similarly, there is most likely no authentication required to get the generic data, but in most cases authentication rules apply to user-specific data. For example, anyone can view book details, but a user has to be logged in to view an order status (and the requested order has to belong to that particular user). If the two data streams are mixed, then any caching mechanism will have to analyse and understand the context of requests, and decide what should be cached on case-by-case basis. If the two streams are split, the decision is easy: everything generic is retrieved from the cache, everything user-specific does not even try to use the cache. No error-prone analytics involved. The generic data flow also does not have to suffer from the authentication overhead — you can save a lot of processor power by skipping authentication when it is not needed. Remember the Golden rule of web caching? We can cache generic data into static files and avoid any dynamic processing on the web servers, getting much better performance. On e-commerce systems, user-specific pages typically have to go through HTTPS, which adds even more overhead.
Because it does not matter where the generic data flow comes from, it is also much easier to partition. Once the system grows over the capacity of a single database, we can just take the generic content out and split it into several databases. Splitting account data and transactions is never that easy, because those details are bound by a single context and often constrained by business rules and validations.
How to split the flow
In the classic two-tier architecture, where the web site plugs directly into the database, data flow splitting starts with using a different data source for those two streams. Even if, on the start, everything sits in the same database, use different connection pools to retrieve user-specific and generic data. The generic data flow does not even have to run in transactions, which can make a big difference with some databases. If an ORM mapping library is used, then caching should be turned on for this stream. User-specific data flow, as a general rule of thumb, should be transactional and should never be cached. It is a good idea to hide any user-specific tables from the generic data flow using database user privileges, so that any attempts to mix the two streams get caught quickly. Internal web site services should be designed to clearly fall into one of those two categories, and not mix methods that retrieve generic data and perform user-specific functions. This is all a preparation for the day when the two-tier system needs to be split into more levels. With a clear separation of responsibilities, this divorce should not be painful at all.
In three tier architectures, I like to split the middleware straight from the start into user-specific and generic servers. Web servers sit on top, get the generic data from the first middleware group and process transactions using the second middleware group. Generic data-flow servers can be clustered and scaled easily, and any load balancing system will work right out of the box. They can be restarted, taken out of the cluster or put back in without any effect on the rest of the system. Transparent caching can be applied to those servers easily. User-specific servers, on the other hand, are much more tricky in all those aspects and should absolutely never be transparently cached. This split is a preparation for further scaling and caching, since generic data servers can be split regionally, put under several layers of cache servers, divided vertically by product range or type. The functionality on user-specific servers is focused and isolated, so that we will have less to focus
Add to Del.Icio.Us bookmarks


Many excellent points in this article, Gojko. I would add one additional point, aimed primarily at larger web entities.
The generic data can easily be cached by a CDN or edge network.
In the case of Akamai, you can employ their “edge script” tags within a static page. The entire static page will be cached, and served quickly from cache, but the ESI tags can make a request back to the servers to fill in user-specific content for those fragments. Very similar to the AJAX approach in your last section, but without the extra interaction on the browser.
I think there is also a category for “slowly changing” data. I’ll use the Amazon product detail page as an example. For the majority of books and movies, user reviews change very seldom, if at all. This is not completely static, as the SKU, title, publisher’s notes, and so on, would be. But, for the long tail, reviews change once in a great while. If it takes a few hours for the new review to appear everywhere in the world, it’s not a big deal.
There is another dimension to consider, along with generic vs. specific:
1. Some data will be read much more often than it changes, meriting a high degree of caching. (Updates are typically based on some back-end publishing process, but could be driven by the site itself.)
2. Other data will change much more often than it is read. This is suited for a demand-pull process (i.e. dynamic generation.)
Note that generic data could fall into either of these categories (so could specific data, for that matter). But, generic data that changes much more often than it is read is probably not worth the processing time or storage to cache.
Beautiful article Gojko;
as you said and I would like to emphasize; always use good URLs: semantic, hackeable, extensible URLs.
Playing with URLs, redirecting/proxying this request to that server is extremely easy and a light task for tools like Lighttpd or Varnish HTTP Proxy.
If your URLs are semantic enough you can create simple regex rules to tell your HTTP server what’s cached and what’s not; and avoiding hitting your scripting language is gaining -a lot of performance-.
IE:
/users/.*?$ is not cached
/users/.*?/rss is cached for 1hour
/users/.*?/new_files is cached for 5 minutes
etc
(ps, I’m adding your blog to my “can’t miss” folder in bloglines
[...] Two data streams for a happy website C’est un peu technique, mais cette réflexion intéressera sûrement tous les entrepreneurs qui sont confrontés à un problème de montée en charge (problème inhérent aux business models basés sur le tout gratuit et donc sur la publicité qui nécessite un trafic important pour être rentable). [...]
This is a good idea. However there are possibly user specific things that are stateless. Hence wouldnt the partitions be better if made simply between stateful and stateless rather than generic v/s user specific?
shiraz
Hi Shiraz,
this is by no means the final partition, but I propose using this as the first one, as it allows you to cache and further partition generic data and remove all the overhead of authentication and SSL without any special analytics. With user-specific data, even if it is stateless, things become a bit tricky.
If there is no authentication required by business rules to access such data, then I’d “cheat” and push that as well into generic data servers if possible. For example, personal pages on social networking sites may be considered user specific at first glance, because they are related to a particular user, but they are in fact generic because everyone can access the same details. The site can store your user ID in a cookie and then forward you to the correct profile when you click on “My page” link without touching the database. There is no harm if I see your page, as long as I cannot change it without authentication.
If authentication to view the data is required by business rules (for example in a typical e-commerce application) then caching that data and exposing it directly would be considered a huge security gap. So this data stream has to go though SSL and be protected by proper authentication. Putting it in the same cache as other data would only complicate things for the rest of cache. Depending on business rules, this may be cached as well, but under the umbrella of user specific services. That way the generic data stream remains simple.
So then may we have 3 classes of data streams - pure stateless, secure stateless and stateful.
What’s with the hidden porn links embedded in the article, beginning after the sentence “then the growth options will be severely limited.”? Are those intentional, or has someone managed to inject stuff into your page? It took me a minute to figure out why your site was coming up under blocked content…
Hi AoD, thanks for the tip - apparently someone injected that stuff, my wordpress installation was out of date for two weeks. I removed it. Thanks again.
You’re certainly welcome. I suspected it was something like that, but you never want to judge