One of the most important architectural decisions that must be done early on in a scalable web site project is splitting the data flow into two streams: one that is user specific and one that is generic. If this is done properly, the system will be able to grow easily. On the other hand, if the data streams are not separated from the start, then the growth options will be severely limited. Trying to make such a web site scale will be just painting the corpse, and this change will cost a whole lot more when you need to introduce it later (and it is “when” in this case, not “if”).

In a classic online book-store example, book details, prices and shop categories are all generic. They do not depend on any particular user. Browsing the catalogue should produce the same results for any two users at the same moment. Actions like buying a book or reviewing orders and transactions are user specific and they cannot be shared between different users. Why is it so important to split these two streams right from the start? Because generic and user-specific data flows are affected by completely different caching and partitioning constraints.

Different constraints

Keeping session footprint low, caching and partitioning have been tried and tested as best practices for scaling web systems. If there is only a single data stream, because of different constraints, these practices will be hard to implement and even harder to improve later.

For example, most generic data flow is completely stateless, but user-specific actions are often stateful. If these two flows are clearly separated, then we can process generic actions with stateless services and servers. Stateless services are much easier to manage and scale than stateful services, since they can be easily replaced without any effect on the system operation. You can just stack more cheap servers and throw requests at them using round-robin or simple load balancing when you need more power. Stateful services cannot be scaled so easy — they might rely on resource locking and load balancers have to send all requests for a single session to the same server. If a stateful server dies, that has a visible effect on the system operation, so these services have to be much more resilient than stateless ones.

Generic data can be cached and shared between users and even servers. User-specific data like account information or transaction statuses should never be cached. Similarly, there is most likely no authentication required to get the generic data, but in most cases authentication rules apply to user-specific data. For example, anyone can view book details, but a user has to be logged in to view an order status (and the requested order has to belong to that particular user). If the two data streams are mixed, then any caching mechanism will have to analyse and understand the context of requests, and decide what should be cached on case-by-case basis. If the two streams are split, the decision is easy: everything generic is retrieved from the cache, everything user-specific does not even try to use the cache. No error-prone analytics involved. The generic data flow also does not have to suffer from the authentication overhead — you can save a lot of processor power by skipping authentication when it is not needed. We can cache generic data into static files and avoid any dynamic processing on the web servers, getting much better performance. On e-commerce systems, user-specific pages typically have to go through HTTPS, which adds even more overhead.

Because it does not matter where the generic data flow comes from, it is also much easier to partition. Once the system grows over the capacity of a single database, we can just take the generic content out and split it into several databases. Splitting account data and transactions is never that easy, because those details are bound by a single context and often constrained by business rules and validations.

How to split the flow

In the classic two-tier architecture, where the web site plugs directly into the database, data flow splitting starts with using a different data source for those two streams. Even if, on the start, everything sits in the same database, use different connection pools to retrieve user-specific and generic data. The generic data flow does not even have to run in transactions, which can make a big difference with some databases. If an ORM mapping library is used, then caching should be turned on for this stream. User-specific data flow, as a general rule of thumb, should be transactional and should never be cached. It is a good idea to hide any user-specific tables from the generic data flow using database user privileges, so that any attempts to mix the two streams get caught quickly. Internal web site services should be designed to clearly fall into one of those two categories, and not mix methods that retrieve generic data and perform user-specific functions. This is all a preparation for the day when the two-tier system needs to be split into more levels. With a clear separation of responsibilities, this divorce should not be painful at all.

In three tier architectures, I like to split the middleware straight from the start into user-specific and generic servers. Web servers sit on top, get the generic data from the first middleware group and process transactions using the second middleware group. Generic data-flow servers can be clustered and scaled easily, and any load balancing system will work right out of the box. They can be restarted, taken out of the cluster or put back in without any effect on the rest of the system. Transparent caching can be applied to those servers easily. User-specific servers, on the other hand, are much more tricky in all those aspects and should absolutely never be transparently cached. This split is a preparation for further scaling and caching, since generic data servers can be split regionally, put under several layers of cache servers, divided vertically by product range or type. The functionality on user-specific servers is focused and isolated, so that we will have less to focus