During the last two years, I was involved in several projects deployed on the Amazon cloud. Being a relatively early adopter was a fantastic experience that provided lots of opportunities to burn my fingers and learn from mistakes. It also seriously challenged my view of scalable software architectures. I spoke about key lessons learned at CloudCamp London last week – here is the summary of that presentation.
Before I start, I’d like to point out that judging from this post it might seem that I have a negative view of cloud deployments, but nothing could be further from the truth. I have many nice things to say about the cloud, but lots of other presenters at CloudCamp do that all the time. I wanted to play the devil’s advocate a bit and expose some of the things that you won’t necessarily find in marketing materials.
First fundamental rule of cloud deployment: No single machine on the cloud is going to be any more reliable than any other machine there
Before the cloud, I was used to investing more in machines which were more important. Database boxes would have better power supplies than web servers, ideally redundant. Content servers got better disks and lots of them. A nice Cisco appliance would balance requests to web servers, and was infinitely more reliable than them. Web servers, for all I cared, could crash and burn at any time, as long as they did not all decide to do it at the same time. With the cloud, this isn’t possible. No matter how many virtual cores or memory you rent, all the boxes are running on very similar hardware. Or, putting it in another way:
Cynical version of the first rule: All your cloud boxes are equally unreliable
A very healthy way to look at this is that all your cloud applications will run on a bunch of cheap web servers. It’s healthy because planning for that in advance will help you keep your mental health when glitches occur, and it will also force you to design for machine failure upfront making the system more resilient.
A few months ago, one of the load balancing appliance providers tried to pitch their software for the cloud. Choose a box to act as a load balancer, install their software, and hey presto you have the equivalent of a balancer appliance in the cloud. Yeah, right. That machine is as probable to go down as any web server, and when it does your entire server cluster will be cut off from the Internet.
Second fundamental rule of cloud deployments: All machines will be affected by the same networking and IO constraints
Amazon lets people get pretty big virtual boxes in terms of processor or memory power. However, IO and networking are a completely different issue. For in-house data grid deployments, getting a separate set of network cards and putting them on a dedicated VLAN or even their own switch is a really good idea, because of the broadcast traffic between the nodes. You can’t do that on the cloud. Putting a card with hardware TCP offloading is not an option (and broadcast is also not an option at least on Amazon, but that’s another story). So the architecture has to work around this. Bottlenecks can’t be just solved by getting better hardware. Beware of this while designing an architecture that depends on all traffic going through a single file server, database machine or load balancer. If all the traffic goes through a single point, the entire capacity of the cluster will be limited to that machine’s IO or network constraints (which is probably shared with who knows how many other virtual machines on the same physical box).
Third fundamental rule of cloud deployment: networking is unreliable
When I started using the cloud, it took me a while to fully grasp the fact that it’s a really really big data centre. Even if you get only two machines there, they still run in a huge data centre, can be separated with lots of cable and switches and might not even be in the same building. That introduces a new level of complexity that I simply did not have to deal with earlier. It’s not often that I had problems with the network in the cloud, but strange things do happen. I watched my grid control machine disappear from the network and come back in ten minutes without any idea what happened. A few months ago I was completely puzzled by the fact that a web server can’t talk to an application server, although I was connected to both of them at that same time. They were working fine, but they simply did not see each other for a while.
Murphy’s law guarantees that these issues, although rare, will occur at the most inconvenient moment. So choose infrastructure and clustering software wisely. I used an opensource messaging system that worked perfectly inhouse, but the cluster simply couldn’t recover from network glitches in the cloud. Put in lots of monitoring scripts to check if machines can see each other, and plan for parts of the cluster being inaccessible for short periods of time.
There is no fast shared storage
All these things together contribute to the last big challenge I had to get my head around. There simply is no equivalent of a nice network attached shared storage on Amazon. If you’re planning to use a hot-standby database by sharing data files, think again. Elastic Block Clouds can only be mounted to a single machine. You can get a machine and expose an elastic block using NFS, but that falls under the effects of all three rules explained above. SimpleDB is slow. PersistentFS allows you to mount S3 as a file system to multiple machines, but read-only. There is no good solution for this, apart from designing around it.
How to keep your sanity
It took me a while to understand that just deploying the same old applications in the way I was used to isn’t going to work that well on the cloud. To get the most out of cloud deployments, applications have to be designed up-front for massive networks and running on cheap unstable web boxes. But I think that is actually a good thing. Designing to work around those constraints makes applications much better – faster, easier to scale, cheaper to operate. Asynchronous persistence can significantly improve performance but I never thought about that before deploying to the cloud and running into IO issues. Data partitioning and replication make applications scale better and work faster. Sections of the system that can work even if they can’t see other sections help provide a better service to customers. This also makes the systems easier to deploy, because you can do one section at a time.
To conclude, there are three key ideas to keep in mind:
- Partition, partition, partition: avoid funnels or single points of failure. Remember that all you have is a bunch of cheap web servers with poor IO. This will prevent bottlenecks and scoring an own-goal by designing a denial of service attack in the system yourself.
- Plan on resources not being there for short periods of time. Break the system apart into pieces that work together, but can keep working in isolation at least for several minutes. This will help make the system resilient to networking issues and help with deployment.
- Plan on any machine going down at any time. Build in mechanisms for automated recovery and reconfiguration of the cluster. We accept failure in hardware as a fact of life – that’s why people buy database servers with redundant disks and power supplies, and buy them in pairs. Designing applications for cloud deployment simply makes us accept this as a fact with software as well.