Learning from Amazon´s failure

At the end of April,  Amazon Web Services (AWS), the world leader in Cloud Computing Infrastructure as a Service, suffered a major outage affecting hundreds of websites that rely on AWS.

When Amazon went down, it took with it  many companies who run their web services in  Amazon data centre in Northern Virginia in the US East region. Companies such reddit, indabamusic, foursquare and Quora were affected due to this major outage. This was a concerning situation for Amazon

With cloud computing gaining more  traction, enterprises  are increasingly deploying their services on AWS due to the benefits the cloud provides in terms of ’cost savings, elasticity and faster times to market’. But this AWS outage has raised lots of questions and doubts in the minds of current customers  as well as potential cloud users regarding the reliability of the cloud.

What lessons can we learn from this failure? Is the cloud something that we can rely on? The answer is simple as

“yes we can!”

There were other well known and successful  customers such Netflix that  use AWS to offer their services; but their services were not affected by the outage. Why? Because they were lucky to run their services in other Amazon Region or because they designed their systems  in order to deal with failures and provide  business continuity.

Enterprises should have designed their systems robust and prone to any failure. That would allow them to keep their business up and running when such failures happen. The nice thing is that AWS  allows you to design fault tolerant architectures.

Moving an application to the Cloud it is not a trivial thing and it does not merely imply re-locating the service to the cloud and rely on the availability of the  cloud service provider. Such approach is all right for services that are not the “core business” –  “mission critical” or for companies whose services can be down for several hours and lose probably some data,  without compromising their business.

Many services  are being moved  to the cloud in order to get the benefits the cloud provides nowadays: High availability; ability to scale (orders of magnitude increase in usage and/or users); performance; faster deployment times etc. Although Amazon is responsible for the interdependencies between Availability Zones, many companies failed to address that the other part of the problem relies in their deployed architectures in the cloud.

In The Server Labs we think that the key for a successful architecture in the cloud is ‘Design for Failure’ ,  as you would do it for any other distributed system.  Although the outage left companies out of business during hours and affecting their P&L, still they do not follow that golden rule. The reasons behind this might be one of these. Or they  lack of technical knowledge in complex architectures / distributed systems / cloud computing architectures,  or they can not afford the  operational costs associated to a global high-availability architecture in Amazon AWS.

There are a some approaches to architecting high availability architectures on Amazon´s cloud.

Use multiple Availability Zones

This  approach is about to have  the business services running across multiple Availability Zones on AWS  (like US West 1A and US West 1B). The failure in a particular zone will redirect the traffic to a different zone that is stable.  This is a cost effective solution (in comparison to the second approach of distributing  business services across multiple “Availability Regions”).  However this approach may not be sufficient when the entire availability zone goes down (US-East), as it happened in the recent outage

Use multiple Availability Regions

The second approach is to run the application across multiple “Availability Regions”. In this case the  service is hosted on multiple AWS´s regions (like  US West and Europe). It is possible to have geo-distributed traffic and high availability, across continents with this setup. This configuration is recommended for companies that need a high level of scalability, load balancing and world wide user access requirements.  In the case of a failure at one region, the traffic can be redirected to other stable regions. This approach would have addressed the latest AWS outage scenario, and is the one used by several companies that were not affected by the last outage.

Both approaches offer a simple view of the possibilities of designing fault tolerant cloud architectures. Of course, the design could be completed (involving multiple public clouds or hybrid solutions) taking into consideration the business needs

As for any distributed system,  cloud architects should bear in mind some key points:  As a rule of thumb “Avoid single point of failure unless your business can live with it”.  Also the architecture should not compromise scalability and availability. Any fail-over mechanism will result in additional costs.

At the end, High Availability and Scalability is a trade-off between higher costs of infrastructure (the opportunity cost / opportunity benefit)  versus the benefit of not losing customers/revenue in case of a failure.

The AWS outage will force the enterprises to  focus to the importance/needs of robust, well defined & designed architectures that are going to be deployed  in the cloud.  If Enterprise Cloud Architects take into consideration and address multiple possibilities of failures, nothing will really fail completely.  With this approach in mind,  those companies will get the most benefit from the cloud.

0 Comments on “Learning from Amazon´s failure”

Leave a Comment