On March 1, 2018, an incredibly powerful nor’easter hit the east coast. The effects of this storm were massive.
Hurricane force winds were recorded in New England
Two feet of snow was produced in March in many areas
Storm surges rivaling records set by Hurricane Sandy were recorded
This had an unexpected impact on AWS services in the US-East-1 availability zone. The storm caused widespread power outages in Loudoun County, home to one of the highest concentrations of data centers on the planet. 15,000 customers were without power the day after the storm. Equinix, a co-location provider used by AWS, experienced a power outage affecting it’s DC1 – DC6 and DC10 - DC12 facilities. Another co-lo named CoreSite was also affected at their VA1 and VA2 dc’s. Amazon experienced outages between 6:23 am and 6:32 am during the initial outage and again at 8:11 am through 8:21 am during service restoration. This is a very small outage window. During this outage, AWS experienced packet loss on its Direct Connect service that affected companies such as Slack, Capital One, and it's own Alexa service while the AWS network rerouted traffic.
The title of this blog post is fitting as with any cloud provider any outage is, to put it mildly, to be avoided. Cloud tech is still relatively new and subject to fair and unfair biases in regards to its place in the market and viability as an alternative to older established technology. This is about as bad as it gets even considering the weather conditions and outcome.
Netflix was not affected, however. Netflix being a notable user of AWS services to provide its customer's service around the world. Netflix has been very open about its AWS configuration. Here is a great blog entry as an example. In this entry, Ruslan Meshenberg, Naresh Gopalani, and Luke Kosewski write about how Netflix implemented a multi-regional Active-Active solution to avoid just such an occurrence. They discuss how normal operation uses DNS to route users to the geographically closest AWS Region. If an outage occurs in a region tools are used to re-route users to a healthy and connected region. In the article, they also mention testing this configuration to verify it’s operation and effectiveness before a disaster strikes.
To summarize, Netflix has done their homework regarding how to build a resilient cloud architecture. I’m sure the other companies emulated some of these ideas and will utilize this experience to make their own environments more resistant to these sorts of outages. It’s a learning experience for all of us. The comforting aspect of this event is that the tools and resources already exist to ensure this never occurs again. The event doesn’t prove cloud services are unreliable. Quite the opposite. It proves cloud services are extremely reliable if implemented correctly, tested repeatedly, and improved upon over time.