AWS: Here’s what went wrong in our big cloud-computing outage – ZDNet

Amazon Web Services (AWS) rarely goes down unexpectedly, but you can expect a detailed explainer when a major outage does happen.

12/15 update: AWS misfires once more, just days after a massive failure

The latest of AWS's major outages occurred at 7:30AM PST on Tuesday, December 7, lasted five hours and affected customers using certain application interfaces in the US-EAST-1 Region. In a public cloud of AWS's scale, a five-hour outage is a major incident.

Managing the Multicloud

It's easier than ever for enterprises to take a multicloud approach, as AWS, Azure, and Google Cloud Platform all share customers. Here's a look at the issues, vendors and tools involved in the management of multiple clouds.

According to AWS's explanation of what went wrong, the source of the outage was a glitch in its internal network that hosts "foundational services" such as application/service monitoring, the AWS internal Domain Name Service (DNS), authorization, and parts of the Elastic Cloud 2 (EC2) network control plane. DNS was important in this case as it's the system used to translate human-readable domain names to numeric internet (IP) addresses.

SEE: Having a single cloud provider is so last decade

AWS's internal network underpins parts of the main AWS network that most customers connect with in order to deliver their content services. Normally, when the main network scales up to meet a surge in resource demand, the internal network should scale up proportionally via networking devices that handlenetwork address translation (NAT)between the two networks.

However, on Tuesday last week, the cross-network scaling didn't go smoothly, with AWS NAT devices on the internal network becoming "overwhelmed", blocking translation messages between the networks with severe knock-on effects for several customer-facing services that, technically, were not directly impacted.

"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network," AWS says in its postmortem.

"This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks."

The delays spurred latency and errors for foundational services talking between the networks, triggering even more failing connection attempts that ultimately led to "persistent congestion and performance issues" on the internal network devices.

With the connection between the two networks blocked up, the AWS internal operating team quickly lost visibility into its real-time monitoring services and were forced to rely on past-event logs to figure out the cause of the congestion. After identifying a spike in internal DNS errors, the teams diverted internal DNS traffic away from blocked paths. This work was completed two hours after the initial outage at 9:28AM PST.

This alleviated impact on customer-facing services but didn't fully fix affected AWS services or unblock NAT device congestion. Moreover, the AWS internal ops team still lacked real-time monitoring data, subsequently slowing recovery and restoration.

Besides lacking real-time visibility, AWS internal deployment systems were hampered, again slowing remediation. The third major cause of its non-optimal response was concern that a fix for internal-to-main network communications would disrupt other customer-facing AWS services that weren't affected.

"Because many AWS services on the main AWS network and AWS customer applications were still operating normally, we wanted to be extremely deliberate while making changes to avoid impacting functioning workloads," AWS said.

First, the main AWS network was not affected, so AWS customer workloads were "not directly impacted", AWS says. Rather, customers were affected by AWS services that rely on its internal network.

However, the knock-on effects from the internal network glitch were far and wide for customer-facing AWS services, affecting everything from compute, container and content distribution services to databases, desktop virtualization and network optimization tools.

AWS control planes are used to create and manage AWS resources. These control planes were affected as they are hosted on the internal network. So, while EC2 instances were not affected, the EC2 APIs customers use to launch new EC2 instances were. Higher latency and error rates were the first impacts customers saw at 7:30AM PST.

SEE: Cloud security in 2021: A business guide to essential tools and best practices

With this capability gone, customers had trouble with Amazon RDS (relational database services) and the Amazon EMR big data platform, while customers with Amazon Workspaces's managed desktop virtualization service couldn't create new resources.

Similarly, AWS's Elastic Cloud Balancers (ELB) were not directly affected but, since ELB APIs were, customers couldn't add new instances to existing ELBs as quickly as usual.

Route 53 (CDN) APIs were also impaired for five hours, preventing customers changing DNS entries. There were also login failures to the AWS Console, latency affecting Amazon Secure Token Services for third-party identity services, delays to CloudWatch, and impaired access to Amazon S3 buckets, DynamoDB tables via VPC Endpoints, and problems invoking serverless Lambda functions.

The December 7 incident shared at least one trait with a major outage that occurred this time last year: it stopped AWS from communicating swiftly with customers about the incident via the AWS Service Health Dashboard.

"The impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region," AWS explained.

Additionally, the AWS support contact center relies on the AWS internal network, so staff couldn't create new cases at normal speed during the five-hour disruption.

AWS says it will release a new version of its Service Health Dashboard early 2022, which will run across multiple regions to "ensure we do not have delays in communicating with customers."

Cloud outages do happen. Google Cloud has had its fare share and Microsoft in October had to explain its eight-hour outage. While rare, the outages are a reminder that public cloud might be more reliable than conventional data centers, but things do go wrong, sometimes catastrophically, and can impact a wide number of critical services.

"Finally, we want to apologize for the impact this event caused for our customers," said AWS. "While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further."

View original post here:

AWS: Here's what went wrong in our big cloud-computing outage - ZDNet

How Do You Define Cloud Computing? - Data Center Knowledge [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
RCom arm in tie-up for cloud computing - Moneycontrol.com [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Roundup Of Cloud Computing Forecasts, 2017 - Forbes [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Cloud Computing Continues to Influence HPC - insideHPC [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
5 Cloud Computing Stocks to Buy - TheStreet.com [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Adobe bets big on cloud computing for marketing, creative professionals - Livemint [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Red Hat's New Products Centered Around Cloud Computing, Containers - Virtualization Review [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Verizon sells cloud services to IBM in 'unique cooperation between ... - Cloud Tech [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Hospital CIOs see benefits of healthcare cloud computing - TechTarget [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
How Cloud Computing Is Turning the Tide on Heart Attacks - Fortune [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
What is Cloud Computing Technology?: Cloud Definition ... [Last Updated On: May 3rd, 2017] [Originally Added On: May 3rd, 2017]
Daily Report: Cloud Computing Asserts Itself - New York Times [Last Updated On: May 4th, 2017] [Originally Added On: May 4th, 2017]
Verizon sells cloud services to IBM in 'unique cooperation between two tech leaders' - Cloud Tech [Last Updated On: May 4th, 2017] [Originally Added On: May 4th, 2017]
CIOs embrace the value of cloud computing in healthcare - TechTarget [Last Updated On: May 4th, 2017] [Originally Added On: May 4th, 2017]
Heptio's Joe Beda: Before embracing cloud computing, make sure your culture is ready - GeekWire [Last Updated On: May 4th, 2017] [Originally Added On: May 4th, 2017]
CLOUD COMPUTING Cisco Expands Cloud IoT Services with $610M Viptela Acquisition - CIO Today [Last Updated On: May 6th, 2017] [Originally Added On: May 6th, 2017]
A prepaid wallet that helps start-ups access cloud-computing services - The Hindu [Last Updated On: May 6th, 2017] [Originally Added On: May 6th, 2017]
Google: No to Price War Over Cloud Computing - Investopedia [Last Updated On: May 6th, 2017] [Originally Added On: May 6th, 2017]
3 things to know about the cloud v. data center decision - ZDNet [Last Updated On: May 8th, 2017] [Originally Added On: May 8th, 2017]
OpenStack Foundation cites 'capabilities, compliance and cost' as Summit kicks off - Cloud Tech [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
Profit From Cloud Computing Boom With This ETF - Seeking Alpha [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
Autonomous Driving Market Focuses on Artificial Intelligence and ... - PR Newswire (press release) [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
The cloud computing tidal wave - BetaNews [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
Aruba predicts a hybrid future for edge and cloud computing - The Internet of Business (blog) [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
China Says Draft Rules on Cloud Computing Have Been Misunderstood - Wall Street Journal (subscription) [Last Updated On: May 9th, 2017] [Originally Added On: May 9th, 2017]
Oracle launches cloud computing service for India | Business Line - Hindu Business Line [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
Microsoft is on the edge: Windows, Office? Naah. Let's talk about cloud, AI - The Register [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
IBM touts its cloud platform as quickest for AI with benchmark tests - Cloud Tech [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
Enterprise-owned data centres still 'essential' despite cloud growth, research notes - Cloud Tech [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
You really should know what the Andrew File System is - Network World [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
Microsoft launches Android app to manage its Azure cloud computing platform - Android Police [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
3 Cloud Computing Stocks To Buy Right Now - May 10, 2017 ... - Zacks.com [Last Updated On: May 11th, 2017] [Originally Added On: May 11th, 2017]
Virtustream Adds Enterprise Cloud to Global Dell EMC Partner Program - Cloud Computing Intelligence (registration) (blog) [Last Updated On: May 13th, 2017] [Originally Added On: May 13th, 2017]
Trump signs cybersecurity executive order, mandating a move to cloud computing - GeekWire [Last Updated On: May 14th, 2017] [Originally Added On: May 14th, 2017]
Cloud Computing, Term of Art Complete Preakness Works - BloodHorse.com (press release) (registration) (blog) [Last Updated On: May 14th, 2017] [Originally Added On: May 14th, 2017]
IBM Announces The Defense Calculator And A Cloud Computing Service - Forbes [Last Updated On: May 14th, 2017] [Originally Added On: May 14th, 2017]
Achieving compliance in the cloud - CSO Online [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
Boston schools CIO Mark Racine takes hybrid approach to cloud computing - EdScoop News (press release) (registration) (blog) [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
Benefit-risk 'tipping point' for cloud computing now passed, says ... - Out-Law.com [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
Cloud Computing puts in work for Preakness before deluge - Daily Racing Form [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
How telecom is shifting its strategy to support cloud computing - SiliconANGLE (blog) [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
Cloud computing - Simple English Wikipedia, the free encyclopedia [Last Updated On: May 17th, 2017] [Originally Added On: May 17th, 2017]
How Alphabet Views the Cloud Computing Price Wars - Market Realist [Last Updated On: May 18th, 2017] [Originally Added On: May 18th, 2017]
Keying Longshot Cloud Computing in the Preakness - America's Best Racing [Last Updated On: May 18th, 2017] [Originally Added On: May 18th, 2017]
Microsoft Extends Cloud-Computing Arms Race to Africa - Fox Business [Last Updated On: May 18th, 2017] [Originally Added On: May 18th, 2017]
Firms Face Decelerating Cloud Spending: Analyst - Investopedia [Last Updated On: May 20th, 2017] [Originally Added On: May 20th, 2017]
Is edge computing set to blow away the cloud? - Cloud Tech [Last Updated On: May 20th, 2017] [Originally Added On: May 20th, 2017]
Microsoft Extends Cloud-Computing Arms Race to Africa - Wall Street Journal (subscription) [Last Updated On: May 20th, 2017] [Originally Added On: May 20th, 2017]
Rested and ready: 13-1 shot Cloud Computing wins Preakness - Fairfield Daily Republic [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
Watch Cloud Computing's thrilling come-from-behind finish at the Preakness Stakes - For The Win [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
Cloud Computing wins the 142nd Preakness Stakes in front of a record crowd [Photos] - Baltimore Business Journal [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
13-1 shot Cloud Computing edges Classic Empire, springs upset in Preakness - News3LV [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
Cloud Computing Wins Preakness Stakes, and Techies Are Stoked - Fortune [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
Cloud Computing wins Preakness Stakes, dashing Always ... [Last Updated On: May 22nd, 2017] [Originally Added On: May 22nd, 2017]
Cloud computing, Galeria Inno and change at Deka - Delano.lu [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Roundup Of Cloud Computing Forecasts, 2017 - Enterprise Irregulars (blog) [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Cloud Computing Does Not Need Help From Washington - Cramer's ... - Seeking Alpha [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
CTOvision Assessment on The Megatrend of Cloud Computing - CTOvision (blog) [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Cloud Computing's Trainer Wins One for His Mentor at Preakness - New York Times [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Make sense of edge computing vs. cloud computing | InfoWorld - InfoWorld [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Cloud Computing Takes the Preakness - RFD-TV [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Cloud Computing takes Preakness - CNN.com [Last Updated On: May 23rd, 2017] [Originally Added On: May 23rd, 2017]
Make Sense of Edge Computing vs. Cloud Computing - Linux.com (blog) [Last Updated On: May 26th, 2017] [Originally Added On: May 26th, 2017]
How will cloud computing and analytics affect Citrix shops? - TechTarget [Last Updated On: May 26th, 2017] [Originally Added On: May 26th, 2017]
Red Hat to acquire cloud computing firm - Triangle Business Journal [Last Updated On: May 26th, 2017] [Originally Added On: May 26th, 2017]
Cloud computing streamlines oil field monitoring - Williston Daily Herald [Last Updated On: May 26th, 2017] [Originally Added On: May 26th, 2017]
Fans should appreciate Cloud Computing's Preakness win - ESPN [Last Updated On: May 26th, 2017] [Originally Added On: May 26th, 2017]
Cray Takes the Plunge into Cloud Computing - TOP500 News [Last Updated On: May 28th, 2017] [Originally Added On: May 28th, 2017]
Baidu to leverage cloud computing, artificial intelligence, in effort to ramp up behavioural analysis - South China Morning Post [Last Updated On: May 28th, 2017] [Originally Added On: May 28th, 2017]
Cloud computing will change the nature of hospital IT shops - Healthcare IT News [Last Updated On: May 28th, 2017] [Originally Added On: May 28th, 2017]
Microsoft's weapon in high-stakes cloud-computing battle with Amazon? Freebies - The Seattle Times [Last Updated On: May 28th, 2017] [Originally Added On: May 28th, 2017]
Amazon Shares Hit $1000, Showing Dominance of E-Commerce, Cloud - The VAR Guy [Last Updated On: May 30th, 2017] [Originally Added On: May 30th, 2017]
Oracle set to expand cloud reach with Tencent alliance - South China Morning Post [Last Updated On: May 30th, 2017] [Originally Added On: May 30th, 2017]
Cloud Computing to Skip Belmont as Field Comes into Focus - America's Best Racing [Last Updated On: May 30th, 2017] [Originally Added On: May 30th, 2017]
Movers: Amazon's Stock Price Hits $1000 - New York Times [Last Updated On: June 1st, 2017] [Originally Added On: June 1st, 2017]
Mary Meeker: Healthcare technology is booming thanks to cloud computing and wearables - SiliconANGLE (blog) [Last Updated On: June 1st, 2017] [Originally Added On: June 1st, 2017]
Will Amazon's Web Services Business Get Hurt by Cloud Computing Commodification? - HuffPost [Last Updated On: June 1st, 2017] [Originally Added On: June 1st, 2017]
Box CEO Aaron Levie: Artificial intelligence to revolutionize cloud computing - MarketWatch [Last Updated On: June 1st, 2017] [Originally Added On: June 1st, 2017]
Cloud computing takes off as top new discipline on campus - Education Dive [Last Updated On: June 1st, 2017] [Originally Added On: June 1st, 2017]
CIOs and factors overlooked when changing your cloud - Cloud Tech [Last Updated On: June 3rd, 2017] [Originally Added On: June 3rd, 2017]

AWS: Here’s what went wrong in our big cloud-computing outage – ZDNet

The Prometheus League

Breaking News and Updates

Prometheism

Forbidden Fruit

The Evolutionary Perspective

Transtopia Menu

Library Updates

Library Books

Future Euvolution

Lucid Dreams from Childhood

Genetic Revolution

Speciation + Self-Directed Evolution