A major outage in AWS's Eastern region underscored the extent to which businesses and channel partners depend on the company's cloud services. The Sunday evening incident was also a painful reminder of the importance of geographical redundancy.
“Events like this highlight how an issue in one AWS region (such as US-EAST-1) can cascade across numerous dependent services,” Netgain Technology CEO Sumeet Sabharwal told Channel Dive in an email.
The incident took down instances in the Virginia-based US-EAST-1 region Sunday night, and mass outages lingered into Monday evening.
Large digital brands, including Reddit and WhatsApp, were offline for hours. Small and medium-sized businesses suffered connectivity issues during what was deemed to be the largest IT outage since the July 2024 CrowdStrike incident, according to Reuters.
Initial reports of problems in the region surfaced late Sunday on DownDetector. The number of reports peaked at 9,672 at 1:14 p.m. Eastern Time Monday.
AWS confirmed "increased error rates and latencies” in services related to US-EAST-1 Monday morning. The hyperscaler identified the root cause as a DNS problem in a NoSQL AWS database, called DynamoDB.
At 3:15 p.m. ET recovery "across all AWS services” had commenced, according to the company, which said services were continuing to improve at 4 p.m.
Asymmetrical impacts
The outage had varying effects across the IT landscape.
Netgain’s clients were impacted, although none faced a complete outage, according to Sabharwal.
“A small number of clients encountered intermittent service impacts — primarily brief connectivity disruptions, (including VPN access,) and isolated capacity constraints,” Sabharwal said.
CDW-owned MSP and professional services firm Mission learned of “multiple” compromised services within its customer base.
“AWS has applied initial mitigations, and recovery is underway,” Mission said in an emailed statement. “You may encounter throttled requests as they complete restoration. We're tracking this actively and will reach out if your environment needs attention.”
Multi-region resilience
IT services providers can architect bulwarks to protect customers from cloud failures.
While outages are occurring less often than in previous years, Monday put a spotlight on the “far-reaching effects” of a regional cloud site going down, Sabharwal said.
“For clients requiring high availability, building in geographic redundancy and multi-region design is a critical part of modern reference architecture,” he said.
Redundancy, however, comes at a price. Netgain’s base offering deploys client environments in a single cloud region picked by the customer. The company’s premium managed services tier comes with multi-region redundancy.
“That means data and critical workloads are replicated to a secondary region, and failover-capable pathways are architected so that if a single region fails or becomes impaired, the secondary region can pick up with minimal disruption,” Sabharwal said.
What some hyperscalers refer to as geographic redundancy may not be going far enough. There are multiple AWS availability zones within US-EAST-1, but that level of redundancy wasn’t enough, in this case. The US-EAST-1 is a shared control plane for services that reach far beyond the region, according to Sabharwal.
“The effect went beyond a single availability zone failure and rippled across services and users in ways that typical AZ-only redundancy doesn’t fully protect against,” he said.
Netgain recommends an architecture that provides failover across different regions, rather than relying on availability zones within one region. Clients should assess which of their services use a single cloud region and conduct regular testing of failover plans, Sabharwal said.
MSPs need to provide “clear dependency mapping” for customers and put in place protocols to quickly respond to events like the AWS outage, he added.