When an AWS cloud outage swept through the hyperscaler’s US-EAST-1 region late Sunday and into Monday, resilience plans were put to the test as theoretical consequences became real headaches for service providers and their customers.
Within hours, major digital brands were knocked offline as AWS scrambled to fix the problem. The hyperscaler said Monday that the disruption was triggered by Domain Name System (DNS) resolution issues involving its DynamoDB service, which took out a subsystem that launches cloud instances. Dependent downstream applications and services also fell within the outage's scope.
The incident was unusual for the way it unfolded, according to executives at several AWS partners.
Cascading effects stemming from a single DNS issue are rare, Jamil Ahmed, director and distinguished engineer at Solace, told Channel Dive. Solace is a software company focusing on real time, event-driven integration for agentic AI and an AWS ISV partner.
Protection against such an event isn’t easy, Ahmed said. It calls for investment in a multicloud environment, which means doubling up on the technical expertise needed to deploy across more than one infrastructure.
"I don't propose that everybody needs to be multicloud, especially with this outage," Ahmed said. "There will be a class of businesses that are [saying], 'Ok, this is a rare event … Shrug it off and move on.'"
The calculus is different for companies keen to avoid financial or reputational harm tied to cloud outages.
"There is an additional cost to be multicloud, but now [the impacted businesses] can weigh that up against, 'What's the brand impact that we've had? What was the cost of that particular outage?" Ahmed said. "It makes it at least a clearer cost-benefit analysis."
The resiliency investment calculus
Software and services partners have strategies that can minimize operational disruptions. But engineering protection comes at a cost.
Randall Hunt, CTO at Caylent, a cloud services company and AWS Premier Tier Services Partner, said that building for resilience is an approach the company has "intentionally evangelized" with its customers. Yet, he agreed that hitting the strictest recovery objectives isn’t a panacea.
"Customers have to be realistic about what they're willing to spend to meet their RTO, RPO, and [service-level agreement] objectives," Hunt said in an email. "Sometimes the juice isn't worth the squeeze."
Businesses in the cloud have to make tough decisions about their risk tolerance, according to Allen Terleto, VP of partners and alliances at Cockroach Labs, a cloud-native distributed SQL database company and AWS partner.
"In the real world, everything is segmented to a point," he said. "There are going to be enterprises that are willing to pay the cost for business continuity because the ROI of being able to invest into those solutions is higher by avoiding the potential for that outage."
The prospect of lost revenue isn't the sole pain point for making such an investment. In the highly regulated financial services industry, compliance is another key operational resiliency consideration, Terleto said, citing the risks associated with regulatory action.
Businesses should focus on making sensible investments in resiliency.
AWS’s outage was so rare that it “behooves a business to think realistically about how much they're willing to invest," Hunt said, citing diminishing returns from overspending on unrealistic RPOs and RTOs.
For companies with stringent uptime requirements, Caylent builds "multiregion cellular architectures that limit the blast radius of any potential failure," he said.
Cell-based architectures use multiple, isolated instances of a workload to reduce “the potential impact of a failure," according to AWS.
Caylent leverages the Amazon Route 53 Application Recovery Controller "to maintain control even if global dependencies like US-EAST-1 are down," said Hunt. The failover service is designed to help organizations build high-availability applications, according to AWS.
The AWS outage reinforces the importance of multicloud and hybrid resiliency strategies, Jay Pasteris, COO at Blue Mantis, an IT services provider and AWS partner, said in an email. The company’s clients tend to distribute workloads across multiple availability zones and cloud providers to ensure service continuity. Those measures are layered with automated failover, disaster recovery and proactive monitoring.
In most cases, that resilience approach performed as intended this week, according to Pasteris."Customers experienced minimal disruption while AWS services recovered," he said.
AWS is referring press inquiries on the outage to its health dashboard, a company spokeswoman told Channel Dive on Tuesday.