Amazon.com's Hidden Ties to AWS: Why Outages Impact Retail Operations
It's a common misconception that Amazon.com operates entirely independently of its cloud computing arm, Amazon Web Services (AWS), leading many to wonder why it often appears unaffected by AWS outages. However, user experiences frequently contradict this, revealing that Amazon's retail operations do suffer significant degradation during such events.
The Reality of Impact
Many users report difficulties during AWS outages, including:
- Inability to add items to carts or view prices.
- Customer service chat and phone lines becoming unavailable.
- Delays in package deliveries.
- Loss of filtering functionality and access to order history.
This widespread impact suggests that critical components of the retail platform are, in fact, tied to AWS infrastructure.
Hybrid Infrastructure and Gradual Migration
The reason for this interconnectedness lies in Amazon's complex, evolving infrastructure strategy. While it's an "old wives' tale" that AWS simply grew out of Amazon Retail's excess capacity, the reality is that Amazon operates a dual infrastructure approach:
- AWS Infrastructure (NAWS): This is the public cloud offering.
- Legacy Infrastructure (MAWS/CDO): This refers to older, internal systems, sometimes called CDO (Consumer Devices Other) or COE, which originally ran separately.
However, there's a significant ongoing migration. Many parts of CDO, including critical services like Alexa, have been moving towards AWS. They extensively use AWS managed services such as DynamoDB, Lambda, Kinesis, and SQS. Even core compute layers are migrating from internal systems like Apollo to AWS offerings like ECS and Fargate. DynamoDB, in particular, is noted to be used "everywhere" within Amazon Retail, making its unavailability a direct cause of retail site issues.
This means Amazon.com isn't a monolith on a separate infrastructure but rather a complex system with deeply integrated and increasingly dependent components running on AWS services, alongside legacy systems still in transition.
Regionality and Resilience Challenges
Another key factor is the regional nature of outages. Impact on Amazon.com often depends on which specific AWS regions or data centers are affected and where user requests are being routed. While strategies like multi-region failover are considered, they are complex to implement perfectly. Amazon may even opt to route traffic to less-impacted data centers, trading higher latency for continued availability, rather than a complete shutdown.
In essence, Amazon.com's resilience to AWS outages is not due to complete isolation but rather a testament to its ongoing efforts in managing a hybrid, gradually migrating infrastructure, and the inherent challenges of maintaining seamless global operations amidst regional cloud service disruptions.