Coinbase Recaps May Outage: AWS Cascading Failure Exposes Architectural Risks
Odaily Odaily reported that Coinbase has released a post-mortem report on the large-scale service outage that occurred on May 7, 2026. The outage lasted approximately 8 hours, with full recovery taking about 12 hours. During this period, trading, deposits, withdrawals, and most core services were either unavailable or severely degraded.
Coinbase stated that the incident was triggered by the simultaneous failure of multiple chillers in the cooling system of a data center in one Availability Zone (use1-az4) within the AWS us-east-1 region. This caused rack thermal protection shutdowns, leading to EC2 instances and EBS volumes going offline, and also impacted several other internet services.
During the recovery process, Coinbase's trading matching engine lost quorum as its cluster architecture, deployed in a single AWS data center, lost a majority of its nodes. Emergency action was required to restore operations through code adjustments and the formation of new node groups, followed by a gradual restart of market trading.
Additionally, a control plane failure in the AWS Managed Kafka (MSK) service prevented automatic re-election of partition leaders, further blocking the quote, fee, and some settlement and data stream systems, amplifying the overall impact. The system gradually returned to normal after Coinbase and the AWS engineering team collaborated on manual partition migration.
Coinbase stated that the incident exposed shortcomings in its cross-Availability Zone automatic failover capabilities and the disaster recovery of managed middleware. The company plans to upgrade its cross-region hot standby architecture, strengthen regular disaster recovery drills, migrate its Kafka system from a dual-AZ to a triple-AZ deployment, and work with AWS to advance root cause fixes and improvements.
