BTC
ETH
HTX
SOL
BNB
View Market
简中
繁中
English
日本語
한국어
ภาษาไทย
Tiếng Việt

Coinbase Review of May Outage: AWS Cascading Failures Exposed Architectural Risks

2026-06-01 13:26

Odaily Odaily reports that Coinbase has released a post-mortem report on the large-scale service outage that occurred on May 7, 2026. The disruption lasted approximately 8 hours, with full recovery taking about 12 hours. During this period, trading, deposits, withdrawals, and most core services were either unavailable or severely degraded.

Coinbase stated that the outage was triggered by the simultaneous failure of multiple chillers in the cooling system of a data center within an Availability Zone (use1-az4) of the AWS us-east-1 region. This led to thermal shutdown protection for server racks, causing EC2 instances and EBS volumes to go offline, and impacting multiple internet services.

During the recovery process, Coinbase's trading matching engine lost quorum after its cluster architecture, deployed within a single AWS data center, lost the majority of its nodes. Emergency code adjustments and the formation of new node groups were required to restore operations, with market trading being gradually restarted throughout the recovery.

Additionally, the AWS Managed Streaming for Kafka (MSK) service experienced a control plane failure, preventing automatic re-election of partition leaders. This further blocked order books, fee calculations, and parts of the settlement and data streaming systems, expanding the overall impact. After Coinbase and the AWS engineering teams collaborated on manual partition migrations, the system gradually returned to normal.

Coinbase indicated that this incident exposed deficiencies in its cross-Availability Zone automatic failover capabilities and the disaster recovery of managed middleware. The company will upgrade its cross-region hot standby architecture, strengthen regular disaster recovery drills, migrate its Kafka systems from a dual-AZ to a triple-AZ deployment, and work jointly with AWS to address root causes and implement improvements.