Coinbase Review of May Outage: AWS Cascading Failure Exposes Architectural Risks
Odaily Planet Daily News Coinbase released a post-mortem report on the large-scale service outage that occurred on May 7, 2026. The outage lasted approximately 8 hours, with full recovery taking about 12 hours. During this period, trading, deposits, withdrawals, and most core services were unavailable or severely degraded.
Coinbase stated that the root cause was the simultaneous failure of multiple chillers in the cooling system of a data center within an Availability Zone (use1-az4) of the AWS us-east-1 region. This triggered thermal shutdown protection for server racks, causing EC2 instances and EBS volumes to go offline and impacting several internet services.
During the recovery process, Coinbase's matching engine lost quorum due to the loss of the majority of nodes in a cluster architecture deployed in a single AWS data center. Emergency code adjustments and the establishment of a new node group were required to restore operation, followed by a gradual restart of market trading during the recovery process.
Additionally, a control plane failure occurred in the AWS Managed Streaming for Apache Kafka (MSK) service, preventing automatic re-election of partition leaders. This further blocked the quoting, fee, and certain settlement and data flow systems, expanding the overall scope of impact. After collaborative manual partition migration efforts by Coinbase and AWS engineering teams, the system gradually returned to normal.
Coinbase stated that this incident exposed deficiencies in its cross-Availability Zone automatic failover capabilities and the disaster recovery of managed middleware. The company will upgrade its cross-region hot standby architecture, strengthen regular failover drills, and migrate the Kafka system from dual-Availability Zone to three-Availability Zone deployment. It will also work with AWS to pursue root cause fixes and improvements.
