Web3 underlying infrastructure? A Brief Analysis of the Causes of CloudFlare Service Interruption Yesterday

星球君的朋友们

Odaily资深作者

2022-06-22 12:45

This article is about 5557 words, reading the full article takes about 8 minutes

A deep dive into the origins of Cloudflare and Web3.

AI Summary

Expand

A deep dive into the origins of Cloudflare and Web3.

Original Source: Alpha Rabbit Research Notes

Original Source: Alpha Rabbit Research Notes
In this article, I will talk about what is CloudFlare, what kind of company it is, the origin of CloudFlare and Web3, and explain the reasons for this failure technically.

first level title

Structure of this article

1. Event background

What happened at the end of June 2022 (this Tuesday)?

2. What is CDN (Content Delivery Network)
What is a CDN
what is routing

CDN companies are usually security companies?

3. What kind of company is Cloudflare?

4. The Origin of Cloudflare and Web3

IPFS&Ethereum

5. Why is there a service interruption at Cloudflare? (Technical Analysis Section)
related to architectural transformation
first level title
in conclusion

event background

In this article, I will talk about what is CloudFlare, what kind of company it is, the origin of CloudFlare and Web3, and explain the reasons for this failure technically.

first level title

Before talking about Cloudflare, let's popularize a concept (CDN)

secondary title

What are CDNs?CDN, the full name is Content Distribute Network (Content Distribution Network) or Content Delivery Network;

So, what is a Content Delivery Network? It is a computer network system that can be connected to each other through the Internet. It uses the server closest to each user to send music, pictures, videos, applications and other files to users faster and more reliably to provide high performance, scalability and Low-cost web content delivered to users.Visually speaking, CDN is somewhat similar to JD logistics model

, by establishing logistics points (cache servers) all over the country, when someone buys goods from JD.com (user resource request), JD.com can find the nearest or fastest one according to the user's delivery address (CDN for user domain name resolution) last time. A logistics point for delivery (connecting the access user to the nearest cache server for resource transfer).

CDN services can be used to ensure fast and reliable delivery of static content, which can be cached and is best suited for storage and distribution in networks with high network speeds, thus freeing up backbone network channels for dynamic content that must be delivered in real time, such as webcasting , to reduce the delay.Let's take an example. For example, there is a British company whose main customers are also in the UK. If a website is established for this company, the website server is usually placed in the UK. However, there will be delays that affect users' website access experience. However, if the delay is caused by network congestion, then this delay can be improved.

How to improve it?

Note that the number of optical fibers here is mainly laid at the same time when we build infrastructure such as submarine optical cables, railways and highways. Therefore, the bandwidth we use has been increasing over the years. You can understand the increase in network loans as the expansion of transportation roads, which is a matter of spending money to lay them.

secondary title

routingWe mentioned network routing earlier,What is routing? In fact, the main problem that routing solves is the communication between two points, and what route to take.

For example, once there is network congestion between London and Oxford, the system can choose other routes. It's a bit like smart transportation, and Internet routing optimization is similar. So over the years, despite the increasing traffic, network performance has been improving.

In layman's terms, it is to accelerate the website. Some websites are extremely slow to open due to reasons, which requires CDN to accelerate.

So if a European user wants to access the content of the American website, the CDN will set up a server in Europe and translate the American content to this server. When a European user accesses a domain name, since the CDN operator knows that the user's access comes from the European system, he will give the user the IP address of the European server, and the user will naturally access the European server.

secondary title

CDN companies are usually security companies?

Note: The explanation of CDN in this part is partly from Youtube blogger Lao Ke Tan Technology Stock

first level title

What kind of company is Cloudflare?In 2010, Cloudflare was officially founded and is headquartered in San Francisco, USA. It is a company whose main business is its CDN and security services. Cloudflare's main business is to provide customers with a reverse proxy-based content distribution network and distributed domain name resolution service (Distributed Domain Name Server). Since 2009, the company has been invested by venture capital such as Union Square Ventures, and Baidu has also participated in Cloudflare's D round of financing,

In addition, Cloudflare has acquired a series of network service and security companies, including StopTheHacker and CryptoSeal in 2014; Eager Platform Co. in 2016; Neumob, S2 Systems, Linc, Zaraz in 2017 and later; Vectrix and Area 1 Security.

first level title

The origin of Cloudflare and Web3Cloudflare is a CDN company that started to support Web3 development relatively early. Its official website says this:Moreover, the official website mentions that Web 1.0 has given the world the ability to disseminate information quickly, while Web 2.0 has made this information interactive. Web 3.0, or Web3, is considered the next iteration of the internet, built on decentralized technologies like IPFS and Ethereum.

image description

Picture from Cloudflare official website

The Cloudflare Ethereum Gateway allows customers to use their own domains, which can be sent via HTTP JSON RPC queries to custom domains. Cloudflare can manage, maintain, and monitor Web3 infrastructure, so builders can focus on what matters: building Dapps. Cloudflare can create safe, reliable and fast services based on Web3 technology through the industry's leading global network.

Why is there a service outage at Cloudflare?

secondary title

The official explanation of the Cloudflare service outage event on June 21, 2022:

Cloudflare apologizes for this outage, which was Cloudflare's fault and not due to an attack or other malicious activity.

secondary title

The background of this architectural transformation

Over the past 18 months, Cloudflare has been working to transform the architecture of all of its busiest data centers, making them more agile and resilient. At present, 19 data centers have been successfully converted to this architecture, which Cloudflare internally calls Multi-Colo PoP (MCP); these 19 data centers are located in: Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, Sao Paulo, San Jose, Singapore, Sydney and Tokyo.

This new architecture is designed as a Clos network, and a key part of it is the addition of an additional routing layer (see diagram below), creating a mesh of connections. This mesh structure allows us to easily disable and enable parts of the data center's internal network for maintenance or to address issues. This layer is represented by the Spine section identified in the figure below.

Note: Clos network is a multi-stage switching network. The term was first officially used by Charles Clos in 1953, and it represents an idealized representation of the actual multi-stage telephone switching system. The Clos network is used when the physical circuit switching requirements exceed the maximum achievable capacity of a single crossbar switch. The main advantage of the Clos network is that the number of crosspoints required is much smaller than the number of crosspoints required by the entire switching system using a large Crossbar Switch.However, since these locations also host a large portion of Cloudflare's traffic, any issues here would have very wide-ranging effects, and unfortunately, that's why the Cloudflare service ended on June 21st.

secondary title

Timeline and Impact of Service OutageCloudflare uses a protocol called BGP (Border Gateway Protocol, Border Gateway Protocol, an autonomous system routing protocol that runs on TCP)."The protocol's operator-defined policy determines which prefixes (sets of adjacent IP addresses) are broadcast to peers (other networks they are connected to). These policies have separate components that are evaluated sequentially. The end result is that any given prefix is either broadcasted or not broadcasted. A change in policy may mean that prefixes that would previously have been broadcast are no longer broadcast, known as"revoke

, these IP addresses will no longer function properly on the Internet.

The operator has formulated a certain strategy and decided that certain route prefixes can be broadcasted (broadcast here means that the route can be learned by other edge bgp routers, and then other bgp networks know these route changes. The prefix is prefix, which is used to uniquely identify a network number connected to the Internet)

When the prefix advertisement policy changes, the terminology is rearranged, causing Cloudflare to withdraw a critical subset of prefixes.

03：56 UTC：A change in policy could mean that prefixes that were previously broadcast are no longer broadcast, and Cloudflare engineers have had additional difficulty recovering the problematic portion of the affected data center, although Cloudflare has backup procedures in place to handle such issues.

06：17：Cloudflare deployed the change to the first (datacenter) location, none of the locations were affected by this change because of the old architecture used by those locations.
06：27：Deployment changes to Cloudflare's busiest locations, but not to locations with MCP (Multi-Colo PoP) architecture.
06：32：The deployment has reached the point where MCP (Multi-Colo PoP) is enabled, and the changes have been deployed to critical parts. This is when the outage began, and 19 data centers went offline quickly.
06：51：Cloudflare internally announced the outage.
06：58：The first change made on the router to verify the root cause.
07：42：Troubleshoot to find the root cause and restore what went wrong
08：00：The last revert was done and the network engineer started checking out the other side's changes, reverting the status, at which point the problem reappeared sporadically, so there was a bit of a delay.

The service interruption event is over.

Although these problematic data centers accounted for only 4% of Cloudflare's total network, the outage affected 50% of total requests;

(There is a small part of the code in this part, which is omitted here. Interested network engineering partners can view the original text:

https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/）

text

Remediation and next steps

This service terminal incident has caused extensive and serious impacts. Cloudflare has always attached great importance to usability. It has already proposed several areas for improvement, and will continue to work hard to find all the problems that may potentially lead to service terminals.process:

While the MCP program was designed to increase availability, we had a procedural gap in updating these data centers that had a severe impact. While Cloudflare does have a staggered strategy in mind, it's not perfect, and the deployment process and automation needs to include MCP testing and specific deployment processes to ensure there are no unintended consequences.Architecture:

Misconfigured routers can prevent proper route broadcasts, preventing normal traffic and infrastructure operations. Cloudflare will redesign the policy statement of route advertisement to prevent sorting errors.automation:"There are parts of Cloudflare's automation suite that can mitigate the negative impact of this incident. Cloudflare will focus on automation improvements, enforcing improved interleaving policies for network configuration rollouts, and providing automated"secondary title