first level title
overview
first level title
events and background
Normally, the Ethereum PoS consensus network status will be finalized (Finalized) in 2 Epochs, but last week there was a delay of 2 Epochs finalized.
firstIt happened on May 11, and the finalization of Epoch was delayed by 3 Epoch, about 20 minutes.
the second timeIt happened on May 12th, and the finalization of Epoch was delayed by 8 Epochs, about 51 minutes.
During the event, the Ethereum network continued to generate blocks and process transactions. However, due to the insufficient voting rate of the Validator (verification node), Epoch could not be finalized (that is, Epoch received the consensus level security guarantee of the Ethereum PoS network). The failure of Epoch to finalize means that in the case of most validators doing evil and forking, epcoh may be rolled back, resulting in the transaction being rolled back.
In fact, during the incident, there was no fork in the Ethereum network, and the Validator did not vote maliciously. It was only because a large number of Validators were offline that the voting rate was insufficient, which prevented the Epoch from being finalized during the event.
After observation, the abnormal situation of CPU overload in the offline Validator is considered to be the direct cause of the offline Validator.
In the second event, the Epoch finalization was delayed by 8 Epochs, because the finalization delay was greater thanMIN_EpochS_TO_INACTIVITY_PENALTY(= 4) thus triggering the Ethereum consensus algorithmInactivity leakprocessing mechanism.
Penalize offline Validators by slashing their pledged funds,About 28 ETH were confiscated。
Cancellation of Attestation rewards, resulting inAbout 50 ETH has not been issued。
This mechanism ensures that the online Validator can finally hold ⅔ of the total pledged funds of Ethereum, so that the network status can finally be finalized
The node service of imToken also detected this incident. By monitoring the voting status of the Validator of the Ethereum consensus layer in real time, it can give an early warning of the abnormality of the Ethereum consensus network before the Epoch is not finalized normally. The figure below is the state of the node when the first event occurs.
Under the PoW mechanism, the success of the transaction is to determine that the transaction will not be rolled back after a certain number of consecutive blocks, and the PoS is based on the block height returned by Safe Head as the determination of the success of the transaction. In the current specification, Justified Checkpoint is used as the status of Safe Head. Therefore, judging from the status of the previous Epoch, there may be a delay of 6.4 minutes, which is a very bad experience for users.
The Safe Head service developed by imToken will calculate a safe block for transaction confirmation based on the real-time Ethereum consensus layer data, and shorten the time for transaction confirmation on the premise of ensuring the security of user transactions. Under normal circumstances, the block height returned by imTokens Safe Head algorithm (yellow in the above figure) will be very close to the latest block height (green), thereby improving user experience.
More information on the Safe Head mechanism:
Cause Analysis
The direct cause of the above incident is that the load of certain Ethereum consensus layer client nodes is too high, which makes the Validator go offline, so that consensus voting cannot be performed normally. After analysis, the reasons for the high load of these nodes are:
When receiving attestations pointing to stale blocks, nodes need to recalculate the state of the beacon chain to verify these attestations, and this process consumes a lot of CPU and memory resources.
When a large number of witnesses pointing to old blocks are received at the same time, the CPU and memory resources of the node are exhausted, which causes these Validators to go offline.
Originally, this kind of problem can be solved by caching based on the witness pointing to the block. However, due to the scale growth of the Validator and the emergence of a large number of such attestations, the cache implemented by the problematic client is broken down, and the node has to consume a lot of resources to restart Computes the beacon chain state.
Consensus layer clients Teku and Prysm have released patch versions to solve this problem. Specifically, the client implementation of the patch version will filter out these stale witnesses, that is, ignore the witness when the following conditions are met:
Witness points to a stale Slot
Witness points to a Checkpoint that the node has never seen
However, we still need to continue to observe the finalization of the Ethereum mainnet to confirm the effectiveness of the patch.
first level title
Prysm:v 4.0.3-hotfix
Teku:v2 3.5.0
Ethereum Design Advantages
In this incident, Ethereum guarantees availability and continues to generate blocks and process transactions, and the key to only delaying the finalization of Epoch lies in two points:
1. Diversity of Ethereum clients
secondary title
Diversity of Ethereum Clients
In this incident, although there were problems in the implementation of the consensus layer client Teku and Prysm, it did not affect the normal operation of other consensus layer clients. Like the Lighthouse client this time andnot affected, because different clients have different implementation designs, the Validator is still working normally.
secondary title
The Design of Ethereum Gasper Consensus Algorithm for Usability
Ensuring the availability of Ethereum is one of the design starting points of the Ethereum consensus algorithm Gasper, which separates the production and finalization of Ethereum blocks. Therefore, even if the finalization of the block is blocked, the production of the block will not be terminated. Considering that in most cases, block finalization will eventually resume (generated blocks will still be finalized), the impact on users will actually be very low. Compared with other BFT consensus algorithms: if the block finalization fails, the consensus node will stop producing the next block. As a result, the entire blockchain is unavailable during the period, which is commonly known as blockchain hangs.
first level title
secondary title
Ethereum Multi-Client Challenges
image description
source:https://clientdiversity.org/#distribution
It can be seen that the diversity of Ethereum clients still needs to be promoted and publicized. Conceivably, if the client implementations were diverse enough that Prysm and Teku accounted for less than ⅓, then this event wouldnt even happen (⅔ clients functioning properly enough to finalize the Epoch). In addition, the clients of the current execution layer are concentrated in Geth, accounting for as high as 61%. This is actually a potential risk: if Geth does not work properly, Ethereum will be greatly affected.
In addition to the need for further efforts in the diversity of Ethereum clients, Ethereum client switching is also a pain point exposed by this incident: when a certain client implementation fails, how does Validator switch to the normal client implementation. This process involves:
Safely migrate the Validation key of the problematic client to the normal client
Since the Ethereum consensus has Slash rules, it is necessary to ensure that the behavior of the old client and the new client is consistent without being slashed. For example:
The old and new clients voted on the checkpoints on both sides of the fork, thus being slashed
secondary title
Monitoring of Ethereum Consensus
Services like Safe Head are needed to continuously monitor the real-time status of the Ethereum PoS network, to detect and warn of such events in advance, instead of waiting until the Epoch cannot be finalized as expected to learn that the network status is abnormal. Related latest research can be found insecondary title。
Popular Science of Ethereum Consensus Algorithm
secondary title
Implications for Ethereum Applications
Although the Ethereum network is robust enough, occasional instability will have a certain impact on applications. At the same time, the application should correctly handle these unstable scenarios.
Layer 1 ->Layer 2 deposit time will be longer. When Layer 2 is in mint, an important prerequisite is to ensure that L1 deposit transactions will not be rolled back. Therefore, when the finalization of the Ethereum network Epoch is delayed, the deposit time of L1->L2 will be correspondingly longer.
Similarly, the exchange also needs to prevent the recharge transaction on the chain from being rolled back, so the recharge time will be correspondingly longer.
Quotations on the Oracle chain are at risk of being rolled back, so high-value services that rely on it should be suspended appropriately.
Summarize
Summarize
first level title