To comprehensively summarize the Kintsugi incident, what are the specific action plans before the mainnet merger?

特邀专栏作者

2022-02-10 03:18

This article is about 3212 words, reading the full article takes about 5 minutes

Kintsugi ran into a series of issues in the first few weeks of its operation, exposing several vulnerabilities across multiple clients.

Original source: notes.ethereum.org

summary

Summarize

The merged testnet Kintsugi encountered a series of issues during its first few weeks of operation, exposing several vulnerabilities across multiple clients. The problem is mainly caused by a fuzzer developed by developer Marius, which aims to create interesting blocks and broadcast the blocks in the network.

a block like thisblockHash(block hash) is replaced by itsparentHash(parent block hash).engine_executePayloadhas all the building blocks and building blocksblockHashAll parameters required. EL (execution layer) clients should build blocks based on these parameters andblockHashauthenticating. This particular block correctly failed Geth's checks, but passed Nethermind's and Besu's. The block was incorrectly validated in Nethermind due to caching issues, while Besu has no such checks at all. As a result, the block was proposed by a Lighthouse-Besu node and caused the blockchain to fork in two, with validators connected to Nethermind or Besu at the execution level on one fork, and validators connected to Geth on one fork. on another fork.

Note that checking the current block'sblockHashis a merge-added requirement, so there will be missing or inaccurate validation on some clients.

A problem with Geth is that when executing the wrong payload, it returns a JSON-RPC error instead ofINVALID(doesn't work), and Teku's problem (fixed but not deployed at this point) is that those bugs are passable in optimistic sync mode. Therefore, Teku-Geth nodes still go into optimistic sync mode when encountering invalid load. Since the block itself is valid, connected Geth nodes fetch data from the network instead of engineAPI, so Teku-Geth nodes are now on an invalid fork. Since Teku nodes are still on an older version with many bugs, Teku-Geth nodes remain in optimistic sync mode and refuse to propose blocks during the period when the blockchain stops being finalized. We are now in a situation where consensus layer clients (lighthouse, prysm, nimbus and lodestar) - Geth (about 46%) and consensus layer clients - Nethermind/Besu (about 19%) are on different forks , the rest of the validators running Teku-Geth (about 35%) are in optimistic sync mode.

After finding and deploying the fixes for the Nethermind and Besu nodes, we were able to get them back on the correct chain. An update to the Teku-Geth node caused another issue related to invalid memory access, caused by an issue on Geth related to block ordering validation. This specific vulnerability was also triggered by Marius' fuzzer, which produced aparentRootis valid andblock_number=1block. Before Geth executes a block, it needs to look at its parent blocks to see if they need to be synchronized. One way of doing this is to check in the cacheparentHashandparentHashandblockNumber. Since Teku executes all loads in all forks simultaneously, the cache no longer containsparentHash. Therefore, Geth tries to pass parentHash andblockNumberFind its parent block. However, the database does not have the hash of this blockNumber (this block is constructed by the fuzzer). Geth will infer that since it has no parent block, it needs to have sync turned on. However, the synchronization triggered in this way will try to synchronize a shorter chain than the authoritative chain, which violates certain conditions in Geth, which causes Geth process errors, nodes shutdown, and Teku-Geth nodes are always in an unhealthy state.

During the debugging of the above issue, the Geth team also found a race condition in the merged code base that triggered the error. Additionally, we had other issues - Nimbus had bugs related to executive layer reconnection, Lodestar had lowered the score of peers that refused to produce blocks.

first level title

FAQ

Q: Is this testnet dead?

A:No. After we deployed the fix and resynced some stagnant nodes, the chain finally started finalizing again. When the chain recovery is finalized, it can operate as usual. Currently, Kintsugi's participation rate is about 99%, which indicates that all clients have been patched and the network is functioning well. Transactions and smart contract interactions continue to function as usual.

Q: Why hasn't this chain been finalized for so long?

A:While we found the root cause early on, we wanted to keep the chain non-finalized and let client teams debug their code. Additionally, we wanted to collect client performance data during non-finalization periods.

Q: Will validators on the forked chain be slashed?

A:Won't. Each verifier contains a slashing protection database, which ensures that the verifier does not sign messages that can be slashed. Validators on the "wrong" fork are only considered inactive on the "correct" fork. Once they regroup on the "correct" fork, the slashing database prevents them from signing slashable messages.

Q: How will this affect the mainnet launch? Will there be new delays?

A:We don't think this matter will affect the mainnet launch plan. No serious problems were found in the specification itself. The purpose of the testnet is to find bugs, and we think Kintsugi does a good job of finding edge cases in client implementations. This event is a good stress test for multiple client combinations. We have a public checklist that will guide us when we are ready to merge on mainnet.

Q: How will this affect the test plan?

A:We will look into creating several testnets that are forced to be in a non-finalized state. Ongoing testing on these non-finalized testnets allows us to trigger more edge cases and improve tooling. Vulnerabilities found in this incident will be added as static test cases to ensure we pass regression tests.

Important implications for validators, infrastructure providers and tool developers:

The non-finalization period on the testnet reinforces some assumptions about worst-case hardware requirements. During the non-finalization period, validators should expect to:

Increased CPU load (sometimes up to 100%) due to the need to evaluate multiple fork selection rules
HDD usage increases during non-finalizing periods as there will be no pruning
There will be a marginal increase in RAM usage

This means that any additional tools or monitoring running on the same machine will run into resource contention. The tools of the Kintsugi testnet (block explorer, faucet, RPC) run on a Kubernetes cluster with 3 nodes. This cluster also runs beacon nodes used by several tools. Since beacon nodes use far more resources than provisioned, our tools often run in a degraded manner due to insufficient resources. It is prudent for infrastructure providers to run their consensus and execution layers on separate machines, or have strict resource usage definitions.

Merging means that each consensus layer client needs to run its own execution layer client. Execution layer clients (on mainnet) now require a lot of disk capacity. CL's disk usage can also spike during non-finalization periods, which can lead to crashes due to insufficient disk space. All validators should make sure they have a large enough buffer disk space to deal with this kind of problem.

Tool developers who rely on finality should give extra consideration to non-finalization periods. One possible way is to showoptimisticInformation, while conveying that information in the user interface will change.

ETH

Welcome to Join Odaily Official Community

Subscription Group

https://t.me/Odaily_News

Chat Group

https://t.me/Odaily_CryptoPunk

Official Account

https://twitter.com/OdailyChina

Chat Group

https://t.me/Odaily_CryptoPunk