An article to understand the advantages and disadvantages of FPGA and GPU accelerated zero-knowledge proof computing

星球君的朋友们

Odaily资深作者

2023-06-06 03:30

This article is about 2048 words, reading the full article takes about 3 minutes

From practical engineering experience, FPGA is an option, but GPU is currently a cost-effective option.

AI Summary

Expand

From practical engineering experience, FPGA is an option, but GPU is currently a cost-effective option.

Original source: IOSG Ventures

Zero-knowledge proof technology is more and more widely used, such as privacy proof, calculation proof, consensus proof and so on. While looking for more and better application scenarios, many people have gradually discovered that zero-knowledge proof proves that performance is a bottleneck. The Trapdoor Tech team has been deeply researching zero-knowledge proof technology since 2019, and has been exploring efficient zero-knowledge proof acceleration solutions. GPU or FPGA are relatively common acceleration platforms currently on the market. This paper starts with the calculation of MSM and analyzes the advantages and disadvantages of FPGA and GPU accelerated zero-knowledge proof calculation.

TL;DR

ZKP is a technology with broad prospects for the future. More and more applications are adopting zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. At the same time, the computational performance of the ZKP proof is relatively poor. This paper analyzes the MSM algorithm, elliptic curve point addition algorithm, Montgomery multiplication algorithm, etc. in detail, and compares the performance difference between GPU and FPGA in BLS 12 _ 381 curve point addition. In general, in terms of ZKP proof computing, short-term GPU has obvious advantages, high throughput, high cost performance, programmability and so on. Relatively speaking, FPGA has certain advantages in power consumption. In the long run, there may be FPGA chips suitable for ZKP calculations, or ASIC chips customized for ZKP.

The ZKP algorithm is complex

ZKP is a general term for Zero Knowledge Proof technology (Zero Knowledge Proof). There are mainly two classifications: zk-SNARK and zk-STARK. The current common algorithms of zk-SNARK are Groth 16, PLONK, PLOOKUP, Marlin and Halo/Halo 2. The iteration of the zk-SNARK algorithm is mainly along two directions: 1/Whether a trusted setup is needed 2/The performance of the circuit structure. The advantage of the zk-STARK algorithm is that no trusted setup is required, but the amount of verification calculation is log-linear.

As far as the application of the zk-SNARK/zk-STARK algorithm is concerned, the zero-knowledge proof algorithms used by different projects are relatively scattered. In the application of the zk-SNARK algorithm, because the PLONK/Halo 2 algorithm is universal (no trusted setup required), there may be more and more applications.

PLONK proves the amount of computation

Taking the PLONK algorithm as an example, analyze the calculation amount of the PLONK proof.

The calculation amount of the PLONK proof part consists of four parts:

1/ MSM - Multiple Scalar Multiplication. MSM is often used to compute polynomial commitments.

2/ NTT Computation - Polynomial conversion between point value and coefficient representation.

3/ Polynomial calculation - polynomial addition, subtraction, multiplication and division. Polynomial evaluation (Evaluation) and so on.

4/ Circuit Synthesize - circuit synthesis. The calculation of this part is related to the scale/complexity of the circuit.

Generally speaking, the amount of calculation in the Circuit Synthesize part is more judgment and loop logic, and the degree of parallelism is relatively low, which is more suitable for CPU calculation. Generally speaking, zero-knowledge proof acceleration generally refers to the calculation acceleration of the first three parts. Among them, the calculation amount of MSM is relatively the largest, followed by NTT.

What's MSM

MSM (Multiple Scalar Multiplication) refers to a given series of points and scalars on the elliptic curve, and calculates the points corresponding to the results of adding these points.

For example, given a series of points on an elliptic curve:

Given a fixed set of Elliptic curve points from one specified curve:

[G_ 1, G_ 2, G_ 3, ..., G_n]

and random coefficients:

and a randomly sampled finite field elements from specified scalar field:

[s_ 1, s_ 2, s_ 3, ..., s_n]

MSM is the calculation to get the Elliptic curve point Q:

Q = \sum_{i= 1 }^{n}s_i*G_i

The industry generally adopts the Pippenger algorithm to optimize the MSM calculation. Take a closer look at the schematic diagram of the process of Pippenger's algorithm:

The calculation process of the Pippenger algorithm is divided into two steps:

1/ Scalar split into Windows. If Scalar is 256 bits, and a Window is 8 bits, then all Scalars are divided into 256/8 = 32 Windows. Each layer of Window uses a "Buckets" to temporarily store intermediate results. GW_x is the point of accumulation result on one layer. Calculating GW_x is also relatively simple. It traverses each Scalar in one layer in turn, and adds the corresponding G_x to the bit of the corresponding Buckets according to the value of the Scalar layer as the Index. In fact, the principle is relatively simple. If the coefficients of the addition of two points are the same, add the two points first and then add the Scalar again, instead of adding the two points twice before adding the Scalar.

2/ The points calculated by each Window are accumulated by double-add to get the final result.

Pippenger's algorithm also has many deformation optimization algorithms. In any case, the underlying calculation of the MSM algorithm is the addition of points on the elliptic curve. Different optimization algorithms correspond to different numbers of points added.

Elliptic Curve Point Add (Point Add)

You can look at various algorithms for point addition on elliptic curves with "short Weierstrass" form from this site.

http://www.hyperelliptic.org/EFD/g 1 p/auto-shortw-jacobian-0.html#addition-madd-2007-bl

Assuming that the Projective coordinates of the two points are (x 1, y 1, z 1) and (x 2, y 2, z 2) respectively, the result of point addition (x 3, y 3) can be calculated by the following calculation formula , z 3).

The reason for giving the calculation process in detail is to show that the entire calculation process is mostly integer operations. The bit width of the integer depends on the parameters of the elliptic curve. Give the bit widths of some common elliptic curves:

BN 256 - 256 bits
BLS 12 _ 381 - 381 bits
BLS 12 _ 377 - 377 bits
Note in particular that these integer operations are operations on the modulo field. Modular addition/modulus subtraction is relatively simple, focus on the principle and implementation of modular multiplication.

Modular Multiplication

Given two values over a modulo field: x and y. The modular multiplication calculation refers to x*y mod p. Note that the bit width of these integers is the bit width of the elliptic curve. The classic algorithm for modular multiplication is Montgomery Multiplication. Before performing Montgomery multiplication, the multiplicand needs to be converted to Montgomery representation:

The formula for calculating Montgomery multiplication is as follows:

There are many Montgomery multiplication implementation algorithms: CIOS (Coarsely Integrated Operand Scanning), FIOS (Finely Integrated Operand Scanning), and FIPS (Finely Integrated Product Scanning) and so on. This article does not introduce the details of various algorithm implementations in depth, and interested readers can do their own research.

In order to compare the performance difference between FPGA and GPU, choose the most basic algorithm implementation method:

Simply put, the modular multiplication algorithm can be further divided into two calculations: large number multiplication and large number addition. After understanding the calculation logic of MSM, you can choose the performance of modular multiplication (Throughput) to compare the performance of FPGA and GPU.

observe and think

Under such FPGA design, it can be estimated that the entire VU 9 P can provide the throughput at BLS 12 _ 381 elliptic curve points. A point addition (add_mix method) requires about 12 modular multiplications. The system clock of FPGA is 450M.

Under the same modular multiplication/modular addition algorithm, using the same point-add algorithm, the point-plus Troughput of Nvidia 3090 (considering data transmission factors) exceeds 500 M/s. Of course, the whole calculation involves a variety of algorithms, some algorithms may be suitable for FPGA, and some algorithms are suitable for GPU. The reason for using the same algorithm comparison is to compare the core computing capabilities of FPGA and GPU.

Summarize

More and more applications are adopting zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. From our hands-on engineering experience, FPGA is an option, but GPU is currently a cost-effective option. FPGA prefers deterministic computing, which has the advantages of latency and power consumption. GPU has high programmability, has a relatively mature high-performance computing framework, and has a short development iteration cycle, and prefers throughput scenarios.

ZKP

technology

Welcome to Join Odaily Official Community