Original title: "Using On-Chain Data for Policy Research: Part 1"
first level title
Original compilation: Kxp, BlockBeats
I. Introduction
Real and detailed data are rarely used in the formulation of Crypto policies, mainly for the following three reasons:
1. Most of the relevant policies in the field of emerging technologies are still at the level of theory and qualitative analysis, and data are rarely used in the early stages.
2. Although the data on the chain is all open and transparent, people need to complete a lot of work in a short period of time (that is, extract the original data directly from the blockchain) in order to access these data, even for Crypto native practitioners.
3. There are a small number of data products from blockchain “forensics” companies and data vendors, but none of them are flexible/customizable and do not meet the needs of economic/financial researchers.
Many modern economics and finance researchers miss the opportunity to apply tools to crypto data analysis. By design, Crypto can provide granular data to anyone, but most of the policy still relies on external pre-aggregated time series data sources such as CoinMarketCap instead of obtaining data directly from the data source. What about?
Just as policymakers can look up the balance sheets of every major U.S. bank and watch consumer deposits change second by second, so they can effortlessly look at Stablecoin issuance across the Ethereum ecosystem, but most analyzes of Stablecoin Policy papers instead take an analytical approach that explores hypothetical events.
In this article, I will explain the following points in detail, hoping to be helpful to policy researchers who want to use on-chain data:
How to obtain data on the chain
The structure adopted by the data on the chain
Several basic tools for extracting and using on-chain data
In a subsequent article, I'll explore how the data collected here can be used to gauge the direction of the Crypto market. In the meantime, I'll post free-to-use data and code at the end. By illuminating how data can be queried in the blockchain, I hope to show you the new ways that Crypto's openness can open up for data decision-making.
first level title
II. On-chain data acquisition method
In general, data collection efforts should focus on one blockchain (Ethereum) and a subset of specific projects, primarily USD-denominated, fiat-backed Stablecoins, including USDC, Tether, Binance USD, Pax Dollar and Gemini Dollar. This approach is broadly applicable to on-chain data, even if you want to create a different data set.
Block explorers like Etherscan are great for looking at snapshots of transactions and gathering information about specific smart contracts, but in my experience they are less useful for generating large datasets. When collecting and processing raw data, you basically have two options: (1) run a full node locally, or (2) query a database that has raw data written directly from the chain. The first method has relatively high requirements for professional skills and computing resources, while the second method only requires basic SQL and Python skills, so we will use the second method here.
first level title
III. On-chain data structure
To answer this question, you first need to understand the purpose of your processing data. For this test case, I decided to build a large time-series dataset for the main fiat-backed Stablecoins and observe some specific behaviors: minting (i.e. issuing Stablecoins), burning (i.e. decommissioning Stablecoins) and transfers. I chose to conduct my research this way because policymakers and academics are currently focusing most on fiat-backed stablecoins, so the data could be quite useful in the short term.
Several major stablecoins denominated in US dollars have adopted the ERC-20Token standard. As the name suggests, ERC-20 is a standardized way to create tokens using smart contracts on Ethereum. If you understand the blockchain as a giant decentralized Excel sheet, then a smart contract is similar to an Excel function. After input parameters in the function, it will use its built-in logic to produce a specific output result (for example, the MAX function is used to output the maximum value among the input parameters).
We can locate smart contracts using their Ethereum addresses, which are unique identifiers in the blockchain data structure:
Similar to APIs, smart contracts are reusable programs. Every time a smart contract receives an interaction instruction, a record of the interaction is generated and recorded on the blockchain by the Ethereum protocol in the form of a log, and these logs constitute a reliable source of information about smart contract activities.
When a smart contract performs a specific function, such as burning an ERC-20 Stablecoin to remove it from circulation, that function and its parameters are recorded on the blockchain as a transaction log.
In the transaction below, Circle, the issuer of the USDC Stablecoin, burned $1056.92 worth of USDC.
If you switch to the "Logs" tab, you can view the transaction event log, the corresponding field is
Address: The contract address of the smart contract. The contract address of USDC Stablecoin is0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48。
Name: the function executed by the smart contract, and the parameters in the function. Here, the smart contract is calling the burn function, which receives parameters specifying where the burned coins are sent (such as a burn pool, which must be an Ethereum address) and the amount of burned coins (the amount must be an unsigned integer less than 256 bits ).
The output of Etherscan also shows the subject and data fields, which contain most of the relevant information we need to parse when analyzing transactions.
Topic0 is the hash of the function signature. Essentially, it passes the function and its arguments through a one-way algorithm to get a unique function hash. Ethereum uses the Keccak-256 hash function, and when you enter a function signature through the Keccak-256 algorithm, it will always produce the same hash, so anytime that hash appears in the logs, you can be sure is calling the same function.
Topic1 is an index parameter of the burn function. Here, Topic1 is the address where the burnt tokens are sent to. (Note: If the burn function has more parameters, those parameters will appear as additional topics)
The data field here indicates the number of tokens burned.
first level title
IV. Basic tools to extract and process on-chain data
As mentioned, in this example I chose to pull on-chain data from an existing database rather than accessing live nodes on the Ethereum network. For ease of understanding, I extracted a large number of raw data tables from GCP using SQL, and then cleaned them in Python using the pandas library.
When we pull the table from GCP, we will use BigQuery, which stores many Ethereum data tables, as shown in the left column of the image below. When you click on a table, the corresponding database schema appears, like the ethereum.logs table in the image below. At the same time, the addresses, data and topics involved will be recorded in the log data.
The query in the image below will be used to extract all records in the log table that involve interactions with USDC, Tether USD, Binance USD, Pax Dollar or Gemini Dollar contracts. In addition to the information in ethereum.logs, some additional information is useful, so I also merged the data in the ethereum.block table, which covers gas fees and other information.
The resulting table can be read directly by Python and subdivided into the following fields with the help of a pandas dataframe:
log_index
transaction_hash
transaction_index
address
data
topics
block_timestamp
block_number
block_hash
number
miner
size
gas_limit
gas_used
base_fee_per_gas
first level title
V. Conclusion
This article uses Ethereum's log data, and the same method can also be used to access various data on the chain. Python and SQL are tools familiar to most economists and policymakers, and they can make a big difference. Compared with traditional finance, Crypto is more transparent. This allows researchers to use real-time data to shed light on how the financial system works and contain possible risks in a timely manner.
Original link
