Risk Warning: Beware of illegal fundraising in the name of 'virtual currency' and 'blockchain'. — Five departments including the Banking and Insurance Regulatory Commission
Information
Discover
Search
Login
简中
繁中
English
日本語
한국어
ภาษาไทย
Tiếng Việt
BTC
ETH
HTX
SOL
BNB
View Market
ZONFF Research: What are we talking about when we talk about Web3 data?
Zonff Partners
特邀专栏作者
2022-10-08 13:30
This article is about 8832 words, reading the full article takes about 13 minutes
Discuss from the whole life cycle of data generation, collection, storage, management and use

Original Author: Lewis Liao,Zonff Partners

What are we talking about when we talk about Web3 data? To figure this out, we first need to figure out what the data looks like in Web2. This article will discuss the whole life cycle of data generation, collection, storage, management and use. Before that, we first clarify how data is defined.

In the "Network Security Standard Practice Guidelines - Data Classification and Grading Guidelines" (Draft for Comment - v1.0 - 202109) issued by the National Information Security Standardization Technical Committee of China, data is classified into personal information, public data, and legal person data.

Its specific definition and examples are as follows:

image

first level title

1.1 Data generation, collection and storage

Public data, personal data, and legal entity data are mostly generated when we use computer applications in our daily life, among which personal data and legal entity data are closely related to ordinary users.

image description

image

Image credit: Zonff Partners

Image credit: Zonff Partners

The bottom-level database stores the data transmitted from the back-end and generated by the interaction between the user and the front-end. Broadly speaking, this is user data.

As far as mobile applications are concerned, data can be roughly divided into the following categories:

  • User information, the user-related information recorded by the user using the application service, including the user's identity information, device, network, geographic location, and even a list of applications installed on the mobile device, etc., is collected by the server data table and buried point;

  • Content data, data generated by users using application services, including any non-personal information content data that users actively write interactively on the application, which is part of the application service and is generally collected directly by the server-side data table;

  • Behavioral data, the data generated by the user's interaction during the use of the application, these include the user's behavior habits during the use of the application, such as viewing time, click rate, penetration rate, sliding situation, etc., generally collected by buried points;

  • Log data, data generated by the application itself during the user's use of the application, including application crash logs, etc.;

  • Code data, non-user interaction data includes front-end and back-end codes, these data, like user data, are stored on a centralized server somewhere;

In this classification, user information belongs to personal information data, and log and code data belong to legal person data. Among them, content data and behavior data are worth discussing. In the Web2 era, they are more divided into their own business data by centralized entities, that is, legal person data.

image description

image
image description

Image credit: Preethi Kasireddy

Compared with Web2 applications, the user terminal and the front end are almost unchanged, the difference lies in the back end and the database. Users interact with node providers through the front end (rather than a centralized server), access contract codes arranged on blockchains such as Ethereum (rather than the backend environment on the server), and interact. In this process, the above-mentioned types of data will also be generated. Due to the difference in technical architecture, the data generated by Web3 is not stored by a centralized server. There may be similarities and differences in the storage methods for data generated in different ways.

Among them, all the data generated by the interaction of smart contracts are published on the blockchain and can be accessed by anyone, so it becomes a public product, including asset information, transaction data and contract codes. In theory, as long as the blockchain block space is large enough, any data can be stored on the blockchain, and some projects are even trying to use the blockchain as a database to store data.

image

At the current stage, except for the above three types of data, most of the data generated by a Web3 application is still stored in a centralized server, including front-end code, user information, content data, behavior data and log data. This is because the relevant storage infrastructure is not perfect at present, and the project party is either limited by technical problems, or has adopted a centralized solution for reasons such as ensuring access speed. With the continuous development of infrastructure, there are many more and more powerful storage infrastructures, such as IFPS, Stroj, Filecoin and Ceramic, etc., and more and more applications have begun to deploy themselves on decentralized storage, such as Arrange the front-end website on IPFS and access it through ENS, so as to build a decentralized website front-end and use Arweave to permanently store the file data such as pictures corresponding to the NFT project, etc.

In general, when building a Web3 application, developers usually have three options for storing application data:

  • Store it on the blockchain, this option is very expensive, it will make the application as simple as possible, and the data is completely open, the advantage is that the most direct protection of application sovereignty;

  • Store smart contract logic on the blockchain, and others on the traditional backend. This approach sacrifices user sovereignty and risks centralization. This is the way most Web3 applications currently use;

  • Store the logic of the smart contract on the blockchain, and other stores such as IPFS, Arweave, and Ceramic, and manage and update data through smart contracts. This method is expensive (Ceramic is currently free) and slow for the time being, but this method can protect Sovereignty of the application;

first level title

1.2 Trend: Decentralized Storage - Data and Application Sovereignty

When it comes to the 3 ways to build Web3 applications, there is a key word here: sovereignty. This term is an unavoidable topic when we talk about the characteristics of Web3. Generally speaking, it includes data sovereignty and application sovereignty. So does sovereignty matter? This is another topic, which is not discussed in this article. If you are interested, you can read related articles, such as "Web3 Data Market Outlook" and "Web3 - Let the"right to data"awake". Here I want to cut into the necessary path for the establishment of Web3 sovereignty from the perspective of data, and deduce the direction and focus of infrastructure development.

Regarding data sovereignty, including digital asset sovereignty and user data sovereignty, the article "Vertical Liquidity: How Values ​​Are Interconnected" mentioned that tokens can define users' digital asset sovereignty (identity, relationship, and property rights). It is determined by a broad consensus that is difficult to tamper with. At the most basic level, the definition of these rights can be completed by the blockchain itself, such as which address a token belongs to. However, once it comes to the ownership of more complex digital product rights, there will be many problems. The typical one is the storage of pictures (or articles, etc.) corresponding to NFT. This problem has been discussed in "NFT: A Revolution in Digital Ownership". discuss. The status quo of most NFTs is that their corresponding digital products are stored on a centralized server somewhere. Once the server crashes or is hacked, all the user has is a string of hashes on the chain. The real "item" behind the hash ” can be stolen or replaced at any time and become worthless.

In addition, user data sovereignty, as one of the most obvious dividing lines between Web2 and Web3, is a banner for the innovation and progress of Web3. In this regard, Ceramic envisions a data universe, a composable, web-scale data ecosystem owned by everyone but not exclusive to anyone. User data follows the user from one application to another, and the user acts as the center to control his own digital universe. At present, there are almost no applications that can achieve this. Cyberconnect has made a good attempt. It has created a decentralized social graph protocol, hoping to realize the interoperability of users' social relationship data between applications. But at present, the application does not guarantee the user's data sovereignty. Although they have begun to transfer to Ceramic for construction, everything is still on the way.

Regarding application sovereignty, some people call sovereign application "superstructure", which has characteristics such as unstoppable, free, valuable, scalable, permissionless, positive externality, and trusted neutrality, which together provide a digital world. Public goods build the infrastructure of the "metaverse" (if you believe it). At present, most of the so-called Web3 applications do not have a high degree of application sovereignty. They are not real public products, and they can be easily sanctioned and changed by power. The Tornado Cash incident directly illustrates this problem. One of the main reasons is that although the contract codes of these application protocol layers are published on the blockchain, components such as front-ends and domain names are still controlled by third-party centralized entities.

In order to achieve data sovereignty and application sovereignty, the construction method of Web3 applications is very important. The basic starting point is storage. Where does data exist and how can it be stored to ensure that users can have sovereignty? In general, depending on the user's data type, there are different solutions:

  • The user's asset information and transaction data should be public ledger data, and it is most important to ensure verifiability on the chain, but it is very valuable for applications like Aztec to protect the privacy of users' on-chain transactions;

  • The user's user information, content data and behavior data are regarded as personal information, and it is very important to ensure the control of the user. With the consent of the user, these data can be selectively disclosed as public products to discover positive externalities;

  • As legal person data, log data and code data are acceptable and necessary to be privatized, but when it comes to Web3 infrastructure applications such as "super buildings", it should have the characteristics of public infrastructure, and the storage of application codes It should be open and have anti-censorship capabilities beyond the platform level;

At present, the reason why most Web3 applications adopt "storing smart contract logic on the blockchain and others on the traditional backend" is that there is currently no good enough decentralized infrastructure to replace the original centralized infrastructure solution.

First of all, decentralized storage such as IPFS, Filecoin, and Arweave are all static storage, which makes them lack computing and state management capabilities, and cannot implement more advanced database-like functions (such as variability, version control, access control, and programmable logic). ), and although Ceramic is a dynamic storage, which solves these problems to a certain extent, the current access speed of Ceramic is still relatively slow, and the development kit is not perfect, and its degree of decentralization has always been criticized.

The main function of decentralized storage such as IPFS, Filecoin, and Arweave is to statically store unstructured data such as pictures, documents, and static codes, because its characteristics that are difficult to be tampered with guarantee the digital data such as NFT to a certain extent. Sovereignty, once the connection between the hash code on the chain and the decentralized storage address on the chain is established, it is difficult to be influenced by external forces in extraordinary ways. The front-end code built on it also promotes the integrity of the application sovereignty, but because the storage technology at the current stage is only storage, the lack of computing power makes its functional support far behind the centralized server solution.

image description

image

image description

When: August 23, 2022

first level title

2.1 Data management

Building Web3 applications on decentralized storage makes them less likely to be interfered by external forces and breaks monopoly and power. But storage alone is not enough. It also needs the support of rendering computing, data processing, permission configuration, privacy protection and other technologies in the storage environment to ensure the sovereignty of applications and users’ data, so as to realize the rise of personal sovereignty in the digital world. Especially the issues of authority control and privacy protection, they should be implemented with a high-level sovereign technical solution. These levels of data in Web2 applications are stored on some specific centralized servers according to different security protection levels. Their security is guaranteed by network security, and their sovereignty is guaranteed by platforms (such as enterprise platforms, government platforms, etc.) . In this data management mode, users are subject to super administrators, and users have no rights to the data itself. In addition, data security is also subject to the centralized entity of the super administrator. For example, in the public security data leakage incident in a certain area some time ago, a super administrator leaked his private key, causing the personal private information of hundreds of millions of people to leak.

The data management of Web3 should have the following two characteristics:

  • Data sovereignty protection. This should go beyond the platform level or even the world level, and protect the common rights of users in the digital world through world-level consensus. The protection in this respect in the traditional world is at the platform level, and the rules come from non-consensus. A platform-level company can control all the rules and systems and can change them at any time, so that it can violate the personal sovereignty of users at any time;

  • Data Privacy Guarantee. User data privacy is guaranteed mathematically through cryptography, rather than through database network security. User-controlled selective encryption is one of the basic rights of user data sovereignty;

How to manage Web3 data depends on how the data is stored.

image

IPFS and Filecoin are content-centric and access stored content through Content ID (CID). On this basis, third-party applications are built for data management. For example, through ChainSafe Files, the single sign-on problem can be solved in a localized manner. Data can be encrypted and stored conveniently through asymmetric encryption. The content-centric management model makes user management difficult, and how to assign ownership to data becomes more complicated. In addition to providing storage, Filecoin's ecological scalability will be much higher than other bottom layers. Especially after the launch of FVM, there may be some special tools for some vertical fields of data storage and data retrieval, which can help users to help enterprises better manage some of their data, ensure data security, and develop many some new applications.

Ceramic is also based on IPFS, but user-centric, based on IDX Protocol, 3ID DID method (CIP-79) to build a Ceramic-native account system, which can be used to authenticate Ceramic, and users can use blockchain wallets to control 3ID DID Execute transactions on data streams and manage your own data. This is achieved by associating DIDs with data and storing them in the data model. The data model defines the format (schema) of user data, and all applications using the same data model share the data format.

Arweave is a decentralized storage project for on-chain data with one-time payment and permanent storage. The data is stored openly and transparently on the chain, and anyone can access it. The data stored on the chain can be browsed through the Arweave blockchain browser. The data management in this mode is exactly the same as the data on the management chain. There is no access control and "hot update" of the original data. Every time the data is updated, the index address will change. There is no problem with IPFS and Filecoin. But its advantage is that it is very clear which user the data belongs to, which is conducive to the traceability of data rights.

first level title

2.2 Trend: Decentralized Data Market

secondary title

Ceramic's Data Model Marketplace

Ceramic mentioned in their data universe that they want to create an open data model market, because data needs to be interoperable, which can greatly promote the improvement of productivity. Such a data model market is realized through an emergency consensus on the data model, which is similar to the ETC contract standard in Ethereum, from which developers can choose as a function template to have an application that conforms to all the data of the data model. Currently, such a market is not a trading market.

Regarding the data model, a simple example is that in a decentralized social network, the data model can be simplified to 4 parameters, namely:

  • PostList: Stores an index of user posts

  • Post: stores a single post

  • Profile: store user information

  • FollowList: store the user's follow list

So how can data models be created, shared and reused on Ceramic to enable data interoperability across applications?

Ceramic provides a DataModels Registry, an open source, community-built repository of reusable application data models for Ceramic. Here, developers can openly register, discover and reuse existing data models - the basis for customer operations applications built on shared data models. Currently, it is based on Github storage, and in the future it will be distributed on Ceramic.

All data models added to the registry are automatically published under the @datamodels npm package. Any developer can use @datamodels/model-name to install one or more data models, making them available for storing or retrieving data at runtime using any IDX client, including DID DataStore or Self.ID.

secondary title

Ocean's data trading market

image description

image

Image credit: Ocean Protocol

image description

image

Source: Ocean Protocol

first level title

3.1 Data usage and stack

Based on the understanding of the above content, we propose the Web3 data stack, as shown in the figure below,

  • The bottom layer is where data sources are stored, including decentralized storage, on-chain and off-chain data, etc.;

  • The second is the management application for these data, including database, data table, index middleware and data market, etc.;

  • image description

image

Image credit: Zonff Partners

Image credit: Zonff Partners

At present, most of the data used on Web3 in the industry is on-chain data, and data analysis tools and indexing tools are emerging one after another. The huge gold mine of on-chain data has been fully tapped. The data table and analysis application classification in the above figure Most of them are data mining on the chain, and only a small part involves off-chain data. In general, the data usage link is an ETLA (Extract, Transform, Load, Analysis) process, and each node has a representative project. The representative of the Extract (Extract) project is The Graph, while the project representatives of Transform (Transform) into a usable data table and Load (Load) link are Dune and Luabsae, and the representatives of Analysis (Analysis) are Nansen and NFTGO.

In terms of decentralized storage, the support projects for the entire process of ETLA are almost deserted, and there are only some extraction projects. There are huge opportunities and challenges here. The Graph and Ceramic communities themselves are working on extracting data on Ceramic, and the founder of Orbis has also tried to make a Cerscan for browsing data on Ceramic. Arweave can already read and manage the data stored in Arweave with subgraphs through The Graph, and there are also related third-party projects on Filecoin that are doing this. However, no one cares about the TLA process at present. The biggest reason is that the data stored on different decentralized storages are highly heterogeneous, and it is difficult to have a unified model to mine the value of these data. Among them, the most promising Ceramic took this step, because the existence of its data model reduces the heterogeneity of data on Ceramic exponentially, thus making the availability of data higher.

In addition to the data on the chain, there are many projects trying to connect the data on the chain with the data off the chain. Such projects can be regarded as "chain reform" projects.

Type classifications are:

  • Web2 data sovereignty grant and trading market: Itheum, Navigate, Swash, Phyllo, etc. This type of project mainly combines traditional Internet data with on-chain data, hoping to open up the information interaction between Web2 and Web3. The common practice is to export Web2 data and then import it into a designated data pool or directly bind traditional Internet social accounts, etc. ;

  • Enterprise data consensus: Authtrail, the project integrates with the internal database of the enterprise and joins the consensus layer to achieve tamper-proof and traceable data within the enterprise;

  • On-chain and off-chain data combination: Space and Time. Like Authtrail, this project will integrate off-chain databases, but there is no consensus layer. It is more of a joint calculation of off-chain and on-chain data. In addition, Pool is also doing similar matter;

The use paradigm of Web3 data is obviously different from that of Web2, mainly in the way data is gathered together, that is, the ways of storage, indexing, extraction, integration and utilization of different types of data will be different. According to the previous classification, here are some simple summaries:

Public data: including public data and some legal person data classified in the "Network Security Standard Practice Guide - Data Classification and Grading Guidelines". As a public product, it is data that can be publicly mined for value. Access does not require permission, but user ownership can be traced, so as to trace airdrop profits. Typical examples are on-chain data and non-encrypted application data stored on decentralized storage (such as user posts, likes and comments, etc.). The most important upstream support for its use is indexing applications, such as The Graph, or Web3 native database applications, such as Tableland.

Private data: including personal information and some legal person data classified in the "Network Security Standard Practice Guide - Data Classification and Grading Guidelines". As a data type that requires encrypted storage and certain privacy permission configuration, its access is permitted and cannot be publicly obtained. If it is stored in decentralized storage and blockchain, encrypted storage with configurable permissions is required. Or through other means, such as ZK, MPC and TEE and other privacy technology protection. The most important upstream support for its use is database applications, such as Kwil and Ceramic.

Web3.0
Welcome to Join Odaily Official Community