From Isolation to Collaboration: The Significance of Web3 Native Data Pipeline

特邀专栏作者

2023-08-13 10:30

This article is about 6453 words, reading the full article takes about 10 minutes

What opportunities can we capture from the native data flow system of Web3, and what challenges do we need to address in order to seize these opportunities?

Written by:Jay : : FP

Compiled by: Deep Wave TechFlow

The release of the Bitcoin white paper in 2008 sparked a rethinking of the concept of trust. Blockchain subsequently expanded its definition to include the concept of trustless systems and quickly evolved to argue that different types of values such as individual sovereignty, financial democratization and ownership could be applied to existing systems. Of course, a lot of verification and discussion may be required before blockchain can be put into practical use, as its features may appear somewhat radical compared to various existing systems. However, if we are optimistic about these scenarios, building data pipelines and analyzing the valuable information contained in blockchain storage has the potential to become another important turning point in the development of the industry, as we can observe Web3 that never existed before Native business intelligence.

This article explores the potential of Web3 native data pipelines by projecting data pipelines commonly used in the existing IT market into a Web3 environment. The article discusses the benefits of these pipelines, the challenges that need to be addressed, and the impact of these pipelines on the industry.

1. Singularity comes from information innovation

Language is one of the most important differences between humans and the lower animals. It is not just the ability to pronounce words, but to associate definite sounds with definite thoughts and to use these sounds as symbols for the communication of ideas. — Darwin

Historically, major advances in human civilization have been accompanied by innovations in information sharing. Our ancestors used language, both spoken and written, to communicate with each other and to pass on knowledge to future generations. This gives them a major advantage over other species. The invention of writing, paper, and printing made it possible to share information more widely, which led to major advances in science, technology, and culture. In particular, the printing of the Gutenberg Bible with movable metal type was a watershed moment because it made it possible to mass-produce books and other printed materials. This had a profound impact on the religious reformation, democratic revolution, and the starting point of scientific progress.

The rapid development of IT technology in the 2000s allowed us to gain a deeper understanding of human behavior. This has led to changes in lifestyle, and most people in modern times make various decisions based on digital information. It is for this reason that we refer to modern society as the IT Innovation Era.

Only 20 years after the Internet was fully commercialized, artificial intelligence technology once again amazed the world. Many applications have emerged that can replace human labor, and many people are discussing how AI will change civilization. Some are even in a state of denial, wondering how such a technology could emerge so quickly that it could shake the foundation of our society. Although Moores Law indicates that the performance of semiconductors will increase exponentially over time, the changes brought about by the emergence of GPT are too sudden to face immediately.

Interestingly, however, the GPT model itself is not actually a very groundbreaking architecture. On the other hand, the AI industry lists the following as the main success factors of GPT models: 1) Defining business areas that can target large customer groups, and 2) Model tuning through data pipelines – from data acquisition to final results and results-based feedback of. In short, these applications enable innovation by refining service delivery purposes and upgrading data/information processing processes.

2. Data-driven decisions are everywhere

Most of what we call innovation is actually based on the manipulation of accumulated data rather than on chance or intuition. As the saying goes, In the capitalist market, it is not the strong who survive, but the survivors who survive. Todays businesses are highly competitive and the market is saturated. Hence, businesses are collecting and analyzing all kinds of data to grab even the smallest niche.

We may be too obsessed with Schumpeters theory of creative destruction and place too much emphasis on making decisions based on intuition. However, even great intuition is ultimately the product of an individuals accumulated data and information. The digital world will penetrate deeper into our lives in the future, and more and more sensitive information will be presented in the form of digital data.

The Web3 market is getting a lot of attention for its potential to give users control over their data. However, the blockchain field, which is the basic technology of Web3, is currently more focused on solving the trilemma (deep tide note: triangle dilemma, that is, security, decentralization and scalability issues). For new technologies to be convincing in the real world, it is important to develop applications and intelligence that can be used in multiple ways. Weve seen this happen in the world of big data, where significant progress has been made in methodologies for building big data processing and data pipelines since around 2010. In the context of Web3, efforts must be made to move the industry forward and build data flow systems to generate data-based intelligence.

3. Opportunities based on data flow on the chain

So, what opportunities can we capture from Web3 native data flow systems, and what challenges need to be solved to seize these opportunities?

3.1 Advantages

In short, the value of configuring Web3 native data flows is to safely and efficiently distribute reliable data to multiple entities so that valuable insights can be extracted.

Data redundancy - on-chain data is less likely to be lost and more resilient because the protocol network stores data fragments on multiple nodes.
Data Security - On-chain data is tamper-proof as it is verified and consensused by a network of decentralized nodes.
Data Sovereignty – Data sovereignty is the right of users to own and control their own data. With on-chain data streaming, users can see how their data is being used and choose to share it only with those who have a legitimate need for access.
Permissionless and transparent - on-chain data is transparent and tamper-proof. This ensures that the data being processed is also a reliable source of information.
Stable operations – When data flows are orchestrated by protocols in a distributed environment, each layer’s exposure to downtime is significantly reduced because there is no single point of failure.

3.2 Application Cases

Trust is the basis for different entities to interact with each other and make decisions. Therefore, when reliable data can be safely distributed, it means that many interactions and decisions can be made through Web3 services in which various entities participate. This helps maximize social capital, and we can imagine several use cases below.

3.2.1 Service/protocol application

Rules-Based Automated Decision System - Protocols use key parameters to run services. These parameters are adjusted regularly to stabilize the service status and provide users with the best experience. However, the protocol cannot always monitor the service status and make dynamic changes to parameters in a timely manner. This is what on-chain data flow does. On-chain data streams can be used to analyze service status in real-time and suggest the best set of parameters to match service requirements (e.g. applying an automatic floating rate mechanism for lending protocols).

Credit Market Growth - Credit has traditionally been used in financial markets as a measure of an individuals ability to repay. This helps to improve market efficiency. However, the definition of credit remains unclear in the Web3 market. This is due to the scarcity of personal data and lack of data governance across industries. Therefore, integrating and collecting information becomes difficult. Credit markets in Web3 markets can be redefined by building a process that collects and processes fragmented data on-chain (e.g., Spectrals MACRO (Multi-Asset Credit Risk Oracle) score).
Decentralized social/NFT extensions - Decentralized societies prioritize user control, privacy protection, censorship resistance, and community governance. This provides an alternative social paradigm. Therefore, a pipeline can be established to control and update various metadata more smoothly and facilitate migration between platforms.
Fraud detection - Web3 services using smart contracts are vulnerable to malicious attacks that can steal funds, compromise systems, and lead to decoupling and liquidity attacks. By creating a system that can detect these attacks in advance, Web3 services can develop rapid response plans and protect users from harm.

3.2.2 Cooperation and governance initiatives

Fully on-chain DAOs - Decentralized Autonomous Organizations (DAOs) rely heavily on off-chain tools for effective governance and public funding. By building an on-chain data processing process and creating a transparent process for DAO operations, the value of Web3s native DAO can be further enhanced.
Alleviating Governance Fatigue - Web3 protocol decisions are often made through community governance. However, there are many factors that can make it difficult for participants to participate in governance, such as geographical barriers, monitoring pressure, lack of expertise required for governance, randomly released governance agendas, and inconvenient user experience. A protocol governance framework could operate more efficiently and effectively if a tool could be created that streamlined the process for participants from understanding to actually implementing individual governance agenda items.
Open Data Platforms for Collaborative Works – In existing academic and industrial circles, many data and research materials are not publicly disclosed, which can make the overall development of the market very inefficient. On the other hand, on-chain data pools can facilitate more collaborative initiatives than existing markets because they are transparent and accessible to anyone. The development of many token standards and DeFi solutions are good examples. Additionally, we may operate public data pools for various purposes.

3.2.3 Network Diagnosis

Index Research - Various indicators are created by Web3 users to analyze and compare the state of the protocol. Multiple objective metrics (e.g. Nakaflows Satoshi coefficient) can be studied and displayed in real time.
Protocol Metrics - The performance of the protocol can be analyzed by processing data such as the number of active addresses, number of transactions, asset inflow/outflow, and fees incurred by the network. This information can be used to assess the impact of specific protocol updates, the status of MEVs, and the health of the network.

3.3 Challenges

On-chain data has unique advantages that can increase industry value. However, to fully realize these benefits, many challenges must be addressed both within and outside the industry.

Lack of data governance - Data governance is the process of establishing consistent and shared data policies and standards to facilitate the integration of each data primitive. Currently, each on-chain protocol establishes its own standards and retrieves its own data types. The problem, however, is the lack of data governance between the entities that aggregate these protocol data and provide API services to users. This makes integration between services difficult, and as a result, it is difficult for users to obtain reliable and comprehensive insights.
Cost inefficiency - Storing cold data in the protocol saves users data security and server costs. However, if the data needs to be accessed frequently for analysis or requires significant computing resources, it may not be cost-effective to store it on the blockchain.
The oracle problem - Smart contracts can only function fully when they have access to data from the real world. However, these data are not always reliable or consistent. Unlike blockchains, which maintain integrity through consensus algorithms, external data is not deterministic. Oracle solutions must evolve to ensure external data integrity, quality, and scalability independent of a specific application layer.
The protocol is in its infancy - the protocol uses its own token to incentivize users to keep the service running and pay for it. However, the parameters required to operate the protocol (e.g., the precise definition and incentive scheme of service users) are often naively managed. This means that the economic sustainability of the protocol is difficult to verify. If many protocols connect organically and create data pipelines, there will be greater uncertainty about whether the pipelines will work well.
Slow data retrieval time - Protocols typically process transactions through consensus of many nodes, which limits the speed and volume of information processing compared to traditional IT business logic. This bottleneck is difficult to resolve unless the performance of all the protocols that make up the pipeline is significantly improved.
The True Value of Web3 Data - Blockchains are siled systems that are not yet connected to the real world. When collecting Web3 data, we need to consider whether the collected data can provide meaningful insights enough to cover the cost of building the data pipeline.
Unfamiliar syntax - Existing IT data infrastructure and blockchain infrastructure operate very differently. Even the programming language used is different, and blockchain infrastructure often uses low-level languages or new languages designed specifically for blockchain needs. This makes it difficult for new developers and service users to learn how to deal with each data primitive, as they need to learn a new programming language or a new way of thinking about working with blockchain data.

4. Pipelined Web3 data Lego

There are no connections between current Web3 data primitives, they extract and process data independently. This makes it difficult to experiment with synergies in information processing. To address this issue, this paper introduces a data pipeline commonly used in the IT market and maps existing Web3 data primitives onto this pipeline. This will make the use cases more concrete.

4.1 Generic Data Pipeline

Building a data pipeline is like the process of conceptualizing and automating repetitive decision-making processes in everyday life. By doing so, information of a specific quality is readily available and used for decision-making. The more unstructured data to process, the more frequently the information is used, or the more real-time analysis is required, the time and cost of gaining the proactiveness needed for future decisions can be saved by automating these processes.

The diagram above shows a common architecture for building data pipelines in the existing IT infrastructure market. Data suitable for analytical purposes is collected from the correct data source and stored in an appropriate storage solution according to the nature of the data and the analytical requirements. For example, data lakes provide raw data storage solutions for scalable and flexible analysis, while data warehouses focus on storing structured data for query and analysis optimized for specific business logic. The data is then processed into insight or useful information in various ways.

Each solution tier is also available as a packaged service. There is also increasing interest in ETL (Extract, Transform, Load) SaaS product groups that connect the chain of processes from data extraction to loading (eg FiveTran, Panoply, Hivo, Rivery). The sequence is not always one-way, and the layers can be connected to each other in a variety of ways, depending on the specific needs of the organization. The most important thing when building a data pipeline is to minimize the risk of data loss that can occur as data is sent and received to each server tier. This can be achieved by optimizing server decoupling and using reliable data storage and processing solutions.

4.2 Pipelines with on-chain environments

The conceptual diagram of the data pipeline introduced earlier can be applied to the on-chain environment, as shown in the above figure, but it should be noted that a completely decentralized pipeline cannot be formed, because each basic component depends to some extent on the Centralized off-chain solution. In addition, the above figure does not currently include all Web3 solutions, and the boundaries of classification may be blurred—for example, KYVE, in addition to serving as a streaming media platform, also includes the function of a data lake, which can be regarded as a data pipeline itself. Also, Space and Time is classified as a decentralized database, but it offers API gateway services such as RestAPI and streaming, as well as ETL services.

4.2.1 Capture/Process

In order for regular users or dApps to efficiently use/operate a service, they need to be able to easily identify and access data sources primarily generated internally within the protocol, such as transactions, status, and log events. This layer is where a middleware comes into play, helping with processes including oracles, messaging, authentication, and API management. The main solutions are as follows.

Streaming / Indexing Platform

Bitquery, Ceramic, KYVE, Lens, Streamr Network, The Graph, block explorers of various protocols, etc.

Node-as-a-Service and other RPC/API services

Alchemy, All that Node, Infura, Pocket Network, Quicknode, etc.

Oracle

API 3, Band Protocol, Chainlink, Nest Protocol, Pyth, Supra oracles, etc.

4.2.2 Storage

Web3 storage solutions offer several advantages over Web2 storage solutions, such as durability and decentralization. However, they also have some disadvantages, such as high cost and difficulty in data updating and querying. As a result, various solutions have emerged that address these shortcomings and enable efficient processing of structured and dynamic data on Web3 - each solution differs in characteristics such as the type of data processed, whether it is structured, and whether With embedded query functions, etc.

Decentralized storage network

Arweave, Filecoin, KYVE, Sia, Storj, etc.

decentralized database

Arweave-based databases (Glacier, HollowDB, Kwil, WeaveDB), ComposeDB, OrbitDB, Polybase, Space and Time, Tableland, etc.

* Each protocol has different persistent storage mechanisms. For example, Arweave is a blockchain-based model similar to Ethereum storage that stores data permanently on-chain, while Filecoin, Sia, and Storj are contract-based models that store data off-chain.

4.2.3 Conversion

In the context of Web3, the transformation layer is as important as the storage layer. This is because the structure of a blockchain essentially consists of a distributed collection of nodes, which makes it easy to use scalable backend logic. In the artificial intelligence industry, people are actively exploring the use of these advantages to conduct research in the field of federated learning, and protocols specifically used for machine learning and artificial intelligence operations have emerged.

Data training/modeling/computing

Akash, Bacalhau, Bittensor, Gensyn, Golem, Together, etc.

* Federated learning is a method for training artificial intelligence models by distributing the original model across multiple native clients, training it using stored data, and then collecting the learned parameters on a central server.

4.2.4 Analysis/Use

The dashboard services and end-user insights and analytics solutions listed below are platforms that allow users to observe and discover various insights from specific protocols. Some of these solutions also provide API services for the final product. However, it is important to note that the data in these solutions is not always accurate as they mostly use separate off-chain tools to store and process data. Errors between solutions can also be observed.

At the same time, there is a platform called Web3 Functions that can automatically/trigger the execution of smart contracts, just like centralized platforms such as Google Cloud trigger/execute specific business logic. Using this platform, users can implement business logic in a Web3 native way, rather than just gain insights by processing on-chain data.

Dashboard service

Dune Analytics, Flipside Crypto, Footprint, Transpose, and more.

End User Insights and Analysis

Chainalaysis, Glassnode, Messari, Nansen, The Tie, Token Terminal and more.

Web3 Functions

Chainlink’s Functions, Gelato Network, etc.

5. Conclusion and reflection

As Kant said, we can only witness the phenomenon of things, but cannot touch their essence. Still, we use records of observations called data to process information and knowledge, and we see how innovations in information technology drive the development of civilization. Therefore, building a data pipeline in a Web3 marketplace, in addition to being decentralized, can play a key role as a starting point for actually capturing these opportunities. I would like to conclude this article with a few thoughts.

5.1 The role of storage solutions will become more important

The most important prerequisite to having a data pipeline is establishing data and API governance. In an increasingly diverse ecosystem, the norms created by each protocol will continue to be re-created, and fragmented transaction records through multi-chain ecosystems will make it more difficult for individuals to derive comprehensive insights. Then, a storage solution is an entity that can provide integrated data in a unified format by collecting fragmented information and updating the specifications of each protocol. We observe that existing storage solutions in the market, such as Snowflake and Databricks, are growing rapidly, have large customer bases, are vertically integrated by operating at various levels in the pipeline, and are leading the industry.

5.2 Opportunities in the Data Sources Market

As data becomes more accessible and processing improves, successful use cases begin to emerge. This creates a positive cyclical effect, whereby data sources and collection tools emerge explosively—since 2010, the type and amount of digital data collected each year has grown exponentially due to huge advances in technology for building data pipelines. Applying this background to the Web3 market, many data sources can be generated recursively on-chain in the future. This also means that blockchain will expand to various business areas. At this point, we can expect data collection to advance through data marketplaces like Ocean Protocol or DeWi (decentralized wireless) solutions like Helium and XNET, as well as storage solutions.

5.3 What matters is meaningful data and analysis

However, the most important thing is to keep asking what data should be prepared to extract the insights that are truly needed. There is nothing more wasteful than building a data pipeline for the sake of building it without clear assumptions to validate. Existing markets have achieved numerous innovations through building data pipelines, but have also paid a countless price through repeated pointless failures. It’s also good to have constructive discussions about the development of the technology stack, but the industry needs time to think about and discuss more fundamental questions, such as what data should be stored in the block space, or what purpose the data should be used for. The goal should be to realize the value of Web3 through actionable intelligence and use cases, and in this process, developing multiple basic components and completing the pipeline are the means to achieve this goal.

technology

smart contract

Oracle

Welcome to Join Odaily Official Community

Subscription Group

https://t.me/Odaily_News

Chat Group

https://t.me/Odaily_CryptoPunk

Official Account

https://twitter.com/OdailyChina

Chat Group

https://t.me/Odaily_CryptoPunk