Source: Filecoin Network

Editor's note: This article is mainly based on David Aronchick's keynote speech at the Filecoin Unleashed conference in Paris in 2023. David is the CEO of Expanso and the former Head of Data Computation at Protocol Labs, responsible for launching the Bacalhau project. This article represents the independent views of the original content creator and has obtained permission for republication.
According to IDC, the amount of global storage data is expected to exceed 175 ZB by 2025. This is a huge amount of data, equivalent to 175 trillion 1GB USB drives. Most of this data will be generated between 2020 and 2025, with a projected compound annual growth rate of 61%.
Today, the rapidly growing data sphere faces two major challenges:
Slow and expensive data transfer. It would take approximately 1.8 billion years to download 175 ZB of data with current bandwidth.
Burden of compliance tasks. There are hundreds of data-related management regulations globally, making cross-jurisdictional compliance tasks nearly impossible to accomplish.
The combined result of weak internet growth and regulatory restrictions is that almost 68% of institutional data remains idle. Therefore, it becomes particularly important to transfer computing resources to data storage locations (broadly known as compute-over-data) instead of transferring data to computing locations. Data computing platforms like Bacalhau (https://www.bacalhau.org/) and CoD (https://docs.filecoin.io/basics/what-is-filecoin/programming-on-filecoin/) are working towards this goal.
In the following sections, we will briefly introduce:
How institutions are currently handling data.
Alternative solutions based on "data computing".
Finally, why distributed computing is important.
Status Quo
Currently, institutions employ the following three methods to address data processing challenges, but none of them are ideal.
Using centralized systems
The most common approach is to use centralized systems for large-scale data processing. We often see institutions combining computing frameworks like Adobe Spark, Hadoop, Databricks, Kubernetes, Kafka, Ray, etc., to form a cluster system network connected to a centralized API server. However, these systems fail to effectively address network vulnerabilities and regulatory issues related to data mobility.
This has led to institutions being fined and penalized billions of dollars for data breaches.
Self-construction
Another approach is to have developers build a customized coordination system that meets the institution's needs for awareness and robustness. This approach is innovative, but it often faces the risk of failure due to over-reliance on a few individuals to maintain and operate the system.
Inaction
Surprisingly, most institutions do nothing with their data. For example, a city can collect a large amount of data from surveillance footage every day, but due to the high cost, this data is only accessible for viewing on local machines and cannot be archived or processed.
Building True Distributed Computing
The pain points of data processing have two main solutions.
Solution 1: Building on an open-source data computing platform
Solution 1: Open-source data computing platform
Developers can use open-source distributed data platforms for computing instead of the custom coordination system mentioned earlier. Because this platform is open-source and scalable, institutions only need to build the necessary components. This setup can meet the requirements of multi-cloud, multi-compute, and non-data center applications, as well as handle complex regulatory environments. Importantly, access to the open-source community is no longer dependent on one or more developers for system maintenance, reducing the possibility of failure.
Solution 2: Building on a distributed data protocol
In cooperation with advanced computing projects such as Bacalhau and Lilypad, developers can go further and build systems not only on the open-source data platforms mentioned in Solution 1, but also on truly distributed data protocols such as the Filecoin network.
Solution 2: Distributed Data Computing Protocol
This means that institutions can make use of distributed protocols that understand how to coordinate and describe user problems in a more sophisticated way, unlocking the computational region adjacent to data generation and storage locations. This transition from data centers to distributed protocols can ideally be achieved with minimal adjustments to the experience of data scientists.
Distributed Means Maximizing Choice
By deploying on distributed protocols such as the Filecoin network, our vision is that users can access hundreds (or thousands) of machines distributed across different regions on the same network and follow the same protocol rules as other machines. This essentially opens up an ocean of choices for data scientists, as they can request from the network:
Select datasets from anywhere in the world.
Adhere to any governance structure, whether it is HIPAA, GDPR, or FISMA.
Operate at the lowest possible cost.
Juan Triangle|Decoding Abbreviations: FHE (Fully Homomorphic Encryption), MPC (Multi-Party Computation), TEE (Trusted Execution Environment), ZKP (Zero-Knowledge Proof)
When it comes to the concept of maximizing choices, we have to mention the "Juan's triangle". The term was coined by Juan Benet, the founder of Protocol Labs, to explain why different use cases (in the future) will have different distributed computing networks to support them.
The Juan's triangle proposes that computing networks often need to balance privacy, verifiability, and performance. The traditional "one-size-fits-all" approach is difficult to apply to every use case. Instead, the modular nature of distributed protocols allows different distributed networks (or sub-networks) to meet different user needs - whether it's privacy, verifiability, or performance. Ultimately, we will optimize based on factors we consider important. At that time, there will be many service providers (as indicated by the boxes within the triangle) filling these gaps and making distributed computing a reality.
In summary, data processing is a complex problem that requires ready-to-use solutions. Utilizing open-source data computation as an alternative to traditional centralized systems is a great first step. Ultimately, deploying computing platforms on distributed protocols like the Filecoin network allows for the freedom to configure computing resources according to users' personalized needs, which is crucial in the era of big data and artificial intelligence.
Please follow the CoD Working Group to stay updated with all the latest developments in the distributed computing platform. To learn more about the Filecoin ecosystem, please follow the Filecoin Insight Blog and follow us on Filecoin Insight's Twitter, Bacalhau, Lilypad, Expanso, and CoD WG.


