Why is multimodal modularity an illusion for Web3 AI?

特邀专栏作者

2025-06-18 12:44

This article is about 6448 words, reading the full article takes about 10 minutes

The future of Web3 AI lies not in imitation, but in strategic detours. From semantic alignment in high-dimensional space, to information bottlenecks in the attention mechanism, to feature alignment under heterogeneous computing power, I will explain why Web3 AI should take the rural-to-urban approach as its tactical program.

Original article by @BlazingKevin_ , the Researcher at Movemaker

Nvidia has quietly recovered all the losses caused by Deepseek, and even climbed to a new high. The evolution of multimodal models has not brought chaos, but deepened the technical barriers of Web2 AI. From semantic alignment to visual understanding, from high-dimensional embedding to feature fusion, complex models are integrating various modal expressions at an unprecedented speed to build an increasingly closed AI highland. The US stock market also voted with its feet. Whether it is currency stocks or AI stocks, they have all gone through a wave of bull market. However, this wave of heat has nothing to do with Crypto. The Web3 AI attempts we have seen, especially the evolution of Agent in recent months, are almost completely wrong in direction: wishful thinking to use decentralized structures to assemble Web2-style multimodal modular systems is actually a double dislocation of technology and thinking. In today's world where module coupling is extremely strong, feature distribution is highly unstable, and computing power requirements are increasingly concentrated, multimodal modularization cannot stand in Web3 at all. What we want to point out is that the future of Web3 AI is not imitation, but strategic detours. From semantic alignment in high-dimensional space, to information bottlenecks in the attention mechanism, to feature alignment under heterogeneous computing power, I will explain why Web3 AI should take the rural areas as its tactical program.

Web3 AI is based on a flat multimodal model, and semantics cannot be aligned, resulting in poor performance

In the multimodal system of modern Web2 AI, "semantic alignment" refers to mapping information from different modalities (such as images, text, audio, video, etc.) into the same or mutually convertible semantic space, so that the model can understand and compare the inherent meaning behind these originally different signals. For example, a photo of a cat and the sentence "a cute cat", the model needs to project them to positions close to each other in the high-dimensional embedding space, so that it can "see the picture and speak" and "hear the sound and associate the picture" when retrieving, generating or reasoning.

Only when a high-dimensional embedding space is achieved can the workflow be divided into different modules to reduce costs and increase efficiency. However, in the web3 Agent protocol, high-dimensional embedding cannot be achieved because modularity is an illusion of Web3 AI.

How to understand high-dimensional embedding space? At the most intuitive level, imagine "high-dimensional embedding space" as a coordinate system - just like the x-y coordinates on a plane, you can use a pair of numbers to locate a point. It's just that in our common two-dimensional plane, a point is completely determined by two numbers (x, y); in "high-dimensional" space, each point is described by more numbers, which may be 128, 512, or even thousands of numbers.

From the simplest to the most complex, understand it in three steps:

2D example:
Imagine you have marked the coordinates of several cities on a map, such as Beijing (116.4, 39.9), Shanghai (121.5, 31.2), and Guangzhou (113.3, 23.1). Each city here corresponds to a "two-dimensional embedding vector": the two-dimensional coordinates encode the geographic location information into a number.
If you want to measure the “similarity” between cities—cities that are close together on a map tend to be in the same economic or climate zone—you can simply compare the Euclidean distance between their coordinates.
Extending to multiple dimensions:
Now suppose you want to describe not only the location in "geographic space", but also some "climate characteristics" (average temperature, rainfall), "demographic characteristics" (population density, GDP), etc. You can assign a vector containing 5, 10, or even more dimensions to each city.
For example, Guangzhou’s 5-dimensional vector might be [113.3, 23.1, 24.5, 1700, 14.5], which respectively represent longitude, latitude, average temperature, annual rainfall (mm), and economic index. This “multidimensional space” allows you to compare cities by geography, climate, economy, and other dimensions at the same time: if the vectors of two cities are very close, it means that they are very similar in these attributes.
Switching to semantics - why "embed":
In natural language processing (NLP) or computer vision, we also hope to map "words", "sentences" or "images" into such a multi-dimensional vector, so that "similar meanings" of words or images are closer in space. This mapping process is called "embedding".
For example, we train a model to map "cat" to a 300-dimensional vector v₁, "dog" to another vector v₂, and "irrelevant" words such as "economy" to v₃. Then in this 300-dimensional space, the distance between v₁ and v₂ will be small (because they are both animals and often appear in similar language environments), while the distance between v₁ and v₃ will be large.
As the model is trained on massive amounts of text or image-text pairs, each dimension it learns does not directly correspond to interpretable attributes such as "longitude" and "latitude", but rather to some kind of "implicit semantic features". Some dimensions may capture the coarse-grained division of "animal vs. non-animal", some may distinguish between "domestic vs. wild", and some may correspond to the feeling of "cute vs. mighty"... In short, hundreds or thousands of dimensions work together to encode all kinds of complex and intertwined semantic levels.

What is the difference between high-dimensionality and low-dimensionality? Only with enough dimensions can we accommodate a variety of intertwined semantic features, and only high dimensions can make them have a clearer position in their respective semantic dimensions. When semantics cannot be distinguished, that is, when semantics cannot be aligned, different signals in the low-dimensional space "squeeze" each other, causing the model to frequently confuse when searching or classifying, and the accuracy rate drops significantly; secondly, it is difficult to capture subtle differences in the strategy generation stage, and it is easy to miss key trading signals or misjudge risk thresholds, which directly drags down the performance of returns; thirdly, cross-module collaboration becomes impossible, each agent acts independently, and the information island phenomenon is serious, the overall response delay increases, and the robustness deteriorates; finally, in the face of complex market scenarios, low-dimensional structures have almost no capacity to carry multi-source data, and the system stability and scalability are difficult to guarantee. Long-term operation is bound to fall into performance bottlenecks and maintenance difficulties, resulting in a product performance far from the initial expectations after landing.

So can Web3 AI or Agent protocol achieve high-dimensional embedding space? First of all, how is high-dimensional space achieved? In the traditional sense, "high dimension" requires that each subsystem - such as market intelligence, strategy generation, execution and implementation, and risk control - be aligned and complement each other in data representation and decision-making process. However, most Web3 Agents simply encapsulate existing APIs (CoinGecko, DEX interface, etc.) into independent "Agents", lacking a unified central embedding space and cross-module attention mechanism, resulting in the inability of information to interact between modules from multiple angles and levels. It can only follow a linear pipeline, showing a single function, and cannot form an overall closed-loop optimization.

Many agents directly call external interfaces, and even do not do enough fine-tuning or feature engineering on the data returned by the interfaces. For example, the market analysis agent simply obtains the price and volume, the transaction execution agent only places orders according to the interface parameters, and the risk control agent only alarms according to several thresholds. They each perform their own duties, but lack multi-modal fusion and deep semantic understanding of the same risk event or market signal, resulting in the system being unable to quickly generate comprehensive and multi-angle strategies when facing extreme market conditions or cross-asset opportunities.

Therefore, requiring Web3 AI to achieve high-dimensional space is equivalent to requiring the Agent protocol to develop all the API interfaces involved by itself, which runs counter to its original intention of modularization. The modular multimodal system described by small and medium-sized enterprises in Web3 AI cannot withstand scrutiny. High-dimensional architecture requires end-to-end unified training or collaborative optimization: from signal capture to strategy calculation, to execution and risk control, all links share the same set of representations and loss functions. The "module as plug-in" idea of Web3 Agent has aggravated fragmentation - each Agent upgrade, deployment, and parameter adjustment are completed in their own silo, which is difficult to iterate synchronously, and there is no effective centralized monitoring and feedback mechanism, resulting in a surge in maintenance costs and limited overall performance.

To realize a full-link intelligent agent with industry barriers, it is necessary to break through the system engineering of end-to-end joint modeling, unified embedding across modules, and collaborative training and deployment. However, there is no such pain point in the current market, and naturally there is no market demand.

In low-dimensional space, attention mechanisms cannot be precisely designed

High-level multimodal models require sophisticated attention mechanisms. Attention mechanisms are essentially a way to dynamically allocate computing resources, allowing the model to selectively "focus" on the most relevant parts when processing a certain modality of input. The most common are the self-attention and cross-attention mechanisms in Transformer: self-attention enables the model to measure the dependencies between each element in the sequence, such as the importance of each word in the text to other words; cross-attention allows information from one modality (such as text) to decide which image features to "look at" when decoding or generating another modality (such as a feature sequence of an image). Through multi-head attention, the model can simultaneously learn multiple alignments in different subspaces to capture more complex and fine-grained associations.

The premise for the attention mechanism to work is that the multimodality has high dimensions. In high-dimensional space, a sophisticated attention mechanism can find the most core part from the massive high-dimensional space in the shortest time. Before explaining why the attention mechanism needs to be placed in a high-dimensional space to work, let us first understand the process of Web2 AI represented by the Transformer decoder when designing the attention mechanism. The core idea is that when processing sequences (text, image patches, audio frames), the model dynamically assigns "attention weights" to each element, allowing it to focus on the most relevant information rather than blindly treating them equally.

In simple terms, if the attention mechanism is compared to a car, designing Query-Key-Value is like designing the engine. QKV is a mechanism that helps us determine key information. Query refers to query ("what am I looking for"), Key refers to index ("what tags do I have"), and Value refers to content ("what content is here"). For multimodal models, the content you input to the model may be a sentence, a picture, or an audio clip. In order to retrieve the content we need in the dimensional space, these inputs will be cut into the smallest units, such as a character, a small block of a certain pixel size, or an audio frame. The multimodal model will generate Query, Key, and Value for these smallest units to perform attention calculations. When the model processes a certain position, it will use the Query at this position to compare the Keys of all positions to determine which tags best match the current needs. Then, based on the degree of match, the Value is extracted from the corresponding position and weighted by importance. Finally, a new representation that contains both its own information and globally relevant content is obtained. In this way, each output can dynamically "ask questions-retrieve-integrate" according to the context to achieve efficient and accurate information focus.

On the basis of this engine, various parts are added to cleverly combine "global interaction" with "controllable complexity": scaling dot products to ensure numerical stability, multi-head parallelism to enrich expression, position encoding to retain sequence order, sparse variants to balance efficiency, residuals and normalization to help stabilize training, and cross-attention to open up multi-modality. These modular and progressive designs enable Web2 AI to have both powerful learning capabilities and efficient operation within an affordable computing power range when processing various sequence and multi-modal tasks.

Why can't modular Web3 AI achieve unified attention scheduling? First, the attention mechanism relies on a unified Query-Key-Value space. All input features must be mapped to the same high-dimensional vector space in order to calculate dynamic weights through dot products. Independent APIs return data in different formats and distributions - prices, order status, threshold alarms - without a unified embedding layer, and cannot form a set of interactive Q/K/V. Secondly, multi-head attention allows different information sources to be paid attention to in parallel at the same layer, and then the results are aggregated; while independent APIs often "call A first, then call B, then call C", and the output of each step is only the input of the next module. It lacks the ability of parallel and multi-way dynamic weighting, and naturally cannot simulate the fine scheduling of the attention mechanism that scores all positions or all modalities at the same time and then integrates them. Finally, the real attention mechanism will dynamically assign weights to each element based on the overall context; in API mode, modules can only see the "independent" context when they are called, and there is no real-time shared central context between each other, so it is impossible to achieve global association and focus across modules.

Therefore, it is impossible to build a "unified attention scheduling" capability like Transformer by simply encapsulating various functions into discrete APIs without a common vector representation, parallel weighting and aggregation, just like a car with low engine performance cannot improve its upper limit no matter how it is modified.

Discrete modular patchwork results in feature fusion remaining at the level of superficial static splicing.

"Feature fusion" is to further combine the feature vectors obtained after processing different modalities based on alignment and attention, so that they can be directly used by downstream tasks (classification, retrieval, generation, etc.). Fusion methods can be as simple as splicing and weighted summation, or as complex as bilinear pooling, tensor decomposition and even dynamic routing technology. A higher-order method is to alternate alignment, attention and fusion in a multi-layer network, or to establish a more flexible message passing path between cross-modal features through graph neural networks (GNNs) to achieve deep information interaction.

Needless to say, Web3 AI is still at the simplest splicing stage, because the premise of dynamic feature fusion is high-dimensional space and precise attention mechanism. When the prerequisites are not met, the feature fusion in the final stage will not be able to achieve excellent performance.

Web2 AI tends to adopt end-to-end joint training: it processes all modal features such as images, text, and audio simultaneously in the same high-dimensional space, and optimizes collaboratively with the downstream task layer through the attention layer and the fusion layer. The model automatically learns the optimal fusion weights and interaction methods in forward and backward propagation. Web3 AI, on the other hand, adopts a more discrete module splicing approach, encapsulating various APIs such as image recognition, market crawling, and risk assessment into independent agents, and then simply piecing together the labels, values, or threshold alarms output by each of them. Comprehensive decisions are made by the main line logic or manual labor. This approach lacks a unified training goal and gradient flow across modules.

In Web2 AI, the system relies on the attention mechanism to calculate the importance scores of various features in real time according to the context and dynamically adjust the fusion strategy; multi-head attention can also capture multiple different feature interaction modes in parallel at the same level, thereby taking into account local details and global semantics. Web3 AI often fixes weights such as "image × 0.5 + text × 0.3 + price × 0.2" in advance, or uses simple if/else rules to determine whether to merge, or does not merge at all, only presenting the output of each module together, which lacks flexibility.

Web2 AI maps all modal features to a high-dimensional space of thousands of dimensions. The fusion process is not only vector concatenation, but also includes multiple high-order interactive operations such as addition and bilinear pooling. Each dimension may correspond to a certain potential semantics, enabling the model to capture deep and complex cross-modal associations. In contrast, the output of each agent of Web3 AI often contains only a few key fields or indicators, with extremely low feature dimensions, and it is almost impossible to express delicate information such as "why the image content matches the text meaning" or "the subtle connection between price fluctuations and emotional trends."

In Web2 AI, the loss of downstream tasks is continuously transmitted back to various parts of the model through the attention layer and the fusion layer, automatically adjusting which features should be strengthened or suppressed, forming a closed-loop optimization. In contrast, Web3 AI relies on manual or external processes to evaluate and adjust parameters after reporting API call results. The lack of automated end-to-end feedback makes it difficult to iterate and optimize fusion strategies online.

The barriers to the AI industry are deepening, but pain points have not yet appeared

Because it is necessary to take into account cross-modal alignment, precise attention calculation and high-dimensional feature fusion in end-to-end training, the multimodal system of Web2 AI is often an extremely large engineering project. It not only requires massive, diverse and precisely annotated cross-modal data sets, but also requires thousands of GPUs for weeks or even months of training time; in terms of model architecture, it integrates various latest network design concepts and optimization technologies; in terms of engineering implementation, it is also necessary to build a scalable distributed training platform, monitoring system, model version management and deployment pipeline; in algorithm development, it is necessary to continue to study more efficient attention variants, more robust alignment losses and lighter fusion strategies. Such full-link, full-stack systematic work has extremely high requirements for funds, data, computing power, talents and even organizational collaboration, so it constitutes a very strong industry barrier and also creates the core competitiveness that a few leading teams have mastered so far.

In April, when I reviewed Chinese AI applications and compared them with WEB3 ai, I mentioned a point: Crypto has the potential to achieve breakthroughs in industries with strong barriers. This means that some industries are already very mature in the traditional market, but have huge pain points. High maturity means that there are sufficient users familiar with similar business models, and large pain points mean that users are willing to try new solutions, that is, they have a strong willingness to accept Crypto. Both are indispensable. In other words, if it is not an industry that is already very mature in the traditional market but has huge pain points, Crypto cannot take root in it and will not have a living space. Users are very reluctant to fully understand it and do not understand its potential upper limit.

WEB3 AI or any Crypto product under the banner of PMF needs to develop with the tactic of surrounding the city from the countryside. It should test the waters on a small scale in the edge positions to ensure a solid foundation before waiting for the emergence of the core scenario, that is, the target city. The core of Web3 AI lies in decentralization, and its evolution path is reflected in high parallelism, low coupling and compatibility of heterogeneous computing power. ** This makes Web3 AI more advantageous in scenarios such as edge computing, and is suitable for lightweight structures, easy parallelization and incentivizable tasks, such as LoRA fine-tuning, behavior alignment post-training tasks, crowdsourced data training and annotation, small basic model training, and edge device collaborative training. The product architecture of these scenarios is lightweight and the roadmap can be flexibly iterated. But this does not mean that the opportunity is now, because the barriers of WEB2 AI have just begun to form. The emergence of Deepseek has stimulated the progress of multimodal complex task AI. This is the competition of leading enterprises and the early stage of the emergence of WEB2 AI dividends. I think only when the dividends of WEB2 AI disappear, the pain points left behind are the opportunities for WEB3 AI to cut in, just like the birth of DeFi. Before the time point comes, WEB3 AI's self-created pain points will continue to enter the market. We need to carefully identify the protocols with "surrounding the city from the countryside" and whether to cut in from the edge, first gain a foothold in the countryside (or small market, small scene) with weak strength and few market rooting scenarios, and gradually accumulate resources and experience; whether to combine points and surfaces and promote in a circular way, and be able to continuously iterate and update products in a small enough application scenario. If this cannot be done, then it is difficult to achieve a market value of 1 billion US dollars by relying on PMF on this basis, and such projects will not be on the list of concerns; whether it can fight a protracted war and be flexible and maneuverable, WEB2 AI The potential barriers are changing dynamically, and the corresponding potential pain points are also evolving. We need to pay attention to whether the WEB3 AI protocol needs to be flexible enough to adapt to different scenarios, move quickly between rural areas, and move closer to the target city at the fastest speed. If the protocol itself is too infrastructure-intensive and the network architecture is huge, then it is very likely to be eliminated.

About Movemaker

Movemaker is the first official community organization authorized by the Aptos Foundation and jointly initiated by Ankaa and BlockBooster, focusing on promoting the construction and development of the Aptos Chinese ecosystem. As the official representative of Aptos in the Chinese region, Movemaker is committed to building a diverse, open and prosperous Aptos ecosystem by connecting developers, users, capital and many ecological partners.

Disclaimer:

This article/blog is for informational purposes only and represents the personal opinions of the author and does not necessarily represent the position of Movemaker. This article is not intended to provide: (i) investment advice or investment recommendations; (ii) an offer or solicitation to buy, sell or hold digital assets; or (iii) financial, accounting, legal or tax advice. Holding digital assets, including stablecoins and NFTs, is extremely risky and may fluctuate in price and become worthless. You should carefully consider whether trading or holding digital assets is appropriate for you based on your financial situation. If you have questions about your specific situation, please consult your legal, tax or investment advisor. The information provided in this article (including market data and statistical information, if any) is for general information only. Reasonable care has been taken in the preparation of these data and charts, but no responsibility is assumed for any factual errors or omissions expressed therein.

invest

industry

Aptos

technology

Welcome to Join Odaily Official Community

Subscription Group

https://t.me/Odaily_News

Chat Group

https://t.me/Odaily_CryptoPunk

Official Account

https://twitter.com/OdailyChina

Chat Group

https://t.me/Odaily_CryptoPunk