DeepSeek的十万亿美元之路:用开源撬动万亿硬件生态
Original Title: DeepSeek's 10 Trillion USD Grand Strategy
Original Author: @bookwormengr
Original Translation: Peggy, BlockBeats
Editor's Note: Over the past year, discussions surrounding DeepSeek have mostly focused on model performance, open-source strategy, and price wars. However, understanding DeepSeek solely through whether it sells subscriptions, has multimodal capabilities, or can build a coding agent likely underestimates what it truly aims to change.
This article proposes a more radical thesis: DeepSeek's goal may not be short-term monetization through the application layer. Instead, through a series of underlying architectural innovations, it aims to reshape the cost structure of AI training and inference, indirectly fostering a new hardware ecosystem. From MoE, MLA to DSA, CSA, mHC, Engram, and onto Dual Path and TileLang, DeepSeek's technical roadmap consistently revolves around a core question: With constraints on HBM, advanced process nodes, packaging, and the CUDA ecosystem, how can it run stronger models with less high-end computing power?
The most noteworthy aspect of the article is not whether DeepSeek can make hundreds of millions of dollars via API or subscriptions, but whether it is binding model capabilities, memory architecture, and the domestic hardware ecosystem together. KV Cache compression reduces reliance on HBM. NAND and SSDs can handle long-term caching. LPDDR can be used for weight streaming and Engram storage. TileLang attempts to erode the CUDA moat. If these innovations continue to proliferate, the beneficiaries extend beyond DeepSeek itself to include storage, ASIC, GPU, networking chips, and the entire AI infrastructure chain.
Of course, the article's conclusions regarding a "10 trillion USD industrial ecosystem" and a "1 trillion USD valuation" are still highly speculative. However, it provides a crucial path for understanding DeepSeek: open-sourcing does not necessarily mean abandoning commercialization, and low prices are not just about subsidizing the market. For DeepSeek, the real business might not reside in the application layer but in making more hardware viable and enabling lower-cost AI supply. In other words, what it sells may not be the model itself, but the feasibility of the next generation of AI infrastructure.
Below is the original text:

Have you ever wondered exactly how DeepSeek plans to make money, and potentially a lot of it?
It hasn't launched competitive coding subscription plans like GLM, MoonShot, or MiniMax. It also lacks multimodal, audio, or video models. So far, it doesn't even have its own harness—the outer runtime framework for model invocation, tool integration, and task execution—though they have recently started recruiting for related positions to build this system.
Meanwhile, DeepSeek seems firmly committed to the open-source side for the long haul, even willingly sharing its "secrets" publicly. Isn't this madness? Isn't it just burning money? Are the investors ready to pour $10 billion into it just flushing their money down the drain?
Personally, I think the opposite is true.
Next, based on what DeepSeek has accomplished so far, I will offer some observations and analyze the strategy it appears to be following. DeepSeek CEO Liang Wenfeng's goal likely extends far beyond the current model competition. He is aiming for an even bigger prize: DeepSeek has the potential to reach a $1 trillion valuation while driving the formation of a new $10 trillion industry.

TechInAsia's report on DeepSeek's latest funding round
Revisiting DeepSeek's "Hero's Journey"
DeepSeek has always been swimming against the current. Instead of continuously releasing slightly more capable models and rushing to package them into directly monetizable applications like coding subscriptions, it has taken a different path. On January 27, 2025, I posted a popular tweet describing what I saw as DeepSeek's "Hero's Journey." Today, this story has become even more intriguing.
While others were building dense models, DeepSeek opted for the more challenging-to-train Mixture of Experts (MoE).
Taking a "first principles" approach, they invented the new GRPO algorithm to replace the then-mainstream but costly-to-implement PPO reinforcement learning algorithm.
They discovered that Reinforcement Learning from Verified Rewards (RLVR) is a key strategy for improving model reasoning capabilities.
They also proposed a simple speculative decoding strategy via "Multi Token Prediction," simultaneously making training signals denser.
They refined the "ZERO Bubble" pipeline to improve the utilization efficiency of limited GPU resources.
They released an Expert Load Balancer to make deploying MoE models easier for everyone. Particularly, through the "Wide Expert Parallel" strategy, models can be served with larger batches, significantly reducing inference costs.
They invented mechanisms like MLA, DSA, CSA, and HCA to reduce the demand for KV Cache and keep the computational requirements, which grow with context length, as close to constant as possible.
They invented Engram, trading memory for computational efficiency.
They also invented mHC to enable stable training as model scale increases. And there are many more examples like these.
In the "Hero's Journey," the most universal narrative structure, the hero never knows exactly where the journey will lead from the start. He gradually learns his true great purpose along the way and accomplishes it despite numerous obstacles. He encounters many skeptics but chooses to ignore them. He also faces many malicious actors. He has obvious flaws or weaknesses but eventually overcomes them to fulfill his mission. He confronts seemingly insurmountable challenges, finds ways to form alliances, and learns to use limited and precious resources wisely. This is what makes the audience root for the hero. This is also why DeepSeek has gained followers, global respect, and opponents.
As I will detail below, DeepSeek has been on this path for a long time and has gradually discovered its ultimate destiny: its goal is not to sell coding subscriptions but to propel a $10 trillion Chinese AI hardware ecosystem and achieve a $1 trillion valuation itself. In this process, it will also create opportunities for many new entrants in the Western hardware ecosystem.

Starting with Some Fun KV Cache Calculations
Take a look at this very timely recent tweet from @SemiAnalysis_:

DeepSeek has already solved this problem better than anyone!
Let's do some interesting KV Cache calculations. Don't worry if you don't like math. We'll use the recently released KV Cache Calculator to see how much KV Cache savings DeepSeek V4 Pro offers compared to the latest GLM and Qwen models.
For our calculations, we'll use a context length of 1 million, assuming KV precision is 8 bits and indexer precision is 16 bits. Feel free to open the calculator yourself and try: https://kvcache.ai/tools/kv-cache-calculator/

You can try the calculator yourself!
At a context length of 1 million:
·DeepSeek V4 needs only 5.48GB HBM;
·GLM-5 needs 60GB HBM;
·Qwen3-235B-A22B requires a hefty 89GB HBM.
Note:
·DeepSeek is a 1.6 trillion parameter model;
·GLM-5 is approximately 700 billion parameters and already uses DeepSeek's MLA and DSA, though not the latest compressed attention mechanisms;
·Qwen3-235B-A22B is approximately 235 billion parameters and uses GQA attention.
DeepSeek has made fundamental contributions to alleviating memory pressure. If such innovations are widely adopted, they will significantly reduce the cost of long-duration agents and unlock the next wave of new applications.

KV Cache footprint comparison for a 1 million token context and model scale
The Methodology Behind the "Madness"
The reason KV Cache size can be so small without sacrificing model quality is precisely why DeepSeek can offer long-term caching at such low prices—less than 3% of Sonnet 4.6's cache hit price—while keeping the cache available for hours.
For long-duration tasks, a smaller KV Cache means it can be more economically offloaded to SSD and reloaded when needed. This reduces reliance on HBM. From the perspective of China's AI hardware industry, HBM is not only in tight supply but is also one of the most difficult types of memory to manufacture.
Furthermore, DeepSeek has developed techniques for faster KV Cache loading from SSD, as described in its Dual Path paper.

DeepSeek V4 compresses the KV Cache so aggressively that this step might not even be necessary anymore.
So, who benefits most directly from KV Cache compression?
Who mass-produces SSDs? Remember, YMTC (Yangtze Memory Technologies Co.) is growing into a giant in the 3D NAND space. NAND can help DeepSeek avoid recomputing the KV. In turn, DeepSeek creates a massive market for NAND and SSDs—benefiting not only YMTC but also other related manufacturers.

But it's not just about NAND and SSDs.
LPDDR memory also holds tremendous potential. It can serve as a location to store model weights, streaming these weights into HBM as needed, thereby alleviating pressure on HBM demand. The SGLang team published an excellent blog post introducing this concept. The diagram below illustrates how this solution works.
While DeepSeek didn't specifically design for this solution, its MoE architecture, which inherently has a large number of expert models, along with its 4-bit weight quantization, makes this solution easier to implement.

This schematic shows how memory might be used and how model weights could be streamed from LPDDR to HBM. Highly recommend reading the SGLang blog post.
This innovation, combined with an extremely compact and lossless KV Cache, will significantly reduce the demand for HBM.
So, who produces LPDDR in China? The answer is CXMT (ChangXin Memory Technologies). They lag by only about half a generation in LPDDR speed and one generation in density—a gap that isn't very large.
In addition to abundant NAND, the Chinese AI ecosystem will soon have a plentiful supply of LPDDR. Can this alleviate computing power pressure? The answer is: yes. Keep reading.

Using Memory Intelligently Can Also Ease Pressure on GPU / ASIC
The role of using NAND to store the KV Cache is easy to understand: it allows the KV Cache to be retained for longer, reduces pressure on HBM, and avoids the recomputation of the KV Cache, thereby lessening the computational burden on GPUs and ASICs.
So, can LPDDR play a similar role? Besides being a storage location for streaming weights to HBM on-demand, can it further reduce computational pressure?
The answer is: yes.
LPDDR can be used to store a large volume of content known as Engram. In DeepSeek's Engram paper, they point out that MoE can expand model capacity through conditional computation, but the Transformer itself lacks a native "knowledge retrieval" mechanism. Therefore, Transformers often have to inefficiently simulate retrieval through computation.
To address this, DeepSeek proposed the Engram module. It modernizes the classic N-gram embedding into a hash-based O(1) lookup mechanism, creating a complementary sparse path they call "conditional memory."
This approach saves computation but requires memory to hold the embedding table, which can be very large.
Essentially, this is a classic "trading memory for compute" scheme. The key insight is that the "memory" side is much cheaper in terms of cost per bit read—a single LPDDR lookup is far less expensive than passing data through multiple Transformer layers for a forward pass. Therefore, at scale, this is a very favorable trade-off.
This is how DeepSeek sacrifices some memory in exchange for computational savings.

A worthwhile trade-off
Due to not having equivalent levels of chip transistor density and lacking EUV lithography, Chinese GPUs and ASICs are likely to lag behind their Western counterparts in raw FLOPs for the foreseeable future. They also have significant gaps in advanced packaging techniques. Therefore, such trade-offs are very worthwhile, especially given China's ability to mass-produce NAND and LPDDR memory.
Reviewing DeepSeek's Long-Term Strategy
Considering these innovations, DeepSeek's goal doesn't appear to be making a few hundred million dollars in profits now. Many of its past choices illustrate this point: the absence of multimodal capabilities, no voice model, and certainly no video model yet.
What it is truly engaged in is a patient, potentially $10 trillion long-term game: fostering an alternative AI hardware ecosystem.
This is not just about making Chinese memory manufacturers key players in the global AI hardware market. It's about fundamentally reducing resource requirements to make AI model training and serving more cost-effective. This, in turn, allows many GPU, ASIC, and networking chip manufacturers to become viable options.
Simultaneously, these innovations will also benefit the Western open-source ecosystem and a new generation of hardware manufacturers.
All the signs are already there. Let's review in detail the innovations DeepSeek has proposed so far:
1. Mixture of Experts (MoE) and MLA introduced in DeepSeek V2
DeepSeek introduced MoE and MLA in V2. MoE reduces the computation needed to train highly intelligent models by about 40-50%; MLA reduces the KV Cache by 90%.
This makes offloading the KV Cache to SSDs quite efficient.
These ideas first appeared in DeepSeek's V2 paper released in May 2024. Later, they laid the foundation for training DeepSeek V3. At that time, DeepSeek trained a system approaching the performance level of closed-source models using only 2048 crippled H800 GPUs.

2. DSA: Introduced in DeepSeek V3.2 Exp to reduce computational overhead in long-context scenarios while alleviating HBM bandwidth pressure.
The core function of DSA is to ensure that computation doesn't continuously grow with context length. As the chart below shows, as context length increases, the processing time for DeepSeek-V3.2 remains largely flat.

3. mHC: DeepSeek proposed mHC: Manifold-Constrained Hyper-Connections in a paper published in December 2025.
mHC is an innovation at the macro-architecture level from DeepSeek, redesigning how information flows between Transformer layers.
Traditionally, since ResNet, models have used standard residual connections, i.e., x + F(x). mHC expands the residual stream into multiple parallel information channels and allows the model to perform learnable mixing between these channels. The key is constraining the mixing matrix to be a doubly stochastic matrix, restricting it to the Birkhoff polytope via Sinkhorn-Knopp projection. This mathematically guarantees stable signal amplitude regardless of how deep the model is stacked.
This solves the catastrophic instability problem faced by previous unconstrained Hyper-Connections. Hyper-Connections, initially proposed by ByteDance, suffered from signal


