The cheaper AI becomes, the more expensive chips get
- Core thesis: The continuous and significant decline in AI model inference costs (dropping approximately 1,000-fold over three years) has not led to a decrease in demand for computing power. On the contrary, the surge in application scenarios and deep usage has triggered exponential growth in token consumption (doubling every two months), causing total spending to explode upwards. This has ultimately intensified the supply-demand imbalance and price increases for physical infrastructure like computing power and storage.
- Key elements:
- Claude Sonnet 5 is priced at only 40%-60% of the flagship Opus 4.8, yet its performance reaches over 90% of it, reflecting the continuous improvement in cost-efficiency of AI models.
- After model prices dropped, total enterprise AI spending bucked the trend and grew: global enterprise generative AI spending was $11.5 billion in 2024, soaring to $37 billion in 2025, a year-over-year increase of 320%.
- The demand for computing power has transmitted to the hardware market, with the spot prices of DRAM and NAND Flash accumulating gains of over 300% since Q3 2025, and memory chip prices rising six-fold within a year.
- Goldman Sachs forecasts cumulative global AI infrastructure capital expenditure to be approximately $7.6 trillion from 2026 to 2031. Based on a baseline GPU price of $80,500, NVIDIA accounts for 75% of total computing power spending.
- The combined factors of application proliferation, increased depth of single application usage, and heightened model complexity have driven token consumption from an average of hundreds of interactions per day in 2023 to tens of thousands in 2025, with each interaction triggering multiple subsequent inferences.
- Jevons paradox is being re-enacted in the AI field: just as Watt's improvement of the steam engine reduced coal consumption but ultimately increased total coal usage, improvements in AI inference efficiency similarly stimulate even greater demand for computing power.
Original source: Wall Street CN
On June 30, Anthropic released Claude Sonnet 5.
This is a mid-tier model, the "most capable workhorse" in the Sonnet series. It scored 63.2 on the agent capability test SWE-bench Pro — just 6 points shy of the flagship Opus 4.8's 69.2. On another dimension, the graduate-level reasoning test GPQA-AAA v2, Sonnet 5 actually edged out Opus 4.8.
The pricing is even more critical. During the promotional period, it charges $2 per million input tokens and $10 per million output tokens. Opus 4.8's corresponding prices are $5 and $25 — Sonnet 5, at 40% to 60% of the price, delivers over 90% of the flagship's capability.
This news can be interpreted in two ways.
First: AI is getting cheaper again. Lower costs benefit everyone, the Chatbot war continues, and model makers compete fiercely.
Second — which is what the market is pricing in — the cheaper the models, the more expensive computing power and storage become.
On the day Claude Sonnet 5 was released, the U.S. semiconductor index rose nearly 4%. In the AI narrative over the past three years, there has been a clear thread: inference efficiency would kill chip demand. But this judgment has been wrong at every data point.
Price Drop: 1,000x in Three Years
Let's look at the price drop trajectory first.
In 2022, the cost of calling a GPT-4 level API was about $0.03 per thousand tokens. By 2025, the price for a model of equivalent performance — according to the Stanford AI Index Report — had fallen about 280 times. Combined with the effects of open source and efficiency improvements, the industry-wide recognized reduction is 1,000 times.
It's not just one type of model; every company is cutting prices.
Anthropic's Sonnet 5 targets the capability density of Opus 4.8 at only 40% to 60% of the price. Google's Gemini Omni Flash charges $0.10 per second for video generation, and the Nano Banana 2 Lite image model generates an image in 4 seconds, costing only $0.034 per thousand images — half the price of its predecessor. DeepSeek-V4-Pro has pushed the cost of a million input tokens down to the $0.035 level.
Price reductions aren't just happening on the pricing sheet.
On June 24, The Information reported that OpenAI had found a pure software optimization technique internally — cutting the GPU requirement for a certain computational step by more than half, slashing the dedicated GPU pool from thousands to just a few hundred. The same month, Meta proposed the Vistara solution: reconnecting DDR4 memory from decommissioned servers via a proprietary CXL chip, pairing it with DDR5 in a 3:1 ratio, and reducing inference server costs by 25%.
By June 30, StepFun had open-sourced the speculative decoding technology JetSpec — which can boost the inference speed of large models by nearly 10 times. In other words, for the same token output, the number of GPUs required can drop by an order of magnitude.
If AI were a traditional cost-demand function, these signals would point to one thing: fewer chips will be needed in the future.
Wall Street feared this.
On the weekend when DeepSeek released R1 in January, AI infrastructure stocks experienced their most severe sell-off in recent years. Shares of AI cloud company Nebius plummeted 40%. The narrative was simple: Chinese open-source models sell tokens for $0.1, while American companies charge $2; demand for computing power must collapse.
Explosion: Total Spending Soared 320%
But what actually happened was the complete opposite.
Nebius co-founder Roman Chernin later recalled that the week of panic triggered by DeepSeek "might have been our best week of sales." The first reaction of company procurement departments upon seeing the sharp drop in costs was not to cut budgets, but to finally be able to run inference at scale.
In 2024, global enterprise spending on generative AI was about $11.5 billion. In 2025, that number skyrocketed to $37 billion — a 320% increase in one year. According to a Menlo Ventures enterprise survey, the median company was running "dozens" of AI applications in 2025, compared to 1 or 2 in 2023.
Data from various dimensions all follow the same curve:
Uber had already exhausted its full-year AI budget by April 2026. AT&T currently processes 27 billion tokens per day — that number was 800 million 18 months ago. One large U.S. health insurance company saw its monthly token consumption surge from 3 million to over 150 million.
Breaking it down, the growth comes from the superposition of three directions.
First is the diffusion of applications. The marketing department in each enterprise uses 3 AI tools, the sales department 4, the customer service department 2, plus legal, HR, finance — from 2 to dozens, this is an order of magnitude leap.
Second is the depth of individual applications. Take customer service AI as an example: In 2023, daily interactions were about 500, each using about 800 tokens, and the conversation ended there. By 2025, daily interactions are 15,000, each using about 4,500 tokens, and each interaction triggers 3 to 5 subsequent inference tasks — sentiment analysis, escalation prediction, quality scoring — all stacking on the same entry point.
Third is the increase in model complexity itself. Upgrading from a single-turn model with 7B parameters to a multi-step reasoning agent with 70B+ parameters, the tokens consumed per round of internal reasoning are tens to hundreds of times that of linear interaction.
In other words, the cost per token dropped to one-thousandth, while the number of tokens the market used increased tens of thousands of times. The net effect of this multiplication points in only one direction: an explosion in spending.
Token consumption doubles every two months — multiple independent clues converge on the same number. Extrapolating this exponential curve to 2027, annual enterprise AI spending exceeding $100 billion is an arithmetic problem, not a prediction.
Transmission: Storage Up 6x, Chip Infrastructure Heading to $7.6 Trillion
The demand stimulated by price drops did not stop at the software layer.
The increase in memory prices is the most direct signal of AI demand transmitting from the model layer to the hardware layer.
Starting from Q3 2025, the spot prices of DRAM and NAND Flash have both risen by over 300%. DDR5 modules saw a price increase that once exceeded 90% in a single month. Entering 2026, the price hikes not only didn't stop but accelerated.
The contract price increase for DRAM in Q1 was revised upwards from an expected 55%-60% to 90%-95%; NAND was revised from 33%-38% to 55%-60%. TrendForce's forecast for Q2 predicts DRAM will rise another 58%-63%, and NAND another 70%-75%.
Using a consumer product as a benchmark: the Acer Predator 32G DDR5 6000 dual-channel kit was priced at around 1,300 yuan at the end of October 2025, but by January 2026, it had skyrocketed to 2,700 yuan. Doubling in three months is extremely rare in the consumer goods market.
Samsung's memory business recorded an all-time high quarterly operating profit in Q4 2025 — exceeding 20 trillion Korean won, approximately 96.2 billion yuan. The most fundamental driver of this year-long rally is not consumer-level upgrades from phones or PCs, but the massive procurement of HBM, enterprise-grade SSDs, and high-density DRAM by AI data centers.
A Goldman Sachs report from May calculated this to the extreme.
The report forecasts cumulative global AI infrastructure capital expenditure from 2026 to 2031 at approximately $7.6 trillion. That's $765 billion in 2026 alone, climbing to $1.6 trillion by 2031. Based on a single benchmark GPU (NVIDIA VR200 Rubin) costing $80,500, NVIDIA accounts for 75% of total computing power spending in each period.
Goldman Sachs also asked a key question in the report: If ASICs (application-specific integrated circuits) massively replace GPUs, could it reduce total demand?
The answer depends on the situation. If demand is inelastic — meaning an enterprise's AI computing needs are fixed — ASIC substitution can directly reduce total capital requirements. But if demand is elastic — the cheaper the computing power, the more you buy — a change in chip mix primarily reshapes the distribution of profits among different suppliers, not the total spending scale.
Goldman Sachs's base case scenario chose the latter.
U.S. stock prices are also moving in the same direction. SanDisk has risen 857% since the beginning of the year; a Bernstein report on June 30 raised its target price to $3,000. AMD hit an all-time high, rising 7% in a single day. Companies making GPUs, storage, packaging, and data center equipment — all are near new highs.
The most striking figure cited in a June 11 review article by Edgen.tech is this: Memory chip prices have risen sixfold in the past year.
The label "cyclical recovery" doesn't stick. For something to rise sixfold, there's a re-pricing of AI's physical infrastructure driven by demand from the entire economic system behind it.
Root Cause: Jevons Already Answered This in 1865
William Stanley Jevons wrote a book called *The Coal Question* in 1865.
His core observation was that after Watt improved the steam engine, coal consumption per unit dropped significantly, but Britain's total coal consumption increased rather than decreased. Because the efficiency gain made steam power affordable for more industries — textiles, railways, mining, shipping — each new scenario created coal demand that didn't exist before.
160 years later, the same formula is playing out with AI computing power.
Companies have done the math. At 2022 token prices, running real-time inference for customer service conversations was economically unfeasible. Non-urgent scenarios weren't worth running AI on. Personalized content generation could only be done at the segment level, not the user level. By 2025, with prices having dropped 1,000 times, this "previously non-existent demand" has all become essential.
Nebius's Chernin gave the most direct summary: "Every time we make the same unit of intelligence cheaper, we are not reducing consumption; we are increasing it — because the same budget can solve more complex tasks."
The market has overlooked another structural driver: the positive feedback loop of gross margin.
The gross margin curve for AI inference has no historical precedent. An API company might start with a gross margin of only 10% — model training is expensive, inference is expensive. But software optimizations (operator fusion, quantization, speculative decoding) reduce inference costs every month, while price adjustments always lag. So the speed at which gross margin climbs from 10% to 90% is shorter than in any traditional industry.
Gross margin drives profit, profit drives more procurement, procurement drives down costs — a positive feedback loop with no ceiling.
"If you have DRAM, you can sell tokens; if you don't have DRAM, you can't sell tokens." This phrase is becoming the fundamental equation of AI chip demand.
Two sensitivity assumptions in the Goldman Sachs report also reinforce the same judgment. If the economic life of a chip shrinks from 5 years to 3 years, the replacement cycle accelerates, and cumulative capital requirements jump up a notch. If memory per chip is 25% higher than expected — this mainly changes the allocation of spending within the chip stack, with a limited net impact on the $7.6 trillion total, but the direction is the same: spending won't decrease.
The Endgame: Who Holds the Computing Power
The lifting of the Fable 5 export controls — banned on June 12, lifted on June 30, just three weeks in between — provides an unexpected footnote to this paradox.
The reason for the control was "national security risk." Lifting the control has nothing to do with the risk disappearing — alternatives appeared. During the period of the control, Asian teams like Tulongfeng released models close to the Mythos level, quickly nullifying the deterrence of the blockade. The lifting was a matter of reality, not goodwill.
This episode perfectly fits the main thread of the AI cost paradox: models are fungible. From GPT to Claude to DeepSeek to open-source models, no one can monopolize AI capability itself — if someone sets up a barrier, someone will find a detour.
Hardware doesn't follow this logic.
Not GPUs. Not DRAM. Fab construction cycles are measured in years. The production capacity of lithography machines is fixed. The supply elasticity of high-purity silicon is nearly zero. These are laws of physics, not business strategies. Software optimization can reduce model costs by a factor of a thousand, but it can't shorten the construction cycle of a single fab by a single day.
If this paradox continues to play out, the endpoint of AI model price cuts does not point to de-computation — it points to the re-concentration of pricing power for computation. No matter whose model you use, tokens have to run on someone's chips. Every cent that model makers compete away on price ends up as revenue on the ledgers of data centers, fabs, and memory production lines. The more aggressive the cost reduction, the more irreversible this transfer becomes.


