The cheaper AI gets, the more expensive chips become

Odaily资深作者

2026-07-01 10:00

This article is about 4526 words, reading the full article takes about 7 minutes

There was a prevailing narrative in the AI space: that improved inference efficiency would kill the demand for chips. However, the reality is the opposite—the cheaper AI becomes, the more valuable physical compute power gets. Every price cut that model developers compete on ultimately flows into the pockets of chipmakers and foundries.

AI Summary

Expand

Core Thesis: The sustained dramatic decline in AI model inference costs (a roughly 1,000-fold reduction over three years) has not led to a decrease in compute demand. Instead, it has triggered an exponential surge in token consumption (doubling every two months) due to a proliferation of application scenarios and deeper usage. Total expenditures have paradoxically skyrocketed, ultimately exacerbating the supply-demand imbalance and price increases for physical infrastructure like compute and storage.
Key Elements:
1. The pricing of Claude Sonnet 5 is only 40%-60% that of the flagship Opus 4.8, yet its performance reaches over 90% of it, demonstrating the continuous improvement in AI model cost efficiency.
2. After model price reductions, total enterprise AI spending bucked the trend and grew: Global enterprise generative AI spending was $11.5 billion in 2024, surging to $37 billion in 2025, a year-over-year increase of 320%.
3. The demand for compute power has transmitted to the hardware market. Spot prices for DRAM and NAND Flash have accumulated gains of over 300% since Q3 2025, with memory chip prices rising six-fold within a year.
4. Goldman Sachs predicts cumulative global AI infrastructure capital expenditure will be approximately $7.6 trillion between 2026 and 2031. Based on a benchmark GPU price of $80,500, NVIDIA accounts for 75% of total compute spending.
5. The combined effect of application proliferation, increased depth of individual application usage, and rising model complexity has driven token consumption from an average of hundreds of interactions per day in 2023 to tens of thousands in 2025, with each interaction triggering multiple subsequent inferences.
6. Jevons' paradox is replaying in the AI field: Watt's improved steam engine reduced coal consumption but increased total coal use. Similarly, improved AI inference efficiency stimulates even greater demand for compute power.

Original source: Wall Street CN

On June 30, Anthropic released Claude Sonnet 5.

This is a mid-tier model, the "most capable worker" in the Sonnet series. On the agentic capability benchmark SWE-bench Pro, it scored 63.2% — just 6 points shy of the flagship Opus 4.8's 69.2%. On another dimension, the graduate-level reasoning test GPQA-AAA v2, Sonnet 5 actually outperformed Opus 4.8.

Pricing is even more critical. During the promotional period, it charges $2 per million input tokens and $10 for output. Opus 4.8's corresponding prices are $5 and $25 — Sonnet 5, at 40% to 60% of the price, delivers over 90% of the flagship's capability.

This news can be interpreted in two ways.

The first: AI is getting cheaper again. Cost reductions benefit everyone, the Chatbot war continues, and model vendors are locked in fierce competition.

The second — and this is what the market is pricing in — as models get cheaper, compute and storage are actually getting more expensive.

On the day Claude Sonnet 5 was released, the US semiconductor index rose nearly 4%. In the AI narrative of the past three years, there has been a clear assumption: improved inference efficiency would kill demand for chips. But this judgment has been wrong at every data point.

Price Reduction: A Thousandfold Decrease in Three Years

Let's first look at the price reduction trend.

In 2022, the cost of calling a GPT-4-level API was about $0.03 per thousand tokens. By 2025, the price for models of equivalent performance — according to the Stanford AI Index Report — had dropped roughly 280 times. Factoring in the combined effects of open-source models and efficiency improvements, the industry-wide acknowledged decrease is 1,000 times.

It's not just one model; everyone is cutting prices.

Anthropic's Sonnet 5, which matches the capability density of Opus 4.8, is priced at only 40% to 60%. Google's Gemini Omni Flash charges $0.10 per second for video generation, while the Nano Banana 2 Lite image model generates an image in 4 seconds, costing just $0.034 per thousand images — half the price of its predecessor. DeepSeek-V4-Pro has pushed the cost of a million input tokens down to $0.035.

Price cuts aren't just happening on the pricing sheet.

On June 24, The Information reported that OpenAI had internally found a purely software optimization technique — reducing GPU requirements for a specific computing step by more than half, slashing the dedicated GPU pool from thousands to just a few hundred units. That same month, Meta proposed the Vistara solution: reconnecting DDR4 memory from decommissioned servers via its self-developed CXL chips, pairing them with DDR5 in a 3:1 ratio, reducing inference server costs by 25%.

By June 30, Stepfun open-sourced the speculative decoding technology JetSpec — boosting large model inference speed by nearly 10 times. In other words, for the same token output, the number of GPUs needed could drop by an order of magnitude.

If AI were a traditional cost-demand function, these signals should point to one thing: fewer chips will be needed in the future.

Wall Street was afraid of this.

Over the weekend when DeepSeek released R1 in January, AI infrastructure stocks experienced their most severe sell-off in recent years. Shares of the AI cloud company Nebius plunged 40%. The narrative was simple: Chinese open-source models sell tokens for $0.1, while US companies charge $2, so compute demand was bound to collapse.

Explosion: Total Spending Surged 320% Instead

But exactly the opposite happened.

Roman Chernin, co-founder of Nebius, later recalled: the week DeepSeek caused panic "might have been our best sales week ever." The purchasing departments of companies, upon seeing the sudden cost drop, didn't cut budgets; they finally felt they could run inference at scale.

In 2024, global enterprise generative AI spending was approximately $11.5 billion. In 2025, this figure skyrocketed to $37 billion — a 320% increase in one year. According to a Menlo Ventures enterprise survey, the median company was running "dozens" of AI applications in 2025, compared to 1 to 2 in 2023.

Data across various dimensions all fall on the same curve:

Uber had already exhausted its entire 2026 AI budget by April 2026. AT&T currently processes 27 billion tokens daily — 18 months ago, that number was 800 million. A large US health insurance company saw its monthly token consumption jump from 3 million to over 150 million.

Breaking it down, the growth comes from three overlapping directions.

First, application proliferation. The marketing department in each company uses 3 AI tools, sales uses 4, customer service 2, plus legal, HR, finance — going from 2 tools to dozens is an order-of-magnitude leap.

Second, the depth of individual applications. Take customer service AI as an example: in 2023, there were about 500 daily interactions, each using about 800 tokens, ending with the conversation. By 2025, there are 15,000 daily interactions, each using about 4,500 tokens, and each interaction triggers 3 to 5 subsequent inference tasks — sentiment analysis, escalation prediction, quality scoring — all stacking on the same entry point.

Third, the increasing complexity of the models themselves. Upgrading from a 7B-parameter single-turn model to a 70B+ multi-step reasoning agent consumes tens to hundreds of times more tokens per internal reasoning step than a linear interaction.

In other words, the cost per token dropped to one-thousandth, but the number of tokens consumed by the market increased tens of thousands of times. The net effect of these multipliers is only in one direction: spending explosion.

Token consumption doubles every two months — multiple independent data points converge on the same number. Extrapolating this exponential curve to 2027, enterprise AI spending breaking $100 billion annually is an arithmetic problem, not a prediction.

Transmission: Storage Up Sixfold, Chip Infrastructure Points to $7.6 Trillion

The demand stimulated by price cuts hasn't stopped at the software layer.

The surge in memory prices is the most direct signal of AI demand transmitting from the model layer to the hardware layer.

Starting in Q3 2025, spot prices for DRAM and NAND Flash have collectively risen by over 300%. DDR5 modules saw a price spike exceeding 90% in a single month. Entering 2026, the price increases not only continued but accelerated.

In Q1, DRAM contract price increases were revised up from an expected 55%-60% to 90%-95%; NAND was revised up from 33%-38% to 55%-60%. For Q2, TrendForce predicted DRAM would rise another 58%-63% and NAND another 70%-75%.

Using a consumer-grade product as a benchmark: the Acer Predator 32G DDR5 6000 dual-channel kit was priced at around 1,300 RMB at the end of October 2025, but had surged to 2,700 RMB by January 2026. Doubling in three months is extremely rare in the consumer goods market.

Samsung's memory business recorded its highest-ever quarterly operating profit in Q4 2025 — exceeding 20 trillion KRW, or about 96.2 billion RMB. The most fundamental driving force behind this year-long rally is not consumer-grade upgrades from phones or PCs, but the massive procurement of HBM, enterprise SSDs, and high-density DRAM by AI data centers.

A May Goldman Sachs report calculated this to the extreme.

The report predicted cumulative global AI infrastructure capital expenditure from 2026 to 2031 to be approximately $7.6 trillion. This includes $765 billion in 2026 alone, rising to $1.6 trillion by 2031. Assuming a base unit price of $80,500 for a GPU (based on NVIDIA's VR200 Rubin), NVIDIA would account for 75% of total compute spending in each period.

Goldman Sachs also pursued a key question in the report: If ASICs (application-specific integrated circuits) significantly replace GPUs, could it reduce total demand?

The answer depends on the situation. If demand is inelastic — meaning enterprise AI compute needs are fixed — ASIC substitution can directly lower total capital requirements. But if demand is elastic — the cheaper the compute, the more is bought — a change in chip mix primarily reshapes profit distribution among different suppliers, not the total spending scale.

Goldman Sachs's base case scenario chose the latter.

Stock prices in the US are also moving in the same direction. SanDisk has risen 857% since the start of the year, and Bernstein raised its target price to $3,000 in a June 30 report. AMD jumped 7% in a single day to an all-time high. Those making GPUs, storage, packaging, and data center equipment — all are near new highs.

The most striking figure cited by Edgen.tech in its June 11 overview article: the price of memory chips has increased sixfold over the past year.

The label "cyclical recovery" doesn't fit. A sixfold increase signifies that the entire economic system's demands are repricing AI's physical infrastructure behind it.

The Root Cause: Jevons Already Answered in 1865

William Stanley Jevons wrote a book in 1865 called "The Coal Question."

His core observation was that after Watt improved the steam engine, unit coal consumption dropped significantly, but Britain's total coal consumption paradoxically rose. Because the efficiency gain made steam power affordable in more industries — textiles, railways, mining, shipping — and each new scenario created coal demand that didn't exist before.

160 years later, the same formula is replaying in AI compute.

Companies did the math. At 2022 token prices, running real-time inference for customer service conversations was economically unviable. Non-critical scenarios weren't worth running AI. Personalized content generation could only target niche group levels, not individual users. By 2025, with prices down 1,000 times, all this "previously non-existent demand" has become essential.

Nebius' Chernin provided the most direct summary: "Every time we make the same unit of intelligence cheaper, we aren't reducing consumption; we are increasing consumption — because the same budget can solve more complex tasks."

The market has overlooked another structural driver: the positive feedback loop of gross margins.

The gross margin curve for AI inference has no historical parallel. An API provider might start with a gross margin of only 10% — training and inference are expensive. But software optimizations (operator fusion, quantization, speculative decoding) compress inference costs every month, while pricing adjustments always lag. Consequently, the speed at which gross margins climb from 10% to 90% is faster than in any traditional industry.

Gross margins drive profits, profits drive increased procurement, procurement reduces costs — a positive feedback loop with no ceiling.

"If you have DRAM, you can sell tokens; if you don't have DRAM, you can't." This statement is becoming the fundamental equation for AI chip demand.

Two sensitivity assumptions in the Goldman Sachs report reinforce the same conclusion. If the economic lifespan of a chip shrinks from 5 years to 3 years, the replacement cycle accelerates, pushing cumulative capital requirements significantly higher. If memory per chip is 25% higher than expected — this mainly changes the spending allocation within the chip stack, with a limited net impact on the $7.6 trillion total, but the direction is the same: money won't be spent less.

The Endgame: Who Holds the Compute

The lifting of the Fable 5 export controls — banned on June 12, lifted on June 30, just three weeks — provides an unexpected footnote to this paradox.

The rationale for control was "national security risks." Lifting the control had nothing to do with the risk disappearing — alternatives appeared. Asian teams like Tulongfeng released models approaching the Mythos level during the control period, quickly neutralizing the deterrence of the blockade. The lifting was a matter of reality, not goodwill.

This episode perfectly fits the main narrative of the AI cost paradox: models are replaceable. From GPT to Claude to DeepSeek to open-source models, no one can monopolize AI capability itself — someone sets up a barrier, someone finds a way around.

Hardware doesn't follow this logic.

GPUs don't. DRAM doesn't. The construction cycle for fabs is measured in years. The production capacity of lithography machines is fixed. The supply elasticity of high-purity silicon is nearly zero. These are laws of physics, not business strategies. Software optimization can reduce model costs by a thousand times, but it can't shorten the construction cycle of a single fab by a single day.

The endpoint of AI model price reductions — if this paradox continues — does not point to de-compute-ization. It points to the re-concentration of pricing power for compute. No matter whose model you use, tokens must run on someone's chips. Every cent that model vendors slash in price wars ultimately ends up as revenue in the ledgers of data centers, fabs, and memory production lines. The more aggressive the cost reduction, the more irreversible this transfer becomes.

technology

Welcome to Join Odaily Official Community