When Reasoning Becomes a Scarce Resource, Who Captures the Value

星球君的朋友们

Odaily资深作者

2026-06-09 03:38

This article is about 5653 words, reading the full article takes about 9 minutes

The ultimate winner will not be the company with the most GPUs.

AI Summary

Expand

Core Thesis: The compute bottleneck in the AI industry has shifted from training to inference, and the market is repricing accordingly. Value will no longer solely accrue to companies that own the most GPUs, but will settle in the middleware layer that can aggregate, route, and optimize fragmented inference computing power, such as asset-light platforms like Hyperbolic.
Key Elements:
1. Inference Becomes the New Bottleneck: The market recognizes that inference is a recurring cost (scaling with usage), not a one-time capital expenditure like training. J.P. Morgan estimates the inference market size to be 10-50 times that of training, evidenced by Anthropic taking over data centers dedicated to inference.
2. Industry Giants Confirm the Shift: Nvidia has reorganized its financial reporting around "serving tokens," dividing inference into two fronts – cloud and edge computing – and has released chips with significantly improved inference performance. Cerebras' IPO was 20 times oversubscribed due to its chip architecture focused on inference acceleration.
3. The Answer to the "600 Billion Dollar Question": The ROI gap for AI investments raised by Sequoia will be filled by growing inference demand, not training. The normalized demand for inference will absorb the previous overbuilding of GPU capacity.
4. Hyperbolic's Value Proposition: As the only company spanning GPU leasing, deployment, and model APIs, Hyperbolic profits by aggregating multi-cloud computing resources and providing real-time pricing data. Its asset-light model allows it to capture spreads more effectively during periods of compute oversupply.
5. Slim Margins for Applications, Value in the Middleware: Inference applications like Venice are constrained by upstream compute costs, resulting in thin profit margins. Their economic model reveals that underlying compute power is the primary cost, reinforcing the value of the aggregation layer (like Hyperbolic) that controls compute routing and pricing.

Original Author: Frank Fu

Original Source: IOSG Ventures

The gap that David Cahn identified in 2023 was never filled on the training side. It was filled on the inference side, and the market has only started pricing it in over the past few weeks. Now that Nvidia is restructuring its financial reporting around "serving tokens" and Cerebras' IPO saw 20x oversubscription, the debate over the bottleneck is over. The real question has become: when inference becomes a scarce resource, where will value accrue in the compute stack?

Following the GPU: From the $200 Billion Problem to the $600 Billion Problem

In 2023, Sequoia's David Cahn posed the question hanging over all of AI construction – the "$200 Billion Problem." For every dollar spent on a GPU, approximately another dollar is spent powering it in a data center. Therefore, each year's GPU CapEx implies that these chips must eventually generate roughly $200 billion in revenue to recoup that capital. Even with generous assumptions about AI revenue, he still found a gap of over $125 billion between "investment" and "what end customers actually pay." The concern was straightforward: GPUs are being overbuilt ahead of real demand.

A year later, the gap hasn't narrowed; it's widened. In Cahn's 2024 follow-up, as hyperscaler CapEx ballooned, he redefined it as the "$600 Billion Problem." The bearish logic converges into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital.

Both articles were essentially asking the same thing: who fills this gap? The answer never appeared on the "training" side of the ledger. It appeared on the inference side, and the market has only started pricing this in over the last few weeks.

The Cerebras IPO and the Inference Squeeze

Cerebras went public on Thursday. The IPO was 20x oversubscribed, priced at nearly double the final price hike from Wednesday. The demand wasn't driven by bets on the "next Nvidia killer." It stemmed from something simpler: the market is beginning to realize that the real bottleneck in AI is inference, not training.

Cerebras' core competency is a chip architecture that makes inference extremely fast. Not training – inference. That's what got Wall Street excited. The inference market is recurring; it expands with usage. Every time Claude answers a question, every time an agent executes a task, it consumes compute. Training happens once; inference never stops.

J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. When machines start executing tasks issued by other machines – i.e., agentic expansion – inference demand no longer scales with user count, but with compute itself.

Nvidia Redraws the Map: Inference Takes Center Stage

If Cerebras represents the market's awakening, then Nvidia's latest quarterly earnings are confirmation from the top of the chain. On the latest earnings call, Jensen Huang made the unspoken explicit: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-shot inference to logical reasoning, and now to agents that can call tools and orchestrate tasks themselves. Huang stated, "Tokens are now profitable." In the AI era, compute is revenue and profit.

This reshapes the entire industry. Training is a one-time cost to build a model; inference is the recurring cost of running it. And today's bottleneck is in inference, not training.

Nvidia wrote this judgment into its own reporting structure. It now reports under two platforms instead of one: Data Center and Edge Computing. Data Center (~$75 billion for the quarter, up 92% YoY) is further broken down into Hyperscale (~$38 billion, up 12% QoQ) and ACIE (AI Cloud, Industrial & Enterprise, ~$37 billion, up 31% QoQ). The entirely new line is Edge Computing: $6.4 billion, up 29% YoY, covering the endpoints where agentic AI and physical AI actually run – PCs, workstations, AI-RAN base stations, robots, and cars.

Edge computing currently represents less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. The signal: inference is splitting into two fronts – cloud inference within data centers, and endpoint inference on the edge. AI needs to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping starting Q3, offers up to 35x the inference throughput of Blackwell. Huang also gave a new, $200 billion TAM for the Vera CPU designed for agentic workloads. Every frontier model company is expected to fully pivot to it on day one.

When the most valuable company on earth restructures its financial disclosures around "serving tokens," the debate over the bottleneck is settled. The remainder of this article discusses who captures the value when inference (rather than training) becomes the scarce resource.

A scope clarification first. Between these two fronts, this article discusses cloud inference – rented data center GPUs that provide API token services externally. Endpoint inference runs on local chips *within* the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN) and bypasses the underlying GPU rental and aggregation stack entirely. Here, consider it a tailwind that amplifies the entire inference economy and corroborates the bottleneck thesis, rather than the market where Hyperbolic and Venice operate – both are firmly on the cloud line.

The Squeeze Has Arrived

Anthropic is the canary in the coal mine. Usage far exceeds pre-provisioned capacity. Complaints about Claude being "lobotomized" flood the internet – rate-limited responses, slower reasoning, compressed context windows. The solution is raw compute: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX – 220,000+ Nvidia GPUs, 300+ megawatts – and dedicated it entirely to inference, not training.

This capacity unlocked a wave of limit adjustments, each one a signal. On May 6th, Anthropic doubled the five-hour limit for Claude Code, eliminated peak-hour rate limits, and significantly raised Opus's API rate limits. On May 13th, it raised the weekly limit for Claude Code by another 50% (until July 13th). Then, starting June 15th, it did the opposite of "generous": it carved out agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipelines) from the flat subscription, placing it into a separate metered credit pool ($20-$200/month, billed at API rates). This last step encapsulates the entire argument in one move: agents consume inference far faster than a flat subscription model was designed to handle, so it must be priced as the recurring cost it truly is.

Training is a one-time capital expenditure. Inference is a recurring operating cost that compounds with every new user and every new agent.

The Stack: Six Layers, One Bottleneck

Every AI application sits on a supply chain starting from TSMC fabs and ending at API endpoints:

Most companies own only one layer. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.

Except for one.

Hyperbolic: The Only Company Spanning Three Layers

Hyperbolic launched its on-demand GPU marketplace in June 2025. In its initial months, it surpassed 200,000+ developers, with adoption spanning frontier AI labs, search, and large consumer platforms.

What's interesting is its architecture.

Hyperbolic doesn't own a single GPU itself. Every card comes from neoclouds and data centers – CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness; it's actually a moat.

By sitting between GPU supply and consumption, Hyperbolic gains real-time data others don't see. It knows who is buying which GPU, at what price, and when. It sees oversupply before it becomes public, and demand spikes before they hit the market.

Today, the moat *is* this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool. Developers can rent the cheapest available GPU anywhere without negotiating with each operator or managing a stack of accounts. The more clouds it connects, the deeper the liquidity, the richer the pricing data. Going forward, the team is exploring using this data to model GPU price curves and eventually deploy its own capital to smooth supply and demand, acting as a market maker for physical compute. But this goal is still early; what compounds in the present is the aggregation layer.

This is the flywheel:

More cloud connections → More aggregated supply
More supply → Deeper market and real-time pricing data
Better data → Smarter routing now, pricing models later
Better liquidity and prices → More developers → More clouds want to connect

No other company is attempting this. Hyperbolic is the only one simultaneously operating across the GPU rental layer, the deployment layer, and the model API layer.

Venice as a Mirror

Venice is the clearest manifestation of the inference economy at the application layer, and a useful contrast to Hyperbolic's position. It's a privacy-first inference application: an OpenAI-compatible API plus consumer subscriptions (Free / Pro / Pro+ / Max) that routes requests to about 75 models, roughly two-thirds of which are open-source or self-hosted (Llama, Mistral, Qwen, DeepSeek), with the rest being anonymous pass-throughs to closed-source frontier models. Crucially, Venice doesn't own meaningful compute itself. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala), and pays frontier labs for pass-through, making its true cost of revenue inference compute, not SaaS hosting.

What Venice truly sells is privacy. "Privatization" here doesn't mean turning public compute into private property; it's wrapping commoditized inference with guarantees: no data persistence, no training on data, request anonymization, and some workloads running in TEEs so even the operator can't see plaintext. The underlying compute is generic; the premium comes from this privacy wrapper. But this guarantee is layered and non-uniform: for open-source models running on its controlled or TEE GPUs, it approaches end-to-end confidential computing. For anonymous pass-throughs to closed-source models like Claude or GPT, privacy is just identity stripping; the frontier lab still processes your raw prompt. So the strongest privacy covers only the open-source part; the frontier model part is "anonymous" rather than "truly confidential." Venice's gross margin = subscription price − inference cost paid downstream. The premium it can command over raw API pricing relies almost entirely on this privacy premium, which is also why its margins are thin and subject to frontier pass-through pricing.

The token design packages this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, an inference credit where 1 DIEM ≈ $1 of compute per day. Paid subscriptions trigger programmatic buyback-and-burns of VVV (approx. $2 / $5 / $10 for Pro / Pro+ / Max). Emissions follow a fixed declining schedule: 6M → 5M → 4M VVV monthly, dropping to 3M on July 1st. The buybacks are real but discretionary and still small: ~$103k burned each in April and May, slowly climbing towards ~$110k in June, well below the $200k/month threshold.

Fundamentals are healthier than the headlines. The publicly circulated "$70 million ARR" figure is almost certainly the result of mistakenly counting subscription renewals as new customer acquisitions. A defensible observable range is closer to $6 million to $15 million ARR. Underneath that, traction is real: ~136,000 token holders, ~9.9 million monthly website visits (~330,000 daily), with new Pro subscriptions hovering around ~1,400 per day. This is a real business, but a thin-margin one, its economics constrained by the compute it purchases.

This is precisely why Hyperbolic sits one layer above. If Venice is the gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone relies on; Hyperbolic aggregates and standardizes that fragmented supply, then sells it to Venice and all players like it. As inference demand grows, value accrues not only to applications consuming compute but, crucially, to the layer that aggregates, routes, and captures the cost of revenue these applications pay.

Why This Matters Now

Nvidia restructured its finances around "serving tokens." Cerebras' IPO proved the market understands inference is the bottleneck. Anthropic scrambling for capacity proves it's a real problem. Agentic and physical AI will amplify demand by orders of magnitude, spanning both cloud and edge.

It also closes the loop on the "$600 Billion Problem" from the other side. Cahn's bearish logic – overbuilding leading to oversupply – will likely be validated. But oversupply is the best possible market for an asset-light aggregator. When GPU prices fall and supply fragments across dozens of clouds, the player that owns no hardware and routes every workload to the cheapest available card earns the spread, while operators holding depreciating GPUs take the losses. Hyperbolic is betting *on* oversupply, not against it.

The ultimate winning company won't be the one with the most GPUs. It will be the one that can tell you which GPUs are available where, at what price, and route every workload to the cheapest place to run it.

Hyperbolic is building that company. It doesn't own GPUs itself; it's pure software, spanning three layers deep, but it's building the ultimate aggregation layer for inference compute.

Welcome to Join Odaily Official Community