When reasoning becomes a scarce resource, who captures the value
- Core Thesis: The computational bottleneck in the AI industry has shifted from training to inference, and the market is repricing accordingly. Value will no longer solely belong to companies with the most GPUs but will accrue to the middle layer capable of aggregating, routing, and optimizing fragmented inference compute, such as asset-light platforms like Hyperbolic.
- Key Elements:
- **Inference as the New Bottleneck**: The market recognizes inference as a recurring cost (scaling with usage), not a one-time capital expense like training. J.P. Morgan estimates the inference market is 10-50 times larger than training, evidenced by Anthropic taking over a data center dedicated exclusively to inference.
- **Confirmation from Industry Giants**: Nvidia has restructured its earnings narrative around "serving tokens," dividing inference into cloud and edge computing fronts, and launching chips with significantly improved inference performance. Cerebras' IPO received 20x oversubscription due to its chip architecture focused on inference acceleration.
- **The Answer to the "600 Billion Dollar Question"**: The return gap on AI investment identified by Sequoia will be filled by growing inference demand, not training. The normalized demand for inference will absorb the previous overbuilding of GPU infrastructure.
- **Hyperbolic's Value Proposition**: As the only company spanning GPU leasing, deployment, and model APIs, Hyperbolic profits by aggregating multi-cloud computing resources and providing real-time pricing data. Its asset-light model allows it to better capture spreads during periods of compute surplus.
- **Thin Margins in the Application Layer vs. Value in the Middle Layer**: Inference applications like Venice are constrained by upstream compute costs, resulting in thin profits. Their economic model reveals that underlying compute is the primary cost, which reinforces the value of the aggregation layer (like Hyperbolic) that controls compute routing and pricing.
Original Author: Frank Fu
Original Source: IOSG Ventures
The gap that David Cahn identified in 2023 was never filled on the training side. It was filled on the inference side, and the market has only begun to price this in over the past few weeks. As Nvidia restructures its earnings around "serving tokens" and Cerebras's IPO sees 20x oversubscription, the debate over the bottleneck is over. The real question has become the next one: when inference becomes a scarce resource, where will value accrue in the compute stack?
Follow the GPU: From a $200 Billion Problem to a $600 Billion Problem
In 2023, Sequoia's David Cahn posed the question looming over all of AI construction: the "$200 Billion Problem." For every $1 spent on a GPU, roughly another $1 is spent powering it in a data center. Therefore, each year's GPU CapEx means these chips must ultimately generate around $200 billion in revenue to recoup that capital. Even with generous assumptions for AI revenue, he found a gap of over $125 billion between the "investment" and what "end customers actually pay." The concern was straightforward: GPUs were being overbuilt ahead of real demand.
A year later, the gap hadn't narrowed; it widened. In his 2024 follow-up, as hyperscaler CapEx ballooned, Cahn redefined it as the "$600 Billion Problem." The bearish logic converged into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital.
Both articles were essentially asking the same thing: who fills this gap? The answer never appeared on the "training" side of the ledger. It appeared on the inference side, and the market has only started pricing it in over the past few weeks.
Cerebras IPO and the Inference Squeeze
Cerebras went public on Thursday. The IPO was 20x oversubscribed, pricing at nearly double the final mark-up from Wednesday. The demand didn't come from bets on a "next Nvidia killer." It stemmed from something simpler: the market is beginning to realize the real bottleneck in AI is inference, not training.
Cerebras's core competency is a chip architecture that makes inference extremely fast. Not training, but inference. That's what excites Wall Street. The inference market is recurring; it expands with usage. Every time Claude answers a question, every time an agent executes a task, it consumes compute. Training happens once; inference never stops.
J.P. Morgan estimates the inference market to be 10 to 50 times the size of training. When machines start executing tasks ordered by other machines – the agentic expansion – inference demand no longer scales with user numbers, but with compute itself.
Nvidia Redraws the Map: Inference Takes Center Stage
If Cerebras was the market's awakening, Nvidia's latest quarterly earnings were the confirmation from the top of the chain. In the latest earnings call, Jensen Huang made the unspoken clear: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from single-shot reasoning to logical reasoning, and now to the agent stage where it can call tools and orchestrate tasks itself. Huang said, "Tokens are now profitable." In the AI era, compute equals revenue and profit.
This reshapes the entire industry. Training is a one-time cost to build a model; inference is the recurring cost of running it. The bottleneck today is inference, not training.
Nvidia embedded this judgment into its own earnings reporting. It now reports under two platforms instead of one: Data Center and Edge Computing. Data Center (~$75 billion in the quarter, +92% YoY) is further broken down into Hyperscale (~$38 billion, +12% QoQ) and ACIE – AI Cloud, Industrial & Enterprise (~$37 billion, +31% QoQ). The entirely new line is Edge Computing: $6.4 billion, +29% YoY, covering the endpoints where agentic AI and physical AI actually run, like PCs, workstations, AI-RAN base stations, robots, and cars.
Edge currently accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. The signal is this: inference is splitting into two fronts – cloud inference within data centers, and endpoint inference at the edge. AI needs to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping starting Q3, offers up to 35x the inference throughput of Blackwell; Huang also gave a new $200 billion TAM for the Vera CPU built for agentic workloads. Every frontier model company is expected to shift fully to it on day one.
When the world's most valuable company restructures its financial disclosures around "serving tokens," the bottleneck debate is settled. The remainder of this article discusses who captures the value once inference (not training) becomes the scarce resource.
Let's set the scope first. Between these two fronts, this article discusses cloud inference – rented data center GPUs offering external API token services. Endpoint inference runs on local chips inside the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU leasing and aggregation stack. Here, consider it a tailwind that amplifies the entire inference economy and confirms the bottleneck thesis, rather than the market occupied by Hyperbolic and Venice, which operate entirely on the cloud line.
The Squeeze Has Arrived
Anthropic is the canary in the coal mine. Usage far exceeds pre-configured capacity. Complaints about Claude being "lobotomized" flood the internet – rate-limited responses, slower reasoning, compressed context windows. The solution is raw compute: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX – 220,000+ Nvidia GPUs, 300+ megawatts – dedicated entirely to inference, not training.
This capacity unlocked a series of limit changes, each a signal. On May 6th, Anthropic doubled Claude Code's five-hour limit, removed peak hour throttling, and significantly raised Opus's API rate limits. On May 13th, it raised Claude Code's weekly limit by another 50% (until July 13th). Then, starting June 15th, it did the opposite of "generosity": it separated agentic and programmatic usage (Agent SDK, headless mode `claude -p`, CI pipelines) from flat-rate subscriptions into a separate metered credit pool ($20 to $200 per month, priced at API rates). This last step encapsulates the entire argument in one move: agents consume inference far faster than flat-rate subscriptions can sustain, so it must be priced as the "recurring cost" it truly is.
Training is a one-time capital expenditure. Inference is a recurring operational cost that compounds with every new user and every new agent.
The Stack: Six Layers, One Bottleneck
Every AI application sits on a supply chain starting from TSMC fabs and ending at the API endpoint:


Most companies own just one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.
Except for one.
Hyperbolic: The Only Company Spanning Three Layers
Hyperbolic launched its on-demand GPU market in June 2025. In its initial months, it surpassed 200,000+ developers, adopted by frontier AI labs, search platforms, and large consumer-facing applications.
What's interesting is its architecture.
Hyperbolic doesn't own a single GPU itself. Every card comes from neocloud providers and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This might sound like a weakness, but it's actually a moat.
By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data others can't. It knows who buys which GPU, at what price, and when. It sees supply gluts before they become public, and demand surges before they hit the market.
Today, that moat is the multi-cloud aggregation itself. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool. Developers can rent the cheapest available GPU anywhere without negotiating with each operator or managing multiple accounts. The more clouds it connects, the deeper the liquidity, the richer the pricing data. Going forward, the team is exploring ways to use this data to model GPU price curves and eventually deploy its own capital to smooth supply and demand, acting as a market maker for physical compute. But this goal is still early; what compounds right now is the aggregation layer.
This is the flywheel:
- Connect more clouds → More aggregated supply
- More supply → Deeper market and real-time pricing data
- Better data → Smarter routing now, pricing models later
- Better liquidity and prices → More developers → More clouds want to connect
No other company is attempting this. Hyperbolic is the only one simultaneously spanning the GPU rental layer, the deployment layer, and the model API layer.
Venice as a Mirror
Venice is the clearest expression of the inference economy at the application layer, and a useful contrast to Hyperbolic's position. It's a privacy-first inference application: an OpenAI-compatible API plus consumer subscriptions (Free / Pro / Pro+ / Max) that routes requests across about 75 models, roughly two-thirds of which are open-source or self-hosted (Llama, Mistral, Qwen, DeepSeek), with the rest being anonymous pass-through to closed-source frontier models. Crucially, Venice doesn't own meaningful compute itself. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays frontier labs for pass-through. So its real cost of revenue is inference compute, not SaaS hosting.
What Venice truly sells is privacy. "Privacy-preservation" here isn't turning public compute into private property; it's wrapping commoditized inference in guarantees: no data retention, no training on user data, anonymized requests, with some load running in TEEs where even the operator can't see the plaintext. The underlying compute is a commodity; the markup comes from this privacy layer. And this guarantee isn't uniform: for open-source models running on its own controlled or TEE GPUs, it offers near end-to-end confidential computing. But for anonymous pass-through to closed models like Claude or GPT, privacy is just identity stripping; the frontier lab still processes your raw prompt. So the strongest privacy covers only the open-source part; frontier models get "anonymity," not true confidentiality. Venice's gross margin = subscription price − inference costs paid downstream. The premium it can charge over bare API pricing relies almost entirely on this privacy premium, which is why it's thin-margin and vulnerable to frontier pass-through pricing.
The token design wraps this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, an inference credit roughly equivalent to $1 of compute per day. Paid subscriptions trigger programmatic buyback-and-burn of VVV (~$2 / $5 / $10 for Pro / Pro+ / Max), with emissions following a fixed declining schedule: 6M → 5M → 4M VVV monthly, dropping to 3M on July 1st. The buybacks are real but discretionary and still small: ~$103k burned each in April and May, slowly climbing towards ~$110k in June, well below the $200k/month line.
Fundamentals are healthier than the headlines. The oft-cited "$70 million ARR" figure is almost certainly a product of mistaking subscription renewals for net new customer acquisition; a defensible observable range is closer to $6-15 million ARR. Beneath this, traction is real: ~136k token holders, ~9.9 million monthly website visits (~330k daily), with new Pro subscriptions hovering around ~1,400 per day. It's a real business, but a thin-margin one whose economics are dictated by the compute it buys.
This is exactly why Hyperbolic sits one layer above. If Venice is a gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone else relies on; Hyperbolic aggregates that fragmented supply, standardizes it, and sells it to Venice and all the players like it. As inference demand grows, value accrues not just to applications consuming compute, but to the layer that aggregates, routes, and captures the cost of revenue these applications pay.
Why This Matters Now
Nvidia restructured its finances around "serving tokens." Cerebras's IPO proves the market understands inference is the bottleneck. Anthropic scrambling for capacity proves it's a real problem. Agentic and physical AI will amplify demand by orders of magnitude, spanning both cloud and edge fronts.
And it also closes the loop on the "$600 Billion Problem" from the other side. Cahn's bearish logic – overbuilding leading to glut – will likely be validated. But a glut is precisely the best market for a light-asset aggregator. When GPU prices fall and supply is fragmented across dozens of clouds, the player who owns no hardware and routes every workload to the cheapest available card collects the spread, while operators holding depreciating GPUs take the losses. Hyperbolic is *long* on glut, not short.
The company that ultimately wins won't be the one with the most GPUs. It will be the one that can tell you which GPUs are available where, at what price, and route every workload to run at the lowest cost.
Hyperbolic is building that company. It doesn't own GPUs. It's pure software, three layers deep, yet is building itself into the ultimate aggregation layer for inference compute.


