เมื่อการอนุมานกลายเป็นทรัพยากรที่หายาก ใครจะเป็นผู้รับมูลค่า
- มุมมองหลัก: อุปสรรคด้านพลังประมวลผลของอุตสาหกรรม AI ได้เปลี่ยนจากด้านการฝึกอบรม (Training) ไปสู่ด้านการอนุมาน (Inference) แล้ว และตลาดกำลังกำหนดราคาใหม่ให้กับสิ่งนี้ มูลค่าจะไม่เป็นของบริษัทที่มี GPU มากที่สุดอีกต่อไป แต่จะสะสมอยู่ในชั้นกลาง (Middle Layer) ที่สามารถรวบรวม จัดเส้นทาง และปรับปรุงพลังประมวลผลการอนุมานที่กระจัดกระจายให้เหมาะสมที่สุด เช่น แพลตฟอร์มสินทรัพย์เบาอย่าง Hyperbolic
- ปัจจัยสำคัญ:
- **การอนุมานกลายเป็นคอขวดใหม่**: ตลาดตระหนักว่าการอนุมานเป็นต้นทุนที่เกิดขึ้นเป็นประจำ (ขยายตัวตามการใช้งาน) ไม่ใช่รายจ่ายฝ่ายทุนครั้งเดียวเหมือนกับการฝึกอบรม J.P. Morgan ประมาณการว่าขนาดตลาดการอนุมานมีขนาดใหญ่กว่าการฝึกอบรม 10-50 เท่า โดยมีหลักฐานจากการที่ Anthropic เข้าควบคุมศูนย์ข้อมูลเพื่อใช้ในการอนุมานโดยเฉพาะ
- **การยืนยันการเปลี่ยนแปลงของยักษ์ใหญ่ในอุตสาหกรรม**: Nvidia ปรับโครงสร้างรายงานทางการเงินใหม่โดยรอบแนวคิด "การให้บริการ Token" แบ่งการอนุมานออกเป็นสองแนวรบคือคลาวด์และ Edge Computing และเปิดตัวชิปที่มีประสิทธิภาพการอนุมานเพิ่มขึ้นอย่างมาก การเสนอขายหุ้น IPO ของ Cerebras ได้รับการจองซื้อเกิน 20 เท่า เนื่องจากมุ่งเน้นไปที่สถาปัตยกรรมชิปที่เร่งความเร็วการอนุมาน
- **คำตอบของ "ปัญหา 6 แสนล้านดอลลาร์"**: ช่องว่างผลตอบแทนจากการลงทุน AI ที่ Sequoia ยกขึ้นมา จะถูกเติมเต็มโดยความต้องการด้านการอนุมานที่เพิ่มขึ้น ไม่ใช่การฝึกอบรม ความต้องการปกติของการอนุมานจะดูดซับการสร้าง GPU ที่มากเกินไปในช่วงแรก
- **การวางตำแหน่งคุณค่าของ Hyperbolic**: ในฐานะบริษัทเดียวที่ครอบคลุมทั้งสามชั้น ได้แก่ การเช่า GPU การปรับใช้ และโมเดล API Hyperbolic สร้างผลกำไรด้วยการรวบรวมทรัพยากรคลาวด์หลายแห่งและให้ข้อมูลราคาแบบเรียลไทม์ โมเดลสินทรัพย์เบาของมันสามารถทำกำไรจากส่วนต่างราคาได้มากกว่าเมื่อพลังประมวลผลล้นเกิน
- **กำไรบางในชั้นแอปพลิเคชันและคุณค่าในชั้นกลาง**: แอปพลิเคชันการอนุมานอย่าง Venice ต้องอยู่ภายใต้ต้นทุนพลังประมวลผลจาก上游 ทำให้มีกำไรเพียงเล็กน้อย โมเดลเศรษฐกิจของมันเผยให้เห็นว่าพลังประมวลผลพื้นฐานเป็นต้นทุนหลัก ซึ่งตอกย้ำคุณค่าของชั้นรวมศูนย์ (เช่น Hyperbolic) ที่ควบคุมการจัดเส้นทางและกำหนดราคาพลังประมวลผล
Original Author: Frank Fu
Original Source: IOSG Ventures
The gap David Cahn identified in 2023 was never filled on the training side. It has been filled on the inference side, and the market has only begun to price this in over the past few weeks. As Nvidia restructures its financial reporting around "serving tokens" and Cerebras' IPO sees 20x oversubscription, the battle over the bottleneck is over. The real question has now become: when inference becomes a scarce resource, where does value accrue in the compute stack?
Following the GPU: From a $200 Billion Problem to a $600 Billion Problem
In 2023, Sequoia's David Cahn posed the question hanging over the entire AI buildout – the "$200 Billion Problem." For every dollar spent on a GPU, roughly another dollar must be spent on powering it in a data center. Therefore, each year's GPU CapEx means these chips must ultimately generate about $200 billion in revenue to justify the capital outlay. Even with generous assumptions for AI revenue, he identified a gap of over $125 billion between the "spend" and what "end customers actually pay." The concern was straightforward: GPUs were being overbuilt ahead of real demand.
A year later, the gap hasn't narrowed; it's widened. In Cahn's 2024 follow-up, as hyperscaler CapEx ballooned, he redefined it as the "$600 Billion Problem." The bearish logic converges into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital.
Both articles essentially ask the same thing: who fills this gap? The answer was never on the "training" side of the ledger. It appears on the inference side, and the market has only started pricing this in over the past few weeks.
The Cerebras IPO and the Inference Squeeze
Cerebras went public on Thursday. The IPO was 20x oversubscribed, priced at nearly double the final mark-up from Wednesday. The demand wasn't driven by bets on the "next Nvidia killer," but by a simpler realization: the market is beginning to understand that the true bottleneck in AI is inference, not training.
Cerebras' core competency is a chip architecture that makes inference extremely fast. Not training, but inference. This is what excited Wall Street. The inference market is recurring; it expands with usage. Every time Claude answers a question, every time an agent executes a task, it consumes compute power. Training happens once; inference never stops.
J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. When machines begin executing tasks assigned by other machines – i.e., agentic expansion – inference demand no longer scales with user count, but with compute itself.
Nvidia Redraws the Map: Inference Takes Center Stage
If Cerebras was the market's awakening, Nvidia's latest quarterly earnings were confirmation from the top of the chain. On the latest earnings call, Jensen Huang made the unspoken explicit: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-shot inference to logical reasoning, and now to the agent phase where it can call tools and orchestrate tasks. Huang stated, "Tokens are now profitable." In the AI era, compute power is revenue and profit.
This reshapes the entire industry. Training is a one-time cost to build a model; inference is the recurring cost of running it. The bottleneck today is in inference, not training.
Nvidia baked this judgment into its own financial reporting structure. It now reports under two platforms instead of one: Data Center and Edge Computing. The Data Center segment (approximately $75 billion for the quarter, +92% YoY) is further broken down into Hyperscale (approximately $38 billion, +12% QoQ) and ACIE (AI Cloud, Industrial, and Enterprise, approximately $37 billion, +31% QoQ). The new line is Edge Computing: $6.4 billion, +29% YoY, covering the endpoints where agentic AI and physical AI actually run – PCs, workstations, AI-RAN base stations, robots, and cars.
Edge currently accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. This signals that inference is splitting into two fronts: cloud inference within data centers, and endpoint inference at the edge. AI needs to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping from Q3, offers up to 35x the inference throughput of Blackwell; Huang also provided a new $200 billion TAM for the Vera CPU designed for agentic workloads. Every frontier model company is expected to fully transition to it on day one.
When the world's most valuable company restructures its financial disclosure around "serving tokens," the battle over the bottleneck is over. The remainder of this article discusses who captures value when inference (rather than training) becomes the scarce resource.
First, a scope note. Between these two fronts, this article discusses cloud inference: rented data center GPUs that provide external API token services. Endpoint inference runs on local chips inside the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), entirely bypassing the underlying GPU rental and aggregation stack. Here, consider it a tailwind that expands the overall inference economy and reinforces the bottleneck argument, rather than the market Hyperbolic and Venice operate in, as both exist entirely on the cloud inference line.
The Squeeze Has Arrived
Anthropic is the canary in the coal mine. Usage far exceeds pre-provisioned capacity. Complaints about Claude being "lobotomized" flooded the internet – rate-limited replies, slower reasoning, compressed context windows. The solution was raw compute power: In May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX – 220,000+ Nvidia GPUs, 300+ megawatts – and dedicated it exclusively to inference, not training.
This capacity unlock triggered a chain of limit changes, each one a signal. On May 6th, Anthropic doubled Claude Code's five-hour limit, removed peak-hour rate limits, and significantly increased Opus API rate limits. On May 13th, it further increased Claude Code's weekly limit by 50% (until July 13th). Then, starting June 15th, it did the opposite of "generous": it carved out agentic and programmatic usage (Agent SDK, headless mode `claude -p`, CI pipelines) from the flat subscription into a separate metered credit pool ($20 to $200 per month, priced at API rates). This final step distills the entire argument into one action: agents consume inference far faster than a flat subscription can bear, so it must be priced as the recurring cost it truly is.
Training is a one-time capital expenditure. Inference is a recurring operational cost that compounds with every new user and every new agent.
The Stack: Six Layers, One Bottleneck
Every AI application sits on a supply chain that starts at a TSMC fab and ends at an API endpoint:


Most companies own only one of these layers. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.
Except for one.
Hyperbolic: The Only Company Spanning Three Layers
Hyperbolic launched its on-demand GPU marketplace in June 2025. Within its first few months, it surpassed 200,000+ developers, with adoption spanning frontier AI labs, search platforms, and large consumer-facing platforms.
What's interesting is its architecture.
Hyperbolic doesn't own a single GPU itself. Every card comes from neoclouds and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness, but it's actually a moat.
By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data that others don't. It knows who is buying which GPU, at what price, and when. It sees oversupply before it becomes public, and demand surges before they hit the market.
Today, this moat is the multi-cloud aggregation itself. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool. Developers can rent the cheapest available GPU anywhere without negotiating with each operator or managing a dozen accounts. The more clouds it connects, the deeper the liquidity, the richer the pricing data. Looking ahead, the team is exploring ways to use this data to model GPU price curves and eventually deploy its own capital to smooth supply and demand, acting as a market maker for physical compute power. But this goal is still early stage; what truly compounds today is the aggregation layer.
This is the flywheel:
- Connect more clouds → More aggregated supply
- More supply → Deeper market and real-time pricing data
- Better data → Smarter routing today, pricing models in the long run
- Better liquidity and prices → More developers → More clouds wanting to connect
No other company is attempting this. Hyperbolic is the only company simultaneously spanning the GPU rental layer, the deployment layer, and the model API layer.
Venice as a Mirror
Venice is the clearest representation of the inference economy at the application layer, and a useful contrast to Hyperbolic's position. It's a privacy-first inference application: an OpenAI-compatible API, plus consumer subscriptions (Free / Pro / Pro+ / Max), routing requests to approximately 75 models. About two-thirds are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), while the rest are anonymous passthroughs to closed-source frontier models. The key point is that Venice doesn't own meaningful compute power itself. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays frontier labs for passthroughs. Its true cost of revenue is inference compute, not SaaS hosting.
What Venice truly sells is privacy. Here, "privacy" isn't about turning public compute into private property; it's wrapping commoditized inference with guarantees: no data retention, no training on data, request anonymization, with some workloads running in TEEs so the operators themselves can't see plaintext. The underlying compute is a commodity; the markup comes from this privacy wrapper. And this guarantee is layered, not uniform: for open-source models running on its controlled or TEE GPUs, it achieves near end-to-end confidential computing. But for anonymous passthroughs to closed models like Claude or GPT, privacy means stripping identity – the frontier lab's end still processes your raw prompt. So the strongest privacy only covers the open-source portion; the frontier model part is "anonymous" rather than "truly confidential." Venice's gross margin = subscription price − inference costs paid downstream. The premium it can charge over the bare API price relies almost entirely on this privacy premium. This is also why it's thin-margined and subject to the pricing of frontier passthroughs.
The token design wraps this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, an inference credit where 1 DIEM roughly equals $1 of compute per day. Paid subscriptions trigger programmatic buybacks and burns of VVV (approximately $2 / $5 / $10 for Pro / Pro+ / Max), while emissions follow a fixed decreasing schedule: 6M → 5M → 4M VVV per month, dropping to 3M on July 1st. The buybacks are real but discretionary and still small: approximately $103k burned in both April and May, slowly climbing towards roughly $110k in June, well below the $200k per month line.
The fundamentals are healthier than the headlines. The publicly circulated " $70M ARR" figure is almost certainly an artifact of misattributing subscription renewals as net new acquisitions; a defensible observable range is closer to $6M to $15M ARR. Beneath this, traction is real: approximately 136k token-holding addresses, roughly 9.9M website visits per month (~330k daily), with new Pro subscriptions hovering around 1,400 per day. This is a real business, but a thin-margined one, with its economics determined by the compute power it purchases.
This is precisely why Hyperbolic sits one layer above. If Venice is a gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone else depends on; Hyperbolic aggregates that fragmented supply, standardizes it, and sells it to Venice and every player like it. As inference demand grows, value doesn't just accrue to the applications consuming compute, but to the layer that aggregates and routes compute, capturing the cost of revenue these applications pay.
Why This Matters Now
Nvidia has restructured its finances around "serving tokens." Cerebras' IPO proves the market understands inference is the bottleneck. Anthropic's scramble for capacity confirms this is a real problem. Agentic and physical AI will multiply demand by orders of magnitude across both cloud and edge lines.
It also closes the loop on the "$600 Billion Problem" from the other side. Cahn's bearish logic – overbuilding, then oversupply – will likely be validated eventually. But oversupply is exactly the best environment for a lightweight aggregator: when GPU prices fall and supply is fragmented across dozens of clouds, the player who owns no hardware and routes every workload to the cheapest available card captures the spread, while operators holding depreciating GPUs take the losses. Hyperbolic is betting *on* oversupply, not against it.
The company that ultimately wins won't be the one with the most GPUs. It will be the one that can tell you which GPUs are available where, at what price, and route every workload to the place where it can be run at the lowest cost.
Hyperbolic is building that company. It doesn't own GPUs itself, is pure software, spans three layers deep, and is building itself into the ultimate aggregation layer for inference compute.


