OpenAIがアプリ層を食い尽くす?a16zの見解:真のチャンスは汎用モデルの外にある
- 核心ポイント:AIアプリ系スタートアップの真のチャンスは、大規模モデル企業と横断的な汎用ツール(いわゆる「イエローブリックロード」)で競争することではなく、業界プロセスに深く入り込み、複雑なワークフロー、データ蓄積、コンプライアンスガバナンス、システム統合能力に依存した垂直ソリューションを構築することにある。
- 重要な要素:
- 大規模モデル研究所(OpenAI、Anthropicなど)は、モデル能力+横断的ツール(コード生成など)で「イエローブリックロード」を席巻しつつあるが、複数のステップや関係者を必要とする垂直的なシナリオに深く入り込むことは難しい。
- スタートアップの参入障壁(モート)は、特定の業界に関連して蓄積されたデータと学習のフライホイールから生まれる。これには公開されているトレーニングセットには含まれない、暗黙の業界慣行や部族知識が含まれる。
- アプリ層の企業は、複数のベンダーをルーティングし、サブタスクに最適なモデルを選択し、モデルアップグレードに伴う移行コストを負担することで、コスト最適化と複雑性の管理を実現できる。
- 特定のユースケースに基づくガバナンス、コンプライアンス、監査の「コントロールプレーン」(HIPAA、FINRA規則処理など)を提供することは、横断的モデルが代替しにくい中核的価値である。
- 垂直企業の「システム型」製品は人間の作業を代替し、ワークフローに組み込まれる。顧客はベンチマーク結果ではなく、ビジネス成果(売上リード数など)に対して支払うため、高い顧客ライフタイムバリューを実現する。
Original title: Avoiding Death on the Yellow Brick Road
Original author: Joe Schmidt IV, a16z
Translation and compilation by: Peggy
Editor's note: As large language models continue to improve in capability, a common anxiety is emerging in the AI application layer: If model companies like OpenAI and Anthropic control the underlying models, distribution channels, and brand advantages, what can startups still do in the application layer?
This is precisely the question a16z partner Joe Schmidt attempts to answer in this article. Using the "Yellow Brick Road" from *The Wizard of Oz* as a metaphor, he divides AI application opportunities into two categories: one is the main road that large model companies are walking down themselves, such as code generation, writing, image generation, general-purpose agents, and horizontal office assistants; the other is "the rest of Oz"—the vertical scenarios deeply embedded in industry processes that rely on complex workflows, data accumulation, compliance governance, and system integration capabilities.
In his view, the real opportunity for startups lies in the latter.
From sales to insurance, Joe Schmidt repeatedly emphasizes the same logic: what enterprises are truly willing to pay for is not a smarter chat window, but a system that can take responsibility for business outcomes. It needs to understand the messy state of customer data, handle multi-person approvals and edge cases, bear compliance and audit responsibilities, and manage migration, routing, and cost optimization for customers as models continue to upgrade.
This is also the core judgment of this article regarding the next generation of enterprise software: underlying models will become increasingly powerful and increasingly interchangeable; but what is truly irreplaceable are the data, processes, governance capabilities, and operational memory accumulated around specific industries and specific workflows. The opportunity for AI application companies lies not in competing with model companies for the "Yellow Brick Road," but in entering those places that are more complex, messier, slower, yet closer to real business value.
The following is the original text:
Lately, I keep hearing the same question from founders and potential employees: Is there anything left to do in the AI application layer? Or will OpenAI and Anthropic end up killing everything?
There's a typical AI-style anxiety behind this question. Some have concluded that to avoid becoming permanently commoditized, the only positions with long-term value are inside the large model labs, or in fields like robotics, hard tech, or other frontier areas—theoretically, things the "labs can't touch." Because if every type of software is going to be consumed, either directly absorbed by Codex or Claude, or made unnecessary by some future model, the best choice seems to be: Run!
I admit, I too am almost an AI maximalist, and I think they are half right. The large model labs are indeed entering large swaths of the application layer. But the "application layer" is not a homogeneous set of opportunities. The truly important criterion is: Are you walking the "Yellow Brick Road," or are you somewhere else in Oz?
Note: The "Yellow Brick Road" is the main road in *The Wizard of Oz* leading to the heart of the Emerald City to see the "Wizard."
The "Yellow Brick Road" is how we describe the path that large model labs are walking and pouring enormous resources into. Problems like code generation, writing, and image creation are naturally suited for the labs because they get better as the raw capabilities of the models improve: every dollar invested in pre-training and post-training directly improves the quality of the output.
But elsewhere in Oz, there are more complex, usually more vertical problems. These are not simply about giving an enterprise user a horizontal tool that connects to standard tools and computer operation capabilities. The value here comes more from the scaffolding around the model: the scaffolding that makes the output trustworthy, compliant, and capable of truly entering business processes in a specific industry. The raw capabilities of the underlying model are still important, but they are not everything.
We are seeing this in real-time. OpenAI and Anthropic are essentially admitting to the market: they can't solve all problems with a single general-purpose AI colleague. They have announced massive investments in frontline deployment-style joint ventures, building entire companies around configuring and customizing models for enterprises. If they truly believed the next model release would solve these problems, they wouldn't be pouring billions into such projects.
So, if you want to make money building AI applications, don't walk the Yellow Brick Road. Go build somewhere else in Oz. Here are the lessons we, and some founders in our portfolio, have learned in practice.
The Yellow Brick Road
If you're starting a company, the Yellow Brick Road is the most obvious path, but also the most dangerous one. Take a high-performance model, connect it to some off-the-shelf connectors like Google Drive, Slack, Salesforce, Notion, GitHub, and build an agent orchestration layer on top. It looks like magic.
The problem is, this is exactly what the large model labs are doing with Cowork and Codex. Obviously, they own the models, which means they have better margins, more control, and can exert pricing power over all downstream participants. But perhaps more importantly, they also control the architectural choices that determine what problems the product is suitable for solving. So far, they have very deliberately adopted the "model + tool calling" pattern, which is precisely the pattern needed for those horizontal, low-step count tasks on the Yellow Brick Road. Even if a startup could somehow surpass Codex or Claude Code, the large model labs still have massive distribution power and the strongest brand halo in the AI field.
If you are an AI application company using the same playbook—connecting to the same connectors, without sub-agents or configurations underneath, and without distribution channels—you are likely walking a path to nowhere.
Everywhere Else in Oz
The situation isn't entirely pessimistic for startups. Beyond the Yellow Brick Road, there are still huge opportunities. Startups can own customers and solve complex problems here.
These companies are building agentic experiences: models woven into complex networks of tools, automations, and integrations—in other words, software. This also makes most of these startups naturally vertical. They can focus on multi-step, multi-party workflows, design sub-agents for different roles and vertical scenarios, and handle problems that Anthropic's and OpenAI's horizontal platforms find difficult to reach: gathering context across systems, then routing tasks to multiple people who need to approve at different stages.
This type of work typically involves one or more legacy systems, often requires deterministic results because ambiguity is unacceptable, and sometimes directly ties to an important business outcome. The large model labs certainly know how valuable these problems are: that's why they are building their own outsourced configuration teams, and why a whole ecosystem of reinforcement learning service companies for large customers is emerging.
Why the Rest of Oz Won't Be Completely Occupied by the "Wizard"
One counterargument to the above view is that, so far, betting that models or labs won't continue to improve has been a bad trade. They are likely to keep getting stronger and eventually eat the markets these application-layer companies serve.
The large model labs will certainly continue to improve. But I believe that companies in the rest of Oz still have a few defensive strategies in the long run.
Data & Learning Flywheel
Much of what you truly internalize in a business doesn't exist in any training set: unwritten industry customs, undocumented standards, tribal knowledge residing in practitioners' minds. None of it is on the public internet. No amount of training compute can replace being inside the workflows where this knowledge lives.
Two flywheels are at play here: a cross-customer flywheel, where patterns compound as you see more variants of the same problem; and a customer-internal flywheel, where the reasons behind specific decisions, the unspoken exceptions, and the company's own heuristics only emerge when users genuinely interact with the system.
Even if customer data cannot be used across customers, an application company can still leverage pattern recognition of different customer problem types and use it to guide the architecture of future problems. A company whose agents have processed a hundred legal redlines, a thousand insurance underwriting cycles, or ten thousand SDR sales development activities has an understanding of the problem space that a later entrant cannot replicate the first time they spin up an agent.
Theoretically, a horizontal agent could build the same learning infrastructure. But the reason it doesn't, besides a lack of focus, is user experience. Capturing this knowledge depends entirely on what workflow interface you present to users. Vertical players can design these interfaces around what information truly needs to be exposed for a specific workflow; horizontal tools cannot. Evaluation sets, annotated outputs, edge case taxonomies—all can compound into a vertical data flywheel that further supports fine-tuning. Later entrants, without equivalent production exposure, will struggle to generate this flywheel. Whether it's feasible depends on data rights, accumulated production usage, and customer contract structures, but the pattern recognition itself will continue to accumulate.
Managing Model Volatility & Complexity
Inside the large model labs, routing already happens: directing different requests to different classes of models, using model ensembles under the hood. But what they *can't* do is route across vendors, evaluate a competitor's model for a specific sub-task, or use a truly optimal open-source fine-tuned model for a narrow task.
Companies elsewhere in Oz will, for each sub-task, choose the most suitable model from across the entire model market, not just the model released by a single lab. They will also take on the work no one wants to do: re-running evaluations with each new model release, recalibrating prompts for customer edge cases, and managing rollouts without breaking production environments. The large model labs won't do this for customers. They sell you the new model and tell you to migrate. Companies elsewhere in Oz absorb the migration cost. The customer gets the best intelligence from across the market and continuity through each upgrade cycle.
Cost Optimization
Throwing every query at Opus 4.7 is the fastest way to turn gross profit margins negative. The best Oz companies route between different tiers of models: the hardest tasks to frontier models, the majority of tasks to mid-range models, and use smaller, specialized, or fine-tuned models where proven effective.
Some of these companies are now doing their own post-training on top of this, optimizing models for the narrower tasks customers truly care about, and delivering it at a fraction of the cost of frontier API calls. Large model labs price for the "floor": the minimum intelligence you can buy for X dollars. Oz companies sell the *opposite*: the minimum dollar cost for the level of intelligence actually needed for a specific workflow. This is only possible when you know exactly what level of intelligence each sub-task requires. And large model labs, structurally, cannot know every task within every vertical industry. Ultimately, this translates directly into lower, more controllable outcome-based pricing.
Governance
Becoming the control plane through which customers run AI in a specific vertical creates considerable value. This control plane is where permissions, audits, what agents are allowed to do, and what agents actually did converge.
This control plane is built on use-case-specific guardrails, which differ completely across industries and job types. Because these companies own the tools, workflows, and data that agents touch end-to-end, they can provide deterministic results in ways horizontal tools cannot. They also absorb regulatory complexity for the end buyer: FRCP and professional conduct rules in law, HIPAA in healthcare, SEC and FINRA rules in finance, state-level insurance regulations, and so on. A horizontal player cannot plausibly do this without turning itself into a hundred different verticals. What a CIO needs is a partner who can explicitly commit in a contract to handle the compliance responsibility for the agents it provides.
All of this ultimately comes back to the same thing: Focus.
This focus can be a vertical industry, like insurance, law, or accounting; or it can be a function done sufficiently deep, like sales, customer support, or finance. Either way, the work requires a team to be deeply embedded in the same type of customer base for a long time, understanding its workflows, edge cases, and regulatory requirements. Large model labs are not built for this. They have to serve everyone, be everywhere, which is why they built the Yellow Brick Road in the first place. The same trade-off makes it difficult for them to enter the rest of Oz: you can be everywhere, or you can be extreme at one thing, but you cannot be both.
Case Study in Sales: Practical Advice from the CEO of 11x
What does this look like in practice? Here is some practical advice from Prabhav Jain, CEO of 11x.
Focus on Outcomes
A viable tactical path to building a company resilient to the impact of large model labs is to start from the specific outcomes customers truly care about. For us, that outcome is helping businesses generate more sales leads and pipeline.
From here, the questions become very concrete: Which activities do we want to own end-to-end that genuinely drive pipeline growth? Break each activity into tasks. Which tasks are suitable for agents, and which are not? Which require complex domain insight, and which do not? Large model labs will also release workflows, but when a workflow has many steps, messy inputs, hard-to-explain state, or real-world constraints, having a better model alone doesn't get it done. The work then returns to traditional software engineering, an area where large model labs have no advantage over a focused application company.
For example, some tasks we handle include: prospect discovery based on custom signals, prospect information enrichment, deep account research, context pulling from CRM, message composition for different channels, lead qualification agents, and email deliverability systems. Some are agent tasks, some are not. These tasks aren't accomplished with a single prompt; they require deep engineering capability.
The key insight in this Oz analogy is: roughly half of any real workflow consists of non-agent tasks, and this half doesn't carry an advantage for the labs. Under the model layer, they are no better than you at writing deterministic software. The other half, the agent tasks, still require you to tune, train, and constrain the model towards the desired outcome.
Domain knowledge is often not in general training data. These capabilities must be built bottom-up from a vertical industry or function and fed to the model at the right moment in the workflow. When our agent judges an inbound lead's qualification over the phone, it must be trained to understand what constitutes a good sales conversation for a specific industry and user profile. This is the application company's work, and this capability compounds.
More importantly, these capabilities constantly become outdated because businesses themselves evolve. Therefore, your ability to continuously evolve the workflow and context becomes a competitive advantage. For instance, when we started building our scaled email outreach product, "AI-written emails" were just emerging. Fast forward to today, people have developed a keen sense for detecting AI-written vs. human-written emails, and crucially, this judgment changes every few months. Our agent must constantly adapt to market dynamics, but the moat is built precisely here. In fact, despite this dynamic change, our positive reply rate has increased 4x over the last few months, generating hundreds of millions of dollars in sales pipeline for customers.
Tackle High-Complexity Problems
Complex problems are where real business value is unlocked. Otherwise, you'll easily find yourself building just a thin wrapper layer.
Decompose any sufficiently complex business problem, and you'll quickly see chaos. Here's a seemingly simple example from the GTM space: If a company is already your customer, you shouldn't contact a contact within that company. But this is far from simple.
Maybe your CRM has the company's domain. What about companies with dozens of subsidiaries? What if the CRM records the parent company's domain? What if an outdated matching field in Salesforce causes you to send a cold sales email to the CRO of an existing customer? Real-world data is messy. Humans struggle with it, and models don't magically cross this threshold. Building order from this chaos requires designing specialized agents around the specific shape of the problem, not just pointing a general-purpose copilot at the CRM and calling it done. In fact, based on our data, we've found that our data quality and freshness surpass the customer's own, so we anchor on our data by default.
Guardrails Aren't Just to Prevent Bad Things. Customers Pay for This.
Guardrails are severely underestimated. Even within the same product, each use case needs its own guardrails. For us, a regulated financial services prospect requires entirely different assurances than a mid-market SaaS customer. And these assurances cascade into how the agent writes, who it can contact, what data it can touch, what it can say on the phone, and how each decision is recorded.
A "one-size-fits-all" system breaks down in the face of this diversity. Guardrails must be built per use case, configured per customer, and continuously audited—and this work falls entirely on the application company. This is why we need forward-deployed engineers and technical deployment strategists to tune for each customer's requirements.
For example, we worked with a Fortune 1000 institution to perform consented outbound calls via voice to its vast SMB customer base. In the initial rounds, pickup rates were low. We had to iterate quickly to learn how to engage this specific audience within the first 10 seconds of a call. SMB business owners behave differently from large B2B buyers or consumers. Now, we generate more sales opportunities for them in a day than their entire sales team could generate in that segment in a month.
Case Study in Insurance: Practical Advice from the CEO of FurtherAI
Sales is just one example. Insurance is another, illustrating the same point from a different angle. Here is Aman Gour, CEO of FurtherAI, on what it means to "build away from the Yellow Brick Road."
When we started deploying AI into real insurance operations, we repeatedly heard a hypothesis: the model *is* the intelligence, and the workflow is just scaffolding built around the model.
But the more insurers we work with, the more convinced we are that the opposite is true.
In insurance, a lot of the intelligence *resides* in the workflow itself. Two insurers can run a submission through an apparently identical path: submission, review, quote, underwriting. The path itself is easy. What truly differentiates the two insurers is everything *inside* that path: Which risks need escalation? Which loss signals matter? Which of two conflicting underwriting guidelines takes precedence? When must a human sign off? What external data needs to be pulled? And how


