After AI has consumed everything, what remains untrainable?

区块律动BlockBeats

特邀专栏作者

2026-06-11 05:44

本文約6561字，閱讀全文需要約10分鐘

Trust, Permissions, Responsibility, and Industry Judgment

AI總結

展開

Core Argument: As AI capabilities continue to advance, general-purpose models will devour all tasks that can be measured by benchmarks, trained on public data, and validated at low cost. True value and moats will reside in "untrainable" domains: areas dependent on an organization's proprietary data, complex workflows, user trust, industry judgment, and the accumulation of long-term internal value.
Key Elements:
1. **Trainability Equals Commoditization**: Any task that can be measured by benchmarks and validated at low cost (e.g., code writing) will be devoured and commoditized by models. Value will leak out of these "readable" works.
2. **Private Correctness is the Moat**: Models cannot automatically gain a bank's system permissions or a doctor's trust. True automation requires deep integration within an organization, handling proprietary data, complex workflows, and long-accumulated experience that cannot be easily replicated from the outside.
3. **Trust and Responsibility Form Barriers**: Models can generate answers but cannot bear responsibility for errors or hold industry licenses. Application companies gain an irremovable position by winning customer trust and embedding themselves into their decision-making processes and systems.
4. **Defining "Good" is Power**: Whoever can define the "acceptable quality of work" within a real service and establish private benchmarks holds the pricing power and moat of that domain (e.g., Harvey's legal benchmarks, Sierra's customer resolution definitions).
5. **Fierce Competition at Model Layer, Value Exists at Application Layer**: The frontier model market is not dominated by a single giant but remains a multi-player competitive landscape. Customers need supplier competition, and labs will not easily stifle deeply integrated applications, leaving room for value creation.

Original Title: The Untrainable

Original Author: Sarah Guo, Conviction

Original Translation Compiled by: Peggy, BlockBeats

Editor's Note: As AI capabilities continue to leap forward, a new pessimistic view is emerging in the investment community: if models become increasingly powerful, all application companies will eventually be swallowed up by model and computing power layers like Anthropic, OpenAI, and Nvidia, leaving only frontier models, computing power, and a few pieces of infrastructure as the market's final survivors. However, Sarah Guo believes this judgment is only half correct. Those "thin wrappers" (applications that simply wrap models) will indeed be absorbed. Any task that can be measured by benchmarks, trained on public data, and validated at low cost will gradually become commoditized.

The real question is: after AI devours everything trainable, what remains untrainable?

The answer in this article lies in value that exists within real organizations and cannot be easily replicated from the outside: enterprise proprietary data, complex workflows, user trust, system permissions, industry judgment, compliance responsibilities, and experience accumulated over long-term operations. Models can be smarter, but they cannot automatically enter a bank's production systems; they can generate medical answers, but cannot directly earn a doctor's trust or navigate a hospital's decision-making processes; they can write legal texts, but cannot bear responsibility for senior lawyers or arbitrarily define what constitutes qualified legal work.

Therefore, the truly moat-worthy AI companies of the future are not simply smarter than general models. Instead, they delve deep into specific industries to complete the difficult but crucial "translation" work: organizing a client's private reality, tools, processes, and judgment standards into a system that models can act upon, and gradually defining what constitutes a "good outcome" through long-term service. The stronger AI becomes, the more it will devalue measurable, replicable tasks—and the more it will highlight those "untrainable" elements carrying history, relationships, permissions, and professional judgment. This is the real value that may survive even after the model devours everything.

Below is the original text:

By mid-2026, the investor version of "AI psychosis" is a feeling of despair that there is nothing left worth investing in: it seems we should just throw all our money at Anthropic and Nvidia, then go home and sleep. But I have never felt this way. For several minor versions now, I've been convinced that models are already smarter than me; I would happily buy Anthropic and Nvidia at market prices; many of my smartest friends are quite sure that model self-improvement will soon really take off—yet I still don't feel that despair.

This despair isn't stupid. Its logic goes like this: if models keep getting better at everything, then all companies built on top of models are just thin shells waiting to be absorbed; ultimately, the only value left will be computing power and frontier model weights.

Take software as an example, the case this despair relies on most. When Devin was released in 2024, it could only solve 13% of tasks in standard software benchmarks, so the market largely dismissed it. A year and a half later, the strongest agents are scoring over 80% and handling real work inside Goldman Sachs and the U.S. Army. Almost everyone has drawn the same erroneous conclusion: the model has swallowed software engineering.

But as models consume the most easily measurable parts of software engineering, we are also relearning what many teams knew all along: engineering has always resisted measurement, and the most easily measurable parts are not necessarily the only important ones.

MIT's Mert Demirer and collaborators have finally quantified this: among over 100,000 developers, the latest generation of coding agents increased code writing volume by roughly 180%, but the code actually delivered to production only increased by about 30%. Writing code became cheaper, but the remaining steps still involve humans, and these steps matter. Of course, the overall net impact remains astonishing.

A benchmark is something you can measure; and anything that can be measured can be trained for. Therefore, coding agents matured first: compilers are free validators, and test suites are also free validators. When answers can be self-checked at near-zero cost, you can keep polishing around that checking signal until you break through.

But passing tests never means a change is correct for a codebase that has been running for ten years. The module exists for maybe three reasons no one documented; the deployment pipeline might be held together by a cron job no one wants to admit they wrote.

This kind of correctness cannot be read from a leaderboard, nor truly from anything else. You can only know if a system this complex really works by letting it run in the real world long enough. And smarter models won't make the real world run faster. No one would run unit tests on a system as large as Google, see a green checkmark, and feel completely at ease. You trust it because it has weathered years of real-world load.

This correctness is not only private, but also a slowly forming moat—a moat that capital cannot compress in time. Even optimists admit this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote that the only reliable way to evaluate an agent's performance over a one-year cycle might be to actually let it run for a year.

As Gabe Pereyra says, true automation isn't just models getting stronger. It's product, model, workflow, and company organization all changing together—and of these four, three move at the organization's pace.

Mobilizing people is the part no benchmark touches: convincing a skeptical partner to change how she handles matters, keeping a team cohesive during a rebuild. This is why, when hiring a CEO, we value their ability to handle people at least as much as their analytical ability. Models getting smarter won't change this weighting.

The feedback here is fuzzy, the time horizon is in years, and trust belongs to a specific person. Every company I know has already put frontier coding models in front of every engineer, yet none of their engineering organizations have changed at a pace close to model improvement. Adopting the tool took a quarter—and what a magical quarter of token growth that was! But real rebuilding takes years.

The work that can be read is leaving. The work that truly has value is structurally unreadable: anything you can put on a leaderboard can be trained for; therefore, anything measurable is already becoming commoditized. This process takes time and will never be entirely complete, but the direction never reverses.

To put it in monetary terms, using my friend Matt MacInnis of Rippling's framing: a token used just to answer a general question is nearly worthless, because anyone's model can answer it. But a token reasoning over your company's data is much more valuable because it does what you actually want, not just generate a plausible-sounding answer.

Readable work gets devoured from two directions.

From below, tasks saturate: once a task can be cheaply checked, buyers stop caring which model completed it and start asking how much it costs. This task then falls to the cheapest open-source or distilled model of the week. As long as margin pressure can work, it eventually will.

From above, labs are trying to make models swallow their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies—all the apparatus once wrapped around models is being pulled into the model weights, until the "wrapper" itself becomes the model. This is the absorption boundary.

Profit pressure also works from another direction: a general agent must be ready to handle anything, so it's expensive; a focused application can optimize a workflow to the point where it consumes only a fraction of the tokens. And unlike the labs selling those tokens, application companies can pocket the difference.

Therefore, we can ask two questions of any kind of work: Is its correctness private and costly—a truth that exists only inside a specific company's data? Is it isolated within a system outsiders cannot enter? Combine these questions with the saturation level of the task, and you get a 2x2 matrix.

Work that is already saturated with public answers is the realm of commoditized tokens; open-source models will own it. Frontier work with public answers, like coding benchmarks, is where labs will win, because when evaluation is free, owning it isn't worth much.

The real prize is the last corner—the "untrainable" corner: frontier work whose correctness exists only in a private environment. You can see this on the inference clouds serving AI-native pioneers: the vast majority of tokens are generated by custom models, not by general open-source models.

The walls to this last corner vary in height. A developer's toy codebase is portable and standardized, so climbing in is not hard. But a bank's production system is neither portable nor standardized. Being 2% smarter on SWE-Bench Verified doesn't get you root access to it.

Capability will devour many things, but a better model won't turn private truth standards into public ones. It doesn't hold a license, sign for liability, own the company's files, or become the party to sue when the answer is wrong. The bottleneck here isn't intelligence; it's permission, and also responsibility. You can imagine a model far smarter than any human, but it still must be let in the door, and someone must still sign their name for what it does.

That door has a lock, and a bolt.

The lock is the environment: only after gaining trust within a system, passing security reviews, completing integration, and signing contracts with outcome liability can you verify whether AI truly does something useful.

The bolt is the users. Today, most American doctors open OpenEvidence daily. This cannot be bought with any amount of computing power. A lab could train a perfect medical model tomorrow, but it still has no way into doctors' usage habits or UCSF's decision-making processes. Because trust is built slowly—through relationships, through user acquiescence—not erased by gradient descent.

This is precisely the work of application companies. An application stakes its place in the "untrainable" corner through unglamorous work: organizing a company's private reality so a model can act on it; giving action tools to the model; working with clients to change how their workforce actually operates.

A company that can perform this "translation" is hard to replicate, and this translation never ends. Integration and maintenance persist alongside the client relationship. The teams that win this are those that put domain-specialist engineers and tools alongside the client.

For example, in a top-tier traditional law firm, the M&A practice alone handles nearly a thousand transactions a year. You can't have hundreds of paralegals download client files to their desktops and hand them to a general agent to read through. Confidentiality reasons alone prevent this, let alone a dozen other issues. Even if you could, you'd only learn fragments: one assistant correcting one thing at a time, no one seeing how an entire transaction flows.

The really important signal exists at the transaction level. A transaction has its own shape: for M&A, it's NDAs, term sheets, due diligence, purchase agreements, ancillary documents, closing checklists; for IP litigation, it's motions, discovery, prior art, more motions. Each practice area has its own structure; lawyers and tools cannot be freely swapped.

And the problem this law firm truly needs to solve is at an even higher level: how to run every practice area simultaneously, like a top partner managing hundreds of matters in parallel while bringing in new business and training associates. Transforming such a company is not a single problem you can write a benchmark for. It requires an operator to handle it like playing "data baseball": the intermediate goals are extremely vague, feedback is incomplete, cycles are very long, and the environment itself doesn't stand still.

Unfortunately, unreadable value is also hard to sell, for the same reason it's hard to commoditize: a company cannot judge from the outside whether an AI can transform its operations as well as benchmarks suggest. Therefore, the strongest companies stop trying to prove themselves externally. Instead, they first get inside the client, and then price for outcomes.

Sierra only charges when its agent solves a customer's problem; if the issue is escalated to a human, it doesn't charge. Thus, the price itself becomes the evaluation mechanism. This works because Sierra owns the definition of "resolved." Cognition's Devin did the same in software, launching "performance guarantees." Only when you are trustfully inside a system are you qualified to offer such guarantees for outcomes.

Even at the layer of providing token services—what everyone likes to call a pure commodity—it doesn't behave like one. The best AI-native companies concentrate their services with one or two providers, like Baseten or Fireworks. Because per-token cost will eventually commoditize, but reliability under real traffic and stable access to scarce compute power will not. Where you serve inference is a different choice from which models you use. The only truly commodity-like part of inference is the price.

A common counterargument is: the lab is your supplier. Why wouldn't it undercut you with its own first-party product, selling below cost to crush you? Or simply revoke your API access and take the market itself? This is the real version of that despair. But it only holds if the model layer is a single-player game.

Clearly, it is not. The model layer looks more like a death match among three and a half players, alongside a cohort of international players about six months behind in training progress, and a development league five times the size of last year's. Clients want competition among their suppliers, and labs want market share more than they want to kill any specific application.

You can see this in markets where labs compete head-to-head. In the consumer chat space, the best model has never simply won the entire market. ChatGPT has led through years of real competition; the share it is now losing goes to Gemini, driven by Android and search distribution, not a better model. Anthropic is currently perceived in prediction markets and internet vibes as having the best model, but it is hardly a major player in consumer chat; it has built its business in enterprise and coding scenarios.

If a better model can't even take users away from a competitor in its core application, it won't easily absorb a hospital's medical records system or a bank's liability framework by integrating. Today, people choose products based on more than just coding ability. If the frontier model layer remains crowded, then the application layer above it holds value.

If a piece of work cannot be scored from the outside, then someone inside must decide what counts as a good answer. And that decision is the whole game. Enough such decisions written down become benchmarks. Harvey published legal benchmarks; Sierra published voice agent benchmarks. You earn the right to define what "good" means in a domain because that domain is already using you. And these companies earn that right through the hard struggles of real adoption.

The evaluations that truly determine where money flows are private, formed company-by-company: what this company, for this type of matter, will accept as good work. And this is far from complete, because the depth of law extends far beyond any public test. OpenEvidence is precipitating what defines a safe clinical answer.

None of this is really "measurement" in the true sense; it's judgment about what is true and what is good. These judgments are written down until they become the standard by which everyone else is measured. No matter how smart the foundational model labs become, they cannot write these standards out of thin air, because this status exists only within the domain.

This authority tends to fall where it already exists. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "resolved" means is decided by the company that already has the client relationship.

The absorption boundary will continue to rise, because we will keep learning to measure more work, and what can be measured will be devoured. The untrainable ground will shrink beneath the feet of those standing on it. So you cannot find a defensible position and stop. You must keep moving towards places that cannot yet be scored, and continuously re-underwrite, re-judge risk.

On a narrow task, using your private data and your own evaluation system, you can train to the frontier and beat general models in key scenarios; this specialized model becomes part of the moat. On the other hand, if you compete on general model capabilities, it's a war of capital, and you will lose to whoever has the most computing power. This is also the trap most easily fallen into by companies with only shallow access to highly readable tasks.

When a company, to survive, decides to train beyond frontier capabilities across a large set of general tasks, the outcome usually seems decided by data center scale. The end result is often not an independent champion, but a sale to a player with ample computing power.

All of the above is defense. Harder is offense: first deciding what to build. This is what I've been searching for this whole year, and I've only found it about three times. Models can't help here. They do whatever you point them at; but they cannot tell you what is worth pointing at. You cannot build a benchmark for it, therefore you cannot train for it.

This is also why the existing giants won't take everything: they will hold the territory they already have, while the next thing comes from someone who discovers a use case before others do. Perhaps intent is a scarcer input than compute.

This feeling of despair is half right. Thin wrappers are indeed being absorbed, and many things that look like companies today are indeed just thin wrappers. But it is wrong about what remains after absorption. The mechanism is clear, but the end point is not.

What I am willing to bet on is this direction: intelligence will continue to get cheaper, and value will continue to slide towards the few places models cannot reach. The untrainable is value with history.

So, enter one of these domains, do the unglamorous translation work, and start writing down what "good" means there. Because someone will. The most cited benchmark score of this year is actually a map of territory soon to be worthless, and a notice: a notice to some that they are about to lose the right to define what "good" is.

Original Link

安全

歡迎加入Odaily官方社群