After AI has consumed everything, what remains untrainable?

区块律动BlockBeats

特邀专栏作者

2026-06-11 05:44

บทความนี้มีประมาณ 6561 คำ การอ่านทั้งหมดใช้เวลาประมาณ 10 นาที

Trust, Permission, Responsibility, and Industry Judgment

สรุปโดย AI

ขยาย

Core Thesis: As AI capabilities continue to advance, general-purpose models will consume all tasks that can be measured by benchmarks, trained on public data, and validated at low cost. True value and moats will reside in the “untrainable” domains: those that depend on enterprise proprietary data, complex workflows, user trust, industry judgment, and the long-term accumulation of genuine organizational value.
Key Elements:
1. **Trainability Equals Commoditization**: Any task measurable by benchmarks and verifiable at low cost (e.g., code writing) will be consumed and commoditized by models. Value will drain from these “readable” tasks.
2. **Private Correctness is the Moat**: Models cannot automatically obtain a bank's system permissions or a doctor's trust. True automation requires deep integration within an organization, handling private data, complex workflows, and long-accumulated experience that cannot be easily replicated from the outside.
3. **Trust and Responsibility Form a Barrier**: Models can generate answers but cannot bear responsibility for errors or hold industry licenses. Application companies secure an unassailable position by winning customer trust and embedding themselves into their decision-making processes and systems.
4. **Defining “Good” Equals Power**: Whoever defines “acceptable work quality” in real-world services and establishes private benchmarks holds the pricing power and moat in that domain (e.g., Harvey’s legal benchmarks, Sierra’s definition of customer resolution).
5. **Intense Competition at the Model Layer, Value Exists at the Application Layer**: The frontier model market is not a single monopoly but remains a multi-player competitive landscape. Clients need supplier competition, and labs will not easily stifle applications with deep integration capabilities, leaving room for value creation.

Original Title: The Untrainable

Original Author: Sarah Guo, Conviction

Original Translation: Peggy, BlockBeats

Editor's Note: As AI capabilities continue to leap forward, a new pessimistic view is emerging in the investment community: if models become increasingly powerful, all application companies will eventually be consumed by model and compute layers like Anthropic, OpenAI, and Nvidia, leaving only frontier models, compute power, and a handful of infrastructure in the market. However, Sarah Guo believes this assessment is only half right. Those "thin wrappers"—applications that simply wrap around models—will indeed be absorbed. Any task that can be measured by benchmarks, trained on public data, or validated at low cost will gradually become commoditized.

The real question is: after AI consumes everything that is trainable, what remains untrainable?

The answer in this article lies in value that exists within real organizations and cannot be easily replicated from the outside: enterprise proprietary data, complex workflows, user trust, system permissions, industry judgment, compliance responsibilities, and experience accumulated over long-term operations. Models can become smarter, but they cannot automatically enter a bank's production systems; they can generate medical answers, but cannot directly earn a doctor's trust or navigate a hospital's decision-making processes; they can write legal texts, but cannot assume responsibility for senior lawyers or arbitrarily define what constitutes qualified legal work.

Therefore, the truly moated AI companies of the future won't simply be smarter than general-purpose models. Instead, they will delve deep into specific industries to complete the difficult but crucial "translation" work: organizing a client's private reality, tools, processes, and judgment criteria into systems that models can act upon, and gradually defining what constitutes a "good result" through long-term service. The stronger AI becomes, the more it will devalue measurable, replicable tasks; and the more it will highlight those "untrainable" elements laden with history, relationships, permissions, and professional judgment. This is the true value that may remain after models have consumed everything else.

Below is the original text:

By mid-2026, the investor version of "AI psychosis" is a feeling of despair that there's nothing left worth investing in: it feels like we should just put all our money into Anthropic and Nvidia and go home to sleep. But I've never felt that way. For several minor versions now, I've been convinced that models are smarter than me; I'd be happy to buy Anthropic and Nvidia at market prices; and many of my smartest friends are quite confident that self-improvement of models will truly take off soon—yet I still don't feel that despair.

This despair isn't foolish. Its logic goes like this: if models keep getting better at everything, then all companies built on top of them are just thin shells waiting to be absorbed; the only value that ultimately survives will be compute and frontier model weights.

Take software as an example, the case this despair relies on most. When Devin was released in 2024, it could only solve 13% of tasks on standard software benchmarks, and was largely dismissed by the market. A year and a half later, the strongest agents are scoring over 80% and handling real work inside Goldman Sachs and the U.S. Army. Almost everyone drew the same wrong conclusion: models have swallowed software engineering.

But as models consume the most measurable parts of software engineering, we are also rediscovering what many teams have long known: engineering has always resisted measurement, and the most measurable parts are not necessarily the only important parts.

MIT's Mert Demirer and collaborators have finally quantified this: among over 100,000 developers, the latest generation of coding agents increased code writing by about 180%, but the amount of code actually deployed to production only increased by about 30%. Writing code became cheaper, but the remaining steps still require humans, and these steps matter. Of course, the overall net impact remains stunning.

A benchmark is something you can measure; and anything you can measure can be trained for. That's why coding agents matured first: the compiler is a free verifier, and so is the test suite. When answers can be self-checked at nearly zero cost, you can iterate around that check signal until you break through it.

But passing a test never means the change is correct for a codebase that has been running for ten years. There might be three reasons for that module's existence that no one documented; the deployment pipeline might be held together by a cron job no one wants to admit they wrote.

This kind of correctness cannot be read from a leaderboard, nor can it truly be read from anything else directly. You can only let a system of such complexity run in the real world long enough to know if it truly works. And smarter models don't make the real world run faster. No one would run unit tests on a system as large as Google, see a green checkmark, and feel completely confident. You trust it because it has withstood years of real-world load.

This correctness is not only private, but also a moat that forms slowly—a moat where capital cannot simply compress time. Even optimists admit this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote: the only reliable way to evaluate an agent's performance over a one-year period might be to let it run for a full year.

As Gabe Pereyra puts it, true automation isn't just models getting stronger. It's the product, the model, the workflow, and the organization all changing together—and three out of these four advance at the organization's pace.

Getting people to move is something no benchmark can touch: convincing a skeptical partner to change how she handles transactions, keeping a team cohesive during a rebuild. This is also why, when hiring a CEO, we value their ability to handle people at least as much as their analytical ability. Smarter models won't change this balance.

The feedback here is ambiguous, the time horizon is measured in years, and trust belongs to a specific person. Every company I know has put frontier coding models in the hands of every engineer, but not a single company's engineering organization has changed at a pace even close to model improvement. Adopting the tool took a quarter—and what a magical token-growth quarter that was! But real rebuilding takes years.

Readable work is leaving. Truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained for; therefore, anything measurable is heading towards commoditization. This process takes time and is never fully complete, but the direction never reverses.

To put it in monetary terms, borrowing from my friend Matt MacInnis of Rippling: a token used to answer a general question is nearly worthless because any model can answer it; but a token reasoning over your company's data is much more valuable because it's doing what you actually want, not just generating a plausible-looking answer.

Readable work gets eaten from two directions.

From below, tasks saturate: once a piece of work can be checked at low cost, buyers stop caring which model completed it and start asking how much it costs. So the work falls to the cheapest open-source or distilled model of the week. Where margins can be squeezed, they eventually will be.

From above, labs are experimenting with making models swallow their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies—all the apparatus once wrapped around models is being pulled into the model weights, until the "shell" itself becomes the model. This is the absorption boundary.

Profit pressure also works from another direction: a general-purpose agent must be ready to handle anything, making it expensive; a focused application can optimize a workflow to the extreme, consuming only a fraction of the tokens. And unlike the labs selling those tokens, the application company can keep the spread.

So we can ask two questions of any task: Is its correctness private and costly—a truth that exists only inside a specific company's data? Is it isolated within a system that outsiders cannot enter? Combining these questions with the task's degree of saturation gives us a 2x2 matrix.

A saturated task with a public answer is the domain of commoditized tokens; open-source models will dominate it. A frontier task with a public answer, like coding benchmarks, is where labs will win, because when evaluation is free, owning it isn't worth much.

The real prize is the last corner—the "untrainable" corner: frontier work whose correctness exists only in a private environment. You can see this on inference clouds serving AI-native early adopters: the vast majority of tokens are generated by custom models, not by general-purpose open-source models.

The walls to this last corner vary in height. A developer's toy codebase is transferable and standardized, so climbing in isn't hard. But a bank's production system is neither transferable nor standardized. You don't get root access to it just by being 2% smarter on SWE-Bench Verified.

Capability will consume many things, but a better model won't turn a private truth standard into a public one. It doesn't hold a license, sign for liability, own the company's files, or become the defendant when an answer is wrong. The bottleneck here isn't intelligence; it's permission and accountability. You can imagine a model far smarter than any human, but it still must be allowed in the door, and someone must still sign their name for what it does.

That door has a lock and a bolt.

The lock is the environment: only after being trusted inside a system, passing security reviews, completing integration, and signing contracts with accountability for outcomes, can you verify if the AI truly does something useful.

The bolt is the user. Today, most American doctors open OpenEvidence daily, and that's not something any compute can buy. A lab could train a perfect medical model tomorrow, but it still has no way into doctors' habits or UCSF's decision-making processes. Because trust is built slowly, through relationships and user acquiescence, not through gradient descent wiping them away.

This is precisely the work of application companies. An application secures its position in the "untrainable" corner by doing unglamorous work: organizing a company's private reality so models can act on it; handing tools to the model; working with clients to change how their workforce actually operates.

A company that can perform this "translation" is hard to replicate, and this translation never ends. Integration and maintenance persist alongside the client relationship. The teams that win are those that put domain-expert engineers and tools alongside their clients.

Take a top-tier traditional law firm. Its M&A practice alone handles nearly a thousand transactions a year. You can't have hundreds of paralegals download client files to their desktops and feed them to a general-purpose agent to read through. Confidentiality rules alone prohibit this, let alone a dozen other issues. Even if you could, you'd only learn fragments: one paralegal corrects one thing at a time, and no one sees how an entire transaction flows.

The truly important signal exists at the transaction level. A transaction has its own shape: for M&A, it's NDAs, term sheets, due diligence, purchase agreements, ancillary documents, closing checklists; for IP litigation, it's motions, discovery, prior art, more motions. Each practice area has its own structure, and lawyers and tools are not freely interchangeable.

And the real problem this law firm needs to solve is one level higher: how to run every practice area simultaneously, just as a top partner manages hundreds of matters in parallel while bringing in new clients and training associate lawyers. Transforming such a company is not a single question you can write a benchmark for. It requires an operator to handle it like a "data baseball" game: the intermediate goals are extremely vague, feedback is incomplete, the cycles are long, and the environment itself doesn't stand still.

Unfortunately, unreadable value is also hard to sell, for the same reason it's hard to commoditize: a company cannot judge from the outside whether AI can transform its operations as effectively as a benchmark suggests. So the strongest companies stop trying to prove themselves externally, enter the client's domain first, and then price based on outcomes.

Sierra charges only when its agent resolves a customer's issue; if the issue is escalated to a human, it doesn't charge. Thus, the price itself becomes the evaluation mechanism. This works because Sierra holds the definition of "resolved." Cognition's Devin does the same in software, offering "performance guarantees." Only when you are trusted inside a system can you offer such guarantees for outcomes.

Even at the layer of providing token services—what everyone likes to call a pure commodity—its behavior isn't commoditized. The best AI-native companies concentrate their services with one or two providers, like Baseten or Fireworks. The cost per token will trend towards commoditization over time, but reliability under real traffic and guaranteed access to scarce compute will not. Where you provide inference is a different choice from which models you use. The only truly commodity-like part of inference is the price.

A common counterargument is: the labs are your suppliers. Why wouldn't they undercut you by selling their first-party products below cost, or revoke your API access and take the market for themselves? This is the real version of that despair. But it only holds if the model layer is a single-player game.

Clearly, it is not. The model layer looks more like a death match among three and a half players, alongside a batch of international players about six months behind in training progress, and a development league five times larger than last year. Customers want competition among their suppliers, and labs want market share more than they want to kill any specific application.

You can see this in markets where labs compete head-on. In consumer chat, the best model has never simply won the entire market. ChatGPT held its lead through years of real competition; the share it's now losing goes to Gemini, driven by Android and search distribution, not because the model is better. Anthropic is currently perceived in prediction markets and internet sentiment as having the best model, yet it is barely a major player in consumer chat, having built its business around enterprise and coding use cases.

If a better model can't even steal users from a competitor in its core application, it won't easily integrate into a hospital's medical records system or a bank's liability framework. Today, the public chooses products based on factors beyond coding ability. If the frontier model layer remains crowded, the application layer above it holds value.

If a task cannot be scored from the outside, someone inside must decide what constitutes a good answer. And that decision is the entire game. Enough such decisions written down become benchmarks. Harvey published legal benchmarks; Sierra published voice agent benchmarks. You earn the right to define what "good" means in a domain because the domain is already using you. And these companies earned that right through the hard struggle of real-world adoption.

The evaluations that truly determine where money flows are private and formed company by company: what will this company, for this type of matter, accept as good work. And this is far from finished, because the depth of law far exceeds any public test. OpenEvidence is cementing what constitutes a safe clinical answer.

None of this is really "measurement" in the truest sense; it's judgment about what is true and what is good. These judgments are written down until they become the standard against which everyone else is measured. No matter how smart the foundation model labs become, they cannot write these standards out of thin air, because this authority exists only within the domain.

This authority tends to fall where it already lies. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "resolved" means is decided by the company that already has the client relationship.

The absorption boundary will continue to rise, because we will keep learning to measure more work, and what is measurable will be consumed. The untrainable ground will shrink beneath the feet of those standing on it, so you cannot find a defensible position and stop. You must keep moving towards places that cannot yet be scored, and continuously re-underwrite and re-judge risks.

On a narrow task, with your private data and your own evaluation system, you can train to the frontier and beat general-purpose models in key scenarios; this specialized model becomes part of the moat. On the other hand, if you compete on general-purpose model capability, it's a capital war, and you will lose to those with the most compute. This is the trap that most easily catches companies with only shallow access and highly readable tasks.

When a company decides, for survival, to train a capability exceeding frontier models across a broad set of general tasks, the outcome usually seems pre-determined by data center scale. The ending is often not an independent champion, but being sold to a player with ample compute.

All of the above is defense. Harder is offense: first deciding what to build. This is what I've been searching for all year, and I've probably only found it three times. The model can't help here. You point it at something, and it does it; but it cannot tell you what is worth pointing at. You can't build a benchmark for that, so you can't train for it.

This is also why incumbents won't take everything: they will hold their own territory, and the next thing will come from someone who discovers a use case before anyone else. Perhaps, intent is a scarcer input than compute.

Half of this despair is correct. Thin shells are indeed being absorbed, and many things that look like companies today are indeed just thin shells. But it is wrong about what remains after absorbtion. The mechanism is clear, but the endpoint is not.

The direction I'm willing to bet on is this: intelligence will continue to get cheaper, and value will continue to slide towards the few places models cannot reach. The untrainable thing is value with history.

So, enter one of these domains, do the unglamorous translation work, and begin writing down the definition of "good" there. Because someone will. This year's most-cited benchmark score is actually a map to territory about to become worthless, and also a notice: a notice to some that they are about to lose the right to define what "good" means.

Original Link

ความปลอดภัย

ยินดีต้อนรับเข้าร่วมชุมชนทางการของ Odaily