After AI devours everything, what remains untrainable?
- Core Argument: As AI capabilities continue to leap forward, general-purpose models will consume all tasks that can be measured by benchmarks, trained on public data, and validated at low cost. True value and moats will reside in the "untrainable" domains: those that depend on enterprise proprietary data, complex workflows, user trust, industry judgment, and the accumulated value of real organizations built over time.
- Key Elements:
- Trainability Equals Commoditization: Any task measurable by benchmarks and verifiable at low cost (such as code writing) will be consumed by models and commoditized. Value will drain from these "readable" tasks.
- Proprietary Correctness is the Moat: Models cannot automatically acquire a bank's system permissions or a doctor's trust. True automation requires deep integration within an organization, handling private data, complex workflows, and long-accumulated experience that cannot be easily replicated from the outside.
- Trust and Responsibility Form Barriers: Models can generate answers, but they cannot bear responsibility for errors or hold industry licenses. Application companies, by winning customer trust and embedding themselves into their decision-making processes and systems, secure a position that algorithms cannot erode.
- Defining "Good" is Power: Whoever can define "acceptable work quality" in real-world services and establish private benchmarks holds the pricing power and moat in that domain (e.g., Harvey's legal benchmarks, Sierra's definition of customer resolution).
- Model Layer is Competitive, Value Exists in the Application Layer: The frontier model market is not a single dominant monopoly but remains a multi-player competitive landscape. Customers need vendor competition, and labs are unlikely to easily eliminate applications with deep integration capabilities, leaving room for value creation.
Original Title: The Untrainable
Original Author: Sarah Guo, Conviction
Original Translation: Peggy, BlockBeats
Editor's Note: As AI capabilities continue to leap forward, a new pessimistic view is emerging in the investment community: if models get increasingly powerful, all application companies will eventually be consumed by model and compute layer giants like Anthropic, OpenAI, and Nvidia, leaving the market with only frontier models, computing power, and a handful of infrastructure players. But Sarah Guo argues this judgment is only half right. Those "thin wrappers" – applications that simply wrap a model – will indeed be absorbed. Any task that can be measured by benchmarks, trained on public data, and validated at low cost will gradually become commoditized.
The real question is: after AI devours everything trainable, what remains untrainable?
The answer in this article lies in values that exist within real organizations and cannot be easily replicated from the outside: enterprise proprietary data, complex workflows, user trust, system permissions, industry judgment, compliance responsibility, and experience accumulated over long-term operations. A model can become smarter, but it cannot automatically enter a bank's production system. It can generate medical answers, but it cannot directly earn a doctor's trust or integrate into a hospital's decision-making process. It can write legal texts, but it cannot take responsibility for a senior lawyer, nor can it unilaterally define what constitutes qualified legal work.
Therefore, the truly defensible AI companies of the future won't simply be smarter general-purpose models. Instead, they will delve deep into specific industries, undertaking the difficult but crucial work of "translation": organizing clients' private realities, tools, processes, and judgment standards into systems on which models can act, and gradually defining "what constitutes a good result" over long-term service. The stronger AI becomes, the more it devalues measurable, replicable tasks; and the more it highlights those "untrainable" elements – those with history, relationships, permissions, and professional judgment. This is the genuine value that might survive after the model consumes everything else.
Below is the original text:
By mid-2026, the investor version of "AI psychosis" is a sense of despair that there's nothing left worth investing in: we should just put all our money into Anthropic and Nvidia and go home to sleep. But I've never felt that way. For several model iterations now, I've been convinced models are smarter than me. I'd be happy to buy Anthropic and Nvidia at market prices. My smartest friends are fairly certain recursive self-improvement in models will truly take off soon. Yet, I still don't feel that despair.
This despair isn't stupid. Its logic runs like this: if models keep getting better at everything, then all companies built on top of them are just thin wrappers waiting to be absorbed. The only value that ultimately survives is computing power and frontier model weights.
Take software, the primary exhibit for this despair. When Devin launched in 2024, it could only solve 13% of tasks in standard software benchmarks, so the market largely dismissed it. A year and a half later, the best agents achieve over 80%, and are handling real work inside Goldman Sachs and the U.S. Army. Almost everyone drew the same wrong conclusion: the model had swallowed software engineering.
But as the model devoured the most measurable parts of software engineering, we are also re-learning what many teams have long known: engineering has always resisted measurement, and what's easiest to measure isn't necessarily the only thing that matters.
MIT's Mert Demirer and collaborators have finally quantified this: among over 100,000 developers, the latest generation of coding agents increased the sheer volume of code written by about 180%, but the amount of code actually deployed to production only increased by about 30%. Writing code became cheaper, but the remaining steps still require humans, and these steps are important. Of course, the overall net impact remains staggering.
A benchmark is something you can measure; and anything measurable can be trained on. Thus, coding agents matured first: the compiler is a free verifier, the test suite is a free verifier. When answers can be self-checked at near-zero cost, you can keep polishing around that signal until you break through.
But passing a test never means the change is the right one for a codebase that has been running for ten years. There might be three undocumented reasons why that module exists. The deployment pipeline might be held together by a cron job nobody wants to admit they wrote.
This kind of correctness cannot be read from a leaderboard, nor truly read from anything else. You have to let such a complex system run long enough in the real world to know if it works. And smarter models won't make the real world run faster. No one would run unit tests on a system as large as Google, see green checkmarks, and be completely satisfied. You trust it because it has withstood years of real-world load.
This correctness is not only proprietary; it's also a moat that forms slowly, a moat capital cannot compress time to overcome. Even optimists admit this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote that the only reliable way to evaluate an agent's performance over a year-long cycle might be to actually let it run for a year.
As Gabe Pereyra said, true automation isn't just models getting stronger. It's products, models, workflows, and company organizations changing together. And of these four, three move at the organization's pace.
Getting people on board is the part no benchmark touches: convincing a skeptical partner to change how she does things, keeping a team cohesive during a rebuild. This is why, when hiring a CEO, we value their ability to handle people at least as much as their analytic ability. Smarter models don't change this weighting.
The feedback here is ambiguous, the time horizons are years, and trust belongs to a specific person. Every company I know has put frontier coding models in front of every engineer, yet no company's engineering organization has changed at a pace anywhere near model progress. Tool adoption took a quarter – what a magical quarter of token growth! But true rebuilding takes years.
The readable work is leaving. The truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained on; therefore, anything measurable is already heading toward commoditization. This process takes time and is never complete, but its direction never reverses.
In my friend Matt MacInnis's (Rippling) framing, translating this into monetary terms: a token used to answer a general question is worth almost nothing, because anyone's model can answer it. But a token reasoning over your company's data is much more valuable, because it's doing what you actually want, not just generating a plausible-sounding answer.
Readable work will be devoured from two directions.
From below, tasks saturate: once a piece of work can be cheaply verified, buyers stop caring which model completed it and start asking how much it costs. Consequently, the work falls to the cheapest open-source or distilled model of the week. As long as margin pressure can act, it eventually will.
From above, labs are trying to make models consume their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies – all the apparatus once wrapped around the model is being pulled into the model weights, until the "wrapper" itself becomes the model. This is the absorption boundary.
Profit pressure also acts from the other direction: a general-purpose agent must be ready for anything, making it expensive; a focused application can optimize a workflow to the extreme, consuming a tiny fraction of the tokens. And unlike labs selling those tokens, application companies can keep the difference for themselves.
Therefore, we can ask two questions of any piece of work: Is its correctness private, expensive, a truth that exists only within a single company's data? Is it isolated within a system that outsiders cannot enter? Combine these questions with the degree of task saturation, and you get a 2x2 matrix.
Work that is saturated and has public answers is the domain of commoditized tokens; open-source models will occupy it. Frontier work with public answers, like coding benchmarks, is where labs win – because when evaluation is free, possessing it isn't valuable.
The real prize is the last corner, the "untrainable" corner: frontier work whose correctness exists only in private environments. You can see this in the inference clouds serving AI-native early adopters: the vast majority of tokens are generated by custom models, not by general-purpose open-source models.
The walls to this last corner vary in height. A developer's toy codebase is portable and standardized, making it easy to climb over. But a bank's production system is neither portable nor standardized. Being 2% smarter on SWE-Bench Verified doesn't get you root access to it.
Capability will consume many things, but a better model won't turn a private truth standard into a public one. It doesn't hold licenses, sign for liability, own a company's files, or become the defendant when an answer is wrong. The bottleneck here isn't intelligence, but permission, and also responsibility. You can imagine a model far smarter than any human, but it still must be allowed in the door, and someone still has to sign their name for what it does.
That door has a lock and a bolt.
The lock is the environment: only after gaining trust within a system, passing security reviews, completing integration, and signing contracts with accountability for outcomes, can you verify if the AI is truly doing something useful.
The bolt is the user. Today, most U.S. doctors open OpenEvidence every day; this isn't something any amount of computing power can buy. A lab could train a perfect medical model tomorrow, but it still wouldn't have a way into doctors' habits or UCSF's decision-making process. Because trust is built slowly, through relationships and user buy-in, not something gradient descent can erase.
This is precisely the work of application companies. An application secures its place in the "untrainable" corner through the unglamorous work: organizing a company's private reality so a model can act on it; giving the model the tools to act; working with a client to change how their workforce actually operates.
A company that can perform this kind of "translation" is hard to replicate, and this translation never ends. Integration and maintenance continue as long as the client relationship lasts. The teams that win this are the ones that put domain-specialized engineers and tools alongside their clients.
Example: in a top-tier legacy law firm, just the M&A practice handles nearly a thousand deals a year. You can't have hundreds of paralegals downloading client documents to their desktops and feeding them to a general-purpose agent to read. Confidentiality reasons alone would forbid it, not to mention a dozen other issues. Even if you could, you'd only learn fragments: one paralegal corrects a bit here, another there, no one sees how an entire deal flows.
The truly critical signal exists at the deal level. A deal has its own shape: for M&A, it's NDA, term sheet, due diligence, purchase agreement, ancillary documents, closing checklist. For IP litigation, it's motions, discovery, prior art, more motions. Each practice area has its own structure; lawyers and tools cannot be swapped arbitrarily.
And the real problem the firm needs to solve is a level higher: how to run every practice area simultaneously, like top partners managing hundreds of matters in parallel while bringing in new clients and mentoring associates. Transforming such a firm isn't a single question you can create an evaluation task for. It requires a manager handling it like playing "data baseball": intermediate goals are extremely vague, feedback is incomplete, cycles are very long, and the environment itself isn't static.
Unfortunately, unreadable value is also hard to sell, for the same reason it's hard to commoditize: a company cannot judge from the outside whether AI can transform its operations like a benchmark suggests. Therefore, the strongest companies stop trying to prove themselves externally and instead get inside the client first, then price based on outcomes.
Sierra only charges if its agent solves the client's problem; if the issue is escalated to a human, it doesn't charge. Thus, the price itself becomes the evaluation mechanism. And this works because Sierra holds the definition of what "resolved" means. Cognition's Devin does the same in software, launching "performance guarantees." Only when you are trusted inside a system can you offer such guarantees for outcomes.
Even at the layer of providing token services – what everyone loves to call a pure commodity – it doesn't behave like one. The best AI-native companies concentrate their serving on one or two providers, like Baseten or Fireworks. Because while cost per token will commoditize over time, reliability under real traffic and stable access to scarce compute won't. Where inference is served is a distinct choice from which models are used. The only truly commodity-like aspect of inference is its price.
A common rebuttal is: the lab is your supplier; why wouldn't it undercut you with its first-party product sold below cost, effectively strangling you? Or simply revoke your API access and take the market itself? This is the real version of that despair. But it only holds if the model layer is a single-player game.
Clearly, it is not. The model layer looks more like a three-and-a-half-player deathmatch, with a cohort of international players about six months behind in training progress, and a development league five times larger than last year. Clients want competition among their suppliers, and labs want market share more than they want to kill any specific application.
You can see this in markets where labs compete head-to-head. In consumer chat, the best model has never simply won the whole market. ChatGPT stayed ahead through years of real competition. The share it's now losing is flowing to Gemini, driven by Android and search distribution, not a better model. Anthropic is currently perceived in prediction markets and internet sentiment to have the best model, yet it's barely a player in consumer chat; it built its business in enterprise and coding scenarios.
If a better model can't steal users from a competitor in the most core application, it won't easily integrate into a hospital's medical records system or a bank's liability framework through sheer vertical integration. Today, the public chooses products based on more than just coding ability. If the frontier model layer remains crowded, the application layer above it will retain value.
If a piece of work cannot be scored externally, then someone internal must decide what constitutes a good answer. This decision is the entire game. Enough such decisions written down become a benchmark. Harvey published a legal benchmark; Sierra published a voice agent benchmark. You earn the right to define "good" in a domain because that domain is already using you. And these companies earned that right through the hard struggle of real adoption.
The evaluation that truly determines where money flows is private, formed company by company: what will *this* company, for *this* matter, accept as good work. And this is far from finished, because the depth of law extends far beyond any public test. OpenEvidence is defining what constitutes a safe clinical answer.
None of this is truly "measurement" in the objective sense; it's judgment about what is true and what is good. These judgments get written down until they become the standard by which everyone else is measured. No matter how smart the foundation model lab gets, it cannot write these standards out of thin air, because this authority only exists within the domain.
This authority often ends up where it originally resided. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "resolved" means is decided by the company that already has the client relationship.
The absorption boundary will continue to rise, because we will keep learning to measure more work, and what is measurable will be consumed. The untrainable ground will shrink under the feet of those standing on it. So you cannot find a single defensible position and stop. You must continuously move towards places that cannot yet be scored, and constantly re-underwrite, re-judge risk.
On a narrow task, with your proprietary data and your own evaluation system, you can train a model to frontier performance and beat general-purpose models in critical scenarios; this specialized model becomes part of the moat. Conversely, if you compete on general-purpose model capability, it's a capital war, and you will lose to whoever has the most compute. This is also the easiest trap for companies with only shallow access to highly readable tasks.
When a company decides to survive by training a model to surpass frontier capability on a large swath of general tasks, the outcome usually seems determined by data center size. The endpoint is often not an independent champion, but being sold to a player with ample compute.
All of the above is defense. The harder task is offense: deciding first what to build. This is what I've been looking for this past year, and I've found it perhaps three times. The model can't help here. Point it somewhere, and it does what you ask; but it cannot tell you where is worth pointing. You cannot build a benchmark for this, and therefore you cannot train for it.
This is also why established giants won't take everything: they will defend what they already own, while the next thing comes from someone who discovers a use before others do. Perhaps intent is a scarcer input than compute.
Half of the despair is right. Thin wrappers are being absorbed, and many things that look like companies today are just thin wrappers. But it's wrong about what survives after absorption. The mechanism is clear, but the endpoint is not.
The bet I'm willing to make is in this direction: intelligence will continue to get cheaper, and value will continue to slide towards the few places models cannot reach. The untrainable is value with history.
So, enter one of these domains, do the unglamorous translation work, and start writing down what "good" means there. Because someone will. The most-cited benchmark score of this year is actually a map to value soon rendered worthless, and also a notice: a notice to some that they are about to lose the right to define what counts as "good."


