AI가 모든 것을 집어삼킨 후, 무엇이 여전히 훈련 불가능한가?
- 핵심 관점: AI 능력이 지속적으로 도약함에 따라, 범용 모델은 벤치마크로 측정 가능하고, 공개 데이터로 훈련되며, 저비용으로 검증 가능한 모든 작업을 집어삼킬 것입니다. 진정한 가치와 해자(moat)는 '훈련 불가능한' 영역, 즉 기업의 사적 데이터, 복잡한 워크플로우, 사용자 신뢰, 업계 판단, 그리고 오랜 시간 축적된 실제 조직 내부 가치에 의존하는 곳에 존재할 것입니다.
- 핵심 요소:
- **훈련 가능성은 곧 상품화**: 벤치마크로 측정 가능하고 저비용으로 검증될 수 있는 모든 작업(예: 코드 작성)은 모델에 의해 집어삼켜지고 상품화될 것입니다. 가치는 이러한 '읽을 수 있는' 작업에서 유실될 것입니다.
- **사적 정확성(Private Correctness)이 해자(Moat)다**: 모델이 은행의 시스템 권한이나 의사의 신뢰를 자동으로 획득할 수는 없습니다. 진정한 자동화는 조직 내부 깊숙이 침투하여 사적 데이터, 복잡한 워크플로우, 장기간 축적된 경험을 처리해야 하며, 이는 외부에서 단순히 복제할 수 없습니다.
- **신뢰와 책임이 진입 장벽을 구성한다**: 모델은 답변을 생성할 수 있지만, 실수에 대한 책임을 지거나 업계 라이선스를 보유할 수는 없습니다. 애플리케이션 기업은 고객의 신뢰를 얻고, 고객의 의사 결정 프로세스와 시스템에 진입함으로써 알고리즘으로부터 빼앗을 수 없는 위치를 확보합니다.
- **'좋음'의 기준을 정의하는 것이 곧 권력이다**: 실제 서비스에서 '수용 가능한 작업 품질'을 정의하고 사적 벤치마크를 구축하는 사람이 해당 분야의 가격 결정권과 해자(moat)를 장악합니다 (예: Harvey의 법률 벤치마크, Sierra의 고객 해결 정의).
- **모델 계층의 경쟁은 치열하나, 애플리케이션 계층의 가치는 존재한다**: 최첨단 모델 시장은 단일 거대 기업이 독점하는 것이 아니라 여전히 다수의 플레이어가 경쟁하는 구도입니다. 고객은 공급업체 간의 경쟁을 필요로 하며, 연구소는 깊은 통합 능력을 가진 애플리케이션을 쉽게 질식시키지 않으므로 가치 창출의 여지가 남아 있습니다.
Original Title: The Untrainable
Original Author: Sarah Guo, Conviction
Original Compilation: Peggy, BlockBeats
Editor's Note: As AI capabilities continue to advance, a new pessimistic view is emerging in the investment world: if models become increasingly powerful, all application companies will eventually be consumed by model and compute layers like Anthropic, OpenAI, and Nvidia, leaving only cutting-edge models, compute power, and a few infrastructure pieces in the market. However, Sarah Guo argues this judgment is only half right. Those "thin wrappers" (simple applications built on top of models) will indeed be absorbed, and any task that can be measured by benchmarks, trained on public data, and validated at low cost will gradually become commoditized.
The real question is: after AI consumes everything trainable, what remains untrainable?
This article's answer lies in values that exist within real organizations and cannot be easily replicated externally: enterprise proprietary data, complex workflows, user trust, system permissions, industry judgment, compliance responsibility, and accumulated experience from long-term operations. Models can become smarter, but they cannot automatically enter a bank's production system; they can generate medical answers, but cannot directly gain doctors' trust or a hospital's decision-making process; they can write legal texts, but cannot take responsibility for senior lawyers or unilaterally define what constitutes qualified legal work.
Therefore, truly defensible AI companies in the future will not simply be smarter than general-purpose models. Instead, they will delve deep into specific industries to complete the difficult but critical task of "translation": organizing the client's private reality, tools, processes, and judgment standards into a system the model can act upon, and gradually defining "what constitutes a good result" through long-term service. The stronger AI becomes, the more it will devalue measurable and replicable tasks; simultaneously, it will highlight those "untrainable" elements carrying history, relationships, permissions, and professional judgment. This is the true value that may remain after the model consumes everything.
Below is the original text:
By mid-2026, the investor version of "AI psychosis" is a feeling of despair that there's nothing left worth investing in: we should just put all our money into Anthropic and Nvidia and go home to sleep. But I've never felt that way. For several minor versions now, I've been convinced models are smarter than me; I'd be happy to buy Anthropic and Nvidia at market prices; my smartest friends are fairly certain that model self-improvement will soon truly work – yet I still don't feel that despair.
This despair isn't stupid. Its logic is: if models continuously get better at everything, then all companies built on top of them are just thin shells waiting to be absorbed; the only remaining value will be compute power and frontier model weights.
Take software as an example, the case this despair relies on most. When Devin was released in 2024, it could only solve 13% of tasks on standard software benchmarks, so it was largely dismissed by the market. A year and a half later, the best agents can achieve over 80% high scores and are handling real work inside Goldman Sachs and the U.S. Army. Almost everyone reached the same wrong conclusion: the model had swallowed software engineering.
But as models consumed the most measurable parts of software engineering, we are also rediscovering something many teams knew all along: engineering has always resisted measurement, and the most measurable parts are not necessarily the only important parts.
MIT's Mert Demirer and collaborators finally quantified this: among over 100,000 developers, the latest generation of coding agents increased code writing volume by roughly 180%, but the amount of code actually shipped to production only increased by about 30%. Writing code became cheaper, but the remaining steps still require humans, and these steps are important. Of course, the overall net impact is still astonishing.
Benchmarks are something you can measure; and anything measurable can be used for training. Therefore, coding agents matured first: compilers are free validators, test suites are free validators. When an answer can be self-checked at almost zero cost, you can iterate around that check signal until you break through.
But passing tests never means the change is correct for a codebase that has been running for ten years. There might be three undocumented reasons why that module exists; the deployment pipeline might be held together by a cron job no one wants to admit writing.
This kind of correctness cannot be read from leaderboards, nor truly from anything else directly. You have to let such a complex system run in the real world long enough to know if it actually works. Smarter models won't make the real world run faster. No one would fully trust a system as large as Google just because unit tests pass with a green checkmark. You trust it because it has withstood years of real-world load.
This correctness is not only private but also a slowly forming moat, a moat where capital cannot compress time directly. Even optimists admit this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote: the only reliable way to evaluate an agent's performance over a year-long period might be to let it run for a full year.
As Gabe Pereyra said, true automation isn't just models getting stronger. It's the product, the model, the workflow, and the organization all changing together, and three out of these four advance at the speed of the organization.
Getting people moving is part of the process that no benchmark touches: convincing a skeptical partner to change how she handles matters, keeping a team cohesive during a rebuild. This is why, when hiring a CEO, we value people skills at least as much as analytical skills. Smarter models won't change this weighting.
The feedback here is ambiguous, the time horizon is measured in years, and trust belongs to a specific person. Every company I know has every engineer using frontier coding models, yet no company's engineering organization has changed at a pace close to model progress. Adopting the tool took a quarter – what a magical quarter of token growth! But true rebuilding takes years.
Readable work is leaving. Truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained for; therefore, anything measurable is already heading towards commoditization. This process takes time and will never be complete, but the direction never reverses.
In the words of my friend Matt MacInnis at Rippling, translating this into monetary terms: a token used to answer a general question is almost worthless, because anyone's model can answer it; but a token reasoning on your company's proprietary data is much more valuable, because it does what you actually want, not just generate a plausible answer.
Readable work will be consumed from two directions.
From below, tasks saturate: once a task can be checked at low cost, buyers stop caring which model completed it and start asking how much it costs. This task then falls to the cheapest open-source or distilled model of the week. If profit margins can come into play, they eventually will.
From above, labs are trying to make models consume their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies – all the apparatus once wrapped around models are being pulled into model weights until the "wrapper" itself becomes the model. This is the absorption boundary.
Profit pressure also works in another direction: a general-purpose agent must be ready to handle anything, so costs are high; a focused application can optimize a workflow to the extreme, consuming only a fraction of the tokens. And unlike labs selling these tokens, application companies can keep the margin differential.
Therefore, we can ask two questions about any task: Is its correctness private and costly, a truth existing only within a specific company's data? Is it isolated within a system outsiders cannot access? Combining these questions with the task's saturation level yields a 2x2 matrix.
Saturated tasks with public answers are the domain of commoditized tokens, occupied by open-source models. Frontier tasks with public answers, like coding benchmarks, are where labs win, because when evaluation is free, possessing it isn't valuable.
The real prize is the last corner, the "untrainable" corner: frontier work whose correctness only exists in a private environment. You can see this on inference clouds serving AI-native pioneers: the vast majority of tokens are generated by custom models, not general open-source ones.
The walls to this last corner vary in height. A developer's toy codebase is portable and standardized, so it's not hard to climb in. But a bank's production system is neither portable nor standardized. Being 2% smarter on SWE-Bench Verified won't get you root access to it.
Capability consumes many things, but better models won't turn private truth standards into public ones. They don't hold licenses, sign for liability, own the company's files, or become the sued party when answers are wrong. The bottleneck here isn't intelligence; it's permission, and responsibility. You can imagine a model far smarter than anyone, but it must still be allowed in the door, and someone must still sign their name for what it does.
That door has a lock and a bolt.
The lock is the environment: only after gaining trust within a system, passing security reviews, completing integration, and signing contracts with outcome liability, can you verify if AI actually does useful things.
The bolt is the user. Today, most American doctors open OpenEvidence daily. This cannot be bought with any amount of compute. A lab could train a perfect medical model tomorrow, but it still has no way into doctors' usage habits or UCSF's decision-making process. Trust is built slowly, through relationships and user acquiescence, not erased by gradient descent.
This is precisely the work of application companies. An application occupies its spot in the "untrainable" corner through unglamorous work: organizing a company's private reality so the model can act on it; giving action tools to the model; working with clients to change how their workforce actually operates.
A company that can perform this kind of "translation" is hard to replicate, and this translation never ends. Integration and maintenance continue alongside the client relationship. The teams that win this are those that put domain-specific engineers and tools close to the client.
For example, in a top-tier legacy law firm, the M&A practice alone handles nearly a thousand transactions annually. You can't have hundreds of paralegals download client files to their desktops and feed them to a general agent to read through. Confidentiality reasons alone forbid it, let alone a dozen other problems. Even if you could, you'd only learn fragments: one paralegal correcting one thing at a time, no one seeing how an entire transaction flows.
The truly important signal exists at the transaction level. A transaction has a shape: for M&A, it's NDA, term sheet, due diligence, purchase agreement, ancillary documents, closing checklist; for IP litigation, it's motions, discovery, prior art, more motions. Each practice area has its own structure; lawyers and tools are not interchangeable.
And the law firm's real problem is at an even higher level: how to run every practice area simultaneously, just as top partners manage hundreds of matters in parallel while bringing in new business and training associates. Rebuilding such a firm is not a single problem you can write an evaluation task for. It requires an operator to approach it like playing "data baseball": intermediate goals are extremely vague, feedback is incomplete, cycles are long, and the environment itself doesn't stand still.
Unfortunately, unreadable value is also hard to sell, for the same reason it's hard to commoditize: a company cannot externally judge whether AI can transform its operations as benchmarks suggest. Therefore, the strongest companies stop trying to prove themselves externally, enter the client's internal environment first, and then price based on outcomes.
Sierra only charges when its agent resolves a client's issue; if the issue is escalated to a human, it doesn't charge. Thus, the price itself becomes the evaluation mechanism. This works because Sierra owns the definition of "resolved." Cognition's Devin does the same in software, launching a "performance guarantee." Only when you are trusted to enter a system's interior can you offer such guarantees for outcomes.
Even at the layer of providing token services – which everyone likes to call a pure commodity – it doesn't behave like one. The best AI-native companies concentrate their services on one or two providers, like Baseten or Fireworks. Cost per token will commoditize over time, but reliability under real traffic and stable access to scarce compute will not. Where to serve inference is a different choice from which models to use. The only truly commodity-like part of inference is price.
A common rebuttal is: if the lab is your supplier, why won't it undersell you with its first-party product and drive you out of business? Or simply revoke your API access and take the market itself? That's the real version of that despair. But it only holds if the model layer is a single-player game.
Clearly, it's not. The model layer is more like a three-and-a-half-player death match, with a cohort of international players about six months behind in training progress, and a development league five times larger than last year. Clients want competition among their suppliers, and labs want market share more than they want to kill any specific application.
You can see this in markets where labs compete head-on. In the consumer chat space, the best model never simply won the entire market. ChatGPT has maintained its lead through years of real competition; the share it's now losing goes to Gemini, driven by Android and search distribution, not a better model. Anthropic is currently considered by prediction markets and internet sentiment to have the best model, yet it's barely a major player in consumer chat; it built its business in enterprise and coding scenarios.
If a better model can't take users from competitors in its core application, it won't easily integrate into a hospital's medical records system or a bank's liability framework through absorption. Today, the public chooses products based on more than just coding ability. As long as the frontier model layer remains crowded, the application layer above it holds value.
If a task cannot be scored externally, then someone internally must decide what constitutes a good answer. And this decision is the entire game. Enough such decisions written down become benchmarks. Harvey released legal benchmarks; Sierra released voice agent benchmarks. You earn the right to define "good" in a domain because the domain is already using you. And these companies earned that right through the hard struggle of real adoption.
The evaluations that truly determine where money flows are private and formed company-by-company: what this company, for this type of matter, will accept as good work. And this is far from complete, because the depth of law far exceeds any public test. OpenEvidence is codifying what constitutes safe clinical answers.
None of this is truly "measurement" in the usual sense; it's judgment about what is true and what is good. These judgments are written down until they become standards against which everyone else is measured. No matter how smart the foundational model lab becomes, it cannot write these standards out of thin air, because this status exists only within the domain.
This authority tends to stay where it already exists. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "resolved" means is determined by the company that already has the client relationship.
The absorption boundary will continue to rise, because we will keep learning to measure more work, and what is measurable will be consumed. The untrainable ground will shrink under the feet of those standing on it, so you can't find a defensible position and stop. You must keep moving towards places that cannot yet be scored, and continuously reassess and rejudge risk.
On a narrow task, with your proprietary data and your own evaluation system, you can train to frontier level and beat general models in key scenarios; this specialized model becomes part of the moat. On the other hand, if you compete on general model capability, it's a capital war you will lose to whoever has the most compute. This is the trap most easily fallen into by companies with only shallow access and highly readable tasks.
When a company decides to survive by training to surpass frontier capabilities on a large set of general tasks, the outcome usually seems preordained by data center scale. The final result is often not an independent champion, but being sold to a player with ample compute.
All of the above is defense. The harder part is offense: first deciding what to build. This is what I've been searching for this past year, and I've probably found it only three times. Models are no help here. They do whatever you point them at; they can't tell you what's worth pointing at. You can't build benchmarks for it, therefore you can't train for it.
This is also why incumbents won't take everything: they will hold onto their established territories, and the next thing will come from someone who discovers the use case before others. Perhaps intent is a scarcer input than compute.
This feeling of despair is half right. Thin shells are being absorbed, and many things that look like companies today are indeed just thin shells. But it's wrong about what remains after absorption. The mechanism is clear, but the endpoint is not.
My bet is on this direction: intelligence will continue to become cheaper, and value will continue to slide towards a few places models cannot reach. The untrainable thing is value with history.
So, enter one of these domains, do the unglamorous translation work, and start writing down what "good" means there. Because someone will. The most cited benchmark scores this year are actually a map to territory that will soon become worthless, and a notice: a notice to some that they are about to lose the right to define what "good" is.


