AI đã thâu tóm mọi thứ, còn điều gì không thể huấn luyện được?
- Quan điểm cốt lõi: Khi năng lực AI liên tục vượt bậc, các mô hình tổng quát sẽ thâu tóm tất cả các nhiệm vụ có thể đo lường bằng điểm chuẩn, được huấn luyện trên dữ liệu công khai và xác thực với chi phí thấp. Giá trị thực sự và hào lũy phòng thủ sẽ nằm ở những lĩnh vực "không thể huấn luyện": đó là những giá trị nội tại của tổ chức phụ thuộc vào dữ liệu riêng tư của doanh nghiệp, quy trình làm việc phức tạp, lòng tin của người dùng, phán đoán ngành và sự tích lũy lâu dài.
- Các yếu tố chính:
- **Khả năng huấn luyện đồng nghĩa với hàng hóa**: Bất kỳ nhiệm vụ nào có thể đo lường bằng điểm chuẩn và xác thực với chi phí thấp (ví dụ: viết mã) đều sẽ bị mô hình thâu tóm và trở thành hàng hóa. Giá trị sẽ dịch chuyển khỏi những công việc "có thể đọc" này.
- **Tính đúng đắn riêng tư là hào lũy phòng thủ**: Mô hình không thể tự động lấy được quyền hạn hệ thống của ngân hàng hay sự tin tưởng của bác sĩ. Tự động hóa thực sự đòi hỏi phải thâm nhập sâu vào tổ chức, xử lý dữ liệu riêng tư, quy trình làm việc phức tạp và kinh nghiệm tích lũy lâu dài, những thứ không thể sao chép đơn giản từ bên ngoài.
- **Lòng tin và trách nhiệm tạo nên rào cản**: Mô hình có thể tạo ra câu trả lời, nhưng không thể chịu trách nhiệm cho sai sót hay nắm giữ giấy phép hành nghề. Các công ty ứng dụng, bằng cách giành được lòng tin của khách hàng, tham gia vào quy trình ra quyết định và hệ thống của họ, có được vị thế không thể bị thuật toán tước đoạt.
- **Định nghĩa "tốt" chính là quyền lực**: Ai có thể định nghĩa "chất lượng công việc chấp nhận được" trong dịch vụ thực tế và thiết lập các điểm chuẩn riêng tư, người đó nắm giữ quyền định giá và hào lũy phòng thủ trong lĩnh vực đó (ví dụ: điểm chuẩn pháp lý của Harvey, định nghĩa giải quyết khách hàng của Sierra).
- **Cạnh tranh ở tầng mô hình khốc liệt, giá trị ở tầng ứng dụng tồn tại**: Thị trường mô hình tiên tiến không phải là sự độc quyền của một gã khổng lồ duy nhất, mà vẫn là một bối cảnh cạnh tranh đa người chơi. Khách hàng cần sự cạnh tranh giữa các nhà cung cấp, các phòng thí nghiệm sẽ không dễ dàng bóp chết các ứng dụng có khả năng tích hợp sâu, điều này để lại không gian cho việc tạo ra giá trị.
Original title: The Untrainable
Original author: Sarah Guo, Conviction
Original translation: Peggy, BlockBeats
Editor's note: As AI capabilities continue to advance, a new pessimistic view is emerging in investment circles: if models become increasingly powerful, all application companies will eventually be consumed by model and compute layers like Anthropic, OpenAI, and Nvidia, leaving only frontier models, computing power, and minimal infrastructure. However, Sarah Guo argues this view is only half right. Those "thin wrappers" (simple applications wrapping models) will indeed be absorbed, and any task that can be measured by benchmarks, trained on public data, or validated at low cost will gradually become commoditized.
The real question is: after AI devours everything trainable, what remains untrainable?
The answer, according to this article, lies in value embedded within real organizations that cannot be easily replicated externally: enterprise proprietary data, complex workflows, user trust, system permissions, industry judgment, compliance responsibilities, and experience accumulated over long-term operations. Models can become smarter, but they cannot automatically access a bank's production system; they can generate medical answers, but cannot directly earn a doctor's trust or integrate into a hospital's decision-making process; they can write legal texts but cannot replace senior lawyers in taking responsibility or arbitrarily define what constitutes qualified legal work.
Therefore, the truly defensible AI companies of the future are not simply smarter than general-purpose models. Instead, they delve deep into specific industries to perform the difficult but critical "translation" work: organizing a client's proprietary reality, tools, processes, and judgment criteria into a system models can act upon, and over long-term service, gradually defining "what constitutes a good result." The stronger AI becomes, the more it devalues measurable, replicable tasks, and the more it highlights those "untrainable assets" steeped in history, relationships, permissions, and professional judgment. These are the real values that may persist after the model's devouring.
Below is the original text:
By mid-2026, the investor version of "AI psychosis" is a desperate feeling that there's nothing left worth investing in: we should just put all our money into Anthropic and Nvidia and go home to sleep. But I've never felt this way. For several small versions now, I've been convinced that models are already smarter than me; I would be happy to buy Anthropic and Nvidia at market prices; and many of my smartest friends are reasonably certain that self-improvement in models will soon truly take off—yet I still lack this sense of despair.
This despair isn't stupid. Its logic goes like this: if models keep getting better at everything, then all companies built on top of them are just thin layers waiting to be absorbed. Ultimately, the only value left will be compute power and frontier model weights.
Take software, the case this despair relies on most. When Devin launched in 2024, it could only solve 13% of tasks in standard software benchmarks, and was largely dismissed by the market. A year and a half later, the best agents can achieve scores over 80% and are handling real work inside Goldman Sachs and the U.S. Army. Almost everyone reached the same wrong conclusion: the model swallowed software engineering.
But as models swallowed the most measurable part of software engineering, we are also rediscovering something many teams knew all along: engineering has always resisted measurement, and the most measurable parts are not necessarily the only important ones.
MIT's Mert Demirer and his collaborators finally quantified this: among over 100,000 developers, the latest generation of coding agents increased code writing output by about 180%, but the amount of code actually deployed to production only increased by about 30%. Writing code became cheaper, but the remaining steps still require human involvement, and these steps matter. Of course, the overall net impact remains staggering.
A benchmark is something you can measure; and anything measurable can be trained for. That's why coding agents matured first: compilers are free validators, test suites are free validators. When answers can be checked at nearly zero cost, you can iterate around that signal until you break through.
But passing a test never means a change is correct for a codebase that has been running for a decade. There might be three undocumented reasons why that module exists; the deployment pipeline might be held together by a cron job no one wants to admit they wrote.
This kind of correctness cannot be read from a leaderboard, nor truly from anything else. You can only know if such a complex system truly works by letting it run in the real world long enough. And smarter models don't make the real world run faster. No one would run unit tests on a system as large as Google, see a green checkmark, and be completely satisfied. You trust it because it has withstood years of real-world load.
This correctness is not only proprietary but also a slowly forming moat—a moat that capital cannot directly compress in time. Even optimists acknowledge this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote that the only reliable way to evaluate an agent's performance over a one-year cycle might be to let it actually run for a year.
As Gabe Pereyra said, true automation is not just models becoming stronger. It's products, models, workflows, and company organizations changing together, and three out of these four move at the pace of the organization.
Getting people moving is the part no benchmark can touch: convincing a skeptical partner to change how she handles matters, keeping a team cohesive during a rebuild. This is why, when hiring a CEO, we value people skills at least as much as analytical skills. Smarter models won't change this weighting.
The feedback here is vague, the time horizon is years, and trust belongs to specific individuals. Every company I know has put frontier coding models in the hands of every engineer, yet none of their engineering organizations have changed at a pace close to model progress. Adoption of the tool took a quarter—what a magical quarter of token growth! But true rebuilding takes years.
Readable work is leaving. Truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained for; therefore, anything measurable is already on the path to commoditization. This process takes time and is never fully complete, but the direction never reverses.
To put it in terms of money, as my friend Matt MacInnis of Rippling says: a token used to answer a general question is almost worthless because any model can answer it. But a token reasoning over your company's data is far more valuable because it does something you really want, not just generates a plausible-sounding answer.
Readable work gets consumed from two directions.
From below, tasks saturate: once a task can be checked at low cost, buyers stop caring which model completed it and start asking how much it costs. The work then falls to the cheapest open-source or distilled model of the week. As long as profit margins can work, they eventually will.
From above, labs are trying to make models swallow their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use, even reasoning strategies—all the equipment once wrapped around the model is being pulled into the model weights until the "wrapper" itself becomes the model. This is the absorption boundary.
Profit pressure also works the other way: a general-purpose agent must be ready to handle anything, hence costly; a focused application can optimize a workflow to the extreme, consuming only a fraction of the tokens. And unlike the labs selling those tokens, application companies can pocket the difference.
Therefore, we can ask any task two questions: Is its correctness proprietary, costly, a truth that exists only inside a certain company's data? Is it isolated within a system no outsider can access? Combining these questions with the task's level of saturation yields a 2x2 matrix.
Tasks that are already saturated with public answers belong to the commodity token zone—open-source models will dominate them. Frontier work with public answers, like coding benchmarks, is where labs will win because when evaluation is free, owning it holds little value.
The real prize is the last corner, the "untrainable" corner: frontier tasks whose correctness exists only in proprietary environments. You can see this on inference clouds serving AI-native pioneers: the vast majority of tokens are generated by custom models, not general-purpose open-source models.
The walls to this last corner vary in height. A developer's toy codebase is portable and standardized, so crawling in isn't hard. But a bank's production system is neither portable nor standardized. You don't gain root access to it just by being 2% smarter on SWE-Bench Verified.
Capability will consume many things, but a better model won't turn proprietary ground truth into public ground truth. It doesn't hold a license, doesn't sign for liability, doesn't own the company's files; when an answer is wrong, it can't be the defendant in a lawsuit. The bottleneck here isn't intelligence—it's permissions, and it's responsibility. You can imagine a model far smarter than any human, yet it still must be let in the door, and someone must still sign their name for what it does.
That door has a lock and a bolt.
The lock is the environment: only after gaining trust within a system, passing security reviews, completing integration, and signing contracts with accountability for results, can you verify if AI is truly doing something useful.
The bolt is the user. Today, most American doctors open OpenEvidence daily. This cannot be bought with any amount of compute. A lab could train a perfect medical model tomorrow, but it still has no way into doctors' usage habits or UCSF's decision-making process. Because trust is built slowly, through relationships, through user acquiescence—not something gradient descent can erase.
This is precisely the work of application companies. An application occupies its spot in the "untrainable" corner through unglamorous work: organizing a company's proprietary reality so models can act upon it; giving models tools to act; working with clients to change how their workforce actually operates.
Such a company that performs this "translation" is hard to replicate, and this translation never ends. Integration and maintenance continue alongside the client relationship. Those who win this are teams that bring domain-specialist engineers and tools alongside the client.
For example, in a top-tier traditional law firm, the M&A practice alone handles nearly a thousand transactions annually. You can't have hundreds of paralegals download client files to their desktops and then feed them to a general-purpose agent to read. Confidentiality reasons alone forbid it, let alone a dozen other issues. Even if you could, what you learn is fragmented: one paralegal makes a minor correction here, another there; no one sees how an entire transaction flows.
The truly important signals exist at the transaction level. A transaction has its own shape: for M&A, it's NDAs, term sheets, due diligence, purchase agreements, ancillary documents, closing checklists; for IP litigation, it's motions, discovery, prior art, more motions. Every practice area has its own structure, and lawyers and tools cannot be swapped interchangeably.
And the real problem for the firm is higher still: how to run every practice area simultaneously, just as a top partner manages hundreds of matters in parallel while bringing in new clients and training associates. Rebuilding such a firm is not a single problem you can write an evaluation task for. It requires an operator to handle it like "data baseball": intermediate goals are extremely vague, feedback is incomplete, cycles are long, and the environment itself doesn't stand still.
Unfortunately, unreadable value is also hard to sell, for the same reason it's hard to commoditize: a company cannot externally judge whether AI can truly transform its operations as benchmarks suggest. Therefore, the strongest companies stop trying to prove themselves externally; they first get inside clients, then price based on outcomes.
Sierra charges only when its agents solve a client's problem; if the issue is escalated to a human, it doesn't charge. Thus, the price itself becomes an evaluation mechanism. And this works because Sierra owns the definition of "resolved." Cognition's Devin does the same thing in software with its "performance guarantee." Only when you are trusted inside a system can you offer such guarantees for outcomes.
Even at the layer of providing token services—what everyone loves to call a pure commodity—it doesn't behave like one. The best AI-native companies concentrate their services on one or two providers like Baseten or Fireworks. Because the cost per token commoditizes over time, but reliability under real traffic and stable access to scarce compute do not. Where to serve inference is a different choice from which models to use. The only truly commodity-like part of inference is the price.
A common counterargument: the lab is your supplier. Why won't it undercut you by selling its own first-party product below cost, or revoke your API access and take the market itself? This is the real version of that despair. But it only holds true if the model layer is a single-player game.
Clearly, it's not. The model layer looks more like a three-and-a-half-player deathmatch, with a group of international players about six months behind in training progress, and a development league five times larger than last year. Clients want competition among their suppliers, and labs want market share more than they want to kill any specific application.
You can see this in markets where labs compete head-on. In consumer chat, the best model has never simply won everything. ChatGPT has maintained its lead through years of real competition; the share it's losing now goes to Gemini, driven by Android and search distribution, not a better model. Anthropic is currently believed on prediction markets and internet vibes to have the best model, but it's barely a major player in consumer chat, having built its business on enterprise and coding scenarios.
If a better model can't even take users from a competitor in the most core application, it won't easily integrate into a hospital's medical records system or a bank's liability framework either. Today, public choices are based on more than just coding ability. If the frontier model layer remains crowded, then the application layer above it holds value.
If a task cannot be scored externally, then someone internally must decide what counts as a good answer. And that decision is the entire game. Enough such decisions written down become benchmarks. Harvey published legal benchmarks; Sierra published benchmarks for voice agents. You earn the right to define "good" in a domain because that domain is already using you. And these companies earn that right through the hard struggle of real adoption.
The evaluations that truly determine where money flows are private, formed company by company: what will this firm, for this type of matter, accept as good work. And this is far from finished, because the depth of law extends far beyond any public test. OpenEvidence is sedimenting what constitutes safe clinical answers.
None of this is truly "measurement" in the narrow sense. It is judgment about what is true and what is good. These judgments are written down until they become the standard against which everyone else is measured. No matter how smart the foundational model lab becomes, it cannot write these standards out of thin air, because such authority exists only within the domain.
This authority tends to reside where it already resides. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "resolved" means is determined by the company that already has the client relationship.
The absorption boundary will continue to rise because we will keep learning to measure more work, and the measurable will be consumed. The untrainable ground keeps shrinking beneath the feet of those standing on it, so you cannot find a defensible position and stop. You must constantly move towards what cannot yet be scored, and continuously re-underwrite, re-judge risk.
On a narrow task, with your proprietary data and your own evaluation system, you can train to frontier levels and beat general-purpose models in key scenarios; this specialized model becomes part of the moat. On the other hand, if you compete on general-purpose model capabilities, it's a war of capital, and you will lose to those with the most compute. This is the trap most easily fallen into by companies with only shallow access performing highly readable tasks.
When a company decides, for survival, to train its capabilities beyond frontier models on a broad set of general tasks, the outcome usually seems determined by data center scale. The end result is often not an independent champion, but a sale to a player with ample compute.
This is all defense. Harder is offense: first deciding what to build. This is what I've been looking for all year, and I've found it maybe three times. Models can't help with this. They do whatever you point them at, but they can't tell you what is worth pointing at. You can't build a benchmark for it, so you can't train for it.
This is also why incumbents won't take everything: they'll hold onto what they already have, and the next thing comes from someone who discovers a use case before others. Perhaps intent is a scarcer input than compute.
This despair is half right. Thin wrappers are indeed being absorbed, and many things that look like companies today are indeed just thin wrappers. But it's wrong about what remains after absorption. The mechanism is clear, but the endpoint is not.
Where I'm willing to bet is in this direction: intelligence will continue to get cheaper, and value will continue to slide towards the few places models cannot reach. The untrainable is value with history.
So, enter one of these domains, do the unglamorous translation work, and start writing down the definition of "good" there. Because someone will. The most cited benchmark score this year is actually a map of value soon to become worthless, and also a notice: a notice to some that they are about to lose the right to define what "good" means.


