Token預算戰爭:企業AI進入「算帳時代」
- 核心觀點:企業AI正從「是否採用」轉向「如何算帳」,核心矛盾在於Token成本與業務價值難以直接掛鉤。下一階段的關鍵不是模型能力,而是能否將Token消耗精準歸因於具體業務結果,從而決定AI資源的分配。
- 關鍵要素:
- AI推理成本已從實驗預算轉變為持續性營運支出,CEO和CFO要求量化每一美元Token帶來的實際價值。
- Token消耗不等於價值:同一工作流程因提示詞、上下文長度、模型選擇和重試次數等因素,成本可能相差5-10倍。
- 邊際Token效用是核心指標,指每多花一美元推理成本所創造的商業價值,但大多數公司目前無法追蹤。
- AI預算申請本質上與人工成本競爭,替代外包(BPO)比替代內部員工更容易建立量化基準。
- 重試長尾、上下文膨脹和路由不當是導致Token成本失控的三大原因,會顯著改變經濟帳。
- 從Token到結果的歸因缺失,企業需要捕捉Agent的決策軌跡,以解釋「為什麼」某個工作流程成功或失敗。
- 掌握歸因能力的公司能做出分配決策(如工作流程優化、模型切換),並最終控制企業內部AI資源流向。
Original Title: Token Budget Wars
Original Author: Jaya Gupta
Original Translation Compiled by: Peggy
Editor's Note: Enterprise AI is moving from "whether to adopt" to "how to calculate the costs."
Over the past two years, many companies pushed employees to use AI, often to keep up with technological trends and competitive pressure. But when AI inference costs shift from experimental budgets to ongoing operational expenses, CEOs and CFOs start asking a more pragmatic question: How much value does AI actually create? For every dollar spent on token costs, what real results are we getting in return?
This is the core of the "Token Budget Wars." The so-called token budget war isn't just about companies wanting to lower their AI bills; it's about reassessing which business operations deserve more computational power, which tasks should be switched to cheaper models, which processes can replace outsourcing or human labor, and which are merely wasteful consumption.
The most noteworthy point of the article is that AI usage does not equal value. In the SaaS era, usage usually signified software adoption; but in the AI era, token consumption only indicates that "the meter is running." The same workflow could have vastly different costs depending on prompts, context, model selection, and the number of retries. A higher bill could mean AI is genuinely working, or it could mean the system is engaging in futile cycles.
Therefore, the next phase of enterprise AI is not just about model capability, but about whether we can map token costs to business outcomes. The first phase proved that AI can complete tasks; the second phase will answer: Are these tasks worth paying for in the first place?
Below is the original text:
Enterprise AI has moved from "whether to adopt" to "how to allocate."
At the top levels of companies, the new "currency" is your ability to quantify the ROI of AI investments. Every functional department is asked the same question: What did you produce, and what was the cost? Over the past two years, CEOs, waking up to see Jim Cramer on CNBC (#bearish) while watching competitors announce productivity gains, have demanded that everyone in the company use AI. Now, the real pressure comes from the follow-up question: Show me the value.
Claude was released in November 2025, by which time most companies' annual budgets for 2026 were already locked. By the first quarter, actual usage had already far exceeded the original plan. Inference costs were no longer just a budget item for experimentation; they became a recurring operational expense. This brought about a new question: Where exactly is AI creating value?
This question is difficult to answer because the utility of tokens hasn't been quantified. The bill won't tell you whether the expenditure replaced human labor, generated revenue, reduced risk, accelerated a process, or was just a group of engineers burning tokens for a leaderboard (#metamates). When spending is a few hundred thousand dollars, it still looks like an experiment. But beyond a certain threshold, say reaching seven figures, it becomes infrastructure. Technical differences begin to have a material impact on the profit and loss statement: the same workflow, with the same inputs, can have token costs differing by 5 to 10 times between two runs, with no apparent errors. At an experimental scale, this variance is already expensive; but once it's at an infrastructure scale, it becomes a number the CFO has to explain to the CEO.
You could call it the "marginal token utility": the business value created for every additional dollar spent on inference cost. This is the number that truly matters in the scaling phase, and it's the number most companies can't currently see.
The question in the boardroom is shifting from "Is AI useful?" to "Where exactly is AI creating leverage?" This is why the so-called token budget battle is essentially a fight for the right to allocate tokens.
And the battle over token ownership is heating up because it collides with a three-decade-old executive instinct: large teams mean big titles, wide scope of responsibility, and more power. In the past, a visible sign of a senior manager's success was the size of the team they managed—direct reports, indirect reports, and the number of people in the organizational chart.
But when intelligence becomes a scarce resource, the new status symbol becomes: How much intelligence can you command?
AI spending is fundamentally competing with labor costs.
Most AI budget requests essentially fall into one of three categories: replacing outsourced labor, replacing internal labor, or creating new revenue.
An employee has a salary. An BPO outsourcing contract has a price per ticket, claim, invoice, or audit review. Humans understand these units of measurement. But inference costs are more complex, because the final cost of completing a task depends on how the system executes the process. A claim processing task that requires three retries, manual correction, and calls a frontier model could end up being more expensive than the outsourced labor it was intended to replace. This is why the discussion is shifting to: What is the cost to achieve an outcome? For example, the cost per resolved ticket, per processed claim, per reviewed contract, per completed invoice, per avoided new hire, per retained customer, or per dollar of revenue generated.
Executives have realized that BPO is the easiest place to establish a baseline, because this work is already priced per "unit completed." In contrast, comparing AI with internal employees is much more difficult, as employees do many things throughout the day, including browsing TikTok during lunch; productivity gains often manifest as avoided hiring or dispersed capacity release; and managers resist team cuts based solely on partial automation. BPO provides a quantifiable baseline for business teams.
This differs from the logic of SaaS. SaaS once trained companies to treat usage as a proxy for value.
But AI breaks this. The amount of inference resources consumed by the same workflow can vary drastically due to prompts, retrieved context, chosen model, called tools, number of retries, and whether the agent gets stuck. The unit on the bill—the token—is stable, but the amount of work it represents is not.
More precisely: signal and noise use the same unit of measurement. A rising token bill could mean real work is being done; but it could also mean computational power is being wasted on poor prompts, irrelevant context, unnecessary tool calls, redundant reasoning, and overpowered models. Two companies could have identical token bills, yet run fundamentally different operations beneath the surface: one is translating inference into results, the other is paying for futile cycles, and these two scenarios look identical on the invoice.
SaaS usage tells you: the software has been adopted. AI usage only tells you: the meter is running. It doesn't tell you whether the company is actually running.
Why is marginal token utility hard to see?
There are three main reasons.
First is the long tail of retries. If the probability of an agent completing a workflow correctly on the first try is p, then the expected token consumption per resolved workflow roughly scales as T/p, where T is the base cost. If the completion rate drops from 90% to 70%, the effective cost per resolved problem increases by about 28%, not 20%, because failures have a compounding effect. In enterprise workflows, inputs are often messy, and edge cases matter. Failure doesn't just lower accuracy; it changes the economics.
Second is context bloat. For operations heavily reliant on attention mechanisms, inference costs roughly grow as O(n²) with context length. Therefore, doubling the context length roughly quadruples the inference cost. Everyone wants the model to have enough information, so systems tend to over-supply: when five documents would suffice, the retrieval pulls in fifty; the connector dumps an entire email thread; the agent carries on with long-obsolete conversation history.
Third is routing. When teams don't know which model is "good enough," they default to using the strongest one. A basic classification task might run on the same model intended for complex reasoning. When call volumes reach millions, the difference between routing simple tasks to a small model versus sending all tasks to a frontier model often separates a manageable bill from a board-level problem.
Non-software industries will feel this pain in the form of a "transformation." Software companies see this problem first, because the work being optimized is already heavily instrumented. Engineering teams have metrics like PRs, commits, deployments, incidents, cycle time, MTTR, and these are linked to products. While not perfect, this type of work is easier to measure.
Non-software companies feel this problem more acutely because their work is operational. Examples include claims, underwriting, customer service tickets, compliance reviews, supply chain anomalies, and payment disputes. Or, companies with real-world assets face the same issue. These workflows were traditionally measured by headcount, cycle time, SLA achievement rates, and error rates, and often have higher requirements—they need to be defensible in audits, not just correct on average. The unit of work and the unit of cost don't speak the same language and don't reside in the same organization. The tech team sees token consumption, the business unit sees workflow changes, but connecting the two requires multiple teams to first agree on "what exactly are we measuring."
In my view, software companies will experience the token budget war as a productivity measurement problem, corresponding to the many "AI layoffs" we've seen; non-software companies will experience it as a transformation problem.
The missing layer is the attribution from tokens to outcomes. Enterprises need a conversion layer that links inference spending to the work completed and the business results generated. This layer must answer three questions: What is the true cost of this workflow, including retries and corrections? Which parts of the agent's execution trajectory are truly important, and which are just futile cycles? Has this work changed the operating model—e.g., fewer tickets per support agent, shorter claims cycles, smaller BPO budgets, delayed hiring? The next layer is to perform outcome attribution in business language. Not just "this workflow cost $2.13," but to say: this type of claim is cheaper for the agent to process than the BPO, but if the policy requires additional exception documents, the long tail of retries destroys the economics.
Measurement becomes memory. To connect a token to an outcome, a company must capture everything that happens in between: what the agent saw, what it retrieved, what tools it called, what it ignored, where it retried, where it was overridden by a human, which exception rule was applied, which precedent was used, and why one path succeeded while another failed. The measurement layer must record the decision trace, which is something companies have almost never truly owned. Logging systems capture what happened, but rarely why. For example, a CRM can tell you a deal was delayed, but not the undocumented judgment behind the sales forecast.
Decision rationale is one of the most easily corrupted and perishable assets in a company, as it resides in Slack threads, email chains, escalation meetings, and people's heads. But the problem is, people leave, and processes change.
AI changes this because agents generate traces. Every retrieval, tool call, retry, escalation, manual correction, and final decision becomes part of the path from context to action to outcome. Initially, companies will capture these traces to justify spending. But once captured, these traces become more valuable than the cost report itself, as they become a permanent record of how the organization actually makes decisions. (Ahem, context graph, although I am really tired of hearing that term lately.)
The distribution layer is the real prize. If inference becomes a metered resource in the customer operations model, then every dollar must prove its worth. Which vendors can tell when a token has been converted into a result, when it hasn't, and why?
Enterprises won't figure this out entirely on their own. They will buy it as a transformation. Fortune 500 companies have played this script before: buckle up, hire McKinsey, bring in every former Palantir employee on the market, and drive change top-down from the CEO. Token-to-outcome attribution will arrive similarly to ERP, BI, and digital transformation: as a CEO-sponsored "project," backed by a set of underlying infrastructure, and eventually becoming the new source of truth. Founders who can make this happen will assemble different types of founding teams, and they themselves will be different from the traditional archetype of an entrepreneur.
Whoever masters token-to-outcome attribution can make allocation decisions: which workflows deserve more compute power, which should be throttled, which should be switched to cheaper models, which should remain with humans, and which can replace BPO. And once you can make these decisions, you control the flow of AI spending within the enterprise and earn the trust needed to allocate this resource.
The first phase of enterprise AI proved that models can do the work. The next phase will determine how much of that work is truly worth paying for. As Charlie Munger said: Show me the incentives, and I will show you the outcome.


