AI大模型中转站不仅需要跑分，也需要审计：GatewayBench评测框架上线，官网Check4U.ai同步开放

Odaily资深作者

2026-05-29 10:15

บทความนี้มีประมาณ 5787 คำ การอ่านทั้งหมดใช้เวลาประมาณ 9 นาที

AI large model relay services need not only performance benchmarks but also audits: The GatewayBench evaluation framework is now online, and the official website Check4U.ai is simultaneously open for access

สรุปโดย AI

ขยาย

AI relay services solve the problem of model access but also create new trust issues. The AI relay service evaluation framework GatewayBench has now opened its evaluation entry point through the official website Check4U.ai, aiming to transform model authenticity, billing transparency, cache isolation, and actual costs into verifiable indicators. GatewayBench simultaneously opens public evaluation leaderboards for API relay services, aggregation gateways, and model service providers, hoping to use auditable data that can be reviewed to help honest delivery gain more market trust.

The GPT-5 or Claude model that enterprises spend a fortune on every month – is it truly the official original version?

To navigate the complex geographical restrictions, corporate compliance processes, and payment barriers of large AI models, a growing number of R&D teams are opting for a shortcut: the third-party AI relay market (Shadow API). This alternative appears highly attractive, allowing developers to access mainstream models like GPT and Claude with lower barriers, offering an experience close to the official API and filling the gaps left by official channels.

However, this seemingly convenient 'door' conceals an unfathomable black box.

In March 2026, an audit published by the CISPA Helmholtz Center for Information Security in Germany revealed a chilling truth: at least 187 top-tier academic papers globally had used AI large model relay APIs (Shadow API) for research (62% of which were accepted by top conferences like CVPR and ICLR). However, because these relay gateways secretly replaced, quantized, or downgraded the underlying models, a large number of research results now face the risk of being irreproducible.

This 'disaster' in academia is a 'time bomb' ready to explode at any moment in a corporate production environment.

Today, large models are no longer just test products in labs. These core infrastructures are deeply integrated into customer service centers, code generation pipelines, Agent workflows, and risk control business chains. For systems supporting critical operations, relay service providers often entice enterprises with selling points like 'original models,' 'low prices,' 'high speed,' and 'support for caching.'

But the problem is that traditional evaluation tools simply can't detect the tricks inside the black box. Existing benchmarking software mainly focuses on the speed and price outside the interface. These tools operate on the default assumption that all data returned by the gateway is real and trustworthy. Conventional evaluation cannot answer the following core business security questions:

● During traffic spikes, is the expensively purchased model secretly being 'switched' or 'downgraded'?

● Behind so-called high speed and low prices, are inflated tokens secretly added to the bill? Are failed requests forcibly charged?

● Are the caching discounts claimed by the gateway actually returned to the enterprise? Is cross-account private data strictly isolated?

AI Relay API indeed solves the access barrier problem for large models, but it also creates a new crisis of trust. Before understanding these invisible backend paths, procurement decisions based solely on surface-level unit prices and speed are akin to running blindly in the dark. Breaking the black-box era and making honest delivery a market advantage again is the top priority for the current AI supply chain.

Unveiling the Black Box: The 'Three Unspoken Rules' of the AI Large Model Relay Market

Why can't existing conventional benchmarking software detect the problems? Because traditional evaluation tools only stay outside the interface, focusing purely on comparing response speed and listed prices. But behind the highly opaque gateway backend, vendors are exploiting this information asymmetry, transforming complex technical methods into hidden arbitrage tools.

Delving into the current AI large model relay market, there are three highly concealable 'unspoken rules' that are eroding enterprise business quality and budgets:

Unspoken Rule One: Dynamic Model Watering Down

In the gray relay market, the hardest arbitrage method to detect is dynamic model substitution.

Many service providers, when facing conventional evaluation tools or low system traffic, will dutifully call the official original models. However, once high-concurrency business peaks occur or in blind spots difficult for monitoring to cover, these gateways will secretly switch the backend to low-performance quantized models, weaker distilled versions, or even low-cost open-source models with the same name.

Although on the surface, the interface still outputs text on time, because the underlying probability distribution has been tampered with, the actual output quality received by the enterprise is completely different from the metrics initially promised by the vendor. This kind of model swapping exposes the enterprise's customer service response accuracy and code generation quality to an uncontrollable risk of degradation at any time.

Unspoken Rule Two: Invisible Bills (Hidden Billing & Privacy Exposure)

Since the popularization of reasoning models, a new, harder-to-verify cost item has appeared in large model bills: thinking tokens. Because this part of the reasoning process is invisible by default, it's difficult for buyers to verify the actual reasoning consumption claimed by the platform, leaving room for malicious gateways to inflate costs.

What's more frightening than inflated bills is uncontrolled cache fraud. Some service providers display a cache hit indicator on the bill but fail to pass the actual discount back to the enterprise, turning cache optimization into a purely numerical game. Worse still, to forcibly increase cache hit rates and reduce their own costs, certain gateways stuff prompts from different enterprises into the same shared cache pool. This directly impacts the data isolation boundary of multi-tenant systems, exposing the enterprise's core business data and commercial privacy to the risk of cross-account mixing.

Unspoken Rule Three: False 'Lowest Price Online' (A Carefully Designed Financial Trap)

When procuring large model APIs, the 'lowest price online' often catches the most attention. But in actual business operations, the nominal unit price doesn't equal the actual cost. Especially in the relay gateway market, input, output, cache hit, and cache write costs are often bundled into a single composite price, which seems easy to compare. However, once subjected to real business loads, this price easily becomes distorted.

This distortion stems from the fact that there is no fixed input-output ratio when enterprises call large models. RAG, long document analysis, and complex Agent workflows are typically high-input, low-output; code generation and content creation might be low-input, high-output. If the platform only shows a single blended price, it's difficult for buyers to determine where the true cost lies. A seemingly low-priced platform might optimize its external offer by lowering the output unit price, while simultaneously raising the prices for input, cache writing, or other less conspicuous items. Ultimately, for certain high-input scenarios, the actual bill could be far higher than expected.

Furthermore, hidden costs lurk in various inconspicuous clauses. For instance, when the system experiences disconnections, timeouts, or 5xx errors, failed requests may still be forcibly charged; low unit prices often come bundled with high minimum top-up amounts, non-refundable balance policies, and opaque forex and payment channel fees. When all these financial frictions are combined, the actual cost debited from the enterprise's ledger for every million Tokens can be several times higher than the nominal price advertised on the webpage.

GatewayBench: A Specialized Audit Framework for Large Model Gateways

Faced with highly opaque gateway backends, traditional speed testing and benchmarking tools have clear limitations. They can compare response speed, model coverage, and nominal prices, but struggle to answer more critical questions: Does the model actually invoked by the gateway match the promise? Is the billing transparent? Is caching and data isolation trustworthy?

Against this backdrop, the GatewayBench audit evaluation framework has officially launched, opening its assessment portal via the website Check4U.ai. As an open-source audit framework for large model gateways, GatewayBench doesn't just look at speed and surface-level prices. Instead, it deconstructs gateway evaluation into three dimensions: Trustworthiness, Affordability, and Performance, using a weight system of 40% Trustworthiness + 40% Affordability + 20% Performance.

This weighting reflects GatewayBench's fundamental judgment: in the context of AI large model relay APIs, trustworthiness and actual cost take precedence over speed. A gateway must first prove its model is authentic, its billing is transparent, and its costs are explainable before it qualifies for a performance comparison.

To achieve this goal, GatewayBench offers three core audit capabilities:

L1 Trustworthiness Audit: From Platform Self-Report to Verifiable Trust

In GatewayBench's scoring system, L1 Trustworthiness accounts for 40% of the weight. The logic behind this design is: in the AI large model relay API scenario, while speed and price are important, if the model is not authentic and the billing is not transparent, there is no basis for discussing other metrics.

The core risk of third-party large model relay gateways stems from the invisible processes behind successful calls. A normal response at the interface level only proves the request was processed, but cannot verify that the model origin, billing process, and cache handling all conform to the platform's promises. In the past, these aspects lacked externally verifiable evidence, making systematic auditing difficult.

The L1 dimension of GatewayBench is designed to transform these vague suspicions into reviewable engineering signals. It does this by breaking down trustworthiness into three questions: Is the model authentic? Is billing transparent? Is the cache trustworthy? It employs statistical tests, cryptographic structures, and latency fingerprinting to observe from the outside what is actually happening inside the gateway's black box.

For model authenticity, GatewayBench introduces the RUT (Rank-based Uniformity Test) algorithm, which checks the position of output tokens in the probability ranking of the reference model. Different models might generate similar text, but the token probability distribution is much harder to fake. If the backend undergoes quantization, downgrading, or substitution, the distribution shift leaves traces. Concurrently, GatewayBench can also use Logprob Tracking. By requesting only a single output token under a fixed prompt, it tracks whether the log probability shows a stable drift over different time periods. This provides a lower-cost signal for continuously monitoring model updates, fine-tuning, quantization adjustments, or routing changes.

For billing transparency, GatewayBench uses PALACE to estimate a reasonable range of thinking tokens, helping identify abnormally high reporting in reasoning models. Simultaneously, it leverages verifiable structures like CoIn to give billing records stronger traceability and tamper resistance.

For cache trustworthiness, GatewayBench uses latency fingerprinting to determine if a cache hit is genuine, and cross-account isolation tests to identify potential tenant boundary issues. If the bill shows a cache hit but the TTFT hasn't correspondingly decreased, the discount might exist only on paper. If abnormal cache sharing occurs between two independent accounts, it may signal a risk in cache isolation.

Through these methods, GatewayBench transforms the black box, previously only assessable through intuition and suspicion, into sets of measurable, reviewable, and comparable signals, achieving truly 'verifiable trust'.

L2 Performance: Extreme Load Testing, Probing Stable Delivery Under Extreme Stress

Only gateways that pass the L1 Trustworthiness audit qualify for a performance and cost-effectiveness comparison.

In the large model infrastructure ecosystem, performance has always been the metric most emphasized by relay and aggregation API providers. Claims like 'Fastest Online' or 'Single Concurrency 150 tokens/s' are not uncommon. However, when designing its metric system, GatewayBench maintained a relatively restrained weight allocation for performance: L2 Performance accounts for only 20% of the total score.

The reason is simple. Speed is important, of course. It determines the basic usability of a system and can filter out services with frequent stalling or long-tail latency issues. But speed should not outweigh trustworthiness and affordability. A gateway, even if very fast, cannot become a trustworthy enterprise-grade infrastructure if it involves model substitution, inflated billing, or opaque caching.

Therefore, in its L2 Performance dimension, GatewayBench doesn't chase single-point peak speeds. Instead, it breaks performance down into questions closer to the production environment: Latency determines how long a user waits; Goodput measures how much effective capacity is delivered within latency thresholds; long-context tests observe how the system degrades under heavy load.

The premise behind this design is: performance is a threshold, but not the end goal. What enterprises are truly buying is stable, timely, and predictable delivery capability under trustworthy conditions.

For commercial-grade applications, the peak speed of an idle system has extremely limited reference value. GatewayBench refuses to simply compete on throughput (Tokens/s). Instead, it introduces SLOs (Service Level Objectives) as business red lines. Enterprises can stipulate, for example, that P95 TTFA must be less than 1.5 seconds, P95 E2E must be less than 8 seconds, and streaming output must not exhibit significant jitter. Goodput is built upon these red lines: only throughput delivered while meeting SLOs counts as effective capacity (Goodput).

To test the true scheduling capability of relay gateways, GatewayBench launches ultra-long-context stress tests reaching the 100k level. This type of test observes whether the system maintains stable graceful degradation under heavy load, or suffers from long-tail latency issues or even silent downgrades. What enterprises pay for is precisely this stable delivery capability that can still deliver results on time under extreme business pressure.

L3 Affordability: Penetrating the Billing Fog to Reveal 'the True Cost per Million Tokens'

Price is the most sensitive variable for enterprises procuring large model relay APIs, and it is also the easiest to repackage. The unit price on a webpage looks clear, but once in production, costs are influenced by input-output ratios, caching rules, failed requests, top-up terms, exchange rates, and payment channel fees.

Therefore, GatewayBench assigns a 40% weight to L3 Affordability, making it a core dimension alongside L1 Trustworthiness. The focus here is not on the nominal unit price on the price list, but on the True Cost per 1M Tokens – the cost an enterprise ultimately pays under real business loads.

For listed pricing, GatewayBench separates input price, output price, cache hit price, and cache write price, preventing a single composite price from masking the cost structure. The input-output ratios vary greatly across different business scenarios: RAG, long document analysis, and Agent workflows are typically high-input, low-output; code generation and content creation may lead to higher output costs. Looking only at a single comprehensive price can easily lead to misjudging real expenses.

Regarding the relative official price, GatewayBench introduces the price ratio 'Platform Price / Official Price' and considers the gateway's role to determine if the premium is reasonable. Aggregation routing, multi-channel failover, and unified billing can justify a certain service premium; a simple forwarding proxy should be closer to the official price. A low price is not necessarily an advantage, and a high price needs to correspond to real engineering value.

The GatewayBench framework will also delve into the hidden frictions behind the bills: Are failed requests forcibly charged? Are the claimed caching discounts actually passed back? Are there strict consumption limits on the financial account? By deconstructing layer by layer, GatewayBench ultimately restores the true cost for enterprises, stripped of all marketing gloss.

Join GatewayBench: Rewarding Honest Delivery in the Market

The API relay market is not going to disappear because of controversy. As long as regional, payment, risk control, and compliance differences exist for model access, third-party gateways will continue to meet real demand. Since this is an unavoidable reality, GatewayBench aims to make this relay system more transparent, trustworthy, and sustainable.

The biggest contradiction in the current API relay market is that information asymmetry is amplifying the 'bad money drives out good' effect. Some service providers can gain short-term traffic through model swapping, billing packaging, cache arbitrage, or low-price marketing. Meanwhile, vendors committed to original factory passthrough, transparent billing, and stable service find it harder to be seen in the noisy market because they cannot break through the constraints of real costs.

This is not a structure a mature infrastructure market can rely on long-term. Any mature supply chain needs a set of metric-based market trust mechanisms: good services should be seen, stable performance should be recorded, and honest delivery should earn more traffic, more trust, and higher-quality procurement budgets. Service providers who rely on information asymmetry for long-term gains should bear higher trust costs.

Based on the vision of reshaping industry trust consensus, the AI large model relay station evaluation platform Check4U.ai has officially announced its launch. It extends an

เทคโนโลยี

ยินดีต้อนรับเข้าร่วมชุมชนทางการของ Odaily