AI model relay stations need not only benchmarks but also audits: GatewayBench evaluation framework launches, with the official website Check4U.ai now open

Odaily资深作者

2026-05-29 10:15

This article is about 5787 words, reading the full article takes about 9 minutes

AI relay stations solve the problem of model access, but they also create new trust issues. The AI relay station evaluation framework, GatewayBench, has now opened its evaluation portal via the official website Check4U.ai, aiming to transform model authenticity, billing transparency, cache isolation, and real costs into verifiable metrics. GatewayBench also opens its public evaluation leaderboard to API relay stations, aggregation gateways, and model service providers, hoping to use auditable data to help honest delivery gain more market trust.

AI Summary

Expand

Key Takeaway: The AI large model relay API (Shadow API) market has hidden practices such as model substitution, hidden fees, and falsely advertised low prices, which affect the reproducibility of research results and the stability of enterprise operations. The GatewayBench audit framework evaluates through three dimensions: Trustworthiness, Economics, and Performance, aiming to reveal the black box and promote market transparency.
Key Elements:
1. An audit by the CISPA Helmholtz Center for Information Security found that at least 187 top-tier conference papers globally use relay APIs, with 62% of them facing the risk of irreproducible research results due to the underlying model being substituted, quantized, or downgraded.
2. Hidden Practice 1: Dynamic model watering down. During high concurrency, the relay gateway can quietly replace the original model with a quantized version, a distilled version, or a low-cost open-source model, leading to unstable output quality.
3. Hidden Practice 2: Hidden billing and privacy risks. This includes inflating the "thinking tokens" consumption of reasoning models, not honoring caching discounts, and sharing cache pools across accounts, which invalidates enterprise data isolation.
4. Hidden Practice 3: Falsely advertised "lowest price on the market". Blended unit prices mask the true cost. Bills for high-input scenarios (e.g., RAG) can be far higher than expected, and hidden frictions like charges for failed requests and high minimum top-up thresholds drive up actual spending.
5. GatewayBench Audit Framework: The weight system is 40% Trustworthiness + 40% Economics + 20% Performance. Core audits include the RUT algorithm for verifying model authenticity, PALACE for estimating reasonable token ranges, and SLO metrics for assessing reliable delivery capability.
6. The Economics dimension introduces the "True Cost per 1M Tokens" concept, breaking down input/output/cache prices, comparing them to official prices, and identifying hidden costs like charges for failed requests and non-refunded cache discounts.
7. The Check4U.ai platform has launched an open evaluation leaderboard, aiming to use a unified, open-source framework to allow trustworthy, honest service providers to gain market recognition and drive the industry towards a verifiable and accountable transformation.

Is the GPT-5 or Claude that companies spend millions on each month actually the official original model?

To navigate the complex geopolitical restrictions, enterprise compliance processes, and payment barriers surrounding major AI models, a growing number of R&D teams are opting for a shortcut: the third-party AI relay market (Shadow API). This alternative appears highly attractive, offering developers lower-barrier access to mainstream models like GPT and Claude, with an experience close to the official API, filling gaps that official channels cannot cover.

However, this seemingly convenient "gateway" hides a deep, opaque black box.

In March 2026, an audit released by the CISPA Helmholtz Center for Information Security in Germany revealed a chilling truth: at least 187 top-tier academic papers globally have used AI model relay APIs (Shadow APIs) for research (62% of which have been accepted by top academic conferences like CVPR and ICLR). However, because these relay gateways secretly swapped, quantized, or downgraded the underlying models, a large number of research findings now face the risk of being irreproducible.

This "disaster" in academia translates into a potential "ticking time bomb" ready to explode in enterprise production environments.

Today's large language models are no longer just lab test subjects; these core infrastructures are deeply integrated into customer service centers, code generation pipelines, agent workflows, and risk control business chains. For systems supporting critical business operations, relay service providers often pitch selling points like "original model," "low price," "high speed," and "cache support" to enterprises.

The problem lies in the fact that traditional benchmarking tools simply cannot detect the tricks within the black box. Existing scoring software primarily focuses on the speed and price visible outside the API endpoint. These tools operate on the default assumption that all data returned by the gateway is authentic and reliable. Conventional evaluations simply cannot answer the following core business security questions:

● When business traffic peaks, has the expensive model you purchased been secretly "swapped" or "downgraded"?

● Behind the touted speed and low price, are inflated token counts sneaked into your bill, or are you charged for failed requests?

● Are the cache discounts claimed by the gateway genuinely passed back to the enterprise, and is cross-account private data strictly isolated?

AI Relay APIs indeed solve the access problem for large models, but they also create a new crisis of trust. Before understanding these invisible backend paths, making procurement decisions based solely on surface-level unit prices and speed is equivalent to running blindfolded. Breaking the black box era and making honest delivery a market advantage again is the top priority that the current AI supply chain urgently needs to address.

Unveiling the Black Box: Three "Unspoken Rules" of the AI Model Relay Market

Why can't existing conventional scoring software detect these problems? Because traditional evaluation tools only operate outside the API endpoint, focusing solely on response speed and advertised prices. Behind the highly opaque gateways, vendors are exploiting this information asymmetry, converting complex technical maneuvers into hidden arbitrage tools.

A deep dive into the current AI model relay market reveals three highly insidious "unspoken rules" that are eroding enterprise business quality and budgets:

Unspoken Rule 1: Dynamic Model Watering Down

In the gray relay market, the hardest arbitrage tactic to detect is dynamic model substitution.

Many service providers behave honestly when facing conventional benchmark tools or low system traffic, faithfully calling the official original models. However, when hit with high-concurrency business peaks or operating in areas difficult to monitor, these gateways quietly switch the backend to cheaper quantized models, weaker distilled versions, or even identically named but low-cost open-source models.

While the API interface might still be producing text on time on the surface, the underlying probability distribution has been tampered with. The actual output quality received by the enterprise no longer matches the metrics originally promised by the provider. This model swapping behavior exposes enterprise customer service response accuracy and code generation quality to uncontrolled risks of degradation at any time.

Unspoken Rule 2: Invisible Bills (Hidden Billing and Privacy Exposure)

With the prevalence of reasoning models, a new cost item has appeared on LLM bills that is harder to verify: thinking tokens. Since this part of the reasoning process is invisible by default, buyers struggle to determine if the reasoning consumption claimed by the platform is real, leaving room for unscrupulous gateways to inflate costs.

Even more frightening than inflated bills is runaway cache fraud. Some service providers display cache hit markers on the bill but fail to pass on the actual discount to the enterprise, reducing cache optimization to a mere bookkeeping game. Worse still, to force higher cache hit rates and compress their own costs, certain gateways may forcibly stuff prompts from different enterprises into the same shared cache pool. This directly impacts the data isolation boundaries of multi-tenant systems, exposing core business data and commercial privacy to the risk of cross-account contamination.

Unspoken Rule 3: The Illusion of "Lowest Price Across the Web" (A Carefully Designed Financial Trap)

When procuring LLM APIs, "lowest price across the web" often grabs the most attention. But in real business operations, the nominal unit price does not equal the true cost. Especially in the relay gateway market, input, output, cache hit, and cache write prices are often bundled into a single composite price. While convenient for comparison, this becomes easily distorted under real business loads.

This distortion stems from the fact that there is no fixed input-to-output ratio for enterprise LLM calls. RAG, long document analysis, and complex agent workflows typically involve high input and low output; code generation and content creation might involve low input and high output. If a platform only shows a blended unit price, it is difficult for buyers to pinpoint where the costs are actually incurred. A platform with a seemingly low blended price might optimize its offering by lowering the output unit price while inflating prices for input, cache writes, or other less conspicuous items. Consequently, the actual bill for certain high-input scenarios could be far higher than expected.

Furthermore, hidden costs lurk in various obscure clauses. For instance, during system interruptions, timeouts, or 5xx errors, failed requests might still be charged. Low unit prices are often tied to high minimum top-up amounts, take-it-or-leave-it "non-refundable balance" policies, and opaque forex and payment channel fees. When all these financial frictions are combined, the actual cost debited from the enterprise's account per million tokens can be several times higher than the nominal price advertised on the website.

GatewayBench: A Professional Audit Framework for LLM Gateways

Faced with highly opaque gateway backends, traditional speed testing and scoring tools have significant limitations. These metrics can compare response speed, model coverage, and nominal prices, but struggle to answer more critical questions: Does the model actually used by the gateway match what was promised? Is the billing transparent? Are the cache and data isolation trustworthy?

Against this backdrop, the GatewayBench audit evaluation framework has been officially launched, with the evaluation portal open on its official website, Check4U.ai. As an open-source audit framework for LLM gateways, GatewayBench goes beyond speed and surface-level pricing, breaking down gateway evaluation into three dimensions: Trustworthiness, Economy, and Performance, using a weight system of 40% Trustworthiness + 40% Economy + 20% Performance.

This weighting reflects GatewayBench's core judgment: in the context of AI model relay APIs, trustworthiness and true cost take precedence over speed. A gateway must first prove model authenticity, billing transparency, and cost explainability before it qualifies for performance comparison.

To achieve this goal, GatewayBench provides three core audit capabilities:

L1 Trustworthiness Audit: From Self-Proclamation to Verifiable Trust

Within GatewayBench's scoring system, L1 Trustworthiness accounts for 40% of the weight. The logic behind this design is: in the AI model relay API scenario, while speed and price are important, if the model is not authentic and the billing is not transparent, other metrics lose their basis for discussion.

The core risk of third-party LLM relay gateways stems from the invisible processes behind a successful API call. A normal response at the interface level only indicates that a request was processed, but it cannot prove that the model origin, billing process, and cache handling all conform to the platform's promises. In the past, these aspects lacked externally verifiable evidence, making systematic auditing difficult.

GatewayBench's L1 dimension is precisely designed to convert these vague suspicions into auditable engineering signals. It breaks down trustworthiness into three questions: Is the model authentic? Is billing transparent? Is the cache trustworthy? It uses statistical tests, cryptographic structures, and latency fingerprints to observe what's actually happening inside the gateway black box from an external perspective.

Regarding model authenticity, GatewayBench introduces the RUT (Rank-based Uniformity Test) algorithm to check the position of output tokens within the probability ranking of a reference model. Different models might generate similar text, but their token probability distributions are harder to fake. If the backend undergoes quantization, downgrading, or substitution, the distribution drift will leave traces. Concurrently, GatewayBench can combine Logprob Tracking, making a single-token request under a fixed prompt and tracking whether its log probability shows a stable offset over different periods. This provides a lower-cost signal for continuous monitoring of model updates, fine-tuning, quantization adjustments, or route changes.

For billing transparency, GatewayBench uses PALACE to estimate a reasonable range for thinking tokens, identifying anomalous over-reporting in reasoning models. It also leverages verifiable structures like CoIn to make billing records more traceable and tamper-resistant.

For cache trustworthiness, GatewayBench uses latency fingerprints to determine if a cache hit is genuine and employs cross-account isolation tests to identify potential tenant boundary issues. If the bill shows a cache hit but the TTFT (Time to First Token) does not decrease correspondingly, the discount might only exist on paper. If anomalous cache reuse occurs between two independent accounts, it could signal a cache isolation risk.

Through these methods, GatewayBench transforms a black box, previously assessable only by intuition and suspicion, into a set of measurable, auditable, and comparable signals, achieving truly "verifiable trust."

L2 Performance: Extreme Load Testing to Assess Stable Delivery Under Pressure

Only gateways that pass the L1 Trustworthiness audit are eligible for performance and cost-effectiveness comparison.

In the LLM infrastructure ecosystem, performance has always been the metric relay stations and aggregated API providers emphasize most. Claims like "fastest on the web" or "150 tokens/s per concurrent request" are common. However, GatewayBench designed its metric system with a deliberately restrained weight allocation for performance: L2 Performance accounts for only 20% of the composite score.

The reason is simple. Speed is important; it determines basic system usability and can weed out services plagued by frequent freezes or long-tail latency. But speed should not take precedence over trustworthiness and economy. A gateway, even if very fast, cannot be a trustworthy enterprise-grade infrastructure if it engages in model substitution, billing inflation, or cache opacity.

Therefore, in the L2 Performance dimension, GatewayBench does not chase single-point peak speed. Instead, it breaks performance down into questions more relevant to production environments: Latency determines how long a user waits; Goodput measures how much effective throughput can be delivered within a latency budget; Long-context testing observes how the system degrades under heavy loads.

The logic behind this design is: performance is a threshold requirement, but not the end goal. What enterprises truly purchase is stable, timely, and predictable delivery capability under a premise of trust.

In commercial-grade applications, peak speeds observed under no-load conditions offer extremely limited reference value. GatewayBench rejects simply comparing throughput (Tokens/s) and instead introduces Service Level Objectives (SLOs) as business red lines. An enterprise can stipulate, for example, that the P95 TTFA (Time to First Accept) must be less than 1.5 seconds, the P95 End-to-End latency must be less than 8 seconds, and streaming output must not exhibit significant jitter. Goodput is built upon this red line: only throughput delivered while meeting the SLO counts as effective capacity (Goodput).

To test the true scheduling capability of relay gateways, GatewayBench throws in stress tests with ultra-long contexts up to 100k tokens. This type of test observes whether the system maintains stable, graceful degradation under heavy load, or experiences long-tail latency issues, or even hidden downgrading. What enterprises pay for is precisely this stable delivery capability that can still "submit on time" under extreme business pressure.

L3 Economy: Penetrating the Billing Fog to Reveal the "True Cost per Million Tokens"

Price is the most sensitive variable for enterprises procuring LLM relay APIs, and it is also the easiest to repackage. Listed unit prices on websites seem clear, but once production calls begin, costs are influenced by input-output ratios, cache rules, failed requests, top-up terms, exchange rates, and payment channel fees.

Therefore, GatewayBench assigns L3 Economy a 40% weight, making it a core dimension alongside L1 Trustworthiness. The focus here is not on the nominal unit price on the price list, but on the "True Cost per 1M Tokens," i.e., the cost an enterprise ultimately pays under real business workloads.

For explicit pricing, GatewayBench breaks down the input price, output price, cache hit price, and cache write price, preventing a single blended price from masking the cost structure. Input-output ratios vary greatly across different business scenarios. RAG, long document analysis, and agent workflows typically have high input and low output; code generation and content creation may lead to higher output costs. Looking only at a composite price easily leads to miscalculating actual expenditure.

Regarding pricing relative to official channels, GatewayBench introduces a "Platform Price / Official Price" ratio and evaluates whether the premium is reasonable based on the gateway's role. Aggregated routing, multi-channel failover, and unified billing can justify a certain service premium; simple forwarding proxies should be closer to official prices. A low price is not necessarily an advantage, and a high price must correspond to genuine engineering value.

The GatewayBench framework will also delve into hidden friction behind the bills: Are failed requests forcefully charged? Are the claimed cache discounts genuinely passed back? Are there strict consumption limits on fund accounts? By deconstructing layer by layer, GatewayBench ultimately restores the actual debited cost for the enterprise, stripped of all marketing packaging.

Join GatewayBench: Let Honest Delivery Reap Market Rewards

The API relay market will not disappear due to controversy. As long as regional, payment, risk control, and compliance differences exist for model access, third-party gateways will continue to serve real demand. Since this is an unavoidable reality, GatewayBench's goal is to make these relay points more transparent, trustworthy, and sustainable.

The biggest contradiction in the current API relay market is that information asymmetry is amplifying the "bad money drives out good" effect. Some providers can gain short-term traffic through model swapping, billing tricks, cache arbitrage, or low-price marketing. Meanwhile, providers committed to original model transmission, transparent billing, and stable service find it harder to be noticed in a noisy market, constrained by their inability to break free from real cost limits.

This is not a structure the infrastructure market can rely on long-term. Any mature supply chain requires a set of metric-based market trust mechanisms: good services should be seen, stable delivery should be recorded, and honest provision should be rewarded with more traffic, more trust, and higher-quality procurement budgets. Providers who profit long-term from information asymmetry should bear higher trust costs.

Based on this vision of reshaping industry trust consensus, the AI model relay evaluation platform Check4U.ai officially announces its launch and extends an open invitation to API relay stations, aggregated routing platforms, and model service providers worldwide: join the public GatewayBench evaluation leaderboard.

technology

Welcome to Join Odaily Official Community