BTC
ETH
HTX
SOL
BNB
ดูตลาด
简中
繁中
English
日本語
한국어
ภาษาไทย
Tiếng Việt

智谱 launches GLM-5.1 High-Speed API, achieving an output speed of 400 tokens/s

2026-05-22 03:19

Odaily reported that Zhipu has launched the GLM-5.1 High-Speed API for select enterprise customers, achieving a model output speed of 400 tokens/s, setting a new global record for end-to-end speed in official large model interfaces.

It is understood that this high-speed version, while retaining the capabilities of the original flagship model, is powered by a high-performance inference engine jointly developed by Zhipu and the TileRT team. The engine reduces kernel launch and memory read/write latency in traditional inference by reconstructing the GPU runtime scheduling mechanism, statically organizing the model into persistent engine kernels that reside on the GPU.

In multi-GPU scenarios, TileRT further specializes GPU nodes in an 8-card NVL topology into different functional workers to improve attention layer computation and cross-card communication efficiency.

Currently, this high-speed service has been made available to select enterprise customers of Zhipu's MaaS platform. In the future, the company will continue to optimize FP8 inference and ultra-long context capabilities, providing support for low-latency scenarios such as AI programming, real-time interaction, and real-time voice.