智谱 launches GLM-5.1 High-Speed API, achieving an output speed of 400 tokens/s
Odaily reported that Zhipu has launched the GLM-5.1 High-Speed API for select enterprise customers, achieving a model output speed of 400 tokens/s, setting a new global record for end-to-end speed in official large model interfaces.
It is understood that this high-speed version, while retaining the capabilities of the original flagship model, is powered by a high-performance inference engine jointly developed by Zhipu and the TileRT team. The engine reduces kernel launch and memory read/write latency in traditional inference by reconstructing the GPU runtime scheduling mechanism, statically organizing the model into persistent engine kernels that reside on the GPU.
In multi-GPU scenarios, TileRT further specializes GPU nodes in an 8-card NVL topology into different functional workers to improve attention layer computation and cross-card communication efficiency.
Currently, this high-speed service has been made available to select enterprise customers of Zhipu's MaaS platform. In the future, the company will continue to optimize FP8 inference and ultra-long context capabilities, providing support for low-latency scenarios such as AI programming, real-time interaction, and real-time voice.
