智谱发布GLM-5.1高速版API,输出速度达400 tokens/s
Odaily Planet Daily News: Zhipu AI has launched the GLM-5.1 High-Speed API for select enterprise customers, achieving a model output speed of 400 tokens/s, setting a new global record for end-to-end speed across official large model interfaces.
It is understood that this high-speed version retains the capabilities of the original flagship model while being driven by a high-performance inference engine jointly developed by the Zhipu AI and TileRT teams. By restructuring the GPU runtime scheduling mechanism, the engine statically organizes the model into a persistent Engine Kernel resident on the GPU, reducing kernel launch and memory read/write latency associated with traditional inference.
In multi-GPU scenarios, TileRT further specializes GPU nodes within an 8-card NVL topology into different functional workers to enhance attention layer computation and inter-card communication efficiency.
Currently, this high-speed service has been made available to select enterprise customers on Zhipu AI's MaaS platform. Going forward, the optimization will continue with FP8 inference and ultra-long context capabilities, providing support for low-latency scenarios such as AI programming, real-time interaction, and real-time voice.
