Skip to content

Conversation

@wuhuikx
Copy link

@wuhuikx wuhuikx commented Feb 12, 2026

In recent months, the Qwen C-end Infrastructure Engineering Team and the AMD AIFW Team have collaborated to implement extreme latency optimization solutions for Qwen3 235B and Qwen3 VL 235B on the AMD MI300 series GPU platform, based on the SGLang framework.

Remarkable breakthroughs have been achieved in terms of performance, precision, and stability.
• For Qwen3 235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.67×, and the Time Per Output Token (TPOT) has been improved by 2.12×.
• For Qwen3 VL 235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.62×, and the Time Per Output Token (TPOT) has been improved by 1.90×.

This paper elaborates on the performance optimization techniques jointly explored and implemented by the two teams, with a core focus on achieving ultra-low-latency inference.

@wuhuikx
Copy link
Author

wuhuikx commented Feb 12, 2026

@sunway513

@wuhuikx wuhuikx changed the title Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300 Series Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 and Qwen3-VL on AMD MI300X Series Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant