vLLM 开发动态报告 - 2026-04-04
时间窗口: 2026-04-04 11:50 (UTC+8) ~ 2026-04-05 11:50 (UTC+8) 数据统计: 新 Issue 16 | 关闭 Issue 4 | 新 PR 32 | 合并 PR 7 | 关闭未合并 PR 5
📊 每日开发状态摘要
在4月4日至5日的时间窗口内,vLLM项目保持了极高的开发活跃度,新增了16个Issue和32个PR,并合并了7个PR。社区关注的核心焦点集中在性能回归排查与优化(特别是Qwen3.5 397B的解码性能)以及Mixture-of-Experts (MoE) 模型的各类兼容性与功能性问题(如Gemma 4 MoE、DeepSeek-R1)。ROCm平台上的问题也持续受到关注和修复。
🎯 AMD/ROCm 生态相关动态
本周期AMD/ROCm生态相关活动较为活跃,包含一个重要的功能新增、多个问题报告以及一个引起争议的插曲。
- 功能增强 PR:
- #39001 - [ROCm] Support unlimited sequence lengths via multi-pass reduction: 由
ekuznetsov139提交。该PR解决了ROCm平台上自定义注意力op对序列长度限制在131072以内的问题。通过实现多轮归约,现在可以支持无限长的序列,在处理超长上下文(例如15万令牌)时,相比回退到极慢的参考Triton内核,带来了显著的性能提升(从0.84 tok/s提升至45.1 tok/s)。这是对AMD硬件上长序列推理能力的重要增强。
- #39001 - [ROCm] Support unlimited sequence lengths via multi-pass reduction: 由
- 问题报告 (Issues):
- #39010 - vLLM regression on v0.19.0 which causes model load to stop “Capturing CUDA graphs” step at 71%: 用户报告在ROCm平台上,v0.19.0版本在捕获CUDA图时会在71%进度卡住,而v0.18.1工作正常,表明这是一个版本回归问题。机器人已自动标记并请求更多ROCm环境信息。
- #38972 - Mistral Small 4 (119B MoE) fails to start on ROCm MI325X: 用户在MI325X上启动大型MoE模型时遇到两个阻塞性问题:1) AITER MLA注意力内核硬编码支持的注意力头数(仅16或128)与模型的32头不匹配;2) 128专家的FP8 MoE内核编译时间过长(15-20分钟),导致引擎启动超时。这反映了AMD平台上对新兴大参数MoE模型支持的不足。
- #38963 - meme PR: 这是一个由
functionstackx提交的恶搞PR,其标题和内容罗列了大量与AMD相关的关键词和“如何获得晋升”的调侃,意在讽刺此前曝光的AMD内部PR优先级看板。该PR在15分钟内被AMD员工AndreasKaratzas关闭,并引发了关于专业性的简短讨论。此事件虽无技术价值,但反映了社区对厂商贡献透明度和动机的关注。
💬 高热度讨论分析
- PR #38981 - [Perf][GDN] Align TMA usage with upstream FLA (已合并):
- 核心议题: 修复vLLM中GDN(门控Delta网络)内核与上游FLA库在TMA(Tensor Memory Accelerator)使用上的不一致,以解决Blackwell GPU上的性能问题。
- 观点与讨论: 提交者
arpera通过测试证明,禁用TMA后,vLLM的GDN预填充内核性能与上游FLA对齐,在8xB200上整体吞吐量提升3.2%。ZJY0516和vadiklyutiy随后提出疑问,表示在小批量(低并发)场景下观察到了TPOT(每输出令牌时间)的回归,并分享了内部基准测试数据。讨论聚焦于验证不同负载下的影响。 - 最终结论: 经过进一步交流,PR被合并。争议点在于性能提升的普适性,但核心修复(与上游对齐)被认为是正确的。
- Issue #38982 - Enabling cudagraph_mm_encoder results in ModuleNotFoundError:
- 核心议题: 在v0.19.0版本中,启用
cudagraph_mm_encoder编译选项会导致因模块导入路径错误而启动失败。 - 观点与讨论: 报告者
zzlol63迅速定位到问题根源是PR重构过程中,部分文件移动后相关的导入路径没有完全更新。他直接列出了受影响的文件和可能的原因PR,表现出很高的技术分析能力。 - 当前状态: 问题已被确认,并催生了专门的修复PR #38997。
- 核心议题: 在v0.19.0版本中,启用
- PR #38993 - [Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout:
- 核心议题: 将Trtllm FP8 MoE内核的权重布局改为“Shuffled Weights and BlockMajorK”,以提升性能。
- 观点与讨论: 提交者
wzhao18提供了详尽的基准测试数据,显示新布局在各种矩阵尺寸下均能带来性能提升(最高达1.8倍)。robertgshaw2-redhat认可了这一发现并触发全量CI测试。 - 当前状态: PR处于开放状态,等待CI结果和进一步审查。
- Issue #39004 - ~23% output throughput regression on Qwen3.5-397B NVFP4 decode:
- 核心议题: 用户
vadiklyutiy通过细致的每日快照对比,发现在过去10天内,Qwen3.5-397B模型在8xB200上的解码吞吐量出现了约23%的显著性能回归,并指出可能与FlashInfer升级到0.6.7版本(PR #38423)的时间点吻合。 - 观点与讨论:
ZJY0516回应也观察到了类似问题,并初步怀疑与FlashInfer注意力内核有关。这是一个典型社区协作定位性能回归的案例,用户提供了极佳的可复现基准。 - 当前状态: 问题开放,等待核心开发者调查和确认根因。
- 核心议题: 用户
🔥 热门话题与趋势分析
- 性能回归与优化:是本期最核心的话题。多个Issue(#39004, #38971)和PR(#38981, #38990, #38993)都围绕性能展开,反映出社区对vLLM版本迭代中性能稳定性的高度敏感和持续追求。
- MoE模型的兼容性与挑战:大量Issue涉及MoE模型(#38972 Mistral Small 4, #38999/#39000 Gemma 4 MoE, #38971 Nemotron Super),问题涵盖启动失败、量化支持、数据并行等,说明随着大型MoE模型流行,vLLM在全方位支持上仍面临不少挑战。
- ROCm平台支持:除了前述AMD生态部分,ROCm标签的Issue持续出现,涉及长序列、新硬件(MI325X)适配和版本回归,表明AMD硬件的兼容性和性能调优是长期且活跃的工作方向。
- 工具调用与多模态解析:出现了多个修复Qwen3、Gemma 4等模型工具调用解析的PR(#38973, #38996, #38992),说明在基础推理功能之外,对复杂输出格式和API的支持正趋于精细化。
🛠️ 重点技术变更
- PR #38981 - [Perf][GDN] Align TMA usage with upstream FLA (已合并): 关键技术修复,通过禁用Blackwell GPU上默认的TMA,使vLLM的GDN内核行为与上游FLA参考实现一致,解决了因编译器问题导致的性能劣化,对使用Qwen等GDN架构模型的用户有积极影响。
- PR #39001 - [ROCm] Support unlimited sequence lengths (开放中): 对ROCm平台的重要增强,突破了原有长序列限制,通过多轮归约算法显著提升了超长上下文推理性能,扩展了AMD硬件的能力边界。
- PR #39008 - [Quant] Add TurboQuant 4-bit (tq4) KV cache quantization (开放中): 引入了一种新的4位KV缓存量化方案TQ4,通过半字节打包和旋转预处理,声称能提供相比FP8近一倍的压缩率(3.76x vs FP16)。这是一个前沿的存储优化探索。
- PR #38990 - [Bugfix][MoE] Fix 6-8% decode regression (开放中): 修复了因共享专家执行顺序逻辑错误导致的TP-only MoE模型解码吞吐量回归。该问题由之前的重构引入,此修复对恢复MoE模型性能至关重要。
📈 开发活跃度观察
- 贡献活跃:在24小时内产生32个新PR,合并7个,显示项目开发节奏极快。贡献者既有
robertgshaw2-redhat、ZJY0516等熟悉的核心维护者,也有arpera、vadiklyutiy等提供深度性能分析和测试的社区成员。 - 审查与协作:在热门PR(如#38981)中可以看到开发者之间就性能影响进行快速、技术细节丰富的讨论。对于Bug报告(如#38982),社区成员能快速定位原因并提交修复,展示了高效的自愈能力。
- 问题闭环:关闭了4个Issue,其中包括一个关于RISC-V CPU后端代码归属权的RFC讨论(#38974),该Issue在提出后数小时内即被关闭,显示了对社区贡献规范议题的处理效率。
💡 值得关注的问题
- 重大性能回归 (#39004):Qwen3.5-397B模型近期的解码性能下降需要核心团队尽快调查,特别是与FlashInfer 0.6.7升级的关联性,这对大量用户的生产部署有直接影响。
- MoE模型的数据并行支持:Issue #38999 揭示了一个关键问题——Gemma 4 MoE模型在数据并行(DP>1)但未启用专家并行(EP)时会崩溃。这提示文档或错误提示可能需要加强,引导用户正确配置。
- ROCm平台的稳定性:Issue #39010 表明v0.19.0版本在ROCm上可能存在CUDA图相关的回归风险,需要AMD团队或ROCm维护者重点跟进,确保版本升级的平滑性。
📋 附录:详细数据列表
新增 Issue
- #39010 [Bug]: vLLM regression on v0.19.0 which causes model load to stop “Capturing CUDA graphs” step at 71% — bug,rocm — by depuhitv (创建于: 2026-04-05 11:21 (UTC+8))
- #38988 [Performance]: Qwen 3.5 27B Prefix Caching — performance — by NilsHellwig (创建于: 2026-04-05 01:00 (UTC+8))
- #39004 [Perf]: ~23% output throughput regression on Qwen3.5-397B NVFP4 decode (8×B200) over the last 10 days — performance — by vadiklyutiy (创建于: 2026-04-05 07:56 (UTC+8))
- #39000 [Bug]: Gemma 4 MoE (26B-A4B) — runtime MXFP4 quantization crashes during weight loading in fused MoE layer — bug — by leuski (创建于: 2026-04-05 05:29 (UTC+8))
- #38999 [Bug]: Gemma 4 MoE (26B-A4B) crashes with
--data-parallel-size > 1— AssertionError in cuda_communicator all_gather — bug — by leuski (创建于: 2026-04-05 05:26 (UTC+8)) - #38991 [Bug]: runai_safetensors_weights_iterator yields tensors in nondeterministic order, breaking FP8 inference on some platforms — 无标签 — by janbernloehr (创建于: 2026-04-05 02:37 (UTC+8))
- #38994 Qwen-3.5 9B often producing repetitive/garbled output with Intel Backend — bug — by AlexanderValentini (创建于: 2026-04-05 03:51 (UTC+8))
- #38986 [Bug]: Sync EPLB rearrangement hangs indefinitely with DP8 + EP on B200 — bug — by arpera (创建于: 2026-04-04 23:46 (UTC+8))
- #38982 [Bug]: Enabling cudagraph_mm_encoder results in ModuleNotFoundError — bug — by zzlol63 (创建于: 2026-04-04 20:25 (UTC+8))
- #38980 [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double_scale key) — 无标签 — by DrHepa (创建于: 2026-04-04 19:30 (UTC+8))
- #38979 [Bug]: Regression in vllm 0.19.0 - The page size of the layer is not divisible by the maximum page size — bug — by outermeasure (创建于: 2026-04-04 19:12 (UTC+8))
- #38976 [Bug]:TimeoutError: RPC call to sample_tokens timed out. when pp is on under xpu env — bug — by zwh20081 (创建于: 2026-04-04 18:25 (UTC+8))
- #38974 [RFC]: Request to clarify attribution/authorship for the RISC-V CPU backend PR chain (#20292, #32405, #36538, #36578) — RFC,cpu — by ihb2032 (创建于: 2026-04-04 17:24 (UTC+8))
- #38972 [Bug]: Mistral Small 4 (119B MoE) fails to start on ROCm MI325X - two blocking issues — bug,rocm — by maincodeMax (创建于: 2026-04-04 15:31 (UTC+8))
- #38971 [Performance]: NVFP4 MoE on SM120: no env override to select backend (FLASHINFER_CUTLASS vs MARLIN) — performance — by mmeyer-datendo (创建于: 2026-04-04 15:00 (UTC+8))
- #38967 [Bug] vLLM >= 0.18.0 NCCL segfault (cuMemCreate) with TP>1 on RTX 4090 (SM 89) — 无标签 — by zhouliang5266 (创建于: 2026-04-04 13:16 (UTC+8))
已关闭 Issue
- #21236 [Bug]: Quantization does not lead to Throughput Speedup (Please Help) — performance,stale — by jvonrad (关闭于: 2026-04-05 10:17 (UTC+8))
- #37319 [RFC]: Extensible Per-Token Quantized KV Cache Scale Infrastructure — RFC — by JartX (关闭于: 2026-04-05 07:04 (UTC+8))
- #31564 [Bug]: Qwen3-VL-8B-Instruct has accuracy issue - Multi modal accuracy issue — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-04-04 22:11 (UTC+8))
- #38974 [RFC]: Request to clarify attribution/authorship for the RISC-V CPU backend PR chain (#20292, #32405, #36538, #36578) — RFC,cpu — by ihb2032 (关闭于: 2026-04-04 17:49 (UTC+8))
新增 PR
- #39011 Update MusicFlamingo and add AudioFlamingoNext — documentation,new-model,multi-modality — by lashahub (创建于: 2026-04-05 11:40 (UTC+8))
- #39008 [Quant] Add TurboQuant 4-bit (tq4) KV cache quantization — v1 — by wizzense (创建于: 2026-04-05 10:45 (UTC+8))
- #38987 [Bugfix][Spec Decode] Fix extract_hidden_states for VLM models — bug,speculative-decoding,v1 — by abatilo (创建于: 2026-04-05 00:30 (UTC+8))
- #38984 [KV Offload] Add SLRU eviction policy for CPU offloading cache — v1 — by simpx (创建于: 2026-04-04 21:19 (UTC+8))
- #39009 [MoE] Move remaining PrepareAndFinalize to prepare finalize folder — 无标签 — by Jackmin801 (创建于: 2026-04-05 10:47 (UTC+8))
- #39007 [MoE] Move GPT OSS Triton kernel experts into fused_moe/experts/ — documentation,gpt-oss — by Jackmin801 (创建于: 2026-04-05 10:11 (UTC+8))
- #39003 [Frontend] Add /v1/files upload endpoint for multimodal inputs (#38531) — documentation,frontend,multi-modality — by Alberto-Codes (创建于: 2026-04-05 07:30 (UTC+8))
- #39006 Revert “[IR][RmsNorm] pass None if not has_weight” (#38961) — 无标签 — by vllm-agent (创建于: 2026-04-05 10:04 (UTC+8))
- #39005 [MoE] Move DEEP_GEMM into experts/ subdirectory — documentation,performance — by Jackmin801 (创建于: 2026-04-05 09:40 (UTC+8))
- #39001 [ROCm] Support unlimited sequence lengths via multi-pass reduction — rocm,v1 — by ekuznetsov139 (创建于: 2026-04-05 06:12 (UTC+8))
- #39002 [Bugfix] Fix FlashInfer crash with kv_cache_dtype_skip_layers — bug,v1,nvidia — by yzong-rh (创建于: 2026-04-05 06:50 (UTC+8))
- #38993 [Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout — ready,nvidia,ready-run-all-tests — by wzhao18 (创建于: 2026-04-05 03:11 (UTC+8))
- #38997 Fix broken import paths for encoder_cudagraph modules — ready,v1,qwen,nvidia — by Gregory-Pereira (创建于: 2026-04-05 05:05 (UTC+8))
- #38996 fixing qwen3 tool call parsing — tool-calling,qwen — by Gregory-Pereira (创建于: 2026-04-05 04:40 (UTC+8))
- #38992 Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters — tool-calling — by Gregory-Pereira (创建于: 2026-04-05 02:40 (UTC+8))
- #38990 [Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap — bug,ready,ready-run-all-tests — by voipmonitor (创建于: 2026-04-05 01:59 (UTC+8))
- #38989 [Bug] Fix routing bias dtype for trtllm per-block fp8 moe — bug,nvidia — by wzhao18 (创建于: 2026-04-05 01:41 (UTC+8))
- #38985 [feat] Support modelopt_mixed for Turing and Ampere via Marlin — nvidia — by ir1ka (创建于: 2026-04-04 21:55 (UTC+8))
- #38978 [MTP] Apply shared_head.norm to DeepSeekMTP multi-step chaining — deepseek — by voipmonitor (创建于: 2026-04-04 19:09 (UTC+8))
- #38977 [BUILD] Enable compile_commands export for Ninja builds — ci/build — by harrisonGPU (创建于: 2026-04-04 18:50 (UTC+8))
- #38973 [Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser and Qwen3XMLToolParser (#37831) — bug,tool-calling,qwen — by AAISSJ (创建于: 2026-04-04 16:02 (UTC+8))
- #38969 [Doc] Update multimodal inputs document about video metadata inputs — documentation — by Isotr0py (创建于: 2026-04-04 13:57 (UTC+8))
- #38964 fix(gdn): use pre-allocated buffer for CUDA graph compatibility — frontend,ci/build,v1,multi-modality,qwen,cpu,nvidia — by lingzhi227 (创建于: 2026-04-04 12:30 (UTC+8))
- #38963 meme (for avoidance of any doubt): how to get promo at AMD according to internal KPI dashboard: ROCm AMD HIP MI200 MI210 MI250 MI300 MI300X MI308 MI325X MI350 MI355X gfx90a gfx940 gfx941 gfx942 gfx1100 gfx1101 gfx1200 gfx1201 hipblaslt hipblas rocblas triton-rocm AITER Meta Facebook AWS Amazon Anthropic SemiAnalysis Microsoft Google Databricks Anyscale Oracle IBM Intel Samsung Hugging Face regression crash broken data corruption hang deadlock OOM segfault SIGABRT urgent critical hotfix failing attention flash_attn paged_attention quantization model_executor model_runner serving scheduler tensor_parallel pipeline_parallel distributed cuda_graph performance benchmark — rocm,intel-gpu,nvidia — by functionstackx (创建于: 2026-04-04 12:11 (UTC+8))
- #38998 Revert “[vLLM IR] gemma_rms_norm” — ready — by robertgshaw2-redhat (创建于: 2026-04-05 05:24 (UTC+8))
- #38970 [Bugfix][CPU] Fix macOS compatibility broken by #36487 — bug,ready,cpu,verified — by 2imi9 (创建于: 2026-04-04 14:43 (UTC+8))
- #38995 [Quantization] - Layerwise reloading of Attention/KV quantized models — 无标签 — by Josephasafg (创建于: 2026-04-05 04:21 (UTC+8))
- #38983 [Compile] enable allreduce_rms fuse for GemmaRMSNorm — 无标签 — by ZJY0516 (创建于: 2026-04-04 21:14 (UTC+8))
- #38975 fix(kv_cache): sync
_prob_scale_floatandq_scalefallback overwrite — 无标签 — by namgyu-youn (创建于: 2026-04-04 17:40 (UTC+8)) - #38981 [Perf][GDN] Align TMA usage with upstream FLA — ready — by arpera (创建于: 2026-04-04 19:53 (UTC+8))
- #38968 [Transformers v5] Fix tokenizer metadata in quantized LoRA tests — 无标签 — by SouthWest7 (创建于: 2026-04-04 13:48 (UTC+8))
- #38966 fix(gptq): auto-detect v1/v2 zero-point format from actual weights — 无标签 — by lingzhi227 (创建于: 2026-04-04 12:43 (UTC+8))
已合并 PR
- #32694 [Quantization][Deprecation] Remove Petit NVFP4 — rocm,ready,ci/build — by robertgshaw2-redhat (合并于: 2026-04-05 08:07 (UTC+8))
- #38998 Revert “[vLLM IR] gemma_rms_norm” — ready — by robertgshaw2-redhat (合并于: 2026-04-05 05:48 (UTC+8))
- #38970 [Bugfix][CPU] Fix macOS compatibility broken by #36487 — bug,ready,cpu,verified — by 2imi9 (合并于: 2026-04-04 22:05 (UTC+8))
- #38955 Refactor Arctic loading to use AutoWeightsLoader — intel-gpu,ready — by lalit10 (合并于: 2026-04-04 13:01 (UTC+8))
- #38780 [vLLM IR] gemma_rms_norm — ready,vllm-ir — by wxsIcey (合并于: 2026-04-05 01:55 (UTC+8))
- #38981 [Perf][GDN] Align TMA usage with upstream FLA — ready — by arpera (合并于: 2026-04-05 00:38 (UTC+8))
- #38961 [IR][RmsNorm] pass None if not has_weight — ready — by lk-chen (合并于: 2026-04-04 23:02 (UTC+8))
关闭但未合并的 PR
- #31693 Fix TRTLLM decode assertion error when query lengths are non-uniform #31594 — v1,nvidia — by baonudesifeizhai (关闭于: 2026-04-05 10:31 (UTC+8))
- #38964 fix(gdn): use pre-allocated buffer for CUDA graph compatibility — frontend,ci/build,v1,multi-modality,qwen,cpu,nvidia — by lingzhi227 (关闭于: 2026-04-04 12:39 (UTC+8))
- #38963 meme (for avoidance of any doubt): how to get promo at AMD according to internal KPI dashboard: ROCm AMD HIP MI200 MI210 MI250 MI300 MI300X MI308 MI325X MI350 MI355X gfx90a gfx940 gfx941 gfx942 gfx1100 gfx1101 gfx1200 gfx1201 hipblaslt hipblas rocblas triton-rocm AITER Meta Facebook AWS Amazon Anthropic SemiAnalysis Microsoft Google Databricks Anyscale Oracle IBM Intel Samsung Hugging Face regression crash broken data corruption hang deadlock OOM segfault SIGABRT urgent critical hotfix failing attention flash_attn paged_attention quantization model_executor model_runner serving scheduler tensor_parallel pipeline_parallel distributed cuda_graph performance benchmark — rocm,intel-gpu,nvidia — by functionstackx (关闭于: 2026-04-04 12:22 (UTC+8))
- #38913 [NVIDIA] Update FlashInfer to version 0.6.7.post2. Avoid re-downloading BMM export headers when flashinfer-cubin is installed — ready,ci/build,nvidia,ready-run-all-tests — by johnnynunez (关闭于: 2026-04-05 04:24 (UTC+8))
- #38914 [ROCm] mi250x decode regression — rocm — by rlrs (关闭于: 2026-04-04 18:23 (UTC+8))