vLLM 开发动态报告 - 2025-12-28

时间窗口: 2025-12-28 11:01 (UTC+8) ~ 2025-12-29 11:01 (UTC+8) 数据统计: 新 Issue 5 | 关闭 Issue 7 | 新 PR 21 | 合并 PR 8 | 关闭未合并 PR 15

📊 每日开发状态摘要

在本次观察周期内，vLLM 项目保持活跃的开发节奏，新增了21个PR和5个Issue，合并了8个PR。开发焦点集中在多模态模型支持、推理性能优化（特别是MoE内核调优）以及前端API功能完善上。同时，AMD/ROCm生态的兼容性与测试稳定性是后台持续投入的重要方向。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD/ROCm 生态相关活动活跃，主要体现在 CI/CD 测试修复、依赖项迁移和核心功能支持上。

ROCm CI 测试修复与依赖更新 (PR #31327, #31323, #31441)
- 技术细节:
  - PR #31327: 将 xgrammar 依赖从自定义分支迁移至上游 PyPI 发布版（0.1.29）。此版本包含对 AMD GPU 上 Triton 内核 warp 大小（32 vs NVIDIA 的 64）的关键修复，解决了结构化输出在 MI300 上的资源不足错误。
  - PR #31323: 为支持音频转录测试，添加了从源码构建 ROCm 兼容版 TorchCodec 的脚本和 CI 步骤，解决了该库无预编译 ROCm 轮子的问题。
  - PR #31441: 为通过 Isaac 多模态模型测试，在 ROCm 测试依赖中增加了 perceptron 库。
- 影响分析: 这些改动共同提升了 vLLM 在 AMD 平台上的测试覆盖率和稳定性，减少了对特定分支的依赖，向使用标准上游包迈进，有利于长期维护。
ROCm 平台特定测试跳过与基础设施修复 (PR #31462, #31460)
- 技术细节:
  - PR #31462: 在 ROCm 平台上跳过依赖 DeepGemm（CUDA 特定功能）的测试 test_silu_mul_fp8_quant_deep_gemm。
  - PR #31460: 为修复 ROCm 平台上的 “NixlConnector PD accuracy tests”，在 Dockerfile 中将 NCCL 替换为 ROCm 的 RCCL/RIXL，并添加 UCX 支持。
- 影响分析: 明确了 ROCm 与 CUDA 在底层库支持上的差异，确保了 CI 的通过率，并完善了 ROCm 分布式通信的基础设施。
Quark 量化工具相关重构 (PR #31461)
- 技术细节: 这是 MoE 层重构系列的第11部分，专门重构了 quark_moe 层。贡献者指出当前缺乏针对 Quark 方法的测试，并寻求指导。
- 影响分析: 这表明 AMD 的 Quark 量化工具集成在 vLLM 代码库中正在进行现代化重构，但测试覆盖可能是一个待完善的领域。
PyNCCL 回退机制增强 (PR #31459)
- 技术细节: 修复了 CudaCommunicator 中 all_gatherv 方法在 PyNCCL 不可用（如在 AMD GPU 上）时缺少 torch.distributed 回退路径的问题。此前这导致在 AMD MI350x 上进行多维度并行（TP/DP/EP）时出现断言错误。
- 影响分析: 这是一个对 AMD 平台非常重要的修复，它解决了在使用复杂并行策略时的一个关键阻塞点，提升了 vLLM 在 AMD 硬件上的部署能力和健壮性。

💬 高热度讨论分析

Issue #30624: Qwen3VL 在异步调度下的 GPU 断言错误
- 核心议题: 多位用户报告在运行 Qwen3-VL 系列模型（8B至235B）并启用 --async-scheduling 时，出现 masked_scatter_size_check 或 vectorized_gather_kernel 的 CUDA 断言失败，问题表现不稳定。
- 观点与测试:
  - 用户反馈: bash99 测试发现禁用 --async-scheduling 后问题消失，指向异步调度是触发条件。
  - 维护者推测: DarkLight1337 联想到一个已知的多模态 CPU 张量竞争条件问题，并建议尝试修复该问题的 PR #31373。
  - 验证结果: bash99 随后确认 PR #31373 解决了启用异步调度时的崩溃问题。
- 最终结论: 问题根源于之前一个针对异步多模态竞争条件的修复被意外回退。PR #31373 重新应用了该修复，并在此周期内被合并，从而解决了此问题。这展示了社区协作定位和修复复杂、间歇性 bug 的过程。
Issue #18619: Qwen2.5 从 v0.8.4 到 v0.8.5 的性能下降 (已关闭)
- 核心议题: 用户报告 Qwen2.5 模型在 vLLM 版本升级后出现显著的吞吐量下降（从 ~60 请求/秒降至 ~57）。
- 争议与调查:
  - 原始报告: 提供了详细的 Nsight Systems 性能分析截图，显示内核执行时间分布发生变化。
  - 社区验证: 另一位贡献者 Zerohertz 尝试复现，但测得的吞吐量数量级（~3 请求/秒）与原始报告不符，引发了对测试配置差异的讨论。
  - 讨论焦点: 双方对性能数据的基准存在较大分歧，未能就“性能回归”的具体表现达成一致。讨论涉及调度配置、评测方法等细节。
- 当前状态: 由于问题年代较久且未能形成一致的、可复现的结论，该 Issue 因停滞（stale）而被关闭。它凸显了性能问题排查中对标准化、可复现的评测环境的依赖。

🔥 热门话题与趋势分析

多模态与视觉语言模型支持: 新增多个相关 PR（如支持 t5gemma-2, Cogagent，修复 Jina reranker 多模态输入），同时已关闭的多个 CI Failure Issue 也集中于修复多模态模型测试。这表明对多样化 VLM 的集成和稳定性保证是当前重点。
推理性能深度优化: 趋势体现在针对特定硬件（如 NVIDIA B200）和模型（如 Qwen3-MoE, GLM-4.5/4.6）的 Triton MoE 内核调优配置被频繁提交和合并（PR #31448, #31407）。这反映了社区对极致推理性能的追求。
前端 API 与用户体验完善: 围绕工具调用（tool call） 的流式输出问题，同期出现了两个修复 PR（#31451, #31457），均旨在解决流式响应中丢失工具名称等元数据的问题。此外，也有关于 Docker 镜像拉取限速、本地模型加载失败的 Issue，显示对生产化部署体验的关注。

🛠️ 重点技术变更

PR #31461 ([MoE Refactor][11/N] refactor quark_moe layer): 作为 MoE 层大规模重构的一部分，此 PR 专门处理 AMD Quark 量化相关的 MoE 实现。其重要性在于推动代码库的现代化和统一，但贡献者提出的“缺乏测试”问题也指出了未来需要加强验证的方向。
PR #31373 ([BugFix] Re-fix async multimodal cpu tensor race condition): 此修复解决了启用异步调度时，多模态模型因 CPU 张量复用不同步而导致的间歇性 GPU 断言错误（如 Issue #30624）。对提升多模态模型在高压、异步推理场景下的稳定性至关重要。
Issue #31467 ([RFC]: A Triton operator dispatch mechanism through modified CustomOp): 这是一个重要的架构设计提议。作者指出当前 Triton 内核调用方式分散（通过Attention后端、CustomOp、直接调用），且不同硬件后端（如 ROCm 和 CUDA）需要不同实现，但缺乏统一的调度机制。提案旨在创建一个抽象的 TritonOp 层来实现硬件后端的动态分发。这可能为 vLLM 未来更好地支持异构计算平台（包括 AMD NPU 等）奠定基础。

📈 开发活跃度观察

贡献者：AndreasKaratzas 在 AMD CI 测试修复方面贡献密集，体现了对 ROCm 平台稳定性的持续投入。Jzz1943 连续提交了多个针对 B200 和 Qwen3-MoE 的调优配置 PR，显示了对性能调优的关注。
代码审查与合并：8个 PR 在一天内被合并，其中包含重要的 bug 修复（如 #31373, #31395）和 AMD 生态更新（如 #31327, #31323），表明核心维护团队保持着高效的审查和合并节奏，确保关键修复能及时进入主线。
协作模式：在 Issue #30624 的解决过程中，清晰展示了用户反馈、维护者知识关联、PR 验证的高效协作模式，是开源社区解决复杂问题的典型范例。

💡 值得关注的问题

Issue #31467 (Triton算子分发机制RFC): 这是一个可能影响深远的架构讨论。如果采纳，将改变 vLLM 中 Triton 内核的组织和调用方式，旨在提升对多硬件后端的支持能力。社区对其设计方案的反馈和最终决策值得跟进。
Issue #31454 (支持Qwen2.5 ModelOpt/GradNAS剪枝检查点): 该需求涉及动态读取模型权重来推断每层不同的中间尺寸，以支持高级模型压缩技术产出的检查点。这反映了对前沿模型优化技术生态兼容性的需求，实现后能增强 vLLM 的模型加载灵活性。
PR #31461 中提出的测试缺失问题: 在重构 Quark MoE 层时，贡献者明确指出缺乏相应的单元测试。这暴露了项目中某些特定硬件或工具链相关功能的测试覆盖率可能存在缺口，是保证 AMD Quark 量化功能长期可靠性的潜在风险点。

📋 附录：详细数据列表

新增 Issue

#31467 [RFC]: A Triton operator dispatch mechanism through modified CustomOp — RFC — by MengqingCao (创建于: 2025-12-29 10:44 (UTC+8))
#31454 [Feature]: Support per-layer MLP sizes for Qwen2.5 ModelOpt/GradNAS pruned checkpoints — feature request — by CedricHwong (创建于: 2025-12-28 23:04 (UTC+8))
#31450 [Feature]: vLLM should apply to Docker Open Source Program for removing image pull limits — feature request — by ehfd (创建于: 2025-12-28 18:42 (UTC+8))
#31444 [Bug]: “No tokenizer file found in directory” is seen when serve model from local directory after upgrading vllm from 0.11.2 to 0.12 — bug — by nobodynobody123 (创建于: 2025-12-28 13:10 (UTC+8))
#31443 [Bug]: Tool name is lost in chat_completion_stream_generator — bug — by aabbccddwasd (创建于: 2025-12-28 12:15 (UTC+8))

已关闭 Issue

#7519 [Feature]: Context Parallelism — feature request,stale — by huseinzol05 (关闭于: 2025-12-29 10:18 (UTC+8))
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2025-12-29 08:26 (UTC+8))
#29511 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (关闭于: 2025-12-29 08:26 (UTC+8))
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-29 08:26 (UTC+8))
#18619 [Bug][PERF]: Qwen2.5 performance degradation 0.8.4 -> 0.8.5 — bug,stale,qwen — by vadiklyutiy (关闭于: 2025-12-29 01:13 (UTC+8))
#30624 [Bug]: masked_scatter_size_check failed when running Qwen3VLMoE — bug,qwen,nvidia — by soodoshll (关闭于: 2025-12-28 19:05 (UTC+8))
#31377 [Bug]: torch.ops._C.static_scaled_fp8_quant IMA error — bug — by BoyuanFeng (关闭于: 2025-12-28 11:20 (UTC+8))

新增 PR

#31466 [CI/Build][CPU] Update CPU CI test cases — ready,ci/build,cpu — by bigPYJ1151 (创建于: 2025-12-29 10:22 (UTC+8))
#31465 fixed mypy warnings for files under vllm/v1/attention — rocm,v1,nvidia — by MrIceCreamMan (创建于: 2025-12-29 10:20 (UTC+8))
#31451 [BugFix] Preserve tool function name in delta messages when remaining arguments are empty — frontend — by luke396 (创建于: 2025-12-28 19:33 (UTC+8))
#31464 [Bugfix] Apply RMSNorm weight correction for Gemma2 GGUF models — 无标签 — by kitaekatt (创建于: 2025-12-29 10:15 (UTC+8))
#31446 Adding the support t5gemma-2 — documentation,new-model,rocm,ci/build,multi-modality — by akh64bit (创建于: 2025-12-28 15:18 (UTC+8))
#31445 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — frontend,ready,multi-modality — by twjww (创建于: 2025-12-28 15:15 (UTC+8))
#31463 Add Cogagent Model to vllm — documentation,new-model,v1 — by JBurtn (创建于: 2025-12-29 09:35 (UTC+8))
#31462 [ROCm][CI] Skip DeepGemm-dependent test on ROCm platform — rocm — by AndreasKaratzas (创建于: 2025-12-29 08:50 (UTC+8))
#31461 [MoE Refactor][11/N] refactor quark_moe layer — 无标签 — by raayandhar (创建于: 2025-12-29 05:52 (UTC+8))
#31460 [CI]Test Group ‘NixlConnector PD accuracy tests’ is fixed — documentation,rocm,ci/build,kv-connector — by qli88 (创建于: 2025-12-29 05:23 (UTC+8))
#31457 Fix missing tool call fields in streaming responses — frontend — by ssam18 (创建于: 2025-12-29 02:47 (UTC+8))
#31459 Add torch.distributed fallback for all_gatherv when PyNCCL unavailable — nvidia,meta-exported,fb-exported — by iseeyuan (创建于: 2025-12-29 03:03 (UTC+8))
#31458 Vox streaming on top — frontend,tpu,needs-rebase,v1 — by patrickvonplaten (创建于: 2025-12-29 03:02 (UTC+8))
#31456 Streaming vox support — frontend,tpu,v1 — by patrickvonplaten (创建于: 2025-12-29 02:18 (UTC+8))
#31455 [Bugfix] Fix EPLB state logging error — 无标签 — by tlrmchlsmth (创建于: 2025-12-29 01:46 (UTC+8))
#31453 [BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA — 无标签 — by JartX (创建于: 2025-12-28 21:15 (UTC+8))
#31448 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (创建于: 2025-12-28 17:20 (UTC+8))
#31452 [LoRA] Add lora_target_regex to selectively apply LoRA layers — documentation — by laohyx (创建于: 2025-12-28 20:49 (UTC+8))
#31449 fix no think of GLM-4.5 / GLM-4.7 — 无标签 — by zRzRzRzRzRzRzR (创建于: 2025-12-28 18:23 (UTC+8))
#31447 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (创建于: 2025-12-28 16:27 (UTC+8))
#31442 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (创建于: 2025-12-28 12:04 (UTC+8))

已合并 PR

#31327 [ROCm] Migrate xgrammar to upstream release — rocm,ci/build,ready-run-all-tests — by AndreasKaratzas (合并于: 2025-12-28 16:08 (UTC+8))
#31323 [ROCm][CI] Add TorchCodec source build for transcription tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2025-12-28 16:06 (UTC+8))
#31407 Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 — performance,ready — by mratsim (合并于: 2025-12-29 00:38 (UTC+8))
#31448 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (合并于: 2025-12-29 00:38 (UTC+8))
#31373 [BugFix] Re-fix async multimodal cpu tensor race condition — ready,v1 — by njhill (合并于: 2025-12-28 19:05 (UTC+8))
#31441 [ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2025-12-28 12:15 (UTC+8))
#31395 [BugFix] register quant scale tensors as buffer — ready — by BoyuanFeng (合并于: 2025-12-28 11:20 (UTC+8))
#31385 add tip for VLLM_USE_PRECOMPILED arg to reduce docker build time — documentation,ready — by yitingdc (合并于: 2025-12-28 11:19 (UTC+8))

关闭但未合并的 PR

#31456 Streaming vox support — frontend,tpu,v1 — by patrickvonplaten (关闭于: 2025-12-29 02:55 (UTC+8))
#31392 [Draft] More streaming vox — frontend,tpu,needs-rebase,v1 — by patrickvonplaten (关闭于: 2025-12-29 02:18 (UTC+8))
#31412 [Core] Sort block IDs at I/O layer for contiguous memory access — v1 — by majiayu000 (关闭于: 2025-12-28 14:59 (UTC+8))
#31224 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — documentation,performance,new-model,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by twjww (关闭于: 2025-12-28 14:13 (UTC+8))
#31447 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (关闭于: 2025-12-28 16:45 (UTC+8))
#31442 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 — qwen — by Jzz1943 (关闭于: 2025-12-28 15:55 (UTC+8))
#31420 [Cleanup] Replace generic Exception with specific types (part 2) — v1 — by yurekami (关闭于: 2025-12-28 14:47 (UTC+8))
#31419 [Cleanup] Replace generic Exception with ValueError in quant utils — 无标签 — by yurekami (关闭于: 2025-12-28 14:47 (UTC+8))
#31421 [Cleanup] Add descriptive messages to empty exceptions — 无标签 — by yurekami (关闭于: 2025-12-28 14:47 (UTC+8))
#31423 [UX] Improve DBO/microbatching error message for unsupported backends — 无标签 — by yurekami (关闭于: 2025-12-28 14:46 (UTC+8))
#31425 [Cleanup] Replace generic Exception with specific types — frontend,v1 — by yurekami (关闭于: 2025-12-28 14:45 (UTC+8))
#31435 Add descriptive error messages to IPEX ops assertions — 无标签 — by yurekami (关闭于: 2025-12-28 14:41 (UTC+8))
#31429 Add descriptive error messages to bare asserts in forward_context.py — 无标签 — by yurekami (关闭于: 2025-12-28 14:41 (UTC+8))
#31433 Add explicit warning categories to warnings.warn() calls — 无标签 — by yurekami (关闭于: 2025-12-28 14:40 (UTC+8))
#31434 Add return type annotations to post_init methods — 无标签 — by yurekami (关闭于: 2025-12-28 14:40 (UTC+8))