vLLM 开发动态报告 - 2026-01-25

时间窗口: 2026-01-25 11:08 (UTC+8) ~ 2026-01-26 11:08 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 20 | 新 PR 25 | 合并 PR 7 | 关闭未合并 PR 11

📊 每日开发状态摘要

在2026年1月25日至26日的24小时内，vLLM项目保持了较高的开发活跃度，共新增25个PR、11个Issue，合并了7个PR，关闭了20个Issue。开发焦点集中在模型支持扩展（如Qwen系列、GLM、Hunyuan）、性能优化与内核调优（尤其是针对特定硬件如Blackwell、ROCm），以及修复V1引擎及多模态模型中的关键缺陷。此外，持续集成（CI）的稳定性问题也受到了较多关注和修复。

🎯 AMD/ROCm 生态相关动态

本周期内有明确的AMD生态相关活动，主要涉及ROCm平台上的功能增强和问题修复。

PR #33047: [W8A8 Block Linear Refactor][1/N] Extract Input Quantization Kernels into Modular Architecture
- 概述：此PR旨在将FP8输入量化实现重构为模块化、可扩展的内核架构，为后续简化 W8A8BlockFp8LinearOp 和过渡到新IR做准备。
- AMD关联：在设计中明确为ROCm/AITER平台创建了专用的内核实现模块（aiter.py）。新的内核选择系统为ROCm平台指定了优先级链：AiterInputQuantKernel → CudaInputQuantKernel → TritonInputQuantKernel → PytorchInputQuantKernel。这表明开发团队正在系统化地为AMD硬件优化量化计算路径。
- 技术影响：通过模块化分离和明确的后备链，提升了代码在跨平台（CUDA/ROCm）上的可维护性和清晰度，有助于未来针对AMD GPU（如MI300系列）进行更精细的性能调优。
PR #33018: [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp
- 概述：修复了在ROCm平台上，当同时启用MTP（Mixture of Tokens and Parameters，专家并行）和 VLLM_ROCM_USE_AITER_FUSED_SHARED_EXPERTS 环境变量时，模型加载失败的问题。
- AMD关联：直接针对ROCm后端在处理DeepSeek模型的专家并行特性时的一个兼容性bug进行修复。贡献者 ganyi1996ppo 虽无“-amd”后缀，但专注于ROCm问题修复。
- 技术影响：确保了使用AITER融合内核的ROCm用户能够正常加载和运行采用MTP技术的DeepSeek模型，增强了AMD平台对前沿模型架构的支持。
PR #33043: [rocm][aiter] add env var VLLM_ROCM_USE_AITER_SAMPLING
- 概述：新增环境变量 VLLM_ROCM_USE_AITER_SAMPLING，用于控制ROCm平台是否使用AITER（AMD优化内核）的采样操作。
- AMD关联：这是对AMD专用内核系列控制的又一补充。PR描述中提到，此op曾引入准确性问题（#32413），新增开关为用户提供了回退到其他实现的能力。
- 技术影响：给予用户更大的灵活性和风险控制能力，特别是在新内核可能存在稳定性问题时，保证了服务的可靠性。这体现了对AMD生态用户实际运维体验的重视。

小结：本周期内，AMD生态的更新集中在内核架构重构以更好支持ROCm、修复专家并行等高级特性在ROCm上的兼容性问题，以及增强用户对AMD专用内核的控制粒度。虽然没有出现与Quark量化工具或MI300直接相关的新内容，但对ROCm平台底层内核的持续优化和问题修复是明确的工作方向。

💬 高热度讨论分析

PR #32772: [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni (17条评论)
- 核心议题：优化Qwen2.5-Omni多模态模型中mrope（多维RoPE）位置的计算方式，从逐token循环改为基于mm_position.offset的直接计算，以修复主分支上某些配置（如use_audio_in_video=True）的崩溃问题。
- 观点与讨论：
  - 贡献者（Etelis）：提出了基于偏移量的新算法，并进行了多轮测试验证（单/多音频、图像等模态）。
  - 审阅者（DarkLight1337, Isotr0py）：密切关注重构后的正确性，要求提供与主分支的详细输出对比，并测试了多种模态组合场景，最终确认新实现不仅修复了崩溃，且输出与主分支在可通过的案例中一致。
- 争议焦点：无重大争议，讨论主要是严谨的代码审查和全面的回归测试。
- 最终状态：PR在充分测试和审查后获准合并，成功解决了特定配置下的问题并优化了计算逻辑。
PR #32969: [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling (11条评论)
- 核心议题：修复使用transformers后端时，Qwen2.5-VL模型在内存分析（profiling）阶段因embedding张量维度不匹配而崩溃的问题。
- 观点与讨论：
  - 贡献者（AndreasKaratzas）：初始修复方案因未考虑不同视觉模型（如Idefics3与Qwen2.5-VL）输出embedding形状的差异，导致引入了新bug。
  - 审阅者（Isotr0py）：及时指出初始修复破坏了Idefics3模型，推动了更全面的解决方案。
- 争议焦点：如何设计一个普适性修复，能同时处理不同多模态模型在profiling阶段embedding形状的多样性（如[num_patches, tokens_per_patch, dim] vs [total_tokens, dim]）。
- 最终状态：贡献者提交了v2版本，通过更智能的形状检测和适配逻辑，解决了Qwen2.5-VL问题的同时保持了对Idefics3等其他模型的兼容性，最终被合并。
Issue #33017: [Bug]: /v1/responses endpoint crashes with ‘NoneType has no attribute startswith‘ when input contains function_call items (4条评论)
- 核心议题：OpenAI Responses API端点在请求历史中包含function_call项且该项id字段为None时会崩溃。
- 观点与讨论：
  - 报告者（fugarty）：详细描述了问题，并迅速自行定位到根因（ResponseFunctionToolCall对象的id可能为None），提供了修复代码和Docker内热补丁命令。
  - 维护者（chaunceyjiang）：询问了服务命令以复现环境，并最终确认问题已被PR #31999修复。
- 当前状态：Issue仍处于打开状态，但根本原因和修复方案已明确，指向一个已存在的PR。这展示了社区用户的高水平自主排错能力。

🔥 热门话题与趋势分析

模型支持与新功能：
- 持续扩展：有Issue请求支持Qwen3 TTS (#33051) 和 GLM-4V的mrope优化 (#33039)，以及PR为Hunyuan模型添加Eagle3推测解码支持 (#33035)。
- 功能演进：关于层级重载（Layerwise Reloading）与量化模型在RL等场景中集成的讨论被正式提出为未来工作路线图 (#33038)。
性能优化与硬件适配：
- 硬件特定问题：在Blackwell GPU (GB10/B200) 上使用CUDA 13.0和特定NCCL版本时，TP=2导致挂起的问题被报告 (#33041)，凸显了新硬件生态的磨合挑战。
- 内核调优自动化：为Mamba的selective_state_update内核建立类似Fused MoE的自动调优脚本和配置系统的需求被提出 (#33034)，并立即有贡献者认领，表明内核性能调优体系正在向更多算子扩展。
测试与CI稳定性：
- 多模块失败：CI中出现了MoE集成测试 (#33029)、Attention内核测试 (#33027) 和多模态模型测试 (#33028) 的集中失败。
- 高效修复：相关修复PR（#33030, #33033）迅速被创建并标记为“ready”，显示了团队对CI流水线稳定性的高度重视和快速响应能力。

🛠️ 重点技术变更

PR #33046: [Model Runner V2] Fix slot_mapping after #25954 (已合并)
- 技术解读：修复了PR #25954（分离FlashAttn注意力计算和缓存更新）引入的一个bug，该bug导致Model Runner V2在启用CUDA Graph时因block_tables属性缺失和attn_metadata预期格式错误而崩溃。
- 影响：确保了性能优化修改不会破坏V2推理引擎的稳定性，是维持V1引擎演进中不同组件兼容性的关键修复。
PR #33018: [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp (进行中)
- 技术解读：修复了ROCm平台上一个特定路径（使用AITER融合共享专家且启用MTP）下的模型权重加载问题。
- 影响：提升了AMD平台对复杂MoE模型（DeepSeek）专家并行特性的支持完整性和可靠性。
PR #33019: [MoE][Fix] Fix PPLX CUTLASS FP8 incorrect output with apply_router_weight_on_input (进行中)
- 技术解读：修复了当使用PPLX all2all后端、CUTLASS FP8量化，且模型配置为apply_router_weight_on_input（如Llama4）时，由于错误地对逐token量化（per_act_token_quant）的scale张量进行重复操作，导致输出异常的问题。
- 影响：解决了特定配置下MoE计算精度严重下降的问题，保证了FP8量化在复杂MoE模型上的正确性。

📈 开发活跃度观察

高效合并：24小时内合并了7个PR，其中包含重要的问题修复（如#33046, #32684, #32836）和功能改进。
Issue清理：关闭了20个Issue，其中大部分为长期未活动的陈旧（stale）Issue，有效维护了问题跟踪列表的清洁度。
贡献者分布：核心维护者（WoosukKwon, robertgshaw2-redhat等）在处理关键bug和CI修复上非常活跃。同时，来自社区（如fugarty, Etelis）和合作伙伴（如AMD相关贡献者）的优质贡献也被积极接纳，显示出健康的项目生态。

💡 值得关注的问题

Issue #33041: [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs：此问题涉及最新Blackwell架构GPU、特定CUDA 13.0和NCCL 2.27.7组合下的死锁，可能影响早期Blackwell适配者的使用体验，需要关注其根本原因和解决方案。
Issue #33038: [Feature]: Integrate layerwise reloading with other vLLM loading features：该Issue规划了层间重载技术与注意力/MLA量化、EPLB（专家并行负载均衡）等其他高级特性的集成路线图，对实现高效的量化模型RL训练等前沿应用场景至关重要，标志着一项核心基础设施的持续演进。

📋 附录：详细数据列表

新增 Issue

#33051 [Feature]: Will it support Qwen3 TTS — feature request — by lovetian1991 (创建于: 2026-01-26 10:59 (UTC+8))
#33050 [Bug]: [DeepSeek-V3.2] PD Can’t instantiate abstract class DeepseekV32IndexerBackend without an implementation for abstract method ‘get_impl_cls’ — bug — by chaunceyjiang (创建于: 2026-01-26 10:32 (UTC+8))
#33017 [Bug] /v1/responses endpoint crashes with ‘NoneType has no attribute startswith’ when input contains function_call items — 无标签 — by fugarty (创建于: 2026-01-25 12:39 (UTC+8))
#33041 [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs (CUDA 13.0, NCCL 2.27.7) — bug — by robertjmcintyre (创建于: 2026-01-26 04:59 (UTC+8))
#33034 [Feature][Help Wanted]: Add tuning script and config files for Mamba selective_state_update kernel — help wanted,good first issue,feature request — by danisereb (创建于: 2026-01-26 00:23 (UTC+8))
#33040 vLLM initialization failing on yggdrasil only — 无标签 — by ivnle (创建于: 2026-01-26 04:45 (UTC+8))
#33038 [Feature]: Integrate layerwise reloading with other vLLM loading features — feature request — by kylesayrs (创建于: 2026-01-26 04:10 (UTC+8))
#33029 [CI Failure]: MoE Integration Tests — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:42 (UTC+8))
#33027 [CI Failure]: Kernels Attention — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:38 (UTC+8))
#33026 [Feature]: Support fused silu_mul and block-wise quantization Triton kernel — feature request — by weimin023 (创建于: 2026-01-25 22:12 (UTC+8))
#33028 [CI Failure]: MultiModal Tests — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:39 (UTC+8))

已关闭 Issue

#33003 [Bug]: Model Runner V2 broken CUDA Graph after kvcache update split(#25954) — bug — by izhuhaoran (关闭于: 2026-01-26 10:30 (UTC+8))
#22162 [Usage]: multi-lora for vision language model — usage,stale — by PizzaSnow (关闭于: 2026-01-26 10:16 (UTC+8))
#22239 [Performance]: vllm v0.10.0 seems to be much slower than vllm v0.8.5 when using Qwen3-30B-A3B-int4 — performance,stale — by alanayu (关闭于: 2026-01-26 10:16 (UTC+8))
#22284 [Bug] [gpt-oss-20b] [Responses API]: Could not parse header: too many tokens remaining after extracting content-type and recipient — bug,stale — by cadedaniel (关闭于: 2026-01-26 10:16 (UTC+8))
#22885 [Bug]: tool_calls content loss } when use hermes_tool_parser — bug,stale,tool-calling — by 394988736 (关闭于: 2026-01-26 10:16 (UTC+8))
#23935 [Bug]: No platform detected, vLLM is running on UnspecifiedPlatform in Docker with Kubernetes, Nvidia L4 — bug,stale — by a1expol (关闭于: 2026-01-26 10:16 (UTC+8))
#24112 [RFC]: Improve MoE triton kernel tuning — RFC,stale — by jeejeelee (关闭于: 2026-01-26 10:16 (UTC+8))
#24484 [RFC]: Enhancing the Pluggable Scheduler with a Workload-Aware Adaptive Policy — RFC,stale — by sidikbro (关闭于: 2026-01-26 10:16 (UTC+8))
#24779 [CI]: Each test has about 4-5min of setup time — feature request,stale — by afeldman-nm (关闭于: 2026-01-26 10:15 (UTC+8))
#24916 [Feature]: Make FP8 Attention fast for GPT-OSS w/ FA3 on Hopper — feature request,stale — by jmkuebler (关闭于: 2026-01-26 10:15 (UTC+8))
#25004 [CI]: Test container layer caching optimization — feature request,stale — by njhill (关闭于: 2026-01-26 10:15 (UTC+8))
#25108 [Bug]: vLLM chat breaks during multi-turn chat — bug,stale — by RobotSail (关闭于: 2026-01-26 10:15 (UTC+8))
#25482 [Bug]: V1 scheduler is crashing in concurrent scenarios — bug,stale — by jaluma (关闭于: 2026-01-26 10:15 (UTC+8))
#25539 [Bug]:When there is an enum type with Chinese values in the response_format schema, a decoding error occurs during output. — bug,stale — by tianqihou (关闭于: 2026-01-26 10:15 (UTC+8))
#31588 [Bug]: vLLM SM 12.1 (Blackwell GB10) V1 Engine Bug Report (Relates to: #28589, #31128, #28621, #27679) — bug — by ohsono (关闭于: 2026-01-26 05:06 (UTC+8))
#33040 vLLM initialization failing on yggdrasil only — 无标签 — by ivnle (关闭于: 2026-01-26 04:45 (UTC+8))
#33009 [Installation]: vLLM v14.0 Build Error with Official Docker Image — installation — by weimin023 (关闭于: 2026-01-25 21:16 (UTC+8))
#30872 [Bug]: Triton kernel failing for LoRA on SM100 — bug — by josefdra (关闭于: 2026-01-25 16:32 (UTC+8))
#27655 [Bug]: ‘FlashInferAllToAllManager’ object has no attribute ‘prepare_workspace’ — bug — by peakcrosser7 (关闭于: 2026-01-25 16:15 (UTC+8))
#32680 [Bug]: hanging durling long video understanding — bug — by JJJYmmm (关闭于: 2026-01-25 13:17 (UTC+8))

新增 PR

#33047 [W8A8 Block Linear Refactor][1/N] Extract Input Quantization Kernels into Modular Architecture — needs-rebase,nvidia — by maralbahari (创建于: 2026-01-26 10:18 (UTC+8))
#33033 [CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) — ready — by LucasWilkinson (创建于: 2026-01-25 23:55 (UTC+8))
#33048 [Model Runner V2] Minor simplification for finish_requests — v1 — by WoosukKwon (创建于: 2026-01-26 10:28 (UTC+8))
#33049 [MoE Refactor] DefaultMoERunner simplifcation — documentation,v1 — by bnellnm (创建于: 2026-01-26 10:29 (UTC+8))
#33046 [Model Runner V2] Fix slot_mapping after #25954 — v1,nvidia — by WoosukKwon (创建于: 2026-01-26 10:13 (UTC+8))
#33031 [CI] fix attention test — ready,ci/build — by ZJY0516 (创建于: 2026-01-25 23:17 (UTC+8))
#33018 [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp — bug,rocm,deepseek — by ganyi1996ppo (创建于: 2026-01-25 13:03 (UTC+8))
#33045 Fix UnembedMetrics FLOP overcounting for prefill — v1,meta-exported,fb-exported — by omkhalil (创建于: 2026-01-26 10:08 (UTC+8))
#33044 [Bugfix] Fix Intern / Radio Pooling Tests — bug — by alex-jw-brooks (创建于: 2026-01-26 08:48 (UTC+8))
#33039 Refactor/glm4xv mrope — documentation — by KKSK-DON (创建于: 2026-01-26 04:30 (UTC+8))
#33043 [rocm][aiter] add env var VLLM_ROCM_USE_AITER_SAMPLING — rocm,v1 — by yuguo68 (创建于: 2026-01-26 07:19 (UTC+8))
#33036 [Model] Fix Qwen3-VL load_weights to skip lm_head when tie_word_embeddings is True — qwen — by jeremyteboul (创建于: 2026-01-26 02:58 (UTC+8))
#33042 [WIP] [Voxtral] Streaming example — documentation — by patrickvonplaten (创建于: 2026-01-26 05:53 (UTC+8))
#33037 [BugFix] Fix P/D with non-MoE DP — bug,v1,kv-connector — by njhill (创建于: 2026-01-26 03:26 (UTC+8))
#33032 [Tests] Remove Duplicates — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-25 23:20 (UTC+8))
#33022 [Kernel] Apply 256bit LDG/STG To Activation Kernels — 无标签 — by AstroVoyager7 (创建于: 2026-01-25 16:17 (UTC+8))
#33035 feature: support eagle3 for HunyuanVL & Hunyuan — speculative-decoding,v1 — by irisliu10 (创建于: 2026-01-26 00:50 (UTC+8))
#33019 [MoE][Fix] Fix PPLX CUTLASS FP8 incorrect output with apply_router_weight_on_input — nvidia — by thc1006 (创建于: 2026-01-25 13:10 (UTC+8))
#33030 [Bugfix] Fix Dtypes for Pynccl Wrapper — bug,ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-25 22:52 (UTC+8))
#33020 [Bugfix] Fix display error (inconsistent with context) — bug,ready — by lingebeng (创建于: 2026-01-25 15:06 (UTC+8))
#33025 merge — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by aidendle94 (创建于: 2026-01-25 20:11 (UTC+8))
#33024 [V1][Hybrid] Support spec decode and optimize block-aligned split in mamba cache align mode — v1 — by peakcrosser7 (创建于: 2026-01-25 17:09 (UTC+8))
#33016 [Doc] Add Qwen2.5 models to batch invariance tested models — documentation,ready,qwen — by ZhanqiuHu (创建于: 2026-01-25 11:45 (UTC+8))
#33021 fix: Resolve kv_cache_dtype=’auto’ to actual dtype and log it (#32843) — 无标签 — by ItzDEXX (创建于: 2026-01-25 16:08 (UTC+8))
#33023 refactor: unify FlashInfer utils into vllm.utils.flashinfer — nvidia — by puranikyashaswin (创建于: 2026-01-25 16:19 (UTC+8))

已合并 PR

#33048 [Model Runner V2] Minor simplification for finish_requests — v1 — by WoosukKwon (合并于: 2026-01-26 10:35 (UTC+8))
#33046 [Model Runner V2] Fix slot_mapping after #25954 — v1,nvidia — by WoosukKwon (合并于: 2026-01-26 10:29 (UTC+8))
#32969 [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling — bug,ready,qwen — by AndreasKaratzas (合并于: 2026-01-26 08:34 (UTC+8))
#32772 [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni — documentation,ready,v1,qwen — by Etelis (合并于: 2026-01-25 20:15 (UTC+8))
#33016 [Doc] Add Qwen2.5 models to batch invariance tested models — documentation,ready,qwen — by ZhanqiuHu (合并于: 2026-01-25 17:20 (UTC+8))
#32836 [BugFix] Add env variable to control PDL in LoRA — bug,ready — by jeejeelee (合并于: 2026-01-25 16:32 (UTC+8))
#32684 [Bugfix] fix encoder cache hang in Qwen3VL — bug,ready,multi-modality,qwen — by JJJYmmm (合并于: 2026-01-25 13:17 (UTC+8))

关闭但未合并的 PR

#33004 [BugFix] fix model runner v2 error after kvcache update split — bug,v1,nvidia — by izhuhaoran (关闭于: 2026-01-26 10:31 (UTC+8))
#33031 [CI] fix attention test — ready,ci/build — by ZJY0516 (关闭于: 2026-01-26 02:32 (UTC+8))
#23523 [AMD][FastMath] add –ffast-math build option for AMD Devices — rocm,ci/build,stale — by zejunchen-zejun (关闭于: 2026-01-26 10:16 (UTC+8))
#23756 [Bugfix] when nixl port by bind, process cannot stop — stale,kv-connector — by lengrongfu (关闭于: 2026-01-26 10:16 (UTC+8))
#24828 [Core] Avoid unnecessary coordination for non-MoE data parallel — stale,v1 — by ZJY0516 (关闭于: 2026-01-26 10:15 (UTC+8))
#25499 [Benchmark] Add Multimodal encoder forward pass Benchmark — performance,stale — by punitvara (关闭于: 2026-01-26 10:15 (UTC+8))
#33044 [Bugfix] Fix Intern / Radio Pooling Tests — bug — by alex-jw-brooks (关闭于: 2026-01-26 09:21 (UTC+8))
#33025 merge — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by aidendle94 (关闭于: 2026-01-25 20:26 (UTC+8))
#27862 [BugFix] Fixed ‘FlashInferAllToAllManager’ object has no attribute ‘prepare_workspace’ — bug,needs-rebase,nvidia — by peakcrosser7 (关闭于: 2026-01-25 16:03 (UTC+8))
#28176 [V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models — needs-rebase,v1,qwen — by peakcrosser7 (关闭于: 2026-01-25 16:00 (UTC+8))
#29272 [V1] [Hybrid] Lighter Mamba Prefix Caching with standard memory layout — documentation,frontend,needs-rebase,v1,qwen — by peakcrosser7 (关闭于: 2026-01-25 15:59 (UTC+8))