vLLM 开发动态报告 - 2026-01-25
时间窗口: 2026-01-25 11:08 (UTC+8) ~ 2026-01-26 11:08 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 20 | 新 PR 25 | 合并 PR 7 | 关闭未合并 PR 11
📊 每日开发状态摘要
在2026年1月25日至26日的24小时内,vLLM项目保持了较高的开发活跃度,共新增25个PR、11个Issue,合并了7个PR,关闭了20个Issue。开发焦点集中在模型支持扩展(如Qwen系列、GLM、Hunyuan)、性能优化与内核调优(尤其是针对特定硬件如Blackwell、ROCm),以及修复V1引擎及多模态模型中的关键缺陷。此外,持续集成(CI)的稳定性问题也受到了较多关注和修复。
🎯 AMD/ROCm 生态相关动态
本周期内有明确的AMD生态相关活动,主要涉及ROCm平台上的功能增强和问题修复。
- PR #33047: [W8A8 Block Linear Refactor][1/N] Extract Input Quantization Kernels into Modular Architecture
- 概述:此PR旨在将FP8输入量化实现重构为模块化、可扩展的内核架构,为后续简化
W8A8BlockFp8LinearOp和过渡到新IR做准备。 - AMD关联:在设计中明确为ROCm/AITER平台创建了专用的内核实现模块(
aiter.py)。新的内核选择系统为ROCm平台指定了优先级链:AiterInputQuantKernel→CudaInputQuantKernel→TritonInputQuantKernel→PytorchInputQuantKernel。这表明开发团队正在系统化地为AMD硬件优化量化计算路径。 - 技术影响:通过模块化分离和明确的后备链,提升了代码在跨平台(CUDA/ROCm)上的可维护性和清晰度,有助于未来针对AMD GPU(如MI300系列)进行更精细的性能调优。
- 概述:此PR旨在将FP8输入量化实现重构为模块化、可扩展的内核架构,为后续简化
- PR #33018: [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp
- 概述:修复了在ROCm平台上,当同时启用MTP(Mixture of Tokens and Parameters,专家并行)和
VLLM_ROCM_USE_AITER_FUSED_SHARED_EXPERTS环境变量时,模型加载失败的问题。 - AMD关联:直接针对ROCm后端在处理DeepSeek模型的专家并行特性时的一个兼容性bug进行修复。贡献者
ganyi1996ppo虽无“-amd”后缀,但专注于ROCm问题修复。 - 技术影响:确保了使用AITER融合内核的ROCm用户能够正常加载和运行采用MTP技术的DeepSeek模型,增强了AMD平台对前沿模型架构的支持。
- 概述:修复了在ROCm平台上,当同时启用MTP(Mixture of Tokens and Parameters,专家并行)和
- PR #33043: [rocm][aiter] add env var VLLM_ROCM_USE_AITER_SAMPLING
- 概述:新增环境变量
VLLM_ROCM_USE_AITER_SAMPLING,用于控制ROCm平台是否使用AITER(AMD优化内核)的采样操作。 - AMD关联:这是对AMD专用内核系列控制的又一补充。PR描述中提到,此op曾引入准确性问题(#32413),新增开关为用户提供了回退到其他实现的能力。
- 技术影响:给予用户更大的灵活性和风险控制能力,特别是在新内核可能存在稳定性问题时,保证了服务的可靠性。这体现了对AMD生态用户实际运维体验的重视。
- 概述:新增环境变量
小结:本周期内,AMD生态的更新集中在内核架构重构以更好支持ROCm、修复专家并行等高级特性在ROCm上的兼容性问题,以及增强用户对AMD专用内核的控制粒度。虽然没有出现与Quark量化工具或MI300直接相关的新内容,但对ROCm平台底层内核的持续优化和问题修复是明确的工作方向。
💬 高热度讨论分析
- PR #32772: [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni (17条评论)
- 核心议题:优化Qwen2.5-Omni多模态模型中mrope(多维RoPE)位置的计算方式,从逐token循环改为基于
mm_position.offset的直接计算,以修复主分支上某些配置(如use_audio_in_video=True)的崩溃问题。 - 观点与讨论:
- 贡献者(Etelis):提出了基于偏移量的新算法,并进行了多轮测试验证(单/多音频、图像等模态)。
- 审阅者(DarkLight1337, Isotr0py):密切关注重构后的正确性,要求提供与主分支的详细输出对比,并测试了多种模态组合场景,最终确认新实现不仅修复了崩溃,且输出与主分支在可通过的案例中一致。
- 争议焦点:无重大争议,讨论主要是严谨的代码审查和全面的回归测试。
- 最终状态:PR在充分测试和审查后获准合并,成功解决了特定配置下的问题并优化了计算逻辑。
- 核心议题:优化Qwen2.5-Omni多模态模型中mrope(多维RoPE)位置的计算方式,从逐token循环改为基于
- PR #32969: [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling (11条评论)
- 核心议题:修复使用transformers后端时,Qwen2.5-VL模型在内存分析(profiling)阶段因embedding张量维度不匹配而崩溃的问题。
- 观点与讨论:
- 贡献者(AndreasKaratzas):初始修复方案因未考虑不同视觉模型(如Idefics3与Qwen2.5-VL)输出embedding形状的差异,导致引入了新bug。
- 审阅者(Isotr0py):及时指出初始修复破坏了Idefics3模型,推动了更全面的解决方案。
- 争议焦点:如何设计一个普适性修复,能同时处理不同多模态模型在profiling阶段embedding形状的多样性(如
[num_patches, tokens_per_patch, dim]vs[total_tokens, dim])。 - 最终状态:贡献者提交了v2版本,通过更智能的形状检测和适配逻辑,解决了Qwen2.5-VL问题的同时保持了对Idefics3等其他模型的兼容性,最终被合并。
- Issue #33017: [Bug]: /v1/responses endpoint crashes with ‘NoneType has no attribute startswith‘ when input contains function_call items (4条评论)
- 核心议题:OpenAI Responses API端点在请求历史中包含
function_call项且该项id字段为None时会崩溃。 - 观点与讨论:
- 报告者(fugarty):详细描述了问题,并迅速自行定位到根因(
ResponseFunctionToolCall对象的id可能为None),提供了修复代码和Docker内热补丁命令。 - 维护者(chaunceyjiang):询问了服务命令以复现环境,并最终确认问题已被PR #31999修复。
- 报告者(fugarty):详细描述了问题,并迅速自行定位到根因(
- 当前状态:Issue仍处于打开状态,但根本原因和修复方案已明确,指向一个已存在的PR。这展示了社区用户的高水平自主排错能力。
- 核心议题:OpenAI Responses API端点在请求历史中包含
🔥 热门话题与趋势分析
- 模型支持与新功能:
- 持续扩展:有Issue请求支持Qwen3 TTS (#33051) 和 GLM-4V的mrope优化 (#33039),以及PR为Hunyuan模型添加Eagle3推测解码支持 (#33035)。
- 功能演进:关于层级重载(Layerwise Reloading)与量化模型在RL等场景中集成的讨论被正式提出为未来工作路线图 (#33038)。
- 性能优化与硬件适配:
- 硬件特定问题:在Blackwell GPU (GB10/B200) 上使用CUDA 13.0和特定NCCL版本时,TP=2导致挂起的问题被报告 (#33041),凸显了新硬件生态的磨合挑战。
- 内核调优自动化:为Mamba的
selective_state_update内核建立类似Fused MoE的自动调优脚本和配置系统的需求被提出 (#33034),并立即有贡献者认领,表明内核性能调优体系正在向更多算子扩展。
- 测试与CI稳定性:
- 多模块失败:CI中出现了MoE集成测试 (#33029)、Attention内核测试 (#33027) 和多模态模型测试 (#33028) 的集中失败。
- 高效修复:相关修复PR(#33030, #33033)迅速被创建并标记为“ready”,显示了团队对CI流水线稳定性的高度重视和快速响应能力。
🛠️ 重点技术变更
- PR #33046: [Model Runner V2] Fix slot_mapping after #25954 (已合并)
- 技术解读:修复了PR #25954(分离FlashAttn注意力计算和缓存更新)引入的一个bug,该bug导致Model Runner V2在启用CUDA Graph时因
block_tables属性缺失和attn_metadata预期格式错误而崩溃。 - 影响:确保了性能优化修改不会破坏V2推理引擎的稳定性,是维持V1引擎演进中不同组件兼容性的关键修复。
- 技术解读:修复了PR #25954(分离FlashAttn注意力计算和缓存更新)引入的一个bug,该bug导致Model Runner V2在启用CUDA Graph时因
- PR #33018: [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp (进行中)
- 技术解读:修复了ROCm平台上一个特定路径(使用AITER融合共享专家且启用MTP)下的模型权重加载问题。
- 影响:提升了AMD平台对复杂MoE模型(DeepSeek)专家并行特性的支持完整性和可靠性。
- PR #33019: [MoE][Fix] Fix PPLX CUTLASS FP8 incorrect output with apply_router_weight_on_input (进行中)
- 技术解读:修复了当使用PPLX all2all后端、CUTLASS FP8量化,且模型配置为
apply_router_weight_on_input(如Llama4)时,由于错误地对逐token量化(per_act_token_quant)的scale张量进行重复操作,导致输出异常的问题。 - 影响:解决了特定配置下MoE计算精度严重下降的问题,保证了FP8量化在复杂MoE模型上的正确性。
- 技术解读:修复了当使用PPLX all2all后端、CUTLASS FP8量化,且模型配置为
📈 开发活跃度观察
- 高效合并:24小时内合并了7个PR,其中包含重要的问题修复(如#33046, #32684, #32836)和功能改进。
- Issue清理:关闭了20个Issue,其中大部分为长期未活动的陈旧(stale)Issue,有效维护了问题跟踪列表的清洁度。
- 贡献者分布:核心维护者(WoosukKwon, robertgshaw2-redhat等)在处理关键bug和CI修复上非常活跃。同时,来自社区(如fugarty, Etelis)和合作伙伴(如AMD相关贡献者)的优质贡献也被积极接纳,显示出健康的项目生态。
💡 值得关注的问题
- Issue #33041: [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs:此问题涉及最新Blackwell架构GPU、特定CUDA 13.0和NCCL 2.27.7组合下的死锁,可能影响早期Blackwell适配者的使用体验,需要关注其根本原因和解决方案。
- Issue #33038: [Feature]: Integrate layerwise reloading with other vLLM loading features:该Issue规划了层间重载技术与注意力/MLA量化、EPLB(专家并行负载均衡)等其他高级特性的集成路线图,对实现高效的量化模型RL训练等前沿应用场景至关重要,标志着一项核心基础设施的持续演进。
📋 附录:详细数据列表
新增 Issue
- #33051 [Feature]: Will it support Qwen3 TTS — feature request — by lovetian1991 (创建于: 2026-01-26 10:59 (UTC+8))
- #33050 [Bug]: [DeepSeek-V3.2] PD Can’t instantiate abstract class DeepseekV32IndexerBackend without an implementation for abstract method ‘get_impl_cls’ — bug — by chaunceyjiang (创建于: 2026-01-26 10:32 (UTC+8))
- #33017 [Bug] /v1/responses endpoint crashes with ‘NoneType has no attribute startswith’ when input contains function_call items — 无标签 — by fugarty (创建于: 2026-01-25 12:39 (UTC+8))
- #33041 [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs (CUDA 13.0, NCCL 2.27.7) — bug — by robertjmcintyre (创建于: 2026-01-26 04:59 (UTC+8))
- #33034 [Feature][Help Wanted]: Add tuning script and config files for Mamba selective_state_update kernel — help wanted,good first issue,feature request — by danisereb (创建于: 2026-01-26 00:23 (UTC+8))
- #33040 vLLM initialization failing on yggdrasil only — 无标签 — by ivnle (创建于: 2026-01-26 04:45 (UTC+8))
- #33038 [Feature]: Integrate layerwise reloading with other vLLM loading features — feature request — by kylesayrs (创建于: 2026-01-26 04:10 (UTC+8))
- #33029 [CI Failure]: MoE Integration Tests — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:42 (UTC+8))
- #33027 [CI Failure]: Kernels Attention — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:38 (UTC+8))
- #33026 [Feature]: Support fused silu_mul and block-wise quantization Triton kernel — feature request — by weimin023 (创建于: 2026-01-25 22:12 (UTC+8))
- #33028 [CI Failure]: MultiModal Tests — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-25 22:39 (UTC+8))
已关闭 Issue
- #33003 [Bug]: Model Runner V2 broken CUDA Graph after kvcache update split(#25954) — bug — by izhuhaoran (关闭于: 2026-01-26 10:30 (UTC+8))
- #22162 [Usage]: multi-lora for vision language model — usage,stale — by PizzaSnow (关闭于: 2026-01-26 10:16 (UTC+8))
- #22239 [Performance]: vllm v0.10.0 seems to be much slower than vllm v0.8.5 when using Qwen3-30B-A3B-int4 — performance,stale — by alanayu (关闭于: 2026-01-26 10:16 (UTC+8))
- #22284 [Bug] [gpt-oss-20b] [Responses API]: Could not parse header: too many tokens remaining after extracting content-type and recipient — bug,stale — by cadedaniel (关闭于: 2026-01-26 10:16 (UTC+8))
- #22885 [Bug]: tool_calls content loss } when use hermes_tool_parser — bug,stale,tool-calling — by 394988736 (关闭于: 2026-01-26 10:16 (UTC+8))
- #23935 [Bug]: No platform detected, vLLM is running on UnspecifiedPlatform in Docker with Kubernetes, Nvidia L4 — bug,stale — by a1expol (关闭于: 2026-01-26 10:16 (UTC+8))
- #24112 [RFC]: Improve MoE triton kernel tuning — RFC,stale — by jeejeelee (关闭于: 2026-01-26 10:16 (UTC+8))
- #24484 [RFC]: Enhancing the Pluggable Scheduler with a Workload-Aware Adaptive Policy — RFC,stale — by sidikbro (关闭于: 2026-01-26 10:16 (UTC+8))
- #24779 [CI]: Each test has about 4-5min of setup time — feature request,stale — by afeldman-nm (关闭于: 2026-01-26 10:15 (UTC+8))
- #24916 [Feature]: Make FP8 Attention fast for GPT-OSS w/ FA3 on Hopper — feature request,stale — by jmkuebler (关闭于: 2026-01-26 10:15 (UTC+8))
- #25004 [CI]: Test container layer caching optimization — feature request,stale — by njhill (关闭于: 2026-01-26 10:15 (UTC+8))
- #25108 [Bug]: vLLM chat breaks during multi-turn chat — bug,stale — by RobotSail (关闭于: 2026-01-26 10:15 (UTC+8))
- #25482 [Bug]: V1 scheduler is crashing in concurrent scenarios — bug,stale — by jaluma (关闭于: 2026-01-26 10:15 (UTC+8))
- #25539 [Bug]:When there is an enum type with Chinese values in the response_format schema, a decoding error occurs during output. — bug,stale — by tianqihou (关闭于: 2026-01-26 10:15 (UTC+8))
- #31588 [Bug]: vLLM SM 12.1 (Blackwell GB10) V1 Engine Bug Report (Relates to: #28589, #31128, #28621, #27679) — bug — by ohsono (关闭于: 2026-01-26 05:06 (UTC+8))
- #33040 vLLM initialization failing on yggdrasil only — 无标签 — by ivnle (关闭于: 2026-01-26 04:45 (UTC+8))
- #33009 [Installation]: vLLM v14.0 Build Error with Official Docker Image — installation — by weimin023 (关闭于: 2026-01-25 21:16 (UTC+8))
- #30872 [Bug]: Triton kernel failing for LoRA on SM100 — bug — by josefdra (关闭于: 2026-01-25 16:32 (UTC+8))
- #27655 [Bug]: ‘FlashInferAllToAllManager’ object has no attribute ‘prepare_workspace’ — bug — by peakcrosser7 (关闭于: 2026-01-25 16:15 (UTC+8))
- #32680 [Bug]: hanging durling long video understanding — bug — by JJJYmmm (关闭于: 2026-01-25 13:17 (UTC+8))
新增 PR
- #33047 [W8A8 Block Linear Refactor][1/N] Extract Input Quantization Kernels into Modular Architecture — needs-rebase,nvidia — by maralbahari (创建于: 2026-01-26 10:18 (UTC+8))
- #33033 [CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) — ready — by LucasWilkinson (创建于: 2026-01-25 23:55 (UTC+8))
- #33048 [Model Runner V2] Minor simplification for finish_requests — v1 — by WoosukKwon (创建于: 2026-01-26 10:28 (UTC+8))
- #33049 [MoE Refactor] DefaultMoERunner simplifcation — documentation,v1 — by bnellnm (创建于: 2026-01-26 10:29 (UTC+8))
- #33046 [Model Runner V2] Fix slot_mapping after #25954 — v1,nvidia — by WoosukKwon (创建于: 2026-01-26 10:13 (UTC+8))
- #33031 [CI] fix attention test — ready,ci/build — by ZJY0516 (创建于: 2026-01-25 23:17 (UTC+8))
- #33018 [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp — bug,rocm,deepseek — by ganyi1996ppo (创建于: 2026-01-25 13:03 (UTC+8))
- #33045 Fix UnembedMetrics FLOP overcounting for prefill — v1,meta-exported,fb-exported — by omkhalil (创建于: 2026-01-26 10:08 (UTC+8))
- #33044 [Bugfix] Fix Intern / Radio Pooling Tests — bug — by alex-jw-brooks (创建于: 2026-01-26 08:48 (UTC+8))
- #33039 Refactor/glm4xv mrope — documentation — by KKSK-DON (创建于: 2026-01-26 04:30 (UTC+8))
- #33043 [rocm][aiter] add env var VLLM_ROCM_USE_AITER_SAMPLING — rocm,v1 — by yuguo68 (创建于: 2026-01-26 07:19 (UTC+8))
- #33036 [Model] Fix Qwen3-VL load_weights to skip lm_head when tie_word_embeddings is True — qwen — by jeremyteboul (创建于: 2026-01-26 02:58 (UTC+8))
- #33042 [WIP] [Voxtral] Streaming example — documentation — by patrickvonplaten (创建于: 2026-01-26 05:53 (UTC+8))
- #33037 [BugFix] Fix P/D with non-MoE DP — bug,v1,kv-connector — by njhill (创建于: 2026-01-26 03:26 (UTC+8))
- #33032 [Tests] Remove Duplicates — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-25 23:20 (UTC+8))
- #33022 [Kernel] Apply 256bit LDG/STG To Activation Kernels — 无标签 — by AstroVoyager7 (创建于: 2026-01-25 16:17 (UTC+8))
- #33035 feature: support eagle3 for HunyuanVL & Hunyuan — speculative-decoding,v1 — by irisliu10 (创建于: 2026-01-26 00:50 (UTC+8))
- #33019 [MoE][Fix] Fix PPLX CUTLASS FP8 incorrect output with apply_router_weight_on_input — nvidia — by thc1006 (创建于: 2026-01-25 13:10 (UTC+8))
- #33030 [Bugfix] Fix Dtypes for Pynccl Wrapper — bug,ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-25 22:52 (UTC+8))
- #33020 [Bugfix] Fix display error (inconsistent with context) — bug,ready — by lingebeng (创建于: 2026-01-25 15:06 (UTC+8))
- #33025 merge — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by aidendle94 (创建于: 2026-01-25 20:11 (UTC+8))
- #33024 [V1][Hybrid] Support spec decode and optimize block-aligned split in mamba cache align mode — v1 — by peakcrosser7 (创建于: 2026-01-25 17:09 (UTC+8))
- #33016 [Doc] Add Qwen2.5 models to batch invariance tested models — documentation,ready,qwen — by ZhanqiuHu (创建于: 2026-01-25 11:45 (UTC+8))
- #33021 fix: Resolve kv_cache_dtype=’auto’ to actual dtype and log it (#32843) — 无标签 — by ItzDEXX (创建于: 2026-01-25 16:08 (UTC+8))
- #33023 refactor: unify FlashInfer utils into vllm.utils.flashinfer — nvidia — by puranikyashaswin (创建于: 2026-01-25 16:19 (UTC+8))
已合并 PR
- #33048 [Model Runner V2] Minor simplification for finish_requests — v1 — by WoosukKwon (合并于: 2026-01-26 10:35 (UTC+8))
- #33046 [Model Runner V2] Fix slot_mapping after #25954 — v1,nvidia — by WoosukKwon (合并于: 2026-01-26 10:29 (UTC+8))
- #32969 [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling — bug,ready,qwen — by AndreasKaratzas (合并于: 2026-01-26 08:34 (UTC+8))
- #32772 [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni — documentation,ready,v1,qwen — by Etelis (合并于: 2026-01-25 20:15 (UTC+8))
- #33016 [Doc] Add Qwen2.5 models to batch invariance tested models — documentation,ready,qwen — by ZhanqiuHu (合并于: 2026-01-25 17:20 (UTC+8))
- #32836 [BugFix] Add env variable to control PDL in LoRA — bug,ready — by jeejeelee (合并于: 2026-01-25 16:32 (UTC+8))
- #32684 [Bugfix] fix encoder cache hang in Qwen3VL — bug,ready,multi-modality,qwen — by JJJYmmm (合并于: 2026-01-25 13:17 (UTC+8))
关闭但未合并的 PR
- #33004 [BugFix] fix model runner v2 error after kvcache update split — bug,v1,nvidia — by izhuhaoran (关闭于: 2026-01-26 10:31 (UTC+8))
- #33031 [CI] fix attention test — ready,ci/build — by ZJY0516 (关闭于: 2026-01-26 02:32 (UTC+8))
- #23523 [AMD][FastMath] add –ffast-math build option for AMD Devices — rocm,ci/build,stale — by zejunchen-zejun (关闭于: 2026-01-26 10:16 (UTC+8))
- #23756 [Bugfix] when nixl port by bind, process cannot stop — stale,kv-connector — by lengrongfu (关闭于: 2026-01-26 10:16 (UTC+8))
- #24828 [Core] Avoid unnecessary coordination for non-MoE data parallel — stale,v1 — by ZJY0516 (关闭于: 2026-01-26 10:15 (UTC+8))
- #25499 [Benchmark] Add Multimodal encoder forward pass Benchmark — performance,stale — by punitvara (关闭于: 2026-01-26 10:15 (UTC+8))
- #33044 [Bugfix] Fix Intern / Radio Pooling Tests — bug — by alex-jw-brooks (关闭于: 2026-01-26 09:21 (UTC+8))
- #33025 merge — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by aidendle94 (关闭于: 2026-01-25 20:26 (UTC+8))
- #27862 [BugFix] Fixed ‘FlashInferAllToAllManager’ object has no attribute ‘prepare_workspace’ — bug,needs-rebase,nvidia — by peakcrosser7 (关闭于: 2026-01-25 16:03 (UTC+8))
- #28176 [V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models — needs-rebase,v1,qwen — by peakcrosser7 (关闭于: 2026-01-25 16:00 (UTC+8))
- #29272 [V1] [Hybrid] Lighter Mamba Prefix Caching with standard memory layout — documentation,frontend,needs-rebase,v1,qwen — by peakcrosser7 (关闭于: 2026-01-25 15:59 (UTC+8))