vLLM 开发动态报告 - 2026-02-01

时间窗口: 2026-02-01 11:35 (UTC+8) ~ 2026-02-02 11:35 (UTC+8) 数据统计: 新 Issue 10 | 关闭 Issue 17 | 新 PR 29 | 合并 PR 19 | 关闭未合并 PR 3

📊 每日开发状态摘要

本周期（2026-02-01 至 02-02）vLLM 社区保持高度活跃，共处理了 49 个 Issue 和 48 个 PR。核心焦点集中在多模态模型支持与兼容性修复（如 Qwen3-VL、GLM4）、推理性能与调度优化（如 Triton MLA 内核、调度器策略），以及硬件生态适配（特别是 AMD ROCm 和 NVIDIA Blackwell 平台）。开发效率较高，合并了 19 个 PR，表明代码审查与集成流程顺畅。CI 测试中出现的分布式测试失败和特定模型精度问题引发了关注和快速跟进。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态有重要更新，主要体现在性能优化和兼容性修复上，其中包含一位 AMD 员工（amd-hhashemi）的贡献。

PR #33527 - Adds padding and perf improvements to wvSplitK_fp8 (用户: amd-hhashemi)
- 技术细节：此 PR 为 ROCm 平台的 wvSplitK_fp8 内核（用于 FP8 量化计算）增加了激活张量填充（padding）支持。同时优化了偏置和 DPP 规约操作，并扩展了测试场景。
- 影响分析：这是 AMD 员工直接贡献的硬件层优化，旨在提升 AMD GPU 上 FP8 量化操作的性能和鲁棒性。填充支持有助于处理非标准尺寸的输入，而性能优化能直接提升推理速度。这表明 AMD 团队正深度参与 vLLM 在 ROCm 平台上的底层内核调优。
PR #33511 - fix(ROCm): Make flash_attn import optional in MLA attention (用户: rabi)
- 技术细节：修复了在 ROCm 平台上，即使运行的模型不使用 MLA（Multi-Head Latent Attention），也会因为 flash_attn 的强制导入而导致启动失败的问题。修复方案是在 mla_attention.py 中将对 flash_attn 的导入改为可选，仅在真正使用 MLA 时才要求安装。
- 影响分析：显著改善了 ROCm 平台上非 MLA 模型的兼容性和易用性。用户无需再为所有模型安装 flash_attn 库，降低了部署门槛，体现了对 AMD 平台用户体验的重视。

💬 高热度讨论分析

Issue #21237 - [Bug]: vLLM stops inference (已关闭)
- 核心议题：用户报告在处理一批数据时，vLLM 会在某个点后停止推理，GPU 利用率降为 0 但内存仍被占用。
- 各方观点：
  - 问题提交者与遇到相似问题的用户：分享了日志，显示程序在 EngineCore waiting for work 处挂起，认为是 vLLM 的内部问题。
  - 贡献者 @fungaren：指出问题可能与 guided_decoding 功能相关，并关联了多个历史 Issue，建议升级到 v0.10.0 或更高版本以解决旧版 xgrammar 库的 Bug。
- 争议焦点：初期问题根源不明确，用户认为是框架缺陷，后续分析指向了特定功能（引导解码）与旧版依赖的兼容性问题。
- 最终结论：Issue 因长期未更新而被自动关闭。但讨论中给出的解决方案（升级 vLLM 版本）为遇到类似问题的用户提供了明确路径。
Issue #33526 - [RFC]: Progressive KV Cache CPU Onloading (新增)
- 核心议题：针对现有的 KV Cache CPU 卸载功能，提出“渐进式加载”方案，以解决大请求阻塞小请求的“队头阻塞”问题。
- 各方观点：
  - RFC 提出者 (pougetat)：详细阐述了当前设计的问题（一次加载整个请求的块），并提出了分批加载的解决方案和架构设计。
  - 维护者 (robertgshaw2-redhat)：询问是否有能体现该特性价值的示例工作负载，以便评估其必要性并作为性能对比基准。
- 争议焦点：暂无激烈争议，主要围绕该功能优化的实际收益和应用场景进行探讨。维护者的问题旨在确保新功能的开发有明确的性能目标和验证标准。
- 当前状态：RFC 开放讨论中，等待更具体的用例和性能数据支持。
PR #33529 - Triton MLA GQA perf fixes (4x improvement at 80k context) (新增)
- 核心议题：修复 Triton MLA 内核在长上下文（如 80K）下性能急剧下降的问题，并对 Kimi 等模型带来显著提升。
- 各方观点：
  - 贡献者 (koush)：指出问题源于 PTX 代码生成，并提交了底层优化。他提供了详细的基准测试结果，显示在长上下文场景下获得了 4.36 倍的性能提升。
  - 社区机器人 (mergify)：提示预提交检查未通过。
- 争议焦点：无观点争议，但贡献者需要解决代码规范问题。该 PR 因附带强有力的性能数据而受到高度关注。
- 当前状态：PR 开放，等待通过预提交检查后进入审查流程。

🔥 热门话题与趋势分析

模型支持与兼容性问题：多个 Issue 反映了新模型快速集成带来的挑战。Qwen3-VL、GLM-4.7-flash、NemotronH 等模型的加载、LoRA 适配、配置解析问题集中出现，说明社区在模型生态扩展上十分活跃，但也需持续投入兼容性维护。
推理与性能优化：性能优化是永恒主题。本周期讨论涉及多个层面：
- 内核级：Triton MLA 长上下文性能修复（#33529）、AMD FP8 内核优化（#33527）。
- 调度级：KV Cache CPU 渐进式加载 RFC（#33526）、调度器跳过 KV 阻塞请求的优化（#33499）。
- 功能级：推理令牌（reasoning_tokens）准确计数修复（#33512, #33513）。
硬件平台适配深化：除上述 AMD 动态外，针对 NVIDIA 新硬件（如 Blackwell B200， SM121）的支持也在持续进行（PR #33517, #33516, #33518），显示了 vLLM 对前沿硬件跟进迅速。

🛠️ 重点技术变更

PR #33524 - [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (已合并)
- 解读：修复了 GPT-OSS 风格混合注意力模型（1个全注意力组+1个滑动窗口注意力组）在前缀缓存中命中率始终为 0 的 bug。通过将此类简单模型作为特殊情况处理，跳过了不必要的收敛检查循环。
- 影响：直接提升了相关模型使用前缀缓存时的效率，减少了冗余计算，对提升 GPT-OSS 类模型的推理速度有积极意义。
PR #33501 - Fix DeepSeek V2 RoPE initialization error (已合并)
- 解读：修复了 DeepSeek V2 模型在 TPU 等平台上因 RoPE（旋转位置编码）初始化错误而无法启动的问题。问题源于模型配置中 rope_scaling 为 null 时的处理逻辑。
- 影响：确保了 DeepSeek V2 这一重要模型系列在 vLLM 支持的所有后端平台上都能正常加载和运行，扩大了其适用性。
PR #33502 / #33500 - 关于多模态模型初始化的修复与回滚
- 解读：这组 PR 揭示了一个重要问题。PR #33110 为统一多模态分块处理引入了初始化逻辑，但导致某些多模态模型启动挂起（#33500 回滚）。随后 PR #33502 在限制 Torch 线程数的上下文中重新执行该初始化，解决了挂起问题。
- 影响：体现了在多模态支持这种复杂功能开发中面临的微妙挑战（如线程安全、初始化顺序）。最终的修复保障了多模态模型初始化的稳定性和性能。

📈 开发活跃度观察

高效合并：24 小时内合并 19 个 PR，效率突出，表明核心团队审查和集成流程成熟。
贡献者多样化：贡献者来自 AMD (amd-hhashemi)、NVIDIA (Code4me2, amitz-nv)、Red Hat (robertgshaw2-redhat) 等公司以及独立开发者，生态健康。
CI 与质量保障：大量 PR 收到了 mergify 机器人关于预提交检查和合并冲突的提示，说明自动化代码质量检查流程严格执行，是保障项目代码健康的重要环节。

💡 值得关注的问题

Issue #33532 - DeepSeek V2 Lite FP8 0% Accuracy [NIGHTLY]：CI 中出现的特定量化模型精度归零问题，需警惕是否为新引入的量化或计算内核的回归 Bug。
Issue #33533 - Distributed Tests (4 GPUs) [Nightly] NCCL hang：分布式测试的 NCCL 挂起是影响稳定性的严重问题，需要优先排查。
Issue #33528 - LongCat-Flash-Lite and “Engram” embedding support：提出了对新兴的“词嵌入磁盘缓存”技术的支持需求，这可能成为未来优化超长上下文模型内存占用的关键技术方向。
Issue #33526 - Progressive KV Cache CPU Onloading [RFC]：这是一个重要的架构改进提案，其设计决策和实施效果将直接影响内存受限场景下的多租户推理性能和公平性，建议社区持续关注讨论进展。

📋 附录：详细数据列表

新增 Issue

#33534 [Installation]: vllm with ray in AWS cluster — installation — by haozhuang0000 (创建于: 2026-02-02 10:08 (UTC+8))
#33533 [CI Failure]: Distributed Tests (4 GPUs) [Nightly] — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-02 10:05 (UTC+8))
#33532 [CI Failure]: DeepSeek V2 Lite FP8 0% Accuracy [NIGHTLY] — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-02 10:02 (UTC+8))
#33528 [Feature]: LongCat-Flash-Lite and “Engram” embedding support — feature request — by TomLucidor (创建于: 2026-02-02 09:08 (UTC+8))
#33526 [RFC]: Progressive KV Cache CPU Onloading — RFC — by pougetat (创建于: 2026-02-02 07:47 (UTC+8))
#33519 [Bug]: GLM4.7-flash cannot load lora adapters — bug — by Otsutsukii (创建于: 2026-02-02 04:06 (UTC+8))
#33515 [Bug]: Error in inspecting model architecture ‘NemotronHForCausalLM’ — bug — by shahizat (创建于: 2026-02-02 03:17 (UTC+8))
#33512 Responses API reasoning_tokens always zero for text-based reasoning parsers — 无标签 — by anencore94 (创建于: 2026-02-01 21:49 (UTC+8))
#33508 [Bug]: AttributeError: ‘Qwen3VLMoeConfig’ object has no attribute ‘vocab_size’ — bug — by study-overflow (创建于: 2026-02-01 17:55 (UTC+8))
#33507 [Bug]: Startup of the Qwen3-VL model failed. — bug — by shell-nlp (创建于: 2026-02-01 17:13 (UTC+8))

已关闭 Issue

#32829 [Bug]: During GLM-4.7 function calling (fc), the function output is not streamed. — bug — by zhangsongqing (关闭于: 2026-02-02 11:13 (UTC+8))
#10791 [Usage]: When will the bug in the vllm-flash-attn vision module be fixed? — usage,stale — by yansh97 (关闭于: 2026-02-02 10:18 (UTC+8))
#21237 [Bug]: vLLM stops inference — bug,stale — by DankoZhang (关闭于: 2026-02-02 10:18 (UTC+8))
#25036 [Bug]: Service Crashes Upon Any Request When Using OpenReasoning-Nemotron-32B After Switching to V1 — bug,stale — by NaiveYan (关闭于: 2026-02-02 10:17 (UTC+8))
#25811 [Bug]: Unable to run Qwen3-Next-80B-A3B-Instruct using B200 and Flashinfer backend — bug,stale — by mihirp1998 (关闭于: 2026-02-02 10:17 (UTC+8))
#25878 [CI]: Automatically cancel fastcheck when full CI have been triggered — feature request,stale — by Isotr0py (关闭于: 2026-02-02 10:17 (UTC+8))
#26091 [Feature]: GLM thinking budget — feature request,stale — by Cgrandjean (关闭于: 2026-02-02 10:17 (UTC+8))
#26094 [Performance]: preformence of from vllm.platforms import current_platform — performance,stale — by yonikremer (关闭于: 2026-02-02 10:17 (UTC+8))
#26114 [Performance]: Scheduler.update_from_output optimization — performance,stale — by Jialin (关闭于: 2026-02-02 10:17 (UTC+8))
#26133 [RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention — RFC,stale — by qiruiyangmeta (关闭于: 2026-02-02 10:17 (UTC+8))
#26157 [Feature]: Make Custom I/O Processing Plugins More General — feature request,stale — by alex-jw-brooks (关闭于: 2026-02-02 10:17 (UTC+8))
#26166 [Performance]: Mistral Small 3.2 throughput v0.10.2 — performance,stale — by sspeekenbrink (关闭于: 2026-02-02 10:16 (UTC+8))
#33508 [Bug]: AttributeError: ‘Qwen3VLMoeConfig’ object has no attribute ‘vocab_size’ — bug — by study-overflow (关闭于: 2026-02-01 19:33 (UTC+8))
#33459 [Performance]: Torch symm AllReduce seems suboptimal on 8×B200 (fixed CTA count?) — performance — by rajagond (关闭于: 2026-02-01 19:07 (UTC+8))
#32753 [Feature]: [Safety] Add input validation and contract documentation to attention kernels — feature request — by red1239109-cmd (关闭于: 2026-02-01 17:45 (UTC+8))
#33412 [Bug]: vllm bench always injects Authorization header even when OPENAI_API_KEY is unset — bug — by daily-kim (关闭于: 2026-02-01 15:35 (UTC+8))
#33460 [v0.13.0] Required Transformers version mismatch — 无标签 — by gagank1 (关闭于: 2026-02-01 13:31 (UTC+8))

新增 PR

#33536 [Misc] Remove deprecated profiler environment variables — 无标签 — by carlory (创建于: 2026-02-02 11:33 (UTC+8))
#33535 [Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable — ci/build — by carlory (创建于: 2026-02-02 11:17 (UTC+8))
#33530 [Nightly CI] Remove CT Model — 无标签 — by robertgshaw2-redhat (创建于: 2026-02-02 09:56 (UTC+8))
#33521 Fix mistral sliding window parsing — ready — by andylolu2 (创建于: 2026-02-02 04:40 (UTC+8))
#33529 Triton MLA GQA perf fixes (4x improvement at 80k context) — v1 — by koush (创建于: 2026-02-02 09:40 (UTC+8))
#33520 [Bugfix] Fix tool call streaming for gpt-oss/Harmony models — bug,frontend,gpt-oss — by alexbi29 (创建于: 2026-02-02 04:26 (UTC+8))
#33511 fix(ROCm): Make flash_attn import optional in MLA attention — rocm — by rabi (创建于: 2026-02-01 21:28 (UTC+8))
#33523 [Models] Step-3.5-Flash — documentation,new-model,ready — by csy0225 (创建于: 2026-02-02 06:59 (UTC+8))
#33524 [Fix] prefix cache hit rate == 0 bug with gpt-oss style models — bug,ready,v1,gpt-oss — by ivanium (创建于: 2026-02-02 07:13 (UTC+8))
#33531 Merge pull request #3 from janhq/sync-upstream — 无标签 — by nguyenhoangthuan99 (创建于: 2026-02-02 09:57 (UTC+8))
#33527 Adds padding and perf improvements to wvSplitK_fp8 — rocm — by amd-hhashemi (创建于: 2026-02-02 08:51 (UTC+8))
#33525 Update get_expert_mapping to include self parameter — 无标签 — by Otsutsukii (创建于: 2026-02-02 07:16 (UTC+8))
#33499 Scheduler: skip KV-blocked requests to prevent blocking (#31731) — v1 — by harsh543 (创建于: 2026-02-01 11:48 (UTC+8))
#33522 [Core][Scheduler] Fix FCFS queue ordering for skipped waiting requests — v1 — by harsh543 (创建于: 2026-02-02 04:53 (UTC+8))
#33518 [Bugfix] Fix NVFP4 MoE weight shapes for non-gated MLPs (Nemotron-Nano) — bug,nvidia — by Code4me2 (创建于: 2026-02-02 03:48 (UTC+8))
#33517 [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support — nvidia — by Code4me2 (创建于: 2026-02-02 03:48 (UTC+8))
#33516 [Bugfix] Add SM110/SM120 device capability checks for NVFP4 MoE backends — bug,nvidia — by Code4me2 (创建于: 2026-02-02 03:48 (UTC+8))
#33501 Fix DeepSeek V2 RoPE initialization error — ready,deepseek — by catswe (创建于: 2026-02-01 14:49 (UTC+8))
#33514 [Bugfix] Fix gpt-oss chat format mismatch with HuggingFace — bug,frontend,gpt-oss — by thjung123 (创建于: 2026-02-02 02:41 (UTC+8))
#33510 Add MoE config for Super B200 TP2 — ready — by shaharmor98 (创建于: 2026-02-01 20:38 (UTC+8))
#33506 Support FI fused MoE non gated FP8 & NVFP4 — nvidia — by amitz-nv (创建于: 2026-02-01 16:58 (UTC+8))
#33513 Fix reasoning_tokens for text-based parsers in Responses API — frontend — by anencore94 (创建于: 2026-02-01 21:50 (UTC+8))
#33509 [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab — structured-output,v1 — by FredericOdermatt (创建于: 2026-02-01 20:10 (UTC+8))
#33502 [Redo] #33110 with threading limit — ready,v1 — by DarkLight1337 (创建于: 2026-02-01 15:21 (UTC+8))
#33505 [Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba — qwen — by SouthWest7 (创建于: 2026-02-01 16:26 (UTC+8))
#33503 feat(spec_decode): fuse EAGLE step slot mapping and metadata updates — speculative-decoding,v1 — by sladyn98 (创建于: 2026-02-01 15:30 (UTC+8))
#33504 Add **kwargs parameter to v1 FlashAttentionImpl as catch-all — v1 — by haojin2 (创建于: 2026-02-01 15:55 (UTC+8))
#33498 [Experimental][Refactor] Refactor vision chunk modality processing for unification — documentation,needs-rebase,multi-modality — by Isotr0py (创建于: 2026-02-01 11:41 (UTC+8))
#33500 [Critical] Revert #33110 — v1 — by DarkLight1337 (创建于: 2026-02-01 13:02 (UTC+8))

已合并 PR

#33218 [Bugfix] GLM-4 tool parser: incremental string streaming — bug,ready — by QwertyJack (合并于: 2026-02-02 11:13 (UTC+8))
#33530 [Nightly CI] Remove CT Model — 无标签 — by robertgshaw2-redhat (合并于: 2026-02-02 11:09 (UTC+8))
#33523 [Models] Step-3.5-Flash — documentation,new-model,ready — by csy0225 (合并于: 2026-02-02 10:21 (UTC+8))
#33524 [Fix] prefix cache hit rate == 0 bug with gpt-oss style models — bug,ready,v1,gpt-oss — by ivanium (合并于: 2026-02-02 09:59 (UTC+8))
#32655 Add unpermute-aware fused MoE LoRA path — performance,ready — by RunkaiTao (合并于: 2026-02-02 09:46 (UTC+8))
#33374 [ModelRunner V2] Support spec decode with structured outputs — ready,v1 — by njhill (合并于: 2026-02-02 08:20 (UTC+8))
#33467 [ModelRunner V2] Misc minor simplifications and optimizations — ready,v1,nvidia — by njhill (合并于: 2026-02-02 06:17 (UTC+8))
#33437 [Misc] skip target model mm emb in draft proposal step when draft is text-only — ready,v1 — by kkt-cohere (合并于: 2026-02-02 05:13 (UTC+8))
#33501 Fix DeepSeek V2 RoPE initialization error — ready,deepseek — by catswe (合并于: 2026-02-02 05:00 (UTC+8))
#33510 Add MoE config for Super B200 TP2 — ready — by shaharmor98 (合并于: 2026-02-02 02:48 (UTC+8))
#33077 [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM — bug,rocm,ready,qwen — by JartX (合并于: 2026-02-01 21:36 (UTC+8))
#33047 [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. — ready,nvidia — by maralbahari (合并于: 2026-02-01 17:28 (UTC+8))
#33502 [Redo] #33110 with threading limit — ready,v1 — by DarkLight1337 (合并于: 2026-02-01 17:18 (UTC+8))
#33489 Change defaults for vllm bench startup — performance,ready — by ProExpertProg (合并于: 2026-02-01 15:46 (UTC+8))
#33488 fix: only include Authorization header when OPENAI_API_KEY is set — performance,ready — by zack041 (合并于: 2026-02-01 15:35 (UTC+8))
#33370 [Models]: lfm2_siglip2 return intermediate encoder layers — ready — by lalo (合并于: 2026-02-01 14:17 (UTC+8))
#33500 [Critical] Revert #33110 — v1 — by DarkLight1337 (合并于: 2026-02-01 13:06 (UTC+8))
#33481 [Bugfix] Fix inconsistent handling of cache reset — bug,documentation,performance,frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-01 12:23 (UTC+8))
#33440 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 — ready,ci/build,kv-connector — by Gregory-Pereira (合并于: 2026-02-01 11:50 (UTC+8))

关闭但未合并的 PR

#26151 [Model] Pass param prefix to VocabParallelEmbedding in hunyuan_v1 — stale — by Anionex (关闭于: 2026-02-02 10:17 (UTC+8))
#33531 Merge pull request #3 from janhq/sync-upstream — 无标签 — by nguyenhoangthuan99 (关闭于: 2026-02-02 09:57 (UTC+8))
#33344 Fix: Validate placeholder tokens don’t exceed batch length in chunked… — v1 — by GOavi101 (关闭于: 2026-02-01 20:34 (UTC+8))