vLLM 开发动态报告 - 2025-12-14

时间窗口: 2025-12-14 10:54 (UTC+8) ~ 2025-12-15 10:54 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 13 | 新 PR 19 | 合并 PR 25 | 关闭未合并 PR 8

📊 每日开发状态摘要

本周期（24小时）vLLM 开发活跃度很高，共合并了25个PR，修复了大量模型特异性Bug（尤其是GLM、Qwen系列），并进行了多项性能优化与代码质量改进。社区重点关注多模态模型支持、注意力后端选择、量化及调度算法的正确性。AMD生态有持续的代码贡献和CI维护。

🎯 AMD/ROCm 生态相关动态

本周期有明确的AMD相关活动，主要体现在代码贡献和CI维护上：

PR #30364 ([Bugfix] awq_gemm: fix argument order swap)：
- 贡献者：mgehre-amd (AMD员工)
- 内容：修复了AWQ GEMM Triton内核中 scales 和 zeros 参数的声明与传递顺序不一致的问题。虽然因“双重交换”未导致功能错误，但此修改使代码更清晰易读。
- 测试平台：在 gfx1151 (AMD Strix Halo) 架构上通过了相关测试 (pytest tests/kernels/quantization/test_awq_triton.py tests/models/quantization/test_awq.py)。
- 分析：这是一个典型的代码质量改进PR，显示了AMD工程师对vLLM代码库的深入理解和持续维护。该修复确保了AMD平台AWQ量化的代码清晰度。
PR #30590 ([ROCm][CI] Add “Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test” Back Into AMD CI)：
- 内容：此前因在NVIDIA CI中不稳定而被移除的Qwen3-Next MTP异步EPLB准确性测试，被重新添加回仅AMD CI中运行，因为其在AMD CI上运行可靠。
- 分析：此举表明AMD CI管道与主CI管道相对独立，且AMD团队在积极维护其硬件平台上的专项测试集，以确保特定功能（如MTP异步EPLB）在ROCm生态下的稳定性。
已关闭 Issue #22644 ([QUESTION]: any plans to add support for amd strix halo (gfx1151)? and eta?)：
- 状态：因长期无活动被标记为“stale”后自动关闭。
- 分析：此Issue反映了社区对AMD新硬件（如Strix Halo）支持的关切。尽管被关闭，但PR #30364恰好在该架构（gfx1151）上进行了测试，间接表明相关工作仍在持续进行中，只是缺乏一个集中的路线图公告。

小结：本周期AMD生态的动态以代码质量修复和CI管道维护为主，显示了AMD团队持续、稳定的工程投入。关于Quark量化工具或MI300等关键词在本周期数据中未出现。

💬 高热度讨论分析

Issue #30654 ([Feature][Attention][UX]: Incorporate Features into Attention Selection)：
- 核心议题：如何让注意力后端选择器在考虑硬件优先级的同时，也能自动感知并匹配特定功能需求（如FP8 KV Cache、注意力下沉）。
- 观点与争议：
  - 发起人（robertgshaw2-redhat）：指出当前逻辑因未考虑功能支持而导致测试失败（如CUTLASS_MLA不支持FP8 KV Cache但被首选）。
  - 贡献者（yifant-code）：主动请缨，提出为每个后端添加能力描述符并更新选择逻辑的设计思路。
  - 维护者（mgoin）：澄清现有代码中已有管理这些功能的条目（如KV Cache支持），问题可能在于某个后端（CUTLASS_MLA）错误地声明了其不支持的功能。
- 当前状态：讨论澄清了部分误解，问题根源更可能在于后端的“能力声明”是否准确。贡献者已受邀与核心开发者共同设计解决方案。
Issue #30637 ([Bug]: CUDA OOM during sampler warmup with DeepSeek-V3.2 (DeepGEMM) on vLLM Nightly (V1 Engine))：
- 核心议题：用户在H20 GPU上使用DeepGEMM运行DeepSeek-V3.2时遇到CUDA OOM。
- 观点与交流：
  - 报告者（zkf331）：提供了极其详细的环境信息和安装步骤。
  - 其他用户（mondaylord）：遇到类似安装困难，请求详细指南。
  - 报告者的回复：提供了可复现的vLLM Nightly和DeepGEMM安装步骤，成为该问题的宝贵排错文档。
  - 建议（jeejeelee）：建议尝试移除 --gpu-memory-utilization 0.85 参数。
- 当前状态：问题尚未解决，但通过讨论产生了有价值的、具体的环境搭建和调试信息。
已关闭 Issue #11345 ([Performance]: 1P1D Disaggregation performance)：
- 核心议题：历史遗留的Prefill/Decode（PD）解聚功能性能问题讨论，涉及TTFT（首字时间）和ITL（词间延迟）不理想。
- 观点与争议：
  - 性能质疑方：通过基准测试发现解聚方案的ITL甚至比传统的分块预填充（chunked prefill）还高，这与理论预期不符。
  - 功能维护方（KuntaiDu）：解释解聚的主要目标是解耦TTFT和ITL的调优，而非提升吞吐；同时指出当时版本存在KV缓存传输失败重试的机制，这会拉高ITL。
  - 技术深度讨论：有贡献者怀疑异步NCCL传输可能与计算流存在未预期的冲突，导致数据损坏和性能下降，并通过设置 CUDA_LAUNCH_BLOCKING=1 临时规避来佐证。
- 最终结论：该Issue因长期未更新被关闭。讨论揭示了分布式推理中解聚架构在实践中的复杂性，包括传输可靠性、流同步等深层挑战。
Issue #30659 ([Bug]: scheduler issue with PD disaggregation)：
- 核心议题：用户从理论层面指出PD解聚调度策略中的一个潜在死锁场景——处于WAITING_FOR_REMOTE_KVS状态的请求占用块资源，可能导致后续请求永远无法被调度。
- 观点：
  - 报告者（David9857）：通过流程图清晰地展示了可能导致调度僵局的理论情况。
  - 回应者（robertgshaw2-redhat）：承认可能性存在，并提出潜在解决方案是释放等待队列中请求占用的块。
- 当前状态：问题开放，提出了一个需要核心开发者评估的、关于调度算法健壮性的理论缺陷。
PR #30635 (Cast inputs_embeds to model dtype if necessary)：
- 核心议题：修复V1引擎中inputs_embeds缓冲区与模型dtype可能不匹配，导致RuntimeError: expected mat1 and mat2 to have the same dtype的问题。
- 观点与深度分析：
  - 提交者（wasertech）：提出了一个直接的强制类型转换的修复方案。
  - 维护者（DarkLight1337）：质疑根本原因，认为缓冲区初始化时应已使用正确dtype，建议深入打印日志定位dtype在何时何地被改变。
  - 提交者的后续调查：推测问题可能发生在模型加载或配置更新阶段，导致self.dtype在缓冲区创建后发生变更，而缓冲区未同步更新。
- 最终结果：PR被关闭。讨论没有停留在表面修复，而是引导了对vLLM V1引擎初始化流程中配置与缓冲区状态同步机制的更深层次审视。

🔥 热门话题与趋势分析

模型特异性Bug集中涌现：本周期新增Issue中，大量是针对特定模型的Bug报告，如GLM-4.1V视频帧数、GLM-4.6V与Transformers v5的处理器兼容性、Qwen3-Next-MTP批处理输出重复、DeepSeek-V3.2 OOM等。这表明vLLM在快速扩展模型支持范围的同时，与各模型架构细节的集成测试面临持续挑战。
性能优化与量化深耕：
- 注意力后端：Issue #30654 和 PR #30653（回滚）共同凸显了对注意力后端自动化、智能化选择的持续优化需求。
- CUDA图：PR #30657 优化了CUDA图模式的日志提示，属于用户体验改进。
- MoE性能：PR #30647 旨在消除GPT-OSS模型MoE层前后的填充和切片操作，是与Flashinfer协同的底层性能调优。
构建、CI与基础设施维护：多个PR涉及Dockerfile修复（#30662）、依赖冲突解决（#30646）、CI自动rebase脚本（#30643）等，反映了项目在高速开发中对基础设施稳定性的重视。

🛠️ 重点技术变更

PR #30653 (Revert “[Fix]Load kv-cache dtype from hf_quant_config.json automatically”)：
- 技术解读：紧急回滚了先前一项自动从HuggingFace量化配置加载KV Cache数据类型的特性，因其导致CI失败（CUTLASS_MLA后端不支持FP8 KV Cache但被选中）。
- 影响：这是一次“修复引入回归”的典型案例，凸显了新功能（自动加载配置）与现有系统组件（后端能力声明）之间集成测试的重要性。它直接催生了关于完善注意力后端选择逻辑的Issue #30654。
PR #29013 (CPU KV Offloading: Use more CUDA streams)：
- 技术解读：将CPU KV卸载中的CUDA流从2个全局流改为为每次传输创建独立流，以避免高并发负载时流队列被填满导致的阻塞问题。
- 影响：显著提升了在高并发场景下CPU KV卸载的吞吐量和响应性，对内存受限环境下使用大模型的稳定性有积极影响。
PR #30632 ([main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring)：
- 技术解读：修复了Qwen3-Next-MTP模型在批处理推理时产生大量冗余输出的Bug。问题根源于推测性解码（MTP）逻辑中，批处理时一个状态变量被错误地重复使用。
- 影响：快速修复了一个影响特定模型核心功能（批处理）的严重Bug，体现了社区对模型支持问题的快速响应能力。

📈 开发活跃度观察

高合并吞吐量：24小时内合并25个PR，显示项目审查与合并流程高效。
贡献者多样性：除了核心维护者（如DarkLight1337, robertgshaw2-redhat），也有来自AMD（mgehre-amd）、华为云（vllm-ascend项目贡献者drslark）等不同生态的开发者提交重要修复。
讨论质量：Issue讨论不仅报告问题，常伴随深入的技术分析（如#30659的理论推演、#30635的根因排查），社区氛围偏向技术深度探究。

💡 值得关注的问题

注意力后端选择的智能化（Issue #30654）：如何系统化地管理各后端的能力矩阵，并设计鲁棒的自动降级选择逻辑，是提升vLLM跨硬件、跨配置用户体验的关键架构问题。
调度算法的理论缺陷（Issue #30659）：PD解聚调度器中潜在的死锁风险，需要核心开发者评估其在实际负载中发生的概率，并决定是否需要修改调度策略以根本杜绝此风险。
CUDA版本兼容性与安装体验（Issue #30633）：用户反映vLLM预构建wheel对CUDA 12.9+的强制要求导致旧驱动系统安装困难。这涉及项目在支持最新特性与维护用户友好性之间的长期权衡。

📋 附录：详细数据列表

新增 Issue

#30656 [Bug]: EngineCore crashes testing v1/embeddings — bug — by hungrymonkey (创建于: 2025-12-15 04:57 (UTC+8))
#30661 [Bug]: draft_model_config is None when Suffix Speculative Decoding is enabled — bug — by fluctlux (创建于: 2025-12-15 10:35 (UTC+8))
#30659 [Bug]: scheduler issue with PD disaggregation — bug — by David9857 (创建于: 2025-12-15 10:00 (UTC+8))
#30637 [Bug]: CUDA OOM during sampler warmup with DeepSeek-V3.2 (DeepGEMM) on vLLM Nightly (V1 Engine) — bug — by zkf331 (创建于: 2025-12-14 17:28 (UTC+8))
#30660 [Bug]: [Elastic EP]: AssertionError in replicate_experts during DP4 → DP2 scale-down — bug — by 1542305589 (创建于: 2025-12-15 10:10 (UTC+8))
#30654 [Feature][Attention][UX]: Incorporate Features into Attention Selection — help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2025-12-15 02:04 (UTC+8))
#30655 [CI Failure]: VLLM with DP performing worst — ci-failure — by akasshdeep (创建于: 2025-12-15 03:41 (UTC+8))
#30651 [Bug]: mistral_tool_parser.py has a “json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)” problem — bug — by davrot (创建于: 2025-12-15 00:31 (UTC+8))
#30638 [Bug]: GLM-4.1V dummy video input generation fails with ValueError (num_frames=1 < temporal_factor=2) — bug — by Silas-11 (创建于: 2025-12-14 17:45 (UTC+8))
#30644 [Feature]: FLASHMLA_SPARSE (SM90/SM100 only) fallback to TILELANG reference kernel (supports ALL) — feature request — by fernandaspets (创建于: 2025-12-14 19:30 (UTC+8))
#30641 [Bug]: unsloth DeepSeek-V3.1-Terminus-UD-Q3_K_XL gguf run error — bug — by shanying2017 (创建于: 2025-12-14 18:32 (UTC+8))
#30631 [Bug]: GPU has redundant outputs when running Qwen3-Next-MTP in batched inferring. — bug — by drslark (创建于: 2025-12-14 11:12 (UTC+8))
#30633 [Installation]: How to install vLLM 0.11.0 with CUDA < 12.9 (Driver 535)? No matching wheels found — installation — by whu125 (创建于: 2025-12-14 12:29 (UTC+8))

已关闭 Issue

#30661 [Bug]: draft_model_config is None when Suffix Speculative Decoding is enabled — bug — by fluctlux (关闭于: 2025-12-15 10:36 (UTC+8))
#11345 [Performance]: 1P1D Disaggregation performance — performance,stale — by Jeffwan (关闭于: 2025-12-15 10:13 (UTC+8))
#13131 [Bug]: Memory usage calculations are unstable on systems with unified memory (Jetson, GH200, etc.) — bug,stale — by hacker1024 (关闭于: 2025-12-15 10:13 (UTC+8))
#17801 [Bug]: After converting InternVL3-8B to the Hugging Face (HF) format, vLLM fails to launch and throws the error: ValueError: ‘limit_mm_per_prompt’ is only supported for multimodal models. — bug,stale — by FloSophorae (关闭于: 2025-12-15 10:13 (UTC+8))
#22619 [Usage]: how to tune a dense fp8_w8a8 model — usage,stale — by aabbccddwasd (关闭于: 2025-12-15 10:13 (UTC+8))
#22644 [QUESTION]: any plans to add support for amd strix halo (gfx1151)? and eta? — feature request,stale — by bugbuster-dev (关闭于: 2025-12-15 10:13 (UTC+8))
#30052 [Bug]: An error will occur when using a quantization configuration with modules_to_not_convert for the DeepSeek-V3.2 model on non-NVIDIA GPUs — bug — by tom-zju (关闭于: 2025-12-15 09:22 (UTC+8))
#28072 [Bug]: WeightsMapper used in QuantizationConfig does not handle regex/wildcard paths — bug,quantization,nvidia — by shengliangxu (关闭于: 2025-12-14 18:18 (UTC+8))
#30651 [Bug]: mistral_tool_parser.py has a “json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)” problem — bug — by davrot (关闭于: 2025-12-15 02:52 (UTC+8))
#30465 [Bug]: reranker benchmarking calculate input_token=0 — bug — by hz0ne (关闭于: 2025-12-14 22:57 (UTC+8))
#30250 [Bug]: Qwen2.5-VL-72B-Instruct W8A8 Abnormal reasoning results — bug — by Haoyanlong (关闭于: 2025-12-14 21:20 (UTC+8))
#30584 [Bug]: GLM-4.6V-Flash + transformers v5: vLLM passes MediaWithBytes to HF image processor, causing 400 errors for multimodal requests — bug — by dbotwinick (关闭于: 2025-12-14 21:19 (UTC+8))
#30631 [Bug]: GPU has redundant outputs when running Qwen3-Next-MTP in batched inferring. — bug — by drslark (关闭于: 2025-12-14 17:32 (UTC+8))

新增 PR

#30657 [Log] Skip piecewise cudagraph warn when using full cudagraph — ready,nvidia — by BoyuanFeng (创建于: 2025-12-15 05:19 (UTC+8))
#30662 [XPU] fix Dockerfile.xpu, avoid wheel conflicts — ready,ci/build — by jikunshang (创建于: 2025-12-15 10:47 (UTC+8))
#30652 Strengthen input validation and tests for ‘parse_raw_prompts’. — ready — by mivehk (创建于: 2025-12-15 00:53 (UTC+8))
#30658 [Bugfix] Fix deepseek_v32 tokenizer_mode — structured-output,frontend,ready,v1,deepseek — by jeejeelee (创建于: 2025-12-15 09:51 (UTC+8))
#30648 [Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template — frontend,ready — by seokhyunan (创建于: 2025-12-14 22:02 (UTC+8))
#30649 additional protection for CVE-2025-62164 — frontend,ready,multi-modality — by wenqiglantz (创建于: 2025-12-14 22:24 (UTC+8))
#30636 [Bugfix] Fix RequestOutput miss lora_request — ready,v1,llama,gpt-oss — by jeejeelee (创建于: 2025-12-14 16:13 (UTC+8))
#30653 Revert “[Fix]Load kv-cache dtype from hf_quant_config.json automatically” — ready,ci-failure — by robertgshaw2-redhat (创建于: 2025-12-15 01:48 (UTC+8))
#30647 [Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE — ci/build,gpt-oss — by elvischenv (创建于: 2025-12-14 21:40 (UTC+8))
#30650 [Bugfix] CustomAR + TritonAttn[AMPERE] + FULL_CG - gpt-oss — gpt-oss,nvidia — by bbrowning (创建于: 2025-12-14 23:12 (UTC+8))
#30635 Cast inputs_embeds to model dtype if necessary — v1 — by wasertech (创建于: 2025-12-14 15:22 (UTC+8))
#30645 fix: fix engine initialization fails with ValueError — 无标签 — by leejianwoo-collab (创建于: 2025-12-14 20:29 (UTC+8))
#30646 fix: unsatisfiable testing dependencies caused by a version conflict — ci/build — by leejianwoo-collab (创建于: 2025-12-14 20:46 (UTC+8))
#30643 Auto-rebase PRs older than 40 commits compared to main — ci/build — by khluu (创建于: 2025-12-14 18:47 (UTC+8))
#30640 [dnm] test mergify rebase — ci/build — by khluu (创建于: 2025-12-14 18:19 (UTC+8))
#30642 Update mergify.yml — ci/build — by khluu (创建于: 2025-12-14 18:40 (UTC+8))
#30639 [DNM] Test PR for 8xH200 test — needs-rebase,ci/build — by khluu (创建于: 2025-12-14 18:16 (UTC+8))
#30632 [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring — ready,v1,qwen — by drslark (创建于: 2025-12-14 11:15 (UTC+8))
#30634 Update docs README.md to add NVFP4 quantization support — documentation — by omrialmog (创建于: 2025-12-14 13:52 (UTC+8))

已合并 PR

#30657 [Log] Skip piecewise cudagraph warn when using full cudagraph — ready,nvidia — by BoyuanFeng (合并于: 2025-12-15 10:49 (UTC+8))
#30590 [ROCm][CI] Add “Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test” Back Into AMD CI — rocm,ready,ci/build,qwen — by micah-wil (合并于: 2025-12-14 14:56 (UTC+8))
#30653 Revert “[Fix]Load kv-cache dtype from hf_quant_config.json automatically” — ready,ci-failure — by robertgshaw2-redhat (合并于: 2025-12-15 08:33 (UTC+8))
#29013 CPU KV Offloading: Use more CUDA streams — ready,v1,nvidia — by orozery (合并于: 2025-12-15 07:50 (UTC+8))
#30532 [responsesAPI]add extra body parameters — frontend,ready — by Ri0S (合并于: 2025-12-15 03:25 (UTC+8))
#30512 Improve parse_raw_prompt test cases for invalid input .v2 — ready — by mivehk (合并于: 2025-12-14 11:18 (UTC+8))
#30420 [NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 — ready,v1,kv-connector — by xuechendi (合并于: 2025-12-14 23:38 (UTC+8))
#30118 [Model][Quantization] Override HF defaults to GGUF ones (incl. Qwen3 MoE) — ready,qwen — by a4lg (合并于: 2025-12-14 23:01 (UTC+8))
#30596 [Bugfix][benchmarks] Fix input token calculation for rerank benchmark metrics — performance,frontend,ready — by Flink-ddd (合并于: 2025-12-14 22:57 (UTC+8))
#29752 [Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL — ready,qwen — by skyloevil (合并于: 2025-12-14 21:24 (UTC+8))
#30213 [Bugfix] Improve error messages in ModelConfig validation — ready — by yifant-code (合并于: 2025-12-14 21:23 (UTC+8))
#30244 [Bugfix] Fix fusion for VL models — ready — by ElizaWszola (合并于: 2025-12-14 21:22 (UTC+8))
#30542 [Bugfix] Revert Qwen2-VL part of change in #28271 — ready,qwen — by zifeitong (合并于: 2025-12-14 21:20 (UTC+8))
#30583 [Bugfix] Update get_processor_data to use get_all method — ready,multi-modality — by dbotwinick (合并于: 2025-12-14 21:19 (UTC+8))
#29919 [Misc] Add a script to benchmark compilation time — performance,frontend,ready — by desertfire (合并于: 2025-12-14 21:02 (UTC+8))
#30057 [Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… — ready,deepseek — by tom-zju (合并于: 2025-12-14 18:20 (UTC+8))
#30164 Nvidia ModelOpt workaround for issue 28072 — quantization,ready,nvidia — by shengliangxu (合并于: 2025-12-14 18:18 (UTC+8))
#30390 fix: Update json features supported by xGrammar — structured-output,ready,v1 — by johannesflommersfeld (合并于: 2025-12-14 18:16 (UTC+8))
#30364 [Bugfix] awq_gemm: fix argument order swap — ready — by mgehre-amd (合并于: 2025-12-14 18:15 (UTC+8))
#30489 [torch.compile] Add encoder tag for compilation — ready,qwen — by ilmarkov (合并于: 2025-12-14 18:15 (UTC+8))
#30539 Add AudioFlamingo3 model support — documentation,new-model,ready,multi-modality — by lashahub (合并于: 2025-12-14 18:14 (UTC+8))
#30540 [Doc]: fixing typos in various files — documentation,frontend,ready,nvidia — by didier-durand (合并于: 2025-12-14 18:14 (UTC+8))
#30632 [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring — ready,v1,qwen — by drslark (合并于: 2025-12-14 17:32 (UTC+8))
#30612 [Chore] Remove redundant RequestPrompt — frontend,ready — by DarkLight1337 (合并于: 2025-12-14 17:22 (UTC+8))
#30462 enable unbacked with aot_compile — ready — by laithsakka (合并于: 2025-12-14 16:14 (UTC+8))

关闭但未合并的 PR

#17737 [Fix] Auto-detect XGrammar compiler threads based on CPU cores. — documentation,structured-output,needs-rebase,stale,v1 — by Ubospica (关闭于: 2025-12-15 10:13 (UTC+8))
#17851 [Frontend]: always try to load lora adapter from reslover — frontend,needs-rebase,stale — by CormickKneey (关闭于: 2025-12-15 10:13 (UTC+8))
#22640 [Misc] add more attention backends in tests/kernel and remove unused … — stale — by wz202020 (关闭于: 2025-12-15 10:13 (UTC+8))
#30565 [Bug] add sm100f check for trtllm-gen flashinfer attention and moe — v1,nvidia — by IwakuraRein (关闭于: 2025-12-15 06:03 (UTC+8))
#30635 Cast inputs_embeds to model dtype if necessary — v1 — by wasertech (关闭于: 2025-12-14 23:03 (UTC+8))
#30640 [dnm] test mergify rebase — ci/build — by khluu (关闭于: 2025-12-14 18:42 (UTC+8))
#30642 Update mergify.yml — ci/build — by khluu (关闭于: 2025-12-14 18:40 (UTC+8))
#30639 [DNM] Test PR for 8xH200 test — needs-rebase,ci/build — by khluu (关闭于: 2025-12-14 18:17 (UTC+8))