vLLM 开发动态报告 - 2025-12-21

时间窗口: 2025-12-21 10:55 (UTC+8) ~ 2025-12-22 10:55 (UTC+8) 数据统计: 新 Issue 3 | 关闭 Issue 16 | 新 PR 17 | 合并 PR 7 | 关闭未合并 PR 15

📊 每日开发状态摘要

本周期（12月21日至22日）vLLM 项目开发活跃度较高，共合并7个PR，新开17个PR，同时关闭了16个历史遗留或已解决的 Issue，显示项目在快速迭代的同时也在积极进行维护清理。开发焦点集中在模型适配（多模态、MoE）、量化技术优化（FP8、NVFP4） 以及持续进行的 MoE 内核重构上。社区互动频繁，尤其关注多模态模型部署、量化准确性与性能问题。

🎯 AMD/ROCm 生态相关动态

本周期有明确的 AMD 生态相关修改和讨论，主要集中在修复 FP8 量化在 ROCm 平台上的兼容性问题。

新增 PR - #31106: [Bugfix][Hardware][AMD] Consolidate FP8 min/max values into helper function
- 贡献者: c0de128
- 技术细节: 此 PR 旨在修复 ROCm 平台上一个关键的 FP8 量化精度问题。在 ROCm 上使用 torch.float8_e4m3fnuz 数据类型时，PyTorch 默认的 finfo.max (240.0) 会导致动态量化出现精度问题，正确的最大值应为 224.0。该 PR 创建了一个统一的辅助函数 get_fp8_min_max(dtype) 来集中管理 FP8 的最小/最大值逻辑，并更新了所有重复该逻辑的代码位置（包括 input_quant_fp8.py, fp8_utils.py, deep_gemm.py 等）。
- 影响: 这是对 AMD 平台 FP8 量化支持的重要修复，确保了在 ROCm 上使用 FP8 量化时的计算准确性，提升了 AMD GPU 上 vLLM 的稳定性和可靠性。
已关闭 Issue - #22458: [Feature]: Please add support for Amd Instinct MI50
- 此 Issue 因超过90天无活动而被自动关闭。在之前的讨论中，社区指出主线 vLLM 暂无支持 MI50 (gfx906) 架构的计划，并建议用户尝试社区维护的分支 nlzy/vllm-gfx906。这反映了官方对较旧 AMD GPU 架构的有限支持策略。
已关闭 Issue - #22590: [Bug]: ROCm build falls back to default arch despite ARG_PYTORCH_ROCM_ARCH set in Dockerfile.rocm
- 此 Issue 同样因长时间无活动而关闭。它报告了 Docker 构建时 GPU_TARGETS 环境变量未被正确传递的问题，影响了 ROCm 镜像的自定义构建。虽已关闭，但问题本身揭示了 vLLM 的 ROCm Docker 构建流程存在配置复杂性问题。

小结: 本周期 AMD 生态的核心进展是一个重要的 FP8 量化修复（PR #31106），体现了团队在解决跨平台（特别是 ROCm）量化一致性方面的持续努力。对于旧硬件的支持请求，则通过社区分流的方式处理。

💬 高热度讨论分析

本周期内，已关闭的 Issue 中存在两个历史讨论热度较高的线程。

Issue #23338: [Usage]: When infer a large multimodal model, the program hangs and becomes unresponsive.
- 核心议题: 用户在使用 vLLM 部署多模态模型（如 MiniCPM-o-2_6）处理视频输入时，服务无响应、GPU 利用率为0。
- 各方观点:
  - 提问者 (whk6688): 提供了详细的复现步骤，并测试发现 Qwen2.5-Omni 正常，怀疑是特定模型问题。
  - 维护者 (DarkLight1337): 初步猜测 max_model_len 设置过低，建议提高。
  - 贡献者 (jio-H): 指出遇到了同样问题，并提供了临时解决方案——添加 max_num_batched_tokens 参数，引用了一个已知的类似 Issue (#18329)。
  - 维护者 (DarkLight1337): 进一步指出 #18329 的问题应已被另一个 PR (#21798) 修复，询问用户是否仍在使用最新版本遇到此问题。
- 争议焦点: 无直接争议，更多是问题排查和解决方案的交流。核心在于多模态模型输入（特别是视频）导致的调度或内存分配异常。
- 当前状态: Issue 因超过90天无新进展被自动关闭，但问题本身反映了多模态模型推理路径的复杂性，可能仍有潜在风险。
Issue #20177: [Bug]: RuntimeError: TopPSamplingFromProbs failed with error code an illegal memory access was encountered
- 核心议题: 用户在高并发（160线程）或单GPU线程数过多时，遇到 CUDA 非法内存访问错误，指向 FlashInfer 采样内核。
- 各方观点:
  - 提问者 (Lucas-Luoling 等): 提供了详细的错误信息和环境，指出减少并发数或禁用 FlashInfer 可规避问题。
  - 其他用户 (magikRUKKOLA, hnt2601): 确认遇到相同问题，同样怀疑是 FlashInfer 的 Bug，并建议使用 compute-sanitizer 工具排查。
  - 解决方案倾向: 讨论中形成一种临时解决方案——禁用 FlashInfer 注意力机制。
- 争议焦点: 无。这是一个典型的底层内核 Bug 报告，社区通过共享经验和临时方案进行互助。
- 当前状态: Issue 因陈旧被关闭。这类底层内核的稳定性问题通常需要通过更新 FlashInfer 子模块或 vLLM 本身来修复。

🔥 热门话题与趋势分析

多模态模型支持与问题: 新增 Issue (#31091, #31096) 和已关闭的热门 Issue (#23338, #23423) 均涉及多模态模型（CLIP, SigLIP, Qwen-VL, MiniCPM-o）。问题集中在聊天模板不正确、函数调用不支持、视频输入导致服务挂起和输出异常。这表明随着多模态模型普及，vLLM 在该领域的适配、稳定性和功能完整性正面临挑战，是当前社区的焦点。
量化技术的深耕与优化: 相关活动密集。
- 性能优化: 合并的 PR #30897 针对小批量场景优化了 NVFP4 输入量化内核，带来显著解码吞吐量提升。PR #31089 尝试在 Blackwell 架构上启用 MXFP4 Triton 后端。
- 正确性修复: 关闭的 Issue #30830 (MoE在线FP8量化精度问题) 和新增的 PR #31106 (AMD FP8最大值修复) 凸显了量化在不同硬件和模型结构下确保计算正确的复杂性。
- 功能拓展: PR #31099 和 #31097 致力于解决 FP4 量化在更高张量并行度 (TP>=4) 下的支持问题和内核初始化 Bug。
MoE 模型的持续重构与适配: MoE 内核重构 是明显的技术主线，出现了系列 PR (#31102, #31100, #31036)，旨在将不同后端的 MoE 实现统一到模块化内核接口。同时，Issue #23345 和 PR #31105 反映了社区对 DeepSeek-V2/V3 等复杂 MoE 模型 在分布式（DP/EP）和 LoRA 微调场景下正确性的高度关注。

🛠️ 重点技术变更

PR #31106 (新增): AMD FP8 量化值统一修复
- 技术解读: 通过创建 get_fp8_min_max 辅助函数，集中修正了 ROCm 平台上 float8_e4m3fnuz 数据类型的最大值定义错误。这避免了因 PyTorch 默认值与硬件预期不符而导致的量化误差。
- 影响: 提升了 vLLM 在 AMD GPU 上进行 FP8 量化推理的准确性，是跨平台量化支持的重要一步。
PR #31105 (新增): 修复 DeepSeek MoE 模型的 LoRA 权重加载
- 技术解读: 解决了三个关键问题：(1) 映射 HuggingFace 与 vLLM 内部对 DeepSeek MLA 注意力模块的命名差异；(2) 允许 MoE 模型中 shared_experts 路径通过 LoRA 校验；(3) 对模块对象进行去重，避免重复设置 LoRA。
- 影响: 使 DeepSeek V2/V3 等热门 MoE 模型能够正确加载和应用 LoRA 适配器，对于模型微调和部署至关重要。
PR #30821 (已合并): 启用 fused_qknorm_rope_kernel 对部分 RoPE 的支持
- 技术解读: 扩展了融合内核（将 Q 计算、Norm 和 RoPE 融合）的功能，使其支持仅对部分注意力头应用 RoPE 的模型（如 GLM4.6-MoE）。虽然因已有组合内核优化而收益不大，但消除了该优化对这类模型的限制。
- 影响: 为更多样化的模型架构提供了潜在的微小性能提升，体现了内核优化的细粒度推进。

📈 开发活跃度观察

贡献与合并节奏: 24小时内产生17个新PR并合并7个，显示出非常活跃的提交和代码审查节奏。同时，批量关闭了16个旧 Issue，表明项目维护者正在积极进行仓库管理，清理已解决或过时的问题。
核心贡献者领域: robertgshaw2-redhat 在 MoE 重构系列中非常活跃；jeejeelee、DarkLight1337 等维护者频繁参与问题讨论、PR 审查和合并。AMD 相关的修复由 c0de128 贡献。
协作模式: 从高热度 Issue 的讨论可见，用户在遇到问题时能提供详细环境信息，核心维护者能快速回应并提供排查思路或引用相关修复，社区协作氛围良好。新 PR 的描述模板和 CI 机器人提示也规范了贡献流程。

💡 值得关注的问题

新增 Issue #31108: Mamba 类模型加载失败: 用户报告 vLLM 加载 mistralai/Mamba-Codestral-7B-v0.1 时因权重前缀假设错误而失败。这涉及对非 Transformer 架构（如 Mamba）的支持，是一个需要模型加载逻辑适配的新挑战。
新增 Issue #31096: Qwen3-Next 模型函数调用支持: 用户反馈该模型的 Instruct 和 Thinking 版本在 vLLM 中均不支持函数调用，返回格式非标准。这关系到 vLLM 对新兴模型高级功能（函数调用、推理）的快速跟进能力。
MoE 重构的进展与风险: 以 robertgshaw2-redhat 主导的 MoE 内核重构（#31102, #31100 等）是重大的代码结构调整，旨在提高可维护性和统一后端。社区需关注其合并进度以及对现有 MoE 模型性能和稳定性的影响。

📋 附录：详细数据列表

新增 Issue

#31108 [Bug]: vLLM fails to load mistralai/Mamba-Codestral-7B-v0.1 due to incorrect weight prefix assumption — bug — by sicong-lin (创建于: 2025-12-22 10:45 (UTC+8))
#31091 [Usage]: Image Embedding Models (CLIP, Siglip, etc) — usage — by JamesDConley (创建于: 2025-12-21 12:10 (UTC+8))
#31096 [Usage]: Qwen3-Next: Both Instruct and Thinking models don’t support function calling — usage — by PHOEBEMOON0802 (创建于: 2025-12-21 20:02 (UTC+8))

已关闭 Issue

#19828 [Usage]: vllm启动qwen2.5vl-7b以后为什么显存使用越来越多 — usage,stale — by wqw0806 (关闭于: 2025-12-22 10:18 (UTC+8))
#20177 [Bug]: RuntimeError: TopPSamplingFromProbs failed with error code an illegal memory access was encountered — bug,stale — by Lucas-Luoling (关闭于: 2025-12-22 10:18 (UTC+8))
#21882 [Bug]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {‘EngineCore_0’: 1} — bug,stale — by Asstar-X (关闭于: 2025-12-22 10:17 (UTC+8))
#22458 [Feature]: Please add support for Amd Instinct MI50 — feature request,rocm,stale — by iEddie-cmd (关闭于: 2025-12-22 10:17 (UTC+8))
#22590 [Bug]: ROCm build falls back to default arch despite ARG_PYTORCH_ROCM_ARCH set in Dockerfile.rocm — bug,rocm,stale — by president-not-sure (关闭于: 2025-12-22 10:17 (UTC+8))
#23089 [Bug]: vLLM 0.6.3.post1 and vLLM 0.8.5.post1 shows a significant difference in inference speed when running qwen2-vl-3b and internvl3-1b. What could be the cause of this?” — bug,stale — by LZBUAV (关闭于: 2025-12-22 10:17 (UTC+8))
#23338 [Usage]: When infer a large multimodal model, the program hangs and becomes unresponsive. — usage,stale — by whk6688 (关闭于: 2025-12-22 10:16 (UTC+8))
#23345 [Bug][DP/EP]: Qwen3MoE Wrong Answer — bug,stale — by robertgshaw2-redhat (关闭于: 2025-12-22 10:16 (UTC+8))
#23373 [Bug]: since 0.10.1, Pooling output type changed from float32 to bfloat16 (and different numeric results) — bug,stale — by lk-chen (关闭于: 2025-12-22 10:16 (UTC+8))
#23423 [Bug]: There are problems with the output of the Multimodal large model — bug,stale — by Ron-Favorite (关闭于: 2025-12-22 10:16 (UTC+8))
#23433 [Feature]: shared_experts overlap with DeepEP low latency dispatch — feature request,stale — by zhangxy9999 (关闭于: 2025-12-22 10:16 (UTC+8))
#31066 [Doc]: Formatting issue in markdown file — documentation — by ssaketh-ch (关闭于: 2025-12-22 09:38 (UTC+8))
#30830 [Bug]: accuracy issue on MoE online fp8 quantization — bug — by yma11 (关闭于: 2025-12-22 09:07 (UTC+8))
#31023 [Doc]: FP8 KV Cache: Does softmax output multiply with FP8 V directly or after dequantization? — documentation — by jorjiang (关闭于: 2025-12-22 08:41 (UTC+8))
#30853 [Performance]: DeepSeek V3.2 Benchmarking: Significant performance discrepancy between initial and subsequent runs — performance — by zkf331 (关闭于: 2025-12-21 23:27 (UTC+8))
#30958 [Bug]: Vllm Server stuck and automatically shutdown. — bug — by tzjtatata (关闭于: 2025-12-21 20:47 (UTC+8))

新增 PR

#31106 [Bugfix][Hardware][AMD] Consolidate FP8 min/max values into helper function — rocm — by c0de128 (创建于: 2025-12-22 10:22 (UTC+8))
#31107 Fix DP group init error — 无标签 — by wangxiyuan (创建于: 2025-12-22 10:23 (UTC+8))
#31090 [CI] Fix “2 Node Tests (4 GPUs in total)” — documentation,ready,ci/build — by LucasWilkinson (创建于: 2025-12-21 11:49 (UTC+8))
#31101 [BugFix] Fix beam search parent mapping for variable logprobs — frontend — by scratch-ml (创建于: 2025-12-22 02:23 (UTC+8))
#31094 [Feature] add StreamableResponsesParser and token usage counting for ParsableContext — frontend,gpt-oss — by penfree (创建于: 2025-12-21 19:11 (UTC+8))
#31104 [BugFix] LoRA: FusedMoE make_expert_params_mapping supports base_layer — speculative-decoding,llama,qwen,deepseek,gpt-oss — by HollowMan6 (创建于: 2025-12-22 08:44 (UTC+8))
#31105 [Bugfix][LoRA] Fix LoRA weight mapping for DeepSeek MLA attention and… — deepseek — by tim0120 (创建于: 2025-12-22 09:18 (UTC+8))
#31103 [Bugfix] Fix weight_loader v1 block scale — 无标签 — by kyuyeunk (创建于: 2025-12-22 07:19 (UTC+8))
#31095 adapt voxtral — new-model,v1,multi-modality — by patrickvonplaten (创建于: 2025-12-21 19:12 (UTC+8))
#31102 [MoE Refactor][7/N] AITER MK — rocm,v1 — by robertgshaw2-redhat (创建于: 2025-12-22 02:54 (UTC+8))
#31100 [MoE Refactor][6/N] Use Modular Kernels for ModelOpt FP8 — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-22 00:42 (UTC+8))
#31098 Refactor modelopt fp8 modular kernel — nvidia — by robertgshaw2-redhat (创建于: 2025-12-22 00:20 (UTC+8))
#31099 [FIX] Support TP > 4 for FP4 quantization — 无标签 — by danielafrimi (创建于: 2025-12-22 00:38 (UTC+8))
#31093 ci: add nvidia-smi warmup before Prime-RL integration test — ready,ci/build,nvidia — by AmeenP (创建于: 2025-12-21 18:05 (UTC+8))
#31097 [FIX] FP4 quantization kernel padding initialization bug — 无标签 — by danielafrimi (创建于: 2025-12-21 20:23 (UTC+8))
#31089 [Quantization] enable MXFP4 Triton backend on SM120 (Blackwell) — 无标签 — by janreges (创建于: 2025-12-21 11:30 (UTC+8))
#31092 Feat/support nemotron h mtp upstream — new-model,speculative-decoding,v1 — by shaharmor98 (创建于: 2025-12-21 15:54 (UTC+8))

已合并 PR

#30821 [Kernel] Enable fused_qknorm_rope_kernel supports partial rope — ready — by jeejeelee (合并于: 2025-12-22 10:39 (UTC+8))
#31090 [CI] Fix “2 Node Tests (4 GPUs in total)” — documentation,ready,ci/build — by LucasWilkinson (合并于: 2025-12-22 10:32 (UTC+8))
#31071 [Doc] Clarify FP8 KV cache computation workflow — documentation,ready,v1 — by westers (合并于: 2025-12-22 08:41 (UTC+8))
#30897 [NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size — performance,ready,nvidia — by mgoin (合并于: 2025-12-22 01:41 (UTC+8))
#31036 [MoE Refactor][4/N] Marlin Fp8 Mk — ready — by robertgshaw2-redhat (合并于: 2025-12-22 01:37 (UTC+8))
#31093 ci: add nvidia-smi warmup before Prime-RL integration test — ready,ci/build,nvidia — by AmeenP (合并于: 2025-12-21 23:43 (UTC+8))
#31088 add aarnphm and chaunceyjiang to the new tool_parser directory — ready,ci/build — by chaunceyjiang (合并于: 2025-12-21 11:24 (UTC+8))

关闭但未合并的 PR

#31068 [Bugfix] Fix truncate_prompt_tokens ignored in PoolingParams.encode() — v1 — by westers (关闭于: 2025-12-22 10:36 (UTC+8))
#18545 [Bugfix] handle zero-token batches by creating empty attention metadata for KV cache load — needs-rebase,stale,v1 — by hammersam (关闭于: 2025-12-22 10:18 (UTC+8))
#21076 [CI] upgrade schemathesis to 4.1.0 from 3.39.15 — ci/build,stale — by davidxia (关闭于: 2025-12-22 10:17 (UTC+8))
#21900 [Bugfix] Fix Dense module loading for sentence-transformers embedding models — stale — by FFFfff1FFFfff (关闭于: 2025-12-22 10:17 (UTC+8))
#23158 [feat] Support EAGLE for Qwen2 — new-model,speculative-decoding,ready,stale,qwen — by Ximingwang-09 (关闭于: 2025-12-22 10:16 (UTC+8))
#23323 Fix tool_call parsing when tool_choice is enforced. — frontend,stale — by Gh0u1L5 (关闭于: 2025-12-22 10:16 (UTC+8))
#23340 add MoE config for A100 80GB for Qwen3 — stale,qwen — by cberge908 (关闭于: 2025-12-22 10:16 (UTC+8))
#23367 [CI][V0 Deprecation] Remove V0 support for core tests — stale — by csahithi (关闭于: 2025-12-22 10:16 (UTC+8))
#23406 Make uvloop usage robust on Windows; fall back to asyncio.run when un… — frontend,stale — by imjbassi (关闭于: 2025-12-22 10:16 (UTC+8))
#23417 Add run_mistral.py; keep local envs ignored — stale — by jscarfy (关闭于: 2025-12-22 10:16 (UTC+8))
#23428 [Optimize] V1 . Sink the reference counting method touch of the block to align with the free release method — stale,v1 — by CLFutureX (关闭于: 2025-12-22 10:16 (UTC+8))
#30849 [Misc][Benchmark] use pid tracking, replaced pkill commands — performance — by RuixiangMa (关闭于: 2025-12-22 07:51 (UTC+8))
#23624 [Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector — tpu,needs-rebase,stale,v1,kv-connector — by KuntaiDu (关闭于: 2025-12-22 03:12 (UTC+8))
#31098 Refactor modelopt fp8 modular kernel — nvidia — by robertgshaw2-redhat (关闭于: 2025-12-22 00:20 (UTC+8))
#30977 Docs: add OpenAI SDK example for Qwen2.5-VL classification — documentation,qwen — by Dhruv-80 (关闭于: 2025-12-21 19:53 (UTC+8))