vLLM 开发动态报告 - 2025-12-30

时间窗口: 2025-12-30 10:50 (UTC+8) ~ 2025-12-31 10:50 (UTC+8) 数据统计: 新 Issue 4 | 关闭 Issue 9 | 新 PR 33 | 合并 PR 14 | 关闭未合并 PR 6

📊 每日开发状态摘要

在2025年12月30日至31日期间，vLLM项目保持了较高的开发活跃度，新增与合并了大量PR（33个新增，14个合并），主要集中在模型支持扩展（如Nemotron-Flash、BAGEL、Isaac-0.2）、性能优化与Bug修复（涉及MoE、Speculative Decoding、CPU平台）以及AMD ROCm生态的持续集成与问题修复上。社区讨论热度不减，一个关于弃用vLLM V0引擎的历史RFC在正式关闭后再次引发关注。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD/ROCm生态相关的开发与集成工作持续推进，主要体现在CI/CD优化、Bug修复和内核集成三个方面。

CI/CD测试优化与模型更新：
- PR #31551 ([ROCm][CI] Update MiniCPM model test…)：更新了ROCm CI中的测试模型，将MiniCPM从v3升级到兼容性更好的v4.1版本。同时，该PR移除了显式的AITER内核测试，改为依赖默认的后端调度机制，这有助于简化测试逻辑并更好地反映实际使用情况。还修复了MiniCPM模型因嵌入缩放因子导致的测试准确性问题。
- PR #31553 ([ROCm][CI] Fix failure in Language Models Tests…)：修复了AMD CI中因测试分片策略导致某些工作进程无测试任务而失败的问题，通过将代理池从 mi325_8 调整为 mi325_2，确保了测试的稳定执行。
关键Bug修复与精度保障：
- PR #31523 ([ROCm][Bugfix] Fix accuracy issue on fmoe…)：修复了一个在启用 VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS 环境变量时，导致所有MoE模型（如DeepSeek-R1）出现严重准确性回归的致命问题。根本原因是在迁移到自定义算子（CustomOp）框架时，未将 grouped_topk 操作加入调度列表，导致未能正确调用AMD优化的HIP内核。该修复使DeepSeek-R1在GSM8K基准上的准确率从近乎0恢复至94%以上。
- Issue #31139 ([Bug]: Build vllm/rocm docker image)：一个关于ROCm Docker镜像构建失败的Issue在本周期被关闭。AMD工程师 (@vllmellm) 确认在MI300X上构建成功，用户反馈该问题是临时的且已被修复。
平台兼容性与代码健壮性改进：
- PR #31552 ([Bugfix][Hardware][AMD] Fix device parameter and exception handling)：修复了两个通用性问题，对AMD及其他平台均有积极影响。一是将融合辅助函数中硬编码的 device=”cuda” 改为使用 current_platform.device_type，增强了代码的平台无关性；二是收窄了AITER导入时的异常捕获范围，避免了掩盖驱动错误等严重问题，便于调试。

💬 高热度讨论分析

Issue #18571 ([RFC]: Deprecating vLLM V0)：此历史RFC（征求意见稿）在本周期被关闭，再次汇集了社区对vLLm架构演进的核心关切。
- 核心议题：正式弃用并计划从代码库中移除vLLM V0引擎。
- 主要观点：
  - 项目维护方 (@WoosukKwon)：强调V1已成熟，维护V0带来巨大的代码复杂度、技术债务和CI成本。移除V0可简化代码、避免用户混淆、加速V1开发。
  - 社区用户：普遍支持技术演进，但表达了诸多对V1功能完备性的担忧和迁移顾虑，包括：
    - 关键功能缺失：优先级调度、流式输入支持、V100/T4等旧GPU兼容性、FP8 KV Cache在非Hopper设备上的支持等。
    - 升级成本：用户需要评估迁移工作量和对现有业务的影响。
- 争议焦点：在V1完全覆盖V0所有关键功能之前，是否应该完全移除V0。用户担心被迫升级或部分工作负载无法平滑迁移。
- 当前状态：该RFC已关闭，标志着V0弃用流程的正式完结。但讨论中暴露的用户需求（如优先级调度）仍是V1需要持续完善的方向。
Issue #26191 ([Bug]: Performance Regression in Acceptance length for EAGLE3)：关于推测解码（Speculative Decoding）性能回归的技术讨论。
- 核心议题：在特定配置（disable_padded_drafter_batch=False）下，EAGLE3的接受长度（AL）出现下降。
- 技术分析过程：提交者 (@tomasruizt) 通过详细的二分排查，将问题定位到PR #24539，并发现性能表现与 disable_padded_drafter_batch 标志强相关。维护者 (@benchislett) 随后提交了修复补丁。
- 最终结论：问题已被PR #26498修复，并在本周期由维护者确认合并，该Issue随之关闭。这体现了社区对性能问题的严谨排查和高效修复。

🔥 热门话题与趋势分析

模型支持持续扩张：社区积极集成各类新模型架构。
- 混合架构模型：PR #31543 新增了对NVIDIA Nemotron-Flash-3B（混合Mamba2、DeltaNet、Transformer）的支持。
- 多模态模型：PR #31546 修复了BAGEL模型的API服务问题；PR #31550 更新了对Isaac-0.2多模态模型的支持。
- MoE模型：PR #28775 经过长期评审后合并，新增了对openPangu MoE模型（具备独特的KV头大小和Sink KV特性）的支持。
MoE层与量化重构：MoE层的持续重构是近期重要主线。
- 代码重构：PR #31533, #31542 等是MoE重构系列的一部分，旨在标准化和模块化MoE实现。
- 内核集成：PR #31548 尝试集成第三方高性能MoE内核 Sonic MoE 以提升Hopper GPU上的吞吐。
- 量化与LoRA：PR #31534 修复了MoE中LoRA权重形状与PEFT不对齐的问题；PR #31453 为CompressedTensorsWNA16MoEMethod添加了缺失的select_gemm_impl方法以支持LoRA。
自定义算子（CustomOp）迁移：项目正将特定硬件优化代码迁移到统一的CustomOp框架。
- PR #31523 和 #31530 都涉及此迁移过程，旨在统一算子调度，但同时也引入了需要仔细测试的复杂性（如31523中出现的bug）。

🛠️ 重点技术变更

PR #31534 ([Fix] Align fused moe lora_b shape with peft)：这是一个关键修复，解决了vLLM加载PEFT训练的低秩适配器时，MoE层中LoRA B权重形状假设与PEFT库标准不一致的问题，确保了推理结果的一致性，对使用MoE模型进行微调后部署的场景至关重要。
PR #31540 ([Bugfix] Fix block size used in EAGLE slot mapping)：修复了在混合使用不同块大小的线性层和注意力层的模型中（如某些混合架构），EAGLE推测解码插槽映射可能出错的问题，提升了复杂模型下推测解码的稳定性和长上下文支持。
PR #31551 ([ROCm][CI] Update MiniCPM model test…)：代表了AMD平台CI测试的成熟化演进。从强制指定测试后端转向依赖默认调度，更贴近用户真实环境，同时通过更新测试模型保证了测试的有效性和前沿性。
PR #31525 ([CPU] Disable async schedule on CPU)：针对CPU后端的优化，禁用了不必要的异步调度，简化了CPU平台的执行逻辑，有助于提升CPU推理的效率和稳定性。

📈 开发活跃度观察

贡献者多元化：本周期贡献者来自NVIDIA、AMD、Intel、Red Hat等公司及独立开发者，涉及GPU（CUDA/ROCm）、CPU、XPU等多种硬件平台。
高效的问题闭环：多个Issue（如#31535, #31526, #31139）在提出后短时间内即通过PR指引或确认解决而被关闭，显示社区响应迅速。
PR审查与合并流程顺畅：大量PR被标记为 ready 状态并快速合并，表明核心团队与贡献者之间的协作高效。同时，自动化机器人（mergify, github-actions）在代码规范检查、CI触发和文档预览等方面发挥了重要作用。

💡 值得关注的问题

Issue #31529 ([Bug]: vllm 0.12.0 fail to start Qwen3-VL-30B-A3B-Thinking)：一个关于多模态大模型Qwen3-VL启动失败的问题，初步怀疑与pynvml库冲突有关，建议卸载pynvml。该问题尚未解决，对于多模态模型用户值得关注。
Issue #31555 ([Docs] Feedback for /en/stable/MONSTERDOG)：此Issue内容异常，包含大量无关附件和混乱文本，疑似垃圾信息或测试Issue，可能需要维护者进行清理。
V1引擎的功能完备性：虽然V0弃用已成定局，但从高热度讨论#18571可以看出，社区对V1在优先级调度等生产环境关键功能上的支持仍有强烈需求，这将是影响vLLM在更复杂服务场景中 adoption 的关键。

📋 附录：详细数据列表

新增 Issue

#31555 [Docs] Feedback for /en/stable/MONSTERDOG — documentation — by s33765387-cpu (创建于: 2025-12-31 09:20 (UTC+8))
#31535 [Bug]: Model architectures [‘NemotronFlashForCausalLM’] are not supported for now. — bug — by LeiYangGH (创建于: 2025-12-30 23:27 (UTC+8))
#31529 [Bug]: vllm 0.12.0 fail to start Qwen3-VL-30B-A3B-Thinking — bug — by Arashi19901001 (创建于: 2025-12-30 19:20 (UTC+8))
#31526 [Bug]: kv_connector.get_block_ids_with_load_errors fails to capture load errors in PD separation scenario when P-side uses layer_wise KV cache loading — bug — by SpecterCipher (创建于: 2025-12-30 17:39 (UTC+8))

已关闭 Issue

#15702 [Bug]: Engine crash periodically running Deepseek V3/R1 on Hopper GPUs in cutlass_scaled_mm_sm90() — bug,stale — by rymc (关闭于: 2025-12-31 10:20 (UTC+8))
#16191 [Usage]: How can I quickly obtain the number of prompt tokens containing multimodal data? — help wanted,usage,stale,multi-modality — by yansh97 (关闭于: 2025-12-31 10:20 (UTC+8))
#31535 [Bug]: Model architectures [‘NemotronFlashForCausalLM’] are not supported for now. — bug — by LeiYangGH (关闭于: 2025-12-31 07:11 (UTC+8))
#26191 [Bug]: Performance Regression in Acceptance length for EAGLE3 — bug — by tomasruizt (关闭于: 2025-12-30 22:37 (UTC+8))
#31526 [Bug]: kv_connector.get_block_ids_with_load_errors fails to capture load errors in PD separation scenario when P-side uses layer_wise KV cache loading — bug — by SpecterCipher (关闭于: 2025-12-30 19:18 (UTC+8))
#31139 [Bug]: Build vllm/rocm docker image — bug,rocm — by JartX (关闭于: 2025-12-30 17:27 (UTC+8))
#31440 [Feature]: Allow usage of continue_final_message in /embeddings endpoint — feature request — by kevin-pw (关闭于: 2025-12-30 16:42 (UTC+8))
#18571 [RFC]: Deprecating vLLM V0 — RFC,stale — by WoosukKwon (关闭于: 2025-12-30 12:21 (UTC+8))
#28070 [Usage]: Is there a way to control default thinking behaviour of a model? — usage — by yz342 (关闭于: 2025-12-30 11:38 (UTC+8))

新增 PR

#31551 [ROCm][CI] Update MiniCPM model test: MiniCPM3-4B to MiniCPM4.1-8B and simplify attention backend testing — rocm — by AndreasKaratzas (创建于: 2025-12-31 07:35 (UTC+8))
#31538 [Model] Add template for qwen3-reranker — frontend,qwen — by alphabetc1 (创建于: 2025-12-31 01:48 (UTC+8))
#31520 [BugFix] Fix NUMA node validation in CPU platform — ready,cpu — by SameerAsal (创建于: 2025-12-30 11:38 (UTC+8))
#31534 [Fix] Align fused moe lora_b shape with peft — documentation,ready,gpt-oss — by linitra24 (创建于: 2025-12-30 21:44 (UTC+8))
#31530 CustomOp: test forward dispatch for grouped_topk — ready — by xinyu-intel (创建于: 2025-12-30 20:45 (UTC+8))
#31554 [BE] Type coverage for vllm/compilation [1/?] — nvidia — by Lucaskabela (创建于: 2025-12-31 09:09 (UTC+8))
#31524 fix: improve Mistral tokenizer auto-detection to check model architec… — 无标签 — by jayhemnani9910 (创建于: 2025-12-30 16:29 (UTC+8))
#31552 [Bugfix][Hardware][AMD] Fix device parameter and exception handling — rocm — by c0de128 (创建于: 2025-12-31 08:00 (UTC+8))
#31549 Remove unused use_marlin variable in Mxfp4MoEMethod — ready — by vsourirajan (创建于: 2025-12-31 06:27 (UTC+8))
#31553 [ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size — rocm,ci/build — by AndreasKaratzas (创建于: 2025-12-31 08:25 (UTC+8))
#31550 feature/issac 0.2 — multi-modality — by AkshatSh (创建于: 2025-12-31 06:41 (UTC+8))
#31537 [Bugfix] Record metrics for aborted requests — v1 — by jayhemnani9910 (创建于: 2025-12-31 01:33 (UTC+8))
#31543 [Model] Add support for Nemotron-Flash-3B — documentation,new-model,v1 — by hhzhang16 (创建于: 2025-12-31 03:31 (UTC+8))
#31533 [MoE Refactor][13/N] Convert FI to Use PFNoEP — nvidia — by robertgshaw2-redhat (创建于: 2025-12-30 21:42 (UTC+8))
#31548 [Kernel] Add Sonic MoE integration for Hopper GPUs with swiglu support — 无标签 — by clocksmith (创建于: 2025-12-31 06:01 (UTC+8))
#31540 [Bugfix] Fix block size used in EAGLE slot mapping — bug,speculative-decoding,ready,v1 — by benchislett (创建于: 2025-12-31 02:31 (UTC+8))
#31547 [3/n] Migrate norm kernels to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2025-12-31 05:59 (UTC+8))
#31545 Use Transformers v5 WeightRenaming for Transformers modeling backend — 无标签 — by hmellor (创建于: 2025-12-31 05:23 (UTC+8))
#31546 [Bugfix] Fix BAGEL online serving for text and image understanding — 无标签 — by Dylan1229 (创建于: 2025-12-31 05:25 (UTC+8))
#31542 [MoE Refactor] Aiter Experts for BF16 MoE — rocm,ready — by zyongye (创建于: 2025-12-31 02:57 (UTC+8))
#31544 [CI] Add sm_110 to aarch64 CUDA 13.0 builds — ci/build,nvidia — by kbenkhaled (创建于: 2025-12-31 04:59 (UTC+8))
#31528 [FIX] Add NO_MUL activation support for modular kernel path — 无标签 — by danielafrimi (创建于: 2025-12-30 18:38 (UTC+8))
#31539 Add get_expert_mapping to NemotronHModel (for LoRA support) — ready — by danisereb (创建于: 2025-12-31 02:11 (UTC+8))
#31541 Turn @config into a dataclass_transform — frontend — by hmellor (创建于: 2025-12-31 02:38 (UTC+8))
#31527 [Fix] Add SIGINT handler for openai — frontend — by wzshiming (创建于: 2025-12-30 18:25 (UTC+8))
#31536 fix(compile): apply partition wrapper when loading AOT cached functions — 无标签 — by devbyteai (创建于: 2025-12-31 01:08 (UTC+8))
#31532 [Bugfix] Enforce strict EOS handling in structured outputs with Outlines backend — structured-output,v1 — by shyoshyo (创建于: 2025-12-30 21:36 (UTC+8))
#31531 Use the same memory for workspace13 and fused_output. — 无标签 — by halyavin (创建于: 2025-12-30 21:10 (UTC+8))
#31525 [CPU] Disable async schedule on CPU — ready,cpu — by bigPYJ1151 (创建于: 2025-12-30 17:34 (UTC+8))
#31523 [ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled — rocm,ready — by ganyi1996ppo (创建于: 2025-12-30 14:47 (UTC+8))
#31522 [xpu] [bugfix] upgrade to latest oneccl in dockerfile — ready,ci/build — by rogerxfeng8 (创建于: 2025-12-30 12:58 (UTC+8))
#31521 [Triton][CustomOp] A Triton operator dispatch mechanism through modified CustomOp — qwen — by MengqingCao (创建于: 2025-12-30 11:40 (UTC+8))
#31519 [BugFix] Fix NUMA node validation in CPU platform — cpu — by SameerAsal (创建于: 2025-12-30 10:51 (UTC+8))

已合并 PR

#31534 [Fix] Align fused moe lora_b shape with peft — documentation,ready,gpt-oss — by linitra24 (合并于: 2025-12-31 09:44 (UTC+8))
#31477 Add docker buildx bake configuration — ready — by amrmahdi (合并于: 2025-12-31 09:08 (UTC+8))
#31497 [Frontend] add continue_final_message parameter to /embeddings endpoint — frontend,ready — by kevin-pw (合并于: 2025-12-30 15:21 (UTC+8))
#31505 [Bugfix]Fix pooling model always disabled due to incorrect PP rank check — ready,v1 — by vintipandey (合并于: 2025-12-31 03:27 (UTC+8))
#31453 [BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA — ready — by JartX (合并于: 2025-12-31 03:20 (UTC+8))
#31508 [Minor] Various small code cleanups/simplifications — structured-output,frontend,ready,v1,multi-modality — by njhill (合并于: 2025-12-30 14:42 (UTC+8))
#28775 [Model] Add support for openPangu moe model — documentation,new-model,ready,v1 — by yt0428 (合并于: 2025-12-31 00:11 (UTC+8))
#31408 Add Loraconfig parameter to get_punica_wrapper function — ready — by ZT-AIA (合并于: 2025-12-30 14:27 (UTC+8))
#31525 [CPU] Disable async schedule on CPU — ready,cpu — by bigPYJ1151 (合并于: 2025-12-30 20:34 (UTC+8))
#31491 [CI][NIXL] Split DPEP tests — ready,ci/build,v1,kv-connector — by NickLucche (合并于: 2025-12-30 20:26 (UTC+8))
#31523 [ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled — rocm,ready — by ganyi1996ppo (合并于: 2025-12-30 17:21 (UTC+8))
#31522 [xpu] [bugfix] upgrade to latest oneccl in dockerfile — ready,ci/build — by rogerxfeng8 (合并于: 2025-12-30 14:52 (UTC+8))
#31500 Migrate meetups & sponsors [2/N] — documentation,ready — by esmeetu (合并于: 2025-12-30 12:26 (UTC+8))
#31343 feat(frontend): add –default-chat-template-kwargs CLI argument — documentation,frontend,ready — by effortprogrammer (合并于: 2025-12-30 11:38 (UTC+8))

关闭但未合并的 PR

#31538 [Model] Add template for qwen3-reranker — frontend,qwen — by alphabetc1 (关闭于: 2025-12-31 10:48 (UTC+8))
#25598 Suppress FA3 “wgmma.mma_async instructions are serialized” warning — ci/build,stale — by mgoin (关闭于: 2025-12-31 03:35 (UTC+8))
#26498 [Bugfix] Invalidate positions when using padded speculative decoding — documentation,speculative-decoding,needs-rebase,v1 — by benchislett (关闭于: 2025-12-30 22:37 (UTC+8))
#31393 [CI] Replace pre-commit with prek for faster pre-commit checks — documentation,ci/build,cpu — by j178 (关闭于: 2025-12-30 11:47 (UTC+8))
#31519 [BugFix] Fix NUMA node validation in CPU platform — cpu — by SameerAsal (关闭于: 2025-12-30 11:33 (UTC+8))
#31451 [BugFix] Preserve tool function name in delta messages when remaining arguments are empty — frontend — by luke396 (关闭于: 2025-12-30 11:18 (UTC+8))