vLLM 开发动态报告 - 2025-12-22

时间窗口: 2025-12-22 10:50 (UTC+8) ~ 2025-12-23 10:50 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 9 | 新 PR 78 | 合并 PR 29 | 关闭未合并 PR 21

📊 每日开发状态摘要

在过去24小时内，vLLM 社区保持高度活跃，新增了17个 Issue 和78个 PR，同时合并了29个 PR。开发重点集中在三个方面：1) AMD 生态适配，特别是针对新架构（如 Strix Halo）的 FP8 支持和 ROCm 构建修复；2) 性能优化与缺陷修复，涉及注意力机制、推测解码、编译缓存等多个核心模块；3) 对新硬件（如 NVIDIA Blackwell）和量化格式（如 ModelOpt FP8）的扩展支持。整体来看，项目正处在一个快速迭代、积极解决平台兼容性与性能瓶颈的阶段。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关的活动非常活跃，主要体现在构建修复、新架构（Strix Halo）支持以及 AITER 运行时优化上。

Issues:

#31155 (已关闭) & #31139: ROCm 构建被破坏。由 PR #30821 引入的未使用变量 rotary_dim 导致 ROCm 构建失败（-Werror=unused-variable）。已通过 #31156 快速修复。另一个构建失败（Docker 镜像）可能涉及更复杂的 HIP 编译警告问题。
#31086 (已关闭): ROCm Triton 后端故障。表现为多个测试在启用 Triton 后端时失败。问题通过回滚到旧的基础镜像暂时解决，表明可能与新版 Triton 的兼容性有关。

PRs (主要来自 c0de128，专注于 Strix Halo 支持):

FP8 支持扩展 (#31184): 在 supports_fp8() 函数中添加 gfx11 前缀检查，旨在为 RDNA 3/3.5 架构（包括 Strix Halo）启用 FP8 量化。需要验证：审阅者要求提供 gfx11 硬件上的 lm_eval 测试结果以确认功能有效性，贡献者表示无相关硬件访问权限。
AITER 设备参数统一 (#31178, #31176, #31149): 一系列 PR 旨在将硬编码的 device=”cuda” 替换为从输入张量获取的设备，以提高 ROCm 在多 GPU 和特定架构上的兼容性。同样，需要实际硬件测试验证。
代码质量与 Bug 修复 (#31177, #31121, #31119, #31118): 涉及异常处理细化、修复 Python 列表别名导致的 MoE 专家分配错误、张量切片赋值模式统一、未初始化变量修复等。这些是针对 ROCm 代码路径的基础性加固。
CI/测试修复 (#31192, #31159): #31192 因 ROCm 使用 spawn 而非 fork 的进程创建方式，跳过了依赖 fork 的 V1 引擎测试。#31159 回退了 Triton 版本以修复 GPT-OSS 在 gfx950 上的导入错误。

总结：AMD 生态的工作聚焦于扩大硬件支持范围（gfx11）和夯实代码基础（设备兼容性、内存/逻辑错误）。一个突出挑战是缺乏易于访问的 gfx11 测试环境，导致部分 PR 无法提供端到端性能验证，依赖 CI 和核心维护者进行最终确认。

💬 高热度讨论分析

PR #31193: [Feature] Add iteration level logging and enhance nvtx marker
- 核心议题：是否默认启用详细的迭代级别日志（请求数、令牌数、耗时）和增强的 NVTX 标记。
- 不同观点：
  - 反对默认开启 (wangshangsam, nvpohanh, robertgshaw2-redhat)：认为该功能主要服务于 GPU 性能分析与调优的专家用户，对绝大多数普通用户而言过于冗长，且可能带来不必要的 CPU 开销和日志干扰。
  - 维护者行动 (maxyanghu)：基于反馈，迅速将功能改为默认关闭，通过环境变量 VLLM_LOG_ITERATION_DETAILS 控制。
- 结论：功能被保留，但遵循了“安静默认”的原则，体现了对用户体验和性能开销的考量。
Issue #31128: [Feature]: Add support of Blackwell SM121(DGX Spark)
- 核心议题：请求为 NVIDIA DGX Spark (Blackwell, ARM64, CUDA 13) 提供原生支持。
- 讨论内容：
  - 问题：vLLM 0.13.0 依赖 PyTorch 2.9.0，但其官方轮子不提供 ARM64 + CUDA 版本。NGC PyTorch 2.10 镜像是目前唯一选择。
  - 社区提供的解决方案：多位用户贡献了从源码编译、使用 CUDA 13 专用轮子（从 v0.13.0 开始提供）和特定 Docker 镜像的详细步骤。
- 结论：讨论显示 vLLM 已通过夜间构建和特定版本轮子提供了对 Blackwell 和 CUDA 13 的支持，但官方文档和版本兼容性说明可能需要更新以更好地引导用户。
Issue #31148 & PR #31198: Jais2 model unsupported rotary_dim kwarg
- 核心议题：Jais2 模型在初始化时因 get_rope() 接收到不支持的 rotary_dim 参数而失败。
- 讨论与解决：Issue 中迅速关联了修复 PR (#31198)。该 PR 删除了导致问题的错误 LoRA 逻辑。这是一个典型的“报告-修复”快速响应案例。
Issue #31155: [Bug] [ROCm] [Critical]: ROCm build broken
- 核心议题：ROCm 构建因未使用变量错误而中断，属高优先级阻塞性问题。
- 讨论与解决：问题被迅速标记为 critical，并在几小时内通过 #31156 修复和合并。体现了对关键平台构建问题的快速响应能力。

🔥 热门话题与趋势分析

新硬件支持浪潮：社区正积极应对 NVIDIA Blackwell (SM100, SM120) 和 AMD RDNA 3.5 (Strix Halo) 等新硬件。讨论和 PR 涉及架构检测、内核调度、CUDA 版本兼容性（12 vs 13）和安装路径。
量化支持扩展：对 FP8 量化 的支持持续深化，范围从 NVIDIA ModelOpt 的新变种（#30957）扩展到 AMD 消费级显卡。同时，MoE 模型的量化支持（如 FP8 MoE 内核选择）是另一个活跃子领域。
编译与缓存问题：多个 Issue (#31199, #31183) 涉及 TorchInductor 编译缓存在多进程并发下的冲突，以及 CUDA Graph 捕获的稳定性，反映了在追求极致性能时面临的复杂性。
分布式推理中的 Bug：关于 P/D 架构下 KV 缓存连接器 (#31145) 和 Tensor Parallel 与推测解码 (#31154) 的 Bug 报告，说明在复杂分布式场景下，状态同步和确定性仍是挑战。
模型支持与 CI 维护：由于上游模型被删除（如 mosaicml/mpt-7b），需要及时禁用相关测试 (#31182)。同时，需要不断适配 Transformers 库的 API 变动 (#31181, #31146)。

🛠️ 重点技术变更

PR #31197 (已合并): Revert [SM100] Enable fp8 compute for prefill MLA：紧急回退了之前合并的 #30746。原因是在依赖的 FlashInfer 更新 (#30993) 未就绪前，该更改可能导致问题（如破坏 Blackwell CI）。影响：暂时撤销了 FP8 在 MLA 预填充阶段的加速支持，等待底层库稳定。
PR #31167 (已合并): [Perf] Remove blocking copy in GDN Attention：移除了 Qwen3-Next 模型中 GDN 注意力层的一个阻塞性拷贝操作。影响：为异步调度和 MTP 的完全非阻塞化扫清障碍，在小批量场景下测得约 1% 的性能提升。
PR #31192 (已合并): [AMD][CI] fix v1/engine test_preprocess_error_handling：因 ROCm 平台使用 spawn 多进程方法，而测试依赖 fork，故跳过该测试。影响：保证了 AMD CI 的通过率，但也凸显了跨平台多进程模型带来的测试复杂性。
PR #30957 (已合并): Support NVIDIA ModelOpt HF FP8 variants：新增对两种 ModelOpt 私有 FP8 格式 (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) 的加载支持。影响：扩展了 vLLM 对 NVIDIA 优化后模型的兼容性，满足了特定用户工作流的需求。
Issue #31200: [Bug]: class Request and block_hasher has circular reference：指出多模态特征缓存中 Request 和 block_hasher 之间存在循环引用，可能引起内存泄漏。影响：提供了一个使用 weakref 的修复方案，此问题对长期运行的多模态服务至关重要。

📈 开发活跃度观察

贡献者活跃度：非常活跃。用户 c0de128 在 AMD 相关修复上提交了密集的 PR 序列。robertgshaw2-redhat, jeejeelee, mgoin, AndreasKaratzas 等核心贡献者在审查、修复、合并上保持高强度互动。
代码审查与合并速度：对于关键性构建修复（如 #31156）和明显错误修复（如 #31198）响应和合并速度极快（数小时内）。对于涉及性能优化和新功能的 PR，审查更严谨，通常会要求提供性能基准（如 lm_eval 结果）或讨论默认行为的影响（如 #31193）。
跨平台协作：在 AMD 相关 PR 的讨论中，可以看到来自 AMD 员工的审阅要求（@tjtanaa 要求提供性能测试）与外部贡献者（缺乏硬件）之间的互动，社区正在探索如何在这种约束下有效合作。

💡 值得关注的问题

AMD Strix Halo (gfx11) 的 FP8 支持验证：PR #31184 等已添加代码支持，但缺乏硬件实测验证。这是扩大 vLLM 在 AMD 消费级显卡上应用的关键一步，需要社区或 AMD 内部协调完成最终验证。
多模态服务中的内存泄漏风险：Issue #31200 揭示的循环引用问题可能影响服务的长期稳定性，其修复方案值得尽快评估与合并。
Qwen3-Next MTP 稳定性：Issue #31186 报告了 Qwen3-Next 在多令牌预测（MTP）下的崩溃问题。Qwen3-Next 是新近的重要模型系列，其性能与稳定性备受关注。
编译缓存的并发安全性：Issue #31199 反映的 TorchInductor 编译缓存在多进程并发加载时的损坏问题，可能影响高并发部署场景的稳定性。
复杂分布式配置的协同逻辑：Issue #31145 和 #31154 分别暴露了多连接器（Multi-Connector）和 N-gram 推测解码在 TP 场景下的边缘 case，提示在复杂分布式组件组合时，需要更全面的集成测试。

📋 附录：详细数据列表

新增 Issue

#31202 [Bug]: Mixtral Fp8 Accuracy is Degraded — bug,help wanted — by robertgshaw2-redhat (创建于: 2025-12-23 10:27 (UTC+8))
#31200 [Bug]: class Request and block_hasher has cirular reference, may cause memory leak. — bug — by frelam (创建于: 2025-12-23 09:55 (UTC+8))
#31199 [Bug]: UnpicklingError during concurrent model compilation on multiple GPUs — bug — by AIDevCuda (创建于: 2025-12-23 09:46 (UTC+8))
#31148 [Bug]: Jais2 model in vLLM 0.13.0: get_rope() called with unsupported rotary_dim kwarg (TypeError during model init) — bug — by NikolasTh90 (创建于: 2025-12-22 21:18 (UTC+8))
#31181 [CI Failure]: Transformers Nightly Models Test — ci-failure — by AndreasKaratzas (创建于: 2025-12-23 05:04 (UTC+8))
#31186 [Bug]: Qwen3-Next MTP Crash — bug — by benchislett (创建于: 2025-12-23 06:02 (UTC+8))
#31128 [Feature]: Add support of Blackwell SM121(DGX Spark) — feature request — by yanyunl1991 (创建于: 2025-12-22 15:35 (UTC+8))
#31170 [Bug]: Torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device — bug — by shahizat (创建于: 2025-12-23 02:40 (UTC+8))
#31123 [Bug]: skip_tokenizer_init=True crashes google/gemma-3-27b-it — bug — by terrykong (创建于: 2025-12-22 14:40 (UTC+8))
#31165 [CI Failure]: Entrypoints — ci-failure — by robertgshaw2-redhat (创建于: 2025-12-23 00:51 (UTC+8))
#31155 [Bug] [ROCm] [Critical]: ROCm build broken — bug,rocm — by tjtanaa (创建于: 2025-12-22 22:56 (UTC+8))
#31154 [Bug]: Ngram draft tokens diverge across TP ranks when using external_launcher. — bug — by xiaoxiaosuaxuan (创建于: 2025-12-22 22:52 (UTC+8))
#31145 [Bug]: [P/D] multi-connector cannot be used together with a KV connector that uses a push scheme — bug — by liziyu179 (创建于: 2025-12-22 20:21 (UTC+8))
#31139 [Bug]: Build vllm/rocm docker image — bug,rocm — by JartX (创建于: 2025-12-22 19:01 (UTC+8))
#31136 [Bug]: error when run examples/online_serving/prompt_embed_inference_with_openai_client.py — bug — by yuekaizhang (创建于: 2025-12-22 18:10 (UTC+8))
#31124 [Bug]: vllm.entrypoints.openai.api_server started but can’t be accessed in wsl — bug — by Usigned (创建于: 2025-12-22 14:44 (UTC+8))
#31122 [Bug]: vllm.entrypoints.openai.api_server started but can’t be access on wsl — bug — by Usigned (创建于: 2025-12-22 14:40 (UTC+8))

已关闭 Issue

#14954 [Bug]: vLLM running on Unspecified Platform raises NotImplementedError when using podman/docker-compose — bug,stale — by BastianBN (关闭于: 2025-12-23 10:46 (UTC+8))
#17171 [Bug]: Qwen2VL-2b / Qwen2.5-7b has AssertionError and Cuda error when qps goes higher — bug,stale — by Ericoool9614 (关闭于: 2025-12-23 10:46 (UTC+8))
#22961 [Bug]: TypeError: MoeWNA16Method.get_weight_loader..moe_wna16_weight_loader() got an unexpected keyword argument 'return_sucess' — bug,stale — by n1ck-guo (关闭于: 2025-12-23 10:46 (UTC+8))
#22965 [Bug]: [xPyD]Abnormal results when using v1 P2pNcclConnector as KV cache transport: repeated requests for the same input produce abnormal outputs — bug,stale — by qianyang01 (关闭于: 2025-12-23 10:46 (UTC+8))
#31202 [Bug]: Mixtral Fp8 Accuracy is Degraded — bug,help wanted — by robertgshaw2-redhat (关闭于: 2025-12-23 10:42 (UTC+8))
#31181 [CI Failure]: Transformers Nightly Models Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-23 08:46 (UTC+8))
#31086 [Bug][ROCm]: Triton backend broken — bug,rocm — by AndreasKaratzas (关闭于: 2025-12-23 08:45 (UTC+8))
#31165 [CI Failure]: Entrypoints — ci-failure — by robertgshaw2-redhat (关闭于: 2025-12-23 00:54 (UTC+8))
#31155 [Bug] [ROCm] [Critical]: ROCm build broken — bug,rocm — by tjtanaa (关闭于: 2025-12-23 00:30 (UTC+8))

新增 PR

#31193 [Feature] Add iteration level logging and enhance nvtx marker — v1,nvidia — by maxyanghu (创建于: 2025-12-23 08:02 (UTC+8))
#31203 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() — rocm,multi-modality — by AndreasKaratzas (创建于: 2025-12-23 10:49 (UTC+8))
#31178 [ROCm][Strix Halo] Fix for device parameter in AITER topK metadata — rocm — by c0de128 (创建于: 2025-12-23 03:53 (UTC+8))
#31184 [ROCm][Strix Halo] Fix for FP8 support detection on gfx11x architectures — rocm — by c0de128 (创建于: 2025-12-23 05:36 (UTC+8))
#31176 [ROCm][Strix Halo] Fix for hardcoded device in MLA sparse attention — rocm,nvidia — by c0de128 (创建于: 2025-12-23 03:39 (UTC+8))
#31197 Revert “[SM100] Enable fp8 compute for prefill MLA (#30746)” — v1,nvidia — by pavanimajety (创建于: 2025-12-23 09:39 (UTC+8))
#31201 Add nvidia h800 moe config — nvidia — by lengrongfu (创建于: 2025-12-23 10:20 (UTC+8))
#31131 [Model] Add verify_and_update_model_config for VerifyAndUpdateConfig. — qwen — by noooop (创建于: 2025-12-22 17:15 (UTC+8))
#31169 [MoE Refactor][10/N] Cleanup Disk -> Kernel Shuffle — nvidia — by robertgshaw2-redhat (创建于: 2025-12-23 02:04 (UTC+8))
#31166 [Bugfix] Fix MLA attention crash when using DP with DCP — needs-rebase,v1 — by sachinkumarsingh092 (创建于: 2025-12-23 01:06 (UTC+8))
#31188 [Doc] Add Claude code usage example — documentation — by mgoin (创建于: 2025-12-23 06:11 (UTC+8))
#31162 [Feature] OTEL tracing during loading — frontend,needs-rebase,ci/build,v1,cpu — by emricksini-h (创建于: 2025-12-23 00:36 (UTC+8))
#31161 [Bugfix] Fix MoE LoRA bin/pt loading — ready — by jeejeelee (创建于: 2025-12-23 00:17 (UTC+8))
#31194 [ci] Fix Pytorch compilation test oom in 2.10 — ready — by angelayi (创建于: 2025-12-23 08:23 (UTC+8))
#31198 [Bugfix] Fix Jais2ForCausalLM — 无标签 — by jeejeelee (创建于: 2025-12-23 09:42 (UTC+8))
#31147 Add prefix continuation feature to DeepSeek v3.2 — deepseek — by PHOEBEMOON0802 (创建于: 2025-12-22 20:57 (UTC+8))
#31185 [PERF] Use cutlass_scaled_mm for Blackwell instead of deep-gemm’s blockscale gemm — nvidia — by vadiklyutiy (创建于: 2025-12-23 05:45 (UTC+8))
#31175 [Bugfix] Properly apply v_scale for mimo_v2_flash — 无标签 — by mgoin (创建于: 2025-12-23 03:37 (UTC+8))
#31149 [Bugfix][ROCm][Dynamo][DS 3.1][FP8] fix unsupported hasattr call when Dynamo tracing for ROCm device — rocm,ready — by zejunchen-zejun (创建于: 2025-12-22 21:31 (UTC+8))
#31196 [compile] Remove torch 2.9 patches — 无标签 — by angelayi (创建于: 2025-12-23 09:20 (UTC+8))
#31195 Fix TRTLLM Ragged Attention FP8 Path — ready,v1,nvidia — by pavanimajety (创建于: 2025-12-23 09:18 (UTC+8))
#31192 [AMD][CI] fix v1/engine test_preprocess_error_handling — rocm,ready,v1 — by divakar-amd (创建于: 2025-12-23 07:41 (UTC+8))
#31144 fix multiconnector for multi connector use push kv connector — kv-connector — by liziyu179 (创建于: 2025-12-22 20:13 (UTC+8))
#31189 Amd ci/rlhf colocate — rocm,ci/build — by rjrock (创建于: 2025-12-23 06:45 (UTC+8))
#31160 [Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 — ready,deepseek — by yewentao256 (创建于: 2025-12-23 00:11 (UTC+8))
#31190 [Transformers][Bugfix] Migrated to new transformers nightly logic — 无标签 — by AndreasKaratzas (创建于: 2025-12-23 07:01 (UTC+8))
#31182 [CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests — documentation,ready,ci-failure — by mgoin (创建于: 2025-12-23 05:15 (UTC+8))
#31191 Fix RecursionError in MediaWithBytes unpickling — multi-modality — by nrghosh (创建于: 2025-12-23 07:02 (UTC+8))
#31167 [Perf] Remove blocking copy in GDN Attention — performance,ready,v1 — by benchislett (创建于: 2025-12-23 01:45 (UTC+8))
#31173 [Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' — ready,v1,nvidia — by yewentao256 (创建于: 2025-12-23 03:10 (UTC+8))
#31187 [CI/ROCm] Fixing “V1 Test attention (H100)” test group. — rocm,v1 — by Alexei-V-Ivanov-AMD (创建于: 2025-12-23 06:07 (UTC+8))
#31183 [wip] disable compile cache in test_fusion_attn — 无标签 — by angelayi (创建于: 2025-12-23 05:34 (UTC+8))
#31110 [Bugfix][ROCm] Fix AITER method typos and invalid staticmethod — rocm — by c0de128 (创建于: 2025-12-22 11:17 (UTC+8))
#31157 [Bugfix][ROCm] Fix typo: triton_fp4_gemm_dynamic_qaunt -> quant — rocm — by c0de128 (创建于: 2025-12-22 23:39 (UTC+8))
#31118 [ROCm][Strix Halo] Fix for uninitialized prefix_scheduler_metadata — rocm,v1 — by c0de128 (创建于: 2025-12-22 12:51 (UTC+8))
#31119 [ROCm][Strix Halo] Fix for tensor slice assignment in MLA — rocm,v1 — by c0de128 (创建于: 2025-12-22 13:00 (UTC+8))
#31121 [ROCm][Strix Halo] Fix for list aliasing in fused MoE initialization — rocm — by c0de128 (创建于: 2025-12-22 13:19 (UTC+8))
#31177 [ROCm][Strix Halo] Fix for exception types in AITER MLA FP8 check — rocm — by c0de128 (创建于: 2025-12-23 03:48 (UTC+8))
#31179 [ROCm][Strix Halo] Fix for FP8 dtype in silu_mul quantization — rocm — by c0de128 (创建于: 2025-12-23 04:20 (UTC+8))
#31180 [WIP] Mimo v2 flash mtp — new-model — by mgoin (创建于: 2025-12-23 04:53 (UTC+8))
#31172 [NVIDIA] Add env var to override FlashInfer FP8 BMM — nvidia — by Kh4L (创建于: 2025-12-23 03:09 (UTC+8))
#31174 [Doc] Add vllm-metal to hardware plugin documentation — documentation,ready,cpu — by mgoin (创建于: 2025-12-23 03:15 (UTC+8))
#31168 Check if item.id exists to prevent 500 — frontend,ci/build — by gdombiak (创建于: 2025-12-23 01:51 (UTC+8))
#31158 [torch.compile] Improve encoder compilation detection in PiecewiseBackend — qwen — by ilmarkov (创建于: 2025-12-22 23:43 (UTC+8))
#31171 [perf] Integrate flashinfer concat_mla_k — v1 — by jiahanc (创建于: 2025-12-23 02:41 (UTC+8))
#31153 [Chore] Update more locations to use attention_config.backend — performance,ready — by DarkLight1337 (创建于: 2025-12-22 22:46 (UTC+8))
#31159 [ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run — rocm,ready,ci/build,gpt-oss — by gshtras (创建于: 2025-12-22 23:52 (UTC+8))
#31164 [openai api] log http exception in handler — frontend — by andyxning (创建于: 2025-12-23 00:48 (UTC+8))
#31125 [UX] improve profiler error message — ready,v1 — by BoyuanFeng (创建于: 2025-12-22 14:48 (UTC+8))
#31140 [Feature] Support weight-shape-unaligned block-scale fp8 models — 无标签 — by Wanli-Jiang (创建于: 2025-12-22 19:29 (UTC+8))
#31150 [BugFix] Fix architecture flags to prevent issues on SM103 — ready,ci/build,nvidia — by LopezCastroRoberto (创建于: 2025-12-22 21:44 (UTC+8))
#31163 Fix prefill trace warmup — documentation,frontend,ci/build,v1 — by sraizada-tt (创建于: 2025-12-23 00:36 (UTC+8))
#31156 [ROCm] [Critical]: Remove unused variable — rocm,ready — by tjtanaa (创建于: 2025-12-22 23:14 (UTC+8))
#31137 [misc] Sort uvicorn log level description according to verbosity — frontend — by andyxning (创建于: 2025-12-22 18:13 (UTC+8))
#31113 Fix document of torchrun_example.py — documentation — by foreverlms (创建于: 2025-12-22 12:01 (UTC+8))
#31151 [CI][Bugfix] Fix entrypoints/openai/test_audio.py — ready — by NickLucche (创建于: 2025-12-22 22:03 (UTC+8))
#31152 cwm tool parser — 无标签 — by ErezSC42 (创建于: 2025-12-22 22:43 (UTC+8))
#31146 Add util function for checking nesting of rope parameters — ready — by hmellor (创建于: 2025-12-22 20:22 (UTC+8))
#31142 Don’t tolerate llm_config instead of text_config — ready — by hmellor (创建于: 2025-12-22 19:57 (UTC+8))
#31135 [V1] align text, token_ids and logprobs under stop buffering with str… — v1 — by quanliu1991 (创建于: 2025-12-22 18:01 (UTC+8))
#31134 Add prefix continuation to DeepSeek v3.2 — deepseek — by PHOEBEMOON0802 (创建于: 2025-12-22 17:40 (UTC+8))
#31143 Add encode time — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by LJH-LBJ (创建于: 2025-12-22 20:07 (UTC+8))
#31141 Epd mooncake engine — documentation,frontend,v1,kv-connector — by khuonglmhw (创建于: 2025-12-22 19:32 (UTC+8))
#31138 [Mistral common] Ensure all functions are imported from the top & only use public methods — ci/build — by patrickvonplaten (创建于: 2025-12-22 18:51 (UTC+8))
#31112 [CI]Replace pip in docker.xpu with uv pip — ci/build — by 1643661061leo (创建于: 2025-12-22 11:54 (UTC+8))
#31132 [Model] Fix bagel failed to run — 无标签 — by Potabk (创建于: 2025-12-22 17:16 (UTC+8))
#31133 [Model] use maybe_all_reduce_tensor_model_parallel — 无标签 — by wangxiyuan (创建于: 2025-12-22 17:30 (UTC+8))
#31129 [DeepSeek v3.2] Add prefix contiunuation feature for DeepSeek v3.2 — deepseek — by PHOEBEMOON0802 (创建于: 2025-12-22 15:46 (UTC+8))
#31130 [Bugfix] Fix shape mismatch in sparse 2:4 bitmask decompression for vision models — 无标签 — by majiayu000 (创建于: 2025-12-22 16:21 (UTC+8))
#31127 [Frontend] Make pooling entrypoints request schema consensus. — documentation,frontend,multi-modality — by noooop (创建于: 2025-12-22 15:16 (UTC+8))
#31126 Add explicit n:1 parameter to OpenAI API payloads in benchmark functions — performance — by wisdomfriend (创建于: 2025-12-22 15:08 (UTC+8))
#31109 [Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled — rocm — by c0de128 (创建于: 2025-12-22 11:12 (UTC+8))
#31120 [Misc] Fix typo: ‘occured’ -> ‘occurred’ — 无标签 — by c0de128 (创建于: 2025-12-22 13:02 (UTC+8))
#31117 [Misc] Fix spelling typos in model comments — qwen — by c0de128 (创建于: 2025-12-22 12:14 (UTC+8))
#31115 [Misc] Fix grammar errors in comments and messages — 无标签 — by c0de128 (创建于: 2025-12-22 12:10 (UTC+8))
#31116 [Misc] Fix quantization-related typos — 无标签 — by c0de128 (创建于: 2025-12-22 12:12 (UTC+8))
#31114 [Misc] Fix spelling typos in comments — ci/build,multi-modality — by c0de128 (创建于: 2025-12-22 12:07 (UTC+8))
#31111 [ROCm][Refactor] Move the contiguous logic for ROCm in torch_sdpa_wrapper into MMEncoderAttention — rocm — by shen-shanshan (创建于: 2025-12-22 11:33 (UTC+8))

已合并 PR

#30097 [Feature] Batch invariant: Lora — ready — by quanliu1991 (合并于: 2025-12-23 10:32 (UTC+8))
#31197 Revert “[SM100] Enable fp8 compute for prefill MLA (#30746)” — v1,nvidia — by pavanimajety (合并于: 2025-12-23 10:15 (UTC+8))
#31194 [ci] Fix Pytorch compilation test oom in 2.10 — ready — by angelayi (合并于: 2025-12-23 09:56 (UTC+8))
#31192 [AMD][CI] fix v1/engine test_preprocess_error_handling — rocm,ready,v1 — by divakar-amd (合并于: 2025-12-23 09:28 (UTC+8))
#30746 [SM100] Enable fp8 compute for prefill MLA — documentation,rocm,ready,ci/build,v1,multi-modality,tool-calling,deepseek,nvidia — by pavanimajety (合并于: 2025-12-23 03:15 (UTC+8))
#31102 [MoE Refactor][7/N] AITER MK — rocm,ready,v1 — by robertgshaw2-redhat (合并于: 2025-12-23 07:42 (UTC+8))
#31182 [CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests — documentation,ready,ci-failure — by mgoin (合并于: 2025-12-23 07:40 (UTC+8))
#31167 [Perf] Remove blocking copy in GDN Attention — performance,ready,v1 — by benchislett (合并于: 2025-12-23 06:25 (UTC+8))
#31173 [Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' — ready,v1,nvidia — by yewentao256 (合并于: 2025-12-23 06:24 (UTC+8))
#29845 [SpecDecode] Simplified alternative padded-speculation acceptance rate fix — rocm,speculative-decoding,ready,v1 — by LucasWilkinson (合并于: 2025-12-23 05:06 (UTC+8))
#31174 [Doc] Add vllm-metal to hardware plugin documentation — documentation,ready,cpu — by mgoin (合并于: 2025-12-23 04:06 (UTC+8))
#31052 [MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE — ready — by zyongye (合并于: 2025-12-23 01:34 (UTC+8))
#31159 [ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run — rocm,ready,ci/build,gpt-oss — by gshtras (合并于: 2025-12-23 01:19 (UTC+8))
#31125 [UX] improve profiler error message — ready,v1 — by BoyuanFeng (合并于: 2025-12-23 00:46 (UTC+8))
#31156 [ROCm] [Critical]: Remove unused variable — rocm,ready — by tjtanaa (合并于: 2025-12-23 00:28 (UTC+8))
#31040 [AMD][CI] Add “V1 Test e2e + engine” to mi325_8 Agent Pool — rocm,ready,ci/build — by micah-wil (合并于: 2025-12-22 23:41 (UTC+8))
#31151 [CI][Bugfix] Fix entrypoints/openai/test_audio.py — ready — by NickLucche (合并于: 2025-12-22 23:21 (UTC+8))
#30781 [CI] add polling for precompiled wheel in python_only_compile.sh, fix index generation for releases — ready,ci/build — by Harry-Chen (合并于: 2025-12-22 21:24 (UTC+8))
#30242 [BugFix] skip language model in Encoder — documentation,ready,v1,qwen,kv-connector — by Bounty-hunter (合并于: 2025-12-22 21:25 (UTC+8))
#30205 [gpt-oss] Fix harmony parser in streaming responses — frontend,ready,gpt-oss — by AlonKejzman (合并于: 2025-12-22 20:56 (UTC+8))
#31132 [Model] Fix bagel failed to run — 无标签 — by Potabk (合并于: 2025-12-22 18:15 (UTC+8))
#31083 Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs — documentation,ready — by rogeryoungh (合并于: 2025-12-22 13:28 (UTC+8))
#31109 [Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled — rocm — by c0de128 (合并于: 2025-12-22 13:14 (UTC+8))
#31120 [Misc] Fix typo: ‘occured’ -> ‘occurred’ — 无标签 — by c0de128 (合并于: 2025-12-22 13:14 (UTC+8))
#31117 [Misc] Fix spelling typos in model comments — qwen — by c0de128 (合并于: 2025-12-22 13:14 (UTC+8))
#31115 [Misc] Fix grammar errors in comments and messages — 无标签 — by c0de128 (合并于: 2025-12-22 13:14 (UTC+8))
#31116 [Misc] Fix quantization-related typos — 无标签 — by c0de128 (合并于: 2025-12-22 13:13 (UTC+8))
#31114 [Misc] Fix spelling typos in comments — ci/build,multi-modality — by c0de128 (合并于: 2025-12-22 13:13 (UTC+8))
#30957 [Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM — documentation,frontend,ready,nvidia — by CedricHwong (合并于: 2025-12-22 11:34 (UTC+8))

关闭但未合并的 PR

#17368 enable multiple platform device in DP init — needs-rebase,unstale,v1 — by HanlinDu (关闭于: 2025-12-23 09:13 (UTC+8))
#19844 [BugFix][V0] Fix AssertionError for prompt_logprobs — needs-rebase,unstale — by xu-song (关闭于: 2025-12-23 08:16 (UTC+8))
#31190 [Transformers][Bugfix] Migrated to new transformers nightly logic — 无标签 — by AndreasKaratzas (关闭于: 2025-12-23 08:04 (UTC+8))
#29605 [BugFix] num_cpu_blocks metrics is None in cache_config_info even when using OffloadingConnector — v1,qwen — by sts07142 (关闭于: 2025-12-23 07:49 (UTC+8))
#31110 [Bugfix][ROCm] Fix AITER method typos and invalid staticmethod — rocm — by c0de128 (关闭于: 2025-12-23 05:11 (UTC+8))
#31157 [Bugfix][ROCm] Fix typo: triton_fp4_gemm_dynamic_qaunt -> quant — rocm — by c0de128 (关闭于: 2025-12-23 05:11 (UTC+8))
#23437 [Frontend] Add deepseek v3.1 reasoning parser — frontend,needs-rebase,qwen,deepseek — by arsenetar (关闭于: 2025-12-23 02:13 (UTC+8))
#21084 [CI/Build] Add the Nixl test to CI — ready,needs-rebase,ci/build,stale,kv-connector — by kouroshHakha (关闭于: 2025-12-23 02:08 (UTC+8))
#23870 [Benchmark] Add ability to round robin over a set of urls for benchmarking — performance,needs-rebase,ci/build,nvidia — by kouroshHakha (关闭于: 2025-12-23 02:08 (UTC+8))
#31163 Fix prefill trace warmup — documentation,frontend,ci/build,v1 — by sraizada-tt (关闭于: 2025-12-23 00:36 (UTC+8))
#25168 [CI/build] Abort CI if pre-commit fails — documentation,ci/build — by tmuttaki (关闭于: 2025-12-22 22:25 (UTC+8))
#24210 [DO NOT MERGE] PR for testing — needs-rebase,ci/build,unstale — by tmuttaki (关闭于: 2025-12-22 22:25 (UTC+8))
#31142 Don’t tolerate llm_config instead of text_config — ready — by hmellor (关闭于: 2025-12-22 21:59 (UTC+8))
#31134 Add prefix continuation to DeepSeek v3.2 — deepseek — by PHOEBEMOON0802 (关闭于: 2025-12-22 20:46 (UTC+8))
#31143 Add encode time — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by LJH-LBJ (关闭于: 2025-12-22 20:08 (UTC+8))
#31141 Epd mooncake engine — documentation,frontend,v1,kv-connector — by khuonglmhw (关闭于: 2025-12-22 19:32 (UTC+8))
#31129 [DeepSeek v3.2] Add prefix contiunuation feature for DeepSeek v3.2 — deepseek — by PHOEBEMOON0802 (关闭于: 2025-12-22 16:40 (UTC+8))
#30478 fix gme model do not use mrope — qwen — by zhuikefeng986285005-byte (关闭于: 2025-12-22 14:11 (UTC+8))
#31111 [ROCm][Refactor] Move the contiguous logic for ROCm in torch_sdpa_wrapper into MMEncoderAttention — rocm — by shen-shanshan (关闭于: 2025-12-22 12:40 (UTC+8))
#20036 [New Model] Support StableLMAlphaForCausalLM — new-model — by b8zhong (关闭于: 2025-12-22 11:50 (UTC+8))
#20901 [WIP][EPLB] Enable Llama4 EPLB — needs-rebase,llama — by b8zhong (关闭于: 2025-12-22 11:50 (UTC+8))