vLLM 开发动态报告 - 2025-12-09

时间窗口: 2025-12-09 21:37 (UTC+8) ~ 2025-12-10 21:37 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 19 | 新 PR 37 | 合并 PR 37 | 关闭未合并 PR 23

📊 每日开发状态摘要

在本次观察窗口内，vLLM 项目保持了极高的开发活跃度，新增与合并的 PR 数量均为 37 个，反映了快速的迭代与问题修复节奏。开发焦点集中在 AMD 平台兼容性优化、DeepSeek-V3.2 模型的系列问题修复、以及 CPU/TPU 后端的功能增强 上。多个技术讨论（RFC）的提出也表明社区正在就架构演进（如在线量化、基准测试工具）进行深入探讨。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动频繁，涉及兼容性修复、性能优化和 CI 改进。

编号	类型	标题	关键贡献者	核心内容与影响分析
#30360	Issue	[RFC]: Consolidate FP8 min/max values into somewhere reasonable (Python only)	`rasmith`	【设计提案】指出在 MI300（使用 `torch.float8_e4m3fnuz`）等平台上，建议的 FP8 最小/最大值（±224）与 PyTorch 默认值（±240）存在冲突，导致测试失败。提案旨在将 FP8 的推荐值统一集中管理，以解决平台差异并防止未来问题复发。这是提升 AMD 平台 FP8 量化精度和代码健壮性的重要一步。
#30361	PR	[Attention][AMD] Make flash-attn optional	`mgehre-amd`	【兼容性修复】修复了 ROCm 平台上一个阻塞性问题：由于 `vllm.v1.spec_decode.eagle` 无条件导入 `flash-attn` 工具模块，导致即使运行非 Eagle 推理任务（且使用 Triton 等后端），只要未安装 `flash-attn` 就会启动失败。此 PR 使导入变为条件性，增强了 ROCm 环境下依赖管理的灵活性。
#30364	PR	[Bugfix] awq_gemm: fix argument order swap	`mgehre-amd`	【代码清理】修正了 `_custom_ops.awq_gemm` 函数内部参数 `scales` 和 `zeros` 的声明与传递顺序，使其与调用方及 CUDA 内核的实现保持一致。此修复不影响功能，但提高了代码可读性和维护性，对 AMD 平台 AWQ 量化的开发有积极意义。
#30308	PR (已合并)	[bugfix][quantization] fix quark qwen3 kv_cache quantization	`haoyangli-amd`	【Quark 量化修复】解决了 Qwen3 MoE 模型使用 Quark 量化时，KV Cache 缩放因子识别错误的问题。通过调用基类的 `get_cache_scale` 方法进行正确识别，确保了模型推理的准确性。这是 AMD Quark 量化工具对复杂模型支持的重要完善。
#29937	PR (已合并)	Improve wvsplitK tile and balance heuristics.	`amd-hhashemi`	【性能优化】改进了针对“瘦”GEMM（矩阵乘法）运算的 tile 大小选择和负载均衡启发式算法。该优化基于对 Llama3.1/3.3 等大型模型的实际性能分析，旨在提升 AMD 硬件上的计算内核效率。
#25552	PR (已合并)	[ROCm] Aiter Quant Kernels	`vllmellm`	【性能优化】集成了 Aiter 的 FP8 量化内核，支持静态/动态张量量化和动态 token 量化。合并后的性能数据显示，在 Llama-3.1-70B-Instruct-FP8-KV 等模型上，总吞吐量（tok/s）有显著提升（~3%），是 AMD ROCm 平台量化推理性能的关键增强。
#25693	PR (已合并)	[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter	`charlifu`	【性能优化】为 Aiter 添加了 LayerNorm + FP8 块量化以及 SiLU + FP8 块量化的融合算子。通过算子融合减少内存访问和内核启动开销，旨在进一步提升 AMD 平台上量化模型的推理性能。
#28314	Issue (已关闭)	[AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments	`zhewenl`	【CI 修复】此跟踪 AMD CI 环境依赖问题的 Issue 已被关闭，表明 UV、torchaudio、terratorch、pqdm/num2words 等依赖问题已通过多个 PR 得到解决，AMD CI 环境的稳定性和完整性得到显著改善。
#30020	PR (已合并)	[CI/Build][AMD] Skip quantization kernels tests that require CUTLASS or e4m3fn when not supported by platform	`rasmith`	【CI 修复】使测试框架能够识别平台能力，跳过 MI300 等不支持的测试（如需要 CUTLASS 或 e4m3fn 格式的测试），降低了 AMD CI 的噪音和失败率，提高了测试的针对性。

💬 高热度讨论分析

Issue #30358: [Bug]: Prefill scheduled num_block mismatch
- 核心议题：在分块预填充（chunked prefill）场景下，调度器分配给请求的 KV 块数量在初始分配 (update_state_after_alloc) 和最终完成 (request_finished) 两个阶段不一致，可能导致分布式 KV 连接器元数据不同步。
- 观点与进展：
  - 报告者 (xuechendi)：通过详尽的日志分析，定位到问题源于调度器在后续循环中更新了请求的块ID列表，但未同步通知 KV 连接器。
  - 结论：报告者已找到根本原因（代码链接），并计划提交修复。这是一个深入的技术调试案例，展示了社区贡献者强大的问题定位能力。
- 当前状态：open，等待修复 PR。
Issue #15636: [Bug]: Outlines broken on vLLM 0.8+ (已关闭)
- 核心议题：V1 引擎默认启用后，不再支持用户自定义的 logits processors（如 Outlines 库所需），导致大量用户工作流断裂。
- 不同观点：
  - 用户群体：表达了强烈不满，认为在 V1 引擎功能未完全对齐 V0 时就默认切换是“糟糕的决策”，给集成带来了困难和不便。
  - 维护者 (simon-mo)：承认决策失误并道歉，解释由于无法按请求粒度进行回退，导致无法优雅降级。
  - 维护者 (russellb, mgoin)：提供了临时解决方案（设置 VLLM_USE_V1=0 回退到 V0 引擎），并指出 V1 中应使用内置的结构化输出功能。
- 争议焦点：功能迭代的激进程度与向后兼容性、用户迁移成本之间的平衡。
- 最终结论：Issue 在长期讨论后被关闭，但揭示了项目在重大架构升级时面临的兼容性挑战。
PR #30062: [CPU] Support for Whisper
- 核心议题：为 CPU 后端添加 Whisper 语音模型支持。
- 关键讨论：
  - Codex 审查：指出一个 P1 级别严重问题——CPU 注意力后端在处理编码器-解码器（如 Whisper）的交叉注意力时，错误地使用了因果掩码（causal mask），这将导致解码器只能看到部分编码器记忆，产生错误输出。
  - 维护者互动：审查意见得到了重视和讨论，PR 在经过多轮 CI 测试和代码修正后最终合并。
- 结论：此 PR 的合并标志着 CPU 后端多模态支持的重要扩展，同时展示了自动化代码审查工具在捕捉深层逻辑错误上的价值。
PR #30346: [Core] Major fix catch backend grammar exceptions (xgrammar, outlines, etc) in scheduler
- 核心议题：当用户传入畸形或不支持的 JSON schema 时，xgrammar 等结构化输出后端会抛出未捕获的异常，导致整个 vLLM 服务进程崩溃。
- 不同观点：
  - 贡献者 (blancsw)：提出在调度器层面捕获这些异常，并将其转化为对客户端的错误响应，避免服务重启，提升服务韧性。
  - 核心开发者 (wseaton)：指出“abort”术语在 vLLM 中通常特指客户端中止，建议该错误处理机制可以参考其正在开发的 #26813 PR 中的新调度器/引擎内部错误处理框架，以实现更统一的设计。
- 当前状态：讨论指向一个更架构化的解决方案，该 PR 可能需要进行调整或与另一个 PR 协作。

🔥 热门话题与趋势分析

DeepSeek-V3.2 模型问题集中涌现：成为近期最热门的支持主题。相关 Issue/PR 涉及分词器性能（#30351 缓存优化）、结构化输出支持（#30371 修复）、AWQ 性能（#30370）、工具调用解析（#30311）等，反映了新锐大模型与推理引擎快速适配过程中的典型阵痛期。
量化技术持续深化：FP8 量化是焦点，既有针对 AMD 平台的标准值统一（#30360），也有 Quark 工具对 Qwen3 的修复（#30308），以及在线量化重载的架构设计讨论（#30359），显示出量化在提升推理效率方面的核心地位。
CPU 与 TPU 后端稳步推进：CPU 后端新增 Whisper 支持（#30062）并请求添加 Attention 基准测试（#30374）。TPU 后端修复了 PyTorch 2.9.1 升级引发的编译错误（#30331）。这表明 vLLM 在多硬件平台支持上的战略布局。
多模态与视觉模型支持：出现 HunyuanOCR 批处理图像污染问题（#30342/30344）和 LMDB 多模态缓存实现（#30373），表明视觉语言模型的应用在增加，对推理引擎的批处理和缓存机制提出了新需求。

🛠️ 重点技术变更

PR #30351: [Bugfix] Cache added_vocab to avoid per-token overhead (已合并)：
- 技术解读：DeepSeek-V3.2 分词器的 __len__ 方法每次调用都会计算新增词汇表，该操作在逐 Token 解码的主线程中成为瓶颈，导致服务延迟甚至卡死。
- 影响：通过预计算并缓存 added_vocab，彻底消除了这个性能热点，显著提升了 DeepSeek-V3.2 模型在高并发下的服务稳定性和响应能力。
PR #30371: [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output (已合并)：
- 技术解读：修复了 DeepSeek-V3.2 模型无法启用结构化输出（如 JSON Schema、Grammar 约束）的问题。
- 影响：解禁了 DeepSeek-V3.2 模型在需要严格输出格式控制场景下的应用能力，是其功能支持完整性的关键补全。
PR #30384/#30349: Fix minimax m2 model rotary_dim (已合并/关闭)：
- 技术解读：PR #29966 统一 RoPE 维度计算逻辑后，引发了 Minimax-M2 等模型因 rotary_dim 语义混淆（已缩放 vs 待缩放）而导致的输出乱码问题。PR #30384 通过允许 get_rope 函数识别并跳过已缩放的维度来修复。
- 影响：恢复了受影响模型的正常推理能力，并引发了对 rotary_dim 参数进行全局标准化重构的讨论（见 PR #30389），有助于统一代码逻辑。
PR #30361: [Attention][AMD] Make flash-attn optional (进行中)：
- 技术解读：将 ROCm 平台对上游 flash-attn 的强依赖改为弱依赖，允许用户在未安装该包时使用其他注意力后端。
- 影响：提高了 AMD 平台部署的灵活性，降低了依赖管理的复杂度和潜在冲突。
PR #30062: [CPU] Support for Whisper (已合并)：
- 技术解读：为 CPU 推理后端实现了 Whisper 语音编码器-解码器模型的支持，包括处理交叉注意力的正确掩码逻辑。
- 影响：大幅扩展了 vLLM 在无 GPU 环境下的应用场景，使其能够处理语音转录任务。

📈 开发活跃度观察

AMD 贡献者高度活跃：-amd 后缀的贡献者在本周期提交了多个关键修复（Quark、FP8、CI），显示出 AMD 团队在积极推动其硬件生态与 vLLM 的深度集成，从功能、性能到测试的全面覆盖。
快速响应与修复：针对 DeepSeek-V3.2 和 Minimax-M2 等模型的具体问题，社区能在短时间内定位根源并提供修复，展现了项目对主流模型适配的快速响应能力。
社区深度参与：如 Issue #30358 所示，贡献者能够进行复杂调度逻辑的调试并提出具体修复方案，说明 vLLM 社区具备强大的技术专家群体。

💡 值得关注的问题

Issue #30359: [RFC] [QeRL]: Online Quantization and Model Reloading：
- 这是一个大规模的架构设计提案，旨在优化强化学习等场景中量化模型的在线重载流程，解决当前实现中内存占用高、支持有限的问题。讨论将影响 vLLM 在训练后处理（RLHF/Pipeline）领域的应用效能，值得量化及核心架构开发者关注。
Issue #30358: Prefill scheduled num_block mismatch：
- 虽然根因已找到，但该 Bug 揭示了在复杂调度和分布式 KV 管理场景下，状态同步的脆弱性。其最终修复方案需要谨慎设计，以确保在所有边缘情况下的一致性。
Issue #30383: [RFC]: Multi-Process Benchmark Architecture：
- 指出了当前 vllm benchmark 工具在单进程负载生成器下的性能瓶颈，无法准确评估高并发服务的真实性能。该提案旨在设计一个多进程架构，对于性能评测的公正性和可靠性至关重要，尤其影响大型服务系统的选型评估。

📋 附录：详细数据列表

新增 Issue

#30380 [Usage]: 大家一般怎么使用vllm/tests的？ — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
#30359 [RFC] [QeRL]: Online Quantization and Model Reloading — RFC — by kylesayrs (创建于: 2025-12-10 05:24 (UTC+8))
#30387 [Bug]: illegal memory access countered when using kv-cache-type=fp8 loading a weight-fp8 model for evaltest in flash-attn backend — bug — by youngze0016 (创建于: 2025-12-10 19:43 (UTC+8))
#30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (创建于: 2025-12-10 17:55 (UTC+8))
#30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (创建于: 2025-12-10 18:02 (UTC+8))
#30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (创建于: 2025-12-10 15:53 (UTC+8))
#30381 [Usage]: — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
#30379 [Usage]: how to use vllm/tests/？ — usage — by tobeprozy (创建于: 2025-12-10 17:25 (UTC+8))
#30378 [Feature]: Automatically infer Qwen3 reranker settings (remove need for hf_overrides) — feature request — by ilopezluna (创建于: 2025-12-10 17:21 (UTC+8))
#30375 [Bug]: [TPU] ShapeDtypeStruct error when loading custom safetensors checkpoint on TPU v5litepod — bug — by Baltsat (创建于: 2025-12-10 16:12 (UTC+8))
#30372 [Bug]: vLLM (GPT-OSS) causes distorted tool argument names + infinite tool-call loop with Korean messenger tool — bug — by minmini2 (创建于: 2025-12-10 14:59 (UTC+8))
#30370 [Performance]: DeepSeek-V3.2 AWQ Performance is lower then i expected — performance — by yongho-chang (创建于: 2025-12-10 10:45 (UTC+8))
#30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (创建于: 2025-12-09 22:48 (UTC+8))
#30358 [Bug]: Prefill scheduled num_block mismatch at update_state_after_alloc and request_finished — bug — by xuechendi (创建于: 2025-12-10 04:15 (UTC+8))
#30368 [CI] Test target determination using LLM — feature request,ci — by khluu (创建于: 2025-12-10 09:42 (UTC+8))
#30360 [RFC]: Consolidate FP8 min/max values into somewhere reasonable (Python only) — rocm,RFC — by rasmith (创建于: 2025-12-10 05:44 (UTC+8))
#30342 [Bug]: HunyuanOCR batching problem with variable sized images in a batch. — bug — by anker-c2 (创建于: 2025-12-09 22:22 (UTC+8))

已关闭 Issue

#15636 [Bug]: Outlines broken on vLLM 0.8+ — bug,structured-output,unstale — by cpfiffer (关闭于: 2025-12-10 21:18 (UTC+8))
#30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (关闭于: 2025-12-10 18:47 (UTC+8))
#30381 [Usage]: — usage — by tobeprozy (关闭于: 2025-12-10 17:28 (UTC+8))
#30379 [Usage]: how to use vllm/tests/？ — usage — by tobeprozy (关闭于: 2025-12-10 17:26 (UTC+8))
#30311 [Bug]: deepseekv32.DeepseekV32Tokenizer Runtime causes model to crash — bug — by magician-xin (关闭于: 2025-12-10 16:30 (UTC+8))
#28314 [AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments — rocm,ci-failure — by zhewenl (关闭于: 2025-12-10 13:32 (UTC+8))
#30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (关闭于: 2025-12-10 12:05 (UTC+8))
#20181 [Feature]: Batch inference for Multi-Modal Online Serving — feature request,stale — by eslambakr (关闭于: 2025-12-10 10:25 (UTC+8))
#21097 [Bug]: w8a8 quantization not supporting sm120 — bug,stale — by sarmiena (关闭于: 2025-12-10 10:24 (UTC+8))
#21909 [Bug]: quant_method is not None — bug,stale — by maxin9966 (关闭于: 2025-12-10 10:24 (UTC+8))
#22325 [Bug]: gpt-oss model crashes on NVIDIA B200 with any OpenAI chat completion request — bug,stale — by teds-lin (关闭于: 2025-12-10 10:24 (UTC+8))
#22422 [Feature]: mxfp4 support for 3090 — feature request,stale — by ehartford (关闭于: 2025-12-10 10:24 (UTC+8))
#22501 [Usage]: Running a 300-400B Parameter Model on Multi-Node Setup (2x 8xA100) — usage,stale — by rangehow (关闭于: 2025-12-10 10:24 (UTC+8))
#22575 [Bug]: Vllm hangs when I use the offline engine with dp = 2 or more — bug,stale — by Stealthwriter (关闭于: 2025-12-10 10:24 (UTC+8))
#22623 [Usage]: if openai-mirror/gpt-oss-20b can run in A100? — usage,stale — by neverstoplearn (关闭于: 2025-12-10 10:23 (UTC+8))
#22624 [Bug]: 1.7B fp16 + 0.6B draft OOM with gpu_memory_utilization=0.9, while 4B int8 + 0.6B works fine on A800 80 GB — bug,stale — by kiexu (关闭于: 2025-12-10 10:23 (UTC+8))
#22639 [Bug]: function convert_lark_to_gbnf interpreting ‘#’ to parse as lark commentaries — bug,stale — by renout-nicolas (关闭于: 2025-12-10 10:23 (UTC+8))
#30206 [Bug]: DeepSeek-V3.2 DeepGEMM RuntimeError — bug — by coval3nte (关闭于: 2025-12-09 22:55 (UTC+8))
#29840 [Bug]: LMCacheConnectorV1Impl has no attribute ‘layerwise_storers’ on remote full cache hit with layerwise mode — bug — by XinyiQiao (关闭于: 2025-12-10 01:11 (UTC+8))

新增 PR

#30390 fix: Update json features supported by xGrammar — structured-output,v1 — by johannesflommersfeld (创建于: 2025-12-10 20:51 (UTC+8))
#30391 [IMPROVEMENT] Change MistralReasoningParser behavior — 无标签 — by juliendenize (创建于: 2025-12-10 21:02 (UTC+8))
#30389 Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim — performance,llama,qwen,deepseek,gpt-oss — by hmellor (创建于: 2025-12-10 20:33 (UTC+8))
#30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (创建于: 2025-12-10 18:37 (UTC+8))
#30340 Add Eagle and Eagle3 support to Transformers modeling backend — 无标签 — by hmellor (创建于: 2025-12-09 22:09 (UTC+8))
#30388 [Docs] Generate full list of metrics in user docs — documentation — by markmc (创建于: 2025-12-10 19:50 (UTC+8))
#30386 [v1] Add PrefixLM support to TritonAttention backend — v1 — by Isotr0py (创建于: 2025-12-10 19:32 (UTC+8))
#30385 [Core] Whisper support torch.compile — v1 — by NickLucche (创建于: 2025-12-10 19:11 (UTC+8))
#30376 [Fix]fix import error from lmcache — kv-connector — by wz1qqx (创建于: 2025-12-10 16:38 (UTC+8))
#30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (创建于: 2025-12-09 23:24 (UTC+8))
#30361 [Attention][AMD] Make flash-attn optional — rocm,speculative-decoding,v1 — by mgehre-amd (创建于: 2025-12-10 06:46 (UTC+8))
#30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (创建于: 2025-12-09 23:42 (UTC+8))
#30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (创建于: 2025-12-10 00:57 (UTC+8))
#30364 [Bugfix] awq_gemm: fix argument order swap — 无标签 — by mgehre-amd (创建于: 2025-12-10 07:17 (UTC+8))
#30377 adding constraint updates of cos-sin to improve mrope performance — 无标签 — by wujinyuan1 (创建于: 2025-12-10 16:48 (UTC+8))
#30373 Implement LMDB-based multi-modal cache — ci/build,v1,multi-modality — by petersalas (创建于: 2025-12-10 15:21 (UTC+8))
#30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (创建于: 2025-12-10 12:26 (UTC+8))
#30344 [Bugfix] Fix HunyuanOCR cross-image contamination in batch processing — 无标签 — by anker-c2 (创建于: 2025-12-09 22:49 (UTC+8))
#30346 [Core] Major fix catch backend grammar exceptions (xgrammar, outlines, etc) in scheduler — v1 — by blancsw (创建于: 2025-12-09 22:58 (UTC+8))
#30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (创建于: 2025-12-09 23:14 (UTC+8))
#30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (创建于: 2025-12-09 22:09 (UTC+8))
#30369 [Fix] Add default rope theta for qwen1 model — qwen — by iwzbi (创建于: 2025-12-10 10:36 (UTC+8))
#30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (创建于: 2025-12-09 22:56 (UTC+8))
#30367 [CI] Reduce Flakiness For test_spec_decode.py::test_suffix_decoding_acceptance — ready,v1 — by micah-wil (创建于: 2025-12-10 08:18 (UTC+8))
#30363 Remove all2all backend envvar — documentation,ci/build — by elizabetht (创建于: 2025-12-10 07:09 (UTC+8))
#30366 [Bug Fix] Fix Kimi-Linear model initialization crash due to missing ‘indexer_rotary_emb’ arg — 无标签 — by yonasTMC (创建于: 2025-12-10 08:02 (UTC+8))
#30357 Upstream fp8 with static scales gpt oss — needs-rebase,gpt-oss — by maleksan85 (创建于: 2025-12-10 03:49 (UTC+8))
#30365 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (创建于: 2025-12-10 07:59 (UTC+8))
#30362 [WIP] Bump dockerfile to cuda 13.0.2 (for testing) — ci/build,nvidia — by dougbtv (创建于: 2025-12-10 06:51 (UTC+8))
#30353 [Fix] Handle multiple tool calls in Qwen3-MTP tool parser — frontend,tool-calling,qwen — by ArkVex (创建于: 2025-12-10 01:48 (UTC+8))
#30356 [CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 — ready,ci/build,deepseek — by MatthewBonanni (创建于: 2025-12-10 02:05 (UTC+8))
#30352 [CI/Test] Fix FP8 per-tensor quant test reference scale shape — ready — by LucasWilkinson (创建于: 2025-12-10 01:30 (UTC+8))
#30354 [WIP][Core] Update PyTorch to 2.9.1 generally — rocm,ci/build,nvidia — by orionr (创建于: 2025-12-10 01:49 (UTC+8))
#30355 [Model Runner V2] Fix Triton warning on tl.where — v1 — by WoosukKwon (创建于: 2025-12-10 01:56 (UTC+8))
#30350 Remove virtual engine handling — tpu,needs-rebase,v1,codex,qwen,kv-connector — by WoosukKwon (创建于: 2025-12-10 00:34 (UTC+8))
#30341 [CI] refine more logic when generating and using nightly wheels & indices — ci/build — by Harry-Chen (创建于: 2025-12-09 22:17 (UTC+8))
#30338 Fix gigachat3 parser + update tests — frontend,tool-calling — by ajpqs (创建于: 2025-12-09 21:37 (UTC+8))

已合并 PR

#30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (合并于: 2025-12-10 20:58 (UTC+8))
#30062 [CPU] Support for Whisper — ready,ci/build,v1,multi-modality — by aditew01 (合并于: 2025-12-10 20:58 (UTC+8))
#30331 [Bugfix] tpu_model_runner: set vllm config context when calling reset_dynamo_cache() — tpu,ready,v1 — by dtrifiro (合并于: 2025-12-10 20:58 (UTC+8))
#30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (合并于: 2025-12-10 12:05 (UTC+8))
#30332 [BUGFIX] Mistral tool call parser v11+ — frontend,ready,tool-calling — by juliendenize (合并于: 2025-12-09 22:55 (UTC+8))
#30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (合并于: 2025-12-10 16:30 (UTC+8))
#30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (合并于: 2025-12-10 13:37 (UTC+8))
#29358 [ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group — rocm,ready,ci/build,v1,multi-modality — by AndreasKaratzas (合并于: 2025-12-10 13:33 (UTC+8))
#30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (合并于: 2025-12-10 12:27 (UTC+8))
#30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (合并于: 2025-12-10 12:05 (UTC+8))
#30308 [bugfix][quantization] fix quark qwen3 kv_cache quantization — ready,qwen — by haoyangli-amd (合并于: 2025-12-10 11:24 (UTC+8))
#30367 [CI] Reduce Flakiness For test_spec_decode.py::test_suffix_decoding_acceptance — ready,v1 — by micah-wil (合并于: 2025-12-10 10:35 (UTC+8))
#30230 [responsesAPI][6] Fix multi turn MCP tokenization — documentation,frontend,ready,gpt-oss — by qandrew (合并于: 2025-12-10 10:13 (UTC+8))
#30020 [CI/Build][AMD] Skip quantization kernels tests that require CUTLASS or e4m3fn when not supported by platform — rocm,ready,nvidia — by rasmith (合并于: 2025-12-10 10:28 (UTC+8))
#30336 [Bugfix] Fix fp8 DeepGemm compilation issues — bug,ready,ci-failure,deepseek — by ElizaWszola (合并于: 2025-12-10 09:17 (UTC+8))
#29624 [Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode — speculative-decoding,ready,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2025-12-10 09:18 (UTC+8))
#30330 [Bugfix] Fix cuda graph sizes when running with speculative decoding — ready,nvidia — by PatrykSaffer (合并于: 2025-12-10 08:47 (UTC+8))
#29723 [V1][Spec Decode] Optimize Medusa proposer to avoid GPU-CPU sync — speculative-decoding,ready,v1 — by dongbo910220 (合并于: 2025-12-10 08:15 (UTC+8))
#29937 Improve wvsplitK tile and balance heristics. — rocm,ready — by amd-hhashemi (合并于: 2025-12-10 07:51 (UTC+8))
#25693 [Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter — rocm,ready — by charlifu (合并于: 2025-12-10 06:39 (UTC+8))
#30119 [BugFix] Fix DeepSeek-R1 hang with DP and MTP — ready,v1,deepseek — by LucasWilkinson (合并于: 2025-12-10 02:51 (UTC+8))
#29066 [MoE][Refactor] Remove most arguments to FusedMoEMethodBase.apply — moe,ready,nvidia,ready-run-all-tests — by bnellnm (合并于: 2025-12-10 05:48 (UTC+8))
#28480 [Quantization] FP8 Weight Reloading for Quantized RL Rollout — quantization,ready,rl — by kylesayrs (合并于: 2025-12-10 05:54 (UTC+8))
#30277 [BugFix] Fix non detected failing tests — ready,ci/build — by ilmarkov (合并于: 2025-12-10 01:57 (UTC+8))
#29145 [CI/Build] Make test_mha_attn.py run on correct platform only and check for flash_attn_varlen_func in layer.py — rocm,ready,ci/build — by rasmith (合并于: 2025-12-10 04:18 (UTC+8))
#30234 Bump actions/stale from 10.1.0 to 10.1.1 — ready,dependencies,ci/build,github_actions — by dependabot (合并于: 2025-12-10 04:12 (UTC+8))
#30233 Bump actions/checkout from 6.0.0 to 6.0.1 — ready,dependencies,ci/build,github_actions — by dependabot (合并于: 2025-12-10 04:03 (UTC+8))
#30307 [Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models — ready,qwen — by a4lg (合并于: 2025-12-10 03:13 (UTC+8))
#30352 [CI/Test] Fix FP8 per-tensor quant test reference scale shape — ready — by LucasWilkinson (合并于: 2025-12-10 02:52 (UTC+8))
#29912 [Cleanup] Refactor profiling env vars into a CLI config — documentation,performance,structured-output,frontend,tpu,ready,v1 — by benchislett (合并于: 2025-12-10 02:29 (UTC+8))
#30187 [Model Runner V2] Support num NaNs in logits — v1 — by WoosukKwon (合并于: 2025-12-10 02:00 (UTC+8))
#30355 [Model Runner V2] Fix Triton warning on tl.where — v1 — by WoosukKwon (合并于: 2025-12-10 01:59 (UTC+8))
#29897 [Compile] Fix torch warning TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled — ready,v1 — by yewentao256 (合并于: 2025-12-09 23:40 (UTC+8))
#30298 Update AMD test definitions (2025-12-08) — rocm,ready,ci/build,amd — by Alexei-V-Ivanov-AMD (合并于: 2025-12-10 01:31 (UTC+8))
#30173 [BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded — speculative-decoding,ready,v1,nvidia — by LucasWilkinson (合并于: 2025-12-09 23:36 (UTC+8))
#30018 [Feature] Batch-Invariant Support for FA2 and LoRA — ready,v1 — by quanliu1991 (合并于: 2025-12-09 23:01 (UTC+8))
#25552 [ROCm] Aiter Quant Kernels — rocm,ready,ci/build — by vllmellm (合并于: 2025-12-09 22:27 (UTC+8))

关闭但未合并的 PR

#23997 Feature/sampler benchmark #23977 — performance,unstale — by baonudesifeizhai (关闭于: 2025-12-10 21:14 (UTC+8))
#26701 [ROCm]: W8A8BlockFp8LinearOp should use AITER on MI355X — rocm,ready,needs-rebase — by gronsti-amd (关闭于: 2025-12-10 20:33 (UTC+8))
#30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (关闭于: 2025-12-10 19:59 (UTC+8))
#30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (关闭于: 2025-12-10 18:47 (UTC+8))
#29653 fix potential object has no attribute ‘bias’ error — 无标签 — by allerou4 (关闭于: 2025-12-10 15:16 (UTC+8))
#30297 [Core] Add SLA-tiered scheduling (opt-in) and docs — documentation,v1 — by ProdByBuddha (关闭于: 2025-12-10 13:13 (UTC+8))
#30327 [BugFix] Fix hang issue in LMCache mp mode — v1,kv-connector — by wz1qqx (关闭于: 2025-12-10 10:32 (UTC+8))
#17830 cmake: Get rid of VLLM_PYTHON_EXECUTABLE — needs-rebase,ci/build,stale — by seemethere (关闭于: 2025-12-10 10:26 (UTC+8))
#17872 measure peak memory correctly by removing already used memory — needs-rebase,stale,v1 — by MiladInk (关闭于: 2025-12-10 10:25 (UTC+8))
#17959 [Bugfix] fix check kv cache memory log info — needs-rebase,stale,v1 — by BoL0150 (关闭于: 2025-12-10 10:25 (UTC+8))
#21056 [Feature][EPLB] Add EPLB support for MiniMax-01 — stale — by haveheartt (关闭于: 2025-12-10 10:24 (UTC+8))
#21413 Intentionally fail parallel sampling test — stale,v1 — by sethkimmel3 (关闭于: 2025-12-10 10:24 (UTC+8))
#21506 [V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1 — documentation,rocm,frontend,ci/build,stale,v1,multi-modality,tool-calling — by DW934 (关闭于: 2025-12-10 10:24 (UTC+8))
#22238 [V1][SpecDecode]Support Relaxed Acceptance for thinking tokens in speculative decoding when using greedy search, camp up by Nvidia. — stale,v1 — by DW934 (关闭于: 2025-12-10 10:24 (UTC+8))
#22488 Feat/sliding window metrics — Related to #22480 — needs-rebase,stale,v1 — by NumberWan (关闭于: 2025-12-10 10:24 (UTC+8))
#22632 [Bugfix]fix deepseek_r1_reasoning bugs when in contents. — stale,deepseek — by z2415445508 (关闭于: 2025-12-10 10:23 (UTC+8))
#27594 Fix intermediatetensors spawn error #27591 — qwen — by baonudesifeizhai (关闭于: 2025-12-10 08:44 (UTC+8))
#28627 [Weight Loading] Expand quantized weight reloading support — needs-rebase,v1 — by kylesayrs (关闭于: 2025-12-10 05:48 (UTC+8))
#30354 [WIP][Core] Update PyTorch to 2.9.1 generally — rocm,ci/build,nvidia — by orionr (关闭于: 2025-12-10 02:46 (UTC+8))
#30063 Mistral tool parser — frontend,tool-calling — by graelo (关闭于: 2025-12-09 23:56 (UTC+8))
#27305 [ROCm][torch.compile] Adding MulAddFusionPass to enable AITER fused_mul_add — rocm,needs-rebase — by micah-wil (关闭于: 2025-12-09 23:49 (UTC+8))
#26257 [Feature][torch.compile] Add pass to rearrange AllGather for FP8 models in sequence parallel for better Async TP fusion — needs-rebase,ci/build — by jasonlizhengjian (关闭于: 2025-12-09 22:59 (UTC+8))
#25618 [ROCm][Allreduce] Add dispatch mechanism for choosing performant allreduce implementations for AMD platforms — rocm,needs-rebase,nvidia — by zejunchen-zejun (关闭于: 2025-12-09 21:45 (UTC+8))