vLLM 开发动态报告 - 2025-12-09
时间窗口: 2025-12-09 21:37 (UTC+8) ~ 2025-12-10 21:37 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 19 | 新 PR 37 | 合并 PR 37 | 关闭未合并 PR 23
📊 每日开发状态摘要
在本次观察窗口内,vLLM 项目保持了极高的开发活跃度,新增与合并的 PR 数量均为 37 个,反映了快速的迭代与问题修复节奏。开发焦点集中在 AMD 平台兼容性优化、DeepSeek-V3.2 模型的系列问题修复、以及 CPU/TPU 后端的功能增强 上。多个技术讨论(RFC)的提出也表明社区正在就架构演进(如在线量化、基准测试工具)进行深入探讨。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动频繁,涉及兼容性修复、性能优化和 CI 改进。
| 编号 | 类型 | 标题 | 关键贡献者 | 核心内容与影响分析 |
|---|---|---|---|---|
| #30360 | Issue | [RFC]: Consolidate FP8 min/max values into somewhere reasonable (Python only) | rasmith |
【设计提案】 指出在 MI300(使用 torch.float8_e4m3fnuz)等平台上,建议的 FP8 最小/最大值(±224)与 PyTorch 默认值(±240)存在冲突,导致测试失败。提案旨在将 FP8 的推荐值统一集中管理,以解决平台差异并防止未来问题复发。这是提升 AMD 平台 FP8 量化精度和代码健壮性的重要一步。 |
| #30361 | PR | [Attention][AMD] Make flash-attn optional | mgehre-amd |
【兼容性修复】 修复了 ROCm 平台上一个阻塞性问题:由于 vllm.v1.spec_decode.eagle 无条件导入 flash-attn 工具模块,导致即使运行非 Eagle 推理任务(且使用 Triton 等后端),只要未安装 flash-attn 就会启动失败。此 PR 使导入变为条件性,增强了 ROCm 环境下依赖管理的灵活性。 |
| #30364 | PR | [Bugfix] awq_gemm: fix argument order swap | mgehre-amd |
【代码清理】 修正了 _custom_ops.awq_gemm 函数内部参数 scales 和 zeros 的声明与传递顺序,使其与调用方及 CUDA 内核的实现保持一致。此修复不影响功能,但提高了代码可读性和维护性,对 AMD 平台 AWQ 量化的开发有积极意义。 |
| #30308 | PR (已合并) | [bugfix][quantization] fix quark qwen3 kv_cache quantization | haoyangli-amd |
【Quark 量化修复】 解决了 Qwen3 MoE 模型使用 Quark 量化时,KV Cache 缩放因子识别错误的问题。通过调用基类的 get_cache_scale 方法进行正确识别,确保了模型推理的准确性。这是 AMD Quark 量化工具 对复杂模型支持的重要完善。 |
| #29937 | PR (已合并) | Improve wvsplitK tile and balance heuristics. | amd-hhashemi |
【性能优化】 改进了针对“瘦”GEMM(矩阵乘法)运算的 tile 大小选择和负载均衡启发式算法。该优化基于对 Llama3.1/3.3 等大型模型的实际性能分析,旨在提升 AMD 硬件上的计算内核效率。 |
| #25552 | PR (已合并) | [ROCm] Aiter Quant Kernels | vllmellm |
【性能优化】 集成了 Aiter 的 FP8 量化内核,支持静态/动态张量量化和动态 token 量化。合并后的性能数据显示,在 Llama-3.1-70B-Instruct-FP8-KV 等模型上,总吞吐量(tok/s)有显著提升(~3%),是 AMD ROCm 平台量化推理性能的关键增强。 |
| #25693 | PR (已合并) | [Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter | charlifu |
【性能优化】 为 Aiter 添加了 LayerNorm + FP8 块量化以及 SiLU + FP8 块量化的融合算子。通过算子融合减少内存访问和内核启动开销,旨在进一步提升 AMD 平台上量化模型的推理性能。 |
| #28314 | Issue (已关闭) | [AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments | zhewenl |
【CI 修复】 此跟踪 AMD CI 环境依赖问题的 Issue 已被关闭,表明 UV、torchaudio、terratorch、pqdm/num2words 等依赖问题已通过多个 PR 得到解决,AMD CI 环境的稳定性和完整性得到显著改善。 |
| #30020 | PR (已合并) | [CI/Build][AMD] Skip quantization kernels tests that require CUTLASS or e4m3fn when not supported by platform | rasmith |
【CI 修复】 使测试框架能够识别平台能力,跳过 MI300 等不支持的测试(如需要 CUTLASS 或 e4m3fn 格式的测试),降低了 AMD CI 的噪音和失败率,提高了测试的针对性。 |
💬 高热度讨论分析
- Issue #30358: [Bug]: Prefill scheduled num_block mismatch
- 核心议题:在分块预填充(chunked prefill)场景下,调度器分配给请求的 KV 块数量在初始分配 (
update_state_after_alloc) 和最终完成 (request_finished) 两个阶段不一致,可能导致分布式 KV 连接器元数据不同步。 - 观点与进展:
- 报告者 (
xuechendi):通过详尽的日志分析,定位到问题源于调度器在后续循环中更新了请求的块ID列表,但未同步通知 KV 连接器。 - 结论:报告者已找到根本原因(代码链接),并计划提交修复。这是一个深入的技术调试案例,展示了社区贡献者强大的问题定位能力。
- 报告者 (
- 当前状态:
open,等待修复 PR。
- 核心议题:在分块预填充(chunked prefill)场景下,调度器分配给请求的 KV 块数量在初始分配 (
- Issue #15636: [Bug]: Outlines broken on vLLM 0.8+ (已关闭)
- 核心议题:V1 引擎默认启用后,不再支持用户自定义的 logits processors(如 Outlines 库所需),导致大量用户工作流断裂。
- 不同观点:
- 用户群体:表达了强烈不满,认为在 V1 引擎功能未完全对齐 V0 时就默认切换是“糟糕的决策”,给集成带来了困难和不便。
- 维护者 (
simon-mo):承认决策失误并道歉,解释由于无法按请求粒度进行回退,导致无法优雅降级。 - 维护者 (
russellb, mgoin):提供了临时解决方案(设置VLLM_USE_V1=0回退到 V0 引擎),并指出 V1 中应使用内置的结构化输出功能。
- 争议焦点:功能迭代的激进程度与向后兼容性、用户迁移成本之间的平衡。
- 最终结论:Issue 在长期讨论后被关闭,但揭示了项目在重大架构升级时面临的兼容性挑战。
- PR #30062: [CPU] Support for Whisper
- 核心议题:为 CPU 后端添加 Whisper 语音模型支持。
- 关键讨论:
- Codex 审查:指出一个 P1 级别严重问题——CPU 注意力后端在处理编码器-解码器(如 Whisper)的交叉注意力时,错误地使用了因果掩码(causal mask),这将导致解码器只能看到部分编码器记忆,产生错误输出。
- 维护者互动:审查意见得到了重视和讨论,PR 在经过多轮 CI 测试和代码修正后最终合并。
- 结论:此 PR 的合并标志着 CPU 后端多模态支持的重要扩展,同时展示了自动化代码审查工具在捕捉深层逻辑错误上的价值。
- PR #30346: [Core] Major fix catch backend grammar exceptions (xgrammar, outlines, etc) in scheduler
- 核心议题:当用户传入畸形或不支持的 JSON schema 时,xgrammar 等结构化输出后端会抛出未捕获的异常,导致整个 vLLM 服务进程崩溃。
- 不同观点:
- 贡献者 (
blancsw):提出在调度器层面捕获这些异常,并将其转化为对客户端的错误响应,避免服务重启,提升服务韧性。 - 核心开发者 (
wseaton):指出“abort”术语在 vLLM 中通常特指客户端中止,建议该错误处理机制可以参考其正在开发的 #26813 PR 中的新调度器/引擎内部错误处理框架,以实现更统一的设计。
- 贡献者 (
- 当前状态:讨论指向一个更架构化的解决方案,该 PR 可能需要进行调整或与另一个 PR 协作。
🔥 热门话题与趋势分析
- DeepSeek-V3.2 模型问题集中涌现:成为近期最热门的支持主题。相关 Issue/PR 涉及分词器性能(#30351 缓存优化)、结构化输出支持(#30371 修复)、AWQ 性能(#30370)、工具调用解析(#30311)等,反映了新锐大模型与推理引擎快速适配过程中的典型阵痛期。
- 量化技术持续深化:FP8 量化是焦点,既有针对 AMD 平台的标准值统一(#30360),也有 Quark 工具对 Qwen3 的修复(#30308),以及在线量化重载的架构设计讨论(#30359),显示出量化在提升推理效率方面的核心地位。
- CPU 与 TPU 后端稳步推进:CPU 后端新增 Whisper 支持(#30062)并请求添加 Attention 基准测试(#30374)。TPU 后端修复了 PyTorch 2.9.1 升级引发的编译错误(#30331)。这表明 vLLM 在多硬件平台支持上的战略布局。
- 多模态与视觉模型支持:出现 HunyuanOCR 批处理图像污染问题(#30342/30344)和 LMDB 多模态缓存实现(#30373),表明视觉语言模型的应用在增加,对推理引擎的批处理和缓存机制提出了新需求。
🛠️ 重点技术变更
- PR #30351: [Bugfix] Cache added_vocab to avoid per-token overhead (已合并):
- 技术解读:DeepSeek-V3.2 分词器的
__len__方法每次调用都会计算新增词汇表,该操作在逐 Token 解码的主线程中成为瓶颈,导致服务延迟甚至卡死。 - 影响:通过预计算并缓存
added_vocab,彻底消除了这个性能热点,显著提升了 DeepSeek-V3.2 模型在高并发下的服务稳定性和响应能力。
- 技术解读:DeepSeek-V3.2 分词器的
- PR #30371: [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output (已合并):
- 技术解读:修复了 DeepSeek-V3.2 模型无法启用结构化输出(如 JSON Schema、Grammar 约束)的问题。
- 影响:解禁了 DeepSeek-V3.2 模型在需要严格输出格式控制场景下的应用能力,是其功能支持完整性的关键补全。
- PR #30384/#30349: Fix minimax m2 model rotary_dim (已合并/关闭):
- 技术解读:PR #29966 统一 RoPE 维度计算逻辑后,引发了 Minimax-M2 等模型因
rotary_dim语义混淆(已缩放 vs 待缩放)而导致的输出乱码问题。PR #30384 通过允许get_rope函数识别并跳过已缩放的维度来修复。 - 影响:恢复了受影响模型的正常推理能力,并引发了对
rotary_dim参数进行全局标准化重构的讨论(见 PR #30389),有助于统一代码逻辑。
- 技术解读:PR #29966 统一 RoPE 维度计算逻辑后,引发了 Minimax-M2 等模型因
- PR #30361: [Attention][AMD] Make flash-attn optional (进行中):
- 技术解读:将 ROCm 平台对上游
flash-attn的强依赖改为弱依赖,允许用户在未安装该包时使用其他注意力后端。 - 影响:提高了 AMD 平台部署的灵活性,降低了依赖管理的复杂度和潜在冲突。
- 技术解读:将 ROCm 平台对上游
- PR #30062: [CPU] Support for Whisper (已合并):
- 技术解读:为 CPU 推理后端实现了 Whisper 语音编码器-解码器模型的支持,包括处理交叉注意力的正确掩码逻辑。
- 影响:大幅扩展了 vLLM 在无 GPU 环境下的应用场景,使其能够处理语音转录任务。
📈 开发活跃度观察
- AMD 贡献者高度活跃:
-amd后缀的贡献者在本周期提交了多个关键修复(Quark、FP8、CI),显示出 AMD 团队在积极推动其硬件生态与 vLLM 的深度集成,从功能、性能到测试的全面覆盖。 - 快速响应与修复:针对 DeepSeek-V3.2 和 Minimax-M2 等模型的具体问题,社区能在短时间内定位根源并提供修复,展现了项目对主流模型适配的快速响应能力。
- 社区深度参与:如 Issue #30358 所示,贡献者能够进行复杂调度逻辑的调试并提出具体修复方案,说明 vLLM 社区具备强大的技术专家群体。
💡 值得关注的问题
- Issue #30359: [RFC] [QeRL]: Online Quantization and Model Reloading:
- 这是一个大规模的架构设计提案,旨在优化强化学习等场景中量化模型的在线重载流程,解决当前实现中内存占用高、支持有限的问题。讨论将影响 vLLM 在训练后处理(RLHF/Pipeline)领域的应用效能,值得量化及核心架构开发者关注。
- Issue #30358: Prefill scheduled num_block mismatch:
- 虽然根因已找到,但该 Bug 揭示了在复杂调度和分布式 KV 管理场景下,状态同步的脆弱性。其最终修复方案需要谨慎设计,以确保在所有边缘情况下的一致性。
- Issue #30383: [RFC]: Multi-Process Benchmark Architecture:
- 指出了当前
vllm benchmark工具在单进程负载生成器下的性能瓶颈,无法准确评估高并发服务的真实性能。该提案旨在设计一个多进程架构,对于性能评测的公正性和可靠性至关重要,尤其影响大型服务系统的选型评估。
- 指出了当前
📋 附录:详细数据列表
新增 Issue
- #30380 [Usage]: 大家一般怎么使用vllm/tests的? — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
- #30359 [RFC] [QeRL]: Online Quantization and Model Reloading — RFC — by kylesayrs (创建于: 2025-12-10 05:24 (UTC+8))
- #30387 [Bug]: illegal memory access countered when using kv-cache-type=fp8 loading a weight-fp8 model for evaltest in flash-attn backend — bug — by youngze0016 (创建于: 2025-12-10 19:43 (UTC+8))
- #30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (创建于: 2025-12-10 17:55 (UTC+8))
- #30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (创建于: 2025-12-10 18:02 (UTC+8))
- #30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (创建于: 2025-12-10 15:53 (UTC+8))
- #30381 [Usage]: — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
- #30379 [Usage]: how to use vllm/tests/? — usage — by tobeprozy (创建于: 2025-12-10 17:25 (UTC+8))
- #30378 [Feature]: Automatically infer Qwen3 reranker settings (remove need for hf_overrides) — feature request — by ilopezluna (创建于: 2025-12-10 17:21 (UTC+8))
- #30375 [Bug]: [TPU] ShapeDtypeStruct error when loading custom safetensors checkpoint on TPU v5litepod — bug — by Baltsat (创建于: 2025-12-10 16:12 (UTC+8))
- #30372 [Bug]: vLLM (GPT-OSS) causes distorted tool argument names + infinite tool-call loop with Korean messenger tool — bug — by minmini2 (创建于: 2025-12-10 14:59 (UTC+8))
- #30370 [Performance]: DeepSeek-V3.2 AWQ Performance is lower then i expected — performance — by yongho-chang (创建于: 2025-12-10 10:45 (UTC+8))
- #30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (创建于: 2025-12-09 22:48 (UTC+8))
- #30358 [Bug]: Prefill scheduled num_block mismatch at update_state_after_alloc and request_finished — bug — by xuechendi (创建于: 2025-12-10 04:15 (UTC+8))
- #30368 [CI] Test target determination using LLM — feature request,ci — by khluu (创建于: 2025-12-10 09:42 (UTC+8))
- #30360 [RFC]: Consolidate FP8 min/max values into somewhere reasonable (Python only) — rocm,RFC — by rasmith (创建于: 2025-12-10 05:44 (UTC+8))
- #30342 [Bug]: HunyuanOCR batching problem with variable sized images in a batch. — bug — by anker-c2 (创建于: 2025-12-09 22:22 (UTC+8))
已关闭 Issue
- #15636 [Bug]: Outlines broken on vLLM 0.8+ — bug,structured-output,unstale — by cpfiffer (关闭于: 2025-12-10 21:18 (UTC+8))
- #30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (关闭于: 2025-12-10 18:47 (UTC+8))
- #30381 [Usage]: — usage — by tobeprozy (关闭于: 2025-12-10 17:28 (UTC+8))
- #30379 [Usage]: how to use vllm/tests/? — usage — by tobeprozy (关闭于: 2025-12-10 17:26 (UTC+8))
- #30311 [Bug]: deepseekv32.DeepseekV32Tokenizer Runtime causes model to crash — bug — by magician-xin (关闭于: 2025-12-10 16:30 (UTC+8))
- #28314 [AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments — rocm,ci-failure — by zhewenl (关闭于: 2025-12-10 13:32 (UTC+8))
- #30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (关闭于: 2025-12-10 12:05 (UTC+8))
- #20181 [Feature]: Batch inference for Multi-Modal Online Serving — feature request,stale — by eslambakr (关闭于: 2025-12-10 10:25 (UTC+8))
- #21097 [Bug]: w8a8 quantization not supporting sm120 — bug,stale — by sarmiena (关闭于: 2025-12-10 10:24 (UTC+8))
- #21909 [Bug]: quant_method is not None — bug,stale — by maxin9966 (关闭于: 2025-12-10 10:24 (UTC+8))
- #22325 [Bug]: gpt-oss model crashes on NVIDIA B200 with any OpenAI chat completion request — bug,stale — by teds-lin (关闭于: 2025-12-10 10:24 (UTC+8))
- #22422 [Feature]: mxfp4 support for 3090 — feature request,stale — by ehartford (关闭于: 2025-12-10 10:24 (UTC+8))
- #22501 [Usage]: Running a 300-400B Parameter Model on Multi-Node Setup (2x 8xA100) — usage,stale — by rangehow (关闭于: 2025-12-10 10:24 (UTC+8))
- #22575 [Bug]: Vllm hangs when I use the offline engine with dp = 2 or more — bug,stale — by Stealthwriter (关闭于: 2025-12-10 10:24 (UTC+8))
- #22623 [Usage]: if openai-mirror/gpt-oss-20b can run in A100? — usage,stale — by neverstoplearn (关闭于: 2025-12-10 10:23 (UTC+8))
- #22624 [Bug]: 1.7B fp16 + 0.6B draft OOM with gpu_memory_utilization=0.9, while 4B int8 + 0.6B works fine on A800 80 GB — bug,stale — by kiexu (关闭于: 2025-12-10 10:23 (UTC+8))
- #22639 [Bug]: function convert_lark_to_gbnf interpreting ‘#’ to parse as lark commentaries — bug,stale — by renout-nicolas (关闭于: 2025-12-10 10:23 (UTC+8))
- #30206 [Bug]: DeepSeek-V3.2 DeepGEMM RuntimeError — bug — by coval3nte (关闭于: 2025-12-09 22:55 (UTC+8))
- #29840 [Bug]: LMCacheConnectorV1Impl has no attribute ‘layerwise_storers’ on remote full cache hit with layerwise mode — bug — by XinyiQiao (关闭于: 2025-12-10 01:11 (UTC+8))
新增 PR
- #30390 fix: Update json features supported by xGrammar — structured-output,v1 — by johannesflommersfeld (创建于: 2025-12-10 20:51 (UTC+8))
- #30391 [IMPROVEMENT] Change MistralReasoningParser behavior — 无标签 — by juliendenize (创建于: 2025-12-10 21:02 (UTC+8))
- #30389 Standardise
get_ropeto userope_parameters["partial_rotary_factor"], notrotary_dim— performance,llama,qwen,deepseek,gpt-oss — by hmellor (创建于: 2025-12-10 20:33 (UTC+8)) - #30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (创建于: 2025-12-10 18:37 (UTC+8))
- #30340 Add Eagle and Eagle3 support to Transformers modeling backend — 无标签 — by hmellor (创建于: 2025-12-09 22:09 (UTC+8))
- #30388 [Docs] Generate full list of metrics in user docs — documentation — by markmc (创建于: 2025-12-10 19:50 (UTC+8))
- #30386 [v1] Add PrefixLM support to TritonAttention backend — v1 — by Isotr0py (创建于: 2025-12-10 19:32 (UTC+8))
- #30385 [Core] Whisper support
torch.compile— v1 — by NickLucche (创建于: 2025-12-10 19:11 (UTC+8)) - #30376 [Fix]fix import error from lmcache — kv-connector — by wz1qqx (创建于: 2025-12-10 16:38 (UTC+8))
- #30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (创建于: 2025-12-09 23:24 (UTC+8))
- #30361 [Attention][AMD] Make flash-attn optional — rocm,speculative-decoding,v1 — by mgehre-amd (创建于: 2025-12-10 06:46 (UTC+8))
- #30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (创建于: 2025-12-09 23:42 (UTC+8))
- #30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (创建于: 2025-12-10 00:57 (UTC+8))
- #30364 [Bugfix] awq_gemm: fix argument order swap — 无标签 — by mgehre-amd (创建于: 2025-12-10 07:17 (UTC+8))
- #30377 adding constraint updates of cos-sin to improve mrope performance — 无标签 — by wujinyuan1 (创建于: 2025-12-10 16:48 (UTC+8))
- #30373 Implement LMDB-based multi-modal cache — ci/build,v1,multi-modality — by petersalas (创建于: 2025-12-10 15:21 (UTC+8))
- #30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (创建于: 2025-12-10 12:26 (UTC+8))
- #30344 [Bugfix] Fix HunyuanOCR cross-image contamination in batch processing — 无标签 — by anker-c2 (创建于: 2025-12-09 22:49 (UTC+8))
- #30346 [Core] Major fix catch backend grammar exceptions (xgrammar, outlines, etc) in scheduler — v1 — by blancsw (创建于: 2025-12-09 22:58 (UTC+8))
- #30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (创建于: 2025-12-09 23:14 (UTC+8))
- #30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (创建于: 2025-12-09 22:09 (UTC+8))
- #30369 [Fix] Add default rope theta for qwen1 model — qwen — by iwzbi (创建于: 2025-12-10 10:36 (UTC+8))
- #30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (创建于: 2025-12-09 22:56 (UTC+8))
- #30367 [CI] Reduce Flakiness For test_spec_decode.py::test_suffix_decoding_acceptance — ready,v1 — by micah-wil (创建于: 2025-12-10 08:18 (UTC+8))
- #30363 Remove all2all backend envvar — documentation,ci/build — by elizabetht (创建于: 2025-12-10 07:09 (UTC+8))
- #30366 [Bug Fix] Fix Kimi-Linear model initialization crash due to missing ‘indexer_rotary_emb’ arg — 无标签 — by yonasTMC (创建于: 2025-12-10 08:02 (UTC+8))
- #30357 Upstream fp8 with static scales gpt oss — needs-rebase,gpt-oss — by maleksan85 (创建于: 2025-12-10 03:49 (UTC+8))
- #30365 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (创建于: 2025-12-10 07:59 (UTC+8))
- #30362 [WIP] Bump dockerfile to cuda 13.0.2 (for testing) — ci/build,nvidia — by dougbtv (创建于: 2025-12-10 06:51 (UTC+8))
- #30353 [Fix] Handle multiple tool calls in Qwen3-MTP tool parser — frontend,tool-calling,qwen — by ArkVex (创建于: 2025-12-10 01:48 (UTC+8))
- #30356 [CI][DeepSeek] Add nightly DeepSeek R1
lm_evaltests on H200 — ready,ci/build,deepseek — by MatthewBonanni (创建于: 2025-12-10 02:05 (UTC+8)) - #30352 [CI/Test] Fix FP8 per-tensor quant test reference scale shape — ready — by LucasWilkinson (创建于: 2025-12-10 01:30 (UTC+8))
- #30354 [WIP][Core] Update PyTorch to 2.9.1 generally — rocm,ci/build,nvidia — by orionr (创建于: 2025-12-10 01:49 (UTC+8))
- #30355 [Model Runner V2] Fix Triton warning on tl.where — v1 — by WoosukKwon (创建于: 2025-12-10 01:56 (UTC+8))
- #30350 Remove virtual engine handling — tpu,needs-rebase,v1,codex,qwen,kv-connector — by WoosukKwon (创建于: 2025-12-10 00:34 (UTC+8))
- #30341 [CI] refine more logic when generating and using nightly wheels & indices — ci/build — by Harry-Chen (创建于: 2025-12-09 22:17 (UTC+8))
- #30338 Fix gigachat3 parser + update tests — frontend,tool-calling — by ajpqs (创建于: 2025-12-09 21:37 (UTC+8))
已合并 PR
- #30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (合并于: 2025-12-10 20:58 (UTC+8))
- #30062 [CPU] Support for Whisper — ready,ci/build,v1,multi-modality — by aditew01 (合并于: 2025-12-10 20:58 (UTC+8))
- #30331 [Bugfix] tpu_model_runner: set vllm config context when calling reset_dynamo_cache() — tpu,ready,v1 — by dtrifiro (合并于: 2025-12-10 20:58 (UTC+8))
- #30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (合并于: 2025-12-10 12:05 (UTC+8))
- #30332 [BUGFIX] Mistral tool call parser v11+ — frontend,ready,tool-calling — by juliendenize (合并于: 2025-12-09 22:55 (UTC+8))
- #30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (合并于: 2025-12-10 16:30 (UTC+8))
- #30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (合并于: 2025-12-10 13:37 (UTC+8))
- #29358 [ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group — rocm,ready,ci/build,v1,multi-modality — by AndreasKaratzas (合并于: 2025-12-10 13:33 (UTC+8))
- #30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (合并于: 2025-12-10 12:27 (UTC+8))
- #30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (合并于: 2025-12-10 12:05 (UTC+8))
- #30308 [bugfix][quantization] fix quark qwen3 kv_cache quantization — ready,qwen — by haoyangli-amd (合并于: 2025-12-10 11:24 (UTC+8))
- #30367 [CI] Reduce Flakiness For test_spec_decode.py::test_suffix_decoding_acceptance — ready,v1 — by micah-wil (合并于: 2025-12-10 10:35 (UTC+8))
- #30230 [responsesAPI][6] Fix multi turn MCP tokenization — documentation,frontend,ready,gpt-oss — by qandrew (合并于: 2025-12-10 10:13 (UTC+8))
- #30020 [CI/Build][AMD] Skip quantization kernels tests that require CUTLASS or e4m3fn when not supported by platform — rocm,ready,nvidia — by rasmith (合并于: 2025-12-10 10:28 (UTC+8))
- #30336 [Bugfix] Fix fp8 DeepGemm compilation issues — bug,ready,ci-failure,deepseek — by ElizaWszola (合并于: 2025-12-10 09:17 (UTC+8))
- #29624 [Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode — speculative-decoding,ready,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2025-12-10 09:18 (UTC+8))
- #30330 [Bugfix] Fix cuda graph sizes when running with speculative decoding — ready,nvidia — by PatrykSaffer (合并于: 2025-12-10 08:47 (UTC+8))
- #29723 [V1][Spec Decode] Optimize Medusa proposer to avoid GPU-CPU sync — speculative-decoding,ready,v1 — by dongbo910220 (合并于: 2025-12-10 08:15 (UTC+8))
- #29937 Improve wvsplitK tile and balance heristics. — rocm,ready — by amd-hhashemi (合并于: 2025-12-10 07:51 (UTC+8))
- #25693 [Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter — rocm,ready — by charlifu (合并于: 2025-12-10 06:39 (UTC+8))
- #30119 [BugFix] Fix DeepSeek-R1 hang with DP and MTP — ready,v1,deepseek — by LucasWilkinson (合并于: 2025-12-10 02:51 (UTC+8))
- #29066 [MoE][Refactor] Remove most arguments to FusedMoEMethodBase.apply — moe,ready,nvidia,ready-run-all-tests — by bnellnm (合并于: 2025-12-10 05:48 (UTC+8))
- #28480 [Quantization] FP8 Weight Reloading for Quantized RL Rollout — quantization,ready,rl — by kylesayrs (合并于: 2025-12-10 05:54 (UTC+8))
- #30277 [BugFix] Fix non detected failing tests — ready,ci/build — by ilmarkov (合并于: 2025-12-10 01:57 (UTC+8))
- #29145 [CI/Build] Make test_mha_attn.py run on correct platform only and check for flash_attn_varlen_func in layer.py — rocm,ready,ci/build — by rasmith (合并于: 2025-12-10 04:18 (UTC+8))
- #30234 Bump actions/stale from 10.1.0 to 10.1.1 — ready,dependencies,ci/build,github_actions — by dependabot (合并于: 2025-12-10 04:12 (UTC+8))
- #30233 Bump actions/checkout from 6.0.0 to 6.0.1 — ready,dependencies,ci/build,github_actions — by dependabot (合并于: 2025-12-10 04:03 (UTC+8))
- #30307 [Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models — ready,qwen — by a4lg (合并于: 2025-12-10 03:13 (UTC+8))
- #30352 [CI/Test] Fix FP8 per-tensor quant test reference scale shape — ready — by LucasWilkinson (合并于: 2025-12-10 02:52 (UTC+8))
- #29912 [Cleanup] Refactor profiling env vars into a CLI config — documentation,performance,structured-output,frontend,tpu,ready,v1 — by benchislett (合并于: 2025-12-10 02:29 (UTC+8))
- #30187 [Model Runner V2] Support num NaNs in logits — v1 — by WoosukKwon (合并于: 2025-12-10 02:00 (UTC+8))
- #30355 [Model Runner V2] Fix Triton warning on tl.where — v1 — by WoosukKwon (合并于: 2025-12-10 01:59 (UTC+8))
- #29897 [Compile] Fix torch warning
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled— ready,v1 — by yewentao256 (合并于: 2025-12-09 23:40 (UTC+8)) - #30298 Update AMD test definitions (2025-12-08) — rocm,ready,ci/build,amd — by Alexei-V-Ivanov-AMD (合并于: 2025-12-10 01:31 (UTC+8))
- #30173 [BugFix] Fix
assert batch_descriptor.num_tokens == num_tokens_padded— speculative-decoding,ready,v1,nvidia — by LucasWilkinson (合并于: 2025-12-09 23:36 (UTC+8)) - #30018 [Feature] Batch-Invariant Support for FA2 and LoRA — ready,v1 — by quanliu1991 (合并于: 2025-12-09 23:01 (UTC+8))
- #25552 [ROCm] Aiter Quant Kernels — rocm,ready,ci/build — by vllmellm (合并于: 2025-12-09 22:27 (UTC+8))
关闭但未合并的 PR
- #23997 Feature/sampler benchmark #23977 — performance,unstale — by baonudesifeizhai (关闭于: 2025-12-10 21:14 (UTC+8))
- #26701 [ROCm]: W8A8BlockFp8LinearOp should use AITER on MI355X — rocm,ready,needs-rebase — by gronsti-amd (关闭于: 2025-12-10 20:33 (UTC+8))
- #30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (关闭于: 2025-12-10 19:59 (UTC+8))
- #30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (关闭于: 2025-12-10 18:47 (UTC+8))
- #29653 fix potential object has no attribute ‘bias’ error — 无标签 — by allerou4 (关闭于: 2025-12-10 15:16 (UTC+8))
- #30297 [Core] Add SLA-tiered scheduling (opt-in) and docs — documentation,v1 — by ProdByBuddha (关闭于: 2025-12-10 13:13 (UTC+8))
- #30327 [BugFix] Fix hang issue in LMCache mp mode — v1,kv-connector — by wz1qqx (关闭于: 2025-12-10 10:32 (UTC+8))
- #17830 cmake: Get rid of VLLM_PYTHON_EXECUTABLE — needs-rebase,ci/build,stale — by seemethere (关闭于: 2025-12-10 10:26 (UTC+8))
- #17872 measure peak memory correctly by removing already used memory — needs-rebase,stale,v1 — by MiladInk (关闭于: 2025-12-10 10:25 (UTC+8))
- #17959 [Bugfix] fix check kv cache memory log info — needs-rebase,stale,v1 — by BoL0150 (关闭于: 2025-12-10 10:25 (UTC+8))
- #21056 [Feature][EPLB] Add EPLB support for MiniMax-01 — stale — by haveheartt (关闭于: 2025-12-10 10:24 (UTC+8))
- #21413 Intentionally fail parallel sampling test — stale,v1 — by sethkimmel3 (关闭于: 2025-12-10 10:24 (UTC+8))
- #21506 [V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1 — documentation,rocm,frontend,ci/build,stale,v1,multi-modality,tool-calling — by DW934 (关闭于: 2025-12-10 10:24 (UTC+8))
- #22238 [V1][SpecDecode]Support Relaxed Acceptance for thinking tokens in speculative decoding when using greedy search, camp up by Nvidia. — stale,v1 — by DW934 (关闭于: 2025-12-10 10:24 (UTC+8))
- #22488 Feat/sliding window metrics — Related to #22480 — needs-rebase,stale,v1 — by NumberWan (关闭于: 2025-12-10 10:24 (UTC+8))
- #22632 [Bugfix]fix deepseek_r1_reasoning bugs when
in contents. — stale,deepseek — by z2415445508 (关闭于: 2025-12-10 10:23 (UTC+8)) - #27594 Fix intermediatetensors spawn error #27591 — qwen — by baonudesifeizhai (关闭于: 2025-12-10 08:44 (UTC+8))
- #28627 [Weight Loading] Expand quantized weight reloading support — needs-rebase,v1 — by kylesayrs (关闭于: 2025-12-10 05:48 (UTC+8))
- #30354 [WIP][Core] Update PyTorch to 2.9.1 generally — rocm,ci/build,nvidia — by orionr (关闭于: 2025-12-10 02:46 (UTC+8))
- #30063 Mistral tool parser — frontend,tool-calling — by graelo (关闭于: 2025-12-09 23:56 (UTC+8))
- #27305 [ROCm][torch.compile] Adding MulAddFusionPass to enable AITER fused_mul_add — rocm,needs-rebase — by micah-wil (关闭于: 2025-12-09 23:49 (UTC+8))
- #26257 [Feature][torch.compile] Add pass to rearrange AllGather for FP8 models in sequence parallel for better Async TP fusion — needs-rebase,ci/build — by jasonlizhengjian (关闭于: 2025-12-09 22:59 (UTC+8))
- #25618 [ROCm][Allreduce] Add dispatch mechanism for choosing performant allreduce implementations for AMD platforms — rocm,needs-rebase,nvidia — by zejunchen-zejun (关闭于: 2025-12-09 21:45 (UTC+8))