vLLM 开发动态报告 - 2025-12-19

时间窗口: 2025-12-19 10:29 (UTC+8) ~ 2025-12-20 10:29 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 36 | 新 PR 38 | 合并 PR 23 | 关闭未合并 PR 31

📊 每日开发状态摘要

在2025年12月19日至20日的24小时窗口期内，vLLM项目保持高活跃度，共处理55个Issue（新增19个，关闭36个）和61个PR（新增38个，合并23个）。开发焦点集中在性能回归排查与修复（特别是高并发下的推测解码和量化模型问题）、AMD生态功能演进（Quark在线量化RFC）以及内核优化与重构（MoE、注意力内核）。同时，社区积极处理了大量陈旧 Issue 的清理工作。

🎯 AMD/ROCm 生态相关动态

本周期AMD相关活动集中在新功能提议和用户环境问题解决上，显示出AMD团队在深化vLLM集成的持续努力。

新增 Issue [#31028]：[RFC]: AMD-Quark Online Quantization
- 贡献者：hangy-amd（AMD员工）。
- 内容：这是一份详细的设计提案，旨在为vLLM引入AMD Quark后端的在线量化功能。提案对比了离线量化与在线量化的流程差异，并计划复用大部分现有离线量化管道，仅在模型初始化和权重加载阶段插入量化操作。目标是让用户无需预量化模型，仅通过设置quantization参数即可获得量化模型，简化部署。
- 讨论：维护者 robertgshaw2-redhat 提问，计划支持哪些量化方案（推测为FP8），并建议此功能应与具体后端（如Quark）解耦，通过通用后端实现，以避免混淆。
- 分析：这标志着AMD正推动其量化技术更深地集成到vLLM工作流中，从“运行已量化模型”迈向“在加载时动态量化”，旨在提升用户体验和部署灵活性。
新增/已关闭 Issue [#31035]：[Bug]: ROCm docker image failed to build
- 内容：用户尝试为Radeon RX 7800 XT（gfx1101架构）构建ROCm Docker镜像时失败。
- 解决：AMD维护者 hongxiayang 迅速响应，指出是用户命令中的构建上下文路径错误（/opt/llm/images/vllm-rocm），修正为当前目录（.）后成功解决。该问题被标记为“用户错误”并关闭。
- 分析：反映了社区用户对AMD GPU支持的兴趣，同时展示了AMD团队对用户问题的快速支持能力。
已合并 PR [#31021]：[CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp
- 贡献者：shen-shanshan（从评论互动看，与AMD团队协作紧密）。
- 内容：重构自定义算子ApplyRotaryEmb的代码，提取通用的预处理（_pre_process）和后处理（_post_process）方法，以消除冗余代码。
- 分析：此PR虽不直接涉及AMD硬件，但属于ROCm平台基础设施优化的一部分，旨在提升代码可维护性，为后续可能针对AMD硬件优化的自定义算子开发铺平道路。

小结：AMD团队一方面在推动前沿的在线量化功能设计，另一方面也在积极处理用户端的环境适配问题，并持续进行代码基础设施的优化，形成了从底层支持到高层功能演进的立体推进态势。

💬 高热度讨论分析

Issue [#31014]：[Bug]: GPT-OSS-120B Eagle-v2 High concurrency perf drop
- 核心议题：PR #29624 在B200服务器上导致GPT-OSS-120B模型在高并发场景下的推测解码接受长度（Acceptance Length）显著下降（从~2.4降至~1.9），造成性能回归。
- 各方观点：
  - 报告者（shyeh25）：提供了详细的基准测试数据，定位了回归引入的提交。
  - 维护者（nvpohanh）：猜测PR可能引发了某种竞态条件，导致当CPU过快时，草稿模型输出不稳定甚至损坏。
  - 其他开发者（benchislett）：成功复现问题，并指出在启用异步调度时才会出现性能下降。同时，他提到PR #29845 修复了他的复现案例，甚至带来了超过5%的性能提升，建议合入该PR后再评估。
- 争议焦点：无直接争议，讨论聚焦于问题根因分析和解决方案验证。
- 当前状态：问题待解决。讨论指向一个可能的修复（PR #29845），但需进一步验证其对原始报告的B200高并发场景是否有效。
Issue [#31018]：[Bug]: ImportError: libcudart.so.12
- 核心议题：用户环境存在CUDA版本不匹配（系统为CUDA 13.0，但vLLM编译链接了CUDA 12库），导致启动失败。
- 各方观点：
  - 维护者（robertgshaw2-redhat）：快速诊断出问题本质。
- 总结：这是一个常见的环境配置问题，维护者提供了清晰的诊断方向。讨论热度不高，但展示了社区对常见问题的快速响应能力。
**Issue [#31045]：[Bug]: No start tag using Qwen3-235B-A22B-Thinking-2507-FP8**
- 核心议题：用户使用“思维”模型时，输出中不显示预期的<think>标签。
- 各方观点：
  - 其他用户（qingy1337）：分析认为首个<think>标签可能由对话模板自动添加而未在输出中显示，但</think>标签缺失仍是问题。建议通过修改对话模板来修复。
  - 报告者（celsowm）：随后自己发现并分享了解决方案：使用 --reasoning-parser deepseek_r1 参数。
- 当前结论：问题通过使用正确的推理解析器参数得到解决。Issue被快速关闭。

🔥 热门话题与趋势分析

性能回归成为焦点：多个Issue报告了从v0.10.2升级到v0.12.0/0.13.0后的显著性能下降（如 #31030, #31014）。这表明在快速迭代中，性能监控和回归测试面临巨大压力，社区对性能变化极为敏感。
量化技术全面开花：讨论涉及多种量化方案：
- 在线量化（#31028 - AMD Quark）：旨在简化部署流程。
- W8A8与FP8（#31033 - Marlin）：用户报告了W4A8 Marlin在RTX 5090上的优异性能，同时也发现了配置相关的问题。
- 工具链集成问题（#31019， #28197 - LLM-Compressor）：用户在使用llm-compressor工具量化（尤其是稀疏和MoE）模型时遇到加载错误，暴露了外部工具链与vLLM集成时的兼容性挑战。
多模态与视觉语言模型支持持续深化：有Issue报告Qwen3-VL 2:4稀疏量化模型的问题（#31019），并有合并PR添加了对新多模态MoE模型MiMo-V2-Flash的支持（#30836），显示该领域模型迭代快，支持工作需紧跟。
MoE模型支持强化：除了新增模型支持，还有专门针对GLM-4 MoE在数据并行下的dtype修复PR（#31055），以及MoE内核的重构工作（#30825），表明对MoE这一重要架构类型的优化仍在持续进行。

🛠️ 重点技术变更

已合并 PR [#30887]：[Bugfix] Triton attention kernels: mask out V blocks…
- 解读：修复了Triton注意力内核中一个关于滑动窗口的关键Bug。此前，对于落在滑动窗口之外的V（值）内存块，内核未能正确屏蔽，可能读取到垃圾数据（甚至NaN），导致输出损坏。此修复完善了3D和2D内核的掩码逻辑，解决了长期存在的潜在正确性问题。
- 影响：提升了使用Triton注意力后端且具有滑动窗口（如GPT-OSS）模型的推理正确性和稳定性，修复了相关崩溃和错误输出问题。
已合并 PR [#30825]/[#30990]：[MoE Refactor][2/N] 与 Deprecate cutlass block quant fp8
- 解读：这是MoE融合内核重构的一部分。#30825 开始将FP8的TP（张量并行）内核调用统一到模块化内核接口。#30990 则因为CUTLASS Block Scale FP8内核在Blackwell（B200）上无法工作且缺乏维护，决定弃用并移除该内核，推荐使用FlashInfer等替代方案。
- 影响：简化了MoE内核的代码结构，为未来优化和维护铺路；同时清理了未维护的代码路径，减少潜在问题。
已合并 PR [#30846]：[Bugfix] Fix logprobs with spec decode and modified logits
- 解读：修复了在使用推测解码（speculative decoding）且启用了“raw” logprobs/logits模式时的一个问题。当采样参数需要就地修改logits时，推测解码中的拒绝采样器未能正确克隆logits，导致错误。
- 影响：确保了推测解码与高级日志记录功能（logprobs）的兼容性，使开发者能够同时利用性能加速和详细的输出分析。
已合并 PR [#30836]：[Model] Add MiMo-V2-Flash support
- 解读：为vLLM新增了对小米最新开源MoE多模态模型 MiMo-V2-Flash 的支持。
- 影响：扩展了vLLM的模型覆盖范围，使社区能够高效推理这一新模型。根据PR描述，该模型在GSM8K基准上取得了超过90%的准确率。

📈 开发活跃度观察

贡献者活跃：AMD员工（hangy-amd）提交了重要的RFC，显示AMD团队正深入参与核心功能设计。核心维护者（如 robertgshaw2-redhat， LucasWilkinson， jinzhen-lin）非常活跃，主导了多项关键修复、重构和代码审查。
代码审查与合并高效：在24小时内合并了23个PR，包括多个重要的问题修复（如Triton注意力修复、推测解码logprobs修复）和新模型支持，显示项目具有良好的持续集成和审查流程。
社区参与积极：用户报告了多个深入的技术问题（如性能回归、量化问题），并参与讨论和提供解决方案（如#31045），体现了成熟的技术社区生态。
维护工作并行：在新增任务不断涌现的同时，项目也关闭了36个陈旧Issue（多数因长期不活跃被自动关闭），显示了良好的项目维护管理。

💡 值得关注的问题

GPT-OSS-120B 高并发性能回归（#31014）：此问题影响了高端硬件（B200）上的大规模模型性能，且与近期合并的优化PR（#29624）相关。需要尽快明确根因并落实修复，这对维持vLLM在高端推理场景下的竞争力至关重要。
AMD-Quark 在线量化方案（#31028）：该RFC提出了一个重要的功能方向。后续需要关注其具体实现方案如何与维护者提出的“通用后端”建议结合，以及最终落地的时间线。
多版本性能回归（#31030等）：多个用户报告了版本升级后的性能下降。建议项目方系统性回顾相关版本的重大变更，并考虑建立更完善的性能基准测试与监控体系，防止类似问题频发。
LLM-Compressor 工具链集成（#31019, #28197）：作为流行的量化工具，其与vLLM的集成问题会影响用户量化模型的使用体验。需要协调双方开发团队，明确问题归属并修复兼容性。

📋 附录：详细数据列表

新增 Issue

#31043 [BugFix]: move torch.Size across graphs in split_graph — feature request,torch.compile — by BoyuanFeng (创建于: 2025-12-20 02:24 (UTC+8))
#31018 [Bug]: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory — bug — by shahizat (创建于: 2025-12-19 16:47 (UTC+8))
#31014 [Bug]: GPT-OSS-120B Eagle-v2 High concurrency perf drop — bug,speculative-decoding — by shyeh25 (创建于: 2025-12-19 14:13 (UTC+8))
#31030 [Performance]: Regression between 0.10.2 and 0.12.0 — performance — by maximegmd (创建于: 2025-12-19 20:52 (UTC+8))
#31016 [Bug]: FlashInfer Incompatible with Sleep Mode — bug,help wanted — by xiaoxiaosuaxuan (创建于: 2025-12-19 16:04 (UTC+8))
#31019 [Bug]: Qwen3-VL 2:4 sparsity llm-compressor RuntimeError: shape mismatch (0.12, 0.13rc2) — bug,help wanted,good first issue — by SorenDreano (创建于: 2025-12-19 17:18 (UTC+8))
#31028 [RFC]: AMD-Quark Online Quantization — rocm,RFC — by hangy-amd (创建于: 2025-12-19 20:27 (UTC+8))
#31027 [Bug]: 0.13.0 start with CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by xsank (创建于: 2025-12-19 19:50 (UTC+8))
#31045 [Bug]: No start tag using Qwen3-235B-A22B-Thinking-2507-FP8 — bug — by celsowm (创建于: 2025-12-20 03:22 (UTC+8))
#31035 [Bug]: ROCm docker image failed to build — bug,rocm — by Katharta (创建于: 2025-12-19 23:54 (UTC+8))
#31039 [Feature]: Integrate Sonic MoE — help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2025-12-20 01:29 (UTC+8))
#31044 [CI Failure]: Blackwell Fusion Tests — help wanted,torch.compile,ci-failure — by robertgshaw2-redhat (创建于: 2025-12-20 02:49 (UTC+8))
#31033 [Bug]: gpt-oss-20b fails to start when using W4A8 Marlin with VLLM_MARLIN_INPUT_DTYPE=int8 — bug,gpt-oss — by mgoin (创建于: 2025-12-19 23:19 (UTC+8))
#31009 [Docs] Feedback for /en/latest/ — 无标签 — by Leopipi26 (创建于: 2025-12-19 13:26 (UTC+8))
#31023 [Doc]: FP8 KV Cache: Does softmax output multiply with FP8 V directly or after dequantization? — documentation — by jorjiang (创建于: 2025-12-19 18:33 (UTC+8))
#31012 [Bug]: truncate_prompt_tokens in PoolingParams has no effect when using AsyncLLM.encode() — bug — by jeffreywang-anyscale (创建于: 2025-12-19 13:53 (UTC+8))
#31004 [New Model]: T5Gemma 2 — new-model — by ducviet00-h2 (创建于: 2025-12-19 11:55 (UTC+8))
#31005 [Bug]: Incorrect warning message for tensor parallel size in Ray executor (ray_utils.py:331-340) — bug — by jarieshan (创建于: 2025-12-19 12:37 (UTC+8))
#30999 [Feature] GLM45 tool parser: Stream tool name before full arguments — 无标签 — by so2liu (创建于: 2025-12-19 10:35 (UTC+8))

已关闭 Issue

#18577 [Bug]: Killing local vLLM worker processes in multiproc_worker_utils.py — bug,stale — by yliu2702 (关闭于: 2025-12-20 10:13 (UTC+8))
#19653 [Feature]: Add EP/DP/PD deps in docker image — feature request,stale — by simon-mo (关闭于: 2025-12-20 10:12 (UTC+8))
#19743 [Usage]: ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Gateway Time-out — usage,stale — by Shirley125 (关闭于: 2025-12-20 10:12 (UTC+8))
#20862 [Usage]: vLLM Server Launch Freezes at Using NCCL on B200 — usage,stale — by kimbochen (关闭于: 2025-12-20 10:12 (UTC+8))
#20962 [RFC][FEATURE]: TTFT Routing — feature request,stale — by chickeyton (关闭于: 2025-12-20 10:12 (UTC+8))
#22307 [Bug]: Qwen3-30B-A3B-Instruct-2507-FP deployment failed — bug,stale — by changqingla (关闭于: 2025-12-20 10:12 (UTC+8))
#22408 [Performance]: n-gram speculative decoding drafting optimizations — performance,stale — by Jialin (关闭于: 2025-12-20 10:12 (UTC+8))
#22454 [Bug]: Low time to first token of prefill and decode instances but high TTFT with 1p1d — bug,stale — by AsicDyc (关闭于: 2025-12-20 10:12 (UTC+8))
#22754 [Bug]: [V1][P/D] Perf Gap D2H copy for sampled token ids (output) unintentionally block all other copy operations from other CUDA streams for KV transfers — bug,performance,stale,v1 — by liuzijing2014 (关闭于: 2025-12-20 10:11 (UTC+8))
#22793 [Usage]: VLLM crash - Latest Docker Image - GPT-OSS-20B - on RTX5090 — usage,stale — by maglat (关闭于: 2025-12-20 10:11 (UTC+8))
#23144 [Bug]: Unexpected performance of Disaggregated Prefilling — bug,stale — by BruceW-07 (关闭于: 2025-12-20 10:11 (UTC+8))
#23235 Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions. — stale — by XiaYijun124 (关闭于: 2025-12-20 10:11 (UTC+8))
#23237 [Bug]: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! This is likely because you forgot to include input plac — bug,stale — by Relroy0517 (关闭于: 2025-12-20 10:11 (UTC+8))
#23249 [Bug]: vllm loading blocks on loading the model with torch.multiprocessing. — bug,stale — by aymane-03-cs (关闭于: 2025-12-20 10:11 (UTC+8))
#23256 [Bug]: Qwen2.5-VL GPTQ kernels (Marlin AND Machette) with Tensor Parallelism incorrect AND non-deterministic output since 0.10.0 — bug,stale — by SorenDreano (关闭于: 2025-12-20 10:11 (UTC+8))
#23263 [Feature]: Jina v3 support tasks — feature request,stale — by swtb3-ryder (关闭于: 2025-12-20 10:11 (UTC+8))
#23276 [Feature]: Support for Predicted Outputs — feature request,stale — by bwasti (关闭于: 2025-12-20 10:11 (UTC+8))
#23283 [Bug]: TypeError: AsyncEngineArgs.init() got an unexpected keyword argument ‘disable_log_requests’ — bug,stale — by rabeehk (关闭于: 2025-12-20 10:11 (UTC+8))
#23322 [Usage]: Optimization of CUDA Graph Memory Usage during Sampling Prewarming in kvcache — usage,stale — by EanWang (关闭于: 2025-12-20 10:11 (UTC+8))
#23326 [Bug]: In the deployment log of the latest version of vllm, after a successful request, the details of the input graph token are not displayed. How to set it up? — bug,stale — by lmingze (关闭于: 2025-12-20 10:11 (UTC+8))
#23376 [Installation]: Jetson Orin Nano Super with Jetpack 6.x can not pip install vllm — installation,stale — by hamesjan (关闭于: 2025-12-20 10:11 (UTC+8))
#28197 [Bug]: MoE models quantized with LLM compressor require Linear targets (which llmcompressor doesn’t add by default) — bug — by mratsim (关闭于: 2025-12-20 06:46 (UTC+8))
#31027 [Bug]: 0.13.0 start with CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by xsank (关闭于: 2025-12-20 06:45 (UTC+8))
#30970 [Bug]: MPClient sends queue message but EngineCoreProc does not receive, causing timeout. — bug — by YuYun329 (关闭于: 2025-12-20 06:44 (UTC+8))
#30989 [Bug]: CUTLASS BLOCK SCALE FP8 IMA on B200 — bug — by robertgshaw2-redhat (关闭于: 2025-12-20 06:41 (UTC+8))
#31045 [Bug]: No start tag using Qwen3-235B-A22B-Thinking-2507-FP8 — bug — by celsowm (关闭于: 2025-12-20 06:35 (UTC+8))
#31035 [Bug]: ROCm docker image failed to build — bug,rocm — by Katharta (关闭于: 2025-12-20 06:24 (UTC+8))
#30682 [Bug]: No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). — bug — by shahizat (关闭于: 2025-12-19 16:39 (UTC+8))
#30861 [Bug]: TypeError: unhashable type: ‘dict’ when serving deepseek32 — bug — by jeejeelee (关闭于: 2025-12-20 05:07 (UTC+8))
#30904 [Bug]: Empty content on NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 — bug — by puppetm4st3r (关闭于: 2025-12-19 23:33 (UTC+8))
#31009 [Docs] Feedback for /en/latest/ — 无标签 — by Leopipi26 (关闭于: 2025-12-19 22:05 (UTC+8))
#24190 [Feature]: Model FLOPs Utilization Reporting — feature request,stale — by bwasti (关闭于: 2025-12-19 21:58 (UTC+8))
#26019 [Feature]: Add Aarch64 to PyTorch CI HUD vLLM dashboard — feature request,aarch64-cpu — by (关闭于: 2025-12-19 20:16 (UTC+8))
#29432 [Bug]: Add validation for tool requests that the tool is available — bug,good first issue,gpt-oss — by njhill (关闭于: 2025-12-19 18:23 (UTC+8))
#30857 [Bug]: 使用VLLM进行推理会卡死 — bug — by Bonjir (关闭于: 2025-12-19 17:44 (UTC+8))
#30999 [Feature] GLM45 tool parser: Stream tool name before full arguments — 无标签 — by so2liu (关闭于: 2025-12-19 10:58 (UTC+8))

新增 PR

#31055 [Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking — 无标签 — by ReinforcedKnowledge (创建于: 2025-12-20 10:18 (UTC+8))
#31053 [CI] FIx fixture 'siglip_attention_config' not found — rocm,ready,multi-modality — by LucasWilkinson (创建于: 2025-12-20 09:12 (UTC+8))
#31054 [CI] Fix H200 Distributed test — documentation,ready,ci/build — by LucasWilkinson (创建于: 2025-12-20 09:34 (UTC+8))
#31040 [AMD][CI] Add “V1 Test e2e + engine” to mi325_8 Agent Pool — rocm,ready,ci/build — by micah-wil (创建于: 2025-12-20 01:31 (UTC+8))
#31049 [CI] Add Qwen3-Next-FP8 to Blackwell model tests — ready,qwen — by vadiklyutiy (创建于: 2025-12-20 07:55 (UTC+8))
#31052 [MoE Refactor] Use modular kernel for unquantized Triton MoE — ready — by zyongye (创建于: 2025-12-20 08:39 (UTC+8))
#31036 [MoE Refactor][4/N] Marlin Fp8 Mk — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-19 23:55 (UTC+8))
#31051 [DO NOT MERGE] Rename Flashinfer MLA Backend to TRTLLM MLA Backend — documentation,ci/build,v1,nvidia — by pavanimajety (创建于: 2025-12-20 08:25 (UTC+8))
#31050 [MoE Refactor] Split invoke_fused_moe_kernel — ready — by zyongye (创建于: 2025-12-20 08:04 (UTC+8))
#31007 [Qwen3-Omni] fixed _get_feat_extract_output_lengths function — ready,qwen — by wangxiongts (创建于: 2025-12-19 12:44 (UTC+8))
#31046 [Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 — ready,deepseek — by yewentao256 (创建于: 2025-12-20 05:15 (UTC+8))
#31048 [NIXL] delay req_id clean for _req_to_save to finished_request — kv-connector — by xuechendi (创建于: 2025-12-20 06:35 (UTC+8))
#31047 [DeepSeek v3.2] Remove unnecessary syncwarps — ready,deepseek — by MatthewBonanni (创建于: 2025-12-20 05:41 (UTC+8))
#31029 [BugFix] Fix architecture flags to prevent issues on SM103 — ci/build,nvidia — by LopezCastroRoberto (创建于: 2025-12-19 20:35 (UTC+8))
#31037 [Bug] Fusion pass — ready — by vllmellm (创建于: 2025-12-20 01:10 (UTC+8))
#31041 [Perf] Add skip_clone to SamplingParams for internal request handling — performance,frontend,ready — by mgoin (创建于: 2025-12-20 01:51 (UTC+8))
#31042 [Bugfix] Add init_workspace_manager to moe kernel benchmarks — performance,nvidia — by mgoin (创建于: 2025-12-20 02:03 (UTC+8))
#31013 [Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() — v1 — by jeffreywang-anyscale (创建于: 2025-12-19 14:01 (UTC+8))
#31038 [Bug] Fix hash issue RuntimeError: Worker failed with error 'unhashable type: 'dict'' — ready,v1 — by yewentao256 (创建于: 2025-12-20 01:12 (UTC+8))
#31034 [P/D] rework mooncake connector and introduce its bootstrap server — documentation,kv-connector — by dtcccc (创建于: 2025-12-19 23:33 (UTC+8))
#31032 [CI] Implement uploading to PyPI and GitHub in the release pipeline — ci/build — by Harry-Chen (创建于: 2025-12-19 22:17 (UTC+8))
#31021 [CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp — rocm,ready — by shen-shanshan (创建于: 2025-12-19 17:33 (UTC+8))
#31031 Add pytest marks, update ci_config and test_area accordingly — ci/build — by avinashsingh77 (创建于: 2025-12-19 21:40 (UTC+8))
#31015 [Doc][CPU] Fix index link for CPU regular release wheels — documentation,ready,cpu — by bigPYJ1151 (创建于: 2025-12-19 14:45 (UTC+8))
#31026 [Bugfix] Fix Ray GPU count warning msg, support cuda-like platform — v1,nvidia — by jarieshan (创建于: 2025-12-19 19:46 (UTC+8))
#31025 [CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 — performance,ci/build — by simonreginis (创建于: 2025-12-19 19:18 (UTC+8))
#31024 Move test dependencies from inline installs to Docker image — ci/build — by avinashsingh77 (创建于: 2025-12-19 18:46 (UTC+8))
#31022 [CI/Build][Bugfix] Fix regular release wheel upload path — ci/build — by bigPYJ1151 (创建于: 2025-12-19 17:52 (UTC+8))
#31020 [Bugfix] Fix GPU exceeds warning message on ray backend — v1 — by jarieshan (创建于: 2025-12-19 17:24 (UTC+8))
#31017 [bugfix]:Cannot connect to host when using disagg epd proxy — documentation,kv-connector — by Shirley125 (创建于: 2025-12-19 16:45 (UTC+8))
#31008 [Quantization] enable compressed-tensors marlin support for turing (2) — ready — by jinzhen-lin (创建于: 2025-12-19 12:46 (UTC+8))
#31006 [Bugfix] Fix Ray GPU availability warning message — v1 — by jarieshan (创建于: 2025-12-19 12:41 (UTC+8))
#31011 fix benchmark moe int8_w8a16 — performance — by RobotGF (创建于: 2025-12-19 13:52 (UTC+8))
#31010 [Doc] Update LWS docs example lws.yaml — documentation — by liveforfun (创建于: 2025-12-19 13:29 (UTC+8))
#31003 [Mics] add pcp basic support to MoE model — 无标签 — by pisceskkk (创建于: 2025-12-19 11:47 (UTC+8))
#31000 [Quantization] enable compressed-tensors marlin support for turing — 无标签 — by jinzhen-lin (创建于: 2025-12-19 11:14 (UTC+8))
#31002 feat(kernel): patch fused_gdn_gating — qwen — by OsirisDuan (创建于: 2025-12-19 11:41 (UTC+8))
#31001 `Feature/nvfp4 universal fallback emulation — 无标签 — by Rob-P-Smith (创建于: 2025-12-19 11:17 (UTC+8))

已合并 PR

#30869 [Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm — rocm,ready — by zejunchen-zejun (合并于: 2025-12-20 09:03 (UTC+8))
#30876 GLM-4.7 Tool Parser and Doc Update — documentation,frontend,ready,tool-calling — by zRzRzRzRzRzRzR (合并于: 2025-12-20 08:09 (UTC+8))
#30825 [MoE Refactor][2/N] Use Modular Kernels for Fp8 — ready — by robertgshaw2-redhat (合并于: 2025-12-20 07:36 (UTC+8))
#31046 [Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 — ready,deepseek — by yewentao256 (合并于: 2025-12-20 07:31 (UTC+8))
#30846 [BugFix] Fix logprobs with spec decode and modified logits — bug,ready,v1 — by njhill (合并于: 2025-12-19 11:58 (UTC+8))
#30990 [MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) — ready,ci/build,nvidia — by robertgshaw2-redhat (合并于: 2025-12-20 05:09 (UTC+8))
#30924 [BugFix] Fix TypeError: unhashable type: ‘dict’ when serving deepseek32 — ready,v1,deepseek — by LucasWilkinson (合并于: 2025-12-20 05:07 (UTC+8))
#30898 [Refactor] Refactor for DeepGemmQuantScaleFMT using cache — ready — by yewentao256 (合并于: 2025-12-20 04:50 (UTC+8))
#27444 Make engine core client handshake timeout configurable — ready,v1 — by eicherseiji (合并于: 2025-12-20 04:38 (UTC+8))
#30836 [Model] Add MiMo-V2-Flash support — documentation,new-model,ready — by Abatom (合并于: 2025-12-20 01:17 (UTC+8))
#30982 Update Pytorch version update docs — documentation,ready — by atalman (合并于: 2025-12-20 00:08 (UTC+8))
#30961 [Quantization] fix marlin w8a8 check — bug,quantization — by jinzhen-lin (合并于: 2025-12-19 23:33 (UTC+8))
#31021 [CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp — rocm,ready — by shen-shanshan (合并于: 2025-12-19 22:16 (UTC+8))
#30887 [Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window — ready — by tdoublep (合并于: 2025-12-19 21:39 (UTC+8))
#30613 [Bugfix] Add validation for tool requests when tool_parser is unavailable — documentation,frontend,ready — by majiayu000 (合并于: 2025-12-19 18:23 (UTC+8))
#31015 [Doc][CPU] Fix index link for CPU regular release wheels — documentation,ready,cpu — by bigPYJ1151 (合并于: 2025-12-19 15:29 (UTC+8))
#30871 [CPU][Bugfix] Fix ppc64le CPU build — ready,cpu — by npanpaliya (合并于: 2025-12-19 20:26 (UTC+8))
#26494 Enable aarch64 CPU performance benchmarks — documentation,performance,ready,ci/build,cpu — by (合并于: 2025-12-19 20:16 (UTC+8))
#28139 [Frontend][Bug] allow tool calls in analysis channel — frontend,ready — by dr75 (合并于: 2025-12-19 18:47 (UTC+8))
#31008 [Quantization] enable compressed-tensors marlin support for turing (2) — ready — by jinzhen-lin (合并于: 2025-12-19 16:56 (UTC+8))
#30974 [Bugfix] Fix incorrect tiles creation for mm prefix triton attention — ready — by Isotr0py (合并于: 2025-12-19 16:00 (UTC+8))
#30968 Add hidden dimension validation for multimodal embedding inputs — ready,multi-modality — by wenqiglantz (合并于: 2025-12-19 15:59 (UTC+8))
#31000 [Quantization] enable compressed-tensors marlin support for turing — 无标签 — by jinzhen-lin (合并于: 2025-12-19 12:29 (UTC+8))

关闭但未合并的 PR

#19058 [BugFix]: Hermes tool parser stream output error #19056 — frontend,stale,tool-calling,qwen — by LiuLi1998 (关闭于: 2025-12-20 10:13 (UTC+8))
#22577 Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu) — stale — by eric-higgins-ai (关闭于: 2025-12-20 10:11 (UTC+8))
#22853 [DRAFT] Fix tool call validation to prevent 500 errors from invalid types — frontend,needs-rebase,stale — by elizabetht (关闭于: 2025-12-20 10:11 (UTC+8))
#23248 perf(rope): add fast-path helper + unit test + micro-bench — stale — by PrithviElancherran (关闭于: 2025-12-20 10:11 (UTC+8))
#23290 [Bugfix] using undeclared variables as judgment criteria in deepseek_v2.py — stale,deepseek — by wonderseen (关闭于: 2025-12-20 10:11 (UTC+8))
#29444 [Hybrid] Mamba2 metadata caching — ready,needs-rebase,v1 — by s3woz (关闭于: 2025-12-20 09:03 (UTC+8))
#30718 [Perf] Cache for deepgemm moe, 2.1% Throuput improvement, 2.2% TTFT improvement — ready — by yewentao256 (关闭于: 2025-12-20 05:31 (UTC+8))
#31029 [BugFix] Fix architecture flags to prevent issues on SM103 — ci/build,nvidia — by LopezCastroRoberto (关闭于: 2025-12-20 05:09 (UTC+8))
#26020 Gcp tweak 5 — needs-rebase,v1,nvidia — by robertgshaw2-redhat (关闭于: 2025-12-20 05:09 (UTC+8))
#31037 [Bug] Fusion pass — ready — by vllmellm (关闭于: 2025-12-20 04:38 (UTC+8))
#31038 [Bug] Fix hash issue RuntimeError: Worker failed with error 'unhashable type: 'dict'' — ready,v1 — by yewentao256 (关闭于: 2025-12-20 01:56 (UTC+8))
#25592 [UX] Integrate DeepGEMM into vLLM wheel — needs-rebase,ci/build — by mgoin (关闭于: 2025-12-20 00:46 (UTC+8))
#30650 [Bugfix] CustomAR + TritonAttn[AMPERE] + FULL_CG - gpt-oss — gpt-oss,nvidia — by bbrowning (关闭于: 2025-12-19 22:53 (UTC+8))
#26611 Debug — documentation,performance,new-model,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by Jianhong-Zhang (关闭于: 2025-12-19 23:38 (UTC+8))
#21230 [Nixl] Debug logging — needs-rebase,v1,kv-connector — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#20957 [DP/EP] PPLX<>Triton Debug — v1 — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#20863 Fused moe tuning ep — performance — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#20454 [Tools] Add Utility For P/D Disagg Justfile — performance,kv-connector — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#15343 [P/D Disaggregation] PDController and PDWorker Prototype (1p1d) — documentation,frontend,stale,kv-connector — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#30979 [MoE Refactor] Add mk for cutlass fp8 block — v1,nvidia — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#30087 [Tools] Standard Scripts For Model Bash — 无标签 — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#28416 [torch.compile] Include value of VLLM_ALLREDUCE_USE_SYMM_MEM in Cache Key — 无标签 — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#23785 [DO NOT MERGE] updated — 无标签 — by robertgshaw2-redhat (关闭于: 2025-12-19 23:38 (UTC+8))
#22873 [CI][V0 Deprecation] Remove Metrics and Tracing Tests — ready,ci/build,stale — by robertgshaw2-redhat (关闭于: 2025-12-19 23:37 (UTC+8))
#25091 [Core] Add MFU tracking to GPU model execution — needs-rebase,v1 — by bwasti (关闭于: 2025-12-19 21:59 (UTC+8))
#22254 [TPU][Misc] Fix TPU.device_name — tpu,stale — by NickLucche (关闭于: 2025-12-19 21:11 (UTC+8))
#31022 [CI/Build][Bugfix] Fix regular release wheel upload path — ci/build — by bigPYJ1151 (关闭于: 2025-12-19 20:14 (UTC+8))
#31020 [Bugfix] Fix GPU exceeds warning message on ray backend — v1 — by jarieshan (关闭于: 2025-12-19 17:56 (UTC+8))
#31006 [Bugfix] Fix Ray GPU availability warning message — v1 — by jarieshan (关闭于: 2025-12-19 16:45 (UTC+8))
#22636 Add support for apple mps - Phase1 — ci/build — by avinashsingh77 (关闭于: 2025-12-19 12:05 (UTC+8))
#19096 Fix Incorrect data_parallel_rank and subsequent errors under torchrun — needs-rebase,stale,qwen — by izhuhaoran (关闭于: 2025-12-19 10:56 (UTC+8))