vLLM 开发动态报告 - 2025-12-11

时间窗口: 2025-12-11 10:48 (UTC+8) ~ 2025-12-12 10:48 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 23 | 新 PR 68 | 合并 PR 45 | 关闭未合并 PR 27

📊 每日开发状态摘要

在2025年12月11日至12日的24小时内，vLLM项目保持了极高的开发活跃度，合并了45个PR并处理了38个Issue。开发焦点集中在性能优化、内存管理、对新硬件（如Blackwell Ultra）的支持以及对近期热门模型（如DeepSeek-V3.2、GPT-OSS）的问题修复上。AMD/ROCm平台的支持工作主要体现在持续集成（CI）的测试完善和特定内核的适配上。

🎯 AMD/ROCm 生态相关动态

本周期内，直接的AMD功能开发PR较少，主要工作集中在CI测试的完善和问题修复上：

CI与测试适配：
- PR #30508 (已合并) 与 PR #30417 (已合并)：由AMD贡献者（rasmith）提交，分别跳过了ROCm不支持的cutlass_w4a8_moe测试和缺少特定模块（如deep_ep）的融合测试，确保ROCm测试流水线能够顺利通过。
- PR #30526 (已合并)：将V1 e2e测试的代理池从mi325_1调整为mi325_4，以确保多GPU测试有足够的硬件资源。
- PR #30527 (开放中)：计划为ROCm环境下的推测解码多GPU测试添加GPU可用性检查，防止因资源不足导致的测试失败。
Issue 追踪关闭：
- Issue #14964 (已关闭)：追踪AMD AITER内核集成的总览性Issue被维护者主动关闭，因其“过于陈旧”，并建议相关方开新的追踪器。
- Issue #22698 (已关闭)：关于MI300/MI308上CUDA图运行末尾存在“气泡”的性能问题。经AMD工程师（vllmellm）确认，在最新的rocm/vllm-dev:nightly镜像中此问题已不复现，故关闭。
Bug修复与支持确认：
- Issue #28184 (已关闭)：关于Whisper编码器-解码器模型在vLLM 0.11与ROCm上不工作的问题。AMD工程师确认相关PR已合并，在最新nightly镜像中问题已解决。

分析：本周期AMD生态的动态以“稳”为主，侧重于确保现有功能在CI中的稳定性和修复已知问题，而非引入新功能。关闭历史遗留的追踪Issue也表明项目在梳理和优化问题管理流程。

💬 高热度讨论分析

Issue #30453: DeepSeek-V3.2 频繁重复解码问题
- 核心议题：用户报告在H200上部署DeepSeek-V3.2时，遇到严重的重复解码（Repetitive Decoding）问题。
- 观点与进展：
  - 提问者 makabaka6338 提供了极其详尽的环境信息和请求脚本。
  - 其他用户 mondaylord 和 niceallen 表示遇到类似问题（流响应首token被截断或“+1”）。
  - 讨论焦点迅速转向环境版本排查。用户 h1248759474 反复询问具体的依赖包版本，提问者最终列出了完整的pip list输出。
- 当前状态：问题仍为 Open。讨论尚未涉及根本原因分析，目前停留在信息收集和确认共性现象阶段。
Issue #30498: GPT-OSS-120B 在达到 max_tokens 时返回空内容
- 核心议题：用户发现GPT-OSS-120B模型在预热后（即命中前缀缓存后），即使max_tokens设置得足够大，返回的content字段也为null，而使用--enforce-eager标志可以规避。
- 观点分析：
  - 用户视角 (ashtarkb): 提供完整复现步骤，认为这是一个Bug。
  - 初步解释 (alew3): 认为如果推理（reasoning）用完了所有令牌配额，content为空是预期行为，并建议改进检查代码。
  - 深度分析 (bbrowning): 指出这很可能与Issue #26480是同一个Bug——即命中前缀缓存后，CUDA图可能导致无限生成token id 0。--enforce-eager有效是因为它禁用了CUDA图。这个观点得到了提问者的认可。
- 当前状态：问题仍为 Open。讨论指向一个已知的、与缓存和CUDA图相关的核心Bug，需要更深层的修复。
Issue #30521: Flashinfer 后端内存分析回归 (DP > 1)
- 核心议题：在数据并行（DP>1）时，使用Flashinfer后端的vLLM 0.12.0在内存分析（profiling）阶段存在回归，可能导致OOM。
- 观点分析：
  - 报告者 nrghosh 自己进行了深入分析，指出问题关键在于：内存分析在CUDA图捕获和采样器预热之前运行，因此未计入这部分内存（特别是Flashinfer较大的工作区缓冲区），导致KV缓存分配过大。
  - 他提出了几个思考方向：1) 在计算KV缓存前为采样器预热预留内存；2) 使用更小的预热批次；3) 探究Flashinfer CUDA图内存使用是FlexAttention两倍的原因。
- 当前状态：问题仍为 Open。这是一个由经验丰富的贡献者提出的、带有自我分析的深度技术讨论，直接指向内存管理流程的核心逻辑。
Issue #30139 (已关闭): 推理模型不进行推理，将结果全放入 reasoning_content
- 核心议题：多个用户报告Mistral Ministral等推理模型在vLLM中未正确分离推理和回答内容。
- 观点与结论：
  - 用户困惑：多个用户确认了此问题，影响Ministral、MiniMax-M2等模型。
  - 根本原因：Mistral的推理解析器（MistralReasoningParser）最初借鉴了DeepSeek V1的实现，包含了一些启发式规则（如没有开始标签则全视为推理），这不适合Mistral模型可切换“聊天/推理”模式的特性。
  - 解决方案：Mistral员工 juliendenize 提交了 PR #30391，改进了解析器的逻辑，使其更符合Mistral模型的行为。该PR已合并并解决了此问题。
- 最终结论：通过上游模型提供方的直接贡献，问题得到修复。这体现了社区与模型开发商合作的高效性。

🔥 热门话题与趋势分析

新模型支持与问题：DeepSeek-V3.2 和 GPT-OSS-120B 是绝对的焦点，相关Issue讨论热烈。这反映出社区快速采用前沿大模型，同时vLLM在适配这些架构复杂的新模型时面临挑战（如重复解码、缓存Bug）。
部署与硬件适配：
- 新硬件：PR #30484 开始为 NVIDIA Blackwell Ultra (SM103) 添加初始支持。
- CPU后端：针对 Arm CPU 的构建问题（Issue #30470）和性能优化（Issue #30487，L2缓存检测）受到关注，显示移动端和边缘计算场景的重要性。
- 安装问题：如何离线或高效使用预编译wheel（Issue #30464）是实际部署中的常见痛点。
性能与内存优化：这始终是vLLM的核心议题。本期围绕CUDA图内存估算（PR #30515）、编码器缓存内存占用（PR #30452, #30475）、Triton内核split_k设置（PR #30528）的讨论，都是对极致性能的追求。
推理与工具调用：关于如何禁用Qwen模型的“思考”（Issue #30477），以及推理模型输出解析（Issue #30139, #28852）的讨论，说明“推理”和“工具调用”已成为LLM应用的高级特性，对其可靠控制的需求强烈。

🛠️ 重点技术变更

PR #30515: 在内存分析中计入CUDA图占用：这是一个重要的改进。通过在内存分析阶段进行一次小的、假的CUDA图捕获，来预估其真实内存占用，从而更准确地计算可用的KV缓存空间。这将有效减少因低估CUDA图内存而导致的启动后OOM问题，对于DeepSeek-R1等大模型在极限配置下稳定运行至关重要。
PR #30452 / #30475: 优化编码器缓存内存使用：这两个关联PR针对多模态模型的编码器缓存进行了重构。通过仅存储实际的嵌入向量而非完整的占位符张量，实现了高达12.5倍的内存节省。这使得如Qwen3-VL等大型视觉模型能够在有限的GPU内存下支持更长的上下文或更多视频帧，是多模态推理能力的关键提升。
PR #30484: 添加SM103 (Blackwell Ultra) 支持：为未来的NVIDIA B300/GB300系列GPU添加了初始支持。虽然需要配合Triton补丁，但这是保持vLLM在最新硬件上领先地位的必要步骤。
PR #30528: 为Triton内核设置 split_k=1：针对Hopper架构（SM90）的性能调优。通过强制设置split_k=1，避免小批次推理时Triton内核回退到float32精度，从而提升小批次场景下的推理速度。这体现了对真实生产场景中多样化负载的精细化优化。
PR #30389: 标准化 RoPE partial_rotary_factor 获取：一个重要的底层重构，统一了从配置中获取旋转位置编码维度的方式，全部改为通过rope_parameters[“partial_rotary_factor”]。这解决了因社区量化模型配置不一致（如MiniMax-M2）导致的加载失败问题，提升了模型的兼容性和鲁棒性。

📈 开发活跃度观察

高效合并：24小时内合并45个PR，表明代码审查和合并流程非常高效。许多PR在创建当天即被合并（如 #30481, #30526）。
贡献者多元化：除核心维护者（如 DarkLight1337, yewentao256, hmellor）外，大量来自AMD (rasmith, AndreasKaratzas)、 NVIDIA (LopezCastroRoberto)、以及企业用户（如 nrghosh）的贡献非常活跃。用户 fadara01 在报告Arm CPU构建Issue后迅速自行提交了修复PR，展现了积极的社区参与。
问题解决模式：常见模式是用户提交详细的Bug报告后，很快有维护者或社区成员进行根因分析，甚至直接关联到已知的Issue或提交修复代码。这体现了社区拥有强大的集体技术能力和知识沉淀。

💡 值得关注的问题

DeepSeek-V3.2重复解码 (Issue #30453)：作为最新发布的热门模型，其部署稳定性至关重要。此问题影响广泛，需要优先调查是否与vLLM的调度或缓存逻辑存在特定交互问题。
GPT-OSS-120B缓存与CUDA图Bug (Issue #30498)：被关联到一个已知的深层Bug（#26480）。这个问题会影响所有命中前缀缓存的请求，导致输出质量严重下降，是需要彻底解决的核心稳定性问题。
NIXL解耦服务在流水线并行下的握手错误 (Issue #30501)：用户尝试在复杂配置（TP=8, PP=2）下运行DeepSeek-R1的分解式服务遭遇失败。这揭示了高级分布式特性在边界用例下的兼容性问题，对于推动大规模模型服务技术有重要参考价值。
Flashinfer后端内存分析 (Issue #30521)：该讨论揭示了当前内存分析流程的一个潜在缺陷——未能充分考虑不同后端（Flashinfer vs FlexAttention）在CUDA图和工作区内存上的差异。这可能需要一个更通用、更精准的内存预测框架。

📋 附录：详细数据列表

新增 Issue

#30453 [Bug]: Frequent Repetitive Decoding Problems in DeepSeek-V3.2 — bug — by makabaka6338 (创建于: 2025-12-11 11:19 (UTC+8))
#30470 [Bug] [CPU Backend]: vLLM build on Arm CPU fails with pytorch nightly — bug,cpu — by fadara01 (创建于: 2025-12-11 15:51 (UTC+8))
#30530 [Installation]: 在h200部署deepseekv3.2版本库的问题 — installation — by h1248759474 (创建于: 2025-12-12 10:07 (UTC+8))
#30477 [Usage]: How to disable thinking for Qwen-8B — usage — by fancyerii (创建于: 2025-12-11 17:28 (UTC+8))
#30501 [Bug]/[Feature]:NIXL Disaggregated Serving Fails with Pipeline Parallelism (PP > 1) — bug — by nskpro-cmd (创建于: 2025-12-12 01:16 (UTC+8))
#30464 [Usage]: How can I use the local pre-compiled wheel of vllm — usage — by gcanlin (创建于: 2025-12-11 14:22 (UTC+8))
#30498 [Bug]: GPT-OSS-120B returns null content when hitting max_tokens without –enforce-eager — bug — by ashtarkb (创建于: 2025-12-12 00:40 (UTC+8))
#30521 [Bug]: vLLM 0.12.0 / Flashinfer Backend memory profiling regression (DP > 1) — bug — by nrghosh (创建于: 2025-12-12 06:05 (UTC+8))
#30511 Potential Deadlock? — 无标签 — by ChuanLi1101 (创建于: 2025-12-12 03:57 (UTC+8))
#30493 [Bug]: 5090 RTX seems to be broken — bug — by mobicham (创建于: 2025-12-11 23:56 (UTC+8))
#30461 [Bug]: Lora support don’t work on 0.12.0 with Qwen3-30B-A30B-Instuct-2507 — bug — by bash99 (创建于: 2025-12-11 13:49 (UTC+8))
#30487 [Bug] [CPU Backend]: Wrong L2 cache size on Arm CPU for Attention tiles — bug,cpu — by Radu2k (创建于: 2025-12-11 20:58 (UTC+8))
#30485 [Feature]: Support for GGUF qwen3vl models — feature request — by arno4000 (创建于: 2025-12-11 20:17 (UTC+8))
#30466 [Feature]: Support transformers>=5 — feature request — by Whadup (创建于: 2025-12-11 15:05 (UTC+8))
#30465 [Bug]: reranker benchmarking calculate input_token=0 — bug — by hz0ne (创建于: 2025-12-11 14:38 (UTC+8))

已关闭 Issue

#18467 [Usage]: Can vllm distributed load model? — usage,stale — by ASTLY123 (关闭于: 2025-12-12 10:16 (UTC+8))
#21689 [Bug]: VLLM 0.10.0 breaks quantized models batch inference speed for Qwen2.5-VL-7B (tested multiple quantization types) — bug,stale — by Magmanat (关闭于: 2025-12-12 10:15 (UTC+8))
#21799 [Bug]: Performance Regression in Eagle3: vLLM 0.9.0 vs. vLLM 0.9. — bug,performance,stale — by ggg-s (关闭于: 2025-12-12 10:15 (UTC+8))
#22004 [Bug]: AWQ fails on MoE models — bug,stale — by someone132s (关闭于: 2025-12-12 10:15 (UTC+8))
#22334 [Feature]: add –reasoning_parser flag for gpt-oss — feature request,stale — by lorentzbao (关闭于: 2025-12-12 10:15 (UTC+8))
#22568 [Usage]: VRAM spike while loading gemma3-12b bnb on vllm-0.10 — usage,stale — by ivoras (关闭于: 2025-12-12 10:15 (UTC+8))
#22718 [Bug]: kimi_k2_tool_parser.py has bug with tool_call_regex — bug,stale — by Jarlene (关闭于: 2025-12-12 10:15 (UTC+8))
#22719 [Bug]: Qwen2.5vl-3b OOM but 7b works fine — bug,stale — by Elenore1997 (关闭于: 2025-12-12 10:15 (UTC+8))
#22766 [Usage]: Prevent prefill from being split across different batches — usage,stale — by randombk (关闭于: 2025-12-12 10:15 (UTC+8))
#22784 [Bug]: AttributeError when loading Step3 model — bug,stale — by Joy666-Li (关闭于: 2025-12-12 10:15 (UTC+8))
#22816 [Bug]: vllm serve does not bind to all available ipv4 and ipv6 addresses when –host is empty — bug,stale — by smarterclayton (关闭于: 2025-12-12 10:14 (UTC+8))
#30470 [Bug] [CPU Backend]: vLLM build on Arm CPU fails with pytorch nightly — bug,cpu — by fadara01 (关闭于: 2025-12-12 10:09 (UTC+8))
#14964 [Feature] [ROCm]: AITER Kernel Integration — feature request,rocm,stale — by tjtanaa (关闭于: 2025-12-12 09:37 (UTC+8))
#22698 [Performance]: Big bubble at the end of cudagraph run in MI300/MI308 — performance,rocm,unstale — by hoangvictor (关闭于: 2025-12-12 09:36 (UTC+8))
#30477 [Usage]: How to disable thinking for Qwen-8B — usage — by fancyerii (关闭于: 2025-12-12 09:17 (UTC+8))
#30445 [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build — bug — by eugr (关闭于: 2025-12-12 04:45 (UTC+8))
#30139 [Bug]: Reasoning models does not reason and put completion in reasoning_content — bug — by Rictus (关闭于: 2025-12-12 00:53 (UTC+8))
#29259 [Bug]: simple_profiling.py fails on CPU target — bug — by NobuoTsukamoto (关闭于: 2025-12-12 00:49 (UTC+8))
#30269 [Bug]: Multi-node deployment fails with TP=1 and PP=2 — bug — by doss22 (关闭于: 2025-12-11 18:41 (UTC+8))
#29861 [Feature]: [CPU Backend] Enable support for Whisper — feature request,cpu — by aditew01 (关闭于: 2025-12-11 18:09 (UTC+8))
#28852 [Bug]: When infer MiniMax-m2, during streaming returns, the think are contained in the ‘content’ field and cannot be separated. — bug — by gallery2016 (关闭于: 2025-12-11 17:05 (UTC+8))
#28184 [Bug]: encoder_decoder models (e.g. Whisper) is not working in vLLM 0.11 with ROCm — bug,rocm — by lucaschr21 (关闭于: 2025-12-11 14:01 (UTC+8))
#29587 [Installation]: PaddleOCR-VL integration with vLLM — installation — by ssuncheol (关闭于: 2025-12-11 10:55 (UTC+8))

新增 PR

#30532 [responsesAPI]add extra body parameters — frontend — by Ri0S (创建于: 2025-12-12 10:27 (UTC+8))
#30484 [Feature] Add SM103 (Blackwell Ultra) Support to vLLM — ready,needs-rebase,v1,nvidia — by LopezCastroRoberto (创建于: 2025-12-11 20:09 (UTC+8))
#30531 [CPU] Refactor CPU fused MOE — ci/build — by bigPYJ1151 (创建于: 2025-12-12 10:08 (UTC+8))
#30471 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 15:55 (UTC+8))
#30524 [Optimization] Pad the number of tokens to a multiple of 4 to improve FP8 performance — v1 — by 0xjunhao (创建于: 2025-12-12 07:20 (UTC+8))
#30481 [CPU][FIX] Fix build failures on Arm CPUs with torch nightly — ready,ci/build — by fadara01 (创建于: 2025-12-11 17:57 (UTC+8))
#30529 [Benchmarks] auto_tune.sh: Use hostname variable for server requests — performance — by KevinMusgrave (创建于: 2025-12-12 10:01 (UTC+8))
#30515 [UX][Startup] Account for CUDA graphs during memory profiling — v1,nvidia — by MatthewBonanni (创建于: 2025-12-12 05:12 (UTC+8))
#30490 [DeepSeek V3.2] Proper drop_thinking logic — ready,deepseek — by vladnosiv (创建于: 2025-12-11 23:02 (UTC+8))
#30527 [ROCm][CI] Skip multi-GPU speculative decoding tests when insufficient GPUs available — rocm,ready,v1 — by AndreasKaratzas (创建于: 2025-12-12 08:59 (UTC+8))
#30528 [Perf] Set split_k to 1 for triton_kernels — 无标签 — by xyang16 (创建于: 2025-12-12 09:15 (UTC+8))
#30526 [ROCm][CI] Use mi325_4 agent pool for V1 e2e tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2025-12-12 08:16 (UTC+8))
#30520 [BugFix] Use late binding to avoid zmq port conflict race conditions — v1 — by njhill (创建于: 2025-12-12 05:57 (UTC+8))
#30452 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 11:09 (UTC+8))
#30516 [compile] Parse compile range cache keys as Range during cache loading. — ready — by zhxchen17 (创建于: 2025-12-12 05:30 (UTC+8))
#30508 [CI/Build][AMD] Skip test_cutlass_w4a8_moe tests on ROCm sine they require cutlass_pack_scale_fp8 — rocm,ready,nvidia — by rasmith (创建于: 2025-12-12 02:45 (UTC+8))
#30525 [Release 2.10] Test Torch 2.10 RC — rocm,ci/build,nvidia — by atalman (创建于: 2025-12-12 07:52 (UTC+8))
#30513 [WIP][CI] Speed up fusion tests — ready,needs-rebase,ci/build — by LucasWilkinson (创建于: 2025-12-12 04:19 (UTC+8))
#30499 [CI] Breakup h200 tests — ready,needs-rebase,ci/build — by LucasWilkinson (创建于: 2025-12-12 00:45 (UTC+8))
#30519 [Kernels][MoE] Add FusedMoERouter object — 无标签 — by bnellnm (创建于: 2025-12-12 05:54 (UTC+8))
#30512 Improve parse_raw_prompt test cases for invalid input .v2 — 无标签 — by mivehk (创建于: 2025-12-12 04:05 (UTC+8))
#30523 Fix Kimi K2 thinking model nvfp4 vocab size — v1 — by kjiang249 (创建于: 2025-12-12 06:43 (UTC+8))
#30510 [Refactor] Remove useless syncwarp — ready — by yewentao256 (创建于: 2025-12-12 03:35 (UTC+8))
#30494 [Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement — performance,ready,deepseek — by yewentao256 (创建于: 2025-12-12 00:15 (UTC+8))
#30496 [Refactor] Reduce duplicate code in per_token_group_quant cuda kernels — ready,nvidia — by yewentao256 (创建于: 2025-12-12 00:19 (UTC+8))
#30522 [KV Connector][Metrics] Do not count prefix cache hits in connector queries — ready,v1 — by markmc (创建于: 2025-12-12 06:11 (UTC+8))
#30509 [Doc] Add documents for multi-node distributed serving with MP backend — documentation,v1 — by Isotr0py (创建于: 2025-12-12 03:18 (UTC+8))
#30491 [Docs][CPU backend] Add pre-built Arm CPU Docker images — documentation,ready — by ioghiban (创建于: 2025-12-11 23:40 (UTC+8))
#30514 [CI] Update several models in registry that are available online now — ready,ci/build — by mgoin (创建于: 2025-12-12 04:53 (UTC+8))
#30518 Don’t compile vision encoder for Transformers backend — 无标签 — by hmellor (创建于: 2025-12-12 05:52 (UTC+8))
#30517 [CI] Fix mypy for vllm/v1/executor — v1 — by yewentao256 (创建于: 2025-12-12 05:33 (UTC+8))
#30505 [Bugfix][Model] Fix Afmoe rope_parameters issue — bug,ready — by mgoin (创建于: 2025-12-12 02:36 (UTC+8))
#30472 [BugFix][MM]support VLLM_RANDOMIZE_DP_DUMMY_INPUTS — ready,v1 — by charlotte12l (创建于: 2025-12-11 16:02 (UTC+8))
#30506 simplify the return value from generate_beam_search — 无标签 — by nwaughachukwuma (创建于: 2025-12-12 02:37 (UTC+8))
#30503 [compile] Stop one-off setting enable_aot_compile and use context manager instead. — ready — by zhxchen17 (创建于: 2025-12-12 01:48 (UTC+8))
#30474 [Misc] Add mcp to requirements — ready,ci/build — by yeqcharlotte (创建于: 2025-12-11 17:03 (UTC+8))
#30507 [Bugfix] Dictionary MM embeddings for online chat — frontend,v1 — by DarkLight1337 (创建于: 2025-12-12 02:41 (UTC+8))
#30495 [Async][Feat] support apply penalty or bad_words for async + spec — v1 — by izhuhaoran (创建于: 2025-12-12 00:18 (UTC+8))
#30500 feat(gguf): Extract HF config from GGUF metadata for repos without config.json — 无标签 — by kitaekatt (创建于: 2025-12-12 00:51 (UTC+8))
#30504 [CI] Whisper logprobs tests — ready,multi-modality — by NickLucche (创建于: 2025-12-12 02:19 (UTC+8))
#30462 enable unbacked with aot_compile — ready — by laithsakka (创建于: 2025-12-11 13:50 (UTC+8))
#30502 [Kernel] add H100 triton fused moe config for FP8 Qwen3MoE — qwen — by cjackal (创建于: 2025-12-12 01:18 (UTC+8))
#30497 fix(gguf): GGUF model support fixes for Blackwell GPUs — structured-output,v1 — by kitaekatt (创建于: 2025-12-12 00:37 (UTC+8))
#30488 Give pooling examples better names — documentation,ready — by hmellor (创建于: 2025-12-11 21:21 (UTC+8))
#30492 [WIP] add manual numa binding — v1 — by jasonlizhengjian (创建于: 2025-12-11 23:44 (UTC+8))
#30459 set assume_32bit_indexing and pass unbacked hints — ready — by laithsakka (创建于: 2025-12-11 12:58 (UTC+8))
#30480 Make the httpx logger less annoying when Transformers v5 is installed — ready — by hmellor (创建于: 2025-12-11 17:51 (UTC+8))
#30483 [Misc] Improve error message for is_multimodal — ready,qwen — by DarkLight1337 (创建于: 2025-12-11 19:01 (UTC+8))
#30489 Add encoder tag for compilation — qwen — by ilmarkov (创建于: 2025-12-11 21:23 (UTC+8))
#30473 Fix typo of endpoint name in CLI args docs — frontend,ready — by kmaehashi (创建于: 2025-12-11 16:33 (UTC+8))
#30486 [BugFix] Fix minimax m2 model partial_rotary_factor — 无标签 — by rogeryoungh (创建于: 2025-12-11 20:25 (UTC+8))
#30482 [Frontend] Honor chat template for gpt-oss harmony (#23015) — frontend,gpt-oss — by ajayanto (创建于: 2025-12-11 18:23 (UTC+8))
#30458 [Deprecation] Remove fallbacks for embed_input_ids and embed_multimodal — speculative-decoding,ready,qwen — by DarkLight1337 (创建于: 2025-12-11 12:33 (UTC+8))
#30469 [Deprecation] Remove missed fallback for embed_input_ids — ready — by DarkLight1337 (创建于: 2025-12-11 15:34 (UTC+8))
#30476 [Bugfix] Fix task still being passed in tests/benchmarks — performance,ready,v1 — by DarkLight1337 (创建于: 2025-12-11 17:16 (UTC+8))
#30478 fix gme model do not use mrope — qwen — by zhuikefeng986285005-byte (创建于: 2025-12-11 17:40 (UTC+8))
#30463 [Deprecation] Deprecation --convert reward, use --convert embed instead. — documentation,ready — by noooop (创建于: 2025-12-11 14:01 (UTC+8))
#30475 [Core][MM] Optimize encoder cache manager by operating with embeddings only — v1,multi-modality,llama,qwen — by ywang96 (创建于: 2025-12-11 17:06 (UTC+8))
#30479 [BugFix] Fix unmap_and_release by tag not done correctly — 无标签 — by Crispig (创建于: 2025-12-11 17:44 (UTC+8))
#30468 [Feat] EPD Mooncake store — documentation,v1,kv-connector — by khuonglmhw (创建于: 2025-12-11 15:33 (UTC+8))
#30467 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 15:30 (UTC+8))
#30457 [Feat] Mooncake storage connector for E-PD disaggregate — documentation,v1,kv-connector — by khuonglmhw (创建于: 2025-12-11 12:28 (UTC+8))
#30456 feat: support video list inference — frontend,multi-modality — by LiuLi1998 (创建于: 2025-12-11 11:47 (UTC+8))
#30454 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 11:31 (UTC+8))
#30460 [chore] Update FA commit — ready,ci/build — by LucasWilkinson (创建于: 2025-12-11 13:29 (UTC+8))
#30455 [Doc] Add Baidu Kunlun XPU support — documentation,ready — by xyDong0223 (创建于: 2025-12-11 11:38 (UTC+8))
#30451 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 11:03 (UTC+8))
#30450 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 10:50 (UTC+8))

已合并 PR

#29421 [Core] Whisper Enable Encoder Batching — ready,v1 — by NickLucche (合并于: 2025-12-12 05:06 (UTC+8))
#30254 gptq marlin quantization support for fused moe with lora — ready — by Bhanu068 (合并于: 2025-12-12 10:27 (UTC+8))
#30481 [CPU][FIX] Fix build failures on Arm CPUs with torch nightly — ready,ci/build — by fadara01 (合并于: 2025-12-12 10:09 (UTC+8))
#29628 [Core] Refactor _build_attention_metadata — ready,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2025-12-12 09:54 (UTC+8))
#30526 [ROCm][CI] Use mi325_4 agent pool for V1 e2e tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2025-12-12 09:37 (UTC+8))
#30508 [CI/Build][AMD] Skip test_cutlass_w4a8_moe tests on ROCm sine they require cutlass_pack_scale_fp8 — rocm,ready,nvidia — by rasmith (合并于: 2025-12-12 09:02 (UTC+8))
#30314 [fix] fix SM check for Flashinfer TRTLLM MOE — ready,nvidia — by jiahanc (合并于: 2025-12-12 09:00 (UTC+8))
#30417 [CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm — rocm,ready,v1 — by rasmith (合并于: 2025-12-12 08:24 (UTC+8))
#30002 [FIX]Patch run-cluster.sh (fix for #28328) — documentation,ready — by evberrypi (合并于: 2025-12-12 07:36 (UTC+8))
#30276 [ROCM][CI] Fix AMD Examples Test Group — documentation,rocm,ready,ci/build — by Concurrensee (合并于: 2025-12-12 07:03 (UTC+8))
#29804 [EPLB] Support EPLB w/ NVFP4 — ready,nvidia — by andrewbriand (合并于: 2025-12-12 06:59 (UTC+8))
#30510 [Refactor] Remove useless syncwarp — ready — by yewentao256 (合并于: 2025-12-12 06:43 (UTC+8))
#30494 [Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement — performance,ready,deepseek — by yewentao256 (合并于: 2025-12-12 06:28 (UTC+8))
#30491 [Docs][CPU backend] Add pre-built Arm CPU Docker images — documentation,ready — by ioghiban (合并于: 2025-12-12 06:03 (UTC+8))
#30472 [BugFix][MM]support VLLM_RANDOMIZE_DP_DUMMY_INPUTS — ready,v1 — by charlotte12l (合并于: 2025-12-12 05:00 (UTC+8))
#30389 Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim — performance,ready,llama,qwen,deepseek,gpt-oss — by hmellor (合并于: 2025-12-12 04:45 (UTC+8))
#30503 [compile] Stop one-off setting enable_aot_compile and use context manager instead. — ready — by zhxchen17 (合并于: 2025-12-12 04:28 (UTC+8))
#30474 [Misc] Add mcp to requirements — ready,ci/build — by yeqcharlotte (合并于: 2025-12-12 04:06 (UTC+8))
#30430 [ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding — rocm,speculative-decoding,ready,v1 — by AndreasKaratzas (合并于: 2025-12-12 03:25 (UTC+8))
#30442 [Feature] AWQ marlin quantization support for fused moe with lora — ready — by princepride (合并于: 2025-12-12 02:03 (UTC+8))
#30340 Add Eagle and Eagle3 support to Transformers modeling backend — ready,v1 — by hmellor (合并于: 2025-12-12 01:02 (UTC+8))
#30391 [IMPROVEMENT] Change MistralReasoningParser behavior — ready — by juliendenize (合并于: 2025-12-12 00:53 (UTC+8))
#30341 [CI] refine more logic when generating and using nightly wheels & indices, add cuda130 build for aarch64, specify correct manylinux version — ready,ci/build,nvidia — by Harry-Chen (合并于: 2025-12-12 00:42 (UTC+8))
#30488 Give pooling examples better names — documentation,ready — by hmellor (合并于: 2025-12-12 00:22 (UTC+8))
#30402 [Docs][CPU Backend] Add nightly and per revision pre-built Arm CPU wheels — documentation,ready — by ioghiban (合并于: 2025-12-11 23:57 (UTC+8))
#30480 Make the httpx logger less annoying when Transformers v5 is installed — ready — by hmellor (合并于: 2025-12-11 23:44 (UTC+8))
#28309 [KVConnector] Add KV events to KV Connectors — ready,ci/build,v1,kv-connector — by hickeyma (合并于: 2025-12-11 22:30 (UTC+8))
#30337 fix: enhance human_readable_int function — ready — by andyxning (合并于: 2025-12-11 15:30 (UTC+8))
#30483 [Misc] Improve error message for is_multimodal — ready,qwen — by DarkLight1337 (合并于: 2025-12-11 22:39 (UTC+8))
#30473 Fix typo of endpoint name in CLI args docs — frontend,ready — by kmaehashi (合并于: 2025-12-11 19:07 (UTC+8))
#30050 [Misc][PCP&DCP] relocate PCP feature check — ready,v1 — by pisceskkk (合并于: 2025-12-11 19:36 (UTC+8))
#30458 [Deprecation] Remove fallbacks for embed_input_ids and embed_multimodal — speculative-decoding,ready,qwen — by DarkLight1337 (合并于: 2025-12-11 14:58 (UTC+8))
#30469 [Deprecation] Remove missed fallback for embed_input_ids — ready — by DarkLight1337 (合并于: 2025-12-11 18:06 (UTC+8))
#30476 [Bugfix] Fix task still being passed in tests/benchmarks — performance,ready,v1 — by DarkLight1337 (合并于: 2025-12-11 18:33 (UTC+8))
#30463 [Deprecation] Deprecation --convert reward, use --convert embed instead. — documentation,ready — by noooop (合并于: 2025-12-11 18:18 (UTC+8))
#30444 [Fix] Update lazing loading of video loader backend — ready,multi-modality — by jeremyteboul (合并于: 2025-12-11 18:14 (UTC+8))
#30376 [Fix]fix import error from lmcache — ready,kv-connector — by wz1qqx (合并于: 2025-12-11 17:23 (UTC+8))
#29882 [bugfix] fix MiniMaxM2ReasoningParser streaming output not separating reasoning_content. — frontend,ready — by JaviS-Rei (合并于: 2025-12-11 17:05 (UTC+8))
#29710 [perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill — performance,ready,v1 — by minosfuture (合并于: 2025-12-11 16:20 (UTC+8))
#30285 Ensure minimum frames for GLM 4.6V compatibility — ready — by gh-wf (合并于: 2025-12-11 13:26 (UTC+8))
#30455 [Doc] Add Baidu Kunlun XPU support — documentation,ready — by xyDong0223 (合并于: 2025-12-11 12:52 (UTC+8))
#30428 [Chore] Fix torch precision warning — ready,v1 — by yewentao256 (合并于: 2025-12-11 12:05 (UTC+8))
#30396 [Deprecation] Remove deprecated plugin and compilation fields for v0.13 release — documentation,ready — by DarkLight1337 (合并于: 2025-12-11 11:59 (UTC+8))
#30397 [Deprecation] Remove deprecated task, seed and MM settings — documentation,performance,frontend,ready,qwen — by DarkLight1337 (合并于: 2025-12-11 11:59 (UTC+8))
#29439 [Bugfix] Fix grouped_topk pytorch impl when num_experts can’t be grouped properly — ready — by divakar-amd (合并于: 2025-12-11 11:47 (UTC+8))

关闭但未合并的 PR

#17982 [Misc]Added span attribute gen_ai.system to identify spans from vLLM — needs-rebase,stale — by LakshmiPriyaSujith (关闭于: 2025-12-12 10:16 (UTC+8))
#21869 [Bugfix] Fix PyNcclCommunicator device assertion for un-indexed CUDA devices — stale — by CarlosArguilar (关闭于: 2025-12-12 10:15 (UTC+8))
#21988 [Disagg][Perf] Add env var to allow gpu model work runs in non-default CUDA stream, improving disagg TTIT/TTFT — ready,needs-rebase,stale,v1 — by liuzijing2014 (关闭于: 2025-12-12 10:15 (UTC+8))
#22094 Fix error message for max_input_length (bugfix of #22092) — frontend,stale — by RobertFischer (关闭于: 2025-12-12 10:15 (UTC+8))
#30471 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-12 09:43 (UTC+8))
#22681 [Kernel][AMD] Reduce AITER attention CPU overhead — rocm,stale,v1 — by mxz297 (关闭于: 2025-12-12 10:15 (UTC+8))
#22729 [Bugfix] hermes解析器输出存在被过滤问题 — frontend,stale,tool-calling — by astrophel0 (关闭于: 2025-12-12 10:15 (UTC+8))
#22775 [Perf] Add GLM-4.5V tuning configs — performance,stale — by houseroad (关闭于: 2025-12-12 10:15 (UTC+8))
#25640 [Misc] Fix internal invocation of _register_fake — 无标签 — by ijpq (关闭于: 2025-12-12 09:44 (UTC+8))
#29237 [Optimization] Add Fused Triton Kernel for GPT-OSS Router — gpt-oss — by ijpq (关闭于: 2025-12-12 09:44 (UTC+8))
#27281 [WIP][torch.compile] Add Triton-distributed GEMM+AllReduce fusion compile pass — 无标签 — by jasonlizhengjian (关闭于: 2025-12-12 05:22 (UTC+8))
#21879 [Kernel][Machete] Larger tile shape for large K when mem bound — 无标签 — by czhu-cohere (关闭于: 2025-12-12 03:12 (UTC+8))
#30440 [fix] Fix qwen3_coder tool call per parameter streaming — frontend,ready,tool-calling,qwen — by koush (关闭于: 2025-12-12 02:11 (UTC+8))
#30369 [Fix] Add default rope theta for qwen1 model — qwen — by iwzbi (关闭于: 2025-12-12 01:31 (UTC+8))
#30278 [CPU][Bugfix] Fix CPU Profiler issue — v1 — by zhili03 (关闭于: 2025-12-12 00:48 (UTC+8))
#30122 [Bugfix][Async] fix update_async_output_token_ids for async + spec — v1 — by izhuhaoran (关闭于: 2025-12-12 00:20 (UTC+8))
#28252 [WIP][KVConnector] Retrive KV events from LMCache — needs-rebase,kv-connector — by hickeyma (关闭于: 2025-12-11 22:41 (UTC+8))
#27718 Feature/kv cache average lifetime — documentation,needs-rebase,v1 — by alhridoy (关闭于: 2025-12-11 19:17 (UTC+8))
#26892 Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt — v1,nvidia — by Sugar-zsg (关闭于: 2025-12-11 16:22 (UTC+8))
#30468 [Feat] EPD Mooncake store — documentation,v1,kv-connector — by khuonglmhw (关闭于: 2025-12-11 15:33 (UTC+8))
#30467 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-11 15:42 (UTC+8))
#30457 [Feat] Mooncake storage connector for E-PD disaggregate — documentation,v1,kv-connector — by khuonglmhw (关闭于: 2025-12-11 15:22 (UTC+8))
#30454 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-11 13:36 (UTC+8))
#26617 feat: implement compact encoder cache for memory optimization — v1 — by liangwen12year (关闭于: 2025-12-11 13:08 (UTC+8))
#30422 [ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly — rocm — by micah-wil (关闭于: 2025-12-11 12:01 (UTC+8))
#30451 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (关闭于: 2025-12-11 11:10 (UTC+8))
#30450 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (关闭于: 2025-12-11 11:02 (UTC+8))