vLLM 开发动态报告 - 2026-02-23

时间窗口: 2026-02-23 11:31 (UTC+8) ~ 2026-02-24 11:31 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 20 | 新 PR 70 | 合并 PR 31 | 关闭未合并 PR 23

📊 每日开发状态摘要

本周期内，vLLM 项目保持高速开发迭代，共处理了超过 90 个 Issue 和 PR。开发焦点集中在 模型支持扩展（如 Qwen3.5 NVFP4、Ring 2.5）、性能优化（如 MoE 内核、GDN 解码）以及 CI/CD 稳定性（尤其是 AMD 平台）上。同时，AMD 平台支持 与 核心架构演进（如 Model Runner V2）的相关讨论和修复是今日的突出看点。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动活跃，涉及硬件支持、性能优化和 CI 稳定性等多个方面。

新增 Issue

#35089: [RFC]: In-Tree AMD Zen CPU Backend via zentorch
- 贡献者: amd-lalithnc (AMD 员工)
- 概述: 提出将 AMD Zen CPU 后端 (基于 zentorch) 集成到 vLLM 树内的详细方案。旨在为 AMD CPU 提供一等公民支持，保持与外部插件在性能和功能上的平价。
- 技术细节: 设计包含平台检测、运行时 GEMM 分发、编译时图优化 pass (如 embedding 替换和 11 种算子融合模式) 以及 CustomOps 的使用。
- 讨论热度: 高。核心开发者 ProExpertProg 参与讨论，提出了关于利用 vLLM IR 进行内核选择、编译时与运行时优化分离、以及自定义 pass 注册系统改进等问题，触及了 vLLM 未来架构演进的核心议题。
- 影响: 若被采纳，将为 vLLM 在 AMD CPU 服务器上提供强大的原生推理能力，扩大其硬件生态。
#35163: [Bug]: AMD docker image still using torch 2.9…
- 概述: 指出 AMD CI 的 Docker 镜像仍在使用的 PyTorch 2.9，而 requirements/rocm-build.txt 已指定 2.10.0，导致依赖 PyTorch 2.10 新特性的 PR 被阻塞。
- 影响: 阻碍了内核迁移等开发工作，是 AMD 平台与主开发流同步的一个阻塞点。
#35126, #35128, #35130, #35132, #35133: 一系列 AMD CI 失败
- 概述: 报告了在 MI355 等 AMD 硬件上多个测试组 (TG) 的失败，涉及 MoE 内核、语言模型、分布式 LoRA 测试、跨节点测试等。
- 共同点: 多数由 AMD 员工 AndreasKaratzas 创建/跟进，反映了团队在确保 AMD 平台测试信号稳定和与上游保持一致的持续努力。
- 特别关注: #35132 提到了对 CrossLayer KV layout 和 NixlConnector 新测试的评估，表明 AMD 平台正在跟进 vLLM 最前沿的 KV 缓存管理特性。

新增/合并 PR

#35152: [ROCm][CI] Disable skinny GEMMs in language model standard tests… (Open)
- 贡献者: AndreasKaratzas
- 内容: 在 ROCm 语言模型测试中默认禁用 VLLM_ROCM_USE_SKINNY_GEMM，以解决因其使用 atomicAdd 导致的浮点非确定性，从而修复准确性测试失败。
- 影响: 提升 AMD 平台 CI 测试的确定性和稳定性。
#35164: [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available (Open)
- 贡献者: rasmith (推测为 AMD 员工，处理 MORI KV connector)
- 内容: 在测试机器不支持 InfiniBand 时跳过相关测试，为 MORI KV connector 未来支持 TCP 后端做准备。
- 关联: 与新增 Issue #35165 ([Feature]: Add TCP support to MORI KV connector) 目标一致。
#35144: [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel (Open)
- 贡献者: mgehre-amd (AMD 员工)
- 内容: 允许在 ROCm 上选择 GPTQMarlinConfig，从而使 GPTQ 模型能够使用 choose_mp_linear_kernel 框架（如 Conch 内核），而非局限于旧的 ExLlama v2 路径。修复了 Conch 内核处理对称量化的问题。
- 影响: 统一了 ROCm 与 CUDA 平台在量化内核选择上的路径，并为 AMD 平台引入了性能更优的新内核。
#35093: [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 (Open)
- 贡献者: amd-asalykov (AMD 员工)
- 内容: 为 CDNA4 架构 (MI350X/MI355X) 添加调优后的 MoE 内核配置，旨在提升 Kimi K2.5 等模型在 AMD 最新硬件上的性能。
- 影响: 针对性优化 AMD 最新 GPU 的 MoE 性能。
#35103: [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash (Open)
- 内容: 将 FP4 BMM 特性限制在 CDNA4 (gfx950) 上，防止其在 MI300X/MI325X (gfx942) 上被错误启用导致崩溃，确保优雅回退到 FP8。
- 影响: 修复了 MXFP4 量化在旧款 AMD GPU 上的兼容性问题。

已合并 PR

#35043: [ROCm][CI] Fix spec decode profile assertion and logprob test determinism
- 修复了 ROCm 上推测解码性能分析和 logprob 测试中的非确定性问题。

总结: 本周期 AMD 生态动态显示，AMD 团队正从 硬件兼容性修复（MI300X FP4）、性能深度优化（CDNA4 MoE 内核）、框架对齐（启用 GPTQMarlinConfig）和 CI 稳定性保障 多个维度，系统性地推进 vLLM 在 AMD 全栈硬件上的成熟度和性能。

💬 高热度讨论分析

Issue #35089: [RFC]: In-Tree AMD Zen CPU Backend via zentorch
- 核心议题: 是否及如何将 AMD Zen CPU 优化后端集成到 vLLM 主代码库。
- 不同观点:
  - 提案方 (amd-lalithnc): 主张“树内”集成，仅将核心优化内核放在外部 zentorch 库，vLLM 负责平台调度、优化 pass 编排等，以提供开箱即用的一等公民体验。
  - 核心开发者 (ProExpertProg): 欢迎贡献，但提出关键架构性质询：
    - 询问 vLLM IR 是否缺乏所需能力，暗示应基于正在快速发展的 vLLM IR 进行内核选择，而非依赖编译时的 Inductor pass 替换算子。
    - 强调内核选择应独立于编译，以保持即时编译与预编译的一致性。编译 pass 应仅用于优化变换。
    - 寻求关于自定义 pass 注册系统痛点的反馈。
- 争议焦点: 集成策略的边界划分——多少逻辑应放在树内，以及如何与 vLLM 正在演进的 IR 和编译基础设施协同。
- 当前状态: 讨论进行中，是涉及 vLLM 未来平台抽象和扩展架构的重要对话。
Issue #35150: [Feature]: Support NVFP4 Checkpoint of Qwen3.5
- 核心议题: 社区请求支持 NVIDIA 发布的 Qwen3.5 397B NVFP4 量化模型。
- 讨论过程: 用户 ywang96 提出需求后，核心贡献者 Isotr0py 迅速响应表示可以查看。ywang96 补充说明另一位贡献者 vadiklyutiy 也在处理。随后，PR #35156 被创建以修复该模型加载的一个具体问题。
- 观察: 展示了社区需求到开发者响应的快速闭环，以及社区内部工作的自发协调。
Issue #35163: AMD docker image still using torch 2.9…
- 核心议题: AMD CI 环境与主开发分支的 PyTorch 版本不一致，阻塞了其他开发工作。
- 观点: 报告者 mikaylagawarecki 明确指出这是阻碍其内核迁移 PR 的 bug。被提及的维护者 gshtras 等需要评估升级的影响。
- 状态: 问题刚提出，尚未有解决方案讨论，但属于亟待解决的依赖性问题。
PR #35082: [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp
- 核心议题: 修复 Qwen3.5 等模型在使用解码上下文并行 (DCP) 和 FlashAttention 3 时的崩溃。
- 讨论过程: 作者 haosdent 给出修复。测试者 ehfd 验证有效，但同时暴露了 DCP 与 Mamba 注意力不兼容、以及与 Prefix Caching 在混合模型中的限制等更深层次问题。
- 观点总结:
  - 修复者: 提供针对性补丁。
  - 测试者/用户: 验证补丁，并揭示相关特性组合的边界条件，引发关于“Prefix Caching + PP”与“DCP + TP”在长上下文代理场景下孰优的实用讨论。
- 结论: PR 解决了直接崩溃问题，但引发了关于复杂特性（DCP， Hybrid模型， Prefix Caching）兼容性的更广泛讨论。

🔥 热门话题与趋势分析

模型支持与性能优化:
- Qwen3.5 系列: 成为焦点，涉及 NVFP4 量化支持 (#35150, #35156)、超长上下文 YaRN 缩放 (#35056, #35080)、FlashInfer 后端精度问题 (#35138)、以及部署配方更新请求 (#35154)。
- 新模型集成: Ring 2.5 (#35102)、Llama 4 Vision LoRA (#35147) 等工作在进行中。
- 解码优化: GDN (#35149)、Mamba (#35157) 等特定注意力机制的优化被单独提出。
CI/CD 与平台稳定性:
- AMD CI: 大量 Issue (#35126, #35128, #35130, #35132, #35133) 反映在 MI355 等新硬件上达到测试稳定性仍需努力。
- 通用 CI 失败: 如融合测试 (#35134)、音频模型参考实现问题 (#35140) 等，显示了维护庞大测试集的挑战。
安全性与文档:
- 出现了关于安全日志 (#34947)、数据合规 (#35005) 的讨论（虽已关闭），以及新增的 PR 专注于安全风险文档 (#35139) 和主机头验证 (#35160)，表明项目开始更多关注企业级部署的安全需求。
音频与多模态处理:
- 音频转录的格式支持 (#35109)、时间戳修复 (#35159)、以及处理器缓存兼容性 (#35111) 等问题被修复，体现了对多模态特性完善度的追求。

🛠️ 重点技术变更

PR #35089 (RFC): In-Tree AMD Zen CPU Backend via zentorch
- 技术解读: 这是一份完整的设计提案，旨在为 vLLM 添加一个高性能的 AMD CPU 原生后端。它不只是一个优化，而是涉及平台检测、运行时分发、编译时优化图变换的全栈集成方案。
- 影响: 若实施，将显著扩展 vLLM 的部署场景至纯 AMD CPU 服务器或混合架构环境，提升其作为异构计算平台的价值。
PR #35162: [Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism
- 技术解读: 为 V2 模型运行器添加了流水线并行 (PP) 下的片式 CUDA 图支持。解决了此前 PP 必须回退到 eager 模式的性能瓶颈。
- 影响: 进一步释放 Model Runner V2 的性能潜力，使得 TP+PP 等复杂并行策略能更好地利用 CUDA 图优化，是 V2 走向成熟和完善的关键一步。
PR #34874: [Bugfix] Fix prefix caching for Mamba ‘all‘ mode (Nemotron models)
- 技术解读: 修复了混合模型（如 Nemotron）中 Mamba 注意力层在“all”模式前缀缓存下的元数据缓存 bug。该 bug 导致 CUDA 图重放时读取了错误的块索引，产生 NaN。
- 影响: 解决了影响模型正确性的严重问题，确保了 Mamba 类模型在使用高级缓存功能时的可靠性。
PR #34924: [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default
- 技术解读: 将 FlashInfer DeepGEMM 的 swapAB 优化路径在 SM90 (Hopper) 及以上架构默认启用。该优化能带来明显的低批次性能提升。
- 影响: 为 H100、H200、B200 等用户带来“免费”的性能提升，体现了对主流高性能硬件持续进行默认优化的思路。
PR #35135: [Bugfix] Fix lora_ids in FusedMoE LoRA test
- 技术解读: 修复了 FusedMoE LoRA 测试中 expert_ids 填充值错误的问题（应为 -1 而非 0），并收紧了断言条件。
- 影响: 看似是测试修复，实则纠正了底层内核接口的一个潜在误解，确保了 MoE LoRA 功能实现的正确性，对支持专家混合模型的微调至关重要。

📈 开发活跃度观察

AMD 团队活跃度: 非常高。AndreasKaratzas 如同“消防员”，创建和处理了大量 AMD CI 相关 Issue；其他 AMD 员工 (rasmith, mgehre-amd, amd-asalykov, pschlan-amd) 则在各自负责的模块（KV Connector、量化、内核优化、性能分析）积极贡献。
社区贡献者: haosdent 表现突出，连续提交了多个关键 Bugfix (#35080, #35082, #34874)。russellb 专注于安全增强和文档。
核心团队: DarkLight1337, hmellor, mgoin 等核心成员高效地进行代码审查和合并，并参与架构讨论（如 ProExpertProg 在 Zen CPU RFC 中的深度提问）。
协作模式: 可以看到良好的社区互动，如 Issue 中需求的提出与认领 (#35150)，以及 PR 中开发者与测试者紧密合作验证修复 (#35082)。AMD 与 NVIDIA 生态的开发者同时在代码库中协作，共同推进项目发展。

💡 值得关注的问题

AMD Zen CPU 后端决策 (#35089): 该 RFC 的走向将定义 vLLM 对异构 CPU 支持的技术路径，其讨论值得所有关心 vLLM 平台演进的开发者关注。
AMD CI 的 PyTorch 版本升级 (#35163): 此问题若不解决，将阻碍 AMD 平台与主分支新特性的同步，是一个关键依赖项更新。
GDN 解码路径优化 (#35149): 作为一个新提出的性能优化项，关注其后续设计和实现，可能为特定模型架构带来显著解码提速。
线性注意力（Mamba）状态管理 (#35157): 该 PR 修复了 Mamba 状态在缓存重置时的清理问题，是保障线性注意力模型在复杂调度下稳定运行的重要补丁。
多节点测试脚本问题 (#35129): 反映出 CI 脚本中可能存在的语法或环境问题，虽已由 PR #35131 尝试修复，但多节点测试的健壮性仍需持续观察。

📋 附录：详细数据列表

新增 Issue

#35165 [Feature]: Add TCP support to MORI KV connector — feature request — by rasmith (创建于: 2026-02-24 10:13 (UTC+8))
#35163 [Bug]: AMD docker image still using torch 2.9 despite 2.10.0 in requirements/rocm-build.txt — bug,rocm — by mikaylagawarecki (创建于: 2026-02-24 09:47 (UTC+8))
#35150 [Feature]: Support NVFP4 Checkpoint of Qwen3.5 — feature request — by ywang96 (创建于: 2026-02-24 07:32 (UTC+8))
#35154 [Performance]: Optimized Deployment Recipe for Qwen3.5 — performance — by ywang96 (创建于: 2026-02-24 07:58 (UTC+8))
#35128 [CI Failure]: mi355_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:16 (UTC+8))
#35149 [Performance]: Optimize GDN Decode — performance — by ywang96 (创建于: 2026-02-24 07:24 (UTC+8))
#35089 [RFC]: In-Tree AMD Zen CPU Backend via zentorch — rocm,RFC,cpu — by amd-lalithnc (创建于: 2026-02-23 17:36 (UTC+8))
#35133 [CI Failure]: mi355_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:34 (UTC+8))
#35141 [Feature]: Sequence Parallel Support for Model Runner V2 — feature request — by yewentao256 (创建于: 2026-02-24 04:23 (UTC+8))
#35140 [CI] Ultravox audio model HuggingFace reference produces invalid output with NaN logprobs — ci-failure — by LucasWilkinson (创建于: 2026-02-24 03:55 (UTC+8))
#35138 [Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell. — bug,nvidia — by xinli-sw (创建于: 2026-02-24 03:06 (UTC+8))
#35126 [CI Failure]: mi355_1: Kernels MoE Test %N — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:11 (UTC+8))
#35129 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:18 (UTC+8))
#35134 [CI Failure]: Fusion E2E Quick (H100) test_tp1_fp8_fusions — ci-failure — by mgoin (创建于: 2026-02-24 02:34 (UTC+8))
#35130 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:19 (UTC+8))
#35132 [CI Failure][ROCm]: CrossLayer KV layout Distributed NixlConnector PD accuracy tests (4 GPUs) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:28 (UTC+8))
#35118 [Feature]: Support ubuntu 24.04 runtime container — feature request — by aasgaonkar (创建于: 2026-02-24 01:21 (UTC+8))
#35114 [Usage]: Does vllm cpu supports DP — usage — by akasshdeep (创建于: 2026-02-24 00:41 (UTC+8))
#35084 [Bug]: VLLM tries to load “inductor” instead of custom compiler — bug,torch.compile — by mergian (创建于: 2026-02-23 15:51 (UTC+8))
#35104 [Bug]: V1 engine workers die after idle period (SystemError: PyCFunction / EngineDeadError) — TP=2, multiprocessing — 无标签 — by jsboige (创建于: 2026-02-23 22:46 (UTC+8))
#35091 [Bug]: Triton CompilationError in speculative decoding (draft_model) — bug — by rse173 (创建于: 2026-02-23 18:35 (UTC+8))
#35087 [Bug]: DeepSeek 3.2 P/D Disaggregation Support — bug — by yanminjia (创建于: 2026-02-23 16:31 (UTC+8))

已关闭 Issue

#22276 [Bug]: Unknown quantization method: mxfp4 — bug,stale — by mkrzywda (关闭于: 2026-02-24 10:17 (UTC+8))
#23793 [Bug]: vllm.LLM does not release GPU memory after deletion when loaded with a HF model — bug,stale — by HuFY-dev (关闭于: 2026-02-24 10:17 (UTC+8))
#27504 [Usage]: add_vision_id ignored for Qwen 2.5-VL-32B-Instruct — usage,stale — by justachetan (关闭于: 2026-02-24 10:16 (UTC+8))
#27505 [Bug]: Value error, Found conflicts between ‘rope_type=default’ (modern field) and ‘type=mrope’ — bug,stale — by asirgogogo (关闭于: 2026-02-24 10:16 (UTC+8))
#28886 [CI] Notification mechanism for failing nightly jobs — feature request,stale — by khluu (关闭于: 2026-02-24 05:28 (UTC+8))
#34290 [Model Bash][DeepSeek]: Remove Clone From Shared Expert Stream — model-bash — by robertgshaw2-redhat (关闭于: 2026-02-24 03:37 (UTC+8))
#34300 [Model Bash][DeepSeek]: Remove Logits Casts in DSR1 NVFP4 TRTLLM — model-bash — by robertgshaw2-redhat (关闭于: 2026-02-24 03:36 (UTC+8))
#34708 [CI Failure]: Fp8 MoE Kernels (DeepEP + DeepGEMM) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-24 03:36 (UTC+8))
#34947 Feature: Audit-grade request logging for EU AI Act compliance (Article 12) — 无标签 — by desiorac (关闭于: 2026-02-24 03:07 (UTC+8))
#35005 Data Governance & Compliance Guide for vLLM Deployments (EU AI Act Article 6) — 无标签 — by desiorac (关闭于: 2026-02-24 03:05 (UTC+8))
#35134 [CI Failure]: Fusion E2E Quick (H100) test_tp1_fp8_fusions — ci-failure — by mgoin (关闭于: 2026-02-24 02:37 (UTC+8))
#35130 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:34 (UTC+8))
#29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:07 (UTC+8))
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:07 (UTC+8))
#34621 [Usage]: new vLLM version — usage — by eghouti (关闭于: 2026-02-24 01:32 (UTC+8))
#34326 [Bug]: –served-model-name causes model detection issues — bug — by ssendev (关闭于: 2026-02-24 00:38 (UTC+8))
#34865 [Bug]: Prefix caching failing for Nemotron models — bug — by benchislett (关闭于: 2026-02-24 00:31 (UTC+8))
#35056 [Bug]: Qwen3.5 AttributeError: 'MRotaryEmbedding' object has no attribute 'truncate' with RoPE Scaling — bug — by ehfd (关闭于: 2026-02-24 00:05 (UTC+8))
#33784 [Bug]: [Docker.cpu build] incomplete type ‘qk_vec_type’ {aka ‘void’} used in nested name specifier — bug,cpu — by BartekKruczek (关闭于: 2026-02-23 21:08 (UTC+8))
#34812 [Bug]: GraniteMoeHybridModel not applying embedding_multiplier to input embeddings — bug — by gabe-l-hart (关闭于: 2026-02-23 16:42 (UTC+8))

新增 PR

#35139 [Docs] Document security risks of GPT-OSS Python tool — documentation,frontend,gpt-oss — by russellb (创建于: 2026-02-24 03:11 (UTC+8))
#35160 [Security] Add Host header validation and improve security guide — documentation,frontend — by russellb (创建于: 2026-02-24 08:45 (UTC+8))
#35099 gpu_model_runner: Cache is_encoder_decoder from model config — ready,v1 — by pschlan-amd (创建于: 2026-02-23 21:24 (UTC+8))
#35147 [Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) — ready,llama — by dorhuri123 (创建于: 2026-02-24 07:05 (UTC+8))
#35107 Fix custom processors that use deleted behaviour for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 23:27 (UTC+8))
#35151 [Bugfix][Frontend] Disable /v1/chat/completions for base models without chat templates — bug,frontend — by alxfv (创建于: 2026-02-24 07:37 (UTC+8))
#35135 [Bugfix] Fix lora_ids in FusedMoE LoRA test — bug,ready — by xyang16 (创建于: 2026-02-24 02:35 (UTC+8))
#35164 [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available — bug,rocm,v1,kv-connector — by rasmith (创建于: 2026-02-24 10:10 (UTC+8))
#35166 gdn attn metadata builder — v1,meta-exported,fb-exported — by jennyyyyzhen (创建于: 2026-02-24 10:42 (UTC+8))
#35085 [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support — bug — by haosdent (创建于: 2026-02-23 16:03 (UTC+8))
#35082 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp — bug,v1 — by haosdent (创建于: 2026-02-23 15:39 (UTC+8))
#35161 [Bugfix] Fix expert_ids padding values in moe_align_block_size kernel — bug — by xyang16 (创建于: 2026-02-24 09:32 (UTC+8))
#35102 [Model] Ring 2.5 — new-model — by ZJY0516 (创建于: 2026-02-23 22:12 (UTC+8))
#35162 [Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism — v1,nvidia — by ZhanqiuHu (创建于: 2026-02-24 09:33 (UTC+8))
#35156 [BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable — bug,ready,qwen — by vadiklyutiy (创建于: 2026-02-24 08:13 (UTC+8))
#35076 [Bugfix] Propagate default stop_token_ids to per-request SamplingParams — bug,frontend,gpt-oss — by sriganesh123 (创建于: 2026-02-23 12:25 (UTC+8))
#35123 [Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches — bug,ready,ci/build — by mgoin (创建于: 2026-02-24 02:01 (UTC+8))
#35158 [Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding — ci/build,v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-24 08:41 (UTC+8))
#35159 fix: compensate chunk split offset in Whisper segment timestamps (closes #32588) — frontend — by stakeswky (创建于: 2026-02-24 08:42 (UTC+8))
#35157 [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache — bug,v1 — by heheda12345 (创建于: 2026-02-24 08:36 (UTC+8))
#35137 [core]Fix CPU Pinned-Memory Race Between _dummy_run() and execute_model() — v1 — by charlotte12l (创建于: 2026-02-24 02:58 (UTC+8))
#35152 [ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism — rocm — by AndreasKaratzas (创建于: 2026-02-24 07:38 (UTC+8))
#35155 updated for rh ubi10.1 docker — ci/build — by sureshnam (创建于: 2026-02-24 08:02 (UTC+8))
#35153 [MoE Refactor] Make SharedExperts class for use with DefaultMoERunner — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-24 07:49 (UTC+8))
#35148 [Responses] Decouple SSE event helpers from Harmony context — frontend,gpt-oss — by sfeng33 (创建于: 2026-02-24 07:24 (UTC+8))
#35146 add attention sink support to aiter fa — rocm,needs-rebase,v1 — by jennyyyyzhen (创建于: 2026-02-24 06:55 (UTC+8))
#35145 [ROCm][CI] Reverting changes in MI355 pipeline so that some default TGs are blocking on external AMD-CI signal — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-24 06:48 (UTC+8))
#35143 [Spec Decode] Remove unnecessary all reduce for dense draft model on dp setting — speculative-decoding,v1 — by ZhengkaiZ (创建于: 2026-02-24 04:47 (UTC+8))
#35144 [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm — by mgehre-amd (创建于: 2026-02-24 05:57 (UTC+8))
#35122 Reapply [Attention] Refactor check_and_update_config — speculative-decoding,v1,multi-modality,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-02-24 01:55 (UTC+8))
#35136 [Release] Include source distribution (sdist) in PyPI uploads — ci/build — by dougbtv (创建于: 2026-02-24 02:54 (UTC+8))
#35112 [Benchmarks] Add –connection-pooling flag to vllm bench serve — performance — by InfraWhisperer (创建于: 2026-02-24 00:26 (UTC+8))
#35142 XPU updates + attach sanitized CI logs — 无标签 — by Joshna-Medisetty (创建于: 2026-02-24 04:45 (UTC+8))
#35117 [compile] Save aot compile artifacts atomically. — ready — by zhxchen17 (创建于: 2026-02-24 00:55 (UTC+8))
#35088 Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe — performance,ready,nvidia — by danisereb (创建于: 2026-02-23 16:49 (UTC+8))
#35120 [WIP][Model Runner V2] Support pooling models — v1,nvidia — by WoosukKwon (创建于: 2026-02-24 01:30 (UTC+8))
#35116 Feat/dsv3 router cublas fallback — needs-rebase,ci/build,deepseek — by robertgshaw2-redhat (创建于: 2026-02-24 00:46 (UTC+8))
#35093 [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 — rocm — by amd-asalykov (创建于: 2026-02-23 18:45 (UTC+8))
#35124 [Bugfix] FlashInfer incompatible with sleep mode — bug,v1,nvidia — by guan404ming (创建于: 2026-02-24 02:05 (UTC+8))
#35131 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-24 02:19 (UTC+8))
#35127 [Perf] Optimize pooling model redundant copy, 1.8% throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-02-24 02:13 (UTC+8))
#35125 [BugFix][kv_offload]: Fix kernel block size detection — bug,v1 — by orozery (创建于: 2026-02-24 02:10 (UTC+8))
#35092 [Bugfix][PD] Fix OffloadingConnector with PD deployments — bug,kv-connector — by NickLucche (创建于: 2026-02-23 18:39 (UTC+8))
#35121 [Performance] Cublas Bf16 Gate with Fp32 Output — ci/build,deepseek — by roikoren755 (创建于: 2026-02-24 01:45 (UTC+8))
#35119 [compile] Invalidate cache for cpu flags — 无标签 — by angelayi (创建于: 2026-02-24 01:26 (UTC+8))
#35113 [Misc] Monitor interface changes — ready,ci/build — by NickLucche (创建于: 2026-02-24 00:29 (UTC+8))
#35110 [DO NOT MERGE] Check if set_default_torch_num_threads is still necessary for init — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-24 00:07 (UTC+8))
#35115 [compile] Improve error message during artifacts load failure. — 无标签 — by zhxchen17 (创建于: 2026-02-24 00:45 (UTC+8))
#35101 Fix custom processors that use deleted import for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 22:10 (UTC+8))
#35075 [Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention — bug,v1 — by benchislett (创建于: 2026-02-23 12:10 (UTC+8))
#35108 [glm-asr] change defaults dummy audio size — ready,multi-modality — by eustlb (创建于: 2026-02-23 23:34 (UTC+8))
#35111 [Bugfix] Fix failing FunASR processor test — bug,ready,multi-modality — by Isotr0py (创建于: 2026-02-24 00:23 (UTC+8))
#35098 Use Xet high performance mode for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 20:59 (UTC+8))
#35080 [Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling — bug,ready — by haosdent (创建于: 2026-02-23 14:53 (UTC+8))
#35109 [Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats — bug,frontend — by seanmamasde (创建于: 2026-02-23 23:43 (UTC+8))
#35100 Support parakeet as audio encoder for nemotron-nano-vl — 无标签 — by netanel-haber (创建于: 2026-02-23 21:46 (UTC+8))
#35106 Fix custom compiler backend resolution in make_compiler#35084 — 无标签 — by baonudesifeizhai (创建于: 2026-02-23 23:25 (UTC+8))
#35105 [Refactor] Add global helper to deduplicate vectorized memory ops — needs-rebase,nvidia — by LopezCastroRoberto (创建于: 2026-02-23 22:50 (UTC+8))
#35103 [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash — bug,rocm — by c0de128 (创建于: 2026-02-23 22:29 (UTC+8))
#35083 [Refactor] Decouple TimingContext from InputProcessingContext — performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-02-23 15:45 (UTC+8))
#35095 [MLA] Add fused Triton concat+quantize kernel for fp8 decode queries — 无标签 — by elvircrn (创建于: 2026-02-23 19:23 (UTC+8))
#35096 feat: added num requests and num tokens to profiles — v1 — by hypdeb (创建于: 2026-02-23 19:33 (UTC+8))
#35094 Fix pipeline parallel with embed scaling in the Transformers modelling backend — ready — by hmellor (创建于: 2026-02-23 18:59 (UTC+8))
#35097 [BugFix] Fix 3D rope in transformers backend — bug — by zucchini-nlp (创建于: 2026-02-23 20:01 (UTC+8))
#35090 docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build — documentation,cpu — by sagearc (创建于: 2026-02-23 18:04 (UTC+8))
#35081 [Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le — 无标签 — by Akashcodes732 (创建于: 2026-02-23 15:37 (UTC+8))
#35086 more models for vLLM Benchmark Suite — documentation,performance,ci/build,cpu — by louie-tsai (创建于: 2026-02-23 16:12 (UTC+8))
#35077 [Bugfix] LoRA for fused_qkv_a_proj in MLA (DeepSeek V3.2) — bug,deepseek — by HollowMan6 (创建于: 2026-02-23 12:28 (UTC+8))
#35078 Bump actions/stale from 10.1.1 to 10.2.0 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-02-23 12:28 (UTC+8))
#35074 [Misc] Enable weights loading tracking for quantized models — ready — by Isotr0py (创建于: 2026-02-23 11:47 (UTC+8))

已合并 PR

#35099 gpu_model_runner: Cache is_encoder_decoder from model config — ready,v1 — by pschlan-amd (合并于: 2026-02-24 11:08 (UTC+8))
#34906 [Quantization] Support FP8 MoE bias for models like GPT-OSS — ready,gpt-oss — by jasperjiaguo (合并于: 2026-02-24 11:07 (UTC+8))
#33443 [ROCm] AITER fused RoPE+KVCache — rocm,ready,torch.compile,v1,gpt-oss — by Rohan138 (合并于: 2026-02-24 11:06 (UTC+8))
#35054 [Mamba1] - Change supports_update_block_table to True — ready,v1 — by Josephasafg (合并于: 2026-02-24 11:05 (UTC+8))
#35135 [Bugfix] Fix lora_ids in FusedMoE LoRA test — bug,ready — by xyang16 (合并于: 2026-02-24 10:49 (UTC+8))
#34924 [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default — performance,ready,ready-run-all-tests — by mgoin (合并于: 2026-02-24 09:34 (UTC+8))
#35123 [Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches — bug,ready,ci/build — by mgoin (合并于: 2026-02-24 09:11 (UTC+8))
#34846 [Perf] Improve default triton fused moe configs — performance,ready — by mgoin (合并于: 2026-02-24 08:01 (UTC+8))
#34992 [RL] Validation for pause_mode=’keep’ — documentation,ready,ci/build — by hao-aaron (合并于: 2026-02-24 05:30 (UTC+8))
#34302 [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) — performance,ready,ci/build,deepseek — by robertgshaw2-redhat (合并于: 2026-02-23 22:02 (UTC+8))
#34973 Enforce that model is the first positional arg when --served-model-name is used — ready — by hmellor (合并于: 2026-02-24 00:38 (UTC+8))
#35043 [ROCm][CI] Fix spec decode profile assertion and logprob test determinism — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-02-23 21:05 (UTC+8))
#35113 [Misc] Monitor interface changes — ready,ci/build — by NickLucche (合并于: 2026-02-24 01:14 (UTC+8))
#35101 Fix custom processors that use deleted import for Transformers v5 — ready — by hmellor (合并于: 2026-02-24 00:38 (UTC+8))
#34990 [CI] Skip Responses API — ready,ci-failure — by robertgshaw2-redhat (合并于: 2026-02-23 23:46 (UTC+8))
#34874 [Bugfix] Fix prefix caching for Mamba ‘all’ mode (Nemotron models) — bug,ready,v1 — by haosdent (合并于: 2026-02-24 00:31 (UTC+8))
#35098 Use Xet high performance mode for Transformers v5 — ready — by hmellor (合并于: 2026-02-24 00:19 (UTC+8))
#35080 [Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling — bug,ready — by haosdent (合并于: 2026-02-24 00:05 (UTC+8))
#30950 [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) — documentation,ready,v1 — by markmc (合并于: 2026-02-23 23:01 (UTC+8))
#34254 [kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies — ready — by eldarkurtic (合并于: 2026-02-23 22:37 (UTC+8))
#35083 [Refactor] Decouple TimingContext from InputProcessingContext — performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-02-23 22:15 (UTC+8))
#35033 [Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests — bug,ready,multi-modality,llama — by eldarkurtic (合并于: 2026-02-23 22:04 (UTC+8))
#35060 [CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig — ready — by VincentG1234 (合并于: 2026-02-23 21:05 (UTC+8))
#35094 Fix pipeline parallel with embed scaling in the Transformers modelling backend — ready — by hmellor (合并于: 2026-02-23 21:04 (UTC+8))
#35010 [XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend — ready — by yma11 (合并于: 2026-02-23 21:06 (UTC+8))
#35068 [Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item — ready — by yewentao256 (合并于: 2026-02-23 21:05 (UTC+8))
#34651 [Feature] Lazy import for the “mistral” tokenizer module. — structured-output,frontend,ready,v1,multi-modality — by nascheme (合并于: 2026-02-23 16:43 (UTC+8))
#34739 [BugFix]: Fix local mypy issues — bug,frontend,ready,kv-connector — by hickeyma (合并于: 2026-02-23 16:40 (UTC+8))
#34813 fix: Apply embedding_multiplier to inputs_embeds — ready — by gabe-l-hart (合并于: 2026-02-23 16:42 (UTC+8))
#33752 [Bugfix] Fix kernel benchmark — bug,performance,ready — by jeejeelee (合并于: 2026-02-23 13:18 (UTC+8))
#35025 [Refactor] Simplify dummy data generation — documentation,speculative-decoding,ready,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-02-23 12:55 (UTC+8))

关闭但未合并的 PR

#29065 [ROCm] Support AITER paged attention with sliding_window — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,needs-rebase,ci/build — by sammysun0711 (关闭于: 2026-02-24 11:00 (UTC+8))
#28821 Fix: API server wait timeout on slow storage nodes — stale,v1 — by jaywonchung (关闭于: 2026-02-24 10:17 (UTC+8))
#27503 Clarify V0→V1 error; keep SamplingParams importable when VLLM_USE_V1=0 — frontend,ci/build,stale,v1 — by nick-allison (关闭于: 2026-02-24 10:16 (UTC+8))
#27507 qwen3moe on gh200 — stale,qwen — by bhaktatejas922 (关闭于: 2026-02-24 10:16 (UTC+8))
#35137 [core]Fix CPU Pinned-Memory Race Between _dummy_run() and execute_model() — v1 — by charlotte12l (关闭于: 2026-02-24 08:14 (UTC+8))
#35142 XPU updates + attach sanitized CI logs — 无标签 — by Joshna-Medisetty (关闭于: 2026-02-24 04:46 (UTC+8))
#35116 Feat/dsv3 router cublas fallback — needs-rebase,ci/build,deepseek — by robertgshaw2-redhat (关闭于: 2026-02-24 03:39 (UTC+8))
#35002 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (关闭于: 2026-02-24 02:20 (UTC+8))
#35110 [DO NOT MERGE] Check if set_default_torch_num_threads is still necessary for init — ready,multi-modality — by DarkLight1337 (关闭于: 2026-02-24 00:51 (UTC+8))
#34768 Improve Transformers v4/v5 compatibility in tokenizers and processors — ci/build,multi-modality — by cccat6 (关闭于: 2026-02-24 00:44 (UTC+8))
#33327 [Quant] Fix FP8 block quantization for non-aligned dimensions — documentation,needs-rebase — by Etelis (关闭于: 2026-02-23 23:45 (UTC+8))
#33328 [Quant] Skip CUTLASS block FP8 on Blackwell GPUs — nvidia — by Etelis (关闭于: 2026-02-23 23:44 (UTC+8))
#34970 [Bugfix] Fix block_size mismatch for MLA models after #34818 — bug,ready,needs-rebase,v1 — by mgoin (关闭于: 2026-02-23 22:26 (UTC+8))
#35067 [Deprecation] Deprecate resolve_hf_chat_template as scheduled — frontend,ready — by yewentao256 (关闭于: 2026-02-23 22:19 (UTC+8))
#33180 [Bugfix][Hardware][AMD] Fix MXFP4 MoE backend selection for RDNA3 (gfx1100) — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
#34481 [Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
#32878 [Bugfix][Hardware][AMD] Add LoRA guard to unquantized MoE backend selection — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
#31118 [Bugfix][Hardware][AMD] Fix uninitialized prefix_scheduler_metadata — bug,rocm,ready,v1 — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
#34895 Gate cuBLAS kernel fallback — ci/build,deepseek — by roikoren755 (关闭于: 2026-02-23 22:02 (UTC+8))
#27216 [Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default — rocm,needs-rebase,stale,nvidia — by Isotr0py (关闭于: 2026-02-23 19:44 (UTC+8))
#27515 Fix issue #27486 double bos token — stale — by baonudesifeizhai (关闭于: 2026-02-23 17:26 (UTC+8))
#34602 [Bugfix] Fix sleep mode wake_up memory leaks — bug — by stmoonar (关闭于: 2026-02-23 14:59 (UTC+8))
#28330 Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 — frontend,needs-rebase,unstale,nvidia — by baonudesifeizhai (关闭于: 2026-02-23 11:34 (UTC+8))