vLLM 开发动态报告 - 2026-02-23
时间窗口: 2026-02-23 11:31 (UTC+8) ~ 2026-02-24 11:31 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 20 | 新 PR 70 | 合并 PR 31 | 关闭未合并 PR 23
📊 每日开发状态摘要
本周期内,vLLM 项目保持高速开发迭代,共处理了超过 90 个 Issue 和 PR。开发焦点集中在 模型支持扩展(如 Qwen3.5 NVFP4、Ring 2.5)、性能优化(如 MoE 内核、GDN 解码)以及 CI/CD 稳定性(尤其是 AMD 平台)上。同时,AMD 平台支持 与 核心架构演进(如 Model Runner V2)的相关讨论和修复是今日的突出看点。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动活跃,涉及硬件支持、性能优化和 CI 稳定性等多个方面。
新增 Issue
- #35089: [RFC]: In-Tree AMD Zen CPU Backend via zentorch
- 贡献者:
amd-lalithnc(AMD 员工) - 概述: 提出将 AMD Zen CPU 后端 (基于
zentorch) 集成到 vLLM 树内的详细方案。旨在为 AMD CPU 提供一等公民支持,保持与外部插件在性能和功能上的平价。 - 技术细节: 设计包含平台检测、运行时 GEMM 分发、编译时图优化 pass (如 embedding 替换和 11 种算子融合模式) 以及
CustomOps的使用。 - 讨论热度: 高。核心开发者
ProExpertProg参与讨论,提出了关于利用 vLLM IR 进行内核选择、编译时与运行时优化分离、以及自定义 pass 注册系统改进等问题,触及了 vLLM 未来架构演进的核心议题。 - 影响: 若被采纳,将为 vLLM 在 AMD CPU 服务器上提供强大的原生推理能力,扩大其硬件生态。
- 贡献者:
- #35163: [Bug]: AMD docker image still using torch 2.9…
- 概述: 指出 AMD CI 的 Docker 镜像仍在使用的 PyTorch 2.9,而
requirements/rocm-build.txt已指定 2.10.0,导致依赖 PyTorch 2.10 新特性的 PR 被阻塞。 - 影响: 阻碍了内核迁移等开发工作,是 AMD 平台与主开发流同步的一个阻塞点。
- 概述: 指出 AMD CI 的 Docker 镜像仍在使用的 PyTorch 2.9,而
- #35126, #35128, #35130, #35132, #35133: 一系列 AMD CI 失败
- 概述: 报告了在 MI355 等 AMD 硬件上多个测试组 (TG) 的失败,涉及 MoE 内核、语言模型、分布式 LoRA 测试、跨节点测试等。
- 共同点: 多数由 AMD 员工
AndreasKaratzas创建/跟进,反映了团队在确保 AMD 平台测试信号稳定和与上游保持一致的持续努力。 - 特别关注:
#35132提到了对CrossLayer KV layout和NixlConnector新测试的评估,表明 AMD 平台正在跟进 vLLM 最前沿的 KV 缓存管理特性。
新增/合并 PR
- #35152: [ROCm][CI] Disable skinny GEMMs in language model standard tests… (Open)
- 贡献者:
AndreasKaratzas - 内容: 在 ROCm 语言模型测试中默认禁用
VLLM_ROCM_USE_SKINNY_GEMM,以解决因其使用atomicAdd导致的浮点非确定性,从而修复准确性测试失败。 - 影响: 提升 AMD 平台 CI 测试的确定性和稳定性。
- 贡献者:
- #35164: [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available (Open)
- 贡献者:
rasmith(推测为 AMD 员工,处理 MORI KV connector) - 内容: 在测试机器不支持 InfiniBand 时跳过相关测试,为 MORI KV connector 未来支持 TCP 后端做准备。
- 关联: 与新增 Issue
#35165([Feature]: Add TCP support to MORI KV connector) 目标一致。
- 贡献者:
- #35144: [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel (Open)
- 贡献者:
mgehre-amd(AMD 员工) - 内容: 允许在 ROCm 上选择
GPTQMarlinConfig,从而使 GPTQ 模型能够使用choose_mp_linear_kernel框架(如 Conch 内核),而非局限于旧的 ExLlama v2 路径。修复了 Conch 内核处理对称量化的问题。 - 影响: 统一了 ROCm 与 CUDA 平台在量化内核选择上的路径,并为 AMD 平台引入了性能更优的新内核。
- 贡献者:
- #35093: [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 (Open)
- 贡献者:
amd-asalykov(AMD 员工) - 内容: 为 CDNA4 架构 (MI350X/MI355X) 添加调优后的 MoE 内核配置,旨在提升 Kimi K2.5 等模型在 AMD 最新硬件上的性能。
- 影响: 针对性优化 AMD 最新 GPU 的 MoE 性能。
- 贡献者:
- #35103: [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash (Open)
- 内容: 将 FP4 BMM 特性限制在 CDNA4 (gfx950) 上,防止其在 MI300X/MI325X (gfx942) 上被错误启用导致崩溃,确保优雅回退到 FP8。
- 影响: 修复了 MXFP4 量化在旧款 AMD GPU 上的兼容性问题。
已合并 PR
- #35043: [ROCm][CI] Fix spec decode profile assertion and logprob test determinism
- 修复了 ROCm 上推测解码性能分析和 logprob 测试中的非确定性问题。
总结: 本周期 AMD 生态动态显示,AMD 团队正从 硬件兼容性修复(MI300X FP4)、性能深度优化(CDNA4 MoE 内核)、框架对齐(启用 GPTQMarlinConfig)和 CI 稳定性保障 多个维度,系统性地推进 vLLM 在 AMD 全栈硬件上的成熟度和性能。
💬 高热度讨论分析
- Issue #35089: [RFC]: In-Tree AMD Zen CPU Backend via zentorch
- 核心议题: 是否及如何将 AMD Zen CPU 优化后端集成到 vLLM 主代码库。
- 不同观点:
- 提案方 (amd-lalithnc): 主张“树内”集成,仅将核心优化内核放在外部
zentorch库,vLLM 负责平台调度、优化 pass 编排等,以提供开箱即用的一等公民体验。 - 核心开发者 (ProExpertProg): 欢迎贡献,但提出关键架构性质询:
- 询问 vLLM IR 是否缺乏所需能力,暗示应基于正在快速发展的 vLLM IR 进行内核选择,而非依赖编译时的 Inductor pass 替换算子。
- 强调内核选择应独立于编译,以保持即时编译与预编译的一致性。编译 pass 应仅用于优化变换。
- 寻求关于自定义 pass 注册系统痛点的反馈。
- 提案方 (amd-lalithnc): 主张“树内”集成,仅将核心优化内核放在外部
- 争议焦点: 集成策略的边界划分——多少逻辑应放在树内,以及如何与 vLLM 正在演进的 IR 和编译基础设施协同。
- 当前状态: 讨论进行中,是涉及 vLLM 未来平台抽象和扩展架构的重要对话。
- Issue #35150: [Feature]: Support NVFP4 Checkpoint of Qwen3.5
- 核心议题: 社区请求支持 NVIDIA 发布的 Qwen3.5 397B NVFP4 量化模型。
- 讨论过程: 用户
ywang96提出需求后,核心贡献者Isotr0py迅速响应表示可以查看。ywang96补充说明另一位贡献者vadiklyutiy也在处理。随后,PR#35156被创建以修复该模型加载的一个具体问题。 - 观察: 展示了社区需求到开发者响应的快速闭环,以及社区内部工作的自发协调。
- Issue #35163: AMD docker image still using torch 2.9…
- 核心议题: AMD CI 环境与主开发分支的 PyTorch 版本不一致,阻塞了其他开发工作。
- 观点: 报告者
mikaylagawarecki明确指出这是阻碍其内核迁移 PR 的 bug。被提及的维护者gshtras等需要评估升级的影响。 - 状态: 问题刚提出,尚未有解决方案讨论,但属于亟待解决的依赖性问题。
- PR #35082: [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp
- 核心议题: 修复 Qwen3.5 等模型在使用解码上下文并行 (DCP) 和 FlashAttention 3 时的崩溃。
- 讨论过程: 作者
haosdent给出修复。测试者ehfd验证有效,但同时暴露了 DCP 与 Mamba 注意力不兼容、以及与 Prefix Caching 在混合模型中的限制等更深层次问题。 - 观点总结:
- 修复者: 提供针对性补丁。
- 测试者/用户: 验证补丁,并揭示相关特性组合的边界条件,引发关于“Prefix Caching + PP”与“DCP + TP”在长上下文代理场景下孰优的实用讨论。
- 结论: PR 解决了直接崩溃问题,但引发了关于复杂特性(DCP, Hybrid模型, Prefix Caching)兼容性的更广泛讨论。
🔥 热门话题与趋势分析
- 模型支持与性能优化:
- Qwen3.5 系列: 成为焦点,涉及 NVFP4 量化支持 (
#35150,#35156)、超长上下文 YaRN 缩放 (#35056,#35080)、FlashInfer 后端精度问题 (#35138)、以及部署配方更新请求 (#35154)。 - 新模型集成: Ring 2.5 (
#35102)、Llama 4 Vision LoRA (#35147) 等工作在进行中。 - 解码优化: GDN (
#35149)、Mamba (#35157) 等特定注意力机制的优化被单独提出。
- Qwen3.5 系列: 成为焦点,涉及 NVFP4 量化支持 (
- CI/CD 与平台稳定性:
- AMD CI: 大量 Issue (
#35126, #35128, #35130, #35132, #35133) 反映在 MI355 等新硬件上达到测试稳定性仍需努力。 - 通用 CI 失败: 如融合测试 (
#35134)、音频模型参考实现问题 (#35140) 等,显示了维护庞大测试集的挑战。
- AMD CI: 大量 Issue (
- 安全性与文档:
- 出现了关于安全日志 (
#34947)、数据合规 (#35005) 的讨论(虽已关闭),以及新增的 PR 专注于安全风险文档 (#35139) 和主机头验证 (#35160),表明项目开始更多关注企业级部署的安全需求。
- 出现了关于安全日志 (
- 音频与多模态处理:
- 音频转录的格式支持 (
#35109)、时间戳修复 (#35159)、以及处理器缓存兼容性 (#35111) 等问题被修复,体现了对多模态特性完善度的追求。
- 音频转录的格式支持 (
🛠️ 重点技术变更
- PR #35089 (RFC): In-Tree AMD Zen CPU Backend via zentorch
- 技术解读: 这是一份完整的设计提案,旨在为 vLLM 添加一个高性能的 AMD CPU 原生后端。它不只是一个优化,而是涉及平台检测、运行时分发、编译时优化图变换的全栈集成方案。
- 影响: 若实施,将显著扩展 vLLM 的部署场景至纯 AMD CPU 服务器或混合架构环境,提升其作为异构计算平台的价值。
- PR #35162: [Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism
- 技术解读: 为 V2 模型运行器添加了流水线并行 (PP) 下的片式 CUDA 图支持。解决了此前 PP 必须回退到 eager 模式的性能瓶颈。
- 影响: 进一步释放 Model Runner V2 的性能潜力,使得 TP+PP 等复杂并行策略能更好地利用 CUDA 图优化,是 V2 走向成熟和完善的关键一步。
- PR #34874: [Bugfix] Fix prefix caching for Mamba ‘all‘ mode (Nemotron models)
- 技术解读: 修复了混合模型(如 Nemotron)中 Mamba 注意力层在“all”模式前缀缓存下的元数据缓存 bug。该 bug 导致 CUDA 图重放时读取了错误的块索引,产生 NaN。
- 影响: 解决了影响模型正确性的严重问题,确保了 Mamba 类模型在使用高级缓存功能时的可靠性。
- PR #34924: [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default
- 技术解读: 将 FlashInfer DeepGEMM 的
swapAB优化路径在 SM90 (Hopper) 及以上架构默认启用。该优化能带来明显的低批次性能提升。 - 影响: 为 H100、H200、B200 等用户带来“免费”的性能提升,体现了对主流高性能硬件持续进行默认优化的思路。
- 技术解读: 将 FlashInfer DeepGEMM 的
- PR #35135: [Bugfix] Fix lora_ids in FusedMoE LoRA test
- 技术解读: 修复了 FusedMoE LoRA 测试中
expert_ids填充值错误的问题(应为 -1 而非 0),并收紧了断言条件。 - 影响: 看似是测试修复,实则纠正了底层内核接口的一个潜在误解,确保了 MoE LoRA 功能实现的正确性,对支持专家混合模型的微调至关重要。
- 技术解读: 修复了 FusedMoE LoRA 测试中
📈 开发活跃度观察
- AMD 团队活跃度: 非常高。
AndreasKaratzas如同“消防员”,创建和处理了大量 AMD CI 相关 Issue;其他 AMD 员工 (rasmith,mgehre-amd,amd-asalykov,pschlan-amd) 则在各自负责的模块(KV Connector、量化、内核优化、性能分析)积极贡献。 - 社区贡献者:
haosdent表现突出,连续提交了多个关键 Bugfix (#35080,#35082,#34874)。russellb专注于安全增强和文档。 - 核心团队:
DarkLight1337,hmellor,mgoin等核心成员高效地进行代码审查和合并,并参与架构讨论(如ProExpertProg在 Zen CPU RFC 中的深度提问)。 - 协作模式: 可以看到良好的社区互动,如 Issue 中需求的提出与认领 (
#35150),以及 PR 中开发者与测试者紧密合作验证修复 (#35082)。AMD 与 NVIDIA 生态的开发者同时在代码库中协作,共同推进项目发展。
💡 值得关注的问题
- AMD Zen CPU 后端决策 (
#35089): 该 RFC 的走向将定义 vLLM 对异构 CPU 支持的技术路径,其讨论值得所有关心 vLLM 平台演进的开发者关注。 - AMD CI 的 PyTorch 版本升级 (
#35163): 此问题若不解决,将阻碍 AMD 平台与主分支新特性的同步,是一个关键依赖项更新。 - GDN 解码路径优化 (
#35149): 作为一个新提出的性能优化项,关注其后续设计和实现,可能为特定模型架构带来显著解码提速。 - 线性注意力(Mamba)状态管理 (
#35157): 该 PR 修复了 Mamba 状态在缓存重置时的清理问题,是保障线性注意力模型在复杂调度下稳定运行的重要补丁。 - 多节点测试脚本问题 (
#35129): 反映出 CI 脚本中可能存在的语法或环境问题,虽已由 PR#35131尝试修复,但多节点测试的健壮性仍需持续观察。
📋 附录:详细数据列表
新增 Issue
- #35165 [Feature]: Add TCP support to MORI KV connector — feature request — by rasmith (创建于: 2026-02-24 10:13 (UTC+8))
- #35163 [Bug]: AMD docker image still using torch 2.9 despite 2.10.0 in
requirements/rocm-build.txt— bug,rocm — by mikaylagawarecki (创建于: 2026-02-24 09:47 (UTC+8)) - #35150 [Feature]: Support NVFP4 Checkpoint of Qwen3.5 — feature request — by ywang96 (创建于: 2026-02-24 07:32 (UTC+8))
- #35154 [Performance]: Optimized Deployment Recipe for Qwen3.5 — performance — by ywang96 (创建于: 2026-02-24 07:58 (UTC+8))
- #35128 [CI Failure]: mi355_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:16 (UTC+8))
- #35149 [Performance]: Optimize GDN Decode — performance — by ywang96 (创建于: 2026-02-24 07:24 (UTC+8))
- #35089 [RFC]: In-Tree AMD Zen CPU Backend via zentorch — rocm,RFC,cpu — by amd-lalithnc (创建于: 2026-02-23 17:36 (UTC+8))
- #35133 [CI Failure]: mi355_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:34 (UTC+8))
- #35141 [Feature]: Sequence Parallel Support for Model Runner V2 — feature request — by yewentao256 (创建于: 2026-02-24 04:23 (UTC+8))
- #35140 [CI] Ultravox audio model HuggingFace reference produces invalid output with NaN logprobs — ci-failure — by LucasWilkinson (创建于: 2026-02-24 03:55 (UTC+8))
- #35138 [Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell. — bug,nvidia — by xinli-sw (创建于: 2026-02-24 03:06 (UTC+8))
- #35126 [CI Failure]: mi355_1: Kernels MoE Test %N — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:11 (UTC+8))
- #35129 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:18 (UTC+8))
- #35134 [CI Failure]: Fusion E2E Quick (H100)
test_tp1_fp8_fusions— ci-failure — by mgoin (创建于: 2026-02-24 02:34 (UTC+8)) - #35130 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:19 (UTC+8))
- #35132 [CI Failure][ROCm]: CrossLayer KV layout Distributed NixlConnector PD accuracy tests (4 GPUs) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-24 02:28 (UTC+8))
- #35118 [Feature]: Support ubuntu 24.04 runtime container — feature request — by aasgaonkar (创建于: 2026-02-24 01:21 (UTC+8))
- #35114 [Usage]: Does vllm cpu supports DP — usage — by akasshdeep (创建于: 2026-02-24 00:41 (UTC+8))
- #35084 [Bug]: VLLM tries to load “inductor” instead of custom compiler — bug,torch.compile — by mergian (创建于: 2026-02-23 15:51 (UTC+8))
- #35104 [Bug]: V1 engine workers die after idle period (SystemError: PyCFunction / EngineDeadError) — TP=2, multiprocessing — 无标签 — by jsboige (创建于: 2026-02-23 22:46 (UTC+8))
- #35091 [Bug]: Triton CompilationError in speculative decoding (draft_model) — bug — by rse173 (创建于: 2026-02-23 18:35 (UTC+8))
- #35087 [Bug]: DeepSeek 3.2 P/D Disaggregation Support — bug — by yanminjia (创建于: 2026-02-23 16:31 (UTC+8))
已关闭 Issue
- #22276 [Bug]: Unknown quantization method: mxfp4 — bug,stale — by mkrzywda (关闭于: 2026-02-24 10:17 (UTC+8))
- #23793 [Bug]: vllm.LLM does not release GPU memory after deletion when loaded with a HF model — bug,stale — by HuFY-dev (关闭于: 2026-02-24 10:17 (UTC+8))
- #27504 [Usage]:
add_vision_idignored for Qwen 2.5-VL-32B-Instruct — usage,stale — by justachetan (关闭于: 2026-02-24 10:16 (UTC+8)) - #27505 [Bug]: Value error, Found conflicts between ‘rope_type=default’ (modern field) and ‘type=mrope’ — bug,stale — by asirgogogo (关闭于: 2026-02-24 10:16 (UTC+8))
- #28886 [CI] Notification mechanism for failing nightly jobs — feature request,stale — by khluu (关闭于: 2026-02-24 05:28 (UTC+8))
- #34290 [Model Bash][DeepSeek]: Remove Clone From Shared Expert Stream — model-bash — by robertgshaw2-redhat (关闭于: 2026-02-24 03:37 (UTC+8))
- #34300 [Model Bash][DeepSeek]: Remove Logits Casts in DSR1 NVFP4 TRTLLM — model-bash — by robertgshaw2-redhat (关闭于: 2026-02-24 03:36 (UTC+8))
- #34708 [CI Failure]: Fp8 MoE Kernels (DeepEP + DeepGEMM) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-24 03:36 (UTC+8))
- #34947 Feature: Audit-grade request logging for EU AI Act compliance (Article 12) — 无标签 — by desiorac (关闭于: 2026-02-24 03:07 (UTC+8))
- #35005 Data Governance & Compliance Guide for vLLM Deployments (EU AI Act Article 6) — 无标签 — by desiorac (关闭于: 2026-02-24 03:05 (UTC+8))
- #35134 [CI Failure]: Fusion E2E Quick (H100)
test_tp1_fp8_fusions— ci-failure — by mgoin (关闭于: 2026-02-24 02:37 (UTC+8)) - #35130 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:34 (UTC+8))
- #29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:07 (UTC+8))
- #29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-24 02:07 (UTC+8))
- #34621 [Usage]: new vLLM version — usage — by eghouti (关闭于: 2026-02-24 01:32 (UTC+8))
- #34326 [Bug]: –served-model-name causes model detection issues — bug — by ssendev (关闭于: 2026-02-24 00:38 (UTC+8))
- #34865 [Bug]: Prefix caching failing for Nemotron models — bug — by benchislett (关闭于: 2026-02-24 00:31 (UTC+8))
- #35056 [Bug]: Qwen3.5
AttributeError: 'MRotaryEmbedding' object has no attribute 'truncate'with RoPE Scaling — bug — by ehfd (关闭于: 2026-02-24 00:05 (UTC+8)) - #33784 [Bug]: [Docker.cpu build] incomplete type ‘qk_vec_type’ {aka ‘void’} used in nested name specifier — bug,cpu — by BartekKruczek (关闭于: 2026-02-23 21:08 (UTC+8))
- #34812 [Bug]: GraniteMoeHybridModel not applying embedding_multiplier to input embeddings — bug — by gabe-l-hart (关闭于: 2026-02-23 16:42 (UTC+8))
新增 PR
- #35139 [Docs] Document security risks of GPT-OSS Python tool — documentation,frontend,gpt-oss — by russellb (创建于: 2026-02-24 03:11 (UTC+8))
- #35160 [Security] Add Host header validation and improve security guide — documentation,frontend — by russellb (创建于: 2026-02-24 08:45 (UTC+8))
- #35099 gpu_model_runner: Cache is_encoder_decoder from model config — ready,v1 — by pschlan-amd (创建于: 2026-02-23 21:24 (UTC+8))
- #35147 [Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) — ready,llama — by dorhuri123 (创建于: 2026-02-24 07:05 (UTC+8))
- #35107 Fix custom processors that use deleted behaviour for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 23:27 (UTC+8))
- #35151 [Bugfix][Frontend] Disable /v1/chat/completions for base models without chat templates — bug,frontend — by alxfv (创建于: 2026-02-24 07:37 (UTC+8))
- #35135 [Bugfix] Fix lora_ids in FusedMoE LoRA test — bug,ready — by xyang16 (创建于: 2026-02-24 02:35 (UTC+8))
- #35164 [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available — bug,rocm,v1,kv-connector — by rasmith (创建于: 2026-02-24 10:10 (UTC+8))
- #35166 gdn attn metadata builder — v1,meta-exported,fb-exported — by jennyyyyzhen (创建于: 2026-02-24 10:42 (UTC+8))
- #35085 [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support — bug — by haosdent (创建于: 2026-02-23 16:03 (UTC+8))
- #35082 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp — bug,v1 — by haosdent (创建于: 2026-02-23 15:39 (UTC+8))
- #35161 [Bugfix] Fix expert_ids padding values in moe_align_block_size kernel — bug — by xyang16 (创建于: 2026-02-24 09:32 (UTC+8))
- #35102 [Model] Ring 2.5 — new-model — by ZJY0516 (创建于: 2026-02-23 22:12 (UTC+8))
- #35162 [Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism — v1,nvidia — by ZhanqiuHu (创建于: 2026-02-24 09:33 (UTC+8))
- #35156 [BUGFIX][Qwen3.5] Hardcode
mlp.gateas not quantizable — bug,ready,qwen — by vadiklyutiy (创建于: 2026-02-24 08:13 (UTC+8)) - #35076 [Bugfix] Propagate default stop_token_ids to per-request SamplingParams — bug,frontend,gpt-oss — by sriganesh123 (创建于: 2026-02-23 12:25 (UTC+8))
- #35123 [Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches — bug,ready,ci/build — by mgoin (创建于: 2026-02-24 02:01 (UTC+8))
- #35158 [Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding — ci/build,v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-24 08:41 (UTC+8))
- #35159 fix: compensate chunk split offset in Whisper segment timestamps (closes #32588) — frontend — by stakeswky (创建于: 2026-02-24 08:42 (UTC+8))
- #35157 [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache — bug,v1 — by heheda12345 (创建于: 2026-02-24 08:36 (UTC+8))
- #35137 [core]Fix CPU Pinned-Memory Race Between _dummy_run() and execute_model() — v1 — by charlotte12l (创建于: 2026-02-24 02:58 (UTC+8))
- #35152 [ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism — rocm — by AndreasKaratzas (创建于: 2026-02-24 07:38 (UTC+8))
- #35155 updated for rh ubi10.1 docker — ci/build — by sureshnam (创建于: 2026-02-24 08:02 (UTC+8))
- #35153 [MoE Refactor] Make SharedExperts class for use with DefaultMoERunner — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-24 07:49 (UTC+8))
- #35148 [Responses] Decouple SSE event helpers from Harmony context — frontend,gpt-oss — by sfeng33 (创建于: 2026-02-24 07:24 (UTC+8))
- #35146 add attention sink support to aiter fa — rocm,needs-rebase,v1 — by jennyyyyzhen (创建于: 2026-02-24 06:55 (UTC+8))
- #35145 [ROCm][CI] Reverting changes in MI355 pipeline so that some default TGs are blocking on external AMD-CI signal — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-24 06:48 (UTC+8))
- #35143 [Spec Decode] Remove unnecessary all reduce for dense draft model on dp setting — speculative-decoding,v1 — by ZhengkaiZ (创建于: 2026-02-24 04:47 (UTC+8))
- #35144 [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm — by mgehre-amd (创建于: 2026-02-24 05:57 (UTC+8))
- #35122 Reapply [Attention] Refactor
check_and_update_config— speculative-decoding,v1,multi-modality,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-02-24 01:55 (UTC+8)) - #35136 [Release] Include source distribution (sdist) in PyPI uploads — ci/build — by dougbtv (创建于: 2026-02-24 02:54 (UTC+8))
- #35112 [Benchmarks] Add –connection-pooling flag to vllm bench serve — performance — by InfraWhisperer (创建于: 2026-02-24 00:26 (UTC+8))
- #35142 XPU updates + attach sanitized CI logs — 无标签 — by Joshna-Medisetty (创建于: 2026-02-24 04:45 (UTC+8))
- #35117 [compile] Save aot compile artifacts atomically. — ready — by zhxchen17 (创建于: 2026-02-24 00:55 (UTC+8))
- #35088 Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe — performance,ready,nvidia — by danisereb (创建于: 2026-02-23 16:49 (UTC+8))
- #35120 [WIP][Model Runner V2] Support pooling models — v1,nvidia — by WoosukKwon (创建于: 2026-02-24 01:30 (UTC+8))
- #35116 Feat/dsv3 router cublas fallback — needs-rebase,ci/build,deepseek — by robertgshaw2-redhat (创建于: 2026-02-24 00:46 (UTC+8))
- #35093 [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 — rocm — by amd-asalykov (创建于: 2026-02-23 18:45 (UTC+8))
- #35124 [Bugfix] FlashInfer incompatible with sleep mode — bug,v1,nvidia — by guan404ming (创建于: 2026-02-24 02:05 (UTC+8))
- #35131 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-24 02:19 (UTC+8))
- #35127 [Perf] Optimize pooling model redundant copy, 1.8% throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-02-24 02:13 (UTC+8))
- #35125 [BugFix][kv_offload]: Fix kernel block size detection — bug,v1 — by orozery (创建于: 2026-02-24 02:10 (UTC+8))
- #35092 [Bugfix][PD] Fix OffloadingConnector with PD deployments — bug,kv-connector — by NickLucche (创建于: 2026-02-23 18:39 (UTC+8))
- #35121 [Performance] Cublas Bf16 Gate with Fp32 Output — ci/build,deepseek — by roikoren755 (创建于: 2026-02-24 01:45 (UTC+8))
- #35119 [compile] Invalidate cache for cpu flags — 无标签 — by angelayi (创建于: 2026-02-24 01:26 (UTC+8))
- #35113 [Misc] Monitor interface changes — ready,ci/build — by NickLucche (创建于: 2026-02-24 00:29 (UTC+8))
- #35110 [DO NOT MERGE] Check if
set_default_torch_num_threadsis still necessary for init — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-24 00:07 (UTC+8)) - #35115 [compile] Improve error message during artifacts load failure. — 无标签 — by zhxchen17 (创建于: 2026-02-24 00:45 (UTC+8))
- #35101 Fix custom processors that use deleted import for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 22:10 (UTC+8))
- #35075 [Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention — bug,v1 — by benchislett (创建于: 2026-02-23 12:10 (UTC+8))
- #35108 [glm-asr] change defaults dummy audio size — ready,multi-modality — by eustlb (创建于: 2026-02-23 23:34 (UTC+8))
- #35111 [Bugfix] Fix failing FunASR processor test — bug,ready,multi-modality — by Isotr0py (创建于: 2026-02-24 00:23 (UTC+8))
- #35098 Use Xet high performance mode for Transformers v5 — ready — by hmellor (创建于: 2026-02-23 20:59 (UTC+8))
- #35080 [Bugfix] Fix MRotaryEmbedding missing
truncateattr with YaRN scaling — bug,ready — by haosdent (创建于: 2026-02-23 14:53 (UTC+8)) - #35109 [Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats — bug,frontend — by seanmamasde (创建于: 2026-02-23 23:43 (UTC+8))
- #35100 Support parakeet as audio encoder for nemotron-nano-vl — 无标签 — by netanel-haber (创建于: 2026-02-23 21:46 (UTC+8))
- #35106 Fix custom compiler backend resolution in make_compiler#35084 — 无标签 — by baonudesifeizhai (创建于: 2026-02-23 23:25 (UTC+8))
- #35105 [Refactor] Add global helper to deduplicate vectorized memory ops — needs-rebase,nvidia — by LopezCastroRoberto (创建于: 2026-02-23 22:50 (UTC+8))
- #35103 [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash — bug,rocm — by c0de128 (创建于: 2026-02-23 22:29 (UTC+8))
- #35083 [Refactor] Decouple TimingContext from InputProcessingContext — performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-02-23 15:45 (UTC+8))
- #35095 [MLA] Add fused Triton concat+quantize kernel for fp8 decode queries — 无标签 — by elvircrn (创建于: 2026-02-23 19:23 (UTC+8))
- #35096 feat: added num requests and num tokens to profiles — v1 — by hypdeb (创建于: 2026-02-23 19:33 (UTC+8))
- #35094 Fix pipeline parallel with embed scaling in the Transformers modelling backend — ready — by hmellor (创建于: 2026-02-23 18:59 (UTC+8))
- #35097 [BugFix] Fix 3D rope in transformers backend — bug — by zucchini-nlp (创建于: 2026-02-23 20:01 (UTC+8))
- #35090 docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build — documentation,cpu — by sagearc (创建于: 2026-02-23 18:04 (UTC+8))
- #35081 [Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le — 无标签 — by Akashcodes732 (创建于: 2026-02-23 15:37 (UTC+8))
- #35086 more models for vLLM Benchmark Suite — documentation,performance,ci/build,cpu — by louie-tsai (创建于: 2026-02-23 16:12 (UTC+8))
- #35077 [Bugfix] LoRA for fused_qkv_a_proj in MLA (DeepSeek V3.2) — bug,deepseek — by HollowMan6 (创建于: 2026-02-23 12:28 (UTC+8))
- #35078 Bump actions/stale from 10.1.1 to 10.2.0 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-02-23 12:28 (UTC+8))
- #35074 [Misc] Enable weights loading tracking for quantized models — ready — by Isotr0py (创建于: 2026-02-23 11:47 (UTC+8))
已合并 PR
- #35099 gpu_model_runner: Cache is_encoder_decoder from model config — ready,v1 — by pschlan-amd (合并于: 2026-02-24 11:08 (UTC+8))
- #34906 [Quantization] Support FP8 MoE bias for models like GPT-OSS — ready,gpt-oss — by jasperjiaguo (合并于: 2026-02-24 11:07 (UTC+8))
- #33443 [ROCm] AITER fused RoPE+KVCache — rocm,ready,torch.compile,v1,gpt-oss — by Rohan138 (合并于: 2026-02-24 11:06 (UTC+8))
- #35054 [Mamba1] - Change supports_update_block_table to True — ready,v1 — by Josephasafg (合并于: 2026-02-24 11:05 (UTC+8))
- #35135 [Bugfix] Fix lora_ids in FusedMoE LoRA test — bug,ready — by xyang16 (合并于: 2026-02-24 10:49 (UTC+8))
- #34924 [Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default — performance,ready,ready-run-all-tests — by mgoin (合并于: 2026-02-24 09:34 (UTC+8))
- #35123 [Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches — bug,ready,ci/build — by mgoin (合并于: 2026-02-24 09:11 (UTC+8))
- #34846 [Perf] Improve default triton fused moe configs — performance,ready — by mgoin (合并于: 2026-02-24 08:01 (UTC+8))
- #34992 [RL] Validation for pause_mode=’keep’ — documentation,ready,ci/build — by hao-aaron (合并于: 2026-02-24 05:30 (UTC+8))
- #34302 [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) — performance,ready,ci/build,deepseek — by robertgshaw2-redhat (合并于: 2026-02-23 22:02 (UTC+8))
- #34973 Enforce that
modelis the first positional arg when--served-model-nameis used — ready — by hmellor (合并于: 2026-02-24 00:38 (UTC+8)) - #35043 [ROCm][CI] Fix spec decode profile assertion and logprob test determinism — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-02-23 21:05 (UTC+8))
- #35113 [Misc] Monitor interface changes — ready,ci/build — by NickLucche (合并于: 2026-02-24 01:14 (UTC+8))
- #35101 Fix custom processors that use deleted import for Transformers v5 — ready — by hmellor (合并于: 2026-02-24 00:38 (UTC+8))
- #34990 [CI] Skip Responses API — ready,ci-failure — by robertgshaw2-redhat (合并于: 2026-02-23 23:46 (UTC+8))
- #34874 [Bugfix] Fix prefix caching for Mamba ‘all’ mode (Nemotron models) — bug,ready,v1 — by haosdent (合并于: 2026-02-24 00:31 (UTC+8))
- #35098 Use Xet high performance mode for Transformers v5 — ready — by hmellor (合并于: 2026-02-24 00:19 (UTC+8))
- #35080 [Bugfix] Fix MRotaryEmbedding missing
truncateattr with YaRN scaling — bug,ready — by haosdent (合并于: 2026-02-24 00:05 (UTC+8)) - #30950 [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) — documentation,ready,v1 — by markmc (合并于: 2026-02-23 23:01 (UTC+8))
- #34254 [kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies — ready — by eldarkurtic (合并于: 2026-02-23 22:37 (UTC+8))
- #35083 [Refactor] Decouple TimingContext from InputProcessingContext — performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-02-23 22:15 (UTC+8))
- #35033 [Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests — bug,ready,multi-modality,llama — by eldarkurtic (合并于: 2026-02-23 22:04 (UTC+8))
- #35060 [CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig — ready — by VincentG1234 (合并于: 2026-02-23 21:05 (UTC+8))
- #35094 Fix pipeline parallel with embed scaling in the Transformers modelling backend — ready — by hmellor (合并于: 2026-02-23 21:04 (UTC+8))
- #35010 [XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend — ready — by yma11 (合并于: 2026-02-23 21:06 (UTC+8))
- #35068 [Refactor] Remove dead private func
_fp8_permand_extract_mask_for_item— ready — by yewentao256 (合并于: 2026-02-23 21:05 (UTC+8)) - #34651 [Feature] Lazy import for the “mistral” tokenizer module. — structured-output,frontend,ready,v1,multi-modality — by nascheme (合并于: 2026-02-23 16:43 (UTC+8))
- #34739 [BugFix]: Fix local mypy issues — bug,frontend,ready,kv-connector — by hickeyma (合并于: 2026-02-23 16:40 (UTC+8))
- #34813 fix: Apply embedding_multiplier to inputs_embeds — ready — by gabe-l-hart (合并于: 2026-02-23 16:42 (UTC+8))
- #33752 [Bugfix] Fix kernel benchmark — bug,performance,ready — by jeejeelee (合并于: 2026-02-23 13:18 (UTC+8))
- #35025 [Refactor] Simplify dummy data generation — documentation,speculative-decoding,ready,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-02-23 12:55 (UTC+8))
关闭但未合并的 PR
- #29065 [ROCm] Support AITER paged attention with
sliding_window— documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,needs-rebase,ci/build — by sammysun0711 (关闭于: 2026-02-24 11:00 (UTC+8)) - #28821 Fix: API server wait timeout on slow storage nodes — stale,v1 — by jaywonchung (关闭于: 2026-02-24 10:17 (UTC+8))
- #27503 Clarify V0→V1 error; keep SamplingParams importable when VLLM_USE_V1=0 — frontend,ci/build,stale,v1 — by nick-allison (关闭于: 2026-02-24 10:16 (UTC+8))
- #27507 qwen3moe on gh200 — stale,qwen — by bhaktatejas922 (关闭于: 2026-02-24 10:16 (UTC+8))
- #35137 [core]Fix CPU Pinned-Memory Race Between _dummy_run() and execute_model() — v1 — by charlotte12l (关闭于: 2026-02-24 08:14 (UTC+8))
- #35142 XPU updates + attach sanitized CI logs — 无标签 — by Joshna-Medisetty (关闭于: 2026-02-24 04:46 (UTC+8))
- #35116 Feat/dsv3 router cublas fallback — needs-rebase,ci/build,deepseek — by robertgshaw2-redhat (关闭于: 2026-02-24 03:39 (UTC+8))
- #35002 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (关闭于: 2026-02-24 02:20 (UTC+8))
- #35110 [DO NOT MERGE] Check if
set_default_torch_num_threadsis still necessary for init — ready,multi-modality — by DarkLight1337 (关闭于: 2026-02-24 00:51 (UTC+8)) - #34768 Improve Transformers v4/v5 compatibility in tokenizers and processors — ci/build,multi-modality — by cccat6 (关闭于: 2026-02-24 00:44 (UTC+8))
- #33327 [Quant] Fix FP8 block quantization for non-aligned dimensions — documentation,needs-rebase — by Etelis (关闭于: 2026-02-23 23:45 (UTC+8))
- #33328 [Quant] Skip CUTLASS block FP8 on Blackwell GPUs — nvidia — by Etelis (关闭于: 2026-02-23 23:44 (UTC+8))
- #34970 [Bugfix] Fix block_size mismatch for MLA models after #34818 — bug,ready,needs-rebase,v1 — by mgoin (关闭于: 2026-02-23 22:26 (UTC+8))
- #35067 [Deprecation] Deprecate
resolve_hf_chat_templateas scheduled — frontend,ready — by yewentao256 (关闭于: 2026-02-23 22:19 (UTC+8)) - #33180 [Bugfix][Hardware][AMD] Fix MXFP4 MoE backend selection for RDNA3 (gfx1100) — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
- #34481 [Bugfix][Hardware][AMD] Add ahead-of-time weight dequantization for quantization emulation — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
- #32878 [Bugfix][Hardware][AMD] Add LoRA guard to unquantized MoE backend selection — bug,rocm — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
- #31118 [Bugfix][Hardware][AMD] Fix uninitialized prefix_scheduler_metadata — bug,rocm,ready,v1 — by c0de128 (关闭于: 2026-02-23 22:08 (UTC+8))
- #34895 Gate cuBLAS kernel fallback — ci/build,deepseek — by roikoren755 (关闭于: 2026-02-23 22:02 (UTC+8))
- #27216 [Kernel] Re-enable mrope triton kernel for CUDA/ROCM platform by default — rocm,needs-rebase,stale,nvidia — by Isotr0py (关闭于: 2026-02-23 19:44 (UTC+8))
- #27515 Fix issue #27486 double bos token — stale — by baonudesifeizhai (关闭于: 2026-02-23 17:26 (UTC+8))
- #34602 [Bugfix] Fix sleep mode wake_up memory leaks — bug — by stmoonar (关闭于: 2026-02-23 14:59 (UTC+8))
- #28330 Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 — frontend,needs-rebase,unstale,nvidia — by baonudesifeizhai (关闭于: 2026-02-23 11:34 (UTC+8))