vLLM 开发动态报告 - 2026-03-23

时间窗口: 2026-03-23 11:29 (UTC+8) ~ 2026-03-24 11:29 (UTC+8) 数据统计: 新 Issue 34 | 关闭 Issue 12 | 新 PR 81 | 合并 PR 34 | 关闭未合并 PR 20

📊 每日开发状态摘要

本周期（2026-03-23至2026-03-24）vLLM项目保持高度活跃，新增Issue 34个，合并PR 34个。开发焦点集中在多模态模型支持（如Qwen3-VL、GLM-4.6V的嵌入和推理问题）、推理效率优化（睡眠模式内存释放、CUDA图性能）以及跨平台兼容性（特别是AMD/ROCm生态的性能优化与问题修复）上。多个关键Bug被快速定位和修复，社区协作紧密。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动活跃，主要集中在性能优化、Bug修复和工具链支持上。

Issue #37941: [Usage]: Using RIXL Connector on AMD GPU
- 用户: ssyrc
- 内容: 用户在隔离环境中尝试在AMD GPU上使用Nixl Connector进行PD（预填充/解码）分离，询问是否有基于Dockerfile.rocm的预构建镜像，并澄清官方ROCm镜像是否包含RIXL库，以及Nixl是否支持AMD-NVIDIA跨厂商通信。
- 分析: 此Issue反映了用户在AMD生产环境中部署vLLM高级特性（如PD分离）时遇到的实际障碍，突显了AMD生态在容器化部署和高级通信库支持方面的文档和资源缺口。
PR #37891: [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X
- 提交者: attila-dusnoki-htec
- 状态: Open
- 内容: 针对AMD MI355X (gfx950) 平台，为DeepSeek V2/V3/R1模型引入了融合的AllReduce+RMSNorm内核（通过AITER实现）。通过将张量并行AllReduce从o_proj和MoE投影层移至后续的RMSNorm层，并与残差加法融合，减少了内核启动开销和内存流量，以提升FP4/FP8量化模型在TP>1时的性能。
- 分析: 这是针对特定AMD硬件（MI355X）和模型家族（DeepSeek）的深度性能优化，展示了AMD生态向更细粒度、模型感知的优化方向发展。
PR #37887: [ROCm][perf] fix Aiter sparse MLA with MTP>1
- 提交者: gronsti-amd (AMD员工)
- 状态: Open
- 内容: 修复了在DeepSeek v3.2模型上使用ROCM_AITER_MLA_SPARSE注意力后端，并启用MTP（多令牌预测）推测解码且 num_speculative_tokens > 1 时的问题。问题根因是检查逻辑依赖于attn_metadata列表的顺序，修复后检查所有注意力类型。
- 分析: 该修复确保AMD AITER稀疏MLA后端能够与更先进的推测解码方法（MTP）完全兼容，提升了AMD平台上大模型推理的效率和能力。
PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm
- 提交者: aaab8b
- 状态: Merged
- 内容: 修复了ROCm平台上睡眠模式无法真正释放GPU显存的问题。根本原因是HIP的hipMemRelease在虚拟地址保留期间不会释放物理显存。修复方案在unmap_and_release中强制进行地址释放与重新保留的循环，确保物理页被释放。
- 分析: 这是一个关键的平台特定Bug修复，解决了ROCm上资源管理的重要缺陷，使得睡眠模式在AMD GPU上真正可用，对动态工作负载管理至关重要。
PR #36505: [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel
- 提交者: mgehre-amd (AMD员工)
- 状态: Merged
- 内容: 重构了ROCm上的AWQ（激活感知权重量化）支持，使AWQMarlinConfig能通过统一的choose_mp_linear_kernel框架选择内核（如ConchLinearKernel），而非直接调用NVIDIA专用的ops.marlin_gemm。此举显著提升了AWQ模型在ROCm上的性能（基准测试显示预填+57%，解码+73%）。
- 分析: 这是将AMD量化支持整合进vLLM统一内核调度框架的重要一步，提升了代码一致性和性能。
PR #36100: [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs
- 提交者: ChuanLi1101
- 状态: Merged
- 内容: 修复了ROCm AITER自定义op中fused_moe_fake元函数签名不匹配的严重Bug（缺少hidden_pad等4个参数），以及其他一些注释和日志中的小错误。
- 分析: 修复了torch.compile/FakeTensor模式下因签名不匹配导致的崩溃，确保了AMD平台编译路径的稳定性。
PR #37906 & #37930: [ROCm][CI] 相关改进
- 提交者: AndreasKaratzas
- 状态: Merged / Open
- 内容: 同步了AMD CI的测试任务拆分（#37906），并添加了uv pip compile工作流用于生成ROCm测试依赖的锁文件（#37930）。
- 分析: 持续改进AMD平台的CI/CD流程，保障代码质量和测试效率。

💬 高热度讨论分析

Issue #37855: [Bug]: Qwen3-VL-Embedding-8B Image embedding failed
- 评论数: 7
- 核心议题: 用户报告Qwen3-VL-8B模型在图像嵌入时出现500内部服务器错误。
- 各方观点:
  - 报告者 (nuclearwu): 提供了详细的错误堆栈和环境信息（910B2）。
  - 回应者 (JINO-ROHIT, DarkLight1337): 尝试复现并索要更详细的环境信息(collect_env.py输出)以协助诊断。
  - 贡献者 (CMLKevin): 多次表示正在调查并准备提交修复，认为问题可能与VLM嵌入张量准备有关。
- 当前状态: Open。问题尚未解决，处于信息收集和初步分析阶段。这反映了多模态模型支持的复杂性。
Issue #37858: [Bug]: does not have the attribute ‘FakeTensorMode’
- 评论数: 5
- 核心议题: 用户遇到FakeTensorMode属性错误，疑似PyTorch兼容性问题。
- 各方观点:
  - 报告者 (ACCEKLL): 提供完整环境信息。
  - 贡献者 (CMLKevin): 初步分析为PyTorch版本不匹配或API变更。
  - 核心开发者 (zou3519): 指出根本原因是PR #37158（修复PyTorch 2.10支持）未包含在发布版本中，导致此问题在v0.18.0出现。他解释了该PR的作用。
  - 报告者追问修复版本。
- 争议焦点: 无争议，核心开发者快速定位了根因。
- 当前状态: Open。问题根源明确，待相关PR合入或版本更新。
Issue #37849: [RFC]: Unify the function of getting device count
- 评论数: 5
- 核心议题: 提议统一项目中三种获取设备数量的函数 (cuda_device_count_stateless, current_platform.device_count, torch.accelerator.device_count())。
- 各方观点:
  - 提议者 (wincent8): 认为当前方式令用户困惑，且cuda_device_count_stateless阻碍了XPU等多加速器支持，建议分两步走进行统一。
  - 核心开发者 (simon-mo): 指出cuda_device_count_stateless存在的意义是避免初始化CUDA上下文，询问新方案是否能保持此特性。
  - 核心开发者 (jikunshang): 建议更稳妥的两阶段方案：先用current_platform.device_count()包装，再尝试替换为torch官方API。
  - simon-mo最终表示支持清理，但要求验证原有Bug不会复现。
- 争议焦点: 如何在保证向后兼容性和避免隐式上下文初始化的前提下，优雅地统一API。讨论倾向于谨慎的渐进式重构。
- 当前状态: Open。达成重构共识，进入具体方案验证阶段。
Issue #37857: [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True
- 评论数: 3
- 核心议题: 在数据并行(DP>1)且量化方法使用模块化内核路径时，MoE专家捕获器因断言失败而崩溃。
- 各方观点:
  - 报告者 (junjzhang): 提供了极其详细的分析，精准定位了问题根因——两种DP分发路径下topk_ids形状的差异，并给出了具体的修复建议代码。
  - 贡献者 (Young-Leo): 积极响应，表示将着手修复。
  - 报告者 (junjzhang): 对修复方案表示认可，并建议添加测试覆盖。
- 争议焦点: 无争议。这是一个高质量的技术Bug报告，直接促成了修复PR (#37879) 的产生。
- 当前状态: Open。关联PR #37879已提交，旨在修复此问题。

🔥 热门话题与趋势分析

多模态模型支持与调试：多个Issue（#37855, #37928, #37934）涉及Qwen3-VL、Nemotron-colembed-vl、GLM-4.6V等多模态模型的嵌入、服务端点和量化问题，显示多模态推理正成为复杂问题高发区，对测试和文档提出更高要求。
推理效率与资源管理：
- 睡眠模式：Issue #37860 和已合并的PR #37533 分别针对CUDA和ROCm平台修复睡眠模式内存释放问题，显示该特性在动态伸缩场景下的重要性。
- 编译与缓存：Issue #37919 报告了torch.compile缓存未命中导致编译时间暴增的问题，反映了性能优化与编译系统复杂度之间的平衡挑战。
- 推测解码：多个PR（#37887, #37944）围绕推测解码进行优化和问题修复，持续追求更低延迟。
AMD平台性能深耕：如AMD生态动态所示，AMD贡献者正从基础功能支持（如睡眠模式修复）转向针对特定硬件（MI355X）和模型（DeepSeek）的深度性能优化（融合内核），表明AMD生态在vLLM中的集成进入成熟和精细化阶段。
新量化方法探索：Issue #37908 提议集成Nunchaku库的SVDQuant (W4A4) 量化方法，显示了社区对更激进、更高效量化方案的持续兴趣。

🛠️ 重点技术变更

PR #35963: [Feature] ViT Full CUDA Graph (Merged)
- 解读: 为视觉Transformer编码器引入了完整的CUDA图支持，支持基于预算的图捕获、贪婪装箱算法和数据并行。通过SupportsEncoderCudaGraph协议实现模型无感管理。
- 影响: 预计将显著减少多模态模型中ViT部分的内核启动开销，提升Qwen3-VL等模型的端到端推理性能，是多模态推理优化的重大进展。
PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm (Merged)
- 解读: 解决了ROCm平台上一个长期存在的资源管理缺陷，通过强制地址释放/保留循环，确保睡眠模式能真正释放物理显存。
- 影响: 使睡眠模式在AMD GPU上变得可靠，对于需要动态调整资源占用的云服务和多租户环境至关重要。
PR #37873 & #37884: RoBERTa position_ids 累积错误修复 (Both Merged)
- 解读: 两个PR从不同角度修复了同一核心Bug：RoBERTa类嵌入模型在CUDA图启用时，因位置ID缓冲区被就地修改并累积，最终导致索引越界。一个在模型运行器层面清零填充区，另一个在模型层面改为非就地计算。
- 影响: 修复了影响BGE-M3等多个流行嵌入模型的严重崩溃问题，提升了嵌入模型服务的稳定性。
PR #37891: [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X (Open)
- 解读: 针对AMD MI355X和DeepSeek模型家族，将AllReduce与RMSNorm融合，是硬件、模型、框架协同优化的典型案例。
- 影响: 若被合并，将直接提升DeepSeek系列模型在特定AMD硬件上的张量并行效率，为其他模型的类似优化提供参考。
PR #37948: [Perf] triton bilinear_pos_embed kernel for ViT (Open)
- 解读: 为ViT中的双线性位置嵌入插值编写了融合的Triton内核，替换了大量细碎的PyTorch操作。
- 影响: 可大幅减少ViT预处理阶段的CPU开销和内核启动次数，是对多模态推理流水线另一环节（预处理）的优化。

📈 开发活跃度观察

贡献者活跃：AMD相关贡献者（aaab8b, mgehre-amd, gronsti-amd等）表现活跃，提交了多个核心修复和优化PR，显示AMD团队对vLLM项目的持续投入。
社区协作：多个复杂Bug（如#37857, #37868）由用户或贡献者进行了深度分析，并与维护者快速互动，形成了高效的问题诊断和修复流程。
代码审查与合并：本周期合并了34个PR，其中包括ViT CUDA图、多个ROCm修复、量化重构等重量级变更，显示核心团队在积极整合社区贡献。
重点领域：开发活动明显集中于性能优化（内核融合、CUDA图、推测解码）、平台支持（AMD/ROCm深度优化）和模型支持（多模态、新架构Bug修复）三大方向。

💡 值得关注的问题

Issue #37856: Shared Expert output is incorrect under Sequence Parallel MoE: 大规模部署Qwen3.5 MoE模型时，在序列并行路径下共享专家计算错误，导致输出乱码。影响：涉及TP>1, DP>1, EP启用的大规模MoE模型训练/推理的正确性，亟待修复。
Issue #37907: “none” reasoning effort doesn’t do what it says it does (and may break output): 指出当前reasoning_effort="none"的实现方式（隐藏而非不生成推理内容）存在语义问题且可能导致输出损坏。影响：涉及推理模型API设计的合理性与向后兼容性，需要谨慎设计。
Issue #37941: Using RIXL Connector on AMD GPU: 暴露了AMD生态在高级部署特性（PD分离）上工具链和文档的不足。影响：阻碍了用户在AMD生产环境中使用vLLM的全部能力。
Issue #37854: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4: NGC容器内vLLM版本与模型量化格式（MIXED_PRECISION）不兼容，且升级路径受阻。影响：用户在使用官方NGC镜像服务最新量化模型时遇到障碍，凸显了容器版本与模型发布节奏的协调问题。
PR #37944: Revert “Zero-bubble async scheduling + spec decoding”: 由于在CPU后端引入Triton内核导致CI大规模失败，一项重要的异步调度优化被临时回退。影响：表明在支持多后端（CPU/GPU）时，性能优化代码需格外注意兼容性，该特性的重新引入值得关注。

📋 附录：详细数据列表

新增 Issue

#37857 [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True — 无标签 — by junjzhang (创建于: 2026-03-23 14:44 (UTC+8))
#37858 [Bug]: does not have the attribute ‘FakeTensorMode’ — bug — by ACCEKLL (创建于: 2026-03-23 14:58 (UTC+8))
#37951 [Bug]: KV Cache Memory Error with 262K Context on High VRAM Setup (Regression from Previous Version) — bug — by win10ogod (创建于: 2026-03-24 10:03 (UTC+8))
#37928 [Usage]: v0.18.0 nvidia/nemotron-colembed-vl-4b-v2 /embeddings 404 — usage — by brandonbiggs (创建于: 2026-03-24 04:50 (UTC+8))
#37876 [Bug]: After upgrading to v0.18.0, the logs no longer display token output speed — bug — by Formula24Code (创建于: 2026-03-23 17:43 (UTC+8))
#37946 [Bug]: Minimax-M2.5 on version 0.17.0 results in an keyerror when the pipeline parallelism (PP) is greater than or equal to 2 — bug — by Rcity (创建于: 2026-03-24 09:15 (UTC+8))
#37942 [Bug]: Editing values.labels in chart-helm breaks Service selector and leaves Endpoints empty — bug — by utsumi-fj (创建于: 2026-03-24 08:48 (UTC+8))
#37941 [Usage]: Using RIXL Connector on AMD GPU — rocm,usage — by ssyrc (创建于: 2026-03-24 08:28 (UTC+8))
#37849 [RFC]: Unify the function of getting device count — RFC — by wincent8 (创建于: 2026-03-23 12:31 (UTC+8))
#37937 [Bug]: IndexError: prev_tool_call_arr list index out of range when streaming tool call hits max_tokens (openai parser) — bug — by onurdemircan-softtech (创建于: 2026-03-24 07:10 (UTC+8))
#37934 [Bug]: Inflight bitsandbytes quanitzation error in GLM-4.6V-Flash — bug — by akowalsk (创建于: 2026-03-24 05:57 (UTC+8))
#37933 [Bug]: v0.18.0 fails to run pipeline parallel across nodes — bug — by shunzhiwen (创建于: 2026-03-24 05:51 (UTC+8))
#37931 [Bug]: flashinfer_cutedsl incompatible with all cross-node EP backends on GB200 NVL72 — bug — by qiching (创建于: 2026-03-24 05:40 (UTC+8))
#37919 [Bug]: Qwen3.5-35B-A3B compile cache miss 100% on subgraphs. — bug,torch.compile — by zhxchen17 (创建于: 2026-03-24 03:45 (UTC+8))
#37917 [Bug]: Qwen3.5 27b WorkerProc hit an exception ; CUDA error: an illegal memory access was encountered — bug — by Galigator (创建于: 2026-03-24 03:42 (UTC+8))
#37908 [RFC]: Add Nunchaku SVDQuant W4A4 quantization backend — 无标签 — by ultranationalism (创建于: 2026-03-24 01:42 (UTC+8))
#37909 [Bug]: “none” reasoning effort doesn’t do what it says it does (and may break output) — bug — by scwgoire (创建于: 2026-03-24 01:42 (UTC+8))
#37842 _update_request_as_session does not update max_tokens from StreamingUpdate — 无标签 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
#37907 [Usage]: Unable to run Qwen3-14B with vLLM (multiple issues) — usage — by swap-debug (创建于: 2026-03-24 01:28 (UTC+8))
#37850 [Performance]: Request vLLM input: FlashInfer JIT ops not registered as proper torch.ops custom ops, breaking torch.compile(fullgraph=True) — upstream fix in progress at flashinfer#2734 — performance — by Johnsonms (创建于: 2026-03-23 12:36 (UTC+8))
#37900 [Bug]: Engine V1 crash with WorkerProc leaked shared_memory on multi-GPU setup — bug — by BioAGI-Moretti (创建于: 2026-03-23 23:12 (UTC+8))
#37860 [Bug]: sleep mode not releasing GPU memory — bug — by uuddlrlrbaba-letsgo (创建于: 2026-03-23 15:15 (UTC+8))
#37897 GPT-OSS structured output + reasoning grinds to a halt at long context — 无标签 — by fergusfinn (创建于: 2026-03-23 23:00 (UTC+8))
#37890 [Bug]: NaNs in vLLM using DeepSeek-R1-0528-NVFP4-v2 and FlashInfer MLA — bug — by tlrmchlsmth (创建于: 2026-03-23 21:31 (UTC+8))
#37868 [Bug]: bge-m3 /pooling endpoint breaks in the latest main branch — bug — by yanghui1-arch (创建于: 2026-03-23 16:52 (UTC+8))
#37893 [Bug]: EBNF grammar not strictly enforced when n > 1 in parallel generation — bug — by kuailehaha (创建于: 2026-03-23 22:04 (UTC+8))
#37883 [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations — bug — by RalufRan (创建于: 2026-03-23 19:41 (UTC+8))
#37886 [Doc]: The –rope-scaling parameter has taken effect in vLLM supports YaRN — documentation — by labAxiaoming (创建于: 2026-03-23 20:25 (UTC+8))
#37855 [Bug]: Qwen3-VL-Embedding-8B Image embedding failed — bug — by nuclearwu (创建于: 2026-03-23 13:47 (UTC+8))
#37846 [Bug]: 0.17.0rc1在A2部署GLM-4.7，开启MTP后工具调用异常 — bug — by samsuzhang (创建于: 2026-03-23 11:53 (UTC+8))
#37856 [Bug]: Shared Expert output is incorrect under Sequence Parallel MoE (EP + TP > 1 + DP > 1) for Qwen3.5 MoE models — bug — by jQizhang (创建于: 2026-03-23 14:26 (UTC+8))
#37854 [Bug]: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4 — quant_algo MIXED_PRECISION not in whitelist — 无标签 — by algal (创建于: 2026-03-23 13:32 (UTC+8))
#37852 [Bug]: Phi qk_layernorm appears to be unsupported in vLLM — bug — by Qi-Zhan (创建于: 2026-03-23 13:05 (UTC+8))
#37847 [Installation]: Documented v0.18.0 cu128 release wheel URL returns 404 — installation — by ouyangziheng (创建于: 2026-03-23 12:13 (UTC+8))

已关闭 Issue

#19552 [Bug]: 使用qwen2.5-omni对音频识别，cpu会被打满。 — bug,stale — by cy565025164 (关闭于: 2026-03-24 10:14 (UTC+8))
#36921 [Bug]: V1 engine hangs then crashes with “No available shared memory broadcast block found / RPC call to sample_tokens timed out” under chat completions burst load on Qwen3.5-122B-A10B — bug — by natech-persona (关闭于: 2026-03-23 19:36 (UTC+8))
#37876 [Bug]: After upgrading to v0.18.0, the logs no longer display token output speed — bug — by Formula24Code (关闭于: 2026-03-24 09:23 (UTC+8))
#37917 [Bug]: Qwen3.5 27b WorkerProc hit an exception ; CUDA error: an illegal memory access was encountered — bug — by Galigator (关闭于: 2026-03-24 03:58 (UTC+8))
#37648 BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests — 无标签 — by hslee-lunit (关闭于: 2026-03-23 23:15 (UTC+8))
#37868 [Bug]: bge-m3 /pooling endpoint breaks in the latest main branch — bug — by yanghui1-arch (关闭于: 2026-03-23 22:59 (UTC+8))
#36372 [Bug]: LoRA on Qwen-3.5-27B fails to run — bug — by Nero10578 (关闭于: 2026-03-23 20:25 (UTC+8))
#37546 [Bug]: CPU backend crashes with TypeError: 'function' object is not subscriptable on first inference request — bug,cpu — by fyuan1316 (关闭于: 2026-03-23 19:43 (UTC+8))
#37804 [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell — bug — by vadiklyutiy (关闭于: 2026-03-23 16:43 (UTC+8))
#37400 [Bug]: JAIS: ALiBi is applied even when position_embedding_type=”learned” — bug,good first issue — by Qi-Zhan (关闭于: 2026-03-23 15:36 (UTC+8))
#36481 [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes — performance — by shenyt-sanshui (关闭于: 2026-03-23 14:33 (UTC+8))
#37404 [Bug]: AssertionError: assert num_kv_heads == 1 with CPU KV Offloading + GLM-5-FP8 — bug — by yz342 (关闭于: 2026-03-23 12:13 (UTC+8))

新增 PR

#37956 [Deprecate] Deprecate pooling multi task support. — documentation,frontend — by noooop (创建于: 2026-03-24 11:28 (UTC+8))
#37899 [Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages — bug,frontend,ready — by jetxa (创建于: 2026-03-23 23:08 (UTC+8))
#37955 [OpenAI] Fix MiniMax M2 reasoning token usage accounting — frontend — by QwertyJack (创建于: 2026-03-24 11:25 (UTC+8))
#37915 [Bugfix][Frontend] Pass default_chat_template_kwargs to Anthropic endpoint — bug,frontend — by vinnybad (创建于: 2026-03-24 02:56 (UTC+8))
#37954 [Parser] Pass tools via ToolParser.init instead of reading from request — frontend,tool-calling,qwen,deepseek — by sfeng33 (创建于: 2026-03-24 11:20 (UTC+8))
#37861 [Frontend] Remove pooling multi task support. (Hold off until v0.20.0) — frontend,ready — by noooop (创建于: 2026-03-23 15:19 (UTC+8))
#37953 replace torch.cuda.set_stream/current_stream with accelerator api — documentation,speculative-decoding,v1,kv-connector,nvidia — by xinyu-intel (创建于: 2026-03-24 11:19 (UTC+8))
#37888 [XPU] Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (创建于: 2026-03-23 20:54 (UTC+8))
#37952 fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit — documentation — by jperezdealgaba (创建于: 2026-03-24 11:06 (UTC+8))
#37879 fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path — 无标签 — by Young-Leo (创建于: 2026-03-23 18:25 (UTC+8))
#37914 [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc — documentation,nvidia — by b-mu (创建于: 2026-03-24 02:25 (UTC+8))
#37935 [Parser] Remove unused request parameter from extract_reasoning() — documentation,frontend,qwen,deepseek — by sfeng33 (创建于: 2026-03-24 05:58 (UTC+8))
#37911 [Bugfix] Suppress spurious CPU KV cache warning in launch render — bug,frontend,ready — by sagearc (创建于: 2026-03-24 02:05 (UTC+8))
#37892 [Bugfix] Fix concat_mla_q kernel for float32 dtype — bug,ready — by xyang16 (创建于: 2026-03-23 21:41 (UTC+8))
#37936 [BugFix][kv_offload] Reduce memory blocks allocated for CPU offload — bug,v1 — by jonathanc-n (创建于: 2026-03-24 06:37 (UTC+8))
#37938 [Frontend] Add pluggable store backends for Responses API persistence — documentation,frontend — by nicko170 (创建于: 2026-03-24 07:28 (UTC+8))
#37943 [NIXL] Strengthen TpKVTopology validation — v1,kv-connector — by yzong-rh (创建于: 2026-03-24 08:51 (UTC+8))
#37913 Downsize CPU jobs to use small queue — ci/build — by khluu (创建于: 2026-03-24 02:25 (UTC+8))
#37949 [Attention][GPT-OSS] Add L40S/SM89 CUDA policy and sink-aware Triton tuning — v1,gpt-oss,nvidia — by will-deines (创建于: 2026-03-24 10:00 (UTC+8))
#37950 [Bugfix] Restore stats logging for multi-server mode — bug,frontend,v1 — by khairulkabir1661 (创建于: 2026-03-24 10:01 (UTC+8))
#37906 [ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 01:28 (UTC+8))
#37948 [Perf] triton bilinear_pos_embed kernel for ViT — performance,qwen — by zhandaz (创建于: 2026-03-24 09:48 (UTC+8))
#37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu — ci/build — by jikunshang (创建于: 2026-03-24 09:35 (UTC+8))
#37895 [CI] Add batch invariant test: Block FP8 + small MOE — ready,ci/build — by yewentao256 (创建于: 2026-03-23 22:43 (UTC+8))
#37932 [Model Runner V2] Gather multimodal embeddings before draft model postprocess — ready,v1 — by TheEpicDolphin (创建于: 2026-03-24 05:46 (UTC+8))
#37944 Revert “[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding” (#32951) — speculative-decoding,v1 — by zhewenl (创建于: 2026-03-24 09:08 (UTC+8))
#37945 Revert “[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision” (#36725) — bug,nvidia — by zhewenl (创建于: 2026-03-24 09:09 (UTC+8))
#37940 [NIXL][BUG] Fix Triton and Gemma heterogeneous TP — bug,v1,kv-connector — by yzong-rh (创建于: 2026-03-24 08:01 (UTC+8))
#37918 [LinearAttention] Introduce non_spec_query_start_loc_cpu in GDN metadata — v1 — by Jialin (创建于: 2026-03-24 03:45 (UTC+8))
#37939 fix(security): replace unsafe eval() in tool-calling examples — documentation — by xr843 (创建于: 2026-03-24 07:30 (UTC+8))
#37923 [Bugfix] Force continuous usage stats when CLI override is enabled — bug,frontend,ready — by dsingal0 (创建于: 2026-03-24 04:06 (UTC+8))
#37926 Make microbatch optimization (DBO) work with general models — ready,v1 — by 0xjunhao (创建于: 2026-03-24 04:45 (UTC+8))
#37929 [Core] Use standalone autograd_cache_key for compilation dedup optimization — 无标签 — by frgossen (创建于: 2026-03-24 05:02 (UTC+8))
#37930 [ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 05:22 (UTC+8))
#37925 [Core] CUDA Checkpoint/Restore — Phase 2: Engine/Executor/API Integration — frontend,v1,nvidia — by elizabetht (创建于: 2026-03-24 04:24 (UTC+8))
#37927 [ROCm][CI] Remove redundant common.txt from rocm-test.txt — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 04:46 (UTC+8))
#37920 [Bugfix] Pass hf_token through config loading paths for gated model support — bug — by javierdejesusda (创建于: 2026-03-24 03:46 (UTC+8))
#37924 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI — rocm,ready,ci/build,kv-connector — by AndreasKaratzas (创建于: 2026-03-24 04:16 (UTC+8))
#37922 [Draft] Fix transitional step bug — bug,structured-output,v1,deepseek,gpt-oss — by southfreebird (创建于: 2026-03-24 04:04 (UTC+8))
#37921 [Core] CUDA Checkpoint/Restore — Phase 1: C Extension, Python Wrapper, Worker Methods — ci/build,v1,nvidia — by elizabetht (创建于: 2026-03-24 03:51 (UTC+8))
#37916 tests/v1/e2e/spec_decode: assert async scheduling is used — v1 — by puririshi98 (创建于: 2026-03-24 03:23 (UTC+8))
#37910 Making spec decode testing nightly — ci/build — by puririshi98 (创建于: 2026-03-24 02:00 (UTC+8))
#37912 [Bugfix] Fuse Qwen3.5 in_qkvz_proj forwarding with LoRA enabled — bug,qwen — by Isotr0py (创建于: 2026-03-24 02:12 (UTC+8))
#37902 [Mypy] Better fixes for the mypy issues in vllm/config — documentation,performance,ready,multi-modality — by hmellor (创建于: 2026-03-23 23:48 (UTC+8))
#37889 [CI] Add Qwen3.5 quantized model GSM8K CI evals for Blackwell — ready,qwen — by vadiklyutiy (创建于: 2026-03-23 21:09 (UTC+8))
#37843 fix(scheduler): update max_tokens from StreamingUpdate in session — v1 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
#37882 [CI] split Entrypoints Integration (API Server 1) into 3 jobs — ready,ci/build — by jikunshang (创建于: 2026-03-23 19:13 (UTC+8))
#37885 Canonical KV Cache Allocation for HMA Models — v1,kv-connector — by Etelis (创建于: 2026-03-23 19:56 (UTC+8))
#37905 fusing dynamic quantization in fused_moe_lora_fp8 to keep PDL on — 无标签 — by yugong333 (创建于: 2026-03-24 00:49 (UTC+8))
#37874 [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package — ready,v1 — by ronensc (创建于: 2026-03-23 17:28 (UTC+8))
#37904 [Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) — ready — by hmellor (创建于: 2026-03-24 00:40 (UTC+8))
#37903 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths — 无标签 — by netanel-haber (创建于: 2026-03-24 00:13 (UTC+8))
#37880 [Feature] Support per-draft-model MoE backend via --speculative-config — speculative-decoding,v1 — by askliar (创建于: 2026-03-23 18:37 (UTC+8))
#37901 [compile] Add a fast serializer for aot save/load. — 无标签 — by zhxchen17 (创建于: 2026-03-23 23:16 (UTC+8))
#37887 [ROCm][perf] fix Aiter sparse MLA with MTP>1 — rocm,speculative-decoding,v1 — by gronsti-amd (创建于: 2026-03-23 20:29 (UTC+8))
#37896 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,needs-rebase,nvidia — by xueliangyang-oeuler (创建于: 2026-03-23 22:52 (UTC+8))
#37866 fix: handle FakeTensorMode patching for PyTorch compatibility — 无标签 — by CMLKevin (创建于: 2026-03-23 16:15 (UTC+8))
#37884 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,ready,nvidia — by he-yufeng (创建于: 2026-03-23 19:44 (UTC+8))
#37898 [Hybrid] Marconi-style admission policy for hybrid cache — v1 — by s3woz (创建于: 2026-03-23 23:05 (UTC+8))
#37873 [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region — bug,ready,v1,nvidia — by yanghui1-arch (创建于: 2026-03-23 17:25 (UTC+8))
#37891 [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X — rocm,deepseek,nvidia — by attila-dusnoki-htec (创建于: 2026-03-23 21:41 (UTC+8))
#37894 [Feature] Add prefill node failure detection and health query endpoint for MooncakeConnector Proxy — documentation,kv-connector — by SSmallMonster (创建于: 2026-03-23 22:04 (UTC+8))
#37869 Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (创建于: 2026-03-23 16:55 (UTC+8))
#37872 Add sanitized MiniMax M2 parser variant for path-like output — tool-calling — by SamerAlshaer1991 (创建于: 2026-03-23 17:08 (UTC+8))
#37877 [Bugfix][LoRA] Fix incorrect LoRA Log — bug,ready — by jeejeelee (创建于: 2026-03-23 17:47 (UTC+8))
#37881 Auto mtp — documentation,frontend,speculative-decoding,needs-rebase,ci/build,v1,llama — by Zacks917 (创建于: 2026-03-23 18:54 (UTC+8))
#37859 [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837) — bug,v1 — by rohithj7 (创建于: 2026-03-23 15:11 (UTC+8))
#37878 docs: add WSL troubleshooting guide — documentation — by CChen19 (创建于: 2026-03-23 18:20 (UTC+8))
#37844 [Draft][XPU] add gptq marlin support — 无标签 — by jikunshang (创建于: 2026-03-23 11:32 (UTC+8))
#37871 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (创建于: 2026-03-23 16:57 (UTC+8))
#37864 feat: Add OpenAI Conversations API support — frontend — by bongwoobak (创建于: 2026-03-23 15:55 (UTC+8))
#37875 [Frontend] Use unified Parser for chat completions non-streaming — frontend — by MatteoFari (创建于: 2026-03-23 17:29 (UTC+8))
#37845 [Bugfix] Fix GLM tool-call finish chunk suffix alignment in streaming — bug,frontend — by QwertyJack (创建于: 2026-03-23 11:33 (UTC+8))
#37848 [Reasoning] Add structural tag support for reasoning parser via sampling params — frontend,ready — by rishitdholakia13 (创建于: 2026-03-23 12:14 (UTC+8))
#37870 fix: add qk_layernorm support for Phi models — 无标签 — by gambletan (创建于: 2026-03-23 16:56 (UTC+8))
#37867 Fix incorrect tokenizer source path in RunAI object storage pull (#37836) — 无标签 — by alvinttang (创建于: 2026-03-23 16:51 (UTC+8))
#37863 [Misc]Update gitignore — ready — by wangxiyuan (创建于: 2026-03-23 15:42 (UTC+8))
#37865 Fix cudagraph max capture size upper bound — nvidia — by weireweire (创建于: 2026-03-23 15:58 (UTC+8))
#37862 docs: EU AI Act compliance guide for vLLM deployers — documentation — by BipinRimal314 (创建于: 2026-03-23 15:37 (UTC+8))
#37853 [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models — v1,kv-connector — by orozery (创建于: 2026-03-23 13:27 (UTC+8))
#37851 update doc for online fp8 quantization — documentation,ready — by yma11 (创建于: 2026-03-23 12:47 (UTC+8))

已合并 PR

#37487 [V0 Deprecation] Refactor kv cache from list to element — rocm,ready,v1,qwen,deepseek,kv-connector — by yewentao256 (合并于: 2026-03-24 11:10 (UTC+8))
#37906 [ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-24 09:48 (UTC+8))
#37895 [CI] Add batch invariant test: Block FP8 + small MOE — ready,ci/build — by yewentao256 (合并于: 2026-03-24 09:16 (UTC+8))
#37932 [Model Runner V2] Gather multimodal embeddings before draft model postprocess — ready,v1 — by TheEpicDolphin (合并于: 2026-03-24 09:14 (UTC+8))
#36803 [Test] E2E Nemotron-3-Super tests — ready,ci/build,nvidia — by roikoren755 (合并于: 2026-03-24 08:49 (UTC+8))
#37016 [CI] Split V1 Others into 3 separate jobs — ready,ci/build — by khluu (合并于: 2026-03-24 06:44 (UTC+8))
#35007 [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning — bug,ready,v1,nvidia — by WindChimeRan (合并于: 2026-03-24 06:31 (UTC+8))
#32951 [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding — speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-24 03:37 (UTC+8))
#36728 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts — bug,documentation,rocm,frontend,ready,ci/build,v1,multi-modality,tool-calling,llama — by yzong-rh (合并于: 2026-03-24 05:02 (UTC+8))
#36725 [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision — bug,ready,nvidia — by robertgshaw2-redhat (合并于: 2026-03-24 04:19 (UTC+8))
#36799 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels — performance,ready,ci/build,nvidia — by kylesayrs (合并于: 2026-03-24 04:03 (UTC+8))
#37812 [MRV2] Consider spec decoding in warmup — ready,v1 — by WoosukKwon (合并于: 2026-03-24 01:45 (UTC+8))
#37882 [CI] split Entrypoints Integration (API Server 1) into 3 jobs — ready,ci/build — by jikunshang (合并于: 2026-03-24 01:37 (UTC+8))
#37657 [CI][PD] Add Hybrid SSM integration tests to CI — ready,ci/build,v1,kv-connector — by NickLucche (合并于: 2026-03-23 23:58 (UTC+8))
#37609 Use lazy graph module during split_module to defer recompile() — ready,torch.compile — by angelayi (合并于: 2026-03-23 23:21 (UTC+8))
#37884 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,ready,nvidia — by he-yufeng (合并于: 2026-03-23 23:15 (UTC+8))
#37873 [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region — bug,ready,v1,nvidia — by yanghui1-arch (合并于: 2026-03-23 22:59 (UTC+8))
#37808 [Mypy] Fix mypy for vllm/config — performance,ready,v1,nvidia — by yewentao256 (合并于: 2026-03-23 22:34 (UTC+8))
#37533 [ROCm] fix sleep mode not releasing GPU memory problem on ROCm — rocm,ready — by aaab8b (合并于: 2026-03-23 21:07 (UTC+8))
#37834 [Test] Consolidate tool parser unit tests to tests/tool_parsers — ready,tool-calling,llama — by bbrowning (合并于: 2026-03-23 12:24 (UTC+8))
#37877 [Bugfix][LoRA] Fix incorrect LoRA Log — bug,ready — by jeejeelee (合并于: 2026-03-23 19:42 (UTC+8))
#37550 [Bugfix] Fix CPU backend crash in KV cache block zeroing — bug,ready,v1,cpu — by DorBernsohn (合并于: 2026-03-23 19:35 (UTC+8))
#37784 [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle — ready — by jikunshang (合并于: 2026-03-23 19:10 (UTC+8))
#37498 [Frontend][Responses API] Fix arrival_time recording for TTFT on initial request — documentation,frontend,ready,gpt-oss,meta-exported,fb-exported — by qandrew (合并于: 2026-03-23 17:58 (UTC+8))
#32929 [FP8]add FP8 WoQ kernel abstraction. — performance,ready,ci/build,v1,nvidia — by jikunshang (合并于: 2026-03-23 17:47 (UTC+8))
#37863 [Misc]Update gitignore — ready — by wangxiyuan (合并于: 2026-03-23 16:14 (UTC+8))
#36100 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs — rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-23 15:48 (UTC+8))
#37338 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 — bug,ready,qwen — by arpera (合并于: 2026-03-23 15:37 (UTC+8))
#37810 [Bugfix] Store Qwen3Next A_log in fp32 — bug,ready,qwen — by effortprogrammer (合并于: 2026-03-23 15:36 (UTC+8))
#37820 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type=’alibi’ — bug,ready — by r266-tech (合并于: 2026-03-23 15:36 (UTC+8))
#36505 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm,ready — by mgehre-amd (合并于: 2026-03-23 15:36 (UTC+8))
#37851 update doc for online fp8 quantization — documentation,ready — by yma11 (合并于: 2026-03-23 13:19 (UTC+8))
#35963 [Feature] ViT Full CUDA Graph — ready,v1,multi-modality,qwen,nvidia — by b-mu (合并于: 2026-03-23 13:01 (UTC+8))
#37816 [CI/Build][LoRA] Update Qwen35 LoRA testing — ready,ci/build,qwen — by jeejeelee (合并于: 2026-03-23 12:55 (UTC+8))

关闭但未合并的 PR

#37357 [Bugfix] Fix elastic EP scale-up after scale-down — bug,v1 — by tzulingk (关闭于: 2026-03-24 11:01 (UTC+8))
#37069 [Test][Nixl] Add YAML-driven test runner for PD accuracy configs — needs-rebase,ci/build,v1,kv-connector — by yzong-rh (关闭于: 2026-03-24 10:09 (UTC+8))
#35888 [CI] Fix mypy for vllm/config — v1,nvidia — by hickeyma (关闭于: 2026-03-23 23:53 (UTC+8))
#37229 Fix Qwen3.5-Next RMSNormGated Initialization Error on TPU — qwen — by jrplatin (关闭于: 2026-03-24 06:08 (UTC+8))
#36309 [BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections — bug,ready,needs-rebase,qwen — by JWriter20 (关闭于: 2026-03-24 04:57 (UTC+8))
#35251 [compile][graph_partition] Remove unused subgraph inputs after split_module — needs-rebase — by fxdawnn (关闭于: 2026-03-24 04:54 (UTC+8))
#37135 [Bugfix] Fix FP16 overflow in NVFP4 Marlin kernel epilogue and forward input_global_scale on SM75 — bug,performance — by saifmb0 (关闭于: 2026-03-24 02:47 (UTC+8))
#37101 [Test] — frontend — by liuchenbing2026 (关闭于: 2026-03-24 00:55 (UTC+8))
#37869 Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (关闭于: 2026-03-23 21:24 (UTC+8))
#37652 Fix: Handle $ref in json-schema in qwen3_coder tool parser — tool-calling,qwen — by schoennenbeck (关闭于: 2026-03-23 20:28 (UTC+8))
#35716 [Bugfix] Qwen3Coder streaming tool call JSON missing opening brace in arguments — bug,needs-rebase,tool-calling,qwen — by KrxGu (关闭于: 2026-03-23 19:33 (UTC+8))
#35144 [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm,needs-rebase — by mgehre-amd (关闭于: 2026-03-23 19:08 (UTC+8))
#37881 Auto mtp — documentation,frontend,speculative-decoding,needs-rebase,ci/build,v1,llama — by Zacks917 (关闭于: 2026-03-23 18:58 (UTC+8))
#37411 fix(jais): only apply ALiBi when position_embedding_type is ‘alibi’ — needs-rebase — by jigangz (关闭于: 2026-03-23 18:35 (UTC+8))
#37864 feat: Add OpenAI Conversations API support — frontend — by bongwoobak (关闭于: 2026-03-23 18:12 (UTC+8))
#37281 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (关闭于: 2026-03-23 16:43 (UTC+8))
#37806 [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation — bug,qwen — by vadiklyutiy (关闭于: 2026-03-23 16:43 (UTC+8))
#37621 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — bug — by simpx (关闭于: 2026-03-23 16:18 (UTC+8))
#37788 [Refactor] converge xxx_config to vllm_config in async_llm — frontend,v1 — by andyxning (关闭于: 2026-03-23 13:10 (UTC+8))
#37578 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug — by simpx (关闭于: 2026-03-23 12:43 (UTC+8))