vLLM 开发动态报告 - 2026-03-23
时间窗口: 2026-03-23 11:29 (UTC+8) ~ 2026-03-24 11:29 (UTC+8) 数据统计: 新 Issue 34 | 关闭 Issue 12 | 新 PR 81 | 合并 PR 34 | 关闭未合并 PR 20
📊 每日开发状态摘要
本周期(2026-03-23至2026-03-24)vLLM项目保持高度活跃,新增Issue 34个,合并PR 34个。开发焦点集中在多模态模型支持(如Qwen3-VL、GLM-4.6V的嵌入和推理问题)、推理效率优化(睡眠模式内存释放、CUDA图性能)以及跨平台兼容性(特别是AMD/ROCm生态的性能优化与问题修复)上。多个关键Bug被快速定位和修复,社区协作紧密。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动活跃,主要集中在性能优化、Bug修复和工具链支持上。
- Issue #37941: [Usage]: Using RIXL Connector on AMD GPU
- 用户:
ssyrc - 内容: 用户在隔离环境中尝试在AMD GPU上使用Nixl Connector进行PD(预填充/解码)分离,询问是否有基于
Dockerfile.rocm的预构建镜像,并澄清官方ROCm镜像是否包含RIXL库,以及Nixl是否支持AMD-NVIDIA跨厂商通信。 - 分析: 此Issue反映了用户在AMD生产环境中部署vLLM高级特性(如PD分离)时遇到的实际障碍,突显了AMD生态在容器化部署和高级通信库支持方面的文档和资源缺口。
- 用户:
- PR #37891: [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X
- 提交者:
attila-dusnoki-htec - 状态: Open
- 内容: 针对AMD MI355X (gfx950) 平台,为DeepSeek V2/V3/R1模型引入了融合的AllReduce+RMSNorm内核(通过AITER实现)。通过将张量并行AllReduce从
o_proj和MoE投影层移至后续的RMSNorm层,并与残差加法融合,减少了内核启动开销和内存流量,以提升FP4/FP8量化模型在TP>1时的性能。 - 分析: 这是针对特定AMD硬件(MI355X)和模型家族(DeepSeek)的深度性能优化,展示了AMD生态向更细粒度、模型感知的优化方向发展。
- 提交者:
- PR #37887: [ROCm][perf] fix Aiter sparse MLA with MTP>1
- 提交者:
gronsti-amd(AMD员工) - 状态: Open
- 内容: 修复了在DeepSeek v3.2模型上使用ROCM_AITER_MLA_SPARSE注意力后端,并启用MTP(多令牌预测)推测解码且
num_speculative_tokens > 1时的问题。问题根因是检查逻辑依赖于attn_metadata列表的顺序,修复后检查所有注意力类型。 - 分析: 该修复确保AMD AITER稀疏MLA后端能够与更先进的推测解码方法(MTP)完全兼容,提升了AMD平台上大模型推理的效率和能力。
- 提交者:
- PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm
- 提交者:
aaab8b - 状态: Merged
- 内容: 修复了ROCm平台上睡眠模式无法真正释放GPU显存的问题。根本原因是HIP的
hipMemRelease在虚拟地址保留期间不会释放物理显存。修复方案在unmap_and_release中强制进行地址释放与重新保留的循环,确保物理页被释放。 - 分析: 这是一个关键的平台特定Bug修复,解决了ROCm上资源管理的重要缺陷,使得睡眠模式在AMD GPU上真正可用,对动态工作负载管理至关重要。
- 提交者:
- PR #36505: [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel
- 提交者:
mgehre-amd(AMD员工) - 状态: Merged
- 内容: 重构了ROCm上的AWQ(激活感知权重量化)支持,使
AWQMarlinConfig能通过统一的choose_mp_linear_kernel框架选择内核(如ConchLinearKernel),而非直接调用NVIDIA专用的ops.marlin_gemm。此举显著提升了AWQ模型在ROCm上的性能(基准测试显示预填+57%,解码+73%)。 - 分析: 这是将AMD量化支持整合进vLLM统一内核调度框架的重要一步,提升了代码一致性和性能。
- 提交者:
- PR #36100: [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs
- 提交者:
ChuanLi1101 - 状态: Merged
- 内容: 修复了ROCm AITER自定义op中
fused_moe_fake元函数签名不匹配的严重Bug(缺少hidden_pad等4个参数),以及其他一些注释和日志中的小错误。 - 分析: 修复了torch.compile/FakeTensor模式下因签名不匹配导致的崩溃,确保了AMD平台编译路径的稳定性。
- 提交者:
- PR #37906 & #37930: [ROCm][CI] 相关改进
- 提交者:
AndreasKaratzas - 状态: Merged / Open
- 内容: 同步了AMD CI的测试任务拆分(#37906),并添加了
uv pip compile工作流用于生成ROCm测试依赖的锁文件(#37930)。 - 分析: 持续改进AMD平台的CI/CD流程,保障代码质量和测试效率。
- 提交者:
💬 高热度讨论分析
- Issue #37855: [Bug]: Qwen3-VL-Embedding-8B Image embedding failed
- 评论数: 7
- 核心议题: 用户报告Qwen3-VL-8B模型在图像嵌入时出现500内部服务器错误。
- 各方观点:
- 报告者 (
nuclearwu): 提供了详细的错误堆栈和环境信息(910B2)。 - 回应者 (
JINO-ROHIT,DarkLight1337): 尝试复现并索要更详细的环境信息(collect_env.py输出)以协助诊断。 - 贡献者 (
CMLKevin): 多次表示正在调查并准备提交修复,认为问题可能与VLM嵌入张量准备有关。
- 报告者 (
- 当前状态: Open。问题尚未解决,处于信息收集和初步分析阶段。这反映了多模态模型支持的复杂性。
- Issue #37858: [Bug]: does not have the attribute ‘FakeTensorMode’
- 评论数: 5
- 核心议题: 用户遇到
FakeTensorMode属性错误,疑似PyTorch兼容性问题。 - 各方观点:
- 报告者 (
ACCEKLL): 提供完整环境信息。 - 贡献者 (
CMLKevin): 初步分析为PyTorch版本不匹配或API变更。 - 核心开发者 (
zou3519): 指出根本原因是PR #37158(修复PyTorch 2.10支持)未包含在发布版本中,导致此问题在v0.18.0出现。他解释了该PR的作用。 - 报告者追问修复版本。
- 报告者 (
- 争议焦点: 无争议,核心开发者快速定位了根因。
- 当前状态: Open。问题根源明确,待相关PR合入或版本更新。
- Issue #37849: [RFC]: Unify the function of getting device count
- 评论数: 5
- 核心议题: 提议统一项目中三种获取设备数量的函数 (
cuda_device_count_stateless,current_platform.device_count,torch.accelerator.device_count())。 - 各方观点:
- 提议者 (
wincent8): 认为当前方式令用户困惑,且cuda_device_count_stateless阻碍了XPU等多加速器支持,建议分两步走进行统一。 - 核心开发者 (
simon-mo): 指出cuda_device_count_stateless存在的意义是避免初始化CUDA上下文,询问新方案是否能保持此特性。 - 核心开发者 (
jikunshang): 建议更稳妥的两阶段方案:先用current_platform.device_count()包装,再尝试替换为torch官方API。 simon-mo最终表示支持清理,但要求验证原有Bug不会复现。
- 提议者 (
- 争议焦点: 如何在保证向后兼容性和避免隐式上下文初始化的前提下,优雅地统一API。讨论倾向于谨慎的渐进式重构。
- 当前状态: Open。达成重构共识,进入具体方案验证阶段。
- Issue #37857: [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True
- 评论数: 3
- 核心议题: 在数据并行(DP>1)且量化方法使用模块化内核路径时,MoE专家捕获器因断言失败而崩溃。
- 各方观点:
- 报告者 (
junjzhang): 提供了极其详细的分析,精准定位了问题根因——两种DP分发路径下topk_ids形状的差异,并给出了具体的修复建议代码。 - 贡献者 (
Young-Leo): 积极响应,表示将着手修复。 - 报告者 (
junjzhang): 对修复方案表示认可,并建议添加测试覆盖。
- 报告者 (
- 争议焦点: 无争议。这是一个高质量的技术Bug报告,直接促成了修复PR (#37879) 的产生。
- 当前状态: Open。关联PR #37879已提交,旨在修复此问题。
🔥 热门话题与趋势分析
- 多模态模型支持与调试:多个Issue(#37855, #37928, #37934)涉及Qwen3-VL、Nemotron-colembed-vl、GLM-4.6V等多模态模型的嵌入、服务端点和量化问题,显示多模态推理正成为复杂问题高发区,对测试和文档提出更高要求。
- 推理效率与资源管理:
- 睡眠模式:Issue #37860 和已合并的PR #37533 分别针对CUDA和ROCm平台修复睡眠模式内存释放问题,显示该特性在动态伸缩场景下的重要性。
- 编译与缓存:Issue #37919 报告了torch.compile缓存未命中导致编译时间暴增的问题,反映了性能优化与编译系统复杂度之间的平衡挑战。
- 推测解码:多个PR(#37887, #37944)围绕推测解码进行优化和问题修复,持续追求更低延迟。
- AMD平台性能深耕:如AMD生态动态所示,AMD贡献者正从基础功能支持(如睡眠模式修复)转向针对特定硬件(MI355X)和模型(DeepSeek)的深度性能优化(融合内核),表明AMD生态在vLLM中的集成进入成熟和精细化阶段。
- 新量化方法探索:Issue #37908 提议集成Nunchaku库的SVDQuant (W4A4) 量化方法,显示了社区对更激进、更高效量化方案的持续兴趣。
🛠️ 重点技术变更
- PR #35963: [Feature] ViT Full CUDA Graph (Merged)
- 解读: 为视觉Transformer编码器引入了完整的CUDA图支持,支持基于预算的图捕获、贪婪装箱算法和数据并行。通过
SupportsEncoderCudaGraph协议实现模型无感管理。 - 影响: 预计将显著减少多模态模型中ViT部分的内核启动开销,提升Qwen3-VL等模型的端到端推理性能,是多模态推理优化的重大进展。
- 解读: 为视觉Transformer编码器引入了完整的CUDA图支持,支持基于预算的图捕获、贪婪装箱算法和数据并行。通过
- PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm (Merged)
- 解读: 解决了ROCm平台上一个长期存在的资源管理缺陷,通过强制地址释放/保留循环,确保睡眠模式能真正释放物理显存。
- 影响: 使睡眠模式在AMD GPU上变得可靠,对于需要动态调整资源占用的云服务和多租户环境至关重要。
- PR #37873 & #37884: RoBERTa position_ids 累积错误修复 (Both Merged)
- 解读: 两个PR从不同角度修复了同一核心Bug:RoBERTa类嵌入模型在CUDA图启用时,因位置ID缓冲区被就地修改并累积,最终导致索引越界。一个在模型运行器层面清零填充区,另一个在模型层面改为非就地计算。
- 影响: 修复了影响BGE-M3等多个流行嵌入模型的严重崩溃问题,提升了嵌入模型服务的稳定性。
- PR #37891: [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X (Open)
- 解读: 针对AMD MI355X和DeepSeek模型家族,将AllReduce与RMSNorm融合,是硬件、模型、框架协同优化的典型案例。
- 影响: 若被合并,将直接提升DeepSeek系列模型在特定AMD硬件上的张量并行效率,为其他模型的类似优化提供参考。
- PR #37948: [Perf] triton bilinear_pos_embed kernel for ViT (Open)
- 解读: 为ViT中的双线性位置嵌入插值编写了融合的Triton内核,替换了大量细碎的PyTorch操作。
- 影响: 可大幅减少ViT预处理阶段的CPU开销和内核启动次数,是对多模态推理流水线另一环节(预处理)的优化。
📈 开发活跃度观察
- 贡献者活跃:AMD相关贡献者(
aaab8b,mgehre-amd,gronsti-amd等)表现活跃,提交了多个核心修复和优化PR,显示AMD团队对vLLM项目的持续投入。 - 社区协作:多个复杂Bug(如#37857, #37868)由用户或贡献者进行了深度分析,并与维护者快速互动,形成了高效的问题诊断和修复流程。
- 代码审查与合并:本周期合并了34个PR,其中包括ViT CUDA图、多个ROCm修复、量化重构等重量级变更,显示核心团队在积极整合社区贡献。
- 重点领域:开发活动明显集中于性能优化(内核融合、CUDA图、推测解码)、平台支持(AMD/ROCm深度优化)和模型支持(多模态、新架构Bug修复)三大方向。
💡 值得关注的问题
- Issue #37856: Shared Expert output is incorrect under Sequence Parallel MoE: 大规模部署Qwen3.5 MoE模型时,在序列并行路径下共享专家计算错误,导致输出乱码。影响:涉及TP>1, DP>1, EP启用的大规模MoE模型训练/推理的正确性,亟待修复。
- Issue #37907: “none” reasoning effort doesn’t do what it says it does (and may break output): 指出当前
reasoning_effort="none"的实现方式(隐藏而非不生成推理内容)存在语义问题且可能导致输出损坏。影响:涉及推理模型API设计的合理性与向后兼容性,需要谨慎设计。 - Issue #37941: Using RIXL Connector on AMD GPU: 暴露了AMD生态在高级部署特性(PD分离)上工具链和文档的不足。影响:阻碍了用户在AMD生产环境中使用vLLM的全部能力。
- Issue #37854: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4: NGC容器内vLLM版本与模型量化格式(MIXED_PRECISION)不兼容,且升级路径受阻。影响:用户在使用官方NGC镜像服务最新量化模型时遇到障碍,凸显了容器版本与模型发布节奏的协调问题。
- PR #37944: Revert “Zero-bubble async scheduling + spec decoding”: 由于在CPU后端引入Triton内核导致CI大规模失败,一项重要的异步调度优化被临时回退。影响:表明在支持多后端(CPU/GPU)时,性能优化代码需格外注意兼容性,该特性的重新引入值得关注。
📋 附录:详细数据列表
新增 Issue
- #37857 [Bug]: RoutedExpertsCapturer.capture() assertion failure with DP>1 when supports_internal_mk=True — 无标签 — by junjzhang (创建于: 2026-03-23 14:44 (UTC+8))
- #37858 [Bug]: does not have the attribute ‘FakeTensorMode’ — bug — by ACCEKLL (创建于: 2026-03-23 14:58 (UTC+8))
- #37951 [Bug]: KV Cache Memory Error with 262K Context on High VRAM Setup (Regression from Previous Version) — bug — by win10ogod (创建于: 2026-03-24 10:03 (UTC+8))
- #37928 [Usage]: v0.18.0 nvidia/nemotron-colembed-vl-4b-v2 /embeddings 404 — usage — by brandonbiggs (创建于: 2026-03-24 04:50 (UTC+8))
- #37876 [Bug]: After upgrading to v0.18.0, the logs no longer display token output speed — bug — by Formula24Code (创建于: 2026-03-23 17:43 (UTC+8))
- #37946 [Bug]: Minimax-M2.5 on version 0.17.0 results in an keyerror when the pipeline parallelism (PP) is greater than or equal to 2 — bug — by Rcity (创建于: 2026-03-24 09:15 (UTC+8))
- #37942 [Bug]: Editing
values.labelsinchart-helmbreaks Service selector and leaves Endpoints empty — bug — by utsumi-fj (创建于: 2026-03-24 08:48 (UTC+8)) - #37941 [Usage]: Using RIXL Connector on AMD GPU — rocm,usage — by ssyrc (创建于: 2026-03-24 08:28 (UTC+8))
- #37849 [RFC]: Unify the function of getting device count — RFC — by wincent8 (创建于: 2026-03-23 12:31 (UTC+8))
- #37937 [Bug]: IndexError: prev_tool_call_arr list index out of range when streaming tool call hits max_tokens (openai parser) — bug — by onurdemircan-softtech (创建于: 2026-03-24 07:10 (UTC+8))
- #37934 [Bug]: Inflight bitsandbytes quanitzation error in GLM-4.6V-Flash — bug — by akowalsk (创建于: 2026-03-24 05:57 (UTC+8))
- #37933 [Bug]: v0.18.0 fails to run pipeline parallel across nodes — bug — by shunzhiwen (创建于: 2026-03-24 05:51 (UTC+8))
- #37931 [Bug]:
flashinfer_cutedslincompatible with all cross-node EP backends on GB200 NVL72 — bug — by qiching (创建于: 2026-03-24 05:40 (UTC+8)) - #37919 [Bug]: Qwen3.5-35B-A3B compile cache miss 100% on subgraphs. — bug,torch.compile — by zhxchen17 (创建于: 2026-03-24 03:45 (UTC+8))
- #37917 [Bug]: Qwen3.5 27b WorkerProc hit an exception ; CUDA error: an illegal memory access was encountered — bug — by Galigator (创建于: 2026-03-24 03:42 (UTC+8))
- #37908 [RFC]: Add Nunchaku SVDQuant W4A4 quantization backend — 无标签 — by ultranationalism (创建于: 2026-03-24 01:42 (UTC+8))
- #37909 [Bug]: “none” reasoning effort doesn’t do what it says it does (and may break output) — bug — by scwgoire (创建于: 2026-03-24 01:42 (UTC+8))
- #37842 _update_request_as_session does not update max_tokens from StreamingUpdate — 无标签 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
- #37907 [Usage]: Unable to run Qwen3-14B with vLLM (multiple issues) — usage — by swap-debug (创建于: 2026-03-24 01:28 (UTC+8))
- #37850 [Performance]: Request vLLM input: FlashInfer JIT ops not registered as proper torch.ops custom ops, breaking torch.compile(fullgraph=True) — upstream fix in progress at flashinfer#2734 — performance — by Johnsonms (创建于: 2026-03-23 12:36 (UTC+8))
- #37900 [Bug]: Engine V1 crash with WorkerProc leaked shared_memory on multi-GPU setup — bug — by BioAGI-Moretti (创建于: 2026-03-23 23:12 (UTC+8))
- #37860 [Bug]: sleep mode not releasing GPU memory — bug — by uuddlrlrbaba-letsgo (创建于: 2026-03-23 15:15 (UTC+8))
- #37897 GPT-OSS structured output + reasoning grinds to a halt at long context — 无标签 — by fergusfinn (创建于: 2026-03-23 23:00 (UTC+8))
- #37890 [Bug]: NaNs in vLLM using DeepSeek-R1-0528-NVFP4-v2 and FlashInfer MLA — bug — by tlrmchlsmth (创建于: 2026-03-23 21:31 (UTC+8))
- #37868 [Bug]: bge-m3 /pooling endpoint breaks in the latest main branch — bug — by yanghui1-arch (创建于: 2026-03-23 16:52 (UTC+8))
- #37893 [Bug]: EBNF grammar not strictly enforced when n > 1 in parallel generation — bug — by kuailehaha (创建于: 2026-03-23 22:04 (UTC+8))
- #37883 [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations — bug — by RalufRan (创建于: 2026-03-23 19:41 (UTC+8))
- #37886 [Doc]: The –rope-scaling parameter has taken effect in vLLM supports YaRN — documentation — by labAxiaoming (创建于: 2026-03-23 20:25 (UTC+8))
- #37855 [Bug]: Qwen3-VL-Embedding-8B Image embedding failed — bug — by nuclearwu (创建于: 2026-03-23 13:47 (UTC+8))
- #37846 [Bug]: 0.17.0rc1在A2部署GLM-4.7,开启MTP后工具调用异常 — bug — by samsuzhang (创建于: 2026-03-23 11:53 (UTC+8))
- #37856 [Bug]: Shared Expert output is incorrect under Sequence Parallel MoE (EP + TP > 1 + DP > 1) for Qwen3.5 MoE models — bug — by jQizhang (创建于: 2026-03-23 14:26 (UTC+8))
- #37854 [Bug]: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4 — quant_algo MIXED_PRECISION not in whitelist — 无标签 — by algal (创建于: 2026-03-23 13:32 (UTC+8))
- #37852 [Bug]: Phi qk_layernorm appears to be unsupported in vLLM — bug — by Qi-Zhan (创建于: 2026-03-23 13:05 (UTC+8))
- #37847 [Installation]: Documented v0.18.0 cu128 release wheel URL returns 404 — installation — by ouyangziheng (创建于: 2026-03-23 12:13 (UTC+8))
已关闭 Issue
- #19552 [Bug]: 使用qwen2.5-omni对音频识别,cpu会被打满。 — bug,stale — by cy565025164 (关闭于: 2026-03-24 10:14 (UTC+8))
- #36921 [Bug]: V1 engine hangs then crashes with “No available shared memory broadcast block found / RPC call to sample_tokens timed out” under chat completions burst load on Qwen3.5-122B-A10B — bug — by natech-persona (关闭于: 2026-03-23 19:36 (UTC+8))
- #37876 [Bug]: After upgrading to v0.18.0, the logs no longer display token output speed — bug — by Formula24Code (关闭于: 2026-03-24 09:23 (UTC+8))
- #37917 [Bug]: Qwen3.5 27b WorkerProc hit an exception ; CUDA error: an illegal memory access was encountered — bug — by Galigator (关闭于: 2026-03-24 03:58 (UTC+8))
- #37648 BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests — 无标签 — by hslee-lunit (关闭于: 2026-03-23 23:15 (UTC+8))
- #37868 [Bug]: bge-m3 /pooling endpoint breaks in the latest main branch — bug — by yanghui1-arch (关闭于: 2026-03-23 22:59 (UTC+8))
- #36372 [Bug]: LoRA on Qwen-3.5-27B fails to run — bug — by Nero10578 (关闭于: 2026-03-23 20:25 (UTC+8))
- #37546 [Bug]: CPU backend crashes with
TypeError: 'function' object is not subscriptableon first inference request — bug,cpu — by fyuan1316 (关闭于: 2026-03-23 19:43 (UTC+8)) - #37804 [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell — bug — by vadiklyutiy (关闭于: 2026-03-23 16:43 (UTC+8))
- #37400 [Bug]: JAIS: ALiBi is applied even when position_embedding_type=”learned” — bug,good first issue — by Qi-Zhan (关闭于: 2026-03-23 15:36 (UTC+8))
- #36481 [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes — performance — by shenyt-sanshui (关闭于: 2026-03-23 14:33 (UTC+8))
- #37404 [Bug]: AssertionError: assert num_kv_heads == 1 with CPU KV Offloading + GLM-5-FP8 — bug — by yz342 (关闭于: 2026-03-23 12:13 (UTC+8))
新增 PR
- #37956 [Deprecate] Deprecate pooling multi task support. — documentation,frontend — by noooop (创建于: 2026-03-24 11:28 (UTC+8))
- #37899 [Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages — bug,frontend,ready — by jetxa (创建于: 2026-03-23 23:08 (UTC+8))
- #37955 [OpenAI] Fix MiniMax M2 reasoning token usage accounting — frontend — by QwertyJack (创建于: 2026-03-24 11:25 (UTC+8))
- #37915 [Bugfix][Frontend] Pass default_chat_template_kwargs to Anthropic endpoint — bug,frontend — by vinnybad (创建于: 2026-03-24 02:56 (UTC+8))
- #37954 [Parser] Pass tools via ToolParser.init instead of reading from request — frontend,tool-calling,qwen,deepseek — by sfeng33 (创建于: 2026-03-24 11:20 (UTC+8))
- #37861 [Frontend] Remove pooling multi task support. (Hold off until v0.20.0) — frontend,ready — by noooop (创建于: 2026-03-23 15:19 (UTC+8))
- #37953 replace torch.cuda.set_stream/current_stream with accelerator api — documentation,speculative-decoding,v1,kv-connector,nvidia — by xinyu-intel (创建于: 2026-03-24 11:19 (UTC+8))
- #37888 [XPU] Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (创建于: 2026-03-23 20:54 (UTC+8))
- #37952 fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit — documentation — by jperezdealgaba (创建于: 2026-03-24 11:06 (UTC+8))
- #37879 fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path — 无标签 — by Young-Leo (创建于: 2026-03-23 18:25 (UTC+8))
- #37914 [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc — documentation,nvidia — by b-mu (创建于: 2026-03-24 02:25 (UTC+8))
- #37935 [Parser] Remove unused request parameter from extract_reasoning() — documentation,frontend,qwen,deepseek — by sfeng33 (创建于: 2026-03-24 05:58 (UTC+8))
- #37911 [Bugfix] Suppress spurious CPU KV cache warning in
launch render— bug,frontend,ready — by sagearc (创建于: 2026-03-24 02:05 (UTC+8)) - #37892 [Bugfix] Fix concat_mla_q kernel for float32 dtype — bug,ready — by xyang16 (创建于: 2026-03-23 21:41 (UTC+8))
- #37936 [BugFix][kv_offload] Reduce memory blocks allocated for CPU offload — bug,v1 — by jonathanc-n (创建于: 2026-03-24 06:37 (UTC+8))
- #37938 [Frontend] Add pluggable store backends for Responses API persistence — documentation,frontend — by nicko170 (创建于: 2026-03-24 07:28 (UTC+8))
- #37943 [NIXL] Strengthen TpKVTopology validation — v1,kv-connector — by yzong-rh (创建于: 2026-03-24 08:51 (UTC+8))
- #37913 Downsize CPU jobs to use small queue — ci/build — by khluu (创建于: 2026-03-24 02:25 (UTC+8))
- #37949 [Attention][GPT-OSS] Add L40S/SM89 CUDA policy and sink-aware Triton tuning — v1,gpt-oss,nvidia — by will-deines (创建于: 2026-03-24 10:00 (UTC+8))
- #37950 [Bugfix] Restore stats logging for multi-server mode — bug,frontend,v1 — by khairulkabir1661 (创建于: 2026-03-24 10:01 (UTC+8))
- #37906 [ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 01:28 (UTC+8))
- #37948 [Perf] triton bilinear_pos_embed kernel for ViT — performance,qwen — by zhandaz (创建于: 2026-03-24 09:48 (UTC+8))
- #37947 [DRAFT][XPU] Upgrade torch 2.11 for xpu — ci/build — by jikunshang (创建于: 2026-03-24 09:35 (UTC+8))
- #37895 [CI] Add batch invariant test: Block FP8 + small MOE — ready,ci/build — by yewentao256 (创建于: 2026-03-23 22:43 (UTC+8))
- #37932 [Model Runner V2] Gather multimodal embeddings before draft model postprocess — ready,v1 — by TheEpicDolphin (创建于: 2026-03-24 05:46 (UTC+8))
- #37944 Revert “[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding” (#32951) — speculative-decoding,v1 — by zhewenl (创建于: 2026-03-24 09:08 (UTC+8))
- #37945 Revert “[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision” (#36725) — bug,nvidia — by zhewenl (创建于: 2026-03-24 09:09 (UTC+8))
- #37940 [NIXL][BUG] Fix Triton and Gemma heterogeneous TP — bug,v1,kv-connector — by yzong-rh (创建于: 2026-03-24 08:01 (UTC+8))
- #37918 [LinearAttention] Introduce non_spec_query_start_loc_cpu in GDN metadata — v1 — by Jialin (创建于: 2026-03-24 03:45 (UTC+8))
- #37939 fix(security): replace unsafe eval() in tool-calling examples — documentation — by xr843 (创建于: 2026-03-24 07:30 (UTC+8))
- #37923 [Bugfix] Force continuous usage stats when CLI override is enabled — bug,frontend,ready — by dsingal0 (创建于: 2026-03-24 04:06 (UTC+8))
- #37926 Make microbatch optimization (DBO) work with general models — ready,v1 — by 0xjunhao (创建于: 2026-03-24 04:45 (UTC+8))
- #37929 [Core] Use standalone autograd_cache_key for compilation dedup optimization — 无标签 — by frgossen (创建于: 2026-03-24 05:02 (UTC+8))
- #37930 [ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 05:22 (UTC+8))
- #37925 [Core] CUDA Checkpoint/Restore — Phase 2: Engine/Executor/API Integration — frontend,v1,nvidia — by elizabetht (创建于: 2026-03-24 04:24 (UTC+8))
- #37927 [ROCm][CI] Remove redundant common.txt from rocm-test.txt — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-24 04:46 (UTC+8))
- #37920 [Bugfix] Pass hf_token through config loading paths for gated model support — bug — by javierdejesusda (创建于: 2026-03-24 03:46 (UTC+8))
- #37924 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI — rocm,ready,ci/build,kv-connector — by AndreasKaratzas (创建于: 2026-03-24 04:16 (UTC+8))
- #37922 [Draft] Fix transitional step bug — bug,structured-output,v1,deepseek,gpt-oss — by southfreebird (创建于: 2026-03-24 04:04 (UTC+8))
- #37921 [Core] CUDA Checkpoint/Restore — Phase 1: C Extension, Python Wrapper, Worker Methods — ci/build,v1,nvidia — by elizabetht (创建于: 2026-03-24 03:51 (UTC+8))
- #37916
tests/v1/e2e/spec_decode: assert async scheduling is used — v1 — by puririshi98 (创建于: 2026-03-24 03:23 (UTC+8)) - #37910 Making spec decode testing nightly — ci/build — by puririshi98 (创建于: 2026-03-24 02:00 (UTC+8))
- #37912 [Bugfix] Fuse Qwen3.5 in_qkvz_proj forwarding with LoRA enabled — bug,qwen — by Isotr0py (创建于: 2026-03-24 02:12 (UTC+8))
- #37902 [Mypy] Better fixes for the
mypyissues invllm/config— documentation,performance,ready,multi-modality — by hmellor (创建于: 2026-03-23 23:48 (UTC+8)) - #37889 [CI] Add Qwen3.5 quantized model GSM8K CI evals for Blackwell — ready,qwen — by vadiklyutiy (创建于: 2026-03-23 21:09 (UTC+8))
- #37843 fix(scheduler): update max_tokens from StreamingUpdate in session — v1 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
- #37882 [CI] split Entrypoints Integration (API Server 1) into 3 jobs — ready,ci/build — by jikunshang (创建于: 2026-03-23 19:13 (UTC+8))
- #37885 Canonical KV Cache Allocation for HMA Models — v1,kv-connector — by Etelis (创建于: 2026-03-23 19:56 (UTC+8))
- #37905 fusing dynamic quantization in fused_moe_lora_fp8 to keep PDL on — 无标签 — by yugong333 (创建于: 2026-03-24 00:49 (UTC+8))
- #37874 [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into
cpu/package — ready,v1 — by ronensc (创建于: 2026-03-23 17:28 (UTC+8)) - #37904 [Mypy] Fix mypy for
vllm/model_executor(exceptvllm/model_executor/layers) — ready — by hmellor (创建于: 2026-03-24 00:40 (UTC+8)) - #37903 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths — 无标签 — by netanel-haber (创建于: 2026-03-24 00:13 (UTC+8))
- #37880 [Feature] Support per-draft-model MoE backend via
--speculative-config— speculative-decoding,v1 — by askliar (创建于: 2026-03-23 18:37 (UTC+8)) - #37901 [compile] Add a fast serializer for aot save/load. — 无标签 — by zhxchen17 (创建于: 2026-03-23 23:16 (UTC+8))
- #37887 [ROCm][perf] fix Aiter sparse MLA with MTP>1 — rocm,speculative-decoding,v1 — by gronsti-amd (创建于: 2026-03-23 20:29 (UTC+8))
- #37896 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,needs-rebase,nvidia — by xueliangyang-oeuler (创建于: 2026-03-23 22:52 (UTC+8))
- #37866 fix: handle FakeTensorMode patching for PyTorch compatibility — 无标签 — by CMLKevin (创建于: 2026-03-23 16:15 (UTC+8))
- #37884 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,ready,nvidia — by he-yufeng (创建于: 2026-03-23 19:44 (UTC+8))
- #37898 [Hybrid] Marconi-style admission policy for hybrid cache — v1 — by s3woz (创建于: 2026-03-23 23:05 (UTC+8))
- #37873 [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region — bug,ready,v1,nvidia — by yanghui1-arch (创建于: 2026-03-23 17:25 (UTC+8))
- #37891 [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X — rocm,deepseek,nvidia — by attila-dusnoki-htec (创建于: 2026-03-23 21:41 (UTC+8))
- #37894 [Feature] Add prefill node failure detection and health query endpoint for MooncakeConnector Proxy — documentation,kv-connector — by SSmallMonster (创建于: 2026-03-23 22:04 (UTC+8))
- #37869 Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (创建于: 2026-03-23 16:55 (UTC+8))
- #37872 Add sanitized MiniMax M2 parser variant for path-like output — tool-calling — by SamerAlshaer1991 (创建于: 2026-03-23 17:08 (UTC+8))
- #37877 [Bugfix][LoRA] Fix incorrect LoRA Log — bug,ready — by jeejeelee (创建于: 2026-03-23 17:47 (UTC+8))
- #37881 Auto mtp — documentation,frontend,speculative-decoding,needs-rebase,ci/build,v1,llama — by Zacks917 (创建于: 2026-03-23 18:54 (UTC+8))
- #37859 [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837) — bug,v1 — by rohithj7 (创建于: 2026-03-23 15:11 (UTC+8))
- #37878 docs: add WSL troubleshooting guide — documentation — by CChen19 (创建于: 2026-03-23 18:20 (UTC+8))
- #37844 [Draft][XPU] add gptq marlin support — 无标签 — by jikunshang (创建于: 2026-03-23 11:32 (UTC+8))
- #37871 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (创建于: 2026-03-23 16:57 (UTC+8))
- #37864 feat: Add OpenAI Conversations API support — frontend — by bongwoobak (创建于: 2026-03-23 15:55 (UTC+8))
- #37875 [Frontend] Use unified Parser for chat completions non-streaming — frontend — by MatteoFari (创建于: 2026-03-23 17:29 (UTC+8))
- #37845 [Bugfix] Fix GLM tool-call finish chunk suffix alignment in streaming — bug,frontend — by QwertyJack (创建于: 2026-03-23 11:33 (UTC+8))
- #37848 [Reasoning] Add structural tag support for reasoning parser via sampling params — frontend,ready — by rishitdholakia13 (创建于: 2026-03-23 12:14 (UTC+8))
- #37870 fix: add qk_layernorm support for Phi models — 无标签 — by gambletan (创建于: 2026-03-23 16:56 (UTC+8))
- #37867 Fix incorrect tokenizer source path in RunAI object storage pull (#37836) — 无标签 — by alvinttang (创建于: 2026-03-23 16:51 (UTC+8))
- #37863 [Misc]Update gitignore — ready — by wangxiyuan (创建于: 2026-03-23 15:42 (UTC+8))
- #37865 Fix cudagraph max capture size upper bound — nvidia — by weireweire (创建于: 2026-03-23 15:58 (UTC+8))
- #37862 docs: EU AI Act compliance guide for vLLM deployers — documentation — by BipinRimal314 (创建于: 2026-03-23 15:37 (UTC+8))
- #37853 [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models — v1,kv-connector — by orozery (创建于: 2026-03-23 13:27 (UTC+8))
- #37851 update doc for online fp8 quantization — documentation,ready — by yma11 (创建于: 2026-03-23 12:47 (UTC+8))
已合并 PR
- #37487 [V0 Deprecation] Refactor kv cache from list to element — rocm,ready,v1,qwen,deepseek,kv-connector — by yewentao256 (合并于: 2026-03-24 11:10 (UTC+8))
- #37906 [ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-24 09:48 (UTC+8))
- #37895 [CI] Add batch invariant test: Block FP8 + small MOE — ready,ci/build — by yewentao256 (合并于: 2026-03-24 09:16 (UTC+8))
- #37932 [Model Runner V2] Gather multimodal embeddings before draft model postprocess — ready,v1 — by TheEpicDolphin (合并于: 2026-03-24 09:14 (UTC+8))
- #36803 [Test] E2E Nemotron-3-Super tests — ready,ci/build,nvidia — by roikoren755 (合并于: 2026-03-24 08:49 (UTC+8))
- #37016 [CI] Split V1 Others into 3 separate jobs — ready,ci/build — by khluu (合并于: 2026-03-24 06:44 (UTC+8))
- #35007 [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning — bug,ready,v1,nvidia — by WindChimeRan (合并于: 2026-03-24 06:31 (UTC+8))
- #32951 [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding — speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-24 03:37 (UTC+8))
- #36728 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts — bug,documentation,rocm,frontend,ready,ci/build,v1,multi-modality,tool-calling,llama — by yzong-rh (合并于: 2026-03-24 05:02 (UTC+8))
- #36725 [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision — bug,ready,nvidia — by robertgshaw2-redhat (合并于: 2026-03-24 04:19 (UTC+8))
- #36799 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels — performance,ready,ci/build,nvidia — by kylesayrs (合并于: 2026-03-24 04:03 (UTC+8))
- #37812 [MRV2] Consider spec decoding in warmup — ready,v1 — by WoosukKwon (合并于: 2026-03-24 01:45 (UTC+8))
- #37882 [CI] split Entrypoints Integration (API Server 1) into 3 jobs — ready,ci/build — by jikunshang (合并于: 2026-03-24 01:37 (UTC+8))
- #37657 [CI][PD] Add Hybrid SSM integration tests to CI — ready,ci/build,v1,kv-connector — by NickLucche (合并于: 2026-03-23 23:58 (UTC+8))
- #37609 Use lazy graph module during split_module to defer recompile() — ready,torch.compile — by angelayi (合并于: 2026-03-23 23:21 (UTC+8))
- #37884 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,ready,nvidia — by he-yufeng (合并于: 2026-03-23 23:15 (UTC+8))
- #37873 [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region — bug,ready,v1,nvidia — by yanghui1-arch (合并于: 2026-03-23 22:59 (UTC+8))
- #37808 [Mypy] Fix mypy for
vllm/config— performance,ready,v1,nvidia — by yewentao256 (合并于: 2026-03-23 22:34 (UTC+8)) - #37533 [ROCm] fix sleep mode not releasing GPU memory problem on ROCm — rocm,ready — by aaab8b (合并于: 2026-03-23 21:07 (UTC+8))
- #37834 [Test] Consolidate tool parser unit tests to tests/tool_parsers — ready,tool-calling,llama — by bbrowning (合并于: 2026-03-23 12:24 (UTC+8))
- #37877 [Bugfix][LoRA] Fix incorrect LoRA Log — bug,ready — by jeejeelee (合并于: 2026-03-23 19:42 (UTC+8))
- #37550 [Bugfix] Fix CPU backend crash in KV cache block zeroing — bug,ready,v1,cpu — by DorBernsohn (合并于: 2026-03-23 19:35 (UTC+8))
- #37784 [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle — ready — by jikunshang (合并于: 2026-03-23 19:10 (UTC+8))
- #37498 [Frontend][Responses API] Fix arrival_time recording for TTFT on initial request — documentation,frontend,ready,gpt-oss,meta-exported,fb-exported — by qandrew (合并于: 2026-03-23 17:58 (UTC+8))
- #32929 [FP8]add FP8 WoQ kernel abstraction. — performance,ready,ci/build,v1,nvidia — by jikunshang (合并于: 2026-03-23 17:47 (UTC+8))
- #37863 [Misc]Update gitignore — ready — by wangxiyuan (合并于: 2026-03-23 16:14 (UTC+8))
- #36100 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs — rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-23 15:48 (UTC+8))
- #37338 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 — bug,ready,qwen — by arpera (合并于: 2026-03-23 15:37 (UTC+8))
- #37810 [Bugfix] Store Qwen3Next A_log in fp32 — bug,ready,qwen — by effortprogrammer (合并于: 2026-03-23 15:36 (UTC+8))
- #37820 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type=’alibi’ — bug,ready — by r266-tech (合并于: 2026-03-23 15:36 (UTC+8))
- #36505 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm,ready — by mgehre-amd (合并于: 2026-03-23 15:36 (UTC+8))
- #37851 update doc for online fp8 quantization — documentation,ready — by yma11 (合并于: 2026-03-23 13:19 (UTC+8))
- #35963 [Feature] ViT Full CUDA Graph — ready,v1,multi-modality,qwen,nvidia — by b-mu (合并于: 2026-03-23 13:01 (UTC+8))
- #37816 [CI/Build][LoRA] Update Qwen35 LoRA testing — ready,ci/build,qwen — by jeejeelee (合并于: 2026-03-23 12:55 (UTC+8))
关闭但未合并的 PR
- #37357 [Bugfix] Fix elastic EP scale-up after scale-down — bug,v1 — by tzulingk (关闭于: 2026-03-24 11:01 (UTC+8))
- #37069 [Test][Nixl] Add YAML-driven test runner for PD accuracy configs — needs-rebase,ci/build,v1,kv-connector — by yzong-rh (关闭于: 2026-03-24 10:09 (UTC+8))
- #35888 [CI] Fix mypy for vllm/config — v1,nvidia — by hickeyma (关闭于: 2026-03-23 23:53 (UTC+8))
- #37229 Fix Qwen3.5-Next RMSNormGated Initialization Error on TPU — qwen — by jrplatin (关闭于: 2026-03-24 06:08 (UTC+8))
- #36309 [BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections — bug,ready,needs-rebase,qwen — by JWriter20 (关闭于: 2026-03-24 04:57 (UTC+8))
- #35251 [compile][graph_partition] Remove unused subgraph inputs after split_module — needs-rebase — by fxdawnn (关闭于: 2026-03-24 04:54 (UTC+8))
- #37135 [Bugfix] Fix FP16 overflow in NVFP4 Marlin kernel epilogue and forward input_global_scale on SM75 — bug,performance — by saifmb0 (关闭于: 2026-03-24 02:47 (UTC+8))
- #37101 [Test] — frontend — by liuchenbing2026 (关闭于: 2026-03-24 00:55 (UTC+8))
- #37869 Optimize XPU ops using latest vllm-xpu-kernels — 无标签 — by xwu-intel (关闭于: 2026-03-23 21:24 (UTC+8))
- #37652 Fix: Handle $ref in json-schema in qwen3_coder tool parser — tool-calling,qwen — by schoennenbeck (关闭于: 2026-03-23 20:28 (UTC+8))
- #35716 [Bugfix] Qwen3Coder streaming tool call JSON missing opening brace in arguments — bug,needs-rebase,tool-calling,qwen — by KrxGu (关闭于: 2026-03-23 19:33 (UTC+8))
- #35144 [ROCm] Enable GPTQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm,needs-rebase — by mgehre-amd (关闭于: 2026-03-23 19:08 (UTC+8))
- #37881 Auto mtp — documentation,frontend,speculative-decoding,needs-rebase,ci/build,v1,llama — by Zacks917 (关闭于: 2026-03-23 18:58 (UTC+8))
- #37411 fix(jais): only apply ALiBi when position_embedding_type is ‘alibi’ — needs-rebase — by jigangz (关闭于: 2026-03-23 18:35 (UTC+8))
- #37864 feat: Add OpenAI Conversations API support — frontend — by bongwoobak (关闭于: 2026-03-23 18:12 (UTC+8))
- #37281 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (关闭于: 2026-03-23 16:43 (UTC+8))
- #37806 [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation — bug,qwen — by vadiklyutiy (关闭于: 2026-03-23 16:43 (UTC+8))
- #37621 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — bug — by simpx (关闭于: 2026-03-23 16:18 (UTC+8))
- #37788 [Refactor] converge xxx_config to vllm_config in async_llm — frontend,v1 — by andyxning (关闭于: 2026-03-23 13:10 (UTC+8))
- #37578 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug — by simpx (关闭于: 2026-03-23 12:43 (UTC+8))