vLLM 开发动态报告 - 2026-03-06
时间窗口: 2026-03-06 11:10 (UTC+8) ~ 2026-03-07 11:10 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 17 | 新 PR 75 | 合并 PR 42 | 关闭未合并 PR 43
📊 每日开发状态摘要
本周期(2026-03-06至2026-03-07)vLLM项目保持高活跃度,新增26个Issue和75个PR,并合并了42个PR。开发焦点集中在性能优化、问题修复和硬件生态适配上。其中,针对AMD ROCm平台的多个修复和优化,以及围绕Qwen系列模型、分布式执行稳定性和推测解码架构的讨论成为主要关注点。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动频繁,涉及核心功能修复、性能优化和问题定位:
- PR #36278 - Step-3.5-Flash模型Quark量化权重加载支持
- 贡献者: ColinZ22 (AMD员工后缀特征)
- 内容: 修复了加载由AMD Quark量化工具(及其他工具)导出的、采用旧格式(Transformers v5之前)MoE权重时的问题,并添加了缺失的
packed_modules_mapping属性,使得量化后的Step-3.5-Flash模型能够在vLLM中正常推理。 - 影响: 直接增强了vLLM对AMD Quark量化工具链产出的模型文件的兼容性,提升了AMD生态用户在vLLM上部署量化MoE模型(如Step系列)的体验。
- PR #36232 - Quark OCP MX量化解析器健壮性优化
- 贡献者: xuebwang-amd (AMD员工)
- 内容: 修复了Quark OCP MX(权重仅)量化配置解析器中的错误,使其对
input_quant_spec为None的情况更健壮。该修复确保了Qwen3-30B等模型的OCP MXFP4量化能在AMD平台上正常运行。 - 影响: 提升了AMD特定量化格式(OCP MX)在vLLM中的稳定性和可靠性。
- Issue #36228 (评论) - AMD预构建环境下的Qwen3.5部署问题
- 用户反馈: 用户
Abathur284284284指出,在AMD官方提供的ROCm 7.0预构建环境中部署Qwen3.5模型时,因transformers版本过低而直接报错。这暗示了AMD生态的预置环境可能未与最新的模型架构支持同步。 - 影响: 暴露了AMD生态软件栈在快速跟进新模型支持方面可能存在滞后,影响用户体验。
- 用户反馈: 用户
- Issue #35828 (已关闭) - Qwen3-Next在AMD上的准确率回归
- 总结: 此问题已被关闭。用户
jennyyyyzhen最初报告Qwen3-Next-80B在MI300x上GSM8k准确率从~85%降至~50%。经排查,原因是PR #35180引入的VLLM_ROCM_USE_AITER_TRITON_ROPE=1标志导致。该标志已在PR #35601中被默认禁用,从而修复了此问题。 - 影响: 体现了AMD团队对AITER内核的快速迭代和问题响应能力,通过调整实验性功能标志,在性能优化与数值精度间取得了平衡。
- 总结: 此问题已被关闭。用户
💬 高热度讨论分析
- Issue #36228: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL
- 核心议题: vLLM v0.12.0中,默认的多进程(
mp)执行器后端会导致Qwen3-VL模型的视觉编码器输出损坏,而Ray后端正常。这是一个从v0.11.0到v0.12.0的回归问题。 - 观点分析:
- 报告者 (AGENDD): 进行了根因分析,指出
mp执行器未像ray执行器那样为每个工作进程设置正确的CUDA_VISIBLE_DEVICES环境变量,导致PyTorch 2.9.0中F.linear(cuBLAS) 在特定条件下产生数值不稳定。 - 解决方案: AGENDD提出了三个修复选项,并倾向于最根本的方案——让
mp执行器在初始化工作进程时正确设置CUDA_VISIBLE_DEVICES,以消除与ray后端的行为差异。
- 报告者 (AGENDD): 进行了根因分析,指出
- 当前状态: 问题仍为Open状态。AGENDD表示已准备好测试套件,并等待维护者的意见以提交PR。
- 核心议题: vLLM v0.12.0中,默认的多进程(
- Issue #36219: Speculative Decoding Proposer Interface Unification Proposal
- 核心议题: 用户
wangxiyuan提交了一份详细的RFC,旨在统一当前杂乱无章的推测解码(Speculative Decoding)算法接口,以解决可维护性、可扩展性和硬件插件集成困难的问题。 - 观点分析:
- 提案者 (wangxiyuan): 详细列举了当前接口不一致、模型运行器耦合严重、硬件扩展困难等三大问题,并提出了一个基于
BaseProposer抽象类的统一接口设计,以及将算法调度逻辑移出模型运行器的架构方案。 - 维护者 (benchislett): 已关注并回应将在下周详细审阅。
- 提案者 (wangxiyuan): 详细列举了当前接口不一致、模型运行器耦合严重、硬件扩展困难等三大问题,并提出了一个基于
- 争议焦点: 这是一个大型架构重构提案,预计将引发关于向后兼容性、实现复杂度和优先级的热烈讨论。
- 当前状态: RFC处于Open状态,等待核心团队的详细评审和社区反馈。
- 核心议题: 用户
- Issue #36266: Add MLA + Quant support in vLLM
- 核心议题: 请求为DeepSeek MLA (Multi-Latent Attention) 添加量化支持,尤其是FP8。
- 观点分析:
- 发起者 (baonudesifeizhai): 指出FlashInfer库已具备MLA相关内核,建议vLLM集成以实现MLA+Quant支持。
- 参与者 (dorhuri123, carlyou): 两人均表示希望认领此任务并合作。
- 参与者 (ProExpertProg): 指出该Issue是#35792的重复,并澄清提到的现有内核可能并未完全解决MLA量化的核心问题。
- 最终结论: 该Issue被迅速关闭,确认为重复问题。相关开发工作已在#35792及后续PR(如#36205)中推进。
- Issue #36236: vllm fails to load continual pre-trained Qwen3.5-MoE model
- 核心议题: 使用Transformers 5.x进行持续预训练(CPT)的Qwen3.5-MoE模型无法被vLLM加载,因为新版本Transformers将配置类名从
Qwen3_5MoeConfig改为了Qwen3_5MoeTextConfig。 - 观点分析:
- 报告者 (itacora): 认为需要更新vLLM内部对配置类的引用。
- 参与者 (MatteoFari): 澄清vLLM已有
Qwen3_5MoeTextConfig,核心问题是纯文本MoE检查点被错误地路由到了多模态路径。修复方向应是添加对纯文本MoE模型类型的正确支持。 - 报告者 (itacora): 同意MatteoFari的分析。
- 当前状态: 问题Open,明确了修复方向是改进模型类型路由逻辑而非简单重命名。
- 核心议题: 使用Transformers 5.x进行持续预训练(CPT)的Qwen3.5-MoE模型无法被vLLM加载,因为新版本Transformers将配置类名从
- Issue #36217: The checkpoint you are trying to load has model type
qwen3_5but Transformers does not recognize this architecture- 核心议题: v0.16.0版本无法加载Qwen3.5模型。
- 观点分析:
- 报告者 & 参与者: 用户普遍遇到此问题,推测是vLLM v0.16.0尚未支持Qwen3.5。
- 参与者 (Abathur284284284): 额外提到在AMD ROCm 7.0预构建环境中部署时,因transformers版本过低而报错。
- 当前状态: 问题Open,表明社区对新模型架构的快速支持有强烈需求,同时暴露出不同部署环境(如AMD预置环境)的依赖管理问题。
🔥 热门话题与趋势分析
- Qwen模型系列问题集中爆发: 多个Issue涉及Qwen3/Qwen3.5(包括MoE、VL变体)的加载失败(#36217, #36236)、量化模型崩溃(#36249)、性能对比(#36215)等问题,反映出该系列模型生态活跃,但vLLM的集成支持面临挑战,需要持续跟进。
- 分布式执行的稳定性挑战: 包括Ray执行器DAG编译模式下的卡死(#36237)、多节点部署高并发崩溃(#36221)等,凸显了在大规模、高并发生产环境下,分布式执行引擎的稳健性是需要持续投入的关键领域。
- 性能优化与融合: 社区对性能优化热情高涨,特别是针对特定硬件(如NVIDIA Blackwell GB10的MoE tuning配置 #36273)和算子融合(如MLA+Quant融合 #36205, #36277, #36297)。
- AMD/ROCm平台的持续改进: 从权重加载、量化解析到注意力后端验证和CI/CD完善,AMD贡献者在本周期非常活跃,显示其对增强vLLM在MI300/MI355系列GPU上体验的持续投入。
🛠️ 重点技术变更
- PR #36185: Reenable features for ROCm attention backends: 在ROCm注意力后端选择逻辑重构后,重新启用了对FP8 KV缓存等特性的支持,解决了后端因验证失败而被错误拒绝的问题,恢复了AMD平台上高级特性的可用性。
- PR #35758: Support HMA+NixlConnector: 使NixlConnector能够与混合注意力管理器(HMA)协同工作,针对使用混合注意力(如Flash Attention + Sliding Window Attention)的模型,显著减少了KV传输的数据量,为状态空间模型等未来的KV传输优化奠定了基础。
- PR #34730: Add shutdown timeout: 为核心引擎和API服务器增加了优雅关机超时功能,允许在收到SIGTERM信号后等待正在处理的请求完成,提升了服务管理的可控性和用户体验。
- PR #36042: Fix CUDA graph decode capture crash in AITER FlashAttention: 修复了ROCm AITER FlashAttention后端在CUDA图捕获阶段因错误路径选择而导致的崩溃问题,提升了AMD平台上使用CUDA图优化的稳定性。
- PR #35538: [docs][torch.compile] Add fusions.md: 新增了详尽的编译时融合优化参考文档,系统性地整理了所有融合操作的原理、支持矩阵和性能收益,极大提升了
torch.compile相关功能的可理解性和可调试性。
📈 开发活跃度观察
- 贡献者活跃: AMD贡献者(
-amd后缀)在本周期提交了多个关键修复和优化PR,显示出其在硬件生态适配上的深入参与。同时,社区成员积极参与问题根因分析和解决方案讨论(如#36228),贡献质量高。 - PR审查与合并高效: 在24小时内合并了42个PR,表明核心团队审查和合并流程高效。许多PR被标记为
ready后快速进入测试和合并流程(如#36208, #36284, #36293)。 - 社区参与度高: 新增Issue数量多,且许多附带详细的复现步骤和日志,有助于快速定位问题。对于新功能(如MLA量化)和架构改进(如推测解码统一接口),也有开发者主动提出贡献意愿。
💡 值得关注的问题
- GPTQ-Int4量化模型稳定性问题(#36249): 多个用户报告Qwen3.5 GPTQ-Int4模型在达到一定吞吐量后崩溃,且非量化版本正常。这可能是vLLM内核与特定量化格式在高负载下的兼容性问题,需要重点调查。
- Windows平台CPU卸载功能失效(#36279): 此问题限制了VRAM有限用户在Windows上运行较大模型的能力,影响用户面广。
- 推测解码架构统一化(#36219): 该RFC提议进行重大架构重构,其决策将深远影响未来推测解码算法的开发模式和硬件插件生态,值得社区广泛关注和讨论。
- 多个模型加载与运行问题: 包括Qwen3.5加载失败(#36217)、Qwen3.5 MoE CPT模型加载失败(#36236)、Qwen3 AWQ模型加载错误(#36234)等,反映了vLLM在快速跟进众多新模型架构时面临的兼容性挑战,需要系统性的测试和适配机制。
📋 附录:详细数据列表
新增 Issue
- #36302 [Installation]: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb — installation — by WillLester (创建于: 2026-03-07 10:49 (UTC+8))
- #36237 [Bug]: Generation hangs until RAY_CGRAPH_get_timeout (300s) with Ray compiled DAG executor — bug — by jQizhang (创建于: 2026-03-06 16:54 (UTC+8))
- #36295 [Bug]: Deepseek R1 TRTLLM FP8 MoE produces garbage output — bug — by wzhao18 (创建于: 2026-03-07 08:38 (UTC+8))
- #36236 [Bug]: vllm fails to load continual pre-trained Qwen3.5-MoE model due to missing support for transformers 5.x renamed class (Qwen3_5MoeTextConfig) — bug — by itacora (创建于: 2026-03-06 16:51 (UTC+8))
- #36264 [Tracking issue]: NVIDIA CI improvements — feature request — by jasonlizhengjian (创建于: 2026-03-07 00:33 (UTC+8))
- #36266 [Feature]: Add MLA + Quant support in vLLM (leveraging existing FlashInfer MLA support) — feature request — by baonudesifeizhai (创建于: 2026-03-07 01:07 (UTC+8))
- #36279 [Bug]: –cpu-offload-gb fails with AssertionError on Windows (v0.16.0) — bug — by ghbaud (创建于: 2026-03-07 04:27 (UTC+8))
- #36249 [Bug]: Qwen3.5-27B-GPTQ-Int4 crashes [vllm version v0.16.1rc1] — bug — by berkayersoyy (创建于: 2026-03-06 19:24 (UTC+8))
- #36275 [Bug]: Qwen3.5 4b incompatibility — bug — by aminsamir45 (创建于: 2026-03-07 03:31 (UTC+8))
- #36219 [RFC]: Speculative Decoding Proposer Interface Unification Proposal — RFC — by wangxiyuan (创建于: 2026-03-06 14:41 (UTC+8))
- #36272 [Feature]: [CPU] Performance improvement: Auto-preload Intel OpenMP on x86 for multi-core CPU inference — feature request,cpu — by gxoga (创建于: 2026-03-07 02:29 (UTC+8))
- #36257 Security: Unrestricted method invocation via collective_rpc endpoint in dev mode — 无标签 — by Vext-Labs (创建于: 2026-03-06 23:09 (UTC+8))
- #36234 [Bug]: 使用vllm 0.16.0 版本时候,加载qwen3-14b-awq模型时候会报错误,如:ERROR _wrapper.py:141: Error in wrapped target: CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by uOnePiece (创建于: 2026-03-06 16:41 (UTC+8))
- #36251 [Bug]: Qwen3.5-397B-A17B-FP8 crashes with TP=4 + Expert Parallelism - dimension mismatch in fused linear sharding — 无标签 — by UmutAlihan (创建于: 2026-03-06 19:49 (UTC+8))
- #36217 [Bug]: The checkpoint you are trying to load has model type
qwen3_5but Transformers does not recognize this architecture — bug — by huangye123 (创建于: 2026-03-06 14:27 (UTC+8)) - #36228 [Bug]: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL in v0.12.0 (ray executor works correctly) — 无标签 — by AGENDD (创建于: 2026-03-06 16:08 (UTC+8))
- #36245 [Bug]: Gemma3 mmproj-*.gguf is not downloaded in ‘download_gguf’ — bug — by lmjantsch (创建于: 2026-03-06 17:45 (UTC+8))
- #36235 [Bug]: When using VLLM version 0.16.0, there will be an error when loading the qwen3-14b-awq model, such as: ERROR _wrapper.py:141: Error in wrapped target: CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by uOnePiece (创建于: 2026-03-06 16:43 (UTC+8))
- #36233 [Bug]: Critical bug in tool_call test scenarios [vllm version 0.16.0] — bug — by ajiang17 (创建于: 2026-03-06 16:38 (UTC+8))
- #36220 [Bug]: vllm serve quantized GLM-5 failed — bug — by artificialzjy (创建于: 2026-03-06 14:45 (UTC+8))
- #36215 [Performance]: vLLM is slower than SGLang when deploying the Qwen3.5 model. — performance — by yszhli (创建于: 2026-03-06 14:13 (UTC+8))
- #36223 [RFC]: Support quarot for eagle3 — RFC — by drslark (创建于: 2026-03-06 15:13 (UTC+8))
- #36222 [Usage]: MoE flatten_tp_size should not unconditionally include dp_size — DP loses its original semantics for MoE layers — usage — by gerayking (创建于: 2026-03-06 15:02 (UTC+8))
- #36221 [Bug]: Two-node deployment of kimi2-5, runtime crash — bug — by xidiancpy (创建于: 2026-03-06 14:55 (UTC+8))
- #36207 [Bug]: kimi-2.5 reasoning_parser error. — bug — by SaltFish11 (创建于: 2026-03-06 12:10 (UTC+8))
- #36203 [Bug]: CUDAGraphMode.FULL not supported with ChunkedLocalAttention backend for Llama4 models — bug — by huydhn (创建于: 2026-03-06 11:10 (UTC+8))
已关闭 Issue
- #18517 [Feature]: hope that xgrammar and vLLM v1 can offer significant inference acceleration on the RTX 4090 as well — feature request,stale — by Zachary-ai-engineer (关闭于: 2026-03-07 10:18 (UTC+8))
- #18673 [Installation]: Hard to find right wheel files to build the release version — installation,stale — by llsj14 (关闭于: 2026-03-07 10:18 (UTC+8))
- #21511 [Feature]: Qwen3 Models GGUF Support — feature request,stale — by ctcanbol (关闭于: 2026-03-07 10:18 (UTC+8))
- #23203 [Feature]: Support gpt-oss mxfp4 on compute capability 7.5 — feature request,stale — by Aaron1011 (关闭于: 2026-03-07 10:17 (UTC+8))
- #23225 [Feature][Responses API] Support logprobs — feature request,stale — by heheda12345 (关闭于: 2026-03-07 10:17 (UTC+8))
- #23244 [Bug]: openai/gpt-oss-20b breaks on data parallel — bug,stale — by Fhrozen (关闭于: 2026-03-07 10:17 (UTC+8))
- #23767 [Feature]: Add LORA Model Name in Open Telemetry — feature request,stale — by alew3 (关闭于: 2026-03-07 10:17 (UTC+8))
- #24655 [Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 198) — bug,stale — by TideDra (关闭于: 2026-03-07 10:17 (UTC+8))
- #24790 [Usage]: multi-gpu,I want embedding model deploy on GPU1, and set CUDA_VISIBLE_DEVICES=1, in vllm start script file, but vllm service still malloc memory of GPU 0 。 — usage,stale — by suchenRoad (关闭于: 2026-03-07 10:17 (UTC+8))
- #35651 [Bug][XPU]: Inference generates garbage with flash attention — bug — by andswitch (关闭于: 2026-03-07 03:00 (UTC+8))
- #36266 [Feature]: Add MLA + Quant support in vLLM (leveraging existing FlashInfer MLA support) — feature request — by baonudesifeizhai (关闭于: 2026-03-07 05:29 (UTC+8))
- #35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8) — bug,performance,rocm — by functionstackx (关闭于: 2026-03-07 04:24 (UTC+8))
- #36257 Security: Unrestricted method invocation via collective_rpc endpoint in dev mode — 无标签 — by Vext-Labs (关闭于: 2026-03-07 00:31 (UTC+8))
- #35574 [Bug]: Qwen3.5 Can not close thinking by “enable_thinking”: false — bug — by Charmnut (关闭于: 2026-03-06 16:25 (UTC+8))
- #35828 [Bug]: Qwen3-Next accuracy regression on AMD — bug,rocm — by jennyyyyzhen (关闭于: 2026-03-06 14:03 (UTC+8))
- #35945 [Bug]: AssertionError in causal_conv1d_update when capturing CUDA graphs for Qwen3.5/GDN layers — bug — by git-jxj (关闭于: 2026-03-06 14:26 (UTC+8))
- #33418 [Bug]: wrong error reported when len(prompt) + requested tokens > max_context_len — bug — by sducouedic (关闭于: 2026-03-06 14:15 (UTC+8))
新增 PR
- #36303 [Cleanup] Remove duplicate parser registration — 无标签 — by taneem-ibrahim (创建于: 2026-03-07 11:02 (UTC+8))
- #36293 [ROCm][CI] Making entrypoints more deterministic on ROCm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-07 07:42 (UTC+8))
- #36253 [ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. — rocm,ready — by benenzhu (创建于: 2026-03-06 20:12 (UTC+8))
- #36300 [Bug] Fix pooling model benchmark script — bug,performance,ready — by yewentao256 (创建于: 2026-03-07 09:20 (UTC+8))
- #36265 pick up tuned prefill configs for FP8 FA3 — ci/build — by jmkuebler (创建于: 2026-03-07 00:50 (UTC+8))
- #36241 [Bugfix] Fix FlashInfer block size restriction breaking hybrid attention models — bug,documentation,v1,nvidia — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
- #36240 [Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size — bug — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
- #36239 [Bugfix] Remove incorrect assertion blocking mixed decode+spec-decode batches in GDN attention — bug,v1 — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
- #36227 [ROCm] Add cascade attention support for AITER FlashAttention backend — rocm,needs-rebase,v1 — by ChuanLi1101 (创建于: 2026-03-06 15:51 (UTC+8))
- #36299 [CI] Fix BackgroundResources double-cleanup crash by adding guard — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-07 08:59 (UTC+8))
- #36297 Fused BMM+FP8 quant Triton kernel for MLA _v_up_proj (forward_mqa path) — performance,ci/build,v1 — by dorhuri123 (创建于: 2026-03-07 08:41 (UTC+8))
- #36286 [MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow — nvidia — by yzong-rh (创建于: 2026-03-07 06:16 (UTC+8))
- #36301 [XPU][Doc] update xpu document about triton dependency/conflict issue. — documentation — by jikunshang (创建于: 2026-03-07 09:22 (UTC+8))
- #36290 [BugFix] Avoid ignored trust_remote_code warnings — bug,speculative-decoding,ready — by njhill (创建于: 2026-03-07 06:45 (UTC+8))
- #36281 [BE] Rename
should_torch_compile_mm_vittoshould_torch_compile_mm_encoder— documentation,ready,llama,qwen — by Lucaskabela (创建于: 2026-03-07 05:21 (UTC+8)) - #36292 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-07 07:05 (UTC+8))
- #36289 [Model] Register Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM for text-only checkpoints — new-model,qwen — by aminsamir45 (创建于: 2026-03-07 06:41 (UTC+8))
- #36294 [MoE Refactor] Remove “naive” all2all backend — 无标签 — by bnellnm (创建于: 2026-03-07 08:20 (UTC+8))
- #36298 full cudagraph for flex-attn — v1,nvidia — by shunting314 (创建于: 2026-03-07 08:48 (UTC+8))
- #36280 [Model Runner V2] Fix warmup for pipeline parallel — v1 — by njhill (创建于: 2026-03-07 05:18 (UTC+8))
- #36296 [Bug] Fix TRTLLM FP8 MoE Monolithic — bug,nvidia — by wzhao18 (创建于: 2026-03-07 08:40 (UTC+8))
- #36270 [Core] Fix benign error log during normal shutdown — ready,v1 — by njhill (创建于: 2026-03-07 02:13 (UTC+8))
- #36282 mla: don’t update kv cache on dummy forwards — ready,nvidia — by itayalroy (创建于: 2026-03-07 05:43 (UTC+8))
- #36284 [ROCm][CI] Fixing yaml file for external amd-ci signal — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-07 05:56 (UTC+8))
- #36285 Add structured output support for beam search in online serving — frontend — by ashutosh1919 (创建于: 2026-03-07 06:08 (UTC+8))
- #36205 [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) — performance,ci/build,v1 — by carlyou (创建于: 2026-03-06 11:44 (UTC+8))
- #36216 [V0 Deprecation] Remove unused swap_space parameter — documentation,performance,frontend,ready,ci/build,v1,kv-connector — by majiayu000 (创建于: 2026-03-06 14:22 (UTC+8))
- #36291 Create test-qwen-deepseek-1.5-accuracy.py — v1,qwen,deepseek — by puririshi98 (创建于: 2026-03-07 06:53 (UTC+8))
- #36288 [torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set — documentation,llama,qwen,nvidia — by Lucaskabela (创建于: 2026-03-07 06:27 (UTC+8))
- #36287 [Core] Add –tensor-parallel-size-attention for TPA separation — v1,llama — by sungsooha (创建于: 2026-03-07 06:20 (UTC+8))
- #36283 [Feature] Add LLM.shutdown() for explicit engine teardown — frontend — by wojciech-wais (创建于: 2026-03-07 05:44 (UTC+8))
- #36277 [Compile] Add MLA attention + FP8 static quant fusion support — v1,nvidia — by dorhuri123 (创建于: 2026-03-07 04:06 (UTC+8))
- #36258 [Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs — documentation,frontend,v1 — by wojciech-wais (创建于: 2026-03-06 23:38 (UTC+8))
- #36276 [EPLB] Add nixl-based eplb communicator — ci/build,v1,kv-connector — by ilmarkov (创建于: 2026-03-07 04:04 (UTC+8))
- #36278 [Model] [Bugfix] Adding legacy MoE weight format support in Step-3.5-Flash for quantized model inference — bug — by ColinZ22 (创建于: 2026-03-07 04:10 (UTC+8))
- #36268 [Audio] Bundle
get_generation_prompt()params intoSpeechToTextParams— documentation,frontend,qwen — by ekagra-ranjan (创建于: 2026-03-07 01:49 (UTC+8)) - #36273 [Kernel][MoE] Add fp8_w8a8 MoE tuning config for NVIDIA GB10 (DGX Spark) — nvidia — by scottgl9 (创建于: 2026-03-07 02:36 (UTC+8))
- #36274 [Bugfix][ROCm] Strip block_size before attention backend validation — bug,rocm — by jennyyyyzhen (创建于: 2026-03-07 02:36 (UTC+8))
- #36271 [EPLB] Remove main waits in case of slow EPLB — 无标签 — by ilmarkov (创建于: 2026-03-07 02:28 (UTC+8))
- #36206 [CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression — ready — by AndreasKaratzas (创建于: 2026-03-06 11:49 (UTC+8))
- #36261 [EPLB] Optmize eplb mapping and record in router for prefill — 无标签 — by ilmarkov (创建于: 2026-03-07 00:25 (UTC+8))
- #36269 bytes calculation for weight offloading — meta-exported,fb-exported — by mxz297 (创建于: 2026-03-07 01:55 (UTC+8))
- #36267 [EPLB] Simplify EPLB rearrange by only returning one map — 无标签 — by SageMoore (创建于: 2026-03-07 01:26 (UTC+8))
- #36238 Add ‘none’ reasoning effort to ChatCompletionRequest — frontend — by juliendenize (创建于: 2026-03-06 17:08 (UTC+8))
- #36232 [ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization — rocm,ready — by xuebwang-amd (创建于: 2026-03-06 16:19 (UTC+8))
- #36262 Revert “[BugFix] Fix engine hanging after KV cache initialization fai… — bug,v1 — by njhill (创建于: 2026-03-07 00:27 (UTC+8))
- #36263 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (创建于: 2026-03-07 00:29 (UTC+8))
- #36260 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (创建于: 2026-03-07 00:25 (UTC+8))
- #36259 Feat/sigint graceful shutdown — frontend,v1 — by wojciech-wais (创建于: 2026-03-07 00:18 (UTC+8))
- #36250 [Bugfix] Fix FP8 paged MQA fallback for CUDA graph capture — bug,v1,nvidia — by ZJY0516 (创建于: 2026-03-06 19:26 (UTC+8))
- #36252 [Bugfix] Allow concurrency and memory_limit for runai_streamer_sharded — bug — by ziyangliu-666 (创建于: 2026-03-06 19:54 (UTC+8))
- #36242 [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 — bug,ready,qwen — by AjAnubolu (创建于: 2026-03-06 17:42 (UTC+8))
- #36255 fix: improve token_ids_cpu swap to copy only valid indices — tpu,v1 — by zzaebok (创建于: 2026-03-06 21:35 (UTC+8))
- #36204 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks — 无标签 — by laithsakka (创建于: 2026-03-06 11:19 (UTC+8))
- #36256 [Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators — frontend — by umut-polat (创建于: 2026-03-06 21:36 (UTC+8))
- #36244 [torch.compile] Remove attention layer name from unified_kv_cache_update — 无标签 — by SongyouZhong (创建于: 2026-03-06 17:43 (UTC+8))
- #36247 [Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x — bug,ready,deepseek — by vllmellm (创建于: 2026-03-06 18:16 (UTC+8))
- #36218 [UX] Infer dtype for local checkpoint — ready — by Isotr0py (创建于: 2026-03-06 14:38 (UTC+8))
- #36248 docs+tests: consolidate doc fixes and test assertion — documentation,multi-modality,cpu — by haosenwang1018 (创建于: 2026-03-06 18:21 (UTC+8))
- #36254 [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators — frontend — by umut-polat (创建于: 2026-03-06 21:31 (UTC+8))
- #36230 [CI] Fix startup error test — ready,v1 — by hmellor (创建于: 2026-03-06 16:14 (UTC+8))
- #36246 [CI/Build] Updated rmsnorm test to improve OOT device coverage — 无标签 — by romitjain (创建于: 2026-03-06 18:04 (UTC+8))
- #36214 docs: clarify CPU backend issue reporting sentence — documentation,cpu — by haosenwang1018 (创建于: 2026-03-06 14:03 (UTC+8))
- #36213 docs: capitalize GitHub Actions in profiling guide — documentation — by haosenwang1018 (创建于: 2026-03-06 14:02 (UTC+8))
- #36212 tests: assert missing video loader error message — multi-modality — by haosenwang1018 (创建于: 2026-03-06 13:53 (UTC+8))
- #36211 docs: fix speculative decoding wording typos — documentation — by haosenwang1018 (创建于: 2026-03-06 13:50 (UTC+8))
- #36243 fix(config): skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel — 无标签 — by tusharshetty61 (创建于: 2026-03-06 17:42 (UTC+8))
- #36231 [Misc] Add enable_log_requests parameter to RequestLogger — frontend — by chaunceyjiang (创建于: 2026-03-06 16:17 (UTC+8))
- #36226 [Kernel] Add MMQ kernels for all I-Matrix quantization types — 无标签 — by aimbit-ni (创建于: 2026-03-06 15:23 (UTC+8))
- #36229 [Build] Fix API rate limit exceeded when using
VLLM_USE_PRECOMPILED=1— ci/build — by elvischenv (创建于: 2026-03-06 16:09 (UTC+8)) - #36224 [Bugfix] Fix reasoning token routing with tool parsers: prompt false positive and transition-batch loss — bug,frontend,qwen — by alexbi29 (创建于: 2026-03-06 15:22 (UTC+8))
- #36225 [main][feature] Support quarot for eagle3 — speculative-decoding,llama — by drslark (创建于: 2026-03-06 15:22 (UTC+8))
- #36210 [ROCm][CI] Preparing gfx90a mirroring — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-06 13:13 (UTC+8))
- #36209 docs: fix wrong cc in int8.md — documentation,ready — by KevinZonda (创建于: 2026-03-06 12:18 (UTC+8))
- #36208 [CI] Fix bge-m3 similarity reference values after Defination typo fix — ready — by AndreasKaratzas (创建于: 2026-03-06 12:16 (UTC+8))
已合并 PR
- #36293 [ROCm][CI] Making entrypoints more deterministic on ROCm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-07 11:04 (UTC+8))
- #36042 Fix CUDA graph decode capture crash in AITER FlashAttention — rocm,ready,v1,nvidia,meta-exported,fb-exported — by iseeyuan (合并于: 2026-03-07 10:12 (UTC+8))
- #35971 refine
vllm bench throughput --backend hf— performance,ready — by jikunshang (合并于: 2026-03-07 10:10 (UTC+8)) - #36290 [BugFix] Avoid ignored trust_remote_code warnings — bug,speculative-decoding,ready — by njhill (合并于: 2026-03-07 09:24 (UTC+8))
- #36280 [Model Runner V2] Fix warmup for pipeline parallel — v1 — by njhill (合并于: 2026-03-07 08:58 (UTC+8))
- #36270 [Core] Fix benign error log during normal shutdown — ready,v1 — by njhill (合并于: 2026-03-07 08:39 (UTC+8))
- #36282 mla: don’t update kv cache on dummy forwards — ready,nvidia — by itayalroy (合并于: 2026-03-07 08:36 (UTC+8))
- #36284 [ROCm][CI] Fixing yaml file for external amd-ci signal — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-07 08:30 (UTC+8))
- #35538 [docs][torch.compile] Add fusions.md — kernel/operator fusion reference page — documentation,ready,torch.compile — by copilot-swe-agent (合并于: 2026-03-07 07:55 (UTC+8))
- #35850 [ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) — rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-07 04:24 (UTC+8))
- #35253 Enabling some B200-specific tests on MI355 — ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-03-07 03:27 (UTC+8))
- #35877 [CustomOp] CustomOp FusedRMSNormGated — ready,v1 — by eellison (合并于: 2026-03-07 02:53 (UTC+8))
- #36206 [CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression — ready — by AndreasKaratzas (合并于: 2026-03-06 17:17 (UTC+8))
- #36165 [Bugfix] Fix
cudagraph_mode:FULLdispatch (This does not impactFULL_AND_PIECEWISE(default)) — bug,ready,v1,nvidia — by TQCB (合并于: 2026-03-06 22:15 (UTC+8)) - #36262 Revert “[BugFix] Fix engine hanging after KV cache initialization fai… — bug,v1 — by njhill (合并于: 2026-03-07 00:28 (UTC+8))
- #35478 [BugFix] Fix engine hanging after KV cache initialization failure — bug,ready,v1 — by 842974287 (合并于: 2026-03-06 12:58 (UTC+8))
- #36068 [Bugfix] Quickfix followups to busy loop removal in #28053 — bug,ready,v1 — by tjohnson31415 (合并于: 2026-03-07 00:13 (UTC+8))
- #36152 [compile] Stop unconditionally patching constrain_to_fx_strides — ready — by zou3519 (合并于: 2026-03-06 23:17 (UTC+8))
- #35202 [Refactor] Modular video loader backend refactoring — ready,multi-modality — by Isotr0py (合并于: 2026-03-06 22:06 (UTC+8))
- #36024 [Misc] Lazy import registered processors — ready,deepseek — by Isotr0py (合并于: 2026-03-06 22:06 (UTC+8))
- #36150 [bugfix] add api process rank in default multimodal request — bug,ready — by fake0fan (合并于: 2026-03-06 20:00 (UTC+8))
- #36230 [CI] Fix startup error test — ready,v1 — by hmellor (合并于: 2026-03-06 19:50 (UTC+8))
- #36153 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) — frontend,ready,multi-modality — by alex-jw-brooks (合并于: 2026-03-06 17:16 (UTC+8))
- #35758 [Core][KVConnector] Support HMA+NixlConnector — ready,v1,kv-connector — by NickLucche (合并于: 2026-03-06 15:51 (UTC+8))
- #35158 [Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding — ready,ci/build,v1,kv-connector,ready-run-all-tests — by ZhanqiuHu (合并于: 2026-03-06 15:50 (UTC+8))
- #35553 [ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance — documentation,rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-03-06 15:15 (UTC+8))
- #36173 Change “following fields were present in the request but ignored” log from warn to debug — frontend,ready — by tlrmchlsmth (合并于: 2026-03-06 14:15 (UTC+8))
- #36191 [BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list — bug,ready — by walterbm (合并于: 2026-03-06 14:15 (UTC+8))
- #36192 [Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 — ready — by russellb (合并于: 2026-03-06 14:15 (UTC+8))
- #36197 [Bugfix] Fix misleading context length error messages — bug,frontend,ready — by AjAnubolu (合并于: 2026-03-06 14:15 (UTC+8))
- #35892 [Bugfix] Fix inner_dp_world initialization order for multi-node TP — bug,ready,ci/build,v1 — by zyongye (合并于: 2026-03-06 14:04 (UTC+8))
- #34730 [Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (合并于: 2026-03-06 14:04 (UTC+8))
- #36164 perf: add slots to KVCacheBlock — ready,v1 — by cong-or (合并于: 2026-03-06 14:04 (UTC+8))
- #36209 docs: fix wrong cc in int8.md — documentation,ready — by KevinZonda (合并于: 2026-03-06 14:01 (UTC+8))
- #36208 [CI] Fix bge-m3 similarity reference values after Defination typo fix — ready — by AndreasKaratzas (合并于: 2026-03-06 13:07 (UTC+8))
- #36047 [Feature] Add –distributed-timeout-seconds CLI option — ready,v1 — by 842974287 (合并于: 2026-03-06 12:57 (UTC+8))
- #34754 [Bug] Fix a corner case in _process_simple_streaming_events — bug,frontend,ready — by 842974287 (合并于: 2026-03-06 12:57 (UTC+8))
- #36158 [Misc] Rename
group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs— ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-03-06 12:32 (UTC+8)) - #36177 [ROCm][CI] Adding missing dependencies for Multi-modal models tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-06 12:23 (UTC+8))
- #36156 [Bugfix] Fix simple Mistral-Small example — bug,documentation,ready — by DarkLight1337 (合并于: 2026-03-06 12:25 (UTC+8))
- #36185 Reenable features for ROCm attention backends — documentation,rocm,ready,v1 — by Rohan138 (合并于: 2026-03-06 12:21 (UTC+8))
- #36147 cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul — ready,ci/build,cpu — by nikhil-arm (合并于: 2026-03-06 11:48 (UTC+8))
关闭但未合并的 PR
- #20275 [Bugfix][Frontend]: Fix API server connection refused on wsl2 — frontend,stale — by Chen-zexi (关闭于: 2026-03-07 10:18 (UTC+8))
- #21041 [Model] Reasoning Parser for Nemotron Models — stale — by schoennenbeck (关闭于: 2026-03-07 10:18 (UTC+8))
- #28062 [wip] Fix prime rl test — ready,ci/build,stale — by rzabarazesh (关闭于: 2026-03-07 10:17 (UTC+8))
- #36299 [CI] Fix BackgroundResources double-cleanup crash by adding guard — rocm,ready,v1 — by AndreasKaratzas (关闭于: 2026-03-07 09:30 (UTC+8))
- #36277 [Compile] Add MLA attention + FP8 static quant fusion support — v1,nvidia — by dorhuri123 (关闭于: 2026-03-07 05:59 (UTC+8))
- #35610 [Fix][XPU] TRITON_ATTN: add pytorch-triton-xpu to xpu requirements — ci/build — by andswitch (关闭于: 2026-03-07 03:00 (UTC+8))
- #36260 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (关闭于: 2026-03-07 00:28 (UTC+8))
- #28850 [Hardware][AMD] Add fused QK RoPE and reshape & cache flash support for ROCm — rocm,needs-rebase,stale,v1,qwen — by mjkvaak-amd (关闭于: 2026-03-06 22:28 (UTC+8))
- #35187 feat: enable EPLB for NVFP4 compressed-tensors ML3 checkpoint — 无标签 — by hypdeb (关闭于: 2026-03-06 22:02 (UTC+8))
- #34062 [torch.compile] Remove attention layer name from unified_kv_cache_update — needs-rebase — by veeceey (关闭于: 2026-03-06 21:31 (UTC+8))
- #24395 [Performance]: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact — v1 — by CLFutureX (关闭于: 2026-03-06 19:18 (UTC+8))
- #24359 [Feature] Add sliding window support to FlexAttention backend — v1,gpt-oss — by alhridoy (关闭于: 2026-03-06 19:16 (UTC+8))
- #24286 [Model] Add Support for Grok2 — documentation,new-model,needs-rebase — by wenchen76 (关闭于: 2026-03-06 19:14 (UTC+8))
- #24202 Support prompt hidden states — tpu,needs-rebase,v1 — by charlotte12l (关闭于: 2026-03-06 19:10 (UTC+8))
- #24069 [EPLB] Reconstruct EPLB algorithm invocation method — 无标签 — by raindaywhu (关闭于: 2026-03-06 19:09 (UTC+8))
- #24068 Gfx908 attn fix — rocm,unstale — by UD-mmcminn (关闭于: 2026-03-06 19:08 (UTC+8))
- #24027 Support using SigLIP2 text and image embedding as standalone model — new-model,multi-modality — by duc-ph (关闭于: 2026-03-06 19:08 (UTC+8))
- #36214 docs: clarify CPU backend issue reporting sentence — documentation,cpu — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
- #36213 docs: capitalize GitHub Actions in profiling guide — documentation — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
- #36212 tests: assert missing video loader error message — multi-modality — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
- #36211 docs: fix speculative decoding wording typos — documentation — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
- #23566 [Model] Add lite-whisper model support in vLLM — new-model,frontend,needs-rebase,unstale — by yuqiannemo (关闭于: 2026-03-06 17:27 (UTC+8))
- #23601 Implement standardized environment variable parsing with pathlib.Path support and type introspection — needs-rebase — by copilot-swe-agent (关闭于: 2026-03-06 17:27 (UTC+8))
- #23571 Remove
build_for_cudagraph_capturemethod and usebuilddirectly in attention metadata builders — needs-rebase,v1,nvidia — by copilot-swe-agent (关闭于: 2026-03-06 17:27 (UTC+8)) - #23553 [EPLB] Model Modify of EPLB — needs-rebase,unstale,v1 — by shiyuan680 (关闭于: 2026-03-06 17:26 (UTC+8))
- #23526 [Not for merge] Debug cuda graph sleep mode — needs-rebase,v1,nvidia — by 22quinn (关闭于: 2026-03-06 17:25 (UTC+8))
- #23518 Fix gpt-oss tool call — frontend,unstale,gpt-oss — by sa411022 (关闭于: 2026-03-06 17:25 (UTC+8))
- #23463 [#20711] Use QuantFp8 CustomOp-abstraction for MoE layers — performance,needs-rebase,nvidia — by rojagtap (关闭于: 2026-03-06 17:25 (UTC+8))
- #23442 Add a flag to use FusedMoE kernel in compressed quantization — needs-rebase,unstale — by chenxi-yang (关闭于: 2026-03-06 17:23 (UTC+8))
- #23339 [Refactor] Small cleanup for quantized FusedMoE — ready,needs-rebase,stale,nvidia — by amirkl94 (关闭于: 2026-03-06 17:23 (UTC+8))
- #23277 Stats/encoder scheduled — needs-rebase,v1 — by h-brenoskuk (关闭于: 2026-03-06 17:21 (UTC+8))
- #23269 [Bugfix] updated moe_wna16.cu BLOCK_SIZE_K check to allow glm-4.5-air — 无标签 — by anikifoss (关闭于: 2026-03-06 17:21 (UTC+8))
- #23208 WIP: vllm-kernels package restructure — rocm,needs-rebase,ci/build,v1,qwen,cpu,nvidia — by seemethere (关闭于: 2026-03-06 17:20 (UTC+8))
- #23151 Add tuned fused_moe configs for Qwen3-30B-A3B — qwen — by wenchen76 (关闭于: 2026-03-06 17:18 (UTC+8))
- #23085 Bump actions/checkout from 4.2.2 to 5.0.0 — needs-rebase,ci/build,github_actions — by dependabot (关闭于: 2026-03-06 17:16 (UTC+8))
- #26058 Add token sharding functions for context parallel — needs-rebase,stale,v1 — by qiruiyangmeta (关闭于: 2026-03-06 16:48 (UTC+8))
- #26057 Add context parallelism configurations and parallel group — needs-rebase,stale,v1 — by qiruiyangmeta (关闭于: 2026-03-06 16:33 (UTC+8))
- #32168 scheduler: Cache also the last block after KV recving — v1 — by orozery (关闭于: 2026-03-06 16:26 (UTC+8))
- #29020 [Kernel] Separate Triton Attention Kernel Launches for Prefill and Decode for FULL CUDA Graph mode — v1,nvidia — by jvlunteren (关闭于: 2026-03-06 15:16 (UTC+8))
- #35064 [ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) — rocm,ci/build,v1,gpt-oss — by ChuanLi1101 (关闭于: 2026-03-06 14:31 (UTC+8))
- #35946 Fix incorrect assertion in causal_conv1d_update for indexed conv_state — 无标签 — by git-jxj (关闭于: 2026-03-06 14:26 (UTC+8))
- #34121 [BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run — bug,v1 — by Selkh (关闭于: 2026-03-06 13:28 (UTC+8))
- #36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (关闭于: 2026-03-06 11:52 (UTC+8))