vLLM 开发动态报告 - 2026-03-06

时间窗口: 2026-03-06 11:10 (UTC+8) ~ 2026-03-07 11:10 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 17 | 新 PR 75 | 合并 PR 42 | 关闭未合并 PR 43

📊 每日开发状态摘要

本周期（2026-03-06至2026-03-07）vLLM项目保持高活跃度，新增26个Issue和75个PR，并合并了42个PR。开发焦点集中在性能优化、问题修复和硬件生态适配上。其中，针对AMD ROCm平台的多个修复和优化，以及围绕Qwen系列模型、分布式执行稳定性和推测解码架构的讨论成为主要关注点。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动频繁，涉及核心功能修复、性能优化和问题定位：

PR #36278 - Step-3.5-Flash模型Quark量化权重加载支持
- 贡献者: ColinZ22 (AMD员工后缀特征)
- 内容: 修复了加载由AMD Quark量化工具（及其他工具）导出的、采用旧格式（Transformers v5之前）MoE权重时的问题，并添加了缺失的packed_modules_mapping属性，使得量化后的Step-3.5-Flash模型能够在vLLM中正常推理。
- 影响: 直接增强了vLLM对AMD Quark量化工具链产出的模型文件的兼容性，提升了AMD生态用户在vLLM上部署量化MoE模型（如Step系列）的体验。
PR #36232 - Quark OCP MX量化解析器健壮性优化
- 贡献者: xuebwang-amd (AMD员工)
- 内容: 修复了Quark OCP MX（权重仅）量化配置解析器中的错误，使其对input_quant_spec为None的情况更健壮。该修复确保了Qwen3-30B等模型的OCP MXFP4量化能在AMD平台上正常运行。
- 影响: 提升了AMD特定量化格式（OCP MX）在vLLM中的稳定性和可靠性。
Issue #36228 (评论) - AMD预构建环境下的Qwen3.5部署问题
- 用户反馈: 用户Abathur284284284指出，在AMD官方提供的ROCm 7.0预构建环境中部署Qwen3.5模型时，因transformers版本过低而直接报错。这暗示了AMD生态的预置环境可能未与最新的模型架构支持同步。
- 影响: 暴露了AMD生态软件栈在快速跟进新模型支持方面可能存在滞后，影响用户体验。
Issue #35828 (已关闭) - Qwen3-Next在AMD上的准确率回归
- 总结: 此问题已被关闭。用户jennyyyyzhen最初报告Qwen3-Next-80B在MI300x上GSM8k准确率从~85%降至~50%。经排查，原因是PR #35180引入的VLLM_ROCM_USE_AITER_TRITON_ROPE=1标志导致。该标志已在PR #35601中被默认禁用，从而修复了此问题。
- 影响: 体现了AMD团队对AITER内核的快速迭代和问题响应能力，通过调整实验性功能标志，在性能优化与数值精度间取得了平衡。

💬 高热度讨论分析

Issue #36228: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL
- 核心议题: vLLM v0.12.0中，默认的多进程（mp）执行器后端会导致Qwen3-VL模型的视觉编码器输出损坏，而Ray后端正常。这是一个从v0.11.0到v0.12.0的回归问题。
- 观点分析:
  - 报告者 (AGENDD): 进行了根因分析，指出mp执行器未像ray执行器那样为每个工作进程设置正确的CUDA_VISIBLE_DEVICES环境变量，导致PyTorch 2.9.0中F.linear (cuBLAS) 在特定条件下产生数值不稳定。
  - 解决方案: AGENDD提出了三个修复选项，并倾向于最根本的方案——让mp执行器在初始化工作进程时正确设置CUDA_VISIBLE_DEVICES，以消除与ray后端的行为差异。
- 当前状态: 问题仍为Open状态。AGENDD表示已准备好测试套件，并等待维护者的意见以提交PR。
Issue #36219: Speculative Decoding Proposer Interface Unification Proposal
- 核心议题: 用户wangxiyuan提交了一份详细的RFC，旨在统一当前杂乱无章的推测解码（Speculative Decoding）算法接口，以解决可维护性、可扩展性和硬件插件集成困难的问题。
- 观点分析:
  - 提案者 (wangxiyuan): 详细列举了当前接口不一致、模型运行器耦合严重、硬件扩展困难等三大问题，并提出了一个基于BaseProposer抽象类的统一接口设计，以及将算法调度逻辑移出模型运行器的架构方案。
  - 维护者 (benchislett): 已关注并回应将在下周详细审阅。
- 争议焦点: 这是一个大型架构重构提案，预计将引发关于向后兼容性、实现复杂度和优先级的热烈讨论。
- 当前状态: RFC处于Open状态，等待核心团队的详细评审和社区反馈。
Issue #36266: Add MLA + Quant support in vLLM
- 核心议题: 请求为DeepSeek MLA (Multi-Latent Attention) 添加量化支持，尤其是FP8。
- 观点分析:
  - 发起者 (baonudesifeizhai): 指出FlashInfer库已具备MLA相关内核，建议vLLM集成以实现MLA+Quant支持。
  - 参与者 (dorhuri123, carlyou): 两人均表示希望认领此任务并合作。
  - 参与者 (ProExpertProg): 指出该Issue是#35792的重复，并澄清提到的现有内核可能并未完全解决MLA量化的核心问题。
- 最终结论: 该Issue被迅速关闭，确认为重复问题。相关开发工作已在#35792及后续PR（如#36205）中推进。
Issue #36236: vllm fails to load continual pre-trained Qwen3.5-MoE model
- 核心议题: 使用Transformers 5.x进行持续预训练（CPT）的Qwen3.5-MoE模型无法被vLLM加载，因为新版本Transformers将配置类名从Qwen3_5MoeConfig改为了Qwen3_5MoeTextConfig。
- 观点分析:
  - 报告者 (itacora): 认为需要更新vLLM内部对配置类的引用。
  - 参与者 (MatteoFari): 澄清vLLM已有Qwen3_5MoeTextConfig，核心问题是纯文本MoE检查点被错误地路由到了多模态路径。修复方向应是添加对纯文本MoE模型类型的正确支持。
  - 报告者 (itacora): 同意MatteoFari的分析。
- 当前状态: 问题Open，明确了修复方向是改进模型类型路由逻辑而非简单重命名。
Issue #36217: The checkpoint you are trying to load has model type qwen3_5 but Transformers does not recognize this architecture
- 核心议题: v0.16.0版本无法加载Qwen3.5模型。
- 观点分析:
  - 报告者 & 参与者: 用户普遍遇到此问题，推测是vLLM v0.16.0尚未支持Qwen3.5。
  - 参与者 (Abathur284284284): 额外提到在AMD ROCm 7.0预构建环境中部署时，因transformers版本过低而报错。
- 当前状态: 问题Open，表明社区对新模型架构的快速支持有强烈需求，同时暴露出不同部署环境（如AMD预置环境）的依赖管理问题。

🔥 热门话题与趋势分析

Qwen模型系列问题集中爆发: 多个Issue涉及Qwen3/Qwen3.5（包括MoE、VL变体）的加载失败（#36217, #36236）、量化模型崩溃（#36249）、性能对比（#36215）等问题，反映出该系列模型生态活跃，但vLLM的集成支持面临挑战，需要持续跟进。
分布式执行的稳定性挑战: 包括Ray执行器DAG编译模式下的卡死（#36237）、多节点部署高并发崩溃（#36221）等，凸显了在大规模、高并发生产环境下，分布式执行引擎的稳健性是需要持续投入的关键领域。
性能优化与融合: 社区对性能优化热情高涨，特别是针对特定硬件（如NVIDIA Blackwell GB10的MoE tuning配置 #36273）和算子融合（如MLA+Quant融合 #36205, #36277, #36297）。
AMD/ROCm平台的持续改进: 从权重加载、量化解析到注意力后端验证和CI/CD完善，AMD贡献者在本周期非常活跃，显示其对增强vLLM在MI300/MI355系列GPU上体验的持续投入。

🛠️ 重点技术变更

PR #36185: Reenable features for ROCm attention backends: 在ROCm注意力后端选择逻辑重构后，重新启用了对FP8 KV缓存等特性的支持，解决了后端因验证失败而被错误拒绝的问题，恢复了AMD平台上高级特性的可用性。
PR #35758: Support HMA+NixlConnector: 使NixlConnector能够与混合注意力管理器（HMA）协同工作，针对使用混合注意力（如Flash Attention + Sliding Window Attention）的模型，显著减少了KV传输的数据量，为状态空间模型等未来的KV传输优化奠定了基础。
PR #34730: Add shutdown timeout: 为核心引擎和API服务器增加了优雅关机超时功能，允许在收到SIGTERM信号后等待正在处理的请求完成，提升了服务管理的可控性和用户体验。
PR #36042: Fix CUDA graph decode capture crash in AITER FlashAttention: 修复了ROCm AITER FlashAttention后端在CUDA图捕获阶段因错误路径选择而导致的崩溃问题，提升了AMD平台上使用CUDA图优化的稳定性。
PR #35538: [docs][torch.compile] Add fusions.md: 新增了详尽的编译时融合优化参考文档，系统性地整理了所有融合操作的原理、支持矩阵和性能收益，极大提升了torch.compile相关功能的可理解性和可调试性。

📈 开发活跃度观察

贡献者活跃: AMD贡献者（-amd后缀）在本周期提交了多个关键修复和优化PR，显示出其在硬件生态适配上的深入参与。同时，社区成员积极参与问题根因分析和解决方案讨论（如#36228），贡献质量高。
PR审查与合并高效: 在24小时内合并了42个PR，表明核心团队审查和合并流程高效。许多PR被标记为ready后快速进入测试和合并流程（如#36208, #36284, #36293）。
社区参与度高: 新增Issue数量多，且许多附带详细的复现步骤和日志，有助于快速定位问题。对于新功能（如MLA量化）和架构改进（如推测解码统一接口），也有开发者主动提出贡献意愿。

💡 值得关注的问题

GPTQ-Int4量化模型稳定性问题（#36249）: 多个用户报告Qwen3.5 GPTQ-Int4模型在达到一定吞吐量后崩溃，且非量化版本正常。这可能是vLLM内核与特定量化格式在高负载下的兼容性问题，需要重点调查。
Windows平台CPU卸载功能失效（#36279）: 此问题限制了VRAM有限用户在Windows上运行较大模型的能力，影响用户面广。
推测解码架构统一化（#36219）: 该RFC提议进行重大架构重构，其决策将深远影响未来推测解码算法的开发模式和硬件插件生态，值得社区广泛关注和讨论。
多个模型加载与运行问题: 包括Qwen3.5加载失败（#36217）、Qwen3.5 MoE CPT模型加载失败（#36236）、Qwen3 AWQ模型加载错误（#36234）等，反映了vLLM在快速跟进众多新模型架构时面临的兼容性挑战，需要系统性的测试和适配机制。

📋 附录：详细数据列表

新增 Issue

#36302 [Installation]: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb — installation — by WillLester (创建于: 2026-03-07 10:49 (UTC+8))
#36237 [Bug]: Generation hangs until RAY_CGRAPH_get_timeout (300s) with Ray compiled DAG executor — bug — by jQizhang (创建于: 2026-03-06 16:54 (UTC+8))
#36295 [Bug]: Deepseek R1 TRTLLM FP8 MoE produces garbage output — bug — by wzhao18 (创建于: 2026-03-07 08:38 (UTC+8))
#36236 [Bug]: vllm fails to load continual pre-trained Qwen3.5-MoE model due to missing support for transformers 5.x renamed class (Qwen3_5MoeTextConfig) — bug — by itacora (创建于: 2026-03-06 16:51 (UTC+8))
#36264 [Tracking issue]: NVIDIA CI improvements — feature request — by jasonlizhengjian (创建于: 2026-03-07 00:33 (UTC+8))
#36266 [Feature]: Add MLA + Quant support in vLLM (leveraging existing FlashInfer MLA support) — feature request — by baonudesifeizhai (创建于: 2026-03-07 01:07 (UTC+8))
#36279 [Bug]: –cpu-offload-gb fails with AssertionError on Windows (v0.16.0) — bug — by ghbaud (创建于: 2026-03-07 04:27 (UTC+8))
#36249 [Bug]: Qwen3.5-27B-GPTQ-Int4 crashes [vllm version v0.16.1rc1] — bug — by berkayersoyy (创建于: 2026-03-06 19:24 (UTC+8))
#36275 [Bug]: Qwen3.5 4b incompatibility — bug — by aminsamir45 (创建于: 2026-03-07 03:31 (UTC+8))
#36219 [RFC]: Speculative Decoding Proposer Interface Unification Proposal — RFC — by wangxiyuan (创建于: 2026-03-06 14:41 (UTC+8))
#36272 [Feature]: [CPU] Performance improvement: Auto-preload Intel OpenMP on x86 for multi-core CPU inference — feature request,cpu — by gxoga (创建于: 2026-03-07 02:29 (UTC+8))
#36257 Security: Unrestricted method invocation via collective_rpc endpoint in dev mode — 无标签 — by Vext-Labs (创建于: 2026-03-06 23:09 (UTC+8))
#36234 [Bug]: 使用vllm 0.16.0 版本时候，加载qwen3-14b-awq模型时候会报错误，如：ERROR _wrapper.py:141: Error in wrapped target: CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by uOnePiece (创建于: 2026-03-06 16:41 (UTC+8))
#36251 [Bug]: Qwen3.5-397B-A17B-FP8 crashes with TP=4 + Expert Parallelism - dimension mismatch in fused linear sharding — 无标签 — by UmutAlihan (创建于: 2026-03-06 19:49 (UTC+8))
#36217 [Bug]: The checkpoint you are trying to load has model type qwen3_5 but Transformers does not recognize this architecture — bug — by huangye123 (创建于: 2026-03-06 14:27 (UTC+8))
#36228 [Bug]: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL in v0.12.0 (ray executor works correctly) — 无标签 — by AGENDD (创建于: 2026-03-06 16:08 (UTC+8))
#36245 [Bug]: Gemma3 mmproj-*.gguf is not downloaded in ‘download_gguf’ — bug — by lmjantsch (创建于: 2026-03-06 17:45 (UTC+8))
#36235 [Bug]: When using VLLM version 0.16.0, there will be an error when loading the qwen3-14b-awq model, such as: ERROR _wrapper.py:141: Error in wrapped target: CUDA error: the provided PTX was compiled with an unsupported toolchain. — bug — by uOnePiece (创建于: 2026-03-06 16:43 (UTC+8))
#36233 [Bug]: Critical bug in tool_call test scenarios [vllm version 0.16.0] — bug — by ajiang17 (创建于: 2026-03-06 16:38 (UTC+8))
#36220 [Bug]: vllm serve quantized GLM-5 failed — bug — by artificialzjy (创建于: 2026-03-06 14:45 (UTC+8))
#36215 [Performance]: vLLM is slower than SGLang when deploying the Qwen3.5 model. — performance — by yszhli (创建于: 2026-03-06 14:13 (UTC+8))
#36223 [RFC]: Support quarot for eagle3 — RFC — by drslark (创建于: 2026-03-06 15:13 (UTC+8))
#36222 [Usage]: MoE flatten_tp_size should not unconditionally include dp_size — DP loses its original semantics for MoE layers — usage — by gerayking (创建于: 2026-03-06 15:02 (UTC+8))
#36221 [Bug]: Two-node deployment of kimi2-5, runtime crash — bug — by xidiancpy (创建于: 2026-03-06 14:55 (UTC+8))
#36207 [Bug]: kimi-2.5 reasoning_parser error. — bug — by SaltFish11 (创建于: 2026-03-06 12:10 (UTC+8))
#36203 [Bug]: CUDAGraphMode.FULL not supported with ChunkedLocalAttention backend for Llama4 models — bug — by huydhn (创建于: 2026-03-06 11:10 (UTC+8))

已关闭 Issue

#18517 [Feature]: hope that xgrammar and vLLM v1 can offer significant inference acceleration on the RTX 4090 as well — feature request,stale — by Zachary-ai-engineer (关闭于: 2026-03-07 10:18 (UTC+8))
#18673 [Installation]: Hard to find right wheel files to build the release version — installation,stale — by llsj14 (关闭于: 2026-03-07 10:18 (UTC+8))
#21511 [Feature]: Qwen3 Models GGUF Support — feature request,stale — by ctcanbol (关闭于: 2026-03-07 10:18 (UTC+8))
#23203 [Feature]: Support gpt-oss mxfp4 on compute capability 7.5 — feature request,stale — by Aaron1011 (关闭于: 2026-03-07 10:17 (UTC+8))
#23225 [Feature][Responses API] Support logprobs — feature request,stale — by heheda12345 (关闭于: 2026-03-07 10:17 (UTC+8))
#23244 [Bug]: openai/gpt-oss-20b breaks on data parallel — bug,stale — by Fhrozen (关闭于: 2026-03-07 10:17 (UTC+8))
#23767 [Feature]: Add LORA Model Name in Open Telemetry — feature request,stale — by alew3 (关闭于: 2026-03-07 10:17 (UTC+8))
#24655 [Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 198) — bug,stale — by TideDra (关闭于: 2026-03-07 10:17 (UTC+8))
#24790 [Usage]: multi-gpu，I want embedding model deploy on GPU1, and set CUDA_VISIBLE_DEVICES=1, in vllm start script file， but vllm service still malloc memory of GPU 0 。 — usage,stale — by suchenRoad (关闭于: 2026-03-07 10:17 (UTC+8))
#35651 [Bug][XPU]: Inference generates garbage with flash attention — bug — by andswitch (关闭于: 2026-03-07 03:00 (UTC+8))
#36266 [Feature]: Add MLA + Quant support in vLLM (leveraging existing FlashInfer MLA support) — feature request — by baonudesifeizhai (关闭于: 2026-03-07 05:29 (UTC+8))
#35641 [Bug]: ROCm MI355X Kimi K2.5 AITER TP8 MLA kernel Error (num_head=8) — bug,performance,rocm — by functionstackx (关闭于: 2026-03-07 04:24 (UTC+8))
#36257 Security: Unrestricted method invocation via collective_rpc endpoint in dev mode — 无标签 — by Vext-Labs (关闭于: 2026-03-07 00:31 (UTC+8))
#35574 [Bug]: Qwen3.5 Can not close thinking by “enable_thinking”: false — bug — by Charmnut (关闭于: 2026-03-06 16:25 (UTC+8))
#35828 [Bug]: Qwen3-Next accuracy regression on AMD — bug,rocm — by jennyyyyzhen (关闭于: 2026-03-06 14:03 (UTC+8))
#35945 [Bug]: AssertionError in causal_conv1d_update when capturing CUDA graphs for Qwen3.5/GDN layers — bug — by git-jxj (关闭于: 2026-03-06 14:26 (UTC+8))
#33418 [Bug]: wrong error reported when len(prompt) + requested tokens > max_context_len — bug — by sducouedic (关闭于: 2026-03-06 14:15 (UTC+8))

新增 PR

#36303 [Cleanup] Remove duplicate parser registration — 无标签 — by taneem-ibrahim (创建于: 2026-03-07 11:02 (UTC+8))
#36293 [ROCm][CI] Making entrypoints more deterministic on ROCm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-07 07:42 (UTC+8))
#36253 [ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. — rocm,ready — by benenzhu (创建于: 2026-03-06 20:12 (UTC+8))
#36300 [Bug] Fix pooling model benchmark script — bug,performance,ready — by yewentao256 (创建于: 2026-03-07 09:20 (UTC+8))
#36265 pick up tuned prefill configs for FP8 FA3 — ci/build — by jmkuebler (创建于: 2026-03-07 00:50 (UTC+8))
#36241 [Bugfix] Fix FlashInfer block size restriction breaking hybrid attention models — bug,documentation,v1,nvidia — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
#36240 [Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size — bug — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
#36239 [Bugfix] Remove incorrect assertion blocking mixed decode+spec-decode batches in GDN attention — bug,v1 — by AjAnubolu (创建于: 2026-03-06 17:23 (UTC+8))
#36227 [ROCm] Add cascade attention support for AITER FlashAttention backend — rocm,needs-rebase,v1 — by ChuanLi1101 (创建于: 2026-03-06 15:51 (UTC+8))
#36299 [CI] Fix BackgroundResources double-cleanup crash by adding guard — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-07 08:59 (UTC+8))
#36297 Fused BMM+FP8 quant Triton kernel for MLA _v_up_proj (forward_mqa path) — performance,ci/build,v1 — by dorhuri123 (创建于: 2026-03-07 08:41 (UTC+8))
#36286 [MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow — nvidia — by yzong-rh (创建于: 2026-03-07 06:16 (UTC+8))
#36301 [XPU][Doc] update xpu document about triton dependency/conflict issue. — documentation — by jikunshang (创建于: 2026-03-07 09:22 (UTC+8))
#36290 [BugFix] Avoid ignored trust_remote_code warnings — bug,speculative-decoding,ready — by njhill (创建于: 2026-03-07 06:45 (UTC+8))
#36281 [BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder — documentation,ready,llama,qwen — by Lucaskabela (创建于: 2026-03-07 05:21 (UTC+8))
#36292 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-07 07:05 (UTC+8))
#36289 [Model] Register Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM for text-only checkpoints — new-model,qwen — by aminsamir45 (创建于: 2026-03-07 06:41 (UTC+8))
#36294 [MoE Refactor] Remove “naive” all2all backend — 无标签 — by bnellnm (创建于: 2026-03-07 08:20 (UTC+8))
#36298 full cudagraph for flex-attn — v1,nvidia — by shunting314 (创建于: 2026-03-07 08:48 (UTC+8))
#36280 [Model Runner V2] Fix warmup for pipeline parallel — v1 — by njhill (创建于: 2026-03-07 05:18 (UTC+8))
#36296 [Bug] Fix TRTLLM FP8 MoE Monolithic — bug,nvidia — by wzhao18 (创建于: 2026-03-07 08:40 (UTC+8))
#36270 [Core] Fix benign error log during normal shutdown — ready,v1 — by njhill (创建于: 2026-03-07 02:13 (UTC+8))
#36282 mla: don’t update kv cache on dummy forwards — ready,nvidia — by itayalroy (创建于: 2026-03-07 05:43 (UTC+8))
#36284 [ROCm][CI] Fixing yaml file for external amd-ci signal — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-07 05:56 (UTC+8))
#36285 Add structured output support for beam search in online serving — frontend — by ashutosh1919 (创建于: 2026-03-07 06:08 (UTC+8))
#36205 [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) — performance,ci/build,v1 — by carlyou (创建于: 2026-03-06 11:44 (UTC+8))
#36216 [V0 Deprecation] Remove unused swap_space parameter — documentation,performance,frontend,ready,ci/build,v1,kv-connector — by majiayu000 (创建于: 2026-03-06 14:22 (UTC+8))
#36291 Create test-qwen-deepseek-1.5-accuracy.py — v1,qwen,deepseek — by puririshi98 (创建于: 2026-03-07 06:53 (UTC+8))
#36288 [torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set — documentation,llama,qwen,nvidia — by Lucaskabela (创建于: 2026-03-07 06:27 (UTC+8))
#36287 [Core] Add –tensor-parallel-size-attention for TPA separation — v1,llama — by sungsooha (创建于: 2026-03-07 06:20 (UTC+8))
#36283 [Feature] Add LLM.shutdown() for explicit engine teardown — frontend — by wojciech-wais (创建于: 2026-03-07 05:44 (UTC+8))
#36277 [Compile] Add MLA attention + FP8 static quant fusion support — v1,nvidia — by dorhuri123 (创建于: 2026-03-07 04:06 (UTC+8))
#36258 [Feature] Add /live liveness probe, LLM.shutdown(), and k8s shutdown docs — documentation,frontend,v1 — by wojciech-wais (创建于: 2026-03-06 23:38 (UTC+8))
#36276 [EPLB] Add nixl-based eplb communicator — ci/build,v1,kv-connector — by ilmarkov (创建于: 2026-03-07 04:04 (UTC+8))
#36278 [Model] [Bugfix] Adding legacy MoE weight format support in Step-3.5-Flash for quantized model inference — bug — by ColinZ22 (创建于: 2026-03-07 04:10 (UTC+8))
#36268 [Audio] Bundle get_generation_prompt() params into SpeechToTextParams — documentation,frontend,qwen — by ekagra-ranjan (创建于: 2026-03-07 01:49 (UTC+8))
#36273 [Kernel][MoE] Add fp8_w8a8 MoE tuning config for NVIDIA GB10 (DGX Spark) — nvidia — by scottgl9 (创建于: 2026-03-07 02:36 (UTC+8))
#36274 [Bugfix][ROCm] Strip block_size before attention backend validation — bug,rocm — by jennyyyyzhen (创建于: 2026-03-07 02:36 (UTC+8))
#36271 [EPLB] Remove main waits in case of slow EPLB — 无标签 — by ilmarkov (创建于: 2026-03-07 02:28 (UTC+8))
#36206 [CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression — ready — by AndreasKaratzas (创建于: 2026-03-06 11:49 (UTC+8))
#36261 [EPLB] Optmize eplb mapping and record in router for prefill — 无标签 — by ilmarkov (创建于: 2026-03-07 00:25 (UTC+8))
#36269 bytes calculation for weight offloading — meta-exported,fb-exported — by mxz297 (创建于: 2026-03-07 01:55 (UTC+8))
#36267 [EPLB] Simplify EPLB rearrange by only returning one map — 无标签 — by SageMoore (创建于: 2026-03-07 01:26 (UTC+8))
#36238 Add ‘none’ reasoning effort to ChatCompletionRequest — frontend — by juliendenize (创建于: 2026-03-06 17:08 (UTC+8))
#36232 [ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization — rocm,ready — by xuebwang-amd (创建于: 2026-03-06 16:19 (UTC+8))
#36262 Revert “[BugFix] Fix engine hanging after KV cache initialization fai… — bug,v1 — by njhill (创建于: 2026-03-07 00:27 (UTC+8))
#36263 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (创建于: 2026-03-07 00:29 (UTC+8))
#36260 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (创建于: 2026-03-07 00:25 (UTC+8))
#36259 Feat/sigint graceful shutdown — frontend,v1 — by wojciech-wais (创建于: 2026-03-07 00:18 (UTC+8))
#36250 [Bugfix] Fix FP8 paged MQA fallback for CUDA graph capture — bug,v1,nvidia — by ZJY0516 (创建于: 2026-03-06 19:26 (UTC+8))
#36252 [Bugfix] Allow concurrency and memory_limit for runai_streamer_sharded — bug — by ziyangliu-666 (创建于: 2026-03-06 19:54 (UTC+8))
#36242 [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 — bug,ready,qwen — by AjAnubolu (创建于: 2026-03-06 17:42 (UTC+8))
#36255 fix: improve token_ids_cpu swap to copy only valid indices — tpu,v1 — by zzaebok (创建于: 2026-03-06 21:35 (UTC+8))
#36204 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks — 无标签 — by laithsakka (创建于: 2026-03-06 11:19 (UTC+8))
#36256 [Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators — frontend — by umut-polat (创建于: 2026-03-06 21:36 (UTC+8))
#36244 [torch.compile] Remove attention layer name from unified_kv_cache_update — 无标签 — by SongyouZhong (创建于: 2026-03-06 17:43 (UTC+8))
#36247 [Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x — bug,ready,deepseek — by vllmellm (创建于: 2026-03-06 18:16 (UTC+8))
#36218 [UX] Infer dtype for local checkpoint — ready — by Isotr0py (创建于: 2026-03-06 14:38 (UTC+8))
#36248 docs+tests: consolidate doc fixes and test assertion — documentation,multi-modality,cpu — by haosenwang1018 (创建于: 2026-03-06 18:21 (UTC+8))
#36254 [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators — frontend — by umut-polat (创建于: 2026-03-06 21:31 (UTC+8))
#36230 [CI] Fix startup error test — ready,v1 — by hmellor (创建于: 2026-03-06 16:14 (UTC+8))
#36246 [CI/Build] Updated rmsnorm test to improve OOT device coverage — 无标签 — by romitjain (创建于: 2026-03-06 18:04 (UTC+8))
#36214 docs: clarify CPU backend issue reporting sentence — documentation,cpu — by haosenwang1018 (创建于: 2026-03-06 14:03 (UTC+8))
#36213 docs: capitalize GitHub Actions in profiling guide — documentation — by haosenwang1018 (创建于: 2026-03-06 14:02 (UTC+8))
#36212 tests: assert missing video loader error message — multi-modality — by haosenwang1018 (创建于: 2026-03-06 13:53 (UTC+8))
#36211 docs: fix speculative decoding wording typos — documentation — by haosenwang1018 (创建于: 2026-03-06 13:50 (UTC+8))
#36243 fix(config): skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel — 无标签 — by tusharshetty61 (创建于: 2026-03-06 17:42 (UTC+8))
#36231 [Misc] Add enable_log_requests parameter to RequestLogger — frontend — by chaunceyjiang (创建于: 2026-03-06 16:17 (UTC+8))
#36226 [Kernel] Add MMQ kernels for all I-Matrix quantization types — 无标签 — by aimbit-ni (创建于: 2026-03-06 15:23 (UTC+8))
#36229 [Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 — ci/build — by elvischenv (创建于: 2026-03-06 16:09 (UTC+8))
#36224 [Bugfix] Fix reasoning token routing with tool parsers: prompt false positive and transition-batch loss — bug,frontend,qwen — by alexbi29 (创建于: 2026-03-06 15:22 (UTC+8))
#36225 [main][feature] Support quarot for eagle3 — speculative-decoding,llama — by drslark (创建于: 2026-03-06 15:22 (UTC+8))
#36210 [ROCm][CI] Preparing gfx90a mirroring — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-06 13:13 (UTC+8))
#36209 docs: fix wrong cc in int8.md — documentation,ready — by KevinZonda (创建于: 2026-03-06 12:18 (UTC+8))
#36208 [CI] Fix bge-m3 similarity reference values after Defination typo fix — ready — by AndreasKaratzas (创建于: 2026-03-06 12:16 (UTC+8))

已合并 PR

#36293 [ROCm][CI] Making entrypoints more deterministic on ROCm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-07 11:04 (UTC+8))
#36042 Fix CUDA graph decode capture crash in AITER FlashAttention — rocm,ready,v1,nvidia,meta-exported,fb-exported — by iseeyuan (合并于: 2026-03-07 10:12 (UTC+8))
#35971 refine vllm bench throughput --backend hf — performance,ready — by jikunshang (合并于: 2026-03-07 10:10 (UTC+8))
#36290 [BugFix] Avoid ignored trust_remote_code warnings — bug,speculative-decoding,ready — by njhill (合并于: 2026-03-07 09:24 (UTC+8))
#36280 [Model Runner V2] Fix warmup for pipeline parallel — v1 — by njhill (合并于: 2026-03-07 08:58 (UTC+8))
#36270 [Core] Fix benign error log during normal shutdown — ready,v1 — by njhill (合并于: 2026-03-07 08:39 (UTC+8))
#36282 mla: don’t update kv cache on dummy forwards — ready,nvidia — by itayalroy (合并于: 2026-03-07 08:36 (UTC+8))
#36284 [ROCm][CI] Fixing yaml file for external amd-ci signal — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-07 08:30 (UTC+8))
#35538 [docs][torch.compile] Add fusions.md — kernel/operator fusion reference page — documentation,ready,torch.compile — by copilot-swe-agent (合并于: 2026-03-07 07:55 (UTC+8))
#35850 [ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) — rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-07 04:24 (UTC+8))
#35253 Enabling some B200-specific tests on MI355 — ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-03-07 03:27 (UTC+8))
#35877 [CustomOp] CustomOp FusedRMSNormGated — ready,v1 — by eellison (合并于: 2026-03-07 02:53 (UTC+8))
#36206 [CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression — ready — by AndreasKaratzas (合并于: 2026-03-06 17:17 (UTC+8))
#36165 [Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) — bug,ready,v1,nvidia — by TQCB (合并于: 2026-03-06 22:15 (UTC+8))
#36262 Revert “[BugFix] Fix engine hanging after KV cache initialization fai… — bug,v1 — by njhill (合并于: 2026-03-07 00:28 (UTC+8))
#35478 [BugFix] Fix engine hanging after KV cache initialization failure — bug,ready,v1 — by 842974287 (合并于: 2026-03-06 12:58 (UTC+8))
#36068 [Bugfix] Quickfix followups to busy loop removal in #28053 — bug,ready,v1 — by tjohnson31415 (合并于: 2026-03-07 00:13 (UTC+8))
#36152 [compile] Stop unconditionally patching constrain_to_fx_strides — ready — by zou3519 (合并于: 2026-03-06 23:17 (UTC+8))
#35202 [Refactor] Modular video loader backend refactoring — ready,multi-modality — by Isotr0py (合并于: 2026-03-06 22:06 (UTC+8))
#36024 [Misc] Lazy import registered processors — ready,deepseek — by Isotr0py (合并于: 2026-03-06 22:06 (UTC+8))
#36150 [bugfix] add api process rank in default multimodal request — bug,ready — by fake0fan (合并于: 2026-03-06 20:00 (UTC+8))
#36230 [CI] Fix startup error test — ready,v1 — by hmellor (合并于: 2026-03-06 19:50 (UTC+8))
#36153 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) — frontend,ready,multi-modality — by alex-jw-brooks (合并于: 2026-03-06 17:16 (UTC+8))
#35758 [Core][KVConnector] Support HMA+NixlConnector — ready,v1,kv-connector — by NickLucche (合并于: 2026-03-06 15:51 (UTC+8))
#35158 [Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding — ready,ci/build,v1,kv-connector,ready-run-all-tests — by ZhanqiuHu (合并于: 2026-03-06 15:50 (UTC+8))
#35553 [ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance — documentation,rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-03-06 15:15 (UTC+8))
#36173 Change “following fields were present in the request but ignored” log from warn to debug — frontend,ready — by tlrmchlsmth (合并于: 2026-03-06 14:15 (UTC+8))
#36191 [BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list — bug,ready — by walterbm (合并于: 2026-03-06 14:15 (UTC+8))
#36192 [Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 — ready — by russellb (合并于: 2026-03-06 14:15 (UTC+8))
#36197 [Bugfix] Fix misleading context length error messages — bug,frontend,ready — by AjAnubolu (合并于: 2026-03-06 14:15 (UTC+8))
#35892 [Bugfix] Fix inner_dp_world initialization order for multi-node TP — bug,ready,ci/build,v1 — by zyongye (合并于: 2026-03-06 14:04 (UTC+8))
#34730 [Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (合并于: 2026-03-06 14:04 (UTC+8))
#36164 perf: add slots to KVCacheBlock — ready,v1 — by cong-or (合并于: 2026-03-06 14:04 (UTC+8))
#36209 docs: fix wrong cc in int8.md — documentation,ready — by KevinZonda (合并于: 2026-03-06 14:01 (UTC+8))
#36208 [CI] Fix bge-m3 similarity reference values after Defination typo fix — ready — by AndreasKaratzas (合并于: 2026-03-06 13:07 (UTC+8))
#36047 [Feature] Add –distributed-timeout-seconds CLI option — ready,v1 — by 842974287 (合并于: 2026-03-06 12:57 (UTC+8))
#34754 [Bug] Fix a corner case in _process_simple_streaming_events — bug,frontend,ready — by 842974287 (合并于: 2026-03-06 12:57 (UTC+8))
#36158 [Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-03-06 12:32 (UTC+8))
#36177 [ROCm][CI] Adding missing dependencies for Multi-modal models tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-06 12:23 (UTC+8))
#36156 [Bugfix] Fix simple Mistral-Small example — bug,documentation,ready — by DarkLight1337 (合并于: 2026-03-06 12:25 (UTC+8))
#36185 Reenable features for ROCm attention backends — documentation,rocm,ready,v1 — by Rohan138 (合并于: 2026-03-06 12:21 (UTC+8))
#36147 cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul — ready,ci/build,cpu — by nikhil-arm (合并于: 2026-03-06 11:48 (UTC+8))

关闭但未合并的 PR

#20275 [Bugfix][Frontend]: Fix API server connection refused on wsl2 — frontend,stale — by Chen-zexi (关闭于: 2026-03-07 10:18 (UTC+8))
#21041 [Model] Reasoning Parser for Nemotron Models — stale — by schoennenbeck (关闭于: 2026-03-07 10:18 (UTC+8))
#28062 [wip] Fix prime rl test — ready,ci/build,stale — by rzabarazesh (关闭于: 2026-03-07 10:17 (UTC+8))
#36299 [CI] Fix BackgroundResources double-cleanup crash by adding guard — rocm,ready,v1 — by AndreasKaratzas (关闭于: 2026-03-07 09:30 (UTC+8))
#36277 [Compile] Add MLA attention + FP8 static quant fusion support — v1,nvidia — by dorhuri123 (关闭于: 2026-03-07 05:59 (UTC+8))
#35610 [Fix][XPU] TRITON_ATTN: add pytorch-triton-xpu to xpu requirements — ci/build — by andswitch (关闭于: 2026-03-07 03:00 (UTC+8))
#36260 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (关闭于: 2026-03-07 00:28 (UTC+8))
#28850 [Hardware][AMD] Add fused QK RoPE and reshape & cache flash support for ROCm — rocm,needs-rebase,stale,v1,qwen — by mjkvaak-amd (关闭于: 2026-03-06 22:28 (UTC+8))
#35187 feat: enable EPLB for NVFP4 compressed-tensors ML3 checkpoint — 无标签 — by hypdeb (关闭于: 2026-03-06 22:02 (UTC+8))
#34062 [torch.compile] Remove attention layer name from unified_kv_cache_update — needs-rebase — by veeceey (关闭于: 2026-03-06 21:31 (UTC+8))
#24395 [Performance]: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact — v1 — by CLFutureX (关闭于: 2026-03-06 19:18 (UTC+8))
#24359 [Feature] Add sliding window support to FlexAttention backend — v1,gpt-oss — by alhridoy (关闭于: 2026-03-06 19:16 (UTC+8))
#24286 [Model] Add Support for Grok2 — documentation,new-model,needs-rebase — by wenchen76 (关闭于: 2026-03-06 19:14 (UTC+8))
#24202 Support prompt hidden states — tpu,needs-rebase,v1 — by charlotte12l (关闭于: 2026-03-06 19:10 (UTC+8))
#24069 [EPLB] Reconstruct EPLB algorithm invocation method — 无标签 — by raindaywhu (关闭于: 2026-03-06 19:09 (UTC+8))
#24068 Gfx908 attn fix — rocm,unstale — by UD-mmcminn (关闭于: 2026-03-06 19:08 (UTC+8))
#24027 Support using SigLIP2 text and image embedding as standalone model — new-model,multi-modality — by duc-ph (关闭于: 2026-03-06 19:08 (UTC+8))
#36214 docs: clarify CPU backend issue reporting sentence — documentation,cpu — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
#36213 docs: capitalize GitHub Actions in profiling guide — documentation — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
#36212 tests: assert missing video loader error message — multi-modality — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
#36211 docs: fix speculative decoding wording typos — documentation — by haosenwang1018 (关闭于: 2026-03-06 18:21 (UTC+8))
#23566 [Model] Add lite-whisper model support in vLLM — new-model,frontend,needs-rebase,unstale — by yuqiannemo (关闭于: 2026-03-06 17:27 (UTC+8))
#23601 Implement standardized environment variable parsing with pathlib.Path support and type introspection — needs-rebase — by copilot-swe-agent (关闭于: 2026-03-06 17:27 (UTC+8))
#23571 Remove build_for_cudagraph_capture method and use build directly in attention metadata builders — needs-rebase,v1,nvidia — by copilot-swe-agent (关闭于: 2026-03-06 17:27 (UTC+8))
#23553 [EPLB] Model Modify of EPLB — needs-rebase,unstale,v1 — by shiyuan680 (关闭于: 2026-03-06 17:26 (UTC+8))
#23526 [Not for merge] Debug cuda graph sleep mode — needs-rebase,v1,nvidia — by 22quinn (关闭于: 2026-03-06 17:25 (UTC+8))
#23518 Fix gpt-oss tool call — frontend,unstale,gpt-oss — by sa411022 (关闭于: 2026-03-06 17:25 (UTC+8))
#23463 [#20711] Use QuantFp8 CustomOp-abstraction for MoE layers — performance,needs-rebase,nvidia — by rojagtap (关闭于: 2026-03-06 17:25 (UTC+8))
#23442 Add a flag to use FusedMoE kernel in compressed quantization — needs-rebase,unstale — by chenxi-yang (关闭于: 2026-03-06 17:23 (UTC+8))
#23339 [Refactor] Small cleanup for quantized FusedMoE — ready,needs-rebase,stale,nvidia — by amirkl94 (关闭于: 2026-03-06 17:23 (UTC+8))
#23277 Stats/encoder scheduled — needs-rebase,v1 — by h-brenoskuk (关闭于: 2026-03-06 17:21 (UTC+8))
#23269 [Bugfix] updated moe_wna16.cu BLOCK_SIZE_K check to allow glm-4.5-air — 无标签 — by anikifoss (关闭于: 2026-03-06 17:21 (UTC+8))
#23208 WIP: vllm-kernels package restructure — rocm,needs-rebase,ci/build,v1,qwen,cpu,nvidia — by seemethere (关闭于: 2026-03-06 17:20 (UTC+8))
#23151 Add tuned fused_moe configs for Qwen3-30B-A3B — qwen — by wenchen76 (关闭于: 2026-03-06 17:18 (UTC+8))
#23085 Bump actions/checkout from 4.2.2 to 5.0.0 — needs-rebase,ci/build,github_actions — by dependabot (关闭于: 2026-03-06 17:16 (UTC+8))
#26058 Add token sharding functions for context parallel — needs-rebase,stale,v1 — by qiruiyangmeta (关闭于: 2026-03-06 16:48 (UTC+8))
#26057 Add context parallelism configurations and parallel group — needs-rebase,stale,v1 — by qiruiyangmeta (关闭于: 2026-03-06 16:33 (UTC+8))
#32168 scheduler: Cache also the last block after KV recving — v1 — by orozery (关闭于: 2026-03-06 16:26 (UTC+8))
#29020 [Kernel] Separate Triton Attention Kernel Launches for Prefill and Decode for FULL CUDA Graph mode — v1,nvidia — by jvlunteren (关闭于: 2026-03-06 15:16 (UTC+8))
#35064 [ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) — rocm,ci/build,v1,gpt-oss — by ChuanLi1101 (关闭于: 2026-03-06 14:31 (UTC+8))
#35946 Fix incorrect assertion in causal_conv1d_update for indexed conv_state — 无标签 — by git-jxj (关闭于: 2026-03-06 14:26 (UTC+8))
#34121 [BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run — bug,v1 — by Selkh (关闭于: 2026-03-06 13:28 (UTC+8))
#36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (关闭于: 2026-03-06 11:52 (UTC+8))