vLLM 开发动态报告 - 2026-03-02

时间窗口: 2026-03-02 11:29 (UTC+8) ~ 2026-03-03 11:29 (UTC+8) 数据统计: 新 Issue 38 | 关闭 Issue 21 | 新 PR 82 | 合并 PR 37 | 关闭未合并 PR 29

📊 每日开发状态摘要

本周期（24小时）vLLM 开发保持高度活跃，新增和处理的 Issue/PR 数量众多。开发焦点集中在 AMD 生态支持优化（如 NVFP4 模型仿真、ROCm CI 修复）和核心性能与稳定性（如混合PD系统、注意力后端融合、各类模型支持）上。同时，社区针对多个高热度 Bug（如 PD 失联、speculative decoding 兼容性）和设计决策（如 KV 传输错误处理）展开了深入讨论。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，涉及功能开发、问题修复和 CI 优化等多个方面。

1. 新功能与扩展支持

PR #35733 & #35737（由 fxmarty-amd 提交）：这两项工作为核心功能，旨在支持 NVFP4 密集和 MoE 模型在 AMD Instinct 及非 Blackwell 的 NVIDIA GPU（如 Ampere, Hopper）上通过仿真运行。这解决了 nvidia/Qwen3-8B-NVFP4 等模型在非目标硬件上的运行问题，对研究和跨平台部署具有重要意义。
PR #35765（由 pschlan-amd 提交）：优化 AITER MLA 后端，使用 Triton 内核避免 _build_decode 中的 CPU 同步，旨在提升 ROCm 平台上的 MLA 解码性能。
PR #35722（由 hanlin12-AMD 提交）：为 vLLM server 添加 --hip_online_tuning 选项，以启用 hipBLASLt 在线 GEMM 调优，进一步挖掘 AMD GPU 的潜在性能。
PR #35786（由 Rohan138 提交）：在非 shuffle KV 缓存布局下，为 ROCm AITER Flash Attention 后端启用 RoPE+KV 缓存融合 pass，提升注意力计算效率。
Issue #35713 & #35712（由 tjtanaa 创建）：功能请求，要求为 ROCm 启用 AITER 的 fused_allreduce_rmsnorm_quant 和 fused_allreduce_rmsnorm 融合 pass，以完善分布式训练优化。

2. 问题修复与兼容性

PR #35658（已合并）：在 ROCm 的 Dockerfile 和 requirements 中添加 amd-quark 包依赖，修复了 ROCm 镜像无法运行 MXFP4 等量化模型（如 amd/Kimi-K2.5-MXFP4）的问题。
PR #35791（由 varun-sundar-rabindranath 提交）：修复了在 ROCm 上使用 GPT-OSS + Expert Parallel 时因位矩阵构造逻辑导致的段错误，增强了复杂 MoE 模型在 AMD 平台上的稳定性。
PR #35787（由 AndreasKaratzas 提交）：优化 ROCm 的 GCN 架构解析逻辑以支持字母步进（如 gfx90a），并规范环境变量同步调用。
PR #35798（已合并）& PR #35816：修复 ROCm CI 脚本中由反斜杠换行符导致的 pytest 标记解析错误，并将 pytest 退出码 5（无测试收集）视为成功，提升 CI 健壮性。
PR #35806（由 micah-wil 提交）：修复因 amd-quark 包引入后暴露的 GPT-OSS 量化测试断言逻辑，允许实测准确率高于预期值。

3. 测试与验证

PR #35710（由 AndreasKaratzas 提交）：通过应用平台特定的确定性设置（如选择 TRITON_ATTN 后端、禁用 BATCH_INVARIANT），在 ROCm 上启用异步权重传输示例，并将验证通过率放宽至 90%，以应对当前非确定性问题。

💬 高热度讨论分析

Issue #35772: FusedARRMS 在 TP>1 时启动期间 CUDA 图捕获挂起
- 核心议题：在多 GPU（TP>1）环境下，启用 fuse_allreduce_rms 融合 pass 后，B200/H200 等设备在 CUDA 图捕获阶段发生挂起。
- 各方观点：
  - 报告者 (benchislett)：提供了复现命令，指出临时解决方案是通过 --compilation-config.pass_config.fuse_allreduce_rms false 禁用该融合。
  - 调查者 (hjjq)：确认问题与 PR #34109 相关，指出将环境变量 VLLM_FLASHINFER_ALLREDUCE_BACKEND 设置为 trtllm 可作为单节点下的有效规避方案，并认为 mnnvl 后端可能存在问题。
  - 决策者 (ProExpertProg)：建议直接提交 PR 将默认后端改为 trtllm。
- 当前状态：PR #35793 已被创建以将默认后端切换为 trtllm，问题正在解决中。
Issue #35746: 在支持 AVX512_BF16 的 AMD 主机上模型预热时发生段错误
- 核心议题：用户在 AMD 7940HS CPU 上运行 vLLM 时遭遇段错误，疑似与 Torch Inductor 代码生成或库依赖有关。
- 各方观点：
  - 报告者 (NetWilliam)：提供了详细的错误信息和生成的 cpp_fused__softmax_0 代码，表达了希望协助修复的意愿。
  - 协助者 (bigPYJ1151)：指出问题与动态链接库无关，怀疑是 torch.compile 导致，建议使用 --enforce-eager 参数验证。
  - 验证结果：报告者确认 --enforce-eager 可启动服务器，但请求到来时仍会触发崩溃，将问题进一步指向编译生成的代码。
- 当前状态：问题仍为 Open，根本原因待查，可能涉及 CPU 特定指令集下的代码生成缺陷。
Issue #35724: H100 PCIe 在 TP=2 时运行 Qwen3.5-122B 出现“设备不支持多播”错误
- 核心议题：在仅通过 PCIe 连接的 H100 上，vLLM V1 引擎尝试使用对称内存（多播）进行 TP 通信时失败。
- 各方观点：
  - 报告者 (wallbreaker740)：提供了完整环境信息，并对比指出 v0.15.0 版本在相同硬件上工作正常，暗示是 V1 引擎的新特性导致了问题。
  - 解答者 (Saad-Mallebhari)：明确指出这是硬件限制（PCIe 不支持 NVLink 的多播），并提供了通过 DISABLE_SYMMETRIC_MEMORY=1 环境变量禁用该功能的解决方案。
  - 后续进展：报告者尝试了多种禁用方案（包括对方建议的）均无效，其他用户建议尝试相关 PR #35085。
- 争议焦点：用户提供的解决方案未能生效，可能与特定版本或配置叠加有关，问题比单纯的硬件限制更复杂。
- 当前状态：问题仍为 Open，需要进一步排查为何已有的禁用方案不起作用。

🔥 热门话题与趋势分析

AMD 平台支持深化：本周期的活动表明，社区正从“基本可用”向“性能优化”和“功能对齐”迈进，重点包括新量化格式支持、计算图优化、CI/CD 稳定性和性能调优选项。
PD 失联与混合系统：Issue #35799 详细分析了 NIXL 0.10.0 导致 PD 失联的根因（UCX 错误选择 RDMA 而非 NVLink），PR #35758 和 #35760 则围绕支持 HMA（混合注意力管理器）与 NIXL 连接器、以及为 PD 失联添加 Speculative Decoding 测试，显示出这是当前高性能推理的前沿关注点。
Speculative Decoding 的兼容性挑战：Issue #35800 和 Issue #35704（已关闭）分别反映了 speculative decoding 与工具调用引导解码、以及与 CUDA 图内存规划的潜在冲突，表明在追求极致速度时，功能间的交织测试至关重要。
Qwen3.x 模型支持与优化：多个 Issue (#35704, #35820, #35743) 和 PR (#35739, #35777) 围绕 Qwen3.5 的部署、性能优化展开，显示出该系列模型的高人气和 vLLM 对其持续适配的努力。
通用 Bug 修复与 CI 稳定性：大量 Issue 报告了从 CPU 绑定、LoRA 加载到日志缺失等各种问题，同时多个 CI Failure Issue 和对应的修复 PR（如 #35798）表明在快速迭代中维持测试管道的稳定是一大挑战。

🛠️ 重点技术变更

PR #35733 & #35737 (NVFP4 仿真支持)：通过为 ROCm（及非 Blackwell CUDA）平台默认选择 EMULATION 后端，并修正权重缩放逻辑和 CUDA 图捕获问题，大幅扩展了 NVFP4 格式模型的硬件兼容性，对生态发展有积极意义。
PR #35751 (已合并，DeepSeekV2 QKVAProj 自定义 Op)：将 DeepSeekV2FusedQkvAProj.forward 中的条件 GEMM 路径包装为自定义操作，解决了 torch.compile 下因数据依赖条件导致的图截断问题，提升了 DeepSeekV2 模型编译后的执行效率。
PR #35658 (已合并，添加 amd-quark 依赖)：一个看似简单的依赖添加，解决了 ROCm 用户运行 MXFP4 等量化模型的燃眉之急，是保证 AMD 平台功能完整性的关键一步。
Issue #35780 (RFC：移除每块 KV 传输错误处理)：提议移除复杂的“按块KV传输错误处理”逻辑以降低维护负担和新功能开发阻力。这是一个重要的设计决策讨论，关系到系统在传输错误时的降级策略和长期代码健康度。

📈 开发活跃度观察

贡献者活跃：AMD 相关贡献者 (-amd 后缀) 非常活跃，不仅提交代码，也积极创建功能需求（如 tjtanaa），显示 AMD 团队正深度参与项目共建。
Issue/PR 响应速度：社区响应迅速，例如 #35704 问题在用户提供反馈后数小时内即被关闭，许多 Bug 报告都有核心贡献者快速介入分析。
CI/CD 管道繁忙：合并了 37 个 PR，同时有大量 CI 失败被报告和跟踪（如 #35769, #35783），表明项目正处于高强度集成和测试阶段。

💡 值得关注的问题

AMD CPU 特定段错误 (Issue #35746)：虽然发生在 CPU 模式，但涉及 AMD 平台和 Torch Inductor 的深层交互，可能需要 PyTorch 团队协同调查。
FusedARRMS 挂起问题 (Issue #35772)：影响了高端 GPU（B200/H200）的多卡性能，其根本原因（mnnvl 后端问题）需要尽快查明。
NIXL 0.10.0 的 PD 失联回归 (Issue #35799)：详细的技术分析揭示了底层通信库 UCX 在传输选择上的微妙变化，此类问题对构建稳定可靠的 PD 系统构成挑战。
RFC：移除每块 KV 传输错误处理 (Issue #35780)：邀请社区对一项简化复杂性的提议进行反馈，其结果将影响系统的错误恢复能力和代码结构。

📋 附录：详细数据列表

新增 Issue

#35823 [Bug]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — bug — by chamwen (创建于: 2026-03-03 11:23 (UTC+8))
#35820 [Bug]: deploy Qwen3.5-27B error — bug — by ZTurboX (创建于: 2026-03-03 10:28 (UTC+8))
#35780 [RFC]: Remove Per-Block KV Transfer Error Handling — RFC — by robertgshaw2-redhat (创建于: 2026-03-03 02:21 (UTC+8))
#35789 [Bug]: CPU OMP thread autobind fails when number of local ranks <= NUMA nodes — bug — by shunzhiwen (创建于: 2026-03-03 03:08 (UTC+8))
#35807 [Feature]: Performance Tuning of FlashInfer MLA Sparse — feature request — by benchislett (创建于: 2026-03-03 05:50 (UTC+8))
#35805 [Feature]: FlashInfer Sparse MLA + FP8 KV Cache — feature request — by benchislett (创建于: 2026-03-03 05:40 (UTC+8))
#35772 [Bug]: FusedARRMS Hang on startup during cudagraph capture TP>1 — bug,torch.compile — by benchislett (创建于: 2026-03-03 01:40 (UTC+8))
#35769 [CI Failure]: mi325_1: Quantized Models Test — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 00:50 (UTC+8))
#35804 [Feature]: PRISM 153-key — Legitimacy Verification Layer for Model Selection Algorithm — feature request — by Mossaab-s (创建于: 2026-03-03 05:30 (UTC+8))
#35799 [Bug] PD disagg via NixlConnector fails with NIXL 0.10.0 on B200 — bug — by ZhanqiuHu (创建于: 2026-03-03 04:49 (UTC+8))
#35800 [Bug]: Enabling speculative coding causes malformed Tool Calls in Qwen 122B MXFP4 — bug — by mdierolf (创建于: 2026-03-03 04:56 (UTC+8))
#35796 [Bug][DSV3.2]: Sparse Attention + DBO Crash — bug — by S1ro1 (创建于: 2026-03-03 04:38 (UTC+8))
#35795 [Bug]: DSA + Dual batch overlap shape mismatch — bug — by S1ro1 (创建于: 2026-03-03 04:36 (UTC+8))
#35779 [Bug]: Harmony models incorrectly drops prior-turn analysis channel in multi-turn conversations — bug — by stevewx (创建于: 2026-03-03 02:21 (UTC+8))
#35783 [CI Failure]: mi355_8: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 02:38 (UTC+8))
#35792 [Feature]: Support MLA attention + quant fusions — feature request — by carlyou (创建于: 2026-03-03 03:37 (UTC+8))
#35726 [Bug]: Vllm 0.16.0 version log missing input content — bug — by mia-wong1016 (创建于: 2026-03-02 17:26 (UTC+8))
#35784 [CI Failure]: mi355_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 02:45 (UTC+8))
#35778 [Bug]: Regression: terrible mixed prefill-decode performance with CUDA graphs enabled. — bug — by Zidrewndacht (创建于: 2026-03-03 02:03 (UTC+8))
#35724 [Bug] H100 PCIe: RuntimeError ‘[SymmDeviceMemory] Device does not support multicasting’ when running Qwen3.5-122B with TP=2 — usage — by wallbreaker740 (创建于: 2026-03-02 16:50 (UTC+8))
#35771 [RFC][torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — RFC,torch.compile — by ProExpertProg (创建于: 2026-03-03 01:21 (UTC+8))
#35767 [Enhancement]: Qwen3-ASR realtime endpoint produces degraded output — stateless segments, no cross-segment context, raw format leaks — 无标签 — by TheCodeWrangler (创建于: 2026-03-03 00:31 (UTC+8))
#35766 [Bug]: aot_compile setting some aotautograd configs that change the cache key — bug,torch.compile — by zou3519 (创建于: 2026-03-03 00:27 (UTC+8))
#35717 [Bug]: RunAI streamer breaks in 0.15.1 — bug — by Sanches166 (创建于: 2026-03-02 15:25 (UTC+8))
#35746 [Bug]: Segfault at IP=0 during model warmup on AVX512_BF16 host (AMD 7940HS) — bug,cpu — by NetWilliam (创建于: 2026-03-02 21:13 (UTC+8))
#35755 [Bug]: AsyncScheduler crashes with AssertionError during Realtime ASR streaming (num_output_placeholders underflow) — 无标签 — by TheCodeWrangler (创建于: 2026-03-02 22:54 (UTC+8))
#35706 [Bug]: CUDA illegal memory access on H200 MiniMax-M2.5 — bug — by Oseltamivir (创建于: 2026-03-02 12:42 (UTC+8))
#35743 [Bug]: Qwen 3.5 27B AWQ 4bit capturing CUDA graph fails — bug — by dule1322 (创建于: 2026-03-02 20:39 (UTC+8))
#35704 [Bug]: Cannot start the Qwen3.5-27B-FP8 model with error — bug — by cyysky (创建于: 2026-03-02 11:30 (UTC+8))
#35734 [Bug]: LoRA loading fails for modules with numeric indices (e.g., to_out.0 in Diffusion Transformers)` — bug — by Wang-Shengyuan (创建于: 2026-03-02 19:18 (UTC+8))
#35730 [Feature]: Load Mistral format LoRA when --load-format=mistral and --enable-lora — feature request — by harshil-shah (创建于: 2026-03-02 18:50 (UTC+8))
#35729 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by cherrymorning (创建于: 2026-03-02 18:22 (UTC+8))
#35718 [Bug]: Garbled / gibberish output after serving Kimi-K2.5 with vLLM on 8×H200 (INT4) for some time — bug — by momaek (创建于: 2026-03-02 15:37 (UTC+8))

#35725 [Bug]: .venv/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/flat/hopper/collective/flat_collective_store.hpp:18:10: fatal error: cuda/ptx: No such file or directory (EngineCore_DP0 pid=1553337) 18

#include <cuda/ptx> — bug — by ymshuang (创建于: 2026-03-02 16:54 (UTC+8))

#35713 [Feature] [ROCm]: Enable AITER fused_allreduce_rmsnorm_quant — feature request,rocm — by tjtanaa (创建于: 2026-03-02 14:39 (UTC+8))
#35712 [Feature] [ROCm]: Enable AITER fused_allreduce_rmsnorm — feature request,rocm — by tjtanaa (创建于: 2026-03-02 14:37 (UTC+8))
#35708 [Bug]: vLLM-compile warm start appears to be saving artifacts — bug,torch.compile — by zou3519 (创建于: 2026-03-02 12:50 (UTC+8))
#35705 [Bug]: streaming mode+finish_reason length, delta content not empty with finish_reason length — bug,rocm — by ajiang17 (创建于: 2026-03-02 11:59 (UTC+8))

已关闭 Issue

#35462 [Installation]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — installation — by chamwen (关闭于: 2026-03-03 11:23 (UTC+8))
#33882 [Feature]: Improve sparse embedding pooling output format for better efficiency and usability — feature request — by staugust (关闭于: 2026-03-03 10:50 (UTC+8))
#10971 [Bug]: Vllm CPU mode only takes 1 single core for multi-core cpu — bug,stale — by fzyzcjy (关闭于: 2026-03-03 10:19 (UTC+8))
#11873 [Bug]: Engine is gracefully shutting down — bug,stale — by Bryce1010 (关闭于: 2026-03-03 10:19 (UTC+8))
#21030 [Feature]: The inference accelation for quantized qwen2.5vl — feature request,stale — by WanianXO (关闭于: 2026-03-03 10:19 (UTC+8))
#22331 [Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb) — bug,stale — by vakovalskii (关闭于: 2026-03-03 10:18 (UTC+8))
#24921 [Bug]: v0.10.2, Qwen3-30B-A3B-NVFP4 MOE model on 5090, sm_120 hardware, no cutlass_scaled_mm kernel — bug,stale — by MrVolts (关闭于: 2026-03-03 10:18 (UTC+8))
#25022 [Feature]: Request support for AWQ quantization on GPUs with CUDA Compute Capability < 8.0 using Compressed Tensors Quantization. — feature request,stale — by BUJIDAOVS (关闭于: 2026-03-03 10:18 (UTC+8))
#27934 V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB — stale — by m0nk111 (关闭于: 2026-03-03 10:17 (UTC+8))
#27949 [Usage]: How do I deploy GGUF models with vLLM via Docker correct? — usage,stale — by alpha754293 (关闭于: 2026-03-03 10:17 (UTC+8))
#35795 [Bug]: DSA + Dual batch overlap shape mismatch — bug — by S1ro1 (关闭于: 2026-03-03 04:37 (UTC+8))
#29608 [Bug]: CUDA Graph Replay Skips KV Transfer Synchronization in Full Cache Hit Scenarios — bug,stale — by xiaguan (关闭于: 2026-03-03 04:04 (UTC+8))
#35726 [Bug]: Vllm 0.16.0 version log missing input content — bug — by mia-wong1016 (关闭于: 2026-03-03 03:16 (UTC+8))
#35633 [Bug]: parity with cuda: rocm image missing amd quark kimi k2.5 mxfp4 — bug,rocm — by functionstackx (关闭于: 2026-03-02 14:02 (UTC+8))
#29023 [Feature]: Disable logging /metrics — help wanted,good first issue,feature request,stale — by robertgshaw2-redhat (关闭于: 2026-03-02 23:01 (UTC+8))
#35412 [Bug]:Qwen3-VL-Reranker produces completely wrong relevance scores compared to native Transformers — bug — by xl2014 (关闭于: 2026-03-02 19:45 (UTC+8))
#35704 [Bug]: Cannot start the Qwen3.5-27B-FP8 model with error — bug — by cyysky (关闭于: 2026-03-02 19:23 (UTC+8))
#35729 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by cherrymorning (关闭于: 2026-03-02 18:25 (UTC+8))
#35599 [Bug]: cpu version compile failed in 0.16.0 — bug — by wanghualex1-wq (关闭于: 2026-03-02 17:50 (UTC+8))
#29760 [Bug]: perf regression for qwen3-0.6b — bug,stale — by BoyuanFeng (关闭于: 2026-03-02 14:08 (UTC+8))
#35702 [Bug]: Qwen3.5-FP8 Crashes VLLM — bug — by darsh12 (关闭于: 2026-03-02 12:27 (UTC+8))

新增 PR

#35754 [Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict — bug,ready,ci/build,cpu — by bigPYJ1151 (创建于: 2026-03-02 22:52 (UTC+8))
#35822 [CI/Build] Automatically patch video metadata for multimodal processor test — ready,multi-modality — by Isotr0py (创建于: 2026-03-03 10:45 (UTC+8))
#35774 [Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-03 01:54 (UTC+8))
#35727 [model] support FireRedASR2 — documentation,new-model — by AllenDou (创建于: 2026-03-02 17:39 (UTC+8))
#35821 [V0 deprecation] Remove Swin model — ready — by Isotr0py (创建于: 2026-03-03 10:30 (UTC+8))
#35811 [BUG] Fix async rlhf tests — bug,documentation,ready,ci/build,v1 — by hao-aaron (创建于: 2026-03-03 06:50 (UTC+8))
#35764 [Feat][NIXL] Add KV lease refresh mechanism for disaggregated prefill — frontend,v1,kv-connector — by robertgshaw2-redhat (创建于: 2026-03-03 00:11 (UTC+8))
#35740 feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) — frontend — by will-deines (创建于: 2026-03-02 20:07 (UTC+8))
#35819 [Docs][Model Runner V2] Add Design Docs — documentation — by WoosukKwon (创建于: 2026-03-03 10:28 (UTC+8))
#35818 Revert “[CI/Build] Enable Qwen3.5 tests on CI” (#35763) — qwen — by zhewenl (创建于: 2026-03-03 10:06 (UTC+8))
#35817 Feat/scaled fp4 quant use functional — 无标签 — by tianrengao (创建于: 2026-03-03 10:00 (UTC+8))
#35762 fix: raise ValueError when –kv-cache-dtype conflicts with checkpoint kv_cache_quant_algo — documentation,performance,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by infektyd (创建于: 2026-03-03 00:00 (UTC+8))
#35763 [CI/Build] Enable Qwen3.5 tests on CI — ready,qwen — by Isotr0py (创建于: 2026-03-03 00:04 (UTC+8))
#35770 [CI] Temporarily Disable Nightly Failures — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-03-03 00:53 (UTC+8))
#35716 [Bugfix] Qwen3Coder streaming tool call JSON missing opening brace in arguments — bug,needs-rebase,qwen — by KrxGu (创建于: 2026-03-02 15:10 (UTC+8))
#35816 [CI] Fix AMD: V1 Others (mi325_1) Amd CI bug — bug,rocm,ready,ci/build — by yewentao256 (创建于: 2026-03-03 09:06 (UTC+8))
#35710 [ROCm][CI] Support async weight transfer example with platform-aware determinism — documentation,rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-02 13:59 (UTC+8))
#35781 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-03 02:30 (UTC+8))
#35815 [Bugfix] Fix CPU OMP autobind assertion to use local_world_size — bug,v1,cpu — by OiPunk (创建于: 2026-03-03 08:38 (UTC+8))
#35814 Revert “[Attention] FA4 integration” (#32974) — documentation,ci/build,v1,nvidia — by zhewenl (创建于: 2026-03-03 08:32 (UTC+8))
#35735 [Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 — documentation,new-model,ready,multi-modality,llama,nvidia — by jzakrzew (创建于: 2026-03-02 19:20 (UTC+8))
#35813 [Bugfix] Fix misnamed parameter in compressed_tensors_moe.py — bug,ready — by bnellnm (创建于: 2026-03-03 07:43 (UTC+8))
#35790 [Model Runner V2] Add WhisperModelState [6/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-03 03:12 (UTC+8))
#35798 [ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-03 04:45 (UTC+8))
#35793 [All Reduce] Change default backend of Flashinfer All Reduce to trtllm — ready — by hjjq (创建于: 2026-03-03 03:52 (UTC+8))
#35794 [Perf] Optimize FusedMoEModularKernel output tensor using torch.empty — 无标签 — by xyang16 (创建于: 2026-03-03 04:16 (UTC+8))
#35759 [Bugfix] Fix AsyncScheduler crash during streaming input with chunked prefill — bug,v1 — by TheCodeWrangler (创建于: 2026-03-02 23:29 (UTC+8))
#35749 [Docs] Add breadcrumbs for better UX — ready — by hmellor (创建于: 2026-03-02 21:59 (UTC+8))
#35812 FusedMoEWithLoRA EP support — 无标签 — by Jackmin801 (创建于: 2026-03-03 07:13 (UTC+8))
#35806 [ROCm][CI] Fix Assertion Logic For test_gpt_oss — rocm,gpt-oss — by micah-wil (创建于: 2026-03-03 05:49 (UTC+8))
#35797 [Bugfix] Fix MM processor test for Qwen3.5 — bug,ready,multi-modality,qwen — by ywang96 (创建于: 2026-03-03 04:42 (UTC+8))
#35751 [MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile — ready,deepseek — by robertgshaw2-redhat (创建于: 2026-03-02 22:31 (UTC+8))
#35810 [compile] Consistent compiler config for saved/loaded vllm backends. — 无标签 — by zhxchen17 (创建于: 2026-03-03 06:33 (UTC+8))
#35809 [Models] Cohere ASR — documentation,new-model,frontend,v1,qwen — by ekagra-ranjan (创建于: 2026-03-03 06:10 (UTC+8))
#35808 [Kernel] Add FP8 block-quantized MoE config for E=256,N=512 on H100 — 无标签 — by marklubin (创建于: 2026-03-03 05:59 (UTC+8))
#35803 [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths — bug,nvidia — by blake-snc (创建于: 2026-03-03 05:20 (UTC+8))
#35801 lora: support loading Mistral-format LoRA weights via hf_to_vllm_mapper — 无标签 — by Godmook (创建于: 2026-03-03 05:13 (UTC+8))
#35802 Fix: DBO + DSA shape mismatch — 无标签 — by S1ro1 (创建于: 2026-03-03 05:19 (UTC+8))
#35741 [Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker — bug,ready,qwen — by yeqcharlotte (创建于: 2026-03-02 20:16 (UTC+8))
#35787 [ROCm] Optimize gfx arch parsing for alpha stepping and guard env sync — rocm — by AndreasKaratzas (创建于: 2026-03-03 03:03 (UTC+8))
#35757 Fix unclean shutdown from ctrl-c with AR Fusion — nvidia — by infektyd (创建于: 2026-03-02 23:21 (UTC+8))
#35788 [BUG] Fix rlhf_async example — bug,documentation,ready — by hao-aaron (创建于: 2026-03-03 03:06 (UTC+8))
#35791 [Bugfix][RoCM] GPT-OSS + Expert Parallel — bug,rocm,gpt-oss — by varun-sundar-rabindranath (创建于: 2026-03-03 03:31 (UTC+8))
#35786 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) — rocm,v1 — by Rohan138 (创建于: 2026-03-03 03:01 (UTC+8))
#35777 [Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next — qwen — by xyang16 (创建于: 2026-03-03 02:00 (UTC+8))
#35782 [MoE Refactor] Remove SharedFusedMoE class — needs-rebase,v1,llama,qwen,deepseek,nvidia — by bnellnm (创建于: 2026-03-03 02:32 (UTC+8))
#35773 [BugFix] Fix cmake based incremental install (wrong vllm install dir) — bug,ready,ci/build — by LucasWilkinson (创建于: 2026-03-03 01:52 (UTC+8))
#35709 [torch.compile] Improve cold and warm start compile tests — ready — by zou3519 (创建于: 2026-03-02 12:53 (UTC+8))
#35785 Segmented spans — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector — by almogtavor (创建于: 2026-03-03 03:00 (UTC+8))
#35768 [Bug] Fix FA install directory in cmake — bug,ready,ci/build — by yewentao256 (创建于: 2026-03-03 00:35 (UTC+8))
#35753 [Mamba] Add stochastic rounding support — ci/build — by roikoren755 (创建于: 2026-03-02 22:46 (UTC+8))
#35776 [Misc] Use envs module to get VLLM_DISABLED_KERNELS — 无标签 — by hickeyma (创建于: 2026-03-03 01:58 (UTC+8))
#35750 Prefix caching updates — documentation,performance,structured-output,needs-rebase,ci/build,v1 — by viktorpusTT (创建于: 2026-03-02 22:20 (UTC+8))
#35775 [CI] Add explicit permissions to macOS smoke test workflow — ci/build — by russellb (创建于: 2026-03-03 01:56 (UTC+8))
#35742 [CI] Fix mypy for vllm/reasoning — documentation,deepseek — by hickeyma (创建于: 2026-03-02 20:28 (UTC+8))
#35765 AITER MLA backend: Avoid CPU sync in _build_decode — rocm,v1 — by pschlan-amd (创建于: 2026-03-03 00:16 (UTC+8))
#35737 [NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation — rocm,nvidia — by fxmarty-amd (创建于: 2026-03-02 19:38 (UTC+8))
#35760 [Test][WIP] Add PD disagg + SD acceptance tests — v1,kv-connector — by ZhanqiuHu (创建于: 2026-03-02 23:51 (UTC+8))
#35723 [Doc] Improve UX of --enable-log-requests — performance,frontend,ready — by DarkLight1337 (创建于: 2026-03-02 16:45 (UTC+8))
#35758 [Core][KVConnector] Support HMA+NixlConnector — v1,kv-connector — by NickLucche (创建于: 2026-03-02 23:23 (UTC+8))
#35761 Fix import order violating ruff rules — 无标签 — by pschlan-amd (创建于: 2026-03-02 23:57 (UTC+8))
#35731 AITER MLA backend: Fix CPU sync in _build_decode — rocm,v1 — by pschlan-amd (创建于: 2026-03-02 18:51 (UTC+8))
#35738 [Misc] Add --attention-backend auto option — ready — by NickLucche (创建于: 2026-03-02 19:54 (UTC+8))
#35711 [Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions — bug — by AndreasKaratzas (创建于: 2026-03-02 14:29 (UTC+8))
#35756 Split generic IO Processor plugins tests from Terratorch specific ones — ci/build — by christian-pinto (创建于: 2026-03-02 22:55 (UTC+8))
#35721 [LoRA] Support dual CUDA streams — nvidia — by jeejeelee (创建于: 2026-03-02 16:10 (UTC+8))
#35752 [NIXL] Refactor kernel_block_size detection — v1,kv-connector — by NickLucche (创建于: 2026-03-02 22:43 (UTC+8))
#35748 Fix issue #35743 — v1 — by xueliangyang-oeuler (创建于: 2026-03-02 21:39 (UTC+8))
#35732 [Bugfix]: Fix LoRA loading failure for modules with numeric indices (e.g., to_out.0 in Diffusion Transformers) — bug — by Wang-Shengyuan (创建于: 2026-03-02 19:09 (UTC+8))
#35747 add linear method — 无标签 — by xmpp777 (创建于: 2026-03-02 21:22 (UTC+8))
#35745 [Performance] Add is_reasoning_end_streaming() override to GptOssReasoningParser — gpt-oss — by fergusfinn (创建于: 2026-03-02 20:57 (UTC+8))
#35739 [core]add gdn packed decode path — qwen — by caozuoba (创建于: 2026-03-02 20:01 (UTC+8))
#35744 Fix routed experts capture for hybrid models (Mamba + Attention) — v1 — by xhx1022 (创建于: 2026-03-02 20:55 (UTC+8))
#35736 Fix Ray compiled-DAG SHM channel stalls by detaching zero-copy np.ndarray logprobs buffers — v1 — by JeanPaulShapo (创建于: 2026-03-02 19:33 (UTC+8))
#35719 [ROCm][Perf] Enable sparse_mla’s cudagraph on ROCm platform — rocm,v1,nvidia — by ganyi1996ppo (创建于: 2026-03-02 15:45 (UTC+8))
#35733 [NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct GPUs, on Hopper, Ampere, etc. through emulation — rocm — by fxmarty-amd (创建于: 2026-03-02 19:13 (UTC+8))
#35728 Merge pull request #1 from vllm-project/main — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-02 18:12 (UTC+8))
#35715 [Misc] Cleanup useless current_platform import — ready,v1,nvidia — by wangxiyuan (创建于: 2026-03-02 14:59 (UTC+8))
#35722 Hanlin12 hip online tuning — documentation,frontend — by hanlin12-AMD (创建于: 2026-03-02 16:14 (UTC+8))
#35720 [Bugfix] fix kv cache fp8 crash on sm<89 — bug,nvidia — by fo40225 (创建于: 2026-03-02 15:46 (UTC+8))
#35714 Support multimodal speculative decoding in non-parallel draft_model mode — speculative-decoding,v1 — by EanWang211123 (创建于: 2026-03-02 14:46 (UTC+8))
#35707 scaffold logic which ensures that conditionally imported deps exist — rocm — by wjabbour (创建于: 2026-03-02 12:46 (UTC+8))

已合并 PR

#35763 [CI/Build] Enable Qwen3.5 tests on CI — ready,qwen — by Isotr0py (合并于: 2026-03-03 01:43 (UTC+8))
#35770 [CI] Temporarily Disable Nightly Failures — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-03-03 09:49 (UTC+8))
#35615 [Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode — frontend,ready,qwen — by voipmonitor (合并于: 2026-03-03 09:40 (UTC+8))
#35376 [Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph — v1,nvidia — by izhuhaoran (合并于: 2026-03-03 09:10 (UTC+8))
#35270 [XPU][NIXL] Add GPUDirect RDMA support for XPU — ready,ci/build,kv-connector — by zhenwei-intel (合并于: 2026-03-03 08:42 (UTC+8))
#35735 [Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 — documentation,new-model,ready,multi-modality,llama,nvidia — by jzakrzew (合并于: 2026-03-03 08:32 (UTC+8))
#35798 [ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-03 08:07 (UTC+8))
#35793 [All Reduce] Change default backend of Flashinfer All Reduce to trtllm — ready — by hjjq (合并于: 2026-03-03 07:57 (UTC+8))
#35749 [Docs] Add breadcrumbs for better UX — ready — by hmellor (合并于: 2026-03-02 22:31 (UTC+8))
#35797 [Bugfix] Fix MM processor test for Qwen3.5 — bug,ready,multi-modality,qwen — by ywang96 (合并于: 2026-03-03 07:05 (UTC+8))
#35751 [MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile — ready,deepseek — by robertgshaw2-redhat (合并于: 2026-03-03 07:03 (UTC+8))
#35552 clean unused cudagraph_batch_sizes — ready,v1,nvidia — by BoyuanFeng (合并于: 2026-03-03 06:00 (UTC+8))
#35741 [Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker — bug,ready,qwen — by yeqcharlotte (合并于: 2026-03-03 05:11 (UTC+8))
#35788 [BUG] Fix rlhf_async example — bug,documentation,ready — by hao-aaron (合并于: 2026-03-03 04:36 (UTC+8))
#34672 [ci] Add Ray compatibility check informational CI job — ready,ci/build — by jeffreywang-anyscale (合并于: 2026-03-03 04:06 (UTC+8))
#31057 [KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops — ready,kv-connector,nvidia — by yashwantbezawada (合并于: 2026-03-03 04:04 (UTC+8))
#33736 [Spec Decode] Add hidden states extraction system — documentation,new-model,speculative-decoding,ready,v1,llama,kv-connector — by fynnsu (合并于: 2026-03-03 03:29 (UTC+8))
#35709 [torch.compile] Improve cold and warm start compile tests — ready — by zou3519 (合并于: 2026-03-03 03:27 (UTC+8))
#35587 [BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start — bug,ready — by CSWYF3634076 (合并于: 2026-03-03 02:56 (UTC+8))
#35307 [CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests — ready,ci/build — by PatrykWo (合并于: 2026-03-03 01:02 (UTC+8))
#35580 [CI] Defining extended V1 e2e + engine tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-03-02 16:10 (UTC+8))
#35152 [ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism — rocm,ready — by AndreasKaratzas (合并于: 2026-03-02 15:04 (UTC+8))
#35723 [Doc] Improve UX of --enable-log-requests — performance,frontend,ready — by DarkLight1337 (合并于: 2026-03-03 00:24 (UTC+8))
#35672 [Core] Move test utility to test file — ready,gpt-oss — by wjabbour (合并于: 2026-03-02 23:56 (UTC+8))
#35518 [CI] Fix mypy for vllm/device allocator — ready,v1 — by hickeyma (合并于: 2026-03-02 23:53 (UTC+8))
#34169 [CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name — ready,v1,cpu — by charlesashby (合并于: 2026-03-02 17:34 (UTC+8))
#34627 [Performance] Extract kv update ops from MLA attention backends — ready,v1 — by ElizaWszola (合并于: 2026-03-02 23:43 (UTC+8))
#34119 [Fix Bug]num_active_loras always equals to zero — bug,ready,v1,gpt-oss — by RunkaiTao (合并于: 2026-03-02 23:17 (UTC+8))
#35505 [MyPy][BugFix] Check profiler is assigned before calling start() on it — bug,ready,v1 — by hickeyma (合并于: 2026-03-02 21:23 (UTC+8))
#35681 Fix unresolved-import errors when using Astral’s ty by removing src.root — ready — by tlrmchlsmth (合并于: 2026-03-02 18:26 (UTC+8))
#35588 [Feat] Supports Anthropic Messages count_tokens API — frontend,ready — by chaunceyjiang (合并于: 2026-03-02 17:48 (UTC+8))
#35715 [Misc] Cleanup useless current_platform import — ready,v1,nvidia — by wangxiyuan (合并于: 2026-03-02 17:36 (UTC+8))
#35495 [Misc] Bound NIXL upper bound version — ready,ci/build,kv-connector — by NickLucche (合并于: 2026-03-02 16:57 (UTC+8))
#34750 [Rocm][CI] Fix LM Eval Large Models (H100) test group — rocm,ready,ci/build — by charlifu (合并于: 2026-03-02 15:43 (UTC+8))
#34448 [Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels — ready,ci/build,nvidia — by EdalatiAli (合并于: 2026-03-02 15:31 (UTC+8))
#35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models — rocm,ready,ci/build — by hongxiayang (合并于: 2026-03-02 14:02 (UTC+8))
#35691 [XPU] fix mxfp4 activation type — ready — by jikunshang (合并于: 2026-03-02 11:48 (UTC+8))

关闭但未合并的 PR

#19028 chore: add an alternative Ubuntu software source to speedup docker image building — documentation,ready,ci/build,stale — by acelyc111 (关闭于: 2026-03-03 10:19 (UTC+8))
#22523 [Bench] Add Triton NVFP4 GEMM — performance,stale — by phuhung273 (关闭于: 2026-03-03 10:18 (UTC+8))
#27539 [Core] Prefix cache: frequency- and cost-aware eviction (opt-in) — ci/build,stale,v1 — by Aminsed (关闭于: 2026-03-03 10:17 (UTC+8))
#33609 [Feature][Speculative Decoding] Consolidate EAGLE input preparation — speculative-decoding,needs-rebase,v1 — by jaewshin (关闭于: 2026-03-03 09:04 (UTC+8))
#35814 Revert “[Attention] FA4 integration” (#32974) — documentation,ci/build,v1,nvidia — by zhewenl (关闭于: 2026-03-03 08:36 (UTC+8))
#34023 [Bugfix] Fix RAW hazard and optimize stores in EP Scatter Kernel — bug — by Manikvsin (关闭于: 2026-03-03 08:06 (UTC+8))
#35812 FusedMoEWithLoRA EP support — 无标签 — by Jackmin801 (关闭于: 2026-03-03 07:14 (UTC+8))
#32377 [Docker] Remove CUDA compatibility library loading; fixes #32373 — ci/build,nvidia — by wangshangsam (关闭于: 2026-03-03 06:28 (UTC+8))
#32960 [compile][cuda_graph]Add sym_size handling by folding them to constant — needs-rebase,nvidia — by fxdawnn (关闭于: 2026-03-03 05:39 (UTC+8))
#31239 [Feature] Add logprobs support for Whisper transcription API — documentation,frontend,needs-rebase,v1 — by TheCodeWrangler (关闭于: 2026-03-03 03:56 (UTC+8))
#35785 Segmented spans — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector — by almogtavor (关闭于: 2026-03-03 03:00 (UTC+8))
#35768 [Bug] Fix FA install directory in cmake — bug,ready,ci/build — by yewentao256 (关闭于: 2026-03-03 02:42 (UTC+8))
#30010 [Frontend] Improves Anthropic API compatibility — frontend,ready — by bbartels (关闭于: 2026-03-03 02:37 (UTC+8))
#29376 Fix LoRA compatibility with quantized MoE models — 无标签 — by atoniolo76 (关闭于: 2026-03-03 02:07 (UTC+8))
#35750 Prefix caching updates — documentation,performance,structured-output,needs-rebase,ci/build,v1 — by viktorpusTT (关闭于: 2026-03-03 01:58 (UTC+8))
#23184 ON HOLD - [Core] Lazy/Delayed CUDA graph — needs-rebase,stale,v1,nvidia — by diegocastanibm (关闭于: 2026-03-03 00:38 (UTC+8))
#35761 Fix import order violating ruff rules — 无标签 — by pschlan-amd (关闭于: 2026-03-03 00:10 (UTC+8))
#35731 AITER MLA backend: Fix CPU sync in _build_decode — rocm,v1 — by pschlan-amd (关闭于: 2026-03-03 00:09 (UTC+8))
#35661 interleave mm strings via request — frontend — by netanel-haber (关闭于: 2026-03-03 00:06 (UTC+8))
#35341 [Bug] FA2 is not supported for NVIDIA Blackwell architecture — bug,v1,nvidia — by olka (关闭于: 2026-03-02 23:16 (UTC+8))
#34830 [Feature] Add –disable-uvicorn-metrics-access-log shorthand — frontend — by miloudbelarebia (关闭于: 2026-03-02 23:07 (UTC+8))
#35559 [MoE Refactor] Turn ChunkingMoERunner into a wrapper so it can be used with any MoERunner subclass. — needs-rebase,nvidia — by bnellnm (关闭于: 2026-03-02 21:31 (UTC+8))
#34714 [Build] Include OpenTelemetry in release Docker images — ci/build,cpu — by codefromthecrypt (关闭于: 2026-03-02 20:20 (UTC+8))
#35728 Merge pull request #1 from vllm-project/main — 无标签 — by xueliangyang-oeuler (关闭于: 2026-03-02 18:15 (UTC+8))
#33514 [Bugfix] Fix gpt-oss chat format mismatch with HuggingFace — bug,frontend,gpt-oss — by thjung123 (关闭于: 2026-03-02 17:36 (UTC+8))
#27757 [UX] Include NVTX in cuda.txt — needs-rebase,ci/build,stale,nvidia — by jeejeelee (关闭于: 2026-03-02 17:26 (UTC+8))
#31698 [Kernel] Add triton silu_and_mul in fused_moe — performance — by jeejeelee (关闭于: 2026-03-02 17:12 (UTC+8))
#35482 Fix AttributeError in RMSNormGated by adding activation attribute and… — needs-rebase,qwen — by xueliangyang-oeuler (关闭于: 2026-03-02 16:59 (UTC+8))
#35699 [CLOSED]Support multimodal speculative decoding in non-parallel draft_model mode — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by EanWang211123 (关闭于: 2026-03-02 14:29 (UTC+8))