vLLM 开发动态报告 - 2026-03-02
时间窗口: 2026-03-02 11:29 (UTC+8) ~ 2026-03-03 11:29 (UTC+8) 数据统计: 新 Issue 38 | 关闭 Issue 21 | 新 PR 82 | 合并 PR 37 | 关闭未合并 PR 29
📊 每日开发状态摘要
本周期(24小时)vLLM 开发保持高度活跃,新增和处理的 Issue/PR 数量众多。开发焦点集中在 AMD 生态支持优化(如 NVFP4 模型仿真、ROCm CI 修复)和核心性能与稳定性(如混合PD系统、注意力后端融合、各类模型支持)上。同时,社区针对多个高热度 Bug(如 PD 失联、speculative decoding 兼容性)和设计决策(如 KV 传输错误处理)展开了深入讨论。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,涉及功能开发、问题修复和 CI 优化等多个方面。
1. 新功能与扩展支持
- PR #35733 & #35737(由
fxmarty-amd提交):这两项工作为核心功能,旨在支持 NVFP4 密集和 MoE 模型在 AMD Instinct 及非 Blackwell 的 NVIDIA GPU(如 Ampere, Hopper)上通过仿真运行。这解决了nvidia/Qwen3-8B-NVFP4等模型在非目标硬件上的运行问题,对研究和跨平台部署具有重要意义。 - PR #35765(由
pschlan-amd提交):优化 AITER MLA 后端,使用 Triton 内核避免_build_decode中的 CPU 同步,旨在提升 ROCm 平台上的 MLA 解码性能。 - PR #35722(由
hanlin12-AMD提交):为 vLLM server 添加--hip_online_tuning选项,以启用 hipBLASLt 在线 GEMM 调优,进一步挖掘 AMD GPU 的潜在性能。 - PR #35786(由
Rohan138提交):在非 shuffle KV 缓存布局下,为 ROCm AITER Flash Attention 后端启用 RoPE+KV 缓存融合 pass,提升注意力计算效率。 - Issue #35713 & #35712(由
tjtanaa创建):功能请求,要求为 ROCm 启用 AITER 的fused_allreduce_rmsnorm_quant和fused_allreduce_rmsnorm融合 pass,以完善分布式训练优化。
2. 问题修复与兼容性
- PR #35658(已合并):在 ROCm 的 Dockerfile 和 requirements 中添加
amd-quark包依赖,修复了 ROCm 镜像无法运行 MXFP4 等量化模型(如amd/Kimi-K2.5-MXFP4)的问题。 - PR #35791(由
varun-sundar-rabindranath提交):修复了在 ROCm 上使用 GPT-OSS + Expert Parallel 时因位矩阵构造逻辑导致的段错误,增强了复杂 MoE 模型在 AMD 平台上的稳定性。 - PR #35787(由
AndreasKaratzas提交):优化 ROCm 的 GCN 架构解析逻辑以支持字母步进(如gfx90a),并规范环境变量同步调用。 - PR #35798(已合并)& PR #35816:修复 ROCm CI 脚本中由反斜杠换行符导致的 pytest 标记解析错误,并将 pytest 退出码 5(无测试收集)视为成功,提升 CI 健壮性。
- PR #35806(由
micah-wil提交):修复因amd-quark包引入后暴露的 GPT-OSS 量化测试断言逻辑,允许实测准确率高于预期值。
3. 测试与验证
- PR #35710(由
AndreasKaratzas提交):通过应用平台特定的确定性设置(如选择 TRITON_ATTN 后端、禁用 BATCH_INVARIANT),在 ROCm 上启用异步权重传输示例,并将验证通过率放宽至 90%,以应对当前非确定性问题。
💬 高热度讨论分析
- Issue #35772:
FusedARRMS在 TP>1 时启动期间 CUDA 图捕获挂起- 核心议题:在多 GPU(TP>1)环境下,启用
fuse_allreduce_rms融合 pass 后,B200/H200 等设备在 CUDA 图捕获阶段发生挂起。 - 各方观点:
- 报告者 (
benchislett):提供了复现命令,指出临时解决方案是通过--compilation-config.pass_config.fuse_allreduce_rms false禁用该融合。 - 调查者 (
hjjq):确认问题与 PR #34109 相关,指出将环境变量VLLM_FLASHINFER_ALLREDUCE_BACKEND设置为trtllm可作为单节点下的有效规避方案,并认为mnnvl后端可能存在问题。 - 决策者 (
ProExpertProg):建议直接提交 PR 将默认后端改为trtllm。
- 报告者 (
- 当前状态:PR #35793 已被创建以将默认后端切换为
trtllm,问题正在解决中。
- 核心议题:在多 GPU(TP>1)环境下,启用
- Issue #35746: 在支持 AVX512_BF16 的 AMD 主机上模型预热时发生段错误
- 核心议题:用户在 AMD 7940HS CPU 上运行 vLLM 时遭遇段错误,疑似与 Torch Inductor 代码生成或库依赖有关。
- 各方观点:
- 报告者 (
NetWilliam):提供了详细的错误信息和生成的cpp_fused__softmax_0代码,表达了希望协助修复的意愿。 - 协助者 (
bigPYJ1151):指出问题与动态链接库无关,怀疑是torch.compile导致,建议使用--enforce-eager参数验证。 - 验证结果:报告者确认
--enforce-eager可启动服务器,但请求到来时仍会触发崩溃,将问题进一步指向编译生成的代码。
- 报告者 (
- 当前状态:问题仍为 Open,根本原因待查,可能涉及 CPU 特定指令集下的代码生成缺陷。
- Issue #35724: H100 PCIe 在 TP=2 时运行 Qwen3.5-122B 出现“设备不支持多播”错误
- 核心议题:在仅通过 PCIe 连接的 H100 上,vLLM V1 引擎尝试使用对称内存(多播)进行 TP 通信时失败。
- 各方观点:
- 报告者 (
wallbreaker740):提供了完整环境信息,并对比指出 v0.15.0 版本在相同硬件上工作正常,暗示是 V1 引擎的新特性导致了问题。 - 解答者 (
Saad-Mallebhari):明确指出这是硬件限制(PCIe 不支持 NVLink 的多播),并提供了通过DISABLE_SYMMETRIC_MEMORY=1环境变量禁用该功能的解决方案。 - 后续进展:报告者尝试了多种禁用方案(包括对方建议的)均无效,其他用户建议尝试相关 PR #35085。
- 报告者 (
- 争议焦点:用户提供的解决方案未能生效,可能与特定版本或配置叠加有关,问题比单纯的硬件限制更复杂。
- 当前状态:问题仍为 Open,需要进一步排查为何已有的禁用方案不起作用。
🔥 热门话题与趋势分析
- AMD 平台支持深化:本周期的活动表明,社区正从“基本可用”向“性能优化”和“功能对齐”迈进,重点包括新量化格式支持、计算图优化、CI/CD 稳定性和性能调优选项。
- PD 失联与混合系统:Issue #35799 详细分析了 NIXL 0.10.0 导致 PD 失联的根因(UCX 错误选择 RDMA 而非 NVLink),PR #35758 和 #35760 则围绕支持 HMA(混合注意力管理器)与 NIXL 连接器、以及为 PD 失联添加 Speculative Decoding 测试,显示出这是当前高性能推理的前沿关注点。
- Speculative Decoding 的兼容性挑战:Issue #35800 和 Issue #35704(已关闭)分别反映了 speculative decoding 与工具调用引导解码、以及与 CUDA 图内存规划的潜在冲突,表明在追求极致速度时,功能间的交织测试至关重要。
- Qwen3.x 模型支持与优化:多个 Issue (#35704, #35820, #35743) 和 PR (#35739, #35777) 围绕 Qwen3.5 的部署、性能优化展开,显示出该系列模型的高人气和 vLLM 对其持续适配的努力。
- 通用 Bug 修复与 CI 稳定性:大量 Issue 报告了从 CPU 绑定、LoRA 加载到日志缺失等各种问题,同时多个 CI Failure Issue 和对应的修复 PR(如 #35798)表明在快速迭代中维持测试管道的稳定是一大挑战。
🛠️ 重点技术变更
- PR #35733 & #35737 (NVFP4 仿真支持):通过为 ROCm(及非 Blackwell CUDA)平台默认选择
EMULATION后端,并修正权重缩放逻辑和 CUDA 图捕获问题,大幅扩展了 NVFP4 格式模型的硬件兼容性,对生态发展有积极意义。 - PR #35751 (已合并,DeepSeekV2 QKVAProj 自定义 Op):将
DeepSeekV2FusedQkvAProj.forward中的条件 GEMM 路径包装为自定义操作,解决了 torch.compile 下因数据依赖条件导致的图截断问题,提升了 DeepSeekV2 模型编译后的执行效率。 - PR #35658 (已合并,添加 amd-quark 依赖):一个看似简单的依赖添加,解决了 ROCm 用户运行 MXFP4 等量化模型的燃眉之急,是保证 AMD 平台功能完整性的关键一步。
- Issue #35780 (RFC:移除每块 KV 传输错误处理):提议移除复杂的“按块KV传输错误处理”逻辑以降低维护负担和新功能开发阻力。这是一个重要的设计决策讨论,关系到系统在传输错误时的降级策略和长期代码健康度。
📈 开发活跃度观察
- 贡献者活跃:AMD 相关贡献者 (
-amd后缀) 非常活跃,不仅提交代码,也积极创建功能需求(如tjtanaa),显示 AMD 团队正深度参与项目共建。 - Issue/PR 响应速度:社区响应迅速,例如 #35704 问题在用户提供反馈后数小时内即被关闭,许多 Bug 报告都有核心贡献者快速介入分析。
- CI/CD 管道繁忙:合并了 37 个 PR,同时有大量 CI 失败被报告和跟踪(如 #35769, #35783),表明项目正处于高强度集成和测试阶段。
💡 值得关注的问题
- AMD CPU 特定段错误 (Issue #35746):虽然发生在 CPU 模式,但涉及 AMD 平台和 Torch Inductor 的深层交互,可能需要 PyTorch 团队协同调查。
- FusedARRMS 挂起问题 (Issue #35772):影响了高端 GPU(B200/H200)的多卡性能,其根本原因(
mnnvl后端问题)需要尽快查明。 - NIXL 0.10.0 的 PD 失联回归 (Issue #35799):详细的技术分析揭示了底层通信库 UCX 在传输选择上的微妙变化,此类问题对构建稳定可靠的 PD 系统构成挑战。
- RFC:移除每块 KV 传输错误处理 (Issue #35780):邀请社区对一项简化复杂性的提议进行反馈,其结果将影响系统的错误恢复能力和代码结构。
📋 附录:详细数据列表
新增 Issue
- #35823 [Bug]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — bug — by chamwen (创建于: 2026-03-03 11:23 (UTC+8))
- #35820 [Bug]: deploy Qwen3.5-27B error — bug — by ZTurboX (创建于: 2026-03-03 10:28 (UTC+8))
- #35780 [RFC]: Remove Per-Block KV Transfer Error Handling — RFC — by robertgshaw2-redhat (创建于: 2026-03-03 02:21 (UTC+8))
- #35789 [Bug]: CPU OMP thread autobind fails when number of local ranks <= NUMA nodes — bug — by shunzhiwen (创建于: 2026-03-03 03:08 (UTC+8))
- #35807 [Feature]: Performance Tuning of FlashInfer MLA Sparse — feature request — by benchislett (创建于: 2026-03-03 05:50 (UTC+8))
- #35805 [Feature]: FlashInfer Sparse MLA + FP8 KV Cache — feature request — by benchislett (创建于: 2026-03-03 05:40 (UTC+8))
- #35772 [Bug]:
FusedARRMSHang on startup during cudagraph capture TP>1 — bug,torch.compile — by benchislett (创建于: 2026-03-03 01:40 (UTC+8)) - #35769 [CI Failure]: mi325_1: Quantized Models Test — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 00:50 (UTC+8))
- #35804 [Feature]: PRISM 153-key — Legitimacy Verification Layer for Model Selection Algorithm — feature request — by Mossaab-s (创建于: 2026-03-03 05:30 (UTC+8))
- #35799 [Bug] PD disagg via NixlConnector fails with NIXL 0.10.0 on B200 — bug — by ZhanqiuHu (创建于: 2026-03-03 04:49 (UTC+8))
- #35800 [Bug]: Enabling speculative coding causes malformed Tool Calls in Qwen 122B MXFP4 — bug — by mdierolf (创建于: 2026-03-03 04:56 (UTC+8))
- #35796 [Bug][DSV3.2]: Sparse Attention + DBO Crash — bug — by S1ro1 (创建于: 2026-03-03 04:38 (UTC+8))
- #35795 [Bug]: DSA + Dual batch overlap shape mismatch — bug — by S1ro1 (创建于: 2026-03-03 04:36 (UTC+8))
- #35779 [Bug]: Harmony models incorrectly drops prior-turn analysis channel in multi-turn conversations — bug — by stevewx (创建于: 2026-03-03 02:21 (UTC+8))
- #35783 [CI Failure]: mi355_8: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 02:38 (UTC+8))
- #35792 [Feature]: Support MLA attention + quant fusions — feature request — by carlyou (创建于: 2026-03-03 03:37 (UTC+8))
- #35726 [Bug]: Vllm 0.16.0 version log missing input content — bug — by mia-wong1016 (创建于: 2026-03-02 17:26 (UTC+8))
- #35784 [CI Failure]: mi355_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (创建于: 2026-03-03 02:45 (UTC+8))
- #35778 [Bug]: Regression: terrible mixed prefill-decode performance with CUDA graphs enabled. — bug — by Zidrewndacht (创建于: 2026-03-03 02:03 (UTC+8))
- #35724 [Bug] H100 PCIe: RuntimeError ‘[SymmDeviceMemory] Device does not support multicasting’ when running Qwen3.5-122B with TP=2 — usage — by wallbreaker740 (创建于: 2026-03-02 16:50 (UTC+8))
- #35771 [RFC][torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — RFC,torch.compile — by ProExpertProg (创建于: 2026-03-03 01:21 (UTC+8))
- #35767 [Enhancement]: Qwen3-ASR realtime endpoint produces degraded output — stateless segments, no cross-segment context, raw format leaks — 无标签 — by TheCodeWrangler (创建于: 2026-03-03 00:31 (UTC+8))
- #35766 [Bug]: aot_compile setting some aotautograd configs that change the cache key — bug,torch.compile — by zou3519 (创建于: 2026-03-03 00:27 (UTC+8))
- #35717 [Bug]: RunAI streamer breaks in 0.15.1 — bug — by Sanches166 (创建于: 2026-03-02 15:25 (UTC+8))
- #35746 [Bug]: Segfault at IP=0 during model warmup on AVX512_BF16 host (AMD 7940HS) — bug,cpu — by NetWilliam (创建于: 2026-03-02 21:13 (UTC+8))
- #35755 [Bug]: AsyncScheduler crashes with AssertionError during Realtime ASR streaming (num_output_placeholders underflow) — 无标签 — by TheCodeWrangler (创建于: 2026-03-02 22:54 (UTC+8))
- #35706 [Bug]: CUDA illegal memory access on H200 MiniMax-M2.5 — bug — by Oseltamivir (创建于: 2026-03-02 12:42 (UTC+8))
- #35743 [Bug]: Qwen 3.5 27B AWQ 4bit capturing CUDA graph fails — bug — by dule1322 (创建于: 2026-03-02 20:39 (UTC+8))
- #35704 [Bug]: Cannot start the Qwen3.5-27B-FP8 model with error — bug — by cyysky (创建于: 2026-03-02 11:30 (UTC+8))
- #35734 [Bug]: LoRA loading fails for modules with numeric indices (e.g., to_out.0 in Diffusion Transformers)` — bug — by Wang-Shengyuan (创建于: 2026-03-02 19:18 (UTC+8))
- #35730 [Feature]: Load Mistral format LoRA when
--load-format=mistraland--enable-lora— feature request — by harshil-shah (创建于: 2026-03-02 18:50 (UTC+8)) - #35729 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by cherrymorning (创建于: 2026-03-02 18:22 (UTC+8))
- #35718 [Bug]: Garbled / gibberish output after serving Kimi-K2.5 with vLLM on 8×H200 (INT4) for some time — bug — by momaek (创建于: 2026-03-02 15:37 (UTC+8))
-
#35725 [Bug]: .venv/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/flat/hopper/collective/flat_collective_store.hpp:18:10: fatal error: cuda/ptx: No such file or directory (EngineCore_DP0 pid=1553337) 18 #include <cuda/ptx> — bug — by ymshuang (创建于: 2026-03-02 16:54 (UTC+8)) - #35713 [Feature] [ROCm]: Enable AITER
fused_allreduce_rmsnorm_quant— feature request,rocm — by tjtanaa (创建于: 2026-03-02 14:39 (UTC+8)) - #35712 [Feature] [ROCm]: Enable AITER
fused_allreduce_rmsnorm— feature request,rocm — by tjtanaa (创建于: 2026-03-02 14:37 (UTC+8)) - #35708 [Bug]: vLLM-compile warm start appears to be saving artifacts — bug,torch.compile — by zou3519 (创建于: 2026-03-02 12:50 (UTC+8))
- #35705 [Bug]: streaming mode+finish_reason length, delta content not empty with finish_reason length — bug,rocm — by ajiang17 (创建于: 2026-03-02 11:59 (UTC+8))
已关闭 Issue
- #35462 [Installation]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — installation — by chamwen (关闭于: 2026-03-03 11:23 (UTC+8))
- #33882 [Feature]: Improve sparse embedding pooling output format for better efficiency and usability — feature request — by staugust (关闭于: 2026-03-03 10:50 (UTC+8))
- #10971 [Bug]: Vllm CPU mode only takes 1 single core for multi-core cpu — bug,stale — by fzyzcjy (关闭于: 2026-03-03 10:19 (UTC+8))
- #11873 [Bug]: Engine is gracefully shutting down — bug,stale — by Bryce1010 (关闭于: 2026-03-03 10:19 (UTC+8))
- #21030 [Feature]: The inference accelation for quantized qwen2.5vl — feature request,stale — by WanianXO (关闭于: 2026-03-03 10:19 (UTC+8))
- #22331 [Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb) — bug,stale — by vakovalskii (关闭于: 2026-03-03 10:18 (UTC+8))
- #24921 [Bug]: v0.10.2, Qwen3-30B-A3B-NVFP4 MOE model on 5090, sm_120 hardware,
no cutlass_scaled_mm kernel— bug,stale — by MrVolts (关闭于: 2026-03-03 10:18 (UTC+8)) - #25022 [Feature]: Request support for AWQ quantization on GPUs with CUDA Compute Capability < 8.0 using Compressed Tensors Quantization. — feature request,stale — by BUJIDAOVS (关闭于: 2026-03-03 10:18 (UTC+8))
- #27934 V1 Engine: Memory allocation failures and crashes with 7B-13B models on RTX 3060 12GB — stale — by m0nk111 (关闭于: 2026-03-03 10:17 (UTC+8))
- #27949 [Usage]: How do I deploy GGUF models with vLLM via Docker correct? — usage,stale — by alpha754293 (关闭于: 2026-03-03 10:17 (UTC+8))
- #35795 [Bug]: DSA + Dual batch overlap shape mismatch — bug — by S1ro1 (关闭于: 2026-03-03 04:37 (UTC+8))
- #29608 [Bug]: CUDA Graph Replay Skips KV Transfer Synchronization in Full Cache Hit Scenarios — bug,stale — by xiaguan (关闭于: 2026-03-03 04:04 (UTC+8))
- #35726 [Bug]: Vllm 0.16.0 version log missing input content — bug — by mia-wong1016 (关闭于: 2026-03-03 03:16 (UTC+8))
- #35633 [Bug]: parity with cuda: rocm image missing amd quark kimi k2.5 mxfp4 — bug,rocm — by functionstackx (关闭于: 2026-03-02 14:02 (UTC+8))
- #29023 [Feature]: Disable logging
/metrics— help wanted,good first issue,feature request,stale — by robertgshaw2-redhat (关闭于: 2026-03-02 23:01 (UTC+8)) - #35412 [Bug]:Qwen3-VL-Reranker produces completely wrong relevance scores compared to native Transformers — bug — by xl2014 (关闭于: 2026-03-02 19:45 (UTC+8))
- #35704 [Bug]: Cannot start the Qwen3.5-27B-FP8 model with error — bug — by cyysky (关闭于: 2026-03-02 19:23 (UTC+8))
- #35729 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by cherrymorning (关闭于: 2026-03-02 18:25 (UTC+8))
- #35599 [Bug]: cpu version compile failed in 0.16.0 — bug — by wanghualex1-wq (关闭于: 2026-03-02 17:50 (UTC+8))
- #29760 [Bug]: perf regression for qwen3-0.6b — bug,stale — by BoyuanFeng (关闭于: 2026-03-02 14:08 (UTC+8))
- #35702 [Bug]: Qwen3.5-FP8 Crashes VLLM — bug — by darsh12 (关闭于: 2026-03-02 12:27 (UTC+8))
新增 PR
- #35754 [Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict — bug,ready,ci/build,cpu — by bigPYJ1151 (创建于: 2026-03-02 22:52 (UTC+8))
- #35822 [CI/Build] Automatically patch video metadata for multimodal processor test — ready,multi-modality — by Isotr0py (创建于: 2026-03-03 10:45 (UTC+8))
- #35774 [Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-03 01:54 (UTC+8))
- #35727 [model] support FireRedASR2 — documentation,new-model — by AllenDou (创建于: 2026-03-02 17:39 (UTC+8))
- #35821 [V0 deprecation] Remove Swin model — ready — by Isotr0py (创建于: 2026-03-03 10:30 (UTC+8))
- #35811 [BUG] Fix async rlhf tests — bug,documentation,ready,ci/build,v1 — by hao-aaron (创建于: 2026-03-03 06:50 (UTC+8))
- #35764 [Feat][NIXL] Add KV lease refresh mechanism for disaggregated prefill — frontend,v1,kv-connector — by robertgshaw2-redhat (创建于: 2026-03-03 00:11 (UTC+8))
- #35740 feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) — frontend — by will-deines (创建于: 2026-03-02 20:07 (UTC+8))
- #35819 [Docs][Model Runner V2] Add Design Docs — documentation — by WoosukKwon (创建于: 2026-03-03 10:28 (UTC+8))
- #35818 Revert “[CI/Build] Enable Qwen3.5 tests on CI” (#35763) — qwen — by zhewenl (创建于: 2026-03-03 10:06 (UTC+8))
- #35817 Feat/scaled fp4 quant use functional — 无标签 — by tianrengao (创建于: 2026-03-03 10:00 (UTC+8))
- #35762 fix: raise ValueError when –kv-cache-dtype conflicts with checkpoint kv_cache_quant_algo — documentation,performance,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by infektyd (创建于: 2026-03-03 00:00 (UTC+8))
- #35763 [CI/Build] Enable Qwen3.5 tests on CI — ready,qwen — by Isotr0py (创建于: 2026-03-03 00:04 (UTC+8))
- #35770 [CI] Temporarily Disable Nightly Failures — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-03-03 00:53 (UTC+8))
- #35716 [Bugfix] Qwen3Coder streaming tool call JSON missing opening brace in arguments — bug,needs-rebase,qwen — by KrxGu (创建于: 2026-03-02 15:10 (UTC+8))
- #35816 [CI] Fix
AMD: V1 Others (mi325_1)Amd CI bug — bug,rocm,ready,ci/build — by yewentao256 (创建于: 2026-03-03 09:06 (UTC+8)) - #35710 [ROCm][CI] Support async weight transfer example with platform-aware determinism — documentation,rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-02 13:59 (UTC+8))
- #35781 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-03 02:30 (UTC+8))
- #35815 [Bugfix] Fix CPU OMP autobind assertion to use local_world_size — bug,v1,cpu — by OiPunk (创建于: 2026-03-03 08:38 (UTC+8))
- #35814 Revert “[Attention] FA4 integration” (#32974) — documentation,ci/build,v1,nvidia — by zhewenl (创建于: 2026-03-03 08:32 (UTC+8))
- #35735 [Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 — documentation,new-model,ready,multi-modality,llama,nvidia — by jzakrzew (创建于: 2026-03-02 19:20 (UTC+8))
- #35813 [Bugfix] Fix misnamed parameter in compressed_tensors_moe.py — bug,ready — by bnellnm (创建于: 2026-03-03 07:43 (UTC+8))
- #35790 [Model Runner V2] Add WhisperModelState [6/N] — v1,nvidia — by WoosukKwon (创建于: 2026-03-03 03:12 (UTC+8))
- #35798 [ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-03 04:45 (UTC+8))
- #35793 [All Reduce] Change default backend of Flashinfer All Reduce to trtllm — ready — by hjjq (创建于: 2026-03-03 03:52 (UTC+8))
- #35794 [Perf] Optimize FusedMoEModularKernel output tensor using torch.empty — 无标签 — by xyang16 (创建于: 2026-03-03 04:16 (UTC+8))
- #35759 [Bugfix] Fix AsyncScheduler crash during streaming input with chunked prefill — bug,v1 — by TheCodeWrangler (创建于: 2026-03-02 23:29 (UTC+8))
- #35749 [Docs] Add breadcrumbs for better UX — ready — by hmellor (创建于: 2026-03-02 21:59 (UTC+8))
- #35812 FusedMoEWithLoRA EP support — 无标签 — by Jackmin801 (创建于: 2026-03-03 07:13 (UTC+8))
- #35806 [ROCm][CI] Fix Assertion Logic For
test_gpt_oss— rocm,gpt-oss — by micah-wil (创建于: 2026-03-03 05:49 (UTC+8)) - #35797 [Bugfix] Fix MM processor test for Qwen3.5 — bug,ready,multi-modality,qwen — by ywang96 (创建于: 2026-03-03 04:42 (UTC+8))
- #35751 [MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile — ready,deepseek — by robertgshaw2-redhat (创建于: 2026-03-02 22:31 (UTC+8))
- #35810 [compile] Consistent compiler config for saved/loaded vllm backends. — 无标签 — by zhxchen17 (创建于: 2026-03-03 06:33 (UTC+8))
- #35809 [Models] Cohere ASR — documentation,new-model,frontend,v1,qwen — by ekagra-ranjan (创建于: 2026-03-03 06:10 (UTC+8))
- #35808 [Kernel] Add FP8 block-quantized MoE config for E=256,N=512 on H100 — 无标签 — by marklubin (创建于: 2026-03-03 05:59 (UTC+8))
- #35803 [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths — bug,nvidia — by blake-snc (创建于: 2026-03-03 05:20 (UTC+8))
- #35801 lora: support loading Mistral-format LoRA weights via hf_to_vllm_mapper — 无标签 — by Godmook (创建于: 2026-03-03 05:13 (UTC+8))
- #35802 Fix: DBO + DSA shape mismatch — 无标签 — by S1ro1 (创建于: 2026-03-03 05:19 (UTC+8))
- #35741 [Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker — bug,ready,qwen — by yeqcharlotte (创建于: 2026-03-02 20:16 (UTC+8))
- #35787 [ROCm] Optimize gfx arch parsing for alpha stepping and guard env sync — rocm — by AndreasKaratzas (创建于: 2026-03-03 03:03 (UTC+8))
- #35757 Fix unclean shutdown from ctrl-c with AR Fusion — nvidia — by infektyd (创建于: 2026-03-02 23:21 (UTC+8))
- #35788 [BUG] Fix rlhf_async example — bug,documentation,ready — by hao-aaron (创建于: 2026-03-03 03:06 (UTC+8))
- #35791 [Bugfix][RoCM] GPT-OSS + Expert Parallel — bug,rocm,gpt-oss — by varun-sundar-rabindranath (创建于: 2026-03-03 03:31 (UTC+8))
- #35786 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) — rocm,v1 — by Rohan138 (创建于: 2026-03-03 03:01 (UTC+8))
- #35777 [Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next — qwen — by xyang16 (创建于: 2026-03-03 02:00 (UTC+8))
- #35782 [MoE Refactor] Remove SharedFusedMoE class — needs-rebase,v1,llama,qwen,deepseek,nvidia — by bnellnm (创建于: 2026-03-03 02:32 (UTC+8))
- #35773 [BugFix] Fix cmake based incremental install (wrong vllm install dir) — bug,ready,ci/build — by LucasWilkinson (创建于: 2026-03-03 01:52 (UTC+8))
- #35709 [torch.compile] Improve cold and warm start compile tests — ready — by zou3519 (创建于: 2026-03-02 12:53 (UTC+8))
- #35785 Segmented spans — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector — by almogtavor (创建于: 2026-03-03 03:00 (UTC+8))
- #35768 [Bug] Fix FA install directory in cmake — bug,ready,ci/build — by yewentao256 (创建于: 2026-03-03 00:35 (UTC+8))
- #35753 [Mamba] Add stochastic rounding support — ci/build — by roikoren755 (创建于: 2026-03-02 22:46 (UTC+8))
- #35776 [Misc] Use envs module to get VLLM_DISABLED_KERNELS — 无标签 — by hickeyma (创建于: 2026-03-03 01:58 (UTC+8))
- #35750 Prefix caching updates — documentation,performance,structured-output,needs-rebase,ci/build,v1 — by viktorpusTT (创建于: 2026-03-02 22:20 (UTC+8))
- #35775 [CI] Add explicit permissions to macOS smoke test workflow — ci/build — by russellb (创建于: 2026-03-03 01:56 (UTC+8))
- #35742 [CI] Fix mypy for vllm/reasoning — documentation,deepseek — by hickeyma (创建于: 2026-03-02 20:28 (UTC+8))
- #35765 AITER MLA backend: Avoid CPU sync in _build_decode — rocm,v1 — by pschlan-amd (创建于: 2026-03-03 00:16 (UTC+8))
- #35737 [NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation — rocm,nvidia — by fxmarty-amd (创建于: 2026-03-02 19:38 (UTC+8))
- #35760 [Test][WIP] Add PD disagg + SD acceptance tests — v1,kv-connector — by ZhanqiuHu (创建于: 2026-03-02 23:51 (UTC+8))
- #35723 [Doc] Improve UX of
--enable-log-requests— performance,frontend,ready — by DarkLight1337 (创建于: 2026-03-02 16:45 (UTC+8)) - #35758 [Core][KVConnector] Support HMA+NixlConnector — v1,kv-connector — by NickLucche (创建于: 2026-03-02 23:23 (UTC+8))
- #35761 Fix import order violating ruff rules — 无标签 — by pschlan-amd (创建于: 2026-03-02 23:57 (UTC+8))
- #35731 AITER MLA backend: Fix CPU sync in _build_decode — rocm,v1 — by pschlan-amd (创建于: 2026-03-02 18:51 (UTC+8))
- #35738 [Misc] Add
--attention-backend autooption — ready — by NickLucche (创建于: 2026-03-02 19:54 (UTC+8)) - #35711 [Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions — bug — by AndreasKaratzas (创建于: 2026-03-02 14:29 (UTC+8))
- #35756 Split generic IO Processor plugins tests from Terratorch specific ones — ci/build — by christian-pinto (创建于: 2026-03-02 22:55 (UTC+8))
- #35721 [LoRA] Support dual CUDA streams — nvidia — by jeejeelee (创建于: 2026-03-02 16:10 (UTC+8))
- #35752 [NIXL] Refactor
kernel_block_sizedetection — v1,kv-connector — by NickLucche (创建于: 2026-03-02 22:43 (UTC+8)) - #35748 Fix issue #35743 — v1 — by xueliangyang-oeuler (创建于: 2026-03-02 21:39 (UTC+8))
- #35732 [Bugfix]: Fix LoRA loading failure for modules with numeric indices (e.g., to_out.0 in Diffusion Transformers) — bug — by Wang-Shengyuan (创建于: 2026-03-02 19:09 (UTC+8))
- #35747 add linear method — 无标签 — by xmpp777 (创建于: 2026-03-02 21:22 (UTC+8))
- #35745 [Performance] Add is_reasoning_end_streaming() override to GptOssReasoningParser — gpt-oss — by fergusfinn (创建于: 2026-03-02 20:57 (UTC+8))
- #35739 [core]add gdn packed decode path — qwen — by caozuoba (创建于: 2026-03-02 20:01 (UTC+8))
- #35744 Fix routed experts capture for hybrid models (Mamba + Attention) — v1 — by xhx1022 (创建于: 2026-03-02 20:55 (UTC+8))
- #35736 Fix Ray compiled-DAG SHM channel stalls by detaching zero-copy
np.ndarraylogprobs buffers — v1 — by JeanPaulShapo (创建于: 2026-03-02 19:33 (UTC+8)) - #35719 [ROCm][Perf] Enable
sparse_mla’s cudagraph on ROCm platform — rocm,v1,nvidia — by ganyi1996ppo (创建于: 2026-03-02 15:45 (UTC+8)) - #35733 [NVFP4] Support NVFP4 dense models from
modeloptandcompressed-tensorson AMD Instinct GPUs, on Hopper, Ampere, etc. through emulation — rocm — by fxmarty-amd (创建于: 2026-03-02 19:13 (UTC+8)) - #35728 Merge pull request #1 from vllm-project/main — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-02 18:12 (UTC+8))
- #35715 [Misc] Cleanup useless
current_platformimport — ready,v1,nvidia — by wangxiyuan (创建于: 2026-03-02 14:59 (UTC+8)) - #35722 Hanlin12 hip online tuning — documentation,frontend — by hanlin12-AMD (创建于: 2026-03-02 16:14 (UTC+8))
- #35720 [Bugfix] fix kv cache fp8 crash on sm<89 — bug,nvidia — by fo40225 (创建于: 2026-03-02 15:46 (UTC+8))
- #35714 Support multimodal speculative decoding in non-parallel draft_model mode — speculative-decoding,v1 — by EanWang211123 (创建于: 2026-03-02 14:46 (UTC+8))
- #35707 scaffold logic which ensures that conditionally imported deps exist — rocm — by wjabbour (创建于: 2026-03-02 12:46 (UTC+8))
已合并 PR
- #35763 [CI/Build] Enable Qwen3.5 tests on CI — ready,qwen — by Isotr0py (合并于: 2026-03-03 01:43 (UTC+8))
- #35770 [CI] Temporarily Disable Nightly Failures — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-03-03 09:49 (UTC+8))
- #35615 [Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode — frontend,ready,qwen — by voipmonitor (合并于: 2026-03-03 09:40 (UTC+8))
- #35376 [Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph — v1,nvidia — by izhuhaoran (合并于: 2026-03-03 09:10 (UTC+8))
- #35270 [XPU][NIXL] Add GPUDirect RDMA support for XPU — ready,ci/build,kv-connector — by zhenwei-intel (合并于: 2026-03-03 08:42 (UTC+8))
- #35735 [Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 — documentation,new-model,ready,multi-modality,llama,nvidia — by jzakrzew (合并于: 2026-03-03 08:32 (UTC+8))
- #35798 [ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-03 08:07 (UTC+8))
- #35793 [All Reduce] Change default backend of Flashinfer All Reduce to trtllm — ready — by hjjq (合并于: 2026-03-03 07:57 (UTC+8))
- #35749 [Docs] Add breadcrumbs for better UX — ready — by hmellor (合并于: 2026-03-02 22:31 (UTC+8))
- #35797 [Bugfix] Fix MM processor test for Qwen3.5 — bug,ready,multi-modality,qwen — by ywang96 (合并于: 2026-03-03 07:05 (UTC+8))
- #35751 [MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile — ready,deepseek — by robertgshaw2-redhat (合并于: 2026-03-03 07:03 (UTC+8))
- #35552 clean unused cudagraph_batch_sizes — ready,v1,nvidia — by BoyuanFeng (合并于: 2026-03-03 06:00 (UTC+8))
- #35741 [Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker — bug,ready,qwen — by yeqcharlotte (合并于: 2026-03-03 05:11 (UTC+8))
- #35788 [BUG] Fix rlhf_async example — bug,documentation,ready — by hao-aaron (合并于: 2026-03-03 04:36 (UTC+8))
- #34672 [ci] Add Ray compatibility check informational CI job — ready,ci/build — by jeffreywang-anyscale (合并于: 2026-03-03 04:06 (UTC+8))
- #31057 [KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops — ready,kv-connector,nvidia — by yashwantbezawada (合并于: 2026-03-03 04:04 (UTC+8))
- #33736 [Spec Decode] Add hidden states extraction system — documentation,new-model,speculative-decoding,ready,v1,llama,kv-connector — by fynnsu (合并于: 2026-03-03 03:29 (UTC+8))
- #35709 [torch.compile] Improve cold and warm start compile tests — ready — by zou3519 (合并于: 2026-03-03 03:27 (UTC+8))
- #35587 [BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start — bug,ready — by CSWYF3634076 (合并于: 2026-03-03 02:56 (UTC+8))
- #35307 [CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests — ready,ci/build — by PatrykWo (合并于: 2026-03-03 01:02 (UTC+8))
- #35580 [CI] Defining extended V1 e2e + engine tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-03-02 16:10 (UTC+8))
- #35152 [ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism — rocm,ready — by AndreasKaratzas (合并于: 2026-03-02 15:04 (UTC+8))
- #35723 [Doc] Improve UX of
--enable-log-requests— performance,frontend,ready — by DarkLight1337 (合并于: 2026-03-03 00:24 (UTC+8)) - #35672 [Core] Move test utility to test file — ready,gpt-oss — by wjabbour (合并于: 2026-03-02 23:56 (UTC+8))
- #35518 [CI] Fix mypy for vllm/device allocator — ready,v1 — by hickeyma (合并于: 2026-03-02 23:53 (UTC+8))
- #34169 [CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name — ready,v1,cpu — by charlesashby (合并于: 2026-03-02 17:34 (UTC+8))
- #34627 [Performance] Extract kv update ops from MLA attention backends — ready,v1 — by ElizaWszola (合并于: 2026-03-02 23:43 (UTC+8))
- #34119 [Fix Bug]
num_active_lorasalways equals to zero — bug,ready,v1,gpt-oss — by RunkaiTao (合并于: 2026-03-02 23:17 (UTC+8)) - #35505 [MyPy][BugFix] Check profiler is assigned before calling start() on it — bug,ready,v1 — by hickeyma (合并于: 2026-03-02 21:23 (UTC+8))
- #35681 Fix unresolved-import errors when using Astral’s ty by removing src.root — ready — by tlrmchlsmth (合并于: 2026-03-02 18:26 (UTC+8))
- #35588 [Feat] Supports Anthropic Messages count_tokens API — frontend,ready — by chaunceyjiang (合并于: 2026-03-02 17:48 (UTC+8))
- #35715 [Misc] Cleanup useless
current_platformimport — ready,v1,nvidia — by wangxiyuan (合并于: 2026-03-02 17:36 (UTC+8)) - #35495 [Misc] Bound NIXL upper bound version — ready,ci/build,kv-connector — by NickLucche (合并于: 2026-03-02 16:57 (UTC+8))
- #34750 [Rocm][CI] Fix LM Eval Large Models (H100) test group — rocm,ready,ci/build — by charlifu (合并于: 2026-03-02 15:43 (UTC+8))
- #34448 [Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels — ready,ci/build,nvidia — by EdalatiAli (合并于: 2026-03-02 15:31 (UTC+8))
- #35658 [ROCm] add amd-quark package in requirements for rocm to use quantized models — rocm,ready,ci/build — by hongxiayang (合并于: 2026-03-02 14:02 (UTC+8))
- #35691 [XPU] fix mxfp4 activation type — ready — by jikunshang (合并于: 2026-03-02 11:48 (UTC+8))
关闭但未合并的 PR
- #19028 chore: add an alternative Ubuntu software source to speedup docker image building — documentation,ready,ci/build,stale — by acelyc111 (关闭于: 2026-03-03 10:19 (UTC+8))
- #22523 [Bench] Add Triton NVFP4 GEMM — performance,stale — by phuhung273 (关闭于: 2026-03-03 10:18 (UTC+8))
- #27539 [Core] Prefix cache: frequency- and cost-aware eviction (opt-in) — ci/build,stale,v1 — by Aminsed (关闭于: 2026-03-03 10:17 (UTC+8))
- #33609 [Feature][Speculative Decoding] Consolidate EAGLE input preparation — speculative-decoding,needs-rebase,v1 — by jaewshin (关闭于: 2026-03-03 09:04 (UTC+8))
- #35814 Revert “[Attention] FA4 integration” (#32974) — documentation,ci/build,v1,nvidia — by zhewenl (关闭于: 2026-03-03 08:36 (UTC+8))
- #34023 [Bugfix] Fix RAW hazard and optimize stores in EP Scatter Kernel — bug — by Manikvsin (关闭于: 2026-03-03 08:06 (UTC+8))
- #35812 FusedMoEWithLoRA EP support — 无标签 — by Jackmin801 (关闭于: 2026-03-03 07:14 (UTC+8))
- #32377 [Docker] Remove CUDA compatibility library loading; fixes #32373 — ci/build,nvidia — by wangshangsam (关闭于: 2026-03-03 06:28 (UTC+8))
- #32960 [compile][cuda_graph]Add sym_size handling by folding them to constant — needs-rebase,nvidia — by fxdawnn (关闭于: 2026-03-03 05:39 (UTC+8))
- #31239 [Feature] Add logprobs support for Whisper transcription API — documentation,frontend,needs-rebase,v1 — by TheCodeWrangler (关闭于: 2026-03-03 03:56 (UTC+8))
- #35785 Segmented spans — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector — by almogtavor (关闭于: 2026-03-03 03:00 (UTC+8))
- #35768 [Bug] Fix FA install directory in cmake — bug,ready,ci/build — by yewentao256 (关闭于: 2026-03-03 02:42 (UTC+8))
- #30010 [Frontend] Improves Anthropic API compatibility — frontend,ready — by bbartels (关闭于: 2026-03-03 02:37 (UTC+8))
- #29376 Fix LoRA compatibility with quantized MoE models — 无标签 — by atoniolo76 (关闭于: 2026-03-03 02:07 (UTC+8))
- #35750 Prefix caching updates — documentation,performance,structured-output,needs-rebase,ci/build,v1 — by viktorpusTT (关闭于: 2026-03-03 01:58 (UTC+8))
- #23184 ON HOLD - [Core] Lazy/Delayed CUDA graph — needs-rebase,stale,v1,nvidia — by diegocastanibm (关闭于: 2026-03-03 00:38 (UTC+8))
- #35761 Fix import order violating ruff rules — 无标签 — by pschlan-amd (关闭于: 2026-03-03 00:10 (UTC+8))
- #35731 AITER MLA backend: Fix CPU sync in _build_decode — rocm,v1 — by pschlan-amd (关闭于: 2026-03-03 00:09 (UTC+8))
- #35661 interleave mm strings via request — frontend — by netanel-haber (关闭于: 2026-03-03 00:06 (UTC+8))
- #35341 [Bug] FA2 is not supported for NVIDIA Blackwell architecture — bug,v1,nvidia — by olka (关闭于: 2026-03-02 23:16 (UTC+8))
- #34830 [Feature] Add –disable-uvicorn-metrics-access-log shorthand — frontend — by miloudbelarebia (关闭于: 2026-03-02 23:07 (UTC+8))
- #35559 [MoE Refactor] Turn ChunkingMoERunner into a wrapper so it can be used with any MoERunner subclass. — needs-rebase,nvidia — by bnellnm (关闭于: 2026-03-02 21:31 (UTC+8))
- #34714 [Build] Include OpenTelemetry in release Docker images — ci/build,cpu — by codefromthecrypt (关闭于: 2026-03-02 20:20 (UTC+8))
- #35728 Merge pull request #1 from vllm-project/main — 无标签 — by xueliangyang-oeuler (关闭于: 2026-03-02 18:15 (UTC+8))
- #33514 [Bugfix] Fix gpt-oss chat format mismatch with HuggingFace — bug,frontend,gpt-oss — by thjung123 (关闭于: 2026-03-02 17:36 (UTC+8))
- #27757 [UX] Include NVTX in cuda.txt — needs-rebase,ci/build,stale,nvidia — by jeejeelee (关闭于: 2026-03-02 17:26 (UTC+8))
- #31698 [Kernel] Add triton silu_and_mul in fused_moe — performance — by jeejeelee (关闭于: 2026-03-02 17:12 (UTC+8))
- #35482 Fix AttributeError in RMSNormGated by adding activation attribute and… — needs-rebase,qwen — by xueliangyang-oeuler (关闭于: 2026-03-02 16:59 (UTC+8))
- #35699 [CLOSED]Support multimodal speculative decoding in non-parallel
draft_modelmode — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by EanWang211123 (关闭于: 2026-03-02 14:29 (UTC+8))