vLLM 开发动态报告 - 2026-02-27

时间窗口: 2026-02-27 11:05 (UTC+8) ~ 2026-02-28 11:05 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 21 | 新 PR 81 | 合并 PR 41 | 关闭未合并 PR 21

📊 每日开发状态摘要

在本次观察窗口内，vLLM 社区保持了较高的开发活跃度，新增与关闭的 Issue 和 PR 数量众多。开发焦点主要集中在推理稳定性问题排查（尤其是针对 Qwen 系列模型）、AMD (ROCm) 平台的优化与问题修复，以及新功能与性能优化（如模型运行器 V2、WebSocket 支持、性能剖析工具改进）上。社区在快速迭代新特性的同时，也面临着由硬件多样性（AMD/NVIDIA/ARM）和模型复杂性（MoE、多模态、推测解码）带来的兼容性与稳定性挑战。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，涉及多个关键修复、功能增强和持续集成（CI）改进。

Issue #35569: [ROCm] ROCM_ATTN 后端在 Qwen3-VL-Reranker 上出现系统性数值偏差
- 描述：在 AMD MI355 (gfx950) 平台上，使用 ROCM_ATTN 注意力后端时，Qwen3-VL-Reranker-2B 模型的评分输出存在约 8.5% 的确定性偏差，而其他 ROCm 后端 (ROCM_AITER_FA, TRITON_ATTN, FLEX_ATTENTION) 则表现正常。在 MI325 (gfx942) 上误差较小。
- 分析：这表明 ROCM_ATTN 后端在该特定模型和硬件上存在底层数值精度问题，可能源于内核实现差异。需要 AMD 团队（如 @kenroche, @gshtras）进一步调查。
- 影响：影响 AMD 平台上使用该后端进行重排序任务的准确性。
PR #35560: [WIP][Bugfix][ROCm] 修复 MoE 模型在 TP=1 时的 MXFP4 在线量化
- 描述：修复了三个阻止标准 HuggingFace MoE 模型（如 Qwen3-30B-A3B）在使用 quantization=”mxfp4″ 和 tensor_parallel_size=1 时加载的关键错误。问题涉及索引错误、未知量化方法以及数据类型不匹配。
- 分析：此修复对于提升 AMD 平台对主流 MXFP4 量化模型的支持至关重要。它区分了 GPT-OSS 格式和标准 HF 格式的检查点，并正确处理了在线量化流程。
- 影响：显著改善了 AMD GPU 上运行 MXFP4 量化 MoE 模型的兼容性和用户体验。
PR #35485: [Bugfix][ROCm] 在 RDNA3/RDNA4 (gfx1x) 上禁用完整 CUDA 图捕获
- 描述：为解决 Triton 注意力内核在 RDNA3/RDNA4 架构上进行 HIP 图捕获时崩溃的问题，该 PR 在检测到 gfx11xx/gfx12xx GPU 时，自动将 cudagraph_mode 从 FULL_AND_PIECEWISE 降级为 PIECEWISE。
- 分析：这是一个针对特定 AMD GPU 架构已知稳定性问题的规避措施。评论中 @tjtanaa 指出，仅此更改可能不足以完全解决请求发送后的崩溃问题，需要更多测试。
- 影响：旨在提高在 RDNA3/RDNA4 GPU 上运行 vLLM 的稳定性，但最终效果待验证。
PR #35553 & #35571: [ROCm][CI] 工具使用和视觉评分测试稳定性修复
- 描述：这两个 PR 针对 ROCm CI 测试的间歇性失败，引入了平台特定的覆盖设置：禁用 VLLM_ROCM_USE_SKINNY_GEMM、禁用前缀缓存、设置 --max-num-seqs 1，以消除原子归约和非确定性批处理方差的影响。
- 分析：反映了在 AMD 平台上确保测试结果确定性的持续努力。这些调整有助于隔离内核问题，使测试更可靠。
- 影响：提升 AMD CI 流水线的稳定性和可信度。
其他相关 PR：
- PR #35491 (xuebwang-amd): 支持 AMD Quark 量化的 Qwen3.5 模型（MXFP4, PTPC FP8 等），聚焦模型加载和精度。
- PR #35527: 为 ROCM_ATTN 后端添加对 stablelm 模型头大小 80 的支持，是支持更多模型的前置工作。
- PR #35533: 修复了 AITER RoPE 操作在编译过程中的一个功能性（functionalization）错误，已合并。

💬 高热度讨论分析

Issue #35477: 如何使用 vllm bench serve 测试编码器-仅 BERT 模型的性能（5条评论）
- 核心议题：用户尝试对 BAAI/bge-large-zh-v1.5 编码器模型进行性能测试，但基准测试工具卡住。
- 不同观点：
  - 用户（@duanshengliu）提供了详细的复现步骤，并怀疑是 --served-model-name 参数导致的 bug。
  - 维护者（@noooop）澄清了 bge-reranker-v2-m3 是重排序模型，不支持 /v1/embeddings 端点。
- 最终结论：用户自行发现并解决了问题，方法是在启动服务和运行基准测试时都不使用 --served-model-name 参数。这确实暗示了该参数在处理某些模型时可能存在逻辑缺陷。该 Issue 已关闭。
Issue #35541: vLLM 在低 num_gpu_blocks_override 设置下无限期挂起（关联 PR #35542）
- 核心议题：当 GPU KV 缓存块数量被人为设低，而请求所需块数超过总量时，调度器会陷入无限重试循环，导致进程挂起而非抛出错误。
- 观点与方案：
  - 问题提出者（@kvcache670）：进行了根因分析，指出 V1 调度器在等待队列请求分配失败时，只是跳出循环而未处理该请求，导致其一直阻塞队列头部。
  - 解决方案（同一用户）：在 PR #35542 中提出了修复：在重试后，计算请求所需块数与实际可用块数，如果永远无法满足，则主动中止（abort）该请求，并继续处理后续请求。
- 争议焦点：无显著争议，这是一个明确的缺陷修复。另一个相关 PR #35570 也提出了类似的修复方案。
- 当前状态：Issue 开放，修复 PR #35542 已提交待审。
Issue #35507: 使用 OffloadingConnector 时出现断言错误
- 核心议题：用户在使用 OffloadingConnector 进行 KV 缓存卸载时遇到断言错误 assert len(req.block_hashes) >= num_gpu_blocks。
- 讨论过程：
  - 维护者（@lengrongfu, @haosdent）建议使用最新版本，并指出 PR #27648 可能已修复。
  - 用户（@liunianxuxie）表示受项目限制只能使用 v0.11.0，在按照 PR #27648 修改代码后出现了新的错误。
- 当前状态：问题仍处于开放状态，凸显了在旧版本上使用复杂功能（KV卸载）时，向后移植修复的困难。用户可能需要等待官方版本更新或自行深度调试。

🔥 热门话题与趋势分析

推理稳定性与挂起问题：多个 Issue（#35504, #35502, #35541）报告了在不同配置（TP>1, Qwen 模型，低内存限制）下的推理挂起问题。这表明在高并发、复杂模型和资源受限场景下，调度器和执行引擎的健壮性面临持续考验。
多模态与编码器模型支持：围绕如何正确测试编码器模型（Issue #35477）、修复多模态模型加载（Issue #35522, PR #35535）和处理音频-视频交错逻辑（PR #35487）的讨论活跃，体现了 vLLM 向超越纯文本生成任务扩展的持续努力。
AMD 平台优化与测试：如前一节所述，本周期有大量针对 AMD 平台的修复和 CI 改进工作，表明社区正致力于缩小其在功能、性能和稳定性上与 NVIDIA 平台的差距。
性能优化与新功能：
- 推测解码：涉及 MTP 权重验证（PR #35548）、后缀解码崩溃（Issue #35521）和基准测试工具修复（PR #35471）。
- 前端与 API：WebSocket 支持被加入 Responses API（PR #35492），并计划为 Realtime 端点添加监控指标（PR #35500）。
- 内核与编译：继续推进 torch.compile 相关优化（PR #35472, #35475）和 Model Runner V2 相关工作（PR #35564, #35520, #35120）。
模型支持与加载问题：不断有新的模型被添加（PR #32407），同时也有现有模型在特定配置（如 FP8 量化、TP>1）下的加载和运行问题被报告（Issue #35519, #35496, #35528）。

🛠️ 重点技术变更

PR #35560: [ROCm] 修复 MXFP4 在线量化 for MoE (TP=1)：这是一个对 AMD 平台量化支持至关重要的修复。它解决了从标准 HF 格式加载 MoE 模型并启用 MXFP4 在线量化时的三个连环错误，涉及权重加载器逻辑、量化方法注解和数据类型处理，直接影响 Qwen3、Mixtral 等流行 MoE 模型在 AMD GPU 上的可用性。
PR #35480: 在 do_mamba_copy_block 中使用固定内存进行异步 H2D 传输：此优化针对使用线性注意力（如 Mamba）的模型，修复了由非固定内存传输引起的 aten::to 同步等待问题（可达20ms）。通过使用固定内存，将此类延迟降低至0.13ms，显著提升了包含线性注意力层模型的预处理效率。
PR #34580: 为 Qwen3 VL ViT 注意力启用 FlashInfer cuDNN 后端：此 PR 为 Qwen3 视觉语言模型的视觉编码器引入了 FLASHINFER 注意力后端。通过适配 cuDNN 后端所需的输入格式（如 cu_seqlens），并在必要时进行填充以避免图重编译，在 GB200 上实现了显著的编码器前向传播加速（从 ~7.2ms 降至 ~3.7ms），对提升多模态任务端到端性能有重要意义。
PR #35487: 修复批处理非交错音视频请求的误判：这是一个关键的多模态逻辑修复。原函数通过简单的 token 位置范围重叠来判断音视频是否交错，在批处理多个独立请求时会产生误判，导致后续嵌入层崩溃。修复方案是添加“密度检查”，确保在真正的交错模式下，音视频 token 之间没有间隙。这提升了 Qwen2.5/3-Omni 等模型在批处理场景下的稳定性。

📈 开发活跃度观察

贡献者多样性：除了核心维护团队（如 @haosdent, @ZJY0516, @DarkLight1337）的持续投入外，社区成员（如 @kvcache670, @duanshengliu）积极报告问题并参与解决方案讨论，甚至直接提交修复 PR。
AMD 团队参与：用户名包含 “-amd” 的贡献者（如 xuebwang-amd）活跃地提交代码，专注于 ROCm 平台的功能支持、bug 修复和 CI 稳定性，显示了 AMD 对 vLLM 生态的持续投入。
代码审查与合并效率：在观察窗口内合并了 41 个 PR，表明审查流程运转高效。许多关键修复（如 #35487, #35533, #35480）在当天就被合并，体现了对重大问题和回归的快速响应能力。
问题解决闭环：部分 Issue（如 #35477, #35390）在社区互动下快速定位并解决了问题，或通过相关 PR 得到解决后关闭，展现了良好的社区协作和问题跟踪机制。

💡 值得关注的问题

Qwen 系列模型的推理稳定性：Issue #35504 和 #35502 报告了 Qwen3.5/Coder-Next 模型在特定版本和配置下出现推理挂起和性能下降的问题，可能涉及 GDN 优化、AllReduce 融合等底层更改。这需要核心开发者重点关注，因为 Qwen 是 vLLM 支持的关键模型家族之一。
ARM64 (Grace) + Blackwell (GB10) 的兼容性问题：Issue #35519 报告了在 ARM64 架构的 Grace CPU 搭配 Blackwell GPU 上，NVFP4 量化的 Qwen3.5 模型因内核包含非法指令而崩溃。这揭示了新硬件组合带来的新兼容性挑战，可能需要对量化内核进行特定架构的适配或提供明确的不支持声明。
多节点/分布式场景下的复杂问题：Issue #35496 报告了在 TP=8 的大模型部署中，因 FlashInfer GDN 内核 JIT 编译超时导致 Ray RPC 超时和服务崩溃。这提示在大规模分布式部署中，首次运行的编译时间可能成为服务可用性的瓶颈，需要考虑预编译或更智能的等待/超时策略。
ModelOpt W4A8 MXFP4+FP8 量化格式的支持缺口：Issue #35528 指出，目前没有 NVIDIA 的 PTQ 工具（ModelOpt）能产出可直接使用 vLLM 原生 FlashInfer MXFP4 MoE 内核的检查点，造成了功能断层。这需要 vLLM 开发团队与 NVIDIA ModelOpt 团队协作，定义或支持一种统一的量化检查点格式。

📋 附录：详细数据列表

新增 Issue

#35569 [Bug]: [ROCm] ROCM_ATTN backend shows ~8.5% systematic score deviation on Qwen3-VL-Reranker pooling — bug,rocm — by AndreasKaratzas (创建于: 2026-02-28 10:31 (UTC+8))
#35574 [Bug]: Qwen3.5 Can not close thinking by “enable_thinking”: false — bug — by Charmnut (创建于: 2026-02-28 11:03 (UTC+8))
#35496 [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B — bug — by DBDXSS (创建于: 2026-02-27 16:47 (UTC+8))
#35573 [Doc]: Speculative decoding –speculative-config option lacks clear documentation of accepted keys and values — documentation — by ArpitSuthar (创建于: 2026-02-28 10:56 (UTC+8))
#35507 [Bug]: Assertionerror when using OffloadingConnector: assert len(req.block_hashes) >= num_gpu_blocks — bug — by liunianxuxie (创建于: 2026-02-27 20:20 (UTC+8))
#35572 [Bug]: 在线跑GLM-5，开dp，遇到RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered — bug — by artificialzjy (创建于: 2026-02-28 10:48 (UTC+8))
#35566 CUDA illegal memory access in MoE layer with MiniMax-M2.5 NVFP4 on Blackwell (SM120) — 无标签 — by jhsmith409 (创建于: 2026-02-28 09:34 (UTC+8))
#35547 [Bug]: Long weight loading results in server start failure — bug — by wzhao18 (创建于: 2026-02-28 04:55 (UTC+8))
#35563 [Bug]: Subset of Lora unit tests fail on NVIDIA VLLM Stack — bug — by puririshi98 (创建于: 2026-02-28 08:43 (UTC+8))
#35562 [Bug]: MXFP4A16 compressed-tensors quantization produces degenerate output (PPL 22,953 vs 8.74 BF16) — bug — by zeryx (创建于: 2026-02-28 07:53 (UTC+8))
#35550 [RFC]: Batch-Aware Expert Pruning for MoE Decode (XShare) — 无标签 — by hai-meh-cs (创建于: 2026-02-28 05:37 (UTC+8))
#35544 [CI Failure]: Import error in test_utils.py — ci-failure — by dippi9845 (创建于: 2026-02-28 04:37 (UTC+8))
#35541 [Bug]: vLLM hangs indefinitely with low num_gpu_blocks_override — bug — by kvcache670 (创建于: 2026-02-28 03:45 (UTC+8))
#35528 [Feature]: Support serving ModelOpt W4A8 MXFP4+FP8 checkpoints — feature request — by zeryx (创建于: 2026-02-28 00:47 (UTC+8))
#35504 [Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 and fuse_allreduce_rms=False (H100s on PCIe) — bug — by vitush93 (创建于: 2026-02-27 19:05 (UTC+8))
#35522 [Bug]: Music Flamingo ValueError: Following weights were not initialized from checkpoint: {'audio_tower.pos_emb.freqs'} — bug — by denadai2 (创建于: 2026-02-27 23:35 (UTC+8))
#35502 [Bug]: Server hangs indefinitely during inference with Qwen3.5-27B-FP8 (vLLM nightly) — bug — by gallery2016 (创建于: 2026-02-27 18:34 (UTC+8))
#35521 [Bug]: Suffix decoding crashes with assert total_num_scheduled_tokens > 0 — bug — by rosstang (创建于: 2026-02-27 23:35 (UTC+8))
#35513 [Bug]: OLMoE missing clip_qkv implementation in vLLM — bug — by Qi-Zhan (创建于: 2026-02-27 21:39 (UTC+8))
#35519 [Bug]: Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA illegal instruction during generation) — bug — by EmilHaase (创建于: 2026-02-27 23:11 (UTC+8))
#35516 [RFC]: why block_hash maps not always a single KVCacheBlock. — RFC — by binbinzhm (创建于: 2026-02-27 22:16 (UTC+8))
#35476 [CI] AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — ci-failure — by LucasWilkinson (创建于: 2026-02-27 11:58 (UTC+8))
#35501 [Bug]: fp8_blockscale_gemm JIT compilation fails on vLLM Docker image — missing cublasLt.h, nvrtc.h, and -lnvrtc — bug — by onurguner (创建于: 2026-02-27 18:16 (UTC+8))
#35477 [Usage]: How to test the performance of a encoder-only BertModel using vllm bench serve — usage — by duanshengliu (创建于: 2026-02-27 12:36 (UTC+8))
#35494 [Bug]: sparsemixer uses hard-coded jitter_eps instead of config — bug — by Qi-Zhan (创建于: 2026-02-27 16:00 (UTC+8))
#35473 [Feature]: Request vllm support agent skills — feature request — by tonyaw (创建于: 2026-02-27 11:42 (UTC+8))

已关闭 Issue

#25661 [Bug]: RedhatAI/Qwen3-30B-A3B-FP8-dynamic TP=1 H100 RuntimeError: CUDA driver error: an illegal memory access was encountered — bug,stale — by thameem-abbas (关闭于: 2026-02-28 10:16 (UTC+8))
#26967 [Bug]: Problems with serving GPT-OSS-20B via vLLM on Open WebUI — bug,stale — by MikeNatC (关闭于: 2026-02-28 10:16 (UTC+8))
#27651 [Bug]: Qwen3 VL describes 512x512 images wrong — bug,stale — by andrePankraz (关闭于: 2026-02-28 10:16 (UTC+8))
#27746 [Bug]: strict value in function definitions causes request error when using Mistral tokenizer — bug,stale — by bbrowning (关闭于: 2026-02-28 10:16 (UTC+8))
#27783 [Usage]: Model performance different from api — usage,stale — by fny21 (关闭于: 2026-02-28 10:15 (UTC+8))
#27805 [New Model]: Add support for Nanonets-OCR2-3B — new-model,stale — by zarakokolagar (关闭于: 2026-02-28 10:15 (UTC+8))
#27810 [Bug]: Potential Out-of-Bounds Access in gptq_marlin.cu and marlin_24_cuda_kernel.cu — bug,stale — by molly-ting (关闭于: 2026-02-28 10:15 (UTC+8))
#27823 [Doc]: Multi-node distributed guide issues — documentation,stale — by schung-amd (关闭于: 2026-02-28 10:15 (UTC+8))
#32714 [Bug]: ❗️ Sleep is broken since 0.14.0 ❗️ — bug — by rstanislav (关闭于: 2026-02-28 03:44 (UTC+8))
#35439 [Feature Request]: W4A8 (compressed-tensors) Kernel support for Blackwell SM100+ — bug — by zeryx (关闭于: 2026-02-28 01:44 (UTC+8))
#34700 [CI Failure]: V1 Engine E2E — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-28 00:05 (UTC+8))
#35394 [Bug]: Qwen3-Omni Fails Under Mixed-Modality Requests at Concurrency 5 — bug — by Shirley125 (关闭于: 2026-02-27 22:48 (UTC+8))
#35057 [Bug]: Qwen3.5 scheduler_metadata must have shape (metadata_size) with Decode Context Parallel (DCP) — bug — by ehfd (关闭于: 2026-02-27 22:27 (UTC+8))
#34509 [New Model]: inclusionAI/Ring-2.5-1T — new-model — by youkaichao (关闭于: 2026-02-27 22:10 (UTC+8))
#34893 [Bug]: Qwen3.5-397B-A17B-FP8 fails with TP=4 - fused linear layer sharding incompatibility — 无标签 — by UmutAlihan (关闭于: 2026-02-27 21:19 (UTC+8))
#35476 [CI] AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — ci-failure — by LucasWilkinson (关闭于: 2026-02-27 21:10 (UTC+8))
#35390 [Bug]: Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton — bug — by ehfd (关闭于: 2026-02-27 13:42 (UTC+8))
#35477 [Usage]: How to test the performance of a encoder-only BertModel using vllm bench serve — usage — by duanshengliu (关闭于: 2026-02-27 17:38 (UTC+8))
#35438 [Bug]: Invalid response_format leads in 500 errors — bug — by antonovsergey93 (关闭于: 2026-02-27 16:01 (UTC+8))
#35268 [Bug]: Qwen3-VL-Rerank model’s rerank API does not support query with mixed image and text inputs. — bug — by Yimi81 (关闭于: 2026-02-27 14:27 (UTC+8))
#32895 [Bug]: [ROCm] [MI355X] new 0.14 upstream gptoss hard error TP=1? — bug,rocm — by functionstackx (关闭于: 2026-02-27 11:37 (UTC+8))

新增 PR

#35571 [ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances — rocm — by AndreasKaratzas (创建于: 2026-02-28 10:36 (UTC+8))
#35570 Fix (v1): prevent scheduler hang when request exceeds total KV capacity — v1 — by luca-akka (创建于: 2026-02-28 10:32 (UTC+8))
#35564 [Model Runner V2] Move MM encoder to Model States [3/N] — v1,nvidia — by WoosukKwon (创建于: 2026-02-28 08:46 (UTC+8))
#35568 [Bugfix] Fix Marlin W4A8-FP8 check for SM121+ Blackwell variants — bug — by blake-snc (创建于: 2026-02-28 10:29 (UTC+8))
#35556 Yejin/bench ramp fix v2 — performance,frontend,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-02-28 06:23 (UTC+8))
#35492 feat(responses): add WebSocket mode for Responses API — frontend — by rasonyang (创建于: 2026-02-27 15:42 (UTC+8))
#35561 [Model Runner V2] Use pre-allocated tensors — needs-rebase,v1 — by njhill (创建于: 2026-02-28 07:20 (UTC+8))
#35520 [Model Runner V2] support qwen35 / mamba hybrid model — needs-rebase,v1,qwen,nvidia — by izhuhaoran (创建于: 2026-02-27 23:14 (UTC+8))
#35475 [torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 — documentation,ready,gpt-oss — by zou3519 (创建于: 2026-02-27 11:45 (UTC+8))
#35567 [XPU][CI] add xpu image build job in vllm CI — ci/build — by jikunshang (创建于: 2026-02-28 10:03 (UTC+8))
#35531 [Misc] Clean up ResponsesRequest model validators — frontend,ready — by umut-polat (创建于: 2026-02-28 01:35 (UTC+8))
#35558 [WIP] CI test run with Model Runner V2 — ready-run-all-tests — by njhill (创建于: 2026-02-28 06:50 (UTC+8))
#35565 fix(benchmarks): align ShareGPT token count with legacy script — performance — by luca-akka (创建于: 2026-02-28 09:06 (UTC+8))
#35536 [Model Runner V2][Bugfix] Fix MRV2 LoRA warmup — bug,v1 — by jeejeelee (创建于: 2026-02-28 02:15 (UTC+8))
#35503 [Bugfix] Propagate compilation_time from workers to main process for TP>1 — bug,ready,v1,cpu — by huydhn (创建于: 2026-02-27 18:54 (UTC+8))
#35517 [misc] cleanup one level of error stack when nixl fails to initialize — ready,kv-connector — by youkaichao (创建于: 2026-02-27 22:19 (UTC+8))
#35557 [Bugfix] Fix Anthropic API base64 image handling in Messages endpoint — bug,frontend — by voipmonitor (创建于: 2026-02-28 06:24 (UTC+8))
#35559 [MoE Refactor] Turn ChunkingMoERunner into a wrapper so it can be used with any MoERunner subclass. — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-28 06:53 (UTC+8))
#35560 [WIP][Bugfix][ROCm] Fix MXFP4 online quantization for MoE models at tp=1 — bug,rocm — by SandishKumarHN (创建于: 2026-02-28 06:58 (UTC+8))
#35510 [Bugfix] Move chat completion response_format validation to Pydantic model_validator — bug,frontend,ready — by umut-polat (创建于: 2026-02-27 21:16 (UTC+8))
#35533 [ROCm]: fix aiter rope functionalization — rocm,ready — by Rohan138 (创建于: 2026-02-28 01:49 (UTC+8))
#35543 [CI] Improve (NCCL and torch) symmetric memory allreduce test coverage — ci/build — by jasonlizhengjian (创建于: 2026-02-28 04:21 (UTC+8))
#35548 [MTP] Validate that MTP weights are actually loaded — ready,deepseek — by MatthewBonanni (创建于: 2026-02-28 05:03 (UTC+8))
#35555 Fix bugs in fp8 fused_moe_lora kernels and add unit test for it — 无标签 — by yugong333 (创建于: 2026-02-28 06:16 (UTC+8))
#35554 [Benchmark] Add batch ramp-up detection for accurate decode profiling — 无标签 — by YJYJLee (创建于: 2026-02-28 06:11 (UTC+8))
#35472 [torch.compile] Stop lazily compiling — 无标签 — by zou3519 (创建于: 2026-02-27 11:35 (UTC+8))
#35527 [ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN — documentation,rocm,ready,v1 — by micah-wil (创建于: 2026-02-28 00:39 (UTC+8))
#35553 [ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance — rocm — by AndreasKaratzas (创建于: 2026-02-28 05:58 (UTC+8))
#35552 clean unused cudagraph_batch_sizes — v1,nvidia — by BoyuanFeng (创建于: 2026-02-28 05:46 (UTC+8))
#35551 feat: batch-aware expert pruning (XShare) for MoE models — needs-rebase,v1,gpt-oss — by hai-meh-cs (创建于: 2026-02-28 05:39 (UTC+8))
#35549 [MoE Refactor] Refactor ZeroExpertFusedMoE into new framework — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-28 05:35 (UTC+8))
#35542 [BugFix] Abort waiting requests that can never fit in KV cache block pool — bug,v1 — by kvcache670 (创建于: 2026-02-28 04:07 (UTC+8))
#35546 Fixed import error of a test — 无标签 — by dippi9845 (创建于: 2026-02-28 04:55 (UTC+8))
#35545 Add helion package version 0.3.0 to requirements — ci/build — by dippi9845 (创建于: 2026-02-28 04:42 (UTC+8))
#35540 [Bugfix] Fix empty channel/recipient in harmony for /v1/responses — bug,frontend,gpt-oss — by kg6-sleipnir (创建于: 2026-02-28 03:37 (UTC+8))
#35539 Support Audio Extraction from MP4 Video for Nemotron Nano VL — frontend,ci/build,multi-modality — by askliar (创建于: 2026-02-28 03:15 (UTC+8))
#35538 docs: Add kernel/operator fusions reference page — documentation — by copilot-swe-agent (创建于: 2026-02-28 03:04 (UTC+8))
#35537 [Bugfix] Fixes for SLA finder — bug,documentation,performance — by DarkLight1337 (创建于: 2026-02-28 02:48 (UTC+8))
#35482 Fix AttributeError in RMSNormGated by adding activation attribute and… — needs-rebase,qwen — by xueliangyang-oeuler (创建于: 2026-02-27 14:44 (UTC+8))
#35535 [Bugfix] Fix loading Music Flamingo — bug — by NickCao (创建于: 2026-02-28 02:14 (UTC+8))
#35529 Extract guidellm results — ci/build,gpt-oss — by debroy-rh (创建于: 2026-02-28 01:07 (UTC+8))
#35505 [MyPy][BugFix] Check profiler is assigned before calling start() on it — bug,ready,v1 — by hickeyma (创建于: 2026-02-27 19:23 (UTC+8))
#35534 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hujia177 (创建于: 2026-02-28 02:07 (UTC+8))
#35515 Make DeepSeek V2/V3/R1 RMSNorm consistent with DeepSeek impl — deepseek — by hmellor (创建于: 2026-02-27 22:03 (UTC+8))
#35525 [Doc] Fix link to Llama chat template for usability — documentation,ready,llama — by hickeyma (创建于: 2026-02-28 00:20 (UTC+8))
#35480 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block — ready,v1 — by hl475 (创建于: 2026-02-27 13:16 (UTC+8))
#35532 [Misc] Use VLLMValidationError consistently in ResponsesRequest — frontend — by umut-polat (创建于: 2026-02-28 01:37 (UTC+8))
#35530 [Misc] Fix stale doc URL and docstring module path — 无标签 — by umut-polat (创建于: 2026-02-28 01:33 (UTC+8))
#35524 [Misc] Fill in some v1 CODEOWNERS gaps — ci/build — by njhill (创建于: 2026-02-28 00:07 (UTC+8))
#35526 fix(mig): comprehensive MIG UUID handling for NVML device queries — nvidia — by GaneshSubhashPatil (创建于: 2026-02-28 00:34 (UTC+8))
#35508 [Responses] Normalize OpenResponses input items to match openai-python SDK requirements — frontend,ready — by anencore94 (创建于: 2026-02-27 20:56 (UTC+8))
#35485 [Bugfix][ROCm] Disable full CUDA graph capture on RDNA3/RDNA4 (gfx1x) — bug,rocm,nvidia — by haosdent (创建于: 2026-02-27 15:11 (UTC+8))
#35518 [CI] Fix mypy for vllm/device allocator — v1 — by hickeyma (创建于: 2026-02-27 22:29 (UTC+8))
#35523 [Bugfix] Add missing clip_qkv implementation to OLMoE attention — bug — by haosdent (创建于: 2026-02-27 23:38 (UTC+8))
#35509 [Bugfix] Fix GLM-OCR text_config model_type handling — bug,needs-rebase — by umut-polat (创建于: 2026-02-27 21:15 (UTC+8))
#35487 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,ready,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:18 (UTC+8))
#35493 Grpc renderer — frontend,ci/build — by hyeongyun0916 (创建于: 2026-02-27 15:47 (UTC+8))
#35495 [Misc] Bound NIXL upper bound version — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-02-27 16:26 (UTC+8))
#35514 [Bugfix] Replace assert with ValueError for response_format validatio… — bug,frontend,ready — by antonovsergey93 (创建于: 2026-02-27 21:57 (UTC+8))
#35512 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hmellor (创建于: 2026-02-27 21:36 (UTC+8))
#35511 Fix TypeError in GLM OCR config — 无标签 — by hmellor (创建于: 2026-02-27 21:22 (UTC+8))
#35506 [Core] Proactively free KV cache blocks when aborting finished requests — v1,kv-connector — by jianzs (创建于: 2026-02-27 19:45 (UTC+8))
#35468 [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations — bug,nvidia — by haosdent (创建于: 2026-02-27 11:07 (UTC+8))
#35489 [Bugfix] Fix Fabric/RDMA attribute queries poisoning global error_code in cumem allocator — bug — by haosdent (创建于: 2026-02-27 15:26 (UTC+8))
#35500 [Feature] Add basic metrics for /realtime endpoint — frontend — by pougetat (创建于: 2026-02-27 17:54 (UTC+8))
#35498 Fix indexError in moe wna16 quantization with enable-expert-parallel — 无标签 — by ivyilike (创建于: 2026-02-27 17:35 (UTC+8))
#35499 fix(phimoe): use config router_jitter_noise instead of hardcoded jitter_eps — 无标签 — by stakeswky (创建于: 2026-02-27 17:50 (UTC+8))
#35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (创建于: 2026-02-27 17:25 (UTC+8))
#35484 [WIP][Bugfix] Fix multi-node PP crash with logprobs due to pinned memory serialization — bug,v1 — by haosdent (创建于: 2026-02-27 15:10 (UTC+8))
#35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (创建于: 2026-02-27 15:27 (UTC+8))
#35474 [Perf][DeepSeek-V3] Dynamic MLA/MHA Routing for Sub-1024 Token Prefill (~3x TTFT Speedup) — deepseek — by Joy-In-Code (创建于: 2026-02-27 11:44 (UTC+8))
#35481 [Feature][CI]: compare func & no_func outputs in test_functionalization.py — 无标签 — by 11happy (创建于: 2026-02-27 14:40 (UTC+8))
#35491 [ROCm][Quantization] support amd-quark quantized Qwen3.5 model — rocm,qwen — by xuebwang-amd (创建于: 2026-02-27 15:27 (UTC+8))
#35488 [WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency — bug,v1,nvidia — by haosdent (创建于: 2026-02-27 15:19 (UTC+8))
#35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:14 (UTC+8))
#35483 Add AMD AITER MLA fusion optimization for DeepSeek models — rocm,deepseek — by khairulkabir1661 (创建于: 2026-02-27 14:47 (UTC+8))
#35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (创建于: 2026-02-27 13:07 (UTC+8))
#35478 [BugFix] Fix engine hanging after KV cache initialization failure — bug,v1 — by 842974287 (创建于: 2026-02-27 12:38 (UTC+8))
#35471 fix(benchmarks): correct peak output token throughput calculation for speculative decoding — performance — by hukongyi (创建于: 2026-02-27 11:27 (UTC+8))
#35470 Fix MIG UUID handling in interface.py and cuda.py — nvidia — by GaneshSubhashPatil (创建于: 2026-02-27 11:15 (UTC+8))
#35469 feat: add OpenTelemetry Metrics support via OTLP protocol — v1 — by RichardoMrMu (创建于: 2026-02-27 11:12 (UTC+8))

已合并 PR

#35564 [Model Runner V2] Move MM encoder to Model States [3/N] — v1,nvidia — by WoosukKwon (合并于: 2026-02-28 10:32 (UTC+8))
#35120 [Model Runner V2] Support pooling models — v1,nvidia — by WoosukKwon (合并于: 2026-02-28 10:03 (UTC+8))
#35531 [Misc] Clean up ResponsesRequest model validators — frontend,ready — by umut-polat (合并于: 2026-02-28 09:19 (UTC+8))
#35517 [misc] cleanup one level of error stack when nixl fails to initialize — ready,kv-connector — by youkaichao (合并于: 2026-02-28 08:42 (UTC+8))
#35105 [Refactor][Kernel] Add global helper to deduplicate vectorized memory ops — ready,nvidia — by LopezCastroRoberto (合并于: 2026-02-28 08:28 (UTC+8))
#35533 [ROCm]: fix aiter rope functionalization — rocm,ready — by Rohan138 (合并于: 2026-02-28 06:42 (UTC+8))
#35334 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends — documentation,rocm,ready,v1 — by gshtras (合并于: 2026-02-28 05:32 (UTC+8))
#34171 [Feat][RL][2/2] Native Weight Syncing API: IPC — documentation,ready,ci/build — by hao-aaron (合并于: 2026-02-28 04:45 (UTC+8))
#35404 [Bugfix][Model] Fix gpt-oss batch invariance — bug,ready,gpt-oss — by jzakrzew (合并于: 2026-02-28 04:41 (UTC+8))
#34102 [DP] Only use DP padding when cudagraphs are actually used — speculative-decoding,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-02-28 04:14 (UTC+8))
#35420 [Bugfix] Add monkeypatch to prevent race condition from writing — bug,ready — by Lucaskabela (合并于: 2026-02-28 03:51 (UTC+8))
#34888 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 — ready — by SteadfastAsArt (合并于: 2026-02-28 03:39 (UTC+8))
#35308 [compile] Fix caching error over pytree slice node. — ready — by zhxchen17 (合并于: 2026-02-28 03:34 (UTC+8))
#35172 [Model Runner V2] Warmup kernels — ready,v1 — by njhill (合并于: 2026-02-28 02:43 (UTC+8))
#35097 [BugFix] Fix 3D rope in transformers backend — bug,ready — by zucchini-nlp (合并于: 2026-02-28 02:34 (UTC+8))
#35100 Support parakeet as audio encoder for nemotron-nano-vl — new-model,ready — by netanel-haber (合并于: 2026-02-28 02:07 (UTC+8))
#35525 [Doc] Fix link to Llama chat template for usability — documentation,ready,llama — by hickeyma (合并于: 2026-02-28 01:51 (UTC+8))
#35480 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block — ready,v1 — by hl475 (合并于: 2026-02-28 01:50 (UTC+8))
#35524 [Misc] Fill in some v1 CODEOWNERS gaps — ci/build — by njhill (合并于: 2026-02-28 01:34 (UTC+8))
#32407 [Model] Add huggingface skt/A.X-K1 model — documentation,new-model,ready — by fort726 (合并于: 2026-02-28 01:26 (UTC+8))
#34390 [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching — rocm,ready — by gmagogsfm (合并于: 2026-02-28 01:21 (UTC+8))
#35312 [Core] Fix gpu_worker.py pre-commit errors — ready,v1 — by njhill (合并于: 2026-02-27 23:54 (UTC+8))
#35317 Add @BoyuanFeng to CODEOWNERS — documentation,ready,ci/build — by BoyuanFeng (合并于: 2026-02-27 23:53 (UTC+8))
#33646 [Bugfix] Handle case when kimi ends reasoning with a tool call — bug,ready — by koush (合并于: 2026-02-27 22:58 (UTC+8))
#35487 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,ready,multi-modality,qwen — by linyueqian (合并于: 2026-02-27 22:48 (UTC+8))
#35410 [compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism — ready — by jasonlizhengjian (合并于: 2026-02-27 21:36 (UTC+8))
#35082 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp — bug,ready,v1 — by haosdent (合并于: 2026-02-27 22:27 (UTC+8))
#35512 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hmellor (合并于: 2026-02-27 22:13 (UTC+8))
#35423 [Bugfix] Add missing activation attr to RMSNormGated — bug,ready — by Tib-Gridello (合并于: 2026-02-27 20:53 (UTC+8))
#34580 Flashinfer cuDNN backend for Qwen3 VL ViT attention — ready,ci/build,v1,qwen,nvidia — by maxyanghu (合并于: 2026-02-27 20:20 (UTC+8))
#35413 [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping — bug,ready,qwen — by jeejeelee (合并于: 2026-02-27 11:46 (UTC+8))
#35424 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 — bug,ready — by ZJY0516 (合并于: 2026-02-27 12:18 (UTC+8))
#35456 [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint — bug,frontend,ready — by umut-polat (合并于: 2026-02-27 16:01 (UTC+8))
#33088 [Bugfix] Use ‘sum’ reduction instead of ‘avg’ in Async TP reduce-scatter — bug,ready — by wangxingran222 (合并于: 2026-02-27 15:06 (UTC+8))
#35457 [Model Performance] Add Qwen3MoE tuned MoE configs for H200 — ready,qwen — by chengyinie (合并于: 2026-02-27 13:51 (UTC+8))
#35369 [Bug] correct out dtype of rms_norm_gated native path — bug,ready — by zufangzhu (合并于: 2026-02-27 13:19 (UTC+8))
#35434 [BugFix] Repo utils debug print patch — bug,ready — by pi314ever (合并于: 2026-02-27 11:50 (UTC+8))
#35314 [Bug] Fix outdated links in source code — bug,documentation,ready,ci/build — by yewentao256 (合并于: 2026-02-27 11:50 (UTC+8))
#33197 use ‘max_active_experts’ for moe lora input size — ready — by gnovack (合并于: 2026-02-27 11:50 (UTC+8))
#35400 [Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils — ready,v1 — by NickLucche (合并于: 2026-02-27 11:42 (UTC+8))
#33012 [Core]Extract is_last_rank in Ray for tpu to override — ready,v1 — by Chenyaaang (合并于: 2026-02-27 11:18 (UTC+8))

关闭但未合并的 PR

#25932 [Misc] Add benchmark for sampler. — performance,stale — by FangJiangyi (关闭于: 2026-02-28 10:16 (UTC+8))
#35554 [Benchmark] Add batch ramp-up detection for accurate decode profiling — 无标签 — by YJYJLee (关闭于: 2026-02-28 06:22 (UTC+8))
#35545 Add helion package version 0.3.0 to requirements — ci/build — by dippi9845 (关闭于: 2026-02-28 04:50 (UTC+8))
#35534 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hujia177 (关闭于: 2026-02-28 02:08 (UTC+8))
#35532 [Misc] Use VLLMValidationError consistently in ResponsesRequest — frontend — by umut-polat (关闭于: 2026-02-28 01:47 (UTC+8))
#22282 Add option to disable weakref conversion for last piecewise cudagraph in a module — torch.compile,needs-rebase,unstale,v1,nvidia — by sarckk (关闭于: 2026-02-28 01:42 (UTC+8))
#32135 [Bugfix] Don’t Index MM Placeholders Out When LoRA is Enabled — bug,v1,qwen — by alex-jw-brooks (关闭于: 2026-02-27 23:19 (UTC+8))
#35509 [Bugfix] Fix GLM-OCR text_config model_type handling — bug,needs-rebase — by umut-polat (关闭于: 2026-02-27 23:09 (UTC+8))
#35514 [Bugfix] Replace assert with ValueError for response_format validatio… — bug,frontend,ready — by antonovsergey93 (关闭于: 2026-02-27 22:14 (UTC+8))
#35443 [BUGFIX] Replace assert with ValueError for response_format validation in chat completions endpoint — bug,documentation,frontend,ci/build,v1,qwen,deepseek,nvidia — by antonovsergey93 (关闭于: 2026-02-27 21:59 (UTC+8))
#35511 Fix TypeError in GLM OCR config — 无标签 — by hmellor (关闭于: 2026-02-27 21:35 (UTC+8))
#35351 [XPU] special handle for pooler models w8a16 gemm — 无标签 — by yma11 (关闭于: 2026-02-27 20:03 (UTC+8))
#34881 [Bugfix] limit cudagraph capture sizes by num_blocks for GDN models — bug,v1,qwen,nvidia — by ZJY0516 (关闭于: 2026-02-27 18:40 (UTC+8))
#35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (关闭于: 2026-02-27 17:48 (UTC+8))
#35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (关闭于: 2026-02-27 17:14 (UTC+8))
#35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (关闭于: 2026-02-27 15:17 (UTC+8))
#35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (关闭于: 2026-02-27 13:11 (UTC+8))
#32157 feat: add requires_token_ids interface for sampling params — v1 — by llsj14 (关闭于: 2026-02-27 12:23 (UTC+8))
#34835 [Core] Extract is_last_rank in ray for tpu to override — v1 — by pv97 (关闭于: 2026-02-27 11:47 (UTC+8))
#29191 [Models] Lfm2-VL Architecture — documentation,new-model,stale — by paulpak58 (关闭于: 2026-02-27 11:38 (UTC+8))
#34511 [Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations — bug,needs-rebase — by haosdent (关闭于: 2026-02-27 11:08 (UTC+8))