vLLM 开发动态报告 - 2026-02-27
时间窗口: 2026-02-27 11:05 (UTC+8) ~ 2026-02-28 11:05 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 21 | 新 PR 81 | 合并 PR 41 | 关闭未合并 PR 21
📊 每日开发状态摘要
在本次观察窗口内,vLLM 社区保持了较高的开发活跃度,新增与关闭的 Issue 和 PR 数量众多。开发焦点主要集中在推理稳定性问题排查(尤其是针对 Qwen 系列模型)、AMD (ROCm) 平台的优化与问题修复,以及新功能与性能优化(如模型运行器 V2、WebSocket 支持、性能剖析工具改进)上。社区在快速迭代新特性的同时,也面临着由硬件多样性(AMD/NVIDIA/ARM)和模型复杂性(MoE、多模态、推测解码)带来的兼容性与稳定性挑战。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,涉及多个关键修复、功能增强和持续集成(CI)改进。
- Issue #35569: [ROCm] ROCM_ATTN 后端在 Qwen3-VL-Reranker 上出现系统性数值偏差
- 描述:在 AMD MI355 (gfx950) 平台上,使用
ROCM_ATTN注意力后端时,Qwen3-VL-Reranker-2B模型的评分输出存在约 8.5% 的确定性偏差,而其他 ROCm 后端 (ROCM_AITER_FA,TRITON_ATTN,FLEX_ATTENTION) 则表现正常。在 MI325 (gfx942) 上误差较小。 - 分析:这表明
ROCM_ATTN后端在该特定模型和硬件上存在底层数值精度问题,可能源于内核实现差异。需要 AMD 团队(如 @kenroche, @gshtras)进一步调查。 - 影响:影响 AMD 平台上使用该后端进行重排序任务的准确性。
- 描述:在 AMD MI355 (gfx950) 平台上,使用
- PR #35560: [WIP][Bugfix][ROCm] 修复 MoE 模型在 TP=1 时的 MXFP4 在线量化
- 描述:修复了三个阻止标准 HuggingFace MoE 模型(如 Qwen3-30B-A3B)在使用
quantization=”mxfp4″和tensor_parallel_size=1时加载的关键错误。问题涉及索引错误、未知量化方法以及数据类型不匹配。 - 分析:此修复对于提升 AMD 平台对主流 MXFP4 量化模型的支持至关重要。它区分了 GPT-OSS 格式和标准 HF 格式的检查点,并正确处理了在线量化流程。
- 影响:显著改善了 AMD GPU 上运行 MXFP4 量化 MoE 模型的兼容性和用户体验。
- 描述:修复了三个阻止标准 HuggingFace MoE 模型(如 Qwen3-30B-A3B)在使用
- PR #35485: [Bugfix][ROCm] 在 RDNA3/RDNA4 (gfx1x) 上禁用完整 CUDA 图捕获
- 描述:为解决 Triton 注意力内核在 RDNA3/RDNA4 架构上进行 HIP 图捕获时崩溃的问题,该 PR 在检测到 gfx11xx/gfx12xx GPU 时,自动将
cudagraph_mode从FULL_AND_PIECEWISE降级为PIECEWISE。 - 分析:这是一个针对特定 AMD GPU 架构已知稳定性问题的规避措施。评论中 @tjtanaa 指出,仅此更改可能不足以完全解决请求发送后的崩溃问题,需要更多测试。
- 影响:旨在提高在 RDNA3/RDNA4 GPU 上运行 vLLM 的稳定性,但最终效果待验证。
- 描述:为解决 Triton 注意力内核在 RDNA3/RDNA4 架构上进行 HIP 图捕获时崩溃的问题,该 PR 在检测到 gfx11xx/gfx12xx GPU 时,自动将
- PR #35553 & #35571: [ROCm][CI] 工具使用和视觉评分测试稳定性修复
- 描述:这两个 PR 针对 ROCm CI 测试的间歇性失败,引入了平台特定的覆盖设置:禁用
VLLM_ROCM_USE_SKINNY_GEMM、禁用前缀缓存、设置--max-num-seqs 1,以消除原子归约和非确定性批处理方差的影响。 - 分析:反映了在 AMD 平台上确保测试结果确定性的持续努力。这些调整有助于隔离内核问题,使测试更可靠。
- 影响:提升 AMD CI 流水线的稳定性和可信度。
- 描述:这两个 PR 针对 ROCm CI 测试的间歇性失败,引入了平台特定的覆盖设置:禁用
- 其他相关 PR:
- PR #35491 (
xuebwang-amd): 支持 AMD Quark 量化的 Qwen3.5 模型(MXFP4, PTPC FP8 等),聚焦模型加载和精度。 - PR #35527: 为
ROCM_ATTN后端添加对stablelm模型头大小 80 的支持,是支持更多模型的前置工作。 - PR #35533: 修复了 AITER RoPE 操作在编译过程中的一个功能性(functionalization)错误,已合并。
- PR #35491 (
💬 高热度讨论分析
- Issue #35477: 如何使用
vllm bench serve测试编码器-仅 BERT 模型的性能(5条评论)- 核心议题:用户尝试对
BAAI/bge-large-zh-v1.5编码器模型进行性能测试,但基准测试工具卡住。 - 不同观点:
- 用户(@duanshengliu)提供了详细的复现步骤,并怀疑是
--served-model-name参数导致的 bug。 - 维护者(@noooop)澄清了
bge-reranker-v2-m3是重排序模型,不支持/v1/embeddings端点。
- 用户(@duanshengliu)提供了详细的复现步骤,并怀疑是
- 最终结论:用户自行发现并解决了问题,方法是在启动服务和运行基准测试时都不使用
--served-model-name参数。这确实暗示了该参数在处理某些模型时可能存在逻辑缺陷。该 Issue 已关闭。
- 核心议题:用户尝试对
- Issue #35541: vLLM 在低
num_gpu_blocks_override设置下无限期挂起(关联 PR #35542)- 核心议题:当 GPU KV 缓存块数量被人为设低,而请求所需块数超过总量时,调度器会陷入无限重试循环,导致进程挂起而非抛出错误。
- 观点与方案:
- 问题提出者(@kvcache670):进行了根因分析,指出 V1 调度器在等待队列请求分配失败时,只是跳出循环而未处理该请求,导致其一直阻塞队列头部。
- 解决方案(同一用户):在 PR #35542 中提出了修复:在重试后,计算请求所需块数与实际可用块数,如果永远无法满足,则主动中止(abort)该请求,并继续处理后续请求。
- 争议焦点:无显著争议,这是一个明确的缺陷修复。另一个相关 PR #35570 也提出了类似的修复方案。
- 当前状态:Issue 开放,修复 PR #35542 已提交待审。
- Issue #35507: 使用 OffloadingConnector 时出现断言错误
- 核心议题:用户在使用
OffloadingConnector进行 KV 缓存卸载时遇到断言错误assert len(req.block_hashes) >= num_gpu_blocks。 - 讨论过程:
- 维护者(@lengrongfu, @haosdent)建议使用最新版本,并指出 PR #27648 可能已修复。
- 用户(@liunianxuxie)表示受项目限制只能使用 v0.11.0,在按照 PR #27648 修改代码后出现了新的错误。
- 当前状态:问题仍处于开放状态,凸显了在旧版本上使用复杂功能(KV卸载)时,向后移植修复的困难。用户可能需要等待官方版本更新或自行深度调试。
- 核心议题:用户在使用
🔥 热门话题与趋势分析
- 推理稳定性与挂起问题:多个 Issue(#35504, #35502, #35541)报告了在不同配置(TP>1, Qwen 模型,低内存限制)下的推理挂起问题。这表明在高并发、复杂模型和资源受限场景下,调度器和执行引擎的健壮性面临持续考验。
- 多模态与编码器模型支持:围绕如何正确测试编码器模型(Issue #35477)、修复多模态模型加载(Issue #35522, PR #35535)和处理音频-视频交错逻辑(PR #35487)的讨论活跃,体现了 vLLM 向超越纯文本生成任务扩展的持续努力。
- AMD 平台优化与测试:如前一节所述,本周期有大量针对 AMD 平台的修复和 CI 改进工作,表明社区正致力于缩小其在功能、性能和稳定性上与 NVIDIA 平台的差距。
- 性能优化与新功能:
- 推测解码:涉及 MTP 权重验证(PR #35548)、后缀解码崩溃(Issue #35521)和基准测试工具修复(PR #35471)。
- 前端与 API:WebSocket 支持被加入 Responses API(PR #35492),并计划为 Realtime 端点添加监控指标(PR #35500)。
- 内核与编译:继续推进 torch.compile 相关优化(PR #35472, #35475)和 Model Runner V2 相关工作(PR #35564, #35520, #35120)。
- 模型支持与加载问题:不断有新的模型被添加(PR #32407),同时也有现有模型在特定配置(如 FP8 量化、TP>1)下的加载和运行问题被报告(Issue #35519, #35496, #35528)。
🛠️ 重点技术变更
- PR #35560: [ROCm] 修复 MXFP4 在线量化 for MoE (TP=1):这是一个对 AMD 平台量化支持至关重要的修复。它解决了从标准 HF 格式加载 MoE 模型并启用 MXFP4 在线量化时的三个连环错误,涉及权重加载器逻辑、量化方法注解和数据类型处理,直接影响 Qwen3、Mixtral 等流行 MoE 模型在 AMD GPU 上的可用性。
- PR #35480: 在
do_mamba_copy_block中使用固定内存进行异步 H2D 传输:此优化针对使用线性注意力(如 Mamba)的模型,修复了由非固定内存传输引起的aten::to同步等待问题(可达20ms)。通过使用固定内存,将此类延迟降低至0.13ms,显著提升了包含线性注意力层模型的预处理效率。 - PR #34580: 为 Qwen3 VL ViT 注意力启用 FlashInfer cuDNN 后端:此 PR 为 Qwen3 视觉语言模型的视觉编码器引入了
FLASHINFER注意力后端。通过适配 cuDNN 后端所需的输入格式(如cu_seqlens),并在必要时进行填充以避免图重编译,在 GB200 上实现了显著的编码器前向传播加速(从 ~7.2ms 降至 ~3.7ms),对提升多模态任务端到端性能有重要意义。 - PR #35487: 修复批处理非交错音视频请求的误判:这是一个关键的多模态逻辑修复。原函数通过简单的 token 位置范围重叠来判断音视频是否交错,在批处理多个独立请求时会产生误判,导致后续嵌入层崩溃。修复方案是添加“密度检查”,确保在真正的交错模式下,音视频 token 之间没有间隙。这提升了 Qwen2.5/3-Omni 等模型在批处理场景下的稳定性。
📈 开发活跃度观察
- 贡献者多样性:除了核心维护团队(如 @haosdent, @ZJY0516, @DarkLight1337)的持续投入外,社区成员(如 @kvcache670, @duanshengliu)积极报告问题并参与解决方案讨论,甚至直接提交修复 PR。
- AMD 团队参与:用户名包含 “-amd” 的贡献者(如
xuebwang-amd)活跃地提交代码,专注于 ROCm 平台的功能支持、bug 修复和 CI 稳定性,显示了 AMD 对 vLLM 生态的持续投入。 - 代码审查与合并效率:在观察窗口内合并了 41 个 PR,表明审查流程运转高效。许多关键修复(如 #35487, #35533, #35480)在当天就被合并,体现了对重大问题和回归的快速响应能力。
- 问题解决闭环:部分 Issue(如 #35477, #35390)在社区互动下快速定位并解决了问题,或通过相关 PR 得到解决后关闭,展现了良好的社区协作和问题跟踪机制。
💡 值得关注的问题
- Qwen 系列模型的推理稳定性:Issue #35504 和 #35502 报告了 Qwen3.5/Coder-Next 模型在特定版本和配置下出现推理挂起和性能下降的问题,可能涉及 GDN 优化、AllReduce 融合等底层更改。这需要核心开发者重点关注,因为 Qwen 是 vLLM 支持的关键模型家族之一。
- ARM64 (Grace) + Blackwell (GB10) 的兼容性问题:Issue #35519 报告了在 ARM64 架构的 Grace CPU 搭配 Blackwell GPU 上,NVFP4 量化的 Qwen3.5 模型因内核包含非法指令而崩溃。这揭示了新硬件组合带来的新兼容性挑战,可能需要对量化内核进行特定架构的适配或提供明确的不支持声明。
- 多节点/分布式场景下的复杂问题:Issue #35496 报告了在 TP=8 的大模型部署中,因 FlashInfer GDN 内核 JIT 编译超时导致 Ray RPC 超时和服务崩溃。这提示在大规模分布式部署中,首次运行的编译时间可能成为服务可用性的瓶颈,需要考虑预编译或更智能的等待/超时策略。
- ModelOpt W4A8 MXFP4+FP8 量化格式的支持缺口:Issue #35528 指出,目前没有 NVIDIA 的 PTQ 工具(ModelOpt)能产出可直接使用 vLLM 原生 FlashInfer MXFP4 MoE 内核的检查点,造成了功能断层。这需要 vLLM 开发团队与 NVIDIA ModelOpt 团队协作,定义或支持一种统一的量化检查点格式。
📋 附录:详细数据列表
新增 Issue
- #35569 [Bug]: [ROCm] ROCM_ATTN backend shows ~8.5% systematic score deviation on Qwen3-VL-Reranker pooling — bug,rocm — by AndreasKaratzas (创建于: 2026-02-28 10:31 (UTC+8))
- #35574 [Bug]: Qwen3.5 Can not close thinking by “enable_thinking”: false — bug — by Charmnut (创建于: 2026-02-28 11:03 (UTC+8))
- #35496 [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B — bug — by DBDXSS (创建于: 2026-02-27 16:47 (UTC+8))
- #35573 [Doc]: Speculative decoding –speculative-config option lacks clear documentation of accepted keys and values — documentation — by ArpitSuthar (创建于: 2026-02-28 10:56 (UTC+8))
- #35507 [Bug]: Assertionerror when using OffloadingConnector: assert len(req.block_hashes) >= num_gpu_blocks — bug — by liunianxuxie (创建于: 2026-02-27 20:20 (UTC+8))
- #35572 [Bug]: 在线跑GLM-5,开dp,遇到RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered — bug — by artificialzjy (创建于: 2026-02-28 10:48 (UTC+8))
- #35566 CUDA illegal memory access in MoE layer with MiniMax-M2.5 NVFP4 on Blackwell (SM120) — 无标签 — by jhsmith409 (创建于: 2026-02-28 09:34 (UTC+8))
- #35547 [Bug]: Long weight loading results in server start failure — bug — by wzhao18 (创建于: 2026-02-28 04:55 (UTC+8))
- #35563 [Bug]: Subset of Lora unit tests fail on NVIDIA VLLM Stack — bug — by puririshi98 (创建于: 2026-02-28 08:43 (UTC+8))
- #35562 [Bug]: MXFP4A16 compressed-tensors quantization produces degenerate output (PPL 22,953 vs 8.74 BF16) — bug — by zeryx (创建于: 2026-02-28 07:53 (UTC+8))
- #35550 [RFC]: Batch-Aware Expert Pruning for MoE Decode (XShare) — 无标签 — by hai-meh-cs (创建于: 2026-02-28 05:37 (UTC+8))
- #35544 [CI Failure]: Import error in test_utils.py — ci-failure — by dippi9845 (创建于: 2026-02-28 04:37 (UTC+8))
- #35541 [Bug]: vLLM hangs indefinitely with low
num_gpu_blocks_override— bug — by kvcache670 (创建于: 2026-02-28 03:45 (UTC+8)) - #35528 [Feature]: Support serving ModelOpt W4A8 MXFP4+FP8 checkpoints — feature request — by zeryx (创建于: 2026-02-28 00:47 (UTC+8))
- #35504 [Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 and fuse_allreduce_rms=False (H100s on PCIe) — bug — by vitush93 (创建于: 2026-02-27 19:05 (UTC+8))
- #35522 [Bug]: Music Flamingo
ValueError: Following weights were not initialized from checkpoint: {'audio_tower.pos_emb.freqs'}— bug — by denadai2 (创建于: 2026-02-27 23:35 (UTC+8)) - #35502 [Bug]: Server hangs indefinitely during inference with Qwen3.5-27B-FP8 (vLLM nightly) — bug — by gallery2016 (创建于: 2026-02-27 18:34 (UTC+8))
- #35521 [Bug]: Suffix decoding crashes with assert total_num_scheduled_tokens > 0 — bug — by rosstang (创建于: 2026-02-27 23:35 (UTC+8))
- #35513 [Bug]: OLMoE missing clip_qkv implementation in vLLM — bug — by Qi-Zhan (创建于: 2026-02-27 21:39 (UTC+8))
- #35519 [Bug]: Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA illegal instruction during generation) — bug — by EmilHaase (创建于: 2026-02-27 23:11 (UTC+8))
- #35516 [RFC]: why block_hash maps not always a single KVCacheBlock. — RFC — by binbinzhm (创建于: 2026-02-27 22:16 (UTC+8))
- #35476 [CI] AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — ci-failure — by LucasWilkinson (创建于: 2026-02-27 11:58 (UTC+8))
- #35501 [Bug]: fp8_blockscale_gemm JIT compilation fails on vLLM Docker image — missing cublasLt.h, nvrtc.h, and -lnvrtc — bug — by onurguner (创建于: 2026-02-27 18:16 (UTC+8))
- #35477 [Usage]: How to test the performance of a encoder-only BertModel using
vllm bench serve— usage — by duanshengliu (创建于: 2026-02-27 12:36 (UTC+8)) - #35494 [Bug]: sparsemixer uses hard-coded jitter_eps instead of config — bug — by Qi-Zhan (创建于: 2026-02-27 16:00 (UTC+8))
- #35473 [Feature]: Request vllm support agent skills — feature request — by tonyaw (创建于: 2026-02-27 11:42 (UTC+8))
已关闭 Issue
- #25661 [Bug]: RedhatAI/Qwen3-30B-A3B-FP8-dynamic TP=1 H100 RuntimeError: CUDA driver error: an illegal memory access was encountered — bug,stale — by thameem-abbas (关闭于: 2026-02-28 10:16 (UTC+8))
- #26967 [Bug]: Problems with serving GPT-OSS-20B via vLLM on Open WebUI — bug,stale — by MikeNatC (关闭于: 2026-02-28 10:16 (UTC+8))
- #27651 [Bug]: Qwen3 VL describes 512x512 images wrong — bug,stale — by andrePankraz (关闭于: 2026-02-28 10:16 (UTC+8))
- #27746 [Bug]:
strictvalue in function definitions causes request error when using Mistral tokenizer — bug,stale — by bbrowning (关闭于: 2026-02-28 10:16 (UTC+8)) - #27783 [Usage]: Model performance different from api — usage,stale — by fny21 (关闭于: 2026-02-28 10:15 (UTC+8))
- #27805 [New Model]: Add support for Nanonets-OCR2-3B — new-model,stale — by zarakokolagar (关闭于: 2026-02-28 10:15 (UTC+8))
- #27810 [Bug]: Potential Out-of-Bounds Access in gptq_marlin.cu and marlin_24_cuda_kernel.cu — bug,stale — by molly-ting (关闭于: 2026-02-28 10:15 (UTC+8))
- #27823 [Doc]: Multi-node distributed guide issues — documentation,stale — by schung-amd (关闭于: 2026-02-28 10:15 (UTC+8))
- #32714 [Bug]: ❗️ Sleep is broken since 0.14.0 ❗️ — bug — by rstanislav (关闭于: 2026-02-28 03:44 (UTC+8))
- #35439 [Feature Request]: W4A8 (compressed-tensors) Kernel support for Blackwell SM100+ — bug — by zeryx (关闭于: 2026-02-28 01:44 (UTC+8))
- #34700 [CI Failure]: V1 Engine E2E — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-28 00:05 (UTC+8))
- #35394 [Bug]: Qwen3-Omni Fails Under Mixed-Modality Requests at Concurrency 5 — bug — by Shirley125 (关闭于: 2026-02-27 22:48 (UTC+8))
- #35057 [Bug]: Qwen3.5
scheduler_metadata must have shape (metadata_size)with Decode Context Parallel (DCP) — bug — by ehfd (关闭于: 2026-02-27 22:27 (UTC+8)) - #34509 [New Model]: inclusionAI/Ring-2.5-1T — new-model — by youkaichao (关闭于: 2026-02-27 22:10 (UTC+8))
- #34893 [Bug]: Qwen3.5-397B-A17B-FP8 fails with TP=4 - fused linear layer sharding incompatibility — 无标签 — by UmutAlihan (关闭于: 2026-02-27 21:19 (UTC+8))
- #35476 [CI] AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — ci-failure — by LucasWilkinson (关闭于: 2026-02-27 21:10 (UTC+8))
- #35390 [Bug]: Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton — bug — by ehfd (关闭于: 2026-02-27 13:42 (UTC+8))
- #35477 [Usage]: How to test the performance of a encoder-only BertModel using
vllm bench serve— usage — by duanshengliu (关闭于: 2026-02-27 17:38 (UTC+8)) - #35438 [Bug]: Invalid response_format leads in 500 errors — bug — by antonovsergey93 (关闭于: 2026-02-27 16:01 (UTC+8))
- #35268 [Bug]: Qwen3-VL-Rerank model’s rerank API does not support query with mixed image and text inputs. — bug — by Yimi81 (关闭于: 2026-02-27 14:27 (UTC+8))
- #32895 [Bug]: [ROCm] [MI355X] new 0.14 upstream gptoss hard error TP=1? — bug,rocm — by functionstackx (关闭于: 2026-02-27 11:37 (UTC+8))
新增 PR
- #35571 [ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances — rocm — by AndreasKaratzas (创建于: 2026-02-28 10:36 (UTC+8))
- #35570 Fix (v1): prevent scheduler hang when request exceeds total KV capacity — v1 — by luca-akka (创建于: 2026-02-28 10:32 (UTC+8))
- #35564 [Model Runner V2] Move MM encoder to Model States [3/N] — v1,nvidia — by WoosukKwon (创建于: 2026-02-28 08:46 (UTC+8))
- #35568 [Bugfix] Fix Marlin W4A8-FP8 check for SM121+ Blackwell variants — bug — by blake-snc (创建于: 2026-02-28 10:29 (UTC+8))
- #35556 Yejin/bench ramp fix v2 — performance,frontend,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-02-28 06:23 (UTC+8))
- #35492 feat(responses): add WebSocket mode for Responses API — frontend — by rasonyang (创建于: 2026-02-27 15:42 (UTC+8))
- #35561 [Model Runner V2] Use pre-allocated tensors — needs-rebase,v1 — by njhill (创建于: 2026-02-28 07:20 (UTC+8))
- #35520 [Model Runner V2] support qwen35 / mamba hybrid model — needs-rebase,v1,qwen,nvidia — by izhuhaoran (创建于: 2026-02-27 23:14 (UTC+8))
- #35475 [torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 — documentation,ready,gpt-oss — by zou3519 (创建于: 2026-02-27 11:45 (UTC+8))
- #35567 [XPU][CI] add xpu image build job in vllm CI — ci/build — by jikunshang (创建于: 2026-02-28 10:03 (UTC+8))
- #35531 [Misc] Clean up ResponsesRequest model validators — frontend,ready — by umut-polat (创建于: 2026-02-28 01:35 (UTC+8))
- #35558 [WIP] CI test run with Model Runner V2 — ready-run-all-tests — by njhill (创建于: 2026-02-28 06:50 (UTC+8))
- #35565 fix(benchmarks): align ShareGPT token count with legacy script — performance — by luca-akka (创建于: 2026-02-28 09:06 (UTC+8))
- #35536 [Model Runner V2][Bugfix] Fix MRV2 LoRA warmup — bug,v1 — by jeejeelee (创建于: 2026-02-28 02:15 (UTC+8))
- #35503 [Bugfix] Propagate compilation_time from workers to main process for TP>1 — bug,ready,v1,cpu — by huydhn (创建于: 2026-02-27 18:54 (UTC+8))
- #35517 [misc] cleanup one level of error stack when nixl fails to initialize — ready,kv-connector — by youkaichao (创建于: 2026-02-27 22:19 (UTC+8))
- #35557 [Bugfix] Fix Anthropic API base64 image handling in Messages endpoint — bug,frontend — by voipmonitor (创建于: 2026-02-28 06:24 (UTC+8))
- #35559 [MoE Refactor] Turn ChunkingMoERunner into a wrapper so it can be used with any MoERunner subclass. — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-28 06:53 (UTC+8))
- #35560 [WIP][Bugfix][ROCm] Fix MXFP4 online quantization for MoE models at tp=1 — bug,rocm — by SandishKumarHN (创建于: 2026-02-28 06:58 (UTC+8))
- #35510 [Bugfix] Move chat completion response_format validation to Pydantic model_validator — bug,frontend,ready — by umut-polat (创建于: 2026-02-27 21:16 (UTC+8))
- #35533 [ROCm]: fix aiter rope functionalization — rocm,ready — by Rohan138 (创建于: 2026-02-28 01:49 (UTC+8))
- #35543 [CI] Improve (NCCL and torch) symmetric memory allreduce test coverage — ci/build — by jasonlizhengjian (创建于: 2026-02-28 04:21 (UTC+8))
- #35548 [MTP] Validate that MTP weights are actually loaded — ready,deepseek — by MatthewBonanni (创建于: 2026-02-28 05:03 (UTC+8))
- #35555 Fix bugs in fp8 fused_moe_lora kernels and add unit test for it — 无标签 — by yugong333 (创建于: 2026-02-28 06:16 (UTC+8))
- #35554 [Benchmark] Add batch ramp-up detection for accurate decode profiling — 无标签 — by YJYJLee (创建于: 2026-02-28 06:11 (UTC+8))
- #35472 [torch.compile] Stop lazily compiling — 无标签 — by zou3519 (创建于: 2026-02-27 11:35 (UTC+8))
- #35527 [ROCm] Add
stablelmHead Size 80 To Supported Head Sizes For ROCM_ATTN — documentation,rocm,ready,v1 — by micah-wil (创建于: 2026-02-28 00:39 (UTC+8)) - #35553 [ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance — rocm — by AndreasKaratzas (创建于: 2026-02-28 05:58 (UTC+8))
- #35552 clean unused cudagraph_batch_sizes — v1,nvidia — by BoyuanFeng (创建于: 2026-02-28 05:46 (UTC+8))
- #35551 feat: batch-aware expert pruning (XShare) for MoE models — needs-rebase,v1,gpt-oss — by hai-meh-cs (创建于: 2026-02-28 05:39 (UTC+8))
- #35549 [MoE Refactor] Refactor ZeroExpertFusedMoE into new framework — needs-rebase,nvidia — by bnellnm (创建于: 2026-02-28 05:35 (UTC+8))
- #35542 [BugFix] Abort waiting requests that can never fit in KV cache block pool — bug,v1 — by kvcache670 (创建于: 2026-02-28 04:07 (UTC+8))
- #35546 Fixed import error of a test — 无标签 — by dippi9845 (创建于: 2026-02-28 04:55 (UTC+8))
- #35545 Add helion package version 0.3.0 to requirements — ci/build — by dippi9845 (创建于: 2026-02-28 04:42 (UTC+8))
- #35540 [Bugfix] Fix empty channel/recipient in harmony for /v1/responses — bug,frontend,gpt-oss — by kg6-sleipnir (创建于: 2026-02-28 03:37 (UTC+8))
- #35539 Support Audio Extraction from MP4 Video for Nemotron Nano VL — frontend,ci/build,multi-modality — by askliar (创建于: 2026-02-28 03:15 (UTC+8))
- #35538 docs: Add kernel/operator fusions reference page — documentation — by copilot-swe-agent (创建于: 2026-02-28 03:04 (UTC+8))
- #35537 [Bugfix] Fixes for SLA finder — bug,documentation,performance — by DarkLight1337 (创建于: 2026-02-28 02:48 (UTC+8))
- #35482 Fix AttributeError in RMSNormGated by adding activation attribute and… — needs-rebase,qwen — by xueliangyang-oeuler (创建于: 2026-02-27 14:44 (UTC+8))
- #35535 [Bugfix] Fix loading Music Flamingo — bug — by NickCao (创建于: 2026-02-28 02:14 (UTC+8))
- #35529 Extract guidellm results — ci/build,gpt-oss — by debroy-rh (创建于: 2026-02-28 01:07 (UTC+8))
- #35505 [MyPy][BugFix] Check profiler is assigned before calling start() on it — bug,ready,v1 — by hickeyma (创建于: 2026-02-27 19:23 (UTC+8))
- #35534 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hujia177 (创建于: 2026-02-28 02:07 (UTC+8))
- #35515 Make DeepSeek V2/V3/R1 RMSNorm consistent with DeepSeek impl — deepseek — by hmellor (创建于: 2026-02-27 22:03 (UTC+8))
- #35525 [Doc] Fix link to Llama chat template for usability — documentation,ready,llama — by hickeyma (创建于: 2026-02-28 00:20 (UTC+8))
- #35480 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block — ready,v1 — by hl475 (创建于: 2026-02-27 13:16 (UTC+8))
- #35532 [Misc] Use VLLMValidationError consistently in ResponsesRequest — frontend — by umut-polat (创建于: 2026-02-28 01:37 (UTC+8))
- #35530 [Misc] Fix stale doc URL and docstring module path — 无标签 — by umut-polat (创建于: 2026-02-28 01:33 (UTC+8))
- #35524 [Misc] Fill in some v1 CODEOWNERS gaps — ci/build — by njhill (创建于: 2026-02-28 00:07 (UTC+8))
- #35526 fix(mig): comprehensive MIG UUID handling for NVML device queries — nvidia — by GaneshSubhashPatil (创建于: 2026-02-28 00:34 (UTC+8))
- #35508 [Responses] Normalize OpenResponses input items to match openai-python SDK requirements — frontend,ready — by anencore94 (创建于: 2026-02-27 20:56 (UTC+8))
- #35485 [Bugfix][ROCm] Disable full CUDA graph capture on RDNA3/RDNA4 (gfx1x) — bug,rocm,nvidia — by haosdent (创建于: 2026-02-27 15:11 (UTC+8))
- #35518 [CI] Fix mypy for vllm/device allocator — v1 — by hickeyma (创建于: 2026-02-27 22:29 (UTC+8))
- #35523 [Bugfix] Add missing clip_qkv implementation to OLMoE attention — bug — by haosdent (创建于: 2026-02-27 23:38 (UTC+8))
- #35509 [Bugfix] Fix GLM-OCR text_config model_type handling — bug,needs-rebase — by umut-polat (创建于: 2026-02-27 21:15 (UTC+8))
- #35487 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,ready,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:18 (UTC+8))
- #35493 Grpc renderer — frontend,ci/build — by hyeongyun0916 (创建于: 2026-02-27 15:47 (UTC+8))
- #35495 [Misc] Bound NIXL upper bound version — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-02-27 16:26 (UTC+8))
- #35514 [Bugfix] Replace assert with ValueError for response_format validatio… — bug,frontend,ready — by antonovsergey93 (创建于: 2026-02-27 21:57 (UTC+8))
- #35512 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hmellor (创建于: 2026-02-27 21:36 (UTC+8))
- #35511 Fix TypeError in GLM OCR config — 无标签 — by hmellor (创建于: 2026-02-27 21:22 (UTC+8))
- #35506 [Core] Proactively free KV cache blocks when aborting finished requests — v1,kv-connector — by jianzs (创建于: 2026-02-27 19:45 (UTC+8))
- #35468 [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations — bug,nvidia — by haosdent (创建于: 2026-02-27 11:07 (UTC+8))
- #35489 [Bugfix] Fix Fabric/RDMA attribute queries poisoning global error_code in cumem allocator — bug — by haosdent (创建于: 2026-02-27 15:26 (UTC+8))
- #35500 [Feature] Add basic metrics for /realtime endpoint — frontend — by pougetat (创建于: 2026-02-27 17:54 (UTC+8))
- #35498 Fix indexError in moe wna16 quantization with enable-expert-parallel — 无标签 — by ivyilike (创建于: 2026-02-27 17:35 (UTC+8))
- #35499 fix(phimoe): use config router_jitter_noise instead of hardcoded jitter_eps — 无标签 — by stakeswky (创建于: 2026-02-27 17:50 (UTC+8))
- #35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (创建于: 2026-02-27 17:25 (UTC+8))
- #35484 [WIP][Bugfix] Fix multi-node PP crash with logprobs due to pinned memory serialization — bug,v1 — by haosdent (创建于: 2026-02-27 15:10 (UTC+8))
- #35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (创建于: 2026-02-27 15:27 (UTC+8))
- #35474 [Perf][DeepSeek-V3] Dynamic MLA/MHA Routing for Sub-1024 Token Prefill (~3x TTFT Speedup) — deepseek — by Joy-In-Code (创建于: 2026-02-27 11:44 (UTC+8))
- #35481 [Feature][CI]: compare
func&no_funcoutputs in test_functionalization.py — 无标签 — by 11happy (创建于: 2026-02-27 14:40 (UTC+8)) - #35491 [ROCm][Quantization] support amd-quark quantized Qwen3.5 model — rocm,qwen — by xuebwang-amd (创建于: 2026-02-27 15:27 (UTC+8))
- #35488 [WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency — bug,v1,nvidia — by haosdent (创建于: 2026-02-27 15:19 (UTC+8))
- #35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:14 (UTC+8))
- #35483 Add AMD AITER MLA fusion optimization for DeepSeek models — rocm,deepseek — by khairulkabir1661 (创建于: 2026-02-27 14:47 (UTC+8))
- #35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (创建于: 2026-02-27 13:07 (UTC+8))
- #35478 [BugFix] Fix engine hanging after KV cache initialization failure — bug,v1 — by 842974287 (创建于: 2026-02-27 12:38 (UTC+8))
- #35471 fix(benchmarks): correct peak output token throughput calculation for speculative decoding — performance — by hukongyi (创建于: 2026-02-27 11:27 (UTC+8))
- #35470 Fix MIG UUID handling in interface.py and cuda.py — nvidia — by GaneshSubhashPatil (创建于: 2026-02-27 11:15 (UTC+8))
- #35469 feat: add OpenTelemetry Metrics support via OTLP protocol — v1 — by RichardoMrMu (创建于: 2026-02-27 11:12 (UTC+8))
已合并 PR
- #35564 [Model Runner V2] Move MM encoder to Model States [3/N] — v1,nvidia — by WoosukKwon (合并于: 2026-02-28 10:32 (UTC+8))
- #35120 [Model Runner V2] Support pooling models — v1,nvidia — by WoosukKwon (合并于: 2026-02-28 10:03 (UTC+8))
- #35531 [Misc] Clean up ResponsesRequest model validators — frontend,ready — by umut-polat (合并于: 2026-02-28 09:19 (UTC+8))
- #35517 [misc] cleanup one level of error stack when nixl fails to initialize — ready,kv-connector — by youkaichao (合并于: 2026-02-28 08:42 (UTC+8))
- #35105 [Refactor][Kernel] Add global helper to deduplicate vectorized memory ops — ready,nvidia — by LopezCastroRoberto (合并于: 2026-02-28 08:28 (UTC+8))
- #35533 [ROCm]: fix aiter rope functionalization — rocm,ready — by Rohan138 (合并于: 2026-02-28 06:42 (UTC+8))
- #35334 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends — documentation,rocm,ready,v1 — by gshtras (合并于: 2026-02-28 05:32 (UTC+8))
- #34171 [Feat][RL][2/2] Native Weight Syncing API: IPC — documentation,ready,ci/build — by hao-aaron (合并于: 2026-02-28 04:45 (UTC+8))
- #35404 [Bugfix][Model] Fix gpt-oss batch invariance — bug,ready,gpt-oss — by jzakrzew (合并于: 2026-02-28 04:41 (UTC+8))
- #34102 [DP] Only use DP padding when cudagraphs are actually used — speculative-decoding,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-02-28 04:14 (UTC+8))
- #35420 [Bugfix] Add monkeypatch to prevent race condition from writing — bug,ready — by Lucaskabela (合并于: 2026-02-28 03:51 (UTC+8))
- #34888 [Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 — ready — by SteadfastAsArt (合并于: 2026-02-28 03:39 (UTC+8))
- #35308 [compile] Fix caching error over pytree slice node. — ready — by zhxchen17 (合并于: 2026-02-28 03:34 (UTC+8))
- #35172 [Model Runner V2] Warmup kernels — ready,v1 — by njhill (合并于: 2026-02-28 02:43 (UTC+8))
- #35097 [BugFix] Fix 3D rope in transformers backend — bug,ready — by zucchini-nlp (合并于: 2026-02-28 02:34 (UTC+8))
- #35100 Support parakeet as audio encoder for nemotron-nano-vl — new-model,ready — by netanel-haber (合并于: 2026-02-28 02:07 (UTC+8))
- #35525 [Doc] Fix link to Llama chat template for usability — documentation,ready,llama — by hickeyma (合并于: 2026-02-28 01:51 (UTC+8))
- #35480 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block — ready,v1 — by hl475 (合并于: 2026-02-28 01:50 (UTC+8))
- #35524 [Misc] Fill in some v1 CODEOWNERS gaps — ci/build — by njhill (合并于: 2026-02-28 01:34 (UTC+8))
- #32407 [Model] Add huggingface skt/A.X-K1 model — documentation,new-model,ready — by fort726 (合并于: 2026-02-28 01:26 (UTC+8))
- #34390 [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching — rocm,ready — by gmagogsfm (合并于: 2026-02-28 01:21 (UTC+8))
- #35312 [Core] Fix
gpu_worker.pypre-commit errors — ready,v1 — by njhill (合并于: 2026-02-27 23:54 (UTC+8)) - #35317 Add @BoyuanFeng to CODEOWNERS — documentation,ready,ci/build — by BoyuanFeng (合并于: 2026-02-27 23:53 (UTC+8))
- #33646 [Bugfix] Handle case when kimi ends reasoning with a tool call — bug,ready — by koush (合并于: 2026-02-27 22:58 (UTC+8))
- #35487 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,ready,multi-modality,qwen — by linyueqian (合并于: 2026-02-27 22:48 (UTC+8))
- #35410 [compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism — ready — by jasonlizhengjian (合并于: 2026-02-27 21:36 (UTC+8))
- #35082 [Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp — bug,ready,v1 — by haosdent (合并于: 2026-02-27 22:27 (UTC+8))
- #35512 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hmellor (合并于: 2026-02-27 22:13 (UTC+8))
- #35423 [Bugfix] Add missing activation attr to RMSNormGated — bug,ready — by Tib-Gridello (合并于: 2026-02-27 20:53 (UTC+8))
- #34580 Flashinfer cuDNN backend for Qwen3 VL ViT attention — ready,ci/build,v1,qwen,nvidia — by maxyanghu (合并于: 2026-02-27 20:20 (UTC+8))
- #35413 [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping — bug,ready,qwen — by jeejeelee (合并于: 2026-02-27 11:46 (UTC+8))
- #35424 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 — bug,ready — by ZJY0516 (合并于: 2026-02-27 12:18 (UTC+8))
- #35456 [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint — bug,frontend,ready — by umut-polat (合并于: 2026-02-27 16:01 (UTC+8))
- #33088 [Bugfix] Use ‘sum’ reduction instead of ‘avg’ in Async TP reduce-scatter — bug,ready — by wangxingran222 (合并于: 2026-02-27 15:06 (UTC+8))
- #35457 [Model Performance] Add Qwen3MoE tuned MoE configs for H200 — ready,qwen — by chengyinie (合并于: 2026-02-27 13:51 (UTC+8))
- #35369 [Bug] correct out dtype of rms_norm_gated native path — bug,ready — by zufangzhu (合并于: 2026-02-27 13:19 (UTC+8))
- #35434 [BugFix] Repo utils debug print patch — bug,ready — by pi314ever (合并于: 2026-02-27 11:50 (UTC+8))
- #35314 [Bug] Fix outdated links in source code — bug,documentation,ready,ci/build — by yewentao256 (合并于: 2026-02-27 11:50 (UTC+8))
- #33197 use ‘max_active_experts’ for moe lora input size — ready — by gnovack (合并于: 2026-02-27 11:50 (UTC+8))
- #35400 [Misc] Move
GPUModelRunner.prepare_kernel_block_sizesto utils — ready,v1 — by NickLucche (合并于: 2026-02-27 11:42 (UTC+8)) - #33012 [Core]Extract is_last_rank in Ray for tpu to override — ready,v1 — by Chenyaaang (合并于: 2026-02-27 11:18 (UTC+8))
关闭但未合并的 PR
- #25932 [Misc] Add benchmark for sampler. — performance,stale — by FangJiangyi (关闭于: 2026-02-28 10:16 (UTC+8))
- #35554 [Benchmark] Add batch ramp-up detection for accurate decode profiling — 无标签 — by YJYJLee (关闭于: 2026-02-28 06:22 (UTC+8))
- #35545 Add helion package version 0.3.0 to requirements — ci/build — by dippi9845 (关闭于: 2026-02-28 04:50 (UTC+8))
- #35534 Revert “Add GlmOcrConfig for GLM-OCR model type recognition” — 无标签 — by hujia177 (关闭于: 2026-02-28 02:08 (UTC+8))
- #35532 [Misc] Use VLLMValidationError consistently in ResponsesRequest — frontend — by umut-polat (关闭于: 2026-02-28 01:47 (UTC+8))
- #22282 Add option to disable weakref conversion for last piecewise cudagraph in a module — torch.compile,needs-rebase,unstale,v1,nvidia — by sarckk (关闭于: 2026-02-28 01:42 (UTC+8))
- #32135 [Bugfix] Don’t Index MM Placeholders Out When LoRA is Enabled — bug,v1,qwen — by alex-jw-brooks (关闭于: 2026-02-27 23:19 (UTC+8))
- #35509 [Bugfix] Fix GLM-OCR text_config model_type handling — bug,needs-rebase — by umut-polat (关闭于: 2026-02-27 23:09 (UTC+8))
- #35514 [Bugfix] Replace assert with ValueError for response_format validatio… — bug,frontend,ready — by antonovsergey93 (关闭于: 2026-02-27 22:14 (UTC+8))
- #35443 [BUGFIX] Replace assert with ValueError for response_format validation in chat completions endpoint — bug,documentation,frontend,ci/build,v1,qwen,deepseek,nvidia — by antonovsergey93 (关闭于: 2026-02-27 21:59 (UTC+8))
- #35511 Fix TypeError in GLM OCR config — 无标签 — by hmellor (关闭于: 2026-02-27 21:35 (UTC+8))
- #35351 [XPU] special handle for pooler models w8a16 gemm — 无标签 — by yma11 (关闭于: 2026-02-27 20:03 (UTC+8))
- #34881 [Bugfix] limit cudagraph capture sizes by num_blocks for GDN models — bug,v1,qwen,nvidia — by ZJY0516 (关闭于: 2026-02-27 18:40 (UTC+8))
- #35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (关闭于: 2026-02-27 17:48 (UTC+8))
- #35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (关闭于: 2026-02-27 17:14 (UTC+8))
- #35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (关闭于: 2026-02-27 15:17 (UTC+8))
- #35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (关闭于: 2026-02-27 13:11 (UTC+8))
- #32157 feat: add requires_token_ids interface for sampling params — v1 — by llsj14 (关闭于: 2026-02-27 12:23 (UTC+8))
- #34835 [Core] Extract is_last_rank in ray for tpu to override — v1 — by pv97 (关闭于: 2026-02-27 11:47 (UTC+8))
- #29191 [Models] Lfm2-VL Architecture — documentation,new-model,stale — by paulpak58 (关闭于: 2026-02-27 11:38 (UTC+8))
- #34511 [Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations — bug,needs-rebase — by haosdent (关闭于: 2026-02-27 11:08 (UTC+8))