vLLM 开发动态报告 - 2026-03-16

时间窗口: 2026-03-16 11:28 (UTC+8) ~ 2026-03-17 11:28 (UTC+8) 数据统计: 新 Issue 21 | 关闭 Issue 20 | 新 PR 93 | 合并 PR 63 | 关闭未合并 PR 28

📊 每日开发状态摘要

本周期（2026-03-16）vLLM 项目开发活动高度活跃，新增 PR 93 个，合并 63 个，显示出强劲的开发与集成节奏。核心关注点集中在模型支持的精进（特别是 Qwen 系列）、各类量化方法（如 NVFP4、MXFP8）的扩展与优化，以及内存管理机制的创新与修复（包括 KV 缓存卸载和 MoE 专家缓存）。同时，AMD ROCm 平台的兼容性与性能调优工作持续推进。

🎯 AMD/ROCm 生态相关动态

本周期 AMD/ROCm 生态相关工作活跃，涉及问题修复、功能增强和兼容性改进。

1. 关键问题 (Issues):

严重稳定性问题 (#37151): 用户报告在 AMD Ryzen AI MAX+ 395 (gfx1151) 上加载 Qwen3-VL-32B-AWQ 模型时，在 libhsa-runtime64.so 中发生段错误，导致引擎核心初始化失败。该问题在官方 ROCm nightly Docker 镜像中亦可复现，表明是 ROCm 7.0 栈与特定 APU 硬件或模型配置的兼容性问题，亟待解决。

2. 功能与修复 (PRs):

hipBLASLt 调优选项 (#37146): AMD 员工 hanlin12-AMD 提出 PR，旨在为 vllm serve 命令添加 --enable-hipblaslt-online-tuning 选项以启用 AITER 库的在线调优。此提议引发了关于设计哲学的讨论，核心维护者认为 AMD 特定配置应统一通过环境变量（如 VLLM_ROCM_*）管理，而非新增 CLI 参数，体现了项目对接口简洁性和一致性的坚持。
Hybrid 模型 KV 缓存修复 (#37228): 修复了 ROCm 平台上混合注意力/SSM模型（hybrid model）中 KV 缓存因内存布局（stride）错误导致 Triton 内核读取错误数据的问题。
WSL2 环境兼容性 (#37189): 为 WSL2 环境下无法使用 amdsmi 的问题添加了优雅的回退机制，改用 torch.cuda 接口获取设备信息，提升了 ROCm 在 WSL2 上的可用性。
PyTorch 版本兼容性 (#37219): 修复了因使用新版本 PyTorch (2.10+) API torch.compiler.skip_all_guards_unsafe 而导致的旧版本 PyTorch 兼容性问题。
测试完善 (#37213): 为 MORI/O 连接器测试添加缺失的 transfer_id 字段，确保 ROCm 平台相关测试的通过性。
已合并的重要修复 (#36845): 修复了 ROCm 平台上 NixlConnector 因缺少 KV 块复制方法而崩溃的问题，并改进了 spec_decode 测试脚本，使其能根据平台自动选择注意力后端（ROCm 用 TRITON_ATTN，NVIDIA 用 FLASH_ATTN）。

💬 高热度讨论分析

Issue #37167: “[Bug]: responses API, combining of message and tool call”
- 核心议题: responses API 在渲染非 Harmony 模型（如 Qwen3.5）的消息时，未能正确将推理/内容消息与后续的工具调用合并到同一个助理消息中，导致智能体循环提前终止。
- 各方观点:
  - 报告者 (bfroemel): 提供了详细的 bug 描述和修复补丁，指出现有合并逻辑限制过严。
  - 分析者 (weiguangli-io): 进行了深入的技术根因分析，确认了代码中两个过于严格的守卫条件，并给出了具体的修复建议。
  - 原代码作者 (qandrew): 确认了 bug 的存在，并支持修复，同时建议实现方式应保持灵活性以兼容未来可能采用不同消息格式的模型。
- 争议焦点: 无实质性争议，讨论聚焦于如何设计一个既解决当前问题，又保持对未来模型格式扩展性的方案。
- 当前状态: 讨论达成共识，等待具体实现和 review。
PR #37146: “Add the option to turn on hipBLASLt online tuning”
- 核心议题: 是否应该为 AMD 特定的 hipBLASLt 在线调优功能添加一个 CLI 参数。
- 各方观点:
  - 贡献者 (hanlin12-AMD): 倾向于提供 CLI 选项，认为对用户更友好。
  - 维护者 (hmellor): 反对新增 CLI 参数，主张遵循现有模式，所有 AMD 特定配置均应通过环境变量设置，以维护 CLI 的简洁性和跨平台一致性。
- 争议焦点: 功能配置的管理哲学——便捷性 vs. 接口一致性与简洁性。
- 当前状态: PR 处于开放状态，需要贡献者与维护者进一步协商设计。

🔥 热门话题与趋势分析

Qwen 模型生态深度集成: 大量 Issue 和 PR 围绕 Qwen 系列模型（3， 3.5， VL, Next, Coder）展开，涉及工具调用解析、GDN 注意力优化、模型加载、speculative decoding 兼容性等，表明 Qwen 已成为 vLLM 社区使用和优化的重点模型家族。
内存与缓存管理创新:
- Weight Offloading: Issue #37176 及其关联 PR 揭示了 weight offload 与 FP8 KV 缓存量化参数管理的冲突。
- KV Cache Offload: PR #37160 提出了新的 SimpleCPUOffloadConnector 设计，旨在更简洁、通用地支持 CPU KV 缓存卸载。
- MoE 专家缓存: PR #37190 实现了动态的 MoE 专家级 LRU 缓存，允许在固定 GPU 内存中缓存热点专家，是内存受限场景下的重要优化。
工具调用与推理解析器的持续打磨: 多个 PR（#37186， #37187， #36827）专注于修复不同模型系列（QwenCoder, Hermes, Granite）工具调用解析器在流式输出、参数处理、并发安全等方面的 bug，反映了该功能的高复杂度和社区对其稳定性的高度重视。
硬件与平台兼容性拓宽:
- AMD ROCm: 如前述，持续修复和优化。
- NVIDIA Blackwell: Issue #37242 分享了在 RTX 5090 (SM 120) 和 WSL2 2.7.0 上成功启用 CUDA Graphs 的经验。
- Intel XPU: PR #37149 为 XPU 平台添加了透明睡眠模式支持。

🛠️ 重点技术变更

PR #37231: “[Bugfix] Expand quantization method support in perf metrics”: 修复了性能指标（如 MFU）计算模块仅支持少数几种量化方法的问题。新增了对包括 Quark、GPTQ、AWQ、BitsAndBytes 在内的 22 种量化方法的权重字节大小映射，确保量化模型的性能报告准确无误。
PR #37160: “[Feat][v1] Simple yet General CPU KV Cache Offloading”: 提出了一种新的 CPU KV 缓存卸载连接器设计。它复用现有 BlockPool 和 KVCacheCoordinator 基础设施，从而天然支持 HMA、前缀缓存和 LRU 驱逐，旨在以更低的复杂性和开销提供更通用的卸载能力。
PR #37190: “[Feature][Offload] Add dynamic MoE expert LRU cache”: 实现了创新的 MoE 专家权重动态缓存机制。所有专家权重驻留 CPU，仅最近使用的专家被缓存到固定大小的 GPU 缓冲区。当批次所需专家数超出缓存容量时，自动回退到 CPU 计算。这是在有限 GPU 内存下运行大型 MoE 模型的有效实践。
PR #37196: “[Perf] consolidating, vectorizing and cleaning up CUDA/HIP implementations of custom ops.”: 对自定义操作（如 LayerNorm）的 CUDA/HIP 内核实现进行了重要的代码重构和优化，消除了重复代码，并扩展了向量化路径，提升了内核执行效率和代码可维护性。

📈 开发活跃度观察

AMD 贡献者活跃：以 -amd 后缀用户及熟悉 ROCm 的贡献者（如 AndreasKaratzas）为主，在问题诊断、平台适配、性能调优等方面提交了多个 PR 和修复，显示出 AMD 对 vLLM 生态投入的持续增加。
高效合并流程：单日合并 63 个 PR，表明代码审查和集成流程高效。许多 PR 在解决冲突或通过预提交检查后迅速被标记为 ready 并合并。
深度技术讨论：在关键 Bug 和功能设计（如 #37167， #37146）的讨论中，参与者提供了详尽的技术分析和设计权衡，展现了社区较高的技术素养和协作氛围。

💡 值得关注的问题

AMD Ryzen AI 段错误 (#37151): 这是一个严重的稳定性问题，影响特定 AMD APU 运行视觉语言模型。需要社区和 AMD 共同排查，是 ROCm 栈在消费级 AI PC 硬件上可靠性的一个考验。
Weight Offloading 与量化参数的冲突 (#37176): 暴露了模型加载后处理（如删除/重命名参数）与离线加载器初始化时机不匹配的深层问题。其修复方案（PR #37178， #37194）需要谨慎评估，以确保不影响其他场景。
长期运行代理的 KV 缓存调度 (RFC #37168): 提出了一种“主动协调与双区调度”机制，以解决智能体场景中上下文非单纯追加导致的 KV 缓存失效问题。这是一个前瞻性的架构讨论，可能影响未来 vLLM 对长序列、复杂交互应用的支持方式。

📋 附录：详细数据列表

新增 Issue

#37175 [Usage]:wen3.5-35B-A3B (FP8) with vLLM 0.17.1 , the first request takes significantly longer than subsequent requests — usage — by robinyqd (创建于: 2026-03-16 18:18 (UTC+8))
#37250 [Bug]: QWEN 3.5-397B-A17B report “RPC call to sample_tokens timed out” — bug — by mahaocong90 (创建于: 2026-03-17 10:53 (UTC+8))
#37223 [Feature]: Add LoRA support for Qwen3ASRForConditionalGeneration — good first issue,feature request — by henriklied (创建于: 2026-03-17 04:14 (UTC+8))
#37242 [Community] RTX 5090 (Blackwell sm_120) + WSL2 2.7.0: CUDA graphs work — benchmarks + full config — 无标签 — by Kyzcreig (创建于: 2026-03-17 09:02 (UTC+8))
#37168 [RFC]: Active Coordination and Two-Zone Scheduling Mechanism for KV Cache in Long-Running Agents — RFC — by xinrunxue (创建于: 2026-03-16 17:36 (UTC+8))
#37180 [Usage]: Attention backend — usage — by BartekKruczek (创建于: 2026-03-16 18:54 (UTC+8))
#37235 [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) — bug — by NoahLundSyrdal (创建于: 2026-03-17 06:52 (UTC+8))
#37141 [Feature]: Upstream DGX spark improvements from Avarok-Cybersecurity/dgx-vllm — help wanted,feature request,nvidia,quantization — by ProExpertProg (创建于: 2026-03-16 13:05 (UTC+8))
#37184 [Bug]: Kimi-K2.5 Tool Parser Critical Issues - 87% Accuracy, 8KB Limit, Token Leakage — bug — by Dixon3 (创建于: 2026-03-16 19:22 (UTC+8))
#37167 [Bug]: responses API, combining of message and tool call — bug — by bfroemel (创建于: 2026-03-16 17:32 (UTC+8))
#37222 [Bug]: _find_range_for_shape in hotpath — bug,torch.compile — by zou3519 (创建于: 2026-03-17 03:58 (UTC+8))
#37204 [RFC]: Create separation between independent HTTP API definitions — RFC — by russellb (创建于: 2026-03-16 23:59 (UTC+8))
#37176 [Bug]: weight offloading fails: ‘Attention’ object has no attribute ‘k_scale’ — bug — by sfbemerk (创建于: 2026-03-16 18:26 (UTC+8))
#37191 [Bug]: Benchmark tuning MoE kernel config fails ; NVIDIA GH200, Qwen3-Coder-Next, vLLM 0.17.1 — bug — by tomschelsen (创建于: 2026-03-16 21:25 (UTC+8))
#37150 [Bug]: the acceptance rate of ngram+async-scheduling is only 1.22% — bug — by HF-001 (创建于: 2026-03-16 15:28 (UTC+8))
#37185 [Bug]: The 2-bit model repeatedly outputs !!!!, while Transformers works correctly. — bug — by wenhuach21 (创建于: 2026-03-16 19:29 (UTC+8))
#37140 [Usage]: CUDA error: the provided PTX was compiled with an unsupported toolchain. — usage — by 870223666 (创建于: 2026-03-16 12:09 (UTC+8))
#37173 [Bug]: AssertionError in cutlass_scaled_mm during profile_run for FP8 pooling/reranker models — bug — by officiallymarky (创建于: 2026-03-16 18:03 (UTC+8))
#37162 [Feature] Fuse per-group FP8 dynamic quant onto Triton attention kernel — 无标签 — by Etelis (创建于: 2026-03-16 17:12 (UTC+8))
#37159 [Bug]: vLLM crashed with V100 by running zai-org/GLM-OCR — bug — by sweihub (创建于: 2026-03-16 16:46 (UTC+8))
#37151 [Bug]: [ROCm][gfx1151] Engine Core segfaults in libhsa-runtime64.so when loading Qwen3-VL-32B-AWQ on AMD Ryzen AI MAX+ 395 — bug,rocm — by MIKAZE3 (创建于: 2026-03-16 15:45 (UTC+8))

已关闭 Issue

#28667 [Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 — bug,stale — by Wanli-Lee (关闭于: 2026-03-17 10:22 (UTC+8))
#28712 [Bug]: qwen3-vl-32b-instruct ERROR 11-13 21:06:35 [multiproc_executor.py:597] ValueError: Following weights were not initialize from checkpoint: — bug,stale — by l878619717 (关闭于: 2026-03-17 10:22 (UTC+8))
#28797 [RFC]: [CPU]covt_e4m3_bf16 — RFC,stale — by wangyxbh (关闭于: 2026-03-17 10:22 (UTC+8))
#28800 [Bug]: Build vllm on Apple silicon: libc++abi error — bug,stale — by iervn6341-ovo (关闭于: 2026-03-17 10:22 (UTC+8))
#28806 [Bug]: Hermes tool call parser drops empty arguments for parameterless tools while streaming — bug,stale — by shakez0901 (关闭于: 2026-03-17 10:22 (UTC+8))
#37180 [Usage]: Attention backend — usage — by BartekKruczek (关闭于: 2026-03-17 07:26 (UTC+8))
#29462 [CI Failure]: mi325_8: Language Models Tests (Hybrid) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-17 05:59 (UTC+8))
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-17 05:59 (UTC+8))
#36688 [Bug]: torch.opcheck fails for _C.rms_norm_per_block_quant — bug,help wanted — by ProExpertProg (关闭于: 2026-03-17 05:11 (UTC+8))
#36547 [Bug] MultiConnector: NIXL transfers silently broken after HMA migration — bug — by ZhanqiuHu (关闭于: 2026-03-17 04:51 (UTC+8))
#35998 Qwen3.5 Model Tokenizer Loading Failure — 无标签 — by ArtemSultanov-PG (关闭于: 2026-03-17 03:12 (UTC+8))
#37032 [Bug]: GLM 4.7-flash returns gibberish when native KV cache offloading is on — bug — by yz342 (关闭于: 2026-03-17 01:03 (UTC+8))
#33572 [Bug]: GPT-OSS with CPU KV cache offload break with FlashInfer — bug — by cquil11 (关闭于: 2026-03-16 23:20 (UTC+8))
#24684 [Bug]: vllm bench serve isn’t an exact replacement of benchmark_serving.py — bug,good first issue — by gshtras (关闭于: 2026-03-16 22:27 (UTC+8))
#36109 [Bug]: TypeError when rope_scaling.factor is null in model config — bug — by yuanheng-zhao (关闭于: 2026-03-16 22:27 (UTC+8))
#32595 [New Model]: Complexity (Pacific-Prime) - INL Dynamics + Token-Routed MLP — 无标签 — by Complexity-ML (关闭于: 2026-03-16 20:09 (UTC+8))
#36632 [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream — bug — by JakubCerven (关闭于: 2026-03-16 19:58 (UTC+8))
#35823 [Bug]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — bug — by chamwen (关闭于: 2026-03-16 19:12 (UTC+8))
#34845 [Bug]: Qwen3-Next MTP fails when paired with chunked prefill — bug — by wzhao18 (关闭于: 2026-03-16 17:30 (UTC+8))
#36769 [Bug]: Qwen3.5 tool calling bug — bug,rocm — by DifferentialityDevelopment (关闭于: 2026-03-16 15:38 (UTC+8))

新增 PR

#37241 [Refactor] Relocate responses API tests — v1 — by sfeng33 (创建于: 2026-03-17 08:46 (UTC+8))
#37152 fix(v1/prefix-cache): prevent KV-cache pollution from same-step block registration — v1 — by flutist (创建于: 2026-03-16 15:51 (UTC+8))
#37249 [MoE] Introduce Fp8MoEState and class-based dispatch for DeepGemm — 无标签 — by yzong-rh (创建于: 2026-03-17 10:50 (UTC+8))
#37247 [Model] Implement LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by petern48 (创建于: 2026-03-17 10:27 (UTC+8))
#37251 [Benchmark] Add iteration benchmark with server-side step stats, trac… — 无标签 — by YJYJLee (创建于: 2026-03-17 11:06 (UTC+8))
#37239 [Models][GDN] Prevent D2H sync in ChunkGatedDeltaRule — v1,qwen — by lgeiger (创建于: 2026-03-17 07:19 (UTC+8))
#37248 refactor: standardize kimi_linear and minimax_text_01 model weight loading to use AutoWeightsLoader — 无标签 — by XLiu-2000 (创建于: 2026-03-17 10:32 (UTC+8))
#37212 [MM Encoder] Add Turing compatability for FlashInfer ViT backend — v1 — by Isotr0py (创建于: 2026-03-17 00:52 (UTC+8))
#37205 [Kernel] Add gpt-oss Router GEMM kernel — performance,ci/build,gpt-oss,nvidia — by xyang16 (创建于: 2026-03-17 00:12 (UTC+8))
#37246 [Bugfix] dtype mismatch in ngram gpu propose — bug,speculative-decoding,v1 — by PatchouliTIS (创建于: 2026-03-17 10:25 (UTC+8))
#37179 [XPU] skip unsupported ut and update test_nixl_connector — ready,ci/build,v1,kv-connector — by zhenwei-intel (创建于: 2026-03-16 18:44 (UTC+8))
#37146 Add the option to turn on hipBLASLt online tuning — frontend — by hanlin12-AMD (创建于: 2026-03-16 15:00 (UTC+8))
#37157 [openapi] remove redundant exception stack trace[4/N] — frontend — by andyxning (创建于: 2026-03-16 16:39 (UTC+8))
#37221 [3/n] Migrate cutlass/scaled_mm_entry.cu torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-03-17 02:35 (UTC+8))
#37245 [Bugfix] Fix incorrect int8 dtype cast for kv_c_normed in MLA prefill — bug — by jacob-crux (创建于: 2026-03-17 09:47 (UTC+8))
#37244 [Perf] Support Flashinfer trtllm tinygemm_bf16 router gemm for GPT-OSS — gpt-oss — by elvischenv (创建于: 2026-03-17 09:30 (UTC+8))
#37243 [ROCm][CI] Refine gating tests — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-17 09:20 (UTC+8))
#37220 [Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell — bug — by kitaekatt (创建于: 2026-03-17 02:04 (UTC+8))
#37181 Add ability to replace oot ops when using lora — ready — by kyuyeunk (创建于: 2026-03-16 19:15 (UTC+8))
#37207 [Feat] Enable CompressedTensorW4A8Int for XPU — 无标签 — by tianmu-li (创建于: 2026-03-17 00:23 (UTC+8))
#37209 Fix some Mistral parser issues — frontend,ready — by juliendenize (创建于: 2026-03-17 00:30 (UTC+8))
#37237 [Model Runner V2] Spec decode rejection sampler logprobs support — v1 — by TheEpicDolphin (创建于: 2026-03-17 07:12 (UTC+8))
#37238 [Model Runner V2] Spec decode rejection sampler greedy support — v1 — by TheEpicDolphin (创建于: 2026-03-17 07:15 (UTC+8))
#37160 [Feat][v1] Simple yet General CPU KV Cache Offloading — v1,kv-connector — by ivanium (创建于: 2026-03-16 16:50 (UTC+8))
#37228 [ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model (#37228) — bug,rocm,v1,meta-exported,fb-exported — by jennyyyyzhen (创建于: 2026-03-17 05:24 (UTC+8))
#37240 Pr 1 — needs-rebase,ci/build — by sharvil10 (创建于: 2026-03-17 07:24 (UTC+8))
#37234 [Bugfix] Fix for builtins (forward fix of pytorch/177558) — bug — by Lucaskabela (创建于: 2026-03-17 06:35 (UTC+8))
#37236 Fix ambiguous num_blocks for hybrid attn mamba — v1 — by collinmccarthy (创建于: 2026-03-17 06:52 (UTC+8))
#37198 [compile] Enable mega aot artifact for torch 2.12+. — ready — by zhxchen17 (创建于: 2026-03-16 22:53 (UTC+8))
#37232 Fix EagleMistralLarge3Model initialization — speculative-decoding,ready — by juliendenize (创建于: 2026-03-17 06:27 (UTC+8))
#37233 [UX] Add flashinfer-cubin as CUDA default dep — ci/build,nvidia — by mgoin (创建于: 2026-03-17 06:28 (UTC+8))
#37225 [Perf] Optimize top-k search in apply_top_k_top_p_triton sampler — performance,ready,v1 — by mgoin (创建于: 2026-03-17 04:36 (UTC+8))
#37201 [Deprecation] Deprecate --calculate-kv-scales option — ready,quantization — by mgoin (创建于: 2026-03-16 23:28 (UTC+8))
#37231 [Bugfix] Expand quantization method support in perf metrics — bug,v1 — by thillai-c (创建于: 2026-03-17 05:46 (UTC+8))
#37217 [MoE/EPLB] Fix FlashInfer nvfp4 experts + EPLB correctness — ready,nvidia — by elvircrn (创建于: 2026-03-17 01:19 (UTC+8))
#37230 [CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in init — 无标签 — by AndreasKaratzas (创建于: 2026-03-17 05:38 (UTC+8))
#37219 [ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch — rocm,ready — by AndreasKaratzas (创建于: 2026-03-17 02:03 (UTC+8))
#37229 Fix Qwen3.5-Next RMSNormGated Initialization Error on TPU — qwen — by jrplatin (创建于: 2026-03-17 05:27 (UTC+8))
#37192 WIP: [Feature] KVCACHE NVFP4 — documentation,needs-rebase,v1,quantization — by JartX (创建于: 2026-03-16 21:42 (UTC+8))
#37190 [Feature][Offload] Add dynamic MoE expert LRU cache (–moe-expert-cache-size) — frontend,ci/build — by e1n00r (创建于: 2026-03-16 21:05 (UTC+8))
#37170 [Bugfix] Fix prompt_embeds precision divergence with MTP speculative … — bug,speculative-decoding,v1 — by leihuang-sketch (创建于: 2026-03-16 17:39 (UTC+8))
#37227 [Perf] Use list.extend() over append loops in FlatLogprobs + minor hot-path cleanups — v1 — by vaibhavhariram (创建于: 2026-03-17 04:49 (UTC+8))
#37226 [WIP] Bring back nightly build — ci/build — by atalman (创建于: 2026-03-17 04:47 (UTC+8))
#37197 [BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer — bug,ready,v1 — by jikunshang (创建于: 2026-03-16 22:38 (UTC+8))
#37224 [UltraVox] Fix output type — 无标签 — by vasqu (创建于: 2026-03-17 04:29 (UTC+8))
#37199 [Misc] Add float16 to CacheDType — documentation,rocm,ready,v1,nvidia — by MatthewBonanni (创建于: 2026-03-16 22:55 (UTC+8))
#37213 [CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test — bug,rocm,ready,v1,kv-connector — by rasmith (创建于: 2026-03-17 00:53 (UTC+8))
#37178 Bugfix for offloading+prefetch for GLM-4.7-FP8 — bug — by sfbemerk (创建于: 2026-03-16 18:42 (UTC+8))
#37218 [CI] Add retry with 4x backoff to HTTP fetches for transient failures — rocm,ready — by AndreasKaratzas (创建于: 2026-03-17 01:41 (UTC+8))
#37215 [Bugfix] Fix render server crash for quantized models on CPU-only hosts — bug,frontend,ready — by sagearc (创建于: 2026-03-17 01:00 (UTC+8))
#37210 [Bugfix] Handle inaccessible GPUs in CudaPlatform.log_warnings() — bug,nvidia — by eschmidbauer (创建于: 2026-03-17 00:33 (UTC+8))
#37188 [Performance] Enable Triton autotuning disk cache by default — 无标签 — by arpera (创建于: 2026-03-16 20:41 (UTC+8))
#37202 reorder _init_message_queues after init_device() for multi-node DP — v1,meta-exported,fb-exported — by prajjwal1 (创建于: 2026-03-16 23:59 (UTC+8))
#37193 [LoRA] Add LoRA support for Qwen3OmniMoeThinkerForConditionalGeneration — qwen — by pratapyash (创建于: 2026-03-16 21:44 (UTC+8))
#37216 [CI] Add macOS ARM CPU wheel build and publish — ci/build,cpu — by mgoin (创建于: 2026-03-17 01:09 (UTC+8))
#37182 [Pixtral] Enable Pixtral language model support Eagle3 — 无标签 — by Flechman (创建于: 2026-03-16 19:16 (UTC+8))
#37214 Fix minimax m2.5 nvfp4 kv scales weight loading — 无标签 — by wzhao18 (创建于: 2026-03-17 00:58 (UTC+8))
#37195 [V0 Deprecation] Deprecate virtual engine — ready,v1,qwen,kv-connector — by yewentao256 (创建于: 2026-03-16 22:08 (UTC+8))
#37200 [Bugfix] Make siglip/clip compatible with transformers v5 — bug,ready,multi-modality — by zucchini-nlp (创建于: 2026-03-16 23:17 (UTC+8))
#37211 fix: Add missing k_scale attribute to Attention for weight offloading — 无标签 — by BillionClaw (创建于: 2026-03-17 00:39 (UTC+8))
#37208 [Doc] Fix inconsistent hash notation in Prefix Caching diagram — documentation — by BillionClaw (创建于: 2026-03-17 00:30 (UTC+8))
#37206 Unified memory — v1 — by omerpaz95 (创建于: 2026-03-17 00:17 (UTC+8))
#37148 [Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing — bug,frontend,ready — by chaunceyjiang (创建于: 2026-03-16 15:16 (UTC+8))
#37203 refactor hardcoded device strings in vllm tests — speculative-decoding,v1,multi-modality,nvidia — by wincent8 (创建于: 2026-03-16 23:59 (UTC+8))
#37196 [Perf] consolidating, vectorizing and cleaning up CUDA/HIP implementations of custom ops. — nvidia — by GOavi101 (创建于: 2026-03-16 22:32 (UTC+8))
#37194 fix: handle missing parameters gracefully in weight offloading — 无标签 — by muoshuosha (创建于: 2026-03-16 21:48 (UTC+8))
#37183 Remove unused EVS functions in qwen3_vl.py — ready,qwen — by gty111 (创建于: 2026-03-16 19:22 (UTC+8))
#37189 [ROCm] Add torch.cuda fallback for amdsmi-dependent methods on WSL2 — rocm,nvidia — by JoursBleu (创建于: 2026-03-16 20:44 (UTC+8))
#37187 [Tool Parser] Qwen3Coder: incremental string streaming, trailing newline fix, and whitespace content filter — qwen — by ec-jt (创建于: 2026-03-16 20:24 (UTC+8))
#37186 [Tool Parser] Add incremental string streaming and fix trailing newline for Qwen3Coder — qwen — by ec-jt (创建于: 2026-03-16 20:03 (UTC+8))
#37139 [Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync — ready,qwen — by lgeiger (创建于: 2026-03-16 11:48 (UTC+8))
#37171 [Frontend] feat: add streaming support for token generation endpoint — frontend — by hhk7734 (创建于: 2026-03-16 17:46 (UTC+8))
#37174 [Docs] Make the link to hardware plugins clearer — documentation,ready — by hmellor (创建于: 2026-03-16 18:14 (UTC+8))
#37177 Fix KV cache memory estimation for hybrid Mamba/Attention models — v1 — by xueliangyang-oeuler (创建于: 2026-03-16 18:27 (UTC+8))
#37169 [Bugfix] Fix “Already borrowed” tokenizer race in Hermes tool parser — bug — by stonelazy (创建于: 2026-03-16 17:38 (UTC+8))
#37172 Fix KV cache size estimation regression in v0.17+ — v1 — by xueliangyang-oeuler (创建于: 2026-03-16 18:00 (UTC+8))
#37149 [XPU][Feature] transparent sleep mode support for XPU platform — documentation,v1 — by yma11 (创建于: 2026-03-16 15:24 (UTC+8))
#37161 Fix issue #37103: Remove shape mismatch warnings in FLA operations — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-16 16:59 (UTC+8))
#37166 Fix issue #37103: Remove shape mismatch warnings in FLA operations — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-16 17:30 (UTC+8))
#37165 [perf][connector] optimize build_connector_meta when host buffer transfer is not used — kv-connector — by youkaichao (创建于: 2026-03-16 17:21 (UTC+8))
#37163 Fix issue #37103: Remove shape mismatch warnings in FLA operations — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-16 17:12 (UTC+8))
#37164 [Bugfix] Fix TOCTOU race in KV block allocator causing prefix-cache block theft — bug,needs-rebase,v1 — by AbhiOnGithub (创建于: 2026-03-16 17:17 (UTC+8))
#37158 [Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 — bug — by dbari (创建于: 2026-03-16 16:40 (UTC+8))
#37156 Fix issue #37037 — v1 — by xueliangyang-oeuler (创建于: 2026-03-16 16:39 (UTC+8))
#37147 [Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs — bug,ready,multi-modality,qwen — by Isotr0py (创建于: 2026-03-16 15:06 (UTC+8))
#37155 1. — frontend — by nokiaMS (创建于: 2026-03-16 16:38 (UTC+8))
#37154 Fix reasoning parser CI failure for seedoss and glm4 moe — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-16 16:09 (UTC+8))
#37153 Fix issue#37032 — kv-connector — by xueliangyang-oeuler (创建于: 2026-03-16 16:01 (UTC+8))
#37144 [Model Runner V2] Fix processed logits in sample() — ready,v1 — by WoosukKwon (创建于: 2026-03-16 14:17 (UTC+8))
#37143 [XPU] support MLA model on Intel GPU — 无标签 — by jikunshang (创建于: 2026-03-16 14:10 (UTC+8))
#37145 [Bugfix] Avoid LD_PRELOAD check on MacOS — bug,v1,cpu — by bigPYJ1151 (创建于: 2026-03-16 14:20 (UTC+8))
#37138 [ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-16 11:42 (UTC+8))
#37142 [v1] align execute_dummy_batch with uniform decode query length — v1 — by liuxingbo12138 (创建于: 2026-03-16 14:04 (UTC+8))

已合并 PR

#37027 [CI] Fix flaky tool_use chat completion tests with deterministic seed — ready,tool-calling — by sfeng33 (合并于: 2026-03-17 11:24 (UTC+8))
#37181 Add ability to replace oot ops when using lora — ready — by kyuyeunk (合并于: 2026-03-17 09:04 (UTC+8))
#36867 Support non-contiguous KV cache in TRTLLM fp8 dequant kernel — ready,v1,nvidia — by vadiklyutiy (合并于: 2026-03-17 08:48 (UTC+8))
#36030 [BugFix] Correct max memory usage for multiple KV-cache groups — bug,ready,v1 — by peakcrosser7 (合并于: 2026-03-17 08:38 (UTC+8))
#37209 Fix some Mistral parser issues — frontend,ready — by juliendenize (合并于: 2026-03-17 08:08 (UTC+8))
#37074 [Feature][Frontend] add support for Cohere Embed v2 API — documentation,frontend,ready — by walterbm (合并于: 2026-03-17 07:55 (UTC+8))
#27599 [CI/Build] Add common tool call parser test suite — ready,tool-calling,llama,qwen,deepseek — by bbrowning (合并于: 2026-03-17 07:46 (UTC+8))
#36356 [CI] Stabilize multinode DP internal LB completion tests — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-17 06:40 (UTC+8))
#34389 [Custom Ops] Add functional + out variant for scaled_fp4_quant — ready — by tianrengao (合并于: 2026-03-17 06:51 (UTC+8))
#37198 [compile] Enable mega aot artifact for torch 2.12+. — ready — by zhxchen17 (合并于: 2026-03-17 05:05 (UTC+8))
#37232 Fix EagleMistralLarge3Model initialization — speculative-decoding,ready — by juliendenize (合并于: 2026-03-17 06:41 (UTC+8))
#35448 [Quant][Feature] Support online MXFP8 quantization for MoE and dense models — ready,nvidia,quantization — by EdalatiAli (合并于: 2026-03-17 06:07 (UTC+8))
#37064 [Doc] Clarify schema enforcement behavior for tool_choice modes — documentation,ready,tool-calling — by cemigo114 (合并于: 2026-03-17 06:27 (UTC+8))
#37115 [Benchmark] Improvements to attention benchmark script — performance,ready — by wzhao18 (合并于: 2026-03-17 06:22 (UTC+8))
#37217 [MoE/EPLB] Fix FlashInfer nvfp4 experts + EPLB correctness — ready,nvidia — by elvircrn (合并于: 2026-03-17 06:03 (UTC+8))
#36779 [Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) — bug,ready — by KrxGu (合并于: 2026-03-17 05:11 (UTC+8))
#36549 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors — bug,ready,kv-connector — by ZhanqiuHu (合并于: 2026-03-17 04:51 (UTC+8))
#37197 [BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer — bug,ready,v1 — by jikunshang (合并于: 2026-03-17 04:39 (UTC+8))
#37199 [Misc] Add float16 to CacheDType — documentation,rocm,ready,v1,nvidia — by MatthewBonanni (合并于: 2026-03-17 04:24 (UTC+8))
#37213 [CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test — bug,rocm,ready,v1,kv-connector — by rasmith (合并于: 2026-03-17 04:22 (UTC+8))
#36915 [Refactor] Consolidate GPT-OSS reasoning parser tests — structured-output,ready,v1,gpt-oss — by sfeng33 (合并于: 2026-03-17 03:53 (UTC+8))
#36288 [torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set — documentation,ready,llama,qwen,nvidia — by Lucaskabela (合并于: 2026-03-17 03:42 (UTC+8))
#37215 [Bugfix] Fix render server crash for quantized models on CPU-only hosts — bug,frontend,ready — by sagearc (合并于: 2026-03-17 02:59 (UTC+8))
#36687 [PD][Nixl] Add support for hybrid SSM-FA models — ready,v1,kv-connector — by NickLucche (合并于: 2026-03-17 02:58 (UTC+8))
#36827 Add simple granite4 tool parser — documentation,ready,tool-calling — by maxdebayser (合并于: 2026-03-17 01:49 (UTC+8))
#36982 [MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible — ready,v1 — by MatthewBonanni (合并于: 2026-03-17 01:51 (UTC+8))
#37090 [Bugfix] Disable cross-layer KV cache for MLA attention backends — bug,ready,v1,kv-connector — by haosdent (合并于: 2026-03-17 01:03 (UTC+8))
#37200 [Bugfix] Make siglip/clip compatible with transformers v5 — bug,ready,multi-modality — by zucchini-nlp (合并于: 2026-03-17 00:48 (UTC+8))
#37148 [Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing — bug,frontend,ready — by chaunceyjiang (合并于: 2026-03-17 00:27 (UTC+8))
#36845 [ROCm] Fix KV copy methods and auto-select attention backend for ROCm — rocm,ready,v1,kv-connector — by AndreasKaratzas (合并于: 2026-03-16 16:07 (UTC+8))
#36442 [ROCm][CI] Retrying in case of batch variance effects and reducing flakiness — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-16 16:08 (UTC+8))
#36106 [Bugfix] Add safety check and fallback for null scaling factor — bug,ready — by yuanheng-zhao (合并于: 2026-03-16 22:27 (UTC+8))
#34158 [Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout — bug,ready,v1,nvidia — by Etelis (合并于: 2026-03-16 23:20 (UTC+8))
#36693 [Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh — ready,nvidia — by yewentao256 (合并于: 2026-03-16 23:12 (UTC+8))
#35594 [BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 — bug,ready,nvidia — by flutist (合并于: 2026-03-16 23:10 (UTC+8))
#36022 [Kernel] Add FlashInfer MoE A2A Kernel — documentation,performance,new-model,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by leo-cf-tian (合并于: 2026-03-16 14:45 (UTC+8))
#36529 [Compile] Fix compile warning in moe_permute — ready — by yewentao256 (合并于: 2026-03-16 22:16 (UTC+8))
#36992 [Bugfix] accept redacted thinking blocks in Anthropic messages — bug,frontend,ready — by bbartels (合并于: 2026-03-16 22:01 (UTC+8))
#37013 [Spec Decode] Update extract_hidden_states to use deferred kv_connector clear — speculative-decoding,ready,v1,kv-connector — by fynnsu (合并于: 2026-03-16 21:53 (UTC+8))
#37183 Remove unused EVS functions in qwen3_vl.py — ready,qwen — by gty111 (合并于: 2026-03-16 21:09 (UTC+8))
#37104 Patch Mistral config — ready — by juliendenize (合并于: 2026-03-16 20:22 (UTC+8))
#36012 [Performance] Add prefetch for checkpoints to OS page cache — ready — by arpera (合并于: 2026-03-16 19:32 (UTC+8))
#36229 [Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 — ready,ci/build — by elvischenv (合并于: 2026-03-16 20:09 (UTC+8))
#37139 [Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync — ready,qwen — by lgeiger (合并于: 2026-03-16 20:11 (UTC+8))
#37174 [Docs] Make the link to hardware plugins clearer — documentation,ready — by hmellor (合并于: 2026-03-16 19:12 (UTC+8))
#35208 GLM4 tool parser: fix streaming mode — ready — by RNabel (合并于: 2026-03-16 18:48 (UTC+8))
#34871 [Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches — bug,ready,v1 — by haosdent (合并于: 2026-03-16 17:30 (UTC+8))
#37057 Fix pipeline parallel with multimodal models with the Transformers modelling backend — ready — by hmellor (合并于: 2026-03-16 18:20 (UTC+8))
#37055 Fix text only inputs for MRoPE models with the Transformers modelling backend — ready — by hmellor (合并于: 2026-03-16 18:31 (UTC+8))
#37031 [Hardware] Replace memory related torch.cuda APIs — performance,ready,v1,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-16 18:24 (UTC+8))
#36675 [Misc] Sync pre-commit to 4.5.1 in workflows and docs — documentation,ready,ci/build — by SoluMilken (合并于: 2026-03-16 18:03 (UTC+8))
#36987 [FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell — ready,v1,qwen,nvidia — by vadiklyutiy (合并于: 2026-03-16 17:04 (UTC+8))
#37093 [Frontend][Misc] Remove unused log in /is_sleeping — frontend,ready — by esmeetu (合并于: 2026-03-16 17:46 (UTC+8))
#37147 [Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs — bug,ready,multi-modality,qwen — by Isotr0py (合并于: 2026-03-16 16:56 (UTC+8))
#36204 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks — ready — by laithsakka (合并于: 2026-03-16 16:52 (UTC+8))
#37136 [Performance][Model Loader] Skip non-local expert weights during EP model loading — ready — by esmeetu (合并于: 2026-03-16 16:33 (UTC+8))
#36774 [Bugfix] fix Qwen3.5 tool calling bug — bug,ready,qwen — by chaunceyjiang (合并于: 2026-03-16 15:38 (UTC+8))
#37144 [Model Runner V2] Fix processed logits in sample() — ready,v1 — by WoosukKwon (合并于: 2026-03-16 15:35 (UTC+8))
#37145 [Bugfix] Avoid LD_PRELOAD check on MacOS — bug,v1,cpu — by bigPYJ1151 (合并于: 2026-03-16 14:31 (UTC+8))
#37138 [ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-16 12:49 (UTC+8))
#37127 [CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests — bug,rocm,frontend,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-16 13:27 (UTC+8))
#37107 [Model] Add HyperCLOVAX-SEED-Think-14B language model support — documentation,new-model,ready — by bigshanedogg (合并于: 2026-03-16 14:40 (UTC+8))
#36612 [XPU] Add deepseek_scaling_rope fused kernel — ready,deepseek — by yitingw1 (合并于: 2026-03-16 12:35 (UTC+8))

关闭但未合并的 PR

#27891 [Feature] Add MetaShufflingMoE as Optional MoE backend to Llama4 models — needs-rebase,stale,llama — by sunfish2010 (关闭于: 2026-03-17 10:22 (UTC+8))
#27943 [V1][Perf] Optimize Medusa proposer: reduce sync overhead — speculative-decoding,stale,v1 — by skyloevil (关闭于: 2026-03-17 10:22 (UTC+8))
#37018 [BUG] Collective causing deadlock for DPEP MoE — bug,frontend,v1 — by hao-aaron (关闭于: 2026-03-17 07:44 (UTC+8))
#37240 Pr 1 — needs-rebase,ci/build — by sharvil10 (关闭于: 2026-03-17 07:25 (UTC+8))
#36930 [Model Runner V2] Spec decode rejection sampler greedy + logprobs support — v1 — by TheEpicDolphin (关闭于: 2026-03-17 07:13 (UTC+8))
#34646 [Bugfix] Fix EPLB + NVFP4: make expanded activation scales contiguous — bug,ready,needs-rebase,v1,nvidia — by elvircrn (关闭于: 2026-03-17 06:33 (UTC+8))
#32716 [Bugfix] Fix MTP edge case in split_decodes_and_prefills — bug,v1 — by Josephasafg (关闭于: 2026-03-16 21:53 (UTC+8))
#33468 Add multimodal support to reranking API — frontend,needs-rebase — by sathergate (关闭于: 2026-03-17 05:24 (UTC+8))
#31464 [Bugfix] Apply RMSNorm weight correction for Gemma2 GGUF models — bug — by kitaekatt (关闭于: 2026-03-17 03:38 (UTC+8))
#30699 [Bugfix] Skip missing parameters during GGUF Gemma2 weight loading — bug — by kitaekatt (关闭于: 2026-03-17 03:38 (UTC+8))
#30434 fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer — 无标签 — by kitaekatt (关闭于: 2026-03-17 03:38 (UTC+8))
#30424 fix(gemma2): Add quant_config to embedding layer for GGUF support — 无标签 — by kitaekatt (关闭于: 2026-03-17 03:38 (UTC+8))
#35948 Add granite4 tool parser — documentation,ready,tool-calling — by maxdebayser (关闭于: 2026-03-17 01:53 (UTC+8))
#31089 [Quantization] enable MXFP4 Triton backend on SM120 (Blackwell) — 无标签 — by janreges (关闭于: 2026-03-17 01:14 (UTC+8))
#35428 [CI] Add macOS ARM CPU wheel build and publish — ready,ci/build,cpu — by mgoin (关闭于: 2026-03-17 01:09 (UTC+8))
#20459 [V1][Spec Decode][Feature] Spec decode with probs — documentation,speculative-decoding,needs-rebase,v1 — by andylolu2 (关闭于: 2026-03-17 00:55 (UTC+8))
#32803 [Model] Add Pacific-Prime / Complexity model support — new-model — by Complexity-ML (关闭于: 2026-03-16 22:12 (UTC+8))
#34559 [Model] Support pacific_i64 which uses Token routed i64 in MoE — new-model,nvidia — by Complexity-ML (关闭于: 2026-03-16 20:45 (UTC+8))
#37186 [Tool Parser] Add incremental string streaming and fix trailing newline for Qwen3Coder — qwen — by ec-jt (关闭于: 2026-03-16 20:23 (UTC+8))
#35762 fix: raise ValueError when –kv-cache-dtype conflicts with checkpoint kv_cache_quant_algo — documentation,performance,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by infektyd (关闭于: 2026-03-16 18:18 (UTC+8))
#35185 Fix –kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — documentation — by paipeline (关闭于: 2026-03-16 18:18 (UTC+8))
#35044 Fix –kv-cache-dtype auto behavior with checkpoint quantization configs — needs-rebase,v1 — by paipeline (关闭于: 2026-03-16 18:18 (UTC+8))
#34953 Fix kv_cache_dtype auto resolution with checkpoint quantization — 无标签 — by paipeline (关闭于: 2026-03-16 18:18 (UTC+8))
#37034 fix: resolve kv_cache_dtype=’auto’ from checkpoint kv_cache_quant_algo — 无标签 — by alvinttang (关闭于: 2026-03-16 18:17 (UTC+8))
#37155 1. — frontend — by nokiaMS (关闭于: 2026-03-16 16:38 (UTC+8))
#36689 [Feature]: Fused CUTLASS GEMM + static FP8 output quant in epilogue — performance,nvidia — by Itssshikhar (关闭于: 2026-03-16 16:22 (UTC+8))
#23704 [XPU][Feature] sleep mode support for XPU platform — documentation,needs-rebase,v1 — by yma11 (关闭于: 2026-03-16 15:24 (UTC+8))
#37105 [V1] Fix jump-forward decoding correctness and add tests (follow-up to #36142) — structured-output,v1 — by HelloWorldU (关闭于: 2026-03-16 11:54 (UTC+8))