vLLM 开发动态报告 - 2026-03-22

时间窗口: 2026-03-22 11:39 (UTC+8) ~ 2026-03-23 11:39 (UTC+8) 数据统计: 新 Issue 10 | 关闭 Issue 10 | 新 PR 48 | 合并 PR 21 | 关闭未合并 PR 15

📊 每日开发状态摘要

在2026年3月22日至23日的周期内，vLLM项目保持高活跃度，共新增10个Issue和48个PR，合并21个PR。开发焦点集中在AMD/ROCm平台的功能增强与测试修复、工具调用解析器的深度优化以及核心调度与缓存策略的创新提议上。性能优化（尤其是针对特定硬件和模型的调优）与多平台支持（AMD、Intel XPU）是当前的核心驱动力。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关的开发活动非常活跃，主要围绕性能提升、功能完善和测试稳定性展开。涉及的贡献者包括明确的AMD员工（如 ChuanLi1101）及其他活跃的ROCm开发者。

性能与功能增强：
- PR #37800 (ChuanLi1101): [ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion。此PR至关重要，它填补了vLLM在ROCm平台上对MXFP4量化模型支持的空白。之前，MXFP4配置会回退到未量化的BF16计算，而此PR实现了Mxfp4LinearMethod，在ROCm上利用AITER的Triton FP4 GEMM内核，使其性能与ROCm/ATOM对齐。同时，默认启用了共享专家融合，以降低MoE模型的核启动开销。
- PR #37826 (laudney): [ROCm] Widen OAI Triton MoE capability range to include gfx12 (RDNA4)。通过放宽能力范围上限，使标准MXFP4 MoE路径支持AMD RDNA4架构（gfx12），无需定制内核，扩展了AMD消费级GPU的兼容性。
测试修复与稳定性：
- PR #37833 (AndreasKaratzas): [ROCm] Fix MoE kernel test failures on gfx950。系统性地修复了在MI355（gfx950）上多项MoE内核测试的失败，包括处理triton_kernels.topk的返回值差异、为ROCm平台门禁C++内核、优化FP8量化边界处的除法操作等，提升了测试套件在最新AMD硬件上的可靠性。
- PR #37838 (AndreasKaratzas): [ROCm] Fix fused RMS norm quant test failures on gfx90a。修复了gfx90a（MI250）架构上融合RMSNorm量化测试中因浮点精度处理差异导致的少量FP8量化偏差问题，通过统一中间精度类型确保结果一致性。
- PR #37835 (AndreasKaratzas): [ROCm] Add UE8M0 scale packing for Triton silu_mul_quant。为ROCm平台的Triton回退内核添加了UE8M0缩放格式支持，虽然当前DeepGEMM在ROCm上不可用，但此举确保了功能完整性并为未来做准备。
持续集成（CI）优化：
- PR #37780, #37723, #37717 (AndreasKaratzas)：一系列CI调整，包括将部分在MI250上已覆盖的测试在MI325/MI355上设为可选，降低测试负载；调整语音转文字测试的精度阈值以稳定通过；为max_tokens测试添加大GPU标记以避免内存问题。这些优化旨在提高AMD CI流水线的效率和稳定性。

小结：本周期AMD生态的贡献呈现出从“可用”到“好用、稳定”的深化趋势，重点在于补齐量化支持的性能短板、扩大硬件兼容范围，并系统性地解决测试问题，为生产环境部署奠定基础。

💬 高热度讨论分析

RFC #36155: dLLM support via plugin (spec-decode path reuse) (7条评论)
- 核心议题：讨论如何以最小核心改动，通过插件系统复用推测解码的路径来支持扩散语言模型（dLLM）。
- 观点与争议：
  - 提议者（RedHat）：主张利用现有spec-decode数据结构和调度接口，仅需一个核心钩子即可让插件实现完整的dLLM语义，实现最大封装。
  - 核心维护者（@benchislett）：肯定设计合理性，但提出两大挑战：1) 结构化输出与一次生成多个令牌的冲突可能需重构；2) 自定义掩码注意力在各平台支持不一，可能限制模型兼容性。
  - 讨论焦点：如何平衡扩展性与核心复杂性。最终认为在worker/model层面处理结构化输出可行，且当前主流的“现代”dLLM架构符合假设。RFC于本周期被作者关闭，宣布项目即将正式启动。
- 结论：方案在高层面上获得认可，实际实施中的具体技术挑战（如结构化输出）将是下一步关注点。
Issue #26897: [Bug]: deepseekv3.2 tool_calls failure (10条评论)
- 核心议题：用户部署DeepSeek-V3.2-Exp模型时遇到工具调用失败（500错误）和并发时输出乱码（“HHHHH”）的问题。
- 观点与解决方案：
  - 用户经验：有用户发现使用张量并行（-tp 8）替代数据并行（-dp 8）可使工具调用从失败变为成功，但对原因表示困惑。
  - 并发问题：多位用户确认在高并发（>=2）下会出现各种乱回复问题，影响可用性。
  - 模型特异性：讨论指出v3.2的默认优化策略是EP/DP，且内核主要针对TP=1优化，建议生产部署使用DP+EP模式。
- 状态：该Issue因长时间无活动被自动关闭，但反映出的模型并行策略选择与工具调用稳定性问题具有普遍参考价值。
Issue #29138: [Bug]: AttributeError: ‘NoneType’ object has no attribute ‘use_mxfp4_w4a16’ (7条评论)
- 核心议题：MoE模型（Qwen3）在使用bitsandbytes量化并启用LoRA时，在TP>1情况下触发属性错误。
- 根本原因与修复：
  - 核心开发者确认：初期确认为MoE模型在特定版本（0.11）对bitsandbytes+LoRA的支持不完善。
  - 根本原因：_inject_lora_into_fused_moe函数假设quant_config总存在，但在未量化/不支持配置时为None，导致属性错误。
  - 解决方案：由贡献者提交PR #29231，为quant_config为None的情况添加回退逻辑，加载默认的未量化配置。
- 结论：问题根源在于代码对量化配置状态的假设不够健壮，已通过PR修复。

🔥 热门话题与趋势分析

工具调用解析器的“深水区”优化：围绕Qwen3Coder、Minimax-M2、GLM等解析器的边界条件、流式输出、参数验证和格式对齐问题出现了多个PR（#37829, #37831, #37790, #37845）和Issue。这表明在基础功能实现后，社区正深入解决复杂场景下的鲁棒性和用户体验问题。
硬件特异性性能问题与调优：
- Blackwell (B200) 精度问题：Issue #37804 详细揭示了DeepGemm要求的E8M0格式在Blackwell上导致Qwen3.5 FP8模型精度显著下降的问题，并提出了基于架构的自动禁用方案（PR #37806）。
- AMD平台性能追赶：如前所述，多个PR致力于补齐AMD平台在量化（MXFP4）和MoE性能方面的短板。
- Intel ARC GPU支持：Issue #37828 报告了新版vLLM中Intel ARC 140v GPU不被识别的问题，得到建议使用Triton attention后端，显示了多硬件生态支持的持续挑战。
KV缓存与调度算法创新：
- T-LRU提案：Issue #37823 和配套PR #37825 提出了基于对话感知的尾部优化LRU缓存淘汰策略，旨在降低P95尾延迟，这是对生产环境SLA要求的直接响应。
- 调度器Bug修复：Issue #37842 和 PR #37843 指出并修复了流式会话中max_tokens更新未被同步的问题，体现了对核心调度逻辑细节的持续打磨。
内存优化与OOM防治：多个PR（#37805, #37799, #37746）针对DeepSeek-R1等大模型在特定配置下（如使用MLA解码缓冲区、DP模式下的Mamba状态）暴露的内存占用过高或OOM问题进行了修复，核心思路是共享缓冲区、清理陈旧状态。

🛠️ 重点技术变更

PR #37825: [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy
- 技术解读：在现有LRU前缀缓存淘汰策略基础上，引入“尾部延迟安全”概念。通过分析对话历史长度和预估下一查询长度，将可安全淘汰而不违反设定延迟阈值的KV块放入独立队列优先淘汰，从而在不影响缓存命中率的前提下优化尾部延迟。
- 影响：为需要严格SLA保障的生产场景提供了新的优化工具，是算法层面对性能优化的重要探索。
PR #37798: [MRV2] Use FP64 for Gumbel noise
- 技术解读：将Model Runner V2中Gumbel噪声的计算从FP32改回FP64。这是对先前优化（#34854）的部分回退，旨在解决FP32可能带来的数值稳定性问题，确保采样结果的确定性。
- 影响：以牺牲大规模随机采样时的一定性能（最高~1.8倍）为代价，换取了算法在不同批大小和硬件上更可靠的数值行为，体现了对正确性的优先保障。
PR #37718: [Bug] Fix fp8 deepgemm batch invariance
- 技术解读：修复了在FP8 DeepGemm运算中，内核在run_flashinfer_deepgemm_swapAB和run_deepgemm之间动态选择会破坏批处理不变性（batch invariance）的问题。
- 影响：确保了使用FP8 DeepGemm的模型在不同批大小下输出的一致性，这是模型部署和测试中确定性要求的关键修复。

📈 开发活跃度观察

贡献活跃：单日新增48个PR，显示社区贡献非常活跃。其中，来自AMD生态的贡献者（如AndreasKaratzas、ChuanLi1101）提交了大量与ROCm相关的修复和优化PR，表明AMD正在积极投入资源提升vLLM在其硬件上的表现。
审查与合并效率：合并了21个PR，其中包括多个重要的bug修复和功能改进，表明核心团队审查和合并流程顺畅。许多PR在创建后短时间内就被标记为ready并合并。
Issue管理：关闭了10个Issue，其中大部分是由于长时间不活动而被自动标记为stale后关闭，也反映出社区对老旧问题的清理。

💡 值得关注的问题

Blackwell平台FP8精度权衡：Issue #37804 揭示的DeepGemm E8M0格式导致的精度问题，以及PR #37806提出的模型/架构特异性禁用方案，引发了对不同硬件平台下最优量化策略选择的思考。社区需要持续关注此方案的有效性及对性能的影响。
创新型缓存策略的落地效果：RFC #37823 提出的T-LRU策略已进入实现阶段（PR #37825）。其实践效果、参数调优经验以及对真实工作负载的收益，值得社区在未来周期内密切关注和验证。
多硬件平台支持的长尾问题：Issue #37828（Intel ARC）和一系列AMD测试修复PR表明，支持多样化的硬件生态是一个持续的过程，涉及驱动、内核、测试用例等多个层面。如何更系统化地管理和验证多平台兼容性，是一个长期课题。
运行时模型优化技术的探索：PR #37824 提出的“RIY”运行时专家剪枝技术，展示了在加载时根据预定义画像优化MoE模型内存占用的可能性。这类动态优化技术对超大MoE模型的部署有重要意义，其实用性和性能开销值得进一步观察。

📋 附录：详细数据列表

新增 Issue

#37839 [Feature Request] Support chat_template in tokenizer_config.json for DeepSeekV32 — 无标签 — by xlshaoscu (创建于: 2026-03-23 11:10 (UTC+8))
#37842 _update_request_as_session does not update max_tokens from StreamingUpdate — 无标签 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
#37837 Why is an assertion used here? — 无标签 — by WZKIIIII (创建于: 2026-03-23 10:26 (UTC+8))
#37836 [Bug] Potential incorrect tokenizer source path in RunAI object storage pull — bug — by mf46 (创建于: 2026-03-23 10:26 (UTC+8))
#37832 [Performance]: Deepseek performance regressing with norm fusion enabled — performance — by tianrengao (创建于: 2026-03-23 09:04 (UTC+8))
#37828 [Bug]: Intel ARC 140v not supported as XE2 cutlass kernel — bug,intel-gpu — by PterosDiacos (创建于: 2026-03-23 05:20 (UTC+8))
#37823 [RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction — RFC — by wenxinzhang0 (创建于: 2026-03-23 04:29 (UTC+8))
#37804 [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell — bug — by vadiklyutiy (创建于: 2026-03-22 20:41 (UTC+8))
#37801 [Bug]: CUDA 13 LMCache KV connector install path still resolves CUDA 12 artifacts — 无标签 — by malaiwah (创建于: 2026-03-22 20:16 (UTC+8))
#37794 [Bug]: The Qwen 3.5 model cannot disable thinking in version 0.18.0. — bug — by SongXiaoMao (创建于: 2026-03-22 14:21 (UTC+8))

已关闭 Issue

#26963 [Feature][XPU]: speculative decoding support on XPU. — feature request,intel-gpu,stale — by jikunshang (关闭于: 2026-03-23 10:21 (UTC+8))
#26897 [Bug]: deepseekv3.2 tool_calls failure — bug,stale — by CallmeZhangChenchen (关闭于: 2026-03-23 10:17 (UTC+8))
#28110 [Bug]: spark can not run MoE GEMM — bug,stale — by qiyuxinlin (关闭于: 2026-03-23 10:16 (UTC+8))
#28804 [Bug]: DeepSeek V3.1 Tool Parser: Leading whitespace accumulation in multi-turn tool calling conversations — bug,stale — by momaek (关闭于: 2026-03-23 10:16 (UTC+8))
#28998 [Performance]: Timecost on Qwen2.5VL with multi images — performance,stale — by SmartNight-cc (关闭于: 2026-03-23 10:16 (UTC+8))
#29138 [Bug]: AttributeError: ‘NoneType’ object has no attribute ‘use_mxfp4_w4a16’ — bug,stale — by nole69 (关闭于: 2026-03-23 10:16 (UTC+8))
#29171 [Feature Request]: Add JSON response format to /health endpoint — stale — by javAlborz (关闭于: 2026-03-23 10:16 (UTC+8))
#29178 [Bug]: Internal: AttributeError: module ‘triton.language’ has no attribute ‘constexpr_function’ — bug,stale — by mephisto1484 (关闭于: 2026-03-23 10:16 (UTC+8))
#36155 [RFC]: dLLM support via plugin (spec-decode path reuse) — RFC — by AlonKellner-RedHat (关闭于: 2026-03-22 23:05 (UTC+8))
#37794 [Bug]: The Qwen 3.5 model cannot disable thinking in version 0.18.0. — bug — by SongXiaoMao (关闭于: 2026-03-22 14:32 (UTC+8))

新增 PR

#37845 [Bugfix] Fix GLM tool-call finish chunk suffix alignment in streaming — bug,frontend — by QwertyJack (创建于: 2026-03-23 11:33 (UTC+8))
#37844 [Draft][XPU] add gptq marlin support — 无标签 — by jikunshang (创建于: 2026-03-23 11:32 (UTC+8))
#37843 fix(scheduler): update max_tokens from StreamingUpdate in session — v1 — by warren618 (创建于: 2026-03-23 11:31 (UTC+8))
#37808 [Mypy] Fix mypy for vllm/config — performance,ready,v1,nvidia — by yewentao256 (创建于: 2026-03-22 22:18 (UTC+8))
#37841 replace cuda_device_count_stateless() to torch.accelerator.device_count() in test/utils.py — nvidia — by wincent8 (创建于: 2026-03-23 11:28 (UTC+8))
#37840 [Docs] Adds vllm-musa to custom_op.md — documentation — by yeahdongcn (创建于: 2026-03-23 11:18 (UTC+8))
#37831 [Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params — bug,tool-calling,qwen — by AAISSJ (创建于: 2026-03-23 08:12 (UTC+8))
#37833 [ROCm] Fix MoE kernel test failures on gfx950 — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-23 09:25 (UTC+8))
#37838 [ROCm] Fix fused RMS norm quant test failures on gfx90a — rocm,ready — by AndreasKaratzas (创建于: 2026-03-23 10:57 (UTC+8))
#37825 [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy — documentation,v1 — by wenxinzhang0 (创建于: 2026-03-23 04:44 (UTC+8))
#37835 [ROCm] Add UE8M0 scale packing for Triton silu_mul_quant — rocm,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-23 09:49 (UTC+8))
#37834 [Test] Consolidate tool parser unit tests to tests/tool_parsers — ready,tool-calling,llama — by bbrowning (创建于: 2026-03-23 09:46 (UTC+8))
#37810 [Bugfix] Store Qwen3Next A_log in fp32 — bug,qwen — by effortprogrammer (创建于: 2026-03-22 22:37 (UTC+8))
#37788 [Refactor] converge xxx_config to vllm_config in async_llm — frontend,v1 — by andyxning (创建于: 2026-03-22 11:40 (UTC+8))
#37816 [CI/Build][LoRA] Update Qwen35 LoRA testing — ready,ci/build,qwen — by jeejeelee (创建于: 2026-03-23 01:17 (UTC+8))
#37814 [Perf] Add A100 fused MoE tuned config for E=16,N=3072 — 无标签 — by xiaohajiayou (创建于: 2026-03-23 01:04 (UTC+8))
#37827 [Bugfix] Fix async load failures when multiple requests share same prefix — bug,v1 — by kahfizulkifli (创建于: 2026-03-23 04:52 (UTC+8))
#37830 [MRV2] Enable PP CUDA graph test — ready,ci/build,nvidia — by WoosukKwon (创建于: 2026-03-23 06:47 (UTC+8))
#37815 Varun/zero out padding — v1,nvidia — by varun-sundar-rabindranath (创建于: 2026-03-23 01:16 (UTC+8))
#37829 [Tool Parser] Qwen3Coder: stop string streaming at next <parameter=> boundary — tool-calling,qwen — by ec-jt (创建于: 2026-03-23 05:27 (UTC+8))
#37809 [Tool Parser] Qwen3Coder: boundary-safe streaming fix (latest main) — tool-calling,qwen — by ec-jt (创建于: 2026-03-22 22:21 (UTC+8))
#37826 [ROCm] Widen OAI Triton MoE capability range to include gfx12 (RDNA4) — rocm,gpt-oss — by laudney (创建于: 2026-03-23 04:46 (UTC+8))
#37824 [Feature] RIY: Runtime expert pruning for MoE models — documentation,frontend — by flash7777 (创建于: 2026-03-23 04:37 (UTC+8))
#37798 [MRV2] Use FP64 for Gumbel noise — ready,v1 — by WoosukKwon (创建于: 2026-03-22 17:23 (UTC+8))
#37811 [Bigfix]fix lora test by pass padded size back to the layer — rocm,ready,gpt-oss — by zyongye (创建于: 2026-03-22 23:52 (UTC+8))
#37821 [MRV2] Add FULL CUDA graph support with PP — needs-rebase,v1,nvidia — by WoosukKwon (创建于: 2026-03-23 01:59 (UTC+8))
#37822 [LMCache CI]: add HND interface protection — v1,kv-connector — by sammshen (创建于: 2026-03-23 02:59 (UTC+8))
#37806 [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation — bug,qwen — by vadiklyutiy (创建于: 2026-03-22 20:54 (UTC+8))
#37818 [MRV2] Skip hidden states allocation for PW CUDA graphs — ready,v1,nvidia — by WoosukKwon (创建于: 2026-03-23 01:40 (UTC+8))
#37819 [Docs] Add guide for editing agent instruction files — documentation — by bbrowning (创建于: 2026-03-23 01:41 (UTC+8))
#37820 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type=’alibi’ — bug — by r266-tech (创建于: 2026-03-23 01:49 (UTC+8))
#37817 [Bugfix] Skip speculator auto-detection for non-Hub models — bug — by johnnyodonnell (创建于: 2026-03-23 01:17 (UTC+8))
#37813 [Perf] fuse kernels in gdn — qwen — by ZJY0516 (创建于: 2026-03-23 00:34 (UTC+8))
#37797 [Model] Add GGUF support for Qwen3.5 hybrid models — qwen — by ChuanLi1101 (创建于: 2026-03-22 17:01 (UTC+8))
#37812 [MRV2] Consider spec decoding in warmup — ready,v1 — by WoosukKwon (创建于: 2026-03-23 00:12 (UTC+8))
#37803 Enable NemotronHPuzzle + NemotronHMTP — new-model,ready,nvidia — by netanel-haber (创建于: 2026-03-22 20:26 (UTC+8))
#37799 [Bugfix] Share MLA decode output buffer across layers to fix OOM — bug,v1,nvidia — by elvircrn (创建于: 2026-03-22 18:27 (UTC+8))
#37807 yes — rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1,cpu,nvidia — by Yin-Liang (创建于: 2026-03-22 21:56 (UTC+8))
#37805 [Fix] Share decode output buffer across MLA layers to reduce memory — v1,nvidia — by xueliangyang-oeuler (创建于: 2026-03-22 20:44 (UTC+8))
#37802 [CI/Build] Fix LMCache KV connector install path for CUDA 13 images — ci/build,kv-connector,nvidia — by malaiwah (创建于: 2026-03-22 20:17 (UTC+8))
#37800 [ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion — rocm — by ChuanLi1101 (创建于: 2026-03-22 18:46 (UTC+8))
#37790 [Bugfix] Fixe MiniMax-M2 parser failed to validate the validity of function names — bug,tool-calling — by xyDong0223 (创建于: 2026-03-22 13:05 (UTC+8))
#37796 [Fix] Clear stale prompt logprobs on preemption to avoid livelock — v1 — by xueliangyang-oeuler (创建于: 2026-03-22 16:07 (UTC+8))
#37793 [Fix] Add dynamic engine_id monitoring for MooncakeConnector proxy — documentation,kv-connector — by xueliangyang-oeuler (创建于: 2026-03-22 14:14 (UTC+8))
#37795 Fix missing logprobs for in streaming chat completions (#37737) — frontend — by agarwalprakhar2511 (创建于: 2026-03-22 15:31 (UTC+8))
#37789 [Perf] Fuse prefill conv output split into Q/K/V to reduce rearrange overhead — qwen — by ZJY0516 (创建于: 2026-03-22 11:46 (UTC+8))
#37792 [Feature] Add track_token_ids for efficient selective token logprobs tracking — documentation,v1 — by dengoswei (创建于: 2026-03-22 13:23 (UTC+8))
#37791 Custom/v0.14.1 — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector,nvidia — by jusikjoa (创建于: 2026-03-22 13:22 (UTC+8))

已合并 PR

#37632 always use embed&token_classify for bge-m3 — ready — by staugust (合并于: 2026-03-23 11:10 (UTC+8))
#37643 Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling — documentation,ready,multi-modality — by lashahub (合并于: 2026-03-23 10:29 (UTC+8))
#37830 [MRV2] Enable PP CUDA graph test — ready,ci/build,nvidia — by WoosukKwon (合并于: 2026-03-23 07:30 (UTC+8))
#35162 [Model Runner V2] Enable piecewise & full CUDA graphs for pipeline parallelism — ready,v1,nvidia — by ZhanqiuHu (合并于: 2026-03-23 04:48 (UTC+8))
#37798 [MRV2] Use FP64 for Gumbel noise — ready,v1 — by WoosukKwon (合并于: 2026-03-23 03:28 (UTC+8))
#37811 [Bigfix]fix lora test by pass padded size back to the layer — rocm,ready,gpt-oss — by zyongye (合并于: 2026-03-23 03:20 (UTC+8))
#37818 [MRV2] Skip hidden states allocation for PW CUDA graphs — ready,v1,nvidia — by WoosukKwon (合并于: 2026-03-23 02:47 (UTC+8))
#37723 [ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold — rocm,ready — by AndreasKaratzas (合并于: 2026-03-22 17:32 (UTC+8))
#37803 Enable NemotronHPuzzle + NemotronHMTP — new-model,ready,nvidia — by netanel-haber (合并于: 2026-03-22 23:13 (UTC+8))
#37719 [Test] Only Run MLA model when user explicitly set for batch invariance — ready,v1 — by yewentao256 (合并于: 2026-03-22 21:09 (UTC+8))
#37718 [Bug] Fix fp8 deepgemm batch invariant — bug,ready — by yewentao256 (合并于: 2026-03-22 20:57 (UTC+8))
#36097 [Model Runner V2] Support multi-modal embeddings for spec decode model — speculative-decoding,ready,v1,llama — by TheEpicDolphin (合并于: 2026-03-22 17:48 (UTC+8))
#35927 [MoE] Move PF Methods to Folder — documentation,ready,nvidia — by robertgshaw2-redhat (合并于: 2026-03-22 16:42 (UTC+8))
#37763 [ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support — rocm,ready — by AndreasKaratzas (合并于: 2026-03-22 16:02 (UTC+8))
#37764 [ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 — rocm,ready — by AndreasKaratzas (合并于: 2026-03-22 16:02 (UTC+8))
#37778 [ROCm][CI] Added missing resampy dependency for MM audio tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-22 16:02 (UTC+8))
#37780 [ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-22 16:03 (UTC+8))
#37779 [Perf] Optimize glm4.xv VIT — ready — by KKSK-DON (合并于: 2026-03-22 14:12 (UTC+8))
#37782 [Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback — bug,rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-22 13:37 (UTC+8))
#37781 [CI] Skip ISAAC multimodal tests due to broken upstream HF model weights — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-22 13:23 (UTC+8))
#37717 [ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-22 12:25 (UTC+8))

关闭但未合并的 PR

#36131 docs: add missing docstrings for SamplingParams methods — 无标签 — by Yuxin1999 (关闭于: 2026-03-23 10:57 (UTC+8))
#23331 [Hardware][POWER] Enable support for AIX on PowerPC — needs-rebase,ci/build,stale — by mehendarkarprajwal (关闭于: 2026-03-23 10:17 (UTC+8))
#28808 Reset V1 max_model_len after KV sizing — documentation,needs-rebase,stale,v1 — by m0nk111 (关闭于: 2026-03-23 10:16 (UTC+8))
#28812 v1: clamp max_model_len when KV cache exceeds GPU budget — needs-rebase,stale,v1 — by m0nk111 (关闭于: 2026-03-23 10:16 (UTC+8))
#29209 [Perf] Introduce a quick access of logprob without creating intermediate Logprob dictionary — stale — by Jialin (关闭于: 2026-03-23 10:16 (UTC+8))
#37809 [Tool Parser] Qwen3Coder: boundary-safe streaming fix (latest main) — tool-calling,qwen — by ec-jt (关闭于: 2026-03-23 05:24 (UTC+8))
#34632 [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 — new-model,rocm,needs-rebase — by laudney (关闭于: 2026-03-22 20:37 (UTC+8))
#37821 [MRV2] Add FULL CUDA graph support with PP — needs-rebase,v1,nvidia — by WoosukKwon (关闭于: 2026-03-23 03:11 (UTC+8))
#27238 v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check — v1 — by khaled-wsa (关闭于: 2026-03-23 01:40 (UTC+8))
#34373 [Feature] Enable uniform KV cache allocation for multi-group HMA models — needs-rebase,v1,kv-connector — by Etelis (关闭于: 2026-03-22 23:43 (UTC+8))
#37582 Fix EP expert_map init for TPU: avoid dynamic-shape ops on XLA — 无标签 — by colin2328 (关闭于: 2026-03-22 23:31 (UTC+8))
#37187 [Tool Parser] Qwen3Coder: incremental string streaming, trailing newline fix, and whitespace content filter — tool-calling,qwen — by ec-jt (关闭于: 2026-03-22 22:21 (UTC+8))
#37807 yes — rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1,cpu,nvidia — by Yin-Liang (关闭于: 2026-03-22 21:57 (UTC+8))
#37311 Puzzle mtp — 无标签 — by netanel-haber (关闭于: 2026-03-22 20:39 (UTC+8))
#30030 [Feature] Add track_token_ids for Efficient Selective Token Logprobs Tracking — documentation,performance,new-model,rocm,structured-output,frontend,needs-rebase,ci/build,v1,multi-modality — by JaviS-Rei (关闭于: 2026-03-22 15:15 (UTC+8))