vLLM 开发动态报告 - 2026-04-02

时间窗口: 2026-04-02 11:41 (UTC+8) ~ 2026-04-03 11:41 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 27 | 新 PR 69 | 合并 PR 31 | 关闭未合并 PR 18

📊 每日开发状态摘要

过去24小时内，vLLM 社区活跃度保持高位，新增了19个Issue和69个PR，其中31个PR被合并。开发焦点集中在 新模型支持（特别是 Gemma 4）和推理/工具调用解析器的完善上。同时，针对 AMD ROCm 生态的优化、KV Cache 量化以及分布式推理（NIXL）的稳定性和性能改进也在持续进行中。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，涉及模型支持、性能优化和 Bug 修复。

Issues:

#38851 [Feature]: ROCm Kimi K2.5 EAGLE3 MTP heads: 用户 @functionstackx 向 AMD 团队（通过 @hongxiayang @chunfangamd 等）提问，询问 AMD 是否会像 NVIDIA 和社区一样，为 Kimi K2.5 模型提供开源的 EAGLE3 MTP（Multi-Token Prediction）头部，以支持推测解码。这直接关系到 AMD 平台在生产环境中的竞争力和特色功能支持。目前处于开放讨论状态。

PRs (已合并):

#38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle: 由 AMD 员工 @BowenBao 提交。此 PR 是 Quark 量化工具链重构的第一步，将 MXFP4 MoE 权重量化路径改造成通过统一预言机（oracle）和内核后端执行，并将后端名称 “CK” 统一为 “AITER”，与 CLI 参数 --moe-backend aiter 保持一致。这是 AMD 优化其量化栈的重要基础工作。
#38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2: 由 AMD 员工 @vllmellm 提交。解决了 RDNA4 (gfx12xx) GPU 无法运行 FP8 MoE 模型的问题（修复了 #36105）。PR 还包含针对特定 Qwen FP8 MoE 模型在 Radeon PRO 9700 上的性能调优配置，展示了 AMD 在主流模型上的性能优化成果。

PRs (进行中):

#38786 Splitting MLA attention Triton kernel: 旨在通过拆分 MLA 注意力内核的第一阶段来优化大上下文长度下的 GPU 利用率。贡献者特别指出，在 ROCm 上，此优化对 GLM-4.7-Flash 等模型在 MI325 上带来了显著的性能提升（例如，在 65536 上下文下，Token/s 从 6.5 提升至 55.6）。
#38833 [ROCm] pad intermediate size for certain unquantized moe model gemma4 to use aiter fused moe for alignment: 由 AMD 员工 @hongxiayang 提交。为了解决 Gemma 4 MoE 模型的中间维度（expert_intermediate_size=704）未对齐 AITER CK GEMM 的 tile 要求而导致运行时错误的问题，进行了填充。这是确保新模型在 AMD 优化后端上正常运行的必要补丁。
#38817 [ROCm] Enable fused_silu_mul_block_quant on ROCm: 跟随 #32996，在 ROCm 上正确启用新的 fused_silu_mul_block_quant 内核，而非简单地 guard 掉。
#38824 [ROCm] add head-dim 512 for ROCM_ATTN for gemma4 model support: 由 AMD 员工 @hongxiayang 提交。为 ROCM_ATTN 注意力后端添加对头维度 512 的支持，以兼容新推出的 Gemma 4 模型（头维度为512）。否则，当用户显式指定 --attention-backend ROCM_ATTN 时，Gemma 4 模型会因头维度不支持而报错。
#38787 [GDN] Fused all preprocessing into one kernel for chunked stage: 将 GDN 分块前向传播中的四个独立 Triton 内核融合为一个，消除了多次内核启动开销。在 MI350X 上测试显示平均有 1.9倍 的性能提升，且准确率略有上升。

总结：AMD 团队及其贡献者在本周期非常活跃，工作覆盖了从底层内核优化（MLA、GDN、新融合内核）、量化工具链重构（Quark），到确保新模型（Gemma 4）和主流模型（Qwen FP8 MoE）在 AMD 硬件上高效、稳定运行的完整链路。

💬 高热度讨论分析

Issue #23837: “[Bug]: gpt-oss Intermittent 500 Internal Server Error…” (6条评论)
- 核心议题: 用户报告在使用严格 JSON 系统提示时，gpt-oss 模型间歇性返回 HTTP 500 且响应体为空。
- 观点与争议: 多位用户（@fabienric， @wonjerry， @daviden1013, @FrankTheTank9）确认遇到相同问题，表明这不是孤立事件。@tamastarjanyi 提供了一个关键线索：通过调整 reasoning_effort 参数可以影响甚至消除错误率（设为 high 时错误率为0%），暗示问题可能与模型推理过程或解析逻辑相关。
- 结论: 该 Issue 因长期无活动被标记为 stale 后关闭。但问题本身未解决，其根本原因（可能涉及 harmony 解析器）仍需调查。
Issue #36105: “[ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start…” (10条评论)
- 核心议题: 在 ROCm 上服务 FP8 MoE 模型时，因 “No FP8 MoE backend supports the deployment configuration” 错误而启动失败。
- 观点与争议: 最初用户 @gbdjxgp 在 Radeon 8060S 上遇到问题。随后 @MrHighVoltage 在混合 GPU（RX 7900 XTX + AI Pro R9700）和纯 R9700 环境下也报告了相同错误。@vllmellm (AMD) 介入排查，指出 RDNA3/3.5 架构（如 RX 7900 XTX， Radeon 8060S）缺乏原生 FP8 支持，而 R9700 应支持。讨论涉及了 Docker 环境、GPU 混插、驱动版本和 vLLM 版本等多个可能因素。
- 结论: 此 Issue 最终通过 PR #38086 得到解决，该 PR 为 RDNA4 启用了 Triton FP8 MoE 后端并进行了性能调优。讨论揭示了 AMD 不同 GPU 架构对 FP8 支持的差异性，以及社区对明确硬件支持矩阵的需求。
PR #38826: “feat(models): implement Google Gemma 4 architecture support”
- 核心议题: 为 vLLM 添加对 Google Gemma 4 模型家族（MoE、多模态、推理、工具调用）的全面原生支持。
- 观点与状态: 这是一个大型功能 PR，迅速被合并。它引入了 Gemma4ForCausalLM 模型架构、视频处理流程、Gemma4ReasoningParser 和 Gemma4ToolParser。合并后，立即催生了两个相关的 Bugfix Issue（#38855， #38837），说明新功能集成需要快速迭代修复。这反映了社区对紧跟主流模型步伐的高度重视。

🔥 热门话题与趋势分析

Gemma 4 模型支持浪潮: 本周期的绝对热点。除了已合并的完整支持 PR (#38826)，还有试图通过 Transformers 后端提供支持的 PR (#38828，因重复被关闭)，以及因 Gemma 4 引入而触发的多个修复和适配 PR (#38833， #38824， #38837， #38844， #38855)。Issue #38868 则指出了支持 Gemma 4 所需的最低 transformers 库版本问题。
推理与工具调用解析器的精细化: 多个 Issue 和 PR 涉及推理内容解析的边界情况 Bug，例如 Qwen3 在流式响应与 stop sequences 结合时的解析错误 (#38789， PR #38864)，以及 Gemma 4 推理解析因特殊令牌被剥离而失败 (#38855)。工具调用解析器也发现参数传递错误 (#38837，已由 #38847 修复) 和边缘情况处理问题 (#38866)。
性能优化，尤其是量化方向: KV Cache 量化是持续焦点，PR #38378 (INT8/FP8 per-token-head) 和 PR #38854 (INT4 per-token-head) 相继被提出。MLA 注意力融合输出量化 (#36205) 和 merge_attn_states 融合量化 (#36518) 等内核级优化也被合并，追求极致的性能提升。
多模态与视频推理支持增强: 除了 Gemma 4 的多模态支持，本周还合并了对 Cheers 多模态模型的支持 (#38788)。Issue #38811 反映了用户对 Qwen3-VL 视频推理的需求。
分布式推理与容错性: 围绕 NIXL、P2P NCCL 等 KV 连接器的 Bug 修复和功能改进持续进行，例如修复 P/D 分离场景下的请求 ID 不匹配 (#38816)、处理负向 token 计数导致的指标崩溃 (#38839)，以及为 MoE 模型添加专家并行层的容错性（弹性缩容）支持 (#38862)。

🛠️ 重点技术变更

PR #38826: Gemma 4 全面支持: 这是标志性变更，将 vLLM 支持的模型家族扩展到最新的 Google 旗舰模型。它不仅是模型加载，还集成了专属的推理和工具调用解析器，以及视频帧提取处理流程，展示了 vLLM 对复杂多模态模型的支持能力。
PR #38774: Quark 量化重构: 这是 AMD Quark 量化工具链现代化的关键一步。通过引入“预言机”架构，为未来统一和扩展 AMD 平台的量化算法支持奠定了基础，有助于提升 AMD 生态上模型部署的效率和灵活性。
PR #38786: MLA 注意力内核拆分: 针对长上下文场景的性能瓶颈进行优化。通过增加内核网格大小来提高 GPU 利用率，在 ROCm 平台上对特定模型取得了数量级的性能提升，是硬件特定优化的典型案例。
PR #38378: KV Cache Per-Token-Head 量化: 在 Triton 注意力后端中实现了更细粒度的 KV Cache 量化（INT8/FP8 per-token-head）。基准测试显示其在减少内存占用和提升吞吐量方面取得了积极效果，是推理优化的重要方向。
Issue #38839 & #38840: NIXL 连接器生产环境问题: 这两个 Issue 分别揭示了在生产环境中使用 NIXL 进行 P/D 分离时可能遇到的严重问题——指标计算崩溃和握手竞争条件。它们指出了分布式推理在复杂部署场景下的稳定性挑战，是运维和开发人员需要高度关注的风险点。

📈 开发活跃度观察

贡献者多样性: 除了核心团队（如 @vllmellm， @hongxiayang， @Isotr0py， @ywang96 等），来自 AMD (@BowenBao， @gshtras)、Intel (@mieshkiwrk) 及其他公司的贡献者活跃，社区生态健康。
模型支持是首要任务: 开发活动明显向支持新发布的明星模型（Gemma 4）倾斜，相关 PR 优先级高，合并迅速。
代码审查与合并效率: 大量 PR (31个) 在一天内被合并，显示社区有高效的审查和集成流程。但一些复杂 PR（如 #38786， #38831）存在合并冲突，需贡献者及时 rebase。
AMD 贡献聚焦: AMD 相关的贡献不再仅限于基础功能 enable，而是深入到性能调优、量化整合和新模型适配等深度领域，显示其投入加深。

💡 值得关注的问题

AMD MTP 支持路线图: Issue #38851 直接询问 AMD 对 Kimi K2.5 MTP 的开源计划。这关系到 AMD 平台在高级推理优化特性上能否与 NVIDIA 对齐，需要 AMD 官方给出明确回应或路线图。
FP8 支持的硬件碎片化: 从 Issue #36105 的讨论可以看出，用户对 AMD 不同 GPU 架构（RDNA3 vs RDNA4）的 FP8 支持情况感到困惑。项目文档或工具可能需要提供更清晰的硬件兼容性说明。
Gemma 4 推理解析的完善: Issue #38855 揭示的 Gemma4ReasoningParser 在非流式模式下失效的问题，以及相关修复 PR #38858 的讨论，表明新模型集成的解析逻辑需要更充分的测试，尤其是在与 tokenizer 特殊令牌处理的交互上。
分布式推理的监控与度量: Issue #38839 提出的 Prometheus 指标在跨节点 KV 传输场景下计算为负值并崩溃的问题，是生产部署中的一个潜在隐患。需要稳健的解决方案来处理这种分布式记账场景。

📋 附录：详细数据列表

新增 Issue

#38869 [Doc]:vllm可以支持在5090上部署么？？大家用的都是什么cuda配置啊 — documentation — by lmingze (创建于: 2026-04-03 10:32 (UTC+8))

#38855 [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <

channel> tokens stripped before parsing — 无标签 — by mabry1985 (创建于: 2026-04-03 06:44 (UTC+8))

#38868 Gemma 4 support: model_type gemma4 not recognized — 无标签 — by 2imi9 (创建于: 2026-04-03 10:15 (UTC+8))
#38851 [Feature]: ROCm Kimi K2.5 EAGLE3 MTP heads — feature request,rocm — by functionstackx (创建于: 2026-04-03 05:42 (UTC+8))
#38837 [Bug]: Gemma4ToolParser.init() missing tools parameter — 400 error on tool calls — 无标签 — by hospedales (创建于: 2026-04-03 03:19 (UTC+8))
#38843 [Feature]: support unbacked in export — feature request — by laithsakka (创建于: 2026-04-03 03:46 (UTC+8))
#38840 fix(nixl): Handshake race when same-node workers re-register with new engine IDs — 无标签 — by dmvevents (创建于: 2026-04-03 03:26 (UTC+8))
#38839 fix(metrics): Prometheus counter crash on negative prompt tokens with external KV transfer — 无标签 — by dmvevents (创建于: 2026-04-03 03:26 (UTC+8))
#38818 [Bug]: Error when running Devstral 2 — bug — by thomasmaindron (创建于: 2026-04-02 23:14 (UTC+8))
#38829 [Bug]: Unsharded model cannot be loaded — bug — by shubhamjain0594 (创建于: 2026-04-03 01:01 (UTC+8))
#38820 [Usage]: port question — usage — by CertainlyGo (创建于: 2026-04-02 23:37 (UTC+8))
#38808 [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. — bug — by Taeyang123456 (创建于: 2026-04-02 20:30 (UTC+8))
#38811 [Usage]: Qwen3-VL inference on video complains of lack of metadata — usage — by carlos-havier (创建于: 2026-04-02 21:03 (UTC+8))
#38809 [Feature]: How to disable chat template when using vllm serve — feature request — by sleepwalker2017 (创建于: 2026-04-02 20:33 (UTC+8))
#38805 [Usage]: Does vllm support online infer for qwen3_asr_forced_aligner now? I only found offline example — usage — by wuchenhu98 (创建于: 2026-04-02 19:31 (UTC+8))
#38797 [Usage]: I encountered an error while deploying deepseek-ai/DeepSeek-OCR-2using vLLM. The logs show: — usage — by xuexigpt (创建于: 2026-04-02 16:55 (UTC+8))
#38793 [Feature]: Does P2pNcclConnector support PD separation for the GLM5 model dsa? Testing on the 0.15.1 branch has failed. — feature request — by kahakuka (创建于: 2026-04-02 16:10 (UTC+8))
#38790 [Bug]: shm_broadcast.py raise cancelled error on serving BAGEL-7B-MoT — bug — by chickeyton (创建于: 2026-04-02 14:34 (UTC+8))
#38789 [Bug]: Qwen3ReasoningParser leaks </think> into content when streaming with stop sequences (Related to #17468) — bug — by fsytsuta (创建于: 2026-04-02 14:24 (UTC+8))

已关闭 Issue

#37333 [Bug]: Gemma-3 specific heterogeneous TP failures with PD disagg — bug — by yzong-rh (关闭于: 2026-04-03 11:18 (UTC+8))
#23222 [Feature][Responses API] Stream Function Call — feature request,stale — by simon-mo (关闭于: 2026-04-03 10:17 (UTC+8))
#22814 [Bug]: Excess system memory usage during GGUF loading — bug,stale — by TNT3530 (关闭于: 2026-04-03 10:17 (UTC+8))
#23837 [Bug]: gpt-oss Intermittent 500 Internal Server Error with empty response body when using strict JSON “function router” system prompt — bug,stale,gpt-oss — by bkdoeng (关闭于: 2026-04-03 10:17 (UTC+8))
#24584 [Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming — bug,stale — by hardikmenger (关闭于: 2026-04-03 10:17 (UTC+8))
#38868 Gemma 4 support: model_type gemma4 not recognized — 无标签 — by 2imi9 (关闭于: 2026-04-03 10:17 (UTC+8))
#27059 [Feature]: Batch-invariant Inference Support for VLMs — feature request,stale — by lionelfeng (关闭于: 2026-04-03 10:16 (UTC+8))
#27607 [Bug]: KeyError: 'layers.47.mlp.experts.w2_weight' loading a NVFP4 + BF16 mixed-precision llm-compressor model — bug,stale — by BenasdTW (关闭于: 2026-04-03 10:16 (UTC+8))
#27808 [Bug]: Eagle spec decoding not working with Gemma2 — bug,stale — by jalola (关闭于: 2026-04-03 10:16 (UTC+8))
#28060 [Bug]: Qwen3VL doesn’t recognise upstream flash-attn — bug,stale — by lgeiger (关闭于: 2026-04-03 10:16 (UTC+8))
#29360 [Bug]: Qwen3-Coder-480B-FP8 running on 8*H20,deepgemm warmup OOM — bug,stale — by Cranshine (关闭于: 2026-04-03 10:16 (UTC+8))
#29862 [Model loading error]: quantized version of Llama-3.2-11B-Vision-Instruct — stale — by mbenchwevioo (关闭于: 2026-04-03 10:16 (UTC+8))
#29875 [Usage]: Is there a way to inject the grammar into the docker directly — usage,stale — by chwundermsft (关闭于: 2026-04-03 10:16 (UTC+8))
#29939 [Bug]: 0.8.5.post1 Partial reasoning is empty — bug,stale — by FangdongWang (关闭于: 2026-04-03 10:15 (UTC+8))
#29944 [Usage]:It seems that the prefix cache has not brought about any performance benefits. — usage,stale — by wenba0 (关闭于: 2026-04-03 10:15 (UTC+8))
#29958 [CI Failure]: deepseek-ai/deepseek-vl2-tiny CUBLAS_STATUS_EXECUTION_FAILED — stale,ci-failure — by NickLucche (关闭于: 2026-04-03 10:15 (UTC+8))
#29967 [Feature]: upgrade to CUDA 13 in docker image — feature request,stale — by (关闭于: 2026-04-03 10:15 (UTC+8))
#30007 [Feature]: Propagate served-model-name to LMCache metrics — feature request,stale — by mustafayildirim (关闭于: 2026-04-03 10:15 (UTC+8))
#33097 [Feature]: Fuse FP8 output quantization into merge_attn_states (DCP / cascade paths) — feature request — by sachinkumarsingh092 (关闭于: 2026-04-03 09:47 (UTC+8))
#36615 [Bug]: unknown error trying to run vllm v0.17.0 with ROCm on Radeon 8060S (gfx1151) — bug,rocm — by anomaly256 (关闭于: 2026-04-03 07:31 (UTC+8))
#38837 [Bug]: Gemma4ToolParser.init() missing tools parameter — 400 error on tool calls — 无标签 — by hospedales (关闭于: 2026-04-03 05:35 (UTC+8))
#36105 [ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start with NotImplementedError: No FP8 MoE backend supports the deployment configuration — bug,rocm — by gbdjxgp (关闭于: 2026-04-02 16:13 (UTC+8))
#38843 [Feature]: support unbacked in export — feature request — by laithsakka (关闭于: 2026-04-03 03:47 (UTC+8))
#38286 [Feature]: Batch invariance on 3090 — feature request — by YM2132 (关闭于: 2026-04-02 21:30 (UTC+8))
#38790 [Bug]: shm_broadcast.py raise cancelled error on serving BAGEL-7B-MoT — bug — by chickeyton (关闭于: 2026-04-02 14:40 (UTC+8))
#38543 [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query — bug — by sergey-zinchenko (关闭于: 2026-04-02 12:14 (UTC+8))
#37748 [Feature]: Is there docker image support vllm and rocm 7.1+ — feature request,rocm — by BigFaceBoy (关闭于: 2026-04-02 11:46 (UTC+8))

新增 PR

#38872 [Misc] Remove sys.exit from Gemma4 multimodal processor — 无标签 — by Isotr0py (创建于: 2026-04-03 11:34 (UTC+8))
#38871 [9/n] Migrate to torch stable ABI — rocm,needs-rebase,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-03 11:31 (UTC+8))
#38841 [8/n] Migrate to torch stable ABI — rocm,needs-rebase,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-03 03:32 (UTC+8))
#38784 [XPU][CI] Add misc cases on Intel GPU in CI — intel-gpu,ci/build — by zxd1997066 (创建于: 2026-04-02 11:58 (UTC+8))
#38865 [Refactor] Improve indexer decode path metadata preparation — v1 — by zyongye (创建于: 2026-04-03 09:33 (UTC+8))
#38870 [Bugfix] Fix DSV32 weight loading — bug,ready,deepseek — by zyongye (创建于: 2026-04-03 10:53 (UTC+8))
#38807 [vLLM IR] add import_ir_kernels() to support OOT platforms — 无标签 — by wxsIcey (创建于: 2026-04-02 20:22 (UTC+8))
#38867 [Bugfix] Fix MoE routed input transform when using DeepEP LL — bug — by bnellnm (创建于: 2026-04-03 10:14 (UTC+8))
#38795 [Bugfix]Fix EP precision for Qwen3.5 — bug,qwen — by USTCKAY (创建于: 2026-04-02 16:22 (UTC+8))
#38824 [ROCm] add head-dim 512 for ROCM_ATTN for gemma4 model support — documentation,rocm,ready,v1 — by hongxiayang (创建于: 2026-04-03 00:08 (UTC+8))
#38866 [fix]: deepseekv31 tool parser: " and } may arrive in seperate deltas. — tool-calling,deepseek — by pascal9585 (创建于: 2026-04-03 10:02 (UTC+8))
#38827 feat: add max_tokens_per_doc in rerank request (rebase of #33315) — frontend — by jefp (创建于: 2026-04-03 00:57 (UTC+8))
#38833 [ROCm] pad intermediate size for certain unquantized moe model gemma4 to use aiter fused moe for alignment — rocm — by hongxiayang (创建于: 2026-04-03 01:21 (UTC+8))
#38858 [Bugfix] Fix Gemma4 non-streaming reasoning parsing — bug,frontend — by jacobzhang22 (创建于: 2026-04-03 08:00 (UTC+8))
#38864 [Bugfix] Fix Qwen3 </think> leak in streaming with stop sequences — bug,qwen — by jacobzhang22 (创建于: 2026-04-03 09:28 (UTC+8))
#38859 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts — bug,ready,nvidia — by yzong-rh (创建于: 2026-04-03 08:02 (UTC+8))
#38828 [Model] support Gemma 4 — documentation,ci/build,multi-modality — by effortprogrammer (创建于: 2026-04-03 00:58 (UTC+8))
#38798 [vLLM IR] rms_norm_gated — rocm,nvidia — by wxsIcey (创建于: 2026-04-02 17:04 (UTC+8))
#38863 Feat/disk offloading — v1,kv-connector — by rarepepi (创建于: 2026-04-03 08:55 (UTC+8))
#38849 [Bug] Fix TypeError when hf_config.architectures is None during model loading — bug,intel-gpu,ready — by TihoElek (创建于: 2026-04-03 04:54 (UTC+8))
#38862 [EP] Fault tolerance: automatic elastic scale-down on DP engine death — v1 — by tzulingk (创建于: 2026-04-03 08:47 (UTC+8))
#38838 [CI] Fix test_nixl_connector — ready,v1,kv-connector — by MatthewBonanni (创建于: 2026-04-03 03:19 (UTC+8))
#38832 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 — bug,ready,qwen — by vadiklyutiy (创建于: 2026-04-03 01:20 (UTC+8))
#38860 [Parser] Pass request.tools to tool parser — frontend — by sfeng33 (创建于: 2026-04-03 08:06 (UTC+8))
#38861 Omp improvement from PR#36487 — ci/build,v1,cpu — by louie-tsai (创建于: 2026-04-03 08:38 (UTC+8))
#38856 [LMCache] vLLM Block Allocation Event — kv-connector — by Oasis-Git (创建于: 2026-04-03 06:54 (UTC+8))
#38857 OMP init re-implementation — ci/build,v1,cpu — by louie-tsai (创建于: 2026-04-03 07:43 (UTC+8))
#38848 [Bugfix] Fix Qwen3 tool parser for Responses API tools — bug,tool-calling,qwen — by sfeng33 (创建于: 2026-04-03 04:46 (UTC+8))
#38854 [Feature] KV cache per-token-head Int4 Quantization — documentation,ready,v1 — by JartX (创建于: 2026-04-03 06:27 (UTC+8))
#38825 [Intel][Triton] Support round_int8 for Intel backend — intel-gpu,ready — by mieshkiwrk (创建于: 2026-04-03 00:35 (UTC+8))
#38852 fix a cpu distributed ci testing issue — ci/build,cpu — by louie-tsai (创建于: 2026-04-03 06:19 (UTC+8))
#38850 [Perf] Optimize DCP communication with reusable collective scratch buffers, 1.5%~4% E2E throughput improvement — ready,v1,nvidia — by yewentao256 (创建于: 2026-04-03 05:23 (UTC+8))
#38853 [Bug] Fix workspace manager _current_workspaces size — bug,v1 — by yewentao256 (创建于: 2026-04-03 06:21 (UTC+8))
#38815 [Quant] add CompressedTensorsW8A8Mxfp8 for linear and MoE layers — ready — by EdalatiAli (创建于: 2026-04-02 22:41 (UTC+8))
#38844 [Bugfix]: Enable Gemma4ForCasualLM to load lora adapters correctly — bug — by ShubyM (创建于: 2026-04-03 03:57 (UTC+8))
#38835 [Attention] relax the head dim 512 and paged kv for sm90+FA4 — ci/build,v1 — by IwakuraRein (创建于: 2026-04-03 01:51 (UTC+8))
#38847 [Bugfix]: Fix Gemma4ToolParser.init() missing tools parameter — bug,ready,tool-calling — by hospedales (创建于: 2026-04-03 04:12 (UTC+8))
#38845 Feature/int4 per token — documentation,v1 — by JartX (创建于: 2026-04-03 03:59 (UTC+8))
#38846 Closed Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property — llama,qwen — by laithsakka (创建于: 2026-04-03 04:02 (UTC+8))
#38814 [FlashAttention] Symlink FA4 instead of copying when using VLLM_FLASH_ATTN_SRC_DIR — performance,ci/build — by MatthewBonanni (创建于: 2026-04-02 22:27 (UTC+8))
#38836 [CI] Fix: pass string cache_dtype in test_register_kv_caches — ready,v1,kv-connector — by ZhanqiuHu (创建于: 2026-04-03 02:48 (UTC+8))
#38842 [Refactor] Remove unused dead code — speculative-decoding,ready,v1 — by yewentao256 (创建于: 2026-04-03 03:36 (UTC+8))
#38792 [CI] Add flashinfer.py to attention test source deps — ready,ci/build — by stecasta (创建于: 2026-04-02 15:48 (UTC+8))
#38819 [Attention][MLA] Re-enable FA4 as default MLA prefill backend — ready — by MatthewBonanni (创建于: 2026-04-02 23:17 (UTC+8))
#38834 [Bugfix] Fix oversized piecewise CUDA graphs for Gemma3n cross-decoder — bug,ready,nvidia — by LucasWilkinson (创建于: 2026-04-03 01:24 (UTC+8))
#38831 [ModelRunnerV2][Hybrid model] Support Hybrid Model in ModelRunner V2 — v1 — by MengqingCao (创建于: 2026-04-03 01:13 (UTC+8))
#38826 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) — new-model,intel-gpu,ready,multi-modality,tool-calling — by lucianommartins (创建于: 2026-04-03 00:48 (UTC+8))
#38823 [Attention] Allow using system FA4 — v1 — by IwakuraRein (创建于: 2026-04-02 23:55 (UTC+8))
#38791 [Bugfix] Fix test mocks after SM100 restriction in #38730 — bug,ready,nvidia — by stecasta (创建于: 2026-04-02 15:43 (UTC+8))
#38822 [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend — v1,nvidia — by djmmoss (创建于: 2026-04-02 23:49 (UTC+8))
#38800 [New Model]: jinaai/jina-reranker-v3 — new-model,frontend,needs-rebase — by noooop (创建于: 2026-04-02 17:12 (UTC+8))
#38817 [ROCm] Enable fused_silu_mul_block_quant on ROCm — rocm,ready,ci/build — by gshtras (创建于: 2026-04-02 23:07 (UTC+8))
#38821 [Transformers v5] Fix Tarsier2Config text_config nesting (#38736) — 无标签 — by Zelys-DFKH (创建于: 2026-04-02 23:48 (UTC+8))
#38813 [Fix] Align MoRIIO registration format with vLLM router and handle de… — documentation,kv-connector — by mpashkovskii (创建于: 2026-04-02 22:18 (UTC+8))
#38786 Splitting MLA attention Triton kernel — v1 — by ekuznetsov139 (创建于: 2026-04-02 13:32 (UTC+8))
#38816 Fix/p2p request id mismatch — v1,kv-connector — by groot-code24 (创建于: 2026-04-02 23:05 (UTC+8))
#38799 [EASY] Drop duplicate KV-cache initialization — ready — by namgyu-youn (创建于: 2026-04-02 17:10 (UTC+8))
#38812 skip — rocm,v1 — by qiruixu666-source (创建于: 2026-04-02 21:59 (UTC+8))
#38804 Fix sarvam forward compatibility with transformers v5 — verified — by Vikrantpalle (创建于: 2026-04-02 19:23 (UTC+8))
#38802 Change trust_remote_code default in test runners — multi-modality,ready-run-all-tests — by hmellor (创建于: 2026-04-02 18:22 (UTC+8))
#38788 [Model] Add support for Cheers multimodal model — documentation,new-model,ready — by bingshuailiu (创建于: 2026-04-02 14:22 (UTC+8))
#38810 [LMCache][MP] optimize save when mla enabled — kv-connector — by chunxiaozheng (创建于: 2026-04-02 20:40 (UTC+8))
#38803 [Frontend] Fixed tool parsing errors for MinimaxM2 — tool-calling — by Csrayz (创建于: 2026-04-02 18:52 (UTC+8))
#38806 Revert “[Perf] DSV3.2 Indexer Fused Weights Projection” (#38684) — deepseek — by vllm-agent (创建于: 2026-04-02 20:05 (UTC+8))
#38794 [Perf] Reduce H2D pageable memory copies — v1 — by jackcfwang (创建于: 2026-04-02 16:10 (UTC+8))
#38801 [CPU] Fix chained comparison static_assert for Clang 21+ — cpu — by ricky-chaoju (创建于: 2026-04-02 17:37 (UTC+8))
#38796 [XPU] Fix MoE hang in test_external_lb_dp by handling restricted device visibility — intel-gpu — by 1643661061leo (创建于: 2026-04-02 16:41 (UTC+8))
#38787 [GDN] Fused all preprocessing into one kernel for chunked stage — rocm,ci/build — by a-sidorova (创建于: 2026-04-02 14:00 (UTC+8))
#38785 [Bugfix] Fix Anthropic adapter missing content field on assistant tool-call messages — bug,frontend — by Lidang-Jiang (创建于: 2026-04-02 13:18 (UTC+8))

已合并 PR

#38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle — rocm,ready — by BowenBao (合并于: 2026-04-03 11:29 (UTC+8))
#37566 refactor hard coded device string in test files under tests/v1 and tests/lora — speculative-decoding,ready,v1,nvidia — by wincent8 (合并于: 2026-04-03 11:21 (UTC+8))
#38460 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — ready,v1 — by Etelis (合并于: 2026-04-03 11:13 (UTC+8))
#36518 [Kernel] Fuse FP8 output quantization into merge_attn_states — documentation,performance,rocm,ready,ci/build,v1,llama,qwen,nvidia — by carlyou (合并于: 2026-04-03 09:47 (UTC+8))
#36205 [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) — documentation,performance,ready,ci/build,v1 — by carlyou (合并于: 2026-04-03 09:16 (UTC+8))
#33657 [XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 — rocm,intel-gpu,ready,v1,qwen,nvidia — by yma11 (合并于: 2026-04-03 08:59 (UTC+8))
#38838 [CI] Fix test_nixl_connector — ready,v1,kv-connector — by MatthewBonanni (合并于: 2026-04-03 08:52 (UTC+8))
#38832 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 — bug,ready,qwen — by vadiklyutiy (合并于: 2026-04-03 08:45 (UTC+8))
#38510 [New Model]: add support for telechat3 — documentation,new-model,ready,ci/build — by 1096125073 (合并于: 2026-04-03 08:26 (UTC+8))
#37416 [Kernel] Mamba support different layout for Conv state — ready,qwen — by NickLucche (合并于: 2026-04-03 07:50 (UTC+8))
#38847 [Bugfix]: Fix Gemma4ToolParser.init() missing tools parameter — bug,ready,tool-calling — by hospedales (合并于: 2026-04-03 05:35 (UTC+8))
#38836 [CI] Fix: pass string cache_dtype in test_register_kv_caches — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-04-03 03:42 (UTC+8))
#38792 [CI] Add flashinfer.py to attention test source deps — ready,ci/build — by stecasta (合并于: 2026-04-03 03:24 (UTC+8))
#38826 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) — new-model,intel-gpu,ready,multi-modality,tool-calling — by lucianommartins (合并于: 2026-04-03 02:13 (UTC+8))
#38062 Bump helion dependency from 0.3.2 to 0.3.3 — ready,ci/build — by gmagogsfm (合并于: 2026-04-03 01:59 (UTC+8))
#38791 [Bugfix] Fix test mocks after SM100 restriction in #38730 — bug,ready,nvidia — by stecasta (合并于: 2026-04-03 01:12 (UTC+8))
#38690 [FA4] Update flash-attention to latest upstream FA4 — ready,ci/build,nvidia — by LucasWilkinson (合并于: 2026-04-03 01:02 (UTC+8))
#38427 [Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) — bug,ready — by YM2132 (合并于: 2026-04-02 21:29 (UTC+8))
#38292 [CI][ROCm] Add gpt-oss w4a8 in CI — rocm,ready,gpt-oss — by BowenBao (合并于: 2026-04-03 00:06 (UTC+8))
#38620 [Frontend] Re-enable running MaxSim on GPU — frontend,ready,v1 — by noooop (合并于: 2026-04-03 00:03 (UTC+8))
#33529 Triton MLA perf fixes — performance,ready,v1,deepseek — by koush (合并于: 2026-04-02 21:40 (UTC+8))
#38788 [Model] Add support for Cheers multimodal model — documentation,new-model,ready — by bingshuailiu (合并于: 2026-04-02 21:01 (UTC+8))
#30518 Don’t compile vision encoder for Transformers backend — ready — by hmellor (合并于: 2026-04-02 20:42 (UTC+8))
#38378 [Feature] KV cache per-token-head INT8/FP8 quantization — documentation,rocm,ready,v1,quantization — by JartX (合并于: 2026-04-02 20:13 (UTC+8))
#37813 [Perf] fuse kernels in gdn — ready,qwen — by ZJY0516 (合并于: 2026-04-02 19:52 (UTC+8))
#38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 — performance,rocm,ready,qwen — by vllmellm (合并于: 2026-04-02 16:13 (UTC+8))
#38770 [CPU] Support gelu act in cpu_fused_moe — ready,v1,cpu — by bigPYJ1151 (合并于: 2026-04-02 14:14 (UTC+8))
#38778 Revert “[Kernel] Add gpt-oss Router GEMM kernel (#37205)” — performance,ready,ci/build,gpt-oss — by xyang16 (合并于: 2026-04-02 13:02 (UTC+8))
#38743 [Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam — ready — by gmagogsfm (合并于: 2026-04-02 12:30 (UTC+8))
#38750 [ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol — bug,rocm,ready — by gshtras (合并于: 2026-04-02 12:30 (UTC+8))
#38545 [Bugfix] Use dedicated MM processor cache in /tokenize to prevent sender-cache pollution — bug,frontend,ready,multi-modality — by sergey-zinchenko (合并于: 2026-04-02 12:14 (UTC+8))

关闭但未合并的 PR

#38773 Revert “[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang” (#38730) — bug,documentation,nvidia — by vllm-agent (关闭于: 2026-04-03 11:27 (UTC+8))
#24618 [Core] Reuse ZMQ to level trigger local ShmRingBuffer events. — stale — by noobpwnftw (关闭于: 2026-04-03 10:17 (UTC+8))
#28801 Fixes 28713 — frontend,stale,tool-calling — by oneraghavan (关闭于: 2026-04-03 10:16 (UTC+8))
#29498 fix: Add validation for tool requests that the tool is available — frontend,needs-rebase,stale,gpt-oss — by bestony (关闭于: 2026-04-03 10:16 (UTC+8))
#38858 [Bugfix] Fix Gemma4 non-streaming reasoning parsing — bug,frontend — by jacobzhang22 (关闭于: 2026-04-03 09:49 (UTC+8))
#38828 [Model] support Gemma 4 — documentation,ci/build,multi-modality — by effortprogrammer (关闭于: 2026-04-03 01:51 (UTC+8))
#35898 [XPU][QWEN3_NEXT] remove fla hardcode to cuda — intel-gpu,needs-rebase,qwen,nvidia — by xuechendi (关闭于: 2026-04-03 08:59 (UTC+8))
#38857 OMP init re-implementation — ci/build,v1,cpu — by louie-tsai (关闭于: 2026-04-03 08:07 (UTC+8))
#38845 Feature/int4 per token — documentation,v1 — by JartX (关闭于: 2026-04-03 05:15 (UTC+8))
#38846 Closed Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property — llama,qwen — by laithsakka (关闭于: 2026-04-03 04:02 (UTC+8))
#38650 [Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint — bug,qwen — by mmangkad (关闭于: 2026-04-03 02:24 (UTC+8))
#38823 [Attention] Allow using system FA4 — v1 — by IwakuraRein (关闭于: 2026-04-03 01:57 (UTC+8))
#38762 (alternative to #37891) [ROCm] Fix rocm allreduce rmsnorm fusion for Deepseek models — rocm,needs-rebase,ci/build,deepseek,nvidia — by rbrugaro-amd (关闭于: 2026-04-02 23:12 (UTC+8))
#38767 [Transformers v5] Add SarvamMLAConfig to fix SarvamMLAForCausalLM (#38734) — 无标签 — by Zelys-DFKH (关闭于: 2026-04-02 19:30 (UTC+8))
#36893 [Feature] Kvcache Int8 per-token scale on TRITON_ATTN continue of #34327 thanks EricccYang — documentation,performance,rocm,frontend,ready,needs-rebase,ci/build,v1,multi-modality,gpt-oss — by JartX (关闭于: 2026-04-02 22:37 (UTC+8))
#37891 [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X — rocm,needs-rebase,ci/build,deepseek,nvidia — by attila-dusnoki-htec (关闭于: 2026-04-02 14:36 (UTC+8))
#38715 [Bugfix] Fix intra-step KV block corruption from stale prefix cache hits — bug,v1 — by KrxGu (关闭于: 2026-04-02 17:22 (UTC+8))
#36480 bugfix(dcp, gdn): disabling DCP semantics for linear-attention KV/state groups — bug,v1,nvidia — by pisceskkk (关闭于: 2026-04-02 16:23 (UTC+8))