vLLM 开发动态报告 - 2026-04-02
时间窗口: 2026-04-02 11:41 (UTC+8) ~ 2026-04-03 11:41 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 27 | 新 PR 69 | 合并 PR 31 | 关闭未合并 PR 18
📊 每日开发状态摘要
过去24小时内,vLLM 社区活跃度保持高位,新增了19个Issue和69个PR,其中31个PR被合并。开发焦点集中在 新模型支持(特别是 Gemma 4)和推理/工具调用解析器的完善上。同时,针对 AMD ROCm 生态的优化、KV Cache 量化以及分布式推理(NIXL)的稳定性和性能改进也在持续进行中。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,涉及模型支持、性能优化和 Bug 修复。
Issues:
- #38851 [Feature]: ROCm Kimi K2.5 EAGLE3 MTP heads: 用户 @functionstackx 向 AMD 团队(通过 @hongxiayang @chunfangamd 等)提问,询问 AMD 是否会像 NVIDIA 和社区一样,为 Kimi K2.5 模型提供开源的 EAGLE3 MTP(Multi-Token Prediction)头部,以支持推测解码。这直接关系到 AMD 平台在生产环境中的竞争力和特色功能支持。目前处于开放讨论状态。
PRs (已合并):
- #38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle: 由 AMD 员工 @BowenBao 提交。此 PR 是 Quark 量化工具链重构的第一步,将 MXFP4 MoE 权重量化路径改造成通过统一预言机(oracle)和内核后端执行,并将后端名称 “CK” 统一为 “AITER”,与 CLI 参数
--moe-backend aiter保持一致。这是 AMD 优化其量化栈的重要基础工作。 - #38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2: 由 AMD 员工 @vllmellm 提交。解决了 RDNA4 (
gfx12xx) GPU 无法运行 FP8 MoE 模型的问题(修复了 #36105)。PR 还包含针对特定 Qwen FP8 MoE 模型在 Radeon PRO 9700 上的性能调优配置,展示了 AMD 在主流模型上的性能优化成果。
PRs (进行中):
- #38786 Splitting MLA attention Triton kernel: 旨在通过拆分 MLA 注意力内核的第一阶段来优化大上下文长度下的 GPU 利用率。贡献者特别指出,在 ROCm 上,此优化对 GLM-4.7-Flash 等模型在 MI325 上带来了显著的性能提升(例如,在 65536 上下文下,Token/s 从 6.5 提升至 55.6)。
- #38833 [ROCm] pad intermediate size for certain unquantized moe model gemma4 to use aiter fused moe for alignment: 由 AMD 员工 @hongxiayang 提交。为了解决 Gemma 4 MoE 模型的中间维度(expert_intermediate_size=704)未对齐 AITER CK GEMM 的 tile 要求而导致运行时错误的问题,进行了填充。这是确保新模型在 AMD 优化后端上正常运行的必要补丁。
- #38817 [ROCm] Enable fused_silu_mul_block_quant on ROCm: 跟随 #32996,在 ROCm 上正确启用新的
fused_silu_mul_block_quant内核,而非简单地 guard 掉。 - #38824 [ROCm] add head-dim 512 for ROCM_ATTN for gemma4 model support: 由 AMD 员工 @hongxiayang 提交。为
ROCM_ATTN注意力后端添加对头维度 512 的支持,以兼容新推出的 Gemma 4 模型(头维度为512)。否则,当用户显式指定--attention-backend ROCM_ATTN时,Gemma 4 模型会因头维度不支持而报错。 - #38787 [GDN] Fused all preprocessing into one kernel for chunked stage: 将 GDN 分块前向传播中的四个独立 Triton 内核融合为一个,消除了多次内核启动开销。在 MI350X 上测试显示平均有 1.9倍 的性能提升,且准确率略有上升。
总结:AMD 团队及其贡献者在本周期非常活跃,工作覆盖了从底层内核优化(MLA、GDN、新融合内核)、量化工具链重构(Quark),到确保新模型(Gemma 4)和主流模型(Qwen FP8 MoE)在 AMD 硬件上高效、稳定运行的完整链路。
💬 高热度讨论分析
- Issue #23837: “[Bug]: gpt-oss Intermittent 500 Internal Server Error…” (6条评论)
- 核心议题: 用户报告在使用严格 JSON 系统提示时,gpt-oss 模型间歇性返回 HTTP 500 且响应体为空。
- 观点与争议: 多位用户(@fabienric, @wonjerry, @daviden1013, @FrankTheTank9)确认遇到相同问题,表明这不是孤立事件。@tamastarjanyi 提供了一个关键线索:通过调整
reasoning_effort参数可以影响甚至消除错误率(设为high时错误率为0%),暗示问题可能与模型推理过程或解析逻辑相关。 - 结论: 该 Issue 因长期无活动被标记为
stale后关闭。但问题本身未解决,其根本原因(可能涉及 harmony 解析器)仍需调查。
- Issue #36105: “[ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start…” (10条评论)
- 核心议题: 在 ROCm 上服务 FP8 MoE 模型时,因 “No FP8 MoE backend supports the deployment configuration” 错误而启动失败。
- 观点与争议: 最初用户 @gbdjxgp 在 Radeon 8060S 上遇到问题。随后 @MrHighVoltage 在混合 GPU(RX 7900 XTX + AI Pro R9700)和纯 R9700 环境下也报告了相同错误。@vllmellm (AMD) 介入排查,指出 RDNA3/3.5 架构(如 RX 7900 XTX, Radeon 8060S)缺乏原生 FP8 支持,而 R9700 应支持。讨论涉及了 Docker 环境、GPU 混插、驱动版本和 vLLM 版本等多个可能因素。
- 结论: 此 Issue 最终通过 PR #38086 得到解决,该 PR 为 RDNA4 启用了 Triton FP8 MoE 后端并进行了性能调优。讨论揭示了 AMD 不同 GPU 架构对 FP8 支持的差异性,以及社区对明确硬件支持矩阵的需求。
- PR #38826: “feat(models): implement Google Gemma 4 architecture support”
- 核心议题: 为 vLLM 添加对 Google Gemma 4 模型家族(MoE、多模态、推理、工具调用)的全面原生支持。
- 观点与状态: 这是一个大型功能 PR,迅速被合并。它引入了
Gemma4ForCausalLM模型架构、视频处理流程、Gemma4ReasoningParser和Gemma4ToolParser。合并后,立即催生了两个相关的 Bugfix Issue(#38855, #38837),说明新功能集成需要快速迭代修复。这反映了社区对紧跟主流模型步伐的高度重视。
🔥 热门话题与趋势分析
- Gemma 4 模型支持浪潮: 本周期的绝对热点。除了已合并的完整支持 PR (#38826),还有试图通过 Transformers 后端提供支持的 PR (#38828, 因重复被关闭),以及因 Gemma 4 引入而触发的多个修复和适配 PR (#38833, #38824, #38837, #38844, #38855)。Issue #38868 则指出了支持 Gemma 4 所需的最低
transformers库版本问题。 - 推理与工具调用解析器的精细化: 多个 Issue 和 PR 涉及推理内容解析的边界情况 Bug,例如 Qwen3 在流式响应与 stop sequences 结合时的解析错误 (#38789, PR #38864),以及 Gemma 4 推理解析因特殊令牌被剥离而失败 (#38855)。工具调用解析器也发现参数传递错误 (#38837, 已由 #38847 修复) 和边缘情况处理问题 (#38866)。
- 性能优化,尤其是量化方向: KV Cache 量化是持续焦点,PR #38378 (INT8/FP8 per-token-head) 和 PR #38854 (INT4 per-token-head) 相继被提出。MLA 注意力融合输出量化 (#36205) 和
merge_attn_states融合量化 (#36518) 等内核级优化也被合并,追求极致的性能提升。 - 多模态与视频推理支持增强: 除了 Gemma 4 的多模态支持,本周还合并了对 Cheers 多模态模型的支持 (#38788)。Issue #38811 反映了用户对 Qwen3-VL 视频推理的需求。
- 分布式推理与容错性: 围绕 NIXL、P2P NCCL 等 KV 连接器的 Bug 修复和功能改进持续进行,例如修复 P/D 分离场景下的请求 ID 不匹配 (#38816)、处理负向 token 计数导致的指标崩溃 (#38839),以及为 MoE 模型添加专家并行层的容错性(弹性缩容)支持 (#38862)。
🛠️ 重点技术变更
- PR #38826: Gemma 4 全面支持: 这是标志性变更,将 vLLM 支持的模型家族扩展到最新的 Google 旗舰模型。它不仅是模型加载,还集成了专属的推理和工具调用解析器,以及视频帧提取处理流程,展示了 vLLM 对复杂多模态模型的支持能力。
- PR #38774: Quark 量化重构: 这是 AMD Quark 量化工具链现代化的关键一步。通过引入“预言机”架构,为未来统一和扩展 AMD 平台的量化算法支持奠定了基础,有助于提升 AMD 生态上模型部署的效率和灵活性。
- PR #38786: MLA 注意力内核拆分: 针对长上下文场景的性能瓶颈进行优化。通过增加内核网格大小来提高 GPU 利用率,在 ROCm 平台上对特定模型取得了数量级的性能提升,是硬件特定优化的典型案例。
- PR #38378: KV Cache Per-Token-Head 量化: 在 Triton 注意力后端中实现了更细粒度的 KV Cache 量化(INT8/FP8 per-token-head)。基准测试显示其在减少内存占用和提升吞吐量方面取得了积极效果,是推理优化的重要方向。
- Issue #38839 & #38840: NIXL 连接器生产环境问题: 这两个 Issue 分别揭示了在生产环境中使用 NIXL 进行 P/D 分离时可能遇到的严重问题——指标计算崩溃和握手竞争条件。它们指出了分布式推理在复杂部署场景下的稳定性挑战,是运维和开发人员需要高度关注的风险点。
📈 开发活跃度观察
- 贡献者多样性: 除了核心团队(如 @vllmellm, @hongxiayang, @Isotr0py, @ywang96 等),来自 AMD (@BowenBao, @gshtras)、Intel (@mieshkiwrk) 及其他公司的贡献者活跃,社区生态健康。
- 模型支持是首要任务: 开发活动明显向支持新发布的明星模型(Gemma 4)倾斜,相关 PR 优先级高,合并迅速。
- 代码审查与合并效率: 大量 PR (31个) 在一天内被合并,显示社区有高效的审查和集成流程。但一些复杂 PR(如 #38786, #38831)存在合并冲突,需贡献者及时 rebase。
- AMD 贡献聚焦: AMD 相关的贡献不再仅限于基础功能 enable,而是深入到性能调优、量化整合和新模型适配等深度领域,显示其投入加深。
💡 值得关注的问题
- AMD MTP 支持路线图: Issue #38851 直接询问 AMD 对 Kimi K2.5 MTP 的开源计划。这关系到 AMD 平台在高级推理优化特性上能否与 NVIDIA 对齐,需要 AMD 官方给出明确回应或路线图。
- FP8 支持的硬件碎片化: 从 Issue #36105 的讨论可以看出,用户对 AMD 不同 GPU 架构(RDNA3 vs RDNA4)的 FP8 支持情况感到困惑。项目文档或工具可能需要提供更清晰的硬件兼容性说明。
- Gemma 4 推理解析的完善: Issue #38855 揭示的
Gemma4ReasoningParser在非流式模式下失效的问题,以及相关修复 PR #38858 的讨论,表明新模型集成的解析逻辑需要更充分的测试,尤其是在与 tokenizer 特殊令牌处理的交互上。 - 分布式推理的监控与度量: Issue #38839 提出的 Prometheus 指标在跨节点 KV 传输场景下计算为负值并崩溃的问题,是生产部署中的一个潜在隐患。需要稳健的解决方案来处理这种分布式记账场景。
📋 附录:详细数据列表
新增 Issue
- #38869 [Doc]:vllm可以支持在5090上部署么??大家用的都是什么cuda配置啊 — documentation — by lmingze (创建于: 2026-04-03 10:32 (UTC+8))
-
#38855 [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — < channel> tokens stripped before parsing — 无标签 — by mabry1985 (创建于: 2026-04-03 06:44 (UTC+8)) - #38868 Gemma 4 support: model_type
gemma4not recognized — 无标签 — by 2imi9 (创建于: 2026-04-03 10:15 (UTC+8)) - #38851 [Feature]: ROCm Kimi K2.5 EAGLE3 MTP heads — feature request,rocm — by functionstackx (创建于: 2026-04-03 05:42 (UTC+8))
- #38837 [Bug]: Gemma4ToolParser.init() missing
toolsparameter — 400 error on tool calls — 无标签 — by hospedales (创建于: 2026-04-03 03:19 (UTC+8)) - #38843 [Feature]: support unbacked in export — feature request — by laithsakka (创建于: 2026-04-03 03:46 (UTC+8))
- #38840 fix(nixl): Handshake race when same-node workers re-register with new engine IDs — 无标签 — by dmvevents (创建于: 2026-04-03 03:26 (UTC+8))
- #38839 fix(metrics): Prometheus counter crash on negative prompt tokens with external KV transfer — 无标签 — by dmvevents (创建于: 2026-04-03 03:26 (UTC+8))
- #38818 [Bug]: Error when running Devstral 2 — bug — by thomasmaindron (创建于: 2026-04-02 23:14 (UTC+8))
- #38829 [Bug]: Unsharded model cannot be loaded — bug — by shubhamjain0594 (创建于: 2026-04-03 01:01 (UTC+8))
- #38820 [Usage]: port question — usage — by CertainlyGo (创建于: 2026-04-02 23:37 (UTC+8))
- #38808 [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. — bug — by Taeyang123456 (创建于: 2026-04-02 20:30 (UTC+8))
- #38811 [Usage]: Qwen3-VL inference on video complains of lack of metadata — usage — by carlos-havier (创建于: 2026-04-02 21:03 (UTC+8))
- #38809 [Feature]: How to disable chat template when using vllm serve — feature request — by sleepwalker2017 (创建于: 2026-04-02 20:33 (UTC+8))
- #38805 [Usage]: Does vllm support online infer for qwen3_asr_forced_aligner now? I only found offline example — usage — by wuchenhu98 (创建于: 2026-04-02 19:31 (UTC+8))
- #38797 [Usage]: I encountered an error while deploying deepseek-ai/DeepSeek-OCR-2using vLLM. The logs show: — usage — by xuexigpt (创建于: 2026-04-02 16:55 (UTC+8))
- #38793 [Feature]: Does P2pNcclConnector support PD separation for the GLM5 model dsa? Testing on the 0.15.1 branch has failed. — feature request — by kahakuka (创建于: 2026-04-02 16:10 (UTC+8))
- #38790 [Bug]: shm_broadcast.py raise cancelled error on serving BAGEL-7B-MoT — bug — by chickeyton (创建于: 2026-04-02 14:34 (UTC+8))
- #38789 [Bug]: Qwen3ReasoningParser leaks </think> into content when streaming with
stopsequences (Related to #17468) — bug — by fsytsuta (创建于: 2026-04-02 14:24 (UTC+8))
已关闭 Issue
- #37333 [Bug]: Gemma-3 specific heterogeneous TP failures with PD disagg — bug — by yzong-rh (关闭于: 2026-04-03 11:18 (UTC+8))
- #23222 [Feature][Responses API] Stream Function Call — feature request,stale — by simon-mo (关闭于: 2026-04-03 10:17 (UTC+8))
- #22814 [Bug]: Excess system memory usage during GGUF loading — bug,stale — by TNT3530 (关闭于: 2026-04-03 10:17 (UTC+8))
- #23837 [Bug]: gpt-oss Intermittent 500 Internal Server Error with empty response body when using strict JSON “function router” system prompt — bug,stale,gpt-oss — by bkdoeng (关闭于: 2026-04-03 10:17 (UTC+8))
- #24584 [Bug]: vLLM Runtime Fails to Honor Context Cancellation During Streaming — bug,stale — by hardikmenger (关闭于: 2026-04-03 10:17 (UTC+8))
- #38868 Gemma 4 support: model_type
gemma4not recognized — 无标签 — by 2imi9 (关闭于: 2026-04-03 10:17 (UTC+8)) - #27059 [Feature]: Batch-invariant Inference Support for VLMs — feature request,stale — by lionelfeng (关闭于: 2026-04-03 10:16 (UTC+8))
- #27607 [Bug]:
KeyError: 'layers.47.mlp.experts.w2_weight'loading a NVFP4 + BF16 mixed-precisionllm-compressormodel — bug,stale — by BenasdTW (关闭于: 2026-04-03 10:16 (UTC+8)) - #27808 [Bug]:
Eaglespec decoding not working with Gemma2 — bug,stale — by jalola (关闭于: 2026-04-03 10:16 (UTC+8)) - #28060 [Bug]: Qwen3VL doesn’t recognise upstream
flash-attn— bug,stale — by lgeiger (关闭于: 2026-04-03 10:16 (UTC+8)) - #29360 [Bug]: Qwen3-Coder-480B-FP8 running on 8*H20,deepgemm warmup OOM — bug,stale — by Cranshine (关闭于: 2026-04-03 10:16 (UTC+8))
- #29862 [Model loading error]: quantized version of Llama-3.2-11B-Vision-Instruct — stale — by mbenchwevioo (关闭于: 2026-04-03 10:16 (UTC+8))
- #29875 [Usage]: Is there a way to inject the grammar into the docker directly — usage,stale — by chwundermsft (关闭于: 2026-04-03 10:16 (UTC+8))
- #29939 [Bug]: 0.8.5.post1 Partial reasoning is empty — bug,stale — by FangdongWang (关闭于: 2026-04-03 10:15 (UTC+8))
- #29944 [Usage]:It seems that the prefix cache has not brought about any performance benefits. — usage,stale — by wenba0 (关闭于: 2026-04-03 10:15 (UTC+8))
- #29958 [CI Failure]: deepseek-ai/deepseek-vl2-tiny
CUBLAS_STATUS_EXECUTION_FAILED— stale,ci-failure — by NickLucche (关闭于: 2026-04-03 10:15 (UTC+8)) - #29967 [Feature]: upgrade to CUDA 13 in docker image — feature request,stale — by (关闭于: 2026-04-03 10:15 (UTC+8))
- #30007 [Feature]: Propagate served-model-name to LMCache metrics — feature request,stale — by mustafayildirim (关闭于: 2026-04-03 10:15 (UTC+8))
- #33097 [Feature]: Fuse FP8 output quantization into merge_attn_states (DCP / cascade paths) — feature request — by sachinkumarsingh092 (关闭于: 2026-04-03 09:47 (UTC+8))
- #36615 [Bug]: unknown error trying to run vllm v0.17.0 with ROCm on Radeon 8060S (gfx1151) — bug,rocm — by anomaly256 (关闭于: 2026-04-03 07:31 (UTC+8))
- #38837 [Bug]: Gemma4ToolParser.init() missing
toolsparameter — 400 error on tool calls — 无标签 — by hospedales (关闭于: 2026-04-03 05:35 (UTC+8)) - #36105 [ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start with NotImplementedError: No FP8 MoE backend supports the deployment configuration — bug,rocm — by gbdjxgp (关闭于: 2026-04-02 16:13 (UTC+8))
- #38843 [Feature]: support unbacked in export — feature request — by laithsakka (关闭于: 2026-04-03 03:47 (UTC+8))
- #38286 [Feature]: Batch invariance on 3090 — feature request — by YM2132 (关闭于: 2026-04-02 21:30 (UTC+8))
- #38790 [Bug]: shm_broadcast.py raise cancelled error on serving BAGEL-7B-MoT — bug — by chickeyton (关闭于: 2026-04-02 14:40 (UTC+8))
- #38543 [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query — bug — by sergey-zinchenko (关闭于: 2026-04-02 12:14 (UTC+8))
- #37748 [Feature]: Is there docker image support vllm and rocm 7.1+ — feature request,rocm — by BigFaceBoy (关闭于: 2026-04-02 11:46 (UTC+8))
新增 PR
- #38872 [Misc] Remove sys.exit from Gemma4 multimodal processor — 无标签 — by Isotr0py (创建于: 2026-04-03 11:34 (UTC+8))
- #38871 [9/n] Migrate to torch stable ABI — rocm,needs-rebase,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-03 11:31 (UTC+8))
- #38841 [8/n] Migrate to torch stable ABI — rocm,needs-rebase,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-03 03:32 (UTC+8))
- #38784 [XPU][CI] Add misc cases on Intel GPU in CI — intel-gpu,ci/build — by zxd1997066 (创建于: 2026-04-02 11:58 (UTC+8))
- #38865 [Refactor] Improve indexer decode path metadata preparation — v1 — by zyongye (创建于: 2026-04-03 09:33 (UTC+8))
- #38870 [Bugfix] Fix DSV32 weight loading — bug,ready,deepseek — by zyongye (创建于: 2026-04-03 10:53 (UTC+8))
- #38807 [vLLM IR] add
import_ir_kernels()to support OOT platforms — 无标签 — by wxsIcey (创建于: 2026-04-02 20:22 (UTC+8)) - #38867 [Bugfix] Fix MoE routed input transform when using DeepEP LL — bug — by bnellnm (创建于: 2026-04-03 10:14 (UTC+8))
- #38795 [Bugfix]Fix EP precision for Qwen3.5 — bug,qwen — by USTCKAY (创建于: 2026-04-02 16:22 (UTC+8))
- #38824 [ROCm] add head-dim 512 for ROCM_ATTN for gemma4 model support — documentation,rocm,ready,v1 — by hongxiayang (创建于: 2026-04-03 00:08 (UTC+8))
- #38866 [fix]: deepseekv31 tool parser:
"and}may arrive in seperate deltas. — tool-calling,deepseek — by pascal9585 (创建于: 2026-04-03 10:02 (UTC+8)) - #38827 feat: add max_tokens_per_doc in rerank request (rebase of #33315) — frontend — by jefp (创建于: 2026-04-03 00:57 (UTC+8))
- #38833 [ROCm] pad intermediate size for certain unquantized moe model gemma4 to use aiter fused moe for alignment — rocm — by hongxiayang (创建于: 2026-04-03 01:21 (UTC+8))
- #38858 [Bugfix] Fix Gemma4 non-streaming reasoning parsing — bug,frontend — by jacobzhang22 (创建于: 2026-04-03 08:00 (UTC+8))
- #38864 [Bugfix] Fix Qwen3 </think> leak in streaming with stop sequences — bug,qwen — by jacobzhang22 (创建于: 2026-04-03 09:28 (UTC+8))
- #38859 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts — bug,ready,nvidia — by yzong-rh (创建于: 2026-04-03 08:02 (UTC+8))
- #38828 [Model] support Gemma 4 — documentation,ci/build,multi-modality — by effortprogrammer (创建于: 2026-04-03 00:58 (UTC+8))
- #38798 [vLLM IR] rms_norm_gated — rocm,nvidia — by wxsIcey (创建于: 2026-04-02 17:04 (UTC+8))
- #38863 Feat/disk offloading — v1,kv-connector — by rarepepi (创建于: 2026-04-03 08:55 (UTC+8))
- #38849 [Bug] Fix TypeError when hf_config.architectures is None during model loading — bug,intel-gpu,ready — by TihoElek (创建于: 2026-04-03 04:54 (UTC+8))
- #38862 [EP] Fault tolerance: automatic elastic scale-down on DP engine death — v1 — by tzulingk (创建于: 2026-04-03 08:47 (UTC+8))
- #38838 [CI] Fix
test_nixl_connector— ready,v1,kv-connector — by MatthewBonanni (创建于: 2026-04-03 03:19 (UTC+8)) - #38832 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 — bug,ready,qwen — by vadiklyutiy (创建于: 2026-04-03 01:20 (UTC+8))
- #38860 [Parser] Pass request.tools to tool parser — frontend — by sfeng33 (创建于: 2026-04-03 08:06 (UTC+8))
- #38861 Omp improvement from PR#36487 — ci/build,v1,cpu — by louie-tsai (创建于: 2026-04-03 08:38 (UTC+8))
- #38856 [LMCache] vLLM Block Allocation Event — kv-connector — by Oasis-Git (创建于: 2026-04-03 06:54 (UTC+8))
- #38857 OMP init re-implementation — ci/build,v1,cpu — by louie-tsai (创建于: 2026-04-03 07:43 (UTC+8))
- #38848 [Bugfix] Fix Qwen3 tool parser for Responses API tools — bug,tool-calling,qwen — by sfeng33 (创建于: 2026-04-03 04:46 (UTC+8))
- #38854 [Feature] KV cache per-token-head Int4 Quantization — documentation,ready,v1 — by JartX (创建于: 2026-04-03 06:27 (UTC+8))
- #38825 [Intel][Triton] Support
round_int8for Intel backend — intel-gpu,ready — by mieshkiwrk (创建于: 2026-04-03 00:35 (UTC+8)) - #38852 fix a cpu distributed ci testing issue — ci/build,cpu — by louie-tsai (创建于: 2026-04-03 06:19 (UTC+8))
- #38850 [Perf] Optimize DCP communication with reusable collective scratch buffers, 1.5%~4% E2E throughput improvement — ready,v1,nvidia — by yewentao256 (创建于: 2026-04-03 05:23 (UTC+8))
- #38853 [Bug] Fix workspace manager
_current_workspacessize — bug,v1 — by yewentao256 (创建于: 2026-04-03 06:21 (UTC+8)) - #38815 [Quant] add CompressedTensorsW8A8Mxfp8 for linear and MoE layers — ready — by EdalatiAli (创建于: 2026-04-02 22:41 (UTC+8))
- #38844 [Bugfix]: Enable Gemma4ForCasualLM to load lora adapters correctly — bug — by ShubyM (创建于: 2026-04-03 03:57 (UTC+8))
- #38835 [Attention] relax the head dim 512 and paged kv for sm90+FA4 — ci/build,v1 — by IwakuraRein (创建于: 2026-04-03 01:51 (UTC+8))
- #38847 [Bugfix]: Fix Gemma4ToolParser.init() missing
toolsparameter — bug,ready,tool-calling — by hospedales (创建于: 2026-04-03 04:12 (UTC+8)) - #38845 Feature/int4 per token — documentation,v1 — by JartX (创建于: 2026-04-03 03:59 (UTC+8))
- #38846 Closed Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property — llama,qwen — by laithsakka (创建于: 2026-04-03 04:02 (UTC+8))
- #38814 [FlashAttention] Symlink FA4 instead of copying when using
VLLM_FLASH_ATTN_SRC_DIR— performance,ci/build — by MatthewBonanni (创建于: 2026-04-02 22:27 (UTC+8)) - #38836 [CI] Fix: pass string cache_dtype in test_register_kv_caches — ready,v1,kv-connector — by ZhanqiuHu (创建于: 2026-04-03 02:48 (UTC+8))
- #38842 [Refactor] Remove unused dead code — speculative-decoding,ready,v1 — by yewentao256 (创建于: 2026-04-03 03:36 (UTC+8))
- #38792 [CI] Add flashinfer.py to attention test source deps — ready,ci/build — by stecasta (创建于: 2026-04-02 15:48 (UTC+8))
- #38819 [Attention][MLA] Re-enable FA4 as default MLA prefill backend — ready — by MatthewBonanni (创建于: 2026-04-02 23:17 (UTC+8))
- #38834 [Bugfix] Fix oversized piecewise CUDA graphs for Gemma3n cross-decoder — bug,ready,nvidia — by LucasWilkinson (创建于: 2026-04-03 01:24 (UTC+8))
- #38831 [ModelRunnerV2][Hybrid model] Support Hybrid Model in ModelRunner V2 — v1 — by MengqingCao (创建于: 2026-04-03 01:13 (UTC+8))
- #38826 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) — new-model,intel-gpu,ready,multi-modality,tool-calling — by lucianommartins (创建于: 2026-04-03 00:48 (UTC+8))
- #38823 [Attention] Allow using system FA4 — v1 — by IwakuraRein (创建于: 2026-04-02 23:55 (UTC+8))
- #38791 [Bugfix] Fix test mocks after SM100 restriction in #38730 — bug,ready,nvidia — by stecasta (创建于: 2026-04-02 15:43 (UTC+8))
- #38822 [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend — v1,nvidia — by djmmoss (创建于: 2026-04-02 23:49 (UTC+8))
- #38800 [New Model]: jinaai/jina-reranker-v3 — new-model,frontend,needs-rebase — by noooop (创建于: 2026-04-02 17:12 (UTC+8))
- #38817 [ROCm] Enable fused_silu_mul_block_quant on ROCm — rocm,ready,ci/build — by gshtras (创建于: 2026-04-02 23:07 (UTC+8))
- #38821 [Transformers v5] Fix Tarsier2Config text_config nesting (#38736) — 无标签 — by Zelys-DFKH (创建于: 2026-04-02 23:48 (UTC+8))
- #38813 [Fix] Align MoRIIO registration format with vLLM router and handle de… — documentation,kv-connector — by mpashkovskii (创建于: 2026-04-02 22:18 (UTC+8))
- #38786 Splitting MLA attention Triton kernel — v1 — by ekuznetsov139 (创建于: 2026-04-02 13:32 (UTC+8))
- #38816 Fix/p2p request id mismatch — v1,kv-connector — by groot-code24 (创建于: 2026-04-02 23:05 (UTC+8))
- #38799 [EASY] Drop duplicate KV-cache initialization — ready — by namgyu-youn (创建于: 2026-04-02 17:10 (UTC+8))
- #38812 skip — rocm,v1 — by qiruixu666-source (创建于: 2026-04-02 21:59 (UTC+8))
- #38804 Fix sarvam forward compatibility with transformers v5 — verified — by Vikrantpalle (创建于: 2026-04-02 19:23 (UTC+8))
- #38802 Change
trust_remote_codedefault in test runners — multi-modality,ready-run-all-tests — by hmellor (创建于: 2026-04-02 18:22 (UTC+8)) - #38788 [Model] Add support for Cheers multimodal model — documentation,new-model,ready — by bingshuailiu (创建于: 2026-04-02 14:22 (UTC+8))
- #38810 [LMCache][MP] optimize save when mla enabled — kv-connector — by chunxiaozheng (创建于: 2026-04-02 20:40 (UTC+8))
- #38803 [Frontend] Fixed tool parsing errors for MinimaxM2 — tool-calling — by Csrayz (创建于: 2026-04-02 18:52 (UTC+8))
- #38806 Revert “[Perf] DSV3.2 Indexer Fused Weights Projection” (#38684) — deepseek — by vllm-agent (创建于: 2026-04-02 20:05 (UTC+8))
- #38794 [Perf] Reduce H2D pageable memory copies — v1 — by jackcfwang (创建于: 2026-04-02 16:10 (UTC+8))
- #38801 [CPU] Fix chained comparison static_assert for Clang 21+ — cpu — by ricky-chaoju (创建于: 2026-04-02 17:37 (UTC+8))
- #38796 [XPU] Fix MoE hang in test_external_lb_dp by handling restricted device visibility — intel-gpu — by 1643661061leo (创建于: 2026-04-02 16:41 (UTC+8))
- #38787 [GDN] Fused all preprocessing into one kernel for chunked stage — rocm,ci/build — by a-sidorova (创建于: 2026-04-02 14:00 (UTC+8))
- #38785 [Bugfix] Fix Anthropic adapter missing content field on assistant tool-call messages — bug,frontend — by Lidang-Jiang (创建于: 2026-04-02 13:18 (UTC+8))
已合并 PR
- #38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle — rocm,ready — by BowenBao (合并于: 2026-04-03 11:29 (UTC+8))
- #37566 refactor hard coded device string in test files under tests/v1 and tests/lora — speculative-decoding,ready,v1,nvidia — by wincent8 (合并于: 2026-04-03 11:21 (UTC+8))
- #38460 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — ready,v1 — by Etelis (合并于: 2026-04-03 11:13 (UTC+8))
- #36518 [Kernel] Fuse FP8 output quantization into merge_attn_states — documentation,performance,rocm,ready,ci/build,v1,llama,qwen,nvidia — by carlyou (合并于: 2026-04-03 09:47 (UTC+8))
- #36205 [mla] Support fused FP8/NVFP4 output quantization in MLA attention (#35792) — documentation,performance,ready,ci/build,v1 — by carlyou (合并于: 2026-04-03 09:16 (UTC+8))
- #33657 [XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 — rocm,intel-gpu,ready,v1,qwen,nvidia — by yma11 (合并于: 2026-04-03 08:59 (UTC+8))
- #38838 [CI] Fix
test_nixl_connector— ready,v1,kv-connector — by MatthewBonanni (合并于: 2026-04-03 08:52 (UTC+8)) - #38832 [Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 — bug,ready,qwen — by vadiklyutiy (合并于: 2026-04-03 08:45 (UTC+8))
- #38510 [New Model]: add support for telechat3 — documentation,new-model,ready,ci/build — by 1096125073 (合并于: 2026-04-03 08:26 (UTC+8))
- #37416 [Kernel] Mamba support different layout for Conv state — ready,qwen — by NickLucche (合并于: 2026-04-03 07:50 (UTC+8))
- #38847 [Bugfix]: Fix Gemma4ToolParser.init() missing
toolsparameter — bug,ready,tool-calling — by hospedales (合并于: 2026-04-03 05:35 (UTC+8)) - #38836 [CI] Fix: pass string cache_dtype in test_register_kv_caches — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-04-03 03:42 (UTC+8))
- #38792 [CI] Add flashinfer.py to attention test source deps — ready,ci/build — by stecasta (合并于: 2026-04-03 03:24 (UTC+8))
- #38826 feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) — new-model,intel-gpu,ready,multi-modality,tool-calling — by lucianommartins (合并于: 2026-04-03 02:13 (UTC+8))
- #38062 Bump helion dependency from 0.3.2 to 0.3.3 — ready,ci/build — by gmagogsfm (合并于: 2026-04-03 01:59 (UTC+8))
- #38791 [Bugfix] Fix test mocks after SM100 restriction in #38730 — bug,ready,nvidia — by stecasta (合并于: 2026-04-03 01:12 (UTC+8))
- #38690 [FA4] Update flash-attention to latest upstream FA4 — ready,ci/build,nvidia — by LucasWilkinson (合并于: 2026-04-03 01:02 (UTC+8))
- #38427 [Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) — bug,ready — by YM2132 (合并于: 2026-04-02 21:29 (UTC+8))
- #38292 [CI][ROCm] Add gpt-oss w4a8 in CI — rocm,ready,gpt-oss — by BowenBao (合并于: 2026-04-03 00:06 (UTC+8))
- #38620 [Frontend] Re-enable running MaxSim on GPU — frontend,ready,v1 — by noooop (合并于: 2026-04-03 00:03 (UTC+8))
- #33529 Triton MLA perf fixes — performance,ready,v1,deepseek — by koush (合并于: 2026-04-02 21:40 (UTC+8))
- #38788 [Model] Add support for Cheers multimodal model — documentation,new-model,ready — by bingshuailiu (合并于: 2026-04-02 21:01 (UTC+8))
- #30518 Don’t compile vision encoder for Transformers backend — ready — by hmellor (合并于: 2026-04-02 20:42 (UTC+8))
- #38378 [Feature] KV cache per-token-head INT8/FP8 quantization — documentation,rocm,ready,v1,quantization — by JartX (合并于: 2026-04-02 20:13 (UTC+8))
- #37813 [Perf] fuse kernels in gdn — ready,qwen — by ZJY0516 (合并于: 2026-04-02 19:52 (UTC+8))
- #38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 — performance,rocm,ready,qwen — by vllmellm (合并于: 2026-04-02 16:13 (UTC+8))
- #38770 [CPU] Support gelu act in cpu_fused_moe — ready,v1,cpu — by bigPYJ1151 (合并于: 2026-04-02 14:14 (UTC+8))
- #38778 Revert “[Kernel] Add gpt-oss Router GEMM kernel (#37205)” — performance,ready,ci/build,gpt-oss — by xyang16 (合并于: 2026-04-02 13:02 (UTC+8))
- #38743 [Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam — ready — by gmagogsfm (合并于: 2026-04-02 12:30 (UTC+8))
- #38750 [ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol — bug,rocm,ready — by gshtras (合并于: 2026-04-02 12:30 (UTC+8))
- #38545 [Bugfix] Use dedicated MM processor cache in /tokenize to prevent sender-cache pollution — bug,frontend,ready,multi-modality — by sergey-zinchenko (合并于: 2026-04-02 12:14 (UTC+8))
关闭但未合并的 PR
- #38773 Revert “[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang” (#38730) — bug,documentation,nvidia — by vllm-agent (关闭于: 2026-04-03 11:27 (UTC+8))
- #24618 [Core] Reuse ZMQ to level trigger local ShmRingBuffer events. — stale — by noobpwnftw (关闭于: 2026-04-03 10:17 (UTC+8))
- #28801 Fixes 28713 — frontend,stale,tool-calling — by oneraghavan (关闭于: 2026-04-03 10:16 (UTC+8))
- #29498 fix: Add validation for tool requests that the tool is available — frontend,needs-rebase,stale,gpt-oss — by bestony (关闭于: 2026-04-03 10:16 (UTC+8))
- #38858 [Bugfix] Fix Gemma4 non-streaming reasoning parsing — bug,frontend — by jacobzhang22 (关闭于: 2026-04-03 09:49 (UTC+8))
- #38828 [Model] support Gemma 4 — documentation,ci/build,multi-modality — by effortprogrammer (关闭于: 2026-04-03 01:51 (UTC+8))
- #35898 [XPU][QWEN3_NEXT] remove fla hardcode to cuda — intel-gpu,needs-rebase,qwen,nvidia — by xuechendi (关闭于: 2026-04-03 08:59 (UTC+8))
- #38857 OMP init re-implementation — ci/build,v1,cpu — by louie-tsai (关闭于: 2026-04-03 08:07 (UTC+8))
- #38845 Feature/int4 per token — documentation,v1 — by JartX (关闭于: 2026-04-03 05:15 (UTC+8))
- #38846 Closed Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property — llama,qwen — by laithsakka (关闭于: 2026-04-03 04:02 (UTC+8))
- #38650 [Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint — bug,qwen — by mmangkad (关闭于: 2026-04-03 02:24 (UTC+8))
- #38823 [Attention] Allow using system FA4 — v1 — by IwakuraRein (关闭于: 2026-04-03 01:57 (UTC+8))
- #38762 (alternative to #37891) [ROCm] Fix rocm allreduce rmsnorm fusion for Deepseek models — rocm,needs-rebase,ci/build,deepseek,nvidia — by rbrugaro-amd (关闭于: 2026-04-02 23:12 (UTC+8))
- #38767 [Transformers v5] Add SarvamMLAConfig to fix SarvamMLAForCausalLM (#38734) — 无标签 — by Zelys-DFKH (关闭于: 2026-04-02 19:30 (UTC+8))
- #36893 [Feature] Kvcache Int8 per-token scale on TRITON_ATTN continue of #34327 thanks EricccYang — documentation,performance,rocm,frontend,ready,needs-rebase,ci/build,v1,multi-modality,gpt-oss — by JartX (关闭于: 2026-04-02 22:37 (UTC+8))
- #37891 [ROCm][Perf] Add fused AllReduce+RMSNorm for DeepSeek on MI355X — rocm,needs-rebase,ci/build,deepseek,nvidia — by attila-dusnoki-htec (关闭于: 2026-04-02 14:36 (UTC+8))
- #38715 [Bugfix] Fix intra-step KV block corruption from stale prefix cache hits — bug,v1 — by KrxGu (关闭于: 2026-04-02 17:22 (UTC+8))
- #36480 bugfix(dcp, gdn): disabling DCP semantics for linear-attention KV/state groups — bug,v1,nvidia — by pisceskkk (关闭于: 2026-04-02 16:23 (UTC+8))