vLLM 开发动态报告 - 2026-04-03
时间窗口: 2026-04-03 11:28 (UTC+8) ~ 2026-04-04 11:28 (UTC+8) 数据统计: 新 Issue 29 | 关闭 Issue 8 | 新 PR 62 | 合并 PR 36 | 关闭未合并 PR 16
📊 每日开发状态摘要
在2026年4月3日至4日期间,vLLM项目保持高活跃度,新增了29个Issue和62个PR。开发焦点集中在Gemma 4模型的性能优化、工具调用解析的bug修复,以及对AMD平台(特别是MI355x和MI250x硬件)的深度支持与性能调优。同时,社区中出现了一起关于代码归属权的争议性讨论,吸引了核心维护者的关注。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动非常活跃,主要围绕GLM-5模型在MI系列GPU上的性能优化和bug修复展开,体现了AMD贡献者深度参与项目集成的努力。
1. Issue 38954: 活跃ROCm贡献者请求triage权限
- 用户:
ChuanLi1101(活跃的AMD贡献者) - 内容: 该用户请求获得仓库的triage权限,以管理其负责的众多ROCm相关PR(超过12个),涵盖AITER集成、MLA注意力后端、MXFP4量化、GLM-5模型支持等。这有助于加快面向客户(如GLM-5 on MI355X)的工作评审周期。
- 影响: 反映了AMD团队在vLLM中承担了从底层内核到上层模型支持的完整技术栈工作,并寻求更高效的协作流程。
2. Issue 38924: [Bug][ROCm] GLM-5 MXFP4稀疏MLA解码在MI355x上崩溃
- 用户:
ChuanLi1101 - 内容: 报告了GLM-5模型在8x MI355X (gfx950) 上使用MXFP4量化时,解码阶段因AITER稀疏MLA内核要求
num_heads >= 16而崩溃的问题。当TP=8时,每个GPU仅分得8个头,触发运行时错误。 - 技术细节: 根本原因在于AITER的
mla_decode_stage1_asm_fwd内核和FP8分页MQA对数内核对头数的硬性要求。 - 关联PR: 与PR #38947(集成AITER MLA预填充内核)同属GLM-5 on ROCm性能优化系列。
3. PR #38947: [ROCm][Perf] 为稠密MLA后端添加AITER MLA预填充内核
- 用户:
ChuanLi1101 - 内容: 将AITER的
mla_prefill_fwd汇编内核集成到vLLM的稠密MLA后端中,取代了原先按块计算、需要昂贵KV扩展的预填充上下文路径。新路径在潜在空间压缩查询,并直接针对分页KV缓存进行单次内核调用,消除了每块的KV扩展循环。 - 意义: 这是提升GLM-5等模型在AMD硬件(如MI355X)上推理性能的关键优化步骤,旨在缩小与ATOM平台的性能差距。
4. PR #38914: [ROCm] mi250x解码回归
- 用户:
rlrs - 内容: 修复了在MI250X (gfx90a) 解码工作负载上出现的ROCm
wvSplitK回归问题,该问题导致数值错误输出。PR恢复了gfx9非MFMA路径中正确的规约行为,并添加了回归测试。 - 讨论焦点: AMD工程师
amd-hhashemi介入审查,指出汇编路径存在效率风险和原始依赖,建议优先使用内置路径并查明根本原因。双方就问题在特定ROCm版本(6.4.4)上的可复现性进行了技术交流,最终建议升级ROCm/clang++版本。 - 意义: 体现了AMD内部团队与外部贡献者之间就硬件特定优化的严谨代码审查和技术协作。
5. 其他ROCm相关PR:
- PR #38937, #38951, #38959: 一系列针对ROCm CI/CD流水线的修复,包括补充缺失的依赖、修复Dockerfile解析错误等,确保AMD平台的持续集成稳定。
- PR #38941: 移除了AMD镜像构建任务的“软失败”状态,表明AMD构建流程趋于稳定。
- PR #38774: 已合并。重构了Quark MoE MXFP4实现,使其通过oracle和内核后端运行,并将后端名称统一为“AITER”。
💬 高热度讨论分析
1. Issue #38942: 请求归属权:多ISA CPU分发器工作
- 核心议题: 贡献者
MekayelAnik指控其关于Python多ISA CPU分发器的原创工作(2025年12月提交于一个fork仓库)在未经署名的情况下被用于vLLM主仓库的PR #35466中。他提供了详细的时间线和代码对比证据。 - 不同观点:
- 原告 (
MekayelAnik): 认为存在清晰的谱系,一个中间PR (#35346) 重基了他的提交,但次日被一个未给予其功劳的重新实现所替换,这属于未署名使用。 - 支持者 (
wjhrdy): 确认曾与MekayelAnik合作,并认为其值得认可。 - 维护者 (待回应): @simon-mo, @WoosukKwon 等被提及,目前尚未在issue中公开回应。
- 原告 (
- 争议焦点: 开源项目中“灵感借鉴”与“代码复用”的界限,以及贡献者荣誉的维护机制。
- 当前状态: 问题开放,等待维护者处理。
2. PR #38914: [ROCm] mi250x解码回归
- 核心议题: 针对MI250X解码路径中一个回归的修复方案是否最优。
- 不同观点:
- 贡献者 (
rlrs): 提交PR恢复旧的、已验证可工作的汇编规约逻辑,并添加测试。 - AMD工程师 (
amd-hhashemi): 认为汇编路径低效且危险,应优先使用内置(builtin)路径。质疑问题在较新ROCm版本上的可复现性,建议升级工具链。
- 贡献者 (
- 讨论焦点: 是应该快速回退修复以解决用户问题,还是应深挖根本原因并采用更现代的优化路径。这涉及到对硬件兼容性、代码长期维护性和性能的权衡。
- 当前状态: PR开放,讨论中。
amd-hhashemi要求提供更简单的复现脚本。
3. Issue #38887: Gemma 4 E4B 在v0.19.0上因强制TRITON_ATTN回退而极慢
- 核心议题: Gemma 4因其异构注意力头维度,被全局强制使用慢速的Triton注意力后端,导致性能远低于同类规模模型。
- 社区反馈:
- 用户 (
CunXin1,jump-drive): 提供了在不同硬件(RTX 4090, NVIDIA DRIVE AGX Thor)上的性能数据(~9 tok/s),证实问题存在且影响严重。 - 关联PR (#38891): 用户
CunXin1主动提交PR,提议允许Gemma 4按层选择注意力后端(滑动窗口层用FlashAttention,全注意力层用Triton),而非全局强制。有用户反馈该PR有效提升了速度。
- 用户 (
- 争议焦点: 为规避混合后端可能带来的数值问题而采取的全局强制策略,是否过于保守,牺牲了过多性能。
- 当前状态: 问题开放,关联PR #38891 已提交,有待核心开发者审查其数值安全性。
🔥 热门话题与趋势分析
- Gemma 4模型集成与优化: 成为近期热点,相关Issue/PR集中于:
- 性能瓶颈 (#38887): 异构注意力头维度导致的全局慢速后端问题。
- 工具调用与流式解析 (#38910, #38946): 流式输出中JSON/HTML文本损坏。
- 权重加载 (#38874, #38912): NVFP4量化模型、MoE专家参数映射的兼容性问题。
- 硬件兼容性 (#38918): 在Turing架构GPU上因共享内存限制无法运行。
-
AMD平台支持深化: 焦点从基础功能支持转向深度性能优化和特定模型(GLM-5)的打磨,涉及内核集成、量化支持和CI/CD完善。
-
工具调用与流式输出: 多个Issue (#38910, #38946, #38894) 报告了Qwen3.5、Gemma 4等模型在启用“思考”或工具调用时,流式输出解析错误(如
content为None、JSON无效、HTML标签重复)。这表明复杂输出结构的流式处理仍是易错点。 - 构建与部署问题:
- CUDA 13.0与glibc兼容性 (#38908): 预编译wheel依赖较新glibc,导致在RHEL 9等系统上安装失败。
- Apple Clang编译错误 (#38889): 新版本Clang将警告视为错误,需文档指导。
🛠️ 重点技术变更
- PR #38947 ([ROCm][Perf] 添加AITER MLA预填充内核):
- 技术解读: 将AITER的高性能汇编内核直接集成到vLLM的MLA注意力流水线中,跳过了昂贵的中间KV张量扩展步骤。这是将AMD硬件特定优化深度融入vLLM架构的范例。
- 影响: 显著提升GLM-5等使用MLA注意力的模型在AMD GPU上的预填充速度,是客户导向性能交付的关键一步。
- PR #38915 ([Bug] 修复CUDA 13中
swap_blocks_batch的编译错误):- 技术解读: 修复了由于CUDA 13.0 API签名变化(
cuMemcpyBatchAsync参数减少)导致的源码编译失败。这是一个紧急修复,影响了使用CUDA 13.0的用户和发布流水线。 - 影响: 确保了vLLM在更广泛的CUDA环境下的可构建性。
- 技术解读: 修复了由于CUDA 13.0 API签名变化(
- PR #38138 ([Frontend] 新的在线量化前端):
- 技术解读: 引入了新的、更灵活的在线量化配置前端。支持全局方案选择、为线性层/MoE层单独覆盖方案,以及通过正则表达式忽略特定层。
- 影响: 为未来更丰富、更精细的在线量化策略(如MXFP8)提供了可扩展的配置框架,是量化功能架构上的重要改进。
- PR #37171 ([Frontend] 为解耦端点添加流式支持):
- 技术解读: 为解耦推理API (
/inference/v1/generate) 实现了流式响应支持,补全了该端点的功能。 - 影响: 增强了vLLM在微服务架构中的灵活性,使得预填充和解码分离的场景也能享受流式传输的好处。
- 技术解读: 为解耦推理API (
📈 开发活跃度观察
- AMD贡献者活跃度突出:
ChuanLi1101作为关键贡献者,同时推动多个ROCm相关PR和Issue,并请求更高权限,显示出对AMD生态负责的深度投入。 - 核心维护者参与度高: 在归属权争议(#38942)、MI250x回归(#38914)等关键讨论中,核心维护者被@提及,预计将介入处理。在多个PR的评论中也能看到及时的技术反馈。
- 新人贡献者涌现: 多个bug报告和修复来自首次或二次贡献者(如#38892, #38909),社区保持健康吸引力。
- 代码合并效率: 24小时内合并36个PR,显示评审与合并流程运行顺畅。
💡 值得关注的问题
- 归属权争议 (Issue #38942): 这是涉及开源社区道德和协作规范的重要事件。处理结果将体现项目维护者对贡献者劳动的尊重程度,并可能影响未来外部贡献者的积极性。
- 跨请求数据污染 (Issue #38903): 在启用异步调度和流水线并行的多节点设置下,出现用户间上下文泄漏的严重安全问题。现有工作区是禁用异步调度,但根本原因有待定位,需高度关注。
- Gemma 4性能与兼容性:
- 性能取舍 (Issue #38887 & PR #38891): 需要在“规避数值风险”和“追求极致性能”之间做出工程决策。允许混合后端是否安全,需要核心开发者基于测试做出结论。
- 老旧硬件支持 (Issue #38918): 是否值得为Turing等老旧架构调整内核以支持Gemma 4这类新模型,涉及维护成本和用户收益的权衡。
📋 附录:详细数据列表
新增 Issue
- #38946 [Bug]: Gemma 4 tool usage produces invalid json due to bug in streaming — bug — by Brummi (创建于: 2026-04-04 05:13 (UTC+8))
- #38931 [Bug]: Deepseek R1 produces incorrect output — bug — by wzhao18 (创建于: 2026-04-04 02:31 (UTC+8))
- #38887 [Bug]: Gemma 4 E4B extremely slow on v0.19.0 forced TRITON_ATTN fallback yields ~9 tok/s on RTX 4090 (vs ~100+ tok/s for comparable Llama 3B) — bug — by CunXin1 (创建于: 2026-04-03 14:51 (UTC+8))
- #38926 [Bug]: Gemma4-31B freezes on multiple RTX6000 PRO during loading — bug — by kovern (创建于: 2026-04-04 01:51 (UTC+8))
- #38954 Request for triage permission — active ROCm contributor — rocm — by ChuanLi1101 (创建于: 2026-04-04 06:43 (UTC+8))
- #38942 Request for attribution: Multi-ISA CPU dispatcher work (PR #35466) — 无标签 — by MekayelAnik (创建于: 2026-04-04 04:42 (UTC+8))
- #38910 [Bug]: Gemma4 tool parser duplicates HTML tag prefixes in streamed tool arguments — 无标签 — by yoke233 (创建于: 2026-04-03 19:07 (UTC+8))
- #38936 [Bug]: NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 offline execution fails — bug — by shilpa-ananth (创建于: 2026-04-04 03:14 (UTC+8))
- #38925 [Feature]: Support lightweight import of vllm protocol types without torch dependency — 无标签 — by hexfusion (创建于: 2026-04-04 00:38 (UTC+8))
- #38930 Entropy-adaptive per-head KV cache quantization: +8% quality over uniform at same compression — 无标签 — by SCJedi (创建于: 2026-04-04 02:17 (UTC+8))
- #38929 [Bug]: Draft model speculative decoding tests failing: async_scheduling not enabled and engine core initialization errors — bug — by puririshi98 (创建于: 2026-04-04 02:15 (UTC+8))
- #38884 [Bug]: Gemma 4 torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors — bug — by NilsHellwig (创建于: 2026-04-03 14:37 (UTC+8))
- #38916 [Bug]: Qwen3.5 Inference TimeoutError with flashinfer gdn backend — bug — by xyang16 (创建于: 2026-04-03 22:50 (UTC+8))
- #38885 [Bug]: Potential misalignment between qwen3.5 chat template and recommended tool parser — bug — by sen-ppl (创建于: 2026-04-03 14:44 (UTC+8))
- #38924 [Bug][ROCm] GLM-5 MXFP4 sparse MLA decode crash on MI355x — rocm — by ChuanLi1101 (创建于: 2026-04-04 00:36 (UTC+8))
- #38923 [Feature]: Built-in request multiplexer to let
vllm serveuse all available GPUs without external proxies — feature request — by vadimkantorov (创建于: 2026-04-04 00:30 (UTC+8)) - #38918 [Usage]: Gemma4 on Turing GPUs (SM 7.5): all attention backends hit shared memory limits — usage — by lisp19 (创建于: 2026-04-03 23:05 (UTC+8))
- #38908 [Bug]: H100 CUDA 13.0 nightly wheels require glibc 2.35 (manylinux_2_35), failing on glibc 2.34 systems — bug — by shahizat (创建于: 2026-04-03 19:01 (UTC+8))
- #38912 Gemma 4 MoE NVFP4: expert_params_mapping doesn’t handle scale key suffixes — 无标签 — by marioiseli89 (创建于: 2026-04-03 19:26 (UTC+8))
- #38911 [Bug]: tool_choice=’required’+PD disaggregation; internal server error for GLM-5 — bug — by fool7367 (创建于: 2026-04-03 19:19 (UTC+8))
- #38902 [Feature]: Support for transformers>=5.0 in a stable release — feature request — by zhang-prog (创建于: 2026-04-03 17:39 (UTC+8))
- #38905 [Usage]: max-model-len set to ‘auto’,didn’t work. Why? — usage — by yuan19890512-ux (创建于: 2026-04-03 18:09 (UTC+8))
- #38894 [Bug]: Qwen3.5 with enable thinking only output content in reasoning field, content=None — bug — by Nevermetyou65 (创建于: 2026-04-03 16:34 (UTC+8))
- #38903 [Bug]: Cross-request context contamination with async scheduling + pipeline parallelism on multi-node — bug — by agis09 (创建于: 2026-04-03 17:53 (UTC+8))
-
#38898 [Feature]: Mamba DSconv state layoutSupport speculative decoding with mamba_cache_mode=align— feature request — by NickLucche (创建于: 2026-04-03 17:01 (UTC+8)) - #38897 [Bug] DS conv state layout does not support speculative decoding with mamba_cache_mode=’align’ — 无标签 — by NickLucche (创建于: 2026-04-03 16:59 (UTC+8))
- #38893 [Feature]: Add Eagle3 Speculative Decoding Support for Gemma 4 Target Models — feature request — by Ofir408 (创建于: 2026-04-03 16:28 (UTC+8))
- #38892 [Bug]: matmul_batch_invariant does not handle all torch.matmul dimension combinations (4D x 3D for gemma4-E2B) — bug — by YM2132 (创建于: 2026-04-03 16:10 (UTC+8))
- #38886 [Bug]: Gemma 4 E4B weight loading fails
Gemma4ClippableLinearparameterinput_maxnot recognized — bug — by CunXin1 (创建于: 2026-04-03 14:48 (UTC+8))
已关闭 Issue
- #28642 [Feature][P0]: Enable BuildKit Remote Cache in CI — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-04-04 10:15 (UTC+8))
- #29651 [Feature]: Add P/D disaggregation deployment on Ray — feature request,ray,stale — by JackyMa1997 (关闭于: 2026-04-04 10:15 (UTC+8))
- #30021 [Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. — bug,stale — by microchila (关闭于: 2026-04-04 10:15 (UTC+8))
- #30053 [Docker Hub]: missing TPU docker images v.0.11.2 and v.0.12.0 — installation,stale — by m4r1k (关闭于: 2026-04-04 10:15 (UTC+8))
- #30058 [Feature]: Multi-Adapter Support for Embed Qwen3 8B Embedding Model — feature request,stale — by dawnik17 (关闭于: 2026-04-04 10:15 (UTC+8))
- #29581 [Bug]: Batch Invariant didn’t work with Qwen3-30B-A3B-Instruct-2507-AWQ — bug — by bi1101 (关闭于: 2026-04-03 22:54 (UTC+8))
- #38897 [Bug] DS conv state layout does not support speculative decoding with mamba_cache_mode=’align’ — 无标签 — by NickLucche (关闭于: 2026-04-03 17:01 (UTC+8))
- #38309 [Bug]: microsoft/Phi-4-reasoning-vision-15B Fails to startup — bug — by varun-sundar-rabindranath (关闭于: 2026-04-03 12:14 (UTC+8))
新增 PR
- #38961 [IR][RmsNorm] pass None if not has_weight — ready — by lk-chen (创建于: 2026-04-04 10:10 (UTC+8))
- #38962 [Bugfix] Include device index in compile cache paths — 无标签 — by nascheme (创建于: 2026-04-04 10:49 (UTC+8))
- #38917 [Bugfix] Fix Qwen3.5 LoRA activation for shared expert modules — bug,qwen — by jayden222 (创建于: 2026-04-03 22:52 (UTC+8))
- #38959 [ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers — rocm,ready — by AndreasKaratzas (创建于: 2026-04-04 08:42 (UTC+8))
- #38949 [IR][RmsNorm] register None param if has_weight==False — ready — by lk-chen (创建于: 2026-04-04 05:51 (UTC+8))
- #38960 [MoE Refactor] Split up compressed_tensors_moe.py — 无标签 — by bnellnm (创建于: 2026-04-04 09:39 (UTC+8))
- #38947 [ROCm][Perf] Add AITER MLA prefill kernel for dense MLA backend — rocm,v1 — by ChuanLi1101 (创建于: 2026-04-04 05:33 (UTC+8))
- #38940 [vLLM IR] Avoid redundant file reads in IrOpImpl.uuid() — vllm-ir — by sBobHuang (创建于: 2026-04-04 03:45 (UTC+8))
- #38877 [compile] mla + group fp8 fusion — v1 — by carlyou (创建于: 2026-04-03 12:26 (UTC+8))
- #38958 [Bugfix] Fix piecewise backend to support torch.cond — 无标签 — by ydwu4 (创建于: 2026-04-04 08:01 (UTC+8))
- #38957 [Bugfix][cmake] fix FP4 ARCH for CUDA>=13.0 — 无标签 — by carlyou (创建于: 2026-04-04 07:51 (UTC+8))
- #38915 [Bug] Fix compile error for
swap_blocks_batchin CUDA 13 — bug,ready,nvidia — by yewentao256 (创建于: 2026-04-03 21:58 (UTC+8)) - #38939 [R3] Add routed experts to openai entrypoint — frontend — by hao-aaron (创建于: 2026-04-04 03:25 (UTC+8))
- #38927 [Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… — bug,ready,qwen — by elenalil-aws (创建于: 2026-04-04 01:59 (UTC+8))
- #38938 Bug/test eagle dp v0 — bug,ready,v1 — by Monishver11 (创建于: 2026-04-04 03:19 (UTC+8))
- #38950 [Docker] Add fastsafetensors to NVIDIA Dockerfile — ready — by zhewenl (创建于: 2026-04-04 06:12 (UTC+8))
- #38956 Add device: h200_18gb to 50 passing CI test steps — 无标签 — by khluu (创建于: 2026-04-04 07:15 (UTC+8))
- #38955 Refactor Arctic loading to use AutoWeightsLoader — 无标签 — by lalit10 (创建于: 2026-04-04 07:07 (UTC+8))
- #38951 [ROCm][CI] Minor missing import patch — rocm,ready — by AndreasKaratzas (创建于: 2026-04-04 06:31 (UTC+8))
- #38937 [ROCm][CI] Added back missing common deps — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-04-04 03:16 (UTC+8))
- #38943 [BUG] Fix PP for R3 — bug,v1 — by hao-aaron (创建于: 2026-04-04 04:42 (UTC+8))
- #38933 [Performance Improvement] Update
batched_count_greater_thanto handle batch size 1 without recompile — v1 — by Lucaskabela (创建于: 2026-04-04 02:56 (UTC+8)) - #38953 [MLA] Map bfloat16/float16 kv_cache_dtype to auto for C++ cache ops — 无标签 — by voipmonitor (创建于: 2026-04-04 06:33 (UTC+8))
- #38952 [SM120] DeepGEMM check: warning instead of raise in sparse indexer — 无标签 — by voipmonitor (创建于: 2026-04-04 06:33 (UTC+8))
- #38871 [9/n] Migrate attention and cache kernels to torch stable ABI — rocm,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-03 11:31 (UTC+8))
- #38914 [ROCm] mi250x decode regression — rocm — by rlrs (创建于: 2026-04-03 21:37 (UTC+8))
- #38948 Rocm72 py311 d12 — documentation,performance,new-model,rocm,frontend,intel-gpu,speculative-decoding,needs-rebase,ci/build,v1 — by kiran-thumma (创建于: 2026-04-04 05:45 (UTC+8))
- #38935 [PD][HeteroArch]Fix accuracy issue with CPU_ATTN as Decoder and Flash_ATTN as prefiller — intel-gpu,cpu,kv-connector — by xuechendi (创建于: 2026-04-04 02:58 (UTC+8))
- #38944 Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) — ready — by frgossen (创建于: 2026-04-04 04:45 (UTC+8))
- #38945 [Bugfix] Gemma 4: Fix bug around invalid JSON diffs during tool usage — bug,tool-calling — by Brummi (创建于: 2026-04-04 05:02 (UTC+8))
- #38891 [Gemma4] Allow per-layer attention backend selection for heterogeneou… — 无标签 — by CunXin1 (创建于: 2026-04-03 15:48 (UTC+8))
- #38919 [Bugfix] Runtime driver check for cuMemcpyBatchAsync in swap_blocks_batch — bug — by Etelis (创建于: 2026-04-03 23:18 (UTC+8))
- #38941 [ci] Remove soft fail for AMD image build job — rocm,ready,ci/build — by khluu (创建于: 2026-04-04 04:13 (UTC+8))
- #38873 fea: avoid long multi-modal preprocessing block the async processsing… — v1 — by robinren03 (创建于: 2026-04-03 11:50 (UTC+8))
- #38934 Remove MQ multi-node tests — ready,ci/build — by jeffreywang-anyscale (创建于: 2026-04-04 02:57 (UTC+8))
- #38928 [Bugfix][Perf] Indexer upcast WK to BF16 for fusion — bug,deepseek — by benchislett (创建于: 2026-04-04 02:06 (UTC+8))
- #38879 [Gemma4] Enable Fast Prefill Optimization — new-model,ready,multi-modality,tool-calling — by LucasWilkinson (创建于: 2026-04-03 13:00 (UTC+8))
- #38932 Revert “[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460)” — v1 — by khluu (创建于: 2026-04-04 02:42 (UTC+8))
- #38875 [Bugfix] Fix Gemma4 NVFP4 quantized model weight loading — bug — by 2imi9 (创建于: 2026-04-03 11:57 (UTC+8))
- #38921 [ROCm][CI] Move skipped tests out of run-amd-test.sh — rocm,ci/build — by micah-wil (创建于: 2026-04-04 00:16 (UTC+8))
- #38913 [NVIDIA] Update FlashInfer to version 0.6.7.post1. Avoid re-downloading BMM export headers when flashinfer-cubin is installed — ready,ci/build,nvidia,ready-run-all-tests — by johnnynunez (创建于: 2026-04-03 19:34 (UTC+8))
- #38922 [Bugfix] Fix broken explicit unquantized kv cache dtype support — bug — by Isotr0py (创建于: 2026-04-04 00:22 (UTC+8))
- #38920 [Docs] add cache directory security guidance — documentation — by russellb (创建于: 2026-04-03 23:44 (UTC+8))
- #38872 [Misc] Clean up Gemma4 implementation — ready — by Isotr0py (创建于: 2026-04-03 11:34 (UTC+8))
- #38889 [Doc] add Apple Clang 21+ compilation troubleshooting — documentation,cpu — by Alex-ai-future (创建于: 2026-04-03 15:40 (UTC+8))
- #38904 [XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI — intel-gpu,ready,ci/build — by zxd1997066 (创建于: 2026-04-03 18:03 (UTC+8))
- #38890 [Bugfix] Fix logger.warning format string arg mismatch in Qwen3XML tool parser — bug,tool-calling,qwen — by edenfunf (创建于: 2026-04-03 15:44 (UTC+8))
- #38906 [Core] Make handshake timeout configurable — v1 — by foraxe (创建于: 2026-04-03 18:10 (UTC+8))
- #38909 [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls — bug,tool-calling — by yoke233 (创建于: 2026-04-03 19:03 (UTC+8))
- #38907 Fix the order of _free_encoder_inputs — v1 — by gty111 (创建于: 2026-04-03 18:40 (UTC+8))
- #38900 [P/D][Feature] Support kv_transfer_params for parallel sampling (n>1) — frontend,v1,kv-connector — by chaunceyjiang (创建于: 2026-04-03 17:24 (UTC+8))
- #38901 refactor hard coded device string in test files under tests/compile tests/quantization tests/models and tests/model_executor — v1,multi-modality — by wincent8 (创建于: 2026-04-03 17:30 (UTC+8))
- #38899 [XPU][CI] Skip test_topk_only cases on Intel GPU in CI — intel-gpu,ready,ci/build — by zxd1997066 (创建于: 2026-04-03 17:07 (UTC+8))
- #38888 [Multimodal] Add attention-score-based image token pruning for Qwen VL models — documentation,frontend,v1,multi-modality,qwen — by shhn1 (创建于: 2026-04-03 15:00 (UTC+8))
- #38895 bugfix(flashinfer,dcp): remove kv_cache_layout for BatchDCPPrefillWrapper._new_tokens. — bug,v1,nvidia — by pisceskkk (创建于: 2026-04-03 16:35 (UTC+8))
- #38896 [XPU] [CT] Enable CT W4A4MxFp4 path and add xpu kernel — intel-gpu — by zufangzhu (创建于: 2026-04-03 16:36 (UTC+8))
- #38883 [Transformers v5] Skip ColBERT jina HF comparison due to remote code incompatibility — 无标签 — by Lidang-Jiang (创建于: 2026-04-03 14:09 (UTC+8))
- #38882 [Bugfix] Fix GGUF parameter mapping for Transformers v5 fused MoE experts — bug — by Lidang-Jiang (创建于: 2026-04-03 14:08 (UTC+8))
- #38881 fix: expose full steering API in OpenAI protocol models — frontend,v1 — by RhizoNymph (创建于: 2026-04-03 14:05 (UTC+8))
- #38876 [CI/Build] Add audio deps in Dockerfile.cpu — ready,ci/build,cpu — by bigPYJ1151 (创建于: 2026-04-03 12:16 (UTC+8))
- #38880 [Helion] Add rotary positional embedding (RoPE) Helion kernels (neox + gptj) — 无标签 — by meinie0826 (创建于: 2026-04-03 13:34 (UTC+8))
- #38874 [Bugfix] Fix Gemma4 NVFP4 quantized model weight loading — bug,documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,needs-rebase — by 2imi9 (创建于: 2026-04-03 11:52 (UTC+8))
已合并 PR
- #38870 [Bugfix] Fix DSV32 weight loading — bug,ready,deepseek — by zyongye (合并于: 2026-04-04 10:57 (UTC+8))
- #38959 [ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers — rocm,ready — by AndreasKaratzas (合并于: 2026-04-04 10:41 (UTC+8))
- #38853 [Bug] Fix workspace manager
_current_workspacessize — bug,ready,v1 — by yewentao256 (合并于: 2026-04-04 09:29 (UTC+8)) - #38915 [Bug] Fix compile error for
swap_blocks_batchin CUDA 13 — bug,ready,nvidia — by yewentao256 (合并于: 2026-04-04 07:56 (UTC+8)) - #38927 [Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… — bug,ready,qwen — by elenalil-aws (合并于: 2026-04-04 07:30 (UTC+8))
- #38951 [ROCm][CI] Minor missing import patch — rocm,ready — by AndreasKaratzas (合并于: 2026-04-04 07:01 (UTC+8))
- #38937 [ROCm][CI] Added back missing common deps — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-04-04 06:58 (UTC+8))
- #38585 [ROCm][CI/Build] Fix the pytest hook to properly print out the summary — rocm,ready,ci/build — by gshtras (合并于: 2026-04-03 17:24 (UTC+8))
- #38941 [ci] Remove soft fail for AMD image build job — rocm,ready,ci/build — by khluu (合并于: 2026-04-04 04:42 (UTC+8))
- #38238 Removed GPU state confirmation and cleanup steps. — rocm,ready,ci/build — by dhonnappa-amd (合并于: 2026-04-04 04:11 (UTC+8))
- #38934 Remove MQ multi-node tests — ready,ci/build — by jeffreywang-anyscale (合并于: 2026-04-04 04:00 (UTC+8))
- #38758 [Model Runner V2] Add config validation for not-yet-supported features — ready,ci/build — by njhill (合并于: 2026-04-04 03:08 (UTC+8))
- #36487 [CPU] Replace OMP initialization — ready,ci/build,v1,cpu,verified — by kot-begemot-uk (合并于: 2026-04-03 18:42 (UTC+8))
- #38859 [Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts — bug,ready,nvidia — by yzong-rh (合并于: 2026-04-04 01:48 (UTC+8))
- #38807 [vLLM IR] add
import_ir_kernels()to support OOT platforms — ready — by wxsIcey (合并于: 2026-04-04 01:25 (UTC+8)) - #38711 Fix invalid logprobs with MTP enabled and sync scheduling — ready,v1 — by danisereb (合并于: 2026-04-04 00:24 (UTC+8))
- #38138 [Frontend] new online quantization frontend — frontend,ready,quantization — by vkuzo (合并于: 2026-04-03 23:58 (UTC+8))
- #38558 [KVConnector] Skip
register_kv_cacheson profiling — ready,v1 — by NickLucche (合并于: 2026-04-03 23:40 (UTC+8)) - #38670 [Bugfix] Fix AWQ models batch invariance issues — bug,ready,v1 — by YM2132 (合并于: 2026-04-03 22:54 (UTC+8))
- #38872 [Misc] Clean up Gemma4 implementation — ready — by Isotr0py (合并于: 2026-04-03 13:47 (UTC+8))
- #38342 [XPU] bump up xpu-kernel v0.1.5, transpose moe weights — intel-gpu,ready,ci/build — by mayuyuace (合并于: 2026-04-03 22:10 (UTC+8))
- #38325 [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (合并于: 2026-04-03 21:49 (UTC+8))
- #38361 [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill — ready,v1,nvidia — by arpera (合并于: 2026-04-03 21:38 (UTC+8))
- #38825 [Intel][Triton] Support
round_int8for Intel backend — intel-gpu,ready — by mieshkiwrk (合并于: 2026-04-03 20:47 (UTC+8)) - #38904 [XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI — intel-gpu,ready,ci/build — by zxd1997066 (合并于: 2026-04-03 20:44 (UTC+8))
- #38615 [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 — rocm,ready,v1 — by wufann (合并于: 2026-04-03 18:54 (UTC+8))
- #37171 [Frontend] feat: add streaming support for token generation endpoint — frontend,ready — by hhk7734 (合并于: 2026-04-03 18:20 (UTC+8))
- #38899 [XPU][CI] Skip test_topk_only cases on Intel GPU in CI — intel-gpu,ready,ci/build — by zxd1997066 (合并于: 2026-04-03 17:50 (UTC+8))
- #38655 Fix Nano Nemotron VL regressions — ready,multi-modality — by netanel-haber (合并于: 2026-04-03 15:22 (UTC+8))
- #38876 [CI/Build] Add audio deps in Dockerfile.cpu — ready,ci/build,cpu — by bigPYJ1151 (合并于: 2026-04-03 13:05 (UTC+8))
- #38698 [MRV2][KVConnector] Fix missing build_connector_worker_meta — ready,v1 — by ivanium (合并于: 2026-04-03 13:42 (UTC+8))
- #38746 [Bug] Add e_score_correction_bias to SKIP_TENSORS — bug,ready — by hao-aaron (合并于: 2026-04-03 12:15 (UTC+8))
- #36298 full cudagraph for flex-attn — ready,v1,nvidia — by shunting314 (合并于: 2026-04-03 12:15 (UTC+8))
- #38306 [Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B — new-model,ready,ci/build,multi-modality — by varun-sundar-rabindranath (合并于: 2026-04-03 12:14 (UTC+8))
- #38664 [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI — rocm,ready,qwen — by BowenBao (合并于: 2026-04-03 12:05 (UTC+8))
- #38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle — rocm,ready — by BowenBao (合并于: 2026-04-03 11:29 (UTC+8))
关闭但未合并的 PR
- #38753 [Bench] update chat template in gsm8k_eval.py — 无标签 — by carlyou (关闭于: 2026-04-04 10:35 (UTC+8))
- #29823 [Frontend] Fixes glm reasoning parser leaking tokens — stale — by bbartels (关闭于: 2026-04-04 10:15 (UTC+8))
- #29846 [V1][Spec Decode][Feature] support tree attention in flash-linear-attenton — speculative-decoding,stale,v1 — by menggeliu1205 (关闭于: 2026-04-04 10:15 (UTC+8))
- #29874 [Frontend] add toolparser for deepseek v3.2 reusing qwen xml parser — frontend,stale,tool-calling,qwen,deepseek — by wenmengzhou (关闭于: 2026-04-04 10:15 (UTC+8))
- #38949 [IR][RmsNorm] register None param if has_weight==False — ready — by lk-chen (关闭于: 2026-04-04 09:58 (UTC+8))
- #38287 [Perf] Skip kv connector empty work, Around 1% Throughput Improvement — ready,v1,kv-connector — by yewentao256 (关闭于: 2026-04-04 08:02 (UTC+8))
- #38436 [Bugfix] Fix NaN corruption from CUDA graph padding in NVFP4 models — bug,v1,nvidia — by elvircrn (关闭于: 2026-04-04 02:57 (UTC+8))
- #38932 Revert “[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync (#38460)” — v1 — by khluu (关闭于: 2026-04-04 02:46 (UTC+8))
- #38875 [Bugfix] Fix Gemma4 NVFP4 quantized model weight loading — bug — by 2imi9 (关闭于: 2026-04-04 01:50 (UTC+8))
- #38852 fix a cpu distributed ci testing issue — needs-rebase,ci/build,cpu — by louie-tsai (关闭于: 2026-04-03 23:43 (UTC+8))
- #37239 [Models][GDN] Prevent D2H sync in
ChunkGatedDeltaRule— needs-rebase,v1,qwen — by lgeiger (关闭于: 2026-04-03 22:49 (UTC+8)) - #25386 [Ray][CPU] Ray executor and Ray DP support for CPU backend — documentation,needs-rebase,ci/build,unstale,v1,cpu — by alex-coniasse (关闭于: 2026-04-03 17:44 (UTC+8))
- #38861 Omp improvement from PR#36487 — ci/build,v1,cpu — by louie-tsai (关闭于: 2026-04-03 17:44 (UTC+8))
- #37276 fix: combine content/reasoning with tool calls in responses API (#37167) — frontend,needs-rebase — by aayushbaluni (关闭于: 2026-04-03 15:56 (UTC+8))
- #38881 fix: expose full steering API in OpenAI protocol models — frontend,v1 — by RhizoNymph (关闭于: 2026-04-03 14:05 (UTC+8))
- #38874 [Bugfix] Fix Gemma4 NVFP4 quantized model weight loading — bug,documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,needs-rebase — by 2imi9 (关闭于: 2026-04-03 11:54 (UTC+8))