vLLM 开发动态报告 - 2026-04-01

时间窗口: 2026-04-01 11:38 (UTC+8) ~ 2026-04-02 11:38 (UTC+8) 数据统计: 新 Issue 32 | 关闭 Issue 17 | 新 PR 68 | 合并 PR 30 | 关闭未合并 PR 10

📊 每日开发状态摘要

本期（4月1日-2日）vLLM项目开发活跃度保持高位，新增32个Issue和68个PR，合并了30个PR。开发焦点主要集中在 AMD/ROCm平台体验优化、Transformers v5兼容性迁移 以及 vLLM IR（中间表示）架构的推进 上。同时，围绕分散式服务、量化性能和工具调用准确性的问题反馈与修复是社区讨论的核心。

🎯 AMD/ROCm 生态相关动态

本期AMD生态相关活动非常活跃，主要围绕功能补齐、性能优化和问题修复展开。

新Issue：强调ROCm与CUDA/SGLang的体验对等
- #38692、#38693、#38687：用户 functionstackx 连续提交三个Issue，核心诉求是提升ROCm在分散式服务和CI测试方面的体验，以达到与CUDA以及ROCm版SGLang同等的水平。具体问题包括：vLLM router不支持MoRI KV缓存连接器、缺少ROCm CI测试、以及Docker镜像未内置Pollara AINIC与Thor-2网卡驱动。AMD员工 chunfangamd 已介入并@相关同事。
- 状态与影响：这些Issue均处于开放状态，凸显了用户对AMD生产环境可用性的迫切需求。解决这些问题对提升AMD GPU集群的部署体验至关重要。
活跃PR：内核优化、兼容性修复与量化支持
- #38762 (rbrugaro-amd)：修复ROCm上DeepSeek MoE模型的allreduce + rmsnorm融合模式匹配问题，通过预处理消除无操作（no-op）的view节点，使模式匹配生效。
- #38757 (mikaylagawarecki)：继续将共享的CUDA/ROCm内核迁移到libtorch稳定ABI的工作，提升跨版本兼容性。
- #38750 (gshtras)：修复因缺失符号导致的ROCm运行时导入失败问题。
- #38774 (BowenBao)：重构Quark MoE的MXFP4量化路径，使其通过oracle和kernel后端执行，并将后端名称统一为“AITER”。
- #38719 (xaguilar-amd)：修复了在AMD平台上，当使用Aiter MLA FP8持久化内核与CUDA图时，KV缓存损坏和元数据错误导致的Kim-i 2.5模型输出异常问题。
已关闭Issue：历史问题得到解决
- #35637：关于MI355运行Minimax M2.1 MXFP4模型时AITER MoE TP4错误的问题，随着PR #34285的合并而被关闭。
- #35925：关于Qwen3.5-35B在启用AITER时产生错误输出（NaN导致）的问题已关闭。

小结：AMD团队正积极响应用户对“体验对等”的诉求，工作重点从基础功能支持转向性能优化、稳定性和生产环境完善。围绕AITER内核、量化支持和编译兼容性的修复是本期重点。

💬 高热度讨论分析

RFC #38760：Per-iteration forward pass metrics
- 核心议题：提议在vLLM引擎层添加每次迭代的前向传播详细指标（如Prefill/Decode请求数、KV缓存深度、实际GPU耗时等），以解决当前Prometheus异步拉取模式数据丢失、不同步、无历史的问题。
- 各方观点：
  - 提议方 (tedzhouhk)：认为这对于编排系统（自动扩缩容、路由器、分散式服务规划器）构建精确的成本模型至关重要，现有指标无法满足需求。
  - 潜在考量（隐含在RFC细节中）：需要设计低开销的指标收集与导出机制，避免影响引擎性能，并考虑API设计（如回调函数、轻量级RPC）。
- 当前状态：RFC开放征求意见，是面向生态系统工具链的重要基础设施提案。
Issue #38257（已关闭）：Qwen3-VL-235B多图长对话OOM
- 核心议题：超大视觉语言模型在多图像、长文本多轮对话中发生OOM。
- 讨论过程：
  - 用户 (cjackal)：提供了详细复现步骤。
  - 社区 (DarkLight1337)：建议排除CUDA图问题。
  - 维护者 (ywang96)：怀疑masked_scatter_操作存在内存拷贝问题。
  - 解决方案：指向PR #34246，该PR将masked_scatter_替换为直接索引操作，从根本上解决了内存问题。用户确认该PR有效。
- 结论：问题根因在于PyTorch某个版本前masked_scatter_在特定条件下的实现问题，通过PR #34246的简化方案得以修复。体现了对大规模多模态模型内存优化的持续改进。

🔥 热门话题与趋势分析

Transformers v5 兼容性攻坚：出现了一系列标记为“Transformers v5”的子任务Issue（如 #38734, #38736, #38740, #38735），旨在修复因HuggingFace Transformers库升级至v5版本导致的各类模型初始化失败问题。社区成员踊跃认领，表明项目正处在紧跟上游依赖的关键升级期。
vLLM IR 架构深化：多个Issue/PR（#38733, #38744, #38745, #38756）围绕将激活函数、量化、RoPE等操作迁移到新的vLLM IR（中间表示）体系。讨论涉及内核实现、编译优化和OOT（Out-of-Tree）平台迁移指南，标志着项目底层抽象和硬件无关化进入新阶段。
分散式服务与异构推理：Issue #38710报告了XPU+CPU异构分散式服务的精度问题，Issue #38692则关注ROCm分散式服务中KV传输的支持。这反映了业界对跨设备、跨节点高效部署复杂模型工作流的探索和挑战。
量化性能与精度回归：Issue #38720（FP8速度显著慢于BF16）和 #38697（W8A8解码慢于FP16）报告了量化后性能不升反降的情况。同时，PR #38728提出在Hopper架构上将NVFP4权重转换为FP8进行计算以提升性能。这表明量化技术的实际收益与硬件、内核实现紧密相关，仍是优化重点。

🛠️ 重点技术变更

PR #38760（RFC）：引擎级迭代指标提案：这是一个可能改变监控与调度生态的提案。它旨在提供高保真、时间同步的批处理成本数据，对于构建智能的负载均衡和资源调度系统具有基础性价值。
PR #38684（已合并）：DeepSeek-V3.2 Indexer融合优化：将Indexer中的WK和Weights_Proj投影操作融合，相比原有的重叠优化获得了更显著的解码速度提升（在特定测试中提升约3%），是针对特定模型架构的低开销高效优化范例。
PR #34246（已合并）：简化多模态掩码处理：利用PyTorch 2.9.0后的优化，将容易引起内存问题的masked_scatter_操作替换为直接索引赋值，成功解决了Qwen3-VL等大模型在多图场景下的OOM问题，是一次通过依赖升级和代码简化解决深层缺陷的优秀实践。
PR #37831（已合并）与 #38751（回滚）：展示了代码审查和CI的严格性。PR #37831修复了Qwen3Coder工具解析器对复杂JSON Schema（anyOf/oneOf）的处理，但因合并冲突后引入了测试失败，迅速被#38751回滚。这体现了主分支稳定性优先的原则。

📈 开发活跃度观察

贡献者：AMD团队（*-amd用户）贡献突出，涉及ROCm内核、量化、CI等多个方面。同时，众多社区开发者（如 mateenali66, zeel2104）主动认领Transformers v5兼容性任务，显示出良好的社区参与度。
代码审查与合并：单日合并30个PR，效率很高。但不少PR（如 #38762, #38766）处于“needs-rebase”状态，表明主分支变动频繁，贡献者需及时同步。部分PR的CI失败与自身变更无关，反映了测试环境的复杂性。

💡 值得关注的问题

ROCm体验对等：Issue #38692, #38693, #38687 提出的问题直指AMD平台在生产部署中的关键短板，其解决进度将显著影响vLLM在AMD生态中的采纳度。
Transformers v5迁移：一系列相关的good first issue是当前项目兼容性升级的重点，其完成情况关系到未来能否平滑使用最新Transformer模型。
vLLM IR的推进：相关RFC和任务（#38744, #38733等）的讨论与实施，将决定vLLM未来内核抽象、跨平台移植和编译优化能力的天花板。
Per-iteration Metrics RFC：#38760 的讨论结果将直接影响外部调度器、监控系统的能力边界，建议相关生态开发者积极参与。

📋 附录：详细数据列表

新增 Issue

#38782 [vLLM IR] Op test & benchmark infra — vllm-ir — by ProExpertProg (创建于: 2026-04-02 11:07 (UTC+8))
#38710 [Bug]: heterogeneous disaggregated serving XPU (Prefill) + CPU (Decode) accuracy issue — bug — by Spycsh (创建于: 2026-04-01 17:49 (UTC+8))
#38772 [Feature]: General LL GEMMs with PDL Support — feature request — by robertgshaw2-redhat (创建于: 2026-04-02 10:02 (UTC+8))
#38744 [RFC][vLLM IR]: Automatically compile native impl for IR ops — RFC,vllm-ir — by ProExpertProg (创建于: 2026-04-02 02:22 (UTC+8))
#38776 [Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere — feature request — by ir1ka (创建于: 2026-04-02 10:13 (UTC+8))
#38745 [vLLM IR] Port QuantFP8 to IR op — vllm-ir — by ProExpertProg (创建于: 2026-04-02 02:23 (UTC+8))
#38706 [Usage]: How to launch the Qwen3.5 service using vLLM on a V100 GPU — usage — by yangqinghao-cmss (创建于: 2026-04-01 16:44 (UTC+8))
#38718 [Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B — 无标签 — by lucaspirola (创建于: 2026-04-01 20:13 (UTC+8))
#38765 [vLLM IR] OOT migration guide — 无标签 — by ProExpertProg (创建于: 2026-04-02 08:12 (UTC+8))
#38760 [RFC]: Per-iteration forward pass metrics with accurate engine-level timing — RFC — by tedzhouhk (创建于: 2026-04-02 06:34 (UTC+8))
#38756 [vLLM IR] Port RoPE ops to IR — vllm-ir — by ProExpertProg (创建于: 2026-04-02 05:51 (UTC+8))
#38737 [Transformers v5] ColBERTJinaRobertaModel — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:58 (UTC+8))
#38734 [Transformers v5] SarvamMLAForCausalLM — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:53 (UTC+8))
#38736 [Transformers v5] Tarsier2ForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:56 (UTC+8))
#38754 [Bug]: GPT OSS Router GEMM Causing NaNs — bug — by benchislett (创建于: 2026-04-02 05:15 (UTC+8))
#38692 [Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn’t current support MoRI kvcache connector — bug,rocm — by functionstackx (创建于: 2026-04-01 13:30 (UTC+8))
#38729 [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 — 无标签 — by stecasta (创建于: 2026-04-01 23:31 (UTC+8))
#38740 [Transformers v5] NemotronParseForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-04-02 00:46 (UTC+8))
#38735 [Transformers v5] Ernie4_5_VLMoeForConditionalGenerati — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:54 (UTC+8))
#38733 [vLLM IR] Port activations to IR op — vllm-ir — by ProExpertProg (创建于: 2026-04-01 23:48 (UTC+8))
#38738 [Bug]: Anthropic Messages API + Mistral model: “Invalid assistant message” on multi-turn tool calling — bug — by YoyoSailer (创建于: 2026-04-02 00:28 (UTC+8))
#38725 [Proposal] Topology-Aware KV Cache Compression for Memory-Efficient Inference — performance — by jianxinglee62-prog (创建于: 2026-04-01 22:20 (UTC+8))
#38693 [Feature]: Parity with CUDA: vLLM router should have ROCm CI — feature request,rocm — by functionstackx (创建于: 2026-04-01 13:43 (UTC+8))
#38720 [Performance]: FP8 (Fp8OnlineLinearMethod) significantly slower than BF16 for ReplicatedLinear — performance — by zhangj1an (创建于: 2026-04-01 21:42 (UTC+8))
#38717 [Bug]: Bench Serve encounter utf-8 UnicodeDecodeError — bug — by JaredforReal (创建于: 2026-04-01 19:48 (UTC+8))
#38716 [Bug]: RuntimeError: failed to map GGUF parameters (18288) — bug — by ion-elgreco (创建于: 2026-04-01 19:21 (UTC+8))
#38713 [Bug]: Error when trying to serve MiniMax 2.5 on 4 H100 nodes with 4 GPUS — bug — by F-Michelon (创建于: 2026-04-01 18:44 (UTC+8))
#38700 [Bug]: vLLM fails to start with LMCache + Qwen3-Coder-Next-FP8 (nightly image) — bug — by Sanches166 (创建于: 2026-04-01 15:41 (UTC+8))
#38697 [Performance]: llmcompressor W8A8 Inference: decoding stage speed is lower than FP16 — performance — by DarkenStar (创建于: 2026-04-01 14:52 (UTC+8))
#38696 [Bug]: qwen3.5 when enable response_format json_schema outputs garbled spaces — bug — by Yyong25 (创建于: 2026-04-01 14:04 (UTC+8))
#38694 [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B — feature request — by JEWONMOON (创建于: 2026-04-01 13:47 (UTC+8))
#38687 [Bug]: parity with CUDA: ROCm nightly & release docker images aren’t built with Pollara AINIC or Broadcom Thor-2 NICs — bug,rocm — by functionstackx (创建于: 2026-04-01 12:42 (UTC+8))

已关闭 Issue

#29053 [Feature]: AMD radeon 8060s rocm7.1 torch== 2.10.0.dev20251113+rocm7.1 — feature request,rocm,stale — by gqyalh (关闭于: 2026-04-02 11:22 (UTC+8))
#36091 [RFC]: Add InstantTensor Support in vLLM — RFC — by arlo-scitix (关闭于: 2026-04-02 11:13 (UTC+8))
#27991 [Bug]: Qwen3-VL-8B into AWQ with llm-compressor not recognized. — bug,stale — by LouisAI-DL (关闭于: 2026-04-02 10:17 (UTC+8))
#28517 [Bug]: ValueError: There is no module or parameter named ‘mlp_AR’ in TransformersForCausalLM — bug,stale — by jiaohuix (关闭于: 2026-04-02 10:17 (UTC+8))
#29632 [RFC]: Force EOS generation when Structured Output Grammar is terminated — RFC,stale — by Ki-Seki (关闭于: 2026-04-02 10:16 (UTC+8))
#29722 [RFC]: Add Balance Scheduling — RFC,stale — by GDzhu01 (关闭于: 2026-04-02 10:16 (UTC+8))
#29817 [Bug]: vllm with lmcache crashes on semi-large number of parallel queries — bug,stale — by kvcop (关闭于: 2026-04-02 10:16 (UTC+8))
#29871 [Usage]: Extremly low token input speed for DeepSeek-R1-Distill-Llama-70B — usage,stale — by muelphil (关闭于: 2026-04-02 10:16 (UTC+8))
#35925 [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled — bug,rocm — by jennyyyyzhen (关闭于: 2026-04-02 09:04 (UTC+8))
#35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (关闭于: 2026-04-02 09:01 (UTC+8))
#37387 [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report — bug — by Alkevas (关闭于: 2026-04-02 08:17 (UTC+8))
#38666 [Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED — bug — by bitbottrap (关闭于: 2026-04-02 06:28 (UTC+8))
#38729 [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 — 无标签 — by stecasta (关闭于: 2026-04-02 03:08 (UTC+8))
#38634 [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase — bug — by wwwjs (关闭于: 2026-04-02 00:39 (UTC+8))
#38720 [Performance]: FP8 (Fp8OnlineLinearMethod) significantly slower than BF16 for ReplicatedLinear — performance — by zhangj1an (关闭于: 2026-04-01 21:55 (UTC+8))
#38257 [Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs — bug — by cjackal (关闭于: 2026-04-01 16:19 (UTC+8))
#36530 [Bug]: pip install 0.17 fails with CXXABI_1.3.15 not found — bug — by jlqibm (关闭于: 2026-04-01 13:31 (UTC+8))

新增 PR

#38780 [vLLM IR] register gemma_rms_norm ir kernel — 无标签 — by wxsIcey (创建于: 2026-04-02 11:00 (UTC+8))
#38684 [Perf] DSV3.2 Indexer Fused Weights Projection — ready,deepseek — by benchislett (创建于: 2026-04-01 11:59 (UTC+8))
#38781 [LongCat flash] Fix ZeroExpertFusedMoE missing select_experts() in router and MTP fix — 无标签 — by ekagra-ranjan (创建于: 2026-04-02 11:05 (UTC+8))
#38783 [7/n] libtorch stable ABI — rocm,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-02 11:17 (UTC+8))
#38739 Fix multiline-format string for python 3.10 — ready — by ProExpertProg (创建于: 2026-04-02 00:38 (UTC+8))
#38750 [ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol — bug,rocm,ready — by gshtras (创建于: 2026-04-02 04:41 (UTC+8))
#38712 [Bugfix][Core] Fix negative prompt token counter increments with external KV cache accounting — bug,v1 — by chenminghua8 (创建于: 2026-04-01 18:07 (UTC+8))
#38779 [Transformers v5] Use os._exit() in EngineCoreProc to prevent third-party thread hangs — v1 — by Cursx (创建于: 2026-04-02 10:40 (UTC+8))
#38775 [vLLM IR] 4/N Compile native implementation — vllm-ir — by ProExpertProg (创建于: 2026-04-02 10:12 (UTC+8))
#38707 Zufang/ct mxfp8 — intel-gpu — by zufangzhu (创建于: 2026-04-01 16:50 (UTC+8))
#38703 [ZenCPU] Changes with respect to docker build and relevant cpu tests — ci/build,cpu — by Chinmay-Kulkarni-AMD (创建于: 2026-04-01 15:51 (UTC+8))
#38764 [Misc][LoRA] Add automerge weight merge for single-adapter LoRA serving — documentation,v1,nvidia — by bhoomit (创建于: 2026-04-02 07:55 (UTC+8))
#38778 Revert “[Kernel] Add gpt-oss Router GEMM kernel (#37205)” — performance,ready,ci/build,gpt-oss — by xyang16 (创建于: 2026-04-02 10:18 (UTC+8))
#38770 [CPU] Support gelu act in cpu_fused_moe — ready,v1,cpu — by bigPYJ1151 (创建于: 2026-04-02 09:45 (UTC+8))
#38777 Deprecate ExpertsInt8Config (fused int8 moe quantiation) — ci/build,v1,cpu — by namgyu-youn (创建于: 2026-04-02 10:17 (UTC+8))
#38709 [Core][Metrics] Remove vllm:prompt_tokens_recomputed metric — ready,v1,kv-connector — by markmc (创建于: 2026-04-01 17:07 (UTC+8))
#38769 Deprecate FP-Quant Support — v1 — by namgyu-youn (创建于: 2026-04-02 09:43 (UTC+8))
#38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle — rocm — by BowenBao (创建于: 2026-04-02 10:03 (UTC+8))
#38773 Revert “[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang” (#38730) — bug,documentation,nvidia — by vllm-agent (创建于: 2026-04-02 10:03 (UTC+8))
#38771 [Bugfix] Fix MLA kv_b_proj activation dtype with Marlin FP8 — bug — by jacobzhang22 (创建于: 2026-04-02 09:50 (UTC+8))
#38768 [Bugfix] install failed on macbook(cpu) — bug,cpu — by jesse996 (创建于: 2026-04-02 09:24 (UTC+8))
#38767 [Transformers v5] Add SarvamMLAConfig to fix SarvamMLAForCausalLM (#38734) — 无标签 — by Zelys-DFKH (创建于: 2026-04-02 08:51 (UTC+8))
#38766 [ROCM][Bug fix] fix aiter asm atten on hybrid models — bug,rocm,needs-rebase,v1 — by yuankaichen-amd (创建于: 2026-04-02 08:38 (UTC+8))
#38742 [BugFix] Handle numpy scalar types in MsgpackEncoder — bug,ready,v1 — by yaochengji (创建于: 2026-04-02 01:56 (UTC+8))
#38761 fix hang with pause and collectives — v1 — by hao-aaron (创建于: 2026-04-02 06:41 (UTC+8))
#38762 [ROCm] Fix rocm allreduce rmsnorm fusion for Deepseek models — rocm,needs-rebase,ci/build,deepseek,nvidia — by rbrugaro-amd (创建于: 2026-04-02 06:51 (UTC+8))
#38758 [Model Runner V2] Add config validation for not-yet-supported features — ready,ci/build — by njhill (创建于: 2026-04-02 06:13 (UTC+8))
#38763 only patch runtime_env for torch >= 2.10 — 无标签 — by Rohan138 (创建于: 2026-04-02 07:03 (UTC+8))
#38759 [BugFix] Fix precommit breakage due to conflicting in-flight merges — bug,ready,v1 — by njhill (创建于: 2026-04-02 06:30 (UTC+8))
#38751 Revert “[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params (#37831)” — bug,ready,tool-calling,qwen — by khluu (创建于: 2026-04-02 04:44 (UTC+8))
#38746 [Bug] Add e_score_correction_bias to SKIP_TENSORS — bug,ready — by hao-aaron (创建于: 2026-04-02 02:56 (UTC+8))
#38757 [6/n] Migrate some shared CUDA/RoCM kernels to libtorch stable ABI — rocm,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-02 05:55 (UTC+8))
#38701 [Refactor] Merge duplicate checks and error handling in Executor — needs-rebase,v1 — by idouba (创建于: 2026-04-01 15:47 (UTC+8))
#38755 [Parser] Migrate response api streaming to unified parser — frontend — by sfeng33 (创建于: 2026-04-02 05:19 (UTC+8))
#38753 [Bench] update chat template in gsm8k_eval.py — 无标签 — by carlyou (创建于: 2026-04-02 05:11 (UTC+8))
#38752 [Core] Use boxed_return in split_module for tuple-conformant subgraphs — needs-rebase — by frgossen (创建于: 2026-04-02 04:52 (UTC+8))
#38749 Skip reasoning parsing when using continue_final_message — frontend — by hsiehjackson (创建于: 2026-04-02 04:28 (UTC+8))
#38748 [Transformers v5] Fix NemotronParse image_size tuple unpack — 无标签 — by mateenali66 (创建于: 2026-04-02 03:47 (UTC+8))
#38715 [Bugfix] Fix intra-step KV block corruption from stale prefix cache hits — bug,v1 — by KrxGu (创建于: 2026-04-01 19:15 (UTC+8))
#38747 [Transformers v5] Fix Ernie4_5_VLMoeForConditionalGeneration rope_theta config — 无标签 — by mateenali66 (创建于: 2026-04-02 03:10 (UTC+8))
#38730 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang — bug,documentation,ready,nvidia — by stecasta (创建于: 2026-04-01 23:36 (UTC+8))
#38685 [ROCm][CI] Remove soft_fail from AMD Docker Image Build — rocm,ci/build — by micah-wil (创建于: 2026-04-01 12:19 (UTC+8))
#38711 Fix invalid logprobs with MTP enabled and sync scheduling — ready,v1 — by danisereb (创建于: 2026-04-01 18:01 (UTC+8))
#38705 [Bugfix] Fix TypeError in response_input_to_harmony when assistant content is None — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-04-01 16:24 (UTC+8))
#38689 ROCm sometimes compiles problematically on torch.log on MI325 — rocm,llama — by Concurrensee (创建于: 2026-04-01 13:05 (UTC+8))
#38743 [Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam — ready — by gmagogsfm (创建于: 2026-04-02 02:16 (UTC+8))
#38699 [Bugfix] Correct mistake in chained comparison in static assert logic — bug,cpu — by KyleMylonakisProtopia (创建于: 2026-04-01 15:39 (UTC+8))
#38686 fix(lora): use float32 intermediate buffer in fused MoE LoRA to prevent bf16 precision loss — 无标签 — by prsabahrami (创建于: 2026-04-01 12:25 (UTC+8))
#38741 [Doc] Add Scheduler section to V1 architecture overview — documentation,intel-gpu — by CodersAcademy006 (创建于: 2026-04-02 00:58 (UTC+8))
#38732 [Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars — bug,performance — by he-yufeng (创建于: 2026-04-01 23:47 (UTC+8))
#38683 [Quantization] Rename mxfp4 quant layer and oracle to gpt_oss_mxfp4 — needs-rebase,gpt-oss — by zyongye (创建于: 2026-04-01 11:58 (UTC+8))
#38726 [Bugfix][Core] Fix stuck chunked pipeline parallelism with async scheduling — bug,v1 — by starkwj (创建于: 2026-04-01 22:27 (UTC+8))
#38690 [WIP][Do not merge yet] Update flash-attention to latest upstream FA4 — ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-04-01 13:19 (UTC+8))
#38731 [Bugfix] Harden DP handshake port against non-engine traffic — bug,v1 — by he-yufeng (创建于: 2026-04-01 23:43 (UTC+8))
#38714 Add ibm-granite/granite-vision-3.3-2b to supported models documentation — documentation,ready — by jesus-talavera-ibm (创建于: 2026-04-01 18:46 (UTC+8))
#38728 [Quantization] Convert NVFP4 weights to FP8 on Hopper for faster inference — 无标签 — by Tib-Gridello (创建于: 2026-04-01 22:47 (UTC+8))
#38722 [Misc] Fix docstring typo: buildin -> builtin — frontend,gpt-oss — by crawfordxx (创建于: 2026-04-01 22:18 (UTC+8))
#38727 nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len — 无标签 — by netanel-haber (创建于: 2026-04-01 22:36 (UTC+8))
#38723 Fix shape comment in extract_hidden_states example — documentation,ready — by fynnsu (创建于: 2026-04-01 22:18 (UTC+8))
#38724 [Misc] Fix typos in source code comments — v1,kv-connector — by crawfordxx (创建于: 2026-04-01 22:20 (UTC+8))
#38721 [Misc] Fix typos in test comments — v1,kv-connector — by crawfordxx (创建于: 2026-04-01 22:18 (UTC+8))
#38688 [Renderer] Enforce token-only inputs for LLMEngine and AsyncLLM — documentation,frontend,ready,ci/build,v1,multi-modality — by DarkLight1337 (创建于: 2026-04-01 12:42 (UTC+8))
#38719 Fix Kimi-K2.5 accuracy when Aiter MLA FP8 PS + CUDA graphs are used — rocm,v1,nvidia — by xaguilar-amd (创建于: 2026-04-01 21:03 (UTC+8))
#38704 [ROCm][perf] Use workspace manager for sparse indexer allocations — rocm,v1 — by gronsti-amd (创建于: 2026-04-01 16:15 (UTC+8))
#38708 Add verified label to trigger pre-commit — ready,ci/build — by hmellor (创建于: 2026-04-01 16:58 (UTC+8))
#38698 [MRV2][KVConnector] Fix missing build_connector_worker_meta — ready,v1 — by ivanium (创建于: 2026-04-01 15:26 (UTC+8))
#38702 [OPT] Optimize the fused moe triton kernel routing expert accumulation — 无标签 — by BJWang-ant (创建于: 2026-04-01 15:48 (UTC+8))
#38695 [Bugfix] Support [TOOL_CALLS] single-token format in Jamba tool parser — bug,documentation,tool-calling — by oromanenko-nv (创建于: 2026-04-01 14:02 (UTC+8))

已合并 PR

#38684 [Perf] DSV3.2 Indexer Fused Weights Projection — ready,deepseek — by benchislett (合并于: 2026-04-02 11:34 (UTC+8))
#38739 Fix multiline-format string for python 3.10 — ready — by ProExpertProg (合并于: 2026-04-02 11:19 (UTC+8))
#38676 [CPU] Support head_size 512 in cpu_attn — documentation,ready,v1,cpu — by bigPYJ1151 (合并于: 2026-04-01 13:42 (UTC+8))
#38759 [BugFix] Fix precommit breakage due to conflicting in-flight merges — bug,ready,v1 — by njhill (合并于: 2026-04-02 06:35 (UTC+8))
#38751 Revert “[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params (#37831)” — bug,ready,tool-calling,qwen — by khluu (合并于: 2026-04-02 06:34 (UTC+8))
#38673 [Bugfix] Preserve original ImportError in gRPC server entrypoint — bug,frontend,ready — by CatherineSue (合并于: 2026-04-02 06:16 (UTC+8))
#36836 [Feat][Executor] Introduce RayExecutorV2 — ready,ci/build,v1 — by jeffreywang-anyscale (合并于: 2026-04-02 05:34 (UTC+8))
#38644 [Refactor] Simplify FutureWrapper in MultiprocExecutor — ready,v1 — by yzong-rh (合并于: 2026-04-02 05:28 (UTC+8))
#32996 Feature/silu block quant fusion v1 — documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,ready,ci/build — by Monishver11 (合并于: 2026-04-02 02:50 (UTC+8))
#37831 [Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params — bug,ready,tool-calling,qwen — by AAISSJ (合并于: 2026-04-01 20:22 (UTC+8))
#38730 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang — bug,documentation,ready,nvidia — by stecasta (合并于: 2026-04-02 03:08 (UTC+8))
#38573 [Compile] Fix nvfp4 compile warning — ready,ci/build — by yewentao256 (合并于: 2026-04-02 02:28 (UTC+8))
#38242 [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str — documentation,ready,v1 — by chaunceyjiang (合并于: 2026-04-02 00:56 (UTC+8))
#34664 [Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp — performance,ready,nvidia,quantization — by mgoin (合并于: 2026-04-02 00:41 (UTC+8))
#38714 Add ibm-granite/granite-vision-3.3-2b to supported models documentation — documentation,ready — by jesus-talavera-ibm (合并于: 2026-04-01 23:22 (UTC+8))
#37940 [NIXL][BUG] Fix Triton heterogeneous TP — bug,ready,v1,kv-connector — by yzong-rh (合并于: 2026-04-01 23:23 (UTC+8))
#38722 [Misc] Fix docstring typo: buildin -> builtin — frontend,gpt-oss — by crawfordxx (合并于: 2026-04-01 22:39 (UTC+8))
#38723 Fix shape comment in extract_hidden_states example — documentation,ready — by fynnsu (合并于: 2026-04-01 22:29 (UTC+8))
#35153 [MoE Refactor] Make SharedExperts class for use with DefaultMoERunner — nvidia,ready-run-all-tests — by bnellnm (合并于: 2026-04-01 21:44 (UTC+8))
#38359 [Bugfix] Revert “Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding” — bug,ready,v1,nvidia — by elvircrn (合并于: 2026-04-01 21:11 (UTC+8))
#38659 [1/N][Cleanup] Standardize on use of is_quantized_kv_cache — rocm,intel-gpu,ready,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-04-01 12:08 (UTC+8))
#38179 [KVTransfer] Fix TpKVTopology.is_kv_replicated equality case — ready,kv-connector — by JianDan0212 (合并于: 2026-04-01 18:41 (UTC+8))
#38636 (security) Enforce frame limit in VideoMediaIO — ready,multi-modality — by jperezdealgaba (合并于: 2026-04-01 18:23 (UTC+8))
#38708 Add verified label to trigger pre-commit — ready,ci/build — by hmellor (合并于: 2026-04-01 17:31 (UTC+8))
#37948 [Perf] triton bilinear_pos_embed kernel for ViT — performance,ready,multi-modality,qwen,nvidia — by zhandaz (合并于: 2026-04-01 16:52 (UTC+8))
#34246 [Core] Simplify multimodal masking — ready,v1,qwen — by lgeiger (合并于: 2026-04-01 16:18 (UTC+8))
#36178 [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking — bug,ready,v1 — by LucasWilkinson (合并于: 2026-04-01 12:15 (UTC+8))
#38649 [Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup — bug,structured-output,ready,v1 — by jeffreywang-anyscale (合并于: 2026-04-01 13:31 (UTC+8))
#38617 [bugfix] do not add extra linebreak for score/rerank with chat template — bug,frontend,ready — by staugust (合并于: 2026-04-01 12:50 (UTC+8))
#38559 [Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement — ready — by yewentao256 (合并于: 2026-04-01 11:54 (UTC+8))

关闭但未合并的 PR

#38742 [BugFix] Handle numpy scalar types in MsgpackEncoder — bug,ready,v1 — by yaochengji (关闭于: 2026-04-02 08:09 (UTC+8))
#37491 [Build] Update CUTLASS revision from v4.2.1 to v4.4.2 — ready,ci/build,nvidia — by meena-at-work (关闭于: 2026-04-02 03:28 (UTC+8))
#25098 [Core] Make Whisper work with b200 + flashinfer — documentation,ready,v1,nvidia — by russellb (关闭于: 2026-04-02 03:22 (UTC+8))
#36291 add testing for Qwen3.5-27B-FP8 to GSM8K eval — v1,qwen,deepseek — by puririshi98 (关闭于: 2026-04-02 01:56 (UTC+8))
#38686 fix(lora): use float32 intermediate buffer in fused MoE LoRA to prevent bf16 precision loss — 无标签 — by prsabahrami (关闭于: 2026-04-02 01:12 (UTC+8))
#38461 Fixed issues — multi-modality — by rpathade (关闭于: 2026-04-01 23:40 (UTC+8))
#35229 [Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE — 无标签 — by mgoin (关闭于: 2026-04-01 23:08 (UTC+8))
#37815 [MLAAttention] Clear Cudagraph padded region of FI decode Attention kernel — ready,v1,nvidia — by varun-sundar-rabindranath (关闭于: 2026-04-01 22:07 (UTC+8))
#37192 WIP: [Feature] KVCACHE NVFP4 — documentation,needs-rebase,v1,quantization — by JartX (关闭于: 2026-04-01 16:41 (UTC+8))
#30676 [WIP]dynamic port to solve port confict — stale,v1 — by 1643661061leo (关闭于: 2026-04-01 13:07 (UTC+8))