vLLM 开发动态报告 - 2026-04-01
时间窗口: 2026-04-01 11:38 (UTC+8) ~ 2026-04-02 11:38 (UTC+8) 数据统计: 新 Issue 32 | 关闭 Issue 17 | 新 PR 68 | 合并 PR 30 | 关闭未合并 PR 10
📊 每日开发状态摘要
本期(4月1日-2日)vLLM项目开发活跃度保持高位,新增32个Issue和68个PR,合并了30个PR。开发焦点主要集中在 AMD/ROCm平台体验优化、Transformers v5兼容性迁移 以及 vLLM IR(中间表示)架构的推进 上。同时,围绕分散式服务、量化性能和工具调用准确性的问题反馈与修复是社区讨论的核心。
🎯 AMD/ROCm 生态相关动态
本期AMD生态相关活动非常活跃,主要围绕功能补齐、性能优化和问题修复展开。
- 新Issue:强调ROCm与CUDA/SGLang的体验对等
- #38692、#38693、#38687:用户
functionstackx连续提交三个Issue,核心诉求是提升ROCm在分散式服务和CI测试方面的体验,以达到与CUDA以及ROCm版SGLang同等的水平。具体问题包括:vLLM router不支持MoRI KV缓存连接器、缺少ROCm CI测试、以及Docker镜像未内置Pollara AINIC与Thor-2网卡驱动。AMD员工chunfangamd已介入并@相关同事。 - 状态与影响:这些Issue均处于开放状态,凸显了用户对AMD生产环境可用性的迫切需求。解决这些问题对提升AMD GPU集群的部署体验至关重要。
- #38692、#38693、#38687:用户
- 活跃PR:内核优化、兼容性修复与量化支持
- #38762 (
rbrugaro-amd):修复ROCm上DeepSeek MoE模型的allreduce + rmsnorm融合模式匹配问题,通过预处理消除无操作(no-op)的view节点,使模式匹配生效。 - #38757 (
mikaylagawarecki):继续将共享的CUDA/ROCm内核迁移到libtorch稳定ABI的工作,提升跨版本兼容性。 - #38750 (
gshtras):修复因缺失符号导致的ROCm运行时导入失败问题。 - #38774 (
BowenBao):重构Quark MoE的MXFP4量化路径,使其通过oracle和kernel后端执行,并将后端名称统一为“AITER”。 - #38719 (
xaguilar-amd):修复了在AMD平台上,当使用Aiter MLA FP8持久化内核与CUDA图时,KV缓存损坏和元数据错误导致的Kim-i 2.5模型输出异常问题。
- #38762 (
- 已关闭Issue:历史问题得到解决
- #35637:关于MI355运行Minimax M2.1 MXFP4模型时AITER MoE TP4错误的问题,随着PR #34285的合并而被关闭。
- #35925:关于Qwen3.5-35B在启用AITER时产生错误输出(NaN导致)的问题已关闭。
小结:AMD团队正积极响应用户对“体验对等”的诉求,工作重点从基础功能支持转向性能优化、稳定性和生产环境完善。围绕AITER内核、量化支持和编译兼容性的修复是本期重点。
💬 高热度讨论分析
- RFC #38760:Per-iteration forward pass metrics
- 核心议题:提议在vLLM引擎层添加每次迭代的前向传播详细指标(如Prefill/Decode请求数、KV缓存深度、实际GPU耗时等),以解决当前Prometheus异步拉取模式数据丢失、不同步、无历史的问题。
- 各方观点:
- 提议方 (
tedzhouhk):认为这对于编排系统(自动扩缩容、路由器、分散式服务规划器)构建精确的成本模型至关重要,现有指标无法满足需求。 - 潜在考量(隐含在RFC细节中):需要设计低开销的指标收集与导出机制,避免影响引擎性能,并考虑API设计(如回调函数、轻量级RPC)。
- 提议方 (
- 当前状态:RFC开放征求意见,是面向生态系统工具链的重要基础设施提案。
- Issue #38257(已关闭):Qwen3-VL-235B多图长对话OOM
- 核心议题:超大视觉语言模型在多图像、长文本多轮对话中发生OOM。
- 讨论过程:
- 用户 (
cjackal):提供了详细复现步骤。 - 社区 (
DarkLight1337):建议排除CUDA图问题。 - 维护者 (
ywang96):怀疑masked_scatter_操作存在内存拷贝问题。 - 解决方案:指向PR #34246,该PR将
masked_scatter_替换为直接索引操作,从根本上解决了内存问题。用户确认该PR有效。
- 用户 (
- 结论:问题根因在于PyTorch某个版本前
masked_scatter_在特定条件下的实现问题,通过PR #34246的简化方案得以修复。体现了对大规模多模态模型内存优化的持续改进。
🔥 热门话题与趋势分析
- Transformers v5 兼容性攻坚:出现了一系列标记为“Transformers v5”的子任务Issue(如 #38734, #38736, #38740, #38735),旨在修复因HuggingFace Transformers库升级至v5版本导致的各类模型初始化失败问题。社区成员踊跃认领,表明项目正处在紧跟上游依赖的关键升级期。
- vLLM IR 架构深化:多个Issue/PR(#38733, #38744, #38745, #38756)围绕将激活函数、量化、RoPE等操作迁移到新的vLLM IR(中间表示)体系。讨论涉及内核实现、编译优化和OOT(Out-of-Tree)平台迁移指南,标志着项目底层抽象和硬件无关化进入新阶段。
- 分散式服务与异构推理:Issue #38710报告了XPU+CPU异构分散式服务的精度问题,Issue #38692则关注ROCm分散式服务中KV传输的支持。这反映了业界对跨设备、跨节点高效部署复杂模型工作流的探索和挑战。
- 量化性能与精度回归:Issue #38720(FP8速度显著慢于BF16)和 #38697(W8A8解码慢于FP16)报告了量化后性能不升反降的情况。同时,PR #38728提出在Hopper架构上将NVFP4权重转换为FP8进行计算以提升性能。这表明量化技术的实际收益与硬件、内核实现紧密相关,仍是优化重点。
🛠️ 重点技术变更
- PR #38760(RFC):引擎级迭代指标提案:这是一个可能改变监控与调度生态的提案。它旨在提供高保真、时间同步的批处理成本数据,对于构建智能的负载均衡和资源调度系统具有基础性价值。
- PR #38684(已合并):DeepSeek-V3.2 Indexer融合优化:将Indexer中的WK和Weights_Proj投影操作融合,相比原有的重叠优化获得了更显著的解码速度提升(在特定测试中提升约3%),是针对特定模型架构的低开销高效优化范例。
- PR #34246(已合并):简化多模态掩码处理:利用PyTorch 2.9.0后的优化,将容易引起内存问题的
masked_scatter_操作替换为直接索引赋值,成功解决了Qwen3-VL等大模型在多图场景下的OOM问题,是一次通过依赖升级和代码简化解决深层缺陷的优秀实践。 - PR #37831(已合并)与 #38751(回滚):展示了代码审查和CI的严格性。PR #37831修复了Qwen3Coder工具解析器对复杂JSON Schema(anyOf/oneOf)的处理,但因合并冲突后引入了测试失败,迅速被#38751回滚。这体现了主分支稳定性优先的原则。
📈 开发活跃度观察
- 贡献者:AMD团队(
*-amd用户)贡献突出,涉及ROCm内核、量化、CI等多个方面。同时,众多社区开发者(如mateenali66,zeel2104)主动认领Transformers v5兼容性任务,显示出良好的社区参与度。 - 代码审查与合并:单日合并30个PR,效率很高。但不少PR(如 #38762, #38766)处于“needs-rebase”状态,表明主分支变动频繁,贡献者需及时同步。部分PR的CI失败与自身变更无关,反映了测试环境的复杂性。
💡 值得关注的问题
- ROCm体验对等:Issue #38692, #38693, #38687 提出的问题直指AMD平台在生产部署中的关键短板,其解决进度将显著影响vLLM在AMD生态中的采纳度。
- Transformers v5迁移:一系列相关的good first issue是当前项目兼容性升级的重点,其完成情况关系到未来能否平滑使用最新Transformer模型。
- vLLM IR的推进:相关RFC和任务(#38744, #38733等)的讨论与实施,将决定vLLM未来内核抽象、跨平台移植和编译优化能力的天花板。
- Per-iteration Metrics RFC:#38760 的讨论结果将直接影响外部调度器、监控系统的能力边界,建议相关生态开发者积极参与。
📋 附录:详细数据列表
新增 Issue
- #38782 [vLLM IR] Op test & benchmark infra — vllm-ir — by ProExpertProg (创建于: 2026-04-02 11:07 (UTC+8))
- #38710 [Bug]: heterogeneous disaggregated serving XPU (Prefill) + CPU (Decode) accuracy issue — bug — by Spycsh (创建于: 2026-04-01 17:49 (UTC+8))
- #38772 [Feature]: General LL GEMMs with PDL Support — feature request — by robertgshaw2-redhat (创建于: 2026-04-02 10:02 (UTC+8))
- #38744 [RFC][vLLM IR]: Automatically compile native impl for IR ops — RFC,vllm-ir — by ProExpertProg (创建于: 2026-04-02 02:22 (UTC+8))
- #38776 [Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere — feature request — by ir1ka (创建于: 2026-04-02 10:13 (UTC+8))
- #38745 [vLLM IR] Port QuantFP8 to IR op — vllm-ir — by ProExpertProg (创建于: 2026-04-02 02:23 (UTC+8))
- #38706 [Usage]: How to launch the Qwen3.5 service using vLLM on a V100 GPU — usage — by yangqinghao-cmss (创建于: 2026-04-01 16:44 (UTC+8))
- #38718 [Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B — 无标签 — by lucaspirola (创建于: 2026-04-01 20:13 (UTC+8))
- #38765 [vLLM IR] OOT migration guide — 无标签 — by ProExpertProg (创建于: 2026-04-02 08:12 (UTC+8))
- #38760 [RFC]: Per-iteration forward pass metrics with accurate engine-level timing — RFC — by tedzhouhk (创建于: 2026-04-02 06:34 (UTC+8))
- #38756 [vLLM IR] Port RoPE ops to IR — vllm-ir — by ProExpertProg (创建于: 2026-04-02 05:51 (UTC+8))
- #38737 [Transformers v5] ColBERTJinaRobertaModel — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:58 (UTC+8))
- #38734 [Transformers v5] SarvamMLAForCausalLM — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:53 (UTC+8))
- #38736 [Transformers v5] Tarsier2ForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:56 (UTC+8))
- #38754 [Bug]: GPT OSS Router GEMM Causing NaNs — bug — by benchislett (创建于: 2026-04-02 05:15 (UTC+8))
- #38692 [Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn’t current support MoRI kvcache connector — bug,rocm — by functionstackx (创建于: 2026-04-01 13:30 (UTC+8))
- #38729 [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 — 无标签 — by stecasta (创建于: 2026-04-01 23:31 (UTC+8))
- #38740 [Transformers v5] NemotronParseForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-04-02 00:46 (UTC+8))
- #38735 [Transformers v5] Ernie4_5_VLMoeForConditionalGenerati — help wanted,good first issue — by hmellor (创建于: 2026-04-01 23:54 (UTC+8))
- #38733 [vLLM IR] Port activations to IR op — vllm-ir — by ProExpertProg (创建于: 2026-04-01 23:48 (UTC+8))
- #38738 [Bug]: Anthropic Messages API + Mistral model: “Invalid assistant message” on multi-turn tool calling — bug — by YoyoSailer (创建于: 2026-04-02 00:28 (UTC+8))
- #38725 [Proposal] Topology-Aware KV Cache Compression for Memory-Efficient Inference — performance — by jianxinglee62-prog (创建于: 2026-04-01 22:20 (UTC+8))
- #38693 [Feature]: Parity with CUDA: vLLM router should have ROCm CI — feature request,rocm — by functionstackx (创建于: 2026-04-01 13:43 (UTC+8))
- #38720 [Performance]: FP8 (Fp8OnlineLinearMethod) significantly slower than BF16 for ReplicatedLinear — performance — by zhangj1an (创建于: 2026-04-01 21:42 (UTC+8))
- #38717 [Bug]: Bench Serve encounter utf-8 UnicodeDecodeError — bug — by JaredforReal (创建于: 2026-04-01 19:48 (UTC+8))
- #38716 [Bug]: RuntimeError: failed to map GGUF parameters (18288) — bug — by ion-elgreco (创建于: 2026-04-01 19:21 (UTC+8))
- #38713 [Bug]: Error when trying to serve MiniMax 2.5 on 4 H100 nodes with 4 GPUS — bug — by F-Michelon (创建于: 2026-04-01 18:44 (UTC+8))
- #38700 [Bug]: vLLM fails to start with LMCache + Qwen3-Coder-Next-FP8 (nightly image) — bug — by Sanches166 (创建于: 2026-04-01 15:41 (UTC+8))
- #38697 [Performance]: llmcompressor W8A8 Inference: decoding stage speed is lower than FP16 — performance — by DarkenStar (创建于: 2026-04-01 14:52 (UTC+8))
- #38696 [Bug]: qwen3.5 when enable response_format json_schema outputs garbled spaces — bug — by Yyong25 (创建于: 2026-04-01 14:04 (UTC+8))
- #38694 [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B — feature request — by JEWONMOON (创建于: 2026-04-01 13:47 (UTC+8))
- #38687 [Bug]: parity with CUDA: ROCm nightly & release docker images aren’t built with Pollara AINIC or Broadcom Thor-2 NICs — bug,rocm — by functionstackx (创建于: 2026-04-01 12:42 (UTC+8))
已关闭 Issue
- #29053 [Feature]: AMD radeon 8060s rocm7.1 torch== 2.10.0.dev20251113+rocm7.1 — feature request,rocm,stale — by gqyalh (关闭于: 2026-04-02 11:22 (UTC+8))
- #36091 [RFC]: Add InstantTensor Support in vLLM — RFC — by arlo-scitix (关闭于: 2026-04-02 11:13 (UTC+8))
- #27991 [Bug]: Qwen3-VL-8B into AWQ with llm-compressor not recognized. — bug,stale — by LouisAI-DL (关闭于: 2026-04-02 10:17 (UTC+8))
- #28517 [Bug]: ValueError: There is no module or parameter named ‘mlp_AR’ in TransformersForCausalLM — bug,stale — by jiaohuix (关闭于: 2026-04-02 10:17 (UTC+8))
- #29632 [RFC]: Force EOS generation when Structured Output Grammar is terminated — RFC,stale — by Ki-Seki (关闭于: 2026-04-02 10:16 (UTC+8))
- #29722 [RFC]: Add Balance Scheduling — RFC,stale — by GDzhu01 (关闭于: 2026-04-02 10:16 (UTC+8))
- #29817 [Bug]: vllm with lmcache crashes on semi-large number of parallel queries — bug,stale — by kvcop (关闭于: 2026-04-02 10:16 (UTC+8))
- #29871 [Usage]: Extremly low token input speed for DeepSeek-R1-Distill-Llama-70B — usage,stale — by muelphil (关闭于: 2026-04-02 10:16 (UTC+8))
- #35925 [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled — bug,rocm — by jennyyyyzhen (关闭于: 2026-04-02 09:04 (UTC+8))
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (关闭于: 2026-04-02 09:01 (UTC+8))
- #37387 [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report — bug — by Alkevas (关闭于: 2026-04-02 08:17 (UTC+8))
- #38666 [Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED — bug — by bitbottrap (关闭于: 2026-04-02 06:28 (UTC+8))
- #38729 [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 — 无标签 — by stecasta (关闭于: 2026-04-02 03:08 (UTC+8))
- #38634 [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase — bug — by wwwjs (关闭于: 2026-04-02 00:39 (UTC+8))
- #38720 [Performance]: FP8 (Fp8OnlineLinearMethod) significantly slower than BF16 for ReplicatedLinear — performance — by zhangj1an (关闭于: 2026-04-01 21:55 (UTC+8))
- #38257 [Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs — bug — by cjackal (关闭于: 2026-04-01 16:19 (UTC+8))
- #36530 [Bug]: pip install 0.17 fails with CXXABI_1.3.15 not found — bug — by jlqibm (关闭于: 2026-04-01 13:31 (UTC+8))
新增 PR
- #38780 [vLLM IR] register gemma_rms_norm ir kernel — 无标签 — by wxsIcey (创建于: 2026-04-02 11:00 (UTC+8))
- #38684 [Perf] DSV3.2 Indexer Fused Weights Projection — ready,deepseek — by benchislett (创建于: 2026-04-01 11:59 (UTC+8))
- #38781 [LongCat flash] Fix
ZeroExpertFusedMoEmissingselect_experts()in router and MTP fix — 无标签 — by ekagra-ranjan (创建于: 2026-04-02 11:05 (UTC+8)) - #38783 [7/n] libtorch stable ABI — rocm,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-02 11:17 (UTC+8))
- #38739 Fix multiline-format string for python 3.10 — ready — by ProExpertProg (创建于: 2026-04-02 00:38 (UTC+8))
- #38750 [ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol — bug,rocm,ready — by gshtras (创建于: 2026-04-02 04:41 (UTC+8))
- #38712 [Bugfix][Core] Fix negative prompt token counter increments with external KV cache accounting — bug,v1 — by chenminghua8 (创建于: 2026-04-01 18:07 (UTC+8))
- #38779 [Transformers v5] Use os._exit() in EngineCoreProc to prevent third-party thread hangs — v1 — by Cursx (创建于: 2026-04-02 10:40 (UTC+8))
- #38775 [vLLM IR] 4/N Compile native implementation — vllm-ir — by ProExpertProg (创建于: 2026-04-02 10:12 (UTC+8))
- #38707 Zufang/ct mxfp8 — intel-gpu — by zufangzhu (创建于: 2026-04-01 16:50 (UTC+8))
- #38703 [ZenCPU] Changes with respect to docker build and relevant cpu tests — ci/build,cpu — by Chinmay-Kulkarni-AMD (创建于: 2026-04-01 15:51 (UTC+8))
- #38764 [Misc][LoRA] Add automerge weight merge for single-adapter LoRA serving — documentation,v1,nvidia — by bhoomit (创建于: 2026-04-02 07:55 (UTC+8))
- #38778 Revert “[Kernel] Add gpt-oss Router GEMM kernel (#37205)” — performance,ready,ci/build,gpt-oss — by xyang16 (创建于: 2026-04-02 10:18 (UTC+8))
- #38770 [CPU] Support gelu act in cpu_fused_moe — ready,v1,cpu — by bigPYJ1151 (创建于: 2026-04-02 09:45 (UTC+8))
- #38777 Deprecate
ExpertsInt8Config(fused int8 moe quantiation) — ci/build,v1,cpu — by namgyu-youn (创建于: 2026-04-02 10:17 (UTC+8)) - #38709 [Core][Metrics] Remove
vllm:prompt_tokens_recomputedmetric — ready,v1,kv-connector — by markmc (创建于: 2026-04-01 17:07 (UTC+8)) - #38769 Deprecate FP-Quant Support — v1 — by namgyu-youn (创建于: 2026-04-02 09:43 (UTC+8))
- #38774 [ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle — rocm — by BowenBao (创建于: 2026-04-02 10:03 (UTC+8))
- #38773 Revert “[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang” (#38730) — bug,documentation,nvidia — by vllm-agent (创建于: 2026-04-02 10:03 (UTC+8))
- #38771 [Bugfix] Fix MLA kv_b_proj activation dtype with Marlin FP8 — bug — by jacobzhang22 (创建于: 2026-04-02 09:50 (UTC+8))
- #38768 [Bugfix] install failed on macbook(cpu) — bug,cpu — by jesse996 (创建于: 2026-04-02 09:24 (UTC+8))
- #38767 [Transformers v5] Add SarvamMLAConfig to fix SarvamMLAForCausalLM (#38734) — 无标签 — by Zelys-DFKH (创建于: 2026-04-02 08:51 (UTC+8))
- #38766 [ROCM][Bug fix] fix aiter asm atten on hybrid models — bug,rocm,needs-rebase,v1 — by yuankaichen-amd (创建于: 2026-04-02 08:38 (UTC+8))
- #38742 [BugFix] Handle numpy scalar types in MsgpackEncoder — bug,ready,v1 — by yaochengji (创建于: 2026-04-02 01:56 (UTC+8))
- #38761 fix hang with pause and collectives — v1 — by hao-aaron (创建于: 2026-04-02 06:41 (UTC+8))
- #38762 [ROCm] Fix rocm allreduce rmsnorm fusion for Deepseek models — rocm,needs-rebase,ci/build,deepseek,nvidia — by rbrugaro-amd (创建于: 2026-04-02 06:51 (UTC+8))
- #38758 [Model Runner V2] Add config validation for not-yet-supported features — ready,ci/build — by njhill (创建于: 2026-04-02 06:13 (UTC+8))
- #38763 only patch runtime_env for torch >= 2.10 — 无标签 — by Rohan138 (创建于: 2026-04-02 07:03 (UTC+8))
- #38759 [BugFix] Fix precommit breakage due to conflicting in-flight merges — bug,ready,v1 — by njhill (创建于: 2026-04-02 06:30 (UTC+8))
- #38751 Revert “[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params (#37831)” — bug,ready,tool-calling,qwen — by khluu (创建于: 2026-04-02 04:44 (UTC+8))
- #38746 [Bug] Add e_score_correction_bias to SKIP_TENSORS — bug,ready — by hao-aaron (创建于: 2026-04-02 02:56 (UTC+8))
- #38757 [6/n] Migrate some shared CUDA/RoCM kernels to libtorch stable ABI — rocm,ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-02 05:55 (UTC+8))
- #38701 [Refactor] Merge duplicate checks and error handling in Executor — needs-rebase,v1 — by idouba (创建于: 2026-04-01 15:47 (UTC+8))
- #38755 [Parser] Migrate response api streaming to unified parser — frontend — by sfeng33 (创建于: 2026-04-02 05:19 (UTC+8))
- #38753 [Bench] update chat template in gsm8k_eval.py — 无标签 — by carlyou (创建于: 2026-04-02 05:11 (UTC+8))
- #38752 [Core] Use boxed_return in split_module for tuple-conformant subgraphs — needs-rebase — by frgossen (创建于: 2026-04-02 04:52 (UTC+8))
- #38749 Skip reasoning parsing when using continue_final_message — frontend — by hsiehjackson (创建于: 2026-04-02 04:28 (UTC+8))
- #38748 [Transformers v5] Fix NemotronParse image_size tuple unpack — 无标签 — by mateenali66 (创建于: 2026-04-02 03:47 (UTC+8))
- #38715 [Bugfix] Fix intra-step KV block corruption from stale prefix cache hits — bug,v1 — by KrxGu (创建于: 2026-04-01 19:15 (UTC+8))
- #38747 [Transformers v5] Fix Ernie4_5_VLMoeForConditionalGeneration rope_theta config — 无标签 — by mateenali66 (创建于: 2026-04-02 03:10 (UTC+8))
- #38730 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang — bug,documentation,ready,nvidia — by stecasta (创建于: 2026-04-01 23:36 (UTC+8))
- #38685 [ROCm][CI] Remove soft_fail from AMD Docker Image Build — rocm,ci/build — by micah-wil (创建于: 2026-04-01 12:19 (UTC+8))
- #38711 Fix invalid logprobs with MTP enabled and sync scheduling — ready,v1 — by danisereb (创建于: 2026-04-01 18:01 (UTC+8))
- #38705 [Bugfix] Fix TypeError in response_input_to_harmony when assistant content is None — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-04-01 16:24 (UTC+8))
- #38689 ROCm sometimes compiles problematically on torch.log on MI325 — rocm,llama — by Concurrensee (创建于: 2026-04-01 13:05 (UTC+8))
- #38743 [Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam — ready — by gmagogsfm (创建于: 2026-04-02 02:16 (UTC+8))
- #38699 [Bugfix] Correct mistake in chained comparison in static assert logic — bug,cpu — by KyleMylonakisProtopia (创建于: 2026-04-01 15:39 (UTC+8))
- #38686 fix(lora): use float32 intermediate buffer in fused MoE LoRA to prevent bf16 precision loss — 无标签 — by prsabahrami (创建于: 2026-04-01 12:25 (UTC+8))
- #38741 [Doc] Add Scheduler section to V1 architecture overview — documentation,intel-gpu — by CodersAcademy006 (创建于: 2026-04-02 00:58 (UTC+8))
- #38732 [Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars — bug,performance — by he-yufeng (创建于: 2026-04-01 23:47 (UTC+8))
- #38683 [Quantization] Rename mxfp4 quant layer and oracle to gpt_oss_mxfp4 — needs-rebase,gpt-oss — by zyongye (创建于: 2026-04-01 11:58 (UTC+8))
- #38726 [Bugfix][Core] Fix stuck chunked pipeline parallelism with async scheduling — bug,v1 — by starkwj (创建于: 2026-04-01 22:27 (UTC+8))
- #38690 [WIP][Do not merge yet] Update flash-attention to latest upstream FA4 — ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-04-01 13:19 (UTC+8))
- #38731 [Bugfix] Harden DP handshake port against non-engine traffic — bug,v1 — by he-yufeng (创建于: 2026-04-01 23:43 (UTC+8))
- #38714 Add ibm-granite/granite-vision-3.3-2b to supported models documentation — documentation,ready — by jesus-talavera-ibm (创建于: 2026-04-01 18:46 (UTC+8))
- #38728 [Quantization] Convert NVFP4 weights to FP8 on Hopper for faster inference — 无标签 — by Tib-Gridello (创建于: 2026-04-01 22:47 (UTC+8))
- #38722 [Misc] Fix docstring typo: buildin -> builtin — frontend,gpt-oss — by crawfordxx (创建于: 2026-04-01 22:18 (UTC+8))
- #38727 nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len — 无标签 — by netanel-haber (创建于: 2026-04-01 22:36 (UTC+8))
- #38723 Fix shape comment in extract_hidden_states example — documentation,ready — by fynnsu (创建于: 2026-04-01 22:18 (UTC+8))
- #38724 [Misc] Fix typos in source code comments — v1,kv-connector — by crawfordxx (创建于: 2026-04-01 22:20 (UTC+8))
- #38721 [Misc] Fix typos in test comments — v1,kv-connector — by crawfordxx (创建于: 2026-04-01 22:18 (UTC+8))
- #38688 [Renderer] Enforce token-only inputs for LLMEngine and AsyncLLM — documentation,frontend,ready,ci/build,v1,multi-modality — by DarkLight1337 (创建于: 2026-04-01 12:42 (UTC+8))
- #38719 Fix Kimi-K2.5 accuracy when Aiter MLA FP8 PS + CUDA graphs are used — rocm,v1,nvidia — by xaguilar-amd (创建于: 2026-04-01 21:03 (UTC+8))
- #38704 [ROCm][perf] Use workspace manager for sparse indexer allocations — rocm,v1 — by gronsti-amd (创建于: 2026-04-01 16:15 (UTC+8))
- #38708 Add
verifiedlabel to triggerpre-commit— ready,ci/build — by hmellor (创建于: 2026-04-01 16:58 (UTC+8)) - #38698 [MRV2][KVConnector] Fix missing build_connector_worker_meta — ready,v1 — by ivanium (创建于: 2026-04-01 15:26 (UTC+8))
- #38702 [OPT] Optimize the fused moe triton kernel routing expert accumulation — 无标签 — by BJWang-ant (创建于: 2026-04-01 15:48 (UTC+8))
- #38695 [Bugfix] Support [TOOL_CALLS] single-token format in Jamba tool parser — bug,documentation,tool-calling — by oromanenko-nv (创建于: 2026-04-01 14:02 (UTC+8))
已合并 PR
- #38684 [Perf] DSV3.2 Indexer Fused Weights Projection — ready,deepseek — by benchislett (合并于: 2026-04-02 11:34 (UTC+8))
- #38739 Fix multiline-format string for python 3.10 — ready — by ProExpertProg (合并于: 2026-04-02 11:19 (UTC+8))
- #38676 [CPU] Support head_size 512 in cpu_attn — documentation,ready,v1,cpu — by bigPYJ1151 (合并于: 2026-04-01 13:42 (UTC+8))
- #38759 [BugFix] Fix precommit breakage due to conflicting in-flight merges — bug,ready,v1 — by njhill (合并于: 2026-04-02 06:35 (UTC+8))
- #38751 Revert “[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params (#37831)” — bug,ready,tool-calling,qwen — by khluu (合并于: 2026-04-02 06:34 (UTC+8))
- #38673 [Bugfix] Preserve original ImportError in gRPC server entrypoint — bug,frontend,ready — by CatherineSue (合并于: 2026-04-02 06:16 (UTC+8))
- #36836 [Feat][Executor] Introduce RayExecutorV2 — ready,ci/build,v1 — by jeffreywang-anyscale (合并于: 2026-04-02 05:34 (UTC+8))
- #38644 [Refactor] Simplify FutureWrapper in MultiprocExecutor — ready,v1 — by yzong-rh (合并于: 2026-04-02 05:28 (UTC+8))
- #32996 Feature/silu block quant fusion v1 — documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,ready,ci/build — by Monishver11 (合并于: 2026-04-02 02:50 (UTC+8))
- #37831 [Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params — bug,ready,tool-calling,qwen — by AAISSJ (合并于: 2026-04-01 20:22 (UTC+8))
- #38730 [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang — bug,documentation,ready,nvidia — by stecasta (合并于: 2026-04-02 03:08 (UTC+8))
- #38573 [Compile] Fix nvfp4 compile warning — ready,ci/build — by yewentao256 (合并于: 2026-04-02 02:28 (UTC+8))
- #38242 [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str — documentation,ready,v1 — by chaunceyjiang (合并于: 2026-04-02 00:56 (UTC+8))
- #34664 [Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp — performance,ready,nvidia,quantization — by mgoin (合并于: 2026-04-02 00:41 (UTC+8))
- #38714 Add ibm-granite/granite-vision-3.3-2b to supported models documentation — documentation,ready — by jesus-talavera-ibm (合并于: 2026-04-01 23:22 (UTC+8))
- #37940 [NIXL][BUG] Fix Triton heterogeneous TP — bug,ready,v1,kv-connector — by yzong-rh (合并于: 2026-04-01 23:23 (UTC+8))
- #38722 [Misc] Fix docstring typo: buildin -> builtin — frontend,gpt-oss — by crawfordxx (合并于: 2026-04-01 22:39 (UTC+8))
- #38723 Fix shape comment in extract_hidden_states example — documentation,ready — by fynnsu (合并于: 2026-04-01 22:29 (UTC+8))
- #35153 [MoE Refactor] Make SharedExperts class for use with DefaultMoERunner — nvidia,ready-run-all-tests — by bnellnm (合并于: 2026-04-01 21:44 (UTC+8))
- #38359 [Bugfix] Revert “Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding” — bug,ready,v1,nvidia — by elvircrn (合并于: 2026-04-01 21:11 (UTC+8))
- #38659 [1/N][Cleanup] Standardize on use of
is_quantized_kv_cache— rocm,intel-gpu,ready,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-04-01 12:08 (UTC+8)) - #38179 [KVTransfer] Fix TpKVTopology.is_kv_replicated equality case — ready,kv-connector — by JianDan0212 (合并于: 2026-04-01 18:41 (UTC+8))
- #38636 (security) Enforce frame limit in VideoMediaIO — ready,multi-modality — by jperezdealgaba (合并于: 2026-04-01 18:23 (UTC+8))
- #38708 Add
verifiedlabel to triggerpre-commit— ready,ci/build — by hmellor (合并于: 2026-04-01 17:31 (UTC+8)) - #37948 [Perf] triton bilinear_pos_embed kernel for ViT — performance,ready,multi-modality,qwen,nvidia — by zhandaz (合并于: 2026-04-01 16:52 (UTC+8))
- #34246 [Core] Simplify multimodal masking — ready,v1,qwen — by lgeiger (合并于: 2026-04-01 16:18 (UTC+8))
- #36178 [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking — bug,ready,v1 — by LucasWilkinson (合并于: 2026-04-01 12:15 (UTC+8))
- #38649 [Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup — bug,structured-output,ready,v1 — by jeffreywang-anyscale (合并于: 2026-04-01 13:31 (UTC+8))
- #38617 [bugfix] do not add extra linebreak for score/rerank with chat template — bug,frontend,ready — by staugust (合并于: 2026-04-01 12:50 (UTC+8))
- #38559 [Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement — ready — by yewentao256 (合并于: 2026-04-01 11:54 (UTC+8))
关闭但未合并的 PR
- #38742 [BugFix] Handle numpy scalar types in MsgpackEncoder — bug,ready,v1 — by yaochengji (关闭于: 2026-04-02 08:09 (UTC+8))
- #37491 [Build] Update CUTLASS revision from v4.2.1 to v4.4.2 — ready,ci/build,nvidia — by meena-at-work (关闭于: 2026-04-02 03:28 (UTC+8))
- #25098 [Core] Make Whisper work with b200 + flashinfer — documentation,ready,v1,nvidia — by russellb (关闭于: 2026-04-02 03:22 (UTC+8))
- #36291 add testing for
Qwen3.5-27B-FP8to GSM8K eval — v1,qwen,deepseek — by puririshi98 (关闭于: 2026-04-02 01:56 (UTC+8)) - #38686 fix(lora): use float32 intermediate buffer in fused MoE LoRA to prevent bf16 precision loss — 无标签 — by prsabahrami (关闭于: 2026-04-02 01:12 (UTC+8))
- #38461 Fixed issues — multi-modality — by rpathade (关闭于: 2026-04-01 23:40 (UTC+8))
- #35229 [Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE — 无标签 — by mgoin (关闭于: 2026-04-01 23:08 (UTC+8))
- #37815 [MLAAttention] Clear Cudagraph padded region of FI decode Attention kernel — ready,v1,nvidia — by varun-sundar-rabindranath (关闭于: 2026-04-01 22:07 (UTC+8))
- #37192 WIP: [Feature] KVCACHE NVFP4 — documentation,needs-rebase,v1,quantization — by JartX (关闭于: 2026-04-01 16:41 (UTC+8))
- #30676 [WIP]dynamic port to solve port confict — stale,v1 — by 1643661061leo (关闭于: 2026-04-01 13:07 (UTC+8))