vLLM 开发动态报告 - 2026-03-27

时间窗口: 2026-03-27 11:31 (UTC+8) ~ 2026-03-28 11:31 (UTC+8) 数据统计: 新 Issue 18 | 关闭 Issue 11 | 新 PR 83 | 合并 PR 27 | 关闭未合并 PR 29

📊 每日开发状态摘要

本周期（2026-03-27至2026-03-28）vLLM 项目保持高强度开发，新增 83 个 PR 和 18 个 Issue，合并了 27 个 PR。核心进展集中在 AMD/ROCm 生态的性能优化与问题修复、Transformers v5 升级带来的大规模兼容性调整，以及多模态功能与缓存管理的持续改进。社区协作活跃，多名贡献者积极认领了为适配 Transformers v5 而分解出的若干“good first issue”。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，主要集中在性能优化、构建系统修复和文档更新。

新增 Issue (性能问题):
- #38406 [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node：用户 functionstackx（频繁提交 AMD 相关报告）指出，即使在启用 VLLM_ROCM_USE_AITER=1 后，MI325X 运行 Kimi K2.5 模型的性能依然显著低于 NVIDIA H200。此问题已引发 AMD 员工（@chunfangamd， @andyluo7， @hongxiayang）和社区的关注，是当前 ROCm 性能调优的重点。
- #38303 [Bug]: minimax nvfp4 model crash：内容被截断，但从标题看涉及 NVFP4 量化模型，可能关联 AMD 的 Quark 量化工具链。
新增与合并的 PR:
- #38367 (已合并) [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips：由 AMD 员工 hongxiayang 提交，更新了官方文档以包含 ROCm nightly Docker 镜像的使用说明，提升了用户体验。
- #38365 [ROCm] patch benchmark_moe：修复 benchmark_moe.py 脚本以兼容 ray==2.54.0，该版本不再使用 ROCR_VISIBLE_DEVICES 环境变量。
- #38346 [ROCM] Optmize redudent d2d copy of moe.：优化了 MI355 上运行 MiniMax M2.5 模型时 MOE 内核的输出路径，消除了冗余的 D2D 拷贝，展示了针对特定 AMD 硬件和模型的深度优化。
- #38337 [ROCm][Build] Fix pip install detection when build isolation installs CUDA PyTorch：修复了在 ROCm 系统上使用 pip 构建时，因构建隔离环境安装了 CUDA 版 PyTorch 而导致的错误检测问题。
- #38334 [ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM：为 Vision Transformer 注意力后端添加 Triton 回退路径，以解决在不支持高效 SDPA 的 ROCm 设备（如 gfx906）上运行多模态模型时的 OOM 问题。
- #38413 [ROCm] [Release] Update ROCm variant from rocm700 to rocm721：将基础镜像版本从 ROCm 7.0.0 升级至 7.2.1，是 #38252 的后续工作。
- #38371 Enable building MoRI with AMD AINIC stack：更新 ROCm 基础镜像构建，支持使用可选的 AMD AINIC (Pensando) 网卡栈构建 MORI，面向 disaggregated/KV 卸载场景。
- #38396 [AMD][CI] Update DeepEP branch：更新 DeepEP 分支以正确为 gfx942/gfx950 进行预编译，并调整测试用例到 MI325。
- #38415 [ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry：修复 Docker 构建中 UV 安装脚本，使其能检测 curl 失败并重试。
其他相关 PR:
- #38317 [ROCm][CI] Enable hybrid chunked prefill test：在 ROCm CI 中启用混合分块预填充测试，移除了原有的 CUDA 专属跳过标记。
- #38381 [ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm：将 test_hybrid.py 测试在 ROCm 上固定使用 TRITON_ATTN 后端，以减少批处理差异导致的测试不稳定性。

分析：AMD 团队（通过 -amd 后缀用户识别）在本周期投入显著，工作覆盖性能调优（#38406， #38346）、构建系统健壮性（#38337， #38415）、CI/CD 完善（#38317， #38381）和用户体验（#38367）。特别是针对 MI300 系列（MI325， MI355）的性能问题和新硬件（AINIC）的支持，表明 AMD 正致力于在 vLLM 上构建完整且高性能的软件生态。

💬 高热度讨论分析

Issue #38374: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load
- 核心议题：使用 IPC（CUDA 进程间通信）以检查点格式热更新模型权重后，推理结果与冷启动目标检查点服务器的结果不一致。
- 观点与进展：提交者 kimihailv 提供了非常详细的复现步骤和结果对比，显示热交换后的模型行为更接近原始模型而非目标模型。该 Issue 在创建后约 14 分钟内被迅速关闭，表明可能是一个可快速定位和修复的 bug，或已被其他 PR 解决。
Issue #38375: IndexError when --renderer-num-workers + --mm-processor-cache-type shm
- 核心议题：多模态渲染器工作线程 (--renderer-num-workers > 1) 与共享内存多模态缓存 (--mm-processor-cache-type shm) 不兼容，导致高并发下 IndexError。
- 观点与争议：scyyh11 迅速定位了问题根源：SHM 缓存设计假设单写入者，而多渲染器工作线程导致了并发写入。他提出了临时解决方案（设置 --renderer-num-workers=1），并询问原 PR（#34789）作者 @DarkLight1337 是应该丢弃 renderer_num_workers 配置还是修复 SHM 缓存路径本身。争议焦点在于功能兼容性与设计修改的权衡。
- 当前状态：待维护者决策。
Issue #38382: [Transformers v5] Mistral multimodal models
- 核心议题：升级至 Transformers v5.4 后，Mistral 多模态模型（Pixtral， Voxtral）因处理器重构而出现 IndexError。
- 观点与进展：hmellor 指出问题源于 #38018 PR 使用 v5.3 开发，而 v5.4 重构了处理器。allgather 主动认领此问题。这反映了大型上游依赖更新对下游项目造成的广泛影响。
- 当前状态：allgather 已提交修复 PR #38410。
Issue #38387: [Transformers v5] HCXVisionForCausalLM
- 核心议题：Transformers v5 升级导致 HCXVisionForCausalLM 模型初始化失败 (AttributeError: 'HCXVisionConfig' object has no attribute 'text_config')。
- 观点与进展：HanFa 指出可能与 Transformers 的一个 PR 相关。hmellor 认为问题可能在于模型配置代码本身的缺陷，并鼓励 HanFa 尝试修复。这体现了社区协作解决上游变更问题的模式。
- 当前状态：HanFa 已认领。
PR #38315: [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward
- 核心议题：为 Gated Delta Network (GDN) 融合 KKT 和 solve_tril 内核以提升预填充性能。
- 观点与争议：PR 展示了对 H200 上 Qwen3.5-397B 模型 TTFT 的显著提升。然而，贡献者 Nepherpitou 报告在 4x3090 上测试 Qwen 122B AWQ 模型时出现了输出损坏（!!!!!!!!!!!!!）。作者 ZJY0516 回应未能复现，并请求更多信息。争议焦点在于更改的准确性和硬件/模型泛化能力。
- 当前状态：存在合并冲突，待解决并进一步验证准确性。

🔥 热门话题与趋势分析

Transformers v5 升级浪潮：这是本周期最突出的趋势。hmellor 创建了总领 Issue (#38379) 并分解出多个子 Issue（如 #38386， #38387， #38389 等），吸引了大量社区贡献者（kasha01， HanFa， amitmodi 等）认领。问题涉及 tokenizer 配置、模型初始化、处理器兼容性等多个方面，是一次系统性的工程适配。
多模态功能深化与问题暴露：多个 Issue（#38375， #38382， #38385）和 PR（#38330， #38405）围绕多模态展开，涉及缓存并发、处理器兼容性、序列化协议等。这表明随着多模态特性被更广泛和复杂地使用，其底层架构的鲁棒性和扩展性正面临考验。
性能优化持续进行：针对特定硬件（AMD MI355， NVIDIA Blackwell）和模型架构（GDN， MLA， MoE）的深度性能优化 PR 不断涌现（如 #38346， #38354， #38315），体现了 vLLM 对极致推理效率的追求。
AMD 生态支持力度加强：如前述，本周期 AMD 相关的修复、优化和文档更新非常密集，显示出 AMD 团队正系统性地提升 vLLM 在其硬件上的可用性和性能。

🛠️ 重点技术变更

PR #38346 [ROCM] Optmize redudent d2d copy of moe.：此 PR 通过优化 MI355 上 MOE 内核的输出路径，消除了两次不必要的设备间（D2D）拷贝。这是针对特定硬件（MI300系列）和模型（MiniMax M2.5）的“外科手术式”优化，能直接提升端到端推理吞吐量，体现了平台专属优化的价值。
PR #38334 [ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM：该修改为 ROCm 设备上 ViT 的注意力计算添加了 Triton 回退路径，有效避免了 PyTorch SDPA “math” 后端因分配 N² 注意力矩阵导致的 OOM 问题。这对在 AMD 显卡上稳定运行多模态模型至关重要。
PR #38337 [ROCm][Build] Fix pip install detection…：此修复解决了在 ROCm 系统上从源码构建时的一个隐蔽但影响很大的问题，提升了开发体验和构建可靠性，是完善 AMD 平台支持基础设施的关键一环。
PR #34789 (已合并) [Bugfix] Offload blocking tokenizer ops to shared thread pool：此 PR 通过将阻塞性的 tokenizer 和多模态预处理操作卸载到共享线程池，解决了高并发下事件循环被阻塞导致的管理接口（如 /health）无响应问题。这是提升服务端稳定性和可观测性的重要改进。
PR #37975 (已合并) [Model] Extract GatedDeltaNetAttention into shared layer：将 Qwen3Next 和 Qwen3.5 中的 GDN 层实现重构为独立的共享模块。这提高了代码复用性、可维护性，并为其他平台（如 XPU， NPU）实现定制化算子支持奠定了基础。

📈 开发活跃度观察

高强度开发：单日新增 83 个 PR，反映出极高的开发活跃度和社区参与热情。
高效的协作与合并：多个 PR 在创建当天即被合并（如 #38367， #38322），显示核心维护者评审和合并流程高效。
社区驱动的规模化任务：Transformers v5 适配工作通过分解 Issue 和 good first issue 标签，成功吸引了多名新贡献者（kasha01， dharun4772， guanwei-wu 等）参与，是成功的社区协同范例。
AMD 团队深度参与：AMD 员工不仅提交代码，也积极在 Issue 中与用户互动（如 #38406），形成了良性的反馈循环。

💡 值得关注的问题

AMD MI325X 性能瓶颈：Issue #38406 中报告的 Kimi K2.5 在 MI325X 上性能远逊于 H200 的问题尚未解决，这可能是影响 AMD 数据中心 GPU 市场竞争力的关键性能问题，需要 AMD 团队持续重点关注和优化。
Transformers v5 升级的连锁反应：虽然社区已积极介入修复，但此次升级暴露出的兼容性问题面广且深。需要密切关注后续是否还有隐藏问题，以及升级是否会对生产环境部署的模型稳定性造成影响。
多模态缓存与并发处理的根本矛盾：Issue #38375 揭示了当前多模态缓存设计与并发预处理之间的设计冲突。最终的修复方案（是限制功能还是重构缓存）将影响多模态服务的高并发能力。

📋 附录：详细数据列表

新增 Issue

#38386 [Transformers v5] Base model and LoRA used in test has incorrect tokenizer_config.json — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:43 (UTC+8))
#38384 [Transformers v5] Distributed shutdown test timtout — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:36 (UTC+8))
#38385 [Transformers v5] MiniCPMV cannot apply processor — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:40 (UTC+8))
#38411 [Bug]: Vision encoder crashes on SM100 (Jetson Thor) — _vllm_fa2_C compiled for SM80-only, no override available for vision encoder — bug — by mrjbj (创建于: 2026-03-28 10:06 (UTC+8))
#38406 [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node — bug,rocm — by functionstackx (创建于: 2026-03-28 07:34 (UTC+8))
#38382 [Transformers v5] Mistral multimodal models — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:30 (UTC+8))
#38375 [Bug]: IndexError when --renderer-num-workers + --mm-processor-cache-type shm — bug — by cjackal (创建于: 2026-03-28 00:22 (UTC+8))
#38387 [Transformers v5] HCXVisionForCausalLM — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:52 (UTC+8))
#38389 [Transformers v5] IsaacForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:55 (UTC+8))
#38379 Upgrade to Transformers v5 — help wanted — by hmellor (创建于: 2026-03-28 02:19 (UTC+8))
#38339 [Bug] Step-3.5-Flash MTP Speculative Decoding Has Extremely Low Acceptance Rate (2.4%-4.6%) — bug,speculative-decoding — by elinx (创建于: 2026-03-27 15:49 (UTC+8))
#38376 [Bug]: glm 4.7 fp8 crashes (Worker_TP3 pid=457501) ERROR 03-27 17:11:15 [multiproc_executor.py:852] AttributeError: ‘_OpNamespace’ ‘_C’ object has no attribute ‘per_token_group_fp8_quant’ — bug — by koush (创建于: 2026-03-28 01:33 (UTC+8))
#38348 [Bug]: chunked prefill can’t be disabled. maybe due to race condition in scheduler — bug — by Dingjifeng (创建于: 2026-03-27 17:22 (UTC+8))
#38374 [Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint — bug — by kimihailv (创建于: 2026-03-28 00:19 (UTC+8))
#38351 [Bug]: When use_audio_in_video is enabled in qwen3-omni, the output may exhibit issues such as empty or repetitive output. — bug — by wwxs123W (创建于: 2026-03-27 18:32 (UTC+8))
#38349 [Bug]: 使用swift rollout启动vllm，推理结果乱码 — bug — by Aristotle-wfh (创建于: 2026-03-27 17:50 (UTC+8))
#38347 [BUG]: Python API stuck after long time running — usage — by sweihub (创建于: 2026-03-27 17:13 (UTC+8))
#38331 [Usage]: for the tpye: UniformTypeKVCacheSpecs , the calc of num_blocks = available_memory // kv_cache_groups[0].kv_cache_spec.page_size_bytes, no need to divide num_layers ? — usage — by yangshanjun (创建于: 2026-03-27 15:05 (UTC+8))

已关闭 Issue

#38303 [Bug]: minimax nvfp4 model crash — bug — by functionstackx (关闭于: 2026-03-28 10:28 (UTC+8))
#4763 [Feature]: Support W4A8KV4 Quantization(QServe/QoQ) — feature request,stale — by bratao (关闭于: 2026-03-28 10:14 (UTC+8))
#38304 [Feature]: vllm.ai docs should should instructions for rocm nightly docker & rocm nightly wheel — feature request,rocm — by functionstackx (关闭于: 2026-03-28 09:22 (UTC+8))
#33560 [Bug]: lm-eval shows significant accuracy differences on RedHatAI/Qwen3-8B-NVFP4 model (Turing vs. Ampere) — bug — by ir1ka (关闭于: 2026-03-28 07:36 (UTC+8))
#38348 [Bug]: chunked prefill can’t be disabled. maybe due to race condition in scheduler — bug — by Dingjifeng (关闭于: 2026-03-28 00:43 (UTC+8))
#38374 [Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint — bug — by kimihailv (关闭于: 2026-03-28 00:33 (UTC+8))
#32932 [Bug][cpu][arm]: Failure case for BF16 dispatch on non-bf16 supported arm HW — bug,cpu — by gassan-arm (关闭于: 2026-03-27 17:45 (UTC+8))
#38347 [BUG]: Python API stuck after long time running — usage — by sweihub (关闭于: 2026-03-27 17:43 (UTC+8))
#38245 [Bug]: Responses API text.format.type="json_schema" leaks schema_ in non-stream responses and breaks streaming — bug — by noobHappylife (关闭于: 2026-03-27 13:58 (UTC+8))
#38201 [Feature]: Support TurboQuant KV quant — feature request — by NiuBlibing (关闭于: 2026-03-27 13:21 (UTC+8))
#31450 [Feature]: vLLM should apply to Docker Open Source Program for removing image pull limits — feature request — by ehfd (关闭于: 2026-03-27 13:08 (UTC+8))

新增 PR

#38373 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — 无标签 — by SouthWest7 (创建于: 2026-03-27 23:51 (UTC+8))
#38413 [ROCm] [Release] Update ROCm variant from rocm700 to rocm721 — rocm,ready,ci/build — by tjtanaa (创建于: 2026-03-28 10:27 (UTC+8))
#38414 [Test] Fix flaky race condition in test_abort_final_step — v1 — by yzong-rh (创建于: 2026-03-28 10:57 (UTC+8))
#38415 [ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-28 10:58 (UTC+8))
#38412 docs: Add build guide for NVIDIA Blackwell consumer GPUs (RTX 50-series) on Windows/WSL2 — documentation,nvidia — by nguyennhuanhle (创建于: 2026-03-28 10:21 (UTC+8))
#38410 [Transformers v5] fix missing pixtral/voxtral multimodal dispatch — 无标签 — by allgather (创建于: 2026-03-28 09:59 (UTC+8))
#38403 [CI][Eval] Lower Nemotron-3-Super-120B-A12B-BF16 GSM8K accuracy threshold to 0.91 — ready — by SandishKumarHN (创建于: 2026-03-28 06:23 (UTC+8))
#38365 [ROCm] patch benchmark_moe — performance,rocm — by big-yellow-duck (创建于: 2026-03-27 22:28 (UTC+8))
#38364 [Bugfix]Fix incorrect KV cache log output — bug,v1 — by WangErXiao (创建于: 2026-03-27 22:13 (UTC+8))
#38408 Log KV cache GiB usage and warn when max_num_seqs exceeds capacity — v1 — by ashishkamra (创建于: 2026-03-28 07:58 (UTC+8))
#38409 [Compile] Replace split_module with lightweight vllm-specific graph splitter — 无标签 — by fxdawnn (创建于: 2026-03-28 08:10 (UTC+8))
#38395 Fix/mla kv offloading — v1,kv-connector — by rarepepi (创建于: 2026-03-28 04:59 (UTC+8))
#38407 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (创建于: 2026-03-28 07:36 (UTC+8))
#38325 [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (创建于: 2026-03-27 14:11 (UTC+8))
#38399 [WIP][Model Runner V2] Implement flash sampling for the draft model — v1 — by TheEpicDolphin (创建于: 2026-03-28 05:42 (UTC+8))
#38367 [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips — documentation,rocm,ready — by hongxiayang (创建于: 2026-03-27 22:45 (UTC+8))
#38404 [Bugfix] Fix platform detection crash when vllm is not installed as a package — bug — by SandishKumarHN (创建于: 2026-03-28 06:25 (UTC+8))
#38402 [Tests][Bugfix] Fix race condition in test_abort_final_step.py — bug,v1 — by SandishKumarHN (创建于: 2026-03-28 06:19 (UTC+8))
#38400 [CI/Build] Fix GH200 build: add –torch-only flag to use_existing_torch.py to preserve torchvision dependency — ci/build — by SandishKumarHN (创建于: 2026-03-28 06:06 (UTC+8))
#38405 [Frontend] Add multimodal support to /inference/v1/generate endpoint — documentation,frontend,kv-connector — by nithinvc (创建于: 2026-03-28 07:04 (UTC+8))
#38401 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (创建于: 2026-03-28 06:15 (UTC+8))
#38360 [compile] Bug fix for _decompose_size_nodes — bug,ready,ready-run-all-tests — by anijain2305 (创建于: 2026-03-27 21:57 (UTC+8))
#38397 feat: Added Dockerfile for ppc64le CUDA build — ci/build,nvidia — by Okoyl (创建于: 2026-03-28 05:18 (UTC+8))
#38398 [Bugfix] Restore prepare_fp8_layer_for_marlin removed by merge conflict resolution — bug,ready — by vadiklyutiy (创建于: 2026-03-28 05:28 (UTC+8))
#38394 [Chore] Remove unused logits_process.py — ready — by WoosukKwon (创建于: 2026-03-28 04:58 (UTC+8))
#38361 [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill — needs-rebase,v1 — by arpera (创建于: 2026-03-27 21:58 (UTC+8))
#38396 [AMD][CI] Update DeepEP branch — rocm,ci/build — by rjrock (创建于: 2026-03-28 05:04 (UTC+8))
#38369 [CI] Skip failing test — ready,multi-modality — by NickLucche (创建于: 2026-03-27 22:57 (UTC+8))
#38378 [Feature] Kvcache per token — documentation,v1 — by JartX (创建于: 2026-03-28 02:10 (UTC+8))
#38393 [BugFix]: use base address for zero-token experts in CUTLASS grouped GEMM to avoid over by 1 — bug,nvidia — by LucasWilkinson (创建于: 2026-03-28 04:16 (UTC+8))
#38381 [ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm — rocm — by micah-wil (创建于: 2026-03-28 02:29 (UTC+8))
#38329 [MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists — ready,nvidia — by jaewonlee-fb (创建于: 2026-03-27 14:41 (UTC+8))
#38392 [Bugfix] Fix dead-code validation bypass in structured_outputs + named tool_choice — bug,frontend,tool-calling — by yzong-rh (创建于: 2026-03-28 03:30 (UTC+8))
#38390 [Model Runner v2] E/P/D disaggregation support — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-28 03:24 (UTC+8))
#38391 [CI Bugfix] Pre-download missing FlashInfer headers in Docker build — bug,ready,ci/build,ci-failure — by mgoin (创建于: 2026-03-28 03:28 (UTC+8))
#38358 Revert “[CI] Add batch invariant test for b200” (#38014) — ci/build — by zhewenl (创建于: 2026-03-27 20:06 (UTC+8))
#38370 [CI] Fix BatchedGemmEnums.h not found CI issue — ready,ci/build,nvidia — by yewentao256 (创建于: 2026-03-27 22:57 (UTC+8))
#38380 Add short flag -sc for --speculative-config argument — 无标签 — by mgoin (创建于: 2026-03-28 02:22 (UTC+8))
#38388 [Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support — multi-modality — by khairulkabir1661 (创建于: 2026-03-28 02:55 (UTC+8))
#38383 [Refactor] Remove dead code in kv connector and model runner — intel-gpu,ready,v1,cpu,kv-connector — by yewentao256 (创建于: 2026-03-28 02:33 (UTC+8))
#38362 [BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed — bug,frontend — by walterbm (创建于: 2026-03-27 22:08 (UTC+8))
#38366 [Bug Fix][CPU Profiling] Add CPU profiler summary file output on rank 0 — bug — by Elm8116 (创建于: 2026-03-27 22:42 (UTC+8))
#38318 fix(harmony): run render_for_completion in thread pool to unblock event loop — frontend,gpt-oss — by ztimothy96 (创建于: 2026-03-27 11:58 (UTC+8))
#38377 [Bugfix]: fix blind access to per_token_group_fp8_quant causing GLM 4.7 crash on RTX 6000 pro — bug — by koush (创建于: 2026-03-28 01:34 (UTC+8))
#38321 [ROCm] Fix ROCM_ATTN cross-attention for beam search encoder-decoder models — rocm,intel-gpu,ready,v1 — by AndreasKaratzas (创建于: 2026-03-27 13:06 (UTC+8))
#38317 [ROCm][CI] Enable hybrid chunked prefill test — rocm,ready,ci/build,v1 — by AndreasKaratzas (创建于: 2026-03-27 11:54 (UTC+8))
#38355 feat: add TQMediaConnector plugin for TransferQueue image data support — multi-modality — by RobotGF (创建于: 2026-03-27 19:58 (UTC+8))
#38372 [Hybrid] Simplify accepted token counting in spec decode for hybrid models — v1 — by fuscof-ibm (创建于: 2026-03-27 23:46 (UTC+8))
#38357 Revert “[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell” (#38083) — bug,qwen — by zhewenl (创建于: 2026-03-27 20:06 (UTC+8))
#38350 Remove need for explicit \n in docstring lists for --help formatting — ready — by hmellor (创建于: 2026-03-27 18:09 (UTC+8))
#38371 Enable building MoRI with AMD AINIC stack — rocm,ci/build — by ichbinblau (创建于: 2026-03-27 23:05 (UTC+8))
#38346 [ROCM] Optmize redudent d2d copy of moe. — rocm — by benenzhu (创建于: 2026-03-27 17:11 (UTC+8))
#38368 [FlashLinearAttention] reduce recompilations by removing unused triton kernel inputs — 无标签 — by lgeiger (创建于: 2026-03-27 22:54 (UTC+8))
#38363 [quantization] Further qr — 无标签 — by lixixicommute (创建于: 2026-03-27 22:09 (UTC+8))
#38359 [Bugfix] Revert “Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding” — bug,ready,v1,nvidia — by elvircrn (创建于: 2026-03-27 20:18 (UTC+8))
#38322 [CI/Build] Move nightly wheel index generation to a single post-build step — ready,ci/build — by Harry-Chen (创建于: 2026-03-27 13:22 (UTC+8))
#38356 [CI] don’t merge modify cases XPU — intel-gpu,ci/build — by wendyliu235 (创建于: 2026-03-27 20:03 (UTC+8))
#38345 [CI] don’t mergify label Intel — intel-gpu,ci/build — by wendyliu235 (创建于: 2026-03-27 17:06 (UTC+8))
#38353 [Bugfix] Fix Kimi K2 tool parser dropping content before tool section — bug,tool-calling — by felixmr1 (创建于: 2026-03-27 19:49 (UTC+8))
#38354 [Performance] MLA decode pre-attention fused kernels for SM100 — ci/build,v1,deepseek,nvidia — by elvircrn (创建于: 2026-03-27 19:55 (UTC+8))
#38352 [Bugfix] Fix int64 expert IDs in routing simulator crashing flashinfer all2all — bug,nvidia — by elvircrn (创建于: 2026-03-27 19:37 (UTC+8))
#38342 [XPU] transpose moe weights — intel-gpu — by mayuyuace (创建于: 2026-03-27 16:38 (UTC+8))
#38332 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (创建于: 2026-03-27 15:14 (UTC+8))
#38315 [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward — 无标签 — by ZJY0516 (创建于: 2026-03-27 11:35 (UTC+8))
#38344 [CI] don’t merge for very — ci/build — by wendyliu235 (创建于: 2026-03-27 16:56 (UTC+8))
#38341 [XPU] fix lora uts — intel-gpu,gpt-oss — by chaojun-zhang (创建于: 2026-03-27 16:16 (UTC+8))
#38343 [Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass — 无标签 — by AuYang261 (创建于: 2026-03-27 16:47 (UTC+8))
#38338 [Kernel] Add Triton reference implementation of NVFP4 GEMM — performance — by Drogon4231 (创建于: 2026-03-27 15:35 (UTC+8))
#38340 [Doc] fix selector/label mismatch in helm chart — documentation — by utsumi-fj (创建于: 2026-03-27 16:07 (UTC+8))
#38328 [Doc] Clarify Helm chart location in deployment guide — documentation,ready — by utsumi-fj (创建于: 2026-03-27 14:20 (UTC+8))
#38337 [ROCm][Build] Fix pip install detection when build isolation installs CUDA PyTorch — rocm,ci/build,nvidia — by westers (创建于: 2026-03-27 15:33 (UTC+8))
#38336 [Bugfix] Detect missing shard files in quantized checkpoints — bug — by westers (创建于: 2026-03-27 15:31 (UTC+8))
#38335 [ROCm][CI] Fix spawn test decorator silently skipping tests without main — rocm — by westers (创建于: 2026-03-27 15:30 (UTC+8))
#38334 [ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM — rocm — by westers (创建于: 2026-03-27 15:21 (UTC+8))
#38333 feat(grpc): add periodic stats logging and servicer log forwarding — frontend — by CatherineSue (创建于: 2026-03-27 15:19 (UTC+8))
#38330 [Core] Add score encoder cache manager — v1 — by hotTea123 (创建于: 2026-03-27 14:56 (UTC+8))
#38326 [Bugfix] Fix N-gram proposer JIT warmup that never triggers compilation — bug,speculative-decoding,v1 — by redwrasse (创建于: 2026-03-27 14:15 (UTC+8))
#38327 [cpu] Fix chained comparisons in static_assert breaking macOS builds — cpu — by openasocket (创建于: 2026-03-27 14:17 (UTC+8))
#38324 [Bugfix] Preserve torch profiler summary output on CPU — bug,documentation,v1,cpu — by jacobzhang22 (创建于: 2026-03-27 14:06 (UTC+8))
#38316 [XPU] Add per-channel quantized model in compressed-tensors — intel-gpu — by yma11 (创建于: 2026-03-27 11:43 (UTC+8))
#38320 [CI] Add xpu auto-label rule for Intel GPU/XPU PRs — ready,ci/build — by wendyliu235 (创建于: 2026-03-27 12:32 (UTC+8))
#38323 [Bugfix] Fix Helm chart Deployment using hardcoded labels — bug,documentation — by simpx (创建于: 2026-03-27 13:42 (UTC+8))
#38319 [Security] Fix CVE-2026-4944: respect trust_remote_code in NemotronVL and KimiK25 — 无标签 — by Wernerina (创建于: 2026-03-27 12:25 (UTC+8))

已合并 PR

#38252 [ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 — rocm,ready,ci/build — by gshtras (合并于: 2026-03-28 07:03 (UTC+8))
#38126 [NVIDIA] Fix DGX Spark logic — ready,ci/build,nvidia — by johnnynunez (合并于: 2026-03-28 06:26 (UTC+8))
#33972 [Bugfix]fix output Nan/Inf in marlin if dtype=float16 — bug,ready — by ir1ka (合并于: 2026-03-28 07:36 (UTC+8))
#37695 [Perf] Use torch compile to fuse pack topk in trtllm moe — performance,ready,nvidia — by wzhao18 (合并于: 2026-03-28 07:30 (UTC+8))
#31201 Add nvidia h800 moe config — stale,nvidia — by lengrongfu (合并于: 2026-03-28 07:28 (UTC+8))
#34789 [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop — bug,ready,v1 — by scyyh11 (合并于: 2026-03-27 13:17 (UTC+8))
#38367 [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips — documentation,rocm,ready — by hongxiayang (合并于: 2026-03-28 07:20 (UTC+8))
#38311 [Model Runner V2] Rebuild attention metadata before eagle decode full… — ready,v1 — by TheEpicDolphin (合并于: 2026-03-28 04:46 (UTC+8))
#38369 [CI] Skip failing test — ready,multi-modality — by NickLucche (合并于: 2026-03-28 04:25 (UTC+8))
#38032 [QeRL] Compose online quantization with quantized reloading — ready,quantization — by kylesayrs (合并于: 2026-03-28 04:22 (UTC+8))
#38380 Add short flag -sc for --speculative-config argument — 无标签 — by mgoin (合并于: 2026-03-28 03:04 (UTC+8))
#37453 [ROCm] Fix GPT-OSS import for triton 3.6 — rocm,ready,gpt-oss — by gshtras (合并于: 2026-03-28 02:00 (UTC+8))
#38043 {ROCm]: gpt-oss fusion/padding fixes — rocm,ready,ci/build,gpt-oss — by Rohan138 (合并于: 2026-03-28 00:19 (UTC+8))
#38350 Remove need for explicit \n in docstring lists for --help formatting — ready — by hmellor (合并于: 2026-03-27 23:38 (UTC+8))
#38262 [frontend] dump openai responses type by alias — frontend,ready — by cjackal (合并于: 2026-03-27 13:58 (UTC+8))
#34520 [EPLB] Cleanup the transfer logic for the various eplb maps — ready,ci/build — by SageMoore (合并于: 2026-03-27 17:18 (UTC+8))
#33695 enable skipping of SW attention layers when using FP8 KV cache — ready,quantization — by jmkuebler (合并于: 2026-03-27 21:25 (UTC+8))
#37952 fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit — documentation,ready — by jperezdealgaba (合并于: 2026-03-27 21:02 (UTC+8))
#38322 [CI/Build] Move nightly wheel index generation to a single post-build step — ready,ci/build — by Harry-Chen (合并于: 2026-03-27 15:44 (UTC+8))
#36946 [P/D] Mooncake: Add unit tests and minor fixes for mooncake connector — ready,v1,kv-connector — by dtcccc (合并于: 2026-03-27 16:26 (UTC+8))
#38328 [Doc] Clarify Helm chart location in deployment guide — documentation,ready — by utsumi-fj (合并于: 2026-03-27 15:43 (UTC+8))
#38168 [Bugfix] Fix Hermes tool parser when stream interval > 1 — bug,ready,tool-calling — by sfeng33 (合并于: 2026-03-27 14:42 (UTC+8))
#34285 [Refactor] Move FusedMoE hidden_size roundup to quant_method — rocm,ready,ci/build,gpt-oss,nvidia — by BowenBao (合并于: 2026-03-27 14:38 (UTC+8))
#38320 [CI] Add xpu auto-label rule for Intel GPU/XPU PRs — ready,ci/build — by wendyliu235 (合并于: 2026-03-27 14:22 (UTC+8))
#38219 [CPU] Support CT W4A16 on CPU MP kernel — ready,cpu — by bigPYJ1151 (合并于: 2026-03-27 14:15 (UTC+8))
#37975 [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 — ready,qwen — by wxsIcey (合并于: 2026-03-27 14:13 (UTC+8))
#37853 [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models — ready,v1,kv-connector — by orozery (合并于: 2026-03-27 13:38 (UTC+8))

关闭但未合并的 PR

#37789 [Perf] Fuse prefill conv output split into Q/K/V to reduce rearrange overhead — needs-rebase,qwen — by ZJY0516 (关闭于: 2026-03-28 10:50 (UTC+8))
#38037 Fix IndexError in streaming tool calls when max_tokens is hit — frontend,ready — by joaquinhuigomez (关闭于: 2026-03-28 07:58 (UTC+8))
#35458 [Frontend] Add multimodal support to /inference/v1/generate endpoint — documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,needs-rebase,ci/build — by nithinvc (关闭于: 2026-03-28 06:59 (UTC+8))
#38401 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (关闭于: 2026-03-28 06:32 (UTC+8))
#38398 [Bugfix] Restore prepare_fp8_layer_for_marlin removed by merge conflict resolution — bug,ready — by vadiklyutiy (关闭于: 2026-03-28 05:45 (UTC+8))
#38394 [Chore] Remove unused logits_process.py — ready — by WoosukKwon (关闭于: 2026-03-28 05:36 (UTC+8))
#37950 [Bugfix] Restore stats logging for multi-server mode — bug,frontend,v1 — by khairulkabir1661 (关闭于: 2026-03-28 05:28 (UTC+8))
#34647 [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes — rocm — by khairulkabir1661 (关闭于: 2026-03-28 05:28 (UTC+8))
#38285 [AMD][Build] Test DeepEP offload — rocm,ci/build — by rjrock (关闭于: 2026-03-28 05:05 (UTC+8))
#38370 [CI] Fix BatchedGemmEnums.h not found CI issue — ready,ci/build,nvidia — by yewentao256 (关闭于: 2026-03-28 03:39 (UTC+8))
#34756 [ROCm] [Nightly Docker Release] nightly rocm docker — rocm,needs-rebase,ci/build — by hongxiayang (关闭于: 2026-03-28 03:21 (UTC+8))
#38268 [Bugfix] Remove dead name-only regex fallback from DeepSeek V3.1 tool parser — bug,tool-calling,deepseek — by yzong-rh (关闭于: 2026-03-27 22:59 (UTC+8))
#38283 [Bugfix] Remove dead unicode_escape decoding from 4 tool parsers — bug,needs-rebase,tool-calling — by yzong-rh (关闭于: 2026-03-27 22:59 (UTC+8))
#37770 [Quant] Consolidate GPTQ: remove gptq.py, rename gptq_marlin.py to gptq.py — rocm,v1,ready-run-all-tests — by robertgshaw2-redhat (关闭于: 2026-03-27 22:28 (UTC+8))
#25159 [Rocm] [quantization] support quark wmxfp4 for gptoss — rocm,needs-rebase,gpt-oss — by haoyangli0109 (关闭于: 2026-03-27 21:52 (UTC+8))
#38356 [CI] don’t merge modify cases XPU — intel-gpu,ci/build — by wendyliu235 (关闭于: 2026-03-27 20:16 (UTC+8))
#38345 [CI] don’t mergify label Intel — intel-gpu,ci/build — by wendyliu235 (关闭于: 2026-03-27 20:16 (UTC+8))
#38332 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (关闭于: 2026-03-27 19:16 (UTC+8))
#38344 [CI] don’t merge for very — ci/build — by wendyliu235 (关闭于: 2026-03-27 17:05 (UTC+8))
#30377 adding constraint updates of cos-sin to improve mrope performance — needs-rebase,stale — by wujinyuan1 (关闭于: 2026-03-27 15:54 (UTC+8))
#38181 [ROCm] Fix cp_mha_gather_cache for strided KV views — rocm,ready,needs-rebase,v1 — by yuttian1 (关闭于: 2026-03-27 15:09 (UTC+8))
#31069 Fix LoRA prefix cache corruption by using lora_int_id — rocm,ci/build,stale,v1 — by westers (关闭于: 2026-03-27 15:05 (UTC+8))
#31072 [ROCm][Test] Skip RTN quantization tests on ROCm — documentation,rocm,stale,v1 — by westers (关闭于: 2026-03-27 15:03 (UTC+8))
#31067 [Bugfix] Fix incorrect tensor parallel size in Ray executor warning — bug,v1 — by westers (关闭于: 2026-03-27 14:57 (UTC+8))
#31077 Fix ROCm CUDA graph replay synchronization bug (issue #29521) — bug,documentation,rocm,needs-rebase,ci/build,v1,nvidia — by westers (关闭于: 2026-03-27 14:57 (UTC+8))
#38323 [Bugfix] Fix Helm chart Deployment using hardcoded labels — bug,documentation — by simpx (关闭于: 2026-03-27 13:44 (UTC+8))
#37959 [Bugfix] Fix Helm chart Deployment using hardcoded labels instead of chart.labels — bug,documentation — by simpx (关闭于: 2026-03-27 13:43 (UTC+8))
#38319 [Security] Fix CVE-2026-4944: respect trust_remote_code in NemotronVL and KimiK25 — 无标签 — by Wernerina (关闭于: 2026-03-27 12:25 (UTC+8))
#33774 add support for telechat3 model — new-model — by 1096125073 (关闭于: 2026-03-27 11:41 (UTC+8))