vLLM 开发动态报告 - 2026-03-28

时间窗口: 2026-03-28 11:50 (UTC+8) ~ 2026-03-29 11:50 (UTC+8) 数据统计: 新 Issue 6 | 关闭 Issue 17 | 新 PR 29 | 合并 PR 12 | 关闭未合并 PR 4

📊 每日开发状态摘要

在3月28日至29日期间，vLLM项目活跃度较高，共处理了23个Issue（新增6，关闭17）和41个PR（新增29，合并12）。开发重点集中在修复各类Bug（如量化、模型加载、调度器）和优化CI/CD流程上。同时，为支持即将到来的Transformers v5，社区展开了多项模型适配工作，并对AMD平台的支持进行了持续优化。

🎯 AMD/ROCm 生态相关动态

本期AMD生态相关的开发和修复是重点之一，涉及多个关键Issue和PR。

已解决的核心问题：
- Issue #36337 (已关闭)：用户Slonegg报告了Kimi-K2.5-MXFP4模型在MI350X (gfx950) + ROCm 7.2环境下输出乱码的问题。AMD员工andyluo7参与排查，最终确认使用官方的vllm/vllm-openai-rocm:v0.17.0镜像并设置VLLM_ROCM_USE_AITER=1可以正常工作。ChuanLi1101提交的PR #36422修复了导致模拟模式无法启用的布尔逻辑错误。此问题暴露了新硬件（MI350X）与软件栈（ROCm 7.2）适配中的兼容性细节。
新增的改进与修复：
- PR #38434：改进了在WSL（Windows Subsystem for Linux）环境中对ROCm平台的检测逻辑。当传统的amd-smi和rocm-smi工具不可用时，回退到检查torch.version.hip，提升了在WSL环境中识别ROCm设置的可靠性。
- PR #38444 (新增)：这是一个重大变更，用Python完全重写了AMD CI的测试运行器 (run-amd-test.py)，替代了原有的bash脚本。新运行器解决了三个关键问题：1) 正确捕获测试容器的真实退出码（避免因atexit钩子导致的误判成功）；2) 彻底清理跨作业的GPU状态泄漏；3) 提供更详细的诊断信息。这将显著提升AMD CI的稳定性和可维护性。
- PR #38450 (新增)：修复了在编码器-解码器模型（如Whisper、BART）上使用ROCM_ATTN和ROCM_AITER_FA后端时，交叉注意力（cross-attention）计算错误的问题。解决方法是暂时将ENCODER_DECODER注意力类型从这两个后端的支持列表中移除，使调度器能自动选择正确的后端（如TRITON_ATTN），从而修复了单束搜索与贪婪解码结果不一致的问题。
- PR #38413 (已合并)：将Docker构建中默认的ROCm变体从rocm700更新为rocm721，以支持更新的ROCm版本。
- PR #38108 (已合并)：修复了在AMD多GPU系统上使用Ray后端运行MoE自动调优脚本时，因设备序号错误导致的 invalid device ordinal 崩溃。

💬 高热度讨论分析

Issue #36337: Kimi-K2.5-MXFP4在MI350X上输出乱码
- 核心议题：用户在新硬件MI350X上运行AMD官方量化模型时遇到输出不可读的问题。
- 各方观点：
  - 用户 (Slonegg)：详细描述了问题现象，尝试了多种vLLM版本和配置，怀疑是软件栈兼容性问题。
  - AMD员工 (andyluo7)：提供了在MI350X上已验证可工作的配置（使用官方Docker镜像），并协助排查。
  - 贡献者 (ChuanLi1101)：提交了修复代码，指出问题根因是模拟模式判断逻辑有误。
- 讨论焦点：问题的根本原因是在特定条件组合下（MX硬件+特定量化方案），错误的逻辑导致本应启用的高精度模拟回退路径被跳过，直接使用了可能不兼容的本地内核。
- 结论：问题通过使用正确的Docker镜像和后续的代码修复得到解决，体现了新硬件适配过程中的协作调试。
Issue #38441: Kimi-k2工具解析器正则表达式问题
- 核心议题：Kimi K2.5模型在流式生成工具调用时，偶尔在``和函数名间插入换行符，导致现有正则表达式匹配失败，工具调用被静默丢弃。
- 各方观点：
  - 报告者 (bigBrain1901)：分析了问题根因（Python . 默认不匹配\n），并提出了修复方案（使用\s*和re.DOTALL）。
  - 社区讨论：该Issue直接催生了PR #38443进行修复。在PR的评论中，贡献者进一步优化了正则表达式，将.+改为\S.+，以确保匹配从第一个非空白字符开始，并移除了冗余的.strip()调用，使逻辑更清晰。
- 争议焦点：无重大争议，主要是技术方案的精炼。
- 结论：PR #38443已提交，以更健壮的方式修复了该解析器漏洞。
PR #38444: 全新的AMD CI Python运行器
- 核心议题：是否以及如何用Python重写AMD CI的bash测试运行器，以解决长期存在的稳定性问题。
- 讨论焦点：此PR本身是一个大型的说明性文档，阐述了重写的必要性和新架构。它引发了关于此改动是否应推广到其他平台（CUDA、CPU）Dockerfile的讨论（见评论中与tjtanaa的交流）。作者AndreasKaratzas指出，其他平台的Dockerfile存在相同的潜在问题（网络安装失败被静默忽略），但本次修复主要针对AMD CI遇到的具体痛点。
- 结论：该PR被标记为ready，作为解决AMD CI特定问题的专项改进。它为其他平台提供了可参考的解决方案模板。
PR #38432: LoRA融合收缩+扩展操作
- 核心议题：通过融合LoRA的收缩（shrink）和扩展（expand）两个内核为一个操作，以减少Python调度开销并启用SM90+ GPU的PDL（Programmatic Dependent Launch）优化。
- 各方观点：
  - 作者 (bhoomit)：展示了融合操作在H200上带来的显著延迟和吞吐量提升，并解释了与另一个异步流PR (#35721) 的互补关系。
  - 维护者 (jeejeelee)：提醒作者参考相关的异步流PR。
- 讨论焦点：两个优化（内核融合 vs. 异步多流）的兼容性与收益叠加场景。
- 结论：PR处于开放状态，技术方案旨在优化LoRA推理性能。
PR #38423: 修复NVFP4在SM12x GPU上的问题
- 核心议题：修复RTX 50系列（SM120）和DGX Spark（SM121）等新架构GPU运行NVFP4量化模型时的非法指令错误。
- 讨论焦点：问题根因复杂，涉及CUTLASS库缺失对应架构的tile配置、FlashInfer捆绑的CUTLASS版本过旧、内核可用性检查逻辑有误等。评论中有用户(eugr)报告应用该PR后仍遇到非法指令崩溃，作者(johnnynunez)回应并整合了其他相关修复PR的更改。
- 结论：该PR旨在解决新硬件对前沿量化格式的支持问题，正处于积极迭代和修复中，凸显了软硬件生态协同的挑战。

🔥 热门话题与趋势分析

Transformers v5升级适配：多个Issue和PR围绕此主题展开。例如 Issue #38425 (InternVL2)、PR #38447 (HyperCLOVAX Vision)、PR #38437 (MiniCPMV) 都是在将模型代码“供应商化”（vendoring）到vLLM中，以解决v5更严格的配置验证或模型初始化流程带来的兼容性问题。
量化技术深耕与问题修复：量化是活跃领域，但也暴露出许多边界情况。Issue #38439 (NVFP4+MLA)、PR #38423 (NVFP4+SM12x)、PR #38442 (在线量化重加载) 都在处理不同量化方案与特定模型架构、硬件平台的兼容性和正确性问题。
工具调用与解析器完善：随着Kimi等模型使用增加，其专用的工具调用解析器细节问题受到关注（Issue #38441, PR #38443），社区正在打磨这些高级功能的鲁棒性。
CI/CD基础设施现代化：不仅限于AMD平台，CI的稳定性和效率是共同关注点。PR #38444、PR #38391（预下载FlashInfer头文件以解决网络依赖）都体现了对构建和测试流水线可靠性的持续投入。

🛠️ 重点技术变更

PR #38444: [ROCm][CI] 全新Python CI运行器：这不仅是一个修复，更是对AMD CI流程的重构。通过Python脚本精确控制容器生命周期、资源清理和结果验证，从根本上解决了困扰已久的环境污染和测试结果误报问题，对保障AMD平台代码质量至关重要。
PR #38432: [Misc][LoRA] 融合收缩+扩展操作：此优化通过减少内核启动的Python开销，并在支持的硬件上启用PDL，直接提升了LoRA推理性能。它展示了vLLM对性能优化细节的持续追求。
PR #38423: [NVIDIA] Bugfix NVFP4 DGX Spark and RTX50：这个修复涉及底层库（CUTLASS）更新、编译配置和运行时检查逻辑，是多层技术栈协同工作的典型例子。它确保了vLLM的前沿量化特性能够顺利扩展到新一代消费级和专业级GPU上。
PR #38414: [Test] 修复test_abort_final_step中的竞态条件：修复了一个因等待时间不足导致的间歇性测试失败。这反映了项目对测试套件稳定性的重视，确保核心的请求中止逻辑被可靠验证。
PR #38391: [CI Bugfix] Docker构建时预下载FlashInfer头文件：通过将网络依赖从运行时移到构建时，解决了CI环境中因外部资源不可用导致的测试失败，提升了测试的确定性和速度。

📈 开发活跃度观察

贡献者多样化：本周期出现了许多首次或非长期贡献者（如mukesh-hai, saifmb0, anantha119等），他们提交了从指标暴露到bug修复的各种PR，显示社区吸引力。
企业深度参与：来自AMD (andyluo7, tjtanaa等)、NVIDIA (johnnynunez等) 员工的代码和讨论非常活跃，特别是在其各自硬件生态相关的优化和问题修复上，体现了厂商对vLLM生态的战略投入。
高效的审查与合并：大量PR被快速标记为ready并合并，尤其是Bugfix和CI相关改动，表明核心维护团队对保持主线稳定和高效有着清晰的优先级判断。

💡 值得关注的问题

Transformers v5升级的长期影响：当前通过“供应商化”模型代码来绕过兼容性问题只是权宜之计。长期来看，推动模型代码上游合并到Transformers库，并让vLLM适配标准的v5初始化流程，是更可持续的方向（如Issue #38425中提及）。
新硬件架构的持续适配：随着NV SM12x和AMD gfx950等新架构上市，底层内核库（如CUTLASS、FlashInfer）的更新、编译体系的调整以及运行时检查都需要持续跟进（Issue #36337， PR #38423）。
AMD CI的稳定性转型：PR #38444的重写能否彻底解决AMD CI的 flaky 问题，以及其设计模式是否会反向推动其他平台CI的改进，值得后续观察。
长期开放Bug的清理：本周期关闭了大量标记为stale的历史Issue，这表明社区在积极进行代码库和维护负担的治理。

📋 附录：详细数据列表

新增 Issue

#38441 [Bug]: kimi-k2 tool parser regex is off a tiny bit — bug — by bigBrain1901 (创建于: 2026-03-29 06:50 (UTC+8))
#38439 [Bug]: NVFP4 + MLA error during processing — bug — by kylesayrs (创建于: 2026-03-29 05:47 (UTC+8))
#38428 [Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming — bug — by sh1man (创建于: 2026-03-28 19:05 (UTC+8))
#38431 [Installation]: torch 2.11 is not supported — installation — by vanmilleru (创建于: 2026-03-28 22:30 (UTC+8))
#38425 [Transformers v5] InternVL2 — help wanted,good first issue — by hmellor (创建于: 2026-03-28 16:52 (UTC+8))
#38420 [Bug]: _C_stable_libtorch fails to build: const& references violate stable ABI trivially_copyable requirement — 无标签 — by gbanyan (创建于: 2026-03-28 13:27 (UTC+8))

已关闭 Issue

#20216 [Bug]: When inferring Qwen3-32B-AWQ with vllm0.9.2, an error message appears: Quantization scheme is not supported — bug,stale — by HelloCard (关闭于: 2026-03-29 10:17 (UTC+8))
#21602 [Bug]: ValueError: current platform cpu does not support ray. — feature request,ray,stale — by codingl2k1 (关闭于: 2026-03-29 10:17 (UTC+8))
#23669 [CI]: Quantization tests cleanup — ci/build,stale — by njhill (关闭于: 2026-03-29 10:17 (UTC+8))
#25850 [Bug]: vLLM subprocesses remain alive after parent process/server exits (MCP client exit → vLLM still alive) — bug,stale — by xhd0728 (关闭于: 2026-03-29 10:17 (UTC+8))
#27174 [Bug]: Qwen3-VL-30B-A3B OOM on 96GB single gpu — bug,stale — by bhaktatejas922 (关闭于: 2026-03-29 10:16 (UTC+8))
#29285 [Bug]: 加载PaddleOCR-VL模型进行ocr识别时出错 — bug,stale — by czhcc (关闭于: 2026-03-29 10:16 (UTC+8))
#29417 [Bug]: Cosmos-Reason-7B Flash Attention head dim — bug,stale — by ECMGit (关闭于: 2026-03-29 10:16 (UTC+8))
#29566 [Bug]: disagg_example_p2p_nccl_xpyd.sh RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug,stale — by ZRJ026 (关闭于: 2026-03-29 10:16 (UTC+8))
#29571 [Feature]: Batch-invariant in Ampere — feature request,stale — by luo1206 (关闭于: 2026-03-29 10:16 (UTC+8))
#29591 [Bug] RuntimeError: Cannot re-initialize CUDA in forked subprocess when using lmcache/kv_cache_sharing_lmcache_v1 — bug,stale — by usberkeley (关闭于: 2026-03-29 10:16 (UTC+8))
#29596 [Bug]: kimi-k2 gets wrong tool_call_id — bug,stale — by MoyanZitto (关闭于: 2026-03-29 10:16 (UTC+8))
#35310 [Feature]: Qwen-ASR Forced Aligner — good first issue,feature request — by jppm99 (关闭于: 2026-03-29 08:27 (UTC+8))
#38439 [Bug]: NVFP4 + MLA error during processing — bug — by kylesayrs (关闭于: 2026-03-29 07:04 (UTC+8))
#36337 [Bug]: Kimi-K2.5-MXFP4 produces gibberish output on MI350X (gfx950) with ROCm 7.2 — bug,rocm — by Slonegg (关闭于: 2026-03-29 03:54 (UTC+8))
#38221 Flaky test: test_abort_during_final_step[False] fails intermittently — 无标签 — by markmc (关闭于: 2026-03-28 21:43 (UTC+8))
#38375 [Bug]: IndexError when --renderer-num-workers + --mm-processor-cache-type shm — bug — by cjackal (关闭于: 2026-03-28 21:32 (UTC+8))
#38110 [Bug]: flashinfer-cubin does not include all cubins/headers — bug — by mgoin (关闭于: 2026-03-28 21:09 (UTC+8))

新增 PR

#38451 [wip] fix dbo sync — 无标签 — by czhu-cohere (创建于: 2026-03-29 11:47 (UTC+8))
#38436 [Bugfix] Fix NaN corruption from CUDA graph padding in NVFP4 models — bug,v1,nvidia — by elvircrn (创建于: 2026-03-29 04:26 (UTC+8))
#38423 [NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 — bug,ready,ci/build,nvidia,ready-run-all-tests — by johnnynunez (创建于: 2026-03-28 15:58 (UTC+8))
#38450 [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models — documentation,rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-29 11:03 (UTC+8))
#38444 [ROCm][CI] Add K8s-hardened Python CI runner with JUnit exit-code fix, GPU lifecycle, and LFU cache — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-29 07:49 (UTC+8))
#38449 [CI] Revamp translation validation tests: parametrize ROCm backends, add seed, relax semantic assertions — rocm — by AndreasKaratzas (创建于: 2026-03-29 09:58 (UTC+8))
#38446 Revert “[QeRL] Compose online quantization with quantized reloading” (#38032) — 无标签 — by zhewenl (创建于: 2026-03-29 09:05 (UTC+8))
#38448 fix(tokenizer): skip reasoning_effort when None in Mistral tokenizer — 无标签 — by marioiseli89 (创建于: 2026-03-29 09:31 (UTC+8))
#38447 [Transformers v5] Vendor HCXVisionConfig for compatibility — 无标签 — by HanFa (创建于: 2026-03-29 09:21 (UTC+8))
#38445 [PERF]MiniMax-M2 gate kernel — performance,ci/build — by jeejeelee (创建于: 2026-03-29 08:55 (UTC+8))
#38442 [QeRL] Fix online quantized reloading — ci/build,v1 — by kylesayrs (创建于: 2026-03-29 06:55 (UTC+8))
#38426 [CI]revert initialize_model context manager — ready — by jikunshang (创建于: 2026-03-28 18:14 (UTC+8))
#38443 [Bugfix][Tool Parser] Fix Kimi-K2 streaming regex to handle leading newline before tool call ID — bug,tool-calling — by saifmb0 (创建于: 2026-03-29 07:32 (UTC+8))
#38440 [compile] Allow strings in custom ops without regressing compilation times - Part 2 — rocm,needs-rebase,v1,qwen — by anijain2305 (创建于: 2026-03-29 06:07 (UTC+8))
#38437 Draft: Vendor MiniCPMV processor to unblock Transformers v5 upgrade — 无标签 — by guanwei-wu (创建于: 2026-03-29 05:18 (UTC+8))
#38438 [Bugfix] Fix set serialization in Qwen3.5 rope validation configs — bug,qwen — by jrwoolley (创建于: 2026-03-29 05:23 (UTC+8))
#38435 [metrics] expose num_skipped_waiting_reqs to distinguish constraint-blocked waiting requests — v1 — by mukesh-hai (创建于: 2026-03-29 01:45 (UTC+8))
#38432 [Misc][LoRA] Add fused shrink+expand op with PDL for LoRA — 无标签 — by bhoomit (创建于: 2026-03-28 22:56 (UTC+8))
#38421 -[Bugfix] Fix stable ABI build: pass torch::stable::Tensor by value — bug — by anantha119 (创建于: 2026-03-28 15:42 (UTC+8))
#38434 [Fix] Improve ROCm detection in WSL environments — rocm — by yiz-liu (创建于: 2026-03-28 23:46 (UTC+8))
#38433 feat(nixl,dcp): Supports DCP for PD disaggregation with nixl connector and MLA backends — documentation,v1,kv-connector,nvidia — by pisceskkk (创建于: 2026-03-28 23:27 (UTC+8))
#38429 [CI] Fix Ernie4.5-VL initialization test — ready,ci/build,ci-failure — by haosdent (创建于: 2026-03-28 20:51 (UTC+8))
#38427 [Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) — bug — by YM2132 (创建于: 2026-03-28 18:22 (UTC+8))
#38430 Devcontainer rocm rx6900xt — documentation,performance,rocm,needs-rebase,ci/build — by ArsArmandi (创建于: 2026-03-28 22:02 (UTC+8))
#38418 [Bugfix] Disallow renderer_num_workers > 1 with mm processor cache — bug,ready — by scyyh11 (创建于: 2026-03-28 12:41 (UTC+8))
#38422 [Bugfix] Explain ShareGPT benchmark filtering failures when fixed output_len invalidates all requests — bug,performance — by chenshui223 (创建于: 2026-03-28 15:48 (UTC+8))
#38424 Debug shutdown — documentation,rocm,intel-gpu,ci/build,v1,cpu — by Cursx (创建于: 2026-03-28 16:38 (UTC+8))
#38417 fix: ensure CUDA context initialization before MemorySnapshot in EngineCore — v1,nvidia — by amasolov (创建于: 2026-03-28 12:32 (UTC+8))
#38419 [Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) — bug,speculative-decoding,v1 — by SandishKumarHN (创建于: 2026-03-28 12:45 (UTC+8))

已合并 PR

#37049 [Misc]: clean up non-core lint issues — documentation,performance,ready — by whyiug (合并于: 2026-03-28 22:28 (UTC+8))
#38111 [Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator — bug,ready — by shubhra (合并于: 2026-03-29 08:42 (UTC+8))
#38426 [CI]revert initialize_model context manager — ready — by jikunshang (合并于: 2026-03-29 00:56 (UTC+8))
#35367 [Feature] Add Qwen3-ForcedAligner support via token classification pooling — documentation,new-model,ready,multi-modality,qwen — by haosdent (合并于: 2026-03-29 08:27 (UTC+8))
#38415 [ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-29 00:47 (UTC+8))
#38362 [BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed — bug,frontend,ready — by walterbm (合并于: 2026-03-29 02:30 (UTC+8))
#38429 [CI] Fix Ernie4.5-VL initialization test — ready,ci/build,ci-failure — by haosdent (合并于: 2026-03-28 22:43 (UTC+8))
#38418 [Bugfix] Disallow renderer_num_workers > 1 with mm processor cache — bug,ready — by scyyh11 (合并于: 2026-03-28 21:32 (UTC+8))
#38391 [CI Bugfix] Pre-download missing FlashInfer headers in Docker build — bug,ready,ci/build,ci-failure — by mgoin (合并于: 2026-03-28 21:09 (UTC+8))
#38414 [Test] Fix flaky race condition in test_abort_final_step — ready,v1 — by yzong-rh (合并于: 2026-03-28 17:06 (UTC+8))
#38108 Fix Device Index for ROCm Ray Workers in MoE Benchmark — performance,rocm,ready — by li-liwen (合并于: 2026-03-28 16:27 (UTC+8))
#38413 [ROCm] [Release] Update ROCm variant from rocm700 to rocm721 — rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-28 14:07 (UTC+8))

关闭但未合并的 PR

#38023 [Dev] CuTile gemm FP8 — performance,nvidia — by LironKesem (关闭于: 2026-03-29 10:26 (UTC+8))
#38430 Devcontainer rocm rx6900xt — documentation,performance,rocm,needs-rebase,ci/build — by ArsArmandi (关闭于: 2026-03-28 22:02 (UTC+8))
#38407 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (关闭于: 2026-03-28 17:24 (UTC+8))
#38403 [CI][Eval] Lower Nemotron-3-Super-120B-A12B-BF16 GSM8K accuracy threshold to 0.91 — ready — by SandishKumarHN (关闭于: 2026-03-28 12:45 (UTC+8))