vLLM 开发动态报告 - 2026-03-30
时间窗口: 2026-03-30 11:46 (UTC+8) ~ 2026-03-31 11:46 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 11 | 新 PR 83 | 合并 PR 39 | 关闭未合并 PR 17
📊 每日开发状态摘要
本周期(2026-03-30至2026-03-31)vLLM 开发活动异常活跃,共新增25个 Issue 和83个 PR,合并了39个 PR。核心关注点集中于 AMD/ROCm 生态的深度优化与问题修复、混合模型(Mamba/SSM)与 MoE 相关的性能与稳定性,以及 LoRA 支持、多模态和工具调用等高级功能的完善。开发社区对生产环境中的复杂场景(如高并发、多节点部署)的稳定性和用户体验关注度持续提升。
🎯 AMD/ROCm 生态相关动态
本周期 AMD/ROCm 生态是绝对焦点,共有 超过10个相关 PR 和 3个 Issue,由多名 AMD 员工(如 amd-asalykov, gshtras, benenzhu, AndreasKaratzas 等)主导,体现了 AMD 对 vLLM 平台支持的大力投入。
- 性能优化:
- PR #38536: (
amd-asalykov) 为 MI350X/355X 上的 Kimi K2.5 模型添加了针对 TP=4 场景优化的 MoE 内核配置,在隔离测试中带来了 3-10倍的 MoE 性能提升,端到端基准测试显示平均 TTFT 降低约 2倍,平均 TPOT 降低 3-4倍。 - PR #38597: (
benenzhu) 优化了 ROCm 上 MoE 内核中冗余的设备间数据拷贝,在 MiniMax M2.5 上实现了 ~9% 的端到端吞吐量提升。 - PR #38509: (
AndreasKaratzas) 改进了 FP8/MXFP4 MoE 后端选择逻辑,现在仅列出当前平台(CUDA/ROCm/XPU)支持的候选后端,使日志更清晰。
- PR #38536: (
- Bug 修复与稳定性:
- Issue #38512: (
AndreasKaratzas) 报告了 AITER JIT 编译内核时传递了不受支持的-amdgpu-coerce-illegal-types=1编译器标志。该问题被迅速识别并转移至 ROCm/aiter 仓库处理。 - PR #38555: (
dondetir) 修复了消费级 RDNA3/3.5 GPU(如 RX 7900 XTX)在启动多模态模型时,因 MIOpen 求解器数据库缺失导致编码器缓存性能分析无限挂起的问题。解决方案是在这些 GPU 上跳过该分析步骤。 - PR #38503: (
AndreasKaratzas) 修复了异步 KV 卸载过程中,因延迟释放的 GPU 块未被计入has_requests()而导致引擎停滞的 Bug。 - PR #38502: (
AndreasKaratzas) 修复了混合 Mamba 模型(如 Jamba)使用超大块大小时,Triton 页式注意力内核因请求共享内存超限而在 ROCm 上导致 OOM 的问题。 - PR #37698: (
hongxiayang) 修复了 Quark 量化工具(如 MiniMax-M2.1-MXFP4)因硬编码trust_remote_code=False导致加载失败的问题,允许用户通过命令行参数覆盖此设置。
- Issue #38512: (
- 功能扩展:
- PR #38501: (
AndreasKaratzas) 为 Triton INT8 缩放线性内核添加了非对称(AZP)INT8 量化支持,解除了在 ROCm 平台上运行非对称 INT8 模型的限制。 - PR #38605: (
EdalatiAli) 新增了对在线 Block FP8 量化的支持,允许 BF16/FP16 模型在加载时动态量化为块式 FP8,并支持密集和 MoE 模型。
- PR #38501: (
💬 高热度讨论分析
- Issue #38516: [Feature]: It‘s user unfriendly to panic when there is not enough VRAM…
- 核心议题:vLLM 在启动时检查可用显存是否足以支持模型的最大序列长度,若不足则直接报错退出。用户认为这不友好,建议改为运行时动态检查并返回错误。
- 观点对立:
- 用户 (
yangxi): 认为许多请求并不会达到最大长度,启动时检查过于严格,应允许启动并在运行时按需处理 OOM。 - 维护者 (
DarkLight1337): 认为 vLLM 是面向生产服务的,启动时检查可以防止恶意用户通过构造长上下文请求发起 DoS 攻击,造成服务崩溃。建议使用--max-model-len -1自动适配内存。
- 用户 (
- 争议焦点:安全与用户体验的权衡。用户想要最大的灵活性和容错性,而维护者优先考虑服务的稳定性和安全性。
- 当前状态:讨论中,未达成共识。用户坚持认为运行时检查已足够。
- Issue #37749: [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 (已关闭)
- 核心议题:用户升级至 v0.18.0 后,Qwen 3.5 系列模型无法启动或输出乱码。
- 多方排查:
- 用户和社区成员提供了多种环境信息(A100, H200, DGX Spark)。
- 维护者 (
tjtanaa) 建议尝试禁用前缀缓存 (--no-enable-prefix-caching)。 - 社区成员 (
learnbott) 通过指定--attention-backend FLASHINFER解决了部分 Blackwell GPU 上的问题。 - 最终,另一位用户 (
dotmobo) 通过升级驱动/CUDA版本、使用特定 Docker 镜像并增加共享内存解决了问题。
- 讨论结论:问题可能由驱动/CUDA版本不匹配、GPU架构兼容性(Flash Attention 支持)或系统配置(shm大小) 等多因素导致,而非单一的 vLLM 代码缺陷。
- 最终状态:随着用户找到解决方案,该 Issue 被关闭。
- PR #38583: [Bugfix] Opt-in INFO prompt summaries for request logging…
- 核心议题:v0.17.0 后,请求日志中的提示词内容被移至 DEBUG 级别,INFO 级别不再显示,引发用户困惑(Issue #38537)。此 PR 旨在提供一个折中方案。
- 观点与决策:
- 维护者 (
robertgshaw2-redhat): 明确指出将客户端负载(提示词)记录在 INFO 级别是不安全的,不应作为默认行为。 - PR 作者 (
JosephAhn23): 根据反馈,将设计修改为双重选择加入模式:用户必须先启用--enable-log-requests,再额外启用--enable-log-request-prompts,才能在 INFO 级别看到截断的提示词预览。
- 维护者 (
- 最终结论:安全优先。社区采纳了更严格的安全策略,通过显式的、分离的选项来控制敏感信息的日志级别。
🔥 热门话题与趋势分析
- KV 缓存完整性与竞态条件:多个高严重性 Issue 指向复杂的 KV 缓存损坏问题。
- Issue #38606: 在快速切换 LoRA 适配器时触发 KV 块损坏,与之前已修复的“取消请求”竞态条件不同,指向 LoRA 命名空间下块管理的潜在缺陷。
- Issue #38551: MTP 推测解码与多模态编码器缓存在高并发下发生竞争,编码器缓存条目被回收导致 MTP 提案读取失败并引发引擎崩溃。
- 趋势:随着功能(LoRA、推测解码、多模态、前缀缓存)的叠加和并发压力的提升,对 KV 缓存管理器的正确性和鲁棒性提出了极高要求,成为稳定性攻坚的核心领域。
-
AMD 平台的支持与性能攻坚:如前所述,本周期 AMD 相关贡献占据显著比例。从修复编译器/工具链问题,到优化 MoE、INT8 等关键路径性能,再到解决消费级显卡的特殊问题,表明 AMD 正致力于将 vLLM 在 ROCm 生态上的体验推向与 CUDA 对等的成熟度。
- 复杂部署场景下的问题:
- Issue #38602: 多节点 Ray 部署下,请求
logprobs会导致死锁,GPU 利用率降至 0%。 - Issue #38515: CPU 卸载模式下运行一段时间后发生崩溃。
- 趋势:随着用户将 vLLM 应用于更大规模、更复杂的生产环境(多节点、内存卸载),一些在单机单卡测试中难以暴露的分布式协同、资源管理和生命周期问题开始浮现。
- Issue #38602: 多节点 Ray 部署下,请求
- 模型与功能支持的持续扩展:
- 新模型:PR 新增了对 TeleChat3、DeepSeek-OCR-2 的支持。
- 功能增强:修复了 Kimi K2、GLM4 MoE 等模型的工具调用解析器;改进 LoRA 对 Qwen 3.5 MoE 和多模态 ASR 模型的支持。
🛠️ 重点技术变更
-
PR #38536 ([ROCm][Perf] Add optimized MoE configs for Kimi K2.5 TP=4):此 PR 提供了针对特定模型(Kimi K2.5)和硬件(MI350X/355X)的 “专家级”调优配置。它通过在
fused_moe_kernel_gptq_awq中预置最优内核参数,绕过了动态调优的开销,直接带来了数倍的性能提升,是硬件-模型协同优化的典范。 -
PR #38555 ([ROCm] Skip encoder cache profiling on consumer RDNA GPUs):这个修复巧妙地区分了数据中心级和消费级 AMD GPU 的软件生态差异。通过检测 GPU 架构,避免在缺少 MIOpen 优化数据库的消费卡上执行会挂起的操作,提升了 vLLM 在更广泛 AMD 硬件上的启动成功率,是改善用户体验的重要补丁。
-
PR #38509 ([MoE] Filter FP8/MXFP4 MoE backend candidates by platform):这是一个开发者体验的改进。它清理了后端选择逻辑,使日志信息不再罗列不相关平台的后端,帮助开发者更清晰地理解当前系统的可用选项,减少了调试时的噪音。
-
PR #36847 ([Feat][Spec Decode] DFlash) (已合并):此 PR 引入了一种新的推测解码方法 DFlash,它使用双向注意力机制,与现有的 EAGLE、MTP 等方案不同。虽然目前受限于非因果注意力内核的可用性,但为未来提升推理效率开辟了新的技术路径。
-
PR #35431 ([Bugfix] Use null block (0) for padded block table entries) (已合并):统一了 Mamba/SSM 层与注意力层在块表填充值上的约定(从
-1改为0),解决了长期存在的混合模型 CUDAGraph 兼容性问题,是统一内存布局、提升框架内聚性的关键修复。
📈 开发活跃度观察
- 贡献活跃度极高:单日新增 PR 数量达 83 个,显示社区创新和修复的节奏非常快。合并 39 个 PR 也表明代码审查和集成流程高效。
- AMD 团队主导专项优化:
amd-asalykov、gshtras、benenzhu等贡献者在本周期提交了大量高质量的 ROCm 相关 PR,从内核优化到 Bug 修复,显示了 AMD 团队深入、系统性的投入。 - 社区协作解决复杂问题:从 Qwen 3.5 升级问题的讨论可以看出,用户、社区成员和维护者通过共享信息、尝试多种方案进行协作,最终定位到环境配置问题,体现了开源社区的力量。
💡 值得关注的问题
- KV 缓存损坏的深层根源 (Issue #38606, #37076):目前已经发现至少两种独立的触发条件(取消请求、快速切换 LoRA)。需要关注是否有一个更根本的元问题存在于 KV 缓存管理逻辑中。
- 多节点 Ray 部署的死锁问题 (Issue #38602):影响分布式生产部署的稳定性,需优先排查涉及
logprobs计算与 Ray 通信之间的交互逻辑。 - MTP 与编码器缓存的资源竞争 (Issue #38551):在多模态推测解码场景下,缓存生命周期管理需要与推测解码的引用计数机制进行更精细的协调。
- 消费级 AMD GPU 的完整支持:虽然 PR #38555 解决了一个启动问题,但消费卡在功能完整性、性能调优方面与数据中心卡仍有差距,其支持状态值得持续关注。
--max-model-len检查策略的讨论 (Issue #38516):这场关于安全与便利的辩论,可能促使项目考虑更灵活的配置策略(例如,分级的检查严格度),其结果值得关注。
📋 附录:详细数据列表
新增 Issue
- #38606 [Bug]: KV block corruption under rapid LoRA adapter alternation — bug — by Yunzez (创建于: 2026-03-31 11:37 (UTC+8))
- #38603 [Bug]: Streaming last chunk contains non-empty tool_calls with empty fields “type” causing type validation error — bug — by guohuanliang1 (创建于: 2026-03-31 11:04 (UTC+8))
- #38602 [Bug] API hangs/deadlocks when requesting logprobs on multi-node Ray deployment (PP=2, TP=8) — bug — by YirongWho (创建于: 2026-03-31 10:47 (UTC+8))
- #38601 [Usage]: In the context of pp parallelism, UniformTypeKVCacheSpecs type, the calculation of page size considers all the layers, not only the layerss of one pp rank ? — usage — by yangshanjun (创建于: 2026-03-31 10:43 (UTC+8))
- #38591 Bug: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B — 无标签 — by miguel-flowstate (创建于: 2026-03-31 06:17 (UTC+8))
- #38520 [Bug]: LoRA loading fails for Qwen 3.5 MoE (35b-A3b) due to expert module name mismatch — bug — by uYu (创建于: 2026-03-30 15:09 (UTC+8))
- #38582 [CI Failure]: tests/models/language/pooling/test_splade_sparse_pooler.py — ci-failure — by bnellnm (创建于: 2026-03-31 04:32 (UTC+8))
- #38587 [Bug]: RCCL RDNA3 gfx1100 Tp2 ROCM at startup — bug,rocm — by JartX (创建于: 2026-03-31 05:04 (UTC+8))
- #38537 [Bug]: prompt is missing in Received request only params not prompts — bug — by tongcu (创建于: 2026-03-30 18:46 (UTC+8))
- #38586 [Bug]: Whisper online benchmark with profiling error: TypeError: multi_modal_content must be a dict containing ‘audio’ — bug — by AdityaKulshrestha (创建于: 2026-03-31 04:58 (UTC+8))
- #38560 [Bug]: reasoning_effort passed to MistralCommonTokenizer.apply_chat_template breaks Mistral Small 4 chat completions on vLLM 0.18.0 — bug — by BenjaminFuentesEviden (创建于: 2026-03-30 23:53 (UTC+8))
- #38551 [Bug]: AssertionError: Encoder cache miss crashes engine with MTP + multimodal under high concurrency — 无标签 — by kaiktl (创建于: 2026-03-30 21:52 (UTC+8))
- #38568 [Feature]: Sharded model loader doesn’t support GCS — feature request — by kiriloman (创建于: 2026-03-31 01:03 (UTC+8))
- #38550 [Bug]: can’t start b200x2 or b200x4 sm100 with nvidia/Qwen3.5-397B-A17B-NVFP4 — bug — by evgeniiperepelkin (创建于: 2026-03-30 21:31 (UTC+8))
- #38516 [Feature]: It’s user unfriendly to panic when there is not enough VRAM to serve at least one request with the max seq len — feature request — by yangxi (创建于: 2026-03-30 14:07 (UTC+8))
- #38543 [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query — bug — by sergey-zinchenko (创建于: 2026-03-30 20:19 (UTC+8))
- #38531 [Feature]: Support direct binary/multipart file upload for video and image in OpenAI-compatible API — feature request,multi-modality — by harshvb20 (创建于: 2026-03-30 17:15 (UTC+8))
- #38527 [Bug]: Qwen3.5-35B-A3B-FP8 model outputs all exclamation points — bug — by dengtong (创建于: 2026-03-30 16:09 (UTC+8))
- #38524 [Doc]: comprehensive rewrite of disaggregated prefilling (PD) documentation — documentation — by neweyes (创建于: 2026-03-30 15:50 (UTC+8))
- #38523 [Feature]: Support
routed_expertsexport in disaggregated Prefill/Decode serving — feature request — by Lecooo (创建于: 2026-03-30 15:38 (UTC+8)) - #38521 [Bug]: Endless generation — bug — by issugo (创建于: 2026-03-30 15:10 (UTC+8))
- #38512 [Bug]: AITER logs clang error for unsupported
-amdgpu-coerce-illegal-types=1flag during kernel compilation — bug,rocm — by AndreasKaratzas (创建于: 2026-03-30 13:54 (UTC+8)) - #38515 [Bug]: vLLM crashed when working with CPU offloading — bug — by sweihub (创建于: 2026-03-30 14:03 (UTC+8))
- #38511 feat: DNS-AID SVCB endpoint capability advertisement — 无标签 — by IngmarVG-IB (创建于: 2026-03-30 13:54 (UTC+8))
- #38507 [Feature]: For small setup max-model-len auto doesn’t make sens — feature request — by ErykCh (创建于: 2026-03-30 13:23 (UTC+8))
已关闭 Issue
- #29743 [Feature]: Turing support in Qwen3-VL backends — feature request,stale — by TURX (关闭于: 2026-03-31 10:19 (UTC+8))
- #38307 [Bug]: AMD’s minimax mxfp4 trust_remote_code bug — bug,rocm — by functionstackx (关闭于: 2026-03-30 23:49 (UTC+8))
- #33664 [Bug]: DeepSeek-V3.1 with fp8 KV Cache causes illegal memory access at concurrency ≥ 5 in
serve_benchmark— bug — by lyg95 (关闭于: 2026-03-31 05:02 (UTC+8)) - #35336 [Refactor]: Make SSM backends use the null block (0) for padded requests instead of -1 — feature request — by LucasWilkinson (关闭于: 2026-03-31 05:02 (UTC+8))
- #37570 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP num_speculative_tokens with >1 for zai-org/GLM-4.7-FP8 under load — bug — by Xarbirus (关闭于: 2026-03-31 03:40 (UTC+8))
- #36763 [Bug]: Kimi-K2.5 outputs only ‘!!!!!!!!!!’ in reasoning field, content is always null — 无标签 — by LaRiffle (关闭于: 2026-03-31 00:51 (UTC+8))
- #34067 [Feature]: nightly docker and wheels for ROCm — feature request,rocm — by hongxiayang (关闭于: 2026-03-30 21:04 (UTC+8))
- #37103 [Bug]: UserWarning: Input tensor shape suggests potential format mismatch — bug — by eiffel31 (关闭于: 2026-03-30 20:32 (UTC+8))
- #37749 [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 — bug — by Pinockel (关闭于: 2026-03-30 20:03 (UTC+8))
- #34591 [Bug]: Ministral-3 LoRA adapter fails silently — bug — by sbischl (关闭于: 2026-03-30 14:59 (UTC+8))
- #38512 [Bug]: AITER logs clang error for unsupported
-amdgpu-coerce-illegal-types=1flag during kernel compilation — bug,rocm — by AndreasKaratzas (关闭于: 2026-03-30 14:14 (UTC+8))
新增 PR
- #38608 [Tests] Update sequence parallelism tests to support XPU — intel-gpu — by chaojun-zhang (创建于: 2026-03-31 11:45 (UTC+8))
- #38534 Feature Add fault tolerance instruction workflow and pause operation — frontend,v1 — by fangyuchu (创建于: 2026-03-30 17:50 (UTC+8))
- #38607 feat(tests): enhance async TP tests for XPU compatibility — intel-gpu — by chaojun-zhang (创建于: 2026-03-31 11:42 (UTC+8))
- #38595 inlined dsv3.2 — new-model,v1,deepseek — by WoosukKwon (创建于: 2026-03-31 09:34 (UTC+8))
- #38605 [Quant][Feature] Support online block_fp8 quant for dense and MoE models — rocm — by EdalatiAli (创建于: 2026-03-31 11:21 (UTC+8))
- #38542 [Bugfix] Use importlib.import_module to resolve standalone_compile module — bug — by LJL36 (创建于: 2026-03-30 20:18 (UTC+8))
- #38604 [Feature]: Better base64 to torch tensor (Fixes #26781) — 无标签 — by Shreyansh1729 (创建于: 2026-03-31 11:04 (UTC+8))
- #38522 [LoRA] fix MoE expert module name parsing for models with numeric expert indices (Qwen 3.5 MoE) — qwen — by NIK-TIGER-BILL (创建于: 2026-03-30 15:28 (UTC+8))
- #38514 [RFC] Context-Aware KV-Cache Retention API (Prioritized Evictions) — performance,frontend,v1 — by hyeongyun0916 (创建于: 2026-03-30 13:59 (UTC+8))
- #38596 [XPU]move testing dependencies from Dockerfile to xpu-test.in — intel-gpu,ci/build — by 1643661061leo (创建于: 2026-03-31 09:38 (UTC+8))
- #38581 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — qwen — by Shreyansh1729 (创建于: 2026-03-31 04:27 (UTC+8))
- #38600 [Bugfix] too many values to unpack in dispatch_cpu_unquantized_gemm — bug — by boymucheng (创建于: 2026-03-31 10:21 (UTC+8))
- #38565 [Spec Decode] Implement Mean Pool Connector to return mean pooled vector over prompt tokens in response — v1,kv-connector — by zzaebok (创建于: 2026-03-31 00:39 (UTC+8))
- #38599 Revert “[Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes” (#38270) — bug,v1,nvidia — by vllm-agent (创建于: 2026-03-31 10:06 (UTC+8))
- #38598 Revert “[Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM” (#38562) — bug — by vllm-agent (创建于: 2026-03-31 10:06 (UTC+8))
- #38597 [ROCM] Optmize redudent d2d copy of moe. — rocm — by benenzhu (创建于: 2026-03-31 09:43 (UTC+8))
- #38553 Clarify model-load OOM hint for GPU memory guidance — v1 — by panpan0000 (创建于: 2026-03-30 22:38 (UTC+8))
- #38594 [CI] Avoid concurrent docker pull in intel XPU CI runners to prevent rate limit issues — intel-gpu,ci/build — by wendyliu235 (创建于: 2026-03-31 09:09 (UTC+8))
- #38593 Revert “feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… (#36466)” — v1 — by vllm-agent (创建于: 2026-03-31 08:41 (UTC+8))
- #38572 [Feature]: Per-Request Timing Headers (–enable-request-stats-headers) — documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,ci/build,v1 — by vrdn-23 (创建于: 2026-03-31 02:28 (UTC+8))
- #38589 Add @vadiklyutiy as committer — ready,ci/build — by vadiklyutiy (创建于: 2026-03-31 05:37 (UTC+8))
- #38579 [Bugfix] Kimi-K2 tool parser streaming - fix token leakage, argument truncation, and content dropping — bug,tool-calling — by sfeng33 (创建于: 2026-03-31 03:58 (UTC+8))
- #38592 [Kernel] [Helion] [17/N] Add Helion kernel torch.compile support — 无标签 — by gmagogsfm (创建于: 2026-03-31 06:28 (UTC+8))
- #38590 [MoE Refactor] Layer runner refactor — needs-rebase,v1,llama,qwen,deepseek,gpt-oss,nvidia — by bnellnm (创建于: 2026-03-31 05:59 (UTC+8))
- #38580 [ROCm][CI-Build] Cherry pick triton BUFFER_OPS fix — rocm,ci/build — by gshtras (创建于: 2026-03-31 04:15 (UTC+8))
- #38567 Restore non-hf processor path for Nano-Nemotron-VL (bypass
call_hf_processor_mm_only) - fixes #38018 — ready,nvidia — by netanel-haber (创建于: 2026-03-31 00:55 (UTC+8)) - #38574 [Online Quant] [QeRL] Minor code cleanup — ready — by kylesayrs (创建于: 2026-03-31 03:14 (UTC+8))
- #38548 chore: wip kimi tool call rewrite — tool-calling — by felixmr1 (创建于: 2026-03-30 20:55 (UTC+8))
- #38556 [Bugfix][Async] Fix async spec decoding with hybrid models — bug,speculative-decoding,ready,v1 — by MatthewBonanni (创建于: 2026-03-30 23:01 (UTC+8))
- #38588 Fix Whisper online benchmarking with profiling #38586 — performance — by AdityaKulshrestha (创建于: 2026-03-31 05:11 (UTC+8))
- #38583 [Bugfix] Opt-in INFO prompt summaries for request logging (–enable-log-request-prompts) — bug,documentation,frontend — by JosephAhn23 (创建于: 2026-03-31 04:43 (UTC+8))
- #38585 [ROCm][CI/Build] Fix the pytest hook to properly print out the summary — rocm,ready,ci/build — by gshtras (创建于: 2026-03-31 04:57 (UTC+8))
- #38584 [WIP][CI][Bugfix] Fix
test_run_eagle_dp— bug,ready,v1 — by MatthewBonanni (创建于: 2026-03-31 04:51 (UTC+8)) - #38530 [Benchmark] Fix poem dataset sampling infinite loop when input_len is too small — performance — by WJYuuuu (创建于: 2026-03-30 17:14 (UTC+8))
- #38501 [ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel — rocm,ready — by AndreasKaratzas (创建于: 2026-03-30 12:25 (UTC+8))
- #38578 [release] Automate DockerHub image publishing in release pipeline — ci/build — by khluu (创建于: 2026-03-31 03:48 (UTC+8))
- #38503 [Engine] Fix engine stalling when async KV offload defers block frees — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-30 12:39 (UTC+8))
- #38577 Add nightly b200 test for spec decode eagle correctness — ci/build — by puririshi98 (创建于: 2026-03-31 03:41 (UTC+8))
- #38508 [ROCm][CI] Fix Whisper translation test attention backend selection — rocm,ready — by AndreasKaratzas (创建于: 2026-03-30 13:28 (UTC+8))
- #38576 vLLM Benchmark Suite perf regression after PR#32723 — performance,ci/build,cpu — by louie-tsai (创建于: 2026-03-31 03:35 (UTC+8))
- #38575 Cp+nixl — documentation,frontend,needs-rebase,ci/build,v1,multi-modality,cpu,kv-connector,nvidia — by aarondou (创建于: 2026-03-31 03:27 (UTC+8))
- #38573 [Compile] Fix nvfp4 compile warning — ready — by yewentao256 (创建于: 2026-03-31 02:29 (UTC+8))
- #38561 [Core] Batch invariant chunked prefill for mamba/hybrid models — v1 — by jzakrzew (创建于: 2026-03-31 00:03 (UTC+8))
- #38562 [Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM — bug,ready — by MatthewBonanni (创建于: 2026-03-31 00:04 (UTC+8))
- #38571 [BugFix] Fix OOB read in CUTLASS grouped GEMM with epilogue — bug,nvidia — by LucasWilkinson (创建于: 2026-03-31 01:51 (UTC+8))
- #38566 [Bugfix][CI] Fix flaky
test_eagletest — bug,ready,v1 — by NickLucche (创建于: 2026-03-31 00:51 (UTC+8)) - #38570 [Misc] Move –grpc CLI argument into make_arg_parser — frontend — by CatherineSue (创建于: 2026-03-31 01:14 (UTC+8))
- #38509 [MoE] Filter FP8/MXFP4 MoE backend candidates by platform — rocm — by AndreasKaratzas (创建于: 2026-03-30 13:38 (UTC+8))
- #38504 [Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path — rocm,ready,v1,gpt-oss — by AndreasKaratzas (创建于: 2026-03-30 12:44 (UTC+8))
- #38569 Move –grpc arg into make_arg_parser — rocm,frontend,needs-rebase,ci/build — by CatherineSue (创建于: 2026-03-31 01:06 (UTC+8))
- #38554 [kv_offload+HMA] Fix num_blocks with different per-layer page sizes and improve assert message — ready,v1,kv-connector — by kfirtoledo (创建于: 2026-03-30 22:50 (UTC+8))
- #38557 [Feature] Add GCS support for sharded state loader — 无标签 — by kiriloman (创建于: 2026-03-30 23:13 (UTC+8))
- #38549 [PERF] Extend NCCL symmetric memory to AllGather and ReduceScatter — ci/build,nvidia — by samnordmann (创建于: 2026-03-30 21:17 (UTC+8))
- #38564 AIFQA-397 BLK-011: [vLLM] DeepSeek-OCR-2 — DeepseekOCR2ForCausalLM arch not supported — intel-gpu,multi-modality,deepseek — by jklawikowski (创建于: 2026-03-31 00:29 (UTC+8))
- #38563 AIFQA-399 BLK-001: [vLLM/XPU] Multi-GPU CCL/OFI transport hang — shm_broadcast blocks indefinitely (TP>1) — documentation,performance,intel-gpu,v1,multi-modality,qwen,kv-connector,nvidia — by jklawikowski (创建于: 2026-03-31 00:19 (UTC+8))
- #38545 [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries — bug,frontend,multi-modality — by sergey-zinchenko (创建于: 2026-03-30 20:34 (UTC+8))
- #38559 [Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement — ready — by yewentao256 (创建于: 2026-03-30 23:46 (UTC+8))
- #38558 [KVConnector] Skip
register_kv_cacheson profiling — v1 — by NickLucche (创建于: 2026-03-30 23:27 (UTC+8)) - #38546 [KVConnector] Remove redundant method KVConnectorOutput::merge() — ready,v1 — by hickeyma (创建于: 2026-03-30 20:50 (UTC+8))
- #38555 [ROCm] Skip encoder cache profiling on consumer RDNA GPUs — rocm,v1 — by dondetir (创建于: 2026-03-30 22:56 (UTC+8))
- #38517 [Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear — bug — by kkyyxhll (创建于: 2026-03-30 14:09 (UTC+8))
- #38513 feat(serving): DNS-AID SVCB endpoint registration — documentation,frontend,ci/build — by IngmarVG-IB (创建于: 2026-03-30 13:55 (UTC+8))
- #38552 openshift-compatibility — ci/build — by SorenDreano (创建于: 2026-03-30 22:17 (UTC+8))
- #38547 [Misc] Add @tomeras91 as a maintainer of Nemotron related code + mamba block — ci/build — by tomeras91 (创建于: 2026-03-30 20:53 (UTC+8))
- #38544 [Bugfix] Fix glm4_moe_tool_parser for tools in the responses API format — bug,tool-calling — by math625f (创建于: 2026-03-30 20:31 (UTC+8))
- #38540 [Minimax-M2.5] Use bfloat16 gate linear — 无标签 — by SYChen123 (创建于: 2026-03-30 19:18 (UTC+8))
- #38535 [Bugfix][CPU] Skip set_num_threads after thread binding — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-03-30 18:03 (UTC+8))
- #38541 [Nixl] Support different head_dim — v1,kv-connector — by NickLucche (创建于: 2026-03-30 19:50 (UTC+8))
- #38526 [BugFix] Fix shared reference bug in parsers for n>=2 and streaming — bug,frontend — by okdshin (创建于: 2026-03-30 16:07 (UTC+8))
- #38539 [OPT] Optimize the fusion-MOE-kernel matrix calculation for cases whe… — 无标签 — by BJWang-ant (创建于: 2026-03-30 19:16 (UTC+8))
- #38538 Add audio extraction at init + automatic audio detection — frontend,needs-rebase,v1 — by askliar (创建于: 2026-03-30 19:05 (UTC+8))
- #38536 [ROCm][Perf] Add optimized MoE configs for Kimi K2.5 TP=4 — rocm — by amd-asalykov (创建于: 2026-03-30 18:25 (UTC+8))
- #38533 [Frontend] Skip stop in reasoning content — v1,tool-calling — by chaunceyjiang (创建于: 2026-03-30 17:39 (UTC+8))
- #38510 [New Model]: add support for telechat3 — new-model — by 1096125073 (创建于: 2026-03-30 13:49 (UTC+8))
- #38525 [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation — documentation,kv-connector — by neweyes (创建于: 2026-03-30 16:02 (UTC+8))
- #38532 fix/TokenizerRepeatedInitialization — 无标签 — by jingkuja (创建于: 2026-03-30 17:18 (UTC+8))
- #38519 Fix Responses JSON schema alias serialization — frontend,ready — by noobHappylife (创建于: 2026-03-30 14:51 (UTC+8))
- #38528 [Benchmark] Fix Dataset infinite loop when input_len is too small — performance — by WJYuuuu (创建于: 2026-03-30 16:57 (UTC+8))
- #38529 fix: vgpu segfault before nccl init — 无标签 — by ljy11a (创建于: 2026-03-30 17:07 (UTC+8))
- #38518 [Misc][Logger][DP] Enable AggregatedLoggingStatLogger when user specifing aggregate_engine_logging — v1 — by MengqingCao (创建于: 2026-03-30 14:16 (UTC+8))
- #38505 [ci] Soft fail and disable retry for AMD build image job — rocm,ready,ci/build — by khluu (创建于: 2026-03-30 13:01 (UTC+8))
- #38506 [Core] Make ModelRunnerOutput.num_nans_in_logits np.ndarray rather than python dict — v1 — by RahulBirCodes (创建于: 2026-03-30 13:04 (UTC+8))
- #38502 [ROCm] Cap Triton paged attention block size to fix ROCm shared memory OOM — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-30 12:29 (UTC+8))
已合并 PR
- #37234 [Bugfix] Fix for builtins (forward fix of pytorch/177558) — bug,ready — by Lucaskabela (合并于: 2026-03-31 09:08 (UTC+8))
- #37221 [3/n] Migrate cutlass/scaled_mm_entry.cu torch stable ABI — ready,ci/build,nvidia — by mikaylagawarecki (合并于: 2026-03-31 02:20 (UTC+8))
- #38265 [Refactor] Consolidate Tool type alias in tool_parsers/utils.py — ready,tool-calling — by sfeng33 (合并于: 2026-03-31 08:55 (UTC+8))
- #36070 [Bugfix][DCP] Fix CUDA graph capture for Decode Context Parallelism — bug,ready,v1,nvidia — by sungsooha (合并于: 2026-03-31 08:20 (UTC+8))
- #38589 Add @vadiklyutiy as committer — ready,ci/build — by vadiklyutiy (合并于: 2026-03-31 07:50 (UTC+8))
- #36466 feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… — ready,v1 — by Prathmesh234 (合并于: 2026-03-31 07:16 (UTC+8))
- #38567 Restore non-hf processor path for Nano-Nemotron-VL (bypass
call_hf_processor_mm_only) - fixes #38018 — ready,nvidia — by netanel-haber (合并于: 2026-03-31 05:56 (UTC+8)) - #38478 [Bug fix][Quantization] Fix dummy weight loading — bug,ready,quantization — by Josephasafg (合并于: 2026-03-31 04:38 (UTC+8))
- #35431 [Bugfix] Use null block (0) for padded block table entries — bug,ready,v1,ready-run-all-tests — by SandishKumarHN (合并于: 2026-03-31 05:02 (UTC+8))
- #38381 [ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm — rocm,ready — by micah-wil (合并于: 2026-03-31 04:26 (UTC+8))
- #36261 [EPLB] Optmize eplb mapping and record in router for prefill — ready — by ilmarkov (合并于: 2026-03-31 03:48 (UTC+8))
- #36847 [Feat][Spec Decode] DFlash — new-model,speculative-decoding,ready,v1,qwen,nvidia — by benchislett (合并于: 2026-03-31 03:03 (UTC+8))
- #38562 [Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM — bug,ready — by MatthewBonanni (合并于: 2026-03-31 00:51 (UTC+8))
- #35862 [Refactor] Unify engine process monitoring in engine manager and add Ray backend support — frontend,ready,v1 — by fangyuchu (合并于: 2026-03-31 01:16 (UTC+8))
- #37123 [Core][CI] Add opt-in media URL caching via VLLM_MEDIA_CACHE — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-30 19:58 (UTC+8))
- #37467 [HMA]Move hybrid blksize to update_block_size_for_backend to fix attn supported block size is not 16 issue — intel-gpu,ready,v1 — by xuechendi (合并于: 2026-03-31 00:47 (UTC+8))
- #38423 [NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 — bug,ready,ci/build,nvidia,ready-run-all-tests — by johnnynunez (合并于: 2026-03-31 00:36 (UTC+8))
- #35753 [Mamba] Add stochastic rounding support — ready,ci/build,nvidia — by roikoren755 (合并于: 2026-03-31 00:33 (UTC+8))
- #37698 [ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 — bug,rocm,ready,cpu — by hongxiayang (合并于: 2026-03-30 23:49 (UTC+8))
- #37291 [Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method — bug,ready,ready-run-all-tests,quantization — by mgehre-amd (合并于: 2026-03-30 22:30 (UTC+8))
- #38547 [Misc] Add @tomeras91 as a maintainer of Nemotron related code + mamba block — ci/build — by tomeras91 (合并于: 2026-03-30 21:12 (UTC+8))
- #38255 [Bugfix] Remove false-positive format mismatch warnings in FLA ops — bug,ready — by tdoublep (合并于: 2026-03-30 20:32 (UTC+8))
- #38535 [Bugfix][CPU] Skip set_num_threads after thread binding — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-03-30 20:13 (UTC+8))
- #37236 Fix ambiguous num_blocks for hybrid attn mamba — ready,v1 — by collinmccarthy (合并于: 2026-03-30 19:09 (UTC+8))
- #38253 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,ready,multi-modality — by aliialsaeedii (合并于: 2026-03-30 18:26 (UTC+8))
- #38158 [Bugfix] Fix shared-object aliasing in n>1 streaming with tool calls — bug,frontend,ready — by yzong-rh (合并于: 2026-03-30 18:12 (UTC+8))
- #38270 [Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes — bug,ready,v1,nvidia — by NickLucche (合并于: 2026-03-30 17:41 (UTC+8))
- #38457 [ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support — documentation,rocm,ready — by tjtanaa (合并于: 2026-03-30 17:25 (UTC+8))
- #38495 [CI] Fix SPLADE pooler test broken by #38139 — ready — by haosdent (合并于: 2026-03-30 15:48 (UTC+8))
- #37529 [ROCm] Enable MORI EP for unquantized MoE with AITER backend — rocm,ready — by pinsiangamd (合并于: 2026-03-30 15:19 (UTC+8))
- #38482 (security) Fix SSRF in batch runner download_bytes_from_url — documentation,frontend,ready — by jperezdealgaba (合并于: 2026-03-30 15:10 (UTC+8))
- #36963 [Bugfix][Model] Fix PixtralForConditionalGeneration LoRA — bug,ready — by jeejeelee (合并于: 2026-03-30 14:59 (UTC+8))
- #36965 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — ready — by JoursBleu (合并于: 2026-03-30 14:24 (UTC+8))
- #38505 [ci] Soft fail and disable retry for AMD build image job — rocm,ready,ci/build — by khluu (合并于: 2026-03-30 14:05 (UTC+8))
- #38329 [MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists — ready,nvidia — by jaewonlee-fb (合并于: 2026-03-30 13:53 (UTC+8))
- #38487 [Misc] Always use
forward_mulmatforConv3don newer versions of torch. — ready — by ywang96 (合并于: 2026-03-30 13:39 (UTC+8)) - #38492 [CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests — rocm,ready — by AndreasKaratzas (合并于: 2026-03-30 13:36 (UTC+8))
- #38497 Add @ZJY0516 to CODEOWNERS — ready,ci/build — by ZJY0516 (合并于: 2026-03-30 12:10 (UTC+8))
- #33703 [Bugfix] Support multi-type params parsing for DeepSeek v3.2 — bug,ready,tool-calling,deepseek — by kizill (合并于: 2026-03-30 12:07 (UTC+8))
关闭但未合并的 PR
- #38542 [Bugfix] Use importlib.import_module to resolve standalone_compile module — bug — by LJL36 (关闭于: 2026-03-31 11:25 (UTC+8))
- #38581 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — qwen — by Shreyansh1729 (关闭于: 2026-03-31 10:44 (UTC+8))
- #28459 when model_runner not ensure_kv_transfer_shutdown this method, it wil… — stale,v1 — by lengrongfu (关闭于: 2026-03-31 10:19 (UTC+8))
- #29047 Online Rotations to vLLM — documentation,rocm,needs-rebase,ci/build,stale,llama — by gametekker (关闭于: 2026-03-31 10:19 (UTC+8))
- #38346 [ROCM] Optmize redudent d2d copy of moe. — documentation,performance,new-model,rocm,frontend,intel-gpu,speculative-decoding,ci/build,v1,multi-modality — by benenzhu (关闭于: 2026-03-31 09:42 (UTC+8))
- #38593 Revert “feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… (#36466)” — v1 — by vllm-agent (关闭于: 2026-03-31 08:42 (UTC+8))
- #38548 chore: wip kimi tool call rewrite — tool-calling — by felixmr1 (关闭于: 2026-03-31 05:20 (UTC+8))
- #38569 Move –grpc arg into make_arg_parser — rocm,frontend,needs-rebase,ci/build — by CatherineSue (关闭于: 2026-03-31 01:07 (UTC+8))
- #37910 Making spec decode testing nightly — ci/build — by puririshi98 (关闭于: 2026-03-31 00:58 (UTC+8))
- #38188 Bump flashinfer version to 0.6.7 — ci/build,nvidia,ready-run-all-tests — by wzhao18 (关闭于: 2026-03-30 23:28 (UTC+8))
- #36957 [NIXL][Mamba][1/N] Heterogeneous TP: full conv transfer + local extract — needs-rebase,v1,kv-connector — by ZhanqiuHu (关闭于: 2026-03-30 22:52 (UTC+8))
- #37603 [NIXL][Mamba][2/N] Heterogeneous TP : chunk-interleaved layout — kv-connector — by ZhanqiuHu (关闭于: 2026-03-30 22:52 (UTC+8))
- #34338 [Bugfix][CI] Fix spawned tests — bug,ready — by NickLucche (关闭于: 2026-03-30 22:35 (UTC+8))
- #38402 [Tests][Bugfix] Fix race condition in test_abort_final_step.py — bug,v1 — by SandishKumarHN (关闭于: 2026-03-30 21:22 (UTC+8))
- #30735 add Qwen3OmniMoeAudioEncoder and support torch compile — stale,qwen — by XiaobingSuper (关闭于: 2026-03-30 20:02 (UTC+8))
- #24941 [Frontend] Skip
stopin reasoning content — ready,needs-rebase,stale,v1,tool-calling — by chaunceyjiang (关闭于: 2026-03-30 17:58 (UTC+8)) - #38528 [Benchmark] Fix Dataset infinite loop when input_len is too small — performance — by WJYuuuu (关闭于: 2026-03-30 17:12 (UTC+8))