vLLM 开发动态报告 - 2026-03-19

时间窗口: 2026-03-19 11:26 (UTC+8) ~ 2026-03-20 11:26 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 22 | 新 PR 100 | 合并 PR 55 | 关闭未合并 PR 24

📊 每日开发状态摘要

在2026年3月19日至20日这个周期内，vLLM 项目保持了极高的开发活跃度，新增了100个PR并合并了55个。开发重点集中在性能优化、Bug修复以及对AMD ROCm生态的持续增强上。社区报告了多个与GLM-4.7-FP8、Qwen3.5等热门模型相关的关键Bug，而AMD团队的贡献者则迅速响应并解决了若干ROCm平台上的核心问题。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动非常活跃，主要体现在Bug修复、性能优化和新功能支持上。

1. 关键Bug修复：

Issue #37548 & PR #37606：用户 vllmellm 报告了在ROCm平台上使用 ROCM_AITER_UNIFIED_ATTN 后端时，因缓存块大小（16 与 64）配置不匹配导致的编译失败。AMD工程师 divakar-amd 迅速定位问题，并通过PR #37606修复了配置逻辑，确保无论通过环境变量还是CLI参数启用该后端，都能使用正确的缓存块大小（64）。该PR已合并。
PR #37547：由 gronsti-amd 提交，修复了DeepSeek V3 MLA稀疏注意力实现中 lru_cache 失效的问题（因函数定义在每次调用时都新建，导致模块被重复导入）。此修复提升了性能。
Issue #37596：一个ROCm CI测试失败，涉及Qwen模型的语言模型测试（PPL），错误为 HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION，表明存在内存访问违规问题，正在排查中。

2. 性能与功能优化：

PR #37539：由 xaguilar-amd 提交，优化了ROCm AITER后端MLA解码路径的性能。通过将输出张量分配从 torch.zeros 改为 torch.empty，移除了冗余的GPU零填充内核启动，降低了解码延迟。
PR #37533：由 aaab8b 提交，修复了ROCm平台上睡眠模式（sleep mode）无法真正释放物理GPU内存的问题。通过在执行 hipMemRelease 后循环调用 cuMemAddressFree 和 cuMemAddressReserve，强制驱动释放物理页，解决了内存泄露问题。
PR #37529：由 pinsiangamd 提交，修复了在ROCm上使用非量化（BF16）MoE模型时，MORI专家并行后端与AITER专家后端不兼容的问题。此前该组合会静默退化（不报错但输出质量下降），现在已能正确工作。
PR #37418 (已合并)：由 Duyi-Wang 提交，修复了MoRI（Moe Remote Invocation）与AITER FP8量化调度的兼容性问题，确保FP8量化能在MoRI的dispatch阶段正确执行。

3. 新功能/集成：

PR #37515 & #37513：由 khairulkabir1661 提交，旨在为DeepSeek的MLA注意力添加AITER融合内核支持，以在解码和预填充路径上融合多个操作，提升性能。目前PR需要解决合并冲突。
PR #37542 (已合并)：为加速CI，将多模态扩展池化测试从其他测试中分离出来单独运行。

总结：AMD团队在本周期表现非常积极，不仅快速响应用户问题，还持续在内存管理、内核性能、专家并行等深水区进行优化和修复，显著提升了vLLM在ROCm平台上的稳定性和性能。

💬 高热度讨论分析

本周期未出现评论量极高的“爆款”讨论，但以下几个议题的交流具有代表性：

1. Issue #37590: HTTP连接静默断开问题

核心议题：用户 DKingAlpha 报告当提示词包含大表格时，vLLM服务器有时会无日志地断开客户端连接。
观点交锋：
- 另一位用户 malaiwah 表示在相似设置下无法复现该问题，并分享了其测试配置细节。
- 提问者随后提供了更详细的部署环境（Docker Compose + 反向代理），并在 malaiwah 的协助下进行排查。
最终结论：经过逐步排查，发现问题根源在于用户自己的反向代理配置错误，而非vLLM本身。提问者关闭了issue并感谢协助者。这体现了社区协作排查问题的良好氛围。

2. PR #37522: 关于旧环境兼容性的讨论

核心议题：panpan0000 提交PR，希望为新增的 .out 操作符重载添加向后兼容性检查，防止新代码在旧编译环境中崩溃。
观点交锋：
- 支持兼容性：提交者认为应通过 hasattr 检查来增强代码鲁棒性，避免因Python代码与C扩展版本不匹配导致的崩溃。
- 反对支持：维护者 ProExpertProg 认为不应支持构建版本与Python代码不匹配的情况，用户应该直接构建新版本。他建议，如果确实需要，应该设计一个更好的不匹配报错系统，而非增加运行时检查。
当前状态：该PR因维护者的明确反对而被关闭。这反映了项目在追求开发速度与维护长期兼容性之间的权衡，目前倾向于前者。

3. Issue #28572 (已关闭): 多GPU设备兼容性问题

核心议题：用户因系统中存在不同计算能力的GPU（RTX 2080 Ti和RTX 3090），导致vLLM服务失败。
讨论过程：用户与维护者讨论了使用 CUDA_VISIBLE_DEVICES 隔离设备、检查环境变量传递、以及模型数据类型（MXFP4）兼容性等多种可能。
争议焦点：问题始终未能完美解决，用户认为vLLm在存在不兼容GPU的环境中应该更智能地处理或给出更清晰的错误指引。
最终状态：该Issue因长期无新进展而被标记为“stale”并自动关闭。这类问题揭示了在生产环境中混合部署不同代际GPU的复杂性和挑战。

🔥 热门话题与趋势分析

Bug报告集中在CUDA非法内存访问与MLA：多个Issue报告了cudaErrorIllegalAddress崩溃，涉及场景包括：更改max-num-batched-tokens、启用吞吐量模式、使用MTP推测解码等。这些通常与底层内核或内存管理在特定边界条件下的缺陷有关。同时，关于MLA注意力后端选择失败的报告也出现了。
GLM-4.7-FP8模型问题凸显：用户 Xarbirus 连续提交了多个与zai-org/GLM-4.7-FP8模型相关的崩溃Bug，涉及不同配置参数，表明该模型在vLLM中的支持可能还存在稳定性问题，是当前测试和修复的热点。
Qwen3.5模型支持持续完善：围绕Qwen3.5模型的讨论和修复很多，包括多GPU初始化错误、工具解析器支持、GatedDeltaNet层的Marlin量化兼容性修复等，说明该系列模型被广泛使用，其复杂结构（如MoE、GDN）对推理引擎提出了高要求。
MoE（混合专家）性能与正确性：无论是ROCm还是CUDA平台，都有关于MoE性能回归（如图模式慢于非MTP）和内核Bug（如FlashInfer TRTLLM monolithic内核路由错误导致零准确率）的讨论，反映了MoE推理仍是技术难点和优化重点。

🛠️ 重点技术变更

PR #37606: [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention
- 技术解读：修复了ROCm AITER统一注意力后端缓存块大小配置不一致的底层逻辑。统一注意力需要64的块大小以获得最佳性能，但通过--attention-config参数设置时，配置更新顺序导致仍使用了默认的16。
- 影响：确保了该后端性能的稳定性，避免了由此引发的编译失败，提升了用户体验。
PR #37539: [Performance] Remove unnecessary zero-fill of MLA decode output tensor in Aiter backend
- 技术解读：将MLA解码输出张量的初始化从torch.zeros改为torch.empty。由于后续AITER内核会无条件覆盖整个张量，前置的零填充是冗余的GPU内核操作。
- 影响：为每个解码步骤的每个层都减少了一次vectorized_elementwise内核启动，直接降低了MLA模型在ROCm平台上的解码延迟，是典型的低级性能优化。
PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm
- 技术解读：揭示了HIP运行时hipMemRelease在虚拟地址保留期间不会释放物理VRAM的底层行为差异。通过强制循环执行地址释放与重新保留，绕过了驱动限制。
- 影响：解决了ROCm平台上睡眠模式“假释放”内存的问题，使GPU内存能被其他应用（如RL训练框架）真正复用，提升了系统资源利用率。
PR #36056 (已合并): [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1
- 技术解读：重构了DeepSeek-V3.2工具调用解析器的流式处理逻辑。旧版采用复杂的状态机逐字符解析，在stream_interval > 1时会导致标签解析错误。新版改为缓冲令牌直至获取完整DSML块后一次性解析。
- 影响：从根本上解决了流式返回间隔较大时工具调用解析出错的问题，提升了DeepSeek-V3.2模型流式输出的可靠性。

📈 开发活跃度观察

AMD团队贡献突出：以divakar-amd, gronsti-amd, xaguilar-amd等为代表的AMD工程师在本周期非常活跃，提交了多个关键修复和优化PR，显示了AMD对vLLM ROCm后端投入的持续加大。
社区参与度高：新增的25个Issue中，大部分是用户提交的详细Bug报告，包含了重现步骤和日志，表明用户社区成熟且乐于贡献。许多Issue在当天就得到了回应或修复。
合并效率高：在100个新增PR中，有55个被合并，其中包括多个关键Bug修复，显示核心维护团队代码审查和合并流程高效。
CI/CD持续运行：多个CI失败被报告和跟踪（如mi355_1测试失败），表明项目的自动化测试体系在持续运行并暴露问题，有助于维护代码质量。

💡 值得关注的问题

GLM-4.7-FP8的稳定性问题：多个独立的Issue报告了该模型在不同配置下的CUDA非法内存访问崩溃（#37587, #37598, #37599, #37570）。这可能需要核心开发者进行系统性排查，看是否是某个特定内核或内存管理策略与该模型不兼容。
Qwen3.5多GPU初始化错误：Issue #37623报告了Qwen3.5-122B-A10B-FP8在多GPU（TP=2）初始化时失败，错误提示Device does not support multicasting，可能与FlashInfer allreduce融合工作空间的创建有关，影响大规模模型部署。
FlashInfer TRTLLM MoE后端路由Bug：Issue #37591和PR #37605揭露了FlashInfer TRTLLM monolithic MoE内核在处理全负路由logits时，会错误地选择专家，导致Qwen3.5 FP8模型输出完全错误（0%准确率）。该问题已通过临时禁用问题路由方法得到缓解，但根本性修复需等待上游更新。
CPU后端KV缓存块归零崩溃：PR #37550报告并修复了CPU后端因调用GPU Triton归零内核而导致的崩溃，这提醒我们在为多后端项目添加新功能时，需充分考虑所有后端的兼容性。

📋 附录：详细数据列表

新增 Issue

#37608 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_qkv_weight’ — bug — by S1ro1 (创建于: 2026-03-20 05:55 (UTC+8))
#37623 [Bug]: Qwen3.5-122B-A10B-FP8 multi GPU issue(tensor-parallel-size) — bug — by jungmoklee (创建于: 2026-03-20 09:06 (UTC+8))
#37625 MTP and Distributed MoE Speculative Decoding Degradation — 无标签 — by glaziermag (创建于: 2026-03-20 09:17 (UTC+8))
#37624 MLA and APC Conflict — 无标签 — by glaziermag (创建于: 2026-03-20 09:17 (UTC+8))
#37543 [Bug]: 推理vllm，出现如下报错，KeyError：residual — bug — by kdywt (创建于: 2026-03-19 17:52 (UTC+8))
#37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (创建于: 2026-03-20 08:15 (UTC+8))
#37548 [Bug][ROCm]: Aiter unified attention fails during compilation — bug,rocm — by vllmellm (创建于: 2026-03-19 18:30 (UTC+8))
#37602 [Bug]: Qwen3.5-122B-A10B-FP8 EngineCore crash on concurrent image requests — 无标签 — by serdarildercaglar (创建于: 2026-03-20 05:01 (UTC+8))
#37590 [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response — bug — by DKingAlpha (创建于: 2026-03-20 02:53 (UTC+8))
#37591 [Bug]: FlashInfer TRTLLM monolithic MoE produces 0% accuracy for Qwen3.5-35B/122B FP8 — bug,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-20 03:09 (UTC+8))
#37599 [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with --max-num-batched-tokens < default (e.g. 4K) under — bug — by Xarbirus (创建于: 2026-03-20 04:47 (UTC+8))
#37598 [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with --max-num-batched-tokens > default (e.g. 16K) under load — bug — by Xarbirus (创建于: 2026-03-20 04:42 (UTC+8))
#37596 [CI Failure]: mi355_1: Language Models Test (PPL) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-20 04:15 (UTC+8))
#37587 [Bug]: cudaErrorIllegalAddress crash when enabling --performance-mode throughput for zai-org/GLM-4.7-FP8 under load — bug — by Xarbirus (创建于: 2026-03-20 02:31 (UTC+8))
#37581 [Bug]: /v1/chat/completions/render` crashes for Qwen/Qwen3-ASR-0.6B multimodal audio, and chat audio returns empty/junk — bug — by peregilk (创建于: 2026-03-20 01:03 (UTC+8))
#37570 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP num_speculative_tokens with >1 for zai-org/GLM-4.7-FP8 under load — bug — by Xarbirus (创建于: 2026-03-19 23:08 (UTC+8))
#37563 mm_fp4 trtllm backend leaks padding scales into real rows (use_8x4_sf_layout=True) — 无标签 — by elvircrn (创建于: 2026-03-19 21:57 (UTC+8))
#37554 [Bug] –calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) — 无标签 — by daudo (创建于: 2026-03-19 19:00 (UTC+8))
#37558 “vLLM-deployed Qwen3.5 with Reasoning Parser Shows Empty reasoningContent in Spring AI OpenAI Model” — bug — by lhc75 (创建于: 2026-03-19 19:51 (UTC+8))
#37546 [Bug]: CPU backend crashes with TypeError: 'function' object is not subscriptable on first inference request — bug,cpu — by fyuan1316 (创建于: 2026-03-19 18:07 (UTC+8))
#37553 [Bug]: Mistral-Small-4-119B-2603 fails on 8x RTX 3090 (SM 8.6) with vLLM v0.17.1: no valid MLA attention backend — bug — by Martossien (创建于: 2026-03-19 18:57 (UTC+8))
#37551 [Bug] vLLM 0.17.1: zai-org/GLM-OCR has mtp_graph < no_mtp_graph despite high acceptance — bug — by AlejandroBaron (创建于: 2026-03-19 18:56 (UTC+8))
#37527 [Performance]: Is SamplingParams support set enable_thinking? — performance — by lcedaw (创建于: 2026-03-19 15:39 (UTC+8))
#37525 [Bug]: 启动qwen3.5-35B后反复kill进程：vllm.v1.engine exceptions enginedeaderror enginecore encountered an issue — bug — by promise-hash (创建于: 2026-03-19 15:20 (UTC+8))
#37520 [Bug]: [V1 Engine] Segfault / NCCL init failure when running 4 GPUs across NUMA nodes (v0.17.0) — bug — by xkkx123 (创建于: 2026-03-19 14:20 (UTC+8))

已关闭 Issue

#31043 [BugFix]: move torch.Size across graphs in split_graph — help wanted,feature request,torch.compile — by BoyuanFeng (关闭于: 2026-03-20 10:55 (UTC+8))
#20223 [Feature]: Any plans to support TokenWeave optimizations in vLLM? — feature request,stale — by solitude-s (关闭于: 2026-03-20 10:21 (UTC+8))
#22497 [Bug]: Crashed when loading ggml quantized Gemma3 — bug,stale — by Imagium719 (关闭于: 2026-03-20 10:21 (UTC+8))
#22533 [Bug]: gpt-oss-20b flaky BadRequest 400 — bug,stale — by simonfl (关闭于: 2026-03-20 10:21 (UTC+8))
#23603 [Feature]: Log prompt for gpt-oss — feature request,stale,gpt-oss — by beom115 (关闭于: 2026-03-20 10:21 (UTC+8))
#24140 [Bug]: Deepseek V3.1 tool_choice=required，输出混乱 — bug,stale — by WangJianQ-0118 (关闭于: 2026-03-20 10:21 (UTC+8))
#24147 [Bug]: model failure for OpenGVLab/InternVL3-38B-hf — bug,stale — by thesillystudent (关闭于: 2026-03-20 10:21 (UTC+8))
#28572 [Bug]: vllm refuses to serve LLM in the presence of multiple GPUs — bug,stale — by tigran123 (关闭于: 2026-03-20 10:20 (UTC+8))
#37625 MTP and Distributed MoE Speculative Decoding Degradation — 无标签 — by glaziermag (关闭于: 2026-03-20 09:22 (UTC+8))
#37624 MLA and APC Conflict — 无标签 — by glaziermag (关闭于: 2026-03-20 09:22 (UTC+8))
#37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (关闭于: 2026-03-20 08:16 (UTC+8))
#37548 [Bug][ROCm]: Aiter unified attention fails during compilation — bug,rocm — by vllmellm (关闭于: 2026-03-20 08:00 (UTC+8))
#37444 Regression in nightly: AttributeError ‘MergedColumnParallelLinear’ has no attribute ‘weight’ with Qwen3.5-9B — 无标签 — by jhsmith409 (关闭于: 2026-03-20 07:21 (UTC+8))
#37590 [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response — bug — by DKingAlpha (关闭于: 2026-03-20 04:45 (UTC+8))
#37471 [Bug]: Accuracy issue running Model Runner V2 with Qwen3.5 — bug — by yewentao256 (关闭于: 2026-03-20 03:33 (UTC+8))
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:06 (UTC+8))
#32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:01 (UTC+8))
#31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:01 (UTC+8))
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:00 (UTC+8))
#37397 [Bug]: Kimi-K2.5 chat completion doesn’t return any reasoning content — bug — by wizche (关闭于: 2026-03-19 21:08 (UTC+8))
#36773 [Bug]: Qwen3.5-35B-A3B on B200 with vllm v0.17.0 output random result — bug — by elepherai (关闭于: 2026-03-19 20:18 (UTC+8))
#30848 [Bug]: Fail to run Qwen3-Next model. — bug,stale — by MaoJianwei (关闭于: 2026-03-19 11:51 (UTC+8))

新增 PR

#37634 [XPU] Automatically detect target platform as XPU in build. — ready,ci/build — by ccrhx4 (创建于: 2026-03-20 11:10 (UTC+8))
#37565 [Bugfix] Disable –calculate-kv-scales for hybrid GDN/Mamba+Attention… — bug — by Young-Leo (创建于: 2026-03-19 22:04 (UTC+8))
#37537 [Model] Deprecate the score task (this will not affect users). — documentation,frontend,ready,v1 — by noooop (创建于: 2026-03-19 16:07 (UTC+8))
#37612 [V0 Deprecation] Deprecate –disable-frontend-multiprocessing — performance,frontend,ready,v1 — by sfeng33 (创建于: 2026-03-20 07:25 (UTC+8))
#37633 [bug fix] not override “partial_rotary_factor” if it is in model config — bug — by xgwang (创建于: 2026-03-20 11:10 (UTC+8))
#37632 always use embed&token_classify for bge-m3 — 无标签 — by staugust (创建于: 2026-03-20 10:38 (UTC+8))
#37531 [bugfix] Outputting full results for long audio files in stream_api_response() mode — bug,documentation — by AllenDou (创建于: 2026-03-19 15:48 (UTC+8))
#37603 [NIXL][Mamba][2/N] Heterogeneous TP : chunk-interleaved layout — kv-connector — by ZhanqiuHu (创建于: 2026-03-20 05:17 (UTC+8))
#37523 fix(xpu): Re-compute compile ranges after platform-specific config updates — ready,v1 — by Liangyx2 (创建于: 2026-03-19 14:47 (UTC+8))
#37630 [Bugfix] Add early detection for CUDA < 13.0 on sm_103+ GPUs (GB300) — bug,nvidia — by qiching (创建于: 2026-03-20 09:55 (UTC+8))
#37588 [Model Runner V2] Add full/piecewise cuda graph support for eagle pre… — v1,nvidia — by TheEpicDolphin (创建于: 2026-03-20 02:39 (UTC+8))
#37631 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — 无标签 — by simpx (创建于: 2026-03-20 09:57 (UTC+8))
#37628 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (创建于: 2026-03-20 09:44 (UTC+8))
#37629 [Bugfix] Fix EAGLE3+multimodal+async crash: clamp -1 placeholders in embedding paths — bug,speculative-decoding,v1 — by haosdent (创建于: 2026-03-20 09:52 (UTC+8))
#37620 reshape instead of view in FP8ScaledMMLinearKernel — 无标签 — by krishna-kylist (创建于: 2026-03-20 08:42 (UTC+8))
#37627 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (创建于: 2026-03-20 09:31 (UTC+8))
#37571 [UX] Enable torch_profiler_with_stack — documentation — by jeejeelee (创建于: 2026-03-19 23:11 (UTC+8))
#37626 Replace VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var with max_num_batched_tokens — nvidia — by mgoin (创建于: 2026-03-20 09:23 (UTC+8))
#37622 [Bugfix] Fix Step3 pipeline parallel KeyError for residual tensor — bug — by JMonde (创建于: 2026-03-20 09:02 (UTC+8))
#37621 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — bug — by simpx (创建于: 2026-03-20 08:50 (UTC+8))
#37610 [ROCm][CI] Reduce image resolution for gemma3 to avoid OOM on MI250 — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-20 06:50 (UTC+8))
#37619 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 08:17 (UTC+8))
#37583 Revert “[docs] Add docs for new RL flows” (#36188) — documentation,needs-rebase,ci/build — by zhewenl (创建于: 2026-03-20 01:46 (UTC+8))
#37562 [Bugfix] Rebase of #36329: Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 + torch.compile compatibility — bug,needs-rebase,qwen — by kitch2400 (创建于: 2026-03-19 21:35 (UTC+8))
#37615 [Bugfix] Add bounds check for SHMManager handle in get_singleton_inst… — bug,cpu — by yassha (创建于: 2026-03-20 07:50 (UTC+8))
#37617 [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds — rocm,ready,nvidia — by AndreasKaratzas (创建于: 2026-03-20 08:08 (UTC+8))
#37616 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-20 07:57 (UTC+8))
#37606 [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention — bug,rocm,ready,v1 — by divakar-amd (创建于: 2026-03-20 05:39 (UTC+8))
#37518 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (创建于: 2026-03-19 14:10 (UTC+8))
#37614 [ROCm][CI] Remove deepep DBO tests on gfx90a — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 07:46 (UTC+8))
#37569 [Refactor] Fix bitsandbytes loader import for pipeline-parallel params — 无标签 — by ikaadil (创建于: 2026-03-19 23:07 (UTC+8))
#37613 [ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests — rocm,ready,multi-modality,llama — by AndreasKaratzas (创建于: 2026-03-20 07:40 (UTC+8))
#37528 [Model] Add LFM2-ColBERT-350M support — documentation,new-model — by ieBoytsov (创建于: 2026-03-19 15:39 (UTC+8))
#37573 [Bug] Fix EmbedIOprocessor “classify” <-> “embed” — bug,frontend,ready — by yewentao256 (创建于: 2026-03-19 23:32 (UTC+8))
#37572 [Refactor] Remove dead code in pooling model — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-19 23:29 (UTC+8))
#37611 [ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-20 07:16 (UTC+8))
#37600 [compile] Cache InductorPass uuid — 无标签 — by angelayi (创建于: 2026-03-20 04:53 (UTC+8))
#37609 Use lazy graph module during split_module to defer recompile() — 无标签 — by angelayi (创建于: 2026-03-20 06:01 (UTC+8))
#37607 [CPU][UX][Perf] Enable tcmalloc by default — ci/build,cpu — by fadara01 (创建于: 2026-03-20 05:47 (UTC+8))
#37605 [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591) — bug,ready,ci/build,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-20 05:27 (UTC+8))
#37597 [ci] Use fresh cache directory for compile only per test case. — 无标签 — by zhxchen17 (创建于: 2026-03-20 04:28 (UTC+8))
#37589 [compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. — 无标签 — by zhxchen17 (创建于: 2026-03-20 02:51 (UTC+8))
#37604 [compile] Fix aot test failures with torch 2.12. — ready — by zhxchen17 (创建于: 2026-03-20 05:17 (UTC+8))
#37601 [EPLB] Refactor Async EPLB synchronization logic — needs-rebase — by SageMoore (创建于: 2026-03-20 04:56 (UTC+8))
#37595 [Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ — ci/build — by sfeng33 (创建于: 2026-03-20 03:57 (UTC+8))
#37549 [Feature] Rework chunk-based processing with torch.scan — needs-rebase — by bohnstingl (创建于: 2026-03-19 18:38 (UTC+8))
#37582 Fix EP expert_map init for TPU: avoid dynamic-shape ops on XLA — 无标签 — by colin2328 (创建于: 2026-03-20 01:26 (UTC+8))
#37517 Adding DeepEP MoE Test Group. — needs-rebase,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-03-19 13:01 (UTC+8))
#37594 [issue#37343] fix: prevent TTFT regression by adding batched logprobs budget to scheduler — v1 — by Pineberry1 (创建于: 2026-03-20 03:49 (UTC+8))
#37585 [CI] Removing deprecated rlhf examples reference — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 02:22 (UTC+8))
#37592 add integration of xqa and fmha through flashinfer — v1,nvidia — by DanBlanaru (创建于: 2026-03-20 03:12 (UTC+8))
#37593 [Refactor] Relocate entrypoint tests to match serving code structure — ci/build,multi-modality — by sfeng33 (创建于: 2026-03-20 03:21 (UTC+8))
#37515 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (创建于: 2026-03-19 12:34 (UTC+8))
#37568 [Log] Log once in local node by default — ready — by yewentao256 (创建于: 2026-03-19 22:45 (UTC+8))
#37524 [Hardware][GPU] Profiler config additional to increase it scope and annotation details — v1 — by devalshahamd (创建于: 2026-03-19 14:56 (UTC+8))
#37586 feat: add IBM POWER8 (ppc64le) CPU backend support — speculative-decoding,ci/build,v1,cpu — by Scottcjn (创建于: 2026-03-20 02:30 (UTC+8))
#37574 Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers — ready — by hmellor (创建于: 2026-03-19 23:43 (UTC+8))
#37580 Nemotron Nano VL: Streamline pixel shuffle — 无标签 — by milesial (创建于: 2026-03-20 00:49 (UTC+8))
#37584 Revert “[BugFix] Correct max memory usage for multiple KV-cache groups” (#36030) — bug,v1 — by zhewenl (创建于: 2026-03-20 01:47 (UTC+8))
#37579 [Model] Refactor Step3-VL processor to HF style — ready — by DarkLight1337 (创建于: 2026-03-20 00:48 (UTC+8))
#37567 Run MacOS smoke test on daily cron job instead of every commit — ready,ci/build — by hmellor (创建于: 2026-03-19 22:42 (UTC+8))
#37535 [P/D] AnthropicMessages add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by chaunceyjiang (创建于: 2026-03-19 16:04 (UTC+8))
#37578 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug — by simpx (创建于: 2026-03-20 00:39 (UTC+8))
#37577 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug,needs-rebase,nvidia — by simpx (创建于: 2026-03-20 00:26 (UTC+8))
#37561 [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded — ready,v1,cpu — by fadara01 (创建于: 2026-03-19 21:25 (UTC+8))
#37576 [Metrics] Add Prometheus counter for CUDA graph mode — v1,nvidia — by tlrmchlsmth (创建于: 2026-03-20 00:14 (UTC+8))
#37544 [CI] Gate pre-commit on ready label or number of contributions — ready,ci/build — by hmellor (创建于: 2026-03-19 18:00 (UTC+8))
#37575 docs: RFC for dynamic LoRA GPU capacity with resolver plugin integration — 无标签 — by wangchen615 (创建于: 2026-03-19 23:57 (UTC+8))
#37560 [Misc] Cleanup more configs and processors — ready,qwen — by DarkLight1337 (创建于: 2026-03-19 20:30 (UTC+8))
#37526 [MRV2] Use fp32 for draft logits — ready,v1 — by WoosukKwon (创建于: 2026-03-19 15:33 (UTC+8))
#37557 [LoRA] Minor improvements to LoRA log — ready — by jeejeelee (创建于: 2026-03-19 19:43 (UTC+8))
#37566 refactor hard coded device string in test files under tests/v1 and tests/lora — speculative-decoding,v1,nvidia — by wincent8 (创建于: 2026-03-19 22:33 (UTC+8))
#37547 [Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module — bug,rocm,v1 — by gronsti-amd (创建于: 2026-03-19 18:20 (UTC+8))
#37559 Stop bench CLI from recursively casting all configs to dict — performance,ready — by hmellor (创建于: 2026-03-19 20:09 (UTC+8))
#37555 Update typing annotations to use ReadOnly for ConversationMessage — frontend — by ikaadil (创建于: 2026-03-19 19:27 (UTC+8))
#37564 [Bugfix] Zero-init NVFP4 padding scales to prevent NaN contamination — bug — by elvircrn (创建于: 2026-03-19 22:03 (UTC+8))
#37541 [Misc] Clean up processing logic — ready,qwen — by DarkLight1337 (创建于: 2026-03-19 17:07 (UTC+8))
#37552 [CI] Merge cleanup_pr_body.yml and reminder_comment.yml — ready,ci/build — by hmellor (创建于: 2026-03-19 18:56 (UTC+8))
#37556 [EPLB][Refactor] Replace boolean state flags with EPLBPhase enum — 无标签 — by ilmarkov (创建于: 2026-03-19 19:34 (UTC+8))
#37536 Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… — ready,v1 — by xueliangyang-oeuler (创建于: 2026-03-19 16:05 (UTC+8))
#37545 [Model] Remove unnecessary get_language_model — ready — by DarkLight1337 (创建于: 2026-03-19 18:02 (UTC+8))
#37542 [CI/Build] Split out MM pooling tests — rocm,ready,ci/build — by DarkLight1337 (创建于: 2026-03-19 17:39 (UTC+8))
#37550 [Bugfix] Fix CPU backend crash in KV cache block zeroing — bug,v1,cpu — by DorBernsohn (创建于: 2026-03-19 18:48 (UTC+8))
#37538 [Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-03-19 16:28 (UTC+8))
#37533 [ROCm] fix sleep mode not releasing GPU memory problem on ROCm — rocm — by aaab8b (创建于: 2026-03-19 15:52 (UTC+8))
#37532 [CI] Fix wrong path test file, missing rlhf_async_new_apis.py — ready,ci/build — by tjtanaa (创建于: 2026-03-19 15:50 (UTC+8))
#37539 [Performance] Remove unnecessary zero-fill of MLA decode output tensor in Aiter backend — rocm,needs-rebase,v1 — by xaguilar-amd (创建于: 2026-03-19 16:49 (UTC+8))
#37540 Fix test for moved example — ci/build — by hmellor (创建于: 2026-03-19 16:56 (UTC+8))
#37522 Backward compatible when scaled_fp4_quant.out symbol missing — v1 — by panpan0000 (创建于: 2026-03-19 14:29 (UTC+8))
#37534 [P/D] let toy proxy handle Responses API — v1,kv-connector — by chaunceyjiang (创建于: 2026-03-19 16:00 (UTC+8))
#37529 [ROCm] Enable MORI EP for unquantized MoE with AITER backend — rocm — by pinsiangamd (创建于: 2026-03-19 15:39 (UTC+8))
#37530 [Bugfix] Fix MLA KV cache blocks not zeroed on reuse, causing CUDA crashes under concurrent load — bug,v1,nvidia — by jacob-crux (创建于: 2026-03-19 15:43 (UTC+8))
#37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend,ready — by cdpath (创建于: 2026-03-19 11:29 (UTC+8))
#37521 [Bugfix] Fix speculative sampler warmup OOM when using EAGLE — bug,v1 — by wangyxbh (创建于: 2026-03-19 14:24 (UTC+8))
#37519 refactor: abstract deepgemm support into platform — nvidia — by SherryC41 (创建于: 2026-03-19 14:11 (UTC+8))
#37516 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (创建于: 2026-03-19 12:50 (UTC+8))
#37514 [MODEL] Cherry-pick: Adding Support for Qwen3.5 Models — documentation,new-model,speculative-decoding,v1,qwen — by ChuanLi1101 (创建于: 2026-03-19 12:31 (UTC+8))
#37513 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (创建于: 2026-03-19 12:31 (UTC+8))
#37512 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (创建于: 2026-03-19 12:02 (UTC+8))
#37511 V0.17.1+haosdentfix 34845 — performance,ci/build,v1,qwen,nvidia — by NekoSunflower (创建于: 2026-03-19 11:46 (UTC+8))

已合并 PR

#36976 [Bugfix][LoRA] Fix Qwen35 LoRA — bug,ready,ci/build,qwen — by jeejeelee (合并于: 2026-03-20 11:09 (UTC+8))
#36038 [compile][graph_partition]Add tensor size handling — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by fxdawnn (合并于: 2026-03-20 10:55 (UTC+8))
#37207 [Feat] Enable CompressedTensorW4A8Int for XPU — ready — by tianmu-li (合并于: 2026-03-20 10:34 (UTC+8))
#36708 fix: disambiguate multimodal prefix cache keys — ready,v1 — by tianshu-Michael-yu (合并于: 2026-03-20 10:33 (UTC+8))
#37478 [CI] Update mergify tool-calling label paths — ready,ci/build — by sfeng33 (合并于: 2026-03-20 10:22 (UTC+8))
#36056 [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 — bug,ready,deepseek — by sfeng33 (合并于: 2026-03-20 07:51 (UTC+8))
#37452 Fix DP coordinator ZMQ port TOCTOU — ready,v1 — by itayalroy (合并于: 2026-03-20 08:58 (UTC+8))
#37606 [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention — bug,rocm,ready,v1 — by divakar-amd (合并于: 2026-03-20 08:00 (UTC+8))
#37573 [Bug] Fix EmbedIOprocessor “classify” <-> “embed” — bug,frontend,ready — by yewentao256 (合并于: 2026-03-20 07:40 (UTC+8))
#37572 [Refactor] Remove dead code in pooling model — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-20 07:39 (UTC+8))
#37448 Fix AttributeError in Qwen3.5 GDN layers with quantized models — bug,ready,ci/build,qwen — by jhsmith409 (合并于: 2026-03-20 07:21 (UTC+8))
#36996 [CI][BugFix][AMD] Don’t set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary — bug,rocm,ready — by rasmith (合并于: 2026-03-20 07:12 (UTC+8))
#37188 [Performance] Enable Triton autotuning disk cache by default — performance,ready — by arpera (合并于: 2026-03-20 05:36 (UTC+8))
#36064 test Qwen/Qwen3-4B-Instruct-2507 for unbacked — ready,qwen — by laithsakka (合并于: 2026-03-20 05:28 (UTC+8))
#36294 [MoE Refactor] Rename “naive” all2all backend — documentation,ready,nvidia — by bnellnm (合并于: 2026-03-20 03:50 (UTC+8))
#35244 Comment fix for async rl example — documentation,ready — by hao-aaron (合并于: 2026-03-20 03:46 (UTC+8))
#37218 [CI] Add retry with 4x backoff to HTTP fetches for transient failures — rocm,ready — by AndreasKaratzas (合并于: 2026-03-20 03:00 (UTC+8))
#34839 [ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-03-20 03:30 (UTC+8))
#33049 [MoE Refactor] DefaultMoERunner simplifcation — documentation,ready,v1 — by bnellnm (合并于: 2026-03-20 03:07 (UTC+8))
#37568 [Log] Log once in local node by default — ready — by yewentao256 (合并于: 2026-03-20 03:04 (UTC+8))
#37574 Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers — ready — by hmellor (合并于: 2026-03-20 02:04 (UTC+8))
#37358 [Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed — bug,ready,nvidia — by EdalatiAli (合并于: 2026-03-20 01:58 (UTC+8))
#37345 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,ready,llama,qwen — by Lucaskabela (合并于: 2026-03-20 01:26 (UTC+8))
#37567 Run MacOS smoke test on daily cron job instead of every commit — ready,ci/build — by hmellor (合并于: 2026-03-20 00:19 (UTC+8))
#37535 [P/D] AnthropicMessages add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by chaunceyjiang (合并于: 2026-03-20 00:41 (UTC+8))
#37561 [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded — ready,v1,cpu — by fadara01 (合并于: 2026-03-20 00:35 (UTC+8))
#37544 [CI] Gate pre-commit on ready label or number of contributions — ready,ci/build — by hmellor (合并于: 2026-03-20 00:21 (UTC+8))
#37560 [Misc] Cleanup more configs and processors — ready,qwen — by DarkLight1337 (合并于: 2026-03-19 23:45 (UTC+8))
#37346 [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods — bug,ready,nvidia,quantization — by wzhao18 (合并于: 2026-03-19 23:43 (UTC+8))
#37526 [MRV2] Use fp32 for draft logits — ready,v1 — by WoosukKwon (合并于: 2026-03-19 23:41 (UTC+8))
#31509 [1/n] Migrate permute_cols to libtorch stable ABI — documentation,rocm,ready,ci/build,cpu,nvidia — by mikaylagawarecki (合并于: 2026-03-19 23:27 (UTC+8))
#37480 Remove deprecated reasoning_content message field(part-2) — documentation,frontend,ready,v1,qwen,kv-connector,nvidia — by ikaadil (合并于: 2026-03-19 23:20 (UTC+8))
#37557 [LoRA] Minor improvements to LoRA log — ready — by jeejeelee (合并于: 2026-03-19 23:18 (UTC+8))
#37466 Cap the number of API servers to 1 when using Elastic EP. — frontend,ready — by SageMoore (合并于: 2026-03-19 22:42 (UTC+8))
#37559 Stop bench CLI from recursively casting all configs to dict — performance,ready — by hmellor (合并于: 2026-03-19 22:04 (UTC+8))
#37438 [Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support — bug,frontend,ready — by DorBernsohn (合并于: 2026-03-19 21:08 (UTC+8))
#37369 fix(cpu): add null check for aligned_alloc in ScratchPadManager — ready,cpu — by yassha (合并于: 2026-03-19 17:45 (UTC+8))
#37541 [Misc] Clean up processing logic — ready,qwen — by DarkLight1337 (合并于: 2026-03-19 21:30 (UTC+8))
#37552 [CI] Merge cleanup_pr_body.yml and reminder_comment.yml — ready,ci/build — by hmellor (合并于: 2026-03-19 20:55 (UTC+8))
#37504 [Refactor] Relocate endpoint tests to mirror serving code directory structure — ready,ci/build — by sfeng33 (合并于: 2026-03-19 15:19 (UTC+8))
#37536 Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… — ready,v1 — by xueliangyang-oeuler (合并于: 2026-03-19 20:05 (UTC+8))
#37545 [Model] Remove unnecessary get_language_model — ready — by DarkLight1337 (合并于: 2026-03-19 20:00 (UTC+8))
#37542 [CI/Build] Split out MM pooling tests — rocm,ready,ci/build — by DarkLight1337 (合并于: 2026-03-19 19:36 (UTC+8))
#37458 Don’t log exc_info when vLLM tries to doenload a file that doesn’t exist — ready — by hmellor (合并于: 2026-03-19 18:38 (UTC+8))
#35592 [Docs] Reorganize pooling docs. — documentation,ready,ci/build — by noooop (合并于: 2026-03-19 19:25 (UTC+8))
#37538 [Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-03-19 18:24 (UTC+8))
#37407 [Bugfix] Fix Nemotron Parse loading — bug,rocm,ready,multi-modality — by DarkLight1337 (合并于: 2026-03-19 17:55 (UTC+8))
#37418 [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant — bug,rocm,ready — by Duyi-Wang (合并于: 2026-03-19 17:49 (UTC+8))
#37532 [CI] Fix wrong path test file, missing rlhf_async_new_apis.py — ready,ci/build — by tjtanaa (合并于: 2026-03-19 17:21 (UTC+8))
#37415 [MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available — structured-output,ready,v1,nvidia — by jikunshang (合并于: 2026-03-19 17:23 (UTC+8))
#36808 Support temporal compression for Nemotron-3-VL videos — ready,v1 — by collinmccarthy (合并于: 2026-03-19 16:02 (UTC+8))
#37425 [Perf] Fix slow hasattr in CUDAGraphWrapper.getattr — ready,v1,nvidia — by ZeldaHuang (合并于: 2026-03-19 15:43 (UTC+8))
#37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend,ready — by cdpath (合并于: 2026-03-19 15:23 (UTC+8))
#37310 [SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-03-19 15:22 (UTC+8))
#37009 [ROCm] issue management - request information for bug issues on ROCm — bug,rocm,ready,ci/build — by hongxiayang (合并于: 2026-03-19 11:51 (UTC+8))

关闭但未合并的 PR

#30767 Algo — needs-rebase — by Mercykid-bash (关闭于: 2026-03-20 11:06 (UTC+8))
#23247 [Bugfix]Enable zmq router handover to handle scaling-up after scaling-down in EEP — needs-rebase,stale — by wuhang2014 (关闭于: 2026-03-20 10:21 (UTC+8))
#23996 [Feature]: Support Phi4Flash model in V1 — needs-rebase,stale,v1 — by aditchawdhary (关闭于: 2026-03-20 10:21 (UTC+8))
#37627 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (关闭于: 2026-03-20 09:43 (UTC+8))
#36453 fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends — nvidia — by brandonmmusic-max (关闭于: 2026-03-20 09:42 (UTC+8))
#36785 Update rocm get gpu info capability — rocm — by tmm77 (关闭于: 2026-03-20 05:40 (UTC+8))
#36967 Allow platform plugins to override gpu_memory_utilization default — frontend,v1 — by aws-navyadhara (关闭于: 2026-03-20 02:29 (UTC+8))
#37577 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug,needs-rebase,nvidia — by simpx (关闭于: 2026-03-20 00:38 (UTC+8))
#37575 docs: RFC for dynamic LoRA GPU capacity with resolver plugin integration — 无标签 — by wangchen615 (关闭于: 2026-03-19 23:58 (UTC+8))
#24919 Reduce noisy startup logs (Roadmap Q3 2025): demote/remove non‑essential INFO logs — frontend — by ikaadil (关闭于: 2026-03-19 23:42 (UTC+8))
#37203 refactor hardcoded device strings in vllm tests — speculative-decoding,v1,multi-modality,nvidia — by wincent8 (关闭于: 2026-03-19 22:36 (UTC+8))
#34030 [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming — bug,frontend — by cradonn (关闭于: 2026-03-19 21:35 (UTC+8))
#36974 Fix issue 36969 — frontend — by xueliangyang-oeuler (关闭于: 2026-03-19 21:28 (UTC+8))
#36866 [Bugfix] Fix tool call streaming JSON separator mismatch — bug,frontend — by xr843 (关闭于: 2026-03-19 18:43 (UTC+8))
#37540 Fix test for moved example — ci/build — by hmellor (关闭于: 2026-03-19 16:57 (UTC+8))
#37522 Backward compatible when scaled_fp4_quant.out symbol missing — v1 — by panpan0000 (关闭于: 2026-03-19 16:37 (UTC+8))
#36111 [Perf] add cute dsl kernel for gdn decode — needs-rebase,qwen — by ZJY0516 (关闭于: 2026-03-19 15:04 (UTC+8))
#35567 [XPU][CI] add xpu image build job in vllm CI — ci/build — by jikunshang (关闭于: 2026-03-19 15:10 (UTC+8))
#25570 [docs] add more kv-param in nixl usage docs — documentation,unstale,kv-connector — by panpan0000 (关闭于: 2026-03-19 15:01 (UTC+8))
#37508 [VLLMZ-905] fix(xpu): Clamp compile warmup sizes to model runner token capacity — v1 — by Liangyx2 (关闭于: 2026-03-19 14:42 (UTC+8))
#37516 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (关闭于: 2026-03-19 12:52 (UTC+8))
#37513 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (关闭于: 2026-03-19 12:33 (UTC+8))
#37511 V0.17.1+haosdentfix 34845 — performance,ci/build,v1,qwen,nvidia — by NekoSunflower (关闭于: 2026-03-19 11:46 (UTC+8))
#37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (关闭于: 2026-03-19 11:27 (UTC+8))