vLLM 开发动态报告 - 2026-03-19
时间窗口: 2026-03-19 11:26 (UTC+8) ~ 2026-03-20 11:26 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 22 | 新 PR 100 | 合并 PR 55 | 关闭未合并 PR 24
📊 每日开发状态摘要
在2026年3月19日至20日这个周期内,vLLM 项目保持了极高的开发活跃度,新增了100个PR并合并了55个。开发重点集中在性能优化、Bug修复以及对AMD ROCm生态的持续增强上。社区报告了多个与GLM-4.7-FP8、Qwen3.5等热门模型相关的关键Bug,而AMD团队的贡献者则迅速响应并解决了若干ROCm平台上的核心问题。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动非常活跃,主要体现在Bug修复、性能优化和新功能支持上。
1. 关键Bug修复:
- Issue #37548 & PR #37606:用户
vllmellm报告了在ROCm平台上使用ROCM_AITER_UNIFIED_ATTN后端时,因缓存块大小(16与64)配置不匹配导致的编译失败。AMD工程师divakar-amd迅速定位问题,并通过PR #37606修复了配置逻辑,确保无论通过环境变量还是CLI参数启用该后端,都能使用正确的缓存块大小(64)。该PR已合并。 - PR #37547:由
gronsti-amd提交,修复了DeepSeek V3 MLA稀疏注意力实现中lru_cache失效的问题(因函数定义在每次调用时都新建,导致模块被重复导入)。此修复提升了性能。 - Issue #37596:一个ROCm CI测试失败,涉及Qwen模型的语言模型测试(PPL),错误为
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION,表明存在内存访问违规问题,正在排查中。
2. 性能与功能优化:
- PR #37539:由
xaguilar-amd提交,优化了ROCm AITER后端MLA解码路径的性能。通过将输出张量分配从torch.zeros改为torch.empty,移除了冗余的GPU零填充内核启动,降低了解码延迟。 - PR #37533:由
aaab8b提交,修复了ROCm平台上睡眠模式(sleep mode)无法真正释放物理GPU内存的问题。通过在执行hipMemRelease后循环调用cuMemAddressFree和cuMemAddressReserve,强制驱动释放物理页,解决了内存泄露问题。 - PR #37529:由
pinsiangamd提交,修复了在ROCm上使用非量化(BF16)MoE模型时,MORI专家并行后端与AITER专家后端不兼容的问题。此前该组合会静默退化(不报错但输出质量下降),现在已能正确工作。 - PR #37418 (已合并):由
Duyi-Wang提交,修复了MoRI(Moe Remote Invocation)与AITER FP8量化调度的兼容性问题,确保FP8量化能在MoRI的dispatch阶段正确执行。
3. 新功能/集成:
- PR #37515 & #37513:由
khairulkabir1661提交,旨在为DeepSeek的MLA注意力添加AITER融合内核支持,以在解码和预填充路径上融合多个操作,提升性能。目前PR需要解决合并冲突。 - PR #37542 (已合并):为加速CI,将多模态扩展池化测试从其他测试中分离出来单独运行。
总结:AMD团队在本周期表现非常积极,不仅快速响应用户问题,还持续在内存管理、内核性能、专家并行等深水区进行优化和修复,显著提升了vLLM在ROCm平台上的稳定性和性能。
💬 高热度讨论分析
本周期未出现评论量极高的“爆款”讨论,但以下几个议题的交流具有代表性:
1. Issue #37590: HTTP连接静默断开问题
- 核心议题:用户
DKingAlpha报告当提示词包含大表格时,vLLM服务器有时会无日志地断开客户端连接。 - 观点交锋:
- 另一位用户
malaiwah表示在相似设置下无法复现该问题,并分享了其测试配置细节。 - 提问者随后提供了更详细的部署环境(Docker Compose + 反向代理),并在
malaiwah的协助下进行排查。
- 另一位用户
- 最终结论:经过逐步排查,发现问题根源在于用户自己的反向代理配置错误,而非vLLM本身。提问者关闭了issue并感谢协助者。这体现了社区协作排查问题的良好氛围。
2. PR #37522: 关于旧环境兼容性的讨论
- 核心议题:
panpan0000提交PR,希望为新增的.out操作符重载添加向后兼容性检查,防止新代码在旧编译环境中崩溃。 - 观点交锋:
- 支持兼容性:提交者认为应通过
hasattr检查来增强代码鲁棒性,避免因Python代码与C扩展版本不匹配导致的崩溃。 - 反对支持:维护者
ProExpertProg认为不应支持构建版本与Python代码不匹配的情况,用户应该直接构建新版本。他建议,如果确实需要,应该设计一个更好的不匹配报错系统,而非增加运行时检查。
- 支持兼容性:提交者认为应通过
- 当前状态:该PR因维护者的明确反对而被关闭。这反映了项目在追求开发速度与维护长期兼容性之间的权衡,目前倾向于前者。
3. Issue #28572 (已关闭): 多GPU设备兼容性问题
- 核心议题:用户因系统中存在不同计算能力的GPU(RTX 2080 Ti和RTX 3090),导致vLLM服务失败。
- 讨论过程:用户与维护者讨论了使用
CUDA_VISIBLE_DEVICES隔离设备、检查环境变量传递、以及模型数据类型(MXFP4)兼容性等多种可能。 - 争议焦点:问题始终未能完美解决,用户认为vLLm在存在不兼容GPU的环境中应该更智能地处理或给出更清晰的错误指引。
- 最终状态:该Issue因长期无新进展而被标记为“stale”并自动关闭。这类问题揭示了在生产环境中混合部署不同代际GPU的复杂性和挑战。
🔥 热门话题与趋势分析
- Bug报告集中在CUDA非法内存访问与MLA:多个Issue报告了
cudaErrorIllegalAddress崩溃,涉及场景包括:更改max-num-batched-tokens、启用吞吐量模式、使用MTP推测解码等。这些通常与底层内核或内存管理在特定边界条件下的缺陷有关。同时,关于MLA注意力后端选择失败的报告也出现了。 - GLM-4.7-FP8模型问题凸显:用户
Xarbirus连续提交了多个与zai-org/GLM-4.7-FP8模型相关的崩溃Bug,涉及不同配置参数,表明该模型在vLLM中的支持可能还存在稳定性问题,是当前测试和修复的热点。 - Qwen3.5模型支持持续完善:围绕Qwen3.5模型的讨论和修复很多,包括多GPU初始化错误、工具解析器支持、GatedDeltaNet层的Marlin量化兼容性修复等,说明该系列模型被广泛使用,其复杂结构(如MoE、GDN)对推理引擎提出了高要求。
- MoE(混合专家)性能与正确性:无论是ROCm还是CUDA平台,都有关于MoE性能回归(如图模式慢于非MTP)和内核Bug(如FlashInfer TRTLLM monolithic内核路由错误导致零准确率)的讨论,反映了MoE推理仍是技术难点和优化重点。
🛠️ 重点技术变更
- PR #37606: [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention
- 技术解读:修复了ROCm AITER统一注意力后端缓存块大小配置不一致的底层逻辑。统一注意力需要64的块大小以获得最佳性能,但通过
--attention-config参数设置时,配置更新顺序导致仍使用了默认的16。 - 影响:确保了该后端性能的稳定性,避免了由此引发的编译失败,提升了用户体验。
- 技术解读:修复了ROCm AITER统一注意力后端缓存块大小配置不一致的底层逻辑。统一注意力需要64的块大小以获得最佳性能,但通过
- PR #37539: [Performance] Remove unnecessary zero-fill of MLA decode output tensor in Aiter backend
- 技术解读:将MLA解码输出张量的初始化从
torch.zeros改为torch.empty。由于后续AITER内核会无条件覆盖整个张量,前置的零填充是冗余的GPU内核操作。 - 影响:为每个解码步骤的每个层都减少了一次
vectorized_elementwise内核启动,直接降低了MLA模型在ROCm平台上的解码延迟,是典型的低级性能优化。
- 技术解读:将MLA解码输出张量的初始化从
- PR #37533: [ROCm] fix sleep mode not releasing GPU memory problem on ROCm
- 技术解读:揭示了HIP运行时
hipMemRelease在虚拟地址保留期间不会释放物理VRAM的底层行为差异。通过强制循环执行地址释放与重新保留,绕过了驱动限制。 - 影响:解决了ROCm平台上睡眠模式“假释放”内存的问题,使GPU内存能被其他应用(如RL训练框架)真正复用,提升了系统资源利用率。
- 技术解读:揭示了HIP运行时
- PR #36056 (已合并): [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1
- 技术解读:重构了DeepSeek-V3.2工具调用解析器的流式处理逻辑。旧版采用复杂的状态机逐字符解析,在
stream_interval > 1时会导致标签解析错误。新版改为缓冲令牌直至获取完整DSML块后一次性解析。 - 影响:从根本上解决了流式返回间隔较大时工具调用解析出错的问题,提升了DeepSeek-V3.2模型流式输出的可靠性。
- 技术解读:重构了DeepSeek-V3.2工具调用解析器的流式处理逻辑。旧版采用复杂的状态机逐字符解析,在
📈 开发活跃度观察
- AMD团队贡献突出:以
divakar-amd,gronsti-amd,xaguilar-amd等为代表的AMD工程师在本周期非常活跃,提交了多个关键修复和优化PR,显示了AMD对vLLM ROCm后端投入的持续加大。 - 社区参与度高:新增的25个Issue中,大部分是用户提交的详细Bug报告,包含了重现步骤和日志,表明用户社区成熟且乐于贡献。许多Issue在当天就得到了回应或修复。
- 合并效率高:在100个新增PR中,有55个被合并,其中包括多个关键Bug修复,显示核心维护团队代码审查和合并流程高效。
- CI/CD持续运行:多个CI失败被报告和跟踪(如
mi355_1测试失败),表明项目的自动化测试体系在持续运行并暴露问题,有助于维护代码质量。
💡 值得关注的问题
- GLM-4.7-FP8的稳定性问题:多个独立的Issue报告了该模型在不同配置下的CUDA非法内存访问崩溃(#37587, #37598, #37599, #37570)。这可能需要核心开发者进行系统性排查,看是否是某个特定内核或内存管理策略与该模型不兼容。
- Qwen3.5多GPU初始化错误:Issue #37623报告了Qwen3.5-122B-A10B-FP8在多GPU(TP=2)初始化时失败,错误提示
Device does not support multicasting,可能与FlashInfer allreduce融合工作空间的创建有关,影响大规模模型部署。 - FlashInfer TRTLLM MoE后端路由Bug:Issue #37591和PR #37605揭露了FlashInfer TRTLLM monolithic MoE内核在处理全负路由logits时,会错误地选择专家,导致Qwen3.5 FP8模型输出完全错误(0%准确率)。该问题已通过临时禁用问题路由方法得到缓解,但根本性修复需等待上游更新。
- CPU后端KV缓存块归零崩溃:PR #37550报告并修复了CPU后端因调用GPU Triton归零内核而导致的崩溃,这提醒我们在为多后端项目添加新功能时,需充分考虑所有后端的兼容性。
📋 附录:详细数据列表
新增 Issue
- #37608 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_qkv_weight’ — bug — by S1ro1 (创建于: 2026-03-20 05:55 (UTC+8))
- #37623 [Bug]: Qwen3.5-122B-A10B-FP8 multi GPU issue(tensor-parallel-size) — bug — by jungmoklee (创建于: 2026-03-20 09:06 (UTC+8))
- #37625 MTP and Distributed MoE Speculative Decoding Degradation — 无标签 — by glaziermag (创建于: 2026-03-20 09:17 (UTC+8))
- #37624 MLA and APC Conflict — 无标签 — by glaziermag (创建于: 2026-03-20 09:17 (UTC+8))
- #37543 [Bug]: 推理vllm,出现如下报错,KeyError:residual — bug — by kdywt (创建于: 2026-03-19 17:52 (UTC+8))
- #37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (创建于: 2026-03-20 08:15 (UTC+8))
- #37548 [Bug][ROCm]: Aiter unified attention fails during compilation — bug,rocm — by vllmellm (创建于: 2026-03-19 18:30 (UTC+8))
- #37602 [Bug]: Qwen3.5-122B-A10B-FP8 EngineCore crash on concurrent image requests — 无标签 — by serdarildercaglar (创建于: 2026-03-20 05:01 (UTC+8))
- #37590 [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response — bug — by DKingAlpha (创建于: 2026-03-20 02:53 (UTC+8))
- #37591 [Bug]: FlashInfer TRTLLM monolithic MoE produces 0% accuracy for Qwen3.5-35B/122B FP8 — bug,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-20 03:09 (UTC+8))
- #37599 [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with
--max-num-batched-tokens< default (e.g. 4K) under — bug — by Xarbirus (创建于: 2026-03-20 04:47 (UTC+8)) - #37598 [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with
--max-num-batched-tokens> default (e.g. 16K) under load — bug — by Xarbirus (创建于: 2026-03-20 04:42 (UTC+8)) - #37596 [CI Failure]: mi355_1: Language Models Test (PPL) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-20 04:15 (UTC+8))
- #37587 [Bug]: cudaErrorIllegalAddress crash when enabling
--performance-mode throughputfor zai-org/GLM-4.7-FP8 under load — bug — by Xarbirus (创建于: 2026-03-20 02:31 (UTC+8)) - #37581 [Bug]: /v1/chat/completions/render` crashes for Qwen/Qwen3-ASR-0.6B multimodal audio, and chat audio returns empty/junk — bug — by peregilk (创建于: 2026-03-20 01:03 (UTC+8))
- #37570 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP num_speculative_tokens with >1 for zai-org/GLM-4.7-FP8 under load — bug — by Xarbirus (创建于: 2026-03-19 23:08 (UTC+8))
- #37563 mm_fp4 trtllm backend leaks padding scales into real rows (use_8x4_sf_layout=True) — 无标签 — by elvircrn (创建于: 2026-03-19 21:57 (UTC+8))
- #37554 [Bug] –calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) — 无标签 — by daudo (创建于: 2026-03-19 19:00 (UTC+8))
- #37558 “vLLM-deployed Qwen3.5 with Reasoning Parser Shows Empty reasoningContent in Spring AI OpenAI Model” — bug — by lhc75 (创建于: 2026-03-19 19:51 (UTC+8))
- #37546 [Bug]: CPU backend crashes with
TypeError: 'function' object is not subscriptableon first inference request — bug,cpu — by fyuan1316 (创建于: 2026-03-19 18:07 (UTC+8)) - #37553 [Bug]: Mistral-Small-4-119B-2603 fails on 8x RTX 3090 (SM 8.6) with vLLM v0.17.1: no valid MLA attention backend — bug — by Martossien (创建于: 2026-03-19 18:57 (UTC+8))
- #37551 [Bug] vLLM 0.17.1:
zai-org/GLM-OCRhasmtp_graph < no_mtp_graphdespite high acceptance — bug — by AlejandroBaron (创建于: 2026-03-19 18:56 (UTC+8)) - #37527 [Performance]: Is SamplingParams support set enable_thinking? — performance — by lcedaw (创建于: 2026-03-19 15:39 (UTC+8))
- #37525 [Bug]: 启动qwen3.5-35B后反复kill进程:vllm.v1.engine exceptions enginedeaderror enginecore encountered an issue — bug — by promise-hash (创建于: 2026-03-19 15:20 (UTC+8))
- #37520 [Bug]: [V1 Engine] Segfault / NCCL init failure when running 4 GPUs across NUMA nodes (v0.17.0) — bug — by xkkx123 (创建于: 2026-03-19 14:20 (UTC+8))
已关闭 Issue
- #31043 [BugFix]: move torch.Size across graphs in split_graph — help wanted,feature request,torch.compile — by BoyuanFeng (关闭于: 2026-03-20 10:55 (UTC+8))
- #20223 [Feature]: Any plans to support TokenWeave optimizations in vLLM? — feature request,stale — by solitude-s (关闭于: 2026-03-20 10:21 (UTC+8))
- #22497 [Bug]: Crashed when loading ggml quantized Gemma3 — bug,stale — by Imagium719 (关闭于: 2026-03-20 10:21 (UTC+8))
- #22533 [Bug]: gpt-oss-20b flaky BadRequest 400 — bug,stale — by simonfl (关闭于: 2026-03-20 10:21 (UTC+8))
- #23603 [Feature]: Log prompt for gpt-oss — feature request,stale,gpt-oss — by beom115 (关闭于: 2026-03-20 10:21 (UTC+8))
- #24140 [Bug]: Deepseek V3.1 tool_choice=required,输出混乱 — bug,stale — by WangJianQ-0118 (关闭于: 2026-03-20 10:21 (UTC+8))
- #24147 [Bug]: model failure for OpenGVLab/InternVL3-38B-hf — bug,stale — by thesillystudent (关闭于: 2026-03-20 10:21 (UTC+8))
- #28572 [Bug]: vllm refuses to serve LLM in the presence of multiple GPUs — bug,stale — by tigran123 (关闭于: 2026-03-20 10:20 (UTC+8))
- #37625 MTP and Distributed MoE Speculative Decoding Degradation — 无标签 — by glaziermag (关闭于: 2026-03-20 09:22 (UTC+8))
- #37624 MLA and APC Conflict — 无标签 — by glaziermag (关闭于: 2026-03-20 09:22 (UTC+8))
- #37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (关闭于: 2026-03-20 08:16 (UTC+8))
- #37548 [Bug][ROCm]: Aiter unified attention fails during compilation — bug,rocm — by vllmellm (关闭于: 2026-03-20 08:00 (UTC+8))
- #37444 Regression in nightly: AttributeError ‘MergedColumnParallelLinear’ has no attribute ‘weight’ with Qwen3.5-9B — 无标签 — by jhsmith409 (关闭于: 2026-03-20 07:21 (UTC+8))
- #37590 [Bug]: http.client.RemoteDisconnected: Remote end closed connection without response — bug — by DKingAlpha (关闭于: 2026-03-20 04:45 (UTC+8))
- #37471 [Bug]: Accuracy issue running Model Runner V2 with Qwen3.5 — bug — by yewentao256 (关闭于: 2026-03-20 03:33 (UTC+8))
- #29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:06 (UTC+8))
- #32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:01 (UTC+8))
- #31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:01 (UTC+8))
- #29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-20 02:00 (UTC+8))
- #37397 [Bug]: Kimi-K2.5 chat completion doesn’t return any reasoning content — bug — by wizche (关闭于: 2026-03-19 21:08 (UTC+8))
- #36773 [Bug]: Qwen3.5-35B-A3B on B200 with vllm v0.17.0 output random result — bug — by elepherai (关闭于: 2026-03-19 20:18 (UTC+8))
- #30848 [Bug]: Fail to run Qwen3-Next model. — bug,stale — by MaoJianwei (关闭于: 2026-03-19 11:51 (UTC+8))
新增 PR
- #37634 [XPU] Automatically detect target platform as XPU in build. — ready,ci/build — by ccrhx4 (创建于: 2026-03-20 11:10 (UTC+8))
- #37565 [Bugfix] Disable –calculate-kv-scales for hybrid GDN/Mamba+Attention… — bug — by Young-Leo (创建于: 2026-03-19 22:04 (UTC+8))
- #37537 [Model] Deprecate the score task (this will not affect users). — documentation,frontend,ready,v1 — by noooop (创建于: 2026-03-19 16:07 (UTC+8))
- #37612 [V0 Deprecation] Deprecate –disable-frontend-multiprocessing — performance,frontend,ready,v1 — by sfeng33 (创建于: 2026-03-20 07:25 (UTC+8))
- #37633 [bug fix] not override “partial_rotary_factor” if it is in model config — bug — by xgwang (创建于: 2026-03-20 11:10 (UTC+8))
- #37632 always use
embed&token_classifyfor bge-m3 — 无标签 — by staugust (创建于: 2026-03-20 10:38 (UTC+8)) - #37531 [bugfix] Outputting full results for long audio files in stream_api_response() mode — bug,documentation — by AllenDou (创建于: 2026-03-19 15:48 (UTC+8))
- #37603 [NIXL][Mamba][2/N] Heterogeneous TP : chunk-interleaved layout — kv-connector — by ZhanqiuHu (创建于: 2026-03-20 05:17 (UTC+8))
- #37523 fix(xpu): Re-compute compile ranges after platform-specific config updates — ready,v1 — by Liangyx2 (创建于: 2026-03-19 14:47 (UTC+8))
- #37630 [Bugfix] Add early detection for CUDA < 13.0 on sm_103+ GPUs (GB300) — bug,nvidia — by qiching (创建于: 2026-03-20 09:55 (UTC+8))
- #37588 [Model Runner V2] Add full/piecewise cuda graph support for eagle pre… — v1,nvidia — by TheEpicDolphin (创建于: 2026-03-20 02:39 (UTC+8))
- #37631 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — 无标签 — by simpx (创建于: 2026-03-20 09:57 (UTC+8))
- #37628 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (创建于: 2026-03-20 09:44 (UTC+8))
- #37629 [Bugfix] Fix EAGLE3+multimodal+async crash: clamp -1 placeholders in embedding paths — bug,speculative-decoding,v1 — by haosdent (创建于: 2026-03-20 09:52 (UTC+8))
- #37620 reshape instead of view in FP8ScaledMMLinearKernel — 无标签 — by krishna-kylist (创建于: 2026-03-20 08:42 (UTC+8))
- #37627 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (创建于: 2026-03-20 09:31 (UTC+8))
- #37571 [UX] Enable torch_profiler_with_stack — documentation — by jeejeelee (创建于: 2026-03-19 23:11 (UTC+8))
- #37626 Replace VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var with max_num_batched_tokens — nvidia — by mgoin (创建于: 2026-03-20 09:23 (UTC+8))
- #37622 [Bugfix] Fix Step3 pipeline parallel KeyError for residual tensor — bug — by JMonde (创建于: 2026-03-20 09:02 (UTC+8))
- #37621 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — bug — by simpx (创建于: 2026-03-20 08:50 (UTC+8))
- #37610 [ROCm][CI] Reduce image resolution for gemma3 to avoid OOM on MI250 — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-20 06:50 (UTC+8))
- #37619 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 08:17 (UTC+8))
- #37583 Revert “[docs] Add docs for new RL flows” (#36188) — documentation,needs-rebase,ci/build — by zhewenl (创建于: 2026-03-20 01:46 (UTC+8))
- #37562 [Bugfix] Rebase of #36329: Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 + torch.compile compatibility — bug,needs-rebase,qwen — by kitch2400 (创建于: 2026-03-19 21:35 (UTC+8))
- #37615 [Bugfix] Add bounds check for SHMManager handle in get_singleton_inst… — bug,cpu — by yassha (创建于: 2026-03-20 07:50 (UTC+8))
- #37617 [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds — rocm,ready,nvidia — by AndreasKaratzas (创建于: 2026-03-20 08:08 (UTC+8))
- #37616 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-20 07:57 (UTC+8))
- #37606 [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention — bug,rocm,ready,v1 — by divakar-amd (创建于: 2026-03-20 05:39 (UTC+8))
- #37518 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (创建于: 2026-03-19 14:10 (UTC+8))
- #37614 [ROCm][CI] Remove deepep DBO tests on gfx90a — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 07:46 (UTC+8))
- #37569 [Refactor] Fix bitsandbytes loader import for pipeline-parallel params — 无标签 — by ikaadil (创建于: 2026-03-19 23:07 (UTC+8))
- #37613 [ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests — rocm,ready,multi-modality,llama — by AndreasKaratzas (创建于: 2026-03-20 07:40 (UTC+8))
- #37528 [Model] Add LFM2-ColBERT-350M support — documentation,new-model — by ieBoytsov (创建于: 2026-03-19 15:39 (UTC+8))
- #37573 [Bug] Fix EmbedIOprocessor “classify” <-> “embed” — bug,frontend,ready — by yewentao256 (创建于: 2026-03-19 23:32 (UTC+8))
- #37572 [Refactor] Remove dead code in pooling model — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-19 23:29 (UTC+8))
- #37611 [ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-20 07:16 (UTC+8))
- #37600 [compile] Cache InductorPass uuid — 无标签 — by angelayi (创建于: 2026-03-20 04:53 (UTC+8))
- #37609 Use lazy graph module during split_module to defer recompile() — 无标签 — by angelayi (创建于: 2026-03-20 06:01 (UTC+8))
- #37607 [CPU][UX][Perf] Enable tcmalloc by default — ci/build,cpu — by fadara01 (创建于: 2026-03-20 05:47 (UTC+8))
- #37605 [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591) — bug,ready,ci/build,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-20 05:27 (UTC+8))
- #37597 [ci] Use fresh cache directory for compile only per test case. — 无标签 — by zhxchen17 (创建于: 2026-03-20 04:28 (UTC+8))
- #37589 [compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. — 无标签 — by zhxchen17 (创建于: 2026-03-20 02:51 (UTC+8))
- #37604 [compile] Fix aot test failures with torch 2.12. — ready — by zhxchen17 (创建于: 2026-03-20 05:17 (UTC+8))
- #37601 [EPLB] Refactor Async EPLB synchronization logic — needs-rebase — by SageMoore (创建于: 2026-03-20 04:56 (UTC+8))
- #37595 [Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ — ci/build — by sfeng33 (创建于: 2026-03-20 03:57 (UTC+8))
- #37549 [Feature] Rework chunk-based processing with torch.scan — needs-rebase — by bohnstingl (创建于: 2026-03-19 18:38 (UTC+8))
- #37582 Fix EP expert_map init for TPU: avoid dynamic-shape ops on XLA — 无标签 — by colin2328 (创建于: 2026-03-20 01:26 (UTC+8))
- #37517 Adding DeepEP MoE Test Group. — needs-rebase,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-03-19 13:01 (UTC+8))
- #37594 [issue#37343] fix: prevent TTFT regression by adding batched logprobs budget to scheduler — v1 — by Pineberry1 (创建于: 2026-03-20 03:49 (UTC+8))
- #37585 [CI] Removing deprecated rlhf examples reference — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-20 02:22 (UTC+8))
- #37592 add integration of xqa and fmha through flashinfer — v1,nvidia — by DanBlanaru (创建于: 2026-03-20 03:12 (UTC+8))
- #37593 [Refactor] Relocate entrypoint tests to match serving code structure — ci/build,multi-modality — by sfeng33 (创建于: 2026-03-20 03:21 (UTC+8))
- #37515 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (创建于: 2026-03-19 12:34 (UTC+8))
- #37568 [Log] Log once in local node by default — ready — by yewentao256 (创建于: 2026-03-19 22:45 (UTC+8))
- #37524 [Hardware][GPU] Profiler config additional to increase it scope and annotation details — v1 — by devalshahamd (创建于: 2026-03-19 14:56 (UTC+8))
- #37586 feat: add IBM POWER8 (ppc64le) CPU backend support — speculative-decoding,ci/build,v1,cpu — by Scottcjn (创建于: 2026-03-20 02:30 (UTC+8))
- #37574 Fix
SpeculatorsConfignow thatPreTrainedConfigis adataclassin Transformers — ready — by hmellor (创建于: 2026-03-19 23:43 (UTC+8)) - #37580 Nemotron Nano VL: Streamline pixel shuffle — 无标签 — by milesial (创建于: 2026-03-20 00:49 (UTC+8))
- #37584 Revert “[BugFix] Correct max memory usage for multiple KV-cache groups” (#36030) — bug,v1 — by zhewenl (创建于: 2026-03-20 01:47 (UTC+8))
- #37579 [Model] Refactor Step3-VL processor to HF style — ready — by DarkLight1337 (创建于: 2026-03-20 00:48 (UTC+8))
- #37567 Run MacOS smoke test on daily
cronjob instead of every commit — ready,ci/build — by hmellor (创建于: 2026-03-19 22:42 (UTC+8)) - #37535 [P/D] AnthropicMessages add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by chaunceyjiang (创建于: 2026-03-19 16:04 (UTC+8))
- #37578 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug — by simpx (创建于: 2026-03-20 00:39 (UTC+8))
- #37577 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug,needs-rebase,nvidia — by simpx (创建于: 2026-03-20 00:26 (UTC+8))
- #37561 [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded — ready,v1,cpu — by fadara01 (创建于: 2026-03-19 21:25 (UTC+8))
- #37576 [Metrics] Add Prometheus counter for CUDA graph mode — v1,nvidia — by tlrmchlsmth (创建于: 2026-03-20 00:14 (UTC+8))
- #37544 [CI] Gate pre-commit on
readylabel or number of contributions — ready,ci/build — by hmellor (创建于: 2026-03-19 18:00 (UTC+8)) - #37575 docs: RFC for dynamic LoRA GPU capacity with resolver plugin integration — 无标签 — by wangchen615 (创建于: 2026-03-19 23:57 (UTC+8))
- #37560 [Misc] Cleanup more configs and processors — ready,qwen — by DarkLight1337 (创建于: 2026-03-19 20:30 (UTC+8))
- #37526 [MRV2] Use fp32 for draft logits — ready,v1 — by WoosukKwon (创建于: 2026-03-19 15:33 (UTC+8))
- #37557 [LoRA] Minor improvements to LoRA log — ready — by jeejeelee (创建于: 2026-03-19 19:43 (UTC+8))
- #37566 refactor hard coded device string in test files under tests/v1 and tests/lora — speculative-decoding,v1,nvidia — by wincent8 (创建于: 2026-03-19 22:33 (UTC+8))
- #37547 [Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module — bug,rocm,v1 — by gronsti-amd (创建于: 2026-03-19 18:20 (UTC+8))
- #37559 Stop bench CLI from recursively casting all configs to
dict— performance,ready — by hmellor (创建于: 2026-03-19 20:09 (UTC+8)) - #37555 Update typing annotations to use ReadOnly for ConversationMessage — frontend — by ikaadil (创建于: 2026-03-19 19:27 (UTC+8))
- #37564 [Bugfix] Zero-init NVFP4 padding scales to prevent NaN contamination — bug — by elvircrn (创建于: 2026-03-19 22:03 (UTC+8))
- #37541 [Misc] Clean up processing logic — ready,qwen — by DarkLight1337 (创建于: 2026-03-19 17:07 (UTC+8))
- #37552 [CI] Merge
cleanup_pr_body.ymlandreminder_comment.yml— ready,ci/build — by hmellor (创建于: 2026-03-19 18:56 (UTC+8)) - #37556 [EPLB][Refactor] Replace boolean state flags with EPLBPhase enum — 无标签 — by ilmarkov (创建于: 2026-03-19 19:34 (UTC+8))
- #37536 Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… — ready,v1 — by xueliangyang-oeuler (创建于: 2026-03-19 16:05 (UTC+8))
- #37545 [Model] Remove unnecessary
get_language_model— ready — by DarkLight1337 (创建于: 2026-03-19 18:02 (UTC+8)) - #37542 [CI/Build] Split out MM pooling tests — rocm,ready,ci/build — by DarkLight1337 (创建于: 2026-03-19 17:39 (UTC+8))
- #37550 [Bugfix] Fix CPU backend crash in KV cache block zeroing — bug,v1,cpu — by DorBernsohn (创建于: 2026-03-19 18:48 (UTC+8))
- #37538 [Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-03-19 16:28 (UTC+8))
- #37533 [ROCm] fix sleep mode not releasing GPU memory problem on ROCm — rocm — by aaab8b (创建于: 2026-03-19 15:52 (UTC+8))
- #37532 [CI] Fix wrong path test file, missing
rlhf_async_new_apis.py— ready,ci/build — by tjtanaa (创建于: 2026-03-19 15:50 (UTC+8)) - #37539 [Performance] Remove unnecessary zero-fill of MLA decode output tensor in Aiter backend — rocm,needs-rebase,v1 — by xaguilar-amd (创建于: 2026-03-19 16:49 (UTC+8))
- #37540 Fix test for moved example — ci/build — by hmellor (创建于: 2026-03-19 16:56 (UTC+8))
- #37522 Backward compatible when
scaled_fp4_quant.outsymbol missing — v1 — by panpan0000 (创建于: 2026-03-19 14:29 (UTC+8)) - #37534 [P/D] let toy proxy handle Responses API — v1,kv-connector — by chaunceyjiang (创建于: 2026-03-19 16:00 (UTC+8))
- #37529 [ROCm] Enable MORI EP for unquantized MoE with AITER backend — rocm — by pinsiangamd (创建于: 2026-03-19 15:39 (UTC+8))
- #37530 [Bugfix] Fix MLA KV cache blocks not zeroed on reuse, causing CUDA crashes under concurrent load — bug,v1,nvidia — by jacob-crux (创建于: 2026-03-19 15:43 (UTC+8))
- #37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend,ready — by cdpath (创建于: 2026-03-19 11:29 (UTC+8))
- #37521 [Bugfix] Fix speculative sampler warmup OOM when using EAGLE — bug,v1 — by wangyxbh (创建于: 2026-03-19 14:24 (UTC+8))
- #37519 refactor: abstract deepgemm support into platform — nvidia — by SherryC41 (创建于: 2026-03-19 14:11 (UTC+8))
- #37516 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (创建于: 2026-03-19 12:50 (UTC+8))
- #37514 [MODEL] Cherry-pick: Adding Support for Qwen3.5 Models — documentation,new-model,speculative-decoding,v1,qwen — by ChuanLi1101 (创建于: 2026-03-19 12:31 (UTC+8))
- #37513 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (创建于: 2026-03-19 12:31 (UTC+8))
- #37512 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (创建于: 2026-03-19 12:02 (UTC+8))
- #37511 V0.17.1+haosdentfix 34845 — performance,ci/build,v1,qwen,nvidia — by NekoSunflower (创建于: 2026-03-19 11:46 (UTC+8))
已合并 PR
- #36976 [Bugfix][LoRA] Fix Qwen35 LoRA — bug,ready,ci/build,qwen — by jeejeelee (合并于: 2026-03-20 11:09 (UTC+8))
- #36038 [compile][graph_partition]Add tensor size handling — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by fxdawnn (合并于: 2026-03-20 10:55 (UTC+8))
- #37207 [Feat] Enable CompressedTensorW4A8Int for XPU — ready — by tianmu-li (合并于: 2026-03-20 10:34 (UTC+8))
- #36708 fix: disambiguate multimodal prefix cache keys — ready,v1 — by tianshu-Michael-yu (合并于: 2026-03-20 10:33 (UTC+8))
- #37478 [CI] Update mergify tool-calling label paths — ready,ci/build — by sfeng33 (合并于: 2026-03-20 10:22 (UTC+8))
- #36056 [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 — bug,ready,deepseek — by sfeng33 (合并于: 2026-03-20 07:51 (UTC+8))
- #37452 Fix DP coordinator ZMQ port TOCTOU — ready,v1 — by itayalroy (合并于: 2026-03-20 08:58 (UTC+8))
- #37606 [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention — bug,rocm,ready,v1 — by divakar-amd (合并于: 2026-03-20 08:00 (UTC+8))
- #37573 [Bug] Fix EmbedIOprocessor “classify” <-> “embed” — bug,frontend,ready — by yewentao256 (合并于: 2026-03-20 07:40 (UTC+8))
- #37572 [Refactor] Remove dead code in pooling model — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-20 07:39 (UTC+8))
- #37448 Fix AttributeError in Qwen3.5 GDN layers with quantized models — bug,ready,ci/build,qwen — by jhsmith409 (合并于: 2026-03-20 07:21 (UTC+8))
- #36996 [CI][BugFix][AMD] Don’t set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary — bug,rocm,ready — by rasmith (合并于: 2026-03-20 07:12 (UTC+8))
- #37188 [Performance] Enable Triton autotuning disk cache by default — performance,ready — by arpera (合并于: 2026-03-20 05:36 (UTC+8))
- #36064 test Qwen/Qwen3-4B-Instruct-2507 for unbacked — ready,qwen — by laithsakka (合并于: 2026-03-20 05:28 (UTC+8))
- #36294 [MoE Refactor] Rename “naive” all2all backend — documentation,ready,nvidia — by bnellnm (合并于: 2026-03-20 03:50 (UTC+8))
- #35244 Comment fix for async rl example — documentation,ready — by hao-aaron (合并于: 2026-03-20 03:46 (UTC+8))
- #37218 [CI] Add retry with 4x backoff to HTTP fetches for transient failures — rocm,ready — by AndreasKaratzas (合并于: 2026-03-20 03:00 (UTC+8))
- #34839 [ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-03-20 03:30 (UTC+8))
- #33049 [MoE Refactor] DefaultMoERunner simplifcation — documentation,ready,v1 — by bnellnm (合并于: 2026-03-20 03:07 (UTC+8))
- #37568 [Log] Log once in local node by default — ready — by yewentao256 (合并于: 2026-03-20 03:04 (UTC+8))
- #37574 Fix
SpeculatorsConfignow thatPreTrainedConfigis adataclassin Transformers — ready — by hmellor (合并于: 2026-03-20 02:04 (UTC+8)) - #37358 [Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed — bug,ready,nvidia — by EdalatiAli (合并于: 2026-03-20 01:58 (UTC+8))
- #37345 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,ready,llama,qwen — by Lucaskabela (合并于: 2026-03-20 01:26 (UTC+8))
- #37567 Run MacOS smoke test on daily
cronjob instead of every commit — ready,ci/build — by hmellor (合并于: 2026-03-20 00:19 (UTC+8)) - #37535 [P/D] AnthropicMessages add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by chaunceyjiang (合并于: 2026-03-20 00:41 (UTC+8))
- #37561 [CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded — ready,v1,cpu — by fadara01 (合并于: 2026-03-20 00:35 (UTC+8))
- #37544 [CI] Gate pre-commit on
readylabel or number of contributions — ready,ci/build — by hmellor (合并于: 2026-03-20 00:21 (UTC+8)) - #37560 [Misc] Cleanup more configs and processors — ready,qwen — by DarkLight1337 (合并于: 2026-03-19 23:45 (UTC+8))
- #37346 [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods — bug,ready,nvidia,quantization — by wzhao18 (合并于: 2026-03-19 23:43 (UTC+8))
- #37526 [MRV2] Use fp32 for draft logits — ready,v1 — by WoosukKwon (合并于: 2026-03-19 23:41 (UTC+8))
- #31509 [1/n] Migrate permute_cols to libtorch stable ABI — documentation,rocm,ready,ci/build,cpu,nvidia — by mikaylagawarecki (合并于: 2026-03-19 23:27 (UTC+8))
- #37480 Remove deprecated reasoning_content message field(part-2) — documentation,frontend,ready,v1,qwen,kv-connector,nvidia — by ikaadil (合并于: 2026-03-19 23:20 (UTC+8))
- #37557 [LoRA] Minor improvements to LoRA log — ready — by jeejeelee (合并于: 2026-03-19 23:18 (UTC+8))
- #37466 Cap the number of API servers to 1 when using Elastic EP. — frontend,ready — by SageMoore (合并于: 2026-03-19 22:42 (UTC+8))
- #37559 Stop bench CLI from recursively casting all configs to
dict— performance,ready — by hmellor (合并于: 2026-03-19 22:04 (UTC+8)) - #37438 [Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support — bug,frontend,ready — by DorBernsohn (合并于: 2026-03-19 21:08 (UTC+8))
- #37369 fix(cpu): add null check for aligned_alloc in ScratchPadManager — ready,cpu — by yassha (合并于: 2026-03-19 17:45 (UTC+8))
- #37541 [Misc] Clean up processing logic — ready,qwen — by DarkLight1337 (合并于: 2026-03-19 21:30 (UTC+8))
- #37552 [CI] Merge
cleanup_pr_body.ymlandreminder_comment.yml— ready,ci/build — by hmellor (合并于: 2026-03-19 20:55 (UTC+8)) - #37504 [Refactor] Relocate endpoint tests to mirror serving code directory structure — ready,ci/build — by sfeng33 (合并于: 2026-03-19 15:19 (UTC+8))
- #37536 Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… — ready,v1 — by xueliangyang-oeuler (合并于: 2026-03-19 20:05 (UTC+8))
- #37545 [Model] Remove unnecessary
get_language_model— ready — by DarkLight1337 (合并于: 2026-03-19 20:00 (UTC+8)) - #37542 [CI/Build] Split out MM pooling tests — rocm,ready,ci/build — by DarkLight1337 (合并于: 2026-03-19 19:36 (UTC+8))
- #37458 Don’t log
exc_infowhen vLLM tries to doenload a file that doesn’t exist — ready — by hmellor (合并于: 2026-03-19 18:38 (UTC+8)) - #35592 [Docs] Reorganize pooling docs. — documentation,ready,ci/build — by noooop (合并于: 2026-03-19 19:25 (UTC+8))
- #37538 [Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-03-19 18:24 (UTC+8))
- #37407 [Bugfix] Fix Nemotron Parse loading — bug,rocm,ready,multi-modality — by DarkLight1337 (合并于: 2026-03-19 17:55 (UTC+8))
- #37418 [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant — bug,rocm,ready — by Duyi-Wang (合并于: 2026-03-19 17:49 (UTC+8))
- #37532 [CI] Fix wrong path test file, missing
rlhf_async_new_apis.py— ready,ci/build — by tjtanaa (合并于: 2026-03-19 17:21 (UTC+8)) - #37415 [MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available — structured-output,ready,v1,nvidia — by jikunshang (合并于: 2026-03-19 17:23 (UTC+8))
- #36808 Support temporal compression for Nemotron-3-VL videos — ready,v1 — by collinmccarthy (合并于: 2026-03-19 16:02 (UTC+8))
- #37425 [Perf] Fix slow hasattr in CUDAGraphWrapper.getattr — ready,v1,nvidia — by ZeldaHuang (合并于: 2026-03-19 15:43 (UTC+8))
- #37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend,ready — by cdpath (合并于: 2026-03-19 15:23 (UTC+8))
- #37310 [SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-03-19 15:22 (UTC+8))
- #37009 [ROCm] issue management - request information for bug issues on ROCm — bug,rocm,ready,ci/build — by hongxiayang (合并于: 2026-03-19 11:51 (UTC+8))
关闭但未合并的 PR
- #30767 Algo — needs-rebase — by Mercykid-bash (关闭于: 2026-03-20 11:06 (UTC+8))
- #23247 [Bugfix]Enable zmq router handover to handle scaling-up after scaling-down in EEP — needs-rebase,stale — by wuhang2014 (关闭于: 2026-03-20 10:21 (UTC+8))
- #23996 [Feature]: Support Phi4Flash model in V1 — needs-rebase,stale,v1 — by aditchawdhary (关闭于: 2026-03-20 10:21 (UTC+8))
- #37627 [Bugfix] Handle None compilation times in v1 Executor initialize_from_config — bug,v1 — by ec-jt (关闭于: 2026-03-20 09:43 (UTC+8))
- #36453 fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends — nvidia — by brandonmmusic-max (关闭于: 2026-03-20 09:42 (UTC+8))
- #36785 Update rocm get gpu info capability — rocm — by tmm77 (关闭于: 2026-03-20 05:40 (UTC+8))
- #36967 Allow platform plugins to override gpu_memory_utilization default — frontend,v1 — by aws-navyadhara (关闭于: 2026-03-20 02:29 (UTC+8))
- #37577 [Bugfix] Fix unclean shutdown from Ctrl-C with AR Fusion — bug,needs-rebase,nvidia — by simpx (关闭于: 2026-03-20 00:38 (UTC+8))
- #37575 docs: RFC for dynamic LoRA GPU capacity with resolver plugin integration — 无标签 — by wangchen615 (关闭于: 2026-03-19 23:58 (UTC+8))
- #24919 Reduce noisy startup logs (Roadmap Q3 2025): demote/remove non‑essential INFO logs — frontend — by ikaadil (关闭于: 2026-03-19 23:42 (UTC+8))
- #37203 refactor hardcoded device strings in vllm tests — speculative-decoding,v1,multi-modality,nvidia — by wincent8 (关闭于: 2026-03-19 22:36 (UTC+8))
- #34030 [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming — bug,frontend — by cradonn (关闭于: 2026-03-19 21:35 (UTC+8))
- #36974 Fix issue 36969 — frontend — by xueliangyang-oeuler (关闭于: 2026-03-19 21:28 (UTC+8))
- #36866 [Bugfix] Fix tool call streaming JSON separator mismatch — bug,frontend — by xr843 (关闭于: 2026-03-19 18:43 (UTC+8))
- #37540 Fix test for moved example — ci/build — by hmellor (关闭于: 2026-03-19 16:57 (UTC+8))
- #37522 Backward compatible when
scaled_fp4_quant.outsymbol missing — v1 — by panpan0000 (关闭于: 2026-03-19 16:37 (UTC+8)) - #36111 [Perf] add cute dsl kernel for gdn decode — needs-rebase,qwen — by ZJY0516 (关闭于: 2026-03-19 15:04 (UTC+8))
- #35567 [XPU][CI] add xpu image build job in vllm CI — ci/build — by jikunshang (关闭于: 2026-03-19 15:10 (UTC+8))
- #25570 [docs] add more kv-param in nixl usage docs — documentation,unstale,kv-connector — by panpan0000 (关闭于: 2026-03-19 15:01 (UTC+8))
- #37508 [VLLMZ-905] fix(xpu): Clamp compile warmup sizes to model runner token capacity — v1 — by Liangyx2 (关闭于: 2026-03-19 14:42 (UTC+8))
- #37516 [Core] Fix Sampler Eager Allocator OOM via Zero-Tradeoff Pre-Allocation — v1 — by glaziermag (关闭于: 2026-03-19 12:52 (UTC+8))
- #37513 [ROCm] Add AITER fused kernel support for DeepSeek MLA attention — rocm,needs-rebase,deepseek — by khairulkabir1661 (关闭于: 2026-03-19 12:33 (UTC+8))
- #37511 V0.17.1+haosdentfix 34845 — performance,ci/build,v1,qwen,nvidia — by NekoSunflower (关闭于: 2026-03-19 11:46 (UTC+8))
- #37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (关闭于: 2026-03-19 11:27 (UTC+8))