vLLM 开发动态报告 - 2026-03-25
时间窗口: 2026-03-25 11:43 (UTC+8) ~ 2026-03-26 11:43 (UTC+8) 数据统计: 新 Issue 29 | 关闭 Issue 10 | 新 PR 92 | 合并 PR 44 | 关闭未合并 PR 74
📊 每日开发状态摘要
本周期(2026-03-25至2026-03-26)vLLM项目保持高速开发节奏,新增及关闭大量Issue和PR,社区活跃度极高。核心关注点集中在AMD生态兼容性与性能优化、工具调用(Tool Calling)相关Bug修复以及前沿功能探索(如流式视频输入、新型量化方法)。AMD平台的支持,特别是针对不同架构(gfx1030/RDNA 2, gfx1201/RDNA 4, MI300系列)的性能调优和问题修复是本期焦点。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动非常活跃,涉及性能诊断、功能启用、Bug修复及测试完善等多个方面。
新增Issue:
- #38107: [Bug]: Abnormally bad performance on AMD ROCM gfx1030:用户报告在RDNA 2架构(gfx1030,如W6800, 6900XT)上性能异常低下(个位数token/s)。核心贡献者
Michelle-HCJ精准指出根因在于bfloat16数据类型在RDNA 2硬件上缺乏支持,导致回退至软件模拟,建议改用float16。该问题被迅速定位并关闭,体现了社区对AMD平台性能问题的深入理解。
新增PR (Open & Merged):
- #38086 ([ROCm] Enable VLLM triton FP8 moe for gfx1201):由 AMD 员工(
vllmellm)提交,为 RDNA 4 (gfx12xx) 显卡启用Triton FP8 MoE支持,并针对Qwen3/Qwen3.5 FP8 MoE模型进行了性能调优,解决了此前因检测失败导致的NotImplementedError。 - #38109 ([Bugfix] Fix FP8 MoE support detection on ROCm):修复了在ROCm(特别是RDNA 4)上检测FP8 MoE支持时,因
amdsmi返回哨兵值或旧版GPU识别逻辑不准确导致的失败。统一改用p.supports_fp8()进行集中检测。 - #38108 (Fix Device Index for ROCm Ray Workers in MoE Benchmark):修复了在多GPU AMD系统上使用Ray运行MoE自动调优脚本时,因设备序号错误导致的崩溃。
- #38181 ([ROCm] Fix cp_mha_gather_cache for strided KV views):修复了ROCm后端在处理具有跨步KV视图的混合注意力+Mamba模型(如Qwen3.5-397B-A17B-FP8)时,KV缓存收集逻辑错误的问题。
- #37833 ([ROCm] Fix MoE kernel test failures on gfx950) (已合并):修复了在MI355 (gfx950) 上多个MoE内核测试失败的问题,涉及对Triton返回值处理、CUDA/ROCm内核路径区分、FP8量化边界精度容忍度调整等,确保CI稳定性。
- #36574 ([ROCm] Utilize persistent MLA kernel from AITER) (已合并):集成AITER的持久化MLA解码内核,通过在GPU计算单元上常驻内核来减少启动开销,在DeepSeek-R1模型上取得了显著性能提升(最高达2.4倍)。
- #36702 ([ROCm] Attention selector reordering) (已合并):重新排序了ROCm注意力后端选择器的优先级,将原生
ROCM_ATTN设为最高优先级(性能最优),优化了不同模型下的后端选择逻辑。 - #38155 ([ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355):为MI355 GPU新增Qwen3.5模型的GSM8K评测CI任务,持续完善AMD硬件的模型质量验证。
- #38167 ([ROCm][CI] Fix wvSplitKrc mock argument order):修复了ROCm特定GEMM单元测试中因模拟函数参数顺序错误导致的测试失败。
- #38088 ([ROCm][CI] Increase OpenAPI schema test timeouts) (已合并):针对ROCm CI环境可能较慢的情况,延长了OpenAPI schema测试的超时时间,提升测试稳定性。
Backport PRs (针对v0.13.0分支):
- #38180: 将修复Qwen3-Next非标准块大小(544)支持的PR #31380 backport至v0.13.0。
- #38146: 将修复AITER MLA解码路径中
last_page_len计算错误的PR #31282 backport至v0.13.0。 - #38145: 将修复
ROCM_AITER_TRITON_MLA后端在DeepSeek-V3上精度回归的PR #31816 backport至v0.13.0。
总结: 本周期AMD生态工作呈现 “深度优化” 和 “广度覆盖” 的特点。一方面,针对特定架构(RDNA 2, RDNA 4, CDNA 3)的性能问题和功能缺失进行精准修复;另一方面,通过完善CI测试、修复内核Bug、backport重要补丁,系统性提升vLLM在AMD全栈硬件上的稳定性、功能完整性和用户体验。
💬 高热度讨论分析
- Issue #38106:
tool_choice="required"+ speculative decoding leads to failed tool calls- 核心议题:当强制工具调用(
tool_choice="required")与推测解码(使用Qwen3.5模型)结合时,模型输出了非JSON格式(XML)的工具调用,导致解析失败,进而产生无工具调用列表但finish_reason="tool_calls"的异常输出。 - 观点与讨论:
- 提交者
SvenLorenz提供了详尽的复现步骤和日志,指出这可能是推测解码模型未遵循tool_choice的JSON格式约束,并提出了临时解决方案(禁用推测解码或修改解析器)。 - 维护者
chaunceyjiang积极跟进,首先确认此为首次遇到,并请求打印关键数据以定位问题。
- 提交者
- 争议焦点:无显著争议,主要在于问题根因的排查——是推测解码的逻辑缺陷,还是模型本身在特定条件下的输出偏好?
- 当前状态:问题仍为
open,处于诊断阶段。
- 核心议题:当强制工具调用(
- Issue #38107: AMD gfx1030性能异常
- 核心议题:RDNA 2消费级显卡在vLLM下性能远低于预期(对比llama.cpp)。
- 观点与讨论:
- 用户
Nero10578提供了完整的配置、命令和性能数据。 - 贡献者
Michelle-HCJ给出了权威且专业的诊断:明确指出RDNA 2无硬件bf16支持,vLLM自动选择bf16会导致软件模拟,性能骤降。建议显式使用--dtype float16。
- 用户
- 争议焦点:无。这是一个典型的技术科普与问题解决场景。
- 最终结论:用户采纳建议后问题解决,Issue被关闭。此案例对社区用户具有重要参考价值。
- Issue #38141 & PR #38142: Streaming Video Input RFC
- 核心议题:提议为vLLM新增流式视频输入处理能力,以支持实时视频理解应用。
- 观点与讨论:
- 发起者
lishunyang12撰写了非常详尽的RFC,论证了市场需求、技术可行性和实施路线图。 - 随后,他提交了实现PR #38142,但很快主动关闭,并说明将工作合并至独立的
vllm-omni仓库进行开发。
- 发起者
- 争议焦点:无公开争议,但该举动反映了项目在管理复杂、跨领域新特性时的一种架构决策——将“全模态”相关的高级功能放在专门的项目中孵化,以保持核心vLLM引擎的专注性。
- 当前状态:RFC和PR均已关闭,相关工作转移到
vllm-project/vllm-omni#2201。
- Issue #38171: [Feature]: Add TurboQuant Support for KV Cache Quantization
- 核心议题:请求集成一篇新论文提出的 “TurboQuant” 向量量化方法,用于KV缓存压缩。
- 观点与讨论:
- 提交者
tunglinwood系统性地介绍了该方法的理论优势(近最优失真率、无偏内积估计)。 - 其他用户(
gaby,1315577677)表达了强烈的兴趣和支持,认为这对于大模型的内存节省“意义重大”。
- 提交者
- 争议焦点:无争议,更多是社区对前沿技术的热切期待。
- 当前状态:Issue为
open,是一个高质量的特性提案。
🔥 热门话题与趋势分析
- AMD硬件支持持续深化:除上述专门章节外,多个PR和Issue显示社区正致力于解决AMD平台从消费级到数据中心级的各类“最后一公里”问题,包括性能调优、新架构适配、测试稳定化等。
- 工具调用与结构化输出:相关Bug修复和讨论热度不减(如#38106, #38168, #38103)。这反映出随着模型能力提升,复杂输出格式的处理已成为推理服务的核心挑战,相关解析器、流式支持、与推测解码等功能的兼容性是维护重点。
- 视频与多模态处理成为新前沿:虽然流式视频RFC被移至独立仓库,但它明确指出了行业趋势。同期还有PR #38156(优化NanoNemotron-VL视频编码批处理),表明多模态推理的效率和功能扩展是持续投入的方向。
- 内核与量化优化:社区持续引入和优化专用计算内核(如#38086 FP8 MoE, #38169 对问题内核的revert),并探索新的量化方案(如#38171 TurboQuant, #38138 新的在线量化前端),旨在提升性能与扩展硬件支持。
🛠️ 重点技术变更
- PR #38142 ([Realtime] Add streaming video input support) & Issue #38141:尽管已关闭,但其提出的架构和协议设计展示了将vLLM实时音频流处理能力扩展至视频领域的清晰路径,是项目边界拓展的重要尝试。
- PR #38086 ([ROCm] Enable VLLM triton FP8 moe for gfx1201):此PR标志着vLLM对AMD最新RDNA 4架构的正式FP8 MoE计算支持,对于在该硬件上运行先进的MoE模型至关重要。
- PR #38174 ([Feature] Universal speculative decoding for heterogeneous vocabularies):实现了支持异构词汇表的通用推测解码算法(Token-Level Intersection),允许草稿模型和目标模型使用不同的词表,显著提升了推测解码的模型选择灵活性,是一项重要的算法改进。
- PR #38168 ([Bugfix] Fix Hermes tool parser when stream interval > 1):修复了Hermes工具解析器在流式输出间隔大于1时的边界错误。工具解析器的健壮性直接影响用户体验,此类修复至关重要。
- PR #38138 ([wip] new online quantization frontend):正在开发新的在线量化前端,旨在提供更统一、简洁的用户配置接口(如
quantization="fp8_blockwise"),预示着用户API的进一步优化。
📈 开发活跃度观察
- 高产出与高效率:单日新增92个PR,合并44个,显示社区贡献非常活跃,核心维护团队的代码审查与合并效率极高。
- AMD贡献者深度参与:用户名带
-amd后缀或明显属于AMD团队的贡献者(如vllmellm,mgehre-amd,AndreasKaratzas等)在本周期提交/参与了大量关键PR,涉及内核、平台支持、CI测试等多个层面,是推动AMD生态支持的核心力量。 - 代码质量与测试并重:在合并大量新功能的同时,也有大量PR专注于修复测试、完善CI配置(如#38155, #38167, #38088),体现了对软件工程质量的重视。
💡 值得关注的问题
- AMD RDNA2 架构的 bfloat16 性能陷阱:Issue #38107 揭示的问题具有普遍性。项目文档或默认配置中可能需要更明确的警告或自动回退机制,以避免用户在RDNA2等不支持硬件bf16的显卡上遭遇性能断崖。
- ROCm CI测试框架的健壮性:Issue #38097 指出测试子进程创建方式可能导致测试被静默跳过,这可能影响测试覆盖率,尤其是对ROCm平台的测试质量。
- 新量化技术的集成路径:Issue #38171 (TurboQuant) 获得了社区积极反响。此类来自学术界的先进量化方法,如何评估、集成到vLLM现有的量化体系中,是一个值得观察的技术管理过程。
- 工具调用解析器的长期维护:工具调用相关的Bug报告频繁且细节复杂(如与流式、推测解码的交互)。需要持续投入资源以保证这套系统的稳定性和正确性。
📋 附录:详细数据列表
新增 Issue
- #38182 [Bug]: Based on Qwen3.5-35B-A3B, why does enabling MTP speculative decoding actually reduce the prefix cache hit rate? — bug — by uOnePiece (创建于: 2026-03-26 11:32 (UTC+8))
- #38175 [RFC]: Support ViT Full CUDA Graph (Tracker) — RFC — by shen-shanshan (创建于: 2026-03-26 10:22 (UTC+8))
- #38171 [Feature]: Add TurboQuant Support for KV Cache Quantization — feature request — by tunglinwood (创建于: 2026-03-26 09:24 (UTC+8))
- #38107 [Bug]: Abnormally bad performance on AMD ROCM gfx1030 (W6800, V620, 6900XT 6800XT) — bug,rocm — by Nero10578 (创建于: 2026-03-25 20:19 (UTC+8))
- #38176 [Bug]: qwen3 235B model with latest vllm is going to generate only 1 token. — bug — by jiangtaozh (创建于: 2026-03-26 10:22 (UTC+8))
- #38173 [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) — feature request — by wan-danfeng (创建于: 2026-03-26 09:51 (UTC+8))
- #38170 [Bug]: MFU statistics on the Step-3.5-Flash model are inaccurate — bug — by kangxiaoning (创建于: 2026-03-26 09:15 (UTC+8))
- #38164 [Feature][Ray]: Support EEP for RayExecutorV2 — feature request — by jeffreywang-anyscale (创建于: 2026-03-26 07:51 (UTC+8))
- #38159 [Installation]: nightly builds for vllm/vllm-openai stopped three days ago — installation — by CaptainGlac1er (创建于: 2026-03-26 07:22 (UTC+8))
- #38098 [CI Failure]: LM Eval Large Models (H200) — ci-failure — by ilmarkov (创建于: 2026-03-25 18:37 (UTC+8))
- #38122 [Bug]: Qwen 3.5 fails to load from GGUF — bug — by megla-tlanghorst (创建于: 2026-03-26 00:36 (UTC+8))
- #38147 [RFC]: Add Configuration API — feature request — by hickeyma (创建于: 2026-03-26 05:22 (UTC+8))
- #38141 [RFC] Streaming Video Input for Real-Time Video Understanding — 无标签 — by lishunyang12 (创建于: 2026-03-26 04:12 (UTC+8))
- #38132 [Bug]: truncation: “auto” in Responses API returns 400 instead of truncating input — bug — by lukezTT (创建于: 2026-03-26 02:30 (UTC+8))
- #38131 [Bug]: [CPU Backend] No CPU profiler summary equivalent; CUDA summary flag is silently disabled on CPU — bug,cpu — by Elm8116 (创建于: 2026-03-26 02:04 (UTC+8))
- #38121 [Usage]: how does cpu offload work? — usage — by JINO-ROHIT (创建于: 2026-03-26 00:36 (UTC+8))
- #38118 [Doc]: consider adding docstring for a lot of the methods — documentation — by JINO-ROHIT (创建于: 2026-03-26 00:02 (UTC+8))
- #38069 [RFC]: Can we implement n-gram and suffix speculative decoding in model_runner_v2? — RFC — by lio1226 (创建于: 2026-03-25 12:42 (UTC+8))
- #38113 [Installation]: Ray not present in Container Image — installation — by ed-pai (创建于: 2026-03-25 22:43 (UTC+8))
- #38110 [Bug]:
flashinfer-cubindoes not include all cubins/headers — bug — by mgoin (创建于: 2026-03-25 21:59 (UTC+8)) - #38106 [Bug]: tool_choice=”required” + speculative decoding with lukealonso/Qwen3.5-397B-A17B-NVFP4 leads to failed tool calls. — bug — by SvenLorenz (创建于: 2026-03-25 20:03 (UTC+8))
- #38101 [CI Failure]: Test Eval Marlin Qwen3-30B-A3B-Fp8 — ci-failure — by ilmarkov (创建于: 2026-03-25 19:17 (UTC+8))
- #38085 [Bug]: Qwen3.5 LoRA module is not in model’s supported LoRA target modules — bug — by wufenglailai (创建于: 2026-03-25 16:21 (UTC+8))
- #38097 [ROCm][CI]:
create_new_process_for_each_test("spawn")may silently skip tests without__main__entrypoint — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-25 18:31 (UTC+8)) - #38079 [RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop — 无标签 — by junjzhang (创建于: 2026-03-25 15:15 (UTC+8))
- #38077 [Bug]: Qwen3.5-9B answer !!!!!!!!! — bug — by hoshinoyouyou (创建于: 2026-03-25 14:59 (UTC+8))
- #38071 [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion — feature request — by aman-coder03 (创建于: 2026-03-25 12:58 (UTC+8))
- #38064 [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 — 无标签 — by yeshihai (创建于: 2026-03-25 11:59 (UTC+8))
- #38063 [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable — 无标签 — by yeshihai (创建于: 2026-03-25 11:59 (UTC+8))
已关闭 Issue
- #38107 [Bug]: Abnormally bad performance on AMD ROCM gfx1030 (W6800, V620, 6900XT 6800XT) — bug,rocm — by Nero10578 (关闭于: 2026-03-26 11:09 (UTC+8))
- #20316 [Performance]: Optimize beam search code — performance,stale — by zhanggzh (关闭于: 2026-03-26 10:15 (UTC+8))
- #38141 [RFC] Streaming Video Input for Real-Time Video Understanding — 无标签 — by lishunyang12 (关闭于: 2026-03-26 04:19 (UTC+8))
- #35118 [Feature]: Support ubuntu 24.04 runtime container — feature request — by aasgaonkar (关闭于: 2026-03-26 00:07 (UTC+8))
- #34231 [Performance]: Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance — by RocMarshal (关闭于: 2026-03-25 22:59 (UTC+8))
- #24107 [Bug]: CUDA illegal memory access during structured output (xgrammar) crashes vLLM workers and API returns 500 — bug,stale — by ItzAmirreza (关闭于: 2026-03-25 21:37 (UTC+8))
- #29391 [Bug]: BF16Vec has no fallback options for Arm CPUs with no BF16 support — bug — by fadara01 (关闭于: 2026-03-25 15:01 (UTC+8))
- #37250 [Bug]: QWEN 3.5-397B-A17B report “RPC call to sample_tokens timed out” — bug — by mahaocong90 (关闭于: 2026-03-25 14:22 (UTC+8))
- #36849 [Bug]: GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator) — bug — by Priananda620 (关闭于: 2026-03-25 12:06 (UTC+8))
- #37937 [Bug]: IndexError: prev_tool_call_arr list index out of range when streaming tool call hits max_tokens (openai parser) — bug — by onurdemircan-softtech (关闭于: 2026-03-25 12:06 (UTC+8))
新增 PR
- #38183 [Draft][MRV2] Experimental
build_attn_metadatarefactor — v1 — by LucasWilkinson (创建于: 2026-03-26 11:32 (UTC+8)) - #38181 [ROCm] Fix cp_mha_gather_cache for strided KV views — rocm,v1 — by yuttian1 (创建于: 2026-03-26 11:13 (UTC+8))
- #38152 Disable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (创建于: 2026-03-26 05:55 (UTC+8))
- #38158 [Bugfix] Fix shared-object aliasing in n>1 streaming with tool calls — bug,frontend,needs-rebase — by yzong-rh (创建于: 2026-03-26 07:11 (UTC+8))
- #38177 Support AsyncTP pattern matching for FlashInfer vllm.bmm_fp8 on Blackwell#27893 — nvidia — by baonudesifeizhai (创建于: 2026-03-26 10:28 (UTC+8))
- #38136 Fix multi-node allreduce fusion — ci/build,nvidia — by wzhao18 (创建于: 2026-03-26 03:39 (UTC+8))
- #38180 [Bugfix][Backport][ROCm] Backport PR #31380 to v0.13.0: Fix Qwen3-Next inference with non-standard block size (544) — bug,rocm,v1,qwen — by khairulkabir1661 (创建于: 2026-03-26 11:09 (UTC+8))
- #38092 [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format — bug,ready,ready-run-all-tests — by BadrBasowid (创建于: 2026-03-25 17:49 (UTC+8))
- #38168 [Bugfix] Fix Hermes tool parser when stream interval > 1 — bug,tool-calling — by sfeng33 (创建于: 2026-03-26 08:38 (UTC+8))
- #38137 [ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 03:43 (UTC+8))
- #38179 [KVTransfer] Fix TpKVTopology.is_kv_replicated equality case — kv-connector — by JianDan0212 (创建于: 2026-03-26 10:48 (UTC+8))
- #38167 [ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 08:34 (UTC+8))
- #38166 [ROCm][CI] Fix CK MXFP4 MoE GEMM crash for unaligned intermediate_size_per_partition — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 08:26 (UTC+8))
- #38178 [CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 10:36 (UTC+8))
- #38123 [compile] Allow strings in custom ops without regressing compilation times — ready,needs-rebase,qwen — by zou3519 (创建于: 2026-03-26 00:46 (UTC+8))
- #38174 [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI) — speculative-decoding,v1 — by wan-danfeng (创建于: 2026-03-26 10:09 (UTC+8))
- #38066 feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel — 无标签 — by yeshihai (创建于: 2026-03-25 12:06 (UTC+8))
- #38143 [Bugfix] Apply truncation in Responses API harmony path — bug,frontend,gpt-oss — by saivedant169 (创建于: 2026-03-26 04:23 (UTC+8))
- #38139 [Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-03-26 03:57 (UTC+8))
- #38162 [Bugfix] Add missing f-string prefix in xgrammar choices error message — bug,structured-output,v1 — by yzong-rh (创建于: 2026-03-26 07:47 (UTC+8))
- #38172 [Misc] Add 20 regression tests for 11 tool parser bug fixes — bug,tool-calling,qwen,deepseek — by bbrowning (创建于: 2026-03-26 09:40 (UTC+8))
- #38076 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder — ready,v1,deepseek — by chaunceyjiang (创建于: 2026-03-25 14:24 (UTC+8))
- #38082 [Bugfix] Fix benchmark_fused_collective.py — bug,performance,ready — by jeejeelee (创建于: 2026-03-25 15:53 (UTC+8))
- #38169 Revert “[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration” (#38050) — nvidia — by zhewenl (创建于: 2026-03-26 09:05 (UTC+8))
- #38165 [ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-26 08:11 (UTC+8))
- #38127 Various Transformers v5 fixes — ready,deepseek — by hmellor (创建于: 2026-03-26 01:35 (UTC+8))
- #38163 [MRV2] Support expert index capture — ci/build,v1 — by WoosukKwon (创建于: 2026-03-26 07:48 (UTC+8))
- #38161 [ROCm][CI] Fix flaky GPTQ compile correctness test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 07:40 (UTC+8))
- #38083 [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell — bug,ready,qwen — by vadiklyutiy (创建于: 2026-03-25 15:58 (UTC+8))
- #38157 feat(health): add –health-port for out-of-band health check process — frontend,v1 — by bolzzzz (创建于: 2026-03-26 06:59 (UTC+8))
- #38160 [Core] Add hybrid_ep backend — v1,nvidia — by samnordmann (创建于: 2026-03-26 07:32 (UTC+8))
- #38120 [Cohere] Enable Cohere-Transcribe — documentation,new-model,ready — by ekagra-ranjan (创建于: 2026-03-26 00:31 (UTC+8))
- #38153 [Refactor] Remove unused utils — ready — by yewentao256 (创建于: 2026-03-26 05:55 (UTC+8))
- #38156 [Model] Enable batched video encoding for NanoNemotron-VL (RADIO ViT) — v1 — by askliar (创建于: 2026-03-26 06:50 (UTC+8))
- #38128 [EPLB] Mask padding in EPLB load recording — speculative-decoding,v1 — by ilmarkov (创建于: 2026-03-26 01:39 (UTC+8))
- #38155 [ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 — rocm,ready,ci/build,qwen — by AndreasKaratzas (创建于: 2026-03-26 06:14 (UTC+8))
- #38154 Improve DCP error message to suggest
VLLM_ATTENTION_BACKENDalternative — v1 — by amitmodi (创建于: 2026-03-26 06:02 (UTC+8)) - #38151 Add
sp_min_token_num=0to E2E correctness tests for SP and AsyncTP — 无标签 — by copilot-swe-agent (创建于: 2026-03-26 05:49 (UTC+8)) - #38150 [Mistral Grammar] Support Grammar Factory — structured-output,v1,tool-calling — by juliendenize (创建于: 2026-03-26 05:49 (UTC+8))
- #38146 [Bugfix][Backport][Hardware][AMD] Backport PR #31282 to v0.13.0: Fix last_page_len calculation in AITER MLA decode — bug,rocm,v1 — by khairulkabir1661 (创建于: 2026-03-26 05:17 (UTC+8))
- #38149 [WIP][HMA] Add configuration API — frontend,v1 — by hickeyma (创建于: 2026-03-26 05:27 (UTC+8))
- #38148 Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor — 无标签 — by elvircrn (创建于: 2026-03-26 05:24 (UTC+8))
- #38111 [Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator — bug,ready — by shubhra (创建于: 2026-03-25 22:12 (UTC+8))
- #38126 [NVIDIA] Fix DGX Spark logic — ready,ci/build,nvidia — by johnnynunez (创建于: 2026-03-26 01:35 (UTC+8))
- #38145 [Bugfix][Backport] Backport PR #31816 to v0.13.0: Fix ROCM_AITER_TRITON_MLA accuracy for DeepSeek-V3 — bug,rocm,v1,deepseek — by khairulkabir1661 (创建于: 2026-03-26 05:11 (UTC+8))
- #38140 [Bugfix] Fix Qwen 3.5 GGUF loading: add model type mapping and vision config d… — bug,qwen — by phycoding (创建于: 2026-03-26 04:01 (UTC+8))
- #38144 [RFC] Gap C: Replacement rank bootstrapping for elastic EP peer-swap recovery — v1 — by tzulingk (创建于: 2026-03-26 04:31 (UTC+8))
- #38072 [RFC] Gap D: DP rank peer-swap recovery for elastic EP — v1 — by tzulingk (创建于: 2026-03-25 13:08 (UTC+8))
- #38142 [Realtime] Add streaming video input support with Qwen3-Omni — documentation,new-model,frontend,v1,qwen — by lishunyang12 (创建于: 2026-03-26 04:13 (UTC+8))
- #38119 [MultiModal] add support for numpy array embeddings — ready,multi-modality — by guillaumeguy (创建于: 2026-03-26 00:29 (UTC+8))
- #38138 [wip] new online quantization frontend — frontend — by vkuzo (创建于: 2026-03-26 03:51 (UTC+8))
- #38135 [Model] Support for nvidia/gpt-oss-puzzle-88B — new-model,frontend,v1,gpt-oss,nvidia — by TomerRonen34 (创建于: 2026-03-26 03:32 (UTC+8))
- #38133 Fix multi-node allreduce fusion — nvidia — by wzhao18 (创建于: 2026-03-26 03:14 (UTC+8))
- #38134 [WIP] Remove kv cache dtype enum from csrc — documentation,performance,rocm,speculative-decoding,v1,cpu,kv-connector,nvidia — by MatthewBonanni (创建于: 2026-03-26 03:22 (UTC+8))
- #38124 [1/N][Misc][Cleanup] Resolve kv cache dtype
"auto"at init time and eliminate from internal code — documentation,performance,rocm,speculative-decoding,ready,v1,cpu,kv-connector,nvidia — by MatthewBonanni (创建于: 2026-03-26 01:15 (UTC+8)) - #38125 DOC: Documentation pages fixes — ready — by mtsokol (创建于: 2026-03-26 01:19 (UTC+8))
- #38129 DOC: TPU mention fix — ready — by mtsokol (创建于: 2026-03-26 01:52 (UTC+8))
- #38096 [Core][KV Connector] Remove use of num_cached_tokens in error handling — ready,v1 — by markmc (创建于: 2026-03-25 18:28 (UTC+8))
- #38115 [Frontend] Move APIServerProcessManager target server fn — frontend,ready,v1 — by njhill (创建于: 2026-03-25 23:16 (UTC+8))
- #38130 [Misc] Switch py-cpuinfo dependency to maintained fork — ci/build,cpu — by akx (创建于: 2026-03-26 02:00 (UTC+8))
- #38084 [Frontend] Add /v1/responses/render endpoint and refactor responses preprocessing — frontend,needs-rebase — by hyeongyun0916 (创建于: 2026-03-25 16:17 (UTC+8))
- #38116 Relocate Encoder CUDA graph manager — ready,v1,nvidia — by WoosukKwon (创建于: 2026-03-25 23:25 (UTC+8))
- #38117 [Doc] Update example docs to include Nemotron Super v3 and Nano 4B — documentation — by Naveassaf (创建于: 2026-03-25 23:27 (UTC+8))
- #38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 — rocm,qwen — by vllmellm (创建于: 2026-03-25 16:29 (UTC+8))
- #38112 Added faster exp routine for lower precision data types. — cpu — by almayne (创建于: 2026-03-25 22:40 (UTC+8))
- #38081 [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5) — bug,ready,v1,qwen — by Lidang-Jiang (创建于: 2026-03-25 15:44 (UTC+8))
- #38109 [Bugfix] Fix FP8 MoE support detection on ROCm when amdsmi returns sentinel value — bug,rocm — by nemanjaudovic (创建于: 2026-03-25 20:36 (UTC+8))
- #38114 [Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py — bug,performance — by nemanjaudovic (创建于: 2026-03-25 23:09 (UTC+8))
- #38068 [MRV2][Attention] Remove
set_workspace_bufferpattern in FlashInfer — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-03-25 12:14 (UTC+8)) - #38100 [Bugfix] Map reasoning_effort=”none” to enable_thinking=False for Qwen3 chat templates — bug,frontend,qwen — by Lidang-Jiang (创建于: 2026-03-25 19:03 (UTC+8))
- #38070 [GGUF Kernel] Remove artificial 255 expert limit to support models with more experts — 无标签 — by guqiong96 (创建于: 2026-03-25 12:47 (UTC+8))
- #38089 Fix KeyError for multimodal models with vision tower layers (fixes #3… — v1,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 17:10 (UTC+8))
- #38073 Fix hidden size mismatch in eagle3 nonparallel draft path (fixes #37966) — speculative-decoding,v1,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 13:20 (UTC+8))
- #38067 Fix IndexError in cpu_fused_moe_torch when TP+EP enabled on CPU-only … — cpu,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 12:09 (UTC+8))
- #38103 [Bugfix][Tool Parser] Fix Hunyuan A13B nested JSON tool streaming — bug,tool-calling — by vineetatiwari27 (创建于: 2026-03-25 19:36 (UTC+8))
- #38090 Fix Plamo 2/3 & LFM2 for Transformers v5 — ready — by hmellor (创建于: 2026-03-25 17:34 (UTC+8))
- #38074 [Model] Add AutoWeightsLoader support for jais — ready — by grYe99 (创建于: 2026-03-25 13:55 (UTC+8))
- #38108 Fix Device Index for ROCm Ray Workers in MoE Benchmark — performance,rocm — by li-liwen (创建于: 2026-03-25 20:30 (UTC+8))
- #38102 [ROCm][CI] Rename filepath test to point to correct file — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-25 19:29 (UTC+8))
- #38099 [Bugfix] Add fallback for TokenizersBackend tokenizer class — bug — by Lidang-Jiang (创建于: 2026-03-25 19:02 (UTC+8))
- #38105 [Docs] Add Granite Vision to supported multimodal models list — documentation — by jesus-talavera-ibm (创建于: 2026-03-25 19:54 (UTC+8))
- #38104 [bugfix] DeepSeek MTP support and Pipeline Parallel feature. — bug,v1,deepseek — by zzhx1 (创建于: 2026-03-25 19:37 (UTC+8))
- #38095 Fix offline mode test for Transformers v5 — ready — by hmellor (创建于: 2026-03-25 18:24 (UTC+8))
- #38088 [ROCm][CI] Increase OpenAPI schema test timeouts — rocm,ready — by AndreasKaratzas (创建于: 2026-03-25 16:56 (UTC+8))
- #38094 [BugFix][Frontend] pass kv_transfer_params through to sampling_params — bug,frontend — by hhk7734 (创建于: 2026-03-25 18:16 (UTC+8))
- #38093 [Bugfix] Fix scaled_mm output narrowing for 3D input tensors — bug — by nemanjaudovic (创建于: 2026-03-25 18:03 (UTC+8))
- #38091 [Bugfix] Fix ImportError for flash_attn < v2.1.2 missing triton rotary module — bug — by Lidang-Jiang (创建于: 2026-03-25 17:38 (UTC+8))
- #38087 [Bugfix] fix: normalize layer names for kv cache group to prevent KeyError in — bug — by mahendrarathore1742 (创建于: 2026-03-25 16:36 (UTC+8))
- #38080 [Core] Preempt requests with fewer num_computed_tokens to reduce wasted computation — v1 — by chaunceyjiang (创建于: 2026-03-25 15:25 (UTC+8))
- #38078 Fix fixed internal ZMQ port collision in multi-model deployment — documentation,v1 — by Aeovy (创建于: 2026-03-25 15:14 (UTC+8))
- #38075 add qwen demo. — documentation,qwen — by hongbolv (创建于: 2026-03-25 13:58 (UTC+8))
- #38065 [Perf] FP8 FlashInfer Attn for ViT — performance,v1,multi-modality,qwen,nvidia — by zhandaz (创建于: 2026-03-25 12:01 (UTC+8))
已合并 PR
- #38152 Disable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (合并于: 2026-03-26 09:20 (UTC+8))
- #38029 [Tool Parser][1/3] Pass tools to ToolParser constructor — frontend,ready,tool-calling — by sfeng33 (合并于: 2026-03-26 10:29 (UTC+8))
- #34549 [Misc] Optimized check to encapsulate both CUDA and ROCm platforms — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-26 09:43 (UTC+8))
- #38076 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder — ready,v1,deepseek — by chaunceyjiang (合并于: 2026-03-26 09:43 (UTC+8))
- #37214 Fix minimax m2.5 nvfp4 kv scales weight loading — ready — by wzhao18 (合并于: 2026-03-26 08:48 (UTC+8))
- #37348 [Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU — bug,ready,qwen — by jrplatin (合并于: 2026-03-26 08:46 (UTC+8))
- #38127 Various Transformers v5 fixes — ready,deepseek — by hmellor (合并于: 2026-03-26 08:10 (UTC+8))
- #38120 [Cohere] Enable Cohere-Transcribe — documentation,new-model,ready — by ekagra-ranjan (合并于: 2026-03-26 07:13 (UTC+8))
- #36716 [ROCm]: Update rope+kvcache fusion conditions and disable custom op by default — documentation,rocm,ready — by Rohan138 (合并于: 2026-03-26 04:58 (UTC+8))
- #38119 [MultiModal] add support for numpy array embeddings — ready,multi-modality — by guillaumeguy (合并于: 2026-03-26 04:13 (UTC+8))
- #36574 [ROCm] Utilize persistent MLA kernel from AITER — rocm,ready,v1 — by SKPsanjeevi (合并于: 2026-03-26 03:00 (UTC+8))
- #37833 [ROCm] Fix MoE kernel test failures on gfx950 — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-26 02:46 (UTC+8))
- #38096 [Core][KV Connector] Remove use of num_cached_tokens in error handling — ready,v1 — by markmc (合并于: 2026-03-26 02:20 (UTC+8))
- #38115 [Frontend] Move APIServerProcessManager target server fn — frontend,ready,v1 — by njhill (合并于: 2026-03-26 02:14 (UTC+8))
- #37725 [Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell — bug,ready,ci/build,nvidia — by RobTand (合并于: 2026-03-25 23:18 (UTC+8))
- #38057 [CI/Docs] Improve aarch64/DGX Spark support for dev setup — documentation,ready — by bbrowning (合并于: 2026-03-26 00:24 (UTC+8))
- #35182 [Misc] Reorganize inputs — documentation,performance,rocm,frontend,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-03-26 01:22 (UTC+8))
- #38050 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration — ready,nvidia — by zyongye (合并于: 2026-03-26 01:16 (UTC+8))
- #36058 [2/n] Migrate per_token_group_quant to torch stable ABI — ready,ci/build,nvidia — by mikaylagawarecki (合并于: 2026-03-26 01:15 (UTC+8))
- #38046 [compile] Add some more startup tests for top models — ready,ci/build — by zou3519 (合并于: 2026-03-26 00:02 (UTC+8))
- #38048 [Refactor] Rename
WAITING_FOR_FSMtoWAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR— structured-output,ready,v1 — by yewentao256 (合并于: 2026-03-25 23:41 (UTC+8)) - #37970 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (合并于: 2026-03-25 23:20 (UTC+8))
- #37488 [Feature] EPLB Support for GPU Model Runner v2 — ready,v1 — by yewentao256 (合并于: 2026-03-25 23:16 (UTC+8))
- #37880 [Feature] Support per-draft-model MoE backend via
--speculative-config— speculative-decoding,ready,v1 — by askliar (合并于: 2026-03-25 22:31 (UTC+8)) - #37819 [Docs] Add guide for editing agent instruction files — documentation,ready — by bbrowning (合并于: 2026-03-25 21:54 (UTC+8))
- #37902 [Mypy] Better fixes for the
mypyissues invllm/config— documentation,performance,frontend,ready,v1,multi-modality,kv-connector — by hmellor (合并于: 2026-03-25 21:14 (UTC+8)) - #36869 [KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector — ready,kv-connector — by JianDan0212 (合并于: 2026-03-25 21:24 (UTC+8))
- #38090 Fix Plamo 2/3 & LFM2 for Transformers v5 — ready — by hmellor (合并于: 2026-03-25 20:29 (UTC+8))
- #37607 [CPU][UX][Perf] Enable tcmalloc by default — ready,ci/build,cpu — by fadara01 (合并于: 2026-03-25 20:39 (UTC+8))
- #38074 [Model] Add AutoWeightsLoader support for jais — ready — by grYe99 (合并于: 2026-03-25 20:38 (UTC+8))
- #38035 Better weight tying check for multimodal models — ready — by hmellor (合并于: 2026-03-25 20:07 (UTC+8))
- #37616 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test — rocm,frontend,ready — by AndreasKaratzas (合并于: 2026-03-25 18:55 (UTC+8))
- #38102 [ROCm][CI] Rename filepath test to point to correct file — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-25 20:05 (UTC+8))
- #37840 [Docs] Adds vllm-musa to custom_op.md — documentation,ready — by yeahdongcn (合并于: 2026-03-25 19:54 (UTC+8))
- #37280 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 — bug,speculative-decoding,ready,v1,llama — by mgehre-amd (合并于: 2026-03-25 19:42 (UTC+8))
- #38095 Fix offline mode test for Transformers v5 — ready — by hmellor (合并于: 2026-03-25 19:39 (UTC+8))
- #37483 [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors — rocm,frontend,ready — by AndreasKaratzas (合并于: 2026-03-25 18:24 (UTC+8))
- #38088 [ROCm][CI] Increase OpenAPI schema test timeouts — rocm,ready — by AndreasKaratzas (合并于: 2026-03-25 18:06 (UTC+8))
- #37029 [Hardware][XPU] Align memory usage with cuda on xpu — rocm,ready,nvidia — by jikunshang (合并于: 2026-03-25 18:14 (UTC+8))
- #37143 [XPU] support MLA model on Intel GPU — ready — by jikunshang (合并于: 2026-03-25 17:43 (UTC+8))
- #36702 [ROCm] Attention selector reordering — documentation,rocm,ready,ci/build,v1 — by gshtras (合并于: 2026-03-25 17:42 (UTC+8))
- #37968 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (合并于: 2026-03-25 14:19 (UTC+8))
- #37958 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser — bug,frontend,ready,tool-calling — by chaunceyjiang (合并于: 2026-03-25 12:06 (UTC+8))
- #37640 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test — rocm,ready — by vllmellm (合并于: 2026-03-25 13:06 (UTC+8))
关闭但未合并的 PR
- #31328 [ROCm][Perf] Tune fused_moe and add int4 w4a16 config — rocm — by yuttian1 (关闭于: 2026-03-26 10:41 (UTC+8))
- #28829 [XPU][QWEN] support qwen3-30b-a3b gptq int4 and tp=1,2 — stale,qwen — by mayuyuace (关闭于: 2026-03-26 10:15 (UTC+8))
- #37744 [Bugfix] Fix PyTorch stable ABI compatibility for permute_cols — bug,ci/build — by kilork (关闭于: 2026-03-26 06:00 (UTC+8))
- #38142 [Realtime] Add streaming video input support with Qwen3-Omni — documentation,new-model,frontend,v1,qwen — by lishunyang12 (关闭于: 2026-03-26 04:19 (UTC+8))
- #38133 Fix multi-node allreduce fusion — nvidia — by wzhao18 (关闭于: 2026-03-26 03:34 (UTC+8))
- #30980 [Do not merge][Async] Asynchronous DP coordination — v1 — by MatthewBonanni (关闭于: 2026-03-26 03:24 (UTC+8))
- #36448 fix(tokenizer): handle TokenizersBackendFast class for Qwen3.5 GPTQ models — qwen — by giulio-leone (关闭于: 2026-03-26 02:34 (UTC+8))
- #36360 fix(tool_parser): fix hermes parser dropping final brace during MTP streaming — tool-calling — by giulio-leone (关闭于: 2026-03-26 02:34 (UTC+8))
- #37555 Update typing annotations to use ReadOnly for ConversationMessage — frontend — by ikaadil (关闭于: 2026-03-26 00:40 (UTC+8))
- #33814 refactor fp8.py online quant weight loading to use layerwise reload utils — ready,quantization — by vkuzo (关闭于: 2026-03-26 00:18 (UTC+8))
- #37316 [Models][GDN] Reduce D2H syncs in ChunkGatedDeltaRule — 无标签 — by lgeiger (关闭于: 2026-03-25 23:39 (UTC+8))
- #35947 fix: Software E2M1 conversion for SM12x NVFP4 activation quantization — ready,qwen,nvidia — by blake-snc (关闭于: 2026-03-25 23:18 (UTC+8))
- #34230 [vLLM/v1][sample] Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance,v1 — by RocMarshal (关闭于: 2026-03-25 22:59 (UTC+8))
- #37791 Custom/v0.14.1 — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector,nvidia — by jusikjoa (关闭于: 2026-03-25 22:01 (UTC+8))
- #35827 Fix issue #35820 — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35748 Fix issue #35743 — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #37156 Fix issue #37037 — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #37154 Fix reasoning parser CI failure for seedoss and glm4 moe — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #37153 Fix issue#37032 — needs-rebase,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36977 chore(pre-commit): make bash hooks runnable on Windows — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36975 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36970 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,needs-rebase,qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36873 [ROCm] Fix build issues with cub:: namespace and missing headers — rocm,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36871 [Qwen3] Fix truncated reasoning extraction in parser — qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36760 Fix(Offline Inference): Enable reasoning parser support in LLM class … — frontend,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36758 Fix(DP): Optimize load balancing score calculation (#36748) — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36757 Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco… — v1,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36738 Fix(MoE): Relax assertion for non-gated NVFP4 MoE models (#31782) — documentation,needs-rebase,deepseek,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36727 fix: replace deprecated F.sigmoid with torch.sigmoid in linear_attn (… — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36572 fix: make MiniMaxM2AppendThinkReasoningParser extract reasoning corre… — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36569 fix: prevent KeyError in harmony parser by preserving None fields (is… — frontend,gpt-oss,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36568 fix: revert triton_kernels tag to v3.5.0 to resolve ImportError (issu… — ci/build,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36468 feat: support fp8 kv cache and chunked prefill for minimax model — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36446 feat(v1): add timeout to engine core step to prevent deadlock — needs-rebase,v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36125 Perf: Relax CUDA kernel condition for MoE INT4 — nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #36081 Perf: Optimize DeepEP prepare/finalize for identity mapping — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35958 Perf: Enable double-buffered chunked EP communication in DeepEP — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35955 Perf: Implement fused sort/scan for MoE block alignment using Triton — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35951 Perf: Optimize regex patterns in MiniMaxM2ToolParser — needs-rebase,tool-calling,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35942 [GGUF] Add support for MiniMax-M2 architecture — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #35844 [V1] Fix illegal memory access and background thread crash during asy… — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
- #38089 Fix KeyError for multimodal models with vision tower layers (fixes #3… — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #38073 Fix hidden size mismatch in eagle3 nonparallel draft path (fixes #37966) — speculative-decoding,v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #38067 Fix IndexError in cpu_fused_moe_torch when TP+EP enabled on CPU-only … — cpu,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37994 [Bugfix] Add minimax_m2 to eagle3 supported models list — bug,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37896 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,needs-rebase,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37805 [Fix] Share decode output buffer across MLA layers to reduce memory — v1,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37796 [Fix] Clear stale prompt logprobs on preemption to avoid livelock — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37793 [Fix] Add dynamic engine_id monitoring for MooncakeConnector proxy — documentation,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37750 fix: specify device for torch.Event to prevent multi-GPU issues — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37747 fix: preserve logprobs for control tokens in streaming tool calls — frontend,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37670 fix: set device for prepare_inputs_event to avoid device mismatch — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
- #37662 fix: handle multicasting error in FlashInfer workspace init — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37653 fix: handle multicasting error in FlashInfer workspace init — needs-rebase,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37637 Fix: Add EAGLE/MTP slots calculation in max_num_new_slots_for_drafting — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37412 Fix Spec Decode + NCCL Illegal Memory Access — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37410 Fix SM121 GB10 FP4 quantization sticky CUDA error — ci/build,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37409 Fix KV Offloading + MLA AssertionError in get_kv_cache_shape — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37403 Fix tensor size mismatch in per-channel weight scale loading for MoE … — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37395 Fix piecewise CUDA graph bugs in split_graph and cuda_graph replay — nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37389 Fix placeholder_block_size undefined error in initialize_kv_cache — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37177 Fix KV cache memory estimation for hybrid Mamba/Attention models — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37172 Fix KV cache size estimation regression in v0.17+ — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37166 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37163 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #37161 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
- #38105 [Docs] Add Granite Vision to supported multimodal models list — documentation — by jesus-talavera-ibm (关闭于: 2026-03-25 19:55 (UTC+8))
- #36791 fix for the model not even loading and zero accuracy — 无标签 — by jdebache (关闭于: 2026-03-25 18:18 (UTC+8))
- #38080 [Core] Preempt requests with fewer num_computed_tokens to reduce wasted computation — v1 — by chaunceyjiang (关闭于: 2026-03-25 16:22 (UTC+8))
- #37889 [CI] Add Qwen3.5 quantized model GSM8K CI evals for Blackwell — ready,qwen — by vadiklyutiy (关闭于: 2026-03-25 16:00 (UTC+8))
- #38075 add qwen demo. — documentation,qwen — by hongbolv (关闭于: 2026-03-25 14:03 (UTC+8))
- #37918 [LinearAttention] Introduce non_spec_query_start_loc_cpu in GDN metadata — ready,v1 — by Jialin (关闭于: 2026-03-25 13:13 (UTC+8))
- #35488 [WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency — bug,v1,nvidia — by haosdent (关闭于: 2026-03-25 12:03 (UTC+8))
- #37088 [WIP][Bugfix] Move GDN warmup after KV cache allocation to fix memory leak (#36973) — bug,v1,qwen — by haosdent (关闭于: 2026-03-25 11:54 (UTC+8))