vLLM 开发动态报告 - 2026-03-25

时间窗口: 2026-03-25 11:43 (UTC+8) ~ 2026-03-26 11:43 (UTC+8) 数据统计: 新 Issue 29 | 关闭 Issue 10 | 新 PR 92 | 合并 PR 44 | 关闭未合并 PR 74

📊 每日开发状态摘要

本周期（2026-03-25至2026-03-26）vLLM项目保持高速开发节奏，新增及关闭大量Issue和PR，社区活跃度极高。核心关注点集中在AMD生态兼容性与性能优化、工具调用（Tool Calling）相关Bug修复以及前沿功能探索（如流式视频输入、新型量化方法）。AMD平台的支持，特别是针对不同架构（gfx1030/RDNA 2， gfx1201/RDNA 4， MI300系列）的性能调优和问题修复是本期焦点。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动非常活跃，涉及性能诊断、功能启用、Bug修复及测试完善等多个方面。

新增Issue:

#38107: [Bug]: Abnormally bad performance on AMD ROCM gfx1030：用户报告在RDNA 2架构（gfx1030，如W6800, 6900XT）上性能异常低下（个位数token/s）。核心贡献者 Michelle-HCJ 精准指出根因在于 bfloat16 数据类型在RDNA 2硬件上缺乏支持，导致回退至软件模拟，建议改用 float16。该问题被迅速定位并关闭，体现了社区对AMD平台性能问题的深入理解。

新增PR (Open & Merged):

#38086 ([ROCm] Enable VLLM triton FP8 moe for gfx1201)：由 AMD 员工（vllmellm）提交，为 RDNA 4 (gfx12xx) 显卡启用Triton FP8 MoE支持，并针对Qwen3/Qwen3.5 FP8 MoE模型进行了性能调优，解决了此前因检测失败导致的 NotImplementedError。
#38109 ([Bugfix] Fix FP8 MoE support detection on ROCm)：修复了在ROCm（特别是RDNA 4）上检测FP8 MoE支持时，因amdsmi返回哨兵值或旧版GPU识别逻辑不准确导致的失败。统一改用 p.supports_fp8() 进行集中检测。
#38108 (Fix Device Index for ROCm Ray Workers in MoE Benchmark)：修复了在多GPU AMD系统上使用Ray运行MoE自动调优脚本时，因设备序号错误导致的崩溃。
#38181 ([ROCm] Fix cp_mha_gather_cache for strided KV views)：修复了ROCm后端在处理具有跨步KV视图的混合注意力+Mamba模型（如Qwen3.5-397B-A17B-FP8）时，KV缓存收集逻辑错误的问题。
#37833 ([ROCm] Fix MoE kernel test failures on gfx950) (已合并)：修复了在MI355 (gfx950) 上多个MoE内核测试失败的问题，涉及对Triton返回值处理、CUDA/ROCm内核路径区分、FP8量化边界精度容忍度调整等，确保CI稳定性。
#36574 ([ROCm] Utilize persistent MLA kernel from AITER) (已合并)：集成AITER的持久化MLA解码内核，通过在GPU计算单元上常驻内核来减少启动开销，在DeepSeek-R1模型上取得了显著性能提升（最高达2.4倍）。
#36702 ([ROCm] Attention selector reordering) (已合并)：重新排序了ROCm注意力后端选择器的优先级，将原生 ROCM_ATTN 设为最高优先级（性能最优），优化了不同模型下的后端选择逻辑。
#38155 ([ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355)：为MI355 GPU新增Qwen3.5模型的GSM8K评测CI任务，持续完善AMD硬件的模型质量验证。
#38167 ([ROCm][CI] Fix wvSplitKrc mock argument order)：修复了ROCm特定GEMM单元测试中因模拟函数参数顺序错误导致的测试失败。
#38088 ([ROCm][CI] Increase OpenAPI schema test timeouts) (已合并)：针对ROCm CI环境可能较慢的情况，延长了OpenAPI schema测试的超时时间，提升测试稳定性。

Backport PRs (针对v0.13.0分支):

#38180: 将修复Qwen3-Next非标准块大小（544）支持的PR #31380 backport至v0.13.0。
#38146: 将修复AITER MLA解码路径中last_page_len计算错误的PR #31282 backport至v0.13.0。
#38145: 将修复ROCM_AITER_TRITON_MLA后端在DeepSeek-V3上精度回归的PR #31816 backport至v0.13.0。

总结： 本周期AMD生态工作呈现 “深度优化” 和 “广度覆盖” 的特点。一方面，针对特定架构（RDNA 2, RDNA 4, CDNA 3）的性能问题和功能缺失进行精准修复；另一方面，通过完善CI测试、修复内核Bug、backport重要补丁，系统性提升vLLM在AMD全栈硬件上的稳定性、功能完整性和用户体验。

💬 高热度讨论分析

Issue #38106: tool_choice="required" + speculative decoding leads to failed tool calls
- 核心议题：当强制工具调用(tool_choice="required")与推测解码（使用Qwen3.5模型）结合时，模型输出了非JSON格式（XML）的工具调用，导致解析失败，进而产生无工具调用列表但finish_reason="tool_calls"的异常输出。
- 观点与讨论：
  - 提交者 SvenLorenz 提供了详尽的复现步骤和日志，指出这可能是推测解码模型未遵循tool_choice的JSON格式约束，并提出了临时解决方案（禁用推测解码或修改解析器）。
  - 维护者 chaunceyjiang 积极跟进，首先确认此为首次遇到，并请求打印关键数据以定位问题。
- 争议焦点：无显著争议，主要在于问题根因的排查——是推测解码的逻辑缺陷，还是模型本身在特定条件下的输出偏好？
- 当前状态：问题仍为open，处于诊断阶段。
Issue #38107: AMD gfx1030性能异常
- 核心议题：RDNA 2消费级显卡在vLLM下性能远低于预期（对比llama.cpp）。
- 观点与讨论：
  - 用户 Nero10578 提供了完整的配置、命令和性能数据。
  - 贡献者 Michelle-HCJ 给出了权威且专业的诊断：明确指出RDNA 2无硬件bf16支持，vLLM自动选择bf16会导致软件模拟，性能骤降。建议显式使用--dtype float16。
- 争议焦点：无。这是一个典型的技术科普与问题解决场景。
- 最终结论：用户采纳建议后问题解决，Issue被关闭。此案例对社区用户具有重要参考价值。
Issue #38141 & PR #38142: Streaming Video Input RFC
- 核心议题：提议为vLLM新增流式视频输入处理能力，以支持实时视频理解应用。
- 观点与讨论：
  - 发起者 lishunyang12 撰写了非常详尽的RFC，论证了市场需求、技术可行性和实施路线图。
  - 随后，他提交了实现PR #38142，但很快主动关闭，并说明将工作合并至独立的 vllm-omni 仓库进行开发。
- 争议焦点：无公开争议，但该举动反映了项目在管理复杂、跨领域新特性时的一种架构决策——将“全模态”相关的高级功能放在专门的项目中孵化，以保持核心vLLM引擎的专注性。
- 当前状态：RFC和PR均已关闭，相关工作转移到 vllm-project/vllm-omni#2201。
Issue #38171: [Feature]: Add TurboQuant Support for KV Cache Quantization
- 核心议题：请求集成一篇新论文提出的 “TurboQuant” 向量量化方法，用于KV缓存压缩。
- 观点与讨论：
  - 提交者 tunglinwood 系统性地介绍了该方法的理论优势（近最优失真率、无偏内积估计）。
  - 其他用户（gaby, 1315577677）表达了强烈的兴趣和支持，认为这对于大模型的内存节省“意义重大”。
- 争议焦点：无争议，更多是社区对前沿技术的热切期待。
- 当前状态：Issue为open，是一个高质量的特性提案。

🔥 热门话题与趋势分析

AMD硬件支持持续深化：除上述专门章节外，多个PR和Issue显示社区正致力于解决AMD平台从消费级到数据中心级的各类“最后一公里”问题，包括性能调优、新架构适配、测试稳定化等。
工具调用与结构化输出：相关Bug修复和讨论热度不减（如#38106， #38168， #38103）。这反映出随着模型能力提升，复杂输出格式的处理已成为推理服务的核心挑战，相关解析器、流式支持、与推测解码等功能的兼容性是维护重点。
视频与多模态处理成为新前沿：虽然流式视频RFC被移至独立仓库，但它明确指出了行业趋势。同期还有PR #38156（优化NanoNemotron-VL视频编码批处理），表明多模态推理的效率和功能扩展是持续投入的方向。
内核与量化优化：社区持续引入和优化专用计算内核（如#38086 FP8 MoE， #38169 对问题内核的revert），并探索新的量化方案（如#38171 TurboQuant， #38138 新的在线量化前端），旨在提升性能与扩展硬件支持。

🛠️ 重点技术变更

PR #38142 ([Realtime] Add streaming video input support) & Issue #38141：尽管已关闭，但其提出的架构和协议设计展示了将vLLM实时音频流处理能力扩展至视频领域的清晰路径，是项目边界拓展的重要尝试。
PR #38086 ([ROCm] Enable VLLM triton FP8 moe for gfx1201)：此PR标志着vLLM对AMD最新RDNA 4架构的正式FP8 MoE计算支持，对于在该硬件上运行先进的MoE模型至关重要。
PR #38174 ([Feature] Universal speculative decoding for heterogeneous vocabularies)：实现了支持异构词汇表的通用推测解码算法（Token-Level Intersection），允许草稿模型和目标模型使用不同的词表，显著提升了推测解码的模型选择灵活性，是一项重要的算法改进。
PR #38168 ([Bugfix] Fix Hermes tool parser when stream interval > 1)：修复了Hermes工具解析器在流式输出间隔大于1时的边界错误。工具解析器的健壮性直接影响用户体验，此类修复至关重要。
PR #38138 ([wip] new online quantization frontend)：正在开发新的在线量化前端，旨在提供更统一、简洁的用户配置接口（如quantization="fp8_blockwise"），预示着用户API的进一步优化。

📈 开发活跃度观察

高产出与高效率：单日新增92个PR，合并44个，显示社区贡献非常活跃，核心维护团队的代码审查与合并效率极高。
AMD贡献者深度参与：用户名带 -amd 后缀或明显属于AMD团队的贡献者（如 vllmellm, mgehre-amd, AndreasKaratzas 等）在本周期提交/参与了大量关键PR，涉及内核、平台支持、CI测试等多个层面，是推动AMD生态支持的核心力量。
代码质量与测试并重：在合并大量新功能的同时，也有大量PR专注于修复测试、完善CI配置（如#38155, #38167, #38088），体现了对软件工程质量的重视。

💡 值得关注的问题

AMD RDNA2 架构的 bfloat16 性能陷阱：Issue #38107 揭示的问题具有普遍性。项目文档或默认配置中可能需要更明确的警告或自动回退机制，以避免用户在RDNA2等不支持硬件bf16的显卡上遭遇性能断崖。
ROCm CI测试框架的健壮性：Issue #38097 指出测试子进程创建方式可能导致测试被静默跳过，这可能影响测试覆盖率，尤其是对ROCm平台的测试质量。
新量化技术的集成路径：Issue #38171 (TurboQuant) 获得了社区积极反响。此类来自学术界的先进量化方法，如何评估、集成到vLLM现有的量化体系中，是一个值得观察的技术管理过程。
工具调用解析器的长期维护：工具调用相关的Bug报告频繁且细节复杂（如与流式、推测解码的交互）。需要持续投入资源以保证这套系统的稳定性和正确性。

📋 附录：详细数据列表

新增 Issue

#38182 [Bug]: Based on Qwen3.5-35B-A3B, why does enabling MTP speculative decoding actually reduce the prefix cache hit rate? — bug — by uOnePiece (创建于: 2026-03-26 11:32 (UTC+8))
#38175 [RFC]: Support ViT Full CUDA Graph (Tracker) — RFC — by shen-shanshan (创建于: 2026-03-26 10:22 (UTC+8))
#38171 [Feature]: Add TurboQuant Support for KV Cache Quantization — feature request — by tunglinwood (创建于: 2026-03-26 09:24 (UTC+8))
#38107 [Bug]: Abnormally bad performance on AMD ROCM gfx1030 (W6800, V620, 6900XT 6800XT) — bug,rocm — by Nero10578 (创建于: 2026-03-25 20:19 (UTC+8))
#38176 [Bug]: qwen3 235B model with latest vllm is going to generate only 1 token. — bug — by jiangtaozh (创建于: 2026-03-26 10:22 (UTC+8))
#38173 [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) — feature request — by wan-danfeng (创建于: 2026-03-26 09:51 (UTC+8))
#38170 [Bug]: MFU statistics on the Step-3.5-Flash model are inaccurate — bug — by kangxiaoning (创建于: 2026-03-26 09:15 (UTC+8))
#38164 [Feature][Ray]: Support EEP for RayExecutorV2 — feature request — by jeffreywang-anyscale (创建于: 2026-03-26 07:51 (UTC+8))
#38159 [Installation]: nightly builds for vllm/vllm-openai stopped three days ago — installation — by CaptainGlac1er (创建于: 2026-03-26 07:22 (UTC+8))
#38098 [CI Failure]: LM Eval Large Models (H200) — ci-failure — by ilmarkov (创建于: 2026-03-25 18:37 (UTC+8))
#38122 [Bug]: Qwen 3.5 fails to load from GGUF — bug — by megla-tlanghorst (创建于: 2026-03-26 00:36 (UTC+8))
#38147 [RFC]: Add Configuration API — feature request — by hickeyma (创建于: 2026-03-26 05:22 (UTC+8))
#38141 [RFC] Streaming Video Input for Real-Time Video Understanding — 无标签 — by lishunyang12 (创建于: 2026-03-26 04:12 (UTC+8))
#38132 [Bug]: truncation: “auto” in Responses API returns 400 instead of truncating input — bug — by lukezTT (创建于: 2026-03-26 02:30 (UTC+8))
#38131 [Bug]: [CPU Backend] No CPU profiler summary equivalent; CUDA summary flag is silently disabled on CPU — bug,cpu — by Elm8116 (创建于: 2026-03-26 02:04 (UTC+8))
#38121 [Usage]: how does cpu offload work? — usage — by JINO-ROHIT (创建于: 2026-03-26 00:36 (UTC+8))
#38118 [Doc]: consider adding docstring for a lot of the methods — documentation — by JINO-ROHIT (创建于: 2026-03-26 00:02 (UTC+8))
#38069 [RFC]: Can we implement n-gram and suffix speculative decoding in model_runner_v2? — RFC — by lio1226 (创建于: 2026-03-25 12:42 (UTC+8))
#38113 [Installation]: Ray not present in Container Image — installation — by ed-pai (创建于: 2026-03-25 22:43 (UTC+8))
#38110 [Bug]: flashinfer-cubin does not include all cubins/headers — bug — by mgoin (创建于: 2026-03-25 21:59 (UTC+8))
#38106 [Bug]: tool_choice=”required” + speculative decoding with lukealonso/Qwen3.5-397B-A17B-NVFP4 leads to failed tool calls. — bug — by SvenLorenz (创建于: 2026-03-25 20:03 (UTC+8))
#38101 [CI Failure]: Test Eval Marlin Qwen3-30B-A3B-Fp8 — ci-failure — by ilmarkov (创建于: 2026-03-25 19:17 (UTC+8))
#38085 [Bug]: Qwen3.5 LoRA module is not in model’s supported LoRA target modules — bug — by wufenglailai (创建于: 2026-03-25 16:21 (UTC+8))
#38097 [ROCm][CI]: create_new_process_for_each_test("spawn") may silently skip tests without __main__ entrypoint — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-25 18:31 (UTC+8))
#38079 [RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop — 无标签 — by junjzhang (创建于: 2026-03-25 15:15 (UTC+8))
#38077 [Bug]: Qwen3.5-9B answer !!!!!!!!! — bug — by hoshinoyouyou (创建于: 2026-03-25 14:59 (UTC+8))
#38071 [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion — feature request — by aman-coder03 (创建于: 2026-03-25 12:58 (UTC+8))
#38064 [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 — 无标签 — by yeshihai (创建于: 2026-03-25 11:59 (UTC+8))
#38063 [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable — 无标签 — by yeshihai (创建于: 2026-03-25 11:59 (UTC+8))

已关闭 Issue

#38107 [Bug]: Abnormally bad performance on AMD ROCM gfx1030 (W6800, V620, 6900XT 6800XT) — bug,rocm — by Nero10578 (关闭于: 2026-03-26 11:09 (UTC+8))
#20316 [Performance]: Optimize beam search code — performance,stale — by zhanggzh (关闭于: 2026-03-26 10:15 (UTC+8))
#38141 [RFC] Streaming Video Input for Real-Time Video Understanding — 无标签 — by lishunyang12 (关闭于: 2026-03-26 04:19 (UTC+8))
#35118 [Feature]: Support ubuntu 24.04 runtime container — feature request — by aasgaonkar (关闭于: 2026-03-26 00:07 (UTC+8))
#34231 [Performance]: Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance — by RocMarshal (关闭于: 2026-03-25 22:59 (UTC+8))
#24107 [Bug]: CUDA illegal memory access during structured output (xgrammar) crashes vLLM workers and API returns 500 — bug,stale — by ItzAmirreza (关闭于: 2026-03-25 21:37 (UTC+8))
#29391 [Bug]: BF16Vec has no fallback options for Arm CPUs with no BF16 support — bug — by fadara01 (关闭于: 2026-03-25 15:01 (UTC+8))
#37250 [Bug]: QWEN 3.5-397B-A17B report “RPC call to sample_tokens timed out” — bug — by mahaocong90 (关闭于: 2026-03-25 14:22 (UTC+8))
#36849 [Bug]: GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator) — bug — by Priananda620 (关闭于: 2026-03-25 12:06 (UTC+8))
#37937 [Bug]: IndexError: prev_tool_call_arr list index out of range when streaming tool call hits max_tokens (openai parser) — bug — by onurdemircan-softtech (关闭于: 2026-03-25 12:06 (UTC+8))

新增 PR

#38183 [Draft][MRV2] Experimental build_attn_metadata refactor — v1 — by LucasWilkinson (创建于: 2026-03-26 11:32 (UTC+8))
#38181 [ROCm] Fix cp_mha_gather_cache for strided KV views — rocm,v1 — by yuttian1 (创建于: 2026-03-26 11:13 (UTC+8))
#38152 Disable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (创建于: 2026-03-26 05:55 (UTC+8))
#38158 [Bugfix] Fix shared-object aliasing in n>1 streaming with tool calls — bug,frontend,needs-rebase — by yzong-rh (创建于: 2026-03-26 07:11 (UTC+8))
#38177 Support AsyncTP pattern matching for FlashInfer vllm.bmm_fp8 on Blackwell#27893 — nvidia — by baonudesifeizhai (创建于: 2026-03-26 10:28 (UTC+8))
#38136 Fix multi-node allreduce fusion — ci/build,nvidia — by wzhao18 (创建于: 2026-03-26 03:39 (UTC+8))
#38180 [Bugfix][Backport][ROCm] Backport PR #31380 to v0.13.0: Fix Qwen3-Next inference with non-standard block size (544) — bug,rocm,v1,qwen — by khairulkabir1661 (创建于: 2026-03-26 11:09 (UTC+8))
#38092 [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format — bug,ready,ready-run-all-tests — by BadrBasowid (创建于: 2026-03-25 17:49 (UTC+8))
#38168 [Bugfix] Fix Hermes tool parser when stream interval > 1 — bug,tool-calling — by sfeng33 (创建于: 2026-03-26 08:38 (UTC+8))
#38137 [ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 03:43 (UTC+8))
#38179 [KVTransfer] Fix TpKVTopology.is_kv_replicated equality case — kv-connector — by JianDan0212 (创建于: 2026-03-26 10:48 (UTC+8))
#38167 [ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 08:34 (UTC+8))
#38166 [ROCm][CI] Fix CK MXFP4 MoE GEMM crash for unaligned intermediate_size_per_partition — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 08:26 (UTC+8))
#38178 [CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 10:36 (UTC+8))
#38123 [compile] Allow strings in custom ops without regressing compilation times — ready,needs-rebase,qwen — by zou3519 (创建于: 2026-03-26 00:46 (UTC+8))
#38174 [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI) — speculative-decoding,v1 — by wan-danfeng (创建于: 2026-03-26 10:09 (UTC+8))
#38066 feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel — 无标签 — by yeshihai (创建于: 2026-03-25 12:06 (UTC+8))
#38143 [Bugfix] Apply truncation in Responses API harmony path — bug,frontend,gpt-oss — by saivedant169 (创建于: 2026-03-26 04:23 (UTC+8))
#38139 [Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-03-26 03:57 (UTC+8))
#38162 [Bugfix] Add missing f-string prefix in xgrammar choices error message — bug,structured-output,v1 — by yzong-rh (创建于: 2026-03-26 07:47 (UTC+8))
#38172 [Misc] Add 20 regression tests for 11 tool parser bug fixes — bug,tool-calling,qwen,deepseek — by bbrowning (创建于: 2026-03-26 09:40 (UTC+8))
#38076 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder — ready,v1,deepseek — by chaunceyjiang (创建于: 2026-03-25 14:24 (UTC+8))
#38082 [Bugfix] Fix benchmark_fused_collective.py — bug,performance,ready — by jeejeelee (创建于: 2026-03-25 15:53 (UTC+8))
#38169 Revert “[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration” (#38050) — nvidia — by zhewenl (创建于: 2026-03-26 09:05 (UTC+8))
#38165 [ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-26 08:11 (UTC+8))
#38127 Various Transformers v5 fixes — ready,deepseek — by hmellor (创建于: 2026-03-26 01:35 (UTC+8))
#38163 [MRV2] Support expert index capture — ci/build,v1 — by WoosukKwon (创建于: 2026-03-26 07:48 (UTC+8))
#38161 [ROCm][CI] Fix flaky GPTQ compile correctness test — rocm,ready — by AndreasKaratzas (创建于: 2026-03-26 07:40 (UTC+8))
#38083 [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell — bug,ready,qwen — by vadiklyutiy (创建于: 2026-03-25 15:58 (UTC+8))
#38157 feat(health): add –health-port for out-of-band health check process — frontend,v1 — by bolzzzz (创建于: 2026-03-26 06:59 (UTC+8))
#38160 [Core] Add hybrid_ep backend — v1,nvidia — by samnordmann (创建于: 2026-03-26 07:32 (UTC+8))
#38120 [Cohere] Enable Cohere-Transcribe — documentation,new-model,ready — by ekagra-ranjan (创建于: 2026-03-26 00:31 (UTC+8))
#38153 [Refactor] Remove unused utils — ready — by yewentao256 (创建于: 2026-03-26 05:55 (UTC+8))
#38156 [Model] Enable batched video encoding for NanoNemotron-VL (RADIO ViT) — v1 — by askliar (创建于: 2026-03-26 06:50 (UTC+8))
#38128 [EPLB] Mask padding in EPLB load recording — speculative-decoding,v1 — by ilmarkov (创建于: 2026-03-26 01:39 (UTC+8))
#38155 [ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 — rocm,ready,ci/build,qwen — by AndreasKaratzas (创建于: 2026-03-26 06:14 (UTC+8))
#38154 Improve DCP error message to suggest VLLM_ATTENTION_BACKEND alternative — v1 — by amitmodi (创建于: 2026-03-26 06:02 (UTC+8))
#38151 Add sp_min_token_num=0 to E2E correctness tests for SP and AsyncTP — 无标签 — by copilot-swe-agent (创建于: 2026-03-26 05:49 (UTC+8))
#38150 [Mistral Grammar] Support Grammar Factory — structured-output,v1,tool-calling — by juliendenize (创建于: 2026-03-26 05:49 (UTC+8))
#38146 [Bugfix][Backport][Hardware][AMD] Backport PR #31282 to v0.13.0: Fix last_page_len calculation in AITER MLA decode — bug,rocm,v1 — by khairulkabir1661 (创建于: 2026-03-26 05:17 (UTC+8))
#38149 [WIP][HMA] Add configuration API — frontend,v1 — by hickeyma (创建于: 2026-03-26 05:27 (UTC+8))
#38148 Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor — 无标签 — by elvircrn (创建于: 2026-03-26 05:24 (UTC+8))
#38111 [Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator — bug,ready — by shubhra (创建于: 2026-03-25 22:12 (UTC+8))
#38126 [NVIDIA] Fix DGX Spark logic — ready,ci/build,nvidia — by johnnynunez (创建于: 2026-03-26 01:35 (UTC+8))
#38145 [Bugfix][Backport] Backport PR #31816 to v0.13.0: Fix ROCM_AITER_TRITON_MLA accuracy for DeepSeek-V3 — bug,rocm,v1,deepseek — by khairulkabir1661 (创建于: 2026-03-26 05:11 (UTC+8))
#38140 [Bugfix] Fix Qwen 3.5 GGUF loading: add model type mapping and vision config d… — bug,qwen — by phycoding (创建于: 2026-03-26 04:01 (UTC+8))
#38144 [RFC] Gap C: Replacement rank bootstrapping for elastic EP peer-swap recovery — v1 — by tzulingk (创建于: 2026-03-26 04:31 (UTC+8))
#38072 [RFC] Gap D: DP rank peer-swap recovery for elastic EP — v1 — by tzulingk (创建于: 2026-03-25 13:08 (UTC+8))
#38142 [Realtime] Add streaming video input support with Qwen3-Omni — documentation,new-model,frontend,v1,qwen — by lishunyang12 (创建于: 2026-03-26 04:13 (UTC+8))
#38119 [MultiModal] add support for numpy array embeddings — ready,multi-modality — by guillaumeguy (创建于: 2026-03-26 00:29 (UTC+8))
#38138 [wip] new online quantization frontend — frontend — by vkuzo (创建于: 2026-03-26 03:51 (UTC+8))
#38135 [Model] Support for nvidia/gpt-oss-puzzle-88B — new-model,frontend,v1,gpt-oss,nvidia — by TomerRonen34 (创建于: 2026-03-26 03:32 (UTC+8))
#38133 Fix multi-node allreduce fusion — nvidia — by wzhao18 (创建于: 2026-03-26 03:14 (UTC+8))
#38134 [WIP] Remove kv cache dtype enum from csrc — documentation,performance,rocm,speculative-decoding,v1,cpu,kv-connector,nvidia — by MatthewBonanni (创建于: 2026-03-26 03:22 (UTC+8))
#38124 [1/N][Misc][Cleanup] Resolve kv cache dtype "auto" at init time and eliminate from internal code — documentation,performance,rocm,speculative-decoding,ready,v1,cpu,kv-connector,nvidia — by MatthewBonanni (创建于: 2026-03-26 01:15 (UTC+8))
#38125 DOC: Documentation pages fixes — ready — by mtsokol (创建于: 2026-03-26 01:19 (UTC+8))
#38129 DOC: TPU mention fix — ready — by mtsokol (创建于: 2026-03-26 01:52 (UTC+8))
#38096 [Core][KV Connector] Remove use of num_cached_tokens in error handling — ready,v1 — by markmc (创建于: 2026-03-25 18:28 (UTC+8))
#38115 [Frontend] Move APIServerProcessManager target server fn — frontend,ready,v1 — by njhill (创建于: 2026-03-25 23:16 (UTC+8))
#38130 [Misc] Switch py-cpuinfo dependency to maintained fork — ci/build,cpu — by akx (创建于: 2026-03-26 02:00 (UTC+8))
#38084 [Frontend] Add /v1/responses/render endpoint and refactor responses preprocessing — frontend,needs-rebase — by hyeongyun0916 (创建于: 2026-03-25 16:17 (UTC+8))
#38116 Relocate Encoder CUDA graph manager — ready,v1,nvidia — by WoosukKwon (创建于: 2026-03-25 23:25 (UTC+8))
#38117 [Doc] Update example docs to include Nemotron Super v3 and Nano 4B — documentation — by Naveassaf (创建于: 2026-03-25 23:27 (UTC+8))
#38086 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 — rocm,qwen — by vllmellm (创建于: 2026-03-25 16:29 (UTC+8))
#38112 Added faster exp routine for lower precision data types. — cpu — by almayne (创建于: 2026-03-25 22:40 (UTC+8))
#38081 [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5) — bug,ready,v1,qwen — by Lidang-Jiang (创建于: 2026-03-25 15:44 (UTC+8))
#38109 [Bugfix] Fix FP8 MoE support detection on ROCm when amdsmi returns sentinel value — bug,rocm — by nemanjaudovic (创建于: 2026-03-25 20:36 (UTC+8))
#38114 [Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py — bug,performance — by nemanjaudovic (创建于: 2026-03-25 23:09 (UTC+8))
#38068 [MRV2][Attention] Remove set_workspace_buffer pattern in FlashInfer — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-03-25 12:14 (UTC+8))
#38100 [Bugfix] Map reasoning_effort=”none” to enable_thinking=False for Qwen3 chat templates — bug,frontend,qwen — by Lidang-Jiang (创建于: 2026-03-25 19:03 (UTC+8))
#38070 [GGUF Kernel] Remove artificial 255 expert limit to support models with more experts — 无标签 — by guqiong96 (创建于: 2026-03-25 12:47 (UTC+8))
#38089 Fix KeyError for multimodal models with vision tower layers (fixes #3… — v1,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 17:10 (UTC+8))
#38073 Fix hidden size mismatch in eagle3 nonparallel draft path (fixes #37966) — speculative-decoding,v1,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 13:20 (UTC+8))
#38067 Fix IndexError in cpu_fused_moe_torch when TP+EP enabled on CPU-only … — cpu,closed-as-slop — by xueliangyang-oeuler (创建于: 2026-03-25 12:09 (UTC+8))
#38103 [Bugfix][Tool Parser] Fix Hunyuan A13B nested JSON tool streaming — bug,tool-calling — by vineetatiwari27 (创建于: 2026-03-25 19:36 (UTC+8))
#38090 Fix Plamo 2/3 & LFM2 for Transformers v5 — ready — by hmellor (创建于: 2026-03-25 17:34 (UTC+8))
#38074 [Model] Add AutoWeightsLoader support for jais — ready — by grYe99 (创建于: 2026-03-25 13:55 (UTC+8))
#38108 Fix Device Index for ROCm Ray Workers in MoE Benchmark — performance,rocm — by li-liwen (创建于: 2026-03-25 20:30 (UTC+8))
#38102 [ROCm][CI] Rename filepath test to point to correct file — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-25 19:29 (UTC+8))
#38099 [Bugfix] Add fallback for TokenizersBackend tokenizer class — bug — by Lidang-Jiang (创建于: 2026-03-25 19:02 (UTC+8))
#38105 [Docs] Add Granite Vision to supported multimodal models list — documentation — by jesus-talavera-ibm (创建于: 2026-03-25 19:54 (UTC+8))
#38104 [bugfix] DeepSeek MTP support and Pipeline Parallel feature. — bug,v1,deepseek — by zzhx1 (创建于: 2026-03-25 19:37 (UTC+8))
#38095 Fix offline mode test for Transformers v5 — ready — by hmellor (创建于: 2026-03-25 18:24 (UTC+8))
#38088 [ROCm][CI] Increase OpenAPI schema test timeouts — rocm,ready — by AndreasKaratzas (创建于: 2026-03-25 16:56 (UTC+8))
#38094 [BugFix][Frontend] pass kv_transfer_params through to sampling_params — bug,frontend — by hhk7734 (创建于: 2026-03-25 18:16 (UTC+8))
#38093 [Bugfix] Fix scaled_mm output narrowing for 3D input tensors — bug — by nemanjaudovic (创建于: 2026-03-25 18:03 (UTC+8))
#38091 [Bugfix] Fix ImportError for flash_attn < v2.1.2 missing triton rotary module — bug — by Lidang-Jiang (创建于: 2026-03-25 17:38 (UTC+8))
#38087 [Bugfix] fix: normalize layer names for kv cache group to prevent KeyError in — bug — by mahendrarathore1742 (创建于: 2026-03-25 16:36 (UTC+8))
#38080 [Core] Preempt requests with fewer num_computed_tokens to reduce wasted computation — v1 — by chaunceyjiang (创建于: 2026-03-25 15:25 (UTC+8))
#38078 Fix fixed internal ZMQ port collision in multi-model deployment — documentation,v1 — by Aeovy (创建于: 2026-03-25 15:14 (UTC+8))
#38075 add qwen demo. — documentation,qwen — by hongbolv (创建于: 2026-03-25 13:58 (UTC+8))
#38065 [Perf] FP8 FlashInfer Attn for ViT — performance,v1,multi-modality,qwen,nvidia — by zhandaz (创建于: 2026-03-25 12:01 (UTC+8))

已合并 PR

#38152 Disable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (合并于: 2026-03-26 09:20 (UTC+8))
#38029 [Tool Parser][1/3] Pass tools to ToolParser constructor — frontend,ready,tool-calling — by sfeng33 (合并于: 2026-03-26 10:29 (UTC+8))
#34549 [Misc] Optimized check to encapsulate both CUDA and ROCm platforms — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-26 09:43 (UTC+8))
#38076 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder — ready,v1,deepseek — by chaunceyjiang (合并于: 2026-03-26 09:43 (UTC+8))
#37214 Fix minimax m2.5 nvfp4 kv scales weight loading — ready — by wzhao18 (合并于: 2026-03-26 08:48 (UTC+8))
#37348 [Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU — bug,ready,qwen — by jrplatin (合并于: 2026-03-26 08:46 (UTC+8))
#38127 Various Transformers v5 fixes — ready,deepseek — by hmellor (合并于: 2026-03-26 08:10 (UTC+8))
#38120 [Cohere] Enable Cohere-Transcribe — documentation,new-model,ready — by ekagra-ranjan (合并于: 2026-03-26 07:13 (UTC+8))
#36716 [ROCm]: Update rope+kvcache fusion conditions and disable custom op by default — documentation,rocm,ready — by Rohan138 (合并于: 2026-03-26 04:58 (UTC+8))
#38119 [MultiModal] add support for numpy array embeddings — ready,multi-modality — by guillaumeguy (合并于: 2026-03-26 04:13 (UTC+8))
#36574 [ROCm] Utilize persistent MLA kernel from AITER — rocm,ready,v1 — by SKPsanjeevi (合并于: 2026-03-26 03:00 (UTC+8))
#37833 [ROCm] Fix MoE kernel test failures on gfx950 — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-26 02:46 (UTC+8))
#38096 [Core][KV Connector] Remove use of num_cached_tokens in error handling — ready,v1 — by markmc (合并于: 2026-03-26 02:20 (UTC+8))
#38115 [Frontend] Move APIServerProcessManager target server fn — frontend,ready,v1 — by njhill (合并于: 2026-03-26 02:14 (UTC+8))
#37725 [Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell — bug,ready,ci/build,nvidia — by RobTand (合并于: 2026-03-25 23:18 (UTC+8))
#38057 [CI/Docs] Improve aarch64/DGX Spark support for dev setup — documentation,ready — by bbrowning (合并于: 2026-03-26 00:24 (UTC+8))
#35182 [Misc] Reorganize inputs — documentation,performance,rocm,frontend,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-03-26 01:22 (UTC+8))
#38050 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration — ready,nvidia — by zyongye (合并于: 2026-03-26 01:16 (UTC+8))
#36058 [2/n] Migrate per_token_group_quant to torch stable ABI — ready,ci/build,nvidia — by mikaylagawarecki (合并于: 2026-03-26 01:15 (UTC+8))
#38046 [compile] Add some more startup tests for top models — ready,ci/build — by zou3519 (合并于: 2026-03-26 00:02 (UTC+8))
#38048 [Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR — structured-output,ready,v1 — by yewentao256 (合并于: 2026-03-25 23:41 (UTC+8))
#37970 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (合并于: 2026-03-25 23:20 (UTC+8))
#37488 [Feature] EPLB Support for GPU Model Runner v2 — ready,v1 — by yewentao256 (合并于: 2026-03-25 23:16 (UTC+8))
#37880 [Feature] Support per-draft-model MoE backend via --speculative-config — speculative-decoding,ready,v1 — by askliar (合并于: 2026-03-25 22:31 (UTC+8))
#37819 [Docs] Add guide for editing agent instruction files — documentation,ready — by bbrowning (合并于: 2026-03-25 21:54 (UTC+8))
#37902 [Mypy] Better fixes for the mypy issues in vllm/config — documentation,performance,frontend,ready,v1,multi-modality,kv-connector — by hmellor (合并于: 2026-03-25 21:14 (UTC+8))
#36869 [KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector — ready,kv-connector — by JianDan0212 (合并于: 2026-03-25 21:24 (UTC+8))
#38090 Fix Plamo 2/3 & LFM2 for Transformers v5 — ready — by hmellor (合并于: 2026-03-25 20:29 (UTC+8))
#37607 [CPU][UX][Perf] Enable tcmalloc by default — ready,ci/build,cpu — by fadara01 (合并于: 2026-03-25 20:39 (UTC+8))
#38074 [Model] Add AutoWeightsLoader support for jais — ready — by grYe99 (合并于: 2026-03-25 20:38 (UTC+8))
#38035 Better weight tying check for multimodal models — ready — by hmellor (合并于: 2026-03-25 20:07 (UTC+8))
#37616 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test — rocm,frontend,ready — by AndreasKaratzas (合并于: 2026-03-25 18:55 (UTC+8))
#38102 [ROCm][CI] Rename filepath test to point to correct file — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-25 20:05 (UTC+8))
#37840 [Docs] Adds vllm-musa to custom_op.md — documentation,ready — by yeahdongcn (合并于: 2026-03-25 19:54 (UTC+8))
#37280 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 — bug,speculative-decoding,ready,v1,llama — by mgehre-amd (合并于: 2026-03-25 19:42 (UTC+8))
#38095 Fix offline mode test for Transformers v5 — ready — by hmellor (合并于: 2026-03-25 19:39 (UTC+8))
#37483 [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors — rocm,frontend,ready — by AndreasKaratzas (合并于: 2026-03-25 18:24 (UTC+8))
#38088 [ROCm][CI] Increase OpenAPI schema test timeouts — rocm,ready — by AndreasKaratzas (合并于: 2026-03-25 18:06 (UTC+8))
#37029 [Hardware][XPU] Align memory usage with cuda on xpu — rocm,ready,nvidia — by jikunshang (合并于: 2026-03-25 18:14 (UTC+8))
#37143 [XPU] support MLA model on Intel GPU — ready — by jikunshang (合并于: 2026-03-25 17:43 (UTC+8))
#36702 [ROCm] Attention selector reordering — documentation,rocm,ready,ci/build,v1 — by gshtras (合并于: 2026-03-25 17:42 (UTC+8))
#37968 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (合并于: 2026-03-25 14:19 (UTC+8))
#37958 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser — bug,frontend,ready,tool-calling — by chaunceyjiang (合并于: 2026-03-25 12:06 (UTC+8))
#37640 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test — rocm,ready — by vllmellm (合并于: 2026-03-25 13:06 (UTC+8))

关闭但未合并的 PR

#31328 [ROCm][Perf] Tune fused_moe and add int4 w4a16 config — rocm — by yuttian1 (关闭于: 2026-03-26 10:41 (UTC+8))
#28829 [XPU][QWEN] support qwen3-30b-a3b gptq int4 and tp=1,2 — stale,qwen — by mayuyuace (关闭于: 2026-03-26 10:15 (UTC+8))
#37744 [Bugfix] Fix PyTorch stable ABI compatibility for permute_cols — bug,ci/build — by kilork (关闭于: 2026-03-26 06:00 (UTC+8))
#38142 [Realtime] Add streaming video input support with Qwen3-Omni — documentation,new-model,frontend,v1,qwen — by lishunyang12 (关闭于: 2026-03-26 04:19 (UTC+8))
#38133 Fix multi-node allreduce fusion — nvidia — by wzhao18 (关闭于: 2026-03-26 03:34 (UTC+8))
#30980 [Do not merge][Async] Asynchronous DP coordination — v1 — by MatthewBonanni (关闭于: 2026-03-26 03:24 (UTC+8))
#36448 fix(tokenizer): handle TokenizersBackendFast class for Qwen3.5 GPTQ models — qwen — by giulio-leone (关闭于: 2026-03-26 02:34 (UTC+8))
#36360 fix(tool_parser): fix hermes parser dropping final brace during MTP streaming — tool-calling — by giulio-leone (关闭于: 2026-03-26 02:34 (UTC+8))
#37555 Update typing annotations to use ReadOnly for ConversationMessage — frontend — by ikaadil (关闭于: 2026-03-26 00:40 (UTC+8))
#33814 refactor fp8.py online quant weight loading to use layerwise reload utils — ready,quantization — by vkuzo (关闭于: 2026-03-26 00:18 (UTC+8))
#37316 [Models][GDN] Reduce D2H syncs in ChunkGatedDeltaRule — 无标签 — by lgeiger (关闭于: 2026-03-25 23:39 (UTC+8))
#35947 fix: Software E2M1 conversion for SM12x NVFP4 activation quantization — ready,qwen,nvidia — by blake-snc (关闭于: 2026-03-25 23:18 (UTC+8))
#34230 [vLLM/v1][sample] Refactor the exponential_ method calling to improve performance of TopKTopPSampler — performance,v1 — by RocMarshal (关闭于: 2026-03-25 22:59 (UTC+8))
#37791 Custom/v0.14.1 — documentation,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,kv-connector,nvidia — by jusikjoa (关闭于: 2026-03-25 22:01 (UTC+8))
#35827 Fix issue #35820 — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35748 Fix issue #35743 — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#37156 Fix issue #37037 — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#37154 Fix reasoning parser CI failure for seedoss and glm4 moe — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#37153 Fix issue#37032 — needs-rebase,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36977 chore(pre-commit): make bash hooks runnable on Windows — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36975 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36970 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,needs-rebase,qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36873 [ROCm] Fix build issues with cub:: namespace and missing headers — rocm,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36871 [Qwen3] Fix truncated reasoning extraction in parser — qwen,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36760 Fix(Offline Inference): Enable reasoning parser support in LLM class … — frontend,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36758 Fix(DP): Optimize load balancing score calculation (#36748) — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36757 Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco… — v1,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36738 Fix(MoE): Relax assertion for non-gated NVFP4 MoE models (#31782) — documentation,needs-rebase,deepseek,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36727 fix: replace deprecated F.sigmoid with torch.sigmoid in linear_attn (… — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36572 fix: make MiniMaxM2AppendThinkReasoningParser extract reasoning corre… — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36569 fix: prevent KeyError in harmony parser by preserving None fields (is… — frontend,gpt-oss,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36568 fix: revert triton_kernels tag to v3.5.0 to resolve ImportError (issu… — ci/build,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36468 feat: support fp8 kv cache and chunked prefill for minimax model — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36446 feat(v1): add timeout to engine core step to prevent deadlock — needs-rebase,v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36125 Perf: Relax CUDA kernel condition for MoE INT4 — nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#36081 Perf: Optimize DeepEP prepare/finalize for identity mapping — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35958 Perf: Enable double-buffered chunked EP communication in DeepEP — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35955 Perf: Implement fused sort/scan for MoE block alignment using Triton — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35951 Perf: Optimize regex patterns in MiniMaxM2ToolParser — needs-rebase,tool-calling,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35942 [GGUF] Add support for MiniMax-M2 architecture — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#35844 [V1] Fix illegal memory access and background thread crash during asy… — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:37 (UTC+8))
#38089 Fix KeyError for multimodal models with vision tower layers (fixes #3… — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#38073 Fix hidden size mismatch in eagle3 nonparallel draft path (fixes #37966) — speculative-decoding,v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#38067 Fix IndexError in cpu_fused_moe_torch when TP+EP enabled on CPU-only … — cpu,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37994 [Bugfix] Add minimax_m2 to eagle3 supported models list — bug,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37896 [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding — bug,needs-rebase,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37805 [Fix] Share decode output buffer across MLA layers to reduce memory — v1,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37796 [Fix] Clear stale prompt logprobs on preemption to avoid livelock — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37793 [Fix] Add dynamic engine_id monitoring for MooncakeConnector proxy — documentation,kv-connector,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37750 fix: specify device for torch.Event to prevent multi-GPU issues — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37747 fix: preserve logprobs for control tokens in streaming tool calls — frontend,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37670 fix: set device for prepare_inputs_event to avoid device mismatch — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:36 (UTC+8))
#37662 fix: handle multicasting error in FlashInfer workspace init — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37653 fix: handle multicasting error in FlashInfer workspace init — needs-rebase,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37637 Fix: Add EAGLE/MTP slots calculation in max_num_new_slots_for_drafting — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37412 Fix Spec Decode + NCCL Illegal Memory Access — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37410 Fix SM121 GB10 FP4 quantization sticky CUDA error — ci/build,nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37409 Fix KV Offloading + MLA AssertionError in get_kv_cache_shape — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37403 Fix tensor size mismatch in per-channel weight scale loading for MoE … — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37395 Fix piecewise CUDA graph bugs in split_graph and cuda_graph replay — nvidia,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37389 Fix placeholder_block_size undefined error in initialize_kv_cache — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37177 Fix KV cache memory estimation for hybrid Mamba/Attention models — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37172 Fix KV cache size estimation regression in v0.17+ — v1,closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37166 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37163 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#37161 Fix issue #37103: Remove shape mismatch warnings in FLA operations — closed-as-slop — by xueliangyang-oeuler (关闭于: 2026-03-25 21:35 (UTC+8))
#38105 [Docs] Add Granite Vision to supported multimodal models list — documentation — by jesus-talavera-ibm (关闭于: 2026-03-25 19:55 (UTC+8))
#36791 fix for the model not even loading and zero accuracy — 无标签 — by jdebache (关闭于: 2026-03-25 18:18 (UTC+8))
#38080 [Core] Preempt requests with fewer num_computed_tokens to reduce wasted computation — v1 — by chaunceyjiang (关闭于: 2026-03-25 16:22 (UTC+8))
#37889 [CI] Add Qwen3.5 quantized model GSM8K CI evals for Blackwell — ready,qwen — by vadiklyutiy (关闭于: 2026-03-25 16:00 (UTC+8))
#38075 add qwen demo. — documentation,qwen — by hongbolv (关闭于: 2026-03-25 14:03 (UTC+8))
#37918 [LinearAttention] Introduce non_spec_query_start_loc_cpu in GDN metadata — ready,v1 — by Jialin (关闭于: 2026-03-25 13:13 (UTC+8))
#35488 [WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency — bug,v1,nvidia — by haosdent (关闭于: 2026-03-25 12:03 (UTC+8))
#37088 [WIP][Bugfix] Move GDN warmup after KV cache allocation to fix memory leak (#36973) — bug,v1,qwen — by haosdent (关闭于: 2026-03-25 11:54 (UTC+8))