vLLM 开发动态报告 - 2026-03-11
时间窗口: 2026-03-11 11:27 (UTC+8) ~ 2026-03-12 11:27 (UTC+8) 数据统计: 新 Issue 33 | 关闭 Issue 25 | 新 PR 73 | 合并 PR 71 | 关闭未合并 PR 30
📊 每日开发状态摘要
在2026年3月11日至12日的24小时窗口期内,vLLM项目保持了极高的开发活跃度,共新增73个PR并合并了71个。主要进展集中于AMD/ROCm生态性能优化、新模型(尤其是多模态模型)支持以及大规模系统部署(如DGX Spark)的兼容性修复。同时,社区围绕多个核心功能(如量化、工具调用、调度器)的bug修复与性能优化展开了深入讨论。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关的贡献非常活跃,主要体现在性能优化、功能完善和兼容性修复上。
- 性能优化(Perf):
- PR #36810 (新增): [ROCm][Perf] Fused GEMM + static FP8 output quantization。由
andyluo7提交,核心创新在于将FP8 GEMM与静态量化融合为单个torch._scaled_mm调用,直接输出FP8数据,消除了BF16中间结果的DRAM往返。在MI300X上,针对常见模型形状(如4096x4096)实现了1.5倍至1.6倍的性能提升。这利用了ROCm 6.0+ hipBLASLt原生支持FP8输出的特性。 - PR #36743 (新增): [ROCm] Optimize concat_mla_q for CDNA3 (MI300X) and CDNA4 (MI355X)。同样是
andyluo7的贡献,针对不同架构进行了针对性优化:在CDNA3上使用非临时存储提示以提升HBM带宽效率(小规模token提速可达23%);在CDNA4上启用256位宽加载/存储,使向量化循环迭代次数减半。 - PR #35093 (已合并): 为CDNA4添加了针对Kimi K2.5 MoE模型的调优内核配置,在TP=8场景下带来了显著的端到端性能提升(更低的TTFT和TPOT)。
- PR #36810 (新增): [ROCm][Perf] Fused GEMM + static FP8 output quantization。由
- 功能与兼容性:
- PR #36785 (新增): Update rocm get gpu info capability。由
tmm77提交,旨在允许amdsmi在torch不可用时获取GPU信息,以支持更灵活的部署场景(如使用Ray后端时)。 - PR #35923 (已合并): 修复了ROCm注意力后端在Qwen3.5等使用非标准block size(如1056)模型上无法正确运行的问题。
- PR #36499 (已合并): 将gfx1152/gfx1153 (Krackan, RDNA 3.5) 架构加入HIP支持的架构列表,确保为这些新GPU正确编译内核。
- PR #36274 (已合并): 修复ROCm注意力后端验证逻辑,使其与CUDA平台行为一致,在验证前剥离
block_size,以支持不规则block size。
- PR #36785 (新增): Update rocm get gpu info capability。由
- 相关Issue:
- Issue #36805: 报告了一个与TorchAO相关的误导性错误信息,该信息错误地将GPU计算能力(SM 8.9)表述为CUDA版本。该Issue被标记了
rocm标签,并由机器人自动通知了AMD相关维护者(@hongxiayang,@tjtanaa)。 - Issue #36769: 关于Qwen3.5工具调用的bug报告,同样被标记
rocm并通知了AMD维护者。
- Issue #36805: 报告了一个与TorchAO相关的误导性错误信息,该信息错误地将GPU计算能力(SM 8.9)表述为CUDA版本。该Issue被标记了
总结:AMD生态的贡献者(包括andyluo7、amd-asalykov、tmm77等)在本周期表现突出,工作重点从基础功能支持转向深度性能挖潜和架构特定优化,特别是针对MI300X/MI355X以及FP8计算路径,显示出vLLM在AMD硬件上的支持日趋成熟和精细化。
💬 高热度讨论分析
- Issue #36753: “POST /wake_up causes vLLM process to crash”
- 核心议题:用户尝试利用vLLM的Sleep/Awake功能在单卡上动态切换多个模型时,发送
/wake_up请求导致进程崩溃。 - 观点与进展:
- 用户
lazybum-sudo详细描述了复现步骤,并直接@了维护者。 - 维护者
DarkLight1337将其转交给相关专家 (@youkaichao,@tjtanaa)。 - 贡献者
KrxGu主动提出可以介入调查,认为问题可能出在sleep/wake控制路径和EngineCoreClient.ensure_alive()的失败上。
- 用户
- 当前状态:维护者表示需等待被@专家的回应,问题仍为
open。这是一个涉及核心引擎生命周期管理的复杂问题,社区正在协调资源进行排查。
- 核心议题:用户尝试利用vLLM的Sleep/Awake功能在单卡上动态切换多个模型时,发送
- Issue #36805: “Misleading error message for FP8 quantization”
- 核心议题:错误信息
“Float8 dynamic activation quantization is only supported on CUDA>=8.9 and MI300+”具有误导性,其中的“CUDA>=8.9”实际指的是GPU计算能力(SM version),而非CUDA工具包版本。 - 观点与立场:报告者
jranaraki明确指出这是TorchAO中的问题,并给出了清晰的修改建议。社区机器人自动标记并通知了ROCm相关维护者。 - 争议焦点:无争议,属于明确的bug报告和改进建议。
- 当前状态:
open,等待上游(TorchAO)或vLLM维护者处理。
- 核心议题:错误信息
- PR #36766 vs. PR #36779: 关于 opcheck 测试失败的修复方案争议
- 核心议题:两者都旨在修复
rms_norm_per_block_quant内核opcheck测试中误报权重张量被修改的问题。 - 不同观点:
- PR #36766 的方案是跳过opcheck的
test_schema检查。 - PR #36779 的方案是修正测试中传递给内核的
scales张量形状,并添加运行时校验,从根本上解决问题。
- PR #36766 的方案是跳过opcheck的
- 争议焦点:
KrxGu(PR #36779作者)强烈反对跳过test_schema,认为这会掩盖真正的内存安全风险,并指出PR #36766的方案可能未经充分本地验证。最终,ProExpertProg支持了KrxGu的根本性修复方案。 - 当前状态:PR #36766 已被关闭,PR #36779 正在推进中。这体现了社区对代码质量、测试严谨性和根本性修复的重视。
- 核心议题:两者都旨在修复
🔥 热门话题与趋势分析
- 新一代硬件支持与适配:
- NVIDIA Blackwell (sm_121):多个Issue(#36821, #36835)集中报告了在DGX Spark(B200)上运行vLLM的问题,包括缺少内核支持、CUTLASS兼容性检查失败等。社区迅速提供了从源码编译的解决方案和定制Docker镜像。
- AMD CDNA4 (MI355X):PR #36743专门针对该架构进行优化,显示支持已进入性能调优阶段。
- 大规模/复杂部署问题:
- CPU Offload 与 UMA:Issue #36796 报告了在GH200统一内存模式下CPU Offload的复杂错误。
- 张量并行与数据并行:Issue #36793 反映了在TP=2, DP=2配置下运行量化模型的稳定性问题。
- 分布式通信优化:PR #36834 提出了一个新的分层AllReduce实现,旨在优化多节点性能。
- 模型与工具生态扩展:
- 新模型集成:大量PR关于支持新模型,如Kimi-Audio(#36127)、ColPali(#36818)、Granite4工具解析器(#36827)等。
- 推测解码(Speculative Decoding)增强:多个PR(#36767, #36361, #35461)围绕DFlash、EAGLE3集成和概率性拒绝采样进行改进,这是当前提升推理效率的热点方向。
🛠️ 重点技术变更
- PR #36810 (Fused GEMM+FP8量化 for ROCm):此变更是AMD平台FP8推理性能的关键优化。通过内核融合减少数据移动,直接利用了AMD硬件和软件栈的新特性,对MI300系列用户的性能体验有实质性提升。
- PR #36799 (移除Sparse24 CT集成):这是一个重要的清理和弃用(Deprecation)变更。由于Sparse24模型使用不广泛,为减轻维护负担,移除了相关集成和内核。这标志着项目在权衡功能广度与维护成本后做出的决策,有助于代码库的长期健康。
- PR #35461 (Model Runner V2 概率性拒绝采样):为vLLM V2模型运行器增加了概率性拒绝采样算法,相较于严格的拒绝采样,能显著提升草案令牌的接受率(在GLM 4.7上+6.69个百分点)和整体吞吐量(+15%),是推测解码领域的重要演进。
- PR #35781 (调度器优化 for P/D disaggregation):针对异步远程KV加载场景优化了调度器开销,采用事件驱动方式减少队列扰动,带来了约5%的端到端性能提升。这体现了对 disaggregation 等前沿架构性能的持续打磨。
📈 开发活跃度观察
- 贡献者:活跃度极高,在一天内有大量PR被创建和合并。AMD生态的贡献者(
andyluo7,amd-asalykov等)尤其突出,连续提交高质量性能优化代码。 - 代码审查与合并:合并了71个PR,显示审查和合并流程非常高效。许多PR被打上
ready标签并通过CI后迅速合并。 - 跨领域协作:在多模态模型支持、工具调用解析器等特性开发上,可以看到开发者、模型提供方和用户的紧密协作。
💡 值得关注的问题
- 新一代GPU的兼容性:Issue #36821, #36835 揭示了对NVIDIA Blackwell架构的官方支持尚不完善,用户目前依赖社区提供的自定义构建方案。这需要项目官方CI和发布流水线尽快跟进。
- 内存与并发相关的稳定性:Issue #36796 (CPU Offload), #36826 (并发请求导致流式传输卡顿/崩溃), #36793 (TP/DP配置错误) 等都指向了在复杂配置和高负载下的内存管理、并发同步等深层次稳定性挑战。
- 量化与编译的交互:Issue #36805 (错误信息)、PR #36766/36779 (opcheck测试) 以及一些关于GPTQ Marlin内存访问错误的Issue,都反映了量化技术栈与PyTorch编译等底层系统交互的复杂性,是易出问题的领域。
- 引擎生命周期管理:Issue #36753 (
/wake_up崩溃) 涉及到Sleep/Wake这一高级功能,其健壮性对于生产环境多模型部署至关重要,需要核心维护者重点关注。
📋 附录:详细数据列表
新增 Issue
- #36773 [Bug]: Qwen3.5-35B-A3B on B200 with vllm v0.17.0 output random result — bug — by elepherai (创建于: 2026-03-11 18:05 (UTC+8))
- #36759 [Bug]: 在910b4上使用vLLM 部署 Qwen/Qwen3-VL-Reranker-2B报错 — bug — by wangjiannan98 (创建于: 2026-03-11 16:11 (UTC+8))
- #36794 [Feature]: Improve Error Message When Max Length Reached With Tool Call=Required — feature request — by SvenLorenz (创建于: 2026-03-11 22:24 (UTC+8))
- #36835 [Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries — bug — by LironKesem (创建于: 2026-03-12 09:34 (UTC+8))
- #36833 [Bug]: GLM-4.7-Flash does not return tool_calls field in vLLM 0.16.0 even with –tool-call-parser glm47 — bug — by CodeofGame (创建于: 2026-03-12 08:56 (UTC+8))
- #36828 [Bug]: vLLm 0.17.1 (docker) crash with Qwen 3.5 27B-FP8 in BatchPrefillWithPagedKVCache — bug — by matatonic (创建于: 2026-03-12 06:22 (UTC+8))
- #36821 [Bug]: No sm_121 (Blackwell) support on aarch64 — NVIDIA DGX Spark / Acer GN100 — bug — by blogtheristo (创建于: 2026-03-12 04:52 (UTC+8))
- #36830 [Bug]: delta_text and delta_token_ids get out of sync when stop sequences are used. — bug — by maxdebayser (创建于: 2026-03-12 06:48 (UTC+8))
- #36826 [Bug]: Streaming stalls and can crash when concurrent requests hit the same vLLM server — bug — by imi187 (创建于: 2026-03-12 06:12 (UTC+8))
- #36802 [Bug]: Tesla T4 GPU - triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 81920, Hardware limit: 65536. Reducing block sizes or
num_stages— bug — by ksecurity45 (创建于: 2026-03-12 00:21 (UTC+8)) - #36811 [Bug]: CUDA illegal memory access on GPTQ Marlin — bug — by tonibofarull (创建于: 2026-03-12 02:34 (UTC+8))
- #36741 [Usage]: Not Working with latest version qwen3.5 9B AI MODEL — usage — by kusiyaitkrishna (创建于: 2026-03-11 12:13 (UTC+8))
- #36796 [Bug]: CPU offload errors on nightly with NVIDIA GH200 Unified Memory (UMA) — bug,torch.compile — by ehfd (创建于: 2026-03-11 22:53 (UTC+8))
- #36805 [Bug]: Misleading error message for FP8 quantization: refers to CUDA version instead of GPU compute capability — bug,rocm — by jranaraki (创建于: 2026-03-12 00:45 (UTC+8))
- #36804 [Feature]: FA4 Attention Sinks — feature request — by pbarker-synth (创建于: 2026-03-12 00:42 (UTC+8))
- #36793 [Bug]: TP=2 DP=2 Broken for Qwen3-Next W4A16 — bug — by Chibukach (创建于: 2026-03-11 22:14 (UTC+8))
- #36784 [New Model]: OmniASR by meta — 无标签 — by HDprojec8s (创建于: 2026-03-11 20:22 (UTC+8))
- #36783 [Bug]: CUBLAS_STATUS_INVALID_VALUE on Qwen3.5-122B-A10B-FP8 during profile run — bug — by twright8 (创建于: 2026-03-11 19:38 (UTC+8))
- #36778 [Bug]: Using vLLM to deploy Minimax m2.5, the thinking/reasoning cannot be disable. — bug — by shuixiaoer (创建于: 2026-03-11 18:55 (UTC+8))
- #36781 [Bug]: vLLM 0.17.0 failed to serve Qwen3-30B-A3B-Instruct-2507 after adding
--enable_lora— bug — by Junxiao-Zhao (创建于: 2026-03-11 19:24 (UTC+8)) - #36780 [RFC][NixlConnector]: Add support for hybrid SSM-FA models — RFC — by NickLucche (创建于: 2026-03-11 19:06 (UTC+8))
- #36777 [Feature]: Kimi K2.5 Speculative Decoding — feature request — by casper-hansen (创建于: 2026-03-11 18:45 (UTC+8))
- #36776 [Bug]: qwen 3.5 crash under dp 8 — bug — by ZJY0516 (创建于: 2026-03-11 18:31 (UTC+8))
- #36755 [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters — 无标签 — by xiaguan (创建于: 2026-03-11 15:49 (UTC+8))
- #36753 [Bug]: POST /wake_up causes vLLM process to crash. 500 Internal Server Error — bug — by lazybum-sudo (创建于: 2026-03-11 15:44 (UTC+8))
- #36772 [Bug]: Docker Model Runner vLLM ignores tensor parallel config and starts with world_size=1 on 4x RTX 3060 — bug — by PatcharaphonTiw (创建于: 2026-03-11 18:02 (UTC+8))
- #36771 [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) — bug — by Sanches166 (创建于: 2026-03-11 18:02 (UTC+8))
- #36769 [Bug]: Qwen3.5 tool calling bug — bug,rocm — by DifferentialityDevelopment (创建于: 2026-03-11 17:51 (UTC+8))
- #36763 [Bug]: Kimi-K2.5 outputs only ‘!!!!!!!!!!’ in reasoning field, content is always null — 无标签 — by LaRiffle (创建于: 2026-03-11 17:16 (UTC+8))
- #36736 [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs — 无标签 — by RagnarokChan (创建于: 2026-03-11 11:37 (UTC+8))
- #36750 [Question]HOW TO Enabling FlashAttention- 4 backend for NVIDIA PRO 6000 (Blackwell) with MiniMax-2.5-230B — usage — by luojichen (创建于: 2026-03-11 14:53 (UTC+8))
- #36748 [Bug]: In DP mode, waiting request stack in a few DP ranks. — bug — by xiaoxiaosuaxuan (创建于: 2026-03-11 14:22 (UTC+8))
- #36739 [CPU Backend] Refactor CPU FusedMoE to MK flow — cpu — by bigPYJ1151 (创建于: 2026-03-11 11:45 (UTC+8))
已关闭 Issue
- #36631 [Bug]: v0.17.0 4*2080ti 22G Qwen3.5 RPC call to sample_tokens timed out. — bug — by VIT-Valentin (关闭于: 2026-03-12 10:40 (UTC+8))
- #36759 [Bug]: 在910b4上使用vLLM 部署 Qwen/Qwen3-VL-Reranker-2B报错 — bug — by wangjiannan98 (关闭于: 2026-03-12 10:35 (UTC+8))
- #36233 [Bug]: Critical bug in tool_call test scenarios [vllm version 0.16.0] — bug — by ajiang17 (关闭于: 2026-03-12 10:26 (UTC+8))
- #19403 [Bug]: Issue of Unstable Output for Identical Queries — bug,torch.compile,stale — by skylee-01 (关闭于: 2026-03-12 10:17 (UTC+8))
- #23221 [Feature][Responses API] Stream Arbitrary MCP Tool — feature request,stale — by simon-mo (关闭于: 2026-03-12 10:17 (UTC+8))
- #24580 [Bug]: crash on startup with TP>1 on H100 nodes: AttributeError in ShmRingBuffer.shared_memory — bug,stale — by sanderland (关闭于: 2026-03-12 10:17 (UTC+8))
- #27400 [Bug]: vLLM0.11.0 is about 10x slower than 0.8.1 on classification task — bug,stale — by a7744hsc (关闭于: 2026-03-12 10:16 (UTC+8))
- #27437 [Bug]: Extremely slow FA3 on Hopper for CUDA 13.0 — bug,stale — by zyongye (关闭于: 2026-03-12 10:16 (UTC+8))
- #28063 [Bug]: Error processing images with Qwen3-VL — bug,stale — by carlosdcuba1 (关闭于: 2026-03-12 10:16 (UTC+8))
- #28125 [Bug]: Qwen3-VL-235B illegal memory access was encountered — bug,stale — by RocketRider (关闭于: 2026-03-12 10:16 (UTC+8))
- #28326 [RFC]: vLLM Support for Generic Model Definitions — RFC,stale — by bwasti (关闭于: 2026-03-12 10:16 (UTC+8))
- #28362 [Usage]: Can’t get vLLM to run on an Intel 125H with XPU and Arc graphics — usage,intel-gpu,stale — by phlibi (关闭于: 2026-03-12 10:16 (UTC+8))
- #28381 [Bug]: docker Loading safetensors too slow — bug,stale — by zwukong (关闭于: 2026-03-12 10:16 (UTC+8))
- #28465 [RFC]: Publishing KVBlocks event with mm_hash — RFC,stale — by Bounty-hunter (关闭于: 2026-03-12 10:16 (UTC+8))
- #28486 [Installation]: No pre-built wheel available in 0.11.0 — installation,stale — by Tfloow (关闭于: 2026-03-12 10:16 (UTC+8))
- #28489 [Usage]: Online continuous batching — usage,stale — by GenVr (关闭于: 2026-03-12 10:16 (UTC+8))
- #33864 Bug: CPU KV cache offloading fails for blocks formed during decode — bug,kv-connector — by sriumcp (关闭于: 2026-03-12 04:43 (UTC+8))
- #35766 [Bug]: aot_compile setting some aotautograd configs that change the cache key — bug,torch.compile — by zou3519 (关闭于: 2026-03-12 01:15 (UTC+8))
- #36050 [Bug]: DeepSeek v3.2 FP8 Failure to start server — bug — by wzhao18 (关闭于: 2026-03-12 00:07 (UTC+8))
- #36653 [Bug]: qwen3.5 Mismatch in
imagetoken count between text andinput_ids. Got ids=[4091] — bug — by JakubCerven (关闭于: 2026-03-11 22:26 (UTC+8)) - #34607 [Bug]: specualative decoding error in 0.15.1 — bug — by hocop (关闭于: 2026-03-11 22:21 (UTC+8))
- #36433 [Bug]: matryoshka need gpu-memory??? — bug,good first issue — by ciaoyizhen (关闭于: 2026-03-11 15:15 (UTC+8))
- #35909 [Bug]: Error when using Qwen3-VL/Qwen3.5 with video input and num_frames — bug — by danigarciaoca (关闭于: 2026-03-11 18:13 (UTC+8))
- #36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (关闭于: 2026-03-11 11:49 (UTC+8))
- #33768 [Usage]: How to set the language in Qwen3-Asr — usage — by xyqsgdog-ctrl (关闭于: 2026-03-11 11:34 (UTC+8))
新增 PR
- #36841 [Bugfix] Fix crash when tool_choice=required exceeds max_tokens — bug,frontend,ready,tool-calling — by chaunceyjiang (创建于: 2026-03-12 11:19 (UTC+8))
- #36761 [CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure — ready — by noooop (创建于: 2026-03-11 16:40 (UTC+8))
- #36840 fix: correctly save fp8 kernel tuning results in multi-GPU mode — performance — by cs-cat (创建于: 2026-03-12 10:50 (UTC+8))
- #36834 Added Hierarchical all reduce — nvidia — by lekhit (创建于: 2026-03-12 09:24 (UTC+8))
- #36824 [Model Runner V2] Do not initialize sampler for non-last PP ranks — ready,v1 — by WoosukKwon (创建于: 2026-03-12 05:47 (UTC+8))
- #36839 [Bugfix] Fix Responses API crash when ResponseReasoningItem has content=None — bug,frontend — by rranabha (创建于: 2026-03-12 10:31 (UTC+8))
- #36766 fix(test): skip test_schema in opcheck for rms_norm_per_block_quant (… — needs-rebase — by mahendrarathore1742 (创建于: 2026-03-11 17:41 (UTC+8))
- #36838 enable flashinfer moe kernel for DP + EP — 无标签 — by czhu-cohere (创建于: 2026-03-12 10:11 (UTC+8))
- #36832 Increase Test Coverage for Distributed Comm Patterns — 无标签 — by puririshi98 (创建于: 2026-03-12 08:56 (UTC+8))
- #36837 Expand Speculative Decoding Coverage — speculative-decoding,v1 — by puririshi98 (创建于: 2026-03-12 09:49 (UTC+8))
- #36756 Rename misleading num_sampled_tokens to num_sampled_reqs — v1 — by ENg-122 (创建于: 2026-03-11 15:51 (UTC+8))
- #36809 Revert “Fix
hf_override_fnwhen it modifiesmodel_type” (#35200) — 无标签 — by zhewenl (创建于: 2026-03-12 01:58 (UTC+8)) - #36831 [XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu — documentation,ready — by jikunshang (创建于: 2026-03-12 08:44 (UTC+8))
- #36836 [Feat][Executor] Introduce RayExecutorV2 — ci/build,v1 — by jeffreywang-anyscale (创建于: 2026-03-12 09:40 (UTC+8))
- #36829 [Frontend] Exclude anthropic billing header to avoid prefix cache miss — documentation,frontend,ready — by njhill (创建于: 2026-03-12 06:37 (UTC+8))
- #36822 [BugFix] Fix multiple/duplicate stdout prefixes — bug,frontend,ready,v1 — by njhill (创建于: 2026-03-12 05:04 (UTC+8))
- #36767 Dflash integration — new-model,speculative-decoding,v1,qwen — by StanislavII (创建于: 2026-03-11 17:43 (UTC+8))
- #36779 [Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) — bug,needs-rebase — by KrxGu (创建于: 2026-03-11 19:02 (UTC+8))
- #36817 [Model Runner V2] Add Support for XD-RoPE — v1,nvidia — by santiramos27 (创建于: 2026-03-12 04:08 (UTC+8))
- #36827 Add simple granite4 tool parser — documentation,tool-calling — by maxdebayser (创建于: 2026-03-12 06:17 (UTC+8))
- #36819 [Docs] Update Claude Code doc to include caching performance tip — documentation — by njhill (创建于: 2026-03-12 04:23 (UTC+8))
- #36825 [Bugfix] Fix Qwen3.5 LoRA IndexError in packed_modules_mapping — bug,qwen — by hallerite (创建于: 2026-03-12 06:00 (UTC+8))
- #36823 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace — torch.compile,nvidia,vllm-ir — by ProExpertProg (创建于: 2026-03-12 05:35 (UTC+8))
- #36816 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — vllm-ir — by ProExpertProg (创建于: 2026-03-12 04:00 (UTC+8))
- #36820 Mkorhone openvla pr merge — performance,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by mkorhone (创建于: 2026-03-12 04:41 (UTC+8))
- #36818 [Model] Add ColPali late interaction model for multi-modal retrieval — documentation,new-model,multi-modality — by Kaonael (创建于: 2026-03-12 04:11 (UTC+8))
- #36814
nano-nemotron-vldummy video: Respectmedia_io_kwargs.num_framesif set to budget encoder cache correctly — 无标签 — by netanel-haber (创建于: 2026-03-12 03:36 (UTC+8)) - #36815 [Model Runner V2] Introduce num_tokens_for_attn — v1,nvidia — by WoosukKwon (创建于: 2026-03-12 03:48 (UTC+8))
- #36744 [Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 12:48 (UTC+8))
- #36813 [Tests] Skip model weight download for render-only test server — 无标签 — by sagearc (创建于: 2026-03-12 03:28 (UTC+8))
- #36798 Correct link to supported hardware on vllm.ai — documentation,ready — by hmellor (创建于: 2026-03-11 23:23 (UTC+8))
- #36812 [Metrics] Temporary band-aid for “Counters can only be incremented by non-negative amounts” — ready,v1 — by markmc (创建于: 2026-03-12 02:38 (UTC+8))
- #36810 [ROCm][Perf] Fused GEMM + static FP8 output quantization — rocm — by andyluo7 (创建于: 2026-03-12 02:14 (UTC+8))
- #36799 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels — performance,ci/build,nvidia — by kylesayrs (创建于: 2026-03-12 00:02 (UTC+8))
- #36791 fix for the model not even loading and zero accuracy — 无标签 — by hypdeb (创建于: 2026-03-11 21:35 (UTC+8))
- #36787 Make Gemma and Gemma 2 accept
inputs_embedslike Gemma 3 — ready — by hmellor (创建于: 2026-03-11 21:04 (UTC+8)) - #36807 [Bugfix] Pad Marlin FP8 MoE weight dims to tile alignment under TP > 1 — bug — by ssubhanjali (创建于: 2026-03-12 01:16 (UTC+8))
- #36808 Support temporal compression for videos — 无标签 — by collinmccarthy (创建于: 2026-03-12 01:22 (UTC+8))
- #36806 Only show FP4 Marlin fallback warning for w4a4 models — ready — by mgoin (创建于: 2026-03-12 01:07 (UTC+8))
- #36792 Fix
ExaoneMoeMTPtest that never ran in Transformers v4 — ready,multi-modality — by hmellor (创建于: 2026-03-11 21:37 (UTC+8)) - #36785 Update rocm get gpu info capability — rocm — by tmm77 (创建于: 2026-03-11 20:27 (UTC+8))
- #36803 [Test] E2E Nemotron-3-Super tests — ready,ci/build,nvidia — by roikoren755 (创建于: 2026-03-12 00:36 (UTC+8))
- #36770 [Misc] Clean up renderers — ready,multi-modality,llama — by DarkLight1337 (创建于: 2026-03-11 18:01 (UTC+8))
- #36800 [Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request — bug,qwen — by Isotr0py (创建于: 2026-03-12 00:20 (UTC+8))
- #36797 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,v1 — by MohdElgaar (创建于: 2026-03-11 23:13 (UTC+8))
- #36789 [Bugfix] Fix negative max_tokens when input prompt is too long — bug,frontend,ready — by Isotr0py (创建于: 2026-03-11 21:20 (UTC+8))
- #36801 [kv_offload] Fix bare Exception types and add FilterReusedOffloadingManager tests — v1 — by Hongbin10 (创建于: 2026-03-12 00:21 (UTC+8))
- #36775 [Misc] Add online audio_in_video test — ready,multi-modality,qwen — by Isotr0py (创建于: 2026-03-11 18:28 (UTC+8))
- #36764 [Feature] add Dflash on Ascend — documentation,new-model,speculative-decoding,ci/build,v1,multi-modality,llama,qwen — by chenaoxuan (创建于: 2026-03-11 17:23 (UTC+8))
- #36790 Disable docs build skipping until a better solution is found — ready — by hmellor (创建于: 2026-03-11 21:25 (UTC+8))
- #36742 [EPD] update EPD script arguments — documentation,kv-connector — by zhenwei-intel (创建于: 2026-03-11 12:42 (UTC+8))
- #36788 Fix tied weights in weight mapping test for Transformers v5 — ready,multi-modality — by hmellor (创建于: 2026-03-11 21:15 (UTC+8))
- #36795 [Perf] Enable dual stream execution of input projection for Qwen3 Next — qwen — by xyang16 (创建于: 2026-03-11 22:39 (UTC+8))
- #36762 [Model Runner V2] Remove unused warmup_for_prefill method — ready,v1 — by WoosukKwon (创建于: 2026-03-11 17:01 (UTC+8))
- #36786 Extend EPLB support — 无标签 — by hypdeb (创建于: 2026-03-11 20:57 (UTC+8))
- #36782 [Bugfix] Fix Mistral-small
--format— bug,documentation — by 12010486 (创建于: 2026-03-11 19:29 (UTC+8)) - #36768 Update Flashinfer to 0.6.6 — ready,ci/build,nvidia — by dbari (创建于: 2026-03-11 17:43 (UTC+8))
- #36774 [Bugfix] fix Qwen3.5 tool calling bug — bug,qwen — by chaunceyjiang (创建于: 2026-03-11 18:24 (UTC+8))
- #36749 [openapi] refactor render related openapi [3/N] — frontend,ready — by andyxning (创建于: 2026-03-11 14:31 (UTC+8))
- #36752 [Bugfix] Reset num_cached_tokens sentinel on request preemption — bug,v1 — by xiaguan (创建于: 2026-03-11 15:43 (UTC+8))
- #36751 fix(minicpmv): fix audio inference by handling meta device in init_re… — ready — by tc-mb (创建于: 2026-03-11 15:33 (UTC+8))
- #36765 fix: vulnerability fix for Inference — ci/build — by tusharsadana (创建于: 2026-03-11 17:27 (UTC+8))
- #36760 Fix(Offline Inference): Enable reasoning parser support in LLM class … — frontend — by xueliangyang-oeuler (创建于: 2026-03-11 16:34 (UTC+8))
- #36758 Fix(DP): Optimize load balancing score calculation (#36748) — v1 — by xueliangyang-oeuler (创建于: 2026-03-11 16:08 (UTC+8))
- #36757 Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco… — v1,nvidia — by xueliangyang-oeuler (创建于: 2026-03-11 15:59 (UTC+8))
- #36754 [Perf] optimize gdn mtp performance — qwen — by ZJY0516 (创建于: 2026-03-11 15:46 (UTC+8))
- #36747 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,nvidia — by mvanhorn (创建于: 2026-03-11 14:14 (UTC+8))
- #36746 [Bugfix] Deduplicate sampled token in top_logprobs output — bug,v1 — by mvanhorn (创建于: 2026-03-11 14:10 (UTC+8))
- #36745 [Misc] Relax transformers upper bound to include 5.2.0 — ci/build — by woforce (创建于: 2026-03-11 13:02 (UTC+8))
- #36743 [ROCm] Optimize concat_mla_q for CDNA3 (MI300X) and CDNA4 (MI355X) — rocm,nvidia — by andyluo7 (创建于: 2026-03-11 12:43 (UTC+8))
- #36737 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 11:39 (UTC+8))
- #36738 Fix(MoE): Relax assertion for non-gated NVFP4 MoE models (#31782) — documentation,deepseek,nvidia — by xueliangyang-oeuler (创建于: 2026-03-11 11:44 (UTC+8))
- #36740 [Bugfix] Respect hf-config-path during GGUF speculator detection — bug — by WineChord (创建于: 2026-03-11 11:55 (UTC+8))
已合并 PR
- #36593 [XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. — bug,ready,v1 — by ys950902 (合并于: 2026-03-11 17:17 (UTC+8))
- #35895 [Bugfix] Fix minimax_m2 tool parser when stream interval > 1 — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build — by sfeng33 (合并于: 2026-03-12 10:25 (UTC+8))
- #36831 [XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu — documentation,ready — by jikunshang (合并于: 2026-03-12 09:46 (UTC+8))
- #35461 [Model Runner V2] Add probabilistic rejection sampling for spec decoding — ready,v1 — by TheEpicDolphin (合并于: 2026-03-12 05:04 (UTC+8))
- #36829 [Frontend] Exclude anthropic billing header to avoid prefix cache miss — documentation,frontend,ready — by njhill (合并于: 2026-03-12 09:20 (UTC+8))
- #36316 [BugFix]: add bagel to MM_PREFIX_LM_MODELS — bug,ready — by princepride (合并于: 2026-03-12 03:55 (UTC+8))
- #36710 [Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-12 08:37 (UTC+8))
- #36499 [ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures — rocm,ready,ci/build — by mgehre-amd (合并于: 2026-03-12 07:14 (UTC+8))
- #35811 [BUG] Fix async rlhf tests — bug,documentation,ready,ci/build,v1 — by hao-aaron (合并于: 2026-03-12 06:06 (UTC+8))
- #36556 [ci] Update rtol for test_classification — ready — by angelayi (合并于: 2026-03-11 18:08 (UTC+8))
- #36698 [Kernel] [Helion] [15/N] Split config files into per-platform files — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
- #36563 [Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
- #36683 [Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
- #35790 [Model Runner V2] Add WhisperModelState [6/N] — ready,ci/build,v1,nvidia — by WoosukKwon (合并于: 2026-03-12 05:20 (UTC+8))
- #36129 [LMCache] Pass TP size in lookup for MLA multi-reader locking — ready,kv-connector — by maobaolong (合并于: 2026-03-12 04:45 (UTC+8))
- #33881 [BugFix][kv_offload] Fix offloading decodes with async scheduling — bug,ready,v1,kv-connector — by orozery (合并于: 2026-03-12 04:43 (UTC+8))
- #36274 [Bugfix][ROCm] Strip block_size before attention backend validation — bug,rocm,ready — by jennyyyyzhen (合并于: 2026-03-12 04:37 (UTC+8))
- #36798 Correct link to supported hardware on vllm.ai — documentation,ready — by hmellor (合并于: 2026-03-11 23:43 (UTC+8))
- #36424 [Refactor] Remove dead code in KV connector — ready,v1,kv-connector — by yewentao256 (合并于: 2026-03-12 03:40 (UTC+8))
- #35093 [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 — rocm,ready — by amd-asalykov (合并于: 2026-03-12 03:00 (UTC+8))
- #36787 Make Gemma and Gemma 2 accept
inputs_embedslike Gemma 3 — ready — by hmellor (合并于: 2026-03-12 02:12 (UTC+8)) - #36551 [torch.compile] Add support for non-contiguous fused RMSNorm + group quant — ready,ci/build — by ProExpertProg (合并于: 2026-03-12 01:56 (UTC+8))
- #31964 [KVConnector] Support worker -> scheduler metadata — ready,v1,kv-connector — by orozery (合并于: 2026-03-12 01:36 (UTC+8))
- #36707 fix: align lfm2 thumbnail token counting with HF — ready — by tianshu-Michael-yu (合并于: 2026-03-12 01:28 (UTC+8))
- #36161 Add 320 dimension size support to MLA — ready — by juliendenize (合并于: 2026-03-12 01:21 (UTC+8))
- #36792 Fix
ExaoneMoeMTPtest that never ran in Transformers v4 — ready,multi-modality — by hmellor (合并于: 2026-03-12 01:10 (UTC+8)) - #36770 [Misc] Clean up renderers — ready,multi-modality,llama — by DarkLight1337 (合并于: 2026-03-12 00:39 (UTC+8))
- #36789 [Bugfix] Fix negative max_tokens when input prompt is too long — bug,frontend,ready — by Isotr0py (合并于: 2026-03-12 00:30 (UTC+8))
- #36560 [CI] Add bfcl tool call correctness eval — ready,ci/build — by sfeng33 (合并于: 2026-03-12 00:27 (UTC+8))
- #36061 [Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels — bug,ready — by robertgshaw2-redhat (合并于: 2026-03-12 00:07 (UTC+8))
- #36093 [torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation — ready — by zou3519 (合并于: 2026-03-12 00:07 (UTC+8))
- #35744 Fix routed experts capture for hybrid models (Mamba + Attention) — documentation,ready,v1 — by xhx1022 (合并于: 2026-03-11 23:53 (UTC+8))
- #36726 [Refactor] Remove deadcode in Responses API serving — frontend,ready — by sfeng33 (合并于: 2026-03-11 15:11 (UTC+8))
- #36238 Add ‘none’ reasoning effort to ChatCompletionRequest — frontend,ready — by juliendenize (合并于: 2026-03-11 23:45 (UTC+8))
- #36163 Add support to Mistral large 3 eagle with dense layers — speculative-decoding,ready — by juliendenize (合并于: 2026-03-11 23:42 (UTC+8))
- #33230 Add XPU MLA Sparse backend for DeepSeek v3.2 — documentation,ready,v1,deepseek — by wuxun-zhang (合并于: 2026-03-11 19:19 (UTC+8))
- #36361 Kimi k2.5 MLA based eagle3 — new-model,speculative-decoding,ready,v1,deepseek — by jhaotingc (合并于: 2026-03-11 23:36 (UTC+8))
- #36790 Disable docs build skipping until a better solution is found — ready — by hmellor (合并于: 2026-03-11 21:53 (UTC+8))
- #36788 Fix tied weights in weight mapping test for Transformers v5 — ready,multi-modality — by hmellor (合并于: 2026-03-11 23:10 (UTC+8))
- #36762 [Model Runner V2] Remove unused warmup_for_prefill method — ready,v1 — by WoosukKwon (合并于: 2026-03-11 23:02 (UTC+8))
- #36681 [ROCm][Perf] Allow MTP lens > 1 in Sparse MLA — rocm,speculative-decoding,ready,v1 — by tvirolai-amd (合并于: 2026-03-11 22:43 (UTC+8))
- #36673 [Misc][Attention] Clean up unused method in
CPU_ATTN— ready,v1,cpu — by MatthewBonanni (合并于: 2026-03-11 12:27 (UTC+8)) - #35776 [Misc] Use envs module to get VLLM_DISABLED_KERNELS — ready — by hickeyma (合并于: 2026-03-11 21:37 (UTC+8))
- #36782 [Bugfix] Fix Mistral-small
--format— bug,documentation — by 12010486 (合并于: 2026-03-11 19:47 (UTC+8)) - #36749 [openapi] refactor render related openapi [3/N] — frontend,ready — by andyxning (合并于: 2026-03-11 18:14 (UTC+8))
- #36136 [Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps — bug,ready,multi-modality,qwen — by weiguangli-io (合并于: 2026-03-11 18:13 (UTC+8))
- #36318 Disable cascade attention by default — ready,codex — by mgoin (合并于: 2026-03-11 18:12 (UTC+8))
- #36751 fix(minicpmv): fix audio inference by handling meta device in init_re… — ready — by tc-mb (合并于: 2026-03-11 18:06 (UTC+8))
- #36358 [compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE — ready — by zou3519 (合并于: 2026-03-11 18:12 (UTC+8))
- #36402 fix(lora): use replaced_module_name in pooling model name check — ready — by gambletan (合并于: 2026-03-11 18:11 (UTC+8))
- #36665 platforms: Fix Ray DP startup crash — ready — by itayalroy (合并于: 2026-03-11 18:08 (UTC+8))
- #36658 Add: Eagle3 support for Qwen3.5 — ready,qwen — by rahul-tuli (合并于: 2026-03-11 18:07 (UTC+8))
- #36667 [Refactor] Remove Molmo2 processor wrapper — ready — by DarkLight1337 (合并于: 2026-03-11 18:07 (UTC+8))
- #36321 [Bugfix] Support other quantization methods in glm41v — bug,ready — by LoganJane (合并于: 2026-03-11 17:48 (UTC+8))
- #36635 [NemotronH] Small fix reasoning parser — bug,ready,nvidia — by roikoren755 (合并于: 2026-03-11 17:44 (UTC+8))
- #36218 [UX] Infer dtype for local checkpoint — ready — by Isotr0py (合并于: 2026-03-11 16:50 (UTC+8))
- #36672 Remove unused config field from Gemma2 — ready — by hmellor (合并于: 2026-03-11 16:51 (UTC+8))
- #36127 [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct — documentation,new-model,frontend,ready,multi-modality — by tunglinwood (合并于: 2026-03-11 12:24 (UTC+8))
- #35752 [NIXL][1/N] Refactor
kernel_block_sizedetection — ready,v1,kv-connector — by NickLucche (合并于: 2026-03-11 16:11 (UTC+8)) - #35765 AITER MLA backend: Avoid CPU sync in _build_decode — rocm,ready,v1 — by pschlan-amd (合并于: 2026-03-11 15:25 (UTC+8))
- #35923 [Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 — bug,documentation,rocm,ready,v1,qwen — by JartX (合并于: 2026-03-11 15:45 (UTC+8))
- #36438 [Hardware][NIXL] set default kv buffer type for different platform — ready,kv-connector — by zhenwei-intel (合并于: 2026-03-11 13:19 (UTC+8))
- #36458 [XPU] Support block fp8 moe by fallback to TritonExpert on XPU — ready — by jikunshang (合并于: 2026-03-11 12:54 (UTC+8))
- #36578 feat: add RISC-V support for CPU backend (v2) — ready,ci/build,cpu — by typer-J (合并于: 2026-03-11 12:51 (UTC+8))
- #33503 feat(spec_decode): fuse EAGLE step slot mapping and metadata updates — speculative-decoding,ready,v1 — by sladyn98 (合并于: 2026-03-11 12:35 (UTC+8))
- #36713 [Doc] Fix duplicate words in comments — ready,v1,multi-modality,nvidia — by Hongbin10 (合并于: 2026-03-11 12:28 (UTC+8))
- #31785 [Fix] Use torch.empty for output in attention+quant fusion — ready — by elvischenv (合并于: 2026-03-11 12:26 (UTC+8))
- #35781 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement — ready,v1,kv-connector — by yewentao256 (合并于: 2026-03-11 12:25 (UTC+8))
- #36699 Add tuned H100 MoE configs for LFM2 8B and 24B — 无标签 — by tianshu-Michael-yu (合并于: 2026-03-11 12:22 (UTC+8))
- #36719 [ci] Bound nvidia-cudnn-frontend version — ready,ci/build,nvidia — by khluu (合并于: 2026-03-11 12:17 (UTC+8))
- #36723 [DSV3.2][MTP] Optimize Indexer MTP handling — ready,v1 — by benchislett (合并于: 2026-03-11 12:16 (UTC+8))
关闭但未合并的 PR
- #36766 fix(test): skip test_schema in opcheck for rms_norm_per_block_quant (… — needs-rebase — by mahendrarathore1742 (关闭于: 2026-03-12 07:05 (UTC+8))
- #17094 [Bugfix] fix phi4-mini tool call parse in streaming mode — documentation,frontend,stale,tool-calling — by yulin-li (关闭于: 2026-03-12 10:18 (UTC+8))
- #26889 [DeepSeekV3.2] Fix Loading BF16 weights — stale,deepseek — by qili93 (关闭于: 2026-03-12 10:17 (UTC+8))
- #27299 [Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup — stale — by RishiAstra (关闭于: 2026-03-12 10:16 (UTC+8))
- #27771 [DBO] Compile non-dbo cudagraphs for shapes that are close to dbo_decode_token_threshold — stale,v1,nvidia — by SageMoore (关闭于: 2026-03-12 10:16 (UTC+8))
- #35957 fix online fp8 for MiniCPM models — needs-rebase — by yma11 (关闭于: 2026-03-12 09:05 (UTC+8))
- #36819 [Docs] Update Claude Code doc to include caching performance tip — documentation — by njhill (关闭于: 2026-03-12 06:37 (UTC+8))
- #30413 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (关闭于: 2026-03-12 06:06 (UTC+8))
- #34068 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace — rocm,torch.compile,needs-rebase,nvidia,vllm-ir — by ProExpertProg (关闭于: 2026-03-12 05:45 (UTC+8))
- #36820 Mkorhone openvla pr merge — performance,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by mkorhone (关闭于: 2026-03-12 04:43 (UTC+8))
- #36814
nano-nemotron-vldummy video: Respectmedia_io_kwargs.num_framesif set to budget encoder cache correctly — 无标签 — by netanel-haber (关闭于: 2026-03-12 04:06 (UTC+8)) - #36722 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — rocm,needs-rebase,ci/build,nvidia — by ProExpertProg (关闭于: 2026-03-12 03:59 (UTC+8))
- #36744 [Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-12 03:56 (UTC+8))
- #36655 [Frontend] Allow
engine_client=NoneinOpenAIServingModels— frontend,needs-rebase — by sagearc (关闭于: 2026-03-12 03:15 (UTC+8)) - #36564 [Draft] Support temporal compression for videos — v1 — by collinmccarthy (关闭于: 2026-03-12 01:22 (UTC+8))
- #34203 [Model] Remove InternLM — documentation,new-model,ready,llama — by noooop (关闭于: 2026-03-11 19:28 (UTC+8))
- #36764 [Feature] add Dflash on Ascend — documentation,new-model,speculative-decoding,ci/build,v1,multi-modality,llama,qwen — by chenaoxuan (关闭于: 2026-03-11 23:24 (UTC+8))
- #36549 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors — bug,kv-connector — by ZhanqiuHu (关闭于: 2026-03-11 23:04 (UTC+8))
- #27407 [Frontend] Speed up online server preprocess by using sync tokenizer. — frontend,stale — by noooop (关闭于: 2026-03-11 19:27 (UTC+8))
- #34727 [PD][Nixl] Add support for hybrid SSM-FA models — needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-03-11 19:24 (UTC+8))
- #36636 [Bugfix] Fix KV cache offloading for hybrid attention models (e.g. Qwen3.5-27B) — bug,v1,qwen,kv-connector — by haosdent (关闭于: 2026-03-11 18:21 (UTC+8))
- #36765 fix: vulnerability fix for Inference — ci/build — by tusharsadana (关闭于: 2026-03-11 17:28 (UTC+8))
- #33505 [Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba — qwen — by SouthWest7 (关闭于: 2026-03-11 17:00 (UTC+8))
- #36754 [Perf] optimize gdn mtp performance — qwen — by ZJY0516 (关闭于: 2026-03-11 16:02 (UTC+8))
- #36474 [Bugfix] Fix GPTQ Marlin size_k for Qwen3.5 — bug,qwen — by Isotr0py (关闭于: 2026-03-11 15:49 (UTC+8))
- #35177 [XPU] disable async-scheduling by default on XPU — 无标签 — by yma11 (关闭于: 2026-03-11 14:45 (UTC+8))
- #36745 [Misc] Relax transformers upper bound to include 5.2.0 — ci/build — by woforce (关闭于: 2026-03-11 13:57 (UTC+8))
- #35145 [ROCm][CI] Reverting changes in MI355 pipeline so that some default TGs are blocking on external AMD-CI signal — rocm,ci/build — by AndreasKaratzas (关闭于: 2026-03-11 13:36 (UTC+8))
- #36737 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-11 12:47 (UTC+8))
- #36706 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-11 11:40 (UTC+8))