vLLM 开发动态报告 - 2026-03-11

时间窗口: 2026-03-11 11:27 (UTC+8) ~ 2026-03-12 11:27 (UTC+8) 数据统计: 新 Issue 33 | 关闭 Issue 25 | 新 PR 73 | 合并 PR 71 | 关闭未合并 PR 30

📊 每日开发状态摘要

在2026年3月11日至12日的24小时窗口期内，vLLM项目保持了极高的开发活跃度，共新增73个PR并合并了71个。主要进展集中于AMD/ROCm生态性能优化、新模型（尤其是多模态模型）支持以及大规模系统部署（如DGX Spark）的兼容性修复。同时，社区围绕多个核心功能（如量化、工具调用、调度器）的bug修复与性能优化展开了深入讨论。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关的贡献非常活跃，主要体现在性能优化、功能完善和兼容性修复上。

性能优化（Perf）:
- PR #36810 (新增): [ROCm][Perf] Fused GEMM + static FP8 output quantization。由 andyluo7 提交，核心创新在于将FP8 GEMM与静态量化融合为单个torch._scaled_mm调用，直接输出FP8数据，消除了BF16中间结果的DRAM往返。在MI300X上，针对常见模型形状（如4096x4096）实现了1.5倍至1.6倍的性能提升。这利用了ROCm 6.0+ hipBLASLt原生支持FP8输出的特性。
- PR #36743 (新增): [ROCm] Optimize concat_mla_q for CDNA3 (MI300X) and CDNA4 (MI355X)。同样是 andyluo7 的贡献，针对不同架构进行了针对性优化：在CDNA3上使用非临时存储提示以提升HBM带宽效率（小规模token提速可达23%）；在CDNA4上启用256位宽加载/存储，使向量化循环迭代次数减半。
- PR #35093 (已合并): 为CDNA4添加了针对Kimi K2.5 MoE模型的调优内核配置，在TP=8场景下带来了显著的端到端性能提升（更低的TTFT和TPOT）。
功能与兼容性:
- PR #36785 (新增): Update rocm get gpu info capability。由 tmm77 提交，旨在允许amdsmi在torch不可用时获取GPU信息，以支持更灵活的部署场景（如使用Ray后端时）。
- PR #35923 (已合并): 修复了ROCm注意力后端在Qwen3.5等使用非标准block size（如1056）模型上无法正确运行的问题。
- PR #36499 (已合并): 将gfx1152/gfx1153 (Krackan, RDNA 3.5) 架构加入HIP支持的架构列表，确保为这些新GPU正确编译内核。
- PR #36274 (已合并): 修复ROCm注意力后端验证逻辑，使其与CUDA平台行为一致，在验证前剥离block_size，以支持不规则block size。
相关Issue:
- Issue #36805: 报告了一个与TorchAO相关的误导性错误信息，该信息错误地将GPU计算能力（SM 8.9）表述为CUDA版本。该Issue被标记了rocm标签，并由机器人自动通知了AMD相关维护者（@hongxiayang, @tjtanaa）。
- Issue #36769: 关于Qwen3.5工具调用的bug报告，同样被标记rocm并通知了AMD维护者。

总结：AMD生态的贡献者（包括andyluo7、amd-asalykov、tmm77等）在本周期表现突出，工作重点从基础功能支持转向深度性能挖潜和架构特定优化，特别是针对MI300X/MI355X以及FP8计算路径，显示出vLLM在AMD硬件上的支持日趋成熟和精细化。

💬 高热度讨论分析

Issue #36753: “POST /wake_up causes vLLM process to crash”
- 核心议题：用户尝试利用vLLM的Sleep/Awake功能在单卡上动态切换多个模型时，发送/wake_up请求导致进程崩溃。
- 观点与进展：
  - 用户 lazybum-sudo 详细描述了复现步骤，并直接@了维护者。
  - 维护者 DarkLight1337 将其转交给相关专家 (@youkaichao, @tjtanaa)。
  - 贡献者 KrxGu 主动提出可以介入调查，认为问题可能出在sleep/wake控制路径和EngineCoreClient.ensure_alive()的失败上。
- 当前状态：维护者表示需等待被@专家的回应，问题仍为open。这是一个涉及核心引擎生命周期管理的复杂问题，社区正在协调资源进行排查。
Issue #36805: “Misleading error message for FP8 quantization”
- 核心议题：错误信息 “Float8 dynamic activation quantization is only supported on CUDA>=8.9 and MI300+” 具有误导性，其中的“CUDA>=8.9”实际指的是GPU计算能力（SM version），而非CUDA工具包版本。
- 观点与立场：报告者 jranaraki 明确指出这是TorchAO中的问题，并给出了清晰的修改建议。社区机器人自动标记并通知了ROCm相关维护者。
- 争议焦点：无争议，属于明确的bug报告和改进建议。
- 当前状态：open，等待上游（TorchAO）或vLLM维护者处理。
PR #36766 vs. PR #36779: 关于 opcheck 测试失败的修复方案争议
- 核心议题：两者都旨在修复 rms_norm_per_block_quant 内核opcheck测试中误报权重张量被修改的问题。
- 不同观点：
  - PR #36766 的方案是跳过opcheck的 test_schema 检查。
  - PR #36779 的方案是修正测试中传递给内核的 scales 张量形状，并添加运行时校验，从根本上解决问题。
- 争议焦点：KrxGu（PR #36779作者）强烈反对跳过test_schema，认为这会掩盖真正的内存安全风险，并指出PR #36766的方案可能未经充分本地验证。最终，ProExpertProg 支持了 KrxGu 的根本性修复方案。
- 当前状态：PR #36766 已被关闭，PR #36779 正在推进中。这体现了社区对代码质量、测试严谨性和根本性修复的重视。

🔥 热门话题与趋势分析

新一代硬件支持与适配：
- NVIDIA Blackwell (sm_121)：多个Issue（#36821, #36835）集中报告了在DGX Spark（B200）上运行vLLM的问题，包括缺少内核支持、CUTLASS兼容性检查失败等。社区迅速提供了从源码编译的解决方案和定制Docker镜像。
- AMD CDNA4 (MI355X)：PR #36743专门针对该架构进行优化，显示支持已进入性能调优阶段。
大规模/复杂部署问题：
- CPU Offload 与 UMA：Issue #36796 报告了在GH200统一内存模式下CPU Offload的复杂错误。
- 张量并行与数据并行：Issue #36793 反映了在TP=2, DP=2配置下运行量化模型的稳定性问题。
- 分布式通信优化：PR #36834 提出了一个新的分层AllReduce实现，旨在优化多节点性能。
模型与工具生态扩展：
- 新模型集成：大量PR关于支持新模型，如Kimi-Audio（#36127）、ColPali（#36818）、Granite4工具解析器（#36827）等。
- 推测解码（Speculative Decoding）增强：多个PR（#36767, #36361, #35461）围绕DFlash、EAGLE3集成和概率性拒绝采样进行改进，这是当前提升推理效率的热点方向。

🛠️ 重点技术变更

PR #36810 (Fused GEMM+FP8量化 for ROCm)：此变更是AMD平台FP8推理性能的关键优化。通过内核融合减少数据移动，直接利用了AMD硬件和软件栈的新特性，对MI300系列用户的性能体验有实质性提升。
PR #36799 (移除Sparse24 CT集成)：这是一个重要的清理和弃用（Deprecation）变更。由于Sparse24模型使用不广泛，为减轻维护负担，移除了相关集成和内核。这标志着项目在权衡功能广度与维护成本后做出的决策，有助于代码库的长期健康。
PR #35461 (Model Runner V2 概率性拒绝采样)：为vLLM V2模型运行器增加了概率性拒绝采样算法，相较于严格的拒绝采样，能显著提升草案令牌的接受率（在GLM 4.7上+6.69个百分点）和整体吞吐量（+15%），是推测解码领域的重要演进。
PR #35781 (调度器优化 for P/D disaggregation)：针对异步远程KV加载场景优化了调度器开销，采用事件驱动方式减少队列扰动，带来了约5%的端到端性能提升。这体现了对 disaggregation 等前沿架构性能的持续打磨。

📈 开发活跃度观察

贡献者：活跃度极高，在一天内有大量PR被创建和合并。AMD生态的贡献者（andyluo7, amd-asalykov等）尤其突出，连续提交高质量性能优化代码。
代码审查与合并：合并了71个PR，显示审查和合并流程非常高效。许多PR被打上ready标签并通过CI后迅速合并。
跨领域协作：在多模态模型支持、工具调用解析器等特性开发上，可以看到开发者、模型提供方和用户的紧密协作。

💡 值得关注的问题

新一代GPU的兼容性：Issue #36821, #36835 揭示了对NVIDIA Blackwell架构的官方支持尚不完善，用户目前依赖社区提供的自定义构建方案。这需要项目官方CI和发布流水线尽快跟进。
内存与并发相关的稳定性：Issue #36796 (CPU Offload), #36826 (并发请求导致流式传输卡顿/崩溃), #36793 (TP/DP配置错误) 等都指向了在复杂配置和高负载下的内存管理、并发同步等深层次稳定性挑战。
量化与编译的交互：Issue #36805 (错误信息)、PR #36766/36779 (opcheck测试) 以及一些关于GPTQ Marlin内存访问错误的Issue，都反映了量化技术栈与PyTorch编译等底层系统交互的复杂性，是易出问题的领域。
引擎生命周期管理：Issue #36753 (/wake_up 崩溃) 涉及到Sleep/Wake这一高级功能，其健壮性对于生产环境多模型部署至关重要，需要核心维护者重点关注。

📋 附录：详细数据列表

新增 Issue

#36773 [Bug]: Qwen3.5-35B-A3B on B200 with vllm v0.17.0 output random result — bug — by elepherai (创建于: 2026-03-11 18:05 (UTC+8))
#36759 [Bug]: 在910b4上使用vLLM 部署 Qwen/Qwen3-VL-Reranker-2B报错 — bug — by wangjiannan98 (创建于: 2026-03-11 16:11 (UTC+8))
#36794 [Feature]: Improve Error Message When Max Length Reached With Tool Call=Required — feature request — by SvenLorenz (创建于: 2026-03-11 22:24 (UTC+8))
#36835 [Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries — bug — by LironKesem (创建于: 2026-03-12 09:34 (UTC+8))
#36833 [Bug]: GLM-4.7-Flash does not return tool_calls field in vLLM 0.16.0 even with –tool-call-parser glm47 — bug — by CodeofGame (创建于: 2026-03-12 08:56 (UTC+8))
#36828 [Bug]: vLLm 0.17.1 (docker) crash with Qwen 3.5 27B-FP8 in BatchPrefillWithPagedKVCache — bug — by matatonic (创建于: 2026-03-12 06:22 (UTC+8))
#36821 [Bug]: No sm_121 (Blackwell) support on aarch64 — NVIDIA DGX Spark / Acer GN100 — bug — by blogtheristo (创建于: 2026-03-12 04:52 (UTC+8))
#36830 [Bug]: delta_text and delta_token_ids get out of sync when stop sequences are used. — bug — by maxdebayser (创建于: 2026-03-12 06:48 (UTC+8))
#36826 [Bug]: Streaming stalls and can crash when concurrent requests hit the same vLLM server — bug — by imi187 (创建于: 2026-03-12 06:12 (UTC+8))
#36802 [Bug]: Tesla T4 GPU - triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 81920, Hardware limit: 65536. Reducing block sizes or num_stages — bug — by ksecurity45 (创建于: 2026-03-12 00:21 (UTC+8))
#36811 [Bug]: CUDA illegal memory access on GPTQ Marlin — bug — by tonibofarull (创建于: 2026-03-12 02:34 (UTC+8))
#36741 [Usage]: Not Working with latest version qwen3.5 9B AI MODEL — usage — by kusiyaitkrishna (创建于: 2026-03-11 12:13 (UTC+8))
#36796 [Bug]: CPU offload errors on nightly with NVIDIA GH200 Unified Memory (UMA) — bug,torch.compile — by ehfd (创建于: 2026-03-11 22:53 (UTC+8))
#36805 [Bug]: Misleading error message for FP8 quantization: refers to CUDA version instead of GPU compute capability — bug,rocm — by jranaraki (创建于: 2026-03-12 00:45 (UTC+8))
#36804 [Feature]: FA4 Attention Sinks — feature request — by pbarker-synth (创建于: 2026-03-12 00:42 (UTC+8))
#36793 [Bug]: TP=2 DP=2 Broken for Qwen3-Next W4A16 — bug — by Chibukach (创建于: 2026-03-11 22:14 (UTC+8))
#36784 [New Model]: OmniASR by meta — 无标签 — by HDprojec8s (创建于: 2026-03-11 20:22 (UTC+8))
#36783 [Bug]: CUBLAS_STATUS_INVALID_VALUE on Qwen3.5-122B-A10B-FP8 during profile run — bug — by twright8 (创建于: 2026-03-11 19:38 (UTC+8))
#36778 [Bug]: Using vLLM to deploy Minimax m2.5, the thinking/reasoning cannot be disable. — bug — by shuixiaoer (创建于: 2026-03-11 18:55 (UTC+8))
#36781 [Bug]: vLLM 0.17.0 failed to serve Qwen3-30B-A3B-Instruct-2507 after adding --enable_lora — bug — by Junxiao-Zhao (创建于: 2026-03-11 19:24 (UTC+8))
#36780 [RFC][NixlConnector]: Add support for hybrid SSM-FA models — RFC — by NickLucche (创建于: 2026-03-11 19:06 (UTC+8))
#36777 [Feature]: Kimi K2.5 Speculative Decoding — feature request — by casper-hansen (创建于: 2026-03-11 18:45 (UTC+8))
#36776 [Bug]: qwen 3.5 crash under dp 8 — bug — by ZJY0516 (创建于: 2026-03-11 18:31 (UTC+8))
#36755 [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters — 无标签 — by xiaguan (创建于: 2026-03-11 15:49 (UTC+8))
#36753 [Bug]: POST /wake_up causes vLLM process to crash. 500 Internal Server Error — bug — by lazybum-sudo (创建于: 2026-03-11 15:44 (UTC+8))
#36772 [Bug]: Docker Model Runner vLLM ignores tensor parallel config and starts with world_size=1 on 4x RTX 3060 — bug — by PatcharaphonTiw (创建于: 2026-03-11 18:02 (UTC+8))
#36771 [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) — bug — by Sanches166 (创建于: 2026-03-11 18:02 (UTC+8))
#36769 [Bug]: Qwen3.5 tool calling bug — bug,rocm — by DifferentialityDevelopment (创建于: 2026-03-11 17:51 (UTC+8))
#36763 [Bug]: Kimi-K2.5 outputs only ‘!!!!!!!!!!’ in reasoning field, content is always null — 无标签 — by LaRiffle (创建于: 2026-03-11 17:16 (UTC+8))
#36736 [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs — 无标签 — by RagnarokChan (创建于: 2026-03-11 11:37 (UTC+8))
#36750 [Question]HOW TO Enabling FlashAttention- 4 backend for NVIDIA PRO 6000 (Blackwell) with MiniMax-2.5-230B — usage — by luojichen (创建于: 2026-03-11 14:53 (UTC+8))
#36748 [Bug]: In DP mode, waiting request stack in a few DP ranks. — bug — by xiaoxiaosuaxuan (创建于: 2026-03-11 14:22 (UTC+8))
#36739 [CPU Backend] Refactor CPU FusedMoE to MK flow — cpu — by bigPYJ1151 (创建于: 2026-03-11 11:45 (UTC+8))

已关闭 Issue

#36631 [Bug]: v0.17.0 4*2080ti 22G Qwen3.5 RPC call to sample_tokens timed out. — bug — by VIT-Valentin (关闭于: 2026-03-12 10:40 (UTC+8))
#36759 [Bug]: 在910b4上使用vLLM 部署 Qwen/Qwen3-VL-Reranker-2B报错 — bug — by wangjiannan98 (关闭于: 2026-03-12 10:35 (UTC+8))
#36233 [Bug]: Critical bug in tool_call test scenarios [vllm version 0.16.0] — bug — by ajiang17 (关闭于: 2026-03-12 10:26 (UTC+8))
#19403 [Bug]: Issue of Unstable Output for Identical Queries — bug,torch.compile,stale — by skylee-01 (关闭于: 2026-03-12 10:17 (UTC+8))
#23221 [Feature][Responses API] Stream Arbitrary MCP Tool — feature request,stale — by simon-mo (关闭于: 2026-03-12 10:17 (UTC+8))
#24580 [Bug]: crash on startup with TP>1 on H100 nodes: AttributeError in ShmRingBuffer.shared_memory — bug,stale — by sanderland (关闭于: 2026-03-12 10:17 (UTC+8))
#27400 [Bug]: vLLM0.11.0 is about 10x slower than 0.8.1 on classification task — bug,stale — by a7744hsc (关闭于: 2026-03-12 10:16 (UTC+8))
#27437 [Bug]: Extremely slow FA3 on Hopper for CUDA 13.0 — bug,stale — by zyongye (关闭于: 2026-03-12 10:16 (UTC+8))
#28063 [Bug]: Error processing images with Qwen3-VL — bug,stale — by carlosdcuba1 (关闭于: 2026-03-12 10:16 (UTC+8))
#28125 [Bug]: Qwen3-VL-235B illegal memory access was encountered — bug,stale — by RocketRider (关闭于: 2026-03-12 10:16 (UTC+8))
#28326 [RFC]: vLLM Support for Generic Model Definitions — RFC,stale — by bwasti (关闭于: 2026-03-12 10:16 (UTC+8))
#28362 [Usage]: Can’t get vLLM to run on an Intel 125H with XPU and Arc graphics — usage,intel-gpu,stale — by phlibi (关闭于: 2026-03-12 10:16 (UTC+8))
#28381 [Bug]: docker Loading safetensors too slow — bug,stale — by zwukong (关闭于: 2026-03-12 10:16 (UTC+8))
#28465 [RFC]: Publishing KVBlocks event with mm_hash — RFC,stale — by Bounty-hunter (关闭于: 2026-03-12 10:16 (UTC+8))
#28486 [Installation]: No pre-built wheel available in 0.11.0 — installation,stale — by Tfloow (关闭于: 2026-03-12 10:16 (UTC+8))
#28489 [Usage]: Online continuous batching — usage,stale — by GenVr (关闭于: 2026-03-12 10:16 (UTC+8))
#33864 Bug: CPU KV cache offloading fails for blocks formed during decode — bug,kv-connector — by sriumcp (关闭于: 2026-03-12 04:43 (UTC+8))
#35766 [Bug]: aot_compile setting some aotautograd configs that change the cache key — bug,torch.compile — by zou3519 (关闭于: 2026-03-12 01:15 (UTC+8))
#36050 [Bug]: DeepSeek v3.2 FP8 Failure to start server — bug — by wzhao18 (关闭于: 2026-03-12 00:07 (UTC+8))
#36653 [Bug]: qwen3.5 Mismatch in image token count between text and input_ids. Got ids=[4091] — bug — by JakubCerven (关闭于: 2026-03-11 22:26 (UTC+8))
#34607 [Bug]: specualative decoding error in 0.15.1 — bug — by hocop (关闭于: 2026-03-11 22:21 (UTC+8))
#36433 [Bug]: matryoshka need gpu-memory??? — bug,good first issue — by ciaoyizhen (关闭于: 2026-03-11 15:15 (UTC+8))
#35909 [Bug]: Error when using Qwen3-VL/Qwen3.5 with video input and num_frames — bug — by danigarciaoca (关闭于: 2026-03-11 18:13 (UTC+8))
#36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (关闭于: 2026-03-11 11:49 (UTC+8))
#33768 [Usage]: How to set the language in Qwen3-Asr — usage — by xyqsgdog-ctrl (关闭于: 2026-03-11 11:34 (UTC+8))

新增 PR

#36841 [Bugfix] Fix crash when tool_choice=required exceeds max_tokens — bug,frontend,ready,tool-calling — by chaunceyjiang (创建于: 2026-03-12 11:19 (UTC+8))
#36761 [CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure — ready — by noooop (创建于: 2026-03-11 16:40 (UTC+8))
#36840 fix: correctly save fp8 kernel tuning results in multi-GPU mode — performance — by cs-cat (创建于: 2026-03-12 10:50 (UTC+8))
#36834 Added Hierarchical all reduce — nvidia — by lekhit (创建于: 2026-03-12 09:24 (UTC+8))
#36824 [Model Runner V2] Do not initialize sampler for non-last PP ranks — ready,v1 — by WoosukKwon (创建于: 2026-03-12 05:47 (UTC+8))
#36839 [Bugfix] Fix Responses API crash when ResponseReasoningItem has content=None — bug,frontend — by rranabha (创建于: 2026-03-12 10:31 (UTC+8))
#36766 fix(test): skip test_schema in opcheck for rms_norm_per_block_quant (… — needs-rebase — by mahendrarathore1742 (创建于: 2026-03-11 17:41 (UTC+8))
#36838 enable flashinfer moe kernel for DP + EP — 无标签 — by czhu-cohere (创建于: 2026-03-12 10:11 (UTC+8))
#36832 Increase Test Coverage for Distributed Comm Patterns — 无标签 — by puririshi98 (创建于: 2026-03-12 08:56 (UTC+8))
#36837 Expand Speculative Decoding Coverage — speculative-decoding,v1 — by puririshi98 (创建于: 2026-03-12 09:49 (UTC+8))
#36756 Rename misleading num_sampled_tokens to num_sampled_reqs — v1 — by ENg-122 (创建于: 2026-03-11 15:51 (UTC+8))
#36809 Revert “Fix hf_override_fn when it modifies model_type” (#35200) — 无标签 — by zhewenl (创建于: 2026-03-12 01:58 (UTC+8))
#36831 [XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu — documentation,ready — by jikunshang (创建于: 2026-03-12 08:44 (UTC+8))
#36836 [Feat][Executor] Introduce RayExecutorV2 — ci/build,v1 — by jeffreywang-anyscale (创建于: 2026-03-12 09:40 (UTC+8))
#36829 [Frontend] Exclude anthropic billing header to avoid prefix cache miss — documentation,frontend,ready — by njhill (创建于: 2026-03-12 06:37 (UTC+8))
#36822 [BugFix] Fix multiple/duplicate stdout prefixes — bug,frontend,ready,v1 — by njhill (创建于: 2026-03-12 05:04 (UTC+8))
#36767 Dflash integration — new-model,speculative-decoding,v1,qwen — by StanislavII (创建于: 2026-03-11 17:43 (UTC+8))
#36779 [Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688) — bug,needs-rebase — by KrxGu (创建于: 2026-03-11 19:02 (UTC+8))
#36817 [Model Runner V2] Add Support for XD-RoPE — v1,nvidia — by santiramos27 (创建于: 2026-03-12 04:08 (UTC+8))
#36827 Add simple granite4 tool parser — documentation,tool-calling — by maxdebayser (创建于: 2026-03-12 06:17 (UTC+8))
#36819 [Docs] Update Claude Code doc to include caching performance tip — documentation — by njhill (创建于: 2026-03-12 04:23 (UTC+8))
#36825 [Bugfix] Fix Qwen3.5 LoRA IndexError in packed_modules_mapping — bug,qwen — by hallerite (创建于: 2026-03-12 06:00 (UTC+8))
#36823 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace — torch.compile,nvidia,vllm-ir — by ProExpertProg (创建于: 2026-03-12 05:35 (UTC+8))
#36816 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — vllm-ir — by ProExpertProg (创建于: 2026-03-12 04:00 (UTC+8))
#36820 Mkorhone openvla pr merge — performance,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by mkorhone (创建于: 2026-03-12 04:41 (UTC+8))
#36818 [Model] Add ColPali late interaction model for multi-modal retrieval — documentation,new-model,multi-modality — by Kaonael (创建于: 2026-03-12 04:11 (UTC+8))
#36814 nano-nemotron-vl dummy video: Respect media_io_kwargs.num_frames if set to budget encoder cache correctly — 无标签 — by netanel-haber (创建于: 2026-03-12 03:36 (UTC+8))
#36815 [Model Runner V2] Introduce num_tokens_for_attn — v1,nvidia — by WoosukKwon (创建于: 2026-03-12 03:48 (UTC+8))
#36744 [Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 12:48 (UTC+8))
#36813 [Tests] Skip model weight download for render-only test server — 无标签 — by sagearc (创建于: 2026-03-12 03:28 (UTC+8))
#36798 Correct link to supported hardware on vllm.ai — documentation,ready — by hmellor (创建于: 2026-03-11 23:23 (UTC+8))
#36812 [Metrics] Temporary band-aid for “Counters can only be incremented by non-negative amounts” — ready,v1 — by markmc (创建于: 2026-03-12 02:38 (UTC+8))
#36810 [ROCm][Perf] Fused GEMM + static FP8 output quantization — rocm — by andyluo7 (创建于: 2026-03-12 02:14 (UTC+8))
#36799 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels — performance,ci/build,nvidia — by kylesayrs (创建于: 2026-03-12 00:02 (UTC+8))
#36791 fix for the model not even loading and zero accuracy — 无标签 — by hypdeb (创建于: 2026-03-11 21:35 (UTC+8))
#36787 Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 — ready — by hmellor (创建于: 2026-03-11 21:04 (UTC+8))
#36807 [Bugfix] Pad Marlin FP8 MoE weight dims to tile alignment under TP > 1 — bug — by ssubhanjali (创建于: 2026-03-12 01:16 (UTC+8))
#36808 Support temporal compression for videos — 无标签 — by collinmccarthy (创建于: 2026-03-12 01:22 (UTC+8))
#36806 Only show FP4 Marlin fallback warning for w4a4 models — ready — by mgoin (创建于: 2026-03-12 01:07 (UTC+8))
#36792 Fix ExaoneMoeMTP test that never ran in Transformers v4 — ready,multi-modality — by hmellor (创建于: 2026-03-11 21:37 (UTC+8))
#36785 Update rocm get gpu info capability — rocm — by tmm77 (创建于: 2026-03-11 20:27 (UTC+8))
#36803 [Test] E2E Nemotron-3-Super tests — ready,ci/build,nvidia — by roikoren755 (创建于: 2026-03-12 00:36 (UTC+8))
#36770 [Misc] Clean up renderers — ready,multi-modality,llama — by DarkLight1337 (创建于: 2026-03-11 18:01 (UTC+8))
#36800 [Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request — bug,qwen — by Isotr0py (创建于: 2026-03-12 00:20 (UTC+8))
#36797 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,v1 — by MohdElgaar (创建于: 2026-03-11 23:13 (UTC+8))
#36789 [Bugfix] Fix negative max_tokens when input prompt is too long — bug,frontend,ready — by Isotr0py (创建于: 2026-03-11 21:20 (UTC+8))
#36801 [kv_offload] Fix bare Exception types and add FilterReusedOffloadingManager tests — v1 — by Hongbin10 (创建于: 2026-03-12 00:21 (UTC+8))
#36775 [Misc] Add online audio_in_video test — ready,multi-modality,qwen — by Isotr0py (创建于: 2026-03-11 18:28 (UTC+8))
#36764 [Feature] add Dflash on Ascend — documentation,new-model,speculative-decoding,ci/build,v1,multi-modality,llama,qwen — by chenaoxuan (创建于: 2026-03-11 17:23 (UTC+8))
#36790 Disable docs build skipping until a better solution is found — ready — by hmellor (创建于: 2026-03-11 21:25 (UTC+8))
#36742 [EPD] update EPD script arguments — documentation,kv-connector — by zhenwei-intel (创建于: 2026-03-11 12:42 (UTC+8))
#36788 Fix tied weights in weight mapping test for Transformers v5 — ready,multi-modality — by hmellor (创建于: 2026-03-11 21:15 (UTC+8))
#36795 [Perf] Enable dual stream execution of input projection for Qwen3 Next — qwen — by xyang16 (创建于: 2026-03-11 22:39 (UTC+8))
#36762 [Model Runner V2] Remove unused warmup_for_prefill method — ready,v1 — by WoosukKwon (创建于: 2026-03-11 17:01 (UTC+8))
#36786 Extend EPLB support — 无标签 — by hypdeb (创建于: 2026-03-11 20:57 (UTC+8))
#36782 [Bugfix] Fix Mistral-small --format — bug,documentation — by 12010486 (创建于: 2026-03-11 19:29 (UTC+8))
#36768 Update Flashinfer to 0.6.6 — ready,ci/build,nvidia — by dbari (创建于: 2026-03-11 17:43 (UTC+8))
#36774 [Bugfix] fix Qwen3.5 tool calling bug — bug,qwen — by chaunceyjiang (创建于: 2026-03-11 18:24 (UTC+8))
#36749 [openapi] refactor render related openapi [3/N] — frontend,ready — by andyxning (创建于: 2026-03-11 14:31 (UTC+8))
#36752 [Bugfix] Reset num_cached_tokens sentinel on request preemption — bug,v1 — by xiaguan (创建于: 2026-03-11 15:43 (UTC+8))
#36751 fix(minicpmv): fix audio inference by handling meta device in init_re… — ready — by tc-mb (创建于: 2026-03-11 15:33 (UTC+8))
#36765 fix: vulnerability fix for Inference — ci/build — by tusharsadana (创建于: 2026-03-11 17:27 (UTC+8))
#36760 Fix(Offline Inference): Enable reasoning parser support in LLM class … — frontend — by xueliangyang-oeuler (创建于: 2026-03-11 16:34 (UTC+8))
#36758 Fix(DP): Optimize load balancing score calculation (#36748) — v1 — by xueliangyang-oeuler (创建于: 2026-03-11 16:08 (UTC+8))
#36757 Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco… — v1,nvidia — by xueliangyang-oeuler (创建于: 2026-03-11 15:59 (UTC+8))
#36754 [Perf] optimize gdn mtp performance — qwen — by ZJY0516 (创建于: 2026-03-11 15:46 (UTC+8))
#36747 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,nvidia — by mvanhorn (创建于: 2026-03-11 14:14 (UTC+8))
#36746 [Bugfix] Deduplicate sampled token in top_logprobs output — bug,v1 — by mvanhorn (创建于: 2026-03-11 14:10 (UTC+8))
#36745 [Misc] Relax transformers upper bound to include 5.2.0 — ci/build — by woforce (创建于: 2026-03-11 13:02 (UTC+8))
#36743 [ROCm] Optimize concat_mla_q for CDNA3 (MI300X) and CDNA4 (MI355X) — rocm,nvidia — by andyluo7 (创建于: 2026-03-11 12:43 (UTC+8))
#36737 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 11:39 (UTC+8))
#36738 Fix(MoE): Relax assertion for non-gated NVFP4 MoE models (#31782) — documentation,deepseek,nvidia — by xueliangyang-oeuler (创建于: 2026-03-11 11:44 (UTC+8))
#36740 [Bugfix] Respect hf-config-path during GGUF speculator detection — bug — by WineChord (创建于: 2026-03-11 11:55 (UTC+8))

已合并 PR

#36593 [XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. — bug,ready,v1 — by ys950902 (合并于: 2026-03-11 17:17 (UTC+8))
#35895 [Bugfix] Fix minimax_m2 tool parser when stream interval > 1 — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build — by sfeng33 (合并于: 2026-03-12 10:25 (UTC+8))
#36831 [XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu — documentation,ready — by jikunshang (合并于: 2026-03-12 09:46 (UTC+8))
#35461 [Model Runner V2] Add probabilistic rejection sampling for spec decoding — ready,v1 — by TheEpicDolphin (合并于: 2026-03-12 05:04 (UTC+8))
#36829 [Frontend] Exclude anthropic billing header to avoid prefix cache miss — documentation,frontend,ready — by njhill (合并于: 2026-03-12 09:20 (UTC+8))
#36316 [BugFix]: add bagel to MM_PREFIX_LM_MODELS — bug,ready — by princepride (合并于: 2026-03-12 03:55 (UTC+8))
#36710 [Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-12 08:37 (UTC+8))
#36499 [ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures — rocm,ready,ci/build — by mgehre-amd (合并于: 2026-03-12 07:14 (UTC+8))
#35811 [BUG] Fix async rlhf tests — bug,documentation,ready,ci/build,v1 — by hao-aaron (合并于: 2026-03-12 06:06 (UTC+8))
#36556 [ci] Update rtol for test_classification — ready — by angelayi (合并于: 2026-03-11 18:08 (UTC+8))
#36698 [Kernel] [Helion] [15/N] Split config files into per-platform files — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
#36563 [Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
#36683 [Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning — ready — by gmagogsfm (合并于: 2026-03-12 05:25 (UTC+8))
#35790 [Model Runner V2] Add WhisperModelState [6/N] — ready,ci/build,v1,nvidia — by WoosukKwon (合并于: 2026-03-12 05:20 (UTC+8))
#36129 [LMCache] Pass TP size in lookup for MLA multi-reader locking — ready,kv-connector — by maobaolong (合并于: 2026-03-12 04:45 (UTC+8))
#33881 [BugFix][kv_offload] Fix offloading decodes with async scheduling — bug,ready,v1,kv-connector — by orozery (合并于: 2026-03-12 04:43 (UTC+8))
#36274 [Bugfix][ROCm] Strip block_size before attention backend validation — bug,rocm,ready — by jennyyyyzhen (合并于: 2026-03-12 04:37 (UTC+8))
#36798 Correct link to supported hardware on vllm.ai — documentation,ready — by hmellor (合并于: 2026-03-11 23:43 (UTC+8))
#36424 [Refactor] Remove dead code in KV connector — ready,v1,kv-connector — by yewentao256 (合并于: 2026-03-12 03:40 (UTC+8))
#35093 [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 — rocm,ready — by amd-asalykov (合并于: 2026-03-12 03:00 (UTC+8))
#36787 Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 — ready — by hmellor (合并于: 2026-03-12 02:12 (UTC+8))
#36551 [torch.compile] Add support for non-contiguous fused RMSNorm + group quant — ready,ci/build — by ProExpertProg (合并于: 2026-03-12 01:56 (UTC+8))
#31964 [KVConnector] Support worker -> scheduler metadata — ready,v1,kv-connector — by orozery (合并于: 2026-03-12 01:36 (UTC+8))
#36707 fix: align lfm2 thumbnail token counting with HF — ready — by tianshu-Michael-yu (合并于: 2026-03-12 01:28 (UTC+8))
#36161 Add 320 dimension size support to MLA — ready — by juliendenize (合并于: 2026-03-12 01:21 (UTC+8))
#36792 Fix ExaoneMoeMTP test that never ran in Transformers v4 — ready,multi-modality — by hmellor (合并于: 2026-03-12 01:10 (UTC+8))
#36770 [Misc] Clean up renderers — ready,multi-modality,llama — by DarkLight1337 (合并于: 2026-03-12 00:39 (UTC+8))
#36789 [Bugfix] Fix negative max_tokens when input prompt is too long — bug,frontend,ready — by Isotr0py (合并于: 2026-03-12 00:30 (UTC+8))
#36560 [CI] Add bfcl tool call correctness eval — ready,ci/build — by sfeng33 (合并于: 2026-03-12 00:27 (UTC+8))
#36061 [Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels — bug,ready — by robertgshaw2-redhat (合并于: 2026-03-12 00:07 (UTC+8))
#36093 [torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation — ready — by zou3519 (合并于: 2026-03-12 00:07 (UTC+8))
#35744 Fix routed experts capture for hybrid models (Mamba + Attention) — documentation,ready,v1 — by xhx1022 (合并于: 2026-03-11 23:53 (UTC+8))
#36726 [Refactor] Remove deadcode in Responses API serving — frontend,ready — by sfeng33 (合并于: 2026-03-11 15:11 (UTC+8))
#36238 Add ‘none’ reasoning effort to ChatCompletionRequest — frontend,ready — by juliendenize (合并于: 2026-03-11 23:45 (UTC+8))
#36163 Add support to Mistral large 3 eagle with dense layers — speculative-decoding,ready — by juliendenize (合并于: 2026-03-11 23:42 (UTC+8))
#33230 Add XPU MLA Sparse backend for DeepSeek v3.2 — documentation,ready,v1,deepseek — by wuxun-zhang (合并于: 2026-03-11 19:19 (UTC+8))
#36361 Kimi k2.5 MLA based eagle3 — new-model,speculative-decoding,ready,v1,deepseek — by jhaotingc (合并于: 2026-03-11 23:36 (UTC+8))
#36790 Disable docs build skipping until a better solution is found — ready — by hmellor (合并于: 2026-03-11 21:53 (UTC+8))
#36788 Fix tied weights in weight mapping test for Transformers v5 — ready,multi-modality — by hmellor (合并于: 2026-03-11 23:10 (UTC+8))
#36762 [Model Runner V2] Remove unused warmup_for_prefill method — ready,v1 — by WoosukKwon (合并于: 2026-03-11 23:02 (UTC+8))
#36681 [ROCm][Perf] Allow MTP lens > 1 in Sparse MLA — rocm,speculative-decoding,ready,v1 — by tvirolai-amd (合并于: 2026-03-11 22:43 (UTC+8))
#36673 [Misc][Attention] Clean up unused method in CPU_ATTN — ready,v1,cpu — by MatthewBonanni (合并于: 2026-03-11 12:27 (UTC+8))
#35776 [Misc] Use envs module to get VLLM_DISABLED_KERNELS — ready — by hickeyma (合并于: 2026-03-11 21:37 (UTC+8))
#36782 [Bugfix] Fix Mistral-small --format — bug,documentation — by 12010486 (合并于: 2026-03-11 19:47 (UTC+8))
#36749 [openapi] refactor render related openapi [3/N] — frontend,ready — by andyxning (合并于: 2026-03-11 18:14 (UTC+8))
#36136 [Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps — bug,ready,multi-modality,qwen — by weiguangli-io (合并于: 2026-03-11 18:13 (UTC+8))
#36318 Disable cascade attention by default — ready,codex — by mgoin (合并于: 2026-03-11 18:12 (UTC+8))
#36751 fix(minicpmv): fix audio inference by handling meta device in init_re… — ready — by tc-mb (合并于: 2026-03-11 18:06 (UTC+8))
#36358 [compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE — ready — by zou3519 (合并于: 2026-03-11 18:12 (UTC+8))
#36402 fix(lora): use replaced_module_name in pooling model name check — ready — by gambletan (合并于: 2026-03-11 18:11 (UTC+8))
#36665 platforms: Fix Ray DP startup crash — ready — by itayalroy (合并于: 2026-03-11 18:08 (UTC+8))
#36658 Add: Eagle3 support for Qwen3.5 — ready,qwen — by rahul-tuli (合并于: 2026-03-11 18:07 (UTC+8))
#36667 [Refactor] Remove Molmo2 processor wrapper — ready — by DarkLight1337 (合并于: 2026-03-11 18:07 (UTC+8))
#36321 [Bugfix] Support other quantization methods in glm41v — bug,ready — by LoganJane (合并于: 2026-03-11 17:48 (UTC+8))
#36635 [NemotronH] Small fix reasoning parser — bug,ready,nvidia — by roikoren755 (合并于: 2026-03-11 17:44 (UTC+8))
#36218 [UX] Infer dtype for local checkpoint — ready — by Isotr0py (合并于: 2026-03-11 16:50 (UTC+8))
#36672 Remove unused config field from Gemma2 — ready — by hmellor (合并于: 2026-03-11 16:51 (UTC+8))
#36127 [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct — documentation,new-model,frontend,ready,multi-modality — by tunglinwood (合并于: 2026-03-11 12:24 (UTC+8))
#35752 [NIXL][1/N] Refactor kernel_block_size detection — ready,v1,kv-connector — by NickLucche (合并于: 2026-03-11 16:11 (UTC+8))
#35765 AITER MLA backend: Avoid CPU sync in _build_decode — rocm,ready,v1 — by pschlan-amd (合并于: 2026-03-11 15:25 (UTC+8))
#35923 [Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 — bug,documentation,rocm,ready,v1,qwen — by JartX (合并于: 2026-03-11 15:45 (UTC+8))
#36438 [Hardware][NIXL] set default kv buffer type for different platform — ready,kv-connector — by zhenwei-intel (合并于: 2026-03-11 13:19 (UTC+8))
#36458 [XPU] Support block fp8 moe by fallback to TritonExpert on XPU — ready — by jikunshang (合并于: 2026-03-11 12:54 (UTC+8))
#36578 feat: add RISC-V support for CPU backend (v2) — ready,ci/build,cpu — by typer-J (合并于: 2026-03-11 12:51 (UTC+8))
#33503 feat(spec_decode): fuse EAGLE step slot mapping and metadata updates — speculative-decoding,ready,v1 — by sladyn98 (合并于: 2026-03-11 12:35 (UTC+8))
#36713 [Doc] Fix duplicate words in comments — ready,v1,multi-modality,nvidia — by Hongbin10 (合并于: 2026-03-11 12:28 (UTC+8))
#31785 [Fix] Use torch.empty for output in attention+quant fusion — ready — by elvischenv (合并于: 2026-03-11 12:26 (UTC+8))
#35781 [Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement — ready,v1,kv-connector — by yewentao256 (合并于: 2026-03-11 12:25 (UTC+8))
#36699 Add tuned H100 MoE configs for LFM2 8B and 24B — 无标签 — by tianshu-Michael-yu (合并于: 2026-03-11 12:22 (UTC+8))
#36719 [ci] Bound nvidia-cudnn-frontend version — ready,ci/build,nvidia — by khluu (合并于: 2026-03-11 12:17 (UTC+8))
#36723 [DSV3.2][MTP] Optimize Indexer MTP handling — ready,v1 — by benchislett (合并于: 2026-03-11 12:16 (UTC+8))

关闭但未合并的 PR

#36766 fix(test): skip test_schema in opcheck for rms_norm_per_block_quant (… — needs-rebase — by mahendrarathore1742 (关闭于: 2026-03-12 07:05 (UTC+8))
#17094 [Bugfix] fix phi4-mini tool call parse in streaming mode — documentation,frontend,stale,tool-calling — by yulin-li (关闭于: 2026-03-12 10:18 (UTC+8))
#26889 [DeepSeekV3.2] Fix Loading BF16 weights — stale,deepseek — by qili93 (关闭于: 2026-03-12 10:17 (UTC+8))
#27299 [Kernel] Mamba2 SSD add fused kernel for 1.5-2.5x SSD (prefill) speedup — stale — by RishiAstra (关闭于: 2026-03-12 10:16 (UTC+8))
#27771 [DBO] Compile non-dbo cudagraphs for shapes that are close to dbo_decode_token_threshold — stale,v1,nvidia — by SageMoore (关闭于: 2026-03-12 10:16 (UTC+8))
#35957 fix online fp8 for MiniCPM models — needs-rebase — by yma11 (关闭于: 2026-03-12 09:05 (UTC+8))
#36819 [Docs] Update Claude Code doc to include caching performance tip — documentation — by njhill (关闭于: 2026-03-12 06:37 (UTC+8))
#30413 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (关闭于: 2026-03-12 06:06 (UTC+8))
#34068 [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace — rocm,torch.compile,needs-rebase,nvidia,vllm-ir — by ProExpertProg (关闭于: 2026-03-12 05:45 (UTC+8))
#36820 Mkorhone openvla pr merge — performance,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by mkorhone (关闭于: 2026-03-12 04:43 (UTC+8))
#36814 nano-nemotron-vl dummy video: Respect media_io_kwargs.num_frames if set to budget encoder cache correctly — 无标签 — by netanel-haber (关闭于: 2026-03-12 04:06 (UTC+8))
#36722 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — rocm,needs-rebase,ci/build,nvidia — by ProExpertProg (关闭于: 2026-03-12 03:59 (UTC+8))
#36744 [Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-12 03:56 (UTC+8))
#36655 [Frontend] Allow engine_client=None in OpenAIServingModels — frontend,needs-rebase — by sagearc (关闭于: 2026-03-12 03:15 (UTC+8))
#36564 [Draft] Support temporal compression for videos — v1 — by collinmccarthy (关闭于: 2026-03-12 01:22 (UTC+8))
#34203 [Model] Remove InternLM — documentation,new-model,ready,llama — by noooop (关闭于: 2026-03-11 19:28 (UTC+8))
#36764 [Feature] add Dflash on Ascend — documentation,new-model,speculative-decoding,ci/build,v1,multi-modality,llama,qwen — by chenaoxuan (关闭于: 2026-03-11 23:24 (UTC+8))
#36549 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors — bug,kv-connector — by ZhanqiuHu (关闭于: 2026-03-11 23:04 (UTC+8))
#27407 [Frontend] Speed up online server preprocess by using sync tokenizer. — frontend,stale — by noooop (关闭于: 2026-03-11 19:27 (UTC+8))
#34727 [PD][Nixl] Add support for hybrid SSM-FA models — needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-03-11 19:24 (UTC+8))
#36636 [Bugfix] Fix KV cache offloading for hybrid attention models (e.g. Qwen3.5-27B) — bug,v1,qwen,kv-connector — by haosdent (关闭于: 2026-03-11 18:21 (UTC+8))
#36765 fix: vulnerability fix for Inference — ci/build — by tusharsadana (关闭于: 2026-03-11 17:28 (UTC+8))
#33505 [Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba — qwen — by SouthWest7 (关闭于: 2026-03-11 17:00 (UTC+8))
#36754 [Perf] optimize gdn mtp performance — qwen — by ZJY0516 (关闭于: 2026-03-11 16:02 (UTC+8))
#36474 [Bugfix] Fix GPTQ Marlin size_k for Qwen3.5 — bug,qwen — by Isotr0py (关闭于: 2026-03-11 15:49 (UTC+8))
#35177 [XPU] disable async-scheduling by default on XPU — 无标签 — by yma11 (关闭于: 2026-03-11 14:45 (UTC+8))
#36745 [Misc] Relax transformers upper bound to include 5.2.0 — ci/build — by woforce (关闭于: 2026-03-11 13:57 (UTC+8))
#35145 [ROCm][CI] Reverting changes in MI355 pipeline so that some default TGs are blocking on external AMD-CI signal — rocm,ci/build — by AndreasKaratzas (关闭于: 2026-03-11 13:36 (UTC+8))
#36737 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-11 12:47 (UTC+8))
#36706 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (关闭于: 2026-03-11 11:40 (UTC+8))