vLLM 开发动态报告 - 2026-03-12

时间窗口: 2026-03-12 11:24 (UTC+8) ~ 2026-03-13 11:24 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 24 | 新 PR 82 | 合并 PR 41 | 关闭未合并 PR 35

📊 每日开发状态摘要

在2026年3月12日至13日的周期内，vLLM 项目保持了高强度的开发节奏，新增和合并了大量 PR，并处理了众多 Issue。核心进展集中在两大领域：一是 AMD/ROCm 生态的支持得到了显著增强，涌现出多个修复和功能优化；二是围绕 新模型集成、推测解码（Speculative Decoding）的稳定性、以及 KV 缓存/传输架构 的讨论与修复成为热点。总体而言，社区在快速迭代功能的同时，也在着力解决生产环境中暴露出的复杂问题。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，主要集中在问题修复、功能增强和 CI/CD 优化。

1. Issue（问题报告）

#36890 [Bug]: ROCm: tries to allocate 192GB VRAM for Qwen3.5 0.8B：用户在 ROCm 平台上启动 Qwen3.5-0.8B 模型时，vLLM 异常尝试分配 192GB 显存。这是一个严重的资源估算或分配错误，由 github-actions 自动标记并通知 ROCm 维护团队（@hongxiayang, @tjtanaa）。
#36843 [Bug]: VLLM 0.17.1 initial mtp with FLASH_ATTN randomly crash：尽管此问题不特定于 ROCm，但被贴上了 rocm 标签，并由 github-actions 通知了 ROCm 维护者，表明可能与 ROCm 平台的注意力后端选择有关。报告者已自行定位根因（CUDA Graph 与 Triton JIT 编译冲突）并指出已由 PR #36634 修复。

2. PR（代码变更）

#36851 [ROCm] Enable Sequence Parallelism for AMD GPUs (MI300X/MI325X/MI355X) (由 ChuanLi1101 提交)：这是一个重要的功能启用 PR。它将原本仅限 CUDA 的序列并行（Sequence Parallelism）编译优化扩展到了 ROCm 平台（MI300X/MI325X/MI355X），为 AMD 多 GPU 设置下的模型（如 DeepSeek-V3）带来了潜在的性能提升。讨论点： Reviewer @ProExpertProg 要求提供端到端性能与精度评估，并建议优先合并 vLLM IR 相关 PR 以减少代码变动，同时鼓励贡献者转向实现更高级的 GEMM+通信融合（AsyncTP）。
#36855 [ROCm] Fix AITER sparse MLA crash for num_heads < 16 (e.g. GLM-5 TP=8) (由 ChuanLi1101 提交)：修复了在 ROCm 平台上使用 AITER 后端运行稀疏 MLA（Multi-Head Latent Attention）模型时，当每个 GPU 的注意力头数（num_heads）小于 16 时（例如 GLM-5 模型在 TP=8 时）发生的崩溃。修复逻辑与现有的非稀疏 AITER MLA 实现对齐。
#36927 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn (由 divakar-amd 提交)：为 ROCm 的 rocm_aiter_unified_attn 后端增加了 FP8 x FP8（即 Query 和 Key/Value 都使用 FP8）的注意力计算支持。这是对先前 PR #35316 的跟进，通过吸收缩放因子到 softmax_scale 和 output_scale 中来适配 AITER 内核的要求。
#36845 [ROCm] Fix KV copy methods and auto-select attention backend for ROCm (由 AndreasKaratzas 提交)：修复了在 ROCm 平台使用 NixlConnector 进行 P/D 解耦时，因缺少 insert_blocks_to_device 和 swap_out_blocks_to_host 方法而导致的 TypeError 崩溃。同时更新了测试脚本，使其能根据 GPU 平台自动选择注意力后端（ROCm 用 TRITON_ATTN，NVIDIA 用 FLASH_ATTN）。
#36873 [ROCm] Fix build issues with cub:: namespace and missing headers：修复了 ROCm 构建中因 cub 命名空间和头文件引入方式导致的问题，用统一的 cub_helpers.h 进行替换。
#36846 [ROCm] Validate block_size for explicitly selected attention backends：修复了 #36274 引入的回归问题，确保当用户显式指定注意力后端时，block_size 参数能得到正确验证。
#36949 [ROCm][CI] Optimize ROCm Docker build：对 ROCm 的 Docker 构建和 CI 流程进行了多项优化，包括新的构建脚本、改进的缓存策略、分离 DeepEP 编译阶段并确保其被正确安装，旨在提升构建效率和可靠性。
#36086 [AMD][Build] Add DeepEP to ROCm Dockerfile (已合并)：将 AMD 的 DeepEP（MoE 专家并行库）集成到 ROCm Docker 镜像中，使得相关测试（test_deepep_moe.py）能从全部跳过变为全部通过。

💬 高热度讨论分析

PR #36851（为 AMD GPU 启用序列并行）
- 核心议题：是否应该立即合并这个为 AMD GPU 启用序列并行优化的 PR。
- 观点整理：
  - 贡献者 (ChuanLi1101)：认为此优化能为 ROCm 多 GPU 用户带来直接好处，是功能补充。
  - Reviewer (@ProExpertProg)：提出谨慎意见。首先要求完整的端到端性能与精度测试数据。其次指出当前基于模式匹配的实现在即将到来的 vLLM IR 重构后可能很快过时，增加维护负担。建议贡献者优先实现更高级的 GEMM+通信融合（AsyncTP），待 vLLM IR 落地后再合并此 PR，以实现效益最大化。
- 争议焦点：功能交付的优先级与技术路线图的协调。是立即为AMD用户提供现有优化，还是等待更统一的基础设施以避免未来代码重构？
- 当前状态：PR 仍处于开放状态，等待进一步的数据和决策。
Issue #36923（RFC: 支持从 P 节点到 D 节点的 KV 推送）
- 核心议题：是否应该引入“推送（Push）”模式来替代现有的“拉取（Pull）”模式进行 P/D 解耦间的 KV 传输。
- 观点整理：
  - 提案者 (snadampal)：认为 Push 模式能减少延迟（D 节点无需等待 P 完全结束）、便于实现层间流水线传输，且在多 D 节点场景下能让 P 节点更好地控制网络调度。
  - 反对者 (@robertgshaw2-redhat)：指出 Push 模式要求 D 节点在 P 节点计算开始时就必须预留并持有 KV 内存，降低了资源利用率。同时，支持两套实现会增加显著的维护成本。他要求提供具体的性能收益数据来证明这种复杂性是合理的。
- 争议焦点：Push 模式带来的潜在性能收益是否足以抵消其带来的内存利用效率下降和代码复杂度增加。
- 当前状态：RFC 处于开放讨论阶段，提案者正在寻找合适的基准测试来量化收益。
Issue #36921（V1 引擎在突发负载下崩溃）
- 核心议题：大规模模型（Qwen3.5-122B）在突发聊天补全负载下，V1 引擎出现共享内存协调失败并最终超时崩溃的问题。
- 观点整理：
  - 报告者 (natech-persona)：提供了详尽的复现步骤和日志，指出问题与 KV 缓存使用率低、共享内存空间充足无关，似乎是一个引擎或进程间通信的深层 Bug。
  - 评论者 (@ZJY0516)：建议检查 CPU 资源，怀疑可能与 JIT 内核编译有关。
- 争议焦点：问题的根本原因是指向调度器、共享内存管理，还是底层通信库（如 NCCL）？
- 当前状态：问题尚未解决，需要更深入的调试信息。

🔥 热门话题与趋势分析

AMD 平台支持深度优化：本周期的活动表明，vLLM 社区正致力于将 NVIDIA 平台上的成熟优化（如序列并行、高级注意力后端特性）快速移植和对齐到 AMD ROCm 平台，同时大力完善其构建和测试基础设施。
推测解码的“暗礁”：多个 Issue (#36843, #36872, #36906, #36917) 都涉及推测解码，问题包括与 CUDA Graph/编译器的冲突、输出质量劣化、多模态下的并发崩溃等。这表明推测解码在与各种其他特性（如编译优化、多模态、混合调度）结合时，复杂性和稳定性挑战依然突出。
多模态与工具调用演进：有多个 PR 关于新的多模态检索模型（ColPali, ColQwen3.5）和工具调用解析器的增强（Kimi K2 的引导解码），显示生态在向更复杂的 AI 应用场景扩展。
性能优化与内核深耕：社区持续在关键内核路径上进行微优化，例如针对 Qwen GDN 的打包循环解码快速路径 (#36596)、为 Triton 后端添加 INT8 KV 缓存支持 (#36893)、以及为 TRTLLM 和 Triton MLA 添加 FP8 支持等。
KV 缓存与传输架构：围绕 KV 缓存的管理、卸载和传输（P/D 解耦）的讨论 (#36923, #36869, #34328) 非常活跃，这是支撑超长上下文和分布式推理的核心，也是当前的研发重点。

🛠️ 重点技术变更

PR #36851（AMD 序列并行）：技术解读：通过将平台检查从 is_cuda() 放宽为 is_cuda_alike()，并在编译过程中为 AMD AITER 调度的高斯核注册相应的模式，使序列并行这一重要的分布式优化可用于 AMD MI300 系列 GPU。影响：提升了 AMD 多卡运行大隐藏层尺寸模型时的内存效率和潜在性能。
PR #36855（修复 AITER 稀疏 MLA 崩溃）：技术解读：修复了稀疏 MLA 路径中缺失的“头数重复/解重复”逻辑，该逻辑在非稀疏路径中已存在，用于处理每个 GPU 头数较少的情况。影响：使得如 GLM-5 等稀疏注意力模型能够在 AMD 平台更高的张量并行度下稳定运行。
PR #36927（ROCm FP8 x FP8 注意力）：技术解读：通过将 Q、K、V 的缩放因子巧妙吸收到注意力计算的 softmax_scale 和 output_scale 中，绕过了当前 AITER 统一注意力内核对 Q 缩放因子支持不足的限制。影响：在 AMD 平台上实现了更高效的 FP8 全量化注意力计算，有助于降低带宽和计算开销。
PR #36596（Qwen GDN 打包循环解码快速路径）：技术解读：针对 Qwen3.5 的 GDN 层在非推测、均匀解码（T=1）场景，设计了一条直接处理打包 mixed_qkv 数据的快速路径，避免了中间张量的重构和多余的内存读写。影响：根据贡献者报告，在特定基准测试中实现了约 6% 的吞吐量提升和延迟降低，是针对特定热门模型架构的精细化性能优化典范。
PR #36818（添加 ColPali 模型支持）：技术解读：完整实现了 ColPali（基于 PaliGemma 的 ColBERT 风格多模态检索模型）在 vLLM 中的集成，包括模型定义、配置注册、权重加载适配和多模态处理器支持。影响：丰富了 vLLM 的“池化（Pooling）”模型阵容，使其能够直接服务于多模态文档检索和重排序任务，拓宽了应用边界。

📈 开发活跃度观察

贡献者活跃：AMD 相关的贡献者（如 ChuanLi1101， divakar-amd， AndreasKaratzas）非常活跃，提交了多个高质量的功能和修复 PR，表明 AMD 团队正积极投入资源融入 vLLM 主生态。
代码审查：核心维护者对 AMD 相关 PR 的审查关注点不仅在于功能正确性，更强调性能基准测试、与长期技术路线（vLLM IR）的兼容性，以及避免代码重复，体现了对项目长期可维护性的重视。
Issue 处理：社区在快速关闭陈旧（stale）Issue 的同时，对新出现的复杂生产环境问题（如 #36921）响应迅速，但解决这类问题通常需要更深入的协作和诊断。

💡 值得关注的问题

#36890（ROCm 上异常的 192GB VRAM 分配）：此问题可能导致 AMD 用户无法正常启动小模型，需要尽快调查是模型加载逻辑、设备内存查询还是资源配置算法的 Bug。
#36921（V1 引擎突发负载崩溃）：涉及大规模模型和核心引擎稳定性，若普遍存在，可能影响 V1 引擎在生产环境的可信度，需要优先排查。
#36872（推测解码导致输出质量劣化）：推测解码若导致输出不可用，则失去了其加速意义。此问题直接关系到推测解码功能的实用性和可靠性。
PR #36851 的决策：关于是否现在合并 AMD 序列并行支持，需要项目维护者基于整体技术路线图做出权衡，这个决策本身值得关注。

📋 附录：详细数据列表

新增 Issue

#36856 [Usage]: ‘LLMEngine’ object has no attribute ‘collective_rpc’ — usage — by 870223666 (创建于: 2026-03-12 14:31 (UTC+8))
#36948 [RFC] [Feature]: PD Disaggregation with MooncakeConnector Roadmap — feature request — by dtcccc (创建于: 2026-03-13 11:18 (UTC+8))

#36943 [Bug/Regression]: CPU spikes to 100% on Multi-node (PP=2, TP=8)

Regression starting from v0.12.0 — performance — by randomkrml (创建于: 2026-03-13 08:07 (UTC+8))

#36932 [Bug]: DP>1 doesn’t work with weight syncing — bug — by qgallouedec (创建于: 2026-03-13 05:59 (UTC+8))
#36853 [Bug]: Qwen3-Coder-Next-FP8 start error wifh tp=8 on H100 — bug — by ltm920716 (创建于: 2026-03-12 13:55 (UTC+8))
#36921 [Bug]: V1 engine hangs then crashes with “No available shared memory broadcast block found / RPC call to sample_tokens timed out” under chat completions burst load on Qwen3.5-122B-A10B — bug — by natech-persona (创建于: 2026-03-13 03:47 (UTC+8))
#36870 [Bug]: vllm start Qwen3.5-35B-A3B fail — bug — by lcedaw (创建于: 2026-03-12 18:36 (UTC+8))
#36942 [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() — bug — by dharaghodasara (创建于: 2026-03-13 07:28 (UTC+8))
#36926 [Bug]: nemotron_h does not work with DeepEP all2all backends due to hidden dim rounding — bug — by bnellnm (创建于: 2026-03-13 05:09 (UTC+8))
#36923 [RFC]: [KV Connector]: Support KV push from Prefill to Decode node using Nixl Connector — RFC — by snadampal (创建于: 2026-03-13 03:49 (UTC+8))
#36922 [Performance]: Flashinfer TRTLLM MoE for Qwen3.5 — performance — by Linda-Stadter (创建于: 2026-03-13 03:48 (UTC+8))
#36917 [Bug]: GDN attention backend crashes with mixed decode/spec_decode batch when serving Qwen3.5 family models with MTP — bug — by lulmer (创建于: 2026-03-13 03:14 (UTC+8))
#36916 [Feature]: W6A16 Support — feature request — by frenzybiscuit (创建于: 2026-03-13 03:10 (UTC+8))
#36907 [Bug]: resolve_chat_template_kwargs silently drops chat template kwargs when chat_template is passed in as chat template name instead of Jinja — bug — by xiangxl-a (创建于: 2026-03-13 01:32 (UTC+8))
#36906 [Bug]: EAGLE3 speculative decoding + multimodal crash under high concurrency — bug — by staghado (创建于: 2026-03-13 01:30 (UTC+8))
#36898 [Feature]: Publish CPU images compatible with GitHub Actions — feature request — by nathan-weinberg (创建于: 2026-03-12 23:57 (UTC+8))
#36890 [Bug]: ROCm: tries to allocate 192GB VRAM for Qwen3.5 0.8B — bug,rocm — by lywing-god (创建于: 2026-03-12 22:30 (UTC+8))
#36886 [Bug]: vllm depends on xgrammar version 0.1.29 exactly which is vulnerable to CVE-2026-25048 — bug — by benglewis (创建于: 2026-03-12 21:35 (UTC+8))
#36880 [Bug]: Qwen1 use_logn_attn may be unsupported in vLLM — bug — by Qi-Zhan (创建于: 2026-03-12 20:45 (UTC+8))
#36872 [Bug]: Gibberish output and collapsing generation throughput with Qwen3.5-35B-A3B-FP8 and speculative decoding enabled — bug — by marionchadal (创建于: 2026-03-12 19:11 (UTC+8))
#36865 [Bug]: SM120 / RTX 5090 source build still registers unsupported FlashMLA / FA targets and uses non-SM120 Marlin defaults. — bug — by CristyNel (创建于: 2026-03-12 17:15 (UTC+8))
#36861 [Bug]: Why does setting --pipeline-parallel-size > 1 result in an OOM error, but --tensor-parallel-size> 1 does not? — bug — by VirgilG72 (创建于: 2026-03-12 15:41 (UTC+8))
#36849 [Bug]: GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator) — bug — by Priananda620 (创建于: 2026-03-12 12:56 (UTC+8))
#36852 [Bug]: GPU failure during repeated model loading when using –enable-prefix-caching with KV transfer (LMCacheConnectorV1) — bug — by lavanyabollepalli (创建于: 2026-03-12 13:42 (UTC+8))
#36857 [Bug]: The arguments invoked by the tool in the GLM-5 streaming output cannot be parsed into the JSON format. — bug — by lililcode9527 (创建于: 2026-03-12 14:46 (UTC+8))
#36843 [Bug]: VLLM 0.17.1 initial mtp with FLASH_ATTN randomly crash — bug,rocm — by flutist (创建于: 2026-03-12 11:48 (UTC+8))

已关闭 Issue

#19491 [Bug]: vLLM outputs are not reproducible — bug,torch.compile,stale — by FelixSchneiderZoom (关闭于: 2026-03-13 10:19 (UTC+8))
#23307 [Installation]: GH200 with cuda 12.6 — installation,stale — by HelloWorldLTY (关闭于: 2026-03-13 10:18 (UTC+8))
#25262 [Bug]: Mismatch with prompt logprobs with the same prompt — bug,stale — by SestoAle (关闭于: 2026-03-13 10:18 (UTC+8))
#26712 [Bug]: W8A16-FP8_Block Quant from llm_Compressor Fails to load on Blackwell SM12.0 — bug,stale — by phaelon74 (关闭于: 2026-03-13 10:17 (UTC+8))
#26899 [Bug]: qwen3-235b-a22b-instruct-2507 fails on multi-node deployment — bug,stale — by tensorsofthewall (关闭于: 2026-03-13 10:17 (UTC+8))
#27092 [Feature]: need args to pass set_process_title function — feature request,stale — by XaviereKU (关闭于: 2026-03-13 10:17 (UTC+8))
#28268 [Bug]: Llama4 accuracy issue when torch.compile enabled on MI300x — bug,rocm,torch.compile,stale — by sarckk (关闭于: 2026-03-13 10:17 (UTC+8))
#28560 [Feature]: Allow custom prompts for Random-MM with vllm bench serve — feature request,stale — by gerrylwk (关闭于: 2026-03-13 10:16 (UTC+8))
#28564 [Usage]: Can’t get ModernBert models to run in vllm serve — usage,stale — by Logikschleifen (关闭于: 2026-03-13 10:16 (UTC+8))
#28565 [Installation]: UV_HTTP_TIMEOUT cannot install vllm with uv — installation,stale — by husainnk (关闭于: 2026-03-13 10:16 (UTC+8))
#28566 [Usage]: pd disagg scenario , I discover in the decoder , also has the prefill operation, is it normal ? — usage,stale — by yangshanjun (关闭于: 2026-03-13 10:16 (UTC+8))
#36932 [Bug]: DP>1 doesn’t work with weight syncing — bug — by qgallouedec (关闭于: 2026-03-13 09:56 (UTC+8))
#36870 [Bug]: vllm start Qwen3.5-35B-A3B fail — bug — by lcedaw (关闭于: 2026-03-13 08:44 (UTC+8))
#31404 [Bug]: MM cache AssertionError crashes engine, causes request aborts — bug — by shirejoni (关闭于: 2026-03-13 07:37 (UTC+8))
#30620 [Feature]: Remove Chunking From FusedMoE — help wanted,good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-03-13 02:24 (UTC+8))
#24094 [Bug]: Running Jamba FP8 crashes with cutlass_moe_mm — bug,stale — by Josephasafg (关闭于: 2026-03-13 01:59 (UTC+8))
#34496 [Bug]: Responses API crashes with KeyError when reasoning input item has no content field — bug — by jeonsworld (关闭于: 2026-03-13 00:13 (UTC+8))
#36835 [Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries — bug — by LironKesem (关闭于: 2026-03-13 00:04 (UTC+8))
#36794 [Feature]: Improve Error Message When Max Length Reached With Tool Call=Required — feature request — by SvenLorenz (关闭于: 2026-03-12 18:28 (UTC+8))
#36828 [Bug]: vLLm 0.17.1 (docker) crash with Qwen 3.5 27B-FP8 in BatchPrefillWithPagedKVCache — bug — by matatonic (关闭于: 2026-03-12 18:15 (UTC+8))
#36826 [Bug]: Streaming stalls and can crash when concurrent requests hit the same vLLM server — bug — by imi187 (关闭于: 2026-03-12 18:02 (UTC+8))
#36598 [Bug]: Triton autotuner OOM on Qwen3.5/Qwen3-Next GDN layers (non-SM90 GPUs) — bug — by AuYang261 (关闭于: 2026-03-12 16:34 (UTC+8))
#36669 [Bug]: DeepSeek-OCR v1 crashes with TensorSchema mismatch when images_crop is empty (small images ≤640px) — torch.compile — by ketyi (关闭于: 2026-03-12 15:35 (UTC+8))
#36604 [Performance]: What is the performance impact of upgrading from HTTP/1.1 to HTTP/2 or QUIC? — performance — by sl99897 (关闭于: 2026-03-12 12:05 (UTC+8))

新增 PR

#36949 [ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-13 11:19 (UTC+8))
#36914 Fix/triton quant query input — v1 — by ishanrev (创建于: 2026-03-13 02:50 (UTC+8))
#36945 [CI] Split V1 e2e + engine (1 GPU) into parallel jobs — ready,ci/build — by khluu (创建于: 2026-03-13 09:12 (UTC+8))
#36910 [WIP] [BugFix] Forward spec_step_idx in MTP wrappers and eagle proposer/speculator — bug,speculative-decoding,v1,qwen — by haosdent (创建于: 2026-03-13 02:06 (UTC+8))
#36947 [Bug]: OpenAI Embeddings API crashes on invalid token_ids input — bug,frontend — by dharaghodasara (创建于: 2026-03-13 10:33 (UTC+8))
#36946 [P/D] Mooncake: Add unit tests and minor fixes for mooncake connector — v1,kv-connector — by dtcccc (创建于: 2026-03-13 10:25 (UTC+8))
#36930 [Model Runner V2] Spec decode rejection sampler greedy + logprobs support — v1 — by TheEpicDolphin (创建于: 2026-03-13 05:52 (UTC+8))
#36940 [BUG] Fix rank calculation in NCCLWeightTransferEngine — bug,ready — by hao-aaron (创建于: 2026-03-13 07:09 (UTC+8))
#36851 [ROCm] Enable Sequence Parallelism for AMD GPUs (MI300X/MI325X/MI355X) — rocm — by ChuanLi1101 (创建于: 2026-03-12 13:19 (UTC+8))
#36887 [Model] Add ColQwen3.5 4.5B support — documentation,new-model,multi-modality,qwen — by athrael-soju (创建于: 2026-03-12 21:50 (UTC+8))
#36928 [LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 — bug,ready — by WoosukKwon (创建于: 2026-03-13 05:15 (UTC+8))
#36938 build: update smg-grpc-servicer to use vllm extra — ready,ci/build — by slin1237 (创建于: 2026-03-13 06:28 (UTC+8))
#36929 [Model Runner V2] Some code simplification — ready,v1 — by njhill (创建于: 2026-03-13 05:27 (UTC+8))
#36944 [Bugfix] Streaming finish_reason chunk should have empty delta content — bug,frontend — by chenwenxiaolive (创建于: 2026-03-13 09:06 (UTC+8))
#36925 [Bugfix] signature match for passing spec_step_idx in qwen3-next and qwen3.5 — bug,qwen — by JGSweets (创建于: 2026-03-13 05:04 (UTC+8))

#36920 Fix Qwen3.5 rope validation TypeError: list

set (#36920) — ready,qwen,meta-exported,fb-exported — by diego-urgell (创建于: 2026-03-13 03:42 (UTC+8))

#36937 fix: resolve chat template names before kwargs detection — 无标签 — by giulio-leone (创建于: 2026-03-13 06:27 (UTC+8))
#36939 [BugFix][Core] zero out null kv block after cuda graph capture — bug,ready,v1,nvidia — by walterbm (创建于: 2026-03-13 06:30 (UTC+8))
#36941 [Bugfix][Core]: Fix silent output_handler death on CancelledError — bug,v1 — by eugenepaniot (创建于: 2026-03-13 07:12 (UTC+8))
#36936 fix: Resolve template name to Jinja content in resolve_chat_template to prevent dropped kwargs — 无标签 — by siddharthmohan619-eng (创建于: 2026-03-13 06:22 (UTC+8))
#36855 [ROCm] Fix AITER sparse MLA crash for num_heads < 16 (e.g. GLM-5 TP=8) — rocm,v1 — by ChuanLi1101 (创建于: 2026-03-12 14:14 (UTC+8))
#36933 [BugFix] Handle pre-sharded TP MoE expert weights in Grok loader — bug — by dangoldbj (创建于: 2026-03-13 06:13 (UTC+8))
#36935 creating a pr for crusoe integration for vllm — documentation — by acheamponge (创建于: 2026-03-13 06:18 (UTC+8))
#36934 [BugFix] KeyError on scope[“method”] for realtime api websocket in Authen… — bug,frontend — by daniebrill (创建于: 2026-03-13 06:16 (UTC+8))
#36931 [Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype — bug,v1,deepseek,nvidia — by dbari (创建于: 2026-03-13 05:56 (UTC+8))
#36927 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn — rocm,v1 — by divakar-amd (创建于: 2026-03-13 05:10 (UTC+8))
#36924 [Hardware][TPU] Add executors_supports_async_scheduling() method to Platform so that it can be extended for different Platforms — 无标签 — by gxd3 (创建于: 2026-03-13 04:54 (UTC+8))
#36912 Test Smoll in v0.16.0 — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,deepseek — by debroy-rh (创建于: 2026-03-13 02:38 (UTC+8))
#36889 [WIP][Bugfix] Fix CUDA illegal memory access for GPTQ Marlin MoE with CUDA graphs — bug,nvidia — by haosdent (创建于: 2026-03-12 22:22 (UTC+8))
#36919 [Refactor] Relocate chat completion and anthropic tests — ci/build — by sfeng33 (创建于: 2026-03-13 03:18 (UTC+8))
#36918 [Bugfix][Core] Fix gdn kernel mixed batch spec decode crash — bug,v1 — by lulmer (创建于: 2026-03-13 03:15 (UTC+8))
#36915 [Refactor] Consolidate GPT-OSS reasoning parser tests — structured-output,v1,gpt-oss — by sfeng33 (创建于: 2026-03-13 02:58 (UTC+8))
#36893 [WIP] int8 kv cache per-token per-head scale continue of #34327 thanks EricccYang — documentation,v1 — by JartX (创建于: 2026-03-12 22:46 (UTC+8))
#36913 Fix/triton quant query input — v1 — by ishanrev (创建于: 2026-03-13 02:42 (UTC+8))
#36909 Revise environment setup in AGENTS.md — documentation,ready — by mgoin (创建于: 2026-03-13 01:52 (UTC+8))
#36891 [Tool Parser] Kimi K2: guided decoding for tool_choice=”auto” — 75% → 100% schema accuracy — 无标签 — by ZhanqiuHu (创建于: 2026-03-12 22:32 (UTC+8))
#36911 Add Medusa Spec Decode e2e Test — v1 — by ojhaanshika (创建于: 2026-03-13 02:07 (UTC+8))
#36908 Fix/triton quant query input — v1 — by ishanrev (创建于: 2026-03-13 01:41 (UTC+8))
#36895 [Kernel][Helion][1/N] Add Helion kernel for rms_norm_per_block_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-03-12 23:23 (UTC+8))
#36902 [Kernel][Helion][1/N] Add Helion kernel for per_token_group_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-03-13 00:53 (UTC+8))
#36899 [Perf][Utility] Update RMSNorm benchmark script — performance — by benchislett (创建于: 2026-03-13 00:16 (UTC+8))
#36903 [Misc] Clean up Kimi-audio whisper encoder loading — 无标签 — by Isotr0py (创建于: 2026-03-13 01:03 (UTC+8))
#36896 [bnb] Skip moe + bnb test — ready — by SunMarc (创建于: 2026-03-12 23:36 (UTC+8))
#36905 [WIP][BugFix] Add missing shutdown() to LMCacheConnectorV1 to fix GPU failure on repeated model loading — bug,kv-connector — by haosdent (创建于: 2026-03-13 01:24 (UTC+8))
#36904 [WIP][BugFix] Fix PP OOM for Qwen3Next/Qwen3_5 by guarding embed_tokens and lm_head — bug,qwen — by haosdent (创建于: 2026-03-13 01:24 (UTC+8))
#36877 Add AGENTS.md — documentation,ready — by hmellor (创建于: 2026-03-12 20:20 (UTC+8))
#36900 [Bugfix] Fix ModuleNotFoundError when flash-attn-4 is installed alongside vLLM — bug — by manueldeprada (创建于: 2026-03-13 00:23 (UTC+8))
#36901 [CI] Add AVX2-only CPU image build for GitHub Actions compatibility — ci/build — by nathan-weinberg (创建于: 2026-03-13 00:35 (UTC+8))
#36866 [Bugfix] Fix tool call streaming JSON separator mismatch — bug,frontend — by xr843 (创建于: 2026-03-12 17:27 (UTC+8))
#36876 [Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs — bug,ready,qwen — by tdoublep (创建于: 2026-03-12 20:09 (UTC+8))
#36897 [Bugfix] fix main branch pre-commit error (1 line change) — bug — by SoluMilken (创建于: 2026-03-12 23:46 (UTC+8))
#36892 Unify FlashInfer utility modules — nvidia — by aalhadxx (创建于: 2026-03-12 22:41 (UTC+8))
#36845 [ROCm] Fix KV copy methods and auto-select attention backend for ROCm — rocm,ready,v1,kv-connector — by AndreasKaratzas (创建于: 2026-03-12 12:29 (UTC+8))
#36846 [ROCm] Validate block_size for explicitly selected attention backends — rocm,ready — by AndreasKaratzas (创建于: 2026-03-12 12:33 (UTC+8))
#36882 [CI] Fix mypy pre-commit errors on main — ready — by tdoublep (创建于: 2026-03-12 20:56 (UTC+8))
#36862 [Misc] Set default kv_buffer_device in a better way — ready,kv-connector — by hmellor (创建于: 2026-03-12 16:25 (UTC+8))
#36894 Enhanced paged_attention_v1 function — 无标签 — by do420 (创建于: 2026-03-12 23:09 (UTC+8))
#36875 [Fix] Fix --headless mode with openai.api_server — frontend — by emricksini-h (创建于: 2026-03-12 19:42 (UTC+8))
#36888 fix(serving): add bounds check for prev_tool_call_arr in streaming tool calls — frontend — by gambletan (创建于: 2026-03-12 22:15 (UTC+8))
#36884 Fix mypy on main — 无标签 — by hmellor (创建于: 2026-03-12 21:05 (UTC+8))
#36883 [Frontend] Improve error message when tool_choice=required hits max_tokens — frontend,needs-rebase,tool-calling — by chenwenxiaolive (创建于: 2026-03-12 21:03 (UTC+8))
#36885 [Model] Add ColQwen3.5 4.5B support — documentation,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by athrael-soju (创建于: 2026-03-12 21:33 (UTC+8))
#36881 set mmencodercache to the max_seq_len for all 3 modalities — 无标签 — by netanel-haber (创建于: 2026-03-12 20:53 (UTC+8))
#36878 Move test dependencies from inline installs to Docker image — ci/build — by avinashsingh77 (创建于: 2026-03-12 20:31 (UTC+8))
#36879 [Bugfix] Fix mamba align mode sync scheduling correctness — bug,v1 — by tdoublep (创建于: 2026-03-12 20:32 (UTC+8))
#36874 [CI] Fix mypy error with json field name — 无标签 — by hickeyma (创建于: 2026-03-12 19:40 (UTC+8))
#36859 [BugFix] Scheduler: Only set num_external_computed_tokens once — bug,v1 — by orozery (创建于: 2026-03-12 15:03 (UTC+8))
#36844 [Core] Guard mamba prefill split fragmentation — performance,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by yunseoLee0343 (创建于: 2026-03-12 12:21 (UTC+8))
#36873 [ROCm] Fix build issues with cub:: namespace and missing headers — rocm — by xueliangyang-oeuler (创建于: 2026-03-12 19:20 (UTC+8))
#36871 [Qwen3] Fix truncated reasoning extraction in parser — qwen — by xueliangyang-oeuler (创建于: 2026-03-12 18:57 (UTC+8))
#36868 [Core] Skip np.repeat in _prepare_inputs when all requests are decodes — v1 — by Aayushgoyal00 (创建于: 2026-03-12 18:23 (UTC+8))
#36869 [KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector — kv-connector — by JianDan0212 (创建于: 2026-03-12 18:26 (UTC+8))
#36867 Support non-contiguous KV cache in TRTLLM fp8 dequant kernel — v1,nvidia — by vadiklyutiy (创建于: 2026-03-12 18:06 (UTC+8))
#36860 [Perf] recreate torch profiler wrapper for each new profiling session to avoid slow starts — v1 — by zhangruoxu (创建于: 2026-03-12 15:12 (UTC+8))
#36864 [Model] Add OmniASR (Meta Omnilingual ASR) model support — 无标签 — by ziywang50 (创建于: 2026-03-12 16:44 (UTC+8))
#36863 docs: add Simplified Chinese README (README_zh-CN.md) — documentation — by jnMetaCode (创建于: 2026-03-12 16:35 (UTC+8))
#36858 Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention — rocm,v1,nvidia — by elvischenv (创建于: 2026-03-12 14:52 (UTC+8))
#36850 [Bugfix] Support Qwen3.5 text-only configs — bug,new-model,qwen — by WineChord (创建于: 2026-03-12 13:18 (UTC+8))
#36854 [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs — bug — by haosdent (创建于: 2026-03-12 14:10 (UTC+8))
#36848 fix: enable producer and consumer in ECConnectorBase for ec_both role — 无标签 — by benyebai (创建于: 2026-03-12 12:53 (UTC+8))
#36847 [Feat][Spec Decode] DFlash — new-model,speculative-decoding,v1,qwen — by benchislett (创建于: 2026-03-12 12:38 (UTC+8))
#36842 Add sparse hot KV cache migration (CPU full KV + GPU hot KV) — v1 — by shuohong0208 (创建于: 2026-03-12 11:41 (UTC+8))

已合并 PR

#36385 [Model] Add support for BERT-like Chinese ERNIE pooling models — documentation,new-model,ready — by whyiug (合并于: 2026-03-13 11:23 (UTC+8))
#36818 [Model] Add ColPali late interaction model for multi-modal retrieval — documentation,new-model,ready,multi-modality — by Kaonael (合并于: 2026-03-13 10:18 (UTC+8))
#36940 [BUG] Fix rank calculation in NCCLWeightTransferEngine — bug,ready — by hao-aaron (合并于: 2026-03-13 09:56 (UTC+8))
#34991 [Bugfix] ep_scatter kernel store-load race condition — bug,ready — by ivanium (合并于: 2026-03-13 09:07 (UTC+8))
#36938 build: update smg-grpc-servicer to use vllm extra — ready,ci/build — by slin1237 (合并于: 2026-03-13 09:31 (UTC+8))
#36929 [Model Runner V2] Some code simplification — ready,v1 — by njhill (合并于: 2026-03-13 08:59 (UTC+8))
#33595 [MoE] Add routing simulation override for MXFP4 quantized MoE — ready — by jaewonlee-fb (合并于: 2026-03-13 08:30 (UTC+8))
#34086 [Feature]: Remove Chunking From FusedMoE — documentation,rocm,ready,gpt-oss,nvidia — by SouthWest7 (合并于: 2026-03-13 02:24 (UTC+8))
#36545 [Speculative Decoding] Add norm_before_fc for gpt-oss draft models — speculative-decoding,ready,llama,gpt-oss — by shubhra (合并于: 2026-03-13 07:03 (UTC+8))
#36086 [AMD][Build] Add DeepEP to ROCm Dockerfile — rocm,ready,ci/build — by rjrock (合并于: 2026-03-13 05:33 (UTC+8))
#36210 [ROCm][CI] Preparing gfx90a mirroring — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-13 04:42 (UTC+8))
#36909 Revise environment setup in AGENTS.md — documentation,ready — by mgoin (合并于: 2026-03-13 03:21 (UTC+8))
#36768 Update Flashinfer to 0.6.6 — ready,ci/build,nvidia — by dbari (合并于: 2026-03-13 01:19 (UTC+8))
#36896 [bnb] Skip moe + bnb test — ready — by SunMarc (合并于: 2026-03-13 02:03 (UTC+8))
#36877 Add AGENTS.md — documentation,ready — by hmellor (合并于: 2026-03-13 01:20 (UTC+8))
#36897 [Bugfix] fix main branch pre-commit error (1 line change) — bug — by SoluMilken (合并于: 2026-03-13 00:08 (UTC+8))
#34499 [Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content — bug,frontend,ready,gpt-oss — by jeonsworld (合并于: 2026-03-13 00:13 (UTC+8))
#34732 [Attention] Use FA4 for MLA prefill — documentation,performance,ready,ci/build,v1,nvidia — by MatthewBonanni (合并于: 2026-03-13 00:10 (UTC+8))
#34597 [Kernel] Add FP8 KV cache support to Triton MLA decode attention — documentation,ready,v1 — by grimulkan (合并于: 2026-03-12 23:32 (UTC+8))
#36882 [CI] Fix mypy pre-commit errors on main — ready — by tdoublep (合并于: 2026-03-12 23:22 (UTC+8))
#36145 [Hardware] Replace torch.cuda.device_count/current_device/set_device API — documentation,performance,speculative-decoding,ready,v1,multi-modality,kv-connector,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-12 22:57 (UTC+8))
#36307 [Perf] Add TRTLLM FP8 MoE Modular Kernel — ready,nvidia — by wzhao18 (合并于: 2026-03-12 22:32 (UTC+8))
#36822 [BugFix] Fix multiple/duplicate stdout prefixes — bug,frontend,ready,v1 — by njhill (合并于: 2026-03-12 12:23 (UTC+8))
#35742 [CI] Fix mypy for vllm/reasoning — documentation,structured-output,ready,v1,qwen,deepseek,gpt-oss — by hickeyma (合并于: 2026-03-12 20:21 (UTC+8))
#30947 [docs] Add lightweight AI assisted contribution policy — documentation,ready — by markmc (合并于: 2026-03-12 19:46 (UTC+8))
#36596 [Perf] add packed recurrent fast path for decode — ready,qwen — by caozuoba (合并于: 2026-03-12 19:01 (UTC+8))
#36536 [Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels — frontend,ready — by sagearc (合并于: 2026-03-12 18:29 (UTC+8))
#36841 [Bugfix] Fix crash when tool_choice=required exceeds max_tokens — bug,frontend,ready,tool-calling — by chaunceyjiang (合并于: 2026-03-12 18:28 (UTC+8))
#36605 [MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels — ready,qwen — by shen-shanshan (合并于: 2026-03-12 18:28 (UTC+8))
#36806 [UX] Only show FP4 Marlin fallback warning for w4a4 models — ready — by mgoin (合并于: 2026-03-12 17:19 (UTC+8))
#36144 replace with torch.cuda.device with with torch.accelerator.device_index — performance,rocm,ready,kv-connector,nvidia,ready-run-all-tests — by yma11 (合并于: 2026-03-12 14:12 (UTC+8))
#36599 [Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling — bug,ready,qwen — by AuYang261 (合并于: 2026-03-12 15:51 (UTC+8))
#34328 [KV Connector] Support using FlexKV as KV Cache Offloading option. — documentation,ready,v1,kv-connector,nvidia — by feiqiangs (合并于: 2026-03-12 15:46 (UTC+8))
#36670 [Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop — bug,ready,multi-modality,deepseek — by ketyi (合并于: 2026-03-12 15:35 (UTC+8))
#29947 [Frontend] OpenAI Responses API supports Tool/Function calling with streaming — frontend,ready,v1 — by chaunceyjiang (合并于: 2026-03-12 15:03 (UTC+8))
#36813 [Tests] Skip model weight download for render-only test server — ready — by sagearc (合并于: 2026-03-12 14:24 (UTC+8))
#36677 [Kernel][Helion][13/N] Force static_shapes=False in helion register — ready — by gmagogsfm (合并于: 2026-03-12 13:35 (UTC+8))
#36761 [CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure — ready — by noooop (合并于: 2026-03-12 12:22 (UTC+8))
#36824 [Model Runner V2] Do not initialize sampler for non-last PP ranks — ready,v1 — by WoosukKwon (合并于: 2026-03-12 11:55 (UTC+8))
#36586 [LMCache] Fault Tolerance Mechanism — ready,kv-connector — by Oasis-Git (合并于: 2026-03-12 11:54 (UTC+8))
#35086 more models for vLLM Benchmark Suite — documentation,performance,ready,ci/build,cpu — by louie-tsai (合并于: 2026-03-12 11:36 (UTC+8))

关闭但未合并的 PR

#36910 [WIP] [BugFix] Forward spec_step_idx in MTP wrappers and eagle proposer/speculator — bug,speculative-decoding,v1,qwen — by haosdent (关闭于: 2026-03-13 02:21 (UTC+8))
#23010 [FlashInfer] Truncate block tables for sliding window attention — ready,needs-rebase,stale,v1,nvidia — by WoosukKwon (关闭于: 2026-03-13 10:18 (UTC+8))
#23336 [ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. — rocm,needs-rebase,stale,nvidia — by vllmellm (关闭于: 2026-03-13 10:18 (UTC+8))
#23679 [Spec-decode] Refoctor cudagraphs for spec-decode;support uniform_alignment of cudagraph sizes. — documentation,speculative-decoding,needs-rebase,stale,v1,nvidia — by fhl2000 (关闭于: 2026-03-13 10:18 (UTC+8))
#25016 [UX] Only perform FlashInfer autotuning if needed by kernels — needs-rebase,stale,startup-ux,nvidia — by mgoin (关闭于: 2026-03-13 10:18 (UTC+8))
#25208 Whisper cudagraphs support — needs-rebase,stale,v1,nvidia — by baonudesifeizhai (关闭于: 2026-03-13 10:18 (UTC+8))
#25798 [Feat][EPLB] Enable Round-robin expert placement strategy while eplb is enabled. — needs-rebase,stale — by cboss6 (关闭于: 2026-03-13 10:18 (UTC+8))
#25876 [MISC] Refactor allreduce dispatching — needs-rebase,stale,nvidia — by ilmarkov (关闭于: 2026-03-13 10:18 (UTC+8))
#26013 [Hardware][AMD][Kernel] mori all2all backend integration — documentation,rocm,needs-rebase,stale,nvidia — by whitememory (关闭于: 2026-03-13 10:18 (UTC+8))
#26429 [CI] Release pipeline address “Build wheel - CUDA 12.6” issue — ready,needs-rebase,ci/build,stale,nvidia — by NickLucche (关闭于: 2026-03-13 10:17 (UTC+8))
#26484 [Misc] remove useless function — frontend,stale,v1 — by wangxiyuan (关闭于: 2026-03-13 10:17 (UTC+8))
#28211 Enhance Helm chart installation instructions — documentation,stale — by ccnmxns (关闭于: 2026-03-13 10:17 (UTC+8))
#34291 [Bugfix] Avoid double-sharding pre-sharded FusedMoE expert weights in TP loaders — bug — by dangoldbj (关闭于: 2026-03-13 06:13 (UTC+8))
#36484 fix: Responses API streaming tool call support for non-harmony models — frontend,needs-rebase,gpt-oss — by giulio-leone (关闭于: 2026-03-13 05:12 (UTC+8))
#36562 [DO NOT MERGE] AMD Infra tests — rocm,ready,needs-rebase,ci/build,v1,kv-connector — by AndreasKaratzas (关闭于: 2026-03-13 04:54 (UTC+8))
#36912 Test Smoll in v0.16.0 — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,deepseek — by debroy-rh (关闭于: 2026-03-13 02:40 (UTC+8))
#36913 Fix/triton quant query input — v1 — by ishanrev (关闭于: 2026-03-13 02:43 (UTC+8))
#36908 Fix/triton quant query input — v1 — by ishanrev (关闭于: 2026-03-13 02:34 (UTC+8))
#36894 Enhanced paged_attention_v1 function — 无标签 — by do420 (关闭于: 2026-03-12 23:11 (UTC+8))
#34394 [torch.compile] Remove duplicated split_with_sizes after RoPE — 无标签 — by elvischenv (关闭于: 2026-03-12 22:58 (UTC+8))
#36884 Fix mypy on main — 无标签 — by hmellor (关闭于: 2026-03-12 21:08 (UTC+8))
#36883 [Frontend] Improve error message when tool_choice=required hits max_tokens — frontend,needs-rebase,tool-calling — by chenwenxiaolive (关闭于: 2026-03-12 21:51 (UTC+8))
#36885 [Model] Add ColQwen3.5 4.5B support — documentation,new-model,rocm,needs-rebase,ci/build,v1,multi-modality,qwen — by athrael-soju (关闭于: 2026-03-12 21:42 (UTC+8))
#31974 Fix bug hf_token argument to LLM in Python SDK ignored in vllm.transformer_utils.config — bug — by benglewis (关闭于: 2026-03-12 21:37 (UTC+8))
#36879 [Bugfix] Fix mamba align mode sync scheduling correctness — bug,v1 — by tdoublep (关闭于: 2026-03-12 20:36 (UTC+8))
#36874 [CI] Fix mypy error with json field name — 无标签 — by hickeyma (关闭于: 2026-03-12 19:45 (UTC+8))
#36527 [Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models — bug,speculative-decoding,needs-rebase,v1,qwen — by NikitosKh (关闭于: 2026-03-12 18:42 (UTC+8))
#31024 Move test dependencies from inline installs to Docker image — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,needs-rebase,ci/build — by avinashsingh77 (关闭于: 2026-03-12 17:03 (UTC+8))
#27396 [Model] Re-support MotifForCausalLM — documentation,performance,new-model,needs-rebase,v1,nvidia — by WyldeCat (关闭于: 2026-03-12 14:53 (UTC+8))
#27495 [WIP] [GPT-OSS] customized symm_mem based EP comm kernel integration — frontend,stale,v1,gpt-oss — by Luosuu (关闭于: 2026-03-12 14:41 (UTC+8))
#36850 [Bugfix] Support Qwen3.5 text-only configs — bug,new-model,qwen — by WineChord (关闭于: 2026-03-12 14:40 (UTC+8))
#32206 [WIP][Spec Decode] DFlash — new-model,speculative-decoding,needs-rebase,v1,qwen — by benchislett (关闭于: 2026-03-12 13:00 (UTC+8))
#36733 Vllm add dflash + Optimize draft models (CUDA graph management) — new-model,speculative-decoding,v1,qwen,nvidia — by biggestCjb (关闭于: 2026-03-12 12:42 (UTC+8))
#36641 [torch.compile] Let Non-CUDA platforms provide default sp_min_token_num — nvidia — by realliujiaxu (关闭于: 2026-03-12 11:50 (UTC+8))
#36842 Add sparse hot KV cache migration (CPU full KV + GPU hot KV) — v1 — by shuohong0208 (关闭于: 2026-03-12 11:42 (UTC+8))