vLLM 开发动态报告 - 2026-03-10

时间窗口: 2026-03-10 11:21 (UTC+8) ~ 2026-03-11 11:21 (UTC+8) 数据统计: 新 Issue 38 | 关闭 Issue 28 | 新 PR 115 | 合并 PR 34 | 关闭未合并 PR 42

📊 每日开发状态摘要

在3月10日至11日的周期内，vLLM社区保持高度活跃，共处理了38个新Issue和115个新PR。开发焦点集中在解决Qwen3.5等新兴混合架构（Mamba + Transformer）模型在生产中暴露的各类稳定性问题，特别是推测解码（MTP）与高并发场景下的崩溃。同时，AMD平台的支持持续得到增强，围绕ROCm生态的性能优化与bug修复是重要分支。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动频繁，主要集中在性能优化、Bug修复和生态完善上。

性能优化与内核支持：
- PR #36659：为AMD Radeon AI PRO R9700 (gfx1201, RDNA4) 添加了调优后的FP8 MoE Triton配置，显著提升了Qwen3-FP8模型在该卡上的性能（TTFT降低~20%，TPOT降低~24%）。
- PR #35719 (已合并)：在ROCm平台上重新启用了sparse_mla注意力后端的CUDA Graph支持，以提升性能。
- PR #36716：作为后续调优，暂时禁用了ROCm上RoPE自定义操作符，因发现其在某些配置下（如MI355）会导致性能回退。
- PR #36680 / #36681：允许在ROCm的稀疏MLA（AITER）后端上使用大于1的MTP推测令牌数，扩展了功能支持。
Bug修复与稳定性：
- Issue #35925 (已关闭)：修复了在ROCm平台使用AITER内核时，因NaN传播导致的Qwen3.5模型输出损坏问题。PR #36709通过添加nan_to_num_()清理解决了此问题。
- PR #36720：修复了ROCm/HIP平台上因CUDA Graph内存分析不准确导致的Worker启动OOM问题，通过跳过不可靠的估算来解决。
- PR #36690：修复了AMD GPU上使用MLA注意力且KV缓存为FP8类型时的一个错误。
- PR #36606：提高了Quark量化工具中W4A8等数据类型的解析鲁棒性。
生态建设与需求：
- Issue #36703 & #36704：用户请求为ROCm平台提供nightly版本的pip wheel和Docker镜像，认为这是成为“一等公民”支持的必要条件，反映出社区对AMD生态成熟度的更高期待。
- PR #36711：修正了ROCm CI中GPT-OSS测试的路径错误。

小结：AMD贡献者（含-amd后缀用户）活跃，正在系统性地解决ROCm平台在运行最新模型（如Qwen3.5， DeepSeek-V3）时遇到的性能与正确性问题，并向完善开发者体验迈进。

💬 高热度讨论分析

Issue #36613: Qwen3.5 MTP高并发下CUDA非法内存访问
- 核心议题：Qwen3.5-397B模型启用MTP推测解码后，在高并发请求下发生CUDA Illegal Memory Access (ILM) 崩溃。
- 不同观点：
  - 问题报告者 (xiaochengyige， MLKoz2)：提供了详细的重现步骤和日志，指出禁用MTP则一切正常，怀疑是推测解码引入的bug。
  - 维护者 (ZJY0516)：建议测试main分支或nightly版本以确认是否已修复，并请求最小化重现脚本。
  - 社区成员 (mykolademyanov)：从系统设计角度评论，认为此类问题常源于模型逻辑与执行环境的紧耦合，建议解耦。
  - 其他用户 (cjackal)：指出此问题是另一个Issue的重复，并提供了简化的重现命令。
- 争议焦点：无实质性争议，更多是协作定位问题。
- 当前状态：问题未解决，讨论聚焦于如何更有效地重现和定位根本原因。
Issue #36627: Qwen3.5 vs Qwen3 性能对比
- 核心议题：用户观察到Qwen3.5的TTFT（首令牌时间）远慢于Qwen3，质疑其性能。
- 不同观点：
  - 提问者 (fangbaolei)：认为性能下降显著，特别是TTFT。
  - 解释者 (ShanningZhuang， yunseoLee0343)：指出Qwen3.5采用了Mamba/DeltaNet等混合架构，其循环计算特性导致Prefill阶段无法像Transformer那样完全并行，因此TTFT天生会更长。同时，具体实现中的内存操作（如torch.zeros）也可能加剧开销。
  - 建议者 (MLKoz2)：建议尝试不同的注意力后端和预热策略。
- 争议焦点：无争议，主要是对混合架构性能特性的科普和解释。
- 当前状态：问题开放，结论是性能差异主要源于架构革新，属于预期内的权衡。
Issue #36643: Qwen3.5 不支持流水线并行
- 核心议题：尝试为Qwen3.5模型启用流水线并行（PP）时失败。
- 不同观点与进展：
  - 用户报告错误：启用PP和MTP时，提示“Pipeline parallelism is not supported”。
  - 深度分析 (weiguangli-io)：通过代码审查给出了根本原因：当启用PP时，系统会检查模型是否支持PP。对于MTP推测解码，草稿模型会继承目标模型的PP配置。然而，专门为MTP创建的草稿模型类 Qwen3_5MoeMTP 并未声明支持PP协议（SupportsPP），因此触发错误。
- 当前状态：问题未解决，但根本原因已被清晰定位，为后续修复指明了方向。
PR #36666 / #36628: 关闭超时功能引发CI失败
- 核心议题：一个用于优雅关闭服务器的PR (#34730， #36270) 导致了分布式测试(Distributed Tests (4 GPUs))的持续失败。
- 讨论过程：
  - PR #36628 直接撤销了该功能，使测试通过。
  - 随后，PR #36666 重新提交了该功能，并附带详细分析，认为测试失败可能与关闭过程中的时序或通信问题有关，而非功能本身逻辑错误。作者希望重新建立基线并进一步调查。
  - 讨论涉及了多个相关PR的合并历史和CI测试结果的梳理，体现了复杂项目中问题溯源的难度。
- 争议焦点：如何在引入新功能与保持测试稳定性之间取得平衡。是彻底回退，还是坚持引入并修复测试？
- 当前状态：回退版本已被合并，新版本正在讨论中，凸显了基础设施变更的谨慎性。

🔥 热门话题与趋势分析

Qwen3.5混合架构的“阵痛期”：作为新模型架构，Qwen3.5在vLLM中集成后暴露了大量边界情况问题，成为本周期最热话题。问题涉及MTP崩溃（#36613）、性能疑惑（#36627）、流水线并行不支持（#36643）、前缀缓存验证错误（#36697）、GDN层Triton自动调优OOM（#36598）等，反映出对新架构的全面适配仍在进行中。
推测解码的稳定性与创新：除了MTP的问题，推测解码整体备受关注。一方面在修复问题（如PR #36634修复MTP启动错误），另一方面也在引入新方法（如PR #36733添加DFlash方法）。这反映了社区对提升推理速度技术的持续追求和攻坚。
AMD生态的“查漏补缺”与“追求平等”：从修复NaN、OOM等底层Bug，到为新型号GPU（RDNA4）添加性能配置，再到用户呼吁提供与CUDA对等的nightly构建，显示出ROCm支持从“能用”到“好用、快用”的发展阶段。

🛠️ 重点技术变更

PR #35219 (已合并): 修复混合模型KV缓存的NaN传播问题
- 技术解读：解决了Qwen3.5等混合模型在共享KV缓存块时，因SSM（Mamba）层的fp32残留数据在后续被Attention（fp8/fp16）层复用，经“乘零掩码”操作后产生NaN，并污染所有后续请求的核心Bug。
- 影响：彻底解决了因KV缓存污染导致的模型准确性随服务时间下降的问题，对生产环境稳定性至关重要。
PR #36595 (已合并): 修复空分区合并导致的DeepSeek-V3错误
- 技术解读：修复了在CUDA Graph编译过程中，一个将“仅包含空操作的分区”错误地合并到“切分算子子图”的逻辑Bug。此Bug会导致DeepSeek-V3等模型在编译后产生错误结果或CUDA Graph警告。
- 影响：修复了影响模型正确性的一个关键编译层Bug，保障了使用编译优化的服务的可靠性。
PR #36169 (已合并): 将gRPC服务器实现提取到独立包
- 技术解读：将gRPC服务器代码从vLLM主仓库重构至独立的smg-grpc-servicer包中，主仓库仅保留启动入口。这实现了gRPC协议与vLLM核心的解耦。
- 影响：允许gRPC协议和服务器逻辑独立、快速迭代，无需与vLLM发布周期绑定，提高了架构的灵活性和可维护性。

📈 开发活跃度观察

贡献者活跃度：AMD相关的贡献者（如 vllmellm, indivats, tvirolai-amd 等）非常活跃，提交了多项性能优化和Bug修复PR，显示出AMD团队对vLLM平台的持续投入。
代码审查与合并节奏：共合并了34个PR。对于关键Bug修复（如#36595），在核心维护者确认后合并速度较快。而对于涉及复杂交互或测试失败的功能性PR（如#36666），则经历了回退、重新提交和深入讨论的谨慎过程，体现了对稳定性的重视。
Issue处理：关闭了28个Issue，其中包括一些历史遗留的CI问题（如#29529，#29463）和Bug报告，表明社区在持续进行问题清理。

💡 值得关注的问题

Qwen3.5 MTP在高并发下的稳定性（Issue #36613）：这是影响大规模部署的严重问题，根本原因尚不明确，需要社区重点关注和解决。
混合架构模型的前缀缓存支持（Issue #36697， PR #36649）：当前混合模型的前缀缓存命中率极低，限制了其在多轮对话等场景的性能。相关优化工作（PR #36649）正在进行中，这是一个重要的性能提升方向。
AMD平台Nightly构建的缺失（Issue #36703， #36704）：用户的这一诉求合理，提供与CUDA对等的持续集成产物是提升AMD开发者体验和吸引力的重要一步。

📋 附录：详细数据列表

新增 Issue

#36697 [Bug]: Validation Error: block_size (2096) > max_num_batched_tokens (2048) when enabling prefix caching for Qwen3.5 Mamba architecture — bug — by Imagium719 (创建于: 2026-03-11 04:23 (UTC+8))
#36602 [RFC]: Make kernel/op and component tests device-agnostic for OOT plugins — RFC — by romitjain (创建于: 2026-03-10 15:15 (UTC+8))
#36631 [Bug]: v0.17.0 4*2080ti 22G Qwen3.5 RPC call to sample_tokens timed out. — bug — by VIT-Valentin (创建于: 2026-03-10 17:58 (UTC+8))
#36654 [Bug]: Frequent Tool Call Parsing Failures with DeepSeek-V3.2 — bug — by Curry30Messi (创建于: 2026-03-10 21:39 (UTC+8))
#36718 [Bug]: vLLM 0.15.0 startup on H200 failed at deep_gemm — bug — by pymhq (创建于: 2026-03-11 06:41 (UTC+8))
#36730 [Usage]: When running qwen3.5-27b with vllm 0.17.0, the Deep Thinking output is under “reasoning” and not under “reasoning_content”. — usage — by AEGEGE (创建于: 2026-03-11 10:08 (UTC+8))
#36662 [Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) — bug — by ProExpertProg (创建于: 2026-03-10 22:34 (UTC+8))
#36640 [Bug]: KeyError: ‘language_model.model.layers.20.linear_attn’ — bug — by itfwonjulee (创建于: 2026-03-10 19:02 (UTC+8))
#36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (创建于: 2026-03-10 17:35 (UTC+8))
#36729 [Feature]: Fast KV Compaction via Attention Matching (50x compression) — feature request — by markg85 (创建于: 2026-03-11 09:34 (UTC+8))
#36627 [Performance]: qwen3.5 vs qwen3 — performance — by fangbaolei (创建于: 2026-03-10 17:41 (UTC+8))
#36657 [RFC]: Dynamic Speculation Length (DSL) with Confidence-Threshold Early Exit for vLLM Speculative Decoding — RFC — by jmamou (创建于: 2026-03-10 22:08 (UTC+8))
#36704 [Feature]: upstream nightly rocm docker — feature request,rocm — by functionstackx (创建于: 2026-03-11 05:07 (UTC+8))
#36703 [Feature]: upstream nightly rocm vllm — feature request,rocm — by functionstackx (创建于: 2026-03-11 05:06 (UTC+8))
#36598 [Bug]: Triton autotuner OOM on Qwen3.5/Qwen3-Next GDN layers (non-SM90 GPUs) — bug — by AuYang261 (创建于: 2026-03-10 14:36 (UTC+8))
#36643 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by VirgilG72 (创建于: 2026-03-10 19:23 (UTC+8))
#36656 [Bug]: Garbled output Qwen3.5-122B-A10B VLLM 0.17.0 — bug — by twright8 (创建于: 2026-03-10 22:07 (UTC+8))
#36688 [Bug]: torch.opcheck fails for _C.rms_norm_per_block_quant — bug,help wanted — by ProExpertProg (创建于: 2026-03-11 01:47 (UTC+8))
#36676 [Bug]: dependency nixl<0.10.0 installs nixl-cu12==0.10.1 — bug — by cjackal (创建于: 2026-03-11 00:19 (UTC+8))
#36669 [Bug]: DeepSeek-OCR v1 crashes with TensorSchema mismatch when images_crop is empty (small images ≤640px) — 无标签 — by ketyi (创建于: 2026-03-10 23:27 (UTC+8))
#36668 [RFC]: Support Registry Mechanism for KVCacheSpec — RFC — by MengqingCao (创建于: 2026-03-10 23:24 (UTC+8))
#36663 [Bug]: CUBLAS_STATUS_INVALID_VALUE in Docker due to LD_LIBRARY_PATH cuBLAS version conflict — 无标签 — by lishunyang12 (创建于: 2026-03-10 22:57 (UTC+8))
#36660 [Bug]: Duplicate token with different logprob when requesting top — bug — by CoolFish88 (创建于: 2026-03-10 22:22 (UTC+8))
#36653 [Bug]: qwen3.5 Mismatch in image token count between text and input_ids. Got ids=[4091] — bug — by JakubCerven (创建于: 2026-03-10 21:32 (UTC+8))
#36613 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency — bug — by xiaochengyige (创建于: 2026-03-10 16:49 (UTC+8))
#36623 [Bug]: OOM when –kv-offloading-size>1024 — bug — by xiejibing (创建于: 2026-03-10 17:22 (UTC+8))
#36624 [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 — 无标签 — by elvircrn (创建于: 2026-03-10 17:28 (UTC+8))
#36651 cumem allocator: double-free and stale error codes during sleep/wake cycles — 无标签 — by markrogersjr (创建于: 2026-03-10 20:47 (UTC+8))
#36604 [Performance]: What is the performance impact of upgrading from HTTP/1.1 to HTTP/2 or QUIC? — performance — by sl99897 (创建于: 2026-03-10 15:39 (UTC+8))
#36615 [Bug]: unknown error trying to run vllm v0.17.0 with ROCm on Radeon 8060S (gfx1151) — bug,rocm — by anomaly256 (创建于: 2026-03-10 16:52 (UTC+8))
#36584 [Bug]: Qwen3.5-35B-A3B FlashInfer JIT compilation fails with C++17 feature errors (e.g., std::is_unsigned_v) when using vLLM 0.17.0 — bug — by FLCgeigei (创建于: 2026-03-10 11:51 (UTC+8))
#36585 [Bug]: qwen3.5-27b-gptq deploy fail — bug — by xiaotianns (创建于: 2026-03-10 11:59 (UTC+8))
#36632 [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream — bug — by JakubCerven (创建于: 2026-03-10 18:01 (UTC+8))
#36629 [Performance]: W4A16+eagle3 not better than fp8+eagle3 with Qwen2.5-14B — performance — by ljy6j13 (创建于: 2026-03-10 17:45 (UTC+8))
#36620 [Bug]: Does vllm support deploying glm-5 on A800 or A100, or are there any plans to support it? — bug — by yangzhipeng1108 (创建于: 2026-03-10 17:04 (UTC+8))
#36583 [Bug]: Empty output when using FP8 + Tensor Parallel (2 GPUs) with Qwen3-8B — bug — by Cqh123-H (创建于: 2026-03-10 11:49 (UTC+8))
#36594 [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc — cpu — by junjzhang (创建于: 2026-03-10 13:42 (UTC+8))
#36589 [Bug]: SM 7.5 extreme slowness hangs indefinitely on T4 (vllm 0.17.0 with Qwen3.5-27B) — bug — by billysyt (创建于: 2026-03-10 12:59 (UTC+8))

已关闭 Issue

#29645 [Bug]: Rotated samples extraction - accuracy loss — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-03-11 10:44 (UTC+8))
#36228 [Bug]: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL in v0.12.0 (ray executor works correctly) — 无标签 — by AGENDD (关闭于: 2026-03-11 10:34 (UTC+8))
#27901 [Feature]: Get confirmation on pending request(s) from vLLM V1 AsyncLLM servers — feature request,stale — by stevewx (关闭于: 2026-03-11 10:16 (UTC+8))
#28135 [RFC]: Robust End-to-End CI/CD Regression Testing for Speculative Decoding — RFC,stale — by rahul-tuli (关闭于: 2026-03-11 10:15 (UTC+8))
#28196 [Feature]: Add VLLM_INSTANCE_ID to KV Event — feature request,stale — by chickeyton (关闭于: 2026-03-11 10:15 (UTC+8))
#28301 [Performance]: DeepSeek V3 performance drop when enabling MTP on both H200 and MI300X — performance,rocm,speculative-decoding,stale — by ChangLiu0709 (关闭于: 2026-03-11 10:15 (UTC+8))
#28397 [Feature]: Add use zmq implement AFDConnect — feature request,stale — by lengrongfu (关闭于: 2026-03-11 10:15 (UTC+8))
#28425 [Feature][RL]: Fix Fp8 Weight Loading for RL — feature request,stale — by robertgshaw2-redhat (关闭于: 2026-03-11 10:15 (UTC+8))
#30358 [Bug]: NIXL PD disaggregate with host_buffer has accuracy issue - Prefill scheduled num_block mismatch at update_state_after_alloc and request_finished — bug,stale — by xuechendi (关闭于: 2026-03-11 10:15 (UTC+8))
#28437 [Bug]: Unhandled exception: N9deep_gemm11DGExceptionE. — bug,stale — by bbartels (关闭于: 2026-03-11 10:15 (UTC+8))
#36662 [Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) — bug — by ProExpertProg (关闭于: 2026-03-11 10:04 (UTC+8))
#36295 [Bug]: Deepseek R1 TRTLLM FP8 MoE produces garbage output — bug — by wzhao18 (关闭于: 2026-03-11 10:04 (UTC+8))
#36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (关闭于: 2026-03-11 09:20 (UTC+8))
#35925 [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled — bug,rocm — by jennyyyyzhen (关闭于: 2026-03-11 05:26 (UTC+8))
#35805 [Feature]: FlashInfer Sparse MLA + FP8 KV Cache — feature request — by benchislett (关闭于: 2026-03-11 01:28 (UTC+8))
#29467 [CI Failure]: mi325_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:41 (UTC+8))
#29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:40 (UTC+8))
#34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:40 (UTC+8))
#34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
#32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
#29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
#29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
#29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
#27998 [Bug]: Sequence Parallel Pass doesn’t work for MoE models — bug,torch.compile — by jasonlizhengjian (关闭于: 2026-03-11 00:01 (UTC+8))
#35708 [Bug]: vLLM-compile warm start appears to be saving artifacts — bug,torch.compile — by zou3519 (关闭于: 2026-03-10 21:40 (UTC+8))
#36624 [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 — 无标签 — by elvircrn (关闭于: 2026-03-10 21:20 (UTC+8))
#35138 [Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell. — bug,nvidia — by xinli-sw (关闭于: 2026-03-10 18:32 (UTC+8))
#30288 [Bug]:fused_moe_lora_op compile error — bug,stale — by alex-petrenko (关闭于: 2026-03-10 11:40 (UTC+8))

新增 PR

#36715 [Doc] Fix grammatically incorrect error message in gpu_worker and xpu_worker — v1 — by Hongbin10 (创建于: 2026-03-11 06:26 (UTC+8))
#36706 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 05:22 (UTC+8))
#36733 Vllm add dflash — new-model,speculative-decoding,v1,qwen — by biggestCjb (创建于: 2026-03-11 10:51 (UTC+8))
#36614 fix bugs when token_classify & classify run concurrently — ready — by staugust (创建于: 2026-03-10 16:49 (UTC+8))
#36726 [Refactor] Remove deadcode in Responses API serving — frontend — by sfeng33 (创建于: 2026-03-11 09:10 (UTC+8))
#36734 Automatically increased max_num_batched_tokens under Mamba align mode — 无标签 — by flutist (创建于: 2026-03-11 11:09 (UTC+8))
#36685 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — v1,multi-modality,meta-exported,fb-exported — by d-biswa (创建于: 2026-03-11 01:28 (UTC+8))
#36691 [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling — bug,ready,v1,deepseek — by MatthewBonanni (创建于: 2026-03-11 02:03 (UTC+8))
#36728 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts — bug,nvidia — by yzong-rh (创建于: 2026-03-11 09:27 (UTC+8))
#36732 [MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow — nvidia — by yzong-rh (创建于: 2026-03-11 10:36 (UTC+8))
#36678 [Bugfix] Miscellaneous MoE bug fixes — bug — by bnellnm (创建于: 2026-03-11 00:40 (UTC+8))
#36731 [Bugfix] Fix Eagle3 aux_hidden_state_layers indexing with pipeline parallelism — bug,llama,qwen — by alvinttang (创建于: 2026-03-11 10:26 (UTC+8))
#36612 [XPU] Add deepseek_scaling_rope fused kernel — deepseek — by yitingw1 (创建于: 2026-03-10 16:45 (UTC+8))
#36638 [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading — bug,v1 — by haosdent (创建于: 2026-03-10 18:49 (UTC+8))
#36719 [ci] Bound nvidia-cudnn-frontend version — ready,ci/build,nvidia — by khluu (创建于: 2026-03-11 06:42 (UTC+8))
#36633 FunASR model bugfix — bug,ready — by AllenDou (创建于: 2026-03-10 18:18 (UTC+8))
#36723 [DSV3.2][MTP] Optimize Indexer MTP handling — ready,v1 — by benchislett (创建于: 2026-03-11 07:43 (UTC+8))
#36634 fix mtp launch error in vllm-0.17.0, about cuda graph during memory profile — v1,nvidia — by flutist (创建于: 2026-03-10 18:23 (UTC+8))
#36591 [Docs] Add GSM8K accuracy benchmark example — documentation — by ZhuangYu07 (创建于: 2026-03-10 13:06 (UTC+8))
#36670 [Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop — bug,multi-modality,deepseek — by ketyi (创建于: 2026-03-10 23:28 (UTC+8))
#36711 [ROCm][CI] Corrected the GPT-OSS test root path — rocm,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-11 05:57 (UTC+8))
#36727 fix: replace deprecated F.sigmoid with torch.sigmoid in linear_attn (… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-11 09:24 (UTC+8))
#36599 [Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling — bug,qwen — by AuYang261 (创建于: 2026-03-10 14:37 (UTC+8))
#36665 platforms: Fix Ray DP startup crash — ready — by itayalroy (创建于: 2026-03-10 23:14 (UTC+8))
#36724 [Bug][MoE] Fix TRTLLM EScoreBias Precision — bug,performance,needs-rebase,ci/build,qwen,nvidia — by robertgshaw2-redhat (创建于: 2026-03-11 08:37 (UTC+8))
#36725 [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision — bug,ready,nvidia — by robertgshaw2-redhat (创建于: 2026-03-11 08:39 (UTC+8))
#36700 [Misc] Added curl retries in install_python_libraries.sh — 无标签 — by dmitry-tokarev-nv (创建于: 2026-03-11 04:49 (UTC+8))
#36606 [ROCm][Quantization] improve quant dtype parser robust for W4A8 — rocm — by xuebwang-amd (创建于: 2026-03-10 16:01 (UTC+8))
#36713 [Doc] Fix duplicate words in comments — ready,v1,multi-modality,nvidia — by Hongbin10 (创建于: 2026-03-11 06:20 (UTC+8))
#36717 [Doc] Fix grammatically incorrect comment: “doesn’t allow to override” -> “doesn’t allow overriding” — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:35 (UTC+8))
#36714 [Doc] Capitalize docstrings in vllm/config/ — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:23 (UTC+8))
#36659 [ROCm] Enable FP8 inference on gfx1201 AMD RDNA4 (Radeon AI PRO R9700) with aiter kernels — rocm,needs-rebase — by vllmellm (创建于: 2026-03-10 22:15 (UTC+8))
#36721 Add torch._scaled_mm backend for NVFP4 linear — 无标签 — by NikhilAPatel (创建于: 2026-03-11 07:38 (UTC+8))
#36722 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — rocm,ci/build,nvidia — by ProExpertProg (创建于: 2026-03-11 07:40 (UTC+8))
#36702 [ROCm] Attention selector reordering — documentation,rocm,ci/build,v1 — by gshtras (创建于: 2026-03-11 05:05 (UTC+8))
#36720 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling — bug,rocm,v1,nvidia — by JartX (创建于: 2026-03-11 07:16 (UTC+8))
#36593 [XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. — bug,ready,v1 — by ys950902 (创建于: 2026-03-10 13:35 (UTC+8))
#36716 [ROCm]: Re-disable rope customop on rocm and fix rope+kvcache fusion conditions — rocm — by Rohan138 (创建于: 2026-03-11 06:31 (UTC+8))
#36695 Create test_async_flashinfer_combinations.py — v1,nvidia — by puririshi98 (创建于: 2026-03-11 04:05 (UTC+8))
#36712 [Doc] Fix typos in comments: i.e./e.g. punctuation and “archive” -> “achieve” — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:13 (UTC+8))
#36609 [Minor] Enhance error message for TRTLLM decode uniformity check — ready,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 16:31 (UTC+8))
#36666 [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (创建于: 2026-03-10 23:20 (UTC+8))
#36607 feat(models): enable Qwen3.5 text-only (Qwen3_5ForCausalLM) — IsHybrid, SupportsMRoPE, VL weight remapping — new-model,qwen — by groxaxo (创建于: 2026-03-10 16:03 (UTC+8))
#36710 [Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-11 05:49 (UTC+8))
#36707 fix: align lfm2 thumbnail token counting with HF — 无标签 — by tianshu-Michael-yu (创建于: 2026-03-11 05:23 (UTC+8))
#36705 [Kernel][Helion] [16/N] Refactor register_kernel API to eagerly configure Helion kernel upon importing — 无标签 — by gmagogsfm (创建于: 2026-03-11 05:16 (UTC+8))
#36708 fix: disambiguate multimodal prefix cache keys — v1 — by tianshu-Michael-yu (创建于: 2026-03-11 05:23 (UTC+8))
#36709 [ROCm][Bugfix] Fix NaN corruption — bug,rocm,v1 — by indivats (创建于: 2026-03-11 05:26 (UTC+8))
#36698 [Kernel] [Helion] [15/N] Split config files into per-platform files — 无标签 — by gmagogsfm (创建于: 2026-03-11 04:24 (UTC+8))
#36701 [Core] Remove FlashAttention block size restriction for hybrid models — v1 — by tdoublep (创建于: 2026-03-11 04:52 (UTC+8))
#36699 Add tuned H100 MoE configs for LFM2 8B and 24B — 无标签 — by tianshu-Michael-yu (创建于: 2026-03-11 04:32 (UTC+8))
#36681 [ROCm][Perf] Allow MTP lens > 1 in Sparse MLA — rocm,speculative-decoding,ready,v1 — by tvirolai-amd (创建于: 2026-03-11 01:12 (UTC+8))
#36696 Protect Download from Object Storage by Multiple Processes with Runai Model Streamer — rocm,ci/build — by noa-neria (创建于: 2026-03-11 04:12 (UTC+8))
#36687 [PD][Nixl] Add support for hybrid SSM-FA models — v1,kv-connector — by NickLucche (创建于: 2026-03-11 01:30 (UTC+8))
#36694 Use FLASH_ATTN instead of FLASHINFER for Whisper — nvidia — by WoosukKwon (创建于: 2026-03-11 03:53 (UTC+8))
#36689 [Feature]: Fused CUTLASS GEMM + static FP8 output quant in epilogue — performance,nvidia — by Itssshikhar (创建于: 2026-03-11 01:50 (UTC+8))
#36693 [Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh — ready,nvidia — by yewentao256 (创建于: 2026-03-11 03:15 (UTC+8))
#36683 [Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning — 无标签 — by gmagogsfm (创建于: 2026-03-11 01:17 (UTC+8))
#36677 [Kernel][Helion][13/N] Force static_shapes=False in helion register — 无标签 — by gmagogsfm (创建于: 2026-03-11 00:37 (UTC+8))
#36692 Use client-provided UUIDs directly as cache keys — meta-exported,fb-exported — by tensorAI (创建于: 2026-03-11 02:46 (UTC+8))
#36684 fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 — ready,v1,ready-run-all-tests — by hai-meh-cs (创建于: 2026-03-11 01:19 (UTC+8))
#36616 [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:55 (UTC+8))
#36690 [AMD] fix to run MLA with kv cache dtype = fp8 — bug,rocm,deepseek — by krishna-kylist (创建于: 2026-03-11 01:52 (UTC+8))
#36686 Move to Zensical for docs build — documentation,ci/build — by aireilly (创建于: 2026-03-11 01:29 (UTC+8))
#36682 [Test] Add normal-case tests for split_graph — 无标签 — by SoluMilken (创建于: 2026-03-11 01:12 (UTC+8))
#36673 [Misc][Attention] Clean up unused method in CPU_ATTN — ready,v1,cpu — by MatthewBonanni (创建于: 2026-03-10 23:52 (UTC+8))
#36680 [ROCm][Perf] Allow MTP len > 1 for MLA — rocm,speculative-decoding,v1 — by tvirolai-amd (创建于: 2026-03-11 00:53 (UTC+8))
#36679 [Bugfix] stream failure when model name not in audio endpoints — bug,frontend — by ekagra-ranjan (创建于: 2026-03-11 00:44 (UTC+8))
#36674 [Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs — bug,ready,nvidia — by yewentao256 (创建于: 2026-03-11 00:04 (UTC+8))
#36675 [Misc] Sync pre-commit to 4.5.1 in workflows and docs — documentation,ci/build — by SoluMilken (创建于: 2026-03-11 00:11 (UTC+8))
#36595 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs — bug,ready,ready-run-all-tests — by ZJY0516 (创建于: 2026-03-10 14:18 (UTC+8))
#36582 [compile] Apply stored functorch config while finalizing loaded artifacts. — ready — by zhxchen17 (创建于: 2026-03-10 11:48 (UTC+8))
#36626 [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata — bug,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 17:36 (UTC+8))
#36667 [Refactor] Remove Molmo2 processor wrapper — ready — by DarkLight1337 (创建于: 2026-03-10 23:21 (UTC+8))
#36658 Add: Eagle3 support for Qwen3.5 — ready,qwen — by rahul-tuli (创建于: 2026-03-10 22:14 (UTC+8))
#36672 Remove unused config field from Gemma2 — ready — by hmellor (创建于: 2026-03-10 23:50 (UTC+8))
#36661 feat(attention): extract KV-cache update from TreeAttention backend — v1 — by cong-or (创建于: 2026-03-10 22:30 (UTC+8))
#36671 chunk parakeet into 30s clips to prevent OOMs on long audios — 无标签 — by netanel-haber (创建于: 2026-03-10 23:33 (UTC+8))
#36588 [Model Runner V2] Fix mm input embeddings lookup — ready,v1 — by njhill (创建于: 2026-03-10 12:16 (UTC+8))
#36603 fix(lora): fix IndexError and GQA tensor size mismatch in QKV LoRA la… — 无标签 — by dzhengAP (创建于: 2026-03-10 15:31 (UTC+8))
#36596 [Perf] add packed recurrent fast path for decode — qwen — by caozuoba (创建于: 2026-03-10 14:26 (UTC+8))
#36664 Add gigachat 3.1 tool parser + fix gigachat3 tool parser — documentation,tool-calling — by ajpqs (创建于: 2026-03-10 23:08 (UTC+8))
#36655 [Frontend] Allow engine_client=None in OpenAIServingModels — frontend — by sagearc (创建于: 2026-03-10 22:03 (UTC+8))
#36597 [CI/Build] enable Intel XPU test flow with prebuilt image — ci/build — by wendyliu235 (创建于: 2026-03-10 14:33 (UTC+8))
#36645 [kv_offload+HMA][4/N]: Support sliding window lookup — v1,kv-connector — by orozery (创建于: 2026-03-10 19:34 (UTC+8))
#36647 [GDN] add an env variable for gdn kernel selection — qwen — by ZJY0516 (创建于: 2026-03-10 19:47 (UTC+8))
#36635 [NemotronH] Small fix reasoning parser — bug,ready,nvidia — by roikoren755 (创建于: 2026-03-10 18:38 (UTC+8))
#36652 Fix Qwen3.5 LoRA packed module mapping — qwen — by notimesea (创建于: 2026-03-10 21:12 (UTC+8))
#36641 [torch.compile] Let Non-CUDA platforms provide default sp_min_token_num — nvidia — by realliujiaxu (创建于: 2026-03-10 19:08 (UTC+8))
#36605 [MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels — qwen — by shen-shanshan (创建于: 2026-03-10 15:49 (UTC+8))
#36630 [Bugfix] Fix processor signature — bug,ready — by zucchini-nlp (创建于: 2026-03-10 17:56 (UTC+8))
#36628 [Frontend][Core] Revert “Add shutdown timeout” (#34730 and #36270) — frontend,ready,v1 — by markmc (创建于: 2026-03-10 17:41 (UTC+8))
#36650 [DO NOT MERGE] Reapply “[BugFix] Fix engine hanging after KV cache initialization failure #35478” — bug,ready,v1 — by markmc (创建于: 2026-03-10 20:08 (UTC+8))
#36649 [WIP] [Hybrid][GDN] Enable prefix caching ‘all’ mode for Qwen3.5/Qwen3Next — needs-rebase,v1,qwen — by haosdent (创建于: 2026-03-10 20:03 (UTC+8))
#36648 [WIP][LoRA] Add support for PEFT trainable_tokens_delta weights — needs-rebase — by haosdent (创建于: 2026-03-10 20:02 (UTC+8))
#36646 [DO NOT MERGE][Core] Revert “Fix benign error log during normal shutdown (#36270)” — ready,v1 — by markmc (创建于: 2026-03-10 19:43 (UTC+8))
#36644 [kv_offload+HMA][3/N]: Remove block_size from KVEvents — v1,kv-connector — by orozery (创建于: 2026-03-10 19:32 (UTC+8))
#36642 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec — v1,kv-connector — by orozery (创建于: 2026-03-10 19:20 (UTC+8))
#36610 [kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec — v1,kv-connector — by orozery (创建于: 2026-03-10 16:33 (UTC+8))
#36639 [WIP][Bugfix] Fix causal_conv1d assertion crash during CUDA graph capture (#36566) — bug,nvidia — by haosdent (创建于: 2026-03-10 19:02 (UTC+8))
#36637 [WIP] [Bugfix] Respect scale_attn_weights config flag in GPTBigCode — bug — by haosdent (创建于: 2026-03-10 18:49 (UTC+8))
#36636 [WIP] [Bugfix] Fix KV cache offloading for hybrid attention models (e.g. Qwen3.5-27B) — bug,qwen,kv-connector — by haosdent (创建于: 2026-03-10 18:41 (UTC+8))
#36619 [DO NOT MERGE] Test revert “Remove busy loop from idle buffer readers” (#28053 and #36068) — ready,v1 — by markmc (创建于: 2026-03-10 17:02 (UTC+8))
#36587 skip triton when graph capture — qwen — by flutist (创建于: 2026-03-10 12:05 (UTC+8))
#36622 [Bugfix] Fix off-by-one in multimodal prefix cache hash boundary check — bug,v1 — by AjAnubolu (创建于: 2026-03-10 17:17 (UTC+8))
#36621 [Bugfix] Fix FP8 online quantization premature trigger with TP sharded weights — bug — by AjAnubolu (创建于: 2026-03-10 17:17 (UTC+8))
#36617 [Bugfix] Fix prefix caching for hybrid models with non-uniform page sizes — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:55 (UTC+8))
#36618 Revert “[cudagraph] fix cudagraph warning in deepseekv32 (#28044)” — deepseek,nvidia — by elvircrn (创建于: 2026-03-10 16:59 (UTC+8))
#36611 [Bugfix] Fix FP8 MLA CUDAGraph stale tile scheduler metadata — bug,v1,nvidia — by AjAnubolu (创建于: 2026-03-10 16:34 (UTC+8))
#36608 [Bugfix] Fix DP wave race condition re-arming engine while paused — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:17 (UTC+8))
#36600 [Models] Align nemotron-nano-vl’s audio_in_video processing — multi-modality — by Isotr0py (创建于: 2026-03-10 14:51 (UTC+8))
#36590 [Model] Add HyperCLOVAX-SEED-Omni-8B model support — new-model — by bigshanedogg (创建于: 2026-03-10 13:06 (UTC+8))
#36592 [Kernel] Flush L2 cache in benchmark_moe to reduce cache warmth bias — performance — by Jerry2423 (创建于: 2026-03-10 13:35 (UTC+8))
#36586 [LMCache] Fault Tolerance Mechanism — kv-connector — by Oasis-Git (创建于: 2026-03-10 12:03 (UTC+8))
#36581 [DO NOT REVIEW] Add versioned Helion kernel support with CI policy enforcement — documentation,needs-rebase — by gmagogsfm (创建于: 2026-03-10 11:42 (UTC+8))

已合并 PR

#36614 fix bugs when token_classify & classify run concurrently — ready — by staugust (合并于: 2026-03-11 11:16 (UTC+8))
#36201 [openapi server] log exception in exception handler(2/N) — frontend,ready — by andyxning (合并于: 2026-03-11 11:16 (UTC+8))
#36691 [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling — bug,ready,v1,deepseek — by MatthewBonanni (合并于: 2026-03-11 11:01 (UTC+8))
#36633 FunASR model bugfix — bug,ready — by AllenDou (合并于: 2026-03-10 23:14 (UTC+8))
#36296 [Bug] Fix TRTLLM Block FP8 MoE Monolithic — bug,ready,nvidia — by wzhao18 (合并于: 2026-03-11 10:04 (UTC+8))
#36397 fix: check HTTP status in batch read_file to prevent silent failures — frontend,ready — by alvinttang (合并于: 2026-03-10 22:22 (UTC+8))
#36090 [ROCm][CI] Making some tests optional to reduce workload — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-11 07:45 (UTC+8))
#36609 [Minor] Enhance error message for TRTLLM decode uniformity check — ready,v1,nvidia — by WoosukKwon (合并于: 2026-03-11 06:38 (UTC+8))
#36041 [Model Runner V2] Add initial CI tests — ready,ci/build,v1 — by njhill (合并于: 2026-03-11 05:55 (UTC+8))
#36521 [Core] Simplify core kv-cache blocks initialization logic — ready,v1,ready-run-all-tests — by njhill (合并于: 2026-03-11 04:20 (UTC+8))
#36508 [Misc] fix typo: homogenous-> homogeneous (2 lines change) — v1 — by SoluMilken (合并于: 2026-03-10 21:25 (UTC+8))
#36340 [Test] test_async_scheduling.py improvements — ready,v1 — by njhill (合并于: 2026-03-11 02:17 (UTC+8))
#36595 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs — bug,ready,ready-run-all-tests — by ZJY0516 (合并于: 2026-03-10 22:39 (UTC+8))
#36582 [compile] Apply stored functorch config while finalizing loaded artifacts. — ready — by zhxchen17 (合并于: 2026-03-11 00:34 (UTC+8))
#36626 [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata — bug,v1,nvidia — by WoosukKwon (合并于: 2026-03-11 00:30 (UTC+8))
#36104 [CI] Bump mypy version to 1.19.1 — ready,v1,kv-connector — by hmellor (合并于: 2026-03-11 00:18 (UTC+8))
#35719 [ROCm][Perf] Enable sparse_mla’s cudagraph on ROCm platform — rocm,ready,v1,nvidia — by ganyi1996ppo (合并于: 2026-03-11 00:14 (UTC+8))
#36519 [Bugfix][Sparse MLA] report indexer CG support properly — bug,ready,v1 — by MatthewBonanni (合并于: 2026-03-11 00:14 (UTC+8))
#34304 Improvements to wvSplitKrc skinny GEMM solution — rocm,ready — by amd-hhashemi (合并于: 2026-03-11 00:14 (UTC+8))
#36588 [Model Runner V2] Fix mm input embeddings lookup — ready,v1 — by njhill (合并于: 2026-03-10 15:24 (UTC+8))
#36580 [Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill — ready,v1 — by njhill (合并于: 2026-03-10 15:23 (UTC+8))
#35200 Fix hf_override_fn when it modifies model_type — ready — by hmellor (合并于: 2026-03-10 23:03 (UTC+8))
#35342 feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency — ready,v1,kv-connector — by Srinivasoo7 (合并于: 2026-03-10 22:43 (UTC+8))
#36479 [Model] Consolidate score logic by introduce score_type — new-model,frontend,ready,qwen — by noooop (合并于: 2026-03-10 21:32 (UTC+8))
#36630 [Bugfix] Fix processor signature — bug,ready — by zucchini-nlp (合并于: 2026-03-10 21:20 (UTC+8))
#36628 [Frontend][Core] Revert “Add shutdown timeout” (#34730 and #36270) — frontend,ready,v1 — by markmc (合并于: 2026-03-10 21:20 (UTC+8))
#36532 Fix Qwen2.5-VL test for Transformers v5 — documentation,ready,qwen — by hmellor (合并于: 2026-03-10 20:05 (UTC+8))
#36169 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add –grpc flag to vllm serve — rocm,frontend,ready,ci/build — by CatherineSue (合并于: 2026-03-10 18:29 (UTC+8))
#36534 Fix LFM2 MoE test for Transformers v5 — ready — by hmellor (合并于: 2026-03-10 13:29 (UTC+8))
#35219 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU — bug,ready,v1,qwen,ready-run-all-tests — by vadiklyutiy (合并于: 2026-03-10 18:32 (UTC+8))
#36494 Fix: Re-Enable EP for trtllm MoE FP8 backend — ready,nvidia — by amirkl94 (合并于: 2026-03-10 14:11 (UTC+8))
#36557 [Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. — bug,ready — by hallerite (合并于: 2026-03-10 13:30 (UTC+8))
#36546 Remove unused disable_fallback field — ready — by zhuohan123 (合并于: 2026-03-10 11:56 (UTC+8))
#36159 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-10 11:55 (UTC+8))

关闭但未合并的 PR

#28631 [Frontend][Renderer] Refactor score API — documentation,frontend,multi-modality — by noooop (关闭于: 2026-03-11 10:56 (UTC+8))
#35966 [feat] Kimi K2/DeepSeek Support eagle3 — speculative-decoding,v1,deepseek — by leihuang-sketch (关闭于: 2026-03-11 04:34 (UTC+8))
#28175 [vllm][OutputProcessor] Allow propagate engine core additional outputs back to client — needs-rebase,stale,v1 — by KingsleyZhang123 (关闭于: 2026-03-11 10:15 (UTC+8))
#28313 Make tests/lora/utils usable by plugins — stale — by vanbasten23 (关闭于: 2026-03-11 10:15 (UTC+8))
#36638 [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading — bug,v1 — by haosdent (关闭于: 2026-03-11 10:14 (UTC+8))
#36724 [Bug][MoE] Fix TRTLLM EScoreBias Precision — bug,performance,needs-rebase,ci/build,qwen,nvidia — by robertgshaw2-redhat (关闭于: 2026-03-11 08:38 (UTC+8))
#36717 [Doc] Fix grammatically incorrect comment: “doesn’t allow to override” -> “doesn’t allow overriding” — 无标签 — by Hongbin10 (关闭于: 2026-03-11 08:13 (UTC+8))
#36714 [Doc] Capitalize docstrings in vllm/config/ — 无标签 — by Hongbin10 (关闭于: 2026-03-11 08:11 (UTC+8))
#30574 MLA Based Eagle3 — new-model,speculative-decoding,v1,deepseek — by IzzyPutterman (关闭于: 2026-03-11 07:29 (UTC+8))
#35918 [Model Runner V2] Fix FA3 cuda graphs — v1,nvidia — by njhill (关闭于: 2026-03-11 05:27 (UTC+8))
#29853 [WIP] Calculate MFU/MBU for whole model forward pass — stale — by linwang-aviva (关闭于: 2026-03-11 04:36 (UTC+8))
#35247 Create test_deepseek_v32_config.py — v1,deepseek — by puririshi98 (关闭于: 2026-03-11 04:18 (UTC+8))
#35333 [Perf] Optimize model runner v2 prepare_inputs copy logic, 6.1% E2E throughput improvement — ready,needs-rebase,v1,kv-connector,v2 — by yewentao256 (关闭于: 2026-03-11 03:48 (UTC+8))
#34903 [Bug] Fix illegal memory access issue for model runner v2 — bug,ready,needs-rebase,v1,nvidia — by yewentao256 (关闭于: 2026-03-11 03:45 (UTC+8))
#36523 [Platform] Add MPS (Apple Metal) platform support for macOS — documentation,performance,ci/build,v1,cpu — by robtaylor (关闭于: 2026-03-11 03:31 (UTC+8))
#36021 bump flashinfer v0.6.4 -> v0.6.6 — ci/build,nvidia — by netanel-haber (关闭于: 2026-03-11 03:21 (UTC+8))
#36680 [ROCm][Perf] Allow MTP len > 1 for MLA — rocm,speculative-decoding,v1 — by tvirolai-amd (关闭于: 2026-03-11 01:11 (UTC+8))
#32204 [Core][KVConnector] Support HMA+NixlConnector — needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-03-11 01:10 (UTC+8))
#36338 Add kv_transfer_params to streaming responses — frontend — by RishabhSaini (关闭于: 2026-03-11 01:07 (UTC+8))
#36661 feat(attention): extract KV-cache update from TreeAttention backend — v1 — by cong-or (关闭于: 2026-03-10 22:36 (UTC+8))
#30411 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — ready — by kitaekatt (关闭于: 2026-03-10 23:51 (UTC+8))
#32551 [Performance] Split FlexAttention and FlashInfer attention and cache update — performance,rocm,needs-rebase,v1,nvidia — by Etelis (关闭于: 2026-03-10 23:14 (UTC+8))
#33271 [ROCm] Change default settings for ROCm — rocm,needs-rebase,v1 — by gshtras (关闭于: 2026-03-10 23:03 (UTC+8))
#34998 [ROCm] Check that AITER MHA is not selected with sinks — rocm — by gshtras (关闭于: 2026-03-10 23:03 (UTC+8))
#36571 Fix issue 04 — v1 — by xueliangyang-oeuler (关闭于: 2026-03-10 22:22 (UTC+8))
#36652 Fix Qwen3.5 LoRA packed module mapping — qwen — by notimesea (关闭于: 2026-03-10 21:27 (UTC+8))
#36371 [compile] Remove strict_autograd_cache and force_non_lazy_backward_lowering workaround — ready — by zou3519 (关闭于: 2026-03-10 21:38 (UTC+8))
#34244 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2026-03-10 21:20 (UTC+8))
#36650 [DO NOT MERGE] Reapply “[BugFix] Fix engine hanging after KV cache initialization failure #35478” — bug,ready,v1 — by markmc (关闭于: 2026-03-10 20:54 (UTC+8))
#34061 [Cleanup] Unify vllm.utils.flashinfer and flashinfer_utils — nvidia — by veeceey (关闭于: 2026-03-10 21:14 (UTC+8))
#36646 [DO NOT MERGE][Core] Revert “Fix benign error log during normal shutdown (#36270)” — ready,v1 — by markmc (关闭于: 2026-03-10 20:51 (UTC+8))
#36619 [DO NOT MERGE] Test revert “Remove busy loop from idle buffer readers” (#28053 and #36068) — ready,v1 — by markmc (关闭于: 2026-03-10 19:31 (UTC+8))
#34682 [kv_offload+HMA][2/N]: Support sliding window lookup — v1,kv-connector — by orozery (关闭于: 2026-03-10 17:36 (UTC+8))
#34799 [kv_offload+HMA][3/N]: Scheduler-side support for multiple KV groups — v1,kv-connector — by orozery (关闭于: 2026-03-10 17:36 (UTC+8))
#36618 Revert “[cudagraph] fix cudagraph warning in deepseekv32 (#28044)” — deepseek,nvidia — by elvircrn (关闭于: 2026-03-10 17:15 (UTC+8))
#34680 [kv_offload+HMA][1/N]: Worker-side support for multiple HMA groups — v1 — by orozery (关闭于: 2026-03-10 16:26 (UTC+8))
#35995 Fix: support auto_tune to run local model — performance — by panpan0000 (关闭于: 2026-03-10 16:11 (UTC+8))
#32398 [KVConnector] OffloadingConnector: Add preemptions-only mode — needs-rebase,v1,kv-connector — by orozery (关闭于: 2026-03-10 16:10 (UTC+8))
#36486 [Bugfix] Fix issues in quark emulative logic — bug — by wangjiaxin99 (关闭于: 2026-03-10 15:10 (UTC+8))
#36567 fix: disable async scheduling by default due to instability (issue #3… — 无标签 — by xueliangyang-oeuler (关闭于: 2026-03-10 12:55 (UTC+8))
#28368 [refactor] [mla]: independently passing q_nope & q_rope — stale,v1 — by vnadathur (关闭于: 2026-03-10 12:14 (UTC+8))
#21849 [WIP] Add Kimi-Audio integration for vLLM — documentation,new-model,ci/build,v1 — by HelloWorldU (关闭于: 2026-03-10 11:42 (UTC+8))