vLLM 开发动态报告 - 2026-03-10
时间窗口: 2026-03-10 11:21 (UTC+8) ~ 2026-03-11 11:21 (UTC+8) 数据统计: 新 Issue 38 | 关闭 Issue 28 | 新 PR 115 | 合并 PR 34 | 关闭未合并 PR 42
📊 每日开发状态摘要
在3月10日至11日的周期内,vLLM社区保持高度活跃,共处理了38个新Issue和115个新PR。开发焦点集中在解决Qwen3.5等新兴混合架构(Mamba + Transformer)模型在生产中暴露的各类稳定性问题,特别是推测解码(MTP)与高并发场景下的崩溃。同时,AMD平台的支持持续得到增强,围绕ROCm生态的性能优化与bug修复是重要分支。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动频繁,主要集中在性能优化、Bug修复和生态完善上。
- 性能优化与内核支持:
- PR #36659:为AMD Radeon AI PRO R9700 (gfx1201, RDNA4) 添加了调优后的FP8 MoE Triton配置,显著提升了Qwen3-FP8模型在该卡上的性能(TTFT降低~20%,TPOT降低~24%)。
- PR #35719 (已合并):在ROCm平台上重新启用了
sparse_mla注意力后端的CUDA Graph支持,以提升性能。 - PR #36716:作为后续调优,暂时禁用了ROCm上RoPE自定义操作符,因发现其在某些配置下(如MI355)会导致性能回退。
- PR #36680 / #36681:允许在ROCm的稀疏MLA(AITER)后端上使用大于1的MTP推测令牌数,扩展了功能支持。
- Bug修复与稳定性:
- Issue #35925 (已关闭):修复了在ROCm平台使用AITER内核时,因NaN传播导致的Qwen3.5模型输出损坏问题。PR #36709通过添加
nan_to_num_()清理解决了此问题。 - PR #36720:修复了ROCm/HIP平台上因CUDA Graph内存分析不准确导致的Worker启动OOM问题,通过跳过不可靠的估算来解决。
- PR #36690:修复了AMD GPU上使用MLA注意力且KV缓存为FP8类型时的一个错误。
- PR #36606:提高了Quark量化工具中W4A8等数据类型的解析鲁棒性。
- Issue #35925 (已关闭):修复了在ROCm平台使用AITER内核时,因NaN传播导致的Qwen3.5模型输出损坏问题。PR #36709通过添加
- 生态建设与需求:
- Issue #36703 & #36704:用户请求为ROCm平台提供nightly版本的pip wheel和Docker镜像,认为这是成为“一等公民”支持的必要条件,反映出社区对AMD生态成熟度的更高期待。
- PR #36711:修正了ROCm CI中GPT-OSS测试的路径错误。
小结:AMD贡献者(含-amd后缀用户)活跃,正在系统性地解决ROCm平台在运行最新模型(如Qwen3.5, DeepSeek-V3)时遇到的性能与正确性问题,并向完善开发者体验迈进。
💬 高热度讨论分析
- Issue #36613: Qwen3.5 MTP高并发下CUDA非法内存访问
- 核心议题:Qwen3.5-397B模型启用MTP推测解码后,在高并发请求下发生CUDA Illegal Memory Access (ILM) 崩溃。
- 不同观点:
- 问题报告者 (xiaochengyige, MLKoz2):提供了详细的重现步骤和日志,指出禁用MTP则一切正常,怀疑是推测解码引入的bug。
- 维护者 (ZJY0516):建议测试main分支或nightly版本以确认是否已修复,并请求最小化重现脚本。
- 社区成员 (mykolademyanov):从系统设计角度评论,认为此类问题常源于模型逻辑与执行环境的紧耦合,建议解耦。
- 其他用户 (cjackal):指出此问题是另一个Issue的重复,并提供了简化的重现命令。
- 争议焦点:无实质性争议,更多是协作定位问题。
- 当前状态:问题未解决,讨论聚焦于如何更有效地重现和定位根本原因。
- Issue #36627: Qwen3.5 vs Qwen3 性能对比
- 核心议题:用户观察到Qwen3.5的TTFT(首令牌时间)远慢于Qwen3,质疑其性能。
- 不同观点:
- 提问者 (fangbaolei):认为性能下降显著,特别是TTFT。
- 解释者 (ShanningZhuang, yunseoLee0343):指出Qwen3.5采用了Mamba/DeltaNet等混合架构,其循环计算特性导致Prefill阶段无法像Transformer那样完全并行,因此TTFT天生会更长。同时,具体实现中的内存操作(如
torch.zeros)也可能加剧开销。 - 建议者 (MLKoz2):建议尝试不同的注意力后端和预热策略。
- 争议焦点:无争议,主要是对混合架构性能特性的科普和解释。
- 当前状态:问题开放,结论是性能差异主要源于架构革新,属于预期内的权衡。
- Issue #36643: Qwen3.5 不支持流水线并行
- 核心议题:尝试为Qwen3.5模型启用流水线并行(PP)时失败。
- 不同观点与进展:
- 用户报告错误:启用PP和MTP时,提示“Pipeline parallelism is not supported”。
- 深度分析 (weiguangli-io):通过代码审查给出了根本原因:当启用PP时,系统会检查模型是否支持PP。对于MTP推测解码,草稿模型会继承目标模型的PP配置。然而,专门为MTP创建的草稿模型类
Qwen3_5MoeMTP并未声明支持PP协议(SupportsPP),因此触发错误。
- 当前状态:问题未解决,但根本原因已被清晰定位,为后续修复指明了方向。
- PR #36666 / #36628: 关闭超时功能引发CI失败
- 核心议题:一个用于优雅关闭服务器的PR (#34730, #36270) 导致了分布式测试(
Distributed Tests (4 GPUs))的持续失败。 - 讨论过程:
- PR #36628 直接撤销了该功能,使测试通过。
- 随后,PR #36666 重新提交了该功能,并附带详细分析,认为测试失败可能与关闭过程中的时序或通信问题有关,而非功能本身逻辑错误。作者希望重新建立基线并进一步调查。
- 讨论涉及了多个相关PR的合并历史和CI测试结果的梳理,体现了复杂项目中问题溯源的难度。
- 争议焦点:如何在引入新功能与保持测试稳定性之间取得平衡。是彻底回退,还是坚持引入并修复测试?
- 当前状态:回退版本已被合并,新版本正在讨论中,凸显了基础设施变更的谨慎性。
- 核心议题:一个用于优雅关闭服务器的PR (#34730, #36270) 导致了分布式测试(
🔥 热门话题与趋势分析
- Qwen3.5混合架构的“阵痛期”:作为新模型架构,Qwen3.5在vLLM中集成后暴露了大量边界情况问题,成为本周期最热话题。问题涉及MTP崩溃(#36613)、性能疑惑(#36627)、流水线并行不支持(#36643)、前缀缓存验证错误(#36697)、GDN层Triton自动调优OOM(#36598)等,反映出对新架构的全面适配仍在进行中。
- 推测解码的稳定性与创新:除了MTP的问题,推测解码整体备受关注。一方面在修复问题(如PR #36634修复MTP启动错误),另一方面也在引入新方法(如PR #36733添加DFlash方法)。这反映了社区对提升推理速度技术的持续追求和攻坚。
- AMD生态的“查漏补缺”与“追求平等”:从修复NaN、OOM等底层Bug,到为新型号GPU(RDNA4)添加性能配置,再到用户呼吁提供与CUDA对等的nightly构建,显示出ROCm支持从“能用”到“好用、快用”的发展阶段。
🛠️ 重点技术变更
- PR #35219 (已合并): 修复混合模型KV缓存的NaN传播问题
- 技术解读:解决了Qwen3.5等混合模型在共享KV缓存块时,因SSM(Mamba)层的fp32残留数据在后续被Attention(fp8/fp16)层复用,经“乘零掩码”操作后产生NaN,并污染所有后续请求的核心Bug。
- 影响:彻底解决了因KV缓存污染导致的模型准确性随服务时间下降的问题,对生产环境稳定性至关重要。
- PR #36595 (已合并): 修复空分区合并导致的DeepSeek-V3错误
- 技术解读:修复了在CUDA Graph编译过程中,一个将“仅包含空操作的分区”错误地合并到“切分算子子图”的逻辑Bug。此Bug会导致DeepSeek-V3等模型在编译后产生错误结果或CUDA Graph警告。
- 影响:修复了影响模型正确性的一个关键编译层Bug,保障了使用编译优化的服务的可靠性。
- PR #36169 (已合并): 将gRPC服务器实现提取到独立包
- 技术解读:将gRPC服务器代码从vLLM主仓库重构至独立的
smg-grpc-servicer包中,主仓库仅保留启动入口。这实现了gRPC协议与vLLM核心的解耦。 - 影响:允许gRPC协议和服务器逻辑独立、快速迭代,无需与vLLM发布周期绑定,提高了架构的灵活性和可维护性。
- 技术解读:将gRPC服务器代码从vLLM主仓库重构至独立的
📈 开发活跃度观察
- 贡献者活跃度:AMD相关的贡献者(如
vllmellm,indivats,tvirolai-amd等)非常活跃,提交了多项性能优化和Bug修复PR,显示出AMD团队对vLLM平台的持续投入。 - 代码审查与合并节奏:共合并了34个PR。对于关键Bug修复(如#36595),在核心维护者确认后合并速度较快。而对于涉及复杂交互或测试失败的功能性PR(如#36666),则经历了回退、重新提交和深入讨论的谨慎过程,体现了对稳定性的重视。
- Issue处理:关闭了28个Issue,其中包括一些历史遗留的CI问题(如#29529,#29463)和Bug报告,表明社区在持续进行问题清理。
💡 值得关注的问题
- Qwen3.5 MTP在高并发下的稳定性(Issue #36613):这是影响大规模部署的严重问题,根本原因尚不明确,需要社区重点关注和解决。
- 混合架构模型的前缀缓存支持(Issue #36697, PR #36649):当前混合模型的前缀缓存命中率极低,限制了其在多轮对话等场景的性能。相关优化工作(PR #36649)正在进行中,这是一个重要的性能提升方向。
- AMD平台Nightly构建的缺失(Issue #36703, #36704):用户的这一诉求合理,提供与CUDA对等的持续集成产物是提升AMD开发者体验和吸引力的重要一步。
📋 附录:详细数据列表
新增 Issue
- #36697 [Bug]: Validation Error: block_size (2096) > max_num_batched_tokens (2048) when enabling prefix caching for Qwen3.5 Mamba architecture — bug — by Imagium719 (创建于: 2026-03-11 04:23 (UTC+8))
- #36602 [RFC]: Make kernel/op and component tests device-agnostic for OOT plugins — RFC — by romitjain (创建于: 2026-03-10 15:15 (UTC+8))
- #36631 [Bug]: v0.17.0 4*2080ti 22G Qwen3.5 RPC call to sample_tokens timed out. — bug — by VIT-Valentin (创建于: 2026-03-10 17:58 (UTC+8))
- #36654 [Bug]: Frequent Tool Call Parsing Failures with DeepSeek-V3.2 — bug — by Curry30Messi (创建于: 2026-03-10 21:39 (UTC+8))
- #36718 [Bug]: vLLM 0.15.0 startup on H200 failed at deep_gemm — bug — by pymhq (创建于: 2026-03-11 06:41 (UTC+8))
- #36730 [Usage]: When running qwen3.5-27b with vllm 0.17.0, the Deep Thinking output is under “reasoning” and not under “reasoning_content”. — usage — by AEGEGE (创建于: 2026-03-11 10:08 (UTC+8))
- #36662 [Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) — bug — by ProExpertProg (创建于: 2026-03-10 22:34 (UTC+8))
- #36640 [Bug]: KeyError: ‘language_model.model.layers.20.linear_attn’ — bug — by itfwonjulee (创建于: 2026-03-10 19:02 (UTC+8))
- #36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (创建于: 2026-03-10 17:35 (UTC+8))
- #36729 [Feature]: Fast KV Compaction via Attention Matching (50x compression) — feature request — by markg85 (创建于: 2026-03-11 09:34 (UTC+8))
- #36627 [Performance]: qwen3.5 vs qwen3 — performance — by fangbaolei (创建于: 2026-03-10 17:41 (UTC+8))
- #36657 [RFC]: Dynamic Speculation Length (DSL) with Confidence-Threshold Early Exit for vLLM Speculative Decoding — RFC — by jmamou (创建于: 2026-03-10 22:08 (UTC+8))
- #36704 [Feature]: upstream nightly rocm docker — feature request,rocm — by functionstackx (创建于: 2026-03-11 05:07 (UTC+8))
- #36703 [Feature]: upstream nightly rocm vllm — feature request,rocm — by functionstackx (创建于: 2026-03-11 05:06 (UTC+8))
- #36598 [Bug]: Triton autotuner OOM on Qwen3.5/Qwen3-Next GDN layers (non-SM90 GPUs) — bug — by AuYang261 (创建于: 2026-03-10 14:36 (UTC+8))
- #36643 [Bug]: Qwen3.5 does not work with pipeline parallelism — bug — by VirgilG72 (创建于: 2026-03-10 19:23 (UTC+8))
- #36656 [Bug]: Garbled output Qwen3.5-122B-A10B VLLM 0.17.0 — bug — by twright8 (创建于: 2026-03-10 22:07 (UTC+8))
- #36688 [Bug]: torch.opcheck fails for
_C.rms_norm_per_block_quant— bug,help wanted — by ProExpertProg (创建于: 2026-03-11 01:47 (UTC+8)) - #36676 [Bug]: dependency
nixl<0.10.0installsnixl-cu12==0.10.1— bug — by cjackal (创建于: 2026-03-11 00:19 (UTC+8)) - #36669 [Bug]: DeepSeek-OCR v1 crashes with TensorSchema mismatch when images_crop is empty (small images ≤640px) — 无标签 — by ketyi (创建于: 2026-03-10 23:27 (UTC+8))
- #36668 [RFC]: Support Registry Mechanism for KVCacheSpec — RFC — by MengqingCao (创建于: 2026-03-10 23:24 (UTC+8))
- #36663 [Bug]: CUBLAS_STATUS_INVALID_VALUE in Docker due to LD_LIBRARY_PATH cuBLAS version conflict — 无标签 — by lishunyang12 (创建于: 2026-03-10 22:57 (UTC+8))
- #36660 [Bug]: Duplicate token with different logprob when requesting top — bug — by CoolFish88 (创建于: 2026-03-10 22:22 (UTC+8))
- #36653 [Bug]: qwen3.5 Mismatch in
imagetoken count between text andinput_ids. Got ids=[4091] — bug — by JakubCerven (创建于: 2026-03-10 21:32 (UTC+8)) - #36613 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency — bug — by xiaochengyige (创建于: 2026-03-10 16:49 (UTC+8))
- #36623 [Bug]: OOM when –kv-offloading-size>1024 — bug — by xiejibing (创建于: 2026-03-10 17:22 (UTC+8))
- #36624 [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 — 无标签 — by elvircrn (创建于: 2026-03-10 17:28 (UTC+8))
- #36651 cumem allocator: double-free and stale error codes during sleep/wake cycles — 无标签 — by markrogersjr (创建于: 2026-03-10 20:47 (UTC+8))
- #36604 [Performance]: What is the performance impact of upgrading from HTTP/1.1 to HTTP/2 or QUIC? — performance — by sl99897 (创建于: 2026-03-10 15:39 (UTC+8))
- #36615 [Bug]: unknown error trying to run vllm v0.17.0 with ROCm on Radeon 8060S (gfx1151) — bug,rocm — by anomaly256 (创建于: 2026-03-10 16:52 (UTC+8))
- #36584 [Bug]: Qwen3.5-35B-A3B FlashInfer JIT compilation fails with C++17 feature errors (e.g., std::is_unsigned_v) when using vLLM 0.17.0 — bug — by FLCgeigei (创建于: 2026-03-10 11:51 (UTC+8))
- #36585 [Bug]: qwen3.5-27b-gptq deploy fail — bug — by xiaotianns (创建于: 2026-03-10 11:59 (UTC+8))
- #36632 [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream — bug — by JakubCerven (创建于: 2026-03-10 18:01 (UTC+8))
- #36629 [Performance]: W4A16+eagle3 not better than fp8+eagle3 with Qwen2.5-14B — performance — by ljy6j13 (创建于: 2026-03-10 17:45 (UTC+8))
- #36620 [Bug]: Does vllm support deploying glm-5 on A800 or A100, or are there any plans to support it? — bug — by yangzhipeng1108 (创建于: 2026-03-10 17:04 (UTC+8))
- #36583 [Bug]: Empty output when using FP8 + Tensor Parallel (2 GPUs) with Qwen3-8B — bug — by Cqh123-H (创建于: 2026-03-10 11:49 (UTC+8))
- #36594 [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc — cpu — by junjzhang (创建于: 2026-03-10 13:42 (UTC+8))
- #36589 [Bug]: SM 7.5 extreme slowness hangs indefinitely on T4 (vllm 0.17.0 with Qwen3.5-27B) — bug — by billysyt (创建于: 2026-03-10 12:59 (UTC+8))
已关闭 Issue
- #29645 [Bug]: Rotated samples extraction - accuracy loss — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-03-11 10:44 (UTC+8))
- #36228 [Bug]: Multiprocessing executor produces corrupted vision encoder outputs for Qwen3-VL in v0.12.0 (ray executor works correctly) — 无标签 — by AGENDD (关闭于: 2026-03-11 10:34 (UTC+8))
- #27901 [Feature]: Get confirmation on pending request(s) from vLLM V1 AsyncLLM servers — feature request,stale — by stevewx (关闭于: 2026-03-11 10:16 (UTC+8))
- #28135 [RFC]: Robust End-to-End CI/CD Regression Testing for Speculative Decoding — RFC,stale — by rahul-tuli (关闭于: 2026-03-11 10:15 (UTC+8))
- #28196 [Feature]: Add VLLM_INSTANCE_ID to KV Event — feature request,stale — by chickeyton (关闭于: 2026-03-11 10:15 (UTC+8))
- #28301 [Performance]: DeepSeek V3 performance drop when enabling MTP on both H200 and MI300X — performance,rocm,speculative-decoding,stale — by ChangLiu0709 (关闭于: 2026-03-11 10:15 (UTC+8))
- #28397 [Feature]: Add use zmq implement AFDConnect — feature request,stale — by lengrongfu (关闭于: 2026-03-11 10:15 (UTC+8))
- #28425 [Feature][RL]: Fix Fp8 Weight Loading for RL — feature request,stale — by robertgshaw2-redhat (关闭于: 2026-03-11 10:15 (UTC+8))
- #30358 [Bug]: NIXL PD disaggregate with host_buffer has accuracy issue - Prefill scheduled num_block mismatch at update_state_after_alloc and request_finished — bug,stale — by xuechendi (关闭于: 2026-03-11 10:15 (UTC+8))
- #28437 [Bug]: Unhandled exception: N9deep_gemm11DGExceptionE. — bug,stale — by bbartels (关闭于: 2026-03-11 10:15 (UTC+8))
- #36662 [Bug]: Deepseek-v3 fails on 8xB200 in v0.17.0 (including eager) — bug — by ProExpertProg (关闭于: 2026-03-11 10:04 (UTC+8))
- #36295 [Bug]: Deepseek R1 TRTLLM FP8 MoE produces garbage output — bug — by wzhao18 (关闭于: 2026-03-11 10:04 (UTC+8))
- #36625 [XPU][NixlConnector] Add ze_ipc transport support for single-node PD disaggregation — 无标签 — by Yanli2190 (关闭于: 2026-03-11 09:20 (UTC+8))
- #35925 [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled — bug,rocm — by jennyyyyzhen (关闭于: 2026-03-11 05:26 (UTC+8))
- #35805 [Feature]: FlashInfer Sparse MLA + FP8 KV Cache — feature request — by benchislett (关闭于: 2026-03-11 01:28 (UTC+8))
- #29467 [CI Failure]: mi325_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:41 (UTC+8))
- #29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:40 (UTC+8))
- #34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:40 (UTC+8))
- #34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
- #32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
- #29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
- #29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
- #29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-11 00:39 (UTC+8))
- #27998 [Bug]: Sequence Parallel Pass doesn’t work for MoE models — bug,torch.compile — by jasonlizhengjian (关闭于: 2026-03-11 00:01 (UTC+8))
- #35708 [Bug]: vLLM-compile warm start appears to be saving artifacts — bug,torch.compile — by zou3519 (关闭于: 2026-03-10 21:40 (UTC+8))
- #36624 [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 — 无标签 — by elvircrn (关闭于: 2026-03-10 21:20 (UTC+8))
- #35138 [Bug]: Qwen/Qwen3.5-397B-A17B-FP8 and Qwen/Qwen3.5-397B-A17B has accuracy issues when running with Flashinfer Attention backend on Blackwell. — bug,nvidia — by xinli-sw (关闭于: 2026-03-10 18:32 (UTC+8))
- #30288 [Bug]:fused_moe_lora_op compile error — bug,stale — by alex-petrenko (关闭于: 2026-03-10 11:40 (UTC+8))
新增 PR
- #36715 [Doc] Fix grammatically incorrect error message in gpu_worker and xpu_worker — v1 — by Hongbin10 (创建于: 2026-03-11 06:26 (UTC+8))
- #36706 [WIP][Frontend] Add SubscribeKvEvents KV cache streaming in gRPC server — documentation,frontend — by smfirmin (创建于: 2026-03-11 05:22 (UTC+8))
- #36733 Vllm add dflash — new-model,speculative-decoding,v1,qwen — by biggestCjb (创建于: 2026-03-11 10:51 (UTC+8))
- #36614 fix bugs when token_classify & classify run concurrently — ready — by staugust (创建于: 2026-03-10 16:49 (UTC+8))
- #36726 [Refactor] Remove deadcode in Responses API serving — frontend — by sfeng33 (创建于: 2026-03-11 09:10 (UTC+8))
- #36734 Automatically increased max_num_batched_tokens under Mamba align mode — 无标签 — by flutist (创建于: 2026-03-11 11:09 (UTC+8))
- #36685 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — v1,multi-modality,meta-exported,fb-exported — by d-biswa (创建于: 2026-03-11 01:28 (UTC+8))
- #36691 [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling — bug,ready,v1,deepseek — by MatthewBonanni (创建于: 2026-03-11 02:03 (UTC+8))
- #36728 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts — bug,nvidia — by yzong-rh (创建于: 2026-03-11 09:27 (UTC+8))
- #36732 [MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow — nvidia — by yzong-rh (创建于: 2026-03-11 10:36 (UTC+8))
- #36678 [Bugfix] Miscellaneous MoE bug fixes — bug — by bnellnm (创建于: 2026-03-11 00:40 (UTC+8))
- #36731 [Bugfix] Fix Eagle3 aux_hidden_state_layers indexing with pipeline parallelism — bug,llama,qwen — by alvinttang (创建于: 2026-03-11 10:26 (UTC+8))
- #36612 [XPU] Add deepseek_scaling_rope fused kernel — deepseek — by yitingw1 (创建于: 2026-03-10 16:45 (UTC+8))
- #36638 [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading — bug,v1 — by haosdent (创建于: 2026-03-10 18:49 (UTC+8))
- #36719 [ci] Bound nvidia-cudnn-frontend version — ready,ci/build,nvidia — by khluu (创建于: 2026-03-11 06:42 (UTC+8))
- #36633 FunASR model bugfix — bug,ready — by AllenDou (创建于: 2026-03-10 18:18 (UTC+8))
- #36723 [DSV3.2][MTP] Optimize Indexer MTP handling — ready,v1 — by benchislett (创建于: 2026-03-11 07:43 (UTC+8))
- #36634 fix mtp launch error in vllm-0.17.0, about cuda graph during memory profile — v1,nvidia — by flutist (创建于: 2026-03-10 18:23 (UTC+8))
- #36591 [Docs] Add GSM8K accuracy benchmark example — documentation — by ZhuangYu07 (创建于: 2026-03-10 13:06 (UTC+8))
- #36670 [Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop — bug,multi-modality,deepseek — by ketyi (创建于: 2026-03-10 23:28 (UTC+8))
- #36711 [ROCm][CI] Corrected the GPT-OSS test root path — rocm,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-11 05:57 (UTC+8))
- #36727 fix: replace deprecated F.sigmoid with torch.sigmoid in linear_attn (… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-11 09:24 (UTC+8))
- #36599 [Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling — bug,qwen — by AuYang261 (创建于: 2026-03-10 14:37 (UTC+8))
- #36665 platforms: Fix Ray DP startup crash — ready — by itayalroy (创建于: 2026-03-10 23:14 (UTC+8))
- #36724 [Bug][MoE] Fix TRTLLM EScoreBias Precision — bug,performance,needs-rebase,ci/build,qwen,nvidia — by robertgshaw2-redhat (创建于: 2026-03-11 08:37 (UTC+8))
- #36725 [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision — bug,ready,nvidia — by robertgshaw2-redhat (创建于: 2026-03-11 08:39 (UTC+8))
- #36700 [Misc] Added curl retries in install_python_libraries.sh — 无标签 — by dmitry-tokarev-nv (创建于: 2026-03-11 04:49 (UTC+8))
- #36606 [ROCm][Quantization] improve quant dtype parser robust for W4A8 — rocm — by xuebwang-amd (创建于: 2026-03-10 16:01 (UTC+8))
- #36713 [Doc] Fix duplicate words in comments — ready,v1,multi-modality,nvidia — by Hongbin10 (创建于: 2026-03-11 06:20 (UTC+8))
- #36717 [Doc] Fix grammatically incorrect comment: “doesn’t allow to override” -> “doesn’t allow overriding” — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:35 (UTC+8))
- #36714 [Doc] Capitalize docstrings in vllm/config/ — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:23 (UTC+8))
- #36659 [ROCm] Enable FP8 inference on gfx1201 AMD RDNA4 (Radeon AI PRO R9700) with aiter kernels — rocm,needs-rebase — by vllmellm (创建于: 2026-03-10 22:15 (UTC+8))
- #36721 Add torch._scaled_mm backend for NVFP4 linear — 无标签 — by NikhilAPatel (创建于: 2026-03-11 07:38 (UTC+8))
- #36722 [vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm — rocm,ci/build,nvidia — by ProExpertProg (创建于: 2026-03-11 07:40 (UTC+8))
- #36702 [ROCm] Attention selector reordering — documentation,rocm,ci/build,v1 — by gshtras (创建于: 2026-03-11 05:05 (UTC+8))
- #36720 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling — bug,rocm,v1,nvidia — by JartX (创建于: 2026-03-11 07:16 (UTC+8))
- #36593 [XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. — bug,ready,v1 — by ys950902 (创建于: 2026-03-10 13:35 (UTC+8))
- #36716 [ROCm]: Re-disable rope customop on rocm and fix rope+kvcache fusion conditions — rocm — by Rohan138 (创建于: 2026-03-11 06:31 (UTC+8))
- #36695 Create test_async_flashinfer_combinations.py — v1,nvidia — by puririshi98 (创建于: 2026-03-11 04:05 (UTC+8))
- #36712 [Doc] Fix typos in comments: i.e./e.g. punctuation and “archive” -> “achieve” — 无标签 — by Hongbin10 (创建于: 2026-03-11 06:13 (UTC+8))
- #36609 [Minor] Enhance error message for TRTLLM decode uniformity check — ready,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 16:31 (UTC+8))
- #36666 [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (创建于: 2026-03-10 23:20 (UTC+8))
- #36607 feat(models): enable Qwen3.5 text-only (Qwen3_5ForCausalLM) — IsHybrid, SupportsMRoPE, VL weight remapping — new-model,qwen — by groxaxo (创建于: 2026-03-10 16:03 (UTC+8))
- #36710 [Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-11 05:49 (UTC+8))
- #36707 fix: align lfm2 thumbnail token counting with HF — 无标签 — by tianshu-Michael-yu (创建于: 2026-03-11 05:23 (UTC+8))
- #36705 [Kernel][Helion] [16/N] Refactor register_kernel API to eagerly configure Helion kernel upon importing — 无标签 — by gmagogsfm (创建于: 2026-03-11 05:16 (UTC+8))
- #36708 fix: disambiguate multimodal prefix cache keys — v1 — by tianshu-Michael-yu (创建于: 2026-03-11 05:23 (UTC+8))
- #36709 [ROCm][Bugfix] Fix NaN corruption — bug,rocm,v1 — by indivats (创建于: 2026-03-11 05:26 (UTC+8))
- #36698 [Kernel] [Helion] [15/N] Split config files into per-platform files — 无标签 — by gmagogsfm (创建于: 2026-03-11 04:24 (UTC+8))
- #36701 [Core] Remove FlashAttention block size restriction for hybrid models — v1 — by tdoublep (创建于: 2026-03-11 04:52 (UTC+8))
- #36699 Add tuned H100 MoE configs for LFM2 8B and 24B — 无标签 — by tianshu-Michael-yu (创建于: 2026-03-11 04:32 (UTC+8))
- #36681 [ROCm][Perf] Allow MTP lens > 1 in Sparse MLA — rocm,speculative-decoding,ready,v1 — by tvirolai-amd (创建于: 2026-03-11 01:12 (UTC+8))
- #36696 Protect Download from Object Storage by Multiple Processes with Runai Model Streamer — rocm,ci/build — by noa-neria (创建于: 2026-03-11 04:12 (UTC+8))
- #36687 [PD][Nixl] Add support for hybrid SSM-FA models — v1,kv-connector — by NickLucche (创建于: 2026-03-11 01:30 (UTC+8))
- #36694 Use FLASH_ATTN instead of FLASHINFER for Whisper — nvidia — by WoosukKwon (创建于: 2026-03-11 03:53 (UTC+8))
- #36689 [Feature]: Fused CUTLASS GEMM + static FP8 output quant in epilogue — performance,nvidia — by Itssshikhar (创建于: 2026-03-11 01:50 (UTC+8))
- #36693 [Compile] Fix compile warning
st256_csincuda_vec_utils.cuh— ready,nvidia — by yewentao256 (创建于: 2026-03-11 03:15 (UTC+8)) - #36683 [Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning — 无标签 — by gmagogsfm (创建于: 2026-03-11 01:17 (UTC+8))
- #36677 [Kernel][Helion][13/N] Force static_shapes=False in helion register — 无标签 — by gmagogsfm (创建于: 2026-03-11 00:37 (UTC+8))
- #36692 Use client-provided UUIDs directly as cache keys — meta-exported,fb-exported — by tensorAI (创建于: 2026-03-11 02:46 (UTC+8))
- #36684 fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 — ready,v1,ready-run-all-tests — by hai-meh-cs (创建于: 2026-03-11 01:19 (UTC+8))
- #36616 [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:55 (UTC+8))
- #36690 [AMD] fix to run MLA with kv cache dtype = fp8 — bug,rocm,deepseek — by krishna-kylist (创建于: 2026-03-11 01:52 (UTC+8))
- #36686 Move to Zensical for docs build — documentation,ci/build — by aireilly (创建于: 2026-03-11 01:29 (UTC+8))
- #36682 [Test] Add normal-case tests for
split_graph— 无标签 — by SoluMilken (创建于: 2026-03-11 01:12 (UTC+8)) - #36673 [Misc][Attention] Clean up unused method in
CPU_ATTN— ready,v1,cpu — by MatthewBonanni (创建于: 2026-03-10 23:52 (UTC+8)) - #36680 [ROCm][Perf] Allow MTP len > 1 for MLA — rocm,speculative-decoding,v1 — by tvirolai-amd (创建于: 2026-03-11 00:53 (UTC+8))
- #36679 [Bugfix] stream failure when model name not in audio endpoints — bug,frontend — by ekagra-ranjan (创建于: 2026-03-11 00:44 (UTC+8))
- #36674 [Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs — bug,ready,nvidia — by yewentao256 (创建于: 2026-03-11 00:04 (UTC+8))
- #36675 [Misc] Sync pre-commit to 4.5.1 in workflows and docs — documentation,ci/build — by SoluMilken (创建于: 2026-03-11 00:11 (UTC+8))
- #36595 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs — bug,ready,ready-run-all-tests — by ZJY0516 (创建于: 2026-03-10 14:18 (UTC+8))
- #36582 [compile] Apply stored functorch config while finalizing loaded artifacts. — ready — by zhxchen17 (创建于: 2026-03-10 11:48 (UTC+8))
- #36626 [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata — bug,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 17:36 (UTC+8))
- #36667 [Refactor] Remove Molmo2 processor wrapper — ready — by DarkLight1337 (创建于: 2026-03-10 23:21 (UTC+8))
- #36658 Add: Eagle3 support for Qwen3.5 — ready,qwen — by rahul-tuli (创建于: 2026-03-10 22:14 (UTC+8))
- #36672 Remove unused config field from Gemma2 — ready — by hmellor (创建于: 2026-03-10 23:50 (UTC+8))
- #36661 feat(attention): extract KV-cache update from TreeAttention backend — v1 — by cong-or (创建于: 2026-03-10 22:30 (UTC+8))
- #36671 chunk parakeet into 30s clips to prevent OOMs on long audios — 无标签 — by netanel-haber (创建于: 2026-03-10 23:33 (UTC+8))
- #36588 [Model Runner V2] Fix mm input embeddings lookup — ready,v1 — by njhill (创建于: 2026-03-10 12:16 (UTC+8))
- #36603 fix(lora): fix IndexError and GQA tensor size mismatch in QKV LoRA la… — 无标签 — by dzhengAP (创建于: 2026-03-10 15:31 (UTC+8))
- #36596 [Perf] add packed recurrent fast path for decode — qwen — by caozuoba (创建于: 2026-03-10 14:26 (UTC+8))
- #36664 Add gigachat 3.1 tool parser + fix gigachat3 tool parser — documentation,tool-calling — by ajpqs (创建于: 2026-03-10 23:08 (UTC+8))
- #36655 [Frontend] Allow
engine_client=NoneinOpenAIServingModels— frontend — by sagearc (创建于: 2026-03-10 22:03 (UTC+8)) - #36597 [CI/Build] enable Intel XPU test flow with prebuilt image — ci/build — by wendyliu235 (创建于: 2026-03-10 14:33 (UTC+8))
- #36645 [kv_offload+HMA][4/N]: Support sliding window lookup — v1,kv-connector — by orozery (创建于: 2026-03-10 19:34 (UTC+8))
- #36647 [GDN] add an env variable for gdn kernel selection — qwen — by ZJY0516 (创建于: 2026-03-10 19:47 (UTC+8))
- #36635 [NemotronH] Small fix reasoning parser — bug,ready,nvidia — by roikoren755 (创建于: 2026-03-10 18:38 (UTC+8))
- #36652 Fix Qwen3.5 LoRA packed module mapping — qwen — by notimesea (创建于: 2026-03-10 21:12 (UTC+8))
- #36641 [torch.compile] Let Non-CUDA platforms provide default sp_min_token_num — nvidia — by realliujiaxu (创建于: 2026-03-10 19:08 (UTC+8))
- #36605 [MM][OOT] Support CPU
seq_lensfor OOT MMEncoderAttention kernels — qwen — by shen-shanshan (创建于: 2026-03-10 15:49 (UTC+8)) - #36630 [Bugfix] Fix processor signature — bug,ready — by zucchini-nlp (创建于: 2026-03-10 17:56 (UTC+8))
- #36628 [Frontend][Core] Revert “Add shutdown timeout” (#34730 and #36270) — frontend,ready,v1 — by markmc (创建于: 2026-03-10 17:41 (UTC+8))
- #36650 [DO NOT MERGE] Reapply “[BugFix] Fix engine hanging after KV cache initialization failure #35478” — bug,ready,v1 — by markmc (创建于: 2026-03-10 20:08 (UTC+8))
- #36649 [WIP] [Hybrid][GDN] Enable prefix caching ‘all’ mode for Qwen3.5/Qwen3Next — needs-rebase,v1,qwen — by haosdent (创建于: 2026-03-10 20:03 (UTC+8))
- #36648 [WIP][LoRA] Add support for PEFT trainable_tokens_delta weights — needs-rebase — by haosdent (创建于: 2026-03-10 20:02 (UTC+8))
- #36646 [DO NOT MERGE][Core] Revert “Fix benign error log during normal shutdown (#36270)” — ready,v1 — by markmc (创建于: 2026-03-10 19:43 (UTC+8))
- #36644 [kv_offload+HMA][3/N]: Remove block_size from KVEvents — v1,kv-connector — by orozery (创建于: 2026-03-10 19:32 (UTC+8))
- #36642 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec — v1,kv-connector — by orozery (创建于: 2026-03-10 19:20 (UTC+8))
- #36610 [kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec — v1,kv-connector — by orozery (创建于: 2026-03-10 16:33 (UTC+8))
- #36639 [WIP][Bugfix] Fix causal_conv1d assertion crash during CUDA graph capture (#36566) — bug,nvidia — by haosdent (创建于: 2026-03-10 19:02 (UTC+8))
- #36637 [WIP] [Bugfix] Respect scale_attn_weights config flag in GPTBigCode — bug — by haosdent (创建于: 2026-03-10 18:49 (UTC+8))
- #36636 [WIP] [Bugfix] Fix KV cache offloading for hybrid attention models (e.g. Qwen3.5-27B) — bug,qwen,kv-connector — by haosdent (创建于: 2026-03-10 18:41 (UTC+8))
- #36619 [DO NOT MERGE] Test revert “Remove busy loop from idle buffer readers” (#28053 and #36068) — ready,v1 — by markmc (创建于: 2026-03-10 17:02 (UTC+8))
- #36587 skip triton when graph capture — qwen — by flutist (创建于: 2026-03-10 12:05 (UTC+8))
- #36622 [Bugfix] Fix off-by-one in multimodal prefix cache hash boundary check — bug,v1 — by AjAnubolu (创建于: 2026-03-10 17:17 (UTC+8))
- #36621 [Bugfix] Fix FP8 online quantization premature trigger with TP sharded weights — bug — by AjAnubolu (创建于: 2026-03-10 17:17 (UTC+8))
- #36617 [Bugfix] Fix prefix caching for hybrid models with non-uniform page sizes — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:55 (UTC+8))
- #36618 Revert “[cudagraph] fix cudagraph warning in deepseekv32 (#28044)” — deepseek,nvidia — by elvircrn (创建于: 2026-03-10 16:59 (UTC+8))
- #36611 [Bugfix] Fix FP8 MLA CUDAGraph stale tile scheduler metadata — bug,v1,nvidia — by AjAnubolu (创建于: 2026-03-10 16:34 (UTC+8))
- #36608 [Bugfix] Fix DP wave race condition re-arming engine while paused — bug,v1 — by AjAnubolu (创建于: 2026-03-10 16:17 (UTC+8))
- #36600 [Models] Align nemotron-nano-vl’s
audio_in_videoprocessing — multi-modality — by Isotr0py (创建于: 2026-03-10 14:51 (UTC+8)) - #36590 [Model] Add HyperCLOVAX-SEED-Omni-8B model support — new-model — by bigshanedogg (创建于: 2026-03-10 13:06 (UTC+8))
- #36592 [Kernel] Flush L2 cache in benchmark_moe to reduce cache warmth bias — performance — by Jerry2423 (创建于: 2026-03-10 13:35 (UTC+8))
- #36586 [LMCache] Fault Tolerance Mechanism — kv-connector — by Oasis-Git (创建于: 2026-03-10 12:03 (UTC+8))
- #36581 [DO NOT REVIEW] Add versioned Helion kernel support with CI policy enforcement — documentation,needs-rebase — by gmagogsfm (创建于: 2026-03-10 11:42 (UTC+8))
已合并 PR
- #36614 fix bugs when token_classify & classify run concurrently — ready — by staugust (合并于: 2026-03-11 11:16 (UTC+8))
- #36201 [openapi server] log exception in exception handler(2/N) — frontend,ready — by andyxning (合并于: 2026-03-11 11:16 (UTC+8))
- #36691 [Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling — bug,ready,v1,deepseek — by MatthewBonanni (合并于: 2026-03-11 11:01 (UTC+8))
- #36633 FunASR model bugfix — bug,ready — by AllenDou (合并于: 2026-03-10 23:14 (UTC+8))
- #36296 [Bug] Fix TRTLLM Block FP8 MoE Monolithic — bug,ready,nvidia — by wzhao18 (合并于: 2026-03-11 10:04 (UTC+8))
- #36397 fix: check HTTP status in batch read_file to prevent silent failures — frontend,ready — by alvinttang (合并于: 2026-03-10 22:22 (UTC+8))
- #36090 [ROCm][CI] Making some tests optional to reduce workload — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-11 07:45 (UTC+8))
- #36609 [Minor] Enhance error message for TRTLLM decode uniformity check — ready,v1,nvidia — by WoosukKwon (合并于: 2026-03-11 06:38 (UTC+8))
- #36041 [Model Runner V2] Add initial CI tests — ready,ci/build,v1 — by njhill (合并于: 2026-03-11 05:55 (UTC+8))
- #36521 [Core] Simplify core kv-cache blocks initialization logic — ready,v1,ready-run-all-tests — by njhill (合并于: 2026-03-11 04:20 (UTC+8))
- #36508 [Misc] fix typo: homogenous-> homogeneous (2 lines change) — v1 — by SoluMilken (合并于: 2026-03-10 21:25 (UTC+8))
- #36340 [Test]
test_async_scheduling.pyimprovements — ready,v1 — by njhill (合并于: 2026-03-11 02:17 (UTC+8)) - #36595 [Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs — bug,ready,ready-run-all-tests — by ZJY0516 (合并于: 2026-03-10 22:39 (UTC+8))
- #36582 [compile] Apply stored functorch config while finalizing loaded artifacts. — ready — by zhxchen17 (合并于: 2026-03-11 00:34 (UTC+8))
- #36626 [Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata — bug,v1,nvidia — by WoosukKwon (合并于: 2026-03-11 00:30 (UTC+8))
- #36104 [CI] Bump
mypyversion to 1.19.1 — ready,v1,kv-connector — by hmellor (合并于: 2026-03-11 00:18 (UTC+8)) - #35719 [ROCm][Perf] Enable
sparse_mla’s cudagraph on ROCm platform — rocm,ready,v1,nvidia — by ganyi1996ppo (合并于: 2026-03-11 00:14 (UTC+8)) - #36519 [Bugfix][Sparse MLA] report indexer CG support properly — bug,ready,v1 — by MatthewBonanni (合并于: 2026-03-11 00:14 (UTC+8))
- #34304 Improvements to wvSplitKrc skinny GEMM solution — rocm,ready — by amd-hhashemi (合并于: 2026-03-11 00:14 (UTC+8))
- #36588 [Model Runner V2] Fix mm input embeddings lookup — ready,v1 — by njhill (合并于: 2026-03-10 15:24 (UTC+8))
- #36580 [Model Runner V2] Fix
_compute_slot_mappings_kernelfor chunked prefill — ready,v1 — by njhill (合并于: 2026-03-10 15:23 (UTC+8)) - #35200 Fix
hf_override_fnwhen it modifiesmodel_type— ready — by hmellor (合并于: 2026-03-10 23:03 (UTC+8)) - #35342 feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency — ready,v1,kv-connector — by Srinivasoo7 (合并于: 2026-03-10 22:43 (UTC+8))
- #36479 [Model] Consolidate score logic by introduce score_type — new-model,frontend,ready,qwen — by noooop (合并于: 2026-03-10 21:32 (UTC+8))
- #36630 [Bugfix] Fix processor signature — bug,ready — by zucchini-nlp (合并于: 2026-03-10 21:20 (UTC+8))
- #36628 [Frontend][Core] Revert “Add shutdown timeout” (#34730 and #36270) — frontend,ready,v1 — by markmc (合并于: 2026-03-10 21:20 (UTC+8))
- #36532 Fix Qwen2.5-VL test for Transformers v5 — documentation,ready,qwen — by hmellor (合并于: 2026-03-10 20:05 (UTC+8))
- #36169 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add –grpc flag to vllm serve — rocm,frontend,ready,ci/build — by CatherineSue (合并于: 2026-03-10 18:29 (UTC+8))
- #36534 Fix LFM2 MoE test for Transformers v5 — ready — by hmellor (合并于: 2026-03-10 13:29 (UTC+8))
- #35219 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU — bug,ready,v1,qwen,ready-run-all-tests — by vadiklyutiy (合并于: 2026-03-10 18:32 (UTC+8))
- #36494 Fix: Re-Enable EP for trtllm MoE FP8 backend — ready,nvidia — by amirkl94 (合并于: 2026-03-10 14:11 (UTC+8))
- #36557 [Bugfix] Fix
RuntimeError: Already borrowedthat degrades VLM serving throughput under concurrent load. — bug,ready — by hallerite (合并于: 2026-03-10 13:30 (UTC+8)) - #36546 Remove unused disable_fallback field — ready — by zhuohan123 (合并于: 2026-03-10 11:56 (UTC+8))
- #36159 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (合并于: 2026-03-10 11:55 (UTC+8))
关闭但未合并的 PR
- #28631 [Frontend][Renderer] Refactor score API — documentation,frontend,multi-modality — by noooop (关闭于: 2026-03-11 10:56 (UTC+8))
- #35966 [feat] Kimi K2/DeepSeek Support eagle3 — speculative-decoding,v1,deepseek — by leihuang-sketch (关闭于: 2026-03-11 04:34 (UTC+8))
- #28175 [vllm][OutputProcessor] Allow propagate engine core additional outputs back to client — needs-rebase,stale,v1 — by KingsleyZhang123 (关闭于: 2026-03-11 10:15 (UTC+8))
- #28313 Make tests/lora/utils usable by plugins — stale — by vanbasten23 (关闭于: 2026-03-11 10:15 (UTC+8))
- #36638 [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading — bug,v1 — by haosdent (关闭于: 2026-03-11 10:14 (UTC+8))
- #36724 [Bug][MoE] Fix TRTLLM EScoreBias Precision — bug,performance,needs-rebase,ci/build,qwen,nvidia — by robertgshaw2-redhat (关闭于: 2026-03-11 08:38 (UTC+8))
- #36717 [Doc] Fix grammatically incorrect comment: “doesn’t allow to override” -> “doesn’t allow overriding” — 无标签 — by Hongbin10 (关闭于: 2026-03-11 08:13 (UTC+8))
- #36714 [Doc] Capitalize docstrings in vllm/config/ — 无标签 — by Hongbin10 (关闭于: 2026-03-11 08:11 (UTC+8))
- #30574 MLA Based Eagle3 — new-model,speculative-decoding,v1,deepseek — by IzzyPutterman (关闭于: 2026-03-11 07:29 (UTC+8))
- #35918 [Model Runner V2] Fix FA3 cuda graphs — v1,nvidia — by njhill (关闭于: 2026-03-11 05:27 (UTC+8))
- #29853 [WIP] Calculate MFU/MBU for whole model forward pass — stale — by linwang-aviva (关闭于: 2026-03-11 04:36 (UTC+8))
- #35247 Create test_deepseek_v32_config.py — v1,deepseek — by puririshi98 (关闭于: 2026-03-11 04:18 (UTC+8))
- #35333 [Perf] Optimize model runner v2
prepare_inputscopy logic, 6.1% E2E throughput improvement — ready,needs-rebase,v1,kv-connector,v2 — by yewentao256 (关闭于: 2026-03-11 03:48 (UTC+8)) - #34903 [Bug] Fix illegal memory access issue for model runner v2 — bug,ready,needs-rebase,v1,nvidia — by yewentao256 (关闭于: 2026-03-11 03:45 (UTC+8))
- #36523 [Platform] Add MPS (Apple Metal) platform support for macOS — documentation,performance,ci/build,v1,cpu — by robtaylor (关闭于: 2026-03-11 03:31 (UTC+8))
- #36021 bump flashinfer v0.6.4 -> v0.6.6 — ci/build,nvidia — by netanel-haber (关闭于: 2026-03-11 03:21 (UTC+8))
- #36680 [ROCm][Perf] Allow MTP len > 1 for MLA — rocm,speculative-decoding,v1 — by tvirolai-amd (关闭于: 2026-03-11 01:11 (UTC+8))
- #32204 [Core][KVConnector] Support HMA+NixlConnector — needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-03-11 01:10 (UTC+8))
- #36338 Add kv_transfer_params to streaming responses — frontend — by RishabhSaini (关闭于: 2026-03-11 01:07 (UTC+8))
- #36661 feat(attention): extract KV-cache update from TreeAttention backend — v1 — by cong-or (关闭于: 2026-03-10 22:36 (UTC+8))
- #30411 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — ready — by kitaekatt (关闭于: 2026-03-10 23:51 (UTC+8))
- #32551 [Performance] Split FlexAttention and FlashInfer attention and cache update — performance,rocm,needs-rebase,v1,nvidia — by Etelis (关闭于: 2026-03-10 23:14 (UTC+8))
- #33271 [ROCm] Change default settings for ROCm — rocm,needs-rebase,v1 — by gshtras (关闭于: 2026-03-10 23:03 (UTC+8))
- #34998 [ROCm] Check that AITER MHA is not selected with sinks — rocm — by gshtras (关闭于: 2026-03-10 23:03 (UTC+8))
- #36571 Fix issue 04 — v1 — by xueliangyang-oeuler (关闭于: 2026-03-10 22:22 (UTC+8))
- #36652 Fix Qwen3.5 LoRA packed module mapping — qwen — by notimesea (关闭于: 2026-03-10 21:27 (UTC+8))
- #36371 [compile] Remove strict_autograd_cache and force_non_lazy_backward_lowering workaround — ready — by zou3519 (关闭于: 2026-03-10 21:38 (UTC+8))
- #34244 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2026-03-10 21:20 (UTC+8))
- #36650 [DO NOT MERGE] Reapply “[BugFix] Fix engine hanging after KV cache initialization failure #35478” — bug,ready,v1 — by markmc (关闭于: 2026-03-10 20:54 (UTC+8))
- #34061 [Cleanup] Unify vllm.utils.flashinfer and flashinfer_utils — nvidia — by veeceey (关闭于: 2026-03-10 21:14 (UTC+8))
- #36646 [DO NOT MERGE][Core] Revert “Fix benign error log during normal shutdown (#36270)” — ready,v1 — by markmc (关闭于: 2026-03-10 20:51 (UTC+8))
- #36619 [DO NOT MERGE] Test revert “Remove busy loop from idle buffer readers” (#28053 and #36068) — ready,v1 — by markmc (关闭于: 2026-03-10 19:31 (UTC+8))
- #34682 [kv_offload+HMA][2/N]: Support sliding window lookup — v1,kv-connector — by orozery (关闭于: 2026-03-10 17:36 (UTC+8))
- #34799 [kv_offload+HMA][3/N]: Scheduler-side support for multiple KV groups — v1,kv-connector — by orozery (关闭于: 2026-03-10 17:36 (UTC+8))
- #36618 Revert “[cudagraph] fix cudagraph warning in deepseekv32 (#28044)” — deepseek,nvidia — by elvircrn (关闭于: 2026-03-10 17:15 (UTC+8))
- #34680 [kv_offload+HMA][1/N]: Worker-side support for multiple HMA groups — v1 — by orozery (关闭于: 2026-03-10 16:26 (UTC+8))
- #35995 Fix: support auto_tune to run local model — performance — by panpan0000 (关闭于: 2026-03-10 16:11 (UTC+8))
- #32398 [KVConnector] OffloadingConnector: Add preemptions-only mode — needs-rebase,v1,kv-connector — by orozery (关闭于: 2026-03-10 16:10 (UTC+8))
- #36486 [Bugfix] Fix issues in quark emulative logic — bug — by wangjiaxin99 (关闭于: 2026-03-10 15:10 (UTC+8))
- #36567 fix: disable async scheduling by default due to instability (issue #3… — 无标签 — by xueliangyang-oeuler (关闭于: 2026-03-10 12:55 (UTC+8))
- #28368 [refactor] [mla]: independently passing q_nope & q_rope — stale,v1 — by vnadathur (关闭于: 2026-03-10 12:14 (UTC+8))
- #21849 [WIP] Add Kimi-Audio integration for vLLM — documentation,new-model,ci/build,v1 — by HelloWorldU (关闭于: 2026-03-10 11:42 (UTC+8))