vLLM 开发动态报告 - 2026-02-24
时间窗口: 2026-02-24 11:31 (UTC+8) ~ 2026-02-25 11:31 (UTC+8) 数据统计: 新 Issue 18 | 关闭 Issue 17 | 新 PR 74 | 合并 PR 30 | 关闭未合并 PR 26
📊 每日开发状态摘要
本周期(2026-02-24至2026-02-25),vLLM 项目处于高速开发迭代阶段,新增 PR 74个,合并 PR 30个,显示出活跃的代码贡献和高效的审查合并流程。开发焦点集中在性能优化、内核改进(尤其是 AMD ROCm 平台)、以及修复模型推理与编译相关的各类 Bug。一个显著的趋势是多模态大模型(特别是 Qwen 系列)在高并发场景下的稳定性和内存问题引发了深入讨论。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动活跃,主要集中在 ROCm 平台的性能优化、内核完善和 CI 测试扩展上。
- PR #35246 ([ROCm] Refactor attention backend selection logic):重构了 ROCm 平台的注意力后端选择逻辑,旨在提高代码可维护性和调度清晰度。
- PR #35250 ([Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash):关键修复。为
gfx950(MI300X) 专用的 FP4 BMM 和 FP4 ASM GEMM 操作添加了硬件检查门控。此前在gfx942(MI300X) 上启用相关环境变量会导致非法指令崩溃。此修复确保了功能仅在支持的硬件上启用。 - PR #35245 ([ROCm][WIP]: Fused aiter rope kvcache mla):致力于为 ROCm + AITER 平台实现 RoPE + KV Cache + Cat 操作的融合内核,目标提升如 DeepSeek、Kimi 等 MLA 模型的推理效率。此 PR 依赖于多个前置 PR。
- PR #35239 ([ROCm][CI] Added MI325 mirrors (stage C)):扩展 AMD CI 测试范围,在 MI325 节点上为更多测试组(如入口点集成、示例、标准语言/多模态模型测试)启用“镜像”测试,标志着 ROCm 支持的成熟度和测试覆盖度的提升。
- Issue #35169 ([Bug]: Memory Access Fault during Step-3.5-Flash inference (ROCM)):用户报告在 ROCm 平台上运行特定模型时遇到内存访问错误,已自动标记给 ROCm 相关维护者,表明社区对 AMD 平台问题的响应流程。
- PR #35183 ([ROCm] Split wvSplitKrc into deterministic/fast kernels, add
--fast-skinny-gemmCLI flag, overhaul tests):将wvSplitKrc内核拆分为确定性版本和快速(非确定性)版本,并引入 CLI 标志让用户选择,默认提供确定性结果以确保正确性。同时全面重写了测试套件,将运行时间从90分钟以上大幅缩短至7分钟以内,显著提升了开发效率。 - PR #35180 ([ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE):修复并默认启用 AITER RoPE 自定义操作,同时启用 RoPE 与 KV Cache 的融合。测试显示在
gpt-oss模型上带来约1%的性能提升,并为后续 MLA 模型的融合优化奠定基础。
分析:本周期 AMD 生态的动态呈现出从“功能可用”向“性能优化”和“体验完善”深化的特点。既有防止硬件崩溃的关键安全修复,也有旨在提升推理性能的融合内核开发。同时,CI 测试范围的持续扩大和测试效率的优化,为 ROCm 支持的长期稳定性和开发体验提供了重要保障。特别是将内核拆分为“确定性/快速”两种模式,体现了对生产环境不同需求(正确性 vs 极致性能)的细致考量。
💬 高热度讨论分析
- Issue #35191 ([Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests)
- 核心议题:用户在使用 Qwen3.5-397B-FP8 模型处理高并发多模态请求时,遭遇系统内存(RAM)线性增长直至 OOM(耗尽1TB)的问题。
- 观点与排查:
- 用户 (FWao):详细描述了问题现象,指出在多模态缓存命中率降至零后,内存开始不受控制地增长。尝试了禁用前缀缓存、调整多模态编码器 TP 模式、禁用异步调度等多种方法均未解决。
- 维护者 (DarkLight1337, ywang96):首先排除了前缀缓存(对混合模型默认关闭)和常见高并发警告日志的影响。建议用户尝试移除
--mm-encoder-tp-mode data参数。 - 其他用户 (cjackal):指出这可能是 Qwen2.5 视觉处理器中长期存在的 Bug(参考 #28230),并在 Qwen3 VL 235B 模型上复现了类似 OOM,错误回溯指向多模态嵌入计算函数
_merge_multimodal_embeddings。
- 争议焦点:问题的根本原因尚未确定,可能涉及多模态处理流程中的内存管理缺陷,或特定模型架构(Qwen VL)与 vLLM 多模态管道的兼容性问题。
- 当前状态:问题仍在开放调查中,维护者已介入,指向一个可能相关的历史 Issue。
- Issue #35190 ([Feature]: Support for multiple embedding types in a single inference call)
- 核心议题:vLLM 0.15 支持了多向量检索(如 ColBERT),但目前每种向量类型需单独调用 API。用户请求在一次推理调用中返回所有支持的嵌入类型(如稠密、稀疏、colbert向量),以提升 RAG 管道效率。
- 观点与讨论:
- 用户 (rgidda):认为当前设计对需要混合搜索(稠密+稀疏)或 ColBERT 式交互的应用不友好,造成了双倍的计算和网络开销。
- 维护者 (noooop):建议或许可以通过插件(
io_processor_plugin)来满足此场景。 - 贡献者 (staugust):指出插件可以将两次调用融合为一次,但由于
PoolingParams在CachedRequestState中只能设置一个任务,vLLM 仍需索引两次。建议先通过插件支持单次调用多向量检索,再升级DispatchPooler.forward以支持多任务。
- 争议焦点:无显著争议,主要讨论实现路径。是从前端 API 层直接支持,还是通过插件机制逐步演进。
- 当前状态:开放讨论中,倾向于通过插件方案进行渐进式改进。
- Issue #35195 ([CI] Missing or incorrect SpanKind import in tracing code causes AttributeError)(虽评论未显示,但从标题和状态变化可推断)
- 核心议题:CI 测试中因 OpenTelemetry 的
SpanKind枚举导入错误导致追踪代码失败。 - 观点与行动:该 Issue 在创建当天即被关闭。关联的 PR #35211 直接 Revert 了此前移除冗余 OpenTelemetry pip 安装的 PR #35032。这表明在将 OpenTelemetry 依赖捆绑进 vLLM 主包后,CI 环境中某些测试仍依赖于旧的显式安装方式。
- 争议焦点:无,属于明确的回归问题,采取快速回滚修复。
- 当前状态:已通过回滚 PR 合并得到解决。
- 核心议题:CI 测试中因 OpenTelemetry 的
🔥 热门话题与趋势分析
- 性能优化与剖析:
- Issue #35254 和 #35226 分别从宏观版本升级(观测到 4 倍性能提升)和微观模型层面(DeepSeek 3.2 多流索引器)提出性能分析与优化建议。
- PR #35229 实现了针对 Blackwell (SM100) 的融合内核,体现了对最新硬件特性的持续跟进。
- 编译与内核稳定性:
- Issue #35238 和 PR #35256 揭示了
torch.compile与自定义算子(CustomOp)交互时的dtype传播问题,导致模型崩溃。这反映了在追求编译优化性能时,对算子实现健壮性的更高要求。 - PR #35161 修复了 MoE 对齐内核中
expert_ids的填充值,这是确保底层计算正确性的重要修复。
- Issue #35238 和 PR #35256 揭示了
- 多模态与视觉模型挑战:
- Issue #35191 和历史上 #28230 反映了以 Qwen 系列为代表的大型视觉语言模型在 vLLM 中面临高并发、长上下文内存管理的严峻挑战。
- Issue #35169 表明 ROCm 平台上运行多模态模型也可能遇到底层内存访问问题。
- 推理正确性与解析器:
- Issue #35221 和关联的 PR #35258、#35230 均围绕
Qwen3ReasoningParser在推理输出被截断时的错误解析行为进行修复,强调了对推理(Thinking)功能边界条件的精确处理。
- Issue #35221 和关联的 PR #35258、#35230 均围绕
🛠️ 重点技术变更
-
PR #35156 ([BUGFIX][Qwen3.5] Hardcode
mlp.gateas not quantizable):快速响应并修复了Qwen3.5-397B-A17B-NVFP4等 NVFP4 量化模型加载失败的问题。通过硬编码mlp.gate层为不可量化,绕过了 checkpoint 中缺失相应量化排除标记的缺陷,确保了对新量化格式的即時支持。 -
PR #35161 ([Bugfix] Fix expert_ids padding values in moe_align_block_size kernel):修复了 MoE 对齐内核中
expert_ids填充值应为-1却错误设为0的问题。虽然因早期退出逻辑当前可能不影响精度,但此修复消除了下游内核对无效 ID 的误判风险,提升了代码的健壮性和未来兼容性。 -
PR #35229 ([Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE):响应 Issue #35217 的需求,为 SM100 (Blackwell) 架构在 DeepGEMM MoE 中实现了
silu_mul_fp8_quant_packed融合内核,将原本分离的激活与量化操作合并,旨在提升 FP8 MoE 的计算效率。 -
PR #34055 ([Refactor] [1/N] Reorganize kernel abstraction directory):完成了内核抽象层的目录结构重构,将
kernels包从quantization目录下移至model_executor根目录。这是清理代码结构、使非量化相关的线性内核不再依赖量化目录的重要架构调整。 -
PR #33726 ([Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support):新增了对 Nemotron-H 模型家族 MTP 推测解码的支持,并统一扩展了 Mamba 类注意力后端对推测解码的兼容性。这使得更多线性注意力模型能够受益于推测解码技术。
📈 开发活跃度观察
- 高效合并:24小时内合并30个PR,表明核心团队审查与合并流程高效,项目进展迅速。
- 跨平台贡献:贡献者不仅来自 NVIDIA 和 AMD(如
xxx-amd用户),还有来自 Intel、Red Hat 等公司的工程师,显示出 vLLM 作为跨硬件平台推理引擎的广泛吸引力。 - 问题闭环:关闭了17个 Issue,其中包含多个历时数月的“stale”问题被自动清理,也包含像
#35195这样的 CI 问题被迅速解决,体现了社区维护的活跃度。 - RFC 与设计讨论:Issue #35213 和 PR #35228 关于拆分
EngineClient协议的讨论,显示了社区对前端架构长期演进的思考。
💡 值得关注的问题
- 多模态大模型的内存泄漏风险:Issue #35191 揭示的 Qwen3.5 VL 在高并发下的内存无限增长问题,可能影响所有类似架构的视觉语言模型在生产环境中的稳定性,需要优先调查。
- 编译优化与算子兼容性:Issue #35238 暴露的
torch.compile与自定义算子间的dtype不匹配问题,是影响编译模式稳定性的潜在瓶颈,需要系统性审视所有自定义算子的“原生”实现路径。 - 推测解码的精度测试稳定性:Issue #35168 和 #35167 显示,由于模型本身生成的非确定性,基于输出文本完全匹配的推测解码精度测试存在固有波动性,可能需要调整测试方法论(如引入确定性提示或放松匹配标准)。
- 架构演进:Issue #35213 提出的拆分
EngineClient协议,为支持解耦的前端-后端(如vllm online)奠定基础,是一个重要的架构设计方向,值得关注其后续进展。
📋 附录:详细数据列表
新增 Issue
- #35255 [Bug]: CUDA Error 803 on host with driver 590.48: `system has unsupported display driver / cuda driver combination — bug — by git-jxj (创建于: 2026-02-25 11:18 (UTC+8))
- #35238 Qwen3.5-27B dtype mismatch in DeltaNet layers during torch.compile (float != c10::Half) — torch.compile — by carlswiftsc45-dot (创建于: 2026-02-25 05:36 (UTC+8))
- #35254 [Performance]: curious new kernels from vllm 0.11.1 — performance — by fataswellassad (创建于: 2026-02-25 10:12 (UTC+8))
- #35190 [Feature]: Support for multiple embedding types in a single inference call — feature request — by rgidda (创建于: 2026-02-24 20:30 (UTC+8))
- #35252 [Installation]: No module named ‘vllm.entrypoints.anthropic.serving_messages’ — installation — by timtimyim (创建于: 2026-02-25 09:29 (UTC+8))
- #35191 [Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests — bug — by FWao (创建于: 2026-02-24 20:45 (UTC+8))
- #35235 [CI Failure]: mi355_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (创建于: 2026-02-25 05:17 (UTC+8))
- #35233 [CI Failure]: mi355_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (创建于: 2026-02-25 05:01 (UTC+8))
- #35217 [Feature]: Implement silu_mul_fp8_quant_packed for deepgemm for sm100 families — feature request,model-bash — by zyongye (创建于: 2026-02-25 03:16 (UTC+8))
- #35226 [Performance]: DeepSeek 3.2 Multi-stream indexer — performance — by LucasWilkinson (创建于: 2026-02-25 04:06 (UTC+8))
- #35221 [Bug]:
qwen3reasoning parser incorrectly parse reasoning-only output ascontent— bug — by cjackal (创建于: 2026-02-25 03:30 (UTC+8)) - #35213 [RFC]: Split EngineClient into BaseEngineClient + EngineClient as Foundation for Disaggregated Frontend — RFC — by sagearc (创建于: 2026-02-25 01:54 (UTC+8))
- #35195 [CI] Missing or incorrect SpanKind import in tracing code causes AttributeError — ci-failure — by LucasWilkinson (创建于: 2026-02-24 21:59 (UTC+8))
- #35193 [Bug]: Deadlock in async_scheduling with TP=1 (UniProcExecutor) when a DP rank raises exception inside
execute_model— bug — by fangyuchu (创建于: 2026-02-24 21:38 (UTC+8)) - #35188 [Feature]: Asynchronous Logit Scrubbing (ALS) for Proactive Path Pruning and 85% Efficiency Gain — feature request — by alexbuiko-sketch (创建于: 2026-02-24 20:06 (UTC+8))
- #35169 [Bug]: Memory Access Fault during Step-3.5-Flash inference (ROCM) — bug,rocm — by ColinZ22 (创建于: 2026-02-24 12:41 (UTC+8))
- #35168 [CI] Ngram speculative decoding accuracy below 66% threshold — ci-failure — by LucasWilkinson (创建于: 2026-02-24 12:15 (UTC+8))
- #35167 [CI] EAGLE speculative decoding accuracy below 60% threshold — ci-failure — by LucasWilkinson (创建于: 2026-02-24 12:15 (UTC+8))
已关闭 Issue
- #32402 [RFC]: Add a new
/collect_envapi to response current vllm instance environment — feature request — by lengrongfu (关闭于: 2026-02-25 10:17 (UTC+8)) - #19569 [Bug]: pythonic tool call parsing does not handle negative numeric literals — bug,stale — by art-dsit (关闭于: 2026-02-25 10:17 (UTC+8))
- #21037 [Bug]: TTFT of 1p1d using NIXLConnector is twice that of tp1 — bug,stale — by AsicDyc (关闭于: 2026-02-25 10:17 (UTC+8))
- #24603 [RFC]: Responses API full functionality without stores — RFC,stale,gpt-oss — by alecsolder (关闭于: 2026-02-25 10:16 (UTC+8))
- #27384 [Bug]: Inference Qwen3-VL-30B-A3B with tp=2 pp=4 on 8x4090 get weired result — bug,stale — by muziyongshixin (关闭于: 2026-02-25 10:16 (UTC+8))
- #27385 [Bug]: gptoss calls built-in tool when no tools are given — bug,stale,gpt-oss — by ruiqiRichard (关闭于: 2026-02-25 10:16 (UTC+8))
- #27559 [Bug]: vllm0.10.1 can deploy deepseek-70b model with tp=2 and max-model-len 20000 on the machine with two NVIDIA A800(80GiB) , But vllm 0.11.0 failed — bug,stale — by hfzh06 (关闭于: 2026-02-25 10:16 (UTC+8))
- #27570 [Bug]: Qwen2.5 ViT Incorrect QKV Split When projection_size != hidden_size in Tensor Parallelism — bug,stale — by yanyongyu (关闭于: 2026-02-25 10:16 (UTC+8))
- #27575 [Bug]: p2pNccl 3P1,D-Node nccl receives data and triggers a crash — bug,stale — by thzyyx (关闭于: 2026-02-25 10:16 (UTC+8))
- #35061 [Bug]: [torch.compile] occasional failure to save AOT compiled function after successful graph compilation — bug,torch.compile — by cjackal (关闭于: 2026-02-25 00:30 (UTC+8))
- #35015 [Bug]: Qwen3.5 FP8 CUDA Illegal Access in CUTLASS GEMM — bug — by kimbochen (关闭于: 2026-02-25 06:54 (UTC+8))
- #29998 [Bug]: cannot send two POST to /v1/chat/completions endpoint with identic tool function name with model GPT-OSS-120B — bug — by pd-t (关闭于: 2026-02-25 05:00 (UTC+8))
- #34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-25 03:35 (UTC+8))
- #35195 [CI] Missing or incorrect SpanKind import in tracing code causes AttributeError — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 01:26 (UTC+8))
- #28016 [Usage]: How to recognize PDFs in DeepSeek-OCR with openai — usage — by shoted (关闭于: 2026-02-24 13:38 (UTC+8))
- #34000 Move the Python kernel abstraction package (selectors + interfaces) under
vllm/model_executorso unquantized linear kernels don’t need to be referenced from a quantization-specific directory. — 无标签 — by tjtanaa (关闭于: 2026-02-24 14:47 (UTC+8)) - #35150 [Feature]: Support NVFP4 Checkpoint of Qwen3.5 — feature request — by ywang96 (关闭于: 2026-02-24 11:43 (UTC+8))
新增 PR
- #35258 [Reasoning] Fix Qwen3 parser misclassifying truncated reasoning as content — 无标签 — by sxu75374 (创建于: 2026-02-25 11:31 (UTC+8))
- #35257 [WIP][Bugfix] Fix placeholder detection when tokenizer merges boundary tokens — bug,multi-modality — by haosdent (创建于: 2026-02-25 11:28 (UTC+8))
- #35178 [MoE] Make MoERunner/DefaultMoERunner a PluggableLayer for OOT support — documentation — by wxsIcey (创建于: 2026-02-24 14:57 (UTC+8))
- #35173 [Kernel] Immediately execute argument assertions in wvSplitK — rocm — by wjabbour (创建于: 2026-02-24 13:40 (UTC+8))
- #35256 [WIP][Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile — bug — by haosdent (创建于: 2026-02-25 11:27 (UTC+8))
- #35184 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… — bug,frontend — by daniel-salib (创建于: 2026-02-24 18:03 (UTC+8))
- #35194 [Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks — bug,v1 — by fangyuchu (创建于: 2026-02-24 21:53 (UTC+8))
- #35219 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU — bug,v1,qwen — by vadiklyutiy (创建于: 2026-02-25 03:21 (UTC+8))
- #35248 [WIP][Model Runner V2] Support DP/EP for spec decoding — v1 — by TheEpicDolphin (创建于: 2026-02-25 08:37 (UTC+8))
- #35246 [ROCm] Refactor attention backend selection logic — documentation,rocm,v1 — by SageMoore (创建于: 2026-02-25 08:25 (UTC+8))
- #35253 Enabling some B200-specific tests on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-25 09:40 (UTC+8))
- #35249 [XPU]Fix for Qwen-OMNI crash — qwen — by xuechendi (创建于: 2026-02-25 08:39 (UTC+8))
- #35174 [WIP] [Bugfix] Use atomic writes for AOT compile cache to prevent 0-byte files — bug,needs-rebase — by haosdent (创建于: 2026-02-24 13:43 (UTC+8))
- #35181 [DRAFT][Bugfix] Fix DCP gibberish output and Mamba block_size with DCP — bug,v1 — by haosdent (创建于: 2026-02-24 15:47 (UTC+8))
- #35251 [compile][graph_partition] Remove unused subgraph inputs after split_module — needs-rebase — by fxdawnn (创建于: 2026-02-25 09:21 (UTC+8))
- #35242 Fp8 lora dense kernel — 无标签 — by yugong333 (创建于: 2026-02-25 06:55 (UTC+8))
- #35240 Add PyTorch profiler schedule support with warmup/active iterations — meta-exported,fb-exported — by fenypatel99 (创建于: 2026-02-25 06:36 (UTC+8))
- #35250 [Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash — bug,rocm — by c0de128 (创建于: 2026-02-25 08:47 (UTC+8))
- #35247 Create test_deepseek_v32_config.py — deepseek — by puririshi98 (创建于: 2026-02-25 08:31 (UTC+8))
- #35241 Create tests/distributed/test_mnnvl_alltoall.py — 无标签 — by puririshi98 (创建于: 2026-02-25 06:53 (UTC+8))
- #35227 Explicitly disable scale_up/scale_down when async EPLB is enabled — v1 — by SageMoore (创建于: 2026-02-25 04:11 (UTC+8))
- #35183 [ROCm] Split wvSplitKrc into deterministic/fast kernels, add
--fast-skinny-gemmCLI flag, overhaul tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-24 16:49 (UTC+8)) - #35201 [torch 2.11 update] Use pinned Docker image — performance,rocm,ci/build,v1,cpu,nvidia — by atalman (创建于: 2026-02-24 23:30 (UTC+8))
- #35245 [ROCm][WIP]: Fused aiter rope kvcache mla — rocm — by Rohan138 (创建于: 2026-02-25 07:32 (UTC+8))
- #35180 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE — rocm,ready — by Rohan138 (创建于: 2026-02-24 15:21 (UTC+8))
- #35244 Comment fix for async rl example — documentation — by hao-aaron (创建于: 2026-02-25 07:09 (UTC+8))
- #35239 [ROCm][CI] Added MI325 mirrors (stage C) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-25 06:07 (UTC+8))
- #35243 [Bugfix] Fix DP MTP Dummy Run — bug,v1 — by benchislett (创建于: 2026-02-25 07:06 (UTC+8))
- #35225 docs: document committer proposal process in governance — documentation — by simon-mo (创建于: 2026-02-25 03:54 (UTC+8))
- #35236 [CI] Fix Distributed Tests — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-02-25 05:17 (UTC+8))
- #35223 [KVConnector] scheduler: Add HMA support for KV load recovery — v1,kv-connector — by orozery (创建于: 2026-02-25 03:45 (UTC+8))
- #35170 [ROCm][CI] Adding infiniband mappings for moriio tests — rocm,ci/build,v1,kv-connector — by AndreasKaratzas (创建于: 2026-02-24 13:18 (UTC+8))
- #35232 [Build] Fix LTO build for ROCm when default compiler is GCC but ROCm/HIP is using Clang — rocm,ci/build — by davispuh (创建于: 2026-02-25 04:48 (UTC+8))
- #35229 [Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE — 无标签 — by mgoin (创建于: 2026-02-25 04:38 (UTC+8))
- #35237 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (创建于: 2026-02-25 05:34 (UTC+8))
- #35234 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (创建于: 2026-02-25 05:16 (UTC+8))
- #35222 TRTLLM gen-full attn spec decode & FP8 KV dequant tests — v1,nvidia — by ojhaanshika (创建于: 2026-02-25 03:38 (UTC+8))
- #35210 [BugFix] Fix fp4 quant kernel on CUDA 12.8 — bug,nvidia — by LopezCastroRoberto (创建于: 2026-02-25 01:02 (UTC+8))
- #35231 [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors — frontend — by AndreasKaratzas (创建于: 2026-02-25 04:44 (UTC+8))
- #35230 fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning — qwen — by stakeswky (创建于: 2026-02-25 04:42 (UTC+8))
- #35228 [Frontend] Split EngineClient and OpenAIServing into base and inference classes — frontend — by sagearc (创建于: 2026-02-25 04:27 (UTC+8))
- #35172 [Model Runner V2] Warmup kernels — v1 — by njhill (创建于: 2026-02-24 13:30 (UTC+8))
- #35224 [ROCm][CI] Accept Different But Valid Output for
test_olmoe_tp— rocm — by micah-wil (创建于: 2026-02-25 03:50 (UTC+8)) - #35220 [Benchmarks] Plot benchmark timeline and requests statistics — performance,ci/build — by sducouedic (创建于: 2026-02-25 03:28 (UTC+8))
- #35218 [Docs] Upgrade dynamic LoRA warning to admonition block — documentation — by russellb (创建于: 2026-02-25 03:18 (UTC+8))
- #35216 [compile] Initialize passes at VllmBackend init — 无标签 — by angelayi (创建于: 2026-02-25 02:59 (UTC+8))
- #35215 fixed mtp, multi rank, and test passed — v1,qwen — by ayaan-fw (创建于: 2026-02-25 02:37 (UTC+8))
- #35185 Fix –kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — documentation — by paipeline (创建于: 2026-02-24 18:36 (UTC+8))
- #35214 [Perf] Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement — v1 — by yewentao256 (创建于: 2026-02-25 02:32 (UTC+8))
- #35207 Add @MatthewBonanni to CODEOWNERS — ready,ci/build — by MatthewBonanni (创建于: 2026-02-25 00:16 (UTC+8))
- #35211 Revert “[CI/Build] Remove redundant OpenTelemetry pip install from CI configs” — ci/build — by LucasWilkinson (创建于: 2026-02-25 01:08 (UTC+8))
- #35205 [CI/Build] Fix kernels test location — ci/build — by DarkLight1337 (创建于: 2026-02-25 00:07 (UTC+8))
- #35203 Remove requirement to use
--hf-overridesforDeepseekVLV2ForCausalLM— documentation,ready,v1,deepseek,kv-connector — by hmellor (创建于: 2026-02-24 23:52 (UTC+8)) - #35212 [EPLB] Enforce sync eplb for NCCL-based all2all backend — 无标签 — by ilmarkov (创建于: 2026-02-25 01:08 (UTC+8))
- #35209 [Feature]: Optimize collectives in TP MoE case using torch.compile pass — 无标签 — by SouthWest7 (创建于: 2026-02-25 00:53 (UTC+8))
- #35208 GLM4 tool parser: fix streaming mode — 无标签 — by RNabel (创建于: 2026-02-25 00:34 (UTC+8))
- #35206 [Feature] Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1) — ready,v1,nvidia — by yewentao256 (创建于: 2026-02-25 00:15 (UTC+8))
- #35204 [CI] Fix flaky spec decode accuracy tests — ready,v1 — by LucasWilkinson (创建于: 2026-02-24 23:57 (UTC+8))
- #35197 [Doc] Add MTP docs and update speculative decoding guidance — documentation — by XingLiu1 (创建于: 2026-02-24 22:27 (UTC+8))
- #35189 Remove
padding_indexfrom models that don’t use it for better Transformers v5 compatibility — ready,qwen — by hmellor (创建于: 2026-02-24 20:20 (UTC+8)) - #35199 [CI] Remove Duplicated Tests — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-02-24 22:29 (UTC+8))
- #35200 Fix
hf_override_fnwhen it modifiesmodel_type— 无标签 — by hmellor (创建于: 2026-02-24 22:36 (UTC+8)) - #35202 [Refactor] Refactor video loader frames reading/sampling interface — multi-modality — by Isotr0py (创建于: 2026-02-24 23:34 (UTC+8))
- #35196 [ROCm] [CI] Gate the changes to
Dockerfile.rocm_base. — rocm,ci/build — by tjtanaa (创建于: 2026-02-24 22:22 (UTC+8)) - #35198 [ROCm] [Release] Change the package from
aitertoamd-aiter— rocm,ci/build — by tjtanaa (创建于: 2026-02-24 22:27 (UTC+8)) - #35192 [Kernels] Add GGUF mxfp4 dequantize support — 无标签 — by Isotr0py (创建于: 2026-02-24 20:51 (UTC+8))
- #35182 [DO NOT MERGE] Reorganize inputs — documentation,performance,rocm,frontend,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-02-24 16:08 (UTC+8))
- #35186 [Frontend] Always pass
supported_tasksto validation — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-24 18:42 (UTC+8)) - #35187 feat: enable EPLB for NVFP4 compressed-tensors ML3 checkpoint — 无标签 — by hypdeb (创建于: 2026-02-24 18:56 (UTC+8))
- #35177 [XPU] disable async-scheduling by default on XPU — 无标签 — by yma11 (创建于: 2026-02-24 14:14 (UTC+8))
- #35176 [Benchmark] Make ShareGPT length filters configurable via CLI — performance — by dkim (创建于: 2026-02-24 13:55 (UTC+8))
- #35179 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-24 15:05 (UTC+8))
- #35171 [Bugfix] Fix eager torchaudio import causing ModuleNotFoundError — bug — by haosdent (创建于: 2026-02-24 13:25 (UTC+8))
- #35175 [WIP][Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,v1,nvidia — by haosdent (创建于: 2026-02-24 13:43 (UTC+8))
已合并 PR
- #35156 [BUGFIX][Qwen3.5] Hardcode
mlp.gateas not quantizable — bug,ready,qwen — by vadiklyutiy (合并于: 2026-02-24 11:42 (UTC+8)) - #35161 [Bugfix] Fix expert_ids padding values in moe_align_block_size kernel — bug,ready — by xyang16 (合并于: 2026-02-25 09:14 (UTC+8))
- #34674 Adding Nemotron fp8 Triton MoE Config — 无标签 — by yugong333 (合并于: 2026-02-25 07:56 (UTC+8))
- #34100 Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. — rocm,ready — by amd-hhashemi (合并于: 2026-02-25 07:35 (UTC+8))
- #35075 [Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention — bug,ready,v1 — by benchislett (合并于: 2026-02-25 06:55 (UTC+8))
- #35236 [CI] Fix Distributed Tests — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-02-25 06:31 (UTC+8))
- #34923 [ROCm][CI] Added MI325 mirrors — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-02-25 05:37 (UTC+8))
- #35207 Add @MatthewBonanni to CODEOWNERS — ready,ci/build — by MatthewBonanni (合并于: 2026-02-25 01:45 (UTC+8))
- #35211 Revert “[CI/Build] Remove redundant OpenTelemetry pip install from CI configs” — ci/build — by LucasWilkinson (合并于: 2026-02-25 01:26 (UTC+8))
- #33726 [Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support — new-model,ready,v1 — by benchislett (合并于: 2026-02-25 01:49 (UTC+8))
- #35205 [CI/Build] Fix kernels test location — ci/build — by DarkLight1337 (合并于: 2026-02-25 01:20 (UTC+8))
- #33593 [Perf] Optimize Python Slice for Structured Output using
isliceinstead of [:] — structured-output,ready,v1,deepseek — by yewentao256 (合并于: 2026-02-25 01:02 (UTC+8)) - #34434 [CPU][Perf] Accelerate Attention head for s390x using vector intrinsics — ready,v1,cpu — by R3hankhan123 (合并于: 2026-02-24 23:25 (UTC+8))
- #35189 Remove
padding_indexfrom models that don’t use it for better Transformers v5 compatibility — ready,qwen — by hmellor (合并于: 2026-02-25 00:04 (UTC+8)) - #35199 [CI] Remove Duplicated Tests — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-02-24 23:53 (UTC+8))
- #35053 Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 — ready,nvidia,quantization — by danisereb (合并于: 2026-02-24 23:45 (UTC+8))
- #35088 Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe — performance,ready,nvidia — by danisereb (合并于: 2026-02-24 23:25 (UTC+8))
- #34905 Fix GLM4 parser tests — ready — by RNabel (合并于: 2026-02-24 22:27 (UTC+8))
- #34281 [Attn,KV-cache] Use per-head scales in the attention selector — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by eldarkurtic (合并于: 2026-02-24 22:02 (UTC+8))
- #35111 [Bugfix] Fix failing FunASR processor test — bug,ready,multi-modality — by Isotr0py (合并于: 2026-02-24 20:13 (UTC+8))
- #35186 [Frontend] Always pass
supported_tasksto validation — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-24 20:16 (UTC+8)) - #35108 [glm-asr] change defaults dummy audio size — ready,multi-modality — by eustlb (合并于: 2026-02-24 20:13 (UTC+8))
- #35127 [Perf] Optimize pooling model redundant copy, 1.8% throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-24 20:13 (UTC+8))
- #35117 [compile] Save aot compile artifacts atomically. — ready — by zhxchen17 (合并于: 2026-02-24 20:11 (UTC+8))
- #35147 [Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) — ready,llama — by dorhuri123 (合并于: 2026-02-24 20:10 (UTC+8))
- #33959 Make voxtral compile friendly — ready — by tugsbayasgalan (合并于: 2026-02-24 16:33 (UTC+8))
- #32621 [LoRA] Update LoRA expand kernel block_n calculation — ready — by xyang16 (合并于: 2026-02-24 15:17 (UTC+8))
- #34055 [Refactor] [1/N] Reorganize kernel abstraction directory — rocm,ready,cpu,nvidia,ready-run-all-tests — by BadrBasowid (合并于: 2026-02-24 14:47 (UTC+8))
- #35032 [CI/Build] Remove redundant OpenTelemetry pip install from CI configs — ready,ci/build — by vladmihailescu (合并于: 2026-02-24 14:24 (UTC+8))
- #34628 [MM] Allow audio chunking for offline LLM — documentation,frontend,ready,multi-modality — by NickLucche (合并于: 2026-02-24 13:04 (UTC+8))
关闭但未合并的 PR
- #26648 [Core] Lite weight profiler implementation — documentation,performance,needs-rebase,stale,v1 — by namanlalitnyu (关闭于: 2026-02-25 10:16 (UTC+8))
- #26997 [Feature] Support multi-stream parallelism for the q, kv norm calculations in the Qwen3 model. — stale,qwen — by weijinqian0 (关闭于: 2026-02-25 10:16 (UTC+8))
- #27541 Enhance benchmark_moe.py compatibility issues across vLLM versions — performance,stale — by massif-01 (关闭于: 2026-02-25 10:16 (UTC+8))
- #35174 [WIP] [Bugfix] Use atomic writes for AOT compile cache to prevent 0-byte files — bug,needs-rebase — by haosdent (关闭于: 2026-02-25 09:45 (UTC+8))
- #35181 [DRAFT][Bugfix] Fix DCP gibberish output and Mamba block_size with DCP — bug,v1 — by haosdent (关闭于: 2026-02-25 09:44 (UTC+8))
- #33325 [ut] enhance structured output ut — structured-output,v1 — by andyxning (关闭于: 2026-02-25 09:38 (UTC+8))
- #35103 [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash — bug,rocm — by c0de128 (关闭于: 2026-02-25 08:47 (UTC+8))
- #35227 Explicitly disable scale_up/scale_down when async EPLB is enabled — v1 — by SageMoore (关闭于: 2026-02-25 08:36 (UTC+8))
- #33953 Regular FP8 LoRA kernels — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,needs-rebase,ci/build,v1 — by yugong333 (关闭于: 2026-02-25 06:53 (UTC+8))
- #32307 fix pad_align for gfx942 — 无标签 — by Rohan138 (关闭于: 2026-02-25 05:08 (UTC+8))
- #35234 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (关闭于: 2026-02-25 05:21 (UTC+8))
- #35155 updated for rh ubi10.1 docker — ci/build — by sureshnam (关闭于: 2026-02-25 04:14 (UTC+8))
- #34544 [RL] Pause and Resume for DPEP — documentation,ready,v1 — by hao-aaron (关闭于: 2026-02-25 03:30 (UTC+8))
- #29923 [BUG] Make VLLM_COMPILE_CACHE_SAVE_FORMAT override CompilationConfig — bug,documentation,needs-rebase — by hao-aaron (关闭于: 2026-02-25 03:29 (UTC+8))
- #35204 [CI] Fix flaky spec decode accuracy tests — ready,v1 — by LucasWilkinson (关闭于: 2026-02-25 00:23 (UTC+8))
- #34952 [Build] Fix DSV3_FUSED_A_GEMM_ARCHS to only include SM 9.0 (Hopper) — needs-rebase,ci/build,deepseek,nvidia — by aabbccddwasd (关闭于: 2026-02-24 23:53 (UTC+8))
- #32420 [Frontend] Enable drain shutdown mode for non-DP deployments — documentation,frontend,ready,needs-rebase,v1 — by wseaton (关闭于: 2026-02-24 23:43 (UTC+8))
- #32879 [Bugfix][Hardware][AMD] Fix MXFP4 weight loading for Quark OCP_MX MoE models — bug,rocm — by c0de128 (关闭于: 2026-02-24 23:23 (UTC+8))
- #32526 [Performance] Split attention backends KV cache update — rocm,v1,nvidia — by Etelis (关闭于: 2026-02-24 18:33 (UTC+8))
- #35179 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (关闭于: 2026-02-24 15:16 (UTC+8))
- #35171 [Bugfix] Fix eager torchaudio import causing ModuleNotFoundError — bug — by haosdent (关闭于: 2026-02-24 17:03 (UTC+8))
- #33181 [reasoning parser] code clean — structured-output,v1 — by andyxning (关闭于: 2026-02-24 16:24 (UTC+8))
- #35175 [WIP][Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,v1,nvidia — by haosdent (关闭于: 2026-02-24 14:39 (UTC+8))
- #33138 [code clean] remove duplicated code — frontend — by andyxning (关闭于: 2026-02-24 14:09 (UTC+8))
- #35018 perf(v1): optimize InputBatch.swap_states by swapping active token prefixes — v1 — by VedantMadane (关闭于: 2026-02-24 12:34 (UTC+8))
- #35164 [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available — bug,rocm,v1,kv-connector — by rasmith (关闭于: 2026-02-24 12:28 (UTC+8))