vLLM 开发动态报告 - 2026-02-24

时间窗口: 2026-02-24 11:31 (UTC+8) ~ 2026-02-25 11:31 (UTC+8) 数据统计: 新 Issue 18 | 关闭 Issue 17 | 新 PR 74 | 合并 PR 30 | 关闭未合并 PR 26

📊 每日开发状态摘要

本周期（2026-02-24至2026-02-25），vLLM 项目处于高速开发迭代阶段，新增 PR 74个，合并 PR 30个，显示出活跃的代码贡献和高效的审查合并流程。开发焦点集中在性能优化、内核改进（尤其是 AMD ROCm 平台）、以及修复模型推理与编译相关的各类 Bug。一个显著的趋势是多模态大模型（特别是 Qwen 系列）在高并发场景下的稳定性和内存问题引发了深入讨论。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动活跃，主要集中在 ROCm 平台的性能优化、内核完善和 CI 测试扩展上。

PR #35246 ([ROCm] Refactor attention backend selection logic)：重构了 ROCm 平台的注意力后端选择逻辑，旨在提高代码可维护性和调度清晰度。
PR #35250 ([Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash)：关键修复。为 gfx950 (MI300X) 专用的 FP4 BMM 和 FP4 ASM GEMM 操作添加了硬件检查门控。此前在 gfx942 (MI300X) 上启用相关环境变量会导致非法指令崩溃。此修复确保了功能仅在支持的硬件上启用。
PR #35245 ([ROCm][WIP]: Fused aiter rope kvcache mla)：致力于为 ROCm + AITER 平台实现 RoPE + KV Cache + Cat 操作的融合内核，目标提升如 DeepSeek、Kimi 等 MLA 模型的推理效率。此 PR 依赖于多个前置 PR。
PR #35239 ([ROCm][CI] Added MI325 mirrors (stage C))：扩展 AMD CI 测试范围，在 MI325 节点上为更多测试组（如入口点集成、示例、标准语言/多模态模型测试）启用“镜像”测试，标志着 ROCm 支持的成熟度和测试覆盖度的提升。
Issue #35169 ([Bug]: Memory Access Fault during Step-3.5-Flash inference (ROCM))：用户报告在 ROCm 平台上运行特定模型时遇到内存访问错误，已自动标记给 ROCm 相关维护者，表明社区对 AMD 平台问题的响应流程。
PR #35183 ([ROCm] Split wvSplitKrc into deterministic/fast kernels, add --fast-skinny-gemm CLI flag, overhaul tests)：将 wvSplitKrc 内核拆分为确定性版本和快速（非确定性）版本，并引入 CLI 标志让用户选择，默认提供确定性结果以确保正确性。同时全面重写了测试套件，将运行时间从90分钟以上大幅缩短至7分钟以内，显著提升了开发效率。
PR #35180 ([ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE)：修复并默认启用 AITER RoPE 自定义操作，同时启用 RoPE 与 KV Cache 的融合。测试显示在 gpt-oss 模型上带来约1%的性能提升，并为后续 MLA 模型的融合优化奠定基础。

分析：本周期 AMD 生态的动态呈现出从“功能可用”向“性能优化”和“体验完善”深化的特点。既有防止硬件崩溃的关键安全修复，也有旨在提升推理性能的融合内核开发。同时，CI 测试范围的持续扩大和测试效率的优化，为 ROCm 支持的长期稳定性和开发体验提供了重要保障。特别是将内核拆分为“确定性/快速”两种模式，体现了对生产环境不同需求（正确性 vs 极致性能）的细致考量。

💬 高热度讨论分析

Issue #35191 ([Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests)
- 核心议题：用户在使用 Qwen3.5-397B-FP8 模型处理高并发多模态请求时，遭遇系统内存（RAM）线性增长直至 OOM（耗尽1TB）的问题。
- 观点与排查：
  - 用户 (FWao)：详细描述了问题现象，指出在多模态缓存命中率降至零后，内存开始不受控制地增长。尝试了禁用前缀缓存、调整多模态编码器 TP 模式、禁用异步调度等多种方法均未解决。
  - 维护者 (DarkLight1337， ywang96)：首先排除了前缀缓存（对混合模型默认关闭）和常见高并发警告日志的影响。建议用户尝试移除 --mm-encoder-tp-mode data 参数。
  - 其他用户 (cjackal)：指出这可能是 Qwen2.5 视觉处理器中长期存在的 Bug（参考 #28230），并在 Qwen3 VL 235B 模型上复现了类似 OOM，错误回溯指向多模态嵌入计算函数 _merge_multimodal_embeddings。
- 争议焦点：问题的根本原因尚未确定，可能涉及多模态处理流程中的内存管理缺陷，或特定模型架构（Qwen VL）与 vLLM 多模态管道的兼容性问题。
- 当前状态：问题仍在开放调查中，维护者已介入，指向一个可能相关的历史 Issue。
Issue #35190 ([Feature]: Support for multiple embedding types in a single inference call)
- 核心议题：vLLM 0.15 支持了多向量检索（如 ColBERT），但目前每种向量类型需单独调用 API。用户请求在一次推理调用中返回所有支持的嵌入类型（如稠密、稀疏、colbert向量），以提升 RAG 管道效率。
- 观点与讨论：
  - 用户 (rgidda)：认为当前设计对需要混合搜索（稠密+稀疏）或 ColBERT 式交互的应用不友好，造成了双倍的计算和网络开销。
  - 维护者 (noooop)：建议或许可以通过插件（io_processor_plugin）来满足此场景。
  - 贡献者 (staugust)：指出插件可以将两次调用融合为一次，但由于 PoolingParams 在 CachedRequestState 中只能设置一个任务，vLLM 仍需索引两次。建议先通过插件支持单次调用多向量检索，再升级 DispatchPooler.forward 以支持多任务。
- 争议焦点：无显著争议，主要讨论实现路径。是从前端 API 层直接支持，还是通过插件机制逐步演进。
- 当前状态：开放讨论中，倾向于通过插件方案进行渐进式改进。
Issue #35195 ([CI] Missing or incorrect SpanKind import in tracing code causes AttributeError)（虽评论未显示，但从标题和状态变化可推断）
- 核心议题：CI 测试中因 OpenTelemetry 的 SpanKind 枚举导入错误导致追踪代码失败。
- 观点与行动：该 Issue 在创建当天即被关闭。关联的 PR #35211 直接 Revert 了此前移除冗余 OpenTelemetry pip 安装的 PR #35032。这表明在将 OpenTelemetry 依赖捆绑进 vLLM 主包后，CI 环境中某些测试仍依赖于旧的显式安装方式。
- 争议焦点：无，属于明确的回归问题，采取快速回滚修复。
- 当前状态：已通过回滚 PR 合并得到解决。

🔥 热门话题与趋势分析

性能优化与剖析：
- Issue #35254 和 #35226 分别从宏观版本升级（观测到 4 倍性能提升）和微观模型层面（DeepSeek 3.2 多流索引器）提出性能分析与优化建议。
- PR #35229 实现了针对 Blackwell (SM100) 的融合内核，体现了对最新硬件特性的持续跟进。
编译与内核稳定性：
- Issue #35238 和 PR #35256 揭示了 torch.compile 与自定义算子（CustomOp）交互时的 dtype 传播问题，导致模型崩溃。这反映了在追求编译优化性能时，对算子实现健壮性的更高要求。
- PR #35161 修复了 MoE 对齐内核中 expert_ids 的填充值，这是确保底层计算正确性的重要修复。
多模态与视觉模型挑战：
- Issue #35191 和历史上 #28230 反映了以 Qwen 系列为代表的大型视觉语言模型在 vLLM 中面临高并发、长上下文内存管理的严峻挑战。
- Issue #35169 表明 ROCm 平台上运行多模态模型也可能遇到底层内存访问问题。
推理正确性与解析器：
- Issue #35221 和关联的 PR #35258、#35230 均围绕 Qwen3ReasoningParser 在推理输出被截断时的错误解析行为进行修复，强调了对推理（Thinking）功能边界条件的精确处理。

🛠️ 重点技术变更

PR #35156 ([BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable)：快速响应并修复了 Qwen3.5-397B-A17B-NVFP4 等 NVFP4 量化模型加载失败的问题。通过硬编码 mlp.gate 层为不可量化，绕过了 checkpoint 中缺失相应量化排除标记的缺陷，确保了对新量化格式的即時支持。
PR #35161 ([Bugfix] Fix expert_ids padding values in moe_align_block_size kernel)：修复了 MoE 对齐内核中 expert_ids 填充值应为 -1 却错误设为 0 的问题。虽然因早期退出逻辑当前可能不影响精度，但此修复消除了下游内核对无效 ID 的误判风险，提升了代码的健壮性和未来兼容性。
PR #35229 ([Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE)：响应 Issue #35217 的需求，为 SM100 (Blackwell) 架构在 DeepGEMM MoE 中实现了 silu_mul_fp8_quant_packed 融合内核，将原本分离的激活与量化操作合并，旨在提升 FP8 MoE 的计算效率。
PR #34055 ([Refactor] [1/N] Reorganize kernel abstraction directory)：完成了内核抽象层的目录结构重构，将 kernels 包从 quantization 目录下移至 model_executor 根目录。这是清理代码结构、使非量化相关的线性内核不再依赖量化目录的重要架构调整。
PR #33726 ([Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support)：新增了对 Nemotron-H 模型家族 MTP 推测解码的支持，并统一扩展了 Mamba 类注意力后端对推测解码的兼容性。这使得更多线性注意力模型能够受益于推测解码技术。

📈 开发活跃度观察

高效合并：24小时内合并30个PR，表明核心团队审查与合并流程高效，项目进展迅速。
跨平台贡献：贡献者不仅来自 NVIDIA 和 AMD（如 xxx-amd 用户），还有来自 Intel、Red Hat 等公司的工程师，显示出 vLLM 作为跨硬件平台推理引擎的广泛吸引力。
问题闭环：关闭了17个 Issue，其中包含多个历时数月的“stale”问题被自动清理，也包含像 #35195 这样的 CI 问题被迅速解决，体现了社区维护的活跃度。
RFC 与设计讨论：Issue #35213 和 PR #35228 关于拆分 EngineClient 协议的讨论，显示了社区对前端架构长期演进的思考。

💡 值得关注的问题

多模态大模型的内存泄漏风险：Issue #35191 揭示的 Qwen3.5 VL 在高并发下的内存无限增长问题，可能影响所有类似架构的视觉语言模型在生产环境中的稳定性，需要优先调查。
编译优化与算子兼容性：Issue #35238 暴露的 torch.compile 与自定义算子间的 dtype 不匹配问题，是影响编译模式稳定性的潜在瓶颈，需要系统性审视所有自定义算子的“原生”实现路径。
推测解码的精度测试稳定性：Issue #35168 和 #35167 显示，由于模型本身生成的非确定性，基于输出文本完全匹配的推测解码精度测试存在固有波动性，可能需要调整测试方法论（如引入确定性提示或放松匹配标准）。
架构演进：Issue #35213 提出的拆分 EngineClient 协议，为支持解耦的前端-后端（如 vllm online）奠定基础，是一个重要的架构设计方向，值得关注其后续进展。

📋 附录：详细数据列表

新增 Issue

#35255 [Bug]: CUDA Error 803 on host with driver 590.48: `system has unsupported display driver / cuda driver combination — bug — by git-jxj (创建于: 2026-02-25 11:18 (UTC+8))
#35238 Qwen3.5-27B dtype mismatch in DeltaNet layers during torch.compile (float != c10::Half) — torch.compile — by carlswiftsc45-dot (创建于: 2026-02-25 05:36 (UTC+8))
#35254 [Performance]: curious new kernels from vllm 0.11.1 — performance — by fataswellassad (创建于: 2026-02-25 10:12 (UTC+8))
#35190 [Feature]: Support for multiple embedding types in a single inference call — feature request — by rgidda (创建于: 2026-02-24 20:30 (UTC+8))
#35252 [Installation]: No module named ‘vllm.entrypoints.anthropic.serving_messages’ — installation — by timtimyim (创建于: 2026-02-25 09:29 (UTC+8))
#35191 [Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests — bug — by FWao (创建于: 2026-02-24 20:45 (UTC+8))
#35235 [CI Failure]: mi355_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (创建于: 2026-02-25 05:17 (UTC+8))
#35233 [CI Failure]: mi355_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (创建于: 2026-02-25 05:01 (UTC+8))
#35217 [Feature]: Implement silu_mul_fp8_quant_packed for deepgemm for sm100 families — feature request,model-bash — by zyongye (创建于: 2026-02-25 03:16 (UTC+8))
#35226 [Performance]: DeepSeek 3.2 Multi-stream indexer — performance — by LucasWilkinson (创建于: 2026-02-25 04:06 (UTC+8))
#35221 [Bug]: qwen3 reasoning parser incorrectly parse reasoning-only output as content — bug — by cjackal (创建于: 2026-02-25 03:30 (UTC+8))
#35213 [RFC]: Split EngineClient into BaseEngineClient + EngineClient as Foundation for Disaggregated Frontend — RFC — by sagearc (创建于: 2026-02-25 01:54 (UTC+8))
#35195 [CI] Missing or incorrect SpanKind import in tracing code causes AttributeError — ci-failure — by LucasWilkinson (创建于: 2026-02-24 21:59 (UTC+8))
#35193 [Bug]: Deadlock in async_scheduling with TP=1 (UniProcExecutor) when a DP rank raises exception inside execute_model — bug — by fangyuchu (创建于: 2026-02-24 21:38 (UTC+8))
#35188 [Feature]: Asynchronous Logit Scrubbing (ALS) for Proactive Path Pruning and 85% Efficiency Gain — feature request — by alexbuiko-sketch (创建于: 2026-02-24 20:06 (UTC+8))
#35169 [Bug]: Memory Access Fault during Step-3.5-Flash inference (ROCM) — bug,rocm — by ColinZ22 (创建于: 2026-02-24 12:41 (UTC+8))
#35168 [CI] Ngram speculative decoding accuracy below 66% threshold — ci-failure — by LucasWilkinson (创建于: 2026-02-24 12:15 (UTC+8))
#35167 [CI] EAGLE speculative decoding accuracy below 60% threshold — ci-failure — by LucasWilkinson (创建于: 2026-02-24 12:15 (UTC+8))

已关闭 Issue

#32402 [RFC]: Add a new /collect_env api to response current vllm instance environment — feature request — by lengrongfu (关闭于: 2026-02-25 10:17 (UTC+8))
#19569 [Bug]: pythonic tool call parsing does not handle negative numeric literals — bug,stale — by art-dsit (关闭于: 2026-02-25 10:17 (UTC+8))
#21037 [Bug]: TTFT of 1p1d using NIXLConnector is twice that of tp1 — bug,stale — by AsicDyc (关闭于: 2026-02-25 10:17 (UTC+8))
#24603 [RFC]: Responses API full functionality without stores — RFC,stale,gpt-oss — by alecsolder (关闭于: 2026-02-25 10:16 (UTC+8))
#27384 [Bug]: Inference Qwen3-VL-30B-A3B with tp=2 pp=4 on 8x4090 get weired result — bug,stale — by muziyongshixin (关闭于: 2026-02-25 10:16 (UTC+8))
#27385 [Bug]: gptoss calls built-in tool when no tools are given — bug,stale,gpt-oss — by ruiqiRichard (关闭于: 2026-02-25 10:16 (UTC+8))
#27559 [Bug]: vllm0.10.1 can deploy deepseek-70b model with tp=2 and max-model-len 20000 on the machine with two NVIDIA A800(80GiB) , But vllm 0.11.0 failed — bug,stale — by hfzh06 (关闭于: 2026-02-25 10:16 (UTC+8))
#27570 [Bug]: Qwen2.5 ViT Incorrect QKV Split When projection_size != hidden_size in Tensor Parallelism — bug,stale — by yanyongyu (关闭于: 2026-02-25 10:16 (UTC+8))
#27575 [Bug]: p2pNccl 3P1，D-Node nccl receives data and triggers a crash — bug,stale — by thzyyx (关闭于: 2026-02-25 10:16 (UTC+8))
#35061 [Bug]: [torch.compile] occasional failure to save AOT compiled function after successful graph compilation — bug,torch.compile — by cjackal (关闭于: 2026-02-25 00:30 (UTC+8))
#35015 [Bug]: Qwen3.5 FP8 CUDA Illegal Access in CUTLASS GEMM — bug — by kimbochen (关闭于: 2026-02-25 06:54 (UTC+8))
#29998 [Bug]: cannot send two POST to /v1/chat/completions endpoint with identic tool function name with model GPT-OSS-120B — bug — by pd-t (关闭于: 2026-02-25 05:00 (UTC+8))
#34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-25 03:35 (UTC+8))
#35195 [CI] Missing or incorrect SpanKind import in tracing code causes AttributeError — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 01:26 (UTC+8))
#28016 [Usage]: How to recognize PDFs in DeepSeek-OCR with openai — usage — by shoted (关闭于: 2026-02-24 13:38 (UTC+8))
#34000 Move the Python kernel abstraction package (selectors + interfaces) under vllm/model_executor so unquantized linear kernels don’t need to be referenced from a quantization-specific directory. — 无标签 — by tjtanaa (关闭于: 2026-02-24 14:47 (UTC+8))
#35150 [Feature]: Support NVFP4 Checkpoint of Qwen3.5 — feature request — by ywang96 (关闭于: 2026-02-24 11:43 (UTC+8))

新增 PR

#35258 [Reasoning] Fix Qwen3 parser misclassifying truncated reasoning as content — 无标签 — by sxu75374 (创建于: 2026-02-25 11:31 (UTC+8))
#35257 [WIP][Bugfix] Fix placeholder detection when tokenizer merges boundary tokens — bug,multi-modality — by haosdent (创建于: 2026-02-25 11:28 (UTC+8))
#35178 [MoE] Make MoERunner/DefaultMoERunner a PluggableLayer for OOT support — documentation — by wxsIcey (创建于: 2026-02-24 14:57 (UTC+8))
#35173 [Kernel] Immediately execute argument assertions in wvSplitK — rocm — by wjabbour (创建于: 2026-02-24 13:40 (UTC+8))
#35256 [WIP][Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile — bug — by haosdent (创建于: 2026-02-25 11:27 (UTC+8))
#35184 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… — bug,frontend — by daniel-salib (创建于: 2026-02-24 18:03 (UTC+8))
#35194 [Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks — bug,v1 — by fangyuchu (创建于: 2026-02-24 21:53 (UTC+8))
#35219 [BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU — bug,v1,qwen — by vadiklyutiy (创建于: 2026-02-25 03:21 (UTC+8))
#35248 [WIP][Model Runner V2] Support DP/EP for spec decoding — v1 — by TheEpicDolphin (创建于: 2026-02-25 08:37 (UTC+8))
#35246 [ROCm] Refactor attention backend selection logic — documentation,rocm,v1 — by SageMoore (创建于: 2026-02-25 08:25 (UTC+8))
#35253 Enabling some B200-specific tests on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-25 09:40 (UTC+8))
#35249 [XPU]Fix for Qwen-OMNI crash — qwen — by xuechendi (创建于: 2026-02-25 08:39 (UTC+8))
#35174 [WIP] [Bugfix] Use atomic writes for AOT compile cache to prevent 0-byte files — bug,needs-rebase — by haosdent (创建于: 2026-02-24 13:43 (UTC+8))
#35181 [DRAFT][Bugfix] Fix DCP gibberish output and Mamba block_size with DCP — bug,v1 — by haosdent (创建于: 2026-02-24 15:47 (UTC+8))
#35251 [compile][graph_partition] Remove unused subgraph inputs after split_module — needs-rebase — by fxdawnn (创建于: 2026-02-25 09:21 (UTC+8))
#35242 Fp8 lora dense kernel — 无标签 — by yugong333 (创建于: 2026-02-25 06:55 (UTC+8))
#35240 Add PyTorch profiler schedule support with warmup/active iterations — meta-exported,fb-exported — by fenypatel99 (创建于: 2026-02-25 06:36 (UTC+8))
#35250 [Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash — bug,rocm — by c0de128 (创建于: 2026-02-25 08:47 (UTC+8))
#35247 Create test_deepseek_v32_config.py — deepseek — by puririshi98 (创建于: 2026-02-25 08:31 (UTC+8))
#35241 Create tests/distributed/test_mnnvl_alltoall.py — 无标签 — by puririshi98 (创建于: 2026-02-25 06:53 (UTC+8))
#35227 Explicitly disable scale_up/scale_down when async EPLB is enabled — v1 — by SageMoore (创建于: 2026-02-25 04:11 (UTC+8))
#35183 [ROCm] Split wvSplitKrc into deterministic/fast kernels, add --fast-skinny-gemm CLI flag, overhaul tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-24 16:49 (UTC+8))
#35201 [torch 2.11 update] Use pinned Docker image — performance,rocm,ci/build,v1,cpu,nvidia — by atalman (创建于: 2026-02-24 23:30 (UTC+8))
#35245 [ROCm][WIP]: Fused aiter rope kvcache mla — rocm — by Rohan138 (创建于: 2026-02-25 07:32 (UTC+8))
#35180 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE — rocm,ready — by Rohan138 (创建于: 2026-02-24 15:21 (UTC+8))
#35244 Comment fix for async rl example — documentation — by hao-aaron (创建于: 2026-02-25 07:09 (UTC+8))
#35239 [ROCm][CI] Added MI325 mirrors (stage C) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-25 06:07 (UTC+8))
#35243 [Bugfix] Fix DP MTP Dummy Run — bug,v1 — by benchislett (创建于: 2026-02-25 07:06 (UTC+8))
#35225 docs: document committer proposal process in governance — documentation — by simon-mo (创建于: 2026-02-25 03:54 (UTC+8))
#35236 [CI] Fix Distributed Tests — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-02-25 05:17 (UTC+8))
#35223 [KVConnector] scheduler: Add HMA support for KV load recovery — v1,kv-connector — by orozery (创建于: 2026-02-25 03:45 (UTC+8))
#35170 [ROCm][CI] Adding infiniband mappings for moriio tests — rocm,ci/build,v1,kv-connector — by AndreasKaratzas (创建于: 2026-02-24 13:18 (UTC+8))
#35232 [Build] Fix LTO build for ROCm when default compiler is GCC but ROCm/HIP is using Clang — rocm,ci/build — by davispuh (创建于: 2026-02-25 04:48 (UTC+8))
#35229 [Perf] Fused silu_mul_fp8_quant_packed kernel for SM100 DeepGEMM MoE — 无标签 — by mgoin (创建于: 2026-02-25 04:38 (UTC+8))
#35237 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (创建于: 2026-02-25 05:34 (UTC+8))
#35234 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (创建于: 2026-02-25 05:16 (UTC+8))
#35222 TRTLLM gen-full attn spec decode & FP8 KV dequant tests — v1,nvidia — by ojhaanshika (创建于: 2026-02-25 03:38 (UTC+8))
#35210 [BugFix] Fix fp4 quant kernel on CUDA 12.8 — bug,nvidia — by LopezCastroRoberto (创建于: 2026-02-25 01:02 (UTC+8))
#35231 [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors — frontend — by AndreasKaratzas (创建于: 2026-02-25 04:44 (UTC+8))
#35230 fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning — qwen — by stakeswky (创建于: 2026-02-25 04:42 (UTC+8))
#35228 [Frontend] Split EngineClient and OpenAIServing into base and inference classes — frontend — by sagearc (创建于: 2026-02-25 04:27 (UTC+8))
#35172 [Model Runner V2] Warmup kernels — v1 — by njhill (创建于: 2026-02-24 13:30 (UTC+8))
#35224 [ROCm][CI] Accept Different But Valid Output for test_olmoe_tp — rocm — by micah-wil (创建于: 2026-02-25 03:50 (UTC+8))
#35220 [Benchmarks] Plot benchmark timeline and requests statistics — performance,ci/build — by sducouedic (创建于: 2026-02-25 03:28 (UTC+8))
#35218 [Docs] Upgrade dynamic LoRA warning to admonition block — documentation — by russellb (创建于: 2026-02-25 03:18 (UTC+8))
#35216 [compile] Initialize passes at VllmBackend init — 无标签 — by angelayi (创建于: 2026-02-25 02:59 (UTC+8))
#35215 fixed mtp, multi rank, and test passed — v1,qwen — by ayaan-fw (创建于: 2026-02-25 02:37 (UTC+8))
#35185 Fix –kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — documentation — by paipeline (创建于: 2026-02-24 18:36 (UTC+8))
#35214 [Perf] Optimize Sampler Redundant Copy for Model Runner v2, 1.8% Throughput Improvement — v1 — by yewentao256 (创建于: 2026-02-25 02:32 (UTC+8))
#35207 Add @MatthewBonanni to CODEOWNERS — ready,ci/build — by MatthewBonanni (创建于: 2026-02-25 00:16 (UTC+8))
#35211 Revert “[CI/Build] Remove redundant OpenTelemetry pip install from CI configs” — ci/build — by LucasWilkinson (创建于: 2026-02-25 01:08 (UTC+8))
#35205 [CI/Build] Fix kernels test location — ci/build — by DarkLight1337 (创建于: 2026-02-25 00:07 (UTC+8))
#35203 Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM — documentation,ready,v1,deepseek,kv-connector — by hmellor (创建于: 2026-02-24 23:52 (UTC+8))
#35212 [EPLB] Enforce sync eplb for NCCL-based all2all backend — 无标签 — by ilmarkov (创建于: 2026-02-25 01:08 (UTC+8))
#35209 [Feature]: Optimize collectives in TP MoE case using torch.compile pass — 无标签 — by SouthWest7 (创建于: 2026-02-25 00:53 (UTC+8))
#35208 GLM4 tool parser: fix streaming mode — 无标签 — by RNabel (创建于: 2026-02-25 00:34 (UTC+8))
#35206 [Feature] Support Sequence Parallel for Model Runer v2 (Piecewise Cudagraph, PP=1) — ready,v1,nvidia — by yewentao256 (创建于: 2026-02-25 00:15 (UTC+8))
#35204 [CI] Fix flaky spec decode accuracy tests — ready,v1 — by LucasWilkinson (创建于: 2026-02-24 23:57 (UTC+8))
#35197 [Doc] Add MTP docs and update speculative decoding guidance — documentation — by XingLiu1 (创建于: 2026-02-24 22:27 (UTC+8))
#35189 Remove padding_index from models that don’t use it for better Transformers v5 compatibility — ready,qwen — by hmellor (创建于: 2026-02-24 20:20 (UTC+8))
#35199 [CI] Remove Duplicated Tests — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-02-24 22:29 (UTC+8))
#35200 Fix hf_override_fn when it modifies model_type — 无标签 — by hmellor (创建于: 2026-02-24 22:36 (UTC+8))
#35202 [Refactor] Refactor video loader frames reading/sampling interface — multi-modality — by Isotr0py (创建于: 2026-02-24 23:34 (UTC+8))
#35196 [ROCm] [CI] Gate the changes to Dockerfile.rocm_base. — rocm,ci/build — by tjtanaa (创建于: 2026-02-24 22:22 (UTC+8))
#35198 [ROCm] [Release] Change the package from aiter to amd-aiter — rocm,ci/build — by tjtanaa (创建于: 2026-02-24 22:27 (UTC+8))
#35192 [Kernels] Add GGUF mxfp4 dequantize support — 无标签 — by Isotr0py (创建于: 2026-02-24 20:51 (UTC+8))
#35182 [DO NOT MERGE] Reorganize inputs — documentation,performance,rocm,frontend,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-02-24 16:08 (UTC+8))
#35186 [Frontend] Always pass supported_tasks to validation — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-24 18:42 (UTC+8))
#35187 feat: enable EPLB for NVFP4 compressed-tensors ML3 checkpoint — 无标签 — by hypdeb (创建于: 2026-02-24 18:56 (UTC+8))
#35177 [XPU] disable async-scheduling by default on XPU — 无标签 — by yma11 (创建于: 2026-02-24 14:14 (UTC+8))
#35176 [Benchmark] Make ShareGPT length filters configurable via CLI — performance — by dkim (创建于: 2026-02-24 13:55 (UTC+8))
#35179 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-24 15:05 (UTC+8))
#35171 [Bugfix] Fix eager torchaudio import causing ModuleNotFoundError — bug — by haosdent (创建于: 2026-02-24 13:25 (UTC+8))
#35175 [WIP][Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,v1,nvidia — by haosdent (创建于: 2026-02-24 13:43 (UTC+8))

已合并 PR

#35156 [BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable — bug,ready,qwen — by vadiklyutiy (合并于: 2026-02-24 11:42 (UTC+8))
#35161 [Bugfix] Fix expert_ids padding values in moe_align_block_size kernel — bug,ready — by xyang16 (合并于: 2026-02-25 09:14 (UTC+8))
#34674 Adding Nemotron fp8 Triton MoE Config — 无标签 — by yugong333 (合并于: 2026-02-25 07:56 (UTC+8))
#34100 Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. — rocm,ready — by amd-hhashemi (合并于: 2026-02-25 07:35 (UTC+8))
#35075 [Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention — bug,ready,v1 — by benchislett (合并于: 2026-02-25 06:55 (UTC+8))
#35236 [CI] Fix Distributed Tests — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-02-25 06:31 (UTC+8))
#34923 [ROCm][CI] Added MI325 mirrors — rocm,ready,ci/build,v1,kv-connector — by AndreasKaratzas (合并于: 2026-02-25 05:37 (UTC+8))
#35207 Add @MatthewBonanni to CODEOWNERS — ready,ci/build — by MatthewBonanni (合并于: 2026-02-25 01:45 (UTC+8))
#35211 Revert “[CI/Build] Remove redundant OpenTelemetry pip install from CI configs” — ci/build — by LucasWilkinson (合并于: 2026-02-25 01:26 (UTC+8))
#33726 [Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support — new-model,ready,v1 — by benchislett (合并于: 2026-02-25 01:49 (UTC+8))
#35205 [CI/Build] Fix kernels test location — ci/build — by DarkLight1337 (合并于: 2026-02-25 01:20 (UTC+8))
#33593 [Perf] Optimize Python Slice for Structured Output using islice instead of [:] — structured-output,ready,v1,deepseek — by yewentao256 (合并于: 2026-02-25 01:02 (UTC+8))
#34434 [CPU][Perf] Accelerate Attention head for s390x using vector intrinsics — ready,v1,cpu — by R3hankhan123 (合并于: 2026-02-24 23:25 (UTC+8))
#35189 Remove padding_index from models that don’t use it for better Transformers v5 compatibility — ready,qwen — by hmellor (合并于: 2026-02-25 00:04 (UTC+8))
#35199 [CI] Remove Duplicated Tests — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-02-24 23:53 (UTC+8))
#35053 Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 — ready,nvidia,quantization — by danisereb (合并于: 2026-02-24 23:45 (UTC+8))
#35088 Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe — performance,ready,nvidia — by danisereb (合并于: 2026-02-24 23:25 (UTC+8))
#34905 Fix GLM4 parser tests — ready — by RNabel (合并于: 2026-02-24 22:27 (UTC+8))
#34281 [Attn,KV-cache] Use per-head scales in the attention selector — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by eldarkurtic (合并于: 2026-02-24 22:02 (UTC+8))
#35111 [Bugfix] Fix failing FunASR processor test — bug,ready,multi-modality — by Isotr0py (合并于: 2026-02-24 20:13 (UTC+8))
#35186 [Frontend] Always pass supported_tasks to validation — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-24 20:16 (UTC+8))
#35108 [glm-asr] change defaults dummy audio size — ready,multi-modality — by eustlb (合并于: 2026-02-24 20:13 (UTC+8))
#35127 [Perf] Optimize pooling model redundant copy, 1.8% throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-24 20:13 (UTC+8))
#35117 [compile] Save aot compile artifacts atomically. — ready — by zhxchen17 (合并于: 2026-02-24 20:11 (UTC+8))
#35147 [Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) — ready,llama — by dorhuri123 (合并于: 2026-02-24 20:10 (UTC+8))
#33959 Make voxtral compile friendly — ready — by tugsbayasgalan (合并于: 2026-02-24 16:33 (UTC+8))
#32621 [LoRA] Update LoRA expand kernel block_n calculation — ready — by xyang16 (合并于: 2026-02-24 15:17 (UTC+8))
#34055 [Refactor] [1/N] Reorganize kernel abstraction directory — rocm,ready,cpu,nvidia,ready-run-all-tests — by BadrBasowid (合并于: 2026-02-24 14:47 (UTC+8))
#35032 [CI/Build] Remove redundant OpenTelemetry pip install from CI configs — ready,ci/build — by vladmihailescu (合并于: 2026-02-24 14:24 (UTC+8))
#34628 [MM] Allow audio chunking for offline LLM — documentation,frontend,ready,multi-modality — by NickLucche (合并于: 2026-02-24 13:04 (UTC+8))

关闭但未合并的 PR

#26648 [Core] Lite weight profiler implementation — documentation,performance,needs-rebase,stale,v1 — by namanlalitnyu (关闭于: 2026-02-25 10:16 (UTC+8))
#26997 [Feature] Support multi-stream parallelism for the q, kv norm calculations in the Qwen3 model. — stale,qwen — by weijinqian0 (关闭于: 2026-02-25 10:16 (UTC+8))
#27541 Enhance benchmark_moe.py compatibility issues across vLLM versions — performance,stale — by massif-01 (关闭于: 2026-02-25 10:16 (UTC+8))
#35174 [WIP] [Bugfix] Use atomic writes for AOT compile cache to prevent 0-byte files — bug,needs-rebase — by haosdent (关闭于: 2026-02-25 09:45 (UTC+8))
#35181 [DRAFT][Bugfix] Fix DCP gibberish output and Mamba block_size with DCP — bug,v1 — by haosdent (关闭于: 2026-02-25 09:44 (UTC+8))
#33325 [ut] enhance structured output ut — structured-output,v1 — by andyxning (关闭于: 2026-02-25 09:38 (UTC+8))
#35103 [Bugfix][Hardware][AMD] Gate FP4 BMM on gfx950 to fix MI300X crash — bug,rocm — by c0de128 (关闭于: 2026-02-25 08:47 (UTC+8))
#35227 Explicitly disable scale_up/scale_down when async EPLB is enabled — v1 — by SageMoore (关闭于: 2026-02-25 08:36 (UTC+8))
#33953 Regular FP8 LoRA kernels — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,needs-rebase,ci/build,v1 — by yugong333 (关闭于: 2026-02-25 06:53 (UTC+8))
#32307 fix pad_align for gfx942 — 无标签 — by Rohan138 (关闭于: 2026-02-25 05:08 (UTC+8))
#35234 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend — by pks (关闭于: 2026-02-25 05:21 (UTC+8))
#35155 updated for rh ubi10.1 docker — ci/build — by sureshnam (关闭于: 2026-02-25 04:14 (UTC+8))
#34544 [RL] Pause and Resume for DPEP — documentation,ready,v1 — by hao-aaron (关闭于: 2026-02-25 03:30 (UTC+8))
#29923 [BUG] Make VLLM_COMPILE_CACHE_SAVE_FORMAT override CompilationConfig — bug,documentation,needs-rebase — by hao-aaron (关闭于: 2026-02-25 03:29 (UTC+8))
#35204 [CI] Fix flaky spec decode accuracy tests — ready,v1 — by LucasWilkinson (关闭于: 2026-02-25 00:23 (UTC+8))
#34952 [Build] Fix DSV3_FUSED_A_GEMM_ARCHS to only include SM 9.0 (Hopper) — needs-rebase,ci/build,deepseek,nvidia — by aabbccddwasd (关闭于: 2026-02-24 23:53 (UTC+8))
#32420 [Frontend] Enable drain shutdown mode for non-DP deployments — documentation,frontend,ready,needs-rebase,v1 — by wseaton (关闭于: 2026-02-24 23:43 (UTC+8))
#32879 [Bugfix][Hardware][AMD] Fix MXFP4 weight loading for Quark OCP_MX MoE models — bug,rocm — by c0de128 (关闭于: 2026-02-24 23:23 (UTC+8))
#32526 [Performance] Split attention backends KV cache update — rocm,v1,nvidia — by Etelis (关闭于: 2026-02-24 18:33 (UTC+8))
#35179 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (关闭于: 2026-02-24 15:16 (UTC+8))
#35171 [Bugfix] Fix eager torchaudio import causing ModuleNotFoundError — bug — by haosdent (关闭于: 2026-02-24 17:03 (UTC+8))
#33181 [reasoning parser] code clean — structured-output,v1 — by andyxning (关闭于: 2026-02-24 16:24 (UTC+8))
#35175 [WIP][Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,v1,nvidia — by haosdent (关闭于: 2026-02-24 14:39 (UTC+8))
#33138 [code clean] remove duplicated code — frontend — by andyxning (关闭于: 2026-02-24 14:09 (UTC+8))
#35018 perf(v1): optimize InputBatch.swap_states by swapping active token prefixes — v1 — by VedantMadane (关闭于: 2026-02-24 12:34 (UTC+8))
#35164 [CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available — bug,rocm,v1,kv-connector — by rasmith (关闭于: 2026-02-24 12:28 (UTC+8))