vLLM 开发动态报告 - 2026-02-22

时间窗口: 2026-02-22 11:35 (UTC+8) ~ 2026-02-23 11:35 (UTC+8) 数据统计: 新 Issue 5 | 关闭 Issue 8 | 新 PR 26 | 合并 PR 15 | 关闭未合并 PR 10

📊 每日开发状态摘要

在2026年2月22日至23日的窗口期内，vLLM项目展现出极高活跃度，共处理了38个PR和13个Issue。开发焦点集中在性能优化与问题修复，特别是针对推测解码（Speculative Decoding）、多模态模型（如Qwen-VL）和量化（MXFP4/NVFP4）的支持。同时，AMD ROCm平台的集成与测试稳定性是本周期的一个显著主题，贡献者提交了多个关键修复和功能增强。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动非常活跃，共有8个新增或修改的PR，主要围绕CI测试稳定性修复、功能支持与新硬件兼容性展开。以下为详细分析：

新增功能与硬件支持
- PR #35051: feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs
  - 分析：这是本周期最重要的AMD生态进展。该PR通过Petit内核后端，为AMD CDNA2/CDNA3 GPU（如MI300系列）增加了对MXFP4量化稠密模型的原生支持。这标志着AMD平台对ModelOpt等先进量化工具链的支持迈出关键一步，有助于提升AMD GPU在量化推理场景下的竞争力。贡献者为fengli1702。
- PR #35064: [ROCm] Support MLA with nhead=32 and FP8 KV cache (e.g. Kimi-Linear TP=1)
  - 分析：该PR修复了ROCm平台在运行特定MLA（Multi-Head Latent Attention）模型（如Kimi-Linear-48B， TP=1时nhead=32）时的断言错误，并增加了对FP8 KV缓存的后端自动回退支持。这提升了复杂注意力模型在AMD平台上的兼容性和功能完整性。
CI/测试稳定性与基础设施
- PR #35071: [ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption
  - 分析：贡献者AndreasKaratzas将更多测试用例暴露给AMD生产CI（amdproduction信号），并修复了amdsmi库在多线程查询GPU内存时的堆损坏问题。这直接提升了vLLM在AMD生产环境（如MI325）下的测试覆盖率和运行稳定性。
- PR #35069: [ROCm] Derive device capability from GCN arch string without CUDA init
  - 分析：该PR优化了设备能力查询逻辑，使其在不初始化CUDA的情况下也能从GCN架构字符串解析，解决了在Ray等特定环境中因CUDA初始化顺序导致的运行时错误。这增强了vLLM在复杂分布式环境下的鲁棒性。
- PR #35049, #35043, #35052, #35050: 一系列针对ROCm平台测试确定性和稳定性的修复。
  - 分析：这些PR分别解决了因wvSplitKrc“瘦”GEMM内核非确定性输出导致的多模态测试失败、推测解码性能分析断言错误、实时API测试因JIT编译导致的超时，以及嵌入测试因浮点差异导致的断言失败。这系统性地提升了AMD CI流水线的通过率和可靠性，是平台成熟度提升的重要标志。

总结：本周期AMD生态贡献聚焦于扩大功能支持（MXFP4量化）、解决生产环境稳定性问题（amdsmi堆损坏、CI测试）以及提升框架兼容性（设备能力查询）。这表明AMD团队正致力于将ROCm支持从“可用”推向“稳定、高效、功能全面”的阶段。

💬 高热度讨论分析

Issue #35048: [Bug]: perf degradation between vllm 0.14.0 to 0.15.1
- 核心议题：用户报告在H100/H200 GPU上，使用MiniMax-M2.5等MLA模型时，v0.15.1相比v0.14.0存在显著性能下降（高负载下中位延迟增20-24%，P95 TTFT增73%）。
- 观点分析：
  - 报告者 (meitalbensinai): 指出问题仅在高并发、大请求时出现，低负载下性能正常，怀疑与Hopper架构或deepgemm/flashinfer后端变化有关。
  - 调查者 (BaskDuan): 通过代码比对，将嫌疑范围缩小到v0.14.0至v0.15.1之间的13个MLA相关变更（涉及44个文件），并指出问题可能特定于MLA注意力路径。
  - 深入分析者 (sxn-rs5): 建议进行内核级剖析，关注#32709（MLA融合cat和quant）可能引入的同步开销或次优内存访问模式，并验证后端切换的影响。
- 争议焦点：无直接争议，但讨论深入技术细节，指向MLA实现路径的批量处理、内核碎片化或内存带宽瓶颈可能是根本原因。
- 当前状态：Open。社区已定位可疑代码区域，待进一步性能剖析和修复。
Issue #35057 与 #35056: Qwen3.5系列模型在特定配置下的错误
- 核心议题：两个Issue分别报告了Qwen3.5模型在使用解码上下文并行（DCP） 和 RoPE Scaling 时出现的张量形状不匹配与属性错误。
- 观点分析：
  - 报告者 (ehfd): 提供了详细的错误堆栈和环境信息，并关联了两个问题。
  - 分析者 (sxn-rs5): 指出#35057是DCP中元数据张量拆分/聚合的逻辑问题；#35056是MRotaryEmbedding类缺少truncate属性，需参照标准RotaryEmbedding实现补全。
- 争议焦点：无。属于明确的代码缺陷或缺失实现。
- 当前状态：两者均为Open。问题根因已基本明确，等待修复PR。
Issue #13260: [Bug]: Guided generation throws 500 error or endless generation (已关闭)
- 核心议题：历史遗留问题，关于使用引导式生成（JSON Schema等）时，服务器报500错误或陷入无限生成空白字符的循环。
- 观点与解决历程：
  - 用户们：从v0.7.x到v0.11.x多个版本持续报告此问题，尤其在Mistral、Qwen等模型上。
  - 探索方案：社区尝试了调整温度、使用lm-format-enforcer后端、设置LMFE_MAX_CONSECUTIVE_WHITESPACES环境变量、在xgrammar中禁用空白符(disable_any_whitespace)等多种方案。
  - 根本原因与结论：问题源于模型在引导约束下“不知道生成什么”时，倾向于持续生成空白符。最终的有效解决思路是：1) 确保模型指令清晰；2) 使用xgrammar后端并设置”disable_any_whitespace”: true避免语法级空白；3) 配合frequency_penalty等参数进一步降低重复。此Issue因长期无活动而自动关闭，但讨论提供了宝贵的实践经验。

🔥 热门话题与趋势分析

性能回归与深度优化：Issue #35048引发的讨论是典型代表。随着vLLm功能快速迭代（尤其是MLA、推测解码等复杂特性），性能回归测试和深度优化成为社区关注核心。趋势是从宏观测试转向内核级剖析（nsys）、内存访问模式分析和并发场景压力测试。
模型兼容性与边缘用例：Qwen3.5系列模型在DCP、RoPE扩展等高级特性下的错误（#35057, #35056），以及RTX 5090 (SM120)新硬件无法加载NVFP4 MoE模型（#35065），反映了社区在支持快速涌现的新模型、新硬件配置方面面临的持续挑战。
量化支持的广度和深度：量化是绝对热点。动态包括：为Step-3.5-Flash模型添加NVFP4 MoE支持(#34478)、集成FlashInfer的MXFP8 GEMM到ModelOpt后端(#35053)、以及前述为AMD GPU增加MXFP4稠密模型支持(#35051)。趋势是支持更广泛的模型家族、更精细的量化格式（如MXFP8/4）和跨硬件平台部署。
平台稳定性与开发者体验：大量PR（尤其是AMD相关和CI修复）致力于解决测试中的非确定性、超时、堆损坏等问题。这表明项目在追求功能创新的同时，高度重视跨平台（NVIDIA/AMD/CPU）的稳定性和开发者CI体验。
推测解码的成熟与优化：多个已合并的PR（#34049, #34529, #34898）专注于修复推测解码在特定配置（如TP通信、KV缓存分组、分离式部署）下的问题并进行优化，表明该特性正从“可用”向“高效、稳定”迈进。

🛠️ 重点技术变更

PR #34049 (已合并): [Spec Decode] Reduce TP communication for speculative decoding draft token generation
- 解读：针对Eagle等推测解码算法中，草稿模型生成token时需要全局广播完整logits（O(batch_size × vocab_size)）的巨大通信开销，本PR引入了use_local_argmax_reduction选项。新方法仅在TP间交换(最大值, 索引)对（O(batch_size × 2 × tp_size)），大幅减少了通信量。
- 影响：对于Llama 4等超大词表模型，能有效降低TP通信瓶颈，提升推测解码效率，尤其在批量较大时效果显著。
PR #34478 (已合并): [Model] Add NVFP4 quantization support for Step3.5-Flash
- 解读：Step-3.5-Flash模型的MoE层使用了独特的swiglustep激活函数，此前所有NVFP4 MoE后端均不支持。此PR通过扩展CutlassExpertsFp4和MarlinExperts的激活函数白名单、修复格式断言，成功为其添加了NVFP4量化支持。
- 影响：扩展了vLLM尖端量化技术支持的模型边界，使更多复杂MoE模型能够受益于FP4量化带来的内存和带宽优势。
PR #34898 (已合并): [Bug] Refactor max_num_batched_tokens to account for drafting
- 解读：修复了推测解码中一个关键崩溃BUG。原逻辑中，max_num_batched_tokens未正确考虑草稿模型额外生成的token，导致在混合注意力模型中KV缓存分组计算错误。PR通过调整配置初始化和调度器逻辑，确保分组正确。
- 影响：解决了推测解码与混合注意力模型（如GPT-OSS + Eagle）配合使用时的稳定性问题，是保证复杂特性组合可用性的重要基础修复。

📈 开发活跃度观察

高强度代码合并：24小时内合并15个PR，新增26个PR，显示核心团队和社区贡献者审查与合并流程非常高效。
AMD贡献者活跃：用户 AndreasKaratzas 在单一周期内主导或参与了至少6个ROCm相关的PR（#35071, #35069, #35052, #35050, #35049, #35043），展现了AMD团队对vLLm ROCm支持的持续高强度投入和快速响应能力。
测试驱动开发：绝大多数PR都附带详细的测试计划，且大量PR直接针对CI测试稳定性进行修复，体现了项目对代码质量和回归预防的重视。
聚焦复杂特性：开发活动高度集中在推测解码、多模态、量化、并行计算等前沿且复杂的领域，表明vLLM正致力于攻克高性能推理中的技术深水区。

💡 值得关注的问题

性能回归风险 (Issue #35048)：v0.15.1对MLA模型路径的修改在高负载下引发的性能退化需要尽快定位和修复，这关系到生产环境的稳定升级。
新硬件兼容性 (Issue #35065)：NVIDIA即将发布的RTX 5090 (SM120) 在运行NVFP4 MoE模型时遇到后端不支持的错误，需要vLLM团队提前适配新一代硬件架构。
复杂特性组合的稳定性：如Qwen3.5 + DCP + RoPE扩展（#35057, #35056）这类“模型+高级并行+扩展技术”的组合暴露出新的Bug，预示着随着特性矩阵膨胀，测试用例需要更全面的覆盖。
引导生成的实践指南：虽然Issue #13260已关闭，但引导生成（特别是JSON输出）的稳定性和最佳实践（如后端选择、参数配置）仍是用户高频痛点，值得整理成更明确的文档或提供更健壮的默认实现。

📋 附录：详细数据列表

新增 Issue

#35057 [Bug]: Qwen3.5 scheduler_metadata must have shape (metadata_size) with Decode Context Parallel (DCP) — bug — by ehfd (创建于: 2026-02-22 23:30 (UTC+8))
#35065 RTX 5090 (SM120): NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 fails to start, No NvFp4 MoE backend supports the deployment configuration — bug — by CristyNel (创建于: 2026-02-23 06:34 (UTC+8))
#35061 [Bug]: [torch.compile] occasional failure to save AOT compiled function after successful graph compilation — bug — by cjackal (创建于: 2026-02-23 01:34 (UTC+8))
#35048 [Bug]: perf degradation between vllm 0.14.0 to 0.15.1 — bug — by meitalbensinai (创建于: 2026-02-22 14:10 (UTC+8))
#35056 [Bug]: Qwen3.5 AttributeError: 'MRotaryEmbedding' object has no attribute 'truncate' with RoPE Scaling — bug — by ehfd (创建于: 2026-02-22 23:10 (UTC+8))

已关闭 Issue

#13260 [Bug]: Guided generation throws 500 error or endless generation in vllm serve for mistral small 2501 — bug,stale — by Jason-CKY (关闭于: 2026-02-23 10:18 (UTC+8))
#16969 [RFC]: scheduling policy optimization in vLLM — RFC,stale — by RuixiangMa (关闭于: 2026-02-23 10:18 (UTC+8))
#25434 [Bug]: TritonAttn Backend is very slow on A770 — bug,intel-gpu,stale — by iori2333 (关闭于: 2026-02-23 10:17 (UTC+8))
#26932 [Feature]: A more generalized vLLM Helm chart — feature request,stale — by jgchn (关闭于: 2026-02-23 10:16 (UTC+8))
#27073 [Bug]:几个大的token超限制请求之后GPU KV cache usage持续升高一直接近100%，几个小时不下降，最后重启解决 — bug,stale — by dingwenhao (关闭于: 2026-02-23 10:16 (UTC+8))
#27486 [Bug]: Llama-3.2-1b-instruct prepending extra bos token — bug,stale — by BabyChouSr (关闭于: 2026-02-23 10:16 (UTC+8))
#27513 [Bug]: Poor logging around assertion error when using PPLX all-to-all backend with microbatching (MoE) — bug,stale — by Gregory-Pereira (关闭于: 2026-02-23 10:16 (UTC+8))
#34771 [Bug]: LlavaNext + compressed-tensors: loader expects multi_modal_projector.linear_1.weight_packed but model only has linear_1weight/linear_1bias — bug — by gocoding1011 (关闭于: 2026-02-22 20:32 (UTC+8))

新增 PR

#35073 [Refactor] [3/N] Reorganize kernel abstraction directory — cpu,nvidia — by BadrBasowid (创建于: 2026-02-23 11:17 (UTC+8))
#35072 [Bugfix] Fix iteration time for asynchronous scheduler — bug,v1 — by maxyanghu (创建于: 2026-02-23 10:58 (UTC+8))
#35055 [Misc] Add shard_id validation for MergedColumnLinear — ready — by Isotr0py (创建于: 2026-02-22 22:59 (UTC+8))
#35070 [Model Runner V2] Remove propose_draft method — v1 — by WoosukKwon (创建于: 2026-02-23 09:18 (UTC+8))
#35067 [Deprecation] Deprecate resolve_hf_chat_template as scheduled — frontend,ready — by yewentao256 (创建于: 2026-02-23 07:04 (UTC+8))
#35059 feat(cpu): add CPU support for Mamba ShortConv and fix GEMM dispatch — 无标签 — by rahulssv-ibm (创建于: 2026-02-23 00:46 (UTC+8))
#35071 [ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-23 09:18 (UTC+8))
#35069 [ROCm] Derive device capability from GCN arch string without CUDA init — rocm,nvidia — by AndreasKaratzas (创建于: 2026-02-23 08:32 (UTC+8))
#35063 [Model Runner V2] Fix error-handling — ready,v1 — by njhill (创建于: 2026-02-23 04:33 (UTC+8))
#35053 Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 — ready,nvidia,quantization — by danisereb (创建于: 2026-02-22 16:28 (UTC+8))
#35068 [Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item — ready — by yewentao256 (创建于: 2026-02-23 07:12 (UTC+8))
#35066 [Frontend] Add multi-server frontend for K8s pod health aggregation — frontend — by robertgshaw2-redhat (创建于: 2026-02-23 06:50 (UTC+8))
#35060 [CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig — ready — by VincentG1234 (创建于: 2026-02-23 00:52 (UTC+8))
#35049 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results — rocm,multi-modality — by AndreasKaratzas (创建于: 2026-02-22 14:56 (UTC+8))
#35043 [ROCm][CI] Fix spec decode profile assertion and logprob test determinism — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-22 12:07 (UTC+8))
#35064 [ROCm] Support MLA with nhead=32 and FP8 KV cache (e.g. Kimi-Linear TP=1) — rocm,v1 — by ChuanLi1101 (创建于: 2026-02-23 04:57 (UTC+8))
#35052 [ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays — rocm,ready — by AndreasKaratzas (创建于: 2026-02-22 15:49 (UTC+8))
#35062 [Bugfix] Separate speculator layers into dedicated KV cache group — bug,ci/build,v1,nvidia — by hai-meh-cs (创建于: 2026-02-23 03:13 (UTC+8))
#35058 [Perf] Fused Triton kernels for Qwen3-VL vision encoder preprocessing — multi-modality,qwen,nvidia — by mayank-ketkar-sf (创建于: 2026-02-23 00:41 (UTC+8))
#35051 feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs — rocm — by fengli1702 (创建于: 2026-02-22 15:42 (UTC+8))
#35054 [Mamba1] - Change supports_update_block_table to True — v1 — by Josephasafg (创建于: 2026-02-22 20:45 (UTC+8))
#35050 [ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison — rocm,ready — by AndreasKaratzas (创建于: 2026-02-22 15:03 (UTC+8))
#35045 [Model Runner V2] Support sharing kv cache layers — ready,v1 — by njhill (创建于: 2026-02-22 13:07 (UTC+8))
#35046 Add engine KV checkpoint rewind and lifecycle tools — documentation,frontend,v1 — by enigmatic-figure (创建于: 2026-02-22 13:22 (UTC+8))
#35047 add mixed precision support for modelopt — 无标签 — by sychen52 (创建于: 2026-02-22 13:32 (UTC+8))
#35044 Fix –kv-cache-dtype auto behavior with checkpoint quantization configs — v1 — by paipeline (创建于: 2026-02-22 12:34 (UTC+8))

已合并 PR

#35070 [Model Runner V2] Remove propose_draft method — v1 — by WoosukKwon (合并于: 2026-02-23 10:27 (UTC+8))
#35039 [Model Runner V2][Minor] Remove redundant do_spec_decode field — ready,v1 — by njhill (合并于: 2026-02-23 08:18 (UTC+8))
#34049 [Spec Decode] Reduce TP communication for speculative decoding draft token generation — speculative-decoding,ready,v1,llama — by zixi-qi (合并于: 2026-02-23 06:59 (UTC+8))
#35052 [ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays — rocm,ready — by AndreasKaratzas (合并于: 2026-02-22 18:07 (UTC+8))
#34478 [Model] Add NVFP4 quantization support for Step3.5-Flash — ready,nvidia,quantization — by tacos8me (合并于: 2026-02-23 03:30 (UTC+8))
#35030 Fix apply_top_k_top_p_triton called by non-cuda logits Tensor — ready,v1,nvidia,meta-exported,fb-exported — by xli (合并于: 2026-02-22 13:11 (UTC+8))
#34898 [Bug] Refactor max_num_batched_tokens to account for drafting — bug,speculative-decoding,ready,v1 — by benchislett (合并于: 2026-02-23 00:18 (UTC+8))
#34529 [Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup — speculative-decoding,ready,v1 — by zixi-qi (合并于: 2026-02-23 00:08 (UTC+8))
#35050 [ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison — rocm,ready — by AndreasKaratzas (合并于: 2026-02-22 17:03 (UTC+8))
#34779 [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser — bug,frontend,ready,qwen — by ywang96 (合并于: 2026-02-22 15:15 (UTC+8))
#35040 [Model Runner V2] Enable CUDA graph for Eagle3 — v1,nvidia — by WoosukKwon (合并于: 2026-02-22 13:42 (UTC+8))
#35027 [Benchmark] Use sns.relplot for plotting — performance,ready — by DarkLight1337 (合并于: 2026-02-22 12:26 (UTC+8))
#34558 [New Model] Add ColModernVBERT — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by athrael-soju (合并于: 2026-02-22 12:23 (UTC+8))
#34961 [CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests — rocm,ready,ci/build — by yewentao256 (合并于: 2026-02-22 12:23 (UTC+8))
#35008 [CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-22 12:01 (UTC+8))

关闭但未合并的 PR

#28330 Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 — frontend,needs-rebase,unstale,nvidia — by baonudesifeizhai (关闭于: 2026-02-23 11:34 (UTC+8))
#26160 [Feature] Add OCI model loader for loading models from OCI registries — documentation,stale — by ilopezluna (关闭于: 2026-02-23 10:17 (UTC+8))
#26384 Adding aws creds mount support for sccache in DockerFile — ci/build,stale — by kfhfar (关闭于: 2026-02-23 10:17 (UTC+8))
#26546 [Qwen3 Attention]Update qwen3 models with new interface — stale,qwen — by xintongli1001 (关闭于: 2026-02-23 10:17 (UTC+8))
#27172 Integrate go containerregistry library — documentation,needs-rebase,ci/build,stale — by ericcurtin (关闭于: 2026-02-23 10:16 (UTC+8))
#27203 [Kernel] Support attention sinks in vLLM — ci/build,stale,v1 — by dudugong-gitch (关闭于: 2026-02-23 10:16 (UTC+8))
#27483 [Refactor] Add Shared Block Max Reduction Helper — stale — by harishappana-git (关闭于: 2026-02-23 10:16 (UTC+8))
#33168 Add script to benchmark Mamba for optimal max_num_seqs — performance — by danisereb (关闭于: 2026-02-23 00:40 (UTC+8))
#34412 [DO NOT MERGE ] Evidence for FlashInfer allreduce_fusion one-shot (kARResidualRMSNorm) causes deterministic NaN corruption and GSM8K collapse — 无标签 — by baonudesifeizhai (关闭于: 2026-02-22 15:45 (UTC+8))
#29316 lora cuda multistream — ci/build,v1,nvidia — by hai-meh-cs (关闭于: 2026-02-22 12:22 (UTC+8))