vLLM 开发动态报告 - 2026-02-17

时间窗口: 2026-02-17 11:33 (UTC+8) ~ 2026-02-18 11:33 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 24 | 新 PR 80 | 合并 PR 23 | 关闭未合并 PR 23

📊 每日开发状态摘要

在过去的24小时内，vLLM 项目保持了极高的开发活跃度，合并了23个PR，并新增了80个PR。开发重点集中在性能优化（如推测解码统计、注意力内核、GEMM优化）、多模态与渲染器（Renderer）架构演进，以及持续增强对 AMD ROCm 和 NVIDIA 新硬件（如 Blackwell）的生态支持。同时，社区积极修复了多个CI测试问题和模型加载相关的Bug。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD 生态相关活动活跃，主要集中在功能启用、CI/CD 修复和内核优化方面。

PR #34653 (已合并) - [BugFix] [Build] fix string literals comparison…
- 贡献者: hongxiayang (非-amd后缀，但活跃于ROCm相关修复)
- 内容: 修复了在 ROCm 平台使用 Clang 22+ 编译时，因字符串字面量比较导致的 -Wstring-compare 错误。该问题最初在 Issue #34132 中报告。修复方法是将宏调用点的字符串字面量转换为静态 std::string 对象，确保 == 比较的正确性。
- 影响: 确保了 vLLM 在 ROCm 最新工具链上的可编译性，是维护 AMD 平台支持的基础性工作。
PR #34726 (进行中) - [ROCm] Enable DBO (Dynamic Batch Optimization) on ROCm
- 贡献者: raviguptaamd (AMD员工)
- 内容: 旨在为 AMD GPU (如 MI300X) 启用动态批处理优化 (--enable-dbo)。修改包括：1) 放宽 SMControlContextManager 中仅限 CUDA 的断言，使其也接受 ROCm 平台；2) 为 ROCm 设置更合适的 VLLM_DBO_COMM_SMS 默认值 (64 CUs，相较于 CUDA 的 20 SMs)。
- 影响: 将使 AMD GPU 能够利用 DBO 来重叠通信与计算，提升 DeepSeek-V3 等模型在专家并行模式下的性能。
PR #34692 (进行中) - [ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs.
- 贡献者: lcskrishna (AMD员工)，itej89 (co-author)
- 内容: 集成 DeepEP ROCm 作为 AMD GPU 的 all2all 后端。修改了代码以支持 AMD GPU 上的 FusedBatchedMoE 内核运行，并添加了 float8_e4m3fnuz 格式支持。
- 影响: 为 AMD GPU 提供了高性能的 MoE all2all 通信后端选择，是提升 MI300 等卡上 MoE 模型推理效率的关键步骤。
PR #34735 (进行中) - [AMD][CI] Fix test_custom_allreduce for A100 testgroup
- 贡献者: rjrock
- 内容: 修复 AMD CI 中 test_custom_allreduce 测试失败的问题。将测试中移除 CUDA_VISIBLE_DEVICES 环境变量的逻辑改为针对 ROCm 移除 HIP_VISIBLE_DEVICES。
- 影响: 维护了 AMD CI 的稳定性，确保相关功能测试在 ROCm 环境下正常通过。
PR #34709 (进行中) - [ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode
- 贡献者: laudney
- 内容: 为 RDNA4 (gfx12) 架构启用 wvSplitK 和 wvSplitKQ 瘦 GEMM 内核，用于解码阶段的矩阵计算。通过内核调整（如 wave32 DPP 规约）和运行时检测，使 RDNA 显卡能利用此前仅限 MI 系列 (gfx9) 的优化路径。
- 影响: 在 AMD Radeon AI PRO R9700 (gfx1201) 上实现了约 15% 的解码 token/s 提升，显著增强了消费级 AMD GPU 的推理性能。

💬 高热度讨论分析

Issue #34698 - [Feature]: Atomic Rewind & Correct (ARC) via KV-Cache Rollback
- 核心议题: 用户 alexbuiko-sketch 提议实现“原子回滚与纠正”功能，通过外部侧车监控逻辑错误并触发 KV-Cache 回滚，旨在减少因逻辑漂移产生的计算浪费（声称可节省85%计算周期）。
- 主要观点:
  - 提议方 (alexbuiko-sketch)：认为代理（Proxy）中断重试的方案在延迟、批处理碎片化和“原子性”要求上不满足高性能工业场景需求，需要引擎核心提供低延迟（<2ms）中断钩子。
  - 质疑方 (robertgshaw2-redhat)：反驳了提议方对延迟和批处理内部机制的理解，并指出该特性可能给核心引擎带来巨大复杂度，建议优先探索基于现有前缀缓存功能的实现方案（类似波束搜索的实现方式）。
- 争议焦点: 该功能是否必要，以及其复杂性是否值得引入核心引擎。核心维护者倾向于谨慎评估，要求提供更详细的设计文档。
- 当前状态: 讨论开放中，等待更具体的设计方案。
Issue #34746 - [Feature]: Add –force-close-connections flag to vllm bench serve
- 核心议题: 用户 InfraWhisperer 建议为 vllm bench serve 添加 --force-close-connections 标志，以便在负载均衡器后 benchmarking 时能关闭持久连接，使请求能被正确分发到不同后端实例。
- 主要观点:
  - 提议方 (InfraWhisperer)：当前硬编码的连接池行为在负载均衡场景下会导致所有请求命中同一个后端，无法模拟真实的生产流量模式。该改动小而独立，不影响默认行为。
  - 讨论者 (russellb)：询问了当前硬编码设置的原因。
  - 提议方补充：通过 git blame 追溯到是为了避免 TLS 开销导致 TTFT 膨胀，但多副本部署时需要此选项来匹配实际拓扑。
- 当前状态: 讨论开放中，提议者表示愿意贡献代码。
Issue #34759 - [Bug]: nvidia/Llama-3.3-70B-Instruct-NVFP4 Degraded / Gibberish Output with TRITON_ATTN
- 核心议题: 用户报告使用 TRITON_ATTN 后端时，特定 NVFP4 量化模型输出乱码。
- 主要观点:
  - 分析者 (baonudesifeizhai)：通过代码审查指出根本原因在于 Triton 注意力后端当前不支持 NVFP4 量化所需的 output_block_scale 参数，导致量化/反量化路径不匹配。这是一个后端与量化融合模式不兼容的问题，而非通用模型质量问题。
- 结论: 问题根源被迅速定位，属于特定后端对新型量化格式支持不完善。
- 当前状态: 问题开放中，等待修复。

🔥 热门话题与趋势分析

推测解码 (Speculative Decoding) 完善: 多个 Issue/PR 围绕此主题。
- 统计准确性: Issue #34734 和 PR #34757 关注当草稿器因输入过长被跳过时，统计信息（接受率）仍错误计数的问题。
- 测试强化: PR #34772 提议在端到端测试中加入 GSM8k 准确性检查，以取代不可靠的 token 匹配测试，提高测试可靠性。
- 词汇表验证: PR #34722 扩展了词汇表大小校验范围，覆盖所有使用草稿模型的推测解码方法（MTP, Eagle等），避免潜在的 GPU 内存越界访问。
量化模型加载与兼容性问题:
- 格式匹配: Issue #34771 报告了使用 compressed-tensors 格式量化的 LLaVA-Next 模型，因加载器期望的权重键名与模型定义不匹配而加载失败。
- 新量化格式兼容性: Issue #34759 (NVFP4与Triton后端)、#34728 (Nemotron NVFP4与TRTLLM) 和 #34694 (BF16 NVFP4 Marlin 在不支持 FP4 的 GPU 上) 都反映了对新量化格式（NVFP4）在各硬件和软件后端上支持度的挑战。
- KV Cache 量化配置: Issue #34752 指出当 checkpoint 指定了 kv_cache_quant_algo 时，--kv-cache-dtype auto 的行为需要改进。
多模态 (Multi-Modal) 与渲染器 (Renderer) 架构演进:
- 架构重构: PR #34711 (已合并) 和 #34560 (已合并) 继续推进 Renderer 组件的集成，将多模态哈希解析等功能移入 Renderer，进一步统一输入处理流程，标志着向 Issue #22880 (RFC: Unified Input Formatting) 所描述架构的稳步迈进。
- 缓存与错误处理: PR #34749 旨在将多模态处理器缓存中的致命断言替换为特定的异常，防止缓存竞争导致引擎进程崩溃，提升稳定性。

🛠️ 重点技术变更

PR #34718 (已合并) - [torch.compile] Turn on silu+fp4 quant fusion by default for O1+
- 解读: 默认在 torch.compile 优化等级 O1 及以上启用 SiLU-Mul 与 NVFP4 量化的算子融合。这是一个轻量级的点式融合，对启动时间影响小，但能带来不错的性能收益。
- 影响: 进一步优化了使用 torch.compile 且模型包含 SiLU 激活的 NVFP4 量化模型的推理性能。
PR #33538 (已合并) - [Kernel] Triton-based Top-k and Top-p sampler kernels
- 解读: 引入了一个新的、使用 Triton 编写的 Top-k 和 Top-p 采样内核。该算法通过截断和聚集“离群”logits 来减少搜索空间，在大多数情况下比之前的实现 (#32558) 和 PyTorch 原生实现更快，但会消耗额外约 80MiB VRAM。
- 影响: 为 vLLM 提供了性能更优的采样内核选择，可能成为未来默认采样实现的基础。
PR #34724 (已合并) - [Model Runner V2] Further simplification for PP
- 解读: 将流水线并行 (PP) 的 PPHandler 转换为更简单的工具方法。这是模型运行器 V2 架构简化的持续步骤。
- 影响: 减少了代码复杂度，使 V2 架构更清晰、更易于维护。
PR #34179 (已合并) - [Feature] Decode Context Parallel support for GPU model runner v2
- 解读: 在 GPU 模型运行器 V2 中实现了解码上下文并行 (DCP) 支持，包括 CUDA Graph 支持。
- 影响: 扩展了 V2 架构对高级并行策略的支持能力，为未来性能优化奠定了基础。

📈 开发活跃度观察

贡献活跃: 单日新增 80 个 PR，显示社区贡献非常活跃。合并 23 个 PR 表明核心团队审查和合并节奏紧凑。
AMD 团队参与: 本周期内有数位用户名包含 -amd 或明显为 AMD 员工的开发者提交了重要的 ROCm 功能启用和修复 PR（#34726, #34692），显示 AMD 正积极投入资源以加强其在 vLLM 生态中的地位。
代码清理与重构: 出现了多个以简化、清理和修复 CI/CD 为核心的 PR（如 #34767, #34514, #34699），表明项目在快速迭代的同时，也注重代码健康和基础设施的稳定性。

💡 值得关注的问题

Issue #34698 - Atomic Rewind & Correct (ARC)：这个提议引发了关于引擎核心复杂性与性能收益的深层讨论。其最终是否被接纳，将反映 vLLM 项目对极端优化需求与核心代码简洁性之间的权衡。
Issue #34731 / PR #34733 - 优化 swap_states 性能：该优化旨在减少请求重排序路径上的内存拷贝开销。鉴于调度和内存操作对性能的敏感性，此优化的最终效果值得关注。
Issue #34681 - v0.15.1 性能下降报告：用户报告从 v0.10.1 升级到 v0.15.1 后，在大型模型（Qwen3-480B）多 GPU (TP=8, PP=2) 配置下出现明显性能下降。这需要社区或维护者进一步调查，以确定是配置问题还是潜在的版本回归。
多模型量化支持：以 Issue #34771 为代表的量化模型加载问题，凸显了随着量化工具和格式的多样化，vLLM 的加载器需要不断适配，这是一个持续性的兼容性挑战。

📋 附录：详细数据列表

新增 Issue

#34736 [Refactor]: Handle OOV Multimodal Tokens Generically — documentation — by alex-jw-brooks (创建于: 2026-02-18 03:15 (UTC+8))
#34771 [Bug]: LlavaNext + compressed-tensors: loader expects multi_modal_projector.linear_1.weight_packed but model only has linear_1weight/linear_1bias — bug — by gocoding1011 (创建于: 2026-02-18 10:34 (UTC+8))
#34752 [Bug]: Improve --kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — bug,good first issue — by pavanimajety (创建于: 2026-02-18 05:29 (UTC+8))
#34759 [Bug]: nvidia/Llama-3.3-70B-Instruct-NVFP4 Degraded / Gibberish Output with TRITON_ATTN — bug — by frankwang28 (创建于: 2026-02-18 07:19 (UTC+8))
#34755 Qwen3-Coder-Next-FP8 with tool calling causes system hard-freeze on multi-GPU tensor parallel (v0.15.1) — usage — by zaidorx (创建于: 2026-02-18 05:59 (UTC+8))
#34763 [Performance]: Remove batch_size and max_seq_len padding when cuDNN is upgraded to 9.19 — performance — by maxyanghu (创建于: 2026-02-18 09:41 (UTC+8))
#34746 [Feature]: Add –force-close-connections flag to vllm bench serve — feature request — by InfraWhisperer (创建于: 2026-02-18 04:58 (UTC+8))
#34684 [Bug]: Qwen3.5-397B-A17B - reasoning in content with qwen3 reasoning parser — bug — by FWao (创建于: 2026-02-17 17:56 (UTC+8))
#34698 [Feature]: Atomic Rewind & Correct (ARC) via KV-Cache Rollback for Logical Grounding — feature request — by alexbuiko-sketch (创建于: 2026-02-17 21:10 (UTC+8))
#34701 [CI Failure]: Entrypoints Integration (Responses API) — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:06 (UTC+8))
#34734 [Bug]: Spec Decode Stats still report drafted tokens if drafting is skipped — bug — by benchislett (创建于: 2026-02-18 02:51 (UTC+8))
#34731 [Performance]: Improve swap_states by swapping active token prefixes — performance — by pjo256 (创建于: 2026-02-18 02:10 (UTC+8))
#34728 [Bug]: Nemotron NVFP4 Failing With TRTLLM NVFP4 — bug — by robertgshaw2-redhat (创建于: 2026-02-18 01:50 (UTC+8))
#34702 [CI Failure]: Sequence Parallel Tests — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:09 (UTC+8))
#34713 [Bug]: mkdocs –strict fails on ‘device’ arg in pp_handler.py docstring — bug — by junuxyz (创建于: 2026-02-17 23:15 (UTC+8))
#34708 [CI Failure]: Fp8 MoE Kernels (DeepEP + DeepGEMM) — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:40 (UTC+8))
#34704 [CI Failure]: LM Eval Large Models (H100x4) - Nemotron — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:27 (UTC+8))
#34703 [CI Failure]: Kernels (B200) — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:23 (UTC+8))
#34700 [CI Failure]: V1 Engine E2E — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:00 (UTC+8))
#34706 [CI Failure]: LM Eval Small Models (B200) - Qwen3-Next — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-17 22:31 (UTC+8))
#34705 [Bug]: Old torch compile files cause poor CPU utilisation — bug — by almayne (创建于: 2026-02-17 22:27 (UTC+8))
#34694 [Bug]: BF16 NVFP4 Marlin produces garbled output on GPUs without native FP4 support — bug — by ricky-chaoju (创建于: 2026-02-17 20:39 (UTC+8))
#34689 [Bug]: [cpu] CPUFusedMOE passes MoEActivation enum to torch op expecting str — bug — by micyan01 (创建于: 2026-02-17 19:26 (UTC+8))
#34691 [Bug][Metrics] - record stats when requests are aborted by pause/sleep — bug — by markmc (创建于: 2026-02-17 19:47 (UTC+8))
#34681 [Performance]: Significant performance drop in v0.15.1 with Qwen3-480B? — performance — by randomkrml (创建于: 2026-02-17 16:48 (UTC+8))

已关闭 Issue

#22880 [RFC]: Unified Input Formatting and Processing via Renderer — RFC,keep-open,multi-modality — by ywang96 (关闭于: 2026-02-18 11:30 (UTC+8))
#34634 [Bug]: SharedStorageConnector: vectorized_gather_kernel assertion on Blackwell (B200) GPUs — bug — by mmkamani7 (关闭于: 2026-02-18 10:49 (UTC+8))
#30320 [Bug]: v1 scheduler assert num_new_tokens > 0 — bug — by sszgwdk (关闭于: 2026-02-18 10:55 (UTC+8))
#19849 [Bug]: Subprocess health check / automatic restart for V1 EngineCore — bug,stale — by movchan74 (关闭于: 2026-02-18 10:18 (UTC+8))
#20479 [Bug]: AssertionError: expected size 250112==136064, stride 5120==4096 at dim=0; expected size 5120==4096, stride 1==1 at dim=1 — bug,stale — by Link-Li (关闭于: 2026-02-18 10:18 (UTC+8))
#21797 [Bug]: speculative decoding not implemented error — bug,stale — by liana313 (关闭于: 2026-02-18 10:18 (UTC+8))
#25350 [Bug]: The maximum model length calculation should also take into account the kv-cache-dtype — bug,stale — by ErykCh (关闭于: 2026-02-18 10:17 (UTC+8))
#26794 [Feature]: Lazy import DeepseekV32IndexerCache — feature request,stale — by noooop (关闭于: 2026-02-18 10:17 (UTC+8))
#26964 [Bug]: Issue with Deepseek Reasoning parser with Qwen3 2507 chat templates — bug,stale — by MikeNatC (关闭于: 2026-02-18 10:17 (UTC+8))
#27086 [Bug]: After enabling P-D Disaggregation, the final output results are not entirely identical. — bug,stale — by freedom-cui (关闭于: 2026-02-18 10:17 (UTC+8))
#27090 [Usage]: Does vLLM support a data-parallel group spanning multiple nodes when starting an online service? — usage,stale — by KrisLu999 (关闭于: 2026-02-18 10:17 (UTC+8))
#27138 [Bug] GLM-4.5V fails to load weights with KeyError: ‘blocks.0.attn.qkv.weight’ — bug,stale — by richardhundt (关闭于: 2026-02-18 10:17 (UTC+8))
#27154 [Installation]: How to reduce the vllm image — installation,stale — by geraldstanje (关闭于: 2026-02-18 10:17 (UTC+8))
#27184 [Doc]: Multi-Modal Benchmark is too simple — documentation,stale — by BigFaceBoy (关闭于: 2026-02-18 10:17 (UTC+8))
#34625 [Bug]: Reasoning Parser not working with Minimax m2.5 — bug — by AyRickk (关闭于: 2026-02-18 04:15 (UTC+8))
#34684 [Bug]: Qwen3.5-397B-A17B - reasoning in content with qwen3 reasoning parser — bug — by FWao (关闭于: 2026-02-18 03:52 (UTC+8))
#34701 [CI Failure]: Entrypoints Integration (Responses API) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-18 03:03 (UTC+8))
#34702 [CI Failure]: Sequence Parallel Tests — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-18 01:07 (UTC+8))
#34555 [BUG]: api_server.py: error: unrecognized arguments: –guided-decoding-backend — usage — by robcaulk (关闭于: 2026-02-18 01:07 (UTC+8))
#34713 [Bug]: mkdocs –strict fails on ‘device’ arg in pp_handler.py docstring — bug — by junuxyz (关闭于: 2026-02-17 23:48 (UTC+8))
#34642 [Bug]: Incorrect warning when both mistral & HF format are present in repo — bug — by patrickvonplaten (关闭于: 2026-02-17 19:35 (UTC+8))
#32920 [Bug]: disagg_proxy_demo.py: Method logic error in ‘remove_instance_endpoint’ — bug — by ChenqianCao (关闭于: 2026-02-17 18:06 (UTC+8))
#34640 [Bug]: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 producing random symbols after 3bb4e4311 — bug — by stavinsky (关闭于: 2026-02-17 17:29 (UTC+8))
#34132 [Bug]: Compilation Error with Clang 22+ (ROCm) due to String Literal Comparison in quant_utils.cuh — bug,rocm — by kyuz0 (关闭于: 2026-02-17 15:09 (UTC+8))

新增 PR

#34718 [torch.compile] Turn on silu+fp4 quant fusion by default for O1+ — performance,ready,torch.compile,nvidia — by ProExpertProg (创建于: 2026-02-18 00:11 (UTC+8))
#34711 [Renderer] Move MM Hash parsing into Renderer — ready,multi-modality,deepseek — by DarkLight1337 (创建于: 2026-02-17 22:53 (UTC+8))
#34743 [Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency — ready,multi-modality — by russellb (创建于: 2026-02-18 04:37 (UTC+8))
#34673 [Bugfix][MOE] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) — bug,ready,nvidia — by wwl2755 (创建于: 2026-02-17 14:30 (UTC+8))
#34768 Improve Transformers v4/v5 compatibility in tokenizers and processors — multi-modality — by cccat6 (创建于: 2026-02-18 10:27 (UTC+8))
#34764 [Build] Update numba version to 0.63.1. — rocm,ready,ci/build,cpu,nvidia — by nascheme (创建于: 2026-02-18 10:18 (UTC+8))
#34758 [Model Bash] DeepSeek R1 BF16 Min Latency KV A GEMM (0.5% E2E Speedup) — ready,ci/build,deepseek — by robertgshaw2-redhat (创建于: 2026-02-18 07:16 (UTC+8))
#34773 [Misc][LoRA] Increase max vocab size limit to 258048 in logits processor — 无标签 — by bhoomit (创建于: 2026-02-18 11:01 (UTC+8))
#34772 [Tests] Add GSM8k check to SpecDec E2E tests — v1 — by benchislett (创建于: 2026-02-18 10:52 (UTC+8))
#34766 [Model Runner V2] A bit more PP simplification — ready,v1 — by njhill (创建于: 2026-02-18 10:21 (UTC+8))
#34767 [CI] Remove unused precompiled wheel args from image build — ready,ci/build — by amrmahdi (创建于: 2026-02-18 10:25 (UTC+8))
#34738 Fix memory leak in Responses API store — documentation,frontend,v1 — by lisperz (创建于: 2026-02-18 03:30 (UTC+8))
#34769 Handle setpgrp permission errors in forked test harness — 无标签 — by cccat6 (创建于: 2026-02-18 10:27 (UTC+8))
#34737 [Experimental] Piecewise CUDA graph with dynamic-shape FFN — performance,llama,nvidia — by yunjiangster (创建于: 2026-02-18 03:22 (UTC+8))
#34770 [Build] Add Python 3.14 to supported version list. — ci/build,nvidia — by nascheme (创建于: 2026-02-18 10:29 (UTC+8))
#34765 [Core] Minor structured-output related scheduler optimization — ready,v1 — by njhill (创建于: 2026-02-18 10:19 (UTC+8))
#34753 [ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL — rocm,ready,multi-modality,qwen — by AndreasKaratzas (创建于: 2026-02-18 05:53 (UTC+8))
#34754 [Bug] Fix a corner case in _process_simple_streaming_events — bug,frontend — by 842974287 (创建于: 2026-02-18 05:56 (UTC+8))
#34760 Add platform method to enable custom collective ops registration — rocm,nvidia,meta-exported,fb-exported — by nkm-meta (创建于: 2026-02-18 07:23 (UTC+8))
#34726 [ROCm] Enable DBO (Dynamic Batch Optimization) on ROCm — rocm,v1 — by raviguptaamd (创建于: 2026-02-18 01:30 (UTC+8))
#34739 [BugFix]: Fix local mypy issues — bug,frontend,kv-connector — by hickeyma (创建于: 2026-02-18 03:43 (UTC+8))
#34693 [Bugfix] Fix mypy errors for StructuredOutputsParams by using stdlib dataclass — bug,ready — by hyeongyun0916 (创建于: 2026-02-17 20:19 (UTC+8))
#34747 fix: use external_req_id for P2P NCCL keys in disaggregated prefill — v1,kv-connector — by timon0305 (创建于: 2026-02-18 04:59 (UTC+8))
#34722 fix(spec): extend vocab size validation to MTP and Eagle methods — speculative-decoding,v1 — by timon0305 (创建于: 2026-02-18 00:49 (UTC+8))
#34762 [Bugfix]: Improving –kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — bug — by WorldExplored (创建于: 2026-02-18 07:49 (UTC+8))
#34761 Fix: Preserve input type in _get_modality_specific_lora_reqs — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by puririshi98 (创建于: 2026-02-18 07:40 (UTC+8))
#34745 Fix empty tool_call_id in Anthropic messages API tool result conversion — frontend — by sfeng33 (创建于: 2026-02-18 04:58 (UTC+8))
#34724 [Model Runner V2] Further simplification for PP — v1 — by WoosukKwon (创建于: 2026-02-18 01:22 (UTC+8))
#34735 [AMD][CI] Fix test_custom_allreduce for A100 testgroup — rocm — by rjrock (创建于: 2026-02-18 03:03 (UTC+8))
#34757 [Bug][Spec Decode] Fix AR metrics when drafting is skipped — bug,v1 — by benchislett (创建于: 2026-02-18 06:43 (UTC+8))
#34744 [Attention] Enable masked MHA for topk sparse attention using FA4 — documentation,performance,rocm,ci/build,v1,nvidia — by MatthewBonanni (创建于: 2026-02-18 04:44 (UTC+8))
#34751 Fix spec decode stats when drafting is skipped — v1 — by Chryseisliu (创建于: 2026-02-18 05:25 (UTC+8))
#34756 preliminary attempt on nightly rocm docker — rocm,ci/build — by hongxiayang (创建于: 2026-02-18 06:15 (UTC+8))
#34725 Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 — bug,ready,nvidia — by mgoin (创建于: 2026-02-18 01:25 (UTC+8))
#34707 tests: make parsable-context MCP assertion order-agnostic — needs-rebase — by anencore94 (创建于: 2026-02-17 22:39 (UTC+8))
#34750 [Rocm][CI] Fix LM Eval Large Models (H100) test group — rocm,ci/build — by charlifu (创建于: 2026-02-18 05:24 (UTC+8))
#34749 fix: replace fatal assertions with MultiModalCacheMissError in MM cache — v1,multi-modality — by timon0305 (创建于: 2026-02-18 05:15 (UTC+8))
#34748 [Bugfix] Step3p5 reasoning end detection state leak in streaming chat — bug — by infernix (创建于: 2026-02-18 05:07 (UTC+8))
#34709 [ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode — new-model,rocm,ci/build — by laudney (创建于: 2026-02-17 22:41 (UTC+8))
#34741 [ROCm] Enable FP8 KV-cache and relax constraints for RDNA4 custom paged attention — rocm — by laudney (创建于: 2026-02-18 03:59 (UTC+8))
#34740 [ROCm] Use supports_fp8() for FP8 feature gates instead of arch checks — rocm — by laudney (创建于: 2026-02-18 03:58 (UTC+8))
#34742 refactor(attention): add default get_kv_cache_stride_order implementation — performance,v1,kv-connector — by timon0305 (创建于: 2026-02-18 04:12 (UTC+8))
#34687 [Update] Use FlashInfer fast_decode_plan directly instead of replication — v1,nvidia — by askliar (创建于: 2026-02-17 18:42 (UTC+8))
#34719 [Bugfix] Qwen3.5 kv-scale weight remapping — bug,ready,qwen — by Linda-Stadter (创建于: 2026-02-18 00:17 (UTC+8))
#34729 Removing blocking status from MI355 tests. — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-18 01:52 (UTC+8))
#34733 fix(worker): optimize swap_states to copy only active token prefixes — v1 — by pjo256 (创建于: 2026-02-18 02:12 (UTC+8))
#34723 [Bugfix] Fix prefix creation for Qwen3.5 — bug,ready,qwen — by mgoin (创建于: 2026-02-18 00:56 (UTC+8))
#34670 [Core] Default to “mp” rather than “uni” distributed executor backend — ready,v1 — by njhill (创建于: 2026-02-17 13:28 (UTC+8))
#34717 [CI] Fix flaky test_parsable_context — ready — by sfeng33 (创建于: 2026-02-18 00:05 (UTC+8))
#34732 [Attention] Use FA4 for MLA prefill — documentation,performance,needs-rebase,ci/build,v1,nvidia — by MatthewBonanni (创建于: 2026-02-18 02:12 (UTC+8))
#34730 [Frontend][Core] Add “wait” shutdown mode — frontend,v1 — by markmc (创建于: 2026-02-18 01:59 (UTC+8))
#34720 feat(metrics): add –histogram-buckets CLI flag for configurable Prometheus histogram buckets — v1 — by timon0305 (创建于: 2026-02-18 00:27 (UTC+8))
#34727 [PD][Nixl] Add support for hybrid SSM-FA models — v1,kv-connector — by NickLucche (创建于: 2026-02-18 01:49 (UTC+8))
#34699 [CI/Build] Remove use of skip_v1 — ready,ci/build — by DarkLight1337 (创建于: 2026-02-17 21:34 (UTC+8))
#34692 [ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. — rocm — by lcskrishna (创建于: 2026-02-17 19:53 (UTC+8))
#34721 fix(api): return HTTP 413 instead of 400 for oversized requests — frontend — by timon0305 (创建于: 2026-02-18 00:42 (UTC+8))
#34715 fix: Ensure invalid audio files return 400 error — frontend — by jasonozuzu-cohere (创建于: 2026-02-17 23:31 (UTC+8))
#34716 [BugFix] Fix sp tests — bug,ready — by zou3519 (创建于: 2026-02-17 23:57 (UTC+8))
#34696 [Bugfix] fix activation in cpu_fused_moe_torch call — bug,ready,cpu — by michalowski-arm (创建于: 2026-02-17 21:05 (UTC+8))
#34685 [WIP][Bugfix] Fix type mismatch in causal_conv1d Triton kernels PAD_SLOT_ID checks — bug,nvidia — by haosdent (创建于: 2026-02-17 18:06 (UTC+8))
#34712 [CI] Move tests from H200 to B200 — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-02-17 23:12 (UTC+8))
#34714 [Build] Include OpenTelemetry in release Docker images — ci/build,cpu — by codefromthecrypt (创建于: 2026-02-17 23:26 (UTC+8))
#34695 [WIP][Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models — bug — by haosdent (创建于: 2026-02-17 20:45 (UTC+8))
#34688 [ROCm] Enable bitsandbytes quantization support on ROCm — documentation,rocm,ready,ci/build — by Abdennacer-Badaoui (创建于: 2026-02-17 18:46 (UTC+8))
#34710 [Bugfix] Fix test_flashinfer_blockscale_fp8_none_expert_group on B200 — bug,nvidia — by tlrmchlsmth (创建于: 2026-02-17 22:42 (UTC+8))
#34671 Update max_num_tokens value when specdec is enabled — v1 — by shaharmor98 (创建于: 2026-02-17 14:11 (UTC+8))
#34697 [Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion — bug,qwen — by Isotr0py (创建于: 2026-02-17 21:06 (UTC+8))
#34690 [WIP][Bugfix] Fix xgrammar nanobind leaked objects at shutdown — bug,structured-output,v1 — by haosdent (创建于: 2026-02-17 19:34 (UTC+8))
#34677 [Bugfix][CPU] Fix basic unit tests failing in CPU platforms — bug,nvidia — by jasonyanwenl (创建于: 2026-02-17 15:47 (UTC+8))
#34686 Fix docs build warning — ready,v1 — by hmellor (创建于: 2026-02-17 18:09 (UTC+8))
#34683 Revert “[Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj” — qwen — by ZJY0516 (创建于: 2026-02-17 17:22 (UTC+8))
#34682 [kv_offload+HMA][2/N]: Support sliding window lookup — v1,kv-connector — by orozery (创建于: 2026-02-17 17:20 (UTC+8))
#34676 Add VLLM_SKIP_MODEL_VALIDATION environment variable — frontend — by dsingal0 (创建于: 2026-02-17 15:41 (UTC+8))
#34678 [GGUF][Model] Add Qwen3-Coder-Next GGUF support — multi-modality,qwen — by rudybear (创建于: 2026-02-17 15:48 (UTC+8))
#34672 [ci] Add Ray compatibility check informational CI job — ci/build — by jeffreywang-anyscale (创建于: 2026-02-17 14:18 (UTC+8))
#34680 [kv_offload+HMA][1/N]: Worker-side support for multiple HMA groups — v1 — by orozery (创建于: 2026-02-17 16:21 (UTC+8))
#34679 [Docs]Fix documentation formatting in architecture overview — documentation — by lichuang (创建于: 2026-02-17 16:12 (UTC+8))
#34675 fix(modelopt): average divergent w13 scale factors for NVFP4 MoE — 无标签 — by chknlittle (创建于: 2026-02-17 15:13 (UTC+8))
#34674 Adding Nemotron fp8 Triton MoE Config — 无标签 — by yugong333 (创建于: 2026-02-17 15:03 (UTC+8))
#34669 [Model Runner V2] Minor simplification for BadWordsState — v1 — by WoosukKwon (创建于: 2026-02-17 13:22 (UTC+8))

已合并 PR

#34718 [torch.compile] Turn on silu+fp4 quant fusion by default for O1+ — performance,ready,torch.compile,nvidia — by ProExpertProg (合并于: 2026-02-18 11:29 (UTC+8))
#34653 [BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site — bug,ready — by hongxiayang (合并于: 2026-02-18 11:19 (UTC+8))
#34711 [Renderer] Move MM Hash parsing into Renderer — ready,multi-modality,deepseek — by DarkLight1337 (合并于: 2026-02-18 11:18 (UTC+8))
#34767 [CI] Remove unused precompiled wheel args from image build — ready,ci/build — by amrmahdi (合并于: 2026-02-18 10:58 (UTC+8))
#33600 [Attention] Refactor check_and_update_config — ready,v1,nvidia — by MatthewBonanni (合并于: 2026-02-18 09:06 (UTC+8))
#34179 [Feature] Decode Context Parallel support for GPU model runner v2 — ready,v1,nvidia — by yewentao256 (合并于: 2026-02-18 08:27 (UTC+8))
#34724 [Model Runner V2] Further simplification for PP — v1 — by WoosukKwon (合并于: 2026-02-18 07:18 (UTC+8))
#33538 [Kernel] Triton-based Top-k and Top-p sampler kernels — performance,ready,v1 — by cakeng (合并于: 2026-02-18 07:14 (UTC+8))
#34457 [Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs — bug,documentation,performance,ready,v1,nvidia — by MatthewBonanni (合并于: 2026-02-18 03:01 (UTC+8))
#34717 [CI] Fix flaky test_parsable_context — ready — by sfeng33 (合并于: 2026-02-18 02:42 (UTC+8))
#34716 [BugFix] Fix sp tests — bug,ready — by zou3519 (合并于: 2026-02-18 01:07 (UTC+8))
#34324 Fixed whisper CPU test that does not spawn properly. — rocm,ready,multi-modality — by almayne (合并于: 2026-02-17 22:46 (UTC+8))
#34615 [CI][Nixl] Add CrossLayer KV layout tests — ready,ci/build,kv-connector — by NickLucche (合并于: 2026-02-17 21:35 (UTC+8))
#34560 [Renderer] Move InputPreprocessor into Renderer (2/2) — frontend,ready,v1,multi-modality,qwen — by DarkLight1337 (合并于: 2026-02-17 21:29 (UTC+8))
#34514 [CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior — bug,documentation,performance,structured-output,tpu,ready,ci/build,v1,cpu,kv-connector — by junuxyz (合并于: 2026-02-17 20:22 (UTC+8))
#34686 Fix docs build warning — ready,v1 — by hmellor (合并于: 2026-02-17 18:31 (UTC+8))
#32922 [Bugfix] Fix ‘remove_instance_endpoint’ method logic in disagg_proxy_demo — bug,documentation,ready,kv-connector — by ChenqianCao (合并于: 2026-02-17 18:06 (UTC+8))
#34683 Revert “[Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj” — qwen — by ZJY0516 (合并于: 2026-02-17 17:29 (UTC+8))
#34633 Remove dead bitsandbytes CxB code from 8-bit inference path — ready — by TimDettmers (合并于: 2026-02-17 17:49 (UTC+8))
#34383 [Ray] Propagate third-party env vars to Ray workers via prefix matching — ready,ray,ci/build,v1 — by kouroshHakha (合并于: 2026-02-17 17:08 (UTC+8))
#34656 [CI] Fix bake config artifact path for AMI rebuild pipeline — ready,ci/build — by amrmahdi (合并于: 2026-02-17 14:39 (UTC+8))
#34662 [Model Runner V2] Minor refactoring for penalties — v1 — by WoosukKwon (合并于: 2026-02-17 13:43 (UTC+8))
#34669 [Model Runner V2] Minor simplification for BadWordsState — v1 — by WoosukKwon (合并于: 2026-02-17 13:27 (UTC+8))

关闭但未合并的 PR

#34737 [Experimental] Piecewise CUDA graph with dynamic-shape FFN — performance,llama,nvidia — by yunjiangster (关闭于: 2026-02-18 10:39 (UTC+8))
#21446 Support online_serving for qwen3-reranker model — documentation,stale,qwen — by QianZ423 (关闭于: 2026-02-18 10:18 (UTC+8))
#25533 Update the vllm quantization support for the AMD GPU — documentation,rocm,stale — by Adityayxt (关闭于: 2026-02-18 10:17 (UTC+8))
#26998 Enhance: Error handling for structured output compilation failures — structured-output,frontend,stale,v1 — by quanliu1991 (关闭于: 2026-02-18 10:17 (UTC+8))
#27177 [Benchmark] Add support for list of tokenized ids for custom dataset — performance,stale — by kobe0938 (关闭于: 2026-02-18 10:17 (UTC+8))
#27185 [Fix][Hybrid] wrong page size bytes for mla attention — stale — by rhluo (关闭于: 2026-02-18 10:17 (UTC+8))
#27192 [Core] Speculative Early-Exit for LRMs implemented on Eagle3 — frontend,speculative-decoding,needs-rebase,stale,v1,llama,qwen — by RuBing-Yang (关闭于: 2026-02-18 10:17 (UTC+8))
#27193 [Model] Add shared_head to prefix of SharedHead. — stale,deepseek — by whx-sjtu (关闭于: 2026-02-18 10:16 (UTC+8))
#34761 Fix: Preserve input type in _get_modality_specific_lora_reqs — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by puririshi98 (关闭于: 2026-02-18 07:44 (UTC+8))
#34751 Fix spec decode stats when drafting is skipped — v1 — by Chryseisliu (关闭于: 2026-02-18 06:19 (UTC+8))
#31461 [MoE Refactor][11/N] refactor quark_moe layer — 无标签 — by raayandhar (关闭于: 2026-02-18 04:25 (UTC+8))
#34729 Removing blocking status from MI355 tests. — ci/build — by Alexei-V-Ivanov-AMD (关闭于: 2026-02-18 03:26 (UTC+8))
#34720 feat(metrics): add –histogram-buckets CLI flag for configurable Prometheus histogram buckets — v1 — by timon0305 (关闭于: 2026-02-18 01:59 (UTC+8))
#30714 [ROCm] [MXFP4] Deepseek Fp4 projection gemms dynamically quantized — rocm,v1,deepseek — by dllehr-amd (关闭于: 2026-02-18 01:18 (UTC+8))
#34654 [Feature] Lazy import of ngram_proposer module. — v1 — by nascheme (关闭于: 2026-02-18 01:02 (UTC+8))
#34712 [CI] Move tests from H200 to B200 — ready,ci/build — by robertgshaw2-redhat (关闭于: 2026-02-18 00:13 (UTC+8))
#34710 [Bugfix] Fix test_flashinfer_blockscale_fp8_none_expert_group on B200 — bug,nvidia — by tlrmchlsmth (关闭于: 2026-02-17 22:47 (UTC+8))
#32124 [MoE Refactor][17/N] Clean Up fused_moe.py — performance,needs-rebase — by robertgshaw2-redhat (关闭于: 2026-02-17 21:48 (UTC+8))
#34165 Fix AttributeError: ‘Glm46VImageProcessorFast’ object has no attribut… — 无标签 — by VishnuVV27 (关闭于: 2026-02-17 20:13 (UTC+8))
#34145 Moe config backend — documentation,rocm,frontend,needs-rebase,ci/build,v1,llama,gpt-oss,kv-connector — by danichan-mkm (关闭于: 2026-02-17 20:08 (UTC+8))
#32182 [Update] Use FlashInfer fast_decode_plan directly instead of replication — ci/build,v1,nvidia — by askliar (关闭于: 2026-02-17 18:35 (UTC+8))
#34675 fix(modelopt): average divergent w13 scale factors for NVFP4 MoE — 无标签 — by chknlittle (关闭于: 2026-02-17 15:15 (UTC+8))
#27927 [feature] [compile] use quantfp8 customOp-abstraction for MoE layers — needs-rebase — by vnadathur (关闭于: 2026-02-17 11:50 (UTC+8))