vLLM 开发动态报告 - 2026-02-16

时间窗口: 2026-02-16 11:30 (UTC+8) ~ 2026-02-17 11:30 (UTC+8) 数据统计: 新 Issue 16 | 关闭 Issue 15 | 新 PR 52 | 合并 PR 30 | 关闭未合并 PR 15

📊 每日开发状态摘要

在过去的24小时内，vLLM 项目保持了高活跃度的开发节奏，共处理了31个Issue，新增并合并了多个PR。开发焦点集中在模型支持优化（特别是 Qwen 系列）、推测解码的稳定性修复、AMD ROCm 平台的兼容性与性能提升，以及持续集成（CI）系统的完善上。多个核心性能优化与bug修复被合并入主线，显示出项目在提升推理引擎的鲁棒性、扩展性和跨平台支持方面持续投入。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD 生态相关活动非常活跃，主要集中在问题修复和功能增强上。

新增Issue (关键问题):
- [#34641] [ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942)：一个影响广泛的重大 bug。VLLM_ROCM_USE_AITER_FP4BMM 环境变量默认对 AMD GPU 启用，但仅 MI325X/MI350X (gfx950) 支持 FP4，导致大部分 MI300X 用户在默认配置下崩溃。贡献者 khairulkabir1661 清晰指出了根因并提供了临时解决方案。AMD 员工 tjtanaa 迅速响应，表示将很快提供修复。
新增PR (问题修复与功能开发):
- [#34647] [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes：由 Issue #34641 的提交者 khairulkabir1661 提出的修复方案。通过在 is_fp4bmm_enabled() 方法中查询 AITER 库的硬件能力，自动在 MI300X 上禁用 FP4 并回退到 FP8，从根本上解决问题。
- [#34632] [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12：由 AMD 员工 laudney 提交，为 RDNA4/gfx12 硬件（不支持 tl.dot_scaled）新增一个 Triton 内核，实现了 OCP MX FP4 e2m1f 量化模型的支持。这是一个重要的平台功能扩展。
- [#34655] [CI][AMD][BugFix] Skip tests…：由 AMD 员工 rasmith 提交，修复 CI 问题，跳过 ROCm 平台上不应运行的测试。
- [#34630] [Bugfix][ROCm] Fix WNA16 MoE quant config init…：由 AMD 员工 laudney 提交，修复了 WNA16 MoE 量化路径的初始化问题和 Qwen3-VL 模型配置读取的兼容性问题。
- [#34636] [ROCm][Bugfix]: Only save unpadded sizes…：修复了由于 PR #32344 引入的回归问题，该问题导致 ROCm 上的 RMSNorm+padding 融合模式匹配失败。
- [#34631] [ROCm] Make Whisper causal attention backend-agnostic：移除 Whisper 模型中注意力后端支持的硬编码列表，使其能适配更多 ROCm 后端。
- [#34652] [AMD][CI] Fix test new_weight_syncing/rlhf.py：修复 AMD CI 中一个分布式测试因 torch.cuda.get_device_capability 调用时机不当导致的失败。
已合并PR (功能与CI改进):
- [#34639] [CI] Enable mypy import following for vllm/v1/kv_offload：优化类型检查。
- [#34589] [ROCm][CI] Fix plugins test group…：修复 TerraTorch 插件测试的依赖问题。
- [#34629] Targeting the MI355 agent pool…：将测试任务扩展到新的 MI355 代理池。

总结：AMD 团队在本周期内表现非常活跃，不仅快速响应并修复了影响用户体验的关键崩溃问题（#34641/#34647），还在持续扩展平台功能（#34632），并积极维护 CI 测试的稳定性和覆盖度。这体现了对 AMD 平台用户体验和开发者体验的重视。

💬 高热度讨论分析

Issue [#34619] [Bug]: Qwen3.5. illegal memory access
- 核心议题：用户在 B200 GPU 上运行 Qwen3.5 模型时遇到非法内存访问错误，且怀疑与异步调度和 CUDA 图有关。
- 观点与立场：
  - 提交者 vadiklyutiy 提供了详细的复现环境和日志，并指出禁用 --no-async-scheduling 可以规避问题。
  - 核心开发者 ZJY0516 推测与近期更改的 GDN 后端和异步调度有关。
  - 开发者 ywang96 进一步将问题定位到 causal_conv1d Triton 内核对 PAD_SLOT_ID 的处理上，并确认在 enforce-eager 模式下运行正常。
- 争议焦点：无显著争议，属于协作排查根因的过程。
- 当前状态：问题已定位到具体内核，开发者正在调查修复中。
Issue [#34601] [Feature]: LoRA-based Routing
- 核心议题：提议实现基于 LoRA 适配器的请求路由和热交换功能，以在有限 GPU 资源上动态服务多个适配器。
- 观点与立场：
  - 提议者 yeoshuheng 描述了动态加载和切换 LoRA 适配器的愿景。
  - 参与者 arandomcreatoron 表达了兴趣并指出了潜在挑战：内存开销、CUDA 图捕获的复杂性，并建议配合选择性卸载逻辑。
  - 另一位参与者 jeejeelee 建议参考 vLLM 生态中的 aibrix 项目，暗示可能存在相关解决方案。
- 争议焦点：无直接争议，主要是对技术可行性和实现复杂度的探讨。讨论更倾向于探索性而非立即实施。
- 当前状态：开放式讨论，尚无明确实施计划。
Issue [#34650] Bug: Speculative Decoding (MTP) Causes </think> Detection Failure…
- 核心议题：在使用推测解码（MTP）、推理解析器和结构化输出时，存在一个时序不匹配的 bug，导致 </think> 令牌被静默忽略，从而无法正确执行 JSON 模式约束。
- 观点与立场：提交者 cicirori 提供了极其详尽的技术分析，包括根因定位、具体示例和修复建议。另一位贡献者 Chryseisliu 表示有兴趣接手处理。
- 争议焦点：无争议，属于一个清晰的技术缺陷报告。
- 当前状态：问题开放，有待修复。

🔥 热门话题与趋势分析

模型兼容性与稳定性（尤以 Qwen 系列为甚）：多个 Issue 涉及 Qwen 系列模型（包括 MoE、VL、Next 版本），问题涵盖非法内存访问、性能回归、语法错误和推理解析器失效等。这表明随着 Qwen 模型家族的复杂化和新版本的快速迭代，vLLM 的适配和优化工作需要持续跟进。
推测解码的“边缘”问题：除了上述讨论热烈的 #34650，还有 Issue 报告了 Eagle 推测解码在升级 vLLM 版本后出现的属性错误 (#34607)。推测解码作为性能加速的关键特性，其与各种新功能（结构化输出、多模态、模型升级）的交互稳定性成为测试和修复的重点。
CUDA 图与编译的复杂性：有 Issue 报告了 SharedStorageConnector 在 Blackwell GPU 上的断言错误 (#34634)，以及 torch.compile 相关配置的构造耗时问题 (#34635)。随着对极致性能的追求和硬件架构的更新，底层编译和图形化执行的复杂性日益凸显。
CI/CD 与测试稳定性：大量 PR 和 Issue（包括已关闭的）涉及 CI 测试的修复、跳过和稳定性提升（如 #34637, #34622, #34617, #34666）。这反映了在庞大且快速变化的代码库和异构硬件环境下，维持 CI 管线的可靠运行本身就是一项重要且持续的工作。
AMD 平台集成深化：如前述，本周期 AMD 相关的活动不仅限于 bug 修复，更包括为新硬件（RDNA4）添加内核支持，显示了其生态建设的深入。

🛠️ 重点技术变更

PR [#33960] [Core] Pipeline Parallel support for Model Runner V2 (已合并)：这是一个架构演进的重要里程碑。该 PR 为下一代模型执行器（Model Runner V2）引入了流水线并行（PP）支持，通过模块化的 PPHandler 类封装所有 PP 逻辑，保持了代码的清晰性。测试表明其性能与 V1 基线相当，为未来优化奠定了基础。
PR [#34492] [Models] Fuse Qwen3.5 GDN‘s qkvz_proj and ba_proj (已合并)：针对 Qwen3.5 模型的性能优化。通过融合其 Gated Dense Network (GDN) 中的两个投影层，减少了内核启动和内存操作，在基准测试中带来了可观的吞吐量提升，是模型特定优化的典型例子。
PR [#34632] [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 (新增)：这项技术扩展允许在缺乏特定硬件指令的 AMD RDNA4 架构上高效运行 MXFP4 量化模型。其“two half-dots”策略和内核内反量化设计，展示了针对特定硬件约束进行定制化内核开发的能力。
PR [#34507] [Bugfix] Fix fused MoE int32 overflow… (已合并)：修复了一个在特定大矩阵参数下可能导致整数溢出崩溃的边界情况。该修复巧妙地仅将必要的偏移量转换为 int64，避免了将所有步幅参数改为 int64 带来的性能回归，体现了对性能与稳定性之间平衡的精细考量。
PR [#34666] & [#34667] [Model Runner V2] Minor cleanup… (已合并)：核心开发者 WoosukKwon 对 Model Runner V2 的 PP 实现和 make_dummy 函数进行了代码清理和优化，消除了不必要的 CPU-GPU 同步，反映了对新一代执行器代码质量的持续打磨。

📈 开发活跃度观察

AMD 贡献者活跃：以 laudney, khairulkabir1661, rasmith, Alexei-V-Ivanov-AMD 等为代表的 AMD 贡献者在本周期非常活跃，主导了多个 ROCm 相关的 bug 修复、功能开发和 CI 维护工作，是推动平台支持的重要力量。
核心团队推动架构演进：WoosukKwon, ZhanqiuHu 等核心开发者持续推进 Model Runner V2 的完善（如 PP 支持），DarkLight1337 在处理多模态渲染器重构。这表明项目在优化现有架构的同时，也在积极布局下一代基础设施。
社区贡献多样化：从模型集成（Florence2, Qwen-ASR 实时支持）、性能优化提案（LoRA路由）、到用户体验问题反馈（生成配置覆盖），社区贡献覆盖了从底层到上层的各个方面，生态健康。
CI/CD 持续优化：amrmahdi, AndreasKaratzas 等人提交了多个 PR 来优化镜像构建、缓存策略和测试稳定性，保障了高速开发迭代下的工程质量。

💡 值得关注的问题

AMD 硬件支持与兼容性：MI300X 上 FP4 默认启用导致的崩溃问题（#34641）影响面广，其修复方案（#34647）的进展和合并需要密切关注。
推测解码与结构化输出的交互：Issue #34650 揭示了一个较为隐蔽但影响推理正确性的 bug。推测解码与复杂输出控制逻辑的交互测试需要加强。
Qwen 模型家族的稳定性：近期连续出现多个与 Qwen 模型相关的严重问题（内存访问、性能），建议团队系统性回顾对 Qwen 系列（特别是 MoE 和 VL 变体）的集成与优化代码，建立更完善的测试矩阵。
CUDA 图捕获与编译的复杂性：随着 torch.compile 和 CUDA Graph 的广泛应用，其带来的启动延迟、内存泄漏（如 #34602）和硬件特定问题（如 #34634）将成为运维和调试的难点，需要积累更多最佳实践和诊断工具。

📋 附录：详细数据列表

新增 Issue

#34650 Bug: Speculative Decoding (MTP) Causes </think> Detection Failure in Structured Output + Reasoning Mode — bug — by cicirori (创建于: 2026-02-17 06:09 (UTC+8))
#34661 [Bug]: SyntaxError in qwen3_5_moe.py line 81 when loading Qwen3.5-397B-A17B — 无标签 — by UmutAlihan (创建于: 2026-02-17 08:20 (UTC+8))
#34619 [Bug]: Qwen3.5. illegal memory access — bug — by vadiklyutiy (创建于: 2026-02-16 20:52 (UTC+8))
#34642 [Bug]: Incorrect warning when both mistral & HF format are present in repo — bug — by patrickvonplaten (创建于: 2026-02-17 03:44 (UTC+8))
#34643 [Bug]: vLLM that leaves orphan processes running when the parent process dies due to OOM — bug — by pymhq (创建于: 2026-02-17 03:48 (UTC+8))
#34641 [ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942) — rocm — by khairulkabir1661 (创建于: 2026-02-17 03:02 (UTC+8))
#34640 [Bug]: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 producing random symbols after 3bb4e4311 — bug — by stavinsky (创建于: 2026-02-17 03:00 (UTC+8))
#34634 [Bug]: SharedStorageConnector: vectorized_gather_kernel assertion on Blackwell (B200) GPUs — bug — by mmkamani7 (创建于: 2026-02-17 01:21 (UTC+8))
#34637 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 2) — ci-failure — by AndreasKaratzas (创建于: 2026-02-17 01:46 (UTC+8))
#34635 [Bug]: AllReduceFusionPass(config) takes 1s+ to construct — bug,torch.compile,startup-ux — by zou3519 (创建于: 2026-02-17 01:22 (UTC+8))
#34621 [Usage]: new vLLM version — usage — by eghouti (创建于: 2026-02-16 21:10 (UTC+8))
#34625 [Bug]: Reasoning Parser not working with Minimax m2.5 — bug — by AyRickk (创建于: 2026-02-16 23:11 (UTC+8))
#34622 [CI Failure]: B200 Kernels — ci-failure — by NickLucche (创建于: 2026-02-16 21:21 (UTC+8))
#34617 [CI Failure]: v1/e2e/test_spec_decode.py::test_ngram_and_suffix_correctness — ci-failure — by NickLucche (创建于: 2026-02-16 19:32 (UTC+8))
#34601 [Feature]: LoRA-based Routing — feature request — by yeoshuheng (创建于: 2026-02-16 12:55 (UTC+8))
#34607 [Bug]: specualative decoding error in 0.15.1 — bug — by hocop (创建于: 2026-02-16 15:46 (UTC+8))

已关闭 Issue

#15115 [Usage]: Model compute_logits always get None for sampling_metadata — usage,stale — by yanyongyu (关闭于: 2026-02-17 10:17 (UTC+8))
#26863 [Bug]: Cuda out of memory while inference for Qwen3-VL-4B-Instruct — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-17 10:16 (UTC+8))
#34413 [Bug]: Qwen coder next performance after d7982da commit. — bug — by stavinsky (关闭于: 2026-02-17 09:58 (UTC+8))
#18153 [RFC]: Blackwell Enablement for vLLM (SM100) — RFC,stale,nvidia — by pavanimajety (关闭于: 2026-02-17 06:14 (UTC+8))
#29453 [CI Failure]: mi325_1: Basic Correctness Test — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 04:29 (UTC+8))
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:55 (UTC+8))
#31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:54 (UTC+8))
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:54 (UTC+8))
#34005 [Bug]: Qwen3-1.7B apparently not respecting max-model-len (can’t generate >2048 tokens) — bug — by kdu4108 (关闭于: 2026-02-16 23:58 (UTC+8))
#32938 [Bug]: illegal memory access while running Qwen3-30B-A3B-Instruct-2507 on multi node with DeepEP backend — bug — by llsj14 (关闭于: 2026-02-16 20:57 (UTC+8))
#34284 [CI Failure]: Metrics - Tracing Test - Race Condition Between gRPC and OpenTelemetry Threads? — ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-16 20:57 (UTC+8))
#34578 [Performance]: vLLM’s throughput performance for a single prompt scenario — performance — by kathirvel-balakrishnan (关闭于: 2026-02-16 17:30 (UTC+8))
#34525 [CI Failure] LoRA TP (Distributed): lora/test_olmoe_tp.py::test_olmoe_lora — ci-failure — by LucasWilkinson (关闭于: 2026-02-16 12:10 (UTC+8))
#34504 [Bug]: GLM-5-FP8 Crash - Engine core initialization failed — bug — by jxdn (关闭于: 2026-02-16 12:09 (UTC+8))
#33447 [Bug]: Issue with vllm 0.15.0 image - running via docker — bug — by SKPsanjeevi (关闭于: 2026-02-16 11:57 (UTC+8))

新增 PR

#34655 [CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm — bug,rocm — by rasmith (创建于: 2026-02-17 07:08 (UTC+8))
#34666 [Model Runner V2] Minor cleanup for PP — v1 — by WoosukKwon (创建于: 2026-02-17 10:17 (UTC+8))
#34609 [Bugfix] Force multiprocessing fork method on macOS — bug — by lichuang (创建于: 2026-02-16 16:34 (UTC+8))
#34667 [Model Runner V2] Fix unintended CPU-GPU sync in make_dummy — v1 — by WoosukKwon (创建于: 2026-02-17 10:56 (UTC+8))
#34612 [Feature]: Implement get_kv_cache_stride_order for all classes — performance,v1,kv-connector — by SouthWest7 (创建于: 2026-02-16 17:08 (UTC+8))
#34665 [Bugfix] Fix benchmark_fused_collective crash on CustomOp init — bug,performance — by mayank-ketkar-sf (创建于: 2026-02-17 10:00 (UTC+8))
#34653 [BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site — bug,ready — by hongxiayang (创建于: 2026-02-17 06:53 (UTC+8))
#34649 AI Fix #15647 — 无标签 — by itzyesse99-lgtm (创建于: 2026-02-17 05:48 (UTC+8))
#34660 [Kernel][Perf] Fuse gather_block_tables + compute_slot_mappings into single kernel — performance,v1 — by mayank-ketkar-sf (创建于: 2026-02-17 08:14 (UTC+8))
#34639 [CI] Enable mypy import following for vllm/v1/kv_offload — ready — by aneeshkp (创建于: 2026-02-17 02:36 (UTC+8))
#34664 Add MXFP8 to Marlin dense kernel — 无标签 — by mgoin (创建于: 2026-02-17 09:29 (UTC+8))
#34663 Separate TRTLLM and Flashinfer backends — documentation,v1,nvidia — by pavanimajety (创建于: 2026-02-17 09:21 (UTC+8))
#34614 [Feature] Add Azure Blob Storage support for RunAI Model Streamer — rocm,ci/build — by hasethuraman (创建于: 2026-02-16 17:46 (UTC+8))
#34659 [CI Failure] Disable Qwen3-30B-A3B-MXFP4A16.yaml due to model deletion — ci/build,ci-failure,qwen — by mgoin (创建于: 2026-02-17 08:12 (UTC+8))
#34662 [Model Runner V2] Minor refactoring for penalties — v1 — by WoosukKwon (创建于: 2026-02-17 09:07 (UTC+8))
#34646 [Bugfix] Fix EPLB + NVFP4: make expanded activation scales contiguous — bug,nvidia — by elvircrn (创建于: 2026-02-17 04:50 (UTC+8))
#34658 Refactor oracle to separate support and selection — needs-rebase — by mgoin (创建于: 2026-02-17 07:59 (UTC+8))
#34651 [Feature] Lazy import for the “mistral” tokenizer module. — structured-output,frontend,v1,multi-modality — by nascheme (创建于: 2026-02-17 06:49 (UTC+8))
#34657 [Quant][WIP] Add compressed-tensors MXFP4 support through flashinfer — nvidia — by dsikka (创建于: 2026-02-17 07:47 (UTC+8))
#34647 [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes — rocm — by khairulkabir1661 (创建于: 2026-02-17 05:09 (UTC+8))
#34654 [Feature] Lazy import of ngram_proposer module. — v1 — by nascheme (创建于: 2026-02-17 06:57 (UTC+8))
#34656 [CI] Fix bake config artifact path for AMI rebuild pipeline — ci/build — by amrmahdi (创建于: 2026-02-17 07:23 (UTC+8))
#34652 [AMD][CI] Fix test new_weight_syncing/rlhf.py — rocm — by rjrock (创建于: 2026-02-17 06:52 (UTC+8))
#34606 [CI] Disable precompiled wheel path in CI image builds — ready,ci/build — by amrmahdi (创建于: 2026-02-16 15:34 (UTC+8))
#34648 [Feature] Add VLLM_TRITON_AUTOTUNE with functional autotune control — 无标签 — by debo3 (创建于: 2026-02-17 05:14 (UTC+8))
#34633 Remove dead bitsandbytes CxB code from 8-bit inference path — ready — by TimDettmers (创建于: 2026-02-17 01:16 (UTC+8))
#34645 [Quantization] - Added uses_meta_device_weights to quant config — 无标签 — by Josephasafg (创建于: 2026-02-17 04:31 (UTC+8))
#34644 [release 2.11] Update to torch 2.11-rc1 — rocm,ci/build,cpu,nvidia — by atalman (创建于: 2026-02-17 03:54 (UTC+8))
#34611 [Bugfix] Fix ResponseCreatedEvent ValidationError for json_schema format in streaming — bug,frontend,v1 — by agangadi24 (创建于: 2026-02-16 17:07 (UTC+8))
#34630 [Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings — bug,rocm,qwen — by laudney (创建于: 2026-02-17 01:06 (UTC+8))
#34638 Add Kubernetes monitoring stack and metrics proxy for vLLM deployments — documentation,frontend,needs-rebase,ci/build — by alitirmizi23 (创建于: 2026-02-17 02:16 (UTC+8))
#34636 [ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion — bug,rocm — by Rohan138 (创建于: 2026-02-17 01:40 (UTC+8))
#34632 [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 — new-model,rocm — by laudney (创建于: 2026-02-17 01:09 (UTC+8))
#34599 [ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2026-02-16 12:26 (UTC+8))
#34631 [ROCm] Make Whisper causal attention backend-agnostic — rocm — by laudney (创建于: 2026-02-17 01:07 (UTC+8))
#34629 Targeting the MI355 agent pool with all existing tests — ready,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-17 00:05 (UTC+8))
#34624 [Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] — bug,ready,gpt-oss — by NickLucche (创建于: 2026-02-16 22:33 (UTC+8))
#34627 [Performance] Extract kv update ops from MLA attention backends — 无标签 — by ElizaWszola (创建于: 2026-02-16 23:21 (UTC+8))
#34628 [MM] Allow audio chunking for offline LLM — documentation,frontend,multi-modality — by NickLucche (创建于: 2026-02-16 23:48 (UTC+8))
#34603 add Florence2 model integration and registration — documentation,new-model,frontend,needs-rebase,v1 — by hydracz (创建于: 2026-02-16 13:18 (UTC+8))
#34613 [Realtime] Add Qwen3-ASR realtime streaming support — new-model,frontend,qwen — by pougetat (创建于: 2026-02-16 17:15 (UTC+8))
#34618 (bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts — bug,documentation,frontend,ready — by christian-pinto (创建于: 2026-02-16 19:53 (UTC+8))
#34616 [KVConnector] Scheduler: Fix num_computed_tokens after async KV load — v1 — by orozery (创建于: 2026-02-16 18:17 (UTC+8))
#34623 [Bugfix] Fix tests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generation — bug — by NickLucche (创建于: 2026-02-16 21:33 (UTC+8))
#34620 add tests from higher start point — nvidia — by shaunkotek (创建于: 2026-02-16 20:53 (UTC+8))
#34602 [Bugfix] Fix sleep mode wake_up memory leaks — bug — by stmoonar (创建于: 2026-02-16 13:11 (UTC+8))
#34615 [CI][Nixl] Add CrossLayer KV layout tests — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-02-16 17:50 (UTC+8))
#34610 Revert “[Misc] fix qwen3.5 config” — qwen — by ywang96 (创建于: 2026-02-16 17:05 (UTC+8))
#34608 feat(spec_decode): add Qwen3-VL Eagle3 multimodal prefill support — new-model,speculative-decoding,v1,qwen — by shx2005 (创建于: 2026-02-16 16:06 (UTC+8))
#34604 [Misc] fix qwen3.5 config — qwen — by JJJYmmm (创建于: 2026-02-16 15:21 (UTC+8))
#34605 feat:[SpecDecode][Eagle3] Add image-text pair support for Qwen3-VL drafter — new-model,speculative-decoding,v1,qwen — by shx2005 (创建于: 2026-02-16 15:30 (UTC+8))
#34600 Fix sleep mode wake_up memory leaks — documentation,frontend,ci/build — by stmoonar (创建于: 2026-02-16 12:28 (UTC+8))

已合并 PR

#34666 [Model Runner V2] Minor cleanup for PP — v1 — by WoosukKwon (合并于: 2026-02-17 11:15 (UTC+8))
#34667 [Model Runner V2] Fix unintended CPU-GPU sync in make_dummy — v1 — by WoosukKwon (合并于: 2026-02-17 11:00 (UTC+8))
#34507 [Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression — bug,ready — by haosdent (合并于: 2026-02-17 09:58 (UTC+8))
#34639 [CI] Enable mypy import following for vllm/v1/kv_offload — ready — by aneeshkp (合并于: 2026-02-17 09:58 (UTC+8))
#33960 [Core] Pipeline Parallel support for Model Runner V2 — v1 — by ZhanqiuHu (合并于: 2026-02-17 09:48 (UTC+8))
#33433 [Model Runner V2] support bad_words sampling param — v1 — by izhuhaoran (合并于: 2026-02-17 08:36 (UTC+8))
#34606 [CI] Disable precompiled wheel path in CI image builds — ready,ci/build — by amrmahdi (合并于: 2026-02-16 23:14 (UTC+8))
#34582 [NemotronH] Do not force router to run in fp32 — performance,ready,nvidia — by roikoren755 (合并于: 2026-02-17 02:15 (UTC+8))
#34566 [CI][Metrics] Stabilize tests with polling and subprocess guards — ready — by AndreasKaratzas (合并于: 2026-02-16 18:52 (UTC+8))
#34589 [ROCm][CI] Fix plugins test group; updating terratorch and dependencies — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-16 23:33 (UTC+8))
#34629 Targeting the MI355 agent pool with all existing tests — ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-02-17 01:02 (UTC+8))
#34624 [Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] — bug,ready,gpt-oss — by NickLucche (合并于: 2026-02-17 00:11 (UTC+8))
#34063 [Bugfix] Treat generation_config max_tokens as default not ceiling — bug,frontend,ready — by almogtavor (合并于: 2026-02-16 23:58 (UTC+8))
#34492 [Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj — ready,qwen — by Isotr0py (合并于: 2026-02-16 23:32 (UTC+8))
#34292 [CI] Enable mypy coverage for individual excluded files — ready,v1,nvidia — by Lucaskabela (合并于: 2026-02-16 23:34 (UTC+8))
#34618 (bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts — bug,documentation,frontend,ready — by christian-pinto (合并于: 2026-02-16 23:33 (UTC+8))
#34576 [Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload — bug,ready,v1 — by Vivo50E (合并于: 2026-02-16 23:33 (UTC+8))
#34575 Fix call to moe_mk in modelopt MoE modules (required for LoRA) — ready — by danisereb (合并于: 2026-02-16 23:33 (UTC+8))
#33994 Bump lm-eval version for Transformers v5 compatibility — documentation,rocm,ready,ci/build — by hmellor (合并于: 2026-02-16 21:24 (UTC+8))
#34364 [Fix] Fix tracing test race condition by adding server readiness check — ready — by emricksini-h (合并于: 2026-02-16 20:57 (UTC+8))
#31058 [Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs — ready,v1 — by ekagra-ranjan (合并于: 2026-02-16 19:08 (UTC+8))
#34320 [Bugfix] Fix Dynamo unexpected keyword argument — bug,ready — by samutamm (合并于: 2026-02-16 17:32 (UTC+8))
#34610 Revert “[Misc] fix qwen3.5 config” — qwen — by ywang96 (合并于: 2026-02-16 17:06 (UTC+8))
#34604 [Misc] fix qwen3.5 config — qwen — by JJJYmmm (合并于: 2026-02-16 16:25 (UTC+8))
#34598 [Renderer] Move InputPreprocessor into Renderer (1.5/2) — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-16 15:46 (UTC+8))
#34569 [CI] Write bake config to temp directory instead of repo root — ready,ci/build — by amrmahdi (合并于: 2026-02-16 14:15 (UTC+8))
#34590 [CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice — frontend,ready — by AndreasKaratzas (合并于: 2026-02-16 12:06 (UTC+8))
#34453 [Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues — bug,ready — by bnellnm (合并于: 2026-02-16 12:10 (UTC+8))
#34548 [BugFix] Fix Python 3.13 FlashMLA import error — bug,ready,ci/build — by LucasWilkinson (合并于: 2026-02-16 12:09 (UTC+8))
#34584 [Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model — documentation,ready — by banparth (合并于: 2026-02-16 12:09 (UTC+8))

关闭但未合并的 PR

#34612 [Feature]: Implement get_kv_cache_stride_order for all classes — performance,v1,kv-connector — by SouthWest7 (关闭于: 2026-02-17 10:50 (UTC+8))
#26910 [Bugfix] [ROCm] Enable DCP through Triton MLA on ROCm — rocm,stale,v1 — by tjtanaa (关闭于: 2026-02-17 10:16 (UTC+8))
#33383 [Bugfix] Disable torch.compile for batch invariance on Blackwell to ensure determinism — bug — by ZhanqiuHu (关闭于: 2026-02-17 09:59 (UTC+8))
#34388 [WIP][Spec Decode] P/D disaggregation: transfer hidden states for EAGLE warm-up — documentation,speculative-decoding,needs-rebase,v1,kv-connector — by ZhanqiuHu (关闭于: 2026-02-17 09:59 (UTC+8))
#34659 [CI Failure] Disable Qwen3-30B-A3B-MXFP4A16.yaml due to model deletion — ci/build,ci-failure,qwen — by mgoin (关闭于: 2026-02-17 09:19 (UTC+8))
#34630 [Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings — bug,rocm,qwen — by laudney (关闭于: 2026-02-17 02:38 (UTC+8))
#34638 Add Kubernetes monitoring stack and metrics proxy for vLLM deployments — documentation,frontend,needs-rebase,ci/build — by alitirmizi23 (关闭于: 2026-02-17 02:18 (UTC+8))
#34603 add Florence2 model integration and registration — documentation,new-model,frontend,needs-rebase,v1 — by hydracz (关闭于: 2026-02-16 23:57 (UTC+8))
#30186 Fix #15483 : Add error handling for model-dependent endpoints during … — frontend,v1 — by erdaltoprak (关闭于: 2026-02-16 22:57 (UTC+8))
#31811 [KVConnector] Handle KV events from multiple connectors — v1,kv-connector — by hickeyma (关闭于: 2026-02-16 20:42 (UTC+8))
#34430 [Bugfix] FIx TorchAO config bugs — bug,documentation — by jwpark33 (关闭于: 2026-02-16 19:24 (UTC+8))
#31893 [Misc] Add --use-flashinfer-rope to control the RoPE kernel for cuda — gpt-oss,nvidia — by elvischenv (关闭于: 2026-02-16 18:04 (UTC+8))
#34596 Dependency compatibility: transformers>=5.0.0 and resolver fix for vLLM 0.16.0 — rocm,ci/build — by cccat6 (关闭于: 2026-02-16 17:49 (UTC+8))
#34605 feat:[SpecDecode][Eagle3] Add image-text pair support for Qwen3-VL drafter — new-model,speculative-decoding,v1,qwen — by shx2005 (关闭于: 2026-02-16 16:04 (UTC+8))
#34600 Fix sleep mode wake_up memory leaks — documentation,frontend,ci/build — by stmoonar (关闭于: 2026-02-16 13:02 (UTC+8))