vLLM 开发动态报告 - 2026-02-16
时间窗口: 2026-02-16 11:30 (UTC+8) ~ 2026-02-17 11:30 (UTC+8) 数据统计: 新 Issue 16 | 关闭 Issue 15 | 新 PR 52 | 合并 PR 30 | 关闭未合并 PR 15
📊 每日开发状态摘要
在过去的24小时内,vLLM 项目保持了高活跃度的开发节奏,共处理了31个Issue,新增并合并了多个PR。开发焦点集中在模型支持优化(特别是 Qwen 系列)、推测解码的稳定性修复、AMD ROCm 平台的兼容性与性能提升,以及持续集成(CI)系统的完善上。多个核心性能优化与bug修复被合并入主线,显示出项目在提升推理引擎的鲁棒性、扩展性和跨平台支持方面持续投入。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD 生态相关活动非常活跃,主要集中在问题修复和功能增强上。
- 新增Issue (关键问题):
- [#34641] [ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942):一个影响广泛的重大 bug。
VLLM_ROCM_USE_AITER_FP4BMM环境变量默认对 AMD GPU 启用,但仅 MI325X/MI350X (gfx950) 支持 FP4,导致大部分 MI300X 用户在默认配置下崩溃。贡献者khairulkabir1661清晰指出了根因并提供了临时解决方案。AMD 员工tjtanaa迅速响应,表示将很快提供修复。
- [#34641] [ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942):一个影响广泛的重大 bug。
- 新增PR (问题修复与功能开发):
- [#34647] [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes:由 Issue #34641 的提交者
khairulkabir1661提出的修复方案。通过在is_fp4bmm_enabled()方法中查询 AITER 库的硬件能力,自动在 MI300X 上禁用 FP4 并回退到 FP8,从根本上解决问题。 - [#34632] [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12:由 AMD 员工
laudney提交,为 RDNA4/gfx12 硬件(不支持tl.dot_scaled)新增一个 Triton 内核,实现了 OCP MX FP4 e2m1f 量化模型的支持。这是一个重要的平台功能扩展。 - [#34655] [CI][AMD][BugFix] Skip tests…:由 AMD 员工
rasmith提交,修复 CI 问题,跳过 ROCm 平台上不应运行的测试。 - [#34630] [Bugfix][ROCm] Fix WNA16 MoE quant config init…:由 AMD 员工
laudney提交,修复了 WNA16 MoE 量化路径的初始化问题和 Qwen3-VL 模型配置读取的兼容性问题。 - [#34636] [ROCm][Bugfix]: Only save unpadded sizes…:修复了由于 PR #32344 引入的回归问题,该问题导致 ROCm 上的 RMSNorm+padding 融合模式匹配失败。
- [#34631] [ROCm] Make Whisper causal attention backend-agnostic:移除 Whisper 模型中注意力后端支持的硬编码列表,使其能适配更多 ROCm 后端。
- [#34652] [AMD][CI] Fix test new_weight_syncing/rlhf.py:修复 AMD CI 中一个分布式测试因
torch.cuda.get_device_capability调用时机不当导致的失败。
- [#34647] [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes:由 Issue #34641 的提交者
- 已合并PR (功能与CI改进):
- [#34639] [CI] Enable mypy import following for vllm/v1/kv_offload:优化类型检查。
- [#34589] [ROCm][CI] Fix plugins test group…:修复 TerraTorch 插件测试的依赖问题。
- [#34629] Targeting the MI355 agent pool…:将测试任务扩展到新的 MI355 代理池。
总结:AMD 团队在本周期内表现非常活跃,不仅快速响应并修复了影响用户体验的关键崩溃问题(#34641/#34647),还在持续扩展平台功能(#34632),并积极维护 CI 测试的稳定性和覆盖度。这体现了对 AMD 平台用户体验和开发者体验的重视。
💬 高热度讨论分析
- Issue [#34619] [Bug]: Qwen3.5.
illegal memory access- 核心议题:用户在 B200 GPU 上运行 Qwen3.5 模型时遇到非法内存访问错误,且怀疑与异步调度和 CUDA 图有关。
- 观点与立场:
- 提交者
vadiklyutiy提供了详细的复现环境和日志,并指出禁用--no-async-scheduling可以规避问题。 - 核心开发者
ZJY0516推测与近期更改的 GDN 后端和异步调度有关。 - 开发者
ywang96进一步将问题定位到causal_conv1dTriton 内核对PAD_SLOT_ID的处理上,并确认在enforce-eager模式下运行正常。
- 提交者
- 争议焦点:无显著争议,属于协作排查根因的过程。
- 当前状态:问题已定位到具体内核,开发者正在调查修复中。
- Issue [#34601] [Feature]: LoRA-based Routing
- 核心议题:提议实现基于 LoRA 适配器的请求路由和热交换功能,以在有限 GPU 资源上动态服务多个适配器。
- 观点与立场:
- 提议者
yeoshuheng描述了动态加载和切换 LoRA 适配器的愿景。 - 参与者
arandomcreatoron表达了兴趣并指出了潜在挑战:内存开销、CUDA 图捕获的复杂性,并建议配合选择性卸载逻辑。 - 另一位参与者
jeejeelee建议参考 vLLM 生态中的aibrix项目,暗示可能存在相关解决方案。
- 提议者
- 争议焦点:无直接争议,主要是对技术可行性和实现复杂度的探讨。讨论更倾向于探索性而非立即实施。
- 当前状态:开放式讨论,尚无明确实施计划。
- Issue [#34650] Bug: Speculative Decoding (MTP) Causes </think> Detection Failure…
- 核心议题:在使用推测解码(MTP)、推理解析器和结构化输出时,存在一个时序不匹配的 bug,导致
</think>令牌被静默忽略,从而无法正确执行 JSON 模式约束。 - 观点与立场:提交者
cicirori提供了极其详尽的技术分析,包括根因定位、具体示例和修复建议。另一位贡献者Chryseisliu表示有兴趣接手处理。 - 争议焦点:无争议,属于一个清晰的技术缺陷报告。
- 当前状态:问题开放,有待修复。
- 核心议题:在使用推测解码(MTP)、推理解析器和结构化输出时,存在一个时序不匹配的 bug,导致
🔥 热门话题与趋势分析
- 模型兼容性与稳定性(尤以 Qwen 系列为甚):多个 Issue 涉及 Qwen 系列模型(包括 MoE、VL、Next 版本),问题涵盖非法内存访问、性能回归、语法错误和推理解析器失效等。这表明随着 Qwen 模型家族的复杂化和新版本的快速迭代,vLLM 的适配和优化工作需要持续跟进。
- 推测解码的“边缘”问题:除了上述讨论热烈的 #34650,还有 Issue 报告了 Eagle 推测解码在升级 vLLM 版本后出现的属性错误 (#34607)。推测解码作为性能加速的关键特性,其与各种新功能(结构化输出、多模态、模型升级)的交互稳定性成为测试和修复的重点。
- CUDA 图与编译的复杂性:有 Issue 报告了 SharedStorageConnector 在 Blackwell GPU 上的断言错误 (#34634),以及
torch.compile相关配置的构造耗时问题 (#34635)。随着对极致性能的追求和硬件架构的更新,底层编译和图形化执行的复杂性日益凸显。 - CI/CD 与测试稳定性:大量 PR 和 Issue(包括已关闭的)涉及 CI 测试的修复、跳过和稳定性提升(如 #34637, #34622, #34617, #34666)。这反映了在庞大且快速变化的代码库和异构硬件环境下,维持 CI 管线的可靠运行本身就是一项重要且持续的工作。
- AMD 平台集成深化:如前述,本周期 AMD 相关的活动不仅限于 bug 修复,更包括为新硬件(RDNA4)添加内核支持,显示了其生态建设的深入。
🛠️ 重点技术变更
- PR [#33960] [Core] Pipeline Parallel support for Model Runner V2 (已合并):这是一个架构演进的重要里程碑。该 PR 为下一代模型执行器(Model Runner V2)引入了流水线并行(PP)支持,通过模块化的
PPHandler类封装所有 PP 逻辑,保持了代码的清晰性。测试表明其性能与 V1 基线相当,为未来优化奠定了基础。 - PR [#34492] [Models] Fuse Qwen3.5 GDN‘s qkvz_proj and ba_proj (已合并):针对 Qwen3.5 模型的性能优化。通过融合其 Gated Dense Network (GDN) 中的两个投影层,减少了内核启动和内存操作,在基准测试中带来了可观的吞吐量提升,是模型特定优化的典型例子。
- PR [#34632] [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 (新增):这项技术扩展允许在缺乏特定硬件指令的 AMD RDNA4 架构上高效运行 MXFP4 量化模型。其“two half-dots”策略和内核内反量化设计,展示了针对特定硬件约束进行定制化内核开发的能力。
- PR [#34507] [Bugfix] Fix fused MoE int32 overflow… (已合并):修复了一个在特定大矩阵参数下可能导致整数溢出崩溃的边界情况。该修复巧妙地仅将必要的偏移量转换为 int64,避免了将所有步幅参数改为 int64 带来的性能回归,体现了对性能与稳定性之间平衡的精细考量。
- PR [#34666] & [#34667] [Model Runner V2] Minor cleanup… (已合并):核心开发者
WoosukKwon对 Model Runner V2 的 PP 实现和make_dummy函数进行了代码清理和优化,消除了不必要的 CPU-GPU 同步,反映了对新一代执行器代码质量的持续打磨。
📈 开发活跃度观察
- AMD 贡献者活跃:以
laudney,khairulkabir1661,rasmith,Alexei-V-Ivanov-AMD等为代表的 AMD 贡献者在本周期非常活跃,主导了多个 ROCm 相关的 bug 修复、功能开发和 CI 维护工作,是推动平台支持的重要力量。 - 核心团队推动架构演进:
WoosukKwon,ZhanqiuHu等核心开发者持续推进 Model Runner V2 的完善(如 PP 支持),DarkLight1337在处理多模态渲染器重构。这表明项目在优化现有架构的同时,也在积极布局下一代基础设施。 - 社区贡献多样化:从模型集成(Florence2, Qwen-ASR 实时支持)、性能优化提案(LoRA路由)、到用户体验问题反馈(生成配置覆盖),社区贡献覆盖了从底层到上层的各个方面,生态健康。
- CI/CD 持续优化:
amrmahdi,AndreasKaratzas等人提交了多个 PR 来优化镜像构建、缓存策略和测试稳定性,保障了高速开发迭代下的工程质量。
💡 值得关注的问题
- AMD 硬件支持与兼容性:MI300X 上 FP4 默认启用导致的崩溃问题(#34641)影响面广,其修复方案(#34647)的进展和合并需要密切关注。
- 推测解码与结构化输出的交互:Issue #34650 揭示了一个较为隐蔽但影响推理正确性的 bug。推测解码与复杂输出控制逻辑的交互测试需要加强。
- Qwen 模型家族的稳定性:近期连续出现多个与 Qwen 模型相关的严重问题(内存访问、性能),建议团队系统性回顾对 Qwen 系列(特别是 MoE 和 VL 变体)的集成与优化代码,建立更完善的测试矩阵。
- CUDA 图捕获与编译的复杂性:随着
torch.compile和 CUDA Graph 的广泛应用,其带来的启动延迟、内存泄漏(如 #34602)和硬件特定问题(如 #34634)将成为运维和调试的难点,需要积累更多最佳实践和诊断工具。
📋 附录:详细数据列表
新增 Issue
- #34650 Bug: Speculative Decoding (MTP) Causes </think> Detection Failure in Structured Output + Reasoning Mode — bug — by cicirori (创建于: 2026-02-17 06:09 (UTC+8))
- #34661 [Bug]: SyntaxError in qwen3_5_moe.py line 81 when loading Qwen3.5-397B-A17B — 无标签 — by UmutAlihan (创建于: 2026-02-17 08:20 (UTC+8))
- #34619 [Bug]: Qwen3.5.
illegal memory access— bug — by vadiklyutiy (创建于: 2026-02-16 20:52 (UTC+8)) - #34642 [Bug]: Incorrect warning when both mistral & HF format are present in repo — bug — by patrickvonplaten (创建于: 2026-02-17 03:44 (UTC+8))
- #34643 [Bug]: vLLM that leaves orphan processes running when the parent process dies due to OOM — bug — by pymhq (创建于: 2026-02-17 03:48 (UTC+8))
- #34641 [ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942) — rocm — by khairulkabir1661 (创建于: 2026-02-17 03:02 (UTC+8))
- #34640 [Bug]: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 producing random symbols after 3bb4e4311 — bug — by stavinsky (创建于: 2026-02-17 03:00 (UTC+8))
- #34634 [Bug]: SharedStorageConnector: vectorized_gather_kernel assertion on Blackwell (B200) GPUs — bug — by mmkamani7 (创建于: 2026-02-17 01:21 (UTC+8))
- #34637 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 2) — ci-failure — by AndreasKaratzas (创建于: 2026-02-17 01:46 (UTC+8))
- #34635 [Bug]: AllReduceFusionPass(config) takes 1s+ to construct — bug,torch.compile,startup-ux — by zou3519 (创建于: 2026-02-17 01:22 (UTC+8))
- #34621 [Usage]: new vLLM version — usage — by eghouti (创建于: 2026-02-16 21:10 (UTC+8))
- #34625 [Bug]: Reasoning Parser not working with Minimax m2.5 — bug — by AyRickk (创建于: 2026-02-16 23:11 (UTC+8))
- #34622 [CI Failure]: B200 Kernels — ci-failure — by NickLucche (创建于: 2026-02-16 21:21 (UTC+8))
- #34617 [CI Failure]:
v1/e2e/test_spec_decode.py::test_ngram_and_suffix_correctness— ci-failure — by NickLucche (创建于: 2026-02-16 19:32 (UTC+8)) - #34601 [Feature]: LoRA-based Routing — feature request — by yeoshuheng (创建于: 2026-02-16 12:55 (UTC+8))
- #34607 [Bug]: specualative decoding error in 0.15.1 — bug — by hocop (创建于: 2026-02-16 15:46 (UTC+8))
已关闭 Issue
- #15115 [Usage]: Model
compute_logitsalways get None forsampling_metadata— usage,stale — by yanyongyu (关闭于: 2026-02-17 10:17 (UTC+8)) - #26863 [Bug]: Cuda out of memory while inference for Qwen3-VL-4B-Instruct — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-17 10:16 (UTC+8))
- #34413 [Bug]: Qwen coder next performance after d7982da commit. — bug — by stavinsky (关闭于: 2026-02-17 09:58 (UTC+8))
- #18153 [RFC]: Blackwell Enablement for vLLM (SM100) — RFC,stale,nvidia — by pavanimajety (关闭于: 2026-02-17 06:14 (UTC+8))
- #29453 [CI Failure]: mi325_1: Basic Correctness Test — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 04:29 (UTC+8))
- #29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:55 (UTC+8))
- #31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:54 (UTC+8))
- #29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-17 01:54 (UTC+8))
- #34005 [Bug]: Qwen3-1.7B apparently not respecting max-model-len (can’t generate >2048 tokens) — bug — by kdu4108 (关闭于: 2026-02-16 23:58 (UTC+8))
- #32938 [Bug]: illegal memory access while running Qwen3-30B-A3B-Instruct-2507 on multi node with DeepEP backend — bug — by llsj14 (关闭于: 2026-02-16 20:57 (UTC+8))
- #34284 [CI Failure]: Metrics - Tracing Test - Race Condition Between gRPC and OpenTelemetry Threads? — ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-16 20:57 (UTC+8))
- #34578 [Performance]: vLLM’s throughput performance for a single prompt scenario — performance — by kathirvel-balakrishnan (关闭于: 2026-02-16 17:30 (UTC+8))
- #34525 [CI Failure] LoRA TP (Distributed): lora/test_olmoe_tp.py::test_olmoe_lora — ci-failure — by LucasWilkinson (关闭于: 2026-02-16 12:10 (UTC+8))
- #34504 [Bug]: GLM-5-FP8 Crash - Engine core initialization failed — bug — by jxdn (关闭于: 2026-02-16 12:09 (UTC+8))
- #33447 [Bug]: Issue with vllm 0.15.0 image - running via docker — bug — by SKPsanjeevi (关闭于: 2026-02-16 11:57 (UTC+8))
新增 PR
- #34655 [CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm — bug,rocm — by rasmith (创建于: 2026-02-17 07:08 (UTC+8))
- #34666 [Model Runner V2] Minor cleanup for PP — v1 — by WoosukKwon (创建于: 2026-02-17 10:17 (UTC+8))
- #34609 [Bugfix] Force multiprocessing fork method on macOS — bug — by lichuang (创建于: 2026-02-16 16:34 (UTC+8))
- #34667 [Model Runner V2] Fix unintended CPU-GPU sync in make_dummy — v1 — by WoosukKwon (创建于: 2026-02-17 10:56 (UTC+8))
- #34612 [Feature]: Implement get_kv_cache_stride_order for all classes — performance,v1,kv-connector — by SouthWest7 (创建于: 2026-02-16 17:08 (UTC+8))
- #34665 [Bugfix] Fix benchmark_fused_collective crash on CustomOp init — bug,performance — by mayank-ketkar-sf (创建于: 2026-02-17 10:00 (UTC+8))
- #34653 [BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site — bug,ready — by hongxiayang (创建于: 2026-02-17 06:53 (UTC+8))
- #34649 AI Fix #15647 — 无标签 — by itzyesse99-lgtm (创建于: 2026-02-17 05:48 (UTC+8))
- #34660 [Kernel][Perf] Fuse gather_block_tables + compute_slot_mappings into single kernel — performance,v1 — by mayank-ketkar-sf (创建于: 2026-02-17 08:14 (UTC+8))
- #34639 [CI] Enable mypy import following for vllm/v1/kv_offload — ready — by aneeshkp (创建于: 2026-02-17 02:36 (UTC+8))
- #34664 Add MXFP8 to Marlin dense kernel — 无标签 — by mgoin (创建于: 2026-02-17 09:29 (UTC+8))
- #34663 Separate TRTLLM and Flashinfer backends — documentation,v1,nvidia — by pavanimajety (创建于: 2026-02-17 09:21 (UTC+8))
- #34614 [Feature] Add Azure Blob Storage support for RunAI Model Streamer — rocm,ci/build — by hasethuraman (创建于: 2026-02-16 17:46 (UTC+8))
- #34659 [CI Failure] Disable Qwen3-30B-A3B-MXFP4A16.yaml due to model deletion — ci/build,ci-failure,qwen — by mgoin (创建于: 2026-02-17 08:12 (UTC+8))
- #34662 [Model Runner V2] Minor refactoring for penalties — v1 — by WoosukKwon (创建于: 2026-02-17 09:07 (UTC+8))
- #34646 [Bugfix] Fix EPLB + NVFP4: make expanded activation scales contiguous — bug,nvidia — by elvircrn (创建于: 2026-02-17 04:50 (UTC+8))
- #34658 Refactor oracle to separate support and selection — needs-rebase — by mgoin (创建于: 2026-02-17 07:59 (UTC+8))
- #34651 [Feature] Lazy import for the “mistral” tokenizer module. — structured-output,frontend,v1,multi-modality — by nascheme (创建于: 2026-02-17 06:49 (UTC+8))
- #34657 [Quant][WIP] Add compressed-tensors MXFP4 support through flashinfer — nvidia — by dsikka (创建于: 2026-02-17 07:47 (UTC+8))
- #34647 [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes — rocm — by khairulkabir1661 (创建于: 2026-02-17 05:09 (UTC+8))
- #34654 [Feature] Lazy import of ngram_proposer module. — v1 — by nascheme (创建于: 2026-02-17 06:57 (UTC+8))
- #34656 [CI] Fix bake config artifact path for AMI rebuild pipeline — ci/build — by amrmahdi (创建于: 2026-02-17 07:23 (UTC+8))
- #34652 [AMD][CI] Fix test new_weight_syncing/rlhf.py — rocm — by rjrock (创建于: 2026-02-17 06:52 (UTC+8))
- #34606 [CI] Disable precompiled wheel path in CI image builds — ready,ci/build — by amrmahdi (创建于: 2026-02-16 15:34 (UTC+8))
- #34648 [Feature] Add VLLM_TRITON_AUTOTUNE with functional autotune control — 无标签 — by debo3 (创建于: 2026-02-17 05:14 (UTC+8))
- #34633 Remove dead bitsandbytes CxB code from 8-bit inference path — ready — by TimDettmers (创建于: 2026-02-17 01:16 (UTC+8))
- #34645 [Quantization] - Added uses_meta_device_weights to quant config — 无标签 — by Josephasafg (创建于: 2026-02-17 04:31 (UTC+8))
- #34644 [release 2.11] Update to torch 2.11-rc1 — rocm,ci/build,cpu,nvidia — by atalman (创建于: 2026-02-17 03:54 (UTC+8))
- #34611 [Bugfix] Fix ResponseCreatedEvent ValidationError for json_schema format in streaming — bug,frontend,v1 — by agangadi24 (创建于: 2026-02-16 17:07 (UTC+8))
- #34630 [Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings — bug,rocm,qwen — by laudney (创建于: 2026-02-17 01:06 (UTC+8))
- #34638 Add Kubernetes monitoring stack and metrics proxy for vLLM deployments — documentation,frontend,needs-rebase,ci/build — by alitirmizi23 (创建于: 2026-02-17 02:16 (UTC+8))
- #34636 [ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion — bug,rocm — by Rohan138 (创建于: 2026-02-17 01:40 (UTC+8))
- #34632 [ROCm] Add MXFP4 inline dequant Triton kernel for RDNA4/gfx12 — new-model,rocm — by laudney (创建于: 2026-02-17 01:09 (UTC+8))
- #34599 [ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2026-02-16 12:26 (UTC+8))
- #34631 [ROCm] Make Whisper causal attention backend-agnostic — rocm — by laudney (创建于: 2026-02-17 01:07 (UTC+8))
- #34629 Targeting the MI355 agent pool with all existing tests — ready,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-17 00:05 (UTC+8))
- #34624 [Bugfix][CI] Fix flaky
entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b]— bug,ready,gpt-oss — by NickLucche (创建于: 2026-02-16 22:33 (UTC+8)) - #34627 [Performance] Extract kv update ops from MLA attention backends — 无标签 — by ElizaWszola (创建于: 2026-02-16 23:21 (UTC+8))
- #34628 [MM] Allow audio chunking for offline LLM — documentation,frontend,multi-modality — by NickLucche (创建于: 2026-02-16 23:48 (UTC+8))
- #34603 add Florence2 model integration and registration — documentation,new-model,frontend,needs-rebase,v1 — by hydracz (创建于: 2026-02-16 13:18 (UTC+8))
- #34613 [Realtime] Add Qwen3-ASR realtime streaming support — new-model,frontend,qwen — by pougetat (创建于: 2026-02-16 17:15 (UTC+8))
- #34618 (bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts — bug,documentation,frontend,ready — by christian-pinto (创建于: 2026-02-16 19:53 (UTC+8))
- #34616 [KVConnector] Scheduler: Fix num_computed_tokens after async KV load — v1 — by orozery (创建于: 2026-02-16 18:17 (UTC+8))
- #34623 [Bugfix] Fix
tests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generation— bug — by NickLucche (创建于: 2026-02-16 21:33 (UTC+8)) - #34620 add tests from higher start point — nvidia — by shaunkotek (创建于: 2026-02-16 20:53 (UTC+8))
- #34602 [Bugfix] Fix sleep mode wake_up memory leaks — bug — by stmoonar (创建于: 2026-02-16 13:11 (UTC+8))
- #34615 [CI][Nixl] Add CrossLayer KV layout tests — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-02-16 17:50 (UTC+8))
- #34610 Revert “[Misc] fix qwen3.5 config” — qwen — by ywang96 (创建于: 2026-02-16 17:05 (UTC+8))
- #34608 feat(spec_decode): add Qwen3-VL Eagle3 multimodal prefill support — new-model,speculative-decoding,v1,qwen — by shx2005 (创建于: 2026-02-16 16:06 (UTC+8))
- #34604 [Misc] fix qwen3.5 config — qwen — by JJJYmmm (创建于: 2026-02-16 15:21 (UTC+8))
- #34605 feat:[SpecDecode][Eagle3] Add image-text pair support for Qwen3-VL drafter — new-model,speculative-decoding,v1,qwen — by shx2005 (创建于: 2026-02-16 15:30 (UTC+8))
- #34600 Fix sleep mode wake_up memory leaks — documentation,frontend,ci/build — by stmoonar (创建于: 2026-02-16 12:28 (UTC+8))
已合并 PR
- #34666 [Model Runner V2] Minor cleanup for PP — v1 — by WoosukKwon (合并于: 2026-02-17 11:15 (UTC+8))
- #34667 [Model Runner V2] Fix unintended CPU-GPU sync in make_dummy — v1 — by WoosukKwon (合并于: 2026-02-17 11:00 (UTC+8))
- #34507 [Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression — bug,ready — by haosdent (合并于: 2026-02-17 09:58 (UTC+8))
- #34639 [CI] Enable mypy import following for vllm/v1/kv_offload — ready — by aneeshkp (合并于: 2026-02-17 09:58 (UTC+8))
- #33960 [Core] Pipeline Parallel support for Model Runner V2 — v1 — by ZhanqiuHu (合并于: 2026-02-17 09:48 (UTC+8))
- #33433 [Model Runner V2] support bad_words sampling param — v1 — by izhuhaoran (合并于: 2026-02-17 08:36 (UTC+8))
- #34606 [CI] Disable precompiled wheel path in CI image builds — ready,ci/build — by amrmahdi (合并于: 2026-02-16 23:14 (UTC+8))
- #34582 [NemotronH] Do not force router to run in fp32 — performance,ready,nvidia — by roikoren755 (合并于: 2026-02-17 02:15 (UTC+8))
- #34566 [CI][Metrics] Stabilize tests with polling and subprocess guards — ready — by AndreasKaratzas (合并于: 2026-02-16 18:52 (UTC+8))
- #34589 [ROCm][CI] Fix plugins test group; updating terratorch and dependencies — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-16 23:33 (UTC+8))
- #34629 Targeting the MI355 agent pool with all existing tests — ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-02-17 01:02 (UTC+8))
- #34624 [Bugfix][CI] Fix flaky
entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b]— bug,ready,gpt-oss — by NickLucche (合并于: 2026-02-17 00:11 (UTC+8)) - #34063 [Bugfix] Treat generation_config max_tokens as default not ceiling — bug,frontend,ready — by almogtavor (合并于: 2026-02-16 23:58 (UTC+8))
- #34492 [Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj — ready,qwen — by Isotr0py (合并于: 2026-02-16 23:32 (UTC+8))
- #34292 [CI] Enable mypy coverage for individual excluded files — ready,v1,nvidia — by Lucaskabela (合并于: 2026-02-16 23:34 (UTC+8))
- #34618 (bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts — bug,documentation,frontend,ready — by christian-pinto (合并于: 2026-02-16 23:33 (UTC+8))
- #34576 [Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload — bug,ready,v1 — by Vivo50E (合并于: 2026-02-16 23:33 (UTC+8))
- #34575 Fix call to moe_mk in modelopt MoE modules (required for LoRA) — ready — by danisereb (合并于: 2026-02-16 23:33 (UTC+8))
- #33994 Bump
lm-evalversion for Transformers v5 compatibility — documentation,rocm,ready,ci/build — by hmellor (合并于: 2026-02-16 21:24 (UTC+8)) - #34364 [Fix] Fix tracing test race condition by adding server readiness check — ready — by emricksini-h (合并于: 2026-02-16 20:57 (UTC+8))
- #31058 [Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs — ready,v1 — by ekagra-ranjan (合并于: 2026-02-16 19:08 (UTC+8))
- #34320 [Bugfix] Fix Dynamo unexpected keyword argument — bug,ready — by samutamm (合并于: 2026-02-16 17:32 (UTC+8))
- #34610 Revert “[Misc] fix qwen3.5 config” — qwen — by ywang96 (合并于: 2026-02-16 17:06 (UTC+8))
- #34604 [Misc] fix qwen3.5 config — qwen — by JJJYmmm (合并于: 2026-02-16 16:25 (UTC+8))
- #34598 [Renderer] Move InputPreprocessor into Renderer (1.5/2) — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-16 15:46 (UTC+8))
- #34569 [CI] Write bake config to temp directory instead of repo root — ready,ci/build — by amrmahdi (合并于: 2026-02-16 14:15 (UTC+8))
- #34590 [CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice — frontend,ready — by AndreasKaratzas (合并于: 2026-02-16 12:06 (UTC+8))
- #34453 [Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues — bug,ready — by bnellnm (合并于: 2026-02-16 12:10 (UTC+8))
- #34548 [BugFix] Fix Python 3.13 FlashMLA import error — bug,ready,ci/build — by LucasWilkinson (合并于: 2026-02-16 12:09 (UTC+8))
- #34584 [Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model — documentation,ready — by banparth (合并于: 2026-02-16 12:09 (UTC+8))
关闭但未合并的 PR
- #34612 [Feature]: Implement get_kv_cache_stride_order for all classes — performance,v1,kv-connector — by SouthWest7 (关闭于: 2026-02-17 10:50 (UTC+8))
- #26910 [Bugfix] [ROCm] Enable DCP through Triton MLA on ROCm — rocm,stale,v1 — by tjtanaa (关闭于: 2026-02-17 10:16 (UTC+8))
- #33383 [Bugfix] Disable torch.compile for batch invariance on Blackwell to ensure determinism — bug — by ZhanqiuHu (关闭于: 2026-02-17 09:59 (UTC+8))
- #34388 [WIP][Spec Decode] P/D disaggregation: transfer hidden states for EAGLE warm-up — documentation,speculative-decoding,needs-rebase,v1,kv-connector — by ZhanqiuHu (关闭于: 2026-02-17 09:59 (UTC+8))
- #34659 [CI Failure] Disable Qwen3-30B-A3B-MXFP4A16.yaml due to model deletion — ci/build,ci-failure,qwen — by mgoin (关闭于: 2026-02-17 09:19 (UTC+8))
- #34630 [Bugfix][ROCm] Fix WNA16 MoE quant config init and Qwen3-VL tie_word_embeddings — bug,rocm,qwen — by laudney (关闭于: 2026-02-17 02:38 (UTC+8))
- #34638 Add Kubernetes monitoring stack and metrics proxy for vLLM deployments — documentation,frontend,needs-rebase,ci/build — by alitirmizi23 (关闭于: 2026-02-17 02:18 (UTC+8))
- #34603 add Florence2 model integration and registration — documentation,new-model,frontend,needs-rebase,v1 — by hydracz (关闭于: 2026-02-16 23:57 (UTC+8))
- #30186 Fix #15483 : Add error handling for model-dependent endpoints during … — frontend,v1 — by erdaltoprak (关闭于: 2026-02-16 22:57 (UTC+8))
- #31811 [KVConnector] Handle KV events from multiple connectors — v1,kv-connector — by hickeyma (关闭于: 2026-02-16 20:42 (UTC+8))
- #34430 [Bugfix] FIx TorchAO config bugs — bug,documentation — by jwpark33 (关闭于: 2026-02-16 19:24 (UTC+8))
- #31893 [Misc] Add
--use-flashinfer-ropeto control the RoPE kernel for cuda — gpt-oss,nvidia — by elvischenv (关闭于: 2026-02-16 18:04 (UTC+8)) - #34596 Dependency compatibility: transformers>=5.0.0 and resolver fix for vLLM 0.16.0 — rocm,ci/build — by cccat6 (关闭于: 2026-02-16 17:49 (UTC+8))
- #34605 feat:[SpecDecode][Eagle3] Add image-text pair support for Qwen3-VL drafter — new-model,speculative-decoding,v1,qwen — by shx2005 (关闭于: 2026-02-16 16:04 (UTC+8))
- #34600 Fix sleep mode wake_up memory leaks — documentation,frontend,ci/build — by stmoonar (关闭于: 2026-02-16 13:02 (UTC+8))