vLLM 开发动态报告 - 2026-02-18
时间窗口: 2026-02-18 11:32 (UTC+8) ~ 2026-02-19 11:32 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 18 | 新 PR 73 | 合并 PR 31 | 关闭未合并 PR 10
📊 每日开发状态摘要
过去24小时内,vLLM 项目保持了极高的开发活跃度,新增和合并了大量 PR(73个新增,31个合并),同时处理了大量 Bug 报告和新功能请求。开发焦点主要集中在 修复模型兼容性与初始化问题、优化内核性能(特别是 MoE 和采样) 以及 增强 AMD/ROCm 平台支持 上。多个 CI 测试失败的修复和核心稳定性问题的解决是今日的重要进展。
🎯 AMD/ROCm 生态相关动态
本周期内 AMD 生态相关活动较为活跃,主要涉及问题修复、功能请求和 CI 优化。
- Issue #34850: MiniMax M2.5 MI355 编译错误
- 用户:
functionstackx,问题中直接 @ 了 AMD 员工 (@andyluo7)。 - 内容:用户在 MI355 (gfx942) 上使用 ROCm vLLM 镜像运行 MiniMax M2.5 模型时,遇到编译错误。错误信息指向
unified_attention操作。 - 分析与解决:AMD 工程师
@andyluo7快速响应,提供了安装和运行命令。用户随后自行发现并解决问题:将 block size 从 64 改为 32 后编译成功,根本原因是初始选择的 block size 过大。此 Issue 已关闭。这提醒用户在 AMD 平台上部署时需注意内核参数配置。 - 影响:解决了特定模型在 MI355 上的运行障碍,验证了 ROCm 配方在真实场景下的可用性。
- 用户:
- Issue #34781: 请求 ROCm 上 Kimi K2.5 的 disaggregated PD + wideEP 配方
- 用户:
functionstackx,再次 @ AMD 团队。 - 内容:用户请求在 ROCm 平台上提供与 CUDA 对等的、支持大规模 MoE 模型(如 Kimi K2.5)的 disaggregated PD(部分解码)加 wideEP(宽专家并行)部署配方。用户指出此功能已在 Q1 2026 路线图中,但临近季度末仍未实现。
- 分析与现状:此 Issue 仍处于开放状态,反映了社区用户对 AMD 平台功能完备性(尤其是对大规模 MoE 模型的高效部署支持)的急切期待。AMD 团队尚未直接回复。这属于一个重要的功能缺口,对希望在 AMD 集群上运行万亿参数级别 MoE 模型的用户至关重要。
- 用户:
- PR #34848: [ROCm] 在配置初始化中增加额外步骤,以便在编译配置初始化之前填充自定义操作
- 贡献者:
gshtras,标签包含rocm。 - 内容:此 PR 是 #33271 的先决条件。当前,当启用 AITER 时,某些自定义操作是在 ROCm 编译配置初始化之后追加的,导致相关的融合优化无法进行。此 PR 将自定义操作的初始化提前,确保融合通道能够生效。
- 影响:优化了 AMD 平台上 AITER 内核的编译流程,为后续启用更多内核融合优化铺平道路,有望提升 ROCm 后端的运行时性能。
- 贡献者:
- PR #34839: [ROCm][CI] 清理和重构 amd-ci 遗留流水线
- 贡献者:
AndreasKaratzas,AMD CI 工程师。 - 内容:对 AMD 专属的 CI 流水线配置文件进行大规模清理,包括移除无用注释、更新硬件标签、统一格式、删除无效测试等。
- 影响:提升 AMD CI 流水线的可维护性和可读性,是确保 AMD 平台代码质量持续稳定的重要基础设施维护工作。
- 贡献者:
- PR #34861: [1/N] Elastic EP Milestone 2
- 贡献者:
itayalroy,标签包含rocm。此 PR 是早期由@libertyeagle设计的弹性专家并行特性在最新主分支上的重新提交和问题修复。 - 内容:实现了 Elastic EP(弹性专家并行)的第二阶段功能,支持在服务过程中动态扩展和收缩 EP 规模,并在扩缩容事件之间正常处理请求。
- 影响:这是一个重要的架构特性,不仅限于 AMD 平台,但对 AMD 多卡集群的弹性资源管理和成本优化具有重要意义。标签显示已在多 EP 后端上测试。
- 贡献者:
小结:AMD 生态本周期以问题解决和基础设施优化为主。一方面快速解决了用户遇到的运行时问题,另一方面积极为未来性能优化(AITER 融合)和核心特性(Elastic EP)做准备。但用户对大规模 MoE 模型部署方案的强烈需求仍未得到直接响应,是值得关注的重点。
💬 高热度讨论分析
- Issue #34845: [Bug]: Qwen3-Next MTP fails when paired with chunked prefill
- 评论数:3
- 核心议题:Qwen3-Next 模型使用 MTP(多标记预测)推测解码时,若启用分片预填充(chunked prefill)会导致崩溃。
- 观点分析:
@benchislett(核心贡献者)直接指出根本原因是 PR #34077。该 PR 为确保正确性,断言小预填充分片不应被归类为解码步骤。然而,分片预填充会产生这种“部分预填充”,当前没有像推测验证那样的“回退”状态机制来处理它们。@benchislett进一步指出,这可能也会影响仅在使用分片预填充时才工作的“对齐”模式前缀缓存。- 他通过测试发现,在引入问题的 PR 之前,相同的配置可以正常工作并获得高精度,说明功能本身是可行的,问题出在边缘情况处理上。
- 争议焦点:无直接争议,更多是技术根因分析。焦点在于如何正确设计架构以同时支持分片预填充和推测解码这类高级特性。
- 当前状态:问题开放,等待修复。讨论揭示了底层调度逻辑与高级优化特性之间的一个设计冲突。
- Issue #34800: [CI Failure]: entrypoints/weight_transfer/test_weight_transfer_llm.py
- 评论数:2
- 核心议题:权重传输引擎的单元测试因 mock 未生效而失败。
- 观点分析:
- 发起者
@ilmarkov分析可能是由于进程生成导致 mock (patch) 失效。 @hao-aaron随后确认已通过 PR #34841 修复。
- 发起者
- 争议焦点:无。
- 当前状态:已通过 PR #34841 修复并合并。这是一个典型的开发流程:CI 发现问题 -> 分析原因 -> 快速提交修复 PR。
- Issue #34792: [Bug]: setting VLLM_LOGGING_LEVEL=debug breaks tool calling (关联 PR #34844)
- 评论数:2
- 核心议题:设置调试日志级别会导致 Mistral 模型的工具调用功能失败。
- 观点分析:
- 用户
@dtrifiro发现并提供了详细的跟踪信息,指出 Pydantic 将Iterable字段包装为一次性迭代器,调试日志序列化时会耗尽它,导致后续工具解析器看到空序列。 @NickLucche建议通过 PR 逐一将Iterable属性替换为Sequence来修复。
- 用户
- 争议焦点:无。是一个清晰的 Bug 报告和修复方向讨论。
- 当前状态:PR #34844 已提交并标记为
ready,专门修复此问题。
- Issue #16348: [Bug]: Missing metrics in V1 (历史 Issue,本周期关闭)
- 评论数:16
- 核心议题:在 V1 引擎中,直接使用
AsyncLLMAPI 时,聚合吞吐量指标日志不显示。 - 观点分析:
- 这是一个持续近一年的老旧 Issue。早期贡献者确认在主分支已修复,但用户报告在特定使用方式下仍存在问题。
- 核心贡献者
@markmc和@njhill进行了技术分析,指出问题源于一个 TODO 项未完成,导致日志任务未在AsyncLLM内部正确启动。 - 后续有用户提出希望仅在收到请求时输出指标以减小日志体积,但未形成主要讨论方向。
- 争议焦点:主要是功能缺失与修复进度的沟通。用户需要明确何时能稳定获得功能。
- 最终结论:在长期无新活动后,被 GitHub Actions 标记为 stale 并自动关闭。
- Issue #21951: [Bug]: Incremental detokenization error when running
llama-3.3-70b-fp8model (历史 Issue,本周期关闭)- 评论数:23
- 核心议题:运行 FP8 量化模型时,随机出现 token ID 为负数的错误,导致解码失败并可能引发引擎崩溃。
- 观点分析:
- 讨论涉及多位用户,确认在多款 FP8 模型上存在此随机、难以复现的问题。
- 贡献者
@david6666666和@BruceW-07尝试复现但未成功,增加了排查难度。 - 普遍推测是某种溢出导致生成了无效的 token ID,但根因始终未在 Issue 中明确。
- 争议焦点:无直接争议,但凸显了“难以复现的偶发 bug”在开源社区中诊断和修复的挑战。
- 最终结论:同样因长期无进展而被标记为 stale 后自动关闭。此类问题可能已在后续版本的其他改动中无意修复,或仍需特定条件触发。
🔥 热门话题与趋势分析
- 量化与精度问题:本周期的焦点之一。
- 量化检查点加载 (#34859):缺失分片的量化检查点会静默加载失败,缺乏校验,存在严重隐患。
- FP8 KV 缓存与 DCP 兼容性 (#34795):修复 MLA 注意力中 FP8 KV 缓存与解码上下文并行不兼容的问题,扩大高效内存使用方案的适用范围。
- NVFP4 与 TRTLLM 集成 (#34728, #34725):修复 Nemotron 等模型 NVFP4 量化与 TRTLLM 注意力后端的兼容性问题,确保新量化格式的可用性。
- 在线量化支持 (#34824, #34645):围绕 MXFP8 在线 MoE 量化和
uses_meta_device_weights配置的讨论,显示在线量化路径正在成熟和完善。
- 模型初始化与兼容性:多个 CI 失败和新模型加载错误。
- 集中出现了一批模型初始化测试失败 (#34806, #34810, #34814, #34819),与近期 PR #33600 引入的
_update_block_size_for_backend有关,该函数过早初始化 CUDA,破坏了测试环境。PR #34818 正在修复。 - 新模型支持:如 Granite Moe Hybrid (#34812)、Qwen3.5-397B (#34684, #34779)、EagleMiniCPM (#34806) 等模型在加载或推理时遇到各种适配问题,反映出 vLLm 对新模型架构的快速跟进中伴随的兼容性挑战。
- 集中出现了一批模型初始化测试失败 (#34806, #34810, #34814, #34819),与近期 PR #33600 引入的
- 系统稳定性与内存管理:
- KV 缓存使用率指标 (#34860):新增迭代日志中的 KV 缓存使用情况报告,便于运维和性能调优。
- 孤儿进程清理 (#34816, #34643):通过
prctl机制确保引擎核心和工作进程在父进程意外死亡时能被正确终止,提高集群环境下的资源管理可靠性。 - 分层归一化整数溢出 (#34842):修复潜在的大规模请求下 layernorm 内核索引计算溢出问题,提升系统健壮性。
🛠️ 重点技术变更
- PR #34866: [BUGFIX] Fix
_dummy_runmissingprepare_inputs_eventsynchronization- 技术解读:修复了在异步调度下,空闲数据并行工作者用于专家并行协调的
_dummy_run操作与主执行流 (execute_model) 之间共享的 CPU 固定内存缓冲区(如序列长度)的竞争条件。原因为_dummy_run未遵循相同的 CUDA 事件同步协议。 - 影响:解决了在 DP+EP 配置下可能导致
cudaErrorIllegalAddress的底层同步 Bug,提升了多维度并行场景的稳定性。
- 技术解读:修复了在异步调度下,空闲数据并行工作者用于专家并行协调的
- PR #34846: [Perf] Improve default triton fused moe configs
- 技术解读:通过分析大量已调优的配置文件,全面改进了 Triton 融合 MoE 内核在无调优配置可用时的默认参数。关键改进包括:批大小感知的
BLOCK_SIZE_M、更合理的BLOCK_SIZE_K(64/128 替代 32)、显式设置num_warps/num_stages等。 - 影响:对专家数量多的模型,在大批量情况下可带来高达 2 倍的性能提升,且无性能回退,显著改善了“开箱即用”的 MoE 推理性能。
- 技术解读:通过分析大量已调优的配置文件,全面改进了 Triton 融合 MoE 内核在无调优配置可用时的默认参数。关键改进包括:批大小感知的
- PR #34854: [Model Runner V2] Use FP32 for Gumbel Noise
- 技术解读:将 Gumbel 采样内核中的噪声生成和计算精度从默认的输入精度(如 BF16/FP16)提升到 FP32,并合并了
tl.max和tl.argmax操作。 - 影响:在大批量、大词表场景下获得高达 50% 的性能提升。这是对 v2 模型运行器采样阶段的一个关键性能优化。
- 技术解读:将 Gumbel 采样内核中的噪声生成和计算精度从默认的输入精度(如 BF16/FP16)提升到 FP32,并合并了
- PR #34834: [Bugfix] Fix lora tests
- 技术解读:修复了由 PR #34560(Renderer 重构)引入的回归,该回归导致
_resolve_lora_reqs未被实际调用,进而破坏了 LoRA 功能。 - 影响:快速修复了影响 LoRA 功能的核心测试失败,确保了前端重构过程中关键特性的稳定性。
- 技术解读:修复了由 PR #34560(Renderer 重构)引入的回归,该回归导致
- PR #32771: [Model Runner V2] support piecewise & mixed cudagraph
- 技术解读:为 V2 模型运行器添加了对分片和混合 CUDA 图的支持。
- 影响:进一步释放 V2 运行器的低延迟潜力,允许更灵活地使用 CUDA 图来捕获和重用计算流程,特别是在动态批处理场景下。
📈 开发活跃度观察
- 核心团队高效响应:对于 CI 失败、严重 Bug(如 #34845, #34800),核心贡献者(
@benchislett,@hao-aaron,@DarkLight1337,@mgoin)能够迅速定位根因并提交或合并修复 PR,显示出高效的协作和问题处理能力。 - AMD 贡献者活跃:用户
functionstackx积极在 AMD 平台进行实际部署测试并反馈问题,AMD 工程师@andyluo7,@gshtras,@AndreasKaratzas等在问题解决和基础设施维护上表现活跃。 - 代码审查与合并节奏快:24 小时内合并 31 个 PR,涵盖功能、修复、文档、CI 等多个方面,表明项目处于高速迭代期,且代码审核与合并流程顺畅。
- 社区参与深度:部分 Issue(如 #34845)的讨论深入到架构设计层面,体现了社区用户和贡献者较高的技术素养。
💡 值得关注的问题
- AMD 平台功能对等性:Issue #34781 突显了用户对 AMD 平台在大规模 MoE 模型先进部署方案(disagg PD+wideEP) 上追平 CUDA 生态的迫切需求。这是 AMD 平台能否在高端推理场景与 NVIDIA 竞争的关键之一。
- 量化模型加载的健壮性:Issue #34859 揭示了一个危险 Bug——缺失分片的量化检查点会静默失败。这需要尽快修复,并考虑对非量化模型也加强校验。
- 前缀缓存与分片预填充的交互:Issue #34845 和 #34865 都涉及前缀缓存与高级调度特性(分片预填充、推测解码)的兼容性问题。这需要一次系统性的梳理和设计,而非零散修复。
- “陈旧” Issue 的清理与知识管理:类似 #16348, #21951 这类历史遗留 Bug 被自动关闭,其中可能包含未彻底解决的复杂问题。项目需考虑建立机制,确保重要但难以复现的 Bug 不被遗忘,或许需要更精细的标签分类或定期巡查。
📋 附录:详细数据列表
新增 Issue
- #34865 [Bug]: Prefix caching failing for Nemotron models — bug — by benchislett (创建于: 2026-02-19 10:03 (UTC+8))
- #34845 [Bug]: Qwen3-Next MTP fails when paired with chunked prefill — bug — by wzhao18 (创建于: 2026-02-19 06:14 (UTC+8))
- #34817 [Bug]: Trying to run gpt-oss-120b on rtx pro 6000 — bug — by chadbek (创建于: 2026-02-18 23:37 (UTC+8))
- #34850 [Bug]: MiniMax M2.5 MI355 compile error — bug,rocm — by functionstackx (创建于: 2026-02-19 07:13 (UTC+8))
- #34800 [CI Failure]: entrypoints/weight_transfer/test_weight_transfer_llm.py — ci-failure — by ilmarkov (创建于: 2026-02-18 21:11 (UTC+8))
- #34859 [Bug]: missing shards from quantized checkpoint fails silently — bug — by andrea-fasoli (创建于: 2026-02-19 08:22 (UTC+8))
- #34857 [Feature]: vLLM ResponsesAPI & Tool Calling H1 2026 lookahead — feature request — by qandrew (创建于: 2026-02-19 08:09 (UTC+8))
- #34851 [Feature]: Refactor Quark MoE and mxfp4 MoE to align with MoE oracle/MK — feature request — by BowenBao (创建于: 2026-02-19 07:27 (UTC+8))
- #34804 [CI Failure]: Lora failed tests lora/test_default_mm_loras.py — ci-failure — by ilmarkov (创建于: 2026-02-18 22:00 (UTC+8))
- #34802 [CI Failure]: Lora failed tests qwen2vl beam search — ci-failure — by ilmarkov (创建于: 2026-02-18 21:35 (UTC+8))
- #34781 [Feature]: parity with cuda - ROCm Kimi K2.5 disagg PD +wideEP recipe — feature request,rocm — by functionstackx (创建于: 2026-02-18 14:02 (UTC+8))
- #34827 [Feature]: Support thinking budget for reasoning models (e.g., GLM-4.5V) — feature request — by gaolongxi (创建于: 2026-02-19 00:45 (UTC+8))
- #34819 [CI Failure]: models/test_initialization.py::test_can_initialize_small_subset[Llama4ForConditionalGeneration] — ci-failure — by ilmarkov (创建于: 2026-02-18 23:51 (UTC+8))
- #34814 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[InternS1ProForConditionalGeneration] — ci-failure — by ilmarkov (创建于: 2026-02-18 23:23 (UTC+8))
- #34812 [Bug]: GraniteMoeHybridModel not applying embedding_multiplier to input embeddings — bug — by gabe-l-hart (创建于: 2026-02-18 23:09 (UTC+8))
- #34810 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[H2OVLChatModel] — ci-failure — by ilmarkov (创建于: 2026-02-18 22:58 (UTC+8))
- #34806 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[EagleMiniCPMForCausalLM] — ci-failure — by ilmarkov (创建于: 2026-02-18 22:26 (UTC+8))
- #34792 [Bug]: setting VLLM_LOGGING_LEVEL=debug breaks tool calling — bug — by dtrifiro (创建于: 2026-02-18 18:27 (UTC+8))
- #34797 [CI Failure]: v1/e2e/test_spec_decode.py::test_eagle_correctness — ci-failure — by ilmarkov (创建于: 2026-02-18 20:18 (UTC+8))
- #34782 [Feature]: Support structured outputs for beam search — feature request — by BrandonStudio (创建于: 2026-02-18 14:11 (UTC+8))
已关闭 Issue
- #16348 [Bug]: Missing metrics in V1 — bug,stale,v1 — by dtransposed (关闭于: 2026-02-19 10:18 (UTC+8))
- #16575 [Installation]: — installation,stale — by shubh9m (关闭于: 2026-02-19 10:18 (UTC+8))
- #17832 [Bug]: Inconsistent Output: First API call differs from subsequent identical calls with temperature=0 on Qwen models — bug,stale — by ericshijian (关闭于: 2026-02-19 10:18 (UTC+8))
- #17847 [Feature]: Add Native Daemon Mode Support for
vllm serve— feature request,stale — by BobDu (关闭于: 2026-02-19 10:18 (UTC+8)) - #20256 [Feature]: Limit total GPU memory — feature request,stale — by vitush93 (关闭于: 2026-02-19 10:18 (UTC+8))
- #21951 [Bug]: Incremental detokenization error when running
llama-3.3-70b-fp8model — bug,help wanted,stale — by njhill (关闭于: 2026-02-19 10:17 (UTC+8)) - #27181 [Bug]:
check_enough_kv_cache_memorydidn’t considernum_gpu_blocks_override— bug,stale — by heheda12345 (关闭于: 2026-02-19 10:16 (UTC+8)) - #27182 [Feature]: INT8 Support in Blackwell Arch — feature request,stale — by nhanngoc94245 (关闭于: 2026-02-19 10:16 (UTC+8))
- #32832 [Performance]: H200 Kimi K2 TP8 Triton MoE min-latency — performance — by jhaotingc (关闭于: 2026-02-19 09:27 (UTC+8))
- #34850 [Bug]: MiniMax M2.5 MI355 compile error — bug,rocm — by functionstackx (关闭于: 2026-02-19 08:37 (UTC+8))
- #34804 [CI Failure]: Lora failed tests lora/test_default_mm_loras.py — ci-failure — by ilmarkov (关闭于: 2026-02-19 05:22 (UTC+8))
- #34802 [CI Failure]: Lora failed tests qwen2vl beam search — ci-failure — by ilmarkov (关闭于: 2026-02-19 05:22 (UTC+8))
- #34706 [CI Failure]: LM Eval Small Models (B200) - Qwen3-Next — ci-failure — by robertgshaw2-redhat (关闭于: 2026-02-19 05:18 (UTC+8))
- #32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-02-19 02:42 (UTC+8))
- #31244 [CI Failure]: mi325_2: Plugin Tests (2 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-19 02:41 (UTC+8))
- #34637 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 2) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-19 02:41 (UTC+8))
- #34728 [Bug]: Nemotron NVFP4 Failing With TRTLLM NVFP4 — bug — by robertgshaw2-redhat (关闭于: 2026-02-19 01:39 (UTC+8))
- #34661 [Bug]: SyntaxError in qwen3_5_moe.py line 81 when loading Qwen3.5-397B-A17B — 无标签 — by UmutAlihan (关闭于: 2026-02-18 20:28 (UTC+8))
新增 PR
- #34866 [BUGFIX] Fix
_dummy_runmissingprepare_inputs_eventsynchronization — bug,v1 — by vadiklyutiy (创建于: 2026-02-19 10:12 (UTC+8)) - #34789 [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop — bug,v1 — by scyyh11 (创建于: 2026-02-18 16:40 (UTC+8))
- #34858 [WIP] Increase Flexibility for OOV Multimodal Token Handling — 无标签 — by alex-jw-brooks (创建于: 2026-02-19 08:18 (UTC+8))
- #34818 [Bugfix] Fix Basic Models Test — bug,speculative-decoding,ready,v1,ci-failure,nvidia — by MatthewBonanni (创建于: 2026-02-18 23:41 (UTC+8))
- #34862 [Voxtral Realtime] Fix engine crash on empty multimodal embeddings — ready — by talnirnx (创建于: 2026-02-19 09:18 (UTC+8))
- #34838 [Perf] Custom triton kernel cache — v1 — by njhill (创建于: 2026-02-19 01:53 (UTC+8))
- #34844 [Bugfix] Fix tool_calls Iterable consumed when debug logging is enabled — bug,frontend,ready — by wojciech-wais (创建于: 2026-02-19 05:40 (UTC+8))
- #34860 [Feature] Add KV cache usage metrics to iteration logging — v1 — by maxyanghu (创建于: 2026-02-19 08:40 (UTC+8))
- #34861 [1/N] Elastic EP Milestone 2 — rocm,needs-rebase,ci/build,v1,cpu,nvidia — by itayalroy (创建于: 2026-02-19 08:53 (UTC+8))
- #34864 Deprecate test-pipeline.yaml — ready,ci/build — by khluu (创建于: 2026-02-19 09:52 (UTC+8))
- #34863 [Bugfix] Fix compressed-tensors fp8 block assert and FlashInfer scale propagation — bug — by EliasOenal (创建于: 2026-02-19 09:51 (UTC+8))
- #34834 [Bugfix] Fix lora tests — bug,frontend,ready,qwen — by DarkLight1337 (创建于: 2026-02-19 01:17 (UTC+8))
- #34854 [Model Runner V2] Use FP32 for Gumbel Noise — v1 — by WoosukKwon (创建于: 2026-02-19 07:53 (UTC+8))
- #34791 [Bugfix] Gate 256-bit instructions to CUDA 12.9+ — bug,ready,nvidia — by huydhn (创建于: 2026-02-18 18:00 (UTC+8))
- #34856 [Model Runner V2] Minor CPU optimizations — v1 — by njhill (创建于: 2026-02-19 08:06 (UTC+8))
- #34852 [Build] Downgrade to CUDA 12.8 but use nvcc 12.9 to build csrc — ci/build,nvidia — by tlrmchlsmth (创建于: 2026-02-19 07:31 (UTC+8))
- #34855 [Build] Downgrade to cuda 12.8 — documentation,ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-19 08:03 (UTC+8))
- #34853 [Build] Downgrade to CUDA 12.8 but use nvcc 12.9 to build by overriding in setup.py — documentation,ready,ci/build,nvidia — by LucasWilkinson (创建于: 2026-02-19 07:43 (UTC+8))
- #34849 [Model Runner V2] Remove unnecessary copies in PW CUDA graph capture — v1,nvidia — by WoosukKwon (创建于: 2026-02-19 07:10 (UTC+8))
- #34795 [Core] Enable FP8 KV cache with Decode Context Parallel (DCP) for MLA — 无标签 — by grimulkan (创建于: 2026-02-18 18:49 (UTC+8))
- #34846 [Perf] Improve default triton fused moe configs — performance — by mgoin (创建于: 2026-02-19 06:27 (UTC+8))
- #34847 [Bugfix] Add Quant Config to Llava Next Projector — bug — by alex-jw-brooks (创建于: 2026-02-19 06:29 (UTC+8))
- #34848 [ROCm] Add extra step in config initialization to populate custom ops before compilation config init — rocm — by gshtras (创建于: 2026-02-19 06:49 (UTC+8))
- #34841 [BUG] Fixing Weight Sync unit test — bug,ready,ci-failure — by hao-aaron (创建于: 2026-02-19 04:10 (UTC+8))
- #34774 [Build] Add BUILDER_CUDA_VERSION to decouple build and runtime/PyTorch CUDA versions — ready,ci/build,nvidia — by tlrmchlsmth (创建于: 2026-02-18 11:48 (UTC+8))
- #34843 [CI] Remove failing prime-rl integration test — ready,ci/build — by mgoin (创建于: 2026-02-19 05:28 (UTC+8))
- #34777 [Core] Avoid second per-step worker RPC when not needed — ready,v1 — by njhill (创建于: 2026-02-18 12:55 (UTC+8))
- #34808 Revert “[NemotronH] Do not force router to run in fp32 (#34582)” — 无标签 — by roikoren755 (创建于: 2026-02-18 22:44 (UTC+8))
- #34842 [Bugfix][Kernel] Fix integer overflow in layernorm kernel index computations — bug — by wojciech-wais (创建于: 2026-02-19 05:03 (UTC+8))
- #34793 [Bugfix] Fix engine crash when realtime streaming input is empty (#34532) — bug,frontend,v1 — by wojciech-wais (创建于: 2026-02-18 18:30 (UTC+8))
- #34779 [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser — bug,frontend,ready,qwen — by ywang96 (创建于: 2026-02-18 13:10 (UTC+8))
- #34831 [compile] Move torch_aot_compile directory under torch_compile_cache — ready — by zhxchen17 (创建于: 2026-02-19 01:00 (UTC+8))
- #34776 [WIP] Add Warmup to
vllm bench throughput— performance — by micah-wil (创建于: 2026-02-18 12:44 (UTC+8)) - #34822 [Bugfix] Add is_blackwell_class() for SM121/GB10 DGX Spark support — bug,documentation,v1,nvidia — by 88plug (创建于: 2026-02-19 00:20 (UTC+8))
- #34821 [Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs — bug,ci/build,nvidia — by 88plug (创建于: 2026-02-19 00:20 (UTC+8))
- #34786 [Model Runner V2] Minor simplification for DCP — v1,nvidia — by WoosukKwon (创建于: 2026-02-18 15:25 (UTC+8))
- #34839 [ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-19 02:25 (UTC+8))
- #34836 fix(docs): fix typos in comments and docstrings — gpt-oss — by machov (创建于: 2026-02-19 01:45 (UTC+8))
- #34783 [BugFix] Fix implicit and incorrect assumption on ECConnector is_producer — bug,documentation,v1,kv-connector — by furionw (创建于: 2026-02-18 14:50 (UTC+8))
- #34840 [Core] Fix state names in pause_scheduler() — v1 — by markmc (创建于: 2026-02-19 02:52 (UTC+8))
- #34832 Revert “[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192)” — bug,v1,nvidia — by ZhanqiuHu (创建于: 2026-02-19 01:12 (UTC+8))
- #34826 [Misc] Add mooncake-transfer-engine to kv_connectors requirements — ready,ci/build — by stmatengss (创建于: 2026-02-19 00:43 (UTC+8))
- #34837 [Docker] Fix VLLM_PRECOMPILED_WHEEL_COMMIT build arg name — ci/build — by UranusSeven (创建于: 2026-02-19 01:52 (UTC+8))
- #34835 [Core] Extract is_last_rank in ray for tpu to override — v1 — by pv97 (创建于: 2026-02-19 01:41 (UTC+8))
- #34813 fix: Apply embedding_multiplier to inputs_embeds — 无标签 — by gabe-l-hart (创建于: 2026-02-18 23:21 (UTC+8))
- #34820 [CI][Bugfix] Fix multinode test script — bug,ci/build — by ilmarkov (创建于: 2026-02-19 00:04 (UTC+8))
- #34833 Feature Add fault reporting framework for Fault Tolerant EP — frontend,v1 — by fangyuchu (创建于: 2026-02-19 01:13 (UTC+8))
- #34830 [Feature] Add –disable-uvicorn-metrics-access-log shorthand — frontend — by miloudbelarebia (创建于: 2026-02-19 00:56 (UTC+8))
- #34828 [Doc] Fix multi-node troubleshooting: use static rdzv backend — documentation — by miloudbelarebia (创建于: 2026-02-19 00:55 (UTC+8))
- #34829 [Doc] Add version note for –profiler-config flag — documentation — by miloudbelarebia (创建于: 2026-02-19 00:55 (UTC+8))
- #34823 [Refactor][MLA]: Expose prefill/decode split to torch.compile — 无标签 — by therealnaveenkamal (创建于: 2026-02-19 00:20 (UTC+8))
- #34825 [CI] temporarily disable multi-node tests — ci/build — by robertgshaw2-redhat (创建于: 2026-02-19 00:27 (UTC+8))
- #34824 [Feature][Quant] Support online MXFP8 MoE quantization for SM100 serving — needs-rebase,ci/build — by EdalatiAli (创建于: 2026-02-19 00:24 (UTC+8))
- #34809 feat(frontend): support –tool-call-parser=auto for automatic parser detection — frontend — by timon0305 (创建于: 2026-02-18 22:44 (UTC+8))
- #34815 feat(api): add optional admin control plane API and CLI — frontend — by timon0305 (创建于: 2026-02-18 23:30 (UTC+8))
- #34811 feat(ep): add –expert-placement-file for custom MoE expert-to-rank mapping — 无标签 — by timon0305 (创建于: 2026-02-18 23:00 (UTC+8))
- #34816 [Bugfix] Kill orphan EngineCore/WorkerProc via prctl(PR_SET_PDEATHSIG) — bug,v1 — by wojciech-wais (创建于: 2026-02-18 23:33 (UTC+8))
- #34780 [Model Runner V2] Avoid prepare prefill kernel launch overhead — ready,v1 — by njhill (创建于: 2026-02-18 13:32 (UTC+8))
- #34796 [CI][Bugfix] Fix new_weight_syncing/rlhf.py — bug,documentation — by ilmarkov (创建于: 2026-02-18 19:13 (UTC+8))
- #34807 [WIP][Bugfix] Fix SIGILL crash on CPU-only systems during model inspection — bug,new-model,cpu — by haosdent (创建于: 2026-02-18 22:36 (UTC+8))
- #34794 [Refactor] Implement output type check in LLM — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-18 18:41 (UTC+8))
- #34805 [KVConnector]: Support block-level preemption handling — v1,kv-connector — by orozery (创建于: 2026-02-18 22:09 (UTC+8))
- #34801 [ROCm][Kernel] Add fused MoE exllama GEMM for compressed-tensors int4 — rocm,llama — by mgehre-amd (创建于: 2026-02-18 21:35 (UTC+8))
- #34803 [WIP][Bugfix] Detect LoRA module name prefix mismatches at load time — bug — by haosdent (创建于: 2026-02-18 21:43 (UTC+8))
- #34790 [WIP][Bugfix] Wire whitespace_pattern through V1 structured output backends — bug,structured-output,v1 — by haosdent (创建于: 2026-02-18 17:58 (UTC+8))
- #34799 [kv_offload+HMA][3/N]: Scheduler-side support for multiple KV groups — v1,kv-connector — by orozery (创建于: 2026-02-18 20:57 (UTC+8))
- #34798 [Mamba1] - Kernel Level Chunk Alignment for Prefix Caching — v1 — by Josephasafg (创建于: 2026-02-18 20:38 (UTC+8))
- #34788 Fix vllm build for power — ci/build — by vivek8123 (创建于: 2026-02-18 16:32 (UTC+8))
- #34775 [Renderer] Deprecate code paths for old input processing — documentation,frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-18 11:56 (UTC+8))
- #34785 Add Qwen2-VL V1 parity test — v1,qwen — by WeiL11 (创建于: 2026-02-18 15:25 (UTC+8))
- #34787 [Core] Num External Cached Tokens — frontend,v1 — by aeon-x (创建于: 2026-02-18 15:31 (UTC+8))
- #34784 fix(docs): use static rdzv backend in multi-node troubleshooting script — documentation — by machov (创建于: 2026-02-18 15:09 (UTC+8))
- #34778 feat: expose media_io_kwargs at runtime — frontend,needs-rebase — by milesial (创建于: 2026-02-18 13:01 (UTC+8))
已合并 PR
- #34864 Deprecate test-pipeline.yaml — ready,ci/build — by khluu (合并于: 2026-02-19 10:15 (UTC+8))
- #32771 [Model Runner V2] support piecewise & mixed cudagraph — v1,nvidia — by izhuhaoran (合并于: 2026-02-19 07:03 (UTC+8))
- #34834 [Bugfix] Fix lora tests — bug,frontend,ready,qwen — by DarkLight1337 (合并于: 2026-02-19 05:22 (UTC+8))
- #34854 [Model Runner V2] Use FP32 for Gumbel Noise — v1 — by WoosukKwon (合并于: 2026-02-19 09:07 (UTC+8))
- #34697 [Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion — bug,ready,qwen — by Isotr0py (合并于: 2026-02-19 01:46 (UTC+8))
- #34849 [Model Runner V2] Remove unnecessary copies in PW CUDA graph capture — v1,nvidia — by WoosukKwon (合并于: 2026-02-19 07:52 (UTC+8))
- #34181 [CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py — bug,rocm,ready — by rasmith (合并于: 2026-02-19 07:10 (UTC+8))
- #34588 [MoE Refactor] Convert mxfp4 marlin into modular kernel format — ready — by zyongye (合并于: 2026-02-19 06:37 (UTC+8))
- #34745 Fix empty tool_call_id in Anthropic messages API tool result conversion — frontend,ready — by sfeng33 (合并于: 2026-02-19 06:31 (UTC+8))
- #34841 [BUG] Fixing Weight Sync unit test — bug,ready,ci-failure — by hao-aaron (合并于: 2026-02-19 06:20 (UTC+8))
- #34065 [Docs] Clean up speculators docs — documentation,ready,ci/build,v1 — by kylesayrs (合并于: 2026-02-19 05:48 (UTC+8))
- #34673 [Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) — bug,ready,nvidia — by wwl2755 (合并于: 2026-02-19 05:03 (UTC+8))
- #34655 [CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm — bug,rocm,ready — by rasmith (合并于: 2026-02-19 04:00 (UTC+8))
- #34786 [Model Runner V2] Minor simplification for DCP — v1,nvidia — by WoosukKwon (合并于: 2026-02-19 03:04 (UTC+8))
- #34455 [Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported — bug,rocm,ready — by JadenMathias (合并于: 2026-02-19 02:54 (UTC+8))
- #34826 [Misc] Add mooncake-transfer-engine to kv_connectors requirements — ready,ci/build — by stmatengss (合并于: 2026-02-19 02:26 (UTC+8))
- #34758 [Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) — ready,ci/build,deepseek — by robertgshaw2-redhat (合并于: 2026-02-18 23:42 (UTC+8))
- #34725 [Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 — bug,ready,nvidia — by mgoin (合并于: 2026-02-19 01:39 (UTC+8))
- #34820 [CI][Bugfix] Fix multinode test script — bug,ci/build — by ilmarkov (合并于: 2026-02-19 00:45 (UTC+8))
- #34825 [CI] temporarily disable multi-node tests — ci/build — by robertgshaw2-redhat (合并于: 2026-02-19 00:32 (UTC+8))
- #34780 [Model Runner V2] Avoid prepare prefill kernel launch overhead — ready,v1 — by njhill (合并于: 2026-02-18 16:49 (UTC+8))
- #34228 Add unit tests for fp8 output fusion of triton_attn — ready — by bringlein (合并于: 2026-02-18 19:22 (UTC+8))
- #34723 [Bugfix] Fix prefix creation for Qwen3.5 — bug,ready,qwen — by mgoin (合并于: 2026-02-18 15:39 (UTC+8))
- #34775 [Renderer] Deprecate code paths for old input processing — documentation,frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-18 16:35 (UTC+8))
- #34645 [Quantization] - Added uses_meta_device_weights to quant config — ready,quantization — by Josephasafg (合并于: 2026-02-18 15:43 (UTC+8))
- #34696 [Bugfix] fix activation in cpu_fused_moe_torch call — bug,ready,cpu — by michalowski-arm (合并于: 2026-02-18 15:39 (UTC+8))
- #33255 [Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales — bug,performance,ready — by ElizaWszola (合并于: 2026-02-18 15:35 (UTC+8))
- #34766 [Model Runner V2] A bit more PP simplification — ready,v1 — by njhill (合并于: 2026-02-18 13:39 (UTC+8))
- #34699 [CI/Build] Remove use of
skip_v1— ready,ci/build — by DarkLight1337 (合并于: 2026-02-18 12:19 (UTC+8)) - #34753 [ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL — rocm,ready,multi-modality,qwen — by AndreasKaratzas (合并于: 2026-02-18 11:59 (UTC+8))
- #34743 [Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency — ready,multi-modality — by russellb (合并于: 2026-02-18 11:53 (UTC+8))
关闭但未合并的 PR
- #24700 [Perf] Improve default fused moe configs by analyzing tuned configs — performance,needs-rebase,unstale — by mgoin (关闭于: 2026-02-19 06:44 (UTC+8))
- #32558 [Perf] Triton-based top-p/top-k masking — performance,ready,v1 — by njhill (关闭于: 2026-02-19 05:31 (UTC+8))
- #34593 [Bugfix] Fix MXFP4 weight loading for individual expert tensors — bug — by olka (关闭于: 2026-02-19 00:50 (UTC+8))
- #33353 [KVConnector] Allow connector to protect GPU blocks from eviction — needs-rebase,v1,kv-connector — by orozery (关闭于: 2026-02-18 22:10 (UTC+8))
- #34469 [Bugfix][Hardware][AMD] Fix string literal comparison in DISPATCH_BY_KV_CACHE_DTYPE macro — bug,rocm — by c0de128 (关闭于: 2026-02-18 22:06 (UTC+8))
- #30783 [AWQ] Evaluate fused vs unfused GEMM on actual shape — 无标签 — by mgehre-amd (关闭于: 2026-02-18 21:37 (UTC+8))
- #34587 [Bugfix] - Fix Mamba prefix caching corruption with chunked prefill — bug,v1 — by Josephasafg (关闭于: 2026-02-18 20:39 (UTC+8))
- #34685 [WIP][Bugfix] Fix type mismatch in causal_conv1d Triton kernels PAD_SLOT_ID checks — bug,nvidia — by haosdent (关闭于: 2026-02-18 16:17 (UTC+8))
- #34785 Add Qwen2-VL V1 parity test — v1,qwen — by WeiL11 (关闭于: 2026-02-18 15:38 (UTC+8))
- #34707 tests: make parsable-context MCP assertion order-agnostic — needs-rebase — by anencore94 (关闭于: 2026-02-18 13:35 (UTC+8))