vLLM 开发动态报告 - 2026-01-14
时间窗口: 2026-01-14 10:54 (UTC+8) ~ 2026-01-15 10:54 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 12 | 新 PR 51 | 合并 PR 22 | 关闭未合并 PR 7
📊 每日开发状态摘要
在2026年1月14日至15日的周期内,vLLM 项目保持了极高的开发活跃度,共新增17个 Issue 和51个 PR,合并了22个 PR。开发焦点集中在AMD/ROCm 平台的支持与优化(多个相关 PR 被合并)、核心架构的演进讨论(如 vLLM IR RFC)以及性能调优和bug修复上。多模态模型支持和量化技术(尤其是针对 NVIDIA Blackwell 和 AMD 平台)仍是热门话题。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,贡献者(包括 AMD 员工)提交并合并了多个重要 PR,主要集中在 CI 稳定性、性能回归修复和功能支持上。
- CI 测试与稳定性 (PR #32350, #32346, #32355, #32295)
- PR #32350 与 #32346 均致力于解决 ROCm CI 测试中的不稳定问题。前者通过固定
transformers版本解决了 Jina 模型的回归问题;后者通过将测试策略从设置全局环境变量改为显式指定注意力后端 (ROCM_AITER_FA),解决了 AITER 测试在批量运行时因状态泄漏导致的数值精度问题。 - PR #32355 为特定测试 (
test_structured_output) 在 ROCm 上临时禁用了异步调度 (async_scheduling=False),以绕过因 PR #31998 默认启用该功能后暴露的 bug,作为临时解决方案以解除 CI 阻塞。 - PR #32295 将 RIXL/UCX 的构建从 Docker 基础镜像移至测试阶段镜像,提高了部署灵活性。
- PR #32350 与 #32346 均致力于解决 ROCm CI 测试中的不稳定问题。前者通过固定
- 性能回归修复 (PR #32336, #32303)
- PR #32336 修复了一个严重的性能回归问题。此前 PR #31712 将 ROCM_ATTN 后端的支持块大小限制为 16 和 32,意外导致了 Qwen3-Next-80B 等使用非标准块大小(544)的模型性能大幅下降。本 PR 恢复了对此类特殊块大小的支持,TTFT 和吞吐量得到显著恢复。
- PR #32303 修复了在 ROCm 平台上启用异步调度和推测解码时,由于无效的推测令牌占位符导致的嵌入查找失败问题。该问题在 AMD CI 的结构化输出测试中暴露。
- 功能实现与Bug修复 (PR #32363, #32356, #32039)
- PR #32363 扩展了 EAGLE 数据并行测试,使其支持 ROCm 平台,根据检测到的平台自动参数化使用 ROCM_ATTN、ROCM_AITER_FA 等后端。
- PR #32356 修复了 AMD 硬件上 FP8 融合 MLA RoPE 缓存测试的失败,修正了 CUDA 内核中的 FP8 转换例程,并统一使用了平台无关的容差获取函数。
- PR #32039 通过在 ROCm 平台类中添加
import_kernels()方法,解决了_moe_C扩展未加载的问题,从而启用了此前在 ROCm 上跳过的 MoE 内核单元测试。
- 本周期未发现涉及
Quark量化工具的 PR 或 Issue。
💬 高热度讨论分析
本周期的高热度讨论主要集中于几个已关闭的长期 Issue 和一个新增的重要 Bugfix PR。
- 已关闭 Issue #19002: “Engine core initialization failed”
- 核心议题:用户部署时遇到引擎核心初始化失败,错误信息提示
Disk quota exceeded或进程异常。 - 各方观点:
- 报告者:提供了详细的错误日志和环境信息。
- 社区成员:分享了多种可能原因,包括模型权重文件命名不规范、持久化存储未清理、GPU 内存利用率 (
gpu_memory_utilization) 设置过高、tensor_parallel_size设置与 GPU 数量不匹配等。 - 维护者:指出了模型文件格式等具体问题。
- 总结:此问题非单一原因导致,是部署环境配置(存储、内存、模型文件)和资源分配综合作用的结果。讨论体现了社区互助排查复杂部署问题的过程。
- 当前状态:因超过90天无活动,被自动标记为
stale后关闭。
- 核心议题:用户部署时遇到引擎核心初始化失败,错误信息提示
- 已关闭 Issue #21800: “How to implement PD disaggregation test”
- 核心议题:大学实验室询问如何使用 vLLM 实现 LLM 模型 Prefill/Decode 分解的推理测试。
- 各方观点:
- 提问者:表达了敬意,并寻求教程或手册。
- 维护者 (@wuhang2014):热情回应,提供了官方文档中关于分解预填充和服务示例的链接,并概述了工作流程。
- 后续讨论:其他用户尝试时遇到了
PyNcclConnector在 V1 中不支持、配置参数错误导致进程挂起等问题。维护者澄清了 V1 中应使用NixlConnector或LmCacheConnector。
- 总结:这是一个典型的新用户引导和功能使用咨询。讨论揭示了文档与实际操作之间的差距,以及 API 版本演进带来的兼容性问题。
- 当前状态:因超过90天无活动,被自动标记为
stale后关闭。
- 已关闭 Issue #24264: “CPU Memory leak in P/D disaggregation (with NIXL?)”
- 核心议题:在使用 NIXL 连接器进行 P/D 分解部署时,Decode 实例出现 CPU 内存泄漏,尤其在使用静态 FP8 模型和 FP8 KV Cache 时。
- 各方观点:
- 报告者 (@hasB4K):提供了极其详细的复现条件(静态 FP8 模型 + FP8 KV Cache,在 Docker 环境中),并协助在不同环境复现。
- 维护者 (@robertgshaw2-redhat):积极跟进,在 Docker 和裸机环境复现并对比,怀疑与 NIXL/UCX 在 Docker 中的传输方式回退有关(非 CUDA_IPC)。
- 总结:这是一个深度的、需要硬件和底层网络知识的技术调试案例。讨论展示了贡献者与维护者紧密合作,通过控制变量法(模型精度、KV Cache 类型、执行模式、传输方式)定位复杂问题的过程。
- 当前状态:在 PR #32278 合并后,由报告者验证并关闭。
- 新增 PR #32349: “[BugFix] Fix TRT-LLM NVFP4 DP/EP”
- 核心议题:修复由 PR #31692 引入的、导致 FlashInfer TRTLLM NVFP4 格式模型在数据并行/专家并行下运行失败的问题(缺少
moe_quant_config断言)。 - 各方观点:
- 提交者 (@jiahanc):说明了问题根源,提供了修复和测试方案。
- 审阅者 (@robertgshaw2-redhat):肯定了修复,并请求添加更多上下文(如部署命令),同时指出当前的
maybe_gather_dp是一种临时方案,建议添加断言和 TODO 以便未来重构。
- 总结:这是一个针对特定高级功能(TRT-LLM + NVFP4 + DP/EP)的快速响应修复。讨论体现了对代码“临时方案”进行标注和规划的重要性。
- 当前状态:PR 已合并。
- 核心议题:修复由 PR #31692 引入的、导致 FlashInfer TRTLLM NVFP4 格式模型在数据并行/专家并行下运行失败的问题(缺少
🔥 热门话题与趋势分析
- 性能优化与调试:多个 Issue (#32367, #32320, #32311) 和 PR (#32336) 聚焦于性能表现,包括量化方案约束、并发流处理效率、CUDA Graph 行为一致性等。这表明社区对生产环境下的稳定高性能有持续追求。
- 架构演进与抽象:RFC #32358 提出的 vLLM IR 旨在解决
torch.compile与自定义算子融合的痛点,代表了向更清晰、可扩展的编译器栈演进的重要方向。同时,PR #32331 开始实现CustomLayer以补充CustomOp,是架构简化的另一条线索。 - 多模态模型支持:新增了对
LlavaQwenForCausalLM(#32338)、Step3VL-10B(#32329)、Molmo2(已合并)等模型架构的支持请求和实现,显示了 vLLM 在多模态领域快速跟进的态势。 - 平台兼容性与配置调优:除了活跃的 AMD 生态,Issue #32324 提出了根据不同平台(如 Ascend NPU)设置默认
max_graph_size的需求,反映了 vLLm 在多元化硬件生态上面临的通用配置挑战。 - 量化技术深入:围绕 MXFP8 在线量化(PR #32357)、NVFP4 MoE 支持(PR #32349)、Intel 量化工具链整合(已合并 PR #31716)的讨论和实现,表明量化仍是提升推理效率的核心战场,且技术栈在持续细分和深化。
🛠️ 重点技术变更
- RFC #32358: vLLM IR:提议引入一个基于 Torch FX 的功能性中间表示,将算子语义与实现和调度分离。旨在解决
torch.compile与现有CustomOp体系的兼容性问题,并提供更优的编译、注册和自动调优能力。此为影响深远的设计提案。 - PR #32339: 默认 MLA 后端变更:在 Blackwell 平台上,将默认的 MLA 解码后端从
CUTLASS_MLA改为FLASHINFER_MLA,并将默认 MLA 预填充后端设为TRTLLM。此变更是基于性能基准测试结果,直接影响 Blackwell 用户的默认性能表现。 - PR #32359: 支持异步调度 + 流水线并行:解除了异步调度与流水线并行不能同时使用的限制(目前仍禁止同一请求的不同令牌被一起调度)。这扩大了高性能调度策略的应用范围。
- PR #32327 & #32310: 多模态代码重构:通过提取
context.py和将虚拟数据生成移至注册表,解决了多模态处理器间的循环依赖问题,是代码结构优化的重要步骤。 - Issue #32324: 平台相关的
max_graph_size设置:提出了当前max_cudagraph_capture_size默认计算中的“魔数”(乘以2)可能导致非 CUDA 平台(如 Ascend NPU) VRAM 浪费的问题,引发了关于如何使默认配置更适配异构硬件的讨论。
📈 开发活跃度观察
- 贡献活跃:单日新增51个PR,显示出极高的社区开发活跃度。贡献者不仅来自核心团队,也来自 AMD、Intel、各大云厂商和高校研究者。
- AMD 团队表现突出:以
AndreasKaratzas,micah-wil,rasmith等为代表的贡献者提交了多个高质量的 PR,集中于修复 ROCm 平台的 CI 稳定性、性能回归和功能缺陷,表明 AMD 正投入资源深度优化 vLLM 在其生态上的体验。 - 高效合并流程:合并了22个PR,其中包含大量 bugfix 和优化,说明代码审查和合并流程高效,能快速响应问题。
- Issue 管理:关闭了12个旧 Issue,其中多数因长期无活动被标记为
stale后关闭,有助于保持问题列表的整洁。
💡 值得关注的问题
- vLLM IR 的设计与落地 (RFC #32358):这项提议若被采纳,将是对 vLLM 内核调度和编译系统的重大重构,其设计细节、迁移路径和对现有生态的影响值得社区持续关注和讨论。
- 异构硬件默认配置优化 (Issue #32324):随着 vLLM 支持更多硬件平台(CUDA, ROCm, Ascend, XPU等),如何设计一套既能保证通用性能又不过度占用资源的默认配置策略,是一个需要系统性思考的问题。
- 压缩张量量化方案的限制 (Issue #32367):用户指出
compressed-tensors目前仅支持对称量化,询问非对称量化的约束。这触及了底层量化库的能力边界,其答案可能影响用户对量化格式的选择。 - 流式响应聚合的不稳定性 (Issue #32301):多个用户报告在流式模式下输出块大小不稳定,可能与前后端处理速度不匹配导致的聚合逻辑有关。这影响了流式体验的平滑性,需要进一步诊断。
- AMD CI 测试的持续稳定性:虽然本周期有多个修复,但 AMD CI 仍暴露出异步调度、AITER 状态泄漏等问题,表明在复杂功能组合下的跨平台测试仍需加强。
📋 附录:详细数据列表
新增 Issue
- #32368 [Bug]: _CPU_MOE_ACT in cpu_fused_moe_torch cause AssertionError: Current vLLM config is not set — bug — by kzwrime (创建于: 2026-01-15 10:50 (UTC+8))
- #32367 [Performance]: why compressed-tensors only support sym quantization? — performance — by mxjmtxrm (创建于: 2026-01-15 10:40 (UTC+8))
- #32364 [Bug]: Hybrid models generation slows down noticeably when DP is enabled — bug — by yury-tokpanov (创建于: 2026-01-15 09:51 (UTC+8))
- #32358 [RFC]: vLLM IR: A Functional Intermediate Representation for vLLM — RFC,torch.compile — by ProExpertProg (创建于: 2026-01-15 07:06 (UTC+8))
- #32353 [Bug]: Nemotron-3-Nano is broken when using TRTLLM attention on Blackwell — bug — by mgoin (创建于: 2026-01-15 05:46 (UTC+8))
- #32352 [Bug]: Seed OSS 36B Tool Parser Fails to Parse Array Parameters in Streaming Mode — bug — by ApexArray (创建于: 2026-01-15 05:37 (UTC+8))
- #32335 [Feature]: Extract KV-Cache update from all attention backends — feature request — by ElizaWszola (创建于: 2026-01-14 22:24 (UTC+8))
- #32347 [Bug]: MoE config not found Tesla_T4.json — bug — by 0x666-sudo (创建于: 2026-01-15 03:12 (UTC+8))
- #32338 [Feature]: support for LlavaQwenForCausalLM — feature request — by SH9959 (创建于: 2026-01-14 23:53 (UTC+8))
- #32324 [Feature]: Support setting the default max_graph_size according to different platforms — feature request — by linfeng-yuan (创建于: 2026-01-14 18:34 (UTC+8))
- #32309 [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL — usage — by hoangtd-asilla (创建于: 2026-01-14 15:51 (UTC+8))
- #32330 [Usage]: Running Qwen3-VL-235B-A22B-Instruct-AWQ on two A100-80G GPUs results in an error — usage — by StormMapleleaf (创建于: 2026-01-14 20:38 (UTC+8))
- #32320 [Performance]: Issue in serving concurrent streams — performance — by Adithya-Sakaray (创建于: 2026-01-14 17:15 (UTC+8))
- #32311 [Bug/Question]: Unexpected CUDA Graph Replay observed only in the first request’s prefill under PIECEWISE mode — usage — by zhenwei-intel (创建于: 2026-01-14 16:13 (UTC+8))
- #32301 [Usage]: Inconsistent chunk size in streaming mode, possibly related to RequestOutput.add aggregation logic — usage — by 22373448 (创建于: 2026-01-14 12:03 (UTC+8))
- #32308 [Usage]: v1 SharedStorageConnector and PyNcclConmector executes the model again for the input prompt — usage — by sangeeta0201 (创建于: 2026-01-14 15:45 (UTC+8))
- #32302 [Feature]: ADD CUSTOM LORA ADAPTER CLASS — feature request — by Bharathi1604 (创建于: 2026-01-14 12:53 (UTC+8))
已关闭 Issue
- #27015 [Bug]: Input sequence length exceeds model’s maximum sequence length (160004 > 131072) — bug,stale — by lemisky (关闭于: 2026-01-15 10:41 (UTC+8))
- #19002 [Bug]: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug,stale — by alperen21 (关闭于: 2026-01-15 10:28 (UTC+8))
- #19666 [Bug]: mismatch between multimodal tokens and placeholders for Qwen_2.5-3B (4 GPUs*24G) — bug,stale — by YokiDia (关闭于: 2026-01-15 10:27 (UTC+8))
- #20345 [Usage]: I cannot compile vllm on RTX5090 — usage,stale — by 13516513760 (关闭于: 2026-01-15 10:27 (UTC+8))
- #21800 [Usage]: How to implement the inference test of LLM model PD (Prefill-Decode) disaggregation using the vllm framework ? — usage,stale — by Alan-D-Chen (关闭于: 2026-01-15 10:27 (UTC+8))
- #22287 [Bug]: gpt-oss-20b sometimes emits reserved tokens — bug,stale,gpt-oss — by cadedaniel (关闭于: 2026-01-15 10:27 (UTC+8))
- #24264 [Bug]: CPU Memory leak in P/D disaggregation (with NIXL?) — bug — by hasB4K (关闭于: 2026-01-15 03:48 (UTC+8))
- #26105 [Bug]: LLama4 produces all zero tokens after first token — bug,stale — by kylesayrs (关闭于: 2026-01-15 01:09 (UTC+8))
- #31957 [Bug]: Nemotron Nano V3 FP8 - Expected ‘silu’ activation but got relu2_no_mul — bug — by danisereb (关闭于: 2026-01-14 23:31 (UTC+8))
- #31331 [New Model]: Support molmo2 — new-model — by tunglinwood (关闭于: 2026-01-14 15:33 (UTC+8))
- #32288 [Build]: Relax anthropic version pin from ==0.71.0 to >=0.71.0 — 无标签 — by dsfaccini (关闭于: 2026-01-14 15:21 (UTC+8))
- #30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM — RFC — by yiliu30 (关闭于: 2026-01-14 15:11 (UTC+8))
新增 PR
- #32363 [ROCm][CI] Add ROCm attention backend support for EAGLE DP tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-15 09:45 (UTC+8))
- #32366 [Misc] Remove redundant line — ready — by Potabk (创建于: 2026-01-15 10:36 (UTC+8))
- #32360 Add thread_n=64 support to Marlin MoE — ready — by mgoin (创建于: 2026-01-15 08:02 (UTC+8))
- #32350 [ROCm][CI] Pin transformers 4.57.3 to fix jina test failures — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-15 04:42 (UTC+8))
- #32346 [ROCm][CI] Fix AITER test flakiness by using explicit attention backend — rocm,ci/build — by AndreasKaratzas (创建于: 2026-01-15 03:00 (UTC+8))
- #32357 MXFP8 Online Quantization for Linear — 无标签 — by jerryzh168 (创建于: 2026-01-15 06:45 (UTC+8))
- #32345 Support configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (创建于: 2026-01-15 02:58 (UTC+8))
- #32355 [ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] — rocm,structured-output,v1,llama — by micah-wil (创建于: 2026-01-15 06:35 (UTC+8))
- #32342 Fix optional parameter parsing in MiniMax M2 tool parser #32278 — ready — by baonudesifeizhai (创建于: 2026-01-15 02:08 (UTC+8))
- #32365 Add NUMA Core binding in nixl_connector for CPU xPyD — kv-connector — by ZhengHongming888 (创建于: 2026-01-15 10:00 (UTC+8))
- #32362 [BugFix] Fix
assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1]in Blackwell Quantized MoE Test — bug,ready — by LucasWilkinson (创建于: 2026-01-15 09:01 (UTC+8)) - #32331 [WIP] Implement CustomLayer to complement for CustomOp — 无标签 — by whx-sjtu (创建于: 2026-01-14 20:40 (UTC+8))
- #32361 [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes — bug,ready,deepseek — by LucasWilkinson (创建于: 2026-01-15 08:33 (UTC+8))
- #32356 [CI][AMD][BugFix][FP8] Fix test_concat_and_cache_mla_rope_fused by fixing fp8 conversion routines and using proper atol/rtol — bug,rocm,ready — by rasmith (创建于: 2026-01-15 06:41 (UTC+8))
- #32351 [WIP] Pause and Resume with keep requests for single engine — documentation,frontend,v1 — by ahao-anyscale (创建于: 2026-01-15 05:24 (UTC+8))
- #32349 [BugFix] Fix TRT-LLM NVFP4 DP/EP — bug,ready,nvidia — by jiahanc (创建于: 2026-01-15 03:42 (UTC+8))
- #32344 [MoE Refactor] First cut at MoERunner for FusedMoE layer — documentation,v1,nvidia — by bnellnm (创建于: 2026-01-15 02:44 (UTC+8))
- #32359 [Feature] Support async scheduling + PP — ready,v1 — by yewentao256 (创建于: 2026-01-15 07:19 (UTC+8))
- #32339 [Attention][MLA] Make
FLASHINFER_MLAthe default MLA backend on Blackwell, and TRTLLM the default prefill — ready,nvidia — by MatthewBonanni (创建于: 2026-01-15 00:18 (UTC+8)) - #32354 perf(lora): optimize index lookup from O(n) to O(1) in convert_mapping — 无标签 — by yifant-code (创建于: 2026-01-15 05:48 (UTC+8))
- #32312 [Bugfix] Fix stale
common_attn_metadata.max_seq_lenin speculative decoding with Eagle — bug,speculative-decoding,v1 — by ofirzaf (创建于: 2026-01-14 16:17 (UTC+8)) - #32343 [compile] raise on compile_size implicit padding — ready,nvidia — by dolpm (创建于: 2026-01-15 02:42 (UTC+8))
- #32307 fix pad_align for gfx942 — 无标签 — by Rohan138 (创建于: 2026-01-14 15:26 (UTC+8))
- #32305 Setting is_3d_moe=True for qwen3-moe — qwen — by dcmaddix (创建于: 2026-01-14 14:40 (UTC+8))
- #32336 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten — bug,rocm,ready,v1,qwen — by vllmellm (创建于: 2026-01-14 23:02 (UTC+8))
- #32348 [Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] — bug,v1,nvidia — by WoosukKwon (创建于: 2026-01-15 03:28 (UTC+8))
- #32341 skip previous queried ports — meta-exported,fb-exported — by mxz297 (创建于: 2026-01-15 02:07 (UTC+8))
- #32340 [Nixl][Bugfix] Track
nixl_num_kv_expired_reqsmetric in Prometheus — bug,kv-connector — by NickLucche (创建于: 2026-01-15 01:02 (UTC+8)) - #32332 Improve SLO management with request tiers — frontend,v1 — by hj-mistral (创建于: 2026-01-14 21:16 (UTC+8))
- #32314 [Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode — bug,frontend,ready,v1 — by dtcccc (创建于: 2026-01-14 16:30 (UTC+8))
- #32337 [Benchmark] [Feature] add vllm bench sweep startup command — documentation,performance — by lengrongfu (创建于: 2026-01-14 23:48 (UTC+8))
- #32327 [1/N] Reorganize multimodal processing code — documentation,performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-01-14 20:22 (UTC+8))
- #32334 [1/x][Frontend] A graceful shutdown implementation as per RFC #24885 — frontend,v1 — by wseaton (创建于: 2026-01-14 22:02 (UTC+8))
- #32328 rename tokenize serving api request id prefix to tokenize — frontend,ready — by andyxning (创建于: 2026-01-14 20:28 (UTC+8))
- #32329 [Model] Add Step3vl 10b — documentation,new-model — by ltd0924 (创建于: 2026-01-14 20:36 (UTC+8))
- #32333 add support for Step-Audio-r1.1/step-audio-2-mini — new-model,frontend,ci/build,v1 — by moevis (创建于: 2026-01-14 21:29 (UTC+8))
- #32325 [Model] Add Moondream3 model support[WIP] — new-model — by sniper35 (创建于: 2026-01-14 19:19 (UTC+8))
- #32317 Reinterpreting Fused MoE–LoRA as a Standard Fused MoE Kernel — 无标签 — by RunkaiTao (创建于: 2026-01-14 16:49 (UTC+8))
- #32326 [misc] remove the magic number in _set_cudagraph_sizes function — nvidia — by linfeng-yuan (创建于: 2026-01-14 19:47 (UTC+8))
- #32323 [Frontend] track responsesAPI server_load — frontend,ready — by chaunceyjiang (创建于: 2026-01-14 18:12 (UTC+8))
- #32322 [Misc] Make mem utils can be reused by other platforms — rocm,ready,v1 — by shen-shanshan (创建于: 2026-01-14 17:32 (UTC+8))
- #32304 integrate openai realtime api — documentation,frontend,needs-rebase — by unlikezy (创建于: 2026-01-14 14:24 (UTC+8))
- #32319 [Frontend] Standardize use of
create_error_response— frontend,ready — by DarkLight1337 (创建于: 2026-01-14 17:08 (UTC+8)) - #32321 fix: avoid crash on zero-arg tool calls in glm4 parser — 无标签 — by seekskyworld (创建于: 2026-01-14 17:22 (UTC+8))
- #32313 [Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture — frontend,ready — by chaunceyjiang (创建于: 2026-01-14 16:20 (UTC+8))
- #32310 [Refactor] Move top-level dummy data generation to registry — ready,v1,multi-modality,llama — by DarkLight1337 (创建于: 2026-01-14 16:03 (UTC+8))
- #32318 [Quantization] Add W4A16 NVFP4 MoE support for CompressedTensors — ci/build — by zhangyimi (创建于: 2026-01-14 16:51 (UTC+8))
- #32316 [Build] Bump python openai version — frontend,ci/build — by chaunceyjiang (创建于: 2026-01-14 16:45 (UTC+8))
- #32315 [Refactor] Simplify FusedMoEParallelConfig.make() logic and remove redundant assert — 无标签 — by maang-h (创建于: 2026-01-14 16:44 (UTC+8))
- #32306 [Bugfix] Refactor to support DP parallel in R3 — bug,v1 — by xhx1022 (创建于: 2026-01-14 15:08 (UTC+8))
- #32303 [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling — bug,rocm,v1 — by micah-wil (创建于: 2026-01-14 13:42 (UTC+8))
已合并 PR
- #32343 [compile] raise on compile_size implicit padding — ready,nvidia — by dolpm (合并于: 2026-01-15 04:46 (UTC+8))
- #32283 [BugFix] Assign page_size_padded when unifying kv cache spec. — bug,ready,v1 — by Lumosis (合并于: 2026-01-15 04:10 (UTC+8))
- #32336 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten — bug,rocm,ready,v1,qwen — by vllmellm (合并于: 2026-01-15 03:32 (UTC+8))
- #26291 [MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding — frontend,ready,gpt-oss — by astralord (合并于: 2026-01-15 02:20 (UTC+8))
- #32295 [CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm — rocm,ready,ci/build — by qli88 (合并于: 2026-01-15 00:53 (UTC+8))
- #32327 [1/N] Reorganize multimodal processing code — documentation,performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-01-14 23:25 (UTC+8))
- #32328 rename tokenize serving api request id prefix to tokenize — frontend,ready — by andyxning (合并于: 2026-01-14 22:52 (UTC+8))
- #32281 [ROCm][CI] Handle missing vision_config in Isaac model attention patch — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-01-14 15:21 (UTC+8))
- #32167 [Model] Re-implement Qwen3Omni Audio Encoder — ready,qwen — by ywang96 (合并于: 2026-01-14 15:40 (UTC+8))
- #32323 [Frontend] track responsesAPI server_load — frontend,ready — by chaunceyjiang (合并于: 2026-01-14 20:00 (UTC+8))
- #32322 [Misc] Make mem utils can be reused by other platforms — rocm,ready,v1 — by shen-shanshan (合并于: 2026-01-14 19:46 (UTC+8))
- #32319 [Frontend] Standardize use of
create_error_response— frontend,ready — by DarkLight1337 (合并于: 2026-01-14 19:22 (UTC+8)) - #32313 [Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture — frontend,ready — by chaunceyjiang (合并于: 2026-01-14 18:20 (UTC+8))
- #32310 [Refactor] Move top-level dummy data generation to registry — ready,v1,multi-modality,llama — by DarkLight1337 (合并于: 2026-01-14 18:17 (UTC+8))
- #30997 Add Molmo2 multimodal model support — documentation,new-model,ready,multi-modality — by sangho-vision (合并于: 2026-01-14 15:33 (UTC+8))
- #32260 [Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture — frontend,ready,qwen,gpt-oss — by chaunceyjiang (合并于: 2026-01-14 15:26 (UTC+8))
- #32035 [Docs] Add docs about OOT Quantization Plugins — documentation,ready — by mgoin (合并于: 2026-01-14 15:25 (UTC+8))
- #32039 AMD CI Test - unskip moe_sum test and moe_align_block_size tests — rocm,ready — by hongxiayang (合并于: 2026-01-14 15:25 (UTC+8))
- #32296 [misc] Remove is_torch_equal_or_newer(2.4) cases — ready,v1 — by angelayi (合并于: 2026-01-14 15:22 (UTC+8))
- #32289 [Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 — ready,ci/build — by dsfaccini (合并于: 2026-01-14 15:21 (UTC+8))
- #31716 Consolidate Intel Quantization Toolkit Integration in vLLM — documentation,ready — by yiliu30 (合并于: 2026-01-14 15:11 (UTC+8))
- #32275 [ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test — rocm,ready,ci/build,qwen — by micah-wil (合并于: 2026-01-14 13:29 (UTC+8))
关闭但未合并的 PR
- #19634 [Config] Make prefix cache metrics interval configurable — needs-rebase,stale,v1 — by jeremyeder (关闭于: 2026-01-15 10:27 (UTC+8))
- #28698 fix fused_recurrent_gated_delta_rule_fwd returns incorrect final_state shape when inplace_final_state=False #28632 — 无标签 — by baonudesifeizhai (关闭于: 2026-01-15 04:29 (UTC+8))
- #32305 Setting is_3d_moe=True for qwen3-moe — qwen — by dcmaddix (关闭于: 2026-01-15 03:42 (UTC+8))
- #32298 allow configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (关闭于: 2026-01-15 02:48 (UTC+8))
- #26802 [wip] tools for slo management — frontend,v1 — by hj-mistral (关闭于: 2026-01-14 21:17 (UTC+8))
- #32253 [Frontend] Normalize Responses API input for multi-turn conversations — frontend,needs-rebase,gpt-oss — by daniel-salib (关闭于: 2026-01-14 19:24 (UTC+8))
- #32207 [MoE Recator] Remove CustomOp from UnquantizedFusedMoEMethod — 无标签 — by bnellnm (关闭于: 2026-01-14 11:15 (UTC+8))