vLLM 开发动态报告 - 2026-01-14

时间窗口: 2026-01-14 10:54 (UTC+8) ~ 2026-01-15 10:54 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 12 | 新 PR 51 | 合并 PR 22 | 关闭未合并 PR 7

📊 每日开发状态摘要

在2026年1月14日至15日的周期内，vLLM 项目保持了极高的开发活跃度，共新增17个 Issue 和51个 PR，合并了22个 PR。开发焦点集中在AMD/ROCm 平台的支持与优化（多个相关 PR 被合并）、核心架构的演进讨论（如 vLLM IR RFC）以及性能调优和bug修复上。多模态模型支持和量化技术（尤其是针对 NVIDIA Blackwell 和 AMD 平台）仍是热门话题。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，贡献者（包括 AMD 员工）提交并合并了多个重要 PR，主要集中在 CI 稳定性、性能回归修复和功能支持上。

CI 测试与稳定性 (PR #32350, #32346, #32355, #32295)
- PR #32350 与 #32346 均致力于解决 ROCm CI 测试中的不稳定问题。前者通过固定 transformers 版本解决了 Jina 模型的回归问题；后者通过将测试策略从设置全局环境变量改为显式指定注意力后端 (ROCM_AITER_FA)，解决了 AITER 测试在批量运行时因状态泄漏导致的数值精度问题。
- PR #32355 为特定测试 (test_structured_output) 在 ROCm 上临时禁用了异步调度 (async_scheduling=False)，以绕过因 PR #31998 默认启用该功能后暴露的 bug，作为临时解决方案以解除 CI 阻塞。
- PR #32295 将 RIXL/UCX 的构建从 Docker 基础镜像移至测试阶段镜像，提高了部署灵活性。
性能回归修复 (PR #32336, #32303)
- PR #32336 修复了一个严重的性能回归问题。此前 PR #31712 将 ROCM_ATTN 后端的支持块大小限制为 16 和 32，意外导致了 Qwen3-Next-80B 等使用非标准块大小（544）的模型性能大幅下降。本 PR 恢复了对此类特殊块大小的支持，TTFT 和吞吐量得到显著恢复。
- PR #32303 修复了在 ROCm 平台上启用异步调度和推测解码时，由于无效的推测令牌占位符导致的嵌入查找失败问题。该问题在 AMD CI 的结构化输出测试中暴露。
功能实现与Bug修复 (PR #32363, #32356, #32039)
- PR #32363 扩展了 EAGLE 数据并行测试，使其支持 ROCm 平台，根据检测到的平台自动参数化使用 ROCM_ATTN、ROCM_AITER_FA 等后端。
- PR #32356 修复了 AMD 硬件上 FP8 融合 MLA RoPE 缓存测试的失败，修正了 CUDA 内核中的 FP8 转换例程，并统一使用了平台无关的容差获取函数。
- PR #32039 通过在 ROCm 平台类中添加 import_kernels() 方法，解决了 _moe_C 扩展未加载的问题，从而启用了此前在 ROCm 上跳过的 MoE 内核单元测试。
本周期未发现涉及 Quark 量化工具的 PR 或 Issue。

💬 高热度讨论分析

本周期的高热度讨论主要集中于几个已关闭的长期 Issue 和一个新增的重要 Bugfix PR。

已关闭 Issue #19002: “Engine core initialization failed”
- 核心议题：用户部署时遇到引擎核心初始化失败，错误信息提示 Disk quota exceeded 或进程异常。
- 各方观点：
  - 报告者：提供了详细的错误日志和环境信息。
  - 社区成员：分享了多种可能原因，包括模型权重文件命名不规范、持久化存储未清理、GPU 内存利用率 (gpu_memory_utilization) 设置过高、tensor_parallel_size 设置与 GPU 数量不匹配等。
  - 维护者：指出了模型文件格式等具体问题。
- 总结：此问题非单一原因导致，是部署环境配置（存储、内存、模型文件）和资源分配综合作用的结果。讨论体现了社区互助排查复杂部署问题的过程。
- 当前状态：因超过90天无活动，被自动标记为 stale 后关闭。
已关闭 Issue #21800: “How to implement PD disaggregation test”
- 核心议题：大学实验室询问如何使用 vLLM 实现 LLM 模型 Prefill/Decode 分解的推理测试。
- 各方观点：
  - 提问者：表达了敬意，并寻求教程或手册。
  - 维护者 (@wuhang2014)：热情回应，提供了官方文档中关于分解预填充和服务示例的链接，并概述了工作流程。
  - 后续讨论：其他用户尝试时遇到了 PyNcclConnector 在 V1 中不支持、配置参数错误导致进程挂起等问题。维护者澄清了 V1 中应使用 NixlConnector 或 LmCacheConnector。
- 总结：这是一个典型的新用户引导和功能使用咨询。讨论揭示了文档与实际操作之间的差距，以及 API 版本演进带来的兼容性问题。
- 当前状态：因超过90天无活动，被自动标记为 stale 后关闭。
已关闭 Issue #24264: “CPU Memory leak in P/D disaggregation (with NIXL?)”
- 核心议题：在使用 NIXL 连接器进行 P/D 分解部署时，Decode 实例出现 CPU 内存泄漏，尤其在使用静态 FP8 模型和 FP8 KV Cache 时。
- 各方观点：
  - 报告者 (@hasB4K)：提供了极其详细的复现条件（静态 FP8 模型 + FP8 KV Cache，在 Docker 环境中），并协助在不同环境复现。
  - 维护者 (@robertgshaw2-redhat)：积极跟进，在 Docker 和裸机环境复现并对比，怀疑与 NIXL/UCX 在 Docker 中的传输方式回退有关（非 CUDA_IPC）。
- 总结：这是一个深度的、需要硬件和底层网络知识的技术调试案例。讨论展示了贡献者与维护者紧密合作，通过控制变量法（模型精度、KV Cache 类型、执行模式、传输方式）定位复杂问题的过程。
- 当前状态：在 PR #32278 合并后，由报告者验证并关闭。
新增 PR #32349: “[BugFix] Fix TRT-LLM NVFP4 DP/EP”
- 核心议题：修复由 PR #31692 引入的、导致 FlashInfer TRTLLM NVFP4 格式模型在数据并行/专家并行下运行失败的问题（缺少 moe_quant_config 断言）。
- 各方观点：
  - 提交者 (@jiahanc)：说明了问题根源，提供了修复和测试方案。
  - 审阅者 (@robertgshaw2-redhat)：肯定了修复，并请求添加更多上下文（如部署命令），同时指出当前的 maybe_gather_dp 是一种临时方案，建议添加断言和 TODO 以便未来重构。
- 总结：这是一个针对特定高级功能（TRT-LLM + NVFP4 + DP/EP）的快速响应修复。讨论体现了对代码“临时方案”进行标注和规划的重要性。
- 当前状态：PR 已合并。

🔥 热门话题与趋势分析

性能优化与调试：多个 Issue (#32367, #32320, #32311) 和 PR (#32336) 聚焦于性能表现，包括量化方案约束、并发流处理效率、CUDA Graph 行为一致性等。这表明社区对生产环境下的稳定高性能有持续追求。
架构演进与抽象：RFC #32358 提出的 vLLM IR 旨在解决 torch.compile 与自定义算子融合的痛点，代表了向更清晰、可扩展的编译器栈演进的重要方向。同时，PR #32331 开始实现 CustomLayer 以补充 CustomOp，是架构简化的另一条线索。
多模态模型支持：新增了对 LlavaQwenForCausalLM (#32338)、Step3VL-10B (#32329)、Molmo2（已合并）等模型架构的支持请求和实现，显示了 vLLM 在多模态领域快速跟进的态势。
平台兼容性与配置调优：除了活跃的 AMD 生态，Issue #32324 提出了根据不同平台（如 Ascend NPU）设置默认 max_graph_size 的需求，反映了 vLLm 在多元化硬件生态上面临的通用配置挑战。
量化技术深入：围绕 MXFP8 在线量化（PR #32357）、NVFP4 MoE 支持（PR #32349）、Intel 量化工具链整合（已合并 PR #31716）的讨论和实现，表明量化仍是提升推理效率的核心战场，且技术栈在持续细分和深化。

🛠️ 重点技术变更

RFC #32358: vLLM IR：提议引入一个基于 Torch FX 的功能性中间表示，将算子语义与实现和调度分离。旨在解决 torch.compile 与现有 CustomOp 体系的兼容性问题，并提供更优的编译、注册和自动调优能力。此为影响深远的设计提案。
PR #32339: 默认 MLA 后端变更：在 Blackwell 平台上，将默认的 MLA 解码后端从 CUTLASS_MLA 改为 FLASHINFER_MLA，并将默认 MLA 预填充后端设为 TRTLLM。此变更是基于性能基准测试结果，直接影响 Blackwell 用户的默认性能表现。
PR #32359: 支持异步调度 + 流水线并行：解除了异步调度与流水线并行不能同时使用的限制（目前仍禁止同一请求的不同令牌被一起调度）。这扩大了高性能调度策略的应用范围。
PR #32327 & #32310: 多模态代码重构：通过提取 context.py 和将虚拟数据生成移至注册表，解决了多模态处理器间的循环依赖问题，是代码结构优化的重要步骤。
Issue #32324: 平台相关的 max_graph_size 设置：提出了当前 max_cudagraph_capture_size 默认计算中的“魔数”（乘以2）可能导致非 CUDA 平台（如 Ascend NPU） VRAM 浪费的问题，引发了关于如何使默认配置更适配异构硬件的讨论。

📈 开发活跃度观察

贡献活跃：单日新增51个PR，显示出极高的社区开发活跃度。贡献者不仅来自核心团队，也来自 AMD、Intel、各大云厂商和高校研究者。
AMD 团队表现突出：以 AndreasKaratzas, micah-wil, rasmith 等为代表的贡献者提交了多个高质量的 PR，集中于修复 ROCm 平台的 CI 稳定性、性能回归和功能缺陷，表明 AMD 正投入资源深度优化 vLLM 在其生态上的体验。
高效合并流程：合并了22个PR，其中包含大量 bugfix 和优化，说明代码审查和合并流程高效，能快速响应问题。
Issue 管理：关闭了12个旧 Issue，其中多数因长期无活动被标记为 stale 后关闭，有助于保持问题列表的整洁。

💡 值得关注的问题

vLLM IR 的设计与落地 (RFC #32358)：这项提议若被采纳，将是对 vLLM 内核调度和编译系统的重大重构，其设计细节、迁移路径和对现有生态的影响值得社区持续关注和讨论。
异构硬件默认配置优化 (Issue #32324)：随着 vLLM 支持更多硬件平台（CUDA, ROCm, Ascend, XPU等），如何设计一套既能保证通用性能又不过度占用资源的默认配置策略，是一个需要系统性思考的问题。
压缩张量量化方案的限制 (Issue #32367)：用户指出 compressed-tensors 目前仅支持对称量化，询问非对称量化的约束。这触及了底层量化库的能力边界，其答案可能影响用户对量化格式的选择。
流式响应聚合的不稳定性 (Issue #32301)：多个用户报告在流式模式下输出块大小不稳定，可能与前后端处理速度不匹配导致的聚合逻辑有关。这影响了流式体验的平滑性，需要进一步诊断。
AMD CI 测试的持续稳定性：虽然本周期有多个修复，但 AMD CI 仍暴露出异步调度、AITER 状态泄漏等问题，表明在复杂功能组合下的跨平台测试仍需加强。

📋 附录：详细数据列表

新增 Issue

#32368 [Bug]: _CPU_MOE_ACT in cpu_fused_moe_torch cause AssertionError: Current vLLM config is not set — bug — by kzwrime (创建于: 2026-01-15 10:50 (UTC+8))
#32367 [Performance]: why compressed-tensors only support sym quantization? — performance — by mxjmtxrm (创建于: 2026-01-15 10:40 (UTC+8))
#32364 [Bug]: Hybrid models generation slows down noticeably when DP is enabled — bug — by yury-tokpanov (创建于: 2026-01-15 09:51 (UTC+8))
#32358 [RFC]: vLLM IR: A Functional Intermediate Representation for vLLM — RFC,torch.compile — by ProExpertProg (创建于: 2026-01-15 07:06 (UTC+8))
#32353 [Bug]: Nemotron-3-Nano is broken when using TRTLLM attention on Blackwell — bug — by mgoin (创建于: 2026-01-15 05:46 (UTC+8))
#32352 [Bug]: Seed OSS 36B Tool Parser Fails to Parse Array Parameters in Streaming Mode — bug — by ApexArray (创建于: 2026-01-15 05:37 (UTC+8))
#32335 [Feature]: Extract KV-Cache update from all attention backends — feature request — by ElizaWszola (创建于: 2026-01-14 22:24 (UTC+8))
#32347 [Bug]: MoE config not found Tesla_T4.json — bug — by 0x666-sudo (创建于: 2026-01-15 03:12 (UTC+8))
#32338 [Feature]: support for LlavaQwenForCausalLM — feature request — by SH9959 (创建于: 2026-01-14 23:53 (UTC+8))
#32324 [Feature]: Support setting the default max_graph_size according to different platforms — feature request — by linfeng-yuan (创建于: 2026-01-14 18:34 (UTC+8))
#32309 [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL — usage — by hoangtd-asilla (创建于: 2026-01-14 15:51 (UTC+8))
#32330 [Usage]: Running Qwen3-VL-235B-A22B-Instruct-AWQ on two A100-80G GPUs results in an error — usage — by StormMapleleaf (创建于: 2026-01-14 20:38 (UTC+8))
#32320 [Performance]: Issue in serving concurrent streams — performance — by Adithya-Sakaray (创建于: 2026-01-14 17:15 (UTC+8))
#32311 [Bug/Question]: Unexpected CUDA Graph Replay observed only in the first request’s prefill under PIECEWISE mode — usage — by zhenwei-intel (创建于: 2026-01-14 16:13 (UTC+8))
#32301 [Usage]: Inconsistent chunk size in streaming mode, possibly related to RequestOutput.add aggregation logic — usage — by 22373448 (创建于: 2026-01-14 12:03 (UTC+8))
#32308 [Usage]: v1 SharedStorageConnector and PyNcclConmector executes the model again for the input prompt — usage — by sangeeta0201 (创建于: 2026-01-14 15:45 (UTC+8))
#32302 [Feature]: ADD CUSTOM LORA ADAPTER CLASS — feature request — by Bharathi1604 (创建于: 2026-01-14 12:53 (UTC+8))

已关闭 Issue

#27015 [Bug]: Input sequence length exceeds model’s maximum sequence length (160004 > 131072) — bug,stale — by lemisky (关闭于: 2026-01-15 10:41 (UTC+8))
#19002 [Bug]: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug,stale — by alperen21 (关闭于: 2026-01-15 10:28 (UTC+8))
#19666 [Bug]: mismatch between multimodal tokens and placeholders for Qwen_2.5-3B (4 GPUs*24G) — bug,stale — by YokiDia (关闭于: 2026-01-15 10:27 (UTC+8))
#20345 [Usage]: I cannot compile vllm on RTX5090 — usage,stale — by 13516513760 (关闭于: 2026-01-15 10:27 (UTC+8))
#21800 [Usage]: How to implement the inference test of LLM model PD (Prefill-Decode) disaggregation using the vllm framework ? — usage,stale — by Alan-D-Chen (关闭于: 2026-01-15 10:27 (UTC+8))
#22287 [Bug]: gpt-oss-20b sometimes emits reserved tokens — bug,stale,gpt-oss — by cadedaniel (关闭于: 2026-01-15 10:27 (UTC+8))
#24264 [Bug]: CPU Memory leak in P/D disaggregation (with NIXL?) — bug — by hasB4K (关闭于: 2026-01-15 03:48 (UTC+8))
#26105 [Bug]: LLama4 produces all zero tokens after first token — bug,stale — by kylesayrs (关闭于: 2026-01-15 01:09 (UTC+8))
#31957 [Bug]: Nemotron Nano V3 FP8 - Expected ‘silu’ activation but got relu2_no_mul — bug — by danisereb (关闭于: 2026-01-14 23:31 (UTC+8))
#31331 [New Model]: Support molmo2 — new-model — by tunglinwood (关闭于: 2026-01-14 15:33 (UTC+8))
#32288 [Build]: Relax anthropic version pin from ==0.71.0 to >=0.71.0 — 无标签 — by dsfaccini (关闭于: 2026-01-14 15:21 (UTC+8))
#30663 [RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM — RFC — by yiliu30 (关闭于: 2026-01-14 15:11 (UTC+8))

新增 PR

#32363 [ROCm][CI] Add ROCm attention backend support for EAGLE DP tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-15 09:45 (UTC+8))
#32366 [Misc] Remove redundant line — ready — by Potabk (创建于: 2026-01-15 10:36 (UTC+8))
#32360 Add thread_n=64 support to Marlin MoE — ready — by mgoin (创建于: 2026-01-15 08:02 (UTC+8))
#32350 [ROCm][CI] Pin transformers 4.57.3 to fix jina test failures — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-15 04:42 (UTC+8))
#32346 [ROCm][CI] Fix AITER test flakiness by using explicit attention backend — rocm,ci/build — by AndreasKaratzas (创建于: 2026-01-15 03:00 (UTC+8))
#32357 MXFP8 Online Quantization for Linear — 无标签 — by jerryzh168 (创建于: 2026-01-15 06:45 (UTC+8))
#32345 Support configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (创建于: 2026-01-15 02:58 (UTC+8))
#32355 [ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] — rocm,structured-output,v1,llama — by micah-wil (创建于: 2026-01-15 06:35 (UTC+8))
#32342 Fix optional parameter parsing in MiniMax M2 tool parser #32278 — ready — by baonudesifeizhai (创建于: 2026-01-15 02:08 (UTC+8))
#32365 Add NUMA Core binding in nixl_connector for CPU xPyD — kv-connector — by ZhengHongming888 (创建于: 2026-01-15 10:00 (UTC+8))
#32362 [BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test — bug,ready — by LucasWilkinson (创建于: 2026-01-15 09:01 (UTC+8))
#32331 [WIP] Implement CustomLayer to complement for CustomOp — 无标签 — by whx-sjtu (创建于: 2026-01-14 20:40 (UTC+8))
#32361 [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes — bug,ready,deepseek — by LucasWilkinson (创建于: 2026-01-15 08:33 (UTC+8))
#32356 [CI][AMD][BugFix][FP8] Fix test_concat_and_cache_mla_rope_fused by fixing fp8 conversion routines and using proper atol/rtol — bug,rocm,ready — by rasmith (创建于: 2026-01-15 06:41 (UTC+8))
#32351 [WIP] Pause and Resume with keep requests for single engine — documentation,frontend,v1 — by ahao-anyscale (创建于: 2026-01-15 05:24 (UTC+8))
#32349 [BugFix] Fix TRT-LLM NVFP4 DP/EP — bug,ready,nvidia — by jiahanc (创建于: 2026-01-15 03:42 (UTC+8))
#32344 [MoE Refactor] First cut at MoERunner for FusedMoE layer — documentation,v1,nvidia — by bnellnm (创建于: 2026-01-15 02:44 (UTC+8))
#32359 [Feature] Support async scheduling + PP — ready,v1 — by yewentao256 (创建于: 2026-01-15 07:19 (UTC+8))
#32339 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,nvidia — by MatthewBonanni (创建于: 2026-01-15 00:18 (UTC+8))
#32354 perf(lora): optimize index lookup from O(n) to O(1) in convert_mapping — 无标签 — by yifant-code (创建于: 2026-01-15 05:48 (UTC+8))
#32312 [Bugfix] Fix stale common_attn_metadata.max_seq_len in speculative decoding with Eagle — bug,speculative-decoding,v1 — by ofirzaf (创建于: 2026-01-14 16:17 (UTC+8))
#32343 [compile] raise on compile_size implicit padding — ready,nvidia — by dolpm (创建于: 2026-01-15 02:42 (UTC+8))
#32307 fix pad_align for gfx942 — 无标签 — by Rohan138 (创建于: 2026-01-14 15:26 (UTC+8))
#32305 Setting is_3d_moe=True for qwen3-moe — qwen — by dcmaddix (创建于: 2026-01-14 14:40 (UTC+8))
#32336 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten — bug,rocm,ready,v1,qwen — by vllmellm (创建于: 2026-01-14 23:02 (UTC+8))
#32348 [Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] — bug,v1,nvidia — by WoosukKwon (创建于: 2026-01-15 03:28 (UTC+8))
#32341 skip previous queried ports — meta-exported,fb-exported — by mxz297 (创建于: 2026-01-15 02:07 (UTC+8))
#32340 [Nixl][Bugfix] Track nixl_num_kv_expired_reqs metric in Prometheus — bug,kv-connector — by NickLucche (创建于: 2026-01-15 01:02 (UTC+8))
#32332 Improve SLO management with request tiers — frontend,v1 — by hj-mistral (创建于: 2026-01-14 21:16 (UTC+8))
#32314 [Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode — bug,frontend,ready,v1 — by dtcccc (创建于: 2026-01-14 16:30 (UTC+8))
#32337 [Benchmark] [Feature] add vllm bench sweep startup command — documentation,performance — by lengrongfu (创建于: 2026-01-14 23:48 (UTC+8))
#32327 [1/N] Reorganize multimodal processing code — documentation,performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-01-14 20:22 (UTC+8))
#32334 [1/x][Frontend] A graceful shutdown implementation as per RFC #24885 — frontend,v1 — by wseaton (创建于: 2026-01-14 22:02 (UTC+8))
#32328 rename tokenize serving api request id prefix to tokenize — frontend,ready — by andyxning (创建于: 2026-01-14 20:28 (UTC+8))
#32329 [Model] Add Step3vl 10b — documentation,new-model — by ltd0924 (创建于: 2026-01-14 20:36 (UTC+8))
#32333 add support for Step-Audio-r1.1/step-audio-2-mini — new-model,frontend,ci/build,v1 — by moevis (创建于: 2026-01-14 21:29 (UTC+8))
#32325 [Model] Add Moondream3 model support[WIP] — new-model — by sniper35 (创建于: 2026-01-14 19:19 (UTC+8))
#32317 Reinterpreting Fused MoE–LoRA as a Standard Fused MoE Kernel — 无标签 — by RunkaiTao (创建于: 2026-01-14 16:49 (UTC+8))
#32326 [misc] remove the magic number in _set_cudagraph_sizes function — nvidia — by linfeng-yuan (创建于: 2026-01-14 19:47 (UTC+8))
#32323 [Frontend] track responsesAPI server_load — frontend,ready — by chaunceyjiang (创建于: 2026-01-14 18:12 (UTC+8))
#32322 [Misc] Make mem utils can be reused by other platforms — rocm,ready,v1 — by shen-shanshan (创建于: 2026-01-14 17:32 (UTC+8))
#32304 integrate openai realtime api — documentation,frontend,needs-rebase — by unlikezy (创建于: 2026-01-14 14:24 (UTC+8))
#32319 [Frontend] Standardize use of create_error_response — frontend,ready — by DarkLight1337 (创建于: 2026-01-14 17:08 (UTC+8))
#32321 fix: avoid crash on zero-arg tool calls in glm4 parser — 无标签 — by seekskyworld (创建于: 2026-01-14 17:22 (UTC+8))
#32313 [Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture — frontend,ready — by chaunceyjiang (创建于: 2026-01-14 16:20 (UTC+8))
#32310 [Refactor] Move top-level dummy data generation to registry — ready,v1,multi-modality,llama — by DarkLight1337 (创建于: 2026-01-14 16:03 (UTC+8))
#32318 [Quantization] Add W4A16 NVFP4 MoE support for CompressedTensors — ci/build — by zhangyimi (创建于: 2026-01-14 16:51 (UTC+8))
#32316 [Build] Bump python openai version — frontend,ci/build — by chaunceyjiang (创建于: 2026-01-14 16:45 (UTC+8))
#32315 [Refactor] Simplify FusedMoEParallelConfig.make() logic and remove redundant assert — 无标签 — by maang-h (创建于: 2026-01-14 16:44 (UTC+8))
#32306 [Bugfix] Refactor to support DP parallel in R3 — bug,v1 — by xhx1022 (创建于: 2026-01-14 15:08 (UTC+8))
#32303 [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling — bug,rocm,v1 — by micah-wil (创建于: 2026-01-14 13:42 (UTC+8))

已合并 PR

#32343 [compile] raise on compile_size implicit padding — ready,nvidia — by dolpm (合并于: 2026-01-15 04:46 (UTC+8))
#32283 [BugFix] Assign page_size_padded when unifying kv cache spec. — bug,ready,v1 — by Lumosis (合并于: 2026-01-15 04:10 (UTC+8))
#32336 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten — bug,rocm,ready,v1,qwen — by vllmellm (合并于: 2026-01-15 03:32 (UTC+8))
#26291 [MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding — frontend,ready,gpt-oss — by astralord (合并于: 2026-01-15 02:20 (UTC+8))
#32295 [CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm — rocm,ready,ci/build — by qli88 (合并于: 2026-01-15 00:53 (UTC+8))
#32327 [1/N] Reorganize multimodal processing code — documentation,performance,ready,v1,multi-modality,llama,qwen,deepseek — by DarkLight1337 (合并于: 2026-01-14 23:25 (UTC+8))
#32328 rename tokenize serving api request id prefix to tokenize — frontend,ready — by andyxning (合并于: 2026-01-14 22:52 (UTC+8))
#32281 [ROCm][CI] Handle missing vision_config in Isaac model attention patch — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-01-14 15:21 (UTC+8))
#32167 [Model] Re-implement Qwen3Omni Audio Encoder — ready,qwen — by ywang96 (合并于: 2026-01-14 15:40 (UTC+8))
#32323 [Frontend] track responsesAPI server_load — frontend,ready — by chaunceyjiang (合并于: 2026-01-14 20:00 (UTC+8))
#32322 [Misc] Make mem utils can be reused by other platforms — rocm,ready,v1 — by shen-shanshan (合并于: 2026-01-14 19:46 (UTC+8))
#32319 [Frontend] Standardize use of create_error_response — frontend,ready — by DarkLight1337 (合并于: 2026-01-14 19:22 (UTC+8))
#32313 [Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture — frontend,ready — by chaunceyjiang (合并于: 2026-01-14 18:20 (UTC+8))
#32310 [Refactor] Move top-level dummy data generation to registry — ready,v1,multi-modality,llama — by DarkLight1337 (合并于: 2026-01-14 18:17 (UTC+8))
#30997 Add Molmo2 multimodal model support — documentation,new-model,ready,multi-modality — by sangho-vision (合并于: 2026-01-14 15:33 (UTC+8))
#32260 [Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture — frontend,ready,qwen,gpt-oss — by chaunceyjiang (合并于: 2026-01-14 15:26 (UTC+8))
#32035 [Docs] Add docs about OOT Quantization Plugins — documentation,ready — by mgoin (合并于: 2026-01-14 15:25 (UTC+8))
#32039 AMD CI Test - unskip moe_sum test and moe_align_block_size tests — rocm,ready — by hongxiayang (合并于: 2026-01-14 15:25 (UTC+8))
#32296 [misc] Remove is_torch_equal_or_newer(2.4) cases — ready,v1 — by angelayi (合并于: 2026-01-14 15:22 (UTC+8))
#32289 [Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 — ready,ci/build — by dsfaccini (合并于: 2026-01-14 15:21 (UTC+8))
#31716 Consolidate Intel Quantization Toolkit Integration in vLLM — documentation,ready — by yiliu30 (合并于: 2026-01-14 15:11 (UTC+8))
#32275 [ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test — rocm,ready,ci/build,qwen — by micah-wil (合并于: 2026-01-14 13:29 (UTC+8))

关闭但未合并的 PR

#19634 [Config] Make prefix cache metrics interval configurable — needs-rebase,stale,v1 — by jeremyeder (关闭于: 2026-01-15 10:27 (UTC+8))
#28698 fix fused_recurrent_gated_delta_rule_fwd returns incorrect final_state shape when inplace_final_state=False #28632 — 无标签 — by baonudesifeizhai (关闭于: 2026-01-15 04:29 (UTC+8))
#32305 Setting is_3d_moe=True for qwen3-moe — qwen — by dcmaddix (关闭于: 2026-01-15 03:42 (UTC+8))
#32298 allow configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (关闭于: 2026-01-15 02:48 (UTC+8))
#26802 [wip] tools for slo management — frontend,v1 — by hj-mistral (关闭于: 2026-01-14 21:17 (UTC+8))
#32253 [Frontend] Normalize Responses API input for multi-turn conversations — frontend,needs-rebase,gpt-oss — by daniel-salib (关闭于: 2026-01-14 19:24 (UTC+8))
#32207 [MoE Recator] Remove CustomOp from UnquantizedFusedMoEMethod — 无标签 — by bnellnm (关闭于: 2026-01-14 11:15 (UTC+8))