vLLM 开发动态报告 - 2026-01-15

时间窗口: 2026-01-15 10:54 (UTC+8) ~ 2026-01-16 10:54 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 14 | 新 PR 59 | 合并 PR 42 | 关闭未合并 PR 9

📊 每日开发状态摘要

本周期（2026年1月15-16日）vLLM项目保持了高活跃度的开发和问题修复。主要进展集中在多模态模型支持、性能优化、以及对AMD ROCm平台缺陷的修复上。社区讨论热点围绕量化API的设计、多模态输入处理、以及分布式部署（Ray）中的问题展开，表明项目正快速迭代以适应日益复杂的模型和部署需求。

🎯 AMD/ROCm 生态相关动态

本周期AMD/ROCm相关动态活跃，主要集中在问题修复、测试适配和性能优化上，体现了对该平台持续投入的稳定性提升。

Issues:

#32434: [Bug]: gpt-oss no output with TRITON_ATTN backend with spec decode on ROCm
- 用户： micah-wil (非AMD员工)
- 摘要： 在ROCm平台上，使用TRITON_ATTN注意力后端和推测解码（spec decode）时，gpt-oss模型无输出。
- 技术细节： 当前gpt-oss在ROCm上需要ROCM_AITER_UNIFIED_ATTN后端才能正常工作。已创建一个修复性PR #32431来为相关测试启用此后端。
- 影响： 暴露了ROCm平台下不同注意力后端与推测解码功能的兼容性问题，需要特定配置。
#32409: [Bug][XPU]: Failed to serve w4a16 Qwen3 VL MoE… (被错误标记为ROCm)
- 摘要： 虽然是Intel XPU问题，但被github-actions机器人错误地标记并@了ROCm相关维护者。这反映了自动化流程的误判，但本身不涉及AMD技术。

已关闭 Issues:

#26700: [Bug][ROCm]: W8A8BlockFp8LinearOp does not use AITER on MI355X
- 用户： gronsti-amd (AMD员工)
- 状态： 已由提交 3c680f4a 修复并关闭。这解决了AMD MI355X显卡上特定FP8线性层未使用优化内核（AITER）的问题，提升了性能。

PRs (合并与新增):

#32444: [CI][AMD] Skip test_permute_cols… (新PR)
- 用户： rasmith
- 摘要： 在ROCm CI中跳过test_permute_cols测试，因为该内核未在ROCm上构建和使用。
- 分析： 属于平台差异化处理，避免不必要的测试失败，保持CI稳定。
#32431: [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test (已合并)
- 用户： micah-wil
- 摘要： 为ROCm上的gpt-oss测试启用AITER和ROCM_AITER_UNIFIED_ATTN注意力后端，以修复上述Issue #32434。
- 影响： 确保ROCm平台上特定模型的测试套件能够正确执行。
#32419: Support ROCm aiter specific fusion of per_tensor RMSNorm+QuantFP8 (新PR)
- 用户： tpopp
- 摘要： 为ROCm AITER路径添加针对静态缩放、每张量（per-tensor）的RMSNorm+量化FP8的融合模式。
- 技术细节： 类似于CUDA平台的fusion.py，但针对ROCm AITER定制。据称在MI325上为Llama3模型带来4-5%的速度提升。
- 影响： 重要的性能优化，展示了AMD团队在针对其硬件优化计算图融合方面的持续努力。
#32408: [CI][Hardware][AMD][Kernel] Align FP8 quant utils and fix test_rotary_embedding_mla_cache_fused (新PR)
- 用户： mawong-amd (AMD员工)
- 摘要： 修复AMD CI中失败的test_rotary_embedding_mla_cache_fused测试。通过将AMD的FP8转换逻辑与NVIDIA对齐，并引入相关修复。
- 影响： 确保AMD平台FP8数学计算的正确性与跨平台一致性，是保证量化模型准确性的基础工作。
#32404: Remove non-aiter quant matching path for rocm_aiter_fusion.py (新PR)
- 用户： tpopp
- 摘要： 清理rocm_aiter_fusion.py中冗余的非AITER量化匹配路径，该路径仅在启用AITER时有效，否则会导致重复模式创建问题。
- 影响： 代码清理，提高ROCm融合逻辑的健壮性和清晰度。
#32372: [CI][BugFix][AMD][FP8] Fix test_rms_norm… (已合并)
- 用户： rasmith
- 摘要： 修复ROCm上test_rms_norm测试，使用平台特定的fp8_dtype并正确设置设备，避免非法内存访问。
- 影响： 基础性修复，确保FP8相关单元测试在ROCm上可靠运行。
#32413: [ROCm][Bugfix] Disable hip sampler to fix deepseek‘s accuracy issue on ROCm (已合并)
- 用户： ganyi1996ppo
- 摘要： 临时禁用 ROCm的hip采样器，以修复deepseek-r1模型在ROCm平台上倾向于生成重复内容的准确性问题。
- 影响： 这是一个关键且临时的修复措施，直接解决了影响AMD平台模型输出质量的重要缺陷。这表明在特定硬件/驱动组合下，底层采样库可能存在不稳定性，需要进一步调查根本原因。

💬 高热度讨论分析

Issue #32412: [RFC]: online quantization user facing API
- 核心议题： 如何设计用户友好的API来支持未来可能更复杂的在线量化（运行时量化）配置。
- 主要观点：
  - 提议者 (vkuzo): 建议新增一个online_quantization_config配置对象，以扩展当前的简单字符串参数（如quantization=”fp8″），从而支持缩放粒度、忽略特定层等高级配置。
  - 参与者1 (kylesayrs): 建议扩展现有quantization参数，使其既能接受预设字符串，也能接受配置对象（如QuantizationScheme）。同时讨论了为MoE路由器等层设置默认忽略规则。
  - 参与者2 (vkuzo): 回应倾向于保持两个参数分离以确保类型清晰，但也对简化方案持开放态度。他进一步提出可以重命名现有参数为quantization_backend以增加清晰度。
- 争议焦点： API设计哲学：是保持简单（单参数多态）还是追求清晰（多参数专责）。以及是否复用现有CompressedTensors的数据结构。
- 当前状态： 开放讨论中。共识是需要一个更强大的配置方案，但在具体实现路径上仍有待商榷。
Issue #32378: [Usage]: How to add mixed text and image modal inputs…
- 核心议题： 用户在使用Qwen3VL-Reranker模型时，不清楚如何通过vLLM API向documents列表传入同时包含图像和文本的候选内容。
- 主要观点/进展：
  - 用户 (wade0604): 展示了使用单文本或单图像的成功代码，但困惑于如何组合。
  - 维护者 (DarkLight1337, noooop): 指引用户参考现有的多模态评分示例代码，并建议扩展请求中的content字段。
  - 用户后续尝试： 用户尝试了嵌套结构但得到多个分数，而非期望的单个图文对分数。
- 争议焦点： 无实质性争议，更多是API使用方式的教学和文档澄清问题。暴露出多模态复杂输入场景下的API易用性挑战。
- 当前状态： 开放中，等待更明确的示例或文档说明。
PR #32379: fix: improve NIXL import safety and add UCX compatibility checks
- 核心议题： 修复NIXL连接器因UCX库版本混合导致的段错误问题，并增强其安全性和错误提示。
- 主要观点/互动：
  - 作者 (seekskyworld): 提出检测UCX混合安装并快速失败、安全地惰性导入NIXL、以及处理UCX_PROTO_ENABLE环境变量等方案。
  - 受影响用户 (esmeetu): 在评论中验证补丁，但最初报告问题依旧，并提供了环境信息（未设置UCX_PROTO_ENABLE）。
  - 作者进一步分析： 根据用户反馈，深入调查并发现UCX_PROTO_ENABLE=n可能禁用GPU RMA协议导致崩溃，在后续提交中增加了对此的防护。
- 争议焦点： 无。这是一个针对具体技术问题的协作调试过程，展现了开源社区解决问题的典型流程。
- 当前状态： PR开放，正在根据用户反馈迭代修复。

🔥 热门话题与趋势分析

量化支持深化： 在线量化API设计（#32412）、NVFP4/MXFP4 MoE支持（#32257, #32285）、AMD平台FP8融合优化（#32419）等PR显示，社区正致力于使vLLM支持更灵活、更高效的量化方案，覆盖从训练后量化到在线量化的全流程。
模型支持持续扩张： 新增了对Molmo2 FP8量化（#32385）、OpenVLA机器人模型（#32390）、A.X-K1（#32407）等模型的支持，反映了vLLM紧跟模型发展前沿，不断扩展其多模态和专用模型生态。
性能与核心优化： 动态推测解码（#32374）、MoE内核选择器重构（#32414）、Triton注意力内核性能调优（#32403）、LoRA扩展内核启发式更新（#32425）等工作，表明项目在基础性能和核心调度逻辑上仍在进行深度优化。
多模态推理成熟化： 关于多模态输入处理的讨论（#32378）及相关代码重构（#32406, #32382），表明该功能已从“有”向“好用”和“代码整洁”阶段发展。
KV缓存与连接器： KV传输与推测解码兼容性（#32409）、CPU卸载默认启用（#32421）、NIXL连接器稳定性（#32379）、OffloadingConnector的“仅预占”模式（#32398）等议题，显示出分布式推理和长上下文内存管理是当前的攻坚重点。
平台支持与稳定性： 大量针对ROCm的测试修复、CI优化和bug修复（如#32413, #32444, #32431），以及Docker构建问题修复（#32377, #32373），体现了项目对生产环境稳定性和多平台支持的高度重视。

🛠️ 重点技术变更

PR #32414: [MoE Refactor] Make Oracle Select Kernels In Priority Order
- 解读： 这是MoE内核选择机制的一次重要重构。它为不同的MoE内核（如Triton, FlashInfer, AITER等）引入了统一的特性支持接口和优先级注册机制。
- 影响： 未来可以更自动化、更可靠地根据硬件架构和模型特性选择最优内核，并能提前在初始化时验证兼容性，提供更清晰的错误信息。这是提升MoE模型部署体验和性能的基础性工作。
PR #32421: [UX] Use kv_offloading_backend=native by default (已合并)
- 解读： 将CPU原生KV缓存卸载设置为默认后端（当用户指定--kv_offloading_size时）。
- 影响： 显著改善了用户体验。用户只需指定卸载大小即可启用CPU卸载，而无需额外指定后端。这降低了使用门槛，有利于推广KV卸载功能来处理更长的上下文或更多并发请求。
PR #32413: [ROCm][Bugfix] Disable hip sampler to fix deepseek‘s accuracy issue on ROCm (已合并)
- 解读： 通过设置环境变量临时绕过了ROCm hip采样器的一个缺陷，该缺陷导致DeepSeek模型在AMD GPU上输出重复内容。
- 影响： 直接解决了影响AMD平台模型可用性的关键问题。这是一个治标不治本的临时方案，但保障了用户即时可用的稳定性，为后续彻底修复hip库或vLLM集成问题赢得了时间。
Issue #32432: [Bug]: FlashInfer warmup crash on Blackwell NVFP4
- 解读： 在Blackwell架构GPU上使用NVFP4量化模型和FlashInfer注意力后端时，预热阶段因PyTorch API变更（non_blocking=None）而崩溃。
- 影响： 影响了最新硬件（Blackwell）和先进量化格式（NVFP4）的组合使用。虽然可通过设置VLLM_USE_V1=0回退到旧引擎规避，但阻碍了V1引擎在新硬件上的功能体验。需要等待FlashInfer库或vLLM适配层更新。
PR #32374: [V1][Spec Decode] Add Dynamic SD (新PR)
- 解读： 为V1引擎引入了动态推测解码功能。该功能能根据实时批量大小和token接受率，动态调整推测的token数量（K值），以在不同负载下保持性能收益。
- 影响： 提升了推测解码在变化负载场景（如RL训练rollout）下的鲁棒性和效率，是推测解码技术走向成熟和自适应的重要一步。

📈 开发活跃度观察

贡献者多样性： 本周期活跃贡献者包括AMD员工（tjtanaa, gronsti-amd, mawong-amd等）、NVIDIA员工、英特尔相关贡献者以及大量社区独立开发者，显示出广泛的生态参与。
AMD团队活跃： AMD相关贡献非常集中，涵盖了从驱动层问题（#32413）、内核优化（#32419）、测试修复（#32444, #32408）到CI/CD优化（#32264）的全链条，表明AMD正系统性地提升其在vLLM生态中的稳定性和性能表现。
核心模块深度迭代： 对MoE（#32414）、推测解码（#32374）、注意力后端等核心模块的重构和优化PR仍在持续进行，说明项目在追求快速功能拓展的同时，并未放松对核心架构的打磨。
代码审查与合并效率： 在24小时内合并了42个PR，处理了14个Issue，显示社区维护和自动化流程（如CI、mergebot）运行高效。许多涉及多平台兼容性的PR（如#32372, #32431）能快速得到处理并合并。

💡 值得关注的问题

在线量化API设计决策（Issue #32412）： 该讨论的结果将定义未来vLLM量化功能的用户体验和扩展能力，值得社区成员关注并参与意见。
ROCm平台gpt-oss无输出问题（Issue #32434）： 虽然已有临时CI修复，但根本原因（为何TRITON_ATTN后端在ROCm上不工作）仍需查明，这对保证ROCm平台功能完整性重要。
FlashInfer在Blackwell上的兼容性问题（Issue #32432）： 影响最新硬件与量化技术的结合使用，需要上游（FlashInfer）或vLLM适配层尽快解决。
Ray集群部署问题（Issue #32400, #32401）： 用户在大规模集群中部署多模型实例时遇到的placement group冲突和资源调度问题，反映了vLLM在复杂生产环境调度方面面临的挑战，其解决方案对云服务提供商和大型企业用户至关重要。

📋 附录：详细数据列表

新增 Issue

#32436 [Bug]: GLM 4.7 Tool Call issue — bug — by 9620300 (创建于: 2026-01-16 08:03 (UTC+8))
#32446 [New Model]: Support for TranslateGemma series — 无标签 — by venki-lfc (创建于: 2026-01-16 09:37 (UTC+8))
#32445 [Bug]: GLMASR rope error — bug — by catsled (创建于: 2026-01-16 09:29 (UTC+8))
#32434 [Bug]: gpt-oss no output with TRITON_ATTN backend with spec decode on ROCm — bug,rocm — by micah-wil (创建于: 2026-01-16 07:05 (UTC+8))
#32432 [Bug]: FlashInfer warmup crash on Blackwell NVFP4: non_blocking=None passed to Tensor.to() — bug — by CristyNel (创建于: 2026-01-16 06:37 (UTC+8))
#32412 [RFC]: online quantization user facing API — RFC — by vkuzo (创建于: 2026-01-15 20:58 (UTC+8))
#32424 [Bug]: MoE LoRA hits compile error on B200 — bug — by Jackmin801 (创建于: 2026-01-16 02:42 (UTC+8))
#32410 [Bug][XPU]: Failed to serve w4a16 Qwen3 VL MoE on Intel Arc Pro B60 — bug,intel-gpu — by noobHappylife (创建于: 2026-01-15 19:32 (UTC+8))
#32378 [Usage]: How to add mixed text and image modal inputs to documents for qwen3vl-rerank model vllm inference? — usage — by wade0604 (创建于: 2026-01-15 11:57 (UTC+8))
#32409 [Bug]: KV Transfer does not work with Spec Decode — bug — by szutenberg (创建于: 2026-01-15 18:54 (UTC+8))
#32402 [RFC]: Add a new /collect_env api to response current vllm instance environment — feature request — by lengrongfu (创建于: 2026-01-15 16:55 (UTC+8))
#32399 [Bug]: High Rate of endless Decoding in DeepSeekV3.2 Inference with vLLM v0.13.0 — bug — by makabaka6338 (创建于: 2026-01-15 16:25 (UTC+8))
#32401 [Feature]: when will custom placement group configuration be supported for DP — 无标签 — by ywExcellent (创建于: 2026-01-15 16:43 (UTC+8))
#32400 [Bug]: Ray cluster(5 Node), Deploy DP=2, Creating too many placement groups — 无标签 — by ywExcellent (创建于: 2026-01-15 16:37 (UTC+8))
#32391 [Bug]: shellcheck hook fails with find syntax error and causes silent failures — bug — by junuxyz (创建于: 2026-01-15 15:04 (UTC+8))
#32388 [Usage]: Does EPLB support CompressedTensorsWNA16MarlinMoEMethod in v0.12.0 or higher version? — usage — by IEI-mjx (创建于: 2026-01-15 14:46 (UTC+8))
#32370 [Bug]: use vllm server but tell me not import vllm.benchmarks.latency — bug — by ilovexsir (创建于: 2026-01-15 11:00 (UTC+8))
#32380 [Feature]: Benchmak Scalability Optimization — feature request — by Bounty-hunter (创建于: 2026-01-15 12:16 (UTC+8))
#32373 [Bug]: Fail to load vLLM on new NVIDIA driver — bug — by huydhn (创建于: 2026-01-15 11:15 (UTC+8))
#32375 [Bug]: Tensor Parallel + NixlConnector failed — bug — by esmeetu (创建于: 2026-01-15 11:20 (UTC+8))

已关闭 Issue

#15235 [Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. — bug,stale — by GennVa (关闭于: 2026-01-16 10:13 (UTC+8))
#18567 [Bug]: Single-Node data parallel (–data-parallel-size=4) leads to vLLM crash — bug,stale — by Rexhaif (关闭于: 2026-01-16 10:13 (UTC+8))
#32424 [Bug]: MoE LoRA hits compile error on B200 — bug — by Jackmin801 (关闭于: 2026-01-16 02:58 (UTC+8))
#32214 [Feature]: Extend startup time collection script to work with sweep — help wanted,good first issue,feature request — by desertfire (关闭于: 2026-01-15 17:25 (UTC+8))
#32213 [Bug]: GLM4 tool parser crashes with TypeError when parsing tools with no arguments — bug — by AnasMaar (关闭于: 2026-01-15 16:46 (UTC+8))
#26700 [Bug][ROCm]: W8A8BlockFp8LinearOp does not use AITER on MI355X — bug,rocm — by gronsti-amd (关闭于: 2026-01-15 15:54 (UTC+8))
#32311 [Bug/Question]: Unexpected CUDA Graph Replay observed only in the first request’s prefill under PIECEWISE mode — usage — by zhenwei-intel (关闭于: 2026-01-15 15:13 (UTC+8))
#27787 [Bug]: Under conditions of high concurrency, DeepSeek-V3.1 exhibits the phenomenon of generating nonsensical outputs when running on vllm 0.11.1rc4 — bug — by makabaka6338 (关闭于: 2026-01-15 14:47 (UTC+8))
#27245 [Bug]: High Latency in One-Time Scenarios for DeepSeek Deployed on H200 — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
#30453 [Bug]: Frequent Repetitive Decoding Problems in DeepSeek-V3.2 — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
#30318 [Bug]: vllm v0.10.1 - Non-compliant JSON Output in Tool Calling When Inferencing DeepSeek-V3.1-Terminus — bug — by makabaka6338 (关闭于: 2026-01-15 14:46 (UTC+8))
#32225 [Bug]: common_attn_metadata.max_seq_len not incremented properly in Eagle implementation — bug — by ofirzaf (关闭于: 2026-01-15 14:39 (UTC+8))
#25327 [Bug]: InternVL 3.5 is insanely slow — bug,stale — by mertunsall (关闭于: 2026-01-15 14:23 (UTC+8))
#31864 [Bug][CPU Backend]: Gibberish output on CPU backend with DP2 + MoE Model — bug,cpu — by kzwrime (关闭于: 2026-01-15 12:50 (UTC+8))

新增 PR

#32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (创建于: 2026-01-16 10:46 (UTC+8))
#32438 [Bug] Add TPU backend option — bug,ready — by vanbasten23 (创建于: 2026-01-16 08:32 (UTC+8))
#32414 [MoE Refactor] Make Oracle Select Kernels In Priority Order — documentation,performance,rocm,ready,ci/build,v1,llama,gpt-oss,nvidia — by robertgshaw2-redhat (创建于: 2026-01-15 22:00 (UTC+8))
#32442 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (创建于: 2026-01-16 09:02 (UTC+8))
#32423 [CI] Fix LM Eval Large Models (H100) — ready,ci-failure — by MatthewBonanni (创建于: 2026-01-16 02:29 (UTC+8))

#32395 [Frontend][1/n] Make pooling entrypoints request schema consensus

CompletionRequest — documentation,frontend,ready,multi-modality — by noooop (创建于: 2026-01-15 15:43 (UTC+8))

#32444 [CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm — rocm — by rasmith (创建于: 2026-01-16 09:09 (UTC+8))
#32437 [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. — needs-rebase,nvidia — by pavanimajety (创建于: 2026-01-16 08:21 (UTC+8))
#32443 [Bugfix][chat_completion] ensure tokenizer eos_token_id is added to stop_token_ids — bug,frontend — by Flink-ddd (创建于: 2026-01-16 09:08 (UTC+8))
#32441 [Docs] Clarify structured outputs configuration for Qwen3 reasoning mode — documentation,structured-output,qwen — by davzaman (创建于: 2026-01-16 09:00 (UTC+8))
#32440 [Misc] Add VLLM Config to Prometheus Logger — v1 — by pooyadavoodi (创建于: 2026-01-16 08:50 (UTC+8))
#32431 [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test — rocm,ready,gpt-oss — by micah-wil (创建于: 2026-01-16 05:56 (UTC+8))
#32439 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (创建于: 2026-01-16 08:42 (UTC+8))
#32385 [Model] Molmo2: Enable quantized weight mapping for vision backbone — 无标签 — by George-Polya (创建于: 2026-01-15 13:10 (UTC+8))
#32374 [V1][Spec Decode] Add Dynamic SD — documentation,speculative-decoding,needs-rebase,v1 — by ekagra-ranjan (创建于: 2026-01-15 11:16 (UTC+8))
#32429 [Core] Cleanup shm based object store on engine shutdown — v1,multi-modality — by walterbm (创建于: 2026-01-16 05:04 (UTC+8))
#32435 [Misc] Add Device Config to Prometheus Logger — v1 — by pooyadavoodi (创建于: 2026-01-16 07:28 (UTC+8))
#32390 [Model] Add OpenVLA model support — new-model,multi-modality — by PalmDr (创建于: 2026-01-15 15:00 (UTC+8))
#32433 [Refactor] Remove unused file pallas_kv_cache_update.py — tpu,ready,v1 — by yewentao256 (创建于: 2026-01-16 06:53 (UTC+8))
#32422 [Refactor] Remove unused file — frontend,ready — by yewentao256 (创建于: 2026-01-16 00:56 (UTC+8))
#32430 [Bugfix] Seed OSS 36B Tool Parsing — bug — by ApexArray (创建于: 2026-01-16 05:47 (UTC+8))
#32418 [EPLB][BugFix]Possible deadlock fix — bug,ready — by ilmarkov (创建于: 2026-01-15 23:46 (UTC+8))
#32420 [2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision — frontend,needs-rebase,v1 — by wseaton (创建于: 2026-01-16 00:26 (UTC+8))
#32427 [ROCM] add 3d triton kernel for non-standard block size support under rocm_attn — rocm,v1 — by jennyyyyzhen (创建于: 2026-01-16 03:59 (UTC+8))
#32428 [Logging][Bugfix] fix scheduler stats logging — bug,v1,meta-exported,fb-exported — by jennyyyyzhen (创建于: 2026-01-16 04:02 (UTC+8))
#32371 v1: Standardize request validation errors to VLLMValidationError — needs-rebase,v1 — by JayZenith (创建于: 2026-01-15 11:11 (UTC+8))
#32425 [LoRA] Update LoRA expand kernel heuristic — 无标签 — by xyang16 (创建于: 2026-01-16 02:53 (UTC+8))
#32416 [BugFix] Python file source reading can fail on UnicodeDecodeError — bug,ready — by zou3519 (创建于: 2026-01-15 23:10 (UTC+8))
#32417 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 — bug,ready,v1,nvidia — by vadiklyutiy (创建于: 2026-01-15 23:42 (UTC+8))
#32426 [CI] Remove unused docker/Dockerfile.nightly_torch‎ — ci/build — by orionr (创建于: 2026-01-16 03:10 (UTC+8))
#32398 [KVConnector] OffloadingConnector: Add preemptions-only mode — v1,kv-connector — by orozery (创建于: 2026-01-15 16:23 (UTC+8))
#32421 [UX] Use kv_offloading_backend=native by default — ready,v1,kv-connector — by mgoin (创建于: 2026-01-16 00:34 (UTC+8))
#32387 [FEATURE] Add support for capturing hidden states in Qwen3MoeLLMModel… — qwen — by tzhouam (创建于: 2026-01-15 13:53 (UTC+8))
#32413 [ROCm][Bugfix] Disable hip sampler to fix deepseek’s accuracy issue on ROCm — bug,rocm,ready,v1,deepseek — by ganyi1996ppo (创建于: 2026-01-15 21:53 (UTC+8))
#32419 Support ROCm aiter specific fusion of per_tensor RMSNorm+QuantFP8 — rocm — by tpopp (创建于: 2026-01-16 00:25 (UTC+8))
#32403 [Performance] Improve Triton prefill attention kernel’s performance — v1 — by Isotr0py (创建于: 2026-01-15 17:05 (UTC+8))
#32415 [Fix] Fix buffer sizing and memory handling for routed-expert return in TP — v1 — by TomerBN-Nvidia (创建于: 2026-01-15 22:35 (UTC+8))
#32407 [Model] Add huggingface skt/A.X-K1 model — documentation,new-model — by fort726 (创建于: 2026-01-15 18:25 (UTC+8))
#32379 fix: improve NIXL import safety and add UCX compatibility checks — documentation,kv-connector — by seekskyworld (创建于: 2026-01-15 12:04 (UTC+8))
#32411 [Misc] Fix typo: seperator -> separator in flashmla_sparse.py — v1 — by T1mn (创建于: 2026-01-15 20:12 (UTC+8))
#32406 [3/N] Group together media-related code — ready,multi-modality — by DarkLight1337 (创建于: 2026-01-15 18:12 (UTC+8))
#32404 Remove non-aiter quant matching path for rocm_aiter_fusion.py — rocm — by tpopp (创建于: 2026-01-15 17:25 (UTC+8))
#32394 fix(tool_parsers): enable type conversion for Seed OSS tool parser streaming mode — 无标签 — by karanb192 (创建于: 2026-01-15 15:30 (UTC+8))
#32372 [CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm — bug,rocm,ready — by rasmith (创建于: 2026-01-15 11:12 (UTC+8))
#32396 [Refactor] [11/N] to simplify the mcp architecture — structured-output,frontend,ready,v1,gpt-oss — by chaunceyjiang (创建于: 2026-01-15 15:51 (UTC+8))
#32408 [CI][Hardware][AMD][Kernel] Align FP8 quant utils and fix test_rotary_embedding_mla_cache_fused — rocm — by mawong-amd (创建于: 2026-01-15 18:27 (UTC+8))
#32405 [Hardware][RISCV] Add RISC-V RVV backend support for CPU executor — ci/build,cpu — by hansu2022 (创建于: 2026-01-15 17:33 (UTC+8))
#32397 [Model] Enable LoRA support for internvl2 — 无标签 — by MatteoFari (创建于: 2026-01-15 16:17 (UTC+8))
#32382 [2/N] Move cache factories to MM registry — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-15 12:22 (UTC+8))
#32389 [Model] Avoid token selection in SigLIP pooling head — ready — by DarkLight1337 (创建于: 2026-01-15 14:48 (UTC+8))
#32369 [Refactor] [10/N] to simplify the vLLM openai completion serving architecture — frontend,ready,v1,gpt-oss — by chaunceyjiang (创建于: 2026-01-15 10:58 (UTC+8))
#32386 [Feature] Add FIPS 140-3 compliant hash algorithm option for multimodal hashing — multi-modality — by karanb192 (创建于: 2026-01-15 13:33 (UTC+8))
#32393 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by diviramon (创建于: 2026-01-15 15:18 (UTC+8))
#32392 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by brad-mengchi (创建于: 2026-01-15 15:12 (UTC+8))
#32384 [Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference — bug,structured-output,v1 — by karanb192 (创建于: 2026-01-15 12:48 (UTC+8))
#32376 [code clean] remove duplicate check — ready,v1,multi-modality — by andyxning (创建于: 2026-01-15 11:44 (UTC+8))
#32381 [responsesAPI] allow reasoning parser to output multiple reasoning items — frontend,qwen,deepseek — by qandrew (创建于: 2026-01-15 12:20 (UTC+8))
#32383 [responsesAPI] allow tuning include_stop_str_in_output — frontend — by qandrew (创建于: 2026-01-15 12:42 (UTC+8))
#32377 [Docker] Remove CUDA compatibility library loading; fixes #32373 — ci/build,nvidia — by wangshangsam (创建于: 2026-01-15 11:54 (UTC+8))

已合并 PR

#32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (合并于: 2026-01-16 10:50 (UTC+8))
#32423 [CI] Fix LM Eval Large Models (H100) — ready,ci-failure — by MatthewBonanni (合并于: 2026-01-16 08:52 (UTC+8))
#32431 [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test — rocm,ready,gpt-oss — by micah-wil (合并于: 2026-01-16 08:55 (UTC+8))
#32360 Add thread_n=64 support to Marlin MoE — ready — by mgoin (合并于: 2026-01-16 08:45 (UTC+8))
#32257 [Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors — ready,nvidia — by TomerBN-Nvidia (合并于: 2026-01-16 08:15 (UTC+8))
#32422 [Refactor] Remove unused file — frontend,ready — by yewentao256 (合并于: 2026-01-16 06:59 (UTC+8))
#31827 [MoE Refactor][17/N] Apply Refactor to Bf16 — ready,nvidia — by zyongye (合并于: 2026-01-16 04:53 (UTC+8))
#32238 [ROCM] DSfp4 mla projection gemms weight dynamic quantization — rocm,ready,v1 — by maleksan85 (合并于: 2026-01-16 04:13 (UTC+8))
#32416 [BugFix] Python file source reading can fail on UnicodeDecodeError — bug,ready — by zou3519 (合并于: 2026-01-16 04:01 (UTC+8))
#32421 [UX] Use kv_offloading_backend=native by default — ready,v1,kv-connector — by mgoin (合并于: 2026-01-16 02:55 (UTC+8))
#32264 [ROCm] [CI] [Release] Rocm wheel pipeline with sccache — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-16 02:56 (UTC+8))
#32285 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — ready,quantization — by dsikka (合并于: 2026-01-15 23:25 (UTC+8))
#32362 [BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test — bug,ready — by LucasWilkinson (合并于: 2026-01-16 02:19 (UTC+8))
#30361 [Attention][AMD] Make flash-attn optional — rocm,speculative-decoding,ready,v1 — by mgehre-amd (合并于: 2026-01-16 01:18 (UTC+8))
#32131 fixing podman build issue — rocm,ready,ci/build,meta-exported,fb-exported — by smitkadvani (合并于: 2026-01-16 01:07 (UTC+8))
#32359 [Feature] Support async scheduling + PP — ready,v1 — by yewentao256 (合并于: 2026-01-16 01:06 (UTC+8))
#32350 [ROCm][CI] Pin transformers 4.57.3 to fix jina test failures — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-15 15:19 (UTC+8))
#32348 [Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] — bug,v1,nvidia — by WoosukKwon (合并于: 2026-01-16 00:59 (UTC+8))
#32413 [ROCm][Bugfix] Disable hip sampler to fix deepseek’s accuracy issue on ROCm — bug,rocm,ready,v1,deepseek — by ganyi1996ppo (合并于: 2026-01-16 00:35 (UTC+8))
#29887 [ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend — rocm,ready,v1 — by ganyi1996ppo (合并于: 2026-01-15 23:29 (UTC+8))
#32339 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,nvidia — by MatthewBonanni (合并于: 2026-01-15 22:49 (UTC+8))
#31715 [ROCm] Improve error handling while loading quantized model on gfx120… — rocm,ready — by brian033 (合并于: 2026-01-15 20:16 (UTC+8))
#32406 [3/N] Group together media-related code — ready,multi-modality — by DarkLight1337 (合并于: 2026-01-15 19:52 (UTC+8))
#32372 [CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm — bug,rocm,ready — by rasmith (合并于: 2026-01-15 19:05 (UTC+8))
#31995 [ROCM] Add ROCm image build to release pipeline — rocm,ready,ci/build — by dllehr-amd (合并于: 2026-01-15 19:01 (UTC+8))
#32396 [Refactor] [11/N] to simplify the mcp architecture — structured-output,frontend,ready,v1,gpt-oss — by chaunceyjiang (合并于: 2026-01-15 18:49 (UTC+8))
#32337 [Benchmark] [Feature] add vllm bench sweep startup command — documentation,performance,ready — by lengrongfu (合并于: 2026-01-15 17:25 (UTC+8))
#32382 [2/N] Move cache factories to MM registry — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-15 17:02 (UTC+8))
#32389 [Model] Avoid token selection in SigLIP pooling head — ready — by DarkLight1337 (合并于: 2026-01-15 17:01 (UTC+8))
#32321 fix: avoid crash on zero-arg tool calls in glm4 parser — ready — by seekskyworld (合并于: 2026-01-15 16:45 (UTC+8))
#32314 [Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode — bug,frontend,ready,v1 — by dtcccc (合并于: 2026-01-15 16:31 (UTC+8))
#32369 [Refactor] [10/N] to simplify the vLLM openai completion serving architecture — frontend,ready,v1,gpt-oss — by chaunceyjiang (合并于: 2026-01-15 15:41 (UTC+8))
#32312 [Bugfix] Fix stale common_attn_metadata.max_seq_len in speculative decoding with Eagle — bug,speculative-decoding,ready,v1 — by ofirzaf (合并于: 2026-01-15 14:39 (UTC+8))
#32361 [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes — bug,ready,deepseek — by LucasWilkinson (合并于: 2026-01-15 14:32 (UTC+8))
#32376 [code clean] remove duplicate check — ready,v1,multi-modality — by andyxning (合并于: 2026-01-15 13:29 (UTC+8))
#32201 [CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol — bug,rocm,ready — by rasmith (合并于: 2026-01-15 13:04 (UTC+8))
#32355 [ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] — rocm,structured-output,ready,v1,llama — by micah-wil (合并于: 2026-01-15 12:53 (UTC+8))
#31997 [CI/Build][Hardware][AMD] Fix v1/shutdown — rocm,ready,v1 — by rjrock (合并于: 2026-01-15 12:01 (UTC+8))
#31867 [Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference — bug,ready,cpu — by kzwrime (合并于: 2026-01-15 12:50 (UTC+8))
#32366 [Misc] Remove redundant line — ready — by Potabk (合并于: 2026-01-15 12:29 (UTC+8))
#32345 Support configure skip_special_tokens in openai response api — frontend,ready — by 842974287 (合并于: 2026-01-15 12:07 (UTC+8))
#32342 Fix optional parameter parsing in MiniMax M2 tool parser #32278 — ready — by baonudesifeizhai (合并于: 2026-01-15 12:05 (UTC+8))

关闭但未合并的 PR

#32442 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (关闭于: 2026-01-16 10:16 (UTC+8))
#16606 [Kernel] Adding basic Triton JitCache for triton_attn — needs-rebase,ci/build,stale,v1 — by bringlein (关闭于: 2026-01-16 10:13 (UTC+8))
#32439 [Bugfix] Abort engine requests on client disconnect — bug,frontend — by JayZenith (关闭于: 2026-01-16 08:46 (UTC+8))
#32293 support non gated fused moe for compressed tensors w8a8 int8 — ready — by NVShreyas (关闭于: 2026-01-16 08:27 (UTC+8))
#32371 v1: Standardize request validation errors to VLLMValidationError — needs-rebase,v1 — by JayZenith (关闭于: 2026-01-16 05:46 (UTC+8))
#30395 [ROCm] [CI] [Release] Add rocm wheel release pipeline — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2026-01-16 03:45 (UTC+8))
#32387 [FEATURE] Add support for capturing hidden states in Qwen3MoeLLMModel… — qwen — by tzhouam (关闭于: 2026-01-16 00:38 (UTC+8))
#32392 Support custom URI schemes and trace handlers for profiler — v1,meta-exported,fb-exported — by brad-mengchi (关闭于: 2026-01-15 15:13 (UTC+8))
#27958 Make pre-commit work on fedora — 无标签 — by rabi (关闭于: 2026-01-15 11:42 (UTC+8))