vLLM 开发动态报告 - 2026-01-07

时间窗口: 2026-01-07 10:53 (UTC+8) ~ 2026-01-08 10:53 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 11 | 新 PR 55 | 合并 PR 37 | 关闭未合并 PR 14

📊 每日开发状态摘要

在过去24小时内，vLLM 社区保持了极高的开发活跃度，共处理了超过80个 Issue 和 PR。开发焦点集中在模型支持（特别是 DeepSeek、Kimi 等复杂架构）、性能优化（MoE、CPU后端）和架构重构（注意力模块重组）上。同时，AMD 平台的相关问题受到持续关注和修复。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关的活动较为活跃，主要集中在问题修复、测试优化和功能支持上。

Issues：
- #31920：[Bug]: Prefix cache hit rate remains 0 in multi-round conversation with history of identical prompts. (用户: fdarbeha-amd-com)
  - 描述：一位 AMD 员工报告，在 ROCm 平台上进行多轮对话时，即使历史提示词重复，前缀缓存命中率仍为 0。
  - 技术细节：问题标记了 rocm 标签，并已指派给 AMD 团队成员 (@hongxiayang)。初步排查需要用户提供更详细的运行环境和服务启动命令。
  - 影响：此问题可能影响 ROCm 平台上基于历史缓存的推理性能，需要定位是缓存逻辑问题还是 ROCm 特定后端的实现问题。
PRs (已合并):
- #31931：[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling (用户: AndreasKaratzas)
  - 描述：修复了由 PR #31676 引入的 ROCm 平台上 MoE LoRA 精度回归问题。问题根源在于 PR #31676 改变了 Triton 内核中操作顺序，导致路由器权重乘法在低精度下进行，引发 ROCm 与 CUDA 后端因混合精度处理差异而产生的不同路由结果。
  - 技术细节：将 MUL_ROUTED_WEIGHT 操作移回精度转换之前，确保在 float32 下进行，同时保持偏置在反量化后添加的正确顺序。
  - 影响：恢复了 ROCm 平台上 MoE + LoRA 推理的准确性，确保了跨平台一致性。
- #31905：[ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch (用户: ZhiweiYan-96)
  - 描述：在 CDNA3 架构（如 MI300）上跳过特定的量化测试，因为这些架构仅支持 fp8_e4m3_fnuz 数据类型，而测试使用的检查点为 fp8_e4m3。
  - 影响：避免了因硬件支持差异导致的 CI 失败，确保测试的合理性。
PRs (进行中):
- #31924 / #31916：[Attention] Restructuring Tracker and related PRs
  - 描述：虽然主要目标是代码重构，但这些 PR 被打上了 rocm 标签，表明重构工作考虑了 ROCm 后端的兼容性，确保 AMD 平台的代码路径同步更新。
- #31929：[ROCm][CI] Fix test script to respect Buildkite parallelism settings (用户: AndreasKaratzas)
  - 描述：修复 ROCm CI 测试脚本，使其正确响应 Buildkite 的并行作业配置，避免重复执行任务。

💬 高热度讨论分析

本周期讨论主要集中在模型使用和问题排查上。

Issue #31859: Qwen3-VL-2B-Instruct 压力测试间歇性 500 错误
- 核心议题：用户 (AlpacaKnight) 报告在 H20 GPU 上对 Qwen3-VL 模型进行恒定 QPS 压力测试时，服务会间歇性挂起并返回 500 错误，一段时间后自动恢复。
- 各方观点：
  - 用户：提供了详细复现步骤，并在多个 vLLM 版本中均遇到此问题。
  - 维护者 (@DarkLight1337)：建议开启 VLLM_SERVER_DEV_MODE=1 和 VLLM_LOGGING_LEVEL=DEBUG 来获取更详细的错误堆栈信息，以便定位根本原因（是推理引擎崩溃还是请求处理阻塞）。
- 争议焦点：无。讨论是协作性的，旨在获取更多诊断信息。
- 当前状态：开放。用户已按建议开启调试模式，但尚未报告新的日志信息。
Issue #31889: KimiLinear MLA 初始化失败
- 核心议题：KimiLinear 模型的 MLA 路径因缺失 indexer_rotary_emb 参数导致初始化失败。
- 各方观点：
  - 报告者 (aaarkai)：给出了详细的错误分析和建议的修复代码。
  - 社区成员 (@jeejeelee)：指出该问题已在最新提交中修复，并提供了参考链接。
- 最终结论：确认问题已通过 commit e3fbb6f1 修复。提交者随后关闭了 Issue 和相关的修复 PR (#31892)。
- 当前状态：已关闭。
Issue #31903: DeepSeek-V3.2 在 TP8+PP4 下运行错误
- 核心议题：使用 Ray 分布式后端运行 DeepSeek-V3.2 时，当 PP (流水线并行) 设为 4 时出现初始化错误。
- 各方观点：
  - 报告者 (lengrongfu)：提供了完整错误日志，并指出仅用 TP 或 TP8+PP2 是正常的。
  - 维护者 (@esmeetu)：建议尝试仅使用 TP 运行以缩小问题范围。
- 讨论焦点：初步判断问题与 PP=4 时的特定配置或 Ray 后端下的进程映射有关。
- 当前状态：开放。已有一个关联的修复 PR (#31937) 提交，旨在解决 PP 模式下的 rank 映射错误。

🔥 热门话题与趋势分析

模型支持与兼容性：
- DeepSeek-V3.2 问题集中：多个 Issue 涉及此模型在不同配置下的问题（分布式错误、新 GPU 支持），反映出新模型架构的集成挑战。
- Kimi & OpenPangu 模型：MLA 架构的参数传递问题在多个模型中出现，但修复迅速，体现了对新兴模型架构的快速响应。
- 多模态模型稳定性：Qwen3-VL 系列模型在 async-scheduling 下的崩溃问题被多次报告和修复，表明多模态推理的异步处理仍是技术难点。
新功能与集成提议：
- MCP 工具集成：Issue #31917 提议为 vLLM 添加 Model Context Protocol 支持，以实现更标准的工具调用和智能体工作流。
- 草案模型推测解码：Issue #31883 和 PR #31886 反映出社区对使用独立草案模型进行推测解码功能的需求和实现尝试。
性能与正确性验证：
- MoE LoRA 内核验证：Issue #31912 详细报告了融合 MoE LoRA Triton 内核在各种配置下的正确性验证结果，体现了对核心算子的高标准测试。
- 测试稳定性：多个 PR 致力于修复或跳过 flaky 的测试用例，以提升 CI 的可靠性。

🛠️ 重点技术变更

PR #31916 / #31919 (RFC): Attention 模块重构
- 解读：这是一个大规模代码重构计划的第一步，旨在将散落在各处的注意力相关代码（层、后端、操作符）重新组织到更合理的目录结构（如 vllm/model_executor/layers/attention, vllm/v1/attention）。
- 影响：提高代码可读性和可维护性，为未来 KV 缓存更新融合、预填充/解码分离等高级优化奠定基础。
PR #31931: 修复 ROCm MoE LoRA 精度回归
- 解读：这是一个典型的跨平台兼容性修复。它揭示了 Triton 内核中操作顺序对数值精度的影响，以及在混合精度场景下 CUDA 与 ROCm 后端可能存在的细微差异。
- 影响：确保了 AMD 平台上 MoE + LoRA 这一重要功能的准确性，对 AMD 生态的用户至关重要。
PR #31867: 修复 CPU 后端的 DP+MoE 推理
- 解读：为 CpuCommunicator 添加缺失的 dispatch 和 combine 方法实现，解决了 CPU 平台在数据并行下运行 MoE 模型时因序列并行触发而导致的推理错误。
- 影响：完善了 vLLM CPU 后端的功能，使其能够正确支持更复杂的并行模式。

📈 开发活跃度观察

贡献者多元化：活跃贡献者包括 AMD 员工 (-amd 后缀)、NVIDIA 相关开发者（标签）、vLLM 核心团队成员以及众多社区开发者。Issue 讨论中用户之间的互助解答也很常见。
PR 合并效率高：在 55 个新增 PR 中，当日合并了 37 个，合并率约 67%。这表明核心团队有较强的代码审查和合并能力，许多修复和改进能快速进入主线。
快速响应与闭环：对于已修复的问题（如 KimiLinear bug），社区能快速识别并关闭重复的 Issue 和 PR，保持了项目的整洁。

💡 值得关注的问题

DeepSeek-V3.2 的分布式执行问题 (Issue #31903, #31936)：该模型在复杂并行配置（TP+PP）下，尤其是在新硬件（RTX PRO 6000 Blackwell）上的支持尚不稳定，需要核心开发者投入精力解决。
ROCm 平台前缀缓存失效 (Issue #31920)：此问题若属实，将影响 AMD 平台上聊天应用的推理性能，需要 AMD 团队与核心开发者协作排查。
多模态模型在 Async Scheduling 下的稳定性：Qwen3-VL 等模型的相关问题暗示，多模态处理的异步流水线可能存在更深层次的竞态条件或资源管理问题。
Eagle DP 测试的非确定性 (Issue #31913)：推测解码的测试不稳定，可能暴露出批不变性或推测解码实现中的底层问题，影响功能可靠性。

📋 附录：详细数据列表

新增 Issue

#31903 [Bug]: TP8+PP8 running deepseek-3.2 error — bug — by lengrongfu (创建于: 2026-01-07 22:48 (UTC+8))
#31871 [Bug]: Streaming mode with –tool-call-parser hermes returns raw text instead of parsed tool_calls — bug — by Xingsandesu (创建于: 2026-01-07 14:19 (UTC+8))
#31889 [Bug]: KimiLinear MLA: MLAModules init missing indexer_rotary_emb — bug — by aaarkai (创建于: 2026-01-07 18:02 (UTC+8))
#31936 [Bug]: run deepseek v3.2 failed，not support RTX PRO 6000 * 8？ — bug — by ljwps (创建于: 2026-01-08 09:57 (UTC+8))
#31920 [Bug]: Prefix cache hit rate remains 0 in multi-round conversation with history of identical prompts. — bug,rocm — by fdarbeha-amd-com (创建于: 2026-01-08 04:53 (UTC+8))
#31919 [RFC]: Attention Restructuring Tracker — RFC — by MatthewBonanni (创建于: 2026-01-08 04:44 (UTC+8))
#31923 [Bug]: Nemotron 3 nano - AttributeError: To support LoRA for MoE model, ‘get_expert_mapping’ must be implemented — bug — by OrlandoWhite88 (创建于: 2026-01-08 05:42 (UTC+8))
#31918 [Bug]: nvidia/DeepSeek-R1-NVFP4-v2 accuracy issue with NVFP4 dispatch (CUTEDSL MoE + DeepEP LL) — bug — by minosfuture (创建于: 2026-01-08 04:36 (UTC+8))
#31917 MCP Tool Integration for vLLM Inference — 无标签 — by PurpleSquirrelMedia (创建于: 2026-01-08 04:27 (UTC+8))
#31913 [Bug]: test_eagle_dp test is flaky — bug — by zou3519 (创建于: 2026-01-08 03:35 (UTC+8))
#31912 MoE LoRA Kernel Correctness Verification Report — 无标签 — by qywu (创建于: 2026-01-08 03:10 (UTC+8))
#31904 [Bug]: — bug — by lengrongfu (创建于: 2026-01-07 22:57 (UTC+8))
#31902 [Feature][Benchmarks] Be able to try a different prompt when sending the first test prompt instead of failing directly — feature request — by sducouedic (创建于: 2026-01-07 22:37 (UTC+8))
#31894 [Bug] hf_token argument to LLM in Python SDK ignored in vllm.transformer_utils.config — bug — by benglewis (创建于: 2026-01-07 19:32 (UTC+8))
#31901 [Bug]: Guided decoding with xgrammar failed on macOS CPU inference — bug — by Inokinoki (创建于: 2026-01-07 22:20 (UTC+8))
#31876 [CI Failure]: backend_xgrammar.py: Failed to advance FSM for request — ci-failure — by BlankRH (创建于: 2026-01-07 15:38 (UTC+8))
#31859 [Bug]: Intermittent 500 Internal Server Error When Stress-Testing Qwen3-VL-2B-Instruct with vLLM on H20 — bug — by AlpacaKnight (创建于: 2026-01-07 10:59 (UTC+8))
#31888 [Usage]: rollout slow — usage — by Sherlocktein (创建于: 2026-01-07 17:55 (UTC+8))
#31884 [Bug]: run Qwen3-30B-A3B on 8*A800 2 nodes with DP failed zmq.error.ZMQError — bug — by baoqian426 (创建于: 2026-01-07 17:28 (UTC+8))
#31883 [Feature]: draft model about spec decoding — feature request — by forever10086 (创建于: 2026-01-07 17:26 (UTC+8))
#31878 [Bug]: DeepSeek OCR Triton Error [CUDA] an illegal memory access on vLLM 0.11.2 on H100 — bug — by bojanlazarevski1 (创建于: 2026-01-07 16:33 (UTC+8))
#31877 [Feature]: Contribute changes from vllm/flash-attention fork back upstream — feature request — by h-vetinari (创建于: 2026-01-07 16:19 (UTC+8))
#31872 [Bug]: Qwen3-Next-80B-A3B-Instruct-NVFP4 cannot run — bug — by OliverBryant (创建于: 2026-01-07 15:15 (UTC+8))
#31864 [Bug][CPU Backend]: Gibberish output on CPU backend with DP2 + MoE Model — bug,cpu — by kzwrime (创建于: 2026-01-07 12:05 (UTC+8))
#31862 [Bug]: — bug — by kzwrime (创建于: 2026-01-07 11:29 (UTC+8))

已关闭 Issue

#31889 [Bug]: KimiLinear MLA: MLAModules init missing indexer_rotary_emb — bug — by aaarkai (关闭于: 2026-01-08 10:30 (UTC+8))
#31480 [BUG]: run deepseek v3.2 failed — usage — by ljwps (关闭于: 2026-01-08 09:58 (UTC+8))
#31708 [Bug]: When using image_embeds, ImageProcessorItems are used instead of ImageEmbeddingItems, causing an out-of-bounds array error. — bug — by NewZxy (关闭于: 2026-01-08 09:38 (UTC+8))
#31844 [Bug]: DeepEP LL Accuracy Issue with DeepGEMM E8M0 on B200 — bug — by robertgshaw2-redhat (关闭于: 2026-01-08 06:48 (UTC+8))
#29539 [Bug]: FULL_AND_PIECEWISE cudagraph mode leading to !!! in generated text — bug — by xyang16 (关闭于: 2026-01-08 00:20 (UTC+8))
#31904 [Bug]: — bug — by lengrongfu (关闭于: 2026-01-07 22:57 (UTC+8))
#31697 [New Model]: HY-MT1.5-1.8B — new-model — by Busboy3129 (关闭于: 2026-01-07 22:36 (UTC+8))
#29625 [Bug][CPU Backend]: EngineCore crashes on ARM CPU when enforce_eager=False — bug,cpu — by Elm8116 (关闭于: 2026-01-07 21:32 (UTC+8))
#31679 [Bug]: Qwen3-VL-8B crashes on latest nightly with –async-scheduling — bug — by rnik12 (关闭于: 2026-01-07 14:45 (UTC+8))
#30360 [RFC]: Consolidate FP8 min/max values into somewhere reasonable (Python only) — rocm,RFC — by rasmith (关闭于: 2026-01-07 14:55 (UTC+8))
#31862 [Bug]: — bug — by kzwrime (关闭于: 2026-01-07 11:29 (UTC+8))

新增 PR

#31938 moe-offload-fixed — ci/build,v1 — by wangyxbh (创建于: 2026-01-08 10:46 (UTC+8))
#31937 [Bugfix]some time rank mapping error when use ray in pp — v1 — by lengrongfu (创建于: 2026-01-08 10:33 (UTC+8))
#31933 Naive dispatch combine POC — nvidia — by robertgshaw2-redhat (创建于: 2026-01-08 08:14 (UTC+8))
#31867 [Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference — cpu — by kzwrime (创建于: 2026-01-07 12:44 (UTC+8))
#31892 [BugFix][Model]: Pass indexer_rotary_emb to MLA modules — 无标签 — by aaarkai (创建于: 2026-01-07 18:43 (UTC+8))
#31931 [ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling — rocm,ready — by AndreasKaratzas (创建于: 2026-01-08 07:33 (UTC+8))
#31905 [ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch — rocm,ready — by ZhiweiYan-96 (创建于: 2026-01-07 23:52 (UTC+8))
#31915 [BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 — ready,v1 — by zou3519 (创建于: 2026-01-08 03:40 (UTC+8))
#31934 [CI] Remove torch nightly checks — ci/build — by simon-mo (创建于: 2026-01-08 08:17 (UTC+8))
#31886 [Feature][Bugfix] Support draft model tp any of speculative decode — speculative-decoding,v1 — by stormchasingg (创建于: 2026-01-07 17:35 (UTC+8))
#31930 [Bugfix][MTP] Fix for tool calling performance degradation on GLM models with MTP on — frontend — by andyl98 (创建于: 2026-01-08 07:27 (UTC+8))
#31924 [0/N][Attention] Fix miscellaneous pre-commit issues — rocm,ready — by MatthewBonanni (创建于: 2026-01-08 05:51 (UTC+8))
#31935 LoRA Pipelining — v1 — by kfhfar (创建于: 2026-01-08 08:30 (UTC+8))
#31932 [CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency — ready,multi-modality,qwen — by AndreasKaratzas (创建于: 2026-01-08 07:48 (UTC+8))
#31899 UX: add vLLM env info in ‘/server_info’ — frontend,ready — by jeejeelee (创建于: 2026-01-07 21:39 (UTC+8))
#31865 [refactor] refactor memory constants usage — ready,v1,multi-modality,cpu — by andyxning (创建于: 2026-01-07 12:10 (UTC+8))
#31929 [ROCm][CI] Fix test script to respect Buildkite parallelism settings — rocm,ci/build — by AndreasKaratzas (创建于: 2026-01-08 06:50 (UTC+8))
#31911 Add back missing DeepEP LL params — ready — by elvircrn (创建于: 2026-01-08 02:25 (UTC+8))
#31928 [ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-08 06:45 (UTC+8))
#31926 [Quant] Support MXFP4 W4A16 for compressed-tensors dense models — 无标签 — by mgoin (创建于: 2026-01-08 06:22 (UTC+8))
#31916 [1/N][Attention] Restructure attention: move files — documentation,performance,rocm,tpu,speculative-decoding,v1,multi-modality,llama,qwen,deepseek — by MatthewBonanni (创建于: 2026-01-08 03:56 (UTC+8))
#31927 [Fix] Enable mm_processor_cache with vision LoRA — v1,multi-modality — by prashanth058 (创建于: 2026-01-08 06:23 (UTC+8))
#31925 [core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC — frontend,needs-rebase,ci/build,v1,multi-modality,nvidia — by brandonpelfrey (创建于: 2026-01-08 06:09 (UTC+8))
#31908 [BugFix] Fix bad words with speculative decoding — ready,v1 — by njhill (创建于: 2026-01-08 00:59 (UTC+8))
#31922 [ROCm][CI] Add rocm support for run-multi-node-test.sh — rocm,ci/build — by charlifu (创建于: 2026-01-08 05:24 (UTC+8))
#31896 [Frontend] extract tool calls before reasoning to prevent marker abso… — frontend — by daniel-salib (创建于: 2026-01-07 21:18 (UTC+8))
#31921 [Bugfix] Fix OpenAPI schema test failures — frontend — by AndreasKaratzas (创建于: 2026-01-08 05:08 (UTC+8))
#31914 fix memory for online fp8 quantization with streaming weight load — 无标签 — by vkuzo (创建于: 2026-01-08 03:36 (UTC+8))
#31868 [Doc] Fix: Correct vLLM announcing blog post link in docs — documentation — by Ayobami-00 (创建于: 2026-01-07 14:00 (UTC+8))
#31893 [Misc] Add VLLM_USE_FLASHINFER_ROPE to control the RoPE kernel for cuda — gpt-oss,nvidia — by elvischenv (创建于: 2026-01-07 18:49 (UTC+8))
#31909 [Core] Add Weighted Fair Queuing scheduler for proportional resource allocation — documentation,v1 — by ishrith-gowda (创建于: 2026-01-08 01:05 (UTC+8))
#31898 Enable quantized attention in NemotronH models — ready — by roikoren755 (创建于: 2026-01-07 21:29 (UTC+8))
#31873 [CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 — rocm,ci/build — by rasmith (创建于: 2026-01-07 15:19 (UTC+8))
#31910 [Feature][Benchmarks] Test run: try different prompts until success — performance — by sducouedic (创建于: 2026-01-08 01:09 (UTC+8))
#31881 [Feature][Benchmarks] Custom dataset: read output length from dataset — performance — by sducouedic (创建于: 2026-01-07 16:51 (UTC+8))
#31907 [Model] Enable Qwen3-MoE MXFP4 — qwen — by matkle (创建于: 2026-01-08 00:33 (UTC+8))
#31906 [Bugfix] Fix Var Length Batched Padding in Granite Speech — 无标签 — by alex-jw-brooks (创建于: 2026-01-08 00:21 (UTC+8))
#31890 [Models] Allow converting Qwen3-VL into Reranker model — documentation,frontend,qwen — by Isotr0py (创建于: 2026-01-07 18:19 (UTC+8))
#31870 [OpenAI] Extend VLLMValidationError to additional validation parameters — frontend,ready,v1 — by R3hankhan123 (创建于: 2026-01-07 14:09 (UTC+8))
#31897 [Refactor] Clean up pooler modules — ready,v1 — by DarkLight1337 (创建于: 2026-01-07 21:21 (UTC+8))
#31900 Ignore: dynres attempt — needs-rebase — by netanel-haber (创建于: 2026-01-07 21:58 (UTC+8))
#31891 [Chore] Migrate V0 attention utils — rocm,ready,v1 — by DarkLight1337 (创建于: 2026-01-07 18:35 (UTC+8))
#31895 Update modelopt KV cache quantization resolution to new scheme — 无标签 — by roikoren755 (创建于: 2026-01-07 20:21 (UTC+8))
#31874 [Feature] Add command-line argument support to basic.py example — documentation — by liangzhang-keepmoving (创建于: 2026-01-07 15:33 (UTC+8))
#31879 [Misc] Set default torch num threads for input processing — ready,v1 — by ywang96 (创建于: 2026-01-07 16:37 (UTC+8))
#31880 [ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func — rocm,ready,v1 — by vllmellm (创建于: 2026-01-07 16:43 (UTC+8))
#31887 [Misc] Unify get_kv_cache_stride_order code style — v1,nvidia — by NickLucche (创建于: 2026-01-07 17:41 (UTC+8))
#31882 [Frontend] reimplement beam-search reference transformers — frontend — by satoaoki1999 (创建于: 2026-01-07 17:18 (UTC+8))
#31885 [Kernel] Change get_min_capability to is_supported — cpu,nvidia — by BlankRH (创建于: 2026-01-07 17:31 (UTC+8))
#31869 [Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE — ready — by maang-h (创建于: 2026-01-07 14:07 (UTC+8))
#31875 [Feature] Add flag to disable FlashInfer autotune — 无标签 — by mmangkad (创建于: 2026-01-07 15:34 (UTC+8))
#31863 feat: add support for logging to file via VLLM_LOG_FILE env var — 无标签 — by leejianwoo-collab (创建于: 2026-01-07 11:35 (UTC+8))
#31866 refactor: find_loaded_library — ready,nvidia — by tom-zju (创建于: 2026-01-07 12:12 (UTC+8))
#31861 fix: LoRA type checking bug by replacing type() with isinstance() in … — 无标签 — by leejianwoo-collab (创建于: 2026-01-07 11:23 (UTC+8))
#31860 Lora cache invalidation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen,nvidia — by nitinsurya (创建于: 2026-01-07 11:07 (UTC+8))

已合并 PR

#31286 fix(rocm): add early return in get_flash_attn_version for ROCm — rocm,ready — by rabi (合并于: 2026-01-08 10:28 (UTC+8))
#31645 feat(moe): Add is_act_and_mul=False support for Triton MoE kernels — rocm,ready — by rabi (合并于: 2026-01-08 10:27 (UTC+8))
#31924 [0/N][Attention] Fix miscellaneous pre-commit issues — rocm,ready — by MatthewBonanni (合并于: 2026-01-08 09:15 (UTC+8))
#31899 UX: add vLLM env info in ‘/server_info’ — frontend,ready — by jeejeelee (合并于: 2026-01-08 01:13 (UTC+8))
#31415 [MoE Refactor][15/N] Apply Refactor to Fp8 — documentation,performance,ready,ci/build,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-08 08:42 (UTC+8))
#31865 [refactor] refactor memory constants usage — ready,v1,multi-modality,cpu — by andyxning (合并于: 2026-01-08 02:37 (UTC+8))
#31676 [Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe — bug,ready,gpt-oss — by xuebwang-amd (合并于: 2026-01-07 15:36 (UTC+8))
#31911 Add back missing DeepEP LL params — ready — by elvircrn (合并于: 2026-01-08 06:47 (UTC+8))
#31908 [BugFix] Fix bad words with speculative decoding — ready,v1 — by njhill (合并于: 2026-01-08 04:46 (UTC+8))
#31781 [Kernel] Support bias type in grouped_topk kernel — performance,ready,deepseek,nvidia — by xyang16 (合并于: 2026-01-08 04:16 (UTC+8))
#29499 [EPLB] Optimize EPLB with numpy — ready — by ilmarkov (合并于: 2026-01-08 04:21 (UTC+8))
#29213 [Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) — documentation,performance,ready,v1,nvidia — by katec846 (合并于: 2026-01-07 23:53 (UTC+8))
#31837 [Perf] Fuse stride preparation for NVFP4 cutlass_moe — performance,ready,nvidia — by mgoin (合并于: 2026-01-08 02:31 (UTC+8))
#31868 [Doc] Fix: Correct vLLM announcing blog post link in docs — documentation — by Ayobami-00 (合并于: 2026-01-08 02:06 (UTC+8))
#31898 Enable quantized attention in NemotronH models — ready — by roikoren755 (合并于: 2026-01-08 01:37 (UTC+8))
#30761 [KVConnector]: Enable Cross-layers KV cache layout for MultiConnector — ready,v1,kv-connector — by kfirtoledo (合并于: 2026-01-08 00:59 (UTC+8))
#31841 [Bugfix] Fix race condition in async-scheduling for vlm model — ready,v1 — by tianshu-Michael-yu (合并于: 2026-01-07 14:45 (UTC+8))
#30751 [Bugfix]: prevent leaking tokens in crash log — ready,v1 — by dr75 (合并于: 2026-01-08 00:15 (UTC+8))
#31870 [OpenAI] Extend VLLMValidationError to additional validation parameters — frontend,ready,v1 — by R3hankhan123 (合并于: 2026-01-07 22:45 (UTC+8))
#31897 [Refactor] Clean up pooler modules — ready,v1 — by DarkLight1337 (合并于: 2026-01-08 00:07 (UTC+8))
#31891 [Chore] Migrate V0 attention utils — rocm,ready,v1 — by DarkLight1337 (合并于: 2026-01-07 21:44 (UTC+8))
#31779 [Refactor] GLM-ASR Modeling — ready — by JaredforReal (合并于: 2026-01-07 21:08 (UTC+8))
#31880 [ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func — rocm,ready,v1 — by vllmellm (合并于: 2026-01-07 19:25 (UTC+8))
#31757 [Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on — ready — by andyl98 (合并于: 2026-01-07 17:18 (UTC+8))
#30593 [Misc] Improve error messages for unsupported types and parameters — performance,ready,kv-connector,nvidia — by BlankRH (合并于: 2026-01-07 17:00 (UTC+8))
#31869 [Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE — ready — by maang-h (合并于: 2026-01-07 16:18 (UTC+8))
#31762 [XPU]fallback to TRITON_ATTN on xpu when use float32 dtype — ready,ci/build,v1 — by 1643661061leo (合并于: 2026-01-07 16:10 (UTC+8))
#30808 [Refactor][TPU] Remove torch_xla path and use tpu-inference — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,ready,ci/build — by weiyu0824 (合并于: 2026-01-07 16:07 (UTC+8))
#31104 [BugFix] LoRA: Support loading base_layer of experts — speculative-decoding,ready,llama,qwen,deepseek,gpt-oss — by HollowMan6 (合并于: 2026-01-07 14:49 (UTC+8))
#31106 [Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function — rocm,ready — by c0de128 (合并于: 2026-01-07 14:55 (UTC+8))
#31866 refactor: find_loaded_library — ready,nvidia — by tom-zju (合并于: 2026-01-07 14:42 (UTC+8))
#31850 [Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — rocm,ready,v1 — by LucasWilkinson (合并于: 2026-01-07 13:31 (UTC+8))
#31816 [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend — rocm,ready,v1 — by vllmellm (合并于: 2026-01-07 13:04 (UTC+8))
#31786 [Chore] Try remove init_cached_hf_modules — tpu,v1,ready-run-all-tests — by DarkLight1337 (合并于: 2026-01-07 12:34 (UTC+8))
#31465 fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround — rocm,ready,v1,nvidia — by MrIceCreamMan (合并于: 2026-01-07 12:08 (UTC+8))
#31855 Change warning in get_current_vllm_config to report caller’s line number — ready — by tlrmchlsmth (合并于: 2026-01-07 11:48 (UTC+8))
#31799 [Doc] Update release docs — documentation,ready — by DarkLight1337 (合并于: 2026-01-07 11:27 (UTC+8))

关闭但未合并的 PR

#31892 [BugFix][Model]: Pass indexer_rotary_emb to MLA modules — 无标签 — by aaarkai (关闭于: 2026-01-08 10:31 (UTC+8))
#30775 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — needs-rebase — by teddygood (关闭于: 2026-01-08 10:29 (UTC+8))
#31935 LoRA Pipelining — v1 — by kfhfar (关闭于: 2026-01-08 09:04 (UTC+8))
#29769 Refactor MLA attention: move prefill logic to layer.py — rocm,ready,torch.compile,v1 — by therealnaveenkamal (关闭于: 2026-01-08 05:28 (UTC+8))
#30535 [Core] Add repetitive token detection for hallucination prevention — documentation,frontend,v1 — by jeremyteboul (关闭于: 2026-01-08 02:16 (UTC+8))
#30701 Dynres — needs-rebase — by netanel-haber (关闭于: 2026-01-07 21:50 (UTC+8))
#31861 fix: LoRA type checking bug by replacing type() with isinstance() in … — 无标签 — by leejianwoo-collab (关闭于: 2026-01-07 14:26 (UTC+8))
#31586 [Bugfix] Narrow broad exception in custom all-reduce detection — 无标签 — by c0de128 (关闭于: 2026-01-07 13:16 (UTC+8))
#31589 [Bugfix] Narrow broad exceptions in rank detection functions — 无标签 — by c0de128 (关闭于: 2026-01-07 13:16 (UTC+8))
#31616 [Bugfix] Narrow broad exceptions in compilation backends — ready — by c0de128 (关闭于: 2026-01-07 13:16 (UTC+8))
#31639 [Bugfix] Narrow broad exceptions in FLA shared memory detection — 无标签 — by c0de128 (关闭于: 2026-01-07 13:16 (UTC+8))
#31640 [Bugfix] Narrow broad exceptions in quick allreduce availability check — 无标签 — by c0de128 (关闭于: 2026-01-07 13:15 (UTC+8))
#31860 Lora cache invalidation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen,nvidia — by nitinsurya (关闭于: 2026-01-07 11:07 (UTC+8))
#31854 refactor: refatcor find_loaded_library — documentation,performance,rocm,ci/build,multi-modality,kv-connector,nvidia — by tom-zju (关闭于: 2026-01-07 11:58 (UTC+8))