vLLM 开发动态报告 - 2025-12-23

时间窗口: 2025-12-23 10:47 (UTC+8) ~ 2025-12-24 10:47 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 12 | 新 PR 37 | 合并 PR 26 | 关闭未合并 PR 8

📊 每日开发状态摘要

vLLM 在 12月23日至24日期间保持了高强度的开发活跃度，共计合并了 26 个 PR，关闭了 12 个 Issue。开发重点主要集中在性能优化（特别是 MoE 内核与采样）、多模态功能增强以及持续改善对 AMD ROCm 等硬件平台的支持。社区讨论热烈，多个关于核心架构优化和开发者体验的 RFC 提案涌现，预示着项目正在深入进行精细化打磨和前瞻性设计。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD 生态相关工作以 CI 问题修复和兼容性改进为主，体现了对 ROCm 平台稳定性的持续投入。

CI 测试与构建修复：
- PR #31242 ([ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION…): 修复了 AMD CI 中由于更新 PyTorch 版本导致的 Terratorch 插件测试失败。问题源于新的 torch.backends.cuda.matmul.fp32_precision API 与旧 API 混用。该 PR 通过设置环境变量进行临时规避，等待 PyTorch 后续更新。
- PR #31251 ([Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu…): 修复了在 ROCm 上从源码编译 vLLM 时 sampler.cu 的编译错误。该错误源于未正确使用 cub_helpers.h 头文件来统一处理 CUDA/HIPCUB 的命名空间差异。
- PR #31227 ([ROCm][CI] Fix “Distributed Tests (H200)” Test): 针对 ROCm 环境，调整分布式测试脚本，将不支持的 deepep_high_throughput All2All 后端替换为支持的 allgather_reducescatter，并使用 CPU 进行 DP 同步以绕过 ROCm 上 CUDA 图的限制。
- Issue #31244 / #31245 (CI Failure): AMD CI 负责人 AndreasKaratzas (用户名后缀 -amd) 报告了两个 CI 失败案例，分别是插件测试中的 Terratorch TF32 API 错误和 DeepSeek V2 模型的异步精度测试不稳定问题。前者已有 PR #31242 进行修复，后者被标记为已知问题并计划从 CI 中移除。
运行时错误修复：
- PR #31203 ([ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention…): 修复了多模态编码器注意力模块在 ROCm 上可能因张量非连续而导致的 RuntimeError。解决方案是将 .view() 替换为更健壮的 .reshape()。（已合并）
- PR #31235 ([ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch…): 修复了 Siglip2 模型中 Rotary Embedding 函数在平台分发逻辑上的错误（ROCm 误调用了 CUDA 实现），并调整了 InternVL 视频模型在 ROCm 上的测试宽容度以应对数值精度差异。

分析总结： 本周期 AMD 相关动态主要由社区成员（非 AMD 员工）和 AMD 的 CI 维护者共同推动，集中在解决构建、测试和运行时兼容性问题。这显示 vLLM 社区对 ROCm 平台的支持正在从“可用”向“稳定”和“高性能”过渡，持续清理跨平台实现中的细节问题。未发现与 Quark 量化工具 或 MI300 新特性直接相关的修改。

💬 高热度讨论分析

Issue #31128: “Add support of Blackwell SM121(DGX Spark)” (9条评论)
- 核心议题： 请求 vLLM 原生支持 NVIDIA 最新的 Blackwell 架构边缘平台（DGX Spark，ARM64 + CUDA 13）。
- 观点与立场：
  - 用户 (yanyunl1991)： 指出官方 vLLM 对 PyTorch 2.9 的严格依赖、缺少 ARM64 CUDA 13 wheels 是主要障碍，使用 --enforce-eager 会导致性能损失。
  - 维护者 (DarkLight1337, eugr)： 提供了从源码编译的详细步骤，并指出 v0.13.0 已提供 CUDA 13 的 ARM64 wheels。关键解决方案是使用 uv 从特定索引安装。
- 争议焦点： 无显著争议。讨论聚焦于澄清支持现状和提供具体解决方案，而非是否应该支持。
- 当前状态： 用户确认通过社区提供的方案成功运行，并感谢帮助。Issue 已关闭。
Issue #31229: “Early-Fail Tokenization Guard for Completions or Chat Completions” (3条评论)
- 核心议题： 如何防止极端长提示（数亿字符）在 tokenization 阶段导致服务 CPU OOM 和挂起。
- 观点与立场：
  - 提出者 (scratch-ml)： 详细分析了问题根因（tokenization 在前，长度检查在后），并提出了三个渐进的解决方案：1) 在 tokenization 时启用保护性截断；2) 添加 raw-size 预检查；3) 增强异步 tokenizer 的防护。
  - 核心开发者 (robertgshaw2-redhat)： 强烈支持此改动，认为这不仅关乎稳定性，更是一个拒绝服务攻击的安全问题。他倾向于方案 1，但要求确认与 Mistral 分词器的兼容性，并指定提出者作为负责人。
- 争议焦点： 暂无。讨论处于积极接纳提案并细化方案的阶段。
- 当前状态： 开放，等待社区共识后由提出者实现。
Issue #31219: “Concurrent requests with audio_embeds of different lengths crash EngineCore” (2条评论，但快速关闭)
- 核心议题： 并发处理不同长度音频嵌入的请求时，引擎崩溃。
- 解决过程： 维护者 DarkLight1337 在收到报告后迅速指出问题根源（缺少 dynamic_dims 设置），并立即提交了 PR #31223 进行修复。用户验证后确认问题解决。
- 分析： 此议题热度体现在响应和修复的极快速度上，展现了团队对多模态功能稳定性的重视和高效率。

🔥 热门话题与趋势分析

性能优化深入内核层： 讨论热点从高层架构转向底层内核优化。PR #31246 为 MoE 模型添加 topk_sigmoid 内核，带来 2 倍以上性能提升。RFC #31216 提出将采样中的 gather 操作移至 argmax 之后，以减少通信和计算开销，特别有利于推测解码场景。
多模态与工具调用能力持续扩展：
- 功能增强： PR #31239 为 Whisper 转录 API 新增 logprobs 支持，属于 vLLM 的扩展功能。
- 模型支持： PR #31218 新增了 Google FunctionGemma 模型的工具调用解析器支持。
- 问题修复： PR #31223 修复了多长度音频嵌入的并发处理，PR #31224 修复了 Jina 重排序器对混合图文输入的支持。
分布式与硬件生态适配： 除 AMD 外，议题也涉及 NVIDIA Blackwell (#31128)、LoRA 传输优化（PR #31250）、P/D 架构下请求 ID 管理（PR #27987）等，显示项目在复杂部署环境下的深度适配。
开发者体验与基础设施： RFC #31249 提议重构环境变量声明方式以解决重复定义和类型不一致问题，反映了项目在规模化中对代码质量和维护性的关注。此外，多个文档 PR 和 CI 修复 PR 也体现了对项目健康度的维护。

🛠️ 重点技术变更

PR #31246 ([Kernel] Add topk_sigmoid kernel): 为 MiniMax-M2 等使用 sigmoid 门控的 MoE 模型新增高性能融合内核，替代原有的分组 topk 模拟方案，性能提升超过 2 倍。这是针对特定模型架构的精准优化。
PR #31218 ([Frontend] add FunctionGemma tool parser support): 扩展了 vLLM 的工具调用生态，支持了 Google 专门为函数调用设计的轻量级模型，增强了边缘部署场景的能力。
PR #27987 ([Core] Add a random suffix to frontend-provided request IDs): 重要架构改进。 通过为客户端提供的请求 ID 添加随机后缀，彻底解决了因请求 ID 重复可能引发的各种竞态条件和正确性问题，特别是对 P/D 架构和异步调度至关重要。（已合并）
PR #31203 ([ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention): 一个简单的 .view() 到 .reshape() 的改动，解决了 ROCm 平台上一个潜在的运行时错误。这体现了跨平台开发中对 API 健壮性的细微要求。

📈 开发活跃度观察

高效合并： 在 37 个新增 PR 中合并了 26 个，合并率约 70%，表明代码审查和集成流程高效。
AMD 支持积极： 针对 AMD CI 和运行时问题的 PR 反应迅速，且由包括 AMD 员工在内的多名贡献者共同参与，显示该平台支持是持续投入的重点。
核心开发者活跃： DarkLight1337, jeejeelee, hmellor, AndreasKaratzas 等核心成员在问题诊断、代码审查和修复上非常活跃，覆盖了前端、模型、内核、CI 等多个领域。
社区贡献广泛： 出现了多个来自 yurekami, c0de128, micah-wil 等非核心团队成员的优质 PR，涉及功能开发、bug 修复、文档等多个方面，社区生态健康。

💡 值得关注的问题

Issue #31210: “Wrong Generation Under High Concurrency When Using KVCache CPU Offload”: 高并发下启用 KV Cache CPU 卸载会导致生成结果错误。这可能涉及 CPU/GPU 间数据同步或缓存一致性的深层 bug，对使用该功能的用户影响较大。
RFC #31229: “Early-Fail Tokenization Guard”: 如前所述，这是一个重要的安全和稳定性改进提案，其实施方案和影响范围值得社区持续关注。
RFC #31249: “Improve environment variable declaration and handling”: 旨在重构技术债务。其实施将影响所有环境变量的定义方式，需要谨慎评估和推进。
RFC #31216: “Sampling Optimization: move gather of logits after argmax”: 一个具有潜在显著性能收益的优化提案，其实现可能改变采样阶段的计算流程，需关注其对正确性和性能的最终验证结果。
RFC #31204: “Supporting Multi MTP layers in Speculative Decoding”: 指出了当前 Eagle 推测解码提案器对多 MTP 层模型支持不足的问题，是功能扩展的一个方向。

📋 附录：详细数据列表

新增 Issue

#31253 [Bug]: VLLM_USE_FLASHINFER_MOE_FP16=1 generate different logprob for the same prompt in different run — bug — by zyongye (创建于: 2025-12-24 10:43 (UTC+8))
#31252 [Feature]: Make EngineCore shutdown timeout configurable via environment variable — feature request — by sakunkun (创建于: 2025-12-24 10:25 (UTC+8))
#31219 [Bug]: Concurrent requests with audio_embeds of different lengths crash EngineCore: “audio_embeds contains inconsistent shapes” — bug — by ykkk1 (创建于: 2025-12-23 20:25 (UTC+8))
#31249 [RFC]: Improve environment variable declaration and handling — RFC — by ProExpertProg (创建于: 2025-12-24 08:05 (UTC+8))
#31245 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2025-12-24 06:46 (UTC+8))
#31244 [CI Failure]: mi325_2: Plugin Tests (2 GPUs) — ci-failure — by AndreasKaratzas (创建于: 2025-12-24 06:43 (UTC+8))
#31210 [Bug]: Wrong Generation Under High Concurrency When Using KVCache CPU Offload (vLLM 0.13.0) — bug — by wangqia0309 (创建于: 2025-12-23 16:02 (UTC+8))
#31229 [RFC]: Early-Fail Tokenization Guard for Completions or Chat Completions — RFC — by scratch-ml (创建于: 2025-12-24 01:37 (UTC+8))
#31217 [Usage]: suffix decoding — usage — by jiangix-paper (创建于: 2025-12-23 18:43 (UTC+8))
#31216 [RFC]: Sampling Optimization: move gather of logits after argmax. — RFC — by whx-sjtu (创建于: 2025-12-23 18:23 (UTC+8))
#31206 [Bug]: AsyncLLM Qwen/Qwen3-Embedding got stuck in max_model_len >= 6100 (vllm==0.13.0) — bug — by galabala (创建于: 2025-12-23 15:21 (UTC+8))
#31205 ValueError: Qwen3OmniMoeThinkerForConditionalGeneration does not support LoRA yet. — usage — by VJJJJJJ1 (创建于: 2025-12-23 14:52 (UTC+8))
#31211 [Doc]: Add missing GPT-OSS tool calling instructions — documentation — by amithkk (创建于: 2025-12-23 16:35 (UTC+8))
#31204 [RFC]: Supporting Multi MTP layers in Speculative Decoding (EagleProposer) — RFC — by DingYibin (创建于: 2025-12-23 11:34 (UTC+8))

已关闭 Issue

#31128 [Feature]: Add support of Blackwell SM121(DGX Spark) — feature request — by yanyunl1991 (关闭于: 2025-12-23 11:33 (UTC+8))
#31219 [Bug]: Concurrent requests with audio_embeds of different lengths crash EngineCore: “audio_embeds contains inconsistent shapes” — bug — by ykkk1 (关闭于: 2025-12-24 10:15 (UTC+8))
#29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:47 (UTC+8))
#29460 [CI Failure]: mi325_1: Language Models Test (MTEB) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:47 (UTC+8))
#20342 [Bug]: V1 pre-compiled graph loading much slower than V0 — bug,torch.compile,unstale — by OscarSavNS (关闭于: 2025-12-24 06:29 (UTC+8))
#29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:27 (UTC+8))
#23787 [Bug]: Performance Analysis: Significant Latency on First Inference due to Engine Warm-up (torch.compile & Graph Capture) — bug,torch.compile,stale,startup-ux — by Flink-ddd (关闭于: 2025-12-24 06:26 (UTC+8))
#30995 [Bug]: Fused MoE errors without safe serialization — bug — by ojh31 (关闭于: 2025-12-23 19:09 (UTC+8))
#31148 [Bug]: Jais2 model in vLLM 0.13.0: get_rope() called with unsupported rotary_dim kwarg (TypeError during model init) — bug — by NikolasTh90 (关闭于: 2025-12-23 15:44 (UTC+8))
#31136 [Bug]: error when run examples/online_serving/prompt_embed_inference_with_openai_client.py — bug — by yuekaizhang (关闭于: 2025-12-23 14:08 (UTC+8))
#28930 [Usage]: How to build a qwen3vl embedding model with a custom mlp layer on the top use vllm? — usage — by neverneverendup (关闭于: 2025-12-23 12:49 (UTC+8))
#31091 [Usage]: Image Embedding Models (CLIP, Siglip, etc) — usage — by JamesDConley (关闭于: 2025-12-23 11:26 (UTC+8))

新增 PR

#31246 [Kernel] Add topk_sigmoid kernel — performance — by xyang16 (创建于: 2025-12-24 07:25 (UTC+8))
#31228 Cleanup basic and entrypoint test organisation — ready,ci/build,llama — by hmellor (创建于: 2025-12-24 01:33 (UTC+8))
#31242 [ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION=”tf32” For terratorch Tests In AMD CI — rocm,ready,ci/build — by micah-wil (创建于: 2025-12-24 05:22 (UTC+8))
#31239 [Feature] Add logprobs support for Whisper transcription API — documentation,frontend — by TheCodeWrangler (创建于: 2025-12-24 04:54 (UTC+8))
#31251 [Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu for ROCm namespace alias — rocm — by c0de128 (创建于: 2025-12-24 10:16 (UTC+8))
#31223 [Bugfix] Enable dynamic_dims for different embeds shape — ready,multi-modality,qwen — by DarkLight1337 (创建于: 2025-12-23 22:57 (UTC+8))
#31222 [Chore] Simplify logic of _execute_mm_encoder — ready,v1,multi-modality — by DarkLight1337 (创建于: 2025-12-23 22:08 (UTC+8))
#31226 [cli] complete vllm cli help message — frontend — by andyxning (创建于: 2025-12-24 00:53 (UTC+8))
#31218 [Frontend] add FunctionGemma tool parser support — documentation,ready,tool-calling — by gateremark (创建于: 2025-12-23 20:04 (UTC+8))
#31250 LoRA Slab Optimization — v1,qwen,deepseek,gpt-oss — by Majid-Taheri (创建于: 2025-12-24 08:18 (UTC+8))
#31221 CustomOp: Unify aiter impl into GroupedTopk — ready — by xinyu-intel (创建于: 2025-12-23 21:29 (UTC+8))
#31248 grifffe warning — 无标签 — by Majid-Taheri (创建于: 2025-12-24 07:47 (UTC+8))
#31247 LoRA Slab Optimization — v1,deepseek,gpt-oss — by Majid-Taheri (创建于: 2025-12-24 07:27 (UTC+8))
#31225 fix(core): break circular reference in Request using weakref — v1 — by kelvinvelasquez-SDE (创建于: 2025-12-24 00:30 (UTC+8))
#31243 [WIP] Adopt Dockerfile to build nightly version — ci/build — by atalman (创建于: 2025-12-24 06:23 (UTC+8))
#31235 [ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2025-12-24 03:35 (UTC+8))
#31203 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2025-12-23 10:49 (UTC+8))
#31241 [Bugfix] Fix eagle dp tests on A100 — v1 — by zou3519 (创建于: 2025-12-24 05:14 (UTC+8))
#31240 Revert “[bench] Support common prefix len config (for decode-only bench)” — performance — by minosfuture (创建于: 2025-12-24 05:12 (UTC+8))
#31234 docs: Add llm-d integration to the website — documentation,ready — by terrytangyuan (创建于: 2025-12-24 03:29 (UTC+8))
#31238 Refactor aiter_shared_expert_fusion logic into helper class — 无标签 — by yurekami (创建于: 2025-12-24 04:12 (UTC+8))
#31230 [Bug]: Fix port race condition in distributed initialization — 无标签 — by yurekami (创建于: 2025-12-24 03:04 (UTC+8))
#31236 Construting grid using num of active lora in lora kernels — v1,nvidia — by yugong333 (创建于: 2025-12-24 03:44 (UTC+8))
#31237 fix(models): Handle weight prefix mapping for Mamba-Codestral — 无标签 — by yurekami (创建于: 2025-12-24 03:56 (UTC+8))
#31232 [Feature]: Integrate Sonic MoE kernel for Hopper GPUs — 无标签 — by yurekami (创建于: 2025-12-24 03:05 (UTC+8))
#31233 [Benchmark] Auto-infer dataset name from path for backward compatibility — performance — by yurekami (创建于: 2025-12-24 03:20 (UTC+8))
#31231 [Bug] Fix Qwen3-VL 2:4 sparsity shape mismatch during decompression — qwen — by yurekami (创建于: 2025-12-24 03:04 (UTC+8))
#31227 [ROCm][CI] Fix “Distributed Tests (H200)” Test — rocm,ci/build — by kliuae (创建于: 2025-12-24 00:53 (UTC+8))
#31214 Only patch original_max_position_embeddings for Transformers v4 — ready — by hmellor (创建于: 2025-12-23 18:10 (UTC+8))
#31224 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — frontend — by twjww (创建于: 2025-12-23 23:49 (UTC+8))
#31220 fixed glm 4.7 tool call and parser — frontend — by PratikNarola1 (创建于: 2025-12-23 21:27 (UTC+8))
#31208 [Misc] Introduce encode_*_url utility function — tpu,ready,v1,multi-modality,kv-connector — by DarkLight1337 (创建于: 2025-12-23 15:38 (UTC+8))
#31215 WIP - Paged Eviction — v1 — by albertoperdomo2 (创建于: 2025-12-23 18:16 (UTC+8))
#31209 Correct position of docstring of class attributes — v1 — by wdhongtw (创建于: 2025-12-23 15:39 (UTC+8))
#31212 [Doc] Add tool call parser documentation for GPT-OSS models — documentation,tool-calling,gpt-oss — by amithkk (创建于: 2025-12-23 16:37 (UTC+8))
#31213 Add a support to disable Cutlass W8A8 kernels — nvidia,meta-exported,fb-exported — by houseroad (创建于: 2025-12-23 16:39 (UTC+8))
#31207 fix: update kimi k2 tool parser logic — 无标签 — by wangln19 (创建于: 2025-12-23 15:23 (UTC+8))

已合并 PR

#31160 [Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 — ready,deepseek — by yewentao256 (合并于: 2025-12-24 10:41 (UTC+8))
#30133 [P/D] Mooncake connector support more protocols — ready,kv-connector — by LCAIZJ (合并于: 2025-12-24 10:24 (UTC+8))
#30544 [KVEvent] User request.block_hash for parent block_hash — ready,v1 — by heheda12345 (合并于: 2025-12-24 10:23 (UTC+8))
#30967 [Misc] Remove unused custom ops copy_blocks and copy_blocks_mla — ready — by lengrongfu (合并于: 2025-12-24 10:22 (UTC+8))
#31223 [Bugfix] Enable dynamic_dims for different embeds shape — ready,multi-modality,qwen — by DarkLight1337 (合并于: 2025-12-24 10:15 (UTC+8))
#31222 [Chore] Simplify logic of _execute_mm_encoder — ready,v1,multi-modality — by DarkLight1337 (合并于: 2025-12-24 10:15 (UTC+8))
#31049 [CI] Add Qwen3-Next-FP8 to Blackwell model tests — ready,qwen,nvidia — by vadiklyutiy (合并于: 2025-12-24 09:21 (UTC+8))
#31203 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2025-12-24 05:48 (UTC+8))
#27987 [Core] Add a random suffix to frontend-provided request IDs — frontend,ready,v1,gpt-oss,kv-connector,ready-run-all-tests — by markmc (合并于: 2025-12-24 05:05 (UTC+8))
#28133 [Mamba] - Consolidate Mambas Attention Logic — ready,v1 — by Josephasafg (合并于: 2025-12-24 04:57 (UTC+8))
#31234 docs: Add llm-d integration to the website — documentation,ready — by terrytangyuan (合并于: 2025-12-24 04:27 (UTC+8))
#29788 Use helper function instead of looping through attribute names — ready — by hmellor (合并于: 2025-12-24 01:31 (UTC+8))
#31214 Only patch original_max_position_embeddings for Transformers v4 — ready — by hmellor (合并于: 2025-12-24 00:46 (UTC+8))
#31097 [FIX] FP4 quantization kernel padding initialization bug — performance,ready — by danielafrimi (合并于: 2025-12-24 00:45 (UTC+8))
#30724 Fix edge case Mistral tool parser — ready — by joa-stdn (合并于: 2025-12-23 22:19 (UTC+8))
#31208 [Misc] Introduce encode_*_url utility function — tpu,ready,v1,multi-modality,kv-connector — by DarkLight1337 (合并于: 2025-12-23 21:45 (UTC+8))
#31095 adapt voxtral — new-model,ready,v1,multi-modality — by patrickvonplaten (合并于: 2025-12-23 21:31 (UTC+8))
#31146 Add util function for checking nesting of rope parameters — ready — by hmellor (合并于: 2025-12-23 19:41 (UTC+8))
#30134 [OpenAI] Add parameter metadata to validation errors — frontend,ready — by R3hankhan123 (合并于: 2025-12-23 19:30 (UTC+8))
#30550 [Frontend] Support using chat template as custom score template for reranking models — documentation,new-model,frontend,ready,llama — by jzakrzew (合并于: 2025-12-23 19:19 (UTC+8))
#31161 [Bugfix] Fix MoE LoRA bin/pt loading — ready — by jeejeelee (合并于: 2025-12-23 19:09 (UTC+8))
#31209 Correct position of docstring of class attributes — v1 — by wdhongtw (合并于: 2025-12-23 18:08 (UTC+8))
#26575 [ROCm][FEAT] Support AITER RMSNorm quantization fusion pass — rocm,ready,ci/build — by vllmellm (合并于: 2025-12-23 18:07 (UTC+8))
#31198 [Bugfix] Fix Jais2ForCausalLM — ready — by jeejeelee (合并于: 2025-12-23 15:44 (UTC+8))
#30538 [XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation — documentation,ready,ci/build — by yma11 (合并于: 2025-12-23 13:22 (UTC+8))
#31153 [Chore] Update more locations to use attention_config.backend — performance,ready — by DarkLight1337 (合并于: 2025-12-23 11:19 (UTC+8))

关闭但未合并的 PR

#30418 LoRA Slab Optimization — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by Majid-Taheri (关闭于: 2025-12-24 07:08 (UTC+8))
#30194 Fix compilation tests find commands — ready,ci/build — by ProExpertProg (关闭于: 2025-12-24 06:04 (UTC+8))
#30790 [Release 2.10] Test Torch 2.10 RC - with skipped test — rocm,needs-rebase,ci/build,v1,cpu,nvidia — by atalman (关闭于: 2025-12-24 05:01 (UTC+8))
#30984 Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by yugong333 (关闭于: 2025-12-24 03:09 (UTC+8))
#27747 Cleanup basic and entrypoint test organisation — ci/build,tool-calling,llama — by hmellor (关闭于: 2025-12-24 02:09 (UTC+8))
#23896 [CI] Optimize entrypoints API server tests — ready,needs-rebase,ci/build,tool-calling — by csahithi (关闭于: 2025-12-24 00:55 (UTC+8))
#31087 [Debug] [Do Not Merge] revert Dockerfile.rocm_base changes — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2025-12-23 18:23 (UTC+8))
#22788 [Attention] Cache attention metadata builds across hybrid KV-cache groups — needs-rebase,v1,nvidia — by LucasWilkinson (关闭于: 2025-12-23 11:45 (UTC+8))