vLLM 开发动态报告 - 2025-12-23
时间窗口: 2025-12-23 10:47 (UTC+8) ~ 2025-12-24 10:47 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 12 | 新 PR 37 | 合并 PR 26 | 关闭未合并 PR 8
📊 每日开发状态摘要
vLLM 在 12月23日至24日期间保持了高强度的开发活跃度,共计合并了 26 个 PR,关闭了 12 个 Issue。开发重点主要集中在性能优化(特别是 MoE 内核与采样)、多模态功能增强以及持续改善对 AMD ROCm 等硬件平台的支持。社区讨论热烈,多个关于核心架构优化和开发者体验的 RFC 提案涌现,预示着项目正在深入进行精细化打磨和前瞻性设计。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD 生态相关工作以 CI 问题修复和兼容性改进为主,体现了对 ROCm 平台稳定性的持续投入。
- CI 测试与构建修复:
- PR #31242 ([ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION…): 修复了 AMD CI 中由于更新 PyTorch 版本导致的 Terratorch 插件测试失败。问题源于新的
torch.backends.cuda.matmul.fp32_precisionAPI 与旧 API 混用。该 PR 通过设置环境变量进行临时规避,等待 PyTorch 后续更新。 - PR #31251 ([Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu…): 修复了在 ROCm 上从源码编译 vLLM 时
sampler.cu的编译错误。该错误源于未正确使用cub_helpers.h头文件来统一处理 CUDA/HIPCUB 的命名空间差异。 - PR #31227 ([ROCm][CI] Fix “Distributed Tests (H200)” Test): 针对 ROCm 环境,调整分布式测试脚本,将不支持的
deepep_high_throughputAll2All 后端替换为支持的allgather_reducescatter,并使用 CPU 进行 DP 同步以绕过 ROCm 上 CUDA 图的限制。 - Issue #31244 / #31245 (CI Failure): AMD CI 负责人 AndreasKaratzas (用户名后缀
-amd) 报告了两个 CI 失败案例,分别是插件测试中的 Terratorch TF32 API 错误和 DeepSeek V2 模型的异步精度测试不稳定问题。前者已有 PR #31242 进行修复,后者被标记为已知问题并计划从 CI 中移除。
- PR #31242 ([ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION…): 修复了 AMD CI 中由于更新 PyTorch 版本导致的 Terratorch 插件测试失败。问题源于新的
- 运行时错误修复:
- PR #31203 ([ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention…): 修复了多模态编码器注意力模块在 ROCm 上可能因张量非连续而导致的
RuntimeError。解决方案是将.view()替换为更健壮的.reshape()。(已合并) - PR #31235 ([ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch…): 修复了 Siglip2 模型中 Rotary Embedding 函数在平台分发逻辑上的错误(ROCm 误调用了 CUDA 实现),并调整了 InternVL 视频模型在 ROCm 上的测试宽容度以应对数值精度差异。
- PR #31203 ([ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention…): 修复了多模态编码器注意力模块在 ROCm 上可能因张量非连续而导致的
分析总结: 本周期 AMD 相关动态主要由社区成员(非 AMD 员工)和 AMD 的 CI 维护者共同推动,集中在解决构建、测试和运行时兼容性问题。这显示 vLLM 社区对 ROCm 平台的支持正在从“可用”向“稳定”和“高性能”过渡,持续清理跨平台实现中的细节问题。未发现与 Quark 量化工具 或 MI300 新特性直接相关的修改。
💬 高热度讨论分析
- Issue #31128: “Add support of Blackwell SM121(DGX Spark)” (9条评论)
- 核心议题: 请求 vLLM 原生支持 NVIDIA 最新的 Blackwell 架构边缘平台(DGX Spark,ARM64 + CUDA 13)。
- 观点与立场:
- 用户 (
yanyunl1991): 指出官方 vLLM 对 PyTorch 2.9 的严格依赖、缺少 ARM64 CUDA 13 wheels 是主要障碍,使用--enforce-eager会导致性能损失。 - 维护者 (
DarkLight1337,eugr): 提供了从源码编译的详细步骤,并指出 v0.13.0 已提供 CUDA 13 的 ARM64 wheels。关键解决方案是使用uv从特定索引安装。
- 用户 (
- 争议焦点: 无显著争议。讨论聚焦于澄清支持现状和提供具体解决方案,而非是否应该支持。
- 当前状态: 用户确认通过社区提供的方案成功运行,并感谢帮助。Issue 已关闭。
- Issue #31229: “Early-Fail Tokenization Guard for Completions or Chat Completions” (3条评论)
- 核心议题: 如何防止极端长提示(数亿字符)在 tokenization 阶段导致服务 CPU OOM 和挂起。
- 观点与立场:
- 提出者 (
scratch-ml): 详细分析了问题根因(tokenization 在前,长度检查在后),并提出了三个渐进的解决方案:1) 在 tokenization 时启用保护性截断;2) 添加 raw-size 预检查;3) 增强异步 tokenizer 的防护。 - 核心开发者 (
robertgshaw2-redhat): 强烈支持此改动,认为这不仅关乎稳定性,更是一个拒绝服务攻击的安全问题。他倾向于方案 1,但要求确认与 Mistral 分词器的兼容性,并指定提出者作为负责人。
- 提出者 (
- 争议焦点: 暂无。讨论处于积极接纳提案并细化方案的阶段。
- 当前状态: 开放,等待社区共识后由提出者实现。
- Issue #31219: “Concurrent requests with audio_embeds of different lengths crash EngineCore” (2条评论,但快速关闭)
- 核心议题: 并发处理不同长度音频嵌入的请求时,引擎崩溃。
- 解决过程: 维护者
DarkLight1337在收到报告后迅速指出问题根源(缺少dynamic_dims设置),并立即提交了 PR #31223 进行修复。用户验证后确认问题解决。 - 分析: 此议题热度体现在响应和修复的极快速度上,展现了团队对多模态功能稳定性的重视和高效率。
🔥 热门话题与趋势分析
- 性能优化深入内核层: 讨论热点从高层架构转向底层内核优化。PR #31246 为 MoE 模型添加
topk_sigmoid内核,带来 2 倍以上性能提升。RFC #31216 提出将采样中的gather操作移至argmax之后,以减少通信和计算开销,特别有利于推测解码场景。 - 多模态与工具调用能力持续扩展:
- 功能增强: PR #31239 为 Whisper 转录 API 新增
logprobs支持,属于 vLLM 的扩展功能。 - 模型支持: PR #31218 新增了 Google FunctionGemma 模型的工具调用解析器支持。
- 问题修复: PR #31223 修复了多长度音频嵌入的并发处理,PR #31224 修复了 Jina 重排序器对混合图文输入的支持。
- 功能增强: PR #31239 为 Whisper 转录 API 新增
- 分布式与硬件生态适配: 除 AMD 外,议题也涉及 NVIDIA Blackwell (#31128)、LoRA 传输优化(PR #31250)、P/D 架构下请求 ID 管理(PR #27987)等,显示项目在复杂部署环境下的深度适配。
- 开发者体验与基础设施: RFC #31249 提议重构环境变量声明方式以解决重复定义和类型不一致问题,反映了项目在规模化中对代码质量和维护性的关注。此外,多个文档 PR 和 CI 修复 PR 也体现了对项目健康度的维护。
🛠️ 重点技术变更
- PR #31246 ([Kernel] Add topk_sigmoid kernel): 为 MiniMax-M2 等使用 sigmoid 门控的 MoE 模型新增高性能融合内核,替代原有的分组 topk 模拟方案,性能提升超过 2 倍。这是针对特定模型架构的精准优化。
- PR #31218 ([Frontend] add FunctionGemma tool parser support): 扩展了 vLLM 的工具调用生态,支持了 Google 专门为函数调用设计的轻量级模型,增强了边缘部署场景的能力。
- PR #27987 ([Core] Add a random suffix to frontend-provided request IDs): 重要架构改进。 通过为客户端提供的请求 ID 添加随机后缀,彻底解决了因请求 ID 重复可能引发的各种竞态条件和正确性问题,特别是对 P/D 架构和异步调度至关重要。(已合并)
- PR #31203 ([ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention): 一个简单的
.view()到.reshape()的改动,解决了 ROCm 平台上一个潜在的运行时错误。这体现了跨平台开发中对 API 健壮性的细微要求。
📈 开发活跃度观察
- 高效合并: 在 37 个新增 PR 中合并了 26 个,合并率约 70%,表明代码审查和集成流程高效。
- AMD 支持积极: 针对 AMD CI 和运行时问题的 PR 反应迅速,且由包括 AMD 员工在内的多名贡献者共同参与,显示该平台支持是持续投入的重点。
- 核心开发者活跃:
DarkLight1337,jeejeelee,hmellor,AndreasKaratzas等核心成员在问题诊断、代码审查和修复上非常活跃,覆盖了前端、模型、内核、CI 等多个领域。 - 社区贡献广泛: 出现了多个来自
yurekami,c0de128,micah-wil等非核心团队成员的优质 PR,涉及功能开发、bug 修复、文档等多个方面,社区生态健康。
💡 值得关注的问题
- Issue #31210: “Wrong Generation Under High Concurrency When Using KVCache CPU Offload”: 高并发下启用 KV Cache CPU 卸载会导致生成结果错误。这可能涉及 CPU/GPU 间数据同步或缓存一致性的深层 bug,对使用该功能的用户影响较大。
- RFC #31229: “Early-Fail Tokenization Guard”: 如前所述,这是一个重要的安全和稳定性改进提案,其实施方案和影响范围值得社区持续关注。
- RFC #31249: “Improve environment variable declaration and handling”: 旨在重构技术债务。其实施将影响所有环境变量的定义方式,需要谨慎评估和推进。
- RFC #31216: “Sampling Optimization: move gather of logits after argmax”: 一个具有潜在显著性能收益的优化提案,其实现可能改变采样阶段的计算流程,需关注其对正确性和性能的最终验证结果。
- RFC #31204: “Supporting Multi MTP layers in Speculative Decoding”: 指出了当前 Eagle 推测解码提案器对多 MTP 层模型支持不足的问题,是功能扩展的一个方向。
📋 附录:详细数据列表
新增 Issue
- #31253 [Bug]:
VLLM_USE_FLASHINFER_MOE_FP16=1generate different logprob for the same prompt in different run — bug — by zyongye (创建于: 2025-12-24 10:43 (UTC+8)) - #31252 [Feature]: Make EngineCore shutdown timeout configurable via environment variable — feature request — by sakunkun (创建于: 2025-12-24 10:25 (UTC+8))
- #31219 [Bug]: Concurrent requests with audio_embeds of different lengths crash EngineCore: “audio_embeds contains inconsistent shapes” — bug — by ykkk1 (创建于: 2025-12-23 20:25 (UTC+8))
- #31249 [RFC]: Improve environment variable declaration and handling — RFC — by ProExpertProg (创建于: 2025-12-24 08:05 (UTC+8))
- #31245 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2025-12-24 06:46 (UTC+8))
- #31244 [CI Failure]: mi325_2: Plugin Tests (2 GPUs) — ci-failure — by AndreasKaratzas (创建于: 2025-12-24 06:43 (UTC+8))
- #31210 [Bug]: Wrong Generation Under High Concurrency When Using KVCache CPU Offload (vLLM 0.13.0) — bug — by wangqia0309 (创建于: 2025-12-23 16:02 (UTC+8))
- #31229 [RFC]: Early-Fail Tokenization Guard for Completions or Chat Completions — RFC — by scratch-ml (创建于: 2025-12-24 01:37 (UTC+8))
- #31217 [Usage]: suffix decoding — usage — by jiangix-paper (创建于: 2025-12-23 18:43 (UTC+8))
- #31216 [RFC]: Sampling Optimization: move gather of logits after argmax. — RFC — by whx-sjtu (创建于: 2025-12-23 18:23 (UTC+8))
- #31206 [Bug]: AsyncLLM Qwen/Qwen3-Embedding got stuck in max_model_len >= 6100 (vllm==0.13.0) — bug — by galabala (创建于: 2025-12-23 15:21 (UTC+8))
- #31205 ValueError: Qwen3OmniMoeThinkerForConditionalGeneration does not support LoRA yet. — usage — by VJJJJJJ1 (创建于: 2025-12-23 14:52 (UTC+8))
- #31211 [Doc]: Add missing GPT-OSS tool calling instructions — documentation — by amithkk (创建于: 2025-12-23 16:35 (UTC+8))
- #31204 [RFC]: Supporting Multi MTP layers in Speculative Decoding (EagleProposer) — RFC — by DingYibin (创建于: 2025-12-23 11:34 (UTC+8))
已关闭 Issue
- #31128 [Feature]: Add support of Blackwell SM121(DGX Spark) — feature request — by yanyunl1991 (关闭于: 2025-12-23 11:33 (UTC+8))
- #31219 [Bug]: Concurrent requests with audio_embeds of different lengths crash EngineCore: “audio_embeds contains inconsistent shapes” — bug — by ykkk1 (关闭于: 2025-12-24 10:15 (UTC+8))
- #29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:47 (UTC+8))
- #29460 [CI Failure]: mi325_1: Language Models Test (MTEB) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:47 (UTC+8))
- #20342 [Bug]: V1 pre-compiled graph loading much slower than V0 — bug,torch.compile,unstale — by OscarSavNS (关闭于: 2025-12-24 06:29 (UTC+8))
- #29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-24 06:27 (UTC+8))
- #23787 [Bug]: Performance Analysis: Significant Latency on First Inference due to Engine Warm-up (torch.compile & Graph Capture) — bug,torch.compile,stale,startup-ux — by Flink-ddd (关闭于: 2025-12-24 06:26 (UTC+8))
- #30995 [Bug]: Fused MoE errors without safe serialization — bug — by ojh31 (关闭于: 2025-12-23 19:09 (UTC+8))
- #31148 [Bug]: Jais2 model in vLLM 0.13.0: get_rope() called with unsupported rotary_dim kwarg (TypeError during model init) — bug — by NikolasTh90 (关闭于: 2025-12-23 15:44 (UTC+8))
- #31136 [Bug]: error when run examples/online_serving/prompt_embed_inference_with_openai_client.py — bug — by yuekaizhang (关闭于: 2025-12-23 14:08 (UTC+8))
- #28930 [Usage]: How to build a qwen3vl embedding model with a custom mlp layer on the top use vllm? — usage — by neverneverendup (关闭于: 2025-12-23 12:49 (UTC+8))
- #31091 [Usage]: Image Embedding Models (CLIP, Siglip, etc) — usage — by JamesDConley (关闭于: 2025-12-23 11:26 (UTC+8))
新增 PR
- #31246 [Kernel] Add topk_sigmoid kernel — performance — by xyang16 (创建于: 2025-12-24 07:25 (UTC+8))
- #31228 Cleanup basic and entrypoint test organisation — ready,ci/build,llama — by hmellor (创建于: 2025-12-24 01:33 (UTC+8))
- #31242 [ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION=”tf32” For terratorch Tests In AMD CI — rocm,ready,ci/build — by micah-wil (创建于: 2025-12-24 05:22 (UTC+8))
- #31239 [Feature] Add logprobs support for Whisper transcription API — documentation,frontend — by TheCodeWrangler (创建于: 2025-12-24 04:54 (UTC+8))
- #31251 [Bugfix][Hardware][AMD] Use cub_helpers.h in sampler.cu for ROCm namespace alias — rocm — by c0de128 (创建于: 2025-12-24 10:16 (UTC+8))
- #31223 [Bugfix] Enable
dynamic_dimsfor different embeds shape — ready,multi-modality,qwen — by DarkLight1337 (创建于: 2025-12-23 22:57 (UTC+8)) - #31222 [Chore] Simplify logic of
_execute_mm_encoder— ready,v1,multi-modality — by DarkLight1337 (创建于: 2025-12-23 22:08 (UTC+8)) - #31226 [cli] complete vllm cli help message — frontend — by andyxning (创建于: 2025-12-24 00:53 (UTC+8))
- #31218 [Frontend] add FunctionGemma tool parser support — documentation,ready,tool-calling — by gateremark (创建于: 2025-12-23 20:04 (UTC+8))
- #31250 LoRA Slab Optimization — v1,qwen,deepseek,gpt-oss — by Majid-Taheri (创建于: 2025-12-24 08:18 (UTC+8))
- #31221 CustomOp: Unify aiter impl into GroupedTopk — ready — by xinyu-intel (创建于: 2025-12-23 21:29 (UTC+8))
- #31248 grifffe warning — 无标签 — by Majid-Taheri (创建于: 2025-12-24 07:47 (UTC+8))
- #31247 LoRA Slab Optimization — v1,deepseek,gpt-oss — by Majid-Taheri (创建于: 2025-12-24 07:27 (UTC+8))
- #31225 fix(core): break circular reference in Request using weakref — v1 — by kelvinvelasquez-SDE (创建于: 2025-12-24 00:30 (UTC+8))
- #31243 [WIP] Adopt Dockerfile to build nightly version — ci/build — by atalman (创建于: 2025-12-24 06:23 (UTC+8))
- #31235 [ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2025-12-24 03:35 (UTC+8))
- #31203 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2025-12-23 10:49 (UTC+8))
- #31241 [Bugfix] Fix eagle dp tests on A100 — v1 — by zou3519 (创建于: 2025-12-24 05:14 (UTC+8))
- #31240 Revert “[bench] Support common prefix len config (for decode-only bench)” — performance — by minosfuture (创建于: 2025-12-24 05:12 (UTC+8))
- #31234 docs: Add llm-d integration to the website — documentation,ready — by terrytangyuan (创建于: 2025-12-24 03:29 (UTC+8))
- #31238 Refactor aiter_shared_expert_fusion logic into helper class — 无标签 — by yurekami (创建于: 2025-12-24 04:12 (UTC+8))
- #31230 [Bug]: Fix port race condition in distributed initialization — 无标签 — by yurekami (创建于: 2025-12-24 03:04 (UTC+8))
- #31236 Construting grid using num of active lora in lora kernels — v1,nvidia — by yugong333 (创建于: 2025-12-24 03:44 (UTC+8))
- #31237 fix(models): Handle weight prefix mapping for Mamba-Codestral — 无标签 — by yurekami (创建于: 2025-12-24 03:56 (UTC+8))
- #31232 [Feature]: Integrate Sonic MoE kernel for Hopper GPUs — 无标签 — by yurekami (创建于: 2025-12-24 03:05 (UTC+8))
- #31233 [Benchmark] Auto-infer dataset name from path for backward compatibility — performance — by yurekami (创建于: 2025-12-24 03:20 (UTC+8))
- #31231 [Bug] Fix Qwen3-VL 2:4 sparsity shape mismatch during decompression — qwen — by yurekami (创建于: 2025-12-24 03:04 (UTC+8))
- #31227 [ROCm][CI] Fix “Distributed Tests (H200)” Test — rocm,ci/build — by kliuae (创建于: 2025-12-24 00:53 (UTC+8))
- #31214 Only patch
original_max_position_embeddingsfor Transformers v4 — ready — by hmellor (创建于: 2025-12-23 18:10 (UTC+8)) - #31224 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — frontend — by twjww (创建于: 2025-12-23 23:49 (UTC+8))
- #31220 fixed glm 4.7 tool call and parser — frontend — by PratikNarola1 (创建于: 2025-12-23 21:27 (UTC+8))
- #31208 [Misc] Introduce
encode_*_urlutility function — tpu,ready,v1,multi-modality,kv-connector — by DarkLight1337 (创建于: 2025-12-23 15:38 (UTC+8)) - #31215 WIP - Paged Eviction — v1 — by albertoperdomo2 (创建于: 2025-12-23 18:16 (UTC+8))
- #31209 Correct position of docstring of class attributes — v1 — by wdhongtw (创建于: 2025-12-23 15:39 (UTC+8))
- #31212 [Doc] Add tool call parser documentation for GPT-OSS models — documentation,tool-calling,gpt-oss — by amithkk (创建于: 2025-12-23 16:37 (UTC+8))
- #31213 Add a support to disable Cutlass W8A8 kernels — nvidia,meta-exported,fb-exported — by houseroad (创建于: 2025-12-23 16:39 (UTC+8))
- #31207 fix: update kimi k2 tool parser logic — 无标签 — by wangln19 (创建于: 2025-12-23 15:23 (UTC+8))
已合并 PR
- #31160 [Bug] Fix
Number of dimensions of tensors must match.for Deepseek V3.2 — ready,deepseek — by yewentao256 (合并于: 2025-12-24 10:41 (UTC+8)) - #30133 [P/D] Mooncake connector support more protocols — ready,kv-connector — by LCAIZJ (合并于: 2025-12-24 10:24 (UTC+8))
- #30544 [KVEvent] User request.block_hash for parent block_hash — ready,v1 — by heheda12345 (合并于: 2025-12-24 10:23 (UTC+8))
- #30967 [Misc] Remove unused custom ops
copy_blocksandcopy_blocks_mla— ready — by lengrongfu (合并于: 2025-12-24 10:22 (UTC+8)) - #31223 [Bugfix] Enable
dynamic_dimsfor different embeds shape — ready,multi-modality,qwen — by DarkLight1337 (合并于: 2025-12-24 10:15 (UTC+8)) - #31222 [Chore] Simplify logic of
_execute_mm_encoder— ready,v1,multi-modality — by DarkLight1337 (合并于: 2025-12-24 10:15 (UTC+8)) - #31049 [CI] Add Qwen3-Next-FP8 to Blackwell model tests — ready,qwen,nvidia — by vadiklyutiy (合并于: 2025-12-24 09:21 (UTC+8))
- #31203 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2025-12-24 05:48 (UTC+8))
- #27987 [Core] Add a random suffix to frontend-provided request IDs — frontend,ready,v1,gpt-oss,kv-connector,ready-run-all-tests — by markmc (合并于: 2025-12-24 05:05 (UTC+8))
- #28133 [Mamba] - Consolidate Mambas Attention Logic — ready,v1 — by Josephasafg (合并于: 2025-12-24 04:57 (UTC+8))
- #31234 docs: Add llm-d integration to the website — documentation,ready — by terrytangyuan (合并于: 2025-12-24 04:27 (UTC+8))
- #29788 Use helper function instead of looping through attribute names — ready — by hmellor (合并于: 2025-12-24 01:31 (UTC+8))
- #31214 Only patch
original_max_position_embeddingsfor Transformers v4 — ready — by hmellor (合并于: 2025-12-24 00:46 (UTC+8)) - #31097 [FIX] FP4 quantization kernel padding initialization bug — performance,ready — by danielafrimi (合并于: 2025-12-24 00:45 (UTC+8))
- #30724 Fix edge case Mistral tool parser — ready — by joa-stdn (合并于: 2025-12-23 22:19 (UTC+8))
- #31208 [Misc] Introduce
encode_*_urlutility function — tpu,ready,v1,multi-modality,kv-connector — by DarkLight1337 (合并于: 2025-12-23 21:45 (UTC+8)) - #31095 adapt voxtral — new-model,ready,v1,multi-modality — by patrickvonplaten (合并于: 2025-12-23 21:31 (UTC+8))
- #31146 Add util function for checking nesting of rope parameters — ready — by hmellor (合并于: 2025-12-23 19:41 (UTC+8))
- #30134 [OpenAI] Add parameter metadata to validation errors — frontend,ready — by R3hankhan123 (合并于: 2025-12-23 19:30 (UTC+8))
- #30550 [Frontend] Support using chat template as custom score template for reranking models — documentation,new-model,frontend,ready,llama — by jzakrzew (合并于: 2025-12-23 19:19 (UTC+8))
- #31161 [Bugfix] Fix MoE LoRA bin/pt loading — ready — by jeejeelee (合并于: 2025-12-23 19:09 (UTC+8))
- #31209 Correct position of docstring of class attributes — v1 — by wdhongtw (合并于: 2025-12-23 18:08 (UTC+8))
- #26575 [ROCm][FEAT] Support AITER RMSNorm quantization fusion pass — rocm,ready,ci/build — by vllmellm (合并于: 2025-12-23 18:07 (UTC+8))
- #31198 [Bugfix] Fix Jais2ForCausalLM — ready — by jeejeelee (合并于: 2025-12-23 15:44 (UTC+8))
- #30538 [XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation — documentation,ready,ci/build — by yma11 (合并于: 2025-12-23 13:22 (UTC+8))
- #31153 [Chore] Update more locations to use
attention_config.backend— performance,ready — by DarkLight1337 (合并于: 2025-12-23 11:19 (UTC+8))
关闭但未合并的 PR
- #30418 LoRA Slab Optimization — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by Majid-Taheri (关闭于: 2025-12-24 07:08 (UTC+8))
- #30194 Fix compilation tests find commands — ready,ci/build — by ProExpertProg (关闭于: 2025-12-24 06:04 (UTC+8))
- #30790 [Release 2.10] Test Torch 2.10 RC - with skipped test — rocm,needs-rebase,ci/build,v1,cpu,nvidia — by atalman (关闭于: 2025-12-24 05:01 (UTC+8))
- #30984 Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by yugong333 (关闭于: 2025-12-24 03:09 (UTC+8))
- #27747 Cleanup basic and entrypoint test organisation — ci/build,tool-calling,llama — by hmellor (关闭于: 2025-12-24 02:09 (UTC+8))
- #23896 [CI] Optimize entrypoints API server tests — ready,needs-rebase,ci/build,tool-calling — by csahithi (关闭于: 2025-12-24 00:55 (UTC+8))
- #31087 [Debug] [Do Not Merge] revert Dockerfile.rocm_base changes — rocm,needs-rebase,ci/build — by tjtanaa (关闭于: 2025-12-23 18:23 (UTC+8))
- #22788 [Attention] Cache attention metadata builds across hybrid KV-cache groups — needs-rebase,v1,nvidia — by LucasWilkinson (关闭于: 2025-12-23 11:45 (UTC+8))