vLLM 开发动态报告 - 2025-12-24

时间窗口: 2025-12-24 10:51 (UTC+8) ~ 2025-12-25 10:51 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 6 | 新 PR 61 | 合并 PR 26 | 关闭未合并 PR 29

📊 每日开发状态摘要

在2025年12月24日至25日期间，vLLM项目保持了较高的开发活跃度，新增了61个PR，合并了26个。开发焦点集中在性能优化、bug修复（尤其是与AMD ROCm平台和分布式计算相关的问题）以及新功能支持上，例如对Kimi-K2、ERNIE-4.5-VL等更多模型的支持。社区讨论热烈，特别是在内存优化和模型推理质量方面遇到了新的挑战。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动非常活跃，主要体现在多个PR的修复和优化上。

PR编号	标题	核心内容	影响与分析
#31261	[Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled	修复了在启用共享专家时，DeepSeek模型的Quark量化权重加载失败的问题。问题在于共享专家路径未处理`weight_scale`参数。	关键修复。确保了采用Quark工具量化的DeepSeek MoE模型在ROCm平台上的可运行性，是AMD量化工具链生态的重要一环。
#31282	[Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode	修复了ROCm AITER MLA解码路径中`paged_kv_last_page_len`的计算错误。该内核使用块大小为1，但原先错误地将其设置为完整序列长度。	修复了可能导致注意力计算错误或内存越界访问的底层bug，提升了AITER后端在AMD硬件上运行的稳定性和正确性。
#31293	[Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel	修复了ROCm PagedAttention内核中，当`GQA_RATIO == 1`（非GQA模型）时，部分`Qlocal`寄存器未初始化，导致数值污染的问题。	解决了AMD GPU上可能引起数值漂移或NaN的底层硬件级bug，对模型输出的数值准确性和确定性有重要影响。
#31295	[Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process	将采样器内核中硬编码的`WARP_SIZE=32`替换为动态宏，以同时支持Wave64（MI300）和Wave32（Strix Halo）架构。	增强了代码对不同AMD GPU架构（CDNA vs RDNA）的兼容性，是支持未来AMD消费级GPU（如Strix Halo）的准备工作。
#31286	fix(rocm): add early return in get_flash_attn_version for ROCm	在ROCm平台上跳过CUDA特定的`vllm_flash_attn`导入，防止出现虚假的`libcudart.so.12 not found`错误。	提升了在纯ROCm环境下的用户体验和启动鲁棒性。
#31321	[MoE Refactor] AITER Mixtral Fix	修复了近期AITER重构中意外破坏的Mixtral模型静态激活量化功能。	确保AMD优化后的MoE计算路径对FP8量化模型的支持保持完整。
#31258	Update unquantized_fused_moe_method.py	修改了AITER在ROCm上分配未量化MoE权重的方式，使其对RL场景中的权重重载更安全，解决唤醒后输出随机字符的问题。贡献者注明来自AMD。	解决了在强化学习等需要动态重载权重的应用场景中，使用AITER后端可能出现的正确性问题。
其他	#31327, #31323, #31259	涉及ROCm CI/CD的依赖更新（xgrammar）、测试环境构建（TorchCodec）和工作环境变量设置（`TORCH_NCCL_BLOCKING_WAIT`）。	持续改善AMD平台的集成测试覆盖率和稳定性。

总结：本周期AMD生态的贡献集中在底层内核正确性修复（MLA解码、寄存器初始化）、量化支持完善（Quark）、多架构兼容性（WARP_SIZE）以及提升开发者体验（CI/测试）。这显示了AMD团队正深入优化其硬件在vLLM中的计算正确性和性能基础。

💬 高热度讨论分析

Issue #31276: [Bug]: Stuck with a loop when tensor_parallel_size=2
- 核心议题：用户在使用tensor_parallel_size=2时，vLLM引擎在初始化阶段陷入死循环（100% GPU利用率，0%内存使用）。
- 观点与分析：
  - 维护者 (@yurekami)：进行了初步调查，指出问题可能发生在NCCL初始化、自定义all-reduce设置或P2P能力测试阶段。特别提到了与Issue #5854的关联，即world_size=2时的自定义all-reduce路径问题。提供了详细的调试步骤，建议尝试禁用自定义all-reduce、启用eager模式等。
  - 讨论焦点：问题根因尚未确定，但指向了分布式并行初始化，特别是world_size=2这一特定配置下的潜在缺陷。目前处于诊断阶段，需要用户提供更多调试信息。
- 当前状态：Open，等待用户反馈调试结果。
PR #31282: [Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode
- 核心议题：修复AMD AITER MLA解码路径中一个关键的参数计算错误。
- 观点与分析：
  - 贡献者 (@c0de128)：详细阐述了bug的技术细节（块大小=1，last_page_len却误设为序列长度），并通过与FlashInfer后端的正确实现进行对比来佐证。
  - 审查者 (@tjtanaa)：质疑贡献者提供的测试案例（lm_eval基准测试）与PR变更的相关性，建议按照vLLM文档进行正确的功能测试。
  - 讨论焦点：围绕验证方法的恰当性。审查者要求提供更直接、与修复点相关的测试证明，而非通用的精度基准测试。这反映了项目对代码变更验证严谨性的高标准要求。
- 当前状态：Open，贡献者已补充更多技术论证，等待进一步审查。
PR #31293: [Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel
- 核心议题：修复ROCm注意力内核中未初始化的GPU寄存器问题。
- 观点与分析：
  - 贡献者 (@c0de128)：提供了详细的代码分析和硬件验证结果（在MI300X上运行lm_eval，未出现NaN），证明修复的有效性。
  - CI状态讨论：PR因AMD CI基础设施问题（bootstrap失败）而显示测试失败。贡献者指出这是已知的CI环境问题，与代码本身无关，并请求重新运行CI。
  - 讨论焦点：CI可靠性与代码审查的分离。讨论揭示了在依赖特定硬件CI时可能遇到的挑战，维护者需要区分是基础设施问题还是代码问题。
- 当前状态：Open，等待CI稳定后重新测试。

🔥 热门话题与趋势分析

内存与OOM问题：多个Issue反映了新版本（v0.13.0）可能引入的内存使用回归。
- #31273：在H20上加载Qwen2.5-72B FP8模型时出现OOM，而v0.12.0正常。
- #31262：在CUDA 13.0环境下，MoE模型的FP8量化版本加载失败（后通过调整并行配置解决）。
- 趋势：随着对FP8量化和大规模模型支持的增加，内存管理的复杂性和敏感性显著上升，成为稳定性的一大挑战。
模型推理质量与正确性：用户报告了特定模型或配置下的输出质量问题。
- #31319：GLM-4.7-FP8模型的思维链（reasoning）标签<think>缺失。
- #31325：Qwen3-32B模型在离线生成模式下出现质量退化。
- #31296：Qwen3-Coder大模型的流式输出存在内容丢失。
- 趋势：社区在追求极致性能（量化、推测解码）的同时，对推理结果的正确性和一致性提出了更高要求。
推测解码（Speculative Decoding）的兼容性与稳定性：
- #31270：在Pipeline Parallel (PP) > 2时使用N-gram推测解码会导致崩溃。
- #31288/#31307：修复TP下N-gram/Suffix草案令牌因CPU非确定性而发散的问题。
- 趋势：推测解码作为核心性能特性，其与各种并行策略（TP、PP、DP）的深度集成仍是开发和调试的重点。
多模态与LoRA等高级功能扩展：
- 新增对kimi_k2模型eagle3推测解码的支持（#31254）。
- 为ERNIE-4.5-VL添加视频元数据支持以渲染时间戳（#31274）。
- 优化MoE模型LoRA权重加载性能（#31317）并讨论LoRA热加载需求（#31278）。
- 趋势：项目正快速响应社区需求，扩展其对新兴模型和微调技术的支持。

🛠️ 重点技术变更

PR #29431: Add --max-model-len auto to auto-fit context to available memory
- 技术解读：引入了--max-model-len auto或-1参数。启动时，系统会自动评估设备内存，如果模型原始上下文长度能放下则用之，否则通过二分搜索找到最大的可支持长度。
- 影响：极大改善用户体验，用户无需手动试探内存容量，尤其是对于内存紧张的部署环境或新模型。这是启动配置智能化的重要一步。
PR #31041: [Perf] Add skip_clone to SamplingParams for internal request handling
- 技术解读：为SamplingParams添加skip_clone标志，允许在内部请求处理中跳过昂贵的deepcopy操作。该标志在OpenAI协议转换等方法中被自动设置。
- 影响：显著降低单请求开销，提升了吞吐量。这是对前期关于SamplingParams可变性安全讨论（RFC #29081）的一个实践性解决方案，在安全性和性能间取得了平衡。
PR #31285: [Chore][1/2] Drop v0.14 deprecations
- 技术解读：开始清理计划在v0.14版本中移除的已弃用代码。这是版本迭代中的常规但重要的维护性工作。
- 影响：保持代码库的整洁，减少历史负担，为未来开发铺平道路。开发者需关注相关变更对自身代码的影响。
PR #31294: [Bugfix] Remove dead block_quant_to_tensor_quant function
- 技术解读：移除了一个从未被调用、且存在文档误导和实现bug的死亡函数 block_quant_to_tensor_quant。
- 影响：代码质量提升。消除无效代码，避免后续开发者困惑，并关闭了与之相关的文档问题（#30098）。

📈 开发活跃度观察

合并效率高：24小时内合并了26个PR，显示代码审查和合并流程运作高效。其中包含多个关键bug修复（如#31261、#31260）和功能改进。
贡献者多样化：除了核心团队（如DarkLight1337, yurekami），AMD团队（c0de128， ganyi1996ppo, AndreasKaratzas）贡献了大量ROCm相关修复。同时也有社区成员提交模型支持（#31274）和bug报告。
“机器人”贡献者现象：用户注意到yurekami在短时间内提交了大量添加类型注解的PR（如#31320， #31318等），其中一些被维护者以“偏好保持现状”为由关闭。这表明项目对代码风格和贡献范围有明确的偏好和边界，大规模的风格统一更改可能需要更协调的方式。

💡 值得关注的问题

内存使用回归：Issue #31273 和 #31262 表明v0.13.0可能在内存管理上存在问题，可能导致已部署服务出现OOM。建议：关注此问题的进展，谨慎升级生产环境。
分布式训练卡死：Issue #31276 中tensor_parallel_size=2卡住的问题，如果具有普遍性，会影响小规模多卡推理的稳定性。建议：等待维护者的根本原因分析和修复。
推测解码与并行的兼容性：Issue #31270 和 PR #31288 揭示了推测解码与复杂并行策略（PP、TP）结合时的脆弱性。建议：在使用这些高级特性时进行充分测试。
模型特定Bug：如#31319 (GLM-4.7)、#31325 (Qwen3-32B)等，提示在新模型快速集成过程中，针对特定架构的细节处理需要持续打磨。建议：使用特定模型时关注其专属Issue。

📋 附录：详细数据列表

新增 Issue

#31278 [Usage]:请问Qwen3-VL本地加载模式支持单独加载LoRA么？ — usage — by dengdeng-cat (创建于: 2025-12-24 19:33 (UTC+8))
#31319 [Bug]: GLM-4.7-FP8 missing beginning tag — bug — by Nemo-G (创建于: 2025-12-25 02:45 (UTC+8))
#31273 [Bug]:OOM during weight loading on H20 (141GB) in v0.13.0, but works in v0.12.0 (Qwen2.5-72B FP8) — bug — by Sean-LL (创建于: 2025-12-24 17:52 (UTC+8))
#31325 [Bug]: Qwen3-32B Quality Degeneration on Offline Generate — bug — by thavens (创建于: 2025-12-25 06:59 (UTC+8))
#31304 [Feature]: Support explicit dependency to enable interaction between requests — feature request — by FLesic (创建于: 2025-12-25 00:42 (UTC+8))
#31276 [Bug]: Stuck with a loop when tensor_parallel_size=2 — bug — by RyanLoil (创建于: 2025-12-24 18:26 (UTC+8))
#31296 [Bug]: Streaming output issue in vLLM v0.10.2 — bug — by keilove (创建于: 2025-12-24 23:08 (UTC+8))
#31272 [Performance]: b200x8 deepseek-ai/DeepSeek-V3.2-Exp max perf — performance — by evgeniiperepelkin (创建于: 2025-12-24 17:48 (UTC+8))
#31270 [Bug]: Can run Speculative decode with PP >2? — bug — by frankie-ys (创建于: 2025-12-24 17:10 (UTC+8))
#31271 Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions. — 无标签 — by frankie-ys (创建于: 2025-12-24 17:10 (UTC+8))
#31269 [Feature]: Add support for NVIDIA Jetson AGX Thor — feature request — by yekangming (创建于: 2025-12-24 16:50 (UTC+8))
#31262 [Bug]: When running vLLM 0.13.0 in a CUDA 13.0 environment, the non-quantized version of the MoE model works without issues, but the FP8 quantized version fails to run. — bug — by xzwgit (创建于: 2025-12-24 15:03 (UTC+8))
#31254 [Feature]: Add kimi_k2 supported eagle3 — feature request — by lengrongfu (创建于: 2025-12-24 10:52 (UTC+8))

已关闭 Issue

#15332 [Bug]: RuntimeError: The size of tensor a (1059) must match the size of tensor b (376) at non-singleton dimension, DeepSeek R1 H20x16 pp2, v1 engine — bug,ray,stale — by markluofd (关闭于: 2025-12-25 10:13 (UTC+8))
#29081 [RFC]: SamplingParams should raise a warning when modified via direct assignment on the user side — RFC — by Ki-Seki (关闭于: 2025-12-25 06:35 (UTC+8))
#30098 [Doc]: Misleading Logic & Docstring in block_quant_to_tensor_quant (Block FP8) — documentation — by xqoasis (关闭于: 2025-12-25 01:22 (UTC+8))
#31271 Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions. — 无标签 — by frankie-ys (关闭于: 2025-12-24 17:10 (UTC+8))
#31262 [Bug]: When running vLLM 0.13.0 in a CUDA 13.0 environment, the non-quantized version of the MoE model works without issues, but the FP8 quantized version fails to run. — bug — by xzwgit (关闭于: 2025-12-24 16:06 (UTC+8))
#31254 [Feature]: Add kimi_k2 supported eagle3 — feature request — by lengrongfu (关闭于: 2025-12-24 11:09 (UTC+8))

新增 PR

#31286 fix(rocm): add early return in get_flash_attn_version for ROCm — rocm,ready — by rabi (创建于: 2025-12-24 22:02 (UTC+8))
#31282 [Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode — rocm,v1 — by c0de128 (创建于: 2025-12-24 21:45 (UTC+8))
#31314 [Bugfix] Suppress UserWarning for non-writable buffer in binary2tensor — 无标签 — by yurekami (创建于: 2025-12-25 01:48 (UTC+8))
#31327 [ROCm] Migrate xgrammar to upstream release in rocm-test.txt — rocm,ci/build — by AndreasKaratzas (创建于: 2025-12-25 09:39 (UTC+8))
#31323 [ROCm][CI] Add TorchCodec source build for transcription tests — rocm,ci/build — by AndreasKaratzas (创建于: 2025-12-25 05:46 (UTC+8))
#31322 Support LoRA for PLaMo 2/3 — documentation — by Alnusjaponica (创建于: 2025-12-25 05:21 (UTC+8))
#31317 pin lora_b moe weights on cpu — 无标签 — by gnovack (创建于: 2025-12-25 02:24 (UTC+8))
#31326 [Feat] allow inplace loading lora — frontend — by Jackmin801 (创建于: 2025-12-25 08:14 (UTC+8))
#31324 [CI] Fix flaky vision beam search test with flexible semantic validation — 无标签 — by AndreasKaratzas (创建于: 2025-12-25 06:47 (UTC+8))
#31275 Implement optimal group size calculation for KV cache layers, preferr… — v1 — by vovani (创建于: 2025-12-24 18:08 (UTC+8))
#31311 [Bugfix] Fix circular reference memory leak in Request class — v1 — by yurekami (创建于: 2025-12-25 01:22 (UTC+8))
#31321 [MoE Refactor] AITER Mixtral Fix — rocm — by robertgshaw2-redhat (创建于: 2025-12-25 02:54 (UTC+8))
#31309 fix: add weight mapper for Mamba2 model prefix compatibility — 无标签 — by yurekami (创建于: 2025-12-25 01:06 (UTC+8))
#31320 [Code Quality] Add missing return type annotations to misc modules — multi-modality — by yurekami (创建于: 2025-12-25 02:49 (UTC+8))
#31318 [Code Quality] Add missing return type annotations to core modules — 无标签 — by yurekami (创建于: 2025-12-25 02:38 (UTC+8))
#31290 fix: handle None tokenizer in multimodal processor initialization — multi-modality — by yurekami (创建于: 2025-12-24 22:24 (UTC+8))
#31301 fix(ray): correct misleading warning message for multi-node clusters — v1 — by yurekami (创建于: 2025-12-25 00:22 (UTC+8))
#31294 [Bugfix] Remove dead block_quant_to_tensor_quant function — ready — by yurekami (创建于: 2025-12-24 22:47 (UTC+8))
#31316 [Code Quality] Add missing return type annotations to entrypoints/utils.py — frontend — by yurekami (创建于: 2025-12-25 01:59 (UTC+8))
#31291 fix(config): validate skip_tokenizer_init is not used with multimodal models — ready — by yurekami (创建于: 2025-12-24 22:34 (UTC+8))
#31288 fix(spec_decode): sync ngram draft tokens across TP ranks — speculative-decoding,v1 — by yurekami (创建于: 2025-12-24 22:10 (UTC+8))
#31315 [Code Quality] Add missing return type annotations to distributed module — 无标签 — by yurekami (创建于: 2025-12-25 01:51 (UTC+8))
#31313 [Code Quality] Add missing return type annotations to model_executor utils — 无标签 — by yurekami (创建于: 2025-12-25 01:45 (UTC+8))
#31310 [Code Quality] Add missing return type annotations to transformers_utils — 无标签 — by yurekami (创建于: 2025-12-25 01:20 (UTC+8))
#31308 Add return type annotation to get_cached_compilation_config() — 无标签 — by yurekami (创建于: 2025-12-25 01:04 (UTC+8))
#31306 Add missing return type annotations to config/model.py — 无标签 — by yurekami (创建于: 2025-12-25 00:52 (UTC+8))
#31303 [Misc] Add missing return type annotations to lora/utils and engine/arg_utils — 无标签 — by yurekami (创建于: 2025-12-25 00:40 (UTC+8))
#31302 [Misc] Add missing return type annotations to collection_utils and deep_gemm — 无标签 — by yurekami (创建于: 2025-12-25 00:35 (UTC+8))
#31300 [Misc] Add missing return type annotations to profiling and async_utils — v1 — by yurekami (创建于: 2025-12-25 00:21 (UTC+8))
#31299 [Misc] Add missing return type annotations to torch_utils — 无标签 — by yurekami (创建于: 2025-12-25 00:18 (UTC+8))
#31298 [Misc] Add missing return type annotations to import_utils — 无标签 — by yurekami (创建于: 2025-12-25 00:15 (UTC+8))
#31297 [Misc] Add missing return type annotations to utils — 无标签 — by yurekami (创建于: 2025-12-25 00:08 (UTC+8))
#31287 [Bugfix] Preserve original tokenizer class name for transformers compatibility — 无标签 — by yurekami (创建于: 2025-12-24 22:07 (UTC+8))
#31293 [Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel — rocm — by c0de128 (创建于: 2025-12-24 22:46 (UTC+8))
#31285 [Chore][1/2] Drop v0.14 deprecations — documentation,rocm,frontend,ready,v1,multi-modality,tool-calling,gpt-oss — by DarkLight1337 (创建于: 2025-12-24 21:56 (UTC+8))
#31312 [Bugfix] Add error message for skip_tokenizer_init with multimodal models — 无标签 — by yurekami (创建于: 2025-12-25 01:44 (UTC+8))
#31280 fix(v1): Break circular reference in Request class to prevent memory leak — v1 — by yurekami (创建于: 2025-12-24 20:59 (UTC+8))
#31289 [Bugfix] Correct misleading warning message in Ray executor — v1 — by yurekami (创建于: 2025-12-24 22:12 (UTC+8))
#31307 fix: broadcast ngram/suffix draft tokens across TP ranks — v1 — by yurekami (创建于: 2025-12-25 01:01 (UTC+8))
#31283 [Feature] Make EngineCore shutdown timeout configurable via environment variable — v1 — by yurekami (创建于: 2025-12-24 21:50 (UTC+8))
#31305 fix: correct misleading warning message for Ray TP/PP parallelism — v1 — by yurekami (创建于: 2025-12-25 00:47 (UTC+8))
#31265 [CI] Reorganization pooling_mteb_test — ready,qwen — by noooop (创建于: 2025-12-24 15:55 (UTC+8))
#31267 Fix skip_tokenizer_init=True crash for multimodal models — ready — by majiayu000 (创建于: 2025-12-24 16:40 (UTC+8))
#31292 fix: reset FlashInfer wrappers after sleep mode — v1,nvidia — by yurekami (创建于: 2025-12-24 22:39 (UTC+8))
#31281 fix(models): Add weight name mapping for Mamba-Codestral model loading — ready — by yurekami (创建于: 2025-12-24 21:03 (UTC+8))
#31260 [Bugfix] Fix max_model_len="auto" handling — ready,v1 — by DarkLight1337 (创建于: 2025-12-24 13:45 (UTC+8))
#31295 [Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process — rocm — by c0de128 (创建于: 2025-12-24 23:02 (UTC+8))
#31279 [Bugfix] Disable FlashInfer MoE in batch invariant mode for determinism — 无标签 — by yurekami (创建于: 2025-12-24 20:09 (UTC+8))
#31274 [Model][Ernie4.5-VL] Support video metadata for timestamp rendering — ready — by Tiiiktak (创建于: 2025-12-24 18:04 (UTC+8))
#31284 [Doc] Add GPT-OSS (openai) tool parser documentation — documentation,tool-calling,gpt-oss — by yurekami (创建于: 2025-12-24 21:55 (UTC+8))
#31263 [Chore] Remove unused noqas — frontend,ready,v1,multi-modality,tool-calling,qwen,kv-connector — by DarkLight1337 (创建于: 2025-12-24 15:40 (UTC+8))
#31264 [Chore] Bump lm-eval version — documentation,rocm,ready,ci/build — by DarkLight1337 (创建于: 2025-12-24 15:47 (UTC+8))
#31277 Support ViT SP parallelism in the encode section of qwen2.5vl/qwen3vl — qwen — by ninjazwen (创建于: 2025-12-24 19:00 (UTC+8))
#31266 [bugfix] add hasattr check for drafter in _build_attention_metadata with PP — v1 — by JaviS-Rei (创建于: 2025-12-24 15:56 (UTC+8))
#31256 Feat: Support mixed placement of shared & router experts with EPLB synergy — deepseek — by Mercykid-bash (创建于: 2025-12-24 11:02 (UTC+8))
#31261 [Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled — rocm,ready,deepseek — by ganyi1996ppo (创建于: 2025-12-24 14:08 (UTC+8))
#31268 Preserve original class name in CachedTokenizer for HuggingFace compatibility — 无标签 — by majiayu000 (创建于: 2025-12-24 16:41 (UTC+8))
#31259 [ROCm][CI] Set TORCH_NCCL_BLOCKING_WAIT Distributed Tests On ROCm — rocm,ready,ci/build — by micah-wil (创建于: 2025-12-24 13:36 (UTC+8))
#31255 [LoRA] Load gpt-oss w13_lora_b interleaved — gpt-oss — by xyang16 (创建于: 2025-12-24 10:55 (UTC+8))
#31258 Update unquantized_fused_moe_method.py — 无标签 — by aaab8b (创建于: 2025-12-24 11:56 (UTC+8))
#31257 [Speculators][Speculative Decoding] Fix Kimi K2 Eagle3 Support — deepseek — by chaunceyjiang (创建于: 2025-12-24 11:06 (UTC+8))

已合并 PR

#31241 [Bugfix] Fix eagle dp tests on A100 — ready,v1 — by zou3519 (合并于: 2025-12-25 08:05 (UTC+8))
#31041 [Perf] Add skip_clone to SamplingParams for internal request handling — performance,frontend,ready,v1 — by mgoin (合并于: 2025-12-25 06:35 (UTC+8))
#31294 [Bugfix] Remove dead block_quant_to_tensor_quant function — ready — by yurekami (合并于: 2025-12-25 01:22 (UTC+8))
#31285 [Chore][1/2] Drop v0.14 deprecations — documentation,rocm,frontend,ready,v1,multi-modality,tool-calling,gpt-oss — by DarkLight1337 (合并于: 2025-12-25 01:54 (UTC+8))
#31265 [CI] Reorganization pooling_mteb_test — ready,qwen — by noooop (合并于: 2025-12-24 23:36 (UTC+8))
#31226 [cli] complete vllm cli help message — frontend,ready — by andyxning (合并于: 2025-12-24 23:45 (UTC+8))
#31179 [Bugfix][Hardware][AMD] Fix FP8 dtype in silu_mul quantization — rocm,ready — by c0de128 (合并于: 2025-12-24 23:37 (UTC+8))
#31260 [Bugfix] Fix max_model_len="auto" handling — ready,v1 — by DarkLight1337 (合并于: 2025-12-24 19:15 (UTC+8))
#30800 [PERF] Add interleaved memory allocation to NUMA module — ready,cpu — by skaraban3807 (合并于: 2025-12-24 21:47 (UTC+8))
#31263 [Chore] Remove unused noqas — frontend,ready,v1,multi-modality,tool-calling,qwen,kv-connector — by DarkLight1337 (合并于: 2025-12-24 21:38 (UTC+8))
#31264 [Chore] Bump lm-eval version — documentation,rocm,ready,ci/build — by DarkLight1337 (合并于: 2025-12-24 21:39 (UTC+8))
#31131 [Model] Introduce verify_and_update_model_config for VerifyAndUpdateConfig. — ready,llama,qwen — by noooop (合并于: 2025-12-24 17:54 (UTC+8))
#31261 [Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled — rocm,ready,deepseek — by ganyi1996ppo (合并于: 2025-12-24 16:47 (UTC+8))
#31259 [ROCm][CI] Set TORCH_NCCL_BLOCKING_WAIT Distributed Tests On ROCm — rocm,ready,ci/build — by micah-wil (合并于: 2025-12-24 15:19 (UTC+8))
#30281 [CI/Build] Ignore data_parallel_size_local — ready — by rjrock (合并于: 2025-12-24 15:40 (UTC+8))
#31227 [ROCm][CI] Fix “Distributed Tests (H200)” Test — rocm,ready,ci/build — by kliuae (合并于: 2025-12-24 14:56 (UTC+8))
#28979 [ROCm][CI] Fix entrypoints tests and Python-only installation test on ROCm — rocm,frontend,ready,ci/build,v1,gpt-oss — by AndreasKaratzas (合并于: 2025-12-24 14:42 (UTC+8))
#29431 Add --max-model-len auto to auto-fit context to available memory — feature request,ready,v1,startup-ux,ready-run-all-tests — by mgoin (合并于: 2025-12-24 13:37 (UTC+8))
#30070 [docker] Fix downloading sccache on aarch64 platform — ready,ci/build — by NickCao (合并于: 2025-12-24 13:36 (UTC+8))
#30760 [XPU] Remove distributed_executor_backend check — ready — by 1643661061leo (合并于: 2025-12-24 13:34 (UTC+8))
#31007 [Qwen3-Omni] fixed _get_feat_extract_output_lengths function — ready,qwen — by wangxiongts (合并于: 2025-12-24 13:33 (UTC+8))
#31047 [DeepSeek v3.2] Remove unnecessary syncwarps — ready,deepseek — by MatthewBonanni (合并于: 2025-12-24 13:33 (UTC+8))
#31149 [Bugfix][ROCm][Dynamo][DS 3.1][FP8] fix unsupported hasattr call when Dynamo tracing for ROCm device — rocm,ready — by zejunchen-zejun (合并于: 2025-12-24 13:32 (UTC+8))
#31240 Revert “[bench] Support common prefix len config (for decode-only bench)” — performance — by minosfuture (合并于: 2025-12-24 13:17 (UTC+8))
#31242 [ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION=”tf32” For terratorch Tests In AMD CI — rocm,ready,ci/build — by micah-wil (合并于: 2025-12-24 11:21 (UTC+8))
#31235 [ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2025-12-24 10:56 (UTC+8))

关闭但未合并的 PR

#30740 Support LoRA of PLaMo 2/3 — documentation,performance,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,qwen,cpu — by Alnusjaponica (关闭于: 2025-12-25 05:21 (UTC+8))
#30742 Support LoRA of PLaMo 2/3 — documentation — by Alnusjaponica (关闭于: 2025-12-25 04:51 (UTC+8))
#30951 [Misc] Improve worker error messages for better debugging — v1,cpu — by yurekami (关闭于: 2025-12-25 03:00 (UTC+8))
#31311 [Bugfix] Fix circular reference memory leak in Request class — v1 — by yurekami (关闭于: 2025-12-25 03:00 (UTC+8))
#30946 [Core] Improve DP synchronization error messages — v1 — by yurekami (关闭于: 2025-12-25 02:56 (UTC+8))
#31309 fix: add weight mapper for Mamba2 model prefix compatibility — 无标签 — by yurekami (关闭于: 2025-12-25 02:51 (UTC+8))
#30948 fix: Suppress torch.frombuffer UserWarning for non-writable buffers — 无标签 — by yurekami (关闭于: 2025-12-25 02:48 (UTC+8))
#31316 [Code Quality] Add missing return type annotations to entrypoints/utils.py — frontend — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31315 [Code Quality] Add missing return type annotations to distributed module — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31313 [Code Quality] Add missing return type annotations to model_executor utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31310 [Code Quality] Add missing return type annotations to transformers_utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31308 Add return type annotation to get_cached_compilation_config() — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31306 Add missing return type annotations to config/model.py — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31303 [Misc] Add missing return type annotations to lora/utils and engine/arg_utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31302 [Misc] Add missing return type annotations to collection_utils and deep_gemm — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31300 [Misc] Add missing return type annotations to profiling and async_utils — v1 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31299 [Misc] Add missing return type annotations to torch_utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31298 [Misc] Add missing return type annotations to import_utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31297 [Misc] Add missing return type annotations to utils — 无标签 — by yurekami (关闭于: 2025-12-25 02:24 (UTC+8))
#31312 [Bugfix] Add error message for skip_tokenizer_init with multimodal models — 无标签 — by yurekami (关闭于: 2025-12-25 01:47 (UTC+8))
#31280 fix(v1): Break circular reference in Request class to prevent memory leak — v1 — by yurekami (关闭于: 2025-12-25 01:42 (UTC+8))
#31289 [Bugfix] Correct misleading warning message in Ray executor — v1 — by yurekami (关闭于: 2025-12-25 01:27 (UTC+8))
#31307 fix: broadcast ngram/suffix draft tokens across TP ranks — v1 — by yurekami (关闭于: 2025-12-25 01:25 (UTC+8))
#31305 fix: correct misleading warning message for Ray TP/PP parallelism — v1 — by yurekami (关闭于: 2025-12-25 01:06 (UTC+8))
#19819 LoRA support on llama4 — stale,llama — by frank-wei (关闭于: 2025-12-25 00:12 (UTC+8))
#31267 Fix skip_tokenizer_init=True crash for multimodal models — ready — by majiayu000 (关闭于: 2025-12-25 00:09 (UTC+8))
#30551 docs: Clarify block_quant_to_tensor_quant docstring (fixes #30098) — 无标签 — by yurekami (关闭于: 2025-12-25 00:07 (UTC+8))
#31281 fix(models): Add weight name mapping for Mamba-Codestral model loading — ready — by yurekami (关闭于: 2025-12-24 23:35 (UTC+8))
#30353 [Fix] Handle multiple tool calls in Qwen3-MTP tool parser — frontend,needs-rebase,tool-calling,qwen — by ArkVex (关闭于: 2025-12-24 23:22 (UTC+8))