vLLM 开发动态报告 - 2025-12-10

时间窗口: 2025-12-10 10:48 (UTC+8) ~ 2025-12-11 10:48 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 30 | 新 PR 60 | 合并 PR 28 | 关闭未合并 PR 26

📊 每日开发状态摘要

在本次观察窗口内，vLLM 项目保持了极高的开发活跃度，新增和合并了大量 PR（60/28），重点集中于模型支持优化、性能调优及测试基础设施完善。AMD/ROCm 生态支持成为显性焦点，多个与 ROCm 相关的 CI 修复和功能增强 PR 被提交或合并。社区讨论热烈，主要围绕基准测试工具的未来方向、新模型（如 DeepSeek-V3.2， Gemma2 GGUF）的支持问题以及 KV 缓存数据传输的可靠性展开。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，涵盖了问题修复、CI 管道建设和性能优化。

核心问题修复：
- PR #30432 ([ROCm] Fix broken import in platform attention backend dispatching)：修复了因导入 get_env_variable_attn_backend 导致的 ROCm 平台初始化失败。该问题源于对正在重构的注意力后端选择器（PR #30396）的依赖，本修复通过直接检查环境变量来绕过该依赖，确保 ROCm 平台正常启动。
- PR #30308 ([bugfix][quantization] fix quark qwen3 kv_cache quantization) (已合并)：由 AMD 员工 (haoyangli-amd) 提交，修复了 Quark 量化工具在处理 Qwen3 MoE 模型的 KV 缓存量化时，因未正确调用基类方法而导致的缓存缩放识别失败问题，确保了量化模型的正确性。
- PR #30430 ([ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding)：修复了在 ROCm 上使用 MLA (如 DeepSeek 模型) 并启用推测解码（num_speculative_tokens > 1）时的兼容性问题，将 MLACommonMetadata 添加到了支持列表中。
测试与 CI 增强：
- PR #30417 ([CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm)：针对 ROCm 环境中缺失特定 CUDA/DeepEP 模块的问题，跳过了相关测试，以使分布式测试组能够通过。
- PR #30422 ([ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly)：解决了 ROCm 上运行 MTP 推测解码测试时，因专家数不符合分组条件而引发的 grouped_topk 内核错误，增加了回退机制。
- PR #29358 ([ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group) (已合并)：调整了异步调度测试在 ROCm 上的配置，确保使用支持的注意力后端（TRITON_ATTN）并放宽了数值容差，以使测试通过。
发布与部署管道建设：
- PR #30395 ([ROCm] [CI] [Release] Add rocm wheel release pipeline)：这是一个重要的基础设施 PR，旨在为 ROCm 平台建立预编译 wheel 包的发布管道。该设计包括依赖构建缓存、S3 存储上传和手动触发机制，目标是降低用户在 AMD 机器上的使用门槛。

总结：本周期 AMD 相关的活动从修复紧急运行时错误，扩展到完善测试覆盖率和构建长期可持续的交付管道（预编译包），显示出对 AMD 平台支持正从“功能可用”向“体验优化”和“生态建设”阶段迈进。

💬 高热度讨论分析

Issue #30383: [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits
- 核心议题：用户指出当前 vllm benchmark 工具的单进程设计在高压下（高 QPS、高并发）会成为瓶颈，导致性能指标失真，并提出了构建多进程架构的提案。
- 观点与立场：
  - 提案者 (GaoHuaZhang)：通过详细的测试数据论证了单进程限制的严重性，强调需要一个能反映真实高并发场景的基准测试工具。
  - 支持者 (wenba0)：认为该功能至关重要，愿意提供帮助。
  - 核心维护者 (ywang96)：提供了重要上下文，指出 vllm bench serve 最初设计用于受控的单容器环境，并引导社区关注现有的 guidellm 子项目，该子项目专为大规模基准测试设计。建议与 guidellm 开发者协作，避免维护重复工具，并计划在未来更好地集成和推广 guidellm。
- 争议焦点：无实质性争议，但揭示了项目内部分工和工具定位的战略考量。核心团队更倾向于集中力量发展一个专门的大规模基准测试项目（guidellm），而非扩展现有工具的适用范围。
- 当前状态：讨论开放，维护者的回复为未来的开发方向提供了明确的指引。
Issue #30445: [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build
- 核心议题：用户在当日构建版本中运行特定 AWQ 量化模型时得到乱码输出，而回退到前一日构建则正常，暗示当日合入的某个更改引入了回归。
- 观点与立场：
  - 报告者 (eugr)：提供了详尽的环境信息和复现步骤，并主动测试排除了 fastsafetensors 的影响。
- 争议焦点：无。这是一个典型的回归问题报告，关键在于定位是哪个具体的 PR 或代码变更导致了该问题。
- 当前状态：问题开放，等待开发人员根据构建时间窗口排查引入问题的提交。
Closed Issue #11247: [Bug]: disaggregated prefilling hangs when TP=2 (已关闭)
- 核心议题：这是一个存在了近一年的老问题，关于在张量并行度为 2 的分布式预填充/解码（P/D）架构下，服务会出现挂起。
- 观点与立场：
  - 众多遇到相同问题的用户：持续在评论区反馈他们遇到了相同或类似的问题，并尝试分享各自的解决方案（如设置环境变量 CUDA_LAUNCH_BLOCKING=1、调整 --max-num-seqs 参数等）。
  - 分析者 (yjsunn)：提供了一个深入的分析，认为在高 QPS 和小 kv_buffer_size 时，预填充批次过大可能导致 KV 缓存传输部分失败，进而阻塞整个批次中已成功的请求。
- 争议焦点：无争议，更多是用户互助和问题现象收集。
- 最终结论：该 Issue 因超过 90 天无核心开发人员跟进而被标记为 stale 并最终关闭。但关闭不代表问题已解决，用户讨论区表明这仍是一个影响生产部署的痛点。

🔥 热门话题与趋势分析

模型支持与兼容性：这是 Issue 产生的最大来源。
- 新模型架构：DeepSeek-V3.2 (#30311, #30343, #30371)、Qwen3 系列 (#30378, #30387, #30433)、Gemma2 GGUF (#30404, #30411, #30424 等) 的支持是当前热点，问题集中在 tokenizer 集成、工具调用解析、推理输出异常和 GGUF 格式加载的各类细节上。
- 量化模型：关于 AWQ (#30445)、Quark (#30308)、FP8 KV 缓存 (#30387) 等量化模型在实际使用中出现的输出错误或崩溃问题报告增多，表明量化部署的复杂性和对稳定性的高要求。
CI/CD 与测试基础设施：PR 活动高度集中于此。
- AMD CI 攻坚：大量 PR (#30417, #30422, #30432, #29358 等) 专注于修复 ROCm 平台上的测试失败，涉及跳过不支持的测试、修复内核兼容性、调整测试参数等，目标是使 AMD CI 流水线全面变绿。
- 测试优化与新增：为 CPU 后端 (#30347)、Whisper (#30072) 等新增或优化测试，并修复测试中的导入错误 (#30476) 和环境依赖问题。
性能优化与核心特性：
- KV 缓存与数据传输：异构 KV 布局 (#30448)、NIXL 连接器修复 (#30419, #30420)、共享内存屏障 (#30407) 等 PR 显示，在分布式和 P/D 场景下，KV 缓存的管理和数据传输的可靠性、性能是持续优化的重点。
- 编译与图形化：支持 Whisper 模型的 CUDA Graph (#30072) 和 torch.compile (#30385)，以及对编译缓存竞争问题的讨论 (#24601)，反映了对推理延迟和吞吐量的极致追求。
文档与用户体验：多个 PR 致力于改善文档，包括生成完整的监控指标列表 (#30388)、更新专家并行部署指南 (#27933)、添加 CPU wheel 安装说明 (#30402) 等，显示项目在快速发展的同时，也开始系统化地提升用户体验。

🛠️ 重点技术变更

PR #30448: [NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support：这是一个关键的性能优化 PR，支持在预填充阶段使用一种 KV 布局（NHD）和块大小，在填充完成后动态转换为解码阶段所需的布局（HND）和块大小。这允许为不同阶段选择最优的内存访问模式，尤其有利于提升解码性能。其实现依赖于前缀缓存被禁用。
PR #30385: [Core] Whisper support torch.compile：继支持 CUDA Graph 后，此 PR 进一步为 Whisper 编码器-解码器模型启用了 torch.compile 支持。由于第一步解码需要处理编码器输出，它巧妙地采用了仅编译第二步及以后步骤的策略，从而显著提升了 Whisper 的推理速度。
PR #30440: [fix] Fix qwen3_coder tool call per parameter streaming：修复了 Qwen3 Coder 模型工具调用参数在流式传输时，参数内容（字符串）不能按 token 实时流出的问题。这提升了工具调用场景下的用户体验，确保了流式响应中参数的实时性。
Issue #30343 -> PR #30351 (已合并): DeepSeek-V3.2 tokenizer 性能问题：发现并修复了一个导致服务挂起的严重性能问题。根本原因是 DeepSeekV32Tokenizer 的 __len__ 方法在每个 token 解码时都会调用昂贵的 get_added_vocab()，阻塞了主线程。修复方法是在初始化时缓存 added_vocab 的大小，消除了每令牌的开销。
PR #30407 (已合并): fix(shm): Add memory barriers for cross-process shared memory visibility：通过为跨进程共享内存广播添加内存屏障，修复了潜在的数据竞争问题。值得注意的是，提交者反馈该修复甚至在低延迟场景下带来了轻微的性能提升。

📈 开发活跃度观察

贡献者活跃：在新增的 60 个 PR 中，除了核心团队成员（如 LucasWilkinson, DarkLight1337, AndreasKaratzas），也出现了大量社区贡献者（如 kitaekatt, anker-c2, rogeryoungh），他们积极修复模型支持、文档和各类 Bug。
AMD 团队深度参与：AndreasKaratzas, micah-wil, haoyangli-amd 等（包含或关联 AMD）贡献者非常活跃，主导了 ROCm 平台的大部分修复和测试工作，体现了 AMD 对 vLLM 生态投入的资源。
代码审查与合并速度：28 个 PR 在观察期内被合并，其中许多是修复关键问题的 PR，表明核心团队对重要修复的审查和合并效率很高。同时，也有像 #30395 (ROCm wheel 发布管道) 这样的大型基础设施 PR 仍在讨论和推进中。

💡 值得关注的问题

基准测试工具的战略方向（Issue #30383）：社区提出的增强需求与核心团队规划的 guidellm 子项目存在重叠。如何清晰定义 vllm bench 与 guidellm 的边界，并平滑引导用户和开发者，需要明确的沟通和路线图。
GGUF 格式支持的成熟度：近期涌现了大量 Gemma2 等模型 GGUF 格式加载的问题（如数据类型冲突、缺失配置项、权重映射错误等）。这表明 vLLM 的 GGUF 加载器在面对多样化的模型架构和量化方案时，仍需加强健壮性和兼容性。
分布式 P/D 部署的稳定性：尽管老 Issue #11247 已关闭，但用户反馈表明，在 disaggregated prefilling 场景下的挂起问题依然影响部分用户。这仍然是高并发、分布式生产部署中的一个风险点。
量化模型推理的可靠性：多个 Issue 表明，不同类型的量化模型（AWQ, Quark, FP8 KV Cache）在特定条件下会产生错误输出或崩溃。确保量化推理的数值稳定性和正确性是一个持续的挑战。

📋 附录：详细数据列表

新增 Issue

#30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (创建于: 2025-12-10 18:02 (UTC+8))
#30447 [Usage]: how to load kv cache data into local file — usage — by chx725 (创建于: 2025-12-11 09:43 (UTC+8))
#30445 [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build — bug — by eugr (创建于: 2025-12-11 09:06 (UTC+8))
#30441 [Usage]: vllm serve setup issues on B300 — usage — by navmarri14 (创建于: 2025-12-11 07:50 (UTC+8))
#30439 [Bug]: Qwen3 Coder parser does not stream tool call arguments — bug — by koush (创建于: 2025-12-11 07:45 (UTC+8))
#30436 [Bug]: Speculative decode crashes on PP>1 because self.drafter missing — bug — by kvcop (创建于: 2025-12-11 07:26 (UTC+8))
#30435 [Bug]: vllm problem with cu130 — bug — by Reviveplugins (创建于: 2025-12-11 07:24 (UTC+8))
#30387 [Bug]: illegal memory access countered when using kv-cache-type=fp8 loading a weight-fp8 model for evaltest in flash-attn backend — bug — by youngze0016 (创建于: 2025-12-10 19:43 (UTC+8))
#30401 [Bug]: EAGLE3 failure with gpt-oss 20b and 120b — bug — by Mhdaw (创建于: 2025-12-11 00:42 (UTC+8))
#30394 [Feature]: Prometheus Metrics Abstraction — feature request — by mladjan-gadzic (创建于: 2025-12-10 22:22 (UTC+8))
#30392 [Usage]: Docker image v0.12.0 Fail to run inference via Docker image — usage — by kuopching (创建于: 2025-12-10 21:43 (UTC+8))
#30380 [Usage]: 大家一般怎么使用vllm/tests的？ — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
#30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (创建于: 2025-12-10 17:55 (UTC+8))
#30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (创建于: 2025-12-10 15:53 (UTC+8))
#30381 [Usage]: — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
#30379 [Usage]: how to use vllm/tests/？ — usage — by tobeprozy (创建于: 2025-12-10 17:25 (UTC+8))
#30378 [Feature]: Automatically infer Qwen3 reranker settings (remove need for hf_overrides) — feature request — by ilopezluna (创建于: 2025-12-10 17:21 (UTC+8))
#30375 [Bug]: [TPU] ShapeDtypeStruct error when loading custom safetensors checkpoint on TPU v5litepod — bug — by Baltsat (创建于: 2025-12-10 16:12 (UTC+8))
#30372 [Bug]: vLLM (GPT-OSS) causes distorted tool argument names + infinite tool-call loop with Korean messenger tool — bug — by minmini2 (创建于: 2025-12-10 14:59 (UTC+8))

已关闭 Issue

#11247 [Bug]: disaggregated prefilling hangs when TP=2 — bug,stale — by Louis-99 (关闭于: 2025-12-11 10:17 (UTC+8))
#13040 [Bug]: torchvision.libs/libcudart.41118559.so.12 (deleted): cannot open shared object file: No such file or directory — bug,stale — by wuyifan18 (关闭于: 2025-12-11 10:17 (UTC+8))
#15513 [Bug]: — bug,stale — by znicelya (关闭于: 2025-12-11 10:17 (UTC+8))
#16545 [Bug]: Unexpected behavior of returned_token_ids in Reward Modeling for LlamaForCausalLM — bug,stale — by ryokamoi (关闭于: 2025-12-11 10:17 (UTC+8))
#17292 [Feature]: support reasoning output when offline batched inference — feature request,stale — by wa008 (关闭于: 2025-12-11 10:17 (UTC+8))
#17327 [Usage] Qwen3 Usage Guide — usage,stale — by simon-mo (关闭于: 2025-12-11 10:17 (UTC+8))
#17634 [Bug]: When using the LLaMA-Factory framework with InternVL3-8B-hf for batch inference, vLLM throws an error: ValueError: limit_mm_per_prompt is only supported for multimodal models. — bug,stale — by Fangyuan-Liu (关闭于: 2025-12-11 10:17 (UTC+8))
#17697 [Feature]: Addition of pre-built AMD wheel packages — feature request,stale — by Epliz (关闭于: 2025-12-11 10:17 (UTC+8))
#22470 [Bug]: gpt oss 20/120b generates wired characters and fails later when i use them — bug,stale — by ShlomoSMR (关闭于: 2025-12-11 10:17 (UTC+8))
#29453 [CI Failure]: mi325_1: Basic Correctness Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:21 (UTC+8))
#29464 [CI Failure]: mi325_1: OpenAI API correctness — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:21 (UTC+8))
#29803 [CI Failure]: mi325_1: Cudagraph test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:20 (UTC+8))
#29465 [CI Failure]: mi325_2: Prime-RL Integration Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:18 (UTC+8))
#29526 [CI Failure]: mi325_1: Entrypoints Integration Test (Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:17 (UTC+8))
#29514 [CI Failure]: mi325_4: EPLB Execution Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:17 (UTC+8))
#29443 [CI Failure]: mi325_1: Python-only Installation Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:16 (UTC+8))
#29537 [CI Failure]: mi325_2: Weight Loading Multiple GPU Test - Large Models — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:09 (UTC+8))
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:01 (UTC+8))
#14083 [Feature]: Improve Logging for Error Messages — help wanted,good first issue,feature request,unstale — by robertgshaw2-redhat (关闭于: 2025-12-11 07:17 (UTC+8))
#30214 [Bug]: DeepSeek V3.2 on B200 fails with “CUTLASS_MLA is not valid… Reason: [‘sparse not supported’]” — bug — by hadikoub (关闭于: 2025-12-11 04:20 (UTC+8))
#30240 [Bug]: Lots of “Current vLLM config is not set.” warnings when FlashInfer attention is used — bug — by nvpohanh (关闭于: 2025-12-11 03:18 (UTC+8))
#24601 [Bug]: Launching multiple vLLM processes at the same time doesn’t work well with vLLM’s compile cache — bug,torch.compile — by zou3519 (关闭于: 2025-12-11 02:53 (UTC+8))
#30342 [Bug]: HunyuanOCR batching problem with variable sized images in a batch. — bug — by anker-c2 (关闭于: 2025-12-11 02:09 (UTC+8))
#15636 [Bug]: Outlines broken on vLLM 0.8+ — bug,structured-output,unstale — by cpfiffer (关闭于: 2025-12-10 21:18 (UTC+8))
#30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (关闭于: 2025-12-10 18:47 (UTC+8))
#30381 [Usage]: — usage — by tobeprozy (关闭于: 2025-12-10 17:28 (UTC+8))
#30379 [Usage]: how to use vllm/tests/？ — usage — by tobeprozy (关闭于: 2025-12-10 17:26 (UTC+8))
#30311 [Bug]: deepseekv32.DeepseekV32Tokenizer Runtime causes model to crash — bug — by magician-xin (关闭于: 2025-12-10 16:30 (UTC+8))
#28314 [AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments — rocm,ci-failure — by zhewenl (关闭于: 2025-12-10 13:32 (UTC+8))
#30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (关闭于: 2025-12-10 12:05 (UTC+8))

新增 PR

#30428 [Chore] Fix torch precision warning — ready,v1 — by yewentao256 (创建于: 2025-12-11 05:34 (UTC+8))
#30448 [NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support — v1,kv-connector — by xuechendi (创建于: 2025-12-11 10:06 (UTC+8))
#30449 Add SLA-tiered scheduling (opt-in) — documentation,v1 — by ProdByBuddha (创建于: 2025-12-11 10:17 (UTC+8))
#30437 [Bugfix] missing tokens occur in harmony streaming — frontend,gpt-oss — by Ri0S (创建于: 2025-12-11 07:36 (UTC+8))
#30431 Revert “[CI] Add Async Eplb nightly CI tests (#29385)” — ready,ci/build — by SageMoore (创建于: 2025-12-11 06:27 (UTC+8))
#30446 Added a test for invalid inputs for parse_raw_prompts — 无标签 — by mivehk (创建于: 2025-12-11 09:23 (UTC+8))
#30444 [Fix] Update lazing loading of video loader backend — multi-modality — by jeremyteboul (创建于: 2025-12-11 08:49 (UTC+8))
#30442 [Feature] AWQ marlin quantization support for fused moe with lora — 无标签 — by princepride (创建于: 2025-12-11 08:10 (UTC+8))
#30390 fix: Update json features supported by xGrammar — structured-output,ready,v1 — by johannesflommersfeld (创建于: 2025-12-10 20:51 (UTC+8))
#30432 [ROCm] Fix broken import in platform attention backend dispatching — rocm,ready — by AndreasKaratzas (创建于: 2025-12-11 06:40 (UTC+8))
#30440 [fix] Fix qwen3_coder tool call per parameter streaming — frontend,tool-calling,qwen — by koush (创建于: 2025-12-11 07:46 (UTC+8))
#30395 [ROCm] [CI] [Release] Add rocm wheel release pipeline — rocm,ci/build — by tjtanaa (创建于: 2025-12-10 22:28 (UTC+8))
#30443 [PT nightlies] Remove nightly_torch Docker image and use standard — ci/build — by orionr (创建于: 2025-12-11 08:47 (UTC+8))
#30418 LoRA Slab Optimization — v1 — by Majid-Taheri (创建于: 2025-12-11 02:47 (UTC+8))
#30438 [Feature][Observability] Fine-grained model runner timing metrics — v1 — by andylolu2 (创建于: 2025-12-11 07:38 (UTC+8))
#30433 [Bugfix] Qwen3-next with –hf-overrides {"num_hidden_layers":8} — qwen — by heheda12345 (创建于: 2025-12-11 07:19 (UTC+8))
#30434 fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer — 无标签 — by kitaekatt (创建于: 2025-12-11 07:20 (UTC+8))
#30407 fix(shm): Add memory barriers for cross-process shared memory visibility — ready — by kitaekatt (创建于: 2025-12-11 01:43 (UTC+8))
#30417 [CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm — rocm,v1 — by rasmith (创建于: 2025-12-11 02:15 (UTC+8))
#30399 [BugFix] Fix AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale' — bug,ready,deepseek — by LucasWilkinson (创建于: 2025-12-10 23:14 (UTC+8))
#30430 [ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2025-12-11 06:15 (UTC+8))
#30429 test branch - not for merge — needs-rebase,v1 — by debroy-rh (创建于: 2025-12-11 05:38 (UTC+8))
#30408 fix(gguf): Disable bfloat16 for GGUF on sm120 device — ready — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
#30389 Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim — performance,ready,llama,qwen,deepseek,gpt-oss — by hmellor (创建于: 2025-12-10 20:33 (UTC+8))
#30423 fix(gguf): Make GGUFMoEMethod.apply() parameters optional — needs-rebase — by kitaekatt (创建于: 2025-12-11 03:42 (UTC+8))
#30413 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (创建于: 2025-12-11 01:46 (UTC+8))
#30427 fix(gguf): Extract attn_logit_softcapping from GGUF metadata — 无标签 — by kitaekatt (创建于: 2025-12-11 05:27 (UTC+8))
#30422 [ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly — rocm — by micah-wil (创建于: 2025-12-11 03:25 (UTC+8))
#30426 [Docs] Update EPLB docs — documentation,ready — by mgoin (创建于: 2025-12-11 04:45 (UTC+8))
#30425 [LMCache] Relax lmcache version requirement — ready,ci/build,kv-connector — by njhill (创建于: 2025-12-11 04:04 (UTC+8))
#30424 fix(gemma2): Add quant_config to embedding layer for GGUF support — 无标签 — by kitaekatt (创建于: 2025-12-11 03:46 (UTC+8))
#30400 {Deprecation] Remove tokenizer setter — frontend,ready,v1 — by DarkLight1337 (创建于: 2025-12-10 23:33 (UTC+8))
#30376 [Fix]fix import error from lmcache — ready,kv-connector — by wz1qqx (创建于: 2025-12-10 16:38 (UTC+8))
#30420 [NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 — kv-connector — by xuechendi (创建于: 2025-12-11 03:08 (UTC+8))
#30419 [NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA — v1,kv-connector,nvidia — by xuechendi (创建于: 2025-12-11 02:54 (UTC+8))
#30421 fix(gemma2): Skip missing parameters during GGUF weight loading — structured-output,v1 — by kitaekatt (创建于: 2025-12-11 03:24 (UTC+8))
#30391 [IMPROVEMENT] Change MistralReasoningParser behavior — ready — by juliendenize (创建于: 2025-12-10 21:02 (UTC+8))
#30415 [V0 Deprecation] Deprecate use_v1 — documentation — by MatthewBonanni (创建于: 2025-12-11 02:08 (UTC+8))
#30416 [Deprecation] Remove old _Backend enum — documentation,ready — by MatthewBonanni (创建于: 2025-12-11 02:11 (UTC+8))
#30414 [Doc] Add instructions for building docker image on GB300 with CUDA13 — documentation,aarch64-cuda,nvidia — by soodoshll (创建于: 2025-12-11 02:05 (UTC+8))
#30409 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — structured-output,v1 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
#30410 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
#30373 Implement LMDB-based multi-modal cache — ci/build,v1,multi-modality — by petersalas (创建于: 2025-12-10 15:21 (UTC+8))
#30385 [Core] Whisper support torch.compile — v1 — by NickLucche (创建于: 2025-12-10 19:11 (UTC+8))
#30403 [Misc] Consistent case for vllm bench serve results — documentation,performance,structured-output — by MatthewBonanni (创建于: 2025-12-11 00:56 (UTC+8))
#30398 [Chore] Delay recent deprecations — ready,v1,multi-modality — by DarkLight1337 (创建于: 2025-12-10 23:04 (UTC+8))
#30411 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
#30412 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (创建于: 2025-12-11 01:46 (UTC+8))
#30406 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (创建于: 2025-12-11 01:42 (UTC+8))
#30404 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (创建于: 2025-12-11 01:40 (UTC+8))
#30405 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (创建于: 2025-12-11 01:41 (UTC+8))
#30397 [Deprecation] Remove deprecated task, seed and MM settings — documentation,performance,frontend,ready,qwen — by DarkLight1337 (创建于: 2025-12-10 22:54 (UTC+8))
#30396 [Deprecation] Remove deprecated plugin and compilation fields for v0.13 release — documentation,ready — by DarkLight1337 (创建于: 2025-12-10 22:45 (UTC+8))
#30402 [Docs][CPU Backend] Add nightly and per revision pre-built Arm CPU wheels — documentation — by ioghiban (创建于: 2025-12-11 00:47 (UTC+8))
#30388 [Docs] Generate full list of metrics in user docs — documentation,ready — by markmc (创建于: 2025-12-10 19:50 (UTC+8))
#30386 [v1] Add PrefixLM support to TritonAttention backend — v1 — by Isotr0py (创建于: 2025-12-10 19:32 (UTC+8))
#30393 qk-rmsnorm op — qwen — by ZYang6263 (创建于: 2025-12-10 22:00 (UTC+8))
#30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (创建于: 2025-12-10 18:37 (UTC+8))
#30377 adding constraint updates of cos-sin to improve mrope performance — 无标签 — by wujinyuan1 (创建于: 2025-12-10 16:48 (UTC+8))
#30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (创建于: 2025-12-10 12:26 (UTC+8))

已合并 PR

#30431 Revert “[CI] Add Async Eplb nightly CI tests (#29385)” — ready,ci/build — by SageMoore (合并于: 2025-12-11 08:48 (UTC+8))
#30432 [ROCm] Fix broken import in platform attention backend dispatching — rocm,ready — by AndreasKaratzas (合并于: 2025-12-11 09:12 (UTC+8))
#30106 Add more docs for regex — documentation,structured-output,ready — by xu-song (合并于: 2025-12-11 08:12 (UTC+8))
#28051 [Bugfix] fix confusing OOM errors during v1 init — ready,v1 — by shivampr (合并于: 2025-12-11 07:17 (UTC+8))
#30407 fix(shm): Add memory barriers for cross-process shared memory visibility — ready — by kitaekatt (合并于: 2025-12-11 07:01 (UTC+8))
#30399 [BugFix] Fix AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale' — bug,ready,deepseek — by LucasWilkinson (合并于: 2025-12-10 23:59 (UTC+8))
#27933 [docs] Improve wide-EP performance + benchmarking documentation — documentation,ready — by eicherseiji (合并于: 2025-12-11 06:15 (UTC+8))
#30426 [Docs] Update EPLB docs — documentation,ready — by mgoin (合并于: 2025-12-11 04:56 (UTC+8))
#30216 [LMCache] Fix breakage due to new LMCache version — ready,ci/build,kv-connector — by njhill (合并于: 2025-12-11 03:52 (UTC+8))
#30400 {Deprecation] Remove tokenizer setter — frontend,ready,v1 — by DarkLight1337 (合并于: 2025-12-11 03:10 (UTC+8))
#30241 [bug] Fix “Current vLLM config is not set.” warnings when FlashInfer attention is used — bug,ready,v1,nvidia — by nvpohanh (合并于: 2025-12-11 03:18 (UTC+8))
#29289 [Perf] Enable environment cache in EngineCore to enable the feature for UniProcExecutor as well — ready,v1 — by Jialin (合并于: 2025-12-11 03:13 (UTC+8))
#26813 [P/D] KV Load Failure Recovery/Abort Configuration — frontend,ready,v1,kv-connector — by wseaton (合并于: 2025-12-11 03:00 (UTC+8))
#30344 [Bugfix] Fix HunyuanOCR cross-image contamination in batch processing — ready — by anker-c2 (合并于: 2025-12-11 02:09 (UTC+8))
#30403 [Misc] Consistent case for vllm bench serve results — documentation,performance,structured-output — by MatthewBonanni (合并于: 2025-12-11 01:44 (UTC+8))
#30398 [Chore] Delay recent deprecations — ready,v1,multi-modality — by DarkLight1337 (合并于: 2025-12-11 01:48 (UTC+8))
#30388 [Docs] Generate full list of metrics in user docs — documentation,ready — by markmc (合并于: 2025-12-11 00:09 (UTC+8))
#30331 [Bugfix] tpu_model_runner: set vllm config context when calling reset_dynamo_cache() — tpu,ready,v1 — by dtrifiro (合并于: 2025-12-10 20:58 (UTC+8))
#30072 [Core] Whisper enable FULL_DECODE_ONLY CudaGraph — ready,v1,multi-modality,nvidia — by NickLucche (合并于: 2025-12-10 22:14 (UTC+8))
#30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (合并于: 2025-12-10 20:58 (UTC+8))
#30062 [CPU] Support for Whisper — ready,ci/build,v1,multi-modality — by aditew01 (合并于: 2025-12-10 20:58 (UTC+8))
#30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (合并于: 2025-12-10 12:05 (UTC+8))
#30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (合并于: 2025-12-10 16:30 (UTC+8))
#30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (合并于: 2025-12-10 13:37 (UTC+8))
#29358 [ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group — rocm,ready,ci/build,v1,multi-modality — by AndreasKaratzas (合并于: 2025-12-10 13:33 (UTC+8))
#30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (合并于: 2025-12-10 12:27 (UTC+8))
#30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (合并于: 2025-12-10 12:05 (UTC+8))
#30308 [bugfix][quantization] fix quark qwen3 kv_cache quantization — ready,qwen — by haoyangli-amd (合并于: 2025-12-10 11:24 (UTC+8))

关闭但未合并的 PR

#30086 Revert “[CI] Add Async Eplb nightly CI tests” — ready,ci/build — by SageMoore (关闭于: 2025-12-11 06:18 (UTC+8))
#29357 Add request-ids to TranscriptionRequest, TranslationRequest — documentation,frontend — by eicherseiji (关闭于: 2025-12-11 06:10 (UTC+8))
#30429 test branch - not for merge — needs-rebase,v1 — by debroy-rh (关闭于: 2025-12-11 06:08 (UTC+8))
#29924 [WIP] Try fixing nightly build pipeline — ci/build — by atalman (关闭于: 2025-12-11 05:54 (UTC+8))
#30423 fix(gguf): Make GGUFMoEMethod.apply() parameters optional — needs-rebase — by kitaekatt (关闭于: 2025-12-11 05:44 (UTC+8))
#30415 [V0 Deprecation] Deprecate use_v1 — documentation — by MatthewBonanni (关闭于: 2025-12-11 02:25 (UTC+8))
#30416 [Deprecation] Remove old _Backend enum — documentation,ready — by MatthewBonanni (关闭于: 2025-12-11 02:26 (UTC+8))
#29198 [Model] Restore Gemma3 GGUF multimodal support with GGUF-only guards — ready,v1,multi-modality — by lucianommartins (关闭于: 2025-12-11 03:04 (UTC+8))
#29819 fix(shm): Add memory barriers for cross-process shared memory visibility — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
#30406 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
#30404 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:40 (UTC+8))
#30365 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
#30284 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — structured-output,v1 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
#30090 fix: Force float16 dtype for GGUF models to fix incorrect output — ready — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
#30405 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:41 (UTC+8))
#28156 [CI/Build] Skip encoder-decoder models on AMD — rocm,ready,ci/build — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
#28353 try fix by record_stream() — needs-rebase — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
#28799 enable tests — needs-rebase,ci/build — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
#28523 debug tests/tool_use/test_tool_calls.py failure on AMD — documentation,performance,rocm,frontend,ci/build,tool-calling — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
#30258 [Feature]: OpenTelemetry Metrics Support — v1 — by mladjan-gadzic (关闭于: 2025-12-10 22:23 (UTC+8))
#23997 Feature/sampler benchmark #23977 — performance,unstale — by baonudesifeizhai (关闭于: 2025-12-10 21:14 (UTC+8))
#26701 [ROCm]: W8A8BlockFp8LinearOp should use AITER on MI355X — rocm,ready,needs-rebase — by gronsti-amd (关闭于: 2025-12-10 20:33 (UTC+8))
#30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (关闭于: 2025-12-10 19:59 (UTC+8))
#30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (关闭于: 2025-12-10 18:47 (UTC+8))
#29653 fix potential object has no attribute ‘bias’ error — 无标签 — by allerou4 (关闭于: 2025-12-10 15:16 (UTC+8))
#30297 [Core] Add SLA-tiered scheduling (opt-in) and docs — documentation,v1 — by ProdByBuddha (关闭于: 2025-12-10 13:13 (UTC+8))