vLLM 开发动态报告 - 2025-12-10
时间窗口: 2025-12-10 10:48 (UTC+8) ~ 2025-12-11 10:48 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 30 | 新 PR 60 | 合并 PR 28 | 关闭未合并 PR 26
📊 每日开发状态摘要
在本次观察窗口内,vLLM 项目保持了极高的开发活跃度,新增和合并了大量 PR(60/28),重点集中于模型支持优化、性能调优及测试基础设施完善。AMD/ROCm 生态支持成为显性焦点,多个与 ROCm 相关的 CI 修复和功能增强 PR 被提交或合并。社区讨论热烈,主要围绕基准测试工具的未来方向、新模型(如 DeepSeek-V3.2, Gemma2 GGUF)的支持问题以及 KV 缓存数据传输的可靠性展开。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,涵盖了问题修复、CI 管道建设和性能优化。
- 核心问题修复:
- PR #30432 ([ROCm] Fix broken import in platform attention backend dispatching):修复了因导入
get_env_variable_attn_backend导致的 ROCm 平台初始化失败。该问题源于对正在重构的注意力后端选择器(PR #30396)的依赖,本修复通过直接检查环境变量来绕过该依赖,确保 ROCm 平台正常启动。 - PR #30308 ([bugfix][quantization] fix quark qwen3 kv_cache quantization) (已合并):由 AMD 员工 (
haoyangli-amd) 提交,修复了 Quark 量化工具在处理 Qwen3 MoE 模型的 KV 缓存量化时,因未正确调用基类方法而导致的缓存缩放识别失败问题,确保了量化模型的正确性。 - PR #30430 ([ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding):修复了在 ROCm 上使用 MLA (如 DeepSeek 模型) 并启用推测解码(
num_speculative_tokens > 1)时的兼容性问题,将MLACommonMetadata添加到了支持列表中。
- PR #30432 ([ROCm] Fix broken import in platform attention backend dispatching):修复了因导入
- 测试与 CI 增强:
- PR #30417 ([CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm):针对 ROCm 环境中缺失特定 CUDA/DeepEP 模块的问题,跳过了相关测试,以使分布式测试组能够通过。
- PR #30422 ([ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly):解决了 ROCm 上运行 MTP 推测解码测试时,因专家数不符合分组条件而引发的
grouped_topk内核错误,增加了回退机制。 - PR #29358 ([ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group) (已合并):调整了异步调度测试在 ROCm 上的配置,确保使用支持的注意力后端(
TRITON_ATTN)并放宽了数值容差,以使测试通过。
- 发布与部署管道建设:
- PR #30395 ([ROCm] [CI] [Release] Add rocm wheel release pipeline):这是一个重要的基础设施 PR,旨在为 ROCm 平台建立预编译 wheel 包的发布管道。该设计包括依赖构建缓存、S3 存储上传和手动触发机制,目标是降低用户在 AMD 机器上的使用门槛。
总结:本周期 AMD 相关的活动从修复紧急运行时错误,扩展到完善测试覆盖率和构建长期可持续的交付管道(预编译包),显示出对 AMD 平台支持正从“功能可用”向“体验优化”和“生态建设”阶段迈进。
💬 高热度讨论分析
- Issue #30383: [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits
- 核心议题:用户指出当前
vllm benchmark工具的单进程设计在高压下(高 QPS、高并发)会成为瓶颈,导致性能指标失真,并提出了构建多进程架构的提案。 - 观点与立场:
- 提案者 (
GaoHuaZhang):通过详细的测试数据论证了单进程限制的严重性,强调需要一个能反映真实高并发场景的基准测试工具。 - 支持者 (
wenba0):认为该功能至关重要,愿意提供帮助。 - 核心维护者 (
ywang96):提供了重要上下文,指出vllm bench serve最初设计用于受控的单容器环境,并引导社区关注现有的 guidellm 子项目,该子项目专为大规模基准测试设计。建议与 guidellm 开发者协作,避免维护重复工具,并计划在未来更好地集成和推广 guidellm。
- 提案者 (
- 争议焦点:无实质性争议,但揭示了项目内部分工和工具定位的战略考量。核心团队更倾向于集中力量发展一个专门的大规模基准测试项目(guidellm),而非扩展现有工具的适用范围。
- 当前状态:讨论开放,维护者的回复为未来的开发方向提供了明确的指引。
- 核心议题:用户指出当前
- Issue #30445: [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build
- 核心议题:用户在当日构建版本中运行特定 AWQ 量化模型时得到乱码输出,而回退到前一日构建则正常,暗示当日合入的某个更改引入了回归。
- 观点与立场:
- 报告者 (
eugr):提供了详尽的环境信息和复现步骤,并主动测试排除了fastsafetensors的影响。
- 报告者 (
- 争议焦点:无。这是一个典型的回归问题报告,关键在于定位是哪个具体的 PR 或代码变更导致了该问题。
- 当前状态:问题开放,等待开发人员根据构建时间窗口排查引入问题的提交。
- Closed Issue #11247: [Bug]: disaggregated prefilling hangs when TP=2 (已关闭)
- 核心议题:这是一个存在了近一年的老问题,关于在张量并行度为 2 的分布式预填充/解码(P/D)架构下,服务会出现挂起。
- 观点与立场:
- 众多遇到相同问题的用户:持续在评论区反馈他们遇到了相同或类似的问题,并尝试分享各自的解决方案(如设置环境变量
CUDA_LAUNCH_BLOCKING=1、调整--max-num-seqs参数等)。 - 分析者 (
yjsunn):提供了一个深入的分析,认为在高 QPS 和小kv_buffer_size时,预填充批次过大可能导致 KV 缓存传输部分失败,进而阻塞整个批次中已成功的请求。
- 众多遇到相同问题的用户:持续在评论区反馈他们遇到了相同或类似的问题,并尝试分享各自的解决方案(如设置环境变量
- 争议焦点:无争议,更多是用户互助和问题现象收集。
- 最终结论:该 Issue 因超过 90 天无核心开发人员跟进而被标记为
stale并最终关闭。但关闭不代表问题已解决,用户讨论区表明这仍是一个影响生产部署的痛点。
🔥 热门话题与趋势分析
- 模型支持与兼容性:这是 Issue 产生的最大来源。
- 新模型架构:DeepSeek-V3.2 (#30311, #30343, #30371)、Qwen3 系列 (#30378, #30387, #30433)、Gemma2 GGUF (#30404, #30411, #30424 等) 的支持是当前热点,问题集中在 tokenizer 集成、工具调用解析、推理输出异常和 GGUF 格式加载的各类细节上。
- 量化模型:关于 AWQ (#30445)、Quark (#30308)、FP8 KV 缓存 (#30387) 等量化模型在实际使用中出现的输出错误或崩溃问题报告增多,表明量化部署的复杂性和对稳定性的高要求。
- CI/CD 与测试基础设施:PR 活动高度集中于此。
- AMD CI 攻坚:大量 PR (#30417, #30422, #30432, #29358 等) 专注于修复 ROCm 平台上的测试失败,涉及跳过不支持的测试、修复内核兼容性、调整测试参数等,目标是使 AMD CI 流水线全面变绿。
- 测试优化与新增:为 CPU 后端 (#30347)、Whisper (#30072) 等新增或优化测试,并修复测试中的导入错误 (#30476) 和环境依赖问题。
- 性能优化与核心特性:
- KV 缓存与数据传输:异构 KV 布局 (#30448)、NIXL 连接器修复 (#30419, #30420)、共享内存屏障 (#30407) 等 PR 显示,在分布式和 P/D 场景下,KV 缓存的管理和数据传输的可靠性、性能是持续优化的重点。
- 编译与图形化:支持 Whisper 模型的 CUDA Graph (#30072) 和
torch.compile(#30385),以及对编译缓存竞争问题的讨论 (#24601),反映了对推理延迟和吞吐量的极致追求。
- 文档与用户体验:多个 PR 致力于改善文档,包括生成完整的监控指标列表 (#30388)、更新专家并行部署指南 (#27933)、添加 CPU wheel 安装说明 (#30402) 等,显示项目在快速发展的同时,也开始系统化地提升用户体验。
🛠️ 重点技术变更
-
PR #30448: [NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support:这是一个关键的性能优化 PR,支持在预填充阶段使用一种 KV 布局(NHD)和块大小,在填充完成后动态转换为解码阶段所需的布局(HND)和块大小。这允许为不同阶段选择最优的内存访问模式,尤其有利于提升解码性能。其实现依赖于前缀缓存被禁用。
-
PR #30385: [Core] Whisper support
torch.compile:继支持 CUDA Graph 后,此 PR 进一步为 Whisper 编码器-解码器模型启用了torch.compile支持。由于第一步解码需要处理编码器输出,它巧妙地采用了仅编译第二步及以后步骤的策略,从而显著提升了 Whisper 的推理速度。 -
PR #30440: [fix] Fix qwen3_coder tool call per parameter streaming:修复了 Qwen3 Coder 模型工具调用参数在流式传输时,参数内容(字符串)不能按 token 实时流出的问题。这提升了工具调用场景下的用户体验,确保了流式响应中参数的实时性。
-
Issue #30343 -> PR #30351 (已合并): DeepSeek-V3.2 tokenizer 性能问题:发现并修复了一个导致服务挂起的严重性能问题。根本原因是 DeepSeekV32Tokenizer 的
__len__方法在每个 token 解码时都会调用昂贵的get_added_vocab(),阻塞了主线程。修复方法是在初始化时缓存added_vocab的大小,消除了每令牌的开销。 -
PR #30407 (已合并): fix(shm): Add memory barriers for cross-process shared memory visibility:通过为跨进程共享内存广播添加内存屏障,修复了潜在的数据竞争问题。值得注意的是,提交者反馈该修复甚至在低延迟场景下带来了轻微的性能提升。
📈 开发活跃度观察
- 贡献者活跃:在新增的 60 个 PR 中,除了核心团队成员(如
LucasWilkinson,DarkLight1337,AndreasKaratzas),也出现了大量社区贡献者(如kitaekatt,anker-c2,rogeryoungh),他们积极修复模型支持、文档和各类 Bug。 - AMD 团队深度参与:
AndreasKaratzas,micah-wil,haoyangli-amd等(包含或关联 AMD)贡献者非常活跃,主导了 ROCm 平台的大部分修复和测试工作,体现了 AMD 对 vLLM 生态投入的资源。 - 代码审查与合并速度:28 个 PR 在观察期内被合并,其中许多是修复关键问题的 PR,表明核心团队对重要修复的审查和合并效率很高。同时,也有像
#30395(ROCm wheel 发布管道) 这样的大型基础设施 PR 仍在讨论和推进中。
💡 值得关注的问题
- 基准测试工具的战略方向(Issue #30383):社区提出的增强需求与核心团队规划的 guidellm 子项目存在重叠。如何清晰定义
vllm bench与guidellm的边界,并平滑引导用户和开发者,需要明确的沟通和路线图。 - GGUF 格式支持的成熟度:近期涌现了大量 Gemma2 等模型 GGUF 格式加载的问题(如数据类型冲突、缺失配置项、权重映射错误等)。这表明 vLLM 的 GGUF 加载器在面对多样化的模型架构和量化方案时,仍需加强健壮性和兼容性。
- 分布式 P/D 部署的稳定性:尽管老 Issue #11247 已关闭,但用户反馈表明,在 disaggregated prefilling 场景下的挂起问题依然影响部分用户。这仍然是高并发、分布式生产部署中的一个风险点。
- 量化模型推理的可靠性:多个 Issue 表明,不同类型的量化模型(AWQ, Quark, FP8 KV Cache)在特定条件下会产生错误输出或崩溃。确保量化推理的数值稳定性和正确性是一个持续的挑战。
📋 附录:详细数据列表
新增 Issue
- #30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (创建于: 2025-12-10 18:02 (UTC+8))
- #30447 [Usage]: how to load kv cache data into local file — usage — by chx725 (创建于: 2025-12-11 09:43 (UTC+8))
- #30445 [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build — bug — by eugr (创建于: 2025-12-11 09:06 (UTC+8))
- #30441 [Usage]: vllm serve setup issues on B300 — usage — by navmarri14 (创建于: 2025-12-11 07:50 (UTC+8))
- #30439 [Bug]: Qwen3 Coder parser does not stream tool call arguments — bug — by koush (创建于: 2025-12-11 07:45 (UTC+8))
- #30436 [Bug]: Speculative decode crashes on PP>1 because self.drafter missing — bug — by kvcop (创建于: 2025-12-11 07:26 (UTC+8))
- #30435 [Bug]: vllm problem with cu130 — bug — by Reviveplugins (创建于: 2025-12-11 07:24 (UTC+8))
- #30387 [Bug]: illegal memory access countered when using kv-cache-type=fp8 loading a weight-fp8 model for evaltest in flash-attn backend — bug — by youngze0016 (创建于: 2025-12-10 19:43 (UTC+8))
- #30401 [Bug]: EAGLE3 failure with gpt-oss 20b and 120b — bug — by Mhdaw (创建于: 2025-12-11 00:42 (UTC+8))
- #30394 [Feature]: Prometheus Metrics Abstraction — feature request — by mladjan-gadzic (创建于: 2025-12-10 22:22 (UTC+8))
- #30392 [Usage]: Docker image v0.12.0 Fail to run inference via Docker image — usage — by kuopching (创建于: 2025-12-10 21:43 (UTC+8))
- #30380 [Usage]: 大家一般怎么使用vllm/tests的? — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
- #30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (创建于: 2025-12-10 17:55 (UTC+8))
- #30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (创建于: 2025-12-10 15:53 (UTC+8))
- #30381 [Usage]: — usage — by tobeprozy (创建于: 2025-12-10 17:27 (UTC+8))
- #30379 [Usage]: how to use vllm/tests/? — usage — by tobeprozy (创建于: 2025-12-10 17:25 (UTC+8))
- #30378 [Feature]: Automatically infer Qwen3 reranker settings (remove need for hf_overrides) — feature request — by ilopezluna (创建于: 2025-12-10 17:21 (UTC+8))
- #30375 [Bug]: [TPU] ShapeDtypeStruct error when loading custom safetensors checkpoint on TPU v5litepod — bug — by Baltsat (创建于: 2025-12-10 16:12 (UTC+8))
- #30372 [Bug]: vLLM (GPT-OSS) causes distorted tool argument names + infinite tool-call loop with Korean messenger tool — bug — by minmini2 (创建于: 2025-12-10 14:59 (UTC+8))
已关闭 Issue
- #11247 [Bug]: disaggregated prefilling hangs when TP=2 — bug,stale — by Louis-99 (关闭于: 2025-12-11 10:17 (UTC+8))
- #13040 [Bug]: torchvision.libs/libcudart.41118559.so.12 (deleted): cannot open shared object file: No such file or directory — bug,stale — by wuyifan18 (关闭于: 2025-12-11 10:17 (UTC+8))
- #15513 [Bug]: — bug,stale — by znicelya (关闭于: 2025-12-11 10:17 (UTC+8))
- #16545 [Bug]: Unexpected behavior of
returned_token_idsin Reward Modeling for LlamaForCausalLM — bug,stale — by ryokamoi (关闭于: 2025-12-11 10:17 (UTC+8)) - #17292 [Feature]: support reasoning output when offline batched inference — feature request,stale — by wa008 (关闭于: 2025-12-11 10:17 (UTC+8))
- #17327 [Usage] Qwen3 Usage Guide — usage,stale — by simon-mo (关闭于: 2025-12-11 10:17 (UTC+8))
- #17634 [Bug]: When using the LLaMA-Factory framework with InternVL3-8B-hf for batch inference, vLLM throws an error: ValueError: limit_mm_per_prompt is only supported for multimodal models. — bug,stale — by Fangyuan-Liu (关闭于: 2025-12-11 10:17 (UTC+8))
- #17697 [Feature]: Addition of pre-built AMD wheel packages — feature request,stale — by Epliz (关闭于: 2025-12-11 10:17 (UTC+8))
- #22470 [Bug]: gpt oss 20/120b generates wired characters and fails later when i use them — bug,stale — by ShlomoSMR (关闭于: 2025-12-11 10:17 (UTC+8))
- #29453 [CI Failure]: mi325_1: Basic Correctness Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:21 (UTC+8))
- #29464 [CI Failure]: mi325_1: OpenAI API correctness — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:21 (UTC+8))
- #29803 [CI Failure]: mi325_1: Cudagraph test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:20 (UTC+8))
- #29465 [CI Failure]: mi325_2: Prime-RL Integration Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:18 (UTC+8))
- #29526 [CI Failure]: mi325_1: Entrypoints Integration Test (Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:17 (UTC+8))
- #29514 [CI Failure]: mi325_4: EPLB Execution Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:17 (UTC+8))
- #29443 [CI Failure]: mi325_1: Python-only Installation Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:16 (UTC+8))
- #29537 [CI Failure]: mi325_2: Weight Loading Multiple GPU Test - Large Models — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:09 (UTC+8))
- #29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-11 09:01 (UTC+8))
- #14083 [Feature]: Improve Logging for Error Messages — help wanted,good first issue,feature request,unstale — by robertgshaw2-redhat (关闭于: 2025-12-11 07:17 (UTC+8))
- #30214 [Bug]: DeepSeek V3.2 on B200 fails with “CUTLASS_MLA is not valid… Reason: [‘sparse not supported’]” — bug — by hadikoub (关闭于: 2025-12-11 04:20 (UTC+8))
- #30240 [Bug]: Lots of “Current vLLM config is not set.” warnings when FlashInfer attention is used — bug — by nvpohanh (关闭于: 2025-12-11 03:18 (UTC+8))
- #24601 [Bug]: Launching multiple vLLM processes at the same time doesn’t work well with vLLM’s compile cache — bug,torch.compile — by zou3519 (关闭于: 2025-12-11 02:53 (UTC+8))
- #30342 [Bug]: HunyuanOCR batching problem with variable sized images in a batch. — bug — by anker-c2 (关闭于: 2025-12-11 02:09 (UTC+8))
- #15636 [Bug]: Outlines broken on vLLM 0.8+ — bug,structured-output,unstale — by cpfiffer (关闭于: 2025-12-10 21:18 (UTC+8))
- #30382 [Bug]: Issues with mistralai/Ministral-3-14B-Instruct-2512 — bug — by eltorre (关闭于: 2025-12-10 18:47 (UTC+8))
- #30381 [Usage]: — usage — by tobeprozy (关闭于: 2025-12-10 17:28 (UTC+8))
- #30379 [Usage]: how to use vllm/tests/? — usage — by tobeprozy (关闭于: 2025-12-10 17:26 (UTC+8))
- #30311 [Bug]: deepseekv32.DeepseekV32Tokenizer Runtime causes model to crash — bug — by magician-xin (关闭于: 2025-12-10 16:30 (UTC+8))
- #28314 [AMD][CI Failure]: Tracking failure for AMD CI Dependencies & Environments — rocm,ci-failure — by zhewenl (关闭于: 2025-12-10 13:32 (UTC+8))
- #30343 [Bug]: In DeepSeek-V3.2 tokenizer mode, detokenization saturates the main thread, causing the server to hang — bug — by scratch-ml (关闭于: 2025-12-10 12:05 (UTC+8))
新增 PR
- #30428 [Chore] Fix torch precision warning — ready,v1 — by yewentao256 (创建于: 2025-12-11 05:34 (UTC+8))
- #30448 [NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support — v1,kv-connector — by xuechendi (创建于: 2025-12-11 10:06 (UTC+8))
- #30449 Add SLA-tiered scheduling (opt-in) — documentation,v1 — by ProdByBuddha (创建于: 2025-12-11 10:17 (UTC+8))
- #30437 [Bugfix] missing tokens occur in harmony streaming — frontend,gpt-oss — by Ri0S (创建于: 2025-12-11 07:36 (UTC+8))
- #30431 Revert “[CI] Add Async Eplb nightly CI tests (#29385)” — ready,ci/build — by SageMoore (创建于: 2025-12-11 06:27 (UTC+8))
- #30446 Added a test for invalid inputs for parse_raw_prompts — 无标签 — by mivehk (创建于: 2025-12-11 09:23 (UTC+8))
- #30444 [Fix] Update lazing loading of video loader backend — multi-modality — by jeremyteboul (创建于: 2025-12-11 08:49 (UTC+8))
- #30442 [Feature] AWQ marlin quantization support for fused moe with lora — 无标签 — by princepride (创建于: 2025-12-11 08:10 (UTC+8))
- #30390 fix: Update json features supported by xGrammar — structured-output,ready,v1 — by johannesflommersfeld (创建于: 2025-12-10 20:51 (UTC+8))
- #30432 [ROCm] Fix broken import in platform attention backend dispatching — rocm,ready — by AndreasKaratzas (创建于: 2025-12-11 06:40 (UTC+8))
- #30440 [fix] Fix qwen3_coder tool call per parameter streaming — frontend,tool-calling,qwen — by koush (创建于: 2025-12-11 07:46 (UTC+8))
- #30395 [ROCm] [CI] [Release] Add rocm wheel release pipeline — rocm,ci/build — by tjtanaa (创建于: 2025-12-10 22:28 (UTC+8))
- #30443 [PT nightlies] Remove nightly_torch Docker image and use standard — ci/build — by orionr (创建于: 2025-12-11 08:47 (UTC+8))
- #30418 LoRA Slab Optimization — v1 — by Majid-Taheri (创建于: 2025-12-11 02:47 (UTC+8))
- #30438 [Feature][Observability] Fine-grained model runner timing metrics — v1 — by andylolu2 (创建于: 2025-12-11 07:38 (UTC+8))
- #30433 [Bugfix] Qwen3-next with –hf-overrides {"num_hidden_layers":8} — qwen — by heheda12345 (创建于: 2025-12-11 07:19 (UTC+8))
- #30434 fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer — 无标签 — by kitaekatt (创建于: 2025-12-11 07:20 (UTC+8))
- #30407 fix(shm): Add memory barriers for cross-process shared memory visibility — ready — by kitaekatt (创建于: 2025-12-11 01:43 (UTC+8))
- #30417 [CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm — rocm,v1 — by rasmith (创建于: 2025-12-11 02:15 (UTC+8))
- #30399 [BugFix] Fix
AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale'— bug,ready,deepseek — by LucasWilkinson (创建于: 2025-12-10 23:14 (UTC+8)) - #30430 [ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2025-12-11 06:15 (UTC+8))
- #30429 test branch - not for merge — needs-rebase,v1 — by debroy-rh (创建于: 2025-12-11 05:38 (UTC+8))
- #30408 fix(gguf): Disable bfloat16 for GGUF on sm120 device — ready — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
- #30389 Standardise
get_ropeto userope_parameters["partial_rotary_factor"], notrotary_dim— performance,ready,llama,qwen,deepseek,gpt-oss — by hmellor (创建于: 2025-12-10 20:33 (UTC+8)) - #30423 fix(gguf): Make GGUFMoEMethod.apply() parameters optional — needs-rebase — by kitaekatt (创建于: 2025-12-11 03:42 (UTC+8))
- #30413 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (创建于: 2025-12-11 01:46 (UTC+8))
- #30427 fix(gguf): Extract attn_logit_softcapping from GGUF metadata — 无标签 — by kitaekatt (创建于: 2025-12-11 05:27 (UTC+8))
- #30422 [ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly — rocm — by micah-wil (创建于: 2025-12-11 03:25 (UTC+8))
- #30426 [Docs] Update EPLB docs — documentation,ready — by mgoin (创建于: 2025-12-11 04:45 (UTC+8))
- #30425 [LMCache] Relax lmcache version requirement — ready,ci/build,kv-connector — by njhill (创建于: 2025-12-11 04:04 (UTC+8))
- #30424 fix(gemma2): Add quant_config to embedding layer for GGUF support — 无标签 — by kitaekatt (创建于: 2025-12-11 03:46 (UTC+8))
- #30400 {Deprecation] Remove tokenizer setter — frontend,ready,v1 — by DarkLight1337 (创建于: 2025-12-10 23:33 (UTC+8))
- #30376 [Fix]fix import error from lmcache — ready,kv-connector — by wz1qqx (创建于: 2025-12-10 16:38 (UTC+8))
- #30420 [NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 — kv-connector — by xuechendi (创建于: 2025-12-11 03:08 (UTC+8))
- #30419 [NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA — v1,kv-connector,nvidia — by xuechendi (创建于: 2025-12-11 02:54 (UTC+8))
- #30421 fix(gemma2): Skip missing parameters during GGUF weight loading — structured-output,v1 — by kitaekatt (创建于: 2025-12-11 03:24 (UTC+8))
- #30391 [IMPROVEMENT] Change MistralReasoningParser behavior — ready — by juliendenize (创建于: 2025-12-10 21:02 (UTC+8))
- #30415 [V0 Deprecation] Deprecate use_v1 — documentation — by MatthewBonanni (创建于: 2025-12-11 02:08 (UTC+8))
- #30416 [Deprecation] Remove old
_Backendenum — documentation,ready — by MatthewBonanni (创建于: 2025-12-11 02:11 (UTC+8)) - #30414 [Doc] Add instructions for building docker image on GB300 with CUDA13 — documentation,aarch64-cuda,nvidia — by soodoshll (创建于: 2025-12-11 02:05 (UTC+8))
- #30409 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — structured-output,v1 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
- #30410 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
- #30373 Implement LMDB-based multi-modal cache — ci/build,v1,multi-modality — by petersalas (创建于: 2025-12-10 15:21 (UTC+8))
- #30385 [Core] Whisper support
torch.compile— v1 — by NickLucche (创建于: 2025-12-10 19:11 (UTC+8)) - #30403 [Misc] Consistent case for
vllm bench serveresults — documentation,performance,structured-output — by MatthewBonanni (创建于: 2025-12-11 00:56 (UTC+8)) - #30398 [Chore] Delay recent deprecations — ready,v1,multi-modality — by DarkLight1337 (创建于: 2025-12-10 23:04 (UTC+8))
- #30411 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (创建于: 2025-12-11 01:45 (UTC+8))
- #30412 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (创建于: 2025-12-11 01:46 (UTC+8))
- #30406 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (创建于: 2025-12-11 01:42 (UTC+8))
- #30404 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (创建于: 2025-12-11 01:40 (UTC+8))
- #30405 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (创建于: 2025-12-11 01:41 (UTC+8))
- #30397 [Deprecation] Remove deprecated task, seed and MM settings — documentation,performance,frontend,ready,qwen — by DarkLight1337 (创建于: 2025-12-10 22:54 (UTC+8))
- #30396 [Deprecation] Remove deprecated plugin and compilation fields for v0.13 release — documentation,ready — by DarkLight1337 (创建于: 2025-12-10 22:45 (UTC+8))
- #30402 [Docs][CPU Backend] Add nightly and per revision pre-built Arm CPU wheels — documentation — by ioghiban (创建于: 2025-12-11 00:47 (UTC+8))
- #30388 [Docs] Generate full list of metrics in user docs — documentation,ready — by markmc (创建于: 2025-12-10 19:50 (UTC+8))
- #30386 [v1] Add PrefixLM support to TritonAttention backend — v1 — by Isotr0py (创建于: 2025-12-10 19:32 (UTC+8))
- #30393 qk-rmsnorm op — qwen — by ZYang6263 (创建于: 2025-12-10 22:00 (UTC+8))
- #30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (创建于: 2025-12-10 18:37 (UTC+8))
- #30377 adding constraint updates of cos-sin to improve mrope performance — 无标签 — by wujinyuan1 (创建于: 2025-12-10 16:48 (UTC+8))
- #30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (创建于: 2025-12-10 12:26 (UTC+8))
已合并 PR
- #30431 Revert “[CI] Add Async Eplb nightly CI tests (#29385)” — ready,ci/build — by SageMoore (合并于: 2025-12-11 08:48 (UTC+8))
- #30432 [ROCm] Fix broken import in platform attention backend dispatching — rocm,ready — by AndreasKaratzas (合并于: 2025-12-11 09:12 (UTC+8))
- #30106 Add more docs for regex — documentation,structured-output,ready — by xu-song (合并于: 2025-12-11 08:12 (UTC+8))
- #28051 [Bugfix] fix confusing OOM errors during v1 init — ready,v1 — by shivampr (合并于: 2025-12-11 07:17 (UTC+8))
- #30407 fix(shm): Add memory barriers for cross-process shared memory visibility — ready — by kitaekatt (合并于: 2025-12-11 07:01 (UTC+8))
- #30399 [BugFix] Fix
AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale'— bug,ready,deepseek — by LucasWilkinson (合并于: 2025-12-10 23:59 (UTC+8)) - #27933 [docs] Improve wide-EP performance + benchmarking documentation — documentation,ready — by eicherseiji (合并于: 2025-12-11 06:15 (UTC+8))
- #30426 [Docs] Update EPLB docs — documentation,ready — by mgoin (合并于: 2025-12-11 04:56 (UTC+8))
- #30216 [LMCache] Fix breakage due to new LMCache version — ready,ci/build,kv-connector — by njhill (合并于: 2025-12-11 03:52 (UTC+8))
- #30400 {Deprecation] Remove tokenizer setter — frontend,ready,v1 — by DarkLight1337 (合并于: 2025-12-11 03:10 (UTC+8))
- #30241 [bug] Fix “Current vLLM config is not set.” warnings when FlashInfer attention is used — bug,ready,v1,nvidia — by nvpohanh (合并于: 2025-12-11 03:18 (UTC+8))
- #29289 [Perf] Enable environment cache in EngineCore to enable the feature for UniProcExecutor as well — ready,v1 — by Jialin (合并于: 2025-12-11 03:13 (UTC+8))
- #26813 [P/D] KV Load Failure Recovery/Abort Configuration — frontend,ready,v1,kv-connector — by wseaton (合并于: 2025-12-11 03:00 (UTC+8))
- #30344 [Bugfix] Fix HunyuanOCR cross-image contamination in batch processing — ready — by anker-c2 (合并于: 2025-12-11 02:09 (UTC+8))
- #30403 [Misc] Consistent case for
vllm bench serveresults — documentation,performance,structured-output — by MatthewBonanni (合并于: 2025-12-11 01:44 (UTC+8)) - #30398 [Chore] Delay recent deprecations — ready,v1,multi-modality — by DarkLight1337 (合并于: 2025-12-11 01:48 (UTC+8))
- #30388 [Docs] Generate full list of metrics in user docs — documentation,ready — by markmc (合并于: 2025-12-11 00:09 (UTC+8))
- #30331 [Bugfix] tpu_model_runner: set vllm config context when calling reset_dynamo_cache() — tpu,ready,v1 — by dtrifiro (合并于: 2025-12-10 20:58 (UTC+8))
- #30072 [Core] Whisper enable
FULL_DECODE_ONLYCudaGraph — ready,v1,multi-modality,nvidia — by NickLucche (合并于: 2025-12-10 22:14 (UTC+8)) - #30384 [BugFix] Fix minimax m2 model rotary_dim — ready — by rogeryoungh (合并于: 2025-12-10 20:58 (UTC+8))
- #30062 [CPU] Support for Whisper — ready,ci/build,v1,multi-modality — by aditew01 (合并于: 2025-12-10 20:58 (UTC+8))
- #30351 [Bugfix] Cache added_vocab to avoid per-token overhead — ready — by scratch-ml (合并于: 2025-12-10 12:05 (UTC+8))
- #30371 [Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output — structured-output,ready,v1,deepseek — by chaunceyjiang (合并于: 2025-12-10 16:30 (UTC+8))
- #30347 [cpu][ci] Add CPU Attention Tests for Neon Backend — ready — by fadara01 (合并于: 2025-12-10 13:37 (UTC+8))
- #29358 [ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group — rocm,ready,ci/build,v1,multi-modality — by AndreasKaratzas (合并于: 2025-12-10 13:33 (UTC+8))
- #30339 [CMake][Build]: Remove unused ACL CMake env variables — ready,ci/build — by Radu2k (合并于: 2025-12-10 12:27 (UTC+8))
- #30345 Fix typos in comments across multiple files — documentation,ready,v1 — by wilsonwu (合并于: 2025-12-10 12:05 (UTC+8))
- #30308 [bugfix][quantization] fix quark qwen3 kv_cache quantization — ready,qwen — by haoyangli-amd (合并于: 2025-12-10 11:24 (UTC+8))
关闭但未合并的 PR
- #30086 Revert “[CI] Add Async Eplb nightly CI tests” — ready,ci/build — by SageMoore (关闭于: 2025-12-11 06:18 (UTC+8))
- #29357 Add request-ids to TranscriptionRequest, TranslationRequest — documentation,frontend — by eicherseiji (关闭于: 2025-12-11 06:10 (UTC+8))
- #30429 test branch - not for merge — needs-rebase,v1 — by debroy-rh (关闭于: 2025-12-11 06:08 (UTC+8))
- #29924 [WIP] Try fixing nightly build pipeline — ci/build — by atalman (关闭于: 2025-12-11 05:54 (UTC+8))
- #30423 fix(gguf): Make GGUFMoEMethod.apply() parameters optional — needs-rebase — by kitaekatt (关闭于: 2025-12-11 05:44 (UTC+8))
- #30415 [V0 Deprecation] Deprecate use_v1 — documentation — by MatthewBonanni (关闭于: 2025-12-11 02:25 (UTC+8))
- #30416 [Deprecation] Remove old
_Backendenum — documentation,ready — by MatthewBonanni (关闭于: 2025-12-11 02:26 (UTC+8)) - #29198 [Model] Restore Gemma3 GGUF multimodal support with GGUF-only guards — ready,v1,multi-modality — by lucianommartins (关闭于: 2025-12-11 03:04 (UTC+8))
- #29819 fix(shm): Add memory barriers for cross-process shared memory visibility — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
- #30406 fix(nemotron_h): Add missing rotary positional embeddings to attention layers — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
- #30404 fix(gguf): Ensure Gemma2 configs have hidden_act for backward compatibility — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:40 (UTC+8))
- #30365 fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
- #30284 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — structured-output,v1 — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
- #30090 fix: Force float16 dtype for GGUF models to fix incorrect output — ready — by kitaekatt (关闭于: 2025-12-11 01:42 (UTC+8))
- #30405 fix(gguf): Skip lm_head mapping for models with tied word embeddings — 无标签 — by kitaekatt (关闭于: 2025-12-11 01:41 (UTC+8))
- #28156 [CI/Build] Skip encoder-decoder models on AMD — rocm,ready,ci/build — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
- #28353 try fix by record_stream() — needs-rebase — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
- #28799 enable tests — needs-rebase,ci/build — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
- #28523 debug tests/tool_use/test_tool_calls.py failure on AMD — documentation,performance,rocm,frontend,ci/build,tool-calling — by zhewenl (关闭于: 2025-12-11 01:22 (UTC+8))
- #30258 [Feature]: OpenTelemetry Metrics Support — v1 — by mladjan-gadzic (关闭于: 2025-12-10 22:23 (UTC+8))
- #23997 Feature/sampler benchmark #23977 — performance,unstale — by baonudesifeizhai (关闭于: 2025-12-10 21:14 (UTC+8))
- #26701 [ROCm]: W8A8BlockFp8LinearOp should use AITER on MI355X — rocm,ready,needs-rebase — by gronsti-amd (关闭于: 2025-12-10 20:33 (UTC+8))
- #30348 [Docs]: adds a new metric vllm:request_prefill_kv_computed_tokens in docs — documentation — by googs1025 (关闭于: 2025-12-10 19:59 (UTC+8))
- #30349 [BugFix] Fix minimax m2 model rope_parameters — 无标签 — by esmeetu (关闭于: 2025-12-10 18:47 (UTC+8))
- #29653 fix potential object has no attribute ‘bias’ error — 无标签 — by allerou4 (关闭于: 2025-12-10 15:16 (UTC+8))
- #30297 [Core] Add SLA-tiered scheduling (opt-in) and docs — documentation,v1 — by ProdByBuddha (关闭于: 2025-12-10 13:13 (UTC+8))