vLLM 开发动态报告 - 2025-12-11
时间窗口: 2025-12-11 10:48 (UTC+8) ~ 2025-12-12 10:48 (UTC+8) 数据统计: 新 Issue 15 | 关闭 Issue 23 | 新 PR 68 | 合并 PR 45 | 关闭未合并 PR 27
📊 每日开发状态摘要
在2025年12月11日至12日的24小时内,vLLM项目保持了极高的开发活跃度,合并了45个PR并处理了38个Issue。开发焦点集中在性能优化、内存管理、对新硬件(如Blackwell Ultra)的支持以及对近期热门模型(如DeepSeek-V3.2、GPT-OSS)的问题修复上。AMD/ROCm平台的支持工作主要体现在持续集成(CI)的测试完善和特定内核的适配上。
🎯 AMD/ROCm 生态相关动态
本周期内,直接的AMD功能开发PR较少,主要工作集中在CI测试的完善和问题修复上:
- CI与测试适配:
- PR #30508 (已合并) 与 PR #30417 (已合并):由AMD贡献者(
rasmith)提交,分别跳过了ROCm不支持的cutlass_w4a8_moe测试和缺少特定模块(如deep_ep)的融合测试,确保ROCm测试流水线能够顺利通过。 - PR #30526 (已合并):将V1 e2e测试的代理池从
mi325_1调整为mi325_4,以确保多GPU测试有足够的硬件资源。 - PR #30527 (开放中):计划为ROCm环境下的推测解码多GPU测试添加GPU可用性检查,防止因资源不足导致的测试失败。
- PR #30508 (已合并) 与 PR #30417 (已合并):由AMD贡献者(
- Issue 追踪关闭:
- Issue #14964 (已关闭):追踪AMD AITER内核集成的总览性Issue被维护者主动关闭,因其“过于陈旧”,并建议相关方开新的追踪器。
- Issue #22698 (已关闭):关于MI300/MI308上CUDA图运行末尾存在“气泡”的性能问题。经AMD工程师(
vllmellm)确认,在最新的rocm/vllm-dev:nightly镜像中此问题已不复现,故关闭。
- Bug修复与支持确认:
- Issue #28184 (已关闭):关于Whisper编码器-解码器模型在vLLM 0.11与ROCm上不工作的问题。AMD工程师确认相关PR已合并,在最新nightly镜像中问题已解决。
分析:本周期AMD生态的动态以“稳”为主,侧重于确保现有功能在CI中的稳定性和修复已知问题,而非引入新功能。关闭历史遗留的追踪Issue也表明项目在梳理和优化问题管理流程。
💬 高热度讨论分析
- Issue #30453: DeepSeek-V3.2 频繁重复解码问题
- 核心议题:用户报告在H200上部署DeepSeek-V3.2时,遇到严重的重复解码(Repetitive Decoding)问题。
- 观点与进展:
- 提问者
makabaka6338提供了极其详尽的环境信息和请求脚本。 - 其他用户
mondaylord和niceallen表示遇到类似问题(流响应首token被截断或“+1”)。 - 讨论焦点迅速转向环境版本排查。用户
h1248759474反复询问具体的依赖包版本,提问者最终列出了完整的pip list输出。
- 提问者
- 当前状态:问题仍为 Open。讨论尚未涉及根本原因分析,目前停留在信息收集和确认共性现象阶段。
- Issue #30498: GPT-OSS-120B 在达到 max_tokens 时返回空内容
- 核心议题:用户发现GPT-OSS-120B模型在预热后(即命中前缀缓存后),即使
max_tokens设置得足够大,返回的content字段也为null,而使用--enforce-eager标志可以规避。 - 观点分析:
- 用户视角 (
ashtarkb): 提供完整复现步骤,认为这是一个Bug。 - 初步解释 (
alew3): 认为如果推理(reasoning)用完了所有令牌配额,content为空是预期行为,并建议改进检查代码。 - 深度分析 (
bbrowning): 指出这很可能与Issue #26480是同一个Bug——即命中前缀缓存后,CUDA图可能导致无限生成token id 0。--enforce-eager有效是因为它禁用了CUDA图。这个观点得到了提问者的认可。
- 用户视角 (
- 当前状态:问题仍为 Open。讨论指向一个已知的、与缓存和CUDA图相关的核心Bug,需要更深层的修复。
- 核心议题:用户发现GPT-OSS-120B模型在预热后(即命中前缀缓存后),即使
- Issue #30521: Flashinfer 后端内存分析回归 (DP > 1)
- 核心议题:在数据并行(DP>1)时,使用Flashinfer后端的vLLM 0.12.0在内存分析(profiling)阶段存在回归,可能导致OOM。
- 观点分析:
- 报告者
nrghosh自己进行了深入分析,指出问题关键在于:内存分析在CUDA图捕获和采样器预热之前运行,因此未计入这部分内存(特别是Flashinfer较大的工作区缓冲区),导致KV缓存分配过大。 - 他提出了几个思考方向:1) 在计算KV缓存前为采样器预热预留内存;2) 使用更小的预热批次;3) 探究Flashinfer CUDA图内存使用是FlexAttention两倍的原因。
- 报告者
- 当前状态:问题仍为 Open。这是一个由经验丰富的贡献者提出的、带有自我分析的深度技术讨论,直接指向内存管理流程的核心逻辑。
- Issue #30139 (已关闭): 推理模型不进行推理,将结果全放入 reasoning_content
- 核心议题:多个用户报告Mistral Ministral等推理模型在vLLM中未正确分离推理和回答内容。
- 观点与结论:
- 用户困惑:多个用户确认了此问题,影响Ministral、MiniMax-M2等模型。
- 根本原因:Mistral的推理解析器(
MistralReasoningParser)最初借鉴了DeepSeek V1的实现,包含了一些启发式规则(如没有开始标签则全视为推理),这不适合Mistral模型可切换“聊天/推理”模式的特性。 - 解决方案:Mistral员工
juliendenize提交了 PR #30391,改进了解析器的逻辑,使其更符合Mistral模型的行为。该PR已合并并解决了此问题。
- 最终结论:通过上游模型提供方的直接贡献,问题得到修复。这体现了社区与模型开发商合作的高效性。
🔥 热门话题与趋势分析
- 新模型支持与问题:DeepSeek-V3.2 和 GPT-OSS-120B 是绝对的焦点,相关Issue讨论热烈。这反映出社区快速采用前沿大模型,同时vLLM在适配这些架构复杂的新模型时面临挑战(如重复解码、缓存Bug)。
- 部署与硬件适配:
- 新硬件:PR #30484 开始为 NVIDIA Blackwell Ultra (SM103) 添加初始支持。
- CPU后端:针对 Arm CPU 的构建问题(Issue #30470)和性能优化(Issue #30487,L2缓存检测)受到关注,显示移动端和边缘计算场景的重要性。
- 安装问题:如何离线或高效使用预编译wheel(Issue #30464)是实际部署中的常见痛点。
- 性能与内存优化:这始终是vLLM的核心议题。本期围绕CUDA图内存估算(PR #30515)、编码器缓存内存占用(PR #30452, #30475)、Triton内核split_k设置(PR #30528)的讨论,都是对极致性能的追求。
- 推理与工具调用:关于如何禁用Qwen模型的“思考”(Issue #30477),以及推理模型输出解析(Issue #30139, #28852)的讨论,说明“推理”和“工具调用”已成为LLM应用的高级特性,对其可靠控制的需求强烈。
🛠️ 重点技术变更
- PR #30515: 在内存分析中计入CUDA图占用:这是一个重要的改进。通过在内存分析阶段进行一次小的、假的CUDA图捕获,来预估其真实内存占用,从而更准确地计算可用的KV缓存空间。这将有效减少因低估CUDA图内存而导致的启动后OOM问题,对于DeepSeek-R1等大模型在极限配置下稳定运行至关重要。
- PR #30452 / #30475: 优化编码器缓存内存使用:这两个关联PR针对多模态模型的编码器缓存进行了重构。通过仅存储实际的嵌入向量而非完整的占位符张量,实现了高达12.5倍的内存节省。这使得如Qwen3-VL等大型视觉模型能够在有限的GPU内存下支持更长的上下文或更多视频帧,是多模态推理能力的关键提升。
- PR #30484: 添加SM103 (Blackwell Ultra) 支持:为未来的NVIDIA B300/GB300系列GPU添加了初始支持。虽然需要配合Triton补丁,但这是保持vLLM在最新硬件上领先地位的必要步骤。
- PR #30528: 为Triton内核设置 split_k=1:针对Hopper架构(SM90)的性能调优。通过强制设置
split_k=1,避免小批次推理时Triton内核回退到float32精度,从而提升小批次场景下的推理速度。这体现了对真实生产场景中多样化负载的精细化优化。 - PR #30389: 标准化 RoPE partial_rotary_factor 获取:一个重要的底层重构,统一了从配置中获取旋转位置编码维度的方式,全部改为通过
rope_parameters[“partial_rotary_factor”]。这解决了因社区量化模型配置不一致(如MiniMax-M2)导致的加载失败问题,提升了模型的兼容性和鲁棒性。
📈 开发活跃度观察
- 高效合并:24小时内合并45个PR,表明代码审查和合并流程非常高效。许多PR在创建当天即被合并(如 #30481, #30526)。
- 贡献者多元化:除核心维护者(如
DarkLight1337,yewentao256,hmellor)外,大量来自AMD (rasmith,AndreasKaratzas)、 NVIDIA (LopezCastroRoberto)、 以及企业用户(如nrghosh)的贡献非常活跃。用户fadara01在报告Arm CPU构建Issue后迅速自行提交了修复PR,展现了积极的社区参与。 - 问题解决模式:常见模式是用户提交详细的Bug报告后,很快有维护者或社区成员进行根因分析,甚至直接关联到已知的Issue或提交修复代码。这体现了社区拥有强大的集体技术能力和知识沉淀。
💡 值得关注的问题
- DeepSeek-V3.2重复解码 (Issue #30453):作为最新发布的热门模型,其部署稳定性至关重要。此问题影响广泛,需要优先调查是否与vLLM的调度或缓存逻辑存在特定交互问题。
- GPT-OSS-120B缓存与CUDA图Bug (Issue #30498):被关联到一个已知的深层Bug(#26480)。这个问题会影响所有命中前缀缓存的请求,导致输出质量严重下降,是需要彻底解决的核心稳定性问题。
- NIXL解耦服务在流水线并行下的握手错误 (Issue #30501):用户尝试在复杂配置(TP=8, PP=2)下运行DeepSeek-R1的分解式服务遭遇失败。这揭示了高级分布式特性在边界用例下的兼容性问题,对于推动大规模模型服务技术有重要参考价值。
- Flashinfer后端内存分析 (Issue #30521):该讨论揭示了当前内存分析流程的一个潜在缺陷——未能充分考虑不同后端(Flashinfer vs FlexAttention)在CUDA图和工作区内存上的差异。这可能需要一个更通用、更精准的内存预测框架。
📋 附录:详细数据列表
新增 Issue
- #30453 [Bug]: Frequent Repetitive Decoding Problems in DeepSeek-V3.2 — bug — by makabaka6338 (创建于: 2025-12-11 11:19 (UTC+8))
- #30470 [Bug] [CPU Backend]: vLLM build on Arm CPU fails with pytorch nightly — bug,cpu — by fadara01 (创建于: 2025-12-11 15:51 (UTC+8))
- #30530 [Installation]: 在h200部署deepseekv3.2版本库的问题 — installation — by h1248759474 (创建于: 2025-12-12 10:07 (UTC+8))
- #30477 [Usage]: How to disable thinking for Qwen-8B — usage — by fancyerii (创建于: 2025-12-11 17:28 (UTC+8))
- #30501 [Bug]/[Feature]:NIXL Disaggregated Serving Fails with Pipeline Parallelism (PP > 1) — bug — by nskpro-cmd (创建于: 2025-12-12 01:16 (UTC+8))
- #30464 [Usage]: How can I use the local pre-compiled wheel of vllm — usage — by gcanlin (创建于: 2025-12-11 14:22 (UTC+8))
- #30498 [Bug]: GPT-OSS-120B returns null content when hitting max_tokens without –enforce-eager — bug — by ashtarkb (创建于: 2025-12-12 00:40 (UTC+8))
- #30521 [Bug]: vLLM 0.12.0 / Flashinfer Backend memory profiling regression (DP > 1) — bug — by nrghosh (创建于: 2025-12-12 06:05 (UTC+8))
- #30511 Potential Deadlock? — 无标签 — by ChuanLi1101 (创建于: 2025-12-12 03:57 (UTC+8))
- #30493 [Bug]: 5090 RTX seems to be broken — bug — by mobicham (创建于: 2025-12-11 23:56 (UTC+8))
- #30461 [Bug]: Lora support don’t work on 0.12.0 with Qwen3-30B-A30B-Instuct-2507 — bug — by bash99 (创建于: 2025-12-11 13:49 (UTC+8))
- #30487 [Bug] [CPU Backend]: Wrong L2 cache size on Arm CPU for Attention tiles — bug,cpu — by Radu2k (创建于: 2025-12-11 20:58 (UTC+8))
- #30485 [Feature]: Support for GGUF qwen3vl models — feature request — by arno4000 (创建于: 2025-12-11 20:17 (UTC+8))
- #30466 [Feature]: Support transformers>=5 — feature request — by Whadup (创建于: 2025-12-11 15:05 (UTC+8))
- #30465 [Bug]: reranker benchmarking calculate input_token=0 — bug — by hz0ne (创建于: 2025-12-11 14:38 (UTC+8))
已关闭 Issue
- #18467 [Usage]: Can vllm distributed load model? — usage,stale — by ASTLY123 (关闭于: 2025-12-12 10:16 (UTC+8))
- #21689 [Bug]: VLLM 0.10.0 breaks quantized models batch inference speed for Qwen2.5-VL-7B (tested multiple quantization types) — bug,stale — by Magmanat (关闭于: 2025-12-12 10:15 (UTC+8))
- #21799 [Bug]: Performance Regression in Eagle3: vLLM 0.9.0 vs. vLLM 0.9. — bug,performance,stale — by ggg-s (关闭于: 2025-12-12 10:15 (UTC+8))
- #22004 [Bug]: AWQ fails on MoE models — bug,stale — by someone132s (关闭于: 2025-12-12 10:15 (UTC+8))
- #22334 [Feature]: add –reasoning_parser flag for gpt-oss — feature request,stale — by lorentzbao (关闭于: 2025-12-12 10:15 (UTC+8))
- #22568 [Usage]: VRAM spike while loading gemma3-12b bnb on vllm-0.10 — usage,stale — by ivoras (关闭于: 2025-12-12 10:15 (UTC+8))
- #22718 [Bug]: kimi_k2_tool_parser.py has bug with
tool_call_regex— bug,stale — by Jarlene (关闭于: 2025-12-12 10:15 (UTC+8)) - #22719 [Bug]: Qwen2.5vl-3b OOM but 7b works fine — bug,stale — by Elenore1997 (关闭于: 2025-12-12 10:15 (UTC+8))
- #22766 [Usage]: Prevent prefill from being split across different batches — usage,stale — by randombk (关闭于: 2025-12-12 10:15 (UTC+8))
- #22784 [Bug]: AttributeError when loading Step3 model — bug,stale — by Joy666-Li (关闭于: 2025-12-12 10:15 (UTC+8))
- #22816 [Bug]: vllm serve does not bind to all available ipv4 and ipv6 addresses when –host is empty — bug,stale — by smarterclayton (关闭于: 2025-12-12 10:14 (UTC+8))
- #30470 [Bug] [CPU Backend]: vLLM build on Arm CPU fails with pytorch nightly — bug,cpu — by fadara01 (关闭于: 2025-12-12 10:09 (UTC+8))
- #14964 [Feature] [ROCm]: AITER Kernel Integration — feature request,rocm,stale — by tjtanaa (关闭于: 2025-12-12 09:37 (UTC+8))
- #22698 [Performance]: Big bubble at the end of cudagraph run in MI300/MI308 — performance,rocm,unstale — by hoangvictor (关闭于: 2025-12-12 09:36 (UTC+8))
- #30477 [Usage]: How to disable thinking for Qwen-8B — usage — by fancyerii (关闭于: 2025-12-12 09:17 (UTC+8))
- #30445 [Bug]: QuantTrio/MiniMax-M2-AWQ produces garbage in 12/10/2025 build — bug — by eugr (关闭于: 2025-12-12 04:45 (UTC+8))
- #30139 [Bug]: Reasoning models does not reason and put completion in reasoning_content — bug — by Rictus (关闭于: 2025-12-12 00:53 (UTC+8))
- #29259 [Bug]:
simple_profiling.pyfails on CPU target — bug — by NobuoTsukamoto (关闭于: 2025-12-12 00:49 (UTC+8)) - #30269 [Bug]: Multi-node deployment fails with TP=1 and PP=2 — bug — by doss22 (关闭于: 2025-12-11 18:41 (UTC+8))
- #29861 [Feature]: [CPU Backend] Enable support for Whisper — feature request,cpu — by aditew01 (关闭于: 2025-12-11 18:09 (UTC+8))
- #28852 [Bug]: When infer MiniMax-m2, during streaming returns, the think are contained in the ‘content’ field and cannot be separated. — bug — by gallery2016 (关闭于: 2025-12-11 17:05 (UTC+8))
- #28184 [Bug]: encoder_decoder models (e.g. Whisper) is not working in vLLM 0.11 with ROCm — bug,rocm — by lucaschr21 (关闭于: 2025-12-11 14:01 (UTC+8))
- #29587 [Installation]: PaddleOCR-VL integration with vLLM — installation — by ssuncheol (关闭于: 2025-12-11 10:55 (UTC+8))
新增 PR
- #30532 [responsesAPI]add extra body parameters — frontend — by Ri0S (创建于: 2025-12-12 10:27 (UTC+8))
- #30484 [Feature] Add SM103 (Blackwell Ultra) Support to vLLM — ready,needs-rebase,v1,nvidia — by LopezCastroRoberto (创建于: 2025-12-11 20:09 (UTC+8))
- #30531 [CPU] Refactor CPU fused MOE — ci/build — by bigPYJ1151 (创建于: 2025-12-12 10:08 (UTC+8))
- #30471 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 15:55 (UTC+8))
- #30524 [Optimization] Pad the number of tokens to a multiple of 4 to improve FP8 performance — v1 — by 0xjunhao (创建于: 2025-12-12 07:20 (UTC+8))
- #30481 [CPU][FIX] Fix build failures on Arm CPUs with torch nightly — ready,ci/build — by fadara01 (创建于: 2025-12-11 17:57 (UTC+8))
- #30529 [Benchmarks]
auto_tune.sh: Use hostname variable for server requests — performance — by KevinMusgrave (创建于: 2025-12-12 10:01 (UTC+8)) - #30515 [UX][Startup] Account for CUDA graphs during memory profiling — v1,nvidia — by MatthewBonanni (创建于: 2025-12-12 05:12 (UTC+8))
- #30490 [DeepSeek V3.2] Proper drop_thinking logic — ready,deepseek — by vladnosiv (创建于: 2025-12-11 23:02 (UTC+8))
- #30527 [ROCm][CI] Skip multi-GPU speculative decoding tests when insufficient GPUs available — rocm,ready,v1 — by AndreasKaratzas (创建于: 2025-12-12 08:59 (UTC+8))
- #30528 [Perf] Set split_k to 1 for triton_kernels — 无标签 — by xyang16 (创建于: 2025-12-12 09:15 (UTC+8))
- #30526 [ROCm][CI] Use mi325_4 agent pool for V1 e2e tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2025-12-12 08:16 (UTC+8))
- #30520 [BugFix] Use late binding to avoid zmq port conflict race conditions — v1 — by njhill (创建于: 2025-12-12 05:57 (UTC+8))
- #30452 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 11:09 (UTC+8))
- #30516 [compile] Parse compile range cache keys as Range during cache loading. — ready — by zhxchen17 (创建于: 2025-12-12 05:30 (UTC+8))
- #30508 [CI/Build][AMD] Skip test_cutlass_w4a8_moe tests on ROCm sine they require cutlass_pack_scale_fp8 — rocm,ready,nvidia — by rasmith (创建于: 2025-12-12 02:45 (UTC+8))
- #30525 [Release 2.10] Test Torch 2.10 RC — rocm,ci/build,nvidia — by atalman (创建于: 2025-12-12 07:52 (UTC+8))
- #30513 [WIP][CI] Speed up fusion tests — ready,needs-rebase,ci/build — by LucasWilkinson (创建于: 2025-12-12 04:19 (UTC+8))
- #30499 [CI] Breakup h200 tests — ready,needs-rebase,ci/build — by LucasWilkinson (创建于: 2025-12-12 00:45 (UTC+8))
- #30519 [Kernels][MoE] Add FusedMoERouter object — 无标签 — by bnellnm (创建于: 2025-12-12 05:54 (UTC+8))
- #30512 Improve parse_raw_prompt test cases for invalid input .v2 — 无标签 — by mivehk (创建于: 2025-12-12 04:05 (UTC+8))
- #30523 Fix Kimi K2 thinking model nvfp4 vocab size — v1 — by kjiang249 (创建于: 2025-12-12 06:43 (UTC+8))
- #30510 [Refactor] Remove useless syncwarp — ready — by yewentao256 (创建于: 2025-12-12 03:35 (UTC+8))
- #30494 [Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement — performance,ready,deepseek — by yewentao256 (创建于: 2025-12-12 00:15 (UTC+8))
- #30496 [Refactor] Reduce duplicate code in
per_token_group_quantcuda kernels — ready,nvidia — by yewentao256 (创建于: 2025-12-12 00:19 (UTC+8)) - #30522 [KV Connector][Metrics] Do not count prefix cache hits in connector queries — ready,v1 — by markmc (创建于: 2025-12-12 06:11 (UTC+8))
- #30509 [Doc] Add documents for multi-node distributed serving with MP backend — documentation,v1 — by Isotr0py (创建于: 2025-12-12 03:18 (UTC+8))
- #30491 [Docs][CPU backend] Add pre-built Arm CPU Docker images — documentation,ready — by ioghiban (创建于: 2025-12-11 23:40 (UTC+8))
- #30514 [CI] Update several models in registry that are available online now — ready,ci/build — by mgoin (创建于: 2025-12-12 04:53 (UTC+8))
- #30518 Don’t compile vision encoder for Transformers backend — 无标签 — by hmellor (创建于: 2025-12-12 05:52 (UTC+8))
- #30517 [CI] Fix mypy for vllm/v1/executor — v1 — by yewentao256 (创建于: 2025-12-12 05:33 (UTC+8))
- #30505 [Bugfix][Model] Fix Afmoe rope_parameters issue — bug,ready — by mgoin (创建于: 2025-12-12 02:36 (UTC+8))
- #30472 [BugFix][MM]support VLLM_RANDOMIZE_DP_DUMMY_INPUTS — ready,v1 — by charlotte12l (创建于: 2025-12-11 16:02 (UTC+8))
- #30506 simplify the return value from generate_beam_search — 无标签 — by nwaughachukwuma (创建于: 2025-12-12 02:37 (UTC+8))
- #30503 [compile] Stop one-off setting enable_aot_compile and use context manager instead. — ready — by zhxchen17 (创建于: 2025-12-12 01:48 (UTC+8))
- #30474 [Misc] Add mcp to requirements — ready,ci/build — by yeqcharlotte (创建于: 2025-12-11 17:03 (UTC+8))
- #30507 [Bugfix] Dictionary MM embeddings for online chat — frontend,v1 — by DarkLight1337 (创建于: 2025-12-12 02:41 (UTC+8))
- #30495 [Async][Feat] support apply penalty or bad_words for async + spec — v1 — by izhuhaoran (创建于: 2025-12-12 00:18 (UTC+8))
- #30500 feat(gguf): Extract HF config from GGUF metadata for repos without config.json — 无标签 — by kitaekatt (创建于: 2025-12-12 00:51 (UTC+8))
- #30504 [CI] Whisper logprobs tests — ready,multi-modality — by NickLucche (创建于: 2025-12-12 02:19 (UTC+8))
- #30462 enable unbacked with aot_compile — ready — by laithsakka (创建于: 2025-12-11 13:50 (UTC+8))
- #30502 [Kernel] add H100 triton fused moe config for FP8 Qwen3MoE — qwen — by cjackal (创建于: 2025-12-12 01:18 (UTC+8))
- #30497 fix(gguf): GGUF model support fixes for Blackwell GPUs — structured-output,v1 — by kitaekatt (创建于: 2025-12-12 00:37 (UTC+8))
- #30488 Give pooling examples better names — documentation,ready — by hmellor (创建于: 2025-12-11 21:21 (UTC+8))
- #30492 [WIP] add manual numa binding — v1 — by jasonlizhengjian (创建于: 2025-12-11 23:44 (UTC+8))
- #30459 set assume_32bit_indexing and pass unbacked hints — ready — by laithsakka (创建于: 2025-12-11 12:58 (UTC+8))
- #30480 Make the
httpxlogger less annoying when Transformers v5 is installed — ready — by hmellor (创建于: 2025-12-11 17:51 (UTC+8)) - #30483 [Misc] Improve error message for
is_multimodal— ready,qwen — by DarkLight1337 (创建于: 2025-12-11 19:01 (UTC+8)) - #30489 Add encoder tag for compilation — qwen — by ilmarkov (创建于: 2025-12-11 21:23 (UTC+8))
- #30473 Fix typo of endpoint name in CLI args docs — frontend,ready — by kmaehashi (创建于: 2025-12-11 16:33 (UTC+8))
- #30486 [BugFix] Fix minimax m2 model partial_rotary_factor — 无标签 — by rogeryoungh (创建于: 2025-12-11 20:25 (UTC+8))
- #30482 [Frontend] Honor chat template for gpt-oss harmony (#23015) — frontend,gpt-oss — by ajayanto (创建于: 2025-12-11 18:23 (UTC+8))
- #30458 [Deprecation] Remove fallbacks for
embed_input_idsandembed_multimodal— speculative-decoding,ready,qwen — by DarkLight1337 (创建于: 2025-12-11 12:33 (UTC+8)) - #30469 [Deprecation] Remove missed fallback for
embed_input_ids— ready — by DarkLight1337 (创建于: 2025-12-11 15:34 (UTC+8)) - #30476 [Bugfix] Fix
taskstill being passed in tests/benchmarks — performance,ready,v1 — by DarkLight1337 (创建于: 2025-12-11 17:16 (UTC+8)) - #30478 fix gme model do not use mrope — qwen — by zhuikefeng986285005-byte (创建于: 2025-12-11 17:40 (UTC+8))
- #30463 [Deprecation] Deprecation
--convert reward, use--convert embedinstead. — documentation,ready — by noooop (创建于: 2025-12-11 14:01 (UTC+8)) - #30475 [Core][MM] Optimize encoder cache manager by operating with embeddings only — v1,multi-modality,llama,qwen — by ywang96 (创建于: 2025-12-11 17:06 (UTC+8))
- #30479 [BugFix] Fix unmap_and_release by tag not done correctly — 无标签 — by Crispig (创建于: 2025-12-11 17:44 (UTC+8))
- #30468 [Feat] EPD Mooncake store — documentation,v1,kv-connector — by khuonglmhw (创建于: 2025-12-11 15:33 (UTC+8))
- #30467 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 15:30 (UTC+8))
- #30457 [Feat] Mooncake storage connector for E-PD disaggregate — documentation,v1,kv-connector — by khuonglmhw (创建于: 2025-12-11 12:28 (UTC+8))
- #30456 feat: support video list inference — frontend,multi-modality — by LiuLi1998 (创建于: 2025-12-11 11:47 (UTC+8))
- #30454 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (创建于: 2025-12-11 11:31 (UTC+8))
- #30460 [chore] Update FA commit — ready,ci/build — by LucasWilkinson (创建于: 2025-12-11 13:29 (UTC+8))
- #30455 [Doc] Add Baidu Kunlun XPU support — documentation,ready — by xyDong0223 (创建于: 2025-12-11 11:38 (UTC+8))
- #30451 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 11:03 (UTC+8))
- #30450 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (创建于: 2025-12-11 10:50 (UTC+8))
已合并 PR
- #29421 [Core] Whisper Enable Encoder Batching — ready,v1 — by NickLucche (合并于: 2025-12-12 05:06 (UTC+8))
- #30254 gptq marlin quantization support for fused moe with lora — ready — by Bhanu068 (合并于: 2025-12-12 10:27 (UTC+8))
- #30481 [CPU][FIX] Fix build failures on Arm CPUs with torch nightly — ready,ci/build — by fadara01 (合并于: 2025-12-12 10:09 (UTC+8))
- #29628 [Core] Refactor
_build_attention_metadata— ready,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2025-12-12 09:54 (UTC+8)) - #30526 [ROCm][CI] Use mi325_4 agent pool for V1 e2e tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2025-12-12 09:37 (UTC+8))
- #30508 [CI/Build][AMD] Skip test_cutlass_w4a8_moe tests on ROCm sine they require cutlass_pack_scale_fp8 — rocm,ready,nvidia — by rasmith (合并于: 2025-12-12 09:02 (UTC+8))
- #30314 [fix] fix SM check for Flashinfer TRTLLM MOE — ready,nvidia — by jiahanc (合并于: 2025-12-12 09:00 (UTC+8))
- #30417 [CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm — rocm,ready,v1 — by rasmith (合并于: 2025-12-12 08:24 (UTC+8))
- #30002 [FIX]Patch run-cluster.sh (fix for #28328) — documentation,ready — by evberrypi (合并于: 2025-12-12 07:36 (UTC+8))
- #30276 [ROCM][CI] Fix AMD Examples Test Group — documentation,rocm,ready,ci/build — by Concurrensee (合并于: 2025-12-12 07:03 (UTC+8))
- #29804 [EPLB] Support EPLB w/ NVFP4 — ready,nvidia — by andrewbriand (合并于: 2025-12-12 06:59 (UTC+8))
- #30510 [Refactor] Remove useless syncwarp — ready — by yewentao256 (合并于: 2025-12-12 06:43 (UTC+8))
- #30494 [Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement — performance,ready,deepseek — by yewentao256 (合并于: 2025-12-12 06:28 (UTC+8))
- #30491 [Docs][CPU backend] Add pre-built Arm CPU Docker images — documentation,ready — by ioghiban (合并于: 2025-12-12 06:03 (UTC+8))
- #30472 [BugFix][MM]support VLLM_RANDOMIZE_DP_DUMMY_INPUTS — ready,v1 — by charlotte12l (合并于: 2025-12-12 05:00 (UTC+8))
- #30389 Standardise
get_ropeto userope_parameters["partial_rotary_factor"], notrotary_dim— performance,ready,llama,qwen,deepseek,gpt-oss — by hmellor (合并于: 2025-12-12 04:45 (UTC+8)) - #30503 [compile] Stop one-off setting enable_aot_compile and use context manager instead. — ready — by zhxchen17 (合并于: 2025-12-12 04:28 (UTC+8))
- #30474 [Misc] Add mcp to requirements — ready,ci/build — by yeqcharlotte (合并于: 2025-12-12 04:06 (UTC+8))
- #30430 [ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding — rocm,speculative-decoding,ready,v1 — by AndreasKaratzas (合并于: 2025-12-12 03:25 (UTC+8))
- #30442 [Feature] AWQ marlin quantization support for fused moe with lora — ready — by princepride (合并于: 2025-12-12 02:03 (UTC+8))
- #30340 Add Eagle and Eagle3 support to Transformers modeling backend — ready,v1 — by hmellor (合并于: 2025-12-12 01:02 (UTC+8))
- #30391 [IMPROVEMENT] Change MistralReasoningParser behavior — ready — by juliendenize (合并于: 2025-12-12 00:53 (UTC+8))
- #30341 [CI] refine more logic when generating and using nightly wheels & indices, add cuda130 build for aarch64, specify correct manylinux version — ready,ci/build,nvidia — by Harry-Chen (合并于: 2025-12-12 00:42 (UTC+8))
- #30488 Give pooling examples better names — documentation,ready — by hmellor (合并于: 2025-12-12 00:22 (UTC+8))
- #30402 [Docs][CPU Backend] Add nightly and per revision pre-built Arm CPU wheels — documentation,ready — by ioghiban (合并于: 2025-12-11 23:57 (UTC+8))
- #30480 Make the
httpxlogger less annoying when Transformers v5 is installed — ready — by hmellor (合并于: 2025-12-11 23:44 (UTC+8)) - #28309 [KVConnector] Add KV events to KV Connectors — ready,ci/build,v1,kv-connector — by hickeyma (合并于: 2025-12-11 22:30 (UTC+8))
- #30337 fix: enhance human_readable_int function — ready — by andyxning (合并于: 2025-12-11 15:30 (UTC+8))
- #30483 [Misc] Improve error message for
is_multimodal— ready,qwen — by DarkLight1337 (合并于: 2025-12-11 22:39 (UTC+8)) - #30473 Fix typo of endpoint name in CLI args docs — frontend,ready — by kmaehashi (合并于: 2025-12-11 19:07 (UTC+8))
- #30050 [Misc][PCP&DCP] relocate PCP feature check — ready,v1 — by pisceskkk (合并于: 2025-12-11 19:36 (UTC+8))
- #30458 [Deprecation] Remove fallbacks for
embed_input_idsandembed_multimodal— speculative-decoding,ready,qwen — by DarkLight1337 (合并于: 2025-12-11 14:58 (UTC+8)) - #30469 [Deprecation] Remove missed fallback for
embed_input_ids— ready — by DarkLight1337 (合并于: 2025-12-11 18:06 (UTC+8)) - #30476 [Bugfix] Fix
taskstill being passed in tests/benchmarks — performance,ready,v1 — by DarkLight1337 (合并于: 2025-12-11 18:33 (UTC+8)) - #30463 [Deprecation] Deprecation
--convert reward, use--convert embedinstead. — documentation,ready — by noooop (合并于: 2025-12-11 18:18 (UTC+8)) - #30444 [Fix] Update lazing loading of video loader backend — ready,multi-modality — by jeremyteboul (合并于: 2025-12-11 18:14 (UTC+8))
- #30376 [Fix]fix import error from lmcache — ready,kv-connector — by wz1qqx (合并于: 2025-12-11 17:23 (UTC+8))
- #29882 [bugfix] fix MiniMaxM2ReasoningParser streaming output not separating reasoning_content. — frontend,ready — by JaviS-Rei (合并于: 2025-12-11 17:05 (UTC+8))
- #29710 [perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill — performance,ready,v1 — by minosfuture (合并于: 2025-12-11 16:20 (UTC+8))
- #30285 Ensure minimum frames for GLM 4.6V compatibility — ready — by gh-wf (合并于: 2025-12-11 13:26 (UTC+8))
- #30455 [Doc] Add Baidu Kunlun XPU support — documentation,ready — by xyDong0223 (合并于: 2025-12-11 12:52 (UTC+8))
- #30428 [Chore] Fix torch precision warning — ready,v1 — by yewentao256 (合并于: 2025-12-11 12:05 (UTC+8))
- #30396 [Deprecation] Remove deprecated plugin and compilation fields for v0.13 release — documentation,ready — by DarkLight1337 (合并于: 2025-12-11 11:59 (UTC+8))
- #30397 [Deprecation] Remove deprecated task, seed and MM settings — documentation,performance,frontend,ready,qwen — by DarkLight1337 (合并于: 2025-12-11 11:59 (UTC+8))
- #29439 [Bugfix] Fix grouped_topk pytorch impl when num_experts can’t be grouped properly — ready — by divakar-amd (合并于: 2025-12-11 11:47 (UTC+8))
关闭但未合并的 PR
- #17982 [Misc]Added span attribute gen_ai.system to identify spans from vLLM — needs-rebase,stale — by LakshmiPriyaSujith (关闭于: 2025-12-12 10:16 (UTC+8))
- #21869 [Bugfix] Fix PyNcclCommunicator device assertion for un-indexed CUDA devices — stale — by CarlosArguilar (关闭于: 2025-12-12 10:15 (UTC+8))
- #21988 [Disagg][Perf] Add env var to allow gpu model work runs in non-default CUDA stream, improving disagg TTIT/TTFT — ready,needs-rebase,stale,v1 — by liuzijing2014 (关闭于: 2025-12-12 10:15 (UTC+8))
- #22094 Fix error message for max_input_length (bugfix of #22092) — frontend,stale — by RobertFischer (关闭于: 2025-12-12 10:15 (UTC+8))
- #30471 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-12 09:43 (UTC+8))
- #22681 [Kernel][AMD] Reduce AITER attention CPU overhead — rocm,stale,v1 — by mxz297 (关闭于: 2025-12-12 10:15 (UTC+8))
- #22729 [Bugfix] hermes解析器输出存在被过滤问题 — frontend,stale,tool-calling — by astrophel0 (关闭于: 2025-12-12 10:15 (UTC+8))
- #22775 [Perf] Add GLM-4.5V tuning configs — performance,stale — by houseroad (关闭于: 2025-12-12 10:15 (UTC+8))
- #25640 [Misc] Fix internal invocation of _register_fake — 无标签 — by ijpq (关闭于: 2025-12-12 09:44 (UTC+8))
- #29237 [Optimization] Add Fused Triton Kernel for GPT-OSS Router — gpt-oss — by ijpq (关闭于: 2025-12-12 09:44 (UTC+8))
- #27281 [WIP][torch.compile] Add Triton-distributed GEMM+AllReduce fusion compile pass — 无标签 — by jasonlizhengjian (关闭于: 2025-12-12 05:22 (UTC+8))
- #21879 [Kernel][Machete] Larger tile shape for large K when mem bound — 无标签 — by czhu-cohere (关闭于: 2025-12-12 03:12 (UTC+8))
- #30440 [fix] Fix qwen3_coder tool call per parameter streaming — frontend,ready,tool-calling,qwen — by koush (关闭于: 2025-12-12 02:11 (UTC+8))
- #30369 [Fix] Add default rope theta for qwen1 model — qwen — by iwzbi (关闭于: 2025-12-12 01:31 (UTC+8))
- #30278 [CPU][Bugfix] Fix CPU Profiler issue — v1 — by zhili03 (关闭于: 2025-12-12 00:48 (UTC+8))
- #30122 [Bugfix][Async] fix update_async_output_token_ids for async + spec — v1 — by izhuhaoran (关闭于: 2025-12-12 00:20 (UTC+8))
- #28252 [WIP][KVConnector] Retrive KV events from LMCache — needs-rebase,kv-connector — by hickeyma (关闭于: 2025-12-11 22:41 (UTC+8))
- #27718 Feature/kv cache average lifetime — documentation,needs-rebase,v1 — by alhridoy (关闭于: 2025-12-11 19:17 (UTC+8))
- #26892 Fix uniform_decode=True in prefill when using CUDA Graph with single-token prompt — v1,nvidia — by Sugar-zsg (关闭于: 2025-12-11 16:22 (UTC+8))
- #30468 [Feat] EPD Mooncake store — documentation,v1,kv-connector — by khuonglmhw (关闭于: 2025-12-11 15:33 (UTC+8))
- #30467 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-11 15:42 (UTC+8))
- #30457 [Feat] Mooncake storage connector for E-PD disaggregate — documentation,v1,kv-connector — by khuonglmhw (关闭于: 2025-12-11 15:22 (UTC+8))
- #30454 [Optimization]: Add fused router for GPTOSS — gpt-oss — by ijpq (关闭于: 2025-12-11 13:36 (UTC+8))
- #26617 feat: implement compact encoder cache for memory optimization — v1 — by liangwen12year (关闭于: 2025-12-11 13:08 (UTC+8))
- #30422 [ROCm][CI][Bugfix] Fallback for grouped_topk when num_experts can’t be grouped properly — rocm — by micah-wil (关闭于: 2025-12-11 12:01 (UTC+8))
- #30451 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (关闭于: 2025-12-11 11:10 (UTC+8))
- #30450 [Core] Optimize encoder cache with mask-based storage — v1,multi-modality — by sunYtokki (关闭于: 2025-12-11 11:02 (UTC+8))