vLLM 开发动态报告 - 2025-12-12
时间窗口: 2025-12-12 10:29 (UTC+8) ~ 2025-12-13 10:29 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 7 | 新 PR 44 | 合并 PR 29 | 关闭未合并 PR 7
📊 每日开发状态摘要
在2025年12月12日至13日的24小时窗口期内,vLLM项目保持高速开发节奏,共计新增11个Issue、合并29个PR,并关闭了7个Issue。开发焦点集中在多模态模型支持(如GLM-4.6V、AudioFlamingo3)、性能优化(特别是针对AMD平台和MXFP4 Triton后端)、以及各类bug修复上。AMD生态相关改进持续进行,社区对CUDA版本兼容性、新硬件支持(如Blackwell)以及工具调用功能的讨论较为热烈。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD/ROCm生态相关的开发与修复活动活跃,主要集中在性能优化、Bug修复和测试稳定性提升上,未发现涉及Quark量化工具的更新。
新增 PR (ROCm标签):
- PR #30582: [ROCm] Restore 16-wide fast path in Triton unified attention
- 用户:
hyoon1(非AMD员工) - 内容: 针对AMD RDNA 3/4架构(gfx11/gfx12)的WMMA特性,修复了因PR #21197引入的Triton统一注意力内核性能回归问题。通过添加块对齐的快速路径,并在特定架构上设置适当的瓦片大小,使内核重新匹配16列的硬件WMMA片段,恢复了性能。这是一个重要的平台特定性能优化。
- 用户:
- PR #30586: [ROCm] [AITER] [DOC] Add usage description about check functions…
- 用户:
tjtanaa(非AMD员工) - 内容: 为
_aiter_ops类添加使用描述,并废弃了旧的VLLM_ROCM_USE_AITER_PAGED_ATTN环境变量。这属于文档整理和代码清理。
- 用户:
- PR #30576: [ROCm][CI] Add retry logic and xfail handling for flaky ROCm test…
- 用户:
AndreasKaratzas(非AMD员工) - 内容: 为ROCm上因浮点非确定性导致间歇性失败的异步调度测试添加了重试逻辑和预期失败(xfail)标记,以提升CI稳定性。
- 用户:
已关闭 Issue (AMD相关):
- Issue #14397:
triton_scaled_mmnever used on ROCm- 状态: 已关闭 (由 PR #26668 解决)
- 分析: 这是一个长期存在的bug,导致ROCm平台无法回退到Triton实现的FP8缩放矩阵乘法内核。经过社区贡献者(
shivampr)数月的开发、调试和测试验证(包括在MI300X上的功能测试和lm-eval评估),最终通过PR #26668修复。此修复标志着AMD平台在FP8量化计算路径上的一个重要补全。
已合并 PR (AMD相关):
- PR #30292 & PR #30291: 这两个PR由
rasmith提交,修复了ROCm平台上FP8组量化及其参考实现中因使用错误的FP8数值范围(-240/240 而非 -224/224)导致的精度问题和测试失败。这确保了AMD平台上量化操作的正确性和数值一致性。 - PR #26668: [ROCm] Enable Triton ScaledMM fallback + kernel selection fix
- 用户:
shivampr(非AMD员工) - 内容: 如上文所述,此PR解决了Issue #14397,实现了在ROCm平台上当AITriton不可用时,可回退至Triton
scaled_mm内核的能力,并修复了内核选择逻辑。
- 用户:
- PR #30272: [CI/Build] Use spawn subprocess for ROCm
- 用户:
rjrock(非AMD员工) - 内容: 修复了ROCm后端在数据并行示例中因使用
fork创建子进程导致的Torch重新初始化错误,强制使用spawn方法,提升了示例的兼容性。
- 用户:
小结: 本周期AMD生态的更新以性能调优(Triton注意力)和质量加固(修复量化Bug、稳定CI测试)为主,并且成功关闭了一个关键的长期遗留问题(triton_scaled_mm),表明社区对AMD平台的支持正从“功能可用”向“性能优化与稳定性”阶段深化。
💬 高热度讨论分析
- Issue #30543: Ministral-3-14B-Instruct-2512 C++ Compilation Error…
- 核心议题: 用户在尝试使用
Ministral-3-14B-Instruct-2512模型时,遇到Torch Inductor C++编译错误,以及在禁用enforce_eager时批次生成失败的问题。 - 讨论状态: 目前尚无其他开发者跟进评论。该Issue主要是一个详细的错误报告,未形成多方讨论,但因其涉及较新的模型和编译问题而值得关注。
- 核心议题: 用户在尝试使用
- Issue #30324: RuntimeError: bmm_fp8_internal_cublaslt failed when deploying
- 核心议题: 多个用户(使用GB10、RTX 6000 Pro、B200等不同Blackwell架构GPU)在部署特定模型(如Mistral 123B Instruct 2512)时,遭遇
bmm_fp8_internal_cublaslt failed错误。 - 观点与解决方案:
- 用户
LukeKeywalker提出根因: 系统安装的CUDA版本(13.0)与编译PyTorch所用的CUDA版本(12.8/12.9)不匹配。 - 解决方案共识: 用户们通过并行安装与PyTorch匹配的CUDA工具包(如12.8/12.9),并正确设置相关环境变量(
VLLM_CUDA_HOME,CUDA_HOME,PATH,LD_LIBRARY_PATH),同时清除~/.cache/flashinfer缓存,成功解决了问题。
- 用户
- 总结: 这是一个由驱动/CUDA环境配置混乱引发的典型问题,而非vLLM代码本身的bug。讨论形成了有效的、可复现的解决方案,并帮助了多位遇到相同问题的用户。
- 核心议题: 多个用户(使用GB10、RTX 6000 Pro、B200等不同Blackwell架构GPU)在部署特定模型(如Mistral 123B Instruct 2512)时,遭遇
- Issue #28894 (已关闭): [H200] GPT-OSS-120B + Triton MoE Backend Perf Issue
- 核心议题: GPT-OSS-120B模型在使用Triton MoE后端时,在小批量(concurrency=1,2,4)场景下性能比Marlin后端低约30%。
- 观点与排查过程:
- 性能对比: 多名用户(
jhaotingc,robertgshaw2-redhat,dcmaddix)确认了性能下降现象。 - 根因分析:
xyang16通过深入分析发现,Triton kernels在计算split_k大于1时会自动使用float32数据类型进行计算,导致性能下降。而split_k的大小与计算网格规模(grid_size)相关,网格规模又受批次大小(m)影响。对于小批次,split_k可能大于1,从而触发低效的fp32计算路径。
- 性能对比: 多名用户(
- 最终结论与解决方案: 该问题被确认为一个性能bug,并由 PR #30528 修复。该PR通过为Hopper架构(SM90)强制设置
split_k=1,确保Triton kernels始终使用高效的bf16数据类型,从而使小批次性能与Marlin持平。此Issue因修复而关闭。
- PR #30550: [Frontend] Support passing custom score template as a CLI argument…
- 核心议题: 是否应该以及如何支持通过CLI参数为打分/重排模型(如
nvidia/llama-nemotron-rerank-1b-v2)指定自定义提示模板。 - 观点:
- PR作者 (
jzakrzew): 当前通过修改模型类来支持自定义模板的方式耦合度高,提议通过--score-templateCLI参数传递Jinja模板文件路径,实现解耦,便于支持更多模型。 - 社区反馈 (
noooop,Samoed,tomaarsen): 引发了关于“如何标准化跨编码器(Cross-Encoder)模型提示模板”的更广泛讨论。tomaarsen指出当前缺乏标准,并提出是否可以借鉴chat_template或使用自定义role(如document)作为解决方案的思路。
- PR作者 (
- 当前状态: PR处于开放状态,讨论从具体实现扩展到了行业标准化的层面,尚未形成最终结论。这反映了社区在解决此类通用性问题时的深入思考。
- 核心议题: 是否应该以及如何支持通过CLI参数为打分/重排模型(如
🔥 热门话题与趋势分析
- 多模态模型支持持续升温: 新增了对
AudioFlamingo3音频语言模型的支持(PR #30539),同时出现了GLM-4.6V与Transformers v5兼容性的bug(Issue #30584及修复PR #30583)。此外,针对Qwen2-VL、Gemma3等模型的图像编码器缓存预算计算也进行了修复(PR #30536, #29692)。 - 性能优化与硬件适配:
- AMD平台: 如前述,针对Triton注意力内核和量化操作进行了专项优化和修复。
- Blackwell及其他新硬件: 出现了关于GB10/B200等Blackwell GPU的兼容性问题(Issue #30579, #30324)和GGUF格式在Blackwell上的精度修复(PR #30408)。同时,有PR尝试为GB300(SM103)添加FlashInfer注意力支持(PR #30565)。
- 内核优化: 针对Triton MoE后端的小批量性能问题进行了关键修复(PR #30528)。
- 开发者体验与运维工具: 出现了添加服务预热(
--warmup-config,PR #30561)、GPU内存快照分析(PR #30580)等功能的提议,旨在改善生产环境下的稳定性和可调试性。 - 安全与合规性关注: 有PR针对之前修复的CVE漏洞增加了额外的保护措施(PR #30572),显示项目对安全问题的持续关注。
🛠️ 重点技术变更
- PR #30528 ([Perf] Set split_k to 1 for triton_kernels): 核心性能修复。 通过强制设置
split_k=1,解决了Triton MXFP4后端在小批量推理时因错误使用fp32数据类型导致的严重性能退化问题,对GPT-OSS等使用该后端的模型性能有显著提升。 - PR #30564 & #30563 (移除 VLLM_ATTENTION_BACKEND 相关代码): 配置系统清理。 随着PR #26315将注意力后端选择机制从环境变量迁移到
--attention-config参数,本次更新清除了文档和测试中所有对已弃用环境变量VLLM_ATTENTION_BACKEND的引用,标志着该重构工作的收尾。 - PR #30580 ([memory] Add torch memory snapshot dump during model loading stage): 可观测性增强。 引入了在模型加载阶段自动转储PyTorch内存快照的功能(通过环境变量
VLLM_MEMORY_SNAPSHOT_DIR控制),生成的快照文件可通过PyTorch官方工具可视化。这将极大便利开发者分析和调试GPU内存不足(OOM)问题。 - PR #30490 ([DeepSeek V3.2] Proper drop_thinking logic): 模型行为校正。 根据DeepSeek V3.2技术报告,修正了其“丢弃思考内容”(drop_thinking)的逻辑,从原来的“仅当附带工具调用时不丢弃”改为“当最后一条消息来自用户时丢弃”。这一修正不仅提升了推理效率(节省tokens),更直接带来了在Tau² Airline基准测试上约5个百分点的性能提升。
- PR #30556 (feat: batched shared encoder for whisper beam search): 算法优化。 为编码器-解码器模型(以Whisper为例)的束搜索(beam search)实现了编码器输出复用、请求批处理和早期停止等关键优化,旨在解决此类模型首次推理因编码器计算昂贵而导致的性能瓶颈。
📈 开发活跃度观察
- 高效合并: 在44个新增PR中,有29个在同期被合并,合并率较高,显示核心团队审查和集成代码的效率良好。
- 贡献者多样化: 新增的PR和Issue来自大量不同的用户(如
dbotwinick,MatthewBonanni,WangErXiao,xyang16,lashahub等),既有熟悉的核心贡献者,也有新面孔,表明社区参与度广泛。 - 聚焦问题解决: 大多数合并的PR属于bug修复、性能优化和功能完善,表明开发主线健康,正致力于提升项目的稳定性、性能和用户体验。
💡 值得关注的问题
- Issue #30579: CUDA Illegal Memory Access on 4xB200: 在最新的B200 GPU集群上运行大模型(Qwen3-Next-80B)时出现非法内存访问错误,可能与新硬件或驱动兼容性有关,需要重点关注。
- Issue #30541 & #30554: 工具调用相关问题: DeepSeek-V3.2工具调用缺少DSML令牌、MiniMax M2工具调用解析错误,这些Issue反映了在多模型工具调用支持上仍存在细节不一致和bug,影响功能完整性。
- Issue #30570: Why is VLLM still using SSE…: 用户质疑vLLM为何仍在多客户端协议(MCP)中依赖已弃用半年的SSE,这可能涉及技术债务或向后兼容性的权衡,值得社区讨论明确方向。
- PR #30535 ([Core] Add repetitive token detection…): 该PR提议添加重复令牌检测以阻止模型幻觉,是一个实用的新功能。虽然已通过内部代码审查,但作为影响核心生成逻辑的变更,其合并后的实际效果和潜在影响值得观察。
📋 附录:详细数据列表
新增 Issue
- #30584 [Bug]: GLM-4.6V-Flash + transformers v5: vLLM passes MediaWithBytes to HF image processor, causing 400 errors for multimodal requests — bug — by dbotwinick (创建于: 2025-12-13 09:26 (UTC+8))
- #30579 [Bug]: CUDA Illegal Memory Access when running Qwen3-Next-80B-A3B-Instruct on 4xB200 GPUs — bug — by BolinSNLHM (创建于: 2025-12-13 06:24 (UTC+8))
-
#30541 [Usage]: missing dsml token “ DSML ” with DeepSeek-V3.2 tools call — usage — by crischeng (创建于: 2025-12-12 14:47 (UTC+8)) - #30571 [Bug]: DeepGEMM MoE generating tons of warnings — bug — by MatthewBonanni (创建于: 2025-12-13 04:14 (UTC+8))
- #30570 [Usage]: Why is VLLM still using SSE at all for mcp? — usage — by bags307 (创建于: 2025-12-13 04:02 (UTC+8))
- #30569 [Feature]: Add –ssl-ciphers to CLI arguments — feature request — by GraceMoreau (创建于: 2025-12-13 03:41 (UTC+8))
- #30548 [Feature]: Support for Q.ANT Photonic Computing ? — feature request — by plitc (创建于: 2025-12-12 18:16 (UTC+8))
- #30554 [Bug]: When using minimax m2 model tool call, throw IndexError in serving_chat.py — bug — by WangErXiao (创建于: 2025-12-12 22:05 (UTC+8))
- #30546 [Bug]: Incorrect dimension movement in reduce_scatter (first movedim(0, dim) should be movedim(dim, 0)) — bug — by RKai025 (创建于: 2025-12-12 17:16 (UTC+8))
- #30543 [Bug]: Ministral-3-14B-Instruct-2512 C++ Compilation Error with Torch Inductor and Batch Generation Failure with enforce_eager=False — bug — by KosmoCHE (创建于: 2025-12-12 16:47 (UTC+8))
- #30533 [Bug]: Fail to run Qwen3-Next-80B-A3B-Instruct vllm with CPU backend — bug,cpu — by MaoJianwei (创建于: 2025-12-12 11:32 (UTC+8))
已关闭 Issue
- #22479 [Bug]: –tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 — bug,stale — by fernandaspets (关闭于: 2025-12-13 10:10 (UTC+8))
- #26582 [Bug]: which triton-kernels version for MXFP4 Triton backend? — bug — by matkle (关闭于: 2025-12-13 04:30 (UTC+8))
- #28894 [Bug]: [H200] GPT-OSS-120B + Triton MoE Backend Perf Issue — bug — by jhaotingc (关闭于: 2025-12-13 04:29 (UTC+8))
- #14397 [Bug]:
triton_scaled_mmnever used on ROCm — bug,good first issue,unstale — by ProExpertProg (关闭于: 2025-12-13 02:28 (UTC+8)) - #30511 Potential Deadlock? — 无标签 — by ChuanLi1101 (关闭于: 2025-12-13 02:00 (UTC+8))
- #30324 [Bug]: RuntimeError: bmm_fp8_internal_cublaslt failed when deploying — bug — by Austin-Liang (关闭于: 2025-12-12 14:25 (UTC+8))
- #30326 [Bug]: H200 GPT-OSS-120B has perf drop — bug — by shyeh25 (关闭于: 2025-12-12 13:39 (UTC+8))
新增 PR
- #30555 [Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction — frontend,tool-calling — by WangErXiao (创建于: 2025-12-12 22:07 (UTC+8))
- #30587 [Bugfix] Record request stats when request is aborted — v1 — by pooyadavoodi (创建于: 2025-12-13 09:48 (UTC+8))
- #30563 [Attention] Update tests to remove deprecated env vars — rocm,speculative-decoding,ready,ci/build,v1,multi-modality,kv-connector,nvidia — by MatthewBonanni (创建于: 2025-12-13 01:12 (UTC+8))
- #30564 [Docs] Remove references to
VLLM_ATTENTION_BACKEND— documentation,ready — by MatthewBonanni (创建于: 2025-12-13 01:31 (UTC+8)) - #30582 [ROCm] Restore 16-wide fast path in Triton unified attention — rocm — by hyoon1 (创建于: 2025-12-13 09:02 (UTC+8))
- #30586 [ROCm] [AITER] [DOC] Add usage description about check functions in
_aiter_ops— rocm — by tjtanaa (创建于: 2025-12-13 09:38 (UTC+8)) - #30585 [Bugfix] Fix NaN issue in attention output — v1 — by xyang16 (创建于: 2025-12-13 09:35 (UTC+8))
- #30575 [Bugfix] Pass FA version in
MultiHeadAttention— ready — by MatthewBonanni (创建于: 2025-12-13 05:31 (UTC+8)) - #30581 Add IBM and Red Hat to compute resources sponsors — documentation,ready — by mgoin (创建于: 2025-12-13 07:32 (UTC+8))
- #30583 Update get_processor_data to use get_all method — multi-modality — by dbotwinick (创建于: 2025-12-13 09:24 (UTC+8))
- #30553 [Bug][CPU Backend]: Improve L2 cache size detection and usage on aarch64 — 无标签 — by Radu2k (创建于: 2025-12-12 21:40 (UTC+8))
- #30568 [WIP][CI] Speed up sequence parallel tests — ready,ci/build — by LucasWilkinson (创建于: 2025-12-13 03:16 (UTC+8))
- #30580 [memory] Add torch memory snapshot dump during model loading stage — v1 — by minosfuture (创建于: 2025-12-13 06:56 (UTC+8))
- #30574 MLA Based Eagle3 — new-model,speculative-decoding,v1,deepseek — by IzzyPutterman (创建于: 2025-12-13 05:05 (UTC+8))
- #30577 [Bug][KVConnector][Metrics] Remove a vacuous assertion breaking external-launcher — kv-connector,fb-exported,meta-exported — by QierLi (创建于: 2025-12-13 06:01 (UTC+8))
- #30578 [ci] Mark PrimeRL integration test as soft fail — ci/build — by khluu (创建于: 2025-12-13 06:09 (UTC+8))
- #30576 [ROCm][CI] Add retry logic and xfail handling for flaky ROCm test in test_async_scheduling — rocm,v1 — by AndreasKaratzas (创建于: 2025-12-13 05:44 (UTC+8))
- #30573 [Misc][Refactor] Separate router from FusedMoE class — 无标签 — by bnellnm (创建于: 2025-12-13 04:16 (UTC+8))
- #30535 [Core] Add repetitive token detection for hallucination prevention — documentation,frontend,v1 — by jeremyteboul (创建于: 2025-12-12 13:07 (UTC+8))
- #30539 Add AudioFlamingo3 model support — documentation,new-model,multi-modality — by lashahub (创建于: 2025-12-12 14:22 (UTC+8))
- #30572 Add additional protection for CVE-2025-62164 — frontend,multi-modality — by russellb (创建于: 2025-12-13 04:16 (UTC+8))
- #30565 [Bug] add sm100f check for trtllm-gen flashinfer attention and moe — v1,nvidia — by IwakuraRein (创建于: 2025-12-13 01:42 (UTC+8))
- #30561 feat(serve): add warmup support for consistent first-request performance — documentation,frontend — by TheCodeWrangler (创建于: 2025-12-13 00:52 (UTC+8))
- #30550 [Frontend] Support passing custom score template as a CLI argument to vllm serve — documentation,frontend — by jzakrzew (创建于: 2025-12-12 20:16 (UTC+8))
- #30567 [Bug] Fix AttributeError: ‘Qwen3VLMoeConfig’ object has no attribute ‘intermediate_size’ — ready,qwen — by yewentao256 (创建于: 2025-12-13 02:14 (UTC+8))
- #30540 [Doc]: fixing typos in various files — documentation,frontend,ready,needs-rebase,nvidia — by didier-durand (创建于: 2025-12-12 14:43 (UTC+8))
- #30547 [CustomOp] Support object-level enable for CustomOp — 无标签 — by shen-shanshan (创建于: 2025-12-12 18:12 (UTC+8))
- #30551 docs: Clarify block_quant_to_tensor_quant docstring (fixes #30098) — 无标签 — by yurekami (创建于: 2025-12-12 21:36 (UTC+8))
- #30556 feat: batched shared encoder for whisper beam search — documentation,performance,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by TheCodeWrangler (创建于: 2025-12-12 22:23 (UTC+8))
- #30562 [Refactor] Small refactor for group topk — ready,v1 — by yewentao256 (创建于: 2025-12-13 00:58 (UTC+8))
- #30566 update to transformers v5 — ready,ci/build — by hmellor (创建于: 2025-12-13 01:44 (UTC+8))
- #30558 [Core] Support multi prompt for AsyncLLM.generate() and encode() — documentation,v1 — by buaazp (创建于: 2025-12-13 00:28 (UTC+8))
- #30560 My bad - ignore. — documentation,needs-rebase,ci/build — by fadara01 (创建于: 2025-12-13 00:36 (UTC+8))
- #30559 [Feat] Enable eplb with default all2all backend — ready — by yewentao256 (创建于: 2025-12-13 00:31 (UTC+8))
- #30557 [GPT OSS] Fix tool_choice required — frontend,gpt-oss — by southfreebird (创建于: 2025-12-12 22:30 (UTC+8))
- #30537 Filter safetensors files to download if .safetensors.index.json exists — 无标签 — by mgoin (创建于: 2025-12-12 13:36 (UTC+8))
- #30534 [Bug] Fix attention_backend arg string parsing — bug,ready — by mgoin (创建于: 2025-12-12 12:31 (UTC+8))
- #30545 [Frontend] Map
service_tiertopriorityfor OpenAI API endpoints — frontend — by meffmadd (创建于: 2025-12-12 17:12 (UTC+8)) - #30542 [Bugfix] Revert Qwen2-VL part of change in #28271 — ready,qwen — by zifeitong (创建于: 2025-12-12 15:10 (UTC+8))
- #30552 typing: Add type hints to TurnMetrics class in context.py — frontend,gpt-oss — by yurekami (创建于: 2025-12-12 21:38 (UTC+8))
- #30549 [Core] WhisperEncoder support
torch.compile— v1 — by NickLucche (创建于: 2025-12-12 18:48 (UTC+8)) - #30544 [KVEvent] User request.block_hash for parent block_hash — v1 — by heheda12345 (创建于: 2025-12-12 16:50 (UTC+8))
- #30538 [XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation — ci/build — by yma11 (创建于: 2025-12-12 14:05 (UTC+8))
- #30536 encoder cache optimization budget alignment — v1,multi-modality,qwen — by sunYtokki (创建于: 2025-12-12 13:15 (UTC+8))
已合并 PR
- #30514 [CI] Update several models in registry that are available online now — ready,ci/build — by mgoin (合并于: 2025-12-13 10:28 (UTC+8))
- #30564 [Docs] Remove references to
VLLM_ATTENTION_BACKEND— documentation,ready — by MatthewBonanni (合并于: 2025-12-13 10:20 (UTC+8)) - #30575 [Bugfix] Pass FA version in
MultiHeadAttention— ready — by MatthewBonanni (合并于: 2025-12-13 08:02 (UTC+8)) - #30581 Add IBM and Red Hat to compute resources sponsors — documentation,ready — by mgoin (合并于: 2025-12-13 09:34 (UTC+8))
- #30292 [CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality — rocm,ready — by rasmith (合并于: 2025-12-13 07:41 (UTC+8))
- #30578 [ci] Mark PrimeRL integration test as soft fail — ci/build — by khluu (合并于: 2025-12-13 06:13 (UTC+8))
- #30496 [Refactor] Reduce duplicate code in
per_token_group_quantcuda kernels — ready,nvidia — by yewentao256 (合并于: 2025-12-13 05:45 (UTC+8)) - #29748 [MoE-FP8-modelopt] Add FlashInfer alignment padding for intermediate dimensions — ready — by danielafrimi (合并于: 2025-12-13 04:42 (UTC+8))
- #30528 [Perf] Set split_k to 1 for triton_kernels — performance,moe,ready,gpt-oss,nvidia — by xyang16 (合并于: 2025-12-13 03:07 (UTC+8))
- #29980 [Fix]Load kv-cache dtype from hf_quant_config.json automatically — quantization,ready — by danielafrimi (合并于: 2025-12-13 03:27 (UTC+8))
- #28848 [CI/Build] Add x86 CPU wheel release pipeline — x86-cpu,ready,ci/build,cpu,aarch64-cpu,nvidia — by bigPYJ1151 (合并于: 2025-12-13 03:21 (UTC+8))
- #26668 [ROCm] Enable Triton ScaledMM fallback + kernel selection fix — rocm,ready,ci/build,nvidia — by shivampr (合并于: 2025-12-13 02:28 (UTC+8))
- #30517 [CI] Fix mypy for vllm/v1/executor — ready,v1 — by yewentao256 (合并于: 2025-12-13 02:05 (UTC+8))
- #30059 [bugfix] fix bug when top_logprobs=0 with spec decoding — ready,v1 — by realliujiaxu (合并于: 2025-12-13 01:03 (UTC+8))
- #30266 [Frontend] Fixes anthropic streaming message_start usage nesting — frontend,ready — by bbartels (合并于: 2025-12-13 00:28 (UTC+8))
- #28306 [Kernel] Support CUDA Graphs in 3D Triton Attention Kernel — ready,v1,nvidia — by jvlunteren (合并于: 2025-12-12 23:55 (UTC+8))
- #30425 [LMCache] Relax lmcache version requirement — ready,ci/build,kv-connector — by njhill (合并于: 2025-12-12 11:18 (UTC+8))
- #30534 [Bug] Fix attention_backend arg string parsing — bug,ready — by mgoin (合并于: 2025-12-12 23:40 (UTC+8))
- #30408 fix(gguf): Disable bfloat16 for GGUF on blackwell device — ready — by kitaekatt (合并于: 2025-12-12 23:10 (UTC+8))
- #30490 [DeepSeek V3.2] Proper drop_thinking logic — ready,deepseek — by vladnosiv (合并于: 2025-12-12 23:01 (UTC+8))
- #27532 [Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 — ready,v1,deepseek,gpt-oss,nvidia,ready-run-all-tests — by LucasWilkinson (合并于: 2025-12-12 21:57 (UTC+8))
- #28729 [Bugfix] Multiple fixes for gpt-oss Chat Completion prompting — frontend,ready,tool-calling,gpt-oss — by bbrowning (合并于: 2025-12-12 12:59 (UTC+8))
- #21804 [Bugfix] Fix CMakeLists Environment Variable — ready,ci/build,unstale — by wu-kan (合并于: 2025-12-12 18:54 (UTC+8))
- #29692 [Bugfix] Schedule failure due to wrong get_image_size_with_most_features — ready,multi-modality,qwen — by tomtomjhj (合并于: 2025-12-12 18:27 (UTC+8))
- #30291 [CI/Build][AMD] Fix ref_dynamic_per_token_quant reference implementation on ROCm. — rocm,ready — by rasmith (合并于: 2025-12-12 17:30 (UTC+8))
- #30516 [compile] Parse compile range cache keys as Range during cache loading. — ready — by zhxchen17 (合并于: 2025-12-12 12:30 (UTC+8))
- #30527 [ROCm][CI] Skip multi-GPU speculative decoding tests when insufficient GPUs available — rocm,ready,v1 — by AndreasKaratzas (合并于: 2025-12-12 11:54 (UTC+8))
- #30272 [CI/Build] Use spawn subprocess for ROCm — documentation,rocm,ready — by rjrock (合并于: 2025-12-12 11:33 (UTC+8))
- #30505 [Bugfix][Model] Fix Afmoe rope_parameters issue — bug,ready — by mgoin (合并于: 2025-12-12 10:53 (UTC+8))
关闭但未合并的 PR
- #29377 Draft: DS Arch Eagle3 — deepseek — by IzzyPutterman (关闭于: 2025-12-13 04:47 (UTC+8))
- #30560 My bad - ignore. — documentation,needs-rebase,ci/build — by fadara01 (关闭于: 2025-12-13 00:39 (UTC+8))
- #30479 [BugFix] Fix unmap_and_release by tag not done correctly — 无标签 — by Crispig (关闭于: 2025-12-13 00:20 (UTC+8))
- #30366 [Bug Fix] Fix Kimi-Linear model initialization crash due to missing ‘indexer_rotary_emb’ arg — 无标签 — by yonasTMC (关闭于: 2025-12-12 18:03 (UTC+8))
- #26924 [Performance] Optimize encoder cache memory consumption by storing encoder outputs only — v1,multi-modality — by imkero (关闭于: 2025-12-12 13:36 (UTC+8))
- #30446 Added a test for invalid inputs for parse_raw_prompts — 无标签 — by mivehk (关闭于: 2025-12-12 11:49 (UTC+8))
- #30486 [BugFix] Fix minimax m2 model partial_rotary_factor — 无标签 — by rogeryoungh (关闭于: 2025-12-12 10:57 (UTC+8))