vLLM 开发动态报告 - 2025-12-26
时间窗口: 2025-12-26 10:46 (UTC+8) ~ 2025-12-27 10:46 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 17 | 新 PR 21 | 合并 PR 13 | 关闭未合并 PR 18
📊 每日开发状态摘要
本周期(2025年12月26-27日)vLLM项目保持高活跃度,共合并13个PR,处理17个历史Issue。开发焦点集中在模型兼容性修复(特别是多模态模型LoRA、GPTQ和FP8量化)和异构硬件支持优化(如ROCm平台对非标准块大小模型的修复)。多个长期悬而未决的Issue(如多模态模型LoRA支持)通过大型PR得到解决,表明项目正致力于完善核心功能并提升在不同硬件和模型上的稳定性。
🎯 AMD/ROCm 生态相关动态
本周期AMD/ROCm生态相关活动活跃,主要集中在ROCm后端兼容性修复和性能优化上,但未出现用户名包含“-amd”的贡献者或涉及Quark工具的直接修改。
- PR #31380: [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference
- 内容:修复了Qwen3-Next-80B-A3B-Thinking模型在ROCm注意力后端(
rocm_attn)上的推理崩溃问题。该模型使用非标准的544令牌块大小,而原有内核基于2的幂次方进行位运算寻址,导致内存错误。 - 技术细节:重构了多个ROCm注意力内核(如
triton_reshape_and_cache_flash.py,prefix_prefill.py),将寻址逻辑从位运算改为通用线性算术寻址((block_table - base_addr) // block_byte_stride)。这使得内核能原生支持任意块大小,提升了对复杂模型架构的兼容性。 - 影响:显著增强了vLLM在AMD GPU(如MI300X)上运行非标准块大小模型的能力,是ROCm生态支持向更广泛模型扩展的重要一步。
- 内容:修复了Qwen3-Next-80B-A3B-Thinking模型在ROCm注意力后端(
- PR #31369: [ROCm][CI] Fix rocm attention backends selection on ROCm
- 内容:修复了ROCm设备上注意力后端选择单元测试(
test_rocm_attention_backends_selection.py)的失败问题。 - 技术细节:更新了测试用例以匹配最新的
get_attn_backend_cls接口调用约定。这确保了CI/CD流程中针对ROCm各注意力后端(如ROCM_ATTN,ROCM_AITER_FA)的测试能够正确运行,保障了代码质量。 - 状态:Open,等待与相似PR #31187的合并决策。
- 内容:修复了ROCm设备上注意力后端选择单元测试(
- Issue #31372: [Feature]: Running paddocr‑vl consumes an excessively large amount of memory.
- 内容:用户报告在ROCm镜像上运行PaddleOCR-VL多模态模型时,内存消耗异常高(约30GB),远超模型本身大小(2GB)。
- 与AMD生态关联:问题发生在ROCm环境中,但根因可能在于多模态模型的通用内存管理或潜在的泄漏,而非ROCm特定问题。这反映了在AMD硬件上部署多模态应用时可能遇到的现实挑战。
- 状态:Open,尚未有解决方案。
💬 高热度讨论分析
- Issue #29635: [Bug]: 支持多模态加载lora的方式 (23条评论)
- 核心议题:用户无法为多模态模型(如Qwen-VL)的视觉模块加载LoRA权重,vLLM会过滤掉这些权重并报错。
- 不同观点:
- 用户/开发者:迫切需要此功能进行微调后的生产部署。
- 维护者(@jeejeelee):承认这是已知限制,并已在进行中(WIP)。
- 争议焦点:无争议,社区强烈期待功能实现。
- 当前状态/结论:已关闭。相关功能已通过 PR #26674 合并入主分支,该PR为多模态模型的视觉塔和连接器模块实现了独立的LoRA包装器,解决了此问题。
- Issue #7413: [Bug]: Pipeline parallelism is only supported through AsyncLLMEngine… (9条评论)
- 核心议题:用户在使用非异步LLM引擎时遭遇管道并行性能严重下降的警告。
- 不同观点:
- 多位用户(+1评论):持续遇到相同问题,询问进展。
- 自动化流程:因超过90天无核心维护者回复,被标记为“stale”并最终关闭。
- 争议焦点:无实质性技术讨论,主要是用户对问题长期未得到官方响应的不满。
- 当前状态/结论:已关闭。因不活跃被机器人关闭,问题本身可能依然存在,但未被主动解决。
- Issue #31377: [Bug]: torch.ops._C.static_scaled_fp8_quant IMA error (2条评论)
- 核心议题:在测试FP8量化注意力融合模式时,发生非法内存访问(IMA)错误。
- 讨论过程:
- 开发者@lengrongfu:初步判断是scale张量未在GPU设备上。
- 提交者@BoyuanFeng:提供复现命令,并后续通过PR #31395定位到根本原因是量化scale张量未注册为模型缓冲区(buffer),导致
model.to(device)时未被移至GPU。
- 当前状态:Open,但已有修复PR #31395提交,该PR通过将
_k_scale等张量注册为缓冲区来解决问题。
🔥 热门话题与趋势分析
- FP8量化与模型精度:成为新的关注热点。Issue #31394报告了FlashInfer FP8 MoE内核的精度问题,Issue #31387请求在非原生FP8 GPU(如A100)上支持FP8模型部署,连同#31377的IMA错误,表明FP8量化特性在性能与稳定性上面临考验。
- 多模态模型支持深化:除LoRA支持(PR #26674)外,Issue #31372揭示了多模态模型内存管理的复杂性,而合并的PR #28367新增了对PerceptronAI/Isaac-0.1多模态模型的支持,显示社区正在不断扩展和夯实该领域。
- 异构计算平台适配:活动不仅限于AMD ROCm,PR #31381针对XPU跳过了不支持的fork测试,PR #31221统一了aiter的自定义op实现。这反映了vLLM在CUDA之外,对ROCm、XPU等多种硬件后端的兼容性投入。
- 工程效率与代码质量:PR #31393提议用更快的
prek工具替换pre-commit以加速本地检查,PR #31378和#31374修复了文档生成问题。这表明在功能开发之余,对开发者体验和项目维护性的关注也在持续。
🛠️ 重点技术变更
- PR #30166: [Core][Hybrid allocator + connector] 支持混合分配器 + kv cache连接器(已合并)
- 解读:这是对混合KV缓存管理器(用于滑动窗口注意力等)的重大改进。此前管理器会为滑动窗口层过度分配所有令牌后再释放,导致与外部KV缓存连接器(如LMCache)协作时内存压力大和数据竞争。
- 影响:新实现改为只为滑动窗口内的令牌分配KV缓存,从根本上解决了上述问题,使得vLLM能与LMCache等外部缓存系统更高效、稳定地协同工作。
- PR #26674: [Core] Initialize LoRA support for tower and connector in multi-modal models(已合并)
- 解读:解决了长期存在的多模态模型视觉部分无法使用LoRA的问题。关键技术点是为语言模型、视觉塔和连接器分别创建独立的Punica LoRA包装器,并新增
get_num_mm_encoder_tokens等方法精确计算视觉模块的输入长度。 - 影响:正式为Qwen-VL、Idefics3等多模态模型提供了完整的LoRA微调后部署能力,满足了社区的关键需求。
- 解读:解决了长期存在的多模态模型视觉部分无法使用LoRA的问题。关键技术点是为语言模型、视觉塔和连接器分别创建独立的Punica LoRA包装器,并新增
- PR #31395: [BugFix] register quant scale tensors as buffer(Open)
- 解读:一个典型的深度学习工程细节问题。FP8静态量化中的scale张量由于未注册为PyTorch模块的
buffer,在模型调用.to(device)时不会自动迁移至GPU,导致内核计算时发生设备不匹配的非法内存访问。 - 影响:修复了FP8量化注意力融合测试中的稳定性问题,强调了在实现自定义模块时正确管理张量设备和状态的必要性。
- 解读:一个典型的深度学习工程细节问题。FP8静态量化中的scale张量由于未注册为PyTorch模块的
📈 开发活跃度观察
- 高效合并与清理:单日合并13个PR并关闭17个旧Issue,显示核心团队在积极整合特性并清理积压任务。多个PR(如#31378, #31381)快速被标记为“ready”并合并,表明审查流程顺畅。
- AMD生态贡献活跃:虽然未发现-amd后缀用户,但PR #31380和#31369的作者积极解决ROCm平台的具体问题,贡献具有重要价值。AMD相关修复更侧重于底层内核兼容性与性能,是生态建设的关键部分。
- 社区协作解决难题:对于Issue #31377这类具体错误,社区开发者(@BoyuanFeng, @lengrongfu)能快速互动并定位根因,体现出良好的技术氛围。
💡 值得关注的问题
- Issue #31394: FlashInfer FP8 MoE精度问题:在NVIDIA B200/GB300等顶级硬件上出现的FP8 MoE精度不准问题,可能涉及较深层的计算内核逻辑,需要FlashInfer或vLLM团队深入排查。
- Issue #31387: 非原生FP8 GPU上的FP8模型部署:用户希望在A100等GPU上通过Marlin内核运行FP8模型但失败。这关乎FP8量化在更广泛硬件上的可用性,是一个重要的用户体验和兼容性问题。
- Issue #31371: KV缓存块连续分配优化建议:用户从层间KV缓存传输效率出发,建议对空闲块列表进行排序以实现连续分配。这是一个具有潜在性能收益的优化点,值得核心开发者评估其开销与收益。
📋 附录:详细数据列表
新增 Issue
- #31394 [Bug]: accuracy issue with VLLM_USE_FLASHINFER_MOE_FP8=1 for Qwen3-Coder-480B-A35B-Instruct-FP8 — bug — by mxz297 (创建于: 2025-12-27 01:42 (UTC+8))
- #31389 [Bug]: Byte fallback is not properly handled when using outlines — bug — by Alnusjaponica (创建于: 2025-12-27 00:08 (UTC+8))
- #31387 [Feature]: Use Kimi to support to deploy fp8 model in 8xA100-PCIE-40GB GPU — feature request — by shiwanghua (创建于: 2025-12-26 18:24 (UTC+8))
- #31377 [Bug]: torch.ops._C.static_scaled_fp8_quant IMA error — bug — by BoyuanFeng (创建于: 2025-12-26 14:09 (UTC+8))
- #31379 [Bug]: GLM47 the empty parameter tool report TypeError: expected string or buffer — bug — by NebulaMao (创建于: 2025-12-26 14:55 (UTC+8))
- #31375 [Bug]: vLLM online serve ByteDance-Seed/BAGEL-7B-MoT model error — bug — by zxr-creator (创建于: 2025-12-26 13:26 (UTC+8))
- #31372 [Feature]: Running paddocr‑vl consumes an excessively large amount of memory. — feature request — by wszgrcy (创建于: 2025-12-26 13:00 (UTC+8))
- #31371 [Feature]: Sort blocks by block_id in FreeKVCacheBlockQueue.append_n to enable contiguous allocation — feature request — by xiaguan (创建于: 2025-12-26 10:58 (UTC+8))
已关闭 Issue
- #7413 [Bug]: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise. — bug,stale — by Gaaaaam (关闭于: 2025-12-27 10:23 (UTC+8))
- #17095 [Bug]: Failed to run dp+tp in 2 GPU Nodes — bug,stale — by RodgerZhu (关闭于: 2025-12-27 10:22 (UTC+8))
- #18257 [Bug]: Phi-4-multimodal-instruct with function calling + streaming — bug,stale — by thies1006 (关闭于: 2025-12-27 10:22 (UTC+8))
- #18875 [Usage]: Does vllm support inference or service startup of CPU small model? — usage,stale — by safwaqf (关闭于: 2025-12-27 10:22 (UTC+8))
- #19595 [Usage]: [vLLM V1]
decoded_tokenreturns “Ċ” instead of “\n” in Qwen2.5-Math-7B-Instruct — usage,stale — by xiaobanni (关闭于: 2025-12-27 10:22 (UTC+8)) - #22695 [Performance]: Long-Video Inference for Multimodal LLM — performance,stale — by knlnguyen1802 (关闭于: 2025-12-27 10:22 (UTC+8))
- #23462 [Bug]:MOE kernels fail to compile on CUDA 12.8 (GCC10/11) due to log2(int) overload ambiguity — bug,stale — by (关闭于: 2025-12-27 10:22 (UTC+8))
- #23468 [Bug]: YAML served-model-name list is parsed as a single literal string; aliases not accepted (only single-flag multi-token CLI works) — bug,stale — by danwarren13 (关闭于: 2025-12-27 10:22 (UTC+8))
- #23479 [Bug]: GLM-4.5-Air’s output always start with a newline if streaming and reasoning is enabled — bug,stale — by anon218-wq (关闭于: 2025-12-27 10:22 (UTC+8))
- #23491 [Feature]: Add support for
seed_osstool-call parser in official vLLM image — feature request,stale — by gaopeijie (关闭于: 2025-12-27 10:21 (UTC+8)) - #25312 [Bug]: Random startup failures on single device with uneven GPUs — bug,stale — by edgeinfinity1 (关闭于: 2025-12-27 01:41 (UTC+8))
- #25372 [Feature]: call internal function
_register_fakemay violate style guide. — feature request,stale — by ijpq (关闭于: 2025-12-27 01:40 (UTC+8)) - #31342 [Bug]: vllm profiler cannot working — bug — by lengrongfu (关闭于: 2025-12-26 23:48 (UTC+8))
- #29635 [Bug]: 支持多模态加载lora的方式 — bug — by shangzhensen0620-ui (关闭于: 2025-12-26 20:48 (UTC+8))
- #27916 [Feature]: Does the latest version support LoRa for visual models? — feature request — by SmartNight-cc (关闭于: 2025-12-26 20:48 (UTC+8))
- #26422 [Feature]: Adding Lora on visual blocks of multimodal models — feature request — by HayatoSempai (关闭于: 2025-12-26 20:48 (UTC+8))
- #20020 [Feature]: Can we release a wheel for x86 CPU based inferencing — feature request,stale — by ericcurtin (关闭于: 2025-12-26 20:06 (UTC+8))
新增 PR
- #31397 implements register kv caches in lmcache connector — kv-connector — by chunxiaozheng (创建于: 2025-12-27 10:34 (UTC+8))
- #31382 [LoRA] Remove direct_register_custom_op — ready — by jeejeelee (创建于: 2025-12-26 16:27 (UTC+8))
- #31391 [Bugfix] Fix byte fallback handling when using outlines — documentation,structured-output,v1 — by Alnusjaponica (创建于: 2025-12-27 00:17 (UTC+8))
- #31378 [Docs] Fix some snippets — documentation,ready — by hmellor (创建于: 2025-12-26 14:42 (UTC+8))
- #31393 [CI] Replace
pre-commitwithprekfor faster pre-commit checks — documentation,ci/build,cpu — by j178 (创建于: 2025-12-27 01:37 (UTC+8)) - #31376 [BugFix] Fix cache issue in compilation_config — ready — by BoyuanFeng (创建于: 2025-12-26 14:05 (UTC+8))
- #31395 [BugFix] register quant scale tensors as buffer — ready — by BoyuanFeng (创建于: 2025-12-27 03:58 (UTC+8))
- #31381 [XPU][CI]skip test_preprocess_error_handling due to fork/spawn issue — ready,v1 — by jikunshang (创建于: 2025-12-26 15:25 (UTC+8))
- #31396 Implement shared VRAM pool and KV cache budget API — frontend,v1 — by DmitriiShubin (创建于: 2025-12-27 04:26 (UTC+8))
- #31390 [Bug] Fix log issue with
\n— ready,v1 — by yewentao256 (创建于: 2025-12-27 00:14 (UTC+8)) - #31392 [Draft] More streaming vox — tpu,needs-rebase,v1 — by patrickvonplaten (创建于: 2025-12-27 00:20 (UTC+8))
- #31370 [Docs] Add profiler user docs for http request — documentation,ready — by lengrongfu (创建于: 2025-12-26 10:58 (UTC+8))
- #31388 [Voxtral] Fix speech transcription api — frontend — by patrickvonplaten (创建于: 2025-12-26 20:06 (UTC+8))
- #31385 add tip for VLLM_USE_PRECOMPILED arg to reduce docker build time — documentation — by yitingdc (创建于: 2025-12-26 17:38 (UTC+8))
- #31386 GLM Testing — performance,new-model — by zRzRzRzRzRzRzR (创建于: 2025-12-26 18:06 (UTC+8))
- #31384 [Bugfix] Fix GLM47 tool parser TypeError for empty parameter tools — 无标签 — by yurekami (创建于: 2025-12-26 16:59 (UTC+8))
- #31383 [Bugfix] Add GLM-4 append-think reasoning parser for missing
token — 无标签 — by yurekami (创建于: 2025-12-26 16:43 (UTC+8)) - #31380 [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten — rocm,v1,qwen — by vllmellm (创建于: 2025-12-26 15:16 (UTC+8))
- #31374 [Docs] fix the number of nccl group in P2P NCCL Connector example — documentation — by jiangxiaobin96 (创建于: 2025-12-26 13:22 (UTC+8))
- #31369 [ROCm][CI] Fix rocm attention backends selection on ROCm — rocm,v1 — by zejunchen-zejun (创建于: 2025-12-26 10:54 (UTC+8))
- #31373 [BugFix] Re-fix async multimodal cpu tensor race condition — ready,v1 — by njhill (创建于: 2025-12-26 13:15 (UTC+8))
已合并 PR
- #30166 [Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector — documentation,performance,new-model,rocm,structured-output,frontend,tpu,ready,ci/build,v1 — by ivanium (合并于: 2025-12-27 10:25 (UTC+8))
- #31378 [Docs] Fix some snippets — documentation,ready — by hmellor (合并于: 2025-12-26 20:47 (UTC+8))
- #31381 [XPU][CI]skip test_preprocess_error_handling due to fork/spawn issue — ready,v1 — by jikunshang (合并于: 2025-12-27 05:40 (UTC+8))
- #31221 CustomOp: Unify aiter impl into GroupedTopk — rocm,ready — by xinyu-intel (合并于: 2025-12-27 01:44 (UTC+8))
- #31370 [Docs] Add profiler user docs for http request — documentation,ready — by lengrongfu (合并于: 2025-12-26 23:48 (UTC+8))
- #28367 Feature/isaac 0.1 — documentation,new-model,ready,ci/build,multi-modality — by oscardev256 (合并于: 2025-12-26 10:49 (UTC+8))
- #31138 [Mistral common] Ensure all functions are imported from the top & only use public methods — ready,ci/build — by patrickvonplaten (合并于: 2025-12-26 20:48 (UTC+8))
- #26674 [Core] Initialize LoRA support for tower and connector in multi-modal models — documentation,rocm,tpu,ready,v1,multi-modality,qwen — by jeejeelee (合并于: 2025-12-26 20:48 (UTC+8))
- #31322 [Bugfix] Support LoRA and GPTQModel for PLaMo 2/3 — documentation,ready — by Alnusjaponica (合并于: 2025-12-26 11:41 (UTC+8))
- #31339 [Misc] Fix Qwen2-MoE shared_expert_gate — ready,qwen — by jeejeelee (合并于: 2025-12-26 13:10 (UTC+8))
- #31329 [benchmark] use model card root instead of id — performance,ready — by andyxning (合并于: 2025-12-26 10:55 (UTC+8))
- #31324 [CI] Fix flaky vision beam search test with flexible semantic validation — ready — by AndreasKaratzas (合并于: 2025-12-26 12:39 (UTC+8))
- #30619 [CI/Build] Ignore max transformers version skipping for initialization tests — ready — by Isotr0py (合并于: 2025-12-26 10:50 (UTC+8))
关闭但未合并的 PR
- #17724 feat: engine v1 post process sampled logprobs — frontend,tpu,needs-rebase,stale,v1 — by RobMcH (关闭于: 2025-12-27 10:22 (UTC+8))
- #18501 [Bugfix] Fix spec decode on non-cuda platforms — speculative-decoding,needs-rebase,stale — by rand-fly (关闭于: 2025-12-27 10:22 (UTC+8))
- #18788 [Bugfix] handle
attn_metadata=Noneincalculate_kv_scalesbranch of attn forward — stale — by llllvvuu (关闭于: 2025-12-27 10:22 (UTC+8)) - #18900 [V1][Metrics] Add max_token_capacity_per_batch — stale,v1 — by sahelib25 (关闭于: 2025-12-27 10:22 (UTC+8))
- #18905 [V1][Metrics] Add num_tokens_preempted — stale,v1 — by sahelib25 (关闭于: 2025-12-27 10:22 (UTC+8))
- #31382 [LoRA] Remove direct_register_custom_op — ready — by jeejeelee (关闭于: 2025-12-27 09:40 (UTC+8))
- #31346 [Bugfix] Remove spurious NVFP4 ‘GPU does not support FP4’ warning — v1 — by yurekami (关闭于: 2025-12-27 08:06 (UTC+8))
- #31347 [Code Quality] Add missing return type annotations to compilation and metrics modules — v1 — by yurekami (关闭于: 2025-12-27 07:37 (UTC+8))
- #31349 Add missing return type annotations to usage, logging, and arg_utils modules — 无标签 — by yurekami (关闭于: 2025-12-27 07:36 (UTC+8))
- #31350 Add missing return type annotations to ray, distributed, and attention modules — 无标签 — by yurekami (关闭于: 2025-12-27 07:36 (UTC+8))
- #31356 style(repo_utils): add return type annotations — 无标签 — by yurekami (关闭于: 2025-12-27 07:27 (UTC+8))
- #31358 style(utils): add return type annotations — 无标签 — by yurekami (关闭于: 2025-12-27 07:26 (UTC+8))
- #31359 style(torch_utils): add return type annotations — 无标签 — by yurekami (关闭于: 2025-12-27 07:25 (UTC+8))
- #31362 style: add return type annotations to async_utils and envs — 无标签 — by yurekami (关闭于: 2025-12-27 07:25 (UTC+8))
- #31363 style: add return type annotations to import_utils and profiling — 无标签 — by yurekami (关闭于: 2025-12-27 07:24 (UTC+8))
- #31396 Implement shared VRAM pool and KV cache budget API — frontend,v1 — by DmitriiShubin (关闭于: 2025-12-27 04:27 (UTC+8))
- #31320 [Code Quality] Add missing return type annotations to misc modules — multi-modality — by yurekami (关闭于: 2025-12-26 23:13 (UTC+8))
- #31374 [Docs] fix the number of nccl group in P2P NCCL Connector example — documentation — by jiangxiaobin96 (关闭于: 2025-12-26 14:00 (UTC+8))