vLLM 开发动态报告 - 2026-01-24
时间窗口: 2026-01-24 11:07 (UTC+8) ~ 2026-01-25 11:07 (UTC+8) 数据统计: 新 Issue 6 | 关闭 Issue 18 | 新 PR 19 | 合并 PR 25 | 关闭未合并 PR 6
📊 每日开发状态摘要
本周期(2026年1月24日)vLLM开发活动活跃,共合并25个PR,关闭18个Issue,显示高效的代码集成和问题清理。开发焦点集中在性能优化(特别是量化内核和算子融合)、多模态模型支持(GLM-OCR, Qwen3-Omni优化)以及关键Bug修复(涉及模型运行器、LoRA、KV缓存布局)。CPU后端的跨平台构建问题也得到了及时修复。
🎯 AMD/ROCm 生态相关动态
本周期内与AMD生态直接相关的公开活动较少,仅有一项明确的文档更新:
- PR #32998 ([DOC] [ROCm] Update doc for v0.14.1): 由用户
tjtanaa提交。这是一个简单的文档更新PR,计划在v0.14.1版本发布后合并,旨在确保ROCm平台的安装文档与新版同步。这属于常规的发布维护工作。
分析:
- 在本周期提供的数据中,未发现用户名包含 “-amd” 后缀的贡献者提交的代码。
- 未发现涉及 Quark 量化工具、HIP后端优化或MI300等特定硬件的实质性代码修改或问题讨论。
- 总体来看,AMD生态在本报告周期内处于常规维护状态,无重大功能更新或问题暴露。
💬 高热度讨论分析
- PR #28973 ([Feature] add session based streaming input support to v1)
- 核心议题:为v1引擎设计并实现会话式流式输入(streaming input)支持,允许在单个会话中动态追加输入片段。
- 不同观点:
- 贡献者/设计方 (
joshuadeng): 提出了当前实现,允许在流式会话中进行自回归解码,认为行为应根据需求灵活设定。 - 评审方 (
njhill,ywang96): 对设计细节提出深度质询。主要关切点包括:1)是否应强制max_tokens=1以避免中间解码令牌被“夹”在输入块之间;2)Request对象的重复存储可能带来的状态管理和内存开销;3)需要防止客户端不发送终止信号导致的KV缓存泄漏(需添加超时守卫)。
- 贡献者/设计方 (
- 争议焦点:流式输入会话的语义定义(是否允许中间解码)和实现方式(轻量级状态管理 vs 完整Request对象队列)是需要权衡的关键。
- 当前状态:PR在经过多轮深度讨论和设计迭代后,已于本周期合并,标志着v1引擎一项重要的新功能落地。
- PR #33002 ([CPU Backend][BugFix] Fix failing Darwin pipelines)
- 核心议题:修复因Torch版本不匹配导致的macOS (Darwin) CI流水线失败。
- 不同观点/解决过程:
- 问题发现者 (
fadara01): 最初认为原因是uv默认使用构建隔离,导致编译时和运行时Torch版本不同。 - 评审者 (
bigPYJ1151): 指出根本原因是pyproject.toml中固定了Torch 2.9.1,而requirements/cpu.txt已升级到2.10.0,应使用--no-build-isolation。 - 共识:双方讨论后确认解决方案是在CI构建命令中显式添加
--no-build-isolation,并补充必要的构建依赖(setuptools-scm),这与vLLM CPU安装文档的推荐做法一致。
- 问题发现者 (
- 当前状态:PR已合并,问题得到修复。
- PR #30953 ([CPU] Improve CPU Docker build)
- 核心议题:改进CPU Docker镜像的构建流程,提升Podman兼容性并添加构建信息标签。
- 讨论要点:
- 贡献者 (
maryamtahhan): 提交了完整的改进方案,包括用COPY替换bind mounts、添加OCI标签等。 - 评审者 (
bigPYJ1151): 对文档细节提出修改意见,并建议后续PR中继续优化。
- 贡献者 (
- 总结:讨论聚焦于实现细节和文档准确性,各方均认可改进的价值。PR在经过多轮修改和评审后于本周期合并。
🔥 热门话题与趋势分析
- Bug修复与稳定性:多个新增Issue反映了对新硬件(B200)、新功能组合(LoRA + Fused MoE + DP) 以及架构重构后(Model Runner V2) 的测试覆盖和集成稳定性需求。这表明项目在快速演进中,兼容性和健壮性面临持续挑战。
- 性能优化深水区:优化工作从宏观框架深入到了微观内核和算子融合层面。例如,PR #32520针对SM100F GPU的FP4量化内核进行指令级优化,PR #32950则专门优化MLA中FP8 KV缓存的拼接与量化操作。这体现了性能追求已进入精细化阶段。
- CPU与跨平台支持升温:围绕CPU后端和macOS的构建、部署问题出现多次讨论和修复(PR #33002, Issue #33001),并专门优化了CPU Docker镜像(PR #30953)。这表明vLLM在扩大其适用场景,致力于成为跨平台的推理解决方案。
- 多模态支持持续增强:不仅新增了对GLM-OCR模型的支持(PR #33005),还对现有模型如Qwen3-Omni进行了细节优化(PR #33010),并修复了多模态模型加载的bug(PR #33008)。多模态已成为vLLM的核心能力领域之一。
🛠️ 重点技术变更
- PR #28973: 会话式流式输入支持 (v1):为v1引擎引入了
AsyncLLM.generate处理异步输入流的能力,支持在单个会话中动态接收提示词片段并生成输出。这是对长对话、实时交互场景的重要架构增强,为构建更复杂的AI应用接口奠定了基础。 - PR #32520: FP4量化内核优化 (SM100F):针对Blackwell架构(SM100F)GPU,利用PTX 8.8的256位全局加载指令,重构了FP4量化内核的分区方案。性能提升显著(宣称比前代vLLM内核快达65%),体现了对最新硬件特性的快速适配和极致的性能挖掘。
- PR #32950: MLA中FP8 KV缓存的Cat+Quant融合:通过手写一个
scaled_fp8_quant_and_cat函数,将MLA注意力中FP8 KV缓存的拼接(concat)和量化(quantization)两个独立操作融合,减少了约20us/层的开销。这是典型的算子融合优化,能有效降低内核启动和内存访问开销。 - PR #32993: CPU卸载内存优化工作区:提供了一个绕过PyTorch
pin_memory(存在2倍内存分配问题)的工作区,通过环境变量控制使用UVA或显式内存拷贝进行CPU权重卸载。这临时解决了大模型CPU卸载时意外的内存耗尽问题,对于在有限CPU内存下运行超大模型至关重要。
📈 开发活跃度观察
- 合并效率高:单日合并25个PR,表明代码审查和集成流程顺畅。其中包含多个大型功能PR(如#28973)和关键优化PR(如#32520),体现了核心团队高效的决策和合并能力。
- Issue管理有序:关闭了18个旧Issue,其中多数为长期未活动的“stale”问题被自动清理,有助于保持问题追踪系统的整洁。同时,新报的Bug能被快速响应和分配(如Issue #33011被
thc1006秒级/assign)。 - 贡献者多样性:除核心维护者(
ywang96,DarkLight1337,robertgshaw2-redhat等)外,有多位社区开发者贡献了重要修复和功能(如izhuhaoran修复模型运行器,HollowMan6修复MoE录制回放)。标签系统(如bug,performance,v1)被广泛使用,便于分类管理。
💡 值得关注的问题
- Issue #33011: PPLX + vLLM CUTLASS FP8 输出错误:用户报告使用PPLX通信库和CUTLASS FP8内核时,模型输出准确率(30%)远低于预期(90%)。这可能指向FP8量化、专家并行或特定通信后端组合下的潜在数值正确性问题,需要深入排查,影响量化MoE模型的可靠性。
- Issue #33014: LoRA 在 Fused MoE 且 DP>1 时加载失败:这是一个新报告的组合功能Bug。LoRA与Fused MoE在数据并行下的兼容性问题可能影响许多使用MoE模型进行微调的用户,需优先解决。
- Issue #33003 & PR #33004: 模型运行器 V2 在注意力与缓存更新分离后损坏:由于PR #25954的性能重构,导致了Model Runner V2的CUDA Graph等路径出现属性错误和接口不匹配。这反映了新架构(V2)与持续进行的底层优化之间的集成风险,需要确保测试充分覆盖新旧架构。
📋 附录:详细数据列表
新增 Issue
- #33011 [Bug][Help Wanted]: PPLX + vLLM CUTLASS FP8 Gives Incorrect Responses — bug,help wanted — by robertgshaw2-redhat (创建于: 2026-01-25 04:07 (UTC+8))
- #33014 [Bug]: LoRA loading fails with Fused MoE when data parallelism (DP > 1) is enabled — bug — by Jackmin801 (创建于: 2026-01-25 07:45 (UTC+8))
- #33009 [Installation]: vLLM v14.0 Build Error with Official Docker Image — installation — by weimin023 (创建于: 2026-01-25 03:01 (UTC+8))
- #32992 [Bug]: Batch invariance fails on NVIDIA B200 (Blackwell) with torch.compile — bug — by ZhanqiuHu (创建于: 2026-01-24 11:15 (UTC+8))
- #33003 [Bug]: Model Runner V2 broken CUDA Graph after kvcache update split(#25954) — bug — by izhuhaoran (创建于: 2026-01-24 21:58 (UTC+8))
- #33001 [Bug] [CPU Backend]: Failed to import from vllm._C with torch 2.10 on Darwin — bug,cpu — by fadara01 (创建于: 2026-01-24 20:20 (UTC+8))
已关闭 Issue
- #25540 [Performance]: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 performance is worse than Qwen/Qwen3-235B-A22B-Instruct-2507 (bf16) — performance,stale — by pinak-p (关闭于: 2026-01-25 10:15 (UTC+8))
- #25620 [Performance]:
vllm --versionis too slow — feature request,stale — by yyzxw (关闭于: 2026-01-25 10:15 (UTC+8)) - #25626 [Bug]: How to debug why child process become defunct process when using vllm deploy LLM model — bug,stale — by haifzhu (关闭于: 2026-01-25 10:15 (UTC+8))
- #25638 [Bug]: AttributeError: ‘NoneType’ object has no attribute ‘is_sym’ -tp 4 run Qwen3-Next-80B-A3B-Instruct-int4-AutoRound — bug,stale — by lsm03624 (关闭于: 2026-01-25 10:15 (UTC+8))
- #25639 [Bug]: vllm deepseek r1 Output is chaotic — bug,stale — by danfeihu (关闭于: 2026-01-25 10:15 (UTC+8))
- #25650 [Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access — bug,stale — by zhanghb55 (关闭于: 2026-01-25 10:15 (UTC+8))
- #25653 [Bug]: AttributeError: MoE Model GptOssForCausalLM does not support BitsAndBytes quantization yet. — bug,stale — by wang824892540 (关闭于: 2026-01-25 10:15 (UTC+8))
- #25679 [CI Failure]: Cascade attention E2E test fails with FlashInfer backend — stale,ci-failure — by MatthewBonanni (关闭于: 2026-01-25 10:15 (UTC+8))
- #25680 [Bug]: IndexError when running DeepSeek-V2-Lite with deepep_low_latency and enable-expert-parallel — bug,stale — by ElizaWszola (关闭于: 2026-01-25 10:15 (UTC+8))
- #25747 [Bug]: GLM-4.5V stuck on Ada series GPU — bug,stale — by yry0008 (关闭于: 2026-01-25 10:15 (UTC+8))
- #25756 [Usage]: how to do token prune — usage,stale — by betacatZ (关闭于: 2026-01-25 10:15 (UTC+8))
- #25760 [Bug]: Qwen3-Next-80B-A3B-Thinking-FP8 fails using flashinfer and
qwen3_next_mtpspec decode — bug,stale — by JaheimLee (关闭于: 2026-01-25 10:15 (UTC+8)) - #32055 [Feature]: Copy Ops in FP8 MLA — feature request — by robertgshaw2-redhat (关闭于: 2026-01-25 00:27 (UTC+8))
- #32915 [Bug]: Enable Lora cause OOM — bug — by bi1101 (关闭于: 2026-01-24 22:28 (UTC+8))
- #33001 [Bug] [CPU Backend]: Failed to import from vllm._C with torch 2.10 on Darwin — bug,cpu — by fadara01 (关闭于: 2026-01-24 21:22 (UTC+8))
- #32736 [Feature]: support lora for minimax m2.1 — feature request — by tic-top (关闭于: 2026-01-24 20:48 (UTC+8))
- #25450 [Feature]: Standalone Encoder Benchmark — feature request,stale — by ywang96 (关闭于: 2026-01-24 16:24 (UTC+8))
- #31443 [Bug]: Tool name is lost in chat_completion_stream_generator — bug — by aabbccddwasd (关闭于: 2026-01-24 12:32 (UTC+8))
新增 PR
- #33015 Loom-mvp0 — performance,needs-rebase,v1,kv-connector — by MoefulYe (创建于: 2026-01-25 10:32 (UTC+8))
- #32997 [Bugfix]: Prevent reasoning_content leak — bug,frontend,tool-calling — by RohanDisa (创建于: 2026-01-24 15:09 (UTC+8))
- #32998 [DOC] [ROCm] Update doc for v0.14.1 — documentation,rocm,ready — by tjtanaa (创建于: 2026-01-24 15:25 (UTC+8))
- #33013 [BugFix] Capture logical routed experts reliably for replay — bug — by HollowMan6 (创建于: 2026-01-25 04:59 (UTC+8))
- #33012 [Core]Extract is_last_rank in Ray for tpu to override — v1 — by Chenyaaang (创建于: 2026-01-25 04:11 (UTC+8))
- #33010 [Model] Use mm_position to compute mrope positions for Qwen3-Omni — qwen — by Etelis (创建于: 2026-01-25 03:39 (UTC+8))
- #33008 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints — bug — by ms1design (创建于: 2026-01-25 02:25 (UTC+8))
- #33006 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints after PR #32780 — bug,documentation,v1,cpu — by ms1design (创建于: 2026-01-24 23:26 (UTC+8))
- #33002 [CPU Backend][BugFix] Fix failing Darwin pipelines — bug,ready,ci/build,cpu — by fadara01 (创建于: 2026-01-24 20:27 (UTC+8))
- #33007 [BugFix] KV cache layout for raw-copy KV connectors in disaggregated mode — bug,v1,kv-connector — by thjung123 (创建于: 2026-01-24 23:38 (UTC+8))
- #33005 [GLM-OCR] GLM-OCR with MTP Support — documentation,new-model,speculative-decoding,v1,multi-modality — by zRzRzRzRzRzRzR (创建于: 2026-01-24 22:39 (UTC+8))
- #32991 [BugFix] Fix CPU Weight Offloading with UVA — bug,nvidia — by wzhao18 (创建于: 2026-01-24 11:13 (UTC+8))
- #33004 [BugFix] fix model runner v2 error after kvcache update split — bug,v1,nvidia — by izhuhaoran (创建于: 2026-01-24 22:05 (UTC+8))
- #32994 [Bugfix][Core] Fix use audio in video bug — bug,v1,qwen — by xsank (创建于: 2026-01-24 12:00 (UTC+8))
- #33000 [Frontend] Add –disable-inference flag for GPU-free render-only deployments — frontend,v1 — by hyeongyun0916 (创建于: 2026-01-24 16:25 (UTC+8))
- #32999 [Doc] Ignore typo check on governance doc — 无标签 — by ywang96 (创建于: 2026-01-24 15:41 (UTC+8))
- #32995 [docs] Update governance process links — documentation — by esmeetu (创建于: 2026-01-24 14:30 (UTC+8))
- #32996 Feature/silu block quant fusion v1 — performance,ci/build — by Monishver11 (创建于: 2026-01-24 14:36 (UTC+8))
- #32993 [Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation — nvidia — by wzhao18 (创建于: 2026-01-24 11:44 (UTC+8))
已合并 PR
- #32977 [Docs] Fix Apple silicon include path in CPU installation docs — documentation,ready,cpu — by sjhddh (合并于: 2026-01-25 09:51 (UTC+8))
- #32520 [Perf][Kernel] Optimize FP4 quantization kernels (SM100F) — performance,ready,nvidia — by LopezCastroRoberto (合并于: 2026-01-25 09:45 (UTC+8))
- #32998 [DOC] [ROCm] Update doc for v0.14.1 — documentation,rocm,ready — by tjtanaa (合并于: 2026-01-25 09:13 (UTC+8))
- #28973 [Feature] add session based streaming input support to v1 — ready,v1 — by joshuadeng (合并于: 2026-01-25 04:06 (UTC+8))
- #32277 Using max_loras + 1 to construct grid in fused_moe_lora — ready — by yugong333 (合并于: 2026-01-25 01:39 (UTC+8))
- #32800 [EncoderCacheManager] Remove unnecessary copy — ready,v1 — by lgeiger (合并于: 2026-01-24 22:28 (UTC+8))
- #30953 [CPU] Improve CPU Docker build — documentation,ready,ci/build,cpu — by maryamtahhan (合并于: 2026-01-25 01:08 (UTC+8))
- #33002 [CPU Backend][BugFix] Fix failing Darwin pipelines — bug,ready,ci/build,cpu — by fadara01 (合并于: 2026-01-25 01:02 (UTC+8))
- #32986 [Tests] Replace flaky sleep with polling in test_background_cancel — ready,v1 — by sjhddh (合并于: 2026-01-25 00:39 (UTC+8))
- #32950 [MLA] Fuse cat and qaunt for fp8 kv-cache — documentation,performance,ready,deepseek — by LucasWilkinson (合并于: 2026-01-25 00:03 (UTC+8))
- #32963 Update CPU doc according to feedback — documentation,ready,cpu — by louie-tsai (合并于: 2026-01-25 00:02 (UTC+8))
- #32842 [Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_modes — bug,ready — by HirokenOvo (合并于: 2026-01-24 22:45 (UTC+8))
- #32646 [Bugfix] Fix E2E latency calculation and add warmup support in mm_processor benchmark — bug,performance,ready — by HirokenOvo (合并于: 2026-01-24 18:31 (UTC+8))
- #32763 feat: Complete LoRA support for MiniMaxM2 Fixes #32736 — documentation,ready — by Chenhao-Guan (合并于: 2026-01-24 20:48 (UTC+8))
- #31972 [Models]: Make Multimodal config implicit in ViT implementation — speculative-decoding,ready,qwen,deepseek — by Isotr0py (合并于: 2026-01-24 20:34 (UTC+8))
- #32984 [Perf] Cache exc.errors() result in validation exception handler — frontend,ready — by sjhddh (合并于: 2026-01-24 18:01 (UTC+8))
- #32953 [UX] Deduplicate sampling parameter startup logs — frontend,ready — by DarkLight1337 (合并于: 2026-01-24 17:37 (UTC+8))
- #31655 feat(benchmark): add encoder forward pass benchmarking to mm-processor — performance,ready,v1,multi-modality — by reaganjlee (合并于: 2026-01-24 16:24 (UTC+8))
- #32999 [Doc] Ignore typo check on governance doc — 无标签 — by ywang96 (合并于: 2026-01-24 15:52 (UTC+8))
- #32082 [Models] Add
SharedFusedMoEsupport to Qwen3MoE — ready,qwen — by Isotr0py (合并于: 2026-01-24 15:36 (UTC+8)) - #32995 [docs] Update governance process links — documentation — by esmeetu (合并于: 2026-01-24 15:32 (UTC+8))
- #32982 [Tests] Standardize RNG seed utility across test files — ready,v1 — by sjhddh (合并于: 2026-01-24 14:47 (UTC+8))
- #32981 [Tests] Clarify pytest skip reasons with actionable context — ready,v1 — by sjhddh (合并于: 2026-01-24 14:38 (UTC+8))
- #32983 [Perf] Cache xpu_get_mem_info() result to avoid duplicate calls — v1 — by sjhddh (合并于: 2026-01-24 12:56 (UTC+8))
- #32948 [Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during install — documentation,ready,ci/build,nvidia — by mgoin (合并于: 2026-01-24 11:15 (UTC+8))
关闭但未合并的 PR
- #33015 Loom-mvp0 — performance,needs-rebase,v1,kv-connector — by MoefulYe (关闭于: 2026-01-25 10:37 (UTC+8))
- #18497 Add Apple Silicon bf16 Support — ci/build,stale — by rahuja23 (关闭于: 2026-01-25 10:29 (UTC+8))
- #25645 fix(tracing): Prevent AssertionError in do_tracing when stats are not… — stale,v1 — by git-jxj (关闭于: 2026-01-25 10:15 (UTC+8))
- #25740 Improve model accuracy by using F32 P*V with v_cache dot product — stale — by guoqingbao (关闭于: 2026-01-25 10:15 (UTC+8))
- #25791 perf: optimize rejection sampling triton kernel — stale,v1 — by happierpig (关闭于: 2026-01-25 10:15 (UTC+8))
- #33006 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints after PR #32780 — bug,documentation,v1,cpu — by ms1design (关闭于: 2026-01-25 02:19 (UTC+8))