vLLM 开发动态报告 - 2026-01-04
时间窗口: 2026-01-04 11:06 (UTC+8) ~ 2026-01-05 11:06 (UTC+8) 数据统计: 新 Issue 12 | 关闭 Issue 19 | 新 PR 25 | 合并 PR 11 | 关闭未合并 PR 6
📊 每日开发状态摘要
本期(2026年1月4-5日)vLLM项目保持高度活跃,核心焦点集中在量化方案的架构重构与平台兼容性优化。一方面,社区正发起对GPTQ/AWQ等量化方案的代码清理与模块化改造(如 Issue #31689),并计划弃用一批长期未维护的量化方案(如 PR #31688)。另一方面,针对AMD MI325X等硬件的性能调优与问题修复持续推进。同时,多模态模型的稳定性(如 Qwen3-VL)和新的性能优化策略(如重复模式检测)也受到关注。
🎯 AMD/ROCm 生态相关动态
本期有多个与AMD生态直接相关的重要动态:
- Issue #31695: AMD MI325X 在解耦式服务(PD disaggregation)上性能极差:用户报告在AMD MI325X上使用SharedStorageConnector时,吞吐量仅为1 token/s。问题指出P2pNcclConnector在MI325X上会死锁,而LMCacheConnectorV1则因硬依赖libcudart.so.12而无法使用。该问题已被标记为
performance和rocm,并自动通知了ROCm相关维护者(@hongxiayang 等)。这表明vLLM在AMD最新硬件上的解耦式服务路径可能存在优化不足或兼容性问题。 - PR #31672: [Misc] Support amd-quark online fp8_ptpc quant:由 AMD 员工 (@hangy-amd) 提交,旨在将 AMD Quark 工具的在线 FP8 量化功能集成到 vLLM 中。这是对 RFC #31028 的实现。然而,该PR在核心开发者 (@robertgshaw2-redhat) 的审查中引发了设计争议。审查者认为,在线量化不应与特定的量化后端(如Quark)强耦合,而应建立在vLLM现有的在线量化抽象之上,以避免增加技术债和混淆。
- 已合并的 PR #31597: [ROCm][CI] Fix language generation test accuracy…:该PR通过为ROCm平台在测试中强制禁用HuggingFace的
flash_sdp和mem_efficient_sdp后端,解决了因ROCm Flash Attention实现精度问题导致的测试失败。这确保了ROCm CI的稳定性,是一个重要的平台兼容性修复。 - Issue #31691: CI Failure on AMD runner:报告了在AMD CI runner上LoRA TP测试的临时性失败,与PR #31533的合并有关。已有修复PR (#31663) 被提出,表明AMD CI pipeline被积极监控和维护。
💬 高热度讨论分析
- Issue #31689: [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization
- 核心议题:如何系统性地重构和清理vLLM中GPTQ和AWQ量化的代码架构,将量化集成(权重加载)与量化内核(运行时执行)的职责分离,以解决当前的技术债务和代码混乱问题。
- 观点与立场:
- 发起人 (@robertgshaw2-redhat):提出了清晰的重构蓝图,主张参照Fp8Moe等已有良好抽象的结构进行改造,将
gptq.py、awq.py等多个文件整合并模块化。他强调这是一项重大工程,需要高质量的代码和紧密的沟通。 - 响应者 (@jikunshang, @Bhanu068 等):多位贡献者表示愿意参与此项工作,体现了社区对清理核心基础设施的热情。
- 发起人 (@robertgshaw2-redhat):提出了清晰的重构蓝图,主张参照Fp8Moe等已有良好抽象的结构进行改造,将
- 争议焦点:无显著争议,主要是对庞大重构工作的技术路径规划和任务分工。
- 当前状态:开放寻求帮助,已有贡献者认领部分工作,预计将产生一系列大型PR。
- Issue #31695: AMD MI325X 在解耦式服务上性能极差(详见AMD生态部分)
- 核心议题:AMD新硬件上解耦式服务的性能瓶颈和连接器兼容性问题。
- 当前状态:问题刚被提出,已触发维护者关注,等待进一步诊断。
- PR #31662: [CI Failure] Fix NomicBert max_model_len validation
- 核心议题:在修复NomicBert模型
max_model_len验证逻辑的过程中,引发了关于如何统一管理模型配置(model_arch_config)更新逻辑的架构讨论。 - 观点与立场:
- PR作者 (@noooop) 希望将NomicBert特定的配置更新逻辑集中处理。
- 审查者 (@charlotte12l) 建议更长远地规划,将所有模型的配置更新/读取逻辑都整合到统一的
model_arch_config_convertor中,以实现架构整洁。
- 争议焦点:是采用针对当前问题的快速修复,还是借机推动更彻底的架构统一。讨论体现了项目在快速迭代与长期架构治理之间的权衡。
- 当前状态:PR仍在开放讨论中,涉及更深层次的架构设计。
- 核心议题:在修复NomicBert模型
🔥 热门话题与趋势分析
- 量化方案的“大扫除”与架构演进:本期最突出的趋势是围绕量化代码的大规模重构(Issue #31689)和清理(PR #31688)。这表明vLLM社区正在着手解决因快速支持多种量化方案而积累的技术债务,目标是将量化层变得更模块化、更清晰、更易维护。
- 性能优化与问题诊断:多个Issue (#31695, #31678, #31683) 关注性能问题,从硬件特定瓶颈(AMD MI325X)、Embedding服务扩展性,到多进程架构下的错误日志优化。反映出社区对生产环境稳定性和极致性能的持续追求。
- 多模态与推理引擎的稳定性:Issue #31679 (Qwen3-VL在异步调度下崩溃) 和 #31661 (多模态图像处理异常) 表明,随着复杂模型(尤其是多模态和MoE模型)的广泛使用,在V1引擎和高级特性(如
async-scheduling)下的稳定性挑战依然存在。 - 测试与基础设施的完善:包括CI告警功能提议(Issue #31685)、测试修复(PR #31665, #31597)和依赖升级(PR #31664),显示出项目在保障代码质量与开发体验方面的持续投入。
🛠️ 重点技术变更
- PR #31672 (AMD-Quark 在线量化):技术上将AMD Quark工具链的在线FP8量化能力接入vLLM。但更重要的是,它引发了关于“在线量化”应作为平台无关的通用特性,还是与特定供应商工具绑定的设计哲学讨论。最终采用哪种方案将影响vLLM量化生态的开放性和可维护性。
- PR #31688 (量化方案弃用):计划弃用一批使用率低的量化方案。这是一个重要的生态治理举措,有助于减少代码维护负担,并引导用户和模型发布者转向更主流、维护更好的量化格式(如Compressed Tensors格式的WNA16、FP8等)。
- Issue #31689 (GPTQ/AWQ 重构):这不仅仅是一个清理任务,更是一次重要的架构重构。成功实施将使vLLM的量化支持体系更加健壮和灵活,为未来集成更多高效的量化内核(如Machete, CPU内核等)铺平道路。
- PR #31690 (核心引擎增加重复模式检测):在引擎核心层增加了可选的重复token模式检测,以提前终止模型“胡说八道”的生成循环。这是一个实用的、提升服务效率和用户体验的优化,尤其对易发生幻觉的多模态模型效果显著。
📈 开发活跃度观察
- 贡献者多样性:本期活跃者包括来自AMD (@hangy-amd, @xuebwang-amd)、Red Hat (@robertgshaw2-redhat)、以及众多独立开发者,显示出健康的社区生态。
- 高效合并:在25个新增PR中,有11个在周期内被合并,其中包括对LoRA MoE初始化错误的快速修复(PR #31663),反映了核心团队高效的审查和合并流程。
- 深度技术讨论:在量化重构(Issue #31689)和配置管理(PR #31662)等话题中,出现了多位核心贡献者之间深入的技术架构讨论,表明项目在积极进行中长期的技术规划。
💡 值得关注的问题
- AMD MI325X 性能瓶颈 (Issue #31695):此问题若得到解决,将直接影响vLLM在AMD最新数据中心GPU上的服务效率,对AMD生态用户至关重要。
- 量化架构的技术债务 (Issue #31689):这项重构工程量大、影响面广,其进展和最终设计将决定vLLM量化支持的未来形态。
- 多模态模型的异步调度稳定性 (Issue #31679):随着多模态成为标配,确保其在各种优化调度模式下的稳定运行是提升用户体验的关键。
- 在线量化的设计路径 (PR #31672):关于是否将在线量化与Quark解耦的讨论,其结果将体现vLLM在平衡硬件厂商深度优化与软件抽象通用性上的策略。
📋 附录:详细数据列表
新增 Issue
- #31695 [Performance]: Poor performance (1 tokens/s) in Disaggregated Serving on AMD MI325X — performance,rocm — by jingzhuochelseahu (创建于: 2026-01-05 10:16 (UTC+8))
- #31679 [Bug]: Qwen3-VL-8B crashes on latest nightly with –async-scheduling — bug — by rnik12 (创建于: 2026-01-04 20:32 (UTC+8))
- #31689 [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-05 04:56 (UTC+8))
- #31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-01-05 05:46 (UTC+8))
- #31687 [Bug]: BitBlas quantized models fail during inference — bug — by Conzel (创建于: 2026-01-05 02:17 (UTC+8))
- #31685 [CI][Feature] Alert when CI runner queue wait time is long — feature request — by khluu (创建于: 2026-01-05 00:58 (UTC+8))
- #31683 [Feature]: Error Logging Redesign — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-04 22:53 (UTC+8))
- #31681 [Bug]:An error occurred after the inference program was executed. — bug — by cqray1990 (创建于: 2026-01-04 22:15 (UTC+8))
- #31678 [Performance]: embedding — performance — by shahsavari-mhd (创建于: 2026-01-04 17:51 (UTC+8))
- #31671 [RFC]: Integrate omni-ai-npu/omni-infer’s Topology-Aware Placement and Execution Optimizations into vLLM EPLB — RFC — by taostronger (创建于: 2026-01-04 15:47 (UTC+8))
- #31668 [Bug]: TorchAO quantization is broken in vLLM v1 engine (v0.13.0) — bug — by jaytonde (创建于: 2026-01-04 14:45 (UTC+8))
- #31661 [Bug]: t, h, w = image_grid_thw[image_index] IndexError: list index out of range — bug — by chen03191108-lab (创建于: 2026-01-04 11:35 (UTC+8))
已关闭 Issue
- #15526 [Bug]: FunctionDefinition missing optional param strict — bug,stale — by hanqingwu (关闭于: 2026-01-05 10:31 (UTC+8))
- #16914 [Bug]: vllm can’ t serve for Multi-audio input inference — bug,stale — by yuhp-zts (关闭于: 2026-01-05 10:31 (UTC+8))
- #17187 [Installation]: deployment failure on Kuberentes with CPU device (testing). — installation,stale — by jihed (关闭于: 2026-01-05 10:31 (UTC+8))
- #17513 [Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using –enforce_eager — bug,torch.compile,stale,needs reproduction — by luyuzhe111 (关闭于: 2026-01-05 10:31 (UTC+8))
- #18252 [Bug]: Qwen3 uses vllm automatic batch inference to abnormal output — bug,stale — by AndersonStra (关闭于: 2026-01-05 10:31 (UTC+8))
- #19198 [Bug]: Granite-Speech-3.3-2b hangs forever, never produces output — bug,stale — by sjuxax (关闭于: 2026-01-05 10:31 (UTC+8))
- #19285 [Bug]: continue_final_message + echo + prefix-caching + V0 crash the server — bug,stale — by hibukipanim (关闭于: 2026-01-05 10:31 (UTC+8))
- #19326 [Usage]: Engine in background thread — usage,stale — by GaryHu-zzz (关闭于: 2026-01-05 10:31 (UTC+8))
- #19490 [Bug] [TPU]: OOMing on Llama-8B on new vllm nightly docker — bug,tpu,stale — by BabyChouSr (关闭于: 2026-01-05 10:31 (UTC+8))
- #22800 [Bug]: On the V100-SXM2-32GB single-card machine, it is impossible to run Qwen3-30B-A3B-Instruct-2507-AWQ and Qwen3-30B-A3B-Instruct-2507-GPTQ-Int8 using vllm — bug,stale — by lbl1120 (关闭于: 2026-01-05 10:30 (UTC+8))
- #23926 [Bug]: Illegal memory access with 4 GPUS — bug,stale — by HadiSDev (关闭于: 2026-01-05 10:30 (UTC+8))
- #23955 ValueError: Currently, MiniCPMV only supports versions 2.0, 2.5, 2.6, 4.0. Got version: (4, 5) — stale — by ep5000 (关闭于: 2026-01-05 10:30 (UTC+8))
- #24309 [RFC]: Optimizing the KV cache transmission mechanism in PD disaggregation to reduce its impact on TTFT/TPOT. — RFC,stale — by CalvinXKY (关闭于: 2026-01-05 10:30 (UTC+8))
- #24321 [Performance]: Garbage Collection Optimizations — performance,stale — by Jialin (关闭于: 2026-01-05 10:30 (UTC+8))
- #24323 [Bug]: P/D disaggregate perform worse than non-p/d — bug,stale — by aldwnesx (关闭于: 2026-01-05 10:30 (UTC+8))
- #24355 [Bug]: qwen3-32B Reasoning is slow — bug,stale — by hyqf98 (关闭于: 2026-01-05 10:30 (UTC+8))
- #24365 [Bug]: high priority not work — bug,stale — by gallery2016 (关闭于: 2026-01-05 10:30 (UTC+8))
- #30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (关闭于: 2026-01-04 16:50 (UTC+8))
- #31483 [RFC]: a efficient adaptive rejection sampling for accelerating speculative decoding. — RFC — by sunchendd (关闭于: 2026-01-04 15:35 (UTC+8))
新增 PR
- #31669 [Misc][Model][Refactor] Pass the prefix into Linear layers — speculative-decoding,qwen — by kunpengW-code (创建于: 2026-01-04 14:49 (UTC+8))
- #31692 [MoE Refactor] Apply Structure to NVFP4 — nvidia — by robertgshaw2-redhat (创建于: 2026-01-05 08:23 (UTC+8))
- #31674 [platform] Support additional forward context for OOT — 无标签 — by zzzzwwjj (创建于: 2026-01-04 15:56 (UTC+8))
- #31666 [Misc] Various code simplifications — structured-output,speculative-decoding,ready,v1 — by njhill (创建于: 2026-01-04 14:24 (UTC+8))
- #31659 [Platform] Deprecate seed_everything — documentation,tpu,ready,v1,multi-modality,cpu,nvidia — by wangxiyuan (创建于: 2026-01-04 11:08 (UTC+8))
- #31665 [CI/Build] Revive skipped reward models e2e test — ready — by Isotr0py (创建于: 2026-01-04 14:03 (UTC+8))
- #31667 [Minor] Small pooler output processing optimization — ready,v1 — by njhill (创建于: 2026-01-04 14:41 (UTC+8))
- #31663 [Bugfix] Fix AttributeError: ‘Stream’ object has no attribute ‘dp_size’ — ready — by jeejeelee (创建于: 2026-01-04 11:55 (UTC+8))
- #31694 [Docs] Improve malformed exception caused by backslash line continuations — rocm,multi-modality,llama — by maang-h (创建于: 2026-01-05 10:00 (UTC+8))
- #31662 [CI Failure] Fix NomicBert max_model_len validation — ready — by noooop (创建于: 2026-01-04 11:44 (UTC+8))
- #31693 Fix TRTLLM decode assertion error when query lengths are non-uniform #31594 — v1,nvidia — by baonudesifeizhai (创建于: 2026-01-05 09:58 (UTC+8))
- #31688 [Quantization] Deprecate Long Tail of Schemes — ready — by robertgshaw2-redhat (创建于: 2026-01-05 04:42 (UTC+8))
- #31690 [Core] Add optional flags to check for repetitive token patterns in engine output — frontend,v1 — by aykoppol (创建于: 2026-01-05 05:13 (UTC+8))
- #31672 [Misc] Support amd-quark online fp8_ptpc quant — rocm — by hangy-amd (创建于: 2026-01-04 15:50 (UTC+8))
- #31686 [Bugfix] Correct block shape logic in WNA16 MoE triton kernel — 无标签 — by JartX (创建于: 2026-01-05 02:14 (UTC+8))
- #31684 Drift stable nvfp4 — performance,speculative-decoding,needs-rebase,v1,nvidia — by radna0 (创建于: 2026-01-05 00:35 (UTC+8))
- #31682 Apply refactor to ct — performance,llama,nvidia — by robertgshaw2-redhat (创建于: 2026-01-04 22:23 (UTC+8))
- #31680 [FixBug] Improve exception string in
tensorizer.py— 无标签 — by maang-h (创建于: 2026-01-04 20:34 (UTC+8)) - #31670 [perf]skip embedding check if have computed — v1 — by Shirley125 (创建于: 2026-01-04 15:21 (UTC+8))
- #31677 [Bugfix] Sanitize malformed tool call recipients in Harmony parser — frontend,gpt-oss — by eous (创建于: 2026-01-04 17:30 (UTC+8))
- #31676 [Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe — 无标签 — by xuebwang-amd (创建于: 2026-01-04 17:11 (UTC+8))
- #31675 [Cleanup] Remove deprecated vllm:time_per_output_token_seconds metric — v1 — by majiayu000 (创建于: 2026-01-04 16:51 (UTC+8))
- #31673 Test1 — 无标签 — by AlexLI-hub (创建于: 2026-01-04 15:54 (UTC+8))
- #31660 [DNM] test LoRA PDL — 无标签 — by jeejeelee (创建于: 2026-01-04 11:34 (UTC+8))
- #31664 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 — ci/build — by noooop (创建于: 2026-01-04 13:31 (UTC+8))
已合并 PR
- #31597 [ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp — rocm,ready — by AndreasKaratzas (合并于: 2026-01-05 10:17 (UTC+8))
- #31666 [Misc] Various code simplifications — structured-output,speculative-decoding,ready,v1 — by njhill (合并于: 2026-01-05 10:35 (UTC+8))
- #31659 [Platform] Deprecate seed_everything — documentation,tpu,ready,v1,multi-modality,cpu,nvidia — by wangxiyuan (合并于: 2026-01-05 10:34 (UTC+8))
- #31665 [CI/Build] Revive skipped reward models e2e test — ready — by Isotr0py (合并于: 2026-01-05 10:33 (UTC+8))
- #31667 [Minor] Small pooler output processing optimization — ready,v1 — by njhill (合并于: 2026-01-05 10:33 (UTC+8))
- #31663 [Bugfix] Fix AttributeError: ‘Stream’ object has no attribute ‘dp_size’ — ready — by jeejeelee (合并于: 2026-01-05 10:31 (UTC+8))
- #31137 [misc] Sort uvicorn log level description according to verbosity — frontend,ready — by andyxning (合并于: 2026-01-05 02:45 (UTC+8))
- #31632 [CI] Skip Phi-MoE test due to old API util — ready,ci/build — by AndreasKaratzas (合并于: 2026-01-05 08:52 (UTC+8))
- #31611 [BugFix] Async scheduling: handle model forward errors more cleanly — ready,v1 — by njhill (合并于: 2026-01-05 03:04 (UTC+8))
- #31449 fix no think of GLM-4.5 / GLM-4.7 — ready — by zRzRzRzRzRzRzR (合并于: 2026-01-04 11:43 (UTC+8))
- #31654 [Docs] Fix argparse include path for mm-processor benchmark — documentation,ready — by reaganjlee (合并于: 2026-01-04 11:31 (UTC+8))
关闭但未合并的 PR
- #21876 [Bugfix] Mask OOV tokens IDs from the Rejection Sampler — stale,v1 — by gnovack (关闭于: 2026-01-05 10:30 (UTC+8))
- #23848 [Misc]Fix an error when enabling allreduce fusion pass — stale — by 842974287 (关闭于: 2026-01-05 10:30 (UTC+8))
- #24340 [xformers] Force to use xformers kernels — stale,v1 — by qiruiyangmeta (关闭于: 2026-01-05 10:30 (UTC+8))
- #31600 Fix: Add None check in step_with_batch_queue for async scheduling — v1 — by rickychen-infinirc (关闭于: 2026-01-05 05:08 (UTC+8))
- #31684 Drift stable nvfp4 — performance,speculative-decoding,needs-rebase,v1,nvidia — by radna0 (关闭于: 2026-01-05 00:37 (UTC+8))
- #31614 [ROCm][Attention] Enable FlashAttention backend on ROCm (graph‑safe cu_seqlens_k) — rocm,speculative-decoding,v1 — by ehartford (关闭于: 2026-01-05 00:19 (UTC+8))