vLLM 开发动态报告 - 2026-01-04

时间窗口: 2026-01-04 11:06 (UTC+8) ~ 2026-01-05 11:06 (UTC+8) 数据统计: 新 Issue 12 | 关闭 Issue 19 | 新 PR 25 | 合并 PR 11 | 关闭未合并 PR 6

📊 每日开发状态摘要

本期（2026年1月4-5日）vLLM项目保持高度活跃，核心焦点集中在量化方案的架构重构与平台兼容性优化。一方面，社区正发起对GPTQ/AWQ等量化方案的代码清理与模块化改造（如 Issue #31689），并计划弃用一批长期未维护的量化方案（如 PR #31688）。另一方面，针对AMD MI325X等硬件的性能调优与问题修复持续推进。同时，多模态模型的稳定性（如 Qwen3-VL）和新的性能优化策略（如重复模式检测）也受到关注。

🎯 AMD/ROCm 生态相关动态

本期有多个与AMD生态直接相关的重要动态：

Issue #31695: AMD MI325X 在解耦式服务（PD disaggregation）上性能极差：用户报告在AMD MI325X上使用SharedStorageConnector时，吞吐量仅为1 token/s。问题指出P2pNcclConnector在MI325X上会死锁，而LMCacheConnectorV1则因硬依赖libcudart.so.12而无法使用。该问题已被标记为performance和rocm，并自动通知了ROCm相关维护者（@hongxiayang 等）。这表明vLLM在AMD最新硬件上的解耦式服务路径可能存在优化不足或兼容性问题。
PR #31672: [Misc] Support amd-quark online fp8_ptpc quant：由 AMD 员工 (@hangy-amd) 提交，旨在将 AMD Quark 工具的在线 FP8 量化功能集成到 vLLM 中。这是对 RFC #31028 的实现。然而，该PR在核心开发者 (@robertgshaw2-redhat) 的审查中引发了设计争议。审查者认为，在线量化不应与特定的量化后端（如Quark）强耦合，而应建立在vLLM现有的在线量化抽象之上，以避免增加技术债和混淆。
已合并的 PR #31597: [ROCm][CI] Fix language generation test accuracy…：该PR通过为ROCm平台在测试中强制禁用HuggingFace的flash_sdp和mem_efficient_sdp后端，解决了因ROCm Flash Attention实现精度问题导致的测试失败。这确保了ROCm CI的稳定性，是一个重要的平台兼容性修复。
Issue #31691: CI Failure on AMD runner：报告了在AMD CI runner上LoRA TP测试的临时性失败，与PR #31533的合并有关。已有修复PR (#31663) 被提出，表明AMD CI pipeline被积极监控和维护。

💬 高热度讨论分析

Issue #31689: [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization
- 核心议题：如何系统性地重构和清理vLLM中GPTQ和AWQ量化的代码架构，将量化集成（权重加载）与量化内核（运行时执行）的职责分离，以解决当前的技术债务和代码混乱问题。
- 观点与立场：
  - 发起人 (@robertgshaw2-redhat)：提出了清晰的重构蓝图，主张参照Fp8Moe等已有良好抽象的结构进行改造，将gptq.py、awq.py等多个文件整合并模块化。他强调这是一项重大工程，需要高质量的代码和紧密的沟通。
  - 响应者 (@jikunshang, @Bhanu068 等)：多位贡献者表示愿意参与此项工作，体现了社区对清理核心基础设施的热情。
- 争议焦点：无显著争议，主要是对庞大重构工作的技术路径规划和任务分工。
- 当前状态：开放寻求帮助，已有贡献者认领部分工作，预计将产生一系列大型PR。
Issue #31695: AMD MI325X 在解耦式服务上性能极差（详见AMD生态部分）
- 核心议题：AMD新硬件上解耦式服务的性能瓶颈和连接器兼容性问题。
- 当前状态：问题刚被提出，已触发维护者关注，等待进一步诊断。
PR #31662: [CI Failure] Fix NomicBert max_model_len validation
- 核心议题：在修复NomicBert模型max_model_len验证逻辑的过程中，引发了关于如何统一管理模型配置（model_arch_config）更新逻辑的架构讨论。
- 观点与立场：
  - PR作者 (@noooop) 希望将NomicBert特定的配置更新逻辑集中处理。
  - 审查者 (@charlotte12l) 建议更长远地规划，将所有模型的配置更新/读取逻辑都整合到统一的 model_arch_config_convertor 中，以实现架构整洁。
- 争议焦点：是采用针对当前问题的快速修复，还是借机推动更彻底的架构统一。讨论体现了项目在快速迭代与长期架构治理之间的权衡。
- 当前状态：PR仍在开放讨论中，涉及更深层次的架构设计。

🔥 热门话题与趋势分析

量化方案的“大扫除”与架构演进：本期最突出的趋势是围绕量化代码的大规模重构（Issue #31689）和清理（PR #31688）。这表明vLLM社区正在着手解决因快速支持多种量化方案而积累的技术债务，目标是将量化层变得更模块化、更清晰、更易维护。
性能优化与问题诊断：多个Issue (#31695, #31678, #31683) 关注性能问题，从硬件特定瓶颈（AMD MI325X）、Embedding服务扩展性，到多进程架构下的错误日志优化。反映出社区对生产环境稳定性和极致性能的持续追求。
多模态与推理引擎的稳定性：Issue #31679 (Qwen3-VL在异步调度下崩溃) 和 #31661 (多模态图像处理异常) 表明，随着复杂模型（尤其是多模态和MoE模型）的广泛使用，在V1引擎和高级特性（如async-scheduling）下的稳定性挑战依然存在。
测试与基础设施的完善：包括CI告警功能提议（Issue #31685）、测试修复（PR #31665, #31597）和依赖升级（PR #31664），显示出项目在保障代码质量与开发体验方面的持续投入。

🛠️ 重点技术变更

PR #31672 (AMD-Quark 在线量化)：技术上将AMD Quark工具链的在线FP8量化能力接入vLLM。但更重要的是，它引发了关于“在线量化”应作为平台无关的通用特性，还是与特定供应商工具绑定的设计哲学讨论。最终采用哪种方案将影响vLLM量化生态的开放性和可维护性。
PR #31688 (量化方案弃用)：计划弃用一批使用率低的量化方案。这是一个重要的生态治理举措，有助于减少代码维护负担，并引导用户和模型发布者转向更主流、维护更好的量化格式（如Compressed Tensors格式的WNA16、FP8等）。
Issue #31689 (GPTQ/AWQ 重构)：这不仅仅是一个清理任务，更是一次重要的架构重构。成功实施将使vLLM的量化支持体系更加健壮和灵活，为未来集成更多高效的量化内核（如Machete， CPU内核等）铺平道路。
PR #31690 (核心引擎增加重复模式检测)：在引擎核心层增加了可选的重复token模式检测，以提前终止模型“胡说八道”的生成循环。这是一个实用的、提升服务效率和用户体验的优化，尤其对易发生幻觉的多模态模型效果显著。

📈 开发活跃度观察

贡献者多样性：本期活跃者包括来自AMD (@hangy-amd， @xuebwang-amd)、Red Hat (@robertgshaw2-redhat)、以及众多独立开发者，显示出健康的社区生态。
高效合并：在25个新增PR中，有11个在周期内被合并，其中包括对LoRA MoE初始化错误的快速修复（PR #31663），反映了核心团队高效的审查和合并流程。
深度技术讨论：在量化重构（Issue #31689）和配置管理（PR #31662）等话题中，出现了多位核心贡献者之间深入的技术架构讨论，表明项目在积极进行中长期的技术规划。

💡 值得关注的问题

AMD MI325X 性能瓶颈 (Issue #31695)：此问题若得到解决，将直接影响vLLM在AMD最新数据中心GPU上的服务效率，对AMD生态用户至关重要。
量化架构的技术债务 (Issue #31689)：这项重构工程量大、影响面广，其进展和最终设计将决定vLLM量化支持的未来形态。
多模态模型的异步调度稳定性 (Issue #31679)：随着多模态成为标配，确保其在各种优化调度模式下的稳定运行是提升用户体验的关键。
在线量化的设计路径 (PR #31672)：关于是否将在线量化与Quark解耦的讨论，其结果将体现vLLM在平衡硬件厂商深度优化与软件抽象通用性上的策略。

📋 附录：详细数据列表

新增 Issue

#31695 [Performance]: Poor performance (1 tokens/s) in Disaggregated Serving on AMD MI325X — performance,rocm — by jingzhuochelseahu (创建于: 2026-01-05 10:16 (UTC+8))
#31679 [Bug]: Qwen3-VL-8B crashes on latest nightly with –async-scheduling — bug — by rnik12 (创建于: 2026-01-04 20:32 (UTC+8))
#31689 [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-05 04:56 (UTC+8))
#31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (创建于: 2026-01-05 05:46 (UTC+8))
#31687 [Bug]: BitBlas quantized models fail during inference — bug — by Conzel (创建于: 2026-01-05 02:17 (UTC+8))
#31685 [CI][Feature] Alert when CI runner queue wait time is long — feature request — by khluu (创建于: 2026-01-05 00:58 (UTC+8))
#31683 [Feature]: Error Logging Redesign — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-04 22:53 (UTC+8))
#31681 [Bug]:An error occurred after the inference program was executed. — bug — by cqray1990 (创建于: 2026-01-04 22:15 (UTC+8))
#31678 [Performance]: embedding — performance — by shahsavari-mhd (创建于: 2026-01-04 17:51 (UTC+8))
#31671 [RFC]: Integrate omni-ai-npu/omni-infer’s Topology-Aware Placement and Execution Optimizations into vLLM EPLB — RFC — by taostronger (创建于: 2026-01-04 15:47 (UTC+8))
#31668 [Bug]: TorchAO quantization is broken in vLLM v1 engine (v0.13.0) — bug — by jaytonde (创建于: 2026-01-04 14:45 (UTC+8))
#31661 [Bug]: t, h, w = image_grid_thw[image_index] IndexError: list index out of range — bug — by chen03191108-lab (创建于: 2026-01-04 11:35 (UTC+8))

已关闭 Issue

#15526 [Bug]: FunctionDefinition missing optional param strict — bug,stale — by hanqingwu (关闭于: 2026-01-05 10:31 (UTC+8))
#16914 [Bug]: vllm can’ t serve for Multi-audio input inference — bug,stale — by yuhp-zts (关闭于: 2026-01-05 10:31 (UTC+8))
#17187 [Installation]: deployment failure on Kuberentes with CPU device (testing). — installation,stale — by jihed (关闭于: 2026-01-05 10:31 (UTC+8))
#17513 [Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using –enforce_eager — bug,torch.compile,stale,needs reproduction — by luyuzhe111 (关闭于: 2026-01-05 10:31 (UTC+8))
#18252 [Bug]: Qwen3 uses vllm automatic batch inference to abnormal output — bug,stale — by AndersonStra (关闭于: 2026-01-05 10:31 (UTC+8))
#19198 [Bug]: Granite-Speech-3.3-2b hangs forever, never produces output — bug,stale — by sjuxax (关闭于: 2026-01-05 10:31 (UTC+8))
#19285 [Bug]: continue_final_message + echo + prefix-caching + V0 crash the server — bug,stale — by hibukipanim (关闭于: 2026-01-05 10:31 (UTC+8))
#19326 [Usage]: Engine in background thread — usage,stale — by GaryHu-zzz (关闭于: 2026-01-05 10:31 (UTC+8))
#19490 [Bug] [TPU]: OOMing on Llama-8B on new vllm nightly docker — bug,tpu,stale — by BabyChouSr (关闭于: 2026-01-05 10:31 (UTC+8))
#22800 [Bug]: On the V100-SXM2-32GB single-card machine, it is impossible to run Qwen3-30B-A3B-Instruct-2507-AWQ and Qwen3-30B-A3B-Instruct-2507-GPTQ-Int8 using vllm — bug,stale — by lbl1120 (关闭于: 2026-01-05 10:30 (UTC+8))
#23926 [Bug]: Illegal memory access with 4 GPUS — bug,stale — by HadiSDev (关闭于: 2026-01-05 10:30 (UTC+8))
#23955 ValueError: Currently, MiniCPMV only supports versions 2.0, 2.5, 2.6, 4.0. Got version: (4, 5) — stale — by ep5000 (关闭于: 2026-01-05 10:30 (UTC+8))
#24309 [RFC]: Optimizing the KV cache transmission mechanism in PD disaggregation to reduce its impact on TTFT/TPOT. — RFC,stale — by CalvinXKY (关闭于: 2026-01-05 10:30 (UTC+8))
#24321 [Performance]: Garbage Collection Optimizations — performance,stale — by Jialin (关闭于: 2026-01-05 10:30 (UTC+8))
#24323 [Bug]: P/D disaggregate perform worse than non-p/d — bug,stale — by aldwnesx (关闭于: 2026-01-05 10:30 (UTC+8))
#24355 [Bug]: qwen3-32B Reasoning is slow — bug,stale — by hyqf98 (关闭于: 2026-01-05 10:30 (UTC+8))
#24365 [Bug]: high priority not work — bug,stale — by gallery2016 (关闭于: 2026-01-05 10:30 (UTC+8))
#30383 [RFC]: Multi-Process Benchmark Architecture for Scaling Beyond Single-Core Limits — RFC — by GaoHuaZhang (关闭于: 2026-01-04 16:50 (UTC+8))
#31483 [RFC]: a efficient adaptive rejection sampling for accelerating speculative decoding. — RFC — by sunchendd (关闭于: 2026-01-04 15:35 (UTC+8))

新增 PR

#31669 [Misc][Model][Refactor] Pass the prefix into Linear layers — speculative-decoding,qwen — by kunpengW-code (创建于: 2026-01-04 14:49 (UTC+8))
#31692 [MoE Refactor] Apply Structure to NVFP4 — nvidia — by robertgshaw2-redhat (创建于: 2026-01-05 08:23 (UTC+8))
#31674 [platform] Support additional forward context for OOT — 无标签 — by zzzzwwjj (创建于: 2026-01-04 15:56 (UTC+8))
#31666 [Misc] Various code simplifications — structured-output,speculative-decoding,ready,v1 — by njhill (创建于: 2026-01-04 14:24 (UTC+8))
#31659 [Platform] Deprecate seed_everything — documentation,tpu,ready,v1,multi-modality,cpu,nvidia — by wangxiyuan (创建于: 2026-01-04 11:08 (UTC+8))
#31665 [CI/Build] Revive skipped reward models e2e test — ready — by Isotr0py (创建于: 2026-01-04 14:03 (UTC+8))
#31667 [Minor] Small pooler output processing optimization — ready,v1 — by njhill (创建于: 2026-01-04 14:41 (UTC+8))
#31663 [Bugfix] Fix AttributeError: ‘Stream’ object has no attribute ‘dp_size’ — ready — by jeejeelee (创建于: 2026-01-04 11:55 (UTC+8))
#31694 [Docs] Improve malformed exception caused by backslash line continuations — rocm,multi-modality,llama — by maang-h (创建于: 2026-01-05 10:00 (UTC+8))
#31662 [CI Failure] Fix NomicBert max_model_len validation — ready — by noooop (创建于: 2026-01-04 11:44 (UTC+8))
#31693 Fix TRTLLM decode assertion error when query lengths are non-uniform #31594 — v1,nvidia — by baonudesifeizhai (创建于: 2026-01-05 09:58 (UTC+8))
#31688 [Quantization] Deprecate Long Tail of Schemes — ready — by robertgshaw2-redhat (创建于: 2026-01-05 04:42 (UTC+8))
#31690 [Core] Add optional flags to check for repetitive token patterns in engine output — frontend,v1 — by aykoppol (创建于: 2026-01-05 05:13 (UTC+8))
#31672 [Misc] Support amd-quark online fp8_ptpc quant — rocm — by hangy-amd (创建于: 2026-01-04 15:50 (UTC+8))
#31686 [Bugfix] Correct block shape logic in WNA16 MoE triton kernel — 无标签 — by JartX (创建于: 2026-01-05 02:14 (UTC+8))
#31684 Drift stable nvfp4 — performance,speculative-decoding,needs-rebase,v1,nvidia — by radna0 (创建于: 2026-01-05 00:35 (UTC+8))
#31682 Apply refactor to ct — performance,llama,nvidia — by robertgshaw2-redhat (创建于: 2026-01-04 22:23 (UTC+8))
#31680 [FixBug] Improve exception string in tensorizer.py — 无标签 — by maang-h (创建于: 2026-01-04 20:34 (UTC+8))
#31670 [perf]skip embedding check if have computed — v1 — by Shirley125 (创建于: 2026-01-04 15:21 (UTC+8))
#31677 [Bugfix] Sanitize malformed tool call recipients in Harmony parser — frontend,gpt-oss — by eous (创建于: 2026-01-04 17:30 (UTC+8))
#31676 [Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe — 无标签 — by xuebwang-amd (创建于: 2026-01-04 17:11 (UTC+8))
#31675 [Cleanup] Remove deprecated vllm:time_per_output_token_seconds metric — v1 — by majiayu000 (创建于: 2026-01-04 16:51 (UTC+8))
#31673 Test1 — 无标签 — by AlexLI-hub (创建于: 2026-01-04 15:54 (UTC+8))
#31660 [DNM] test LoRA PDL — 无标签 — by jeejeelee (创建于: 2026-01-04 11:34 (UTC+8))
#31664 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 — ci/build — by noooop (创建于: 2026-01-04 13:31 (UTC+8))

已合并 PR

#31597 [ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp — rocm,ready — by AndreasKaratzas (合并于: 2026-01-05 10:17 (UTC+8))
#31666 [Misc] Various code simplifications — structured-output,speculative-decoding,ready,v1 — by njhill (合并于: 2026-01-05 10:35 (UTC+8))
#31659 [Platform] Deprecate seed_everything — documentation,tpu,ready,v1,multi-modality,cpu,nvidia — by wangxiyuan (合并于: 2026-01-05 10:34 (UTC+8))
#31665 [CI/Build] Revive skipped reward models e2e test — ready — by Isotr0py (合并于: 2026-01-05 10:33 (UTC+8))
#31667 [Minor] Small pooler output processing optimization — ready,v1 — by njhill (合并于: 2026-01-05 10:33 (UTC+8))
#31663 [Bugfix] Fix AttributeError: ‘Stream’ object has no attribute ‘dp_size’ — ready — by jeejeelee (合并于: 2026-01-05 10:31 (UTC+8))
#31137 [misc] Sort uvicorn log level description according to verbosity — frontend,ready — by andyxning (合并于: 2026-01-05 02:45 (UTC+8))
#31632 [CI] Skip Phi-MoE test due to old API util — ready,ci/build — by AndreasKaratzas (合并于: 2026-01-05 08:52 (UTC+8))
#31611 [BugFix] Async scheduling: handle model forward errors more cleanly — ready,v1 — by njhill (合并于: 2026-01-05 03:04 (UTC+8))
#31449 fix no think of GLM-4.5 / GLM-4.7 — ready — by zRzRzRzRzRzRzR (合并于: 2026-01-04 11:43 (UTC+8))
#31654 [Docs] Fix argparse include path for mm-processor benchmark — documentation,ready — by reaganjlee (合并于: 2026-01-04 11:31 (UTC+8))

关闭但未合并的 PR

#21876 [Bugfix] Mask OOV tokens IDs from the Rejection Sampler — stale,v1 — by gnovack (关闭于: 2026-01-05 10:30 (UTC+8))
#23848 [Misc]Fix an error when enabling allreduce fusion pass — stale — by 842974287 (关闭于: 2026-01-05 10:30 (UTC+8))
#24340 [xformers] Force to use xformers kernels — stale,v1 — by qiruiyangmeta (关闭于: 2026-01-05 10:30 (UTC+8))
#31600 Fix: Add None check in step_with_batch_queue for async scheduling — v1 — by rickychen-infinirc (关闭于: 2026-01-05 05:08 (UTC+8))
#31684 Drift stable nvfp4 — performance,speculative-decoding,needs-rebase,v1,nvidia — by radna0 (关闭于: 2026-01-05 00:37 (UTC+8))
#31614 [ROCm][Attention] Enable FlashAttention backend on ROCm (graph‑safe cu_seqlens_k) — rocm,speculative-decoding,v1 — by ehartford (关闭于: 2026-01-05 00:19 (UTC+8))