vLLM 开发动态报告 - 2026-01-28

时间窗口: 2026-01-28 11:22 (UTC+8) ~ 2026-01-29 11:22 (UTC+8) 数据统计: 新 Issue 21 | 关闭 Issue 5 | 新 PR 64 | 合并 PR 30 | 关闭未合并 PR 15

📊 每日开发状态摘要

在2026年1月28日至29日的时间窗口内，vLLM项目保持了极高的开发活跃度，新增64个PR并合并了30个，显示出持续的代码贡献和快速迭代。开发焦点集中在性能优化（如算子优化、CUDA图、流水线并行支持）、功能增强（如FP8 LoRA提案、多模态模型支持）和问题修复（特别是FP8兼容性、工具解析）等方面。整体趋势表明项目在扩展模型支持、提升推理效率与系统稳定性上齐头并进。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动持续进行，主要体现在性能优化、默认配置调整和测试修复方面：

功能请求与性能优化：AMD员工 monajafi-amd 提交了Issue #33279，提议引入环境变量 VLLM_TRITON_AUTOTUNE 以控制Triton内核的自动调优行为，旨在解决由其带来的非确定性、启动时间增加和不稳定问题，默认建议禁用。此改进对AMD及其他硬件平台的性能可移植性与稳定性均有重要意义。
ROCm后端默认设置调整：PR #33271 计划调整ROCm平台的默认设置，包括在支持的平台（gfx9x）上默认启用AITER MoE后端、禁用AITER MHA，并将默认注意力后端切换为ROCM_ATTN（因其性能表现通常优于TRITON_ATTN）。
KV缓存更新与编码器修复：PR #33106 的后续修复 PR #33269 和 #33278 为ROCm的多个注意力后端（RocmAiterUnifiedAttention， RocmAttention， TritonAttention）添加了对 forward_includes_kv_cache_update 的支持，并修复了编码器（Encoder）模型在相关路径下的崩溃问题，提升了功能完整性。
CI/测试改进：
- PR #33277 通过限制 max_num_seqs=1 来减少ROCm平台上 test_sharded_state_loader 测试的批次差异，旨在降低测试不稳定性（Flakiness）。
- PR #32891 为AMD CI的分布式测试（A100）添加了 TORCH_NCCL_BLOCKING_WAIT 环境变量作为已知HIP运行时bug的临时解决方案。

技术影响分析：本周期更新聚焦于提升AMD平台上的开发体验与稳定性（可控的自动调优、测试修复）和运行时性能（默认后端调优、算子功能支持）。虽然未出现对Quark量化工具或MI300等新硬件的直接支持，但持续的底层优化为AMD生态的稳健运行奠定了基础。

💬 高热度讨论分析

Issue #33264: Should we use prompt_cache_key instead of cache_salt in vLLM?
- 核心议题：讨论OpenAI API中的 prompt_cache_key 与vLLM现有的 cache_salt 参数是否应统一。
- 观点整理：
  - 用户 dr75：认为两者设计动机不同。prompt_cache_key 主要用于路由（由网关/负载均衡器处理），而 cache_salt 主要用于vLLM实例内部的缓存隔离与安全。不建议合并，以避免不必要地耦合路由和缓存隔离逻辑。用户可以根据自身用例，在更高层将OpenAI的参数映射到vLLM参数。
  - 用户 hustxiayang：认为两者实现相似，询问为何 prompt_cache_key 不能直接用于vLLM的缓存隔离目的，以实现与OpenAI接口的统一。
- 争议焦点：概念上的分离（路由 vs. 安全隔离）是否需要在接口层面保持一致。
- 当前状态：讨论以厘清概念和适用场景为主，dr75 的建议（保持分离，在网关层映射）被更详细地阐述，尚未有代码变更。
Issue #32109: [Bug]: Blackwell (SM120) FP8 MoE path fails for GLM-4.7 (本期关闭)
- 核心议题：修复在Blackwell GPU（SM120）上运行FP8 MoE模型时，因错误选择了仅支持SM90/SM100的CUTLASS后端而导致的崩溃问题。
- 观点整理：
  - 问题发现者：报告了错误“No compiled cutlass_scaled_mm for CUDA device capability: 120”。
  - 贡献者 XiaobingSuper 和 malaiwah：分析了代码，指出CUTLASS的FP8 MoE内核仅实现了SM90和SM100版本。提出了添加条件判断或实现SM120内核的补丁方案。
  - 最终解决方案：通过 PR #33285 修复。该PR在FP8后端选择逻辑中明确限制CUTLASS后端仅适用于SM90和SM100，从而使SM120设备能够正确地回退到Triton内核，解决了问题。
- 最终结论：通过软件层面的正确功能检测和屏蔽，而非为SM120实现新的CUTLASS内核，以更简洁的方式保证了兼容性。
RFC #33224: Deprecate and delete InductorAdaptor in favor of InductorStandaloneAdaptor
- 核心议题：提议弃用并最终删除 InductorAdaptor，全面转向 InductorStandaloneAdaptor，以简化 torch.compile 集成代码。
- 观点整理：
  - 提议者 zou3519：InductorStandaloneAdaptor 自PyTorch 2.9起已默认开启，且某些模型（如Deepseek 3.2）必须依赖它才能运行。维护两套代码会增加复杂度和维护负担。
  - 支持者 ProExpertProg：完全同意（SGTM）。他指出，需要考虑支持 torch.compile 的后端（CUDA， ROCm等）何时升级到PyTorch 2.10，以确定弃用时间点。
- 当前状态：社区反馈积极，倾向于简化。后续需确定具体的弃用时间表。

🔥 热门话题与趋势分析

性能优化持续深入：
- 算子级优化：PR #32892 对MoE permute 内核进行优化，获得了40%-300%的显著性能提升。
- 编译与缓存：PR #33232 为Qwen多模态模型的ViT编码器启用分片CUDA图，旨在减少内核启动开销，降低TTFT。
- 分布式调度：PR #32618 完善了异步调度与流水线并行（PP）的协同工作，报告了30.8%的端到端吞吐量提升。
FP8精度与应用扩展：
- 问题修复：多个PR（如 #33285, #33300, #33280）集中修复了FP8在MoE、CUTLASS后端选择、CompressedTensors块量化等方面的兼容性问题。
- 新功能提案：RFC #33301 正式提出支持 FP8 LoRA推理，旨在显著减少大规模服务中多个LoRA适配器的内存占用，并利用现代GPU的FP8计算优势。
模型与工具生态拓展：
- 新模型支持：出现了对DeepSeek OCR 2、MiniMax、FunASR等新模型架构的支持PR（#33252, #33244, #33058）。
- 工具与解析器完善：针对GLM-4、Kimi等模型的工具调用解析器进行了流式输出、空参数处理等多处改进（#33218, #33248）。
系统可观测性与调试：
- 指标细化：Issue #33289 及对应PR #33290 为Prefill/Decode分离部署添加了带标签的Prompt Token来源指标，帮助用户区分本地计算、KV传输和缓存命中的工作量。

🛠️ 重点技术变更

PR #33285：修复Blackwell GPU上FP8 MoE的CUTLASS后端选择
- 内容：在FP8 MoE后端选择逻辑中，将vLLM CUTLASS后端严格限制为仅支持SM90和SM100设备。
- 影响：从根本上解决了Blackwell（SM120）等新架构GPU因误选未实现的内核而崩溃的问题，确保了FP8 MoE模型在新硬件上的兼容性与可用性。
PR #32892：MoE Permute内核性能优化
- 内容：通过优化 aligned_expert_first_token_offset 的计算逻辑和减少共享内存使用，重构了 moe_permute 内核。
- 影响：在多种批次大小下取得了40%到300% 的性能提升，虽然该操作在整体MoE计算中占比不高，但优化有助于降低系统延迟，体现了对核心算子的深度打磨。
PR #32618：完全支持异步调度+流水线并行
- 内容：解决了异步调度模式下流水线并行最后一阶段需要向其他阶段广播采样结果的问题，通过GPU直接的张量广播替代了原有的CPU路径。
- 影响：在测试中实现了30.8%的端到端吞吐量提升和31.8%的TPOT提升，使得vLLM在复杂分布式推理场景下的性能潜力得到进一步释放。
PR #33191/#33256：代码质量与文档改进
- 内容：前者将 flake8-implicit-str-concat 规则加入Ruff检查，后者修复了多处文档和代码中的拼写错误。
- 影响：这些变更虽不直接影响功能，但提升了代码库的长期可维护性和代码质量，是成熟项目的重要标志。

📈 开发活跃度观察

合并效率高：在24小时内合并了30个PR，表明项目的代码审查和集成流程非常高效，能够快速吸纳社区贡献。
贡献来源多样：活跃贡献者包括来自NVIDIA、AMD、Intel、华为、红帽等公司的员工，以及众多独立开发者，显示出vLLM广泛的企业和社区生态。
Issue处理及时：新Issue能迅速得到社区成员（包括核心开发者）的回复和诊断，例如Issue #33264和#33261都在短时间内获得了针对性解答。

💡 值得关注的问题

RFC #33301: Support FP8 LoRA inference：这是一个重要的新特性提案，旨在解决大规模部署中多LoRA适配器的内存瓶颈。其实验结果和实施方案值得社区深入评审，可能成为下一个关键性能特性。
Issue #33263: Prevent overallocation of kv-cache：用户提出KV缓存应根据 max_num_seqs * max_model_len 自动限制，而非总是占满可用GPU内存。这触及了资源管理的易用性与灵活性之间的平衡，可能需要设计更智能的默认策略或提供更好的指导。
Issue #33279: Environment Variable to Control Triton Autotuning：由AMD员工提出的这一特性，反映了生产环境对确定性、稳定启动时间的强烈需求。无论最终是否以环境变量形式实现，相关问题的解决思路值得核心团队关注。
CI稳定性：Issue #33297 报告了CI中因GPU内存利用率设置导致的测试失败，提示在资源紧张的CI环境中需要更稳健的默认配置或测试隔离策略。

📋 附录：详细数据列表

新增 Issue

#33301 [RFC]: Support FP8 LoRA inference — RFC — by jeejeelee (创建于: 2026-01-29 10:45 (UTC+8))
#33261 [Bug]: EngineArgs.init() got an unexpected keyword argument ‘task’ — bug — by innovation-miracle (创建于: 2026-01-28 22:46 (UTC+8))
#33299 [Bug]: MTP does not mask embedding on position 0 — bug — by DingYibin (创建于: 2026-01-29 10:20 (UTC+8))
#33297 [CI Failure]: Entrypoints Integration Test (Responses API) for GPU utilization ValueError — ci-failure — by pacoxu (创建于: 2026-01-29 09:55 (UTC+8))
#33295 [Bug]: QKNorm+RMS fusion broken for qwen3 — bug — by ProExpertProg (创建于: 2026-01-29 08:12 (UTC+8))
#33294 [Bug]: Typo in detokenizer.py: read_offest should be read_offset breaks prompt_embeds — bug — by bet0x (创建于: 2026-01-29 08:09 (UTC+8))
#33289 [Feature]: [Metrics] Labeled prompt token metrics for P/D disaggregation (Follow-up on PR #27569) — feature request — by ZhanqiuHu (创建于: 2026-01-29 07:15 (UTC+8))
#33264 [Usage]: Should we use prompt_cache_key instead of cache_salt in vLLM? — usage — by hustxiayang (创建于: 2026-01-28 23:32 (UTC+8))
#33279 [Feature]: Environment Variable to Control Triton Autotuning — feature request — by monajafi-amd (创建于: 2026-01-29 04:01 (UTC+8))
#33276 [Usage]: question about gguf gemma3-4b different output from llama_cpp — usage — by sarahliu-cisco (创建于: 2026-01-29 02:58 (UTC+8))
#33267 [Feature]: Remove attention layer name from unified_kv_cache_update — help wanted,good first issue,feature request,torch.compile,startup-ux — by ProExpertProg (创建于: 2026-01-29 00:22 (UTC+8))
#33263 [Feature]: Prevent overallocation of kv-cache — feature request — by DominikHil (创建于: 2026-01-28 23:21 (UTC+8))
#33224 [RFC]: Deprecate and delete InductorAdaptor in favor of InductorStandaloneAdaptor — RFC,torch.compile — by zou3519 (创建于: 2026-01-28 12:49 (UTC+8))
#33258 [Bug]: pplx all2all backend hangs during model warmup on A6000 GPUs — bug — by SeungjaeLim (创建于: 2026-01-28 21:27 (UTC+8))
#33259 [Usage]: vLLM 的 fused-moe 算子目前只支持给 gate 加 LoRA吗？ — usage — by shiwanghua (创建于: 2026-01-28 21:31 (UTC+8))
#33252 [Feature]: Add Deepseek OCR 2 support to the latest version — feature request — by ItzAmirreza (创建于: 2026-01-28 20:30 (UTC+8))
#33242 [Bug]: Qwen3-VL-30B (MoE): CUDA error ‘illegal memory access’ when max_model_len > 128k — bug — by Austin-Liang (创建于: 2026-01-28 18:05 (UTC+8))
#33245 [Installation]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): — installation — by beebol (创建于: 2026-01-28 18:54 (UTC+8))
#33223 [Usage]: Dose vllm support jsonlines format when using response_format ? — usage — by alanayu (创建于: 2026-01-28 12:03 (UTC+8))
#33229 [Bug]: vllm v0.14.0 GGUF model with architecture minimax-m2 is not supported yet — bug — by FayeSpica (创建于: 2026-01-28 14:42 (UTC+8))
#33219 [Usage]: 请问Qwen3-VL-Embedding怎么输入视频获取向量？ — usage — by zengweigit (创建于: 2026-01-28 11:34 (UTC+8))

已关闭 Issue

#32109 [Bug]: Blackwell (SM120) FP8 MoE path fails for GLM-4.7 : No compiled cutlass_scaled_mm for CUDA device capability: 120 on RTX PRO 6000 Blackwell — bug — by christianbalbin (关闭于: 2026-01-29 10:41 (UTC+8))
#31919 [RFC]: Attention Restructuring Tracker — RFC — by MatthewBonanni (关闭于: 2026-01-29 06:21 (UTC+8))
#33175 [Bug]: Mistral-Small-3.1-24B crashes on startup — bug — by NickLucche (关闭于: 2026-01-28 21:44 (UTC+8))
#12568 [V1] Feedback Thread — stale,v1 — by simon-mo (关闭于: 2026-01-28 16:27 (UTC+8))
#32838 [Bug]: CompressedTensorsW4A16Fp4 is not support on turing — bug — by ir1ka (关闭于: 2026-01-28 12:54 (UTC+8))

新增 PR

#33250 feat(mla): extract KV-cache update — 无标签 — by dw2761 (创建于: 2026-01-28 20:27 (UTC+8))
#33282 [CI] Enable mypy import following for vllm/spec_decode — speculative-decoding,v1 — by Lucaskabela (创建于: 2026-01-29 05:57 (UTC+8))
#33283 [Misc] Remove missed pad_for_cudagraph — ready,nvidia — by LucasWilkinson (创建于: 2026-01-29 06:03 (UTC+8))
#33232 [Misc][ViT Cuda Graphs] Enable Piecewise CUDA Graphs for Qwen3-VL and Qwen2.5-VL ViT to Improve Performance — documentation,v1,qwen,nvidia — by HirokenOvo (创建于: 2026-01-28 15:42 (UTC+8))
#33246 [Misc] support collect_env for endpoint /server_info — frontend — by muma378 (创建于: 2026-01-28 19:09 (UTC+8))
#33298 [Bugfix] Fix Qwen3-VL-Reranker load. — bug,documentation,ready,qwen — by noooop (创建于: 2026-01-29 10:17 (UTC+8))
#33300 [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic — bug,ready — by mgoin (创建于: 2026-01-29 10:35 (UTC+8))
#33285 [Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 — bug,ready,nvidia — by mgoin (创建于: 2026-01-29 06:22 (UTC+8))
#33233 [structured output] validate unsupported json features first — structured-output,ready,v1 — by andyxning (创建于: 2026-01-28 15:42 (UTC+8))
#33260 [Refactor] Define MM data parser in processing info instead of processor itself — ready,multi-modality,llama,qwen — by DarkLight1337 (创建于: 2026-01-28 22:02 (UTC+8))
#33218 [Bugfix] GLM-4 tool parser: incremental string streaming — bug,ready — by QwertyJack (创建于: 2026-01-28 11:27 (UTC+8))
#33268 [Bugfix] Handle empty/whitespace tool_call arguments in chat preproce… — bug,frontend — by seli-equinix (创建于: 2026-01-29 00:29 (UTC+8))
#33265 [Feat] SHMConnector: Share Memory based EC Connector for EPD — documentation,kv-connector — by PiratePai (创建于: 2026-01-29 00:06 (UTC+8))
#33230 Add XPU MLA Sparse backend for DeepSeek v3.2 — v1,deepseek — by wuxun-zhang (创建于: 2026-01-28 15:08 (UTC+8))
#33296 [Misc] Improve tracing test quality and fix noisy logging — documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-29 09:15 (UTC+8))
#33292 [WIP][KV Connector] Add handshake metadata methods to MultiConnector — documentation,new-model,ci/build,v1,multi-modality,llama,qwen,kv-connector — by eicherseiji (创建于: 2026-01-29 07:52 (UTC+8))
#33287 [ez] Delete torch25_custom_graph_pass — ready — by angelayi (创建于: 2026-01-29 07:00 (UTC+8))
#33288 [ez] Delete more torch version checks <= 2.8 — ready — by angelayi (创建于: 2026-01-29 07:12 (UTC+8))
#33293 Reduce Blackwell test time — ready,ci/build — by ProExpertProg (创建于: 2026-01-29 08:01 (UTC+8))
#33273 [UX] Remove noisy CT UnquantizedLinearMethod warn — ready — by mgoin (创建于: 2026-01-29 02:09 (UTC+8))
#33280 Support FP8 block quant for CompressedTensorsW8A16Fp8 — ready,quantization — by mgoin (创建于: 2026-01-29 05:03 (UTC+8))
#33291 [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] — 无标签 — by vadiklyutiy (创建于: 2026-01-29 07:41 (UTC+8))
#33290 [Metrics] Add labeled prompt token metrics for P/D disaggregation — v1,kv-connector — by ZhanqiuHu (创建于: 2026-01-29 07:19 (UTC+8))
#33270 [Bugfix] Fix Hybrid KV cache hit length computation for eagle — bug,v1 — by xyang16 (创建于: 2026-01-29 01:05 (UTC+8))
#33286 [CI][HPU]accelerate hpu test by skip python re-install and clean container name — ci/build — by xuechendi (创建于: 2026-01-29 06:53 (UTC+8))
#33284 [WIP][Attention] Move MLA forward from backend to layer — rocm,v1,nvidia — by MatthewBonanni (创建于: 2026-01-29 06:19 (UTC+8))
#33266 [ModelRunner V2] Misc code simplification and cleanup — v1 — by njhill (创建于: 2026-01-29 00:18 (UTC+8))
#33278 [BugFix] Fix IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) for encoder models — bug,rocm,needs-rebase,v1 — by LucasWilkinson (创建于: 2026-01-29 03:53 (UTC+8))
#33281 [2/N] move responses/serving _make_response_output_items logic to parser — frontend — by qandrew (创建于: 2026-01-29 05:41 (UTC+8))
#33269 [Bugfix] Add missing encoder only guard for do_kv_cache_update — bug,rocm,ready,v1 — by gshtras (创建于: 2026-01-29 00:53 (UTC+8))
#33271 [ROCm] Change default settings for ROCm — rocm — by gshtras (创建于: 2026-01-29 01:26 (UTC+8))
#33257 [MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 — 无标签 — by vadiklyutiy (创建于: 2026-01-28 21:24 (UTC+8))
#33274 fix(symm_mem): add multicast support check to prevent crash on PCIe GPUs — 无标签 — by stmcgovern (创建于: 2026-01-29 02:34 (UTC+8))
#33275 [CI] Change GPU key to device key for B200 test — ready,ci/build — by khluu (创建于: 2026-01-29 02:49 (UTC+8))
#33277 [ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness — rocm — by micah-wil (创建于: 2026-01-29 03:06 (UTC+8))
#33272 [Bugfix] Fix OpenTelemetry trace context propagation between API and engine (issue #21) — bug,documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-29 01:33 (UTC+8))
#33262 [BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend — bug,ready — by ilmarkov (创建于: 2026-01-28 23:09 (UTC+8))
#33247 [model] support FunASR model — documentation,new-model,needs-rebase — by AllenDou (创建于: 2026-01-28 19:37 (UTC+8))
#33251 [Model Runner V2] support apply penalty for spec decode — v1 — by izhuhaoran (创建于: 2026-01-28 20:28 (UTC+8))
#33225 replace with torch.cuda.stream() — v1,kv-connector,nvidia — by xinyu-intel (创建于: 2026-01-28 13:56 (UTC+8))
#33256 [Doc]: fixing multiple typos in diverse files — documentation,frontend,ready,ci/build,v1,cpu — by didier-durand (创建于: 2026-01-28 21:11 (UTC+8))
#33221 [Misc] Provide a DeepSeek ReasoningParser with thinking enabled by default — ready,deepseek — by chaunceyjiang (创建于: 2026-01-28 11:47 (UTC+8))
#33248 Kimi K2.5 model generates “(no content)” placeholder in tool call responses — needs-rebase — by shivamashtikar (创建于: 2026-01-28 20:13 (UTC+8))
#33249 [Frontend] Add structured_outputs field to ResponsesRequest — frontend — by MarcinGodniak (创建于: 2026-01-28 20:22 (UTC+8))
#33253 Improve Mistral format checks. — 无标签 — by juliendenize (创建于: 2026-01-28 20:41 (UTC+8))
#33227 [Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) — ci/build — by orozery (创建于: 2026-01-28 14:35 (UTC+8))
#33255 Fix quant RMS norm fusion for quantization with TMA-aligned scales — 无标签 — by ElizaWszola (创建于: 2026-01-28 21:00 (UTC+8))
#33254 [Docs] Update link to Benchmark CLI documentation — performance — by eldarkurtic (创建于: 2026-01-28 20:58 (UTC+8))
#33241 Revert “Enable Cross layers KV cache layout at NIXL Connector (#30207)” — documentation,ready,v1,kv-connector — by orozery (创建于: 2026-01-28 17:52 (UTC+8))
#33244 Add MiniMax model support to vLLM — needs-rebase — by JingqingZh (创建于: 2026-01-28 18:17 (UTC+8))
#33243 [CPU][IBM Z][Dockerfile] Fix IBM Z builds — ci/build — by R3hankhan123 (创建于: 2026-01-28 18:16 (UTC+8))
#33236 Fix encoder-decoder model disabling mm processor cache — ready — by hmellor (创建于: 2026-01-28 16:20 (UTC+8))
#33231 Fix prefer_cross_layers check — ready,kv-connector — by liranschour (创建于: 2026-01-28 15:39 (UTC+8))
#33240 [CI] Update job dependency syntax for Intel and AMD jobs — rocm,ready,ci/build — by khluu (创建于: 2026-01-28 17:19 (UTC+8))
#33238 [Feature]: Kernels Attention Test cleanup — v1 — by SouthWest7 (创建于: 2026-01-28 16:46 (UTC+8))
#33237 [CI] Update job dependency for hardware and CPU jobs — ci/build — by khluu (创建于: 2026-01-28 16:45 (UTC+8))
#33239 Move decode context parallel validationn to ParallelConfig — 无标签 — by hmellor (创建于: 2026-01-28 16:57 (UTC+8))
#33235 [Core] Improve MetaTensorMode to intercept more tensor factory operations — 无标签 — by calleum (创建于: 2026-01-28 16:14 (UTC+8))
#33234 [LoRA] Support LoRA for Embedding and LMHead in Qwen2/3 family — v1,qwen — by AuYang261 (创建于: 2026-01-28 15:47 (UTC+8))
#33222 Cpu binding to improve GPU performance and utilize idle CPU cores — documentation,performance,cpu — by louie-tsai (创建于: 2026-01-28 12:01 (UTC+8))
#33226 [XPU]disable test_acceptance_length UT — ready,ci/build — by yma11 (创建于: 2026-01-28 14:26 (UTC+8))
#33228 [Feature] silu block quant fusion Triton kernel — performance,ci/build — by weimin023 (创建于: 2026-01-28 14:38 (UTC+8))
#33220 [Doc] add missing model entries in supported_models.md — documentation — by pacoxu (创建于: 2026-01-28 11:36 (UTC+8))
#33217 [Model Runner V2] Init cuda graph pool when necessary — v1,nvidia — by xinyu-intel (创建于: 2026-01-28 11:26 (UTC+8))

已合并 PR

#33285 [Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 — bug,ready,nvidia — by mgoin (合并于: 2026-01-29 10:40 (UTC+8))
#32892 [Perf] Optimize moe_permute kernel, 40%~300% kernel performance improvement — ready,nvidia — by yewentao256 (合并于: 2026-01-29 02:15 (UTC+8))
#33273 [UX] Remove noisy CT UnquantizedLinearMethod warn — ready — by mgoin (合并于: 2026-01-29 08:09 (UTC+8))
#33266 [ModelRunner V2] Misc code simplification and cleanup — v1 — by njhill (合并于: 2026-01-29 06:41 (UTC+8))
#32477 [7/N][Attention][Docs] Add documentation for attention backends — documentation,ready — by MatthewBonanni (合并于: 2026-01-29 06:20 (UTC+8))
#33193 [UX] Enable nested configs in config yaml files — feature request,ready — by mgoin (合并于: 2026-01-29 05:54 (UTC+8))
#33269 [Bugfix] Add missing encoder only guard for do_kv_cache_update — bug,rocm,ready,v1 — by gshtras (合并于: 2026-01-29 05:25 (UTC+8))
#33209 [ez] Remove checks for torch version <= 2.8 — ready — by angelayi (合并于: 2026-01-29 05:03 (UTC+8))
#30976 Use aiter triton fused_add_rmsnorm_pad for gpt-oss — rocm,ready,gpt-oss — by Rohan138 (合并于: 2026-01-29 04:47 (UTC+8))
#32618 [Feature] Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement — ready,v1 — by yewentao256 (合并于: 2026-01-29 04:30 (UTC+8))
#33275 [CI] Change GPU key to device key for B200 test — ready,ci/build — by khluu (合并于: 2026-01-29 03:14 (UTC+8))
#33098 [CI] Whisper tests enforce_eager=False — ready — by NickLucche (合并于: 2026-01-29 01:36 (UTC+8))
#32774 [lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 — ready — by cwazai (合并于: 2026-01-29 00:22 (UTC+8))
#33106 [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends — rocm,ready,v1 — by gshtras (合并于: 2026-01-28 14:36 (UTC+8))
#33183 [Benchmark] Add startup benchmarking to buildkite run — performance,ready,ci/build — by desertfire (合并于: 2026-01-29 00:03 (UTC+8))
#32688 [Quantization][Deprecation] Remove Marlin 24 — performance,ready,ci/build,nvidia — by robertgshaw2-redhat (合并于: 2026-01-28 23:55 (UTC+8))
#33221 [Misc] Provide a DeepSeek ReasoningParser with thinking enabled by default — ready,deepseek — by chaunceyjiang (合并于: 2026-01-28 21:16 (UTC+8))
#33241 Revert “Enable Cross layers KV cache layout at NIXL Connector (#30207)” — documentation,ready,v1,kv-connector — by orozery (合并于: 2026-01-28 20:36 (UTC+8))
#32683 [Quantization][Deprecation] Remove BitBlas — documentation,performance,ready — by robertgshaw2-redhat (合并于: 2026-01-28 19:06 (UTC+8))
#33240 [CI] Update job dependency syntax for Intel and AMD jobs — rocm,ready,ci/build — by khluu (合并于: 2026-01-28 17:33 (UTC+8))
#33237 [CI] Update job dependency for hardware and CPU jobs — ci/build — by khluu (合并于: 2026-01-28 17:10 (UTC+8))
#33199 [CI] Enable mypy import following for vllm/compilation — ready — by hmellor (合并于: 2026-01-28 16:59 (UTC+8))
#33208 Don’t use min_pixels/max_pixels from Qwen2VL’s processor — ready,qwen — by hmellor (合并于: 2026-01-28 13:02 (UTC+8))
#33191 Add flake8-implicit-str-concat rules to Ruff — documentation,performance,frontend,ready,ci/build,v1,tool-calling,llama,deepseek,kv-connector — by hmellor (合并于: 2026-01-28 12:56 (UTC+8))
#33226 [XPU]disable test_acceptance_length UT — ready,ci/build — by yma11 (合并于: 2026-01-28 15:24 (UTC+8))
#33071 [Docs] Simplify CPU x86 Docker build documentation — documentation,ready,cpu — by maryamtahhan (合并于: 2026-01-28 14:37 (UTC+8))
#33058 Adds FunAudioChat multimodal audio model support (#2) — documentation,new-model,ready,multi-modality — by nemoramo (合并于: 2026-01-28 13:18 (UTC+8))
#32821 [Bugfix] Lazy import NgramProposer in GPU model runner — bug,ready,v1 — by 22quinn (合并于: 2026-01-28 13:07 (UTC+8))
#33202 Relax protobuf library version constraints — frontend,ready,ci/build — by jeffreywang-anyscale (合并于: 2026-01-28 12:15 (UTC+8))
#32891 [ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) — rocm,ready,ci/build — by micah-wil (合并于: 2026-01-28 11:32 (UTC+8))

关闭但未合并的 PR

#33296 [Misc] Improve tracing test quality and fix noisy logging — documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-29 09:15 (UTC+8))
#33278 [BugFix] Fix IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) for encoder models — bug,rocm,needs-rebase,v1 — by LucasWilkinson (关闭于: 2026-01-29 06:34 (UTC+8))
#12702 add initial Blackwell support — needs-rebase,ci/build — by kushanam (关闭于: 2026-01-29 06:16 (UTC+8))
#32711 [LoRA] Support add delta inplace for fused_moe_lora_kernel — needs-rebase — by xyang16 (关闭于: 2026-01-29 02:03 (UTC+8))
#33272 [Bugfix] Fix OpenTelemetry trace context propagation between API and engine (issue #21) — bug,documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-29 01:34 (UTC+8))
#32474 [Docker][Hotfix] CUDA compatibility enablement — ci/build,nvidia — by emricksini-h (关闭于: 2026-01-28 18:19 (UTC+8))
#33231 Fix prefer_cross_layers check — ready,kv-connector — by liranschour (关闭于: 2026-01-28 17:30 (UTC+8))
#19236 [Misc][Bugfix] specify docker registry to support podman — documentation,rocm,ready,needs-rebase,ci/build,stale — by p12tic (关闭于: 2026-01-28 16:28 (UTC+8))
#27049 [TPU] Fix tpu torch compile error — tpu,stale — by Chenyaaang (关闭于: 2026-01-28 15:59 (UTC+8))
#31354 [Kernel] Speed up fused MoE LoRA triton op — needs-rebase — by cwazai (关闭于: 2026-01-28 15:12 (UTC+8))
#31863 feat: add support for logging to file via VLLM_LOG_FILE env var — 无标签 — by nathon-lee (关闭于: 2026-01-28 14:16 (UTC+8))
#31770 [Feat] Add Flashinfer TRTLLM MOE to Modular Kernel — documentation,needs-rebase,v1,nvidia — by jiahanc (关闭于: 2026-01-28 13:51 (UTC+8))
#33185 [Core] Add VLLM_NIXL_DISAGGREGATION_BACKEND env var for NIXL backend selection — v1,kv-connector — by algebra-MCX (关闭于: 2026-01-28 11:57 (UTC+8))
#32888 [UX] Glm4MoeModelToolParser - true tool-call streaming — 无标签 — by bigBrain1901 (关闭于: 2026-01-28 11:48 (UTC+8))
#33216 [Feature] silu block quant fusion Triton kernel — performance,ci/build — by weimin023 (关闭于: 2026-01-28 11:24 (UTC+8))