vLLM 开发动态报告 - 2026-02-04

时间窗口: 2026-02-04 11:27 (UTC+8) ~ 2026-02-05 11:27 (UTC+8) 数据统计: 新 Issue 23 | 关闭 Issue 22 | 新 PR 81 | 合并 PR 40 | 关闭未合并 PR 19

📊 每日开发状态摘要

本周期（2026-02-04至02-05）vLLM 社区活跃度极高，新增 81 个 PR，合并 40 个，显示出强劲的开发吞吐量。开发焦点集中在多模态模型支持优化（视频稀疏化、音频转录、评分API修复）、AMD/ROCm 平台CI与兼容性修复，以及编译系统与性能调优（torch.compile 支持、AOT编译策略、性能回归回退）。同时，社区就 KV Cache 适配未来架构和 API 设计语义展开了重要讨论。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关动态活跃，主要集中在 CI 稳定性修复、平台兼容性增强和文档完善。

Issue #33812 ([CI Failure]: mi325_4: LM Eval Large Models (H100))：报告在 ROCm CI 上运行 LM Eval 测试时失败，错误信息显示 modelopt quantization is currently not supported in rocm。这暴露了 ROCm 平台对某些量化方案（如 modelopt）的支持尚不完整，需要后续补齐或明确禁用。
Issue #33809 ([CI Failure]: Kernels MoE Test %N) 及 PR #33840：报告 ROCm MoE 内核测试出现回归。PR #33840 由 AMD 贡献者（rasmith）快速修复，指出问题根源在于未正确设置 VLLM_ROCM_USE_AITER=1 环境变量，导致 AITER 算子无法正确加载。此修复确保了 ROCm 特定优化路径的测试有效性。
Issue #33811 ([Hardware][AMD] Add comments explaining gfx906 (MI50/MI60) is not supported)：贡献者分享了为 MI50/MI60 (gfx906) 架构编译 vLLM 失败的经验，并提交 PR 在代码中添加注释，明确指出 ROCm Triton 不支持 Vega 20 架构是根本原因。此文档更新非常有价值，能帮助社区用户避免在老旧硬件上浪费编译时间。
PR #32745 ([Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism)：已合并。此 PR 重构了 AMD CI 测试逻辑，使其能够正确利用 BuildKite 的并行化能力，替代了之前复杂的手动模拟逻辑。此举不仅能提升 CI 运行效率，还能更精细地分配 GPU 资源（例如，分片测试可占用 1-GPU 机器而非整个 8-GPU 机器），是 AMD CI 基础设施的重要优化。
PR #33713 ([Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching)：已合并。修复了在 ROCm 平台（MI300/325）上使用 float8_e4m3fnuz 数据类型进行 NCCL 通信时出现的 Unsupported dtype 错误。此修复对于支持 AMD 硬件上的 FP8 量化模型（如 Qwen3-30B-A3B-FP8）的专家并行至关重要。

小结：本周期 AMD 生态工作以 “夯实基础” 为主，重点修复 CI 流水线稳定性和平台特定 bug（NCCL 数据类型、环境变量），并补充了重要文档。暂无涉及 Quark 量化工具或 MI300 新特性的重大特性更新。

💬 高热度讨论分析

Issue #33789: [RFC]: How will vLLM adapt to more diverse KVCacheSpecs?
- 核心议题：随着模型架构日益复杂（混合 Mamba、不同 KV 隐藏大小的注意力层），当前的单一全局 KV Cache 块池设计面临挑战。用户询问 vLLM 社区对未来解决方案的规划。
- 观点与争议：
  - 提出者 (sheep94lion)：列举了模型可能包含多个不同 kv_hidden_size 的注意力组，寻求解决方案方向（如多块池、填充对齐）。
  - 核心维护者 (simon-mo)：明确了 vLLM 的设计原则是 “由训练好的模型架构驱动，而非后期附加的未微调优化”。这为讨论定下了基调。
  - 后续讨论：提出者追问对于 kv_hidden_size 差异不大的情况，vLLM 是计划支持对 FullAttention 进行填充，还是采用多块池方案。
- 当前状态：讨论开放，核心维护者给出了设计哲学，但具体技术路径尚未明确。这是一个关乎 vLLM 核心内存管理架构未来演进的重要议题。
Issue #33802: [CI Failure]: Distributed 2xH100 tests
- 核心议题：分布式编译测试 test_async_tp_pass_correctness 在 CI 中挂起并失败，影响发布流程。
- 观点与进展：
  - 报告者 (ProExpertProg) 详细记录了故障现象，并通过二分定位确认是由 PR #23465（“[Attention][FA3] Update FA3 to include new swizzle optimization”）引入的回归。
  - 解决方案：迅速提出了 PR #33841 用于 revert 有问题的提交，并同时尝试了 PR #33854 进行针对性修复（但未成功）。这体现了对 CI 稳定性的高度重视和快速响应流程。
- 当前状态：根本原因已定位，revert PR (#33841) 已准备好作为解决方案。
PR #33837 ([Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result) 及相关 Issue
- 核心议题：PR #33060 引入了一个回归，当向 LLM.score() 传递单个包含多个内容项（如图像）的 ScoreMultiModalParam 时，错误地只返回一个评分结果，而不是每个内容项一个评分。
- 观点与争议：
  - 修复者 (AndreasKaratzas)：认为这是 bug，破坏了批量评分的预期行为，并提供了修复。
  - API 设计者 (noooop)：解释新代码的目的是为了正确处理图文混合输入，将多个内容项视为一个“文档”进行评分是预期行为。
  - 核心维护者 (DarkLight1337)：进一步澄清了 API 设计语义：一个 ScoreMultiModalParam 代表一个文档，无论其包含多少图像或文本段落，都应只返回一个综合评分。这与修复者的理解存在根本分歧。
- 当前状态：修复 PR 被暂停，API 设计者正在根据澄清的语义重新实现功能。这暴露了多模态 API 语义在设计同步和文档化上需要加强。
PR #32762 ([CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode)
- 核心议题：修复 Responses API 中内置 MCP 工具（如 python）在非流式模式下返回的 output item 类型不一致的问题（流式下返回 McpCall，非流式下错误地返回 ResponseReasoningItem）。
- 讨论过程：PR 经历了多次修正，起初试图让非流式模式也返回 McpCall，但后续发现与现有测试期望（内置工具应产生 ResponseReasoningItem）冲突。最终修正为让两种模式都统一返回 ResponseReasoningItem，以符合测试定义和潜在的产品语义。
- 当前状态：已合并。此 PR 反映了在复杂 API（如 GPT-OSS + Harmony）实现中，保持不同调用路径（流式/非流式）行为一致性的挑战，以及通过测试套件来定义和约束行为的重要性。

🔥 热门话题与趋势分析

模型兼容性与量化挑战：多个 Issue 报告了新模型或特定量化格式的运行问题，如 DeepSeek V3.2 NVFP4 与 flashinfer MoE 的兼容性 (#33859)、Qwen3-Coder-Next 在新型号 GPU (GB10) 上的 Triton 分配器错误 (#33857)、CutlassW4A8 对特定维度 (K=7168) 不支持 (#33783)。这表明新模型和低比特量化在落地时仍面临诸多底层内核适配挑战。
多模态支持纵深发展：社区不仅在使用多模态模型，更在优化其效率与功能。例如，PR #33780 为 Qwen VL 模型引入视频帧稀疏化以支持长视频推理；PR #33782 优化聊天完成流式性能；多个 PR 为 InternVL、Ovis2.5、Phi3V 等模型添加 torch.compile 支持，显著提升视觉编码器吞吐量。
编译系统演进与权衡：AOT (Ahead-Of-Time) 编译的推广策略成为 RFC 议题 (#33804)，社区在 “激进推广以获得冷启动收益” 和 “谨慎推进以避免用户困扰” 之间权衡。同时，PR #33820 回退了导致 CI 测试失败和潜在性能回归的激进编译优化 (#33641)，体现了对稳定性的优先考虑。
CI/CD 与测试稳定性被高度重视：大量 Issue 和 PR 围绕 CI 失败展开，社区积极定位根因（如 #33802）并优化测试套件本身（如 #33293 重构融合测试以平衡时间与覆盖率）。这反映了项目在高速开发中对质量保障体系的持续投入。

🛠️ 重点技术变更

PR #33293: [CI][torch.compile] Reduce e2e fusion test time：已合并。这是一次重要的测试基础设施重构。它将耗时过长的端到端融合测试拆分为“快速遍历”（所有模型，单一配置）和“深度扫描”（单一模型，所有配置），并引入工具函数便于未来扩展。此举在保持覆盖率的同时，大幅减少了 CI 资源消耗和时间，为持续集成提速。
PR #33686: feat: Add ColBERT late interaction model support：已合并。增加了对 ColBERT 检索模型的支持，实现了其“晚期交互”机制。这不仅满足了社区长期以来的需求（#13827），也扩展了 vLLM 在检索增强生成 (RAG) 领域的应用场景，从单纯的生成引擎向更全面的语义理解与检索服务迈进。
PR #33820: Revert “[torch.compile] Significantly speed up cold start times”：已合并。回退了 PR #33641。该被回退的 PR 旨在通过修改编译缓存逻辑来加速冷启动，但导致了分布式编译测试失败和潜在的性能回归。此次回退体现了项目在追求性能优化时，将系统稳定性和测试通过率置于首位的审慎原则。
PR #33192: [Bugfix] Disable TRTLLM attention when KV transfer is enabled：已合并。解决了一个在 Blackwell GPU 上使用 P/D 解耦（NixlConnector）时因 TRTLLM 注意力内核要求连续 KV Cache 而导致的崩溃问题。修复方案是当检测到 KV 传输配置时，自动降级使用兼容性更好的 FlashInfer 原生内核。这保障了先进部署模式在新型硬件上的可用性。

📈 开发活跃度观察

高开发吞吐量：单日新增81个PR、合并40个，显示社区贡献非常活跃。
AMD 与 CPU 后端持续投入：AMD 贡献者积极修复平台特定问题，CPU 后端也有并行化 CI、优化注意力分发等改进，表明 vLLM 对多硬件平台的支持在持续深化。
多模态与检索模型成为新热点：围绕视频、音频、ColBERT 的 PR 显示出社区正在快速拓展 vLLM 在传统文本生成之外的疆界。
核心架构讨论进行中：关于 KV Cache 未来适配方案的 RFC 吸引了核心维护者的参与，预示着项目可能即将面临一次重要的底层架构演进决策。

💡 值得关注的问题

KV Cache 架构抉择 (Issue #33789)：社区需要尽快明确对混合注意力模型（Mamba + 多种 Attention）的支持路线图，这将影响许多复杂模型在 vLLM 上的部署。
AOT 编译推广策略 (RFC #33804)：是否在下一版本默认启用 AOT 编译，需要在开发效率、用户体验和潜在风险之间做出明确决策。
多模态评分 API 语义澄清 (PR #33837)：需要官方明确 ScoreMultiModalParam 等 API 的设计意图，并更新文档，避免开发者误解。
新硬件与量化格式的“长尾”适配：如 GB10 (sm121) GPU 上的 Triton 问题、DeepSeek V3.2 NVFP4 的 flashinfer MoE 错误等，反映了支持前沿硬件和模型格式的持续挑战。

📋 附录：详细数据列表

新增 Issue

#33813 [Bug]: llm.score() fails on batched multimodal input for qwen3-vl-reranker — bug — by JiahuiChen-GitHub (创建于: 2026-02-05 02:49 (UTC+8))
#33859 [Bug]: DeepSeek V3.2-NVFP4 with flashinfer moe reports q must have dtype torch::kBFloat16 — bug — by kebe7jun (创建于: 2026-02-05 10:59 (UTC+8))
#33789 [RFC]: How will vLLM adapt to more diverse KVCacheSpecs? — RFC — by sheep94lion (创建于: 2026-02-04 20:25 (UTC+8))
#33831 [Bug]: Deepseek V3.2 Benchmark failure “TypeError: argument ‘tokens’: ‘NoneType’ object” — bug — by wzhao18 (创建于: 2026-02-05 05:38 (UTC+8))
#33802 [CI Failure]: Distributed 2xH100 tests — torch.compile,ci-failure,needs reproduction — by ProExpertProg (创建于: 2026-02-05 00:23 (UTC+8))
#33857 [Bug]: Qwen3-Coder-Next fails with Triton allocator error on DGX Spark cluster (GB10, sm121) — bug — by eugr (创建于: 2026-02-05 10:15 (UTC+8))
#33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (创建于: 2026-02-05 01:31 (UTC+8))
#33833 [Bug][Docker]: Issues with 0.15.0 and newer docker image when running Qwen3-Next with VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 — bug — by jasonlizhengjian (创建于: 2026-02-05 05:48 (UTC+8))
#33783 [Bug]: CutlassW4A8LinearKernel fails on DeepSeekV3.1 W4AF8 due to dimension alignment (K=7168, N=2112 not divisible by 128) — bug — by zkf331 (创建于: 2026-02-04 18:02 (UTC+8))
#33828 [Bug]: mistral3 offline multimodal inference example failing with prompt placeholder error — bug — by skavulya (创建于: 2026-02-05 05:18 (UTC+8))
#33823 [Bug]: Step3p5ForCausalLM fails with pipeline parallelism — bug — by gregporter (创建于: 2026-02-05 04:03 (UTC+8))
#33816 [CI Failure]: Quantized Models Test in tests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2] — ci-failure — by mgoin (创建于: 2026-02-05 02:58 (UTC+8))
#33772 [Usage]: about the Chinese documents — usage — by muxuezzz (创建于: 2026-02-04 15:21 (UTC+8))
#33784 [Bug]: [Docker.cpu build] incomplete type ‘qk_vec_type’ {aka ‘void’} used in nested name specifier — bug — by BartekKruczek (创建于: 2026-02-04 18:12 (UTC+8))
#33812 [CI Failure]: mi325_4: LM Eval Large Models (H100) — ci-failure — by AndreasKaratzas (创建于: 2026-02-05 01:50 (UTC+8))
#33792 [Bug]: Logic for selection of routing_method_type in FusedTopKRouter — bug — by dbari (创建于: 2026-02-04 21:16 (UTC+8))
#33805 [RFC]: Expose RequestOutput hook for programmatic use of Serving layer — RFC — by alecsolder (创建于: 2026-02-05 00:46 (UTC+8))
#33804 [RFC]: [compile] Rollout strategy for AOT Compilation. — RFC — by zhxchen17 (创建于: 2026-02-05 00:34 (UTC+8))
#33791 [Bug]: When loading LoRA via load_lora_adapter, the inference becomes very slow with high CPU usage — bug — by elepherai (创建于: 2026-02-04 21:14 (UTC+8))
#33797 [Performance][CPU Backend]: 40% Performance drop observed on AWQ models from version 0.12.0 vs version 0.11.2 — performance,cpu — by jpiaseck (创建于: 2026-02-04 22:54 (UTC+8))
#33770 [Bug]: Loading Deepseek models is very slow — bug — by koush (创建于: 2026-02-04 14:16 (UTC+8))
#33776 [New Model]: Add support for telechat3 model — 无标签 — by 1096125073 (创建于: 2026-02-04 15:59 (UTC+8))
#33768 [Usage]: How to set the language in Qwen3-Asr — usage — by xyqsgdog-ctrl (创建于: 2026-02-04 14:05 (UTC+8))

已关闭 Issue

#18902 [Bug]: The frequency penalty does not work when spec decoding is enabled in V1, with no warning or error — bug,stale — by southfreebird (关闭于: 2026-02-05 10:18 (UTC+8))
#22569 [Feature]: subfolder parameter for EngineArgs — feature request,stale — by mynameismon (关闭于: 2026-02-05 10:17 (UTC+8))
#23900 [Feature]: Kubernetes 1.34 support (Dynamic Resource Allocation DRA) — feature request,stale — by reneleonhardt (关闭于: 2026-02-05 10:17 (UTC+8))
#24971 [Bug]: Donut model inference, CUDA out of memory — bug,stale — by mfournioux (关闭于: 2026-02-05 10:17 (UTC+8))
#25486 [Feature]: Support BF16/FP16 FlashInfer CUTLASS MoE for SM90/SM100 — feature request,stale — by mgoin (关闭于: 2026-02-05 10:16 (UTC+8))
#26182 [Bug]: Quantization using swift for Ovis2.5 9B — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-05 10:16 (UTC+8))
#26271 [Bug]: Reproducibility of Gemma-3-270m-it output on A10G gpu — bug,stale — by sujit0892 (关闭于: 2026-02-05 10:16 (UTC+8))
#26273 [Usage]: How to validate that the OffloadingConnector is working or not — usage,stale — by NIJ-117 (关闭于: 2026-02-05 10:16 (UTC+8))
#26303 [Feature]: Mxfp4MoEMethod support on ROCm — feature request,rocm,stale — by fxmarty-amd (关闭于: 2026-02-05 10:16 (UTC+8))
#26305 [Bug]: EngineArgs.add_cli_args(parser) fails in python 3.12 — bug,stale — by yonikremer (关闭于: 2026-02-05 10:16 (UTC+8))
#26318 [Bug]: Pipeline parallelism hangs on multi-node when using Slurm on a slingshot network — bug,stale — by prajwal1210 (关闭于: 2026-02-05 10:16 (UTC+8))
#33627 [Bug]: DeepSeek R1 with CUTLASS MLA Broken on B200 — bug — by robertgshaw2-redhat (关闭于: 2026-02-05 09:28 (UTC+8))
#33143 [Bug]: Triton MLIR Error when attempting to run gpt-oss-20b — bug,rocm — by Calandracas606 (关闭于: 2026-02-05 08:58 (UTC+8))
#13827 [Feature]: Support for ColBERT (Late-Interaction Retrieval) in vLLM — feature request,stale — by FernandoDorado (关闭于: 2026-02-05 08:05 (UTC+8))
#33663 [Bug]: RotaryEmbedding CustomOp does not work with gpt-oss — bug,rocm — by Rohan138 (关闭于: 2026-02-05 04:17 (UTC+8))
#25066 [Feature]: Streaming multi-modal input/output — feature request,multi-modality — by DarkLight1337 (关闭于: 2026-02-05 02:16 (UTC+8))
#25064 [Feature]: Realtime ASR — feature request,multi-modality — by gaardhus (关闭于: 2026-02-05 02:12 (UTC+8))
#23681 [Doc]: clarify support for cpu-based image — documentation — by ktdreyer (关闭于: 2026-02-04 21:47 (UTC+8))
#22783 [Bug]: Can’t run Qwen3-32B NVFP4 model — bug,stale — by flaviusburca (关闭于: 2026-02-04 19:46 (UTC+8))
#32269 [Feature]: Benchmark torch._scaled_mm performance with and without padding — help wanted,feature request — by vllmellm (关闭于: 2026-02-04 17:29 (UTC+8))
#33245 [Installation]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): — installation — by beebol (关闭于: 2026-02-04 16:54 (UTC+8))
#33289 [Feature]: [Metrics] Labeled prompt token metrics for P/D disaggregation (Follow-up on PR #27569) — feature request,kv-connector — by ZhanqiuHu (关闭于: 2026-02-04 15:46 (UTC+8))

新增 PR

#33840 [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly — bug,rocm,ready — by rasmith (创建于: 2026-02-05 06:40 (UTC+8))
#33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op — torch.compile,vllm-ir — by ProExpertProg (创建于: 2026-02-05 04:08 (UTC+8))
#33780 add video [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vlsparse — multi-modality,qwen — by rayleeku (创建于: 2026-02-04 17:10 (UTC+8))
#33832 [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue — bug,deepseek — by wzhao18 (创建于: 2026-02-05 05:40 (UTC+8))
#33858 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. — bug,deepseek — by pavanimajety (创建于: 2026-02-05 10:58 (UTC+8))
#33837 [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result — bug,frontend — by AndreasKaratzas (创建于: 2026-02-05 06:22 (UTC+8))
#33855 [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used — structured-output,v1,deepseek — by njhill (创建于: 2026-02-05 10:05 (UTC+8))
#33856 [Minor] Include StreamingInput in inputs package — ready,v1 — by njhill (创建于: 2026-02-05 10:13 (UTC+8))
#33841 Revert “[Attention][FA3] Update FA3 to include new swizzle optimization” — ready,ci/build,v1 — by ProExpertProg (创建于: 2026-02-05 06:43 (UTC+8))
#33778 [WIP][CI/Build] Parallelize CPU CI tests — ready,needs-rebase,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-02-04 16:40 (UTC+8))
#33849 [release] Minor fixes to release annotation — ready,ci/build — by khluu (创建于: 2026-02-05 08:47 (UTC+8))
#33782 [Perf] Optimize chat completion streaming performance — frontend,ready — by chaunceyjiang (创建于: 2026-02-04 17:38 (UTC+8))
#33851 models: lazy-load OpenCV in Nemotron Parse to avoid import-time dependency on X11 libs — 无标签 — by Gregory-Pereira (创建于: 2026-02-05 09:08 (UTC+8))
#33854 [BugFix] Potential bug fix for test_async_tp_pass_correctness — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-05 09:23 (UTC+8))
#33771 [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate — v1,nvidia — by aabbccddwasd (创建于: 2026-02-04 14:25 (UTC+8))
#33798 Add Kimi-Audio-7B ASR support for /v1/audio/transcriptions — documentation,new-model,frontend,multi-modality — by tunglinwood (创建于: 2026-02-04 23:53 (UTC+8))
#33818 [3/N] chatCompletions uses Parser — frontend — by qandrew (创建于: 2026-02-05 03:37 (UTC+8))
#33850 [Misc] Fix typo in error message: suport -> support — 无标签 — by schnell3526 (创建于: 2026-02-05 09:02 (UTC+8))
#33852 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (创建于: 2026-02-05 09:14 (UTC+8))
#33853 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (创建于: 2026-02-05 09:17 (UTC+8))
#33843 [Refactor] Define MoEActivation enum — rocm,cpu,gpt-oss,nvidia — by mgoin (创建于: 2026-02-05 07:38 (UTC+8))
#33807 [UX] Add --moe-backend arg for explicit kernel selection — 无标签 — by mgoin (创建于: 2026-02-05 01:07 (UTC+8))
#33848 [Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment — bug — by RunkaiTao (创建于: 2026-02-05 08:39 (UTC+8))
#33836 Fix RoutingMethodType.from_topk softmax+renormalize mapping#33792 — nvidia — by baonudesifeizhai (创建于: 2026-02-05 06:16 (UTC+8))
#33847 [Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode — bug,structured-output,v1 — by kitaekatt (创建于: 2026-02-05 08:21 (UTC+8))
#33846 [Bugfix] Reduce Triton TILE_SIZE on Blackwell for large head_size with float32 — bug,v1 — by kitaekatt (创建于: 2026-02-05 08:19 (UTC+8))
#33821 [Bugfix] Fix _fused_moe_lora_expand signature mismatch — bug — by xyang16 (创建于: 2026-02-05 03:43 (UTC+8))
#33761 fix(rocm): use correct kv_cache_layout in extend_for_sliding_window — rocm,v1 — by gigamonkeyx (创建于: 2026-02-04 12:31 (UTC+8))
#33845 [Core] Expose detailed scheduler stats — v1 — by yeqcharlotte (创建于: 2026-02-05 07:54 (UTC+8))
#33844 [DRAFT] pd disagg config for kimi-k2-thinking or deepseek-r1 — documentation,deepseek,kv-connector — by kjiang249 (创建于: 2026-02-05 07:53 (UTC+8))
#33842 aria vlm compile — 无标签 — by v0i0 (创建于: 2026-02-05 06:47 (UTC+8))
#33839 [Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds — 无标签 — by pkousha (创建于: 2026-02-05 06:39 (UTC+8))
#33838 [Model] Add torch.compile support for Ovis2.5 vision encoder — 无标签 — by tianrengao (创建于: 2026-02-05 06:38 (UTC+8))
#33834 enabling torch.compile on phi3v — 无标签 — by yushangdi (创建于: 2026-02-05 05:58 (UTC+8))
#33824 fix: use ‘vllm_’ instead of ‘vllm:’ as prefix for prom metrics name — documentation,performance,frontend,speculative-decoding,v1,kv-connector — by alxyok (创建于: 2026-02-05 04:05 (UTC+8))
#33835 Enable torch.compile for Aria model — 无标签 — by karthickai (创建于: 2026-02-05 06:11 (UTC+8))
#33763 Add vllm_enable_compile_cache config flag with backward compatibility — documentation,frontend,needs-rebase,llama — by elizabetht (创建于: 2026-02-04 12:50 (UTC+8))
#33820 Revert “[torch.compile] Significantly speed up cold start times” — ready — by zou3519 (创建于: 2026-02-05 03:43 (UTC+8))
#33830 Add shutdown timeout CLI and changed shutdown function to send SIGTERM to worker processes before SIGKILL — frontend,v1 — by itrowbri (创建于: 2026-02-05 05:24 (UTC+8))
#33829 Add shutdown timeout CLI and changed shutdown function to send SIGTEM — frontend,v1 — by itrowbri (创建于: 2026-02-05 05:23 (UTC+8))
#33795 [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path — bug,llama,kv-connector — by zackyoray (创建于: 2026-02-04 21:57 (UTC+8))
#33827 Enable torch.compile for OpenGVLab/InternVL3-2B — documentation — by tianrengao (创建于: 2026-02-05 04:43 (UTC+8))
#33826 ovis 2.5 - vlm compilation fixes — 无标签 — by v0i0 (创建于: 2026-02-05 04:26 (UTC+8))
#33800 [Bugfix] Support RotaryEmbedding CustomOp for gpt-oss — bug,ready,gpt-oss — by simondanielsson (创建于: 2026-02-05 00:03 (UTC+8))
#33822 try to enable torch.compile for qwen_vl — qwen — by yushangdi (创建于: 2026-02-05 03:56 (UTC+8))
#33814 [wip] explore using layerwise reloading utils for fp8 online quant — 无标签 — by vkuzo (创建于: 2026-02-05 02:50 (UTC+8))
#33819 [DevEnv][NixOS] Add NixOS development environment with CUDA support — documentation,nvidia — by l4b4r4b4b4 (创建于: 2026-02-05 03:42 (UTC+8))
#33766 [chore] make register_module lazy by default — documentation,tool-calling — by qandrew (创建于: 2026-02-04 13:36 (UTC+8))
#33817 [Bugfix] Make MM batching more robust — bug,multi-modality — by DarkLight1337 (创建于: 2026-02-05 03:20 (UTC+8))
#33815 [LoRA] Add TMA support to quantized fused_moe_lora kernel — 无标签 — by xyang16 (创建于: 2026-02-05 02:53 (UTC+8))
#33810 [Misc] Fix up attention benchmarks — performance,ci/build — by LucasWilkinson (创建于: 2026-02-05 01:42 (UTC+8))
#33799 [Docs] Improve documentation — documentation,frontend,ready — by SorenDreano (创建于: 2026-02-05 00:01 (UTC+8))
#33796 [Refactor] Move task outside of PoolingParams.verify — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-04 22:36 (UTC+8))
#33793 [Bugfix] Fix interns1-pro initialization and PP — bug,ready,multi-modality,qwen — by Isotr0py (创建于: 2026-02-04 21:35 (UTC+8))
#33811 [Hardware][AMD] Add comments explaining gfx906 (MI50/MI60) is not supported — rocm,ci/build — by randomizedcoder (创建于: 2026-02-05 01:43 (UTC+8))
#33794 [Bugfix] Fix normalize still being passed to PoolerConfig — bug,frontend,ready — by DarkLight1337 (创建于: 2026-02-04 21:41 (UTC+8))
#33808 [VoxtralRealtime] Force eager execution by default — 无标签 — by NickLucche (创建于: 2026-02-05 01:09 (UTC+8))
#33806 [Test] Add env var to disable reduced precision reduction for PyTorch… — multi-modality — by tianrengao (创建于: 2026-02-05 01:06 (UTC+8))
#33767 feat: Use an executor to parallelize load of deepseek v2 based models (kimi, etc) — deepseek — by koush (创建于: 2026-02-04 13:56 (UTC+8))
#33803 [Voxstral Realtime] Enable tests — multi-modality — by patrickvonplaten (创建于: 2026-02-05 00:29 (UTC+8))
#33801 [Misc] Delay deprecation of CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-02-05 00:09 (UTC+8))
#33786 Add support for ModelOpt MXFP8 models — documentation,nvidia — by danisereb (创建于: 2026-02-04 18:36 (UTC+8))
#33773 [ROCm][FEAT] Integrate aiter gemm w8a8 ptpc — rocm — by vllmellm (创建于: 2026-02-04 15:24 (UTC+8))
#33775 [KVConnector] Fix data race when we have both local and external cache hit — v1,kv-connector — by heheda12345 (创建于: 2026-02-04 15:44 (UTC+8))
#33758 Apply #33621 to main — rocm,ready,ci/build — by DarkLight1337 (创建于: 2026-02-04 12:01 (UTC+8))
#33785 [Model] Apply #32631 for recent models — speculative-decoding,ready,qwen — by DarkLight1337 (创建于: 2026-02-04 18:27 (UTC+8))
#33764 Fix empty content when max_tokens truncates reasoning end token — 无标签 — by elizabetht (创建于: 2026-02-04 13:00 (UTC+8))
#33790 [WIP][Kernel]Add Helion kernel for dynamic_per_token_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-04 20:54 (UTC+8))
#33788 [CPU] Add BF16 Kernel type for s390x — cpu — by R3hankhan123 (创建于: 2026-02-04 20:24 (UTC+8))
#33779 feat(benchmark/kernels): add scaled_fp8_quant benchmark with multiple quantization modes — performance — by junxiangxiaoxiang (创建于: 2026-02-04 16:42 (UTC+8))
#33774 add support for telechat3 model — new-model — by 1096125073 (创建于: 2026-02-04 15:42 (UTC+8))
#33787 [Fix]: Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (创建于: 2026-02-04 19:13 (UTC+8))
#33781 feat(cpu): add CPU support for draft model speculative decoding with parallel drafting — documentation,speculative-decoding,needs-rebase,v1,llama,cpu,nvidia — by ganeshr10 (创建于: 2026-02-04 17:23 (UTC+8))
#33777 [Bugfix]Fix gdn_attn in CUDA graph padding — bug,v1,nvidia — by QilaiZhang (创建于: 2026-02-04 16:12 (UTC+8))
#33769 [XPU] remove common path warning log — ready — by jikunshang (创建于: 2026-02-04 14:14 (UTC+8))
#33756 [BugFix] Set opencv-python-headless <= 4.12.0.88 for FIPS Compliance — bug,ci/build — by jayteaftw (创建于: 2026-02-04 11:42 (UTC+8))
#33757 [Deprecation] Remove _get_data_parser in MM processor — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-04 11:48 (UTC+8))
#33760 add errorType for better type handling — frontend — by lanking520 (创建于: 2026-02-04 12:16 (UTC+8))
#33762 Add padding support to wvSplitK solution for skinny GEMMs — rocm — by amd-hhashemi (创建于: 2026-02-04 12:33 (UTC+8))
#33765 [BugFix] Fixed the argument name for the number of redundant experts in eep example — bug,documentation — by jianzs (创建于: 2026-02-04 13:15 (UTC+8))
#33759 fix(rocm): Use correct kv_cache_layout for sliding window with shuffle KV cache — rocm,v1 — by gigamonkeyx (创建于: 2026-02-04 12:12 (UTC+8))

已合并 PR

#33293 [CI][torch.compile] Reduce e2e fusion test time — ready,ci/build — by ProExpertProg (合并于: 2026-02-05 08:09 (UTC+8))
#32762 [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode — bug,rocm,frontend,ready,gpt-oss — by AndreasKaratzas (合并于: 2026-02-05 11:14 (UTC+8))
#33849 [release] Minor fixes to release annotation — ready,ci/build — by khluu (合并于: 2026-02-05 10:07 (UTC+8))
#33782 [Perf] Optimize chat completion streaming performance — frontend,ready — by chaunceyjiang (合并于: 2026-02-04 20:30 (UTC+8))
#33637 [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 — bug,ready,v1,deepseek,nvidia — by chaunceyjiang (合并于: 2026-02-05 09:28 (UTC+8))
#33192 [Bugfix] Disable TRTLLM attention when KV transfer is enabled — bug,ready,v1,nvidia — by ZhanqiuHu (合并于: 2026-02-05 08:49 (UTC+8))
#33652 [Core] Don’t schedule spec tokens with prefill chunks — ready,v1 — by njhill (合并于: 2026-02-05 07:40 (UTC+8))
#33686 feat: Add ColBERT late interaction model support — documentation,new-model,frontend,ready — by ieBoytsov (合并于: 2026-02-05 08:05 (UTC+8))
#33573 Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] — ready — by SageMoore (合并于: 2026-02-05 06:02 (UTC+8))
#33820 Revert “[torch.compile] Significantly speed up cold start times” — ready — by zou3519 (合并于: 2026-02-05 06:00 (UTC+8))
#29828 [Model] Add transcription support for Qwen3-Omni — documentation,frontend,ready,qwen — by mu-hashmi (合并于: 2026-02-05 05:17 (UTC+8))
#33800 [Bugfix] Support RotaryEmbedding CustomOp for gpt-oss — bug,ready,gpt-oss — by simondanielsson (合并于: 2026-02-05 04:17 (UTC+8))
#33732 Implement zero-copy GQA for multimodal and CPU — ready,v1,cpu — by voidbag (合并于: 2026-02-05 04:11 (UTC+8))
#33308 [rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm — rocm,ready,ci/build,v1,nvidia — by kouroshHakha (合并于: 2026-02-05 02:09 (UTC+8))
#33793 [Bugfix] Fix interns1-pro initialization and PP — bug,ready,multi-modality,qwen — by Isotr0py (合并于: 2026-02-05 01:54 (UTC+8))
#33794 [Bugfix] Fix normalize still being passed to PoolerConfig — bug,frontend,ready — by DarkLight1337 (合并于: 2026-02-04 23:56 (UTC+8))
#33801 [Misc] Delay deprecation of CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (合并于: 2026-02-05 00:41 (UTC+8))
#33694 [Bugfix] Fix ubatch wrapper num_tokens calculate — bug,ready,v1 — by jiangkuaixue123 (合并于: 2026-02-05 00:41 (UTC+8))
#32745 [Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism — rocm,ready,ci/build — by mawong-amd (合并于: 2026-02-04 14:51 (UTC+8))
#33612 [Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-04 22:34 (UTC+8))
#33713 [Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching — bug,rocm,ready — by micah-wil (合并于: 2026-02-04 21:36 (UTC+8))
#33758 Apply #33621 to main — rocm,ready,ci/build — by DarkLight1337 (合并于: 2026-02-04 21:35 (UTC+8))
#33785 [Model] Apply #32631 for recent models — speculative-decoding,ready,qwen — by DarkLight1337 (合并于: 2026-02-04 20:23 (UTC+8))
#33605 [Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni — bug,ready,qwen — by linyueqian (合并于: 2026-02-04 20:15 (UTC+8))
#33291 [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] — ready — by vadiklyutiy (合并于: 2026-02-04 19:20 (UTC+8))
#32255 [BugFix] scheduler: Delay freeing blocks of aborted async loads — bug,ready,v1,kv-connector — by orozery (合并于: 2026-02-04 19:16 (UTC+8))
#33712 [compile] Remove runner type from ignored caching factor list. — ready — by zhxchen17 (合并于: 2026-02-04 18:56 (UTC+8))
#33578 [compile] Clean up AOT compile bypass on evaluate_guards. — ready — by zhxchen17 (合并于: 2026-02-04 18:12 (UTC+8))
#33659 [XPU][2/N] add support unquantized moe support for xpu — ready,ci/build — by jikunshang (合并于: 2026-02-04 18:12 (UTC+8))
#33548 use ORJSONResponse when available to improve the efficiency of request process — frontend,ready — by staugust (合并于: 2026-02-04 18:04 (UTC+8))
#33769 [XPU] remove common path warning log — ready — by jikunshang (合并于: 2026-02-04 16:40 (UTC+8))
#33290 [Metrics] Add labeled prompt token metrics for P/D disaggregation — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-02-04 15:46 (UTC+8))
#33722 [Deprecation] Deprecate profiling envs — documentation,ready — by yewentao256 (合并于: 2026-02-04 13:58 (UTC+8))
#33757 [Deprecation] Remove _get_data_parser in MM processor — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-04 13:51 (UTC+8))
#33688 [Feature] Enable TRITON_ATTN for Batch Invariance — documentation,ready,v1 — by frankwang28 (合并于: 2026-02-04 13:27 (UTC+8))
#33718 [Refactor] Remove unused dead code — ready — by yewentao256 (合并于: 2026-02-04 13:25 (UTC+8))
#33737 [Bugfix] Define router_logits_dtype for remaining MoE models — bug,ready — by mgoin (合并于: 2026-02-04 13:24 (UTC+8))
#33629 Save startup benchmark results as a list of values — performance,ready — by huydhn (合并于: 2026-02-04 12:37 (UTC+8))
#32161 [CPU] Split attention dispatch by head_dim alignment — ready,ci/build,cpu — by R3hankhan123 (合并于: 2026-02-04 11:37 (UTC+8))
#33750 [MM] Align the prefix of MMEncoderAttention with Attention — ready,llama,qwen — by shen-shanshan (合并于: 2026-02-04 12:07 (UTC+8))

关闭但未合并的 PR

#28066 [Chore] Separate out attention backend constants from vllm.utils — rocm,ready,stale — by hezyin (关闭于: 2026-02-05 10:41 (UTC+8))
#25385 [Bugfix] Add Flash Attention guards in MLACommonImpl constructor — stale,v1 — by kzawora-intel (关闭于: 2026-02-05 10:17 (UTC+8))
#26294 [Bugfix] Fix ovis2.5 pre-quant fp8 checkpoint loading — stale — by Isotr0py (关闭于: 2026-02-05 10:16 (UTC+8))
#33563 generate invoking doesn’t require detokenization for beam search — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by gameofdimension (关闭于: 2026-02-05 09:59 (UTC+8))
#28631 [Frontend][Renderer] Refactor score API — documentation,frontend,needs-rebase,multi-modality — by noooop (关闭于: 2026-02-05 10:01 (UTC+8))
#33851 models: lazy-load OpenCV in Nemotron Parse to avoid import-time dependency on X11 libs — 无标签 — by Gregory-Pereira (关闭于: 2026-02-05 09:49 (UTC+8))
#33850 [Misc] Fix typo in error message: suport -> support — 无标签 — by schnell3526 (关闭于: 2026-02-05 09:03 (UTC+8))
#33852 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (关闭于: 2026-02-05 09:14 (UTC+8))
#30409 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — bug,structured-output,needs-rebase,v1 — by kitaekatt (关闭于: 2026-02-05 08:21 (UTC+8))
#33829 Add shutdown timeout CLI and changed shutdown function to send SIGTEM — frontend,v1 — by itrowbri (关闭于: 2026-02-05 05:23 (UTC+8))
#32381 [responsesAPI] allow reasoning parser to output multiple reasoning items — frontend,qwen,deepseek — by qandrew (关闭于: 2026-02-05 02:56 (UTC+8))
#32729 [responsesAPI] support interleaved reasoning via arbitrary type of message outputs — documentation,frontend,meta-exported,fb-exported — by qandrew (关闭于: 2026-02-05 02:56 (UTC+8))
#32304 [Frontend][Feature][Draft]integrate openai realtime api — documentation,frontend,needs-rebase — by unlikezy (关闭于: 2026-02-05 02:17 (UTC+8))
#33808 [VoxtralRealtime] Force eager execution by default — 无标签 — by NickLucche (关闭于: 2026-02-05 01:39 (UTC+8))
#27569 [Bugfix][P/D] Fix throughput stats in disaggregated setup — bug,ready,needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-02-04 19:12 (UTC+8))
#33781 feat(cpu): add CPU support for draft model speculative decoding with parallel drafting — documentation,speculative-decoding,needs-rebase,v1,llama,cpu,nvidia — by ganeshr10 (关闭于: 2026-02-04 17:26 (UTC+8))
#32845 [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vl — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by rayleeku (关闭于: 2026-02-04 16:48 (UTC+8))
#33248 Kimi K2.5 model generates “(no content)” placeholder in tool call responses — needs-rebase — by shivamashtikar (关闭于: 2026-02-04 14:53 (UTC+8))
#33759 fix(rocm): Use correct kv_cache_layout for sliding window with shuffle KV cache — rocm,v1 — by gigamonkeyx (关闭于: 2026-02-04 12:32 (UTC+8))