vLLM 开发动态报告 - 2026-02-04
时间窗口: 2026-02-04 11:27 (UTC+8) ~ 2026-02-05 11:27 (UTC+8) 数据统计: 新 Issue 23 | 关闭 Issue 22 | 新 PR 81 | 合并 PR 40 | 关闭未合并 PR 19
📊 每日开发状态摘要
本周期(2026-02-04至02-05)vLLM 社区活跃度极高,新增 81 个 PR,合并 40 个,显示出强劲的开发吞吐量。开发焦点集中在多模态模型支持优化(视频稀疏化、音频转录、评分API修复)、AMD/ROCm 平台CI与兼容性修复,以及编译系统与性能调优(torch.compile 支持、AOT编译策略、性能回归回退)。同时,社区就 KV Cache 适配未来架构和 API 设计语义展开了重要讨论。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关动态活跃,主要集中在 CI 稳定性修复、平台兼容性增强和文档完善。
- Issue #33812 ([CI Failure]: mi325_4: LM Eval Large Models (H100)):报告在 ROCm CI 上运行 LM Eval 测试时失败,错误信息显示
modelopt quantization is currently not supported in rocm。这暴露了 ROCm 平台对某些量化方案(如 modelopt)的支持尚不完整,需要后续补齐或明确禁用。 - Issue #33809 ([CI Failure]: Kernels MoE Test %N) 及 PR #33840:报告 ROCm MoE 内核测试出现回归。PR #33840 由 AMD 贡献者(
rasmith)快速修复,指出问题根源在于未正确设置VLLM_ROCM_USE_AITER=1环境变量,导致 AITER 算子无法正确加载。此修复确保了 ROCm 特定优化路径的测试有效性。 - Issue #33811 ([Hardware][AMD] Add comments explaining gfx906 (MI50/MI60) is not supported):贡献者分享了为 MI50/MI60 (gfx906) 架构编译 vLLM 失败的经验,并提交 PR 在代码中添加注释,明确指出 ROCm Triton 不支持 Vega 20 架构是根本原因。此文档更新非常有价值,能帮助社区用户避免在老旧硬件上浪费编译时间。
- PR #32745 ([Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism):已合并。此 PR 重构了 AMD CI 测试逻辑,使其能够正确利用 BuildKite 的并行化能力,替代了之前复杂的手动模拟逻辑。此举不仅能提升 CI 运行效率,还能更精细地分配 GPU 资源(例如,分片测试可占用 1-GPU 机器而非整个 8-GPU 机器),是 AMD CI 基础设施的重要优化。
- PR #33713 ([Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching):已合并。修复了在 ROCm 平台(MI300/325)上使用
float8_e4m3fnuz数据类型进行 NCCL 通信时出现的Unsupported dtype错误。此修复对于支持 AMD 硬件上的 FP8 量化模型(如 Qwen3-30B-A3B-FP8)的专家并行至关重要。
小结:本周期 AMD 生态工作以 “夯实基础” 为主,重点修复 CI 流水线稳定性和平台特定 bug(NCCL 数据类型、环境变量),并补充了重要文档。暂无涉及 Quark 量化工具或 MI300 新特性的重大特性更新。
💬 高热度讨论分析
- Issue #33789: [RFC]: How will vLLM adapt to more diverse KVCacheSpecs?
- 核心议题:随着模型架构日益复杂(混合 Mamba、不同 KV 隐藏大小的注意力层),当前的单一全局 KV Cache 块池设计面临挑战。用户询问 vLLM 社区对未来解决方案的规划。
- 观点与争议:
- 提出者 (
sheep94lion):列举了模型可能包含多个不同kv_hidden_size的注意力组,寻求解决方案方向(如多块池、填充对齐)。 - 核心维护者 (
simon-mo):明确了 vLLM 的设计原则是 “由训练好的模型架构驱动,而非后期附加的未微调优化”。这为讨论定下了基调。 - 后续讨论:提出者追问对于
kv_hidden_size差异不大的情况,vLLM 是计划支持对 FullAttention 进行填充,还是采用多块池方案。
- 提出者 (
- 当前状态:讨论开放,核心维护者给出了设计哲学,但具体技术路径尚未明确。这是一个关乎 vLLM 核心内存管理架构未来演进的重要议题。
- Issue #33802: [CI Failure]: Distributed 2xH100 tests
- 核心议题:分布式编译测试
test_async_tp_pass_correctness在 CI 中挂起并失败,影响发布流程。 - 观点与进展:
- 报告者 (
ProExpertProg) 详细记录了故障现象,并通过二分定位确认是由 PR #23465(“[Attention][FA3] Update FA3 to include new swizzle optimization”)引入的回归。 - 解决方案:迅速提出了 PR #33841 用于 revert 有问题的提交,并同时尝试了 PR #33854 进行针对性修复(但未成功)。这体现了对 CI 稳定性的高度重视和快速响应流程。
- 报告者 (
- 当前状态:根本原因已定位,revert PR (#33841) 已准备好作为解决方案。
- 核心议题:分布式编译测试
- PR #33837 ([Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result) 及相关 Issue
- 核心议题:PR #33060 引入了一个回归,当向
LLM.score()传递单个包含多个内容项(如图像)的ScoreMultiModalParam时,错误地只返回一个评分结果,而不是每个内容项一个评分。 - 观点与争议:
- 修复者 (
AndreasKaratzas):认为这是 bug,破坏了批量评分的预期行为,并提供了修复。 - API 设计者 (
noooop):解释新代码的目的是为了正确处理图文混合输入,将多个内容项视为一个“文档”进行评分是预期行为。 - 核心维护者 (
DarkLight1337):进一步澄清了 API 设计语义:一个ScoreMultiModalParam代表一个文档,无论其包含多少图像或文本段落,都应只返回一个综合评分。这与修复者的理解存在根本分歧。
- 修复者 (
- 当前状态:修复 PR 被暂停,API 设计者正在根据澄清的语义重新实现功能。这暴露了多模态 API 语义在设计同步和文档化上需要加强。
- 核心议题:PR #33060 引入了一个回归,当向
- PR #32762 ([CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode)
- 核心议题:修复 Responses API 中内置 MCP 工具(如
python)在非流式模式下返回的 output item 类型不一致的问题(流式下返回McpCall,非流式下错误地返回ResponseReasoningItem)。 - 讨论过程:PR 经历了多次修正,起初试图让非流式模式也返回
McpCall,但后续发现与现有测试期望(内置工具应产生ResponseReasoningItem)冲突。最终修正为让两种模式都统一返回ResponseReasoningItem,以符合测试定义和潜在的产品语义。 - 当前状态:已合并。此 PR 反映了在复杂 API(如 GPT-OSS + Harmony)实现中,保持不同调用路径(流式/非流式)行为一致性的挑战,以及通过测试套件来定义和约束行为的重要性。
- 核心议题:修复 Responses API 中内置 MCP 工具(如
🔥 热门话题与趋势分析
- 模型兼容性与量化挑战:多个 Issue 报告了新模型或特定量化格式的运行问题,如 DeepSeek V3.2 NVFP4 与 flashinfer MoE 的兼容性 (#33859)、Qwen3-Coder-Next 在新型号 GPU (GB10) 上的 Triton 分配器错误 (#33857)、CutlassW4A8 对特定维度 (K=7168) 不支持 (#33783)。这表明新模型和低比特量化在落地时仍面临诸多底层内核适配挑战。
- 多模态支持纵深发展:社区不仅在使用多模态模型,更在优化其效率与功能。例如,PR #33780 为 Qwen VL 模型引入视频帧稀疏化以支持长视频推理;PR #33782 优化聊天完成流式性能;多个 PR 为 InternVL、Ovis2.5、Phi3V 等模型添加
torch.compile支持,显著提升视觉编码器吞吐量。 - 编译系统演进与权衡:AOT (Ahead-Of-Time) 编译的推广策略成为 RFC 议题 (#33804),社区在 “激进推广以获得冷启动收益” 和 “谨慎推进以避免用户困扰” 之间权衡。同时,PR #33820 回退了导致 CI 测试失败和潜在性能回归的激进编译优化 (#33641),体现了对稳定性的优先考虑。
- CI/CD 与测试稳定性被高度重视:大量 Issue 和 PR 围绕 CI 失败展开,社区积极定位根因(如 #33802)并优化测试套件本身(如 #33293 重构融合测试以平衡时间与覆盖率)。这反映了项目在高速开发中对质量保障体系的持续投入。
🛠️ 重点技术变更
- PR #33293: [CI][torch.compile] Reduce e2e fusion test time:已合并。这是一次重要的测试基础设施重构。它将耗时过长的端到端融合测试拆分为“快速遍历”(所有模型,单一配置)和“深度扫描”(单一模型,所有配置),并引入工具函数便于未来扩展。此举在保持覆盖率的同时,大幅减少了 CI 资源消耗和时间,为持续集成提速。
- PR #33686: feat: Add ColBERT late interaction model support:已合并。增加了对 ColBERT 检索模型的支持,实现了其“晚期交互”机制。这不仅满足了社区长期以来的需求(#13827),也扩展了 vLLM 在检索增强生成 (RAG) 领域的应用场景,从单纯的生成引擎向更全面的语义理解与检索服务迈进。
- PR #33820: Revert “[torch.compile] Significantly speed up cold start times”:已合并。回退了 PR #33641。该被回退的 PR 旨在通过修改编译缓存逻辑来加速冷启动,但导致了分布式编译测试失败和潜在的性能回归。此次回退体现了项目在追求性能优化时,将系统稳定性和测试通过率置于首位的审慎原则。
- PR #33192: [Bugfix] Disable TRTLLM attention when KV transfer is enabled:已合并。解决了一个在 Blackwell GPU 上使用 P/D 解耦(NixlConnector)时因 TRTLLM 注意力内核要求连续 KV Cache 而导致的崩溃问题。修复方案是当检测到 KV 传输配置时,自动降级使用兼容性更好的 FlashInfer 原生内核。这保障了先进部署模式在新型硬件上的可用性。
📈 开发活跃度观察
- 高开发吞吐量:单日新增81个PR、合并40个,显示社区贡献非常活跃。
- AMD 与 CPU 后端持续投入:AMD 贡献者积极修复平台特定问题,CPU 后端也有并行化 CI、优化注意力分发等改进,表明 vLLM 对多硬件平台的支持在持续深化。
- 多模态与检索模型成为新热点:围绕视频、音频、ColBERT 的 PR 显示出社区正在快速拓展 vLLM 在传统文本生成之外的疆界。
- 核心架构讨论进行中:关于 KV Cache 未来适配方案的 RFC 吸引了核心维护者的参与,预示着项目可能即将面临一次重要的底层架构演进决策。
💡 值得关注的问题
- KV Cache 架构抉择 (Issue #33789):社区需要尽快明确对混合注意力模型(Mamba + 多种 Attention)的支持路线图,这将影响许多复杂模型在 vLLM 上的部署。
- AOT 编译推广策略 (RFC #33804):是否在下一版本默认启用 AOT 编译,需要在开发效率、用户体验和潜在风险之间做出明确决策。
- 多模态评分 API 语义澄清 (PR #33837):需要官方明确
ScoreMultiModalParam等 API 的设计意图,并更新文档,避免开发者误解。 - 新硬件与量化格式的“长尾”适配:如 GB10 (sm121) GPU 上的 Triton 问题、DeepSeek V3.2 NVFP4 的 flashinfer MoE 错误等,反映了支持前沿硬件和模型格式的持续挑战。
📋 附录:详细数据列表
新增 Issue
- #33813 [Bug]: llm.score() fails on batched multimodal input for qwen3-vl-reranker — bug — by JiahuiChen-GitHub (创建于: 2026-02-05 02:49 (UTC+8))
- #33859 [Bug]: DeepSeek V3.2-NVFP4 with flashinfer moe reports
q must have dtype torch::kBFloat16— bug — by kebe7jun (创建于: 2026-02-05 10:59 (UTC+8)) - #33789 [RFC]: How will vLLM adapt to more diverse KVCacheSpecs? — RFC — by sheep94lion (创建于: 2026-02-04 20:25 (UTC+8))
- #33831 [Bug]: Deepseek V3.2 Benchmark failure “TypeError: argument ‘tokens’: ‘NoneType’ object” — bug — by wzhao18 (创建于: 2026-02-05 05:38 (UTC+8))
- #33802 [CI Failure]: Distributed 2xH100 tests — torch.compile,ci-failure,needs reproduction — by ProExpertProg (创建于: 2026-02-05 00:23 (UTC+8))
- #33857 [Bug]: Qwen3-Coder-Next fails with Triton allocator error on DGX Spark cluster (GB10, sm121) — bug — by eugr (创建于: 2026-02-05 10:15 (UTC+8))
- #33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (创建于: 2026-02-05 01:31 (UTC+8))
- #33833 [Bug][Docker]: Issues with 0.15.0 and newer docker image when running Qwen3-Next with VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 — bug — by jasonlizhengjian (创建于: 2026-02-05 05:48 (UTC+8))
- #33783 [Bug]: CutlassW4A8LinearKernel fails on DeepSeekV3.1 W4AF8 due to dimension alignment (K=7168, N=2112 not divisible by 128) — bug — by zkf331 (创建于: 2026-02-04 18:02 (UTC+8))
- #33828 [Bug]: mistral3 offline multimodal inference example failing with prompt placeholder error — bug — by skavulya (创建于: 2026-02-05 05:18 (UTC+8))
- #33823 [Bug]: Step3p5ForCausalLM fails with pipeline parallelism — bug — by gregporter (创建于: 2026-02-05 04:03 (UTC+8))
- #33816 [CI Failure]: Quantized Models Test in
tests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2]— ci-failure — by mgoin (创建于: 2026-02-05 02:58 (UTC+8)) - #33772 [Usage]: about the Chinese documents — usage — by muxuezzz (创建于: 2026-02-04 15:21 (UTC+8))
- #33784 [Bug]: [Docker.cpu build] incomplete type ‘qk_vec_type’ {aka ‘void’} used in nested name specifier — bug — by BartekKruczek (创建于: 2026-02-04 18:12 (UTC+8))
- #33812 [CI Failure]: mi325_4: LM Eval Large Models (H100) — ci-failure — by AndreasKaratzas (创建于: 2026-02-05 01:50 (UTC+8))
- #33792 [Bug]: Logic for selection of routing_method_type in FusedTopKRouter — bug — by dbari (创建于: 2026-02-04 21:16 (UTC+8))
- #33805 [RFC]: Expose RequestOutput hook for programmatic use of Serving layer — RFC — by alecsolder (创建于: 2026-02-05 00:46 (UTC+8))
- #33804 [RFC]: [compile] Rollout strategy for AOT Compilation. — RFC — by zhxchen17 (创建于: 2026-02-05 00:34 (UTC+8))
- #33791 [Bug]: When loading LoRA via load_lora_adapter, the inference becomes very slow with high CPU usage — bug — by elepherai (创建于: 2026-02-04 21:14 (UTC+8))
- #33797 [Performance][CPU Backend]: 40% Performance drop observed on AWQ models from version 0.12.0 vs version 0.11.2 — performance,cpu — by jpiaseck (创建于: 2026-02-04 22:54 (UTC+8))
- #33770 [Bug]: Loading Deepseek models is very slow — bug — by koush (创建于: 2026-02-04 14:16 (UTC+8))
- #33776 [New Model]: Add support for telechat3 model — 无标签 — by 1096125073 (创建于: 2026-02-04 15:59 (UTC+8))
- #33768 [Usage]: How to set the language in Qwen3-Asr — usage — by xyqsgdog-ctrl (创建于: 2026-02-04 14:05 (UTC+8))
已关闭 Issue
- #18902 [Bug]: The frequency penalty does not work when spec decoding is enabled in V1, with no warning or error — bug,stale — by southfreebird (关闭于: 2026-02-05 10:18 (UTC+8))
- #22569 [Feature]: subfolder parameter for EngineArgs — feature request,stale — by mynameismon (关闭于: 2026-02-05 10:17 (UTC+8))
- #23900 [Feature]: Kubernetes 1.34 support (Dynamic Resource Allocation DRA) — feature request,stale — by reneleonhardt (关闭于: 2026-02-05 10:17 (UTC+8))
- #24971 [Bug]: Donut model inference, CUDA out of memory — bug,stale — by mfournioux (关闭于: 2026-02-05 10:17 (UTC+8))
- #25486 [Feature]: Support BF16/FP16 FlashInfer CUTLASS MoE for SM90/SM100 — feature request,stale — by mgoin (关闭于: 2026-02-05 10:16 (UTC+8))
- #26182 [Bug]: Quantization using swift for Ovis2.5 9B — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-05 10:16 (UTC+8))
- #26271 [Bug]: Reproducibility of Gemma-3-270m-it output on A10G gpu — bug,stale — by sujit0892 (关闭于: 2026-02-05 10:16 (UTC+8))
- #26273 [Usage]: How to validate that the OffloadingConnector is working or not — usage,stale — by NIJ-117 (关闭于: 2026-02-05 10:16 (UTC+8))
- #26303 [Feature]:
Mxfp4MoEMethodsupport on ROCm — feature request,rocm,stale — by fxmarty-amd (关闭于: 2026-02-05 10:16 (UTC+8)) - #26305 [Bug]: EngineArgs.add_cli_args(parser) fails in python 3.12 — bug,stale — by yonikremer (关闭于: 2026-02-05 10:16 (UTC+8))
- #26318 [Bug]: Pipeline parallelism hangs on multi-node when using Slurm on a slingshot network — bug,stale — by prajwal1210 (关闭于: 2026-02-05 10:16 (UTC+8))
- #33627 [Bug]: DeepSeek R1 with CUTLASS MLA Broken on B200 — bug — by robertgshaw2-redhat (关闭于: 2026-02-05 09:28 (UTC+8))
- #33143 [Bug]: Triton MLIR Error when attempting to run gpt-oss-20b — bug,rocm — by Calandracas606 (关闭于: 2026-02-05 08:58 (UTC+8))
- #13827 [Feature]: Support for ColBERT (Late-Interaction Retrieval) in vLLM — feature request,stale — by FernandoDorado (关闭于: 2026-02-05 08:05 (UTC+8))
- #33663 [Bug]:
RotaryEmbeddingCustomOp does not work with gpt-oss — bug,rocm — by Rohan138 (关闭于: 2026-02-05 04:17 (UTC+8)) - #25066 [Feature]: Streaming multi-modal input/output — feature request,multi-modality — by DarkLight1337 (关闭于: 2026-02-05 02:16 (UTC+8))
- #25064 [Feature]: Realtime ASR — feature request,multi-modality — by gaardhus (关闭于: 2026-02-05 02:12 (UTC+8))
- #23681 [Doc]: clarify support for cpu-based image — documentation — by ktdreyer (关闭于: 2026-02-04 21:47 (UTC+8))
- #22783 [Bug]: Can’t run Qwen3-32B NVFP4 model — bug,stale — by flaviusburca (关闭于: 2026-02-04 19:46 (UTC+8))
- #32269 [Feature]: Benchmark torch._scaled_mm performance with and without padding — help wanted,feature request — by vllmellm (关闭于: 2026-02-04 17:29 (UTC+8))
- #33245 [Installation]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): — installation — by beebol (关闭于: 2026-02-04 16:54 (UTC+8))
- #33289 [Feature]: [Metrics] Labeled prompt token metrics for P/D disaggregation (Follow-up on PR #27569) — feature request,kv-connector — by ZhanqiuHu (关闭于: 2026-02-04 15:46 (UTC+8))
新增 PR
- #33840 [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly — bug,rocm,ready — by rasmith (创建于: 2026-02-05 06:40 (UTC+8))
- #33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op — torch.compile,vllm-ir — by ProExpertProg (创建于: 2026-02-05 04:08 (UTC+8))
- #33780 add video [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vlsparse — multi-modality,qwen — by rayleeku (创建于: 2026-02-04 17:10 (UTC+8))
- #33832 [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue — bug,deepseek — by wzhao18 (创建于: 2026-02-05 05:40 (UTC+8))
- #33858 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. — bug,deepseek — by pavanimajety (创建于: 2026-02-05 10:58 (UTC+8))
- #33837 [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result — bug,frontend — by AndreasKaratzas (创建于: 2026-02-05 06:22 (UTC+8))
- #33855 [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used — structured-output,v1,deepseek — by njhill (创建于: 2026-02-05 10:05 (UTC+8))
- #33856 [Minor] Include
StreamingInputin inputs package — ready,v1 — by njhill (创建于: 2026-02-05 10:13 (UTC+8)) - #33841 Revert “[Attention][FA3] Update FA3 to include new swizzle optimization” — ready,ci/build,v1 — by ProExpertProg (创建于: 2026-02-05 06:43 (UTC+8))
- #33778 [WIP][CI/Build] Parallelize CPU CI tests — ready,needs-rebase,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-02-04 16:40 (UTC+8))
- #33849 [release] Minor fixes to release annotation — ready,ci/build — by khluu (创建于: 2026-02-05 08:47 (UTC+8))
- #33782 [Perf] Optimize chat completion streaming performance — frontend,ready — by chaunceyjiang (创建于: 2026-02-04 17:38 (UTC+8))
- #33851 models: lazy-load OpenCV in Nemotron Parse to avoid import-time dependency on X11 libs — 无标签 — by Gregory-Pereira (创建于: 2026-02-05 09:08 (UTC+8))
- #33854 [BugFix] Potential bug fix for
test_async_tp_pass_correctness— bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-05 09:23 (UTC+8)) - #33771 [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate — v1,nvidia — by aabbccddwasd (创建于: 2026-02-04 14:25 (UTC+8))
- #33798 Add Kimi-Audio-7B ASR support for /v1/audio/transcriptions — documentation,new-model,frontend,multi-modality — by tunglinwood (创建于: 2026-02-04 23:53 (UTC+8))
- #33818 [3/N] chatCompletions uses Parser — frontend — by qandrew (创建于: 2026-02-05 03:37 (UTC+8))
- #33850 [Misc] Fix typo in error message: suport -> support — 无标签 — by schnell3526 (创建于: 2026-02-05 09:02 (UTC+8))
- #33852 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (创建于: 2026-02-05 09:14 (UTC+8))
- #33853 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (创建于: 2026-02-05 09:17 (UTC+8))
- #33843 [Refactor] Define MoEActivation enum — rocm,cpu,gpt-oss,nvidia — by mgoin (创建于: 2026-02-05 07:38 (UTC+8))
- #33807 [UX] Add
--moe-backendarg for explicit kernel selection — 无标签 — by mgoin (创建于: 2026-02-05 01:07 (UTC+8)) - #33848 [Bug Fix] Fix
naive_block_assignmentalways defaulting to False due to arg misalignment — bug — by RunkaiTao (创建于: 2026-02-05 08:39 (UTC+8)) - #33836 Fix RoutingMethodType.from_topk softmax+renormalize mapping#33792 — nvidia — by baonudesifeizhai (创建于: 2026-02-05 06:16 (UTC+8))
- #33847 [Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode — bug,structured-output,v1 — by kitaekatt (创建于: 2026-02-05 08:21 (UTC+8))
- #33846 [Bugfix] Reduce Triton TILE_SIZE on Blackwell for large head_size with float32 — bug,v1 — by kitaekatt (创建于: 2026-02-05 08:19 (UTC+8))
- #33821 [Bugfix] Fix _fused_moe_lora_expand signature mismatch — bug — by xyang16 (创建于: 2026-02-05 03:43 (UTC+8))
- #33761 fix(rocm): use correct kv_cache_layout in extend_for_sliding_window — rocm,v1 — by gigamonkeyx (创建于: 2026-02-04 12:31 (UTC+8))
- #33845 [Core] Expose detailed scheduler stats — v1 — by yeqcharlotte (创建于: 2026-02-05 07:54 (UTC+8))
- #33844 [DRAFT] pd disagg config for kimi-k2-thinking or deepseek-r1 — documentation,deepseek,kv-connector — by kjiang249 (创建于: 2026-02-05 07:53 (UTC+8))
- #33842 aria vlm compile — 无标签 — by v0i0 (创建于: 2026-02-05 06:47 (UTC+8))
- #33839 [Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds — 无标签 — by pkousha (创建于: 2026-02-05 06:39 (UTC+8))
- #33838 [Model] Add torch.compile support for Ovis2.5 vision encoder — 无标签 — by tianrengao (创建于: 2026-02-05 06:38 (UTC+8))
- #33834 enabling torch.compile on phi3v — 无标签 — by yushangdi (创建于: 2026-02-05 05:58 (UTC+8))
- #33824 fix: use ‘vllm_’ instead of ‘vllm:’ as prefix for prom metrics name — documentation,performance,frontend,speculative-decoding,v1,kv-connector — by alxyok (创建于: 2026-02-05 04:05 (UTC+8))
- #33835 Enable torch.compile for Aria model — 无标签 — by karthickai (创建于: 2026-02-05 06:11 (UTC+8))
- #33763 Add vllm_enable_compile_cache config flag with backward compatibility — documentation,frontend,needs-rebase,llama — by elizabetht (创建于: 2026-02-04 12:50 (UTC+8))
- #33820 Revert “[torch.compile] Significantly speed up cold start times” — ready — by zou3519 (创建于: 2026-02-05 03:43 (UTC+8))
- #33830 Add shutdown timeout CLI and changed shutdown function to send SIGTERM to worker processes before SIGKILL — frontend,v1 — by itrowbri (创建于: 2026-02-05 05:24 (UTC+8))
- #33829 Add shutdown timeout CLI and changed shutdown function to send SIGTEM — frontend,v1 — by itrowbri (创建于: 2026-02-05 05:23 (UTC+8))
- #33795 [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path — bug,llama,kv-connector — by zackyoray (创建于: 2026-02-04 21:57 (UTC+8))
- #33827 Enable torch.compile for OpenGVLab/InternVL3-2B — documentation — by tianrengao (创建于: 2026-02-05 04:43 (UTC+8))
- #33826 ovis 2.5 - vlm compilation fixes — 无标签 — by v0i0 (创建于: 2026-02-05 04:26 (UTC+8))
- #33800 [Bugfix] Support
RotaryEmbeddingCustomOp for gpt-oss — bug,ready,gpt-oss — by simondanielsson (创建于: 2026-02-05 00:03 (UTC+8)) - #33822 try to enable torch.compile for qwen_vl — qwen — by yushangdi (创建于: 2026-02-05 03:56 (UTC+8))
- #33814 [wip] explore using layerwise reloading utils for fp8 online quant — 无标签 — by vkuzo (创建于: 2026-02-05 02:50 (UTC+8))
- #33819 [DevEnv][NixOS] Add NixOS development environment with CUDA support — documentation,nvidia — by l4b4r4b4b4 (创建于: 2026-02-05 03:42 (UTC+8))
- #33766 [chore] make register_module lazy by default — documentation,tool-calling — by qandrew (创建于: 2026-02-04 13:36 (UTC+8))
- #33817 [Bugfix] Make MM batching more robust — bug,multi-modality — by DarkLight1337 (创建于: 2026-02-05 03:20 (UTC+8))
- #33815 [LoRA] Add TMA support to quantized fused_moe_lora kernel — 无标签 — by xyang16 (创建于: 2026-02-05 02:53 (UTC+8))
- #33810 [Misc] Fix up attention benchmarks — performance,ci/build — by LucasWilkinson (创建于: 2026-02-05 01:42 (UTC+8))
- #33799 [Docs] Improve documentation — documentation,frontend,ready — by SorenDreano (创建于: 2026-02-05 00:01 (UTC+8))
- #33796 [Refactor] Move
taskoutside ofPoolingParams.verify— frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-04 22:36 (UTC+8)) - #33793 [Bugfix] Fix interns1-pro initialization and PP — bug,ready,multi-modality,qwen — by Isotr0py (创建于: 2026-02-04 21:35 (UTC+8))
- #33811 [Hardware][AMD] Add comments explaining gfx906 (MI50/MI60) is not supported — rocm,ci/build — by randomizedcoder (创建于: 2026-02-05 01:43 (UTC+8))
- #33794 [Bugfix] Fix
normalizestill being passed toPoolerConfig— bug,frontend,ready — by DarkLight1337 (创建于: 2026-02-04 21:41 (UTC+8)) - #33808 [VoxtralRealtime] Force eager execution by default — 无标签 — by NickLucche (创建于: 2026-02-05 01:09 (UTC+8))
- #33806 [Test] Add env var to disable reduced precision reduction for PyTorch… — multi-modality — by tianrengao (创建于: 2026-02-05 01:06 (UTC+8))
- #33767 feat: Use an executor to parallelize load of deepseek v2 based models (kimi, etc) — deepseek — by koush (创建于: 2026-02-04 13:56 (UTC+8))
- #33803 [Voxstral Realtime] Enable tests — multi-modality — by patrickvonplaten (创建于: 2026-02-05 00:29 (UTC+8))
- #33801 [Misc] Delay deprecation of CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-02-05 00:09 (UTC+8))
- #33786 Add support for ModelOpt MXFP8 models — documentation,nvidia — by danisereb (创建于: 2026-02-04 18:36 (UTC+8))
- #33773 [ROCm][FEAT] Integrate aiter gemm w8a8 ptpc — rocm — by vllmellm (创建于: 2026-02-04 15:24 (UTC+8))
- #33775 [KVConnector] Fix data race when we have both local and external cache hit — v1,kv-connector — by heheda12345 (创建于: 2026-02-04 15:44 (UTC+8))
- #33758 Apply #33621 to main — rocm,ready,ci/build — by DarkLight1337 (创建于: 2026-02-04 12:01 (UTC+8))
- #33785 [Model] Apply #32631 for recent models — speculative-decoding,ready,qwen — by DarkLight1337 (创建于: 2026-02-04 18:27 (UTC+8))
- #33764 Fix empty content when max_tokens truncates reasoning end token — 无标签 — by elizabetht (创建于: 2026-02-04 13:00 (UTC+8))
- #33790 [WIP][Kernel]Add Helion kernel for dynamic_per_token_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-04 20:54 (UTC+8))
- #33788 [CPU] Add BF16 Kernel type for s390x — cpu — by R3hankhan123 (创建于: 2026-02-04 20:24 (UTC+8))
- #33779 feat(benchmark/kernels): add scaled_fp8_quant benchmark with multiple quantization modes — performance — by junxiangxiaoxiang (创建于: 2026-02-04 16:42 (UTC+8))
- #33774 add support for telechat3 model — new-model — by 1096125073 (创建于: 2026-02-04 15:42 (UTC+8))
- #33787 [Fix]: Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (创建于: 2026-02-04 19:13 (UTC+8))
- #33781 feat(cpu): add CPU support for draft model speculative decoding with parallel drafting — documentation,speculative-decoding,needs-rebase,v1,llama,cpu,nvidia — by ganeshr10 (创建于: 2026-02-04 17:23 (UTC+8))
- #33777 [Bugfix]Fix gdn_attn in CUDA graph padding — bug,v1,nvidia — by QilaiZhang (创建于: 2026-02-04 16:12 (UTC+8))
- #33769 [XPU] remove common path warning log — ready — by jikunshang (创建于: 2026-02-04 14:14 (UTC+8))
- #33756 [BugFix] Set opencv-python-headless <= 4.12.0.88 for FIPS Compliance — bug,ci/build — by jayteaftw (创建于: 2026-02-04 11:42 (UTC+8))
- #33757 [Deprecation] Remove
_get_data_parserin MM processor — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-04 11:48 (UTC+8)) - #33760 add errorType for better type handling — frontend — by lanking520 (创建于: 2026-02-04 12:16 (UTC+8))
- #33762 Add padding support to wvSplitK solution for skinny GEMMs — rocm — by amd-hhashemi (创建于: 2026-02-04 12:33 (UTC+8))
- #33765 [BugFix] Fixed the argument name for the number of redundant experts in eep example — bug,documentation — by jianzs (创建于: 2026-02-04 13:15 (UTC+8))
- #33759 fix(rocm): Use correct kv_cache_layout for sliding window with shuffle KV cache — rocm,v1 — by gigamonkeyx (创建于: 2026-02-04 12:12 (UTC+8))
已合并 PR
- #33293 [CI][torch.compile] Reduce e2e fusion test time — ready,ci/build — by ProExpertProg (合并于: 2026-02-05 08:09 (UTC+8))
- #32762 [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode — bug,rocm,frontend,ready,gpt-oss — by AndreasKaratzas (合并于: 2026-02-05 11:14 (UTC+8))
- #33849 [release] Minor fixes to release annotation — ready,ci/build — by khluu (合并于: 2026-02-05 10:07 (UTC+8))
- #33782 [Perf] Optimize chat completion streaming performance — frontend,ready — by chaunceyjiang (合并于: 2026-02-04 20:30 (UTC+8))
- #33637 [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 — bug,ready,v1,deepseek,nvidia — by chaunceyjiang (合并于: 2026-02-05 09:28 (UTC+8))
- #33192 [Bugfix] Disable TRTLLM attention when KV transfer is enabled — bug,ready,v1,nvidia — by ZhanqiuHu (合并于: 2026-02-05 08:49 (UTC+8))
- #33652 [Core] Don’t schedule spec tokens with prefill chunks — ready,v1 — by njhill (合并于: 2026-02-05 07:40 (UTC+8))
- #33686 feat: Add ColBERT late interaction model support — documentation,new-model,frontend,ready — by ieBoytsov (合并于: 2026-02-05 08:05 (UTC+8))
- #33573 Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] — ready — by SageMoore (合并于: 2026-02-05 06:02 (UTC+8))
- #33820 Revert “[torch.compile] Significantly speed up cold start times” — ready — by zou3519 (合并于: 2026-02-05 06:00 (UTC+8))
- #29828 [Model] Add transcription support for Qwen3-Omni — documentation,frontend,ready,qwen — by mu-hashmi (合并于: 2026-02-05 05:17 (UTC+8))
- #33800 [Bugfix] Support
RotaryEmbeddingCustomOp for gpt-oss — bug,ready,gpt-oss — by simondanielsson (合并于: 2026-02-05 04:17 (UTC+8)) - #33732 Implement zero-copy GQA for multimodal and CPU — ready,v1,cpu — by voidbag (合并于: 2026-02-05 04:11 (UTC+8))
- #33308 [rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm — rocm,ready,ci/build,v1,nvidia — by kouroshHakha (合并于: 2026-02-05 02:09 (UTC+8))
- #33793 [Bugfix] Fix interns1-pro initialization and PP — bug,ready,multi-modality,qwen — by Isotr0py (合并于: 2026-02-05 01:54 (UTC+8))
- #33794 [Bugfix] Fix
normalizestill being passed toPoolerConfig— bug,frontend,ready — by DarkLight1337 (合并于: 2026-02-04 23:56 (UTC+8)) - #33801 [Misc] Delay deprecation of CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (合并于: 2026-02-05 00:41 (UTC+8))
- #33694 [Bugfix] Fix ubatch wrapper num_tokens calculate — bug,ready,v1 — by jiangkuaixue123 (合并于: 2026-02-05 00:41 (UTC+8))
- #32745 [Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism — rocm,ready,ci/build — by mawong-amd (合并于: 2026-02-04 14:51 (UTC+8))
- #33612 [Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-04 22:34 (UTC+8))
- #33713 [Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching — bug,rocm,ready — by micah-wil (合并于: 2026-02-04 21:36 (UTC+8))
- #33758 Apply #33621 to main — rocm,ready,ci/build — by DarkLight1337 (合并于: 2026-02-04 21:35 (UTC+8))
- #33785 [Model] Apply #32631 for recent models — speculative-decoding,ready,qwen — by DarkLight1337 (合并于: 2026-02-04 20:23 (UTC+8))
- #33605 [Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni — bug,ready,qwen — by linyueqian (合并于: 2026-02-04 20:15 (UTC+8))
- #33291 [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] — ready — by vadiklyutiy (合并于: 2026-02-04 19:20 (UTC+8))
- #32255 [BugFix] scheduler: Delay freeing blocks of aborted async loads — bug,ready,v1,kv-connector — by orozery (合并于: 2026-02-04 19:16 (UTC+8))
- #33712 [compile] Remove runner type from ignored caching factor list. — ready — by zhxchen17 (合并于: 2026-02-04 18:56 (UTC+8))
- #33578 [compile] Clean up AOT compile bypass on evaluate_guards. — ready — by zhxchen17 (合并于: 2026-02-04 18:12 (UTC+8))
- #33659 [XPU][2/N] add support unquantized moe support for xpu — ready,ci/build — by jikunshang (合并于: 2026-02-04 18:12 (UTC+8))
- #33548 use ORJSONResponse when available to improve the efficiency of request process — frontend,ready — by staugust (合并于: 2026-02-04 18:04 (UTC+8))
- #33769 [XPU] remove common path warning log — ready — by jikunshang (合并于: 2026-02-04 16:40 (UTC+8))
- #33290 [Metrics] Add labeled prompt token metrics for P/D disaggregation — ready,v1,kv-connector — by ZhanqiuHu (合并于: 2026-02-04 15:46 (UTC+8))
- #33722 [Deprecation] Deprecate profiling envs — documentation,ready — by yewentao256 (合并于: 2026-02-04 13:58 (UTC+8))
- #33757 [Deprecation] Remove
_get_data_parserin MM processor — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-04 13:51 (UTC+8)) - #33688 [Feature] Enable
TRITON_ATTNfor Batch Invariance — documentation,ready,v1 — by frankwang28 (合并于: 2026-02-04 13:27 (UTC+8)) - #33718 [Refactor] Remove unused dead code — ready — by yewentao256 (合并于: 2026-02-04 13:25 (UTC+8))
- #33737 [Bugfix] Define router_logits_dtype for remaining MoE models — bug,ready — by mgoin (合并于: 2026-02-04 13:24 (UTC+8))
- #33629 Save startup benchmark results as a list of values — performance,ready — by huydhn (合并于: 2026-02-04 12:37 (UTC+8))
- #32161 [CPU] Split attention dispatch by head_dim alignment — ready,ci/build,cpu — by R3hankhan123 (合并于: 2026-02-04 11:37 (UTC+8))
- #33750 [MM] Align the prefix of MMEncoderAttention with Attention — ready,llama,qwen — by shen-shanshan (合并于: 2026-02-04 12:07 (UTC+8))
关闭但未合并的 PR
- #28066 [Chore] Separate out attention backend constants from vllm.utils — rocm,ready,stale — by hezyin (关闭于: 2026-02-05 10:41 (UTC+8))
- #25385 [Bugfix] Add Flash Attention guards in MLACommonImpl constructor — stale,v1 — by kzawora-intel (关闭于: 2026-02-05 10:17 (UTC+8))
- #26294 [Bugfix] Fix ovis2.5 pre-quant fp8 checkpoint loading — stale — by Isotr0py (关闭于: 2026-02-05 10:16 (UTC+8))
- #33563 generate invoking doesn’t require detokenization for beam search — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by gameofdimension (关闭于: 2026-02-05 09:59 (UTC+8))
- #28631 [Frontend][Renderer] Refactor score API — documentation,frontend,needs-rebase,multi-modality — by noooop (关闭于: 2026-02-05 10:01 (UTC+8))
- #33851 models: lazy-load OpenCV in Nemotron Parse to avoid import-time dependency on X11 libs — 无标签 — by Gregory-Pereira (关闭于: 2026-02-05 09:49 (UTC+8))
- #33850 [Misc] Fix typo in error message: suport -> support — 无标签 — by schnell3526 (关闭于: 2026-02-05 09:03 (UTC+8))
- #33852 fix: correct typo in error message for Swiglu limit assertion — 无标签 — by schnell3526 (关闭于: 2026-02-05 09:14 (UTC+8))
- #30409 [BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak — bug,structured-output,needs-rebase,v1 — by kitaekatt (关闭于: 2026-02-05 08:21 (UTC+8))
- #33829 Add shutdown timeout CLI and changed shutdown function to send SIGTEM — frontend,v1 — by itrowbri (关闭于: 2026-02-05 05:23 (UTC+8))
- #32381 [responsesAPI] allow reasoning parser to output multiple reasoning items — frontend,qwen,deepseek — by qandrew (关闭于: 2026-02-05 02:56 (UTC+8))
- #32729 [responsesAPI] support interleaved reasoning via arbitrary type of message outputs — documentation,frontend,meta-exported,fb-exported — by qandrew (关闭于: 2026-02-05 02:56 (UTC+8))
- #32304 [Frontend][Feature][Draft]integrate openai realtime api — documentation,frontend,needs-rebase — by unlikezy (关闭于: 2026-02-05 02:17 (UTC+8))
- #33808 [VoxtralRealtime] Force eager execution by default — 无标签 — by NickLucche (关闭于: 2026-02-05 01:39 (UTC+8))
- #27569 [Bugfix][P/D] Fix throughput stats in disaggregated setup — bug,ready,needs-rebase,v1,kv-connector — by NickLucche (关闭于: 2026-02-04 19:12 (UTC+8))
- #33781 feat(cpu): add CPU support for draft model speculative decoding with parallel drafting — documentation,speculative-decoding,needs-rebase,v1,llama,cpu,nvidia — by ganeshr10 (关闭于: 2026-02-04 17:26 (UTC+8))
- #32845 [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vl — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by rayleeku (关闭于: 2026-02-04 16:48 (UTC+8))
- #33248 Kimi K2.5 model generates “(no content)” placeholder in tool call responses — needs-rebase — by shivamashtikar (关闭于: 2026-02-04 14:53 (UTC+8))
- #33759 fix(rocm): Use correct kv_cache_layout for sliding window with shuffle KV cache — rocm,v1 — by gigamonkeyx (关闭于: 2026-02-04 12:32 (UTC+8))