vLLM 开发动态报告 - 2026-01-30

时间窗口: 2026-01-30 11:14 (UTC+8) ~ 2026-01-31 11:14 (UTC+8) 数据统计: 新 Issue 27 | 关闭 Issue 21 | 新 PR 50 | 合并 PR 35 | 关闭未合并 PR 15

📊 每日开发状态摘要

vLLM 项目在 2026年1月30日至31日期间保持高强度开发，新增 27 个 Issue 和 50 个 PR，合并 35 个 PR。开发焦点集中在性能优化（如冷启动编译、MoE 内核）、模型支持扩展（尤其是 Qwen、Mistral 等系列的多模态/语音模型）以及缺陷修复。异步调度与批处理不变性（Batch Invariance）的兼容性问题、Mamba 模型前缀缓存支持等技术细节受到持续关注。

🎯 AMD/ROCm 生态相关动态

本周期内，核心 PR 和 Issue 中没有发现直接与 AMD ROCm、Quark 量化工具或 MI300 相关的内容。用户名包含 -amd 后缀的贡献者在此次数据中未出现。在已合并的 PR 中，PR #33173 和 #33366 是针对 ROCm 平台的 BUG 修复和性能优化，但其主旨在于修复通用性问题，而非引入新的 AMD 特定功能。总体而言，本周期 AMD 生态并非社区的主要活跃讨论区。

💬 高热度讨论分析

Issue #33397: [Usage]: How to set structured_output using grammar
- 核心议题: 用户在使用 v0.14.1 时，按照文档通过 extra_body 设置 grammar 进行结构化输出失败。
- 观点与结论: 维护者 @chaunceyjiang 迅速指出用户应参考最新的示例代码 (examples/online_serving/structured_outputs/structured_outputs.py)，暗示文档或接口可能已更新。这反映了 vLLM 功能迭代迅速，用户需关注最新示例。
RFC #27755: reasoning_content -> reasoning
- 核心议题: 讨论将推理模型输出的字段名从 DeepSeek 惯用的 reasoning_content 改为与 OpenAI 建议及 OpenRouter 对齐的 reasoning。
- 观点整理:
  - 支持方: 认为与 OpenAI 生态对齐有利于 vLLM 作为 API 服务器的兼容性和用户体验。讨论了与 reasoning_details 等字段共存的可能性。
  - 讨论焦点: 如何平衡向后兼容性与标准统一。最终决定在经历一段时间的弃用期后，在 PR #33402 中完全移除 reasoning_content。
- 结论: 社区达成共识，遵循主流 API 标准更有利于项目发展，相关 PR 已合并。
Issue #33439: [Bug]: Race condition in V1 engine serial_utils.py
- 核心议题: 用户在使用离线 LLM.score() API 并发处理多模态输入时遇到竞态条件，导致崩溃。
- 观点与结论: 维护者 @robertgshaw2-redhat 明确指出，LLM API 本身并非线程安全，并发请求应使用 AsyncLLM API。该问题迅速被关闭。这凸显了用户需要明确区分 vLLM 不同 API 的适用场景。

🔥 热门话题与趋势分析

模型支持持续扩展：
- 多模态与语音模型：PR #33431 支持 MiniCPM-o 4.5，PR #33389 修复 DeepSeek-OCR-2 精度，PR #33410 和 #33414 完善 Qwen3-ASR 的后处理与 CI 测试，PR #33187 新增基于 WebSocket 的实时语音 API。表明 vLLM 对“多模态+语音”模型的支持正快速成熟。
- 大语言模型迭代：对 GLM-4.7 (Issue #33348)、Kimi-K2.5 (PR #33346) 等最新模型版本的适配工作持续进行。
性能与内核优化深化：
- 编译与缓存：PR #33441 重点修复 torch.compile 导致的冷启动时间回归，并通过融合 KV cache 更新与注意力操作以优化性能。
- MoE 内核：围绕 FP8、NVFP4、INT4 等量化格式的 MoE 内核支持与修复是多个 PR (如 #32437, #33407, #33449) 的共同主题，旨在提升推理效率。
- 批处理不变性（Batch Invariance）：Issue #32481 及相关 PR #32561 深入探讨了在复杂负载下确保批处理结果确定性的挑战，最终通过禁用不兼容的注意力机制（如 Cascade Attention, FlashInfer Chunked Prefill）来解决问题。
基础设施与部署优化：
- 文档与部署：多个 PR (#32286, #33161) 致力于完善 CPU 和 K8s 部署文档，降低用户使用门槛。
- 配置与兼容性：针对 Transformers v5 的兼容性调整（PR #33372, #33359）和模型权重重新加载（Layerwise Reloading, PR #32133）显示了项目对上游依赖升级的前瞻性应对。

🛠️ 重点技术变更

PR #33441: [fix][torch.compile] Fix cold-start compilation time：通过将 unified_kv_cache_update 操作整合到分裂图中，显著修复了因 #25954 引入的冷启动编译时间延长问题，并确保了该操作与注意力计算处于同一子图以最小化运行时开销。
PR #33426 & #33444: Fix typo in read_offset variable name：修复了 detokenizer.py 中 read_offest 的拼写错误，该错误会影响 prompt_embeds 等功能，属于关键的低级缺陷修复。
PR #33414: [CI] Qwen3-ASR transcriptios tests：为 Qwen3-ASR 模型添加 CI 测试，标志着对语音转录模型的支持进入更稳定和自动化测试的阶段。
PR #33376: fix: allow LFM2 MoE prefix caching (align)：扩展了 Mamba 架构模型的前缀缓存支持至 “align” 模式，是提升状态空间模型推理效率的重要一步。
PR #32561: Disable Cascade Attention for Batch Invariance：为了确保批处理不变性，在启用该功能时强制禁用 Cascade Attention，这是解决复杂场景下确定性输出问题的关键决策。

📈 开发活跃度观察

贡献活跃：在 24 小时内处理了超过 80 个 Issue/PR，合并率（35/50=70%）较高，表明核心团队评审和合并流程高效。
多元化贡献：贡献者来自 Red Hat、NVIDIA、AMD、Cohere、Meta 等多个组织和独立开发者，涉及内核、前端API、模型适配、量化、文档等多个领域。
问题闭环迅速：多个 Bug 类 Issue 在提出当天即通过 PR 修复并合并（如 #33426, #33406），显示了社区响应和解决问题的速度。

💡 值得关注的问题

API 线程安全边界：Issue #33439 再次提醒用户，离线 LLM API 与在线 AsyncLLM API 有不同的并发设计。在构建上层应用时需要明确区分。
SM120 (RTX Blackwell) 兼容性：Issue #33416 和 PR #33417 表明，随着新一代消费级 Blackwell GPU (SM120) 上市，vLLM 的各类量化内核（尤其是 NVFP4 MoE）需要及时适配新的计算能力标识，这可能是一个持续性的工作。
批处理不变性的代价：PR #32561 显示，为了确保绝对的确定性，有时需要牺牲某些优化路径（如 Cascade Attention）。在追求极致性能与保证可复现性之间，需要根据应用场景做出权衡。

📋 附录：详细数据列表

新增 Issue

#33424 [Bug]: LoRA MoE Not Matching HF Output — bug — by Jonahcb (创建于: 2026-01-30 22:58 (UTC+8))
#33461 [Bug]: Marlin NVFP4 GEMM kernel on Turing produces meaningless outputs — bug — by iori2333 (创建于: 2026-01-31 10:55 (UTC+8))
#33460 [v0.13.0] Required Transformers version mismatch — 无标签 — by gagank1 (创建于: 2026-01-31 10:30 (UTC+8))
#33459 [Performance]: Torch symm AllReduce seems suboptimal on 8×B200 (fixed CTA count?) — performance — by rajagond (创建于: 2026-01-31 10:25 (UTC+8))
#33447 [Bug]: Issue with vllm 0.15.0 image - running via docker — bug — by SKPsanjeevi (创建于: 2026-01-31 05:36 (UTC+8))
#33428 [Bug]: GLM 4.6v startup failed — bug — by iori2333 (创建于: 2026-01-30 23:57 (UTC+8))
#33457 [Bug]: Unexpected spaces inserted around <think>/</think> tokens with Kimi K2-Thinking / K2.5 — bug — by 0xbe7a (创建于: 2026-01-31 09:03 (UTC+8))
#33458 [Feature][Speculative Decoding]: Multi Modal Draft Model Support — feature request — by benchislett (创建于: 2026-01-31 09:04 (UTC+8))
#33456 [Feature][Speculative Decoding]: Consolidate EAGLE Input Preparation — feature request — by benchislett (创建于: 2026-01-31 08:49 (UTC+8))
#33455 [Feature][Performance][Speculative Decoding]: Kernel Fusion for EAGLE Steps — feature request — by benchislett (创建于: 2026-01-31 08:44 (UTC+8))
#33453 [Usage]: — usage — by MachineLearningHan (创建于: 2026-01-31 08:26 (UTC+8))
#33450 [Bug]: Attention Assertion — bug — by robertgshaw2-redhat (创建于: 2026-01-31 06:42 (UTC+8))
#33448 [Bug]: LoRA MoE Eager Mode Changes Outputs — bug — by Jonahcb (创建于: 2026-01-31 05:47 (UTC+8))
#33445 [RFC] Remove mandatory ray installation — 无标签 — by yewentao256 (创建于: 2026-01-31 04:37 (UTC+8))
#33439 [Bug]: Race condition in V1 engine serial_utils.py - aux_buffers is None during concurrent llm.score() calls with multimodal inputs — bug — by KoderFPV (创建于: 2026-01-31 01:49 (UTC+8))
#33434 [Feature]: Add prefix cache state metrics for KV cache monitoring — 无标签 — by kevkle (创建于: 2026-01-31 01:17 (UTC+8))
#33429 [Feature]: Investigate if higher TORCHINDUCTOR_COMPILE_THREADS leads to faster compile times — feature request,torch.compile — by zou3519 (创建于: 2026-01-31 00:07 (UTC+8))
#33421 [Bug]: FlashInfer backend stopped working with Batch Invariant after moving from 0.14.1 to 0.15.0, Qwen3-4B — bug — by wektorz (创建于: 2026-01-30 22:17 (UTC+8))
#33418 [Bug]: wrong error reported when len(prompt) + requested tokens > max_context_len — bug — by sducouedic (创建于: 2026-01-30 21:32 (UTC+8))
#33416 [Bug] NVFP4 MoE kernels fail on RTX Blackwell (SM12.0) - device capability family check missing SM120 — 无标签 — by renehonig (创建于: 2026-01-30 20:54 (UTC+8))
#33412 [Bug]: vllm bench always injects Authorization header even when OPENAI_API_KEY is unset — bug — by daily-kim (创建于: 2026-01-30 19:57 (UTC+8))
#33411 [Bug]: parameter VLLM_USE_V1=0 is not effective in vllm v0.15.0. — bug — by soooocold (创建于: 2026-01-30 19:44 (UTC+8))
#33398 [RFC]: Layerwise KV cache offloading to support longer sequence length — RFC — by zhangsicheng5 (创建于: 2026-01-30 16:18 (UTC+8))
#33401 [Bug]: [GML-4.5-Air-Fp8] RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug — by twilighgt (创建于: 2026-01-30 17:14 (UTC+8))
#33400 MultiConnector: Clarify get_finished_count aggregation semantics — 无标签 — by eicherseiji (创建于: 2026-01-30 16:59 (UTC+8))
#33397 [Usage]: How to set structured_output using grammar — usage — by alanayu (创建于: 2026-01-30 15:19 (UTC+8))
#33386 [Feature]: Add support for local image file paths in vision reranker API — feature request — by RC-Qiao (创建于: 2026-01-30 11:18 (UTC+8))

已关闭 Issue

#33294 [Bug]: Typo in detokenizer.py: read_offest should be read_offset breaks prompt_embeds — bug — by bet0x (关闭于: 2026-01-31 10:27 (UTC+8))
#21180 [Bug]: Dockerfile consolidation — bug,stale — by ericcurtin (关闭于: 2026-01-31 10:17 (UTC+8))
#24944 [Bug]: Failed to run Qwen3-Next on vLLM v0.10.2rc3.dev103+g238c4c170.cpu — bug,stale — by MarchBeta2087 (关闭于: 2026-01-31 10:16 (UTC+8))
#25209 [Bug]: Low Accuracy for MMLU Pro with DeepSeekR1-FP4 — bug,stale — by pavanimajety (关闭于: 2026-01-31 10:16 (UTC+8))
#25389 [RFC]: Revise Logits Processor Programming Model — RFC,stale — by afeldman-nm (关闭于: 2026-01-31 10:16 (UTC+8))
#25821 [Bug]: TypeError: argument ‘id’: StreamInput must be either an integer or a list of integers — bug,stale — by zhongzq123 (关闭于: 2026-01-31 10:16 (UTC+8))
#26006 [Bug]: Guided JSON generation not working in Qwen/Qwen2.5-14B-Instruct-1M — bug,stale — by subhalingamd (关闭于: 2026-01-31 10:16 (UTC+8))
#26008 [Bug]: [Bug] [gpt-oss-120b] [Responses API]: Could not parse header: too many tokens remaining after extracting content-type and recipient — bug,stale — by 02deno (关闭于: 2026-01-31 10:16 (UTC+8))
#26055 [Bug]: CUDA error “invalid resource handle” when serving dsr1 with deepep on GB200 — bug,stale — by krishung5 (关闭于: 2026-01-31 10:16 (UTC+8))
#26067 [Bug]: DSR1 + DEP8 on B200 fails with TensorRT-LLM latency and throughput kernels. — bug,stale — by pavanimajety (关闭于: 2026-01-31 10:16 (UTC+8))
#26069 [Bug]: DSR1 FP8 + TEP8 on B200 fails with TensorRT-LLM throughput kernels. — bug,stale — by shyeh25 (关闭于: 2026-01-31 10:16 (UTC+8))
#26079 [Bug]: vLLM Qwen3 Reranker intermittent 500 error - NVMLError_Unknown on score endpoint — bug,stale — by Cyp9715 (关闭于: 2026-01-31 10:15 (UTC+8))
#26108 [RFC]: MoE shared_experts fusion with the selected experts inside FusedMoE — RFC,stale — by alexm-redhat (关闭于: 2026-01-31 10:15 (UTC+8))
#26122 [Performance]: disagg_proxy_demo server request throughput is low — bug,stale — by dkolawole (关闭于: 2026-01-31 10:15 (UTC+8))
#26127 [Installation]: Additional OS level required packages — installation,stale — by bibhas2 (关闭于: 2026-01-31 10:15 (UTC+8))
#26141 vLLM silently hangs on LLaMA Scout 4 with >3M tokens despite sufficient GPU memory — stale,ci-failure — by rajulshakya777 (关闭于: 2026-01-31 10:15 (UTC+8))
#33439 [Bug]: Race condition in V1 engine serial_utils.py - aux_buffers is None during concurrent llm.score() calls with multimodal inputs — bug — by KoderFPV (关闭于: 2026-01-31 02:29 (UTC+8))
#33161 [Doc]: Kubernetes deployment in CPU mode fails (No CUDA..) — documentation — by Josca (关闭于: 2026-01-31 00:05 (UTC+8))
#32481 [Bug]: Batch Invariance fails under more diverse workloads — bug — by frankwang28 (关闭于: 2026-01-30 23:00 (UTC+8))
#27755 [RFC]: reasoning_content -> reasoning — RFC — by hmellor (关闭于: 2026-01-30 19:48 (UTC+8))
#33348 [Bug] GLM-4.7 uses wrong reasoning parser (should use deepseek_r1 instead of glm45) — 无标签 — by QwertyJack (关闭于: 2026-01-30 16:41 (UTC+8))

新增 PR

#33462 Bump flashinfer-python to 0.6.2 — ci/build,nvidia — by esmeetu (创建于: 2026-01-31 11:12 (UTC+8))
#33441 [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops — ready,ready-run-all-tests — by ProExpertProg (创建于: 2026-01-31 02:28 (UTC+8))
#33443 [ROCm] AITER fused RoPE+KVCache — rocm — by Rohan138 (创建于: 2026-01-31 03:27 (UTC+8))
#33417 fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels — ready,needs-rebase,ci/build,nvidia,quantization — by renehonig (创建于: 2026-01-30 20:55 (UTC+8))
#33426 [Bugfix] Fix typo in read_offset variable name — bug,ready,v1 — by bet0x (创建于: 2026-01-30 23:15 (UTC+8))
#33407 [W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. — performance,needs-rebase,nvidia — by maralbahari (创建于: 2026-01-30 18:32 (UTC+8))
#33454 Rust api server — documentation — by jdw2159 (创建于: 2026-01-31 08:39 (UTC+8))
#33452 Support clear mm and encoder cache — v1,meta-exported,fb-exported — by jma99fb (创建于: 2026-01-31 07:57 (UTC+8))
#33451 [WIP][Attention] Add FlashInfer Sparse MLA backend — documentation,v1,nvidia — by MatthewBonanni (创建于: 2026-01-31 07:44 (UTC+8))
#33446 [Bugfix] Fix Sparse24 Compressed Tensors models — bug,nvidia — by kylesayrs (创建于: 2026-01-31 05:10 (UTC+8))
#33449 [Refactor] Remove align block size logic in moe_permute — performance,ready — by yewentao256 (创建于: 2026-01-31 06:16 (UTC+8))
#33427 [Attention] Clarify comment explaining attn_logits +1 dimension — ready,v1 — by fuscof-ibm (创建于: 2026-01-30 23:47 (UTC+8))
#33391 [ModelRunner V2] Fix spec decoding + logprobs — ready,v1 — by njhill (创建于: 2026-01-30 13:32 (UTC+8))
#33437 [Misc] skip target model mm emb in draft proposal step when draft is text-only — v1 — by kkt-cohere (创建于: 2026-01-31 01:28 (UTC+8))
#33444 [Misc] offest -> offset in comments and variable names — speculative-decoding,v1 — by russellb (创建于: 2026-01-31 03:45 (UTC+8))
#33440 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 — ci/build,kv-connector — by Gregory-Pereira (创建于: 2026-01-31 02:07 (UTC+8))
#33442 Fix/sm120 rtx blackwell support nvfp4/fp8 moe — nvidia — by mgoin (创建于: 2026-01-31 02:57 (UTC+8))
#33438 [Benchmark] Add vllm bench iterations for prefill/decode measurement — performance,frontend,needs-rebase — by jaewonlee-fb (创建于: 2026-01-31 01:45 (UTC+8))
#33415 [Voxtral Streaming -> Voxtral Realtime] Rename all voxtral related classes, fn, files — documentation,new-model,ready,multi-modality — by patrickvonplaten (创建于: 2026-01-30 20:46 (UTC+8))
#33435 [Metrics] Add prefix cache state metrics for KV cache monitoring — documentation,v1 — by kevkle (创建于: 2026-01-31 01:18 (UTC+8))
#33432 fix QERL attention import path — ready,ci-failure — by vkuzo (创建于: 2026-01-31 00:23 (UTC+8))
#33436 [BUGFIX] fix Attention import path — bug — by xuechendi (创建于: 2026-01-31 01:21 (UTC+8))
#33413 Fix test_moe.py for Transformers v5 — ready — by hmellor (创建于: 2026-01-30 20:11 (UTC+8))
#33433 [Model Runner V2] support bad_words sampling param — v1 — by izhuhaoran (创建于: 2026-01-31 00:31 (UTC+8))
#33431 [Model] Support MiniCPM-o 4.5 — needs-rebase — by tc-mb (创建于: 2026-01-31 00:14 (UTC+8))
#33414 [CI] Qwen3-ASR transcriptios tests — ready,qwen — by NickLucche (创建于: 2026-01-30 20:23 (UTC+8))
#33395 [KVConnector][LMCache] Enable Support for cross-layer Layout — kv-connector — by Shaoting-Feng (创建于: 2026-01-30 14:39 (UTC+8))
#33430 renderer: Fix miselading error message — frontend — by RishabhSaini (创建于: 2026-01-31 00:14 (UTC+8))
#33425 [Bugfix] Fix correct error message when len(prompt) + max_tokens > max_model_len — bug,frontend — by sducouedic (创建于: 2026-01-30 22:59 (UTC+8))
#33393 [BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models — bug,ready — by dcmaddix (创建于: 2026-01-30 14:21 (UTC+8))
#33422 Update context length in error message when prompt + max_tokens exceeds limit — frontend — by kamalesh0406 (创建于: 2026-01-30 22:47 (UTC+8))
#33420 [EPLB][BUGFIX] Add max_num_transfers config option to EPLB — bug — by SageMoore (创建于: 2026-01-30 22:03 (UTC+8))
#33419 [Misc] Algin Qwen3-VL-embedding image example outputs with HF repo example — documentation,qwen — by Isotr0py (创建于: 2026-01-30 21:49 (UTC+8))
#33423 [Test] GLM4.7 performance acceleration. — 无标签 — by liuchen2026fly (创建于: 2026-01-30 22:51 (UTC+8))
#33406 [BUGFIX] Pixtral cannot be loaded with –limit-mm-per-prompt 0 — bug — by juliendenize (创建于: 2026-01-30 18:29 (UTC+8))
#33408 [Refactor] Move MM data parsing outside processor — ready,v1,multi-modality,llama,qwen — by DarkLight1337 (创建于: 2026-01-30 18:46 (UTC+8))
#33410 [Bugfix] Fix Qwen3ASR language asr tag in output — bug,frontend,ready,qwen — by NickLucche (创建于: 2026-01-30 19:36 (UTC+8))
#33402 Remove deprecated reasoning_content message field — frontend,ready,tool-calling,gpt-oss — by hmellor (创建于: 2026-01-30 17:31 (UTC+8))
#33409 [Misc] Apply PEP 563 to replace quoted union types with unquoted ones — rocm,structured-output,frontend,ready,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (创建于: 2026-01-30 18:53 (UTC+8))
#33388 [Doc] [ROCm] Update Documentation to reflect v0.15.0 release — documentation,rocm,ready — by vllmellm (创建于: 2026-01-30 11:47 (UTC+8))
#33392 PR Title: [Feature] EWSJF: Adaptive Scheduler for Mixed-Workloads (30-50% Throughput Gain) — performance,v1 — by sidikbro (创建于: 2026-01-30 14:10 (UTC+8))
#33404 [Core][Bugfix] Add logprobs_mode_override capability for request-level fine-grained control over logprob computation — bug,frontend,v1 — by tangcy98 (创建于: 2026-01-30 17:34 (UTC+8))
#33405 Extract kv cache update from flashmla sparse — v1 — by Kare0638 (创建于: 2026-01-30 17:51 (UTC+8))
#33403 [WIP] pcp alternative impl — rocm,speculative-decoding,v1,nvidia — by LucasWilkinson (创建于: 2026-01-30 17:32 (UTC+8))
#33396 [Refactor] Move MM item count validation outside of processor — frontend,ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-30 14:54 (UTC+8))
#33399 Isolate VLLM_CACHE_ROOT per test to fix torch.compile cache reuse — 无标签 — by sladyn98 (创建于: 2026-01-30 16:48 (UTC+8))
#33394 Support vibevoice asr — documentation,new-model,ci/build,multi-modality — by nemoramo (创建于: 2026-01-30 14:39 (UTC+8))
#33389 [bugfix] Solve the accuracy issue of deepseek ocr2 — bug,documentation,new-model,deepseek — by leihuang-sketch (创建于: 2026-01-30 11:48 (UTC+8))
#33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (创建于: 2026-01-30 11:18 (UTC+8))
#33390 [BUG] relax pirate persona assertion in test_system_prompt_override — bug — by pacoxu (创建于: 2026-01-30 12:24 (UTC+8))

已合并 PR

#33362 [Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 — ready,v1 — by yewentao256 (合并于: 2026-01-31 10:54 (UTC+8))
#33173 [Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) — documentation,performance,rocm,ready,v1,qwen,kv-connector — by xuebwang-amd (合并于: 2026-01-31 01:50 (UTC+8))
#33426 [Bugfix] Fix typo in read_offset variable name — bug,ready,v1 — by bet0x (合并于: 2026-01-31 09:26 (UTC+8))
#33366 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 — bug,rocm,ready — by gshtras (合并于: 2026-01-31 09:05 (UTC+8))
#33201 Refactor NVFP4 Linear utils for ModelOpt and CT — ready,nvidia,quantization — by mgoin (合并于: 2026-01-31 08:37 (UTC+8))
#33286 [CI][HPU]accelerate hpu test by skip python re-install and clean container name — ready,ci/build — by xuechendi (合并于: 2026-01-31 05:36 (UTC+8))
#32990 Indicate compile mode in the benchmark results — performance,ready,ci/build — by huydhn (合并于: 2026-01-31 04:34 (UTC+8))
#33187 [Realtime API] Adds minimal realtime API based on websockets — documentation,frontend,ready,v1 — by patrickvonplaten (合并于: 2026-01-30 18:41 (UTC+8))
#32437 [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. — ready,nvidia — by pavanimajety (合并于: 2026-01-31 02:30 (UTC+8))
#33432 fix QERL attention import path — ready,ci-failure — by vkuzo (合并于: 2026-01-31 01:29 (UTC+8))
#32740 [Kernel] [Helion] [1/N] Add Helion ConfigManager — ready — by gmagogsfm (合并于: 2026-01-31 01:19 (UTC+8))
#33236 Fix encoder-decoder model disabling mm processor cache — ready,multi-modality — by hmellor (合并于: 2026-01-31 00:30 (UTC+8))
#33413 Fix test_moe.py for Transformers v5 — ready — by hmellor (合并于: 2026-01-30 22:03 (UTC+8))
#33376 fix: allow LFM2 MoE prefix caching (align) — ready — by tianshu-Michael-yu (合并于: 2026-01-30 16:23 (UTC+8))
#33414 [CI] Qwen3-ASR transcriptios tests — ready,qwen — by NickLucche (合并于: 2026-01-31 00:17 (UTC+8))
#33280 Support FP8 block quant for CompressedTensorsW8A16Fp8 — ready,quantization — by mgoin (合并于: 2026-01-31 00:15 (UTC+8))
#32133 [QeRL] Layerwise Reloading — ready,v1 — by kylesayrs (合并于: 2026-01-30 23:50 (UTC+8))
#33393 [BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models — bug,ready — by dcmaddix (合并于: 2026-01-30 23:27 (UTC+8))
#32561 Disable Cascade Attention for Batch Invariance — ready,v1 — by frankwang28 (合并于: 2026-01-30 23:00 (UTC+8))
#33406 [BUGFIX] Pixtral cannot be loaded with –limit-mm-per-prompt 0 — bug — by juliendenize (合并于: 2026-01-30 18:52 (UTC+8))
#33253 Improve Mistral format checks. — ready — by juliendenize (合并于: 2026-01-30 22:23 (UTC+8))
#32286 [Doc] Enhance documentation around CPU container images — documentation,ready,cpu — by nathan-weinberg (合并于: 2026-01-30 21:36 (UTC+8))
#33323 [Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal — ready — by carlory (合并于: 2026-01-30 21:31 (UTC+8))
#33402 Remove deprecated reasoning_content message field — frontend,ready,tool-calling,gpt-oss — by hmellor (合并于: 2026-01-30 19:48 (UTC+8))
#33388 [Doc] [ROCm] Update Documentation to reflect v0.15.0 release — documentation,rocm,ready — by vllmellm (合并于: 2026-01-30 19:06 (UTC+8))

#33332 [Misc] Replace Optional[X] with X

None syntax — rocm,frontend,ready,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (合并于: 2026-01-30 17:56 (UTC+8))

#33396 [Refactor] Move MM item count validation outside of processor — frontend,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-30 17:27 (UTC+8))
#33346 [Models] Refactor Kimi-K2.5 weight loading — ready — by Isotr0py (合并于: 2026-01-30 13:31 (UTC+8))
#33372 Explicitly set return_dict for apply_chat_template — documentation,ready — by hmellor (合并于: 2026-01-30 15:27 (UTC+8))
#33359 Fix tie_word_embeddings for multimodal models in Transformers v5 — ready — by hmellor (合并于: 2026-01-30 11:37 (UTC+8))
#32449 [model] Add support for openPangu7B-VL — documentation,new-model,ready — by hujiaxin0 (合并于: 2026-01-30 15:54 (UTC+8))
#33282 [CI] Enable mypy import following for vllm/spec_decode — speculative-decoding,ready,v1 — by Lucaskabela (合并于: 2026-01-30 14:43 (UTC+8))
#33352 [BugFix] Disable async scheduling for Mamba prefix caching — bug,ready — by peakcrosser7 (合并于: 2026-01-30 12:40 (UTC+8))
#33239 Move decode context parallel validationn to ParallelConfig — ready — by hmellor (合并于: 2026-01-30 14:18 (UTC+8))
#33305 [CI][AMD] Skip 4 GPUs testgroup ray tests — documentation,rocm,ready,ci/build,v1 — by rjrock (合并于: 2026-01-30 13:39 (UTC+8))

关闭但未合并的 PR

#17083 Fix numel() downcast in vllm/csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu +2 — ready,needs-rebase,stale — by r-barnes (关闭于: 2026-01-31 10:17 (UTC+8))
#25412 [AMD/CI] Amd blocking tests v2 — rocm,ci/build,stale — by Alexei-V-Ivanov-AMD (关闭于: 2026-01-31 10:16 (UTC+8))
#32303 [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling — bug,rocm,structured-output,v1 — by micah-wil (关闭于: 2026-01-31 05:40 (UTC+8))
#33442 Fix/sm120 rtx blackwell support nvfp4/fp8 moe — nvidia — by mgoin (关闭于: 2026-01-31 03:40 (UTC+8))
#33436 [BUGFIX] fix Attention import path — bug — by xuechendi (关闭于: 2026-01-31 01:24 (UTC+8))
#31849 [CI][Transformers] Fix transformers v5 support: tied weights and packed 3D MoE experts — needs-rebase,v1,multi-modality — by AndreasKaratzas (关闭于: 2026-01-31 01:09 (UTC+8))
#33430 renderer: Fix miselading error message — frontend — by RishabhSaini (关闭于: 2026-01-31 00:16 (UTC+8))
#33422 Update context length in error message when prompt + max_tokens exceeds limit — frontend — by kamalesh0406 (关闭于: 2026-01-30 23:22 (UTC+8))
#33008 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints — bug — by ms1design (关闭于: 2026-01-30 22:17 (UTC+8))
#32171 Optimize FlashInfer TRTLLM FP4 MoE quantization #32057 — performance,needs-rebase,nvidia — by baonudesifeizhai (关闭于: 2026-01-30 20:56 (UTC+8))
#26504 [Spec-Decode] Add DynamicProposer for per-sequence dynamic speculative decoding — documentation,performance,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling,qwen — by yang926 (关闭于: 2026-01-30 17:27 (UTC+8))
#33349 [Reasoning] Add glm47 reasoning parser for GLM-4.7 models — documentation — by QwertyJack (关闭于: 2026-01-30 16:40 (UTC+8))
#33394 Support vibevoice asr — documentation,new-model,ci/build,multi-modality — by nemoramo (关闭于: 2026-01-30 14:39 (UTC+8))
#33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (关闭于: 2026-01-30 13:35 (UTC+8))
#33321 [ROCm] make rocm_aiter_fa support qwen3-next, remove multiple 16 block size support — documentation,rocm,v1,qwen — by ganyi1996ppo (关闭于: 2026-01-30 11:28 (UTC+8))