vLLM 开发动态报告 - 2026-01-30
时间窗口: 2026-01-30 11:14 (UTC+8) ~ 2026-01-31 11:14 (UTC+8) 数据统计: 新 Issue 27 | 关闭 Issue 21 | 新 PR 50 | 合并 PR 35 | 关闭未合并 PR 15
📊 每日开发状态摘要
vLLM 项目在 2026年1月30日至31日 期间保持高强度开发,新增 27 个 Issue 和 50 个 PR,合并 35 个 PR。开发焦点集中在性能优化(如冷启动编译、MoE 内核)、模型支持扩展(尤其是 Qwen、Mistral 等系列的多模态/语音模型)以及缺陷修复。异步调度与批处理不变性(Batch Invariance)的兼容性问题、Mamba 模型前缀缓存支持等技术细节受到持续关注。
🎯 AMD/ROCm 生态相关动态
本周期内,核心 PR 和 Issue 中没有发现直接与 AMD ROCm、Quark 量化工具或 MI300 相关的内容。用户名包含 -amd 后缀的贡献者在此次数据中未出现。在已合并的 PR 中,PR #33173 和 #33366 是针对 ROCm 平台的 BUG 修复和性能优化,但其主旨在于修复通用性问题,而非引入新的 AMD 特定功能。总体而言,本周期 AMD 生态并非社区的主要活跃讨论区。
💬 高热度讨论分析
- Issue #33397: [Usage]: How to set structured_output using grammar
- 核心议题: 用户在使用 v0.14.1 时,按照文档通过
extra_body设置 grammar 进行结构化输出失败。 - 观点与结论: 维护者 @chaunceyjiang 迅速指出用户应参考最新的示例代码 (
examples/online_serving/structured_outputs/structured_outputs.py),暗示文档或接口可能已更新。这反映了 vLLM 功能迭代迅速,用户需关注最新示例。
- 核心议题: 用户在使用 v0.14.1 时,按照文档通过
- RFC #27755:
reasoning_content->reasoning- 核心议题: 讨论将推理模型输出的字段名从 DeepSeek 惯用的
reasoning_content改为与 OpenAI 建议及 OpenRouter 对齐的reasoning。 - 观点整理:
- 支持方: 认为与 OpenAI 生态对齐有利于 vLLM 作为 API 服务器的兼容性和用户体验。讨论了与
reasoning_details等字段共存的可能性。 - 讨论焦点: 如何平衡向后兼容性与标准统一。最终决定在经历一段时间的弃用期后,在 PR #33402 中完全移除
reasoning_content。
- 支持方: 认为与 OpenAI 生态对齐有利于 vLLM 作为 API 服务器的兼容性和用户体验。讨论了与
- 结论: 社区达成共识,遵循主流 API 标准更有利于项目发展,相关 PR 已合并。
- 核心议题: 讨论将推理模型输出的字段名从 DeepSeek 惯用的
- Issue #33439: [Bug]: Race condition in V1 engine serial_utils.py
- 核心议题: 用户在使用离线
LLM.score()API 并发处理多模态输入时遇到竞态条件,导致崩溃。 - 观点与结论: 维护者 @robertgshaw2-redhat 明确指出,
LLMAPI 本身并非线程安全,并发请求应使用AsyncLLMAPI。该问题迅速被关闭。这凸显了用户需要明确区分 vLLM 不同 API 的适用场景。
- 核心议题: 用户在使用离线
🔥 热门话题与趋势分析
- 模型支持持续扩展:
- 多模态与语音模型:PR #33431 支持 MiniCPM-o 4.5,PR #33389 修复 DeepSeek-OCR-2 精度,PR #33410 和 #33414 完善 Qwen3-ASR 的后处理与 CI 测试,PR #33187 新增基于 WebSocket 的实时语音 API。表明 vLLM 对“多模态+语音”模型的支持正快速成熟。
- 大语言模型迭代:对 GLM-4.7 (Issue #33348)、Kimi-K2.5 (PR #33346) 等最新模型版本的适配工作持续进行。
- 性能与内核优化深化:
- 编译与缓存:PR #33441 重点修复 torch.compile 导致的冷启动时间回归,并通过融合 KV cache 更新与注意力操作以优化性能。
- MoE 内核:围绕 FP8、NVFP4、INT4 等量化格式的 MoE 内核支持与修复是多个 PR (如 #32437, #33407, #33449) 的共同主题,旨在提升推理效率。
- 批处理不变性(Batch Invariance):Issue #32481 及相关 PR #32561 深入探讨了在复杂负载下确保批处理结果确定性的挑战,最终通过禁用不兼容的注意力机制(如 Cascade Attention, FlashInfer Chunked Prefill)来解决问题。
- 基础设施与部署优化:
- 文档与部署:多个 PR (#32286, #33161) 致力于完善 CPU 和 K8s 部署文档,降低用户使用门槛。
- 配置与兼容性:针对 Transformers v5 的兼容性调整(PR #33372, #33359)和模型权重重新加载(Layerwise Reloading, PR #32133)显示了项目对上游依赖升级的前瞻性应对。
🛠️ 重点技术变更
- PR #33441: [fix][torch.compile] Fix cold-start compilation time:通过将
unified_kv_cache_update操作整合到分裂图中,显著修复了因 #25954 引入的冷启动编译时间延长问题,并确保了该操作与注意力计算处于同一子图以最小化运行时开销。 - PR #33426 & #33444: Fix typo in read_offset variable name:修复了
detokenizer.py中read_offest的拼写错误,该错误会影响prompt_embeds等功能,属于关键的低级缺陷修复。 - PR #33414: [CI] Qwen3-ASR transcriptios tests:为 Qwen3-ASR 模型添加 CI 测试,标志着对语音转录模型的支持进入更稳定和自动化测试的阶段。
- PR #33376: fix: allow LFM2 MoE prefix caching (align):扩展了 Mamba 架构模型的前缀缓存支持至 “align” 模式,是提升状态空间模型推理效率的重要一步。
- PR #32561: Disable Cascade Attention for Batch Invariance:为了确保批处理不变性,在启用该功能时强制禁用 Cascade Attention,这是解决复杂场景下确定性输出问题的关键决策。
📈 开发活跃度观察
- 贡献活跃:在 24 小时内处理了超过 80 个 Issue/PR,合并率(35/50=70%)较高,表明核心团队评审和合并流程高效。
- 多元化贡献:贡献者来自 Red Hat、NVIDIA、AMD、Cohere、Meta 等多个组织和独立开发者,涉及内核、前端API、模型适配、量化、文档等多个领域。
- 问题闭环迅速:多个 Bug 类 Issue 在提出当天即通过 PR 修复并合并(如 #33426, #33406),显示了社区响应和解决问题的速度。
💡 值得关注的问题
- API 线程安全边界:Issue #33439 再次提醒用户,离线
LLMAPI 与在线AsyncLLMAPI 有不同的并发设计。在构建上层应用时需要明确区分。 - SM120 (RTX Blackwell) 兼容性:Issue #33416 和 PR #33417 表明,随着新一代消费级 Blackwell GPU (SM120) 上市,vLLM 的各类量化内核(尤其是 NVFP4 MoE)需要及时适配新的计算能力标识,这可能是一个持续性的工作。
- 批处理不变性的代价:PR #32561 显示,为了确保绝对的确定性,有时需要牺牲某些优化路径(如 Cascade Attention)。在追求极致性能与保证可复现性之间,需要根据应用场景做出权衡。
📋 附录:详细数据列表
新增 Issue
- #33424 [Bug]: LoRA MoE Not Matching HF Output — bug — by Jonahcb (创建于: 2026-01-30 22:58 (UTC+8))
- #33461 [Bug]: Marlin NVFP4 GEMM kernel on Turing produces meaningless outputs — bug — by iori2333 (创建于: 2026-01-31 10:55 (UTC+8))
- #33460 [v0.13.0] Required Transformers version mismatch — 无标签 — by gagank1 (创建于: 2026-01-31 10:30 (UTC+8))
- #33459 [Performance]: Torch symm AllReduce seems suboptimal on 8×B200 (fixed CTA count?) — performance — by rajagond (创建于: 2026-01-31 10:25 (UTC+8))
- #33447 [Bug]: Issue with vllm 0.15.0 image - running via docker — bug — by SKPsanjeevi (创建于: 2026-01-31 05:36 (UTC+8))
- #33428 [Bug]: GLM 4.6v startup failed — bug — by iori2333 (创建于: 2026-01-30 23:57 (UTC+8))
- #33457 [Bug]: Unexpected spaces inserted around
<think>/</think>tokens with Kimi K2-Thinking / K2.5 — bug — by 0xbe7a (创建于: 2026-01-31 09:03 (UTC+8)) - #33458 [Feature][Speculative Decoding]: Multi Modal Draft Model Support — feature request — by benchislett (创建于: 2026-01-31 09:04 (UTC+8))
- #33456 [Feature][Speculative Decoding]: Consolidate EAGLE Input Preparation — feature request — by benchislett (创建于: 2026-01-31 08:49 (UTC+8))
- #33455 [Feature][Performance][Speculative Decoding]: Kernel Fusion for EAGLE Steps — feature request — by benchislett (创建于: 2026-01-31 08:44 (UTC+8))
- #33453 [Usage]: — usage — by MachineLearningHan (创建于: 2026-01-31 08:26 (UTC+8))
- #33450 [Bug]: Attention Assertion — bug — by robertgshaw2-redhat (创建于: 2026-01-31 06:42 (UTC+8))
- #33448 [Bug]: LoRA MoE Eager Mode Changes Outputs — bug — by Jonahcb (创建于: 2026-01-31 05:47 (UTC+8))
- #33445 [RFC] Remove mandatory ray installation — 无标签 — by yewentao256 (创建于: 2026-01-31 04:37 (UTC+8))
- #33439 [Bug]: Race condition in V1 engine serial_utils.py - aux_buffers is None during concurrent llm.score() calls with multimodal inputs — bug — by KoderFPV (创建于: 2026-01-31 01:49 (UTC+8))
- #33434 [Feature]: Add prefix cache state metrics for KV cache monitoring — 无标签 — by kevkle (创建于: 2026-01-31 01:17 (UTC+8))
- #33429 [Feature]: Investigate if higher TORCHINDUCTOR_COMPILE_THREADS leads to faster compile times — feature request,torch.compile — by zou3519 (创建于: 2026-01-31 00:07 (UTC+8))
- #33421 [Bug]: FlashInfer backend stopped working with Batch Invariant after moving from 0.14.1 to 0.15.0, Qwen3-4B — bug — by wektorz (创建于: 2026-01-30 22:17 (UTC+8))
- #33418 [Bug]: wrong error reported when len(prompt) + requested tokens > max_context_len — bug — by sducouedic (创建于: 2026-01-30 21:32 (UTC+8))
- #33416 [Bug] NVFP4 MoE kernels fail on RTX Blackwell (SM12.0) - device capability family check missing SM120 — 无标签 — by renehonig (创建于: 2026-01-30 20:54 (UTC+8))
- #33412 [Bug]: vllm bench always injects Authorization header even when OPENAI_API_KEY is unset — bug — by daily-kim (创建于: 2026-01-30 19:57 (UTC+8))
- #33411 [Bug]: parameter VLLM_USE_V1=0 is not effective in vllm v0.15.0. — bug — by soooocold (创建于: 2026-01-30 19:44 (UTC+8))
- #33398 [RFC]: Layerwise KV cache offloading to support longer sequence length — RFC — by zhangsicheng5 (创建于: 2026-01-30 16:18 (UTC+8))
- #33401 [Bug]: [GML-4.5-Air-Fp8] RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug — by twilighgt (创建于: 2026-01-30 17:14 (UTC+8))
- #33400 MultiConnector: Clarify get_finished_count aggregation semantics — 无标签 — by eicherseiji (创建于: 2026-01-30 16:59 (UTC+8))
- #33397 [Usage]: How to set structured_output using grammar — usage — by alanayu (创建于: 2026-01-30 15:19 (UTC+8))
- #33386 [Feature]: Add support for local image file paths in vision reranker API — feature request — by RC-Qiao (创建于: 2026-01-30 11:18 (UTC+8))
已关闭 Issue
- #33294 [Bug]: Typo in detokenizer.py: read_offest should be read_offset breaks prompt_embeds — bug — by bet0x (关闭于: 2026-01-31 10:27 (UTC+8))
- #21180 [Bug]: Dockerfile consolidation — bug,stale — by ericcurtin (关闭于: 2026-01-31 10:17 (UTC+8))
- #24944 [Bug]: Failed to run Qwen3-Next on vLLM v0.10.2rc3.dev103+g238c4c170.cpu — bug,stale — by MarchBeta2087 (关闭于: 2026-01-31 10:16 (UTC+8))
- #25209 [Bug]: Low Accuracy for MMLU Pro with DeepSeekR1-FP4 — bug,stale — by pavanimajety (关闭于: 2026-01-31 10:16 (UTC+8))
- #25389 [RFC]: Revise Logits Processor Programming Model — RFC,stale — by afeldman-nm (关闭于: 2026-01-31 10:16 (UTC+8))
- #25821 [Bug]: TypeError: argument ‘id’: StreamInput must be either an integer or a list of integers — bug,stale — by zhongzq123 (关闭于: 2026-01-31 10:16 (UTC+8))
- #26006 [Bug]: Guided JSON generation not working in
Qwen/Qwen2.5-14B-Instruct-1M— bug,stale — by subhalingamd (关闭于: 2026-01-31 10:16 (UTC+8)) - #26008 [Bug]: [Bug] [gpt-oss-120b] [Responses API]: Could not parse header: too many tokens remaining after extracting content-type and recipient — bug,stale — by 02deno (关闭于: 2026-01-31 10:16 (UTC+8))
- #26055 [Bug]: CUDA error “invalid resource handle” when serving dsr1 with deepep on GB200 — bug,stale — by krishung5 (关闭于: 2026-01-31 10:16 (UTC+8))
- #26067 [Bug]: DSR1 + DEP8 on B200 fails with TensorRT-LLM latency and throughput kernels. — bug,stale — by pavanimajety (关闭于: 2026-01-31 10:16 (UTC+8))
- #26069 [Bug]: DSR1 FP8 + TEP8 on B200 fails with TensorRT-LLM throughput kernels. — bug,stale — by shyeh25 (关闭于: 2026-01-31 10:16 (UTC+8))
- #26079 [Bug]: vLLM Qwen3 Reranker intermittent 500 error - NVMLError_Unknown on score endpoint — bug,stale — by Cyp9715 (关闭于: 2026-01-31 10:15 (UTC+8))
- #26108 [RFC]: MoE shared_experts fusion with the selected experts inside FusedMoE — RFC,stale — by alexm-redhat (关闭于: 2026-01-31 10:15 (UTC+8))
- #26122 [Performance]: disagg_proxy_demo server request throughput is low — bug,stale — by dkolawole (关闭于: 2026-01-31 10:15 (UTC+8))
- #26127 [Installation]: Additional OS level required packages — installation,stale — by bibhas2 (关闭于: 2026-01-31 10:15 (UTC+8))
- #26141 vLLM silently hangs on LLaMA Scout 4 with >3M tokens despite sufficient GPU memory — stale,ci-failure — by rajulshakya777 (关闭于: 2026-01-31 10:15 (UTC+8))
- #33439 [Bug]: Race condition in V1 engine serial_utils.py - aux_buffers is None during concurrent llm.score() calls with multimodal inputs — bug — by KoderFPV (关闭于: 2026-01-31 02:29 (UTC+8))
- #33161 [Doc]: Kubernetes deployment in CPU mode fails (No CUDA..) — documentation — by Josca (关闭于: 2026-01-31 00:05 (UTC+8))
- #32481 [Bug]: Batch Invariance fails under more diverse workloads — bug — by frankwang28 (关闭于: 2026-01-30 23:00 (UTC+8))
- #27755 [RFC]:
reasoning_content->reasoning— RFC — by hmellor (关闭于: 2026-01-30 19:48 (UTC+8)) - #33348 [Bug] GLM-4.7 uses wrong reasoning parser (should use deepseek_r1 instead of glm45) — 无标签 — by QwertyJack (关闭于: 2026-01-30 16:41 (UTC+8))
新增 PR
- #33462 Bump flashinfer-python to 0.6.2 — ci/build,nvidia — by esmeetu (创建于: 2026-01-31 11:12 (UTC+8))
- #33441 [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops — ready,ready-run-all-tests — by ProExpertProg (创建于: 2026-01-31 02:28 (UTC+8))
- #33443 [ROCm] AITER fused RoPE+KVCache — rocm — by Rohan138 (创建于: 2026-01-31 03:27 (UTC+8))
- #33417 fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels — ready,needs-rebase,ci/build,nvidia,quantization — by renehonig (创建于: 2026-01-30 20:55 (UTC+8))
- #33426 [Bugfix] Fix typo in read_offset variable name — bug,ready,v1 — by bet0x (创建于: 2026-01-30 23:15 (UTC+8))
- #33407 [W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. — performance,needs-rebase,nvidia — by maralbahari (创建于: 2026-01-30 18:32 (UTC+8))
- #33454 Rust api server — documentation — by jdw2159 (创建于: 2026-01-31 08:39 (UTC+8))
- #33452 Support clear mm and encoder cache — v1,meta-exported,fb-exported — by jma99fb (创建于: 2026-01-31 07:57 (UTC+8))
- #33451 [WIP][Attention] Add FlashInfer Sparse MLA backend — documentation,v1,nvidia — by MatthewBonanni (创建于: 2026-01-31 07:44 (UTC+8))
- #33446 [Bugfix] Fix Sparse24 Compressed Tensors models — bug,nvidia — by kylesayrs (创建于: 2026-01-31 05:10 (UTC+8))
- #33449 [Refactor] Remove align block size logic in
moe_permute— performance,ready — by yewentao256 (创建于: 2026-01-31 06:16 (UTC+8)) - #33427 [Attention] Clarify comment explaining attn_logits +1 dimension — ready,v1 — by fuscof-ibm (创建于: 2026-01-30 23:47 (UTC+8))
- #33391 [ModelRunner V2] Fix spec decoding + logprobs — ready,v1 — by njhill (创建于: 2026-01-30 13:32 (UTC+8))
- #33437 [Misc] skip target model mm emb in draft proposal step when draft is text-only — v1 — by kkt-cohere (创建于: 2026-01-31 01:28 (UTC+8))
- #33444 [Misc] offest -> offset in comments and variable names — speculative-decoding,v1 — by russellb (创建于: 2026-01-31 03:45 (UTC+8))
- #33440 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 — ci/build,kv-connector — by Gregory-Pereira (创建于: 2026-01-31 02:07 (UTC+8))
- #33442 Fix/sm120 rtx blackwell support nvfp4/fp8 moe — nvidia — by mgoin (创建于: 2026-01-31 02:57 (UTC+8))
- #33438 [Benchmark] Add vllm bench iterations for prefill/decode measurement — performance,frontend,needs-rebase — by jaewonlee-fb (创建于: 2026-01-31 01:45 (UTC+8))
- #33415 [Voxtral Streaming -> Voxtral Realtime] Rename all voxtral related classes, fn, files — documentation,new-model,ready,multi-modality — by patrickvonplaten (创建于: 2026-01-30 20:46 (UTC+8))
- #33435 [Metrics] Add prefix cache state metrics for KV cache monitoring — documentation,v1 — by kevkle (创建于: 2026-01-31 01:18 (UTC+8))
- #33432 fix QERL attention import path — ready,ci-failure — by vkuzo (创建于: 2026-01-31 00:23 (UTC+8))
- #33436 [BUGFIX] fix Attention import path — bug — by xuechendi (创建于: 2026-01-31 01:21 (UTC+8))
- #33413 Fix
test_moe.pyfor Transformers v5 — ready — by hmellor (创建于: 2026-01-30 20:11 (UTC+8)) - #33433 [Model Runner V2] support bad_words sampling param — v1 — by izhuhaoran (创建于: 2026-01-31 00:31 (UTC+8))
- #33431 [Model] Support MiniCPM-o 4.5 — needs-rebase — by tc-mb (创建于: 2026-01-31 00:14 (UTC+8))
- #33414 [CI] Qwen3-ASR transcriptios tests — ready,qwen — by NickLucche (创建于: 2026-01-30 20:23 (UTC+8))
- #33395 [KVConnector][LMCache] Enable Support for cross-layer Layout — kv-connector — by Shaoting-Feng (创建于: 2026-01-30 14:39 (UTC+8))
- #33430 renderer: Fix miselading error message — frontend — by RishabhSaini (创建于: 2026-01-31 00:14 (UTC+8))
- #33425 [Bugfix] Fix correct error message when
len(prompt) + max_tokens > max_model_len— bug,frontend — by sducouedic (创建于: 2026-01-30 22:59 (UTC+8)) - #33393 [BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models — bug,ready — by dcmaddix (创建于: 2026-01-30 14:21 (UTC+8))
- #33422 Update context length in error message when prompt + max_tokens exceeds limit — frontend — by kamalesh0406 (创建于: 2026-01-30 22:47 (UTC+8))
- #33420 [EPLB][BUGFIX] Add
max_num_transfersconfig option to EPLB — bug — by SageMoore (创建于: 2026-01-30 22:03 (UTC+8)) - #33419 [Misc] Algin Qwen3-VL-embedding image example outputs with HF repo example — documentation,qwen — by Isotr0py (创建于: 2026-01-30 21:49 (UTC+8))
- #33423 [Test] GLM4.7 performance acceleration. — 无标签 — by liuchen2026fly (创建于: 2026-01-30 22:51 (UTC+8))
- #33406 [BUGFIX] Pixtral cannot be loaded with –limit-mm-per-prompt 0 — bug — by juliendenize (创建于: 2026-01-30 18:29 (UTC+8))
- #33408 [Refactor] Move MM data parsing outside processor — ready,v1,multi-modality,llama,qwen — by DarkLight1337 (创建于: 2026-01-30 18:46 (UTC+8))
- #33410 [Bugfix] Fix
Qwen3ASRlanguage asr tag in output — bug,frontend,ready,qwen — by NickLucche (创建于: 2026-01-30 19:36 (UTC+8)) - #33402 Remove deprecated
reasoning_contentmessage field — frontend,ready,tool-calling,gpt-oss — by hmellor (创建于: 2026-01-30 17:31 (UTC+8)) - #33409 [Misc] Apply PEP 563 to replace quoted union types with unquoted ones — rocm,structured-output,frontend,ready,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (创建于: 2026-01-30 18:53 (UTC+8))
- #33388 [Doc] [ROCm] Update Documentation to reflect v0.15.0 release — documentation,rocm,ready — by vllmellm (创建于: 2026-01-30 11:47 (UTC+8))
- #33392 PR Title: [Feature] EWSJF: Adaptive Scheduler for Mixed-Workloads (30-50% Throughput Gain) — performance,v1 — by sidikbro (创建于: 2026-01-30 14:10 (UTC+8))
- #33404 [Core][Bugfix] Add logprobs_mode_override capability for request-level fine-grained control over logprob computation — bug,frontend,v1 — by tangcy98 (创建于: 2026-01-30 17:34 (UTC+8))
- #33405 Extract kv cache update from flashmla sparse — v1 — by Kare0638 (创建于: 2026-01-30 17:51 (UTC+8))
- #33403 [WIP] pcp alternative impl — rocm,speculative-decoding,v1,nvidia — by LucasWilkinson (创建于: 2026-01-30 17:32 (UTC+8))
- #33396 [Refactor] Move MM item count validation outside of processor — frontend,ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-30 14:54 (UTC+8))
- #33399 Isolate VLLM_CACHE_ROOT per test to fix torch.compile cache reuse — 无标签 — by sladyn98 (创建于: 2026-01-30 16:48 (UTC+8))
- #33394 Support vibevoice asr — documentation,new-model,ci/build,multi-modality — by nemoramo (创建于: 2026-01-30 14:39 (UTC+8))
- #33389 [bugfix] Solve the accuracy issue of deepseek ocr2 — bug,documentation,new-model,deepseek — by leihuang-sketch (创建于: 2026-01-30 11:48 (UTC+8))
- #33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (创建于: 2026-01-30 11:18 (UTC+8))
- #33390 [BUG] relax pirate persona assertion in test_system_prompt_override — bug — by pacoxu (创建于: 2026-01-30 12:24 (UTC+8))
已合并 PR
- #33362 [Deprecation] Deprecate
seed_everythingandscatter_mm_placeholdersin v0.15 — ready,v1 — by yewentao256 (合并于: 2026-01-31 10:54 (UTC+8)) - #33173 [Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) — documentation,performance,rocm,ready,v1,qwen,kv-connector — by xuebwang-amd (合并于: 2026-01-31 01:50 (UTC+8))
- #33426 [Bugfix] Fix typo in read_offset variable name — bug,ready,v1 — by bet0x (合并于: 2026-01-31 09:26 (UTC+8))
- #33366 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 — bug,rocm,ready — by gshtras (合并于: 2026-01-31 09:05 (UTC+8))
- #33201 Refactor NVFP4 Linear utils for ModelOpt and CT — ready,nvidia,quantization — by mgoin (合并于: 2026-01-31 08:37 (UTC+8))
- #33286 [CI][HPU]accelerate hpu test by skip python re-install and clean container name — ready,ci/build — by xuechendi (合并于: 2026-01-31 05:36 (UTC+8))
- #32990 Indicate compile mode in the benchmark results — performance,ready,ci/build — by huydhn (合并于: 2026-01-31 04:34 (UTC+8))
- #33187 [Realtime API] Adds minimal realtime API based on websockets — documentation,frontend,ready,v1 — by patrickvonplaten (合并于: 2026-01-30 18:41 (UTC+8))
- #32437 [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. — ready,nvidia — by pavanimajety (合并于: 2026-01-31 02:30 (UTC+8))
- #33432 fix QERL attention import path — ready,ci-failure — by vkuzo (合并于: 2026-01-31 01:29 (UTC+8))
- #32740 [Kernel] [Helion] [1/N] Add Helion ConfigManager — ready — by gmagogsfm (合并于: 2026-01-31 01:19 (UTC+8))
- #33236 Fix encoder-decoder model disabling mm processor cache — ready,multi-modality — by hmellor (合并于: 2026-01-31 00:30 (UTC+8))
- #33413 Fix
test_moe.pyfor Transformers v5 — ready — by hmellor (合并于: 2026-01-30 22:03 (UTC+8)) - #33376 fix: allow LFM2 MoE prefix caching (align) — ready — by tianshu-Michael-yu (合并于: 2026-01-30 16:23 (UTC+8))
- #33414 [CI] Qwen3-ASR transcriptios tests — ready,qwen — by NickLucche (合并于: 2026-01-31 00:17 (UTC+8))
- #33280 Support FP8 block quant for CompressedTensorsW8A16Fp8 — ready,quantization — by mgoin (合并于: 2026-01-31 00:15 (UTC+8))
- #32133 [QeRL] Layerwise Reloading — ready,v1 — by kylesayrs (合并于: 2026-01-30 23:50 (UTC+8))
- #33393 [BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models — bug,ready — by dcmaddix (合并于: 2026-01-30 23:27 (UTC+8))
- #32561 Disable Cascade Attention for Batch Invariance — ready,v1 — by frankwang28 (合并于: 2026-01-30 23:00 (UTC+8))
- #33406 [BUGFIX] Pixtral cannot be loaded with –limit-mm-per-prompt 0 — bug — by juliendenize (合并于: 2026-01-30 18:52 (UTC+8))
- #33253 Improve Mistral format checks. — ready — by juliendenize (合并于: 2026-01-30 22:23 (UTC+8))
- #32286 [Doc] Enhance documentation around CPU container images — documentation,ready,cpu — by nathan-weinberg (合并于: 2026-01-30 21:36 (UTC+8))
- #33323 [Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal — ready — by carlory (合并于: 2026-01-30 21:31 (UTC+8))
- #33402 Remove deprecated
reasoning_contentmessage field — frontend,ready,tool-calling,gpt-oss — by hmellor (合并于: 2026-01-30 19:48 (UTC+8)) - #33388 [Doc] [ROCm] Update Documentation to reflect v0.15.0 release — documentation,rocm,ready — by vllmellm (合并于: 2026-01-30 19:06 (UTC+8))
-
#33332 [Misc] Replace Optional[X] with X None syntax — rocm,frontend,ready,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (合并于: 2026-01-30 17:56 (UTC+8)) - #33396 [Refactor] Move MM item count validation outside of processor — frontend,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-30 17:27 (UTC+8))
- #33346 [Models] Refactor Kimi-K2.5 weight loading — ready — by Isotr0py (合并于: 2026-01-30 13:31 (UTC+8))
- #33372 Explicitly set
return_dictforapply_chat_template— documentation,ready — by hmellor (合并于: 2026-01-30 15:27 (UTC+8)) - #33359 Fix
tie_word_embeddingsfor multimodal models in Transformers v5 — ready — by hmellor (合并于: 2026-01-30 11:37 (UTC+8)) - #32449 [model] Add support for openPangu7B-VL — documentation,new-model,ready — by hujiaxin0 (合并于: 2026-01-30 15:54 (UTC+8))
- #33282 [CI] Enable mypy import following for
vllm/spec_decode— speculative-decoding,ready,v1 — by Lucaskabela (合并于: 2026-01-30 14:43 (UTC+8)) - #33352 [BugFix] Disable async scheduling for Mamba prefix caching — bug,ready — by peakcrosser7 (合并于: 2026-01-30 12:40 (UTC+8))
- #33239 Move decode context parallel validationn to
ParallelConfig— ready — by hmellor (合并于: 2026-01-30 14:18 (UTC+8)) - #33305 [CI][AMD] Skip 4 GPUs testgroup ray tests — documentation,rocm,ready,ci/build,v1 — by rjrock (合并于: 2026-01-30 13:39 (UTC+8))
关闭但未合并的 PR
- #17083 Fix
numel()downcast in vllm/csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu +2 — ready,needs-rebase,stale — by r-barnes (关闭于: 2026-01-31 10:17 (UTC+8)) - #25412 [AMD/CI] Amd blocking tests v2 — rocm,ci/build,stale — by Alexei-V-Ivanov-AMD (关闭于: 2026-01-31 10:16 (UTC+8))
- #32303 [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling — bug,rocm,structured-output,v1 — by micah-wil (关闭于: 2026-01-31 05:40 (UTC+8))
- #33442 Fix/sm120 rtx blackwell support nvfp4/fp8 moe — nvidia — by mgoin (关闭于: 2026-01-31 03:40 (UTC+8))
- #33436 [BUGFIX] fix Attention import path — bug — by xuechendi (关闭于: 2026-01-31 01:24 (UTC+8))
- #31849 [CI][Transformers] Fix transformers v5 support: tied weights and packed 3D MoE experts — needs-rebase,v1,multi-modality — by AndreasKaratzas (关闭于: 2026-01-31 01:09 (UTC+8))
- #33430 renderer: Fix miselading error message — frontend — by RishabhSaini (关闭于: 2026-01-31 00:16 (UTC+8))
- #33422 Update context length in error message when prompt + max_tokens exceeds limit — frontend — by kamalesh0406 (关闭于: 2026-01-30 23:22 (UTC+8))
- #33008 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints — bug — by ms1design (关闭于: 2026-01-30 22:17 (UTC+8))
- #32171 Optimize FlashInfer TRTLLM FP4 MoE quantization #32057 — performance,needs-rebase,nvidia — by baonudesifeizhai (关闭于: 2026-01-30 20:56 (UTC+8))
- #26504 [Spec-Decode] Add DynamicProposer for per-sequence dynamic speculative decoding — documentation,performance,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling,qwen — by yang926 (关闭于: 2026-01-30 17:27 (UTC+8))
- #33349 [Reasoning] Add glm47 reasoning parser for GLM-4.7 models — documentation — by QwertyJack (关闭于: 2026-01-30 16:40 (UTC+8))
- #33394 Support vibevoice asr — documentation,new-model,ci/build,multi-modality — by nemoramo (关闭于: 2026-01-30 14:39 (UTC+8))
- #33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (关闭于: 2026-01-30 13:35 (UTC+8))
- #33321 [ROCm] make rocm_aiter_fa support qwen3-next, remove multiple 16 block size support — documentation,rocm,v1,qwen — by ganyi1996ppo (关闭于: 2026-01-30 11:28 (UTC+8))