vLLM 开发动态报告 - 2026-01-26

时间窗口: 2026-01-26 11:01 (UTC+8) ~ 2026-01-27 11:01 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 21 | 新 PR 66 | 合并 PR 33 | 关闭未合并 PR 14

📊 每日开发状态摘要

在2026年1月26日至27日期间，vLLM社区保持高度活跃，共处理了41个Issue（新增20，关闭21）和66个PR（合并33）。开发重点集中在多模态模型支持扩展、AMD ROCm平台功能增强与问题修复，以及核心性能优化（特别是针对Mamba SSM和MoE LoRA内核）。社区就连接器架构简化（NCCL Connector去留）和配置管理改进（环境变量弃用保护）进行了重要讨论，体现了对系统可维护性和用户体验的持续关注。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD生态相关活动较为活跃，主要集中在问题修复和功能增强上。

Issue #33123 - [Bug][ROCm]: Prefix caching produces different output on first request vs subsequent requests on gfx950
- 描述：用户在AMD gfx950（MI300系列）GPU上发现，启用前缀缓存时，首次请求（缓存未命中）与后续请求（缓存命中）的输出不一致。这关联到一个在gfx950上失败的内部测试。
- 技术影响：此问题直接影响了AMD MI300系列GPU上前缀缓存功能的正确性，可能导致生成结果不稳定。已CC多位AMD员工（@mawong-amd等）和vLLM ROCm维护者。
- 状态：Open，正在调查中。
PR #33112 - Fix IndexError with encoder-decoder models when using Custom Paged Attention
- 描述：修复了在ROCm Attention（Custom Paged Attention）后端上运行编码器-解码器模型（如Whisper）时，因交叉注意力步骤中key/value为None而引发的IndexError。
- 技术影响：增强了ROCm Attention后端对编码器-解码器架构模型的兼容性，扩大了AMD平台可稳定运行的模型范围。
- 状态：已合并。
PR #33106 - [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends
- 描述：为ROCm的多个Attention后端（RocmAiterUnifiedAttention， RocmAttention， TritonAttention）添加了对 forward_includes_kv_cache 特性的支持。
- 技术影响：这是对统一CUDA/ROCm Attention API（#32335）工作的跟进，旨在提升AMD平台Attention实现的完整性和性能。
- 状态：Open。
PR #33077 - [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling
- 描述：修复了在AMD ROCm上初始化Qwen3-Omni模型时，在内存性能分析阶段因torch.repeat_interleave操作触发hipErrorIllegalState导致崩溃的问题。解决方案是将涉及小张量的计算移至CPU。
- 技术影响：解决了Qwen3-Omni这一重要多模态模型在AMD硬件上的启动障碍。
- 状态：Open。提交者用户名为JartX，符合“xxx-amd”模式，很可能是AMD员工。

💬 高热度讨论分析

Issue #33115 - [RFC]: Deprecate NCCL Connector?
- 核心议题：提议弃用并移除NCCL连接器，以简化vLLM的连接器选项（目前有NIXL, Mooncake等）。
- 不同观点：
  - 提议方（@robertgshaw2-redhat）：认为NCCL连接器用户不多且导致了不少Issue，简化选项集对用户更友好。
  - 原开发者（@Abatom）：认为NCCL比NIXL更轻量，是用户体验预填充-解码（PD）分离功能的入口，并希望保留它以进行后续优化和功能支持。
- 争议焦点：功能精简与向后兼容、轻量级选项保留之间的平衡。
- 当前状态：讨论开放中，原开发者倾向于保留。
Issue #33096 - [RFC]: Add protection against using deprecated or incorrect environment variables
- 核心议题：提议为vLLM的大量环境变量增加验证和弃用警告机制，防止因环境变量更名、替换或废弃导致用户工作流在版本升级后静默失效。
- 社区反响：获得广泛支持。核心维护者@robertgshaw2-redhat表示“强烈支持”，@simon-mo 支持但也强调需要停止添加更多环境变量。
- 结论倾向：社区一致认为这是一个必要的改进，预计该RFC将推动实施。
Issue #33091 / #33107 - Whisper accuracy issues with FA2+CG+torch.compile
- 核心议题：用户报告Whisper模型在特定配置（FlashAttention-2 + CUDA Graph + torch.compile）下出现准确性严重下降（词错误率134%）的问题。
- 讨论内容： issue #33107 被标记为 #33091 的重复。维护者请求用户测试更多组合（如关闭CUDA Graph、使用Dynamo+piecewise CUDA Graphs等）以定位问题根源。
- 当前状态：问题根因仍在排查中，可能与FlashAttention版本、编译流水线的交互有关。

🔥 热门话题与趋势分析

多模态模型支持持续扩展：
- 新增FunAudioChat（音频→文本）模型支持（PR #33058）。
- 新增Kimi-K2.5模型支持（PR #33131）。
- 为GLM-OCR模型添加MTP（多令牌预测）支持（PR #33005）。
- 为Step-VL模型增加关闭图像分块（img_patch）的配置选项，优化视频理解（PR #32923）。
性能优化与内核调优：
- Mamba SSM：新增针对Blackwell（B200）GPU的selective_state_update内核自动调优脚本和配置文件（PR #33084）。
- LoRA MoE：优化Triton内核的索引和内存访问模式，提升性能（PR #32770）。
- 通用优化：移除未使用的_moe_permute函数（PR #33108），优化DCP（解码上下文并行）张量分配（PR #33102）。
工具调用与格式解析：
- 出现了多个与工具调用解析相关的PR和Issue，例如修复JSON解析错误（PR #33085）、为MiniMax M2模型添加自定义工具模板以修复括号导致的崩溃（PR #33087）、以及为Qwen2.5-Coder模型新增专用工具解析器（PR #33083）。这反映出社区在完善不同模型API兼容性上的努力。

🛠️ 重点技术变更

PR #33096 (RFC): 环境变量弃用保护机制：这是一个重要的用户体验改进提案。通过增加验证和警告，可以显著减少因版本升级导致的隐蔽性错误，提升vLLM在生产环境中的可维护性。
PR #32873 / #33084: Mamba SSM内核性能调优：针对新一代Blackwell架构（B200）和Nemotron Nano模型进行深度优化，显示了vLLM紧跟硬件发展、持续挖掘性能潜力的能力。
PR #33112: ROCm Attention后端编码器-解码器模型支持修复：虽然改动不大，但解决了AMD平台上运行Whisper等模型的关键障碍，对扩大vLLM在异构计算生态中的适用性有积极意义。
PR #33126 / #33136: Journey事件追踪集成：这是一系列（共9个）关于集成OpenTelemetry追踪PR的一部分，旨在将请求生命周期事件直接发射到核心Span。这增强了系统的可观测性，为性能分析和调试提供了更强大的工具。

📈 开发活跃度观察

合并效率：在24小时内合并了33个PR，显示核心团队审阅和集成代码的速度非常快。
贡献者多样性：除了核心团队（如@robertgshaw2-redhat， @DarkLight1337， @WoosukKwon），也看到了来自AMD（@JartX）、Intel（@fadara01）等硬件厂商以及众多独立开发者的贡献。
AMD生态响应：针对AMD平台的问题（如#33123），社区机器人自动CC了相关维护者，且AMD员工（@mawong-amd）已关联到相关修复PR，表明对该平台问题的响应渠道通畅。

💡 值得关注的问题

NCCL连接器的未来（Issue #33115）：这是一个涉及架构简化和用户影响的决策。社区需要更多用户反馈来决定是否弃用。
GPT OSS 20B作为草案模型时抛出错误（Issue #33133）：在使用同模型进行推测解码（Speculative Decoding）时出现KV缓存组断言错误，这可能影响该性能优化功能的可用性。
MoE模型FP8性能回退（Issue #33128）：用户报告在较新版本vLLM上运行Qwen3 MoE模型的FP8版本时，性能相比旧版本显著下降。这需要核心团队关注，可能涉及DeepGEMM等核心算子的回归。
Transformers 5.0升级（Issue #33132 & PR #33100）：Hugging Face Transformers库发布5.0大版本。vLLM已开始更新测试注册表（PR #33100），但全面升级依赖项可能尚需时日，需关注兼容性。

📋 附录：详细数据列表

新增 Issue

#33053 [Bug]: LongCat-Flash-Thinking-2601 still appears to be unsupported — bug — by jambow0320 (创建于: 2026-01-26 11:34 (UTC+8))
#33137 [Usage]: How to disable cascaded attention mechanism in vLLM version 0.7.3 — usage — by berlintw113 (创建于: 2026-01-27 10:52 (UTC+8))
#33134 [Bug]: try to run vllm bench with “throughput”, got error. — bug — by liyuerich (创建于: 2026-01-27 10:44 (UTC+8))
#33133 [Bug]: Using GPT OSS 20B as Drafter Throws Error — bug — by singh-git10 (创建于: 2026-01-27 10:36 (UTC+8))
#33132 Bump transformers to 5.0.0 — 无标签 — by NiuBlibing (创建于: 2026-01-27 10:05 (UTC+8))
#33115 [RFC]: Deprecate NCCL Connector? — RFC — by robertgshaw2-redhat (创建于: 2026-01-27 05:09 (UTC+8))
#33128 [Bug]: From version 0.12.0 to the latest version 0.14.0 of vllm, running moe model of fp8 with deepgemm will be slower than 0.11.0. — bug — by piekey1994 (创建于: 2026-01-27 09:11 (UTC+8))
#33091 [Bug]: Whisper accuracy issue with FA2+CG+torch.compile — bug,torch.compile — by NickLucche (创建于: 2026-01-26 20:57 (UTC+8))
#33123 [Bug][ROCm]: Prefix caching produces different output on first request (cache miss) vs subsequent requests (cache hit) on gfx950 — bug,rocm — by AndreasKaratzas (创建于: 2026-01-27 07:38 (UTC+8))
#33096 [RFC]: Add protection against using deprecated or incorrect environment variables — RFC — by gshtras (创建于: 2026-01-26 23:43 (UTC+8))
#33118 [RFC]: Hidden States Extraction — RFC — by fynnsu (创建于: 2026-01-27 05:30 (UTC+8))
#33092 [Accessory/Distributed] KV Transfer crashes with short prompts when chunked prefill is enabled — bug — by Yaoyikang (创建于: 2026-01-26 21:41 (UTC+8))
#33097 [Feature]: Fuse FP8 output quantization into merge_attn_states (DCP / cascade paths) — feature request — by sachinkumarsingh092 (创建于: 2026-01-26 23:47 (UTC+8))
#33107 [Bug]: Whisper large-v3 accuracy degradation in vLLM 0.14.1 (134.56% WER) on L40S - works fine in 0.12.0 — bug — by sayalibhavsar (创建于: 2026-01-27 02:06 (UTC+8))
#33099 [Bug]: vllm Requests stuck indefinitely — bug — by joules-zapata-pfh (创建于: 2026-01-26 23:59 (UTC+8))
#33094 [CI Failure]: Entrypoints Test (Responses) — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-26 22:32 (UTC+8))
#33089 [Feature]: Support multi-turn conversation for OpenAI Response API — feature request — by kcelia (创建于: 2026-01-26 19:59 (UTC+8))
#33075 [Usage]: Can not disable the thinking mode of deepseek-r1-0528 with request-level chat_template_kwargs — usage — by xlouba (创建于: 2026-01-26 17:39 (UTC+8))
#33074 [Bug]: PD report DeepseekV32 AssertionError: num_kv_heads == 1 — bug — by nicole-lihui (创建于: 2026-01-26 17:36 (UTC+8))
#33072 [Usage]: How to enable API Server in the worker nodes? When deploy with DP2. — usage — by MaoJianwei (创建于: 2026-01-26 17:12 (UTC+8))

已关闭 Issue

#15781 [Feature]: Implementation of Zero Overhead Scheduling — feature request,stale — by lizhigong1988 (关闭于: 2026-01-27 10:18 (UTC+8))
#17639 [Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 — bug,stale — by sgsdxzy (关闭于: 2026-01-27 10:18 (UTC+8))
#18121 [Usage]: Set the kv cache size — usage,stale — by FeliceSchena (关闭于: 2026-01-27 10:17 (UTC+8))
#19793 [Usage]: ValueError: The checkpoint you are trying to load has model type qwen3_moe but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. — usage,stale — by Estella31 (关闭于: 2026-01-27 10:17 (UTC+8))
#19861 [Bug]: max_completion_tokens doesn’t work as max — bug,stale — by cfytrok3 (关闭于: 2026-01-27 10:17 (UTC+8))
#20007 [Bug]: error on ROCM :0:rocdevice.cpp :3020: 5875275709338d us: Callback: Queue 0x7f4580c00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 — bug,stale — by shangshng (关闭于: 2026-01-27 10:17 (UTC+8))
#25549 [Bug]: qwen3-next tool_call error invalid output: !!!!!!!! — bug,stale — by fengyang95 (关闭于: 2026-01-27 10:16 (UTC+8))
#25576 [Doc]: https://docs.vllm.ai/en/v0.6.5/getting_started/installation.html — documentation,stale — by cjg-aussee-ai (关闭于: 2026-01-27 10:16 (UTC+8))
#25793 VLLM offline inference raise except when using qianfan-vl — bug,stale — by cqray1990 (关闭于: 2026-01-27 10:16 (UTC+8))
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-27 08:42 (UTC+8))
#32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-27 08:42 (UTC+8))
#32919 [Bug]: Memory fault access when serving DeepSeek-R1-0528 with mori-ep + concurrency of 128 / 256 — bug,rocm — by junkang1991 (关闭于: 2026-01-27 08:33 (UTC+8))
#33041 [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs (CUDA 13.0, NCCL 2.27.7) — bug — by robertjmcintyre (关闭于: 2026-01-27 07:19 (UTC+8))
#29982 [Feature]: Option to Disable Metrics Health-Check Logging — feature request — by averyngo34 (关闭于: 2026-01-27 05:49 (UTC+8))
#33029 [CI Failure]: MoE Integration Tests — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-27 04:09 (UTC+8))
#33094 [CI Failure]: Entrypoints Test (Responses) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-26 23:31 (UTC+8))
#33050 [Bug]: [DeepSeek-V3.2] PD Can’t instantiate abstract class DeepseekV32IndexerBackend without an implementation for abstract method ‘get_impl_cls’ — bug — by chaunceyjiang (关闭于: 2026-01-26 22:44 (UTC+8))
#32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (关闭于: 2026-01-26 21:59 (UTC+8))
#32758 [Feature]: support lora for Gemma3ForConditionalGeneration — help wanted,feature request — by WuChuYi (关闭于: 2026-01-26 21:56 (UTC+8))
#32309 [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL — usage — by hoangtd-asilla (关闭于: 2026-01-26 15:00 (UTC+8))
#33028 [CI Failure]: MultiModal Tests — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-26 11:49 (UTC+8))

新增 PR

#33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-27 10:50 (UTC+8))
#33110 [Bugfix] Early-reject requests with MM data longer than encode cache capacity — bug,v1 — by YunzhuLu (创建于: 2026-01-27 02:56 (UTC+8))
#33131 [Models] Kimi-K2.5 — documentation,new-model,frontend,ready,multi-modality — by ywang96 (创建于: 2026-01-27 10:03 (UTC+8))
#33135 [code clean] remove duplicate code — structured-output,v1 — by andyxning (创建于: 2026-01-27 10:48 (UTC+8))
#33100 [Model] Bump transformers version for test registry — ready,v1 — by DarkLight1337 (创建于: 2026-01-27 00:15 (UTC+8))
#33058 Adds FunAudioChat multimodal audio model support (#2) — documentation,new-model,multi-modality — by nemoramo (创建于: 2026-01-26 12:28 (UTC+8))
#33103 [Frontend] Cleanup serving engine — frontend,ready — by DarkLight1337 (创建于: 2026-01-27 00:55 (UTC+8))
#33112 Fix IndexError with encoder-decoder models when using Custom Paged Attention — rocm,ready,v1 — by sstamenk (创建于: 2026-01-27 03:02 (UTC+8))
#33113 [torch.compile] Stop assuming 32 bit indexing — ready — by zou3519 (创建于: 2026-01-27 03:49 (UTC+8))
#33130 [Quantization][Refactor] use platform dict to choose kernel — 无标签 — by zufangzhu (创建于: 2026-01-27 09:49 (UTC+8))
#33064 [Perf] avoid duplicate mem_get_info() call in get_current_memory_usage — rocm,ready — by pacoxu (创建于: 2026-01-26 15:14 (UTC+8))
#33129 [release] Minor fixes to release annotation and wheel upload — ci/build — by khluu (创建于: 2026-01-27 09:35 (UTC+8))
#33127 [Kernel] adding native nccl4py support — performance,ci/build,kv-connector,nvidia — by pkousha (创建于: 2026-01-27 08:39 (UTC+8))
#33070 [Bugfix] (grpc): improve GetServerInfo response consistency and accuracy — bug,frontend — by pacoxu (创建于: 2026-01-26 16:06 (UTC+8))
#33101 [Frontend] Reduce mixin usage in serving pooling — frontend,ready — by DarkLight1337 (创建于: 2026-01-27 00:33 (UTC+8))
#33055 [Model Runner V2] Remove UvaBufferPool for cpu->gpu copy — v1 — by WoosukKwon (创建于: 2026-01-26 11:59 (UTC+8))
#33059 [Model Runner V2] Use a different stream for grammar bitmask h2d copy — needs-rebase,v1 — by WoosukKwon (创建于: 2026-01-26 13:27 (UTC+8))
#33076 Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing — ready — by ir1ka (创建于: 2026-01-26 17:47 (UTC+8))
#33126 [Feature] Add journey state cleanup to scheduler (PR #3/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-27 08:28 (UTC+8))
#33125 [Core] Optimize SWA KV cache management for prefix caching — v1 — by jaewonlee-fb (创建于: 2026-01-27 08:19 (UTC+8))
#33124 [Model] GPT-OSS: Use layer_types config for sliding window selection — gpt-oss — by jaewonlee-fb (创建于: 2026-01-27 07:59 (UTC+8))
#33086 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,v1,deepseek — by chaunceyjiang (创建于: 2026-01-26 18:56 (UTC+8))
#33116 [CI/Build][BugFix] fix cuda/compat loading order issue in docker build — bug,ready,ci/build,nvidia — by wpc (创建于: 2026-01-27 05:13 (UTC+8))
#33114 [Draft / POC] Websocket realtime api — frontend — by patrickvonplaten (创建于: 2026-01-27 04:08 (UTC+8))
#33104 [Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs — bug,ready — by wpc (创建于: 2026-01-27 01:12 (UTC+8))
#33122 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs — 无标签 — by fadara01 (创建于: 2026-01-27 06:20 (UTC+8))
#33121 Fix grammar — v1,cpu — by smashyalts (创建于: 2026-01-27 06:05 (UTC+8))
#33120 [Config] Add explicit bind flag for KV events — 无标签 — by alec-flowers (创建于: 2026-01-27 05:55 (UTC+8))
#33109 [DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled — documentation — by VincentG1234 (创建于: 2026-01-27 02:55 (UTC+8))
#33082 [Doc] Improve serve parameter documentation with meaningful defaults — documentation — by karanb192 (创建于: 2026-01-26 18:43 (UTC+8))
#33119 [MoE Refactor] Only use RouterMethodType for flashinfer — nvidia — by bnellnm (创建于: 2026-01-27 05:32 (UTC+8))
#33117 [BugFix] p2p_nccl_connector: Skip non-chunked requests — bug,kv-connector — by RishabhSaini (创建于: 2026-01-27 05:15 (UTC+8))
#33108 [Refactor] Remove unused _moe_permute function — performance,ready — by yewentao256 (创建于: 2026-01-27 02:21 (UTC+8))
#33080 [ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator — ci/build — by khluu (创建于: 2026-01-26 18:30 (UTC+8))
#33105 [Feature] (grpc): add standard gRPC health checking protocol for Kubernetes native probes — frontend,ci/build — by V2arK (创建于: 2026-01-27 01:20 (UTC+8))
#33088 [Bugfix] Use ‘sum’ reduction instead of ‘avg’ in Async TP reduce-scatter — bug — by wangxingran222 (创建于: 2026-01-26 19:32 (UTC+8))
#33111 Add EAGLE3 support for AFMoE — 无标签 — by AutumnAurelium (创建于: 2026-01-27 03:01 (UTC+8))
#33073 [Bugfix] Fix Voxtral streaming slot_mapping — bug,ready — by NickLucche (创建于: 2026-01-26 17:26 (UTC+8))
#33077 [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM — bug,rocm,qwen — by JartX (创建于: 2026-01-26 17:56 (UTC+8))
#33106 [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends — rocm,v1 — by gshtras (创建于: 2026-01-27 01:30 (UTC+8))
#33095 Remove unused logic in models/mistral.py — ready — by andylolu2 (创建于: 2026-01-26 22:48 (UTC+8))
#33102 [Perf] Optimize dcp allocate tensor — ready,v1 — by yewentao256 (创建于: 2026-01-27 00:43 (UTC+8))
#33098 [CI] Whisper tests enforce_eager=False — ready — by NickLucche (创建于: 2026-01-26 23:49 (UTC+8))
#33084 [Feature] Add tuning script and config files for Mamba SSM — performance,needs-rebase — by karanb192 (创建于: 2026-01-26 18:50 (UTC+8))
#33093 [CI] Fix AssertionError: MCP tool call not found in output_messages — ready — by chaunceyjiang (创建于: 2026-01-26 21:48 (UTC+8))
#33078 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. — bug,nvidia — by ayrnb (创建于: 2026-01-26 17:57 (UTC+8))
#33052 [Bugfix] Fix Can’t instantiate abstract class DeepseekV32IndexerBackend — bug,ready,v1,deepseek,kv-connector — by chaunceyjiang (创建于: 2026-01-26 11:13 (UTC+8))
#33085 [Bugfix] Fix JSONDecodeError when multiple tool_calls in single message — bug,frontend — by karanb192 (创建于: 2026-01-26 18:56 (UTC+8))
#33087 [Bugfix][Tool Parser] Fix crash when parameter description contains parentheses — bug — by karanb192 (创建于: 2026-01-26 18:57 (UTC+8))
#33061 [KV MultiConnector]: Add out-of-band handshake metadata get/set functions — kv-connector — by snadampal (创建于: 2026-01-26 13:58 (UTC+8))
#33063 [Chore] Update type annotation of input_ids in model forward — documentation,speculative-decoding,ready,llama,qwen,deepseek,gpt-oss — by DarkLight1337 (创建于: 2026-01-26 14:57 (UTC+8))
#33079 [CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… — documentation,ci/build,cpu — by maryamtahhan (创建于: 2026-01-26 18:26 (UTC+8))
#33090 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,deepseek,kv-connector — by NickLucche (创建于: 2026-01-26 20:12 (UTC+8))
#33066 Enhancerequesttracing — documentation,frontend,needs-rebase,v1 — by sriumcp (创建于: 2026-01-26 15:43 (UTC+8))
#33071 [Docs] Simplify CPU x86 Docker build documentation — documentation,cpu — by maryamtahhan (创建于: 2026-01-26 17:08 (UTC+8))
#33065 [Doc] Further update multi-modal impl doc — documentation,ready — by DarkLight1337 (创建于: 2026-01-26 15:31 (UTC+8))
#33083 [Feature][Tool Parser] Add dedicated parser for Qwen2.5-Coder models — qwen — by karanb192 (创建于: 2026-01-26 18:46 (UTC+8))
#33081 [Bugfix] Fix remove_instance logic error in disagg_proxy_demo.py — bug,documentation,kv-connector — by karanb192 (创建于: 2026-01-26 18:43 (UTC+8))

#33060 [Frontend][5/n] Make pooling entrypoints request schema consensus

ScoreRequest — frontend — by noooop (创建于: 2026-01-26 13:53 (UTC+8))

#33067 test — documentation,new-model — by Emilie1001 (创建于: 2026-01-26 15:48 (UTC+8))
#33054 [StepVL] add step vl offline example — documentation — by ltd0924 (创建于: 2026-01-26 11:53 (UTC+8))
#33069 [Doc] Add back the instruction: install required python3-dev apt package — documentation,cpu — by jasonyanwenl (创建于: 2026-01-26 16:01 (UTC+8))
#33068 fix: Add infinite loop detection for multimodal models (e.g., PaddleOCR-VL) — v1 — by bellkjtt (创建于: 2026-01-26 15:53 (UTC+8))
#33062 [Model Runner V2] Add LoRAState to consolidate lora logic — v1 — by WoosukKwon (创建于: 2026-01-26 14:11 (UTC+8))
#33057 Bump actions/checkout from 6.0.1 to 6.0.2 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-01-26 12:27 (UTC+8))
#33056 Bump actions/setup-python from 6.1.0 to 6.2.0 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-01-26 12:27 (UTC+8))

已合并 PR

#33100 [Model] Bump transformers version for test registry — ready,v1 — by DarkLight1337 (合并于: 2026-01-27 02:53 (UTC+8))
#33112 Fix IndexError with encoder-decoder models when using Custom Paged Attention — rocm,ready,v1 — by sstamenk (合并于: 2026-01-27 10:33 (UTC+8))
#32768 fix: preserve native tool call ID in multi-turn tool calling — frontend,ready — by wangln19 (合并于: 2026-01-27 10:22 (UTC+8))
#32567 [MoE Refactor] Integrate Naive Prepare Finalize into MK — documentation,performance,rocm,ready,ci/build,llama,cpu,nvidia,ready-run-all-tests — by robertgshaw2-redhat (合并于: 2026-01-27 09:28 (UTC+8))
#33055 [Model Runner V2] Remove UvaBufferPool for cpu->gpu copy — v1 — by WoosukKwon (合并于: 2026-01-27 08:47 (UTC+8))
#32908 [Bugfix][TPU] Return a Default fp8 MoE Backend — bug,ready — by vanbasten23 (合并于: 2026-01-27 07:46 (UTC+8))
#33104 [Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs — bug,ready — by wpc (合并于: 2026-01-27 07:13 (UTC+8))
#32913 [fix] CPUDNNLGEMMHandler pointer baked into inductor artifact — ready,cpu — by dolpm (合并于: 2026-01-27 05:59 (UTC+8))
#30011 [feat][log]: add --disable-access-log-for-endpoints CLI option — documentation,frontend,ready — by JaredforReal (合并于: 2026-01-27 05:49 (UTC+8))
#33108 [Refactor] Remove unused _moe_permute function — performance,ready — by yewentao256 (合并于: 2026-01-27 05:06 (UTC+8))
#33080 [ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator — ci/build — by khluu (合并于: 2026-01-27 04:28 (UTC+8))
#33030 [Bugfix] Fix Dtypes for Pynccl Wrapper — bug,ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-27 04:09 (UTC+8))
#33073 [Bugfix] Fix Voxtral streaming slot_mapping — bug,ready — by NickLucche (合并于: 2026-01-27 02:40 (UTC+8))
#31099 [FIX] Always support TP > 4 for FP4 Gemm — ready — by danielafrimi (合并于: 2026-01-27 02:04 (UTC+8))
#33095 Remove unused logic in models/mistral.py — ready — by andylolu2 (合并于: 2026-01-27 01:01 (UTC+8))
#33093 [CI] Fix AssertionError: MCP tool call not found in output_messages — ready — by chaunceyjiang (合并于: 2026-01-26 23:19 (UTC+8))
#33018 [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp — bug,rocm,ready,deepseek — by ganyi1996ppo (合并于: 2026-01-26 23:19 (UTC+8))
#33052 [Bugfix] Fix Can’t instantiate abstract class DeepseekV32IndexerBackend — bug,ready,v1,deepseek,kv-connector — by chaunceyjiang (合并于: 2026-01-26 22:44 (UTC+8))
#33005 [GLM-OCR] GLM-OCR with MTP Support — documentation,new-model,speculative-decoding,ready,v1,multi-modality — by zRzRzRzRzRzRzR (合并于: 2026-01-26 22:24 (UTC+8))
#33063 [Chore] Update type annotation of input_ids in model forward — documentation,speculative-decoding,ready,llama,qwen,deepseek,gpt-oss — by DarkLight1337 (合并于: 2026-01-26 22:02 (UTC+8))
#32873 [Performance] Tune Mamba selective scan kernel for B200 — ready — by danisereb (合并于: 2026-01-26 21:56 (UTC+8))
#32764 [Feature] Add LoRA support for Gemma3 vision components — ready — by vihaan-that (合并于: 2026-01-26 21:56 (UTC+8))
#20320 [Misc] HF Hub LoRA Resolver — documentation,frontend,ready,ci/build,unstale — by alex-jw-brooks (合并于: 2026-01-26 21:56 (UTC+8))
#33010 [Model] Use mm_position to compute mrope positions for Qwen3-Omni — documentation,ready,qwen — by Etelis (合并于: 2026-01-26 21:48 (UTC+8))
#32770 [lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by cwazai (合并于: 2026-01-26 20:56 (UTC+8))
#33065 [Doc] Further update multi-modal impl doc — documentation,ready — by DarkLight1337 (合并于: 2026-01-26 18:54 (UTC+8))
#33054 [StepVL] add step vl offline example — documentation — by ltd0924 (合并于: 2026-01-26 17:00 (UTC+8))
#32955 [Refactor] Use data parser for matching data items to multi-modal UUIDs — documentation,frontend,ready,v1,multi-modality,qwen — by DarkLight1337 (合并于: 2026-01-26 15:00 (UTC+8))
#32882 Set splitk=1 for fused-moe-lora expand kernel — ready — by dcmaddix (合并于: 2026-01-26 14:52 (UTC+8))
#33062 [Model Runner V2] Add LoRAState to consolidate lora logic — v1 — by WoosukKwon (合并于: 2026-01-26 14:21 (UTC+8))
#33032 [Tests] Remove Duplicates — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-26 13:23 (UTC+8))
#32923 [StepVL] support close img patch — ready — by ltd0924 (合并于: 2026-01-26 12:56 (UTC+8))
#33033 [CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) — ready — by LucasWilkinson (合并于: 2026-01-26 11:49 (UTC+8))

关闭但未合并的 PR

#33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 11:00 (UTC+8))
#24579 [Chore] Add E2E test for ‘required’ tool_choice with streaming — frontend,stale,tool-calling — by tlipoca9 (关闭于: 2026-01-27 10:16 (UTC+8))
#24832 [CI/Build][gpt-oss]: Browser Tool Test Enhancement — frontend,stale,gpt-oss — by simondanielsson (关闭于: 2026-01-27 10:16 (UTC+8))
#24859 feat: enable vLLM inference for pruned Qwen3 models — stale,qwen — by wangwenmingaa (关闭于: 2026-01-27 10:16 (UTC+8))
#33126 [Feature] Add journey state cleanup to scheduler (PR #3/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 08:32 (UTC+8))
#32080 Add support for compressed-tensors NVFP4 in non-gated MoE layers #31782 — needs-rebase — by baonudesifeizhai (关闭于: 2026-01-27 05:12 (UTC+8))
#32622 [Feature] Enable flashinfer moe fp4 by default — ready — by yewentao256 (关闭于: 2026-01-27 04:54 (UTC+8))
#30175 [Frontend] Add –uvicorn-access-log-exclude-paths option — frontend — by GeoffreyWang1117 (关闭于: 2026-01-27 02:38 (UTC+8))
#33007 [BugFix] KV cache layout for raw-copy KV connectors in disaggregated mode — bug,v1,kv-connector — by thjung123 (关闭于: 2026-01-26 21:16 (UTC+8))
#32890 Testing vllm/issues/32718 — 无标签 — by RishabhSaini (关闭于: 2026-01-26 22:16 (UTC+8))
#32639 [Bugfix]: Enable Convolution layer custom ops by default — bug — by Isotr0py (关闭于: 2026-01-26 21:08 (UTC+8))
#33066 Enhancerequesttracing — documentation,frontend,needs-rebase,v1 — by sriumcp (关闭于: 2026-01-26 19:10 (UTC+8))
#33067 test — documentation,new-model — by Emilie1001 (关闭于: 2026-01-26 16:04 (UTC+8))
#32405 [Hardware][RISCV] Add RISC-V RVV backend support for CPU executor — needs-rebase,ci/build,cpu — by hansu2022 (关闭于: 2026-01-26 15:55 (UTC+8))