vLLM 开发动态报告 - 2026-01-26
时间窗口: 2026-01-26 11:01 (UTC+8) ~ 2026-01-27 11:01 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 21 | 新 PR 66 | 合并 PR 33 | 关闭未合并 PR 14
📊 每日开发状态摘要
在2026年1月26日至27日期间,vLLM社区保持高度活跃,共处理了41个Issue(新增20,关闭21)和66个PR(合并33)。开发重点集中在多模态模型支持扩展、AMD ROCm平台功能增强与问题修复,以及核心性能优化(特别是针对Mamba SSM和MoE LoRA内核)。社区就连接器架构简化(NCCL Connector去留)和配置管理改进(环境变量弃用保护)进行了重要讨论,体现了对系统可维护性和用户体验的持续关注。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD生态相关活动较为活跃,主要集中在问题修复和功能增强上。
- Issue #33123 - [Bug][ROCm]: Prefix caching produces different output on first request vs subsequent requests on gfx950
- 描述:用户在AMD gfx950(MI300系列)GPU上发现,启用前缀缓存时,首次请求(缓存未命中)与后续请求(缓存命中)的输出不一致。这关联到一个在gfx950上失败的内部测试。
- 技术影响:此问题直接影响了AMD MI300系列GPU上前缀缓存功能的正确性,可能导致生成结果不稳定。已CC多位AMD员工(@mawong-amd等)和vLLM ROCm维护者。
- 状态:Open,正在调查中。
- PR #33112 - Fix IndexError with encoder-decoder models when using Custom Paged Attention
- 描述:修复了在ROCm Attention(Custom Paged Attention)后端上运行编码器-解码器模型(如Whisper)时,因交叉注意力步骤中key/value为
None而引发的IndexError。 - 技术影响:增强了ROCm Attention后端对编码器-解码器架构模型的兼容性,扩大了AMD平台可稳定运行的模型范围。
- 状态:已合并。
- 描述:修复了在ROCm Attention(Custom Paged Attention)后端上运行编码器-解码器模型(如Whisper)时,因交叉注意力步骤中key/value为
- PR #33106 - [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends
- 描述:为ROCm的多个Attention后端(
RocmAiterUnifiedAttention,RocmAttention,TritonAttention)添加了对forward_includes_kv_cache特性的支持。 - 技术影响:这是对统一CUDA/ROCm Attention API(#32335)工作的跟进,旨在提升AMD平台Attention实现的完整性和性能。
- 状态:Open。
- 描述:为ROCm的多个Attention后端(
- PR #33077 - [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling
- 描述:修复了在AMD ROCm上初始化Qwen3-Omni模型时,在内存性能分析阶段因
torch.repeat_interleave操作触发hipErrorIllegalState导致崩溃的问题。解决方案是将涉及小张量的计算移至CPU。 - 技术影响:解决了Qwen3-Omni这一重要多模态模型在AMD硬件上的启动障碍。
- 状态:Open。提交者用户名为
JartX,符合“xxx-amd”模式,很可能是AMD员工。
- 描述:修复了在AMD ROCm上初始化Qwen3-Omni模型时,在内存性能分析阶段因
💬 高热度讨论分析
- Issue #33115 - [RFC]: Deprecate NCCL Connector?
- 核心议题:提议弃用并移除NCCL连接器,以简化vLLM的连接器选项(目前有NIXL, Mooncake等)。
- 不同观点:
- 提议方(@robertgshaw2-redhat):认为NCCL连接器用户不多且导致了不少Issue,简化选项集对用户更友好。
- 原开发者(@Abatom):认为NCCL比NIXL更轻量,是用户体验预填充-解码(PD)分离功能的入口,并希望保留它以进行后续优化和功能支持。
- 争议焦点:功能精简与向后兼容、轻量级选项保留之间的平衡。
- 当前状态:讨论开放中,原开发者倾向于保留。
- Issue #33096 - [RFC]: Add protection against using deprecated or incorrect environment variables
- 核心议题:提议为vLLM的大量环境变量增加验证和弃用警告机制,防止因环境变量更名、替换或废弃导致用户工作流在版本升级后静默失效。
- 社区反响:获得广泛支持。核心维护者@robertgshaw2-redhat表示“强烈支持”,@simon-mo 支持但也强调需要停止添加更多环境变量。
- 结论倾向:社区一致认为这是一个必要的改进,预计该RFC将推动实施。
- Issue #33091 / #33107 - Whisper accuracy issues with FA2+CG+torch.compile
- 核心议题:用户报告Whisper模型在特定配置(FlashAttention-2 + CUDA Graph + torch.compile)下出现准确性严重下降(词错误率134%)的问题。
- 讨论内容: issue #33107 被标记为 #33091 的重复。维护者请求用户测试更多组合(如关闭CUDA Graph、使用Dynamo+piecewise CUDA Graphs等)以定位问题根源。
- 当前状态:问题根因仍在排查中,可能与FlashAttention版本、编译流水线的交互有关。
🔥 热门话题与趋势分析
- 多模态模型支持持续扩展:
- 新增FunAudioChat(音频→文本)模型支持(PR #33058)。
- 新增Kimi-K2.5模型支持(PR #33131)。
- 为GLM-OCR模型添加MTP(多令牌预测)支持(PR #33005)。
- 为Step-VL模型增加关闭图像分块(img_patch)的配置选项,优化视频理解(PR #32923)。
- 性能优化与内核调优:
- Mamba SSM:新增针对Blackwell(B200)GPU的
selective_state_update内核自动调优脚本和配置文件(PR #33084)。 - LoRA MoE:优化Triton内核的索引和内存访问模式,提升性能(PR #32770)。
- 通用优化:移除未使用的
_moe_permute函数(PR #33108),优化DCP(解码上下文并行)张量分配(PR #33102)。
- Mamba SSM:新增针对Blackwell(B200)GPU的
- 工具调用与格式解析:
- 出现了多个与工具调用解析相关的PR和Issue,例如修复JSON解析错误(PR #33085)、为MiniMax M2模型添加自定义工具模板以修复括号导致的崩溃(PR #33087)、以及为Qwen2.5-Coder模型新增专用工具解析器(PR #33083)。这反映出社区在完善不同模型API兼容性上的努力。
🛠️ 重点技术变更
- PR #33096 (RFC): 环境变量弃用保护机制:这是一个重要的用户体验改进提案。通过增加验证和警告,可以显著减少因版本升级导致的隐蔽性错误,提升vLLM在生产环境中的可维护性。
- PR #32873 / #33084: Mamba SSM内核性能调优:针对新一代Blackwell架构(B200)和Nemotron Nano模型进行深度优化,显示了vLLM紧跟硬件发展、持续挖掘性能潜力的能力。
- PR #33112: ROCm Attention后端编码器-解码器模型支持修复:虽然改动不大,但解决了AMD平台上运行Whisper等模型的关键障碍,对扩大vLLM在异构计算生态中的适用性有积极意义。
- PR #33126 / #33136: Journey事件追踪集成:这是一系列(共9个)关于集成OpenTelemetry追踪PR的一部分,旨在将请求生命周期事件直接发射到核心Span。这增强了系统的可观测性,为性能分析和调试提供了更强大的工具。
📈 开发活跃度观察
- 合并效率:在24小时内合并了33个PR,显示核心团队审阅和集成代码的速度非常快。
- 贡献者多样性:除了核心团队(如@robertgshaw2-redhat, @DarkLight1337, @WoosukKwon),也看到了来自AMD(@JartX)、Intel(@fadara01)等硬件厂商以及众多独立开发者的贡献。
- AMD生态响应:针对AMD平台的问题(如#33123),社区机器人自动CC了相关维护者,且AMD员工(@mawong-amd)已关联到相关修复PR,表明对该平台问题的响应渠道通畅。
💡 值得关注的问题
- NCCL连接器的未来(Issue #33115):这是一个涉及架构简化和用户影响的决策。社区需要更多用户反馈来决定是否弃用。
- GPT OSS 20B作为草案模型时抛出错误(Issue #33133):在使用同模型进行推测解码(Speculative Decoding)时出现KV缓存组断言错误,这可能影响该性能优化功能的可用性。
- MoE模型FP8性能回退(Issue #33128):用户报告在较新版本vLLM上运行Qwen3 MoE模型的FP8版本时,性能相比旧版本显著下降。这需要核心团队关注,可能涉及DeepGEMM等核心算子的回归。
- Transformers 5.0升级(Issue #33132 & PR #33100):Hugging Face Transformers库发布5.0大版本。vLLM已开始更新测试注册表(PR #33100),但全面升级依赖项可能尚需时日,需关注兼容性。
📋 附录:详细数据列表
新增 Issue
- #33053 [Bug]: LongCat-Flash-Thinking-2601 still appears to be unsupported — bug — by jambow0320 (创建于: 2026-01-26 11:34 (UTC+8))
- #33137 [Usage]: How to disable cascaded attention mechanism in vLLM version 0.7.3 — usage — by berlintw113 (创建于: 2026-01-27 10:52 (UTC+8))
- #33134 [Bug]: try to run vllm bench with “throughput”, got error. — bug — by liyuerich (创建于: 2026-01-27 10:44 (UTC+8))
- #33133 [Bug]: Using GPT OSS 20B as Drafter Throws Error — bug — by singh-git10 (创建于: 2026-01-27 10:36 (UTC+8))
- #33132 Bump transformers to 5.0.0 — 无标签 — by NiuBlibing (创建于: 2026-01-27 10:05 (UTC+8))
- #33115 [RFC]: Deprecate NCCL Connector? — RFC — by robertgshaw2-redhat (创建于: 2026-01-27 05:09 (UTC+8))
- #33128 [Bug]: From version 0.12.0 to the latest version 0.14.0 of vllm, running moe model of fp8 with deepgemm will be slower than 0.11.0. — bug — by piekey1994 (创建于: 2026-01-27 09:11 (UTC+8))
- #33091 [Bug]: Whisper accuracy issue with FA2+CG+torch.compile — bug,torch.compile — by NickLucche (创建于: 2026-01-26 20:57 (UTC+8))
- #33123 [Bug][ROCm]: Prefix caching produces different output on first request (cache miss) vs subsequent requests (cache hit) on gfx950 — bug,rocm — by AndreasKaratzas (创建于: 2026-01-27 07:38 (UTC+8))
- #33096 [RFC]: Add protection against using deprecated or incorrect environment variables — RFC — by gshtras (创建于: 2026-01-26 23:43 (UTC+8))
- #33118 [RFC]: Hidden States Extraction — RFC — by fynnsu (创建于: 2026-01-27 05:30 (UTC+8))
- #33092 [Accessory/Distributed] KV Transfer crashes with short prompts when chunked prefill is enabled — bug — by Yaoyikang (创建于: 2026-01-26 21:41 (UTC+8))
- #33097 [Feature]: Fuse FP8 output quantization into merge_attn_states (DCP / cascade paths) — feature request — by sachinkumarsingh092 (创建于: 2026-01-26 23:47 (UTC+8))
- #33107 [Bug]: Whisper large-v3 accuracy degradation in vLLM 0.14.1 (134.56% WER) on L40S - works fine in 0.12.0 — bug — by sayalibhavsar (创建于: 2026-01-27 02:06 (UTC+8))
- #33099 [Bug]: vllm Requests stuck indefinitely — bug — by joules-zapata-pfh (创建于: 2026-01-26 23:59 (UTC+8))
- #33094 [CI Failure]: Entrypoints Test (Responses) — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-26 22:32 (UTC+8))
- #33089 [Feature]: Support multi-turn conversation for OpenAI Response API — feature request — by kcelia (创建于: 2026-01-26 19:59 (UTC+8))
- #33075 [Usage]: Can not disable the thinking mode of deepseek-r1-0528 with request-level chat_template_kwargs — usage — by xlouba (创建于: 2026-01-26 17:39 (UTC+8))
- #33074 [Bug]: PD report DeepseekV32 AssertionError: num_kv_heads == 1 — bug — by nicole-lihui (创建于: 2026-01-26 17:36 (UTC+8))
- #33072 [Usage]: How to enable API Server in the worker nodes? When deploy with DP2. — usage — by MaoJianwei (创建于: 2026-01-26 17:12 (UTC+8))
已关闭 Issue
- #15781 [Feature]: Implementation of Zero Overhead Scheduling — feature request,stale — by lizhigong1988 (关闭于: 2026-01-27 10:18 (UTC+8))
- #17639 [Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 — bug,stale — by sgsdxzy (关闭于: 2026-01-27 10:18 (UTC+8))
- #18121 [Usage]: Set the kv cache size — usage,stale — by FeliceSchena (关闭于: 2026-01-27 10:17 (UTC+8))
- #19793 [Usage]: ValueError: The checkpoint you are trying to load has model type
qwen3_moebut Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. — usage,stale — by Estella31 (关闭于: 2026-01-27 10:17 (UTC+8)) - #19861 [Bug]: max_completion_tokens doesn’t work as max — bug,stale — by cfytrok3 (关闭于: 2026-01-27 10:17 (UTC+8))
- #20007 [Bug]: error on ROCM :0:rocdevice.cpp :3020: 5875275709338d us: Callback: Queue 0x7f4580c00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 — bug,stale — by shangshng (关闭于: 2026-01-27 10:17 (UTC+8))
- #25549 [Bug]: qwen3-next tool_call error invalid output: !!!!!!!! — bug,stale — by fengyang95 (关闭于: 2026-01-27 10:16 (UTC+8))
- #25576 [Doc]: https://docs.vllm.ai/en/v0.6.5/getting_started/installation.html — documentation,stale — by cjg-aussee-ai (关闭于: 2026-01-27 10:16 (UTC+8))
- #25793 VLLM offline inference raise except when using qianfan-vl — bug,stale — by cqray1990 (关闭于: 2026-01-27 10:16 (UTC+8))
- #29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-27 08:42 (UTC+8))
- #32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-27 08:42 (UTC+8))
- #32919 [Bug]: Memory fault access when serving DeepSeek-R1-0528 with mori-ep + concurrency of 128 / 256 — bug,rocm — by junkang1991 (关闭于: 2026-01-27 08:33 (UTC+8))
- #33041 [Bug]: vLLM hangs after NCCL init with TP=2 on Blackwell GPUs (CUDA 13.0, NCCL 2.27.7) — bug — by robertjmcintyre (关闭于: 2026-01-27 07:19 (UTC+8))
- #29982 [Feature]: Option to Disable Metrics Health-Check Logging — feature request — by averyngo34 (关闭于: 2026-01-27 05:49 (UTC+8))
- #33029 [CI Failure]: MoE Integration Tests — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-27 04:09 (UTC+8))
- #33094 [CI Failure]: Entrypoints Test (Responses) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-26 23:31 (UTC+8))
- #33050 [Bug]: [DeepSeek-V3.2] PD Can’t instantiate abstract class DeepseekV32IndexerBackend without an implementation for abstract method ‘get_impl_cls’ — bug — by chaunceyjiang (关闭于: 2026-01-26 22:44 (UTC+8))
- #32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (关闭于: 2026-01-26 21:59 (UTC+8))
- #32758 [Feature]: support lora for Gemma3ForConditionalGeneration — help wanted,feature request — by WuChuYi (关闭于: 2026-01-26 21:56 (UTC+8))
- #32309 [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL — usage — by hoangtd-asilla (关闭于: 2026-01-26 15:00 (UTC+8))
- #33028 [CI Failure]: MultiModal Tests — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-26 11:49 (UTC+8))
新增 PR
- #33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-27 10:50 (UTC+8))
- #33110 [Bugfix] Early-reject requests with MM data longer than encode cache capacity — bug,v1 — by YunzhuLu (创建于: 2026-01-27 02:56 (UTC+8))
- #33131 [Models] Kimi-K2.5 — documentation,new-model,frontend,ready,multi-modality — by ywang96 (创建于: 2026-01-27 10:03 (UTC+8))
- #33135 [code clean] remove duplicate code — structured-output,v1 — by andyxning (创建于: 2026-01-27 10:48 (UTC+8))
- #33100 [Model] Bump transformers version for test registry — ready,v1 — by DarkLight1337 (创建于: 2026-01-27 00:15 (UTC+8))
- #33058 Adds FunAudioChat multimodal audio model support (#2) — documentation,new-model,multi-modality — by nemoramo (创建于: 2026-01-26 12:28 (UTC+8))
- #33103 [Frontend] Cleanup serving engine — frontend,ready — by DarkLight1337 (创建于: 2026-01-27 00:55 (UTC+8))
- #33112 Fix IndexError with encoder-decoder models when using Custom Paged Attention — rocm,ready,v1 — by sstamenk (创建于: 2026-01-27 03:02 (UTC+8))
- #33113 [torch.compile] Stop assuming 32 bit indexing — ready — by zou3519 (创建于: 2026-01-27 03:49 (UTC+8))
- #33130 [Quantization][Refactor] use platform dict to choose kernel — 无标签 — by zufangzhu (创建于: 2026-01-27 09:49 (UTC+8))
- #33064 [Perf] avoid duplicate mem_get_info() call in get_current_memory_usage — rocm,ready — by pacoxu (创建于: 2026-01-26 15:14 (UTC+8))
- #33129 [release] Minor fixes to release annotation and wheel upload — ci/build — by khluu (创建于: 2026-01-27 09:35 (UTC+8))
- #33127 [Kernel] adding native nccl4py support — performance,ci/build,kv-connector,nvidia — by pkousha (创建于: 2026-01-27 08:39 (UTC+8))
- #33070 [Bugfix] (grpc): improve GetServerInfo response consistency and accuracy — bug,frontend — by pacoxu (创建于: 2026-01-26 16:06 (UTC+8))
- #33101 [Frontend] Reduce mixin usage in serving pooling — frontend,ready — by DarkLight1337 (创建于: 2026-01-27 00:33 (UTC+8))
- #33055 [Model Runner V2] Remove UvaBufferPool for cpu->gpu copy — v1 — by WoosukKwon (创建于: 2026-01-26 11:59 (UTC+8))
- #33059 [Model Runner V2] Use a different stream for grammar bitmask h2d copy — needs-rebase,v1 — by WoosukKwon (创建于: 2026-01-26 13:27 (UTC+8))
- #33076 Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing — ready — by ir1ka (创建于: 2026-01-26 17:47 (UTC+8))
- #33126 [Feature] Add journey state cleanup to scheduler (PR #3/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-27 08:28 (UTC+8))
- #33125 [Core] Optimize SWA KV cache management for prefix caching — v1 — by jaewonlee-fb (创建于: 2026-01-27 08:19 (UTC+8))
- #33124 [Model] GPT-OSS: Use layer_types config for sliding window selection — gpt-oss — by jaewonlee-fb (创建于: 2026-01-27 07:59 (UTC+8))
- #33086 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,v1,deepseek — by chaunceyjiang (创建于: 2026-01-26 18:56 (UTC+8))
- #33116 [CI/Build][BugFix] fix cuda/compat loading order issue in docker build — bug,ready,ci/build,nvidia — by wpc (创建于: 2026-01-27 05:13 (UTC+8))
- #33114 [Draft / POC] Websocket realtime api — frontend — by patrickvonplaten (创建于: 2026-01-27 04:08 (UTC+8))
- #33104 [Bugfix][MXFP4] Call
trtllm_fp4_block_scale_moewith kwargs — bug,ready — by wpc (创建于: 2026-01-27 01:12 (UTC+8)) - #33122 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs — 无标签 — by fadara01 (创建于: 2026-01-27 06:20 (UTC+8))
- #33121 Fix grammar — v1,cpu — by smashyalts (创建于: 2026-01-27 06:05 (UTC+8))
- #33120 [Config] Add explicit bind flag for KV events — 无标签 — by alec-flowers (创建于: 2026-01-27 05:55 (UTC+8))
- #33109 [DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled — documentation — by VincentG1234 (创建于: 2026-01-27 02:55 (UTC+8))
- #33082 [Doc] Improve serve parameter documentation with meaningful defaults — documentation — by karanb192 (创建于: 2026-01-26 18:43 (UTC+8))
- #33119 [MoE Refactor] Only use RouterMethodType for flashinfer — nvidia — by bnellnm (创建于: 2026-01-27 05:32 (UTC+8))
- #33117 [BugFix] p2p_nccl_connector: Skip non-chunked requests — bug,kv-connector — by RishabhSaini (创建于: 2026-01-27 05:15 (UTC+8))
- #33108 [Refactor] Remove unused
_moe_permutefunction — performance,ready — by yewentao256 (创建于: 2026-01-27 02:21 (UTC+8)) - #33080 [ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator — ci/build — by khluu (创建于: 2026-01-26 18:30 (UTC+8))
- #33105 [Feature] (grpc): add standard gRPC health checking protocol for Kubernetes native probes — frontend,ci/build — by V2arK (创建于: 2026-01-27 01:20 (UTC+8))
- #33088 [Bugfix] Use ‘sum’ reduction instead of ‘avg’ in Async TP reduce-scatter — bug — by wangxingran222 (创建于: 2026-01-26 19:32 (UTC+8))
- #33111 Add EAGLE3 support for AFMoE — 无标签 — by AutumnAurelium (创建于: 2026-01-27 03:01 (UTC+8))
- #33073 [Bugfix] Fix Voxtral streaming slot_mapping — bug,ready — by NickLucche (创建于: 2026-01-26 17:26 (UTC+8))
- #33077 [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM — bug,rocm,qwen — by JartX (创建于: 2026-01-26 17:56 (UTC+8))
- #33106 [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends — rocm,v1 — by gshtras (创建于: 2026-01-27 01:30 (UTC+8))
- #33095 Remove unused logic in
models/mistral.py— ready — by andylolu2 (创建于: 2026-01-26 22:48 (UTC+8)) - #33102 [Perf] Optimize dcp allocate tensor — ready,v1 — by yewentao256 (创建于: 2026-01-27 00:43 (UTC+8))
- #33098 [CI] Whisper tests
enforce_eager=False— ready — by NickLucche (创建于: 2026-01-26 23:49 (UTC+8)) - #33084 [Feature] Add tuning script and config files for Mamba SSM — performance,needs-rebase — by karanb192 (创建于: 2026-01-26 18:50 (UTC+8))
- #33093 [CI] Fix AssertionError: MCP tool call not found in output_messages — ready — by chaunceyjiang (创建于: 2026-01-26 21:48 (UTC+8))
- #33078 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. — bug,nvidia — by ayrnb (创建于: 2026-01-26 17:57 (UTC+8))
- #33052 [Bugfix] Fix Can’t instantiate abstract class DeepseekV32IndexerBackend — bug,ready,v1,deepseek,kv-connector — by chaunceyjiang (创建于: 2026-01-26 11:13 (UTC+8))
- #33085 [Bugfix] Fix JSONDecodeError when multiple tool_calls in single message — bug,frontend — by karanb192 (创建于: 2026-01-26 18:56 (UTC+8))
- #33087 [Bugfix][Tool Parser] Fix crash when parameter description contains parentheses — bug — by karanb192 (创建于: 2026-01-26 18:57 (UTC+8))
- #33061 [KV MultiConnector]: Add out-of-band handshake metadata get/set functions — kv-connector — by snadampal (创建于: 2026-01-26 13:58 (UTC+8))
- #33063 [Chore] Update type annotation of
input_idsin model forward — documentation,speculative-decoding,ready,llama,qwen,deepseek,gpt-oss — by DarkLight1337 (创建于: 2026-01-26 14:57 (UTC+8)) - #33079 [CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… — documentation,ci/build,cpu — by maryamtahhan (创建于: 2026-01-26 18:26 (UTC+8))
- #33090 [Bugfix] Fix DeepseekV32
AssertionError: num_kv_heads == 1— bug,deepseek,kv-connector — by NickLucche (创建于: 2026-01-26 20:12 (UTC+8)) - #33066 Enhancerequesttracing — documentation,frontend,needs-rebase,v1 — by sriumcp (创建于: 2026-01-26 15:43 (UTC+8))
- #33071 [Docs] Simplify CPU x86 Docker build documentation — documentation,cpu — by maryamtahhan (创建于: 2026-01-26 17:08 (UTC+8))
- #33065 [Doc] Further update multi-modal impl doc — documentation,ready — by DarkLight1337 (创建于: 2026-01-26 15:31 (UTC+8))
- #33083 [Feature][Tool Parser] Add dedicated parser for Qwen2.5-Coder models — qwen — by karanb192 (创建于: 2026-01-26 18:46 (UTC+8))
- #33081 [Bugfix] Fix remove_instance logic error in disagg_proxy_demo.py — bug,documentation,kv-connector — by karanb192 (创建于: 2026-01-26 18:43 (UTC+8))
-
#33060 [Frontend][5/n] Make pooling entrypoints request schema consensus ScoreRequest — frontend — by noooop (创建于: 2026-01-26 13:53 (UTC+8)) - #33067 test — documentation,new-model — by Emilie1001 (创建于: 2026-01-26 15:48 (UTC+8))
- #33054 [StepVL] add step vl offline example — documentation — by ltd0924 (创建于: 2026-01-26 11:53 (UTC+8))
- #33069 [Doc] Add back the instruction: install required python3-dev apt package — documentation,cpu — by jasonyanwenl (创建于: 2026-01-26 16:01 (UTC+8))
- #33068 fix: Add infinite loop detection for multimodal models (e.g., PaddleOCR-VL) — v1 — by bellkjtt (创建于: 2026-01-26 15:53 (UTC+8))
- #33062 [Model Runner V2] Add LoRAState to consolidate lora logic — v1 — by WoosukKwon (创建于: 2026-01-26 14:11 (UTC+8))
- #33057 Bump actions/checkout from 6.0.1 to 6.0.2 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-01-26 12:27 (UTC+8))
- #33056 Bump actions/setup-python from 6.1.0 to 6.2.0 — ci/build,github_actions,dependencies — by dependabot (创建于: 2026-01-26 12:27 (UTC+8))
已合并 PR
- #33100 [Model] Bump transformers version for test registry — ready,v1 — by DarkLight1337 (合并于: 2026-01-27 02:53 (UTC+8))
- #33112 Fix IndexError with encoder-decoder models when using Custom Paged Attention — rocm,ready,v1 — by sstamenk (合并于: 2026-01-27 10:33 (UTC+8))
- #32768 fix: preserve native tool call ID in multi-turn tool calling — frontend,ready — by wangln19 (合并于: 2026-01-27 10:22 (UTC+8))
- #32567 [MoE Refactor] Integrate Naive Prepare Finalize into MK — documentation,performance,rocm,ready,ci/build,llama,cpu,nvidia,ready-run-all-tests — by robertgshaw2-redhat (合并于: 2026-01-27 09:28 (UTC+8))
- #33055 [Model Runner V2] Remove UvaBufferPool for cpu->gpu copy — v1 — by WoosukKwon (合并于: 2026-01-27 08:47 (UTC+8))
- #32908 [Bugfix][TPU] Return a Default fp8 MoE Backend — bug,ready — by vanbasten23 (合并于: 2026-01-27 07:46 (UTC+8))
- #33104 [Bugfix][MXFP4] Call
trtllm_fp4_block_scale_moewith kwargs — bug,ready — by wpc (合并于: 2026-01-27 07:13 (UTC+8)) - #32913 [fix] CPUDNNLGEMMHandler pointer baked into inductor artifact — ready,cpu — by dolpm (合并于: 2026-01-27 05:59 (UTC+8))
- #30011 [feat][log]: add
--disable-access-log-for-endpointsCLI option — documentation,frontend,ready — by JaredforReal (合并于: 2026-01-27 05:49 (UTC+8)) - #33108 [Refactor] Remove unused
_moe_permutefunction — performance,ready — by yewentao256 (合并于: 2026-01-27 05:06 (UTC+8)) - #33080 [ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator — ci/build — by khluu (合并于: 2026-01-27 04:28 (UTC+8))
- #33030 [Bugfix] Fix Dtypes for Pynccl Wrapper — bug,ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-27 04:09 (UTC+8))
- #33073 [Bugfix] Fix Voxtral streaming slot_mapping — bug,ready — by NickLucche (合并于: 2026-01-27 02:40 (UTC+8))
- #31099 [FIX] Always support TP > 4 for FP4 Gemm — ready — by danielafrimi (合并于: 2026-01-27 02:04 (UTC+8))
- #33095 Remove unused logic in
models/mistral.py— ready — by andylolu2 (合并于: 2026-01-27 01:01 (UTC+8)) - #33093 [CI] Fix AssertionError: MCP tool call not found in output_messages — ready — by chaunceyjiang (合并于: 2026-01-26 23:19 (UTC+8))
- #33018 [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp — bug,rocm,ready,deepseek — by ganyi1996ppo (合并于: 2026-01-26 23:19 (UTC+8))
- #33052 [Bugfix] Fix Can’t instantiate abstract class DeepseekV32IndexerBackend — bug,ready,v1,deepseek,kv-connector — by chaunceyjiang (合并于: 2026-01-26 22:44 (UTC+8))
- #33005 [GLM-OCR] GLM-OCR with MTP Support — documentation,new-model,speculative-decoding,ready,v1,multi-modality — by zRzRzRzRzRzRzR (合并于: 2026-01-26 22:24 (UTC+8))
- #33063 [Chore] Update type annotation of
input_idsin model forward — documentation,speculative-decoding,ready,llama,qwen,deepseek,gpt-oss — by DarkLight1337 (合并于: 2026-01-26 22:02 (UTC+8)) - #32873 [Performance] Tune Mamba selective scan kernel for B200 — ready — by danisereb (合并于: 2026-01-26 21:56 (UTC+8))
- #32764 [Feature] Add LoRA support for Gemma3 vision components — ready — by vihaan-that (合并于: 2026-01-26 21:56 (UTC+8))
- #20320 [Misc] HF Hub LoRA Resolver — documentation,frontend,ready,ci/build,unstale — by alex-jw-brooks (合并于: 2026-01-26 21:56 (UTC+8))
- #33010 [Model] Use mm_position to compute mrope positions for Qwen3-Omni — documentation,ready,qwen — by Etelis (合并于: 2026-01-26 21:48 (UTC+8))
- #32770 [lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by cwazai (合并于: 2026-01-26 20:56 (UTC+8))
- #33065 [Doc] Further update multi-modal impl doc — documentation,ready — by DarkLight1337 (合并于: 2026-01-26 18:54 (UTC+8))
- #33054 [StepVL] add step vl offline example — documentation — by ltd0924 (合并于: 2026-01-26 17:00 (UTC+8))
- #32955 [Refactor] Use data parser for matching data items to multi-modal UUIDs — documentation,frontend,ready,v1,multi-modality,qwen — by DarkLight1337 (合并于: 2026-01-26 15:00 (UTC+8))
- #32882 Set splitk=1 for fused-moe-lora expand kernel — ready — by dcmaddix (合并于: 2026-01-26 14:52 (UTC+8))
- #33062 [Model Runner V2] Add LoRAState to consolidate lora logic — v1 — by WoosukKwon (合并于: 2026-01-26 14:21 (UTC+8))
- #33032 [Tests] Remove Duplicates — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-26 13:23 (UTC+8))
- #32923 [StepVL] support close img patch — ready — by ltd0924 (合并于: 2026-01-26 12:56 (UTC+8))
- #33033 [CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) — ready — by LucasWilkinson (合并于: 2026-01-26 11:49 (UTC+8))
关闭但未合并的 PR
- #33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 11:00 (UTC+8))
- #24579 [Chore] Add E2E test for ‘required’ tool_choice with streaming — frontend,stale,tool-calling — by tlipoca9 (关闭于: 2026-01-27 10:16 (UTC+8))
- #24832 [CI/Build][gpt-oss]: Browser Tool Test Enhancement — frontend,stale,gpt-oss — by simondanielsson (关闭于: 2026-01-27 10:16 (UTC+8))
- #24859 feat: enable vLLM inference for pruned Qwen3 models — stale,qwen — by wangwenmingaa (关闭于: 2026-01-27 10:16 (UTC+8))
- #33126 [Feature] Add journey state cleanup to scheduler (PR #3/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 08:32 (UTC+8))
- #32080 Add support for compressed-tensors NVFP4 in non-gated MoE layers #31782 — needs-rebase — by baonudesifeizhai (关闭于: 2026-01-27 05:12 (UTC+8))
- #32622 [Feature] Enable flashinfer moe fp4 by default — ready — by yewentao256 (关闭于: 2026-01-27 04:54 (UTC+8))
- #30175 [Frontend] Add –uvicorn-access-log-exclude-paths option — frontend — by GeoffreyWang1117 (关闭于: 2026-01-27 02:38 (UTC+8))
- #33007 [BugFix] KV cache layout for raw-copy KV connectors in disaggregated mode — bug,v1,kv-connector — by thjung123 (关闭于: 2026-01-26 21:16 (UTC+8))
- #32890 Testing vllm/issues/32718 — 无标签 — by RishabhSaini (关闭于: 2026-01-26 22:16 (UTC+8))
- #32639 [Bugfix]: Enable Convolution layer custom ops by default — bug — by Isotr0py (关闭于: 2026-01-26 21:08 (UTC+8))
- #33066 Enhancerequesttracing — documentation,frontend,needs-rebase,v1 — by sriumcp (关闭于: 2026-01-26 19:10 (UTC+8))
- #33067 test — documentation,new-model — by Emilie1001 (关闭于: 2026-01-26 16:04 (UTC+8))
- #32405 [Hardware][RISCV] Add RISC-V RVV backend support for CPU executor — needs-rebase,ci/build,cpu — by hansu2022 (关闭于: 2026-01-26 15:55 (UTC+8))