vLLM 开发动态报告 - 2026-01-13

时间窗口: 2026-01-13 10:59 (UTC+8) ~ 2026-01-14 10:59 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 5 | 新 PR 60 | 合并 PR 33 | 关闭未合并 PR 17

📊 每日开发状态摘要

在2026年1月13日至14日的时间窗口内，vLLM项目保持了极高的开发活跃度，共产生60个新PR并合并33个。开发焦点集中在核心性能优化（如Fused MoE、量化）、多模态模型支持（如Isaac、Granite Vision）的完善，以及持续增强对AMD ROCm生态的兼容性。同时，社区积极处理用户反馈的问题，包括LoRA、分布式推理和API前端的bug修复。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动活跃，主要集中在ROCm平台兼容性修复、性能优化及CI pipeline建设上。

ROCm CI与测试修复：
- PR #32233 (已合并) 及 PR #32281：解决了ROCm平台上Isaac多模态模型因HuggingFace flash_attention_2 实现不准确导致的测试失败问题。通过强制在ROCm上为视觉编码器使用SDPA后端，确保了与CUDA一致的输出精度。
- PR #32275：因默认启用异步调度暴露了Qwen3-Next-80B模型在AMD CI上的一个潜在问题，此PR在ROCm平台上临时禁用异步调度以解阻塞CI，同时展开问题调查。
- PR #32244：启用了非门控MoE（如NemotronH）在ROCm平台上的支持，通过放宽平台检查并回退到Triton实现来实现。
- PR #32264：优化了ROCm wheel构建的Dockerfile，集成sccache以将基础镜像构建时间从约6小时大幅缩减至1.5小时，显著提升CI/CD效率。
AMD Quark量化工具支持：
- PR #32272 (用户: fxmarty-amd)：为使用AMD Quark库生成的、采用在线块对角旋转（Online Block-Diagonal Rotations）的MXFP4/MXFP6量化模型添加了加载支持。这是一个重要的AMD生态功能扩展，旨在提升低精度格式的精度恢复能力。
ROCm性能优化：
- PR #32238：为DeepSeek-R1 MXFP4模型在ROCm上启用了MLA投影GEMM的权重量化动态计算。
构建与部署：
- PR #32295：将RIXL/UCX的构建从基础Docker镜像移至测试阶段镜像，提高了部署灵活性。

总结：AMD团队在本周期内系统性解决了多模态模型在ROCm上的关键兼容性问题，持续推进量化生态（Quark）集成，并着力优化基础设施（CI构建）的效率和稳定性。

💬 高热度讨论分析

Issue #32235: [Bug]: Incorrect grid size in fused_moe_lora
- 核心议题：fused_moe_lora内核中计算启动网格大小时存在偏差，当批次混合了基础模型token和LoRA token时，可能导致部分LoRA适配器被跳过。
- 观点与结论：报告者 yugong333 详细分析了代码，指出网格大小应为 max_loras + 1 以涵盖最坏情况。维护者 jeejeelee 在评论中表示感谢并请求添加回归测试。讨论迅速达成共识，认为这是一个需要修复的bug，并由 yugong333 直接提交了修复PR #32277。
PR #32279: [Bugfix] Fix FusedMoE triton kernel out of bound value
- 核心议题：修复FusedMoE及其LoRA变体内核中的多个越界访问问题，包括offs_token哨兵值错误、边界保护不完善以及moe_weight的越界值。
- 观点与结论：贡献者 xyang16 提供了详尽的修复说明和准确性测试结果（显示精度提升）。核心维护者 dcmaddix 和 jeejeelee 均表示认可。讨论氛围积极，重点关注修复的准确性和性能影响，并一致认为需要合并以解决潜在的数据读取错误和精度损失。
PR #30885: [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding (已合并)
- 核心议题：为NVFP4量化引入一种新的、使用更小缩放因子分片（8x4 SF布局）的后端变体，专门优化小批次（≤32）解码场景，声称可获得25-35%的吞吐量提升。
- 争议与讨论：此PR经历了较长的开发和评审周期，引发了关于性能收益范围、适用场景以及可能引入的复杂性的深入讨论。最终在提供了充分的微基准测试和权衡分析（新后端不适用于中大批次）后获得合并。焦点在于如何在提升小众场景性能的同时，不增加通用场景的复杂性和维护负担。
Issue #32262: [Usage]: Requests get stuck with high GPU utilization on vLLM docker
- 核心议题：用户报告在使用vLLM Docker镜像时，请求卡住且GPU利用率居高不下。
- 观点与状态：用户 Glebbot 提供了极其详细的环境信息。维护者 njhill 介入询问复现方法和KV缓存使用率，试图定位是调度问题还是资源耗尽问题。讨论处于早期诊断阶段，尚未形成结论，凸显了生产环境复杂问题调试的挑战。

🔥 热门话题与趋势分析

量化生态持续深化：围绕多种量化格式（NVFP4, MXFP4, W8A8 INT8）的支持成为热点。相关PR/Issue包括对MoE模型的支持（#32285, #32293）、性能优化（#30885）以及配置API的重构（#32268）。
多模态模型支持与修复：随着更多视觉-语言模型被集成，相关bug修复和适配工作密集出现，如Isaac在ROCm上的注意力问题（#32233, #32281）、Granite Vision的SigLip池化头问题（#32299）以及Omni模型特征拼接错误（#32249）。
内核与计算优化：对核心计算内核的微观优化是持续主题，例如优化Top-K+Top-P采样避免全排序（#32234）、优化分组Top-K内核（#32058）、以及Fused MoE LoRA内核的多项优化（#32236, #32279）。
前端API与开发者体验：OpenAI Responses API的完善（#32247, #32253, #32260, #32298）、内置Web监控仪表板的添加（#32256）以及依赖版本限制的放宽（#32288, #32289）反映了对开发者友好性和兼容性的重视。
分布式与扩展性：关于NIXL在多节点TP场景下元数据交换的RFC讨论（已关闭Issue #25981）及相关bug修复（#32291），以及外部负载均衡器下的问题（#32252），显示分布式推理配置的复杂性仍是关注点。

🛠️ 重点技术变更

PR #32279: FusedMoE Triton内核越界值修复：这是一个关键的安全性修复，解决了可能读取垃圾数据或导致非法内存访问的潜在问题，并验证能带来精度提升，对使用MoE+LoRA功能的用户至关重要。
PR #30885: NVFP4小批次解码性能优化：通过引入新的内核变体，针对小并发解码这一常见推理场景进行了显著优化，体现了对实际部署性能的精细打磨。
PR #32245: [Model Runner V2] Refactor Sampler (已合并)：将采样状态从RequestState移至独立的Sampler并移除SamplingMetadata，这是向更清晰、更模块化的V2模型运行器架构迈进的重要重构步骤。
PR #32256: [Frontend][CLI] Add –enable-dashboard for vLLM Web UI：引入了内置的Web监控和调试仪表板，为运维和开发者提供了一个开箱即用的实时监控、配置查看和简易测试工具，提升了易用性。
PR #32272: [Quark] Support online block-diagonal rotations in OCP MX dense layers：代表了vLLM对AMD Quark量化生态支持的进一步深入，为即将发布的采用先进旋转技术的量化模型做好了准备。

📈 开发活跃度观察

贡献者多样性：除核心维护团队（如 chaunceyjiang, DarkLight1337, yewentao256, WoosukKwon）外，来自AMD（fxmarty-amd, AndreasKaratzas）、NVIDIA（xyang16, malaiwah）及其他公司的工程师贡献显著。
高效合并流程：33个PR在24小时内被合并，显示代码审查和CI流程高效。许多PR被标记为 ready 后快速合并。
问题闭环率高：多个新报的bug（如 #32235, #32258）在当日即通过PR修复或确认原因后关闭，响应迅速。
AMD团队活跃：AMD相关贡献者在本周期提交了多个重要PR，涵盖功能支持、bug修复和CI优化，显示出其对ROCm生态投入的持续加大。

💡 值得关注的问题

Issue #32235: Fused MoE LoRA网格大小bug：虽然已有修复PR，但反映了混合基础模型与多LoRA适配器场景下的边缘情况测试需加强。
Issue #32262: Docker环境下高GPU利用率卡死：此问题现象明显但原因不明，可能涉及底层驱动、Docker资源隔离或vLLM内部资源管理，需要进一步跟踪。
Issue #32252: 非MoE模型+外部LB下的断言失败：暴露了在特定分布式配置（DP + 外部负载均衡）下，统计更新地址初始化路径可能存在缺陷，影响生产部署。
PR #32272 (Quark在线旋转支持)：这是一个新增功能PR，其“朴素非优化实现”的表述意味着未来可能有性能优化空间，且需等待配套模型发布后进行更全面的测试。
异步调度默认启用引发的兼容性问题：如PR #32275所示，新默认配置可能与特定模型或硬件平台产生未知交互，需要在更广泛的生态中进行观察和测试。

📋 附录：详细数据列表

新增 Issue

#32235 [Bug]: Incorrect grid size in fused_moe_lora — bug — by yugong333 (创建于: 2026-01-13 12:42 (UTC+8))
#32262 [Usage]: Requests get stuck with high GPU utilization on vLLM docker — usage — by Glebbot (创建于: 2026-01-13 22:10 (UTC+8))
#32268 [Feature]: Refactor Int8ScaledMMLinearLayerConfig to use QuantKey — help wanted,good first issue,feature request — by vllmellm (创建于: 2026-01-14 00:15 (UTC+8))
#32288 [Build]: Relax anthropic version pin from ==0.71.0 to >=0.71.0 — 无标签 — by dsfaccini (创建于: 2026-01-14 04:56 (UTC+8))
#32269 [Feature]: Benchmark torch._scaled_mm performance with and without padding — help wanted,feature request — by vllmellm (创建于: 2026-01-14 00:31 (UTC+8))
#32278 [Bug]: MiniMax M2 Tool Parser Incorrectly Converts Optional Parameters to null — bug — by KevinBuettner (创建于: 2026-01-14 02:45 (UTC+8))
#32270 [Bug]: same max_seq_len of flashinfer trtllm decode and prefill. — bug — by xiaoxiaosuaxuan (创建于: 2026-01-14 00:36 (UTC+8))
#32267 [Feature]: Kernels Attention Test cleanup — feature request — by MatthewBonanni (创建于: 2026-01-13 23:36 (UTC+8))
#32258 [Bug]: DeepSeek v3.2 garbled output — bug — by esmeetu (创建于: 2026-01-13 20:17 (UTC+8))
#32259 [Bug]: offline infer of mm model cache — bug — by L-hongbin (创建于: 2026-01-13 20:44 (UTC+8))
#32252 [Bug]: self.stats_update_address is None for non-Moe and external LB — bug — by dtcccc (创建于: 2026-01-13 18:09 (UTC+8))
#32242 [Bug]: GLM-4.5-Air-NVFP4 models show unexpectedly low throughput under Pipeline Parallel in vLLM — bug — by east612-ai (创建于: 2026-01-13 15:26 (UTC+8))
#32228 [Bug]: gpt-oss-20B token ids out of range — bug — by vman049 (创建于: 2026-01-13 11:00 (UTC+8))

已关闭 Issue

#22689 [Usage]: How to change system prompts when using LLM.generate for offline use? — usage,stale — by jiangwu300 (关闭于: 2026-01-14 10:13 (UTC+8))
#25981 [RFC]: Fix for NIXL metadata exchange issue when the worker is multi-node — RFC,stale — by GuanLuo (关闭于: 2026-01-14 01:11 (UTC+8))
#32258 [Bug]: DeepSeek v3.2 garbled output — bug — by esmeetu (关闭于: 2026-01-13 22:01 (UTC+8))
#32190 [Bug]: Deepseek-R1 with DEP deployment returns gibberish outputs — bug — by ptarasiewiczNV (关闭于: 2026-01-13 19:05 (UTC+8))
#26775 [Bug]: ModelRegistry.inspect_model_cls shows that Qwen3ForCausalLM is not pooling model — bug,stale — by xiaojianpinga (关闭于: 2026-01-13 11:15 (UTC+8))

新增 PR

#32260 [Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture — frontend,qwen,gpt-oss — by chaunceyjiang (创建于: 2026-01-13 21:24 (UTC+8))
#32247 [Frontend] Add encrypted_content to reasoning items for round-tripping — frontend,needs-rebase,gpt-oss — by daniel-salib (创建于: 2026-01-13 17:24 (UTC+8))
#32279 [Bugfix] Fix FusedMoE triton kernel out of bound value — bug — by xyang16 (创建于: 2026-01-14 02:59 (UTC+8))
#32300 [ASR] Fix audio benchmark and add RTFx metric — performance — by ekagra-ranjan (创建于: 2026-01-14 10:33 (UTC+8))
#32299 [Bugfix] Fix Granite Vision / Don’t use Siglip Pooling Head Nested Models by Default — bug,multi-modality — by alex-jw-brooks (创建于: 2026-01-14 10:23 (UTC+8))
#32281 [ROCm][CI] Handle missing vision_config in Isaac model attention patch — rocm,multi-modality — by AndreasKaratzas (创建于: 2026-01-14 03:06 (UTC+8))
#32277 Using max_loras + 1 to construct grid in fused_moe_lora — 无标签 — by yugong333 (创建于: 2026-01-14 02:17 (UTC+8))
#32275 [ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test — rocm,ready,ci/build,qwen — by micah-wil (创建于: 2026-01-14 01:32 (UTC+8))
#32296 [misc] Remove is_torch_equal_or_newer(2.4) cases — ready,v1 — by angelayi (创建于: 2026-01-14 07:46 (UTC+8))
#32245 [Model Runner V2] Refactor Sampler — v1 — by WoosukKwon (创建于: 2026-01-13 16:38 (UTC+8))
#32298 allow configure skip_special_tokens in openai response api — frontend — by 842974287 (创建于: 2026-01-14 09:28 (UTC+8))
#32297 Allow configure skip_special_tokens in openai response api — frontend,needs-rebase — by 842974287 (创建于: 2026-01-14 09:20 (UTC+8))
#32283 [BugFix] Assign page_size_padded when unifying kv cache spec. — bug,v1 — by Lumosis (创建于: 2026-01-14 03:37 (UTC+8))
#32293 support non gated fused moe for compressed tensors w8a8 int8 — ready — by NVShreyas (创建于: 2026-01-14 06:44 (UTC+8))
#32295 [CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm — rocm,ci/build — by qli88 (创建于: 2026-01-14 07:03 (UTC+8))
#32256 [Frontend][CLI] Add –enable-dashboard for vLLM Web UI — frontend — by esmeetu (创建于: 2026-01-13 18:57 (UTC+8))
#32238 [ROCM] DSfp4 mla projection gemms weight dynamic quantization — rocm,v1 — by maleksan85 (创建于: 2026-01-13 14:41 (UTC+8))
#32289 [Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 — ready,ci/build — by dsfaccini (创建于: 2026-01-14 05:02 (UTC+8))
#32285 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (创建于: 2026-01-14 03:52 (UTC+8))
#32294 [Doc] Add note to docker.md on –model arg #32292 — documentation — by jaysonfrancis (创建于: 2026-01-14 06:51 (UTC+8))
#32292 [Doc] Add note to docker.md on –model arg — documentation,ci/build — by jaysonfrancis (创建于: 2026-01-14 06:27 (UTC+8))
#32249 [Bugfix] interleaved multimodal feature concatenation — bug,v1,multi-modality — by L-hongbin (创建于: 2026-01-13 17:37 (UTC+8))
#32255 [BugFix] scheduler: Delay freeing blocks of aborted async loads — bug,v1,kv-connector — by orozery (创建于: 2026-01-13 18:34 (UTC+8))
#32291 [Bugfix] Pass in missing tp_size to get_port_offset() for MORI IO Connector — bug,kv-connector — by ruisearch42 (创建于: 2026-01-14 06:23 (UTC+8))
#32266 [Bugfix] anthropic: support incoming streaming DeltaMessage with combined content and tool_calls — bug,frontend — by DKingAlpha (创建于: 2026-01-13 23:17 (UTC+8))
#32257 [Feat] Support non gated MoE with Marlin backend — 无标签 — by TomerBN-Nvidia (创建于: 2026-01-13 19:14 (UTC+8))
#32290 [Ignore] Emulate contiguous blocks — v1 — by mgoin (创建于: 2026-01-14 05:56 (UTC+8))
#32282 [Build] Add scripts for cherry-picking and trigger build — ci/build — by simon-mo (创建于: 2026-01-14 03:25 (UTC+8))
#32287 Upgrade transformers-4.57.5 — ci/build — by huydhn (创建于: 2026-01-14 04:52 (UTC+8))
#32286 [Doc] Enhance documentation around CPU container images — documentation,cpu — by nathan-weinberg (创建于: 2026-01-14 03:54 (UTC+8))
#32284 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (创建于: 2026-01-14 03:46 (UTC+8))
#32280 Bump triton_kernels to v3.5.1 for version consistency — ci/build — by mmangkad (创建于: 2026-01-14 03:03 (UTC+8))
#32276 Fix CUDA 13 wheel installation doc — documentation,ready,nvidia — by dmitry-tokarev-nv (创建于: 2026-01-14 01:56 (UTC+8))
#32264 [ROCm] [CI] [Release] Rocm wheel pipeline with sccache — rocm,ci/build — by tjtanaa (创建于: 2026-01-13 22:47 (UTC+8))
#32274 Fix Attention when query dim=4 [batch_size, num_tokens, heads, head_dim] — meta-exported,fb-exported — by henryoier (创建于: 2026-01-14 01:28 (UTC+8))
#32273 [Perf] Only clone when needed for moe_permute — ready — by yewentao256 (创建于: 2026-01-14 01:04 (UTC+8))
#32246 [Trivial] Remove duplicate enable_mfu_metrics — ready — by markmc (创建于: 2026-01-13 17:19 (UTC+8))
#32271 ROCm WideEP Changes — rocm,ci/build — by varun-sundar-rabindranath (创建于: 2026-01-14 00:59 (UTC+8))
#32233 [ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-01-13 12:00 (UTC+8))
#32272 [Quark] Support online block-diagonal rotations in OCP MX dense layers — 无标签 — by fxmarty-amd (创建于: 2026-01-14 00:59 (UTC+8))
#32234 [Core] Optimize top-k + top-p sampling by avoiding full vocabulary sort — v1,meta-exported,fb-exported — by nadavrot (创建于: 2026-01-13 12:39 (UTC+8))
#32232 [Frontend]: minimax_m2 supports structural_tag — documentation,structured-output,frontend,needs-rebase,v1,tool-calling — by chaunceyjiang (创建于: 2026-01-13 11:59 (UTC+8))
#32253 [Frontend] Normalize Responses API input for multi-turn conversations — frontend,needs-rebase,gpt-oss — by daniel-salib (创建于: 2026-01-13 18:14 (UTC+8))
#32251 [Refactor] [7/N] to simplify the vLLM lora serving architecture — frontend,ready — by chaunceyjiang (创建于: 2026-01-13 18:01 (UTC+8))
#32265 Add support for LoRA adapters in Nemotron-H MTP models — 无标签 — by danisereb (创建于: 2026-01-13 23:09 (UTC+8))
#32263 [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation — v1,cpu — by gassan-arm (创建于: 2026-01-13 22:40 (UTC+8))
#32254 [Refactor] Remove MultiModalProfiler — ready,multi-modality,llama — by DarkLight1337 (创建于: 2026-01-13 18:22 (UTC+8))
#32261 Benchmarks: add receipts harness (telemetry + logs) — performance — by StanByriukov02 (创建于: 2026-01-13 21:27 (UTC+8))
#32237 [Fix][MoE] Add SM120 support for FP8 MoE path — documentation,structured-output,frontend,needs-rebase,v1,cpu,nvidia — by malaiwah (创建于: 2026-01-13 13:32 (UTC+8))
#32240 [Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture — frontend,ready,v1,multi-modality,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (创建于: 2026-01-13 15:00 (UTC+8))
#32248 fix(tool_parser): filter whitespace-only content before tool_call — qwen — by minimAluminiumalism (创建于: 2026-01-13 17:29 (UTC+8))
#32243 [Bugfix] Replace PoolingParams.normalize with use_activation — documentation,frontend,ready — by DarkLight1337 (创建于: 2026-01-13 16:06 (UTC+8))
#32250 Feat/support marlin for non gated moe — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by TomerBN-Nvidia (创建于: 2026-01-13 17:51 (UTC+8))
#32241 [Refactor] Remove get_encoder_dummy_data — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-13 15:21 (UTC+8))
#32229 [Model] Refactoring the implementation of qwen3 — qwen — by maang-h (创建于: 2026-01-13 11:09 (UTC+8))
#32244 fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm — rocm — by rabi (创建于: 2026-01-13 16:28 (UTC+8))
#32236 Fused MoE LoRA Kernel Optimizations — 无标签 — by cwazai (创建于: 2026-01-13 13:03 (UTC+8))
#32230 feature: support eagle3 for HunyuanOCR & Qwen3VLMoe & Qwen2Audio — speculative-decoding,v1,qwen — by irisliu10 (创建于: 2026-01-13 11:17 (UTC+8))
#32239 [Doc] Update installation from source command — documentation,ready — by esmeetu (创建于: 2026-01-13 14:58 (UTC+8))
#32231 PR_001 (Benchmark): Reproduced upstream change — frontend,needs-rebase — by MitchLewis930 (创建于: 2026-01-13 11:32 (UTC+8))

已合并 PR

#32245 [Model Runner V2] Refactor Sampler — v1 — by WoosukKwon (合并于: 2026-01-14 09:58 (UTC+8))
#30885 [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding — performance,ready,ci/build,nvidia — by LopezCastroRoberto (合并于: 2026-01-14 07:22 (UTC+8))
#30784 [Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get — ready,ci/build,multi-modality,nvidia — by emricksini-h (合并于: 2026-01-14 06:35 (UTC+8))
#31980 Add mergify label job for “bug” in PR titles — bug,ready,ci/build — by mgoin (合并于: 2026-01-14 06:28 (UTC+8))
#32282 [Build] Add scripts for cherry-picking and trigger build — ci/build — by simon-mo (合并于: 2026-01-14 05:21 (UTC+8))
#28502 [Misc] Add In-Container restart capability through supervisord for sagemaker entrypoint — documentation,ready,ci/build — by HappyAmazonian (合并于: 2026-01-14 05:06 (UTC+8))
#31753 [BugFix]Fix eagle draft_model_config and add tests — ready — by charlotte12l (合并于: 2026-01-13 15:09 (UTC+8))
#31711 fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe — rocm,ready — by rabi (合并于: 2026-01-14 03:11 (UTC+8))
#32050 [EPLB][Cleanup] Remove is_async_enabled from EplbModelState — ready — by SageMoore (合并于: 2026-01-14 02:19 (UTC+8))
#32058 [Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement — performance,ready — by yewentao256 (合并于: 2026-01-14 02:58 (UTC+8))
#32276 Fix CUDA 13 wheel installation doc — documentation,ready,nvidia — by dmitry-tokarev-nv (合并于: 2026-01-14 02:48 (UTC+8))
#32100 [responseAPI] support partial message generation — frontend,ready — by qandrew (合并于: 2026-01-14 02:41 (UTC+8))
#32169 [BugFix] [KVConnector] Fix KV events for LMCache connector — ready,kv-connector — by hickeyma (合并于: 2026-01-13 23:50 (UTC+8))
#32181 nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak — ready,kv-connector — by hasB4K (合并于: 2026-01-14 00:21 (UTC+8))
#32060 [4/N][Attention] Move MLA common to model_executor — rocm,speculative-decoding,ready,v1,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-14 01:08 (UTC+8))
#32246 [Trivial] Remove duplicate enable_mfu_metrics — ready — by markmc (合并于: 2026-01-14 01:09 (UTC+8))
#32198 [Docs] Nixl Usage recommend fail kv_load_failure_policy — documentation,ready,kv-connector — by NickLucche (合并于: 2026-01-13 20:51 (UTC+8))
#32061 [ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-13 15:14 (UTC+8))
#32099 [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output — rocm,ready — by AndreasKaratzas (合并于: 2026-01-13 13:46 (UTC+8))
#32233 [ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-01-13 14:33 (UTC+8))
#32212 Fix various typos found in docs — documentation,structured-output,ready,cpu — by potatosalad (合并于: 2026-01-13 11:41 (UTC+8))
#32251 [Refactor] [7/N] to simplify the vLLM lora serving architecture — frontend,ready — by chaunceyjiang (合并于: 2026-01-13 23:37 (UTC+8))
#32254 [Refactor] Remove MultiModalProfiler — ready,multi-modality,llama — by DarkLight1337 (合并于: 2026-01-13 23:10 (UTC+8))
#32215 [6/N][Attention] Move utils to more appropriate locations — ready,v1,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-13 21:38 (UTC+8))
#32240 [Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture — frontend,ready,v1,multi-modality,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (合并于: 2026-01-13 21:01 (UTC+8))
#29867 [Quantization] fix: overflow with static per-tensor scaling — bug,ready,v1,deepseek — by mickaelseznec (合并于: 2026-01-13 20:56 (UTC+8))
#32243 [Bugfix] Replace PoolingParams.normalize with use_activation — documentation,frontend,ready — by DarkLight1337 (合并于: 2026-01-13 18:45 (UTC+8))
#32241 [Refactor] Remove get_encoder_dummy_data — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-13 17:21 (UTC+8))
#32126 [Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL — ready,qwen — by YunzhuLu (合并于: 2026-01-13 17:04 (UTC+8))
#32239 [Doc] Update installation from source command — documentation,ready — by esmeetu (合并于: 2026-01-13 15:10 (UTC+8))
#32226 [Misc] improve warning/assert messages — ready,v1 — by cjackal (合并于: 2026-01-13 11:11 (UTC+8))
#32211 [Perf] Optimize requests abort — ready,v1 — by yewentao256 (合并于: 2026-01-13 12:11 (UTC+8))
#31956 [Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() — frontend,ready — by sanghoon-yn (合并于: 2026-01-13 11:21 (UTC+8))

关闭但未合并的 PR

#32051 [Misc] support separate draft model loading config from base model — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-01-14 10:22 (UTC+8))
#31746 [Misc] Create an Eagle3Llama interface to help extend more eagle3 model — speculative-decoding,v1,llama,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-01-14 00:48 (UTC+8))
#32297 Allow configure skip_special_tokens in openai response api — frontend,needs-rebase — by 842974287 (关闭于: 2026-01-14 09:21 (UTC+8))
#32292 [Doc] Add note to docker.md on –model arg — documentation,ci/build — by jaysonfrancis (关闭于: 2026-01-14 06:49 (UTC+8))
#27214 [MXFP4] CT Integration Support — 无标签 — by dsikka (关闭于: 2026-01-14 05:25 (UTC+8))
#32284 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (关闭于: 2026-01-14 03:46 (UTC+8))
#31749 [CI/Build] add new target for building CPU image with model — documentation,ci/build,cpu — by nathan-weinberg (关闭于: 2026-01-14 03:18 (UTC+8))
#32271 ROCm WideEP Changes — rocm,ci/build — by varun-sundar-rabindranath (关闭于: 2026-01-14 00:59 (UTC+8))
#28421 map torchao quantized checkpoints to vLLM’s MoE kernels — qwen — by vkuzo (关闭于: 2026-01-13 21:24 (UTC+8))
#32237 [Fix][MoE] Add SM120 support for FP8 MoE path — documentation,structured-output,frontend,needs-rebase,v1,cpu,nvidia — by malaiwah (关闭于: 2026-01-13 21:08 (UTC+8))
#32156 [Frontend] Fix missing channel assignment for multi-turn message parsing — frontend,gpt-oss — by daniel-salib (关闭于: 2026-01-13 18:15 (UTC+8))
#32155 [Frontend] Normalize Responses API input for client compatibility — frontend — by daniel-salib (关闭于: 2026-01-13 18:15 (UTC+8))
#32250 Feat/support marlin for non gated moe — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by TomerBN-Nvidia (关闭于: 2026-01-13 18:04 (UTC+8))
#26330 [Chore] fix spelling in kv_transfer_config variable name — ready,stale,v1,kv-connector — by natoscott (关闭于: 2026-01-13 17:35 (UTC+8))
#32145 update cutlass_moe_mm error check message — nvidia — by XiaobingSuper (关闭于: 2026-01-13 17:03 (UTC+8))
#32229 [Model] Refactoring the implementation of qwen3 — qwen — by maang-h (关闭于: 2026-01-13 16:36 (UTC+8))
#32231 PR_001 (Benchmark): Reproduced upstream change — frontend,needs-rebase — by MitchLewis930 (关闭于: 2026-01-13 11:34 (UTC+8))