vLLM 开发动态报告 - 2026-01-13
时间窗口: 2026-01-13 10:59 (UTC+8) ~ 2026-01-14 10:59 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 5 | 新 PR 60 | 合并 PR 33 | 关闭未合并 PR 17
📊 每日开发状态摘要
在2026年1月13日至14日的时间窗口内,vLLM项目保持了极高的开发活跃度,共产生60个新PR并合并33个。开发焦点集中在核心性能优化(如Fused MoE、量化)、多模态模型支持(如Isaac、Granite Vision)的完善,以及持续增强对AMD ROCm生态的兼容性。同时,社区积极处理用户反馈的问题,包括LoRA、分布式推理和API前端的bug修复。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动活跃,主要集中在ROCm平台兼容性修复、性能优化及CI pipeline建设上。
- ROCm CI与测试修复:
- PR #32233 (已合并) 及 PR #32281:解决了ROCm平台上Isaac多模态模型因HuggingFace
flash_attention_2实现不准确导致的测试失败问题。通过强制在ROCm上为视觉编码器使用SDPA后端,确保了与CUDA一致的输出精度。 - PR #32275:因默认启用异步调度暴露了Qwen3-Next-80B模型在AMD CI上的一个潜在问题,此PR在ROCm平台上临时禁用异步调度以解阻塞CI,同时展开问题调查。
- PR #32244:启用了非门控MoE(如NemotronH)在ROCm平台上的支持,通过放宽平台检查并回退到Triton实现来实现。
- PR #32264:优化了ROCm wheel构建的Dockerfile,集成
sccache以将基础镜像构建时间从约6小时大幅缩减至1.5小时,显著提升CI/CD效率。
- PR #32233 (已合并) 及 PR #32281:解决了ROCm平台上Isaac多模态模型因HuggingFace
- AMD Quark量化工具支持:
- PR #32272 (用户: fxmarty-amd):为使用AMD Quark库生成的、采用在线块对角旋转(Online Block-Diagonal Rotations)的MXFP4/MXFP6量化模型添加了加载支持。这是一个重要的AMD生态功能扩展,旨在提升低精度格式的精度恢复能力。
- ROCm性能优化:
- PR #32238:为DeepSeek-R1 MXFP4模型在ROCm上启用了MLA投影GEMM的权重量化动态计算。
- 构建与部署:
- PR #32295:将RIXL/UCX的构建从基础Docker镜像移至测试阶段镜像,提高了部署灵活性。
总结:AMD团队在本周期内系统性解决了多模态模型在ROCm上的关键兼容性问题,持续推进量化生态(Quark)集成,并着力优化基础设施(CI构建)的效率和稳定性。
💬 高热度讨论分析
- Issue #32235: [Bug]: Incorrect grid size in fused_moe_lora
- 核心议题:
fused_moe_lora内核中计算启动网格大小时存在偏差,当批次混合了基础模型token和LoRA token时,可能导致部分LoRA适配器被跳过。 - 观点与结论:报告者
yugong333详细分析了代码,指出网格大小应为max_loras + 1以涵盖最坏情况。维护者jeejeelee在评论中表示感谢并请求添加回归测试。讨论迅速达成共识,认为这是一个需要修复的bug,并由yugong333直接提交了修复PR #32277。
- 核心议题:
- PR #32279: [Bugfix] Fix FusedMoE triton kernel out of bound value
- 核心议题:修复FusedMoE及其LoRA变体内核中的多个越界访问问题,包括
offs_token哨兵值错误、边界保护不完善以及moe_weight的越界值。 - 观点与结论:贡献者
xyang16提供了详尽的修复说明和准确性测试结果(显示精度提升)。核心维护者dcmaddix和jeejeelee均表示认可。讨论氛围积极,重点关注修复的准确性和性能影响,并一致认为需要合并以解决潜在的数据读取错误和精度损失。
- 核心议题:修复FusedMoE及其LoRA变体内核中的多个越界访问问题,包括
- PR #30885: [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding (已合并)
- 核心议题:为NVFP4量化引入一种新的、使用更小缩放因子分片(8x4 SF布局)的后端变体,专门优化小批次(≤32)解码场景,声称可获得25-35%的吞吐量提升。
- 争议与讨论:此PR经历了较长的开发和评审周期,引发了关于性能收益范围、适用场景以及可能引入的复杂性的深入讨论。最终在提供了充分的微基准测试和权衡分析(新后端不适用于中大批次)后获得合并。焦点在于如何在提升小众场景性能的同时,不增加通用场景的复杂性和维护负担。
- Issue #32262: [Usage]: Requests get stuck with high GPU utilization on vLLM docker
- 核心议题:用户报告在使用vLLM Docker镜像时,请求卡住且GPU利用率居高不下。
- 观点与状态:用户
Glebbot提供了极其详细的环境信息。维护者njhill介入询问复现方法和KV缓存使用率,试图定位是调度问题还是资源耗尽问题。讨论处于早期诊断阶段,尚未形成结论,凸显了生产环境复杂问题调试的挑战。
🔥 热门话题与趋势分析
- 量化生态持续深化:围绕多种量化格式(NVFP4, MXFP4, W8A8 INT8)的支持成为热点。相关PR/Issue包括对MoE模型的支持(#32285, #32293)、性能优化(#30885)以及配置API的重构(#32268)。
- 多模态模型支持与修复:随着更多视觉-语言模型被集成,相关bug修复和适配工作密集出现,如Isaac在ROCm上的注意力问题(#32233, #32281)、Granite Vision的SigLip池化头问题(#32299)以及Omni模型特征拼接错误(#32249)。
- 内核与计算优化:对核心计算内核的微观优化是持续主题,例如优化Top-K+Top-P采样避免全排序(#32234)、优化分组Top-K内核(#32058)、以及Fused MoE LoRA内核的多项优化(#32236, #32279)。
- 前端API与开发者体验:OpenAI Responses API的完善(#32247, #32253, #32260, #32298)、内置Web监控仪表板的添加(#32256)以及依赖版本限制的放宽(#32288, #32289)反映了对开发者友好性和兼容性的重视。
- 分布式与扩展性:关于NIXL在多节点TP场景下元数据交换的RFC讨论(已关闭Issue #25981)及相关bug修复(#32291),以及外部负载均衡器下的问题(#32252),显示分布式推理配置的复杂性仍是关注点。
🛠️ 重点技术变更
- PR #32279: FusedMoE Triton内核越界值修复:这是一个关键的安全性修复,解决了可能读取垃圾数据或导致非法内存访问的潜在问题,并验证能带来精度提升,对使用MoE+LoRA功能的用户至关重要。
- PR #30885: NVFP4小批次解码性能优化:通过引入新的内核变体,针对小并发解码这一常见推理场景进行了显著优化,体现了对实际部署性能的精细打磨。
- PR #32245: [Model Runner V2] Refactor Sampler (已合并):将采样状态从
RequestState移至独立的Sampler并移除SamplingMetadata,这是向更清晰、更模块化的V2模型运行器架构迈进的重要重构步骤。 - PR #32256: [Frontend][CLI] Add –enable-dashboard for vLLM Web UI:引入了内置的Web监控和调试仪表板,为运维和开发者提供了一个开箱即用的实时监控、配置查看和简易测试工具,提升了易用性。
- PR #32272: [Quark] Support online block-diagonal rotations in OCP MX dense layers:代表了vLLM对AMD Quark量化生态支持的进一步深入,为即将发布的采用先进旋转技术的量化模型做好了准备。
📈 开发活跃度观察
- 贡献者多样性:除核心维护团队(如
chaunceyjiang,DarkLight1337,yewentao256,WoosukKwon)外,来自AMD(fxmarty-amd,AndreasKaratzas)、NVIDIA(xyang16,malaiwah)及其他公司的工程师贡献显著。 - 高效合并流程:33个PR在24小时内被合并,显示代码审查和CI流程高效。许多PR被标记为
ready后快速合并。 - 问题闭环率高:多个新报的bug(如 #32235, #32258)在当日即通过PR修复或确认原因后关闭,响应迅速。
- AMD团队活跃:AMD相关贡献者在本周期提交了多个重要PR,涵盖功能支持、bug修复和CI优化,显示出其对ROCm生态投入的持续加大。
💡 值得关注的问题
- Issue #32235: Fused MoE LoRA网格大小bug:虽然已有修复PR,但反映了混合基础模型与多LoRA适配器场景下的边缘情况测试需加强。
- Issue #32262: Docker环境下高GPU利用率卡死:此问题现象明显但原因不明,可能涉及底层驱动、Docker资源隔离或vLLM内部资源管理,需要进一步跟踪。
- Issue #32252: 非MoE模型+外部LB下的断言失败:暴露了在特定分布式配置(DP + 外部负载均衡)下,统计更新地址初始化路径可能存在缺陷,影响生产部署。
- PR #32272 (Quark在线旋转支持):这是一个新增功能PR,其“朴素非优化实现”的表述意味着未来可能有性能优化空间,且需等待配套模型发布后进行更全面的测试。
- 异步调度默认启用引发的兼容性问题:如PR #32275所示,新默认配置可能与特定模型或硬件平台产生未知交互,需要在更广泛的生态中进行观察和测试。
📋 附录:详细数据列表
新增 Issue
- #32235 [Bug]: Incorrect grid size in fused_moe_lora — bug — by yugong333 (创建于: 2026-01-13 12:42 (UTC+8))
- #32262 [Usage]: Requests get stuck with high GPU utilization on vLLM docker — usage — by Glebbot (创建于: 2026-01-13 22:10 (UTC+8))
- #32268 [Feature]: Refactor Int8ScaledMMLinearLayerConfig to use QuantKey — help wanted,good first issue,feature request — by vllmellm (创建于: 2026-01-14 00:15 (UTC+8))
- #32288 [Build]: Relax anthropic version pin from ==0.71.0 to >=0.71.0 — 无标签 — by dsfaccini (创建于: 2026-01-14 04:56 (UTC+8))
- #32269 [Feature]: Benchmark torch._scaled_mm performance with and without padding — help wanted,feature request — by vllmellm (创建于: 2026-01-14 00:31 (UTC+8))
- #32278 [Bug]: MiniMax M2 Tool Parser Incorrectly Converts Optional Parameters to null — bug — by KevinBuettner (创建于: 2026-01-14 02:45 (UTC+8))
- #32270 [Bug]: same
max_seq_lenof flashinfer trtllm decode and prefill. — bug — by xiaoxiaosuaxuan (创建于: 2026-01-14 00:36 (UTC+8)) - #32267 [Feature]: Kernels Attention Test cleanup — feature request — by MatthewBonanni (创建于: 2026-01-13 23:36 (UTC+8))
- #32258 [Bug]: DeepSeek v3.2 garbled output — bug — by esmeetu (创建于: 2026-01-13 20:17 (UTC+8))
- #32259 [Bug]: offline infer of mm model cache — bug — by L-hongbin (创建于: 2026-01-13 20:44 (UTC+8))
- #32252 [Bug]: self.stats_update_address is None for non-Moe and external LB — bug — by dtcccc (创建于: 2026-01-13 18:09 (UTC+8))
- #32242 [Bug]: GLM-4.5-Air-NVFP4 models show unexpectedly low throughput under Pipeline Parallel in vLLM — bug — by east612-ai (创建于: 2026-01-13 15:26 (UTC+8))
- #32228 [Bug]: gpt-oss-20B token ids out of range — bug — by vman049 (创建于: 2026-01-13 11:00 (UTC+8))
已关闭 Issue
- #22689 [Usage]: How to change system prompts when using LLM.generate for offline use? — usage,stale — by jiangwu300 (关闭于: 2026-01-14 10:13 (UTC+8))
- #25981 [RFC]: Fix for NIXL metadata exchange issue when the worker is multi-node — RFC,stale — by GuanLuo (关闭于: 2026-01-14 01:11 (UTC+8))
- #32258 [Bug]: DeepSeek v3.2 garbled output — bug — by esmeetu (关闭于: 2026-01-13 22:01 (UTC+8))
- #32190 [Bug]: Deepseek-R1 with DEP deployment returns gibberish outputs — bug — by ptarasiewiczNV (关闭于: 2026-01-13 19:05 (UTC+8))
- #26775 [Bug]: ModelRegistry.inspect_model_cls shows that Qwen3ForCausalLM is not pooling model — bug,stale — by xiaojianpinga (关闭于: 2026-01-13 11:15 (UTC+8))
新增 PR
- #32260 [Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture — frontend,qwen,gpt-oss — by chaunceyjiang (创建于: 2026-01-13 21:24 (UTC+8))
- #32247 [Frontend] Add encrypted_content to reasoning items for round-tripping — frontend,needs-rebase,gpt-oss — by daniel-salib (创建于: 2026-01-13 17:24 (UTC+8))
- #32279 [Bugfix] Fix FusedMoE triton kernel out of bound value — bug — by xyang16 (创建于: 2026-01-14 02:59 (UTC+8))
- #32300 [ASR] Fix audio benchmark and add RTFx metric — performance — by ekagra-ranjan (创建于: 2026-01-14 10:33 (UTC+8))
- #32299 [Bugfix] Fix Granite Vision / Don’t use Siglip Pooling Head Nested Models by Default — bug,multi-modality — by alex-jw-brooks (创建于: 2026-01-14 10:23 (UTC+8))
- #32281 [ROCm][CI] Handle missing vision_config in Isaac model attention patch — rocm,multi-modality — by AndreasKaratzas (创建于: 2026-01-14 03:06 (UTC+8))
- #32277 Using max_loras + 1 to construct grid in fused_moe_lora — 无标签 — by yugong333 (创建于: 2026-01-14 02:17 (UTC+8))
- #32275 [ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test — rocm,ready,ci/build,qwen — by micah-wil (创建于: 2026-01-14 01:32 (UTC+8))
- #32296 [misc] Remove is_torch_equal_or_newer(2.4) cases — ready,v1 — by angelayi (创建于: 2026-01-14 07:46 (UTC+8))
- #32245 [Model Runner V2] Refactor Sampler — v1 — by WoosukKwon (创建于: 2026-01-13 16:38 (UTC+8))
- #32298 allow configure skip_special_tokens in openai response api — frontend — by 842974287 (创建于: 2026-01-14 09:28 (UTC+8))
- #32297 Allow configure skip_special_tokens in openai response api — frontend,needs-rebase — by 842974287 (创建于: 2026-01-14 09:20 (UTC+8))
- #32283 [BugFix] Assign page_size_padded when unifying kv cache spec. — bug,v1 — by Lumosis (创建于: 2026-01-14 03:37 (UTC+8))
- #32293 support non gated fused moe for compressed tensors w8a8 int8 — ready — by NVShreyas (创建于: 2026-01-14 06:44 (UTC+8))
- #32295 [CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm — rocm,ci/build — by qli88 (创建于: 2026-01-14 07:03 (UTC+8))
- #32256 [Frontend][CLI] Add –enable-dashboard for vLLM Web UI — frontend — by esmeetu (创建于: 2026-01-13 18:57 (UTC+8))
- #32238 [ROCM] DSfp4 mla projection gemms weight dynamic quantization — rocm,v1 — by maleksan85 (创建于: 2026-01-13 14:41 (UTC+8))
- #32289 [Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 — ready,ci/build — by dsfaccini (创建于: 2026-01-14 05:02 (UTC+8))
- #32285 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (创建于: 2026-01-14 03:52 (UTC+8))
- #32294 [Doc] Add note to docker.md on –model arg #32292 — documentation — by jaysonfrancis (创建于: 2026-01-14 06:51 (UTC+8))
- #32292 [Doc] Add note to docker.md on –model arg — documentation,ci/build — by jaysonfrancis (创建于: 2026-01-14 06:27 (UTC+8))
- #32249 [Bugfix] interleaved multimodal feature concatenation — bug,v1,multi-modality — by L-hongbin (创建于: 2026-01-13 17:37 (UTC+8))
- #32255 [BugFix] scheduler: Delay freeing blocks of aborted async loads — bug,v1,kv-connector — by orozery (创建于: 2026-01-13 18:34 (UTC+8))
- #32291 [Bugfix] Pass in missing tp_size to get_port_offset() for MORI IO Connector — bug,kv-connector — by ruisearch42 (创建于: 2026-01-14 06:23 (UTC+8))
- #32266 [Bugfix] anthropic: support incoming streaming DeltaMessage with combined content and tool_calls — bug,frontend — by DKingAlpha (创建于: 2026-01-13 23:17 (UTC+8))
- #32257 [Feat] Support non gated MoE with Marlin backend — 无标签 — by TomerBN-Nvidia (创建于: 2026-01-13 19:14 (UTC+8))
- #32290 [Ignore] Emulate contiguous blocks — v1 — by mgoin (创建于: 2026-01-14 05:56 (UTC+8))
- #32282 [Build] Add scripts for cherry-picking and trigger build — ci/build — by simon-mo (创建于: 2026-01-14 03:25 (UTC+8))
- #32287 Upgrade transformers-4.57.5 — ci/build — by huydhn (创建于: 2026-01-14 04:52 (UTC+8))
- #32286 [Doc] Enhance documentation around CPU container images — documentation,cpu — by nathan-weinberg (创建于: 2026-01-14 03:54 (UTC+8))
- #32284 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (创建于: 2026-01-14 03:46 (UTC+8))
- #32280 Bump triton_kernels to v3.5.1 for version consistency — ci/build — by mmangkad (创建于: 2026-01-14 03:03 (UTC+8))
- #32276 Fix CUDA 13 wheel installation doc — documentation,ready,nvidia — by dmitry-tokarev-nv (创建于: 2026-01-14 01:56 (UTC+8))
- #32264 [ROCm] [CI] [Release] Rocm wheel pipeline with sccache — rocm,ci/build — by tjtanaa (创建于: 2026-01-13 22:47 (UTC+8))
- #32274 Fix Attention when query dim=4 [batch_size, num_tokens, heads, head_dim] — meta-exported,fb-exported — by henryoier (创建于: 2026-01-14 01:28 (UTC+8))
- #32273 [Perf] Only clone when needed for
moe_permute— ready — by yewentao256 (创建于: 2026-01-14 01:04 (UTC+8)) - #32246 [Trivial] Remove duplicate enable_mfu_metrics — ready — by markmc (创建于: 2026-01-13 17:19 (UTC+8))
- #32271 ROCm WideEP Changes — rocm,ci/build — by varun-sundar-rabindranath (创建于: 2026-01-14 00:59 (UTC+8))
- #32233 [ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-01-13 12:00 (UTC+8))
- #32272 [Quark] Support online block-diagonal rotations in OCP MX dense layers — 无标签 — by fxmarty-amd (创建于: 2026-01-14 00:59 (UTC+8))
- #32234 [Core] Optimize top-k + top-p sampling by avoiding full vocabulary sort — v1,meta-exported,fb-exported — by nadavrot (创建于: 2026-01-13 12:39 (UTC+8))
- #32232 [Frontend]: minimax_m2 supports structural_tag — documentation,structured-output,frontend,needs-rebase,v1,tool-calling — by chaunceyjiang (创建于: 2026-01-13 11:59 (UTC+8))
- #32253 [Frontend] Normalize Responses API input for multi-turn conversations — frontend,needs-rebase,gpt-oss — by daniel-salib (创建于: 2026-01-13 18:14 (UTC+8))
- #32251 [Refactor] [7/N] to simplify the vLLM lora serving architecture — frontend,ready — by chaunceyjiang (创建于: 2026-01-13 18:01 (UTC+8))
- #32265 Add support for LoRA adapters in Nemotron-H MTP models — 无标签 — by danisereb (创建于: 2026-01-13 23:09 (UTC+8))
- #32263 [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation — v1,cpu — by gassan-arm (创建于: 2026-01-13 22:40 (UTC+8))
- #32254 [Refactor] Remove
MultiModalProfiler— ready,multi-modality,llama — by DarkLight1337 (创建于: 2026-01-13 18:22 (UTC+8)) - #32261 Benchmarks: add receipts harness (telemetry + logs) — performance — by StanByriukov02 (创建于: 2026-01-13 21:27 (UTC+8))
- #32237 [Fix][MoE] Add SM120 support for FP8 MoE path — documentation,structured-output,frontend,needs-rebase,v1,cpu,nvidia — by malaiwah (创建于: 2026-01-13 13:32 (UTC+8))
- #32240 [Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture — frontend,ready,v1,multi-modality,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (创建于: 2026-01-13 15:00 (UTC+8))
- #32248 fix(tool_parser): filter whitespace-only content before tool_call — qwen — by minimAluminiumalism (创建于: 2026-01-13 17:29 (UTC+8))
- #32243 [Bugfix] Replace
PoolingParams.normalizewithuse_activation— documentation,frontend,ready — by DarkLight1337 (创建于: 2026-01-13 16:06 (UTC+8)) - #32250 Feat/support marlin for non gated moe — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by TomerBN-Nvidia (创建于: 2026-01-13 17:51 (UTC+8))
- #32241 [Refactor] Remove
get_encoder_dummy_data— ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-13 15:21 (UTC+8)) - #32229 [Model] Refactoring the implementation of qwen3 — qwen — by maang-h (创建于: 2026-01-13 11:09 (UTC+8))
- #32244 fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm — rocm — by rabi (创建于: 2026-01-13 16:28 (UTC+8))
- #32236 Fused MoE LoRA Kernel Optimizations — 无标签 — by cwazai (创建于: 2026-01-13 13:03 (UTC+8))
- #32230 feature: support eagle3 for HunyuanOCR & Qwen3VLMoe & Qwen2Audio — speculative-decoding,v1,qwen — by irisliu10 (创建于: 2026-01-13 11:17 (UTC+8))
- #32239 [Doc] Update installation from source command — documentation,ready — by esmeetu (创建于: 2026-01-13 14:58 (UTC+8))
- #32231 PR_001 (Benchmark): Reproduced upstream change — frontend,needs-rebase — by MitchLewis930 (创建于: 2026-01-13 11:32 (UTC+8))
已合并 PR
- #32245 [Model Runner V2] Refactor Sampler — v1 — by WoosukKwon (合并于: 2026-01-14 09:58 (UTC+8))
- #30885 [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding — performance,ready,ci/build,nvidia — by LopezCastroRoberto (合并于: 2026-01-14 07:22 (UTC+8))
- #30784 [Improvement] Persist CUDA compat libraries paths to prevent reset on
apt-get— ready,ci/build,multi-modality,nvidia — by emricksini-h (合并于: 2026-01-14 06:35 (UTC+8)) - #31980 Add mergify label job for “bug” in PR titles — bug,ready,ci/build — by mgoin (合并于: 2026-01-14 06:28 (UTC+8))
- #32282 [Build] Add scripts for cherry-picking and trigger build — ci/build — by simon-mo (合并于: 2026-01-14 05:21 (UTC+8))
- #28502 [Misc] Add In-Container restart capability through supervisord for sagemaker entrypoint — documentation,ready,ci/build — by HappyAmazonian (合并于: 2026-01-14 05:06 (UTC+8))
- #31753 [BugFix]Fix eagle draft_model_config and add tests — ready — by charlotte12l (合并于: 2026-01-13 15:09 (UTC+8))
- #31711 fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe — rocm,ready — by rabi (合并于: 2026-01-14 03:11 (UTC+8))
- #32050 [EPLB][Cleanup] Remove
is_async_enabledfromEplbModelState— ready — by SageMoore (合并于: 2026-01-14 02:19 (UTC+8)) - #32058 [Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement — performance,ready — by yewentao256 (合并于: 2026-01-14 02:58 (UTC+8))
- #32276 Fix CUDA 13 wheel installation doc — documentation,ready,nvidia — by dmitry-tokarev-nv (合并于: 2026-01-14 02:48 (UTC+8))
- #32100 [responseAPI] support partial message generation — frontend,ready — by qandrew (合并于: 2026-01-14 02:41 (UTC+8))
- #32169 [BugFix] [KVConnector] Fix KV events for LMCache connector — ready,kv-connector — by hickeyma (合并于: 2026-01-13 23:50 (UTC+8))
- #32181 nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak — ready,kv-connector — by hasB4K (合并于: 2026-01-14 00:21 (UTC+8))
- #32060 [4/N][Attention] Move MLA common to model_executor — rocm,speculative-decoding,ready,v1,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-14 01:08 (UTC+8))
- #32246 [Trivial] Remove duplicate enable_mfu_metrics — ready — by markmc (合并于: 2026-01-14 01:09 (UTC+8))
- #32198 [Docs] Nixl Usage recommend
failkv_load_failure_policy — documentation,ready,kv-connector — by NickLucche (合并于: 2026-01-13 20:51 (UTC+8)) - #32061 [ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-13 15:14 (UTC+8))
- #32099 [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output — rocm,ready — by AndreasKaratzas (合并于: 2026-01-13 13:46 (UTC+8))
- #32233 [ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-01-13 14:33 (UTC+8))
- #32212 Fix various typos found in
docs— documentation,structured-output,ready,cpu — by potatosalad (合并于: 2026-01-13 11:41 (UTC+8)) - #32251 [Refactor] [7/N] to simplify the vLLM lora serving architecture — frontend,ready — by chaunceyjiang (合并于: 2026-01-13 23:37 (UTC+8))
- #32254 [Refactor] Remove
MultiModalProfiler— ready,multi-modality,llama — by DarkLight1337 (合并于: 2026-01-13 23:10 (UTC+8)) - #32215 [6/N][Attention] Move utils to more appropriate locations — ready,v1,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-01-13 21:38 (UTC+8))
- #32240 [Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture — frontend,ready,v1,multi-modality,tool-calling,llama,qwen,deepseek,gpt-oss — by chaunceyjiang (合并于: 2026-01-13 21:01 (UTC+8))
- #29867 [Quantization] fix: overflow with static per-tensor scaling — bug,ready,v1,deepseek — by mickaelseznec (合并于: 2026-01-13 20:56 (UTC+8))
- #32243 [Bugfix] Replace
PoolingParams.normalizewithuse_activation— documentation,frontend,ready — by DarkLight1337 (合并于: 2026-01-13 18:45 (UTC+8)) - #32241 [Refactor] Remove
get_encoder_dummy_data— ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-13 17:21 (UTC+8)) - #32126 [Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL — ready,qwen — by YunzhuLu (合并于: 2026-01-13 17:04 (UTC+8))
- #32239 [Doc] Update installation from source command — documentation,ready — by esmeetu (合并于: 2026-01-13 15:10 (UTC+8))
- #32226 [Misc] improve warning/assert messages — ready,v1 — by cjackal (合并于: 2026-01-13 11:11 (UTC+8))
- #32211 [Perf] Optimize requests abort — ready,v1 — by yewentao256 (合并于: 2026-01-13 12:11 (UTC+8))
- #31956 [Frontend] Add
reasoning_efforttoOpenAIServing._preprocess_chat()— frontend,ready — by sanghoon-yn (合并于: 2026-01-13 11:21 (UTC+8))
关闭但未合并的 PR
- #32051 [Misc] support separate draft model loading config from base model — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-01-14 10:22 (UTC+8))
- #31746 [Misc] Create an Eagle3Llama interface to help extend more eagle3 model — speculative-decoding,v1,llama,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-01-14 00:48 (UTC+8))
- #32297 Allow configure skip_special_tokens in openai response api — frontend,needs-rebase — by 842974287 (关闭于: 2026-01-14 09:21 (UTC+8))
- #32292 [Doc] Add note to docker.md on –model arg — documentation,ci/build — by jaysonfrancis (关闭于: 2026-01-14 06:49 (UTC+8))
- #27214 [MXFP4] CT Integration Support — 无标签 — by dsikka (关闭于: 2026-01-14 05:25 (UTC+8))
- #32284 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models — 无标签 — by dsikka (关闭于: 2026-01-14 03:46 (UTC+8))
- #31749 [CI/Build] add new target for building CPU image with model — documentation,ci/build,cpu — by nathan-weinberg (关闭于: 2026-01-14 03:18 (UTC+8))
- #32271 ROCm WideEP Changes — rocm,ci/build — by varun-sundar-rabindranath (关闭于: 2026-01-14 00:59 (UTC+8))
- #28421 map torchao quantized checkpoints to vLLM’s MoE kernels — qwen — by vkuzo (关闭于: 2026-01-13 21:24 (UTC+8))
- #32237 [Fix][MoE] Add SM120 support for FP8 MoE path — documentation,structured-output,frontend,needs-rebase,v1,cpu,nvidia — by malaiwah (关闭于: 2026-01-13 21:08 (UTC+8))
- #32156 [Frontend] Fix missing channel assignment for multi-turn message parsing — frontend,gpt-oss — by daniel-salib (关闭于: 2026-01-13 18:15 (UTC+8))
- #32155 [Frontend] Normalize Responses API input for client compatibility — frontend — by daniel-salib (关闭于: 2026-01-13 18:15 (UTC+8))
- #32250 Feat/support marlin for non gated moe — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by TomerBN-Nvidia (关闭于: 2026-01-13 18:04 (UTC+8))
- #26330 [Chore] fix spelling in kv_transfer_config variable name — ready,stale,v1,kv-connector — by natoscott (关闭于: 2026-01-13 17:35 (UTC+8))
- #32145 update cutlass_moe_mm error check message — nvidia — by XiaobingSuper (关闭于: 2026-01-13 17:03 (UTC+8))
- #32229 [Model] Refactoring the implementation of qwen3 — qwen — by maang-h (关闭于: 2026-01-13 16:36 (UTC+8))
- #32231 PR_001 (Benchmark): Reproduced upstream change — frontend,needs-rebase — by MitchLewis930 (关闭于: 2026-01-13 11:34 (UTC+8))