vLLM 开发动态报告 - 2026-01-05

时间窗口: 2026-01-05 10:53 (UTC+8) ~ 2026-01-06 10:53 (UTC+8) 数据统计: 新 Issue 10 | 关闭 Issue 13 | 新 PR 60 | 合并 PR 37 | 关闭未合并 PR 9

📊 每日开发状态摘要

2026年1月5日至6日，vLLM项目保持高度活跃的开发节奏，共新增10个Issue、合并37个PR。本周期开发焦点集中在性能优化（特别是GLM-4系列与调度器）与平台兼容性增强上，AMD生态（ROCm）修复和多模态模型（VL/ASR）支持是两大亮点。同时，社区积极处理了多项安装、运行时Bug及CI/CD流水线问题，确保了项目的稳定性。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动非常活跃，主要集中在bug修复和平台兼容性提升上。

CI测试修复：
- PR #31713 (mawong-amd): [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group。此PR旨在修复AMD量化测试组中所有剩余的失败用例，显示了AMD团队对确保其平台量化功能稳定性的持续投入。
- PR #31728: [CI][ROCm] Fix NIXL tests on ROCm。这是对NVIDIA KV连接器测试在AMD CI流水线中运行的后续修复，确保了跨平台测试的覆盖度。
硬件兼容性与错误处理：
- PR #31715: [ROCm] Improve error handling while loading quantized model on gfx120…。此PR修复了在AMD RDNA 4（gfx1201）等不支持aiter的架构上，加载任何量化模型（如AWQ、GPTQ）时的启动崩溃问题。通过安全包裹导入并优雅降级，避免了因尝试加载不支持的Quark OCP MX组件而导致整个服务崩溃。
- PR #31729 & PR #31733: 这两个PR均修复了在AMD平台（ROCm）上，多处代码中硬编码device="cuda"的问题，分别针对AITER MLA/Fused MoE内核和编译融合辅助函数。这确保了在ROCm平台上能正确使用current_platform.device_type，避免运行时错误。
功能适配：
- PR #31712: fix(rocm): Auto-switch hybrid models from ROCM_ATTN to TRITON_ATTN。为了解决混合模型（如Nemotron-H）因Mamba状态对齐需要大块尺寸，超出AMD GPU共享内存限制的问题，此PR使得当为混合模型显式指定ROCM_ATTN时，vLLM能自动且友好地回退到TRITON_ATTN后端。
Speculative Decoding 支持:
- PR #31714: [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py。修复了在ROCm ATTN后端下运行Eagle推测解码时的注意力元数据类型不匹配问题，完善了AMD平台上推测解码功能的支持。
MoE 重构:
- PR #31542: [MoE Refactor] Aiter Experts for BF16 MoE。将未量化的rocm_aiter_fused_expert内核包装为模块化内核格式，是MoE重构工作的一部分，有助于统一和优化AMD平台上的MoE执行路径。

小结：本周期AMD相关贡献以夯实基础为主，重点修复了量化模型加载、设备标识、特定后端兼容性等多个关键bug，并持续推进MoE等核心模块在ROCm平台上的重构与集成，体现了对AMD硬件生态支持的系统性深化。

💬 高热度讨论分析

Issue #31726: [Usage]: Why does vllm serve keep filling up my system disk… (5条评论)
- 核心议题：用户发现vllm serve从网络挂载加载模型时，系统磁盘空间被持续占用且进程停止后不释放。
- 观点分析：
  - 用户 (@tingjun-cs)：通过详细排查，排除了vLLM自身缓存目录（~/.cache/vllm）和通过环境变量重定向缓存目录导致问题的可能性，指出问题可能在于内核层面的文件系统占用统计（如deleted but held open的文件）。
  - 贡献者 (@chaunceyjiang)：最初建议检查vLLM缓存目录，随后提出使用du -d 1 /进行更深入的磁盘使用分析。
- 当前状态：问题仍在开放讨论中，已从简单的缓存路径问题，引向对容器/操作系统层级文件系统行为的更深层排查。这是一个典型的复杂环境问题诊断过程。
Issue #31708: [Bug]: When using image_embeds… (3条评论)
- 核心议题：用户在使用image_embeds输入时遇到数组越界错误，涉及Qwen2-VL等模型的多模态输入处理。
- 观点分析：
  - 贡献者 (@DarkLight1337)：详细解释了image_embeds和image_grid_thw应有的张量维度（应为2-D，无批次维度），并指出该格式仅适用于v0.13+版本。
  - 用户 (@NewZxy)：在按建议调整维度后问题依旧，并提供了初步分析，怀疑问题与ImageProcessorItems和ImageEmbeddingItems的错误使用有关。
- 争议焦点：无实质性争议，更多是问题排查过程中的信息澄清与版本适配说明。
- 当前状态：开放中，问题可能涉及vLLM内部多模态数据结构处理的bug。
Issue #31755: [Feature]: Optimizations for GLM4.7 (3条评论)
- 核心议题：跟踪GLM-4.7模型在vLLM中的优化任务。
- 观点分析：
  - 发起人 (@yewentao256)：列出了详细的优化任务清单（group_topk kernel、cutlass moe填充优化、kernel化get_cutlass_moe_mm_problem_sizes_from_expert_offsets），并提供了完整的精度与性能评测基准。
  - 其他贡献者 (@jeejeelee, @h1248759074)：补充了qk_norm_rope融合在FP8版本下未触发的情况，并询问依赖安装细节。
- 趋势：此Issue体现了社区对特定热门模型（GLM系列）进行深度、系统性性能调优的合作模式。相关优化PR #31754（优化cutlass moe额外fill(0)）已在同期被合并。

🔥 热门话题与趋势分析

性能优化浪潮：围绕GLM系列模型的优化是绝对热点。除了跟踪Issue #31755，合并的PR #31754通过消除冗余的fill(0)操作，为GLM-4.7-FP8带来了2.9%的端到端吞吐提升和10.8%的TTFT提升。这反映了社区对明星模型进行极致性能挖掘的趋势。
多模态支持持续深化：
- 新模型集成：PR #31758添加了对LiquidAI的LFM2-VL模型族的支持，PR #30864增加了NVIDIA Nemotron Parse 1.1的支持，表明视觉-语言模型库在不断丰富。
- 功能完善：多个PR（如#31724, #31620, #31696）致力于为Pixtral、BLIP2、H2OVL等现有多模态模型启用LoRA支持，扩展了其微调能力。
- 底层修复：Issue #31708和PR #31765等揭示了在多模态高并发场景下，编码器缓存管理可能存在的死锁问题，社区正在积极修复。
AMD生态系统性加固：如前述，本周期出现了一批针对AMD平台的修复PR，覆盖测试流水线、设备标识、错误处理、内核兼容性等多个层面。这表明AMD团队或社区正在对ROCm支持进行“查漏补缺”式的系统化完善，而非零散功能添加。
CPU后端稳步发展：关注点从“能用”转向“好用”和“健壮”。出现了为CPU后端添加微观基准测试的需求（Issue #31721, PR #31720），以及修复在--enforce-eager模式下因缺少forward_cpu()实现导致的输出乱码等核心bug（PR #31643）。

🛠️ 重点技术变更

PR #31754: [Perf] Optimize additional fill(0) in cutlass moe…：通过利用expert_first_token_offset信息避免了一次冗余的填零操作，对GLM-4.7-FP8这类MoE模型带来了显著的端到端性能提升。这是针对特定计算模式进行“手术刀式”优化的典范。
PR #31406: [v1] Add encoder-only/cross attention support to Triton Attention backend：为Triton Attention后端添加了编码器/交叉注意力支持，并引入了适配的双向/因果滑动窗口MHA内核。这解决了FlexAttention对某些head_size支持不佳、以及Flash Attention对旧GPU（如Turing/Volta）计算能力限制的问题，为Whisper等编码器模型提供了更优、更兼容的后端选择。
PR #31643: [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with –enforce-eager：根本性修复了CPU后端在启用强制eager模式时，因多个CustomOp层（如RotaryEmbedding, RMSNorm）缺失forward_cpu()实现，错误回退到CUDA路径而导致的输出乱码问题。通过修正CustomOp.forward_cpu()的默认派发逻辑，确保了CPU执行的正确性。
PR #31732 / #31750: B200测试的禁用与恢复：由于DGX Cloud上B200 runner的连接问题，临时禁用了相关CI测试，并在问题疑似解决后迅速恢复。这反映了团队对CI资源状态的敏捷管理和对测试覆盖度的重视。

📈 开发活跃度观察

高效合并：在24小时内合并37个PR，表明项目拥有活跃的贡献者群体和高效的代码审查与合并流程。
平台贡献者突出：来自AMD的员工或贡献者（如 mawong-amd, c0de128）在本周期异常活跃，提交了多个关键修复PR，显著推动了ROCm平台的稳定性和功能完整性。
多模态领域活跃：多位贡献者围绕视觉语言模型的新增、LoRA支持、bug修复进行协作，显示该领域是当前的技术拓展前沿。
Issue闭环积极：关闭了13个Issue，其中包含一些陈年旧Issue（如#24207, #23739），说明社区在推动新特性的同时，也在注重维护问题的清理。

💡 值得关注的问题

调度器潜在死锁：Issue #31731提出了一个关于调度器可能因单个KV缓存饥饿请求而阻塞其他请求的性能优化提案。同时，PR #31765和#31699试图解决多模态场景下编码器缓存可能引发的调度死锁。调度器的公平性与健壮性是影响高并发、混合负载场景下服务质量的深层核心问题，值得持续关注。
Transformers 5.x兼容性阵痛：Issue #31485（已通过PR #31622关闭）揭示了GLM-4.6V模型在Transformers 5.0.0rc1下工具调用功能失效的问题。这提醒社区，依赖上游重大版本升级可能带来复杂的兼容性挑战，需要更广泛的测试和适配策略。
AMD新架构支持：PR #31715处理了gfx1201（RDNA 4）架构上因aiter不支持而导致的量化模型加载崩溃。随着AMD新GPU架构的推出，确保vLLM的量化、注意力等核心组件能优雅处理或不支持新架构，是维护平台兼容性的重要课题。

📋 附录：详细数据列表

新增 Issue

#31755 [Feature]: Optimizations for GLM4.7 — feature request — by yewentao256 (创建于: 2026-01-06 07:15 (UTC+8))
#31708 [Bug]: When using image_embeds, ImageProcessorItems are used instead of ImageEmbeddingItems, causing an out-of-bounds array error. — bug — by NewZxy (创建于: 2026-01-05 15:16 (UTC+8))
#31710 perf.vllm.ai is not available — 无标签 — by pacoxu (创建于: 2026-01-05 15:44 (UTC+8))
#31731 [Performance]: scheduler improvement to prevent KV-cache heavy requests from blocking others — performance — by Waloid24 (创建于: 2026-01-05 23:48 (UTC+8))
#31726 [Usage]: Why does vllm serve keep filling up my system disk when loading a model from a network mount? — usage — by tingjun-cs (创建于: 2026-01-05 22:50 (UTC+8))
#31721 [Feature]: Fused MoE Micro Benchmark for CPU Backend — feature request,cpu — by andikarachman (创建于: 2026-01-05 20:31 (UTC+8))
#31718 [Installation]: cuda error when install from source — installation — by inv1s10n (创建于: 2026-01-05 18:51 (UTC+8))
#31709 [Bug]: After upgrade to 0.11.2, vllm crashs with Qwen3. — bug — by tonyaw (创建于: 2026-01-05 15:17 (UTC+8))
#31697 [New Model]: HY-MT1.5-1.8B — new-model — by Busboy3129 (创建于: 2026-01-05 11:36 (UTC+8))
#31700 [Bug]: CUDA error: “provided PTX was compiled with an unsupported toolchain” on vLLM 0.11.2 — bug — by dry86 (创建于: 2026-01-05 12:57 (UTC+8))

已关闭 Issue

#24207 [Feature]: Support similar API, such as /health_generate — feature request,stale — by (关闭于: 2026-01-06 10:42 (UTC+8))
#28975 [Bug]: aot_compile does not preserve dynamic shapes state on cache hit. — bug,torch.compile — by laithsakka (关闭于: 2026-01-06 07:03 (UTC+8))
#31579 [Bug]: VLLM_FLOAT32_MATMUL_PRECISION=tf32 does not set cublas tf32 matmul — bug — by cjackal (关闭于: 2026-01-06 06:31 (UTC+8))
#29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-06 05:50 (UTC+8))
#29459 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-06 05:49 (UTC+8))
#30628 [Bug]: For building a CUDA 13 vLLM docker image, when building LMCache, wrong version of NIXL (nixl-cu12) is downloaded — bug,kv-connector,nvidia — by wangshangsam (关闭于: 2026-01-06 04:50 (UTC+8))
#31485 [Feature]: Support Tool Calling with transformers 5.x for GLM-4.6V Models — feature request — by tamthaihoangminh (关闭于: 2026-01-06 03:32 (UTC+8))
#23739 [Performance]: The full cudagraph seems not work. — performance,stale — by xsank (关闭于: 2026-01-06 01:34 (UTC+8))
#31642 [Bug]: TypeError in DeviceCommunicatorBase.dispatch due to method signature mismatch — bug,cpu — by kzwrime (关闭于: 2026-01-06 01:26 (UTC+8))
#31626 [Bug][CPU Backend]: Gibberish output on CPU backend when –enforce-eager is enabled (Qwen3-0.6B) — bug,cpu — by kzwrime (关闭于: 2026-01-06 01:25 (UTC+8))
#31014 [Bug]: GPT-OSS-120B Eagle-v2 High concurrency perf drop — bug,speculative-decoding — by shyeh25 (关闭于: 2026-01-05 17:15 (UTC+8))
#31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-05 13:45 (UTC+8))
#31700 [Bug]: CUDA error: “provided PTX was compiled with an unsupported toolchain” on vLLM 0.11.2 — bug — by dry86 (关闭于: 2026-01-05 13:00 (UTC+8))

新增 PR

#31762 [Bugfix]Add rollback mechanism when XPU kernel does not support FP32 precision FLASH_ATTN in UT — v1 — by 1643661061leo (创建于: 2026-01-06 10:21 (UTC+8))
#31765 [BUGFIX] free encode inputs in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (创建于: 2026-01-06 10:49 (UTC+8))
#31761 [Frontend] Add MCP tool streaming support to Responses API — frontend,gpt-oss — by daniel-salib (创建于: 2026-01-06 10:13 (UTC+8))
#31717 [Perf] GLM ASR — 无标签 — by JaredforReal (创建于: 2026-01-05 18:45 (UTC+8))
#31735 [bugfix] prevent special token injection in user content — frontend — by usepr (创建于: 2026-01-06 00:07 (UTC+8))
#31716 Consolidate Intel Quantization Toolkit Integration in vLLM — documentation — by yiliu30 (创建于: 2026-01-05 18:23 (UTC+8))
#31764 [CI] Fix CPU MM PRocessor Test — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-06 10:32 (UTC+8))
#31763 [perf] Fused operator SplitMrope used in the Qwen2.5-Omni-7B model — qwen — by fuzhihong699 (创建于: 2026-01-06 10:32 (UTC+8))
#31759 [MoE Refactor] Add Temporary Integration Tests - H100 — ready,ci/build,nvidia — by robertgshaw2-redhat (创建于: 2026-01-06 09:31 (UTC+8))
#31747 [Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future — v1,multi-modality,cpu,nvidia,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-06 05:45 (UTC+8))
#31699 [BUGFIX] free encode cache in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (创建于: 2026-01-05 11:53 (UTC+8))
#31753 [BugFix]Fix eagle draft_model_config and add tests — 无标签 — by charlotte12l (创建于: 2026-01-06 07:09 (UTC+8))
#31760 [Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 — qwen — by maang-h (创建于: 2026-01-06 09:40 (UTC+8))
#31725 [Misc] Enable Paligemma’s PrefixLM attention mask computation — ready,multi-modality — by Isotr0py (创建于: 2026-01-05 22:23 (UTC+8))
#31754 [Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement — ready,nvidia — by yewentao256 (创建于: 2026-01-06 07:10 (UTC+8))
#31758 [Model] Add LFM2-VL model support — new-model,v1 — by tianshu-Michael-yu (创建于: 2026-01-06 08:14 (UTC+8))
#31751 [Bug Fix] Handle variable-length tensors in MultiModalFlatField batching — multi-modality — by AndriiPasternak31 (创建于: 2026-01-06 06:48 (UTC+8))
#31750 Revert “[CI Failure] Disable B200 tests while runner is broken” — ready,ci/build — by mgoin (创建于: 2026-01-06 06:24 (UTC+8))
#31738 [Models]: Use MMEncoderAttention for MoonViT — ready — by Isotr0py (创建于: 2026-01-06 00:41 (UTC+8))
#31757 [Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on — 无标签 — by andyl98 (创建于: 2026-01-06 07:54 (UTC+8))
#31756 [Misc][BE] Turn on strict type coverage for vllm/compilation — nvidia — by Lucaskabela (创建于: 2026-01-06 07:39 (UTC+8))
#31744 [Misc][BE] Type coverage for vllm/compilation [2/3] — nvidia — by Lucaskabela (创建于: 2026-01-06 03:39 (UTC+8))
#31748 [Misc][BE] Type coverage for vllm/compilation [3/3] — nvidia — by Lucaskabela (创建于: 2026-01-06 05:58 (UTC+8))
#31752 [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection — 无标签 — by zyongye (创建于: 2026-01-06 06:57 (UTC+8))
#31742 [Bugfix] Fix Broken ModelOpt NVFP4 MoE — bug,ready,llama,nvidia — by robertgshaw2-redhat (创建于: 2026-01-06 02:16 (UTC+8))
#31724 [Model] Enable LoRA support for Pixtral — documentation — by A1c0r-Z (创建于: 2026-01-05 22:06 (UTC+8))
#31713 [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group — rocm,ci/build — by mawong-amd (创建于: 2026-01-05 16:54 (UTC+8))
#31722 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) — performance,ready,qwen — by vadiklyutiy (创建于: 2026-01-05 20:37 (UTC+8))
#31737 [Frontend][gpt-oss] Allow system message to overwrite model identity — frontend,gpt-oss — by qandrew (创建于: 2026-01-06 00:37 (UTC+8))
#31749 [CI/Build] add new target for building CPU image with model — documentation,ci/build,cpu — by nathan-weinberg (创建于: 2026-01-06 06:00 (UTC+8))
#31707 [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator — v1 — by ivanium (创建于: 2026-01-05 15:14 (UTC+8))
#31734 [Cleanup] Remove deprecated fields from CachedRequestData class — ready,v1 — by njhill (创建于: 2026-01-06 00:01 (UTC+8))
#31745 [Bugfix] Remove the num_hidden_layers override for glm4_moe — 无标签 — by andyl98 (创建于: 2026-01-06 03:57 (UTC+8))
#31746 Create an interface to support more eagle3 model — speculative-decoding,v1,llama,meta-exported,fb-exported — by ZhengkaiZ (创建于: 2026-01-06 05:10 (UTC+8))
#31741 Add /pause/step fast pause + barrier for online weight update — frontend,v1 — by vmoens (创建于: 2026-01-06 01:50 (UTC+8))
#31739 [Spec Decode][UX] Add acceptance rates to vllm bench serve report — performance — by MatthewBonanni (创建于: 2026-01-06 00:52 (UTC+8))
#31740 feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support — v1,nvidia — by seli-equinix (创建于: 2026-01-06 01:23 (UTC+8))
#31743 [perf][MLA] Fuse RoPE/FP8 quantization/Q write using mla_rope_quantize_fp8 — v1,nvidia — by minosfuture (创建于: 2026-01-06 03:15 (UTC+8))
#31723 [Core] Optimize expensive deepcopy in GPU model runner — v1 — by GOavi101 (创建于: 2026-01-05 21:04 (UTC+8))
#31736 Update how docs are rendered for cli reference — documentation — by ashwin-phadke (创建于: 2026-01-06 00:19 (UTC+8))
#31732 [CI Failure] Disable B200 tests while runner is broken — ready,ci/build,ci-failure — by mgoin (创建于: 2026-01-05 23:53 (UTC+8))
#31730 [Cleanup] Unify flashinfer utility code — nvidia — by majiayu000 (创建于: 2026-01-05 23:20 (UTC+8))
#31729 [Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE — rocm — by c0de128 (创建于: 2026-01-05 23:04 (UTC+8))
#31727 Fix versions for vulnerable packages — ci/build — by adobrzyn (创建于: 2026-01-05 22:57 (UTC+8))
#31733 [Bugfix][Hardware][AMD] Use platform device type in compilation fusion helpers — rocm — by c0de128 (创建于: 2026-01-05 23:57 (UTC+8))
#31719 [Misc] Support qwen3-next lora — qwen — by BJWang-ant (创建于: 2026-01-05 19:30 (UTC+8))
#31728 [CI][ROCm] Fix NIXL tests on ROCm — rocm,ready,ci/build,kv-connector — by NickLucche (创建于: 2026-01-05 23:04 (UTC+8))
#31715 [ROCm] Improve error handling while loading quantized model on gfx120… — rocm — by brian033 (创建于: 2026-01-05 17:44 (UTC+8))
#31720 [cpu][bench] Add CPU paged attention benchmarks — performance,cpu — by fadara01 (创建于: 2026-01-05 20:28 (UTC+8))
#31714 [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py — rocm,speculative-decoding,v1 — by vllmellm (创建于: 2026-01-05 17:09 (UTC+8))
#31702 Fix ijson build for Power. — ci/build — by npanpaliya (创建于: 2026-01-05 14:06 (UTC+8))
#31712 fix(rocm): Auto-switch hybrid models from ROCM_ATTN to TRITON_ATTN — documentation,rocm — by rabi (创建于: 2026-01-05 16:28 (UTC+8))
#31704 [KVconnector][LMCache] remove the import of legacy LMCache code — ready,kv-connector — by ApostaC (创建于: 2026-01-05 14:26 (UTC+8))
#31711 fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe — rocm — by rabi (创建于: 2026-01-05 15:55 (UTC+8))
#31705 [BugFix] Support setting tp=1 for the Eagle draft model to take effect — speculative-decoding,v1 — by zhaomingyu13 (创建于: 2026-01-05 14:32 (UTC+8))
#31706 perf glmasr — 无标签 — by JaredforReal (创建于: 2026-01-05 14:57 (UTC+8))
#31696 [WIP][Model] Enable LoRA support for tower and connector in H2OVL — documentation — by shwetha-s-poojary (创建于: 2026-01-05 11:35 (UTC+8))
#31703 [Misc] Add packed_modules_mapping for MiniMaxM2ForCausalLM — 无标签 — by jeejeelee (创建于: 2026-01-05 14:25 (UTC+8))
#31698 [Kernel] Add triton silu_and_mul in fused_moe — performance — by jeejeelee (创建于: 2026-01-05 11:41 (UTC+8))
#31701 [MLA] Expose prefill/decode paths to torch.compile — rocm,v1 — by therealnaveenkamal (创建于: 2026-01-05 12:57 (UTC+8))

已合并 PR

#31760 [Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 — qwen — by maang-h (合并于: 2026-01-06 10:09 (UTC+8))
#31725 [Misc] Enable Paligemma’s PrefixLM attention mask computation — ready,multi-modality — by Isotr0py (合并于: 2026-01-06 03:31 (UTC+8))
#31754 [Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement — ready,nvidia — by yewentao256 (合并于: 2026-01-06 10:01 (UTC+8))
#31694 [Docs] Improve malformed exception caused by backslash line continuations — rocm,multi-modality,llama — by maang-h (合并于: 2026-01-06 09:51 (UTC+8))
#31750 Revert “[CI Failure] Disable B200 tests while runner is broken” — ready,ci/build — by mgoin (合并于: 2026-01-06 09:26 (UTC+8))
#31643 [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with –enforce-eager — ready — by rickychen-infinirc (合并于: 2026-01-06 01:25 (UTC+8))
#28874 [Bugfix] vLLM produces invalid UTF-8 tokens and “�” — bug,ready,v1 — by johncalesp (合并于: 2026-01-06 08:23 (UTC+8))
#30732 [CI/Build] Allow user to configure NVSHMEM version via ENV or command line — ready,ci/build — by eicherseiji (合并于: 2026-01-06 07:56 (UTC+8))
#31175 [Bugfix] Properly apply v_scale for mimo_v2_flash — bug,ready — by mgoin (合并于: 2026-01-06 07:20 (UTC+8))
#31742 [Bugfix] Fix Broken ModelOpt NVFP4 MoE — bug,ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 07:18 (UTC+8))
#31542 [MoE Refactor] Aiter Experts for BF16 MoE — rocm,ready — by zyongye (合并于: 2026-01-06 06:52 (UTC+8))
#31585 [Bug] Revert torch warning fix — bug,ready,v1 — by yewentao256 (合并于: 2026-01-06 06:31 (UTC+8))
#30356 [CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 — ready,ci/build,deepseek — by MatthewBonanni (合并于: 2026-01-06 06:17 (UTC+8))
#31734 [Cleanup] Remove deprecated fields from CachedRequestData class — ready,v1 — by njhill (合并于: 2026-01-06 05:07 (UTC+8))
#30864 [Model] Nemotron Parse 1.1 Support — new-model,ready,ci/build,multi-modality — by amitz-nv (合并于: 2026-01-06 05:00 (UTC+8))
#30913 [docker] install cuda13 version of lmcache and nixl — ready,ci/build,kv-connector,nvidia — by soodoshll (合并于: 2026-01-06 04:50 (UTC+8))
#31317 pin lora_b moe weights on cpu — ready — by gnovack (合并于: 2026-01-06 04:15 (UTC+8))
#31150 [BugFix] Fix architecture flags to prevent issues on SM103 — ready,ci/build,nvidia — by LopezCastroRoberto (合并于: 2026-01-06 04:11 (UTC+8))
#31669 [Misc][Model][Refactor] Pass the prefix into Linear layers — speculative-decoding,ready,qwen — by kunpengW-code (合并于: 2026-01-06 04:03 (UTC+8))
#31622 Fix GLM-4.6v flash tool calling in transformers 5.x — documentation,ready,tool-calling — by baonudesifeizhai (合并于: 2026-01-06 03:32 (UTC+8))
#30687 Triton Attention: Support cross-layers blocks — ready,v1 — by orozery (合并于: 2026-01-06 03:29 (UTC+8))
#31644 [Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… — ready — by kzwrime (合并于: 2026-01-06 01:26 (UTC+8))
#31732 [CI Failure] Disable B200 tests while runner is broken — ready,ci/build,ci-failure — by mgoin (合并于: 2026-01-06 00:50 (UTC+8))
#30322 [Frontend] [Doc] Exclude log deltas feature — frontend,ready — by Catacomba (合并于: 2026-01-06 00:34 (UTC+8))
#31406 [v1] Add encoder-only/cross attention support to Triton Attention backend — rocm,ready,v1,multi-modality — by Isotr0py (合并于: 2026-01-06 00:00 (UTC+8))
#31335 [Model] Let more models to support the score template. — documentation,ready,ci/build,qwen — by noooop (合并于: 2026-01-05 19:54 (UTC+8))
#31674 [platform] Support additional forward context for OOT — ready — by zzzzwwjj (合并于: 2026-01-05 18:25 (UTC+8))
#31704 [KVconnector][LMCache] remove the import of legacy LMCache code — ready,kv-connector — by ApostaC (合并于: 2026-01-05 18:11 (UTC+8))
#31660 [LoRA] LoRA PDL improvement — ready — by jeejeelee (合并于: 2026-01-05 16:28 (UTC+8))
#31620 [Model] Enable LoRA support for BLIP2 — documentation,ready — by ppppqp (合并于: 2026-01-05 16:02 (UTC+8))
#29993 [ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace — rocm,ready — by tpopp (合并于: 2026-01-05 15:37 (UTC+8))
#31482 [log] enable max_log_len trim only when needed — frontend,ready — by andyxning (合并于: 2026-01-05 11:55 (UTC+8))
#31664 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 — ready,ci/build — by noooop (合并于: 2026-01-05 13:45 (UTC+8))
#31581 [Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser — frontend,ready — by cjackal (合并于: 2026-01-05 13:42 (UTC+8))
#31147 Add chat prefix completion feature to DeepSeek v3.2 — ready,deepseek — by PHOEBEMOON0802 (合并于: 2026-01-05 11:20 (UTC+8))
#31455 [Bugfix] Fix EPLB state logging error — ready — by tlrmchlsmth (合并于: 2026-01-05 12:06 (UTC+8))
#31662 [CI Failure] Fix NomicBert max_model_len validation — ready — by noooop (合并于: 2026-01-05 11:06 (UTC+8))

关闭但未合并的 PR

#31699 [BUGFIX] free encode cache in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (关闭于: 2026-01-05 14:07 (UTC+8))
#26799 [UX] Fallback to native implementation when flashinfer sampler failed to compile — v1 — by Isotr0py (关闭于: 2026-01-06 04:41 (UTC+8))
#31686 [Bugfix] Correct block shape logic in WNA16 MoE triton kernel — 无标签 — by JartX (关闭于: 2026-01-06 07:39 (UTC+8))
#31682 Apply refactor to ct — performance,needs-rebase,llama,nvidia — by robertgshaw2-redhat (关闭于: 2026-01-06 06:36 (UTC+8))
#31706 perf glmasr — 无标签 — by JaredforReal (关闭于: 2026-01-05 15:29 (UTC+8))
#31615 [Docker][ROCm] Update base image to ROCm 7.1 for GFX1150/1151 support — rocm,ci/build — by c0de128 (关闭于: 2026-01-05 13:29 (UTC+8))
#19704 Make sure the correct version of ao is installed in CI — ready,needs-rebase,ci/build,stale — by drisspg (关闭于: 2026-01-05 12:40 (UTC+8))
#31084 Fix ROCm attention backend selection for encoder-only models — rocm — by westers (关闭于: 2026-01-05 12:08 (UTC+8))
#31653 [Hardware][AMD] Enable AITER by default for optimized ROCm performance — rocm — by c0de128 (关闭于: 2026-01-05 11:37 (UTC+8))