vLLM 开发动态报告 - 2026-01-05
时间窗口: 2026-01-05 10:53 (UTC+8) ~ 2026-01-06 10:53 (UTC+8) 数据统计: 新 Issue 10 | 关闭 Issue 13 | 新 PR 60 | 合并 PR 37 | 关闭未合并 PR 9
📊 每日开发状态摘要
2026年1月5日至6日,vLLM项目保持高度活跃的开发节奏,共新增10个Issue、合并37个PR。本周期开发焦点集中在性能优化(特别是GLM-4系列与调度器)与平台兼容性增强上,AMD生态(ROCm)修复和多模态模型(VL/ASR)支持是两大亮点。同时,社区积极处理了多项安装、运行时Bug及CI/CD流水线问题,确保了项目的稳定性。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动非常活跃,主要集中在bug修复和平台兼容性提升上。
- CI测试修复:
- PR #31713 (
mawong-amd): [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group。此PR旨在修复AMD量化测试组中所有剩余的失败用例,显示了AMD团队对确保其平台量化功能稳定性的持续投入。 - PR #31728: [CI][ROCm] Fix NIXL tests on ROCm。这是对NVIDIA KV连接器测试在AMD CI流水线中运行的后续修复,确保了跨平台测试的覆盖度。
- PR #31713 (
- 硬件兼容性与错误处理:
- PR #31715: [ROCm] Improve error handling while loading quantized model on gfx120…。此PR修复了在AMD RDNA 4(gfx1201)等不支持
aiter的架构上,加载任何量化模型(如AWQ、GPTQ)时的启动崩溃问题。通过安全包裹导入并优雅降级,避免了因尝试加载不支持的Quark OCP MX组件而导致整个服务崩溃。 - PR #31729 & PR #31733: 这两个PR均修复了在AMD平台(ROCm)上,多处代码中硬编码
device="cuda"的问题,分别针对AITER MLA/Fused MoE内核和编译融合辅助函数。这确保了在ROCm平台上能正确使用current_platform.device_type,避免运行时错误。
- PR #31715: [ROCm] Improve error handling while loading quantized model on gfx120…。此PR修复了在AMD RDNA 4(gfx1201)等不支持
- 功能适配:
- PR #31712: fix(rocm): Auto-switch hybrid models from ROCM_ATTN to TRITON_ATTN。为了解决混合模型(如Nemotron-H)因Mamba状态对齐需要大块尺寸,超出AMD GPU共享内存限制的问题,此PR使得当为混合模型显式指定
ROCM_ATTN时,vLLM能自动且友好地回退到TRITON_ATTN后端。
- PR #31712: fix(rocm): Auto-switch hybrid models from ROCM_ATTN to TRITON_ATTN。为了解决混合模型(如Nemotron-H)因Mamba状态对齐需要大块尺寸,超出AMD GPU共享内存限制的问题,此PR使得当为混合模型显式指定
- Speculative Decoding 支持:
- PR #31714: [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in
eagle.py。修复了在ROCm ATTN后端下运行Eagle推测解码时的注意力元数据类型不匹配问题,完善了AMD平台上推测解码功能的支持。
- PR #31714: [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in
- MoE 重构:
- PR #31542: [MoE Refactor] Aiter Experts for BF16 MoE。将未量化的
rocm_aiter_fused_expert内核包装为模块化内核格式,是MoE重构工作的一部分,有助于统一和优化AMD平台上的MoE执行路径。
- PR #31542: [MoE Refactor] Aiter Experts for BF16 MoE。将未量化的
小结:本周期AMD相关贡献以夯实基础为主,重点修复了量化模型加载、设备标识、特定后端兼容性等多个关键bug,并持续推进MoE等核心模块在ROCm平台上的重构与集成,体现了对AMD硬件生态支持的系统性深化。
💬 高热度讨论分析
- Issue #31726: [Usage]: Why does
vllm servekeep filling up my system disk… (5条评论)- 核心议题:用户发现
vllm serve从网络挂载加载模型时,系统磁盘空间被持续占用且进程停止后不释放。 - 观点分析:
- 用户 (@tingjun-cs):通过详细排查,排除了vLLM自身缓存目录(
~/.cache/vllm)和通过环境变量重定向缓存目录导致问题的可能性,指出问题可能在于内核层面的文件系统占用统计(如deleted but held open的文件)。 - 贡献者 (@chaunceyjiang):最初建议检查vLLM缓存目录,随后提出使用
du -d 1 /进行更深入的磁盘使用分析。
- 用户 (@tingjun-cs):通过详细排查,排除了vLLM自身缓存目录(
- 当前状态:问题仍在开放讨论中,已从简单的缓存路径问题,引向对容器/操作系统层级文件系统行为的更深层排查。这是一个典型的复杂环境问题诊断过程。
- 核心议题:用户发现
- Issue #31708: [Bug]: When using image_embeds… (3条评论)
- 核心议题:用户在使用
image_embeds输入时遇到数组越界错误,涉及Qwen2-VL等模型的多模态输入处理。 - 观点分析:
- 贡献者 (@DarkLight1337):详细解释了
image_embeds和image_grid_thw应有的张量维度(应为2-D,无批次维度),并指出该格式仅适用于v0.13+版本。 - 用户 (@NewZxy):在按建议调整维度后问题依旧,并提供了初步分析,怀疑问题与
ImageProcessorItems和ImageEmbeddingItems的错误使用有关。
- 贡献者 (@DarkLight1337):详细解释了
- 争议焦点:无实质性争议,更多是问题排查过程中的信息澄清与版本适配说明。
- 当前状态:开放中,问题可能涉及vLLM内部多模态数据结构处理的bug。
- 核心议题:用户在使用
- Issue #31755: [Feature]: Optimizations for GLM4.7 (3条评论)
- 核心议题:跟踪GLM-4.7模型在vLLM中的优化任务。
- 观点分析:
- 发起人 (@yewentao256):列出了详细的优化任务清单(group_topk kernel、cutlass moe填充优化、kernel化
get_cutlass_moe_mm_problem_sizes_from_expert_offsets),并提供了完整的精度与性能评测基准。 - 其他贡献者 (@jeejeelee, @h1248759074):补充了
qk_norm_rope融合在FP8版本下未触发的情况,并询问依赖安装细节。
- 发起人 (@yewentao256):列出了详细的优化任务清单(group_topk kernel、cutlass moe填充优化、kernel化
- 趋势:此Issue体现了社区对特定热门模型(GLM系列)进行深度、系统性性能调优的合作模式。相关优化PR #31754(优化cutlass moe额外
fill(0))已在同期被合并。
🔥 热门话题与趋势分析
- 性能优化浪潮:围绕GLM系列模型的优化是绝对热点。除了跟踪Issue #31755,合并的PR #31754通过消除冗余的
fill(0)操作,为GLM-4.7-FP8带来了2.9%的端到端吞吐提升和10.8%的TTFT提升。这反映了社区对明星模型进行极致性能挖掘的趋势。 - 多模态支持持续深化:
- 新模型集成:PR #31758添加了对LiquidAI的LFM2-VL模型族的支持,PR #30864增加了NVIDIA Nemotron Parse 1.1的支持,表明视觉-语言模型库在不断丰富。
- 功能完善:多个PR(如#31724, #31620, #31696)致力于为Pixtral、BLIP2、H2OVL等现有多模态模型启用LoRA支持,扩展了其微调能力。
- 底层修复:Issue #31708和PR #31765等揭示了在多模态高并发场景下,编码器缓存管理可能存在的死锁问题,社区正在积极修复。
- AMD生态系统性加固:如前述,本周期出现了一批针对AMD平台的修复PR,覆盖测试流水线、设备标识、错误处理、内核兼容性等多个层面。这表明AMD团队或社区正在对ROCm支持进行“查漏补缺”式的系统化完善,而非零散功能添加。
- CPU后端稳步发展:关注点从“能用”转向“好用”和“健壮”。出现了为CPU后端添加微观基准测试的需求(Issue #31721, PR #31720),以及修复在
--enforce-eager模式下因缺少forward_cpu()实现导致的输出乱码等核心bug(PR #31643)。
🛠️ 重点技术变更
- PR #31754: [Perf] Optimize additional
fill(0)in cutlass moe…:通过利用expert_first_token_offset信息避免了一次冗余的填零操作,对GLM-4.7-FP8这类MoE模型带来了显著的端到端性能提升。这是针对特定计算模式进行“手术刀式”优化的典范。 - PR #31406: [v1] Add encoder-only/cross attention support to Triton Attention backend:为Triton Attention后端添加了编码器/交叉注意力支持,并引入了适配的双向/因果滑动窗口MHA内核。这解决了FlexAttention对某些
head_size支持不佳、以及Flash Attention对旧GPU(如Turing/Volta)计算能力限制的问题,为Whisper等编码器模型提供了更优、更兼容的后端选择。 - PR #31643: [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with –enforce-eager:根本性修复了CPU后端在启用强制eager模式时,因多个CustomOp层(如RotaryEmbedding, RMSNorm)缺失
forward_cpu()实现,错误回退到CUDA路径而导致的输出乱码问题。通过修正CustomOp.forward_cpu()的默认派发逻辑,确保了CPU执行的正确性。 - PR #31732 / #31750: B200测试的禁用与恢复:由于DGX Cloud上B200 runner的连接问题,临时禁用了相关CI测试,并在问题疑似解决后迅速恢复。这反映了团队对CI资源状态的敏捷管理和对测试覆盖度的重视。
📈 开发活跃度观察
- 高效合并:在24小时内合并37个PR,表明项目拥有活跃的贡献者群体和高效的代码审查与合并流程。
- 平台贡献者突出:来自AMD的员工或贡献者(如
mawong-amd,c0de128)在本周期异常活跃,提交了多个关键修复PR,显著推动了ROCm平台的稳定性和功能完整性。 - 多模态领域活跃:多位贡献者围绕视觉语言模型的新增、LoRA支持、bug修复进行协作,显示该领域是当前的技术拓展前沿。
- Issue闭环积极:关闭了13个Issue,其中包含一些陈年旧Issue(如#24207, #23739),说明社区在推动新特性的同时,也在注重维护问题的清理。
💡 值得关注的问题
- 调度器潜在死锁:Issue #31731提出了一个关于调度器可能因单个KV缓存饥饿请求而阻塞其他请求的性能优化提案。同时,PR #31765和#31699试图解决多模态场景下编码器缓存可能引发的调度死锁。调度器的公平性与健壮性是影响高并发、混合负载场景下服务质量的深层核心问题,值得持续关注。
- Transformers 5.x兼容性阵痛:Issue #31485(已通过PR #31622关闭)揭示了GLM-4.6V模型在Transformers 5.0.0rc1下工具调用功能失效的问题。这提醒社区,依赖上游重大版本升级可能带来复杂的兼容性挑战,需要更广泛的测试和适配策略。
- AMD新架构支持:PR #31715处理了gfx1201(RDNA 4)架构上因aiter不支持而导致的量化模型加载崩溃。随着AMD新GPU架构的推出,确保vLLM的量化、注意力等核心组件能优雅处理或不支持新架构,是维护平台兼容性的重要课题。
📋 附录:详细数据列表
新增 Issue
- #31755 [Feature]: Optimizations for GLM4.7 — feature request — by yewentao256 (创建于: 2026-01-06 07:15 (UTC+8))
- #31708 [Bug]: When using image_embeds, ImageProcessorItems are used instead of ImageEmbeddingItems, causing an out-of-bounds array error. — bug — by NewZxy (创建于: 2026-01-05 15:16 (UTC+8))
- #31710 perf.vllm.ai is not available — 无标签 — by pacoxu (创建于: 2026-01-05 15:44 (UTC+8))
- #31731 [Performance]: scheduler improvement to prevent KV-cache heavy requests from blocking others — performance — by Waloid24 (创建于: 2026-01-05 23:48 (UTC+8))
- #31726 [Usage]: Why does
vllm servekeep filling up my system disk when loading a model from a network mount? — usage — by tingjun-cs (创建于: 2026-01-05 22:50 (UTC+8)) - #31721 [Feature]: Fused MoE Micro Benchmark for CPU Backend — feature request,cpu — by andikarachman (创建于: 2026-01-05 20:31 (UTC+8))
- #31718 [Installation]: cuda error when install from source — installation — by inv1s10n (创建于: 2026-01-05 18:51 (UTC+8))
- #31709 [Bug]: After upgrade to 0.11.2, vllm crashs with Qwen3. — bug — by tonyaw (创建于: 2026-01-05 15:17 (UTC+8))
- #31697 [New Model]: HY-MT1.5-1.8B — new-model — by Busboy3129 (创建于: 2026-01-05 11:36 (UTC+8))
- #31700 [Bug]: CUDA error: “provided PTX was compiled with an unsupported toolchain” on vLLM 0.11.2 — bug — by dry86 (创建于: 2026-01-05 12:57 (UTC+8))
已关闭 Issue
- #24207 [Feature]: Support similar API, such as /health_generate — feature request,stale — by (关闭于: 2026-01-06 10:42 (UTC+8))
- #28975 [Bug]: aot_compile does not preserve dynamic shapes state on cache hit. — bug,torch.compile — by laithsakka (关闭于: 2026-01-06 07:03 (UTC+8))
- #31579 [Bug]:
VLLM_FLOAT32_MATMUL_PRECISION=tf32does not set cublas tf32 matmul — bug — by cjackal (关闭于: 2026-01-06 06:31 (UTC+8)) - #29515 [CI Failure]: mi325_1: V1 Test attention (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-06 05:50 (UTC+8))
- #29459 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-06 05:49 (UTC+8))
- #30628 [Bug]: For building a CUDA 13 vLLM docker image, when building LMCache, wrong version of NIXL (
nixl-cu12) is downloaded — bug,kv-connector,nvidia — by wangshangsam (关闭于: 2026-01-06 04:50 (UTC+8)) - #31485 [Feature]: Support Tool Calling with transformers 5.x for GLM-4.6V Models — feature request — by tamthaihoangminh (关闭于: 2026-01-06 03:32 (UTC+8))
- #23739 [Performance]: The full cudagraph seems not work. — performance,stale — by xsank (关闭于: 2026-01-06 01:34 (UTC+8))
- #31642 [Bug]: TypeError in DeviceCommunicatorBase.dispatch due to method signature mismatch — bug,cpu — by kzwrime (关闭于: 2026-01-06 01:26 (UTC+8))
- #31626 [Bug][CPU Backend]: Gibberish output on CPU backend when –enforce-eager is enabled (Qwen3-0.6B) — bug,cpu — by kzwrime (关闭于: 2026-01-06 01:25 (UTC+8))
- #31014 [Bug]: GPT-OSS-120B Eagle-v2 High concurrency perf drop — bug,speculative-decoding — by shyeh25 (关闭于: 2026-01-05 17:15 (UTC+8))
- #31691 [CI Failure]: mi325_4: LoRA TP Test (Distributed) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-05 13:45 (UTC+8))
- #31700 [Bug]: CUDA error: “provided PTX was compiled with an unsupported toolchain” on vLLM 0.11.2 — bug — by dry86 (关闭于: 2026-01-05 13:00 (UTC+8))
新增 PR
- #31762 [Bugfix]Add rollback mechanism when XPU kernel does not support FP32 precision FLASH_ATTN in UT — v1 — by 1643661061leo (创建于: 2026-01-06 10:21 (UTC+8))
- #31765 [BUGFIX] free encode inputs in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (创建于: 2026-01-06 10:49 (UTC+8))
- #31761 [Frontend] Add MCP tool streaming support to Responses API — frontend,gpt-oss — by daniel-salib (创建于: 2026-01-06 10:13 (UTC+8))
- #31717 [Perf] GLM ASR — 无标签 — by JaredforReal (创建于: 2026-01-05 18:45 (UTC+8))
- #31735 [bugfix] prevent special token injection in user content — frontend — by usepr (创建于: 2026-01-06 00:07 (UTC+8))
- #31716 Consolidate Intel Quantization Toolkit Integration in vLLM — documentation — by yiliu30 (创建于: 2026-01-05 18:23 (UTC+8))
- #31764 [CI] Fix CPU MM PRocessor Test — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-06 10:32 (UTC+8))
- #31763 [perf] Fused operator SplitMrope used in the Qwen2.5-Omni-7B model — qwen — by fuzhihong699 (创建于: 2026-01-06 10:32 (UTC+8))
- #31759 [MoE Refactor] Add Temporary Integration Tests - H100 — ready,ci/build,nvidia — by robertgshaw2-redhat (创建于: 2026-01-06 09:31 (UTC+8))
- #31747 [Misc] Fix
Current vLLM config is not set.warnings, assert to avoid issues in the future — v1,multi-modality,cpu,nvidia,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-06 05:45 (UTC+8)) - #31699 [BUGFIX] free encode cache in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (创建于: 2026-01-05 11:53 (UTC+8))
- #31753 [BugFix]Fix eagle draft_model_config and add tests — 无标签 — by charlotte12l (创建于: 2026-01-06 07:09 (UTC+8))
- #31760 [Cleanup] Remove redundant
decoder_layer_typeassignment inQwen2— qwen — by maang-h (创建于: 2026-01-06 09:40 (UTC+8)) - #31725 [Misc] Enable Paligemma’s PrefixLM attention mask computation — ready,multi-modality — by Isotr0py (创建于: 2026-01-05 22:23 (UTC+8))
- #31754 [Perf] Optimize additional
fill(0)in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement — ready,nvidia — by yewentao256 (创建于: 2026-01-06 07:10 (UTC+8)) - #31758 [Model] Add LFM2-VL model support — new-model,v1 — by tianshu-Michael-yu (创建于: 2026-01-06 08:14 (UTC+8))
- #31751 [Bug Fix] Handle variable-length tensors in MultiModalFlatField batching — multi-modality — by AndriiPasternak31 (创建于: 2026-01-06 06:48 (UTC+8))
- #31750 Revert “[CI Failure] Disable B200 tests while runner is broken” — ready,ci/build — by mgoin (创建于: 2026-01-06 06:24 (UTC+8))
- #31738 [Models]: Use
MMEncoderAttentionfor MoonViT — ready — by Isotr0py (创建于: 2026-01-06 00:41 (UTC+8)) - #31757 [Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on — 无标签 — by andyl98 (创建于: 2026-01-06 07:54 (UTC+8))
- #31756 [Misc][BE] Turn on strict type coverage for vllm/compilation — nvidia — by Lucaskabela (创建于: 2026-01-06 07:39 (UTC+8))
- #31744 [Misc][BE] Type coverage for vllm/compilation [2/3] — nvidia — by Lucaskabela (创建于: 2026-01-06 03:39 (UTC+8))
- #31748 [Misc][BE] Type coverage for vllm/compilation [3/3] — nvidia — by Lucaskabela (创建于: 2026-01-06 05:58 (UTC+8))
- #31752 [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection — 无标签 — by zyongye (创建于: 2026-01-06 06:57 (UTC+8))
- #31742 [Bugfix] Fix Broken ModelOpt NVFP4 MoE — bug,ready,llama,nvidia — by robertgshaw2-redhat (创建于: 2026-01-06 02:16 (UTC+8))
- #31724 [Model] Enable LoRA support for Pixtral — documentation — by A1c0r-Z (创建于: 2026-01-05 22:06 (UTC+8))
- #31713 [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group — rocm,ci/build — by mawong-amd (创建于: 2026-01-05 16:54 (UTC+8))
- #31722 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) — performance,ready,qwen — by vadiklyutiy (创建于: 2026-01-05 20:37 (UTC+8))
- #31737 [Frontend][gpt-oss] Allow system message to overwrite model identity — frontend,gpt-oss — by qandrew (创建于: 2026-01-06 00:37 (UTC+8))
- #31749 [CI/Build] add new target for building CPU image with model — documentation,ci/build,cpu — by nathan-weinberg (创建于: 2026-01-06 06:00 (UTC+8))
- #31707 [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator — v1 — by ivanium (创建于: 2026-01-05 15:14 (UTC+8))
- #31734 [Cleanup] Remove deprecated fields from CachedRequestData class — ready,v1 — by njhill (创建于: 2026-01-06 00:01 (UTC+8))
- #31745 [Bugfix] Remove the num_hidden_layers override for glm4_moe — 无标签 — by andyl98 (创建于: 2026-01-06 03:57 (UTC+8))
- #31746 Create an interface to support more eagle3 model — speculative-decoding,v1,llama,meta-exported,fb-exported — by ZhengkaiZ (创建于: 2026-01-06 05:10 (UTC+8))
- #31741 Add
/pause/stepfast pause + barrier for online weight update — frontend,v1 — by vmoens (创建于: 2026-01-06 01:50 (UTC+8)) - #31739 [Spec Decode][UX] Add acceptance rates to
vllm bench servereport — performance — by MatthewBonanni (创建于: 2026-01-06 00:52 (UTC+8)) - #31740 feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support — v1,nvidia — by seli-equinix (创建于: 2026-01-06 01:23 (UTC+8))
- #31743 [perf][MLA] Fuse RoPE/FP8 quantization/Q write using mla_rope_quantize_fp8 — v1,nvidia — by minosfuture (创建于: 2026-01-06 03:15 (UTC+8))
- #31723 [Core] Optimize expensive deepcopy in GPU model runner — v1 — by GOavi101 (创建于: 2026-01-05 21:04 (UTC+8))
- #31736 Update how docs are rendered for cli reference — documentation — by ashwin-phadke (创建于: 2026-01-06 00:19 (UTC+8))
- #31732 [CI Failure] Disable B200 tests while runner is broken — ready,ci/build,ci-failure — by mgoin (创建于: 2026-01-05 23:53 (UTC+8))
- #31730 [Cleanup] Unify flashinfer utility code — nvidia — by majiayu000 (创建于: 2026-01-05 23:20 (UTC+8))
- #31729 [Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE — rocm — by c0de128 (创建于: 2026-01-05 23:04 (UTC+8))
- #31727 Fix versions for vulnerable packages — ci/build — by adobrzyn (创建于: 2026-01-05 22:57 (UTC+8))
- #31733 [Bugfix][Hardware][AMD] Use platform device type in compilation fusion helpers — rocm — by c0de128 (创建于: 2026-01-05 23:57 (UTC+8))
- #31719 [Misc] Support qwen3-next lora — qwen — by BJWang-ant (创建于: 2026-01-05 19:30 (UTC+8))
- #31728 [CI][ROCm] Fix NIXL tests on ROCm — rocm,ready,ci/build,kv-connector — by NickLucche (创建于: 2026-01-05 23:04 (UTC+8))
- #31715 [ROCm] Improve error handling while loading quantized model on gfx120… — rocm — by brian033 (创建于: 2026-01-05 17:44 (UTC+8))
- #31720 [cpu][bench] Add CPU paged attention benchmarks — performance,cpu — by fadara01 (创建于: 2026-01-05 20:28 (UTC+8))
- #31714 [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in
eagle.py— rocm,speculative-decoding,v1 — by vllmellm (创建于: 2026-01-05 17:09 (UTC+8)) - #31702 Fix ijson build for Power. — ci/build — by npanpaliya (创建于: 2026-01-05 14:06 (UTC+8))
- #31712 fix(rocm): Auto-switch hybrid models from ROCM_ATTN to TRITON_ATTN — documentation,rocm — by rabi (创建于: 2026-01-05 16:28 (UTC+8))
- #31704 [KVconnector][LMCache] remove the import of legacy LMCache code — ready,kv-connector — by ApostaC (创建于: 2026-01-05 14:26 (UTC+8))
- #31711 fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe — rocm — by rabi (创建于: 2026-01-05 15:55 (UTC+8))
- #31705 [BugFix] Support setting tp=1 for the Eagle draft model to take effect — speculative-decoding,v1 — by zhaomingyu13 (创建于: 2026-01-05 14:32 (UTC+8))
- #31706 perf glmasr — 无标签 — by JaredforReal (创建于: 2026-01-05 14:57 (UTC+8))
- #31696 [WIP][Model] Enable LoRA support for tower and connector in H2OVL — documentation — by shwetha-s-poojary (创建于: 2026-01-05 11:35 (UTC+8))
- #31703 [Misc] Add packed_modules_mapping for MiniMaxM2ForCausalLM — 无标签 — by jeejeelee (创建于: 2026-01-05 14:25 (UTC+8))
- #31698 [Kernel] Add triton silu_and_mul in fused_moe — performance — by jeejeelee (创建于: 2026-01-05 11:41 (UTC+8))
- #31701 [MLA] Expose prefill/decode paths to torch.compile — rocm,v1 — by therealnaveenkamal (创建于: 2026-01-05 12:57 (UTC+8))
已合并 PR
- #31760 [Cleanup] Remove redundant
decoder_layer_typeassignment inQwen2— qwen — by maang-h (合并于: 2026-01-06 10:09 (UTC+8)) - #31725 [Misc] Enable Paligemma’s PrefixLM attention mask computation — ready,multi-modality — by Isotr0py (合并于: 2026-01-06 03:31 (UTC+8))
- #31754 [Perf] Optimize additional
fill(0)in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement — ready,nvidia — by yewentao256 (合并于: 2026-01-06 10:01 (UTC+8)) - #31694 [Docs] Improve malformed exception caused by backslash line continuations — rocm,multi-modality,llama — by maang-h (合并于: 2026-01-06 09:51 (UTC+8))
- #31750 Revert “[CI Failure] Disable B200 tests while runner is broken” — ready,ci/build — by mgoin (合并于: 2026-01-06 09:26 (UTC+8))
- #31643 [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with –enforce-eager — ready — by rickychen-infinirc (合并于: 2026-01-06 01:25 (UTC+8))
- #28874 [Bugfix] vLLM produces invalid UTF-8 tokens and “�” — bug,ready,v1 — by johncalesp (合并于: 2026-01-06 08:23 (UTC+8))
- #30732 [CI/Build] Allow user to configure NVSHMEM version via ENV or command line — ready,ci/build — by eicherseiji (合并于: 2026-01-06 07:56 (UTC+8))
- #31175 [Bugfix] Properly apply v_scale for mimo_v2_flash — bug,ready — by mgoin (合并于: 2026-01-06 07:20 (UTC+8))
- #31742 [Bugfix] Fix Broken ModelOpt NVFP4 MoE — bug,ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 07:18 (UTC+8))
- #31542 [MoE Refactor] Aiter Experts for BF16 MoE — rocm,ready — by zyongye (合并于: 2026-01-06 06:52 (UTC+8))
- #31585 [Bug] Revert torch warning fix — bug,ready,v1 — by yewentao256 (合并于: 2026-01-06 06:31 (UTC+8))
- #30356 [CI][DeepSeek] Add nightly DeepSeek R1
lm_evaltests on H200 — ready,ci/build,deepseek — by MatthewBonanni (合并于: 2026-01-06 06:17 (UTC+8)) - #31734 [Cleanup] Remove deprecated fields from CachedRequestData class — ready,v1 — by njhill (合并于: 2026-01-06 05:07 (UTC+8))
- #30864 [Model] Nemotron Parse 1.1 Support — new-model,ready,ci/build,multi-modality — by amitz-nv (合并于: 2026-01-06 05:00 (UTC+8))
- #30913 [docker] install cuda13 version of lmcache and nixl — ready,ci/build,kv-connector,nvidia — by soodoshll (合并于: 2026-01-06 04:50 (UTC+8))
- #31317 pin lora_b moe weights on cpu — ready — by gnovack (合并于: 2026-01-06 04:15 (UTC+8))
- #31150 [BugFix] Fix architecture flags to prevent issues on SM103 — ready,ci/build,nvidia — by LopezCastroRoberto (合并于: 2026-01-06 04:11 (UTC+8))
- #31669 [Misc][Model][Refactor] Pass the prefix into Linear layers — speculative-decoding,ready,qwen — by kunpengW-code (合并于: 2026-01-06 04:03 (UTC+8))
- #31622 Fix GLM-4.6v flash tool calling in transformers 5.x — documentation,ready,tool-calling — by baonudesifeizhai (合并于: 2026-01-06 03:32 (UTC+8))
- #30687 Triton Attention: Support cross-layers blocks — ready,v1 — by orozery (合并于: 2026-01-06 03:29 (UTC+8))
- #31644 [Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… — ready — by kzwrime (合并于: 2026-01-06 01:26 (UTC+8))
- #31732 [CI Failure] Disable B200 tests while runner is broken — ready,ci/build,ci-failure — by mgoin (合并于: 2026-01-06 00:50 (UTC+8))
- #30322 [Frontend] [Doc] Exclude log deltas feature — frontend,ready — by Catacomba (合并于: 2026-01-06 00:34 (UTC+8))
- #31406 [v1] Add encoder-only/cross attention support to Triton Attention backend — rocm,ready,v1,multi-modality — by Isotr0py (合并于: 2026-01-06 00:00 (UTC+8))
- #31335 [Model] Let more models to support the score template. — documentation,ready,ci/build,qwen — by noooop (合并于: 2026-01-05 19:54 (UTC+8))
- #31674 [platform] Support additional forward context for OOT — ready — by zzzzwwjj (合并于: 2026-01-05 18:25 (UTC+8))
- #31704 [KVconnector][LMCache] remove the import of legacy LMCache code — ready,kv-connector — by ApostaC (合并于: 2026-01-05 18:11 (UTC+8))
- #31660 [LoRA] LoRA PDL improvement — ready — by jeejeelee (合并于: 2026-01-05 16:28 (UTC+8))
- #31620 [Model] Enable LoRA support for BLIP2 — documentation,ready — by ppppqp (合并于: 2026-01-05 16:02 (UTC+8))
- #29993 [ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace — rocm,ready — by tpopp (合并于: 2026-01-05 15:37 (UTC+8))
- #31482 [log] enable max_log_len trim only when needed — frontend,ready — by andyxning (合并于: 2026-01-05 11:55 (UTC+8))
- #31664 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 — ready,ci/build — by noooop (合并于: 2026-01-05 13:45 (UTC+8))
- #31581 [Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser — frontend,ready — by cjackal (合并于: 2026-01-05 13:42 (UTC+8))
- #31147 Add chat prefix completion feature to DeepSeek v3.2 — ready,deepseek — by PHOEBEMOON0802 (合并于: 2026-01-05 11:20 (UTC+8))
- #31455 [Bugfix] Fix EPLB state logging error — ready — by tlrmchlsmth (合并于: 2026-01-05 12:06 (UTC+8))
- #31662 [CI Failure] Fix NomicBert max_model_len validation — ready — by noooop (合并于: 2026-01-05 11:06 (UTC+8))
关闭但未合并的 PR
- #31699 [BUGFIX] free encode cache in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (关闭于: 2026-01-05 14:07 (UTC+8))
- #26799 [UX] Fallback to native implementation when flashinfer sampler failed to compile — v1 — by Isotr0py (关闭于: 2026-01-06 04:41 (UTC+8))
- #31686 [Bugfix] Correct block shape logic in WNA16 MoE triton kernel — 无标签 — by JartX (关闭于: 2026-01-06 07:39 (UTC+8))
- #31682 Apply refactor to ct — performance,needs-rebase,llama,nvidia — by robertgshaw2-redhat (关闭于: 2026-01-06 06:36 (UTC+8))
- #31706 perf glmasr — 无标签 — by JaredforReal (关闭于: 2026-01-05 15:29 (UTC+8))
- #31615 [Docker][ROCm] Update base image to ROCm 7.1 for GFX1150/1151 support — rocm,ci/build — by c0de128 (关闭于: 2026-01-05 13:29 (UTC+8))
- #19704 Make sure the correct version of ao is installed in CI — ready,needs-rebase,ci/build,stale — by drisspg (关闭于: 2026-01-05 12:40 (UTC+8))
- #31084 Fix ROCm attention backend selection for encoder-only models — rocm — by westers (关闭于: 2026-01-05 12:08 (UTC+8))
- #31653 [Hardware][AMD] Enable AITER by default for optimized ROCm performance — rocm — by c0de128 (关闭于: 2026-01-05 11:37 (UTC+8))