vLLM 开发动态报告 - 2026-01-21

时间窗口: 2026-01-21 11:00 (UTC+8) ~ 2026-01-22 11:00 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 13 | 新 PR 62 | 合并 PR 30 | 关闭未合并 PR 12

📊 每日开发状态摘要

本周期（2026-01-21至2026-01-22）vLLM项目保持高活跃度，新增PR（62个）与合并PR（30个）数量均较多，核心开发重点集中在模型支持扩展（如TranslateGemma、Pacific-Prime）、性能优化（如MoE、torch.compile冷启动）以及AMD平台适配性的修复上。社区讨论热度不减，尤其是围绕GPU型号兼容性、性能回归及配置文档等问题展开了深入交流。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动活跃，主要集中在问题修复和平台适配性增强上，多位AMD员工（用户名含-amd后缀）贡献了重要修复。

Bugfix PR (AMD员工贡献)：
- PR #32813: [Hardware][AMD][Bugfix] Fix PTPC FP8 quantization (by mawong-amd)
  - 内容：修复了PR #32189重构后导致的PTPCFP8LinearMethod继承链错误，使其正确继承自FP8OnlineLinearMethod，从而修复了AMD量化测试组。
  - 影响：确保了在AMD平台上使用PTPC（可能是AMD特定的量化方案）进行FP8量化的功能正确性。
- PR #32783: [ROCm] fix import for on_gfx9 (by divakar-amd)
  - 内容：修复了fused_moe和fused_batched_moe中关于on_gfx9的导入错误，从直接导入替代了通过平台接口调用的方式。
  - 影响：解决了在ROCm平台（特别是GFX9架构）上相关模块的导入问题，保障了代码的鲁棒性。
- PR #32787: [ROCm][CI] fix get_valid_backends (by divakar-amd)
  - 内容：修复了Platform接口中__getattr__方法导致hasattr检查始终返回True的问题，影响了推测解码测试中后端有效性检查的逻辑。
  - 影响：为后续完全修复ROCm上的推测解码接受长度测试（PR #32825）奠定了基础。
- PR #32825: [Hardware][AMD][Bugfix] Fix spec decode acceptance length test (by mawong-amd)
  - 内容：修复ROCm平台上推测解码接受长度测试的失败问题。根本原因是hasattr检查因__getattr__而失效，导致错误地调用了未实现的get_valid_backends方法。
  - 影响：直接修复了AMD CI中推测解码相关测试的失败，提升了测试套件在AMD平台上的通过率。
功能优化/问题修复 PR (涉及AMD平台)：
- PR #32754: [Bugfix][ROCm] Fix DeepSeek-R1 repetition via hybrid sampler routing (by vllmellm)
  - 内容：修复了DeepSeek-R1在AMD ROCm上因Aiter采样器不完全支持每请求RNG而导致的输出重复/无限循环问题。方案是引入混合路由逻辑，对需要随机性的请求回退到PyTorch原生采样器。
  - 影响：解决了特定模型在AMD平台上的功能正确性问题，同时兼顾了高性能场景（贪婪采样）对Aiter加速的利用。
新增PR (AMD员工贡献)：
- PR #32825 (同上)：除了修复测试，还包含了修复EAGLE推测解码在CUDA图捕获期间潜在段错误的PR #32818的更改，这对AMD平台上的EAGLE功能稳定性很重要。

小结：本周期AMD相关贡献以修复测试和平台特定问题为主，体现出团队在确保vLLM在AMD硬件（MI300X， MI325X）上功能完备和测试通过方面的持续投入。未发现与Quark量化工具或MI300新特性直接相关的修改。

💬 高热度讨论分析

Issue #32782: [Bug]: KV cache 0.14 regression vs 0.11 and 0.13
- 核心议题：用户报告从vLLM 0.11/0.13升级到0.14后，在特定配置（Qwen3-Coder-30B, PP=4, TP=1）下出现严重的KV缓存性能回归，导致吞吐量大幅下降和内存溢出（OOM）。
- 观点与讨论：
  - 报告者 (dthp-git)：提供了详细的性能对比数据和错误日志，指出0.14版本在相同硬件和配置下性能更差。
  - 维护者 (robertgshaw2-redhat)：迅速回应感谢报告，并请用户提供完整复现命令以便进一步调查。
- 争议焦点：无公开争议，属于严重的性能回归问题报告。
- 当前状态：Open。维护者已介入，等待更详细信息或内部排查。
Issue #32791: [Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object
- 核心议题：在使用GPT-OSS系列模型进行多轮对话并结合json_object格式限制时，返回的content字段为null。
- 观点与讨论：
  - 报告者 (supersteves)：提供了完整的复现步骤和环境信息，并确认该问题为GPT-OSS（Harmony）系列特有，其他模型（如Llama-3.3）正常。
- 争议焦点：无。这是一个针对特定模型系列的bug报告。
- 当前状态：Open。尚未有维护者直接回复，但问题描述清晰，可复现性强。
Issue #32755: [Installation]: Newly required -cc tag in 0.14.0 for Docker compose -- documentation
- 核心议题：vLLM 0.14.0的Docker Compose使用方式发生变化，需要-cc参数，但官方文档完全没有解释该参数的用法、可选值及含义，导致用户无法顺利部署。
- 观点与讨论：
  - 报告者 (PathosEthosLogos)：强烈抱怨文档缺失，尝试了聊天机器人推荐的各种值均无效，使用体验差。
- 争议焦点：用户对文档质量的失望与现有支持渠道（聊天机器人）未能提供有效帮助之间的矛盾。
- 当前状态：Open，但PR #32809已提交以修复此问题，该PR旨在添加Docker Compose指南并详细说明--cc参数。
Issue #32765: [Feature]: Support gemma3 Models in eagle3 Speculative Decoding
- 核心议题：请求为Gemma3模型添加Eagle3推测解码支持。
- 观点与讨论：
  - 请求者 (Ofir408)：提出了明确的需求。
  - 贡献者 (baonudesifeizhai)：表示可以处理，但提出疑问“Gemma3最大模型只有27B，值得吗？”，暗示可能因模型规模较小而优先级不高。
- 争议焦点：功能实现的“性价比”或优先级评估。
- 当前状态：Open。有社区贡献者表示愿意实施，但尚未提交PR。

🔥 热门话题与趋势分析

模型支持扩展：社区持续推动对新模型架构的支持，本期涉及TranslateGemma (PR #32819)、Pacific-Prime/Complexity (PR #32803)、Eagle2.5-VL (已合并PR #32456)、Pixtral的Eagle3支持 (已合并PR #32542) 以及Gemma3的LoRA支持 (Issue #32758, PR #32764)。
性能优化与回归：
- 性能回归：Issue #32782凸显了版本升级可能引入的性能回退风险，引发对测试覆盖率和性能基准的重视。
- 内存优化：PR #32773试图修复在线FP8量化导致OOM的问题（回滚了PR #29196），PR #32789修复Whisper编码器-解码器模型的内存泄漏，均显示对内存效率的持续关注。
- 计算优化：多个PR聚焦MoE性能（PR #32414, #32774, #32778, #32790）、torch.compile冷启动（PR #32805）和自定义算子编译（PR #32806）。
平台与兼容性问题：
- AMD ROCm：如前述，有多项修复。
- CPU后端：Issue #32786报告了ARM架构多NUMA域下模型并行推理性能差的问题，并随即有PR #32792提交以通过自定义共享内存通信器进行加速。
- WSL：Issue #32559 (已关闭) 和 PR #32749 解决了WSL平台下NVML初始化和多进程方法的问题。
推测解码（Speculative Decoding）的增强与问题修复：围绕EAGLE与前缀缓存的兼容性（Issue #32802, PR #32801）、AMD平台上的测试与稳定性（PR #32825, #32818）以及对新模型架构的支持（Issue #32765, 已合并PR #32542, #32048）是持续热点。

🛠️ 重点技术变更

PR #32744: [PluggableLayer][1/N] Define PluggableLayer (已合并)
- 解读：此PR引入了“可插拔层”（PluggableLayer）的概念和基础框架，旨在替代或补充现有的CustomOp机制，为模型组件（如MLA）提供更规范、更易管理的扩展接口。这是实现RFC #23786的第一步。
- 影响：长期来看，有利于统一和简化vLLM中可扩展组件的开发与管理，提升代码模块化和可维护性。
PR #32414: [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (已合并)
- 解读：对MoE内核选择逻辑进行重大重构。引入了内核优先级概念、标准化的内核构造接口和功能支持注册机制。
- 影响：实现了跨硬件和模型架构的自动最优内核选择（例如可自动选择TRTLLM内核），并在初始化时而非运行时进行兼容性验证，提升了部署的便捷性和错误信息的清晰度。
PR #32805: [torch.compile] Improve Cold Start for MoEs (进行中)
- 解读：通过避免在图编译期间将字符串硬编码到计算图中，显著减少了使用torch.compile时MoE模型的冷启动时间（例如GPT-OSS-120b从46秒降至16秒）。
- 影响：直接改善了用户使用torch.compile功能时的体验，特别是对于大型MoE模型，降低了迭代和部署的门槛。
PR #32806: [torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8 (进行中)
- 解读：手动编译SiluAndMul和QuantFP8这两个CustomOp的forward_native方法。当它们被嵌套在另一个不透明的自定义操作（如fused_moe）中调用时，原始的forward_native会以eager模式执行，影响性能。此PR确保它们在所有上下文中都能被编译。
- 影响：优化了Llama 4、DeepSeek等模型中专家层（使用SiluAndMul）的计算性能，是性能优化向底层算子深入的体现。
PR #32812: [Deprecation] Remove deprecated environment variables (已合并)
- 解读：随着v0.14.0版本发布，清理了一批已废弃的环境变量。
- 影响：保持代码库的整洁，减少配置复杂性，引导用户使用新的、更稳定的配置方式。

📈 开发活跃度观察

贡献者活跃：在新增的62个PR中，除了核心维护团队（如WoosukKwon, njhill, DarkLight1337, robertgshaw2-redhat等），也有大量社区贡献者（如adityapuranik99, minosfuture, cwazai, danisereb等）积极参与模型支持、性能优化和问题修复。
AMD参与度：AMD员工（mawong-amd, divakar-amd）在本周期提交了多个关键修复PR，显示出AMD团队在确保其硬件平台兼容性方面的积极投入。
代码审查与合并效率：合并了30个PR，表明审查流程运作高效。多个PR被打上ready标签等待CI或合并，显示项目有良好的协作节奏。

💡 值得关注的问题

Issue #32782 (KV缓存性能回归)：这是一个影响升级体验的严重性能问题，需要核心团队优先调查和修复，其结果可能影响vLLM 0.14版本的稳定性评价。
Issue #32755 (Docker Compose文档缺失)：反映了项目在版本变更时，文档和用户沟通可能存在滞后。虽然已有修复PR，但也提醒团队需重视升级指南和变更日志的及时更新。
Issue #32786 / PR #32792 (ARM CPU跨NUMA性能)：该问题及解决方案凸显了vLLM在非GPU（特别是ARM服务器）异构计算环境下的性能挑战与优化机会，对于拓展部署场景有重要意义。

📋 附录：详细数据列表

新增 Issue

#32826 [Bug]: MiniMax-M2.1 NVFP4 fails on RTX PRO 6000 Blackwell (SM120) with expert parallel — bug — by gittb (创建于: 2026-01-22 10:59 (UTC+8))
#32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (创建于: 2026-01-22 10:09 (UTC+8))
#32823 [Feature]: microsoft/VibeVoice-ASR support — feature request — by baonudesifeizhai (创建于: 2026-01-22 10:20 (UTC+8))
#32765 [Feature]: Support gemma3 Models in eagle3 Speculative Decoding — feature request — by Ofir408 (创建于: 2026-01-21 17:01 (UTC+8))
#32807 [Usage]: How to use tensorizer to load models from S3 with vLLM — usage — by asharkhan3101 (创建于: 2026-01-22 04:32 (UTC+8))
#32793 [Bug]: importing bitsandbytes causes pytest to fail — bug — by qgallouedec (创建于: 2026-01-22 02:06 (UTC+8))
#32802 [Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE — bug — by mgoin (创建于: 2026-01-22 03:22 (UTC+8))
#32791 [Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object — bug — by supersteves (创建于: 2026-01-22 01:24 (UTC+8))
#32782 [Bug]: KV cache 0.14 regression vs 0.11 and 0.13 — bug — by dthp-git (创建于: 2026-01-21 22:39 (UTC+8))
#32786 [Bug]: [CPU Backend] [Arm] Slow model-parallel inference across NUMA domains — bug,cpu — by fadara01 (创建于: 2026-01-22 00:11 (UTC+8))
#32755 [Installation]: Newly required -cc tag in 0.14.0 for Docker compose – documentation — installation — by PathosEthosLogos (创建于: 2026-01-21 12:00 (UTC+8))
#32758 [Feature]: support lora for Gemma3ForConditionalGeneration — help wanted,feature request — by WuChuYi (创建于: 2026-01-21 14:06 (UTC+8))
#32753 [Feature]: [Safety] Add input validation and contract documentation to attention kernels — feature request — by red1239109-cmd (创建于: 2026-01-21 11:47 (UTC+8))
#32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (创建于: 2026-01-21 11:11 (UTC+8))

已关闭 Issue

#32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (关闭于: 2026-01-22 10:56 (UTC+8))
#25336 [Bug]: metrics — bug,stale — by sidikbro (关闭于: 2026-01-22 10:13 (UTC+8))
#25352 [Bug]: AssertionError while passing otlp_traces_endpoint in Mac — bug,stale — by skshashankkumar41 (关闭于: 2026-01-22 10:13 (UTC+8))
#20451 [RFC]: vLLM-compile low-hanging fruit cold start improvements — RFC,torch.compile,stale,startup-ux — by zou3519 (关闭于: 2026-01-21 23:20 (UTC+8))
#29509 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 3 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-21 21:12 (UTC+8))
#23350 [Bug]: vLLM aarch64 support (GH200) — bug,stale — by wahabk (关闭于: 2026-01-21 19:03 (UTC+8))
#31345 [Bug]: The ‘’draft_tensor_parallel_size‘’ parameter of the Eagel3 draft model does not take effect — bug — by zhaomingyu13 (关闭于: 2026-01-21 17:43 (UTC+8))
#32559 [Bug]: NVML initialization failure even when running the basic example in the WSL platform — bug — by jasonyanwenl (关闭于: 2026-01-21 17:35 (UTC+8))
#19953 [Feature]: vllm support for mistral3.1 with no consolidated.safetensors — feature request,stale — by AbdelrahmanHagrass (关闭于: 2026-01-21 15:27 (UTC+8))
#32584 [Bug]: Mistral 3.1 24B fails to load with sharded weights - “no module named ‘language_model’ in LlamaForCausalLM” error — bug — by chay1045 (关闭于: 2026-01-21 15:21 (UTC+8))
#25562 [Bug]: update_from_kv_xfer_finished raise KeyError when using NixlConnector — bug,stale — by kebe7jun (关闭于: 2026-01-21 14:27 (UTC+8))
#27304 [Feature]: Allow custom stat_loggers in V1 engine initialization — feature request,stale — by yinggeh (关闭于: 2026-01-21 11:16 (UTC+8))
#29518 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-21 11:01 (UTC+8))

新增 PR

#32825 [Hardware][AMD][Bugfix] Fix spec decode acceptance length test — bug,rocm,speculative-decoding,v1 — by mawong-amd (创建于: 2026-01-22 10:53 (UTC+8))
#32824 [Frontend] add prompt_cache_key for openresponses — frontend — by chaunceyjiang (创建于: 2026-01-22 10:27 (UTC+8))
#32819 [Model] Add TranslateGemma support — frontend — by adityapuranik99 (创建于: 2026-01-22 08:16 (UTC+8))
#32806 [torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8 to avoid raw torch ops inside opaque custom ops — ready,torch.compile,ready-run-all-tests — by ProExpertProg (创建于: 2026-01-22 04:11 (UTC+8))
#32787 [ROCm][CI] fix get_valid_backends — rocm,speculative-decoding,ready,v1 — by divakar-amd (创建于: 2026-01-22 00:58 (UTC+8))
#32788 Cleanup some huggingface_hub-related stuff — ready — by Wauplin (创建于: 2026-01-22 01:00 (UTC+8))
#32812 [Deprecation] Remove deprecated environment variables — rocm,speculative-decoding,ready,ci/build,v1 — by yewentao256 (创建于: 2026-01-22 06:10 (UTC+8))
#32778 Workspace Reuse for MOE-LoRA Intermediate Buffers — 无标签 — by cwazai (创建于: 2026-01-21 20:55 (UTC+8))
#32805 [torch.compile] Improve Cold Start for MoEs — ready — by zou3519 (创建于: 2026-01-22 03:40 (UTC+8))
#32810 [FlashMLA] Update FlashMLA to expose new arguments — ci/build,v1,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-22 05:56 (UTC+8))
#32820 [Model Runner V2] Do not error on attention backends — v1 — by WoosukKwon (创建于: 2026-01-22 08:37 (UTC+8))
#32821 [Bugfix] Lazy import NgramProposer in GPU model runner — bug,v1 — by 22quinn (创建于: 2026-01-22 08:47 (UTC+8))
#32818 [Bugfix] Fix potential EAGLE spec decode segfault during graph capture — bug,speculative-decoding,ready,v1 — by mawong-amd (创建于: 2026-01-22 08:07 (UTC+8))
#32817 [MLA][DeepSeek] Add VLLM_MLA_FP8_PROJ to force FP8 for MLA q_b_proj layer — deepseek — by minosfuture (创建于: 2026-01-22 08:02 (UTC+8))
#32816 [Misc] fix: GPU Worker uses allowlist for distributed_executor_backend — v1 — by HollowMan6 (创建于: 2026-01-22 07:29 (UTC+8))
#32815 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream — v1 — by tianshu-Michael-yu (创建于: 2026-01-22 07:24 (UTC+8))
#32811 [Model Runner V2] Refactor Prompt Logprobs — v1 — by WoosukKwon (创建于: 2026-01-22 06:00 (UTC+8))
#32780 [Llama.py -> mistral.py] Extract mistral-only relevant code into separate file — new-model,ready,llama — by patrickvonplaten (创建于: 2026-01-21 21:44 (UTC+8))
#32814 [Bench] Add interactive mode and sample caching to vllm bench serve — performance — by minosfuture (创建于: 2026-01-22 06:22 (UTC+8))
#32809 [Doc/Fix] Add Docker Compose guide and fix doc-build hook — documentation — by SakshamKapoor2911 (创建于: 2026-01-22 05:40 (UTC+8))
#32813 [Hardware][AMD][Bugfix] Fix PTPC FP8 quantization — bug,rocm — by mawong-amd (创建于: 2026-01-22 06:10 (UTC+8))
#32775 [Docs] Remove outdated async_scheduling limitation with speculative decoding — ready — by ikaadil (创建于: 2026-01-21 19:19 (UTC+8))
#32797 [Misc] Add Helion version check to collect_env — ready — by gmagogsfm (创建于: 2026-01-22 02:28 (UTC+8))
#32808 [Bench] Add /slow_down endpoint for P/D disaggregation benchmarking (when input/output ratio is large) — frontend,v1,kv-connector — by minosfuture (创建于: 2026-01-22 05:38 (UTC+8))
#32799 [ModelRunner V2] Don’t pin reused flashinfer tensors — v1,nvidia — by njhill (创建于: 2026-01-22 02:56 (UTC+8))
#32804 Add fused MoE config for Nemotron Nano BF16 on B200 — 无标签 — by danisereb (创建于: 2026-01-22 03:38 (UTC+8))
#32796 [Misc] Log vLLM logo when starting server — frontend — by njhill (创建于: 2026-01-22 02:26 (UTC+8))
#32772 [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni — documentation,qwen — by Etelis (创建于: 2026-01-21 18:15 (UTC+8))
#32803 [Model] Add Pacific-Prime / Complexity model support — new-model — by Complexity-ML (创建于: 2026-01-22 03:32 (UTC+8))
#32798 [Feature] Update env for trtllm prefill True by default — 无标签 — by yewentao256 (创建于: 2026-01-22 02:34 (UTC+8))
#32789 [Bugfix] Fix Whisper/encoder-decoder GPU memory leak — bug,v1,multi-modality — by NickLucche (创建于: 2026-01-22 01:07 (UTC+8))
#32801 [WIP] Fix GPT-OSS prefix caching not working with EAGLE — v1,gpt-oss — by mgoin (创建于: 2026-01-22 03:22 (UTC+8))
#32769 [fix] tesdt mcp_tool_calling_streaming with a more complex math question — ready — by daniel-salib (创建于: 2026-01-21 17:28 (UTC+8))
#32800 [EncoderCacheManager] Remove unnecessary copy — v1 — by lgeiger (创建于: 2026-01-22 03:05 (UTC+8))
#32795 [Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 — bug,v1,cpu,nvidia — by MatthewBonanni (创建于: 2026-01-22 02:21 (UTC+8))
#32762 [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode — bug,frontend,gpt-oss — by AndreasKaratzas (创建于: 2026-01-21 16:01 (UTC+8))
#32783 [ROCm] fix import for on_gfx9 — rocm,ready — by divakar-amd (创建于: 2026-01-21 23:03 (UTC+8))
#32792 [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm — ci/build,v1,cpu — by fadara01 (创建于: 2026-01-22 02:02 (UTC+8))
#32784 Add missing import of fused_topk to benchmark_moe — performance,ready — by danisereb (创建于: 2026-01-21 23:32 (UTC+8))
#32794 [Model Runner V2] Minor refactor for compute_slot_mappings — v1 — by WoosukKwon (创建于: 2026-01-22 02:14 (UTC+8))
#32785 Segmented prefill for gapped external KV cache hits — documentation,v1,kv-connector — by sdavidbd (创建于: 2026-01-22 00:03 (UTC+8))
#32790 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) — 无标签 — by danielafrimi (创建于: 2026-01-22 01:07 (UTC+8))
#32767 [Docs] add Dynamo/aibrix integration and kubeai/aks link — documentation — by pacoxu (创建于: 2026-01-21 17:11 (UTC+8))
#32757 [Model] Extend collect_children and no_init_weights contexts — ready,llama,qwen — by DarkLight1337 (创建于: 2026-01-21 12:30 (UTC+8))
#32768 fix: preserve native tool call ID in multi-turn tool calling — frontend — by wangln19 (创建于: 2026-01-21 17:21 (UTC+8))
#32763 feat: Complete LoRA support for MiniMaxM2 Fixes #32736 — documentation — by Chenhao-Guan (创建于: 2026-01-21 16:40 (UTC+8))
#32781 [WIP] Add Mistral guidance — structured-output,frontend,v1 — by juliendenize (创建于: 2026-01-21 21:47 (UTC+8))
#32770 [lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access — 无标签 — by cwazai (创建于: 2026-01-21 17:48 (UTC+8))
#32779 Fix infinite recursive search issue in quark.py — 无标签 — by xiao-llm (创建于: 2026-01-21 21:30 (UTC+8))
#32760 Support nccl fp8 communication — 无标签 — by amirkl94 (创建于: 2026-01-21 15:24 (UTC+8))
#32777 [Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set — bug,cpu — by karanb192 (创建于: 2026-01-21 20:52 (UTC+8))
#32771 [Model Runner V2] support piecewise & mixed cudagraph — v1,nvidia — by izhuhaoran (创建于: 2026-01-21 17:59 (UTC+8))
#32773 [Bugfix][Quant] fix online fp8 quantization oom — bug — by CSWYF3634076 (创建于: 2026-01-21 18:59 (UTC+8))
#32764 [Feature] Add LoRA support for Gemma3 vision components — 无标签 — by vihaan-that (创建于: 2026-01-21 16:43 (UTC+8))
#32754 [Bugfix][ROCm] Fix DeepSeek-R1 repetition via hybrid sampler routing — bug,rocm,v1,deepseek — by vllmellm (创建于: 2026-01-21 11:52 (UTC+8))
#32776 Deprecated — 无标签 — by Zhenzhong1 (创建于: 2026-01-21 19:28 (UTC+8))
#32774 [lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 — 无标签 — by cwazai (创建于: 2026-01-21 19:11 (UTC+8))
#32766 [P/D] Mooncake Connector support setting device — kv-connector — by Sean-LL (创建于: 2026-01-21 17:03 (UTC+8))
#32761 [Misc] Add JSON format logging support — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by dsuhinin (创建于: 2026-01-21 15:34 (UTC+8))
#32759 [Bugfix] Add glm4_moe_lite to MLA model list to fix excessive KV cache memory usage — bug — by yhfgyyf (创建于: 2026-01-21 14:41 (UTC+8))
#32756 Refactor save_path creation for benchmark results of benchmark/kernels/bench_fp8_gemm.py — performance — by GuoxiangZu (创建于: 2026-01-21 12:09 (UTC+8))
#32752 test check code — documentation,new-model — by Emilie1001 (创建于: 2026-01-21 11:38 (UTC+8))

已合并 PR

#32812 [Deprecation] Remove deprecated environment variables — rocm,speculative-decoding,ready,ci/build,v1 — by yewentao256 (合并于: 2026-01-22 10:25 (UTC+8))
#32820 [Model Runner V2] Do not error on attention backends — v1 — by WoosukKwon (合并于: 2026-01-22 09:02 (UTC+8))
#32811 [Model Runner V2] Refactor Prompt Logprobs — v1 — by WoosukKwon (合并于: 2026-01-22 07:12 (UTC+8))
#31246 [Kernel] Add topk_sigmoid kernel — performance,ready — by xyang16 (合并于: 2026-01-22 06:49 (UTC+8))
#32797 [Misc] Add Helion version check to collect_env — ready — by gmagogsfm (合并于: 2026-01-22 05:54 (UTC+8))
#32799 [ModelRunner V2] Don’t pin reused flashinfer tensors — v1,nvidia — by njhill (合并于: 2026-01-22 05:17 (UTC+8))
#32783 [ROCm] fix import for on_gfx9 — rocm,ready — by divakar-amd (合并于: 2026-01-22 02:41 (UTC+8))
#32784 Add missing import of fused_topk to benchmark_moe — performance,ready — by danisereb (合并于: 2026-01-22 02:30 (UTC+8))
#32794 [Model Runner V2] Minor refactor for compute_slot_mappings — v1 — by WoosukKwon (合并于: 2026-01-22 02:24 (UTC+8))
#32707 [Misc] Omit “disable NCCL for DP sync” startup log when not applicable — ready — by njhill (合并于: 2026-01-22 01:03 (UTC+8))
#30993 Bump Flashinfer to v0.6.1 — ready,ci/build,v1,nvidia — by elvischenv (合并于: 2026-01-22 00:49 (UTC+8))
#32744 [PluggableLayer][1/N] Define PluggableLayer (Fix ci) — documentation,performance,ready,ready-run-all-tests — by whx-sjtu (合并于: 2026-01-22 00:38 (UTC+8))
#32077 [Cleanup] Remove unused KVConnectorModelRunnerMixin methods — ready,v1,kv-connector — by njhill (合并于: 2026-01-21 11:16 (UTC+8))
#32697 [Quantization][Deprecation] Remove RTN — ready — by robertgshaw2-redhat (合并于: 2026-01-22 00:34 (UTC+8))
#29287 [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp — rocm,ready,v1,deepseek — by ganyi1996ppo (合并于: 2026-01-21 23:16 (UTC+8))
#32681 [Quantization][Deprecation] Deprecate HQQ — ready — by robertgshaw2-redhat (合并于: 2026-01-21 22:32 (UTC+8))
#32679 [Quantization][Deprecation] Remove DeepSpeedFp8 — documentation,ready — by robertgshaw2-redhat (合并于: 2026-01-21 22:32 (UTC+8))
#32760 Support nccl fp8 communication — 无标签 — by amirkl94 (合并于: 2026-01-21 21:31 (UTC+8))
#32414 [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority — documentation,performance,rocm,ready,ci/build,v1,llama,qwen,gpt-oss,nvidia — by robertgshaw2-redhat (合并于: 2026-01-21 21:22 (UTC+8))
#32727 [bugfix] Aria model — bug,ready — by divakar-amd (合并于: 2026-01-21 21:11 (UTC+8))
#32456 [Model] Add Eagle2.5-8B Vision-Language Model support — documentation,new-model,speculative-decoding,ready — by George-Polya (合并于: 2026-01-21 17:39 (UTC+8))
#32749 [Bugfix] Force using spawn multiprocess method when it’s the WSL platform — bug,ready — by jasonyanwenl (合并于: 2026-01-21 17:35 (UTC+8))
#32741 [Documentation] Fix typo in docs/design/torch_compile_multimodal.md — documentation — by Lucaskabela (合并于: 2026-01-21 15:54 (UTC+8))
#32673 [Bugfix] Support HF sharded weights for Mistral3/Pixtral models — bug,ready — by ricky-chaoju (合并于: 2026-01-21 15:27 (UTC+8))
#32582 [Docs] Fix GitHub handle in governance process — documentation,ready — by pacoxu (合并于: 2026-01-21 15:07 (UTC+8))
#32491 [FlashMLA] Update FlashMLA — ready,ci/build,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-21 13:03 (UTC+8))
#32682 [Bugfix] Fix Nemotron-Nano-v2-vlm static resolution — bug,ready — by netanel-haber (合并于: 2026-01-21 14:28 (UTC+8))
#32048 Added qwen3 vision language moe support for speculative decoding — speculative-decoding,ready,v1,qwen — by shanjiaz (合并于: 2026-01-21 11:24 (UTC+8))
#32542 Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by gopalsarda (合并于: 2026-01-21 11:18 (UTC+8))
#32299 [Bugfix] Fix Granite Vision / Don’t use Siglip Pooling Head Nested Models by Default — bug,ready,multi-modality — by alex-jw-brooks (合并于: 2026-01-21 11:11 (UTC+8))

关闭但未合并的 PR

#32318 [Quantization] Add W4A16 NVFP4 MoE support for CompressedTensors — ci/build — by zhangyimi (关闭于: 2026-01-22 09:42 (UTC+8))
#32191 [AOT compilation] cached inductor artifacts benchmark #32043 — performance — by dolpm (关闭于: 2026-01-22 06:37 (UTC+8))
#32689 [Quantization][Deprecation] Remove ExpertsInt8 — ready,needs-rebase — by robertgshaw2-redhat (关闭于: 2026-01-22 04:45 (UTC+8))
#32798 [Feature] Update env for trtllm prefill True by default — 无标签 — by yewentao256 (关闭于: 2026-01-22 02:37 (UTC+8))
#31703 [Misc] Add packed_modules_mapping for MiniMaxM2ForCausalLM — documentation — by jeejeelee (关闭于: 2026-01-21 23:04 (UTC+8))
#32776 Deprecated — 无标签 — by Zhenzhong1 (关闭于: 2026-01-21 19:28 (UTC+8))
#32146 [Frontend] Support OpenAI-style tool call IDs in Kimi K2 parser — documentation — by daniel-salib (关闭于: 2026-01-21 19:22 (UTC+8))
#32216 [Frontend] Add dedicated KimiK2ReasoningParser for tool call handling — 无标签 — by daniel-salib (关闭于: 2026-01-21 19:22 (UTC+8))
#20422 [Core] Implement sleep/wake_up for SpecDecodeWorker — speculative-decoding,needs-rebase,unstale — by kebe7jun (关闭于: 2026-01-21 18:36 (UTC+8))
#25563 [Bugfix][Core] fix KeyError in _update_waiting_for_remote_kv — stale,v1,kv-connector — by kebe7jun (关闭于: 2026-01-21 14:28 (UTC+8))
#32750 Gb200 unified memory v0.14.0 — documentation,nvidia — by pst2154 (关闭于: 2026-01-21 14:20 (UTC+8))
#32752 test check code — documentation,new-model — by Emilie1001 (关闭于: 2026-01-21 11:59 (UTC+8))