vLLM 开发动态报告 - 2026-01-21
时间窗口: 2026-01-21 11:00 (UTC+8) ~ 2026-01-22 11:00 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 13 | 新 PR 62 | 合并 PR 30 | 关闭未合并 PR 12
📊 每日开发状态摘要
本周期(2026-01-21至2026-01-22)vLLM项目保持高活跃度,新增PR(62个)与合并PR(30个)数量均较多,核心开发重点集中在模型支持扩展(如TranslateGemma、Pacific-Prime)、性能优化(如MoE、torch.compile冷启动)以及AMD平台适配性的修复上。社区讨论热度不减,尤其是围绕GPU型号兼容性、性能回归及配置文档等问题展开了深入交流。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动活跃,主要集中在问题修复和平台适配性增强上,多位AMD员工(用户名含-amd后缀)贡献了重要修复。
- Bugfix PR (AMD员工贡献):
- PR #32813:
[Hardware][AMD][Bugfix] Fix PTPC FP8 quantization(bymawong-amd)- 内容:修复了PR #32189重构后导致的
PTPCFP8LinearMethod继承链错误,使其正确继承自FP8OnlineLinearMethod,从而修复了AMD量化测试组。 - 影响:确保了在AMD平台上使用PTPC(可能是AMD特定的量化方案)进行FP8量化的功能正确性。
- 内容:修复了PR #32189重构后导致的
- PR #32783:
[ROCm] fix import for on_gfx9(bydivakar-amd)- 内容:修复了
fused_moe和fused_batched_moe中关于on_gfx9的导入错误,从直接导入替代了通过平台接口调用的方式。 - 影响:解决了在ROCm平台(特别是GFX9架构)上相关模块的导入问题,保障了代码的鲁棒性。
- 内容:修复了
- PR #32787:
[ROCm][CI] fix get_valid_backends(bydivakar-amd)- 内容:修复了
Platform接口中__getattr__方法导致hasattr检查始终返回True的问题,影响了推测解码测试中后端有效性检查的逻辑。 - 影响:为后续完全修复ROCm上的推测解码接受长度测试(PR #32825)奠定了基础。
- 内容:修复了
- PR #32825:
[Hardware][AMD][Bugfix] Fix spec decode acceptance length test(bymawong-amd)- 内容:修复ROCm平台上推测解码接受长度测试的失败问题。根本原因是
hasattr检查因__getattr__而失效,导致错误地调用了未实现的get_valid_backends方法。 - 影响:直接修复了AMD CI中推测解码相关测试的失败,提升了测试套件在AMD平台上的通过率。
- 内容:修复ROCm平台上推测解码接受长度测试的失败问题。根本原因是
- PR #32813:
- 功能优化/问题修复 PR (涉及AMD平台):
- PR #32754:
[Bugfix][ROCm] Fix DeepSeek-R1 repetition via hybrid sampler routing(byvllmellm)- 内容:修复了DeepSeek-R1在AMD ROCm上因Aiter采样器不完全支持每请求RNG而导致的输出重复/无限循环问题。方案是引入混合路由逻辑,对需要随机性的请求回退到PyTorch原生采样器。
- 影响:解决了特定模型在AMD平台上的功能正确性问题,同时兼顾了高性能场景(贪婪采样)对Aiter加速的利用。
- PR #32754:
- 新增PR (AMD员工贡献):
- PR #32825 (同上):除了修复测试,还包含了修复EAGLE推测解码在CUDA图捕获期间潜在段错误的PR #32818的更改,这对AMD平台上的EAGLE功能稳定性很重要。
小结:本周期AMD相关贡献以修复测试和平台特定问题为主,体现出团队在确保vLLM在AMD硬件(MI300X, MI325X)上功能完备和测试通过方面的持续投入。未发现与Quark量化工具或MI300新特性直接相关的修改。
💬 高热度讨论分析
- Issue #32782:
[Bug]: KV cache 0.14 regression vs 0.11 and 0.13- 核心议题:用户报告从vLLM 0.11/0.13升级到0.14后,在特定配置(Qwen3-Coder-30B, PP=4, TP=1)下出现严重的KV缓存性能回归,导致吞吐量大幅下降和内存溢出(OOM)。
- 观点与讨论:
- 报告者 (
dthp-git):提供了详细的性能对比数据和错误日志,指出0.14版本在相同硬件和配置下性能更差。 - 维护者 (
robertgshaw2-redhat):迅速回应感谢报告,并请用户提供完整复现命令以便进一步调查。
- 报告者 (
- 争议焦点:无公开争议,属于严重的性能回归问题报告。
- 当前状态:Open。维护者已介入,等待更详细信息或内部排查。
- Issue #32791:
[Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object- 核心议题:在使用GPT-OSS系列模型进行多轮对话并结合
json_object格式限制时,返回的content字段为null。 - 观点与讨论:
- 报告者 (
supersteves):提供了完整的复现步骤和环境信息,并确认该问题为GPT-OSS(Harmony)系列特有,其他模型(如Llama-3.3)正常。
- 报告者 (
- 争议焦点:无。这是一个针对特定模型系列的bug报告。
- 当前状态:Open。尚未有维护者直接回复,但问题描述清晰,可复现性强。
- 核心议题:在使用GPT-OSS系列模型进行多轮对话并结合
- Issue #32755:
[Installation]: Newly required -cc tag in 0.14.0 for Docker compose -- documentation- 核心议题:vLLM 0.14.0的Docker Compose使用方式发生变化,需要
-cc参数,但官方文档完全没有解释该参数的用法、可选值及含义,导致用户无法顺利部署。 - 观点与讨论:
- 报告者 (
PathosEthosLogos):强烈抱怨文档缺失,尝试了聊天机器人推荐的各种值均无效,使用体验差。
- 报告者 (
- 争议焦点:用户对文档质量的失望与现有支持渠道(聊天机器人)未能提供有效帮助之间的矛盾。
- 当前状态:Open,但PR #32809已提交以修复此问题,该PR旨在添加Docker Compose指南并详细说明
--cc参数。
- 核心议题:vLLM 0.14.0的Docker Compose使用方式发生变化,需要
- Issue #32765:
[Feature]: Support gemma3 Models in eagle3 Speculative Decoding- 核心议题:请求为Gemma3模型添加Eagle3推测解码支持。
- 观点与讨论:
- 请求者 (
Ofir408):提出了明确的需求。 - 贡献者 (
baonudesifeizhai):表示可以处理,但提出疑问“Gemma3最大模型只有27B,值得吗?”,暗示可能因模型规模较小而优先级不高。
- 请求者 (
- 争议焦点:功能实现的“性价比”或优先级评估。
- 当前状态:Open。有社区贡献者表示愿意实施,但尚未提交PR。
🔥 热门话题与趋势分析
- 模型支持扩展:社区持续推动对新模型架构的支持,本期涉及TranslateGemma (PR #32819)、Pacific-Prime/Complexity (PR #32803)、Eagle2.5-VL (已合并PR #32456)、Pixtral的Eagle3支持 (已合并PR #32542) 以及Gemma3的LoRA支持 (Issue #32758, PR #32764)。
- 性能优化与回归:
- 性能回归:Issue #32782凸显了版本升级可能引入的性能回退风险,引发对测试覆盖率和性能基准的重视。
- 内存优化:PR #32773试图修复在线FP8量化导致OOM的问题(回滚了PR #29196),PR #32789修复Whisper编码器-解码器模型的内存泄漏,均显示对内存效率的持续关注。
- 计算优化:多个PR聚焦MoE性能(PR #32414, #32774, #32778, #32790)、torch.compile冷启动(PR #32805)和自定义算子编译(PR #32806)。
- 平台与兼容性问题:
- AMD ROCm:如前述,有多项修复。
- CPU后端:Issue #32786报告了ARM架构多NUMA域下模型并行推理性能差的问题,并随即有PR #32792提交以通过自定义共享内存通信器进行加速。
- WSL:Issue #32559 (已关闭) 和 PR #32749 解决了WSL平台下NVML初始化和多进程方法的问题。
- 推测解码(Speculative Decoding)的增强与问题修复:围绕EAGLE与前缀缓存的兼容性(Issue #32802, PR #32801)、AMD平台上的测试与稳定性(PR #32825, #32818)以及对新模型架构的支持(Issue #32765, 已合并PR #32542, #32048)是持续热点。
🛠️ 重点技术变更
- PR #32744:
[PluggableLayer][1/N] Define PluggableLayer(已合并)- 解读:此PR引入了“可插拔层”(PluggableLayer)的概念和基础框架,旨在替代或补充现有的CustomOp机制,为模型组件(如MLA)提供更规范、更易管理的扩展接口。这是实现RFC #23786的第一步。
- 影响:长期来看,有利于统一和简化vLLM中可扩展组件的开发与管理,提升代码模块化和可维护性。
- PR #32414:
[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority(已合并)- 解读:对MoE内核选择逻辑进行重大重构。引入了内核优先级概念、标准化的内核构造接口和功能支持注册机制。
- 影响:实现了跨硬件和模型架构的自动最优内核选择(例如可自动选择TRTLLM内核),并在初始化时而非运行时进行兼容性验证,提升了部署的便捷性和错误信息的清晰度。
- PR #32805:
[torch.compile] Improve Cold Start for MoEs(进行中)- 解读:通过避免在图编译期间将字符串硬编码到计算图中,显著减少了使用torch.compile时MoE模型的冷启动时间(例如GPT-OSS-120b从46秒降至16秒)。
- 影响:直接改善了用户使用
torch.compile功能时的体验,特别是对于大型MoE模型,降低了迭代和部署的门槛。
- PR #32806:
[torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8(进行中)- 解读:手动编译
SiluAndMul和QuantFP8这两个CustomOp的forward_native方法。当它们被嵌套在另一个不透明的自定义操作(如fused_moe)中调用时,原始的forward_native会以eager模式执行,影响性能。此PR确保它们在所有上下文中都能被编译。 - 影响:优化了Llama 4、DeepSeek等模型中专家层(使用
SiluAndMul)的计算性能,是性能优化向底层算子深入的体现。
- 解读:手动编译
- PR #32812:
[Deprecation] Remove deprecated environment variables(已合并)- 解读:随着v0.14.0版本发布,清理了一批已废弃的环境变量。
- 影响:保持代码库的整洁,减少配置复杂性,引导用户使用新的、更稳定的配置方式。
📈 开发活跃度观察
- 贡献者活跃:在新增的62个PR中,除了核心维护团队(如WoosukKwon, njhill, DarkLight1337, robertgshaw2-redhat等),也有大量社区贡献者(如adityapuranik99, minosfuture, cwazai, danisereb等)积极参与模型支持、性能优化和问题修复。
- AMD参与度:AMD员工(
mawong-amd,divakar-amd)在本周期提交了多个关键修复PR,显示出AMD团队在确保其硬件平台兼容性方面的积极投入。 - 代码审查与合并效率:合并了30个PR,表明审查流程运作高效。多个PR被打上
ready标签等待CI或合并,显示项目有良好的协作节奏。
💡 值得关注的问题
- Issue #32782 (KV缓存性能回归):这是一个影响升级体验的严重性能问题,需要核心团队优先调查和修复,其结果可能影响vLLM 0.14版本的稳定性评价。
- Issue #32755 (Docker Compose文档缺失):反映了项目在版本变更时,文档和用户沟通可能存在滞后。虽然已有修复PR,但也提醒团队需重视升级指南和变更日志的及时更新。
- Issue #32786 / PR #32792 (ARM CPU跨NUMA性能):该问题及解决方案凸显了vLLM在非GPU(特别是ARM服务器)异构计算环境下的性能挑战与优化机会,对于拓展部署场景有重要意义。
📋 附录:详细数据列表
新增 Issue
- #32826 [Bug]: MiniMax-M2.1 NVFP4 fails on RTX PRO 6000 Blackwell (SM120) with expert parallel — bug — by gittb (创建于: 2026-01-22 10:59 (UTC+8))
- #32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (创建于: 2026-01-22 10:09 (UTC+8))
- #32823 [Feature]: microsoft/VibeVoice-ASR support — feature request — by baonudesifeizhai (创建于: 2026-01-22 10:20 (UTC+8))
- #32765 [Feature]: Support gemma3 Models in eagle3 Speculative Decoding — feature request — by Ofir408 (创建于: 2026-01-21 17:01 (UTC+8))
- #32807 [Usage]: How to use tensorizer to load models from S3 with vLLM — usage — by asharkhan3101 (创建于: 2026-01-22 04:32 (UTC+8))
- #32793 [Bug]: importing
bitsandbytescauses pytest to fail — bug — by qgallouedec (创建于: 2026-01-22 02:06 (UTC+8)) - #32802 [Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE — bug — by mgoin (创建于: 2026-01-22 03:22 (UTC+8))
- #32791 [Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object — bug — by supersteves (创建于: 2026-01-22 01:24 (UTC+8))
- #32782 [Bug]: KV cache 0.14 regression vs 0.11 and 0.13 — bug — by dthp-git (创建于: 2026-01-21 22:39 (UTC+8))
- #32786 [Bug]: [CPU Backend] [Arm] Slow model-parallel inference across NUMA domains — bug,cpu — by fadara01 (创建于: 2026-01-22 00:11 (UTC+8))
- #32755 [Installation]: Newly required -cc tag in 0.14.0 for Docker compose – documentation — installation — by PathosEthosLogos (创建于: 2026-01-21 12:00 (UTC+8))
- #32758 [Feature]: support lora for Gemma3ForConditionalGeneration — help wanted,feature request — by WuChuYi (创建于: 2026-01-21 14:06 (UTC+8))
- #32753 [Feature]: [Safety] Add input validation and contract documentation to attention kernels — feature request — by red1239109-cmd (创建于: 2026-01-21 11:47 (UTC+8))
- #32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (创建于: 2026-01-21 11:11 (UTC+8))
已关闭 Issue
- #32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (关闭于: 2026-01-22 10:56 (UTC+8))
- #25336 [Bug]: metrics — bug,stale — by sidikbro (关闭于: 2026-01-22 10:13 (UTC+8))
- #25352 [Bug]: AssertionError while passing otlp_traces_endpoint in Mac — bug,stale — by skshashankkumar41 (关闭于: 2026-01-22 10:13 (UTC+8))
- #20451 [RFC]: vLLM-compile low-hanging fruit cold start improvements — RFC,torch.compile,stale,startup-ux — by zou3519 (关闭于: 2026-01-21 23:20 (UTC+8))
- #29509 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 3 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-21 21:12 (UTC+8))
- #23350 [Bug]: vLLM aarch64 support (GH200) — bug,stale — by wahabk (关闭于: 2026-01-21 19:03 (UTC+8))
- #31345 [Bug]: The ‘’draft_tensor_parallel_size‘’ parameter of the Eagel3 draft model does not take effect — bug — by zhaomingyu13 (关闭于: 2026-01-21 17:43 (UTC+8))
- #32559 [Bug]: NVML initialization failure even when running the basic example in the WSL platform — bug — by jasonyanwenl (关闭于: 2026-01-21 17:35 (UTC+8))
- #19953 [Feature]: vllm support for mistral3.1 with no consolidated.safetensors — feature request,stale — by AbdelrahmanHagrass (关闭于: 2026-01-21 15:27 (UTC+8))
- #32584 [Bug]: Mistral 3.1 24B fails to load with sharded weights - “no module named ‘language_model’ in LlamaForCausalLM” error — bug — by chay1045 (关闭于: 2026-01-21 15:21 (UTC+8))
- #25562 [Bug]: update_from_kv_xfer_finished raise KeyError when using NixlConnector — bug,stale — by kebe7jun (关闭于: 2026-01-21 14:27 (UTC+8))
- #27304 [Feature]: Allow custom stat_loggers in V1 engine initialization — feature request,stale — by yinggeh (关闭于: 2026-01-21 11:16 (UTC+8))
- #29518 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-21 11:01 (UTC+8))
新增 PR
- #32825 [Hardware][AMD][Bugfix] Fix spec decode acceptance length test — bug,rocm,speculative-decoding,v1 — by mawong-amd (创建于: 2026-01-22 10:53 (UTC+8))
- #32824 [Frontend] add prompt_cache_key for openresponses — frontend — by chaunceyjiang (创建于: 2026-01-22 10:27 (UTC+8))
- #32819 [Model] Add TranslateGemma support — frontend — by adityapuranik99 (创建于: 2026-01-22 08:16 (UTC+8))
- #32806 [torch.compile] Compile
CustomOp.forward_nativeforSiluAndMulandQuantFP8to avoid raw torch ops inside opaque custom ops — ready,torch.compile,ready-run-all-tests — by ProExpertProg (创建于: 2026-01-22 04:11 (UTC+8)) - #32787 [ROCm][CI] fix get_valid_backends — rocm,speculative-decoding,ready,v1 — by divakar-amd (创建于: 2026-01-22 00:58 (UTC+8))
- #32788 Cleanup some huggingface_hub-related stuff — ready — by Wauplin (创建于: 2026-01-22 01:00 (UTC+8))
- #32812 [Deprecation] Remove deprecated environment variables — rocm,speculative-decoding,ready,ci/build,v1 — by yewentao256 (创建于: 2026-01-22 06:10 (UTC+8))
- #32778 Workspace Reuse for MOE-LoRA Intermediate Buffers — 无标签 — by cwazai (创建于: 2026-01-21 20:55 (UTC+8))
- #32805 [torch.compile] Improve Cold Start for MoEs — ready — by zou3519 (创建于: 2026-01-22 03:40 (UTC+8))
- #32810 [FlashMLA] Update FlashMLA to expose new arguments — ci/build,v1,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-22 05:56 (UTC+8))
- #32820 [Model Runner V2] Do not error on attention backends — v1 — by WoosukKwon (创建于: 2026-01-22 08:37 (UTC+8))
- #32821 [Bugfix] Lazy import NgramProposer in GPU model runner — bug,v1 — by 22quinn (创建于: 2026-01-22 08:47 (UTC+8))
- #32818 [Bugfix] Fix potential EAGLE spec decode segfault during graph capture — bug,speculative-decoding,ready,v1 — by mawong-amd (创建于: 2026-01-22 08:07 (UTC+8))
- #32817 [MLA][DeepSeek] Add VLLM_MLA_FP8_PROJ to force FP8 for MLA q_b_proj layer — deepseek — by minosfuture (创建于: 2026-01-22 08:02 (UTC+8))
- #32816 [Misc] fix: GPU Worker uses allowlist for distributed_executor_backend — v1 — by HollowMan6 (创建于: 2026-01-22 07:29 (UTC+8))
- #32815 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream — v1 — by tianshu-Michael-yu (创建于: 2026-01-22 07:24 (UTC+8))
- #32811 [Model Runner V2] Refactor Prompt Logprobs — v1 — by WoosukKwon (创建于: 2026-01-22 06:00 (UTC+8))
- #32780 [Llama.py -> mistral.py] Extract mistral-only relevant code into separate file — new-model,ready,llama — by patrickvonplaten (创建于: 2026-01-21 21:44 (UTC+8))
- #32814 [Bench] Add interactive mode and sample caching to vllm bench serve — performance — by minosfuture (创建于: 2026-01-22 06:22 (UTC+8))
- #32809 [Doc/Fix] Add Docker Compose guide and fix doc-build hook — documentation — by SakshamKapoor2911 (创建于: 2026-01-22 05:40 (UTC+8))
- #32813 [Hardware][AMD][Bugfix] Fix PTPC FP8 quantization — bug,rocm — by mawong-amd (创建于: 2026-01-22 06:10 (UTC+8))
- #32775 [Docs] Remove outdated async_scheduling limitation with speculative decoding — ready — by ikaadil (创建于: 2026-01-21 19:19 (UTC+8))
- #32797 [Misc] Add Helion version check to collect_env — ready — by gmagogsfm (创建于: 2026-01-22 02:28 (UTC+8))
- #32808 [Bench] Add /slow_down endpoint for P/D disaggregation benchmarking (when input/output ratio is large) — frontend,v1,kv-connector — by minosfuture (创建于: 2026-01-22 05:38 (UTC+8))
- #32799 [ModelRunner V2] Don’t pin reused flashinfer tensors — v1,nvidia — by njhill (创建于: 2026-01-22 02:56 (UTC+8))
- #32804 Add fused MoE config for Nemotron Nano BF16 on B200 — 无标签 — by danisereb (创建于: 2026-01-22 03:38 (UTC+8))
- #32796 [Misc] Log vLLM logo when starting server — frontend — by njhill (创建于: 2026-01-22 02:26 (UTC+8))
- #32772 [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni — documentation,qwen — by Etelis (创建于: 2026-01-21 18:15 (UTC+8))
- #32803 [Model] Add Pacific-Prime / Complexity model support — new-model — by Complexity-ML (创建于: 2026-01-22 03:32 (UTC+8))
- #32798 [Feature] Update env for trtllm prefill True by default — 无标签 — by yewentao256 (创建于: 2026-01-22 02:34 (UTC+8))
- #32789 [Bugfix] Fix Whisper/encoder-decoder GPU memory leak — bug,v1,multi-modality — by NickLucche (创建于: 2026-01-22 01:07 (UTC+8))
- #32801 [WIP] Fix GPT-OSS prefix caching not working with EAGLE — v1,gpt-oss — by mgoin (创建于: 2026-01-22 03:22 (UTC+8))
- #32769 [fix] tesdt mcp_tool_calling_streaming with a more complex math question — ready — by daniel-salib (创建于: 2026-01-21 17:28 (UTC+8))
- #32800 [EncoderCacheManager] Remove unnecessary copy — v1 — by lgeiger (创建于: 2026-01-22 03:05 (UTC+8))
- #32795 [Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 — bug,v1,cpu,nvidia — by MatthewBonanni (创建于: 2026-01-22 02:21 (UTC+8))
- #32762 [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode — bug,frontend,gpt-oss — by AndreasKaratzas (创建于: 2026-01-21 16:01 (UTC+8))
- #32783 [ROCm] fix import for on_gfx9 — rocm,ready — by divakar-amd (创建于: 2026-01-21 23:03 (UTC+8))
- #32792 [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm — ci/build,v1,cpu — by fadara01 (创建于: 2026-01-22 02:02 (UTC+8))
- #32784 Add missing import of fused_topk to benchmark_moe — performance,ready — by danisereb (创建于: 2026-01-21 23:32 (UTC+8))
- #32794 [Model Runner V2] Minor refactor for
compute_slot_mappings— v1 — by WoosukKwon (创建于: 2026-01-22 02:14 (UTC+8)) - #32785 Segmented prefill for gapped external KV cache hits — documentation,v1,kv-connector — by sdavidbd (创建于: 2026-01-22 00:03 (UTC+8))
- #32790 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) — 无标签 — by danielafrimi (创建于: 2026-01-22 01:07 (UTC+8))
- #32767 [Docs] add Dynamo/aibrix integration and kubeai/aks link — documentation — by pacoxu (创建于: 2026-01-21 17:11 (UTC+8))
- #32757 [Model] Extend
collect_childrenandno_init_weightscontexts — ready,llama,qwen — by DarkLight1337 (创建于: 2026-01-21 12:30 (UTC+8)) - #32768 fix: preserve native tool call ID in multi-turn tool calling — frontend — by wangln19 (创建于: 2026-01-21 17:21 (UTC+8))
- #32763 feat: Complete LoRA support for MiniMaxM2 Fixes #32736 — documentation — by Chenhao-Guan (创建于: 2026-01-21 16:40 (UTC+8))
- #32781 [WIP] Add Mistral guidance — structured-output,frontend,v1 — by juliendenize (创建于: 2026-01-21 21:47 (UTC+8))
- #32770 [lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access — 无标签 — by cwazai (创建于: 2026-01-21 17:48 (UTC+8))
- #32779 Fix infinite recursive search issue in quark.py — 无标签 — by xiao-llm (创建于: 2026-01-21 21:30 (UTC+8))
- #32760 Support nccl fp8 communication — 无标签 — by amirkl94 (创建于: 2026-01-21 15:24 (UTC+8))
- #32777 [Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set — bug,cpu — by karanb192 (创建于: 2026-01-21 20:52 (UTC+8))
- #32771 [Model Runner V2] support piecewise & mixed cudagraph — v1,nvidia — by izhuhaoran (创建于: 2026-01-21 17:59 (UTC+8))
- #32773 [Bugfix][Quant] fix online fp8 quantization oom — bug — by CSWYF3634076 (创建于: 2026-01-21 18:59 (UTC+8))
- #32764 [Feature] Add LoRA support for Gemma3 vision components — 无标签 — by vihaan-that (创建于: 2026-01-21 16:43 (UTC+8))
- #32754 [Bugfix][ROCm] Fix DeepSeek-R1 repetition via hybrid sampler routing — bug,rocm,v1,deepseek — by vllmellm (创建于: 2026-01-21 11:52 (UTC+8))
- #32776 Deprecated — 无标签 — by Zhenzhong1 (创建于: 2026-01-21 19:28 (UTC+8))
- #32774 [lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 — 无标签 — by cwazai (创建于: 2026-01-21 19:11 (UTC+8))
- #32766 [P/D] Mooncake Connector support setting device — kv-connector — by Sean-LL (创建于: 2026-01-21 17:03 (UTC+8))
- #32761 [Misc] Add JSON format logging support — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by dsuhinin (创建于: 2026-01-21 15:34 (UTC+8))
- #32759 [Bugfix] Add glm4_moe_lite to MLA model list to fix excessive KV cache memory usage — bug — by yhfgyyf (创建于: 2026-01-21 14:41 (UTC+8))
- #32756 Refactor save_path creation for benchmark results of benchmark/kernels/bench_fp8_gemm.py — performance — by GuoxiangZu (创建于: 2026-01-21 12:09 (UTC+8))
- #32752 test check code — documentation,new-model — by Emilie1001 (创建于: 2026-01-21 11:38 (UTC+8))
已合并 PR
- #32812 [Deprecation] Remove deprecated environment variables — rocm,speculative-decoding,ready,ci/build,v1 — by yewentao256 (合并于: 2026-01-22 10:25 (UTC+8))
- #32820 [Model Runner V2] Do not error on attention backends — v1 — by WoosukKwon (合并于: 2026-01-22 09:02 (UTC+8))
- #32811 [Model Runner V2] Refactor Prompt Logprobs — v1 — by WoosukKwon (合并于: 2026-01-22 07:12 (UTC+8))
- #31246 [Kernel] Add topk_sigmoid kernel — performance,ready — by xyang16 (合并于: 2026-01-22 06:49 (UTC+8))
- #32797 [Misc] Add Helion version check to collect_env — ready — by gmagogsfm (合并于: 2026-01-22 05:54 (UTC+8))
- #32799 [ModelRunner V2] Don’t pin reused flashinfer tensors — v1,nvidia — by njhill (合并于: 2026-01-22 05:17 (UTC+8))
- #32783 [ROCm] fix import for on_gfx9 — rocm,ready — by divakar-amd (合并于: 2026-01-22 02:41 (UTC+8))
- #32784 Add missing import of fused_topk to benchmark_moe — performance,ready — by danisereb (合并于: 2026-01-22 02:30 (UTC+8))
- #32794 [Model Runner V2] Minor refactor for
compute_slot_mappings— v1 — by WoosukKwon (合并于: 2026-01-22 02:24 (UTC+8)) - #32707 [Misc] Omit “disable NCCL for DP sync” startup log when not applicable — ready — by njhill (合并于: 2026-01-22 01:03 (UTC+8))
- #30993 Bump Flashinfer to v0.6.1 — ready,ci/build,v1,nvidia — by elvischenv (合并于: 2026-01-22 00:49 (UTC+8))
- #32744 [PluggableLayer][1/N] Define PluggableLayer (Fix ci) — documentation,performance,ready,ready-run-all-tests — by whx-sjtu (合并于: 2026-01-22 00:38 (UTC+8))
- #32077 [Cleanup] Remove unused
KVConnectorModelRunnerMixinmethods — ready,v1,kv-connector — by njhill (合并于: 2026-01-21 11:16 (UTC+8)) - #32697 [Quantization][Deprecation] Remove RTN — ready — by robertgshaw2-redhat (合并于: 2026-01-22 00:34 (UTC+8))
- #29287 [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp — rocm,ready,v1,deepseek — by ganyi1996ppo (合并于: 2026-01-21 23:16 (UTC+8))
- #32681 [Quantization][Deprecation] Deprecate HQQ — ready — by robertgshaw2-redhat (合并于: 2026-01-21 22:32 (UTC+8))
- #32679 [Quantization][Deprecation] Remove
DeepSpeedFp8— documentation,ready — by robertgshaw2-redhat (合并于: 2026-01-21 22:32 (UTC+8)) - #32760 Support nccl fp8 communication — 无标签 — by amirkl94 (合并于: 2026-01-21 21:31 (UTC+8))
- #32414 [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority — documentation,performance,rocm,ready,ci/build,v1,llama,qwen,gpt-oss,nvidia — by robertgshaw2-redhat (合并于: 2026-01-21 21:22 (UTC+8))
- #32727 [bugfix] Aria model — bug,ready — by divakar-amd (合并于: 2026-01-21 21:11 (UTC+8))
- #32456 [Model] Add Eagle2.5-8B Vision-Language Model support — documentation,new-model,speculative-decoding,ready — by George-Polya (合并于: 2026-01-21 17:39 (UTC+8))
- #32749 [Bugfix] Force using spawn multiprocess method when it’s the WSL platform — bug,ready — by jasonyanwenl (合并于: 2026-01-21 17:35 (UTC+8))
- #32741 [Documentation] Fix typo in
docs/design/torch_compile_multimodal.md— documentation — by Lucaskabela (合并于: 2026-01-21 15:54 (UTC+8)) - #32673 [Bugfix] Support HF sharded weights for Mistral3/Pixtral models — bug,ready — by ricky-chaoju (合并于: 2026-01-21 15:27 (UTC+8))
- #32582 [Docs] Fix GitHub handle in governance process — documentation,ready — by pacoxu (合并于: 2026-01-21 15:07 (UTC+8))
- #32491 [FlashMLA] Update FlashMLA — ready,ci/build,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-21 13:03 (UTC+8))
- #32682 [Bugfix] Fix Nemotron-Nano-v2-vlm static resolution — bug,ready — by netanel-haber (合并于: 2026-01-21 14:28 (UTC+8))
- #32048 Added qwen3 vision language moe support for speculative decoding — speculative-decoding,ready,v1,qwen — by shanjiaz (合并于: 2026-01-21 11:24 (UTC+8))
- #32542 Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by gopalsarda (合并于: 2026-01-21 11:18 (UTC+8))
- #32299 [Bugfix] Fix Granite Vision / Don’t use Siglip Pooling Head Nested Models by Default — bug,ready,multi-modality — by alex-jw-brooks (合并于: 2026-01-21 11:11 (UTC+8))
关闭但未合并的 PR
- #32318 [Quantization] Add W4A16 NVFP4 MoE support for CompressedTensors — ci/build — by zhangyimi (关闭于: 2026-01-22 09:42 (UTC+8))
- #32191 [AOT compilation] cached inductor artifacts benchmark #32043 — performance — by dolpm (关闭于: 2026-01-22 06:37 (UTC+8))
- #32689 [Quantization][Deprecation] Remove ExpertsInt8 — ready,needs-rebase — by robertgshaw2-redhat (关闭于: 2026-01-22 04:45 (UTC+8))
- #32798 [Feature] Update env for trtllm prefill True by default — 无标签 — by yewentao256 (关闭于: 2026-01-22 02:37 (UTC+8))
- #31703 [Misc] Add packed_modules_mapping for MiniMaxM2ForCausalLM — documentation — by jeejeelee (关闭于: 2026-01-21 23:04 (UTC+8))
- #32776 Deprecated — 无标签 — by Zhenzhong1 (关闭于: 2026-01-21 19:28 (UTC+8))
- #32146 [Frontend] Support OpenAI-style tool call IDs in Kimi K2 parser — documentation — by daniel-salib (关闭于: 2026-01-21 19:22 (UTC+8))
- #32216 [Frontend] Add dedicated KimiK2ReasoningParser for tool call handling — 无标签 — by daniel-salib (关闭于: 2026-01-21 19:22 (UTC+8))
- #20422 [Core] Implement sleep/wake_up for SpecDecodeWorker — speculative-decoding,needs-rebase,unstale — by kebe7jun (关闭于: 2026-01-21 18:36 (UTC+8))
- #25563 [Bugfix][Core] fix KeyError in _update_waiting_for_remote_kv — stale,v1,kv-connector — by kebe7jun (关闭于: 2026-01-21 14:28 (UTC+8))
- #32750 Gb200 unified memory v0.14.0 — documentation,nvidia — by pst2154 (关闭于: 2026-01-21 14:20 (UTC+8))
- #32752 test check code — documentation,new-model — by Emilie1001 (关闭于: 2026-01-21 11:59 (UTC+8))