vLLM 开发动态报告 - 2026-01-20
时间窗口: 2026-01-20 10:55 (UTC+8) ~ 2026-01-21 10:55 (UTC+8) 数据统计: 新 Issue 29 | 关闭 Issue 19 | 新 PR 92 | 合并 PR 29 | 关闭未合并 PR 19
📊 每日开发状态摘要
在2026年1月20日至21日的周期内,vLLM项目处于高度活跃的开发迭代期。v0.14.0版本发布后,社区集中处理了大量回归性问题(如Sleep模式内存泄露、特定模型加载失败)和性能优化。同时,多个重要的架构演进提案(如统一解析器、渲染与推理分离、PluggableLayer)引发了深入讨论,显示出项目在提升推理性能、扩展多模态支持和完善硬件生态兼容性(特别是AMD平台)方面的持续投入。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动活跃,主要集中在性能优化、Bug修复和CI/CD改进。
- 已关闭 Issue:
- #31695:用户在AMD MI325X上测试分离式服务(P-D)时,使用
SharedStorageConnector性能极差(1 tokens/s)。AMD工程师(@hongxiayang)介入后,建议并协助用户切换到MORIIOKV连接器,成功解决了性能瓶颈,该问题因此关闭。 - #32647:报告GB300与最新版
deep_gemm库不兼容的问题。贡献者@chaunceyjiang(虽非-amd后缀,但积极处理AMD GPU相关问题)提交的PR #32652快速修复了此问题。
- #31695:用户在AMD MI325X上测试分离式服务(P-D)时,使用
- 新增/进行中的 PR:
- #32710 (
@AndreasKaratzas):重要修复。解决了ROCm平台上Mamba/Jamba等混合模型因非连续张量输入GEMM导致输出乱码的问题,确保了模型在AMD硬件上的正确性。 - #32745 (
@mawong-amd):重构AMD CI测试管道,以更优的方式利用BuildKite的并行化能力,旨在提升测试效率并减少资源占用。 - #32717 (
@micah-wil) & #32719 (@Alexei-V-Ivanov-AMD):清理和增强AMD CI。前者移除了已失效的DeepSeek异步EPLB测试;后者启用了“双节点”分布式测试。 - #32703 (
@tjtanaa):优化日志,避免在非ROCm平台上打印关于QuarkOCP_MX的无关日志信息。 - #32731 (
@micah-wil):为适应AMD硬件上的微小性能差异,降低了test_draft_model_quantization测试的接受长度阈值。 - #32700, #32694 (
@robertgshaw2-redhat):作为v0.14版本弃用通知后的跟进,启动了移除旧量化方案(PTPC FP8, Petit NVFP4)的工作,这些变动同样影响ROCm生态。
- #32710 (
- 关键观察:
- AMD团队参与度高:
-amd后缀用户(@mawong-amd)及@hongxiayang、@tjtanaa等活跃于CI优化、性能调优和问题排查。 - 生态连接器成熟:针对分离式服务的性能问题,官方推荐从调试用的
SharedStorageConnector转向MORIIO或NIXL连接器,表明AMD专用高性能连接器已就位。 - 持续集成完善:AMD CI正朝着与Nvidia CI对标的成熟度演进,包括测试并行化和覆盖度提升。
- AMD团队参与度高:
💬 高热度讨论分析
- Issue #32713: [RFC]: Unified Parser for tokenization, reasoning, tool calling
- 核心议题:提案重构vLLM中分散的输入输出解析逻辑(tokenization, reasoning, tool calling),创建一个统一的
Parser接口,以消除代码冗余、简化模型集成。 - 主要观点:
- 提议方(@qandrew):当前多路径(OpenAI Harmony, 各模型特定解析器)导致维护成本高、易配置错误。统一抽象可增强灵活性并面向未来。
- 第三方开发者(@wseaton):关切点在于新的设计是否保留通过CLI注入自定义解析器的“类插件”能力,并建议考虑将解析器放入沙盒子进程以提升安全性。
- 模型提供商代表(@patrickvonplaten, Mistral):赞同清晰化输入/输出契约的想法,并分享了Mistral通过自定义tokenizer集成外部解析逻辑的经验,认为新抽象应给予模型提供方更多自由。
- 维护者(@chaunceyjiang):指出当前显式配置解析器的必要性,因为同一模型家族内的工具调用和推理格式也可能不同,难以从
config.json自动推断。
- 争议焦点:如何在提供“开箱即用”的智能默认配置与保留用户/集成方深度定制和实验能力之间取得平衡。
- 当前状态:讨论开放中,正在收集反馈。
- 核心议题:提案重构vLLM中分散的输入输出解析逻辑(tokenization, reasoning, tool calling),创建一个统一的
- Issue #32648: [RFC]: Add –disable-inference flag deployments without GPU
- 核心议题:提议为仅需使用“渲染API”(将消息转换为tokens)而不需要GPU推理的侧车部署场景,增加一个
--disable-inference标志。 - 主要观点:
- 提议方(@hyeongyun0916):为分离式架构(如llm-d)提供一个轻量级、短平快的解决方案。
- 维护者(@robertgshaw2-redhat, @vMaroon):强烈支持提供纯渲染/令牌化服务,但反对通过给现有
vllm serve命令加标志来实现。建议创建一个独立的、面向CPU部署的干净入口点(例如vllm render),以构建最小化依赖的专用令牌化服务。
- 共识:社区一致认为需要一个独立的渲染服务,这符合vLLM向“令牌进,令牌出”核心契约和全面分离式架构演进的长远目标。
- 核心议题:提议为仅需使用“渲染API”(将消息转换为tokens)而不需要GPU推理的侧车部署场景,增加一个
- Issue #32637: [Bug]: vllm启动GLM-4.7-Flash失败
- 核心议题:用户尝试启动最新的GLM-4.7-Flash模型时,因transformers库版本过旧无法识别模型架构而失败。
- 讨论内容:贡献者提供了安装最新(main分支)transformers的解决方案,但随即引发了关于版本冲突(vllm要求transformers<5, 而main版本是5.0.0.dev0)和NumPy版本不兼容的连锁问题。
- 反映的问题:突显了支持前沿新模型与维护依赖稳定性之间的固有矛盾。用户需要手动管理复杂的依赖环境,体验不佳。
- 当前状态:问题仍处于开放状态,暂无官方一站式解决方案。
- Issue #32685: [Feature]: Support multi-modal inputs for OpenAI Response API
- 核心议题:请求在已支持的OpenAI Responses API中增加对多模态(图像、视频)输入的支持。
- 讨论过程:发起人提出需求,维护者表示已在路线图上并询问是否愿意参与。随后另一位贡献者(
@chaunceyjiang)指出,相关PR #20975已支持图像输入,并将很快补充视频支持。 - 结论:这是一个典型的“需求提出即被认领并即将解决”的高效互动案例,展示了社区对完善API兼容性的快速响应。
🔥 热门话题与趋势分析
- 多模态支持深化:
- 功能完善:积极扩展Responses API的多模态输入支持(#32685)。
- 架构优化:通过“上下文管理器”模式(#32631跟踪)重构多模态模型初始化逻辑,以优雅支持编码器-only和语言模型-only模式,为分离式服务铺路。
- Bug修复:集中处理视频理解中的编码器缓存挂起(#32680, #32684)、IMA问题(#32687)等底层稳定性问题。
- 分离式服务(P-D/EPD)架构优化:
- RFC涌现:多个RFC聚焦于该架构的痛点优化,例如预填充节点与解码节点间双向KV缓存传输以优化多轮对话(#32733),以及跟踪编码-预填充-解码全面分离的后续进展(#32659)。
- 问题修复:针对实际使用中暴露的请求ID不匹配(#32630)、权重缓存(#32718)等问题进行修复。
- 工具调用与推理解析器重构:
- 统一的
Parser抽象(#32713)成为核心议题,旨在简化日益复杂的模型输出处理逻辑(工具调用、推理链),降低集成和维护成本。
- 统一的
- 性能回归与优化:
- 回归处理:v0.14.0版本引发了Sleep模式内存使用增加(#32714)、Blackwell GPU上特定模型注意力后端错误(#32732)等回归问题,吸引大量关注。
- 主动优化:针对MoE模型(#31755跟踪)、MLA注意力(#32734)、LoRA融合(#32655, #32711)等场景进行持续的底层内核级优化。
- 模型支持与兼容性前线:
- 社区忙于适配GLM-4.7(#32637)、Nemotron-3-Nano(#32353)、Nemotron-Parse FP8(#32743)、MusicFlamingo(#32696)等最新模型,处理因模型迭代快、依赖新transformers特性带来的挑战。
🛠️ 重点技术变更
- PR #32331 / #32725 / #32744: PluggableLayer的引入与回退:
- 内容:引入了新的
PluggableLayer抽象,旨在与现有的CustomOp协同工作,为需要管理权重的高层模型组件(如MLA注意力、FusedMoE)提供更清晰的插件化机制。但在合并后因导致CI失败被临时回退(#32725),随后立即提交了修复版本(#32744)。 - 影响:这是vLLM内核架构演进的重要一步,旨在建立更清晰的分层(PluggableLayer管理权重,CustomOp/vLLM IR管理计算逻辑),以应对日益多样化的硬件和模型定制需求。
- 内容:引入了新的
- PR #32750: GB200 unified memory v0.14.0:
- 内容:为NVIDIA GB200(Grace-Blackwell)统一内存系统实现了睡眠模式的快速路径。使用
cudaMemAdvise和cudaMemPrefetchAsync替代物理拷贝,使唤醒和睡眠速度提升了数百倍。 - 影响:显著优化了GB200系统上模型的休眠/唤醒效率,是充分利用新一代硬件特性的典范,也为未来类似架构的支持提供了参考。
- 内容:为NVIDIA GB200(Grace-Blackwell)统一内存系统实现了睡眠模式的快速路径。使用
- Issue #32713 & PR #32712: 统一解析器(Unified Parser)RFC:
- 内容:提议对tokenization、reasoning parsing、tool calling parsing进行抽象统一,解决当前多路径、配置复杂的痛点。
- 影响:这是对vLLM“前端”处理逻辑的一次重大重构提案,旨在提升代码可维护性、降低新模型集成门槛,并对接外部解析逻辑(如Mistral-common)。相关实验性PR(#32712)已创建。
- Tracker #32631: 多模态组件上下文管理器迁移:
- 内容:通过一系列PR(#32632, #32641, #32650, #32663, #32695),将大部分多模态模型的视觉编码器和语言模型初始化代码包裹在上下文管理器中。
- 影响:为通过配置(
--mm-encoder-only)灵活启用/禁用多模态组件提供了统一、简洁的基础设施,是支撑编码器-预填充-解码器(EPD)分离式架构的关键准备。
📈 开发活跃度观察
- 贡献活跃:单日新增92个PR,合并29个,显示出一个非常活跃的贡献和合并流程。贡献者既包括核心维护团队,也有大量来自AMD、NVIDIA、Meta等公司的工程师以及独立开发者。
- AMD团队参与深入:AMD工程师(
@mawong-amd,@tjtanaa,@hongxiayang等)在本周期表现尤为活跃,不仅修复平台特定Bug,还主导了CI/CD的改进,表明其对ROCm生态在vLLM中的长期投入和成熟度提升。 - 代码审查与合并节奏:存在对大型重构PR(如PluggableLayer)的快速回滚与修复,表明项目在追求进展的同时,对CI稳定性和代码质量保持高度敏感,维护纪律严明。
💡 值得关注的问题
- 统一解析器设计决策:Issue #32713中的讨论将深刻影响未来vLLM的模型集成范式。社区需就“开箱即用”与“深度定制”的平衡点达成共识。
- 分离式服务优化迫在眉睫:多个RFC(#32733, #32648, #32659)和问题报告表明,分离式服务在生产环境中的应用仍面临性能冗余和复杂度挑战,相关优化是近期开发重点。
- Windows支持明确缺失:Issue #32738被迅速以“we dont support windows”为由关闭,明确了vLLM目前无意支持Windows平台,相关需求者需寻找替代方案。
- 前沿模型支持与依赖管理困境:如GLM-4.7启动问题所示,如何平滑地支持依赖前沿transformers特性的新模型,而不破坏大多数用户的稳定环境,是一个持续的管理挑战。
- 性能回归的快速响应:v0.14.0发布后出现的多个性能回归(Sleep模式, Blackwell兼容性)需要核心团队优先处理,以保障版本稳定性。
📋 附录:详细数据列表
新增 Issue
- #32685 [Feature]: Support multi-modal inputs for OpenAI Response API — feature request — by FlynnOwen (创建于: 2026-01-20 23:26 (UTC+8))
- #32713 [RFC]: Unified Parser for tokenization, reasoning, tool calling — RFC — by qandrew (创建于: 2026-01-21 03:00 (UTC+8))
- #32732 [Bug]: Regression in v0.14.0: “No valid attention backend found” for nvidia/DeepSeek-R1-0528-NVFP4 on RTX Pro 6000 (Blackwell) — bug — by heiderich (创建于: 2026-01-21 06:26 (UTC+8))
- #32738 [Bug]: Cannot find vllm_C , is not compiled on windows — bug — by MohamedLahmeri01 (创建于: 2026-01-21 07:39 (UTC+8))
- #32736 [Feature]: support lora for minimax m2.1 — feature request — by tic-top (创建于: 2026-01-21 07:08 (UTC+8))
- #32733 [RFC]: [P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes — RFC — by snadampal (创建于: 2026-01-21 06:27 (UTC+8))
- #32730 [Bug]: CPU tests hit CUDA path when VLLM_TARGET_DEVICE=cpu — bug — by wjhrdy (创建于: 2026-01-21 06:18 (UTC+8))
- #32637 [Bug]: vllm启动GLM-4.7-Flash失败 — bug — by TZJ12 (创建于: 2026-01-20 13:57 (UTC+8))
- #32720 [Bug]: More robust 32 bit indexing — bug — by laithsakka (创建于: 2026-01-21 04:42 (UTC+8))
- #32648 [RFC]: Add –disable-inference flag deployments without GPU — RFC — by hyeongyun0916 (创建于: 2026-01-20 15:44 (UTC+8))
- #32718 [Bug]:
reload_weightsand_get_weights_iteratorreturn cached/stale weights instead of re-reading from disk — bug — by RobotSail (创建于: 2026-01-21 04:21 (UTC+8)) - #32714 [Bug]: Sleep is broken in 0.14.0 — bug — by rstanislav (创建于: 2026-01-21 03:07 (UTC+8))
- #32658 [Feature]: batch invariance for A100 — feature request — by Weili17 (创建于: 2026-01-20 17:34 (UTC+8))
- #32701 [Feature]: Async Scheduling + Pipeline Parallel Support — feature request — by yewentao256 (创建于: 2026-01-21 00:49 (UTC+8))
- #32665 [Bug]: [DeepSeek-V3.2] PD reports
NotImplementedError— bug — by kebe7jun (创建于: 2026-01-20 20:21 (UTC+8)) - #32631 [Tracker]: Initialize MM components in context managers — multi-modality — by DarkLight1337 (创建于: 2026-01-20 11:59 (UTC+8))
- #32656 [Tracker]: Use
mm_featuresfor M-RoPE calculation for all models — help wanted,multi-modality — by DarkLight1337 (创建于: 2026-01-20 17:10 (UTC+8)) - #32680 [Bug]: hanging durling long video understanding — bug — by JJJYmmm (创建于: 2026-01-20 23:02 (UTC+8))
- #32676 [Tracker]: Apply PluggableLayer and vLLM IR to replace current CustomOp — feature request — by whx-sjtu (创建于: 2026-01-20 22:04 (UTC+8))
- #32674 [Feature][P1]: Add OCI Annotations to container images — feature request — by Harry-Chen (创建于: 2026-01-20 21:51 (UTC+8))
- #32670 [Feature]: 请问0.14.0版本能release支持python 3.12的wheel吗? — feature request — by dengdeng-cat (创建于: 2026-01-20 21:14 (UTC+8))
- #32666 [Feature]: Save the start time of the benchmark request — feature request — by kebe7jun (创建于: 2026-01-20 20:47 (UTC+8))
- #32647 [Bug]: GB300 is not compatible with the latest version of deep_gemm. — bug — by chaunceyjiang (创建于: 2026-01-20 15:32 (UTC+8))
- #32659 [RFC]: Tracking follow-up progress on Encode–Prefill–Decode Disaggregation — RFC — by fake0fan (创建于: 2026-01-20 17:37 (UTC+8))
- #32651 [Bug]: v0.13 CPU fails with more than 16 OMP threads — bug — by kot-begemot-uk (创建于: 2026-01-20 16:28 (UTC+8))
- #32653 [Usage]: CPU offload still results in out of VRAM error (Unsloth’s DeepSeek R1) — usage — by Midori-Cell (创建于: 2026-01-20 16:51 (UTC+8))
- #32636 [Bug]: Invalid base64-encoded string for audio input — bug — by IamMegatron2025 (创建于: 2026-01-20 13:39 (UTC+8))
- #32638 [Bug]: Multiple tool_calls parsed correctly by hermes_tool_parser, but fail in serving_chat.py with JSONDecodeError — bug — by caoxu915683474 (创建于: 2026-01-20 13:58 (UTC+8))
- #32643 [Bug] FlashInfer >=0.6.0 TypeError: non_blocking must be bool during CUDA graph capture — 无标签 — by amanwalksdownthestreet (创建于: 2026-01-20 14:35 (UTC+8))
已关闭 Issue
- #27256 [Bug]: Qwen3-Embedding-0.6B三种请求方式,数值结果都不一致 — bug,stale — by DankoZhang (关闭于: 2026-01-21 10:23 (UTC+8))
- #15338 [Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs — bug,stale — by moshilangzi (关闭于: 2026-01-21 10:17 (UTC+8))
- #24320 [Feature]: Batch interface request reordering to boost APC hits — feature request,stale — by kexinoh (关闭于: 2026-01-21 10:16 (UTC+8))
- #24506 [Feature]: Add speculative decoding with draft model pruning — feature request,stale — by jmamou (关闭于: 2026-01-21 10:16 (UTC+8))
- #24755 [Installation]: VLLM dependency Issue — installation,stale — by NishTewari (关闭于: 2026-01-21 10:16 (UTC+8))
- #25361 [Installation]: H20 vllm==0.10.2 for Qwen3-Next-80B,RuntimeError: Worker failed with error ‘‘staticmethod’ object is not callable’, please check the stack trace above for the root cause — installation,stale — by XTLYC (关闭于: 2026-01-21 10:15 (UTC+8))
- #25373 [Feature]: Optimize Tokenization and First Token Generation Redundancy Between Prefill and Decode Stages in Disaggregated Prefill — feature request,stale — by Zaragoto (关闭于: 2026-01-21 10:15 (UTC+8))
- #25387 [Bug] Nightly crash with Qwen3-14B-FP8-dynamic: CUDA illegal memory access (EngineDeadError) — bug,stale — by celsowm (关闭于: 2026-01-21 10:15 (UTC+8))
- #25399 [Performance]: Load KV with partial block — performance,stale — by Zhathw (关闭于: 2026-01-21 10:15 (UTC+8))
- #25426 [Bug]: NotImplementedError: Auto tool choice is not supported yet unless using Harmony — bug,stale — by Anna4242 (关闭于: 2026-01-21 10:15 (UTC+8))
- #32738 [Bug]: Cannot find vllm_C , is not compiled on windows — bug — by MohamedLahmeri01 (关闭于: 2026-01-21 09:25 (UTC+8))
- #31755 [Feature]: Optimizations for MOE models (GLM4.7, DeepSeek series) — feature request — by yewentao256 (关闭于: 2026-01-21 04:41 (UTC+8))
- #31695 [Performance]: Poor performance (1 tokens/s) in Disaggregated Serving on AMD MI325X — performance,rocm — by jane-jhu (关闭于: 2026-01-21 04:19 (UTC+8))
- #31389 [Bug]: Byte fallback is not properly handled when using outlines — bug — by Alnusjaponica (关闭于: 2026-01-21 03:48 (UTC+8))
- #23786 [RFC]: Design a new Layer-Pluggable abstraction to work together with CustomOp — RFC,keep-open — by whx-sjtu (关闭于: 2026-01-21 00:23 (UTC+8))
- #32452 [Bug]:
is_strictly_contiguousassertion fails in FlashInfer TRTLLM decode path on Blackwell for Scout — bug — by luccafong (关闭于: 2026-01-21 00:08 (UTC+8)) - #32353 [Bug]: Nemotron-3-Nano is broken when using TRTLLM attention on Blackwell — bug — by mgoin (关闭于: 2026-01-20 21:53 (UTC+8))
- #32647 [Bug]: GB300 is not compatible with the latest version of deep_gemm. — bug — by chaunceyjiang (关闭于: 2026-01-20 18:48 (UTC+8))
- #29765 [Bug]: Server launch fails on vLLM 0.11.1 / 0.11.2 when loading AWQ or GPTQ MoE models — bug — by aaarkai (关闭于: 2026-01-20 16:31 (UTC+8))
新增 PR
- #32750 Gb200 unified memory v0.14.0 — documentation,nvidia — by pst2154 (创建于: 2026-01-21 10:47 (UTC+8))
- #32748 [Misc] Add index url for torch==2.9.1+cpu — ready,ci/build,cpu — by lk-chen (创建于: 2026-01-21 10:39 (UTC+8))
- #32749 [Bugfix] Force using spawn multiprocess method when it’s the WSL platform — bug — by jasonyanwenl (创建于: 2026-01-21 10:40 (UTC+8))
- #32747 [draft][compile][graph_partition]Add tensor size handling — 无标签 — by fxdawnn (创建于: 2026-01-21 10:16 (UTC+8))
- #32746 [Misc] Replace urllib’s
urlparsewith urllib3’sparse_url— multi-modality — by Isotr0py (创建于: 2026-01-21 10:09 (UTC+8)) - #32661 [Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric — documentation,ready,v1 — by carlory (创建于: 2026-01-20 18:34 (UTC+8))
- #32710 [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) — bug,rocm,ci/build — by AndreasKaratzas (创建于: 2026-01-21 02:23 (UTC+8))
- #32684 [Bugfix] fix encoder cache hang in Qwen3VL — bug,ready,qwen — by JJJYmmm (创建于: 2026-01-20 23:25 (UTC+8))
- #32673 [Bugfix] Support HF sharded weights for Mistral3/Pixtral models — bug — by ricky-chaoju (创建于: 2026-01-20 21:46 (UTC+8))
- #32745 [Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism — rocm,ci/build — by mawong-amd (创建于: 2026-01-21 09:36 (UTC+8))
- #32687 [Bugfix] fix the ima issue of qwen-vit — bug,ready,qwen — by JJJYmmm (创建于: 2026-01-20 23:28 (UTC+8))
- #32744 [PluggableLayer][1/N] Define PluggableLayer (Fix ci) — documentation — by whx-sjtu (创建于: 2026-01-21 09:28 (UTC+8))
- #32739 DPMetadata raises assert error for dense model — meta-exported,fb-exported — by River12 (创建于: 2026-01-21 08:01 (UTC+8))
- #32743 Fix issues when run Nemotron Parse FP8 checkpoint — needs-rebase,v1,nvidia — by Edwardf0t1 (创建于: 2026-01-21 09:28 (UTC+8))
- #32742 [Model Runner V2] Add KV Connector support — v1 — by njhill (创建于: 2026-01-21 08:50 (UTC+8))
- #32707 [Misc] Omit “disable NCCL for DP sync” startup log when not applicable — ready — by njhill (创建于: 2026-01-21 02:05 (UTC+8))
- #32740 [Kernel] [Helion] Add Helion ConfigManager — 无标签 — by gmagogsfm (创建于: 2026-01-21 08:09 (UTC+8))
- #32667 [Performance] add start_times field to vllm bench serve json result — performance — by kebe7jun (创建于: 2026-01-20 20:59 (UTC+8))
- #32741 [Documentation] Fix typo in
docs/design/torch_compile_multimodal.md— documentation — by Lucaskabela (创建于: 2026-01-21 08:20 (UTC+8)) - #32725 Revert “[PluggableLayer][1/N] Define PluggableLayer” — documentation,ready — by robertgshaw2-redhat (创建于: 2026-01-21 05:36 (UTC+8))
- #32728 Fix quantized Falcon-H1 model loading issues — 无标签 — by shengliangxu (创建于: 2026-01-21 05:59 (UTC+8))
- #32731 [ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quantization — rocm,v1 — by micah-wil (创建于: 2026-01-21 06:25 (UTC+8))
- #32737 Fix dtype detection for local model paths — 无标签 — by kamalesh0406 (创建于: 2026-01-21 07:25 (UTC+8))
- #32729 [responsesAPI] support interleaved reasoning via arbitrary type of message outputs — documentation,frontend,meta-exported,fb-exported — by qandrew (创建于: 2026-01-21 06:17 (UTC+8))
- #32690 [Doc] Add contribution guide for MoE/attention/quantization backends — documentation — by clocksmith (创建于: 2026-01-21 00:07 (UTC+8))
- #32655 Add unpermute-aware fused MoE LoRA path — performance — by RunkaiTao (创建于: 2026-01-20 16:55 (UTC+8))
- #32735 Factor out masked-m format silu_mul_quant and use it in BatchedTritonExperts — performance — by tlrmchlsmth (创建于: 2026-01-21 06:47 (UTC+8))
- #32734 [Core] [MLA] Optimize the copy for k_nope, k_pe and v — 无标签 — by pavanimajety (创建于: 2026-01-21 06:27 (UTC+8))
- #32717 [ROCm][CI] Remove DS async eplb accuracy test from AMD CI — rocm,ready,ci/build — by micah-wil (创建于: 2026-01-21 04:08 (UTC+8))
- #32727 [bugfix] Aria model — bug — by divakar-amd (创建于: 2026-01-21 05:58 (UTC+8))
- #32726 Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms — bug,v1 — by wseaton (创建于: 2026-01-21 05:49 (UTC+8))
- #32698 [Misc] Add
get_nameto missing AttentionBackends — v1 — by NickLucche (创建于: 2026-01-21 00:36 (UTC+8)) - #32719 Enabling “2 node” distributed tests in the AMD CI pipeline. — rocm,ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-01-21 04:37 (UTC+8))
- #32703 [Bugfix] Suppress log on non-ROCm platform — bug,rocm,ready — by tjtanaa (创建于: 2026-01-21 01:07 (UTC+8))
- #32724 Revert “[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction” — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-21 05:22 (UTC+8))
- #32723 [Benchmark] Don’t default to
temperature==0invllm bench serve— performance — by njhill (创建于: 2026-01-21 05:14 (UTC+8)) - #32722 [CI] Fix mypy for
vllm/v1/structured_output— rocm,structured-output,ready,v1 — by yewentao256 (创建于: 2026-01-21 05:01 (UTC+8)) - #32692 [Refactor] Clean up unused variables & func — rocm,frontend,ready — by yewentao256 (创建于: 2026-01-21 00:08 (UTC+8))
- #32721 [WIP][CI][build] Move torch deps into requirements/torch.txt and torchlib.txt — rocm,ci/build,cpu,nvidia — by orionr (创建于: 2026-01-21 04:55 (UTC+8))
- #32711 [LoRA] Support add delta inplace for fused_moe_lora_kernel — 无标签 — by xyang16 (创建于: 2026-01-21 02:31 (UTC+8))
- #32645 [BugFix] Fix AssertionError: TRTLLM decode requires uniform query lengths per request. — bug,ready,v1 — by LucasWilkinson (创建于: 2026-01-20 14:59 (UTC+8))
- #32716 Classify new requests as prefills regardless of query length — v1 — by Josephasafg (创建于: 2026-01-21 04:07 (UTC+8))
- #32708 [Misc] Don’t color stdout/err prefix for non-tty — 无标签 — by njhill (创建于: 2026-01-21 02:06 (UTC+8))
- #32697 [Quantization][Deprecation] Remove RTN — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-21 00:32 (UTC+8))
- #32706 [Cleanup] Move scheduler
get_routed_expertslogic to separate method — ready,v1 — by njhill (创建于: 2026-01-21 02:04 (UTC+8)) - #32715 PoC: SIG ownership marker — ci/build — by dougbtv (创建于: 2026-01-21 03:39 (UTC+8))
- #32705 Fix MoE Model DP+TP with NaiveAll2AllManger Bug — bug,meta-exported,fb-exported — by River12 (创建于: 2026-01-21 01:49 (UTC+8))
- #32695 [5/N] Initialize MM components in context managers (Q-Z) — ready,qwen — by DarkLight1337 (创建于: 2026-01-21 00:27 (UTC+8))
- #32686 Fix accessing hidden_act from model config — 无标签 — by grzegorz-k-karch (创建于: 2026-01-20 23:28 (UTC+8))
- #32712 [WIP] Parser — frontend — by qandrew (创建于: 2026-01-21 02:49 (UTC+8))
- #32691 [Doc] Update docs for MM model development with context usage — documentation,ready,kv-connector — by DarkLight1337 (创建于: 2026-01-21 00:07 (UTC+8))
- #32678 [Model] Qwen3-Next Splitting GDN attention calculation in mixed batches of Prefill and Decode — qwen — by xyDong0223 (创建于: 2026-01-20 22:25 (UTC+8))
- #32709 [Model Runner V2] Support FLASHINFER_MLA backend — v1 — by WoosukKwon (创建于: 2026-01-21 02:21 (UTC+8))
- #32700 [Quantization][Deprecation] Remove PTPC FP8 — rocm,ready,ci/build — by robertgshaw2-redhat (创建于: 2026-01-21 00:49 (UTC+8))
- #32699 [Quantization][Deprecation] Remove FPQuant — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-21 00:41 (UTC+8))
- #32704 [Bugfix] Auto-configure TRITON_PTXAS_PATH for new GPU architectures — bug — by danielostrow (创建于: 2026-01-21 01:11 (UTC+8))
- #32696 [Model][Multimodal] Add explicit MusicFlamingo adapter — new-model — by WangHaoyuuu (创建于: 2026-01-21 00:27 (UTC+8))
- #32702 [Bugfix] Fix Quant Type Descriptor for Weights — bug,rocm,ready — by tjtanaa (创建于: 2026-01-21 00:56 (UTC+8))
- #32630 [Bugfix] Resolve request_id mismatch and prevent crashes in Disaggregated Serving — bug,documentation,v1,kv-connector — by jane-jhu (创建于: 2026-01-20 11:09 (UTC+8))
- #32694 [Quantization][Deprecation] Remove Petit NVFP4 — rocm,ready,ci/build — by robertgshaw2-redhat (创建于: 2026-01-21 00:22 (UTC+8))
- #32688 [Quantization][Deprecation] Remove Marlin 24 — performance,ready,ci/build,nvidia — by robertgshaw2-redhat (创建于: 2026-01-20 23:35 (UTC+8))
- #32693 Add trinity tool parser — tool-calling — by lckr (创建于: 2026-01-21 00:13 (UTC+8))
- #32689 [Quantization][Deprecation] Remove ExpertsInt8 — ready — by robertgshaw2-redhat (创建于: 2026-01-20 23:50 (UTC+8))
- #32682 [Bugfix] Fix Nemotron-Nano-v2-vlm static resolution — bug,ready — by netanel-haber (创建于: 2026-01-20 23:10 (UTC+8))
- #32660 [Doc] Update outdated link to Ray documentation — documentation,ready — by graftim (创建于: 2026-01-20 18:22 (UTC+8))
- #32679 [Quantization][Deprecation] Remove
DeepSpeedFp8— documentation,ready — by robertgshaw2-redhat (创建于: 2026-01-20 23:01 (UTC+8)) - #32681 [Quantization][Deprecation] Deprecate HQQ — ready — by robertgshaw2-redhat (创建于: 2026-01-20 23:09 (UTC+8))
- #32683 [Quantization][Deprecation] Remove BitBlas — documentation,performance,ready — by robertgshaw2-redhat (创建于: 2026-01-20 23:22 (UTC+8))
- #32649 [ROCm][Deepseekv3.2][Perf] dsv3.2 further optimization on vllm — rocm,needs-rebase,v1,deepseek — by ganyi1996ppo (创建于: 2026-01-20 15:50 (UTC+8))
- #32675 [Refactor] Make Int8ScaledMMLinearLayerConfig to use QuantKey — 无标签 — by andikarachman (创建于: 2026-01-20 22:00 (UTC+8))
- #32654 [XPU]Support AgRsAll2AllManager on XPU device — ready — by ys950902 (创建于: 2026-01-20 16:52 (UTC+8))
- #32677 [BugFix] Support DP/EP in AG/RS for FLASHINFER_CUTLASS FP8 — bug,nvidia — by amirkl94 (创建于: 2026-01-20 22:09 (UTC+8))
- #32671 Adding the Triton 3D unified attention kernel for speculative decoding workloads — v1 — by xaguilar-amd (创建于: 2026-01-20 21:39 (UTC+8))
- #32663 [4/N] Initialize MM components in context managers (M-P) — ready,qwen — by DarkLight1337 (创建于: 2026-01-20 19:26 (UTC+8))
- #32668 [Misc] Bump opencv-python dependecy version to 4.13 — ready,ci/build — by Isotr0py (创建于: 2026-01-20 21:01 (UTC+8))
- #32672 [Refactor] Make Int8ScaledMMLinearLayerConfig to use QuantKey — v1,multi-modality,cpu — by andikarachman (创建于: 2026-01-20 21:40 (UTC+8))
- #32657 Introduce InferenceProfile as execution-intent metadata — v1 — by santhanuss (创建于: 2026-01-20 17:33 (UTC+8))
- #32669 Bugfix: Pass router logits dtype in nemotron shared experts — bug — by amirkl94 (创建于: 2026-01-20 21:04 (UTC+8))
- #32664 Release 0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by xjf-303 (创建于: 2026-01-20 19:36 (UTC+8))
- #32662 feat(cpu): add CPU support for draft model speculative decoding — speculative-decoding,v1,cpu — by ganeshr10 (创建于: 2026-01-20 19:07 (UTC+8))
- #32652 [Bugfix] Fix the fp8_mqa_logits dim mismatch — bug,ready,deepseek — by chaunceyjiang (创建于: 2026-01-20 16:31 (UTC+8))
- #32650 [3/N] Initialize MM components in context managers (I-L) — ready — by DarkLight1337 (创建于: 2026-01-20 16:01 (UTC+8))
- #32641 [2/N] Initialize MM components in context managers (E-H) — ready — by DarkLight1337 (创建于: 2026-01-20 14:29 (UTC+8))
- #32646 [Bugfix] Fix E2E latency calculation and add warmup support in mm_processor benchmark — bug,performance — by HirokenOvo (创建于: 2026-01-20 15:05 (UTC+8))
- #32644 [Feature] Add GPU UUID support in CUDA_VISIBLE_DEVICES — nvidia — by adityakamat24 (创建于: 2026-01-20 14:49 (UTC+8))
- #32642 Feat: add sampling (min_tokens,…) support for speculative decoding — v1 — by qianlihuang (创建于: 2026-01-20 14:34 (UTC+8))
- #32640 [ROCm] [CI] [Release] Update the docker image annotation — rocm,ci/build — by tjtanaa (创建于: 2026-01-20 14:27 (UTC+8))
- #32639 [Bugfix]: Enable Convolution layer custom ops by default — bug — by Isotr0py (创建于: 2026-01-20 14:23 (UTC+8))
- #32634 [Model Runner V2] Skip kernel launch for penalties & logit_bias — v1 — by WoosukKwon (创建于: 2026-01-20 12:29 (UTC+8))
- #32632 [1/N] Initialize MM components in context managers (A-D) — ready,deepseek — by DarkLight1337 (创建于: 2026-01-20 12:02 (UTC+8))
- #32635 Prefill1 — new-model,v1,qwen — by TYTTYTTYT (创建于: 2026-01-20 13:16 (UTC+8))
- #32633 Fix gpt‑oss Harmony token leaks in tool names and streaming content #32587 — frontend,gpt-oss — by baonudesifeizhai (创建于: 2026-01-20 12:19 (UTC+8))
已合并 PR
- #32661 [Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric — documentation,ready,v1 — by carlory (合并于: 2026-01-20 20:28 (UTC+8))
- #32687 [Bugfix] fix the ima issue of qwen-vit — bug,ready,qwen — by JJJYmmm (合并于: 2026-01-21 01:21 (UTC+8))
- #32331 [PluggableLayer][1/N] Define PluggableLayer — documentation,ready — by whx-sjtu (合并于: 2026-01-21 00:19 (UTC+8))
- #29087 OffloadingConnector: Prevent redundant loads — ready,v1,kv-connector — by orozery (合并于: 2026-01-21 09:15 (UTC+8))
- #32725 Revert “[PluggableLayer][1/N] Define PluggableLayer” — documentation,ready — by robertgshaw2-redhat (合并于: 2026-01-21 08:21 (UTC+8))
- #32189 fp8 online quant: split out Fp8OnlineLinearMethod — ready,quantization — by vkuzo (合并于: 2026-01-21 07:13 (UTC+8))
- #32717 [ROCm][CI] Remove DS async eplb accuracy test from AMD CI — rocm,ready,ci/build — by micah-wil (合并于: 2026-01-21 05:40 (UTC+8))
- #32703 [Bugfix] Suppress log on non-ROCm platform — bug,rocm,ready — by tjtanaa (合并于: 2026-01-21 05:38 (UTC+8))
- #30143 [Misc] Remove pad_for_cudagraphs from config — speculative-decoding,ready,v1,nvidia,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-21 04:05 (UTC+8))
- #31391 [Bugfix] Fix byte fallback handling when using outlines — bug,documentation,structured-output,ready,v1 — by Alnusjaponica (合并于: 2026-01-21 03:48 (UTC+8))
- #25205 [AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction — performance,rocm,ready,ci/build,v1,nvidia — by dolpm (合并于: 2026-01-21 03:45 (UTC+8))
- #32695 [5/N] Initialize MM components in context managers (Q-Z) — ready,qwen — by DarkLight1337 (合并于: 2026-01-21 03:10 (UTC+8))
- #32030 Test: added acceptance length tests — speculative-decoding,ready,v1 — by rahul-tuli (合并于: 2026-01-21 02:55 (UTC+8))
- #32691 [Doc] Update docs for MM model development with context usage — documentation,ready,kv-connector — by DarkLight1337 (合并于: 2026-01-21 02:37 (UTC+8))
- #32709 [Model Runner V2] Support FLASHINFER_MLA backend — v1 — by WoosukKwon (合并于: 2026-01-21 02:26 (UTC+8))
- #32580 [Doc] [ROCm] Update ROCm getting started doc — documentation,rocm,ready — by tjtanaa (合并于: 2026-01-21 01:20 (UTC+8))
- #32273 [Perf] Only clone when needed for
moe_permute— ready — by yewentao256 (合并于: 2026-01-21 00:34 (UTC+8)) - #32603 [Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation — bug,ready — by lingebeng (合并于: 2026-01-21 00:13 (UTC+8))
- #32654 [XPU]Support AgRsAll2AllManager on XPU device — ready — by ys950902 (合并于: 2026-01-20 22:27 (UTC+8))
- #32663 [4/N] Initialize MM components in context managers (M-P) — ready,qwen — by DarkLight1337 (合并于: 2026-01-20 22:06 (UTC+8))
- #32652 [Bugfix] Fix the fp8_mqa_logits dim mismatch — bug,ready,deepseek — by chaunceyjiang (合并于: 2026-01-20 18:48 (UTC+8))
- #32650 [3/N] Initialize MM components in context managers (I-L) — ready — by DarkLight1337 (合并于: 2026-01-20 18:21 (UTC+8))
- #32429 [Core] Cleanup shm based object store on engine shutdown — ready,v1,multi-modality — by walterbm (合并于: 2026-01-20 16:53 (UTC+8))
- #32641 [2/N] Initialize MM components in context managers (E-H) — ready — by DarkLight1337 (合并于: 2026-01-20 16:12 (UTC+8))
- #27814 [Refactor] Make FP8 Linear Ops use kernel abstraction — rocm,ready,ci/build,cpu,nvidia,ready-run-all-tests — by vllmellm (合并于: 2026-01-20 14:48 (UTC+8))
- #32634 [Model Runner V2] Skip kernel launch for penalties & logit_bias — v1 — by WoosukKwon (合并于: 2026-01-20 14:20 (UTC+8))
- #32632 [1/N] Initialize MM components in context managers (A-D) — ready,deepseek — by DarkLight1337 (合并于: 2026-01-20 14:12 (UTC+8))
- #32605 [Model] Use context managers for encoder- and LM-only mode — documentation,ready,v1,llama,qwen,kv-connector — by DarkLight1337 (合并于: 2026-01-20 11:43 (UTC+8))
- #32629 [Model Runner V2] Decouple temperature from penalties — v1 — by WoosukKwon (合并于: 2026-01-20 11:13 (UTC+8))
关闭但未合并的 PR
- #24568 [Feature]Add predicted outputs and num of prediction tokens — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,stale,v1,multi-modality — by LightningPan (关闭于: 2026-01-21 10:16 (UTC+8))
- #25411 [Qwen] Add MergedColumnParallelLinear to improve bf16 inference speed for Qwen3-Next — stale,qwen — by lkm2835 (关闭于: 2026-01-21 10:15 (UTC+8))
- #31185 [PERF] Use
cutlass_scaled_mmfor Blackwell instead of deep-gemm’s blockscale gemm — nvidia — by vadiklyutiy (关闭于: 2026-01-21 09:22 (UTC+8)) - #32724 Revert “[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction” — 无标签 — by robertgshaw2-redhat (关闭于: 2026-01-21 05:35 (UTC+8))
- #32708 [Misc] Don’t color stdout/err prefix for non-tty — 无标签 — by njhill (关闭于: 2026-01-21 04:00 (UTC+8))
- #32699 [Quantization][Deprecation] Remove FPQuant — 无标签 — by robertgshaw2-redhat (关闭于: 2026-01-21 01:07 (UTC+8))
- #32693 Add trinity tool parser — tool-calling — by lckr (关闭于: 2026-01-21 00:14 (UTC+8))
- #32453 [BugFix] Fix is_strictly_contiguous assertion for decode_query in TRT… — bug,needs-rebase,v1,nvidia — by luccafong (关闭于: 2026-01-21 00:08 (UTC+8))
- #32123 updated — needs-rebase,v1 — by robertgshaw2-redhat (关闭于: 2026-01-20 22:41 (UTC+8))
- #31413 poc of removing ModularKernelMethod and maybe_init_modular_kernel — needs-rebase — by robertgshaw2-redhat (关闭于: 2026-01-20 22:38 (UTC+8))
- #32528 create apply_monolithic concept — nvidia — by robertgshaw2-redhat (关闭于: 2026-01-20 22:37 (UTC+8))
- #31933 Naive dispatch combine POC — needs-rebase,llama,nvidia — by robertgshaw2-redhat (关闭于: 2026-01-20 22:36 (UTC+8))
- #32672 [Refactor] Make Int8ScaledMMLinearLayerConfig to use QuantKey — v1,multi-modality,cpu — by andikarachman (关闭于: 2026-01-20 21:52 (UTC+8))
- #27302 Allow custom stat_loggers in V1 engine initialization — frontend — by yinggeh (关闭于: 2026-01-20 20:08 (UTC+8))
- #32664 Release 0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by xjf-303 (关闭于: 2026-01-20 19:38 (UTC+8))
- #31675 [Cleanup] Remove deprecated vllm:time_per_output_token_seconds metric — v1 — by majiayu000 (关闭于: 2026-01-20 18:50 (UTC+8))
- #30992 [Misc] Remove deprecated metric vllm:time_per_output_token_seconds for v0.13 release — v1 — by jliu9515 (关闭于: 2026-01-20 18:49 (UTC+8))
- #29612 [Feature]: Support structured output and tool call together — frontend,needs-rebase,tool-calling — by mladjan-gadzic (关闭于: 2026-01-20 16:35 (UTC+8))
- #29766 [Bugfix]: Fix missing SPLIT_K in GPTQ/AWQ MoE Triton config — bug,needs-rebase — by aaarkai (关闭于: 2026-01-20 16:30 (UTC+8))