vLLM 开发动态报告 - 2026-01-27
时间窗口: 2026-01-27 10:58 (UTC+8) ~ 2026-01-28 10:58 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 8 | 新 PR 61 | 合并 PR 34 | 关闭未合并 PR 23
📊 每日开发状态摘要
本周期(24小时)vLLM 开发活动高度活跃,新增与合并 PR 数量众多。社区关注焦点集中在多模态嵌入模型的准确性(Qwen3-VL-Embedding)、AMD/ROCm 生态的持续整合,以及基础设施现代化(如减少环境变量依赖、优化启动流程)。多个模型支持与 bug 修复同步推进,显示出项目在多硬件平台和复杂模型架构上的快速迭代。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关动态活跃,涉及性能优化、问题修复和基础设施改进:
- Issue #33163: Refactor VLLM_ROCM_USE_AITER env vars to config
- 核心:提案将现有的 13 个控制 AITER 优化的环境变量重构为更规范的配置项(如
--kernel-config),以提升可维护性、减少技术债。 - 讨论与进展:AMD 工程师 @tjtanaa 建议等待 #32358 (IR Ops) PR 的实现,因为该 PR 将提供一个通用的内核分发机制,届时可顺势废弃相关环境变量。此方案获得维护者认可,已更新 Issue 描述。
- 影响:反映了项目向更结构化配置管理迁移的趋势,AMD 生态的优化控制将逐步融入统一框架。
- 核心:提案将现有的 13 个控制 AITER 优化的环境变量重构为更规范的配置项(如
- PR #33200: unpack zeros for asymmetric compressed-tensors W4A16
- 提交者:用户
mgehre-amd(AMD员工)。 - 核心:修复了 ROCm 平台上
ConchLinearKernel在处理非对称量化 W4A16 模型时,因零点的数据布局不匹配导致的内存访问错误(HIP illegal memory access)。 - 影响:修复了特定量化格式模型在 AMD GPU 上的运行崩溃问题,增强了 ROCm 后端对压缩张量格式的兼容性。
- 提交者:用户
- PR #33180: Fix MXFP4 MoE backend selection for RDNA3
- 核心:修复了在 RDNA3 (gfx1100) 架构 GPU 上,MXFP4 MoE 模型错误选择 Triton 后端导致的 MLIR 编译崩溃。修正后,非 gfx9 (CDNA) 架构将回退到
NONE后端。 - 影响:解决了消费级 AMD GPU 运行特定量化 MoE 模型的启动问题,增强了硬件兼容性。
- 核心:修复了在 RDNA3 (gfx1100) 架构 GPU 上,MXFP4 MoE 模型错误选择 Triton 后端导致的 MLIR 编译崩溃。修正后,非 gfx9 (CDNA) 架构将回退到
- PR #33179: Fix FP8 dtype for gfx950(已关闭)
- 内容:试图将 gfx950 添加到使用
float8_e4m3fnuz数据类型的列表中。 - 进展:被 @tjtanaa 指出信息错误,MI355 (gfx950) 使用与 CUDA 相同的 FP8 格式,而非
fnuz格式。PR 被关闭。 - 启示:显示了 AMD 内部团队对硬件细节的精准把控,防止了错误的代码合并。
- 内容:试图将 gfx950 添加到使用
- PR #33156: [CI] Optim release pipeline
- 提交者:
tjtanaa - 核心:优化了 ROCm 平台的 Docker 镜像和 Wheel 包发布流水线,统一了缓存策略,并确保发布的镜像适合用作开发环境。
- 影响:改善了 AMD 平台用户的开发与部署体验。
- 提交者:
💬 高热度讨论分析
- Issue #33169: Add env var support for configuring NIXL disaggregation backend
- 核心议题:是否应该为 NIXL 分解后端(如 UCX, LIBFABRIC)添加新的环境变量
VLLM_NIXL_DISAGGREGATION_BACKEND以简化配置。 - 各方观点:
- 提出者 @erezzarum:认为环境变量更易于在动态环境和不同框架(如 Ray)中进行配置。
- 维护者 @NickLucche 和 @markmc:明确反对,指出现有
kv_transfer_config参数已是框架无关的配置方式,且项目正在有意识地减少环境变量的滥用(参见 #25700),避免全局状态和配置分散。
- 最终结论:社区决定不添加新环境变量。@KrxGu 建议通过改进文档来展示如何通过现有
--kv-transfer-config参数进行配置,以解决用户的易用性痛点。讨论体现了项目在“灵活性”与“架构整洁性”之间的权衡。
- 核心议题:是否应该为 NIXL 分解后端(如 UCX, LIBFABRIC)添加新的环境变量
- Issue #33175: Mistral-Small-3.1-24B crashes on startup
- 核心议题:特定 Mistral 模型在 vLLM 启动时崩溃。
- 各方观点:
- 报告者 @NickLucche:提供了详细的错误栈,指出问题发生在处理器初始化阶段。
- Mistral 员工 @juliendenize:迅速定位问题,指出当模型仓库同时包含
tekken.json和 HF 格式的模型文件时,vLLM 可能错误地回退到不兼容的 Mistral 格式 tokenizer,建议使用--tokenizer-mode hf。 - 贡献者 @dtrifiro:提供了更根本的修复补丁,通过检查是否存在
consolidated*.safetensors文件来更准确地判断是否应使用 Mistral tokenizer 模式。
- 结论:问题根源在于 tokenizer 模式自动检测逻辑的缺陷。社区和上游模型提供方紧密合作,快速提供了解决方案。
- Issue #33147: vLLM 0.11 SSL Initialization Failure on MicroOS with FIPS
- 核心议题:在启用 FIPS 的 MicroOS 系统上,vLLM 因 aiohttp 的 SSL 初始化失败而无法启动。
- 讨论:@KrxGu 主动认领并分析了问题,指出 vLLM 在导入时即加载
aiohttp(用于异步 HTTP 下载),而后者在 FIPS 环境下创建默认 SSL 上下文会失败。他计划将导入改为延迟加载,仅在实际需要网络功能时才可能报错。 - 当前状态:问题被认领,解决方案明确,体现了对边缘系统兼容性的关注。
🔥 热门话题与趋势分析
- 多模态嵌入模型准确性:
Qwen3-VL-Embedding相关 Issue (#33167, #33204) 集中出现,用户反馈 vLLM 产出与官方qwen_vl_utils结果存在显著差异(相似度约0.92)。维护者 @noooop 指出该模型需要特殊参数设置(如continue_final_message=True),并引导用户参考官方示例。这反映了复杂多模态模型在集成和文档清晰度上的挑战。 - 基础设施现代化:
- 减少环境变量:已成为项目明确方向(#33163, #33169),旨在改善配置的可发现性、类型安全和可维护性。
- 代码模块化:PR #33158 和 #33139 重构 API 服务器,提取公共逻辑,体现了代码库的持续整理。
- 硬件后端解耦:RFC #33214 提议将 Intel XPU 内核从 IPEX 迁移到专用
vllm-xpu-kernels库,与 AMD AITER 的配置重构思路类似,共同展示了硬件后端支持向更轻量、专一方向演进。
- 性能优化与调试工具:
- 对 NIXL 吞吐量监控准确性的疑问(#33170)。
- 新增 Mamba 模型
max-num-seqs参数调优脚本(#33168)。 - 为 torch.compile 诊断工具
tlparse添加结构化日志和文档改进(#33213, #33211)。
🛠️ 重点技术变更
- PR #33214 (RFC): XPU kernel migration to vllm-xpu-kernels
- 技术解读:英特尔团队提议,在未来版本中将 Intel GPU (XPU) 支持从依赖 Intel Extension for PyTorch 迁移到独立的
vllm-xpu-kernels内核库。 - 影响:旨在解决 IPEX 带来的依赖沉重、版本兼容性复杂等问题,通过为 vLLM 量身打造的内核库提升性能、可维护性和迭代速度。这是硬件后端支持架构的重要演进。
- 技术解读:英特尔团队提议,在未来版本中将 Intel GPU (XPU) 支持从依赖 Intel Extension for PyTorch 迁移到独立的
- PR #33173: [Quantization][ROCm] Fix MoE weight loading
- 技术解读:修复了 ROCm 平台上加载 Qwen3_MoE/Qwen3_next 等模型的在线量化 MoE 权重时,因形状断言失败导致的崩溃。
- 影响:确保了 AMD 平台对复杂量化 MoE 模型的支持,是生态完善的重要补丁。
- PR #33187: [Realtime API] Adds minimal realtime API based on websockets
- 技术解读:基于 WebSocket 实现了初步的实时 API(目前支持语音转文本),灵感来源于 OpenAI Realtime API。
- 影响:为流式音频处理等低延迟交互场景开辟了新入口,扩展了 vLLM 的服务能力边界。
📈 开发活跃度观察
- 贡献者多元化:活跃贡献者包括来自 AMD (@tjtanaa, @mgehre-amd)、英特尔 (@jikunshang)、Red Hat (@robertgshaw2-redhat)、Meta/Facebook (@jaewonlee-fb, @zou3519)、OpenAI (@noooop) 等公司和组织的工程师,以及众多独立开发者。
- 高效的问题响应:针对模型崩溃、配置疑问等 Issue,核心维护者(如 @NickLucche, @hmellor, @noooop)和上游模型方代表(如 Mistral 的 @juliendenize)能够快速介入并提供解决方案或清晰指引。
- 代码审查聚焦架构与质量:在关于环境变量、API 重构等 PR 的讨论中,维护者不仅关注功能实现,更强调代码结构、长期维护成本和项目一致性原则。
💡 值得关注的问题
- Qwen3-VL-Embedding 的准确性问题:两个独立 Issue 报告了相同问题,表明可能存在普遍性 bug 或易用性陷阱。虽然给出了临时解决指引,但需要官方调查根因并提供更稳定的修复。
- AMD AITER 环境变量重构的时机:该工作需等待 #32358 (IR Ops) 的完成,其进度将影响 AMD 优化配置的现代化进程。
- Intel XPU 内核迁移提案:这是一个影响较大的架构变更,需要社区广泛反馈,并关注其迁移计划(从 v0.15.0 后开始,目标在 v0.16.0 完成)的执行情况。
- 环境变量清理的长期目标:项目明确要减少环境变量的使用(#25700),未来类似的配置需求都可能面临严格审查,开发者应优先考虑通过现有 CLI 或配置文件实现。
📋 附录:详细数据列表
新增 Issue
- #33167 [Usage]: 求助:vllm 在线部署qwen3-vl-Embedding模型,产出结果和离线transformer调用结果不一致是什么原因呢?vllm=0.14.0 — usage — by July-shisan (创建于: 2026-01-27 20:30 (UTC+8))
- #33204 [Bug]: Qwen3-VL-Embedding model produces different embeddings than official qwen_vl_utils implementation — bug — by namburiamit (创建于: 2026-01-28 07:14 (UTC+8))
- #33215 [Bug]: DeepSeek V3.2
tool_choice==requiredin thinking mode gives internal server error. — bug — by clovisNyu (创建于: 2026-01-28 09:26 (UTC+8)) - #33214 [RFC]: XPU kernel migration to vllm-xpu-kernels — RFC — by jikunshang (创建于: 2026-01-28 09:22 (UTC+8))
- #33210 [Bug]: gpt-oss chat format mismatch with HF apply_chat_template — bug — by sgunasekar (创建于: 2026-01-28 08:22 (UTC+8))
- #33147 [Bug]: vLLM 0.11 SSL Initialization Failure (ssl.SSLError: [CRYPTO] unknown error) on MicroOS 5.5 with FIPS Enabled — bug — by fyuan1316 (创建于: 2026-01-27 15:01 (UTC+8))
- #33169 [Feature]: Add environment variable support for configuring NIXL disaggregation backend — feature request — by erezzarum (创建于: 2026-01-27 21:38 (UTC+8))
- #33170 [Performance]: NIXL telemetry throughput does not consider transfer overlapping — performance,kv-connector — by Spycsh (创建于: 2026-01-27 22:25 (UTC+8))
- #33175 [Bug]: Mistral-Small-3.1-24B crashes on startup — bug — by NickLucche (创建于: 2026-01-27 23:11 (UTC+8))
- #33161 [Doc]: Kubernetes deployment in CPU mode fails (No CUDA..) — documentation — by Josca (创建于: 2026-01-27 18:17 (UTC+8))
- #33163 [Feature][AMD][ROCm] Refactor VLLM_ROCM_USE_AITER env vars to config — feature request,rocm — by markmc (创建于: 2026-01-27 19:13 (UTC+8))
- #33140 [Usage]: How to enable the grpc api with vllm serve? — usage — by MaoJianwei (创建于: 2026-01-27 11:55 (UTC+8))
- #33155 [Bug]: gptoss120B is OOM on one H100 after upgrading from v0.11.2 to v0.14.1 — bug — by tonyaw (创建于: 2026-01-27 16:42 (UTC+8))
- #33153 [CI Failure]: Blackwell Test failed for cudaErrorDevicesUnavailable — ci-failure — by pacoxu (创建于: 2026-01-27 16:37 (UTC+8))
- #33144 [Bug]: PD report xpxd with deepseekv32 fp4 Assertion error kv.second.dim()==1 — bug — by nicole-lihui (创建于: 2026-01-27 13:47 (UTC+8))
- #33143 [Bug]: Triton MLIR Error when attempting to run gpt-oss-20b — bug,rocm — by Calandracas606 (创建于: 2026-01-27 12:32 (UTC+8))
- #33142 [Feature]: Tequila/Sherry/BitNet and Ternary support — feature request — by TomLucidor (创建于: 2026-01-27 12:21 (UTC+8))
已关闭 Issue
- #24198 [Feature][gpt-oss]: Browser Tool Test Enhancement — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-28 10:15 (UTC+8))
- #32959 [Bug]: KeyError:
merging_layer.weightwhen loading Mistral/vision-enabled checkpoints after PR #32780 refactor — bug — by ms1design (关闭于: 2026-01-28 05:22 (UTC+8)) - #22843 [Bug]: vLLM (AsyncLLMEngine, LLM) engine initialization fails when using runai_streamer — bug — by jiangwu300 (关闭于: 2026-01-28 02:08 (UTC+8))
- #32910 [Doc]: Having default of “None” in docs for serve params is not helpful, because there is actually a default — documentation — by jcowles (关闭于: 2026-01-28 00:47 (UTC+8))
- #32665 [Bug]: [DeepSeek-V3.2] PD reports
NotImplementedError— bug — by kebe7jun (关闭于: 2026-01-27 18:44 (UTC+8)) - #33144 [Bug]: PD report xpxd with deepseekv32 fp4 Assertion error kv.second.dim()==1 — bug — by nicole-lihui (关闭于: 2026-01-27 14:02 (UTC+8))
- #32643 [Bug] FlashInfer >=0.6.0 TypeError: non_blocking must be bool during CUDA graph capture — 无标签 — by amanwalksdownthestreet (关闭于: 2026-01-27 12:32 (UTC+8))
- #29739 [Feature]: GGUF model with architecture qwen3vlmoe is not supported yet. — feature request — by (关闭于: 2026-01-27 11:56 (UTC+8))
新增 PR
- #33181 [reasoning parser] code clean — structured-output,v1 — by andyxning (创建于: 2026-01-28 00:04 (UTC+8))
- #33201 Refactor NVFP4 Linear utils for ModelOpt and CT — nvidia — by mgoin (创建于: 2026-01-28 06:18 (UTC+8))
- #33212 support returning tokenids in responses api — frontend — by cmunley1 (创建于: 2026-01-28 08:54 (UTC+8))
- #33141 Fix tool call indexing double-counting — frontend,ready — by wangln19 (创建于: 2026-01-27 12:19 (UTC+8))
- #33165 [Model] Support DeepSeek-OCR-2 — documentation,new-model,deepseek — by LiuLi1998 (创建于: 2026-01-27 20:00 (UTC+8))
- #33208 Don’t use
min_pixels/max_pixelsfrom Qwen2VL’s processor — ready,qwen — by hmellor (创建于: 2026-01-28 08:16 (UTC+8)) - #33199 [CI] Enable mypy import following for
vllm/compilation— ready — by hmellor (创建于: 2026-01-28 05:41 (UTC+8)) - #33186 [Docs] Use definition lists for CLI reference docs — documentation,ready — by hmellor (创建于: 2026-01-28 01:10 (UTC+8))
- #33191 Add flake8-implicit-str-concat rules to Ruff — documentation,performance,frontend,ready,ci/build,v1,tool-calling,llama,deepseek,kv-connector — by hmellor (创建于: 2026-01-28 04:08 (UTC+8))
- #33209 [ez] Remove checks for torch version <= 2.8 — 无标签 — by angelayi (创建于: 2026-01-28 08:18 (UTC+8))
- #33211 [docs] Improve tlparse section — documentation,ready — by angelayi (创建于: 2026-01-28 08:28 (UTC+8))
- #33156 [Release] [CI] Optim release pipeline — rocm,ci/build — by tjtanaa (创建于: 2026-01-27 16:57 (UTC+8))
- #33213 [ez] Add structured torch.compile logs — needs-rebase — by angelayi (创建于: 2026-01-28 09:20 (UTC+8))
- #33207 [WIP] PCP simplification — rocm,speculative-decoding,v1,nvidia — by LucasWilkinson (创建于: 2026-01-28 08:07 (UTC+8))
- #33151 [CI] minor fixes to pipeline generator and tests — ci/build,ready-run-all-tests — by khluu (创建于: 2026-01-27 15:57 (UTC+8))
- #33202 Relax protobuf library version constraints — frontend,ready,ci/build — by jeffreywang-anyscale (创建于: 2026-01-28 07:08 (UTC+8))
- #33206 [XPU] minor fix fp8 online quantization — 无标签 — by yma11 (创建于: 2026-01-28 07:52 (UTC+8))
- #33173 [Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) — rocm,qwen — by xuebwang-amd (创建于: 2026-01-27 22:38 (UTC+8))
- #33145 inital commit — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-27 13:49 (UTC+8))
- #33205 Make
mypyopt-out instead of opt-in — v1 — by hmellor (创建于: 2026-01-28 07:26 (UTC+8)) - #33184 [torch.compile] Speed up MOE handling in forward_context — ready — by zou3519 (创建于: 2026-01-28 00:33 (UTC+8))
- #33203 [Kernel] [Helion] Helion kernel registry — 无标签 — by gmagogsfm (创建于: 2026-01-28 07:08 (UTC+8))
- #33187 [Realtime API] Adds minimal realtime API based on websockets — frontend,v1 — by patrickvonplaten (创建于: 2026-01-28 02:37 (UTC+8))
- #33198 [Core] Profiler improvements and lazy initialization — frontend,v1,cpu — by jaewonlee-fb (创建于: 2026-01-28 05:41 (UTC+8))
- #33200 unpack zeros for asymmetric compressed-tensors W4A16 (ConchLinearKernel) — 无标签 — by mgehre-amd (创建于: 2026-01-28 06:06 (UTC+8))
- #33189 [Misc][Build] Lazy load cv2 in nemotron_parse.py — bug,ready — by kiersten-stokes (创建于: 2026-01-28 03:29 (UTC+8))
- #33192 [Bugfix] Disable TRTLLM attention when KV transfer is enabled — bug,v1,nvidia — by ZhanqiuHu (创建于: 2026-01-28 04:10 (UTC+8))
- #33183 [Benchmark] Add startup benchmarking to buildkite run — performance,ci/build — by desertfire (创建于: 2026-01-28 00:25 (UTC+8))
- #33195 [Core] Add sleep level 0 mode with enqueue/wait pattern — frontend,v1 — by jaewonlee-fb (创建于: 2026-01-28 04:57 (UTC+8))
- #33197 use ‘max_active_experts’ for moe lora input size — 无标签 — by gnovack (创建于: 2026-01-28 05:22 (UTC+8))
- #33196 [BugFix] Handle dict values in YAML config by converting to JSON in config file — bug — by wangluochao902 (创建于: 2026-01-28 05:12 (UTC+8))
- #33194 [Hybrid] Simplify mamba block size logic for
allmode — v1 — by tdoublep (创建于: 2026-01-28 04:29 (UTC+8)) - #33193 [UX] Enable nested configs in config yaml files — 无标签 — by mgoin (创建于: 2026-01-28 04:25 (UTC+8))
- #33190 [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) — documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-28 03:47 (UTC+8))
- #33188 Fix merge conflict. — frontend,ci/build,v1,llama,qwen,cpu — by gtnaila (创建于: 2026-01-28 03:24 (UTC+8))
- #33177 [Attention] Use
has_flashinferhelper — ready — by MatthewBonanni (创建于: 2026-01-27 23:29 (UTC+8)) - #33176 [EPLB] Add alternative communication for EPLB weight exchange — ci/build — by ilmarkov (创建于: 2026-01-27 23:17 (UTC+8))
- #33179 [Bugfix][Hardware][AMD] Fix FP8 dtype for gfx950 (MI325X/MI355X) — bug,rocm — by c0de128 (创建于: 2026-01-28 00:02 (UTC+8))
- #33185 [Core] Add VLLM_NIXL_DISAGGREGATION_BACKEND env var for NIXL backend selection — v1,kv-connector — by algebra-MCX (创建于: 2026-01-28 01:06 (UTC+8))
- #33157 [Misc] Cleanup Kimi-K2.5’s vision chunk modality entrypoints — frontend,multi-modality — by Isotr0py (创建于: 2026-01-27 17:22 (UTC+8))
- #33180 [Bugfix][Hardware][AMD] Fix MXFP4 MoE backend selection for RDNA3 (gfx1100) — bug,rocm — by c0de128 (创建于: 2026-01-28 00:03 (UTC+8))
- #33162 Fix weight mapping test for Transfomers v5 — ready,multi-modality — by hmellor (创建于: 2026-01-27 18:24 (UTC+8))
- #33182 [Feature] Add API parent span lifecycle management (PR #6/9) — documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-28 00:04 (UTC+8))
- #33178 [Quantization] - Consolidate experts_int8 with FP8 Modular Kernels — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by Josephasafg (创建于: 2026-01-27 23:39 (UTC+8))
- #33158 [Frontend] Cleanup api server — frontend,ready — by noooop (创建于: 2026-01-27 17:23 (UTC+8))
- #33168 Add script to benchmark Mamba for optimal max_num_seqs — performance — by danisereb (创建于: 2026-01-27 20:52 (UTC+8))
- #33148 [Bugfix] Fix xgrammar cleanup leakage — bug,structured-output,v1 — by DamonJiang777 (创建于: 2026-01-27 15:04 (UTC+8))
- #33174 Add support for Mistral Large 3 inference with Flashinfer MoE — ci/build,deepseek,nvidia — by dbari (创建于: 2026-01-27 22:40 (UTC+8))
- #33172 [Feature] Add prefix-aware routing for data parallel load balancing — documentation,needs-rebase,v1 — by toslali-ibm (创建于: 2026-01-27 22:31 (UTC+8))
- #33171 [Bugfix] Upgrade Numba to fix segfault on CUDA drivers version 580 — bug,rocm,ci/build,cpu,nvidia — by tlrmchlsmth (创建于: 2026-01-27 22:29 (UTC+8))
- #33160 fix:paddle ocr infinite inference bug — bug,v1 — by bellkjtt (创建于: 2026-01-27 18:15 (UTC+8))
- #33164 [Bugfix] Disable CG for Whisper+FA2 — bug,ready,v1 — by NickLucche (创建于: 2026-01-27 19:32 (UTC+8))
- #33152 [PluggableLayer][2/N] Apply PluggableLayer to linear layers — ready,ready-run-all-tests — by whx-sjtu (创建于: 2026-01-27 16:09 (UTC+8))
- #33139 [Frontend] Frontend will only attach supported tasks corresponding entrypoints. — frontend,ready — by noooop (创建于: 2026-01-27 11:46 (UTC+8))
- #33159 [Feature]: Container image WORKDIR consistency — ready,ci/build,cpu — by SouthWest7 (创建于: 2026-01-27 17:24 (UTC+8))
- #33166 Update README.md — documentation — by chaosisnotopen (创建于: 2026-01-27 20:17 (UTC+8))
- #33138 [code clean] remove duplicated code — frontend — by andyxning (创建于: 2026-01-27 11:11 (UTC+8))
- #33146 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (创建于: 2026-01-27 14:19 (UTC+8))
- #33154 [CI] Split responses API MCP tests into separate CI step — ci/build — by pacoxu (创建于: 2026-01-27 16:41 (UTC+8))
- #33150 [CI] fix(tests): avoid Open-Meteo API timeout in test_function_calling_with_stream — 无标签 — by pacoxu (创建于: 2026-01-27 15:41 (UTC+8))
- #33149 [BugFix] Fix minimax_m2 tool call parser for stream_interval > 1 — bug,tool-calling — by MrIceCreamMan (创建于: 2026-01-27 15:19 (UTC+8))
已合并 PR
- #33186 [Docs] Use definition lists for CLI reference docs — documentation,ready — by hmellor (合并于: 2026-01-28 10:22 (UTC+8))
- #33211 [docs] Improve tlparse section — documentation,ready — by angelayi (合并于: 2026-01-28 10:07 (UTC+8))
- #33151 [CI] minor fixes to pipeline generator and tests — ci/build,ready-run-all-tests — by khluu (合并于: 2026-01-28 09:04 (UTC+8))
- #33059 [Model Runner V2] Use a different stream for grammar bitmask h2d copy — v1 — by WoosukKwon (合并于: 2026-01-28 08:37 (UTC+8))
- #26835 Add attention benchmarking tools — performance,ready,nvidia — by MatthewBonanni (合并于: 2026-01-28 08:09 (UTC+8))
- #33184 [torch.compile] Speed up MOE handling in forward_context — ready — by zou3519 (合并于: 2026-01-28 07:17 (UTC+8))
- #33102 [Perf] Optimize dcp allocate tensor — ready,v1 — by yewentao256 (合并于: 2026-01-28 06:24 (UTC+8))
- #33020 [Bugfix] Fix display error (inconsistent with context) — bug,ready — by lingebeng (合并于: 2026-01-28 04:33 (UTC+8))
- #32719 Enabling “2 node” distributed tests in the AMD CI pipeline. — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-01-28 03:13 (UTC+8))
- #33177 [Attention] Use
has_flashinferhelper — ready — by MatthewBonanni (合并于: 2026-01-28 02:33 (UTC+8)) - #33035 feature: support eagle3 for HunyuanVL & Hunyuan — speculative-decoding,ready,v1 — by irisliu10 (合并于: 2026-01-28 01:55 (UTC+8))
- #33082 [Doc] Improve serve parameter documentation with meaningful defaults — documentation,ready — by karanb192 (合并于: 2026-01-28 01:19 (UTC+8))
- #32909 [CI][Pooling] Stabilize ModernBERT test — ready — by AndreasKaratzas (合并于: 2026-01-27 13:26 (UTC+8))
- #33162 Fix weight mapping test for Transfomers v5 — ready,multi-modality — by hmellor (合并于: 2026-01-27 20:30 (UTC+8))
- #33076 Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing — ready — by ir1ka (合并于: 2026-01-28 00:04 (UTC+8))
- #33037 [BugFix] Fix P/D with non-MoE DP — bug,ready,v1,kv-connector — by njhill (合并于: 2026-01-28 00:03 (UTC+8))
- #32549 Support heterogeneous NemotronHPuzzle model — new-model,ready — by danielafrimi (合并于: 2026-01-27 23:55 (UTC+8))
- #32265 [LoRA][Spec Decode] Support LoRA for Nemotron-H MTP models — ready — by danisereb (合并于: 2026-01-27 23:53 (UTC+8))
- #32064 [5/N][Attention] Finish eliminating
vllm/attentionfolder — documentation,rocm,speculative-decoding,ready,ci/build,v1,llama,qwen,deepseek,gpt-oss — by MatthewBonanni (合并于: 2026-01-27 23:02 (UTC+8)) - #33158 [Frontend] Cleanup api server — frontend,ready — by noooop (合并于: 2026-01-27 23:18 (UTC+8))
- #33045 [Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill (#33045) — ready,v1,meta-exported,fb-exported — by omkhalil (合并于: 2026-01-27 23:16 (UTC+8))
- #33090 [Bugfix] Fix DeepseekV32
AssertionError: num_kv_heads == 1— bug,ready,deepseek,kv-connector — by NickLucche (合并于: 2026-01-27 23:03 (UTC+8)) - #33164 [Bugfix] Disable CG for Whisper+FA2 — bug,ready,v1 — by NickLucche (合并于: 2026-01-27 21:46 (UTC+8))
- #27942 [Metrics] [KVConnector] Add Offloading Connector metrics — ready,v1,kv-connector — by omerpaz95 (合并于: 2026-01-27 21:34 (UTC+8))
- #33139 [Frontend] Frontend will only attach supported tasks corresponding entrypoints. — frontend,ready — by noooop (合并于: 2026-01-27 20:15 (UTC+8))
- #32042 [AMD][QWEN3-NEXT] FP8 Tunings — performance,rocm,ready,qwen — by draftbk (合并于: 2026-01-27 17:34 (UTC+8))
- #33131 [Models] Kimi-K2.5 — documentation,new-model,frontend,ready,multi-modality — by ywang96 (合并于: 2026-01-27 14:50 (UTC+8))
- #32976 [AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 — bug,rocm,ready — by rasmith (合并于: 2026-01-27 15:16 (UTC+8))
- #33135 [code clean] remove duplicate code — structured-output,ready,v1 — by andyxning (合并于: 2026-01-27 12:57 (UTC+8))
- #33103 [Frontend] Cleanup serving engine — frontend,ready — by DarkLight1337 (合并于: 2026-01-27 12:47 (UTC+8))
- #33113 [torch.compile] Stop assuming 32 bit indexing — ready — by zou3519 (合并于: 2026-01-27 12:25 (UTC+8))
- #33101 [Frontend] Reduce mixin usage in serving pooling — frontend,ready — by DarkLight1337 (合并于: 2026-01-27 11:50 (UTC+8))
- #33064 [Perf] avoid duplicate mem_get_info() call in get_current_memory_usage — rocm,ready — by pacoxu (合并于: 2026-01-27 11:45 (UTC+8))
- #33109 [DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled — documentation,ready — by VincentG1234 (合并于: 2026-01-27 11:05 (UTC+8))
关闭但未合并的 PR
- #32994 [Bugfix][Core] Fix use audio in video bug — bug,needs-rebase,v1,qwen — by xsank (关闭于: 2026-01-28 10:01 (UTC+8))
- #33206 [XPU] minor fix fp8 online quantization — 无标签 — by yma11 (关闭于: 2026-01-28 08:44 (UTC+8))
- #33008 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints — bug — by ms1design (关闭于: 2026-01-28 05:22 (UTC+8))
- #26482 [CI] Fix mypy for
vllm/attentionandvllm/compilation— tpu,ready,needs-rebase,v1,cpu,nvidia — by hmellor (关闭于: 2026-01-28 04:30 (UTC+8)) - #23507 feat: Add TPU v6e architecture-adaptive attention backend — documentation,tpu,unstale,v1 — by Tar-ive (关闭于: 2026-01-28 05:09 (UTC+8))
- #32961 [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build — by aviralgarg05 (关闭于: 2026-01-28 04:14 (UTC+8))
- #33190 [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) — documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-28 03:48 (UTC+8))
- #33114 [Draft / POC] Websocket realtime api — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by patrickvonplaten (关闭于: 2026-01-28 02:37 (UTC+8))
- #33179 [Bugfix][Hardware][AMD] Fix FP8 dtype for gfx950 (MI325X/MI355X) — bug,rocm — by c0de128 (关闭于: 2026-01-28 00:14 (UTC+8))
- #31736 Update how docs are rendered for cli reference — documentation — by ashwin-phadke (关闭于: 2026-01-28 01:10 (UTC+8))
- #33182 [Feature] Add API parent span lifecycle management (PR #6/9) — documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-28 00:04 (UTC+8))
- #30208 Feat/support nemotron h mtp wip — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by shaharmor98 (关闭于: 2026-01-27 23:00 (UTC+8))
- #28541 Feat/support nemotron h mtp — new-model,speculative-decoding,needs-rebase,v1,qwen — by shaharmor98 (关闭于: 2026-01-27 23:08 (UTC+8))
- #31092 Feat/support nemotron h mtp upstream — new-model,speculative-decoding,needs-rebase,v1 — by shaharmor98 (关闭于: 2026-01-27 22:57 (UTC+8))
- #29884 DRAFT: Mistral large 3 Extended Blackwell Support — performance,ready,needs-rebase,ci/build,v1,deepseek,nvidia — by hypdeb (关闭于: 2026-01-27 22:53 (UTC+8))
- #33172 [Feature] Add prefix-aware routing for data parallel load balancing — documentation,needs-rebase,v1 — by toslali-ibm (关闭于: 2026-01-27 22:32 (UTC+8))
- #33160 fix:paddle ocr infinite inference bug — bug,v1 — by bellkjtt (关闭于: 2026-01-27 22:03 (UTC+8))
-
#30672 [Model][Last/N] Improve all pooling task Generate runner supports using embed and token_embed tasks. — frontend,v1 — by noooop (关闭于: 2026-01-27 21:20 (UTC+8)) - #33086 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,v1,deepseek — by chaunceyjiang (关闭于: 2026-01-27 21:05 (UTC+8))
- #32543 refactor: extract KV cache update logic into method in RocmAttention — rocm,v1 — by Mohit-Gaur (关闭于: 2026-01-27 16:41 (UTC+8))
- #31228 Cleanup basic and entrypoint test organisation — ready,ci/build,tool-calling,llama,gpt-oss — by hmellor (关闭于: 2026-01-27 16:11 (UTC+8))
- #32515 [Bugfix] Add missing o_data_type parameter for FlashInfer >=0.6.0 compatibility — bug,needs-rebase,v1,nvidia — by amanwalksdownthestreet (关闭于: 2026-01-27 12:32 (UTC+8))
- #33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 11:00 (UTC+8))