vLLM 开发动态报告 - 2026-01-27

时间窗口: 2026-01-27 10:58 (UTC+8) ~ 2026-01-28 10:58 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 8 | 新 PR 61 | 合并 PR 34 | 关闭未合并 PR 23

📊 每日开发状态摘要

本周期（24小时）vLLM 开发活动高度活跃，新增与合并 PR 数量众多。社区关注焦点集中在多模态嵌入模型的准确性（Qwen3-VL-Embedding）、AMD/ROCm 生态的持续整合，以及基础设施现代化（如减少环境变量依赖、优化启动流程）。多个模型支持与 bug 修复同步推进，显示出项目在多硬件平台和复杂模型架构上的快速迭代。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关动态活跃，涉及性能优化、问题修复和基础设施改进：

Issue #33163: Refactor VLLM_ROCM_USE_AITER env vars to config
- 核心：提案将现有的 13 个控制 AITER 优化的环境变量重构为更规范的配置项（如 --kernel-config），以提升可维护性、减少技术债。
- 讨论与进展：AMD 工程师 @tjtanaa 建议等待 #32358 (IR Ops) PR 的实现，因为该 PR 将提供一个通用的内核分发机制，届时可顺势废弃相关环境变量。此方案获得维护者认可，已更新 Issue 描述。
- 影响：反映了项目向更结构化配置管理迁移的趋势，AMD 生态的优化控制将逐步融入统一框架。
PR #33200: unpack zeros for asymmetric compressed-tensors W4A16
- 提交者：用户 mgehre-amd (AMD员工)。
- 核心：修复了 ROCm 平台上 ConchLinearKernel 在处理非对称量化 W4A16 模型时，因零点的数据布局不匹配导致的内存访问错误（HIP illegal memory access）。
- 影响：修复了特定量化格式模型在 AMD GPU 上的运行崩溃问题，增强了 ROCm 后端对压缩张量格式的兼容性。
PR #33180: Fix MXFP4 MoE backend selection for RDNA3
- 核心：修复了在 RDNA3 (gfx1100) 架构 GPU 上，MXFP4 MoE 模型错误选择 Triton 后端导致的 MLIR 编译崩溃。修正后，非 gfx9 (CDNA) 架构将回退到 NONE 后端。
- 影响：解决了消费级 AMD GPU 运行特定量化 MoE 模型的启动问题，增强了硬件兼容性。
PR #33179: Fix FP8 dtype for gfx950（已关闭）
- 内容：试图将 gfx950 添加到使用 float8_e4m3fnuz 数据类型的列表中。
- 进展：被 @tjtanaa 指出信息错误，MI355 (gfx950) 使用与 CUDA 相同的 FP8 格式，而非 fnuz 格式。PR 被关闭。
- 启示：显示了 AMD 内部团队对硬件细节的精准把控，防止了错误的代码合并。
PR #33156: [CI] Optim release pipeline
- 提交者：tjtanaa
- 核心：优化了 ROCm 平台的 Docker 镜像和 Wheel 包发布流水线，统一了缓存策略，并确保发布的镜像适合用作开发环境。
- 影响：改善了 AMD 平台用户的开发与部署体验。

💬 高热度讨论分析

Issue #33169: Add env var support for configuring NIXL disaggregation backend
- 核心议题：是否应该为 NIXL 分解后端（如 UCX, LIBFABRIC）添加新的环境变量 VLLM_NIXL_DISAGGREGATION_BACKEND 以简化配置。
- 各方观点：
  - 提出者 @erezzarum：认为环境变量更易于在动态环境和不同框架（如 Ray）中进行配置。
  - 维护者 @NickLucche 和 @markmc：明确反对，指出现有 kv_transfer_config 参数已是框架无关的配置方式，且项目正在有意识地减少环境变量的滥用（参见 #25700），避免全局状态和配置分散。
- 最终结论：社区决定不添加新环境变量。@KrxGu 建议通过改进文档来展示如何通过现有 --kv-transfer-config 参数进行配置，以解决用户的易用性痛点。讨论体现了项目在“灵活性”与“架构整洁性”之间的权衡。
Issue #33175: Mistral-Small-3.1-24B crashes on startup
- 核心议题：特定 Mistral 模型在 vLLM 启动时崩溃。
- 各方观点：
  - 报告者 @NickLucche：提供了详细的错误栈，指出问题发生在处理器初始化阶段。
  - Mistral 员工 @juliendenize：迅速定位问题，指出当模型仓库同时包含 tekken.json 和 HF 格式的模型文件时，vLLM 可能错误地回退到不兼容的 Mistral 格式 tokenizer，建议使用 --tokenizer-mode hf。
  - 贡献者 @dtrifiro：提供了更根本的修复补丁，通过检查是否存在 consolidated*.safetensors 文件来更准确地判断是否应使用 Mistral tokenizer 模式。
- 结论：问题根源在于 tokenizer 模式自动检测逻辑的缺陷。社区和上游模型提供方紧密合作，快速提供了解决方案。
Issue #33147: vLLM 0.11 SSL Initialization Failure on MicroOS with FIPS
- 核心议题：在启用 FIPS 的 MicroOS 系统上，vLLM 因 aiohttp 的 SSL 初始化失败而无法启动。
- 讨论：@KrxGu 主动认领并分析了问题，指出 vLLM 在导入时即加载 aiohttp（用于异步 HTTP 下载），而后者在 FIPS 环境下创建默认 SSL 上下文会失败。他计划将导入改为延迟加载，仅在实际需要网络功能时才可能报错。
- 当前状态：问题被认领，解决方案明确，体现了对边缘系统兼容性的关注。

🔥 热门话题与趋势分析

多模态嵌入模型准确性：Qwen3-VL-Embedding 相关 Issue (#33167, #33204) 集中出现，用户反馈 vLLM 产出与官方 qwen_vl_utils 结果存在显著差异（相似度约0.92）。维护者 @noooop 指出该模型需要特殊参数设置（如 continue_final_message=True），并引导用户参考官方示例。这反映了复杂多模态模型在集成和文档清晰度上的挑战。
基础设施现代化：
- 减少环境变量：已成为项目明确方向（#33163, #33169），旨在改善配置的可发现性、类型安全和可维护性。
- 代码模块化：PR #33158 和 #33139 重构 API 服务器，提取公共逻辑，体现了代码库的持续整理。
- 硬件后端解耦：RFC #33214 提议将 Intel XPU 内核从 IPEX 迁移到专用 vllm-xpu-kernels 库，与 AMD AITER 的配置重构思路类似，共同展示了硬件后端支持向更轻量、专一方向演进。
性能优化与调试工具：
- 对 NIXL 吞吐量监控准确性的疑问（#33170）。
- 新增 Mamba 模型 max-num-seqs 参数调优脚本（#33168）。
- 为 torch.compile 诊断工具 tlparse 添加结构化日志和文档改进（#33213, #33211）。

🛠️ 重点技术变更

PR #33214 (RFC): XPU kernel migration to vllm-xpu-kernels
- 技术解读：英特尔团队提议，在未来版本中将 Intel GPU (XPU) 支持从依赖 Intel Extension for PyTorch 迁移到独立的 vllm-xpu-kernels 内核库。
- 影响：旨在解决 IPEX 带来的依赖沉重、版本兼容性复杂等问题，通过为 vLLM 量身打造的内核库提升性能、可维护性和迭代速度。这是硬件后端支持架构的重要演进。
PR #33173: [Quantization][ROCm] Fix MoE weight loading
- 技术解读：修复了 ROCm 平台上加载 Qwen3_MoE/Qwen3_next 等模型的在线量化 MoE 权重时，因形状断言失败导致的崩溃。
- 影响：确保了 AMD 平台对复杂量化 MoE 模型的支持，是生态完善的重要补丁。
PR #33187: [Realtime API] Adds minimal realtime API based on websockets
- 技术解读：基于 WebSocket 实现了初步的实时 API（目前支持语音转文本），灵感来源于 OpenAI Realtime API。
- 影响：为流式音频处理等低延迟交互场景开辟了新入口，扩展了 vLLM 的服务能力边界。

📈 开发活跃度观察

贡献者多元化：活跃贡献者包括来自 AMD (@tjtanaa， @mgehre-amd)、英特尔 (@jikunshang)、Red Hat (@robertgshaw2-redhat)、Meta/Facebook (@jaewonlee-fb, @zou3519)、OpenAI (@noooop) 等公司和组织的工程师，以及众多独立开发者。
高效的问题响应：针对模型崩溃、配置疑问等 Issue，核心维护者（如 @NickLucche， @hmellor， @noooop）和上游模型方代表（如 Mistral 的 @juliendenize）能够快速介入并提供解决方案或清晰指引。
代码审查聚焦架构与质量：在关于环境变量、API 重构等 PR 的讨论中，维护者不仅关注功能实现，更强调代码结构、长期维护成本和项目一致性原则。

💡 值得关注的问题

Qwen3-VL-Embedding 的准确性问题：两个独立 Issue 报告了相同问题，表明可能存在普遍性 bug 或易用性陷阱。虽然给出了临时解决指引，但需要官方调查根因并提供更稳定的修复。
AMD AITER 环境变量重构的时机：该工作需等待 #32358 (IR Ops) 的完成，其进度将影响 AMD 优化配置的现代化进程。
Intel XPU 内核迁移提案：这是一个影响较大的架构变更，需要社区广泛反馈，并关注其迁移计划（从 v0.15.0 后开始，目标在 v0.16.0 完成）的执行情况。
环境变量清理的长期目标：项目明确要减少环境变量的使用（#25700），未来类似的配置需求都可能面临严格审查，开发者应优先考虑通过现有 CLI 或配置文件实现。

📋 附录：详细数据列表

新增 Issue

#33167 [Usage]: 求助：vllm 在线部署qwen3-vl-Embedding模型，产出结果和离线transformer调用结果不一致是什么原因呢？vllm=0.14.0 — usage — by July-shisan (创建于: 2026-01-27 20:30 (UTC+8))
#33204 [Bug]: Qwen3-VL-Embedding model produces different embeddings than official qwen_vl_utils implementation — bug — by namburiamit (创建于: 2026-01-28 07:14 (UTC+8))
#33215 [Bug]: DeepSeek V3.2 tool_choice==required in thinking mode gives internal server error. — bug — by clovisNyu (创建于: 2026-01-28 09:26 (UTC+8))
#33214 [RFC]: XPU kernel migration to vllm-xpu-kernels — RFC — by jikunshang (创建于: 2026-01-28 09:22 (UTC+8))
#33210 [Bug]: gpt-oss chat format mismatch with HF apply_chat_template — bug — by sgunasekar (创建于: 2026-01-28 08:22 (UTC+8))
#33147 [Bug]: vLLM 0.11 SSL Initialization Failure (ssl.SSLError: [CRYPTO] unknown error) on MicroOS 5.5 with FIPS Enabled — bug — by fyuan1316 (创建于: 2026-01-27 15:01 (UTC+8))
#33169 [Feature]: Add environment variable support for configuring NIXL disaggregation backend — feature request — by erezzarum (创建于: 2026-01-27 21:38 (UTC+8))
#33170 [Performance]: NIXL telemetry throughput does not consider transfer overlapping — performance,kv-connector — by Spycsh (创建于: 2026-01-27 22:25 (UTC+8))
#33175 [Bug]: Mistral-Small-3.1-24B crashes on startup — bug — by NickLucche (创建于: 2026-01-27 23:11 (UTC+8))
#33161 [Doc]: Kubernetes deployment in CPU mode fails (No CUDA..) — documentation — by Josca (创建于: 2026-01-27 18:17 (UTC+8))
#33163 [Feature][AMD][ROCm] Refactor VLLM_ROCM_USE_AITER env vars to config — feature request,rocm — by markmc (创建于: 2026-01-27 19:13 (UTC+8))
#33140 [Usage]: How to enable the grpc api with vllm serve? — usage — by MaoJianwei (创建于: 2026-01-27 11:55 (UTC+8))
#33155 [Bug]: gptoss120B is OOM on one H100 after upgrading from v0.11.2 to v0.14.1 — bug — by tonyaw (创建于: 2026-01-27 16:42 (UTC+8))
#33153 [CI Failure]: Blackwell Test failed for cudaErrorDevicesUnavailable — ci-failure — by pacoxu (创建于: 2026-01-27 16:37 (UTC+8))
#33144 [Bug]: PD report xpxd with deepseekv32 fp4 Assertion error kv.second.dim()==1 — bug — by nicole-lihui (创建于: 2026-01-27 13:47 (UTC+8))
#33143 [Bug]: Triton MLIR Error when attempting to run gpt-oss-20b — bug,rocm — by Calandracas606 (创建于: 2026-01-27 12:32 (UTC+8))
#33142 [Feature]: Tequila/Sherry/BitNet and Ternary support — feature request — by TomLucidor (创建于: 2026-01-27 12:21 (UTC+8))

已关闭 Issue

#24198 [Feature][gpt-oss]: Browser Tool Test Enhancement — feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-28 10:15 (UTC+8))
#32959 [Bug]: KeyError: merging_layer.weight when loading Mistral/vision-enabled checkpoints after PR #32780 refactor — bug — by ms1design (关闭于: 2026-01-28 05:22 (UTC+8))
#22843 [Bug]: vLLM (AsyncLLMEngine, LLM) engine initialization fails when using runai_streamer — bug — by jiangwu300 (关闭于: 2026-01-28 02:08 (UTC+8))
#32910 [Doc]: Having default of “None” in docs for serve params is not helpful, because there is actually a default — documentation — by jcowles (关闭于: 2026-01-28 00:47 (UTC+8))
#32665 [Bug]: [DeepSeek-V3.2] PD reports NotImplementedError — bug — by kebe7jun (关闭于: 2026-01-27 18:44 (UTC+8))
#33144 [Bug]: PD report xpxd with deepseekv32 fp4 Assertion error kv.second.dim()==1 — bug — by nicole-lihui (关闭于: 2026-01-27 14:02 (UTC+8))
#32643 [Bug] FlashInfer >=0.6.0 TypeError: non_blocking must be bool during CUDA graph capture — 无标签 — by amanwalksdownthestreet (关闭于: 2026-01-27 12:32 (UTC+8))
#29739 [Feature]: GGUF model with architecture qwen3vlmoe is not supported yet. — feature request — by (关闭于: 2026-01-27 11:56 (UTC+8))

新增 PR

#33181 [reasoning parser] code clean — structured-output,v1 — by andyxning (创建于: 2026-01-28 00:04 (UTC+8))
#33201 Refactor NVFP4 Linear utils for ModelOpt and CT — nvidia — by mgoin (创建于: 2026-01-28 06:18 (UTC+8))
#33212 support returning tokenids in responses api — frontend — by cmunley1 (创建于: 2026-01-28 08:54 (UTC+8))
#33141 Fix tool call indexing double-counting — frontend,ready — by wangln19 (创建于: 2026-01-27 12:19 (UTC+8))
#33165 [Model] Support DeepSeek-OCR-2 — documentation,new-model,deepseek — by LiuLi1998 (创建于: 2026-01-27 20:00 (UTC+8))
#33208 Don’t use min_pixels/max_pixels from Qwen2VL’s processor — ready,qwen — by hmellor (创建于: 2026-01-28 08:16 (UTC+8))
#33199 [CI] Enable mypy import following for vllm/compilation — ready — by hmellor (创建于: 2026-01-28 05:41 (UTC+8))
#33186 [Docs] Use definition lists for CLI reference docs — documentation,ready — by hmellor (创建于: 2026-01-28 01:10 (UTC+8))
#33191 Add flake8-implicit-str-concat rules to Ruff — documentation,performance,frontend,ready,ci/build,v1,tool-calling,llama,deepseek,kv-connector — by hmellor (创建于: 2026-01-28 04:08 (UTC+8))
#33209 [ez] Remove checks for torch version <= 2.8 — 无标签 — by angelayi (创建于: 2026-01-28 08:18 (UTC+8))
#33211 [docs] Improve tlparse section — documentation,ready — by angelayi (创建于: 2026-01-28 08:28 (UTC+8))
#33156 [Release] [CI] Optim release pipeline — rocm,ci/build — by tjtanaa (创建于: 2026-01-27 16:57 (UTC+8))
#33213 [ez] Add structured torch.compile logs — needs-rebase — by angelayi (创建于: 2026-01-28 09:20 (UTC+8))
#33207 [WIP] PCP simplification — rocm,speculative-decoding,v1,nvidia — by LucasWilkinson (创建于: 2026-01-28 08:07 (UTC+8))
#33151 [CI] minor fixes to pipeline generator and tests — ci/build,ready-run-all-tests — by khluu (创建于: 2026-01-27 15:57 (UTC+8))
#33202 Relax protobuf library version constraints — frontend,ready,ci/build — by jeffreywang-anyscale (创建于: 2026-01-28 07:08 (UTC+8))
#33206 [XPU] minor fix fp8 online quantization — 无标签 — by yma11 (创建于: 2026-01-28 07:52 (UTC+8))
#33173 [Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) — rocm,qwen — by xuebwang-amd (创建于: 2026-01-27 22:38 (UTC+8))
#33145 inital commit — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-27 13:49 (UTC+8))
#33205 Make mypy opt-out instead of opt-in — v1 — by hmellor (创建于: 2026-01-28 07:26 (UTC+8))
#33184 [torch.compile] Speed up MOE handling in forward_context — ready — by zou3519 (创建于: 2026-01-28 00:33 (UTC+8))
#33203 [Kernel] [Helion] Helion kernel registry — 无标签 — by gmagogsfm (创建于: 2026-01-28 07:08 (UTC+8))
#33187 [Realtime API] Adds minimal realtime API based on websockets — frontend,v1 — by patrickvonplaten (创建于: 2026-01-28 02:37 (UTC+8))
#33198 [Core] Profiler improvements and lazy initialization — frontend,v1,cpu — by jaewonlee-fb (创建于: 2026-01-28 05:41 (UTC+8))
#33200 unpack zeros for asymmetric compressed-tensors W4A16 (ConchLinearKernel) — 无标签 — by mgehre-amd (创建于: 2026-01-28 06:06 (UTC+8))
#33189 [Misc][Build] Lazy load cv2 in nemotron_parse.py — bug,ready — by kiersten-stokes (创建于: 2026-01-28 03:29 (UTC+8))
#33192 [Bugfix] Disable TRTLLM attention when KV transfer is enabled — bug,v1,nvidia — by ZhanqiuHu (创建于: 2026-01-28 04:10 (UTC+8))
#33183 [Benchmark] Add startup benchmarking to buildkite run — performance,ci/build — by desertfire (创建于: 2026-01-28 00:25 (UTC+8))
#33195 [Core] Add sleep level 0 mode with enqueue/wait pattern — frontend,v1 — by jaewonlee-fb (创建于: 2026-01-28 04:57 (UTC+8))
#33197 use ‘max_active_experts’ for moe lora input size — 无标签 — by gnovack (创建于: 2026-01-28 05:22 (UTC+8))
#33196 [BugFix] Handle dict values in YAML config by converting to JSON in config file — bug — by wangluochao902 (创建于: 2026-01-28 05:12 (UTC+8))
#33194 [Hybrid] Simplify mamba block size logic for all mode — v1 — by tdoublep (创建于: 2026-01-28 04:29 (UTC+8))
#33193 [UX] Enable nested configs in config yaml files — 无标签 — by mgoin (创建于: 2026-01-28 04:25 (UTC+8))
#33190 [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) — documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-28 03:47 (UTC+8))
#33188 Fix merge conflict. — frontend,ci/build,v1,llama,qwen,cpu — by gtnaila (创建于: 2026-01-28 03:24 (UTC+8))
#33177 [Attention] Use has_flashinfer helper — ready — by MatthewBonanni (创建于: 2026-01-27 23:29 (UTC+8))
#33176 [EPLB] Add alternative communication for EPLB weight exchange — ci/build — by ilmarkov (创建于: 2026-01-27 23:17 (UTC+8))
#33179 [Bugfix][Hardware][AMD] Fix FP8 dtype for gfx950 (MI325X/MI355X) — bug,rocm — by c0de128 (创建于: 2026-01-28 00:02 (UTC+8))
#33185 [Core] Add VLLM_NIXL_DISAGGREGATION_BACKEND env var for NIXL backend selection — v1,kv-connector — by algebra-MCX (创建于: 2026-01-28 01:06 (UTC+8))
#33157 [Misc] Cleanup Kimi-K2.5’s vision chunk modality entrypoints — frontend,multi-modality — by Isotr0py (创建于: 2026-01-27 17:22 (UTC+8))
#33180 [Bugfix][Hardware][AMD] Fix MXFP4 MoE backend selection for RDNA3 (gfx1100) — bug,rocm — by c0de128 (创建于: 2026-01-28 00:03 (UTC+8))
#33162 Fix weight mapping test for Transfomers v5 — ready,multi-modality — by hmellor (创建于: 2026-01-27 18:24 (UTC+8))
#33182 [Feature] Add API parent span lifecycle management (PR #6/9) — documentation,frontend,ci/build,v1 — by sriumcp (创建于: 2026-01-28 00:04 (UTC+8))
#33178 [Quantization] - Consolidate experts_int8 with FP8 Modular Kernels — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by Josephasafg (创建于: 2026-01-27 23:39 (UTC+8))
#33158 [Frontend] Cleanup api server — frontend,ready — by noooop (创建于: 2026-01-27 17:23 (UTC+8))
#33168 Add script to benchmark Mamba for optimal max_num_seqs — performance — by danisereb (创建于: 2026-01-27 20:52 (UTC+8))
#33148 [Bugfix] Fix xgrammar cleanup leakage — bug,structured-output,v1 — by DamonJiang777 (创建于: 2026-01-27 15:04 (UTC+8))
#33174 Add support for Mistral Large 3 inference with Flashinfer MoE — ci/build,deepseek,nvidia — by dbari (创建于: 2026-01-27 22:40 (UTC+8))
#33172 [Feature] Add prefix-aware routing for data parallel load balancing — documentation,needs-rebase,v1 — by toslali-ibm (创建于: 2026-01-27 22:31 (UTC+8))
#33171 [Bugfix] Upgrade Numba to fix segfault on CUDA drivers version 580 — bug,rocm,ci/build,cpu,nvidia — by tlrmchlsmth (创建于: 2026-01-27 22:29 (UTC+8))
#33160 fix:paddle ocr infinite inference bug — bug,v1 — by bellkjtt (创建于: 2026-01-27 18:15 (UTC+8))
#33164 [Bugfix] Disable CG for Whisper+FA2 — bug,ready,v1 — by NickLucche (创建于: 2026-01-27 19:32 (UTC+8))
#33152 [PluggableLayer][2/N] Apply PluggableLayer to linear layers — ready,ready-run-all-tests — by whx-sjtu (创建于: 2026-01-27 16:09 (UTC+8))
#33139 [Frontend] Frontend will only attach supported tasks corresponding entrypoints. — frontend,ready — by noooop (创建于: 2026-01-27 11:46 (UTC+8))
#33159 [Feature]: Container image WORKDIR consistency — ready,ci/build,cpu — by SouthWest7 (创建于: 2026-01-27 17:24 (UTC+8))
#33166 Update README.md — documentation — by chaosisnotopen (创建于: 2026-01-27 20:17 (UTC+8))
#33138 [code clean] remove duplicated code — frontend — by andyxning (创建于: 2026-01-27 11:11 (UTC+8))
#33146 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (创建于: 2026-01-27 14:19 (UTC+8))
#33154 [CI] Split responses API MCP tests into separate CI step — ci/build — by pacoxu (创建于: 2026-01-27 16:41 (UTC+8))
#33150 [CI] fix(tests): avoid Open-Meteo API timeout in test_function_calling_with_stream — 无标签 — by pacoxu (创建于: 2026-01-27 15:41 (UTC+8))
#33149 [BugFix] Fix minimax_m2 tool call parser for stream_interval > 1 — bug,tool-calling — by MrIceCreamMan (创建于: 2026-01-27 15:19 (UTC+8))

已合并 PR

#33186 [Docs] Use definition lists for CLI reference docs — documentation,ready — by hmellor (合并于: 2026-01-28 10:22 (UTC+8))
#33211 [docs] Improve tlparse section — documentation,ready — by angelayi (合并于: 2026-01-28 10:07 (UTC+8))
#33151 [CI] minor fixes to pipeline generator and tests — ci/build,ready-run-all-tests — by khluu (合并于: 2026-01-28 09:04 (UTC+8))
#33059 [Model Runner V2] Use a different stream for grammar bitmask h2d copy — v1 — by WoosukKwon (合并于: 2026-01-28 08:37 (UTC+8))
#26835 Add attention benchmarking tools — performance,ready,nvidia — by MatthewBonanni (合并于: 2026-01-28 08:09 (UTC+8))
#33184 [torch.compile] Speed up MOE handling in forward_context — ready — by zou3519 (合并于: 2026-01-28 07:17 (UTC+8))
#33102 [Perf] Optimize dcp allocate tensor — ready,v1 — by yewentao256 (合并于: 2026-01-28 06:24 (UTC+8))
#33020 [Bugfix] Fix display error (inconsistent with context) — bug,ready — by lingebeng (合并于: 2026-01-28 04:33 (UTC+8))
#32719 Enabling “2 node” distributed tests in the AMD CI pipeline. — rocm,ready,ci/build — by Alexei-V-Ivanov-AMD (合并于: 2026-01-28 03:13 (UTC+8))
#33177 [Attention] Use has_flashinfer helper — ready — by MatthewBonanni (合并于: 2026-01-28 02:33 (UTC+8))
#33035 feature: support eagle3 for HunyuanVL & Hunyuan — speculative-decoding,ready,v1 — by irisliu10 (合并于: 2026-01-28 01:55 (UTC+8))
#33082 [Doc] Improve serve parameter documentation with meaningful defaults — documentation,ready — by karanb192 (合并于: 2026-01-28 01:19 (UTC+8))
#32909 [CI][Pooling] Stabilize ModernBERT test — ready — by AndreasKaratzas (合并于: 2026-01-27 13:26 (UTC+8))
#33162 Fix weight mapping test for Transfomers v5 — ready,multi-modality — by hmellor (合并于: 2026-01-27 20:30 (UTC+8))
#33076 Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing — ready — by ir1ka (合并于: 2026-01-28 00:04 (UTC+8))
#33037 [BugFix] Fix P/D with non-MoE DP — bug,ready,v1,kv-connector — by njhill (合并于: 2026-01-28 00:03 (UTC+8))
#32549 Support heterogeneous NemotronHPuzzle model — new-model,ready — by danielafrimi (合并于: 2026-01-27 23:55 (UTC+8))
#32265 [LoRA][Spec Decode] Support LoRA for Nemotron-H MTP models — ready — by danisereb (合并于: 2026-01-27 23:53 (UTC+8))
#32064 [5/N][Attention] Finish eliminating vllm/attention folder — documentation,rocm,speculative-decoding,ready,ci/build,v1,llama,qwen,deepseek,gpt-oss — by MatthewBonanni (合并于: 2026-01-27 23:02 (UTC+8))
#33158 [Frontend] Cleanup api server — frontend,ready — by noooop (合并于: 2026-01-27 23:18 (UTC+8))
#33045 [Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill (#33045) — ready,v1,meta-exported,fb-exported — by omkhalil (合并于: 2026-01-27 23:16 (UTC+8))
#33090 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,ready,deepseek,kv-connector — by NickLucche (合并于: 2026-01-27 23:03 (UTC+8))
#33164 [Bugfix] Disable CG for Whisper+FA2 — bug,ready,v1 — by NickLucche (合并于: 2026-01-27 21:46 (UTC+8))
#27942 [Metrics] [KVConnector] Add Offloading Connector metrics — ready,v1,kv-connector — by omerpaz95 (合并于: 2026-01-27 21:34 (UTC+8))
#33139 [Frontend] Frontend will only attach supported tasks corresponding entrypoints. — frontend,ready — by noooop (合并于: 2026-01-27 20:15 (UTC+8))
#32042 [AMD][QWEN3-NEXT] FP8 Tunings — performance,rocm,ready,qwen — by draftbk (合并于: 2026-01-27 17:34 (UTC+8))
#33131 [Models] Kimi-K2.5 — documentation,new-model,frontend,ready,multi-modality — by ywang96 (合并于: 2026-01-27 14:50 (UTC+8))
#32976 [AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 — bug,rocm,ready — by rasmith (合并于: 2026-01-27 15:16 (UTC+8))
#33135 [code clean] remove duplicate code — structured-output,ready,v1 — by andyxning (合并于: 2026-01-27 12:57 (UTC+8))
#33103 [Frontend] Cleanup serving engine — frontend,ready — by DarkLight1337 (合并于: 2026-01-27 12:47 (UTC+8))
#33113 [torch.compile] Stop assuming 32 bit indexing — ready — by zou3519 (合并于: 2026-01-27 12:25 (UTC+8))
#33101 [Frontend] Reduce mixin usage in serving pooling — frontend,ready — by DarkLight1337 (合并于: 2026-01-27 11:50 (UTC+8))
#33064 [Perf] avoid duplicate mem_get_info() call in get_current_memory_usage — rocm,ready — by pacoxu (合并于: 2026-01-27 11:45 (UTC+8))
#33109 [DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled — documentation,ready — by VincentG1234 (合并于: 2026-01-27 11:05 (UTC+8))

关闭但未合并的 PR

#32994 [Bugfix][Core] Fix use audio in video bug — bug,needs-rebase,v1,qwen — by xsank (关闭于: 2026-01-28 10:01 (UTC+8))
#33206 [XPU] minor fix fp8 online quantization — 无标签 — by yma11 (关闭于: 2026-01-28 08:44 (UTC+8))
#33008 [BugFix] KeyError when loading Mistral/vision-enabled checkpoints — bug — by ms1design (关闭于: 2026-01-28 05:22 (UTC+8))
#26482 [CI] Fix mypy for vllm/attention and vllm/compilation — tpu,ready,needs-rebase,v1,cpu,nvidia — by hmellor (关闭于: 2026-01-28 04:30 (UTC+8))
#23507 feat: Add TPU v6e architecture-adaptive attention backend — documentation,tpu,unstale,v1 — by Tar-ive (关闭于: 2026-01-28 05:09 (UTC+8))
#32961 [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build — by aviralgarg05 (关闭于: 2026-01-28 04:14 (UTC+8))
#33190 [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) — documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-28 03:48 (UTC+8))
#33114 [Draft / POC] Websocket realtime api — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by patrickvonplaten (关闭于: 2026-01-28 02:37 (UTC+8))
#33179 [Bugfix][Hardware][AMD] Fix FP8 dtype for gfx950 (MI325X/MI355X) — bug,rocm — by c0de128 (关闭于: 2026-01-28 00:14 (UTC+8))
#31736 Update how docs are rendered for cli reference — documentation — by ashwin-phadke (关闭于: 2026-01-28 01:10 (UTC+8))
#33182 [Feature] Add API parent span lifecycle management (PR #6/9) — documentation,frontend,ci/build,v1 — by sriumcp (关闭于: 2026-01-28 00:04 (UTC+8))
#30208 Feat/support nemotron h mtp wip — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by shaharmor98 (关闭于: 2026-01-27 23:00 (UTC+8))
#28541 Feat/support nemotron h mtp — new-model,speculative-decoding,needs-rebase,v1,qwen — by shaharmor98 (关闭于: 2026-01-27 23:08 (UTC+8))
#31092 Feat/support nemotron h mtp upstream — new-model,speculative-decoding,needs-rebase,v1 — by shaharmor98 (关闭于: 2026-01-27 22:57 (UTC+8))
#29884 DRAFT: Mistral large 3 Extended Blackwell Support — performance,ready,needs-rebase,ci/build,v1,deepseek,nvidia — by hypdeb (关闭于: 2026-01-27 22:53 (UTC+8))
#33172 [Feature] Add prefix-aware routing for data parallel load balancing — documentation,needs-rebase,v1 — by toslali-ibm (关闭于: 2026-01-27 22:32 (UTC+8))
#33160 fix:paddle ocr infinite inference bug — bug,v1 — by bellkjtt (关闭于: 2026-01-27 22:03 (UTC+8))

#30672 [Model][Last/N] Improve all pooling task

Generate runner supports using embed and token_embed tasks. — frontend,v1 — by noooop (关闭于: 2026-01-27 21:20 (UTC+8))

#33086 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 — bug,v1,deepseek — by chaunceyjiang (关闭于: 2026-01-27 21:05 (UTC+8))
#32543 refactor: extract KV cache update logic into method in RocmAttention — rocm,v1 — by Mohit-Gaur (关闭于: 2026-01-27 16:41 (UTC+8))
#31228 Cleanup basic and entrypoint test organisation — ready,ci/build,tool-calling,llama,gpt-oss — by hmellor (关闭于: 2026-01-27 16:11 (UTC+8))
#32515 [Bugfix] Add missing o_data_type parameter for FlashInfer >=0.6.0 compatibility — bug,needs-rebase,v1,nvidia — by amanwalksdownthestreet (关闭于: 2026-01-27 12:32 (UTC+8))
#33136 [Feature] Emit journey events to core spans (PR #4/9) — documentation,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-27 11:00 (UTC+8))