vLLM 开发动态报告 - 2026-02-14

时间窗口: 2026-02-14 11:37 (UTC+8) ~ 2026-02-15 11:37 (UTC+8) 数据统计: 新 Issue 3 | 关闭 Issue 27 | 新 PR 22 | 合并 PR 13 | 关闭未合并 PR 10

📊 每日开发状态摘要

本期窗口（2026-02-14至02-15）内，vLLM社区保持高度活跃，处理了大量积压的陈旧Issue（共关闭27个），并持续推进功能开发与优化（新增PR 22个，合并13个）。开发焦点集中在性能优化（如激活函数、CPU权重卸载）、模型支持扩展（特别是多模态检索模型）以及跨平台（尤其是AMD ROCm）的兼容性修复上，体现了项目在提升核心推理性能和生态广度上的持续投入。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动以问题修复和测试稳定性为主，暂无重大新功能引入。

已关闭的AMD相关问题：
- Issue #34376 & #34073: 用户 amd-hhashemi（AMD员工）报告了两个由早期PR（#33731, #33758）引起的构建错误，分别涉及Triton编译和aiohttp编译。两个问题均在维护者（vllmellm）提供解决方案（如确保正确安装ROCm依赖）后被快速关闭，反映了社区对AMD平台构建流程的持续维护。
- Issue #17837 (添加MXFP6量化格式): 此功能请求由AMD员工 fxmarty-amd 通过PR #21166实现并合并。本期该Issue因长时间无活动被自动关闭。fxmarty-amd 在关闭前指出，未来仍有空间为包括AMD在内的各种硬件集成优化的MXFP6内核。这表明AMD在vLLM量化生态中的持续参与。
新增PR中的AMD相关工作：
- PR #34567 ([CI] Fix ColBERT HF comparison tests on AMD CI + refactor): 虽然主要目的是修复CI测试，但由于测试失败发生在AMD CI上，因此被标记了rocm标签。此PR通过条件导入和动态参数化，修复了因flashinfer库在ROCm上不可用而导致的测试收集失败问题，提升了ROCm平台测试的健壮性。
- PR #34543 ([Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993): 这是一个关键修复。PR #32993 最初为CUDA/XPU添加了UVA（统一虚拟寻址）CPU权重卸载支持，但遗漏了ROCm的平台检查，导致该功能在AMD GPU上损坏。此PR通过添加is_rocm()检查修复了该问题，确保了ROCm平台能平等享用CPU权重卸载这一重要性能/内存优化特性。
- PR #34549 ([Misc] Optimized check to encapsulate both CUDA and ROCm platforms): 一个代码清理PR，将特定检查从显式的is_cuda() or is_rocm()改为使用现有的is_cuda_alike()辅助方法，提高了代码的整洁性和未来可维护性。
- PR #34570 ([ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64): 修复了PR #34378引入的回归问题，该问题导致在ROCm AITER后端上，使用滑动窗口（如Mixtral）或head_size < 64的模型解码输出错误。此修复对保障AMD平台上特定模型架构的推理正确性至关重要。

💬 高热度讨论分析

本期关闭的陈旧Issue中包含了多个历史热门讨论。

Issue #14452: [Doc]: Steps to run vLLM on your RTX5080 or 5090! (评论30+)
- 核心议题: 用户分享并讨论在尚未被官方CUDA版本完全支持的Blackwell架构显卡（RTX 5080/5090）上运行vLLM的详细步骤。
- 不同观点:
  - 原创作者 (@pavanimajety)：提供了从特定NGC容器开始、从源码编译vLLM的完整方案，旨在测试最新的main分支功能。
  - 其他用户 (@shahizat)：提供了更快捷的替代方案——使用NVIDIA Triton Server的预构建vLLM容器，但指出其可能不是最新的vLLM代码。
  - 后续讨论焦点：转向了在多块RTX 5090上无法进行P2P通信以支持多GPU推理的问题。用户尝试了多种方法（编译最新NCCL、设置环境变量）均未成功，推测是GPU硬件或驱动层的问题。
- 结论: 该Issue作为一份有价值的社区文档存在了近一年，最终因“陈旧”被自动关闭。讨论揭示了在新硬件早期支持阶段，社区自我驱动解决兼容性问题的努力。
Issue #22265: [New Model]: OpenAI OSS model support (评论30+)
- 核心议题: OpenAI开源权重模型（GPT-OSS）发布后，社区急切询问vLLm的支持情况，特别是工具调用功能。
- 讨论过程:
  - 初期确认: 很快有用户指出vLLm官方Docker镜像已提供支持，且OpenAI官方Cookbook确认了工具调用支持。
  - 问题暴露: 大量用户随即报告了启动错误 AssertionError: Sinks are only supported in FlashAttention 3，表明在初始集成中遇到了兼容性问题。
  - 功能验证: 同时，用户也在具体验证工具调用的参数和实际效果。
- 结论: 该Issue集中反映了社区对重要新模型支持的快速跟进和测试，以及在此过程中必然会遇到的初期bug。这些问题通过后续的PR得以修复。
Issue #27016: [Bug]: when use “Image Embedding Inputs” in openai api，have some bug (评论10+)
- 核心议题: 用户在使用vLLm 0.8.5版本的多模态“图像嵌入输入”API时遇到服务器内部错误。
- 观点与排查:
  - 用户 (@wwkww)：提供了详细的复现代码，表明通过base64传递预计算图像嵌入时出错。
  - 维护者 (@DarkLight1337)：建议升级到最新版本（v0.11.0），因为旧版本可能存在即使功能宣称支持也存在未知bug。同时要求用户提供服务器端完整日志进行深度排查。
- 结论: 讨论凸显了在多模态功能日益复杂的情况下，用户使用较旧版本可能遇到难以诊断的问题，维护者的首要建议是升级到包含大量修复的最新版本。这体现了开源项目中保持版本更新的重要性。

🔥 热门话题与趋势分析

模型支持持续扩展，尤其是多模态与检索模型：本期合并了ColQwen3（PR #34398）和ColModernVBERT（PR #34558）两个新的多模态检索模型支持。新增PR中还有关于GLM-5 MTP、Token routed模型的修改。这表明社区对支持先进模型架构（特别是具备检索、多模态能力的模型）有强烈兴趣和贡献动力。
性能优化深入底层：多个PR关注微观性能提升，例如PR #34562 (silu早期退出优化，针对MOE专家并行场景）、PR #34535（选择性CPU权重卸载，针对MOE模型低并发场景）。这标志着性能优化工作正朝着更精细、场景化的方向发展。
量化生态多样化：新增PR #34556开始了对Humming量化内核的集成工作（目前为WIP），这是一个不同于Marlin的量化方案。结合已关闭的MXFP6 Issue，表明vLLM的量化后端生态正在吸引更多第三方的集成，旨在提供更多选择和潜在的性能优势。
前端与架构解耦：新增PR #34551提出了一个重要的新概念——vllm online命令，旨在启动一个无GPU的预处理服务（仅含tokenization和chat template渲染）。这符合云原生和 disaggregated serving（解耦服务）的趋势，允许将前端负载与GPU推理引擎物理分离。虽然其命名和设计仍在讨论中，但方向值得高度关注。

🛠️ 重点技术变更

PR #33997: [Hybrid] Enable mamba prefix cache "align" mode with async scheduling (已合并)
- 技术解读: 修复了Mamba-注意力混合模型（如Qwen3-Coder-Next）在使用前缀缓存“align”模式并启用异步调度时，因调度器计数不准导致的缓存块管理错误。
- 影响: 使更多支持Mamba的模型能够安全地利用异步调度和前缀缓存，从而提升推理性能和吞吐量。
PR #34535: [Feature][Perf] Support Selective CPU Weight Offloading (已合并)
- 技术解读: 在已有的全局CPU权重卸载基础上，新增了按参数名模式匹配的选择性卸载功能（--cpu-offload-params）。
- 影响: 用户可精细控制仅将部分参数（如MOE模型中的专家权重）保留在CPU，在低并发场景下大幅降低GPU内存占用，同时避免所有参数进出PCIe带来的开销，实现更优的性价比。
PR #34527: fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection (已合并)
- 技术解读: 修复了多模态处理器缓存中的一个关键bug。原代码使用get_type_hints()解析类型提示，在某些前向引用场景下会抛出异常并被静默处理，导致处理器缓存完全失效。
- 影响: 修复后，多模态模型（如Qwen-VL）的处理器加载时间从每次请求~24秒降至毫秒级，极大地提升了处理视频等复杂请求的端到端性能。
PR #34571: [Bugfix] Cap cudagraph capture sizes for Mamba/hybrid models (新开)
- 技术解读: 针对混合Mamba模型，在CUDA Graph捕获阶段，如果Mamba状态缓存块数受内存限制小于最大捕获批次大小，会导致断言失败。此PR增加了相应的调整逻辑。
- 影响: 解决了Hybrid Mamba模型启用CUDA Graph时的一个崩溃问题，提高了此类模型在推理优化路径上的稳定性。

📈 开发活跃度观察

高效的Issue管理: 在24小时内关闭了27个Issue，远超新增的3个，显示项目在积极清理积压问题，其中大部分是90天无活动的“陈旧”Issue。自动化工具（github-actions）在其中扮演了重要角色。
活跃的代码贡献: 新增22个PR，合并13个，合并率约59%。贡献者包括来自Cohere (czhu-cohere)、AMD (amd-hhashemi, AndreasKaratzas 等处理ROCm问题)等公司的开发者，以及众多独立贡献者。
跨平台协同: 针对ROCm平台的修复和测试加固工作（#34543, #34567）由熟悉AMD生态的贡献者积极推动并快速合并，表明项目对多硬件平台支持的重视和良好的协作效率。

💡 值得关注的问题

新增Bug报告:
- Issue #34561: GLM-4.7-Flash-AWQ模型加载失败，提示‘ColumnParallelLinear’ object has no attribute ‘weight‘。这可能是AWQ量化模型加载路径中的一个新bug。
- Issue #34553: GLM-5 FP8在H200 GPU上高并发时，于sparse_attn_indexer处发生OOM。这可能指向稀疏注意力实现中的内存管理问题，在高性能硬件和高负载场景下暴露。
- Issue #34555: --guided-decoding-backend参数在nightly版本中不被识别，可能意味着API不兼容或文档未及时更新。
架构演进讨论: PR #34551 (vllm online命令) 引入了前端与推理引擎解耦的新范式。其具体命名、API设计以及与现有--data-parallel-size-local 0模式的关系，正在引发核心维护者（如 @njhill）的讨论，其结果将影响vLLM未来分布式服务架构的设计方向。

📋 附录：详细数据列表

新增 Issue

#34561 [Bug]: GLM-4.7-Flash-AWQ fails with AttributeError: ‘ColumnParallelLinear’ object has no attribute ‘weight’ — bug — by eugr (创建于: 2026-02-15 02:29 (UTC+8))
#34555 [BUG]: api_server.py: error: unrecognized arguments: –guided-decoding-backend — usage — by robcaulk (创建于: 2026-02-14 20:57 (UTC+8))
#34553 [Bug]: GLM-5 FP8 on H200 CUDA OOM in sparse_attn_indexer at High Concurrency — bug — by kimbochen (创建于: 2026-02-14 16:10 (UTC+8))

已关闭 Issue

#14452 [Doc]: Steps to run vLLM on your RTX5080 or 5090! — documentation,stale — by pavanimajety (关闭于: 2026-02-15 10:17 (UTC+8))
#17837 [Feature]: Add MXFP6 Quantization Format — feature request,stale — by zcuuu (关闭于: 2026-02-15 10:17 (UTC+8))
#19534 [Usage]: Error when building the XPU image using Dockerfile — usage,intel-gpu,stale — by Estella31 (关闭于: 2026-02-15 10:17 (UTC+8))
#22265 [New Model]: OpenAI OSS model support — stale — by ItzAmirreza (关闭于: 2026-02-15 10:17 (UTC+8))
#23905 [Bug]: gpt-oss-120b has high possibility to generate response as part of reasoning by using vllm v0.10.1 — bug,stale,gpt-oss — by tonyaw (关闭于: 2026-02-15 10:17 (UTC+8))
#24079 [Feature]: Add uccl as kvconnect provide — feature request,stale — by lengrongfu (关闭于: 2026-02-15 10:17 (UTC+8))
#25043 [Bug]: disaggregated_serving_p2p_nccl_xpyd prefill worker failed with “EngineCore failed to start”. — bug,stale — by juewAtAmazon (关闭于: 2026-02-15 10:16 (UTC+8))
#26083 [Bug]: GPT-OSS Tool Calls Fail in Stream Mode — bug,stale — by Yuyz0112 (关闭于: 2026-02-15 10:16 (UTC+8))
#26606 [Performance]: Lite-Profiler Framework for Model Analysis — performance,stale — by namanlalitnyu (关闭于: 2026-02-15 10:16 (UTC+8))
#26607 [Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? — bug,stale — by tina0852 (关闭于: 2026-02-15 10:16 (UTC+8))
#26616 [Usage]: How to enable MTP when using Qwen3-Next in local infer ( not vllm serve) — usage,stale — by Kimagure7 (关闭于: 2026-02-15 10:16 (UTC+8))
#26621 [Bug]: Encountered AssertionError and precision issues when enabling MTP in deepseek v3.1 — bug,stale — by WangQiangItachi (关闭于: 2026-02-15 10:16 (UTC+8))
#26720 [Bug]: Model fails to deploy: Illegal CUDA memory access with FP8 Qwen3-235b model in vLLM v0.11.0 — bug,stale — by Harshith-umesh (关闭于: 2026-02-15 10:16 (UTC+8))
#26865 [Bug]: XPU - granite-4.0-h fails to load due to CUDA assumption — bug,intel-gpu,stale — by DatCaptainHorse (关闭于: 2026-02-15 10:16 (UTC+8))
#26942 [Bug]: –decode-context-parallel deployment for DeepSeek-R1 fails on v0.11.0 but works on v0.10.2 — bug,stale — by Harshith-umesh (关闭于: 2026-02-15 10:16 (UTC+8))
#26974 [Bug]: failed to use vllm docker on jetson thor — bug,stale — by zzc0721 (关闭于: 2026-02-15 10:16 (UTC+8))
#26984 [Bug]: qwen2.5-vl vit fp8 scaled_mm_kernel kernel bug, — bug,stale — by WenMang98k (关闭于: 2026-02-15 10:16 (UTC+8))
#26995 [Bug]: deep seek v3 rope implement is not align with transformers deep seek v3 rope — bug,stale — by a32543254 (关闭于: 2026-02-15 10:16 (UTC+8))
#27002 [Bug]: XPU - Model loads fine but generation crashes — bug,intel-gpu,stale — by DatCaptainHorse (关闭于: 2026-02-15 10:16 (UTC+8))
#27006 [Usage]: In vLLM version 0.8.5, when I send an HTTP image URL directly, the model cannot recognize the image content, but it works correctly when I use a base64-encoded image. I’d like to understand why this happens. — usage,stale — by Lislttt (关闭于: 2026-02-15 10:16 (UTC+8))
#27011 [Usage]: Runnig GLM4.5-Air with Speculative Decoding — usage,stale — by aqx95 (关闭于: 2026-02-15 10:16 (UTC+8))
#27016 [Bug]: when use “Image Embedding Inputs” in openai api，have some bug — bug,stale — by wwkww (关闭于: 2026-02-15 10:16 (UTC+8))
#27044 [Bug]: set_model_tag has strange interactions with global state — bug,stale — by Lucaskabela (关闭于: 2026-02-15 10:16 (UTC+8))
#34376 [Bug]: #33731 is causing Triton build fails with gptoss — bug,rocm,gpt-oss — by amd-hhashemi (关闭于: 2026-02-14 14:02 (UTC+8))
#34073 [Bug]: #33758 is causing aiohttp/_websocket.c build error — bug,rocm — by amd-hhashemi (关闭于: 2026-02-14 14:01 (UTC+8))
#34283 [CI Failure]: Multi Modal Models (Extended) 1 — ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-14 12:04 (UTC+8))
#34225 [Bug]: maybe_serialize_tool_calls() fails to verify the tool_calls type — bug — by elklepo (关闭于: 2026-02-14 12:04 (UTC+8))

新增 PR

#34562 [perf] silu early return for 0s — performance — by czhu-cohere (创建于: 2026-02-15 05:09 (UTC+8))
#34551 [Frontend] Add vllm online command for GPU-less preprocessing serving — frontend,v1,cpu — by hyeongyun0916 (创建于: 2026-02-14 15:14 (UTC+8))
#34571 [Bugfix] Cap cudagraph capture sizes for Mamba/hybrid models (#34094) — bug,v1,nvidia — by haosdent (创建于: 2026-02-15 10:56 (UTC+8))
#34563 [Model Runner V2] Minor cleanup for Sampler — v1 — by WoosukKwon (创建于: 2026-02-15 06:01 (UTC+8))
#34570 [ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-15 09:55 (UTC+8))
#34565 feat(whisper): add decoder prefix and custom task tokens for transcription API — frontend — by timon0305 (创建于: 2026-02-15 06:34 (UTC+8))
#34569 [CI] Write bake config to temp directory instead of repo root — ci/build — by amrmahdi (创建于: 2026-02-15 08:41 (UTC+8))
#34568 Fix early return sliding window — needs-rebase,v1 — by yunjiangster (创建于: 2026-02-15 08:33 (UTC+8))
#34567 [CI] Fix ColBERT HF comparison tests on AMD CI + refactor — rocm — by AndreasKaratzas (创建于: 2026-02-15 07:49 (UTC+8))
#34558 [New Model] Add ColModernVBERT — documentation,new-model,multi-modality — by athrael-soju (创建于: 2026-02-15 01:25 (UTC+8))
#34564 feat(metrics): add configurable Prometheus histogram buckets via CLI flags — v1 — by timon0305 (创建于: 2026-02-15 06:29 (UTC+8))
#34566 [CI] Stabilize metrics tests with polling and subprocess guards — 无标签 — by AndreasKaratzas (创建于: 2026-02-15 07:03 (UTC+8))
#34559 Token routed i64 — new-model — by Complexity-ML (创建于: 2026-02-15 01:49 (UTC+8))
#34556 [WIP][Quantization] add humming quantization kernel — 无标签 — by jinzhen-lin (创建于: 2026-02-14 22:05 (UTC+8))
#34560 [Renderer] Move InputPreprocessor into Renderer (2/2) — frontend,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-15 02:16 (UTC+8))
#34552 [BugFix] Fix GLM-5 MTP not supporting num_speculative_tokens > 1 — bug,speculative-decoding,v1 — by LucasWilkinson (创建于: 2026-02-14 16:08 (UTC+8))
#34557 [Feature] Add ColModernVBERT model and configuration for multimodal retrieval — documentation,new-model,multi-modality — by athrael-soju (创建于: 2026-02-15 00:03 (UTC+8))
#34554 [Bugfix] Fix Qwen3.5 config loading — bug,qwen — by ywang96 (创建于: 2026-02-14 19:52 (UTC+8))
#34550 [CI] [Kernel] Rename Helion config key to match actual H100 SXM5 device name used in CI — 无标签 — by gmagogsfm (创建于: 2026-02-14 13:30 (UTC+8))
#34547 [Docs] Fix NCCL typo — documentation — by jiangkuaixue123 (创建于: 2026-02-14 11:39 (UTC+8))
#34549 [Misc] Optimized check to encapsulate both CUDA and ROCm platforms — rocm,nvidia — by AndreasKaratzas (创建于: 2026-02-14 12:14 (UTC+8))
#34548 [BugFix] Fix Python 3.13 FlashMLA import error — bug,ready,ci/build — by LucasWilkinson (创建于: 2026-02-14 11:52 (UTC+8))

已合并 PR

#34563 [Model Runner V2] Minor cleanup for Sampler — v1 — by WoosukKwon (合并于: 2026-02-15 10:29 (UTC+8))
#33997 [Hybrid] Enable mamba prefix cache “align” mode with async scheduling — ready,v1 — by tdoublep (合并于: 2026-02-15 05:15 (UTC+8))
#34538 [ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-02-14 23:32 (UTC+8))
#34510 [Renderer] Move InputPreprocessor into Renderer (1/2) — performance,frontend,ready,v1,multi-modality,llama — by DarkLight1337 (合并于: 2026-02-15 02:14 (UTC+8))
#34554 [Bugfix] Fix Qwen3.5 config loading — bug,qwen — by ywang96 (合并于: 2026-02-14 19:56 (UTC+8))
#34416 [Misc] Update tests and examples for Prithvi/Terratorch models — documentation,ready,ci/build,multi-modality — by christian-pinto (合并于: 2026-02-14 15:03 (UTC+8))
#34398 [new model] add COLQwen3 code & Inference — documentation,new-model,ready,multi-modality,qwen — by craftsangjae (合并于: 2026-02-14 12:15 (UTC+8))
#34294 [CI] Heavy refactoring of Voxtral multimodal audio model tests — rocm,ready,ci/build,v1,multi-modality — by AndreasKaratzas (合并于: 2026-02-14 12:04 (UTC+8))
#34543 [Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 — bug,rocm,ready — by AndreasKaratzas (合并于: 2026-02-14 12:01 (UTC+8))
#34438 Add explicit validation error for tool calls. — ready — by juliendenize (合并于: 2026-02-14 12:04 (UTC+8))
#34527 fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection — ready — by perone (合并于: 2026-02-14 12:03 (UTC+8))
#34533 [bug] Make sure get_modality_with_max_tokens is deterministic — bug,ready,multi-modality — by 842974287 (合并于: 2026-02-14 12:02 (UTC+8))
#34535 [Feature][Perf] Support Selective CPU Weight Offloading — ready,v1,nvidia — by wzhao18 (合并于: 2026-02-14 12:02 (UTC+8))

关闭但未合并的 PR

#33238 [Feature]: Kernels Attention Test cleanup — v1 — by SouthWest7 (关闭于: 2026-02-15 11:36 (UTC+8))
#20800 [Feature] Add command tool parser for Command-A model — documentation,frontend,ready,stale,tool-calling — by gjgjos (关闭于: 2026-02-15 10:17 (UTC+8))
#23201 [Misc][Feature] confidence based early stopping — stale,v1 — by Viol2000 (关闭于: 2026-02-15 10:17 (UTC+8))
#24998 [Frontend]: Allow the qwen reasoning parser to work without think tokens — structured-output,frontend,needs-rebase,stale,v1,qwen,deepseek,gpt-oss — by maxdebayser (关闭于: 2026-02-15 10:16 (UTC+8))
#25062 [V1][performance] modify the logic of check_stop in update_from_output — stale,v1 — by chengda-wu (关闭于: 2026-02-15 10:16 (UTC+8))
#26800 [Model] add kosmos2_5 for vllm — new-model,stale — by yugeeklab (关闭于: 2026-02-15 10:16 (UTC+8))
#26861 [Misc] Set process title and logging prefix early — tpu,needs-rebase,stale,v1 — by orionr (关闭于: 2026-02-15 10:16 (UTC+8))
#34346 Add group quantization support to fused FP8 RMSNorm quant kernels — 无标签 — by Bias92 (关闭于: 2026-02-15 04:09 (UTC+8))
#34557 [Feature] Add ColModernVBERT model and configuration for multimodal retrieval — documentation,new-model,multi-modality — by athrael-soju (关闭于: 2026-02-15 01:06 (UTC+8))
#34546 Add colqwen3 support to vLLM — documentation,new-model,needs-rebase,multi-modality,qwen — by athrael-soju (关闭于: 2026-02-14 16:26 (UTC+8))