vLLM 开发动态报告 - 2026-02-08

时间窗口: 2026-02-08 11:37 (UTC+8) ~ 2026-02-09 11:37 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 25 | 新 PR 29 | 合并 PR 10 | 关闭未合并 PR 12

📊 每日开发状态摘要

在2026年2月8日至9日期间，vLLM项目展现出高度的开发与维护活跃度，共处理了25个遗留Issue并新增了29个PR。开发重点集中在三个方面：一是对前沿技术栈（Python 3.14、PyTorch 2.10）的兼容性准备；二是针对新模型架构（如Qwen3.5、GLM-4.7-Flash）的支持与适配；三是持续的性能优化和问题修复，特别是在多模态、推测解码和AMD生态支持领域。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动较为活跃，主要体现在核心性能优化、bug修复和CI资源管理上。

性能优化：为新一代架构准备
- PR #34100 (Open)： 由AMD员工 (amd-hhashemi) 提交。该PR将wvSplitKQ内核的计算指令从之前的版本转换为16x16 MFMA（Matrix Fused Multiply-Add），旨在为未来的mi4xx架构（推测为MI400系列）做准备。同时，这一改动也降低了寄存器压力，提升了现有wvSplitKQ内核在MI300等架构上的性能。
Bug修复：解决 torch.compile 兼容性问题
- PR #34108 (Open)： 修复了PR #33941引入的一个关键bug。该bug导致在ROCm平台上使用torch.compile时，由于amdsmi的FFI调用无法被Dynamo正确追踪，模型前向传播会崩溃。解决方案是将GCN架构检测逻辑提前到模块导入时执行，将结果存储为常量，从而绕过了Dynamo的追踪限制，确保了编译功能的正常运行。
功能优化：移除MoE内核冗余逻辑
- PR #34086 (Open)： 移除了FusedMoE内核中的分块(chunking)机制。由于chunked-prefill特性现在能保证单次前向传播的令牌数不会超过~65K，内核级别的分块循环已成为不必要的开销。此优化简化了代码，移除了约300行复杂逻辑。该PR虽然不专属于AMD，但其rocm标签表明它同样适用于ROCm平台，是通用性能提升的一部分。
CI/资源管理
- PR #34059 (Merged)： 调整了AMD CI的资源配置，减少了“Benchmarks”和“Benchmarks CLI Test”两个测试组的GPU用量，以释放更多资源供其他测试任务使用，反映了对AMD CI管道效率的持续管理。
生态系统同步：升级至PyTorch 2.10
- PR #30525 (Merged)： 这是一个全局性升级，但对于AMD生态至关重要。它将项目依赖的PyTorch版本升级至2.10.0（包含ROCm版本），是vLLM支持最新AMD硬件和软件栈的基础。

💬 高热度讨论分析

Issue #29595: Qwen3-VL在vLLM v0.11.1+上的Gounding精度下降（已关闭，30条评论）
- 核心议题： 用户报告Qwen3-VL模型在vLLM v0.11.1及以后版本中，物体检测（grounding）能力严重下降，特别是在Hopper架构（H20/H100）GPU上，而A100上正常。
- 观点与争议：
  - 用户方： 提供了详细的复现步骤和对比截图，指出性能下降和精度问题并存，怀疑与Flash Attention后端在Hopper上的特定实现有关。
  - 维护者方： 迅速响应，首先引导用户确认注意力后端，并推测问题可能与PyTorch 2.9在Hopper设备上的内部变化有关。最终，通过合并PR #30525（升级至PyTorch 2.10） 解决了此问题，表明根本原因可能是PyTorch 2.9与Hopper架构的兼容性问题。
- 当前状态： 问题已随PyTorch 2.10升级而关闭，强调了vLLM对上游PyTorch版本依赖的敏感性。
Issue #22308: GPT-OSS模型自定义工具调用支持（已关闭，19条评论）
- 核心议题： 用户询问如何为GPT-OSS模型配置--tool-call-parser以支持自定义工具调用。
- 讨论过程：
  - 初始回复错误地指向了reasoning-parser。
  - 随后维护者澄清，Harmony格式的工具调用支持已通过--tool-call-parser openai实现，并已在文档中更新。
  - 另一用户提出在流式模式下（stream=True）使用openai解析器效果不佳，工具调用未被正确解析。
  - 核心争议点： 流式与非流式请求下工具调用的处理路径和提示格式存在差异，导致模型输出不一致。
- 最终结论： 维护者确认流式工具调用存在bug（已被PR #24768修复），原因是系统提示中未正确激活commentary通道。此讨论揭示了工具调用实现中流式/非流式路径一致性的重要性。
Issue #34096: 支持vLLM与Python 3.14（Open，3条评论）
- 核心议题： 发起者建议vLLM应开始支持Python 3.14，因为PyTorch已提供官方支持。
- 观点与进展：
  - 发起者： 详细测试了依赖兼容性，指出ray[cgraph]是主要障碍，并提供了使其变为可选依赖的解决方案和成功从源码安装的补丁。
  - 维护者： 立即表示接手此任务 (I will take this :))。
  - PyTorch团队成员： 补充说明torch.compile已在PyTorch 2.10中支持Python 3.14。
- 当前状态： 正在进行中。这体现了vLLM社区对保持与最新Python生态系统兼容的前瞻性规划。

🔥 热门话题与趋势分析

前沿环境适配： Python 3.14支持和PyTorch 2.10升级是本期重点，反映出项目紧跟底层技术栈发展的步伐。
模型支持攻坚战： 多个Issue反映了新模型架构快速迭代带来的挑战：
- GLM-4.7-Flash 因使用glm4_moe_lite架构（仅存在于transformers main分支）而无法加载。
- 经Unsloth微调的Ministral-3模型缺少vLLM所需的配置文件（如params.json），导致部署困难。
- 新增PR开始为Qwen3.5模型系列提供官方支持。
工具调用与流式处理的复杂性： Issue #22308和PR #34101（修复Hermes工具调用流式截断问题）均表明，在流式输出、停止序列、复杂解析逻辑交织的场景下，工具调用功能的健壮性面临持续挑战。
多模态与内存优化： 已合并的PR #32493引入了 enable_mm_embeds 标志，允许用户在禁用视觉编码器（节省内存）的同时，仍能输入预计算的图像嵌入，为多模态服务的部署提供了更灵活的配置选项。
安装与依赖问题： Issue #34090中用户因libmpi_cxx.so.40缺失导致安装失败，这可能是由特定PyTorch wheel包引入的间接依赖，提示了复杂依赖环境下的用户支持问题。

🛠️ 重点技术变更

Python 3.14支持前瞻（Issue #34096）： 尽管处于早期阶段，但社区已开始系统性评估和解决依赖问题，这是确保vLLM未来兼容性的关键一步。
Qwen3.5模型支持（PR #34110）： 提前为即将发布的Qwen3.5密集版和MoE版模型添加支持框架，体现了与头部模型厂商的紧密合作。
PyTorch 2.10正式升级（PR #30525）： 一个历时近两个月的大型PR终于合并。此升级不仅解决了Hopper架构上的性能/精度问题，也为利用PyTorch 2.10的新特性（如对Python 3.14的编译支持）奠定了基础。
多模态“文本only”模式增强（PR #32493）： 通过 --limit-mm-per-prompt '{"image": 0}' --enable-mm-embeds 组合，实现了在保留嵌入输入功能的前提下，节省编码器内存开销的“轻量级”多模态服务模式，提升了部署灵活性。
AMD架构优化（PR #34100）： 将内核指令集向16x16 MFMA迁移，是针对AMD下一代GPU架构的底层优化，旨在持续提升其在AI计算中的竞争力。

📈 开发活跃度观察

AMD贡献者活跃： 用户名为 amd-hhashemi 和 AndreasKaratzas 的贡献者提交了重要的内核优化和bug修复PR，显示AMD团队在vLLM项目中的投入深入到了核心性能层和框架兼容性层。
社区参与度高： 在模型支持（如Qwen3.5、GLM）和问题排查（如Qwen3-VL精度）方面，出现了来自模型厂商或相关团队的开发者（如 JJJYmmm， chaunceyjiang）的深度参与和协作。
Issue处理高效： 在24小时内关闭了25个历史Issue，其中多数因“stale”被自动关闭，但也包括一些通过代码修复解决的核心问题（如Qwen3-VL精度），表明维护团队在积极清理问题池。

💡 值得关注的问题

Python 3.14依赖地狱（Issue #34096）： ray[cgraph] 等关键依赖尚未提供Python 3.14的wheel包，这将成为vLLM支持新版Python的最大障碍。社区可能需要推动上游依赖或制定过渡方案。
模型架构的快速演进： GLM-4.7-Flash等模型依赖transformers库的main分支，与vLLM依赖稳定版PyPy包的策略存在冲突。如何平衡对新模型的快速支持与依赖稳定性，是一个持续存在的挑战。
AMD生态整合的深水区： PR #34108暴露了AMD专用库（amdsmi）与PyTorch动态图编译（torch.compile）的兼容性问题。随着编译技术普及，此类底层集成问题可能会更多地浮现。
工具调用的流式处理一致性（Issue #22308 & PR #34101）： 流式与非流式API在工具调用解析上的行为差异容易导致开发者困惑。确保两种模式下解析逻辑的完全一致，对提供稳定的开发者体验至关重要。

📋 附录：详细数据列表

新增 Issue

#34096 [Tracking Feature]: Support vLLM with Python 3.14 — feature request — by mgoin (创建于: 2026-02-09 01:12 (UTC+8))
#34106 [Bug]: MLP Speculator AttributeError: ‘MLPSpeculatorConfig’ object has no attribute ‘num_attention_heads’ — bug — by kylesayrs (创建于: 2026-02-09 08:53 (UTC+8))
#34090 what am I doing wrong ? libmpi_cxx.so.40: cannot open shared object file: No such file or directory — usage — by peter247 (创建于: 2026-02-08 23:24 (UTC+8))
#34099 [Usage]: VLLM with finetuned unsloth ministral 3 — usage — by LeDuySon (创建于: 2026-02-09 02:38 (UTC+8))
#34098 [Bug]: GLM-4.7-Flash requires transformers from git (glm4_moe_lite but Transformers does not recognize this architecture) — bug — by brokedba (创建于: 2026-02-09 02:13 (UTC+8))
#34095 [Usage]: Option to set system_fingerprint? — usage — by vrdn-23 (创建于: 2026-02-09 00:49 (UTC+8))
#34094 [Bug]: assert num_cache_lines >= batch — bug — by vitalik (创建于: 2026-02-09 00:45 (UTC+8))
#34076 [Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs) — bug — by chizhiwei (创建于: 2026-02-08 12:02 (UTC+8))

已关闭 Issue

#22308 [Feature]: If I want gpt-oss to be able to call custom tools, how should I set the –tool-call-parser parameter during deployment? — feature request,stale,gpt-oss — by LZAndy (关闭于: 2026-02-09 10:18 (UTC+8))
#23033 [RFC]: Support mmmu benchmark — RFC,stale — by tanruixiang (关闭于: 2026-02-09 10:18 (UTC+8))
#23107 [Feature]: consider all env vars in compilation hash with some opt-out — good first issue,feature request,stale — by youkaichao (关闭于: 2026-02-09 10:18 (UTC+8))
#25443 [Usage]: Why is OpenCV used for video and image preprocessing? Especially when it comes to processing videos, the efficiency is too low. I want to modify this part to torchvision parallel computing. Where should I start? — usage,stale — by HAOYON-666 (关闭于: 2026-02-09 10:18 (UTC+8))
#25561 [Bug]: output is not deterministic — bug,stale — by zhink (关闭于: 2026-02-09 10:18 (UTC+8))
#25587 [Feature Request] Support q_lora_rank=None for MiniCPM3 — feature request,stale — by bzantium (关闭于: 2026-02-09 10:18 (UTC+8))
#26282 [Usage]: silencing ALL vLLM-enabled logging — usage,stale — by BramVanroy (关闭于: 2026-02-09 10:17 (UTC+8))
#26393 [Bug]: vllm-v0.11.0 run Qwen3-Next-80B-A3B-Instruct-FP8 fail — bug,stale — by exceedzhang (关闭于: 2026-02-09 10:17 (UTC+8))
#26409 [Bug]: Performance issue with v1 engine — bug,stale — by rse173 (关闭于: 2026-02-09 10:17 (UTC+8))
#26433 [RFC]: Batching speculation — RFC,stale — by pkuwangh (关闭于: 2026-02-09 10:17 (UTC+8))
#26479 [Bug]: 0.10.1 offline stop profiler error Can’t disable Kineto profiler when it’s not running — bug,stale — by Xerxes-cn (关闭于: 2026-02-09 10:17 (UTC+8))
#26491 [Bug]: vllm/vllm-openai:nightly container crashes with –otlp-traces-endpoint due to missing opentelemetry package — bug,stale — by Aymendje (关闭于: 2026-02-09 10:17 (UTC+8))
#26493 [Feature]: expose model revisions in OpenAI v1/models endpoint — feature request,stale — by mikix (关闭于: 2026-02-09 10:17 (UTC+8))
#26511 [Performance]: Deepseek3.2 Compile Time Regression on B200 with Pytorch 2.9 — performance,torch.compile,stale — by Lucaskabela (关闭于: 2026-02-09 10:17 (UTC+8))
#26544 [Bug]: TranscriptionRequest is missing request_id — bug,stale — by eicherseiji (关闭于: 2026-02-09 10:17 (UTC+8))
#26555 [Bug]: Highly concurrent calls to the vllm service can cause graphics card crashes — bug,stale — by shuoshuo0 (关闭于: 2026-02-09 10:17 (UTC+8))
#26561 [Bug]: Qwen3-Coder encountered a large number of errors when using the calling capabilities of vllm-0.11.0. — bug,stale — by Jeremy-J-J (关闭于: 2026-02-09 10:17 (UTC+8))
#26573 [Bug]: When stream=True, the seed-oss model, the tool_calls = None — bug,stale,tool-calling — by CallmeZhangChenchen (关闭于: 2026-02-09 10:17 (UTC+8))
#26576 [Bug]: Hardcoded token_ids size causes OOV error for small tokenizer — bug,stale — by grzegorz-k-karch (关闭于: 2026-02-09 10:17 (UTC+8))
#26589 [Bug]: Ovis2.5 9B FP8 quantisation bug — bug,stale — by Dineshkumar-Anandan-ZS0367 (关闭于: 2026-02-09 10:17 (UTC+8))
#26598 [Bug]: Pipeline parallelism skips GPUs and hangs during model load — bug,stale — by devnen (关闭于: 2026-02-09 10:17 (UTC+8))
#33888 [Feature]: when support torch 2.10.0 — feature request — by chamwen (关闭于: 2026-02-09 05:51 (UTC+8))
#29595 [Bug]: Qwen3-VL-235B-A22B-Instruct Grounding Accuracy Issue in vLLM (>= v0.11.1) — bug,torch.compile — by 420516460 (关闭于: 2026-02-09 05:51 (UTC+8))
#33974 [Feature]: We propose the official development and maintenance of a VLLM integration or plugin within Dify. — feature request — by ooodwbooo (关闭于: 2026-02-09 01:04 (UTC+8))
#21943 [Feature] Skip modules for disabled modalities — good first issue,feature request,multi-modality — by DarkLight1337 (关闭于: 2026-02-08 20:57 (UTC+8))

新增 PR

#34110 [MODEL] Adding Support for Qwen3.5 Models — new-model,speculative-decoding,v1,qwen — by JJJYmmm (创建于: 2026-02-09 11:16 (UTC+8))
#34111 [DRAFT][XPU] clean up existing ipex — documentation,v1 — by jikunshang (创建于: 2026-02-09 11:17 (UTC+8))
#34091 refactor_code_for_repeated_functions — cpu,nvidia — by tom-zju (创建于: 2026-02-09 00:05 (UTC+8))
#34075 Support MP backend for elastic EP scale-down — frontend,v1 — by jianzs (创建于: 2026-02-08 11:41 (UTC+8))
#34086 [Feature]: Remove Chunking From FusedMoE — documentation,rocm,gpt-oss,nvidia — by SouthWest7 (创建于: 2026-02-08 19:39 (UTC+8))
#34109 [Kernel] Refactor FlashInfer allreduce for mnnvl backend — nvidia — by hjjq (创建于: 2026-02-09 10:47 (UTC+8))
#34102 [DP] Only use DP padding when cudagraphs are actually used — speculative-decoding,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-02-09 03:46 (UTC+8))
#34092 [torch.compile] Disable recursive pre_grad_passes — 无标签 — by zou3519 (创建于: 2026-02-09 00:19 (UTC+8))
#34108 [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection — bug,rocm — by AndreasKaratzas (创建于: 2026-02-09 10:05 (UTC+8))
#34077 [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP — bug,ready,v1,qwen — by vadiklyutiy (创建于: 2026-02-08 12:31 (UTC+8))
#34103 [Tiny] Rename encoder budget file to more specific name — ready,v1,multi-modality — by reaganjlee (创建于: 2026-02-09 05:20 (UTC+8))
#34097 [LoRA] Support LoRA for Qwen3OmniMoeThinkerForConditionalGeneration — documentation,v1,qwen — by linitra24 (创建于: 2026-02-09 01:28 (UTC+8))
#34107 [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr — ready,multi-modality — by AndreasKaratzas (创建于: 2026-02-09 09:30 (UTC+8))
#34087 [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) — bug,nvidia — by TomerBN-Nvidia (创建于: 2026-02-08 20:38 (UTC+8))
#34105 Move spec decode offline script from example/ to inside vllm for reusability — documentation,speculative-decoding,v1 — by ekagra-ranjan (创建于: 2026-02-09 08:00 (UTC+8))
#34104 Fix Mistral config remap to accept compressed-tensors quantization #34028 — 无标签 — by baonudesifeizhai (创建于: 2026-02-09 06:04 (UTC+8))
#34100 Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. — rocm — by amd-hhashemi (创建于: 2026-02-09 03:00 (UTC+8))
#34101 Fix hermes tool call stream truncation when stop sequences are used — frontend — by maxdebayser (创建于: 2026-02-09 03:08 (UTC+8))
#34085 Fix DeepSeek-OCR tensor validation for all size variants — ready,deepseek — by yichuan-w (创建于: 2026-02-08 18:09 (UTC+8))
#34088 [BugFix] Change support no act and mul for marlin — bug,ready,nvidia — by TomerBN-Nvidia (创建于: 2026-02-08 21:57 (UTC+8))
#34079 [Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics — bug,documentation,v1,kv-connector — by Prowindy (创建于: 2026-02-08 13:58 (UTC+8))
#34093 [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter — 无标签 — by zou3519 (创建于: 2026-02-09 00:23 (UTC+8))
#34084 Convert online APIs to use Renderer — frontend — by reaganjlee (创建于: 2026-02-08 17:31 (UTC+8))
#34089 Constrain numpy/opencv for compatibility — ci/build — by aissam-out (创建于: 2026-02-08 23:06 (UTC+8))
#34083 [Docs] Update docs to include mm processor + encoder benchmarks — documentation,frontend,v1,multi-modality — by reaganjlee (创建于: 2026-02-08 17:07 (UTC+8))
#34082 [Frontend, grpc] add support for Embed RPC in grpc_server — frontend — by santiramos27 (创建于: 2026-02-08 15:59 (UTC+8))
#34081 custom behavior — frontend — by dsingal0 (创建于: 2026-02-08 15:39 (UTC+8))
#34080 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation — bug,v1 — by chizhiwei (创建于: 2026-02-08 14:27 (UTC+8))
#34078 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation (#34076) — bug,v1 — by chizhiwei (创建于: 2026-02-08 13:44 (UTC+8))

已合并 PR

#34027 [bug-fix] supported_tasks is breaking backward compatibility at init_app_state — bug,frontend,ready — by kouroshHakha (合并于: 2026-02-09 09:46 (UTC+8))
#30525 [Release 2.10] Update to Torch 2.10 - final release — documentation,rocm,ci/build,v1,cpu,gpt-oss,nvidia,ready-run-all-tests — by atalman (合并于: 2026-02-09 05:51 (UTC+8))
#33786 Add support for ModelOpt MXFP8 dense models — documentation,ready,nvidia — by danisereb (合并于: 2026-02-09 03:16 (UTC+8))
#32958 glm 4.6 fused tuned inference config for B200 — ready — by navmarri14 (合并于: 2026-02-09 02:55 (UTC+8))
#33735 [torch.compile] Add an option to force-enable the MOE cold start optimization — ready — by zou3519 (合并于: 2026-02-09 02:42 (UTC+8))
#34088 [BugFix] Change support no act and mul for marlin — bug,ready,nvidia — by TomerBN-Nvidia (合并于: 2026-02-09 01:18 (UTC+8))
#33771 [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate — ready,v1,nvidia — by aabbccddwasd (合并于: 2026-02-09 00:13 (UTC+8))
#32493 Add embedding input functionality for disabled modalities [remake] — documentation,frontend,ready,v1,multi-modality — by reaganjlee (合并于: 2026-02-08 20:57 (UTC+8))
#34059 [ROCm] [CI] Reduce Resource of two test groups — rocm,ready,ci/build — by tjtanaa (合并于: 2026-02-08 15:17 (UTC+8))
#33855 [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used — structured-output,ready,v1,deepseek — by njhill (合并于: 2026-02-08 15:16 (UTC+8))

关闭但未合并的 PR

#31458 Vox streaming on top — frontend,tpu,needs-rebase,v1 — by patrickvonplaten (关闭于: 2026-02-09 10:59 (UTC+8))
#19710 [Model] Activated LoRA — documentation,frontend,needs-rebase,stale,v1,tool-calling — by tdoublep (关闭于: 2026-02-09 10:19 (UTC+8))
#22236 [Perf][Feat][Core] Workload-Aware KVCache Eviction Policy — documentation,performance,needs-rebase,ci/build,stale,v1 — by Chasingdreams6 (关闭于: 2026-02-09 10:19 (UTC+8))
#25722 [RFC][Core] propagate the error message up to the frontend process — stale,v1 — by 842974287 (关闭于: 2026-02-09 10:18 (UTC+8))
#25742 feat(minicpm3): support q_lora_rank is None — speculative-decoding,stale,v1,deepseek — by bzantium (关闭于: 2026-02-09 10:18 (UTC+8))
#26557 [Qwen3-Next] MoE configs for 4090 TP=1/PP — stale,qwen — by ReinForce-II (关闭于: 2026-02-09 10:17 (UTC+8))
#33469 Add Kimi-Audio-7B model support — documentation,new-model,multi-modality — by tunglinwood (关闭于: 2026-02-09 10:12 (UTC+8))
#33854 [BugFix] Potential bug fix for test_async_tp_pass_correctness — bug,ready,v1 — by LucasWilkinson (关闭于: 2026-02-09 07:25 (UTC+8))
#29726 [Frontend] Add streaming tool-call support to Responses API (non-Harmony) — frontend,gpt-oss — by sumitaryal (关闭于: 2026-02-09 01:25 (UTC+8))
#34081 custom behavior — frontend — by dsingal0 (关闭于: 2026-02-08 16:55 (UTC+8))
#33023 refactor: unify FlashInfer utils into vllm.utils.flashinfer — nvidia — by puranikyashaswin (关闭于: 2026-02-08 14:35 (UTC+8))
#34078 [bugfix] fix: correctly identify bottleneck worker in pipeline parallel KV cache allocation (#34076) — bug,v1 — by chizhiwei (关闭于: 2026-02-08 13:59 (UTC+8))