vLLM 开发动态报告 - 2026-02-07

时间窗口: 2026-02-07 11:50 (UTC+8) ~ 2026-02-08 11:50 (UTC+8) 数据统计: 新 Issue 7 | 关闭 Issue 7 | 新 PR 31 | 合并 PR 21 | 关闭未合并 PR 5

📊 每日开发状态摘要

在2026年2月7日至8日期间，vLLM项目保持了极高的开发活跃度，共合并了21个PR，新增31个PR。开发重点集中在多模态模型支持优化（特别是编码器缓存配置）、AMD/ROCm生态完善（性能优化与CI修复）以及推测解码、API合规性与编译性能等多个技术领域。项目正快速迭代，以解决用户反馈的问题并增强平台兼容性。

🎯 AMD/ROCm 生态相关动态

本周期内AMD/ROCm生态相关活动非常活跃，主要集中在性能优化、问题修复和CI/CD改进上。

AMD员工直接贡献：
- 用户 amd-hhashemi：
  - PR #33493 (已合并) - “Perf tuning and expansion of cases covered for wvSplitKrc”：这是一个重要的性能调优PR，针对ROCm平台上的特定量化GEMM（wvSplitKrc）进行了性能优化，并扩展了其适用的案例范围，有助于提升AMD GPU上特定模型的推理效率。
  - Issue #34073 - “[Bug]: #33758 is causing aiohttp/_websocket.c build error”：报告了ROCm环境下由于一个先前PR (#33758) 导致的构建错误，表明AMD团队在积极进行集成测试并反馈问题。
ROCm平台Bug修复：
- PR #34069 (已合并) - “[ROCm][Bugfix] fix act_quant_fusion module import error”：修复了ROCm平台上激活量化融合模块的导入错误，解决了相关测试用例的运行问题，由AMD贡献者 AndreasKaratzas 提交并快速修复。
- PR #34038 (已合并) - “[ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug”：为修复ROCm CI中多模态评估测试失败的问题，临时锁定 lm-eval 版本至 0.4.9.2。这凸显了维护多平台CI稳定性的挑战。
生态功能与基础设施请求：
- Issue #34067 - “[Feature]: nightly docker and wheels for ROCm”：由用户 hongxiayang 提出，请求为ROCm平台提供与CUDA平台对等的nightly版本Docker镜像和安装包。这反映了AMD用户群体对更便捷、及时部署方式的需求，是提升AMD生态用户体验的关键一步。

总结：AMD团队在本周期内深度参与了从底层内核性能优化、Bug修复到CI稳定性维护的全流程工作。同时，社区对完善AMD平台部署工具链（nightly版本）的呼声明确，显示出vLLM对AMD硬件支持正从“功能可用”向“体验优化”阶段迈进。

💬 高热度讨论分析

PR #34051 - “[Core] Make encoder budget and cache size configurable”：
- 核心议题：如何将多模态模型中的编码器计算预算（encoder_compute_budget）和缓存大小（encoder_cache_size）从硬编码改为可配置，并解决相关命名和验证问题。
- 观点与讨论：
  - 发起人 (DarkLight1337)：提出了详细的修改方案，旨在解决用户在多模态场景下遇到的编码器缓存溢出问题（如 Issue #34040）。
  - 其他参与者 (NickLucche)：询问是否有单元测试可以验证新的配置值，强调了测试覆盖率的重要性。
  - 发起人回应：同意并计划添加调度器测试，以确保在各种配置下系统不会挂起。
- 当前状态：PR处于开放状态，正在根据反馈补充测试用例。
Issue #33954 - “[Bug]: Qwen3-VL-Embedding-8B embedding quality declines…”：
- 核心议题：用户发现从某个vLLM版本开始，Qwen3-VL-Embedding-8B模型的嵌入质量显著下降。
- 观点与讨论：
  - 问题报告者 (kevin-pw)：提供了详尽的复现步骤和环境对比，锁定问题与 --load-format=fastsafetensors 参数相关。
  - 维护者 (Isotr0py, noooop)：最初无法复现问题，但在报告者提供更精确的复现条件（使用fastsafetensors）后，迅速识别并确认了问题根源。
  - 其他用户 (XyWzzZ)：提供了在另一个版本上的测试结果（未出现问题），协助缩小排查范围。
- 争议焦点：无实质性争议，是一个高效的协作排查过程。
- 最终结论：问题被定位，Issue随后关闭。此讨论推动了PR #34070（修复fastsafetensors多GPU内存浪费）等相关修复工作的进行。
Issue #28589 - “[Bug]: V1 Engine fails on Blackwell GB10 (SM 12.1)…”：
- 核心议题：长期存在的在NVIDIA Blackwell GB10 (SM12.1) GPU上运行V1引擎失败的问题。
- 观点与讨论：
  - 问题报告者 (ohsono)：详细描述了错误现象（“sink setting not supported”）和复杂的复现环境。
  - 社区贡献者 (alkari, johnnynunez 等)：提供了深入的根因分析（CUTLASS MoE SM100宏定义错误）和多种临时解决方案（环境变量设置、源码补丁）。
  - 维护者/用户 (lavdnone2)：分享了在特定版本上成功的配置经验。
- 最终结论：该问题在本周期内被关闭，因为根本原因已在主线代码中通过PR #28938修复。此Issue历时近3个月，是社区协作解决复杂平台兼容性问题的典型案例。

🔥 热门话题与趋势分析

多模态 (Multi-Modality) 支持深化：成为最活跃的技术方向之一。围绕编码器缓存管理产生了Issue #34040和PR #34051，反映出社区正在处理真实场景中复杂输入（大图像）带来的挑战，致力于提供更灵活的资源调配能力。
推测解码 (Speculative Decoding) 优化：热度持续。PR #34049旨在减少Eagle推测解码草案生成时的TP通信开销，通过本地argmax优化来提升大词汇表模型的性能。同时，文档PR #34065更新了各推测方法的示例，显示此功能已进入成熟和普及阶段。
前端API合规性与功能完善：
- API规范遵循：PR #34072根据OpenAI API规范，增加了对系统消息中非文本内容（如图片）的验证和拒绝。
- API功能扩展：PR #33934 (已合并) 为 run_batch 离线批量API增加了转录和翻译支持，虽然这是vLLM的扩展功能，但丰富了企业级应用场景。
- 协议兼容性：PR #34053修复了Anthropic Messages API适配器中的6项协议违规，确保与Claude Code等客户端的兼容性。
编译与执行优化：
- 冷启动优化：PR #34003 (已合并) 和 PR #34062 分别通过重用唯一编译产物、避免在KV缓存更新操作中硬编码层名来减少 torch.compile 的冷启动时间。
- 执行路径优化：PR #33943 (已合并) 重构了 unified_kv_cache_update 操作的检查逻辑，简化了内核抽象。

🛠️ 重点技术变更

PR #34051 - “[Core] Make encoder budget and cache size configurable”：
- 技术解读：将多模态模型的两个关键参数——max_num_batched_encoder_embeds（编码器计算预算）和 encoder_cache_size（编码器缓存大小）——从硬编码改为可配置的启动参数。同时进行了清晰的变量重命名（如 Scheduler.encoder_compute_budget），并增加了智能的验证逻辑（自动校正过小的用户设置并警告）。
- 影响：赋予用户根据实际多模态输入特征调整系统资源的权力，是解决如Issue #34040中“编码器缓存溢出”错误的根本方案，提升了vLLM处理复杂多模态工作负载的灵活性和鲁棒性。
PR #33493 (已合并) - “Perf tuning and expansion of cases covered for wvSplitKrc”：
- 技术解读：针对AMD ROCm平台上的一个特定量化GEMM内核（wvSplitKrc）进行微调和性能优化，并扩展了其支持的矩阵形状范围。
- 影响：直接提升了在AMD GPU（如MI355X）上运行特定量化模型（如GPT-OSS-120B）的推理效率。这是AMD团队持续优化其平台性能的体现，有助于缩小与CUDA平台在特定内核上的性能差距。
PR #34075 - “Support MP backend for elastic EP scale-down”：
- 技术解读：扩展了弹性专家并行（Elastic Expert Parallel, EEP）功能，使其不仅支持Ray后端，现在也支持MP（多进程）后端进行规模收缩（scale-down）。通过引入 EngineRegistry 统一引擎握手和进程管理器注册，并复用握手套接字传递收缩事件。
- 影响：提高了vLLM分布式推理架构的灵活性和部署选项。用户在使用MP后端时也能利用EEP的动态资源调整能力，为更广泛的集群环境提供了弹性支持。

📈 开发活跃度观察

贡献者高度活跃：在24小时内处理了50余个Issue/PR，其中合并了21个PR，显示核心团队和社区贡献者保持了高效的代码审查和合并节奏。
AMD团队深度参与：amd-hhashemi、AndreasKaratzas、tjtanaa 等贡献者积极提交了从性能优化、Bug修复到CI维护的多个PR，表明AMD对vLLM在其硬件上的支持投入了持续且专业的工程资源。
协作高效：从Issue #33954的解决过程可以看出，用户、贡献者和维护者之间能够基于详细的问题描述和测试数据进行快速、有效的技术交流，推动问题迅速定位和解决。

💡 值得关注的问题

Issue #34073 - ROCm构建错误：由AMD员工报告的最新构建错误，涉及aiohttp依赖。这个问题需要尽快解决，因为它可能阻碍ROCm用户更新到最新代码。
Issue #34067 - ROCm Nightly构建请求：社区对ROCm平台nightly版本发布管道的需求明确。实现此功能将极大便利AMD生态的开发者进行持续集成和前沿特性测试。
PR #34046 - 确定性前缀缓存：提出了一个可选的调度器特性（--deterministic-prefix-caching），通过强制拆分前缀来确保缓存命中/未命中路径的计算确定性（解决bf16 GEMM因分块差异导致的ULP级误差）。这对于需要严格结果可复现的场景（如某些CI测试、模型评估）有重要价值，其设计思路值得关注。

📋 附录：详细数据列表

新增 Issue

#34040 [Doc]: Setting Encoder Cache for MultiModal LLMs — documentation — by willingWill17 (创建于: 2026-02-07 13:23 (UTC+8))
#34073 [Bug]: #33758 is causing aiohttp/_websocket.c build error — bug,rocm — by amd-hhashemi (创建于: 2026-02-08 09:31 (UTC+8))
#34067 [Feature]: nightly docker and wheels for ROCm — feature request,rocm — by hongxiayang (创建于: 2026-02-08 05:31 (UTC+8))
#34054 [Bug]: pd disaggregation on the same host with nixl connector can not use nvlink to transfer kv cache — bug — by ChowXu (创建于: 2026-02-07 20:51 (UTC+8))
#34058 [Installation]: On ubuntu i coundn’t install it — installation — by nawnaw0 (创建于: 2026-02-07 22:54 (UTC+8))
#34042 [Feature]: vllm support for MedGemma Models — feature request — by SabariArulnagarajan (创建于: 2026-02-07 14:01 (UTC+8))
#34041 [Bug]: required numpy version mismatch between modules — bug — by llsj14 (创建于: 2026-02-07 13:39 (UTC+8))

已关闭 Issue

#34058 [Installation]: On ubuntu i coundn’t install it — installation — by nawnaw0 (关闭于: 2026-02-07 22:54 (UTC+8))
#33601 vLLM replicas init failure when launching with Ray on a single instance — bug — by jiangwu300 (关闭于: 2026-02-07 21:30 (UTC+8))
#33792 [Bug]: Logic for selection of routing_method_type in FusedTopKRouter — bug — by dbari (关闭于: 2026-02-07 18:48 (UTC+8))
#31379 [Bug]: GLM47 the empty parameter tool report TypeError: expected string or buffer — bug — by NebulaMao (关闭于: 2026-02-07 17:24 (UTC+8))
#34042 [Feature]: vllm support for MedGemma Models — feature request — by SabariArulnagarajan (关闭于: 2026-02-07 14:08 (UTC+8))
#33954 [Bug]: Qwen/Qwen3-VL-Embedding-8B embedding quality declines significantly sometime after vLLM version 0.14.0rc2.dev199+gc80f92c14 — bug — by kevin-pw (关闭于: 2026-02-07 13:30 (UTC+8))
#28589 [Bug]: V1 Engine fails on Blackwell GB10 (SM 12.1): “sink setting not supported” by all compatible attention backends — bug — by ohsono (关闭于: 2026-02-07 12:28 (UTC+8))

新增 PR

#34075 Support MP backend for elastic EP scale-down — frontend,v1 — by jianzs (创建于: 2026-02-08 11:41 (UTC+8))
#34063 [Bugfix] Treat generation_config max_tokens as default not ceiling — bug,frontend — by almogtavor (创建于: 2026-02-08 03:35 (UTC+8))
#34051 [Core] Make encoder budget and cache size configurable — v1,multi-modality — by DarkLight1337 (创建于: 2026-02-07 17:37 (UTC+8))
#34074 [Bugfix] Fix Qwen3 context length limited to sliding_window when use_sliding_window=False — bug,qwen — by veeceey (创建于: 2026-02-08 11:31 (UTC+8))
#34061 [Cleanup] Unify vllm.utils.flashinfer and flashinfer_utils — nvidia — by veeceey (创建于: 2026-02-08 03:25 (UTC+8))
#34070 [BugFix] Fix fastsafetensors TP all procs using all GPUs — bug — by njhill (创建于: 2026-02-08 08:46 (UTC+8))
#34069 [ROCm][Bugfix] fix act_quant_fusion module import error — bug,rocm,ready — by AndreasKaratzas (创建于: 2026-02-08 07:37 (UTC+8))
#34072 Add validation to reject non-text content in system messages — frontend — by veeceey (创建于: 2026-02-08 09:11 (UTC+8))
#34071 Fix realtime transcription never sending transcription.done with slow streaming — frontend — by veeceey (创建于: 2026-02-08 09:09 (UTC+8))
#34062 [torch.compile] Remove attention layer name from unified_kv_cache_update — 无标签 — by veeceey (创建于: 2026-02-08 03:28 (UTC+8))
#34065 [Docs] Clean up speculators docs — documentation,ready — by kylesayrs (创建于: 2026-02-08 05:14 (UTC+8))
#34068 [vLLM IR] fused_add_rms_norm and maybe_inplace — rocm,nvidia — by ProExpertProg (创建于: 2026-02-08 07:10 (UTC+8))
#34056 [Doc] Fix run_batch docs — frontend,ready — by DarkLight1337 (创建于: 2026-02-07 21:29 (UTC+8))
#34060 fix: compute chunk start times from actual audio lengths — frontend — by veeceey (创建于: 2026-02-08 03:24 (UTC+8))
#34064 Fix TRTLLM decode assertion in dummy batch with full cudagraph padding — v1,nvidia,meta-exported,fb-exported — by minosfuture (创建于: 2026-02-08 03:49 (UTC+8))
#34046 [Feature][Scheduler] Add split prefix caching feature to eliminate bf16 GEMM tiling divergence across cache-hit/miss paths — performance,v1 — by AndreasKaratzas (创建于: 2026-02-07 16:11 (UTC+8))
#34066 [Observability] Export EPLB balancedness metrics to Prometheus — 无标签 — by Gregory-Pereira (创建于: 2026-02-08 05:27 (UTC+8))
#34059 [ROCm] [CI] Reduce Resource of two test groups — rocm,ci/build — by tjtanaa (创建于: 2026-02-08 01:36 (UTC+8))
#34057 [CI/Build] Skip GCS test — ready — by DarkLight1337 (创建于: 2026-02-07 21:51 (UTC+8))
#34055 [DO Not Merge Yet][Refactor] [1/N] Reorganize kernel abstraction directory — ready,cpu,nvidia — by BadrBasowid (创建于: 2026-02-07 21:07 (UTC+8))
#34039 [Renderer] Define render_cmpl and render_chat — frontend,ready,multi-modality,qwen — by DarkLight1337 (创建于: 2026-02-07 12:28 (UTC+8))
#34048 [CI][Build] Pin grpcio-tools==1.78.0 — rocm,ready,ci/build — by noooop (创建于: 2026-02-07 17:12 (UTC+8))
#34049 [Spec Decode] Reduce TP communication for speculative decoding draft token generation — speculative-decoding,v1,llama — by zixi-qi (创建于: 2026-02-07 17:17 (UTC+8))
#34053 [Anthropic API] Fix 6 protocol compliance bugs in Messages endpoint — frontend — by SilviuSavu (创建于: 2026-02-07 20:02 (UTC+8))
#34052 fix(cpu): fix mla_decode compilation on x86 without AVX512 — cpu — by ihb2032 (创建于: 2026-02-07 17:59 (UTC+8))
#34050 [CI] Add P2pNccl integration test + rename nixl_integration to pd_integration — tpu,ci/build,v1,kv-connector — by eicherseiji (创建于: 2026-02-07 17:21 (UTC+8))
#34047 [ROCm][CI] Fix serving tokens test failures — rocm — by AndreasKaratzas (创建于: 2026-02-07 16:21 (UTC+8))
#34045 [Refactor] Reduce indirection in encoder cache manager — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-07 15:10 (UTC+8))
#34038 [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug — bug,rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-07 12:26 (UTC+8))
#34043 Reapply [Attention][FA3] Update FA3 to include new swizzle optimization — ready,ci/build,v1 — by LucasWilkinson (创建于: 2026-02-07 14:03 (UTC+8))
#34044 fix: optimize cold start compilation by triggering torch.compile with minimal batch size — v1 — by gouthamk16 (创建于: 2026-02-07 14:05 (UTC+8))

已合并 PR

#34069 [ROCm][Bugfix] fix act_quant_fusion module import error — bug,rocm,ready — by AndreasKaratzas (合并于: 2026-02-08 11:21 (UTC+8))
#34056 [Doc] Fix run_batch docs — frontend,ready — by DarkLight1337 (合并于: 2026-02-07 22:18 (UTC+8))
#34057 [CI/Build] Skip GCS test — ready — by DarkLight1337 (合并于: 2026-02-08 00:52 (UTC+8))
#33755 [Model] Enable Step3p5ForCausalLM testing — documentation,ready — by jeejeelee (合并于: 2026-02-07 21:25 (UTC+8))
#34006 [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune — ready — by mmangkad (合并于: 2026-02-07 21:24 (UTC+8))
#33493 Perf tuning and expansion of cases covered for wvSplitKrc — rocm,ready — by amd-hhashemi (合并于: 2026-02-07 21:33 (UTC+8))
#33604 Make directory exist ok for ray spinning up multiple replicas on a single instance — ready — by jiangwu300 (合并于: 2026-02-07 21:30 (UTC+8))
#33935 Update DeepGEMM version pin in Dockerfile to match #32479 — ready,ci/build — by zifeitong (合并于: 2026-02-07 21:30 (UTC+8))
#33943 move checks out of unified_kv_cache_update custom op — rocm,ready,v1,ready-run-all-tests — by Rohan138 (合并于: 2026-02-07 21:30 (UTC+8))
#33934 [Frontend]Add support for transcriptions and translations to run_batch — frontend,ready — by pooyadavoodi (合并于: 2026-02-07 21:24 (UTC+8))
#34039 [Renderer] Define render_cmpl and render_chat — frontend,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-02-07 21:24 (UTC+8))
#33660 [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. — ready,ready-run-all-tests — by whx-sjtu (合并于: 2026-02-07 21:26 (UTC+8))
#33939 Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 — ready — by TundeAtSN (合并于: 2026-02-07 21:24 (UTC+8))
#34003 [torch.compile] Stop compiling identical artifacts — ready,ready-run-all-tests — by zou3519 (合并于: 2026-02-07 21:24 (UTC+8))
#34048 [CI][Build] Pin grpcio-tools==1.78.0 — rocm,ready,ci/build — by noooop (合并于: 2026-02-07 21:24 (UTC+8))
#34036 [Misc] Simplify get_max_tokens — frontend,ready — by DarkLight1337 (合并于: 2026-02-07 16:59 (UTC+8))
#33978 Fix spelling errors — ready,v1 — by sleepcoo (合并于: 2026-02-07 15:58 (UTC+8))
#34038 [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug — bug,rocm,ci/build — by AndreasKaratzas (合并于: 2026-02-07 14:21 (UTC+8))
#34035 [Misc] Make PlaceholderRange.get_num_embeds a method — ready,v1,multi-modality,llama — by DarkLight1337 (合并于: 2026-02-07 13:30 (UTC+8))
#33517 [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support — ready,nvidia — by Code4me2 (合并于: 2026-02-07 12:28 (UTC+8))
#33998 [Revert] Add util handle_deprecated back — ready — by yewentao256 (合并于: 2026-02-07 12:14 (UTC+8))

关闭但未合并的 PR

#23348 [CI/Build] [Misc] feat(pv&pull secrets): add additional PV to deployment.yaml to use existing ones in project and add pull secrets to download images (e.g. for specific arch) from private registry — documentation,stale,v1 — by myadryshnikova (关闭于: 2026-02-08 10:17 (UTC+8))
#25296 [Profiling] Enable Ray worker process profiling using unitrace on Intel(R) GPUs — documentation,stale — by zma2 (关闭于: 2026-02-08 10:17 (UTC+8))
#34064 Fix TRTLLM decode assertion in dummy batch with full cudagraph padding — v1,nvidia,meta-exported,fb-exported — by minosfuture (关闭于: 2026-02-08 06:02 (UTC+8))
#29626 [Feature]: Qwen3 Omni Transcriptions — documentation,multi-modality,qwen — by sumitaryal (关闭于: 2026-02-07 23:32 (UTC+8))
#34045 [Refactor] Reduce indirection in encoder cache manager — ready,v1,multi-modality — by DarkLight1337 (关闭于: 2026-02-07 15:55 (UTC+8))