vLLM 开发动态报告 - 2026-02-09

时间窗口: 2026-02-09 11:46 (UTC+8) ~ 2026-02-10 11:46 (UTC+8) 数据统计: 新 Issue 24 | 关闭 Issue 24 | 新 PR 69 | 合并 PR 26 | 关闭未合并 PR 18

📊 每日开发状态摘要

在24小时窗口内（2026年2月9-10日），vLLM 社区保持高活跃度，共处理了48个 Issue 和69个 PR。开发焦点集中于 AMD/ROCm 平台兼容性与性能优化、多模态与 MoE 模型支持 以及 核心引擎与内核的稳定性修复。多个关于 ROCm 编译和运行时问题的 Issue 被报告，同时 AMD 团队也积极提交了相应的性能优化和 Bug 修复 PR，显示 AMD 生态的持续投入。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 相关活动非常活跃，涉及从编译错误到性能优化的多个层面。

新增 Issue (均处于 Open 状态，需关注):

#34154 & #34132 - ROCm 编译错误: 两个 Issue 均报告了在 ROCm 平台（Clang 22+/ROCm 6.3）上使用较新工具链编译时遇到的 -Wstring-compare 和 -Warray-compare 错误。根源于 quant_utils.cuh 宏中直接使用 == 比较字符串字面量。这属于前瞻性兼容性问题，需在代码中改用 std::string 或 strcmp 来适配更严格的 C++ 标准。
#34118 - GPTQ INT4 MoE 内核限制: 用户报告在 AMD MI300X 上运行 GPTQ 量化的 Step 3.5-Flash MoE 模型时失败，因为当前 moe_wna16 量化方法硬编码仅支持 SiLU 激活函数。这暴露了 AMD 平台上量化 MoE 内核支持范围的局限性，且缺少类似 Marlin 的回退方案，影响了模型兼容性。
#34113 - AI MAX+ 395 启动挂起: 用户在新款 AMD AI MAX+ 395 GPU 上遇到 vLLM 启动后无限等待核心引擎的问题。这可能是特定新硬件与 ROCm 7.2 栈的兼容性或初始化问题，需要进一步诊断。
#34162, #34160, #34164 - CI 测试回归: AMD CI 流水线中多个测试组因 PyTorch 2.10 升级导致 GPT-OSS 模型支持中断而失败。相关修复已在 PR #34153 中提供，显示了外部依赖升级对 AMD 后端的连锁影响。

新增/合并 PR (体现 AMD 团队主动开发):

已合并 #34153: 由 gshtras 提交，作为对上述 CI 失败的快速响应。该 PR 为 ROCm 平台的 GPT-OSS 模型回退到旧的 triton_kernels API 实现，以兼容尚未更新的 ROCm 版 Triton。这是维持现有功能可用的临时性补丁。
开放中 #34192: 由 AMD 员工 dllehr-amd (-amd后缀) 提交。该 PR 为 gfx950 (MI300) GPU 启用 MXFP4 MoE 权重的预洗牌优化，并升级 aiter 依赖以获取必需的 shuffle_weight 操作。这属于针对特定硬件的深度性能优化，旨在改善内存访问模式。
开放中 #34181: 由 AMD 员工 rasmith 提交。该 PR 将测试中的 assert torch.allclose 改为 torch.testing.assert_close，以在测试失败时提供更清晰的误差信息。这属于开发体验优化，便于调试 AMD 内核。
开放中 #34177: 由 AMD 员工 mgehre-amd 提交。该 PR 为 GFX11 (RDNA3) 架构添加 wvSplitK 瘦 GEMM 内核支持，包括适配 wave32 执行模型和相应的性能优化。根据描述，在 Strix Halo 上为 Qwen3-4B 带来了约 20% 的解码 TPOT 提升。这是将已有优化扩展到消费级 GPU 的重要步骤。
开放中 #34157: 由 dllehr-amd 提交，旨在为 DeepSeek V2 模型的投影层启用运行时动态 MXFP4 量化（通过 dynamic_mxfp4_quant 标志）。这允许在模型加载时动态量化排除层，而非依赖预量化权重，增强了 Quark OCP MX 量化方案的灵活性。

总结: 本周期 AMD 生态动态显示，团队正双线作战：一是快速响应由工具链升级和外部依赖引起的兼容性问题（#34153），二是积极推进针对 MI300 和 RDNA3 硬件的深度性能优化与功能扩展（#34192, #34177, #34157）。用户端则集中反馈了新编译器警告、量化内核限制和新硬件兼容性等挑战。

💬 高热度讨论分析

Issue #34180 - fastsafetensors 分布式加载因文件列表未排序而崩溃
- 核心议题: 在多节点 TP 环境下使用 fastsafetensors 加载模型时，由于各节点本地文件系统枚举顺序不同，导致分配给不同 rank 的分片不匹配，进而引发 NCCL 崩溃。
- 观点与进展:
  - 报告者 jaim12005: 详细分析了根因在于 fastsafetensors_weights_iterator 函数中缺少类似 safetensors_weights_iterator 的 _natural_sort_key 排序。
  - 维护者 mgoin: 迅速指出已有相关工具 (#33491) 可参考，并鼓励提交 PR。
  - 结论: jaim12005 立即提交了修复 PR #34190。讨论高效务实，从问题提出到 PR 提交仅间隔约 2 小时，体现了社区良好的协作和响应速度。
Issue #26366 - 使用 Pydantic 改进配置验证
- 核心议题: 一个长期开放的“良好初体验”Issue，旨在将 vLLM 各配置类中手动的 __post_init__ 验证逻辑迁移到更规范的 Pydantic Field 和验证器。
- 观点与状态:
  - 发起者 hmellor: 提供了清晰的任务分解、实现指南和示例 PR，旨在提升代码质量和可维护性。
  - 社区贡献者: 多名志愿者（如 vrdn-23, simondanielsson）主动认领了不同的配置文件，表现出对项目基础设施贡献的热情。
  - 当前状态: 该 Issue 在本周期被关闭，标志着这项重构工作的完成或阶段性收尾。它展示了 vLLM 项目如何成功引导社区贡献者参与核心代码质量建设。
Issue #20468 - 为更多 MoE 模型（如 Qwen 3, Llama 4）扩展 EPLB 支持
- 核心议题: 呼吁社区帮助将动态专家并行负载均衡 (EPLB) 功能从 DeepSeek 模型扩展到其他 MoE 架构。
- 观点与进展:
  - 发起者 abmfy: 将其标为“良好初体验”，提供了详细的实现步骤和代码示例，以降低参与门槛。
  - 社区响应: 获得大量积极回应，多名开发者（aladerran, ztang2370, rahul713rk 等）主动请求分配具体模型任务。
  - 最终状态: 该 Issue 在本周期因陈旧而自动关闭。虽然初始社区反响热烈，但可能由于实现复杂性或优先级原因，未能看到后续成规模的贡献 PR 列表。这反映了维护开源特性需求的挑战性。

🔥 热门话题与趋势分析

多模态与音频模型精细化修复: 出现了多个针对特定模型（Voxtral, GLM-4V, Nemotron Parse）的 Bug 修复 PR（#34140, #34142, #34165, #34189），内容涉及音频处理对齐、图像处理器兼容性、运行时依赖等。这表明 vLLM 在广度支持众多模型后，正进入深度打磨和稳定性提升阶段。
量化与 MoE 的复杂交互: Issue #34129 和 #34118 分别揭示了在线 FP8 量化和 GPTQ INT4 量化在与 MoE（尤其是 Qwen3Next 这类混合架构）结合时的新问题，如权重拆分失效、激活函数支持受限。这成为高性能推理的前沿挑战。
新硬件与编译生态适配: Issue #34132, #34154, #33991 均与编译器版本或特定 CPU 指令集相关。这表明随着 LLVM/Clang 版本迭代和异构硬件普及，维护跨平台、前瞻性的代码兼容性工作量显著增加。
性能剖析与优化常态化: PR #34177 (AMD wvSplitK for RDNA3)、#32846 (FlashInfer for GDN) 以及 #34130 (MoE int4 调优) 等，表明针对特定内核、特定硬件的微调已成为各硬件厂商和贡献者的常规动作，性能竞争白热化。

🛠️ 重点技术变更

已合并 #34175 - LMCache 令牌化 IPC API: 将 LMCache 多进程连接器的通信基础从“哈希模式”改为“令牌模式”。这意味着传递完整的 token_ids 而非块哈希，使缓存逻辑更贴近语义，并为更精细的缓存管理和失效策略奠定基础，是 KV 缓存外部化方向的重要底层优化。
已合并 #34070 - 修复 fastsafetensors TP 内存浪费: 禁用了分布式情况下的 GDS 模式，解决了此前所有工作进程都在所有 GPU 上初始化 CUDA 上下文导致显存浪费的问题。此修复是推动 fastsafetensors 成为默认加载格式的关键一步，能显著改善多 GPU 启动体验。
开放中 #34183 - 修复前缀缓存导致的多模态请求内存泄漏: 针对 V1 引擎中多模态请求因 Request 对象在前缀缓存中形成引用循环而无法释放 mm_features 的问题。此修复对于长时间运行的多模态服务稳定性至关重要，直接防止了内存泄漏。
已合并 #34110 - 新增 Qwen3.5 模型支持: 在官方模型发布前即合并了对 Qwen3.5 混合架构（交错注意力层与线性注意力层）的支持代码。体现了 vLLM 紧跟前沿模型架构的快速响应能力，为即将到来的模型发布做好了准备。

📈 开发活跃度观察

贡献者多样性: 除了核心团队（ywang96, DarkLight1337, mgoin, noooop 等），AMD 员工（dllehr-amd, mgehre-amd, rasmith, gshtras）在本周期提交了多个关键技术 PR。此外，也有大量来自社区的个人贡献者处理从文档到核心 Bug 的各种问题。
高效的 Issue-PR 闭环: 如 #34180 -> #34190，问题提出后迅速有社区成员在维护者引导下提交修复，流程顺畅。
大型重构完成: Issue #26366 关于 Pydantic 配置验证的重构被关闭，表明一项由社区驱动的长期基础设施改进任务已成功完结。

💡 值得关注的问题

新硬件兼容性 (#34113, #34133): AMD AI MAX+ 395 和 NVIDIA B300 (Blackwell) 均报告了启动挂起问题。这提醒社区新一代消费级和服务器级 GPU 的初始支持可能面临新的驱动/固件/初始化挑战，需要投入专项测试。
量化 MoE 的通用性 (#34118, #34129): 当前量化方案（特别是针对 AMD 平台）对 MoE 模型的支持存在明显短板（激活函数限制、权重拆分失效）。这将是影响大型 MoE 模型在 AMD 及量化场景下普及的关键技术障碍。
编译警告即错误 (#34132, #34154): 新版本编译器将更多警告视为错误，暴露出代码中的潜在不兼容写法。需要系统性审查和修复此类问题，以保障代码的前向兼容性。
EPLB 特性扩展的后续 (#20468): 该热情高涨的“良好初体验”Issue 因陈旧被关闭。社区管理者或许需要思考，如何更好地将社区的初始热情转化为可持续的、有结果的贡献，例如提供更细粒度的任务分解或更活跃的 mentoring。

📋 附录：详细数据列表

新增 Issue

#34201 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘weight_loader’ — bug — by bingoct (创建于: 2026-02-10 10:23 (UTC+8))
#34199 [Bug]: The error message reads: “‘tcp://a URL that I haven’t set’ appears to be a URI. This might be due to a Kubernetes service discovery problem.” — bug — by wzz981 (创建于: 2026-02-10 09:48 (UTC+8))
#34186 [Bug]: LoRA adapters with mismatched module name prefixes silently produce base-model output — 无标签 — by Butanium (创建于: 2026-02-10 07:40 (UTC+8))
#34193 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
#34196 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
#34195 [Installation]: test issue — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
#34194 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
#34197 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
#34180 [Bug]: fastsafetensors distributed loading crashes due to unsorted file list — bug — by jaim12005 (创建于: 2026-02-10 06:53 (UTC+8))
#34156 [Bug]: AttributeError: ‘Glm46VImageProcessorFast’ object has no attribute ‘get_number_of_image_patches’ — bug — by regmibijay (创建于: 2026-02-10 01:56 (UTC+8))
#34178 [CI Failure]: Transformers Nightly Models, tests/models/test_initialization.py — ci-failure — by bnellnm (创建于: 2026-02-10 06:36 (UTC+8))
#34166 [CI Failure]: tests/integration/test_rl.py: RuntimeError: operator torchvision::nms does not exist — ci-failure — by bnellnm (创建于: 2026-02-10 03:26 (UTC+8))
#34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:46 (UTC+8))
#34162 [CI Failure]: mi325_1: ROCm GPT-OSS Eval — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:29 (UTC+8))
#34160 [CI Failure]: mi325_1: Entrypoints Unit Tests — ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:13 (UTC+8))
#34154 [Bug]: — bug,rocm — by umarinkovic (创建于: 2026-02-10 01:41 (UTC+8))
#34136 [Usage]: Which version of LMCache is compatible with vLLM 0.15? — usage — by yangshanjun (创建于: 2026-02-09 20:33 (UTC+8))
#34133 [Bug]: Mistral Large 3 (NVFP4) hangs on initialization - 8x B300 (Blackwell) - “Waiting for core engine” — bug — by nandhaece07 (创建于: 2026-02-09 19:41 (UTC+8))
#34132 [Bug]: Compilation Error with Clang 22+ (ROCm) due to String Literal Comparison in quant_utils.cuh — bug,rocm — by kyuz0 (创建于: 2026-02-09 19:28 (UTC+8))
#34129 [Bug]: Online FP8 quantization does not split MoE expert weights across GPUs with TP/EP for Qwen3NextForCausalLM — bug — by ram16g (创建于: 2026-02-09 16:06 (UTC+8))
#34112 [Usage]: vllm bench serve get stuck when enable --mm-encoder-tp-mode data — usage — by shen-shanshan (创建于: 2026-02-09 11:48 (UTC+8))
#34122 [Feature]: Add MLA attention backend for Turing — feature request — by ir1ka (创建于: 2026-02-09 14:38 (UTC+8))
#34118 [Feature]: [ROCm]: GPTQ INT4 MoE kernel fails on non-SiLU activations - no Marlin fallback on AMD — feature request,rocm — by ehartford (创建于: 2026-02-09 13:28 (UTC+8))
#34113 [Bug][ROCm] Startup hangs at ‘Waiting for 1 local, 0 remote core engine proc(s) to start’ on AI MAX+ 395 — rocm — by RagnarokChan (创建于: 2026-02-09 11:49 (UTC+8))

已关闭 Issue

#20468 [Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4 — good first issue,feature request,stale — by abmfy (关闭于: 2026-02-10 10:19 (UTC+8))
#21605 [Feature]: Support fairness heuristics for the batched requests — feature request,stale — by skonto (关闭于: 2026-02-10 10:19 (UTC+8))
#25677 [Bug]: v0.8.6 Qwen3-30B-A3B The prefill phase takes 10 times longer when chunked prefill is enabled compared to when it is disabled. — bug,stale — by yifengwang66 (关闭于: 2026-02-10 10:18 (UTC+8))
#25714 How to pass thinking_budget for Qwen-3 Thinking models in offline AsyncLLMEngine? — usage,stale — by dahwin (关闭于: 2026-02-10 10:18 (UTC+8))
#26246 [Bug]: offline inference don’t works on macOS26 — bug,stale — by zhengkezhou1 (关闭于: 2026-02-10 10:17 (UTC+8))
#26256 [Performance]: General Discussion on Config-based Throughput Optimization in Server Mode — performance,stale — by BiEchi (关闭于: 2026-02-10 10:17 (UTC+8))
#26614 [Usage]: attn_metadata.seq_lens is not equal to attn_metadata.num_actual_tokens — usage,stale — by betacatZ (关闭于: 2026-02-10 10:17 (UTC+8))
#26640 [Bug]: Bug Report: Engine start failure with Llama-3.1-70B-Instruct on vLLM 0.9.2 + ZenTorch 5.1 + PyTorch 2.7 — bug,stale — by Siseendri (关闭于: 2026-02-10 10:17 (UTC+8))
#26655 [Installation]: — installation,stale — by art39print-c (关闭于: 2026-02-10 10:17 (UTC+8))
#34193 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:36 (UTC+8))
#34196 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
#34195 [Installation]: test issue — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
#34194 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
#34197 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:34 (UTC+8))
#24946 [Bug]: The inference speed of the whisper model under the v1 engine is much slower than v0 — bug — by qaz-t (关闭于: 2026-02-10 05:36 (UTC+8))
#28516 [Bug]: You can only use one kind of structured outputs constraint but multt iple are specified — bug — by maolixiaolin-ailab (关闭于: 2026-02-10 05:35 (UTC+8))
#33906 [Bug]: mxfp4 (gpt-oss moe) on AMD rocm (W7900/gfx1100) breaks — bug,rocm — by ctheune (关闭于: 2026-02-10 03:36 (UTC+8))
#30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2026-02-10 02:47 (UTC+8))
#33596 [CI Failure]: mi325_1: Multi-Modal Accuracy Eval (Small Models) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-10 02:26 (UTC+8))
#26366 [Feature]: Improve config validation using Pydantic — good first issue,keep-open — by hmellor (关闭于: 2026-02-09 19:01 (UTC+8))
#33991 [Installation]: building docker cpu image with VLLM_CPU_DISABLE_AVX512=true (or on any x86_64 CPU without AVX512) fails to compile mla_decode.cpp because BFloat16 has no AVX2 fallback — installation,cpu — by matthewfranglen (关闭于: 2026-02-09 16:55 (UTC+8))
#34112 [Usage]: vllm bench serve get stuck when enable --mm-encoder-tp-mode data — usage — by shen-shanshan (关闭于: 2026-02-09 15:20 (UTC+8))
#29403 [Bug]: fastsafetensors in tensor parallel requires too much VRAM — bug — by Jannik2099 (关闭于: 2026-02-09 15:15 (UTC+8))
#33813 [Bug]: llm.score() fails on batched multimodal input for qwen3-vl-reranker — bug — by JiahuiChen-GitHub (关闭于: 2026-02-09 14:42 (UTC+8))

新增 PR

#34204 [CI/Build] Relax test_mcp_tool_call — ready — by DarkLight1337 (创建于: 2026-02-10 11:45 (UTC+8))
#34203 [Model] Remove InternLM — documentation,new-model,llama — by noooop (创建于: 2026-02-10 11:42 (UTC+8))
#34169 [CPU][Distributed] Fix Make VLLM_DIST_IDENT stable across nodes and a… — v1,cpu — by charlesashby (创建于: 2026-02-10 04:21 (UTC+8))
#34134 openpangu-vl support video input — multi-modality — by hujiaxin0 (创建于: 2026-02-09 20:12 (UTC+8))
#34139 [UX] Fix failed precompiled installation when latest commit wheels still unavailable — ci/build — by Isotr0py (创建于: 2026-02-09 21:43 (UTC+8))
#34181 [CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py — bug,rocm — by rasmith (创建于: 2026-02-10 07:03 (UTC+8))
#34123 [Frontend][CI] Consolidate instrumentator entrypoints — frontend,ready,ci/build — by noooop (创建于: 2026-02-09 14:47 (UTC+8))
#34202 [XPU][7/N] enable xpu fp8 moe — 无标签 — by zufangzhu (创建于: 2026-02-10 10:29 (UTC+8))
#34183 [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching — bug,ready,v1 — by ywang96 (创建于: 2026-02-10 07:25 (UTC+8))
#34200 [DNM][Bugfix] Fix mamba cache dtype for Qwen3.5 — bug,qwen — by ywang96 (创建于: 2026-02-10 10:14 (UTC+8))
#34192 [ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter — rocm,ready,ci/build — by dllehr-amd (创建于: 2026-02-10 09:14 (UTC+8))
#34198 [Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 — bug,ready,qwen — by ywang96 (创建于: 2026-02-10 09:44 (UTC+8))
#34125 [Core] Move pause and resume functions into engine — documentation,v1 — by hao-aaron (创建于: 2026-02-09 15:17 (UTC+8))
#34170 Extend ColBERT support to non-standard BERT backbones — documentation,new-model — by ieBoytsov (创建于: 2026-02-10 04:23 (UTC+8))
#34148 [Doc] Update usage of --limit-mm-per-prompt — documentation,ready — by DarkLight1337 (创建于: 2026-02-10 00:47 (UTC+8))
#34190 [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 — bug,ready — by jaim12005 (创建于: 2026-02-10 08:50 (UTC+8))
#34175 [LMCache] Token Base IPC API — ready,kv-connector — by Oasis-Git (创建于: 2026-02-10 06:02 (UTC+8))
#34191 [Frontend] Add per-request stream interval control with time-based throttling — frontend,v1 — by quan-deepinfra (创建于: 2026-02-10 08:59 (UTC+8))
#34189 [Bugfix] Add missing runtime dependencies for Nemotron Parse 1.1 — bug,ci/build — by kamalesh0406 (创建于: 2026-02-10 08:41 (UTC+8))
#34152 [UX nit] Fix non-default api_server_count message — frontend — by mgoin (创建于: 2026-02-10 01:33 (UTC+8))
#34188 [responsesAPI] fix simpleContext streaming output_messages — frontend,ready,gpt-oss — by qandrew (创建于: 2026-02-10 08:21 (UTC+8))
#34187 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-10 07:40 (UTC+8))
#34185 [Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching — 无标签 — by gmagogsfm (创建于: 2026-02-10 07:39 (UTC+8))
#34184 [Online Quantization] Support memory-efficient online quantization via layerwise loading — 无标签 — by kylesayrs (创建于: 2026-02-10 07:29 (UTC+8))
#34153 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available — bug,rocm,ready,gpt-oss — by gshtras (创建于: 2026-02-10 01:39 (UTC+8))
#34179 [Feature] Decode Context Parallel support for GPU model runner v2 — ready,v1,nvidia — by yewentao256 (创建于: 2026-02-10 06:48 (UTC+8))
#34171 [Feat][RL][2/2] Native Weight Syncing API: IPC — documentation,ci/build — by hao-aaron (创建于: 2026-02-10 04:51 (UTC+8))
#34182 [Misc] Introduce ec_both role EC (encoder cache) connector — v1 — by furionw (创建于: 2026-02-10 07:08 (UTC+8))
#34173 Introduce ec_both for EC connector — v1 — by furionw (创建于: 2026-02-10 05:04 (UTC+8))
#34128 Vllm CPU benchmark suite improvement — performance,ci/build,cpu — by louie-tsai (创建于: 2026-02-09 15:57 (UTC+8))
#34177 [ROCm][Kernel] Add GFX11 (RDNA3) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (创建于: 2026-02-10 06:25 (UTC+8))
#34176 [ROCm][Kernel] Add GFX11 (RDNA3/4) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (创建于: 2026-02-10 06:21 (UTC+8))
#34167 [ModelRunner V2][BugFix] Fix max_query_len calculation — bug,ready,v1,nvidia — by njhill (创建于: 2026-02-10 03:26 (UTC+8))
#34155 [compile] Enable AOT compile with 2.10 in trunk. — ready — by zhxchen17 (创建于: 2026-02-10 01:46 (UTC+8))
#34174 [Draft][ROCm] ROCm7.2 as base — rocm,ci/build — by gshtras (创建于: 2026-02-10 05:46 (UTC+8))
#34172 nightly-run-models — ci/build,llama — by debroy-rh (创建于: 2026-02-10 05:01 (UTC+8))
#34151 Revert “[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)” — 无标签 — by amitz-nv (创建于: 2026-02-10 01:32 (UTC+8))
#34145 Moe config backend — documentation,rocm,frontend,needs-rebase,ci/build,v1,llama,gpt-oss,kv-connector — by danichan-mkm (创建于: 2026-02-10 00:24 (UTC+8))
#34135 [Bugfix] Fix TorchAO bugs and add --torchao-config CLI — bug — by jwpark33 (创建于: 2026-02-09 20:21 (UTC+8))
#34161 add extras dict to FinishedRequestStats to enable stat logger plugins… — v1 — by crawdaddie (创建于: 2026-02-10 02:27 (UTC+8))
#34146 [Benchmarks] Add bimodal dataset for mixed short-chat + long-RAG work… — performance — by pbpatre (创建于: 2026-02-10 00:30 (UTC+8))
#34157 [ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers — rocm,deepseek — by dllehr-amd (创建于: 2026-02-10 01:57 (UTC+8))
#34168 [DRAFT][Feature] implement online data capture/generation for eagle3 — speculative-decoding,v1 — by harryzorus (创建于: 2026-02-10 03:56 (UTC+8))
#34147 [KVConnector] Fix bug and clean up redundant code in KV connectors — bug,kv-connector — by hickeyma (创建于: 2026-02-10 00:38 (UTC+8))
#34165 Fix AttributeError: ‘Glm46VImageProcessorFast’ object has no attribut… — 无标签 — by VishnuVV27 (创建于: 2026-02-10 03:23 (UTC+8))
#34163 [Bug] Fix MLPSpeculatorConfig missing num_attention_heads attribute — bug — by Mr-Neutr0n (创建于: 2026-02-10 02:40 (UTC+8))
#34140 [Bugfix] Voxtral prompt/audio placeholder alignment — bug,ready — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
#34142 [Bugfix] Avoid duplicate k-proj weight emission in helper — bug,ready — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
#34149 [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 — bug,performance — by mgehre-amd (创建于: 2026-02-10 01:04 (UTC+8))
#34159 [Docs] MTP Docs — documentation,v1 — by kylesayrs (创建于: 2026-02-10 02:10 (UTC+8))
#34158 [Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout — bug,v1,nvidia — by Etelis (创建于: 2026-02-10 02:02 (UTC+8))
#34126 Add flagos in MiniCPM-o — 无标签 — by tc-mb (创建于: 2026-02-09 15:17 (UTC+8))
#34150 Causal Masking — v1,meta-exported,fb-exported — by omkhalil (创建于: 2026-02-10 01:26 (UTC+8))
#34121 [BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run — bug,v1 — by Selkh (创建于: 2026-02-09 14:38 (UTC+8))
#34141 [Bugfix] Harden Voxtral encoder melspec and chunk handling — bug — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
#34130 [Perf] fused_moe: add int4_w4a16 benchmark support and tuning config — performance — by mgehre-amd (创建于: 2026-02-09 18:14 (UTC+8))
#34137 [Docs] Fix format error in KV load failure recovery doc — documentation — by zzaebok (创建于: 2026-02-09 20:50 (UTC+8))
#34144 [Misc] Clean up validation logic in input processor — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-09 23:37 (UTC+8))
#34131 [Model] Add Qwen3.5 hybrid model support — new-model,needs-rebase,qwen — by liuchen2026fly (创建于: 2026-02-09 19:08 (UTC+8))
#34143 [Fix] Bump lmcache minimum version to 0.3.11 — ci/build,kv-connector — by MohanKumar21 (创建于: 2026-02-09 22:56 (UTC+8))
#34120 [UX] Add --language-model-only for hybrid models — ready — by ywang96 (创建于: 2026-02-09 14:04 (UTC+8))
#34138 Add MiniMax-M2 model support to vLLM — needs-rebase — by JingqingZh (创建于: 2026-02-09 20:51 (UTC+8))
#34117 [XPU][6/N] add xpu scaled_mm kernel — ready,ci/build — by zufangzhu (创建于: 2026-02-09 13:02 (UTC+8))
#34127 fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… — 无标签 — by tc-mb (创建于: 2026-02-09 15:42 (UTC+8))
#34124 [Model] GLM adaptation — performance,new-model,ready,deepseek — by jeejeelee (创建于: 2026-02-09 15:10 (UTC+8))
#34116 [CPU] Enable FP16 (Half dtype) support for s390x — cpu — by R3hankhan123 (创建于: 2026-02-09 12:50 (UTC+8))
#34119 [Fix Bug]num_active_loras always equals to zero — bug,v1,gpt-oss — by RunkaiTao (创建于: 2026-02-09 13:45 (UTC+8))
#34115 Fix kernel bugs in XPU LoRA and MOE LORA — 无标签 — by chaojun-zhang (创建于: 2026-02-09 12:35 (UTC+8))
#34114 Feat/ascend npu adapt v0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality — by handsomezhuzhu (创建于: 2026-02-09 12:02 (UTC+8))

已合并 PR

#34175 [LMCache] Token Base IPC API — ready,kv-connector — by Oasis-Git (合并于: 2026-02-10 09:18 (UTC+8))
#33233 [structured output] validate unsupported json features first — structured-output,ready,v1 — by andyxning (合并于: 2026-02-10 07:49 (UTC+8))
#34153 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available — bug,rocm,ready,gpt-oss — by gshtras (合并于: 2026-02-10 07:38 (UTC+8))
#33936 [Doc] Add DCP support to attention backend doc — documentation,ready — by mgoin (合并于: 2026-02-10 07:33 (UTC+8))
#34167 [ModelRunner V2][BugFix] Fix max_query_len calculation — bug,ready,v1,nvidia — by njhill (合并于: 2026-02-10 05:47 (UTC+8))
#33945 [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. — ready,torch.compile — by charlifu (合并于: 2026-02-10 05:15 (UTC+8))
#34110 [MODEL] Adding Support for Qwen3.5 Models — documentation,new-model,speculative-decoding,ready,v1,qwen — by JJJYmmm (合并于: 2026-02-09 21:12 (UTC+8))
#34032 [ROCm] update triton branch to support gpt-oss models for gfx11xx devices — rocm,ready,ci/build,gpt-oss — by hongxiayang (合并于: 2026-02-10 03:36 (UTC+8))
#34140 [Bugfix] Voxtral prompt/audio placeholder alignment — bug,ready — by artuskg (合并于: 2026-02-10 03:30 (UTC+8))
#34142 [Bugfix] Avoid duplicate k-proj weight emission in helper — bug,ready — by artuskg (合并于: 2026-02-10 03:17 (UTC+8))
#32846 [Kernel] use flashinfer for gdn prefill — performance,ready,qwen — by ZJY0516 (合并于: 2026-02-10 01:17 (UTC+8))
#34087 [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) — bug,ready,nvidia — by TomerBN-Nvidia (合并于: 2026-02-10 00:44 (UTC+8))
#33985 [Kernel] FlashInfer: switch allreduce fusion to unified API — performance,ready — by mmangkad (合并于: 2026-02-09 23:43 (UTC+8))
#32365 Add NUMA Core binding in nixl_connector for CPU xPyD — ready,v1,cpu,kv-connector — by ZhengHongming888 (合并于: 2026-02-09 23:39 (UTC+8))
#34120 [UX] Add --language-model-only for hybrid models — ready — by ywang96 (合并于: 2026-02-09 22:57 (UTC+8))
#34031 [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 — ready,ci/build — by ProExpertProg (合并于: 2026-02-09 23:05 (UTC+8))
#33810 [Misc] Fix up attention benchmarks — performance,ready,ci/build — by LucasWilkinson (合并于: 2026-02-09 22:42 (UTC+8))
#34117 [XPU][6/N] add xpu scaled_mm kernel — ready,ci/build — by zufangzhu (合并于: 2026-02-09 20:17 (UTC+8))
#33901 [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul — ready,cpu — by nikhil-arm (合并于: 2026-02-09 18:04 (UTC+8))
#32300 [ASR] Fix audio benchmark and add RTFx metric — documentation,performance,ready — by ekagra-ranjan (合并于: 2026-02-09 18:02 (UTC+8))
#34107 [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr — ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-09 17:37 (UTC+8))
#34124 [Model] GLM adaptation — performance,new-model,ready,deepseek — by jeejeelee (合并于: 2026-02-09 17:32 (UTC+8))
#34052 fix(cpu): fix mla_decode compilation on x86 without AVX512 — ready,cpu — by ihb2032 (合并于: 2026-02-09 16:55 (UTC+8))
#34070 [BugFix] Fix fastsafetensors TP all procs using all GPUs — bug,ready — by njhill (合并于: 2026-02-09 15:15 (UTC+8))
#31127 [Frontend][last/5] Make pooling entrypoints request schema consensus. — documentation,frontend,ready,ci/build,multi-modality — by noooop (合并于: 2026-02-09 14:42 (UTC+8))
#34103 [Tiny] Rename encoder budget file to more specific name — ready,v1,multi-modality — by reaganjlee (合并于: 2026-02-09 11:48 (UTC+8))

关闭但未合并的 PR

#34139 [UX] Fix failed precompiled installation when latest commit wheels still unavailable — ci/build — by Isotr0py (关闭于: 2026-02-10 11:05 (UTC+8))
#23481 [Misc] Use req_ids instead of req_id_to_index in scheduler’s update_from_output — tpu,ready,stale,v1,kv-connector — by WoosukKwon (关闭于: 2026-02-10 10:18 (UTC+8))
#25440 [feat][distributed]: support per-process group configuration via pg_options — stale — by linfeng-yuan (关闭于: 2026-02-10 10:18 (UTC+8))
#25881 [Bugfix][Multi Modal] Fix broken frames in video input — stale,multi-modality — by Jixin10 (关闭于: 2026-02-10 10:18 (UTC+8))
#26205 [Bugfix][NCCL P/D] Send layer KV-cache only after layer forward — documentation,needs-rebase,stale,kv-connector — by ruisearch42 (关闭于: 2026-02-10 10:18 (UTC+8))
#26610 fix seed_oss_tool_call used Stream — frontend,stale,tool-calling — by CallmeZhangChenchen (关闭于: 2026-02-10 10:17 (UTC+8))
#26615 [CI/Build] allow setting GDRCOPY_VERSION during docker build — ci/build,stale — by Ivan8or (关闭于: 2026-02-10 10:17 (UTC+8))
#26657 Vllm roe — documentation,performance,structured-output,tpu,speculative-decoding,needs-rebase,ci/build,stale,v1,llama — by ShengxuanQiu (关闭于: 2026-02-10 10:17 (UTC+8))
#34009 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,ready,v1 — by benchislett (关闭于: 2026-02-10 07:43 (UTC+8))
#34173 Introduce ec_both for EC connector — v1 — by furionw (关闭于: 2026-02-10 07:04 (UTC+8))
#34176 [ROCm][Kernel] Add GFX11 (RDNA3/4) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (关闭于: 2026-02-10 06:25 (UTC+8))
#32801 [WIP] Fix GPT-OSS prefix caching not working with EAGLE — v1,gpt-oss — by mgoin (关闭于: 2026-02-10 05:42 (UTC+8))
#34135 [Bugfix] Fix TorchAO bugs and add --torchao-config CLI — bug — by jwpark33 (关闭于: 2026-02-09 23:57 (UTC+8))
#29282 [ASR] Add script for Multi-batch long audio chunking with output streaming — documentation — by ekagra-ranjan (关闭于: 2026-02-10 01:42 (UTC+8))
#34131 [Model] Add Qwen3.5 hybrid model support — new-model,needs-rebase,qwen — by liuchen2026fly (关闭于: 2026-02-10 00:40 (UTC+8))
#30260 Support TP which is not divded for NVFP4 kernels (flashinfer-cutlass) by adding dynamic padding — needs-rebase,nvidia — by danielafrimi (关闭于: 2026-02-09 16:41 (UTC+8))
#31815 [Bugfix] Fix TorchAO quantization bugs and add --torchao-config CLI support — bug — by jwpark33 (关闭于: 2026-02-09 16:35 (UTC+8))
#34114 Feat/ascend npu adapt v0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality — by handsomezhuzhu (关闭于: 2026-02-09 12:03 (UTC+8))