vLLM 开发动态报告 - 2026-02-09
时间窗口: 2026-02-09 11:46 (UTC+8) ~ 2026-02-10 11:46 (UTC+8) 数据统计: 新 Issue 24 | 关闭 Issue 24 | 新 PR 69 | 合并 PR 26 | 关闭未合并 PR 18
📊 每日开发状态摘要
在24小时窗口内(2026年2月9-10日),vLLM 社区保持高活跃度,共处理了48个 Issue 和69个 PR。开发焦点集中于 AMD/ROCm 平台兼容性与性能优化、多模态与 MoE 模型支持 以及 核心引擎与内核的稳定性修复。多个关于 ROCm 编译和运行时问题的 Issue 被报告,同时 AMD 团队也积极提交了相应的性能优化和 Bug 修复 PR,显示 AMD 生态的持续投入。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 相关活动非常活跃,涉及从编译错误到性能优化的多个层面。
新增 Issue (均处于 Open 状态,需关注):
- #34154 & #34132 - ROCm 编译错误: 两个 Issue 均报告了在 ROCm 平台(Clang 22+/ROCm 6.3)上使用较新工具链编译时遇到的
-Wstring-compare和-Warray-compare错误。根源于quant_utils.cuh宏中直接使用==比较字符串字面量。这属于前瞻性兼容性问题,需在代码中改用std::string或strcmp来适配更严格的 C++ 标准。 - #34118 - GPTQ INT4 MoE 内核限制: 用户报告在 AMD MI300X 上运行 GPTQ 量化的 Step 3.5-Flash MoE 模型时失败,因为当前
moe_wna16量化方法硬编码仅支持 SiLU 激活函数。这暴露了 AMD 平台上量化 MoE 内核支持范围的局限性,且缺少类似 Marlin 的回退方案,影响了模型兼容性。 - #34113 - AI MAX+ 395 启动挂起: 用户在新款 AMD AI MAX+ 395 GPU 上遇到 vLLM 启动后无限等待核心引擎的问题。这可能是特定新硬件与 ROCm 7.2 栈的兼容性或初始化问题,需要进一步诊断。
- #34162, #34160, #34164 - CI 测试回归: AMD CI 流水线中多个测试组因 PyTorch 2.10 升级导致 GPT-OSS 模型支持中断而失败。相关修复已在 PR #34153 中提供,显示了外部依赖升级对 AMD 后端的连锁影响。
新增/合并 PR (体现 AMD 团队主动开发):
- 已合并 #34153: 由
gshtras提交,作为对上述 CI 失败的快速响应。该 PR 为 ROCm 平台的 GPT-OSS 模型回退到旧的triton_kernelsAPI 实现,以兼容尚未更新的 ROCm 版 Triton。这是维持现有功能可用的临时性补丁。 - 开放中 #34192: 由 AMD 员工
dllehr-amd(-amd后缀) 提交。该 PR 为 gfx950 (MI300) GPU 启用 MXFP4 MoE 权重的预洗牌优化,并升级aiter依赖以获取必需的shuffle_weight操作。这属于针对特定硬件的深度性能优化,旨在改善内存访问模式。 - 开放中 #34181: 由 AMD 员工
rasmith提交。该 PR 将测试中的assert torch.allclose改为torch.testing.assert_close,以在测试失败时提供更清晰的误差信息。这属于开发体验优化,便于调试 AMD 内核。 - 开放中 #34177: 由 AMD 员工
mgehre-amd提交。该 PR 为 GFX11 (RDNA3) 架构添加 wvSplitK 瘦 GEMM 内核支持,包括适配 wave32 执行模型和相应的性能优化。根据描述,在 Strix Halo 上为 Qwen3-4B 带来了约 20% 的解码 TPOT 提升。这是将已有优化扩展到消费级 GPU 的重要步骤。 - 开放中 #34157: 由
dllehr-amd提交,旨在为 DeepSeek V2 模型的投影层启用运行时动态 MXFP4 量化(通过dynamic_mxfp4_quant标志)。这允许在模型加载时动态量化排除层,而非依赖预量化权重,增强了 Quark OCP MX 量化方案的灵活性。
总结: 本周期 AMD 生态动态显示,团队正双线作战:一是快速响应由工具链升级和外部依赖引起的兼容性问题(#34153),二是积极推进针对 MI300 和 RDNA3 硬件的深度性能优化与功能扩展(#34192, #34177, #34157)。用户端则集中反馈了新编译器警告、量化内核限制和新硬件兼容性等挑战。
💬 高热度讨论分析
- Issue #34180 - fastsafetensors 分布式加载因文件列表未排序而崩溃
- 核心议题: 在多节点 TP 环境下使用
fastsafetensors加载模型时,由于各节点本地文件系统枚举顺序不同,导致分配给不同 rank 的分片不匹配,进而引发 NCCL 崩溃。 - 观点与进展:
- 报告者
jaim12005: 详细分析了根因在于fastsafetensors_weights_iterator函数中缺少类似safetensors_weights_iterator的_natural_sort_key排序。 - 维护者
mgoin: 迅速指出已有相关工具 (#33491) 可参考,并鼓励提交 PR。 - 结论:
jaim12005立即提交了修复 PR #34190。讨论高效务实,从问题提出到 PR 提交仅间隔约 2 小时,体现了社区良好的协作和响应速度。
- 报告者
- 核心议题: 在多节点 TP 环境下使用
- Issue #26366 - 使用 Pydantic 改进配置验证
- 核心议题: 一个长期开放的“良好初体验”Issue,旨在将 vLLM 各配置类中手动的
__post_init__验证逻辑迁移到更规范的 PydanticField和验证器。 - 观点与状态:
- 发起者
hmellor: 提供了清晰的任务分解、实现指南和示例 PR,旨在提升代码质量和可维护性。 - 社区贡献者: 多名志愿者(如
vrdn-23,simondanielsson)主动认领了不同的配置文件,表现出对项目基础设施贡献的热情。 - 当前状态: 该 Issue 在本周期被关闭,标志着这项重构工作的完成或阶段性收尾。它展示了 vLLM 项目如何成功引导社区贡献者参与核心代码质量建设。
- 发起者
- 核心议题: 一个长期开放的“良好初体验”Issue,旨在将 vLLM 各配置类中手动的
- Issue #20468 - 为更多 MoE 模型(如 Qwen 3, Llama 4)扩展 EPLB 支持
- 核心议题: 呼吁社区帮助将动态专家并行负载均衡 (EPLB) 功能从 DeepSeek 模型扩展到其他 MoE 架构。
- 观点与进展:
- 发起者
abmfy: 将其标为“良好初体验”,提供了详细的实现步骤和代码示例,以降低参与门槛。 - 社区响应: 获得大量积极回应,多名开发者(
aladerran,ztang2370,rahul713rk等)主动请求分配具体模型任务。 - 最终状态: 该 Issue 在本周期因陈旧而自动关闭。虽然初始社区反响热烈,但可能由于实现复杂性或优先级原因,未能看到后续成规模的贡献 PR 列表。这反映了维护开源特性需求的挑战性。
- 发起者
🔥 热门话题与趋势分析
- 多模态与音频模型精细化修复: 出现了多个针对特定模型(Voxtral, GLM-4V, Nemotron Parse)的 Bug 修复 PR(#34140, #34142, #34165, #34189),内容涉及音频处理对齐、图像处理器兼容性、运行时依赖等。这表明 vLLM 在广度支持众多模型后,正进入深度打磨和稳定性提升阶段。
- 量化与 MoE 的复杂交互: Issue #34129 和 #34118 分别揭示了在线 FP8 量化和 GPTQ INT4 量化在与 MoE(尤其是 Qwen3Next 这类混合架构)结合时的新问题,如权重拆分失效、激活函数支持受限。这成为高性能推理的前沿挑战。
- 新硬件与编译生态适配: Issue #34132, #34154, #33991 均与编译器版本或特定 CPU 指令集相关。这表明随着 LLVM/Clang 版本迭代和异构硬件普及,维护跨平台、前瞻性的代码兼容性工作量显著增加。
- 性能剖析与优化常态化: PR #34177 (AMD wvSplitK for RDNA3)、#32846 (FlashInfer for GDN) 以及 #34130 (MoE int4 调优) 等,表明针对特定内核、特定硬件的微调已成为各硬件厂商和贡献者的常规动作,性能竞争白热化。
🛠️ 重点技术变更
- 已合并 #34175 - LMCache 令牌化 IPC API: 将 LMCache 多进程连接器的通信基础从“哈希模式”改为“令牌模式”。这意味着传递完整的
token_ids而非块哈希,使缓存逻辑更贴近语义,并为更精细的缓存管理和失效策略奠定基础,是 KV 缓存外部化方向的重要底层优化。 - 已合并 #34070 - 修复 fastsafetensors TP 内存浪费: 禁用了分布式情况下的 GDS 模式,解决了此前所有工作进程都在所有 GPU 上初始化 CUDA 上下文导致显存浪费的问题。此修复是推动
fastsafetensors成为默认加载格式的关键一步,能显著改善多 GPU 启动体验。 - 开放中 #34183 - 修复前缀缓存导致的多模态请求内存泄漏: 针对 V1 引擎中多模态请求因
Request对象在前缀缓存中形成引用循环而无法释放mm_features的问题。此修复对于长时间运行的多模态服务稳定性至关重要,直接防止了内存泄漏。 - 已合并 #34110 - 新增 Qwen3.5 模型支持: 在官方模型发布前即合并了对 Qwen3.5 混合架构(交错注意力层与线性注意力层)的支持代码。体现了 vLLM 紧跟前沿模型架构的快速响应能力,为即将到来的模型发布做好了准备。
📈 开发活跃度观察
- 贡献者多样性: 除了核心团队(
ywang96,DarkLight1337,mgoin,noooop等),AMD 员工(dllehr-amd,mgehre-amd,rasmith,gshtras)在本周期提交了多个关键技术 PR。此外,也有大量来自社区的个人贡献者处理从文档到核心 Bug 的各种问题。 - 高效的 Issue-PR 闭环: 如 #34180 -> #34190,问题提出后迅速有社区成员在维护者引导下提交修复,流程顺畅。
- 大型重构完成: Issue #26366 关于 Pydantic 配置验证的重构被关闭,表明一项由社区驱动的长期基础设施改进任务已成功完结。
💡 值得关注的问题
- 新硬件兼容性 (#34113, #34133): AMD AI MAX+ 395 和 NVIDIA B300 (Blackwell) 均报告了启动挂起问题。这提醒社区新一代消费级和服务器级 GPU 的初始支持可能面临新的驱动/固件/初始化挑战,需要投入专项测试。
- 量化 MoE 的通用性 (#34118, #34129): 当前量化方案(特别是针对 AMD 平台)对 MoE 模型的支持存在明显短板(激活函数限制、权重拆分失效)。这将是影响大型 MoE 模型在 AMD 及量化场景下普及的关键技术障碍。
- 编译警告即错误 (#34132, #34154): 新版本编译器将更多警告视为错误,暴露出代码中的潜在不兼容写法。需要系统性审查和修复此类问题,以保障代码的前向兼容性。
- EPLB 特性扩展的后续 (#20468): 该热情高涨的“良好初体验”Issue 因陈旧被关闭。社区管理者或许需要思考,如何更好地将社区的初始热情转化为可持续的、有结果的贡献,例如提供更细粒度的任务分解或更活跃的 mentoring。
📋 附录:详细数据列表
新增 Issue
- #34201 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘weight_loader’ — bug — by bingoct (创建于: 2026-02-10 10:23 (UTC+8))
- #34199 [Bug]: The error message reads: “‘tcp://a URL that I haven’t set’ appears to be a URI. This might be due to a Kubernetes service discovery problem.” — bug — by wzz981 (创建于: 2026-02-10 09:48 (UTC+8))
- #34186 [Bug]: LoRA adapters with mismatched module name prefixes silently produce base-model output — 无标签 — by Butanium (创建于: 2026-02-10 07:40 (UTC+8))
- #34193 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
- #34196 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
- #34195 [Installation]: test issue — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
- #34194 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
- #34197 [Installation]: — installation — by freedom22-cyber (创建于: 2026-02-10 09:33 (UTC+8))
- #34180 [Bug]: fastsafetensors distributed loading crashes due to unsorted file list — bug — by jaim12005 (创建于: 2026-02-10 06:53 (UTC+8))
- #34156 [Bug]: AttributeError: ‘Glm46VImageProcessorFast’ object has no attribute ‘get_number_of_image_patches’ — bug — by regmibijay (创建于: 2026-02-10 01:56 (UTC+8))
- #34178 [CI Failure]: Transformers Nightly Models, tests/models/test_initialization.py — ci-failure — by bnellnm (创建于: 2026-02-10 06:36 (UTC+8))
- #34166 [CI Failure]: tests/integration/test_rl.py: RuntimeError: operator torchvision::nms does not exist — ci-failure — by bnellnm (创建于: 2026-02-10 03:26 (UTC+8))
- #34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:46 (UTC+8))
- #34162 [CI Failure]: mi325_1: ROCm GPT-OSS Eval — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:29 (UTC+8))
- #34160 [CI Failure]: mi325_1: Entrypoints Unit Tests — ci-failure — by AndreasKaratzas (创建于: 2026-02-10 02:13 (UTC+8))
- #34154 [Bug]: — bug,rocm — by umarinkovic (创建于: 2026-02-10 01:41 (UTC+8))
- #34136 [Usage]: Which version of LMCache is compatible with vLLM 0.15? — usage — by yangshanjun (创建于: 2026-02-09 20:33 (UTC+8))
- #34133 [Bug]: Mistral Large 3 (NVFP4) hangs on initialization - 8x B300 (Blackwell) - “Waiting for core engine” — bug — by nandhaece07 (创建于: 2026-02-09 19:41 (UTC+8))
- #34132 [Bug]: Compilation Error with Clang 22+ (ROCm) due to String Literal Comparison in quant_utils.cuh — bug,rocm — by kyuz0 (创建于: 2026-02-09 19:28 (UTC+8))
- #34129 [Bug]: Online FP8 quantization does not split MoE expert weights across GPUs with TP/EP for Qwen3NextForCausalLM — bug — by ram16g (创建于: 2026-02-09 16:06 (UTC+8))
- #34112 [Usage]: vllm bench serve get stuck when enable
--mm-encoder-tp-mode data— usage — by shen-shanshan (创建于: 2026-02-09 11:48 (UTC+8)) - #34122 [Feature]: Add MLA attention backend for Turing — feature request — by ir1ka (创建于: 2026-02-09 14:38 (UTC+8))
- #34118 [Feature]: [ROCm]: GPTQ INT4 MoE kernel fails on non-SiLU activations - no Marlin fallback on AMD — feature request,rocm — by ehartford (创建于: 2026-02-09 13:28 (UTC+8))
- #34113 [Bug][ROCm] Startup hangs at ‘Waiting for 1 local, 0 remote core engine proc(s) to start’ on AI MAX+ 395 — rocm — by RagnarokChan (创建于: 2026-02-09 11:49 (UTC+8))
已关闭 Issue
- #20468 [Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4 — good first issue,feature request,stale — by abmfy (关闭于: 2026-02-10 10:19 (UTC+8))
- #21605 [Feature]: Support fairness heuristics for the batched requests — feature request,stale — by skonto (关闭于: 2026-02-10 10:19 (UTC+8))
- #25677 [Bug]: v0.8.6 Qwen3-30B-A3B The prefill phase takes 10 times longer when chunked prefill is enabled compared to when it is disabled. — bug,stale — by yifengwang66 (关闭于: 2026-02-10 10:18 (UTC+8))
- #25714 How to pass
thinking_budgetfor Qwen-3 Thinking models in offline AsyncLLMEngine? — usage,stale — by dahwin (关闭于: 2026-02-10 10:18 (UTC+8)) - #26246 [Bug]: offline inference don’t works on macOS26 — bug,stale — by zhengkezhou1 (关闭于: 2026-02-10 10:17 (UTC+8))
- #26256 [Performance]: General Discussion on Config-based Throughput Optimization in Server Mode — performance,stale — by BiEchi (关闭于: 2026-02-10 10:17 (UTC+8))
- #26614 [Usage]: attn_metadata.seq_lens is not equal to attn_metadata.num_actual_tokens — usage,stale — by betacatZ (关闭于: 2026-02-10 10:17 (UTC+8))
- #26640 [Bug]: Bug Report: Engine start failure with Llama-3.1-70B-Instruct on vLLM 0.9.2 + ZenTorch 5.1 + PyTorch 2.7 — bug,stale — by Siseendri (关闭于: 2026-02-10 10:17 (UTC+8))
- #26655 [Installation]: — installation,stale — by art39print-c (关闭于: 2026-02-10 10:17 (UTC+8))
- #34193 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:36 (UTC+8))
- #34196 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
- #34195 [Installation]: test issue — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
- #34194 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:35 (UTC+8))
- #34197 [Installation]: — installation — by freedom22-cyber (关闭于: 2026-02-10 09:34 (UTC+8))
- #24946 [Bug]: The inference speed of the whisper model under the v1 engine is much slower than v0 — bug — by qaz-t (关闭于: 2026-02-10 05:36 (UTC+8))
- #28516 [Bug]: You can only use one kind of structured outputs constraint but multt iple are specified — bug — by maolixiaolin-ailab (关闭于: 2026-02-10 05:35 (UTC+8))
- #33906 [Bug]: mxfp4 (gpt-oss moe) on AMD rocm (W7900/gfx1100) breaks — bug,rocm — by ctheune (关闭于: 2026-02-10 03:36 (UTC+8))
- #30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2026-02-10 02:47 (UTC+8))
- #33596 [CI Failure]: mi325_1: Multi-Modal Accuracy Eval (Small Models) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-10 02:26 (UTC+8))
- #26366 [Feature]: Improve config validation using Pydantic — good first issue,keep-open — by hmellor (关闭于: 2026-02-09 19:01 (UTC+8))
- #33991 [Installation]: building docker cpu image with VLLM_CPU_DISABLE_AVX512=true (or on any x86_64 CPU without AVX512) fails to compile mla_decode.cpp because BFloat16 has no AVX2 fallback — installation,cpu — by matthewfranglen (关闭于: 2026-02-09 16:55 (UTC+8))
- #34112 [Usage]: vllm bench serve get stuck when enable
--mm-encoder-tp-mode data— usage — by shen-shanshan (关闭于: 2026-02-09 15:20 (UTC+8)) - #29403 [Bug]: fastsafetensors in tensor parallel requires too much VRAM — bug — by Jannik2099 (关闭于: 2026-02-09 15:15 (UTC+8))
- #33813 [Bug]: llm.score() fails on batched multimodal input for qwen3-vl-reranker — bug — by JiahuiChen-GitHub (关闭于: 2026-02-09 14:42 (UTC+8))
新增 PR
- #34204 [CI/Build] Relax
test_mcp_tool_call— ready — by DarkLight1337 (创建于: 2026-02-10 11:45 (UTC+8)) - #34203 [Model] Remove InternLM — documentation,new-model,llama — by noooop (创建于: 2026-02-10 11:42 (UTC+8))
- #34169 [CPU][Distributed] Fix Make VLLM_DIST_IDENT stable across nodes and a… — v1,cpu — by charlesashby (创建于: 2026-02-10 04:21 (UTC+8))
- #34134 openpangu-vl support video input — multi-modality — by hujiaxin0 (创建于: 2026-02-09 20:12 (UTC+8))
- #34139 [UX] Fix failed precompiled installation when latest commit wheels still unavailable — ci/build — by Isotr0py (创建于: 2026-02-09 21:43 (UTC+8))
- #34181 [CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py — bug,rocm — by rasmith (创建于: 2026-02-10 07:03 (UTC+8))
- #34123 [Frontend][CI] Consolidate instrumentator entrypoints — frontend,ready,ci/build — by noooop (创建于: 2026-02-09 14:47 (UTC+8))
- #34202 [XPU][7/N] enable xpu fp8 moe — 无标签 — by zufangzhu (创建于: 2026-02-10 10:29 (UTC+8))
- #34183 [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching — bug,ready,v1 — by ywang96 (创建于: 2026-02-10 07:25 (UTC+8))
- #34200 [DNM][Bugfix] Fix mamba cache dtype for Qwen3.5 — bug,qwen — by ywang96 (创建于: 2026-02-10 10:14 (UTC+8))
- #34192 [ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter — rocm,ready,ci/build — by dllehr-amd (创建于: 2026-02-10 09:14 (UTC+8))
- #34198 [Bugfix] Adopt
ChunkGatedDeltaRulefor Qwen3.5 — bug,ready,qwen — by ywang96 (创建于: 2026-02-10 09:44 (UTC+8)) - #34125 [Core] Move pause and resume functions into engine — documentation,v1 — by hao-aaron (创建于: 2026-02-09 15:17 (UTC+8))
- #34170 Extend ColBERT support to non-standard BERT backbones — documentation,new-model — by ieBoytsov (创建于: 2026-02-10 04:23 (UTC+8))
- #34148 [Doc] Update usage of
--limit-mm-per-prompt— documentation,ready — by DarkLight1337 (创建于: 2026-02-10 00:47 (UTC+8)) - #34190 [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 — bug,ready — by jaim12005 (创建于: 2026-02-10 08:50 (UTC+8))
- #34175 [LMCache] Token Base IPC API — ready,kv-connector — by Oasis-Git (创建于: 2026-02-10 06:02 (UTC+8))
- #34191 [Frontend] Add per-request stream interval control with time-based throttling — frontend,v1 — by quan-deepinfra (创建于: 2026-02-10 08:59 (UTC+8))
- #34189 [Bugfix] Add missing runtime dependencies for Nemotron Parse 1.1 — bug,ci/build — by kamalesh0406 (创建于: 2026-02-10 08:41 (UTC+8))
- #34152 [UX nit] Fix non-default api_server_count message — frontend — by mgoin (创建于: 2026-02-10 01:33 (UTC+8))
- #34188 [responsesAPI] fix simpleContext streaming output_messages — frontend,ready,gpt-oss — by qandrew (创建于: 2026-02-10 08:21 (UTC+8))
- #34187 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-10 07:40 (UTC+8))
- #34185 [Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching — 无标签 — by gmagogsfm (创建于: 2026-02-10 07:39 (UTC+8))
- #34184 [Online Quantization] Support memory-efficient online quantization via layerwise loading — 无标签 — by kylesayrs (创建于: 2026-02-10 07:29 (UTC+8))
- #34153 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available — bug,rocm,ready,gpt-oss — by gshtras (创建于: 2026-02-10 01:39 (UTC+8))
- #34179 [Feature] Decode Context Parallel support for GPU model runner v2 — ready,v1,nvidia — by yewentao256 (创建于: 2026-02-10 06:48 (UTC+8))
- #34171 [Feat][RL][2/2] Native Weight Syncing API: IPC — documentation,ci/build — by hao-aaron (创建于: 2026-02-10 04:51 (UTC+8))
- #34182 [Misc] Introduce ec_both role EC (encoder cache) connector — v1 — by furionw (创建于: 2026-02-10 07:08 (UTC+8))
- #34173 Introduce ec_both for EC connector — v1 — by furionw (创建于: 2026-02-10 05:04 (UTC+8))
- #34128 Vllm CPU benchmark suite improvement — performance,ci/build,cpu — by louie-tsai (创建于: 2026-02-09 15:57 (UTC+8))
- #34177 [ROCm][Kernel] Add GFX11 (RDNA3) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (创建于: 2026-02-10 06:25 (UTC+8))
- #34176 [ROCm][Kernel] Add GFX11 (RDNA3/4) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (创建于: 2026-02-10 06:21 (UTC+8))
- #34167 [ModelRunner V2][BugFix] Fix
max_query_lencalculation — bug,ready,v1,nvidia — by njhill (创建于: 2026-02-10 03:26 (UTC+8)) - #34155 [compile] Enable AOT compile with 2.10 in trunk. — ready — by zhxchen17 (创建于: 2026-02-10 01:46 (UTC+8))
- #34174 [Draft][ROCm] ROCm7.2 as base — rocm,ci/build — by gshtras (创建于: 2026-02-10 05:46 (UTC+8))
- #34172 nightly-run-models — ci/build,llama — by debroy-rh (创建于: 2026-02-10 05:01 (UTC+8))
- #34151 Revert “[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257)” — 无标签 — by amitz-nv (创建于: 2026-02-10 01:32 (UTC+8))
- #34145 Moe config backend — documentation,rocm,frontend,needs-rebase,ci/build,v1,llama,gpt-oss,kv-connector — by danichan-mkm (创建于: 2026-02-10 00:24 (UTC+8))
- #34135 [Bugfix] Fix TorchAO bugs and add
--torchao-configCLI — bug — by jwpark33 (创建于: 2026-02-09 20:21 (UTC+8)) - #34161 add extras dict to FinishedRequestStats to enable stat logger plugins… — v1 — by crawdaddie (创建于: 2026-02-10 02:27 (UTC+8))
- #34146 [Benchmarks] Add bimodal dataset for mixed short-chat + long-RAG work… — performance — by pbpatre (创建于: 2026-02-10 00:30 (UTC+8))
- #34157 [ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers — rocm,deepseek — by dllehr-amd (创建于: 2026-02-10 01:57 (UTC+8))
- #34168 [DRAFT][Feature] implement online data capture/generation for eagle3 — speculative-decoding,v1 — by harryzorus (创建于: 2026-02-10 03:56 (UTC+8))
- #34147 [KVConnector] Fix bug and clean up redundant code in KV connectors — bug,kv-connector — by hickeyma (创建于: 2026-02-10 00:38 (UTC+8))
- #34165 Fix AttributeError: ‘Glm46VImageProcessorFast’ object has no attribut… — 无标签 — by VishnuVV27 (创建于: 2026-02-10 03:23 (UTC+8))
- #34163 [Bug] Fix MLPSpeculatorConfig missing num_attention_heads attribute — bug — by Mr-Neutr0n (创建于: 2026-02-10 02:40 (UTC+8))
- #34140 [Bugfix] Voxtral prompt/audio placeholder alignment — bug,ready — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
- #34142 [Bugfix] Avoid duplicate k-proj weight emission in helper — bug,ready — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
- #34149 [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 — bug,performance — by mgehre-amd (创建于: 2026-02-10 01:04 (UTC+8))
- #34159 [Docs] MTP Docs — documentation,v1 — by kylesayrs (创建于: 2026-02-10 02:10 (UTC+8))
- #34158 [Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout — bug,v1,nvidia — by Etelis (创建于: 2026-02-10 02:02 (UTC+8))
- #34126 Add flagos in MiniCPM-o — 无标签 — by tc-mb (创建于: 2026-02-09 15:17 (UTC+8))
- #34150 Causal Masking — v1,meta-exported,fb-exported — by omkhalil (创建于: 2026-02-10 01:26 (UTC+8))
- #34121 [BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run — bug,v1 — by Selkh (创建于: 2026-02-09 14:38 (UTC+8))
- #34141 [Bugfix] Harden Voxtral encoder melspec and chunk handling — bug — by artuskg (创建于: 2026-02-09 22:38 (UTC+8))
- #34130 [Perf] fused_moe: add int4_w4a16 benchmark support and tuning config — performance — by mgehre-amd (创建于: 2026-02-09 18:14 (UTC+8))
- #34137 [Docs] Fix format error in KV load failure recovery doc — documentation — by zzaebok (创建于: 2026-02-09 20:50 (UTC+8))
- #34144 [Misc] Clean up validation logic in input processor — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-09 23:37 (UTC+8))
- #34131 [Model] Add Qwen3.5 hybrid model support — new-model,needs-rebase,qwen — by liuchen2026fly (创建于: 2026-02-09 19:08 (UTC+8))
- #34143 [Fix] Bump lmcache minimum version to 0.3.11 — ci/build,kv-connector — by MohanKumar21 (创建于: 2026-02-09 22:56 (UTC+8))
- #34120 [UX] Add
--language-model-onlyfor hybrid models — ready — by ywang96 (创建于: 2026-02-09 14:04 (UTC+8)) - #34138 Add MiniMax-M2 model support to vLLM — needs-rebase — by JingqingZh (创建于: 2026-02-09 20:51 (UTC+8))
- #34117 [XPU][6/N] add xpu scaled_mm kernel — ready,ci/build — by zufangzhu (创建于: 2026-02-09 13:02 (UTC+8))
- #34127 fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… — 无标签 — by tc-mb (创建于: 2026-02-09 15:42 (UTC+8))
- #34124 [Model] GLM adaptation — performance,new-model,ready,deepseek — by jeejeelee (创建于: 2026-02-09 15:10 (UTC+8))
- #34116 [CPU] Enable FP16 (Half dtype) support for s390x — cpu — by R3hankhan123 (创建于: 2026-02-09 12:50 (UTC+8))
- #34119 [Fix Bug]
num_active_lorasalways equals to zero — bug,v1,gpt-oss — by RunkaiTao (创建于: 2026-02-09 13:45 (UTC+8)) - #34115 Fix kernel bugs in XPU LoRA and MOE LORA — 无标签 — by chaojun-zhang (创建于: 2026-02-09 12:35 (UTC+8))
- #34114 Feat/ascend npu adapt v0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality — by handsomezhuzhu (创建于: 2026-02-09 12:02 (UTC+8))
已合并 PR
- #34175 [LMCache] Token Base IPC API — ready,kv-connector — by Oasis-Git (合并于: 2026-02-10 09:18 (UTC+8))
- #33233 [structured output] validate unsupported json features first — structured-output,ready,v1 — by andyxning (合并于: 2026-02-10 07:49 (UTC+8))
- #34153 [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available — bug,rocm,ready,gpt-oss — by gshtras (合并于: 2026-02-10 07:38 (UTC+8))
- #33936 [Doc] Add DCP support to attention backend doc — documentation,ready — by mgoin (合并于: 2026-02-10 07:33 (UTC+8))
- #34167 [ModelRunner V2][BugFix] Fix
max_query_lencalculation — bug,ready,v1,nvidia — by njhill (合并于: 2026-02-10 05:47 (UTC+8)) - #33945 [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. — ready,torch.compile — by charlifu (合并于: 2026-02-10 05:15 (UTC+8))
- #34110 [MODEL] Adding Support for Qwen3.5 Models — documentation,new-model,speculative-decoding,ready,v1,qwen — by JJJYmmm (合并于: 2026-02-09 21:12 (UTC+8))
- #34032 [ROCm] update triton branch to support gpt-oss models for gfx11xx devices — rocm,ready,ci/build,gpt-oss — by hongxiayang (合并于: 2026-02-10 03:36 (UTC+8))
- #34140 [Bugfix] Voxtral prompt/audio placeholder alignment — bug,ready — by artuskg (合并于: 2026-02-10 03:30 (UTC+8))
- #34142 [Bugfix] Avoid duplicate k-proj weight emission in helper — bug,ready — by artuskg (合并于: 2026-02-10 03:17 (UTC+8))
- #32846 [Kernel] use flashinfer for gdn prefill — performance,ready,qwen — by ZJY0516 (合并于: 2026-02-10 01:17 (UTC+8))
- #34087 [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) — bug,ready,nvidia — by TomerBN-Nvidia (合并于: 2026-02-10 00:44 (UTC+8))
- #33985 [Kernel] FlashInfer: switch allreduce fusion to unified API — performance,ready — by mmangkad (合并于: 2026-02-09 23:43 (UTC+8))
- #32365 Add NUMA Core binding in nixl_connector for CPU xPyD — ready,v1,cpu,kv-connector — by ZhengHongming888 (合并于: 2026-02-09 23:39 (UTC+8))
- #34120 [UX] Add
--language-model-onlyfor hybrid models — ready — by ywang96 (合并于: 2026-02-09 22:57 (UTC+8)) - #34031 [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 — ready,ci/build — by ProExpertProg (合并于: 2026-02-09 23:05 (UTC+8))
- #33810 [Misc] Fix up attention benchmarks — performance,ready,ci/build — by LucasWilkinson (合并于: 2026-02-09 22:42 (UTC+8))
- #34117 [XPU][6/N] add xpu scaled_mm kernel — ready,ci/build — by zufangzhu (合并于: 2026-02-09 20:17 (UTC+8))
- #33901 [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul — ready,cpu — by nikhil-arm (合并于: 2026-02-09 18:04 (UTC+8))
- #32300 [ASR] Fix audio benchmark and add RTFx metric — documentation,performance,ready — by ekagra-ranjan (合并于: 2026-02-09 18:02 (UTC+8))
- #34107 [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr — ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-09 17:37 (UTC+8))
- #34124 [Model] GLM adaptation — performance,new-model,ready,deepseek — by jeejeelee (合并于: 2026-02-09 17:32 (UTC+8))
- #34052 fix(cpu): fix mla_decode compilation on x86 without AVX512 — ready,cpu — by ihb2032 (合并于: 2026-02-09 16:55 (UTC+8))
- #34070 [BugFix] Fix
fastsafetensorsTP all procs using all GPUs — bug,ready — by njhill (合并于: 2026-02-09 15:15 (UTC+8)) - #31127 [Frontend][last/5] Make pooling entrypoints request schema consensus. — documentation,frontend,ready,ci/build,multi-modality — by noooop (合并于: 2026-02-09 14:42 (UTC+8))
- #34103 [Tiny] Rename encoder budget file to more specific name — ready,v1,multi-modality — by reaganjlee (合并于: 2026-02-09 11:48 (UTC+8))
关闭但未合并的 PR
- #34139 [UX] Fix failed precompiled installation when latest commit wheels still unavailable — ci/build — by Isotr0py (关闭于: 2026-02-10 11:05 (UTC+8))
- #23481 [Misc] Use
req_idsinstead ofreq_id_to_indexin scheduler’s update_from_output — tpu,ready,stale,v1,kv-connector — by WoosukKwon (关闭于: 2026-02-10 10:18 (UTC+8)) - #25440 [feat][distributed]: support per-process group configuration via pg_options — stale — by linfeng-yuan (关闭于: 2026-02-10 10:18 (UTC+8))
- #25881 [Bugfix][Multi Modal] Fix broken frames in video input — stale,multi-modality — by Jixin10 (关闭于: 2026-02-10 10:18 (UTC+8))
- #26205 [Bugfix][NCCL P/D] Send layer KV-cache only after layer forward — documentation,needs-rebase,stale,kv-connector — by ruisearch42 (关闭于: 2026-02-10 10:18 (UTC+8))
- #26610 fix seed_oss_tool_call used Stream — frontend,stale,tool-calling — by CallmeZhangChenchen (关闭于: 2026-02-10 10:17 (UTC+8))
- #26615 [CI/Build] allow setting GDRCOPY_VERSION during docker build — ci/build,stale — by Ivan8or (关闭于: 2026-02-10 10:17 (UTC+8))
- #26657 Vllm roe — documentation,performance,structured-output,tpu,speculative-decoding,needs-rebase,ci/build,stale,v1,llama — by ShengxuanQiu (关闭于: 2026-02-10 10:17 (UTC+8))
- #34009 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,ready,v1 — by benchislett (关闭于: 2026-02-10 07:43 (UTC+8))
- #34173 Introduce ec_both for EC connector — v1 — by furionw (关闭于: 2026-02-10 07:04 (UTC+8))
- #34176 [ROCm][Kernel] Add GFX11 (RDNA3/4) support for wvSplitK skinny GEMM kernels — rocm — by mgehre-amd (关闭于: 2026-02-10 06:25 (UTC+8))
- #32801 [WIP] Fix GPT-OSS prefix caching not working with EAGLE — v1,gpt-oss — by mgoin (关闭于: 2026-02-10 05:42 (UTC+8))
- #34135 [Bugfix] Fix TorchAO bugs and add
--torchao-configCLI — bug — by jwpark33 (关闭于: 2026-02-09 23:57 (UTC+8)) - #29282 [ASR] Add script for Multi-batch long audio chunking with output streaming — documentation — by ekagra-ranjan (关闭于: 2026-02-10 01:42 (UTC+8))
- #34131 [Model] Add Qwen3.5 hybrid model support — new-model,needs-rebase,qwen — by liuchen2026fly (关闭于: 2026-02-10 00:40 (UTC+8))
- #30260 Support TP which is not divded for NVFP4 kernels (flashinfer-cutlass) by adding dynamic padding — needs-rebase,nvidia — by danielafrimi (关闭于: 2026-02-09 16:41 (UTC+8))
- #31815 [Bugfix] Fix TorchAO quantization bugs and add
--torchao-configCLI support — bug — by jwpark33 (关闭于: 2026-02-09 16:35 (UTC+8)) - #34114 Feat/ascend npu adapt v0.11.0 — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality — by handsomezhuzhu (关闭于: 2026-02-09 12:03 (UTC+8))