vLLM 开发动态报告 - 2026-02-02
时间窗口: 2026-02-02 11:29 (UTC+8) ~ 2026-02-03 11:29 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 14 | 新 PR 78 | 合并 PR 30 | 关闭未合并 PR 12
📊 每日开发状态摘要
本周期(2026-02-02至02-03)vLLM社区活跃度保持高位,共产生25个新Issue,合并30个PR。开发焦点集中在性能优化(尤其是DeepSeek R1/V3.2、推测解码)、模型支持(多模态模型、新量化格式)、以及关键Bug修复(FP8 MoE、稀疏注意力、后端兼容性)。同时,针对v0.15.0版本的回归问题,社区正积极准备v0.15.1补丁版本。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动主要集中在CI测试集成和问题修复上。
- AMD CI测试回归与集成:
- Issue #33598 ([CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100)): AMD CI 中一个涉及
torch.float8_e4m3fnuz数据类型的FP8块精度测试失败。这表明在AMD平台上对最新FP8数据类型的支持可能存在问题或测试环境配置需要更新。 - PR #33626 ([ci] Integrate AMD tests into CI): 这是一个重要进展。该PR旨在将6个AMD测试任务正式集成到vLLM的主CI流水线中,并定义了AMD测试镜像的配置接口。这表明AMD平台的支持正从独立测试转向与主开发流更紧密的集成,有助于及早发现平台特异性问题。
- PR #33608 (Changing the gating TG group composition): 由
Alexei-V-Ivanov-AMD提交,调整了门控测试组的构成,属于AMD CI基础设施的内部调整。 - 已关闭Issue: #29466, #29529, #29516, #31507 等多个长期存在的AMD CI测试失败问题在本周期被标记为已解决或“变绿”,表明AMD测试套件的稳定性有所改善。
- Issue #33598 ([CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100)): AMD CI 中一个涉及
- 基础环境更新:
- PR #33631 ((wip) Bump to Ubuntu 24.04): 计划将基础构建镜像从Ubuntu 22.04升级至24.04,以解决安全漏洞并绕过CVEs。此变更同时适用于CUDA和ROCM环境。
- 问题修复:
- PR #32902 (fix[ROCm]: Remove unconditional aiter import): 修复了在ROCm平台上,即使
VLLM_ROCM_USE_AITER=0也会触发AITER模块JIT编译和警告的问题,优化了启动体验。
- PR #32902 (fix[ROCm]: Remove unconditional aiter import): 修复了在ROCm平台上,即使
小结: AMD生态的工作重点已从基础功能支持转向提升测试集成度与稳定性。将AMD测试纳入主CI(PR #33626)是迈向更规范化支持的关键一步。同时,针对FP8等新特性的平台兼容性测试仍在进行中。
💬 高热度讨论分析
- Issue #33560: lm-eval显示RedHatAI/Qwen3-8B-NVFP4模型在Turing vs. Ampere架构上的显著精度差异
- 核心议题: 用户在Turing架构GPU(RTX 2080 Ti)上评估NVFP4量化模型时,观察到
lm-eval精度异常(准确率虚高至1.0),而在Ampere架构上正常。问题可能仅限于NVFP4权重的稠密模型。 - 观点与排查:
- 提交者
ir1ka: 通过系统测试,将问题范围缩小到使用dtype=float16(默认自动选择)时在Turing卡上出现,而显式指定dtype=bfloat16则恢复正常。同时指出FP8动态量化模型无此问题。 - 关联分析: 用户联想到PR #29901中提到的Turing架构从fp32切换到fp16累加器的改动可能是原因。
- 提交者
- 当前状态: 问题仍未解决,但已关联到类似的Issue #33461。讨论揭示了硬件架构差异可能对低精度量化模型(尤其是NVFP4)的数值稳定性产生微妙影响,这是一个需要模型量化团队和硬件后端团队共同关注的深层次问题。
- 核心议题: 用户在Turing架构GPU(RTX 2080 Ti)上评估NVFP4量化模型时,观察到
- Issue #33543: 某些FP8 MoE模型在GB200上断言失败
- 核心议题: 部分FP8 MoE模型(如MiniMax-M2.1)在GB200上运行失败,涉及TRTLLM MoE内核。
- 技术讨论:
- 开发者
mgoin分析: 问题的根源在于router_logits的数据类型。像MiniMax-M2.1这样的模型会将router_logits转换为float32(非DeepSeek V3风格路由),而TRTLLM FP8 MoE内核目前仅支持DeepSeek风格路由下的float32类型。强制转换会严重损害模型精度。 - 解决方案: 讨论后决定对这种情况禁用TRTLLM FP8 MoE后端,回退到其他兼容后端(如CUTLASS或Triton)。
mgoin已提交PR #33613实施此修复。
- 开发者
- 结论: 该问题凸显了不同MoE模型在路由策略和数据类型上的细微差别对高性能内核选择的严格限制,需要内核开发者针对特定模式进行适配。
- Issue #33572: GPT-OSS 结合 CPU KV缓存卸载与 FlashInfer 后端时出错
- 核心议题: 当启用原生CPU KV缓存卸载(使用跨层KV缓存布局)并使用FlashInfer TRT-LLM注意力后端时,因断言KV缓存张量是连续内存而失败。
- 技术分析:
- 开发者
orozery指出: FlashInfer后端假设每层KV缓存是连续的,但CPU KV卸载的默认跨层布局不满足此假设。 - 临时方案与根本解决:
mgoin通过测试指出,目前需要同时启用CPU KV卸载和禁用混合KV缓存管理器才能触发此断言。根本解决需要FlashInfer后端放宽断言条件或调整布局选择逻辑。
- 开发者
- 争议焦点: 在追求极致性能(跨层布局对CPU卸载更优)和维护后端兼容性(连续内存假设)之间的权衡。
- 当前状态: 问题待修复,开发者已定位到核心矛盾。
- Issue #33616 / PR #33615 & #33635: reasoning_content 重命名导致聊天模板回归
- 核心议题: PR #33402 将API响应中的
reasoning_content字段更名为reasoning,旨在与OpenAI格式对齐,但意外破坏了依赖旧字段名的聊天模板(如GLM和Kimi K2.5),可能导致模型性能静默下降。 - 观点交锋:
- 报告者
koush: 强调向后兼容的重要性,许多模型不会更新模板,且其他推理引擎仍在使用旧字段名。 - 修复方案: 迅速出现了两个修复PR:#33615 旨在确保当聊天模板需要时保留
reasoning_content;#33635 则侧重于在交错思考特性中保持兼容性。这体现了对同一问题的不同修补策略。
- 报告者
- 结论: 社区迅速响应了此次回归。事件提醒了在修改核心API数据结构时,必须谨慎评估对下游生态系统(特别是模型模板)的广泛影响。
- 核心议题: PR #33402 将API响应中的
- Issue #33599: [RFC]: 改进vLLM与下游生态系统的依赖兼容性
- 核心议题: 由Anyscale提出,指出vLLM严格的依赖锁定(如要求numpy>=2)与下游生态(如Ray,依赖numpy==1.26.4)存在冲突,导致集成困难。
- 核心建议: 提议在vLLM的CI中增加针对代表性下游配置(如Ray的依赖锁文件)的测试,以提前暴露破坏性变更,帮助维护者在引入新约束时做出更明智的权衡。
- 讨论意义: 这并非具体Bug讨论,而是一个关于项目治理和生态协作的提案。它反映了随着vLLM被广泛集成,其依赖管理策略需要更多地考虑外部兼容性。
🔥 热门话题与趋势分析
- 多模态模型支持持续深化: 新增Issue和PR中大量涉及多模态模型:
- 模型支持: GLM-4.7-Flash (#33580)、Intern-S1-Pro (#33636)、PaddleOCR-VL-1.5 (#33554)、DeepSeek-OCR-2 (#33165已合并)。
- 问题修复: Qwen Omni系列模型的音频-视频交织支持 (#33605)、Qwen3-VL的EVS(Encoder-Vision-Side)实现 (#33607)、GLM-4V的MRoPE计算 (#33039已合并)。
- 这表明多模态推理已成为vLLM的核心能力之一,复杂度激增(视觉、音频、视频交织,专用位置编码)。
- 性能优化与硬件适配:
- Blackwell(GB/B200)优化: 有专门Issue (#33583) 跟踪DeepSeek R1在Blackwell上的吞吐量优化,包含多达11个相关PR。另有PR (#33540) 支持在Blackwell上自动检测并使用Fabric内存(MNNVL协议)。
- 内核优化: PR #33568 通过禁用不必要的
clean_logits内核来优化DeepGEMM FP8 MQA性能;PR #33538 重新提出了更优化的Triton版Top-k/ Top-p采样内核。
-
NVFP4量化模型的挑战: 多个Issue (#33560, #33544, #33598, #33333) 均围绕NVFP4量化模型,在不同硬件(Turing, Ampere, Blackwell)和场景下出现加载失败、精度异常或性能问题。这反映了新量化格式在生态适配中的“阵痛期”。
- CI/CD与发布流程强化: 围绕v0.15.1补丁版本的准备,出现了多个修复和清理PR(#33618, #33602, #33621, #33619)。同时,关于依赖兼容性测试的RFC (#33599) 和AMD测试集成 (#33626) 也体现了对工程质量的重视。
🛠️ 重点技术变更
- PR #33624: [torch.compile] Don‘t do the fast moe cold start optimization if there is speculative decoding
- 解读: 修复了
torch.compile的“fast moe cold start”优化与推测解码同时启用时可能导致的编译错误或性能问题。通过在有推测解码时禁用该特定优化,确保了复杂场景下的稳定性。 - 影响: 提升了使用
torch.compile并结合推测解码(如EAGLE、MTP)功能的可靠性,对追求极致推理性能的用户至关重要。
- 解读: 修复了
- PR #33540: [Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series
- 解读: 实现了对Blackwell系列GPU Fabric内存的自动检测,使vLLM能够自适应地尝试使用MNNVL协议进行高速点对点通信,失败后自动回退。此PR已合并。
- 影响: 显著优化了Blackwell GPU在多节点或PD分离场景下的通信性能(提交者报告吞吐量提升三倍),是充分利用新一代硬件特性的关键步骤。
- PR #33613: [Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3
- 解读: 针对Issue #33543的修复。为TRTLLM FP8 MoE内核增加了更精确的启用条件,避免了在不兼容的模型配置下被错误选择导致精度崩溃。
- 影响: 提升了FP8 MoE模型部署的健壮性,确保内核选择逻辑能正确处理多样化的MoE路由实现。
- PR #31914: fix memory for online fp8 quantization with streaming weight load (已合并)
- 解读: 彻底修复了在线FP8量化在流式权重加载时峰值内存过高的问题。通过将权重创建在
meta设备上并即时量化,避免了在加载循环中同时持有大量BF16权重副本。 - 影响: 使在线FP8量化的内存优势得以真正实现,对于在有限内存下加载大模型(尤其是MoE模型)具有重要意义。
- 解读: 彻底修复了在线FP8量化在流式权重加载时峰值内存过高的问题。通过将权重创建在
- PR #33607: [Bugfix] Fix EVS implementation for Qwen3 VL
- 解读: 修复了Qwen3-VL模型Encoder-Vision-Side (EVS) 的实现,解决了其视频帧与时间戳文本交错编码带来的嵌入对齐问题。
- 影响: 确保了Qwen3-VL这类具有复杂多模态编码结构的模型在vLLM中能够正确运行,是多模态支持深化的一个技术体现。
📈 开发活跃度观察
- 贡献者多样性: 本周期活跃贡献者包括来自AMD (
-amd后缀)、Red Hat (-redhat)、NVIDIA、Cohere (kkt-cohere)、Anyscale、Meta以及众多独立开发者,显示项目生态健康。 - 高效的Bug修复闭环: 多个新报告的Bug(#33546, #33543, #33616, #33630)在很短时间内就有对应的修复PR被创建或合并,体现了核心维护团队的高响应速度。
- 版本发布周期: 围绕v0.15.1补丁版本的代码清理和修复PR密集出现,表明团队正有条不紊地处理v0.15.0的遗留问题,准备推出稳定版本。
💡 值得关注的问题
- NVFP4量化模型的硬件兼容性: Issue #33560等揭示的NVFP4精度问题可能是一个系统性挑战,需要量化团队和硬件后端(包括NVIDIA和AMD)协同调查。
- API变更的向后兼容性: Issue #33616关于
reasoning_content的争议,凸显了在快速发展中保持API稳定的重要性。类似的变更在未来需要更充分的生态影响评估。 - AMD测试集成进展: PR #33626的进展值得跟踪,它标志着AMD平台支持进入新阶段。其成功实施将有助于提升vLLM在异构计算环境中的整体质量。
- 多模态模型的复杂性管理: 随着支持的视觉、音频、视频模型及其交织模式越来越多,如何有效管理由此带来的配置复杂性和潜在冲突(如#33605中的音频-视频交织)是一个持续挑战。
- 下游生态依赖冲突: Issue #33599提出的依赖兼容性问题是任何成功开源项目都会面临的成长烦恼。社区如何回应和处理此RFC,将影响vLLM被集成和采用的难易度。
📋 附录:详细数据列表
新增 Issue
- #33638 [Bug]: DeepSeekV3.1 with fp8 kvcache in v0.15.0 produces garbled output — bug — by lyg95 (创建于: 2026-02-03 11:26 (UTC+8))
- #33560 [Bug]: lm-eval shows significant accuracy differences on RedHatAI/Qwen3-8B-NVFP4 model (Turing vs. Ampere) — bug — by ir1ka (创建于: 2026-02-02 18:53 (UTC+8))
- #33632 [Doc]: When running benchmarks, the prefix cache hit rate is abnormally high (not consistent with the production environment). — documentation — by uOnePiece (创建于: 2026-02-03 10:12 (UTC+8))
- #33630 [CI Failure]: Test Models (Qwen OMni) — ci-failure — by robertgshaw2-redhat (创建于: 2026-02-03 10:07 (UTC+8))
- #33603 [Bug] [CI, Build, and Release] Lost vllm against PyTorch nightly signal since Jan 28, 2026 — bug — by atalman (创建于: 2026-02-03 05:48 (UTC+8))
- #33554 [New Model]: PaddleOCR-VL-1.5 Support — 无标签 — by ssuncheol (创建于: 2026-02-02 16:47 (UTC+8))
- #33628 [Bug]: Failed to run distributed inference with Gloo backend on aarch64 — bug — by wenlujon (创建于: 2026-02-03 09:57 (UTC+8))
- #33627 [Bug]: DeepSeek R1 with CUTLASS MLA Broken on B200 — bug — by robertgshaw2-redhat (创建于: 2026-02-03 09:49 (UTC+8))
- #33625 [Feature]: improve sleep mode to not break torch.cuda memory counters — feature request — by stas00 (创建于: 2026-02-03 09:38 (UTC+8))
- #33622 [Feature]: investigate if we can turn off the ahead-of-time autotuning block in vLLM — feature request,torch.compile — by zou3519 (创建于: 2026-02-03 08:55 (UTC+8))
- #33616 [Bug]: reasoning_content to reasoning rename caused regression in chat templates expecting reasoning_content — bug — by koush (创建于: 2026-02-03 07:38 (UTC+8))
- #33543 [Bug]: Some FP8 MoE models fail assertions on GB200 — bug — by jeejeelee (创建于: 2026-02-02 14:37 (UTC+8))
- #33601 vLLM replicas init failure when launching with Ray on a single instance — bug — by jiangwu300 (创建于: 2026-02-03 05:47 (UTC+8))
- #33599 [RFC]: Improving vLLM Dependency Compatibility with Downstream Ecosystems — RFC — by jeffreywang-anyscale (创建于: 2026-02-03 05:31 (UTC+8))
- #33598 [CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100) — ci-failure — by AndreasKaratzas (创建于: 2026-02-03 05:30 (UTC+8))
- #33596 [CI Failure]: mi325_1: Multi-Modal Accuracy Eval (Small Models) — ci-failure — by AndreasKaratzas (创建于: 2026-02-03 05:27 (UTC+8))
- #33572 [Bug]: GPT-OSS with CPU KV cache offload break with FlashInfer — bug — by cquil11 (创建于: 2026-02-03 01:37 (UTC+8))
- #33583 [Performance]: Optimize DeepSeekR1 Throughput on Blackwell — performance — by minosfuture (创建于: 2026-02-03 03:38 (UTC+8))
- #33580 [Model Support] GLM-4.7-Flash requires transformers from git (glm4_moe_lite architecture) — 无标签 — by archit-spec (创建于: 2026-02-03 02:47 (UTC+8))
- #33577 [Installation]: GPLv3 function included in HarmonyBrowserTool — installation — by zmckevit (创建于: 2026-02-03 02:30 (UTC+8))
- #33544 [Bug]: Qwen3-Next-80B-A3B-Instruct-NVFP4 can’t run with 0.15.0 — bug — by nanbogong (创建于: 2026-02-02 14:55 (UTC+8))
- #33539 [Bug]: When using NIXL for PD separation on GB series cards, the MNNVL protocol cannot be used — bug — by kebe7jun (创建于: 2026-02-02 13:01 (UTC+8))
- #33549 [RFC]: Dynamic Expert Load Balance optimization from ascend — RFC — by shenchuxiaofugui (创建于: 2026-02-02 15:42 (UTC+8))
- #33546 [Bug][DeepSeekV32]: AttributeError: ‘FlashMLASparseMetadata’ object has no attribute ‘num_decodes’ — bug — by chaunceyjiang (创建于: 2026-02-02 15:20 (UTC+8))
- #33541 [Feature]: Automatic full-core utilization for TP=1 on arm multi-NUMA CPU systems — feature request,cpu — by phalani-paladugu (创建于: 2026-02-02 13:19 (UTC+8))
已关闭 Issue
- #33333 [Bug]: sm_120 -NvFp4 MoE backend ‘FLASHINFER_CUTLASS’ does not support the deployment configuration since kernel does not support current device. — bug — by shahizat (关闭于: 2026-02-03 02:48 (UTC+8))
- #29467 [CI Failure]: mi325_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-03 05:15 (UTC+8))
- #29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-03 05:14 (UTC+8))
- #29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-03 05:14 (UTC+8))
- #29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-03 05:13 (UTC+8))
- #31507 [CI Failure]: mi325_1: Model Executor Test — ci-failure — by AndreasKaratzas (关闭于: 2026-02-03 05:13 (UTC+8))
- #33418 [Bug]: wrong error reported when len(prompt) + requested tokens > max_context_len — bug — by sducouedic (关闭于: 2026-02-03 04:22 (UTC+8))
- #31805 [Bug]: streaming quantization cause higher peak memory used during model loading and post process — bug — by yma11 (关闭于: 2026-02-03 03:17 (UTC+8))
- #33580 [Model Support] GLM-4.7-Flash requires transformers from git (glm4_moe_lite architecture) — 无标签 — by archit-spec (关闭于: 2026-02-03 02:47 (UTC+8))
- #32434 [Bug]: gpt-oss no output with TRITON_ATTN backend with spec decode on ROCm — bug,rocm — by micah-wil (关闭于: 2026-02-03 01:52 (UTC+8))
- #29547 [Bug]: DeepSeek R1 + MTP OOM — bug — by MatthewBonanni (关闭于: 2026-02-03 01:33 (UTC+8))
- #33539 [Bug]: When using NIXL for PD separation on GB series cards, the MNNVL protocol cannot be used — bug — by kebe7jun (关闭于: 2026-02-02 22:55 (UTC+8))
- #33519 [Bug]: GLM4.7-flash cannot load lora adapters — bug — by Otsutsukii (关闭于: 2026-02-02 20:29 (UTC+8))
- #33507 [Bug]: Startup of the Qwen3-VL model failed. — bug — by shell-nlp (关闭于: 2026-02-02 18:18 (UTC+8))
新增 PR
- #33553 [CI/Build] Remove hardcoded America/Los_Angeles timezone from Dockerfiles — ready,ci/build — by carlory (创建于: 2026-02-02 16:43 (UTC+8))
- #33610 Allow other models to use the Harmony format — frontend,gpt-oss — by tbennun (创建于: 2026-02-03 06:35 (UTC+8))
- #33635 [Bugfix] Interleaved thinking keeps compatibility with reasoning_content — bug,frontend — by chaunceyjiang (创建于: 2026-02-03 11:05 (UTC+8))
- #33637 [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 — bug,v1,deepseek,nvidia — by chaunceyjiang (创建于: 2026-02-03 11:21 (UTC+8))
- #33568 [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel — 无标签 — by xyang16 (创建于: 2026-02-03 00:33 (UTC+8))
- #33615 [Bugfix] ensure reasoning_content exists if the chat template needs it — bug — by koush (创建于: 2026-02-03 07:36 (UTC+8))
- #33636 [Models] Intern-S1-Pro — documentation,new-model,qwen — by CUHKSZzxy (创建于: 2026-02-03 11:06 (UTC+8))
- #33563 generate API doesn’t require detokenization for beam search — frontend,qwen — by gameofdimension (创建于: 2026-02-02 20:42 (UTC+8))
- #33604 Make directory exist ok for ray spinning up multiple replicas on a single instance — 无标签 — by jiangwu300 (创建于: 2026-02-03 05:53 (UTC+8))
- #33634 [Bugfix] Fix mm budget setting for Qwen Omni models — bug,ready,multi-modality,qwen — by ywang96 (创建于: 2026-02-03 10:46 (UTC+8))
- #33605 [Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni — bug,qwen — by linyueqian (创建于: 2026-02-03 06:03 (UTC+8))
- #33624 [torch.compile] Don’t do the fast moe cold start optimization if there is speculative decoding — bug,ready — by zou3519 (创建于: 2026-02-03 09:24 (UTC+8))
- #33579 [Bugfix] Fix sparse MLA metadata building — bug,ready — by MatthewBonanni (创建于: 2026-02-03 02:39 (UTC+8))
- #33620 [Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] for TRTLLM per-tensor FP8 MoE — bug,ready,nvidia — by mgoin (创建于: 2026-02-03 08:35 (UTC+8))
- #33633 [Bugfix] Fix incorrect variable in pooling model LoRA module name check — bug — by yiw190 (创建于: 2026-02-03 10:19 (UTC+8))
- #33626 [ci] Integrate AMD tests into CI — rocm,ci/build — by khluu (创建于: 2026-02-03 09:42 (UTC+8))
- #33576 [Voxtral models] Skip warm-up to skip confusing error message in warm-up — frontend,ready — by patrickvonplaten (创建于: 2026-02-03 02:17 (UTC+8))
- #33631 (wip) Bump to Ubuntu 24.04 — documentation,new-model,rocm,frontend,ci/build,v1,multi-modality,cpu,kv-connector,nvidia — by zaristei2 (创建于: 2026-02-03 10:10 (UTC+8))
- #33629 Save startup benchmark results as a list of values — performance — by huydhn (创建于: 2026-02-03 09:57 (UTC+8))
- #33613 [Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 — bug,ready,deepseek,nvidia — by mgoin (创建于: 2026-02-03 06:55 (UTC+8))
- #33547 [Bugfix] fixed AttributeError: ‘FlashMLASparseMetadata’ object has no attribute ‘num_decodes’ — bug,v1 — by chaunceyjiang (创建于: 2026-02-02 15:22 (UTC+8))
- #33621 Patch aiohttp for CVE-2025-69223 — rocm,ci/build — by zaristei2 (创建于: 2026-02-03 08:53 (UTC+8))
- #33623 [Frontend] Adding a callback to allow direct access to engine output — frontend — by alecsolder (创建于: 2026-02-03 08:57 (UTC+8))
- #33548 use ORJSONResponse when available to improve the efficiency of request process — frontend — by staugust (创建于: 2026-02-02 15:32 (UTC+8))
- #33593 [Perf] Optimize Python Slice operation — structured-output,ready,v1,deepseek — by yewentao256 (创建于: 2026-02-03 05:19 (UTC+8))
- #33540 [Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series — ready — by kebe7jun (创建于: 2026-02-02 13:18 (UTC+8))
- #33619 Patch Protobuf for CVE 2026-0994 — ci/build — by zaristei2 (创建于: 2026-02-03 08:34 (UTC+8))
- #33618 [Release] Fix format and cherry-pick — 无标签 — by zhewenl (创建于: 2026-02-03 08:10 (UTC+8))
- #33617 Fix Anthropic image format handling in serving endpoint — frontend — by razorback16 (创建于: 2026-02-03 07:53 (UTC+8))
- #33614 Fix Anthropic image format handling in serving endpoint — frontend — by razorback16 (创建于: 2026-02-03 07:30 (UTC+8))
- #33591 [14/n] Migrate _moe_C to torch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:03 (UTC+8))
- #33612 [Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement — v1 — by yewentao256 (创建于: 2026-02-03 06:52 (UTC+8))
- #33590 [13/n] Migrate custom all reduce to libtorch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:03 (UTC+8))
- #33611 Make Longcat-Flash N-gram inputs compile-friendly and fix zero-expert MoE routing#33528 — documentation,new-model,v1 — by baonudesifeizhai (创建于: 2026-02-03 06:46 (UTC+8))
- #33589 [12/n] Migrate hadamard + mamba to torch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:03 (UTC+8))
- #33609 [Feature][Speculative Decoding] Consolidate EAGLE input preparation — speculative-decoding,v1 — by jaewshin (创建于: 2026-02-03 06:25 (UTC+8))
- #33607 [Bugfix] Fix EVS implementation for Qwen3 VL — bug,v1,multi-modality,qwen — by 2ez4bz (创建于: 2026-02-03 06:23 (UTC+8))
- #33575 [Performance] Add manual GC control to reduce GPU scheduling latency — v1 — by harsh543 (创建于: 2026-02-03 02:01 (UTC+8))
- #33608 Changing the gating TG group composition — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-03 06:24 (UTC+8))
- #33550 Fix #26037: Skip platform detection when displaying help — frontend — by AbhiOnGithub (创建于: 2026-02-02 16:03 (UTC+8))
- #33606 [Core][Distributed] Add all_gather decomposition pass for generic Inductor fusion — 无标签 — by eellison (创建于: 2026-02-03 06:17 (UTC+8))
- #33586 [9/n] Migrate CUTLASS FP4, sparse, and W4A8 kernels to libtorch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:00 (UTC+8))
- #33538 [Kernel] Triton-based Top-k and Top-p sampler kernels — performance,v1 — by cakeng (创建于: 2026-02-02 12:50 (UTC+8))
- #33588 [11/n] Migrate machete kernels to torch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:01 (UTC+8))
- #33602 [Release] patch step3p5 attention class in v0.15.1 release — 无标签 — by zhewenl (创建于: 2026-02-03 05:47 (UTC+8))
- #33587 [10/n] Migrate CUTLASS mla kernel to torch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 04:01 (UTC+8))
- #33600 [WIP][Attention] Refactor
check_and_update_config— needs-rebase,v1,nvidia — by MatthewBonanni (创建于: 2026-02-03 05:43 (UTC+8)) - #33585 [7/n] Migrate cache_kernels to libtorch stable ABI — rocm,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 03:59 (UTC+8))
- #33584 [8/n] Migrated scaled_mm_entry to libtorch stable ABI — rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-02-03 03:58 (UTC+8))
- #33582 [CPU][BugFix] Fix loading of w4a8int models with bias — bug,ready — by fadara01 (创建于: 2026-02-03 03:21 (UTC+8))
- #33592 Add batched and grouped EPLB communication — 无标签 — by ilmarkov (创建于: 2026-02-03 04:05 (UTC+8))
- #33597 [Minor] Some code simplification in
scheduler.py— ready,v1 — by njhill (创建于: 2026-02-03 05:28 (UTC+8)) - #33571 [torch.compile] Document the workaround to standalone_compile failing — documentation,ready — by zou3519 (创建于: 2026-02-03 01:12 (UTC+8))
- #33574 [Voxtral Realtime] Introduce global log mel max — ready,multi-modality — by patrickvonplaten (创建于: 2026-02-03 01:56 (UTC+8))
- #33537 [Core] Add determinism warmup automation for batch invariant mode — documentation,v1 — by debo3 (创建于: 2026-02-02 12:44 (UTC+8))
- #33595 [MoE] Add routing simulation override for MXFP4 quantized MoE — 无标签 — by jaewonlee-fb (创建于: 2026-02-03 05:20 (UTC+8))
- #33581 [Feature] Warn about unrecognized environment variables — v1 — by gshtras (创建于: 2026-02-03 02:53 (UTC+8))
- #33594 [Model] Use layer_types config for sliding window selection in GPT-OSS — gpt-oss — by jaewonlee-fb (创建于: 2026-02-03 05:20 (UTC+8))
- #33561 [Feature] support mtp3 — documentation,speculative-decoding,needs-rebase,v1 — by mingMelody (创建于: 2026-02-02 20:11 (UTC+8))
- #33570 [UX] Format attention backend log line — ready,nvidia — by MatthewBonanni (创建于: 2026-02-03 01:09 (UTC+8))
- #33578 [compile] Clean up AOT compile bypass on evaluate_guards. — 无标签 — by zhxchen17 (创建于: 2026-02-03 02:32 (UTC+8))
- #33573 Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] — ready — by SageMoore (创建于: 2026-02-03 01:50 (UTC+8))
- #33564 Fix: Corrected timestamp if audio > 30s — frontend — by Hardik-Taneja (创建于: 2026-02-02 20:49 (UTC+8))
- #33559 [Refactor] Move profiling methods to MM budget — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-02 18:30 (UTC+8))
- #33567 Update huggingface-hub again — rocm,ready,ci/build — by hmellor (创建于: 2026-02-02 23:07 (UTC+8))
- #33565 Remove incorrect tokenizer info test — ready — by hmellor (创建于: 2026-02-02 22:54 (UTC+8))
- #33569 OffloadingConnector: Optimize the redundant block touches — kv-connector — by Zhao233 (创建于: 2026-02-03 00:53 (UTC+8))
- #33557 [Perf] Optimize the performance of structured output + reasoning — structured-output,frontend,ready,v1 — by chaunceyjiang (创建于: 2026-02-02 18:17 (UTC+8))
- #33566 [CI] Add DeepSeek V3.2 nightly eval — ready,deepseek — by MatthewBonanni (创建于: 2026-02-02 22:56 (UTC+8))
- #33536 [Misc] Remove deprecated profiler environment variables — ready — by carlory (创建于: 2026-02-02 11:33 (UTC+8))
- #33562 [Bugfix] Enable Kimi k25 processor test — bug,documentation,ready,multi-modality — by Isotr0py (创建于: 2026-02-02 20:36 (UTC+8))
- #33558 [Core] Add option to allocate CUDA Fabric memory (required for MNNVL) — v1,nvidia — by tvegas1 (创建于: 2026-02-02 18:24 (UTC+8))
- #33551 [Model] Use explicit types in
get_generation_prompt— frontend,ready,qwen — by DarkLight1337 (创建于: 2026-02-02 16:35 (UTC+8)) - #33552 Document NixlConnector backend selection via kv_connector_extra_config — documentation,kv-connector — by KrxGu (创建于: 2026-02-02 16:42 (UTC+8))
- #33555 [CI][Bugfix] Fix flaky
tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency— bug,ready,v1,kv-connector — by NickLucche (创建于: 2026-02-02 16:50 (UTC+8)) - #33542 [Chore] Remove redundant input parsing methods — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-02 14:14 (UTC+8))
- #33556 [PluggableLayer][3/N] Apply PluggableLayer to moe-related layers. — 无标签 — by whx-sjtu (创建于: 2026-02-02 17:37 (UTC+8))
- #33545 [config] fix error occurs when loading local mode(starting with /) — 无标签 — by RuixiangMa (创建于: 2026-02-02 14:55 (UTC+8))
已合并 PR
- #32032 [CI/Build] add directions for CPU image upload to Docker Hub — ready,ci/build — by nathan-weinberg (合并于: 2026-02-03 10:48 (UTC+8))
- #32739 [BugFix] DPMetadata raises assert error for dense model — bug,ready,meta-exported,fb-exported — by River12 (合并于: 2026-02-03 08:56 (UTC+8))
- #33540 [Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series — ready — by kebe7jun (合并于: 2026-02-02 22:55 (UTC+8))
- #33618 [Release] Fix format and cherry-pick — 无标签 — by zhewenl (合并于: 2026-02-03 08:19 (UTC+8))
- #33602 [Release] patch step3p5 attention class in v0.15.1 release — 无标签 — by zhewenl (合并于: 2026-02-03 06:54 (UTC+8))
- #33574 [Voxtral Realtime] Introduce global log mel max — ready,multi-modality — by patrickvonplaten (合并于: 2026-02-03 06:01 (UTC+8))
- #32224 fix cutlass_3x_gemm_fp8_blockwise on sm103a — ready,nvidia — by IwakuraRein (合并于: 2026-02-03 03:47 (UTC+8))
- #31914 fix memory for online fp8 quantization with streaming weight load — ready,ci/build,quantization — by vkuzo (合并于: 2026-02-03 03:17 (UTC+8))
- #33570 [UX] Format attention backend log line — ready,nvidia — by MatthewBonanni (合并于: 2026-02-03 02:57 (UTC+8))
- #33486 [cohere] [misc] support arbitrary MM datasets in spec dec bench — documentation,performance,speculative-decoding,ready — by kkt-cohere (合并于: 2026-02-02 16:49 (UTC+8))
- #32005 Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. — ready,v1,nvidia — by yugong333 (合并于: 2026-02-03 01:30 (UTC+8))
- #33559 [Refactor] Move profiling methods to MM budget — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-02 23:27 (UTC+8))
- #33567 Update huggingface-hub again — rocm,ready,ci/build — by hmellor (合并于: 2026-02-03 01:20 (UTC+8))
- #33565 Remove incorrect tokenizer info test — ready — by hmellor (合并于: 2026-02-03 01:11 (UTC+8))
- #33039 [Model] Use mm_position to compute mrope positions for GLM-4.xV — documentation,ready — by KKSK-DON (合并于: 2026-02-03 00:55 (UTC+8))
- #33566 [CI] Add DeepSeek V3.2 nightly eval — ready,deepseek — by MatthewBonanni (合并于: 2026-02-03 00:10 (UTC+8))
- #33562 [Bugfix] Enable Kimi k25 processor test — bug,documentation,ready,multi-modality — by Isotr0py (合并于: 2026-02-02 22:25 (UTC+8))
- #33365 move spec decode slow test to test_areas.yaml — speculative-decoding,ready,ci/build,v1 — by shanjiaz (合并于: 2026-02-02 22:28 (UTC+8))
- #32790 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) — ready — by danielafrimi (合并于: 2026-02-02 22:18 (UTC+8))
- #32902 fix[ROCm]: Remove unconditional aiter import — rocm,speculative-decoding,ready,v1 — by rabi (合并于: 2026-02-02 22:10 (UTC+8))
- #33551 [Model] Use explicit types in
get_generation_prompt— frontend,ready,qwen — by DarkLight1337 (合并于: 2026-02-02 20:38 (UTC+8)) - #33525 Update get_expert_mapping to include self parameter — ready — by Otsutsukii (合并于: 2026-02-02 20:29 (UTC+8))
- #32686 Fix accessing hidden_act from model config — ready — by grzegorz-k-karch (合并于: 2026-02-02 19:11 (UTC+8))
- #33555 [CI][Bugfix] Fix flaky
tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency— bug,ready,v1,kv-connector — by NickLucche (合并于: 2026-02-02 19:01 (UTC+8)) - #33542 [Chore] Remove redundant input parsing methods — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-02 18:50 (UTC+8))
- #33494 [Doc]: update paths for Offline/Online/Others example sections — documentation,ready — by soyr-redhat (合并于: 2026-02-02 11:56 (UTC+8))
- #33243 [CPU][IBM Z][Dockerfile] Fix IBM Z builds — ready,ci/build — by R3hankhan123 (合并于: 2026-02-02 15:41 (UTC+8))
- #33165 [Model] Support DeepSeek-OCR-2 — documentation,new-model,ready,deepseek — by LiuLi1998 (合并于: 2026-02-02 14:24 (UTC+8))
- #33521 Fix mistral sliding window parsing — ready — by andylolu2 (合并于: 2026-02-02 13:08 (UTC+8))
- #33220 [Doc] add missing model entries in supported_models.md — documentation,ready — by pacoxu (合并于: 2026-02-02 11:37 (UTC+8))
关闭但未合并的 PR
- #33250 feat(mla): extract KV-cache update — needs-rebase — by dw2761 (关闭于: 2026-02-03 10:53 (UTC+8))
- #33547 [Bugfix] fixed AttributeError: ‘FlashMLASparseMetadata’ object has no attribute ‘num_decodes’ — bug,v1 — by chaunceyjiang (关闭于: 2026-02-03 10:00 (UTC+8))
- #33124 [Model] GPT-OSS: Use layer_types config for sliding window selection — gpt-oss — by jaewonlee-fb (关闭于: 2026-02-03 05:26 (UTC+8))
- #33617 Fix Anthropic image format handling in serving endpoint — frontend — by razorback16 (关闭于: 2026-02-03 07:54 (UTC+8))
- #33614 Fix Anthropic image format handling in serving endpoint — frontend — by razorback16 (关闭于: 2026-02-03 07:31 (UTC+8))
- #33606 [Core][Distributed] Add all_gather decomposition pass for generic Inductor fusion — 无标签 — by eellison (关闭于: 2026-02-03 06:17 (UTC+8))
- #33516 [Bugfix] Add SM110/SM120 device capability checks for NVFP4 MoE backends — bug,nvidia — by Code4me2 (关闭于: 2026-02-03 00:44 (UTC+8))
- #27686 fix(engine): Add death_pipe monitoring to EngineCoreProc for graceful shutdown — needs-rebase,ci/build,unstale,v1 — by RishabhSaini (关闭于: 2026-02-03 00:56 (UTC+8))
- #33558 [Core] Add option to allocate CUDA Fabric memory (required for MNNVL) — v1,nvidia — by tvegas1 (关闭于: 2026-02-02 20:50 (UTC+8))
- #26504 [Spec-Decode] Add DynamicProposer for per-sequence dynamic speculative decoding — documentation,performance,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling,qwen — by yang926 (关闭于: 2026-02-02 16:33 (UTC+8))
- #28636 fix(network): prevent port allocation race condition — documentation,needs-rebase,v1,kv-connector — by haosdent (关闭于: 2026-02-02 15:52 (UTC+8))
- #33270 [Bugfix] Fix Hybrid KV cache hit length computation for eagle — bug,needs-rebase,v1 — by xyang16 (关闭于: 2026-02-02 13:06 (UTC+8))