vLLM 开发动态报告 - 2026-02-13
时间窗口: 2026-02-13 11:23 (UTC+8) ~ 2026-02-14 11:23 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 15 | 新 PR 43 | 合并 PR 26 | 关闭未合并 PR 7
📊 每日开发状态摘要
在2026年2月13日至14日的周期内,vLLM项目保持了较高的开发活跃度,共处理了43个新增PR和19个新增Issue。核心开发焦点集中在多模态模型支持、性能优化与Bug修复,以及AMD ROCm平台生态的持续完善上。社区正在积极推进多个关键技术特性的落地,包括新的推理输出功能、更高效的CPU权重卸载方案,并对近期引入的MoE重构和AOT编译等重大变更进行问题排查与优化。
🎯 AMD/ROCm 生态相关动态
本周期内AMD/ROCm生态的进展和问题修复较为活跃,体现了社区对该平台支持的持续投入。
- 问题修复与解决方案:
- Issue #34500 ([Bug]: AMD 7900 running GLM-4.7-Flash crash):用户报告在AMD Radeon PRO W7900D上运行GLM-4.7-Flash模型时崩溃。AMD员工
@vllmellm迅速介入,指出问题与Triton后端相关,并建议设置环境变量export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE以启用Triton for AMD。用户确认该方案有效,随后关闭了Issue。这表明针对特定AMD GPU型号,需要正确配置后端以获得兼容性。 - PR #34543 ([Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993):这是一个关键修复。之前的PR #32993为CPU权重卸载添加了UVA支持,但仅在Python层检查了CUDA和XPU平台,导致在ROCm平台上抛出
ValueError。此PR将current_platform.is_rocm()添加到检查条件中,确保ROCm平台能使用相同的UVA代码路径,修复了由此导致的CI测试失败。 - PR #34538 ([ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests):修复了ROCm CI中因无条件导入依赖
flashinfer的FlashInferMLASparseBackend而导致的测试收集失败问题。通过动态构建测试列表,提升了测试套件在ROCm环境下的健壮性。
- Issue #34500 ([Bug]: AMD 7900 running GLM-4.7-Flash crash):用户报告在AMD Radeon PRO W7900D上运行GLM-4.7-Flash模型时崩溃。AMD员工
- 新功能与优化:
- PR #34541 ([ROCM] Optimize ROCM_AITER_FA spec decode eagle performance):旨在优化ROCm AITER FA后端在推测解码(EAGLE)场景下的性能。通过更改cuda_graph支持和移除CPU-GPU同步,使cudagraph也能用于草稿模型,从而提升性能。该PR依赖于另一个修复精度的PR (#32877),目前存在合并冲突待解决。
- PR #34515 ([Draft] amd-quark online fp8 and mxfp4 quantization):一个由AMD员工 (
@hangy-amd) 提交的草案PR,旨在将AMD Quark工具的在线FP8和MXFP4量化功能集成到vLLM中。这是AMD生态量化工具链整合的重要一步。
- 其他关联动态:
- 已关闭的Issue #33678 ([Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X) 和 #28494 ([Bug][AMD gfx1100] Enabling
include_usagewith stream=True results in freeze after 3 iterations) 表明,社区持续跟踪并解决了AMD平台上模型精度和API稳定性的历史问题。 - 贡献者识别:本周期内活跃的AMD员工(用户名带
-amd后缀)包括@hangy-amd和@vivcheng-amd(在已关闭Issue中),显示了AMD团队对vLLM项目的直接贡献。
- 已关闭的Issue #33678 ([Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X) 和 #28494 ([Bug][AMD gfx1100] Enabling
💬 高热度讨论分析
- Issue #32791 ([Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object):
- 核心议题:用户在使用GPT-OSS系列模型进行多轮对话并结合
json_object等结构化输出时,服务器返回content: null。 - 观点与讨论:多位用户确认此问题,并发现仅在使用结构化输出且对话历史中包含助理消息时触发。讨论提出了多种临时解决方案,如降级vLLM版本、修改历史消息角色、在提示前添加随机字符串等。社区最终定位到根本原因是
gptoss_reasoning_parser中的逻辑在多轮场景下错误地匹配了历史消息中的结束标记,导致提前应用了语法掩码。 - 最终结论:在PR #34454中,通过修复解析器逻辑,确保仅在当前生成的消息中寻找结束标记,彻底解决了此问题。该PR已合并。
- 核心议题:用户在使用GPT-OSS系列模型进行多轮对话并结合
- Issue #29403 ([Bug]: fastsafetensors in tensor parallel requires too much VRAM):
- 核心议题:在张量并行模式下使用
fastsafetensors插件加载模型时,出现VRAM使用量激增甚至加载失败的问题。 - 观点与讨论:问题提出者怀疑
fastsafetensors将所有权重加载到了GPU 0上,而非正确分散。贡献者@bbrowning通过调试证实了这一点,并指出这可能是一个bug。讨论涉及了对系统内存和GPU内存影响的分析,以及一个潜在的修复方案。 - 当前状态:该Issue在数据周期内被关闭,但关闭原因未在提供的数据中明确显示(可能是通过其他PR修复或决定暂时规避)。这反映了社区对第三方依赖集成稳定性的关注。
- 核心议题:在张量并行模式下使用
- Issue #33804 ([RFC]: [compile] Rollout strategy for AOT Compilation.):
- 核心议题:讨论如何安全地推出Torch 2.10中的AOT编译功能,避免给用户带来问题。
- 观点与立场:
- 谨慎推出派:鉴于在测试中发现了与AOT相关的阻塞性问题,建议采取保守策略。
- 具体方案:讨论了“在下一个vLLM发布版本中默认关闭AOT,待其在主分支充分测试后再于下下个版本启用”的方案。
- 最终结论:核心维护者达成一致,决定在即将发布的vLLM 0.16版本中搭载PyTorch 2.10但默认关闭AOT编译,在发布后立即在主分支开启AOT,以便在下一个版本(0.17)发布前有充足时间暴露和修复问题。该RFC已关闭。
🔥 热门话题与趋势分析
- 多模态模型支持深化:围绕GLM-OCR、Qwen-VL等模型的Issues(#34040, #34502, #34506)表明,用户正将vLLM用于复杂的视觉语言任务,如图像-文本检索、OCR和视觉推理。社区正在积极解决由此带来的新挑战,如编码器缓存预估不准确和多模态API功能扩展(如#34518请求为Whisper API添加解码器前缀支持)。
- 推理输出与工具调用支持:Feature Request #34536 希望为离线推理添加推理输出支持,而Issues #34496, #34498 则涉及推理和工具调用API的健壮性修复。这表明vLLM的高级功能(结构化输出、推理、工具调用)正被更广泛地使用,其稳定性和易用性成为关注重点。
- 性能优化与核心Bug修复:
- MoE性能回归:PR #34279 为修复IMA错误将步长改为int64,却导致小显存GPU(如GB10)上性能严重下降(~60x)。这引发了快速回滚(PR #34530)和针对性修复(PR #34507)的紧急行动,凸显了性能调优的复杂性。
- 缓存与加载优化:PR #34527 修复了多模态处理器加载的LRU缓存失效问题,显著提升了预处理速度;PR #34498 优化了GDN元数据构建,避免CPU-GPU同步。
- 平台兼容性与部署:除了AMD ROCm,关于macOS多进程问题(#34534)、Python 3.13兼容性问题(#34504)以及CPU卸载内存占用(#32993)的讨论,反映了社区对跨平台、多样化部署场景支持的持续努力。
🛠️ 重点技术变更
- PR #34454 ([Bugfix]: Fix structured output in multi-turn gpt-oss):此PR修复了GPT-OSS模型在多轮对话中使用结构化输出时的一个关键缺陷。通过修正
gptoss_reasoning_parser中识别推理结束标记的逻辑,确保了语法约束仅在正确的时机应用,恢复了模型生成有效内容的能力。影响:显著提升了GPT-OSS系列模型在生产环境多轮交互场景下的可用性和可靠性。 - PR #32993 ([Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation) (已合并):解决了PyTorch固定内存分配可能引起内存翻倍的问题,引入了基于UVA的CPU卸载替代方案,并通过环境变量控制。影响:使大模型在CPU卸载场景下对系统内存的需求变得更可预测和高效,尤其有利于资源受限的环境。
- PR #33705 ([Hybrid] Enable spec decoding in mamba cache align mode) (已合并):重新启用了Mamba模型在缓存对齐模式下的推测解码支持,此前因相关问题被禁用。影响:为Jamba等混合模型(Mamba+Attention)解锁了推测解码这一重要的性能加速特性。
- PR #34523 ([Misc] vLLM‘s –enforce-eager should turn off compile and cudagraphs only) (已合并):将
--enforce-eager调试标志的行为从设置优化级别-O0改为仅关闭torch.compile和cudagraphs,保留其他优化。影响:使调试行为更精确,避免因关闭其他优化(如flashinfer自动调优)而引入额外的性能偏差,便于问题定位。
📈 开发活跃度观察
- 高效的问题闭环:在24小时内,有15个Issue被关闭,其中包括一些历史较久的问题(如#32791, #29403),显示了社区维护者积极处理积压问题。
- 多方协作:在AMD相关问题和MoE性能回归等复杂问题上,可以看到来自AMD员工、Red Hat、NVIDIA及其他社区成员的积极互动和协作分析,共同推动问题解决。
- 代码审查与合并节奏:日内合并了26个PR,其中包含关键bug修复和功能优化,表明核心维护团队保持着较高的代码审查和合并效率,确保重要修复能及时进入主分支。
💡 值得关注的问题
- 多模态模型缓存估计:Issue #34040 和其修复PR #34483 揭示了一个普遍性问题:多模态编码器的缓存需求预估方法(特别是针对单张图片)可能存在缺陷,容易导致运行时错误。需要检视其他多模态模型是否存在类似隐患。
- Python 3.13 兼容性:Issue #34504 指出因符号
_PyThreadState_UncheckedGet未定义导致vLLM在Python 3.13上崩溃。这是一个影响新版本Python用户的环境兼容性问题,需要尽快修复。 - 实时API的健壮性:Issue #34532 报告Realtime API在客户端非正常断开连接时服务器会崩溃。这关系到API的生产环境稳定性,提出的将断言改为运行时错误验证的建议值得采纳。
- 分散式推理的复杂性:Issue #34526 报告在使用Nixl+CPU Offloading等多连接器组合时出现精度问题。这反映了在追求极致的分散式/异构推理架构时,调试和保证正确性的难度极高,相关功能需要更完善的测试和验证手段。
📋 附录:详细数据列表
新增 Issue
- #34525 [CI Failure] LoRA TP (Distributed): lora/test_olmoe_tp.py::test_olmoe_lora — ci-failure — by LucasWilkinson (创建于: 2026-02-14 00:29 (UTC+8))
- #34502 [Usage]: qwen3-vl-reranker-2b deploy issue — usage — by newbecoder (创建于: 2026-02-13 15:37 (UTC+8))
- #34504 [Bug]: GLM-5-FP8 Crash - Engine core initialization failed — bug — by jxdn (创建于: 2026-02-13 16:11 (UTC+8))
- #34545 [Installation]: unrecognized arguments: –omni — installation — by HenryBao91 (创建于: 2026-02-14 09:17 (UTC+8))
- #34524 [Bug]: Error saving sharded state for GPT-OSS-120B - safetensors KeyError for torch.float8_e8m0fnu — bug — by dhayanesh (创建于: 2026-02-14 00:00 (UTC+8))
- #34536 [Feature]: Reasoning output for offline inference — feature request — by BartekKruczek (创建于: 2026-02-14 04:04 (UTC+8))
- #34532 [Bug]: Realtime API crashes when client terminates connection “incorrectly” — bug — by nullquery (创建于: 2026-02-14 02:52 (UTC+8))
- #34534 [Bug]: EngineCore exits immediately after startup when vLLM CPU is launched from multiprocessing.Process on macOS — 无标签 — by diegocastanibm (创建于: 2026-02-14 03:43 (UTC+8))
- #34526 [Bug]: accuracy issue when using multiconnector(Nixl+cpu offloading) — bug — by hsubramony (创建于: 2026-02-14 00:42 (UTC+8))
- #34519 [Feature]: Quality of life - expose model name / custom label in GPU process name — feature request — by Rictus (创建于: 2026-02-13 22:03 (UTC+8))
- #34518 [Feature]: [Whisper] Support for decoder prefix and custom task tokens in transcription API — feature request — by LouisChirol (创建于: 2026-02-13 22:03 (UTC+8))
- #34509 [New Model]: inclusionAI/Ring-2.5-1T — new-model — by youkaichao (创建于: 2026-02-13 17:31 (UTC+8))
- #34500 [Bug]: AMD 7900 running GLM-4.7-Flash crash — bug,rocm — by warlockedward (创建于: 2026-02-13 15:32 (UTC+8))
- #34506 [Bug]: Qwen 2.5 Omni Output text seems only load first part of mm input — bug — by tzhouam (创建于: 2026-02-13 16:39 (UTC+8))
- #34505 [Bug]: Qwen 2.5 Omni cuda graph has error — bug — by tzhouam (创建于: 2026-02-13 16:36 (UTC+8))
- #34503 [Feature]: vllm serve日志不显示请求Prompt — feature request — by jiezhangGt (创建于: 2026-02-13 16:00 (UTC+8))
- #34486 [Feature]: Migrate MoE envs(e.g.
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8) to model flags — feature request — by elvischenv (创建于: 2026-02-13 11:44 (UTC+8)) - #34497 [Bug]: return logprobs with speculative decoding — bug — by xiaoxiaosuaxuan (创建于: 2026-02-13 15:15 (UTC+8))
- #34496 [Bug]: Responses API crashes with KeyError when reasoning input item has no
contentfield — bug — by jeonsworld (创建于: 2026-02-13 15:12 (UTC+8))
已关闭 Issue
- #34524 [Bug]: Error saving sharded state for GPT-OSS-120B - safetensors KeyError for torch.float8_e8m0fnu — bug — by dhayanesh (关闭于: 2026-02-14 05:21 (UTC+8))
- #32791 [Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object — bug — by supersteves (关闭于: 2026-02-14 03:12 (UTC+8))
- #29403 [Bug]: fastsafetensors in tensor parallel requires too much VRAM — bug — by Jannik2099 (关闭于: 2026-02-13 23:56 (UTC+8))
- #34519 [Feature]: Quality of life - expose model name / custom label in GPU process name — feature request — by Rictus (关闭于: 2026-02-13 22:33 (UTC+8))
- #28732 [Bug]:
--pipeline-parallelism>2results inRayChannelTimeoutError— bug,stale — by wahabk (关闭于: 2026-02-13 21:41 (UTC+8)) - #33804 [RFC]: [compile] Rollout strategy for AOT Compilation. — RFC,torch.compile — by zhxchen17 (关闭于: 2026-02-13 21:30 (UTC+8))
- #34500 [Bug]: AMD 7900 running GLM-4.7-Flash crash — bug,rocm — by warlockedward (关闭于: 2026-02-13 20:55 (UTC+8))
- #20963 [New Model]: Kimi-K2-Instruct — new-model,unstale — by Yaomt (关闭于: 2026-02-13 17:25 (UTC+8))
- #28494 [Bug][AMD gfx1100] Enabling
include_usagewith stream=True results in freeze after 3 iterations — bug,rocm — by vivcheng-amd (关闭于: 2026-02-13 17:13 (UTC+8)) - #33678 [Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X — bug,rocm — by hiahiawei (关闭于: 2026-02-13 17:05 (UTC+8))
- #27706 [Bug]: OOM for Qwen2.5-VL-3B and Qwen3-VL-2B when multimodal activated on RX 7800 XT — bug,rocm — by tobing (关闭于: 2026-02-13 16:49 (UTC+8))
- #34463 [CI Failure]: ARM CPU Test (test_moe.py) — ci-failure,cpu — by rajkiranjoshi (关闭于: 2026-02-13 16:13 (UTC+8))
- #34361 [Bug]: Qwen Coder Next prefix caching — bug — by Nepherpitou (关闭于: 2026-02-13 16:13 (UTC+8))
- #34503 [Feature]: vllm serve日志不显示请求Prompt — feature request — by jiezhangGt (关闭于: 2026-02-13 16:09 (UTC+8))
- #34040 [Bug]: Encoder cache underestimation for GLM-4V/GLM-OCR — documentation — by willingWill17 (关闭于: 2026-02-13 13:04 (UTC+8))
新增 PR
- #34541 [ROCM] Optimize ROCM_AITER_FA spec decode eagle performance — rocm,needs-rebase,v1 — by jennyyyyzhen (创建于: 2026-02-14 05:53 (UTC+8))
- #34546 Add colqwen3 support to vLLM — documentation,new-model,multi-modality,qwen — by athrael-soju (创建于: 2026-02-14 10:08 (UTC+8))
- #34538 [ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-14 05:19 (UTC+8))
- #34543 [Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 — bug,rocm,ready — by AndreasKaratzas (创建于: 2026-02-14 07:01 (UTC+8))
- #34487 [Chore] Improve Harmony file_search tool and showcase demo — documentation,frontend,gpt-oss — by franciscojavierarceo (创建于: 2026-02-13 12:21 (UTC+8))
- #34492 [Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj — needs-rebase,qwen — by Isotr0py (创建于: 2026-02-13 13:26 (UTC+8))
- #34528 [Core] Cleanup engine pause/sleep logic — frontend,ready,v1 — by njhill (创建于: 2026-02-14 01:49 (UTC+8))
- #34537 [Kernels] Fix Helion GPU utils to use platform-agnostic device name API — 无标签 — by AndreasKaratzas (创建于: 2026-02-14 04:54 (UTC+8))
- #34540 [Kernel] [Helion] [8/N] Remove fake_impl usage and inference — 无标签 — by gmagogsfm (创建于: 2026-02-14 05:38 (UTC+8))
- #34544 [RL] Pause and Resume for DPEP — v1 — by hao-aaron (创建于: 2026-02-14 07:34 (UTC+8))
- #34542 Mxfp4 refactor cutlass experts — nvidia — by zyongye (创建于: 2026-02-14 06:00 (UTC+8))
- #34539 Generative Scoring — documentation,frontend,needs-rebase,v1 — by vedantjh2 (创建于: 2026-02-14 05:31 (UTC+8))
- #34535 [Feature][Perf] Support Selective CPU Weight Offloading — ready,v1,nvidia — by wzhao18 (创建于: 2026-02-14 03:44 (UTC+8))
- #34527 fix: use
__annotations__instead ofget_type_hints()for dynamickwargsdetection — ready — by perone (创建于: 2026-02-14 01:35 (UTC+8)) - #34508 [Bugfix] Exclude
language_model_onlykey from MM AOT compile hash but include in model one — bug,ready — by ywang96 (创建于: 2026-02-13 17:17 (UTC+8)) - #34533 [bug] Make sure get_modality_with_max_tokens is deterministic — bug,ready,multi-modality — by 842974287 (创建于: 2026-02-14 03:18 (UTC+8))
- #34531 [Docs] Add RunPod GPU deployment guide for vLLM — documentation — by lisperz (创建于: 2026-02-14 02:35 (UTC+8))
- #34529 [Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup — v1 — by zixi-qi (创建于: 2026-02-14 02:06 (UTC+8))
- #34507 [Bugfix] Fix fused MoE perf regression on small GPUs from int64 strides — bug — by haosdent (创建于: 2026-02-13 17:12 (UTC+8))
- #34530 Revert “[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides” — bug — by mgoin (创建于: 2026-02-14 02:33 (UTC+8))
- #34522 [Refactor][KVConnector]: Move KV Cache Events into KVConnectorWorkerMetadata — v1,kv-connector — by hickeyma (创建于: 2026-02-13 23:28 (UTC+8))
- #34523 [Misc] vLLM’s –enforce-eager should turn off compile and cudagraphs only — ready,nvidia — by zou3519 (创建于: 2026-02-13 23:48 (UTC+8))
- #34488 Add FlashInfer cudnn for vit — v1,qwen,nvidia — by KKSK-DON (创建于: 2026-02-13 12:25 (UTC+8))
- #34511 [WIP][Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations — bug — by haosdent (创建于: 2026-02-13 18:15 (UTC+8))
- #34513 [DOC] Specfiy build dependency installation — documentation — by jonoillar (创建于: 2026-02-13 20:02 (UTC+8))
- #34510 [Renderer] Move InputPreprocessor into Renderer (1/2) — performance,frontend,ready,v1,multi-modality,llama — by DarkLight1337 (创建于: 2026-02-13 17:34 (UTC+8))
- #34501 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug,ready — by LoganJane (创建于: 2026-02-13 15:33 (UTC+8))
- #34521 [Draft] Support model Qwen3_5/Qwen3_5_moe on NPUplatform — documentation,new-model,rocm,speculative-decoding,needs-rebase,ci/build,v1,qwen,cpu — by ppppeng (创建于: 2026-02-13 23:28 (UTC+8))
- #34520 [EPLB] Cleanup the transfer logic for the various eplb maps — 无标签 — by SageMoore (创建于: 2026-02-13 23:17 (UTC+8))
- #34516 [bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used — bug,frontend — by kizill (创建于: 2026-02-13 21:16 (UTC+8))
- #34517 [Bugfix] Fix mypy errors on StructuredOutputsParams in OpenAI entrypoints — bug — by junuxyz (创建于: 2026-02-13 21:18 (UTC+8))
- #34512 [Misc] Port Qwen3.5 Configs — ready,qwen — by ywang96 (创建于: 2026-02-13 19:11 (UTC+8))
- #34515 [Draft] amd-quark online fp8 and mxfp4 quantization — rocm,needs-rebase — by hangy-amd (创建于: 2026-02-13 21:15 (UTC+8))
- #34514 [CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior — bug,documentation,performance,structured-output,tpu,ci/build,v1,cpu,kv-connector,nvidia — by junuxyz (创建于: 2026-02-13 20:24 (UTC+8))
- #34499 [Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content — bug,frontend,gpt-oss — by jeonsworld (创建于: 2026-02-13 15:31 (UTC+8))
- #34498 [GDN] Use CPU tensors to build GDN metadata — v1 — by WoosukKwon (创建于: 2026-02-13 15:30 (UTC+8))
- #34493 [Refactor] [2/N] Reorganize kernel abstraction directory — cpu,nvidia — by BadrBasowid (创建于: 2026-02-13 13:53 (UTC+8))
- #34490 [Refactor] Call renderer for online IO processor request — frontend,ready — by DarkLight1337 (创建于: 2026-02-13 12:43 (UTC+8))
- #34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (创建于: 2026-02-13 11:27 (UTC+8))
- #34495 [Bugfix] Fix IndexError in Qwen3CoderToolParser streaming — bug,qwen — by R3hankhan123 (创建于: 2026-02-13 14:43 (UTC+8))
- #34489 [Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 — bug,ready,qwen — by ywang96 (创建于: 2026-02-13 12:30 (UTC+8))
- #34491 [CI/Build] Fix CUDA re-initialization error in distributed model tests — ready,multi-modality,nvidia — by DarkLight1337 (创建于: 2026-02-13 12:52 (UTC+8))
- #34494 [Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE — bug,nvidia — by haosdent (创建于: 2026-02-13 13:56 (UTC+8))
已合并 PR
- #32993 [Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation — ready,nvidia — by wzhao18 (合并于: 2026-02-14 00:11 (UTC+8))
- #33705 [Hybrid] Enable spec decoding in mamba cache align mode — ready,v1 — by peakcrosser7 (合并于: 2026-02-14 05:02 (UTC+8))
- #34508 [Bugfix] Exclude
language_model_onlykey from MM AOT compile hash but include in model one — bug,ready — by ywang96 (合并于: 2026-02-13 21:59 (UTC+8)) - #34454 [Bugfix]: Fix structured output in multi-turn gpt-oss — bug,structured-output,ready,v1,gpt-oss — by bbrowning (合并于: 2026-02-14 03:12 (UTC+8))
- #34530 Revert “[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides” — bug — by mgoin (合并于: 2026-02-14 02:35 (UTC+8))
- #34047 [ROCm][CI] Fix serving tokens test failures — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 11:27 (UTC+8))
- #34523 [Misc] vLLM’s –enforce-eager should turn off compile and cudagraphs only — ready,nvidia — by zou3519 (合并于: 2026-02-14 01:52 (UTC+8))
- #34467 [Bugfix] Replace c10::optional with std::optional in topk kernel — bug,ready — by FloatingVertex (合并于: 2026-02-14 00:30 (UTC+8))
- #34501 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug,ready — by LoganJane (合并于: 2026-02-14 00:05 (UTC+8))
- #34512 [Misc] Port Qwen3.5 Configs — ready,qwen — by ywang96 (合并于: 2026-02-13 21:24 (UTC+8))
- #34440 [BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec — bug,ready,v1 — by peakcrosser7 (合并于: 2026-02-13 16:13 (UTC+8))
- #34170 Extend ColBERT support to non-standard BERT backbones — documentation,new-model,ready — by ieBoytsov (合并于: 2026-02-13 17:53 (UTC+8))
- #34498 [GDN] Use CPU tensors to build GDN metadata — v1 — by WoosukKwon (合并于: 2026-02-13 17:24 (UTC+8))
- #33368 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-13 16:38 (UTC+8))
- #34147 [KVConnector] Clean up redundant code in KV connectors — ready,kv-connector — by hickeyma (合并于: 2026-02-13 16:14 (UTC+8))
- #34125 [Core] Move pause and resume functions into engine — documentation,ready,v1 — by hao-aaron (合并于: 2026-02-13 16:15 (UTC+8))
- #34130 [Perf] fused_moe: add int4_w4a16 benchmark support and tuning config — performance,ready — by mgehre-amd (合并于: 2026-02-13 16:14 (UTC+8))
- #34245 [Bugfix] fix the import path in moe test utils.py — bug,ready — by michalowski-arm (合并于: 2026-02-13 16:13 (UTC+8))
- #34418 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode — bug,ready,v1 — by haosdent (合并于: 2026-02-13 16:13 (UTC+8))
- #34426 [New Model] support new model ovis2.6 — documentation,new-model,ready,multi-modality — by myselvess (合并于: 2026-02-13 16:12 (UTC+8))
- #34490 [Refactor] Call renderer for online IO processor request — frontend,ready — by DarkLight1337 (合并于: 2026-02-13 14:48 (UTC+8))
- #34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (合并于: 2026-02-13 14:48 (UTC+8))
- #34489 [Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 — bug,ready,qwen — by ywang96 (合并于: 2026-02-13 14:48 (UTC+8))
- #34491 [CI/Build] Fix CUDA re-initialization error in distributed model tests — ready,multi-modality,nvidia — by DarkLight1337 (合并于: 2026-02-13 14:43 (UTC+8))
- #34483 [Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image — bug,ready — by haosdent (合并于: 2026-02-13 13:04 (UTC+8))
- #34358 [Bugfix] Standardize getting number of image patches/tokens — bug,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-02-13 12:47 (UTC+8))
关闭但未合并的 PR
- #28140 [DO NOT REVIEW/LAND] Repro for 23970 — 无标签 — by gmagogsfm (关闭于: 2026-02-14 08:33 (UTC+8))
- #33926 [BugFix] Fix small race condition when pausing generation for RL — bug,ready,v1 — by njhill (关闭于: 2026-02-14 04:40 (UTC+8))
- #30972 feat(metrics): Add Prometheus exemplars support for request-level met… — documentation,frontend,needs-rebase,v1 — by TheCodeWrangler (关闭于: 2026-02-14 04:07 (UTC+8))
- #33827 Enable torch.compile for OpenGVLab/InternVL3-2B — documentation — by tianrengao (关闭于: 2026-02-14 03:37 (UTC+8))
- #32991 [BugFix] Fix CPU Weight Offloading with UVA — bug,nvidia — by wzhao18 (关闭于: 2026-02-14 00:34 (UTC+8))
- #34161 add extras dict to FinishedRequestStats to enable stat logger plugins… — v1 — by crawdaddie (关闭于: 2026-02-13 23:54 (UTC+8))
- #34414 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug — by LoganJane (关闭于: 2026-02-13 15:26 (UTC+8))