vLLM 开发动态报告 - 2026-02-13

时间窗口: 2026-02-13 11:23 (UTC+8) ~ 2026-02-14 11:23 (UTC+8) 数据统计: 新 Issue 19 | 关闭 Issue 15 | 新 PR 43 | 合并 PR 26 | 关闭未合并 PR 7

📊 每日开发状态摘要

在2026年2月13日至14日的周期内，vLLM项目保持了较高的开发活跃度，共处理了43个新增PR和19个新增Issue。核心开发焦点集中在多模态模型支持、性能优化与Bug修复，以及AMD ROCm平台生态的持续完善上。社区正在积极推进多个关键技术特性的落地，包括新的推理输出功能、更高效的CPU权重卸载方案，并对近期引入的MoE重构和AOT编译等重大变更进行问题排查与优化。

🎯 AMD/ROCm 生态相关动态

本周期内AMD/ROCm生态的进展和问题修复较为活跃，体现了社区对该平台支持的持续投入。

问题修复与解决方案：
- Issue #34500 ([Bug]: AMD 7900 running GLM-4.7-Flash crash)：用户报告在AMD Radeon PRO W7900D上运行GLM-4.7-Flash模型时崩溃。AMD员工 @vllmellm 迅速介入，指出问题与Triton后端相关，并建议设置环境变量 export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE 以启用Triton for AMD。用户确认该方案有效，随后关闭了Issue。这表明针对特定AMD GPU型号，需要正确配置后端以获得兼容性。
- PR #34543 ([Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993)：这是一个关键修复。之前的PR #32993为CPU权重卸载添加了UVA支持，但仅在Python层检查了CUDA和XPU平台，导致在ROCm平台上抛出ValueError。此PR将current_platform.is_rocm()添加到检查条件中，确保ROCm平台能使用相同的UVA代码路径，修复了由此导致的CI测试失败。
- PR #34538 ([ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests)：修复了ROCm CI中因无条件导入依赖flashinfer的FlashInferMLASparseBackend而导致的测试收集失败问题。通过动态构建测试列表，提升了测试套件在ROCm环境下的健壮性。
新功能与优化：
- PR #34541 ([ROCM] Optimize ROCM_AITER_FA spec decode eagle performance)：旨在优化ROCm AITER FA后端在推测解码（EAGLE）场景下的性能。通过更改cuda_graph支持和移除CPU-GPU同步，使cudagraph也能用于草稿模型，从而提升性能。该PR依赖于另一个修复精度的PR (#32877)，目前存在合并冲突待解决。
- PR #34515 ([Draft] amd-quark online fp8 and mxfp4 quantization)：一个由AMD员工 (@hangy-amd) 提交的草案PR，旨在将AMD Quark工具的在线FP8和MXFP4量化功能集成到vLLM中。这是AMD生态量化工具链整合的重要一步。
其他关联动态：
- 已关闭的Issue #33678 ([Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X) 和 #28494 ([Bug][AMD gfx1100] Enabling include_usage with stream=True results in freeze after 3 iterations) 表明，社区持续跟踪并解决了AMD平台上模型精度和API稳定性的历史问题。
- 贡献者识别：本周期内活跃的AMD员工（用户名带 -amd 后缀）包括 @hangy-amd 和 @vivcheng-amd（在已关闭Issue中），显示了AMD团队对vLLM项目的直接贡献。

💬 高热度讨论分析

Issue #32791 ([Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object)：
- 核心议题：用户在使用GPT-OSS系列模型进行多轮对话并结合json_object等结构化输出时，服务器返回content: null。
- 观点与讨论：多位用户确认此问题，并发现仅在使用结构化输出且对话历史中包含助理消息时触发。讨论提出了多种临时解决方案，如降级vLLM版本、修改历史消息角色、在提示前添加随机字符串等。社区最终定位到根本原因是 gptoss_reasoning_parser 中的逻辑在多轮场景下错误地匹配了历史消息中的结束标记，导致提前应用了语法掩码。
- 最终结论：在PR #34454中，通过修复解析器逻辑，确保仅在当前生成的消息中寻找结束标记，彻底解决了此问题。该PR已合并。
Issue #29403 ([Bug]: fastsafetensors in tensor parallel requires too much VRAM)：
- 核心议题：在张量并行模式下使用fastsafetensors插件加载模型时，出现VRAM使用量激增甚至加载失败的问题。
- 观点与讨论：问题提出者怀疑fastsafetensors将所有权重加载到了GPU 0上，而非正确分散。贡献者 @bbrowning 通过调试证实了这一点，并指出这可能是一个bug。讨论涉及了对系统内存和GPU内存影响的分析，以及一个潜在的修复方案。
- 当前状态：该Issue在数据周期内被关闭，但关闭原因未在提供的数据中明确显示（可能是通过其他PR修复或决定暂时规避）。这反映了社区对第三方依赖集成稳定性的关注。
Issue #33804 ([RFC]: [compile] Rollout strategy for AOT Compilation.)：
- 核心议题：讨论如何安全地推出Torch 2.10中的AOT编译功能，避免给用户带来问题。
- 观点与立场：
  - 谨慎推出派：鉴于在测试中发现了与AOT相关的阻塞性问题，建议采取保守策略。
  - 具体方案：讨论了“在下一个vLLM发布版本中默认关闭AOT，待其在主分支充分测试后再于下下个版本启用”的方案。
- 最终结论：核心维护者达成一致，决定在即将发布的vLLM 0.16版本中搭载PyTorch 2.10但默认关闭AOT编译，在发布后立即在主分支开启AOT，以便在下一个版本（0.17）发布前有充足时间暴露和修复问题。该RFC已关闭。

🔥 热门话题与趋势分析

多模态模型支持深化：围绕GLM-OCR、Qwen-VL等模型的Issues（#34040, #34502, #34506）表明，用户正将vLLM用于复杂的视觉语言任务，如图像-文本检索、OCR和视觉推理。社区正在积极解决由此带来的新挑战，如编码器缓存预估不准确和多模态API功能扩展（如#34518请求为Whisper API添加解码器前缀支持）。
推理输出与工具调用支持：Feature Request #34536 希望为离线推理添加推理输出支持，而Issues #34496, #34498 则涉及推理和工具调用API的健壮性修复。这表明vLLM的高级功能（结构化输出、推理、工具调用）正被更广泛地使用，其稳定性和易用性成为关注重点。
性能优化与核心Bug修复：
- MoE性能回归：PR #34279 为修复IMA错误将步长改为int64，却导致小显存GPU（如GB10）上性能严重下降（~60x）。这引发了快速回滚（PR #34530）和针对性修复（PR #34507）的紧急行动，凸显了性能调优的复杂性。
- 缓存与加载优化：PR #34527 修复了多模态处理器加载的LRU缓存失效问题，显著提升了预处理速度；PR #34498 优化了GDN元数据构建，避免CPU-GPU同步。
平台兼容性与部署：除了AMD ROCm，关于macOS多进程问题（#34534）、Python 3.13兼容性问题（#34504）以及CPU卸载内存占用（#32993）的讨论，反映了社区对跨平台、多样化部署场景支持的持续努力。

🛠️ 重点技术变更

PR #34454 ([Bugfix]: Fix structured output in multi-turn gpt-oss)：此PR修复了GPT-OSS模型在多轮对话中使用结构化输出时的一个关键缺陷。通过修正gptoss_reasoning_parser中识别推理结束标记的逻辑，确保了语法约束仅在正确的时机应用，恢复了模型生成有效内容的能力。影响：显著提升了GPT-OSS系列模型在生产环境多轮交互场景下的可用性和可靠性。
PR #32993 ([Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation) (已合并)：解决了PyTorch固定内存分配可能引起内存翻倍的问题，引入了基于UVA的CPU卸载替代方案，并通过环境变量控制。影响：使大模型在CPU卸载场景下对系统内存的需求变得更可预测和高效，尤其有利于资源受限的环境。
PR #33705 ([Hybrid] Enable spec decoding in mamba cache align mode) (已合并)：重新启用了Mamba模型在缓存对齐模式下的推测解码支持，此前因相关问题被禁用。影响：为Jamba等混合模型（Mamba+Attention）解锁了推测解码这一重要的性能加速特性。
PR #34523 ([Misc] vLLM‘s –enforce-eager should turn off compile and cudagraphs only) (已合并)：将--enforce-eager调试标志的行为从设置优化级别-O0改为仅关闭torch.compile和cudagraphs，保留其他优化。影响：使调试行为更精确，避免因关闭其他优化（如flashinfer自动调优）而引入额外的性能偏差，便于问题定位。

📈 开发活跃度观察

高效的问题闭环：在24小时内，有15个Issue被关闭，其中包括一些历史较久的问题（如#32791, #29403），显示了社区维护者积极处理积压问题。
多方协作：在AMD相关问题和MoE性能回归等复杂问题上，可以看到来自AMD员工、Red Hat、NVIDIA及其他社区成员的积极互动和协作分析，共同推动问题解决。
代码审查与合并节奏：日内合并了26个PR，其中包含关键bug修复和功能优化，表明核心维护团队保持着较高的代码审查和合并效率，确保重要修复能及时进入主分支。

💡 值得关注的问题

多模态模型缓存估计：Issue #34040 和其修复PR #34483 揭示了一个普遍性问题：多模态编码器的缓存需求预估方法（特别是针对单张图片）可能存在缺陷，容易导致运行时错误。需要检视其他多模态模型是否存在类似隐患。
Python 3.13 兼容性：Issue #34504 指出因符号_PyThreadState_UncheckedGet未定义导致vLLM在Python 3.13上崩溃。这是一个影响新版本Python用户的环境兼容性问题，需要尽快修复。
实时API的健壮性：Issue #34532 报告Realtime API在客户端非正常断开连接时服务器会崩溃。这关系到API的生产环境稳定性，提出的将断言改为运行时错误验证的建议值得采纳。
分散式推理的复杂性：Issue #34526 报告在使用Nixl+CPU Offloading等多连接器组合时出现精度问题。这反映了在追求极致的分散式/异构推理架构时，调试和保证正确性的难度极高，相关功能需要更完善的测试和验证手段。

📋 附录：详细数据列表

新增 Issue

#34525 [CI Failure] LoRA TP (Distributed): lora/test_olmoe_tp.py::test_olmoe_lora — ci-failure — by LucasWilkinson (创建于: 2026-02-14 00:29 (UTC+8))
#34502 [Usage]: qwen3-vl-reranker-2b deploy issue — usage — by newbecoder (创建于: 2026-02-13 15:37 (UTC+8))
#34504 [Bug]: GLM-5-FP8 Crash - Engine core initialization failed — bug — by jxdn (创建于: 2026-02-13 16:11 (UTC+8))
#34545 [Installation]: unrecognized arguments: –omni — installation — by HenryBao91 (创建于: 2026-02-14 09:17 (UTC+8))
#34524 [Bug]: Error saving sharded state for GPT-OSS-120B - safetensors KeyError for torch.float8_e8m0fnu — bug — by dhayanesh (创建于: 2026-02-14 00:00 (UTC+8))
#34536 [Feature]: Reasoning output for offline inference — feature request — by BartekKruczek (创建于: 2026-02-14 04:04 (UTC+8))
#34532 [Bug]: Realtime API crashes when client terminates connection “incorrectly” — bug — by nullquery (创建于: 2026-02-14 02:52 (UTC+8))
#34534 [Bug]: EngineCore exits immediately after startup when vLLM CPU is launched from multiprocessing.Process on macOS — 无标签 — by diegocastanibm (创建于: 2026-02-14 03:43 (UTC+8))
#34526 [Bug]: accuracy issue when using multiconnector(Nixl+cpu offloading) — bug — by hsubramony (创建于: 2026-02-14 00:42 (UTC+8))
#34519 [Feature]: Quality of life - expose model name / custom label in GPU process name — feature request — by Rictus (创建于: 2026-02-13 22:03 (UTC+8))
#34518 [Feature]: [Whisper] Support for decoder prefix and custom task tokens in transcription API — feature request — by LouisChirol (创建于: 2026-02-13 22:03 (UTC+8))
#34509 [New Model]: inclusionAI/Ring-2.5-1T — new-model — by youkaichao (创建于: 2026-02-13 17:31 (UTC+8))
#34500 [Bug]: AMD 7900 running GLM-4.7-Flash crash — bug,rocm — by warlockedward (创建于: 2026-02-13 15:32 (UTC+8))
#34506 [Bug]: Qwen 2.5 Omni Output text seems only load first part of mm input — bug — by tzhouam (创建于: 2026-02-13 16:39 (UTC+8))
#34505 [Bug]: Qwen 2.5 Omni cuda graph has error — bug — by tzhouam (创建于: 2026-02-13 16:36 (UTC+8))
#34503 [Feature]: vllm serve日志不显示请求Prompt — feature request — by jiezhangGt (创建于: 2026-02-13 16:00 (UTC+8))
#34486 [Feature]: Migrate MoE envs(e.g. VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8) to model flags — feature request — by elvischenv (创建于: 2026-02-13 11:44 (UTC+8))
#34497 [Bug]: return logprobs with speculative decoding — bug — by xiaoxiaosuaxuan (创建于: 2026-02-13 15:15 (UTC+8))
#34496 [Bug]: Responses API crashes with KeyError when reasoning input item has no content field — bug — by jeonsworld (创建于: 2026-02-13 15:12 (UTC+8))

已关闭 Issue

#34524 [Bug]: Error saving sharded state for GPT-OSS-120B - safetensors KeyError for torch.float8_e8m0fnu — bug — by dhayanesh (关闭于: 2026-02-14 05:21 (UTC+8))
#32791 [Bug]: chat.completions returns content: null for GPT-OSS multi-turn with json_object — bug — by supersteves (关闭于: 2026-02-14 03:12 (UTC+8))
#29403 [Bug]: fastsafetensors in tensor parallel requires too much VRAM — bug — by Jannik2099 (关闭于: 2026-02-13 23:56 (UTC+8))
#34519 [Feature]: Quality of life - expose model name / custom label in GPU process name — feature request — by Rictus (关闭于: 2026-02-13 22:33 (UTC+8))
#28732 [Bug]: --pipeline-parallelism>2 results in RayChannelTimeoutError — bug,stale — by wahabk (关闭于: 2026-02-13 21:41 (UTC+8))
#33804 [RFC]: [compile] Rollout strategy for AOT Compilation. — RFC,torch.compile — by zhxchen17 (关闭于: 2026-02-13 21:30 (UTC+8))
#34500 [Bug]: AMD 7900 running GLM-4.7-Flash crash — bug,rocm — by warlockedward (关闭于: 2026-02-13 20:55 (UTC+8))
#20963 [New Model]: Kimi-K2-Instruct — new-model,unstale — by Yaomt (关闭于: 2026-02-13 17:25 (UTC+8))
#28494 [Bug][AMD gfx1100] Enabling include_usage with stream=True results in freeze after 3 iterations — bug,rocm — by vivcheng-amd (关闭于: 2026-02-13 17:13 (UTC+8))
#33678 [Bug]: [ROCm] Kimi-K2.5 produces incorrect results on AMD MI308X — bug,rocm — by hiahiawei (关闭于: 2026-02-13 17:05 (UTC+8))
#27706 [Bug]: OOM for Qwen2.5-VL-3B and Qwen3-VL-2B when multimodal activated on RX 7800 XT — bug,rocm — by tobing (关闭于: 2026-02-13 16:49 (UTC+8))
#34463 [CI Failure]: ARM CPU Test (test_moe.py) — ci-failure,cpu — by rajkiranjoshi (关闭于: 2026-02-13 16:13 (UTC+8))
#34361 [Bug]: Qwen Coder Next prefix caching — bug — by Nepherpitou (关闭于: 2026-02-13 16:13 (UTC+8))
#34503 [Feature]: vllm serve日志不显示请求Prompt — feature request — by jiezhangGt (关闭于: 2026-02-13 16:09 (UTC+8))
#34040 [Bug]: Encoder cache underestimation for GLM-4V/GLM-OCR — documentation — by willingWill17 (关闭于: 2026-02-13 13:04 (UTC+8))

新增 PR

#34541 [ROCM] Optimize ROCM_AITER_FA spec decode eagle performance — rocm,needs-rebase,v1 — by jennyyyyzhen (创建于: 2026-02-14 05:53 (UTC+8))
#34546 Add colqwen3 support to vLLM — documentation,new-model,multi-modality,qwen — by athrael-soju (创建于: 2026-02-14 10:08 (UTC+8))
#34538 [ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests — rocm,v1 — by AndreasKaratzas (创建于: 2026-02-14 05:19 (UTC+8))
#34543 [Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 — bug,rocm,ready — by AndreasKaratzas (创建于: 2026-02-14 07:01 (UTC+8))
#34487 [Chore] Improve Harmony file_search tool and showcase demo — documentation,frontend,gpt-oss — by franciscojavierarceo (创建于: 2026-02-13 12:21 (UTC+8))
#34492 [Models] Fuse Qwen3.5 GDN’s qkvz_proj and ba_proj — needs-rebase,qwen — by Isotr0py (创建于: 2026-02-13 13:26 (UTC+8))
#34528 [Core] Cleanup engine pause/sleep logic — frontend,ready,v1 — by njhill (创建于: 2026-02-14 01:49 (UTC+8))
#34537 [Kernels] Fix Helion GPU utils to use platform-agnostic device name API — 无标签 — by AndreasKaratzas (创建于: 2026-02-14 04:54 (UTC+8))
#34540 [Kernel] [Helion] [8/N] Remove fake_impl usage and inference — 无标签 — by gmagogsfm (创建于: 2026-02-14 05:38 (UTC+8))
#34544 [RL] Pause and Resume for DPEP — v1 — by hao-aaron (创建于: 2026-02-14 07:34 (UTC+8))
#34542 Mxfp4 refactor cutlass experts — nvidia — by zyongye (创建于: 2026-02-14 06:00 (UTC+8))
#34539 Generative Scoring — documentation,frontend,needs-rebase,v1 — by vedantjh2 (创建于: 2026-02-14 05:31 (UTC+8))
#34535 [Feature][Perf] Support Selective CPU Weight Offloading — ready,v1,nvidia — by wzhao18 (创建于: 2026-02-14 03:44 (UTC+8))
#34527 fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection — ready — by perone (创建于: 2026-02-14 01:35 (UTC+8))
#34508 [Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one — bug,ready — by ywang96 (创建于: 2026-02-13 17:17 (UTC+8))
#34533 [bug] Make sure get_modality_with_max_tokens is deterministic — bug,ready,multi-modality — by 842974287 (创建于: 2026-02-14 03:18 (UTC+8))
#34531 [Docs] Add RunPod GPU deployment guide for vLLM — documentation — by lisperz (创建于: 2026-02-14 02:35 (UTC+8))
#34529 [Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup — v1 — by zixi-qi (创建于: 2026-02-14 02:06 (UTC+8))
#34507 [Bugfix] Fix fused MoE perf regression on small GPUs from int64 strides — bug — by haosdent (创建于: 2026-02-13 17:12 (UTC+8))
#34530 Revert “[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides” — bug — by mgoin (创建于: 2026-02-14 02:33 (UTC+8))
#34522 [Refactor][KVConnector]: Move KV Cache Events into KVConnectorWorkerMetadata — v1,kv-connector — by hickeyma (创建于: 2026-02-13 23:28 (UTC+8))
#34523 [Misc] vLLM’s –enforce-eager should turn off compile and cudagraphs only — ready,nvidia — by zou3519 (创建于: 2026-02-13 23:48 (UTC+8))
#34488 Add FlashInfer cudnn for vit — v1,qwen,nvidia — by KKSK-DON (创建于: 2026-02-13 12:25 (UTC+8))
#34511 [WIP][Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations — bug — by haosdent (创建于: 2026-02-13 18:15 (UTC+8))
#34513 [DOC] Specfiy build dependency installation — documentation — by jonoillar (创建于: 2026-02-13 20:02 (UTC+8))
#34510 [Renderer] Move InputPreprocessor into Renderer (1/2) — performance,frontend,ready,v1,multi-modality,llama — by DarkLight1337 (创建于: 2026-02-13 17:34 (UTC+8))
#34501 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug,ready — by LoganJane (创建于: 2026-02-13 15:33 (UTC+8))
#34521 [Draft] Support model Qwen3_5/Qwen3_5_moe on NPUplatform — documentation,new-model,rocm,speculative-decoding,needs-rebase,ci/build,v1,qwen,cpu — by ppppeng (创建于: 2026-02-13 23:28 (UTC+8))
#34520 [EPLB] Cleanup the transfer logic for the various eplb maps — 无标签 — by SageMoore (创建于: 2026-02-13 23:17 (UTC+8))
#34516 [bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used — bug,frontend — by kizill (创建于: 2026-02-13 21:16 (UTC+8))
#34517 [Bugfix] Fix mypy errors on StructuredOutputsParams in OpenAI entrypoints — bug — by junuxyz (创建于: 2026-02-13 21:18 (UTC+8))
#34512 [Misc] Port Qwen3.5 Configs — ready,qwen — by ywang96 (创建于: 2026-02-13 19:11 (UTC+8))
#34515 [Draft] amd-quark online fp8 and mxfp4 quantization — rocm,needs-rebase — by hangy-amd (创建于: 2026-02-13 21:15 (UTC+8))
#34514 [CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior — bug,documentation,performance,structured-output,tpu,ci/build,v1,cpu,kv-connector,nvidia — by junuxyz (创建于: 2026-02-13 20:24 (UTC+8))
#34499 [Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content — bug,frontend,gpt-oss — by jeonsworld (创建于: 2026-02-13 15:31 (UTC+8))
#34498 [GDN] Use CPU tensors to build GDN metadata — v1 — by WoosukKwon (创建于: 2026-02-13 15:30 (UTC+8))
#34493 [Refactor] [2/N] Reorganize kernel abstraction directory — cpu,nvidia — by BadrBasowid (创建于: 2026-02-13 13:53 (UTC+8))
#34490 [Refactor] Call renderer for online IO processor request — frontend,ready — by DarkLight1337 (创建于: 2026-02-13 12:43 (UTC+8))
#34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (创建于: 2026-02-13 11:27 (UTC+8))
#34495 [Bugfix] Fix IndexError in Qwen3CoderToolParser streaming — bug,qwen — by R3hankhan123 (创建于: 2026-02-13 14:43 (UTC+8))
#34489 [Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 — bug,ready,qwen — by ywang96 (创建于: 2026-02-13 12:30 (UTC+8))
#34491 [CI/Build] Fix CUDA re-initialization error in distributed model tests — ready,multi-modality,nvidia — by DarkLight1337 (创建于: 2026-02-13 12:52 (UTC+8))
#34494 [Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE — bug,nvidia — by haosdent (创建于: 2026-02-13 13:56 (UTC+8))

已合并 PR

#32993 [Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation — ready,nvidia — by wzhao18 (合并于: 2026-02-14 00:11 (UTC+8))
#33705 [Hybrid] Enable spec decoding in mamba cache align mode — ready,v1 — by peakcrosser7 (合并于: 2026-02-14 05:02 (UTC+8))
#34508 [Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one — bug,ready — by ywang96 (合并于: 2026-02-13 21:59 (UTC+8))
#34454 [Bugfix]: Fix structured output in multi-turn gpt-oss — bug,structured-output,ready,v1,gpt-oss — by bbrowning (合并于: 2026-02-14 03:12 (UTC+8))
#34530 Revert “[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides” — bug — by mgoin (合并于: 2026-02-14 02:35 (UTC+8))
#34047 [ROCm][CI] Fix serving tokens test failures — rocm,ready — by AndreasKaratzas (合并于: 2026-02-13 11:27 (UTC+8))
#34523 [Misc] vLLM’s –enforce-eager should turn off compile and cudagraphs only — ready,nvidia — by zou3519 (合并于: 2026-02-14 01:52 (UTC+8))
#34467 [Bugfix] Replace c10::optional with std::optional in topk kernel — bug,ready — by FloatingVertex (合并于: 2026-02-14 00:30 (UTC+8))
#34501 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug,ready — by LoganJane (合并于: 2026-02-14 00:05 (UTC+8))
#34512 [Misc] Port Qwen3.5 Configs — ready,qwen — by ywang96 (合并于: 2026-02-13 21:24 (UTC+8))
#34440 [BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec — bug,ready,v1 — by peakcrosser7 (合并于: 2026-02-13 16:13 (UTC+8))
#34170 Extend ColBERT support to non-standard BERT backbones — documentation,new-model,ready — by ieBoytsov (合并于: 2026-02-13 17:53 (UTC+8))
#34498 [GDN] Use CPU tensors to build GDN metadata — v1 — by WoosukKwon (合并于: 2026-02-13 17:24 (UTC+8))
#33368 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-02-13 16:38 (UTC+8))
#34147 [KVConnector] Clean up redundant code in KV connectors — ready,kv-connector — by hickeyma (合并于: 2026-02-13 16:14 (UTC+8))
#34125 [Core] Move pause and resume functions into engine — documentation,ready,v1 — by hao-aaron (合并于: 2026-02-13 16:15 (UTC+8))
#34130 [Perf] fused_moe: add int4_w4a16 benchmark support and tuning config — performance,ready — by mgehre-amd (合并于: 2026-02-13 16:14 (UTC+8))
#34245 [Bugfix] fix the import path in moe test utils.py — bug,ready — by michalowski-arm (合并于: 2026-02-13 16:13 (UTC+8))
#34418 [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode — bug,ready,v1 — by haosdent (合并于: 2026-02-13 16:13 (UTC+8))
#34426 [New Model] support new model ovis2.6 — documentation,new-model,ready,multi-modality — by myselvess (合并于: 2026-02-13 16:12 (UTC+8))
#34490 [Refactor] Call renderer for online IO processor request — frontend,ready — by DarkLight1337 (合并于: 2026-02-13 14:48 (UTC+8))
#34485 [Refactor] Pass full VllmConfig to Renderer — ready,v1 — by DarkLight1337 (合并于: 2026-02-13 14:48 (UTC+8))
#34489 [Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 — bug,ready,qwen — by ywang96 (合并于: 2026-02-13 14:48 (UTC+8))
#34491 [CI/Build] Fix CUDA re-initialization error in distributed model tests — ready,multi-modality,nvidia — by DarkLight1337 (合并于: 2026-02-13 14:43 (UTC+8))
#34483 [Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image — bug,ready — by haosdent (合并于: 2026-02-13 13:04 (UTC+8))
#34358 [Bugfix] Standardize getting number of image patches/tokens — bug,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-02-13 12:47 (UTC+8))

关闭但未合并的 PR

#28140 [DO NOT REVIEW/LAND] Repro for 23970 — 无标签 — by gmagogsfm (关闭于: 2026-02-14 08:33 (UTC+8))
#33926 [BugFix] Fix small race condition when pausing generation for RL — bug,ready,v1 — by njhill (关闭于: 2026-02-14 04:40 (UTC+8))
#30972 feat(metrics): Add Prometheus exemplars support for request-level met… — documentation,frontend,needs-rebase,v1 — by TheCodeWrangler (关闭于: 2026-02-14 04:07 (UTC+8))
#33827 Enable torch.compile for OpenGVLab/InternVL3-2B — documentation — by tianrengao (关闭于: 2026-02-14 03:37 (UTC+8))
#32991 [BugFix] Fix CPU Weight Offloading with UVA — bug,nvidia — by wzhao18 (关闭于: 2026-02-14 00:34 (UTC+8))
#34161 add extras dict to FinishedRequestStats to enable stat logger plugins… — v1 — by crawdaddie (关闭于: 2026-02-13 23:54 (UTC+8))
#34414 [Bugfix] Add quant_config in ViT of Kimi-K2.5 — bug — by LoganJane (关闭于: 2026-02-13 15:26 (UTC+8))