vLLM 开发动态报告 - 2026-01-18

时间窗口: 2026-01-18 11:02 (UTC+8) ~ 2026-01-19 11:02 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 39 | 新 PR 26 | 合并 PR 18 | 关闭未合并 PR 12

📊 每日开发状态摘要

在2026年1月18日至19日的24小时内，vLLM社区保持着极高的活跃度与处理效率。共处理了39个历史遗留问题（主要因陈旧而关闭），同时新增了8个问题，这表明项目在积极进行“技术债务”清理。代码贡献方面，新增26个PR并合并了18个，其中多个PR涉及Model Runner V2重构、新模型支持（如OpenPangu VL、OLMo 3.5 Hybrid）以及核心性能优化，显示出项目在架构演进和功能扩展上的快速进展。

🎯 AMD/ROCm 生态相关动态

本周期内，有明确的AMD生态相关开发活动，主要集中在ROCm后端的代码重构和性能优化上。

新增PR #32543：refactor: extract KV cache update logic into method in RocmAttention
- 技术细节：此PR将ROCm注意力后端中的KV缓存更新逻辑提取到一个独立的 do_kv_cache_update 方法中。这是为了响应统一各注意力后端接口的Refactor任务（Issue #32335），旨在减少代码重复并提升可维护性。
- 影响：这属于底层架构优化，使ROCm后端的代码结构与其他后端（如FlashAttention）保持一致，为未来进一步的功能开发和性能调优打下基础。
新增PR #32551：[Performance] Split FlexAttention attention and cache update
- 技术细节：同样为响应Issue #32335，此PR将FlexAttention后端的注意力计算和KV缓存更新分离开。这模仿了此前FlashAttention后端的优化模式。
- 影响：虽然FlexAttention是Triton实现的通用注意力后端，但此优化也直接使启用FlexAttention的ROCm平台用户受益，有助于提升推理性能和解耦计算逻辑。

总结：本周期未出现涉及Quark量化工具、MI300特定优化或由xxx-amd用户提交的PR。AMD生态的参与主要体现在对ROCm计算后端的通用性架构优化上，表明AMD平台的集成和维护工作是vLLM多平台支持路线图中持续进行的一部分。

💬 高热度讨论分析

已关闭Issue #20860：Ray + vLLM failing to automatically release GPU memory when tensor parallelism size (tp_size) > 1
- 核心议题：用户在使用Ray部署多GPU（TP>1）vLLM服务时，任务结束后GPU内存无法自动释放。
- 观点与争议：
  - 用户 ShiyuNee 提供了详细的环境和代码，怀疑是进程间通信问题导致子进程未正确终止。
  - Ray 团队成员 eicherseiji 介入协助分析，通过日志发现Ray进程似乎已退出，但GPU内存仍被占用，建议用户尝试Ray Data内置的vLLM集成阶段。
- 结论与状态：该Issue因超过90天无新活动而被自动标记为“stale”并关闭。讨论表明问题可能位于Ray与vLLM深度集成的边缘场景，但并未形成明确的根因结论或解决方案，用户可能需要依赖Ray团队提供的最新集成方案。
已关闭Issue #22975：sometimes tool calling is not correctly parsed but remains in the plain content for qwen3 coder
- 核心议题：Qwen3-Coder模型在进行工具调用时，输出偶尔无法被vLLM解析为结构化工具调用，而是以原始XML标签形式留在返回内容中。
- 观点与争议：
  - 多位用户（VivekMalipatel， mondaylord等）确认了此问题，并指出在流式输出模式下更易出现。
  - 用户 sempervictus 提出见解，认为这可能与Qwen3模型内部使用XML标签的语言特性有关，输出格式的不稳定是模型本身问题，在其他推理引擎中也存在。
  - 用户 bbartels 和 Ithanil 指出，模型仓库更新了工具调用解析器和聊天模板，可能已修复此问题。
- 结论与状态：Issue因陈旧被关闭。讨论揭示了工具调用功能的一个关键挑战：模型输出格式的规范性与解析器的鲁棒性。解决方案指向更新模型文件本身（如tokenizer_config.json和generation_config.json），而不仅是vLLM代码。
新增PR #32553：[P/D] Add KV cache queries and eliminate redundant prefill computation
- 核心议题：此PR提议增加KV缓存查询功能，允许Prefill工作节点从其他引擎获取已缓存的KV块，避免重复计算。
- 观点与争议：
  - 支持与价值：PR作者 snadampal 阐述了其在多轮对话恢复、缓存驱逐恢复等场景下的显著优势。
  - 设计担忧：核心维护者 robertgshaw2-redhat 在评论中肯定了功能价值，但明确表达了对当前实现方式的担忧。他指出，通过重载现有字段含义来实现新功能的“隐式”设计可能带来理解和使用上的混淆，建议探讨一个更“显式”的设计方案。
- 当前状态：PR仍处于开放状态。维护者的评论并非否定功能本身，而是希望就API设计进行更深入的讨论，这体现了vLLM项目对架构清晰度和长期可维护性的高度重视。

🔥 热门话题与趋势分析

视觉语言模型（VLM）支持持续扩展：新增了对OpenPangu VL（PR#32566）和Step1（PR#32511）模型的支持。同时，核心开发者WoosukKwon的PR#32546为Model Runner V2添加了基础VLM（如Qwen3-VL）支持，表明VLM是vLLm未来架构演进的重点优化方向。
工具调用与解析器优化：工具调用生态的完善是持续热点。除了上述关于Qwen3 Coder的讨论，本周期内出现了多个修复和增强相关解析器的PR（#32536， #32537， #32538， #32539），覆盖了Qwen3、Mistral/Devstral、Olmo3等模型，显示出社区在提升该功能稳定性和模型覆盖度上的努力。
Model Runner V2的持续重构：由 WoosukKwon 主导的一系列PR（#32535， #32546， #32562等）持续推进Model Runner V2的优化，涉及Eagle推理优化、VLM支持、状态更新逻辑重构等。这标志着vLLM核心推理引擎正在向更高效、更统一的V2架构稳步迁移。
性能优化与确定性保障：多个PR关注底层性能与正确性，如优化Triton注意力内核（#32403）、修复批次不变性测试（#32544, #32561）、实现Triton版的top-p/k采样掩码（#32558）等。这反映了项目在追求极致性能的同时，对计算确定性和结果可靠性的严格把关。

🛠️ 重点技术变更

PR #32417 (已合并)：修复TRTLLM查询张量中的退化步幅问题。该Bug在使用FlashInfer后端时，当查询张量某一维度为1时可能导致内核执行错误。此修复提升了特定配置下的引擎启动稳定性。
PR #30623 (已合并)：将混合专家（MoE）模型的路由器逻辑分离为独立的面向对象类。这是MoE架构的一次重要重构，通过工厂模式创建不同的路由器（如TopK、分组TopK等），大幅提升了代码的模块化程度和可扩展性，为未来更复杂的路由策略奠定了基础。
PR #32540 (已合并)：修复GLM-ASR音频编码器的旋转位置编码（RoPE）维度错误。该问题导致模型音频编码器的输出不正确。这是一个针对特定模型架构的关键Bugfix，确保了GLM-ASR模型在vLLM上的推理准确性。

📈 开发活跃度观察

高效的代码审查与合并：24小时内合并18个PR，处理39个Issue，显示出核心维护团队高效的工作节奏和良好的社区问题处理能力。
核心贡献者持续输出：WoosukKwon 在Model Runner V2相关重构上提交/合并了多个PR；Isotr0py、robertgshaw2-redhat、yewentao256等频繁出现在重要PR的提交和评审中，驱动着核心架构的演进。
Issue生命周期管理：大量陈旧Issue被自动关闭，这是一种积极的仓库维护策略，有助于聚焦当前有效的问题。但同时也需注意，如#20860这类有技术讨论但未完全解决的问题，可能需要更精准的“未解决”标签而非单纯因时间关闭。
AMD生态的持续参与：虽然本周期没有重大AMD特性更新，但来自社区的PR#32543表明AMD ROCm后端的优化工作仍在持续进行中。

💡 值得关注的问题

性能回归问题（Issue #32547）：用户报告从v0.13.0升级到v0.14.0rc2后，在相同硬件和模型上出现约10%的吞吐量下降。此问题需要高度关注，因为它可能涉及v0.14.0版本中引入的某些默认行为变更或潜在性能缺陷，影响用户体验。
模型实现正确性问题（Issue #32545）：Gemma3视觉模型因自定义Conv2d层实现导致其视觉嵌入输出与原始Transformers库不一致。这引发了关于vLLM中自定义算子必要性和与参考实现对齐的讨论，关系到模型输出的准确性。
采样参数逻辑缺陷（Issue #32557）：SamplingParams 中 bad_words 到 _bad_words_token_ids 的转换逻辑存在缺陷，在某些分词器（如带有特殊空格处理的）上无法正确添加带前缀空格的禁忌词。这是一个影响核心采样功能正确性的Bug。

📋 附录：详细数据列表

新增 Issue

#32545 [Bug]: Gemma3/SiglipVisionEmbeddings embedding output is different to transformers implementation due to custom Conv2d — bug — by jalola (创建于: 2026-01-18 17:40 (UTC+8))
#32541 [Feature]: LoRa adapter support for Qwen3VLForConditionalGeneration — feature request — by giangvu-ai (创建于: 2026-01-18 14:39 (UTC+8))
#32559 [Bug]: NVML initialization failure even when running the basic example in the WSL platform — bug — by jasonyanwenl (创建于: 2026-01-19 03:18 (UTC+8))
#32557 [Bug]: SamplingParams bad_words to _bad_words_token_ids — bug — by zhiqihuang (创建于: 2026-01-19 01:22 (UTC+8))
#32554 [Bug]: Error in inspecting model architecture ‘Gemma3ForConditionalGeneration’ — bug — by Skullians (创建于: 2026-01-18 23:23 (UTC+8))
#32548 [Bug]: The final streaming response chunk is missing the finish_reason. — bug — by zhangsongqing (创建于: 2026-01-18 18:23 (UTC+8))
#32547 [Performance]: Performance degradation from 0.13.0 to 0.14.0rc2 — performance — by aabbccddwasd (创建于: 2026-01-18 18:09 (UTC+8))
#32534 [Bug]: Olmo-3 does not call tools even with auto tool choice enabled — bug — by chrisoutwright (创建于: 2026-01-18 13:00 (UTC+8))

已关闭 Issue

#20860 [Bug]: Ray + vLLM failing to automatically release GPU memory when tensor parallelism size (tp_size) > 1 — bug,ray,stale — by ShiyuNee (关闭于: 2026-01-19 10:17 (UTC+8))
#21274 [Bug]: nvfp4 support on sm120 — bug,stale — by qibaoyuan (关闭于: 2026-01-19 10:17 (UTC+8))
#21301 [Bug]: Hermes tool parser returns invalid arguments — bug,stale — by web3-luoxi (关闭于: 2026-01-19 10:17 (UTC+8))
#21873 [Usage]: Qwen 2.5 VL 7B throughput — usage,stale — by gnagarajan2 (关闭于: 2026-01-19 10:16 (UTC+8))
#22515 [Bug]: gpt oss 20b; [serving_chat.py:1001] openai_harmony.HarmonyError: Unexpected token 12606 while expecting start token 200006 — bug,stale,gpt-oss — by KKKSQJ (关闭于: 2026-01-19 10:16 (UTC+8))
#22975 [Bug]: sometimes tool calling is not correctly parsed but remains in the plain content for qwen3 coder — bug,stale,tool-calling — by Co1lin (关闭于: 2026-01-19 10:16 (UTC+8))
#23217 [Feature]: GPT-OSS harmony format support — help wanted,good first issue,feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-19 10:16 (UTC+8))
#23278 [Bug]: gpt-oss 120b does not work while gpt-oss 20b works on H100 — bug,stale — by QichangZheng (关闭于: 2026-01-19 10:16 (UTC+8))
#23752 [Bug]: DeepSeek R1 with DP=8 on one 8xB200 node has 3x TPOT when concurrency < 8 than concurrency >= 8 — bug,stale — by smarterclayton (关闭于: 2026-01-19 10:16 (UTC+8))
#24192 [New Model]: New model support stepfun-ai/Step-Audio-2-mini — feature request,stale — by pratapyash (关闭于: 2026-01-19 10:16 (UTC+8))
#24325 [Bug]: Unable to use GPT-OSS-120B with speculative decoding — bug,stale,gpt-oss — by hit023 (关闭于: 2026-01-19 10:16 (UTC+8))
#24535 [Bug]: 500 internal server error when BS>8 for ISL/OSL: 1K/1K, 1K/8K. — bug,stale — by lpc0220 (关闭于: 2026-01-19 10:16 (UTC+8))
#24980 [Bug]: Loading spinquant Model Fails Due to Hardware Capability Check and Weight Loading Errors — bug,stale — by HelloCard (关闭于: 2026-01-19 10:16 (UTC+8))
#25002 [Bug]: Empty VllmConfig causing issues with Pooler.for_embed — bug,stale — by prashantgupta24 (关闭于: 2026-01-19 10:15 (UTC+8))
#25039 [Bug]: Failed to run qwen3 next fp8 with TP on Ampere GPU — bug,stale — by JaheimLee (关闭于: 2026-01-19 10:15 (UTC+8))
#25051 [Feature]: Can vllm support user use token_ids as input for chat completions interface? — feature request,stale — by GGBond8488 (关闭于: 2026-01-19 10:15 (UTC+8))
#25063 [Bug]: ShardedStateLoader Appears Not to Support MLA Architecture Models Like DeepSeek — bug,stale — by lirong-lirong (关闭于: 2026-01-19 10:15 (UTC+8))
#25073 [Bug]: Step3 inference failure — bug,stale — by xd1073321804 (关闭于: 2026-01-19 10:15 (UTC+8))
#25097 [Feature]: TPU Support for Prompt Embeds — feature request,stale — by qthequartermasterman (关闭于: 2026-01-19 10:15 (UTC+8))
#25124 [Bug]: The variable orig_to_new_prefix in Qwen2.5-VL likely contains a bug. — bug,stale — by moguizhizi (关闭于: 2026-01-19 10:15 (UTC+8))
#25125 [Bug]: eagle3 tensor-parallel — bug,stale — by wangyxbh (关闭于: 2026-01-19 10:15 (UTC+8))
#25127 [Bug]: vLLM server won’t exit; process stuck in R state and GPU becomes unusable (H100 80GB, gpu_memory_utilization=0.6) — bug,stale — by KOOVIE (关闭于: 2026-01-19 10:15 (UTC+8))
#25134 [Bug]: Eagle3 not support Qwen3-30B-A3B-FP8 — bug,stale — by wangyxbh (关闭于: 2026-01-19 10:15 (UTC+8))
#25144 [RFC]: vLLM’s Path to Sustainable CI/CD — RFC,ci/build,stale — by yeqcharlotte (关闭于: 2026-01-19 10:15 (UTC+8))
#25147 [Doc]: vllm performance dashboard configuration — documentation,stale — by omkarpatil6644 (关闭于: 2026-01-19 10:15 (UTC+8))
#25154 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_column_parallel_weight’ — bug,stale — by sallyjunjun (关闭于: 2026-01-19 10:15 (UTC+8))
#25158 [Bug]: An error occurred when directly deploying the Qwen3-Next 80B-A3B-Thinking model saved by the save_pretrained of transformers using vllm. However, directly deploying the original model was normal — bug,stale — by zhanlun150729 (关闭于: 2026-01-19 10:15 (UTC+8))
#25169 [Doc]: Quantization support on Arm CPU missing documentation — documentation,stale — by AshokBhat (关闭于: 2026-01-19 10:15 (UTC+8))
#25170 [Feature]: support chunked prefill on non-x86 platform — feature request,stale — by cyb70289 (关闭于: 2026-01-19 10:15 (UTC+8))
#25176 [Bug]: Internvl3-2B/8B can’t inference with video input when combined with AsyncLLMEngine — bug,stale — by JiancongWang (关闭于: 2026-01-19 10:15 (UTC+8))
#25210 [Feature]: Use deterministic hashing for KV events always — feature request,stale — by PeaBrane (关闭于: 2026-01-19 10:15 (UTC+8))
#25225 [Bug]: CUDA error during start the vllm serve — bug,stale — by gameboy777x (关闭于: 2026-01-19 10:15 (UTC+8))
#25229 [Bug]: Deepseek vl2 model decoding phase has a very worse performance in both online and offline inference — bug,stale — by beyondHJM (关闭于: 2026-01-19 10:15 (UTC+8))
#25234 [Usage]: How to enable the thinking mode in deepseekv3.1 by default？ — usage,stale — by LeeX852 (关闭于: 2026-01-19 10:15 (UTC+8))
#25236 [Bug]: llm.chat does crashes when using gpt-oss 20b — bug,stale — by HTuennermann (关闭于: 2026-01-19 10:15 (UTC+8))
#25316 logic on enable_chunked_prefill is bit of chaotic — stale — by xianxingg (关闭于: 2026-01-19 10:15 (UTC+8))
#32521 [Bug]: vllm: error: unrecognized arguments: –enable-logging-iteration-details — bug — by ciaoyizhen (关闭于: 2026-01-19 08:13 (UTC+8))
#25148 [Bug]: nccl allReduce error since v0.10.2 — bug — by wedobetter (关闭于: 2026-01-18 21:59 (UTC+8))
#32445 [Bug]: GLMASR rope error — bug — by catsled (关闭于: 2026-01-18 19:18 (UTC+8))

新增 PR

#32566 add openPangu VL — documentation,new-model — by Emilie1001 (创建于: 2026-01-19 10:40 (UTC+8))
#32565 [Bugfix] EPLB - Mistral 3 Large — bug,v1 — by varun-sundar-rabindranath (创建于: 2026-01-19 10:32 (UTC+8))
#32563 benchmark refactor — performance — by Bounty-hunter (创建于: 2026-01-19 09:37 (UTC+8))
#32564 imports — nvidia — by robertgshaw2-redhat (创建于: 2026-01-19 10:12 (UTC+8))
#32533 [Model Runner V2] Refactor dummy_run — v1,nvidia — by WoosukKwon (创建于: 2026-01-18 12:53 (UTC+8))
#32562 [Model Runner V2] Refactor update_states — v1 — by WoosukKwon (创建于: 2026-01-19 09:17 (UTC+8))
#32546 [Model Runner V2] Support VLM — v1,nvidia — by WoosukKwon (创建于: 2026-01-18 17:55 (UTC+8))
#32553 [P/D] Add KV cache queries and eliminate redundant prefill computation — v1,kv-connector — by snadampal (创建于: 2026-01-18 22:52 (UTC+8))
#32560 [CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette — ci/build — by DanielMe (创建于: 2026-01-19 05:49 (UTC+8))
#32561 Disable Cascade Attention for Batch Invariance — v1 — by frankwang28 (创建于: 2026-01-19 05:54 (UTC+8))
#32556 [Doc] Correct comment for _jobs dict in OffloadingConnectorWorker — kv-connector — by DemingCheng (创建于: 2026-01-19 00:51 (UTC+8))
#32551 [Performance] Split FlexAttention attention and cache update — rocm,v1,nvidia — by Etelis (创建于: 2026-01-18 20:39 (UTC+8))
#32558 [Perf] Triton-based top-p/top-k masking — performance,v1 — by njhill (创建于: 2026-01-19 02:13 (UTC+8))
#32555 [CI] Move Distributed Tests from H200 -> H100 — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-01-18 23:33 (UTC+8))
#32537 [Tool Parser] Fix Devstral-Small-2507 streaming tool parsing issue — qwen — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
#32538 [Bugfix][Tool Parser] Fix whitespace content between reasoning and tool_call in streaming mode — bug,qwen — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
#32552 Ignore: try to implement parakeet — 无标签 — by netanel-haber (创建于: 2026-01-18 21:34 (UTC+8))
#32550 [Model] Add support for OLMo 3.5 Hybrid — documentation,new-model — by yanhong-lbh (创建于: 2026-01-18 19:50 (UTC+8))
#32540 [Bugfix] Fix GLM-ASR audio encoder RoPE dim — bug,documentation,ready — by Isotr0py (创建于: 2026-01-18 14:34 (UTC+8))
#32549 Support heterogeneous NemotronHPuzzle model — new-model — by danielafrimi (创建于: 2026-01-18 18:25 (UTC+8))
#32544 Fix batch invariance test under heterogeneous workloads #32481 — v1 — by aishwaryatambe1112 (创建于: 2026-01-18 17:25 (UTC+8))
#32543 refactor: extract KV cache update logic into method in RocmAttention — rocm,v1 — by Mohit-Gaur (创建于: 2026-01-18 16:15 (UTC+8))
#32539 [Bugfix] Add tool chat template for Olmo3 models — bug,documentation,tool-calling — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
#32542 Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) — 无标签 — by gopalsarda (创建于: 2026-01-18 15:21 (UTC+8))
#32536 [Bugfix][Tool Parser] Fix Qwen3 Coder parser to stream tool call arguments — bug,qwen — by karanb192 (创建于: 2026-01-18 14:19 (UTC+8))
#32535 [Model Runner V2] Minor optimization for eagle input processing — v1 — by WoosukKwon (创建于: 2026-01-18 13:43 (UTC+8))

已合并 PR

#32562 [Model Runner V2] Refactor update_states — v1 — by WoosukKwon (合并于: 2026-01-19 09:32 (UTC+8))
#32546 [Model Runner V2] Support VLM — v1,nvidia — by WoosukKwon (合并于: 2026-01-19 08:58 (UTC+8))
#32417 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 — bug,ready,v1,nvidia — by vadiklyutiy (合并于: 2026-01-19 08:57 (UTC+8))
#32471 [Bugfix] Add OOT backend option — bug,ready — by iboiko-habana (合并于: 2026-01-19 06:20 (UTC+8))
#32047 [Refactor] Remove unused cutlass moe problem size function — ready,nvidia — by yewentao256 (合并于: 2026-01-19 04:46 (UTC+8))
#32433 [Refactor] Remove unused file pallas_kv_cache_update.py — tpu,ready,v1 — by yewentao256 (合并于: 2026-01-19 04:46 (UTC+8))
#32556 [Doc] Correct comment for _jobs dict in OffloadingConnectorWorker — kv-connector — by DemingCheng (合并于: 2026-01-19 04:46 (UTC+8))
#31531 Use the same memory for workspace13 and fused_output. — ready — by halyavin (合并于: 2026-01-19 03:14 (UTC+8))
#32555 [CI] Move Distributed Tests from H200 -> H100 — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-01-19 02:25 (UTC+8))
#30623 [MoE Refactor] Separate Router into OO Classes — ready,nvidia — by bnellnm (合并于: 2026-01-19 00:40 (UTC+8))
#32486 “refactor: refactor_repeated_interfaces” — ready — by tom-zju (合并于: 2026-01-18 22:07 (UTC+8))
#32540 [Bugfix] Fix GLM-ASR audio encoder RoPE dim — bug,documentation,ready — by Isotr0py (合并于: 2026-01-18 19:18 (UTC+8))
#32523 [Model] Remove the unnecessary dtype conversion in MiniCPM — ready — by gcanlin (合并于: 2026-01-18 16:07 (UTC+8))
#32511 [Model] Support Step1 Model — documentation,new-model,ready,v1 — by randzero (合并于: 2026-01-18 18:20 (UTC+8))
#32535 [Model Runner V2] Minor optimization for eagle input processing — v1 — by WoosukKwon (合并于: 2026-01-18 13:55 (UTC+8))
#32403 [Performance] Improve Triton prefill attention kernel’s performance — ready,v1 — by Isotr0py (合并于: 2026-01-18 12:19 (UTC+8))
#32129 [MoE Refactor] Move Test Impl into Test Dirs — documentation,ready — by robertgshaw2-redhat (合并于: 2026-01-18 12:16 (UTC+8))
#32532 [Model Runner V2] Move mrope_positions buffer to MRopeState — v1,nvidia — by WoosukKwon (合并于: 2026-01-18 12:09 (UTC+8))

关闭但未合并的 PR

#19785 [Ray] v1 Change device str for platform compatibility — needs-rebase,stale,v1 — by 1StepForever (关闭于: 2026-01-19 10:17 (UTC+8))
#19972 Enabling Safe KVConnector — stale,kv-connector — by prashant182 (关闭于: 2026-01-19 10:17 (UTC+8))
#21549 [V1] [Kernel] Change KV cache layout to (num_blocks, 2, …) for FlashAttention backend — ready,needs-rebase,stale,v1,kv-connector — by tdoublep (关闭于: 2026-01-19 10:17 (UTC+8))
#21724 [Bugfix] [AMD] [ROCm/xformers] pd disagg for rocm fix — rocm,needs-rebase,stale,kv-connector — by seungrokj (关闭于: 2026-01-19 10:16 (UTC+8))
#21817 Fix kvcache mismatch issue in vllm v0 kv_connector — needs-rebase,stale,kv-connector — by billishyahao (关闭于: 2026-01-19 10:16 (UTC+8))
#24775 [Bugfix] Dtype error with sequence classification model and lora. — stale — by pb-sameerreddy (关闭于: 2026-01-19 10:16 (UTC+8))
#25057 [Feat] cache expert_map to accelerate weight_loader when updating weights — stale — by weixiao-huang (关闭于: 2026-01-19 10:15 (UTC+8))
#25139 [GPUModelRunner] Split code related to kv cache init to a separate file — ready,needs-rebase,ci/build,stale,v1 — by heheda12345 (关闭于: 2026-01-19 10:15 (UTC+8))
#25167 [GDN] cherry-pick bugfix for scaled_dot_kkt from upstream FLA. — ready,stale — by sighingnow (关闭于: 2026-01-19 10:15 (UTC+8))
#32480 Strengthen batch inv tests — v1 — by frankwang28 (关闭于: 2026-01-19 08:43 (UTC+8))
#31983 [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) — bug,needs-rebase — by danisereb (关闭于: 2026-01-18 15:17 (UTC+8))
#31506 Add embedding input functionality for disabled modalities — frontend,tpu,needs-rebase,v1,multi-modality,qwen — by reaganjlee (关闭于: 2026-01-18 11:23 (UTC+8))