vLLM 开发动态报告 - 2026-01-18
时间窗口: 2026-01-18 11:02 (UTC+8) ~ 2026-01-19 11:02 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 39 | 新 PR 26 | 合并 PR 18 | 关闭未合并 PR 12
📊 每日开发状态摘要
在2026年1月18日至19日的24小时内,vLLM社区保持着极高的活跃度与处理效率。共处理了39个历史遗留问题(主要因陈旧而关闭),同时新增了8个问题,这表明项目在积极进行“技术债务”清理。代码贡献方面,新增26个PR并合并了18个,其中多个PR涉及Model Runner V2重构、新模型支持(如OpenPangu VL、OLMo 3.5 Hybrid)以及核心性能优化,显示出项目在架构演进和功能扩展上的快速进展。
🎯 AMD/ROCm 生态相关动态
本周期内,有明确的AMD生态相关开发活动,主要集中在ROCm后端的代码重构和性能优化上。
- 新增PR #32543:
refactor: extract KV cache update logic into method in RocmAttention- 技术细节:此PR将ROCm注意力后端中的KV缓存更新逻辑提取到一个独立的
do_kv_cache_update方法中。这是为了响应统一各注意力后端接口的Refactor任务(Issue #32335),旨在减少代码重复并提升可维护性。 - 影响:这属于底层架构优化,使ROCm后端的代码结构与其他后端(如FlashAttention)保持一致,为未来进一步的功能开发和性能调优打下基础。
- 技术细节:此PR将ROCm注意力后端中的KV缓存更新逻辑提取到一个独立的
- 新增PR #32551:
[Performance] Split FlexAttention attention and cache update- 技术细节:同样为响应Issue #32335,此PR将FlexAttention后端的注意力计算和KV缓存更新分离开。这模仿了此前FlashAttention后端的优化模式。
- 影响:虽然FlexAttention是Triton实现的通用注意力后端,但此优化也直接使启用FlexAttention的ROCm平台用户受益,有助于提升推理性能和解耦计算逻辑。
总结:本周期未出现涉及Quark量化工具、MI300特定优化或由xxx-amd用户提交的PR。AMD生态的参与主要体现在对ROCm计算后端的通用性架构优化上,表明AMD平台的集成和维护工作是vLLM多平台支持路线图中持续进行的一部分。
💬 高热度讨论分析
- 已关闭Issue #20860:Ray + vLLM failing to automatically release GPU memory when tensor parallelism size (tp_size) > 1
- 核心议题:用户在使用Ray部署多GPU(TP>1)vLLM服务时,任务结束后GPU内存无法自动释放。
- 观点与争议:
- 用户
ShiyuNee提供了详细的环境和代码,怀疑是进程间通信问题导致子进程未正确终止。 - Ray 团队成员
eicherseiji介入协助分析,通过日志发现Ray进程似乎已退出,但GPU内存仍被占用,建议用户尝试Ray Data内置的vLLM集成阶段。
- 用户
- 结论与状态:该Issue因超过90天无新活动而被自动标记为“stale”并关闭。讨论表明问题可能位于Ray与vLLM深度集成的边缘场景,但并未形成明确的根因结论或解决方案,用户可能需要依赖Ray团队提供的最新集成方案。
- 已关闭Issue #22975:sometimes tool calling is not correctly parsed but remains in the plain content for qwen3 coder
- 核心议题:Qwen3-Coder模型在进行工具调用时,输出偶尔无法被vLLM解析为结构化工具调用,而是以原始XML标签形式留在返回内容中。
- 观点与争议:
- 多位用户(
VivekMalipatel,mondaylord等)确认了此问题,并指出在流式输出模式下更易出现。 - 用户
sempervictus提出见解,认为这可能与Qwen3模型内部使用XML标签的语言特性有关,输出格式的不稳定是模型本身问题,在其他推理引擎中也存在。 - 用户
bbartels和Ithanil指出,模型仓库更新了工具调用解析器和聊天模板,可能已修复此问题。
- 多位用户(
- 结论与状态:Issue因陈旧被关闭。讨论揭示了工具调用功能的一个关键挑战:模型输出格式的规范性与解析器的鲁棒性。解决方案指向更新模型文件本身(如
tokenizer_config.json和generation_config.json),而不仅是vLLM代码。
- 新增PR #32553:
[P/D] Add KV cache queries and eliminate redundant prefill computation- 核心议题:此PR提议增加KV缓存查询功能,允许Prefill工作节点从其他引擎获取已缓存的KV块,避免重复计算。
- 观点与争议:
- 支持与价值:PR作者
snadampal阐述了其在多轮对话恢复、缓存驱逐恢复等场景下的显著优势。 - 设计担忧:核心维护者
robertgshaw2-redhat在评论中肯定了功能价值,但明确表达了对当前实现方式的担忧。他指出,通过重载现有字段含义来实现新功能的“隐式”设计可能带来理解和使用上的混淆,建议探讨一个更“显式”的设计方案。
- 支持与价值:PR作者
- 当前状态:PR仍处于开放状态。维护者的评论并非否定功能本身,而是希望就API设计进行更深入的讨论,这体现了vLLM项目对架构清晰度和长期可维护性的高度重视。
🔥 热门话题与趋势分析
- 视觉语言模型(VLM)支持持续扩展:新增了对OpenPangu VL(PR#32566)和Step1(PR#32511)模型的支持。同时,核心开发者WoosukKwon的PR#32546为Model Runner V2添加了基础VLM(如Qwen3-VL)支持,表明VLM是vLLm未来架构演进的重点优化方向。
- 工具调用与解析器优化:工具调用生态的完善是持续热点。除了上述关于Qwen3 Coder的讨论,本周期内出现了多个修复和增强相关解析器的PR(#32536, #32537, #32538, #32539),覆盖了Qwen3、Mistral/Devstral、Olmo3等模型,显示出社区在提升该功能稳定性和模型覆盖度上的努力。
- Model Runner V2的持续重构:由
WoosukKwon主导的一系列PR(#32535, #32546, #32562等)持续推进Model Runner V2的优化,涉及Eagle推理优化、VLM支持、状态更新逻辑重构等。这标志着vLLM核心推理引擎正在向更高效、更统一的V2架构稳步迁移。 - 性能优化与确定性保障:多个PR关注底层性能与正确性,如优化Triton注意力内核(#32403)、修复批次不变性测试(#32544, #32561)、实现Triton版的top-p/k采样掩码(#32558)等。这反映了项目在追求极致性能的同时,对计算确定性和结果可靠性的严格把关。
🛠️ 重点技术变更
- PR #32417 (已合并):修复TRTLLM查询张量中的退化步幅问题。该Bug在使用FlashInfer后端时,当查询张量某一维度为1时可能导致内核执行错误。此修复提升了特定配置下的引擎启动稳定性。
- PR #30623 (已合并):将混合专家(MoE)模型的路由器逻辑分离为独立的面向对象类。这是MoE架构的一次重要重构,通过工厂模式创建不同的路由器(如TopK、分组TopK等),大幅提升了代码的模块化程度和可扩展性,为未来更复杂的路由策略奠定了基础。
- PR #32540 (已合并):修复GLM-ASR音频编码器的旋转位置编码(RoPE)维度错误。该问题导致模型音频编码器的输出不正确。这是一个针对特定模型架构的关键Bugfix,确保了GLM-ASR模型在vLLM上的推理准确性。
📈 开发活跃度观察
- 高效的代码审查与合并:24小时内合并18个PR,处理39个Issue,显示出核心维护团队高效的工作节奏和良好的社区问题处理能力。
- 核心贡献者持续输出:
WoosukKwon在Model Runner V2相关重构上提交/合并了多个PR;Isotr0py、robertgshaw2-redhat、yewentao256等频繁出现在重要PR的提交和评审中,驱动着核心架构的演进。 - Issue生命周期管理:大量陈旧Issue被自动关闭,这是一种积极的仓库维护策略,有助于聚焦当前有效的问题。但同时也需注意,如#20860这类有技术讨论但未完全解决的问题,可能需要更精准的“未解决”标签而非单纯因时间关闭。
- AMD生态的持续参与:虽然本周期没有重大AMD特性更新,但来自社区的PR#32543表明AMD ROCm后端的优化工作仍在持续进行中。
💡 值得关注的问题
- 性能回归问题(Issue #32547):用户报告从v0.13.0升级到v0.14.0rc2后,在相同硬件和模型上出现约10%的吞吐量下降。此问题需要高度关注,因为它可能涉及v0.14.0版本中引入的某些默认行为变更或潜在性能缺陷,影响用户体验。
- 模型实现正确性问题(Issue #32545):Gemma3视觉模型因自定义Conv2d层实现导致其视觉嵌入输出与原始Transformers库不一致。这引发了关于vLLM中自定义算子必要性和与参考实现对齐的讨论,关系到模型输出的准确性。
- 采样参数逻辑缺陷(Issue #32557):
SamplingParams中bad_words到_bad_words_token_ids的转换逻辑存在缺陷,在某些分词器(如带有特殊空格处理的)上无法正确添加带前缀空格的禁忌词。这是一个影响核心采样功能正确性的Bug。
📋 附录:详细数据列表
新增 Issue
- #32545 [Bug]: Gemma3/SiglipVisionEmbeddings embedding output is different to transformers implementation due to custom Conv2d — bug — by jalola (创建于: 2026-01-18 17:40 (UTC+8))
- #32541 [Feature]: LoRa adapter support for Qwen3VLForConditionalGeneration — feature request — by giangvu-ai (创建于: 2026-01-18 14:39 (UTC+8))
- #32559 [Bug]: NVML initialization failure even when running the basic example in the WSL platform — bug — by jasonyanwenl (创建于: 2026-01-19 03:18 (UTC+8))
- #32557 [Bug]: SamplingParams bad_words to _bad_words_token_ids — bug — by zhiqihuang (创建于: 2026-01-19 01:22 (UTC+8))
- #32554 [Bug]: Error in inspecting model architecture ‘Gemma3ForConditionalGeneration’ — bug — by Skullians (创建于: 2026-01-18 23:23 (UTC+8))
- #32548 [Bug]: The final streaming response chunk is missing the finish_reason. — bug — by zhangsongqing (创建于: 2026-01-18 18:23 (UTC+8))
- #32547 [Performance]: Performance degradation from 0.13.0 to 0.14.0rc2 — performance — by aabbccddwasd (创建于: 2026-01-18 18:09 (UTC+8))
- #32534 [Bug]: Olmo-3 does not call tools even with auto tool choice enabled — bug — by chrisoutwright (创建于: 2026-01-18 13:00 (UTC+8))
已关闭 Issue
- #20860 [Bug]: Ray + vLLM failing to automatically release GPU memory when tensor parallelism size (tp_size) > 1 — bug,ray,stale — by ShiyuNee (关闭于: 2026-01-19 10:17 (UTC+8))
- #21274 [Bug]: nvfp4 support on sm120 — bug,stale — by qibaoyuan (关闭于: 2026-01-19 10:17 (UTC+8))
- #21301 [Bug]: Hermes tool parser returns invalid arguments — bug,stale — by web3-luoxi (关闭于: 2026-01-19 10:17 (UTC+8))
- #21873 [Usage]: Qwen 2.5 VL 7B throughput — usage,stale — by gnagarajan2 (关闭于: 2026-01-19 10:16 (UTC+8))
- #22515 [Bug]: gpt oss 20b; [serving_chat.py:1001] openai_harmony.HarmonyError: Unexpected token 12606 while expecting start token 200006 — bug,stale,gpt-oss — by KKKSQJ (关闭于: 2026-01-19 10:16 (UTC+8))
- #22975 [Bug]: sometimes tool calling is not correctly parsed but remains in the plain content for qwen3 coder — bug,stale,tool-calling — by Co1lin (关闭于: 2026-01-19 10:16 (UTC+8))
- #23217 [Feature]: GPT-OSS harmony format support — help wanted,good first issue,feature request,stale,gpt-oss — by heheda12345 (关闭于: 2026-01-19 10:16 (UTC+8))
- #23278 [Bug]: gpt-oss 120b does not work while gpt-oss 20b works on H100 — bug,stale — by QichangZheng (关闭于: 2026-01-19 10:16 (UTC+8))
- #23752 [Bug]: DeepSeek R1 with DP=8 on one 8xB200 node has 3x TPOT when concurrency < 8 than concurrency >= 8 — bug,stale — by smarterclayton (关闭于: 2026-01-19 10:16 (UTC+8))
- #24192 [New Model]: New model support stepfun-ai/Step-Audio-2-mini — feature request,stale — by pratapyash (关闭于: 2026-01-19 10:16 (UTC+8))
- #24325 [Bug]: Unable to use GPT-OSS-120B with speculative decoding — bug,stale,gpt-oss — by hit023 (关闭于: 2026-01-19 10:16 (UTC+8))
- #24535 [Bug]: 500 internal server error when BS>8 for ISL/OSL: 1K/1K, 1K/8K. — bug,stale — by lpc0220 (关闭于: 2026-01-19 10:16 (UTC+8))
- #24980 [Bug]: Loading
spinquantModel Fails Due to Hardware Capability Check and Weight Loading Errors — bug,stale — by HelloCard (关闭于: 2026-01-19 10:16 (UTC+8)) - #25002 [Bug]: Empty VllmConfig causing issues with Pooler.for_embed — bug,stale — by prashantgupta24 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25039 [Bug]: Failed to run qwen3 next fp8 with TP on Ampere GPU — bug,stale — by JaheimLee (关闭于: 2026-01-19 10:15 (UTC+8))
- #25051 [Feature]: Can vllm support user use token_ids as input for
chat completionsinterface? — feature request,stale — by GGBond8488 (关闭于: 2026-01-19 10:15 (UTC+8)) - #25063 [Bug]: ShardedStateLoader Appears Not to Support MLA Architecture Models Like DeepSeek — bug,stale — by lirong-lirong (关闭于: 2026-01-19 10:15 (UTC+8))
- #25073 [Bug]: Step3 inference failure — bug,stale — by xd1073321804 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25097 [Feature]: TPU Support for Prompt Embeds — feature request,stale — by qthequartermasterman (关闭于: 2026-01-19 10:15 (UTC+8))
- #25124 [Bug]: The variable orig_to_new_prefix in Qwen2.5-VL likely contains a bug. — bug,stale — by moguizhizi (关闭于: 2026-01-19 10:15 (UTC+8))
- #25125 [Bug]: eagle3 tensor-parallel — bug,stale — by wangyxbh (关闭于: 2026-01-19 10:15 (UTC+8))
- #25127 [Bug]: vLLM server won’t exit; process stuck in R state and GPU becomes unusable (H100 80GB, gpu_memory_utilization=0.6) — bug,stale — by KOOVIE (关闭于: 2026-01-19 10:15 (UTC+8))
- #25134 [Bug]: Eagle3 not support Qwen3-30B-A3B-FP8 — bug,stale — by wangyxbh (关闭于: 2026-01-19 10:15 (UTC+8))
- #25144 [RFC]: vLLM’s Path to Sustainable CI/CD — RFC,ci/build,stale — by yeqcharlotte (关闭于: 2026-01-19 10:15 (UTC+8))
- #25147 [Doc]: vllm performance dashboard configuration — documentation,stale — by omkarpatil6644 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25154 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_column_parallel_weight’ — bug,stale — by sallyjunjun (关闭于: 2026-01-19 10:15 (UTC+8))
- #25158 [Bug]: An error occurred when directly deploying the Qwen3-Next 80B-A3B-Thinking model saved by the save_pretrained of transformers using vllm. However, directly deploying the original model was normal — bug,stale — by zhanlun150729 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25169 [Doc]: Quantization support on Arm CPU missing documentation — documentation,stale — by AshokBhat (关闭于: 2026-01-19 10:15 (UTC+8))
- #25170 [Feature]: support chunked prefill on non-x86 platform — feature request,stale — by cyb70289 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25176 [Bug]: Internvl3-2B/8B can’t inference with video input when combined with AsyncLLMEngine — bug,stale — by JiancongWang (关闭于: 2026-01-19 10:15 (UTC+8))
- #25210 [Feature]: Use deterministic hashing for KV events always — feature request,stale — by PeaBrane (关闭于: 2026-01-19 10:15 (UTC+8))
- #25225 [Bug]: CUDA error during start the vllm serve — bug,stale — by gameboy777x (关闭于: 2026-01-19 10:15 (UTC+8))
- #25229 [Bug]: Deepseek vl2 model decoding phase has a very worse performance in both online and offline inference — bug,stale — by beyondHJM (关闭于: 2026-01-19 10:15 (UTC+8))
- #25234 [Usage]: How to enable the thinking mode in deepseekv3.1 by default? — usage,stale — by LeeX852 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25236 [Bug]: llm.chat does crashes when using gpt-oss 20b — bug,stale — by HTuennermann (关闭于: 2026-01-19 10:15 (UTC+8))
- #25316 logic on
enable_chunked_prefillis bit of chaotic — stale — by xianxingg (关闭于: 2026-01-19 10:15 (UTC+8)) - #32521 [Bug]: vllm: error: unrecognized arguments: –enable-logging-iteration-details — bug — by ciaoyizhen (关闭于: 2026-01-19 08:13 (UTC+8))
- #25148 [Bug]: nccl allReduce error since v0.10.2 — bug — by wedobetter (关闭于: 2026-01-18 21:59 (UTC+8))
- #32445 [Bug]: GLMASR rope error — bug — by catsled (关闭于: 2026-01-18 19:18 (UTC+8))
新增 PR
- #32566 add openPangu VL — documentation,new-model — by Emilie1001 (创建于: 2026-01-19 10:40 (UTC+8))
- #32565 [Bugfix] EPLB - Mistral 3 Large — bug,v1 — by varun-sundar-rabindranath (创建于: 2026-01-19 10:32 (UTC+8))
- #32563 benchmark refactor — performance — by Bounty-hunter (创建于: 2026-01-19 09:37 (UTC+8))
- #32564 imports — nvidia — by robertgshaw2-redhat (创建于: 2026-01-19 10:12 (UTC+8))
- #32533 [Model Runner V2] Refactor
dummy_run— v1,nvidia — by WoosukKwon (创建于: 2026-01-18 12:53 (UTC+8)) - #32562 [Model Runner V2] Refactor
update_states— v1 — by WoosukKwon (创建于: 2026-01-19 09:17 (UTC+8)) - #32546 [Model Runner V2] Support VLM — v1,nvidia — by WoosukKwon (创建于: 2026-01-18 17:55 (UTC+8))
- #32553 [P/D] Add KV cache queries and eliminate redundant prefill computation — v1,kv-connector — by snadampal (创建于: 2026-01-18 22:52 (UTC+8))
- #32560 [CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette — ci/build — by DanielMe (创建于: 2026-01-19 05:49 (UTC+8))
- #32561 Disable Cascade Attention for Batch Invariance — v1 — by frankwang28 (创建于: 2026-01-19 05:54 (UTC+8))
- #32556 [Doc] Correct comment for _jobs dict in OffloadingConnectorWorker — kv-connector — by DemingCheng (创建于: 2026-01-19 00:51 (UTC+8))
- #32551 [Performance] Split FlexAttention attention and cache update — rocm,v1,nvidia — by Etelis (创建于: 2026-01-18 20:39 (UTC+8))
- #32558 [Perf] Triton-based top-p/top-k masking — performance,v1 — by njhill (创建于: 2026-01-19 02:13 (UTC+8))
- #32555 [CI] Move Distributed Tests from H200 -> H100 — ready,ci/build — by robertgshaw2-redhat (创建于: 2026-01-18 23:33 (UTC+8))
- #32537 [Tool Parser] Fix Devstral-Small-2507 streaming tool parsing issue — qwen — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
- #32538 [Bugfix][Tool Parser] Fix whitespace content between reasoning and tool_call in streaming mode — bug,qwen — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
- #32552 Ignore: try to implement parakeet — 无标签 — by netanel-haber (创建于: 2026-01-18 21:34 (UTC+8))
- #32550 [Model] Add support for OLMo 3.5 Hybrid — documentation,new-model — by yanhong-lbh (创建于: 2026-01-18 19:50 (UTC+8))
- #32540 [Bugfix] Fix GLM-ASR audio encoder RoPE dim — bug,documentation,ready — by Isotr0py (创建于: 2026-01-18 14:34 (UTC+8))
- #32549 Support heterogeneous NemotronHPuzzle model — new-model — by danielafrimi (创建于: 2026-01-18 18:25 (UTC+8))
- #32544 Fix batch invariance test under heterogeneous workloads #32481 — v1 — by aishwaryatambe1112 (创建于: 2026-01-18 17:25 (UTC+8))
- #32543 refactor: extract KV cache update logic into method in RocmAttention — rocm,v1 — by Mohit-Gaur (创建于: 2026-01-18 16:15 (UTC+8))
- #32539 [Bugfix] Add tool chat template for Olmo3 models — bug,documentation,tool-calling — by karanb192 (创建于: 2026-01-18 14:22 (UTC+8))
- #32542 Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) — 无标签 — by gopalsarda (创建于: 2026-01-18 15:21 (UTC+8))
- #32536 [Bugfix][Tool Parser] Fix Qwen3 Coder parser to stream tool call arguments — bug,qwen — by karanb192 (创建于: 2026-01-18 14:19 (UTC+8))
- #32535 [Model Runner V2] Minor optimization for eagle input processing — v1 — by WoosukKwon (创建于: 2026-01-18 13:43 (UTC+8))
已合并 PR
- #32562 [Model Runner V2] Refactor
update_states— v1 — by WoosukKwon (合并于: 2026-01-19 09:32 (UTC+8)) - #32546 [Model Runner V2] Support VLM — v1,nvidia — by WoosukKwon (合并于: 2026-01-19 08:58 (UTC+8))
- #32417 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 — bug,ready,v1,nvidia — by vadiklyutiy (合并于: 2026-01-19 08:57 (UTC+8))
- #32471 [Bugfix] Add OOT backend option — bug,ready — by iboiko-habana (合并于: 2026-01-19 06:20 (UTC+8))
- #32047 [Refactor] Remove unused cutlass moe problem size function — ready,nvidia — by yewentao256 (合并于: 2026-01-19 04:46 (UTC+8))
- #32433 [Refactor] Remove unused file
pallas_kv_cache_update.py— tpu,ready,v1 — by yewentao256 (合并于: 2026-01-19 04:46 (UTC+8)) - #32556 [Doc] Correct comment for _jobs dict in OffloadingConnectorWorker — kv-connector — by DemingCheng (合并于: 2026-01-19 04:46 (UTC+8))
- #31531 Use the same memory for workspace13 and fused_output. — ready — by halyavin (合并于: 2026-01-19 03:14 (UTC+8))
- #32555 [CI] Move Distributed Tests from H200 -> H100 — ready,ci/build — by robertgshaw2-redhat (合并于: 2026-01-19 02:25 (UTC+8))
- #30623 [MoE Refactor] Separate Router into OO Classes — ready,nvidia — by bnellnm (合并于: 2026-01-19 00:40 (UTC+8))
- #32486 “refactor: refactor_repeated_interfaces” — ready — by tom-zju (合并于: 2026-01-18 22:07 (UTC+8))
- #32540 [Bugfix] Fix GLM-ASR audio encoder RoPE dim — bug,documentation,ready — by Isotr0py (合并于: 2026-01-18 19:18 (UTC+8))
- #32523 [Model] Remove the unnecessary dtype conversion in MiniCPM — ready — by gcanlin (合并于: 2026-01-18 16:07 (UTC+8))
- #32511 [Model] Support Step1 Model — documentation,new-model,ready,v1 — by randzero (合并于: 2026-01-18 18:20 (UTC+8))
- #32535 [Model Runner V2] Minor optimization for eagle input processing — v1 — by WoosukKwon (合并于: 2026-01-18 13:55 (UTC+8))
- #32403 [Performance] Improve Triton prefill attention kernel’s performance — ready,v1 — by Isotr0py (合并于: 2026-01-18 12:19 (UTC+8))
- #32129 [MoE Refactor] Move Test Impl into Test Dirs — documentation,ready — by robertgshaw2-redhat (合并于: 2026-01-18 12:16 (UTC+8))
- #32532 [Model Runner V2] Move mrope_positions buffer to MRopeState — v1,nvidia — by WoosukKwon (合并于: 2026-01-18 12:09 (UTC+8))
关闭但未合并的 PR
- #19785 [Ray] v1 Change device str for platform compatibility — needs-rebase,stale,v1 — by 1StepForever (关闭于: 2026-01-19 10:17 (UTC+8))
- #19972 Enabling Safe KVConnector — stale,kv-connector — by prashant182 (关闭于: 2026-01-19 10:17 (UTC+8))
- #21549 [V1] [Kernel] Change KV cache layout to (num_blocks, 2, …) for FlashAttention backend — ready,needs-rebase,stale,v1,kv-connector — by tdoublep (关闭于: 2026-01-19 10:17 (UTC+8))
- #21724 [Bugfix] [AMD] [ROCm/xformers] pd disagg for rocm fix — rocm,needs-rebase,stale,kv-connector — by seungrokj (关闭于: 2026-01-19 10:16 (UTC+8))
- #21817 Fix kvcache mismatch issue in vllm v0 kv_connector — needs-rebase,stale,kv-connector — by billishyahao (关闭于: 2026-01-19 10:16 (UTC+8))
- #24775 [Bugfix] Dtype error with sequence classification model and lora. — stale — by pb-sameerreddy (关闭于: 2026-01-19 10:16 (UTC+8))
- #25057 [Feat] cache expert_map to accelerate weight_loader when updating weights — stale — by weixiao-huang (关闭于: 2026-01-19 10:15 (UTC+8))
- #25139 [GPUModelRunner] Split code related to kv cache init to a separate file — ready,needs-rebase,ci/build,stale,v1 — by heheda12345 (关闭于: 2026-01-19 10:15 (UTC+8))
- #25167 [GDN] cherry-pick bugfix for scaled_dot_kkt from upstream FLA. — ready,stale — by sighingnow (关闭于: 2026-01-19 10:15 (UTC+8))
- #32480 Strengthen batch inv tests — v1 — by frankwang28 (关闭于: 2026-01-19 08:43 (UTC+8))
- #31983 [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) — bug,needs-rebase — by danisereb (关闭于: 2026-01-18 15:17 (UTC+8))
- #31506 Add embedding input functionality for disabled modalities — frontend,tpu,needs-rebase,v1,multi-modality,qwen — by reaganjlee (关闭于: 2026-01-18 11:23 (UTC+8))