vLLM 开发动态报告 - 2026-01-19

时间窗口: 2026-01-19 10:56 (UTC+8) ~ 2026-01-20 10:56 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 21 | 新 PR 43 | 合并 PR 24 | 关闭未合并 PR 12

📊 每日开发状态摘要

在过去的24小时内，vLLM项目保持了高强度的开发节奏，合并了24个PR并处理了41个Issue。主要进展集中在模型运行器V2（Model Runner V2）的持续优化、多模态模型支持以及各类性能优化与Bug修复。社区的关注点明显偏向于推测解码增强、核心架构演进（如量化方案设计）以及跨平台（特别是AMD ROCm）的兼容性与性能。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动主要围绕量化方案设计讨论和CI/测试修复展开。

RFC：QuantKey设计方案 (Issue #32589)
- 贡献者：hangy-amd (AMD员工)
- 内容：提出对现有量化键（QuantKey）设计的重构建议，认为当前依赖GroupShape隐式确定量化方案（per_tensor/per_channel）的方式存在抽象不清和对高维张量支持不足的问题。
- 提议：在QuantKey中显式增加quant_scheme枚举字段，使量化方案更直观，并由该字段驱动缩放因子形状的逻辑。
- 讨论状态：核心开发者ProExpertProg回复，倾向于保留GroupShape作为更通用的抽象，因其能统一表示张量/令牌/分组量化，而枚举方案可能需要额外的元数据。目前处于开放讨论阶段，尚未形成共识。
CI修复：分布式测试工具 (PR #32620)
- 贡献者：rjrock
- 内容：修复了AMD分布式测试（4 GPUs）组中的一个测试失败问题。具体修复了distributed/test_utils.py中的test_cuda_device_count_stateless测试，解决了因CUDA_VISIBLE_DEVICES环境变量处理导致的AttributeError。
- 影响：提升了AMD CI测试的稳定性，确保相关功能测试能够在ROCm平台上正确运行。
CI修复：NIXL连接器配置 (PR #32570)
- 贡献者：qli88
- 内容：为避免在ROCm平台上崩溃，将NIXL连接器测试的环境变量从UCX_MEM_MMAP_HOOK_MODE=none回退为UCX_RCACHE_MAX_UNRELEASED=1024。这解决了因UCX内存映射钩子模式与ROCm不兼容导致的测试失败。
- 影响：恢复了AMD CI中NIXL相关测试组的可运行性，同时与NIXL团队协同制定了长期解决方案（未来将该配置设为默认值）。

💬 高热度讨论分析

Whisper长音频时间戳错误 (Issue #32588)
- 核心议题：使用vLLM的Whisper模型转录音频时，分段级时间戳会随着音频长度累积偏移（每段约0.5秒）。
- 观点与争议：
  - 报告者dr75指出，问题源于音频分块逻辑使用1秒窗口寻找静音分割点，但生成时间戳时未补偿此偏移。同时指出相关文档描述不准确。
  - 维护者NickLucche标记为“求助”标签，寻求社区贡献。
  - 报告者随后表示因需求提前，将自行提供修复。
  - 另一位贡献者aadeshupadhyay也提出希望接手此问题。
- 当前状态：开放中，dr75计划近期提交修复。
集成FlashInfer的RMS+fp4融合内核 (Issue #32612)
- 核心议题：提议将FlashInfer库中的RMSNorm与FP4量化融合内核集成到vLLM现有的融合优化路径中。
- 观点与参与：
  - 提出者ProExpertProg描述了技术方案。
  - 多位贡献者（sparkecho, Etelis）迅速响应，主动请求承担此任务。
  - ProExpertProg确认由sparkecho负责，并表示可提供支持。
- 当前状态：开放中，任务已被认领，社区对性能优化表现出高度积极性。
支持CUDA_VISIBLE_DEVICES中的GPU UUID (Issue #32569)
- 核心议题：vLLM目前不支持在CUDA_VISIBLE_DEVICES环境变量中使用GPU UUID（仅支持整数索引），使用UUID会导致崩溃。
- 观点与讨论：
  - 报告者Zerohertz详细说明了CUDA原生支持UUID，但vLLM存在解析限制。
  - 贡献者adityakamat24主动请求承担该任务。
  - 报告者进一步追问，之前关于MIG（GPU实例）的兼容性讨论是否适用于通用的UUID场景。
- 当前状态：开放中，有贡献者认领，但涉及与MIG设备管理的潜在关联，需进一步明确设计。

🔥 热门话题与趋势分析

模型与格式支持扩展：社区持续集成新模型（如T5GEMMA2, FireRedASR音频模型，Pangu模型的推理与工具解析器），并完善现有模型家族的支持（如GLM-4.7, GLM-4-MoE-Lite的MLA检测）。
性能优化与内核演进：围绕量化（W4A16、W8A16_FP8、TMA对齐优化）、注意力后端（MLA默认设置调整）、推测解码（草案模型支持、EAGLE PTD）的优化是核心焦点。MoE融合层的重构（Modular Kernel集成）也在持续推进。
平台兼容性与基础设施：AMD ROCm平台的CI稳定性修复、XPU对新量化方案的支持、插件架构兼容性问题（NPU对齐GPU ModelRunner）的讨论，显示了多硬件生态支持的复杂性。
API与前端完善：前端API持续增强，包括为Responses API补齐采样参数、Score端点支持多模态输入（data_1/data_2）、新增预处理Render端点，旨在提升开发者体验和API一致性。
质量保障与可观测性：出现关于加强模型准确性测试的RFC（Issue #32613），提议建立更鲁棒、分层级的精度测试体系以应对日益增多的优化回归。同时，新增生产级Token粒度OTEL追踪支持，提升了可观测性。

🛠️ 重点技术变更

PR #32624: [Model Runner V2] 初始化DP通信缓冲区：修复了Model Runner V2在数据并行（DP）场景下的一个关键Bug，确保load_model能正确触发通信缓冲区的准备，使得DP+专家并行（EP）在MRV2下能输出正确结果。这是MRV2稳定性的重要进展。
PR #32615 & #32529: MLA默认后端调整与修复：将Blackwell平台上MLA的默认解码后端设为FLASHINFER_MLA，默认预填充后端设为TRTLLM，以追求更佳性能。此前因测试失败（#32529，已修复数值稳定性问题）而回退的变更被重新应用。
PR #32618: 完全支持异步调度+流水线并行：通过从最后一个流水线阶段广播new_token_ids，解决了异步调度与流水线并行（PP）不兼容的问题。报告称在特定测试中实现了端到端吞吐量15% 和TPOT 16% 的提升，是并行策略的重要优化。
PR #32628: EAGLE推测解码支持并行令牌解码：引入PTD（Parallel Token Decoding）技术，使EAGLE草案模型能一次性生成K个草案令牌，而非顺序生成。据称在保持接受率的前提下，能实现6.5% 至 16% 的吞吐量提升，且TPOT不随K值线性增长。
Issue #32613: 关于更鲁棒的模型精度测试的RFC：提出了一个系统性改进精度测试基础设施的方案，旨在更早捕获因优化、新内核和模型增加导致的精度回归。建议合并测试、使用更具挑战性的评估集、建立分层测试体系，并与MoE重构工作协调。

📈 开发活跃度观察

高效合并：24小时内合并24个PR，显示核心团队审查与合并流程高效。
核心贡献者活跃：WoosukKwon（模型运行器V2）、MatthewBonanni（注意力后端）、robertgshaw2-redhat（MoE/量化）等核心开发者持续提交关键架构变更。
社区积极参与：多个“help wanted”和“good first issue”标签的Issue被迅速认领（如#32612, #32588），表明社区贡献者池健康且反应迅速。
跨团队协作：AMD团队（*-amd用户）在量化设计、CI修复方面积极提交代码和讨论，与上游维护者互动良好。

💡 值得关注的问题

模型准确性回归风险 (Issue #32613)：随着优化和内核日益复杂，精度回归问题频发。该RFC提出的系统性测试增强方案对保障vLLM输出质量至关重要，其实施方式和优先级值得关注。
量化架构设计讨论 (Issue #32589)：由AMD工程师发起的关于QuantKey设计的RFC，触及了量化抽象的核心。其讨论结果将影响未来所有量化方案（包括AMD Quark等）的集成方式。
数据并行同步性能权衡 (Issue #32140，已关闭)：异步调度下默认禁用NCCL进行DP同步（以避免GPU同步）的策略，在纯Prefill等特定负载下可能导致性能下降。虽然已通过允许手动覆盖作为临时方案，但如何智能适配不同负载仍需探索。
插件架构的长期稳定化 (已关闭Issue #19161)：虽然该RFC已因不活跃关闭，但其中提出的插件API稳定性、兼容性测试等问题，对于吸引和维护第三方硬件（如Intel、Ascend）生态依然具有长期重要性。

📋 附录：详细数据列表

新增 Issue

#32626 [Bug]: TRTLLM Attention Failure with DP/EP — bug — by robertgshaw2-redhat (创建于: 2026-01-20 09:39 (UTC+8))
#32575 [Bug]: MinerU fails to start on DGX Spark using docker images vllm/vllm-openai:nightly-aarch64 — bug — by Essence9999 (创建于: 2026-01-19 15:59 (UTC+8))
#32569 [Feature]: Support GPU UUID in CUDA_VISIBLE_DEVICES — feature request — by Zerohertz (创建于: 2026-01-19 13:18 (UTC+8))
#32588 [Bug]: Wrong timestamps if audio > 30s — bug,help wanted,good first issue — by dr75 (创建于: 2026-01-19 18:27 (UTC+8))
#32612 [Feature]: Integrate RMS+fp4 fused kernel from FlashInfer — help wanted,good first issue,feature request,torch.compile — by ProExpertProg (创建于: 2026-01-20 01:14 (UTC+8))
#32613 [RFC]: More robust model accuracy testing with configurable and tiered coverage — RFC — by xinli-sw (创建于: 2026-01-20 03:39 (UTC+8))
#32611 [Bug]: CUDA driver initialization fails in forked child process due to undetected cuInit() call from pynvml — bug — by relic-yuexi (创建于: 2026-01-20 00:43 (UTC+8))
#32607 [CI Failure][Nightly 1-18]: Test Models Distirbuted — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-19 23:06 (UTC+8))
#32589 [RFC]: About design of QuantKey — RFC — by hangy-amd (创建于: 2026-01-19 18:32 (UTC+8))
#32599 [Bug]: Serve with LoRA error “ValueError: base_model.model.lm_head.base_layer.weight is unsupported LoRA weight” — bug — by linyu09-oss (创建于: 2026-01-19 20:57 (UTC+8))
#32591 Speculative-Decoding Draft Model Exceeds Cuda Graph Range — bug — by tomasruizt (创建于: 2026-01-19 19:07 (UTC+8))
#32604 [Bug]: LMCache CPU kv offload cause decode speed degrade — bug — by aabbccddwasd (创建于: 2026-01-19 23:01 (UTC+8))
#32602 [Usage]: otel fastapi instrumentation doesn’t work — usage — by bondar-aleksandr (创建于: 2026-01-19 22:25 (UTC+8))
#32587 [Bug]: gpt-oss special tokens leak into tool names and responses despite reasoning_parser: "openai_gptoss" — bug — by shinyEazy (创建于: 2026-01-19 18:15 (UTC+8))
#32593 [Doc]: Any plan to make POST /reset_prefix_cache public? — documentation — by dfeigin-nv (创建于: 2026-01-19 19:24 (UTC+8))
#32600 [Bug]: — bug — by Scaramir (创建于: 2026-01-19 21:21 (UTC+8))
#32598 [Bug]: “Fatal Python error: none_dealloc” after 4 days deployment — bug — by Afanmin (创建于: 2026-01-19 20:43 (UTC+8))
#32595 [New Model]: Complexity (Pacific-Prime) - INL Dynamics + Token-Routed MLP — 无标签 — by Complexity-ML (创建于: 2026-01-19 19:38 (UTC+8))
#32584 [Bug]: Mistral 3.1 24B fails to load with sharded weights - “no module named ‘language_model’ in LlamaForCausalLM” error — bug — by chay1045 (创建于: 2026-01-19 17:59 (UTC+8))
#32568 [Bug]: ZeroDivisionError: float floor division by zero — bug — by nottydadie (创建于: 2026-01-19 12:53 (UTC+8))

已关闭 Issue

#16406 [New Model]: Multimodal Embedding Model GME. — stale — by Adenialzz (关闭于: 2026-01-20 10:13 (UTC+8))
#17103 [Bug]: AsyncLLM sleep then wake_up produces meaningless outputs — bug,stale — by wuxibin89 (关闭于: 2026-01-20 10:13 (UTC+8))
#19161 [RFC]: Enhancing vLLM Plugin Architecture — RFC,stale — by kzawora-intel (关闭于: 2026-01-20 10:13 (UTC+8))
#24917 [Performance]: Fuse padding onto GEMM by making the GEMM out-of-place — performance,stale — by ProExpertProg (关闭于: 2026-01-20 10:13 (UTC+8))
#32575 [Bug]: MinerU fails to start on DGX Spark using docker images vllm/vllm-openai:nightly-aarch64 — bug — by Essence9999 (关闭于: 2026-01-20 09:43 (UTC+8))
#31824 [Feature]: Remove parameter re-registration and names from kernel abstraction — help wanted,feature request — by ProExpertProg (关闭于: 2026-01-20 04:56 (UTC+8))
#27684 [Bug]: FlashInfer MLA prefill correctness issues — bug,nvidia — by MatthewBonanni (关闭于: 2026-01-20 02:51 (UTC+8))
#32223 [CI Failure]: Kernels Core Operation Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:23 (UTC+8))
#32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:22 (UTC+8))
#29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:19 (UTC+8))
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:18 (UTC+8))
#29458 [CI Failure]: mi325_8: Language Models Tests (Extra Standard) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:18 (UTC+8))
#29538 [CI Failure]: mi325_8: Kernels Quantization Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:17 (UTC+8))
#29510 [CI Failure]: mi325_1: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:17 (UTC+8))
#32607 [CI Failure][Nightly 1-18]: Test Models Distirbuted — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-19 23:58 (UTC+8))
#32591 Speculative-Decoding Draft Model Exceeds Cuda Graph Range — bug — by tomasruizt (关闭于: 2026-01-19 23:51 (UTC+8))
#30595 [Bug]: Unsatisfiable testing dependencies — bug — by BlankRH (关闭于: 2026-01-19 19:27 (UTC+8))
#32140 [Bug]: Default open async_scheduling, DP performance will deteriorate — bug — by lengrongfu (关闭于: 2026-01-19 17:30 (UTC+8))
#32464 [Bug]: Qwen3-VL-Reranker-8B vllm error — bug — by darvec112357 (关闭于: 2026-01-19 11:57 (UTC+8))
#32495 [CI Failure]: Entrypoints Integration Tests (Responses API) — help wanted,ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-19 12:05 (UTC+8))
#32461 [Bug]: QWenBaseModel.embed_input_ids() got an unexpected keyword argument ‘multimodal_embeddings’ — bug — by honglyua-il (关闭于: 2026-01-19 11:06 (UTC+8))

新增 PR

#32629 [Model Runner V2] Decouple temperature from penalties — v1 — by WoosukKwon (创建于: 2026-01-20 10:43 (UTC+8))
#32583 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 17:58 (UTC+8))
#32628 feat: add Parallel Token Decoding for EAGLE — documentation,speculative-decoding,v1,llama — by hai-meh-cs (创建于: 2026-01-20 10:28 (UTC+8))
#32625 [Model Runner V2] Refactor get_cudagraph_and_dp_padding — v1,nvidia — by WoosukKwon (创建于: 2026-01-20 09:33 (UTC+8))
#32617 Implemented the T5GEMMA2 from Google — documentation,new-model,multi-modality — by akh64bit (创建于: 2026-01-20 05:07 (UTC+8))
#32627 build: remove unused FLASHINFER_AOT_COMPILE build argument — ci/build — by haitwang-cloud (创建于: 2026-01-20 10:10 (UTC+8))

#32574 [Frontend][2/n] Make pooling entrypoints request schema consensus

ChatRequest — documentation,frontend,qwen — by noooop (创建于: 2026-01-19 15:38 (UTC+8))

#32624 [Model Runner V2] Initialized communication buffer for DP — v1 — by WoosukKwon (创建于: 2026-01-20 09:02 (UTC+8))
#32578 [WIP][CT][XPU] Add W8A16_FP8 Linear Support — ci/build — by yiliu30 (创建于: 2026-01-19 16:39 (UTC+8))
#32594 [Platform] Replace hardcoded cuda by current_platform in ModelRunner — v1,nvidia — by gcanlin (创建于: 2026-01-19 19:31 (UTC+8))
#32615 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,ci/build,nvidia — by MatthewBonanni (创建于: 2026-01-20 04:27 (UTC+8))
#32608 [Misc] Remove unused ModelKeys — ready — by jeejeelee (创建于: 2026-01-19 23:10 (UTC+8))
#32623 [WIP][Attention] Abstract the MLA prefill backends — rocm,needs-rebase,v1,nvidia — by MatthewBonanni (创建于: 2026-01-20 06:20 (UTC+8))
#32616 Fix DeepEP high throughput for cutlass moe — nvidia — by czhu-cohere (创建于: 2026-01-20 04:51 (UTC+8))
#32618 [Feature] Fully support for async scheduling + PP, 15% E2E throughput improvement, 16% TPOT improvement — ready,v1 — by yewentao256 (创建于: 2026-01-20 05:18 (UTC+8))
#32621 [LoRA] Update LoRA expand kernel block_n calculation — 无标签 — by xyang16 (创建于: 2026-01-20 05:55 (UTC+8))
#32614 fix: Add glm4_moe_lite to MLA detection — 无标签 — by marksverdhei (创建于: 2026-01-20 04:00 (UTC+8))
#32622 [Feature] Enable flashinfer moe fp4 by default — 无标签 — by yewentao256 (创建于: 2026-01-20 06:12 (UTC+8))
#32619 [Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper — performance — by xyang16 (创建于: 2026-01-20 05:28 (UTC+8))
#32620 [CI/Build][AMD] Fix distributed/test_utils.py — rocm,ci/build — by rjrock (创建于: 2026-01-20 05:47 (UTC+8))
#32609 [Frontend] Add sampling parameters to Responses API — frontend — by DanielMe (创建于: 2026-01-19 23:42 (UTC+8))
#32570 [CI][amd] Revert NIXL connector change to avoid crash — rocm,ready,ci/build,kv-connector — by qli88 (创建于: 2026-01-19 13:42 (UTC+8))
#32580 [Doc] [ROCm] Update ROCm getting started doc — documentation,rocm — by tjtanaa (创建于: 2026-01-19 16:55 (UTC+8))
#32567 [MoE Refactor] Integrate Naive Prepare Finalize into MK — needs-rebase,nvidia — by robertgshaw2-redhat (创建于: 2026-01-19 11:27 (UTC+8))
#32610 [Refactor] Remove unused tpu files — tpu,ready,v1 — by yewentao256 (创建于: 2026-01-20 00:29 (UTC+8))
#32605 [Model] Use context managers for encoder- and LM-only mode — documentation,ready,v1,llama,qwen,kv-connector — by DarkLight1337 (创建于: 2026-01-19 23:03 (UTC+8))
#32603 [Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation — bug,ready — by lingebeng (创建于: 2026-01-19 22:59 (UTC+8))
#32577 [Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs — documentation,frontend,ready,qwen — by noooop (创建于: 2026-01-19 16:31 (UTC+8))
#32606 [CI] Test NIXL+Offloading connector — v1,kv-connector — by NickLucche (创建于: 2026-01-19 23:06 (UTC+8))
#32590 [Feature] Support LoRA MoE for bitsandbytes quantization — 无标签 — by Allyyi (创建于: 2026-01-19 18:54 (UTC+8))
#32601 [BugFix] Fix bad_words token conversion for tokenizers with different space encodings — bug — by ricky-chaoju (创建于: 2026-01-19 22:22 (UTC+8))
#32597 Add triton support for compressed_tensors GPTQ W4A16 on Tesla V100 (Volta CUDA 70) — performance,nvidia — by lapy (创建于: 2026-01-19 20:04 (UTC+8))
#32586 add firered audio model — new-model,frontend — by sxl1993 (创建于: 2026-01-19 18:14 (UTC+8))
#32596 [Bugfix][Speculative-Decoding] Extend CUDA Graph Ranges to Consider SD Tokens — bug,documentation,performance,speculative-decoding,v1,kv-connector,nvidia — by tomasruizt (创建于: 2026-01-19 20:00 (UTC+8))
#32592 [MTP] Delete useless op — deepseek — by DingYibin (创建于: 2026-01-19 19:08 (UTC+8))
#32585 [EC Connector] Optimize remote cache check in scheduler — v1 — by knlnguyen1802 (创建于: 2026-01-19 18:01 (UTC+8))
#32582 [Docs] Fix GitHub handle in governance process — documentation — by pacoxu (创建于: 2026-01-19 17:54 (UTC+8))
#32581 [BugFix] set correct media_type for streaming responses in PD separation proxy — bug,v1,kv-connector — by Sean-LL (创建于: 2026-01-19 17:48 (UTC+8))
#32576 Support Flashcom1 for Qwen3Next — qwen — by ZT-AIA (创建于: 2026-01-19 16:17 (UTC+8))
#32579 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 16:55 (UTC+8))
#32572 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 14:46 (UTC+8))
#32573 [Frontend][Core][Tracing] Add token-level OTEL tracing for prod observability — v1 — by zhanghaotong (创建于: 2026-01-19 15:29 (UTC+8))
#32571 Fix UnboundLocalError in unquantized MoE backend selection — meta-exported,fb-exported — by zengyijing (创建于: 2026-01-19 13:49 (UTC+8))

已合并 PR

#32625 [Model Runner V2] Refactor get_cudagraph_and_dp_padding — v1,nvidia — by WoosukKwon (合并于: 2026-01-20 10:25 (UTC+8))
#31326 [Feat] allow inplace loading lora — documentation,frontend,ready,v1 — by Jackmin801 (合并于: 2026-01-20 10:15 (UTC+8))
#32624 [Model Runner V2] Initialized communication buffer for DP — v1 — by WoosukKwon (合并于: 2026-01-20 09:27 (UTC+8))
#32615 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,ci/build,nvidia — by MatthewBonanni (合并于: 2026-01-20 07:41 (UTC+8))
#32608 [Misc] Remove unused ModelKeys — ready — by jeejeelee (合并于: 2026-01-20 01:34 (UTC+8))
#32529 [BUGFIX] Fix test_mla_backends.py. Scale MLA projection weights to prevent numerical instability — bug,ready,v1 — by vadiklyutiy (合并于: 2026-01-20 02:49 (UTC+8))
#32533 [Model Runner V2] Refactor dummy_run — v1,nvidia — by WoosukKwon (合并于: 2026-01-20 06:50 (UTC+8))
#24322 feat: spec decode with draft models — documentation,performance,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by tomasruizt (合并于: 2026-01-20 05:05 (UTC+8))
#28784 docs: prefix caching seems quite outdated — documentation,ready — by longregen (合并于: 2026-01-20 03:49 (UTC+8))
#32349 [BugFix] Fix TRT-LLM NVFP4 DP/EP — bug,ready,nvidia — by jiahanc (合并于: 2026-01-20 03:32 (UTC+8))
#32482 [CI] Add Helion as an optional dependency — ready,ci/build — by gmagogsfm (合并于: 2026-01-20 03:09 (UTC+8))
#32570 [CI][amd] Revert NIXL connector change to avoid crash — rocm,ready,ci/build,kv-connector — by qli88 (合并于: 2026-01-20 02:39 (UTC+8))
#32121 support dynamic resolution image encoding for Nemotron Nano VL — ready — by netanel-haber (合并于: 2026-01-20 02:15 (UTC+8))
#32363 [ROCm][CI] Add ROCm attention backend support for EAGLE DP tests — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-19 17:57 (UTC+8))
#32577 [Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs — documentation,frontend,ready,qwen — by noooop (合并于: 2026-01-19 22:07 (UTC+8))
#30802 Add support for LoRA adapters in Nemotron-H models — ready,nvidia — by danisereb (合并于: 2026-01-19 22:30 (UTC+8))
#32560 [CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette — ready,ci/build — by DanielMe (合并于: 2026-01-19 19:27 (UTC+8))
#32340 [Nixl][Bugfix] Track nixl_num_kv_expired_reqs metric in Prometheus — bug,ready,kv-connector — by NickLucche (合并于: 2026-01-19 20:28 (UTC+8))
#30385 [Core] Whisper support torch.compile — ready,v1 — by NickLucche (合并于: 2026-01-19 18:02 (UTC+8))
#31386 [GLM-4.7] GLM Model support for GLM-Lite — documentation,performance,new-model,ready,tool-calling — by zRzRzRzRzRzRzR (合并于: 2026-01-19 17:18 (UTC+8))
#32408 [CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused — rocm,ready — by mawong-amd (合并于: 2026-01-19 16:25 (UTC+8))
#32473 [Frontend] Add render endpoints for prompt preprocessing — frontend,ready — by hyeongyun0916 (合并于: 2026-01-19 12:21 (UTC+8))
#32531 [CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests — ready,gpt-oss — by alex-jw-brooks (合并于: 2026-01-19 12:05 (UTC+8))
#32462 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration — bug,ready,qwen — by honglyua-il (合并于: 2026-01-19 11:06 (UTC+8))

关闭但未合并的 PR

#24065 [P/D]support for the v1/chat/completions interface to the disagg_proxy_server — performance,stale,kv-connector — by frankie-ys (关闭于: 2026-01-20 10:13 (UTC+8))
#24848 [Bugfix][Multi Modal] Fix oom bug when using sdap attention backend i… — stale,multi-modality — by DamonJiang777 (关闭于: 2026-01-20 10:13 (UTC+8))
#31446 Adding the support t5gemma-2 — documentation,new-model,rocm,ci/build,multi-modality — by akh64bit (关闭于: 2026-01-20 05:10 (UTC+8))
#31608 [OpenAI] Bound Responses API store with LRU eviction — frontend,needs-rebase — by maylikenoother (关闭于: 2026-01-20 03:05 (UTC+8))
#32356 [CI][AMD][BugFix][FP8] Fix test_concat_and_cache_mla_rope_fused by fixing fp8 conversion routines and using proper atol/rtol — bug,rocm,ready — by rasmith (关闭于: 2026-01-20 01:44 (UTC+8))
#32334 [1/x][Frontend] A graceful shutdown implementation as per RFC #24885 — frontend,needs-rebase,v1 — by wseaton (关闭于: 2026-01-19 23:19 (UTC+8))
#32507 [Fix] test test_function_calling_with_streaming_types about mcp — needs-rebase — by lengrongfu (关闭于: 2026-01-19 20:32 (UTC+8))
#32596 [Bugfix][Speculative-Decoding] Extend CUDA Graph Ranges to Consider SD Tokens — bug,documentation,performance,speculative-decoding,v1,kv-connector,nvidia — by tomasruizt (关闭于: 2026-01-19 20:03 (UTC+8))
#32579 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-19 17:23 (UTC+8))
#32572 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-19 16:53 (UTC+8))
#32566 add openPangu VL — documentation,new-model — by Emilie1001 (关闭于: 2026-01-19 15:17 (UTC+8))
#32122 Oracle improvements — documentation,rocm,ready,needs-rebase,v1,llama,gpt-oss,nvidia — by robertgshaw2-redhat (关闭于: 2026-01-19 11:29 (UTC+8))