vLLM 开发动态报告 - 2026-01-19
时间窗口: 2026-01-19 10:56 (UTC+8) ~ 2026-01-20 10:56 (UTC+8) 数据统计: 新 Issue 20 | 关闭 Issue 21 | 新 PR 43 | 合并 PR 24 | 关闭未合并 PR 12
📊 每日开发状态摘要
在过去的24小时内,vLLM项目保持了高强度的开发节奏,合并了24个PR并处理了41个Issue。主要进展集中在模型运行器V2(Model Runner V2)的持续优化、多模态模型支持以及各类性能优化与Bug修复。社区的关注点明显偏向于推测解码增强、核心架构演进(如量化方案设计)以及跨平台(特别是AMD ROCm)的兼容性与性能。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动主要围绕量化方案设计讨论和CI/测试修复展开。
- RFC:QuantKey设计方案 (Issue #32589)
- 贡献者:
hangy-amd(AMD员工) - 内容:提出对现有量化键(QuantKey)设计的重构建议,认为当前依赖
GroupShape隐式确定量化方案(per_tensor/per_channel)的方式存在抽象不清和对高维张量支持不足的问题。 - 提议:在
QuantKey中显式增加quant_scheme枚举字段,使量化方案更直观,并由该字段驱动缩放因子形状的逻辑。 - 讨论状态:核心开发者
ProExpertProg回复,倾向于保留GroupShape作为更通用的抽象,因其能统一表示张量/令牌/分组量化,而枚举方案可能需要额外的元数据。目前处于开放讨论阶段,尚未形成共识。
- 贡献者:
- CI修复:分布式测试工具 (PR #32620)
- 贡献者:
rjrock - 内容:修复了AMD分布式测试(4 GPUs)组中的一个测试失败问题。具体修复了
distributed/test_utils.py中的test_cuda_device_count_stateless测试,解决了因CUDA_VISIBLE_DEVICES环境变量处理导致的AttributeError。 - 影响:提升了AMD CI测试的稳定性,确保相关功能测试能够在ROCm平台上正确运行。
- 贡献者:
- CI修复:NIXL连接器配置 (PR #32570)
- 贡献者:
qli88 - 内容:为避免在ROCm平台上崩溃,将NIXL连接器测试的环境变量从
UCX_MEM_MMAP_HOOK_MODE=none回退为UCX_RCACHE_MAX_UNRELEASED=1024。这解决了因UCX内存映射钩子模式与ROCm不兼容导致的测试失败。 - 影响:恢复了AMD CI中NIXL相关测试组的可运行性,同时与NIXL团队协同制定了长期解决方案(未来将该配置设为默认值)。
- 贡献者:
💬 高热度讨论分析
- Whisper长音频时间戳错误 (Issue #32588)
- 核心议题:使用vLLM的Whisper模型转录音频时,分段级时间戳会随着音频长度累积偏移(每段约0.5秒)。
- 观点与争议:
- 报告者
dr75指出,问题源于音频分块逻辑使用1秒窗口寻找静音分割点,但生成时间戳时未补偿此偏移。同时指出相关文档描述不准确。 - 维护者
NickLucche标记为“求助”标签,寻求社区贡献。 - 报告者随后表示因需求提前,将自行提供修复。
- 另一位贡献者
aadeshupadhyay也提出希望接手此问题。
- 报告者
- 当前状态:开放中,
dr75计划近期提交修复。
- 集成FlashInfer的RMS+fp4融合内核 (Issue #32612)
- 核心议题:提议将FlashInfer库中的RMSNorm与FP4量化融合内核集成到vLLM现有的融合优化路径中。
- 观点与参与:
- 提出者
ProExpertProg描述了技术方案。 - 多位贡献者(
sparkecho,Etelis)迅速响应,主动请求承担此任务。 ProExpertProg确认由sparkecho负责,并表示可提供支持。
- 提出者
- 当前状态:开放中,任务已被认领,社区对性能优化表现出高度积极性。
- 支持CUDA_VISIBLE_DEVICES中的GPU UUID (Issue #32569)
- 核心议题:vLLM目前不支持在
CUDA_VISIBLE_DEVICES环境变量中使用GPU UUID(仅支持整数索引),使用UUID会导致崩溃。 - 观点与讨论:
- 报告者
Zerohertz详细说明了CUDA原生支持UUID,但vLLM存在解析限制。 - 贡献者
adityakamat24主动请求承担该任务。 - 报告者进一步追问,之前关于MIG(GPU实例)的兼容性讨论是否适用于通用的UUID场景。
- 报告者
- 当前状态:开放中,有贡献者认领,但涉及与MIG设备管理的潜在关联,需进一步明确设计。
- 核心议题:vLLM目前不支持在
🔥 热门话题与趋势分析
- 模型与格式支持扩展:社区持续集成新模型(如T5GEMMA2, FireRedASR音频模型,Pangu模型的推理与工具解析器),并完善现有模型家族的支持(如GLM-4.7, GLM-4-MoE-Lite的MLA检测)。
- 性能优化与内核演进:围绕量化(W4A16、W8A16_FP8、TMA对齐优化)、注意力后端(MLA默认设置调整)、推测解码(草案模型支持、EAGLE PTD)的优化是核心焦点。MoE融合层的重构(Modular Kernel集成)也在持续推进。
- 平台兼容性与基础设施:AMD ROCm平台的CI稳定性修复、XPU对新量化方案的支持、插件架构兼容性问题(NPU对齐GPU ModelRunner)的讨论,显示了多硬件生态支持的复杂性。
- API与前端完善:前端API持续增强,包括为Responses API补齐采样参数、Score端点支持多模态输入(data_1/data_2)、新增预处理Render端点,旨在提升开发者体验和API一致性。
- 质量保障与可观测性:出现关于加强模型准确性测试的RFC(Issue #32613),提议建立更鲁棒、分层级的精度测试体系以应对日益增多的优化回归。同时,新增生产级Token粒度OTEL追踪支持,提升了可观测性。
🛠️ 重点技术变更
- PR #32624: [Model Runner V2] 初始化DP通信缓冲区:修复了Model Runner V2在数据并行(DP)场景下的一个关键Bug,确保
load_model能正确触发通信缓冲区的准备,使得DP+专家并行(EP)在MRV2下能输出正确结果。这是MRV2稳定性的重要进展。 - PR #32615 & #32529: MLA默认后端调整与修复:将Blackwell平台上MLA的默认解码后端设为
FLASHINFER_MLA,默认预填充后端设为TRTLLM,以追求更佳性能。此前因测试失败(#32529,已修复数值稳定性问题)而回退的变更被重新应用。 - PR #32618: 完全支持异步调度+流水线并行:通过从最后一个流水线阶段广播
new_token_ids,解决了异步调度与流水线并行(PP)不兼容的问题。报告称在特定测试中实现了端到端吞吐量15% 和TPOT 16% 的提升,是并行策略的重要优化。 - PR #32628: EAGLE推测解码支持并行令牌解码:引入PTD(Parallel Token Decoding)技术,使EAGLE草案模型能一次性生成K个草案令牌,而非顺序生成。据称在保持接受率的前提下,能实现6.5% 至 16% 的吞吐量提升,且TPOT不随K值线性增长。
- Issue #32613: 关于更鲁棒的模型精度测试的RFC:提出了一个系统性改进精度测试基础设施的方案,旨在更早捕获因优化、新内核和模型增加导致的精度回归。建议合并测试、使用更具挑战性的评估集、建立分层测试体系,并与MoE重构工作协调。
📈 开发活跃度观察
- 高效合并:24小时内合并24个PR,显示核心团队审查与合并流程高效。
- 核心贡献者活跃:
WoosukKwon(模型运行器V2)、MatthewBonanni(注意力后端)、robertgshaw2-redhat(MoE/量化)等核心开发者持续提交关键架构变更。 - 社区积极参与:多个“help wanted”和“good first issue”标签的Issue被迅速认领(如#32612, #32588),表明社区贡献者池健康且反应迅速。
- 跨团队协作:AMD团队(
*-amd用户)在量化设计、CI修复方面积极提交代码和讨论,与上游维护者互动良好。
💡 值得关注的问题
- 模型准确性回归风险 (Issue #32613):随着优化和内核日益复杂,精度回归问题频发。该RFC提出的系统性测试增强方案对保障vLLM输出质量至关重要,其实施方式和优先级值得关注。
- 量化架构设计讨论 (Issue #32589):由AMD工程师发起的关于
QuantKey设计的RFC,触及了量化抽象的核心。其讨论结果将影响未来所有量化方案(包括AMD Quark等)的集成方式。 - 数据并行同步性能权衡 (Issue #32140,已关闭):异步调度下默认禁用NCCL进行DP同步(以避免GPU同步)的策略,在纯Prefill等特定负载下可能导致性能下降。虽然已通过允许手动覆盖作为临时方案,但如何智能适配不同负载仍需探索。
- 插件架构的长期稳定化 (已关闭Issue #19161):虽然该RFC已因不活跃关闭,但其中提出的插件API稳定性、兼容性测试等问题,对于吸引和维护第三方硬件(如Intel、Ascend)生态依然具有长期重要性。
📋 附录:详细数据列表
新增 Issue
- #32626 [Bug]: TRTLLM Attention Failure with DP/EP — bug — by robertgshaw2-redhat (创建于: 2026-01-20 09:39 (UTC+8))
- #32575 [Bug]: MinerU fails to start on DGX Spark using docker images vllm/vllm-openai:nightly-aarch64 — bug — by Essence9999 (创建于: 2026-01-19 15:59 (UTC+8))
- #32569 [Feature]: Support GPU UUID in
CUDA_VISIBLE_DEVICES— feature request — by Zerohertz (创建于: 2026-01-19 13:18 (UTC+8)) - #32588 [Bug]: Wrong timestamps if audio > 30s — bug,help wanted,good first issue — by dr75 (创建于: 2026-01-19 18:27 (UTC+8))
- #32612 [Feature]: Integrate RMS+fp4 fused kernel from FlashInfer — help wanted,good first issue,feature request,torch.compile — by ProExpertProg (创建于: 2026-01-20 01:14 (UTC+8))
- #32613 [RFC]: More robust model accuracy testing with configurable and tiered coverage — RFC — by xinli-sw (创建于: 2026-01-20 03:39 (UTC+8))
- #32611 [Bug]: CUDA driver initialization fails in forked child process due to undetected cuInit() call from pynvml — bug — by relic-yuexi (创建于: 2026-01-20 00:43 (UTC+8))
- #32607 [CI Failure][Nightly 1-18]: Test Models Distirbuted — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-19 23:06 (UTC+8))
- #32589 [RFC]: About design of QuantKey — RFC — by hangy-amd (创建于: 2026-01-19 18:32 (UTC+8))
- #32599 [Bug]: Serve with LoRA error “ValueError: base_model.model.lm_head.base_layer.weight is unsupported LoRA weight” — bug — by linyu09-oss (创建于: 2026-01-19 20:57 (UTC+8))
- #32591 Speculative-Decoding Draft Model Exceeds Cuda Graph Range — bug — by tomasruizt (创建于: 2026-01-19 19:07 (UTC+8))
- #32604 [Bug]: LMCache CPU kv offload cause decode speed degrade — bug — by aabbccddwasd (创建于: 2026-01-19 23:01 (UTC+8))
- #32602 [Usage]: otel fastapi instrumentation doesn’t work — usage — by bondar-aleksandr (创建于: 2026-01-19 22:25 (UTC+8))
- #32587 [Bug]: gpt-oss special tokens leak into tool names and responses despite
reasoning_parser: "openai_gptoss"— bug — by shinyEazy (创建于: 2026-01-19 18:15 (UTC+8)) - #32593 [Doc]: Any plan to make POST /reset_prefix_cache public? — documentation — by dfeigin-nv (创建于: 2026-01-19 19:24 (UTC+8))
- #32600 [Bug]: — bug — by Scaramir (创建于: 2026-01-19 21:21 (UTC+8))
- #32598 [Bug]: “Fatal Python error: none_dealloc” after 4 days deployment — bug — by Afanmin (创建于: 2026-01-19 20:43 (UTC+8))
- #32595 [New Model]: Complexity (Pacific-Prime) - INL Dynamics + Token-Routed MLP — 无标签 — by Complexity-ML (创建于: 2026-01-19 19:38 (UTC+8))
- #32584 [Bug]: Mistral 3.1 24B fails to load with sharded weights - “no module named ‘language_model’ in LlamaForCausalLM” error — bug — by chay1045 (创建于: 2026-01-19 17:59 (UTC+8))
- #32568 [Bug]: ZeroDivisionError: float floor division by zero — bug — by nottydadie (创建于: 2026-01-19 12:53 (UTC+8))
已关闭 Issue
- #16406 [New Model]: Multimodal Embedding Model GME. — stale — by Adenialzz (关闭于: 2026-01-20 10:13 (UTC+8))
- #17103 [Bug]: AsyncLLM sleep then wake_up produces meaningless outputs — bug,stale — by wuxibin89 (关闭于: 2026-01-20 10:13 (UTC+8))
- #19161 [RFC]: Enhancing vLLM Plugin Architecture — RFC,stale — by kzawora-intel (关闭于: 2026-01-20 10:13 (UTC+8))
- #24917 [Performance]: Fuse padding onto GEMM by making the GEMM out-of-place — performance,stale — by ProExpertProg (关闭于: 2026-01-20 10:13 (UTC+8))
- #32575 [Bug]: MinerU fails to start on DGX Spark using docker images vllm/vllm-openai:nightly-aarch64 — bug — by Essence9999 (关闭于: 2026-01-20 09:43 (UTC+8))
- #31824 [Feature]: Remove parameter re-registration and names from kernel abstraction — help wanted,feature request — by ProExpertProg (关闭于: 2026-01-20 04:56 (UTC+8))
- #27684 [Bug]: FlashInfer MLA prefill correctness issues — bug,nvidia — by MatthewBonanni (关闭于: 2026-01-20 02:51 (UTC+8))
- #32223 [CI Failure]: Kernels Core Operation Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:23 (UTC+8))
- #32221 [CI Failure]: Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:22 (UTC+8))
- #29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:19 (UTC+8))
- #29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:18 (UTC+8))
- #29458 [CI Failure]: mi325_8: Language Models Tests (Extra Standard) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:18 (UTC+8))
- #29538 [CI Failure]: mi325_8: Kernels Quantization Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:17 (UTC+8))
- #29510 [CI Failure]: mi325_1: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-01-20 02:17 (UTC+8))
- #32607 [CI Failure][Nightly 1-18]: Test Models Distirbuted — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-19 23:58 (UTC+8))
- #32591 Speculative-Decoding Draft Model Exceeds Cuda Graph Range — bug — by tomasruizt (关闭于: 2026-01-19 23:51 (UTC+8))
- #30595 [Bug]: Unsatisfiable testing dependencies — bug — by BlankRH (关闭于: 2026-01-19 19:27 (UTC+8))
- #32140 [Bug]: Default open async_scheduling, DP performance will deteriorate — bug — by lengrongfu (关闭于: 2026-01-19 17:30 (UTC+8))
- #32464 [Bug]: Qwen3-VL-Reranker-8B vllm error — bug — by darvec112357 (关闭于: 2026-01-19 11:57 (UTC+8))
- #32495 [CI Failure]: Entrypoints Integration Tests (Responses API) — help wanted,ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-19 12:05 (UTC+8))
- #32461 [Bug]: QWenBaseModel.embed_input_ids() got an unexpected keyword argument ‘multimodal_embeddings’ — bug — by honglyua-il (关闭于: 2026-01-19 11:06 (UTC+8))
新增 PR
- #32629 [Model Runner V2] Decouple temperature from penalties — v1 — by WoosukKwon (创建于: 2026-01-20 10:43 (UTC+8))
- #32583 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 17:58 (UTC+8))
- #32628 feat: add Parallel Token Decoding for EAGLE — documentation,speculative-decoding,v1,llama — by hai-meh-cs (创建于: 2026-01-20 10:28 (UTC+8))
- #32625 [Model Runner V2] Refactor get_cudagraph_and_dp_padding — v1,nvidia — by WoosukKwon (创建于: 2026-01-20 09:33 (UTC+8))
- #32617 Implemented the T5GEMMA2 from Google — documentation,new-model,multi-modality — by akh64bit (创建于: 2026-01-20 05:07 (UTC+8))
- #32627 build: remove unused FLASHINFER_AOT_COMPILE build argument — ci/build — by haitwang-cloud (创建于: 2026-01-20 10:10 (UTC+8))
-
#32574 [Frontend][2/n] Make pooling entrypoints request schema consensus ChatRequest — documentation,frontend,qwen — by noooop (创建于: 2026-01-19 15:38 (UTC+8)) - #32624 [Model Runner V2] Initialized communication buffer for DP — v1 — by WoosukKwon (创建于: 2026-01-20 09:02 (UTC+8))
- #32578 [WIP][CT][XPU] Add
W8A16_FP8Linear Support — ci/build — by yiliu30 (创建于: 2026-01-19 16:39 (UTC+8)) - #32594 [Platform] Replace hardcoded cuda by current_platform in ModelRunner — v1,nvidia — by gcanlin (创建于: 2026-01-19 19:31 (UTC+8))
- #32615 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,ci/build,nvidia — by MatthewBonanni (创建于: 2026-01-20 04:27 (UTC+8))
- #32608 [Misc] Remove unused ModelKeys — ready — by jeejeelee (创建于: 2026-01-19 23:10 (UTC+8))
- #32623 [WIP][Attention] Abstract the MLA prefill backends — rocm,needs-rebase,v1,nvidia — by MatthewBonanni (创建于: 2026-01-20 06:20 (UTC+8))
- #32616 Fix DeepEP high throughput for cutlass moe — nvidia — by czhu-cohere (创建于: 2026-01-20 04:51 (UTC+8))
- #32618 [Feature] Fully support for async scheduling + PP, 15% E2E throughput improvement, 16% TPOT improvement — ready,v1 — by yewentao256 (创建于: 2026-01-20 05:18 (UTC+8))
- #32621 [LoRA] Update LoRA expand kernel block_n calculation — 无标签 — by xyang16 (创建于: 2026-01-20 05:55 (UTC+8))
- #32614 fix: Add glm4_moe_lite to MLA detection — 无标签 — by marksverdhei (创建于: 2026-01-20 04:00 (UTC+8))
- #32622 [Feature] Enable flashinfer moe fp4 by default — 无标签 — by yewentao256 (创建于: 2026-01-20 06:12 (UTC+8))
- #32619 [Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper — performance — by xyang16 (创建于: 2026-01-20 05:28 (UTC+8))
- #32620 [CI/Build][AMD] Fix distributed/test_utils.py — rocm,ci/build — by rjrock (创建于: 2026-01-20 05:47 (UTC+8))
- #32609 [Frontend] Add sampling parameters to Responses API — frontend — by DanielMe (创建于: 2026-01-19 23:42 (UTC+8))
- #32570 [CI][amd] Revert NIXL connector change to avoid crash — rocm,ready,ci/build,kv-connector — by qli88 (创建于: 2026-01-19 13:42 (UTC+8))
- #32580 [Doc] [ROCm] Update ROCm getting started doc — documentation,rocm — by tjtanaa (创建于: 2026-01-19 16:55 (UTC+8))
- #32567 [MoE Refactor] Integrate Naive Prepare Finalize into MK — needs-rebase,nvidia — by robertgshaw2-redhat (创建于: 2026-01-19 11:27 (UTC+8))
- #32610 [Refactor] Remove unused tpu files — tpu,ready,v1 — by yewentao256 (创建于: 2026-01-20 00:29 (UTC+8))
- #32605 [Model] Use context managers for encoder- and LM-only mode — documentation,ready,v1,llama,qwen,kv-connector — by DarkLight1337 (创建于: 2026-01-19 23:03 (UTC+8))
- #32603 [Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation — bug,ready — by lingebeng (创建于: 2026-01-19 22:59 (UTC+8))
- #32577 [Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs — documentation,frontend,ready,qwen — by noooop (创建于: 2026-01-19 16:31 (UTC+8))
- #32606 [CI] Test NIXL+Offloading connector — v1,kv-connector — by NickLucche (创建于: 2026-01-19 23:06 (UTC+8))
- #32590 [Feature] Support LoRA MoE for bitsandbytes quantization — 无标签 — by Allyyi (创建于: 2026-01-19 18:54 (UTC+8))
- #32601 [BugFix] Fix bad_words token conversion for tokenizers with different space encodings — bug — by ricky-chaoju (创建于: 2026-01-19 22:22 (UTC+8))
- #32597 Add triton support for compressed_tensors GPTQ W4A16 on Tesla V100 (Volta CUDA 70) — performance,nvidia — by lapy (创建于: 2026-01-19 20:04 (UTC+8))
- #32586 add firered audio model — new-model,frontend — by sxl1993 (创建于: 2026-01-19 18:14 (UTC+8))
- #32596 [Bugfix][Speculative-Decoding] Extend CUDA Graph Ranges to Consider SD Tokens — bug,documentation,performance,speculative-decoding,v1,kv-connector,nvidia — by tomasruizt (创建于: 2026-01-19 20:00 (UTC+8))
- #32592 [MTP] Delete useless op — deepseek — by DingYibin (创建于: 2026-01-19 19:08 (UTC+8))
- #32585 [EC Connector] Optimize remote cache check in scheduler — v1 — by knlnguyen1802 (创建于: 2026-01-19 18:01 (UTC+8))
- #32582 [Docs] Fix GitHub handle in governance process — documentation — by pacoxu (创建于: 2026-01-19 17:54 (UTC+8))
- #32581 [BugFix] set correct media_type for streaming responses in PD separation proxy — bug,v1,kv-connector — by Sean-LL (创建于: 2026-01-19 17:48 (UTC+8))
- #32576 Support Flashcom1 for Qwen3Next — qwen — by ZT-AIA (创建于: 2026-01-19 16:17 (UTC+8))
- #32579 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 16:55 (UTC+8))
- #32572 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-19 14:46 (UTC+8))
- #32573 [Frontend][Core][Tracing] Add token-level OTEL tracing for prod observability — v1 — by zhanghaotong (创建于: 2026-01-19 15:29 (UTC+8))
- #32571 Fix UnboundLocalError in unquantized MoE backend selection — meta-exported,fb-exported — by zengyijing (创建于: 2026-01-19 13:49 (UTC+8))
已合并 PR
- #32625 [Model Runner V2] Refactor get_cudagraph_and_dp_padding — v1,nvidia — by WoosukKwon (合并于: 2026-01-20 10:25 (UTC+8))
- #31326 [Feat] allow inplace loading lora — documentation,frontend,ready,v1 — by Jackmin801 (合并于: 2026-01-20 10:15 (UTC+8))
- #32624 [Model Runner V2] Initialized communication buffer for DP — v1 — by WoosukKwon (合并于: 2026-01-20 09:27 (UTC+8))
- #32615 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill — ready,ci/build,nvidia — by MatthewBonanni (合并于: 2026-01-20 07:41 (UTC+8))
- #32608 [Misc] Remove unused ModelKeys — ready — by jeejeelee (合并于: 2026-01-20 01:34 (UTC+8))
- #32529 [BUGFIX] Fix
test_mla_backends.py. Scale MLA projection weights to prevent numerical instability — bug,ready,v1 — by vadiklyutiy (合并于: 2026-01-20 02:49 (UTC+8)) - #32533 [Model Runner V2] Refactor
dummy_run— v1,nvidia — by WoosukKwon (合并于: 2026-01-20 06:50 (UTC+8)) - #24322 feat: spec decode with draft models — documentation,performance,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by tomasruizt (合并于: 2026-01-20 05:05 (UTC+8))
- #28784 docs: prefix caching seems quite outdated — documentation,ready — by longregen (合并于: 2026-01-20 03:49 (UTC+8))
- #32349 [BugFix] Fix TRT-LLM NVFP4 DP/EP — bug,ready,nvidia — by jiahanc (合并于: 2026-01-20 03:32 (UTC+8))
- #32482 [CI] Add Helion as an optional dependency — ready,ci/build — by gmagogsfm (合并于: 2026-01-20 03:09 (UTC+8))
- #32570 [CI][amd] Revert NIXL connector change to avoid crash — rocm,ready,ci/build,kv-connector — by qli88 (合并于: 2026-01-20 02:39 (UTC+8))
- #32121 support dynamic resolution image encoding for Nemotron Nano VL — ready — by netanel-haber (合并于: 2026-01-20 02:15 (UTC+8))
- #32363 [ROCm][CI] Add ROCm attention backend support for EAGLE DP tests — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-19 17:57 (UTC+8))
- #32577 [Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs — documentation,frontend,ready,qwen — by noooop (合并于: 2026-01-19 22:07 (UTC+8))
- #30802 Add support for LoRA adapters in Nemotron-H models — ready,nvidia — by danisereb (合并于: 2026-01-19 22:30 (UTC+8))
- #32560 [CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette — ready,ci/build — by DanielMe (合并于: 2026-01-19 19:27 (UTC+8))
- #32340 [Nixl][Bugfix] Track
nixl_num_kv_expired_reqsmetric in Prometheus — bug,ready,kv-connector — by NickLucche (合并于: 2026-01-19 20:28 (UTC+8)) - #30385 [Core] Whisper support
torch.compile— ready,v1 — by NickLucche (合并于: 2026-01-19 18:02 (UTC+8)) - #31386 [GLM-4.7] GLM Model support for GLM-Lite — documentation,performance,new-model,ready,tool-calling — by zRzRzRzRzRzRzR (合并于: 2026-01-19 17:18 (UTC+8))
- #32408 [CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused — rocm,ready — by mawong-amd (合并于: 2026-01-19 16:25 (UTC+8))
- #32473 [Frontend] Add render endpoints for prompt preprocessing — frontend,ready — by hyeongyun0916 (合并于: 2026-01-19 12:21 (UTC+8))
- #32531 [CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests — ready,gpt-oss — by alex-jw-brooks (合并于: 2026-01-19 12:05 (UTC+8))
- #32462 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration — bug,ready,qwen — by honglyua-il (合并于: 2026-01-19 11:06 (UTC+8))
关闭但未合并的 PR
- #24065 [P/D]support for the v1/chat/completions interface to the disagg_proxy_server — performance,stale,kv-connector — by frankie-ys (关闭于: 2026-01-20 10:13 (UTC+8))
- #24848 [Bugfix][Multi Modal] Fix oom bug when using sdap attention backend i… — stale,multi-modality — by DamonJiang777 (关闭于: 2026-01-20 10:13 (UTC+8))
- #31446 Adding the support t5gemma-2 — documentation,new-model,rocm,ci/build,multi-modality — by akh64bit (关闭于: 2026-01-20 05:10 (UTC+8))
- #31608 [OpenAI] Bound Responses API store with LRU eviction — frontend,needs-rebase — by maylikenoother (关闭于: 2026-01-20 03:05 (UTC+8))
- #32356 [CI][AMD][BugFix][FP8] Fix test_concat_and_cache_mla_rope_fused by fixing fp8 conversion routines and using proper atol/rtol — bug,rocm,ready — by rasmith (关闭于: 2026-01-20 01:44 (UTC+8))
- #32334 [1/x][Frontend] A graceful shutdown implementation as per RFC #24885 — frontend,needs-rebase,v1 — by wseaton (关闭于: 2026-01-19 23:19 (UTC+8))
- #32507 [Fix] test test_function_calling_with_streaming_types about mcp — needs-rebase — by lengrongfu (关闭于: 2026-01-19 20:32 (UTC+8))
- #32596 [Bugfix][Speculative-Decoding] Extend CUDA Graph Ranges to Consider SD Tokens — bug,documentation,performance,speculative-decoding,v1,kv-connector,nvidia — by tomasruizt (关闭于: 2026-01-19 20:03 (UTC+8))
- #32579 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-19 17:23 (UTC+8))
- #32572 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-19 16:53 (UTC+8))
- #32566 add openPangu VL — documentation,new-model — by Emilie1001 (关闭于: 2026-01-19 15:17 (UTC+8))
- #32122 Oracle improvements — documentation,rocm,ready,needs-rebase,v1,llama,gpt-oss,nvidia — by robertgshaw2-redhat (关闭于: 2026-01-19 11:29 (UTC+8))