vLLM 开发动态报告 - 2026-01-16

时间窗口: 2026-01-16 10:46 (UTC+8) ~ 2026-01-17 10:46 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 22 | 新 PR 40 | 合并 PR 20 | 关闭未合并 PR 11

📊 每日开发状态摘要

在2026年1月16日至17日的时间窗口内，vLLM项目保持了高强度开发与问题修复节奏。整体活动围绕两大主题展开：一是针对前沿硬件（如Blackwell B200、AMD MI系列）和复杂模型（如Llama-4 MoE、DeepSeek-V3.2）的性能优化与Bug修复；二是持续推进项目架构演进，包括新的/render端点设计、AMD平台支持完善以及2026年Q1路线图的制定。社区讨论热烈，特别是在性能回归、闪存注意力（FlashInfer）集成和分布式执行方面。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动活跃，涉及多项兼容性修复、性能优化和CI完善工作。

PR #32497：[Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor
- 内容：修复了在ROCm平台上使用Ray后端进行数据并行（DP）执行时RCCL初始化失败的问题。
- 技术细节：根本原因是ROCm 7.1+所需的环境变量HSA_NO_SCRATCH_RECLAIM=1未设置，以及Ray的自动设备管理与ROCm环境冲突。PR通过修改ROCm平台模块，确保必要的环境变量被传递到Ray执行器中。
- 影响：显著增强了vLLM在AMD多卡集群上使用Ray进行分布式推理的稳定性和可用性。
PR #32492：[RFC][ROCM] Enable aiter attn backend for qwen3-next model
- 内容：针对Qwen3-Next模型（其块大小为544），修复了aiter注意力后端因不支持的块大小而无法工作的问题。
- 技术细节：通过PR #24486解耦内核块大小和页面块大小，使得aiter后端能够处理非标准块大小。测试显示在AMD平台上启用后，GSM8K评测结果正常。
- 影响：扩展了AMD平台上高效注意力后端对更多模型架构的支持。
PR #29843：Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (已合并)
- 用户：amd-hhashemi (AMD员工)
- 内容：针对解码阶段特定尺寸范围的GEMM运算（N=16-128, M=128/640, K=2880）进行原子操作优化，旨在提升低并发下GPT-OSS等模型的性能。
- 影响：针对AMD硬件优化核心计算内核，提升特定场景下的推理效率。
其他相关PR/Issue:
- PR #32244 (已合并)：修复ROCm平台对非门控MoE（如NemotronH模型）的支持，通过放宽平台检查并使用Triton实现作为后备方案。
- PR #32444 (已合并)：在CI中跳过test_permute_cols测试，因为该内核未在ROCm上构建。
- Issue #32455 (路线图)：由github-actions自动提及了ROCm相关维护者@hongxiayang, @tjtanaa, @vllmellm，表明路线图制定时已考虑ROCm相关特性。

💬 高热度讨论分析

Issue #32488: [Bug]: Llama4 FP8 failure with Flashinfer on B200
- 核心议题：在B200 GPU上使用FlashInfer CUTLASS MoE后端运行Llama-4 FP8模型时，因量化参数名不匹配导致内核传入nullptr而崩溃。
- 观点与讨论：
  - lfopensource 指出问题根源在于FlashInfer期望传统量化属性名（g1_alphas），而标准CompressedTensors模型使用通用名（w1_scale），并迅速提出了修复PR #32496。
  - robertgshaw2-redhat 补充说明B200上的FlashInfer内核需要静态每张量量化，而该模型使用的是通道缩放，并提到正在准备一个PR来澄清这一点。
- 当前状态：问题已由lfopensource通过PR #32496修复，该PR增加了回退机制。
PR #32496: [Bugfix] Fix Llama 4 FP8 failure with Flashinfer on B200
- 核心议题：对上述Issue #32488的修复PR。
- 讨论焦点：讨论主要集中在CI流程上（pre-commit检查失败），而非技术细节本身。贡献者lfopensource快速解决了格式问题，并@了维护者请求审查。
- 结论：修复已准备就绪，等待CI验证和合并。
Issue #32455: [Roadmap] [Draft] vLLM Roadmap Q1 2026
- 核心议题：社区发布2026年第一季度路线图草案，涉及核心引擎、大规模服务、平台集成、模型与生态系统等多个领域。
- 讨论：讨论较少，NickLucche补充了多模态（MM）会议时间信息。
- 意义：此Issue虽评论不多，但重要性极高。它公开了项目的未来方向，包括默认启用异步调度和模型运行器V2、CPU KV缓存、模型实现API稳定化等重要目标，为社区贡献提供了清晰的指引。
PR #32473: [Frontend] Add render endpoints for prompt preprocessing
- 核心议题：为RFC #22817（vLLM解耦一切）中的Renderer服务奠定基础，新增/v1/completions/render和/v1/chat/completions/render端点，将提示渲染（模板应用、分词）与文本生成分离。
- 讨论焦点：作者hyeongyun0916在PR描述中提出了一个设计疑问：新的render端点是否应包含对GPT-OSS模型的Harmony处理？他@了RFC作者robertgshaw2-redhat以确认设计意图。
- 影响：这是向解耦架构迈出的重要一步，有利于构建自定义前端和调试。
PR #32470: [feat] add num preempted output
- 核心议题：提议在API响应中添加num_preempted字段，用于强化学习（RL）指标收集。
- 争议焦点：核心维护者@njhill表达了保留意见，认为“不应在顶级响应对象字段中添加随机的每请求指标”，并提及可能已有其他统计传播机制。
- 观点对立：作者RobotGF从RL的实际应用场景（如Partial Rollout系统）出发进行辩护，认为在输出中直接获取此指标对逐样本分析至关重要。
- 状态：设计决策尚在讨论中，需要权衡API简洁性与特定使用场景的需求。

🔥 热门话题与趋势分析

性能回归与优化：多个Issue（#32494, #32488, #32481）报告了从旧版本升级、特定硬件（B200）或复杂负载下的性能下降、错误或延迟问题。社区正积极定位原因，表明项目在快速演进中面临保持向后兼容和硬件普适性的挑战。
FlashInfer 集成与挑战：FlashInfer作为高性能注意力后端，在支持新硬件（B200）和新模型（Llama-4 MoE FP8）时遇到了诸多兼容性和正确性问题（#32488, #32452, #32484）。这表明将底层内核与多样化上层模型及量化方案适配的复杂性。
AMD平台支持强化：本周期内针对AMD的修复和优化PR数量显著，覆盖了分布式执行（#32497）、特定模型支持（#32492）、核心算子（#29843）和测试套件（#32444），显示AMD生态在vLLM中的支持正走向成熟和深化。
架构演进与解耦：路线图（#32455）和PR #32473（渲染端点）均指向一个明确趋势：vLLM正在致力于内部架构的稳定化、模块化和解耦，以提升可维护性、支持更灵活的部署模式（如解耦的P/D架构）。

🛠️ 重点技术变更

PR #32496 (Bugfix): 修复Llama-4 FP8在B200上FlashInfer兼容性
- 解读：修复了因量化参数命名规范不一致导致的致命错误。通过增加从传统命名到标准命名的回退机制，确保了使用CompressedTensors等标准工具量化的FP8 MoE模型能在FlashInfer后端上正常运行。
- 影响：直接解决了社区用户在新硬件上运行前沿量化模型时遇到的障碍，提升了vLLM对最新模型生态的支持能力。
PR #32497 (Bugfix): 修复ROCm上Ray后端的RCCL初始化
- 解读：这是一个平台特定性的深度修复，涉及ROCm底层内存管理（HSA_NO_SCRATCH_RECLAIM）与分布式框架（Ray）的交互。正确设置这些环境变量是保证RCCL（ROCm的NCCL对应物）在复杂分布式环境下稳定工作的前提。
- 影响：为在大型AMD GPU集群上使用vLLM进行数据并行推理扫清了一个关键障碍。
PR #32479 (CI): 更新DeepGEMM版本 (已合并)
- 解读：更新了DeepSeek的DeepGEMM依赖至新版本。DeepGEMM是DeepSeek-V3等模型使用的核心高效矩阵乘库。
- 影响：确保vLLM能够利用DeepGEMM库的最新性能优化和稳定性改进，对支持相关模型至关重要。
PR #32418 (Bugfix): 修复EPLB可能死锁问题 (已合并)
- 解读：修复了专家并行负载均衡（EPLB）在异步模式下可能发生的死锁。关键修复点是确保move_to_workspace操作在所有权重传输实际完成前不会返回，从而维持专家组内集合操作的同步性。
- 影响：提升了大规模MoE模型在专家并行模式下的运行稳定性。
PR #32425 (Optimization): 更新LoRA扩展内核启发式算法 (已合并)
- 解读：优化了LoRA expand内核的启动参数。针对QKVParallelLinearWithLoRA等切片线性层输出尺寸较小的特点，调整了block_n大小以启动更多CUDA块，提高SM并发度。
- 影响：经过测试，该优化使特定LoRA内核性能提升约30%，端到端输出令牌吞吐量提升约1%，对多LoRA适配场景有积极影响。

📈 开发活跃度观察

贡献活跃度：在24小时内新增40个PR，合并20个，处理35个Issue（13新开，22关闭），显示社区开发、审查和合并流程非常高效。
贡献者多样性：贡献者包括来自AMD (amd-hhashemi)、NVIDIA（隐含于B200相关问题）、Red Hat (robertgshaw2-redhat) 的员工以及众多独立开发者，表明项目吸引了广泛的产业和社区参与。
代码审查与质量：许多PR（如#32496, #32490, #32448）都经历了多次pre-commit检查失败与修复的循环，体现了严格的代码规范要求。维护者（如njhill, DarkLight1337, chaunceyjiang）积极参与讨论和技术决策。

💡 值得关注的问题

性能回归（Issue #32494）：用户报告从v0.9.0 (V1引擎) 升级到v0.13.0后，TTFT（首令牌时间）在高并发下显著退化。这触及了引擎核心调度器的核心变更（如异步调度），需要核心团队深入分析并明确性能取舍。
EPLB稳定性（Issue #32478）：专家并行负载均衡在多种情况下发生挂起，用户已找到一些缓解方案（禁用DeepEp LL、禁用异步调度等）。这暴露了在混合并行（TP+EP）和高级调度功能组合下的深层复杂性。
严格连续性断言失败（Issue #32452）：在Blackwell GPU的FlashInfer TRTLLM解码路径中，新的is_strictly_contiguous检查导致断言失败。这揭示了PyTorch .contiguous() 与底层内核对内存布局期望之间的微妙差异，是跨硬件和内核栈协同的重要案例。
路线图执行：新发布的Q1 2026路线图草案（#32455）雄心勃勃，包括默认启用异步调度和V2模型运行器、稳定模型API、大规模服务优化等。这些目标的推进情况将决定vLLM未来半年的架构走向和竞争力。

📋 附录：详细数据列表

新增 Issue

#32495 [CI Failure]: Entrypoints Integration Tests (Responses API) — help wanted,ci-failure — by robertgshaw2-redhat (创建于: 2026-01-17 08:17 (UTC+8))
#32488 [Bug]: Llama4 FP8 failure with Flashinfer on B200 — bug — by mxz297 (创建于: 2026-01-17 02:47 (UTC+8))
#32494 [Performance]: Testing vLLM upgrade results in significant TTFT degradation — performance — by annabellej (创建于: 2026-01-17 07:42 (UTC+8))
#32481 [Bug]: Batch Invariance fails under more diverse workloads — bug — by frankwang28 (创建于: 2026-01-17 01:08 (UTC+8))
#32455 [Roadmap] [Draft] vLLM Roadmap Q1 2026 — 无标签 — by simon-mo (创建于: 2026-01-16 12:23 (UTC+8))
#32478 [Bug]: EPLB hangs in several cases — bug — by ilmarkov (创建于: 2026-01-16 23:57 (UTC+8))
#32472 [Bug]: Granite 4.0-H Small immediate EOS token in specific prompt combinations — bug — by ksmusz (创建于: 2026-01-16 18:47 (UTC+8))
#32469 [Bug]: Error occurs when using Eagle3: Encoder cache miss for {mm_hash} — bug — by Ericoool9614 (创建于: 2026-01-16 17:55 (UTC+8))
#32468 [Bug]: Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. — bug — by simplew2011 (创建于: 2026-01-16 17:21 (UTC+8))
#32467 [Bug]: vLLM 0.13.0, GLM4.6, H20; encountered error: torch.AcceleratorError: CUDA error: an illegal memory access was encountered — bug — by AmazeQiu (创建于: 2026-01-16 16:51 (UTC+8))
#32464 [Bug]: Qwen3-VL-Reranker-8B vllm error — bug — by darvec112357 (创建于: 2026-01-16 15:34 (UTC+8))
#32461 [Bug]: QWenBaseModel.embed_input_ids() got an unexpected keyword argument ‘multimodal_embeddings’ — bug — by honglyua-il (创建于: 2026-01-16 13:38 (UTC+8))
#32452 [Bug]: is_strictly_contiguous assertion fails in FlashInfer TRTLLM decode path on Blackwell for Scout — bug — by luccafong (创建于: 2026-01-16 11:47 (UTC+8))

已关闭 Issue

#20671 [Bug]: Whisper not working on 0.9.2 docker image — bug,stale — by andrePankraz (关闭于: 2026-01-17 10:17 (UTC+8))
#22341 [Bug]: When using streaming output, the error ‘Caught handled exception, but response already started’ occurs. — bug,stale,gpt-oss — by okLLM (关闭于: 2026-01-17 10:16 (UTC+8))
#24179 [Performance]:When running vllm at 30b-a3b MOE and turning on kv quantization, the decoding speed of 40K input drops significantly from 100t to 40t. — performance,stale — by gengchaogit (关闭于: 2026-01-17 10:16 (UTC+8))
#24650 [Feature]: support messages input for classify api skywork-reward model — feature request,stale — by Snowdar (关闭于: 2026-01-17 10:16 (UTC+8))
#24714 [Bug]: Hunyuan-7B-Instruct get error output — bug,stale — by hzjane (关闭于: 2026-01-17 10:16 (UTC+8))
#24777 [Bug] / [Feature]: Determinism in E2E Spec Decode Test — bug,stale — by wwl2755 (关闭于: 2026-01-17 10:16 (UTC+8))
#24800 [Bug]: Fail to initialize DPEngineCoreProc when worldsize = 1 and P/D disaggregate — bug,stale — by nwpu-zxr (关闭于: 2026-01-17 10:16 (UTC+8))
#24802 [Usage]: A KeyError occurred while running Qwen3-235B-A22B-Q4_KM (and int4_int8mix) — usage,stale — by zxzx9898 (关闭于: 2026-01-17 10:15 (UTC+8))
#24876 [Bug]: uv pip install vllm==0.10.2 –extra-index-url https://wheels.vllm.ai/0.10.2/ –torch-backend=auto not work — bug,stale — by 108851027 (关闭于: 2026-01-17 10:15 (UTC+8))
#24879 [Bug]: Crash occurs when calling sleep while running vLLM engine in data parallel mode — bug,stale — by liyuanze (关闭于: 2026-01-17 10:15 (UTC+8))
#24894 [Doc]: vllm performance dashboard details — documentation,stale — by omkarpatil6644 (关闭于: 2026-01-17 10:15 (UTC+8))
#24943 [Bug]:Worker failed with error ‘CUDAGraphMode.FULL_AND_PIECEWISE is not supported with FlexAttentionMetadataBuilder backend (support:AttentionCGSupport.NEVER) ; please try cudagraph_mode=PIECEWISE — bug,stale — by wangbin306 (关闭于: 2026-01-17 10:15 (UTC+8))
#24950 [Usage]: AttributeError: ‘Parameter’ object has no attribute ‘load_column_parallel_weight’ — usage,stale — by sallyjunjun (关闭于: 2026-01-17 10:15 (UTC+8))
#24959 [Bug]: vllm启动qwen3-32B-AWQ后进行对话报错 — bug,stale — by hyqf98 (关闭于: 2026-01-17 10:15 (UTC+8))
#24977 [Bug]: libtpu.sdk crashes on Python 3.12 due to ABI incompatibility — bug,stale — by andre-motta (关闭于: 2026-01-17 10:15 (UTC+8))
#24992 [Bug]: Empty VllmConfig when initializing LogitsProcessor — bug,stale — by prashantgupta24 (关闭于: 2026-01-17 10:15 (UTC+8))
#25015 [Usage]: How to profile vllm with Nsight Compute CLI? — usage,stale — by jessiewiswjc (关闭于: 2026-01-17 10:15 (UTC+8))
#32364 [Bug]: Hybrid models generation slows down noticeably when DP is enabled — bug — by yury-tokpanov (关闭于: 2026-01-17 10:01 (UTC+8))
#30707 [Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM — 无标签 — by Platano78 (关闭于: 2026-01-17 03:20 (UTC+8))
#29192 Tool Calling Parsers Fail to Populate tool_calls Array for Qwen2.5-Coder Models — 无标签 — by Platano78 (关闭于: 2026-01-17 02:44 (UTC+8))
#26639 [Bug]: ValueError: No valid structured output parameter found — bug,structured-output — by fish-miku (关闭于: 2026-01-16 12:23 (UTC+8))
#32172 [Bug]: DeepSeek V3.2 MTP + PD report two errors — bug — by kebe7jun (关闭于: 2026-01-16 11:21 (UTC+8))

新增 PR

#32499 [MLA] Add nvfp4 packed KV cache decode path via dequant cache op #32220 — ci/build,v1,nvidia — by baonudesifeizhai (创建于: 2026-01-17 10:42 (UTC+8))
#32498 [Docs][Governance] Add @robertshaw2-redhat to lead maintainers group — documentation — by simon-mo (创建于: 2026-01-17 10:35 (UTC+8))
#32473 [Frontend] Add render endpoints for prompt preprocessing — frontend — by hyeongyun0916 (创建于: 2026-01-16 20:37 (UTC+8))
#32486 “refactor: refactor_repeated_interfaces” — ready — by tom-zju (创建于: 2026-01-17 01:46 (UTC+8))
#32482 [CI] Add Helion as an optional dependency — ready,ci/build — by gmagogsfm (创建于: 2026-01-17 01:13 (UTC+8))
#32493 Add embedding input functionality for disabled modalities [remake] — documentation,frontend,needs-rebase,v1,multi-modality — by reaganjlee (创建于: 2026-01-17 07:28 (UTC+8))
#32496 [Bugfix] Fix Llama 4 FP8 failure with FlashInfer on B200 (Nullptr crash) — bug,llama,nvidia — by lfopensource (创建于: 2026-01-17 08:24 (UTC+8))
#32497 [Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor — bug,rocm,v1 — by c0de128 (创建于: 2026-01-17 09:35 (UTC+8))
#32456 [Model] Add Eagle2.5-8B Vision-Language Model support — documentation,new-model,speculative-decoding — by George-Polya (创建于: 2026-01-16 12:28 (UTC+8))
#32492 [RFC][ROCM] Enable aiter attn backend for qwen3-next model — rocm,v1,qwen — by jennyyyyzhen (创建于: 2026-01-17 06:52 (UTC+8))
#32490 fix(reasoning): don’t check prompt_token_ids for reasoning end state — frontend — by farazshaikh (创建于: 2026-01-17 05:19 (UTC+8))
#32491 [WIP] Update FlashMLA — ready,ci/build,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-17 05:20 (UTC+8))
#32474 [Docker][Hotfix] CUDA compatibility enablement — ci/build,nvidia — by emricksini-h (创建于: 2026-01-16 21:47 (UTC+8))
#32489 [CI] Fix OOM in Hopper Fusion E2E Tests (H100) — ready,ci/build — by LucasWilkinson (创建于: 2026-01-17 04:05 (UTC+8))
#32484 Revert “[Attention][MLA] Make FLASHINFER_MLA the default MLA backen… — ready,nvidia — by MatthewBonanni (创建于: 2026-01-17 01:30 (UTC+8))
#32485 [not ready for review] extend fp8 online quant with blockwise scaling — 无标签 — by vkuzo (创建于: 2026-01-17 01:35 (UTC+8))
#32470 [feat] add num preempted output — frontend,v1 — by RobotGF (创建于: 2026-01-16 18:17 (UTC+8))
#32448 apply _validate_input to MistralTokenizer token-id chat prompts — frontend,ready — by vanshilshah97 (创建于: 2026-01-16 10:57 (UTC+8))
#32487 [CI][Attention] Add more CI dependencies for attention tests — ci/build — by MatthewBonanni (创建于: 2026-01-17 01:46 (UTC+8))
#32483 Revert #32339 — 无标签 — by MatthewBonanni (创建于: 2026-01-17 01:28 (UTC+8))
#32479 [CI] Update deepgemm to newer version — ready — by yewentao256 (创建于: 2026-01-17 00:17 (UTC+8))
#32480 Strengthen batch inv tests — v1 — by frankwang28 (创建于: 2026-01-17 00:58 (UTC+8))
#32477 [7/N][Attention][Docs] Add documentation for attention backends — documentation — by MatthewBonanni (创建于: 2026-01-16 22:25 (UTC+8))
#32460 [ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms — rocm,ready,qwen,nvidia — by micah-wil (创建于: 2026-01-16 13:05 (UTC+8))
#32471 [Bugfix] Add OOT backend option — bug,ready — by iboiko-habana (创建于: 2026-01-16 18:21 (UTC+8))
#32453 [BugFix] Fix is_strictly_contiguous assertion for decode_query in TRT… — bug,v1,nvidia — by luccafong (创建于: 2026-01-16 11:57 (UTC+8))
#32475 [Doc] Clarify comment regarding partial block loading in OffloadingConnector — documentation,new-model,needs-rebase,ci/build,v1,multi-modality,llama,qwen,kv-connector — by ShadowNearby (创建于: 2026-01-16 22:03 (UTC+8))
#32476 [Doc] Clarify comment regarding partial block loading — kv-connector — by ShadowNearby (创建于: 2026-01-16 22:12 (UTC+8))
#32454 docs: add version requirement note for –profiler-config flag — documentation — by abhishkh (创建于: 2026-01-16 12:08 (UTC+8))
#32462 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration — bug,ready,qwen — by honglyua-il (创建于: 2026-01-16 13:41 (UTC+8))
#32465 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-16 15:36 (UTC+8))
#32459 [Chore] Replace swish with silu — ready — by DarkLight1337 (创建于: 2026-01-16 12:52 (UTC+8))
#32466 add pangu_vl test — new-model — by Emilie1001 (创建于: 2026-01-16 15:53 (UTC+8))
#32463 [Bugfix] [DP] Fix create too many placement groups — bug,v1 — by kebe7jun (创建于: 2026-01-16 14:00 (UTC+8))
#32458 [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… — bug — by junuxyz (创建于: 2026-01-16 12:47 (UTC+8))
#32457 [Bugfix] Use a large enough aiohttp read_bufsize to avoid ContentLengthError — bug — by ApsarasX (创建于: 2026-01-16 12:45 (UTC+8))
#32449 [model] Add support for openPangu7B-VL — new-model — by hujiaxin0 (创建于: 2026-01-16 11:21 (UTC+8))
#32451 add fireredasr model — new-model — by sxl1993 (创建于: 2026-01-16 11:46 (UTC+8))
#32450 [P/D] Enable KV cache queries and eliminate redundant prefill computation — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by snadampal (创建于: 2026-01-16 11:31 (UTC+8))
#32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (创建于: 2026-01-16 10:46 (UTC+8))

已合并 PR

#32498 [Docs][Governance] Add @robertshaw2-redhat to lead maintainers group — documentation — by simon-mo (合并于: 2026-01-17 10:35 (UTC+8))
#28506 [TPU][Core] Enable Pipeline Parallelism on TPU backend — ready,v1 — by Chenyaaang (合并于: 2026-01-17 07:29 (UTC+8))
#32489 [CI] Fix OOM in Hopper Fusion E2E Tests (H100) — ready,ci/build — by LucasWilkinson (合并于: 2026-01-17 05:27 (UTC+8))
#32383 [responsesAPI] allow tuning include_stop_str_in_output — frontend,ready — by qandrew (合并于: 2026-01-17 05:14 (UTC+8))
#32425 [LoRA] Update LoRA expand kernel heuristic — ready — by xyang16 (合并于: 2026-01-17 02:38 (UTC+8))
#29843 Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. — rocm,ready — by amd-hhashemi (合并于: 2026-01-17 01:45 (UTC+8))
#32479 [CI] Update deepgemm to newer version — ready — by yewentao256 (合并于: 2026-01-17 01:18 (UTC+8))
#32460 [ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms — rocm,ready,qwen,nvidia — by micah-wil (合并于: 2026-01-16 16:17 (UTC+8))
#32418 [EPLB][BugFix]Possible deadlock fix — bug,ready — by ilmarkov (合并于: 2026-01-16 22:11 (UTC+8))
#32459 [Chore] Replace swish with silu — ready — by DarkLight1337 (合并于: 2026-01-16 16:22 (UTC+8))
#32444 [CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm — rocm,ready — by rasmith (合并于: 2026-01-16 16:22 (UTC+8))
#32244 fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm — rocm,ready — by rabi (合并于: 2026-01-16 15:31 (UTC+8))
#32306 [Bugfix] Refactor to support DP parallel in R3 — bug,ready,v1 — by xhx1022 (合并于: 2026-01-16 15:13 (UTC+8))
#30499 [CI] Breakup h200 tests — ready,ci/build — by LucasWilkinson (合并于: 2026-01-16 14:23 (UTC+8))

#32395 [Frontend][1/n] Make pooling entrypoints request schema consensus

CompletionRequest — documentation,frontend,ready,multi-modality — by noooop (合并于: 2026-01-16 14:17 (UTC+8))

#32438 [Bug] Add TPU backend option — bug,ready — by vanbasten23 (合并于: 2026-01-16 13:17 (UTC+8))
#26822 [bugfix] Fix online serving crash when text type response_format is received — bug,frontend,ready,tool-calling — by cjackal (合并于: 2026-01-16 12:23 (UTC+8))
#32175 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding — bug,ready,deepseek — by kebe7jun (合并于: 2026-01-16 11:21 (UTC+8))
#32329 [Model] Add Step3vl 10b — documentation,new-model,ready — by ltd0924 (合并于: 2026-01-16 11:04 (UTC+8))
#32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (合并于: 2026-01-16 10:50 (UTC+8))

关闭但未合并的 PR

#23543 [Bugfix][Frontend] Fix bug where default_sampling_params.stop_token_ids is unused in chat API — frontend,stale — by n0gu-furiosa (关闭于: 2026-01-17 10:16 (UTC+8))
#24500 [Models] Prefer in-place add and multiply — speculative-decoding,needs-rebase,stale,llama,qwen,deepseek — by lgeiger (关闭于: 2026-01-17 10:16 (UTC+8))
#24932 [Metrics] Add finer-grained buckets for request_latency to support faster/embedding models — stale,v1 — by pedramr (关闭于: 2026-01-17 10:15 (UTC+8))
#24942 [XPU] refine bf16 check on xpu platform — stale,v1 — by chaojun-zhang (关闭于: 2026-01-17 10:15 (UTC+8))
#25000 [ROCm] Add some dependencies for ROCm — rocm,ci/build,stale — by Concurrensee (关闭于: 2026-01-17 10:15 (UTC+8))
#32087 refactor: refactor_repeated_interfaces — deepseek — by tom-zju (关闭于: 2026-01-17 10:02 (UTC+8))
#32483 Revert #32339 — 无标签 — by MatthewBonanni (关闭于: 2026-01-17 01:28 (UTC+8))
#32475 [Doc] Clarify comment regarding partial block loading in OffloadingConnector — documentation,new-model,needs-rebase,ci/build,v1,multi-modality,llama,qwen,kv-connector — by ShadowNearby (关闭于: 2026-01-16 22:05 (UTC+8))
#32465 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-16 16:39 (UTC+8))
#32466 add pangu_vl test — new-model — by Emilie1001 (关闭于: 2026-01-16 16:03 (UTC+8))
#32450 [P/D] Enable KV cache queries and eliminate redundant prefill computation — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by snadampal (关闭于: 2026-01-16 11:32 (UTC+8))