vLLM 开发动态报告 - 2026-01-16
时间窗口: 2026-01-16 10:46 (UTC+8) ~ 2026-01-17 10:46 (UTC+8) 数据统计: 新 Issue 13 | 关闭 Issue 22 | 新 PR 40 | 合并 PR 20 | 关闭未合并 PR 11
📊 每日开发状态摘要
在2026年1月16日至17日的时间窗口内,vLLM项目保持了高强度开发与问题修复节奏。整体活动围绕两大主题展开:一是针对前沿硬件(如Blackwell B200、AMD MI系列)和复杂模型(如Llama-4 MoE、DeepSeek-V3.2)的性能优化与Bug修复;二是持续推进项目架构演进,包括新的/render端点设计、AMD平台支持完善以及2026年Q1路线图的制定。社区讨论热烈,特别是在性能回归、闪存注意力(FlashInfer)集成和分布式执行方面。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动活跃,涉及多项兼容性修复、性能优化和CI完善工作。
- PR #32497:[Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor
- 内容:修复了在ROCm平台上使用Ray后端进行数据并行(DP)执行时RCCL初始化失败的问题。
- 技术细节:根本原因是ROCm 7.1+所需的环境变量
HSA_NO_SCRATCH_RECLAIM=1未设置,以及Ray的自动设备管理与ROCm环境冲突。PR通过修改ROCm平台模块,确保必要的环境变量被传递到Ray执行器中。 - 影响:显著增强了vLLM在AMD多卡集群上使用Ray进行分布式推理的稳定性和可用性。
- PR #32492:[RFC][ROCM] Enable aiter attn backend for qwen3-next model
- 内容:针对Qwen3-Next模型(其块大小为544),修复了aiter注意力后端因不支持的块大小而无法工作的问题。
- 技术细节:通过PR #24486解耦内核块大小和页面块大小,使得aiter后端能够处理非标准块大小。测试显示在AMD平台上启用后,GSM8K评测结果正常。
- 影响:扩展了AMD平台上高效注意力后端对更多模型架构的支持。
- PR #29843:Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (已合并)
- 用户:
amd-hhashemi(AMD员工) - 内容:针对解码阶段特定尺寸范围的GEMM运算(N=16-128, M=128/640, K=2880)进行原子操作优化,旨在提升低并发下GPT-OSS等模型的性能。
- 影响:针对AMD硬件优化核心计算内核,提升特定场景下的推理效率。
- 用户:
- 其他相关PR/Issue:
- PR #32244 (已合并):修复ROCm平台对非门控MoE(如NemotronH模型)的支持,通过放宽平台检查并使用Triton实现作为后备方案。
- PR #32444 (已合并):在CI中跳过
test_permute_cols测试,因为该内核未在ROCm上构建。 - Issue #32455 (路线图):由
github-actions自动提及了ROCm相关维护者@hongxiayang,@tjtanaa,@vllmellm,表明路线图制定时已考虑ROCm相关特性。
💬 高热度讨论分析
- Issue #32488: [Bug]: Llama4 FP8 failure with Flashinfer on B200
- 核心议题:在B200 GPU上使用FlashInfer CUTLASS MoE后端运行Llama-4 FP8模型时,因量化参数名不匹配导致内核传入
nullptr而崩溃。 - 观点与讨论:
lfopensource指出问题根源在于FlashInfer期望传统量化属性名(g1_alphas),而标准CompressedTensors模型使用通用名(w1_scale),并迅速提出了修复PR #32496。robertgshaw2-redhat补充说明B200上的FlashInfer内核需要静态每张量量化,而该模型使用的是通道缩放,并提到正在准备一个PR来澄清这一点。
- 当前状态:问题已由
lfopensource通过PR #32496修复,该PR增加了回退机制。
- 核心议题:在B200 GPU上使用FlashInfer CUTLASS MoE后端运行Llama-4 FP8模型时,因量化参数名不匹配导致内核传入
- PR #32496: [Bugfix] Fix Llama 4 FP8 failure with Flashinfer on B200
- 核心议题:对上述Issue #32488的修复PR。
- 讨论焦点:讨论主要集中在CI流程上(pre-commit检查失败),而非技术细节本身。贡献者
lfopensource快速解决了格式问题,并@了维护者请求审查。 - 结论:修复已准备就绪,等待CI验证和合并。
- Issue #32455: [Roadmap] [Draft] vLLM Roadmap Q1 2026
- 核心议题:社区发布2026年第一季度路线图草案,涉及核心引擎、大规模服务、平台集成、模型与生态系统等多个领域。
- 讨论:讨论较少,
NickLucche补充了多模态(MM)会议时间信息。 - 意义:此Issue虽评论不多,但重要性极高。它公开了项目的未来方向,包括默认启用异步调度和模型运行器V2、CPU KV缓存、模型实现API稳定化等重要目标,为社区贡献提供了清晰的指引。
- PR #32473: [Frontend] Add render endpoints for prompt preprocessing
- 核心议题:为RFC #22817(vLLM解耦一切)中的
Renderer服务奠定基础,新增/v1/completions/render和/v1/chat/completions/render端点,将提示渲染(模板应用、分词)与文本生成分离。 - 讨论焦点:作者
hyeongyun0916在PR描述中提出了一个设计疑问:新的render端点是否应包含对GPT-OSS模型的Harmony处理?他@了RFC作者robertgshaw2-redhat以确认设计意图。 - 影响:这是向解耦架构迈出的重要一步,有利于构建自定义前端和调试。
- 核心议题:为RFC #22817(vLLM解耦一切)中的
- PR #32470: [feat] add num preempted output
- 核心议题:提议在API响应中添加
num_preempted字段,用于强化学习(RL)指标收集。 - 争议焦点:核心维护者
@njhill表达了保留意见,认为“不应在顶级响应对象字段中添加随机的每请求指标”,并提及可能已有其他统计传播机制。 - 观点对立:作者
RobotGF从RL的实际应用场景(如Partial Rollout系统)出发进行辩护,认为在输出中直接获取此指标对逐样本分析至关重要。 - 状态:设计决策尚在讨论中,需要权衡API简洁性与特定使用场景的需求。
- 核心议题:提议在API响应中添加
🔥 热门话题与趋势分析
- 性能回归与优化:多个Issue(#32494, #32488, #32481)报告了从旧版本升级、特定硬件(B200)或复杂负载下的性能下降、错误或延迟问题。社区正积极定位原因,表明项目在快速演进中面临保持向后兼容和硬件普适性的挑战。
- FlashInfer 集成与挑战:FlashInfer作为高性能注意力后端,在支持新硬件(B200)和新模型(Llama-4 MoE FP8)时遇到了诸多兼容性和正确性问题(#32488, #32452, #32484)。这表明将底层内核与多样化上层模型及量化方案适配的复杂性。
- AMD平台支持强化:本周期内针对AMD的修复和优化PR数量显著,覆盖了分布式执行(#32497)、特定模型支持(#32492)、核心算子(#29843)和测试套件(#32444),显示AMD生态在vLLM中的支持正走向成熟和深化。
- 架构演进与解耦:路线图(#32455)和PR #32473(渲染端点)均指向一个明确趋势:vLLM正在致力于内部架构的稳定化、模块化和解耦,以提升可维护性、支持更灵活的部署模式(如解耦的P/D架构)。
🛠️ 重点技术变更
- PR #32496 (Bugfix): 修复Llama-4 FP8在B200上FlashInfer兼容性
- 解读:修复了因量化参数命名规范不一致导致的致命错误。通过增加从传统命名到标准命名的回退机制,确保了使用CompressedTensors等标准工具量化的FP8 MoE模型能在FlashInfer后端上正常运行。
- 影响:直接解决了社区用户在新硬件上运行前沿量化模型时遇到的障碍,提升了vLLM对最新模型生态的支持能力。
- PR #32497 (Bugfix): 修复ROCm上Ray后端的RCCL初始化
- 解读:这是一个平台特定性的深度修复,涉及ROCm底层内存管理(
HSA_NO_SCRATCH_RECLAIM)与分布式框架(Ray)的交互。正确设置这些环境变量是保证RCCL(ROCm的NCCL对应物)在复杂分布式环境下稳定工作的前提。 - 影响:为在大型AMD GPU集群上使用vLLM进行数据并行推理扫清了一个关键障碍。
- 解读:这是一个平台特定性的深度修复,涉及ROCm底层内存管理(
- PR #32479 (CI): 更新DeepGEMM版本 (已合并)
- 解读:更新了DeepSeek的DeepGEMM依赖至新版本。DeepGEMM是DeepSeek-V3等模型使用的核心高效矩阵乘库。
- 影响:确保vLLM能够利用DeepGEMM库的最新性能优化和稳定性改进,对支持相关模型至关重要。
- PR #32418 (Bugfix): 修复EPLB可能死锁问题 (已合并)
- 解读:修复了专家并行负载均衡(EPLB)在异步模式下可能发生的死锁。关键修复点是确保
move_to_workspace操作在所有权重传输实际完成前不会返回,从而维持专家组内集合操作的同步性。 - 影响:提升了大规模MoE模型在专家并行模式下的运行稳定性。
- 解读:修复了专家并行负载均衡(EPLB)在异步模式下可能发生的死锁。关键修复点是确保
- PR #32425 (Optimization): 更新LoRA扩展内核启发式算法 (已合并)
- 解读:优化了LoRA
expand内核的启动参数。针对QKVParallelLinearWithLoRA等切片线性层输出尺寸较小的特点,调整了block_n大小以启动更多CUDA块,提高SM并发度。 - 影响:经过测试,该优化使特定LoRA内核性能提升约30%,端到端输出令牌吞吐量提升约1%,对多LoRA适配场景有积极影响。
- 解读:优化了LoRA
📈 开发活跃度观察
- 贡献活跃度:在24小时内新增40个PR,合并20个,处理35个Issue(13新开,22关闭),显示社区开发、审查和合并流程非常高效。
- 贡献者多样性:贡献者包括来自AMD (
amd-hhashemi)、NVIDIA(隐含于B200相关问题)、Red Hat (robertgshaw2-redhat) 的员工以及众多独立开发者,表明项目吸引了广泛的产业和社区参与。 - 代码审查与质量:许多PR(如#32496, #32490, #32448)都经历了多次
pre-commit检查失败与修复的循环,体现了严格的代码规范要求。维护者(如njhill,DarkLight1337,chaunceyjiang)积极参与讨论和技术决策。
💡 值得关注的问题
- 性能回归(Issue #32494):用户报告从v0.9.0 (V1引擎) 升级到v0.13.0后,TTFT(首令牌时间)在高并发下显著退化。这触及了引擎核心调度器的核心变更(如异步调度),需要核心团队深入分析并明确性能取舍。
- EPLB稳定性(Issue #32478):专家并行负载均衡在多种情况下发生挂起,用户已找到一些缓解方案(禁用DeepEp LL、禁用异步调度等)。这暴露了在混合并行(TP+EP)和高级调度功能组合下的深层复杂性。
- 严格连续性断言失败(Issue #32452):在Blackwell GPU的FlashInfer TRTLLM解码路径中,新的
is_strictly_contiguous检查导致断言失败。这揭示了PyTorch.contiguous()与底层内核对内存布局期望之间的微妙差异,是跨硬件和内核栈协同的重要案例。 - 路线图执行:新发布的Q1 2026路线图草案(#32455)雄心勃勃,包括默认启用异步调度和V2模型运行器、稳定模型API、大规模服务优化等。这些目标的推进情况将决定vLLM未来半年的架构走向和竞争力。
📋 附录:详细数据列表
新增 Issue
- #32495 [CI Failure]: Entrypoints Integration Tests (Responses API) — help wanted,ci-failure — by robertgshaw2-redhat (创建于: 2026-01-17 08:17 (UTC+8))
- #32488 [Bug]: Llama4 FP8 failure with Flashinfer on B200 — bug — by mxz297 (创建于: 2026-01-17 02:47 (UTC+8))
- #32494 [Performance]: Testing vLLM upgrade results in significant TTFT degradation — performance — by annabellej (创建于: 2026-01-17 07:42 (UTC+8))
- #32481 [Bug]: Batch Invariance fails under more diverse workloads — bug — by frankwang28 (创建于: 2026-01-17 01:08 (UTC+8))
- #32455 [Roadmap] [Draft] vLLM Roadmap Q1 2026 — 无标签 — by simon-mo (创建于: 2026-01-16 12:23 (UTC+8))
- #32478 [Bug]: EPLB hangs in several cases — bug — by ilmarkov (创建于: 2026-01-16 23:57 (UTC+8))
- #32472 [Bug]: Granite 4.0-H Small immediate EOS token in specific prompt combinations — bug — by ksmusz (创建于: 2026-01-16 18:47 (UTC+8))
- #32469 [Bug]: Error occurs when using Eagle3: Encoder cache miss for {mm_hash} — bug — by Ericoool9614 (创建于: 2026-01-16 17:55 (UTC+8))
- #32468 [Bug]: Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. — bug — by simplew2011 (创建于: 2026-01-16 17:21 (UTC+8))
- #32467 [Bug]: vLLM 0.13.0, GLM4.6, H20; encountered error: torch.AcceleratorError: CUDA error: an illegal memory access was encountered — bug — by AmazeQiu (创建于: 2026-01-16 16:51 (UTC+8))
- #32464 [Bug]: Qwen3-VL-Reranker-8B vllm error — bug — by darvec112357 (创建于: 2026-01-16 15:34 (UTC+8))
- #32461 [Bug]: QWenBaseModel.embed_input_ids() got an unexpected keyword argument ‘multimodal_embeddings’ — bug — by honglyua-il (创建于: 2026-01-16 13:38 (UTC+8))
- #32452 [Bug]:
is_strictly_contiguousassertion fails in FlashInfer TRTLLM decode path on Blackwell for Scout — bug — by luccafong (创建于: 2026-01-16 11:47 (UTC+8))
已关闭 Issue
- #20671 [Bug]: Whisper not working on 0.9.2 docker image — bug,stale — by andrePankraz (关闭于: 2026-01-17 10:17 (UTC+8))
- #22341 [Bug]: When using streaming output, the error ‘Caught handled exception, but response already started’ occurs. — bug,stale,gpt-oss — by okLLM (关闭于: 2026-01-17 10:16 (UTC+8))
- #24179 [Performance]:When running vllm at 30b-a3b MOE and turning on kv quantization, the decoding speed of 40K input drops significantly from 100t to 40t. — performance,stale — by gengchaogit (关闭于: 2026-01-17 10:16 (UTC+8))
- #24650 [Feature]: support messages input for classify api skywork-reward model — feature request,stale — by Snowdar (关闭于: 2026-01-17 10:16 (UTC+8))
- #24714 [Bug]: Hunyuan-7B-Instruct get error output — bug,stale — by hzjane (关闭于: 2026-01-17 10:16 (UTC+8))
- #24777 [Bug] / [Feature]: Determinism in E2E Spec Decode Test — bug,stale — by wwl2755 (关闭于: 2026-01-17 10:16 (UTC+8))
- #24800 [Bug]: Fail to initialize DPEngineCoreProc when worldsize = 1 and P/D disaggregate — bug,stale — by nwpu-zxr (关闭于: 2026-01-17 10:16 (UTC+8))
- #24802 [Usage]: A KeyError occurred while running Qwen3-235B-A22B-Q4_KM (and int4_int8mix) — usage,stale — by zxzx9898 (关闭于: 2026-01-17 10:15 (UTC+8))
- #24876 [Bug]: uv pip install vllm==0.10.2 –extra-index-url https://wheels.vllm.ai/0.10.2/ –torch-backend=auto not work — bug,stale — by 108851027 (关闭于: 2026-01-17 10:15 (UTC+8))
- #24879 [Bug]: Crash occurs when calling sleep while running vLLM engine in data parallel mode — bug,stale — by liyuanze (关闭于: 2026-01-17 10:15 (UTC+8))
- #24894 [Doc]: vllm performance dashboard details — documentation,stale — by omkarpatil6644 (关闭于: 2026-01-17 10:15 (UTC+8))
- #24943 [Bug]:Worker failed with error ‘CUDAGraphMode.FULL_AND_PIECEWISE is not supported with FlexAttentionMetadataBuilder backend (support:AttentionCGSupport.NEVER) ; please try cudagraph_mode=PIECEWISE — bug,stale — by wangbin306 (关闭于: 2026-01-17 10:15 (UTC+8))
- #24950 [Usage]: AttributeError: ‘Parameter’ object has no attribute ‘load_column_parallel_weight’ — usage,stale — by sallyjunjun (关闭于: 2026-01-17 10:15 (UTC+8))
- #24959 [Bug]: vllm启动qwen3-32B-AWQ后进行对话报错 — bug,stale — by hyqf98 (关闭于: 2026-01-17 10:15 (UTC+8))
- #24977 [Bug]: libtpu.sdk crashes on Python 3.12 due to ABI incompatibility — bug,stale — by andre-motta (关闭于: 2026-01-17 10:15 (UTC+8))
- #24992 [Bug]: Empty VllmConfig when initializing LogitsProcessor — bug,stale — by prashantgupta24 (关闭于: 2026-01-17 10:15 (UTC+8))
- #25015 [Usage]: How to profile vllm with Nsight Compute CLI? — usage,stale — by jessiewiswjc (关闭于: 2026-01-17 10:15 (UTC+8))
- #32364 [Bug]: Hybrid models generation slows down noticeably when DP is enabled — bug — by yury-tokpanov (关闭于: 2026-01-17 10:01 (UTC+8))
- #30707 [Bug]: RTX 5080 (SM120) + NVFP4 model fails pre-flight memory check despite model fitting in VRAM — 无标签 — by Platano78 (关闭于: 2026-01-17 03:20 (UTC+8))
- #29192 Tool Calling Parsers Fail to Populate tool_calls Array for Qwen2.5-Coder Models — 无标签 — by Platano78 (关闭于: 2026-01-17 02:44 (UTC+8))
- #26639 [Bug]: ValueError: No valid structured output parameter found — bug,structured-output — by fish-miku (关闭于: 2026-01-16 12:23 (UTC+8))
- #32172 [Bug]: DeepSeek V3.2 MTP + PD report two errors — bug — by kebe7jun (关闭于: 2026-01-16 11:21 (UTC+8))
新增 PR
- #32499 [MLA] Add nvfp4 packed KV cache decode path via dequant cache op #32220 — ci/build,v1,nvidia — by baonudesifeizhai (创建于: 2026-01-17 10:42 (UTC+8))
- #32498 [Docs][Governance] Add @robertshaw2-redhat to lead maintainers group — documentation — by simon-mo (创建于: 2026-01-17 10:35 (UTC+8))
- #32473 [Frontend] Add render endpoints for prompt preprocessing — frontend — by hyeongyun0916 (创建于: 2026-01-16 20:37 (UTC+8))
- #32486 “refactor: refactor_repeated_interfaces” — ready — by tom-zju (创建于: 2026-01-17 01:46 (UTC+8))
- #32482 [CI] Add Helion as an optional dependency — ready,ci/build — by gmagogsfm (创建于: 2026-01-17 01:13 (UTC+8))
- #32493 Add embedding input functionality for disabled modalities [remake] — documentation,frontend,needs-rebase,v1,multi-modality — by reaganjlee (创建于: 2026-01-17 07:28 (UTC+8))
- #32496 [Bugfix] Fix Llama 4 FP8 failure with FlashInfer on B200 (Nullptr crash) — bug,llama,nvidia — by lfopensource (创建于: 2026-01-17 08:24 (UTC+8))
- #32497 [Bugfix][Hardware][AMD] Fix RCCL initialization in Ray distributed executor — bug,rocm,v1 — by c0de128 (创建于: 2026-01-17 09:35 (UTC+8))
- #32456 [Model] Add Eagle2.5-8B Vision-Language Model support — documentation,new-model,speculative-decoding — by George-Polya (创建于: 2026-01-16 12:28 (UTC+8))
- #32492 [RFC][ROCM] Enable aiter attn backend for qwen3-next model — rocm,v1,qwen — by jennyyyyzhen (创建于: 2026-01-17 06:52 (UTC+8))
- #32490 fix(reasoning): don’t check prompt_token_ids for reasoning end state — frontend — by farazshaikh (创建于: 2026-01-17 05:19 (UTC+8))
- #32491 [WIP] Update FlashMLA — ready,ci/build,ready-run-all-tests — by LucasWilkinson (创建于: 2026-01-17 05:20 (UTC+8))
- #32474 [Docker][Hotfix] CUDA compatibility enablement — ci/build,nvidia — by emricksini-h (创建于: 2026-01-16 21:47 (UTC+8))
- #32489 [CI] Fix OOM in Hopper Fusion E2E Tests (H100) — ready,ci/build — by LucasWilkinson (创建于: 2026-01-17 04:05 (UTC+8))
- #32484 Revert “[Attention][MLA] Make
FLASHINFER_MLAthe default MLA backen… — ready,nvidia — by MatthewBonanni (创建于: 2026-01-17 01:30 (UTC+8)) - #32485 [not ready for review] extend fp8 online quant with blockwise scaling — 无标签 — by vkuzo (创建于: 2026-01-17 01:35 (UTC+8))
- #32470 [feat] add num preempted output — frontend,v1 — by RobotGF (创建于: 2026-01-16 18:17 (UTC+8))
- #32448 apply _validate_input to MistralTokenizer token-id chat prompts — frontend,ready — by vanshilshah97 (创建于: 2026-01-16 10:57 (UTC+8))
- #32487 [CI][Attention] Add more CI dependencies for attention tests — ci/build — by MatthewBonanni (创建于: 2026-01-17 01:46 (UTC+8))
- #32483 Revert #32339 — 无标签 — by MatthewBonanni (创建于: 2026-01-17 01:28 (UTC+8))
- #32479 [CI] Update deepgemm to newer version — ready — by yewentao256 (创建于: 2026-01-17 00:17 (UTC+8))
- #32480 Strengthen batch inv tests — v1 — by frankwang28 (创建于: 2026-01-17 00:58 (UTC+8))
- #32477 [7/N][Attention][Docs] Add documentation for attention backends — documentation — by MatthewBonanni (创建于: 2026-01-16 22:25 (UTC+8))
- #32460 [ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms — rocm,ready,qwen,nvidia — by micah-wil (创建于: 2026-01-16 13:05 (UTC+8))
- #32471 [Bugfix] Add OOT backend option — bug,ready — by iboiko-habana (创建于: 2026-01-16 18:21 (UTC+8))
- #32453 [BugFix] Fix is_strictly_contiguous assertion for decode_query in TRT… — bug,v1,nvidia — by luccafong (创建于: 2026-01-16 11:57 (UTC+8))
- #32475 [Doc] Clarify comment regarding partial block loading in OffloadingConnector — documentation,new-model,needs-rebase,ci/build,v1,multi-modality,llama,qwen,kv-connector — by ShadowNearby (创建于: 2026-01-16 22:03 (UTC+8))
- #32476 [Doc] Clarify comment regarding partial block loading — kv-connector — by ShadowNearby (创建于: 2026-01-16 22:12 (UTC+8))
- #32454 docs: add version requirement note for –profiler-config flag — documentation — by abhishkh (创建于: 2026-01-16 12:08 (UTC+8))
- #32462 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration — bug,ready,qwen — by honglyua-il (创建于: 2026-01-16 13:41 (UTC+8))
- #32465 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (创建于: 2026-01-16 15:36 (UTC+8))
- #32459 [Chore] Replace swish with silu — ready — by DarkLight1337 (创建于: 2026-01-16 12:52 (UTC+8))
- #32466 add pangu_vl test — new-model — by Emilie1001 (创建于: 2026-01-16 15:53 (UTC+8))
- #32463 [Bugfix] [DP] Fix create too many placement groups — bug,v1 — by kebe7jun (创建于: 2026-01-16 14:00 (UTC+8))
- #32458 [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… — bug — by junuxyz (创建于: 2026-01-16 12:47 (UTC+8))
- #32457 [Bugfix] Use a large enough aiohttp read_bufsize to avoid ContentLengthError — bug — by ApsarasX (创建于: 2026-01-16 12:45 (UTC+8))
- #32449 [model] Add support for openPangu7B-VL — new-model — by hujiaxin0 (创建于: 2026-01-16 11:21 (UTC+8))
- #32451 add fireredasr model — new-model — by sxl1993 (创建于: 2026-01-16 11:46 (UTC+8))
- #32450 [P/D] Enable KV cache queries and eliminate redundant prefill computation — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by snadampal (创建于: 2026-01-16 11:31 (UTC+8))
- #32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (创建于: 2026-01-16 10:46 (UTC+8))
已合并 PR
- #32498 [Docs][Governance] Add @robertshaw2-redhat to lead maintainers group — documentation — by simon-mo (合并于: 2026-01-17 10:35 (UTC+8))
- #28506 [TPU][Core] Enable Pipeline Parallelism on TPU backend — ready,v1 — by Chenyaaang (合并于: 2026-01-17 07:29 (UTC+8))
- #32489 [CI] Fix OOM in Hopper Fusion E2E Tests (H100) — ready,ci/build — by LucasWilkinson (合并于: 2026-01-17 05:27 (UTC+8))
- #32383 [responsesAPI] allow tuning include_stop_str_in_output — frontend,ready — by qandrew (合并于: 2026-01-17 05:14 (UTC+8))
- #32425 [LoRA] Update LoRA expand kernel heuristic — ready — by xyang16 (合并于: 2026-01-17 02:38 (UTC+8))
- #29843 Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. — rocm,ready — by amd-hhashemi (合并于: 2026-01-17 01:45 (UTC+8))
- #32479 [CI] Update deepgemm to newer version — ready — by yewentao256 (合并于: 2026-01-17 01:18 (UTC+8))
- #32460 [ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms — rocm,ready,qwen,nvidia — by micah-wil (合并于: 2026-01-16 16:17 (UTC+8))
- #32418 [EPLB][BugFix]Possible deadlock fix — bug,ready — by ilmarkov (合并于: 2026-01-16 22:11 (UTC+8))
- #32459 [Chore] Replace swish with silu — ready — by DarkLight1337 (合并于: 2026-01-16 16:22 (UTC+8))
- #32444 [CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm — rocm,ready — by rasmith (合并于: 2026-01-16 16:22 (UTC+8))
- #32244 fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm — rocm,ready — by rabi (合并于: 2026-01-16 15:31 (UTC+8))
- #32306 [Bugfix] Refactor to support DP parallel in R3 — bug,ready,v1 — by xhx1022 (合并于: 2026-01-16 15:13 (UTC+8))
- #30499 [CI] Breakup h200 tests — ready,ci/build — by LucasWilkinson (合并于: 2026-01-16 14:23 (UTC+8))
-
#32395 [Frontend][1/n] Make pooling entrypoints request schema consensus CompletionRequest — documentation,frontend,ready,multi-modality — by noooop (合并于: 2026-01-16 14:17 (UTC+8)) - #32438 [Bug] Add TPU backend option — bug,ready — by vanbasten23 (合并于: 2026-01-16 13:17 (UTC+8))
- #26822 [bugfix] Fix online serving crash when text type response_format is received — bug,frontend,ready,tool-calling — by cjackal (合并于: 2026-01-16 12:23 (UTC+8))
- #32175 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding — bug,ready,deepseek — by kebe7jun (合并于: 2026-01-16 11:21 (UTC+8))
- #32329 [Model] Add Step3vl 10b — documentation,new-model,ready — by ltd0924 (合并于: 2026-01-16 11:04 (UTC+8))
- #32447 [Bugfix] Fix ROCm dockerfiles — bug,rocm,ci/build — by tjtanaa (合并于: 2026-01-16 10:50 (UTC+8))
关闭但未合并的 PR
- #23543 [Bugfix][Frontend] Fix bug where default_sampling_params.stop_token_ids is unused in chat API — frontend,stale — by n0gu-furiosa (关闭于: 2026-01-17 10:16 (UTC+8))
- #24500 [Models] Prefer in-place add and multiply — speculative-decoding,needs-rebase,stale,llama,qwen,deepseek — by lgeiger (关闭于: 2026-01-17 10:16 (UTC+8))
- #24932 [Metrics] Add finer-grained buckets for request_latency to support faster/embedding models — stale,v1 — by pedramr (关闭于: 2026-01-17 10:15 (UTC+8))
- #24942 [XPU] refine bf16 check on xpu platform — stale,v1 — by chaojun-zhang (关闭于: 2026-01-17 10:15 (UTC+8))
- #25000 [ROCm] Add some dependencies for ROCm — rocm,ci/build,stale — by Concurrensee (关闭于: 2026-01-17 10:15 (UTC+8))
- #32087 refactor: refactor_repeated_interfaces — deepseek — by tom-zju (关闭于: 2026-01-17 10:02 (UTC+8))
- #32483 Revert #32339 — 无标签 — by MatthewBonanni (关闭于: 2026-01-17 01:28 (UTC+8))
- #32475 [Doc] Clarify comment regarding partial block loading in OffloadingConnector — documentation,new-model,needs-rebase,ci/build,v1,multi-modality,llama,qwen,kv-connector — by ShadowNearby (关闭于: 2026-01-16 22:05 (UTC+8))
- #32465 Pangu reasoning parser and tool parser — 无标签 — by Ji-Yao (关闭于: 2026-01-16 16:39 (UTC+8))
- #32466 add pangu_vl test — new-model — by Emilie1001 (关闭于: 2026-01-16 16:03 (UTC+8))
- #32450 [P/D] Enable KV cache queries and eliminate redundant prefill computation — documentation,performance,new-model,rocm,frontend,tpu,speculative-decoding,needs-rebase,ci/build,v1 — by snadampal (关闭于: 2026-01-16 11:32 (UTC+8))