vLLM 开发动态报告 - 2026-03-26
时间窗口: 2026-03-26 11:45 (UTC+8) ~ 2026-03-27 11:45 (UTC+8) 数据统计: 新 Issue 37 | 关闭 Issue 33 | 新 PR 96 | 合并 PR 37 | 关闭未合并 PR 33
📊 每日开发状态摘要
本周期(2026-03-26至2026-03-27)vLLM 社区活动高度活跃,共产生 133 个新的 Issue 和 PR。开发焦点集中在性能优化(特别是 KV 缓存管理与 kernel 融合)、对新模型和量化格式的支持(如 Phi-4-Vision、TurboQuant),以及 AMD/ROCm 生态系统的持续完善。多个关键技术提案(如增量 MoE 卸载)和性能修复 PR 被合并,显示出项目在高效推理引擎方向上的快速迭代。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,涵盖问题修复、性能优化、CI/CD 和文档完善等多个方面。
Issues:
- #38307:报告了 AMD 的 MiniMax MXFP4 模型在使用
trust_remote_code=True时仍然失败的 bug。此问题与 PR #37698 相关,并阻碍了上游项目的集成。 - #38304:由
functionstackx提出,指出官方文档 (vllm.ai) 未更新以包含 ROCm nightly 版本的 Docker 镜像和 wheel 安装说明,要求与 CUDA 版本保持一致的文档体验。 - #38303:报告了 MiniMax NVFP4 模型因缺少 scales 加载而崩溃的问题,并指出此问题已在 PR #37214 中修复,需等待 v0.19 或使用 nightly 版本。
- #38246:虽然不直接关联 AMD,但由 AMD 员工
functionstackx提出,请求改善 Flashinfer 编译时的日志输出,避免长时间无日志被误认为卡死。
PRs(重点关注来自 -amd 贡献者的提交):
- 性能优化:
- #38313 & #38299 (by
khairulkabir1661): 为 ROCm 平台的 DeepSeek MLA 注意力机制添加 AITER 融合 kernel 支持,分别实现 RoPE + KV 缓存融合 和 RMSNorm + FP8 量化融合,旨在减少内存带宽和 kernel 启动开销,提升 MI300X 等 AMD GPU 上的性能。 - #38296 (by
divakar-amd): 重构 AITER 的 FP8 attention kernel 以原生支持q_scale,简化代码并提升性能。
- #38313 & #38299 (by
- Bug 修复:
- #38293 (by
vecheruk-amd): 修复了使用 Quark 量化(如 MXFP4)的 MoE 模型在进行推测解码(MTP/Eagle)时,因 draft 模型 MoE 层前缀不匹配而导致的AssertionError崩溃问题。修复方式是将断言改为回退到未量化的 MoE 方法。 - #38263 (by
tjtanaa): 紧急修复了刚刚启用的 ROCm nightly 发布管道中因变量未定义导致的构建失败。 - #37547 (by
gronsti-amd): 修复了 ROCm 上 Sparse MLA 实现中因lru_cache使用不当导致模块重复导入的性能问题。
- #38293 (by
- 平台支持与 CI:
- #38236 (by
Chinmay-Kulkarni-AMD): 更新 AMD Zen CPU 后端,正确声明支持的 dtype 并切换至zentorch-weekly依赖。 - #38238 (by
dhonnappa-amd): 为适应 Kubernetes 环境,移除了 AMD 硬件 CI 测试脚本中与宿主机 GPU 状态交互的步骤。 - #38165 (by
AndreasKaratzas): 修复 CI 测试容器中PYTORCH_ROCM_ARCH环境变量设置,确保 Quark 的 JIT 编译仅针对当前 GPU 架构,而非所有架构,加速测试。 - #38184 (by
micah-wil): 在 MI325 设备上启用 kernel 核心操作测试,并处理 MI250 上的相关测试不稳定性。
- #38236 (by
- Release 与生态建设:
- #37283 (by
tjtanaa): 已合并。此 PR 实现了 ROCm nightly 版本的 Docker 镜像和 wheel 包的自动化发布,是 AMD 生态在 vLLM 中成为“一等公民”的关键一步,直接回应了 Issue #36703 和 #36704 的需求。
- #37283 (by
总结:AMD 团队在 vLLM 中贡献全面且深入,从底层的 kernel 性能优化(AITER)、量化工具链集成(Quark),到平台支持(Zen CPU)、CI/CD 管道和最终的用户交付物(Nightly 版本)均在同步推进,显示出其构建完整、高性能异构计算生态的决心。
💬 高热度讨论分析
- Issue #38256: [RFC]: Incremental MoE Expert Offloading:
- 核心议题:提出一个动态的 MoE 专家权重卸载架构,通过 GPU 缓存 + 异步流水线,使大型 MoE 模型能在显存较小的硬件上运行。
- 观点与状态:作者
e1n00r详细阐述了设计原则(缓存作为权重提供者)、3-PR 分阶段实施计划,并提供了独立实现tinyserve的生产数据作为佐证。目前尚无评论,但 RFC 本身内容详实,旨在征集社区对架构设计的反馈。
- PR #38280: [Quantization] Add TurboQuant KV cache quantization:
- 核心议题:实现 Google 的 TurboQuant 算法,这是一种面向质量无损的亚 4 比特 KV 缓存向量量化方法。
- 观点与状态:贡献者
lishunyang12展示了在 Qwen2.5-1.5B 上 2/3/4 bit 量化均达到 100% 输出匹配 且 零延迟开销 的惊人基准测试结果。社区反应积极,xinyu-intel确认在 XPU 上测试同样零质量损失和零延迟开销。讨论还涉及与另一实现草案的对比和未来测试建议。
- Issue #38212: MiniMaxM2ReasoningParser broken for M2.5:
- 核心议题:
MiniMaxM2ReasoningParser无法正确处理 M2.5 模型生成的 `` 开始标签,导致推理内容泄漏到输出中。 - 观点与状态:报告者
SilviuSavu精准定位到是覆盖extract_reasoning_streaming方法时未考虑开始标签所致。参与者Huyueeer提供了详细的测试代码进行验证。核心争议点在于修复方式(直接移除覆盖 vs. 修改),从后续 PR #38213 看,采取了移除覆盖的简洁方案。
- 核心议题:
- Issue #38246: [Feature]: Better Flashinfer compilation logging:
- 核心议题:Flashinfer 内核编译/调优时可能长达 10 分钟无日志,易被误认为卡死,请求增加进度提示。
- 观点与状态:提出者
ProExpertProg描述了痛点。社区迅速响应,JINO-ROHIT立即认领了该 Issue 并提交了修复 PR #38254。讨论高效务实,直指用户体验问题。
- PR #38216: [Perf] Batch KV cache swap copies:
- 核心议题:通过
cuMemcpyBatchAsync批量执行 KV 缓存卸载/加载中的内存拷贝操作,大幅减少驱动调用开销。 - 观点与状态:贡献者
Etelis提供了详尽的各模型基准测试数据,显示在多种配置下可获得 3.6x 至 7.4x 的加速。此优化不改变行为,且对 CUDA 12.8+ 和旧版本均有妥善处理,是一项收益明确的基础性能改进,目前无争议。
- 核心议题:通过
🔥 热门话题与趋势分析
- 性能优化白热化:社区对极致性能的追求体现在各个层面:KV 缓存管理(批量拷贝 #38216、TurboQuant 压缩 #38280、混合模型卸载 #38261)、Kernel 融合(AMD 的 AITER 融合 #38313, #38299)、执行调度(跳过空工作 #38287)。
- 新模型与模态支持:持续集成新发布的模型是 vLLM 保持竞争力的关键。本周期出现 Phi-4 视觉推理模型 的支持问题 (#38309) 及对应 PR (#38306),以及 NVIDIA Nemotron-3-Nano 混合模型在 Blackwell 上的运行问题 (#38208)。
- 工具调用与推理流:与模型智能体能力相关的工具调用解析器 (#38274, #38189)、推理(Think)流解析器 (#38212, #38213) 的 bug 修复和重构持续进行,反映了对生产环境复杂用例稳定性的重视。
- AMD 生态全面性建设:从底层 Kernel、量化工具(Quark),到 CI/CD、发布管道(Nightly Docker/Wheel),再到使用文档,AMD 正在 vLLM 内构建一个完整、可用的软件栈,挑战 CUDA 的生态垄断地位。
🛠️ 重点技术变更
- PR #37283: [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm:
- 技术解读:此 PR 为 ROCm 后端建立了与 CUDA 对等的持续发布管道,自动构建并推送 nightly 版本的 Docker 镜像(
vllm/vllm-openai-rocm:nightly)和 pip wheel 包。 - 影响:极大提升了 AMD 平台用户的开发体验,使其能方便地使用最新特性,是 AMD 生态成熟的重要标志。需关注后续文档(Issue #38304)的同步更新。
- 技术解读:此 PR 为 ROCm 后端建立了与 CUDA 对等的持续发布管道,自动构建并推送 nightly 版本的 Docker 镜像(
- PR #38280: [Quantization] Add TurboQuant KV cache quantization:
- 技术解读:首次引入 TurboQuant 这一前沿的 KV 缓存向量量化技术。实现包括算法核心、引擎集成、Triton kernel 和完整基准测试。其特点是能在极低位宽(2-4 bit)下实现近乎无损的压缩,且运行时开销极低。
- 影响:为超长上下文场景下的显存节省提供了新的强大工具,可能改变 KV 缓存压缩的技术选型格局。
- PR #38216: [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync:
- 技术解读:将 KV 缓存卸载/加载过程中大量细粒度的
cudaMemcpyAsync调用,聚合成一次cuMemcpyBatchAsync调用。利用现代 CUDA 驱动的批处理能力,显著降低 CPU 侧的开销。 - 影响:直接提升使用 KV 缓存卸载功能的场景(如内存不足或使用 NIXL 等外部缓存)的吞吐量,是一项普适且高效的基础设施优化。
- 技术解读:将 KV 缓存卸载/加载过程中大量细粒度的
- RFC #38256: Incremental MoE Expert Offloading:
- 技术解读:提出一个渐进式、基于缓存的 MoE 专家权重动态调度方案。与静态卸载不同,它根据热度在 GPU 显存(缓存)和 CPU 内存之间迁移专家权重,允许超大规模 MoE 模型在有限显存上运行。
- 影响:为解决大模型(如千亿参数 MoE)部署的硬件门槛提供了新颖的软件思路。其实施方案(分 3 个 PR)设计精巧,旨在降低评审复杂度。
- PR #38313 & #38299: [ROCm] AITER fusion kernels for MLA:
- 技术解读:针对 DeepSeek MLA 模型在 AMD GPU 上的特定计算模式,利用 AITER 库实现了 RoPE 融合 KV 缓存写入、RMSNorm 融合 FP8 量化等多个关键操作的 kernel 融合。
- 影响:这是针对特定模型-硬件组合的深度性能优化,展示了 AMD 团队通过软硬件协同设计挖掘 MI300 系列 GPU 潜力的能力,有助于提升其在特定负载下的竞争力。
📈 开发活跃度观察
- 贡献者多元化:活跃贡献者包括 AMD (
-amd)、NVIDIA (-nv?) 等硬件厂商员工,以及大量社区开发者和研究人员。functionstackx在报告问题和推动 AMD 生态完善上非常活跃。 - 高效的代码审查与合并:本周期合并了 37 个 PR,其中包含多个重大功能(如 ROCm nightly 发布)和性能优化。许多 PR 被标记为
ready状态,显示社区在代码质量和合并准备上有一套成熟流程。 - AMD 团队的高强度投入:AMD 相关贡献覆盖了从问题修复、性能优化到基础设施的完整链条,且多位贡献者协同工作,显示出有组织的深度参与。
💡 值得关注的问题
- Issue #38257: Qwen3-VL-235B OOM with multi-image long multiturn inputs:超大视觉语言模型在处理多图、长对话输入时出现 OOM,即使在 8x H100 节点上。这触及了当前多模态大模型推理的显存边界,其解决方案可能涉及更复杂的激活或 KV 缓存管理策略。
- Issue #38266: tokenizing long redundant sequences causes API server deadlock:使用特定 tokenizer(如 Harmony)处理长冗余字符串时, tokenization 耗时过长会导致 API 服务器死锁,影响其他并发请求。这暴露了服务端请求处理隔离性的潜在问题,对生产部署有重要影响。
- RFC #38256: Incremental MoE Expert Offloading:此架构提案若被接受并实现,将显著扩展 vLLM 对超大规模 MoE 模型的部署能力,是一个值得跟踪的重要技术发展方向。
📋 附录:详细数据列表
新增 Issue
- #38212 MiniMaxM2ReasoningParser broken for M2.5: extract_reasoning_streaming assumes no
start tag — 无标签 — by SilviuSavu (创建于: 2026-03-26 16:43 (UTC+8)) - #38309 [Bug]: microsoft/Phi-4-reasoning-vision-15B Fails to startup — bug — by varun-sundar-rabindranath (创建于: 2026-03-27 10:22 (UTC+8))
- #38307 [Bug]: AMD’s minimax mxfp4 trust_remote_code bug — bug,rocm — by functionstackx (创建于: 2026-03-27 10:18 (UTC+8))
- #38308 [Feature]: support affinity settings in helm chart — feature request — by utsumi-fj (创建于: 2026-03-27 10:20 (UTC+8))
- #38304 [Feature]: vllm.ai docs should should instructions for rocm nightly docker & rocm nightly wheel — feature request,rocm — by functionstackx (创建于: 2026-03-27 10:08 (UTC+8))
- #38303 [Bug]: minimax nvfp4 model crash — bug — by functionstackx (创建于: 2026-03-27 09:37 (UTC+8))
- #38298 Energy Efficiency: 10 Mathematical Techniques for 60-70% AI Energy Reduction (Phi6Simple, FFT-Mix, Phi MoE) — 无标签 — by dancinlife (创建于: 2026-03-27 08:35 (UTC+8))
- #38297 [Bug]: Gemma3n concurrent audio requests crash EngineCore — missing dynamic_dims on audio sequence dimension — 无标签 — by RushRed (创建于: 2026-03-27 08:28 (UTC+8))
- #38291 [Feature]: Add Rotorquant support — feature request — by Mimi8298 (创建于: 2026-03-27 07:48 (UTC+8))
- #38201 [Feature]: Support TurboQuant KV quant — feature request — by NiuBlibing (创建于: 2026-03-26 15:32 (UTC+8))
- #38286 [Feature]: Batch invariance on 3090 — feature request — by YM2132 (创建于: 2026-03-27 05:46 (UTC+8))
- #38282 [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring — feature request — by roipony (创建于: 2026-03-27 04:33 (UTC+8))
- #38279 [Feature]: PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference — feature request — by asrvastava (创建于: 2026-03-27 04:01 (UTC+8))
- #38266 [Bug]: tokenizing long redundant sequences causes API server deadlock (harmony and others) — bug — by Gnoale (创建于: 2026-03-27 02:44 (UTC+8))
- #38260 [RFC]: Multi-tier KV offloading via the vLLM offloading connector — RFC — by dannyharnik (创建于: 2026-03-27 01:01 (UTC+8))
- #38221 Flaky test: test_abort_during_final_step[False] fails intermittently — 无标签 — by markmc (创建于: 2026-03-26 19:28 (UTC+8))
- #38257 [Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs — bug — by cjackal (创建于: 2026-03-27 00:29 (UTC+8))
- #38258 [Usage]: How to do offline inference on one rank in a distributed environment? — usage — by DragonAura (创建于: 2026-03-27 00:43 (UTC+8))
- #38233 [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU — bug — by sh1man (创建于: 2026-03-26 20:28 (UTC+8))
- #38235 [Feature]: Quantization support (AWQ / GPTQ / FP8) for mistralai/Voxtral-Mini-4B-Realtime-2602 — feature request — by sh1man (创建于: 2026-03-26 20:56 (UTC+8))
- #38256 [RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline — 无标签 — by e1n00r (创建于: 2026-03-27 00:17 (UTC+8))
- #38246 [Feature]: Better Flashinfer compilation logging — help wanted,feature request — by ProExpertProg (创建于: 2026-03-26 23:22 (UTC+8))
- #38250 [Bug]: VLLM_CPU_OMP_THREADS_BIND=nobind cannot be used with tp>1 on CPU backends — bug — by kot-begemot-uk (创建于: 2026-03-26 23:36 (UTC+8))
- #38231 [RFC]: Support Dynamic Model Switching and Flexible Collective Communication in External Launcher Mode — RFC — by KaisennHu (创建于: 2026-03-26 20:11 (UTC+8))
- #38241 [Bug]: DSR1 hang on B200 — bug — by ProExpertProg (创建于: 2026-03-26 22:25 (UTC+8))
- #38245 [Bug]: Responses API
text.format.type="json_schema"leaksschema_in non-stream responses and breaks streaming — bug — by noobHappylife (创建于: 2026-03-26 23:21 (UTC+8)) - #38240 [Feature]: dflash speculator model support — feature request — by shanjiaz (创建于: 2026-03-26 22:18 (UTC+8))
- #38234 Test Failure: test_run_eagle_dp[FLASH_ATTN] produces non-deterministic outputs with EAGLE speculative decoding — 无标签 — by markmc (创建于: 2026-03-26 20:55 (UTC+8))
- #38230 Hybrid KV offload: MultiConnector + planner for mamba+attention models — 无标签 — by malaiwah (创建于: 2026-03-26 20:06 (UTC+8))
- #38228 [Bug]: Based on vllm 0.18.0 version, when the number of tensor parallelizations is greater than 1, an error message will be reported: [AMP ERROR] [CudaFrontend. cpp: 94] [failed to call cuCtxGetDevice (&device), error code: CUDA-ERROR-INVALIDFHIR TEXT — bug — by uOnePiece (创建于: 2026-03-26 20:01 (UTC+8))
- #38208 [Bug]: CUDA Illegal Instruction during CUDA Graph capture with Nemotron-3-Nano NVFP4 on sm_121 — bug — by dennis-lynch (创建于: 2026-03-26 16:21 (UTC+8))
- #38203 [Bug]: M2.5 tool call result is badcase, deploy 1p1d with nixl connector, P and D use DP8-EP-TP1 — bug — by ssbandjl (创建于: 2026-03-26 15:54 (UTC+8))
- #38202 [Feature]: Add apply_with_spec_decode() method to LogitBiasLogitsProcessor for speculative decoding support — feature request — by ranger2571 (创建于: 2026-03-26 15:48 (UTC+8))
- #38196 GDN attention backend crashes with ngram speculative decoding on mixed decode batches — 无标签 — by bhaktatejas922 (创建于: 2026-03-26 14:41 (UTC+8))
- #38197 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (创建于: 2026-03-26 14:42 (UTC+8))
- #38195 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (创建于: 2026-03-26 14:37 (UTC+8))
- #38194 [Performance]: Prefix cache hit lower on vLLM than on other inference stacks — performance — by baoskee (创建于: 2026-03-26 14:23 (UTC+8))
已关闭 Issue
- #20561 [Bug]: Bad result with parallel generation. — bug,stale — by hscspring (关闭于: 2026-03-27 10:18 (UTC+8))
- #22605 [RFC]: Separated CPU KV Cache Offloading/Transfer Process — RFC,stale — by ApostaC (关闭于: 2026-03-27 10:18 (UTC+8))
- #23534 [Bug]: An issue arises during offline_inference with data-parallel execution. — bug,stale — by ZwhElliott (关闭于: 2026-03-27 10:18 (UTC+8))
- #24665 [Usage]: how to get tts wav(minicpm-o) via vllm online server — usage,stale — by whk6688 (关闭于: 2026-03-27 10:18 (UTC+8))
- #26139 [Bug]: input thread may die silently — bug,stale — by jennyyyyzhen (关闭于: 2026-03-27 10:17 (UTC+8))
- #27409 [New Model]: ibm-granite/granite-4.0-h-small-FP8 — new-model,stale — by afazekas (关闭于: 2026-03-27 10:17 (UTC+8))
- #28003 [Usage]: — usage,stale — by amitmvyas (关闭于: 2026-03-27 10:17 (UTC+8))
- #28933 [Feature]: Structured request_id in logs and inclusion in error logs — feature request,stale — by JaimeArboleda (关闭于: 2026-03-27 10:17 (UTC+8))
- #29361 [Bug]: CUDA illegal memory access in fused_marlin_moe for Kimi-K2-Thinking on H20 4-nodes 32-ranks(DP4, TP8, EP32) — bug,stale — by TianyiZhao1437 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29368 [Bug]:llama4 AttributeError: ‘dict’ object has no attribute ‘model_type’ — bug,stale — by win10ogod (关闭于: 2026-03-27 10:16 (UTC+8))
- #29374 [Bug]: vllm/vllm-openai:v0.11.0 deployment –quantization fp8 throws cuda and tensor errors — bug,stale — by sravan500 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29378 [Bug]: vllm0.11.0 deployment –pipeline-parallel-size 4 –tensor_parallel_size 2 for Qwen3 VL 8B Returns Strange Results — bug,stale — by my462 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29389 [Bug]: race condition in shm_broadcast.py — bug,stale — by nvjullin (关闭于: 2026-03-27 10:16 (UTC+8))
- #29408 [Bug]:Reasoning parser work wrong with Qwen3-VL-Thinking-30B-A3B-FP8 — bug,stale — by zhcn000000 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29436 [Bug]: vLLM Serve with LMCache enabled produces wrong output for GPT-OSS-20B — bug,stale — by ksuma2109 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29481 [Bug]: vllm显存碎片化导致一张显卡只能部署一个小模型 — bug,stale — by xzl12080 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29489 [Usage]: Removing last generated token from output and kv cache — usage,stale — by josefdra (关闭于: 2026-03-27 10:16 (UTC+8))
- #29496 [Bug]: Qwen3-Embedding models get stuck when embedding input of max-model-len — bug,stale — by jhradil (关闭于: 2026-03-27 10:16 (UTC+8))
- #29508 [Feature]: Conformance test for Gateway API Inference Extension — feature request,stale — by liu-cong (关闭于: 2026-03-27 10:16 (UTC+8))
- #36704 [Feature]: upstream nightly rocm docker — feature request,rocm — by functionstackx (关闭于: 2026-03-27 10:05 (UTC+8))
- #36703 [Feature]: upstream nightly rocm vllm — feature request,rocm — by functionstackx (关闭于: 2026-03-27 10:04 (UTC+8))
- #38159 [Installation]: nightly builds for vllm/vllm-openai stopped three days ago — installation — by CaptainGlac1er (关闭于: 2026-03-27 05:09 (UTC+8))
- #33638 [Bug]: DeepSeekV3.1 with fp8 kvcache in v0.15.0 produces garbled output — bug,deepseek — by lyg95 (关闭于: 2026-03-27 00:13 (UTC+8))
- #38241 [Bug]: DSR1 hang on B200 — bug — by ProExpertProg (关闭于: 2026-03-26 23:24 (UTC+8))
- #38101 [CI Failure]: Test Eval Marlin Qwen3-30B-A3B-Fp8 — ci-failure — by ilmarkov (关闭于: 2026-03-26 20:27 (UTC+8))
- #37804 [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell — bug — by vadiklyutiy (关闭于: 2026-03-26 16:21 (UTC+8))
- #37976 [Feature]: Add
/v1/chat/completions/batchendpoint for batched chat completions with structured output support — feature request — by MatejRojec (关闭于: 2026-03-26 16:06 (UTC+8)) - #36777 [Feature]: Kimi K2.5 Speculative Decoding — feature request — by casper-hansen (关闭于: 2026-03-26 15:44 (UTC+8))
- #38118 [Doc]: consider adding docstring for a lot of the methods — documentation — by JINO-ROHIT (关闭于: 2026-03-26 15:05 (UTC+8))
- #38195 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (关闭于: 2026-03-26 14:38 (UTC+8))
- #30604 [Installation]: [ARM_CPU_backend] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. — installation — by Mengjintao (关闭于: 2026-03-26 14:34 (UTC+8))
- #29983 [Bug]: v0.12.0 Dockerfile.cpu fails to build on ARM server — bug,stale,cpu — by jharriga (关闭于: 2026-03-26 14:33 (UTC+8))
- #28052 [Bug]: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103 has an error with flash attention — bug,rocm,stale — by capteen-hook (关闭于: 2026-03-26 14:21 (UTC+8))
新增 PR
- #38299 [ROCm] Add AITER RMSNorm+FP8 quantization fusion for MLA — rocm — by khairulkabir1661 (创建于: 2026-03-27 08:50 (UTC+8))
- #38280 [Quantization] Add TurboQuant KV cache quantization (Phase 1) — v1 — by lishunyang12 (创建于: 2026-03-27 04:02 (UTC+8))
- #38316 [XPU] Add per-channel quantized model in compressed-tensors — 无标签 — by yma11 (创建于: 2026-03-27 11:43 (UTC+8))
- #38315 [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward — needs-rebase — by ZJY0516 (创建于: 2026-03-27 11:35 (UTC+8))
- #38224 [CT]FP8 WoQ Linear Refactored — 无标签 — by Zhenzhong1 (创建于: 2026-03-26 19:57 (UTC+8))
- #38314 [KVConnector]: Add kv_connector_class_name to resolve name conflict with built-in connectors — deepseek,kv-connector — by maobaolong (创建于: 2026-03-27 11:26 (UTC+8))
- #38301 [KVConnector]: prioritize external connector over internal registry — kv-connector — by maobaolong (创建于: 2026-03-27 09:28 (UTC+8))
- #38313 [ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by khairulkabir1661 (创建于: 2026-03-27 11:12 (UTC+8))
- #38302 [CI] Swap to smaller models for MIG slice compatibility — ci/build,v1 — by khluu (创建于: 2026-03-27 09:34 (UTC+8))
- #38274 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when decoding bundles
tool_call_endandtool_call_begininto a single streaming delta — bug,tool-calling — by slevental (创建于: 2026-03-27 03:21 (UTC+8)) - #38294 [Bugfix] Fix AuthenticationMiddleware KeyError on WebSocket scopes — bug,frontend — by russellb (创建于: 2026-03-27 08:02 (UTC+8))
- #38189 [Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers — tool-calling,qwen,deepseek — by sfeng33 (创建于: 2026-03-26 12:35 (UTC+8))
- #38310 [Doc] Support configuring affinity in helm chart — documentation — by utsumi-fj (创建于: 2026-03-27 10:23 (UTC+8))
- #38207 [CI] Reorganize scoring tests — frontend,ready — by noooop (创建于: 2026-03-26 16:17 (UTC+8))
- #38311 [Model Runner V2] Rebuild attention metadata before eagle decode full… — v1 — by TheEpicDolphin (创建于: 2026-03-27 10:43 (UTC+8))
- #38312 Fix zmq port conflict — 无标签 — by liuchenbing2026 (创建于: 2026-03-27 10:47 (UTC+8))
- #38242 [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str — documentation,v1 — by chaunceyjiang (创建于: 2026-03-26 22:25 (UTC+8))
- #38306 [Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B — new-model,multi-modality — by varun-sundar-rabindranath (创建于: 2026-03-27 10:17 (UTC+8))
- #38214 [Feature] Add auto-detection for reasoning_config when only reasoning_parser is set — documentation,v1 — by chaunceyjiang (创建于: 2026-03-26 17:32 (UTC+8))
- #38270 [Mamba] Raise on insufficient cache blocks instead of silently capping cudagraph sizes — v1,nvidia — by NickLucche (创建于: 2026-03-27 03:08 (UTC+8))
- #38305 [Bugfix] Fix Gemma3n concurrent audio requests crashing EngineCore — bug — by he-yufeng (创建于: 2026-03-27 10:14 (UTC+8))
- #38261 Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by malaiwah (创建于: 2026-03-27 01:08 (UTC+8))
- #38292 [CI][ROCm] Add gpt-oss w4a8 in CI — rocm,ready,gpt-oss — by BowenBao (创建于: 2026-03-27 07:53 (UTC+8))
- #38244 [CT][FP8][Marlin] refactor CompressedTensorsW8A16Fp8 to use kernel abstraction — 无标签 — by jikunshang (创建于: 2026-03-26 23:10 (UTC+8))
- #38300 [Speculative Decoding] Add DFlash speculators config parsing — new-model,speculative-decoding,v1,qwen — by ZhanqiuHu (创建于: 2026-03-27 08:58 (UTC+8))
- #38206 Enable multiple gpu test on xpu platform — 无标签 — by chaojun-zhang (创建于: 2026-03-26 16:12 (UTC+8))
- #38296 [ROCm] aiter_unified_attn fp8 q scale refactor — rocm,v1 — by divakar-amd (创建于: 2026-03-27 08:18 (UTC+8))
- #38295 [Docs] Document that –api-key has no effect on gRPC server — documentation — by russellb (创建于: 2026-03-27 08:18 (UTC+8))
- #38247 Various Transformers v5 config fixes — ready,deepseek — by hmellor (创建于: 2026-03-26 23:23 (UTC+8))
- #38293 Fix: FusedMoE AssertionError with Speculative Decoding on Quark-Quantized Models — 无标签 — by vecheruk-amd (创建于: 2026-03-27 07:53 (UTC+8))
- #38290 [CI] Pin GitHub Actions to commit hashes in macos-smoke-test.yml — ci/build — by russellb (创建于: 2026-03-27 07:29 (UTC+8))
- #38289 tests: reduce duplication in online pooling scoring suites — 无标签 — by cursor (创建于: 2026-03-27 06:55 (UTC+8))
- #38285 [AMD][Build] Test DeepEP offload — rocm,ci/build — by rjrock (创建于: 2026-03-27 05:20 (UTC+8))
- #38288 [Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py — 无标签 — by chengyinie (创建于: 2026-03-27 06:17 (UTC+8))
- #38287 [Perf] Skip kv connector empty work, 1.1% Throughput Improvement — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-27 05:56 (UTC+8))
- #38272 [ROCm][CI] Unsetting arch completely — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-27 03:12 (UTC+8))
- #38284 [Startup][UX] Enable CUDAGraph memory profiling by default — v1,nvidia — by MatthewBonanni (创建于: 2026-03-27 05:18 (UTC+8))
- #38283 [Bugfix] Remove dead unicode_escape decoding from 4 tool parsers — bug,tool-calling — by yzong-rh (创建于: 2026-03-27 05:06 (UTC+8))
- #38268 [Bugfix] Remove dead name-only regex fallback from DeepSeek V3.1 tool parser — bug,tool-calling,deepseek — by yzong-rh (创建于: 2026-03-27 02:49 (UTC+8))
- #38238 Removed GPU state confirmation and cleanup steps. — rocm,ready,ci/build — by dhonnappa-amd (创建于: 2026-03-26 21:46 (UTC+8))
- #38255 [Bugfix] Remove false-positive format mismatch warnings in FLA ops — bug,ready — by tdoublep (创建于: 2026-03-27 00:11 (UTC+8))
- #38275 fix: pin 1 unpinned action(s) — ci/build — by dagecko (创建于: 2026-03-27 03:25 (UTC+8))
- #38205 [ZenCPU] Make PT Backport Patch Accessible to vLLM — ready — by amd-lalithnc (创建于: 2026-03-26 15:58 (UTC+8))
- #38277 [Core] Add soft thinking token budget with progressive logit bias — v1 — by efortin (创建于: 2026-03-27 03:28 (UTC+8))
- #38273 Turbo Quant — documentation,rocm,ci/build,v1 — by asrvastava (创建于: 2026-03-27 03:14 (UTC+8))
- #38281 [Model Loader] fix BGE-M3 compatibility — 无标签 — by wkpark (创建于: 2026-03-27 04:11 (UTC+8))
- #38278 [Model] Use AutoWeightsLoader for InternLM2 — 无标签 — by javierdejesusda (创建于: 2026-03-27 03:33 (UTC+8))
- #38276 Enable eagle3 for MiniMax M2 — 无标签 — by TheEpicDolphin (创建于: 2026-03-27 03:27 (UTC+8))
- #38190 [responsesAPI] output failed event. — frontend — by qandrew (创建于: 2026-03-26 12:36 (UTC+8))
- #38271 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (创建于: 2026-03-27 03:09 (UTC+8))
- #38269 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (创建于: 2026-03-27 03:08 (UTC+8))
- #38267 tests(network_utils): add coverage for late-binding ZMQ helpers (#28498) — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by loriscience (创建于: 2026-03-27 02:47 (UTC+8))
- #38254 flash infer logging — 无标签 — by JINO-ROHIT (创建于: 2026-03-27 00:06 (UTC+8))
- #38188 Bump flashinfer version to 0.6.7 — ci/build,nvidia,ready-run-all-tests — by wzhao18 (创建于: 2026-03-26 11:59 (UTC+8))
- #38263 [ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline — bug,rocm,ready,ci/build — by tjtanaa (创建于: 2026-03-27 01:25 (UTC+8))
- #38265 [Refactor] Consolidate Tool type alias in tool_parsers/utils.py — tool-calling — by sfeng33 (创建于: 2026-03-27 02:29 (UTC+8))
- #38262 [frontend] dump openai responses type by alias — frontend — by cjackal (创建于: 2026-03-27 01:20 (UTC+8))
- #38264 [Mypy] Fix adjust_request typing — documentation,frontend,tool-calling — by sfeng33 (创建于: 2026-03-27 01:59 (UTC+8))
- #38252 [ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 — rocm,ci/build — by gshtras (创建于: 2026-03-26 23:41 (UTC+8))
- #38259 SiLU_mul blockwise quantized fp8 kernel in Helion — 无标签 — by rwtarpit (创建于: 2026-03-27 00:49 (UTC+8))
- #38185 [Bugfix][Backport] Backport PR #34934 to v0.13.0: Fix typos — bug,documentation,performance,rocm,ci/build,v1,tool-calling,qwen,cpu,kv-connector — by khairulkabir1661 (创建于: 2026-03-26 11:56 (UTC+8))
- #38248 [WIP] Prototype MM batcher interface — multi-modality — by DarkLight1337 (创建于: 2026-03-26 23:25 (UTC+8))
- #38253 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (创建于: 2026-03-26 23:55 (UTC+8))
- #38236 [ZenCPU] AMD Zen CPU Backend with supported dtypes via zentorch weekly — rocm,ci/build — by Chinmay-Kulkarni-AMD (创建于: 2026-03-26 20:57 (UTC+8))
- #38229 Revert “Various Transformers v5 fixes” (#38127) — deepseek — by zhewenl (创建于: 2026-03-26 20:06 (UTC+8))
- #38251 Nvfp4 cutedsl moe — nvidia,ready-run-all-tests — by zyongye (创建于: 2026-03-26 23:36 (UTC+8))
- #38219 [CPU] Support CT W4A16 on CPU MP kernel — ready,cpu — by bigPYJ1151 (创建于: 2026-03-26 18:38 (UTC+8))
- #38243 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (创建于: 2026-03-26 22:41 (UTC+8))
- #38232 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility — ready — by zhang-prog (创建于: 2026-03-26 20:28 (UTC+8))
- #38249 [Misc] Organize NixlConnector into own directory — documentation,v1,kv-connector — by NickLucche (创建于: 2026-03-26 23:27 (UTC+8))
- #38217 [KV Offload] Update stale comment in
FilterReusedOffloadingManager— v1 — by ronensc (创建于: 2026-03-26 18:26 (UTC+8)) - #38220 [specdec][v1] Fix sampler peak memory utilization when using specdec — ready,v1 — by Flechman (创建于: 2026-03-26 19:27 (UTC+8))
- #38239 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,v1 — by MohdElgaar (创建于: 2026-03-26 21:57 (UTC+8))
- #38216 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — v1 — by Etelis (创建于: 2026-03-26 18:11 (UTC+8))
- #38237 [Bugfix] Only disable hybrid KV cache manager when KV events are actually enabled — bug — by liulanzheng (创建于: 2026-03-26 21:36 (UTC+8))
- #38227 [Bugfix] [Frontend] responses api, refactored simple event streaming — bug,frontend — by bfroemel (创建于: 2026-03-26 20:00 (UTC+8))
- #38225 [Bugfix] Prevent IndexError on empty content in HarmonyContext tool calls — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-03-26 19:57 (UTC+8))
- #38191 [Bugfix] Fix k_norm weight sharding in MiniMaxM2Attention when total_num_kv_heads < tp_size — bug — by wxsIcey (创建于: 2026-03-26 12:53 (UTC+8))
- #38218 [Renderer] Consolidate factory methods — ready — by DarkLight1337 (创建于: 2026-03-26 18:35 (UTC+8))
- #38223 [Bugfix] Handle reasoning_effort=”none” for Harmony models instead of crashing — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-03-26 19:50 (UTC+8))
- #38226 [Bugfix] Respect max_output_tokens in Harmony tool-call loop — bug,frontend,tool-calling,gpt-oss — by dubin555 (创建于: 2026-03-26 19:57 (UTC+8))
- #38222 [Bugfix] Add dimension alignment check to Marlin MoE kernel selection — bug — by he-yufeng (创建于: 2026-03-26 19:32 (UTC+8))
- #38213 fix(reasoning): MiniMaxM2ReasoningParser broken for M2.5 — 无标签 — by SilviuSavu (创建于: 2026-03-26 16:46 (UTC+8))
- #38204 [feature] Implement reasoning_effort — frontend — by chaunceyjiang (创建于: 2026-03-26 15:55 (UTC+8))
- #38215 [Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8 — ci/build,nvidia — by askliar (创建于: 2026-03-26 17:44 (UTC+8))
- #38211 [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion — 无标签 — by aman-coder03 (创建于: 2026-03-26 16:39 (UTC+8))
- #38193 [XPU] Disable xpu graph by default — ready — by jikunshang (创建于: 2026-03-26 14:04 (UTC+8))
- #38209 [Doc] Fix outdated reference to CUDAGraphManager — documentation,ready,nvidia — by DarkLight1337 (创建于: 2026-03-26 16:37 (UTC+8))
- #38210 Fix json chat_templet parser — frontend — by QiuMike (创建于: 2026-03-26 16:39 (UTC+8))
- #38199 [Bugfix] Include entry-point logits processor plugins in output token… — bug,v1 — by YingxuH (创建于: 2026-03-26 15:06 (UTC+8))
- #38200 Qwen3.5 0325 mtp — qwen — by zhewenl (创建于: 2026-03-26 15:19 (UTC+8))
- #38198 [DO NOT MERGE] Flatten type abstractions in Renderer — frontend,ready — by DarkLight1337 (创建于: 2026-03-26 15:06 (UTC+8))
- #38192 added int4 AR format cpu — 无标签 — by Zhenzhong1 (创建于: 2026-03-26 13:44 (UTC+8))
- #38184 [ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness — rocm,ci/build — by micah-wil (创建于: 2026-03-26 11:56 (UTC+8))
- #38187 [Frontend][Bugfix] Fix double BOS token in Responses API for models with add_bos_token — bug,frontend — by hyeongyun0916 (创建于: 2026-03-26 11:58 (UTC+8))
- #38186 [code clean] remove useless contextlib.suppress(Exception) — 无标签 — by andyxning (创建于: 2026-03-26 11:57 (UTC+8))
已合并 PR
- #38207 [CI] Reorganize scoring tests — frontend,ready — by noooop (合并于: 2026-03-26 20:07 (UTC+8))
- #37447 [CI/Build] enable Intel XPU test flow with prebuilt image — ready,ci/build — by wendyliu235 (合并于: 2026-03-27 09:16 (UTC+8))
- #38247 Various Transformers v5 config fixes — ready,deepseek — by hmellor (合并于: 2026-03-27 07:07 (UTC+8))
- #38162 [Bugfix] Add missing f-string prefix in xgrammar choices error message — bug,structured-output,ready,v1 — by yzong-rh (合并于: 2026-03-27 05:43 (UTC+8))
- #38045 [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling — speculative-decoding,ready,ci/build,v1 — by TheEpicDolphin (合并于: 2026-03-27 04:38 (UTC+8))
- #38136 Fix multi-node allreduce fusion — ready,ci/build,nvidia — by wzhao18 (合并于: 2026-03-27 04:24 (UTC+8))
- #37547 [Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module — bug,rocm,ready,v1 — by gronsti-amd (合并于: 2026-03-27 03:01 (UTC+8))
- #38263 [ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline — bug,rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-27 02:47 (UTC+8))
- #38165 [ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-27 02:33 (UTC+8))
- #37930 [ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-27 01:44 (UTC+8))
- #37228 [ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model (#37228) — bug,rocm,ready,v1,meta-exported,fb-exported — by jennyyyyzhen (合并于: 2026-03-27 01:33 (UTC+8))
- #38155 [ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 — rocm,ready,ci/build,qwen — by AndreasKaratzas (合并于: 2026-03-27 00:51 (UTC+8))
- #38178 [CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM — rocm,ready — by AndreasKaratzas (合并于: 2026-03-27 00:46 (UTC+8))
- #38137 [ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test — rocm,ready — by AndreasKaratzas (合并于: 2026-03-27 00:26 (UTC+8))
- #34977 [Mamba][APC] Add test case to compare apc outputs — rocm,ready — by divakar-amd (合并于: 2026-03-27 00:40 (UTC+8))
- #37283 [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm — rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-27 00:32 (UTC+8))
- #35175 [Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,ready,v1,nvidia — by haosdent (合并于: 2026-03-27 00:13 (UTC+8))
- #38161 [ROCm][CI] Fix flaky GPTQ compile correctness test — rocm,ready — by AndreasKaratzas (合并于: 2026-03-26 19:57 (UTC+8))
- #38167 [ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-26 19:55 (UTC+8))
- #35886 [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos — bug,rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-26 23:59 (UTC+8))
- #38014 [CI] Add batch invariant test for b200 — ready,ci/build — by yewentao256 (合并于: 2026-03-26 23:54 (UTC+8))
- #38232 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility — ready — by zhang-prog (合并于: 2026-03-26 23:34 (UTC+8))
- #38169 Revert “[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration” (#38050) — ci-failure,nvidia — by zhewenl (合并于: 2026-03-26 22:59 (UTC+8))
- #38125 DOC: Documentation pages fixes — ready — by mtsokol (合并于: 2026-03-26 16:55 (UTC+8))
- #38218 [Renderer] Consolidate factory methods — ready — by DarkLight1337 (合并于: 2026-03-26 20:19 (UTC+8))
- #37962 [bug-fix] GLM OCR Patch Merger context_dim — bug,ready — by JaredforReal (合并于: 2026-03-26 20:11 (UTC+8))
- #38153 [Refactor] Remove unused utils — ready — by yewentao256 (合并于: 2026-03-26 17:08 (UTC+8))
- #38193 [XPU] Disable xpu graph by default — ready — by jikunshang (合并于: 2026-03-26 16:53 (UTC+8))
- #38209 [Doc] Fix outdated reference to CUDAGraphManager — documentation,ready,nvidia — by DarkLight1337 (合并于: 2026-03-26 16:52 (UTC+8))
- #38018 [Model] Use helper function to run MM processors with token inputs (where applicable) — ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-26 16:44 (UTC+8))
- #38116 Relocate Encoder CUDA graph manager — ready,v1,nvidia — by WoosukKwon (合并于: 2026-03-26 11:52 (UTC+8))
- #38083 [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell — bug,ready,qwen — by vadiklyutiy (合并于: 2026-03-26 16:21 (UTC+8))
- #38011 Add
/v1/chat/completions/batchendpoint for batched chat completions — documentation,frontend,ready — by MatejRojec (合并于: 2026-03-26 12:13 (UTC+8)) - #37691 [cpu][ci] remove soft-fail for Arm CI and add quant model tests — ready,ci/build,cpu — by fadara01 (合并于: 2026-03-26 15:03 (UTC+8))
- #38049 [Model] Add torch.compile support for InternVL vision encoder — rocm,ready — by tianrengao (合并于: 2026-03-26 14:52 (UTC+8))
- #38082 [Bugfix] Fix benchmark_fused_collective.py — bug,performance,ready — by jeejeelee (合并于: 2026-03-26 14:51 (UTC+8))
- #38092 [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format — bug,ready,ready-run-all-tests — by BadrBasowid (合并于: 2026-03-26 12:11 (UTC+8))
关闭但未合并的 PR
- #33774 add support for telechat3 model — new-model — by 1096125073 (关闭于: 2026-03-27 11:41 (UTC+8))
- #37696 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by SouthWest7 (关闭于: 2026-03-27 10:52 (UTC+8))
-
#28631 [Frontend][3/n] Improve pooling entrypoints scoring. — documentation,frontend,needs-rebase,multi-modality — by noooop (关闭于: 2026-03-27 10:53 (UTC+8)) - #21651 Limit concurrent long partial prefills via max_long_partial_prefills — stale,v1 — by pansicheng (关闭于: 2026-03-27 10:18 (UTC+8))
- #27983 [Frontend] Make RequestIdMiddleware return the internal request_id — frontend,ready,stale — by markmc (关闭于: 2026-03-27 10:17 (UTC+8))
- #28780 Fix: align vllm bench serve ignore_eos behavior with legacy benchmark… — performance,stale — by Amitjoiya (关闭于: 2026-03-27 10:17 (UTC+8))
- #28810 v1: account for CPU offload capacity in KV cache check — needs-rebase,ci/build,stale,v1,nvidia — by m0nk111 (关闭于: 2026-03-27 10:17 (UTC+8))
- #29321 Updated AMD-CI mirror (2025-11-24) — rocm,ci/build,stale — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-27 10:17 (UTC+8))
- #29455 [DNM] Split MM standard test — ready,ci/build,stale — by ywang96 (关闭于: 2026-03-27 10:16 (UTC+8))
- #29495 [Misc] Add token breakdown to throughput benchmark JSON output — performance,stale — by xxrjun (关闭于: 2026-03-27 10:16 (UTC+8))
- #35483 Add AMD AITER MLA fusion optimization for DeepSeek models — rocm,needs-rebase,deepseek — by khairulkabir1661 (关闭于: 2026-03-27 08:50 (UTC+8))
- #31105 [Bugfix][LoRA] Fix LoRA weight mapping for DeepSeek MLA attention and… — bug,needs-rebase,deepseek — by tim0120 (关闭于: 2026-03-27 07:55 (UTC+8))
- #38289 tests: reduce duplication in online pooling scoring suites — 无标签 — by cursor (关闭于: 2026-03-27 06:58 (UTC+8))
- #38275 fix: pin 1 unpinned action(s) — ci/build — by dagecko (关闭于: 2026-03-27 05:23 (UTC+8))
- #35454 Enable GLM4.7 FP8 KVCache scale loading — 无标签 — by BowenBao (关闭于: 2026-03-27 05:08 (UTC+8))
- #32597 Add triton support for compressed_tensors GPTQ W4A16 on Tesla V100 (Volta CUDA 70) — performance,nvidia — by lapy (关闭于: 2026-03-27 03:49 (UTC+8))
- #33956 [Bugfix] Fix video frame sampling for short videos and Qwen3-VL 2-frame requirement — bug,needs-rebase,multi-modality,qwen — by chengyinie (关闭于: 2026-03-27 03:13 (UTC+8))
- #38271 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (关闭于: 2026-03-27 03:13 (UTC+8))
- #38269 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (关闭于: 2026-03-27 03:08 (UTC+8))
- #38267 tests(network_utils): add coverage for late-binding ZMQ helpers (#28498) — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by loriscience (关闭于: 2026-03-27 03:08 (UTC+8))
- #38185 [Bugfix][Backport] Backport PR #34934 to v0.13.0: Fix typos — bug,documentation,performance,rocm,ci/build,v1,tool-calling,qwen,cpu,kv-connector — by khairulkabir1661 (关闭于: 2026-03-26 15:58 (UTC+8))
- #38180 [Bugfix][Backport][ROCm] Backport PR #31380 to v0.13.0: Fix Qwen3-Next inference with non-standard block size (544) — bug,rocm,v1,qwen — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
- #38146 [Bugfix][Backport][Hardware][AMD] Backport PR #31282 to v0.13.0: Fix last_page_len calculation in AITER MLA decode — bug,rocm,v1 — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
- #38145 [Bugfix][Backport] Backport PR #31816 to v0.13.0: Fix ROCM_AITER_TRITON_MLA accuracy for DeepSeek-V3 — bug,rocm,v1,deepseek — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
- #38229 Revert “Various Transformers v5 fixes” (#38127) — deepseek — by zhewenl (关闭于: 2026-03-26 23:30 (UTC+8))
- #38243 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (关闭于: 2026-03-26 23:54 (UTC+8))
- #36797 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1 — by MohdElgaar (关闭于: 2026-03-26 21:43 (UTC+8))
- #27881 Adding render group to docker container — rocm,needs-rebase,ci/build — by dhonnappa-amd (关闭于: 2026-03-26 21:34 (UTC+8))
- #37088 [Bugfix] Add FLA_USE_TMA env var to disable TMA in FLA ops (#36973) — bug,v1,qwen — by haosdent (关闭于: 2026-03-26 18:51 (UTC+8))
- #28614 [Optimization] Fix performance regression for text-only inputs to MM models — ready,needs-rebase,multi-modality,qwen,deepseek — by DarkLight1337 (关闭于: 2026-03-26 16:47 (UTC+8))
- #37915 [Bugfix][Frontend] Pass default_chat_template_kwargs to Anthropic endpoint — bug,frontend,ready — by vinnybad (关闭于: 2026-03-26 14:54 (UTC+8))
- #36958 [Bugfix] Fix NIXL MLA notification request ID mismatch causing prefill KV cache leak — bug,v1,kv-connector — by wz1qqx (关闭于: 2026-03-26 14:42 (UTC+8))
- #37954 [Parser] Pass tools via ToolParser.init instead of reading from request — frontend,needs-rebase,tool-calling,qwen,deepseek — by sfeng33 (关闭于: 2026-03-26 12:29 (UTC+8))