vLLM 开发动态报告 - 2026-03-26

时间窗口: 2026-03-26 11:45 (UTC+8) ~ 2026-03-27 11:45 (UTC+8) 数据统计: 新 Issue 37 | 关闭 Issue 33 | 新 PR 96 | 合并 PR 37 | 关闭未合并 PR 33

📊 每日开发状态摘要

本周期（2026-03-26至2026-03-27）vLLM 社区活动高度活跃，共产生 133 个新的 Issue 和 PR。开发焦点集中在性能优化（特别是 KV 缓存管理与 kernel 融合）、对新模型和量化格式的支持（如 Phi-4-Vision、TurboQuant），以及 AMD/ROCm 生态系统的持续完善。多个关键技术提案（如增量 MoE 卸载）和性能修复 PR 被合并，显示出项目在高效推理引擎方向上的快速迭代。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动非常活跃，涵盖问题修复、性能优化、CI/CD 和文档完善等多个方面。

Issues：

#38307：报告了 AMD 的 MiniMax MXFP4 模型在使用 trust_remote_code=True 时仍然失败的 bug。此问题与 PR #37698 相关，并阻碍了上游项目的集成。
#38304：由 functionstackx 提出，指出官方文档 (vllm.ai) 未更新以包含 ROCm nightly 版本的 Docker 镜像和 wheel 安装说明，要求与 CUDA 版本保持一致的文档体验。
#38303：报告了 MiniMax NVFP4 模型因缺少 scales 加载而崩溃的问题，并指出此问题已在 PR #37214 中修复，需等待 v0.19 或使用 nightly 版本。
#38246：虽然不直接关联 AMD，但由 AMD 员工 functionstackx 提出，请求改善 Flashinfer 编译时的日志输出，避免长时间无日志被误认为卡死。

PRs（重点关注来自 -amd 贡献者的提交）：

性能优化：
- #38313 & #38299 (by khairulkabir1661): 为 ROCm 平台的 DeepSeek MLA 注意力机制添加 AITER 融合 kernel 支持，分别实现 RoPE + KV 缓存融合 和 RMSNorm + FP8 量化融合，旨在减少内存带宽和 kernel 启动开销，提升 MI300X 等 AMD GPU 上的性能。
- #38296 (by divakar-amd): 重构 AITER 的 FP8 attention kernel 以原生支持 q_scale，简化代码并提升性能。
Bug 修复：
- #38293 (by vecheruk-amd): 修复了使用 Quark 量化（如 MXFP4）的 MoE 模型在进行推测解码（MTP/Eagle）时，因 draft 模型 MoE 层前缀不匹配而导致的 AssertionError 崩溃问题。修复方式是将断言改为回退到未量化的 MoE 方法。
- #38263 (by tjtanaa): 紧急修复了刚刚启用的 ROCm nightly 发布管道中因变量未定义导致的构建失败。
- #37547 (by gronsti-amd): 修复了 ROCm 上 Sparse MLA 实现中因 lru_cache 使用不当导致模块重复导入的性能问题。
平台支持与 CI：
- #38236 (by Chinmay-Kulkarni-AMD): 更新 AMD Zen CPU 后端，正确声明支持的 dtype 并切换至 zentorch-weekly 依赖。
- #38238 (by dhonnappa-amd): 为适应 Kubernetes 环境，移除了 AMD 硬件 CI 测试脚本中与宿主机 GPU 状态交互的步骤。
- #38165 (by AndreasKaratzas): 修复 CI 测试容器中 PYTORCH_ROCM_ARCH 环境变量设置，确保 Quark 的 JIT 编译仅针对当前 GPU 架构，而非所有架构，加速测试。
- #38184 (by micah-wil): 在 MI325 设备上启用 kernel 核心操作测试，并处理 MI250 上的相关测试不稳定性。
Release 与生态建设：
- #37283 (by tjtanaa): 已合并。此 PR 实现了 ROCm nightly 版本的 Docker 镜像和 wheel 包的自动化发布，是 AMD 生态在 vLLM 中成为“一等公民”的关键一步，直接回应了 Issue #36703 和 #36704 的需求。

总结：AMD 团队在 vLLM 中贡献全面且深入，从底层的 kernel 性能优化（AITER）、量化工具链集成（Quark），到平台支持（Zen CPU）、CI/CD 管道和最终的用户交付物（Nightly 版本）均在同步推进，显示出其构建完整、高性能异构计算生态的决心。

💬 高热度讨论分析

Issue #38256: [RFC]: Incremental MoE Expert Offloading：
- 核心议题：提出一个动态的 MoE 专家权重卸载架构，通过 GPU 缓存 + 异步流水线，使大型 MoE 模型能在显存较小的硬件上运行。
- 观点与状态：作者 e1n00r 详细阐述了设计原则（缓存作为权重提供者）、3-PR 分阶段实施计划，并提供了独立实现 tinyserve 的生产数据作为佐证。目前尚无评论，但 RFC 本身内容详实，旨在征集社区对架构设计的反馈。
PR #38280: [Quantization] Add TurboQuant KV cache quantization：
- 核心议题：实现 Google 的 TurboQuant 算法，这是一种面向质量无损的亚 4 比特 KV 缓存向量量化方法。
- 观点与状态：贡献者 lishunyang12 展示了在 Qwen2.5-1.5B 上 2/3/4 bit 量化均达到 100% 输出匹配 且 零延迟开销 的惊人基准测试结果。社区反应积极，xinyu-intel 确认在 XPU 上测试同样零质量损失和零延迟开销。讨论还涉及与另一实现草案的对比和未来测试建议。
Issue #38212: MiniMaxM2ReasoningParser broken for M2.5：
- 核心议题：MiniMaxM2ReasoningParser 无法正确处理 M2.5 模型生成的 `` 开始标签，导致推理内容泄漏到输出中。
- 观点与状态：报告者 SilviuSavu 精准定位到是覆盖 extract_reasoning_streaming 方法时未考虑开始标签所致。参与者 Huyueeer 提供了详细的测试代码进行验证。核心争议点在于修复方式（直接移除覆盖 vs. 修改），从后续 PR #38213 看，采取了移除覆盖的简洁方案。
Issue #38246: [Feature]: Better Flashinfer compilation logging：
- 核心议题：Flashinfer 内核编译/调优时可能长达 10 分钟无日志，易被误认为卡死，请求增加进度提示。
- 观点与状态：提出者 ProExpertProg 描述了痛点。社区迅速响应，JINO-ROHIT 立即认领了该 Issue 并提交了修复 PR #38254。讨论高效务实，直指用户体验问题。
PR #38216: [Perf] Batch KV cache swap copies：
- 核心议题：通过 cuMemcpyBatchAsync 批量执行 KV 缓存卸载/加载中的内存拷贝操作，大幅减少驱动调用开销。
- 观点与状态：贡献者 Etelis 提供了详尽的各模型基准测试数据，显示在多种配置下可获得 3.6x 至 7.4x 的加速。此优化不改变行为，且对 CUDA 12.8+ 和旧版本均有妥善处理，是一项收益明确的基础性能改进，目前无争议。

🔥 热门话题与趋势分析

性能优化白热化：社区对极致性能的追求体现在各个层面：KV 缓存管理（批量拷贝 #38216、TurboQuant 压缩 #38280、混合模型卸载 #38261）、Kernel 融合（AMD 的 AITER 融合 #38313, #38299）、执行调度（跳过空工作 #38287）。
新模型与模态支持：持续集成新发布的模型是 vLLM 保持竞争力的关键。本周期出现 Phi-4 视觉推理模型 的支持问题 (#38309) 及对应 PR (#38306)，以及 NVIDIA Nemotron-3-Nano 混合模型在 Blackwell 上的运行问题 (#38208)。
工具调用与推理流：与模型智能体能力相关的工具调用解析器 (#38274, #38189)、推理（Think）流解析器 (#38212, #38213) 的 bug 修复和重构持续进行，反映了对生产环境复杂用例稳定性的重视。
AMD 生态全面性建设：从底层 Kernel、量化工具（Quark），到 CI/CD、发布管道（Nightly Docker/Wheel），再到使用文档，AMD 正在 vLLM 内构建一个完整、可用的软件栈，挑战 CUDA 的生态垄断地位。

🛠️ 重点技术变更

PR #37283: [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm：
- 技术解读：此 PR 为 ROCm 后端建立了与 CUDA 对等的持续发布管道，自动构建并推送 nightly 版本的 Docker 镜像（vllm/vllm-openai-rocm:nightly）和 pip wheel 包。
- 影响：极大提升了 AMD 平台用户的开发体验，使其能方便地使用最新特性，是 AMD 生态成熟的重要标志。需关注后续文档（Issue #38304）的同步更新。
PR #38280: [Quantization] Add TurboQuant KV cache quantization：
- 技术解读：首次引入 TurboQuant 这一前沿的 KV 缓存向量量化技术。实现包括算法核心、引擎集成、Triton kernel 和完整基准测试。其特点是能在极低位宽（2-4 bit）下实现近乎无损的压缩，且运行时开销极低。
- 影响：为超长上下文场景下的显存节省提供了新的强大工具，可能改变 KV 缓存压缩的技术选型格局。
PR #38216: [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync：
- 技术解读：将 KV 缓存卸载/加载过程中大量细粒度的 cudaMemcpyAsync 调用，聚合成一次 cuMemcpyBatchAsync 调用。利用现代 CUDA 驱动的批处理能力，显著降低 CPU 侧的开销。
- 影响：直接提升使用 KV 缓存卸载功能的场景（如内存不足或使用 NIXL 等外部缓存）的吞吐量，是一项普适且高效的基础设施优化。
RFC #38256: Incremental MoE Expert Offloading：
- 技术解读：提出一个渐进式、基于缓存的 MoE 专家权重动态调度方案。与静态卸载不同，它根据热度在 GPU 显存（缓存）和 CPU 内存之间迁移专家权重，允许超大规模 MoE 模型在有限显存上运行。
- 影响：为解决大模型（如千亿参数 MoE）部署的硬件门槛提供了新颖的软件思路。其实施方案（分 3 个 PR）设计精巧，旨在降低评审复杂度。
PR #38313 & #38299: [ROCm] AITER fusion kernels for MLA：
- 技术解读：针对 DeepSeek MLA 模型在 AMD GPU 上的特定计算模式，利用 AITER 库实现了 RoPE 融合 KV 缓存写入、RMSNorm 融合 FP8 量化等多个关键操作的 kernel 融合。
- 影响：这是针对特定模型-硬件组合的深度性能优化，展示了 AMD 团队通过软硬件协同设计挖掘 MI300 系列 GPU 潜力的能力，有助于提升其在特定负载下的竞争力。

📈 开发活跃度观察

贡献者多元化：活跃贡献者包括 AMD (-amd)、NVIDIA (-nv?) 等硬件厂商员工，以及大量社区开发者和研究人员。functionstackx 在报告问题和推动 AMD 生态完善上非常活跃。
高效的代码审查与合并：本周期合并了 37 个 PR，其中包含多个重大功能（如 ROCm nightly 发布）和性能优化。许多 PR 被标记为 ready 状态，显示社区在代码质量和合并准备上有一套成熟流程。
AMD 团队的高强度投入：AMD 相关贡献覆盖了从问题修复、性能优化到基础设施的完整链条，且多位贡献者协同工作，显示出有组织的深度参与。

💡 值得关注的问题

Issue #38257: Qwen3-VL-235B OOM with multi-image long multiturn inputs：超大视觉语言模型在处理多图、长对话输入时出现 OOM，即使在 8x H100 节点上。这触及了当前多模态大模型推理的显存边界，其解决方案可能涉及更复杂的激活或 KV 缓存管理策略。
Issue #38266: tokenizing long redundant sequences causes API server deadlock：使用特定 tokenizer（如 Harmony）处理长冗余字符串时， tokenization 耗时过长会导致 API 服务器死锁，影响其他并发请求。这暴露了服务端请求处理隔离性的潜在问题，对生产部署有重要影响。
RFC #38256: Incremental MoE Expert Offloading：此架构提案若被接受并实现，将显著扩展 vLLM 对超大规模 MoE 模型的部署能力，是一个值得跟踪的重要技术发展方向。

📋 附录：详细数据列表

新增 Issue

#38212 MiniMaxM2ReasoningParser broken for M2.5: extract_reasoning_streaming assumes no start tag — 无标签 — by SilviuSavu (创建于: 2026-03-26 16:43 (UTC+8))
#38309 [Bug]: microsoft/Phi-4-reasoning-vision-15B Fails to startup — bug — by varun-sundar-rabindranath (创建于: 2026-03-27 10:22 (UTC+8))
#38307 [Bug]: AMD’s minimax mxfp4 trust_remote_code bug — bug,rocm — by functionstackx (创建于: 2026-03-27 10:18 (UTC+8))
#38308 [Feature]: support affinity settings in helm chart — feature request — by utsumi-fj (创建于: 2026-03-27 10:20 (UTC+8))
#38304 [Feature]: vllm.ai docs should should instructions for rocm nightly docker & rocm nightly wheel — feature request,rocm — by functionstackx (创建于: 2026-03-27 10:08 (UTC+8))
#38303 [Bug]: minimax nvfp4 model crash — bug — by functionstackx (创建于: 2026-03-27 09:37 (UTC+8))
#38298 Energy Efficiency: 10 Mathematical Techniques for 60-70% AI Energy Reduction (Phi6Simple, FFT-Mix, Phi MoE) — 无标签 — by dancinlife (创建于: 2026-03-27 08:35 (UTC+8))
#38297 [Bug]: Gemma3n concurrent audio requests crash EngineCore — missing dynamic_dims on audio sequence dimension — 无标签 — by RushRed (创建于: 2026-03-27 08:28 (UTC+8))
#38291 [Feature]: Add Rotorquant support — feature request — by Mimi8298 (创建于: 2026-03-27 07:48 (UTC+8))
#38201 [Feature]: Support TurboQuant KV quant — feature request — by NiuBlibing (创建于: 2026-03-26 15:32 (UTC+8))
#38286 [Feature]: Batch invariance on 3090 — feature request — by YM2132 (创建于: 2026-03-27 05:46 (UTC+8))
#38282 [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring — feature request — by roipony (创建于: 2026-03-27 04:33 (UTC+8))
#38279 [Feature]: PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference — feature request — by asrvastava (创建于: 2026-03-27 04:01 (UTC+8))
#38266 [Bug]: tokenizing long redundant sequences causes API server deadlock (harmony and others) — bug — by Gnoale (创建于: 2026-03-27 02:44 (UTC+8))
#38260 [RFC]: Multi-tier KV offloading via the vLLM offloading connector — RFC — by dannyharnik (创建于: 2026-03-27 01:01 (UTC+8))
#38221 Flaky test: test_abort_during_final_step[False] fails intermittently — 无标签 — by markmc (创建于: 2026-03-26 19:28 (UTC+8))
#38257 [Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs — bug — by cjackal (创建于: 2026-03-27 00:29 (UTC+8))
#38258 [Usage]: How to do offline inference on one rank in a distributed environment? — usage — by DragonAura (创建于: 2026-03-27 00:43 (UTC+8))
#38233 [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU — bug — by sh1man (创建于: 2026-03-26 20:28 (UTC+8))
#38235 [Feature]: Quantization support (AWQ / GPTQ / FP8) for mistralai/Voxtral-Mini-4B-Realtime-2602 — feature request — by sh1man (创建于: 2026-03-26 20:56 (UTC+8))
#38256 [RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline — 无标签 — by e1n00r (创建于: 2026-03-27 00:17 (UTC+8))
#38246 [Feature]: Better Flashinfer compilation logging — help wanted,feature request — by ProExpertProg (创建于: 2026-03-26 23:22 (UTC+8))
#38250 [Bug]: VLLM_CPU_OMP_THREADS_BIND=nobind cannot be used with tp>1 on CPU backends — bug — by kot-begemot-uk (创建于: 2026-03-26 23:36 (UTC+8))
#38231 [RFC]: Support Dynamic Model Switching and Flexible Collective Communication in External Launcher Mode — RFC — by KaisennHu (创建于: 2026-03-26 20:11 (UTC+8))
#38241 [Bug]: DSR1 hang on B200 — bug — by ProExpertProg (创建于: 2026-03-26 22:25 (UTC+8))
#38245 [Bug]: Responses API text.format.type="json_schema" leaks schema_ in non-stream responses and breaks streaming — bug — by noobHappylife (创建于: 2026-03-26 23:21 (UTC+8))
#38240 [Feature]: dflash speculator model support — feature request — by shanjiaz (创建于: 2026-03-26 22:18 (UTC+8))
#38234 Test Failure: test_run_eagle_dp[FLASH_ATTN] produces non-deterministic outputs with EAGLE speculative decoding — 无标签 — by markmc (创建于: 2026-03-26 20:55 (UTC+8))
#38230 Hybrid KV offload: MultiConnector + planner for mamba+attention models — 无标签 — by malaiwah (创建于: 2026-03-26 20:06 (UTC+8))
#38228 [Bug]: Based on vllm 0.18.0 version, when the number of tensor parallelizations is greater than 1, an error message will be reported: [AMP ERROR] [CudaFrontend. cpp: 94] [failed to call cuCtxGetDevice (&device), error code: CUDA-ERROR-INVALIDFHIR TEXT — bug — by uOnePiece (创建于: 2026-03-26 20:01 (UTC+8))
#38208 [Bug]: CUDA Illegal Instruction during CUDA Graph capture with Nemotron-3-Nano NVFP4 on sm_121 — bug — by dennis-lynch (创建于: 2026-03-26 16:21 (UTC+8))
#38203 [Bug]: M2.5 tool call result is badcase, deploy 1p1d with nixl connector, P and D use DP8-EP-TP1 — bug — by ssbandjl (创建于: 2026-03-26 15:54 (UTC+8))
#38202 [Feature]: Add apply_with_spec_decode() method to LogitBiasLogitsProcessor for speculative decoding support — feature request — by ranger2571 (创建于: 2026-03-26 15:48 (UTC+8))
#38196 GDN attention backend crashes with ngram speculative decoding on mixed decode batches — 无标签 — by bhaktatejas922 (创建于: 2026-03-26 14:41 (UTC+8))
#38197 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (创建于: 2026-03-26 14:42 (UTC+8))
#38195 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (创建于: 2026-03-26 14:37 (UTC+8))
#38194 [Performance]: Prefix cache hit lower on vLLM than on other inference stacks — performance — by baoskee (创建于: 2026-03-26 14:23 (UTC+8))

已关闭 Issue

#20561 [Bug]: Bad result with parallel generation. — bug,stale — by hscspring (关闭于: 2026-03-27 10:18 (UTC+8))
#22605 [RFC]: Separated CPU KV Cache Offloading/Transfer Process — RFC,stale — by ApostaC (关闭于: 2026-03-27 10:18 (UTC+8))
#23534 [Bug]: An issue arises during offline_inference with data-parallel execution. — bug,stale — by ZwhElliott (关闭于: 2026-03-27 10:18 (UTC+8))
#24665 [Usage]: how to get tts wav(minicpm-o) via vllm online server — usage,stale — by whk6688 (关闭于: 2026-03-27 10:18 (UTC+8))
#26139 [Bug]: input thread may die silently — bug,stale — by jennyyyyzhen (关闭于: 2026-03-27 10:17 (UTC+8))
#27409 [New Model]: ibm-granite/granite-4.0-h-small-FP8 — new-model,stale — by afazekas (关闭于: 2026-03-27 10:17 (UTC+8))
#28003 [Usage]: — usage,stale — by amitmvyas (关闭于: 2026-03-27 10:17 (UTC+8))
#28933 [Feature]: Structured request_id in logs and inclusion in error logs — feature request,stale — by JaimeArboleda (关闭于: 2026-03-27 10:17 (UTC+8))
#29361 [Bug]: CUDA illegal memory access in fused_marlin_moe for Kimi-K2-Thinking on H20 4-nodes 32-ranks(DP4, TP8, EP32) — bug,stale — by TianyiZhao1437 (关闭于: 2026-03-27 10:16 (UTC+8))
#29368 [Bug]:llama4 AttributeError: ‘dict’ object has no attribute ‘model_type’ — bug,stale — by win10ogod (关闭于: 2026-03-27 10:16 (UTC+8))
#29374 [Bug]: vllm/vllm-openai:v0.11.0 deployment –quantization fp8 throws cuda and tensor errors — bug,stale — by sravan500 (关闭于: 2026-03-27 10:16 (UTC+8))
#29378 [Bug]: vllm0.11.0 deployment –pipeline-parallel-size 4 –tensor_parallel_size 2 for Qwen3 VL 8B Returns Strange Results — bug,stale — by my462 (关闭于: 2026-03-27 10:16 (UTC+8))
#29389 [Bug]: race condition in shm_broadcast.py — bug,stale — by nvjullin (关闭于: 2026-03-27 10:16 (UTC+8))
#29408 [Bug]:Reasoning parser work wrong with Qwen3-VL-Thinking-30B-A3B-FP8 — bug,stale — by zhcn000000 (关闭于: 2026-03-27 10:16 (UTC+8))
#29436 [Bug]: vLLM Serve with LMCache enabled produces wrong output for GPT-OSS-20B — bug,stale — by ksuma2109 (关闭于: 2026-03-27 10:16 (UTC+8))
#29481 [Bug]: vllm显存碎片化导致一张显卡只能部署一个小模型 — bug,stale — by xzl12080 (关闭于: 2026-03-27 10:16 (UTC+8))
#29489 [Usage]: Removing last generated token from output and kv cache — usage,stale — by josefdra (关闭于: 2026-03-27 10:16 (UTC+8))
#29496 [Bug]: Qwen3-Embedding models get stuck when embedding input of max-model-len — bug,stale — by jhradil (关闭于: 2026-03-27 10:16 (UTC+8))
#29508 [Feature]: Conformance test for Gateway API Inference Extension — feature request,stale — by liu-cong (关闭于: 2026-03-27 10:16 (UTC+8))
#36704 [Feature]: upstream nightly rocm docker — feature request,rocm — by functionstackx (关闭于: 2026-03-27 10:05 (UTC+8))
#36703 [Feature]: upstream nightly rocm vllm — feature request,rocm — by functionstackx (关闭于: 2026-03-27 10:04 (UTC+8))
#38159 [Installation]: nightly builds for vllm/vllm-openai stopped three days ago — installation — by CaptainGlac1er (关闭于: 2026-03-27 05:09 (UTC+8))
#33638 [Bug]: DeepSeekV3.1 with fp8 kvcache in v0.15.0 produces garbled output — bug,deepseek — by lyg95 (关闭于: 2026-03-27 00:13 (UTC+8))
#38241 [Bug]: DSR1 hang on B200 — bug — by ProExpertProg (关闭于: 2026-03-26 23:24 (UTC+8))
#38101 [CI Failure]: Test Eval Marlin Qwen3-30B-A3B-Fp8 — ci-failure — by ilmarkov (关闭于: 2026-03-26 20:27 (UTC+8))
#37804 [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell — bug — by vadiklyutiy (关闭于: 2026-03-26 16:21 (UTC+8))
#37976 [Feature]: Add /v1/chat/completions/batch endpoint for batched chat completions with structured output support — feature request — by MatejRojec (关闭于: 2026-03-26 16:06 (UTC+8))
#36777 [Feature]: Kimi K2.5 Speculative Decoding — feature request — by casper-hansen (关闭于: 2026-03-26 15:44 (UTC+8))
#38118 [Doc]: consider adding docstring for a lot of the methods — documentation — by JINO-ROHIT (关闭于: 2026-03-26 15:05 (UTC+8))
#38195 [Bug]: Qwen3.5-dense wfp8afp8 w: per-tensor a: per-tensor Output garbled text, but in sglang is norm — bug — by kkyyxhll (关闭于: 2026-03-26 14:38 (UTC+8))
#30604 [Installation]: [ARM_CPU_backend] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. — installation — by Mengjintao (关闭于: 2026-03-26 14:34 (UTC+8))
#29983 [Bug]: v0.12.0 Dockerfile.cpu fails to build on ARM server — bug,stale,cpu — by jharriga (关闭于: 2026-03-26 14:33 (UTC+8))
#28052 [Bug]: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103 has an error with flash attention — bug,rocm,stale — by capteen-hook (关闭于: 2026-03-26 14:21 (UTC+8))

新增 PR

#38299 [ROCm] Add AITER RMSNorm+FP8 quantization fusion for MLA — rocm — by khairulkabir1661 (创建于: 2026-03-27 08:50 (UTC+8))
#38280 [Quantization] Add TurboQuant KV cache quantization (Phase 1) — v1 — by lishunyang12 (创建于: 2026-03-27 04:02 (UTC+8))
#38316 [XPU] Add per-channel quantized model in compressed-tensors — 无标签 — by yma11 (创建于: 2026-03-27 11:43 (UTC+8))
#38315 [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward — needs-rebase — by ZJY0516 (创建于: 2026-03-27 11:35 (UTC+8))
#38224 [CT]FP8 WoQ Linear Refactored — 无标签 — by Zhenzhong1 (创建于: 2026-03-26 19:57 (UTC+8))
#38314 [KVConnector]: Add kv_connector_class_name to resolve name conflict with built-in connectors — deepseek,kv-connector — by maobaolong (创建于: 2026-03-27 11:26 (UTC+8))
#38301 [KVConnector]: prioritize external connector over internal registry — kv-connector — by maobaolong (创建于: 2026-03-27 09:28 (UTC+8))
#38313 [ROCm] Add AITER RoPE + KV cache fusion for MLA prefill and decode — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by khairulkabir1661 (创建于: 2026-03-27 11:12 (UTC+8))
#38302 [CI] Swap to smaller models for MIG slice compatibility — ci/build,v1 — by khluu (创建于: 2026-03-27 09:34 (UTC+8))
#38274 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when decoding bundles tool_call_end and tool_call_begin into a single streaming delta — bug,tool-calling — by slevental (创建于: 2026-03-27 03:21 (UTC+8))
#38294 [Bugfix] Fix AuthenticationMiddleware KeyError on WebSocket scopes — bug,frontend — by russellb (创建于: 2026-03-27 08:02 (UTC+8))
#38189 [Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers — tool-calling,qwen,deepseek — by sfeng33 (创建于: 2026-03-26 12:35 (UTC+8))
#38310 [Doc] Support configuring affinity in helm chart — documentation — by utsumi-fj (创建于: 2026-03-27 10:23 (UTC+8))
#38207 [CI] Reorganize scoring tests — frontend,ready — by noooop (创建于: 2026-03-26 16:17 (UTC+8))
#38311 [Model Runner V2] Rebuild attention metadata before eagle decode full… — v1 — by TheEpicDolphin (创建于: 2026-03-27 10:43 (UTC+8))
#38312 Fix zmq port conflict — 无标签 — by liuchenbing2026 (创建于: 2026-03-27 10:47 (UTC+8))
#38242 [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str — documentation,v1 — by chaunceyjiang (创建于: 2026-03-26 22:25 (UTC+8))
#38306 [Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B — new-model,multi-modality — by varun-sundar-rabindranath (创建于: 2026-03-27 10:17 (UTC+8))
#38214 [Feature] Add auto-detection for reasoning_config when only reasoning_parser is set — documentation,v1 — by chaunceyjiang (创建于: 2026-03-26 17:32 (UTC+8))
#38270 [Mamba] Raise on insufficient cache blocks instead of silently capping cudagraph sizes — v1,nvidia — by NickLucche (创建于: 2026-03-27 03:08 (UTC+8))
#38305 [Bugfix] Fix Gemma3n concurrent audio requests crashing EngineCore — bug — by he-yufeng (创建于: 2026-03-27 10:14 (UTC+8))
#38261 Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by malaiwah (创建于: 2026-03-27 01:08 (UTC+8))
#38292 [CI][ROCm] Add gpt-oss w4a8 in CI — rocm,ready,gpt-oss — by BowenBao (创建于: 2026-03-27 07:53 (UTC+8))
#38244 [CT][FP8][Marlin] refactor CompressedTensorsW8A16Fp8 to use kernel abstraction — 无标签 — by jikunshang (创建于: 2026-03-26 23:10 (UTC+8))
#38300 [Speculative Decoding] Add DFlash speculators config parsing — new-model,speculative-decoding,v1,qwen — by ZhanqiuHu (创建于: 2026-03-27 08:58 (UTC+8))
#38206 Enable multiple gpu test on xpu platform — 无标签 — by chaojun-zhang (创建于: 2026-03-26 16:12 (UTC+8))
#38296 [ROCm] aiter_unified_attn fp8 q scale refactor — rocm,v1 — by divakar-amd (创建于: 2026-03-27 08:18 (UTC+8))
#38295 [Docs] Document that –api-key has no effect on gRPC server — documentation — by russellb (创建于: 2026-03-27 08:18 (UTC+8))
#38247 Various Transformers v5 config fixes — ready,deepseek — by hmellor (创建于: 2026-03-26 23:23 (UTC+8))
#38293 Fix: FusedMoE AssertionError with Speculative Decoding on Quark-Quantized Models — 无标签 — by vecheruk-amd (创建于: 2026-03-27 07:53 (UTC+8))
#38290 [CI] Pin GitHub Actions to commit hashes in macos-smoke-test.yml — ci/build — by russellb (创建于: 2026-03-27 07:29 (UTC+8))
#38289 tests: reduce duplication in online pooling scoring suites — 无标签 — by cursor (创建于: 2026-03-27 06:55 (UTC+8))
#38285 [AMD][Build] Test DeepEP offload — rocm,ci/build — by rjrock (创建于: 2026-03-27 05:20 (UTC+8))
#38288 [Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py — 无标签 — by chengyinie (创建于: 2026-03-27 06:17 (UTC+8))
#38287 [Perf] Skip kv connector empty work, 1.1% Throughput Improvement — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-27 05:56 (UTC+8))
#38272 [ROCm][CI] Unsetting arch completely — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-27 03:12 (UTC+8))
#38284 [Startup][UX] Enable CUDAGraph memory profiling by default — v1,nvidia — by MatthewBonanni (创建于: 2026-03-27 05:18 (UTC+8))
#38283 [Bugfix] Remove dead unicode_escape decoding from 4 tool parsers — bug,tool-calling — by yzong-rh (创建于: 2026-03-27 05:06 (UTC+8))
#38268 [Bugfix] Remove dead name-only regex fallback from DeepSeek V3.1 tool parser — bug,tool-calling,deepseek — by yzong-rh (创建于: 2026-03-27 02:49 (UTC+8))
#38238 Removed GPU state confirmation and cleanup steps. — rocm,ready,ci/build — by dhonnappa-amd (创建于: 2026-03-26 21:46 (UTC+8))
#38255 [Bugfix] Remove false-positive format mismatch warnings in FLA ops — bug,ready — by tdoublep (创建于: 2026-03-27 00:11 (UTC+8))
#38275 fix: pin 1 unpinned action(s) — ci/build — by dagecko (创建于: 2026-03-27 03:25 (UTC+8))
#38205 [ZenCPU] Make PT Backport Patch Accessible to vLLM — ready — by amd-lalithnc (创建于: 2026-03-26 15:58 (UTC+8))
#38277 [Core] Add soft thinking token budget with progressive logit bias — v1 — by efortin (创建于: 2026-03-27 03:28 (UTC+8))
#38273 Turbo Quant — documentation,rocm,ci/build,v1 — by asrvastava (创建于: 2026-03-27 03:14 (UTC+8))
#38281 [Model Loader] fix BGE-M3 compatibility — 无标签 — by wkpark (创建于: 2026-03-27 04:11 (UTC+8))
#38278 [Model] Use AutoWeightsLoader for InternLM2 — 无标签 — by javierdejesusda (创建于: 2026-03-27 03:33 (UTC+8))
#38276 Enable eagle3 for MiniMax M2 — 无标签 — by TheEpicDolphin (创建于: 2026-03-27 03:27 (UTC+8))
#38190 [responsesAPI] output failed event. — frontend — by qandrew (创建于: 2026-03-26 12:36 (UTC+8))
#38271 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (创建于: 2026-03-27 03:09 (UTC+8))
#38269 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (创建于: 2026-03-27 03:08 (UTC+8))
#38267 tests(network_utils): add coverage for late-binding ZMQ helpers (#28498) — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by loriscience (创建于: 2026-03-27 02:47 (UTC+8))
#38254 flash infer logging — 无标签 — by JINO-ROHIT (创建于: 2026-03-27 00:06 (UTC+8))
#38188 Bump flashinfer version to 0.6.7 — ci/build,nvidia,ready-run-all-tests — by wzhao18 (创建于: 2026-03-26 11:59 (UTC+8))
#38263 [ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline — bug,rocm,ready,ci/build — by tjtanaa (创建于: 2026-03-27 01:25 (UTC+8))
#38265 [Refactor] Consolidate Tool type alias in tool_parsers/utils.py — tool-calling — by sfeng33 (创建于: 2026-03-27 02:29 (UTC+8))
#38262 [frontend] dump openai responses type by alias — frontend — by cjackal (创建于: 2026-03-27 01:20 (UTC+8))
#38264 [Mypy] Fix adjust_request typing — documentation,frontend,tool-calling — by sfeng33 (创建于: 2026-03-27 01:59 (UTC+8))
#38252 [ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 — rocm,ci/build — by gshtras (创建于: 2026-03-26 23:41 (UTC+8))
#38259 SiLU_mul blockwise quantized fp8 kernel in Helion — 无标签 — by rwtarpit (创建于: 2026-03-27 00:49 (UTC+8))
#38185 [Bugfix][Backport] Backport PR #34934 to v0.13.0: Fix typos — bug,documentation,performance,rocm,ci/build,v1,tool-calling,qwen,cpu,kv-connector — by khairulkabir1661 (创建于: 2026-03-26 11:56 (UTC+8))
#38248 [WIP] Prototype MM batcher interface — multi-modality — by DarkLight1337 (创建于: 2026-03-26 23:25 (UTC+8))
#38253 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (创建于: 2026-03-26 23:55 (UTC+8))
#38236 [ZenCPU] AMD Zen CPU Backend with supported dtypes via zentorch weekly — rocm,ci/build — by Chinmay-Kulkarni-AMD (创建于: 2026-03-26 20:57 (UTC+8))
#38229 Revert “Various Transformers v5 fixes” (#38127) — deepseek — by zhewenl (创建于: 2026-03-26 20:06 (UTC+8))
#38251 Nvfp4 cutedsl moe — nvidia,ready-run-all-tests — by zyongye (创建于: 2026-03-26 23:36 (UTC+8))
#38219 [CPU] Support CT W4A16 on CPU MP kernel — ready,cpu — by bigPYJ1151 (创建于: 2026-03-26 18:38 (UTC+8))
#38243 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (创建于: 2026-03-26 22:41 (UTC+8))
#38232 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility — ready — by zhang-prog (创建于: 2026-03-26 20:28 (UTC+8))
#38249 [Misc] Organize NixlConnector into own directory — documentation,v1,kv-connector — by NickLucche (创建于: 2026-03-26 23:27 (UTC+8))
#38217 [KV Offload] Update stale comment in FilterReusedOffloadingManager — v1 — by ronensc (创建于: 2026-03-26 18:26 (UTC+8))
#38220 [specdec][v1] Fix sampler peak memory utilization when using specdec — ready,v1 — by Flechman (创建于: 2026-03-26 19:27 (UTC+8))
#38239 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,v1 — by MohdElgaar (创建于: 2026-03-26 21:57 (UTC+8))
#38216 [Perf] Batch KV cache swap copies via cuMemcpyBatchAsync — v1 — by Etelis (创建于: 2026-03-26 18:11 (UTC+8))
#38237 [Bugfix] Only disable hybrid KV cache manager when KV events are actually enabled — bug — by liulanzheng (创建于: 2026-03-26 21:36 (UTC+8))
#38227 [Bugfix] [Frontend] responses api, refactored simple event streaming — bug,frontend — by bfroemel (创建于: 2026-03-26 20:00 (UTC+8))
#38225 [Bugfix] Prevent IndexError on empty content in HarmonyContext tool calls — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-03-26 19:57 (UTC+8))
#38191 [Bugfix] Fix k_norm weight sharding in MiniMaxM2Attention when total_num_kv_heads < tp_size — bug — by wxsIcey (创建于: 2026-03-26 12:53 (UTC+8))
#38218 [Renderer] Consolidate factory methods — ready — by DarkLight1337 (创建于: 2026-03-26 18:35 (UTC+8))
#38223 [Bugfix] Handle reasoning_effort=”none” for Harmony models instead of crashing — bug,frontend,gpt-oss — by dubin555 (创建于: 2026-03-26 19:50 (UTC+8))
#38226 [Bugfix] Respect max_output_tokens in Harmony tool-call loop — bug,frontend,tool-calling,gpt-oss — by dubin555 (创建于: 2026-03-26 19:57 (UTC+8))
#38222 [Bugfix] Add dimension alignment check to Marlin MoE kernel selection — bug — by he-yufeng (创建于: 2026-03-26 19:32 (UTC+8))
#38213 fix(reasoning): MiniMaxM2ReasoningParser broken for M2.5 — 无标签 — by SilviuSavu (创建于: 2026-03-26 16:46 (UTC+8))
#38204 [feature] Implement reasoning_effort — frontend — by chaunceyjiang (创建于: 2026-03-26 15:55 (UTC+8))
#38215 [Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8 — ci/build,nvidia — by askliar (创建于: 2026-03-26 17:44 (UTC+8))
#38211 [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion — 无标签 — by aman-coder03 (创建于: 2026-03-26 16:39 (UTC+8))
#38193 [XPU] Disable xpu graph by default — ready — by jikunshang (创建于: 2026-03-26 14:04 (UTC+8))
#38209 [Doc] Fix outdated reference to CUDAGraphManager — documentation,ready,nvidia — by DarkLight1337 (创建于: 2026-03-26 16:37 (UTC+8))
#38210 Fix json chat_templet parser — frontend — by QiuMike (创建于: 2026-03-26 16:39 (UTC+8))
#38199 [Bugfix] Include entry-point logits processor plugins in output token… — bug,v1 — by YingxuH (创建于: 2026-03-26 15:06 (UTC+8))
#38200 Qwen3.5 0325 mtp — qwen — by zhewenl (创建于: 2026-03-26 15:19 (UTC+8))
#38198 [DO NOT MERGE] Flatten type abstractions in Renderer — frontend,ready — by DarkLight1337 (创建于: 2026-03-26 15:06 (UTC+8))
#38192 added int4 AR format cpu — 无标签 — by Zhenzhong1 (创建于: 2026-03-26 13:44 (UTC+8))
#38184 [ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness — rocm,ci/build — by micah-wil (创建于: 2026-03-26 11:56 (UTC+8))
#38187 [Frontend][Bugfix] Fix double BOS token in Responses API for models with add_bos_token — bug,frontend — by hyeongyun0916 (创建于: 2026-03-26 11:58 (UTC+8))
#38186 [code clean] remove useless contextlib.suppress(Exception) — 无标签 — by andyxning (创建于: 2026-03-26 11:57 (UTC+8))

已合并 PR

#38207 [CI] Reorganize scoring tests — frontend,ready — by noooop (合并于: 2026-03-26 20:07 (UTC+8))
#37447 [CI/Build] enable Intel XPU test flow with prebuilt image — ready,ci/build — by wendyliu235 (合并于: 2026-03-27 09:16 (UTC+8))
#38247 Various Transformers v5 config fixes — ready,deepseek — by hmellor (合并于: 2026-03-27 07:07 (UTC+8))
#38162 [Bugfix] Add missing f-string prefix in xgrammar choices error message — bug,structured-output,ready,v1 — by yzong-rh (合并于: 2026-03-27 05:43 (UTC+8))
#38045 [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling — speculative-decoding,ready,ci/build,v1 — by TheEpicDolphin (合并于: 2026-03-27 04:38 (UTC+8))
#38136 Fix multi-node allreduce fusion — ready,ci/build,nvidia — by wzhao18 (合并于: 2026-03-27 04:24 (UTC+8))
#37547 [Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module — bug,rocm,ready,v1 — by gronsti-amd (合并于: 2026-03-27 03:01 (UTC+8))
#38263 [ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline — bug,rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-27 02:47 (UTC+8))
#38165 [ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-27 02:33 (UTC+8))
#37930 [ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-27 01:44 (UTC+8))
#37228 [ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model (#37228) — bug,rocm,ready,v1,meta-exported,fb-exported — by jennyyyyzhen (合并于: 2026-03-27 01:33 (UTC+8))
#38155 [ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 — rocm,ready,ci/build,qwen — by AndreasKaratzas (合并于: 2026-03-27 00:51 (UTC+8))
#38178 [CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM — rocm,ready — by AndreasKaratzas (合并于: 2026-03-27 00:46 (UTC+8))
#38137 [ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test — rocm,ready — by AndreasKaratzas (合并于: 2026-03-27 00:26 (UTC+8))
#34977 [Mamba][APC] Add test case to compare apc outputs — rocm,ready — by divakar-amd (合并于: 2026-03-27 00:40 (UTC+8))
#37283 [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm — rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-27 00:32 (UTC+8))
#35175 [Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode — bug,ready,v1,nvidia — by haosdent (合并于: 2026-03-27 00:13 (UTC+8))
#38161 [ROCm][CI] Fix flaky GPTQ compile correctness test — rocm,ready — by AndreasKaratzas (合并于: 2026-03-26 19:57 (UTC+8))
#38167 [ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-26 19:55 (UTC+8))
#35886 [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos — bug,rocm,ready,v1 — by ChuanLi1101 (合并于: 2026-03-26 23:59 (UTC+8))
#38014 [CI] Add batch invariant test for b200 — ready,ci/build — by yewentao256 (合并于: 2026-03-26 23:54 (UTC+8))
#38232 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility — ready — by zhang-prog (合并于: 2026-03-26 23:34 (UTC+8))
#38169 Revert “[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration” (#38050) — ci-failure,nvidia — by zhewenl (合并于: 2026-03-26 22:59 (UTC+8))
#38125 DOC: Documentation pages fixes — ready — by mtsokol (合并于: 2026-03-26 16:55 (UTC+8))
#38218 [Renderer] Consolidate factory methods — ready — by DarkLight1337 (合并于: 2026-03-26 20:19 (UTC+8))
#37962 [bug-fix] GLM OCR Patch Merger context_dim — bug,ready — by JaredforReal (合并于: 2026-03-26 20:11 (UTC+8))
#38153 [Refactor] Remove unused utils — ready — by yewentao256 (合并于: 2026-03-26 17:08 (UTC+8))
#38193 [XPU] Disable xpu graph by default — ready — by jikunshang (合并于: 2026-03-26 16:53 (UTC+8))
#38209 [Doc] Fix outdated reference to CUDAGraphManager — documentation,ready,nvidia — by DarkLight1337 (合并于: 2026-03-26 16:52 (UTC+8))
#38018 [Model] Use helper function to run MM processors with token inputs (where applicable) — ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-26 16:44 (UTC+8))
#38116 Relocate Encoder CUDA graph manager — ready,v1,nvidia — by WoosukKwon (合并于: 2026-03-26 11:52 (UTC+8))
#38083 [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell — bug,ready,qwen — by vadiklyutiy (合并于: 2026-03-26 16:21 (UTC+8))
#38011 Add /v1/chat/completions/batch endpoint for batched chat completions — documentation,frontend,ready — by MatejRojec (合并于: 2026-03-26 12:13 (UTC+8))
#37691 [cpu][ci] remove soft-fail for Arm CI and add quant model tests — ready,ci/build,cpu — by fadara01 (合并于: 2026-03-26 15:03 (UTC+8))
#38049 [Model] Add torch.compile support for InternVL vision encoder — rocm,ready — by tianrengao (合并于: 2026-03-26 14:52 (UTC+8))
#38082 [Bugfix] Fix benchmark_fused_collective.py — bug,performance,ready — by jeejeelee (合并于: 2026-03-26 14:51 (UTC+8))
#38092 [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format — bug,ready,ready-run-all-tests — by BadrBasowid (合并于: 2026-03-26 12:11 (UTC+8))

关闭但未合并的 PR

#33774 add support for telechat3 model — new-model — by 1096125073 (关闭于: 2026-03-27 11:41 (UTC+8))
#37696 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by SouthWest7 (关闭于: 2026-03-27 10:52 (UTC+8))

#28631 [Frontend][3/n] Improve pooling entrypoints

scoring. — documentation,frontend,needs-rebase,multi-modality — by noooop (关闭于: 2026-03-27 10:53 (UTC+8))

#21651 Limit concurrent long partial prefills via max_long_partial_prefills — stale,v1 — by pansicheng (关闭于: 2026-03-27 10:18 (UTC+8))
#27983 [Frontend] Make RequestIdMiddleware return the internal request_id — frontend,ready,stale — by markmc (关闭于: 2026-03-27 10:17 (UTC+8))
#28780 Fix: align vllm bench serve ignore_eos behavior with legacy benchmark… — performance,stale — by Amitjoiya (关闭于: 2026-03-27 10:17 (UTC+8))
#28810 v1: account for CPU offload capacity in KV cache check — needs-rebase,ci/build,stale,v1,nvidia — by m0nk111 (关闭于: 2026-03-27 10:17 (UTC+8))
#29321 Updated AMD-CI mirror (2025-11-24) — rocm,ci/build,stale — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-27 10:17 (UTC+8))
#29455 [DNM] Split MM standard test — ready,ci/build,stale — by ywang96 (关闭于: 2026-03-27 10:16 (UTC+8))
#29495 [Misc] Add token breakdown to throughput benchmark JSON output — performance,stale — by xxrjun (关闭于: 2026-03-27 10:16 (UTC+8))
#35483 Add AMD AITER MLA fusion optimization for DeepSeek models — rocm,needs-rebase,deepseek — by khairulkabir1661 (关闭于: 2026-03-27 08:50 (UTC+8))
#31105 [Bugfix][LoRA] Fix LoRA weight mapping for DeepSeek MLA attention and… — bug,needs-rebase,deepseek — by tim0120 (关闭于: 2026-03-27 07:55 (UTC+8))
#38289 tests: reduce duplication in online pooling scoring suites — 无标签 — by cursor (关闭于: 2026-03-27 06:58 (UTC+8))
#38275 fix: pin 1 unpinned action(s) — ci/build — by dagecko (关闭于: 2026-03-27 05:23 (UTC+8))
#35454 Enable GLM4.7 FP8 KVCache scale loading — 无标签 — by BowenBao (关闭于: 2026-03-27 05:08 (UTC+8))
#32597 Add triton support for compressed_tensors GPTQ W4A16 on Tesla V100 (Volta CUDA 70) — performance,nvidia — by lapy (关闭于: 2026-03-27 03:49 (UTC+8))
#33956 [Bugfix] Fix video frame sampling for short videos and Qwen3-VL 2-frame requirement — bug,needs-rebase,multi-modality,qwen — by chengyinie (关闭于: 2026-03-27 03:13 (UTC+8))
#38271 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (关闭于: 2026-03-27 03:13 (UTC+8))
#38269 [Bugfix][Tool Parser] Fix Kimi-K2 argument truncation when tool_call_… — bug,tool-calling — by slevental (关闭于: 2026-03-27 03:08 (UTC+8))
#38267 tests(network_utils): add coverage for late-binding ZMQ helpers (#28498) — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by loriscience (关闭于: 2026-03-27 03:08 (UTC+8))
#38185 [Bugfix][Backport] Backport PR #34934 to v0.13.0: Fix typos — bug,documentation,performance,rocm,ci/build,v1,tool-calling,qwen,cpu,kv-connector — by khairulkabir1661 (关闭于: 2026-03-26 15:58 (UTC+8))
#38180 [Bugfix][Backport][ROCm] Backport PR #31380 to v0.13.0: Fix Qwen3-Next inference with non-standard block size (544) — bug,rocm,v1,qwen — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
#38146 [Bugfix][Backport][Hardware][AMD] Backport PR #31282 to v0.13.0: Fix last_page_len calculation in AITER MLA decode — bug,rocm,v1 — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
#38145 [Bugfix][Backport] Backport PR #31816 to v0.13.0: Fix ROCM_AITER_TRITON_MLA accuracy for DeepSeek-V3 — bug,rocm,v1,deepseek — by khairulkabir1661 (关闭于: 2026-03-26 16:06 (UTC+8))
#38229 Revert “Various Transformers v5 fixes” (#38127) — deepseek — by zhewenl (关闭于: 2026-03-26 23:30 (UTC+8))
#38243 [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 — bug,multi-modality — by aliialsaeedii (关闭于: 2026-03-26 23:54 (UTC+8))
#36797 [Bugfix][Core] Allow multi-dtype MambaSpec KV cache spec — bug,documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1 — by MohdElgaar (关闭于: 2026-03-26 21:43 (UTC+8))
#27881 Adding render group to docker container — rocm,needs-rebase,ci/build — by dhonnappa-amd (关闭于: 2026-03-26 21:34 (UTC+8))
#37088 [Bugfix] Add FLA_USE_TMA env var to disable TMA in FLA ops (#36973) — bug,v1,qwen — by haosdent (关闭于: 2026-03-26 18:51 (UTC+8))
#28614 [Optimization] Fix performance regression for text-only inputs to MM models — ready,needs-rebase,multi-modality,qwen,deepseek — by DarkLight1337 (关闭于: 2026-03-26 16:47 (UTC+8))
#37915 [Bugfix][Frontend] Pass default_chat_template_kwargs to Anthropic endpoint — bug,frontend,ready — by vinnybad (关闭于: 2026-03-26 14:54 (UTC+8))
#36958 [Bugfix] Fix NIXL MLA notification request ID mismatch causing prefill KV cache leak — bug,v1,kv-connector — by wz1qqx (关闭于: 2026-03-26 14:42 (UTC+8))
#37954 [Parser] Pass tools via ToolParser.init instead of reading from request — frontend,needs-rebase,tool-calling,qwen,deepseek — by sfeng33 (关闭于: 2026-03-26 12:29 (UTC+8))