vLLM 开发动态报告 - 2026-03-27
时间窗口: 2026-03-27 11:31 (UTC+8) ~ 2026-03-28 11:31 (UTC+8) 数据统计: 新 Issue 18 | 关闭 Issue 11 | 新 PR 83 | 合并 PR 27 | 关闭未合并 PR 29
📊 每日开发状态摘要
本周期(2026-03-27至2026-03-28)vLLM 项目保持高强度开发,新增 83 个 PR 和 18 个 Issue,合并了 27 个 PR。核心进展集中在 AMD/ROCm 生态的性能优化与问题修复、Transformers v5 升级带来的大规模兼容性调整,以及多模态功能与缓存管理的持续改进。社区协作活跃,多名贡献者积极认领了为适配 Transformers v5 而分解出的若干“good first issue”。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动非常活跃,主要集中在性能优化、构建系统修复和文档更新。
- 新增 Issue (性能问题):
#38406[Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node:用户functionstackx(频繁提交 AMD 相关报告)指出,即使在启用VLLM_ROCM_USE_AITER=1后,MI325X 运行 Kimi K2.5 模型的性能依然显著低于 NVIDIA H200。此问题已引发 AMD 员工(@chunfangamd,@andyluo7,@hongxiayang)和社区的关注,是当前 ROCm 性能调优的重点。#38303[Bug]: minimax nvfp4 model crash:内容被截断,但从标题看涉及 NVFP4 量化模型,可能关联 AMD 的 Quark 量化工具链。
- 新增与合并的 PR:
#38367(已合并) [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips:由 AMD 员工hongxiayang提交,更新了官方文档以包含 ROCm nightly Docker 镜像的使用说明,提升了用户体验。#38365[ROCm] patch benchmark_moe:修复benchmark_moe.py脚本以兼容ray==2.54.0,该版本不再使用ROCR_VISIBLE_DEVICES环境变量。#38346[ROCM] Optmize redudent d2d copy of moe.:优化了 MI355 上运行 MiniMax M2.5 模型时 MOE 内核的输出路径,消除了冗余的 D2D 拷贝,展示了针对特定 AMD 硬件和模型的深度优化。#38337[ROCm][Build] Fix pip install detection when build isolation installs CUDA PyTorch:修复了在 ROCm 系统上使用 pip 构建时,因构建隔离环境安装了 CUDA 版 PyTorch 而导致的错误检测问题。#38334[ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM:为 Vision Transformer 注意力后端添加 Triton 回退路径,以解决在不支持高效 SDPA 的 ROCm 设备(如 gfx906)上运行多模态模型时的 OOM 问题。#38413[ROCm] [Release] Update ROCm variant from rocm700 to rocm721:将基础镜像版本从 ROCm 7.0.0 升级至 7.2.1,是#38252的后续工作。#38371Enable building MoRI with AMD AINIC stack:更新 ROCm 基础镜像构建,支持使用可选的 AMD AINIC (Pensando) 网卡栈构建 MORI,面向 disaggregated/KV 卸载场景。#38396[AMD][CI] Update DeepEP branch:更新 DeepEP 分支以正确为 gfx942/gfx950 进行预编译,并调整测试用例到 MI325。#38415[ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry:修复 Docker 构建中 UV 安装脚本,使其能检测 curl 失败并重试。
- 其他相关 PR:
#38317[ROCm][CI] Enable hybrid chunked prefill test:在 ROCm CI 中启用混合分块预填充测试,移除了原有的 CUDA 专属跳过标记。#38381[ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm:将test_hybrid.py测试在 ROCm 上固定使用 TRITON_ATTN 后端,以减少批处理差异导致的测试不稳定性。
分析:AMD 团队(通过 -amd 后缀用户识别)在本周期投入显著,工作覆盖性能调优(#38406, #38346)、构建系统健壮性(#38337, #38415)、CI/CD 完善(#38317, #38381)和用户体验(#38367)。特别是针对 MI300 系列(MI325, MI355)的性能问题和新硬件(AINIC)的支持,表明 AMD 正致力于在 vLLM 上构建完整且高性能的软件生态。
💬 高热度讨论分析
- Issue
#38374: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load- 核心议题:使用 IPC(CUDA 进程间通信)以检查点格式热更新模型权重后,推理结果与冷启动目标检查点服务器的结果不一致。
- 观点与进展:提交者
kimihailv提供了非常详细的复现步骤和结果对比,显示热交换后的模型行为更接近原始模型而非目标模型。该 Issue 在创建后约 14 分钟内被迅速关闭,表明可能是一个可快速定位和修复的 bug,或已被其他 PR 解决。
- Issue
#38375: IndexError when--renderer-num-workers+--mm-processor-cache-type shm- 核心议题:多模态渲染器工作线程 (
--renderer-num-workers > 1) 与共享内存多模态缓存 (--mm-processor-cache-type shm) 不兼容,导致高并发下 IndexError。 - 观点与争议:
scyyh11迅速定位了问题根源:SHM 缓存设计假设单写入者,而多渲染器工作线程导致了并发写入。他提出了临时解决方案(设置--renderer-num-workers=1),并询问原 PR(#34789)作者@DarkLight1337是应该丢弃renderer_num_workers配置还是修复 SHM 缓存路径本身。争议焦点在于功能兼容性与设计修改的权衡。 - 当前状态:待维护者决策。
- 核心议题:多模态渲染器工作线程 (
- Issue
#38382: [Transformers v5] Mistral multimodal models- 核心议题:升级至 Transformers v5.4 后,Mistral 多模态模型(Pixtral, Voxtral)因处理器重构而出现
IndexError。 - 观点与进展:
hmellor指出问题源于#38018PR 使用 v5.3 开发,而 v5.4 重构了处理器。allgather主动认领此问题。这反映了大型上游依赖更新对下游项目造成的广泛影响。 - 当前状态:
allgather已提交修复 PR#38410。
- 核心议题:升级至 Transformers v5.4 后,Mistral 多模态模型(Pixtral, Voxtral)因处理器重构而出现
- Issue
#38387: [Transformers v5] HCXVisionForCausalLM- 核心议题:Transformers v5 升级导致
HCXVisionForCausalLM模型初始化失败 (AttributeError: 'HCXVisionConfig' object has no attribute 'text_config')。 - 观点与进展:
HanFa指出可能与 Transformers 的一个 PR 相关。hmellor认为问题可能在于模型配置代码本身的缺陷,并鼓励HanFa尝试修复。这体现了社区协作解决上游变更问题的模式。 - 当前状态:
HanFa已认领。
- 核心议题:Transformers v5 升级导致
- PR
#38315: [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward- 核心议题:为 Gated Delta Network (GDN) 融合 KKT 和 solve_tril 内核以提升预填充性能。
- 观点与争议:PR 展示了对 H200 上 Qwen3.5-397B 模型 TTFT 的显著提升。然而,贡献者
Nepherpitou报告在 4x3090 上测试 Qwen 122B AWQ 模型时出现了输出损坏(!!!!!!!!!!!!!)。作者ZJY0516回应未能复现,并请求更多信息。争议焦点在于更改的准确性和硬件/模型泛化能力。 - 当前状态:存在合并冲突,待解决并进一步验证准确性。
🔥 热门话题与趋势分析
- Transformers v5 升级浪潮:这是本周期最突出的趋势。
hmellor创建了总领 Issue (#38379) 并分解出多个子 Issue(如#38386,#38387,#38389等),吸引了大量社区贡献者(kasha01,HanFa,amitmodi等)认领。问题涉及 tokenizer 配置、模型初始化、处理器兼容性等多个方面,是一次系统性的工程适配。 - 多模态功能深化与问题暴露:多个 Issue(
#38375,#38382,#38385)和 PR(#38330,#38405)围绕多模态展开,涉及缓存并发、处理器兼容性、序列化协议等。这表明随着多模态特性被更广泛和复杂地使用,其底层架构的鲁棒性和扩展性正面临考验。 - 性能优化持续进行:针对特定硬件(AMD MI355, NVIDIA Blackwell)和模型架构(GDN, MLA, MoE)的深度性能优化 PR 不断涌现(如
#38346,#38354,#38315),体现了 vLLM 对极致推理效率的追求。 - AMD 生态支持力度加强:如前述,本周期 AMD 相关的修复、优化和文档更新非常密集,显示出 AMD 团队正系统性地提升 vLLM 在其硬件上的可用性和性能。
🛠️ 重点技术变更
- PR
#38346[ROCM] Optmize redudent d2d copy of moe.:此 PR 通过优化 MI355 上 MOE 内核的输出路径,消除了两次不必要的设备间(D2D)拷贝。这是针对特定硬件(MI300系列)和模型(MiniMax M2.5)的“外科手术式”优化,能直接提升端到端推理吞吐量,体现了平台专属优化的价值。 - PR
#38334[ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM:该修改为 ROCm 设备上 ViT 的注意力计算添加了 Triton 回退路径,有效避免了 PyTorch SDPA “math” 后端因分配 N² 注意力矩阵导致的 OOM 问题。这对在 AMD 显卡上稳定运行多模态模型至关重要。 - PR
#38337[ROCm][Build] Fix pip install detection…:此修复解决了在 ROCm 系统上从源码构建时的一个隐蔽但影响很大的问题,提升了开发体验和构建可靠性,是完善 AMD 平台支持基础设施的关键一环。 - PR
#34789(已合并) [Bugfix] Offload blocking tokenizer ops to shared thread pool:此 PR 通过将阻塞性的 tokenizer 和多模态预处理操作卸载到共享线程池,解决了高并发下事件循环被阻塞导致的管理接口(如/health)无响应问题。这是提升服务端稳定性和可观测性的重要改进。 - PR
#37975(已合并) [Model] Extract GatedDeltaNetAttention into shared layer:将 Qwen3Next 和 Qwen3.5 中的 GDN 层实现重构为独立的共享模块。这提高了代码复用性、可维护性,并为其他平台(如 XPU, NPU)实现定制化算子支持奠定了基础。
📈 开发活跃度观察
- 高强度开发:单日新增 83 个 PR,反映出极高的开发活跃度和社区参与热情。
- 高效的协作与合并:多个 PR 在创建当天即被合并(如
#38367,#38322),显示核心维护者评审和合并流程高效。 - 社区驱动的规模化任务:Transformers v5 适配工作通过分解 Issue 和
good first issue标签,成功吸引了多名新贡献者(kasha01,dharun4772,guanwei-wu等)参与,是成功的社区协同范例。 - AMD 团队深度参与:AMD 员工不仅提交代码,也积极在 Issue 中与用户互动(如
#38406),形成了良性的反馈循环。
💡 值得关注的问题
- AMD MI325X 性能瓶颈:Issue
#38406中报告的 Kimi K2.5 在 MI325X 上性能远逊于 H200 的问题尚未解决,这可能是影响 AMD 数据中心 GPU 市场竞争力的关键性能问题,需要 AMD 团队持续重点关注和优化。 - Transformers v5 升级的连锁反应:虽然社区已积极介入修复,但此次升级暴露出的兼容性问题面广且深。需要密切关注后续是否还有隐藏问题,以及升级是否会对生产环境部署的模型稳定性造成影响。
- 多模态缓存与并发处理的根本矛盾:Issue
#38375揭示了当前多模态缓存设计与并发预处理之间的设计冲突。最终的修复方案(是限制功能还是重构缓存)将影响多模态服务的高并发能力。
📋 附录:详细数据列表
新增 Issue
- #38386 [Transformers v5] Base model and LoRA used in test has incorrect
tokenizer_config.json— help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:43 (UTC+8)) - #38384 [Transformers v5] Distributed shutdown test timtout — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:36 (UTC+8))
- #38385 [Transformers v5] MiniCPMV cannot apply processor — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:40 (UTC+8))
- #38411 [Bug]: Vision encoder crashes on SM100 (Jetson Thor) —
_vllm_fa2_Ccompiled for SM80-only, no override available for vision encoder — bug — by mrjbj (创建于: 2026-03-28 10:06 (UTC+8)) - #38406 [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node — bug,rocm — by functionstackx (创建于: 2026-03-28 07:34 (UTC+8))
- #38382 [Transformers v5] Mistral multimodal models — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:30 (UTC+8))
- #38375 [Bug]: IndexError when
--renderer-num-workers+--mm-processor-cache-type shm— bug — by cjackal (创建于: 2026-03-28 00:22 (UTC+8)) - #38387 [Transformers v5] HCXVisionForCausalLM — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:52 (UTC+8))
- #38389 [Transformers v5] IsaacForConditionalGeneration — help wanted,good first issue — by hmellor (创建于: 2026-03-28 02:55 (UTC+8))
- #38379 Upgrade to Transformers v5 — help wanted — by hmellor (创建于: 2026-03-28 02:19 (UTC+8))
- #38339 [Bug] Step-3.5-Flash MTP Speculative Decoding Has Extremely Low Acceptance Rate (2.4%-4.6%) — bug,speculative-decoding — by elinx (创建于: 2026-03-27 15:49 (UTC+8))
- #38376 [Bug]: glm 4.7 fp8 crashes (Worker_TP3 pid=457501) ERROR 03-27 17:11:15 [multiproc_executor.py:852] AttributeError: ‘_OpNamespace’ ‘_C’ object has no attribute ‘per_token_group_fp8_quant’ — bug — by koush (创建于: 2026-03-28 01:33 (UTC+8))
- #38348 [Bug]: chunked prefill can’t be disabled. maybe due to race condition in scheduler — bug — by Dingjifeng (创建于: 2026-03-27 17:22 (UTC+8))
- #38374 [Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint — bug — by kimihailv (创建于: 2026-03-28 00:19 (UTC+8))
- #38351 [Bug]: When use_audio_in_video is enabled in qwen3-omni, the output may exhibit issues such as empty or repetitive output. — bug — by wwxs123W (创建于: 2026-03-27 18:32 (UTC+8))
- #38349 [Bug]: 使用swift rollout启动vllm,推理结果乱码 — bug — by Aristotle-wfh (创建于: 2026-03-27 17:50 (UTC+8))
- #38347 [BUG]: Python API stuck after long time running — usage — by sweihub (创建于: 2026-03-27 17:13 (UTC+8))
- #38331 [Usage]: for the tpye: UniformTypeKVCacheSpecs , the calc of num_blocks = available_memory // kv_cache_groups[0].kv_cache_spec.page_size_bytes, no need to divide num_layers ? — usage — by yangshanjun (创建于: 2026-03-27 15:05 (UTC+8))
已关闭 Issue
- #38303 [Bug]: minimax nvfp4 model crash — bug — by functionstackx (关闭于: 2026-03-28 10:28 (UTC+8))
- #4763 [Feature]: Support W4A8KV4 Quantization(QServe/QoQ) — feature request,stale — by bratao (关闭于: 2026-03-28 10:14 (UTC+8))
- #38304 [Feature]: vllm.ai docs should should instructions for rocm nightly docker & rocm nightly wheel — feature request,rocm — by functionstackx (关闭于: 2026-03-28 09:22 (UTC+8))
- #33560 [Bug]: lm-eval shows significant accuracy differences on RedHatAI/Qwen3-8B-NVFP4 model (Turing vs. Ampere) — bug — by ir1ka (关闭于: 2026-03-28 07:36 (UTC+8))
- #38348 [Bug]: chunked prefill can’t be disabled. maybe due to race condition in scheduler — bug — by Dingjifeng (关闭于: 2026-03-28 00:43 (UTC+8))
- #38374 [Bug]: IPC update_weights (checkpoint format): hot-swapped weights can diverge from cold load of target checkpoint — bug — by kimihailv (关闭于: 2026-03-28 00:33 (UTC+8))
- #32932 [Bug][cpu][arm]: Failure case for BF16 dispatch on non-bf16 supported arm HW — bug,cpu — by gassan-arm (关闭于: 2026-03-27 17:45 (UTC+8))
- #38347 [BUG]: Python API stuck after long time running — usage — by sweihub (关闭于: 2026-03-27 17:43 (UTC+8))
- #38245 [Bug]: Responses API
text.format.type="json_schema"leaksschema_in non-stream responses and breaks streaming — bug — by noobHappylife (关闭于: 2026-03-27 13:58 (UTC+8)) - #38201 [Feature]: Support TurboQuant KV quant — feature request — by NiuBlibing (关闭于: 2026-03-27 13:21 (UTC+8))
- #31450 [Feature]: vLLM should apply to Docker Open Source Program for removing image pull limits — feature request — by ehfd (关闭于: 2026-03-27 13:08 (UTC+8))
新增 PR
- #38373 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — 无标签 — by SouthWest7 (创建于: 2026-03-27 23:51 (UTC+8))
- #38413 [ROCm] [Release] Update ROCm variant from rocm700 to rocm721 — rocm,ready,ci/build — by tjtanaa (创建于: 2026-03-28 10:27 (UTC+8))
- #38414 [Test] Fix flaky race condition in test_abort_final_step — v1 — by yzong-rh (创建于: 2026-03-28 10:57 (UTC+8))
- #38415 [ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-28 10:58 (UTC+8))
- #38412 docs: Add build guide for NVIDIA Blackwell consumer GPUs (RTX 50-series) on Windows/WSL2 — documentation,nvidia — by nguyennhuanhle (创建于: 2026-03-28 10:21 (UTC+8))
- #38410 [Transformers v5] fix missing pixtral/voxtral multimodal dispatch — 无标签 — by allgather (创建于: 2026-03-28 09:59 (UTC+8))
- #38403 [CI][Eval] Lower Nemotron-3-Super-120B-A12B-BF16 GSM8K accuracy threshold to 0.91 — ready — by SandishKumarHN (创建于: 2026-03-28 06:23 (UTC+8))
- #38365 [ROCm] patch benchmark_moe — performance,rocm — by big-yellow-duck (创建于: 2026-03-27 22:28 (UTC+8))
- #38364 [Bugfix]Fix incorrect KV cache log output — bug,v1 — by WangErXiao (创建于: 2026-03-27 22:13 (UTC+8))
- #38408 Log KV cache GiB usage and warn when max_num_seqs exceeds capacity — v1 — by ashishkamra (创建于: 2026-03-28 07:58 (UTC+8))
- #38409 [Compile] Replace split_module with lightweight vllm-specific graph splitter — 无标签 — by fxdawnn (创建于: 2026-03-28 08:10 (UTC+8))
- #38395 Fix/mla kv offloading — v1,kv-connector — by rarepepi (创建于: 2026-03-28 04:59 (UTC+8))
- #38407 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (创建于: 2026-03-28 07:36 (UTC+8))
- #38325 [Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (创建于: 2026-03-27 14:11 (UTC+8))
- #38399 [WIP][Model Runner V2] Implement flash sampling for the draft model — v1 — by TheEpicDolphin (创建于: 2026-03-28 05:42 (UTC+8))
- #38367 [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips — documentation,rocm,ready — by hongxiayang (创建于: 2026-03-27 22:45 (UTC+8))
- #38404 [Bugfix] Fix platform detection crash when vllm is not installed as a package — bug — by SandishKumarHN (创建于: 2026-03-28 06:25 (UTC+8))
- #38402 [Tests][Bugfix] Fix race condition in test_abort_final_step.py — bug,v1 — by SandishKumarHN (创建于: 2026-03-28 06:19 (UTC+8))
- #38400 [CI/Build] Fix GH200 build: add –torch-only flag to use_existing_torch.py to preserve torchvision dependency — ci/build — by SandishKumarHN (创建于: 2026-03-28 06:06 (UTC+8))
- #38405 [Frontend] Add multimodal support to /inference/v1/generate endpoint — documentation,frontend,kv-connector — by nithinvc (创建于: 2026-03-28 07:04 (UTC+8))
- #38401 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (创建于: 2026-03-28 06:15 (UTC+8))
- #38360 [compile] Bug fix for _decompose_size_nodes — bug,ready,ready-run-all-tests — by anijain2305 (创建于: 2026-03-27 21:57 (UTC+8))
- #38397 feat: Added Dockerfile for ppc64le CUDA build — ci/build,nvidia — by Okoyl (创建于: 2026-03-28 05:18 (UTC+8))
- #38398 [Bugfix] Restore
prepare_fp8_layer_for_marlinremoved by merge conflict resolution — bug,ready — by vadiklyutiy (创建于: 2026-03-28 05:28 (UTC+8)) - #38394 [Chore] Remove unused logits_process.py — ready — by WoosukKwon (创建于: 2026-03-28 04:58 (UTC+8))
- #38361 [GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill — needs-rebase,v1 — by arpera (创建于: 2026-03-27 21:58 (UTC+8))
- #38396 [AMD][CI] Update DeepEP branch — rocm,ci/build — by rjrock (创建于: 2026-03-28 05:04 (UTC+8))
- #38369 [CI] Skip failing test — ready,multi-modality — by NickLucche (创建于: 2026-03-27 22:57 (UTC+8))
- #38378 [Feature] Kvcache per token — documentation,v1 — by JartX (创建于: 2026-03-28 02:10 (UTC+8))
- #38393 [BugFix]: use base address for zero-token experts in CUTLASS grouped GEMM to avoid over by 1 — bug,nvidia — by LucasWilkinson (创建于: 2026-03-28 04:16 (UTC+8))
- #38381 [ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm — rocm — by micah-wil (创建于: 2026-03-28 02:29 (UTC+8))
- #38329 [MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists — ready,nvidia — by jaewonlee-fb (创建于: 2026-03-27 14:41 (UTC+8))
- #38392 [Bugfix] Fix dead-code validation bypass in structured_outputs + named tool_choice — bug,frontend,tool-calling — by yzong-rh (创建于: 2026-03-28 03:30 (UTC+8))
- #38390 [Model Runner v2] E/P/D disaggregation support — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-28 03:24 (UTC+8))
- #38391 [CI Bugfix] Pre-download missing FlashInfer headers in Docker build — bug,ready,ci/build,ci-failure — by mgoin (创建于: 2026-03-28 03:28 (UTC+8))
- #38358 Revert “[CI] Add batch invariant test for b200” (#38014) — ci/build — by zhewenl (创建于: 2026-03-27 20:06 (UTC+8))
- #38370 [CI] Fix
BatchedGemmEnums.h not foundCI issue — ready,ci/build,nvidia — by yewentao256 (创建于: 2026-03-27 22:57 (UTC+8)) - #38380 Add short flag
-scfor--speculative-configargument — 无标签 — by mgoin (创建于: 2026-03-28 02:22 (UTC+8)) - #38388 [Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support — multi-modality — by khairulkabir1661 (创建于: 2026-03-28 02:55 (UTC+8))
- #38383 [Refactor] Remove dead code in kv connector and model runner — intel-gpu,ready,v1,cpu,kv-connector — by yewentao256 (创建于: 2026-03-28 02:33 (UTC+8))
- #38362 [BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed — bug,frontend — by walterbm (创建于: 2026-03-27 22:08 (UTC+8))
- #38366 [Bug Fix][CPU Profiling] Add CPU profiler summary file output on rank 0 — bug — by Elm8116 (创建于: 2026-03-27 22:42 (UTC+8))
- #38318 fix(harmony): run render_for_completion in thread pool to unblock event loop — frontend,gpt-oss — by ztimothy96 (创建于: 2026-03-27 11:58 (UTC+8))
- #38377 [Bugfix]: fix blind access to per_token_group_fp8_quant causing GLM 4.7 crash on RTX 6000 pro — bug — by koush (创建于: 2026-03-28 01:34 (UTC+8))
- #38321 [ROCm] Fix ROCM_ATTN cross-attention for beam search encoder-decoder models — rocm,intel-gpu,ready,v1 — by AndreasKaratzas (创建于: 2026-03-27 13:06 (UTC+8))
- #38317 [ROCm][CI] Enable hybrid chunked prefill test — rocm,ready,ci/build,v1 — by AndreasKaratzas (创建于: 2026-03-27 11:54 (UTC+8))
- #38355 feat: add TQMediaConnector plugin for TransferQueue image data support — multi-modality — by RobotGF (创建于: 2026-03-27 19:58 (UTC+8))
- #38372 [Hybrid] Simplify accepted token counting in spec decode for hybrid models — v1 — by fuscof-ibm (创建于: 2026-03-27 23:46 (UTC+8))
- #38357 Revert “[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell” (#38083) — bug,qwen — by zhewenl (创建于: 2026-03-27 20:06 (UTC+8))
- #38350 Remove need for explicit
\nin docstring lists for--helpformatting — ready — by hmellor (创建于: 2026-03-27 18:09 (UTC+8)) - #38371 Enable building MoRI with AMD AINIC stack — rocm,ci/build — by ichbinblau (创建于: 2026-03-27 23:05 (UTC+8))
- #38346 [ROCM] Optmize redudent d2d copy of moe. — rocm — by benenzhu (创建于: 2026-03-27 17:11 (UTC+8))
- #38368 [FlashLinearAttention] reduce recompilations by removing unused triton kernel inputs — 无标签 — by lgeiger (创建于: 2026-03-27 22:54 (UTC+8))
- #38363 [quantization] Further qr — 无标签 — by lixixicommute (创建于: 2026-03-27 22:09 (UTC+8))
- #38359 [Bugfix] Revert “Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding” — bug,ready,v1,nvidia — by elvircrn (创建于: 2026-03-27 20:18 (UTC+8))
- #38322 [CI/Build] Move nightly wheel index generation to a single post-build step — ready,ci/build — by Harry-Chen (创建于: 2026-03-27 13:22 (UTC+8))
- #38356 [CI] don’t merge modify cases XPU — intel-gpu,ci/build — by wendyliu235 (创建于: 2026-03-27 20:03 (UTC+8))
- #38345 [CI] don’t mergify label Intel — intel-gpu,ci/build — by wendyliu235 (创建于: 2026-03-27 17:06 (UTC+8))
- #38353 [Bugfix] Fix Kimi K2 tool parser dropping content before tool section — bug,tool-calling — by felixmr1 (创建于: 2026-03-27 19:49 (UTC+8))
- #38354 [Performance] MLA decode pre-attention fused kernels for SM100 — ci/build,v1,deepseek,nvidia — by elvircrn (创建于: 2026-03-27 19:55 (UTC+8))
- #38352 [Bugfix] Fix int64 expert IDs in routing simulator crashing flashinfer all2all — bug,nvidia — by elvircrn (创建于: 2026-03-27 19:37 (UTC+8))
- #38342 [XPU] transpose moe weights — intel-gpu — by mayuyuace (创建于: 2026-03-27 16:38 (UTC+8))
- #38332 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (创建于: 2026-03-27 15:14 (UTC+8))
- #38315 [Perf][GDN] Fuse kkt + solve_tril kernels in chunk forward — 无标签 — by ZJY0516 (创建于: 2026-03-27 11:35 (UTC+8))
- #38344 [CI] don’t merge for very — ci/build — by wendyliu235 (创建于: 2026-03-27 16:56 (UTC+8))
- #38341 [XPU] fix lora uts — intel-gpu,gpt-oss — by chaojun-zhang (创建于: 2026-03-27 16:16 (UTC+8))
- #38343 [Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass — 无标签 — by AuYang261 (创建于: 2026-03-27 16:47 (UTC+8))
- #38338 [Kernel] Add Triton reference implementation of NVFP4 GEMM — performance — by Drogon4231 (创建于: 2026-03-27 15:35 (UTC+8))
- #38340 [Doc] fix selector/label mismatch in helm chart — documentation — by utsumi-fj (创建于: 2026-03-27 16:07 (UTC+8))
- #38328 [Doc] Clarify Helm chart location in deployment guide — documentation,ready — by utsumi-fj (创建于: 2026-03-27 14:20 (UTC+8))
- #38337 [ROCm][Build] Fix pip install detection when build isolation installs CUDA PyTorch — rocm,ci/build,nvidia — by westers (创建于: 2026-03-27 15:33 (UTC+8))
- #38336 [Bugfix] Detect missing shard files in quantized checkpoints — bug — by westers (创建于: 2026-03-27 15:31 (UTC+8))
- #38335 [ROCm][CI] Fix spawn test decorator silently skipping tests without main — rocm — by westers (创建于: 2026-03-27 15:30 (UTC+8))
- #38334 [ROCm] Use Triton attention fallback for ViT to avoid SDPA OOM — rocm — by westers (创建于: 2026-03-27 15:21 (UTC+8))
- #38333 feat(grpc): add periodic stats logging and servicer log forwarding — frontend — by CatherineSue (创建于: 2026-03-27 15:19 (UTC+8))
- #38330 [Core] Add score encoder cache manager — v1 — by hotTea123 (创建于: 2026-03-27 14:56 (UTC+8))
- #38326 [Bugfix] Fix N-gram proposer JIT warmup that never triggers compilation — bug,speculative-decoding,v1 — by redwrasse (创建于: 2026-03-27 14:15 (UTC+8))
- #38327 [cpu] Fix chained comparisons in static_assert breaking macOS builds — cpu — by openasocket (创建于: 2026-03-27 14:17 (UTC+8))
- #38324 [Bugfix] Preserve torch profiler summary output on CPU — bug,documentation,v1,cpu — by jacobzhang22 (创建于: 2026-03-27 14:06 (UTC+8))
- #38316 [XPU] Add per-channel quantized model in compressed-tensors — intel-gpu — by yma11 (创建于: 2026-03-27 11:43 (UTC+8))
- #38320 [CI] Add xpu auto-label rule for Intel GPU/XPU PRs — ready,ci/build — by wendyliu235 (创建于: 2026-03-27 12:32 (UTC+8))
- #38323 [Bugfix] Fix Helm chart Deployment using hardcoded labels — bug,documentation — by simpx (创建于: 2026-03-27 13:42 (UTC+8))
- #38319 [Security] Fix CVE-2026-4944: respect trust_remote_code in NemotronVL and KimiK25 — 无标签 — by Wernerina (创建于: 2026-03-27 12:25 (UTC+8))
已合并 PR
- #38252 [ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 — rocm,ready,ci/build — by gshtras (合并于: 2026-03-28 07:03 (UTC+8))
- #38126 [NVIDIA] Fix DGX Spark logic — ready,ci/build,nvidia — by johnnynunez (合并于: 2026-03-28 06:26 (UTC+8))
- #33972 [Bugfix]fix output Nan/Inf in marlin if dtype=float16 — bug,ready — by ir1ka (合并于: 2026-03-28 07:36 (UTC+8))
- #37695 [Perf] Use torch compile to fuse pack topk in trtllm moe — performance,ready,nvidia — by wzhao18 (合并于: 2026-03-28 07:30 (UTC+8))
- #31201 Add nvidia h800 moe config — stale,nvidia — by lengrongfu (合并于: 2026-03-28 07:28 (UTC+8))
- #34789 [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop — bug,ready,v1 — by scyyh11 (合并于: 2026-03-27 13:17 (UTC+8))
- #38367 [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips — documentation,rocm,ready — by hongxiayang (合并于: 2026-03-28 07:20 (UTC+8))
- #38311 [Model Runner V2] Rebuild attention metadata before eagle decode full… — ready,v1 — by TheEpicDolphin (合并于: 2026-03-28 04:46 (UTC+8))
- #38369 [CI] Skip failing test — ready,multi-modality — by NickLucche (合并于: 2026-03-28 04:25 (UTC+8))
- #38032 [QeRL] Compose online quantization with quantized reloading — ready,quantization — by kylesayrs (合并于: 2026-03-28 04:22 (UTC+8))
- #38380 Add short flag
-scfor--speculative-configargument — 无标签 — by mgoin (合并于: 2026-03-28 03:04 (UTC+8)) - #37453 [ROCm] Fix GPT-OSS import for triton 3.6 — rocm,ready,gpt-oss — by gshtras (合并于: 2026-03-28 02:00 (UTC+8))
- #38043 {ROCm]: gpt-oss fusion/padding fixes — rocm,ready,ci/build,gpt-oss — by Rohan138 (合并于: 2026-03-28 00:19 (UTC+8))
- #38350 Remove need for explicit
\nin docstring lists for--helpformatting — ready — by hmellor (合并于: 2026-03-27 23:38 (UTC+8)) - #38262 [frontend] dump openai responses type by alias — frontend,ready — by cjackal (合并于: 2026-03-27 13:58 (UTC+8))
- #34520 [EPLB] Cleanup the transfer logic for the various eplb maps — ready,ci/build — by SageMoore (合并于: 2026-03-27 17:18 (UTC+8))
- #33695 enable skipping of SW attention layers when using FP8 KV cache — ready,quantization — by jmkuebler (合并于: 2026-03-27 21:25 (UTC+8))
- #37952 fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit — documentation,ready — by jperezdealgaba (合并于: 2026-03-27 21:02 (UTC+8))
- #38322 [CI/Build] Move nightly wheel index generation to a single post-build step — ready,ci/build — by Harry-Chen (合并于: 2026-03-27 15:44 (UTC+8))
- #36946 [P/D] Mooncake: Add unit tests and minor fixes for mooncake connector — ready,v1,kv-connector — by dtcccc (合并于: 2026-03-27 16:26 (UTC+8))
- #38328 [Doc] Clarify Helm chart location in deployment guide — documentation,ready — by utsumi-fj (合并于: 2026-03-27 15:43 (UTC+8))
- #38168 [Bugfix] Fix Hermes tool parser when stream interval > 1 — bug,ready,tool-calling — by sfeng33 (合并于: 2026-03-27 14:42 (UTC+8))
- #34285 [Refactor] Move FusedMoE hidden_size roundup to quant_method — rocm,ready,ci/build,gpt-oss,nvidia — by BowenBao (合并于: 2026-03-27 14:38 (UTC+8))
- #38320 [CI] Add xpu auto-label rule for Intel GPU/XPU PRs — ready,ci/build — by wendyliu235 (合并于: 2026-03-27 14:22 (UTC+8))
- #38219 [CPU] Support CT W4A16 on CPU MP kernel — ready,cpu — by bigPYJ1151 (合并于: 2026-03-27 14:15 (UTC+8))
- #37975 [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 — ready,qwen — by wxsIcey (合并于: 2026-03-27 14:13 (UTC+8))
- #37853 [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models — ready,v1,kv-connector — by orozery (合并于: 2026-03-27 13:38 (UTC+8))
关闭但未合并的 PR
- #37789 [Perf] Fuse prefill conv output split into Q/K/V to reduce rearrange overhead — needs-rebase,qwen — by ZJY0516 (关闭于: 2026-03-28 10:50 (UTC+8))
- #38037 Fix IndexError in streaming tool calls when max_tokens is hit — frontend,ready — by joaquinhuigomez (关闭于: 2026-03-28 07:58 (UTC+8))
- #35458 [Frontend] Add multimodal support to /inference/v1/generate endpoint — documentation,performance,new-model,rocm,structured-output,frontend,intel-gpu,speculative-decoding,needs-rebase,ci/build — by nithinvc (关闭于: 2026-03-28 06:59 (UTC+8))
- #38401 [Tests] Fix Transformers Nightly CI: use check_max_version=False in test_transformers.py — 无标签 — by SandishKumarHN (关闭于: 2026-03-28 06:32 (UTC+8))
- #38398 [Bugfix] Restore
prepare_fp8_layer_for_marlinremoved by merge conflict resolution — bug,ready — by vadiklyutiy (关闭于: 2026-03-28 05:45 (UTC+8)) - #38394 [Chore] Remove unused logits_process.py — ready — by WoosukKwon (关闭于: 2026-03-28 05:36 (UTC+8))
- #37950 [Bugfix] Restore stats logging for multi-server mode — bug,frontend,v1 — by khairulkabir1661 (关闭于: 2026-03-28 05:28 (UTC+8))
- #34647 [ROCm] Add hardware detection for FP4 BMM to prevent MI300X crashes — rocm — by khairulkabir1661 (关闭于: 2026-03-28 05:28 (UTC+8))
- #38285 [AMD][Build] Test DeepEP offload — rocm,ci/build — by rjrock (关闭于: 2026-03-28 05:05 (UTC+8))
- #38370 [CI] Fix
BatchedGemmEnums.h not foundCI issue — ready,ci/build,nvidia — by yewentao256 (关闭于: 2026-03-28 03:39 (UTC+8)) - #34756 [ROCm] [Nightly Docker Release] nightly rocm docker — rocm,needs-rebase,ci/build — by hongxiayang (关闭于: 2026-03-28 03:21 (UTC+8))
- #38268 [Bugfix] Remove dead name-only regex fallback from DeepSeek V3.1 tool parser — bug,tool-calling,deepseek — by yzong-rh (关闭于: 2026-03-27 22:59 (UTC+8))
- #38283 [Bugfix] Remove dead unicode_escape decoding from 4 tool parsers — bug,needs-rebase,tool-calling — by yzong-rh (关闭于: 2026-03-27 22:59 (UTC+8))
- #37770 [Quant] Consolidate GPTQ: remove gptq.py, rename gptq_marlin.py to gptq.py — rocm,v1,ready-run-all-tests — by robertgshaw2-redhat (关闭于: 2026-03-27 22:28 (UTC+8))
- #25159 [Rocm] [quantization] support quark wmxfp4 for gptoss — rocm,needs-rebase,gpt-oss — by haoyangli0109 (关闭于: 2026-03-27 21:52 (UTC+8))
- #38356 [CI] don’t merge modify cases XPU — intel-gpu,ci/build — by wendyliu235 (关闭于: 2026-03-27 20:16 (UTC+8))
- #38345 [CI] don’t mergify label Intel — intel-gpu,ci/build — by wendyliu235 (关闭于: 2026-03-27 20:16 (UTC+8))
- #38332 MiniMax-M2: add Eagle3 speculative decoding support — new-model — by liuchenbing2026 (关闭于: 2026-03-27 19:16 (UTC+8))
- #38344 [CI] don’t merge for very — ci/build — by wendyliu235 (关闭于: 2026-03-27 17:05 (UTC+8))
- #30377 adding constraint updates of cos-sin to improve mrope performance — needs-rebase,stale — by wujinyuan1 (关闭于: 2026-03-27 15:54 (UTC+8))
- #38181 [ROCm] Fix cp_mha_gather_cache for strided KV views — rocm,ready,needs-rebase,v1 — by yuttian1 (关闭于: 2026-03-27 15:09 (UTC+8))
- #31069 Fix LoRA prefix cache corruption by using lora_int_id — rocm,ci/build,stale,v1 — by westers (关闭于: 2026-03-27 15:05 (UTC+8))
- #31072 [ROCm][Test] Skip RTN quantization tests on ROCm — documentation,rocm,stale,v1 — by westers (关闭于: 2026-03-27 15:03 (UTC+8))
- #31067 [Bugfix] Fix incorrect tensor parallel size in Ray executor warning — bug,v1 — by westers (关闭于: 2026-03-27 14:57 (UTC+8))
- #31077 Fix ROCm CUDA graph replay synchronization bug (issue #29521) — bug,documentation,rocm,needs-rebase,ci/build,v1,nvidia — by westers (关闭于: 2026-03-27 14:57 (UTC+8))
- #38323 [Bugfix] Fix Helm chart Deployment using hardcoded labels — bug,documentation — by simpx (关闭于: 2026-03-27 13:44 (UTC+8))
- #37959 [Bugfix] Fix Helm chart Deployment using hardcoded labels instead of chart.labels — bug,documentation — by simpx (关闭于: 2026-03-27 13:43 (UTC+8))
- #38319 [Security] Fix CVE-2026-4944: respect trust_remote_code in NemotronVL and KimiK25 — 无标签 — by Wernerina (关闭于: 2026-03-27 12:25 (UTC+8))
- #33774 add support for telechat3 model — new-model — by 1096125073 (关闭于: 2026-03-27 11:41 (UTC+8))