vLLM 开发动态报告 - 2026-03-18
时间窗口: 2026-03-18 11:35 (UTC+8) ~ 2026-03-19 11:35 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 37 | 新 PR 115 | 合并 PR 48 | 关闭未合并 PR 29
📊 每日开发状态摘要
本周期(2026-03-18至2026-03-19)vLLM 项目开发活动非常活跃,共新增 115 个 PR 并合并了 48 个,显示出强劲的新功能开发和代码集成势头。同时,Issue 管理效率较高,关闭了 37 个问题。重点关注领域包括 AMD ROCm 平台上的 GPU 挂死、量化兼容性修复,以及多个性能优化与内核改进。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动频繁,涵盖问题报告、功能请求和核心修复,突显了对 AMD 硬件和软件栈支持的持续投入。
相关 Issue 分析
- #37406: GPU hang on MI325/300 when serving MiniMax-M2.1-MXFP4 with TP=1
- 报告者:
xuebwang-amd(AMD员工) - 核心问题:用户在 MI300 系列 GPU 上使用 TP=1 服务
amd/MiniMax-M2.1-MXFP4模型时遇到 GPU 挂死。xuebwang-amd初步分析认为是由于 MI300 缺乏原生 MXFP4 支持,Quark 量化工具使用解量化路径导致显存压力过大。 - 技术讨论:
fxmarty-amd(AMD员工) 指出 vLLM 中的实现并非简单的解量化路径,问题可能源于先前识别的溢出问题。这显示团队内部正在就根本原因进行技术排查。 - 影响:暴露了在 AMD 硬件上运行特定量化格式(MXFP4)模型时,内存估算和用户引导机制可能存在不足,需要更早地发出警告或优化策略。
- 报告者:
- #37472: V1 engine hangs on encoder cache profiling on AMD gfx1151
- 核心问题:在 AMD RDNA 3.5 iGPU (gfx1151) 上,V1 引擎在初始化视觉编码器缓存时,因 MIOpen 缺少对应的预编译求解器数据库而无限期挂起。
- 技术细节:该问题与 AMD 较新的集成 GPU 架构相关,MIOpen 库的生态支持尚未完全跟上。用户提供了通过注释代码绕过分析的临时解决方案。
- 影响:阻碍了 AMD 新一代集成显卡平台上的多模态模型推理,需要在代码中增加对缺失 MIOpen DB 架构的检测和优雅降级逻辑。
- #37422: Add kv_transfer_params to Responses API for PD disaggregation
- 报告者/关联PR:
bongwoobak,同时提交了 PR #37424。 - 核心诉求:为
/v1/responsesAPI 添加kv_transfer_params支持,以实现与/v1/chat/completionsAPI 对等的 PD(推测为 Pool Disaggregation)解聚合功能。这对于基于 AMD GPU(如 MI250)和 NixlConnector 的分布式推理场景至关重要。 - 状态:已通过 PR #37424 实现,显示出对 AMD 特定部署方案需求的快速响应。
- 报告者/关联PR:
相关 PR 分析
- #37427: [Bugfix] Fix ROCm crash in qwen3_next multi-stream events
- 贡献者:
JartX - 问题:PR #36795 为 Qwen3 模型引入了多流并行优化,但其 CUDA 事件创建逻辑仅检查
is_cuda(),而未涵盖 ROCm 平台对应的is_cuda_alike(),导致在 AMD GPU 上运行时发生AttributeError。 - 修复:将事件创建的门控条件从
is_cuda()改为is_cuda_alike(),确保在 ROCm 平台上也能正确创建事件。 - 影响:及时修复了上游性能优化引入的 AMD 平台兼容性回归,保障了 ROCm 用户的稳定性。
- 贡献者:
- #37408: [ROCm][Quantization] fallback trust_remote_code=True in Quark config
- 贡献者:
xuebwang-amd(AMD员工) - 问题:当使用特定版本的 Transformers 加载包含自定义代码的 Quark 量化模型(如
amd/MiniMax-M2.1-MXFP4)时,因trust_remote_code参数未正确传递而导致配置验证失败。 - 修复:在 Quark 配置加载逻辑中添加回退,确保
trust_remote_code=True被正确设置。 - 影响:提升了 AMD Quark 量化模型在 vLLM 中的易用性和兼容性。
- 贡献者:
- #37390: Fix Quark OCP-MX W4A6 support: dequant dtype + apply_weights
- 贡献者:
vecheruk-amd(AMD员工) - 问题:Quark W4A6 模型在推理时崩溃,存在两个问题:(1) MXFP4 反量化 HIP 内核仅支持
float16输出,与 vLLM 的bfloat16计算类型冲突;(2)apply_weights函数中可能传入uint8类型的输入,导致后续线性层崩溃。 - 修复:
- 在
mxfp4_utils.py中,强制将反量化输出转换为float16,随后再转换为目标类型。 - 在
quark_ocp_mx.py中,为权重反量化硬编码bfloat16类型,并绕过激活量化的 QDQ 路径。
- 在
- 影响:这是一个关键修复,解决了 Quark 量化方案中一个特定配置(W4A6)的运行时错误,但修复方法也承认了当前实现的一些局限性(如绕过激活 QDQ)。
- 贡献者:
- #37418: [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility
- 贡献者:
Duyi-Wang - 问题:MORI 和 AITER 两个专家并行相关的组件在重构后出现兼容性问题,导致使用 FP8 量化模型时无法协同工作。
- 修复:使
AiterExperts.expects_unquantized_inputs的返回值在搭配 MORI 时变为False,并允许 MORI 在defer_input_quant=True时跳过 FP8 量化步骤。 - 影响:修复了 ROCm 平台上高级 MoE 调度与量化协同工作的一个复杂边缘案例,对大规模 MoE 模型的高效推理有积极意义。
- 贡献者:
- 其他 AMD 相关 PR:
#37453: 修复 ROCm 上 GPT-OSS 模型因 Triton 3.6 版本 API 变更导致的导入错误。#37495: 为 ROCm 平台添加环境变量VLLM_ROCM_W8A8_TRITON_MAX_M,用于配置 W8A8 GEMM 在 Triton 和 Composable Kernel 之间的路由阈值,提供了更细粒度的性能调优手段。
💬 高热度讨论分析
- Issue #37392: [Bug]:推理时报错,模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型
- 评论数:9
- 核心议题:用户部署 Qwen3.5 大型模型时,服务在启动或推理过程中崩溃。
- 观点与进展:
- 用户
uekaterinauelizabethar2175-crypto提供了详细的复现命令和日志,显示服务卡在初始化阶段,随后崩溃。 - 维护者
ZJY0516迅速响应,首先建议设置环境变量VLLM_DISABLE_SHARED_EXPERTS_STREAM=1进行排查,无效后持续询问更精确的复现步骤。 - 用户尝试了包括禁用 NCCL P2P 等多种配置后问题依旧。
- 用户
- 争议焦点:无公开争议,属于协作排查。
- 当前状态:Open。维护者正在尝试复现问题,用户提供了详尽信息,问题可能涉及专家并行、大模型初始化或特定硬件环境下的深层 Bug。
- Issue #37441: [Bug]: GPT OSS 120B performance regression with Triton 3.6
- 评论数:0(但正文非常详细,包含完整分析和数据)
- 核心议题:vLLM 0.17.1 相比 0.16.0,在 H200 上运行 GPT OSS 120B 模型出现约 20% 的延迟性能回退。
- 观点与分析:
- 问题定位:报告者通过 nsys 性能分析,精准定位到回退源于 Triton 3.6 中 MoE 内核将
reduce_grouped操作替换为_reduce操作,后者耗时增加了 6.4 倍。 - 根因与验证:报告者通过手动在 vLLM 0.17.1 中强制使用旧版 Triton 3.5 内核 (
use_legacy_triton_kernels = True),性能完全恢复,验证了猜测。
- 问题定位:报告者通过 nsys 性能分析,精准定位到回退源于 Triton 3.6 中 MoE 内核将
- 争议焦点:无。这是一个高质量的技术问题报告,包含了完整的分析、数据和临时解决方案。
- 当前状态:Open。此问题对性能影响重大,预计会促使核心开发者审查 Triton 3.6 的 MoE 内核变更,并可能采取修复或回滚措施。
- Issue #37406: GPU hang on MI325/300 … (AMD)
- 评论数:2(但发生在 AMD 团队成员内部)
- 核心议题:如上文所述,MI300 上运行 MXFP4 量化模型导致 GPU 挂死。
- 观点与分析:
xuebwang-amd认为问题是内存压力导致,建议 Quark 工具链应提前预警。fxmarty-amd提出了不同技术看法,认为根因可能是代码溢出而非简单的内存路径问题。
- 争议焦点:团队内部对问题根因存在不同技术判断。
- 当前状态:Open。问题正在由 AMD 团队内部进行更深入的技术排查。
🔥 热门话题与趋势分析
- 大模型与复杂配置的稳定性问题:以 #37392 (Qwen3.5-122B 崩溃) 和 #37406 (MI300 挂死) 为代表,随着模型规模增大和部署配置复杂化(专家并行、混合并行、特定量化),暴露出的深层次兼容性和稳定性挑战增多。
- 性能回退与深度调优:#37441 (GPT OSS 性能回退) 展示了社区用户具备极强的性能剖析能力,能定位到具体内核版本变更的影响。此外,多个 PR 涉及针对特定架构(H200、Qwen3.5)的 MoE 内核调优,表明性能优化进入更精细化的阶段。
- 多模态与视觉编码器支持:#37472 (AMD gfx1151 编码器缓存问题) 和 #37423 (为 Completion API 添加 images 字段的 Feature Request) 表明,多模态模型的支持仍是前沿和问题多发领域,尤其是在非主流或新兴硬件架构上。
- 推理 API 的功能扩展与统一:#37422 / #37424 (为 Responses API 添加 kv_transfer_params) 和 #37473 (为 SamplingParams 添加 min_characters) 显示,vLLM 的 API 层正在不断丰富和完善,以满足更复杂的生产级应用需求。
🛠️ 重点技术变更
- PR #37463: [Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100
- 技术内容:为 NVIDIA SM100 (Blackwell) 架构添加了原生的 MXFP4 W4A4 CUTLASS MoE 内核。这不同于已有的 NVFP4 路径,专门针对 MXFP4 量化格式进行了优化。
- 影响:为 MXFP4 量化的 MoE 模型在 Blackwell GPU 上提供了新的高性能推理路径。作者表示后续添加 SM120 支持将较为容易,展现了良好的可扩展性。
- PR #37442: [Bugfix] Zero-init MLA attention output buffers to prevent NaN
- 技术内容:修复了一个隐蔽但严重的问题。在使用 CUDA Graph 且存在 Padding 的 MLA 注意力解码中,未使用的输出槽位可能包含残留的 NaN 值,这些 NaN 会通过后续的 FP8 量化(计算全局 amax)污染整个批次的有效结果。
- 影响:从根本上解决了 MLA 注意力在特定批量推理场景下可能产生 NaN 输出的问题,提升了推理结果的确定性和稳定性,对生产部署至关重要。
- Issue #37441 & 潜在跟进 (GPT OSS Triton 3.6 性能回退)
- 技术分析:此 Issue 本身就是一个深度的技术分析报告。它揭示了上游 Triton 编译器版本升级可能对特定模型(尤其是 MoE 模型)的关键内核产生显著的负面性能影响。
- 影响:警示项目需要建立更完善的性能回归测试机制,特别是对 Triton、FlashInfer 等关键依赖的版本升级。预计会催生一个官方的修复 PR。
- PR #37427: [Bugfix] Fix ROCm crash in qwen3_next multi-stream events
- 技术内容:一个典型的平台兼容性修复。将
is_cuda()检查替换为is_cuda_alike(),确保了在 ROCm (AMD) 和 CUDA (NVIDIA) 平台上代码行为的一致性。 - 影响:虽然改动很小,但保证了 vLLM 的先进特性(如多流并行)能够平等地惠及 AMD 平台用户,是维护生态健康的重要补丁。
- 技术内容:一个典型的平台兼容性修复。将
📈 开发活跃度观察
- 贡献者多元化:本周期活跃的贡献者包括来自 AMD (
-amd后缀) 的员工、来自 NVIDIA 生态的开发者,以及众多社区独立开发者,显示出项目良好的生态吸引力。 - Issue 处理效率:24 小时内关闭了 37 个 Issue,但同时新增了 22 个,表明项目在积极处理历史债务的同时,仍面临持续的问题输入压力。
- PR 合并量大:合并 48 个 PR 表明核心维护团队审查和集成代码的速度很快,项目迭代迅速。许多 PR 标签为
ready,显示 CI/CD 和审查流程运作顺畅。
💡 值得关注的问题
- MI300 系列 GPU 上的稳定性与兼容性:Issue #37406 反映的 GPU 挂死问题需要尽快解决,这关系到 AMD 旗舰计算卡在 vLLM 上的可靠性和用户体验。
- 上游依赖升级的风险管控:Issue #37441 暴露的 Triton 3.6 性能回退问题,提示需要对编译器、内核库等深度依赖的升级进行更严格的性能和功能回归测试。
- 对新硬件架构的前瞻性支持:Issue #37472 (AMD gfx1151 iGPU) 表明,vLLM 在适配 AMD、Intel 等厂商的最新产品时,会面临底层软件栈(如 MIOpen)生态不成熟带来的挑战,需要增加更多的兼容性逻辑和优雅降级策略。
- 大规模 MoE 模型的部署复杂性:从多个 Issue 和 PR 可以看出,数百亿参数级别的 MoE 模型(如 Qwen3.5-122B, GPT OSS 120B)的部署充满了挑战,涉及专家并行、量化、特定内核优化等多个复杂维度,是当前的技术前沿和难点。
📋 附录:详细数据列表
新增 Issue
- #37396 [Feature]: Tree speculative decode. — feature request — by DingYibin (创建于: 2026-03-18 16:15 (UTC+8))
- #37419 [Bug]:
redundancy_buffer_memoryis Never really used — bug — by panpan0000 (创建于: 2026-03-18 19:44 (UTC+8)) - #37392 [Bug]:推理时报错,模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 — bug — by watch-Ultra (创建于: 2026-03-18 14:43 (UTC+8))
- #37435 Speculative/MTP draft config appears to drop target –hf-overrides (breaks long-context YaRN/RoPE extension) — 无标签 — by malaiwah (创建于: 2026-03-18 21:34 (UTC+8))
- #37486 [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn’t have storage — bug — by baonudesifeizhai (创建于: 2026-03-19 06:10 (UTC+8))
- #37471 [Bug]: Accuracy issue running Model Runner V2 with Qwen3.5 — bug — by yewentao256 (创建于: 2026-03-19 03:13 (UTC+8))
- #37400 [Bug]: JAIS: ALiBi is applied even when position_embedding_type=”learned” — bug,good first issue — by Qi-Zhan (创建于: 2026-03-18 16:31 (UTC+8))
- #37472 [Bug] V1 engine hangs on encoder cache profiling on AMD gfx1151 (MIOpen missing solver DB) — rocm — by 3spky5u-oss (创建于: 2026-03-19 03:28 (UTC+8))
- #37451 [Bug]: 0.17.1 - vllm serve deepseek-ai/DeepSeek-OCR-2 on H100 crashes during Capturing CUDA graphs (decode, FULL) — bug — by jraby (创建于: 2026-03-18 23:46 (UTC+8))
- #37468 [Bug]: FlashInfer allreduce fusion workspace uninitialized error — bug — by wzhao18 (创建于: 2026-03-19 02:28 (UTC+8))
- #37423 [Feature]: Allow passing
imagesto CompletionRequest — feature request — by patrickvonplaten (创建于: 2026-03-18 20:20 (UTC+8)) - #37459 [CI Failure]: MultiModal Models Extended 2 - isaac test case OOMs — ci-failure — by varun-sundar-rabindranath (创建于: 2026-03-19 00:39 (UTC+8))
- #37444 Regression in nightly: AttributeError ‘MergedColumnParallelLinear’ has no attribute ‘weight’ with Qwen3.5-9B — 无标签 — by jhsmith409 (创建于: 2026-03-18 22:34 (UTC+8))
- #37441 [Bug]: GPT OSS 120B performance regression with Triton 3.6 — bug — by Dymasik (创建于: 2026-03-18 22:13 (UTC+8))
- #37437 [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? — usage — by Alan-D-Chen (创建于: 2026-03-18 21:42 (UTC+8))
- #37431 Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 — 无标签 — by ced509msn (创建于: 2026-03-18 21:09 (UTC+8))
- #37422 [Feature]: Add kv_transfer_params to Responses API for PD disaggregation — rocm — by bongwoobak (创建于: 2026-03-18 20:17 (UTC+8))
- #37402 [Bug]: _C.scaled_fp4_quant produces sticky CUDA error on SM121 (DGX Spark GB10) — contaminates CUDA context — bug — by rmagur1203 (创建于: 2026-03-18 16:45 (UTC+8))
- #37406 [Bug]: GPU hang on MI325/300 when serving MiniMax-M2.1-MXFP4 with TP=1 — bug,rocm — by xuebwang-amd (创建于: 2026-03-18 17:49 (UTC+8))
- #37404 [Bug]: AssertionError: assert num_kv_heads == 1 with CPU KV Offloading + GLM-5-FP8 — bug — by yz342 (创建于: 2026-03-18 17:25 (UTC+8))
- #37397 [Bug]: Kimi-K2.5 chat completion doesn’t return any reasoning content — bug — by wizche (创建于: 2026-03-18 16:20 (UTC+8))
- #37387 [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report — bug — by Alkevas (创建于: 2026-03-18 12:51 (UTC+8))
已关闭 Issue
- #29166 [Bug]: Granite Speech Illegal Memory Accesses in Parallel Vocab w/ LoRA — bug,stale — by alex-jw-brooks (关闭于: 2026-03-19 11:10 (UTC+8))
- #25937 [Bug]: CUDA driver error: invalid argument around _SymmetricMemory.rendezvous — bug,stale — by youkaichao (关闭于: 2026-03-19 11:08 (UTC+8))
- #26571 [Bug]: failed to build from source on sm120 machine — bug,stale — by ZJY0516 (关闭于: 2026-03-19 11:08 (UTC+8))
- #26989 [Bug]: Qwen 30ba3 VL Does not work — bug,stale — by bhaktatejas922 (关闭于: 2026-03-19 11:08 (UTC+8))
- #27573 [Installation]: Installing nightly fails with: Failed to resolve dependencies for
vllm— installation,stale — by drrros (关闭于: 2026-03-19 11:08 (UTC+8)) - #28014 [Bug]: EngineDeadError, illegal memory access error encountered when serving qwen3-vl on h800/h20 — bug,stale — by WingEdge777 (关闭于: 2026-03-19 11:08 (UTC+8))
- #28089 [Bug]: Qwen3 enable_thinking is broken when continue_final_message is true — bug,stale — by Huarong (关闭于: 2026-03-19 11:08 (UTC+8))
- #28098 [Bug]: BF16 and INT8 dtype mismatch when running quantized model on vLLM — bug,stale — by logesh13 (关闭于: 2026-03-19 11:08 (UTC+8))
- #28226 [RFC]: Redesign Logprobs data structure to reduce GC cost — RFC,stale — by Jialin (关闭于: 2026-03-19 11:08 (UTC+8))
- #28228 [Bug]: Kimi Linear KV cache size estimation and usage not making sens — bug,stale — by johnr14 (关闭于: 2026-03-19 11:08 (UTC+8))
- #28505 [Feature]: Is there a plan to introduce the new feature nano-pearl, a new engineering effort in speculative reasoning. — feature request,stale — by Lexlum (关闭于: 2026-03-19 11:07 (UTC+8))
- #28626 [Bug]:Qwen3-VL-32B-AWQ model memory usage: 8k context limit with 40GB VRAM? — bug,stale — by maxin9966 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28650 [Bug][XPU]: Error spam “Unsupported gpu_arch of paged_attention_vllm!!” — bug,intel-gpu,stale — by DatCaptainHorse (关闭于: 2026-03-19 11:07 (UTC+8))
- #28704 [Bug]: GDN model accuracy is low in DP mode — bug,stale — by ZJY0516 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28713 [Bug]: qwen3-coder sometimes make an IndexError — bug,stale — by slipfre (关闭于: 2026-03-19 11:07 (UTC+8))
- #28714 [Bug]: CUDA Illegal Memory Access Error When Sleep Mode is Triggered During Request Processing — bug,stale — by cynton503 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28770 [Bug]: Failed to launch example for Intel Arc Pro B60 — bug,intel-gpu,stale — by xyang2013 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28830 [Bug]: When the concurrency level is greater than 2, DeepSeek-V3.2 frequently generates nonsensical or corrupted responses. — bug,stale — by Justin-12138 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28836 Using os.sched_yield() releases the CPU so quickly that it causes 100% CPU utilization. — stale — by coderchem (关闭于: 2026-03-19 11:07 (UTC+8))
- #28838 [Bug]: engine monitor outputs unexpected error though the engine works well — bug,stale — by ai-easy-cpu (关闭于: 2026-03-19 11:07 (UTC+8))
- #28856 [Bug]: RuntimeError: Int8 not supported for this architecture — bug,stale — by fenghuohuo2001 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28857 [Bug]: VLLM 0.11.0 with Gemma3-awq is totaly broken to start (not possible to start awq of gemma3-27b-awq — bug,stale — by delphiRo (关闭于: 2026-03-19 11:07 (UTC+8))
- #28876 [CI Failure]: should test_cumem.py use spawn or fork in cuda? — stale,ci-failure — by jerryzh168 (关闭于: 2026-03-19 11:07 (UTC+8))
- #28916 [Bug]: @create_new_process_for_each_test(“spawn”) succeed unconditionally and does not work correctly all usages need to be revistied. — bug,stale — by laithsakka (关闭于: 2026-03-19 11:07 (UTC+8))
- #28924 [Bug]: num_cpu_blocks metrics is None in cache_config_info — bug,stale — by zetxqx (关闭于: 2026-03-19 11:07 (UTC+8))
- #28965 [Bug]: Some compilation tests can not run in the same process due to “Cannot re-initialize CUDA in forked subprocess” — bug,stale — by gmagogsfm (关闭于: 2026-03-19 11:07 (UTC+8))
- #36492 [Bug]: Abnormal Output When Using FP8 KVCache for Kimi-K2.5 Inference under vLLM v0.17.0 — bug — by makabaka6338 (关闭于: 2026-03-19 10:36 (UTC+8))
- #36594 [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc — cpu — by junjzhang (关闭于: 2026-03-19 09:49 (UTC+8))
- #34731 [Performance]: Improve swap_states by swapping active token prefixes — performance — by pjo256 (关闭于: 2026-03-19 05:59 (UTC+8))
- #27890 “fatal error: Python.h: No such file or directory” upon vllm startup after a clean install — bug,unstale — by kha84 (关闭于: 2026-03-19 01:32 (UTC+8))
- #36623 [Bug]: OOM when –kv-offloading-size>1024 — bug — by xiejibing (关闭于: 2026-03-18 23:34 (UTC+8))
- #37274 [Bug]: vLLM serving cannot support video inputs with a list of base64-encoded extracted JPEG frames — bug — by Johere (关闭于: 2026-03-18 21:40 (UTC+8))
- #37402 [Bug]: _C.scaled_fp4_quant produces sticky CUDA error on SM121 (DGX Spark GB10) — contaminates CUDA context — bug — by rmagur1203 (关闭于: 2026-03-18 19:29 (UTC+8))
- #29292 [Feature]: support Nemotron parse-v1.1 — feature request — by cole-dda (关闭于: 2026-03-18 17:51 (UTC+8))
- #37302 [Bug]: R1 NVFP4 gsm8k drop in lm_eval — bug — by elvircrn (关闭于: 2026-03-18 17:10 (UTC+8))
- #37277 [Bug]: GLM47 Tool Call Bug — bug — by xi1212 (关闭于: 2026-03-18 16:12 (UTC+8))
- #35255 [Bug]: CUDA Error 803 on host with driver 590.48: `system has unsupported display driver / cuda driver combination — bug — by git-jxj (关闭于: 2026-03-18 14:28 (UTC+8))
新增 PR
- #37507 [Bugfix] Fall back to Triton/FLA when system CUDA toolkit < 12.6 for GDN prefill kernel — bug,qwen,nvidia — by yanghui1-arch (创建于: 2026-03-19 11:08 (UTC+8))
- #37425 [Perf] Fix slow hasattr in CUDAGraphWrapper.getattr — ready,v1,nvidia — by ZeldaHuang (创建于: 2026-03-18 20:27 (UTC+8))
- #37386 fix(glm47): improve tool call parsing and content normalization — ready — by karanb192 (创建于: 2026-03-18 12:34 (UTC+8))
- #37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (创建于: 2026-03-19 11:29 (UTC+8))
- #37508 [VLLMZ-905] fix(xpu): Clamp compile warmup sizes to model runner token capacity — v1 — by Liangyx2 (创建于: 2026-03-19 11:15 (UTC+8))
- #37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (创建于: 2026-03-19 11:22 (UTC+8))
- #37454 Riscv cpu support v3 — ci/build,cpu — by typer-J (创建于: 2026-03-19 00:18 (UTC+8))
- #37503 [4/n] Migrate sparse/FP4/W4A8 CUTLASS kernels to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-03-19 10:42 (UTC+8))
- #37502 [Bugfix] Fix marlin nvfp4 rescaling — bug — by jinzhen-lin (创建于: 2026-03-19 10:23 (UTC+8))
- #37505 [KVCache] Support Pluggable KVCacheSpec — v1,deepseek — by MengqingCao (创建于: 2026-03-19 10:54 (UTC+8))
- #37506 fix: xgrammar structured output crash — structured-output,v1 — by wangyxbh (创建于: 2026-03-19 11:03 (UTC+8))
- #37447 [CI/Build] enable Intel XPU test flow with prebuilt image — ready,ci/build — by wendyliu235 (创建于: 2026-03-18 23:22 (UTC+8))
- #37504 [Refactor] Relocate endpoint tests to mirror serving code directory structure — ci/build — by sfeng33 (创建于: 2026-03-19 10:46 (UTC+8))
- #37467 [HMA]Fix corner case when hybrid page_size can not be evenly divided issue — v1 — by xuechendi (创建于: 2026-03-19 02:28 (UTC+8))
- #37448 Fix AttributeError in Qwen3.5 GDN layers with quantized models — ci/build,qwen — by jhsmith409 (创建于: 2026-03-18 23:25 (UTC+8))
- #37501 fix: clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels — 无标签 — by kibitzing (创建于: 2026-03-19 10:01 (UTC+8))
- #37449 [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP — bug,ready,v1,nvidia — by youkaichao (创建于: 2026-03-18 23:40 (UTC+8))
- #37430 [Docs] Add docs for context extension using the yarn method — documentation — by labAxiaoming (创建于: 2026-03-18 20:58 (UTC+8))
- #37500 [Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ — ci/build,v1 — by sfeng33 (创建于: 2026-03-19 09:58 (UTC+8))
- #37440 fix: apply redundancy_buffer_memory to KV cache allocation — v1 — by alvinttang (创建于: 2026-03-18 22:06 (UTC+8))
- #37429 [Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models — bug,v1 — by swtb3 (创建于: 2026-03-18 20:49 (UTC+8))
- #37443 [Bugfix][Core] Preserve target hf_overrides in MTP draft config — bug — by malaiwah (创建于: 2026-03-18 22:31 (UTC+8))
- #37480 Remove deprecated reasoning_content message field(part-2) — documentation,frontend,v1,qwen,kv-connector,nvidia — by ikaadil (创建于: 2026-03-19 05:23 (UTC+8))
- #37498 [gpt-oss][responsesAPI] fix TTFT — documentation,frontend,gpt-oss,meta-exported,fb-exported — by qandrew (创建于: 2026-03-19 08:39 (UTC+8))
- #37499 Fix DDE in group_broadcast for unbacked SymInts under torch.compile — llama,qwen — by laithsakka (创建于: 2026-03-19 08:47 (UTC+8))
- #37463 [Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100 — ready,ci/build,nvidia,quantization — by mgoin (创建于: 2026-03-19 01:01 (UTC+8))
- #37452 Fix DP coordinator ZMQ port TOCTOU — v1 — by itayalroy (创建于: 2026-03-19 00:02 (UTC+8))
- #37442 [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding — bug,ready,v1,nvidia — by elvircrn (创建于: 2026-03-18 22:28 (UTC+8))
- #37490 Pass min/max to mark unbacked. — llama,qwen — by laithsakka (创建于: 2026-03-19 06:31 (UTC+8))
- #37492 add shape_id specs to several models — llama,qwen — by laithsakka (创建于: 2026-03-19 06:40 (UTC+8))
- #37493 add unbacked handling for braodcast logic in quant_utils.py — 无标签 — by laithsakka (创建于: 2026-03-19 06:47 (UTC+8))
- #37491 [Build] Update CUTLASS revision from v4.2.1 to v4.4.2 — ready,ci/build,nvidia — by meena-at-work (创建于: 2026-03-19 06:32 (UTC+8))
- #37458 Don’t log
exc_infowhen vLLM tries to doenload a file that doesn’t exist — 无标签 — by hmellor (创建于: 2026-03-19 00:35 (UTC+8)) - #37488 [Feature] EPLB Support for GPU Model Runner v2 — ready,v1 — by yewentao256 (创建于: 2026-03-19 06:24 (UTC+8))
- #37497 [release 2.11] Update torch 211 - debug — rocm,ci/build,cpu,nvidia — by atalman (创建于: 2026-03-19 07:42 (UTC+8))
- #37485 [Perf] Disable inductor runtime asserts by default for serving perfor… — 无标签 — by tianrengao (创建于: 2026-03-19 05:45 (UTC+8))
- #37496 [PERF] Extend NCCL symmetric memory to AllGather and ReduceScatter — nvidia — by samnordmann (创建于: 2026-03-19 07:10 (UTC+8))
- #37495 [ROCm] Add VLLM_ROCM_W8A8_TRITON_MAX_M env var for CK/Triton GEMM rou… — rocm — by rbrugaro-amd (创建于: 2026-03-19 07:05 (UTC+8))
- #37494 [Bugfix] fix AWQ layer to dispatch the correct kernel with torch.compile() — bug — by amd-xavierwang (创建于: 2026-03-19 06:52 (UTC+8))
- #37462 Fix NVFP4-quantized MoE checkpoint support for Step-3.5 Flash — 无标签 — by meenchen (创建于: 2026-03-19 00:58 (UTC+8))
- #37475 [BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks — bug,ready,v1,nvidia — by kjiang249 (创建于: 2026-03-19 04:23 (UTC+8))
- #37487 [V0 Deprecation] Refactor kv cache from list to element — rocm,ready,v1,qwen,deepseek,kv-connector — by yewentao256 (创建于: 2026-03-19 06:19 (UTC+8))
- #37465 [Bugfix] Remove assertion for NVFP4 scale dynamic range — bug,ready — by mgoin (创建于: 2026-03-19 01:22 (UTC+8))
- #37484 [Bugfix] Fix flaky entrypoint logitproc test forced to spawn - CI failures — bug,v1 — by wojciech-wais (创建于: 2026-03-19 05:39 (UTC+8))
- #37489 [WIP] SP+AsyncTP piecewise compilation fix + per-matmul heuristic gating — performance — by tianrengao (创建于: 2026-03-19 06:24 (UTC+8))
- #37483 [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors — frontend — by AndreasKaratzas (创建于: 2026-03-19 05:34 (UTC+8))
- #37482 [Bugfix] Fix weight transfer tests using stale envs cache - CI test failures — bug — by wojciech-wais (创建于: 2026-03-19 05:26 (UTC+8))
- #37481 [XPU] enable is_act_and_mul for xpu — 无标签 — by xuechendi (创建于: 2026-03-19 05:25 (UTC+8))
- #37476 [Feat][RL] IPC weight sync optimizations: multigpu support and packed tensors — documentation,v1,nvidia — by hao-aaron (创建于: 2026-03-19 05:13 (UTC+8))
- #37398 Fix models which use
layer_type_validationfor Transformers v5 — ready — by hmellor (创建于: 2026-03-18 16:21 (UTC+8)) - #37432 Add API docs link if the CLI arg is a config class — ready — by hmellor (创建于: 2026-03-18 21:12 (UTC+8))
- #37479 [Perf] Disable inductor size_asserts by default for serving performance — 无标签 — by tianrengao (创建于: 2026-03-19 05:22 (UTC+8))
- #37416 [Kernel] Mamba support different layout for Conv state — needs-rebase — by NickLucche (创建于: 2026-03-18 18:55 (UTC+8))
- #37478 [CI] Update mergify tool-calling label paths — ci/build — by sfeng33 (创建于: 2026-03-19 05:21 (UTC+8))
- #37477 [Bugfix] Lower spec decode match threshold from 66% to 60% to increase chances of test pass on CI — bug,v1 — by wojciech-wais (创建于: 2026-03-19 05:19 (UTC+8))
- #37469 [perf][cpu] Accelerate BF16 GELU with LUT impl on Arm CPUs — ci/build,cpu — by fadara01 (创建于: 2026-03-19 03:06 (UTC+8))
- #37460 [Metrics][Core] Add PrefillStats to EngineCoreOutputs — ready,v1,kv-connector — by markmc (创建于: 2026-03-19 00:52 (UTC+8))
- #37470 [renderer][ez] combine render_chat_async, render_chat — 无标签 — by qandrew (创建于: 2026-03-19 03:09 (UTC+8))
- #37466 Cap the number of API servers to 1 when using Elastic EP. — frontend,ready — by SageMoore (创建于: 2026-03-19 01:32 (UTC+8))
- #37411 fix(jais): only apply ALiBi when position_embedding_type is ‘alibi’ — 无标签 — by jigangz (创建于: 2026-03-18 18:12 (UTC+8))
- #37473 [Core][V1] support min_characters for SamplingParams — frontend,v1 — by Xarbirus (创建于: 2026-03-19 03:33 (UTC+8))
- #37474 jais: only enable ALiBi when position_embedding_type == “alibi” — 无标签 — by ahmedabbas104 (创建于: 2026-03-19 04:21 (UTC+8))
- #37427 [Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) — bug,rocm,ready,qwen — by JartX (创建于: 2026-03-18 20:32 (UTC+8))
- #37461 [Bug] Fix FlashInfer allreduce fusion workspace uninitialized error — bug — by wzhao18 (创建于: 2026-03-19 00:55 (UTC+8))
- #37439 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,ready,qwen — by cnyvfang (创建于: 2026-03-18 22:06 (UTC+8))
- #37433 [Responses API] tool_choice support (auto / required / none) for GPT-OSS — frontend,gpt-oss — by will-deines (创建于: 2026-03-18 21:13 (UTC+8))
- #37407 [Bugfix] Fix Nemotron Parse loading — bug,rocm,ready,multi-modality — by DarkLight1337 (创建于: 2026-03-18 17:50 (UTC+8))
- #37380 [Bugfix] Decode prompt text from token IDs upstream in renderer — bug,frontend — by karanb192 (创建于: 2026-03-18 12:16 (UTC+8))
- #37456 [Model] Remove unnecessary processor definition for Nemotron Parse — ready — by DarkLight1337 (创建于: 2026-03-19 00:23 (UTC+8))
- #37457 [Misc] Clean up model registry — new-model,ready — by DarkLight1337 (创建于: 2026-03-19 00:25 (UTC+8))
- #37464 Fix NVFP4 weight scale underflow in BF16 dequantization — needs-rebase — by mgoin (创建于: 2026-03-19 01:17 (UTC+8))
- #37450 [Perf] Vectorize chunk_local_cumsum for GDN prefill — 无标签 — by ZJY0516 (创建于: 2026-03-18 23:43 (UTC+8))
- #37455 Revert “[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow” — bug,ready — by mgoin (创建于: 2026-03-19 00:19 (UTC+8))
- #37446 support firered_aed_l model — new-model — by MengLeebin (创建于: 2026-03-18 22:57 (UTC+8))
- #37434 Automatically add links to API docs for matching strings in docs — documentation — by hmellor (创建于: 2026-03-18 21:27 (UTC+8))
- #37453 [ROCm] Fix GPT-OSS import for triton 3.6 — rocm,gpt-oss — by gshtras (创建于: 2026-03-19 00:13 (UTC+8))
- #37410 Fix SM121 GB10 FP4 quantization sticky CUDA error — nvidia — by xueliangyang-oeuler (创建于: 2026-03-18 18:12 (UTC+8))
- #37438 [Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support — bug,frontend — by DorBernsohn (创建于: 2026-03-18 21:52 (UTC+8))
- #37417 support firered_aed_l model — new-model,needs-rebase — by MengLeebin (创建于: 2026-03-18 19:11 (UTC+8))
- #37436 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,qwen — by cnyvfang (创建于: 2026-03-18 21:36 (UTC+8))
- #37445 [Bugfix][Model] Fix Kimi K2 tool parser 8KB section limit and simplify streaming state — bug — by jscaldwell55 (创建于: 2026-03-18 22:34 (UTC+8))
- #37376 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (创建于: 2026-03-18 12:05 (UTC+8))
- #37421 [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode — performance,v1,deepseek,nvidia — by LopezCastroRoberto (创建于: 2026-03-18 19:46 (UTC+8))
- #37401 feat: auto-calculate optimal max_num_seqs via startup decode profiling — v1 — by effortprogrammer (创建于: 2026-03-18 16:45 (UTC+8))
- #37405 [kv_offload+HMA][6/N]: Split offloading_connector.py — ready,v1,kv-connector — by orozery (创建于: 2026-03-18 17:42 (UTC+8))
- #37428 [W.I.P] fragmentation_buffer in profiling — v1 — by panpan0000 (创建于: 2026-03-18 20:41 (UTC+8))
- #37426 fix CUDAGraph memory being counted twice — v1,nvidia — by panpan0000 (创建于: 2026-03-18 20:30 (UTC+8))
- #37424 [Responses API] Add kv_transfer_params for PD disaggregation — frontend,kv-connector — by bongwoobak (创建于: 2026-03-18 20:20 (UTC+8))
- #37420 Fix
redundancy_buffer_memorynot taken account indetermine_available_memory()— v1 — by panpan0000 (创建于: 2026-03-18 19:44 (UTC+8)) - #37418 [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant — bug,rocm — by Duyi-Wang (创建于: 2026-03-18 19:23 (UTC+8))
- #37408 [ROCm][Quantization] fallback trust_remote_code=True in Quark config for some cases — rocm — by xuebwang-amd (创建于: 2026-03-18 17:51 (UTC+8))
- #37414 [Bugfix]enable_thinking: False, the Qwen3.5 model returns an error in the content stream. — bug,qwen — by xyDong0223 (创建于: 2026-03-18 18:45 (UTC+8))
- #37412 Fix Spec Decode + NCCL Illegal Memory Access — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 18:19 (UTC+8))
- #37415 [MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available — structured-output,v1,nvidia — by jikunshang (创建于: 2026-03-18 18:48 (UTC+8))
- #37413 [Core][Feat] Add max-waiting-queue-time parameter to reject requests — frontend,v1 — by chaunceyjiang (创建于: 2026-03-18 18:24 (UTC+8))
- #37409 Fix KV Offloading + MLA AssertionError in get_kv_cache_shape — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 17:57 (UTC+8))
- #37377 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by karanb192 (创建于: 2026-03-18 12:11 (UTC+8))
- #37403 Fix tensor size mismatch in per-channel weight scale loading for MoE … — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-18 16:59 (UTC+8))
- #37399 [1/N][Spec Decode] Unify propose interface — speculative-decoding,v1 — by wangxiyuan (创建于: 2026-03-18 16:31 (UTC+8))
- #37384 [Bugfix][Tool Parser] Fix Kimi-K2.5 parser accuracy, buffer limits, and token leaks — bug — by karanb192 (创建于: 2026-03-18 12:29 (UTC+8))
- #37385 [Bugfix] Fix GLM-4 tool parser double serialization in streaming — bug,documentation — by karanb192 (创建于: 2026-03-18 12:31 (UTC+8))
- #37382 [Bugfix] Cache tokenizer calls in tool parsers to prevent concurrent RuntimeError — bug — by karanb192 (创建于: 2026-03-18 12:18 (UTC+8))
- #37379 [Bugfix][Tool Parser] Remove hardcoded 8K section char limit in Kimi-K2 tool parser — bug — by karanb192 (创建于: 2026-03-18 12:14 (UTC+8))
- #37378 [Misc] Improve DCP error messages with actionable guidance — v1 — by karanb192 (创建于: 2026-03-18 12:13 (UTC+8))
- #37383 [Doc] Add comprehensive –speculative-config documentation — documentation — by karanb192 (创建于: 2026-03-18 12:23 (UTC+8))
- #37395 Fix piecewise CUDA graph bugs in split_graph and cuda_graph replay — nvidia — by xueliangyang-oeuler (创建于: 2026-03-18 16:00 (UTC+8))
- #37375 [LoRA] Make LoRA respect
language_model_only— ready — by jeejeelee (创建于: 2026-03-18 11:40 (UTC+8)) - #37391 [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-03-18 14:07 (UTC+8))
- #37394 [WIP][CT][XPU] Add W8A16 FP8 MoE Support — qwen — by Zhenzhong1 (创建于: 2026-03-18 15:14 (UTC+8))
- #37390 Fix Quark OCP-MX W4A6 support: dequant dtype + apply_weights — 无标签 — by vecheruk-amd (创建于: 2026-03-18 14:05 (UTC+8))
- #37393 [Bugfix] Add ENGINE_CONTEXT to get_batch_defaults() — bug — by shawnghu (创建于: 2026-03-18 14:44 (UTC+8))
- #37388 [Bugfix][Structured Output] Fix structural_tag bitmask not applied on reasoning models — bug,structured-output,v1 — by CatherineSue (创建于: 2026-03-18 13:06 (UTC+8))
- #37389 Fix placeholder_block_size undefined error in initialize_kv_cache — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 13:18 (UTC+8))
- #37381 [Doc] Fix inconsistent hash notation in Prefix Caching diagram — documentation — by karanb192 (创建于: 2026-03-18 12:16 (UTC+8))
- #37374 [Perf] Optimize hidden state extraction logic — performance,speculative-decoding,v1,kv-connector — by benchislett (创建于: 2026-03-18 11:36 (UTC+8))
已合并 PR
- #37386 fix(glm47): improve tool call parsing and content normalization — ready — by karanb192 (合并于: 2026-03-18 16:12 (UTC+8))
- #37449 [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP — bug,ready,v1,nvidia — by youkaichao (合并于: 2026-03-19 02:32 (UTC+8))
- #37347 [Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-03-19 10:27 (UTC+8))
- #37024 [bug] Fix deadlock with pause resume and collective_rpc — bug,ready,v1 — by hao-aaron (合并于: 2026-03-19 09:49 (UTC+8))
- #37237 [Model Runner V2] Spec decode rejection sampler logprobs support — ready,v1 — by TheEpicDolphin (合并于: 2026-03-19 09:36 (UTC+8))
- #37334 [BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep — bug,documentation,ready — by hao-aaron (合并于: 2026-03-19 08:45 (UTC+8))
- #36267 [EPLB] Simplify EPLB rearrange by only returning one map — ready — by SageMoore (合并于: 2026-03-19 08:34 (UTC+8))
- #37238 [Model Runner V2] Spec decode rejection sampler greedy support — ready,v1 — by TheEpicDolphin (合并于: 2026-03-19 06:59 (UTC+8))
- #37442 [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding — bug,ready,v1,nvidia — by elvircrn (合并于: 2026-03-19 08:28 (UTC+8))
- #36477 [XPU]Unify xpu test dependencies in dockerfile.xpu — ready,ci/build — by 1643661061leo (合并于: 2026-03-19 08:12 (UTC+8))
- #37231 [Bugfix] Expand quantization method support in perf metrics — bug,ready,v1 — by thillai-c (合并于: 2026-03-19 07:54 (UTC+8))
- #37054 [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype “auto” leading to gibberish — bug,documentation,ready,v1,nvidia — by andylolu2 (合并于: 2026-03-19 07:07 (UTC+8))
- #37465 [Bugfix] Remove assertion for NVFP4 scale dynamic range — bug,ready — by mgoin (合并于: 2026-03-19 06:37 (UTC+8))
- #36928 [LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 — bug,ready — by WoosukKwon (合并于: 2026-03-19 06:34 (UTC+8))
- #37398 Fix models which use
layer_type_validationfor Transformers v5 — ready — by hmellor (合并于: 2026-03-19 02:41 (UTC+8)) - #37432 Add API docs link if the CLI arg is a config class — ready — by hmellor (合并于: 2026-03-19 01:19 (UTC+8))
- #34733 fix(worker): optimize swap_states to copy only active token prefixes — ready,v1 — by pjo256 (合并于: 2026-03-19 05:59 (UTC+8))
- #37195 [V0 Deprecation] Deprecate virtual engine — ready,v1,qwen,kv-connector — by yewentao256 (合并于: 2026-03-19 05:30 (UTC+8))
- #36671 chunk parakeet into 30s clips to prevent OOMs on long audios — ready — by netanel-haber (合并于: 2026-03-19 05:22 (UTC+8))
- #30647 [Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE — ready,ci/build,gpt-oss,nvidia — by elvischenv (合并于: 2026-03-18 23:01 (UTC+8))
- #37427 [Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) — bug,rocm,ready,qwen — by JartX (合并于: 2026-03-19 04:06 (UTC+8))
- #37439 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,ready,qwen — by cnyvfang (合并于: 2026-03-19 02:36 (UTC+8))
- #37456 [Model] Remove unnecessary processor definition for Nemotron Parse — ready — by DarkLight1337 (合并于: 2026-03-19 02:25 (UTC+8))
- #37457 [Misc] Clean up model registry — new-model,ready — by DarkLight1337 (合并于: 2026-03-19 02:24 (UTC+8))
- #37340 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement — performance,ready,qwen — by yewentao256 (合并于: 2026-03-19 02:18 (UTC+8))
- #37328 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename — ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-18 17:44 (UTC+8))
- #36642 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec — ready,v1,kv-connector — by orozery (合并于: 2026-03-19 01:26 (UTC+8))
- #36057 Adding deterministic lora benchmarking to vLLM Bench — performance,ready — by RonaldBXu (合并于: 2026-03-19 00:02 (UTC+8))
- #37371 standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 — ready — by XLiu-2000 (合并于: 2026-03-18 23:05 (UTC+8))
- #37205 [Kernel] Add gpt-oss Router GEMM kernel — performance,ready,ci/build,gpt-oss,nvidia — by xyang16 (合并于: 2026-03-18 23:15 (UTC+8))
- #37313 [Log] Reduce duplicate log — ready,v1,qwen,nvidia — by yewentao256 (合并于: 2026-03-18 22:57 (UTC+8))
- #37324 [2/3] Refactor InternVL-based processors — speculative-decoding,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-18 22:22 (UTC+8))
- #36330 elastic_ep: Fix stateless group port races — ready,ci/build,v1 — by itayalroy (合并于: 2026-03-18 22:36 (UTC+8))
- #37405 [kv_offload+HMA][6/N]: Split offloading_connector.py — ready,v1,kv-connector — by orozery (合并于: 2026-03-18 21:42 (UTC+8))
- #37301 [Bugfix] Fix base64 JPEG video frames returning empty metadata — bug,ready,multi-modality — by he-yufeng (合并于: 2026-03-18 21:40 (UTC+8))
- #36051 [NIXL][Bugfix] metrics & testing minor bug — bug,documentation,performance,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by andylolu2 (合并于: 2026-03-18 21:39 (UTC+8))
- #31696 [Model] Enable LoRA support for tower and connector in H2OVL — documentation,ready,multi-modality — by shwetha-s-poojary (合并于: 2026-03-18 21:26 (UTC+8))
- #37322 [Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy — bug,ready — by elvircrn (合并于: 2026-03-18 18:30 (UTC+8))
- #32316 [Build] Bump python openai version — frontend,ready,ci/build — by chaunceyjiang (合并于: 2026-03-18 18:20 (UTC+8))
- #36188 [docs] Add docs for new RL flows — documentation,ready,ci/build — by hao-aaron (合并于: 2026-03-18 17:04 (UTC+8))
- #37375 [LoRA] Make LoRA respect
language_model_only— ready — by jeejeelee (合并于: 2026-03-18 15:53 (UTC+8)) - #37391 [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-03-18 15:51 (UTC+8))
- #34805 [kv_offload+HMA][0/N]: Support block-level preemption handling — ready,v1,kv-connector — by orozery (合并于: 2026-03-18 14:49 (UTC+8))
- #37179 [XPU] skip unsupported ut and update test_nixl_connector — rocm,speculative-decoding,ready,ci/build,v1,tool-calling,kv-connector — by zhenwei-intel (合并于: 2026-03-18 13:32 (UTC+8))
- #37130 [responsesAPI] parser.extract_response_outputs can take in token IDs — frontend,ready — by qandrew (合并于: 2026-03-18 13:31 (UTC+8))
- #37335 [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-18 13:26 (UTC+8))
- #37349 [ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture — rocm,ready — by AndreasKaratzas (合并于: 2026-03-18 12:55 (UTC+8))
- #36924 [Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. — ready,v1,ready-run-all-tests — by gxd3 (合并于: 2026-03-18 12:53 (UTC+8))
关闭但未合并的 PR
- #37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (关闭于: 2026-03-19 11:27 (UTC+8))
- #27352 [cmake] fix ROCm hip/clr build on platforms without GPUs attached — rocm,ci/build,stale — by evil0sheep (关闭于: 2026-03-19 11:08 (UTC+8))
- #27397 [FusedMoE] Remove cuda hard-code in dual stream execution — needs-rebase,stale,nvidia — by wxsIcey (关闭于: 2026-03-19 11:08 (UTC+8))
- #28205 [build] fix: only build FA3 for Hopper — needs-rebase,ci/build,stale — by AlpinDale (关闭于: 2026-03-19 11:08 (UTC+8))
- #28446 [CLI] fix unicode encode error for
vllm chat/completecommand input — frontend,stale — by Iceber (关闭于: 2026-03-19 11:07 (UTC+8)) - #28710 Update tpu dockerfile — structured-output,tpu,ci/build,stale,v1,nvidia — by ernie-chang (关闭于: 2026-03-19 11:07 (UTC+8))
- #37493 add unbacked handling for braodcast logic in quant_utils.py — 无标签 — by laithsakka (关闭于: 2026-03-19 08:10 (UTC+8))
- #37496 [PERF] Extend NCCL symmetric memory to AllGather and ReduceScatter — nvidia — by samnordmann (关闭于: 2026-03-19 07:12 (UTC+8))
- #37479 [Perf] Disable inductor size_asserts by default for serving performance — 无标签 — by tianrengao (关闭于: 2026-03-19 05:47 (UTC+8))
- #26947 [Compressed Tensors] Remove parameter conversion for sparse24 — 无标签 — by kylesayrs (关闭于: 2026-03-19 05:26 (UTC+8))
- #37464 Fix NVFP4 weight scale underflow in BF16 dequantization — needs-rebase — by mgoin (关闭于: 2026-03-19 01:19 (UTC+8))
- #37450 [Perf] Vectorize chunk_local_cumsum for GDN prefill — 无标签 — by ZJY0516 (关闭于: 2026-03-19 01:26 (UTC+8))
- #37455 Revert “[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow” — bug,ready — by mgoin (关闭于: 2026-03-19 01:21 (UTC+8))
- #24725 [Bugfix] “UNKNOWN project” issue when installing vLLM in editable mode — bug,ci/build,unstale — by Shawn314 (关闭于: 2026-03-19 01:11 (UTC+8))
- #35674 [Hardware][Power] Add IBM POWER8 (ppc64le) CPU backend support — documentation,speculative-decoding,ci/build,v1,cpu — by Scottcjn (关闭于: 2026-03-19 00:45 (UTC+8))
- #36597 [CI/Build] enable Intel XPU test flow with prebuilt image — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by wendyliu235 (关闭于: 2026-03-18 23:06 (UTC+8))
- #37417 support firered_aed_l model — new-model,needs-rebase — by MengLeebin (关闭于: 2026-03-18 22:38 (UTC+8))
- #37436 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,qwen — by cnyvfang (关闭于: 2026-03-18 21:53 (UTC+8))
- #37124 [Bugfix] Fix KV cache overestimation for hybrid Mamba/attention model… — bug,documentation,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by swtb3 (关闭于: 2026-03-18 20:54 (UTC+8))
- #37326 [Bugfix] Fix unreachable structured_outputs + tool_choice conflict check — bug,frontend,tool-calling — by umut-polat (关闭于: 2026-03-18 20:25 (UTC+8))
- #35666 [Misc] Use VLLMValidationError consistently in completion and chat completion protocol — frontend — by umut-polat (关闭于: 2026-03-18 20:25 (UTC+8))
- #34265 [Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention — performance,rocm,v1,deepseek,nvidia — by LopezCastroRoberto (关闭于: 2026-03-18 19:47 (UTC+8))
- #32678 [Model] Qwen3-Next Splitting GDN attention calculation in mixed batches of Prefill and Decode — qwen — by xyDong0223 (关闭于: 2026-03-18 18:08 (UTC+8))
- #29438 [Kernels] Improve Triton fp8 block scaled kernel — performance,stale — by lgeiger (关闭于: 2026-03-18 18:05 (UTC+8))
- #31468 [Draft]feat: Add per-layer MLP size support for Qwen pruning — documentation,needs-rebase,qwen — by CedricHwong (关闭于: 2026-03-18 17:26 (UTC+8))
- #37388 [Bugfix][Structured Output] Fix structural_tag bitmask not applied on reasoning models — bug,structured-output,v1 — by CatherineSue (关闭于: 2026-03-18 13:46 (UTC+8))
- #37053 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (关闭于: 2026-03-18 12:04 (UTC+8))
- #37063 Fuse fp8 quant merge attn states vec128 — performance,v1 — by carlyou (关闭于: 2026-03-18 11:59 (UTC+8))
- #36384 Fix/fla triton autotune oom 34954 — qwen — by oneraghavan (关闭于: 2026-03-18 11:52 (UTC+8))