vLLM 开发动态报告 - 2026-03-18

时间窗口: 2026-03-18 11:35 (UTC+8) ~ 2026-03-19 11:35 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 37 | 新 PR 115 | 合并 PR 48 | 关闭未合并 PR 29

📊 每日开发状态摘要

本周期（2026-03-18至2026-03-19）vLLM 项目开发活动非常活跃，共新增 115 个 PR 并合并了 48 个，显示出强劲的新功能开发和代码集成势头。同时，Issue 管理效率较高，关闭了 37 个问题。重点关注领域包括 AMD ROCm 平台上的 GPU 挂死、量化兼容性修复，以及多个性能优化与内核改进。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动频繁，涵盖问题报告、功能请求和核心修复，突显了对 AMD 硬件和软件栈支持的持续投入。

💬 高热度讨论分析

Issue #37392: [Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型
- 评论数：9
- 核心议题：用户部署 Qwen3.5 大型模型时，服务在启动或推理过程中崩溃。
- 观点与进展：
  - 用户 uekaterinauelizabethar2175-crypto 提供了详细的复现命令和日志，显示服务卡在初始化阶段，随后崩溃。
  - 维护者 ZJY0516 迅速响应，首先建议设置环境变量 VLLM_DISABLE_SHARED_EXPERTS_STREAM=1 进行排查，无效后持续询问更精确的复现步骤。
  - 用户尝试了包括禁用 NCCL P2P 等多种配置后问题依旧。
- 争议焦点：无公开争议，属于协作排查。
- 当前状态：Open。维护者正在尝试复现问题，用户提供了详尽信息，问题可能涉及专家并行、大模型初始化或特定硬件环境下的深层 Bug。
Issue #37441: [Bug]: GPT OSS 120B performance regression with Triton 3.6
- 评论数：0（但正文非常详细，包含完整分析和数据）
- 核心议题：vLLM 0.17.1 相比 0.16.0，在 H200 上运行 GPT OSS 120B 模型出现约 20% 的延迟性能回退。
- 观点与分析：
  - 问题定位：报告者通过 nsys 性能分析，精准定位到回退源于 Triton 3.6 中 MoE 内核将 reduce_grouped 操作替换为 _reduce 操作，后者耗时增加了 6.4 倍。
  - 根因与验证：报告者通过手动在 vLLM 0.17.1 中强制使用旧版 Triton 3.5 内核 (use_legacy_triton_kernels = True)，性能完全恢复，验证了猜测。
- 争议焦点：无。这是一个高质量的技术问题报告，包含了完整的分析、数据和临时解决方案。
- 当前状态：Open。此问题对性能影响重大，预计会促使核心开发者审查 Triton 3.6 的 MoE 内核变更，并可能采取修复或回滚措施。
Issue #37406: GPU hang on MI325/300 … (AMD)
- 评论数：2（但发生在 AMD 团队成员内部）
- 核心议题：如上文所述，MI300 上运行 MXFP4 量化模型导致 GPU 挂死。
- 观点与分析：
  - xuebwang-amd 认为问题是内存压力导致，建议 Quark 工具链应提前预警。
  - fxmarty-amd 提出了不同技术看法，认为根因可能是代码溢出而非简单的内存路径问题。
- 争议焦点：团队内部对问题根因存在不同技术判断。
- 当前状态：Open。问题正在由 AMD 团队内部进行更深入的技术排查。

🔥 热门话题与趋势分析

大模型与复杂配置的稳定性问题：以 #37392 (Qwen3.5-122B 崩溃) 和 #37406 (MI300 挂死) 为代表，随着模型规模增大和部署配置复杂化（专家并行、混合并行、特定量化），暴露出的深层次兼容性和稳定性挑战增多。
性能回退与深度调优：#37441 (GPT OSS 性能回退) 展示了社区用户具备极强的性能剖析能力，能定位到具体内核版本变更的影响。此外，多个 PR 涉及针对特定架构（H200、Qwen3.5）的 MoE 内核调优，表明性能优化进入更精细化的阶段。
多模态与视觉编码器支持：#37472 (AMD gfx1151 编码器缓存问题) 和 #37423 (为 Completion API 添加 images 字段的 Feature Request) 表明，多模态模型的支持仍是前沿和问题多发领域，尤其是在非主流或新兴硬件架构上。
推理 API 的功能扩展与统一：#37422 / #37424 (为 Responses API 添加 kv_transfer_params) 和 #37473 (为 SamplingParams 添加 min_characters) 显示，vLLM 的 API 层正在不断丰富和完善，以满足更复杂的生产级应用需求。

🛠️ 重点技术变更

PR #37463: [Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100
- 技术内容：为 NVIDIA SM100 (Blackwell) 架构添加了原生的 MXFP4 W4A4 CUTLASS MoE 内核。这不同于已有的 NVFP4 路径，专门针对 MXFP4 量化格式进行了优化。
- 影响：为 MXFP4 量化的 MoE 模型在 Blackwell GPU 上提供了新的高性能推理路径。作者表示后续添加 SM120 支持将较为容易，展现了良好的可扩展性。
PR #37442: [Bugfix] Zero-init MLA attention output buffers to prevent NaN
- 技术内容：修复了一个隐蔽但严重的问题。在使用 CUDA Graph 且存在 Padding 的 MLA 注意力解码中，未使用的输出槽位可能包含残留的 NaN 值，这些 NaN 会通过后续的 FP8 量化（计算全局 amax）污染整个批次的有效结果。
- 影响：从根本上解决了 MLA 注意力在特定批量推理场景下可能产生 NaN 输出的问题，提升了推理结果的确定性和稳定性，对生产部署至关重要。
Issue #37441 & 潜在跟进 (GPT OSS Triton 3.6 性能回退)
- 技术分析：此 Issue 本身就是一个深度的技术分析报告。它揭示了上游 Triton 编译器版本升级可能对特定模型（尤其是 MoE 模型）的关键内核产生显著的负面性能影响。
- 影响：警示项目需要建立更完善的性能回归测试机制，特别是对 Triton、FlashInfer 等关键依赖的版本升级。预计会催生一个官方的修复 PR。
PR #37427: [Bugfix] Fix ROCm crash in qwen3_next multi-stream events
- 技术内容：一个典型的平台兼容性修复。将 is_cuda() 检查替换为 is_cuda_alike()，确保了在 ROCm (AMD) 和 CUDA (NVIDIA) 平台上代码行为的一致性。
- 影响：虽然改动很小，但保证了 vLLM 的先进特性（如多流并行）能够平等地惠及 AMD 平台用户，是维护生态健康的重要补丁。

📈 开发活跃度观察

贡献者多元化：本周期活跃的贡献者包括来自 AMD (-amd 后缀) 的员工、来自 NVIDIA 生态的开发者，以及众多社区独立开发者，显示出项目良好的生态吸引力。
Issue 处理效率：24 小时内关闭了 37 个 Issue，但同时新增了 22 个，表明项目在积极处理历史债务的同时，仍面临持续的问题输入压力。
PR 合并量大：合并 48 个 PR 表明核心维护团队审查和集成代码的速度很快，项目迭代迅速。许多 PR 标签为 ready，显示 CI/CD 和审查流程运作顺畅。

💡 值得关注的问题

MI300 系列 GPU 上的稳定性与兼容性：Issue #37406 反映的 GPU 挂死问题需要尽快解决，这关系到 AMD 旗舰计算卡在 vLLM 上的可靠性和用户体验。
上游依赖升级的风险管控：Issue #37441 暴露的 Triton 3.6 性能回退问题，提示需要对编译器、内核库等深度依赖的升级进行更严格的性能和功能回归测试。
对新硬件架构的前瞻性支持：Issue #37472 (AMD gfx1151 iGPU) 表明，vLLM 在适配 AMD、Intel 等厂商的最新产品时，会面临底层软件栈（如 MIOpen）生态不成熟带来的挑战，需要增加更多的兼容性逻辑和优雅降级策略。
大规模 MoE 模型的部署复杂性：从多个 Issue 和 PR 可以看出，数百亿参数级别的 MoE 模型（如 Qwen3.5-122B, GPT OSS 120B）的部署充满了挑战，涉及专家并行、量化、特定内核优化等多个复杂维度，是当前的技术前沿和难点。

📋 附录：详细数据列表

新增 Issue

#37396 [Feature]: Tree speculative decode. — feature request — by DingYibin (创建于: 2026-03-18 16:15 (UTC+8))
#37419 [Bug]: redundancy_buffer_memory is Never really used — bug — by panpan0000 (创建于: 2026-03-18 19:44 (UTC+8))
#37392 [Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 — bug — by watch-Ultra (创建于: 2026-03-18 14:43 (UTC+8))
#37435 Speculative/MTP draft config appears to drop target –hf-overrides (breaks long-context YaRN/RoPE extension) — 无标签 — by malaiwah (创建于: 2026-03-18 21:34 (UTC+8))
#37486 [Bug]: Qwen3 + DeepGEMM + dummy-load Cannot access data pointer of Tensor that doesn’t have storage — bug — by baonudesifeizhai (创建于: 2026-03-19 06:10 (UTC+8))
#37471 [Bug]: Accuracy issue running Model Runner V2 with Qwen3.5 — bug — by yewentao256 (创建于: 2026-03-19 03:13 (UTC+8))
#37400 [Bug]: JAIS: ALiBi is applied even when position_embedding_type=”learned” — bug,good first issue — by Qi-Zhan (创建于: 2026-03-18 16:31 (UTC+8))
#37472 [Bug] V1 engine hangs on encoder cache profiling on AMD gfx1151 (MIOpen missing solver DB) — rocm — by 3spky5u-oss (创建于: 2026-03-19 03:28 (UTC+8))
#37451 [Bug]: 0.17.1 - vllm serve deepseek-ai/DeepSeek-OCR-2 on H100 crashes during Capturing CUDA graphs (decode, FULL) — bug — by jraby (创建于: 2026-03-18 23:46 (UTC+8))
#37468 [Bug]: FlashInfer allreduce fusion workspace uninitialized error — bug — by wzhao18 (创建于: 2026-03-19 02:28 (UTC+8))
#37423 [Feature]: Allow passing images to CompletionRequest — feature request — by patrickvonplaten (创建于: 2026-03-18 20:20 (UTC+8))
#37459 [CI Failure]: MultiModal Models Extended 2 - isaac test case OOMs — ci-failure — by varun-sundar-rabindranath (创建于: 2026-03-19 00:39 (UTC+8))
#37444 Regression in nightly: AttributeError ‘MergedColumnParallelLinear’ has no attribute ‘weight’ with Qwen3.5-9B — 无标签 — by jhsmith409 (创建于: 2026-03-18 22:34 (UTC+8))
#37441 [Bug]: GPT OSS 120B performance regression with Triton 3.6 — bug — by Dymasik (创建于: 2026-03-18 22:13 (UTC+8))
#37437 [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? — usage — by Alan-D-Chen (创建于: 2026-03-18 21:42 (UTC+8))
#37431 Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 — 无标签 — by ced509msn (创建于: 2026-03-18 21:09 (UTC+8))
#37422 [Feature]: Add kv_transfer_params to Responses API for PD disaggregation — rocm — by bongwoobak (创建于: 2026-03-18 20:17 (UTC+8))
#37402 [Bug]: _C.scaled_fp4_quant produces sticky CUDA error on SM121 (DGX Spark GB10) — contaminates CUDA context — bug — by rmagur1203 (创建于: 2026-03-18 16:45 (UTC+8))
#37406 [Bug]: GPU hang on MI325/300 when serving MiniMax-M2.1-MXFP4 with TP=1 — bug,rocm — by xuebwang-amd (创建于: 2026-03-18 17:49 (UTC+8))
#37404 [Bug]: AssertionError: assert num_kv_heads == 1 with CPU KV Offloading + GLM-5-FP8 — bug — by yz342 (创建于: 2026-03-18 17:25 (UTC+8))
#37397 [Bug]: Kimi-K2.5 chat completion doesn’t return any reasoning content — bug — by wizche (创建于: 2026-03-18 16:20 (UTC+8))
#37387 [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report — bug — by Alkevas (创建于: 2026-03-18 12:51 (UTC+8))

已关闭 Issue

#29166 [Bug]: Granite Speech Illegal Memory Accesses in Parallel Vocab w/ LoRA — bug,stale — by alex-jw-brooks (关闭于: 2026-03-19 11:10 (UTC+8))
#25937 [Bug]: CUDA driver error: invalid argument around _SymmetricMemory.rendezvous — bug,stale — by youkaichao (关闭于: 2026-03-19 11:08 (UTC+8))
#26571 [Bug]: failed to build from source on sm120 machine — bug,stale — by ZJY0516 (关闭于: 2026-03-19 11:08 (UTC+8))
#26989 [Bug]: Qwen 30ba3 VL Does not work — bug,stale — by bhaktatejas922 (关闭于: 2026-03-19 11:08 (UTC+8))
#27573 [Installation]: Installing nightly fails with: Failed to resolve dependencies for vllm — installation,stale — by drrros (关闭于: 2026-03-19 11:08 (UTC+8))
#28014 [Bug]: EngineDeadError, illegal memory access error encountered when serving qwen3-vl on h800/h20 — bug,stale — by WingEdge777 (关闭于: 2026-03-19 11:08 (UTC+8))
#28089 [Bug]: Qwen3 enable_thinking is broken when continue_final_message is true — bug,stale — by Huarong (关闭于: 2026-03-19 11:08 (UTC+8))
#28098 [Bug]: BF16 and INT8 dtype mismatch when running quantized model on vLLM — bug,stale — by logesh13 (关闭于: 2026-03-19 11:08 (UTC+8))
#28226 [RFC]: Redesign Logprobs data structure to reduce GC cost — RFC,stale — by Jialin (关闭于: 2026-03-19 11:08 (UTC+8))
#28228 [Bug]: Kimi Linear KV cache size estimation and usage not making sens — bug,stale — by johnr14 (关闭于: 2026-03-19 11:08 (UTC+8))
#28505 [Feature]: Is there a plan to introduce the new feature nano-pearl, a new engineering effort in speculative reasoning. — feature request,stale — by Lexlum (关闭于: 2026-03-19 11:07 (UTC+8))
#28626 [Bug]:Qwen3-VL-32B-AWQ model memory usage: 8k context limit with 40GB VRAM? — bug,stale — by maxin9966 (关闭于: 2026-03-19 11:07 (UTC+8))
#28650 [Bug][XPU]: Error spam “Unsupported gpu_arch of paged_attention_vllm!!” — bug,intel-gpu,stale — by DatCaptainHorse (关闭于: 2026-03-19 11:07 (UTC+8))
#28704 [Bug]: GDN model accuracy is low in DP mode — bug,stale — by ZJY0516 (关闭于: 2026-03-19 11:07 (UTC+8))
#28713 [Bug]: qwen3-coder sometimes make an IndexError — bug,stale — by slipfre (关闭于: 2026-03-19 11:07 (UTC+8))
#28714 [Bug]: CUDA Illegal Memory Access Error When Sleep Mode is Triggered During Request Processing — bug,stale — by cynton503 (关闭于: 2026-03-19 11:07 (UTC+8))
#28770 [Bug]: Failed to launch example for Intel Arc Pro B60 — bug,intel-gpu,stale — by xyang2013 (关闭于: 2026-03-19 11:07 (UTC+8))
#28830 [Bug]: When the concurrency level is greater than 2, DeepSeek-V3.2 frequently generates nonsensical or corrupted responses. — bug,stale — by Justin-12138 (关闭于: 2026-03-19 11:07 (UTC+8))
#28836 Using os.sched_yield() releases the CPU so quickly that it causes 100% CPU utilization. — stale — by coderchem (关闭于: 2026-03-19 11:07 (UTC+8))
#28838 [Bug]: engine monitor outputs unexpected error though the engine works well — bug,stale — by ai-easy-cpu (关闭于: 2026-03-19 11:07 (UTC+8))
#28856 [Bug]: RuntimeError: Int8 not supported for this architecture — bug,stale — by fenghuohuo2001 (关闭于: 2026-03-19 11:07 (UTC+8))
#28857 [Bug]: VLLM 0.11.0 with Gemma3-awq is totaly broken to start (not possible to start awq of gemma3-27b-awq — bug,stale — by delphiRo (关闭于: 2026-03-19 11:07 (UTC+8))
#28876 [CI Failure]: should test_cumem.py use spawn or fork in cuda? — stale,ci-failure — by jerryzh168 (关闭于: 2026-03-19 11:07 (UTC+8))
#28916 [Bug]: @create_new_process_for_each_test(“spawn”) succeed unconditionally and does not work correctly all usages need to be revistied. — bug,stale — by laithsakka (关闭于: 2026-03-19 11:07 (UTC+8))
#28924 [Bug]: num_cpu_blocks metrics is None in cache_config_info — bug,stale — by zetxqx (关闭于: 2026-03-19 11:07 (UTC+8))
#28965 [Bug]: Some compilation tests can not run in the same process due to “Cannot re-initialize CUDA in forked subprocess” — bug,stale — by gmagogsfm (关闭于: 2026-03-19 11:07 (UTC+8))
#36492 [Bug]: Abnormal Output When Using FP8 KVCache for Kimi-K2.5 Inference under vLLM v0.17.0 — bug — by makabaka6338 (关闭于: 2026-03-19 10:36 (UTC+8))
#36594 [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc — cpu — by junjzhang (关闭于: 2026-03-19 09:49 (UTC+8))
#34731 [Performance]: Improve swap_states by swapping active token prefixes — performance — by pjo256 (关闭于: 2026-03-19 05:59 (UTC+8))
#27890 “fatal error: Python.h: No such file or directory” upon vllm startup after a clean install — bug,unstale — by kha84 (关闭于: 2026-03-19 01:32 (UTC+8))
#36623 [Bug]: OOM when –kv-offloading-size>1024 — bug — by xiejibing (关闭于: 2026-03-18 23:34 (UTC+8))
#37274 [Bug]: vLLM serving cannot support video inputs with a list of base64-encoded extracted JPEG frames — bug — by Johere (关闭于: 2026-03-18 21:40 (UTC+8))
#37402 [Bug]: _C.scaled_fp4_quant produces sticky CUDA error on SM121 (DGX Spark GB10) — contaminates CUDA context — bug — by rmagur1203 (关闭于: 2026-03-18 19:29 (UTC+8))
#29292 [Feature]: support Nemotron parse-v1.1 — feature request — by cole-dda (关闭于: 2026-03-18 17:51 (UTC+8))
#37302 [Bug]: R1 NVFP4 gsm8k drop in lm_eval — bug — by elvircrn (关闭于: 2026-03-18 17:10 (UTC+8))
#37277 [Bug]: GLM47 Tool Call Bug — bug — by xi1212 (关闭于: 2026-03-18 16:12 (UTC+8))
#35255 [Bug]: CUDA Error 803 on host with driver 590.48: `system has unsupported display driver / cuda driver combination — bug — by git-jxj (关闭于: 2026-03-18 14:28 (UTC+8))

新增 PR

#37507 [Bugfix] Fall back to Triton/FLA when system CUDA toolkit < 12.6 for GDN prefill kernel — bug,qwen,nvidia — by yanghui1-arch (创建于: 2026-03-19 11:08 (UTC+8))
#37425 [Perf] Fix slow hasattr in CUDAGraphWrapper.getattr — ready,v1,nvidia — by ZeldaHuang (创建于: 2026-03-18 20:27 (UTC+8))
#37386 fix(glm47): improve tool call parsing and content normalization — ready — by karanb192 (创建于: 2026-03-18 12:34 (UTC+8))
#37510 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (创建于: 2026-03-19 11:29 (UTC+8))
#37508 [VLLMZ-905] fix(xpu): Clamp compile warmup sizes to model runner token capacity — v1 — by Liangyx2 (创建于: 2026-03-19 11:15 (UTC+8))
#37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (创建于: 2026-03-19 11:22 (UTC+8))
#37454 Riscv cpu support v3 — ci/build,cpu — by typer-J (创建于: 2026-03-19 00:18 (UTC+8))
#37503 [4/n] Migrate sparse/FP4/W4A8 CUTLASS kernels to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-03-19 10:42 (UTC+8))
#37502 [Bugfix] Fix marlin nvfp4 rescaling — bug — by jinzhen-lin (创建于: 2026-03-19 10:23 (UTC+8))
#37505 [KVCache] Support Pluggable KVCacheSpec — v1,deepseek — by MengqingCao (创建于: 2026-03-19 10:54 (UTC+8))
#37506 fix: xgrammar structured output crash — structured-output,v1 — by wangyxbh (创建于: 2026-03-19 11:03 (UTC+8))
#37447 [CI/Build] enable Intel XPU test flow with prebuilt image — ready,ci/build — by wendyliu235 (创建于: 2026-03-18 23:22 (UTC+8))
#37504 [Refactor] Relocate endpoint tests to mirror serving code directory structure — ci/build — by sfeng33 (创建于: 2026-03-19 10:46 (UTC+8))
#37467 [HMA]Fix corner case when hybrid page_size can not be evenly divided issue — v1 — by xuechendi (创建于: 2026-03-19 02:28 (UTC+8))
#37448 Fix AttributeError in Qwen3.5 GDN layers with quantized models — ci/build,qwen — by jhsmith409 (创建于: 2026-03-18 23:25 (UTC+8))
#37501 fix: clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels — 无标签 — by kibitzing (创建于: 2026-03-19 10:01 (UTC+8))
#37449 [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP — bug,ready,v1,nvidia — by youkaichao (创建于: 2026-03-18 23:40 (UTC+8))
#37430 [Docs] Add docs for context extension using the yarn method — documentation — by labAxiaoming (创建于: 2026-03-18 20:58 (UTC+8))
#37500 [Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ — ci/build,v1 — by sfeng33 (创建于: 2026-03-19 09:58 (UTC+8))
#37440 fix: apply redundancy_buffer_memory to KV cache allocation — v1 — by alvinttang (创建于: 2026-03-18 22:06 (UTC+8))
#37429 [Bugfix] Fix KV cache sizing and allocation for hybrid Mamba/attention models — bug,v1 — by swtb3 (创建于: 2026-03-18 20:49 (UTC+8))
#37443 [Bugfix][Core] Preserve target hf_overrides in MTP draft config — bug — by malaiwah (创建于: 2026-03-18 22:31 (UTC+8))
#37480 Remove deprecated reasoning_content message field(part-2) — documentation,frontend,v1,qwen,kv-connector,nvidia — by ikaadil (创建于: 2026-03-19 05:23 (UTC+8))
#37498 [gpt-oss][responsesAPI] fix TTFT — documentation,frontend,gpt-oss,meta-exported,fb-exported — by qandrew (创建于: 2026-03-19 08:39 (UTC+8))
#37499 Fix DDE in group_broadcast for unbacked SymInts under torch.compile — llama,qwen — by laithsakka (创建于: 2026-03-19 08:47 (UTC+8))
#37463 [Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100 — ready,ci/build,nvidia,quantization — by mgoin (创建于: 2026-03-19 01:01 (UTC+8))
#37452 Fix DP coordinator ZMQ port TOCTOU — v1 — by itayalroy (创建于: 2026-03-19 00:02 (UTC+8))
#37442 [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding — bug,ready,v1,nvidia — by elvircrn (创建于: 2026-03-18 22:28 (UTC+8))
#37490 Pass min/max to mark unbacked. — llama,qwen — by laithsakka (创建于: 2026-03-19 06:31 (UTC+8))
#37492 add shape_id specs to several models — llama,qwen — by laithsakka (创建于: 2026-03-19 06:40 (UTC+8))
#37493 add unbacked handling for braodcast logic in quant_utils.py — 无标签 — by laithsakka (创建于: 2026-03-19 06:47 (UTC+8))
#37491 [Build] Update CUTLASS revision from v4.2.1 to v4.4.2 — ready,ci/build,nvidia — by meena-at-work (创建于: 2026-03-19 06:32 (UTC+8))
#37458 Don’t log exc_info when vLLM tries to doenload a file that doesn’t exist — 无标签 — by hmellor (创建于: 2026-03-19 00:35 (UTC+8))
#37488 [Feature] EPLB Support for GPU Model Runner v2 — ready,v1 — by yewentao256 (创建于: 2026-03-19 06:24 (UTC+8))
#37497 [release 2.11] Update torch 211 - debug — rocm,ci/build,cpu,nvidia — by atalman (创建于: 2026-03-19 07:42 (UTC+8))
#37485 [Perf] Disable inductor runtime asserts by default for serving perfor… — 无标签 — by tianrengao (创建于: 2026-03-19 05:45 (UTC+8))
#37496 [PERF] Extend NCCL symmetric memory to AllGather and ReduceScatter — nvidia — by samnordmann (创建于: 2026-03-19 07:10 (UTC+8))
#37495 [ROCm] Add VLLM_ROCM_W8A8_TRITON_MAX_M env var for CK/Triton GEMM rou… — rocm — by rbrugaro-amd (创建于: 2026-03-19 07:05 (UTC+8))
#37494 [Bugfix] fix AWQ layer to dispatch the correct kernel with torch.compile() — bug — by amd-xavierwang (创建于: 2026-03-19 06:52 (UTC+8))
#37462 Fix NVFP4-quantized MoE checkpoint support for Step-3.5 Flash — 无标签 — by meenchen (创建于: 2026-03-19 00:58 (UTC+8))
#37475 [BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks — bug,ready,v1,nvidia — by kjiang249 (创建于: 2026-03-19 04:23 (UTC+8))
#37487 [V0 Deprecation] Refactor kv cache from list to element — rocm,ready,v1,qwen,deepseek,kv-connector — by yewentao256 (创建于: 2026-03-19 06:19 (UTC+8))
#37465 [Bugfix] Remove assertion for NVFP4 scale dynamic range — bug,ready — by mgoin (创建于: 2026-03-19 01:22 (UTC+8))
#37484 [Bugfix] Fix flaky entrypoint logitproc test forced to spawn - CI failures — bug,v1 — by wojciech-wais (创建于: 2026-03-19 05:39 (UTC+8))
#37489 [WIP] SP+AsyncTP piecewise compilation fix + per-matmul heuristic gating — performance — by tianrengao (创建于: 2026-03-19 06:24 (UTC+8))
#37483 [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors — frontend — by AndreasKaratzas (创建于: 2026-03-19 05:34 (UTC+8))
#37482 [Bugfix] Fix weight transfer tests using stale envs cache - CI test failures — bug — by wojciech-wais (创建于: 2026-03-19 05:26 (UTC+8))
#37481 [XPU] enable is_act_and_mul for xpu — 无标签 — by xuechendi (创建于: 2026-03-19 05:25 (UTC+8))
#37476 [Feat][RL] IPC weight sync optimizations: multigpu support and packed tensors — documentation,v1,nvidia — by hao-aaron (创建于: 2026-03-19 05:13 (UTC+8))
#37398 Fix models which use layer_type_validation for Transformers v5 — ready — by hmellor (创建于: 2026-03-18 16:21 (UTC+8))
#37432 Add API docs link if the CLI arg is a config class — ready — by hmellor (创建于: 2026-03-18 21:12 (UTC+8))
#37479 [Perf] Disable inductor size_asserts by default for serving performance — 无标签 — by tianrengao (创建于: 2026-03-19 05:22 (UTC+8))
#37416 [Kernel] Mamba support different layout for Conv state — needs-rebase — by NickLucche (创建于: 2026-03-18 18:55 (UTC+8))
#37478 [CI] Update mergify tool-calling label paths — ci/build — by sfeng33 (创建于: 2026-03-19 05:21 (UTC+8))
#37477 [Bugfix] Lower spec decode match threshold from 66% to 60% to increase chances of test pass on CI — bug,v1 — by wojciech-wais (创建于: 2026-03-19 05:19 (UTC+8))
#37469 [perf][cpu] Accelerate BF16 GELU with LUT impl on Arm CPUs — ci/build,cpu — by fadara01 (创建于: 2026-03-19 03:06 (UTC+8))
#37460 [Metrics][Core] Add PrefillStats to EngineCoreOutputs — ready,v1,kv-connector — by markmc (创建于: 2026-03-19 00:52 (UTC+8))
#37470 [renderer][ez] combine render_chat_async, render_chat — 无标签 — by qandrew (创建于: 2026-03-19 03:09 (UTC+8))
#37466 Cap the number of API servers to 1 when using Elastic EP. — frontend,ready — by SageMoore (创建于: 2026-03-19 01:32 (UTC+8))
#37411 fix(jais): only apply ALiBi when position_embedding_type is ‘alibi’ — 无标签 — by jigangz (创建于: 2026-03-18 18:12 (UTC+8))
#37473 [Core][V1] support min_characters for SamplingParams — frontend,v1 — by Xarbirus (创建于: 2026-03-19 03:33 (UTC+8))
#37474 jais: only enable ALiBi when position_embedding_type == “alibi” — 无标签 — by ahmedabbas104 (创建于: 2026-03-19 04:21 (UTC+8))
#37427 [Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) — bug,rocm,ready,qwen — by JartX (创建于: 2026-03-18 20:32 (UTC+8))
#37461 [Bug] Fix FlashInfer allreduce fusion workspace uninitialized error — bug — by wzhao18 (创建于: 2026-03-19 00:55 (UTC+8))
#37439 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,ready,qwen — by cnyvfang (创建于: 2026-03-18 22:06 (UTC+8))
#37433 [Responses API] tool_choice support (auto / required / none) for GPT-OSS — frontend,gpt-oss — by will-deines (创建于: 2026-03-18 21:13 (UTC+8))
#37407 [Bugfix] Fix Nemotron Parse loading — bug,rocm,ready,multi-modality — by DarkLight1337 (创建于: 2026-03-18 17:50 (UTC+8))
#37380 [Bugfix] Decode prompt text from token IDs upstream in renderer — bug,frontend — by karanb192 (创建于: 2026-03-18 12:16 (UTC+8))
#37456 [Model] Remove unnecessary processor definition for Nemotron Parse — ready — by DarkLight1337 (创建于: 2026-03-19 00:23 (UTC+8))
#37457 [Misc] Clean up model registry — new-model,ready — by DarkLight1337 (创建于: 2026-03-19 00:25 (UTC+8))
#37464 Fix NVFP4 weight scale underflow in BF16 dequantization — needs-rebase — by mgoin (创建于: 2026-03-19 01:17 (UTC+8))
#37450 [Perf] Vectorize chunk_local_cumsum for GDN prefill — 无标签 — by ZJY0516 (创建于: 2026-03-18 23:43 (UTC+8))
#37455 Revert “[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow” — bug,ready — by mgoin (创建于: 2026-03-19 00:19 (UTC+8))
#37446 support firered_aed_l model — new-model — by MengLeebin (创建于: 2026-03-18 22:57 (UTC+8))
#37434 Automatically add links to API docs for matching strings in docs — documentation — by hmellor (创建于: 2026-03-18 21:27 (UTC+8))
#37453 [ROCm] Fix GPT-OSS import for triton 3.6 — rocm,gpt-oss — by gshtras (创建于: 2026-03-19 00:13 (UTC+8))
#37410 Fix SM121 GB10 FP4 quantization sticky CUDA error — nvidia — by xueliangyang-oeuler (创建于: 2026-03-18 18:12 (UTC+8))
#37438 [Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support — bug,frontend — by DorBernsohn (创建于: 2026-03-18 21:52 (UTC+8))
#37417 support firered_aed_l model — new-model,needs-rebase — by MengLeebin (创建于: 2026-03-18 19:11 (UTC+8))
#37436 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,qwen — by cnyvfang (创建于: 2026-03-18 21:36 (UTC+8))
#37445 [Bugfix][Model] Fix Kimi K2 tool parser 8KB section limit and simplify streaming state — bug — by jscaldwell55 (创建于: 2026-03-18 22:34 (UTC+8))
#37376 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (创建于: 2026-03-18 12:05 (UTC+8))
#37421 [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode — performance,v1,deepseek,nvidia — by LopezCastroRoberto (创建于: 2026-03-18 19:46 (UTC+8))
#37401 feat: auto-calculate optimal max_num_seqs via startup decode profiling — v1 — by effortprogrammer (创建于: 2026-03-18 16:45 (UTC+8))
#37405 [kv_offload+HMA][6/N]: Split offloading_connector.py — ready,v1,kv-connector — by orozery (创建于: 2026-03-18 17:42 (UTC+8))
#37428 [W.I.P] fragmentation_buffer in profiling — v1 — by panpan0000 (创建于: 2026-03-18 20:41 (UTC+8))
#37426 fix CUDAGraph memory being counted twice — v1,nvidia — by panpan0000 (创建于: 2026-03-18 20:30 (UTC+8))
#37424 [Responses API] Add kv_transfer_params for PD disaggregation — frontend,kv-connector — by bongwoobak (创建于: 2026-03-18 20:20 (UTC+8))
#37420 Fix redundancy_buffer_memory not taken account in determine_available_memory() — v1 — by panpan0000 (创建于: 2026-03-18 19:44 (UTC+8))
#37418 [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant — bug,rocm — by Duyi-Wang (创建于: 2026-03-18 19:23 (UTC+8))
#37408 [ROCm][Quantization] fallback trust_remote_code=True in Quark config for some cases — rocm — by xuebwang-amd (创建于: 2026-03-18 17:51 (UTC+8))
#37414 [Bugfix]enable_thinking: False, the Qwen3.5 model returns an error in the content stream. — bug,qwen — by xyDong0223 (创建于: 2026-03-18 18:45 (UTC+8))
#37412 Fix Spec Decode + NCCL Illegal Memory Access — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 18:19 (UTC+8))
#37415 [MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available — structured-output,v1,nvidia — by jikunshang (创建于: 2026-03-18 18:48 (UTC+8))
#37413 [Core][Feat] Add max-waiting-queue-time parameter to reject requests — frontend,v1 — by chaunceyjiang (创建于: 2026-03-18 18:24 (UTC+8))
#37409 Fix KV Offloading + MLA AssertionError in get_kv_cache_shape — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 17:57 (UTC+8))
#37377 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by karanb192 (创建于: 2026-03-18 12:11 (UTC+8))
#37403 Fix tensor size mismatch in per-channel weight scale loading for MoE … — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-18 16:59 (UTC+8))
#37399 [1/N][Spec Decode] Unify propose interface — speculative-decoding,v1 — by wangxiyuan (创建于: 2026-03-18 16:31 (UTC+8))
#37384 [Bugfix][Tool Parser] Fix Kimi-K2.5 parser accuracy, buffer limits, and token leaks — bug — by karanb192 (创建于: 2026-03-18 12:29 (UTC+8))
#37385 [Bugfix] Fix GLM-4 tool parser double serialization in streaming — bug,documentation — by karanb192 (创建于: 2026-03-18 12:31 (UTC+8))
#37382 [Bugfix] Cache tokenizer calls in tool parsers to prevent concurrent RuntimeError — bug — by karanb192 (创建于: 2026-03-18 12:18 (UTC+8))
#37379 [Bugfix][Tool Parser] Remove hardcoded 8K section char limit in Kimi-K2 tool parser — bug — by karanb192 (创建于: 2026-03-18 12:14 (UTC+8))
#37378 [Misc] Improve DCP error messages with actionable guidance — v1 — by karanb192 (创建于: 2026-03-18 12:13 (UTC+8))
#37383 [Doc] Add comprehensive –speculative-config documentation — documentation — by karanb192 (创建于: 2026-03-18 12:23 (UTC+8))
#37395 Fix piecewise CUDA graph bugs in split_graph and cuda_graph replay — nvidia — by xueliangyang-oeuler (创建于: 2026-03-18 16:00 (UTC+8))
#37375 [LoRA] Make LoRA respect language_model_only — ready — by jeejeelee (创建于: 2026-03-18 11:40 (UTC+8))
#37391 [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-03-18 14:07 (UTC+8))
#37394 [WIP][CT][XPU] Add W8A16 FP8 MoE Support — qwen — by Zhenzhong1 (创建于: 2026-03-18 15:14 (UTC+8))
#37390 Fix Quark OCP-MX W4A6 support: dequant dtype + apply_weights — 无标签 — by vecheruk-amd (创建于: 2026-03-18 14:05 (UTC+8))
#37393 [Bugfix] Add ENGINE_CONTEXT to get_batch_defaults() — bug — by shawnghu (创建于: 2026-03-18 14:44 (UTC+8))
#37388 [Bugfix][Structured Output] Fix structural_tag bitmask not applied on reasoning models — bug,structured-output,v1 — by CatherineSue (创建于: 2026-03-18 13:06 (UTC+8))
#37389 Fix placeholder_block_size undefined error in initialize_kv_cache — v1 — by xueliangyang-oeuler (创建于: 2026-03-18 13:18 (UTC+8))
#37381 [Doc] Fix inconsistent hash notation in Prefix Caching diagram — documentation — by karanb192 (创建于: 2026-03-18 12:16 (UTC+8))
#37374 [Perf] Optimize hidden state extraction logic — performance,speculative-decoding,v1,kv-connector — by benchislett (创建于: 2026-03-18 11:36 (UTC+8))

已合并 PR

#37386 fix(glm47): improve tool call parsing and content normalization — ready — by karanb192 (合并于: 2026-03-18 16:12 (UTC+8))
#37449 [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP — bug,ready,v1,nvidia — by youkaichao (合并于: 2026-03-19 02:32 (UTC+8))
#37347 [Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement — ready,v1 — by yewentao256 (合并于: 2026-03-19 10:27 (UTC+8))
#37024 [bug] Fix deadlock with pause resume and collective_rpc — bug,ready,v1 — by hao-aaron (合并于: 2026-03-19 09:49 (UTC+8))
#37237 [Model Runner V2] Spec decode rejection sampler logprobs support — ready,v1 — by TheEpicDolphin (合并于: 2026-03-19 09:36 (UTC+8))
#37334 [BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep — bug,documentation,ready — by hao-aaron (合并于: 2026-03-19 08:45 (UTC+8))
#36267 [EPLB] Simplify EPLB rearrange by only returning one map — ready — by SageMoore (合并于: 2026-03-19 08:34 (UTC+8))
#37238 [Model Runner V2] Spec decode rejection sampler greedy support — ready,v1 — by TheEpicDolphin (合并于: 2026-03-19 06:59 (UTC+8))
#37442 [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding — bug,ready,v1,nvidia — by elvircrn (合并于: 2026-03-19 08:28 (UTC+8))
#36477 [XPU]Unify xpu test dependencies in dockerfile.xpu — ready,ci/build — by 1643661061leo (合并于: 2026-03-19 08:12 (UTC+8))
#37231 [Bugfix] Expand quantization method support in perf metrics — bug,ready,v1 — by thillai-c (合并于: 2026-03-19 07:54 (UTC+8))
#37054 [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype “auto” leading to gibberish — bug,documentation,ready,v1,nvidia — by andylolu2 (合并于: 2026-03-19 07:07 (UTC+8))
#37465 [Bugfix] Remove assertion for NVFP4 scale dynamic range — bug,ready — by mgoin (合并于: 2026-03-19 06:37 (UTC+8))
#36928 [LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 — bug,ready — by WoosukKwon (合并于: 2026-03-19 06:34 (UTC+8))
#37398 Fix models which use layer_type_validation for Transformers v5 — ready — by hmellor (合并于: 2026-03-19 02:41 (UTC+8))
#37432 Add API docs link if the CLI arg is a config class — ready — by hmellor (合并于: 2026-03-19 01:19 (UTC+8))
#34733 fix(worker): optimize swap_states to copy only active token prefixes — ready,v1 — by pjo256 (合并于: 2026-03-19 05:59 (UTC+8))
#37195 [V0 Deprecation] Deprecate virtual engine — ready,v1,qwen,kv-connector — by yewentao256 (合并于: 2026-03-19 05:30 (UTC+8))
#36671 chunk parakeet into 30s clips to prevent OOMs on long audios — ready — by netanel-haber (合并于: 2026-03-19 05:22 (UTC+8))
#30647 [Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE — ready,ci/build,gpt-oss,nvidia — by elvischenv (合并于: 2026-03-18 23:01 (UTC+8))
#37427 [Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) — bug,rocm,ready,qwen — by JartX (合并于: 2026-03-19 04:06 (UTC+8))
#37439 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,ready,qwen — by cnyvfang (合并于: 2026-03-19 02:36 (UTC+8))
#37456 [Model] Remove unnecessary processor definition for Nemotron Parse — ready — by DarkLight1337 (合并于: 2026-03-19 02:25 (UTC+8))
#37457 [Misc] Clean up model registry — new-model,ready — by DarkLight1337 (合并于: 2026-03-19 02:24 (UTC+8))
#37340 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement — performance,ready,qwen — by yewentao256 (合并于: 2026-03-19 02:18 (UTC+8))
#37328 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename — ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-18 17:44 (UTC+8))
#36642 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec — ready,v1,kv-connector — by orozery (合并于: 2026-03-19 01:26 (UTC+8))
#36057 Adding deterministic lora benchmarking to vLLM Bench — performance,ready — by RonaldBXu (合并于: 2026-03-19 00:02 (UTC+8))
#37371 standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 — ready — by XLiu-2000 (合并于: 2026-03-18 23:05 (UTC+8))
#37205 [Kernel] Add gpt-oss Router GEMM kernel — performance,ready,ci/build,gpt-oss,nvidia — by xyang16 (合并于: 2026-03-18 23:15 (UTC+8))
#37313 [Log] Reduce duplicate log — ready,v1,qwen,nvidia — by yewentao256 (合并于: 2026-03-18 22:57 (UTC+8))
#37324 [2/3] Refactor InternVL-based processors — speculative-decoding,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-18 22:22 (UTC+8))
#36330 elastic_ep: Fix stateless group port races — ready,ci/build,v1 — by itayalroy (合并于: 2026-03-18 22:36 (UTC+8))
#37405 [kv_offload+HMA][6/N]: Split offloading_connector.py — ready,v1,kv-connector — by orozery (合并于: 2026-03-18 21:42 (UTC+8))
#37301 [Bugfix] Fix base64 JPEG video frames returning empty metadata — bug,ready,multi-modality — by he-yufeng (合并于: 2026-03-18 21:40 (UTC+8))
#36051 [NIXL][Bugfix] metrics & testing minor bug — bug,documentation,performance,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by andylolu2 (合并于: 2026-03-18 21:39 (UTC+8))
#31696 [Model] Enable LoRA support for tower and connector in H2OVL — documentation,ready,multi-modality — by shwetha-s-poojary (合并于: 2026-03-18 21:26 (UTC+8))
#37322 [Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy — bug,ready — by elvircrn (合并于: 2026-03-18 18:30 (UTC+8))
#32316 [Build] Bump python openai version — frontend,ready,ci/build — by chaunceyjiang (合并于: 2026-03-18 18:20 (UTC+8))
#36188 [docs] Add docs for new RL flows — documentation,ready,ci/build — by hao-aaron (合并于: 2026-03-18 17:04 (UTC+8))
#37375 [LoRA] Make LoRA respect language_model_only — ready — by jeejeelee (合并于: 2026-03-18 15:53 (UTC+8))
#37391 [Bugfix] Avoid OpenMP thread reallocation in CPU torch compile — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-03-18 15:51 (UTC+8))
#34805 [kv_offload+HMA][0/N]: Support block-level preemption handling — ready,v1,kv-connector — by orozery (合并于: 2026-03-18 14:49 (UTC+8))
#37179 [XPU] skip unsupported ut and update test_nixl_connector — rocm,speculative-decoding,ready,ci/build,v1,tool-calling,kv-connector — by zhenwei-intel (合并于: 2026-03-18 13:32 (UTC+8))
#37130 [responsesAPI] parser.extract_response_outputs can take in token IDs — frontend,ready — by qandrew (合并于: 2026-03-18 13:31 (UTC+8))
#37335 [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-18 13:26 (UTC+8))
#37349 [ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture — rocm,ready — by AndreasKaratzas (合并于: 2026-03-18 12:55 (UTC+8))
#36924 [Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. — ready,v1,ready-run-all-tests — by gxd3 (合并于: 2026-03-18 12:53 (UTC+8))

关闭但未合并的 PR

#37509 fix(anthropic): remove non-standard ‘data: [DONE]’ from Anthropic streaming — frontend — by cdpath (关闭于: 2026-03-19 11:27 (UTC+8))
#27352 [cmake] fix ROCm hip/clr build on platforms without GPUs attached — rocm,ci/build,stale — by evil0sheep (关闭于: 2026-03-19 11:08 (UTC+8))
#27397 [FusedMoE] Remove cuda hard-code in dual stream execution — needs-rebase,stale,nvidia — by wxsIcey (关闭于: 2026-03-19 11:08 (UTC+8))
#28205 [build] fix: only build FA3 for Hopper — needs-rebase,ci/build,stale — by AlpinDale (关闭于: 2026-03-19 11:08 (UTC+8))
#28446 [CLI] fix unicode encode error for vllm chat/complete command input — frontend,stale — by Iceber (关闭于: 2026-03-19 11:07 (UTC+8))
#28710 Update tpu dockerfile — structured-output,tpu,ci/build,stale,v1,nvidia — by ernie-chang (关闭于: 2026-03-19 11:07 (UTC+8))
#37493 add unbacked handling for braodcast logic in quant_utils.py — 无标签 — by laithsakka (关闭于: 2026-03-19 08:10 (UTC+8))
#37496 [PERF] Extend NCCL symmetric memory to AllGather and ReduceScatter — nvidia — by samnordmann (关闭于: 2026-03-19 07:12 (UTC+8))
#37479 [Perf] Disable inductor size_asserts by default for serving performance — 无标签 — by tianrengao (关闭于: 2026-03-19 05:47 (UTC+8))
#26947 [Compressed Tensors] Remove parameter conversion for sparse24 — 无标签 — by kylesayrs (关闭于: 2026-03-19 05:26 (UTC+8))
#37464 Fix NVFP4 weight scale underflow in BF16 dequantization — needs-rebase — by mgoin (关闭于: 2026-03-19 01:19 (UTC+8))
#37450 [Perf] Vectorize chunk_local_cumsum for GDN prefill — 无标签 — by ZJY0516 (关闭于: 2026-03-19 01:26 (UTC+8))
#37455 Revert “[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow” — bug,ready — by mgoin (关闭于: 2026-03-19 01:21 (UTC+8))
#24725 [Bugfix] “UNKNOWN project” issue when installing vLLM in editable mode — bug,ci/build,unstale — by Shawn314 (关闭于: 2026-03-19 01:11 (UTC+8))
#35674 [Hardware][Power] Add IBM POWER8 (ppc64le) CPU backend support — documentation,speculative-decoding,ci/build,v1,cpu — by Scottcjn (关闭于: 2026-03-19 00:45 (UTC+8))
#36597 [CI/Build] enable Intel XPU test flow with prebuilt image — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by wendyliu235 (关闭于: 2026-03-18 23:06 (UTC+8))
#37417 support firered_aed_l model — new-model,needs-rebase — by MengLeebin (关闭于: 2026-03-18 22:38 (UTC+8))
#37436 [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation — bug,qwen — by cnyvfang (关闭于: 2026-03-18 21:53 (UTC+8))
#37124 [Bugfix] Fix KV cache overestimation for hybrid Mamba/attention model… — bug,documentation,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by swtb3 (关闭于: 2026-03-18 20:54 (UTC+8))
#37326 [Bugfix] Fix unreachable structured_outputs + tool_choice conflict check — bug,frontend,tool-calling — by umut-polat (关闭于: 2026-03-18 20:25 (UTC+8))
#35666 [Misc] Use VLLMValidationError consistently in completion and chat completion protocol — frontend — by umut-polat (关闭于: 2026-03-18 20:25 (UTC+8))
#34265 [Attention][Perf][Kernel] Improve topKperRow for large context decode path - DeepSeek-V3.2 sparse attention — performance,rocm,v1,deepseek,nvidia — by LopezCastroRoberto (关闭于: 2026-03-18 19:47 (UTC+8))
#32678 [Model] Qwen3-Next Splitting GDN attention calculation in mixed batches of Prefill and Decode — qwen — by xyDong0223 (关闭于: 2026-03-18 18:08 (UTC+8))
#29438 [Kernels] Improve Triton fp8 block scaled kernel — performance,stale — by lgeiger (关闭于: 2026-03-18 18:05 (UTC+8))
#31468 [Draft]feat: Add per-layer MLP size support for Qwen pruning — documentation,needs-rebase,qwen — by CedricHwong (关闭于: 2026-03-18 17:26 (UTC+8))
#37388 [Bugfix][Structured Output] Fix structural_tag bitmask not applied on reasoning models — bug,structured-output,v1 — by CatherineSue (关闭于: 2026-03-18 13:46 (UTC+8))
#37053 fused qknorm+rope kernel optimization for SM9.0 — 无标签 — by EricccYang (关闭于: 2026-03-18 12:04 (UTC+8))
#37063 Fuse fp8 quant merge attn states vec128 — performance,v1 — by carlyou (关闭于: 2026-03-18 11:59 (UTC+8))
#36384 Fix/fla triton autotune oom 34954 — qwen — by oneraghavan (关闭于: 2026-03-18 11:52 (UTC+8))

LLM Dev Highlights