vLLM 开发动态报告 - 2026-03-17

时间窗口: 2026-03-17 11:35 (UTC+8) ~ 2026-03-18 11:35 (UTC+8) 数据统计: 新 Issue 32 | 关闭 Issue 10 | 新 PR 87 | 合并 PR 47 | 关闭未合并 PR 18

📊 每日开发状态摘要

过去24小时内，vllm项目保持了极高的开发活跃度，新增和合并了大量PR，显示出强劲的社区贡献。核心议题聚焦于性能优化（特别是针对新硬件和量化格式）、Bug修复（涵盖调度、内核、前端API等多个层面）以及平台支持扩展，尤其是对AMD ROCm生态的持续增强。多个高热度讨论揭示了在追求极致性能过程中遇到的复杂挑战，如调度策略导致的延迟问题和新量化基础设施的设计权衡。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动非常活跃，展现了对其硬件平台深入优化的持续投入：

INT4权重推理支持 (PR #37352)：贡献者 jatseng-ai 为AMD MI300平台添加了基于Triton的W4A16 GEMM内核，支持对称和非对称分组量化，显著扩展了ROCm上的低精度推理能力。
夜间构建与发布流程 (PR #37283)：AMD员工 tjtanaa 启用了ROCm的夜间Docker镜像和Wheel发布流程，这是改善AMD平台开发者体验和部署便利性的重要基础设施改进。
MXFP4与LoRA支持扩展 (PR #37268)：贡献者 ChuanLi1101 启用了MXFP4量化MoE模型在ROCm（MI300X/MI325X/MI355X）上使用LoRA适配器的能力，并添加了CDNA4（MI350X/MI355X）的设备ID映射，为新一代AMD硬件做好了准备。
AITER MLA解码性能优化 (PR #37353)：同一贡献者优化了MLA模型在BF16 KV缓存下的解码路径，当nhead<16时跳过了不必要的repeat_interleave操作，为特定配置（如Kimi-Linear-48B，TP=8）带来了潜在的4倍解码加速。
AITER FP8xFP8注意力支持 (PR #36927，已合并)：贡献者 divakar-amd 为rocm_aiter_unified_attn添加了FP8xFP8注意力支持，通过调整scale处理方式使其支持FP8 KV缓存。
ROCm相关问题修复：
- 已关闭 Issue #33666：关于ROCm上RotaryEmbedding+KVCache操作无法匹配模式的问题，提交者认为将由vLLM IR解决，因此关闭。
- 已合并 PR #36720：修复了ROCm上因CUDA图内存分析不准确导致的worker启动OOM问题，通过跳过ROCm上不可靠的分析路径来解决。
- PR #37329, #37331：AMD员工 mgehre-amd 修复了ConchLinearKernel和ExllamaLinearKernel中对逐通道量化（group_size=-1）的处理错误，确保了量化模型在AMD平台上的正确性。

💬 高热度讨论分析

可扩展的量化KV缓存基础设施RFC (Issue #37319)
- 核心议题：JartX 提出设计，旨在为INT8、NVFP4等需要每令牌/每头（per-token/per-head）量化尺度的KV缓存格式建立统一基础设施，以区别于现有的FP8（每张量尺度）方案。
- 观点与争议：核心维护者 mgoin 直接批评该RFC“草率”，像是AI生成且未经仔细审查。他指出设计中的事实错误（如torch.nvfp4不存在）和对NVFP4格式复杂性的低估（需要每16元素一组的尺度）。争议焦点在于设计的完整性和对新兴格式（如nvfp4）考虑的周全性。
- 当前状态：讨论开放，设计被质疑，需要提案者进行重大修订。
前缀缓存下的严重队头阻塞 (Issue #37308)
- 核心议题：用户 Yunzez 通过模糊测试发现，当启用前缀缓存且大小请求（长提示）与微小请求混合并发到达时，微小请求的TTFT（首令牌时间）可能被阻塞恶化147倍。
- 各方观点：报告者提供了详细的复现步骤和性能数据，指出这是调度逻辑缺陷导致的静默性能退化。另一位用户 FocusMode-coder 表示可提供帮助。
- 争议焦点：此问题与另一个分块预填充队头阻塞bug（#37076）根因不同，凸显了复杂缓存与调度策略交互下难以预见的性能瓶颈。
- 当前状态：Issue开放，亟待核心调度开发者关注。
DeepSeek-R1 NVFP4模型精度下降 (Issue #37302)
- 核心议题：用户 elvircrn 报告在特定vllm提交后，DeepSeek-R1 FP4模型的GSM8K评估精度出现显著下降（~1.4%）。
- 观点与排查：robertgshaw2-redhat 指出近期添加的共享专家/QKV投影新内核可能是原因。elvircrn 随后通过二分法定位到一个具体提交（#821eb80c0d），回退后精度恢复。这引发了对新内核正确性或数值稳定性的担忧。
- 当前状态：Issue开放，根本原因待查，可能涉及内核回归。
FlashInfer JIT编译导致主机OOM (Issue #37279)
- 核心议题：用户 stephenmcconnachie 在H100上使用vLLM时，FlashInfer编译gdn_prefill_sm90内核并发启动过多nvcc/cicc进程，耗尽主机内存。
- 观点与解决：报告者分析了问题根因在于FlashInfer的JIT编译并行度过高。他自行找到了解决方案：使用预编译的FlashInfer包替代运行时编译，有效避免了内存风暴。
- 最终结论：用户提供了明确的工作方案并关闭了Issue，为遇到相同问题的用户提供了宝贵参考。
FP8 TRT-LLM MoE内核回归修复 (PR #37346)
- 核心议题：wzhao18 修复了之前PR #35448引入的回归，该回归错误地将仅适用于Monolithic MoE内核的检查添加到了基类，导致Modular内核路径失败。
- 讨论过程：在验证修复时，发现--quantization mxfp8与DeepGeMM存在兼容性问题（引发另一个错误）。EdalatiAli 指出这是DeepGeMM逻辑未正确处理MXFP8层所致（已提Issue #37358），并给出了临时解决方案（VLLM_USE_DEEP_GEMM=0）。
- 最终结论：PR经过讨论和验证，修复了核心问题，同时暴露了另一个独立的兼容性边界情况。

🔥 热门话题与趋势分析

性能优化与Bug修复：这是最活跃的领域，大量PR涉及内核性能提升（如PR #37353， #36795）、内存/延迟优化（PR #37281, #37347）以及修复各种边界情况导致的崩溃或精度问题（PR #37329, #37346, #37265）。
平台与硬件支持扩展：
- AMD ROCm：支持力度持续加强，从新内核、新格式到发布流程，形成系统化投入。
- 新量化格式：NVFP4、MXFP4是当前热点，围绕其KV缓存支持（Issue #37319, PR #37332）和权重加载优化（PR #34577）的讨论密集。
前端与API改进：围绕OpenAI API兼容性和体验的改进持续进行，包括响应API中工具调用与消息的合并逻辑（PR #37294, #37299, #37276）、渲染层重构（PR #37287, #37266）以及输入验证（PR #37326）。
推测解码与多模态模型支持：相关Issue（#37273， #37295）和PR（#37280）显示，这些高级功能在实际使用中仍面临模型兼容性、解析错误等复杂挑战。
构建与部署体验：Docker构建失败（Issue #37284）、源码安装问题（Issue #37288）以及新CLI工具尝试（PR #37355）表明社区在改善用户体验方面的努力和遇到的问题。

🛠️ 重点技术变更

AMD MI300 INT4权重推理内核 (PR #37352)：为AMD平台引入了高性能的INT4量化权重支持，采用Triton实现并支持分组量化，是扩大AMD硬件在低精度推理场景应用的关键一步。
ROCm夜间发布流水线 (PR #37283)：建立了自动化、可持续的ROCm版本构建和分发体系，降低了AMD用户获取最新vLLM版本的障碍，对生态建设有长远意义。
NVFP4 KV缓存支持推进 (PR #37332 & Issue #37319)：PR #37332为reshape_and_cache_flash添加了NVFP4支持，是落地NVFP4 KV缓存的第一步。而Issue #37319的大讨论则揭示了为这类复杂量化格式设计通用基础设施的挑战，是影响未来性能与扩展性的关键设计节点。
非门控MoE的NVFP4 CUTLASS支持 (PR #37320，已合并)：扩展了CUTLASS NVFP4 MoE内核，使其支持如Nemotron-Nano之类的非门控MoE模型，提高了该高性能内核的适用范围。
Qwen3系列输入投影双流并行 (PR #36795，已合并)：通过将in_proj_qkvz和in_proj_ba放在不同CUDA流中并行执行，优化了Qwen3/3.5-Next模型的输入投影阶段，带来了可观的端到端吞吐量提升。

📈 开发活跃度观察

贡献者多元：除了核心维护者，来自AMD (mgehre-amd， tjtanaa)、英特尔 (zhenwei-intel) 等公司的员工以及大量独立开发者积极参与。
高效合并：24小时内合并了47个PR，表明代码审查和合并流程运行高效。许多PR被打上ready标签，显示有规范的流程控制。
Issue响应迅速：对于新报的Bug，经常有开发者或维护者（如 chaunceyjiang， BillionClaw）在短时间内回应，甚至直接提出修复方案。

💡 值得关注的问题

调度器性能悬崖：Issue #37308 和 #37343 揭示的，在特定并发和参数组合下触发的极端TTFT恶化，是需要调度领域专家深入调查的高优先级稳定性问题。
量化KV缓存基础设施设计：Issue #37319 中的争论表明，当前对INT8、NVFP4等格式的支持可能缺乏统一且深思熟虑的设计。社区需要就此达成共识，以避免未来技术债务。
ARM CPU支持构建错误：Issue #37325 显示在ARM CPU上服务视觉语言模型时遇到编译错误，这可能影响vLLM在更广泛边缘设备上的部署。
模块化与重构：PR #37373（融合传递工厂）、PR #37371（标准化权重加载）等显示了代码库持续进行的模块化和重构努力，这对长期维护性至关重要。
第三方依赖兼容性：Issue #37284（FlashInfer构建）和PR #37346讨论中暴露的DeepGeMM与MXFP8兼容性问题，提醒了管理复杂依赖链的挑战。

📋 附录：详细数据列表

新增 Issue

#37372 Port custom ops to native Inductor multi-stream support — 无标签 — by xyang16 (创建于: 2026-03-18 11:12 (UTC+8))
#37273 [Usage]: Failed to run Qwen3 Eagle3 speculate — usage — by Rinascere0 (创建于: 2026-03-17 16:42 (UTC+8))
#37367 [Bug]: gcc: internal compiler error: Segmentation fault signal terminated program cc1 — bug — by 154461013 (创建于: 2026-03-18 10:27 (UTC+8))
#37365 [Bug]: gdn prefill kernel errors — bug — by ZTurboX (创建于: 2026-03-18 09:50 (UTC+8))
#37363 fix(compilation): fix piecewise CUDA graph bugs with splitting_ops — 无标签 — by Complexity-ML (创建于: 2026-03-18 09:08 (UTC+8))
#37362 Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) — 无标签 — by ivnle (创建于: 2026-03-18 09:00 (UTC+8))
#37359 Guidance backend structured output doesn’t work with openai_gptoss reasoning parser (offline LLM.generate) — 无标签 — by ivnle (创建于: 2026-03-18 08:15 (UTC+8))
#37350 [Bug]: ‘placeholder_block_size’ is not defined — bug — by qnfm (创建于: 2026-03-18 06:37 (UTC+8))
#37282 [CI Failure]: Builds fail? — ci-failure — by denadai2 (创建于: 2026-03-17 17:59 (UTC+8))
#37319 [RFC]: Extensible Per-Token Quantized KV Cache Scale Infrastructure — RFC — by JartX (创建于: 2026-03-18 00:05 (UTC+8))
#37344 Pooling API: expose extra_kwargs and allow nested response data for custom poolers — 无标签 — by latenceainew (创建于: 2026-03-18 05:05 (UTC+8))
#37343 [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache — bug — by Yunzez (创建于: 2026-03-18 04:40 (UTC+8))
#37342 [Installation]: PyPI release blocks installation on enterprise systems: xgrammar==0.1.29 blocked by security scanners (CVE-2026-25048) — installation — by adamisaac3 (创建于: 2026-03-18 04:36 (UTC+8))
#37333 [Bug]: Gemma-3 specific heterogeneous TP failures with PD disagg — bug — by yzong-rh (创建于: 2026-03-18 02:09 (UTC+8))
#37325 [Bug][ARM CPU] Build/Runtime error: no matching function for call to ‘at::vec::CPU_CAPABILITY::VecMask<long int, 4>::VecMask(int&)’ when serving Qwen3-VL-8B-Instruct — bug,cpu — by micyan01 (创建于: 2026-03-18 01:03 (UTC+8))
#37302 [Bug]: R1 NVFP4 gsm8k drop in lm_eval — bug — by elvircrn (创建于: 2026-03-17 22:27 (UTC+8))
#37304 [Bug]: Language Models Test (Extended Generation) test_models[False-False-5-32-bigcode/starcoder2-3b] test issue — bug — by zou3519 (创建于: 2026-03-17 22:47 (UTC+8))
#37308 [Bug]: Severe Head-of-Line Blocking (147x TTFT) under Prefix Caching with Asymmetric Batches — bug — by Yunzez (创建于: 2026-03-17 22:54 (UTC+8))
#37305 [RFC]: Enable Intel XPU CI for vLLM — RFC — by wendyliu235 (创建于: 2026-03-17 22:49 (UTC+8))
#37295 [Bug] gpt-oss-120b + P-EAGLE speculative decoding causes openai_harmony parse errors and severe chat latency regression — bug — by fishingpvalues (创建于: 2026-03-17 20:25 (UTC+8))
#37279 [Bug]: OOM during FlashInfer JIT compile of gdn_prefill_sm90 on H100 due to many concurrent cicc processes — bug — by stephenmcconnachie (创建于: 2026-03-17 17:27 (UTC+8))
#37288 [Installation]: Build vllm from source fail — installation — by wyatt-wong (创建于: 2026-03-17 18:37 (UTC+8))
#37285 [Bug]: PD disaggregation for SSM models requires --no-async-scheduling when TP>1 — bug — by NickLucche (创建于: 2026-03-17 18:13 (UTC+8))
#37284 [Bug]: Docker Build Failure for Dockerfile.nightly_pytorch — bug — by mathivanansai (创建于: 2026-03-17 18:08 (UTC+8))
#37274 [Bug]: vLLM serving cannot support video inputs with a list of base64-encoded extracted JPEG frames — bug — by Johere (创建于: 2026-03-17 16:44 (UTC+8))
#37277 [Bug]: GLM47 Tool Call Bug — bug — by xi1212 (创建于: 2026-03-17 17:14 (UTC+8))
#37271 [Bug]: In_proj_ba of GDN in Qwen3Next use MergeColumnParallelLinear may cause accuracy decrease? — bug — by SunnyLee151064 (创建于: 2026-03-17 16:39 (UTC+8))
#37255 [Bug]: Failed to run distributed inference due to error list index out of range in omp_cpuids_list — bug — by wenlujon (创建于: 2026-03-17 13:33 (UTC+8))
#37261 [Bug]: KeyError when use lmcache layerwise mode in vllm 0.14.0 — bug — by Skyseaee (创建于: 2026-03-17 15:08 (UTC+8))
#37263 [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention — RFC — by zengchuang-hw (创建于: 2026-03-17 15:32 (UTC+8))
#37257 [Performance]: vllm and transformer call the same Qwen3-VL-AI4TEST-V1 model, with roughly the same configuration, but the visual label accuracy is 20% lower in testing. — performance — by lky55921 (创建于: 2026-03-17 14:16 (UTC+8))
#37253 [Bug]: prompt is logged as None in RequestLogItem for gpt-oss-20b (Chat Completion API) — bug — by ugiugi0823 (创建于: 2026-03-17 13:02 (UTC+8))

已关闭 Issue

#37282 [CI Failure]: Builds fail? — ci-failure — by denadai2 (关闭于: 2026-03-18 05:39 (UTC+8))
#30579 [Bug]: CUDA Illegal Memory Access when running Qwen3-Next-80B-A3B-Instruct on 4xB200 GPUs — bug,stale — by BolinSNLHM (关闭于: 2026-03-18 01:46 (UTC+8))
#33899 [Bug]: DeepSeek-R1-0528 AssertionError: tokens not padded correctly on GB200 — bug — by chaunceyjiang (关闭于: 2026-03-17 23:16 (UTC+8))
#35686 [Bug][UX]: Unclean shutdown from ctrl-c with AR Fusion — bug,help wanted,good first issue — by robertgshaw2-redhat (关闭于: 2026-03-17 22:22 (UTC+8))
#37176 [Bug]: weight offloading fails: ‘Attention’ object has no attribute ‘k_scale’ — bug — by sfbemerk (关闭于: 2026-03-17 21:22 (UTC+8))
#37279 [Bug]: OOM during FlashInfer JIT compile of gdn_prefill_sm90 on H100 due to many concurrent cicc processes — bug — by stephenmcconnachie (关闭于: 2026-03-17 18:55 (UTC+8))
#36215 [Performance]: vLLM is slower than SGLang when deploying the Qwen3.5 model. — performance — by yszhli (关闭于: 2026-03-17 16:32 (UTC+8))
#37235 [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) — bug — by NoahLundSyrdal (关闭于: 2026-03-17 13:49 (UTC+8))
#33666 [Bug]: RotaryEmbedding+KVCache ops unable to pattern match for ROCmAiterTritonRopeReshapeKVCacheFusionPass — bug,rocm — by Rohan138 (关闭于: 2026-03-17 13:29 (UTC+8))
#35522 [Bug]: Music Flamingo ValueError: Following weights were not initialized from checkpoint: {'audio_tower.pos_emb.freqs'} — bug — by denadai2 (关闭于: 2026-03-17 13:24 (UTC+8))

新增 PR

#37373 [torch.compile] Refactor Attention Quant Fusion Pass with make_fusion_pass Factory — documentation,needs-rebase,nvidia — by BadrBasowid (创建于: 2026-03-18 11:29 (UTC+8))
#37301 [Bugfix] Fix base64 JPEG video frames returning empty metadata — bug,multi-modality — by he-yufeng (创建于: 2026-03-17 22:20 (UTC+8))
#37310 [SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation — kv-connector — by ZhanqiuHu (创建于: 2026-03-17 23:02 (UTC+8))
#37360 [Bugfix] AsyncLLM: Add the ability to specify the pooling_task — bug,v1 — by Gamrix (创建于: 2026-03-18 08:26 (UTC+8))
#37275 [Benchmark] Add iteration benchmark with server-side step stats, trac… — performance,frontend,tpu,v1 — by YJYJLee (创建于: 2026-03-17 16:53 (UTC+8))
#37371 refactor: standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 — 无标签 — by XLiu-2000 (创建于: 2026-03-18 10:48 (UTC+8))
#37349 [ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture — rocm,ready — by AndreasKaratzas (创建于: 2026-03-18 06:07 (UTC+8))
#37330 [ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm — rocm,ready,nvidia — by AndreasKaratzas (创建于: 2026-03-18 01:44 (UTC+8))
#37299 [Bugfix] Responses API: merge all tool calls with preceding assistant message — bug,frontend — by weiguangli-io (创建于: 2026-03-17 22:02 (UTC+8))
#37354 [BugFix] Ensure num_cached_tokens is non-negative for kv transfer failed requests — bug,v1 — by KingsleyZhang123 (创建于: 2026-03-18 07:10 (UTC+8))
#37370 [WIP][Model Runner V2] Add Encoder Dummy Run — v1 — by santiramos27 (创建于: 2026-03-18 10:41 (UTC+8))
#37368 [Misc] make test_worker_memory_snapshot platform-aware — v1 — by zhenwei-intel (创建于: 2026-03-18 10:29 (UTC+8))
#37339 [Core] Add register_model() to KVConnectorBase_V1 for CacheBlend — v1,kv-connector — by zbennett10 (创建于: 2026-03-18 03:35 (UTC+8))
#37369 fix(cpu): add null check for aligned_alloc in ScratchPadManager — cpu — by yassha (创建于: 2026-03-18 10:31 (UTC+8))
#37366 My version of their work — performance,frontend,v1 — by YJYJLee (创建于: 2026-03-18 10:15 (UTC+8))
#37264 [Bugfix] Handle truncate_prompt_tokens in Harmony (GPT-OSS) path — bug,frontend,gpt-oss — by gpwork4u (创建于: 2026-03-17 15:41 (UTC+8))
#37296 fix marlin fp4 kernel N-dimension alignment — needs-rebase — by flutist (创建于: 2026-03-17 20:25 (UTC+8))
#37364 [Model Runner V2] fix draft attention metadata generation — v1 — by TheEpicDolphin (创建于: 2026-03-18 09:43 (UTC+8))
#37351 [Performance] Add –enable-ep-weight-filter CLI option — ready — by esmeetu (创建于: 2026-03-18 06:39 (UTC+8))
#37346 [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods — bug,ready,nvidia,quantization — by wzhao18 (创建于: 2026-03-18 05:14 (UTC+8))
#37361 fix(compilation): fix piecewise CUDA graph bugs with splitting_ops — nvidia — by Complexity-ML (创建于: 2026-03-18 09:00 (UTC+8))
#37340 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement — performance,ready,qwen — by yewentao256 (创建于: 2026-03-18 03:41 (UTC+8))
#37358 [Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed — bug — by EdalatiAli (创建于: 2026-03-18 08:11 (UTC+8))
#37283 [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm — rocm,ci/build — by tjtanaa (创建于: 2026-03-17 18:02 (UTC+8))
#37347 [Perf] Optimize token_embed for pooling models, 2.8% token throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-03-18 05:50 (UTC+8))
#37322 [Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy — bug — by elvircrn (创建于: 2026-03-18 00:35 (UTC+8))
#37352 [Kernel][Hardware][AMD] Add TritonW4A16LinearKernel for ROCm — rocm — by jatseng-ai (创建于: 2026-03-18 06:50 (UTC+8))
#37355 Add local-runtime CLI, launcher install flow, and easy model management — documentation,frontend,nvidia — by FolksyPizza (创建于: 2026-03-18 07:12 (UTC+8))
#37357 [Bugfix] Kill Ray actors on elastic EP scale-down to prevent zombie DEALER connections that block subsequent scale-up — bug,v1 — by tzulingk (创建于: 2026-03-18 07:43 (UTC+8))
#37356 FlashInfer NVFP4 NaN propagation plausible fix — v1,nvidia — by varun-sundar-rabindranath (创建于: 2026-03-18 07:24 (UTC+8))
#37307 [Core] add option to schedule requests based on full ISL — v1 — by DanBlanaru (创建于: 2026-03-17 22:51 (UTC+8))
#37353 [ROCm][Perf] Skip head repeat_interleave for AITER MLA decode with BF16 KV cache — rocm,v1 — by ChuanLi1101 (创建于: 2026-03-18 06:56 (UTC+8))
#37332 Add nvfp4 support to reshape_and_cache_flash — documentation,ci/build,v1,nvidia — by sychen52 (创建于: 2026-03-18 02:01 (UTC+8))
#37303 [Attention] Support distinguishing between short extends and decodes — ready,ci/build,v1 — by LucasWilkinson (创建于: 2026-03-17 22:38 (UTC+8))
#37297 [LoRA] Support FP8 LoRA E2E inference-dense model — needs-rebase — by jeejeelee (创建于: 2026-03-17 20:39 (UTC+8))
#37348 Fix Qwen3.5-Next FP8 Weight Loading Error on TPU — qwen — by jrplatin (创建于: 2026-03-18 05:52 (UTC+8))
#37345 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,llama,qwen — by Lucaskabela (创建于: 2026-03-18 05:14 (UTC+8))
#37338 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 — bug,qwen — by arpera (创建于: 2026-03-18 03:03 (UTC+8))
#37337 [Scheduler][WIP] Try to reduce preemption — v1 — by heheda12345 (创建于: 2026-03-18 03:02 (UTC+8))
#37335 [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset — v1 — by AndreasKaratzas (创建于: 2026-03-18 02:35 (UTC+8))
#37334 [BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep — bug,documentation — by hao-aaron (创建于: 2026-03-18 02:18 (UTC+8))
#37320 [Kernel] Add non-gated support for NVFP4 CUTLASS MoE — ready,nvidia — by mgoin (创建于: 2026-03-18 00:13 (UTC+8))
#37331 [Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel — bug,llama — by mgehre-amd (创建于: 2026-03-18 01:46 (UTC+8))
#37329 [Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) — bug — by mgehre-amd (创建于: 2026-03-18 01:36 (UTC+8))
#37328 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename — multi-modality — by AndreasKaratzas (创建于: 2026-03-18 01:31 (UTC+8))
#37327 [Misc] Use VLLMValidationError for user-facing errors in Responses harmony — frontend,gpt-oss — by umut-polat (创建于: 2026-03-18 01:06 (UTC+8))
#37326 [Bugfix] Fix unreachable structured_outputs + tool_choice conflict check — bug,frontend,tool-calling — by umut-polat (创建于: 2026-03-18 01:05 (UTC+8))
#37324 [2/2] Refactor InternVL-based processors — speculative-decoding — by DarkLight1337 (创建于: 2026-03-18 00:57 (UTC+8))
#37309 [Don’t Merge] Test CI with InstantTensor as default load format — ready,ready-run-all-tests — by mgoin (创建于: 2026-03-17 23:00 (UTC+8))
#37336 Make KV connector metadata build overridable via plugin — ready,v1,kv-connector — by sarckk (创建于: 2026-03-18 02:38 (UTC+8))
#37341 [EPLB] Consolidate is_unchanged/is_received_locally into RecvMetadata — 无标签 — by SageMoore (创建于: 2026-03-18 04:35 (UTC+8))
#37252 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache — documentation,ready,nvidia — by wzhao18 (创建于: 2026-03-17 12:26 (UTC+8))
#37321 [Model] Remove unused handle_oov_mm_token — ready,multi-modality,qwen — by DarkLight1337 (创建于: 2026-03-18 00:26 (UTC+8))
#37323 fix: use raw string for emoji example to avoid SyntaxWarning (fixes #37133) — 无标签 — by aayushbaluni (创建于: 2026-03-18 00:36 (UTC+8))
#37280 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 — bug,speculative-decoding,v1,llama — by mgehre-amd (创建于: 2026-03-17 17:37 (UTC+8))
#37318 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() — v1 — by fuscof-ibm (创建于: 2026-03-18 00:00 (UTC+8))
#37317 [CI] Update model registry with real HF model IDs for CI testing — ready,ci/build — by mgoin (创建于: 2026-03-17 23:59 (UTC+8))
#37313 [Log] Reduce duplicate log — ready,v1,qwen,nvidia — by yewentao256 (创建于: 2026-03-17 23:19 (UTC+8))
#37290 [Chore] Replace all base64 usages with faster pybase64 package — documentation,performance,frontend,ready,multi-modality,llama,qwen — by Isotr0py (创建于: 2026-03-17 19:27 (UTC+8))
#37316 [Models][GDN] Reduce D2H syncs in ChunkGatedDeltaRule — 无标签 — by lgeiger (创建于: 2026-03-17 23:54 (UTC+8))
#37306 [Kernel] Add Llama4 Router GEMM kernel — performance,ci/build,llama — by xyang16 (创建于: 2026-03-17 22:50 (UTC+8))
#37276 fix: combine content/reasoning with tool calls in responses API (#37167) — frontend — by aayushbaluni (创建于: 2026-03-17 16:54 (UTC+8))
#37289 [Bugfix] Standardize custom HF Processor init — bug,ready,qwen,deepseek — by DarkLight1337 (创建于: 2026-03-17 18:54 (UTC+8))
#37311 Puzzle mtp — 无标签 — by netanel-haber (创建于: 2026-03-17 23:13 (UTC+8))
#37300 [BugFix] PyTorch Compilation Tests should error if any test fails — bug,ready,ci/build — by zou3519 (创建于: 2026-03-17 22:06 (UTC+8))
#37294 [Bugfix] [Frontend] Responses API, fix merging of message and tool call + generate correct response.output_text.done for streaming responses with tool calls — bug,frontend — by bfroemel (创建于: 2026-03-17 20:24 (UTC+8))
#37292 Fix Mistral yarn warning in Transformers v5 — 无标签 — by hmellor (创建于: 2026-03-17 19:59 (UTC+8))
#37298 Fix Phi3 test that fails with Transformers v5 — ready,multi-modality — by hmellor (创建于: 2026-03-17 20:41 (UTC+8))
#37287 [Frontend] Complete OpenAI render delegation — frontend,ready — by sagearc (创建于: 2026-03-17 18:27 (UTC+8))
#37256 fix(cpu_worker): Fix list index out of range in omp_cpuids_list for multi-node distributed inference — v1,cpu — by BillionClaw (创建于: 2026-03-17 13:46 (UTC+8))
#37272 fix(distributed): resolve inference failure in cpu_worker — v1,cpu — by BillionClaw (创建于: 2026-03-17 16:40 (UTC+8))
#37260 [1/2] Move InternVL-based processors — speculative-decoding,ready,multi-modality — by DarkLight1337 (创建于: 2026-03-17 15:04 (UTC+8))
#37293 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (创建于: 2026-03-17 20:20 (UTC+8))
#37265 [Bugfix] fix NoneType error in KV cache transfer with NCCL connector for DeepSeek — bug,deepseek,kv-connector — by leoda1 (创建于: 2026-03-17 15:47 (UTC+8))
#37291 [Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method — bug — by mgehre-amd (创建于: 2026-03-17 19:30 (UTC+8))
#37286 [Bugfix] Migrate to python3 -m build from legacy setup.py in Dockerfile.nightly_torch — bug,ci/build — by mathivanansai (创建于: 2026-03-17 18:25 (UTC+8))
#37281 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (创建于: 2026-03-17 17:44 (UTC+8))
#37266 [Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender — frontend,ready — by sagearc (创建于: 2026-03-17 15:52 (UTC+8))
#37259 Bugfix: prevent “selected index k out of range” in TP decode path — bug,ready — by zhejiangxiaomai (创建于: 2026-03-17 14:58 (UTC+8))
#37278 Optimize Fusedmoe int8_w8a8 kernel performance — 无标签 — by JiantaoXu (创建于: 2026-03-17 17:22 (UTC+8))
#37268 [ROCm] Enable MXFP4 LoRA support on MI355X and add CDNA4 device IDs — rocm — by ChuanLi1101 (创建于: 2026-03-17 16:13 (UTC+8))
#37267 0.17.1+librosa — performance,needs-rebase,ci/build,v1,qwen,nvidia — by zhewenl (创建于: 2026-03-17 16:01 (UTC+8))
#37258 [Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens — bug,ready — by chaunceyjiang (创建于: 2026-03-17 14:28 (UTC+8))
#37270 fix(lmcache): handle KeyError in layerwise mode — kv-connector — by BillionClaw (创建于: 2026-03-17 16:37 (UTC+8))
#37269 fix(lmcache): handle KeyError in layerwise storage mode — kv-connector — by BillionClaw (创建于: 2026-03-17 16:28 (UTC+8))
#37262 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (创建于: 2026-03-17 15:09 (UTC+8))
#37254 fix: include prompt text in RequestLogItem for gpt-oss-20b — frontend,gpt-oss — by BillionClaw (创建于: 2026-03-17 13:21 (UTC+8))

已合并 PR

#36795 [Perf] Enable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (合并于: 2026-03-18 11:13 (UTC+8))
#37330 [ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-18 11:12 (UTC+8))
#37351 [Performance] Add –enable-ep-weight-filter CLI option — ready — by esmeetu (合并于: 2026-03-18 09:36 (UTC+8))
#36705 [Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly — ready — by gmagogsfm (合并于: 2026-03-18 09:23 (UTC+8))
#36927 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn — rocm,ready,v1 — by divakar-amd (合并于: 2026-03-18 08:49 (UTC+8))
#36720 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling — bug,rocm,ready,v1,nvidia — by JartX (合并于: 2026-03-18 05:55 (UTC+8))
#37320 [Kernel] Add non-gated support for NVFP4 CUTLASS MoE — ready,nvidia — by mgoin (合并于: 2026-03-18 06:12 (UTC+8))
#35447 [Bugfix] Fix NemotronH MTP + Chunked Prefill — bug,ready,v1 — by benchislett (合并于: 2026-03-17 14:07 (UTC+8))
#36846 [ROCm] Validate block_size for explicitly selected attention backends — rocm,ready — by AndreasKaratzas (合并于: 2026-03-18 06:04 (UTC+8))
#37336 Make KV connector metadata build overridable via plugin — ready,v1,kv-connector — by sarckk (合并于: 2026-03-18 05:29 (UTC+8))
#36887 [Model] Add ColQwen3.5 4.5B support — documentation,new-model,ready,multi-modality,qwen — by athrael-soju (合并于: 2026-03-18 05:17 (UTC+8))
#35809 [Models] Cohere ASR — documentation,performance,new-model,frontend,ready,v1,multi-modality,qwen — by ekagra-ranjan (合并于: 2026-03-18 05:04 (UTC+8))
#34577 [Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow — bug,ready,quantization — by ricky-chaoju (合并于: 2026-03-18 04:48 (UTC+8))
#37158 [Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 — bug,ready — by dbari (合并于: 2026-03-18 04:13 (UTC+8))
#37252 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache — documentation,ready,nvidia — by wzhao18 (合并于: 2026-03-18 04:09 (UTC+8))
#37201 [Deprecation] Deprecate --calculate-kv-scales option — ready,quantization — by mgoin (合并于: 2026-03-18 03:57 (UTC+8))
#37321 [Model] Remove unused handle_oov_mm_token — ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-18 03:44 (UTC+8))
#36988 bump compressed-tensors version to 0.14.0.1 — ready,ci/build,quantization — by brian-dellabetta (合并于: 2026-03-18 03:36 (UTC+8))
#36674 [Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs — bug,ready,nvidia — by yewentao256 (合并于: 2026-03-18 03:19 (UTC+8))
#37100 [CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests — ready,ci/build — by avinashsingh77 (合并于: 2026-03-18 02:44 (UTC+8))
#35673 [Torch 2.11] Migrate torch._C._cpu calls to public torch.cpu API — performance,ready,v1,cpu — by atalman (合并于: 2026-03-18 02:47 (UTC+8))
#37225 [Perf] Optimize top-k search in apply_top_k_top_p_triton sampler — performance,ready,v1 — by mgoin (合并于: 2026-03-18 02:35 (UTC+8))
#37290 [Chore] Replace all base64 usages with faster pybase64 package — documentation,performance,frontend,ready,multi-modality,llama,qwen — by Isotr0py (合并于: 2026-03-17 22:44 (UTC+8))
#37157 [openapi] remove redundant exception stack trace[4/N] — frontend,ready — by andyxning (合并于: 2026-03-17 23:06 (UTC+8))
#37230 [CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in init — ready,ready-run-all-tests — by AndreasKaratzas (合并于: 2026-03-18 00:08 (UTC+8))
#37289 [Bugfix] Standardize custom HF Processor init — bug,ready,qwen,deepseek — by DarkLight1337 (合并于: 2026-03-17 23:38 (UTC+8))
#37300 [BugFix] PyTorch Compilation Tests should error if any test fails — bug,ready,ci/build — by zou3519 (合并于: 2026-03-17 23:26 (UTC+8))
#35243 [Bugfix] Fix DP MTP Dummy Run — bug,ready,v1 — by benchislett (合并于: 2026-03-17 23:16 (UTC+8))
#37224 [UltraVox] Fix output type — ready — by vasqu (合并于: 2026-03-17 22:51 (UTC+8))
#34984 [Misc][LoRA] Add –lora-target-modules to restrict LoRA to specific modules — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by bhoomit (合并于: 2026-03-17 22:36 (UTC+8))
#37241 [Refactor] Relocate responses API tests — ready,v1 — by sfeng33 (合并于: 2026-03-17 13:14 (UTC+8))
#37298 Fix Phi3 test that fails with Transformers v5 — ready,multi-modality — by hmellor (合并于: 2026-03-17 22:29 (UTC+8))
#37287 [Frontend] Complete OpenAI render delegation — frontend,ready — by sagearc (合并于: 2026-03-17 21:53 (UTC+8))
#36955 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,ready,nvidia — by siewcapital (合并于: 2026-03-17 22:22 (UTC+8))
#36265 pick up tuned prefill configs for FP8 FA3 — ready,ci/build,ready-run-all-tests — by jmkuebler (合并于: 2026-03-17 22:00 (UTC+8))
#36256 [Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators — frontend,ready — by umut-polat (合并于: 2026-03-17 21:52 (UTC+8))
#37260 [1/2] Move InternVL-based processors — speculative-decoding,ready,multi-modality — by DarkLight1337 (合并于: 2026-03-17 21:50 (UTC+8))
#37165 [perf][connector] optimize build_connector_meta when host buffer transfer is not used — ready,kv-connector — by youkaichao (合并于: 2026-03-17 19:59 (UTC+8))
#37178 Bugfix for offloading+prefetch for GLM-4.7-FP8 — bug,ready — by sfbemerk (合并于: 2026-03-17 21:22 (UTC+8))
#36664 Add gigachat 3.1 tool parser + fix gigachat3 tool parser — documentation,ready,tool-calling — by ajpqs (合并于: 2026-03-17 20:03 (UTC+8))
#37266 [Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender — frontend,ready — by sagearc (合并于: 2026-03-17 19:22 (UTC+8))
#37259 Bugfix: prevent “selected index k out of range” in TP decode path — bug,ready — by zhejiangxiaomai (合并于: 2026-03-17 19:14 (UTC+8))
#35829 [Feature]: Support for multiple embedding types in a single inference call — frontend,ready,v1 — by staugust (合并于: 2026-03-17 17:05 (UTC+8))
#37258 [Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens — bug,ready — by chaunceyjiang (合并于: 2026-03-17 16:54 (UTC+8))
#32779 Fix infinite recursive search issue in quark.py — documentation,new-model,rocm,speculative-decoding,ready — by xiao-llm (合并于: 2026-03-17 15:19 (UTC+8))
#35535 [Bugfix] Fix loading Music Flamingo — bug,ready — by NickCao (合并于: 2026-03-17 13:24 (UTC+8))
#37246 [Bugfix] dtype mismatch in ngram gpu propose — bug,speculative-decoding,ready,v1 — by PatchouliTIS (合并于: 2026-03-17 13:19 (UTC+8))

关闭但未合并的 PR

#37299 [Bugfix] Responses API: merge all tool calls with preceding assistant message — bug,frontend — by weiguangli-io (关闭于: 2026-03-18 11:07 (UTC+8))
#31257 [Speculators][Speculative Decoding] Fix Kimi K2 Eagle3 Support — deepseek — by chaunceyjiang (关闭于: 2026-03-18 11:01 (UTC+8))
#36231 [Misc] Add enable_log_requests parameter to RequestLogger — frontend — by chaunceyjiang (关闭于: 2026-03-18 11:01 (UTC+8))
#37368 [Misc] make test_worker_memory_snapshot platform-aware — v1 — by zhenwei-intel (关闭于: 2026-03-18 10:49 (UTC+8))
#37248 refactor: standardize kimi_linear and minimax_text_01 model weight loading to use AutoWeightsLoader — needs-rebase,v1,multi-modality,qwen,nvidia — by XLiu-2000 (关闭于: 2026-03-18 10:06 (UTC+8))
#35398 Andy/spec probs — speculative-decoding,needs-rebase,v1 — by andylolu2 (关闭于: 2026-03-18 09:12 (UTC+8))
#37142 [v1] align execute_dummy_batch with uniform decode query length — needs-rebase,v1 — by liuxingbo12138 (关闭于: 2026-03-18 08:54 (UTC+8))
#35690 [Bigfix] Fix padding in FULL_DECODE path when MTP is enabled for DP case — v1 — by zyongye (关闭于: 2026-03-18 06:34 (UTC+8))
#36555 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by Lucaskabela (关闭于: 2026-03-18 04:57 (UTC+8))
#37309 [Don’t Merge] Test CI with InstantTensor as default load format — ready,ready-run-all-tests — by mgoin (关闭于: 2026-03-18 05:17 (UTC+8))
#36881 set mmencodercache to the max_seq_len for all 3 modalities — 无标签 — by netanel-haber (关闭于: 2026-03-18 04:58 (UTC+8))
#33420 [EPLB][BUGFIX] Add max_num_transfers config option to EPLB — bug,needs-rebase — by SageMoore (关闭于: 2026-03-18 03:56 (UTC+8))
#37323 fix: use raw string for emoji example to avoid SyntaxWarning (fixes #37133) — 无标签 — by aayushbaluni (关闭于: 2026-03-18 02:55 (UTC+8))
#30338 Fix gigachat3 parser + update tests — frontend,tool-calling — by ajpqs (关闭于: 2026-03-17 23:59 (UTC+8))
#34824 [Feature][Quant] Support online MXFP8 MoE quantization for SM100 serving — needs-rebase,ci/build,nvidia — by EdalatiAli (关闭于: 2026-03-17 23:25 (UTC+8))
#37211 fix: Add missing k_scale attribute to Attention for weight offloading — 无标签 — by BillionClaw (关闭于: 2026-03-17 21:23 (UTC+8))
#37262 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (关闭于: 2026-03-17 16:25 (UTC+8))
#36500 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (关闭于: 2026-03-17 15:01 (UTC+8))