vLLM 开发动态报告 - 2026-03-17
时间窗口: 2026-03-17 11:35 (UTC+8) ~ 2026-03-18 11:35 (UTC+8) 数据统计: 新 Issue 32 | 关闭 Issue 10 | 新 PR 87 | 合并 PR 47 | 关闭未合并 PR 18
📊 每日开发状态摘要
过去24小时内,vllm项目保持了极高的开发活跃度,新增和合并了大量PR,显示出强劲的社区贡献。核心议题聚焦于性能优化(特别是针对新硬件和量化格式)、Bug修复(涵盖调度、内核、前端API等多个层面)以及平台支持扩展,尤其是对AMD ROCm生态的持续增强。多个高热度讨论揭示了在追求极致性能过程中遇到的复杂挑战,如调度策略导致的延迟问题和新量化基础设施的设计权衡。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动非常活跃,展现了对其硬件平台深入优化的持续投入:
- INT4权重推理支持 (PR #37352):贡献者
jatseng-ai为AMD MI300平台添加了基于Triton的W4A16 GEMM内核,支持对称和非对称分组量化,显著扩展了ROCm上的低精度推理能力。 - 夜间构建与发布流程 (PR #37283):AMD员工
tjtanaa启用了ROCm的夜间Docker镜像和Wheel发布流程,这是改善AMD平台开发者体验和部署便利性的重要基础设施改进。 - MXFP4与LoRA支持扩展 (PR #37268):贡献者
ChuanLi1101启用了MXFP4量化MoE模型在ROCm(MI300X/MI325X/MI355X)上使用LoRA适配器的能力,并添加了CDNA4(MI350X/MI355X)的设备ID映射,为新一代AMD硬件做好了准备。 - AITER MLA解码性能优化 (PR #37353):同一贡献者优化了MLA模型在BF16 KV缓存下的解码路径,当
nhead<16时跳过了不必要的repeat_interleave操作,为特定配置(如Kimi-Linear-48B,TP=8)带来了潜在的4倍解码加速。 - AITER FP8xFP8注意力支持 (PR #36927,已合并):贡献者
divakar-amd为rocm_aiter_unified_attn添加了FP8xFP8注意力支持,通过调整scale处理方式使其支持FP8 KV缓存。 - ROCm相关问题修复:
- 已关闭 Issue #33666:关于ROCm上
RotaryEmbedding+KVCache操作无法匹配模式的问题,提交者认为将由vLLM IR解决,因此关闭。 - 已合并 PR #36720:修复了ROCm上因CUDA图内存分析不准确导致的worker启动OOM问题,通过跳过ROCm上不可靠的分析路径来解决。
- PR #37329, #37331:AMD员工
mgehre-amd修复了ConchLinearKernel和ExllamaLinearKernel中对逐通道量化(group_size=-1)的处理错误,确保了量化模型在AMD平台上的正确性。
- 已关闭 Issue #33666:关于ROCm上
💬 高热度讨论分析
- 可扩展的量化KV缓存基础设施RFC (Issue #37319)
- 核心议题:
JartX提出设计,旨在为INT8、NVFP4等需要每令牌/每头(per-token/per-head)量化尺度的KV缓存格式建立统一基础设施,以区别于现有的FP8(每张量尺度)方案。 - 观点与争议:核心维护者
mgoin直接批评该RFC“草率”,像是AI生成且未经仔细审查。他指出设计中的事实错误(如torch.nvfp4不存在)和对NVFP4格式复杂性的低估(需要每16元素一组的尺度)。争议焦点在于设计的完整性和对新兴格式(如nvfp4)考虑的周全性。 - 当前状态:讨论开放,设计被质疑,需要提案者进行重大修订。
- 核心议题:
- 前缀缓存下的严重队头阻塞 (Issue #37308)
- 核心议题:用户
Yunzez通过模糊测试发现,当启用前缀缓存且大小请求(长提示)与微小请求混合并发到达时,微小请求的TTFT(首令牌时间)可能被阻塞恶化147倍。 - 各方观点:报告者提供了详细的复现步骤和性能数据,指出这是调度逻辑缺陷导致的静默性能退化。另一位用户
FocusMode-coder表示可提供帮助。 - 争议焦点:此问题与另一个分块预填充队头阻塞bug(#37076)根因不同,凸显了复杂缓存与调度策略交互下难以预见的性能瓶颈。
- 当前状态:Issue开放,亟待核心调度开发者关注。
- 核心议题:用户
- DeepSeek-R1 NVFP4模型精度下降 (Issue #37302)
- 核心议题:用户
elvircrn报告在特定vllm提交后,DeepSeek-R1 FP4模型的GSM8K评估精度出现显著下降(~1.4%)。 - 观点与排查:
robertgshaw2-redhat指出近期添加的共享专家/QKV投影新内核可能是原因。elvircrn随后通过二分法定位到一个具体提交(#821eb80c0d),回退后精度恢复。这引发了对新内核正确性或数值稳定性的担忧。 - 当前状态:Issue开放,根本原因待查,可能涉及内核回归。
- 核心议题:用户
- FlashInfer JIT编译导致主机OOM (Issue #37279)
- 核心议题:用户
stephenmcconnachie在H100上使用vLLM时,FlashInfer编译gdn_prefill_sm90内核并发启动过多nvcc/cicc进程,耗尽主机内存。 - 观点与解决:报告者分析了问题根因在于FlashInfer的JIT编译并行度过高。他自行找到了解决方案:使用预编译的FlashInfer包替代运行时编译,有效避免了内存风暴。
- 最终结论:用户提供了明确的工作方案并关闭了Issue,为遇到相同问题的用户提供了宝贵参考。
- 核心议题:用户
- FP8 TRT-LLM MoE内核回归修复 (PR #37346)
- 核心议题:
wzhao18修复了之前PR #35448引入的回归,该回归错误地将仅适用于Monolithic MoE内核的检查添加到了基类,导致Modular内核路径失败。 - 讨论过程:在验证修复时,发现
--quantization mxfp8与DeepGeMM存在兼容性问题(引发另一个错误)。EdalatiAli指出这是DeepGeMM逻辑未正确处理MXFP8层所致(已提Issue #37358),并给出了临时解决方案(VLLM_USE_DEEP_GEMM=0)。 - 最终结论:PR经过讨论和验证,修复了核心问题,同时暴露了另一个独立的兼容性边界情况。
- 核心议题:
🔥 热门话题与趋势分析
- 性能优化与Bug修复:这是最活跃的领域,大量PR涉及内核性能提升(如PR #37353, #36795)、内存/延迟优化(PR #37281, #37347)以及修复各种边界情况导致的崩溃或精度问题(PR #37329, #37346, #37265)。
- 平台与硬件支持扩展:
- AMD ROCm:支持力度持续加强,从新内核、新格式到发布流程,形成系统化投入。
- 新量化格式:NVFP4、MXFP4是当前热点,围绕其KV缓存支持(Issue #37319, PR #37332)和权重加载优化(PR #34577)的讨论密集。
- 前端与API改进:围绕OpenAI API兼容性和体验的改进持续进行,包括响应API中工具调用与消息的合并逻辑(PR #37294, #37299, #37276)、渲染层重构(PR #37287, #37266)以及输入验证(PR #37326)。
- 推测解码与多模态模型支持:相关Issue(#37273, #37295)和PR(#37280)显示,这些高级功能在实际使用中仍面临模型兼容性、解析错误等复杂挑战。
- 构建与部署体验:Docker构建失败(Issue #37284)、源码安装问题(Issue #37288)以及新CLI工具尝试(PR #37355)表明社区在改善用户体验方面的努力和遇到的问题。
🛠️ 重点技术变更
- AMD MI300 INT4权重推理内核 (PR #37352):为AMD平台引入了高性能的INT4量化权重支持,采用Triton实现并支持分组量化,是扩大AMD硬件在低精度推理场景应用的关键一步。
- ROCm夜间发布流水线 (PR #37283):建立了自动化、可持续的ROCm版本构建和分发体系,降低了AMD用户获取最新vLLM版本的障碍,对生态建设有长远意义。
- NVFP4 KV缓存支持推进 (PR #37332 & Issue #37319):PR #37332为
reshape_and_cache_flash添加了NVFP4支持,是落地NVFP4 KV缓存的第一步。而Issue #37319的大讨论则揭示了为这类复杂量化格式设计通用基础设施的挑战,是影响未来性能与扩展性的关键设计节点。 - 非门控MoE的NVFP4 CUTLASS支持 (PR #37320,已合并):扩展了CUTLASS NVFP4 MoE内核,使其支持如Nemotron-Nano之类的非门控MoE模型,提高了该高性能内核的适用范围。
- Qwen3系列输入投影双流并行 (PR #36795,已合并):通过将
in_proj_qkvz和in_proj_ba放在不同CUDA流中并行执行,优化了Qwen3/3.5-Next模型的输入投影阶段,带来了可观的端到端吞吐量提升。
📈 开发活跃度观察
- 贡献者多元:除了核心维护者,来自AMD (
mgehre-amd,tjtanaa)、英特尔 (zhenwei-intel) 等公司的员工以及大量独立开发者积极参与。 - 高效合并:24小时内合并了47个PR,表明代码审查和合并流程运行高效。许多PR被打上
ready标签,显示有规范的流程控制。 - Issue响应迅速:对于新报的Bug,经常有开发者或维护者(如
chaunceyjiang,BillionClaw)在短时间内回应,甚至直接提出修复方案。
💡 值得关注的问题
- 调度器性能悬崖:Issue #37308 和 #37343 揭示的,在特定并发和参数组合下触发的极端TTFT恶化,是需要调度领域专家深入调查的高优先级稳定性问题。
- 量化KV缓存基础设施设计:Issue #37319 中的争论表明,当前对INT8、NVFP4等格式的支持可能缺乏统一且深思熟虑的设计。社区需要就此达成共识,以避免未来技术债务。
- ARM CPU支持构建错误:Issue #37325 显示在ARM CPU上服务视觉语言模型时遇到编译错误,这可能影响vLLM在更广泛边缘设备上的部署。
- 模块化与重构:PR #37373(融合传递工厂)、PR #37371(标准化权重加载)等显示了代码库持续进行的模块化和重构努力,这对长期维护性至关重要。
- 第三方依赖兼容性:Issue #37284(FlashInfer构建)和PR #37346讨论中暴露的DeepGeMM与MXFP8兼容性问题,提醒了管理复杂依赖链的挑战。
📋 附录:详细数据列表
新增 Issue
- #37372 Port custom ops to native Inductor multi-stream support — 无标签 — by xyang16 (创建于: 2026-03-18 11:12 (UTC+8))
- #37273 [Usage]: Failed to run Qwen3 Eagle3 speculate — usage — by Rinascere0 (创建于: 2026-03-17 16:42 (UTC+8))
- #37367 [Bug]: gcc: internal compiler error: Segmentation fault signal terminated program cc1 — bug — by 154461013 (创建于: 2026-03-18 10:27 (UTC+8))
- #37365 [Bug]: gdn prefill kernel errors — bug — by ZTurboX (创建于: 2026-03-18 09:50 (UTC+8))
- #37363 fix(compilation): fix piecewise CUDA graph bugs with splitting_ops — 无标签 — by Complexity-ML (创建于: 2026-03-18 09:08 (UTC+8))
- #37362 Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) — 无标签 — by ivnle (创建于: 2026-03-18 09:00 (UTC+8))
- #37359 Guidance backend structured output doesn’t work with openai_gptoss reasoning parser (offline LLM.generate) — 无标签 — by ivnle (创建于: 2026-03-18 08:15 (UTC+8))
- #37350 [Bug]: ‘placeholder_block_size’ is not defined — bug — by qnfm (创建于: 2026-03-18 06:37 (UTC+8))
- #37282 [CI Failure]: Builds fail? — ci-failure — by denadai2 (创建于: 2026-03-17 17:59 (UTC+8))
- #37319 [RFC]: Extensible Per-Token Quantized KV Cache Scale Infrastructure — RFC — by JartX (创建于: 2026-03-18 00:05 (UTC+8))
- #37344 Pooling API: expose extra_kwargs and allow nested response data for custom poolers — 无标签 — by latenceainew (创建于: 2026-03-18 05:05 (UTC+8))
- #37343 [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache — bug — by Yunzez (创建于: 2026-03-18 04:40 (UTC+8))
- #37342 [Installation]: PyPI release blocks installation on enterprise systems: xgrammar==0.1.29 blocked by security scanners (CVE-2026-25048) — installation — by adamisaac3 (创建于: 2026-03-18 04:36 (UTC+8))
- #37333 [Bug]: Gemma-3 specific heterogeneous TP failures with PD disagg — bug — by yzong-rh (创建于: 2026-03-18 02:09 (UTC+8))
- #37325 [Bug][ARM CPU] Build/Runtime error: no matching function for call to ‘at::vec::CPU_CAPABILITY::VecMask<long int, 4>::VecMask(int&)’ when serving Qwen3-VL-8B-Instruct — bug,cpu — by micyan01 (创建于: 2026-03-18 01:03 (UTC+8))
- #37302 [Bug]: R1 NVFP4 gsm8k drop in lm_eval — bug — by elvircrn (创建于: 2026-03-17 22:27 (UTC+8))
- #37304 [Bug]: Language Models Test (Extended Generation) test_models[False-False-5-32-bigcode/starcoder2-3b] test issue — bug — by zou3519 (创建于: 2026-03-17 22:47 (UTC+8))
- #37308 [Bug]: Severe Head-of-Line Blocking (147x TTFT) under Prefix Caching with Asymmetric Batches — bug — by Yunzez (创建于: 2026-03-17 22:54 (UTC+8))
- #37305 [RFC]: Enable Intel XPU CI for vLLM — RFC — by wendyliu235 (创建于: 2026-03-17 22:49 (UTC+8))
- #37295 [Bug] gpt-oss-120b + P-EAGLE speculative decoding causes openai_harmony parse errors and severe chat latency regression — bug — by fishingpvalues (创建于: 2026-03-17 20:25 (UTC+8))
- #37279 [Bug]: OOM during FlashInfer JIT compile of gdn_prefill_sm90 on H100 due to many concurrent cicc processes — bug — by stephenmcconnachie (创建于: 2026-03-17 17:27 (UTC+8))
- #37288 [Installation]: Build vllm from source fail — installation — by wyatt-wong (创建于: 2026-03-17 18:37 (UTC+8))
- #37285 [Bug]: PD disaggregation for SSM models requires
--no-async-schedulingwhen TP>1 — bug — by NickLucche (创建于: 2026-03-17 18:13 (UTC+8)) - #37284 [Bug]: Docker Build Failure for Dockerfile.nightly_pytorch — bug — by mathivanansai (创建于: 2026-03-17 18:08 (UTC+8))
- #37274 [Bug]: vLLM serving cannot support video inputs with a list of base64-encoded extracted JPEG frames — bug — by Johere (创建于: 2026-03-17 16:44 (UTC+8))
- #37277 [Bug]: GLM47 Tool Call Bug — bug — by xi1212 (创建于: 2026-03-17 17:14 (UTC+8))
- #37271 [Bug]: In_proj_ba of GDN in Qwen3Next use MergeColumnParallelLinear may cause accuracy decrease? — bug — by SunnyLee151064 (创建于: 2026-03-17 16:39 (UTC+8))
- #37255 [Bug]: Failed to run distributed inference due to error list index out of range in omp_cpuids_list — bug — by wenlujon (创建于: 2026-03-17 13:33 (UTC+8))
- #37261 [Bug]: KeyError when use lmcache layerwise mode in vllm 0.14.0 — bug — by Skyseaee (创建于: 2026-03-17 15:08 (UTC+8))
- #37263 [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention — RFC — by zengchuang-hw (创建于: 2026-03-17 15:32 (UTC+8))
- #37257 [Performance]: vllm and transformer call the same Qwen3-VL-AI4TEST-V1 model, with roughly the same configuration, but the visual label accuracy is 20% lower in testing. — performance — by lky55921 (创建于: 2026-03-17 14:16 (UTC+8))
- #37253 [Bug]: prompt is logged as None in RequestLogItem for gpt-oss-20b (Chat Completion API) — bug — by ugiugi0823 (创建于: 2026-03-17 13:02 (UTC+8))
已关闭 Issue
- #37282 [CI Failure]: Builds fail? — ci-failure — by denadai2 (关闭于: 2026-03-18 05:39 (UTC+8))
- #30579 [Bug]: CUDA Illegal Memory Access when running Qwen3-Next-80B-A3B-Instruct on 4xB200 GPUs — bug,stale — by BolinSNLHM (关闭于: 2026-03-18 01:46 (UTC+8))
- #33899 [Bug]: DeepSeek-R1-0528 AssertionError: tokens not padded correctly on GB200 — bug — by chaunceyjiang (关闭于: 2026-03-17 23:16 (UTC+8))
- #35686 [Bug][UX]: Unclean shutdown from ctrl-c with AR Fusion — bug,help wanted,good first issue — by robertgshaw2-redhat (关闭于: 2026-03-17 22:22 (UTC+8))
- #37176 [Bug]: weight offloading fails: ‘Attention’ object has no attribute ‘k_scale’ — bug — by sfbemerk (关闭于: 2026-03-17 21:22 (UTC+8))
- #37279 [Bug]: OOM during FlashInfer JIT compile of gdn_prefill_sm90 on H100 due to many concurrent cicc processes — bug — by stephenmcconnachie (关闭于: 2026-03-17 18:55 (UTC+8))
- #36215 [Performance]: vLLM is slower than SGLang when deploying the Qwen3.5 model. — performance — by yszhli (关闭于: 2026-03-17 16:32 (UTC+8))
- #37235 [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) — bug — by NoahLundSyrdal (关闭于: 2026-03-17 13:49 (UTC+8))
- #33666 [Bug]:
RotaryEmbedding+KVCacheops unable to pattern match forROCmAiterTritonRopeReshapeKVCacheFusionPass— bug,rocm — by Rohan138 (关闭于: 2026-03-17 13:29 (UTC+8)) - #35522 [Bug]: Music Flamingo
ValueError: Following weights were not initialized from checkpoint: {'audio_tower.pos_emb.freqs'}— bug — by denadai2 (关闭于: 2026-03-17 13:24 (UTC+8))
新增 PR
- #37373 [torch.compile] Refactor Attention Quant Fusion Pass with make_fusion_pass Factory — documentation,needs-rebase,nvidia — by BadrBasowid (创建于: 2026-03-18 11:29 (UTC+8))
- #37301 [Bugfix] Fix base64 JPEG video frames returning empty metadata — bug,multi-modality — by he-yufeng (创建于: 2026-03-17 22:20 (UTC+8))
- #37310 [SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation — kv-connector — by ZhanqiuHu (创建于: 2026-03-17 23:02 (UTC+8))
- #37360 [Bugfix] AsyncLLM: Add the ability to specify the pooling_task — bug,v1 — by Gamrix (创建于: 2026-03-18 08:26 (UTC+8))
- #37275 [Benchmark] Add iteration benchmark with server-side step stats, trac… — performance,frontend,tpu,v1 — by YJYJLee (创建于: 2026-03-17 16:53 (UTC+8))
- #37371 refactor: standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 — 无标签 — by XLiu-2000 (创建于: 2026-03-18 10:48 (UTC+8))
- #37349 [ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture — rocm,ready — by AndreasKaratzas (创建于: 2026-03-18 06:07 (UTC+8))
- #37330 [ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm — rocm,ready,nvidia — by AndreasKaratzas (创建于: 2026-03-18 01:44 (UTC+8))
- #37299 [Bugfix] Responses API: merge all tool calls with preceding assistant message — bug,frontend — by weiguangli-io (创建于: 2026-03-17 22:02 (UTC+8))
- #37354 [BugFix] Ensure num_cached_tokens is non-negative for kv transfer failed requests — bug,v1 — by KingsleyZhang123 (创建于: 2026-03-18 07:10 (UTC+8))
- #37370 [WIP][Model Runner V2] Add Encoder Dummy Run — v1 — by santiramos27 (创建于: 2026-03-18 10:41 (UTC+8))
- #37368 [Misc] make test_worker_memory_snapshot platform-aware — v1 — by zhenwei-intel (创建于: 2026-03-18 10:29 (UTC+8))
- #37339 [Core] Add register_model() to KVConnectorBase_V1 for CacheBlend — v1,kv-connector — by zbennett10 (创建于: 2026-03-18 03:35 (UTC+8))
- #37369 fix(cpu): add null check for aligned_alloc in ScratchPadManager — cpu — by yassha (创建于: 2026-03-18 10:31 (UTC+8))
- #37366 My version of their work — performance,frontend,v1 — by YJYJLee (创建于: 2026-03-18 10:15 (UTC+8))
- #37264 [Bugfix] Handle truncate_prompt_tokens in Harmony (GPT-OSS) path — bug,frontend,gpt-oss — by gpwork4u (创建于: 2026-03-17 15:41 (UTC+8))
- #37296 fix marlin fp4 kernel N-dimension alignment — needs-rebase — by flutist (创建于: 2026-03-17 20:25 (UTC+8))
- #37364 [Model Runner V2] fix draft attention metadata generation — v1 — by TheEpicDolphin (创建于: 2026-03-18 09:43 (UTC+8))
- #37351 [Performance] Add –enable-ep-weight-filter CLI option — ready — by esmeetu (创建于: 2026-03-18 06:39 (UTC+8))
- #37346 [Bug] Fix fp8 trtllm MoE modular kernel supported routing methods — bug,ready,nvidia,quantization — by wzhao18 (创建于: 2026-03-18 05:14 (UTC+8))
- #37361 fix(compilation): fix piecewise CUDA graph bugs with splitting_ops — nvidia — by Complexity-ML (创建于: 2026-03-18 09:00 (UTC+8))
- #37340 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement — performance,ready,qwen — by yewentao256 (创建于: 2026-03-18 03:41 (UTC+8))
- #37358 [Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed — bug — by EdalatiAli (创建于: 2026-03-18 08:11 (UTC+8))
- #37283 [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm — rocm,ci/build — by tjtanaa (创建于: 2026-03-17 18:02 (UTC+8))
- #37347 [Perf] Optimize token_embed for pooling models, 2.8% token throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-03-18 05:50 (UTC+8))
- #37322 [Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy — bug — by elvircrn (创建于: 2026-03-18 00:35 (UTC+8))
- #37352 [Kernel][Hardware][AMD] Add TritonW4A16LinearKernel for ROCm — rocm — by jatseng-ai (创建于: 2026-03-18 06:50 (UTC+8))
- #37355 Add local-runtime CLI, launcher install flow, and easy model management — documentation,frontend,nvidia — by FolksyPizza (创建于: 2026-03-18 07:12 (UTC+8))
- #37357 [Bugfix] Kill Ray actors on elastic EP scale-down to prevent zombie DEALER connections that block subsequent scale-up — bug,v1 — by tzulingk (创建于: 2026-03-18 07:43 (UTC+8))
- #37356 FlashInfer NVFP4 NaN propagation plausible fix — v1,nvidia — by varun-sundar-rabindranath (创建于: 2026-03-18 07:24 (UTC+8))
- #37307 [Core] add option to schedule requests based on full ISL — v1 — by DanBlanaru (创建于: 2026-03-17 22:51 (UTC+8))
- #37353 [ROCm][Perf] Skip head repeat_interleave for AITER MLA decode with BF16 KV cache — rocm,v1 — by ChuanLi1101 (创建于: 2026-03-18 06:56 (UTC+8))
- #37332 Add nvfp4 support to reshape_and_cache_flash — documentation,ci/build,v1,nvidia — by sychen52 (创建于: 2026-03-18 02:01 (UTC+8))
- #37303 [Attention] Support distinguishing between short extends and decodes — ready,ci/build,v1 — by LucasWilkinson (创建于: 2026-03-17 22:38 (UTC+8))
- #37297 [LoRA] Support FP8 LoRA E2E inference-dense model — needs-rebase — by jeejeelee (创建于: 2026-03-17 20:39 (UTC+8))
- #37348 Fix Qwen3.5-Next FP8 Weight Loading Error on TPU — qwen — by jrplatin (创建于: 2026-03-18 05:52 (UTC+8))
- #37345 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,llama,qwen — by Lucaskabela (创建于: 2026-03-18 05:14 (UTC+8))
- #37338 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 — bug,qwen — by arpera (创建于: 2026-03-18 03:03 (UTC+8))
- #37337 [Scheduler][WIP] Try to reduce preemption — v1 — by heheda12345 (创建于: 2026-03-18 03:02 (UTC+8))
- #37335 [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset — v1 — by AndreasKaratzas (创建于: 2026-03-18 02:35 (UTC+8))
- #37334 [BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep — bug,documentation — by hao-aaron (创建于: 2026-03-18 02:18 (UTC+8))
- #37320 [Kernel] Add non-gated support for NVFP4 CUTLASS MoE — ready,nvidia — by mgoin (创建于: 2026-03-18 00:13 (UTC+8))
- #37331 [Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel — bug,llama — by mgehre-amd (创建于: 2026-03-18 01:46 (UTC+8))
- #37329 [Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) — bug — by mgehre-amd (创建于: 2026-03-18 01:36 (UTC+8))
- #37328 [CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename — multi-modality — by AndreasKaratzas (创建于: 2026-03-18 01:31 (UTC+8))
- #37327 [Misc] Use VLLMValidationError for user-facing errors in Responses harmony — frontend,gpt-oss — by umut-polat (创建于: 2026-03-18 01:06 (UTC+8))
- #37326 [Bugfix] Fix unreachable structured_outputs + tool_choice conflict check — bug,frontend,tool-calling — by umut-polat (创建于: 2026-03-18 01:05 (UTC+8))
- #37324 [2/2] Refactor InternVL-based processors — speculative-decoding — by DarkLight1337 (创建于: 2026-03-18 00:57 (UTC+8))
- #37309 [Don’t Merge] Test CI with InstantTensor as default load format — ready,ready-run-all-tests — by mgoin (创建于: 2026-03-17 23:00 (UTC+8))
- #37336 Make KV connector metadata build overridable via plugin — ready,v1,kv-connector — by sarckk (创建于: 2026-03-18 02:38 (UTC+8))
- #37341 [EPLB] Consolidate is_unchanged/is_received_locally into RecvMetadata — 无标签 — by SageMoore (创建于: 2026-03-18 04:35 (UTC+8))
- #37252 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache — documentation,ready,nvidia — by wzhao18 (创建于: 2026-03-17 12:26 (UTC+8))
- #37321 [Model] Remove unused
handle_oov_mm_token— ready,multi-modality,qwen — by DarkLight1337 (创建于: 2026-03-18 00:26 (UTC+8)) - #37323 fix: use raw string for emoji example to avoid SyntaxWarning (fixes #37133) — 无标签 — by aayushbaluni (创建于: 2026-03-18 00:36 (UTC+8))
- #37280 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 — bug,speculative-decoding,v1,llama — by mgehre-amd (创建于: 2026-03-17 17:37 (UTC+8))
- #37318 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() — v1 — by fuscof-ibm (创建于: 2026-03-18 00:00 (UTC+8))
- #37317 [CI] Update model registry with real HF model IDs for CI testing — ready,ci/build — by mgoin (创建于: 2026-03-17 23:59 (UTC+8))
- #37313 [Log] Reduce duplicate log — ready,v1,qwen,nvidia — by yewentao256 (创建于: 2026-03-17 23:19 (UTC+8))
- #37290 [Chore] Replace all base64 usages with faster pybase64 package — documentation,performance,frontend,ready,multi-modality,llama,qwen — by Isotr0py (创建于: 2026-03-17 19:27 (UTC+8))
- #37316 [Models][GDN] Reduce D2H syncs in ChunkGatedDeltaRule — 无标签 — by lgeiger (创建于: 2026-03-17 23:54 (UTC+8))
- #37306 [Kernel] Add Llama4 Router GEMM kernel — performance,ci/build,llama — by xyang16 (创建于: 2026-03-17 22:50 (UTC+8))
- #37276 fix: combine content/reasoning with tool calls in responses API (#37167) — frontend — by aayushbaluni (创建于: 2026-03-17 16:54 (UTC+8))
- #37289 [Bugfix] Standardize custom HF Processor init — bug,ready,qwen,deepseek — by DarkLight1337 (创建于: 2026-03-17 18:54 (UTC+8))
- #37311 Puzzle mtp — 无标签 — by netanel-haber (创建于: 2026-03-17 23:13 (UTC+8))
- #37300 [BugFix] PyTorch Compilation Tests should error if any test fails — bug,ready,ci/build — by zou3519 (创建于: 2026-03-17 22:06 (UTC+8))
- #37294 [Bugfix] [Frontend] Responses API, fix merging of message and tool call + generate correct response.output_text.done for streaming responses with tool calls — bug,frontend — by bfroemel (创建于: 2026-03-17 20:24 (UTC+8))
- #37292 Fix Mistral yarn warning in Transformers v5 — 无标签 — by hmellor (创建于: 2026-03-17 19:59 (UTC+8))
- #37298 Fix Phi3 test that fails with Transformers v5 — ready,multi-modality — by hmellor (创建于: 2026-03-17 20:41 (UTC+8))
- #37287 [Frontend] Complete OpenAI render delegation — frontend,ready — by sagearc (创建于: 2026-03-17 18:27 (UTC+8))
- #37256 fix(cpu_worker): Fix list index out of range in omp_cpuids_list for multi-node distributed inference — v1,cpu — by BillionClaw (创建于: 2026-03-17 13:46 (UTC+8))
- #37272 fix(distributed): resolve inference failure in cpu_worker — v1,cpu — by BillionClaw (创建于: 2026-03-17 16:40 (UTC+8))
- #37260 [1/2] Move InternVL-based processors — speculative-decoding,ready,multi-modality — by DarkLight1337 (创建于: 2026-03-17 15:04 (UTC+8))
- #37293 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (创建于: 2026-03-17 20:20 (UTC+8))
- #37265 [Bugfix] fix NoneType error in KV cache transfer with NCCL connector for DeepSeek — bug,deepseek,kv-connector — by leoda1 (创建于: 2026-03-17 15:47 (UTC+8))
- #37291 [Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method — bug — by mgehre-amd (创建于: 2026-03-17 19:30 (UTC+8))
- #37286 [Bugfix] Migrate to python3 -m build from legacy setup.py in Dockerfile.nightly_torch — bug,ci/build — by mathivanansai (创建于: 2026-03-17 18:25 (UTC+8))
- #37281 [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape — bug — by KrxGu (创建于: 2026-03-17 17:44 (UTC+8))
- #37266 [Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender — frontend,ready — by sagearc (创建于: 2026-03-17 15:52 (UTC+8))
- #37259 Bugfix: prevent “selected index k out of range” in TP decode path — bug,ready — by zhejiangxiaomai (创建于: 2026-03-17 14:58 (UTC+8))
- #37278 Optimize Fusedmoe int8_w8a8 kernel performance — 无标签 — by JiantaoXu (创建于: 2026-03-17 17:22 (UTC+8))
- #37268 [ROCm] Enable MXFP4 LoRA support on MI355X and add CDNA4 device IDs — rocm — by ChuanLi1101 (创建于: 2026-03-17 16:13 (UTC+8))
- #37267 0.17.1+librosa — performance,needs-rebase,ci/build,v1,qwen,nvidia — by zhewenl (创建于: 2026-03-17 16:01 (UTC+8))
- #37258 [Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens — bug,ready — by chaunceyjiang (创建于: 2026-03-17 14:28 (UTC+8))
- #37270 fix(lmcache): handle KeyError in layerwise mode — kv-connector — by BillionClaw (创建于: 2026-03-17 16:37 (UTC+8))
- #37269 fix(lmcache): handle KeyError in layerwise storage mode — kv-connector — by BillionClaw (创建于: 2026-03-17 16:28 (UTC+8))
- #37262 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (创建于: 2026-03-17 15:09 (UTC+8))
- #37254 fix: include prompt text in RequestLogItem for gpt-oss-20b — frontend,gpt-oss — by BillionClaw (创建于: 2026-03-17 13:21 (UTC+8))
已合并 PR
- #36795 [Perf] Enable dual stream execution of input projection for Qwen3 — ready,qwen — by xyang16 (合并于: 2026-03-18 11:13 (UTC+8))
- #37330 [ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-18 11:12 (UTC+8))
- #37351 [Performance] Add –enable-ep-weight-filter CLI option — ready — by esmeetu (合并于: 2026-03-18 09:36 (UTC+8))
- #36705 [Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly — ready — by gmagogsfm (合并于: 2026-03-18 09:23 (UTC+8))
- #36927 [ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn — rocm,ready,v1 — by divakar-amd (合并于: 2026-03-18 08:49 (UTC+8))
- #36720 [Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling — bug,rocm,ready,v1,nvidia — by JartX (合并于: 2026-03-18 05:55 (UTC+8))
- #37320 [Kernel] Add non-gated support for NVFP4 CUTLASS MoE — ready,nvidia — by mgoin (合并于: 2026-03-18 06:12 (UTC+8))
- #35447 [Bugfix] Fix NemotronH MTP + Chunked Prefill — bug,ready,v1 — by benchislett (合并于: 2026-03-17 14:07 (UTC+8))
- #36846 [ROCm] Validate block_size for explicitly selected attention backends — rocm,ready — by AndreasKaratzas (合并于: 2026-03-18 06:04 (UTC+8))
- #37336 Make KV connector metadata build overridable via plugin — ready,v1,kv-connector — by sarckk (合并于: 2026-03-18 05:29 (UTC+8))
- #36887 [Model] Add ColQwen3.5 4.5B support — documentation,new-model,ready,multi-modality,qwen — by athrael-soju (合并于: 2026-03-18 05:17 (UTC+8))
- #35809 [Models] Cohere ASR — documentation,performance,new-model,frontend,ready,v1,multi-modality,qwen — by ekagra-ranjan (合并于: 2026-03-18 05:04 (UTC+8))
- #34577 [Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow — bug,ready,quantization — by ricky-chaoju (合并于: 2026-03-18 04:48 (UTC+8))
- #37158 [Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 — bug,ready — by dbari (合并于: 2026-03-18 04:13 (UTC+8))
- #37252 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache — documentation,ready,nvidia — by wzhao18 (合并于: 2026-03-18 04:09 (UTC+8))
- #37201 [Deprecation] Deprecate
--calculate-kv-scalesoption — ready,quantization — by mgoin (合并于: 2026-03-18 03:57 (UTC+8)) - #37321 [Model] Remove unused
handle_oov_mm_token— ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-18 03:44 (UTC+8)) - #36988 bump compressed-tensors version to 0.14.0.1 — ready,ci/build,quantization — by brian-dellabetta (合并于: 2026-03-18 03:36 (UTC+8))
- #36674 [Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs — bug,ready,nvidia — by yewentao256 (合并于: 2026-03-18 03:19 (UTC+8))
- #37100 [CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests — ready,ci/build — by avinashsingh77 (合并于: 2026-03-18 02:44 (UTC+8))
- #35673 [Torch 2.11] Migrate torch._C._cpu calls to public torch.cpu API — performance,ready,v1,cpu — by atalman (合并于: 2026-03-18 02:47 (UTC+8))
- #37225 [Perf] Optimize top-k search in apply_top_k_top_p_triton sampler — performance,ready,v1 — by mgoin (合并于: 2026-03-18 02:35 (UTC+8))
- #37290 [Chore] Replace all base64 usages with faster pybase64 package — documentation,performance,frontend,ready,multi-modality,llama,qwen — by Isotr0py (合并于: 2026-03-17 22:44 (UTC+8))
- #37157 [openapi] remove redundant exception stack trace[4/N] — frontend,ready — by andyxning (合并于: 2026-03-17 23:06 (UTC+8))
- #37230 [CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in init — ready,ready-run-all-tests — by AndreasKaratzas (合并于: 2026-03-18 00:08 (UTC+8))
- #37289 [Bugfix] Standardize custom HF Processor init — bug,ready,qwen,deepseek — by DarkLight1337 (合并于: 2026-03-17 23:38 (UTC+8))
- #37300 [BugFix] PyTorch Compilation Tests should error if any test fails — bug,ready,ci/build — by zou3519 (合并于: 2026-03-17 23:26 (UTC+8))
- #35243 [Bugfix] Fix DP MTP Dummy Run — bug,ready,v1 — by benchislett (合并于: 2026-03-17 23:16 (UTC+8))
- #37224 [
UltraVox] Fix output type — ready — by vasqu (合并于: 2026-03-17 22:51 (UTC+8)) - #34984 [Misc][LoRA] Add –lora-target-modules to restrict LoRA to specific modules — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by bhoomit (合并于: 2026-03-17 22:36 (UTC+8))
- #37241 [Refactor] Relocate responses API tests — ready,v1 — by sfeng33 (合并于: 2026-03-17 13:14 (UTC+8))
- #37298 Fix Phi3 test that fails with Transformers v5 — ready,multi-modality — by hmellor (合并于: 2026-03-17 22:29 (UTC+8))
- #37287 [Frontend] Complete OpenAI render delegation — frontend,ready — by sagearc (合并于: 2026-03-17 21:53 (UTC+8))
- #36955 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,ready,nvidia — by siewcapital (合并于: 2026-03-17 22:22 (UTC+8))
- #36265 pick up tuned prefill configs for FP8 FA3 — ready,ci/build,ready-run-all-tests — by jmkuebler (合并于: 2026-03-17 22:00 (UTC+8))
- #36256 [Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators — frontend,ready — by umut-polat (合并于: 2026-03-17 21:52 (UTC+8))
- #37260 [1/2] Move InternVL-based processors — speculative-decoding,ready,multi-modality — by DarkLight1337 (合并于: 2026-03-17 21:50 (UTC+8))
- #37165 [perf][connector] optimize build_connector_meta when host buffer transfer is not used — ready,kv-connector — by youkaichao (合并于: 2026-03-17 19:59 (UTC+8))
- #37178 Bugfix for offloading+prefetch for GLM-4.7-FP8 — bug,ready — by sfbemerk (合并于: 2026-03-17 21:22 (UTC+8))
- #36664 Add gigachat 3.1 tool parser + fix gigachat3 tool parser — documentation,ready,tool-calling — by ajpqs (合并于: 2026-03-17 20:03 (UTC+8))
- #37266 [Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender — frontend,ready — by sagearc (合并于: 2026-03-17 19:22 (UTC+8))
- #37259 Bugfix: prevent “selected index k out of range” in TP decode path — bug,ready — by zhejiangxiaomai (合并于: 2026-03-17 19:14 (UTC+8))
- #35829 [Feature]: Support for multiple embedding types in a single inference call — frontend,ready,v1 — by staugust (合并于: 2026-03-17 17:05 (UTC+8))
- #37258 [Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens — bug,ready — by chaunceyjiang (合并于: 2026-03-17 16:54 (UTC+8))
- #32779 Fix infinite recursive search issue in quark.py — documentation,new-model,rocm,speculative-decoding,ready — by xiao-llm (合并于: 2026-03-17 15:19 (UTC+8))
- #35535 [Bugfix] Fix loading Music Flamingo — bug,ready — by NickCao (合并于: 2026-03-17 13:24 (UTC+8))
- #37246 [Bugfix] dtype mismatch in ngram gpu propose — bug,speculative-decoding,ready,v1 — by PatchouliTIS (合并于: 2026-03-17 13:19 (UTC+8))
关闭但未合并的 PR
- #37299 [Bugfix] Responses API: merge all tool calls with preceding assistant message — bug,frontend — by weiguangli-io (关闭于: 2026-03-18 11:07 (UTC+8))
- #31257 [Speculators][Speculative Decoding] Fix Kimi K2 Eagle3 Support — deepseek — by chaunceyjiang (关闭于: 2026-03-18 11:01 (UTC+8))
- #36231 [Misc] Add enable_log_requests parameter to RequestLogger — frontend — by chaunceyjiang (关闭于: 2026-03-18 11:01 (UTC+8))
- #37368 [Misc] make test_worker_memory_snapshot platform-aware — v1 — by zhenwei-intel (关闭于: 2026-03-18 10:49 (UTC+8))
- #37248 refactor: standardize kimi_linear and minimax_text_01 model weight loading to use AutoWeightsLoader — needs-rebase,v1,multi-modality,qwen,nvidia — by XLiu-2000 (关闭于: 2026-03-18 10:06 (UTC+8))
- #35398 Andy/spec probs — speculative-decoding,needs-rebase,v1 — by andylolu2 (关闭于: 2026-03-18 09:12 (UTC+8))
- #37142 [v1] align execute_dummy_batch with uniform decode query length — needs-rebase,v1 — by liuxingbo12138 (关闭于: 2026-03-18 08:54 (UTC+8))
- #35690 [Bigfix] Fix padding in FULL_DECODE path when MTP is enabled for DP case — v1 — by zyongye (关闭于: 2026-03-18 06:34 (UTC+8))
- #36555 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by Lucaskabela (关闭于: 2026-03-18 04:57 (UTC+8))
- #37309 [Don’t Merge] Test CI with InstantTensor as default load format — ready,ready-run-all-tests — by mgoin (关闭于: 2026-03-18 05:17 (UTC+8))
- #36881 set mmencodercache to the max_seq_len for all 3 modalities — 无标签 — by netanel-haber (关闭于: 2026-03-18 04:58 (UTC+8))
- #33420 [EPLB][BUGFIX] Add
max_num_transfersconfig option to EPLB — bug,needs-rebase — by SageMoore (关闭于: 2026-03-18 03:56 (UTC+8)) - #37323 fix: use raw string for emoji example to avoid SyntaxWarning (fixes #37133) — 无标签 — by aayushbaluni (关闭于: 2026-03-18 02:55 (UTC+8))
- #30338 Fix gigachat3 parser + update tests — frontend,tool-calling — by ajpqs (关闭于: 2026-03-17 23:59 (UTC+8))
- #34824 [Feature][Quant] Support online MXFP8 MoE quantization for SM100 serving — needs-rebase,ci/build,nvidia — by EdalatiAli (关闭于: 2026-03-17 23:25 (UTC+8))
- #37211 fix: Add missing k_scale attribute to Attention for weight offloading — 无标签 — by BillionClaw (关闭于: 2026-03-17 21:23 (UTC+8))
- #37262 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (关闭于: 2026-03-17 16:25 (UTC+8))
- #36500 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,qwen — by Wangbei25 (关闭于: 2026-03-17 15:01 (UTC+8))