vLLM 开发动态报告 - 2026-03-07
时间窗口: 2026-03-07 11:26 (UTC+8) ~ 2026-03-08 11:26 (UTC+8) 数据统计: 新 Issue 12 | 关闭 Issue 37 | 新 PR 54 | 合并 PR 14 | 关闭未合并 PR 20
📊 每日开发状态摘要
在2026年3月7日至8日期间,vLLM 社区保持高度活跃,共处理了12个新Issue、54个新PR,并合并了14个PR。开发焦点集中在AMD平台支持(出现关键兼容性问题)、性能与精度回归排查,以及文档体系的大幅完善。社区积极处理了多个因新版本(如v0.17.0)或特定硬件(如Blackwell、V100)引发的Bug,并在模型支持(如Sarvam MoE)和核心功能(如N-Gram GPU异步调度)上取得了进展。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD生态相关活动是关注的焦点,主要涉及一个严重Bug和一个重要的功能增强PR。
- Issue #36337: Kimi-K2.5-MXFP4 在 MI350X 上输出乱码
- 概述:用户
Slonegg报告,在4台 MI350X (gfx950) 上使用 ROCm 7.2 和 vLLM 0.17.0 服务amd/Kimi-K2.5-MXFP4模型时,虽然加载正常,但生成的完全是乱码(中英文符号混杂)。 - 技术细节:模型加载、Quark量化检测、AITER MoE内核均工作正常。问题在多个vLLM版本和不同amd-quark版本下均复现。用户怀疑与模型卡指定的ROCm 7.1环境不匹配(当前使用7.2)有关。
- 影响与状态:这是一个严重的功能性Bug,直接阻碍了AMD MI350X用户使用官方发布的MXFP4量化模型。该Issue已被机器人标记并通知了ROCm相关维护者(
@hongxiayang,@tjtanaa,@vllmellm),目前仍为开放状态,亟待解决。
- 概述:用户
- PR #36320: [Quantization] Support Quark W8A8 INT8 MoE inference
- 概述:贡献者
JoursBleu提交PR,旨在支持使用AMD Quark工具进行W8A8 INT8(每通道权重+每令牌动态激活)量化的MoE模型推理。 - 技术细节:此前vLLM的Quark支持未能正确识别此类动态量化方案,且缺少对应的INT8 MoE方法实现,导致模型加载失败。此PR增加了动态W8A8的检测路由、
QuarkW8A8Int8MoEMethod支持,并修复了相关量化函数中的断言逻辑。 - 影响:此PR将显著增强vLLM对AMD Quark量化工具链的兼容性,使像MiniMax-M2.1这类经Quark ptpc_int8方案量化的大规模MoE模型能够在vLLM上运行。这对于推广AMD平台上的高效模型部署至关重要。
- 概述:贡献者
💬 高热度讨论分析
- Issue #36311 / #36117: 注意力水槽保护与版本回归问题
- 核心议题:用户
sahilmalik27提出为KV缓存驱逐策略增加“注意力水槽”保护,后经讨论自我关闭;AGENDD报告Qwen3-VL模型在v0.12.0相比v0.11.0产生严重退化输出。 - 观点与结论:
- 注意力水槽:提出者基于最新研究,认为驱逐关键“水槽”令牌会破坏生成质量。维护者
noooop迅速指出vLLM只驱逐空闲块,活动序列中的令牌不受影响,因此该问题对标准PagedAttention不成立。提出者接受澄清并关闭了Issue。 - 版本回归:经过深入排查,最终根因被锁定在PyTorch 2.9.0在默认
mp后端下,多GPU可见环境中的cuBLAS GEMM内核数值不稳定问题。切换至ray后端或升级到包含修复的PyTorch 2.9.1(v0.16.0搭载)即可解决。这是一个典型的底层依赖引发上层应用行为异常的复杂调试案例,讨论过程展示了社区强大的问题定位能力。
- 注意力水槽:提出者基于最新研究,认为驱逐关键“水槽”令牌会破坏生成质量。维护者
- 核心议题:用户
- Issue #36327: NVIDIA RTX PRO 6000 Blackwell SE 兼容性问题
- 核心议题:用户报告在新款Blackwell架构GPU上运行vLLM失败,而在H100上相同的配置可工作。
- 观点与争议:有用户提供了包含“硬件兼容性分析”的长篇回复。核心解答来自维护者
ZJY0516,直接指向官方文档中关于“PTX编译工具链不匹配”的解决方案。这凸显了新硬件架构早期支持的常见挑战,社区回应高效,提供了明确的解决路径。
- Issue #36313: GPT-OSS vllm 0.17 的CUDA环境错配
- 核心议题:用户发现从PyPI安装v0.17.0(CUDA 12环境)与从nightly索引安装(CUDA 13环境)存在差异,导致在RTX 5090上运行GPT-OSS模型时出现
CUBLAS_STATUS_INVALID_VALUE错误。 - 观点与结论:维护者
ZJY0516指出,针对Blackwell等新架构,需要使用CUDA 13的wheel包,并给出了具体的安装命令。讨论揭示了vLLM为不同CUDA版本提供差异化构建包的发布策略,以及用户根据自身硬件选择正确版本的重要性。
- 核心议题:用户发现从PyPI安装v0.17.0(CUDA 12环境)与从nightly索引安装(CUDA 13环境)存在差异,导致在RTX 5090上运行GPT-OSS模型时出现
🔥 热门话题与趋势分析
- 模型兼容性与版本升级问题:大量Issue围绕特定模型(Qwen 3.5系列、Nemotron 3 Nano)在v0.17.0上的运行问题展开,涉及内存错误、属性错误、加载缓慢等,表明新版本发布后广泛的模型兼容性测试和调优是持续需求。
- 多模态与视觉语言模型:多个Issue涉及Qwen3-VL等VLM模型,问题包括内存分析卡死、输出退化、工具解析失败等,说明多模态推理的复杂性和对系统稳定性要求极高,是当前的技术前沿和挑战点。
- 新硬件架构适配:针对NVIDIA Blackwell (RTX PRO 6000, RTX 5090) 和旧架构 (V100, RTX 2080Ti) 的兼容性问题频发。前者涉及PTX工具链,后者涉及Triton版本支持。这反映了vLLM在支持从老到新、跨度极大的GPU硬件生态时所面临的持续适配压力。
- 文档与用户体验提升:贡献者
goingforstudying-ctrl一次性提交了超过10个文档PR,涵盖了从快速入门、性能调优、错误排查到生产检查清单的方方面面。这表明社区正在系统性地弥补文档缺口,以降低用户使用门槛和运维成本,是项目成熟度提升的标志。
🛠️ 重点技术变更
- PR #36247: 修复DeepSeek-R1在MI300x上的压缩张量量化失败
- 解读:修复了一个在AMD MI300x上加载DeepSeek-R1 FP8模型时因类名匹配失败导致的崩溃。通过将模块类名从
DeepSeekV2FusedQkvAProj改为DeepSeekV2FusedQkvAProjLinear,使其能被压缩张量配置识别。 - 影响:虽然是一个小改动,但保障了AMD平台对重要模型系列的支持,体现了对跨平台兼容性细节的关注。
- 解读:修复了一个在AMD MI300x上加载DeepSeek-R1 FP8模型时因类名匹配失败导致的崩溃。通过将模块类名从
- PR #33942: 新增Sarvam MoE模型支持
- 解读:此PR为vLLM添加了对Sarvam AI的MoE模型(包括常规MHA和MLA变体)的原生支持。它复用并整合了vLLM现有的MLA模块、融合MoE、张量并行等基础设施。
- 影响:扩展了vLLM的模型生态,吸引了新的模型提供方(Sarvam)融入社区,丰富了用户可选择的模型类型。
- PR #29184: N-Gram GPU实现与异步调度器兼容
- 解读:在先前工作的基础上,实现了N-Gram推测解码的GPU版本,并使其与新的异步调度器兼容。引入了向量化的
NgramGPUKernel和NgramProposerGPU。 - 影响:将N-Gram这一轻量级推测解码方法从CPU搬到了GPU,并融入现代异步架构,有望在特定场景下进一步提升推理效率,是推测解码领域的一个重要优化。
- 解读:在先前工作的基础上,实现了N-Gram推测解码的GPU版本,并使其与新的异步调度器兼容。引入了向量化的
- PR #30515: 启动内存分析时考虑CUDA Graph占用
- 解读:内存分析阶段新增了对CUDA Graph内存占用的估算(通过环境变量
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS启用)。通过在分析时捕获一个虚拟的CUDA Graph来预估其开销,从而更准确地计算可用的KV缓存空间。 - 影响:提高了GPU内存利用率预估的准确性,有助于减少因CUDA Graph实际占用内存而导致的运行时OOM风险,尤其对内存紧张的场景有益。
- 解读:内存分析阶段新增了对CUDA Graph内存占用的估算(通过环境变量
📈 开发活跃度观察
- 贡献者:本周期出现了多位积极贡献者,例如专注于文档建设的
goingforstudying-ctrl,提交重要AMD量化功能PR的JoursBleu,以及修复模型特定Bug的JWriter20、kota-row等。 - 代码合并:合并了14个PR,其中包括长期演进的功能(如N-Gram GPU支持,历时3个多月)和新引入的模型支持(Sarvam MoE),显示核心团队在积极处理积压PR和新特性。
- 问题解决:关闭了37个Issue,其中大部分是陈旧的“stale” Issue,表明社区在进行日常维护和清理。同时,对新出现的棘手问题(如AMD乱码、版本回归)反应迅速,已展开深入讨论。
💡 值得关注的问题
- AMD MI350X兼容性危机(Issue #36337):
amd/Kimi-K2.5-MXFP4在MI350X上输出乱码是一个阻塞性问题,直接影响AMD最新GPU的用户体验和信心。需要AMD工程师或熟悉ROCm的贡献者优先介入调查。 - 多模态编码器内存分析死锁(Issue #36357):v0.17.0中Triton ViT注意力后端在SM 7.0 (V100) 及以下GPU上导致启动死锁。这暴露了新优化代码对旧硬件兼容性的影响,
--skip-mm-profiling只是权宜之计,需要根本性修复或版本回退机制。 - 分布式执行器后端的影响(Issue #36117相关):
mp与ray后端的选择在某些场景下(如多模态模型)会导致截然不同的结果,这提示用户和开发者需要更清楚地理解不同后端的行为差异及其适用场景。 - 大规模MoE模型量化支持演进:PR #36320(Quark INT8 MoE)和 PR #36307(TRTLLM FP8 MoE模块化内核)表明,社区正在持续增强对大规模、多样化量化MoE模型的推理支持,这是服务超大模型的关键技术方向。
📋 附录:详细数据列表
新增 Issue
- #36357 [Bug]: Multimodal encoder memory profiling hangs indefinitely on V100 (SM 7.0) in v0.17 — bug — by aswinkumar1999 (创建于: 2026-03-08 05:19 (UTC+8))
- #36331 [Bug]: 0% acceptance rate (MTP) with Qwen 3.5 122B (NVFP4) — bug — by ccdv-ai (创建于: 2026-03-07 22:34 (UTC+8))
- #36355 [Bug]: NVIDIA Nemotron 3 Nano NVFP4 fails to run in 0.17.0 (out of memory, but previously worked) — bug — by pcgeek86 (创建于: 2026-03-08 04:59 (UTC+8))
- #36350 [Bug]: Qwen 3.5 4B fail on first request on Intel XPU (Arc Graphics B580) — bug — by Nero10578 (创建于: 2026-03-08 04:39 (UTC+8))
- #36337 [Bug]: Kimi-K2.5-MXFP4 produces gibberish output on MI350X (gfx950) with ROCm 7.2 — rocm — by Slonegg (创建于: 2026-03-07 23:54 (UTC+8))
- #36328 [Feature]: Include mm_hash and mm_transfer_params in Disaggregated Encoder response to prevent redundant data fetching — feature request — by roytman (创建于: 2026-03-07 21:33 (UTC+8))
- #36327 [Bug]: Compatibility issue NVIDIA RTX PRO 6000 Blackwell SE — bug — by purvalpatel (创建于: 2026-03-07 19:51 (UTC+8))
- #36313 [Bug]: gpt-oss vllm 0.17 — bug — by chrisqianz (创建于: 2026-03-07 14:21 (UTC+8))
- #36315 [Bug]: AttributeError: ‘Qwen3_5TextConfig’ object has no attribute ‘max_window_layers’ — bug — by skfeng36 (创建于: 2026-03-07 15:37 (UTC+8))
- #36308 [Feature]: Include kv_transfer_params in Streaming Responses to optimize TTFT in P/D Disaggregation — feature request — by RishabhSaini (创建于: 2026-03-07 12:27 (UTC+8))
- #36326 [Bug]: VLLM compatibility issue with NVIDIA RTX PRO 6000 Blackwell SE — documentation — by purvalpatel (创建于: 2026-03-07 19:49 (UTC+8))
- #36311 [Feature Request] Pluggable KV cache eviction policy with attention sink protection — 无标签 — by sahilmalik27 (创建于: 2026-03-07 13:48 (UTC+8))
已关闭 Issue
- #18362 [Bug]: Inference fails on Apple silicon due to (distributed) networking error? — bug,stale — by viktor-haag (关闭于: 2026-03-08 10:19 (UTC+8))
- #20492 [RFC]: KV-Cache Interoperability API Standardization — RFC,stale — by vMaroon (关闭于: 2026-03-08 10:19 (UTC+8))
- #20964 [Usage]: questions about kv cache int8 — usage,stale — by Eviannn (关闭于: 2026-03-08 10:19 (UTC+8))
- #21661 [Bug]: RuntimeError: NCCL error: unhandled cuda error — bug,stale — by paolovic (关闭于: 2026-03-08 10:19 (UTC+8))
- #22082 [RFC]: Clear and stable interface for platform in vLLM — RFC,stale — by wangxiyuan (关闭于: 2026-03-08 10:19 (UTC+8))
- #22246 [RFC]: Dynamic Expert Load Balance with Zero-like-overhead — RFC,stale — by raindaywhu (关闭于: 2026-03-08 10:19 (UTC+8))
- #23083 [Feature]: Support Persistent/Pinned Prefixes in Prefix Caching — feature request,stale — by kangyishuai (关闭于: 2026-03-08 10:19 (UTC+8))
- #25378 [Bug]: Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 Fails Starting — bug,stale — by ng-blip (关闭于: 2026-03-08 10:18 (UTC+8))
- #25571 [Bug]: Qwen3-Next-80B-A3B-Instruction pressure test error through vllm 0.10.2 — bug,stale — by oswardlx (关闭于: 2026-03-08 10:18 (UTC+8))
- #26147 [Feature]: Support adding custom content to activations similar to HuggingFace — feature request,stale — by Yifan-Lan (关闭于: 2026-03-08 10:18 (UTC+8))
- #26871 [Bug]: The system generates an error when using dual graphics cards; version 10.1.1 functions correctly, but version 10.2/11.0 triggers an error upon execution. — bug,stale — by ooodwbooo (关闭于: 2026-03-08 10:18 (UTC+8))
- #27491 [Bug]: NaN’s in MLA with chunked-prefill — bug,stale — by opherlieber (关闭于: 2026-03-08 10:17 (UTC+8))
- #27649 [Usage]: Qwen3-32B on RTX PRO 6000 (55s First Token Delay and 15t/s) — usage,stale — by yizhitangtongxue (关闭于: 2026-03-08 10:17 (UTC+8))
- #27782 [Usage]: The same configuration v0.11.0 will report insufficient video memory compared to v0.8.5 — usage,stale — by lan-qh (关闭于: 2026-03-08 10:17 (UTC+8))
- #28104 [Usage]: vllm bench serve不能用sharegpt数据集 — usage,stale — by uOnePiece (关闭于: 2026-03-08 10:17 (UTC+8))
- #28158 [Feature]: extending
ChatCompletionContentPartParamforLLM.chat()method — feature request,stale — by avyuga (关闭于: 2026-03-08 10:17 (UTC+8)) - #28160 [Feature]: Supporting tiny / base other variants through mm-processor-kwargs for deepseek-OCR — feature request,stale — by praateekmahajan (关闭于: 2026-03-08 10:17 (UTC+8))
- #28173 [Bug]: Tool use is not supported in Responses API without Harmony None — bug,stale — by pogevip (关闭于: 2026-03-08 10:17 (UTC+8))
- #28192 [RFC]: Support separate NICs for KV cache traffic and MoE traffic — RFC,stale — by JayFzh (关闭于: 2026-03-08 10:17 (UTC+8))
- #28212 [New Model]: support image embedding models like siglip2 — stale — by (关闭于: 2026-03-08 10:17 (UTC+8))
- #28213 [New Model]: https://huggingface.co/FunAudioLLM/SenseVoiceSmall — stale — by (关闭于: 2026-03-08 10:17 (UTC+8))
- #28215 [Bug]: Build from source with cuda 12.6 failed — bug,stale — by Luciennnnnnn (关闭于: 2026-03-08 10:16 (UTC+8))
- #28219 [Bug]: DeepSeek-R1-Distill-Llama-8B tool calls returned in content instead of tool_calls array — bug,stale — by AhmedSeemalK (关闭于: 2026-03-08 10:16 (UTC+8))
- #28233 [Usage]: LogitProcessor vLLM 0.9.1 run the same prompt 50 times with batching, apply logitprocessor independently on each — usage,stale — by jindalankush28 (关闭于: 2026-03-08 10:16 (UTC+8))
- #28239 [RFC]: Replace list[list[int]] to numpy arrays to mitigate GC overhead. — RFC,stale — by Jialin (关闭于: 2026-03-08 10:16 (UTC+8))
- #28246 [Bug]: Return Token Ids not returning Gen Token Ids for GPT-OSS-120b — bug,stale — by sophies-cerebras (关闭于: 2026-03-08 10:16 (UTC+8))
- #28283 [Bug]: nccl stuck issue — bug,stale — by seindum (关闭于: 2026-03-08 10:16 (UTC+8))
- #28300 [Feature]: model meta in openai api endpoint should include more model information — feature request,stale — by (关闭于: 2026-03-08 10:16 (UTC+8))
- #28316 [Performance][Feature]: make DeepGEMM swapAB available for linear gemm — performance,stale — by xuanzic (关闭于: 2026-03-08 10:16 (UTC+8))
- #36251 [Bug]: Qwen3.5-397B-A17B-FP8 crashes with TP=4 + Expert Parallelism - dimension mismatch in fused linear sharding — 无标签 — by UmutAlihan (关闭于: 2026-03-08 06:17 (UTC+8))
- #36355 [Bug]: NVIDIA Nemotron 3 Nano NVFP4 fails to run in 0.17.0 (out of memory, but previously worked) — bug — by pcgeek86 (关闭于: 2026-03-08 06:02 (UTC+8))
- #27984 [Bug]:
swap_spaceparameter is unused and useless — bug,good first issue,keep-open — by mcelrath (关闭于: 2026-03-07 22:09 (UTC+8)) - #36326 [Bug]: VLLM compatibility issue with NVIDIA RTX PRO 6000 Blackwell SE — documentation — by purvalpatel (关闭于: 2026-03-07 19:50 (UTC+8))
- #28632 [Bug]: Qwen3-Next,fused_recurrent_gated_delta_rule_fwd do not support ( inplace_final_state: bool = False) — bug,stale — by xyDong0223 (关闭于: 2026-03-07 19:38 (UTC+8))
- #36311 [Feature Request] Pluggable KV cache eviction policy with attention sink protection — 无标签 — by sahilmalik27 (关闭于: 2026-03-07 18:41 (UTC+8))
- #35663 [Bug]: Nightly takes >3x time to load weights compared to v0.16.0 and earlier nightlies — bug — by ehfd (关闭于: 2026-03-07 14:57 (UTC+8))
- #36117 [Bug]: Qwen3-VL-2B-Instruct produces significantly different (degraded) outputs on v0.12.0 compared to v0.11.0 — 无标签 — by AGENDD (关闭于: 2026-03-07 14:28 (UTC+8))
新增 PR
- #36371 [compile] Remove strict_autograd_cache and force_non_lazy_backward_lowering workaround — 无标签 — by zou3519 (创建于: 2026-03-08 11:17 (UTC+8))
- #36370 Fix autotuned config selection for active LoRA & LoRA CUDA Graph Memory Reduction — v1,nvidia — by RunkaiTao (创建于: 2026-03-08 09:19 (UTC+8))
- #36320 [Quantization] Support Quark W8A8 INT8 MoE inference — 无标签 — by JoursBleu (创建于: 2026-03-07 17:08 (UTC+8))
- #36334 [Doc]: Add README.md to examples directory — documentation,nvidia — by goingforstudying-ctrl (创建于: 2026-03-07 23:07 (UTC+8))
- #36333 [Doc]: Add –speculative-config option documentation — documentation,nvidia — by goingforstudying-ctrl (创建于: 2026-03-07 22:52 (UTC+8))
- #36309 [BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections — bug,qwen — by JWriter20 (创建于: 2026-03-07 12:48 (UTC+8))
- #36369 [Doc]: Add model card template — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36368 [Doc]: Add common errors and solutions guide — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36367 [Doc]: Add comprehensive environment variables reference — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36366 [Doc]: Add API migration guide — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36365 [Doc]: Add production deployment checklist — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36364 [Doc]: Add glossary of terms — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:42 (UTC+8))
- #36363 [Doc]: Add performance tuning guide — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:41 (UTC+8))
- #36362 [Doc]: Add APC debugging tips documentation — documentation — by goingforstudying-ctrl (创建于: 2026-03-08 08:41 (UTC+8))
- #36351 [Doc]: Add deployment patterns guide — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:40 (UTC+8))
- #36349 [Doc]: Add quick start cheat sheet — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:38 (UTC+8))
- #36348 [Doc]: Add model card template — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:37 (UTC+8))
- #36347 [Doc]: Add common errors and solutions guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:36 (UTC+8))
- #36346 [Doc]: Add comprehensive environment variables reference — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:35 (UTC+8))
- #36345 [Doc]: Add API migration guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:28 (UTC+8))
- #36344 [Doc]: Add production deployment checklist — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:27 (UTC+8))
- #36343 [Doc]: Add glossary of terms — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:25 (UTC+8))
- #36342 [Doc]: Add performance tuning guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:21 (UTC+8))
- #36336 [Doc]: Add APC debugging tips documentation — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (创建于: 2026-03-07 23:12 (UTC+8))
- #36360 fix(tool_parser): fix hermes parser dropping final brace during MTP streaming — tool-calling — by giulio-leone (创建于: 2026-03-08 07:18 (UTC+8))
- #36361 Kimi k25 eagle3 — new-model,speculative-decoding,needs-rebase,v1,deepseek — by jhaotingc (创建于: 2026-03-08 07:21 (UTC+8))
- #36358 [compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE — 无标签 — by zou3519 (创建于: 2026-03-08 06:32 (UTC+8))
- #36359 [KV Cache] Use a contiguous buffer for all layers — v1 — by heheda12345 (创建于: 2026-03-08 06:32 (UTC+8))
- #36324 [Bugfix] Remove incorrect assertion in causal_conv1d_update for Qwen3.5 GDN layersfix: remove incorrect assertion in causal_conv1d_update for GDN index… — bug,qwen,deepseek — by Rks2302 (创建于: 2026-03-07 19:21 (UTC+8))
- #36354 [Bugfix]: Fix unclean shutdown from ctrl-c with AR Fusion — bug,documentation,qwen,nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 04:57 (UTC+8))
- #36341 [CI] fix flaky empty responses and add diagnostic assertions in vision chat tests — rocm,ready — by AndreasKaratzas (创建于: 2026-03-08 04:08 (UTC+8))
- #36356 [CI] Stabilize multinode DP internal LB completion tests — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-08 05:13 (UTC+8))
- #36353 [Bugfix]: Support transformers 5.x Qwen3_5MoeTextConfig — bug,documentation,qwen — by goingforstudying-ctrl (创建于: 2026-03-08 04:55 (UTC+8))
- #36352 [Bugfix]: Fix Qwen3.5 missing max_window_layers attribute — bug,documentation,qwen — by goingforstudying-ctrl (创建于: 2026-03-08 04:51 (UTC+8))
- #36330 elastic_ep: Fix stateless group port races — v1 — by itayalroy (创建于: 2026-03-07 22:23 (UTC+8))
- #36340 [Test]
test_async_scheduling.pyimprovements — ready,v1 — by njhill (创建于: 2026-03-08 02:07 (UTC+8)) - #36338 Add kv_transfer_params to streaming responses — frontend — by RishabhSaini (创建于: 2026-03-08 00:05 (UTC+8))
- #36310 [BugFix] Download mmproj GGUF files for multimodal models in download… — bug — by reidliu41 (创建于: 2026-03-07 13:09 (UTC+8))
- #36319 Support online use_audio_in_video — frontend — by gty111 (创建于: 2026-03-07 16:52 (UTC+8))
- #36329 [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 — bug,qwen — by sonusflow (创建于: 2026-03-07 21:57 (UTC+8))
- #36339 Fix missing mmproj file download for multimodal GGUF models — 无标签 — by 04cb (创建于: 2026-03-08 00:24 (UTC+8))
- #36335 [Doc]: Add debugging tip for Automatic Prefix Caching — documentation — by goingforstudying-ctrl (创建于: 2026-03-07 23:09 (UTC+8))
- #36332 [Bugfix] Fix Qwen3-VL video embedding path (enable_mm_embeds) — bug,qwen — by TianWZhang (创建于: 2026-03-07 22:38 (UTC+8))
- #36318 Disable cascade attention by default — ready,codex — by mgoin (创建于: 2026-03-07 16:51 (UTC+8))
- #36325 [Bugfix] Disable TMA on Blackwell GPUs to fix Triton autotuner OOM in fla/solve_trilfix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO… — bug — by Rks2302 (创建于: 2026-03-07 19:33 (UTC+8))
- #36321 [Bugfix] Support other quantization methods in glm41v — bug — by LoganJane (创建于: 2026-03-07 18:08 (UTC+8))
- #36322 [Attention] Add FlashInfer FA2 MLA attention backend for SM120 — documentation,v1,nvidia — by grimulkan (创建于: 2026-03-07 18:15 (UTC+8))
- #36323 Block verificaition POC — v1 — by evgerher (创建于: 2026-03-07 18:39 (UTC+8))
- #36317 [Hybrid][KVCache] Adjust alignment block size according attn supported kernel sizes — v1 — by MengqingCao (创建于: 2026-03-07 16:21 (UTC+8))
- #36316 [BugFix]: add bagel to MM_PREFIX_LM_MODELS — bug,ready — by princepride (创建于: 2026-03-07 15:54 (UTC+8))
- #36312 [Bugfix] Fixed modelopt mixed precision quant format loading — bug — by kota-row (创建于: 2026-03-07 14:15 (UTC+8))
- #36314 Add script for pd disagg on XPU — v1,kv-connector — by zhenwei-intel (创建于: 2026-03-07 14:24 (UTC+8))
- #36306 [Attention] PCP alternative implementation — v1 — by LucasWilkinson (创建于: 2026-03-07 11:26 (UTC+8))
- #36307 [Perf] Add TRTLLM FP8 MoE Modular Kernel — nvidia — by wzhao18 (创建于: 2026-03-07 12:23 (UTC+8))
已合并 PR
- #35831 [LMCache MP Patch]: Race Condition + Duplicated Block Ids — ready,kv-connector — by sammshen (合并于: 2026-03-08 05:52 (UTC+8))
- #35931 [Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode — bug,ready,kv-connector — by royyhuang (合并于: 2026-03-08 05:52 (UTC+8))
- #36098 [compile] Split compile/warmup monitoring — ready-run-all-tests — by zou3519 (合并于: 2026-03-08 05:52 (UTC+8))
- #35891 [Perf] Support FP8 KV cache for Flashinfer MLA Sparse — documentation,ready,v1,nvidia — by wzhao18 (合并于: 2026-03-08 05:51 (UTC+8))
- #29184 [Core] NGram GPU Implementation compatible with Async Scheduler — speculative-decoding,ready,v1,ready-run-all-tests — by PatchouliTIS (合并于: 2026-03-08 05:51 (UTC+8))
- #35224 [ROCm][CI] Accept Different But Valid Output for
test_olmoe_tp— rocm,ready — by micah-wil (合并于: 2026-03-08 05:50 (UTC+8)) - #36174 [ROCm][CI] Enable AITER for failing
test_gpt_osstest case on MI355 — rocm,ready,gpt-oss — by micah-wil (合并于: 2026-03-08 05:50 (UTC+8)) - #35416 [CI] Enable Crosslayer KV layout tests for ROCm platforms — rocm,ready,ci/build,v1,kv-connector — by qli88 (合并于: 2026-03-08 05:49 (UTC+8))
- #30515 [UX][Startup] Account for CUDA graphs during memory profiling — ready,v1,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-03-08 05:49 (UTC+8))
- #33942 Adding support to Sarvam’s MoE models — documentation,new-model,ready — by rahul-sarvam (合并于: 2026-03-08 01:16 (UTC+8))
- #36303 [Misc] Remove duplicate parser registration — ready — by taneem-ibrahim (合并于: 2026-03-07 22:37 (UTC+8))
- #36216 [V0 Deprecation] Remove unused swap_space parameter — documentation,performance,frontend,ready,ci/build,v1,kv-connector — by majiayu000 (合并于: 2026-03-07 22:09 (UTC+8))
- #36247 [Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x — bug,ready,deepseek — by vllmellm (合并于: 2026-03-07 17:26 (UTC+8))
- #34778 feat: expose media_io_kwargs at runtime — frontend,ready,multi-modality — by milesial (合并于: 2026-03-07 12:27 (UTC+8))
关闭但未合并的 PR
- #25545 [Perf] Disable group QuantFP8 on SM100 (use forward_native) — stale — by ElizaWszola (关闭于: 2026-03-08 10:18 (UTC+8))
- #27398 [CI/Build] Add aime25 eval to gpt-oss CI — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,stale,v1 — by kingsmad (关闭于: 2026-03-08 10:17 (UTC+8))
- #27481 [Test] Draft: Nixl fault tests — documentation,ready,needs-rebase,ci/build,stale,v1,kv-connector — by wseaton (关闭于: 2026-03-08 10:17 (UTC+8))
- #27959 [LoRA][FusedMoE] Introduce FusedMoEPermuteExpertsUnpermuteWithLoRA — needs-rebase,stale — by varun-sundar-rabindranath (关闭于: 2026-03-08 10:17 (UTC+8))
- #28183 bench(openai-chat): Use continuous usage stats for per-token ITL; correct TTFT; add chunk-gap fallback + unit tests — performance,stale — by nick-allison (关闭于: 2026-03-08 10:17 (UTC+8))
- #36348 [Doc]: Add model card template — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:07 (UTC+8))
- #36347 [Doc]: Add common errors and solutions guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36346 [Doc]: Add comprehensive environment variables reference — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36345 [Doc]: Add API migration guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36344 [Doc]: Add production deployment checklist — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36343 [Doc]: Add glossary of terms — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36342 [Doc]: Add performance tuning guide — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #36336 [Doc]: Add APC debugging tips documentation — documentation,structured-output,needs-rebase,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 08:06 (UTC+8))
- #27988 Remove unused swap_space parameter — documentation,frontend,tpu,stale,v1,kv-connector — by mcelrath (关闭于: 2026-03-07 22:09 (UTC+8))
- #36323 Block verificaition POC — v1 — by evgerher (关闭于: 2026-03-07 18:39 (UTC+8))
- #34927 [Bugfix][Kernel] Fix activation_kernels.cu build failure on Blackwell… — bug,ready,needs-rebase — by FloatingVertex (关闭于: 2026-03-07 16:35 (UTC+8))
- #31521 [Triton][CustomOp] A Triton operator dispatch mechanism through modified CustomOp — needs-rebase,qwen — by MengqingCao (关闭于: 2026-03-07 16:25 (UTC+8))
- #36314 Add script for pd disagg on XPU — v1,kv-connector — by zhenwei-intel (关闭于: 2026-03-07 14:24 (UTC+8))
- #35826 fix: preserve prior-turn analysis messages in Harmony multi-turn conversations — frontend,gpt-oss — by weiguangli-io (关闭于: 2026-03-07 12:38 (UTC+8))
- #36305 [Misc] Fix EPLB balancedness metric and add debug logging — 无标签 — by tlrmchlsmth (关闭于: 2026-03-07 11:46 (UTC+8))