vLLM 开发动态报告 - 2026-03-31
时间窗口: 2026-03-31 11:57 (UTC+8) ~ 2026-04-01 11:57 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 19 | 新 PR 60 | 合并 PR 51 | 关闭未合并 PR 23
📊 每日开发状态摘要
在2026年3月31日至4月1日期间,vLLM项目保持了极高的开发活跃度,新增并合并了大量代码。开发焦点集中在性能优化(特别是针对AMD平台和新量化方法)、问题修复(尤以Qwen3.5系列模型和FP8量化相关的问题为多)以及功能增强(如工具调用、池化模型支持)上。社区通过积极处理积压的旧Issue和引入多项重要优化(如vLLM IR、MoE重构、编译专用模式),展现出强劲的迭代和问题解决能力。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD/ROCm生态的开发和优化活动非常活跃,涉及性能优化、问题修复和工具增强。
- 性能与功能优化
- PR #38665 [ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support:由AMD员工提交。该PR旨在提升GLM-5 MXFP4模型在AMD MI355X GPU上的推理性能。主要改动包括:1)将MoE共享专家的双流执行从仅限CUDA扩展到
is_cuda_alike(),从而支持ROCm/HIP流;2)将GLM-5模型架构 (glm_moe_dsa) 添加到Quark动态MXFP4支持列表中。这是将ATOM项目中的高性能优化移植到vLLM的重要一步。 - PR #38647 Add opt-in
--record-poweroption tovllm bench serve:由AMD员工fxmarty-amd提交。此PR为性能基准测试工具增加了功耗和能效(tokens/Joule)的测量功能,目前支持AMD GPU和CPU。这对于评估不同精度(dtype)和服务配置的能效至关重要。 - PR #38627 [WIP] [ROCm] [Doc] Update ROCm attention selection doc and test coverage:修正了ROCm注意力后端自动选择的逻辑,确保在未指定
--attention-backend时,能根据环境变量VLLM_ROCM_USE_AITER和sink_attn设置正确选择ROCM_ATTN、ROCM_AITER_FA或ROCM_AITER_UNIFIED_ATTN,并更新了文档和测试。 - PR #38615 [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8:修复了当每个GPU的注意力头数较少时,AITER MLA持久化模式下的形状不匹配问题,使得Kimi-K2.5等模型在TP=8时能正常工作。
- PR #38665 [ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support:由AMD员工提交。该PR旨在提升GLM-5 MXFP4模型在AMD MI355X GPU上的推理性能。主要改动包括:1)将MoE共享专家的双流执行从仅限CUDA扩展到
- Bug修复与CI维护
- PR #38680 [CI][ROCm] Remove unsupported cases in test_fusion.py:由于组量化(group quant)和逐张量量化(per-tensor quant)融合在ROCm AITER上不受支持,此PR移除了相关的测试用例,以通过CI。
- PR #38614 [ROCm][CI] Fix ROCm Python-only install fetching CUDA torch via build isolation:修复了ROCm的Python-only安装测试中,因构建隔离导致错误下载CUDA版torch的问题。
- PR #37841 replace cuda_device_count_stateless() to current_platform.device_count():这是一项跨平台清理工作,将获取设备数量的函数统一为平台无关的接口,有助于未来对XPU等新加速器的支持,ROCm也从中受益。
- 模型支持与测试
- PR #38664 [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI:计划将AMD发布的MXFP4量化模型加入CI测试,表明对AMD生态量化模型的支持正进入标准化流程。
总结:AMD团队在本周期表现非常积极,提交了多项提升性能、能效和稳定性的核心代码,并持续完善CI/CD流程,显示出对vLLM在AMD硬件上成熟度的高度投入。
💬 高热度讨论分析
- Issue #38626: [Bug]: CUDA error: an illegal memory access was encountered when deploy Qwen3.5-35B-A3B-FP8 on A100
- 核心议题:在A100 GPU(TP=2)上部署FP8量化的Qwen3.5模型时,在解码阶段触发CUDA非法内存访问错误。
- 不同观点:
- 贡献者
CodersAcademy006提供了深入的技术分析:指出错误发生在rank 1,可能与非原生FP8(A100为模拟)、CUDA图(CUDAGraphMode)重放、以及缓存请求恢复导致的张量形状变化有关。给出了详细的调试建议(如CUDA_LAUNCH_BLOCKING=1)。 - 贡献者
ZJY0516则建议用户尝试最新版vLLM (0.18)。
- 贡献者
- 争议焦点:无直接争议。讨论侧重于技术根因分析。
- 当前状态:问题仍为开放状态,等待用户反馈调试结果或确认新版本是否已修复。
- Issue #38634: [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase
- 核心议题:使用MLA架构模型(如DeepSeek)时,开启FP8 KV缓存和CUDA Graph会导致解码阶段产生NaN值。
- 不同观点:
- 维护者
gaby首先指出用户使用的vLLM版本 (0.12.0) 过旧,建议升级至最新版 (0.18.1)。 - 用户
wwwjs询问升级是否能解决问题。 - 贡献者
RemizovDenis从量化理论角度补充,指出标准FP8量化在分布尾部容易失稳,并介绍了替代方案(如3-bit + 残差缩放)。
- 维护者
- 争议焦点:无争议。讨论形成了标准流程:先升级版本,再考虑是否为已知的量化稳定性问题。
- 当前状态:开放。建议用户升级后观察问题是否复现。
🔥 热门话题与趋势分析
- Qwen3.5系列模型问题集中:多个Issue(#38656, #38626, #38643, #38666)报告了不同配置下Qwen3.5模型的启动卡死、CUDA错误、输出乱码等问题。这表明该热门模型系列(尤其是其FP8/A3B/NVFP4变体)在复杂配置下的兼容性和稳定性面临挑战。
- FP8量化与KV缓存的稳定性:除了上述Qwen3.5问题,Issue #38634和#38658均指向FP8 KV缓存与特定条件(CUDA Graph、Marlin、非原生FP8硬件)结合时产生的数值问题(NaN)或性能退化,成为当前一个技术攻坚点。
- 工具调用解析器的健壮性:Issue #38674 揭示了Jamba工具解析器对特定token格式的硬编码依赖,导致不支持标准Mistral格式的模型。这反映了工具调用生态多样化带来的兼容性挑战。
- 编译与性能优化持续深入:多个PR(如 #38675, #38657, #38671, #33825)围绕
torch.compile、内核融合、迁移到Torch稳定ABI、以及引入vLLM IR等底层优化展开,体现了对极致推理性能的持续追求。
🛠️ 重点技术变更
- PR #38675 [torch.compile] Add compile-only mode:引入了
vllm compileCLI命令和API,允许用户预编译模型而不执行推理,从而将编译开销与权重加载过程解耦,实现“热启动”。这是解决大模型启动慢问题的关键一步。 - PR #37160 [Feat][v1] Simple yet General CPU KV Cache Offloading:实现了一种新的、更简洁通用的CPU KV缓存卸载方案。它复用现有的
BlockPool和KVCacheCoordinator,自动获得了HMA、前缀缓存和LRU驱逐等支持,相比旧方案代码量更少、功能更全。 - PR #36286 [MoE Refactor] Migrate Unquantized to Full Oracle Flow:将未量化(BF16)MoE的计算路径迁移到新的模块化“Oracle”流程,与FP8/NVFP4路径统一。这是MoE内核现代化重构的重要里程碑,提升了代码一致性和可维护性。
- PR #33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op:已合并。这是实现vLLM中间表示(IR)系统的基石PR。它定义了IR操作和内核的注册、分发机制,并首先实现了
rms_norm算子。vLLM IR旨在成为统一、可扩展的算子抽象层,对未来的编译优化和多后端支持至关重要。
📈 开发活跃度观察
- 高产出与高效合并:在24小时内新增60个PR并合并51个,合并率高达85%,显示核心团队代码审查和合并效率非常高,项目处于快速迭代期。
- 积极处理技术债务:关闭了19个旧Issue,其中包括多个标记为“stale”的历史遗留问题,表明社区在积极清理积压任务,维护项目健康度。
- AMD团队表现突出:如前所述,多名AMD员工贡献了涉及性能、功耗、兼容性等多个维度的关键PR,是本期最活跃的企业贡献方之一。
- 社区参与度高:多个热门Issue下出现了由社区贡献者提供的深度技术分析和调试建议,展现了vLLM开发者社区良好的技术氛围和协作精神。
💡 值得关注的问题
- Qwen3.5 FP8/A3B模型的广泛问题:系列Issue表明该模型在特定硬件和配置下的部署存在不小风险,需要用户和开发者重点关注稳定性修复。
- MLA架构与FP8 KV缓存的兼容性:Issue #38634和#38652暗示MLA类模型与FP8 KV缓存的组合可能存在固有稳定性挑战,可能需要模型级或算法级的特别处理。
- 工具调用解析器的标准化:Issue #38674暴露了当前各解析器“各自为战”的问题,未来可能需要更统一、可配置的解析框架来应对多样化的模型输出格式。
- 新量化方法的引入:PR #38662 提出了全新的“TurboQuant” KV缓存量化方法(2/3-bit),声称能大幅降低内存占用。这类新技术虽然前景广阔,但其稳定性、性能收益和集成复杂度需要经过社区充分评估。
📋 附录:详细数据列表
新增 Issue
- #38677 [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup — bug — by JasonHe-WQ (创建于: 2026-04-01 10:29 (UTC+8))
- #38674 [Bug]: Jamba tool parser crashes on Mistral-style [TOOL_CALLS] models with standard HF tokenizer (e.g., Apriel-Nemotron-15b) — bug — by oromanenko-nv (创建于: 2026-04-01 09:37 (UTC+8))
- #38666 [Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED — bug — by bitbottrap (创建于: 2026-04-01 07:45 (UTC+8))
- #38656 [Bug]: qwen 3.5 model launch get stuck for quite a long time — bug — by yanan1116 (创建于: 2026-04-01 04:58 (UTC+8))
- #38660 [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings) — 无标签 — by 2imi9 (创建于: 2026-04-01 05:35 (UTC+8))
- #38658 [Bug]: MLA attention casts activations to int32 when using Marlin FP8 on GPUs without native FP8 support (sm < 89) — bug — by marcusm117 (创建于: 2026-04-01 05:16 (UTC+8))
- #38634 [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase — bug — by wwwjs (创建于: 2026-03-31 20:48 (UTC+8))
- #38652 [Bug]: –kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn — 无标签 — by varjoranta (创建于: 2026-04-01 01:50 (UTC+8))
- #38651 [RFC]: Add
max_tokens_per_docsupport for rerank and scoring endpoints — RFC — by jefp (创建于: 2026-04-01 01:39 (UTC+8)) - #38633 [New Model]: JinaEmbeddingsV5Model — 无标签 — by DanielBeck93 (创建于: 2026-03-31 20:38 (UTC+8))
- #38643 [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output — 无标签 — by BANG404 (创建于: 2026-03-31 22:47 (UTC+8))
- #38642 [Usage]: 模型返回值reasoning_content — usage — by hf201429 (创建于: 2026-03-31 22:06 (UTC+8))
- #38626 [Bug]: CUDA error: an illegal memory access was encountered when deploy Qwen3.5-35B-A3B-FP8 on A100 — bug — by zhaotyer (创建于: 2026-03-31 18:22 (UTC+8))
- #38619 [Bug]: Kimi-K2.5 compressed-tensors MoE Marlin repack fails with PTX toolchain error on H200 (CUDA 12.8, driver 570.133.20) — 无标签 — by DavidBellamy (创建于: 2026-03-31 16:53 (UTC+8))
已关闭 Issue
- #27519 [Bug]: torch._dynamo.exc.FailOnRecompileLimitHit: recompile_limit reached with fullgraph=True on 2x RTX2080Ti — bug,stale — by ir1ka (关闭于: 2026-04-01 10:41 (UTC+8))
- #21098 [Bug]: vllm deploy deepseek-v3 bug on attributeerror: ‘_OpNameSpace’ ‘_moe_C’ object has no attribute ‘moe_align_block_size’ — bug,stale — by controlRun (关闭于: 2026-04-01 10:18 (UTC+8))
- #23934 [Bug]: CPU Backend with GPT-OSS Failed — bug,stale,gpt-oss — by zzc98 (关闭于: 2026-04-01 10:17 (UTC+8))
- #26781 [Feature]: Better base64 to torch tenser — good first issue,feature request,stale — by noooop (关闭于: 2026-04-01 10:17 (UTC+8))
- #28759 [Bug]: VLLM compilation cache collision for model’s whose graphs have same shape but different input — bug,torch.compile,stale — by roadr (关闭于: 2026-04-01 10:16 (UTC+8))
- #29085 [Bug]: RuntimeError “cancelled” when using pipeline parallelism with Qwen3-14B — bug,stale — by slwang-ustc (关闭于: 2026-04-01 10:16 (UTC+8))
- #29177 [Usage]: Vllm + Intervl model local infra Image preprocessing / request adding becomes bottleneck even with more CPU cores — how to accelerate? — usage,stale — by Passenger12138 (关闭于: 2026-04-01 10:16 (UTC+8))
- #29517 [Bug]: Duplicate registration of a fake implementation for the gptq_marlin_repack operator causing vllm serve to fail. — bug,stale — by atalhens (关闭于: 2026-04-01 10:16 (UTC+8))
- #29544 [Bug]: “expandable_segments: True” causes vLLM EngineCore initialization to fail when running Qwen3 VL models — bug,stale — by fesvhtr (关闭于: 2026-04-01 10:16 (UTC+8))
- #29643 [Usage]: Enabling Tool call in the Python SDK — usage,stale — by Madan1215 (关闭于: 2026-04-01 10:16 (UTC+8))
- #29763 [Bug]: GLM-4.5 reasoning parser streaming fails without tools in request - missing as_list() conversion — bug,stale — by sygenaithanos (关闭于: 2026-04-01 10:16 (UTC+8))
- #29831 [Bug]: v1/responses
enable_response_messagesreturns blank messagecontent— bug,stale — by jacobthebanana (关闭于: 2026-04-01 10:16 (UTC+8)) - #38085 [Bug]: Qwen3.5 LoRA module is not in model’s supported LoRA target modules — bug — by wufenglailai (关闭于: 2026-04-01 09:59 (UTC+8))
- #31913 [Bug]: test_eagle_dp test is flaky — bug — by zou3519 (关闭于: 2026-04-01 00:42 (UTC+8))
- #36926 [Bug]: nemotron_h does not work with DeepEP all2all backends due to hidden dim rounding — bug — by bnellnm (关闭于: 2026-04-01 00:22 (UTC+8))
- #38098 [CI Failure]: LM Eval Large Models (H200) — ci-failure — by ilmarkov (关闭于: 2026-03-31 23:08 (UTC+8))
- #36443 [Bug]: qwen3.5-27b ValueError: Tokenizer class TokenizersBackendFast does not exist or is not currently imported. — bug — by xiaotianns (关闭于: 2026-03-31 19:31 (UTC+8))
- #38234 Test Failure: test_run_eagle_dp[FLASH_ATTN] produces non-deterministic outputs with EAGLE speculative decoding — 无标签 — by markmc (关闭于: 2026-03-31 18:30 (UTC+8))
- #38582 [CI Failure]: tests/models/language/pooling/test_splade_sparse_pooler.py — ci-failure — by bnellnm (关闭于: 2026-03-31 12:54 (UTC+8))
新增 PR
- #38676 [CPU] Support head_size 512 in cpu_attn — documentation,ready,v1,cpu — by bigPYJ1151 (创建于: 2026-04-01 10:26 (UTC+8))
- #38663 [Core][Feat][ safely abort requests where FSM failed to advance — v1 — by walterbm (创建于: 2026-04-01 06:34 (UTC+8))
- #38635 [Feature] NUMA binding support for GPU workers — documentation,v1,nvidia — by Harry-Chen (创建于: 2026-03-31 20:57 (UTC+8))
- #38649 [Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup — bug,structured-output,ready,v1 — by jeffreywang-anyscale (创建于: 2026-04-01 01:23 (UTC+8))
- #38682 [XPU] [Quant] rename mxfp8_e4m3_quantize and add xpu backend implementation — intel-gpu — by zufangzhu (创建于: 2026-04-01 11:08 (UTC+8))
- #38678 Fix llm_request trace context propagation — frontend,v1 — by will-deines (创建于: 2026-04-01 10:34 (UTC+8))
- #38617 [bugfix] do not add extra linebreak for score/rerank with chat template — bug,frontend,ready — by staugust (创建于: 2026-03-31 16:25 (UTC+8))
- #38655 Fix Nano Nemotron VL regressions — multi-modality — by netanel-haber (创建于: 2026-04-01 04:24 (UTC+8))
- #38665 [ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support — rocm,v1 — by ChuanLi1101 (创建于: 2026-04-01 07:00 (UTC+8))
- #38681 [CPU] Fix lscpu NUMA node regex to handle quoted - and null in containers — cpu — by Monokaix (创建于: 2026-04-01 11:04 (UTC+8))
- #38680 [CI][ROCm] Remove unsupported cases in test_fusion.py — rocm — by charlifu (创建于: 2026-04-01 11:01 (UTC+8))
- #38679 fused_moe_kernel opt — 无标签 — by SYChen123 (创建于: 2026-04-01 10:54 (UTC+8))
- #38612 [CI Failure] pin colmodernvbert revision — ready,multi-modality — by noooop (创建于: 2026-03-31 14:49 (UTC+8))
- #38639 [torch.compile] change auto_functionalized return structure to use indexing instead of unpack values — needs-rebase — by chaojun-zhang (创建于: 2026-03-31 21:44 (UTC+8))
- #38621 [Kernel fusion] QK Norm + RoPE + Cache + Quant — needs-rebase — by EricccYang (创建于: 2026-03-31 17:17 (UTC+8))
- #38654 [Bugfix] Fix
vllm bench serveto count multimodal tokens in “total input tokens” — bug,performance — by mgehre-amd (创建于: 2026-04-01 04:07 (UTC+8)) - #38675 [torch.compile] Add compile-only mode — frontend,v1 — by zou3519 (创建于: 2026-04-01 10:09 (UTC+8))
- #38650 [Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint — bug,qwen — by mmangkad (创建于: 2026-04-01 01:34 (UTC+8))
- #38623 [Doc]: improve disaggregated prefilling guide — documentation — by neweyes (创建于: 2026-03-31 17:36 (UTC+8))
- #38673 [Bugfix] Preserve original ImportError in gRPC server entrypoint — bug,frontend — by CatherineSue (创建于: 2026-04-01 09:10 (UTC+8))
- #38672 Fix llm_request trace context propagation — structured-output,frontend,v1,tool-calling,gpt-oss,nvidia — by will-deines (创建于: 2026-04-01 08:52 (UTC+8))
- #38669 Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8) — ci/build,nvidia — by DavidBellamy (创建于: 2026-04-01 08:01 (UTC+8))
- #38671 [5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-01 08:14 (UTC+8))
- #38670 [Bugfix] Fix AWQ models batch invariance issues — bug,v1 — by YM2132 (创建于: 2026-04-01 08:03 (UTC+8))
- #38661 [2/N] Pass
model_configto the Attention constructors — ready,llama,qwen,deepseek,gpt-oss — by MatthewBonanni (创建于: 2026-04-01 05:55 (UTC+8)) - #38668 add vLLM-side LMCache EC connector entrypoint — kv-connector — by benyebai (创建于: 2026-04-01 07:53 (UTC+8))
- #38667 Fix/lmcache ec connector module — needs-rebase,v1,kv-connector — by benyebai (创建于: 2026-04-01 07:48 (UTC+8))
- #38622 [Bug] Fix encoder cache miss assertion crash with MTP + multimodal — bug,v1 — by esmeetu (创建于: 2026-03-31 17:24 (UTC+8))
- #38657 [compile] Invoke split FX graph by codegen. — 无标签 — by zhxchen17 (创建于: 2026-04-01 05:11 (UTC+8))
- #38664 [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI — rocm,qwen — by BowenBao (创建于: 2026-04-01 06:53 (UTC+8))
- #38659 [1/N][Cleanup] Standardize on use of
is_quantized_kv_cache— rocm,intel-gpu,ready,v1,cpu,nvidia — by MatthewBonanni (创建于: 2026-04-01 05:29 (UTC+8)) - #38662 [Kernel] feat: TurboQuant KV cache quantization (PolarQuant + QJL) — ci/build,v1 — by allaspectsdev (创建于: 2026-04-01 06:08 (UTC+8))
- #38637 [Quantization] Consolidate dummy format logic into DummyModelLoader — ready,quantization — by Josephasafg (创建于: 2026-03-31 21:19 (UTC+8))
- #38644 [Refactor] Simplify FutureWrapper in MultiprocExecutor — v1 — by yzong-rh (创建于: 2026-04-01 00:11 (UTC+8))
- #38638 Fix Nano Nemotron VL regressions — 无标签 — by netanel-haber (创建于: 2026-03-31 21:42 (UTC+8))
- #38624 fix(scheduler): discard async tokens on all V1 preemption paths — v1 — by CodersAcademy006 (创建于: 2026-03-31 17:43 (UTC+8))
- #38653 [Bugfix] Use async preprocessing in pooling/embedding endpoints — bug,frontend — by chuqiwang (创建于: 2026-04-01 02:01 (UTC+8))
- #38647 Add opt-in
--record-poweroption tovllm bench serve— performance,rocm — by fxmarty-amd (创建于: 2026-04-01 01:00 (UTC+8)) - #38640 [BugFix] Fix streaming tool call with null type or id in final chunk — bug,frontend — by yanghui1-arch (创建于: 2026-03-31 21:51 (UTC+8))
- #38648 [XPU] Enable group_size=-1/channel-wise for w4a16 and w4a8 — intel-gpu — by tianmu-li (创建于: 2026-04-01 01:00 (UTC+8))
- #38646 [compile] fuse rope and cache insertion for mla — 无标签 — by ZJY0516 (创建于: 2026-04-01 00:38 (UTC+8))
- #38645 [Hotfix] Minor polish to reduce the
key in mapcalling. — v1 — by RocMarshal (创建于: 2026-04-01 00:37 (UTC+8)) - #38636 (security) Enforce frame limit in VideoMediaIO — ready,multi-modality — by jperezdealgaba (创建于: 2026-03-31 21:07 (UTC+8))
- #38630 Bugfix/multi node dp tcp placement — bug,frontend,v1 — by shaharmor98 (创建于: 2026-03-31 19:52 (UTC+8))
- #38627 [WIP] [ROCm] [Doc] Update ROCm attention selection doc and test coverage — bug,rocm,ready,v1 — by tjtanaa (创建于: 2026-03-31 18:56 (UTC+8))
- #38629 [Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 — ready — by zhang-prog (创建于: 2026-03-31 19:27 (UTC+8))
- #38641 [UX] Log worker exit code when process dies unexpectedly — ready,v1 — by NickCao (创建于: 2026-03-31 21:59 (UTC+8))
- #38628 [Docs] PD with Nixl compat matrix — documentation,ready,kv-connector — by NickLucche (创建于: 2026-03-31 19:16 (UTC+8))
- #38632 [CI] fix LM Eval Qwen3.5 Models (B200) — ready,qwen — by ZJY0516 (创建于: 2026-03-31 20:19 (UTC+8))
- #38631 Fix MLA runs when use_inductor_graph_partition=True — ready — by ElizaWszola (创建于: 2026-03-31 20:03 (UTC+8))
- #38625 AIFQA-409 Run VLLM on BMG70 — documentation,performance,intel-gpu,ci/build — by jklawikowski (创建于: 2026-03-31 18:05 (UTC+8))
- #38620 [Frontend] Re-enable running MaxSim on GPU — frontend,v1 — by noooop (创建于: 2026-03-31 17:14 (UTC+8))
- #38618 Fix/fp8 alignment padding — 无标签 — by jessiewei7 (创建于: 2026-03-31 16:51 (UTC+8))
- #38616 [Feature] Add completed_at field to ResponsesAPI — frontend — by chaunceyjiang (创建于: 2026-03-31 16:02 (UTC+8))
- #38613 [Feature]: add presence_penalty and frequency_penalty fields to Responses API — frontend,ready — by chaunceyjiang (创建于: 2026-03-31 15:02 (UTC+8))
- #38615 [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 — rocm,v1 — by wufann (创建于: 2026-03-31 15:46 (UTC+8))
- #38614 [ROCm][CI] Fix ROCm Python-only install fetching CUDA torch via build isolation — rocm,ready,ci/build,nvidia — by AndreasKaratzas (创建于: 2026-03-31 15:08 (UTC+8))
- #38611 [ci] Remove benchmarks job — ready,ci/build — by khluu (创建于: 2026-03-31 13:54 (UTC+8))
- #38609 [Bugfix] Fix streaming tool call type field defaulting to None instead of “function” — bug,frontend — by sfeng33 (创建于: 2026-03-31 12:04 (UTC+8))
- #38610 [Spec Decode] fix returning size mismatch on extract hidden states proposer — speculative-decoding,v1 — by zzaebok (创建于: 2026-03-31 12:14 (UTC+8))
已合并 PR
- #38559 [Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement — ready — by yewentao256 (合并于: 2026-04-01 11:54 (UTC+8))
- #37051 Fix priority preemption regression test in scheduler — ready,v1 — by ezylopx5 (合并于: 2026-04-01 11:36 (UTC+8))
- #32914 [ROCm][perf] Shuffle KV cache to use paged_attention_common — rocm,ready,ci/build,v1 — by samutamm (合并于: 2026-04-01 11:30 (UTC+8))
- #38172 [Misc] Add 20 regression tests for 11 tool parser bug fixes — bug,ready,tool-calling,qwen,deepseek — by bbrowning (合并于: 2026-04-01 11:00 (UTC+8))
- #38612 [CI Failure] pin colmodernvbert revision — ready,multi-modality — by noooop (合并于: 2026-03-31 18:54 (UTC+8))
- #33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op — rocm,intel-gpu,ready,torch.compile,ci/build,nvidia,ready-run-all-tests,vllm-ir — by ProExpertProg (合并于: 2026-04-01 10:15 (UTC+8))
- #38148 Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor — ready,v1 — by elvircrn (合并于: 2026-04-01 10:15 (UTC+8))
- #38343 [Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass — ready,qwen — by AuYang261 (合并于: 2026-04-01 03:03 (UTC+8))
- #37160 [Feat][v1] Simple yet General CPU KV Cache Offloading — performance,frontend,ready,v1,kv-connector,nvidia — by ivanium (合并于: 2026-04-01 08:58 (UTC+8))
- #37887 [ROCm][perf] fix Aiter sparse MLA with MTP>1 — rocm,speculative-decoding,ready,v1 — by gronsti-amd (合并于: 2026-04-01 07:22 (UTC+8))
- #34539 Generative Scoring — documentation,frontend,ready,v1 — by vedantjh2 (合并于: 2026-04-01 07:02 (UTC+8))
- #38333 feat(grpc): add periodic stats logging and servicer log forwarding — frontend,ready — by CatherineSue (合并于: 2026-04-01 06:50 (UTC+8))
- #37501 fix: clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels — rocm,ready — by kibitzing (合并于: 2026-03-31 23:35 (UTC+8))
- #38637 [Quantization] Consolidate dummy format logic into DummyModelLoader — ready,quantization — by Josephasafg (合并于: 2026-04-01 06:20 (UTC+8))
- #38592 [Kernel] [Helion] [17/N] Add Helion kernel torch.compile support — ready — by gmagogsfm (合并于: 2026-04-01 05:06 (UTC+8))
- #38383 [Refactor] Remove dead code in kv connector and model runner — intel-gpu,ready,v1,cpu,kv-connector — by yewentao256 (合并于: 2026-04-01 05:05 (UTC+8))
- #38451 [Perf] Fix DBO overlap: capture DeepEP event before yield — ready — by czhu-cohere (合并于: 2026-04-01 04:39 (UTC+8))
- #38556 [Bugfix][Async] Fix async spec decoding with hybrid models — bug,speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-31 23:08 (UTC+8))
- #36286 [MoE Refactor] Migrate Unquantized to Full Oracle Flow — ready,nvidia,ready-run-all-tests — by yzong-rh (合并于: 2026-04-01 03:43 (UTC+8))
- #36540 [fix] Remove trtllm ragged mla prefills — ready,v1 — by evezhier (合并于: 2026-04-01 03:30 (UTC+8))
- #37373 [torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate — documentation,ready,v1,nvidia — by BadrBasowid (合并于: 2026-04-01 02:15 (UTC+8))
- #37766 [CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI — ready,ci/build — by yurun00 (合并于: 2026-04-01 02:05 (UTC+8))
- #37503 [4/n] Migrate FP4/W4A8 CUTLASS kernels to torch stable ABI — ready,ci/build,nvidia — by mikaylagawarecki (合并于: 2026-04-01 01:21 (UTC+8))
- #37986 [Quantization][Autoround][XPU] Add
W4A16Support — intel-gpu,ready,ci/build — by yiliu30 (合并于: 2026-04-01 00:48 (UTC+8)) - #38584 [CI][Bugfix] Fix
test_run_eagle_dp— bug,ready,v1 — by MatthewBonanni (合并于: 2026-03-31 18:30 (UTC+8)) - #37010 [Bugfix] Fix FusedMoE weight loading with padded hidden dimensions — bug,ready — by SandishKumarHN (合并于: 2026-04-01 00:22 (UTC+8))
- #38629 [Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 — ready — by zhang-prog (合并于: 2026-03-31 23:50 (UTC+8))
- #38574 [Online Quant] [QeRL] Minor code cleanup — ready — by kylesayrs (合并于: 2026-03-31 22:56 (UTC+8))
- #38628 [Docs] PD with Nixl compat matrix — documentation,ready,kv-connector — by NickLucche (合并于: 2026-03-31 23:01 (UTC+8))
- #37841 replace cuda_device_count_stateless() to current_platform.device_count() — rocm,ready,v1,gpt-oss,nvidia — by wincent8 (合并于: 2026-03-31 22:32 (UTC+8))
- #38594 [CI] Avoid concurrent docker pull in intel XPU CI runners to prevent rate limit issues — intel-gpu,ready,ci/build — by wendyliu235 (合并于: 2026-03-31 22:23 (UTC+8))
- #38632 [CI] fix LM Eval Qwen3.5 Models (B200) — ready,qwen — by ZJY0516 (合并于: 2026-03-31 21:20 (UTC+8))
- #38566 [Bugfix][CI] Skip flaky
test_eagletest — bug,ready,v1 — by NickLucche (合并于: 2026-03-31 21:42 (UTC+8)) - #38631 Fix MLA runs when use_inductor_graph_partition=True — ready — by ElizaWszola (合并于: 2026-03-31 21:37 (UTC+8))
- #38596 [XPU]move testing dependencies from Dockerfile to xpu-test.in — intel-gpu,ready,ci/build — by 1643661061leo (合并于: 2026-03-31 20:49 (UTC+8))
- #33176 [EPLB] Add alternative communication for EPLB weight exchange — ready,ci/build,v1 — by ilmarkov (合并于: 2026-03-31 20:17 (UTC+8))
- #36742 [EPD] update EPD script arguments — documentation,ready,kv-connector — by zhenwei-intel (合并于: 2026-03-31 20:02 (UTC+8))
- #31113 Fix document of torchrun_example.py — documentation,ready,unstale — by foreverlms (合并于: 2026-03-31 18:54 (UTC+8))
- #38129 DOC: TPU mention fix — ready — by mtsokol (合并于: 2026-03-31 18:27 (UTC+8))
- #38570 [Misc] Move –grpc CLI argument into make_arg_parser — frontend,ready — by CatherineSue (合并于: 2026-03-31 18:24 (UTC+8))
- #38613 [Feature]: add presence_penalty and frequency_penalty fields to Responses API — frontend,ready — by chaunceyjiang (合并于: 2026-03-31 16:45 (UTC+8))
-
#28631 [Frontend][3/n] Improve pooling entrypoints scoring. — documentation,frontend,ready,multi-modality,llama,qwen — by noooop (合并于: 2026-03-31 15:52 (UTC+8)) - #35697 [CPU] Support int8 compute mode in CPU AWQ — ready,ci/build,cpu — by yintong-lu (合并于: 2026-03-31 15:27 (UTC+8))
- #38611 [ci] Remove benchmarks job — ready,ci/build — by khluu (合并于: 2026-03-31 14:46 (UTC+8))
- #37989 [OOT] Add OOT support for linear kernel. — ready — by menogrey (合并于: 2026-03-31 14:33 (UTC+8))
- #38189 [Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers — ready,tool-calling,qwen,deepseek — by sfeng33 (合并于: 2026-03-31 13:41 (UTC+8))
- #38554 [kv_offload+HMA] Fix num_blocks with different per-layer page sizes and improve assert message — ready,v1,kv-connector — by kfirtoledo (合并于: 2026-03-31 14:00 (UTC+8))
- #38576 vLLM Benchmark Suite perf regression after PR#32723 — performance,ready,ci/build,cpu — by louie-tsai (合并于: 2026-03-31 13:23 (UTC+8))
- #38508 [ROCm][CI] Fix Whisper translation test attention backend selection — rocm,ready — by AndreasKaratzas (合并于: 2026-03-31 13:21 (UTC+8))
- #38264 [Mypy] Fix adjust_request typing — documentation,frontend,ready,tool-calling — by sfeng33 (合并于: 2026-03-31 12:21 (UTC+8))
- #38546 [KVConnector] Remove redundant method KVConnectorOutput::merge() — ready,v1 — by hickeyma (合并于: 2026-03-31 12:11 (UTC+8))
关闭但未合并的 PR
- #38639 [torch.compile] change auto_functionalized return structure to use indexing instead of unpack values — needs-rebase — by chaojun-zhang (关闭于: 2026-04-01 10:48 (UTC+8))
- #29661 Enable FlashInfer Hopper FP8 attention — v1,nvidia — by nvpohanh (关闭于: 2026-04-01 10:31 (UTC+8))
- #26625 [Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support — ci/build,stale,v1,multi-modality,llama — by Ther-LF (关闭于: 2026-04-01 10:17 (UTC+8))
- #28723 [Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer — speculative-decoding,needs-rebase,stale,v1,nvidia — by pisceskkk (关闭于: 2026-04-01 10:16 (UTC+8))
- #28994 [Misc] Update disaggregated prefill examples for LMCache v0.3.7+ compatibility — documentation,stale,kv-connector — by Zerohertz (关闭于: 2026-04-01 10:16 (UTC+8))
- #36143 [Core] Add –hybrid-kv-cache-group-size to override KV cache grouping for hybrid attn models — v1 — by mzmssg (关闭于: 2026-04-01 10:12 (UTC+8))
- #38672 Fix llm_request trace context propagation — structured-output,frontend,v1,tool-calling,gpt-oss,nvidia — by will-deines (关闭于: 2026-04-01 09:07 (UTC+8))
- #38667 Fix/lmcache ec connector module — needs-rebase,v1,kv-connector — by benyebai (关闭于: 2026-04-01 07:50 (UTC+8))
- #38357 Revert “[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell” (#38083) — bug,needs-rebase,qwen — by zhewenl (关闭于: 2026-04-01 05:11 (UTC+8))
- #38638 Fix Nano Nemotron VL regressions — 无标签 — by netanel-haber (关闭于: 2026-04-01 04:24 (UTC+8))
- #38653 [Bugfix] Use async preprocessing in pooling/embedding endpoints — bug,frontend — by chuqiwang (关闭于: 2026-04-01 02:05 (UTC+8))
- #38358 Revert “[CI] Add batch invariant test for b200” (#38014) — ci/build — by zhewenl (关闭于: 2026-04-01 00:36 (UTC+8))
- #34174 [Draft][ROCm] ROCm7.2 as base — rocm,needs-rebase,ci/build — by gshtras (关闭于: 2026-03-31 23:55 (UTC+8))
- #38419 [Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) — bug,speculative-decoding,v1 — by SandishKumarHN (关闭于: 2026-03-31 23:13 (UTC+8))
- #30558 [Core] Support multi prompt for AsyncLLM.generate() and encode() — documentation,needs-rebase,v1 — by buaazp (关闭于: 2026-03-31 21:00 (UTC+8))
- #30492 [Feature] add manual numa binding — ready,needs-rebase,v1,nvidia — by jasonlizhengjian (关闭于: 2026-03-31 20:59 (UTC+8))
- #36250 [kernel] Fix FP8 paged MQA fallback for CUDA graph capture — bug,v1,nvidia — by ZJY0516 (关闭于: 2026-03-31 20:36 (UTC+8))
- #38010 [Model] Fix BitsAndBytes quantization for GLM-4.1V/4.6V-Flash vision encoder — bug — by yanghui1-arch (关闭于: 2026-03-31 20:08 (UTC+8))
- #37507 [Bugfix] Fall back to Triton/FLA when system CUDA toolkit < 12.6 for GDN prefill kernel — bug,needs-rebase,qwen,nvidia — by yanghui1-arch (关闭于: 2026-03-31 20:08 (UTC+8))
- #38599 Revert “[Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes” (#38270) — bug,v1,nvidia — by vllm-agent (关闭于: 2026-03-31 18:11 (UTC+8))
- #38618 Fix/fp8 alignment padding — 无标签 — by jessiewei7 (关闭于: 2026-03-31 17:51 (UTC+8))
- #38525 [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation — documentation,performance,new-model,rocm,frontend,intel-gpu,speculative-decoding,ci/build,v1,multi-modality — by neweyes (关闭于: 2026-03-31 17:13 (UTC+8))
- #37408 [ROCm][Quantization] fallback trust_remote_code=True in Quark config for some cases — rocm — by xuebwang-amd (关闭于: 2026-03-31 12:08 (UTC+8))