vLLM 开发动态报告 - 2026-03-31

时间窗口: 2026-03-31 11:57 (UTC+8) ~ 2026-04-01 11:57 (UTC+8) 数据统计: 新 Issue 14 | 关闭 Issue 19 | 新 PR 60 | 合并 PR 51 | 关闭未合并 PR 23

📊 每日开发状态摘要

在2026年3月31日至4月1日期间，vLLM项目保持了极高的开发活跃度，新增并合并了大量代码。开发焦点集中在性能优化（特别是针对AMD平台和新量化方法）、问题修复（尤以Qwen3.5系列模型和FP8量化相关的问题为多）以及功能增强（如工具调用、池化模型支持）上。社区通过积极处理积压的旧Issue和引入多项重要优化（如vLLM IR、MoE重构、编译专用模式），展现出强劲的迭代和问题解决能力。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD/ROCm生态的开发和优化活动非常活跃，涉及性能优化、问题修复和工具增强。

性能与功能优化
- PR #38665 [ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support：由AMD员工提交。该PR旨在提升GLM-5 MXFP4模型在AMD MI355X GPU上的推理性能。主要改动包括：1）将MoE共享专家的双流执行从仅限CUDA扩展到is_cuda_alike()，从而支持ROCm/HIP流；2）将GLM-5模型架构 (glm_moe_dsa) 添加到Quark动态MXFP4支持列表中。这是将ATOM项目中的高性能优化移植到vLLM的重要一步。
- PR #38647 Add opt-in --record-power option to vllm bench serve：由AMD员工 fxmarty-amd 提交。此PR为性能基准测试工具增加了功耗和能效（tokens/Joule）的测量功能，目前支持AMD GPU和CPU。这对于评估不同精度（dtype）和服务配置的能效至关重要。
- PR #38627 [WIP] [ROCm] [Doc] Update ROCm attention selection doc and test coverage：修正了ROCm注意力后端自动选择的逻辑，确保在未指定--attention-backend时，能根据环境变量 VLLM_ROCM_USE_AITER 和sink_attn设置正确选择ROCM_ATTN、ROCM_AITER_FA或ROCM_AITER_UNIFIED_ATTN，并更新了文档和测试。
- PR #38615 [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8：修复了当每个GPU的注意力头数较少时，AITER MLA持久化模式下的形状不匹配问题，使得Kimi-K2.5等模型在TP=8时能正常工作。
Bug修复与CI维护
- PR #38680 [CI][ROCm] Remove unsupported cases in test_fusion.py：由于组量化（group quant）和逐张量量化（per-tensor quant）融合在ROCm AITER上不受支持，此PR移除了相关的测试用例，以通过CI。
- PR #38614 [ROCm][CI] Fix ROCm Python-only install fetching CUDA torch via build isolation：修复了ROCm的Python-only安装测试中，因构建隔离导致错误下载CUDA版torch的问题。
- PR #37841 replace cuda_device_count_stateless() to current_platform.device_count()：这是一项跨平台清理工作，将获取设备数量的函数统一为平台无关的接口，有助于未来对XPU等新加速器的支持，ROCm也从中受益。
模型支持与测试
- PR #38664 [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI：计划将AMD发布的MXFP4量化模型加入CI测试，表明对AMD生态量化模型的支持正进入标准化流程。

总结：AMD团队在本周期表现非常积极，提交了多项提升性能、能效和稳定性的核心代码，并持续完善CI/CD流程，显示出对vLLM在AMD硬件上成熟度的高度投入。

💬 高热度讨论分析

Issue #38626: [Bug]: CUDA error: an illegal memory access was encountered when deploy Qwen3.5-35B-A3B-FP8 on A100
- 核心议题：在A100 GPU（TP=2）上部署FP8量化的Qwen3.5模型时，在解码阶段触发CUDA非法内存访问错误。
- 不同观点：
  - 贡献者 CodersAcademy006 提供了深入的技术分析：指出错误发生在rank 1，可能与非原生FP8（A100为模拟）、CUDA图（CUDAGraphMode）重放、以及缓存请求恢复导致的张量形状变化有关。给出了详细的调试建议（如CUDA_LAUNCH_BLOCKING=1）。
  - 贡献者 ZJY0516 则建议用户尝试最新版vLLM (0.18)。
- 争议焦点：无直接争议。讨论侧重于技术根因分析。
- 当前状态：问题仍为开放状态，等待用户反馈调试结果或确认新版本是否已修复。
Issue #38634: [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase
- 核心议题：使用MLA架构模型（如DeepSeek）时，开启FP8 KV缓存和CUDA Graph会导致解码阶段产生NaN值。
- 不同观点：
  - 维护者 gaby 首先指出用户使用的vLLM版本 (0.12.0) 过旧，建议升级至最新版 (0.18.1)。
  - 用户 wwwjs 询问升级是否能解决问题。
  - 贡献者 RemizovDenis 从量化理论角度补充，指出标准FP8量化在分布尾部容易失稳，并介绍了替代方案（如3-bit + 残差缩放）。
- 争议焦点：无争议。讨论形成了标准流程：先升级版本，再考虑是否为已知的量化稳定性问题。
- 当前状态：开放。建议用户升级后观察问题是否复现。

🔥 热门话题与趋势分析

Qwen3.5系列模型问题集中：多个Issue（#38656， #38626， #38643， #38666）报告了不同配置下Qwen3.5模型的启动卡死、CUDA错误、输出乱码等问题。这表明该热门模型系列（尤其是其FP8/A3B/NVFP4变体）在复杂配置下的兼容性和稳定性面临挑战。
FP8量化与KV缓存的稳定性：除了上述Qwen3.5问题，Issue #38634和#38658均指向FP8 KV缓存与特定条件（CUDA Graph、Marlin、非原生FP8硬件）结合时产生的数值问题（NaN）或性能退化，成为当前一个技术攻坚点。
工具调用解析器的健壮性：Issue #38674 揭示了Jamba工具解析器对特定token格式的硬编码依赖，导致不支持标准Mistral格式的模型。这反映了工具调用生态多样化带来的兼容性挑战。
编译与性能优化持续深入：多个PR（如 #38675, #38657, #38671, #33825）围绕torch.compile、内核融合、迁移到Torch稳定ABI、以及引入vLLM IR等底层优化展开，体现了对极致推理性能的持续追求。

🛠️ 重点技术变更

PR #38675 [torch.compile] Add compile-only mode：引入了vllm compile CLI命令和API，允许用户预编译模型而不执行推理，从而将编译开销与权重加载过程解耦，实现“热启动”。这是解决大模型启动慢问题的关键一步。
PR #37160 [Feat][v1] Simple yet General CPU KV Cache Offloading：实现了一种新的、更简洁通用的CPU KV缓存卸载方案。它复用现有的BlockPool和KVCacheCoordinator，自动获得了HMA、前缀缓存和LRU驱逐等支持，相比旧方案代码量更少、功能更全。
PR #36286 [MoE Refactor] Migrate Unquantized to Full Oracle Flow：将未量化（BF16）MoE的计算路径迁移到新的模块化“Oracle”流程，与FP8/NVFP4路径统一。这是MoE内核现代化重构的重要里程碑，提升了代码一致性和可维护性。
PR #33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op：已合并。这是实现vLLM中间表示（IR）系统的基石PR。它定义了IR操作和内核的注册、分发机制，并首先实现了rms_norm算子。vLLM IR旨在成为统一、可扩展的算子抽象层，对未来的编译优化和多后端支持至关重要。

📈 开发活跃度观察

高产出与高效合并：在24小时内新增60个PR并合并51个，合并率高达85%，显示核心团队代码审查和合并效率非常高，项目处于快速迭代期。
积极处理技术债务：关闭了19个旧Issue，其中包括多个标记为“stale”的历史遗留问题，表明社区在积极清理积压任务，维护项目健康度。
AMD团队表现突出：如前所述，多名AMD员工贡献了涉及性能、功耗、兼容性等多个维度的关键PR，是本期最活跃的企业贡献方之一。
社区参与度高：多个热门Issue下出现了由社区贡献者提供的深度技术分析和调试建议，展现了vLLM开发者社区良好的技术氛围和协作精神。

💡 值得关注的问题

Qwen3.5 FP8/A3B模型的广泛问题：系列Issue表明该模型在特定硬件和配置下的部署存在不小风险，需要用户和开发者重点关注稳定性修复。
MLA架构与FP8 KV缓存的兼容性：Issue #38634和#38652暗示MLA类模型与FP8 KV缓存的组合可能存在固有稳定性挑战，可能需要模型级或算法级的特别处理。
工具调用解析器的标准化：Issue #38674暴露了当前各解析器“各自为战”的问题，未来可能需要更统一、可配置的解析框架来应对多样化的模型输出格式。
新量化方法的引入：PR #38662 提出了全新的“TurboQuant” KV缓存量化方法（2/3-bit），声称能大幅降低内存占用。这类新技术虽然前景广阔，但其稳定性、性能收益和集成复杂度需要经过社区充分评估。

📋 附录：详细数据列表

新增 Issue

#38677 [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup — bug — by JasonHe-WQ (创建于: 2026-04-01 10:29 (UTC+8))
#38674 [Bug]: Jamba tool parser crashes on Mistral-style [TOOL_CALLS] models with standard HF tokenizer (e.g., Apriel-Nemotron-15b) — bug — by oromanenko-nv (创建于: 2026-04-01 09:37 (UTC+8))
#38666 [Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED — bug — by bitbottrap (创建于: 2026-04-01 07:45 (UTC+8))
#38656 [Bug]: qwen 3.5 model launch get stuck for quite a long time — bug — by yanan1116 (创建于: 2026-04-01 04:58 (UTC+8))
#38660 [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings) — 无标签 — by 2imi9 (创建于: 2026-04-01 05:35 (UTC+8))
#38658 [Bug]: MLA attention casts activations to int32 when using Marlin FP8 on GPUs without native FP8 support (sm < 89) — bug — by marcusm117 (创建于: 2026-04-01 05:16 (UTC+8))
#38634 [Bug]: MLA + FP8 KV cache + CUDA Graph causes random NaN in decode phase — bug — by wwwjs (创建于: 2026-03-31 20:48 (UTC+8))
#38652 [Bug]: –kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn — 无标签 — by varjoranta (创建于: 2026-04-01 01:50 (UTC+8))
#38651 [RFC]: Add max_tokens_per_doc support for rerank and scoring endpoints — RFC — by jefp (创建于: 2026-04-01 01:39 (UTC+8))
#38633 [New Model]: JinaEmbeddingsV5Model — 无标签 — by DanielBeck93 (创建于: 2026-03-31 20:38 (UTC+8))
#38643 [Bug]: Qwen3.5 (Qwen3_5ForConditionalGeneration) FLA linear attention tensor format mismatch causes gibberish output — 无标签 — by BANG404 (创建于: 2026-03-31 22:47 (UTC+8))
#38642 [Usage]: 模型返回值reasoning_content — usage — by hf201429 (创建于: 2026-03-31 22:06 (UTC+8))
#38626 [Bug]: CUDA error: an illegal memory access was encountered when deploy Qwen3.5-35B-A3B-FP8 on A100 — bug — by zhaotyer (创建于: 2026-03-31 18:22 (UTC+8))
#38619 [Bug]: Kimi-K2.5 compressed-tensors MoE Marlin repack fails with PTX toolchain error on H200 (CUDA 12.8, driver 570.133.20) — 无标签 — by DavidBellamy (创建于: 2026-03-31 16:53 (UTC+8))

已关闭 Issue

#27519 [Bug]: torch._dynamo.exc.FailOnRecompileLimitHit: recompile_limit reached with fullgraph=True on 2x RTX2080Ti — bug,stale — by ir1ka (关闭于: 2026-04-01 10:41 (UTC+8))
#21098 [Bug]: vllm deploy deepseek-v3 bug on attributeerror: ‘_OpNameSpace’ ‘_moe_C’ object has no attribute ‘moe_align_block_size’ — bug,stale — by controlRun (关闭于: 2026-04-01 10:18 (UTC+8))
#23934 [Bug]: CPU Backend with GPT-OSS Failed — bug,stale,gpt-oss — by zzc98 (关闭于: 2026-04-01 10:17 (UTC+8))
#26781 [Feature]: Better base64 to torch tenser — good first issue,feature request,stale — by noooop (关闭于: 2026-04-01 10:17 (UTC+8))
#28759 [Bug]: VLLM compilation cache collision for model’s whose graphs have same shape but different input — bug,torch.compile,stale — by roadr (关闭于: 2026-04-01 10:16 (UTC+8))
#29085 [Bug]: RuntimeError “cancelled” when using pipeline parallelism with Qwen3-14B — bug,stale — by slwang-ustc (关闭于: 2026-04-01 10:16 (UTC+8))
#29177 [Usage]: Vllm + Intervl model local infra Image preprocessing / request adding becomes bottleneck even with more CPU cores — how to accelerate? — usage,stale — by Passenger12138 (关闭于: 2026-04-01 10:16 (UTC+8))
#29517 [Bug]: Duplicate registration of a fake implementation for the gptq_marlin_repack operator causing vllm serve to fail. — bug,stale — by atalhens (关闭于: 2026-04-01 10:16 (UTC+8))
#29544 [Bug]: “expandable_segments: True” causes vLLM EngineCore initialization to fail when running Qwen3 VL models — bug,stale — by fesvhtr (关闭于: 2026-04-01 10:16 (UTC+8))
#29643 [Usage]: Enabling Tool call in the Python SDK — usage,stale — by Madan1215 (关闭于: 2026-04-01 10:16 (UTC+8))
#29763 [Bug]: GLM-4.5 reasoning parser streaming fails without tools in request - missing as_list() conversion — bug,stale — by sygenaithanos (关闭于: 2026-04-01 10:16 (UTC+8))
#29831 [Bug]: v1/responses enable_response_messages returns blank message content — bug,stale — by jacobthebanana (关闭于: 2026-04-01 10:16 (UTC+8))
#38085 [Bug]: Qwen3.5 LoRA module is not in model’s supported LoRA target modules — bug — by wufenglailai (关闭于: 2026-04-01 09:59 (UTC+8))
#31913 [Bug]: test_eagle_dp test is flaky — bug — by zou3519 (关闭于: 2026-04-01 00:42 (UTC+8))
#36926 [Bug]: nemotron_h does not work with DeepEP all2all backends due to hidden dim rounding — bug — by bnellnm (关闭于: 2026-04-01 00:22 (UTC+8))
#38098 [CI Failure]: LM Eval Large Models (H200) — ci-failure — by ilmarkov (关闭于: 2026-03-31 23:08 (UTC+8))
#36443 [Bug]: qwen3.5-27b ValueError: Tokenizer class TokenizersBackendFast does not exist or is not currently imported. — bug — by xiaotianns (关闭于: 2026-03-31 19:31 (UTC+8))
#38234 Test Failure: test_run_eagle_dp[FLASH_ATTN] produces non-deterministic outputs with EAGLE speculative decoding — 无标签 — by markmc (关闭于: 2026-03-31 18:30 (UTC+8))
#38582 [CI Failure]: tests/models/language/pooling/test_splade_sparse_pooler.py — ci-failure — by bnellnm (关闭于: 2026-03-31 12:54 (UTC+8))

新增 PR

#38676 [CPU] Support head_size 512 in cpu_attn — documentation,ready,v1,cpu — by bigPYJ1151 (创建于: 2026-04-01 10:26 (UTC+8))
#38663 [Core][Feat][ safely abort requests where FSM failed to advance — v1 — by walterbm (创建于: 2026-04-01 06:34 (UTC+8))
#38635 [Feature] NUMA binding support for GPU workers — documentation,v1,nvidia — by Harry-Chen (创建于: 2026-03-31 20:57 (UTC+8))
#38649 [Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup — bug,structured-output,ready,v1 — by jeffreywang-anyscale (创建于: 2026-04-01 01:23 (UTC+8))
#38682 [XPU] [Quant] rename mxfp8_e4m3_quantize and add xpu backend implementation — intel-gpu — by zufangzhu (创建于: 2026-04-01 11:08 (UTC+8))
#38678 Fix llm_request trace context propagation — frontend,v1 — by will-deines (创建于: 2026-04-01 10:34 (UTC+8))
#38617 [bugfix] do not add extra linebreak for score/rerank with chat template — bug,frontend,ready — by staugust (创建于: 2026-03-31 16:25 (UTC+8))
#38655 Fix Nano Nemotron VL regressions — multi-modality — by netanel-haber (创建于: 2026-04-01 04:24 (UTC+8))
#38665 [ROCm] Enable dual-stream MoE shared experts and GLM-5 MXFP4 Quark support — rocm,v1 — by ChuanLi1101 (创建于: 2026-04-01 07:00 (UTC+8))
#38681 [CPU] Fix lscpu NUMA node regex to handle quoted - and null in containers — cpu — by Monokaix (创建于: 2026-04-01 11:04 (UTC+8))
#38680 [CI][ROCm] Remove unsupported cases in test_fusion.py — rocm — by charlifu (创建于: 2026-04-01 11:01 (UTC+8))
#38679 fused_moe_kernel opt — 无标签 — by SYChen123 (创建于: 2026-04-01 10:54 (UTC+8))
#38612 [CI Failure] pin colmodernvbert revision — ready,multi-modality — by noooop (创建于: 2026-03-31 14:49 (UTC+8))
#38639 [torch.compile] change auto_functionalized return structure to use indexing instead of unpack values — needs-rebase — by chaojun-zhang (创建于: 2026-03-31 21:44 (UTC+8))
#38621 [Kernel fusion] QK Norm + RoPE + Cache + Quant — needs-rebase — by EricccYang (创建于: 2026-03-31 17:17 (UTC+8))
#38654 [Bugfix] Fix vllm bench serve to count multimodal tokens in “total input tokens” — bug,performance — by mgehre-amd (创建于: 2026-04-01 04:07 (UTC+8))
#38675 [torch.compile] Add compile-only mode — frontend,v1 — by zou3519 (创建于: 2026-04-01 10:09 (UTC+8))
#38650 [Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint — bug,qwen — by mmangkad (创建于: 2026-04-01 01:34 (UTC+8))
#38623 [Doc]: improve disaggregated prefilling guide — documentation — by neweyes (创建于: 2026-03-31 17:36 (UTC+8))
#38673 [Bugfix] Preserve original ImportError in gRPC server entrypoint — bug,frontend — by CatherineSue (创建于: 2026-04-01 09:10 (UTC+8))
#38672 Fix llm_request trace context propagation — structured-output,frontend,v1,tool-calling,gpt-oss,nvidia — by will-deines (创建于: 2026-04-01 08:52 (UTC+8))
#38669 Fix Marlin repack PTX incompatibility on H100/H200 (CUDA 12.8) — ci/build,nvidia — by DavidBellamy (创建于: 2026-04-01 08:01 (UTC+8))
#38671 [5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-04-01 08:14 (UTC+8))
#38670 [Bugfix] Fix AWQ models batch invariance issues — bug,v1 — by YM2132 (创建于: 2026-04-01 08:03 (UTC+8))
#38661 [2/N] Pass model_config to the Attention constructors — ready,llama,qwen,deepseek,gpt-oss — by MatthewBonanni (创建于: 2026-04-01 05:55 (UTC+8))
#38668 add vLLM-side LMCache EC connector entrypoint — kv-connector — by benyebai (创建于: 2026-04-01 07:53 (UTC+8))
#38667 Fix/lmcache ec connector module — needs-rebase,v1,kv-connector — by benyebai (创建于: 2026-04-01 07:48 (UTC+8))
#38622 [Bug] Fix encoder cache miss assertion crash with MTP + multimodal — bug,v1 — by esmeetu (创建于: 2026-03-31 17:24 (UTC+8))
#38657 [compile] Invoke split FX graph by codegen. — 无标签 — by zhxchen17 (创建于: 2026-04-01 05:11 (UTC+8))
#38664 [CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI — rocm,qwen — by BowenBao (创建于: 2026-04-01 06:53 (UTC+8))
#38659 [1/N][Cleanup] Standardize on use of is_quantized_kv_cache — rocm,intel-gpu,ready,v1,cpu,nvidia — by MatthewBonanni (创建于: 2026-04-01 05:29 (UTC+8))
#38662 [Kernel] feat: TurboQuant KV cache quantization (PolarQuant + QJL) — ci/build,v1 — by allaspectsdev (创建于: 2026-04-01 06:08 (UTC+8))
#38637 [Quantization] Consolidate dummy format logic into DummyModelLoader — ready,quantization — by Josephasafg (创建于: 2026-03-31 21:19 (UTC+8))
#38644 [Refactor] Simplify FutureWrapper in MultiprocExecutor — v1 — by yzong-rh (创建于: 2026-04-01 00:11 (UTC+8))
#38638 Fix Nano Nemotron VL regressions — 无标签 — by netanel-haber (创建于: 2026-03-31 21:42 (UTC+8))
#38624 fix(scheduler): discard async tokens on all V1 preemption paths — v1 — by CodersAcademy006 (创建于: 2026-03-31 17:43 (UTC+8))
#38653 [Bugfix] Use async preprocessing in pooling/embedding endpoints — bug,frontend — by chuqiwang (创建于: 2026-04-01 02:01 (UTC+8))
#38647 Add opt-in --record-power option to vllm bench serve — performance,rocm — by fxmarty-amd (创建于: 2026-04-01 01:00 (UTC+8))
#38640 [BugFix] Fix streaming tool call with null type or id in final chunk — bug,frontend — by yanghui1-arch (创建于: 2026-03-31 21:51 (UTC+8))
#38648 [XPU] Enable group_size=-1/channel-wise for w4a16 and w4a8 — intel-gpu — by tianmu-li (创建于: 2026-04-01 01:00 (UTC+8))
#38646 [compile] fuse rope and cache insertion for mla — 无标签 — by ZJY0516 (创建于: 2026-04-01 00:38 (UTC+8))
#38645 [Hotfix] Minor polish to reduce the key in map calling. — v1 — by RocMarshal (创建于: 2026-04-01 00:37 (UTC+8))
#38636 (security) Enforce frame limit in VideoMediaIO — ready,multi-modality — by jperezdealgaba (创建于: 2026-03-31 21:07 (UTC+8))
#38630 Bugfix/multi node dp tcp placement — bug,frontend,v1 — by shaharmor98 (创建于: 2026-03-31 19:52 (UTC+8))
#38627 [WIP] [ROCm] [Doc] Update ROCm attention selection doc and test coverage — bug,rocm,ready,v1 — by tjtanaa (创建于: 2026-03-31 18:56 (UTC+8))
#38629 [Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 — ready — by zhang-prog (创建于: 2026-03-31 19:27 (UTC+8))
#38641 [UX] Log worker exit code when process dies unexpectedly — ready,v1 — by NickCao (创建于: 2026-03-31 21:59 (UTC+8))
#38628 [Docs] PD with Nixl compat matrix — documentation,ready,kv-connector — by NickLucche (创建于: 2026-03-31 19:16 (UTC+8))
#38632 [CI] fix LM Eval Qwen3.5 Models (B200) — ready,qwen — by ZJY0516 (创建于: 2026-03-31 20:19 (UTC+8))
#38631 Fix MLA runs when use_inductor_graph_partition=True — ready — by ElizaWszola (创建于: 2026-03-31 20:03 (UTC+8))
#38625 AIFQA-409 Run VLLM on BMG70 — documentation,performance,intel-gpu,ci/build — by jklawikowski (创建于: 2026-03-31 18:05 (UTC+8))
#38620 [Frontend] Re-enable running MaxSim on GPU — frontend,v1 — by noooop (创建于: 2026-03-31 17:14 (UTC+8))
#38618 Fix/fp8 alignment padding — 无标签 — by jessiewei7 (创建于: 2026-03-31 16:51 (UTC+8))
#38616 [Feature] Add completed_at field to ResponsesAPI — frontend — by chaunceyjiang (创建于: 2026-03-31 16:02 (UTC+8))
#38613 [Feature]: add presence_penalty and frequency_penalty fields to Responses API — frontend,ready — by chaunceyjiang (创建于: 2026-03-31 15:02 (UTC+8))
#38615 [ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 — rocm,v1 — by wufann (创建于: 2026-03-31 15:46 (UTC+8))
#38614 [ROCm][CI] Fix ROCm Python-only install fetching CUDA torch via build isolation — rocm,ready,ci/build,nvidia — by AndreasKaratzas (创建于: 2026-03-31 15:08 (UTC+8))
#38611 [ci] Remove benchmarks job — ready,ci/build — by khluu (创建于: 2026-03-31 13:54 (UTC+8))
#38609 [Bugfix] Fix streaming tool call type field defaulting to None instead of “function” — bug,frontend — by sfeng33 (创建于: 2026-03-31 12:04 (UTC+8))
#38610 [Spec Decode] fix returning size mismatch on extract hidden states proposer — speculative-decoding,v1 — by zzaebok (创建于: 2026-03-31 12:14 (UTC+8))

已合并 PR

#38559 [Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement — ready — by yewentao256 (合并于: 2026-04-01 11:54 (UTC+8))
#37051 Fix priority preemption regression test in scheduler — ready,v1 — by ezylopx5 (合并于: 2026-04-01 11:36 (UTC+8))
#32914 [ROCm][perf] Shuffle KV cache to use paged_attention_common — rocm,ready,ci/build,v1 — by samutamm (合并于: 2026-04-01 11:30 (UTC+8))
#38172 [Misc] Add 20 regression tests for 11 tool parser bug fixes — bug,ready,tool-calling,qwen,deepseek — by bbrowning (合并于: 2026-04-01 11:00 (UTC+8))
#38612 [CI Failure] pin colmodernvbert revision — ready,multi-modality — by noooop (合并于: 2026-03-31 18:54 (UTC+8))
#33825 [vLLM IR] 1/N Implement IR skeleton and rms_norm op — rocm,intel-gpu,ready,torch.compile,ci/build,nvidia,ready-run-all-tests,vllm-ir — by ProExpertProg (合并于: 2026-04-01 10:15 (UTC+8))
#38148 Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor — ready,v1 — by elvircrn (合并于: 2026-04-01 10:15 (UTC+8))
#38343 [Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass — ready,qwen — by AuYang261 (合并于: 2026-04-01 03:03 (UTC+8))
#37160 [Feat][v1] Simple yet General CPU KV Cache Offloading — performance,frontend,ready,v1,kv-connector,nvidia — by ivanium (合并于: 2026-04-01 08:58 (UTC+8))
#37887 [ROCm][perf] fix Aiter sparse MLA with MTP>1 — rocm,speculative-decoding,ready,v1 — by gronsti-amd (合并于: 2026-04-01 07:22 (UTC+8))
#34539 Generative Scoring — documentation,frontend,ready,v1 — by vedantjh2 (合并于: 2026-04-01 07:02 (UTC+8))
#38333 feat(grpc): add periodic stats logging and servicer log forwarding — frontend,ready — by CatherineSue (合并于: 2026-04-01 06:50 (UTC+8))
#37501 fix: clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels — rocm,ready — by kibitzing (合并于: 2026-03-31 23:35 (UTC+8))
#38637 [Quantization] Consolidate dummy format logic into DummyModelLoader — ready,quantization — by Josephasafg (合并于: 2026-04-01 06:20 (UTC+8))
#38592 [Kernel] [Helion] [17/N] Add Helion kernel torch.compile support — ready — by gmagogsfm (合并于: 2026-04-01 05:06 (UTC+8))
#38383 [Refactor] Remove dead code in kv connector and model runner — intel-gpu,ready,v1,cpu,kv-connector — by yewentao256 (合并于: 2026-04-01 05:05 (UTC+8))
#38451 [Perf] Fix DBO overlap: capture DeepEP event before yield — ready — by czhu-cohere (合并于: 2026-04-01 04:39 (UTC+8))
#38556 [Bugfix][Async] Fix async spec decoding with hybrid models — bug,speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-31 23:08 (UTC+8))
#36286 [MoE Refactor] Migrate Unquantized to Full Oracle Flow — ready,nvidia,ready-run-all-tests — by yzong-rh (合并于: 2026-04-01 03:43 (UTC+8))
#36540 [fix] Remove trtllm ragged mla prefills — ready,v1 — by evezhier (合并于: 2026-04-01 03:30 (UTC+8))
#37373 [torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate — documentation,ready,v1,nvidia — by BadrBasowid (合并于: 2026-04-01 02:15 (UTC+8))
#37766 [CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI — ready,ci/build — by yurun00 (合并于: 2026-04-01 02:05 (UTC+8))
#37503 [4/n] Migrate FP4/W4A8 CUTLASS kernels to torch stable ABI — ready,ci/build,nvidia — by mikaylagawarecki (合并于: 2026-04-01 01:21 (UTC+8))
#37986 [Quantization][Autoround][XPU] Add W4A16 Support — intel-gpu,ready,ci/build — by yiliu30 (合并于: 2026-04-01 00:48 (UTC+8))
#38584 [CI][Bugfix] Fix test_run_eagle_dp — bug,ready,v1 — by MatthewBonanni (合并于: 2026-03-31 18:30 (UTC+8))
#37010 [Bugfix] Fix FusedMoE weight loading with padded hidden dimensions — bug,ready — by SandishKumarHN (合并于: 2026-04-01 00:22 (UTC+8))
#38629 [Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 — ready — by zhang-prog (合并于: 2026-03-31 23:50 (UTC+8))
#38574 [Online Quant] [QeRL] Minor code cleanup — ready — by kylesayrs (合并于: 2026-03-31 22:56 (UTC+8))
#38628 [Docs] PD with Nixl compat matrix — documentation,ready,kv-connector — by NickLucche (合并于: 2026-03-31 23:01 (UTC+8))
#37841 replace cuda_device_count_stateless() to current_platform.device_count() — rocm,ready,v1,gpt-oss,nvidia — by wincent8 (合并于: 2026-03-31 22:32 (UTC+8))
#38594 [CI] Avoid concurrent docker pull in intel XPU CI runners to prevent rate limit issues — intel-gpu,ready,ci/build — by wendyliu235 (合并于: 2026-03-31 22:23 (UTC+8))
#38632 [CI] fix LM Eval Qwen3.5 Models (B200) — ready,qwen — by ZJY0516 (合并于: 2026-03-31 21:20 (UTC+8))
#38566 [Bugfix][CI] Skip flaky test_eagle test — bug,ready,v1 — by NickLucche (合并于: 2026-03-31 21:42 (UTC+8))
#38631 Fix MLA runs when use_inductor_graph_partition=True — ready — by ElizaWszola (合并于: 2026-03-31 21:37 (UTC+8))
#38596 [XPU]move testing dependencies from Dockerfile to xpu-test.in — intel-gpu,ready,ci/build — by 1643661061leo (合并于: 2026-03-31 20:49 (UTC+8))
#33176 [EPLB] Add alternative communication for EPLB weight exchange — ready,ci/build,v1 — by ilmarkov (合并于: 2026-03-31 20:17 (UTC+8))
#36742 [EPD] update EPD script arguments — documentation,ready,kv-connector — by zhenwei-intel (合并于: 2026-03-31 20:02 (UTC+8))
#31113 Fix document of torchrun_example.py — documentation,ready,unstale — by foreverlms (合并于: 2026-03-31 18:54 (UTC+8))
#38129 DOC: TPU mention fix — ready — by mtsokol (合并于: 2026-03-31 18:27 (UTC+8))
#38570 [Misc] Move –grpc CLI argument into make_arg_parser — frontend,ready — by CatherineSue (合并于: 2026-03-31 18:24 (UTC+8))
#38613 [Feature]: add presence_penalty and frequency_penalty fields to Responses API — frontend,ready — by chaunceyjiang (合并于: 2026-03-31 16:45 (UTC+8))

#28631 [Frontend][3/n] Improve pooling entrypoints

scoring. — documentation,frontend,ready,multi-modality,llama,qwen — by noooop (合并于: 2026-03-31 15:52 (UTC+8))

#35697 [CPU] Support int8 compute mode in CPU AWQ — ready,ci/build,cpu — by yintong-lu (合并于: 2026-03-31 15:27 (UTC+8))
#38611 [ci] Remove benchmarks job — ready,ci/build — by khluu (合并于: 2026-03-31 14:46 (UTC+8))
#37989 [OOT] Add OOT support for linear kernel. — ready — by menogrey (合并于: 2026-03-31 14:33 (UTC+8))
#38189 [Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers — ready,tool-calling,qwen,deepseek — by sfeng33 (合并于: 2026-03-31 13:41 (UTC+8))
#38554 [kv_offload+HMA] Fix num_blocks with different per-layer page sizes and improve assert message — ready,v1,kv-connector — by kfirtoledo (合并于: 2026-03-31 14:00 (UTC+8))
#38576 vLLM Benchmark Suite perf regression after PR#32723 — performance,ready,ci/build,cpu — by louie-tsai (合并于: 2026-03-31 13:23 (UTC+8))
#38508 [ROCm][CI] Fix Whisper translation test attention backend selection — rocm,ready — by AndreasKaratzas (合并于: 2026-03-31 13:21 (UTC+8))
#38264 [Mypy] Fix adjust_request typing — documentation,frontend,ready,tool-calling — by sfeng33 (合并于: 2026-03-31 12:21 (UTC+8))
#38546 [KVConnector] Remove redundant method KVConnectorOutput::merge() — ready,v1 — by hickeyma (合并于: 2026-03-31 12:11 (UTC+8))

关闭但未合并的 PR

#38639 [torch.compile] change auto_functionalized return structure to use indexing instead of unpack values — needs-rebase — by chaojun-zhang (关闭于: 2026-04-01 10:48 (UTC+8))
#29661 Enable FlashInfer Hopper FP8 attention — v1,nvidia — by nvpohanh (关闭于: 2026-04-01 10:31 (UTC+8))
#26625 [Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support — ci/build,stale,v1,multi-modality,llama — by Ther-LF (关闭于: 2026-04-01 10:17 (UTC+8))
#28723 [Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer — speculative-decoding,needs-rebase,stale,v1,nvidia — by pisceskkk (关闭于: 2026-04-01 10:16 (UTC+8))
#28994 [Misc] Update disaggregated prefill examples for LMCache v0.3.7+ compatibility — documentation,stale,kv-connector — by Zerohertz (关闭于: 2026-04-01 10:16 (UTC+8))
#36143 [Core] Add –hybrid-kv-cache-group-size to override KV cache grouping for hybrid attn models — v1 — by mzmssg (关闭于: 2026-04-01 10:12 (UTC+8))
#38672 Fix llm_request trace context propagation — structured-output,frontend,v1,tool-calling,gpt-oss,nvidia — by will-deines (关闭于: 2026-04-01 09:07 (UTC+8))
#38667 Fix/lmcache ec connector module — needs-rebase,v1,kv-connector — by benyebai (关闭于: 2026-04-01 07:50 (UTC+8))
#38357 Revert “[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell” (#38083) — bug,needs-rebase,qwen — by zhewenl (关闭于: 2026-04-01 05:11 (UTC+8))
#38638 Fix Nano Nemotron VL regressions — 无标签 — by netanel-haber (关闭于: 2026-04-01 04:24 (UTC+8))
#38653 [Bugfix] Use async preprocessing in pooling/embedding endpoints — bug,frontend — by chuqiwang (关闭于: 2026-04-01 02:05 (UTC+8))
#38358 Revert “[CI] Add batch invariant test for b200” (#38014) — ci/build — by zhewenl (关闭于: 2026-04-01 00:36 (UTC+8))
#34174 [Draft][ROCm] ROCm7.2 as base — rocm,needs-rebase,ci/build — by gshtras (关闭于: 2026-03-31 23:55 (UTC+8))
#38419 [Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) — bug,speculative-decoding,v1 — by SandishKumarHN (关闭于: 2026-03-31 23:13 (UTC+8))
#30558 [Core] Support multi prompt for AsyncLLM.generate() and encode() — documentation,needs-rebase,v1 — by buaazp (关闭于: 2026-03-31 21:00 (UTC+8))
#30492 [Feature] add manual numa binding — ready,needs-rebase,v1,nvidia — by jasonlizhengjian (关闭于: 2026-03-31 20:59 (UTC+8))
#36250 [kernel] Fix FP8 paged MQA fallback for CUDA graph capture — bug,v1,nvidia — by ZJY0516 (关闭于: 2026-03-31 20:36 (UTC+8))
#38010 [Model] Fix BitsAndBytes quantization for GLM-4.1V/4.6V-Flash vision encoder — bug — by yanghui1-arch (关闭于: 2026-03-31 20:08 (UTC+8))
#37507 [Bugfix] Fall back to Triton/FLA when system CUDA toolkit < 12.6 for GDN prefill kernel — bug,needs-rebase,qwen,nvidia — by yanghui1-arch (关闭于: 2026-03-31 20:08 (UTC+8))
#38599 Revert “[Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes” (#38270) — bug,v1,nvidia — by vllm-agent (关闭于: 2026-03-31 18:11 (UTC+8))
#38618 Fix/fp8 alignment padding — 无标签 — by jessiewei7 (关闭于: 2026-03-31 17:51 (UTC+8))
#38525 [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation — documentation,performance,new-model,rocm,frontend,intel-gpu,speculative-decoding,ci/build,v1,multi-modality — by neweyes (关闭于: 2026-03-31 17:13 (UTC+8))
#37408 [ROCm][Quantization] fallback trust_remote_code=True in Quark config for some cases — rocm — by xuebwang-amd (关闭于: 2026-03-31 12:08 (UTC+8))