vLLM 开发动态报告 - 2026-03-24

时间窗口: 2026-03-24 11:33 (UTC+8) ~ 2026-03-25 11:33 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 24 | 新 PR 79 | 合并 PR 36 | 关闭未合并 PR 17

📊 每日开发状态摘要

在本次时间窗口内，vLLM 项目保持了高强度开发，新增与合并了大量 PR（79个新增，36个合并），反映了社区持续的代码贡献和功能迭代。核心关注点集中在Qwen3.5等大模型的实际部署问题（如Pipeline Parallelism、ROCm兼容性）和针对特定硬件（如AMD MI300、NVIDIA L40S）的性能优化上。同时，围绕是否应为旧硬件（如Ampere架构A100）提供兼容性支持的讨论引发了社区关注。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动较为活跃，主要集中在问题修复与内核优化上。

ROCm平台上的模型Bug：
- Issue #37996: 用户报告Qwen3.5-397B GPTQ模型在ROCm上仅输出感叹号。另一位用户确认了类似问题。这属于ROCm平台上的关键推理正确性问题，已自动标记给ROCm维护团队。
- Issue #37992 & PR #37993: AMD员工 xuebwang-amd 报告在MI325X (gfx942) 上运行Qwen3.5-MoE视觉模型时，在profile_run阶段因flash_attn的Triton rotary内核失败而崩溃。他随即提交了PR #37993，增加了回退到原生rotary embedding的逻辑以修复此问题。
- PR #37973: 贡献者 vllmellm 提交了在ROCm上为Triton Attention启用FP8 Query的优化，旨在提升性能。这表明社区在持续优化ROCm平台的推理效率。
GPT-OSS模型在ROCm上的修复：
- PR #37787 (已合并): 修复了因PR #37128重构而引入的GPT-OSS模型在ROCm上的回归问题，包括CK MXFP4后端选择、对齐检查以及padding参数的传递。这是确保GPT-OSS在AMD硬件上稳定运行的重要修复。
- PR #38043: 作为后续修复，进一步调整了GPT-OSS在ROCm上的RMSNorm融合与padding策略（如将MI300的padding从128调整为256），并重新启用了相关融合优化。
基础设施与集成：
- PR #37980: 虽然主要关于DeepGEMM集成，但通过CMake FetchContent将其打包进wheel的改动，简化了所有平台（包括ROCm）上DeepSeek-V3等依赖DeepGEMM的模型的部署流程。

💬 高热度讨论分析

Issue #38006: 是否应为Ampere GPU实现 TRITON_MLA_SPARSE 后端以支持Sparse MLA模型？
- 核心议题：用户请求为sm80（A100/A800）GPU实现TRITON_MLA_SPARSE后端，以支持GLM-5、DeepSeek V3.2等使用稀疏MLA的模型。
- 不同观点：
  - 请求方 (ehfd)：认为这具有战略意义，就像过去vLLM为Ampere实现FP8 fallback从而吸引用户一样，实现此功能将把所有使用稀疏MLA模型的Ampere用户吸引到vLLM。
  - 维护者 (youkaichao)：明确表示这不是vLLM维护者的优先事项。认为即使在A100上运行这些模型也“困难/意义不大”，建议保持当前直接报错的策略，有兴趣的社区开发者可以自行维护分支。
- 争议焦点：项目官方维护的优先级（聚焦新硬件/最优性能）与社区用户对现有硬件投资保护需求之间的冲突。
- 当前状态：Issue仍开放，维护者立场明确，依赖社区贡献。
Issue #37996: Qwen3.5 397B GPTQ在ROCm上输出异常
- 核心议题：特定大规模模型在AMD平台上产生完全错误的输出（全感叹号）。
- 讨论内容：用户提供了详细的复现环境。机器人自动请求更多信息并标记给ROCm维护者。另一用户回复遇到相同问题，增加了问题的普遍性。
- 当前状态：Issue开放待解决，属于ROCm平台上的高优先级Bug。
Issue #36613 (已关闭): Qwen3.5-397B在高并发下启用MTP导致CUDA非法内存访问
- 核心议题：大规模MoE模型在启用多令牌预测（MTP）和高并发时出现致命CUDA错误。
- 讨论内容：多名用户确认遇到相同或类似问题。维护者建议测试nightly版本并提供复现脚本。讨论中发现降低预测令牌数（如从5降至1）可规避问题，指向了MTP实现中的并发安全性或内存管理缺陷。
- 最终结论：Issue在用户确认调整超时环境变量可解决问题后被关闭，但根本的MTP高并发稳定性问题可能仍需关注。

🔥 热门话题与趋势分析

Qwen3.5系列模型的部署挑战：多个Issue（#37996, #37972, #37967, #38024）围绕Qwen3.5（尤其是MoE大版本）的部署问题展开，涉及Pipeline Parallelism、Tokenizer兼容性、特定配置下的错误等，反映出该系列模型在vLLM生态中被广泛使用但仍有适配磨合期。
AMD平台支持深化：除上述ROCm-specific问题外，PR活动显示社区正积极解决AMD平台上的性能回归（#37787）并推进优化（#37973），支持正从“能运行”向“高效稳定运行”迈进。
性能优化精细化：针对特定硬件和模型形状的微优化成为趋势，例如为L40S/SM89调整GPT-OSS的Marlin MoE策略（#38054），为DeepSeek V3.2开发融合concat+quant的FP8内核（#38028），以及为CPU推理优化Mamba状态处理（#38047）。
API与功能扩展：前端功能持续增强，如新增梯度计算API（#38008）、支持批处理消息的structured output（#38011）、增加gRPC健康检查（#38016）等，提升了vLLM的易用性和企业级特性。

🛠️ 重点技术变更

PR #37987 (已合并)：为非Triton平台修复compute_slot_mapping崩溃
- 解读：此PR修复了一个关键兼容性Bug。之前的重构移除了numpy后备实现，导致在不支持Triton的平台上（如ppc64le或纯CPU环境）崩溃。修复通过为这些平台提供替代实现，保障了vLLM在异构计算环境中的可运行性。
- 影响：巩固了vLLM在非标准GPU架构和CPU上的部署能力。
PR #37980 (进行中)：将DeepGEMM深度集成至vLLM Wheel
- 解读：该PR通过CMake的FetchContent机制，将DeepSeek模型推理所需的DeepGEMM库直接编译并打包进vLLM的发布wheel中，消除了用户手动安装的步骤。
- 影响：极大简化了DeepSeek-V3等模型的部署体验，降低了使用门槛，是提升用户体验的重要基础设施改进。
PR #37787 (已合并)：修复ROCm上GPT-OSS MXFP4的回归问题
- 解读：此PR修复了因代码重构导致的AMD平台上GPT-OSS模型性能回归和崩溃问题。它细致地恢复了针对特定GFX架构的后端选择逻辑、对齐要求检查和必要的padding参数传递。
- 影响：确保了GPT-OSS模型在AMD最新硬件（如MI300）上的功能正确性和性能，是维护跨平台模型支持的关键操作。
PR #38011 (进行中)：为OpenAI API添加批处理消息支持
- 解读：扩展了/v1/chat/completions端点，使其能够接受一个消息列表的列表，从而在单个请求中处理多个独立的对话，并支持与结构化输出等功能结合。
- 影响：提高了API的效率和灵活性，尤其适用于需要批量处理多个提示并提取结构化数据的场景，减少了HTTP开销。

📈 开发活跃度观察

贡献者多样性：本次周期出现了来自AMD员工（xuebwang-amd）、以及针对特定硬件优化（如L40S）的贡献，表明生态在向更广泛的硬件厂商和细分场景渗透。
问题响应与修复速度：对于xuebwang-amd报告的Issue #37992，其关联的修复PR #37993在问题创建后6分钟内即被提出，显示了核心贡献者或内部团队的高效响应能力。
代码合并效率：在36个已合并的PR中，包含多个重要的Bug修复（如#37987，#38015）和功能改进，审查与合并流程运转顺畅。

💡 值得关注的问题

Ampere架构的长期支持策略：Issue #38006引发的讨论需要社区明确。是否以及如何为旧硬件（如A100）提供对新模型架构（如稀疏MLA）的兼容性支持，将影响一部分用户群体的技术选型。
Qwen3.5 MoE大规模模型稳定性：集中出现的问题（Pipeline Parallelism、ROCm输出异常、Tokenizer）表明对此类超大规模MoE模型的全面支持仍需打磨，是影响生产部署稳定性的关键区域。
多平台内核维护负担：从PR #37987和多个ROCm修复PR可见，维护针对CPU、AMD、NVIDIA等多套后端的算子实现和策略选择，复杂性和维护成本正在增加。如何优雅地管理这种复杂性是一个长期挑战。
MTP等高级功能在高并发下的稳定性：Issue #36613虽已关闭，但多令牌预测、推测解码等高级功能在高负载下的健壮性，仍是实现极致性能时需要持续验证的重点。

📋 附录：详细数据列表

新增 Issue

#38024 Tokenizer error with Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated model - TokenizersBackend not found — 无标签 — by ArtemSultanov-PG (创建于: 2026-03-25 01:44 (UTC+8))
#38006 [Feature]: Implement TRITON_MLA_SPARSE backend for sm80 support of Sparse MLA — feature request — by ehfd (创建于: 2026-03-24 21:01 (UTC+8))
#38056 [Bug]: ImportError: flash_attn.ops.triton.rotary not found on older versions (< v2.1.2) — bug — by xiaoajie738 (创建于: 2026-03-25 10:21 (UTC+8))
#38051 [Bug]: Possible warm start compile time issue for Deepseek V3.2 and Kimi K2.5 — bug,torch.compile — by zou3519 (创建于: 2026-03-25 09:40 (UTC+8))
#37996 [Bug]: Qwen3.5 397B GPTQ model outputs all exclamation points on ROCM — bug,rocm — by hnhyzz (创建于: 2026-03-24 19:04 (UTC+8))
#38041 V2 model runner crashes on Qwen3.5 mixed attention (linear + full) — 无标签 — by bhaktatejas922 (创建于: 2026-03-25 05:44 (UTC+8))
#38033 [Bug]: CPUFusedMOE raises IndexError when running EP across CPU-only nodes — bug — by shunzhiwen (创建于: 2026-03-25 04:28 (UTC+8))
#37976 [Feature]: Support Structured Output with Batched Prompts via OpenAI-Compatible API — feature request — by MatejRojec (创建于: 2026-03-24 15:57 (UTC+8))
#38022 [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880) — bug — by zweack (创建于: 2026-03-25 01:04 (UTC+8))
#37983 [Bug]: compute_slot_mapping crashes on non-Triton platforms (ppc64le/CPU) after PR #32951 — bug — by Akashcodes732 (创建于: 2026-03-24 16:37 (UTC+8))
#38013 [Bug]: Error while serving mistralai/Mistral-7B-v0.3 on CPU systems with Docker build — bug — by azhuvath (创建于: 2026-03-24 22:54 (UTC+8))
#37995 [RFC]: Prefill Context Parallel for Qwen3.5 Hybrid Attention — feature request — by Yancey0623 (创建于: 2026-03-24 18:57 (UTC+8))
#38005 [Bug]: EP support for Fused MoE LoRA is not implemented yet. — bug — by Nero10578 (创建于: 2026-03-24 20:55 (UTC+8))
#38004 [Bug]: Speech-to-Text endpoint may return 501 but not documented in OpenAPI — bug — by Sebastian-dong (创建于: 2026-03-24 20:54 (UTC+8))
#38003 [RFC]: Integrate MineDraft speculative decoding into vLLM — RFC — by Sebastian-dong (创建于: 2026-03-24 20:13 (UTC+8))
#37992 [Bug]: RuntimeError triton error during profile_run with Qwen3.5-MoE vision encoder on ROCm — bug,rocm — by xuebwang-amd (创建于: 2026-03-24 18:24 (UTC+8))
#37972 [Bug]: Pipeline Parallelism >=4 errors on runtime with Qwen3.5-FP8 — bug — by ehfd (创建于: 2026-03-24 15:23 (UTC+8))
#37988 [Bug]: minimax-m2.5, reasoning token result in negative values — bug — by SparrowMu (创建于: 2026-03-24 17:26 (UTC+8))
#37981 [Bug]: v0.18.0 fails to run MiniCPM-o-4.5 — bug — by devalun (创建于: 2026-03-24 16:33 (UTC+8))
#37967 [Bug]: TypeError: transformers.tokenization_utils_tokenizers.TokenizersBackend._patch_mistral_regex() got multiple values for keyword argument ‘fix_mistral_regex’ — bug — by tech112233445566 (创建于: 2026-03-24 14:16 (UTC+8))
#37979 [RFC]: Intel Quantization Support Roadmap (H1 2026) — RFC — by yiliu30 (创建于: 2026-03-24 16:14 (UTC+8))
#37982 [Bug]: chart-helm does not support configuring shared memory (/dev/shm) — bug — by utsumi-fj (创建于: 2026-03-24 16:34 (UTC+8))
#37977 [Bug][Model] Eagle2.5-VL applies ImageNet normalization instead of SigLIP2 — 无标签 — by edwingao28 (创建于: 2026-03-24 16:03 (UTC+8))
#37974 [Bug]: [Bug]: Kimi-K2.5 on version 0.18.0 results in an keyerror when the pipeline parallelism (PP) is greater than or equal to 2 — bug — by WangTuoxytt (创建于: 2026-03-24 15:53 (UTC+8))
#37971 [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance — feature request — by biba10 (创建于: 2026-03-24 15:14 (UTC+8))
#37966 [Bug]: spec decoding nonparallel 路径 draft/target hidden size 不兼容，建议适配不同hidden size — bug — by sunchendd (创建于: 2026-03-24 14:00 (UTC+8))

已关闭 Issue

#25342 [Feature]: Allow increasing the flashinfer workspace buffer size — feature request,stale — by richardhundt (关闭于: 2026-03-25 10:17 (UTC+8))
#25994 [Bug]: DeepSeek-V3.1 gives garbage output — bug,stale — by nicolexin (关闭于: 2026-03-25 10:17 (UTC+8))
#27572 [Bug]: chat/completions stream intermittently returns null as finish_reason — bug,stale — by shuynh2017 (关闭于: 2026-03-25 10:17 (UTC+8))
#27602 [Bug]: quantized medgemma-27b-text-it producing garbage outputs — bug,stale — by kritiyer (关闭于: 2026-03-25 10:17 (UTC+8))
#28085 [Feature][UX]: vLLM Kernel Configuration — feature request,stale,startup-ux — by robertgshaw2-redhat (关闭于: 2026-03-25 10:17 (UTC+8))
#28641 [Feature][P0]: Optimize Dockerfile Layer Ordering — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-25 10:17 (UTC+8))
#28648 [Feature][P1]: Use Bind Mounts for Installation Scripts — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-25 10:17 (UTC+8))
#29151 [Bug]: VLLM Sleep on NVIDIA H100 leading to model producing slow invalid results — bug,stale — by relyt0925 (关闭于: 2026-03-25 10:16 (UTC+8))
#29248 [Bug]: Encoder disaggregation example endpoint timeout (Only when PD disaggregation enabled) — bug,stale — by chungen04 (关闭于: 2026-03-25 10:16 (UTC+8))
#29269 [Feature]: Option to Disable Process/Thread Log Prefixing in vLLM — feature request,stale — by idodoron11 (关闭于: 2026-03-25 10:16 (UTC+8))
#29277 [Usage]: Creating and accessing per request arguments inside vLLM model — usage,stale — by minlu21 (关闭于: 2026-03-25 10:16 (UTC+8))
#29283 [Feature]: Update triton_kernels with upstream triton — feature request,stale — by zyongye (关闭于: 2026-03-25 10:16 (UTC+8))
#29298 [Feature]: Prefill mode for PD using cpu+gpu hybrid engine — feature request,stale — by komitydev (关闭于: 2026-03-25 10:16 (UTC+8))
#29350 [Bug]: Audio transcription duplicated words between chunks — bug,stale — by Quentinchampenois (关闭于: 2026-03-25 10:16 (UTC+8))
#37837 Why is an assertion used here? — 无标签 — by WZKIIIII (关闭于: 2026-03-25 09:18 (UTC+8))
#37608 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_qkv_weight’ — bug — by S1ro1 (关闭于: 2026-03-25 03:37 (UTC+8))
#31894 [Bug] hf_token argument to LLM in Python SDK ignored in vllm.transformer_utils.config — bug — by benglewis (关闭于: 2026-03-25 03:22 (UTC+8))
#37705 [Bug]: Structured output crashes on CPU with pin_memory=True in apply_grammar_bitmask() — 无标签 — by wjhrdy (关闭于: 2026-03-25 01:44 (UTC+8))
#17817 [RFC]: Unification of frontend parser — structured-output,RFC,unstale,tool-calling — by aarnphm (关闭于: 2026-03-24 23:05 (UTC+8))
#37983 [Bug]: compute_slot_mapping crashes on non-Triton platforms (ppc64le/CPU) after PR #32951 — bug — by Akashcodes732 (关闭于: 2026-03-24 23:01 (UTC+8))
#38013 [Bug]: Error while serving mistralai/Mistral-7B-v0.3 on CPU systems with Docker build — bug — by azhuvath (关闭于: 2026-03-24 23:00 (UTC+8))
#37972 [Bug]: Pipeline Parallelism >=4 errors on runtime with Qwen3.5-FP8 — bug — by ehfd (关闭于: 2026-03-24 18:25 (UTC+8))
#37855 [Bug]: Qwen3-VL-Embedding-8B Image embedding failed — bug — by nuclearwu (关闭于: 2026-03-24 14:35 (UTC+8))
#36613 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency — bug — by xiaochengyige (关闭于: 2026-03-24 11:41 (UTC+8))

新增 PR

#38062 Bump helion dependency from 0.3.2 to 0.3.3 — ci/build — by gmagogsfm (创建于: 2026-03-25 11:29 (UTC+8))
#38060 [Bugfix] Handle ImportError for flash_attn in ApplyRotaryEmb class — bug — by xiaoajie738 (创建于: 2026-03-25 10:56 (UTC+8))
#38009 [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request — bug,ready,v1 — by junjzhang (创建于: 2026-03-24 21:35 (UTC+8))
#37973 [Feature] Use Fp8 Query for Triton Attention on ROCm — rocm,v1 — by vllmellm (创建于: 2026-03-24 15:30 (UTC+8))
#37968 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (创建于: 2026-03-24 14:23 (UTC+8))
#37975 [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 — qwen — by wxsIcey (创建于: 2026-03-24 15:56 (UTC+8))
#38061 [MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference — v1,qwen,nvidia — by shen-shanshan (创建于: 2026-03-25 10:59 (UTC+8))
#38059 [KVConnector]: Add OffloadingPolicy abstraction for offloading connector — v1,kv-connector — by jonathanc-n (创建于: 2026-03-25 10:49 (UTC+8))
#38050 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration — ready,nvidia — by zyongye (创建于: 2026-03-25 09:23 (UTC+8))
#38058 [Misc] reject batched completions requests in toy proxy — v1,kv-connector — by zhenwei-intel (创建于: 2026-03-25 10:34 (UTC+8))
#38010 [Model] Fix BitsAndBytes quantization for GLM-4.1V/4.6V-Flash vision encoder — bug — by yanghui1-arch (创建于: 2026-03-24 21:57 (UTC+8))
#38049 [Model] Add torch.compile support for InternVL vision encoder — 无标签 — by tianrengao (创建于: 2026-03-25 09:13 (UTC+8))
#38035 Better weight tying check for multimodal models — ready — by hmellor (创建于: 2026-03-25 05:08 (UTC+8))
#38057 [CI/Docs] Improve aarch64/DGX Spark support for dev setup — documentation — by bbrowning (创建于: 2026-03-25 10:30 (UTC+8))
#38055 [Bugfix][CPU][v1] Fix IndexError in CPUWorker for multi-node DP/EP on CPU-only nodes — bug,v1,cpu — by philhuan (创建于: 2026-03-25 10:18 (UTC+8))
#38032 Compose online quantization with quantized reloading — 无标签 — by kylesayrs (创建于: 2026-03-25 04:24 (UTC+8))
#38031 [Model Runner V2][Minor] Simplify PP logic — ready,v1,nvidia — by njhill (创建于: 2026-03-25 02:55 (UTC+8))
#38054 [MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy — v1,gpt-oss,nvidia — by will-deines (创建于: 2026-03-25 10:18 (UTC+8))
#37958 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser — bug,frontend,ready,tool-calling — by chaunceyjiang (创建于: 2026-03-24 11:39 (UTC+8))
#38045 [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling — v1 — by TheEpicDolphin (创建于: 2026-03-25 06:40 (UTC+8))
#38053 [BugFix] Fix TypeError in MiniCPM-O audio feature unpadding — bug — by Krishnachaitanyakc (创建于: 2026-03-25 10:11 (UTC+8))
#38052 [Doc] Fix Python-only build 404 fallback guidance — documentation,nvidia — by George-ao (创建于: 2026-03-25 09:54 (UTC+8))
#37997 [benchmark] Clarify single-image limitation in CustomMMDataset (docs + warning) — documentation,performance — by dw2761 (创建于: 2026-03-24 19:08 (UTC+8))
#38046 [compile] Add some more startup tests for top models — ready — by zou3519 (创建于: 2026-03-25 07:32 (UTC+8))
#37986 [Quantization][Autoround][XPU] Add W4A16 Support — ci/build — by yiliu30 (创建于: 2026-03-24 16:54 (UTC+8))
#38048 [Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR — structured-output,ready,v1 — by yewentao256 (创建于: 2026-03-25 08:03 (UTC+8))
#38047 Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding — v1 — by lgeiger (创建于: 2026-03-25 07:42 (UTC+8))
#38044 [release] Move the rest of release jobs to release queue — ci/build — by khluu (创建于: 2026-03-25 06:32 (UTC+8))
#38043 {ROCm]: gpt-oss fusion/padding fixes — rocm,ci/build,gpt-oss — by Rohan138 (创建于: 2026-03-25 06:23 (UTC+8))
#38042 Clone support — performance,ci/build,v1,qwen,nvidia — by alexandred (创建于: 2026-03-25 06:20 (UTC+8))
#38014 [CI] Add batch invariant test for b200 — ready,ci/build — by yewentao256 (创建于: 2026-03-24 22:55 (UTC+8))
#38039 [Bug] Fix qwen 3.5 batch invariance — bug,frontend,ready,qwen — by yewentao256 (创建于: 2026-03-25 05:39 (UTC+8))
#38040 [Draft] [Fix] Invariant Check for Auto-Inferred Budgets/Max Batch Size in ViT CUDA Graph Manager — v1,nvidia — by b-mu (创建于: 2026-03-25 05:39 (UTC+8))
#38034 [Perf] Optimize mean pool cumsum by adding a small id cache, 2.8% Throughput improvement — 无标签 — by yewentao256 (创建于: 2026-03-25 05:03 (UTC+8))
#38038 Dynamic SD scheduler decides — speculative-decoding,v1 — by LucasWilkinson (创建于: 2026-03-25 05:21 (UTC+8))
#38037 Fix IndexError in streaming tool calls when max_tokens is hit — frontend — by joaquinhuigomez (创建于: 2026-03-25 05:19 (UTC+8))
#38036 Add 501 response to STT endpoint OpenAPI spec — frontend — by joaquinhuigomez (创建于: 2026-03-25 05:18 (UTC+8))
#37960 Fused rmsnorm fp8 quant — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,deepseek,cpu,gpt-oss — by tianrengao (创建于: 2026-03-24 12:56 (UTC+8))
#37970 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (创建于: 2026-03-24 14:56 (UTC+8))
#38030 [MRV2] Fix for DS v3.2 — ready,v1 — by WoosukKwon (创建于: 2026-03-25 02:37 (UTC+8))
#38028 [Kernel] Add indexer_concat_quant_fp8 kernel for DeepSeek V3.2 — performance,ci/build,deepseek,nvidia — by xyang16 (创建于: 2026-03-25 02:33 (UTC+8))
#38025 [Metrics] Add labeled token waiting metrics for precise load balancing — v1 — by yangligt2 (创建于: 2026-03-25 02:18 (UTC+8))
#38029 [Tool Parser][1/3] Pass tools to ToolParser constructor — frontend,tool-calling — by sfeng33 (创建于: 2026-03-25 02:37 (UTC+8))
#38016 [gRPC] Add standard gRPC health checking (grpc.health.v1) for Kubernetes native probes — frontend — by V2arK (创建于: 2026-03-24 23:14 (UTC+8))
#38011 Add batched messages support to /v1/chat/completions — documentation,frontend — by MatejRojec (创建于: 2026-03-24 22:16 (UTC+8))
#38015 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 — bug,ready — by zou3519 (创建于: 2026-03-24 23:01 (UTC+8))
#38012 [BugFix] Fix order of compile logging — bug,ready — by zou3519 (创建于: 2026-03-24 22:19 (UTC+8))
#38026 Use Inductor for More Quant Fusion — performance — by eellison (创建于: 2026-03-25 02:27 (UTC+8))
#38027 [Nixl][PD] Lease renewal TTL KV blocks on P — documentation,v1,kv-connector — by NickLucche (创建于: 2026-03-25 02:28 (UTC+8))
#38019 [Model] Add Granite 4.0 1B speech to supported models — documentation,ready,multi-modality — by NickCao (创建于: 2026-03-24 23:48 (UTC+8))
#38023 [Dev] CuTile gemm FP8 — performance,nvidia — by LironKesem (创建于: 2026-03-25 01:38 (UTC+8))
#37998 docs: fix broken offline inference paths in documentation — documentation,ready — by vineetatiwari27 (创建于: 2026-03-24 19:12 (UTC+8))
#37964 [XPU] Support Intel XPU hardware information collection in usage stats — bug,ready — by 1643661061leo (创建于: 2026-03-24 13:42 (UTC+8))
#38021 lora: add EP support for FusedMoEWithLoRA — 无标签 — by mrlexcoder (创建于: 2026-03-25 01:04 (UTC+8))
#38020 [Optimization] Fuse mamba_get_block_table_tensor in align mode — v1 — by Jialin (创建于: 2026-03-25 00:56 (UTC+8))
#37999 Update new contributor message — ready,ci/build — by hmellor (创建于: 2026-03-24 19:49 (UTC+8))
#38017 [Hybrid] Introduce MambaProcessContext to simplify function signatures — v1 — by fuscof-ibm (创建于: 2026-03-24 23:21 (UTC+8))
#38018 [DO NOT MERGE] Check which MM processors don’t work with token inputs — ready,multi-modality — by DarkLight1337 (创建于: 2026-03-24 23:44 (UTC+8))
#37980 [UX] Integrate DeepGEMM into vLLM wheel via CMake — documentation,ready,ci/build,v1,deepseek — by mgoin (创建于: 2026-03-24 16:32 (UTC+8))
#37978 [Bugfix][Model] Fix Eagle2.5-VL using ImageNet normalization instead of SigLIP2 — bug,speculative-decoding — by edwingao28 (创建于: 2026-03-24 16:08 (UTC+8))
#37987 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-03-24 16:57 (UTC+8))
#38008 [Core][Frontend] Add gradient computation feature with /v1/gradients API endpoint — frontend,v1 — by gereblye (创建于: 2026-03-24 21:32 (UTC+8))
#38007 [SpecDecode] Add shortcut in rejection sampler for greedy sampling — v1 — by zzaebok (创建于: 2026-03-24 21:09 (UTC+8))
#37994 [Bugfix] Add minimax_m2 to eagle3 supported models list — bug — by xueliangyang-oeuler (创建于: 2026-03-24 18:49 (UTC+8))
#38002 [CPU][Feat] Update CPU Backend to torch 2.11.0 — ci/build,cpu — by fadara01 (创建于: 2026-03-24 20:05 (UTC+8))
#38001 Revert “[Test] E2E Nemotron-3-Super tests” (#36803) — ci/build — by zhewenl (创建于: 2026-03-24 20:04 (UTC+8))
#37993 [ROCm] Fall back to native rotary embedding when flash_attn triton kernel fails — rocm — by xuebwang-amd (创建于: 2026-03-24 18:30 (UTC+8))
#38000 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (创建于: 2026-03-24 19:49 (UTC+8))
#37991 [Docs] Fix build — ready — by hmellor (创建于: 2026-03-24 17:46 (UTC+8))
#37990 [MoE refactor] refactor GPTQMarlinMoEMethod with MK — 无标签 — by jikunshang (创建于: 2026-03-24 17:36 (UTC+8))
#37963 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (创建于: 2026-03-24 13:35 (UTC+8))
#37984 [Bugfix][Helm] Add configurable /dev/shm shared memory mount to chart-helm — bug,documentation — by utsumi-fj (创建于: 2026-03-24 16:38 (UTC+8))
#37989 [OOT] Add OOT support for linear kernel. — 无标签 — by menogrey (创建于: 2026-03-24 17:31 (UTC+8))
#37962 [bug-fix] GLM OCR Patch Merger context_dim — bug,ready — by JaredforReal (创建于: 2026-03-24 13:29 (UTC+8))
#37985 [Mamba] Speculative Decoding Mamba — speculative-decoding,v1 — by Josephasafg (创建于: 2026-03-24 16:51 (UTC+8))
#37969 [Bug] Serialize xgrammar compilation for parallel structured outputs — bug,structured-output,v1 — by robellliu-dev (创建于: 2026-03-24 14:25 (UTC+8))
#37957 Fix tool_parser_cls type annotation from Callable to type[ToolParser] — frontend,ready — by sfeng33 (创建于: 2026-03-24 11:39 (UTC+8))
#37961 [Frontend][Core] Add hierarchical cache salting for shared prefixes across tenants — frontend,v1,multi-modality — by vpfkfl753 (创建于: 2026-03-24 13:19 (UTC+8))
#37959 [Bugfix] Fix Helm chart Deployment using hardcoded labels instead of chart.labels — bug,documentation — by simpx (创建于: 2026-03-24 12:49 (UTC+8))

已合并 PR

#37914 [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc — documentation,ready,nvidia — by b-mu (合并于: 2026-03-25 10:53 (UTC+8))
#38031 [Model Runner V2][Minor] Simplify PP logic — ready,v1,nvidia — by njhill (合并于: 2026-03-25 04:57 (UTC+8))
#20859 [Feature] limit thinking tokens (hard limit) — documentation,performance,new-model,structured-output,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by llsj14 (合并于: 2026-03-25 00:53 (UTC+8))
#37787 [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 — bug,rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-25 08:17 (UTC+8))
#37673 [Performance] Auto-enable prefetch on NFS with RAM guard — ready — by arpera (合并于: 2026-03-25 08:31 (UTC+8))
#37924 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI — rocm,ready,ci/build,kv-connector — by AndreasKaratzas (合并于: 2026-03-25 07:58 (UTC+8))
#38044 [release] Move the rest of release jobs to release queue — ci/build — by khluu (合并于: 2026-03-25 07:40 (UTC+8))
#37485 [Perf] Disable inductor runtime asserts by default for serving perfor… — documentation,ready — by tianrengao (合并于: 2026-03-25 07:37 (UTC+8))
#37903 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths — ready — by netanel-haber (合并于: 2026-03-25 07:25 (UTC+8))
#37926 Make microbatch optimization (DBO) work with general models — ready,v1 — by 0xjunhao (合并于: 2026-03-25 05:40 (UTC+8))
#37728 Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) — ready,v1,nvidia,meta-exported,fb-exported — by minosfuture (合并于: 2026-03-25 01:30 (UTC+8))
#37233 [UX] Add flashinfer-cubin as CUDA default dep — ready,ci/build,nvidia — by mgoin (合并于: 2026-03-25 05:13 (UTC+8))
#38030 [MRV2] Fix for DS v3.2 — ready,v1 — by WoosukKwon (合并于: 2026-03-25 05:03 (UTC+8))
#35386 Add Ubuntu 24.04 support for Docker builds — ready,ci/build — by aasgaonkar (合并于: 2026-03-25 04:34 (UTC+8))
#37692 [FlexAttention] allow custom mask mod — ready,v1 — by liangel-02 (合并于: 2026-03-25 04:03 (UTC+8))
#37920 [Bugfix] Pass hf_token through config loading paths for gated model support — bug,ready — by javierdejesusda (合并于: 2026-03-25 03:22 (UTC+8))
#38015 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 — bug,ready — by zou3519 (合并于: 2026-03-25 03:08 (UTC+8))
#38012 [BugFix] Fix order of compile logging — bug,ready — by zou3519 (合并于: 2026-03-25 02:58 (UTC+8))
#38019 [Model] Add Granite 4.0 1B speech to supported models — documentation,ready,multi-modality — by NickCao (合并于: 2026-03-25 02:23 (UTC+8))
#37706 [Bugfix] Fix structured output crash on CPU due to pin_memory=True — bug,structured-output,ready,v1 — by wjhrdy (合并于: 2026-03-25 01:44 (UTC+8))
#37998 docs: fix broken offline inference paths in documentation — documentation,ready — by vineetatiwari27 (合并于: 2026-03-25 01:35 (UTC+8))
#37923 [Bugfix] Force continuous usage stats when CLI override is enabled — bug,frontend,ready — by dsingal0 (合并于: 2026-03-25 01:29 (UTC+8))
#37964 [XPU] Support Intel XPU hardware information collection in usage stats — bug,ready — by 1643661061leo (合并于: 2026-03-25 01:29 (UTC+8))
#37904 [Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) — ready — by hmellor (合并于: 2026-03-25 01:14 (UTC+8))
#37307 [Core] add option to schedule requests based on full ISL — performance,ready,v1,nvidia — by DanBlanaru (合并于: 2026-03-25 01:01 (UTC+8))
#37999 Update new contributor message — ready,ci/build — by hmellor (合并于: 2026-03-25 00:01 (UTC+8))
#37956 [Deprecate] Deprecate pooling multi task support. — documentation,frontend,ready — by noooop (合并于: 2026-03-24 22:07 (UTC+8))
#37987 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-03-24 22:00 (UTC+8))
#37911 [Bugfix] Suppress spurious CPU KV cache warning in launch render — bug,frontend,ready — by sagearc (合并于: 2026-03-24 20:36 (UTC+8))
#36271 [EPLB] Remove main waits in case of slow EPLB — ready — by ilmarkov (合并于: 2026-03-24 19:50 (UTC+8))
#37991 [Docs] Fix build — ready — by hmellor (合并于: 2026-03-24 18:20 (UTC+8))
#37957 Fix tool_parser_cls type annotation from Callable to type[ToolParser] — frontend,ready — by sfeng33 (合并于: 2026-03-24 13:58 (UTC+8))
#37874 [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package — ready,v1 — by ronensc (合并于: 2026-03-24 13:02 (UTC+8))
#37899 [Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages — bug,frontend,ready — by jetxa (合并于: 2026-03-24 13:00 (UTC+8))
#37783 [release] Move agent queue to Release cluster queues — ci/build — by khluu (合并于: 2026-03-24 11:36 (UTC+8))
#37913 Downsize CPU jobs to use small queue — ci/build — by khluu (合并于: 2026-03-24 11:36 (UTC+8))

关闭但未合并的 PR

#28544 [CI] Add non-eager test-case for SharedStorageConnector — ready,stale,v1,kv-connector — by NickLucche (关闭于: 2026-03-25 10:17 (UTC+8))
#28590 [Misc] allow pass in different load config for eagle draft model — speculative-decoding,ready,needs-rebase,stale,v1,llama — by 842974287 (关闭于: 2026-03-25 10:17 (UTC+8))
#29206 [Perf] Early return in KVCacheManager.allocate_slots — ready,stale,v1 — by Jialin (关闭于: 2026-03-25 10:16 (UTC+8))
#29236 Fix gpt oss tool parser v2 — frontend,stale,tool-calling,gpt-oss — by ShaikAbdulHafeez03 (关闭于: 2026-03-25 10:16 (UTC+8))
#31075 [ROCm][CI/Build] Fix Dockerfile.rocm to set VLLM_TARGET_DEVICE=rocm — documentation,rocm,ci/build,stale,v1 — by westers (关闭于: 2026-03-25 10:16 (UTC+8))
#37927 [ROCm][CI] Remove redundant common.txt from rocm-test.txt — rocm,ready,ci/build — by AndreasKaratzas (关闭于: 2026-03-25 09:28 (UTC+8))
#37517 Adding DeepEP MoE Test Group. — needs-rebase,ci/build — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-25 08:43 (UTC+8))
#35855 [NVFP4][OCP MX] Support ahead of time weight dequantization for emulation backend for dense and MOE models — rocm,needs-rebase,nvidia — by fxmarty-amd (关闭于: 2026-03-25 06:30 (UTC+8))
#38042 Clone support — performance,ci/build,v1,qwen,nvidia — by alexandred (关闭于: 2026-03-25 06:20 (UTC+8))
#38034 [Perf] Optimize mean pool cumsum by adding a small id cache, 2.8% Throughput improvement — 无标签 — by yewentao256 (关闭于: 2026-03-25 05:28 (UTC+8))
#37936 [BugFix][kv_offload] Reduce memory blocks allocated for CPU offload — bug,needs-rebase,v1 — by jonathanc-n (关闭于: 2026-03-24 23:55 (UTC+8))
#37827 [Bugfix] Fix async load failures when multiple requests share same prefix — bug,v1 — by kahfizulkifli (关闭于: 2026-03-24 23:20 (UTC+8))
#33105 [Feature] (grpc): add standard gRPC health checking protocol for Kubernetes native probes — rocm,frontend,needs-rebase,ci/build,v1,qwen,deepseek,nvidia — by V2arK (关闭于: 2026-03-24 23:10 (UTC+8))
#38002 [CPU][Feat] Update CPU Backend to torch 2.11.0 — ci/build,cpu — by fadara01 (关闭于: 2026-03-24 20:11 (UTC+8))
#37963 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (关闭于: 2026-03-24 17:55 (UTC+8))
#37838 [ROCm] Fix fused RMS norm quant test failures on gfx90a — rocm,ready — by AndreasKaratzas (关闭于: 2026-03-24 16:48 (UTC+8))
#29957 [Perf][Async] Implement zero-bubble async speculative decoding — speculative-decoding,v1 — by izhuhaoran (关闭于: 2026-03-24 14:09 (UTC+8))