vLLM 开发动态报告 - 2026-03-24
时间窗口: 2026-03-24 11:33 (UTC+8) ~ 2026-03-25 11:33 (UTC+8) 数据统计: 新 Issue 26 | 关闭 Issue 24 | 新 PR 79 | 合并 PR 36 | 关闭未合并 PR 17
📊 每日开发状态摘要
在本次时间窗口内,vLLM 项目保持了高强度开发,新增与合并了大量 PR(79个新增,36个合并),反映了社区持续的代码贡献和功能迭代。核心关注点集中在Qwen3.5等大模型的实际部署问题(如Pipeline Parallelism、ROCm兼容性)和针对特定硬件(如AMD MI300、NVIDIA L40S)的性能优化上。同时,围绕是否应为旧硬件(如Ampere架构A100)提供兼容性支持的讨论引发了社区关注。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关活动较为活跃,主要集中在问题修复与内核优化上。
- ROCm平台上的模型Bug:
- Issue #37996: 用户报告Qwen3.5-397B GPTQ模型在ROCm上仅输出感叹号。另一位用户确认了类似问题。这属于ROCm平台上的关键推理正确性问题,已自动标记给ROCm维护团队。
- Issue #37992 & PR #37993: AMD员工
xuebwang-amd报告在MI325X (gfx942) 上运行Qwen3.5-MoE视觉模型时,在profile_run阶段因flash_attn的Triton rotary内核失败而崩溃。他随即提交了PR #37993,增加了回退到原生rotary embedding的逻辑以修复此问题。 - PR #37973: 贡献者
vllmellm提交了在ROCm上为Triton Attention启用FP8 Query的优化,旨在提升性能。这表明社区在持续优化ROCm平台的推理效率。
- GPT-OSS模型在ROCm上的修复:
- PR #37787 (已合并): 修复了因PR #37128重构而引入的GPT-OSS模型在ROCm上的回归问题,包括CK MXFP4后端选择、对齐检查以及padding参数的传递。这是确保GPT-OSS在AMD硬件上稳定运行的重要修复。
- PR #38043: 作为后续修复,进一步调整了GPT-OSS在ROCm上的RMSNorm融合与padding策略(如将MI300的padding从128调整为256),并重新启用了相关融合优化。
- 基础设施与集成:
- PR #37980: 虽然主要关于DeepGEMM集成,但通过CMake FetchContent将其打包进wheel的改动,简化了所有平台(包括ROCm)上DeepSeek-V3等依赖DeepGEMM的模型的部署流程。
💬 高热度讨论分析
- Issue #38006: 是否应为Ampere GPU实现 TRITON_MLA_SPARSE 后端以支持Sparse MLA模型?
- 核心议题:用户请求为sm80(A100/A800)GPU实现
TRITON_MLA_SPARSE后端,以支持GLM-5、DeepSeek V3.2等使用稀疏MLA的模型。 - 不同观点:
- 请求方 (ehfd):认为这具有战略意义,就像过去vLLM为Ampere实现FP8 fallback从而吸引用户一样,实现此功能将把所有使用稀疏MLA模型的Ampere用户吸引到vLLM。
- 维护者 (youkaichao):明确表示这不是vLLM维护者的优先事项。认为即使在A100上运行这些模型也“困难/意义不大”,建议保持当前直接报错的策略,有兴趣的社区开发者可以自行维护分支。
- 争议焦点:项目官方维护的优先级(聚焦新硬件/最优性能)与社区用户对现有硬件投资保护需求之间的冲突。
- 当前状态:Issue仍开放,维护者立场明确,依赖社区贡献。
- 核心议题:用户请求为sm80(A100/A800)GPU实现
- Issue #37996: Qwen3.5 397B GPTQ在ROCm上输出异常
- 核心议题:特定大规模模型在AMD平台上产生完全错误的输出(全感叹号)。
- 讨论内容:用户提供了详细的复现环境。机器人自动请求更多信息并标记给ROCm维护者。另一用户回复遇到相同问题,增加了问题的普遍性。
- 当前状态:Issue开放待解决,属于ROCm平台上的高优先级Bug。
- Issue #36613 (已关闭): Qwen3.5-397B在高并发下启用MTP导致CUDA非法内存访问
- 核心议题:大规模MoE模型在启用多令牌预测(MTP)和高并发时出现致命CUDA错误。
- 讨论内容:多名用户确认遇到相同或类似问题。维护者建议测试nightly版本并提供复现脚本。讨论中发现降低预测令牌数(如从5降至1)可规避问题,指向了MTP实现中的并发安全性或内存管理缺陷。
- 最终结论:Issue在用户确认调整超时环境变量可解决问题后被关闭,但根本的MTP高并发稳定性问题可能仍需关注。
🔥 热门话题与趋势分析
- Qwen3.5系列模型的部署挑战:多个Issue(#37996, #37972, #37967, #38024)围绕Qwen3.5(尤其是MoE大版本)的部署问题展开,涉及Pipeline Parallelism、Tokenizer兼容性、特定配置下的错误等,反映出该系列模型在vLLM生态中被广泛使用但仍有适配磨合期。
- AMD平台支持深化:除上述ROCm-specific问题外,PR活动显示社区正积极解决AMD平台上的性能回归(#37787)并推进优化(#37973),支持正从“能运行”向“高效稳定运行”迈进。
- 性能优化精细化:针对特定硬件和模型形状的微优化成为趋势,例如为L40S/SM89调整GPT-OSS的Marlin MoE策略(#38054),为DeepSeek V3.2开发融合concat+quant的FP8内核(#38028),以及为CPU推理优化Mamba状态处理(#38047)。
- API与功能扩展:前端功能持续增强,如新增梯度计算API(#38008)、支持批处理消息的structured output(#38011)、增加gRPC健康检查(#38016)等,提升了vLLM的易用性和企业级特性。
🛠️ 重点技术变更
- PR #37987 (已合并):为非Triton平台修复compute_slot_mapping崩溃
- 解读:此PR修复了一个关键兼容性Bug。之前的重构移除了numpy后备实现,导致在不支持Triton的平台上(如ppc64le或纯CPU环境)崩溃。修复通过为这些平台提供替代实现,保障了vLLM在异构计算环境中的可运行性。
- 影响:巩固了vLLM在非标准GPU架构和CPU上的部署能力。
- PR #37980 (进行中):将DeepGEMM深度集成至vLLM Wheel
- 解读:该PR通过CMake的FetchContent机制,将DeepSeek模型推理所需的DeepGEMM库直接编译并打包进vLLM的发布wheel中,消除了用户手动安装的步骤。
- 影响:极大简化了DeepSeek-V3等模型的部署体验,降低了使用门槛,是提升用户体验的重要基础设施改进。
- PR #37787 (已合并):修复ROCm上GPT-OSS MXFP4的回归问题
- 解读:此PR修复了因代码重构导致的AMD平台上GPT-OSS模型性能回归和崩溃问题。它细致地恢复了针对特定GFX架构的后端选择逻辑、对齐要求检查和必要的padding参数传递。
- 影响:确保了GPT-OSS模型在AMD最新硬件(如MI300)上的功能正确性和性能,是维护跨平台模型支持的关键操作。
- PR #38011 (进行中):为OpenAI API添加批处理消息支持
- 解读:扩展了
/v1/chat/completions端点,使其能够接受一个消息列表的列表,从而在单个请求中处理多个独立的对话,并支持与结构化输出等功能结合。 - 影响:提高了API的效率和灵活性,尤其适用于需要批量处理多个提示并提取结构化数据的场景,减少了HTTP开销。
- 解读:扩展了
📈 开发活跃度观察
- 贡献者多样性:本次周期出现了来自AMD员工(
xuebwang-amd)、以及针对特定硬件优化(如L40S)的贡献,表明生态在向更广泛的硬件厂商和细分场景渗透。 - 问题响应与修复速度:对于
xuebwang-amd报告的Issue #37992,其关联的修复PR #37993在问题创建后6分钟内即被提出,显示了核心贡献者或内部团队的高效响应能力。 - 代码合并效率:在36个已合并的PR中,包含多个重要的Bug修复(如#37987,#38015)和功能改进,审查与合并流程运转顺畅。
💡 值得关注的问题
- Ampere架构的长期支持策略:Issue #38006引发的讨论需要社区明确。是否以及如何为旧硬件(如A100)提供对新模型架构(如稀疏MLA)的兼容性支持,将影响一部分用户群体的技术选型。
- Qwen3.5 MoE大规模模型稳定性:集中出现的问题(Pipeline Parallelism、ROCm输出异常、Tokenizer)表明对此类超大规模MoE模型的全面支持仍需打磨,是影响生产部署稳定性的关键区域。
- 多平台内核维护负担:从PR #37987和多个ROCm修复PR可见,维护针对CPU、AMD、NVIDIA等多套后端的算子实现和策略选择,复杂性和维护成本正在增加。如何优雅地管理这种复杂性是一个长期挑战。
- MTP等高级功能在高并发下的稳定性:Issue #36613虽已关闭,但多令牌预测、推测解码等高级功能在高负载下的健壮性,仍是实现极致性能时需要持续验证的重点。
📋 附录:详细数据列表
新增 Issue
- #38024 Tokenizer error with Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated model - TokenizersBackend not found — 无标签 — by ArtemSultanov-PG (创建于: 2026-03-25 01:44 (UTC+8))
- #38006 [Feature]: Implement
TRITON_MLA_SPARSEbackend for sm80 support of Sparse MLA — feature request — by ehfd (创建于: 2026-03-24 21:01 (UTC+8)) - #38056 [Bug]: ImportError: flash_attn.ops.triton.rotary not found on older versions (< v2.1.2) — bug — by xiaoajie738 (创建于: 2026-03-25 10:21 (UTC+8))
- #38051 [Bug]: Possible warm start compile time issue for Deepseek V3.2 and Kimi K2.5 — bug,torch.compile — by zou3519 (创建于: 2026-03-25 09:40 (UTC+8))
- #37996 [Bug]: Qwen3.5 397B GPTQ model outputs all exclamation points on ROCM — bug,rocm — by hnhyzz (创建于: 2026-03-24 19:04 (UTC+8))
- #38041 V2 model runner crashes on Qwen3.5 mixed attention (linear + full) — 无标签 — by bhaktatejas922 (创建于: 2026-03-25 05:44 (UTC+8))
- #38033 [Bug]: CPUFusedMOE raises IndexError when running EP across CPU-only nodes — bug — by shunzhiwen (创建于: 2026-03-25 04:28 (UTC+8))
- #37976 [Feature]: Support Structured Output with Batched Prompts via OpenAI-Compatible API — feature request — by MatejRojec (创建于: 2026-03-24 15:57 (UTC+8))
- #38022 [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880) — bug — by zweack (创建于: 2026-03-25 01:04 (UTC+8))
- #37983 [Bug]: compute_slot_mapping crashes on non-Triton platforms (ppc64le/CPU) after PR #32951 — bug — by Akashcodes732 (创建于: 2026-03-24 16:37 (UTC+8))
- #38013 [Bug]: Error while serving mistralai/Mistral-7B-v0.3 on CPU systems with Docker build — bug — by azhuvath (创建于: 2026-03-24 22:54 (UTC+8))
- #37995 [RFC]: Prefill Context Parallel for Qwen3.5 Hybrid Attention — feature request — by Yancey0623 (创建于: 2026-03-24 18:57 (UTC+8))
- #38005 [Bug]: EP support for Fused MoE LoRA is not implemented yet. — bug — by Nero10578 (创建于: 2026-03-24 20:55 (UTC+8))
- #38004 [Bug]: Speech-to-Text endpoint may return 501 but not documented in OpenAPI — bug — by Sebastian-dong (创建于: 2026-03-24 20:54 (UTC+8))
- #38003 [RFC]: Integrate MineDraft speculative decoding into vLLM — RFC — by Sebastian-dong (创建于: 2026-03-24 20:13 (UTC+8))
- #37992 [Bug]: RuntimeError triton error during profile_run with Qwen3.5-MoE vision encoder on ROCm — bug,rocm — by xuebwang-amd (创建于: 2026-03-24 18:24 (UTC+8))
- #37972 [Bug]: Pipeline Parallelism >=4 errors on runtime with Qwen3.5-FP8 — bug — by ehfd (创建于: 2026-03-24 15:23 (UTC+8))
- #37988 [Bug]: minimax-m2.5, reasoning token result in negative values — bug — by SparrowMu (创建于: 2026-03-24 17:26 (UTC+8))
- #37981 [Bug]: v0.18.0 fails to run MiniCPM-o-4.5 — bug — by devalun (创建于: 2026-03-24 16:33 (UTC+8))
- #37967 [Bug]: TypeError: transformers.tokenization_utils_tokenizers.TokenizersBackend._patch_mistral_regex() got multiple values for keyword argument ‘fix_mistral_regex’ — bug — by tech112233445566 (创建于: 2026-03-24 14:16 (UTC+8))
- #37979 [RFC]: Intel Quantization Support Roadmap (H1 2026) — RFC — by yiliu30 (创建于: 2026-03-24 16:14 (UTC+8))
- #37982 [Bug]: chart-helm does not support configuring shared memory (
/dev/shm) — bug — by utsumi-fj (创建于: 2026-03-24 16:34 (UTC+8)) - #37977 [Bug][Model] Eagle2.5-VL applies ImageNet normalization instead of SigLIP2 — 无标签 — by edwingao28 (创建于: 2026-03-24 16:03 (UTC+8))
- #37974 [Bug]: [Bug]: Kimi-K2.5 on version 0.18.0 results in an keyerror when the pipeline parallelism (PP) is greater than or equal to 2 — bug — by WangTuoxytt (创建于: 2026-03-24 15:53 (UTC+8))
- #37971 [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance — feature request — by biba10 (创建于: 2026-03-24 15:14 (UTC+8))
- #37966 [Bug]: spec decoding nonparallel 路径 draft/target hidden size 不兼容,建议适配不同hidden size — bug — by sunchendd (创建于: 2026-03-24 14:00 (UTC+8))
已关闭 Issue
- #25342 [Feature]: Allow increasing the flashinfer workspace buffer size — feature request,stale — by richardhundt (关闭于: 2026-03-25 10:17 (UTC+8))
- #25994 [Bug]: DeepSeek-V3.1 gives garbage output — bug,stale — by nicolexin (关闭于: 2026-03-25 10:17 (UTC+8))
- #27572 [Bug]: chat/completions stream intermittently returns null as finish_reason — bug,stale — by shuynh2017 (关闭于: 2026-03-25 10:17 (UTC+8))
- #27602 [Bug]: quantized medgemma-27b-text-it producing garbage outputs — bug,stale — by kritiyer (关闭于: 2026-03-25 10:17 (UTC+8))
- #28085 [Feature][UX]: vLLM Kernel Configuration — feature request,stale,startup-ux — by robertgshaw2-redhat (关闭于: 2026-03-25 10:17 (UTC+8))
- #28641 [Feature][P0]: Optimize Dockerfile Layer Ordering — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-25 10:17 (UTC+8))
- #28648 [Feature][P1]: Use Bind Mounts for Installation Scripts — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-25 10:17 (UTC+8))
- #29151 [Bug]: VLLM Sleep on NVIDIA H100 leading to model producing slow invalid results — bug,stale — by relyt0925 (关闭于: 2026-03-25 10:16 (UTC+8))
- #29248 [Bug]: Encoder disaggregation example endpoint timeout (Only when PD disaggregation enabled) — bug,stale — by chungen04 (关闭于: 2026-03-25 10:16 (UTC+8))
- #29269 [Feature]: Option to Disable Process/Thread Log Prefixing in vLLM — feature request,stale — by idodoron11 (关闭于: 2026-03-25 10:16 (UTC+8))
- #29277 [Usage]: Creating and accessing per request arguments inside vLLM model — usage,stale — by minlu21 (关闭于: 2026-03-25 10:16 (UTC+8))
- #29283 [Feature]: Update triton_kernels with upstream triton — feature request,stale — by zyongye (关闭于: 2026-03-25 10:16 (UTC+8))
- #29298 [Feature]: Prefill mode for PD using cpu+gpu hybrid engine — feature request,stale — by komitydev (关闭于: 2026-03-25 10:16 (UTC+8))
- #29350 [Bug]: Audio transcription duplicated words between chunks — bug,stale — by Quentinchampenois (关闭于: 2026-03-25 10:16 (UTC+8))
- #37837 Why is an assertion used here? — 无标签 — by WZKIIIII (关闭于: 2026-03-25 09:18 (UTC+8))
- #37608 [Bug]: AttributeError: ‘Parameter’ object has no attribute ‘load_qkv_weight’ — bug — by S1ro1 (关闭于: 2026-03-25 03:37 (UTC+8))
- #31894 [Bug]
hf_tokenargument toLLMin Python SDK ignored invllm.transformer_utils.config— bug — by benglewis (关闭于: 2026-03-25 03:22 (UTC+8)) - #37705 [Bug]: Structured output crashes on CPU with pin_memory=True in apply_grammar_bitmask() — 无标签 — by wjhrdy (关闭于: 2026-03-25 01:44 (UTC+8))
- #17817 [RFC]: Unification of frontend parser — structured-output,RFC,unstale,tool-calling — by aarnphm (关闭于: 2026-03-24 23:05 (UTC+8))
- #37983 [Bug]: compute_slot_mapping crashes on non-Triton platforms (ppc64le/CPU) after PR #32951 — bug — by Akashcodes732 (关闭于: 2026-03-24 23:01 (UTC+8))
- #38013 [Bug]: Error while serving mistralai/Mistral-7B-v0.3 on CPU systems with Docker build — bug — by azhuvath (关闭于: 2026-03-24 23:00 (UTC+8))
- #37972 [Bug]: Pipeline Parallelism >=4 errors on runtime with Qwen3.5-FP8 — bug — by ehfd (关闭于: 2026-03-24 18:25 (UTC+8))
- #37855 [Bug]: Qwen3-VL-Embedding-8B Image embedding failed — bug — by nuclearwu (关闭于: 2026-03-24 14:35 (UTC+8))
- #36613 [Bug]: CUDA ILM (Illegal Memory Access) crash when enabling MTP for Qwen3.5-397B-A17B under high concurrency — bug — by xiaochengyige (关闭于: 2026-03-24 11:41 (UTC+8))
新增 PR
- #38062 Bump helion dependency from 0.3.2 to 0.3.3 — ci/build — by gmagogsfm (创建于: 2026-03-25 11:29 (UTC+8))
- #38060 [Bugfix] Handle ImportError for flash_attn in ApplyRotaryEmb class — bug — by xiaoajie738 (创建于: 2026-03-25 10:56 (UTC+8))
- #38009 [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request — bug,ready,v1 — by junjzhang (创建于: 2026-03-24 21:35 (UTC+8))
- #37973 [Feature] Use Fp8 Query for Triton Attention on ROCm — rocm,v1 — by vllmellm (创建于: 2026-03-24 15:30 (UTC+8))
- #37968 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — ready,v1,nvidia — by chaunceyjiang (创建于: 2026-03-24 14:23 (UTC+8))
- #37975 [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 — qwen — by wxsIcey (创建于: 2026-03-24 15:56 (UTC+8))
- #38061 [MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference — v1,qwen,nvidia — by shen-shanshan (创建于: 2026-03-25 10:59 (UTC+8))
- #38059 [KVConnector]: Add OffloadingPolicy abstraction for offloading connector — v1,kv-connector — by jonathanc-n (创建于: 2026-03-25 10:49 (UTC+8))
- #38050 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration — ready,nvidia — by zyongye (创建于: 2026-03-25 09:23 (UTC+8))
- #38058 [Misc] reject batched completions requests in toy proxy — v1,kv-connector — by zhenwei-intel (创建于: 2026-03-25 10:34 (UTC+8))
- #38010 [Model] Fix BitsAndBytes quantization for GLM-4.1V/4.6V-Flash vision encoder — bug — by yanghui1-arch (创建于: 2026-03-24 21:57 (UTC+8))
- #38049 [Model] Add torch.compile support for InternVL vision encoder — 无标签 — by tianrengao (创建于: 2026-03-25 09:13 (UTC+8))
- #38035 Better weight tying check for multimodal models — ready — by hmellor (创建于: 2026-03-25 05:08 (UTC+8))
- #38057 [CI/Docs] Improve aarch64/DGX Spark support for dev setup — documentation — by bbrowning (创建于: 2026-03-25 10:30 (UTC+8))
- #38055 [Bugfix][CPU][v1] Fix IndexError in CPUWorker for multi-node DP/EP on CPU-only nodes — bug,v1,cpu — by philhuan (创建于: 2026-03-25 10:18 (UTC+8))
- #38032 Compose online quantization with quantized reloading — 无标签 — by kylesayrs (创建于: 2026-03-25 04:24 (UTC+8))
- #38031 [Model Runner V2][Minor] Simplify PP logic — ready,v1,nvidia — by njhill (创建于: 2026-03-25 02:55 (UTC+8))
- #38054 [MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy — v1,gpt-oss,nvidia — by will-deines (创建于: 2026-03-25 10:18 (UTC+8))
- #37958 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser — bug,frontend,ready,tool-calling — by chaunceyjiang (创建于: 2026-03-24 11:39 (UTC+8))
- #38045 [Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling — v1 — by TheEpicDolphin (创建于: 2026-03-25 06:40 (UTC+8))
- #38053 [BugFix] Fix TypeError in MiniCPM-O audio feature unpadding — bug — by Krishnachaitanyakc (创建于: 2026-03-25 10:11 (UTC+8))
- #38052 [Doc] Fix Python-only build 404 fallback guidance — documentation,nvidia — by George-ao (创建于: 2026-03-25 09:54 (UTC+8))
- #37997 [benchmark] Clarify single-image limitation in CustomMMDataset (docs + warning) — documentation,performance — by dw2761 (创建于: 2026-03-24 19:08 (UTC+8))
- #38046 [compile] Add some more startup tests for top models — ready — by zou3519 (创建于: 2026-03-25 07:32 (UTC+8))
- #37986 [Quantization][Autoround][XPU] Add
W4A16Support — ci/build — by yiliu30 (创建于: 2026-03-24 16:54 (UTC+8)) - #38048 [Refactor] Rename
WAITING_FOR_FSMtoWAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR— structured-output,ready,v1 — by yewentao256 (创建于: 2026-03-25 08:03 (UTC+8)) - #38047 Remove GPU/CPU syncs in
GDNAttentionMetadata.buildduring speculative decoding — v1 — by lgeiger (创建于: 2026-03-25 07:42 (UTC+8)) - #38044 [release] Move the rest of release jobs to release queue — ci/build — by khluu (创建于: 2026-03-25 06:32 (UTC+8))
- #38043 {ROCm]: gpt-oss fusion/padding fixes — rocm,ci/build,gpt-oss — by Rohan138 (创建于: 2026-03-25 06:23 (UTC+8))
- #38042 Clone support — performance,ci/build,v1,qwen,nvidia — by alexandred (创建于: 2026-03-25 06:20 (UTC+8))
- #38014 [CI] Add batch invariant test for b200 — ready,ci/build — by yewentao256 (创建于: 2026-03-24 22:55 (UTC+8))
- #38039 [Bug] Fix qwen 3.5 batch invariance — bug,frontend,ready,qwen — by yewentao256 (创建于: 2026-03-25 05:39 (UTC+8))
- #38040 [Draft] [Fix] Invariant Check for Auto-Inferred Budgets/Max Batch Size in ViT CUDA Graph Manager — v1,nvidia — by b-mu (创建于: 2026-03-25 05:39 (UTC+8))
- #38034 [Perf] Optimize mean pool cumsum by adding a small id cache, 2.8% Throughput improvement — 无标签 — by yewentao256 (创建于: 2026-03-25 05:03 (UTC+8))
- #38038 Dynamic SD scheduler decides — speculative-decoding,v1 — by LucasWilkinson (创建于: 2026-03-25 05:21 (UTC+8))
- #38037 Fix IndexError in streaming tool calls when max_tokens is hit — frontend — by joaquinhuigomez (创建于: 2026-03-25 05:19 (UTC+8))
- #38036 Add 501 response to STT endpoint OpenAPI spec — frontend — by joaquinhuigomez (创建于: 2026-03-25 05:18 (UTC+8))
- #37960 Fused rmsnorm fp8 quant — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,deepseek,cpu,gpt-oss — by tianrengao (创建于: 2026-03-24 12:56 (UTC+8))
- #37970 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM — performance,ready,nvidia — by Nekofish-L (创建于: 2026-03-24 14:56 (UTC+8))
- #38030 [MRV2] Fix for DS v3.2 — ready,v1 — by WoosukKwon (创建于: 2026-03-25 02:37 (UTC+8))
- #38028 [Kernel] Add indexer_concat_quant_fp8 kernel for DeepSeek V3.2 — performance,ci/build,deepseek,nvidia — by xyang16 (创建于: 2026-03-25 02:33 (UTC+8))
- #38025 [Metrics] Add labeled token waiting metrics for precise load balancing — v1 — by yangligt2 (创建于: 2026-03-25 02:18 (UTC+8))
- #38029 [Tool Parser][1/3] Pass tools to ToolParser constructor — frontend,tool-calling — by sfeng33 (创建于: 2026-03-25 02:37 (UTC+8))
- #38016 [gRPC] Add standard gRPC health checking (grpc.health.v1) for Kubernetes native probes — frontend — by V2arK (创建于: 2026-03-24 23:14 (UTC+8))
- #38011 Add batched messages support to /v1/chat/completions — documentation,frontend — by MatejRojec (创建于: 2026-03-24 22:16 (UTC+8))
- #38015 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 — bug,ready — by zou3519 (创建于: 2026-03-24 23:01 (UTC+8))
- #38012 [BugFix] Fix order of compile logging — bug,ready — by zou3519 (创建于: 2026-03-24 22:19 (UTC+8))
- #38026 Use Inductor for More Quant Fusion — performance — by eellison (创建于: 2026-03-25 02:27 (UTC+8))
- #38027 [Nixl][PD] Lease renewal TTL KV blocks on P — documentation,v1,kv-connector — by NickLucche (创建于: 2026-03-25 02:28 (UTC+8))
- #38019 [Model] Add Granite 4.0 1B speech to supported models — documentation,ready,multi-modality — by NickCao (创建于: 2026-03-24 23:48 (UTC+8))
- #38023 [Dev] CuTile gemm FP8 — performance,nvidia — by LironKesem (创建于: 2026-03-25 01:38 (UTC+8))
- #37998 docs: fix broken offline inference paths in documentation — documentation,ready — by vineetatiwari27 (创建于: 2026-03-24 19:12 (UTC+8))
- #37964 [XPU] Support Intel XPU hardware information collection in usage stats — bug,ready — by 1643661061leo (创建于: 2026-03-24 13:42 (UTC+8))
- #38021 lora: add EP support for FusedMoEWithLoRA — 无标签 — by mrlexcoder (创建于: 2026-03-25 01:04 (UTC+8))
- #38020 [Optimization] Fuse mamba_get_block_table_tensor in align mode — v1 — by Jialin (创建于: 2026-03-25 00:56 (UTC+8))
- #37999 Update new contributor message — ready,ci/build — by hmellor (创建于: 2026-03-24 19:49 (UTC+8))
- #38017 [Hybrid] Introduce MambaProcessContext to simplify function signatures — v1 — by fuscof-ibm (创建于: 2026-03-24 23:21 (UTC+8))
- #38018 [DO NOT MERGE] Check which MM processors don’t work with token inputs — ready,multi-modality — by DarkLight1337 (创建于: 2026-03-24 23:44 (UTC+8))
- #37980 [UX] Integrate DeepGEMM into vLLM wheel via CMake — documentation,ready,ci/build,v1,deepseek — by mgoin (创建于: 2026-03-24 16:32 (UTC+8))
- #37978 [Bugfix][Model] Fix Eagle2.5-VL using ImageNet normalization instead of SigLIP2 — bug,speculative-decoding — by edwingao28 (创建于: 2026-03-24 16:08 (UTC+8))
- #37987 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-03-24 16:57 (UTC+8))
- #38008 [Core][Frontend] Add gradient computation feature with /v1/gradients API endpoint — frontend,v1 — by gereblye (创建于: 2026-03-24 21:32 (UTC+8))
- #38007 [SpecDecode] Add shortcut in rejection sampler for greedy sampling — v1 — by zzaebok (创建于: 2026-03-24 21:09 (UTC+8))
- #37994 [Bugfix] Add minimax_m2 to eagle3 supported models list — bug — by xueliangyang-oeuler (创建于: 2026-03-24 18:49 (UTC+8))
- #38002 [CPU][Feat] Update CPU Backend to torch 2.11.0 — ci/build,cpu — by fadara01 (创建于: 2026-03-24 20:05 (UTC+8))
- #38001 Revert “[Test] E2E Nemotron-3-Super tests” (#36803) — ci/build — by zhewenl (创建于: 2026-03-24 20:04 (UTC+8))
- #37993 [ROCm] Fall back to native rotary embedding when flash_attn triton kernel fails — rocm — by xuebwang-amd (创建于: 2026-03-24 18:30 (UTC+8))
- #38000 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (创建于: 2026-03-24 19:49 (UTC+8))
- #37991 [Docs] Fix build — ready — by hmellor (创建于: 2026-03-24 17:46 (UTC+8))
- #37990 [MoE refactor] refactor GPTQMarlinMoEMethod with MK — 无标签 — by jikunshang (创建于: 2026-03-24 17:36 (UTC+8))
- #37963 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (创建于: 2026-03-24 13:35 (UTC+8))
- #37984 [Bugfix][Helm] Add configurable /dev/shm shared memory mount to chart-helm — bug,documentation — by utsumi-fj (创建于: 2026-03-24 16:38 (UTC+8))
- #37989 [OOT] Add OOT support for linear kernel. — 无标签 — by menogrey (创建于: 2026-03-24 17:31 (UTC+8))
- #37962 [bug-fix] GLM OCR Patch Merger context_dim — bug,ready — by JaredforReal (创建于: 2026-03-24 13:29 (UTC+8))
- #37985 [Mamba] Speculative Decoding Mamba — speculative-decoding,v1 — by Josephasafg (创建于: 2026-03-24 16:51 (UTC+8))
- #37969 [Bug] Serialize xgrammar compilation for parallel structured outputs — bug,structured-output,v1 — by robellliu-dev (创建于: 2026-03-24 14:25 (UTC+8))
- #37957 Fix tool_parser_cls type annotation from Callable to type[ToolParser] — frontend,ready — by sfeng33 (创建于: 2026-03-24 11:39 (UTC+8))
- #37961 [Frontend][Core] Add hierarchical cache salting for shared prefixes across tenants — frontend,v1,multi-modality — by vpfkfl753 (创建于: 2026-03-24 13:19 (UTC+8))
- #37959 [Bugfix] Fix Helm chart Deployment using hardcoded labels instead of chart.labels — bug,documentation — by simpx (创建于: 2026-03-24 12:49 (UTC+8))
已合并 PR
- #37914 [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc — documentation,ready,nvidia — by b-mu (合并于: 2026-03-25 10:53 (UTC+8))
- #38031 [Model Runner V2][Minor] Simplify PP logic — ready,v1,nvidia — by njhill (合并于: 2026-03-25 04:57 (UTC+8))
- #20859 [Feature] limit thinking tokens (hard limit) — documentation,performance,new-model,structured-output,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by llsj14 (合并于: 2026-03-25 00:53 (UTC+8))
- #37787 [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 — bug,rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-25 08:17 (UTC+8))
- #37673 [Performance] Auto-enable prefetch on NFS with RAM guard — ready — by arpera (合并于: 2026-03-25 08:31 (UTC+8))
- #37924 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI — rocm,ready,ci/build,kv-connector — by AndreasKaratzas (合并于: 2026-03-25 07:58 (UTC+8))
- #38044 [release] Move the rest of release jobs to release queue — ci/build — by khluu (合并于: 2026-03-25 07:40 (UTC+8))
- #37485 [Perf] Disable inductor runtime asserts by default for serving perfor… — documentation,ready — by tianrengao (合并于: 2026-03-25 07:37 (UTC+8))
- #37903 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths — ready — by netanel-haber (合并于: 2026-03-25 07:25 (UTC+8))
- #37926 Make microbatch optimization (DBO) work with general models — ready,v1 — by 0xjunhao (合并于: 2026-03-25 05:40 (UTC+8))
- #37728 Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) — ready,v1,nvidia,meta-exported,fb-exported — by minosfuture (合并于: 2026-03-25 01:30 (UTC+8))
- #37233 [UX] Add flashinfer-cubin as CUDA default dep — ready,ci/build,nvidia — by mgoin (合并于: 2026-03-25 05:13 (UTC+8))
- #38030 [MRV2] Fix for DS v3.2 — ready,v1 — by WoosukKwon (合并于: 2026-03-25 05:03 (UTC+8))
- #35386 Add Ubuntu 24.04 support for Docker builds — ready,ci/build — by aasgaonkar (合并于: 2026-03-25 04:34 (UTC+8))
- #37692 [FlexAttention] allow custom mask mod — ready,v1 — by liangel-02 (合并于: 2026-03-25 04:03 (UTC+8))
- #37920 [Bugfix] Pass hf_token through config loading paths for gated model support — bug,ready — by javierdejesusda (合并于: 2026-03-25 03:22 (UTC+8))
- #38015 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 — bug,ready — by zou3519 (合并于: 2026-03-25 03:08 (UTC+8))
- #38012 [BugFix] Fix order of compile logging — bug,ready — by zou3519 (合并于: 2026-03-25 02:58 (UTC+8))
- #38019 [Model] Add Granite 4.0 1B speech to supported models — documentation,ready,multi-modality — by NickCao (合并于: 2026-03-25 02:23 (UTC+8))
- #37706 [Bugfix] Fix structured output crash on CPU due to pin_memory=True — bug,structured-output,ready,v1 — by wjhrdy (合并于: 2026-03-25 01:44 (UTC+8))
- #37998 docs: fix broken offline inference paths in documentation — documentation,ready — by vineetatiwari27 (合并于: 2026-03-25 01:35 (UTC+8))
- #37923 [Bugfix] Force continuous usage stats when CLI override is enabled — bug,frontend,ready — by dsingal0 (合并于: 2026-03-25 01:29 (UTC+8))
- #37964 [XPU] Support Intel XPU hardware information collection in usage stats — bug,ready — by 1643661061leo (合并于: 2026-03-25 01:29 (UTC+8))
- #37904 [Mypy] Fix mypy for
vllm/model_executor(exceptvllm/model_executor/layers) — ready — by hmellor (合并于: 2026-03-25 01:14 (UTC+8)) - #37307 [Core] add option to schedule requests based on full ISL — performance,ready,v1,nvidia — by DanBlanaru (合并于: 2026-03-25 01:01 (UTC+8))
- #37999 Update new contributor message — ready,ci/build — by hmellor (合并于: 2026-03-25 00:01 (UTC+8))
- #37956 [Deprecate] Deprecate pooling multi task support. — documentation,frontend,ready — by noooop (合并于: 2026-03-24 22:07 (UTC+8))
- #37987 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU — bug,ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-03-24 22:00 (UTC+8))
- #37911 [Bugfix] Suppress spurious CPU KV cache warning in
launch render— bug,frontend,ready — by sagearc (合并于: 2026-03-24 20:36 (UTC+8)) - #36271 [EPLB] Remove main waits in case of slow EPLB — ready — by ilmarkov (合并于: 2026-03-24 19:50 (UTC+8))
- #37991 [Docs] Fix build — ready — by hmellor (合并于: 2026-03-24 18:20 (UTC+8))
- #37957 Fix tool_parser_cls type annotation from Callable to type[ToolParser] — frontend,ready — by sfeng33 (合并于: 2026-03-24 13:58 (UTC+8))
- #37874 [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into
cpu/package — ready,v1 — by ronensc (合并于: 2026-03-24 13:02 (UTC+8)) - #37899 [Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages — bug,frontend,ready — by jetxa (合并于: 2026-03-24 13:00 (UTC+8))
- #37783 [release] Move agent queue to Release cluster queues — ci/build — by khluu (合并于: 2026-03-24 11:36 (UTC+8))
- #37913 Downsize CPU jobs to use small queue — ci/build — by khluu (合并于: 2026-03-24 11:36 (UTC+8))
关闭但未合并的 PR
- #28544 [CI] Add non-eager test-case for SharedStorageConnector — ready,stale,v1,kv-connector — by NickLucche (关闭于: 2026-03-25 10:17 (UTC+8))
- #28590 [Misc] allow pass in different load config for eagle draft model — speculative-decoding,ready,needs-rebase,stale,v1,llama — by 842974287 (关闭于: 2026-03-25 10:17 (UTC+8))
- #29206 [Perf] Early return in KVCacheManager.allocate_slots — ready,stale,v1 — by Jialin (关闭于: 2026-03-25 10:16 (UTC+8))
- #29236 Fix gpt oss tool parser v2 — frontend,stale,tool-calling,gpt-oss — by ShaikAbdulHafeez03 (关闭于: 2026-03-25 10:16 (UTC+8))
- #31075 [ROCm][CI/Build] Fix Dockerfile.rocm to set VLLM_TARGET_DEVICE=rocm — documentation,rocm,ci/build,stale,v1 — by westers (关闭于: 2026-03-25 10:16 (UTC+8))
- #37927 [ROCm][CI] Remove redundant common.txt from rocm-test.txt — rocm,ready,ci/build — by AndreasKaratzas (关闭于: 2026-03-25 09:28 (UTC+8))
- #37517 Adding DeepEP MoE Test Group. — needs-rebase,ci/build — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-25 08:43 (UTC+8))
- #35855 [NVFP4][OCP MX] Support ahead of time weight dequantization for emulation backend for dense and MOE models — rocm,needs-rebase,nvidia — by fxmarty-amd (关闭于: 2026-03-25 06:30 (UTC+8))
- #38042 Clone support — performance,ci/build,v1,qwen,nvidia — by alexandred (关闭于: 2026-03-25 06:20 (UTC+8))
- #38034 [Perf] Optimize mean pool cumsum by adding a small id cache, 2.8% Throughput improvement — 无标签 — by yewentao256 (关闭于: 2026-03-25 05:28 (UTC+8))
- #37936 [BugFix][kv_offload] Reduce memory blocks allocated for CPU offload — bug,needs-rebase,v1 — by jonathanc-n (关闭于: 2026-03-24 23:55 (UTC+8))
- #37827 [Bugfix] Fix async load failures when multiple requests share same prefix — bug,v1 — by kahfizulkifli (关闭于: 2026-03-24 23:20 (UTC+8))
- #33105 [Feature] (grpc): add standard gRPC health checking protocol for Kubernetes native probes — rocm,frontend,needs-rebase,ci/build,v1,qwen,deepseek,nvidia — by V2arK (关闭于: 2026-03-24 23:10 (UTC+8))
- #38002 [CPU][Feat] Update CPU Backend to torch 2.11.0 — ci/build,cpu — by fadara01 (关闭于: 2026-03-24 20:11 (UTC+8))
- #37963 [Model] Add support for BharatGen’s Param2MoE model — documentation,new-model — by bhargav-patel-29 (关闭于: 2026-03-24 17:55 (UTC+8))
- #37838 [ROCm] Fix fused RMS norm quant test failures on gfx90a — rocm,ready — by AndreasKaratzas (关闭于: 2026-03-24 16:48 (UTC+8))
- #29957 [Perf][Async] Implement zero-bubble async speculative decoding — speculative-decoding,v1 — by izhuhaoran (关闭于: 2026-03-24 14:09 (UTC+8))