vLLM 开发动态报告 - 2026-03-04

时间窗口: 2026-03-04 11:24 (UTC+8) ~ 2026-03-05 11:24 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 44 | 新 PR 113 | 合并 PR 59 | 关闭未合并 PR 128

📊 每日开发状态摘要

在3月4日至5日的时间窗口内，vLLM 项目保持了极高的开发活跃度，新增与合并的 PR 数量均超过50个，显示出强劲的开发势头。开发焦点集中在性能优化（如融合算子、CUDA Graph、编译优化）、对新硬件（特别是 AMD CPU/GPU）的支持，以及修复由新特性引入的复杂 bug（如推测解码、多模态处理）。社区讨论热烈，尤其在涉及 AMD 生态和深度优化功能（如融合工作矩阵）的议题上。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动活跃，涵盖了从 CPU 到 GPU 的多项重要进展：

AMD Zen CPU 后端集成 (PR #35970)
- 贡献者：amd-lalithnc (AMD员工)
- 内容：这是实现 RFC #35089 的第一个 PR，为 AMD EPYC (Zen) CPU 引入了专用平台 (ZenCpuPlatform)。它通过 zentorch 库路由 GEMM 操作，支持权重预打包以减少推理开销，并添加了 Docker 构建目标 (vllm-openai-amd)。
- 影响：标志着 vLLM 开始为 AMD 服务器 CPU 提供官方、深度的优化支持，旨在提升在纯 CPU 推理场景下的性能。
ROCm AITER 与 MoRI (MoE Runtime Infrastructure) 集成优化 (PR #36034, #36033, #36043)
- 贡献者：varun-sundar-rabindranath
- 内容：
  - PR #36033/#36034：解决了 MoRIPrepareAndFinalize (MoRI A2A后端) 与 AiterExperts 不兼容的问题，通过声明 AiterExperts 支持量化隐藏状态，使得两者可以协同工作。
  - PR #36043：添加了 ROCm 平台下安装 MoRI 和 AITER Python 库的脚本 (install_python_libraries.sh)，方便本地开发。
- 影响：完善了 AMD GPU 上专家并行 (Expert Parallel) 的软件栈，提升了 MoE 模型在 ROCm 生态下的部署便利性和性能潜力。
AITER FlashAttention CUDA Graph 崩溃修复 (PR #36042)
- 贡献者：iseeyuan
- 内容：修复了在 ROCm AITER FlashAttention 后端中，由于错误的条件判断导致非推测解码工作负载也使用了不支持 CUDA Graph 捕获的 unified_attention 路径，进而引发 CUDA Graph 捕获失败的问题。
- 影响：确保了在 AMD GPU 上使用 CUDA Graph 进行推理时的稳定性和性能。
CK MXFP4 MoE 内核维度回退机制 (PR #35893 - 已合并)
- 贡献者：ChuanLi1101
- 内容：修复了 Issue #35637 中报告的，在特定 TP 配置下（如 intermediate_size=1536 且 TP=4），AITER 的 CK MXFP4 MoE 内核因矩阵维度不满足约束而崩溃的问题。修复方案是添加了维度验证，在不兼容时自动回退到模拟模式或 Triton 后端。
- 影响：增强了 MXFP4 量化 MoE 模型在不同并行配置下的鲁棒性，是 AMD 低精度量化生态的重要稳定性修复。
中央融合工作跟踪器 (Issue #36066)
- 贡献者：ProExpertProg
- 内容：创建了一个详细的矩阵来跟踪 vLLM 中主要融合通道（如 RMSNorm+Quant）在不同硬件（sm100， sm90， ROCm）和量化配置下的支持状态。AMD 开发者 tjtanaa 在评论中建议添加 ROCm AITER 特有的融合算子（如 QK Norm + RoPE + Cache + Quant）。
- 影响：此 Issue 为社区提供了清晰的融合优化全景图，并直接推动了 AMD 特定优化的可见性和整合讨论。

💬 高热度讨论分析

Issue #31845: [Bug]: [H200] DeepSeek V3.2 MTP > 1 运行错误
- 核心议题：为 DeepSeek-V3.2 等 MLA 稀疏模型启用多令牌预测 (MTP > 1) 时遇到的限制和解决方案。
- 观点与进展：
  - 初始认知：MatthewBonanni 最初认为稀疏 MLA 的索引器内核限制导致 MTP > 1 不受支持。
  - 深入调查：benchislett 指出问题核心在于 fp8_paged_mqa_logits 内核仅支持 0 或 1 个推测令牌，并分析了竞争对手（SGLang, TensorRT-LLM）如何通过批次扩展或等待新内核来规避。
  - 解决方案：社区通过 PR #34552 添加了批次扩展功能作为临时方案，并期待 DeepGEMM 官方合并支持 MTP=3 的新内核。最终，该 Issue 在性能验证后关闭，并引导至新的优化 Issue (#35878)。
- 争议焦点：无实质争议，主要是对问题根本原因和最佳解决路径的技术探讨。
- 当前状态：已关闭。临时方案已合并，长期依赖于上游内核更新。
Issue #35637: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error
- 核心议题：在 AMD MI355X GPU 上，使用 AITER 运行 MXFP4 量化的 MiniMax-M2.1 模型时，在 Tensor Parallel (TP) 为 4 时发生崩溃。
- 观点与进展：
  - 问题复现：AMD 开发者 hongxiayang 确认并复现了该问题。
  - 临时解决方案：提供了通过设置环境变量 VLLM_ROCM_USE_AITER_MOE=0 来禁用 AITER MoE 内核作为临时绕过方案。
  - 根本解决：ChuanLi1101 通过 PR #35893 实施了永久修复，增加了内核维度验证和自动回退机制。
- 争议焦点：无。
- 当前状态：已通过 PR #35893 修复并关闭。
Issue #36066: [Feature]: Central fusion work tracker
- 核心议题：建立并维护一个跨硬件平台的融合优化支持状态矩阵。
- 观点与进展：
  - 发起者：ProExpertProg 建立了详尽的矩阵，但许多 ROCm 状态标记为“未知”。
  - 社区补充：AMD 贡献者 tjtanaa 主动提出补充 ROCm AITER 特有的融合算子，这表明 AMD 团队正积极参与核心性能优化特性的整合与文档化。
- 争议焦点：无。
- 当前状态：进行中。该 Issue 已成为跟踪和协调跨平台优化工作的中心页面。

🔥 热门话题与趋势分析

性能优化精细化：社区关注点从基础功能支持转向深度优化。融合算子（Issue #36066）、CUDA Graph 覆盖更多模块（如 ViT， PR #35963）、编译优化（PR #36072 探索编译与加载重叠）等成为热点。
多模态与复杂模型支持：围绕视觉编码器（ViT）的优化（PR #35963）、新多模态模型（如 FireRedASR2， PR #35727）的集成，表明 vLLM 正持续扩展其应用边界。
推测解码的成熟与挑战：相关讨论（Issue #31845, #36031）频繁，涉及与特定模型架构（MLA）、底层内核的深度适配，反映出该功能已进入解决复杂边缘案例的阶段。
AMD 生态建设全面加速：从 CPU (Zen) 到 GPU (ROCm)，从基础算子（AITER）到高级运行时（MoRI），再到量化格式（MXFP4），AMD 相关的贡献呈现系统性、多层次的特点，表明其生态集成进入深水区。
开发者体验与基础设施：文档更新（PR #36074）、本地安装脚本（PR #36043）、基准测试修复（PR #35993）等“非核心”但重要的改进增多，反映了项目对开发者友好性的重视。

🛠️ 重点技术变更

PR #35970: In-Tree AMD Zen CPU Backend via zentorch：这是一个架构性引入。它不仅仅是添加了一个库，而是创建了一个新的硬件平台抽象，为未来 AMD CPU 的更多定制优化（如后续的融合通道）铺平了道路，对拓展 vLLM 的部署场景有战略意义。
PR #35963: [Feature] ViT Full CUDA Graph：将 CUDA Graph 的应用从 LLM 解码扩展到视觉编码器预处理。这对于多模态推理的端到端延迟至关重要，通过预算化图捕获和贪婪装箱算法，有效减少了 ViT 部分的内核启动开销。
PR #35961: [Core] Add score mode with perplexity and KLD computation：在 V1 引擎中新增了在 GPU 侧高效计算困惑度（PPL）和 KL 散度（KLD）的能力。这支持了无需将完整词汇表 logits 传回 CPU 的模型评估工作流，对模型质量评估、数据筛选等场景有重要价值。
PR #34883: [Core] Add All-to-All communication backend for DCP (已合并)：为解码上下文并行（DCP）新增了 All-to-All 通信后端，相比原有的 AllGather+ReduceScatter 方案，将每层的 NCCL 调用从 3 次减少到 2 次，有望降低长上下文解码的通信开销，是性能导向的底层通信优化。

📈 开发活跃度观察

贡献者活跃：AMD 贡献者（amd-lalithnc, ChuanLi1101, hongxiayang, tjtanaa 等）表现非常活跃，在问题诊断、修复、特性开发和新平台引入方面贡献突出。
PR 吞吐量高：单日合并 59 个 PR，显示核心团队有强大的代码审查与合并能力。许多 PR 附有详细的测试计划和结果，代码质量要求严格。
Issue 管理有效：关闭 Issue (44) 数量远多于新增 (22)，表明历史积压问题得到有效清理，项目维护状态健康。
协作模式：在高热度 Issue 中，可以看到用户、贡献者、维护者之间高效的互动。问题通常能快速得到复现、定位，并导向具体的修复 PR。

💡 值得关注的问题

Issue #36031: Qwen 3.5 122B 推测解码崩溃：一个具体提交引发了大规模模型在启用推测解码时的崩溃，尽管已有修复 PR (#36036)，但这提醒我们复杂功能组合（大模型+推测解码）的测试覆盖需要持续加强。
Issue #36010: Qwen3.5-27B 批处理推理性能极慢：用户报告同规模模型间存在巨大性能差异（Gemma 3 对比 Qwen3.5）。这可能是特定模型实现、引导解码或调度器交互的潜在性能瓶颈，需要进一步调查。
PR #36072: [WIP][Proof of concept] Overlap model loading and torch.compile：这是一个概念验证，旨在将耗时的模型权重加载与 torch.compile 编译过程重叠执行以加快启动速度。如果成功，将显著改善用户体验，特别是对于需要频繁重启或部署新模型的服务。其设计思路（编译期假权重）值得关注。

📋 附录：详细数据列表

新增 Issue

#36075 [Bug]: Garbled rollouts with GLM5 if VLLM_USE_DEEP_GEMM is not set — bug — by S1ro1 (创建于: 2026-03-05 10:02 (UTC+8))
#35967 [Bug]: Value error, invalid literal for int() with base 10: ‘4.0’ [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs] — bug — by LuWei6896 (创建于: 2026-03-04 14:33 (UTC+8))
#36073 [Bug]: Prolonged Latency in Some Streaming Responses in Function Call Mode（MiniMax model） — bug — by delwen123 (创建于: 2026-03-05 09:37 (UTC+8))
#36077 [Bug]: FlashInfer JIT build fails when serving Qwen3.5-35B-A3B (nvcc path found but ninja build fails) — bug — by yufangbo (创建于: 2026-03-05 10:08 (UTC+8))
#36066 [Feature]: Central fusion work tracker — feature request — by ProExpertProg (创建于: 2026-03-05 08:24 (UTC+8))
#35989 [Bug]: — bug — by DonglongLiu (创建于: 2026-03-04 18:03 (UTC+8))
#36050 [Bug]: DeepSeek v3.2 FP8 Failure to start server — bug — by wzhao18 (创建于: 2026-03-05 06:17 (UTC+8))
#35998 Qwen3.5 Model Tokenizer Loading Failure — 无标签 — by Pg-artemsultanov (创建于: 2026-03-04 19:30 (UTC+8))
#36031 [Bug]: Commit 28ef9ba causing VLLM to crash when running Qwen 3.5 122B with Speculative Decoding enabled — bug — by mdierolf (创建于: 2026-03-05 02:15 (UTC+8))
#35950 [Bug]: ValueError: too many values to unpack (expected 2) — bug — by LuWei6896 (创建于: 2026-03-04 11:43 (UTC+8))
#36037 [Feature]: Supports Speculative Speculative Decoding — feature request — by celsowm (创建于: 2026-03-05 03:27 (UTC+8))
#35999 [New Model]: MetaCLIP-2 variants — new-model,multi-modality — by erenirmak (创建于: 2026-03-04 19:39 (UTC+8))
#36015 [Bug]: Realtime audio transcription (Voxtral) silently hangs after ~10 minutes due to unhandled TimeoutError in background task — bug — by sh1man (创建于: 2026-03-04 23:06 (UTC+8))
#36004 [Bug]: Failed to import Triton kernels. Please make sure your triton version is compatible. Error: cannot import name ‘SparseMatrix’ from ‘triton_kernels.tensor’ — bug — by edc3000 (创建于: 2026-03-04 20:24 (UTC+8))
#36010 [Bug]: Qwen/Qwen3.5-27B Batch Inference very slow / not working — bug — by NilsHellwig (创建于: 2026-03-04 21:41 (UTC+8))
#35962 [RFC]: Add PPL and KLD to VLLM — RFC — by phaelon74 (创建于: 2026-03-04 13:43 (UTC+8))
#36008 [Bug]: invoke_fused_moe_wna16_*_kernel calls get_moe_wna16_block_config with bad parameters — bug — by Maxwell-Lyu (创建于: 2026-03-04 20:59 (UTC+8))
#35992 [Doc]: Inconsistent hash notation in Prefix Caching “Time 5” diagram — documentation — by anony-mous-e (创建于: 2026-03-04 18:24 (UTC+8))
#35991 [Installation]: Making server with GPT-OSS-20B on vllm+openwebui rtx5080 16gb — installation — by Ed-test-s (创建于: 2026-03-04 18:18 (UTC+8))
#35987 [Performance]: Very slow GGUF quantized model — performance — by VAmblardPEReN (创建于: 2026-03-04 17:49 (UTC+8))
#35983 [Bug]: Graph Capturing reports negative memory consumption — bug — by jmkuebler (创建于: 2026-03-04 16:54 (UTC+8))
#35980 [Bug]: Why does deploying Qwen3-32B-AWQ via vllm:v0.10.1.1 result in different outputs for the same input? — bug — by zhushuo5 (创建于: 2026-03-04 16:13 (UTC+8))

已关闭 Issue

#18793 [New Model]: ByteDance-Seed/BAGEL-7B-MoT — new-model,stale — by XueSongTap (关闭于: 2026-03-05 10:17 (UTC+8))
#19623 [Feature]: Add support for multi-lora and single lora for classification tasks — feature request,stale — by power-puff-gg (关闭于: 2026-03-05 10:17 (UTC+8))
#23529 [Usage]: DP Support in Fully SPMD Execution for Offline Inference — usage,stale — by wjj19950828 (关闭于: 2026-03-05 10:17 (UTC+8))
#24805 [Feature]: More Frequently Updated Docker Images — feature request,stale — by JamesDConley (关闭于: 2026-03-05 10:17 (UTC+8))
#25670 [RFC]: Migration from buildkite to GHA — RFC,stale — by jeanschmidt (关闭于: 2026-03-05 10:17 (UTC+8))
#25842 [Usage]: Problem with concurrency in encoder-based embedder serving with V1 Engine — usage,stale — by gabinguo (关闭于: 2026-03-05 10:16 (UTC+8))
#26161 [Bug]: pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig — bug,stale — by lucasjinreal (关闭于: 2026-03-05 10:16 (UTC+8))
#26452 [Bug]: vLLM + LMCache Version Compatibility Issues Across Multiple Releases — bug,stale — by nrghosh (关闭于: 2026-03-05 10:16 (UTC+8))
#27825 [Bug]: Tokenize endpoint for Granite models returns malformed strings in token_strs for non-Latin characters — bug,stale — by kndtran (关闭于: 2026-03-05 10:16 (UTC+8))
#27982 [Usage]: How can I access or return hidden states (representations) after generation? — usage,stale — by hakbari14 (关闭于: 2026-03-05 10:16 (UTC+8))
#28013 [Feature]: CUDA12.6 prebuilt wheel for vllm v0.11 — feature request,stale — by Luciennnnnnn (关闭于: 2026-03-05 10:16 (UTC+8))
#28035 [Usage]: deepseek-ocr The output token count is too low and unstable. — usage,stale — by sixgod-666 (关闭于: 2026-03-05 10:16 (UTC+8))
#28038 [Bug]:During the vllm 0.10.1 v1 benchmark test, only about 100 out of 1000 requests could be processed, and then it would get stuck. — bug,stale — by Ysgg1 (关闭于: 2026-03-05 10:16 (UTC+8))
#28042 [Bug]: RotaryEmbedding forward_native cannot match as expected for QKNormRoPEFusionPass — bug,stale — by izhuhaoran (关闭于: 2026-03-05 10:16 (UTC+8))
#28061 [CI Failure]: ci-infra is broken at approximately noon Nov, 4, 2025 — stale,ci-failure — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-05 10:16 (UTC+8))
#27225 [Tracking Issue]: Qwen3-next performance optimisations — performance — by vadiklyutiy (关闭于: 2026-03-05 08:48 (UTC+8))
#31845 [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) — bug,speculative-decoding,deepseek — by jhaotingc (关闭于: 2026-03-05 07:49 (UTC+8))
#34034 [Bug]: vLLM-compile should not execute the decoder forward pass during compilation — bug,torch.compile — by zou3519 (关闭于: 2026-03-05 04:13 (UTC+8))
#33708 [Doc]: Update CPU image docs to use Docker Hub after 0.16.0 release — documentation,cpu — by nathan-weinberg (关闭于: 2026-03-05 02:31 (UTC+8))
#31766 [Docs] Feedback for /en/latest/contributing/profiling/ — documentation — by cyk2018 (关闭于: 2026-03-05 02:13 (UTC+8))
#34996 [Feature][CI]: test_functionalization.py should compare outputs — feature request — by Rohan138 (关闭于: 2026-03-05 02:01 (UTC+8))
#35784 [CI Failure]: mi355_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:55 (UTC+8))
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:51 (UTC+8))
#35235 [CI Failure]: mi355_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:51 (UTC+8))
#35128 [CI Failure]: mi355_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:45 (UTC+8))
#29523 [CI Failure]: mi325_4: Distributed Tests (4 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:39 (UTC+8))
#34901 [Doc]: Speculators Docs Follow-ups — documentation — by kylesayrs (关闭于: 2026-03-05 01:23 (UTC+8))
#15498 [Doc]: Troubleshooting guide incorrect hardware script fails — documentation,unstale — by surajssd (关闭于: 2026-03-05 01:15 (UTC+8))
#34018 [RFC]: Helix (Context + Tensor) Parallelism for Efficient Long-Context Decoding — RFC — by sungsooha (关闭于: 2026-03-04 23:02 (UTC+8))
#15853 [Bug]: [V1] Testla T4 cannot work for V1 — usage,unstale — by maobaolong (关闭于: 2026-03-04 22:00 (UTC+8))

#35880 [Bug]: Qwen/Qwen3.5-27B-GPTQ-Int4 not working /tmp/tmpb6zywp4k/__triton_launcher.c:7:10: fatal error: Python.h: No such file or directory 7

#include — bug — by NilsHellwig (关闭于: 2026-03-04 21:02 (UTC+8))

#34094 [Bug]: assert num_cache_lines >= batch — bug — by vitalik (关闭于: 2026-03-04 18:56 (UTC+8))
#35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (关闭于: 2026-03-04 16:30 (UTC+8))
#35868 [Bug] Qwen3-ASR crashes without –enforce-eager: missing dynamic_arg_dims for MRoPE positions — 无标签 — by TheCodeWrangler (关闭于: 2026-03-04 16:29 (UTC+8))
#35547 [Bug]: Long weight loading results in server start failure — bug — by wzhao18 (关闭于: 2026-03-04 13:54 (UTC+8))
#35129 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:38 (UTC+8))
#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:37 (UTC+8))
#35769 [CI Failure]: mi325_1: Quantized Models Test — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:37 (UTC+8))
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:32 (UTC+8))
#31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:29 (UTC+8))
#33812 [CI Failure]: mi325_4: LM Eval Large Models (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:29 (UTC+8))
#35836 [Bug]: Support tool_choice=none in Anthropic API — bug — by ZhongsJie (关闭于: 2026-03-04 13:24 (UTC+8))
#35233 [CI Failure]: mi355_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:04 (UTC+8))
#29462 [CI Failure]: mi325_8: Language Models Tests (Hybrid) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:04 (UTC+8))

新增 PR

#35968 [WIP] [Performance] DeepSeek V3.2 multi-stream indexer overlap — deepseek — by haosdent (创建于: 2026-03-04 14:35 (UTC+8))
#35997 [BugFix][LoRA] Rename vocal_parallel_embedding to vocab_parallel_embedding with compatibility shim — bug — by ShaneTao (创建于: 2026-03-04 19:24 (UTC+8))
#36079 Yejin/bench sleep wake timeout — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-05 10:59 (UTC+8))
#36081 Perf: Optimize DeepEP prepare/finalize for identity mapping — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-05 11:11 (UTC+8))
#35954 [torch.compile] Add AsyncTP fusion on Blackwell using FlashInfer bmm_fp8 and symm_mem impl — bug,torch.compile — by AjAnubolu (创建于: 2026-03-04 12:57 (UTC+8))
#35994 [BUGFIX]Fix Qwen2.5-Omni model audio max_token_per_item estimation error leading to encoder_cache_size is 0 — bug,qwen — by jjmiao1 (创建于: 2026-03-04 18:51 (UTC+8))
#36080 [PluggableLayer][4/N] Apply PluggableLayer to remaining layers — 无标签 — by whx-sjtu (创建于: 2026-03-05 11:09 (UTC+8))
#35973 [Doc] Add Parallel Draft Models — documentation — by zihaoanllm (创建于: 2026-03-04 15:38 (UTC+8))
#36024 [Misc] Lazy import registered processors — deepseek — by Isotr0py (创建于: 2026-03-05 01:24 (UTC+8))
#36069 fix(lora): bounds-check lora_a/lora_b in MergedColumnParallelLinear.set_lora — 无标签 — by JackYoung27 (创建于: 2026-03-05 08:43 (UTC+8))
#36045 [Benchmark] Add iteration benchmark with server-side step stats and t… — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-05 04:26 (UTC+8))
#36027 Rename compile_ranges_split_points to compile_ranges_endpoints — ready — by copilot-swe-agent (创建于: 2026-03-05 01:45 (UTC+8))
#35982 fix qwen3 streaming reasoning parser when thinking is disabled — qwen — by LiuLi1998 (创建于: 2026-03-04 16:45 (UTC+8))
#36078 Enable ModelRunnerV2 on XPU — v1 — by xinyu-intel (创建于: 2026-03-05 10:11 (UTC+8))
#36049 [CI] Fix pre-commit mypy issue in main — ready — by yewentao256 (创建于: 2026-03-05 05:37 (UTC+8))
#36062 [Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 — ready — by gmagogsfm (创建于: 2026-03-05 07:51 (UTC+8))
#35985 [BUGFIX]fix cuda memory stat by reserved memory — bug,v1,nvidia — by flutist (创建于: 2026-03-04 17:29 (UTC+8))
#36036 [Bugfix] Fix block_size for hybrid model MTP — bug,speculative-decoding,ready,v1 — by benchislett (创建于: 2026-03-05 03:03 (UTC+8))
#36072 [WIP][Proof of concept] Overlap model loading and torch.compile — frontend,v1 — by zou3519 (创建于: 2026-03-05 09:35 (UTC+8))
#36076 Revert “[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache” (#30681) — documentation,performance,structured-output,v1,nvidia — by zhewenl (创建于: 2026-03-05 10:05 (UTC+8))
#36018 [Feature] Add response_prefix parameter to audio transcription/translation endpoints — frontend,qwen — by TheCodeWrangler (创建于: 2026-03-04 23:48 (UTC+8))
#36070 [Bugfix][DCP] Fix CUDA graph capture for Decode Context Parallelism — bug,v1,nvidia — by sungsooha (创建于: 2026-03-05 09:01 (UTC+8))
#36074 [Docs] Add doc note about building for free-threaded Python. — documentation — by nascheme (创建于: 2026-03-05 09:45 (UTC+8))
#35984 [XPU] bump vllm-xpu-kernels to v0.1.3 — ready,ci/build — by jikunshang (创建于: 2026-03-04 17:06 (UTC+8))
#36071 [CI] Fix lint: mypy union-attr errors in hermes_tool_parser — 无标签 — by gmagogsfm (创建于: 2026-03-05 09:03 (UTC+8))
#36042 Fix CUDA graph decode capture crash in AITER FlashAttention — rocm,v1,nvidia,meta-exported,fb-exported — by iseeyuan (创建于: 2026-03-05 04:01 (UTC+8))
#36044 [compile] Reduce log spam from compile. — ready — by zhxchen17 (创建于: 2026-03-05 04:18 (UTC+8))
#36047 [Feature] Add –distributed-timeout-seconds CLI option — v1 — by 842974287 (创建于: 2026-03-05 05:10 (UTC+8))
#36065 [Bugfix] Fix mypy errors in hermes_tool_parser — bug — by AjAnubolu (创建于: 2026-03-05 08:10 (UTC+8))
#36067 set VLLM_USE_BYTECODE_HOOK to 0 by default — 无标签 — by laithsakka (创建于: 2026-03-05 08:26 (UTC+8))
#36068 [Bugfix] Allow inherited_fds to be None to fix warnings when using spawn — bug,v1 — by tjohnson31415 (创建于: 2026-03-05 08:30 (UTC+8))
#35948 Add granite4 tool parser — documentation,ready,tool-calling — by maxdebayser (创建于: 2026-03-04 11:33 (UTC+8))
#36001 [Bugfix] Fix message queue initialization order for cross-node DP — bug,v1 — by jianzs (创建于: 2026-03-04 19:50 (UTC+8))
#36064 test Qwen/Qwen3-4B-Instruct-2507 for unbacked — qwen — by laithsakka (创建于: 2026-03-05 08:00 (UTC+8))
#35969 [Bugfix] Safe CG padding for FlashMLA FP8 seq_lens and block_table — bug,v1 — by Kevin-XiongC (创建于: 2026-03-04 14:38 (UTC+8))
#36063 [Refactor] Consolidate SupportsEagle — v1,llama,qwen,gpt-oss,kv-connector — by benchislett (创建于: 2026-03-05 07:54 (UTC+8))
#36061 [Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels — bug — by robertgshaw2-redhat (创建于: 2026-03-05 07:51 (UTC+8))
#36060 fix: force prefill path for MTP drafting on SM121 (GB10 Spark) — v1,nvidia — by scottgl9 (创建于: 2026-03-05 07:34 (UTC+8))
#36054 [Bugfix] Fix tokenize endpoint malformed token_strs — bug,frontend — by AjAnubolu (创建于: 2026-03-05 06:41 (UTC+8))
#36057 Adding deterministic lora benchmarking to vLLM Bench — performance — by RonaldBXu (创建于: 2026-03-05 06:58 (UTC+8))
#36041 [Model Runner V2] Add initial CI tests — ready,ci/build — by njhill (创建于: 2026-03-05 04:01 (UTC+8))
#36058 [2/n] Migrate per_token_group_quant to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-03-05 07:06 (UTC+8))
#36059 [BugFix] Fallback from FA4->FA2 for Batch Invariance — bug,v1 — by frankwang28 (创建于: 2026-03-05 07:08 (UTC+8))
#35959 [WIP][MRV2] Extensible CG dispatch rework — v1,nvidia — by LucasWilkinson (创建于: 2026-03-04 13:36 (UTC+8))
#36022 [Kernel] Add FlashInfer MoE A2A Kernel — documentation,performance,new-model,rocm,structured-output,frontend,ci/build,v1,multi-modality,qwen — by leo-cf-tian (创建于: 2026-03-05 01:01 (UTC+8))
#36053 [Bugfix] Fix safetensors metadata OOM for large model headers — bug — by AjAnubolu (创建于: 2026-03-05 06:40 (UTC+8))
#36056 [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 — bug,deepseek — by sfeng33 (创建于: 2026-03-05 06:48 (UTC+8))
#36055 [Bugfix] Fix zombie EngineCore processes after parent exit — bug,v1 — by AjAnubolu (创建于: 2026-03-05 06:46 (UTC+8))
#36052 [Bugfix] Fix async OffloadingConnector silently losing decode-phase blocks — bug,kv-connector — by ZhanqiuHu (创建于: 2026-03-05 06:38 (UTC+8))
#36051 [NIXL][Bugfix] metrics & testing minor bug — bug,v1,kv-connector — by andylolu2 (创建于: 2026-03-05 06:34 (UTC+8))
#36025 [ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm — rocm,ci/build,v1,multi-modality — by micah-wil (创建于: 2026-03-05 01:34 (UTC+8))
#35981 [Misc] Support OOT linear method registering — ready — by shen-shanshan (创建于: 2026-03-04 16:24 (UTC+8))
#36017 [Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 — bug,ready,nvidia — by amitz-nv (创建于: 2026-03-04 23:41 (UTC+8))
#36013 [Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy — bug,ready,v1 — by tdoublep (创建于: 2026-03-04 22:59 (UTC+8))
#36016 Add GemmaRMSNorm + FP8 quantization fusion support — 无标签 — by jackzhxng (创建于: 2026-03-04 23:37 (UTC+8))
#36028 [torch.compile] Use Inductor Process Pool in Compilation — v1 — by eellison (创建于: 2026-03-05 01:56 (UTC+8))
#36048 Add unit tests on cutlass moe — nvidia — by guanxingithub (创建于: 2026-03-05 05:19 (UTC+8))
#36043 [ROCM][MORI] Add MoRI & Aiter installation script — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 04:13 (UTC+8))
#36034 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 02:49 (UTC+8))
#35961 [Core] Add score mode with perplexity and KLD computation — documentation,v1 — by phaelon74 (创建于: 2026-03-04 13:38 (UTC+8))
#36046 [Frontend] Add SubscribeKvEvents KV event streaming bridge — documentation,frontend — by smfirmin (创建于: 2026-03-05 04:32 (UTC+8))
#35949 [MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase — needs-rebase,llama,qwen,deepseek,gpt-oss,nvidia — by bnellnm (创建于: 2026-03-04 11:40 (UTC+8))
#36011 [Bugfix] Fix Harmony streaming cross-channel delta accumulation — bug,frontend,gpt-oss — by will-deines (创建于: 2026-03-04 21:53 (UTC+8))
#36040 [CI] Add NvFP4 Blackwell Tests — ci/build — by ojhaanshika (创建于: 2026-03-05 03:46 (UTC+8))
#36019 [Model Runner V2] Fix pooling — v1 — by njhill (创建于: 2026-03-05 00:06 (UTC+8))
#35947 fix: Software E2M1 conversion for SM12x NVFP4 activation quantization — nvidia — by blake-snc (创建于: 2026-03-04 11:29 (UTC+8))
#36039 Add unit tests on flashinfer trtllm moe — nvidia — by guanxingithub (创建于: 2026-03-05 03:46 (UTC+8))
#36038 [compile][graph_partition]Add tensor size handling — 无标签 — by fxdawnn (创建于: 2026-03-05 03:43 (UTC+8))
#36035 Fix random dataset parameter defaults silently overriding user input — performance — by khairulkabir1661 (创建于: 2026-03-05 02:54 (UTC+8))
#35963 [Feature] ViT Full CUDA Graph — v1,multi-modality,qwen,nvidia — by b-mu (创建于: 2026-03-04 13:56 (UTC+8))
#36000 [Doc] Fix GPU Worker count in Process Count Summary — documentation,ready — by simone-dotolo (创建于: 2026-03-04 19:45 (UTC+8))
#36023 Perf: Optimize CPU detokinzation for long decodes — v1 — by amirkl94 (创建于: 2026-03-05 01:09 (UTC+8))
#36033 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 02:45 (UTC+8))
#36032 qwen3coder tool parser fix anyOf double encoded parameters — qwen — by cmunley1 (创建于: 2026-03-05 02:39 (UTC+8))
#35952 refactor(metrics): consolidate histogram bucket definitions into buck… — v1 — by npache (创建于: 2026-03-04 11:55 (UTC+8))
#36029 [SpecDecode][Benchmark] Add SPEED-bench support to benchmarking CLI — documentation,performance — by talorabr (创建于: 2026-03-05 02:02 (UTC+8))
#36030 [BugFix] fix max memory usage — bug,v1 — by peakcrosser7 (创建于: 2026-03-05 02:14 (UTC+8))
#36026 [Bugfix] Fix wrong num_experts in invoke_fused_moe_wna16 kernels — bug — by OiPunk (创建于: 2026-03-05 01:39 (UTC+8))
#35964 Fix phi4-mm and remove cuda binding — ready,nvidia — by yma11 (创建于: 2026-03-04 13:57 (UTC+8))
#36021 bump flashinfer v0.6.4 -> v0.6.5 — ci/build,nvidia — by netanel-haber (创建于: 2026-03-05 00:19 (UTC+8))
#36020 [Misc] Fix SyntaxWarning - invalid escape sequence ‘\e’ — frontend,ready — by cjackal (创建于: 2026-03-05 00:15 (UTC+8))
#36006 [Misc] Remove deprecated items that are due for removal — frontend,ready,multi-modality — by hickeyma (创建于: 2026-03-04 20:27 (UTC+8))
#36014 fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests — kv-connector — by machov (创建于: 2026-03-04 23:03 (UTC+8))
#35986 Add support for ModelOpt MXFP8 MoE models — 无标签 — by danisereb (创建于: 2026-03-04 17:36 (UTC+8))
#36012 [Performance] Add prefetch for checkpoints to OS page cache — 无标签 — by arpera (创建于: 2026-03-04 22:32 (UTC+8))
#35996 [Bugfix] Make kaldi_native_fbank optional — bug,ready,ci/build — by DarkLight1337 (创建于: 2026-03-04 19:09 (UTC+8))
#35970 In-Tree AMD Zen CPU Backend via zentorch [1/N] — rocm,ci/build,cpu — by amd-lalithnc (创建于: 2026-03-04 14:46 (UTC+8))
#36003 [Bugfix] Fix DeepSeek V3.2 tool parser — bug,deepseek — by sergeyrid (创建于: 2026-03-04 20:22 (UTC+8))
#36009 [FA/Chore] Bump FA version for FP8 two-level accumulation every n steps — ci/build — by PatrykSaffer (创建于: 2026-03-04 21:38 (UTC+8))
#36007 fix(compilation): optimize kv cache update for faster cold start compilation — documentation — by fourierr (创建于: 2026-03-04 20:49 (UTC+8))
#36002 [Core] Add sharding metadata to model parameters — 无标签 — by PrasanaaV (创建于: 2026-03-04 20:13 (UTC+8))
#36005 fix(xpu): handle mem_get_info failure on XPU simulator (testing only) — 无标签 — by lukaszszady (创建于: 2026-03-04 20:25 (UTC+8))
#35976 [Bugfix] Fix RunAI streamer crash with S3-hosted model paths — bug,frontend,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
#35995 Fix: support auto_tune to run local model — performance — by panpan0000 (创建于: 2026-03-04 19:04 (UTC+8))
#35990 [Core] Add contiguous block allocation mode for KV cache — v1 — by xiaguan (创建于: 2026-03-04 18:05 (UTC+8))
#35993 [Bugfix] Fix the batch invariance benchmark script — bug,performance — by jzakrzew (创建于: 2026-03-04 18:27 (UTC+8))
#35972 [Bugfix] Guard non-2D weights in dispatch_cpu_unquantized_gemm — bug — by tanmayc25 (创建于: 2026-03-04 15:34 (UTC+8))
#35974 [Bugfix] Qwen3.5-397B-A17B model loading with transformer=5.2 — bug,qwen — by dolphin8 (创建于: 2026-03-04 15:54 (UTC+8))
#35988 Add support to pass custom presence_penalty and frequency_penalty parameters in the generation config — frontend — by mobicham (创建于: 2026-03-04 17:54 (UTC+8))
#35971 refine vllm bench throughput --backend hf — performance — by jikunshang (创建于: 2026-03-04 15:11 (UTC+8))
#35977 [Bugfix] Fix DP port conflict race condition with late binding — bug,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
#35978 [Bugfix] Fix suffix decoding crash from negative num_new_tokens — bug,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
#35975 [Core] Skip inputs_embeds buffer allocation for text-only models — speculative-decoding,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
#35979 [Bugfix] Fix MXFP4A16 degenerate output via compressed-tensors bump — bug,ci/build — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
#35957 fix online fp8 for MiniCPM models — 无标签 — by yma11 (创建于: 2026-03-04 13:27 (UTC+8))
#35966 [feat] Kimi K2/DeepSeek Support eagle3 — speculative-decoding,v1,deepseek — by leihuang-sketch (创建于: 2026-03-04 14:26 (UTC+8))
#35965 [Core] Add KV transfer support to sparse attention indexer — 无标签 — by wz1qqx (创建于: 2026-03-04 14:16 (UTC+8))
#35958 Perf: Enable double-buffered chunked EP communication in DeepEP — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 13:35 (UTC+8))
#35955 Perf: Implement fused sort/scan for MoE block alignment using Triton — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 13:10 (UTC+8))
#35960 [Bugfix] Add regression test for allreduce RMS fusion with PP — bug — by robellliu-dev (创建于: 2026-03-04 13:37 (UTC+8))
#35956 [Bugfix] Narrow kv_cache mempool context to prevent sleep mode regressions — bug,v1 — by AjAnubolu (创建于: 2026-03-04 13:24 (UTC+8))
#35953 [DO NOT MERGE][Perf] Avoid tokenizing dummy text when running MM processor — ready,multi-modality — by DarkLight1337 (创建于: 2026-03-04 12:29 (UTC+8))
#35951 Perf: Optimize regex patterns in MiniMaxM2ToolParser — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 11:54 (UTC+8))

已合并 PR

#29856 [Model] Add LoRA support for Whisper models — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by daje0601 (合并于: 2026-03-05 10:38 (UTC+8))
#36049 [CI] Fix pre-commit mypy issue in main — ready — by yewentao256 (合并于: 2026-03-05 10:13 (UTC+8))
#35984 [XPU] bump vllm-xpu-kernels to v0.1.3 — ready,ci/build — by jikunshang (合并于: 2026-03-04 18:23 (UTC+8))
#34760 Add platform method to enable custom collective ops registration — rocm,ready,nvidia,meta-exported,fb-exported — by nkm-meta (合并于: 2026-03-05 08:50 (UTC+8))
#36044 [compile] Reduce log spam from compile. — ready — by zhxchen17 (合并于: 2026-03-05 08:48 (UTC+8))
#35941 [Model Runner V2] Misc code simplification — v1 — by njhill (合并于: 2026-03-05 07:26 (UTC+8))
#35239 [ROCm][CI] Added MI325 mirrors (stage C) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-05 06:48 (UTC+8))
#35981 [Misc] Support OOT linear method registering — ready — by shen-shanshan (合并于: 2026-03-05 06:25 (UTC+8))
#36017 [Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 — bug,ready,nvidia — by amitz-nv (合并于: 2026-03-05 06:23 (UTC+8))
#35928 [RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag — ready,codex,aardvark — by simon-mo (合并于: 2026-03-05 06:05 (UTC+8))
#36013 [Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy — bug,ready,v1 — by tdoublep (合并于: 2026-03-05 05:52 (UTC+8))
#34950 [CI] Bump mypy version — ready,v1,kv-connector,nvidia — by hmellor (合并于: 2026-03-05 04:55 (UTC+8))
#35240 Add PyTorch profiler schedule support with warmup/active iterations — ready,meta-exported,fb-exported — by fenypatel99 (合并于: 2026-03-05 04:53 (UTC+8))
#35139 [Docs] Document security risks of GPT-OSS Python tool — documentation,frontend,ready,gpt-oss — by russellb (合并于: 2026-03-05 04:27 (UTC+8))
#35678 [UX] Remove NoOpOffloader log — ready — by robertgshaw2-redhat (合并于: 2026-03-05 04:13 (UTC+8))
#35472 [torch.compile] Stop lazily compiling — ready-run-all-tests — by zou3519 (合并于: 2026-03-05 04:13 (UTC+8))
#36019 [Model Runner V2] Fix pooling — v1 — by njhill (合并于: 2026-03-05 02:53 (UTC+8))
#32441 [Docs] Clarify structured outputs configuration for Qwen3 reasoning mode — documentation,structured-output,ready,qwen — by davzaman (合并于: 2026-03-05 03:44 (UTC+8))
#35871 [CI] Add Blackwell AsyncTP correctness test — ready,ci/build — by stecasta (合并于: 2026-03-05 03:41 (UTC+8))
#36000 [Doc] Fix GPU Worker count in Process Count Summary — documentation,ready — by simone-dotolo (合并于: 2026-03-05 01:03 (UTC+8))
#35397 [Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels — ready — by tomeras91 (合并于: 2026-03-05 02:47 (UTC+8))
#34551 [Frontend] Add vllm launch command for GPU-less preprocessing serving — frontend,ready,v1,cpu — by hyeongyun0916 (合并于: 2026-03-05 02:41 (UTC+8))
#34882 docs: update CPU Docker images to reference Docker Hub instead of AWS ECR — documentation,ready,cpu — by cluster2600 (合并于: 2026-03-05 02:31 (UTC+8))
#32454 docs: add version requirement note for –profiler-config flag — documentation,ready — by abhishkh (合并于: 2026-03-05 02:13 (UTC+8))
#34531 [Docs] Add RunPod GPU deployment guide for vLLM — documentation,ready — by lisperz (合并于: 2026-03-05 02:11 (UTC+8))
#35218 [Docs] Upgrade dynamic LoRA warning to admonition block — documentation,ready — by russellb (合并于: 2026-03-05 02:01 (UTC+8))
#35481 [Feature][CI]: compare func & no_func outputs in test_functionalization.py — ready — by 11happy (合并于: 2026-03-05 02:01 (UTC+8))
#30677 [Docs] Update design/multiprocessing.md — documentation,ready — by windsonsea (合并于: 2026-03-05 01:59 (UTC+8))
#34127 fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… — ready — by tc-mb (合并于: 2026-03-05 01:46 (UTC+8))
#35090 docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build — documentation,ready,cpu — by sagearc (合并于: 2026-03-05 01:45 (UTC+8))
#35197 [Doc] Add MTP docs and update speculative decoding guidance — documentation,ready — by XingLiu1 (合并于: 2026-03-05 01:23 (UTC+8))
#34784 fix(docs): use static rdzv backend in multi-node troubleshooting script — documentation,ready — by machov (合并于: 2026-03-05 01:15 (UTC+8))
#35893 [ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported — bug,rocm,ready — by ChuanLi1101 (合并于: 2026-03-04 16:30 (UTC+8))
#35933 docs: add README for logits_processor examples — documentation,ready — by mitre88 (合并于: 2026-03-05 01:09 (UTC+8))
#35964 Fix phi4-mm and remove cuda binding — ready,nvidia — by yma11 (合并于: 2026-03-05 01:08 (UTC+8))
#35653 Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py — ready — by netanel-haber (合并于: 2026-03-05 00:43 (UTC+8))
#35756 Split generic IO Processor plugins tests from Terratorch specific ones — ready,ci/build — by christian-pinto (合并于: 2026-03-05 00:01 (UTC+8))
#35738 [Misc] Add --attention-backend auto option — ready — by NickLucche (合并于: 2026-03-04 23:12 (UTC+8))
#28053 [Core] Remove busy loop from idle buffer readers — ready,v1 — by joerunde (合并于: 2026-03-04 15:44 (UTC+8))
#34883 [Core] Add All-to-All communication backend for DCP — ready,v1,nvidia — by sungsooha (合并于: 2026-03-04 23:01 (UTC+8))
#35996 [Bugfix] Make kaldi_native_fbank optional — bug,ready,ci/build — by DarkLight1337 (合并于: 2026-03-04 22:47 (UTC+8))
#34783 [BugFix] Fix implicit and incorrect assumption on ECConnector is_producer — bug,documentation,ready,v1,kv-connector — by furionw (合并于: 2026-03-04 22:01 (UTC+8))
#35656 [Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE — bug,ready,qwen — by oneraghavan (合并于: 2026-03-04 21:15 (UTC+8))
#35846 [Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() — bug,ready,v1 — by ronensc (合并于: 2026-03-04 20:25 (UTC+8))
#35640 [MISC] fixed tool_parser mypy errors — ready — by taneem-ibrahim (合并于: 2026-03-04 20:23 (UTC+8))
#35500 [Feature] Add basic metrics for /realtime endpoint — frontend,ready — by pougetat (合并于: 2026-03-04 19:56 (UTC+8))
#34571 [Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models (#34094) — bug,ready,v1,nvidia — by haosdent (合并于: 2026-03-04 18:56 (UTC+8))
#30681 [Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache — documentation,performance,rocm,structured-output,ready,v1,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-04 17:49 (UTC+8))
#35869 [Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile — bug,ready,qwen — by TheCodeWrangler (合并于: 2026-03-04 16:29 (UTC+8))
#35539 Support Audio Extraction from MP4 Video for Nemotron Nano VL — documentation,rocm,frontend,ready,ci/build,v1,multi-modality,cpu,nvidia — by askliar (合并于: 2026-03-04 15:20 (UTC+8))
#35654 [cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties — ready,v1 — by kkt-cohere (合并于: 2026-03-04 15:20 (UTC+8))
#35616 [Bugfix] Improve engine ready timeout error message — bug,ready,v1 — by lailoo (合并于: 2026-03-04 13:54 (UTC+8))
#35835 [BugFix] Support tool_choice=none in the Anthropic API — bug,frontend,ready — by ZhongsJie (合并于: 2026-03-04 13:24 (UTC+8))
#35913 [Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) — rocm,ready,ci/build — by charlifu (合并于: 2026-03-04 12:50 (UTC+8))
#35711 [Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions — bug,ready — by AndreasKaratzas (合并于: 2026-03-04 12:12 (UTC+8))
#35883 [Chore] Remove debug code in model implementation — ready — by Isotr0py (合并于: 2026-03-04 11:50 (UTC+8))
#35872 [Refactor] Clean up processor kwargs extraction — ready — by DarkLight1337 (合并于: 2026-03-04 11:53 (UTC+8))
#35727 [model] support FireRedASR2 — documentation,new-model,ready,ci/build — by AllenDou (合并于: 2026-03-04 11:41 (UTC+8))
#33753 [PluggableLayer][MM] Add PluggableLayer for RelPosAttention — documentation,ready — by shen-shanshan (合并于: 2026-03-04 11:41 (UTC+8))

关闭但未合并的 PR

#27360 [Feature] Implement custom pipeline parallel rank ordering for Ray workers — documentation,new-model,rocm,frontend,ci/build,stale,v1,multi-modality — by JorgenTrondsen (关闭于: 2026-03-05 10:16 (UTC+8))
#36071 [CI] Fix lint: mypy union-attr errors in hermes_tool_parser — 无标签 — by gmagogsfm (关闭于: 2026-03-05 09:15 (UTC+8))
#36065 [Bugfix] Fix mypy errors in hermes_tool_parser — bug — by AjAnubolu (关闭于: 2026-03-05 08:42 (UTC+8))
#36001 [Bugfix] Fix message queue initialization order for cross-node DP — bug,v1 — by jianzs (关闭于: 2026-03-05 08:15 (UTC+8))
#36046 [Frontend] Add SubscribeKvEvents KV event streaming bridge — documentation,frontend — by smfirmin (关闭于: 2026-03-05 04:38 (UTC+8))
#36033 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (关闭于: 2026-03-05 02:45 (UTC+8))
#32170 docs: fix typo in docker_run_bm.sh — tpu,ci/build — by darshan-stack (关闭于: 2026-03-05 02:11 (UTC+8))
#27359 [Doc] Refactor import statements for oneshot in quantization docs to support newer llm-compressor version — documentation,ready,stale — by xxrjun (关闭于: 2026-03-05 02:06 (UTC+8))
#20580 [V1] Correct V1 parallel sampling params — v1 — by imkero (关闭于: 2026-03-05 00:17 (UTC+8))
#29753 docs: add guide to reduce PyTorch profiler overhead via env vars (#29564) — documentation,stale — by kbp4154 (关闭于: 2026-03-05 01:54 (UTC+8))
#19858 optimze attn — needs-rebase,unstale,qwen — by momo609 (关闭于: 2026-03-05 01:50 (UTC+8))
#19829 Move Gemma’s stacked_params_mapping to class scope — 无标签 — by yhtang (关闭于: 2026-03-05 01:49 (UTC+8))
#19816 Introduce RayCudaCommunicator as Ray Compiled Graph communicator — needs-rebase,unstale,nvidia — by ruisearch42 (关闭于: 2026-03-05 01:48 (UTC+8))
#19721 [Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE — needs-rebase,qwen — by varun-sundar-rabindranath (关闭于: 2026-03-05 01:46 (UTC+8))
#19658 [EPLB]: Support offline expert load distribution recording — frontend,needs-rebase,v1 — by jianzs (关闭于: 2026-03-05 01:46 (UTC+8))
#19527 Add NotImplementedError to v1 cpu runner — v1,cpu — by fred2167 (关闭于: 2026-03-05 01:42 (UTC+8))
#19548 [CI] bump mypy version to 1.16.0 — documentation,frontend,needs-rebase,unstale,tool-calling — by andyxning (关闭于: 2026-03-05 01:43 (UTC+8))
#19456 [Misc] Add logfmt8 quantization support — documentation,needs-rebase,ci/build — by anniegracehu (关闭于: 2026-03-05 01:41 (UTC+8))
#19443 [Kernel] Integrate IBM/Applied-AI fused moe kernels — nvidia — by varun-sundar-rabindranath (关闭于: 2026-03-05 01:40 (UTC+8))
#19406 qwen optimze — needs-rebase,unstale,qwen — by momo609 (关闭于: 2026-03-05 01:40 (UTC+8))
#19387 log process name and id — needs-rebase,unstale,v1 — by helunwencser (关闭于: 2026-03-05 01:39 (UTC+8))
#19345 qwen optimze — unstale,qwen — by momo609 (关闭于: 2026-03-05 01:38 (UTC+8))
#19080 [P/D] Exchange NIXL metadata through rank 0 — needs-rebase,unstale,kv-connector — by ptarasiewiczNV (关闭于: 2026-03-05 01:37 (UTC+8))
#18780 [V1][Feat] Fail request if FSM fails to advance — structured-output,needs-rebase,unstale,v1 — by atbe (关闭于: 2026-03-05 01:36 (UTC+8))
#19878 Add page-aligned prefill scheduling. — v1 — by py4 (关闭于: 2026-03-05 01:35 (UTC+8))
#35320 [CI] Add nsys profiling support with NVTX tracing — ci/build — by ojhaanshika (关闭于: 2026-03-05 01:13 (UTC+8))
#32217 [Draft][Kernel] Add new flashinfer A2A kernel — needs-rebase,nvidia — by hjjq (关闭于: 2026-03-05 01:08 (UTC+8))
#35839 docs: add comment explaining VLLM_FUSED_MOE_CHUNK_SIZE — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
#35841 docs: add comment for VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
#35842 docs: add comment for VLLM_WORKER_MULTIPROC_METHOD — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
#35843 docs: add comments for assets cache variables — 无标签 — by Jah-yee (关闭于: 2026-03-05 00:59 (UTC+8))
#19924 enable multiple ssm groups duplication — needs-rebase,unstale — by ilyasch2 (关闭于: 2026-03-05 00:50 (UTC+8))
#20194 FlashInfer generated decode kernels. — nvidia — by wenscarl (关闭于: 2026-03-05 00:47 (UTC+8))
#20027 enable torchao for AMD — rocm,needs-rebase,unstale — by jcaip (关闭于: 2026-03-05 00:47 (UTC+8))
#20229 [Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318) — documentation,new-model,frontend,needs-rebase,v1,startup-ux,nvidia — by ibl-g (关闭于: 2026-03-05 00:23 (UTC+8))
#35417 [Bugfix] Fix EC connector unit tests after has_cache_item API change — bug,needs-rebase,v1 — by pakkah (关闭于: 2026-03-05 00:24 (UTC+8))
#20239 [Frontend][Model] Qwen3Rerank API Server backward compatibility — frontend,needs-rebase,unstale,qwen — by BetterAndBetterII (关闭于: 2026-03-05 00:22 (UTC+8))
#20292 [Hardware][RISC-V] Add RISC-V architecture cpu inference support — needs-rebase,ci/build,stale — by huangzhengx (关闭于: 2026-03-05 00:20 (UTC+8))
#20473 [WIP][Hardware][CPU] testing branch for mlperf — documentation,needs-rebase,ci/build,v1,llama,cpu — by bigPYJ1151 (关闭于: 2026-03-05 00:19 (UTC+8))
#20471 Map Mistral-HF models back onto Mistral format on-the-fly — new-model — by sjuxax (关闭于: 2026-03-05 00:19 (UTC+8))
#20503 feat: Add streaming support for Mistral v11 tool format — frontend,needs-rebase,unstale,tool-calling — by sjuxax (关闭于: 2026-03-05 00:18 (UTC+8))
#20761 A developer friendly tool for multi-instance deployment with Ray Implementation — documentation — by Gongzq5 (关闭于: 2026-03-05 00:16 (UTC+8))
#20848 [WIP] Enable xpu sleep mode — v1 — by yangw1234 (关闭于: 2026-03-05 00:14 (UTC+8))
#20870 [EPLB] Add EPLB support for dots1 — 无标签 — by wenchen76 (关闭于: 2026-03-05 00:14 (UTC+8))
#20872 [EPLB] Add EPLB support for OLMoE — needs-rebase,unstale — by ztang2370 (关闭于: 2026-03-05 00:13 (UTC+8))
#20886 Add CheXAgent model integration with tests and documentation — documentation,new-model,ci/build,multi-modality — by WeiqiangLv (关闭于: 2026-03-05 00:12 (UTC+8))
#20982 [Kernel] DeepGemm MoE : Integrate cuda moe permute/unpermute — performance,needs-rebase,nvidia — by varun-sundar-rabindranath (关闭于: 2026-03-05 00:10 (UTC+8))
#21184 Some initial Vulkan boilerplate — needs-rebase,ci/build,unstale — by ericcurtin (关闭于: 2026-03-05 00:09 (UTC+8))
#21273 [EPLB]: Add EPLB support for Grok1 [WIP] — needs-rebase — by jennifurhe (关闭于: 2026-03-05 00:06 (UTC+8))
#21290 [Feature][EPLB] Add support for Qwen3 EPLB — needs-rebase,unstale,qwen — by hsliuustc (关闭于: 2026-03-05 00:04 (UTC+8))
#21317 adds include_thinking optional Param to Request object to preserve re… — frontend,needs-rebase,unstale — by arpitg1991 (关闭于: 2026-03-05 00:03 (UTC+8))
#21655 Keep reasoning content before applying chat template — frontend,needs-rebase,unstale — by lhdeng-gh (关闭于: 2026-03-05 00:01 (UTC+8))
#21670 [Model] [Draft PR] Add support for SmallThinker model series — documentation,new-model,needs-rebase — by SorryMaker2022 (关闭于: 2026-03-04 23:58 (UTC+8))
#21722 fix the mxfp4 packed qk weight loading issue for llama4 — needs-rebase,llama — by xuebwang-amd (关闭于: 2026-03-04 23:55 (UTC+8))
#21732 [EPLB] Dynamic EPLB Metrics — needs-rebase,unstale,v1 — by baxingpiaochong (关闭于: 2026-03-04 23:55 (UTC+8))
#21738 [Feat] Import KV connector module dynamically for v0 — needs-rebase,unstale,kv-connector — by JiamingMai (关闭于: 2026-03-04 23:54 (UTC+8))
#21962 WIP: allow model to be loaded dynamically — frontend,needs-rebase,v1 — by lionelvillard (关闭于: 2026-03-04 23:53 (UTC+8))
#21969 [Misc] DeepEPLLPrepareAndFinalize - Cleanup — 无标签 — by varun-sundar-rabindranath (关闭于: 2026-03-04 23:53 (UTC+8))
#22075 [V1] Enhanced Exception Handling for KV Cache Loading from Remote Store — needs-rebase,v1,kv-connector — by liuyumoye (关闭于: 2026-03-04 23:52 (UTC+8))
#22219 [Quant] Refactor CompressedTensorsConfig — needs-rebase — by kylesayrs (关闭于: 2026-03-04 23:49 (UTC+8))
#22345 [Model]Force use triton compressed_tensor_moe instead of cutlass — needs-rebase,unstale,nvidia — by access2rohit (关闭于: 2026-03-04 23:49 (UTC+8))
#22392 Ability to use custom-all-reduce on systems with more than 2 PCIe GPUs via env var — rocm,needs-rebase,stale — by avtc (关闭于: 2026-03-04 23:46 (UTC+8))
#22461 [Feature] add procese set cpu affinity current gpu device — stale,v1 — by lengrongfu (关闭于: 2026-03-04 23:44 (UTC+8))
#22607 [Feat][KV offload][WIP] Separated process for CPU KV cache processing — 无标签 — by ApostaC (关闭于: 2026-03-04 23:43 (UTC+8))
#22608 [Feat][KV offloading][WIP] The prototype implementation of a KV offloader used in CPU KV server — 无标签 — by ApostaC (关闭于: 2026-03-04 23:43 (UTC+8))
#22618 [feat] added the optimized config for Qwen3-30B-A3B Fp8 — qwen — by sara4dev (关闭于: 2026-03-04 23:42 (UTC+8))
#22687 feat(perf): Accelerate TP All-Reduce using Triton-Distributed — needs-rebase,unstale,v1 — by preminstrel (关闭于: 2026-03-04 23:41 (UTC+8))
#22694 [Bug Fix] Correctly parse Hermes tool calls — bug,frontend,needs-rebase,tool-calling — by minhsaco99 (关闭于: 2026-03-04 23:41 (UTC+8))
#22854 [Bugfix] Fix dual-stack binding when –host is empty — frontend,needs-rebase,stale,v1 — by Shrey1306 (关闭于: 2026-03-04 23:40 (UTC+8))
#22865 [Core] Optimize swap_states() to copy only valid tokens instead of full rows — stale,v1 — by arjunbreddy22 (关闭于: 2026-03-04 23:39 (UTC+8))
#22904 [Core] Implement the delay_factor parameter in the v1 scheduler — needs-rebase,stale,v1 — by wz202020 (关闭于: 2026-03-04 23:38 (UTC+8))
#22977 [Frontend] Complete Redesign of Tool Calling — frontend,tool-calling — by chaunceyjiang (关闭于: 2026-03-04 23:36 (UTC+8))
#18765 [WIP] Add a metric to track request failures — documentation,frontend — by harche (关闭于: 2026-03-04 23:16 (UTC+8))
#18994 Self-Speculative Decoding using LayerSkip — documentation,speculative-decoding,needs-rebase,v1,llama,qwen — by aniltolwani (关闭于: 2026-03-04 23:18 (UTC+8))
#18477 [Bugfix][Frontend] support webm with audioread fallback — frontend,needs-rebase,ci/build,stale — by cpwan (关闭于: 2026-03-04 23:16 (UTC+8))
#18475 [Misc][benchmark] add warmup; add e2el_per_concurrency and throughput; add random_output_ratio — performance,needs-rebase — by yuzho-amd (关闭于: 2026-03-04 23:15 (UTC+8))
#18325 Refactor: Prioritize Prefill Requests in Scheduler Output — v1 — by qiaoli31 (关闭于: 2026-03-04 23:15 (UTC+8))
#18298 [Don’t merge] Debug failing quantization test with input batch move — needs-rebase,v1,qwen — by heheda12345 (关闭于: 2026-03-04 23:14 (UTC+8))
#17563 AMD tests updated experiment — rocm,ci/build,unstale — by Concurrensee (关闭于: 2026-03-04 23:02 (UTC+8))
#16989 [Hardware][TPU][V1] Better tpu multilora compilation — tpu,needs-rebase,ci/build,v1 — by jdefreitas02 (关闭于: 2026-03-04 22:57 (UTC+8))
#16919 [Bugfix] Fix the missing ‘}’ issue for nested object parameters in stream function call. — bug,frontend,needs-rebase,unstale,tool-calling,qwen — by hewu2008 (关闭于: 2026-03-04 22:56 (UTC+8))
#16935 Add docker to build vllm against torch nightly — rocm,needs-rebase,ci/build — by yangw-dev (关闭于: 2026-03-04 22:57 (UTC+8))
#16849 [CORE] Eliminate Occasional Scheduling Delay for Parallel Sampling — v1 — by dblincoe (关闭于: 2026-03-04 22:56 (UTC+8))
#16844 Add initial HPU support for V1 — needs-rebase,ci/build,v1 — by kzawora-intel (关闭于: 2026-03-04 22:55 (UTC+8))
#16739 Add key latencies to v1 RequestMetrics instance so it can be surfaced… — needs-rebase,unstale,v1 — by Macchiato123000 (关闭于: 2026-03-04 22:54 (UTC+8))
#16677 [NIXL] vllm v0 nixl integration — frontend,speculative-decoding,needs-rebase,deepseek,cpu,kv-connector — by rainj-me (关闭于: 2026-03-04 22:51 (UTC+8))
#16315 Add initial fbgemm deps — documentation — by jianyuh (关闭于: 2026-03-04 22:51 (UTC+8))
#16286 [Model] Run the QK norm in a single op in Llama4 — needs-rebase,llama — by houseroad (关闭于: 2026-03-04 22:50 (UTC+8))
#16160 Support R-KV Cache Compression in vLLM — needs-rebase,v1 — by yeyang-zhou (关闭于: 2026-03-04 22:49 (UTC+8))
#15765 Add changes for cascade optimizations — documentation,frontend,needs-rebase,ci/build,v1,kv-connector,nvidia — by plops655 (关闭于: 2026-03-04 22:49 (UTC+8))
#15957 DeepGemm MoE expert map support — needs-rebase — by bnellnm (关闭于: 2026-03-04 22:49 (UTC+8))
#15643 [draft] reproduce tpu v1 correctness issue with chunked prefill. — tpu,needs-rebase,v1 — by Chenyaaang (关闭于: 2026-03-04 22:47 (UTC+8))
#15641 adds native FastAPI bearer auth — frontend,needs-rebase — by charlesfrye (关闭于: 2026-03-04 22:47 (UTC+8))
#15405 [Doc] Add multi-modal development example for encoder-decoder models — documentation,needs-rebase — by Isotr0py (关闭于: 2026-03-04 22:45 (UTC+8))
#15401 [TPU][V1] Guided decoding on TPU — tpu,needs-rebase,v1 — by carlesoctav (关闭于: 2026-03-04 22:44 (UTC+8))
#15346 Vllm v1 eagle proposer — speculative-decoding,v1 — by sroy745 (关闭于: 2026-03-04 22:43 (UTC+8))
#15005 feat: add custom s3 support — 无标签 — by warjiang (关闭于: 2026-03-04 22:42 (UTC+8))
#14803 [ROCm][AMD] Enable ROCm Flash Attention Backend for Encoder-Decoder Models — rocm,needs-rebase — by vllmellm (关闭于: 2026-03-04 22:38 (UTC+8))
#14678 [Core] Add a level 3 sleep/wake_up that offloads tensors to disk — frontend,v1 — by manoelmarques (关闭于: 2026-03-04 22:38 (UTC+8))
#14584 Ray named test — documentation,needs-rebase — by ruisearch42 (关闭于: 2026-03-04 22:36 (UTC+8))
#14631 [Quant] Add SupportsQuant and packed_modules_mapping to all models — needs-rebase,llama,qwen,deepseek — by kylesayrs (关闭于: 2026-03-04 22:37 (UTC+8))
#14395 [Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS — needs-rebase,ci/build,nvidia — by LucasWilkinson (关闭于: 2026-03-04 22:36 (UTC+8))
#14455 [INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi — speculative-decoding,needs-rebase,ci/build,deepseek — by xuechendi (关闭于: 2026-03-04 22:36 (UTC+8))
#14291 [Model] add colqwen2_vl code & inference — documentation,new-model,needs-rebase,unstale,qwen — by BloomBerry (关闭于: 2026-03-04 22:35 (UTC+8))
#14182 Deepseek MTP for V1 — speculative-decoding,needs-rebase,v1,deepseek — by sroy745 (关闭于: 2026-03-04 22:34 (UTC+8))
#13835 Support w8a8 block_fp8_matmul from generated kernel — needs-rebase,ci/build,nvidia — by wenscarl (关闭于: 2026-03-04 22:28 (UTC+8))
#13853 [HPU] Enable AutoGPTQ/AutoAWQ quantized model inference — needs-rebase — by maktukmak (关闭于: 2026-03-04 22:29 (UTC+8))
#13809 [Misc] support variable remote backend for model loader — documentation,performance,frontend,speculative-decoding,needs-rebase,unstale,v1 — by DellCurry (关闭于: 2026-03-04 22:27 (UTC+8))
#13805 [Model][Speculative Decoding] support k > 1 for MTP — speculative-decoding,needs-rebase,deepseek — by luccafong (关闭于: 2026-03-04 22:26 (UTC+8))
#13758 [WIP][Whisper] beam search for whisper — frontend,needs-rebase — by joennlae (关闭于: 2026-03-04 22:26 (UTC+8))
#13360 [RFC][V1] LogitsProcessor interface — RFC,tpu,needs-rebase,v1 — by njhill (关闭于: 2026-03-04 22:25 (UTC+8))
#12866 adding workaround for c2x/c3x initializer issue — needs-rebase,nvidia — by kushanam (关闭于: 2026-03-04 22:24 (UTC+8))
#13248 [core] add extra_args to RPCProcessRequest and MQLLMEngineClient — needs-rebase — by akeshet (关闭于: 2026-03-04 22:24 (UTC+8))
#12726 [Core] Add Additional Metrics to vLLM Server — speculative-decoding,needs-rebase,stale,v1 — by sahelib25 (关闭于: 2026-03-04 22:22 (UTC+8))
#11554 [Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface. — bug,frontend,needs-rebase,ci/build,tool-calling — by elementary-particle (关闭于: 2026-03-04 22:12 (UTC+8))
#12566 vllm-flash-attn build on AMD — rocm,needs-rebase,ci/build,v1,nvidia — by ProExpertProg (关闭于: 2026-03-04 22:11 (UTC+8))
#12117 [Feature] Support VPTQ quantization — documentation,needs-rebase,ci/build — by wejoncy (关闭于: 2026-03-04 22:10 (UTC+8))
#12341 FLOP counting for vLLM inference — 无标签 — by dianastea (关闭于: 2026-03-04 22:10 (UTC+8))
#12048 [MoE][CPU] Extend fused_moe_iterative for non-x86 CPU backends — needs-rebase — by mgoin (关闭于: 2026-03-04 22:09 (UTC+8))
#10840 Lora scheduler — documentation,needs-rebase — by Scott-Hickmann (关闭于: 2026-03-04 22:06 (UTC+8))
#11348 [Kernel] Add ExLlamaV2 Weight Quantization Support — needs-rebase,ci/build,llama — by AlpinDale (关闭于: 2026-03-04 22:07 (UTC+8))
#21189 [W.I.P]: add Lmcache metrics — unstale,v1,kv-connector — by panpan0000 (关闭于: 2026-03-04 21:56 (UTC+8))

#27963 [Doc][Last/N] Improve all pooling task

Refactor pooling-related documentation — documentation — by noooop (关闭于: 2026-03-04 21:42 (UTC+8))

#35282 [Core] Add chunking for audio over 30s on offline inference, — frontend,needs-rebase — by sangbumlikeagod (关闭于: 2026-03-04 20:47 (UTC+8))
#34227 Feat/support nemotron new file format — documentation,performance,new-model,needs-rebase — by shaunkotek (关闭于: 2026-03-04 16:02 (UTC+8))
#35965 [Core] Add KV transfer support to sparse attention indexer — 无标签 — by wz1qqx (关闭于: 2026-03-04 14:18 (UTC+8))
#35377 [Feature] Use chat template and support request_prompt for Qwen3-ASR — qwen — by pougetat (关闭于: 2026-03-04 14:03 (UTC+8))
#35248 [Model Runner V2] Support DP/EP for spec decoding — needs-rebase,v1 — by TheEpicDolphin (关闭于: 2026-03-04 12:30 (UTC+8))