vLLM 开发动态报告 - 2026-03-04
时间窗口: 2026-03-04 11:24 (UTC+8) ~ 2026-03-05 11:24 (UTC+8) 数据统计: 新 Issue 22 | 关闭 Issue 44 | 新 PR 113 | 合并 PR 59 | 关闭未合并 PR 128
📊 每日开发状态摘要
在3月4日至5日的时间窗口内,vLLM 项目保持了极高的开发活跃度,新增与合并的 PR 数量均超过50个,显示出强劲的开发势头。开发焦点集中在性能优化(如融合算子、CUDA Graph、编译优化)、对新硬件(特别是 AMD CPU/GPU)的支持,以及修复由新特性引入的复杂 bug(如推测解码、多模态处理)。社区讨论热烈,尤其在涉及 AMD 生态和深度优化功能(如融合工作矩阵)的议题上。
🎯 AMD/ROCm 生态相关动态
本周期 AMD 生态相关活动活跃,涵盖了从 CPU 到 GPU 的多项重要进展:
- AMD Zen CPU 后端集成 (PR #35970)
- 贡献者:
amd-lalithnc(AMD员工) - 内容:这是实现 RFC #35089 的第一个 PR,为 AMD EPYC (Zen) CPU 引入了专用平台 (
ZenCpuPlatform)。它通过zentorch库路由 GEMM 操作,支持权重预打包以减少推理开销,并添加了 Docker 构建目标 (vllm-openai-amd)。 - 影响:标志着 vLLM 开始为 AMD 服务器 CPU 提供官方、深度的优化支持,旨在提升在纯 CPU 推理场景下的性能。
- 贡献者:
- ROCm AITER 与 MoRI (MoE Runtime Infrastructure) 集成优化 (PR #36034, #36033, #36043)
- 贡献者:
varun-sundar-rabindranath - 内容:
- PR #36033/#36034:解决了
MoRIPrepareAndFinalize(MoRI A2A后端) 与AiterExperts不兼容的问题,通过声明 AiterExperts 支持量化隐藏状态,使得两者可以协同工作。 - PR #36043:添加了 ROCm 平台下安装 MoRI 和 AITER Python 库的脚本 (
install_python_libraries.sh),方便本地开发。
- PR #36033/#36034:解决了
- 影响:完善了 AMD GPU 上专家并行 (Expert Parallel) 的软件栈,提升了 MoE 模型在 ROCm 生态下的部署便利性和性能潜力。
- 贡献者:
- AITER FlashAttention CUDA Graph 崩溃修复 (PR #36042)
- 贡献者:
iseeyuan - 内容:修复了在 ROCm AITER FlashAttention 后端中,由于错误的条件判断导致非推测解码工作负载也使用了不支持 CUDA Graph 捕获的
unified_attention路径,进而引发 CUDA Graph 捕获失败的问题。 - 影响:确保了在 AMD GPU 上使用 CUDA Graph 进行推理时的稳定性和性能。
- 贡献者:
- CK MXFP4 MoE 内核维度回退机制 (PR #35893 - 已合并)
- 贡献者:
ChuanLi1101 - 内容:修复了 Issue #35637 中报告的,在特定 TP 配置下(如
intermediate_size=1536且TP=4),AITER 的 CK MXFP4 MoE 内核因矩阵维度不满足约束而崩溃的问题。修复方案是添加了维度验证,在不兼容时自动回退到模拟模式或 Triton 后端。 - 影响:增强了 MXFP4 量化 MoE 模型在不同并行配置下的鲁棒性,是 AMD 低精度量化生态的重要稳定性修复。
- 贡献者:
- 中央融合工作跟踪器 (Issue #36066)
- 贡献者:
ProExpertProg - 内容:创建了一个详细的矩阵来跟踪 vLLM 中主要融合通道(如 RMSNorm+Quant)在不同硬件(sm100, sm90, ROCm)和量化配置下的支持状态。AMD 开发者
tjtanaa在评论中建议添加 ROCm AITER 特有的融合算子(如 QK Norm + RoPE + Cache + Quant)。 - 影响:此 Issue 为社区提供了清晰的融合优化全景图,并直接推动了 AMD 特定优化的可见性和整合讨论。
- 贡献者:
💬 高热度讨论分析
- Issue #31845: [Bug]: [H200] DeepSeek V3.2 MTP > 1 运行错误
- 核心议题:为 DeepSeek-V3.2 等 MLA 稀疏模型启用多令牌预测 (MTP > 1) 时遇到的限制和解决方案。
- 观点与进展:
- 初始认知:
MatthewBonanni最初认为稀疏 MLA 的索引器内核限制导致 MTP > 1 不受支持。 - 深入调查:
benchislett指出问题核心在于fp8_paged_mqa_logits内核仅支持 0 或 1 个推测令牌,并分析了竞争对手(SGLang, TensorRT-LLM)如何通过批次扩展或等待新内核来规避。 - 解决方案:社区通过 PR #34552 添加了批次扩展功能作为临时方案,并期待 DeepGEMM 官方合并支持 MTP=3 的新内核。最终,该 Issue 在性能验证后关闭,并引导至新的优化 Issue (#35878)。
- 初始认知:
- 争议焦点:无实质争议,主要是对问题根本原因和最佳解决路径的技术探讨。
- 当前状态:已关闭。临时方案已合并,长期依赖于上游内核更新。
- Issue #35637: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error
- 核心议题:在 AMD MI355X GPU 上,使用 AITER 运行 MXFP4 量化的 MiniMax-M2.1 模型时,在 Tensor Parallel (TP) 为 4 时发生崩溃。
- 观点与进展:
- 问题复现:AMD 开发者
hongxiayang确认并复现了该问题。 - 临时解决方案:提供了通过设置环境变量
VLLM_ROCM_USE_AITER_MOE=0来禁用 AITER MoE 内核作为临时绕过方案。 - 根本解决:
ChuanLi1101通过 PR #35893 实施了永久修复,增加了内核维度验证和自动回退机制。
- 问题复现:AMD 开发者
- 争议焦点:无。
- 当前状态:已通过 PR #35893 修复并关闭。
- Issue #36066: [Feature]: Central fusion work tracker
- 核心议题:建立并维护一个跨硬件平台的融合优化支持状态矩阵。
- 观点与进展:
- 发起者:
ProExpertProg建立了详尽的矩阵,但许多 ROCm 状态标记为“未知”。 - 社区补充:AMD 贡献者
tjtanaa主动提出补充 ROCm AITER 特有的融合算子,这表明 AMD 团队正积极参与核心性能优化特性的整合与文档化。
- 发起者:
- 争议焦点:无。
- 当前状态:进行中。该 Issue 已成为跟踪和协调跨平台优化工作的中心页面。
🔥 热门话题与趋势分析
- 性能优化精细化:社区关注点从基础功能支持转向深度优化。融合算子(Issue #36066)、CUDA Graph 覆盖更多模块(如 ViT, PR #35963)、编译优化(PR #36072 探索编译与加载重叠)等成为热点。
- 多模态与复杂模型支持:围绕视觉编码器(ViT)的优化(PR #35963)、新多模态模型(如 FireRedASR2, PR #35727)的集成,表明 vLLM 正持续扩展其应用边界。
- 推测解码的成熟与挑战:相关讨论(Issue #31845, #36031)频繁,涉及与特定模型架构(MLA)、底层内核的深度适配,反映出该功能已进入解决复杂边缘案例的阶段。
- AMD 生态建设全面加速:从 CPU (Zen) 到 GPU (ROCm),从基础算子(AITER)到高级运行时(MoRI),再到量化格式(MXFP4),AMD 相关的贡献呈现系统性、多层次的特点,表明其生态集成进入深水区。
- 开发者体验与基础设施:文档更新(PR #36074)、本地安装脚本(PR #36043)、基准测试修复(PR #35993)等“非核心”但重要的改进增多,反映了项目对开发者友好性的重视。
🛠️ 重点技术变更
- PR #35970: In-Tree AMD Zen CPU Backend via zentorch:这是一个架构性引入。它不仅仅是添加了一个库,而是创建了一个新的硬件平台抽象,为未来 AMD CPU 的更多定制优化(如后续的融合通道)铺平了道路,对拓展 vLLM 的部署场景有战略意义。
- PR #35963: [Feature] ViT Full CUDA Graph:将 CUDA Graph 的应用从 LLM 解码扩展到视觉编码器预处理。这对于多模态推理的端到端延迟至关重要,通过预算化图捕获和贪婪装箱算法,有效减少了 ViT 部分的内核启动开销。
- PR #35961: [Core] Add score mode with perplexity and KLD computation:在 V1 引擎中新增了在 GPU 侧高效计算困惑度(PPL)和 KL 散度(KLD)的能力。这支持了无需将完整词汇表 logits 传回 CPU 的模型评估工作流,对模型质量评估、数据筛选等场景有重要价值。
- PR #34883: [Core] Add All-to-All communication backend for DCP (已合并):为解码上下文并行(DCP)新增了 All-to-All 通信后端,相比原有的 AllGather+ReduceScatter 方案,将每层的 NCCL 调用从 3 次减少到 2 次,有望降低长上下文解码的通信开销,是性能导向的底层通信优化。
📈 开发活跃度观察
- 贡献者活跃:AMD 贡献者(
amd-lalithnc,ChuanLi1101,hongxiayang,tjtanaa等)表现非常活跃,在问题诊断、修复、特性开发和新平台引入方面贡献突出。 - PR 吞吐量高:单日合并 59 个 PR,显示核心团队有强大的代码审查与合并能力。许多 PR 附有详细的测试计划和结果,代码质量要求严格。
- Issue 管理有效:关闭 Issue (44) 数量远多于新增 (22),表明历史积压问题得到有效清理,项目维护状态健康。
- 协作模式:在高热度 Issue 中,可以看到用户、贡献者、维护者之间高效的互动。问题通常能快速得到复现、定位,并导向具体的修复 PR。
💡 值得关注的问题
- Issue #36031: Qwen 3.5 122B 推测解码崩溃:一个具体提交引发了大规模模型在启用推测解码时的崩溃,尽管已有修复 PR (#36036),但这提醒我们复杂功能组合(大模型+推测解码)的测试覆盖需要持续加强。
- Issue #36010: Qwen3.5-27B 批处理推理性能极慢:用户报告同规模模型间存在巨大性能差异(Gemma 3 对比 Qwen3.5)。这可能是特定模型实现、引导解码或调度器交互的潜在性能瓶颈,需要进一步调查。
- PR #36072: [WIP][Proof of concept] Overlap model loading and torch.compile:这是一个概念验证,旨在将耗时的模型权重加载与
torch.compile编译过程重叠执行以加快启动速度。如果成功,将显著改善用户体验,特别是对于需要频繁重启或部署新模型的服务。其设计思路(编译期假权重)值得关注。
📋 附录:详细数据列表
新增 Issue
- #36075 [Bug]: Garbled rollouts with GLM5 if VLLM_USE_DEEP_GEMM is not set — bug — by S1ro1 (创建于: 2026-03-05 10:02 (UTC+8))
- #35967 [Bug]: Value error, invalid literal for int() with base 10: ‘4.0’ [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs] — bug — by LuWei6896 (创建于: 2026-03-04 14:33 (UTC+8))
- #36073 [Bug]: Prolonged Latency in Some Streaming Responses in Function Call Mode(MiniMax model) — bug — by delwen123 (创建于: 2026-03-05 09:37 (UTC+8))
- #36077 [Bug]: FlashInfer JIT build fails when serving Qwen3.5-35B-A3B (nvcc path found but ninja build fails) — bug — by yufangbo (创建于: 2026-03-05 10:08 (UTC+8))
- #36066 [Feature]: Central fusion work tracker — feature request — by ProExpertProg (创建于: 2026-03-05 08:24 (UTC+8))
- #35989 [Bug]: — bug — by DonglongLiu (创建于: 2026-03-04 18:03 (UTC+8))
- #36050 [Bug]: DeepSeek v3.2 FP8 Failure to start server — bug — by wzhao18 (创建于: 2026-03-05 06:17 (UTC+8))
- #35998 Qwen3.5 Model Tokenizer Loading Failure — 无标签 — by Pg-artemsultanov (创建于: 2026-03-04 19:30 (UTC+8))
- #36031 [Bug]: Commit 28ef9ba causing VLLM to crash when running Qwen 3.5 122B with Speculative Decoding enabled — bug — by mdierolf (创建于: 2026-03-05 02:15 (UTC+8))
- #35950 [Bug]: ValueError: too many values to unpack (expected 2) — bug — by LuWei6896 (创建于: 2026-03-04 11:43 (UTC+8))
- #36037 [Feature]: Supports Speculative Speculative Decoding — feature request — by celsowm (创建于: 2026-03-05 03:27 (UTC+8))
- #35999 [New Model]: MetaCLIP-2 variants — new-model,multi-modality — by erenirmak (创建于: 2026-03-04 19:39 (UTC+8))
- #36015 [Bug]: Realtime audio transcription (Voxtral) silently hangs after ~10 minutes due to unhandled TimeoutError in background task — bug — by sh1man (创建于: 2026-03-04 23:06 (UTC+8))
- #36004 [Bug]: Failed to import Triton kernels. Please make sure your triton version is compatible. Error: cannot import name ‘SparseMatrix’ from ‘triton_kernels.tensor’ — bug — by edc3000 (创建于: 2026-03-04 20:24 (UTC+8))
- #36010 [Bug]: Qwen/Qwen3.5-27B Batch Inference very slow / not working — bug — by NilsHellwig (创建于: 2026-03-04 21:41 (UTC+8))
- #35962 [RFC]: Add PPL and KLD to VLLM — RFC — by phaelon74 (创建于: 2026-03-04 13:43 (UTC+8))
- #36008 [Bug]: invoke_fused_moe_wna16_*_kernel calls get_moe_wna16_block_config with bad parameters — bug — by Maxwell-Lyu (创建于: 2026-03-04 20:59 (UTC+8))
- #35992 [Doc]: Inconsistent hash notation in Prefix Caching “Time 5” diagram — documentation — by anony-mous-e (创建于: 2026-03-04 18:24 (UTC+8))
- #35991 [Installation]: Making server with GPT-OSS-20B on vllm+openwebui rtx5080 16gb — installation — by Ed-test-s (创建于: 2026-03-04 18:18 (UTC+8))
- #35987 [Performance]: Very slow GGUF quantized model — performance — by VAmblardPEReN (创建于: 2026-03-04 17:49 (UTC+8))
- #35983 [Bug]: Graph Capturing reports negative memory consumption — bug — by jmkuebler (创建于: 2026-03-04 16:54 (UTC+8))
- #35980 [Bug]: Why does deploying Qwen3-32B-AWQ via vllm:v0.10.1.1 result in different outputs for the same input? — bug — by zhushuo5 (创建于: 2026-03-04 16:13 (UTC+8))
已关闭 Issue
- #18793 [New Model]: ByteDance-Seed/BAGEL-7B-MoT — new-model,stale — by XueSongTap (关闭于: 2026-03-05 10:17 (UTC+8))
- #19623 [Feature]: Add support for multi-lora and single lora for classification tasks — feature request,stale — by power-puff-gg (关闭于: 2026-03-05 10:17 (UTC+8))
- #23529 [Usage]: DP Support in Fully SPMD Execution for Offline Inference — usage,stale — by wjj19950828 (关闭于: 2026-03-05 10:17 (UTC+8))
- #24805 [Feature]: More Frequently Updated Docker Images — feature request,stale — by JamesDConley (关闭于: 2026-03-05 10:17 (UTC+8))
- #25670 [RFC]: Migration from buildkite to GHA — RFC,stale — by jeanschmidt (关闭于: 2026-03-05 10:17 (UTC+8))
- #25842 [Usage]: Problem with concurrency in encoder-based embedder serving with V1 Engine — usage,stale — by gabinguo (关闭于: 2026-03-05 10:16 (UTC+8))
- #26161 [Bug]: pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig — bug,stale — by lucasjinreal (关闭于: 2026-03-05 10:16 (UTC+8))
- #26452 [Bug]: vLLM + LMCache Version Compatibility Issues Across Multiple Releases — bug,stale — by nrghosh (关闭于: 2026-03-05 10:16 (UTC+8))
- #27825 [Bug]: Tokenize endpoint for Granite models returns malformed strings in
token_strsfor non-Latin characters — bug,stale — by kndtran (关闭于: 2026-03-05 10:16 (UTC+8)) - #27982 [Usage]: How can I access or return hidden states (representations) after generation? — usage,stale — by hakbari14 (关闭于: 2026-03-05 10:16 (UTC+8))
- #28013 [Feature]: CUDA12.6 prebuilt wheel for vllm v0.11 — feature request,stale — by Luciennnnnnn (关闭于: 2026-03-05 10:16 (UTC+8))
- #28035 [Usage]: deepseek-ocr The output token count is too low and unstable. — usage,stale — by sixgod-666 (关闭于: 2026-03-05 10:16 (UTC+8))
- #28038 [Bug]:During the vllm 0.10.1 v1 benchmark test, only about 100 out of 1000 requests could be processed, and then it would get stuck. — bug,stale — by Ysgg1 (关闭于: 2026-03-05 10:16 (UTC+8))
- #28042 [Bug]: RotaryEmbedding forward_native cannot match as expected for QKNormRoPEFusionPass — bug,stale — by izhuhaoran (关闭于: 2026-03-05 10:16 (UTC+8))
- #28061 [CI Failure]: ci-infra is broken at approximately noon Nov, 4, 2025 — stale,ci-failure — by Alexei-V-Ivanov-AMD (关闭于: 2026-03-05 10:16 (UTC+8))
- #27225 [Tracking Issue]: Qwen3-next performance optimisations — performance — by vadiklyutiy (关闭于: 2026-03-05 08:48 (UTC+8))
- #31845 [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) — bug,speculative-decoding,deepseek — by jhaotingc (关闭于: 2026-03-05 07:49 (UTC+8))
- #34034 [Bug]: vLLM-compile should not execute the decoder forward pass during compilation — bug,torch.compile — by zou3519 (关闭于: 2026-03-05 04:13 (UTC+8))
- #33708 [Doc]: Update CPU image docs to use Docker Hub after 0.16.0 release — documentation,cpu — by nathan-weinberg (关闭于: 2026-03-05 02:31 (UTC+8))
- #31766 [Docs] Feedback for
/en/latest/contributing/profiling/— documentation — by cyk2018 (关闭于: 2026-03-05 02:13 (UTC+8)) - #34996 [Feature][CI]:
test_functionalization.pyshould compare outputs — feature request — by Rohan138 (关闭于: 2026-03-05 02:01 (UTC+8)) - #35784 [CI Failure]: mi355_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:55 (UTC+8))
- #29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:51 (UTC+8))
- #35235 [CI Failure]: mi355_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:51 (UTC+8))
- #35128 [CI Failure]: mi355_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:45 (UTC+8))
- #29523 [CI Failure]: mi325_4: Distributed Tests (4 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-05 01:39 (UTC+8))
- #34901 [Doc]: Speculators Docs Follow-ups — documentation — by kylesayrs (关闭于: 2026-03-05 01:23 (UTC+8))
- #15498 [Doc]: Troubleshooting guide incorrect hardware script fails — documentation,unstale — by surajssd (关闭于: 2026-03-05 01:15 (UTC+8))
- #34018 [RFC]: Helix (Context + Tensor) Parallelism for Efficient Long-Context Decoding — RFC — by sungsooha (关闭于: 2026-03-04 23:02 (UTC+8))
- #15853 [Bug]: [V1] Testla T4 cannot work for V1 — usage,unstale — by maobaolong (关闭于: 2026-03-04 22:00 (UTC+8))
-
#35880 [Bug]: Qwen/Qwen3.5-27B-GPTQ-Int4 not working /tmp/tmpb6zywp4k/__triton_launcher.c:7:10: fatal error: Python.h: No such file or directory 7 #include — bug — by NilsHellwig (关闭于: 2026-03-04 21:02 (UTC+8)) - #34094 [Bug]: assert num_cache_lines >= batch — bug — by vitalik (关闭于: 2026-03-04 18:56 (UTC+8))
- #35637 [Bug]: mi355 minimax m2.1 arch mxfp4 rocm AITER TP4 error — bug,rocm — by functionstackx (关闭于: 2026-03-04 16:30 (UTC+8))
- #35868 [Bug] Qwen3-ASR crashes without –enforce-eager: missing dynamic_arg_dims for MRoPE positions — 无标签 — by TheCodeWrangler (关闭于: 2026-03-04 16:29 (UTC+8))
- #35547 [Bug]: Long weight loading results in server start failure — bug — by wzhao18 (关闭于: 2026-03-04 13:54 (UTC+8))
- #35129 [CI Failure]: mi355_4: 2 Node Tests (4 GPUs in total) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:38 (UTC+8))
- #29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:37 (UTC+8))
- #35769 [CI Failure]: mi325_1: Quantized Models Test — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:37 (UTC+8))
- #29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:32 (UTC+8))
- #31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:29 (UTC+8))
- #33812 [CI Failure]: mi325_4: LM Eval Large Models (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:29 (UTC+8))
- #35836 [Bug]: Support tool_choice=none in Anthropic API — bug — by ZhongsJie (关闭于: 2026-03-04 13:24 (UTC+8))
- #35233 [CI Failure]: mi355_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:04 (UTC+8))
- #29462 [CI Failure]: mi325_8: Language Models Tests (Hybrid) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-04 13:04 (UTC+8))
新增 PR
- #35968 [WIP] [Performance] DeepSeek V3.2 multi-stream indexer overlap — deepseek — by haosdent (创建于: 2026-03-04 14:35 (UTC+8))
- #35997 [BugFix][LoRA] Rename vocal_parallel_embedding to vocab_parallel_embedding with compatibility shim — bug — by ShaneTao (创建于: 2026-03-04 19:24 (UTC+8))
- #36079 Yejin/bench sleep wake timeout — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-05 10:59 (UTC+8))
- #36081 Perf: Optimize DeepEP prepare/finalize for identity mapping — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-05 11:11 (UTC+8))
- #35954 [torch.compile] Add AsyncTP fusion on Blackwell using FlashInfer bmm_fp8 and symm_mem impl — bug,torch.compile — by AjAnubolu (创建于: 2026-03-04 12:57 (UTC+8))
- #35994 [BUGFIX]Fix Qwen2.5-Omni model audio max_token_per_item estimation error leading to encoder_cache_size is 0 — bug,qwen — by jjmiao1 (创建于: 2026-03-04 18:51 (UTC+8))
- #36080 [PluggableLayer][4/N] Apply PluggableLayer to remaining layers — 无标签 — by whx-sjtu (创建于: 2026-03-05 11:09 (UTC+8))
- #35973 [Doc] Add Parallel Draft Models — documentation — by zihaoanllm (创建于: 2026-03-04 15:38 (UTC+8))
- #36024 [Misc] Lazy import registered processors — deepseek — by Isotr0py (创建于: 2026-03-05 01:24 (UTC+8))
- #36069 fix(lora): bounds-check lora_a/lora_b in MergedColumnParallelLinear.set_lora — 无标签 — by JackYoung27 (创建于: 2026-03-05 08:43 (UTC+8))
- #36045 [Benchmark] Add iteration benchmark with server-side step stats and t… — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-05 04:26 (UTC+8))
- #36027 Rename
compile_ranges_split_pointstocompile_ranges_endpoints— ready — by copilot-swe-agent (创建于: 2026-03-05 01:45 (UTC+8)) - #35982 fix qwen3 streaming reasoning parser when thinking is disabled — qwen — by LiuLi1998 (创建于: 2026-03-04 16:45 (UTC+8))
- #36078 Enable ModelRunnerV2 on XPU — v1 — by xinyu-intel (创建于: 2026-03-05 10:11 (UTC+8))
- #36049 [CI] Fix pre-commit mypy issue in main — ready — by yewentao256 (创建于: 2026-03-05 05:37 (UTC+8))
- #36062 [Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 — ready — by gmagogsfm (创建于: 2026-03-05 07:51 (UTC+8))
- #35985 [BUGFIX]fix cuda memory stat by reserved memory — bug,v1,nvidia — by flutist (创建于: 2026-03-04 17:29 (UTC+8))
- #36036 [Bugfix] Fix block_size for hybrid model MTP — bug,speculative-decoding,ready,v1 — by benchislett (创建于: 2026-03-05 03:03 (UTC+8))
- #36072 [WIP][Proof of concept] Overlap model loading and torch.compile — frontend,v1 — by zou3519 (创建于: 2026-03-05 09:35 (UTC+8))
- #36076 Revert “[Hardware] Replace
torch.cuda.empty_cachewithtorch.accelerator.empty_cache” (#30681) — documentation,performance,structured-output,v1,nvidia — by zhewenl (创建于: 2026-03-05 10:05 (UTC+8)) - #36018 [Feature] Add
response_prefixparameter to audio transcription/translation endpoints — frontend,qwen — by TheCodeWrangler (创建于: 2026-03-04 23:48 (UTC+8)) - #36070 [Bugfix][DCP] Fix CUDA graph capture for Decode Context Parallelism — bug,v1,nvidia — by sungsooha (创建于: 2026-03-05 09:01 (UTC+8))
- #36074 [Docs] Add doc note about building for free-threaded Python. — documentation — by nascheme (创建于: 2026-03-05 09:45 (UTC+8))
- #35984 [XPU] bump vllm-xpu-kernels to v0.1.3 — ready,ci/build — by jikunshang (创建于: 2026-03-04 17:06 (UTC+8))
- #36071 [CI] Fix lint: mypy union-attr errors in hermes_tool_parser — 无标签 — by gmagogsfm (创建于: 2026-03-05 09:03 (UTC+8))
- #36042 Fix CUDA graph decode capture crash in AITER FlashAttention — rocm,v1,nvidia,meta-exported,fb-exported — by iseeyuan (创建于: 2026-03-05 04:01 (UTC+8))
- #36044 [compile] Reduce log spam from compile. — ready — by zhxchen17 (创建于: 2026-03-05 04:18 (UTC+8))
- #36047 [Feature] Add –distributed-timeout-seconds CLI option — v1 — by 842974287 (创建于: 2026-03-05 05:10 (UTC+8))
- #36065 [Bugfix] Fix mypy errors in hermes_tool_parser — bug — by AjAnubolu (创建于: 2026-03-05 08:10 (UTC+8))
- #36067 set VLLM_USE_BYTECODE_HOOK to 0 by default — 无标签 — by laithsakka (创建于: 2026-03-05 08:26 (UTC+8))
- #36068 [Bugfix] Allow inherited_fds to be None to fix warnings when using spawn — bug,v1 — by tjohnson31415 (创建于: 2026-03-05 08:30 (UTC+8))
- #35948 Add granite4 tool parser — documentation,ready,tool-calling — by maxdebayser (创建于: 2026-03-04 11:33 (UTC+8))
- #36001 [Bugfix] Fix message queue initialization order for cross-node DP — bug,v1 — by jianzs (创建于: 2026-03-04 19:50 (UTC+8))
- #36064 test Qwen/Qwen3-4B-Instruct-2507 for unbacked — qwen — by laithsakka (创建于: 2026-03-05 08:00 (UTC+8))
- #35969 [Bugfix] Safe CG padding for FlashMLA FP8 seq_lens and block_table — bug,v1 — by Kevin-XiongC (创建于: 2026-03-04 14:38 (UTC+8))
- #36063 [Refactor] Consolidate SupportsEagle — v1,llama,qwen,gpt-oss,kv-connector — by benchislett (创建于: 2026-03-05 07:54 (UTC+8))
- #36061 [Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels — bug — by robertgshaw2-redhat (创建于: 2026-03-05 07:51 (UTC+8))
- #36060 fix: force prefill path for MTP drafting on SM121 (GB10 Spark) — v1,nvidia — by scottgl9 (创建于: 2026-03-05 07:34 (UTC+8))
- #36054 [Bugfix] Fix tokenize endpoint malformed token_strs — bug,frontend — by AjAnubolu (创建于: 2026-03-05 06:41 (UTC+8))
- #36057 Adding deterministic lora benchmarking to vLLM Bench — performance — by RonaldBXu (创建于: 2026-03-05 06:58 (UTC+8))
- #36041 [Model Runner V2] Add initial CI tests — ready,ci/build — by njhill (创建于: 2026-03-05 04:01 (UTC+8))
- #36058 [2/n] Migrate per_token_group_quant to torch stable ABI — ci/build,nvidia — by mikaylagawarecki (创建于: 2026-03-05 07:06 (UTC+8))
- #36059 [BugFix] Fallback from FA4->FA2 for Batch Invariance — bug,v1 — by frankwang28 (创建于: 2026-03-05 07:08 (UTC+8))
- #35959 [WIP][MRV2] Extensible CG dispatch rework — v1,nvidia — by LucasWilkinson (创建于: 2026-03-04 13:36 (UTC+8))
- #36022 [Kernel] Add FlashInfer MoE A2A Kernel — documentation,performance,new-model,rocm,structured-output,frontend,ci/build,v1,multi-modality,qwen — by leo-cf-tian (创建于: 2026-03-05 01:01 (UTC+8))
- #36053 [Bugfix] Fix safetensors metadata OOM for large model headers — bug — by AjAnubolu (创建于: 2026-03-05 06:40 (UTC+8))
- #36056 [Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 — bug,deepseek — by sfeng33 (创建于: 2026-03-05 06:48 (UTC+8))
- #36055 [Bugfix] Fix zombie EngineCore processes after parent exit — bug,v1 — by AjAnubolu (创建于: 2026-03-05 06:46 (UTC+8))
- #36052 [Bugfix] Fix async OffloadingConnector silently losing decode-phase blocks — bug,kv-connector — by ZhanqiuHu (创建于: 2026-03-05 06:38 (UTC+8))
- #36051 [NIXL][Bugfix] metrics & testing minor bug — bug,v1,kv-connector — by andylolu2 (创建于: 2026-03-05 06:34 (UTC+8))
- #36025 [ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm — rocm,ci/build,v1,multi-modality — by micah-wil (创建于: 2026-03-05 01:34 (UTC+8))
- #35981 [Misc] Support OOT linear method registering — ready — by shen-shanshan (创建于: 2026-03-04 16:24 (UTC+8))
- #36017 [Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 — bug,ready,nvidia — by amitz-nv (创建于: 2026-03-04 23:41 (UTC+8))
- #36013 [Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy — bug,ready,v1 — by tdoublep (创建于: 2026-03-04 22:59 (UTC+8))
- #36016 Add GemmaRMSNorm + FP8 quantization fusion support — 无标签 — by jackzhxng (创建于: 2026-03-04 23:37 (UTC+8))
- #36028 [torch.compile] Use Inductor Process Pool in Compilation — v1 — by eellison (创建于: 2026-03-05 01:56 (UTC+8))
- #36048 Add unit tests on cutlass moe — nvidia — by guanxingithub (创建于: 2026-03-05 05:19 (UTC+8))
- #36043 [ROCM][MORI] Add MoRI & Aiter installation script — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 04:13 (UTC+8))
- #36034 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 02:49 (UTC+8))
- #35961 [Core] Add score mode with perplexity and KLD computation — documentation,v1 — by phaelon74 (创建于: 2026-03-04 13:38 (UTC+8))
- #36046 [Frontend] Add SubscribeKvEvents KV event streaming bridge — documentation,frontend — by smfirmin (创建于: 2026-03-05 04:32 (UTC+8))
- #35949 [MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase — needs-rebase,llama,qwen,deepseek,gpt-oss,nvidia — by bnellnm (创建于: 2026-03-04 11:40 (UTC+8))
- #36011 [Bugfix] Fix Harmony streaming cross-channel delta accumulation — bug,frontend,gpt-oss — by will-deines (创建于: 2026-03-04 21:53 (UTC+8))
- #36040 [CI] Add NvFP4 Blackwell Tests — ci/build — by ojhaanshika (创建于: 2026-03-05 03:46 (UTC+8))
- #36019 [Model Runner V2] Fix pooling — v1 — by njhill (创建于: 2026-03-05 00:06 (UTC+8))
- #35947 fix: Software E2M1 conversion for SM12x NVFP4 activation quantization — nvidia — by blake-snc (创建于: 2026-03-04 11:29 (UTC+8))
- #36039 Add unit tests on flashinfer trtllm moe — nvidia — by guanxingithub (创建于: 2026-03-05 03:46 (UTC+8))
- #36038 [compile][graph_partition]Add tensor size handling — 无标签 — by fxdawnn (创建于: 2026-03-05 03:43 (UTC+8))
- #36035 Fix random dataset parameter defaults silently overriding user input — performance — by khairulkabir1661 (创建于: 2026-03-05 02:54 (UTC+8))
- #35963 [Feature] ViT Full CUDA Graph — v1,multi-modality,qwen,nvidia — by b-mu (创建于: 2026-03-04 13:56 (UTC+8))
- #36000 [Doc] Fix GPU Worker count in Process Count Summary — documentation,ready — by simone-dotolo (创建于: 2026-03-04 19:45 (UTC+8))
- #36023 Perf: Optimize CPU detokinzation for long decodes — v1 — by amirkl94 (创建于: 2026-03-05 01:09 (UTC+8))
- #36033 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (创建于: 2026-03-05 02:45 (UTC+8))
- #36032 qwen3coder tool parser fix anyOf double encoded parameters — qwen — by cmunley1 (创建于: 2026-03-05 02:39 (UTC+8))
- #35952 refactor(metrics): consolidate histogram bucket definitions into buck… — v1 — by npache (创建于: 2026-03-04 11:55 (UTC+8))
- #36029 [SpecDecode][Benchmark] Add SPEED-bench support to benchmarking CLI — documentation,performance — by talorabr (创建于: 2026-03-05 02:02 (UTC+8))
- #36030 [BugFix] fix max memory usage — bug,v1 — by peakcrosser7 (创建于: 2026-03-05 02:14 (UTC+8))
- #36026 [Bugfix] Fix wrong num_experts in invoke_fused_moe_wna16 kernels — bug — by OiPunk (创建于: 2026-03-05 01:39 (UTC+8))
- #35964 Fix phi4-mm and remove cuda binding — ready,nvidia — by yma11 (创建于: 2026-03-04 13:57 (UTC+8))
- #36021 bump flashinfer v0.6.4 -> v0.6.5 — ci/build,nvidia — by netanel-haber (创建于: 2026-03-05 00:19 (UTC+8))
- #36020 [Misc] Fix SyntaxWarning - invalid escape sequence ‘\e’ — frontend,ready — by cjackal (创建于: 2026-03-05 00:15 (UTC+8))
- #36006 [Misc] Remove deprecated items that are due for removal — frontend,ready,multi-modality — by hickeyma (创建于: 2026-03-04 20:27 (UTC+8))
- #36014 fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests — kv-connector — by machov (创建于: 2026-03-04 23:03 (UTC+8))
- #35986 Add support for ModelOpt MXFP8 MoE models — 无标签 — by danisereb (创建于: 2026-03-04 17:36 (UTC+8))
- #36012 [Performance] Add prefetch for checkpoints to OS page cache — 无标签 — by arpera (创建于: 2026-03-04 22:32 (UTC+8))
- #35996 [Bugfix] Make
kaldi_native_fbankoptional — bug,ready,ci/build — by DarkLight1337 (创建于: 2026-03-04 19:09 (UTC+8)) - #35970 In-Tree AMD Zen CPU Backend via zentorch [1/N] — rocm,ci/build,cpu — by amd-lalithnc (创建于: 2026-03-04 14:46 (UTC+8))
- #36003 [Bugfix] Fix DeepSeek V3.2 tool parser — bug,deepseek — by sergeyrid (创建于: 2026-03-04 20:22 (UTC+8))
- #36009 [FA/Chore] Bump FA version for FP8 two-level accumulation every n steps — ci/build — by PatrykSaffer (创建于: 2026-03-04 21:38 (UTC+8))
- #36007 fix(compilation): optimize kv cache update for faster cold start compilation — documentation — by fourierr (创建于: 2026-03-04 20:49 (UTC+8))
- #36002 [Core] Add sharding metadata to model parameters — 无标签 — by PrasanaaV (创建于: 2026-03-04 20:13 (UTC+8))
- #36005 fix(xpu): handle mem_get_info failure on XPU simulator (testing only) — 无标签 — by lukaszszady (创建于: 2026-03-04 20:25 (UTC+8))
- #35976 [Bugfix] Fix RunAI streamer crash with S3-hosted model paths — bug,frontend,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
- #35995 Fix: support auto_tune to run local model — performance — by panpan0000 (创建于: 2026-03-04 19:04 (UTC+8))
- #35990 [Core] Add contiguous block allocation mode for KV cache — v1 — by xiaguan (创建于: 2026-03-04 18:05 (UTC+8))
- #35993 [Bugfix] Fix the batch invariance benchmark script — bug,performance — by jzakrzew (创建于: 2026-03-04 18:27 (UTC+8))
- #35972 [Bugfix] Guard non-2D weights in dispatch_cpu_unquantized_gemm — bug — by tanmayc25 (创建于: 2026-03-04 15:34 (UTC+8))
- #35974 [Bugfix] Qwen3.5-397B-A17B model loading with transformer=5.2 — bug,qwen — by dolphin8 (创建于: 2026-03-04 15:54 (UTC+8))
- #35988 Add support to pass custom presence_penalty and frequency_penalty parameters in the generation config — frontend — by mobicham (创建于: 2026-03-04 17:54 (UTC+8))
- #35971 refine
vllm bench throughput --backend hf— performance — by jikunshang (创建于: 2026-03-04 15:11 (UTC+8)) - #35977 [Bugfix] Fix DP port conflict race condition with late binding — bug,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
- #35978 [Bugfix] Fix suffix decoding crash from negative num_new_tokens — bug,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
- #35975 [Core] Skip inputs_embeds buffer allocation for text-only models — speculative-decoding,v1 — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
- #35979 [Bugfix] Fix MXFP4A16 degenerate output via compressed-tensors bump — bug,ci/build — by AjAnubolu (创建于: 2026-03-04 16:10 (UTC+8))
- #35957 fix online fp8 for MiniCPM models — 无标签 — by yma11 (创建于: 2026-03-04 13:27 (UTC+8))
- #35966 [feat] Kimi K2/DeepSeek Support eagle3 — speculative-decoding,v1,deepseek — by leihuang-sketch (创建于: 2026-03-04 14:26 (UTC+8))
- #35965 [Core] Add KV transfer support to sparse attention indexer — 无标签 — by wz1qqx (创建于: 2026-03-04 14:16 (UTC+8))
- #35958 Perf: Enable double-buffered chunked EP communication in DeepEP — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 13:35 (UTC+8))
- #35955 Perf: Implement fused sort/scan for MoE block alignment using Triton — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 13:10 (UTC+8))
- #35960 [Bugfix] Add regression test for allreduce RMS fusion with PP — bug — by robellliu-dev (创建于: 2026-03-04 13:37 (UTC+8))
- #35956 [Bugfix] Narrow kv_cache mempool context to prevent sleep mode regressions — bug,v1 — by AjAnubolu (创建于: 2026-03-04 13:24 (UTC+8))
- #35953 [DO NOT MERGE][Perf] Avoid tokenizing dummy text when running MM processor — ready,multi-modality — by DarkLight1337 (创建于: 2026-03-04 12:29 (UTC+8))
- #35951 Perf: Optimize regex patterns in MiniMaxM2ToolParser — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 11:54 (UTC+8))
已合并 PR
- #29856 [Model] Add LoRA support for Whisper models — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by daje0601 (合并于: 2026-03-05 10:38 (UTC+8))
- #36049 [CI] Fix pre-commit mypy issue in main — ready — by yewentao256 (合并于: 2026-03-05 10:13 (UTC+8))
- #35984 [XPU] bump vllm-xpu-kernels to v0.1.3 — ready,ci/build — by jikunshang (合并于: 2026-03-04 18:23 (UTC+8))
- #34760 Add platform method to enable custom collective ops registration — rocm,ready,nvidia,meta-exported,fb-exported — by nkm-meta (合并于: 2026-03-05 08:50 (UTC+8))
- #36044 [compile] Reduce log spam from compile. — ready — by zhxchen17 (合并于: 2026-03-05 08:48 (UTC+8))
- #35941 [Model Runner V2] Misc code simplification — v1 — by njhill (合并于: 2026-03-05 07:26 (UTC+8))
- #35239 [ROCm][CI] Added MI325 mirrors (stage C) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-05 06:48 (UTC+8))
- #35981 [Misc] Support OOT linear method registering — ready — by shen-shanshan (合并于: 2026-03-05 06:25 (UTC+8))
- #36017 [Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 — bug,ready,nvidia — by amitz-nv (合并于: 2026-03-05 06:23 (UTC+8))
- #35928 [RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag — ready,codex,aardvark — by simon-mo (合并于: 2026-03-05 06:05 (UTC+8))
- #36013 [Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy — bug,ready,v1 — by tdoublep (合并于: 2026-03-05 05:52 (UTC+8))
- #34950 [CI] Bump
mypyversion — ready,v1,kv-connector,nvidia — by hmellor (合并于: 2026-03-05 04:55 (UTC+8)) - #35240 Add PyTorch profiler schedule support with warmup/active iterations — ready,meta-exported,fb-exported — by fenypatel99 (合并于: 2026-03-05 04:53 (UTC+8))
- #35139 [Docs] Document security risks of GPT-OSS Python tool — documentation,frontend,ready,gpt-oss — by russellb (合并于: 2026-03-05 04:27 (UTC+8))
- #35678 [UX] Remove NoOpOffloader log — ready — by robertgshaw2-redhat (合并于: 2026-03-05 04:13 (UTC+8))
- #35472 [torch.compile] Stop lazily compiling — ready-run-all-tests — by zou3519 (合并于: 2026-03-05 04:13 (UTC+8))
- #36019 [Model Runner V2] Fix pooling — v1 — by njhill (合并于: 2026-03-05 02:53 (UTC+8))
- #32441 [Docs] Clarify structured outputs configuration for Qwen3 reasoning mode — documentation,structured-output,ready,qwen — by davzaman (合并于: 2026-03-05 03:44 (UTC+8))
- #35871 [CI] Add Blackwell AsyncTP correctness test — ready,ci/build — by stecasta (合并于: 2026-03-05 03:41 (UTC+8))
- #36000 [Doc] Fix GPU Worker count in Process Count Summary — documentation,ready — by simone-dotolo (合并于: 2026-03-05 01:03 (UTC+8))
- #35397 [Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels — ready — by tomeras91 (合并于: 2026-03-05 02:47 (UTC+8))
- #34551 [Frontend] Add vllm launch command for GPU-less preprocessing serving — frontend,ready,v1,cpu — by hyeongyun0916 (合并于: 2026-03-05 02:41 (UTC+8))
- #34882 docs: update CPU Docker images to reference Docker Hub instead of AWS ECR — documentation,ready,cpu — by cluster2600 (合并于: 2026-03-05 02:31 (UTC+8))
- #32454 docs: add version requirement note for –profiler-config flag — documentation,ready — by abhishkh (合并于: 2026-03-05 02:13 (UTC+8))
- #34531 [Docs] Add RunPod GPU deployment guide for vLLM — documentation,ready — by lisperz (合并于: 2026-03-05 02:11 (UTC+8))
- #35218 [Docs] Upgrade dynamic LoRA warning to admonition block — documentation,ready — by russellb (合并于: 2026-03-05 02:01 (UTC+8))
- #35481 [Feature][CI]: compare
func&no_funcoutputs in test_functionalization.py — ready — by 11happy (合并于: 2026-03-05 02:01 (UTC+8)) - #30677 [Docs] Update design/multiprocessing.md — documentation,ready — by windsonsea (合并于: 2026-03-05 01:59 (UTC+8))
- #34127 fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… — ready — by tc-mb (合并于: 2026-03-05 01:46 (UTC+8))
- #35090 docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build — documentation,ready,cpu — by sagearc (合并于: 2026-03-05 01:45 (UTC+8))
- #35197 [Doc] Add MTP docs and update speculative decoding guidance — documentation,ready — by XingLiu1 (合并于: 2026-03-05 01:23 (UTC+8))
- #34784 fix(docs): use static rdzv backend in multi-node troubleshooting script — documentation,ready — by machov (合并于: 2026-03-05 01:15 (UTC+8))
- #35893 [ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported — bug,rocm,ready — by ChuanLi1101 (合并于: 2026-03-04 16:30 (UTC+8))
- #35933 docs: add README for logits_processor examples — documentation,ready — by mitre88 (合并于: 2026-03-05 01:09 (UTC+8))
- #35964 Fix phi4-mm and remove cuda binding — ready,nvidia — by yma11 (合并于: 2026-03-05 01:08 (UTC+8))
- #35653 Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py — ready — by netanel-haber (合并于: 2026-03-05 00:43 (UTC+8))
- #35756 Split generic IO Processor plugins tests from Terratorch specific ones — ready,ci/build — by christian-pinto (合并于: 2026-03-05 00:01 (UTC+8))
- #35738 [Misc] Add
--attention-backend autooption — ready — by NickLucche (合并于: 2026-03-04 23:12 (UTC+8)) - #28053 [Core] Remove busy loop from idle buffer readers — ready,v1 — by joerunde (合并于: 2026-03-04 15:44 (UTC+8))
- #34883 [Core] Add All-to-All communication backend for DCP — ready,v1,nvidia — by sungsooha (合并于: 2026-03-04 23:01 (UTC+8))
- #35996 [Bugfix] Make
kaldi_native_fbankoptional — bug,ready,ci/build — by DarkLight1337 (合并于: 2026-03-04 22:47 (UTC+8)) - #34783 [BugFix] Fix implicit and incorrect assumption on ECConnector is_producer — bug,documentation,ready,v1,kv-connector — by furionw (合并于: 2026-03-04 22:01 (UTC+8))
- #35656 [Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE — bug,ready,qwen — by oneraghavan (合并于: 2026-03-04 21:15 (UTC+8))
- #35846 [Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC
prepare_store()— bug,ready,v1 — by ronensc (合并于: 2026-03-04 20:25 (UTC+8)) - #35640 [MISC] fixed tool_parser mypy errors — ready — by taneem-ibrahim (合并于: 2026-03-04 20:23 (UTC+8))
- #35500 [Feature] Add basic metrics for /realtime endpoint — frontend,ready — by pougetat (合并于: 2026-03-04 19:56 (UTC+8))
- #34571 [Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models (#34094) — bug,ready,v1,nvidia — by haosdent (合并于: 2026-03-04 18:56 (UTC+8))
- #30681 [Hardware] Replace
torch.cuda.empty_cachewithtorch.accelerator.empty_cache— documentation,performance,rocm,structured-output,ready,v1,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-04 17:49 (UTC+8)) - #35869 [Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile — bug,ready,qwen — by TheCodeWrangler (合并于: 2026-03-04 16:29 (UTC+8))
- #35539 Support Audio Extraction from MP4 Video for Nemotron Nano VL — documentation,rocm,frontend,ready,ci/build,v1,multi-modality,cpu,nvidia — by askliar (合并于: 2026-03-04 15:20 (UTC+8))
- #35654 [cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties — ready,v1 — by kkt-cohere (合并于: 2026-03-04 15:20 (UTC+8))
- #35616 [Bugfix] Improve engine ready timeout error message — bug,ready,v1 — by lailoo (合并于: 2026-03-04 13:54 (UTC+8))
- #35835 [BugFix] Support tool_choice=none in the Anthropic API — bug,frontend,ready — by ZhongsJie (合并于: 2026-03-04 13:24 (UTC+8))
- #35913 [Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) — rocm,ready,ci/build — by charlifu (合并于: 2026-03-04 12:50 (UTC+8))
- #35711 [Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions — bug,ready — by AndreasKaratzas (合并于: 2026-03-04 12:12 (UTC+8))
- #35883 [Chore] Remove debug code in model implementation — ready — by Isotr0py (合并于: 2026-03-04 11:50 (UTC+8))
- #35872 [Refactor] Clean up processor kwargs extraction — ready — by DarkLight1337 (合并于: 2026-03-04 11:53 (UTC+8))
- #35727 [model] support FireRedASR2 — documentation,new-model,ready,ci/build — by AllenDou (合并于: 2026-03-04 11:41 (UTC+8))
- #33753 [PluggableLayer][MM] Add PluggableLayer for RelPosAttention — documentation,ready — by shen-shanshan (合并于: 2026-03-04 11:41 (UTC+8))
关闭但未合并的 PR
- #27360 [Feature] Implement custom pipeline parallel rank ordering for Ray workers — documentation,new-model,rocm,frontend,ci/build,stale,v1,multi-modality — by JorgenTrondsen (关闭于: 2026-03-05 10:16 (UTC+8))
- #36071 [CI] Fix lint: mypy union-attr errors in hermes_tool_parser — 无标签 — by gmagogsfm (关闭于: 2026-03-05 09:15 (UTC+8))
- #36065 [Bugfix] Fix mypy errors in hermes_tool_parser — bug — by AjAnubolu (关闭于: 2026-03-05 08:42 (UTC+8))
- #36001 [Bugfix] Fix message queue initialization order for cross-node DP — bug,v1 — by jianzs (关闭于: 2026-03-05 08:15 (UTC+8))
- #36046 [Frontend] Add SubscribeKvEvents KV event streaming bridge — documentation,frontend — by smfirmin (关闭于: 2026-03-05 04:38 (UTC+8))
- #36033 [RoCM] Enable MoRI a2a + AiterExperts — rocm — by varun-sundar-rabindranath (关闭于: 2026-03-05 02:45 (UTC+8))
- #32170 docs: fix typo in docker_run_bm.sh — tpu,ci/build — by darshan-stack (关闭于: 2026-03-05 02:11 (UTC+8))
- #27359 [Doc] Refactor import statements for
oneshotin quantization docs to support newer llm-compressor version — documentation,ready,stale — by xxrjun (关闭于: 2026-03-05 02:06 (UTC+8)) - #20580 [V1] Correct V1 parallel sampling params — v1 — by imkero (关闭于: 2026-03-05 00:17 (UTC+8))
- #29753 docs: add guide to reduce PyTorch profiler overhead via env vars (#29564) — documentation,stale — by kbp4154 (关闭于: 2026-03-05 01:54 (UTC+8))
- #19858 optimze attn — needs-rebase,unstale,qwen — by momo609 (关闭于: 2026-03-05 01:50 (UTC+8))
- #19829 Move Gemma’s stacked_params_mapping to class scope — 无标签 — by yhtang (关闭于: 2026-03-05 01:49 (UTC+8))
- #19816 Introduce RayCudaCommunicator as Ray Compiled Graph communicator — needs-rebase,unstale,nvidia — by ruisearch42 (关闭于: 2026-03-05 01:48 (UTC+8))
- #19721 [Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE — needs-rebase,qwen — by varun-sundar-rabindranath (关闭于: 2026-03-05 01:46 (UTC+8))
- #19658 [EPLB]: Support offline expert load distribution recording — frontend,needs-rebase,v1 — by jianzs (关闭于: 2026-03-05 01:46 (UTC+8))
- #19527 Add NotImplementedError to v1 cpu runner — v1,cpu — by fred2167 (关闭于: 2026-03-05 01:42 (UTC+8))
- #19548 [CI] bump mypy version to 1.16.0 — documentation,frontend,needs-rebase,unstale,tool-calling — by andyxning (关闭于: 2026-03-05 01:43 (UTC+8))
- #19456 [Misc] Add logfmt8 quantization support — documentation,needs-rebase,ci/build — by anniegracehu (关闭于: 2026-03-05 01:41 (UTC+8))
- #19443 [Kernel] Integrate IBM/Applied-AI fused moe kernels — nvidia — by varun-sundar-rabindranath (关闭于: 2026-03-05 01:40 (UTC+8))
- #19406 qwen optimze — needs-rebase,unstale,qwen — by momo609 (关闭于: 2026-03-05 01:40 (UTC+8))
- #19387 log process name and id — needs-rebase,unstale,v1 — by helunwencser (关闭于: 2026-03-05 01:39 (UTC+8))
- #19345 qwen optimze — unstale,qwen — by momo609 (关闭于: 2026-03-05 01:38 (UTC+8))
- #19080 [P/D] Exchange NIXL metadata through rank 0 — needs-rebase,unstale,kv-connector — by ptarasiewiczNV (关闭于: 2026-03-05 01:37 (UTC+8))
- #18780 [V1][Feat] Fail request if FSM fails to advance — structured-output,needs-rebase,unstale,v1 — by atbe (关闭于: 2026-03-05 01:36 (UTC+8))
- #19878 Add page-aligned prefill scheduling. — v1 — by py4 (关闭于: 2026-03-05 01:35 (UTC+8))
- #35320 [CI] Add nsys profiling support with NVTX tracing — ci/build — by ojhaanshika (关闭于: 2026-03-05 01:13 (UTC+8))
- #32217 [Draft][Kernel] Add new flashinfer A2A kernel — needs-rebase,nvidia — by hjjq (关闭于: 2026-03-05 01:08 (UTC+8))
- #35839 docs: add comment explaining VLLM_FUSED_MOE_CHUNK_SIZE — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
- #35841 docs: add comment for VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
- #35842 docs: add comment for VLLM_WORKER_MULTIPROC_METHOD — 无标签 — by Jah-yee (关闭于: 2026-03-05 01:00 (UTC+8))
- #35843 docs: add comments for assets cache variables — 无标签 — by Jah-yee (关闭于: 2026-03-05 00:59 (UTC+8))
- #19924 enable multiple ssm groups duplication — needs-rebase,unstale — by ilyasch2 (关闭于: 2026-03-05 00:50 (UTC+8))
- #20194 FlashInfer generated decode kernels. — nvidia — by wenscarl (关闭于: 2026-03-05 00:47 (UTC+8))
- #20027 enable torchao for AMD — rocm,needs-rebase,unstale — by jcaip (关闭于: 2026-03-05 00:47 (UTC+8))
- #20229 [Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318) — documentation,new-model,frontend,needs-rebase,v1,startup-ux,nvidia — by ibl-g (关闭于: 2026-03-05 00:23 (UTC+8))
- #35417 [Bugfix] Fix EC connector unit tests after has_cache_item API change — bug,needs-rebase,v1 — by pakkah (关闭于: 2026-03-05 00:24 (UTC+8))
- #20239 [Frontend][Model] Qwen3Rerank API Server backward compatibility — frontend,needs-rebase,unstale,qwen — by BetterAndBetterII (关闭于: 2026-03-05 00:22 (UTC+8))
- #20292 [Hardware][RISC-V] Add RISC-V architecture cpu inference support — needs-rebase,ci/build,stale — by huangzhengx (关闭于: 2026-03-05 00:20 (UTC+8))
- #20473 [WIP][Hardware][CPU] testing branch for mlperf — documentation,needs-rebase,ci/build,v1,llama,cpu — by bigPYJ1151 (关闭于: 2026-03-05 00:19 (UTC+8))
- #20471 Map Mistral-HF models back onto Mistral format on-the-fly — new-model — by sjuxax (关闭于: 2026-03-05 00:19 (UTC+8))
- #20503 feat: Add streaming support for Mistral v11 tool format — frontend,needs-rebase,unstale,tool-calling — by sjuxax (关闭于: 2026-03-05 00:18 (UTC+8))
- #20761 A developer friendly tool for multi-instance deployment with Ray Implementation — documentation — by Gongzq5 (关闭于: 2026-03-05 00:16 (UTC+8))
- #20848 [WIP] Enable xpu sleep mode — v1 — by yangw1234 (关闭于: 2026-03-05 00:14 (UTC+8))
- #20870 [EPLB] Add EPLB support for dots1 — 无标签 — by wenchen76 (关闭于: 2026-03-05 00:14 (UTC+8))
- #20872 [EPLB] Add EPLB support for OLMoE — needs-rebase,unstale — by ztang2370 (关闭于: 2026-03-05 00:13 (UTC+8))
- #20886 Add CheXAgent model integration with tests and documentation — documentation,new-model,ci/build,multi-modality — by WeiqiangLv (关闭于: 2026-03-05 00:12 (UTC+8))
- #20982 [Kernel] DeepGemm MoE : Integrate cuda moe permute/unpermute — performance,needs-rebase,nvidia — by varun-sundar-rabindranath (关闭于: 2026-03-05 00:10 (UTC+8))
- #21184 Some initial Vulkan boilerplate — needs-rebase,ci/build,unstale — by ericcurtin (关闭于: 2026-03-05 00:09 (UTC+8))
- #21273 [EPLB]: Add EPLB support for Grok1 [WIP] — needs-rebase — by jennifurhe (关闭于: 2026-03-05 00:06 (UTC+8))
- #21290 [Feature][EPLB] Add support for Qwen3 EPLB — needs-rebase,unstale,qwen — by hsliuustc (关闭于: 2026-03-05 00:04 (UTC+8))
- #21317 adds include_thinking optional Param to Request object to preserve re… — frontend,needs-rebase,unstale — by arpitg1991 (关闭于: 2026-03-05 00:03 (UTC+8))
- #21655 Keep reasoning content before applying chat template — frontend,needs-rebase,unstale — by lhdeng-gh (关闭于: 2026-03-05 00:01 (UTC+8))
- #21670 [Model] [Draft PR] Add support for SmallThinker model series — documentation,new-model,needs-rebase — by SorryMaker2022 (关闭于: 2026-03-04 23:58 (UTC+8))
- #21722 fix the mxfp4 packed qk weight loading issue for llama4 — needs-rebase,llama — by xuebwang-amd (关闭于: 2026-03-04 23:55 (UTC+8))
- #21732 [EPLB] Dynamic EPLB Metrics — needs-rebase,unstale,v1 — by baxingpiaochong (关闭于: 2026-03-04 23:55 (UTC+8))
- #21738 [Feat] Import KV connector module dynamically for v0 — needs-rebase,unstale,kv-connector — by JiamingMai (关闭于: 2026-03-04 23:54 (UTC+8))
- #21962 WIP: allow model to be loaded dynamically — frontend,needs-rebase,v1 — by lionelvillard (关闭于: 2026-03-04 23:53 (UTC+8))
- #21969 [Misc] DeepEPLLPrepareAndFinalize - Cleanup — 无标签 — by varun-sundar-rabindranath (关闭于: 2026-03-04 23:53 (UTC+8))
- #22075 [V1] Enhanced Exception Handling for KV Cache Loading from Remote Store — needs-rebase,v1,kv-connector — by liuyumoye (关闭于: 2026-03-04 23:52 (UTC+8))
- #22219 [Quant] Refactor CompressedTensorsConfig — needs-rebase — by kylesayrs (关闭于: 2026-03-04 23:49 (UTC+8))
- #22345 [Model]Force use triton compressed_tensor_moe instead of cutlass — needs-rebase,unstale,nvidia — by access2rohit (关闭于: 2026-03-04 23:49 (UTC+8))
- #22392 Ability to use custom-all-reduce on systems with more than 2 PCIe GPUs via env var — rocm,needs-rebase,stale — by avtc (关闭于: 2026-03-04 23:46 (UTC+8))
- #22461 [Feature] add procese set cpu affinity current gpu device — stale,v1 — by lengrongfu (关闭于: 2026-03-04 23:44 (UTC+8))
- #22607 [Feat][KV offload][WIP] Separated process for CPU KV cache processing — 无标签 — by ApostaC (关闭于: 2026-03-04 23:43 (UTC+8))
- #22608 [Feat][KV offloading][WIP] The prototype implementation of a KV offloader used in CPU KV server — 无标签 — by ApostaC (关闭于: 2026-03-04 23:43 (UTC+8))
- #22618 [feat] added the optimized config for Qwen3-30B-A3B Fp8 — qwen — by sara4dev (关闭于: 2026-03-04 23:42 (UTC+8))
- #22687 feat(perf): Accelerate TP All-Reduce using Triton-Distributed — needs-rebase,unstale,v1 — by preminstrel (关闭于: 2026-03-04 23:41 (UTC+8))
- #22694 [Bug Fix] Correctly parse Hermes tool calls — bug,frontend,needs-rebase,tool-calling — by minhsaco99 (关闭于: 2026-03-04 23:41 (UTC+8))
- #22854 [Bugfix] Fix dual-stack binding when –host is empty — frontend,needs-rebase,stale,v1 — by Shrey1306 (关闭于: 2026-03-04 23:40 (UTC+8))
- #22865 [Core] Optimize swap_states() to copy only valid tokens instead of full rows — stale,v1 — by arjunbreddy22 (关闭于: 2026-03-04 23:39 (UTC+8))
- #22904 [Core] Implement the delay_factor parameter in the v1 scheduler — needs-rebase,stale,v1 — by wz202020 (关闭于: 2026-03-04 23:38 (UTC+8))
- #22977 [Frontend] Complete Redesign of Tool Calling — frontend,tool-calling — by chaunceyjiang (关闭于: 2026-03-04 23:36 (UTC+8))
- #18765 [WIP] Add a metric to track request failures — documentation,frontend — by harche (关闭于: 2026-03-04 23:16 (UTC+8))
- #18994 Self-Speculative Decoding using LayerSkip — documentation,speculative-decoding,needs-rebase,v1,llama,qwen — by aniltolwani (关闭于: 2026-03-04 23:18 (UTC+8))
- #18477 [Bugfix][Frontend] support webm with audioread fallback — frontend,needs-rebase,ci/build,stale — by cpwan (关闭于: 2026-03-04 23:16 (UTC+8))
- #18475 [Misc][benchmark] add warmup; add e2el_per_concurrency and throughput; add random_output_ratio — performance,needs-rebase — by yuzho-amd (关闭于: 2026-03-04 23:15 (UTC+8))
- #18325 Refactor: Prioritize Prefill Requests in Scheduler Output — v1 — by qiaoli31 (关闭于: 2026-03-04 23:15 (UTC+8))
- #18298 [Don’t merge] Debug failing quantization test with input batch move — needs-rebase,v1,qwen — by heheda12345 (关闭于: 2026-03-04 23:14 (UTC+8))
- #17563 AMD tests updated experiment — rocm,ci/build,unstale — by Concurrensee (关闭于: 2026-03-04 23:02 (UTC+8))
- #16989 [Hardware][TPU][V1] Better tpu multilora compilation — tpu,needs-rebase,ci/build,v1 — by jdefreitas02 (关闭于: 2026-03-04 22:57 (UTC+8))
- #16919 [Bugfix] Fix the missing ‘}’ issue for nested object parameters in stream function call. — bug,frontend,needs-rebase,unstale,tool-calling,qwen — by hewu2008 (关闭于: 2026-03-04 22:56 (UTC+8))
- #16935 Add docker to build vllm against torch nightly — rocm,needs-rebase,ci/build — by yangw-dev (关闭于: 2026-03-04 22:57 (UTC+8))
- #16849 [CORE] Eliminate Occasional Scheduling Delay for Parallel Sampling — v1 — by dblincoe (关闭于: 2026-03-04 22:56 (UTC+8))
- #16844 Add initial HPU support for V1 — needs-rebase,ci/build,v1 — by kzawora-intel (关闭于: 2026-03-04 22:55 (UTC+8))
- #16739 Add key latencies to v1 RequestMetrics instance so it can be surfaced… — needs-rebase,unstale,v1 — by Macchiato123000 (关闭于: 2026-03-04 22:54 (UTC+8))
- #16677 [NIXL] vllm v0 nixl integration — frontend,speculative-decoding,needs-rebase,deepseek,cpu,kv-connector — by rainj-me (关闭于: 2026-03-04 22:51 (UTC+8))
- #16315 Add initial fbgemm deps — documentation — by jianyuh (关闭于: 2026-03-04 22:51 (UTC+8))
- #16286 [Model] Run the QK norm in a single op in Llama4 — needs-rebase,llama — by houseroad (关闭于: 2026-03-04 22:50 (UTC+8))
- #16160 Support R-KV Cache Compression in vLLM — needs-rebase,v1 — by yeyang-zhou (关闭于: 2026-03-04 22:49 (UTC+8))
- #15765 Add changes for cascade optimizations — documentation,frontend,needs-rebase,ci/build,v1,kv-connector,nvidia — by plops655 (关闭于: 2026-03-04 22:49 (UTC+8))
- #15957 DeepGemm MoE expert map support — needs-rebase — by bnellnm (关闭于: 2026-03-04 22:49 (UTC+8))
- #15643 [draft] reproduce tpu v1 correctness issue with chunked prefill. — tpu,needs-rebase,v1 — by Chenyaaang (关闭于: 2026-03-04 22:47 (UTC+8))
- #15641 adds native FastAPI bearer auth — frontend,needs-rebase — by charlesfrye (关闭于: 2026-03-04 22:47 (UTC+8))
- #15405 [Doc] Add multi-modal development example for encoder-decoder models — documentation,needs-rebase — by Isotr0py (关闭于: 2026-03-04 22:45 (UTC+8))
- #15401 [TPU][V1] Guided decoding on TPU — tpu,needs-rebase,v1 — by carlesoctav (关闭于: 2026-03-04 22:44 (UTC+8))
- #15346 Vllm v1 eagle proposer — speculative-decoding,v1 — by sroy745 (关闭于: 2026-03-04 22:43 (UTC+8))
- #15005 feat: add custom s3 support — 无标签 — by warjiang (关闭于: 2026-03-04 22:42 (UTC+8))
- #14803 [ROCm][AMD] Enable ROCm Flash Attention Backend for Encoder-Decoder Models — rocm,needs-rebase — by vllmellm (关闭于: 2026-03-04 22:38 (UTC+8))
- #14678 [Core] Add a level 3 sleep/wake_up that offloads tensors to disk — frontend,v1 — by manoelmarques (关闭于: 2026-03-04 22:38 (UTC+8))
- #14584 Ray named test — documentation,needs-rebase — by ruisearch42 (关闭于: 2026-03-04 22:36 (UTC+8))
- #14631 [Quant] Add SupportsQuant and packed_modules_mapping to all models — needs-rebase,llama,qwen,deepseek — by kylesayrs (关闭于: 2026-03-04 22:37 (UTC+8))
- #14395 [Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS — needs-rebase,ci/build,nvidia — by LucasWilkinson (关闭于: 2026-03-04 22:36 (UTC+8))
- #14455 [INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi — speculative-decoding,needs-rebase,ci/build,deepseek — by xuechendi (关闭于: 2026-03-04 22:36 (UTC+8))
- #14291 [Model] add colqwen2_vl code & inference — documentation,new-model,needs-rebase,unstale,qwen — by BloomBerry (关闭于: 2026-03-04 22:35 (UTC+8))
- #14182 Deepseek MTP for V1 — speculative-decoding,needs-rebase,v1,deepseek — by sroy745 (关闭于: 2026-03-04 22:34 (UTC+8))
- #13835 Support w8a8 block_fp8_matmul from generated kernel — needs-rebase,ci/build,nvidia — by wenscarl (关闭于: 2026-03-04 22:28 (UTC+8))
- #13853 [HPU] Enable AutoGPTQ/AutoAWQ quantized model inference — needs-rebase — by maktukmak (关闭于: 2026-03-04 22:29 (UTC+8))
- #13809 [Misc] support variable remote backend for model loader — documentation,performance,frontend,speculative-decoding,needs-rebase,unstale,v1 — by DellCurry (关闭于: 2026-03-04 22:27 (UTC+8))
- #13805 [Model][Speculative Decoding] support k > 1 for MTP — speculative-decoding,needs-rebase,deepseek — by luccafong (关闭于: 2026-03-04 22:26 (UTC+8))
- #13758 [WIP][Whisper] beam search for whisper — frontend,needs-rebase — by joennlae (关闭于: 2026-03-04 22:26 (UTC+8))
- #13360 [RFC][V1]
LogitsProcessorinterface — RFC,tpu,needs-rebase,v1 — by njhill (关闭于: 2026-03-04 22:25 (UTC+8)) - #12866 adding workaround for c2x/c3x initializer issue — needs-rebase,nvidia — by kushanam (关闭于: 2026-03-04 22:24 (UTC+8))
- #13248 [core] add
extra_argstoRPCProcessRequestandMQLLMEngineClient— needs-rebase — by akeshet (关闭于: 2026-03-04 22:24 (UTC+8)) - #12726 [Core] Add Additional Metrics to vLLM Server — speculative-decoding,needs-rebase,stale,v1 — by sahelib25 (关闭于: 2026-03-04 22:22 (UTC+8))
- #11554 [Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface. — bug,frontend,needs-rebase,ci/build,tool-calling — by elementary-particle (关闭于: 2026-03-04 22:12 (UTC+8))
- #12566 vllm-flash-attn build on AMD — rocm,needs-rebase,ci/build,v1,nvidia — by ProExpertProg (关闭于: 2026-03-04 22:11 (UTC+8))
- #12117 [Feature] Support VPTQ quantization — documentation,needs-rebase,ci/build — by wejoncy (关闭于: 2026-03-04 22:10 (UTC+8))
- #12341 FLOP counting for vLLM inference — 无标签 — by dianastea (关闭于: 2026-03-04 22:10 (UTC+8))
- #12048 [MoE][CPU] Extend fused_moe_iterative for non-x86 CPU backends — needs-rebase — by mgoin (关闭于: 2026-03-04 22:09 (UTC+8))
- #10840 Lora scheduler — documentation,needs-rebase — by Scott-Hickmann (关闭于: 2026-03-04 22:06 (UTC+8))
- #11348 [Kernel] Add ExLlamaV2 Weight Quantization Support — needs-rebase,ci/build,llama — by AlpinDale (关闭于: 2026-03-04 22:07 (UTC+8))
- #21189 [W.I.P]: add Lmcache metrics — unstale,v1,kv-connector — by panpan0000 (关闭于: 2026-03-04 21:56 (UTC+8))
-
#27963 [Doc][Last/N] Improve all pooling task Refactor pooling-related documentation — documentation — by noooop (关闭于: 2026-03-04 21:42 (UTC+8)) - #35282 [Core] Add chunking for audio over 30s on offline inference, — frontend,needs-rebase — by sangbumlikeagod (关闭于: 2026-03-04 20:47 (UTC+8))
- #34227 Feat/support nemotron new file format — documentation,performance,new-model,needs-rebase — by shaunkotek (关闭于: 2026-03-04 16:02 (UTC+8))
- #35965 [Core] Add KV transfer support to sparse attention indexer — 无标签 — by wz1qqx (关闭于: 2026-03-04 14:18 (UTC+8))
- #35377 [Feature] Use chat template and support request_prompt for Qwen3-ASR — qwen — by pougetat (关闭于: 2026-03-04 14:03 (UTC+8))
- #35248 [Model Runner V2] Support DP/EP for spec decoding — needs-rebase,v1 — by TheEpicDolphin (关闭于: 2026-03-04 12:30 (UTC+8))