vLLM 开发动态报告 - 2026-03-20
时间窗口: 2026-03-20 11:14 (UTC+8) ~ 2026-03-21 11:14 (UTC+8) 数据统计: 新 Issue 31 | 关闭 Issue 38 | 新 PR 66 | 合并 PR 45 | 关闭未合并 PR 25
📊 每日开发状态摘要
本周期(2026-03-20至2026-03-21)vLLM社区保持了极高的开发活跃度,共合并了45个PR,关闭了38个Issue,新增了31个Issue和66个PR。开发重点集中在KV连接器/异构推理优化、AMD(ROCm)生态集成与测试、内核性能优化以及多模态模型支持上。同时,多个影响生产环境的关键Bug(如CPU offload OOM、前端中止失效、性能指标错误) 被报告并得到快速响应。
🎯 AMD/ROCm 生态相关动态
本周期AMD生态相关贡献非常活跃,涉及从内核优化、量化支持到CI/CD的多个层面。
- 内核与编译优化:
- PR 37646 (
vllmellm提交): [ROCm][FEAT] AITER Fused AllReduce + RMSNorm。此PR为ROCm平台引入了AITER融合AllReduce与RMSNorm内核,旨在减少通信开销,提升分布式训练/推理性能。这反映了AMD团队在优化其平台计算栈(AITER)与vLLM深度集成方面的持续投入。 - PR 37682 (
andyluo7提交): [Bugfix] Zero-init ROCm MLA attention output buffers for graph padding。此PR修复了ROCm MLA注意力后端在CUDA图捕获路径下输出缓冲区未初始化的问题,是CUDA侧同类修复(#37442)的ROCm版本,确保了跨平台行为一致性。 - PR 37713 (
amd-hhashemi提交): Readability cleanup for wvSplitK reduces。此提交来自AMD员工,对wvSplitK内核的规约部分进行了代码可读性清理。
- PR 37646 (
- 量化与模型支持:
- PR 37698 (
hongxiayang提交): [ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4。此PR修复了Quark量化配置在加载amd/仓库下特定模型(如MiniMax-M2.1-MXFP4)时,因硬编码trust_remote_code=False而导致的异常。这直接属于AMD生态的Quark量化工具链修复,确保了AMD提供的量化模型能够正确加载。 - PR 36232 (
xuebwang-amd提交,已合并): [ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization。此合并进一步增强了Quark OCP MX量化对仅权重量化场景的鲁棒性。
- PR 37698 (
- CI/CD与测试:
- PR 37671 (
tjtanaa提交): [ROCm] [Release] Block rocm release pipeline from running at every commit and fix ECR limit issue。此PR管理ROCm Docker镜像的发布流水线,并修复了AWS ECR的标签数量限制问题,属于基础设施维护。 - 多个由
AndreasKaratzas提交的CI修复PR(如PR 37711、PR 37611、PR 37614、PR 37619)已合并,旨在解决AMD CI测试集群(MI250, MI325, MI355)上出现的测试失败,涉及注意力后端选择、模型列表更新、测试分类等,表明AMD团队正在积极完善其CI/CD以保障与上游的兼容性和稳定性。 - 新增Issue 37710, 37709, 37708, 37724:均为AMD CI测试失败报告,涵盖Whisper精度、DeepEP问题、新测试组启用、量化测试失败等,显示了其测试覆盖的广度和问题跟踪的及时性。
- PR 37671 (
💬 高热度讨论分析
- Issue 37658: [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop
- 核心议题:用户报告通过前端(如Dify)发送中止请求时,vLLM后端引擎未停止生成任务,导致模型陷入无限循环、GPU内存占满。
- 观点与争议:提问者(
xiaolvtongxue-zt)提供了详细的日志和调用方式。维护者(ZJY0516)多次询问具体的复现步骤和请求发送方式,试图定位问题是出在客户端请求构造、服务端配置还是引擎内部逻辑。目前讨论聚焦于如何精确复现问题,尚未形成结论性修复方案。 - 当前状态:Open。问题严重但复现路径不明确,需要进一步信息。
- Issue 37672: [Bug]: Prefetch CPU offload OOMs
- 核心议题:在使用
prefetchCPU offload后端加载大模型(GLM-4.7-FP8)时,即使设置了环境变量VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1,仍然发生OOM,因为该变量在prefetch offloader中未被遵循。 - 观点与解决方案:用户
ehfd报告问题。贡献者he-yufeng迅速定位到代码中三处未检查该环境变量的位置,并提出了修复方案(已在PR 37699中实现)。维护者wzhao18表示感谢。讨论高效、技术细节清晰,直接导向解决方案。 - 当前状态:Open,但已有对应修复PR。
- 核心议题:在使用
- Issue 37666: [Bug]: vllm bench “Peak output token throughput” is “less than Output token throughput”
- 核心议题:
vllm bench性能测试工具中,“峰值输出token吞吐量”指标值低于“平均输出token吞吐量”,这违背常理,表明指标计算存在逻辑错误。 - 不同方案:用户
AskyJx报告问题。贡献者howardpen9提出了一个修复方案(PR 37690),指出其方法“实质性不同”于另一个已存在的PR #35471。他强调自己的方案基于实际生成的token数量在解码时间窗口上的重叠积分,而非以流式块事件计数作为代理,并能够正确处理边缘情况。这形成了两种不同技术路径的对比。 - 当前状态:Open,存在两个潜在的修复PR,需要维护者评估和选择。
- 核心议题:
🔥 热门话题与趋势分析
- KV连接器与异构推理:这是当前最活跃的领域之一。多个PR(37635, 37636, 37716, 37686)专注于不同的KV连接器(NIXL、3FS、MoRIIO、Mooncake),旨在优化Prefill/Decode分离、跨节点KV共享等先进部署场景,体现了vLLM向更复杂、分布式推理架构的演进。
- AMD生态持续集成与测试:如前所述,大量活动围绕AMD CI的绿色化展开。这不仅是简单的测试修复,更反映了AMD硬件(MI250/325/355)和软件栈(ROCm, AITER, Quark)正被深入、系统地集成到vLLM主干的开发流程中。
- 编译与性能优化:
- 内核级优化:如PR 37683消除MoE中的冗余
SparseMatrix创建,PR 37695使用torch.compile融合MoE的打包topk操作,均瞄准微秒级延迟削减。 - 编译配置:PR 37696讨论在分片编译中禁用序列并行(SP),以解决兼容性问题。
- 内核级优化:如PR 37683消除MoE中的冗余
- 多模态与模型适配:持续有PR(37643, 37693, 37685, 37647)更新或新增对多模态模型(AudioFlamingo3, Isaac, Kimi-K2.5, MiDashengLM等)的支持,确保与Hugging Face最新实现对齐,体现了对模型生态快速跟进的承诺。
🛠️ 重点技术变更
- PR 37646: [ROCm][FEAT] AITER Fused AllReduce + RMSNorm:此PR引入了针对ROCm平台的融合通信-计算内核,是提升AMD GPU上分布式训练推理性能的关键优化,标志着AMD定制化内核优化的深入。
- PR 37683: [Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels (已合并):通过避免在MoE路由中重复创建
SparseMatrix对象,移除了冗余的内核启动,在H200上获得了4%的端到端吞吐量提升,是典型的低开销、高回报性能优化。 - PR 37605: [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (已合并):修复了FlashInfer TRTLLM monolithic MoE内核在处理全负router logits时的严重路由错误,该问题曾导致Qwen3.5 FP8模型在EP+DP下准确率降至0%,是关键的稳定性修复。
- Issue 37658: Frontend Abort Fails:虽然未解决,但暴露了在复杂客户端(如Dify)和大型模型场景下,请求生命周期管理可能存在的缺陷,是需要高度关注的生产环境稳定性问题。
- PR 37690: fix(bench): compute peak output token throughput from token-volume decode windows:旨在修正性能基准测试的核心指标计算逻辑,确保性能评估的准确性,对于所有用户评估vLLM性能至关重要。
📈 开发活跃度观察
- AMD团队贡献突出:
vllmellm,hongxiayang,xuebwang-amd,AndreasKaratzas等用户(或团队账号)在本周期提交了大量与ROCm平台相关的代码和CI修复,显示出AMD团队深入参与vLLM开发的活跃度和专业性。 - 社区响应迅速:对于报告的关键Bug(如CPU offload OOM, 结构化输出CPU崩溃),社区贡献者(如
he-yufeng,wjhrdy)能在很短时间内定位根因并提出修复PR,展现了社区强大的问题解决能力。 - 代码合并高效:单日合并45个PR,说明核心维护团队对代码审查和合并流程掌控高效,项目进展迅速。
- CI/CD作为质量关口:大量的CI失败Issue和对应的修复PR,表明CI系统有效地充当了代码质量、跨平台兼容性的“守门员”,尤其对于AMD这类多架构支持至关重要。
💡 值得关注的问题
- 请求中止机制失效(Issue 37658):这是一个影响生产系统可控性和资源安全的潜在严重问题,需要优先复现和根因分析。
- KV连接器相关Bug:如Issue 37703(TRITON_ATTN布局忽略导致异构TP失败)和PR 37716(MoRIIOConnector修复),随着异构推理架构的推广,这些组件的稳定性至关重要。
- AMD平台集成深度:持续关注AITER等AMD专用内核的优化效果,以及Quark量化工具链与vLLM模型加载器的兼容性(如PR 37698所示)。
- 编译与性能指标准确性:PR 37696涉及的编译配置问题以及PR 37690/Issue 37666涉及的性能指标计算,关系到用户对vLLM性能和稳定性的根本信任。
📋 附录:详细数据列表
新增 Issue
- #37730 Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) — 无标签 — by glaziermag (创建于: 2026-03-21 09:44 (UTC+8))
- #37729 [Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5) — bug — by rahul003 (创建于: 2026-03-21 09:35 (UTC+8))
- #37697 [Bug]: openai v1/responses api instructions from prior response leak through previous_response_id — bug — by lukezTT (创建于: 2026-03-20 23:43 (UTC+8))
- #37710 [CI Failure]: mi355_1: Entrypoints Integration (API Server 1) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 01:50 (UTC+8))
- #37724 [CI Failure]: mi355_1: Quantization — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 07:38 (UTC+8))
- #37720 [Bug]: EAGLE Autograd issue in Mix Hidden States mode — bug — by benchislett (创建于: 2026-03-21 06:36 (UTC+8))
- #37714 [Installation]: Blackwell SM120 + CUDA 13 pip install: 5 sequential failures before Qwen3.5 27B+ runs — installation — by Gavin-Qiao (创建于: 2026-03-21 04:59 (UTC+8))
- #37703 [Bug][NIXL]: TRITON_ATTN ignores
VLLM_KV_CACHE_LAYOUT=HND, breaks heterogeneous TP with NIXL — bug — by ZhanqiuHu (创建于: 2026-03-21 01:37 (UTC+8)) - #37705 [Bug]: Structured output crashes on CPU with pin_memory=True in apply_grammar_bitmask() — 无标签 — by wjhrdy (创建于: 2026-03-21 01:41 (UTC+8))
- #37709 [CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 01:47 (UTC+8))
- #37707 [CI Failure]: mi325_1: PyTorch Compilation Passes Unit Tests — ci-failure — by AndreasKaratzas (创建于: 2026-03-21 01:43 (UTC+8))
- #37708 [CI Failure]: mi325_1: Quantized MoE Test (B200-MI325) — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 01:45 (UTC+8))
- #37704 [CI Failure]: mi250_1: Kernels Core Operation Test — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 01:37 (UTC+8))
- #37701 [Bug]: Segfault in FlashInfer autotuning for NVFP4 latency backend on Qwen3-30B-A3B-NVFP4 — bug — by baonudesifeizhai (创建于: 2026-03-21 00:55 (UTC+8))
- #37672 [Bug]: Prefetch CPU offload OOMs;
VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORYmust be implemented — bug — by ehfd (创建于: 2026-03-20 18:37 (UTC+8)) - #37644 [Bug]: key error: ‘Layer.34.mlp.experts.gate_up_proj’ — bug — by zhaoqiongh (创建于: 2026-03-20 14:22 (UTC+8))
- #37666 [Bug]: vllm bench: “Peak output token throughput” is “less than Output token throughput” — bug — by AskyJx (创建于: 2026-03-20 17:39 (UTC+8))
- #37687 [Feature]: add ParoQuant quantization — feature request — by ivanbaldo (创建于: 2026-03-20 21:48 (UTC+8))
- #37684 [Feature]: IndexCache support for DSA models — feature request — by nejch (创建于: 2026-03-20 21:04 (UTC+8))
- #37680 [Bug]: 使用vllm+lmcache部署glm4.7,在多并发+长上下文导致服务挂掉 — bug — by feichenchina (创建于: 2026-03-20 20:29 (UTC+8))
- #37668 Early Stopping for Ouro models — usage — by Sirorezka (创建于: 2026-03-20 17:48 (UTC+8))
- #37675 [Bug]: deepgemm compile error — bug — by Speclkle (创建于: 2026-03-20 19:19 (UTC+8))
- #37674 [Usage]: serve部署模型后,调用chat.completions输入给模型的text中image_pad token被提到了prompt开头 — usage — by NormanWhc (创建于: 2026-03-20 18:59 (UTC+8))
- #37663 [Bug]: Qwen3.5-2B output content is always None on RTX5090 — bug — by yanghui1-arch (创建于: 2026-03-20 17:09 (UTC+8))
- #37650 [Feature]: Allow user selection of structured output (xgrammar) backend for bitmask application — feature request — by xyDong0223 (创建于: 2026-03-20 14:56 (UTC+8))
- #37659 [Bug]: RuntimeError: 基于megatron grpo Qwen3-Omni模型时,出现RuntimeError: Event device index does not match recording stream’s device index — bug — by dlnn123 (创建于: 2026-03-20 16:38 (UTC+8))
- #37660 Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions. — 无标签 — by dlnn123 (创建于: 2026-03-20 16:41 (UTC+8))
- #37658 [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop, vLLM Backend Runs Indefinitely with Near-Full GPU Memory Utilization — bug — by xiaolvtongxue-zt (创建于: 2026-03-20 16:17 (UTC+8))
- #37649 [Usage]: Improve error handling when task=’classify’ is used with decoder-only models (e.g., Qwen, Llama, Mistral) — usage — by jyfct356 (创建于: 2026-03-20 14:55 (UTC+8))
- #37648 BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests — 无标签 — by hslee-lunit (创建于: 2026-03-20 14:54 (UTC+8))
- #37638 [Tracking Issue]: Mamba Heterogeneous TP for NIXL P/D Disaggregation — 无标签 — by ZhanqiuHu (创建于: 2026-03-20 11:44 (UTC+8))
已关闭 Issue
- #26806 [Usage]: MCP-USE with VLLM gpt-oss:20b via ChatOpenAI — usage,stale — by Tahirc1 (关闭于: 2026-03-21 10:29 (UTC+8))
- #27645 [Bug]: Qwen3-VL-235B-A22B-Instruct stuck with assert placeholder < len(self._out_of_band_tensors) — bug,stale — by yry0008 (关闭于: 2026-03-21 10:28 (UTC+8))
- #27726 [Performance]: DeepSeek-R1 performance degradation after enabling MTP — performance,stale — by zkf-qwj (关闭于: 2026-03-21 10:28 (UTC+8))
- #28191 [Feature]: support reranker model like bge-rerank-large with long text chunked process — feature request,stale — by leslie-x (关闭于: 2026-03-21 10:28 (UTC+8))
- #28234 Online Dynamic FP8 Quantization (–quantization=”fp8”) is slower than BF16/FP16 on RTX 5090 — bug,stale — by rEwfii (关闭于: 2026-03-21 10:28 (UTC+8))
- #28371 [Installation]: Recipe Installation fails due to missing linux aarch64.whl — installation,stale — by SourBhow (关闭于: 2026-03-21 10:28 (UTC+8))
- #28456 [Usage]: benchmark_moe Usage — usage,stale — by ekmekovski (关闭于: 2026-03-21 10:28 (UTC+8))
- #28669 vLLM 0.11.0 CUDA Library Mismatch on ARM64 with CUDA 13.x — installation,stale — by swatson1000000 (关闭于: 2026-03-21 10:28 (UTC+8))
- #28717 [Usage]: Errors running vLLM docker in a closed environment with gpt-oss-120b on RTX 6000 Pro — usage,stale — by antonkarlsson1 (关闭于: 2026-03-21 10:28 (UTC+8))
- #28835 [Bug]: Illegal Memory Access and Incorrect Output with Long Inputs (>5k tokens) on qweN3-next-80b-instruct — bug,stale — by mosluo (关闭于: 2026-03-21 10:28 (UTC+8))
- #28843 [Bug]: GPTQ quants fails to run ( RuntimeError: gpt_marlin_gemm only supports bfloat16 and float16 ) — bug,stale — by drrros (关闭于: 2026-03-21 10:28 (UTC+8))
- #28866 [Usage]: When is going to be the next release? — usage,stale — by Invalid-coder (关闭于: 2026-03-21 10:28 (UTC+8))
- #28869 [Bug]: deepseek + torch compile was broken on main (1b82fb0ad) — bug,torch.compile,stale — by sighingnow (关闭于: 2026-03-21 10:28 (UTC+8))
- #28940 [Bug]: eagle3 default use quant model loader — bug,stale — by Allencheng97 (关闭于: 2026-03-21 10:28 (UTC+8))
- #28943 [Usage]: what’s the right way to run embedding model in vllm 0.11.0 — usage,stale — by neverneverendup (关闭于: 2026-03-21 10:28 (UTC+8))
- #28993 [Bug]: Llama 4 Scout on 2 x B200 errors during FlashInfer attention metadata build — bug,stale — by jaywonchung (关闭于: 2026-03-21 10:27 (UTC+8))
- #29006 [Bug]: OpenAI completion error: 500 Unable to allocate 31.6 GiB for an array with shape (65158, 65158) and data type int64 — bug,stale — by abhiram1809 (关闭于: 2026-03-21 10:27 (UTC+8))
- #29007 [Bug]: 0.11.1 A6000x2 120B OSS wrong generations — bug,stale — by Nokimann (关闭于: 2026-03-21 10:27 (UTC+8))
- #29014 [Bug]: stride mismatch when using torch compile on graphs with splitting_ops and non-standard tensor dimensions — bug,torch.compile,stale — by tomeras91 (关闭于: 2026-03-21 10:27 (UTC+8))
- #29030 [Bug]: vLLM 0.11.1 undefined symbol
cutlass_moe_mm_sm100on RTX 5080 (SM 12.0) with CUDA 13.0 — stale — by alkari (关闭于: 2026-03-21 10:27 (UTC+8)) - #29042 [Feature]: Add option to tolerate self-signed certificates to vllm bench serve — feature request,stale — by chilanti (关闭于: 2026-03-21 10:27 (UTC+8))
- #29045 [Bug]: Custom logits processor edge-cases — bug,stale — by afeldman-nm (关闭于: 2026-03-21 10:27 (UTC+8))
- #29089 [Performance]: Can we use CUDA graph to accelerate the Qwen2_5omniAudioEncoder in Qwen2.5-Omni-3B? — performance,stale — by xq25478 (关闭于: 2026-03-21 10:27 (UTC+8))
- #29093 [Bug]: –long-prefill-token-threshold & –max-num-batched-tokens Confilct:lead to OOM! — bug,stale — by Shun7shun (关闭于: 2026-03-21 10:27 (UTC+8))
- #29101 [Bug]: Pipeline parallelism issues on RTX 6000 + gpt-oss-120b — bug,stale — by austin-mccall (关闭于: 2026-03-21 10:27 (UTC+8))
- #37720 [Bug]: EAGLE Autograd issue in Mix Hidden States mode — bug — by benchislett (关闭于: 2026-03-21 06:36 (UTC+8))
- #37591 [Bug]: FlashInfer TRTLLM monolithic MoE produces 0% accuracy for Qwen3.5-35B/122B FP8 — bug,qwen,nvidia — by vadiklyutiy (关闭于: 2026-03-21 03:19 (UTC+8))
- #37554 [Bug] –calculate-kv-scales produces corrupted FP8 KV cache on hybrid GDN+Attention models (Qwen3.5) — 无标签 — by daudo (关闭于: 2026-03-21 02:28 (UTC+8))
- #29523 [CI Failure]: mi325_4: Distributed Tests (4 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:56 (UTC+8))
- #29510 [CI Failure]: mi325_1: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:55 (UTC+8))
- #35783 [CI Failure]: mi355_8: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:55 (UTC+8))
- #34637 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 2) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:55 (UTC+8))
- #29538 [CI Failure]: mi325_8: Kernels Quantization Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:54 (UTC+8))
- #29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-03-21 01:53 (UTC+8))
- #37598 [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with
--max-num-batched-tokens> default (e.g. 16K) under load — bug — by Xarbirus (关闭于: 2026-03-21 00:58 (UTC+8)) - #37663 [Bug]: Qwen3.5-2B output content is always None on RTX5090 — bug — by yanghui1-arch (关闭于: 2026-03-20 17:17 (UTC+8))
- #34229 [Bug]: [CPU Backend] Whisper W8A8 CPU utilization very low on Arm CPU — bug,cpu — by fadara01 (关闭于: 2026-03-20 16:57 (UTC+8))
- #37468 [Bug]: FlashInfer allreduce fusion workspace uninitialized error — bug — by wzhao18 (关闭于: 2026-03-20 15:09 (UTC+8))
新增 PR
- #37731 Support FP8 KVCache on XPU — v1 — by xinyu-intel (创建于: 2026-03-21 11:10 (UTC+8))
- #37725 [Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell — bug,ready,ci/build,nvidia — by RobTand (创建于: 2026-03-21 07:45 (UTC+8))
- #37728 Fix Mamba state corruption from stale CUDA graph block table entries — v1,nvidia,meta-exported,fb-exported — by minosfuture (创建于: 2026-03-21 09:19 (UTC+8))
- #37642 [Bugfix] Fix engine crash when structured output grammar compilation fails with auto backend — bug,structured-output,frontend,v1 — by allen-ash (创建于: 2026-03-20 14:06 (UTC+8))
- #37727 [Bugfix] Fix Responses API instructions leaking through previous_response_id — bug,frontend — by he-yufeng (创建于: 2026-03-21 09:09 (UTC+8))
- #37652 Fix: Handle $ref in json-schema in qwen3_coder tool parser — tool-calling,qwen — by schoennenbeck (创建于: 2026-03-20 15:05 (UTC+8))
- #37664 [Feature] Feat(structured-outputs): expose xgrammar bitmask backend selection via platform interface — rocm,structured-output,v1,cpu,nvidia — by xyDong0223 (创建于: 2026-03-20 17:11 (UTC+8))
- #37721 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-21 06:55 (UTC+8))
- #37643 Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling — documentation,multi-modality — by lashahub (创建于: 2026-03-20 14:14 (UTC+8))
- #37726 Revert “[Model] Deprecate the score task (this will not affect users).” (#37537) — documentation,frontend,v1 — by zhewenl (创建于: 2026-03-21 09:05 (UTC+8))
- #37717 [ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-21 05:53 (UTC+8))
- #37716 [BugFix] Fix MoRIIOConnector for disaggregated P/D inference — bug,kv-connector — by raviguptaamd (创建于: 2026-03-21 05:41 (UTC+8))
- #37695 [Perf] Use torch compile to fuse pack topk in trtllm moe — performance,ready,nvidia — by wzhao18 (创建于: 2026-03-20 23:21 (UTC+8))
- #37700 [Bugfix] Fix FLA Hopper/TMA misclassification on SM12x desktop Blackwell — bug — by RobTand (创建于: 2026-03-21 00:31 (UTC+8))
- #37723 [ROCm][CI] Stabilize ROCm speech-to-text translation test with ROCM_EXTRA_ARGS — rocm,ready — by AndreasKaratzas (创建于: 2026-03-21 07:31 (UTC+8))
- #37698 [ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 — bug,rocm,cpu — by hongxiayang (创建于: 2026-03-20 23:43 (UTC+8))
- #37722 quick fix for 37665 — 无标签 — by xuechendi (创建于: 2026-03-21 07:17 (UTC+8))
- #37719 [Test] Only Run MLA model when user explicitly set for batch invariance — ready,v1 — by yewentao256 (创建于: 2026-03-21 06:13 (UTC+8))
- #37718 [Bug] Fix fp8 deepgemm batch invariant — bug,ready — by yewentao256 (创建于: 2026-03-21 06:07 (UTC+8))
- #37712 Properly enable wvSplitK fp8 path for RDNA — 无标签 — by amd-hhashemi (创建于: 2026-03-21 04:35 (UTC+8))
- #37706 [Bugfix] Fix structured output crash on CPU due to pin_memory=True — bug,structured-output,ready,v1 — by wjhrdy (创建于: 2026-03-21 01:42 (UTC+8))
- #37715 [Bugfix] Mask padded rows for FlashInfer CUTLASS MoE — bug,nvidia — by kjiang249 (创建于: 2026-03-21 05:11 (UTC+8))
- #37713 Readability cleanup for wvSplitK reduces. — rocm — by amd-hhashemi (创建于: 2026-03-21 04:50 (UTC+8))
- #37669 WIP: [openapi] enable scaling ep only when api_server_count is 1 — frontend — by andyxning (创建于: 2026-03-20 18:06 (UTC+8))
- #37711 [ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-21 02:25 (UTC+8))
- #37683 [Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels — performance,ready,gpt-oss — by xyang16 (创建于: 2026-03-20 20:51 (UTC+8))
- #37694 Add get_device_uuid for rocm — rocm,ready — by tmm77 (创建于: 2026-03-20 23:21 (UTC+8))
- #37636 [KVConnector] Support 3FS KVConnector — v1,kv-connector,nvidia — by ibifrost (创建于: 2026-03-20 11:39 (UTC+8))
- #37693 [Model] Update Kimi-K25 and Isaac processors to fit HF-style — ready — by DarkLight1337 (创建于: 2026-03-20 23:07 (UTC+8))
- #37692 [FlexAttention] allow custom mask mod — v1 — by liangel-02 (创建于: 2026-03-20 23:06 (UTC+8))
- #37645 [MRV2] Avoid recompilation of _gather_block_tables_kernel — ready,v1 — by WoosukKwon (创建于: 2026-03-20 14:39 (UTC+8))
- #37702 [Test] Add more unittests for CUDAGraphWrapper — v1,nvidia — by SoluMilken (创建于: 2026-03-21 01:12 (UTC+8))
- #37639 [Model Runner V2] Fix draft logits not populated during cudagraph replay — ready,v1,nvidia — by TheEpicDolphin (创建于: 2026-03-20 11:58 (UTC+8))
- #37691 [cpu][ci] remove soft-fail for Arm CI and add quant model tests — ci/build,cpu — by fadara01 (创建于: 2026-03-20 22:34 (UTC+8))
- #37681 Fix various config related issues for Transformers v5 — ready,deepseek — by hmellor (创建于: 2026-03-20 20:40 (UTC+8))
- #37689 [Bugfix] Fix bogus “Loading weights took” time in DefaultModelLoader — bug — by NickCao (创建于: 2026-03-20 22:02 (UTC+8))
- #37699 [Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader — bug — by he-yufeng (创建于: 2026-03-21 00:06 (UTC+8))
- #37685 Fix attribute error in
isaac_patch_hf_runner— ready,multi-modality — by hmellor (创建于: 2026-03-20 21:06 (UTC+8)) - #37696 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation — v1 — by SouthWest7 (创建于: 2026-03-20 23:34 (UTC+8))
- #37673 [Performance] Auto-enable prefetch on NFS with RAM guard — 无标签 — by arpera (创建于: 2026-03-20 18:42 (UTC+8))
- #37690 fix(bench): compute peak output throughput from token-volume decode windows — performance — by howardpen9 (创建于: 2026-03-20 22:14 (UTC+8))
- #37688 [HMA] [KVEvent] Add evicted groups field to BlockRemoved KV event — v1,kv-connector — by hickeyma (创建于: 2026-03-20 22:02 (UTC+8))
- #37676 [Bugfix] Fix SamplingParams bad_words tokenizer conversion for space-prefixed tokens — bug — by JMonde (创建于: 2026-03-20 20:12 (UTC+8))
- #37686 [P/D] [MooncakeConnector] layerwise push prototype — kv-connector — by liziyu179 (创建于: 2026-03-20 21:36 (UTC+8))
- #37671 [ROCm] [Release] Block rocm release pipeline from running at every commit and fix ECR limit issue — rocm,ci/build — by tjtanaa (创建于: 2026-03-20 18:10 (UTC+8))
- #37682 [Bugfix] Zero-init ROCm MLA attention output buffers for graph padding — bug,rocm,v1 — by andyluo7 (创建于: 2026-03-20 20:45 (UTC+8))
- #37679 fix: remove ambiguous KV cache layout assertion for Mamba hybrid models — v1 — by SoniCoder (创建于: 2026-03-20 20:28 (UTC+8))
- #37678 [Feature] Add OCI Image Annotations to container images — rocm,ci/build,cpu — by JMonde (创建于: 2026-03-20 20:28 (UTC+8))
- #37677 [Bugfix] Allow tensorizer load format for S3/GCS/Azure object storage — bug — by JMonde (创建于: 2026-03-20 20:18 (UTC+8))
- #37646 [ROCm][FEAT] AITER Fused Allreduce + RMSNorm — rocm,ci/build,nvidia — by vllmellm (创建于: 2026-03-20 14:45 (UTC+8))
- #37655 [Do not merge]it is for test cases statistics — ci/build,nvidia — by wincent8 (创建于: 2026-03-20 15:51 (UTC+8))
- #37661 [Misc] Use logger.info_once for auto tool choice log message — ready — by chaunceyjiang (创建于: 2026-03-20 16:43 (UTC+8))
- #37670 fix: set device for prepare_inputs_event to avoid device mismatch — v1 — by xueliangyang-oeuler (创建于: 2026-03-20 18:09 (UTC+8))
- #37667 [Spec Decode][Quantization] fallback quant_config to None for unquantized MTP draft model — speculative-decoding,v1 — by xuebwang-amd (创建于: 2026-03-20 17:47 (UTC+8))
- #37665 [Bugfix] Get actual kernel_block_size_alignment from backend — bug,ready — by NickLucche (创建于: 2026-03-20 17:37 (UTC+8))
- #37662 fix: handle multicasting error in FlashInfer workspace init — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-20 17:01 (UTC+8))
- #37656 [EPLB] EPLB algorithm added:FlashLB+SwiftBalancer — 无标签 — by Mercykid-bash (创建于: 2026-03-20 15:57 (UTC+8))
- #37657 [CI][PD] Add Hybrid SSM integration tests to CI — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-03-20 16:08 (UTC+8))
- #37651 [misc] enhance code robustness — v1 — by andyxning (创建于: 2026-03-20 15:03 (UTC+8))
- #37653 fix: handle multicasting error in FlashInfer workspace init — needs-rebase — by xueliangyang-oeuler (创建于: 2026-03-20 15:13 (UTC+8))
- #37654 [Feature] Expose xgrammar bitmask backend selection in StructuredOutputsConfig — structured-output,v1 — by chaunceyjiang (创建于: 2026-03-20 15:39 (UTC+8))
- #37641 [XPU] bump vllm-xpu-kernels to v0.1.4 — ready,ci/build — by jikunshang (创建于: 2026-03-20 14:05 (UTC+8))
- #37647 [Model] Support Qwen3 as MiDashengLM text backbone — qwen — by zhoukezi (创建于: 2026-03-20 14:51 (UTC+8))
- #37640 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test — rocm — by vllmellm (创建于: 2026-03-20 13:53 (UTC+8))
- #37637 Fix: Add EAGLE/MTP slots calculation in max_num_new_slots_for_drafting — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-20 11:43 (UTC+8))
- #37635 [NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer — kv-connector — by ZhanqiuHu (创建于: 2026-03-20 11:37 (UTC+8))
已合并 PR
- #37131 elastic_ep: Fix issues with repeated scale up/down cycles — ready,v1,nvidia — by itayalroy (合并于: 2026-03-21 07:13 (UTC+8))
- #37475 [BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks — bug,ready,v1,nvidia — by kjiang249 (合并于: 2026-03-21 06:14 (UTC+8))
- #36171 [Refactor] Remove unused dead code — frontend,ready,kv-connector — by yewentao256 (合并于: 2026-03-21 06:12 (UTC+8))
- #37028 [Model Runner V2] Support Streaming Inputs — ready,v1 — by santiramos27 (合并于: 2026-03-21 04:42 (UTC+8))
- #37711 [ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-21 03:25 (UTC+8))
- #37683 [Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels — performance,ready,gpt-oss — by xyang16 (合并于: 2026-03-21 01:28 (UTC+8))
- #37605 [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591) — bug,ready,ci/build,qwen,nvidia — by vadiklyutiy (合并于: 2026-03-21 03:19 (UTC+8))
- #35216 [compile] Initialize passes at VllmBackend init — ready,ready-run-all-tests — by angelayi (合并于: 2026-03-21 02:40 (UTC+8))
- #37693 [Model] Update Kimi-K25 and Isaac processors to fit HF-style — ready — by DarkLight1337 (合并于: 2026-03-21 02:30 (UTC+8))
- #37565 [Bugfix] Disable –calculate-kv-scales for hybrid GDN/Mamba+Attention… — bug,ready — by Young-Leo (合并于: 2026-03-21 02:28 (UTC+8))
- #37519 refactor: abstract deepgemm support into platform — ready,nvidia — by SherryC41 (合并于: 2026-03-21 01:54 (UTC+8))
- #37303 [Attention] Support distinguishing between short extends and decodes — ready,ci/build,v1 — by LucasWilkinson (合并于: 2026-03-21 01:49 (UTC+8))
- #37426 fix CUDAGraph memory being counted twice — ready,v1,nvidia — by panpan0000 (合并于: 2026-03-21 01:36 (UTC+8))
- #37645 [MRV2] Avoid recompilation of _gather_block_tables_kernel — ready,v1 — by WoosukKwon (合并于: 2026-03-21 01:31 (UTC+8))
- #37613 [ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests — rocm,ready,multi-modality,llama — by AndreasKaratzas (合并于: 2026-03-21 01:16 (UTC+8))
- #37639 [Model Runner V2] Fix draft logits not populated during cudagraph replay — ready,v1,nvidia — by TheEpicDolphin (合并于: 2026-03-20 15:43 (UTC+8))
- #37681 Fix various config related issues for Transformers v5 — ready,deepseek — by hmellor (合并于: 2026-03-21 00:30 (UTC+8))
- #37571 [UX] Enable torch_profiler_with_stack — documentation,ready — by jeejeelee (合并于: 2026-03-20 19:17 (UTC+8))
- #37589 [compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. — ready — by zhxchen17 (合并于: 2026-03-21 00:22 (UTC+8))
- #33898 [Metrics] Some small refactoring for better maintainability — speculative-decoding,ready,v1,kv-connector — by hickeyma (合并于: 2026-03-21 00:11 (UTC+8))
- #37604 [compile] Fix aot test failures with torch 2.12. — ready — by zhxchen17 (合并于: 2026-03-21 00:06 (UTC+8))
- #37685 Fix attribute error in
isaac_patch_hf_runner— ready,multi-modality — by hmellor (合并于: 2026-03-20 22:49 (UTC+8)) - #37182 [Pixtral] Enable Pixtral language model support Eagle3 — ready — by Flechman (合并于: 2026-03-20 23:50 (UTC+8))
- #37329 [Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) — bug,ready — by mgehre-amd (合并于: 2026-03-20 23:32 (UTC+8))
- #37331 [Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel — bug,ready,llama — by mgehre-amd (合并于: 2026-03-20 23:31 (UTC+8))
- #37619 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-20 17:06 (UTC+8))
- #37614 [ROCm][CI] Remove deepep DBO tests on gfx90a — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-20 17:07 (UTC+8))
- #37611 [ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-03-20 17:07 (UTC+8))
- #37585 [CI] Removing deprecated rlhf examples reference — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-20 15:20 (UTC+8))
- #34709 [ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode — new-model,rocm,ready,ci/build — by laudney (合并于: 2026-03-20 23:11 (UTC+8))
- #36232 [ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization — rocm,ready — by xuebwang-amd (合并于: 2026-03-20 23:10 (UTC+8))
- #37528 [Model] Add LFM2-ColBERT-350M support — documentation,new-model,ready — by ieBoytsov (合并于: 2026-03-20 22:57 (UTC+8))
- #37500 [Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ — rocm,structured-output,ready,ci/build,v1 — by sfeng33 (合并于: 2026-03-20 17:50 (UTC+8))
- #37595 [Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ — ready,ci/build — by sfeng33 (合并于: 2026-03-20 17:10 (UTC+8))
- #37661 [Misc] Use logger.info_once for auto tool choice log message — ready — by chaunceyjiang (合并于: 2026-03-20 18:40 (UTC+8))
- #37537 [Model] Deprecate the score task (this will not affect users). — documentation,frontend,ready,v1 — by noooop (合并于: 2026-03-20 16:07 (UTC+8))
- #37461 [Bug] Fix FlashInfer allreduce fusion workspace uninitialized error — bug,ready,nvidia — by wzhao18 (合并于: 2026-03-20 15:09 (UTC+8))
- #37641 [XPU] bump vllm-xpu-kernels to v0.1.4 — ready,ci/build — by jikunshang (合并于: 2026-03-20 15:04 (UTC+8))
- #37293 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — documentation,ready,qwen — by Wangbei25 (合并于: 2026-03-20 14:24 (UTC+8))
- #37579 [Model] Refactor Step3-VL processor to HF style — ready — by DarkLight1337 (合并于: 2026-03-20 14:05 (UTC+8))
- #37593 [Refactor] Relocate entrypoint tests to match serving code structure — ready,ci/build,multi-modality — by sfeng33 (合并于: 2026-03-20 13:31 (UTC+8))
- #37634 [XPU] Automatically detect target platform as XPU in build. — ready,ci/build — by ccrhx4 (合并于: 2026-03-20 13:30 (UTC+8))
- #37364 [Model Runner V2] fix draft attention metadata generation — ready,v1 — by TheEpicDolphin (合并于: 2026-03-20 12:05 (UTC+8))
- #37523 fix(xpu): Re-compute compile ranges after platform-specific config updates — ready,v1 — by Liangyx2 (合并于: 2026-03-20 11:52 (UTC+8))
- #37612 [V0 Deprecation] Deprecate –disable-frontend-multiprocessing — performance,frontend,ready,v1 — by sfeng33 (合并于: 2026-03-20 11:31 (UTC+8))
关闭但未合并的 PR
- #26844 Guard SM100 CUTLASS MoE macro to SM100 builds — ready,needs-rebase,ci/build,stale,nvidia — by Jonahcb (关闭于: 2026-03-21 10:29 (UTC+8))
- #27079 [Misc] Remove unused virtual engine flag — ready,needs-rebase,stale,v1,qwen,deepseek,kv-connector — by zhuohan123 (关闭于: 2026-03-21 10:29 (UTC+8))
- #28607 [RL] Support weight update with multi ipc handles + zmq — documentation,stale — by knlnguyen1802 (关闭于: 2026-03-21 10:28 (UTC+8))
- #28833 Do not allow disabling chunked prefill for generation models — needs-rebase,stale — by 22quinn (关闭于: 2026-03-21 10:28 (UTC+8))
- #28900 fix tool-call-parser for Qwen3-Coder model — documentation,stale,tool-calling,qwen — by MaoJianwei (关闭于: 2026-03-21 10:28 (UTC+8))
- #28955 [amd] enable kimi k2 fp8 kv on amd mi300 — rocm,stale,v1 — by bradleyhd (关闭于: 2026-03-21 10:28 (UTC+8))
- #29003 [CI/Build] Add max_jobs docker build arg to rocm_base to prevent oom — rocm,ci/build,stale — by epipho (关闭于: 2026-03-21 10:27 (UTC+8))
- #29019 [CI-Build] Fixed numeric issue of test_prefix_prefill on AMD — rocm,needs-rebase,stale — by him-rh-nm (关闭于: 2026-03-21 10:27 (UTC+8))
- #36844 [Core] Guard mamba prefill split fragmentation — performance,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by yunseoLee0343 (关闭于: 2026-03-21 08:12 (UTC+8))
- #36695 Create test_async_flashinfer_combinations.py — v1,nvidia — by puririshi98 (关闭于: 2026-03-21 06:38 (UTC+8))
- #36832 Increase Test Coverage for Distributed Comm Patterns — ci/build — by puririshi98 (关闭于: 2026-03-21 06:38 (UTC+8))
- #36837 Expand Speculative Decoding Coverage — speculative-decoding,ci/build,v1 — by puririshi98 (关闭于: 2026-03-21 06:38 (UTC+8))
- #36732 [MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow — nvidia — by yzong-rh (关闭于: 2026-03-21 05:43 (UTC+8))
- #37562 [Bugfix] Rebase of #36329: Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 + torch.compile compatibility — bug,needs-rebase,qwen — by kitch2400 (关闭于: 2026-03-21 02:55 (UTC+8))
- #32633 Fix gpt‑oss Harmony token leaks in tool names and streaming content #32587 — frontend,needs-rebase,tool-calling,gpt-oss — by baonudesifeizhai (关闭于: 2026-03-20 23:52 (UTC+8))
- #34101 Fix hermes tool call stream truncation when stop sequences are used — frontend,tool-calling — by maxdebayser (关闭于: 2026-03-20 22:51 (UTC+8))
- #25729 Support the model generating a JSON string instead of JSON in the Hermes Tool Parser — frontend,unstale,tool-calling — by maxdebayser (关闭于: 2026-03-20 22:50 (UTC+8))
- #36607 feat(models): enable Qwen3.5 text-only (Qwen3_5ForCausalLM) — IsHybrid, SupportsMRoPE, VL weight remapping — new-model,tool-calling,qwen — by groxaxo (关闭于: 2026-03-20 21:51 (UTC+8))
- #37655 [Do not merge]it is for test cases statistics — ci/build,nvidia — by wincent8 (关闭于: 2026-03-20 19:19 (UTC+8))
- #31791 [Bugfix] Use isinstance() instead of type() in LoRA can_replace_layer — bug,needs-rebase — by majiayu000 (关闭于: 2026-03-20 17:32 (UTC+8))
- #31651 [Bug Fix]: Require explicit –dataset-name to avoid migration confusion — bug,performance — by majiayu000 (关闭于: 2026-03-20 17:32 (UTC+8))
- #34177 [ROCm][Kernel] Add GFX11 (RDNA3) support for wvSplitK skinny GEMM kernels — rocm,needs-rebase — by mgehre-amd (关闭于: 2026-03-20 17:20 (UTC+8))
- #37414 [Bugfix]enable_thinking: False, the Qwen3.5 model returns an error in the content stream. — bug,qwen — by xyDong0223 (关闭于: 2026-03-20 16:54 (UTC+8))
- #36875 [Fix] Fix
--headlessmode with openai.api_server — frontend — by emricksini-h (关闭于: 2026-03-20 16:08 (UTC+8)) - #37631 [Bugfix] JAIS: Only apply ALiBi when position_embedding_type is alibi — 无标签 — by simpx (关闭于: 2026-03-20 12:05 (UTC+8))