vLLM 开发动态报告 - 2026-03-03

时间窗口: 2026-03-03 11:21 (UTC+8) ~ 2026-03-04 11:21 (UTC+8) 数据统计: 新 Issue 30 | 关闭 Issue 18 | 新 PR 94 | 合并 PR 38 | 关闭未合并 PR 28

📊 每日开发状态摘要

本周期内（2026-03-03至2026-03-04），vLLM社区保持高活跃度，新增94个PR和30个Issue，合并38个PR。开发焦点集中在性能优化、错误修复和新功能集成上。一个显著趋势是针对特定硬件（尤其是AMD ROCm平台）和特定模型系列（如Qwen）的深度优化与问题排查，同时分布式执行器后端的重构、实时流媒体API和KV缓存管理等核心架构议题也引发了热烈讨论。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动非常活跃，主要体现在性能优化、Bug修复和工具链支持上。

新增Issue（AMD相关）:

#35925 (Bug): 用户 jennyyyyzhen 在MI300x上报告，当启用AITER（VLLM_ROCM_USE_AITER=1）时，Qwen3.5-35B-A3B模型会产生包含重复感叹号的损坏响应。该问题在多模态和纯文本请求中均可复现，关闭AITER后问题消失。这指向了AMD AITER优化路径可能存在与特定模型层（如注意力或MOE）的兼容性问题。
#35828 (Bug): 同样是 jennyyyyzhen 报告，在MI300x上，Qwen3-Next-80B模型在启用AITER后，GSM8k评估准确率从约85%降至约50%。经排查，问题与PR #35180引入的AITER Triton RoPE实现有关，设置 VLLM_ROCM_USE_AITER_TRITON_ROPE=0 后准确率恢复正常。这表明新的RoPE实现可能存在数值精度或实现逻辑问题。
#35845 (Bug，已关闭): 用户 flutist 报告在NVIDIA L20上使用bitsandbytes量化加载Qwen3.5-27B失败，但该Issue被标记为rocm。问题由PR #35838修复。

新增PR（AMD相关）:

#35859 (Quark): AMD员工 fxmarty-amd 提交，旨在支持在vLLM中加载来自AMD Quark工具的NVFP4量化检查点。这属于AMD生态量化工具链与vLLM集成的重要一步。
#35855 (NVFP4): AMD员工 fxmarty-amd 提交，为模拟后端（dense和MOE模型）增加了NVFP4权重的提前反量化（AOT）支持，旨在解决大工作负载下每次前向传播都反量化带来的性能瓶颈。
#35893 (Bugfix): 修复了当GEMM维度不满足CK库要求时，AITER融合MOE与MXFP4量化会崩溃的问题。现在会优雅地回退到模拟模式或Triton后端。
#35850 (Feature): 扩展AITER MLA（混合线性注意力）支持，使其能处理每个TP秩上注意力头数小于16的情况（如TP=8时的Kimi模型），并支持FP8 KV缓存。
#35923 (Bugfix): 修复了Qwen3.5模型使用ROCM_ATTN后端时，因非标准块大小（1056）导致输出损坏的问题，将其添加到Triton回退内核的支持列表中。
#35840 (Feature): AMD员工 hanlin12-AMD 提交，增加了启用AITER hipBLASLt在线GEMM调优的选项，旨在进一步优化AMD GPU上的矩阵计算性能。
#35897 (CI): 一个针对AMD CI的小修复。
#35887 (CI): 修复了AMD CI中因GPU数量不足导致的多GPU测试失败问题。

分析: AMD团队（以-amd后缀用户为代表）正在vLLM上密集开展优化工作，重点在三个方面：1) 通过AITER集成提升核心算子（注意力、线性层、MOE、RoPE）在MI系列GPU上的性能；2) 深化其Quark量化工具链与vLLM的整合，支持新的数据类型（NVFP4）和优化工作流（AOT反量化）；3) 持续修复在复杂模型（Qwen系列）和配置（高TP值）下暴露的兼容性与正确性问题。近期关于AITER导致模型输出质量下降的两个严重Bug需要高度关注。

💬 高热度讨论分析

Issue #35848: [RFC]: Revamp Ray Distributed Executor Backend (from Ray team)
- 核心议题: Ray团队提议弃用当前基于Compiled Graph的RayExecutor，改用新的RayExecutorV2。新设计将采用与MultiprocExecutor相同的控制平面（ZMQ/SHM MessageQueue）和数据平面（torch.distributed NCCL），但利用Ray Actor实现跨节点并行和细粒度资源调度。
- 不同观点:
  - 提问方 (robertgshaw2-redhat): 质疑如果新后端复用MultiprocExecutor的组件，其独特价值何在？是否真的需要vLLM内部集成资源调度。
  - 解释方 (kouroshHakha, jeffreywang-anyscale): 阐述了深度集成的优势：1) 简化部署：用户只需一个命令行即可启动跨集群服务，无需复杂的“headless + driver”设置；2) 与Ray编排器协同：当vLLM与Ray Data、Ray Serve等协同工作时，Ray可以精细控制GPU工作进程的放置，实现资源感知调度和弹性伸缩，这是外部编排器（如K8s）难以做到的。
- 当前状态: 讨论已达成基本共识，认为重构方向合理，可以统一代码、减少对难以调试的Ray功能（如DAG）的依赖，提升用户体验和功能支持。RFC处于开放状态。
PR #35852: [Scheduler] Fix KV-cache head-of-line blocking
- 核心议题: 修复V1调度器中因KV缓存不足导致的队头阻塞问题，确保大型请求不会不必要地阻塞后续可运行的小型请求。
- 不同观点:
  - PR作者 (oneraghavan): 提出三处关键修改，旨在让调度器在单个请求分配KV缓存失败后继续尝试调度其他请求，而非直接中断整个调度循环。
  - 审阅者 (orozery): 赞同修复特定用例（一个大请求阻塞多个小请求）的初衷，但担心这种改变可能对追求高吞吐、希望通过减少运行序列数来缓解KV缓存压力的通用工作负载产生负面影响。特别指出，修改“运行中请求”的调度逻辑可能导致重请求被持续饥饿。
- 争议焦点: 在优化混合工作负载的公平性与延迟和维护通用工作负载的吞吐与缓存效率之间寻求平衡。审阅者认为原调度器的“中断”行为是为了保护刚刚通过重调度腾出的缓存空间，使其能用于原请求的后续尝试。
- 最终结论: 经过深入讨论，作者承认审阅者的担忧有道理，特别是改变“运行中请求”的调度逻辑可能引发饥饿问题。该PR已被关闭，表明社区倾向于更保守的调度策略，或在更全面的评估后再进行此类核心调度逻辑的修改。
PR #35930: [Model Runner V2] Use dictionary instead of tuple for execute_model_state
- 核心议题: 随着execute_model_state的不断扩展，提议将其从元组改为字典，以提升可读性和降低错误率。
- 不同观点:
  - PR作者 (WoosukKwon): 认为这只是连接两个方法（prepare_inputs和execute_model）的中间通道，希望保持其一定的“hacky”特性，不将其视为模型运行器的标准部分。
  - 审阅者 (njhill): 建议使用数据类或命名元组以获得强类型和明确性，认为明确指定两个阶段之间传递的状态会更利于代码的阅读和维护。
- 当前状态: 讨论反映了在代码灵活性/简易性与类型安全/可维护性之间的权衡。作者倾向于保持字典形式，审阅者表示尊重但持不同意见。PR仍处于开放状态。

🔥 热门话题与趋势分析

Qwen模型家族的深度集成与问题：Qwen（特别是3.5、3-VL、Next版本）是当前最活跃的模型支持领域，但也是问题高发区。本周期出现了AITER兼容性导致输出损坏/精度下降、GatedDeltaNet (GDN) 层的CUDA图捕获AssertionError、视频输入时间戳处理Bug、Marlin量化与TP的兼容性问题等多个Issue。这反映了vLLM在支持复杂、新颖的模型架构时面临的挑战。
多媒体与实时流处理：关于模型特定的实时音频流抽象的RFC（#35908）和Qwen3-VL视频输入处理的Bug报告（#35909），表明多模态和低延迟流式输入正成为重要的功能前沿和测试焦点。
分布式与执行后端演进：除了Ray后端的RFC，还有关于统一引擎进程监控的RFC（#35858）和对应PR（#35862），旨在为弹性扩展和容错奠定基础。这显示了vLLM在向更稳定、更灵活的生产级分布式系统迈进。
响应API（Responses API）的增强与插件化：多个PR（#35903, #35905, #35904, #35906）致力于扩展响应API的功能，如支持无状态多轮对话、可插拔的响应存储后端、处理Harmony协议令牌泄漏、整合结构化输出与推理。这表明OpenAI Responses API兼容性正变得日益重要和复杂。

🛠️ 重点技术变更

PR #35601 (已合并): [ROCm][Bugfix]: Disable AITER Triton ROPE by default。此PR部分回滚了之前的优化，因为在某些大批次场景下，AITER的RoPE实现性能比原生实现差25%。这反映了性能优化需要针对具体工作负载进行细致调优和权衡，默认启用的激进策略可能导致用户遭遇性能回归。
PR #34552 (已合并): [BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA。修复了GLM-5等使用稀疏MLA注意力模型在多令牌推测解码（MTP）时崩溃的问题。推测解码对复杂注意力模式的支持仍是技术难点，此修复扩展了推测解码的适用模型范围。
PR #35838 (已合并): Enable bnb for multiple indices weight。修复了bitsandbytes量化在加载具有多分片索引的融合权重时崩溃的问题。这提升了4-bit量化在更广泛模型上的可用性，对资源受限的部署场景有积极意义。
Issue #35925 (进行中): [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled。这是一个严重的正确性问题，表明AMD的AITER优化与Qwen3.5的某些特性存在冲突。根本原因可能是注意力、MOE或RoPE内核在特定输入模式下的数值错误，需要AMD团队优先调查。

📈 开发活跃度观察

贡献者：核心团队（如WoosukKwon, Isotr0py）、硬件厂商团队（AMD的fxmarty-amd, ChuanLi1101, hanlin12-AMD；NVIDIA的robertgshaw2-redhat等）和社区开发者共同构成了活跃的贡献生态。AMD团队在本周期的贡献尤为突出。
代码审查与合并：94个新增PR和38个合并PR表明代码流动迅速。许多PR因简单的rebase冲突或pre-commit检查失败而被CI标记，提醒贡献者注意与主干的同步和代码规范。ready标签被用于标记可进入全面CI测试的PR，是合并流程的关键一环。
Issue处理：关闭了18个Issue，其中包括一些陈年旧Issue因长期无活动而被自动关闭。对于新报的Bug，维护者响应迅速，通常要求提供详细的环境信息以便复现。

💡 值得关注的问题

AITER的稳定性与模型兼容性：Issue #35925和#35828揭示了AITER优化可能导致严重的模型输出质量问题。这需要AMD团队紧急修复，并考虑建立更完善的模型-硬件-优化组合的回归测试集。
Ray分布式后端的未来：RFC #35848提出的重构是一项重大架构变更，虽然得到初步认可，但其实现细节、对现有用户的影响以及性能表现需要后续PR的严密跟踪。
实时流媒体API的设计：RFC #35908提出的“模型特定实时流抽象”试图解决当前协议过于简单、导致模型逻辑渗入共享层的问题。其设计选择（扩展协议、增加参数或创建新的引擎API）将影响未来所有ASR/实时音频模型的集成方式。
复杂模型架构的持续挑战：从Qwen的GDN、Mamba到各种稀疏注意力、MOE模型，vLLM需要不断适配快速演进的模型架构，这对其内核实现、调度器和量化支持都构成了长期压力。

📋 附录：详细数据列表

新增 Issue

#35945 [Bug]: AssertionError in causal_conv1d_update when capturing CUDA graphs for Qwen3.5/GDN layers — bug — by git-jxj (创建于: 2026-03-04 10:56 (UTC+8))
#35854 [Bug]: 4 * 2080ti 22G Qwen3.5-35B-A3B Fail — bug — by VIT-Valentin (创建于: 2026-03-03 17:55 (UTC+8))
#35935 [Bug]: DCP + CUDA graphs broken - dcp_context_kv_lens frozen during capture — bug — by LucasWilkinson (创建于: 2026-03-04 08:55 (UTC+8))
#35924 [Bug] Qwen3.5 GatedDeltaNet in_proj_ba fails Marlin MIN_THREAD_N=64 at TP>=4 — 无标签 — by sonusflow (创建于: 2026-03-04 06:11 (UTC+8))
#35944 [Bug]: jetson agx thor报错 — bug — by jouewer (创建于: 2026-03-04 10:38 (UTC+8))
#35943 [Bug]: [MooncakeConnector] Decode side requests stuck in WAITING_FOR_REMOTE_KVS indefinitely on transfer failure, causing HBM leak — bug — by relat-ivity (创建于: 2026-03-04 10:23 (UTC+8))
#35940 [Usage]: vllm 可以部署grouding-dino模型吗 — usage — by zkailinzhang (创建于: 2026-03-04 09:53 (UTC+8))
#35845 [Bug]: llm serve “Qwen/Qwen3.5-27B” –quantization bitsandbytes — bug,rocm — by flutist (创建于: 2026-03-03 15:17 (UTC+8))
#35908 [RFC]: Model-specific realtime streaming abstraction — 无标签 — by TheCodeWrangler (创建于: 2026-03-04 04:32 (UTC+8))
#35909 [Bug]: Error when using Qwen3-VL/Qwen3.5 with video input and num_frames — bug — by danigarciaoca (创建于: 2026-03-04 04:36 (UTC+8))
#35925 [Bug]: Qwen3.5-35B-A3B generates corrupted responses when AITER is enabled — bug,rocm — by jennyyyyzhen (创建于: 2026-03-04 06:48 (UTC+8))
#35848 [RFC]: Revamp Ray Distributed Executor Backend (from Ray team) — RFC — by jeffreywang-anyscale (创建于: 2026-03-03 16:35 (UTC+8))
#35926 [Bug]: Illegal Memory Access in NemotronH MTP — bug — by benchislett (创建于: 2026-03-04 06:59 (UTC+8))
#35922 [Bug]: Marlin MoE kernels crash on A100 (SM80) with compressed-tensors quantized Qwen3-30B-A3B — 无标签 — by zeryx (创建于: 2026-03-04 05:59 (UTC+8))
#35920 [Bug]: UMA Memory Profiling Misattributes OS Page Cache and Fails in Concurrent Deployments — bug — by EmilHaase (创建于: 2026-03-04 05:57 (UTC+8))
#35919 [Feature]: FA4 MLA decode — feature request — by LucasWilkinson (创建于: 2026-03-04 05:54 (UTC+8))
#35915 TPU: Override use_custom_op_collectives() in TpuPlatform — 无标签 — by nkm-meta (创建于: 2026-03-04 05:20 (UTC+8))
#35896 [Bug]: [CPU Backend] ARM build fails with distro/downstream PyTorch that does not vendor libgomp — cpu — by wjhrdy (创建于: 2026-03-04 02:36 (UTC+8))
#35878 [Feature]: Support DeepGEMM MTP3 NV Kernel — feature request — by benchislett (创建于: 2026-03-04 00:21 (UTC+8))

#35880 [Bug]: Qwen/Qwen3.5-27B-GPTQ-Int4 not working /tmp/tmpb6zywp4k/__triton_launcher.c:7:10: fatal error: Python.h: No such file or directory 7

#include — bug — by NilsHellwig (创建于: 2026-03-04 00:31 (UTC+8))

#35868 [Bug] Qwen3-ASR crashes without –enforce-eager: missing dynamic_arg_dims for MRoPE positions — 无标签 — by TheCodeWrangler (创建于: 2026-03-03 21:57 (UTC+8))
#35863 [Bug]: Voxtral-Realtime stops returning transcribed text starting from the 3rd concurrent session — bug — by sh1man999 (创建于: 2026-03-03 21:03 (UTC+8))
#35861 [Bug]: error occurs when using uv pip install -e . on v0.16.0 — bug — by GranMin (创建于: 2026-03-03 20:17 (UTC+8))
#35858 [RFC]: Unify engine process monitoring in engine manager and add Ray backend support — RFC — by fangyuchu (创建于: 2026-03-03 19:48 (UTC+8))
#35857 [Bug]: NameError: ‘IntermediateTensors’ is not defined in GPT-OSS model during Pipeline Parallelism — bug — by burakkilic11 (创建于: 2026-03-03 18:54 (UTC+8))
#35847 [Performance]: The checkpoint you are trying to load has model type qwen3_5 but Transformers does not recognize this architecture. — performance — by jamin85cheng (创建于: 2026-03-03 16:13 (UTC+8))
#35836 [Bug]: Support tool_choice=none in Anthropic API — bug — by ZhongsJie (创建于: 2026-03-03 14:22 (UTC+8))
#35832 [Bug]: prompt_logprobs ignores logprobs_mode — bug — by ar0ck (创建于: 2026-03-03 13:35 (UTC+8))
#35828 [Bug]: Qwen3-Next accuracy regression on AMD — bug,rocm — by jennyyyyzhen (创建于: 2026-03-03 12:25 (UTC+8))
#35823 [Bug]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — bug — by chamwen (创建于: 2026-03-03 11:23 (UTC+8))

已关闭 Issue

#35854 [Bug]: 4 * 2080ti 22G Qwen3.5-35B-A3B Fail — bug — by VIT-Valentin (关闭于: 2026-03-04 11:15 (UTC+8))
#35935 [Bug]: DCP + CUDA graphs broken - dcp_context_kv_lens frozen during capture — bug — by LucasWilkinson (关闭于: 2026-03-04 11:13 (UTC+8))
#30847 [Bug]: Qwen 3VL via Efficient Video Sampling (EVS) to trim video embeddings and found that the number of tokens after timestamp in the Prompt was not aligned with the actual number of tokens after pruning? — bug — by xshqhua (关闭于: 2026-03-04 10:18 (UTC+8))
#27590 [Bug]: Compile Integration should reuse for identical code — bug,torch.compile,stale — by Lucaskabela (关闭于: 2026-03-04 10:16 (UTC+8))
#27928 [Bug]: What happened to /get_world_size ? — bug,stale — by pbarker-synth (关闭于: 2026-03-04 10:16 (UTC+8))
#27956 [Feature]: awq marlin performs poorly when dealing with long inputs — feature request,stale — by dingjingzhen-bot (关闭于: 2026-03-04 10:16 (UTC+8))
#27979 [Bug]:model response repeat same sentence and never stop on v0.11.x version, v0.10.2 is ok — bug,stale — by markluofd (关闭于: 2026-03-04 10:16 (UTC+8))
#27984 [Bug]: swap_space parameter is unused and useless — bug,stale — by mcelrath (关闭于: 2026-03-04 10:16 (UTC+8))
#35845 [Bug]: llm serve “Qwen/Qwen3.5-27B” –quantization bitsandbytes — bug,rocm — by flutist (关闭于: 2026-03-04 09:46 (UTC+8))
#33599 [RFC]: Improving vLLM Dependency Compatibility with Downstream Ecosystems — RFC — by jeffreywang-anyscale (关闭于: 2026-03-04 05:35 (UTC+8))
#35772 [Bug]: FusedARRMS Hang on startup during cudagraph capture TP>1 — bug,torch.compile — by benchislett (关闭于: 2026-03-04 00:38 (UTC+8))
#29855 [Bug]: DSR1 fp8/nvfp4 MTP with spec num 3 load weight failure — bug,stale — by shyeh25 (关闭于: 2026-03-03 23:55 (UTC+8))
#34380 [Bug]: GLM-5 MTP crashes with num_speculative_tokens > 1 — bug — by mgoin (关闭于: 2026-03-03 23:21 (UTC+8))
#35804 [Feature]: PRISM 153-key — Legitimacy Verification Layer for Model Selection Algorithm — feature request — by Mossaab-s (关闭于: 2026-03-03 21:58 (UTC+8))
#35746 [Bug]: Segfault at IP=0 during model warmup on AVX512_BF16 host (AMD 7940HS) — bug,cpu — by NetWilliam (关闭于: 2026-03-03 21:50 (UTC+8))
#35724 [Bug] H100 PCIe: RuntimeError ‘[SymmDeviceMemory] Device does not support multicasting’ when running Qwen3.5-122B with TP=2 — usage — by wallbreaker740 (关闭于: 2026-03-03 19:57 (UTC+8))
#33118 [RFC]: Hidden States Extraction — RFC — by fynnsu (关闭于: 2026-03-03 12:22 (UTC+8))
#35462 [Installation]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — installation — by chamwen (关闭于: 2026-03-03 11:23 (UTC+8))

新增 PR

#35946 Fix incorrect assertion in causal_conv1d_update for indexed conv_state — 无标签 — by git-jxj (创建于: 2026-03-04 11:04 (UTC+8))
#35835 [BugFix] Support tool_choice=none in the Anthropic API — bug,frontend,ready — by ZhongsJie (创建于: 2026-03-03 14:22 (UTC+8))
#35862 [Refactor] Unify engine process monitoring in engine manager and add Ray backend support — frontend,v1 — by fangyuchu (创建于: 2026-03-03 20:26 (UTC+8))
#35895 [Bugfix] Fix minimax_m2 tool parser when stream interval > 1 — bug,tool-calling — by sfeng33 (创建于: 2026-03-04 02:33 (UTC+8))
#35932 [Bugfix] Fix MiniMax M2 TP dimension issue in q_norm and k_norm — bug — by shengxinhu (创建于: 2026-03-04 08:43 (UTC+8))
#35942 [GGUF] Add support for MiniMax-M2 architecture — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-04 10:08 (UTC+8))
#35941 [Model Runner V2] Misc code simplification — v1 — by njhill (创建于: 2026-03-04 09:54 (UTC+8))
#35939 [ROCm][Quantization] Simplify activation scale passing for triton mxfp4 w4afp8 — rocm,gpt-oss — by BowenBao (创建于: 2026-03-04 09:50 (UTC+8))
#35838 Enable bnb for multiple indices weight — ready — by flutist (创建于: 2026-03-03 15:03 (UTC+8))
#35930 [Model Runner V2] Use dictionary instead of tuple for execute_model_state — v1 — by WoosukKwon (创建于: 2026-03-04 08:04 (UTC+8))
#35938 [Model Runner V2] Reuse constant tensors in input prep — v1 — by njhill (创建于: 2026-03-04 09:30 (UTC+8))
#35937 [Bugfix] Clamp out-of-bounds token IDs in MTP models — bug,qwen — by scottgl9 (创建于: 2026-03-04 09:16 (UTC+8))
#35934 feat: add suspend()/resume() for CRIU-safe snapshots — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,deepseek — by will-deines (创建于: 2026-03-04 08:49 (UTC+8))
#35936 fix: tool_choice=”required” falls back to tool_parser for non-JSON formats — frontend — by voipmonitor (创建于: 2026-03-04 08:57 (UTC+8))
#35897 [DO NOT MERGE] Attempting to fix quotation — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-04 03:03 (UTC+8))
#35918 [Model Runner V2] Fix FA3 cuda graphs — v1,nvidia — by njhill (创建于: 2026-03-04 05:35 (UTC+8))
#35933 docs: add README for logits_processor examples — documentation — by mitre88 (创建于: 2026-03-04 08:44 (UTC+8))
#35931 [Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode — bug,kv-connector — by royyhuang (创建于: 2026-03-04 08:24 (UTC+8))
#35912 [CI/Build] Allow mounting AWS credentials for sccache S3 auth — ready,ci/build — by amrmahdi (创建于: 2026-03-04 04:58 (UTC+8))
#35927 [Refactor][FusedMoE] Move PF Methods to Folder — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-03-04 07:01 (UTC+8))
#35929 fix(memory): resolve UMA concurrent profiling race condition and page cache leak — 无标签 — by EmilHaase (创建于: 2026-03-04 07:48 (UTC+8))
#35916 [Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode — bug,ready,v1 — by jaewonlee-fb (创建于: 2026-03-04 05:24 (UTC+8))
#35877 [CustomOp] Make FusedRMSNormGated a CustomOp — 无标签 — by eellison (创建于: 2026-03-04 00:09 (UTC+8))
#35928 [RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag — codex,aardvark — by simon-mo (创建于: 2026-03-04 07:40 (UTC+8))
#35825 [Core] Move save_tensorized_model logic to Worker — ready,v1 — by njhill (创建于: 2026-03-03 12:12 (UTC+8))
#35892 [Bugfix] Fix inner_dp_world initialization order for multi-node TP — bug,ready,ci/build,v1 — by zyongye (创建于: 2026-03-04 02:08 (UTC+8))
#35885 [Bugfix] Make prompt_logprobs respect logprobs_mode setting — bug,v1 — by wojciech-wais (创建于: 2026-03-04 01:01 (UTC+8))
#35923 [Bugfix] Add 1056 block_size to triton fallback on rocm Attention to support qwen3_5 — bug,documentation,rocm,v1,qwen — by JartX (创建于: 2026-03-04 06:11 (UTC+8))
#35893 [ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported — bug,rocm — by ChuanLi1101 (创建于: 2026-03-04 02:16 (UTC+8))
#35921 [compile] Fix extra cache save on warm start. — 无标签 — by zhxchen17 (创建于: 2026-03-04 05:57 (UTC+8))
#35917 [Model Runner V2] Fix inputs_embeds=None bug for MM models — bug,v1 — by WoosukKwon (创建于: 2026-03-04 05:24 (UTC+8))
#35903 feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) — frontend — by will-deines (创建于: 2026-03-04 04:16 (UTC+8))
#35905 feat(responses): pluggable ResponseStore abstraction — frontend — by will-deines (创建于: 2026-03-04 04:17 (UTC+8))
#35914 Add CUDA-Agent: Multi-stage RL training pipeline for CUDA kernel optimization — nvidia — by windtara0619 (创建于: 2026-03-04 05:11 (UTC+8))
#35906 [Responses API] Sanitize leaked Harmony control tokens in tool names and recipients — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 04:17 (UTC+8))
#35890 [Perf] Use dummy M for weight prepacking on x86 — cpu — by tianmu-li (创建于: 2026-03-04 01:52 (UTC+8))
#35913 [Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) — rocm,ci/build — by charlifu (创建于: 2026-03-04 05:06 (UTC+8))
#35887 [ROCm][CI] Fix TP size issue for test_gpt_oss — rocm,ready,gpt-oss — by micah-wil (创建于: 2026-03-04 01:22 (UTC+8))
#35911 WIP: NCCL Weight metadata transfer — 无标签 — by S1ro1 (创建于: 2026-03-04 04:54 (UTC+8))
#35910 fix(cmake): support distro PyTorch without vendored libgomp — ci/build,cpu — by wjhrdy (创建于: 2026-03-04 04:38 (UTC+8))
#35904 [Responses API] Structured output + reasoning via structural tag embedding — structured-output,frontend,v1,gpt-oss — by will-deines (创建于: 2026-03-04 04:17 (UTC+8))
#35907 [Harmony] Fix analysis-channel tool calls and preserve reasoning across turns — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 04:17 (UTC+8))
#35901 [Responses API] Sanitize leaked Harmony control tokens in tool names and recipients — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 04:09 (UTC+8))
#35900 feat(responses): pluggable ResponseStore abstraction — frontend — by will-deines (创建于: 2026-03-04 04:09 (UTC+8))
#35899 feat: adjust FA3 num_splits on Hopper for low latency in cudagraph mode — v1,nvidia — by jmkuebler (创建于: 2026-03-04 03:52 (UTC+8))
#35894 [Model] Qwen3-ASR realtime: SDK-style streaming with cross-segment context — frontend,v1,qwen — by TheCodeWrangler (创建于: 2026-03-04 02:25 (UTC+8))
#35902 [Harmony] Fix analysis-channel tool calls and preserve reasoning across turns — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 04:09 (UTC+8))
#35891 [Perf] Support FP8 KV cache for Flashinfer MLA Sparse — documentation,ready,v1,nvidia — by wzhao18 (创建于: 2026-03-04 02:04 (UTC+8))
#35898 [XPU][QWEN3_NEXT] remove fla hardcode to cuda — qwen,nvidia — by xuechendi (创建于: 2026-03-04 03:37 (UTC+8))
#35889 [Bugfix] Fix LoRA + TP > 1 crash due to non-contiguous input tensor in lora_shrink_op.py — bug — by cmacboyd (创建于: 2026-03-04 01:39 (UTC+8))
#35873 [Responses API] Structured output + reasoning via structural tag embedding — structured-output,frontend,v1,gpt-oss — by will-deines (创建于: 2026-03-03 23:12 (UTC+8))
#35881 [Responses API] Sanitize leaked Harmony control tokens in tool names and recipients — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 00:33 (UTC+8))
#35884 [Harmony] Fix analysis-channel tool calls and preserve reasoning across turns — frontend,gpt-oss — by will-deines (创建于: 2026-03-04 01:00 (UTC+8))
#35874 feat(responses): pluggable ResponseStore abstraction — frontend — by will-deines (创建于: 2026-03-03 23:26 (UTC+8))
#35859 [Quark] Support loading Quark NVFP4 checkpoints in vLLM — rocm,needs-rebase — by fxmarty-amd (创建于: 2026-03-03 19:52 (UTC+8))
#35855 [NVFP4][OCP MX] Support ahead of time weight dequantization for emulation backend for dense and MOE models — rocm,needs-rebase — by fxmarty-amd (创建于: 2026-03-03 18:23 (UTC+8))
#35870 [CI] Temporarily Disable Llama4 MoE Refactor Test — ready,llama — by robertgshaw2-redhat (创建于: 2026-03-03 22:37 (UTC+8))
#35888 [CI] Fix mypy for vllm/config — v1,nvidia — by hickeyma (创建于: 2026-03-04 01:37 (UTC+8))
#35869 [Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile — bug,ready,qwen — by TheCodeWrangler (创建于: 2026-03-03 22:03 (UTC+8))
#35882 [CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests — ready,deepseek — by MatthewBonanni (创建于: 2026-03-04 00:45 (UTC+8))
#35886 [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos — bug,rocm,v1 — by ChuanLi1101 (创建于: 2026-03-04 01:03 (UTC+8))
#35883 [Chore] Remove debug code in model implementation — ready — by Isotr0py (创建于: 2026-03-04 00:55 (UTC+8))
#35872 [Refactor] Clean up processor kwargs extraction — ready — by DarkLight1337 (创建于: 2026-03-03 22:47 (UTC+8))
#35879 [Performance] Fuse RoPE + KV cache update for MLA backends — needs-rebase,v1 — by ElizaWszola (创建于: 2026-03-04 00:27 (UTC+8))
#35876 [Bugfix][NIXL] Compute NIXL throughput using timestamps instead of summed durations — bug,v1,kv-connector — by PrasanaaV (创建于: 2026-03-03 23:56 (UTC+8))
#35875 Update TritonExperts to reflect support for none-gated ReLU^2 — 无标签 — by Naveassaf (创建于: 2026-03-03 23:45 (UTC+8))
#35871 [CI] Add Blackwell AsyncTP correctness test — ready,ci/build — by stecasta (创建于: 2026-03-03 22:44 (UTC+8))
#35852 [Scheduler] Fix KV-cache head-of-line blocking (#31731) — v1 — by oneraghavan (创建于: 2026-03-03 16:59 (UTC+8))
#35849 [Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker — bug,qwen — by gkswns0531 (创建于: 2026-03-03 16:40 (UTC+8))
#35866 Order config.py in Lexicographical order — needs-rebase,ci/build,multi-modality — by askliar (创建于: 2026-03-03 21:41 (UTC+8))
#35860 Optimized gemm_half_q_half_gptq_4bit_kernel for reduced memory latency — 无标签 — by YaozhengZhang (创建于: 2026-03-03 20:16 (UTC+8))
#35867 [Docs] Fix typos in comments and docstrings — v1 — by abdelhadi703 (创建于: 2026-03-03 21:44 (UTC+8))
#35864 [Core] Add structured request_id to log records — frontend,v1 — by PrasanaaV (创建于: 2026-03-03 21:09 (UTC+8))
#35853 [CI] And PPL test for Qwen3.5. — ready,qwen — by noooop (创建于: 2026-03-03 17:06 (UTC+8))
#35865 [Bugfix] fixes wna16 quantization for dense layers. — bug,qwen — by sighingnow (创建于: 2026-03-03 21:10 (UTC+8))
#35846 [Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() — bug,ready,v1 — by ronensc (创建于: 2026-03-03 15:42 (UTC+8))
#35856 Segmented spans rebased — documentation,v1,kv-connector — by almogtavor (创建于: 2026-03-03 18:35 (UTC+8))
#35851 Lora multistream — 无标签 — by kfhfar (创建于: 2026-03-03 16:56 (UTC+8))
#35850 [ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) — rocm,v1 — by ChuanLi1101 (创建于: 2026-03-03 16:50 (UTC+8))
#35829 [Feature]: Support for multiple embedding types in a single inference call — frontend,v1 — by staugust (创建于: 2026-03-03 12:37 (UTC+8))
#35840 Hanlin12 hiplaslt online tuning — documentation,frontend — by hanlin12-AMD (创建于: 2026-03-03 15:10 (UTC+8))
#35826 fix: preserve prior-turn analysis messages in Harmony multi-turn conversations — frontend,gpt-oss — by OiPunk (创建于: 2026-03-03 12:19 (UTC+8))
#35833 [Bugfix] Fix TRITON_MLA FP8 KV cache decode on Blackwell GPUs — bug,documentation,v1 — by ricky-chaoju (创建于: 2026-03-03 13:40 (UTC+8))
#35842 docs: add comment for VLLM_WORKER_MULTIPROC_METHOD — 无标签 — by Jah-yee (创建于: 2026-03-03 15:14 (UTC+8))
#35834 add regression test — ready — by hallerite (创建于: 2026-03-03 13:43 (UTC+8))
#35837 [Draft] update intel gpu buildkite ci — ci/build — by 1pikachu (创建于: 2026-03-03 14:54 (UTC+8))
#35841 docs: add comment for VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE — 无标签 — by Jah-yee (创建于: 2026-03-03 15:13 (UTC+8))
#35830 [Bugfix] Fix stream_interval > 1 for jamba tool parser — bug,needs-rebase — by sfeng33 (创建于: 2026-03-03 12:39 (UTC+8))
#35844 [V1] Fix illegal memory access and background thread crash during asy… — v1 — by xueliangyang-oeuler (创建于: 2026-03-03 15:16 (UTC+8))
#35839 docs: add comment explaining VLLM_FUSED_MOE_CHUNK_SIZE — 无标签 — by Jah-yee (创建于: 2026-03-03 15:09 (UTC+8))
#35843 docs: add comments for assets cache variables — 无标签 — by Jah-yee (创建于: 2026-03-03 15:15 (UTC+8))
#35824 [CI/Build] Trigger processor tests on registry update — ready,ci/build — by DarkLight1337 (创建于: 2026-03-03 12:06 (UTC+8))
#35831 [LMCache MP Patch]: Race Condition + Duplicated Block Ids — kv-connector — by sammshen (创建于: 2026-03-03 12:49 (UTC+8))
#35827 Fix issue #35820 — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-03 12:22 (UTC+8))

已合并 PR

#35294 [Model Runner V2] support dp & ep for spec decoding — v1,nvidia — by izhuhaoran (合并于: 2026-03-04 07:19 (UTC+8))
#35754 [Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict — bug,ready,ci/build,cpu — by bigPYJ1151 (合并于: 2026-03-03 21:56 (UTC+8))
#33607 [Bugfix] Fix EVS implementation for Qwen3 VL — bug,ready,v1,multi-modality,qwen — by 2ez4bz (合并于: 2026-03-04 10:18 (UTC+8))
#35838 Enable bnb for multiple indices weight — ready — by flutist (合并于: 2026-03-04 09:46 (UTC+8))
#35710 [ROCm][CI] Support async weight transfer example with platform-aware determinism — documentation,rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-04 09:44 (UTC+8))
#35912 [CI/Build] Allow mounting AWS credentials for sccache S3 auth — ready,ci/build — by amrmahdi (合并于: 2026-03-04 06:30 (UTC+8))
#35916 [Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode — bug,ready,v1 — by jaewonlee-fb (合并于: 2026-03-04 08:10 (UTC+8))
#35825 [Core] Move save_tensorized_model logic to Worker — ready,v1 — by njhill (合并于: 2026-03-04 07:31 (UTC+8))
#34552 [BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA — bug,speculative-decoding,ready,v1 — by LucasWilkinson (合并于: 2026-03-03 23:21 (UTC+8))
#35917 [Model Runner V2] Fix inputs_embeds=None bug for MM models — bug,v1 — by WoosukKwon (合并于: 2026-03-04 05:47 (UTC+8))
#35813 [Bugfix] Fix misnamed parameter in compressed_tensors_moe.py — bug,ready — by bnellnm (合并于: 2026-03-04 05:29 (UTC+8))
#35887 [ROCm][CI] Fix TP size issue for test_gpt_oss — rocm,ready,gpt-oss — by micah-wil (合并于: 2026-03-04 04:57 (UTC+8))
#35601 [ROCm][Bugfix]: Disable AITER Triton ROPE by default — bug,rocm,ready — by Rohan138 (合并于: 2026-03-04 03:43 (UTC+8))
#32564 [MoE Refactor] Create MK for TRTLLM Kernels — documentation,performance,rocm,ci/build,llama,gpt-oss,nvidia,ready-run-all-tests — by robertgshaw2-redhat (合并于: 2026-03-04 02:39 (UTC+8))
#35870 [CI] Temporarily Disable Llama4 MoE Refactor Test — ready,llama — by robertgshaw2-redhat (合并于: 2026-03-04 02:36 (UTC+8))
#35882 [CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests — ready,deepseek — by MatthewBonanni (合并于: 2026-03-04 01:26 (UTC+8))
#34715 fix: Ensure invalid audio files return 400 error — documentation,performance,frontend,ready,ci/build,v1,multi-modality,qwen,deepseek,cpu — by jasonozuzu-cohere (合并于: 2026-03-04 00:47 (UTC+8))
#34986 TRTLLM gen-full attn Test Coverage — ready,v1,nvidia — by ojhaanshika (合并于: 2026-03-04 00:35 (UTC+8))
#34307 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-03 22:24 (UTC+8))
#35604 [Frontend][1/n] Improve pooling entrypoints classify. — frontend,ready — by noooop (合并于: 2026-03-03 22:05 (UTC+8))
#35853 [CI] And PPL test for Qwen3.5. — ready,qwen — by noooop (合并于: 2026-03-03 21:15 (UTC+8))
#35442 [Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching — ready,v1 — by tdoublep (合并于: 2026-03-03 20:15 (UTC+8))
#31025 [CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 — performance,ready,ci/build — by simonreginis (合并于: 2026-03-03 19:48 (UTC+8))
#35821 [V0 deprecation] Remove Swin model — ready — by Isotr0py (合并于: 2026-03-03 12:03 (UTC+8))
#35834 add regression test — ready — by hallerite (合并于: 2026-03-03 15:32 (UTC+8))
#35198 [ROCm] [Release] Change the package from aiter to amd-aiter — rocm,ready,ci/build — by tjtanaa (合并于: 2026-03-03 15:13 (UTC+8))
#35645 Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults — ready — by lin-shh (合并于: 2026-03-03 13:59 (UTC+8))
#35648 [Misc] Fix typos in comments: explict→explicit, paramaters→parameters — ready — by lin-shh (合并于: 2026-03-03 13:59 (UTC+8))
#35683 [MISC] Removed unused function find_all_indices() from tool_parsers/utils.py — ready — by taneem-ibrahim (合并于: 2026-03-03 13:58 (UTC+8))
#35773 [BugFix] Fix cmake based incremental install (wrong vllm install dir) — bug,ready,ci/build — by LucasWilkinson (合并于: 2026-03-03 13:58 (UTC+8))
#35824 [CI/Build] Trigger processor tests on registry update — ready,ci/build — by DarkLight1337 (合并于: 2026-03-03 13:55 (UTC+8))
#35806 [ROCm][CI] Fix Assertion Logic For test_gpt_oss — rocm,ready,gpt-oss — by micah-wil (合并于: 2026-03-03 13:08 (UTC+8))
#35427 [Refactor] Fix maxsim cuda platform and add cli to control it — frontend,ready,nvidia — by yewentao256 (合并于: 2026-03-03 12:48 (UTC+8))
#35822 [CI/Build] Automatically patch video metadata for multimodal processor test — ready,multi-modality — by Isotr0py (合并于: 2026-03-03 12:27 (UTC+8))
#35451 [Core] Add optional flags to check for repetitive token patterns in engine output — frontend,ready,v1 — by aykoppol (合并于: 2026-03-03 12:23 (UTC+8))
#35774 [Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] — v1,nvidia — by WoosukKwon (合并于: 2026-03-03 12:06 (UTC+8))
#35819 [Docs][Model Runner V2] Add Design Docs — documentation — by WoosukKwon (合并于: 2026-03-03 11:58 (UTC+8))
#35459 [ModelRunnerV2] Rename sampler functions and variables for clarity — v1 — by andylolu2 (合并于: 2026-03-03 11:48 (UTC+8))

关闭但未合并的 PR

#35707 scaffold logic which ensures that conditionally imported deps exist — rocm — by wjabbour (关闭于: 2026-03-04 11:09 (UTC+8))
#27813 [Feature][Hybrid] Integrate prefix caching into short conv models — needs-rebase,stale,v1 — by gigit0000 (关闭于: 2026-03-04 10:16 (UTC+8))
#27835 [CI/Build] Set test case to run two different containers on the same host — ready,ci/build,stale — by amdfaa (关闭于: 2026-03-04 10:16 (UTC+8))
#27989 Update delta_text and model_output format to include newline — stale — by Bsist (关闭于: 2026-03-04 10:16 (UTC+8))
#34762 [Bugfix]: Improving –kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo — bug,v1 — by WorldExplored (关闭于: 2026-03-04 10:07 (UTC+8))
#35914 Add CUDA-Agent: Multi-stage RL training pipeline for CUDA kernel optimization — nvidia — by windtara0619 (关闭于: 2026-03-04 05:19 (UTC+8))
#33806 [Test] Add env var to disable reduced precision reduction for PyTorch… — needs-rebase,multi-modality — by tianrengao (关闭于: 2026-03-04 04:21 (UTC+8))
#35901 [Responses API] Sanitize leaked Harmony control tokens in tool names and recipients — frontend,gpt-oss — by will-deines (关闭于: 2026-03-04 04:11 (UTC+8))
#35900 feat(responses): pluggable ResponseStore abstraction — frontend — by will-deines (关闭于: 2026-03-04 04:11 (UTC+8))
#35902 [Harmony] Fix analysis-channel tool calls and preserve reasoning across turns — frontend,gpt-oss — by will-deines (关闭于: 2026-03-04 04:11 (UTC+8))
#35873 [Responses API] Structured output + reasoning via structural tag embedding — structured-output,frontend,v1,gpt-oss — by will-deines (关闭于: 2026-03-04 03:07 (UTC+8))
#35881 [Responses API] Sanitize leaked Harmony control tokens in tool names and recipients — frontend,gpt-oss — by will-deines (关闭于: 2026-03-04 03:07 (UTC+8))
#35884 [Harmony] Fix analysis-channel tool calls and preserve reasoning across turns — frontend,gpt-oss — by will-deines (关闭于: 2026-03-04 03:07 (UTC+8))
#35740 feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) — frontend — by will-deines (关闭于: 2026-03-04 03:07 (UTC+8))
#35874 feat(responses): pluggable ResponseStore abstraction — frontend — by will-deines (关闭于: 2026-03-04 03:07 (UTC+8))
#31690 [Core] Add optional flags to check for repetitive token patterns in engine output — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by aykoppol (关闭于: 2026-03-04 02:03 (UTC+8))
#35816 [CI] Fix AMD: V1 Others (mi325_1) Amd CI bug — bug,rocm,ready,ci/build — by yewentao256 (关闭于: 2026-03-04 01:52 (UTC+8))
#35852 [Scheduler] Fix KV-cache head-of-line blocking (#31731) — v1 — by oneraghavan (关闭于: 2026-03-03 22:52 (UTC+8))
#34332 fp8.py online quant: reuse layerwise reloading infra, take 3 — 无标签 — by vkuzo (关闭于: 2026-03-03 22:09 (UTC+8))
#34020 [wip] layerwise loading for fp8.py, take 2 — 无标签 — by vkuzo (关闭于: 2026-03-03 22:09 (UTC+8))
#35860 Optimized gemm_half_q_half_gptq_4bit_kernel for reduced memory latency — 无标签 — by YaozhengZhang (关闭于: 2026-03-03 21:46 (UTC+8))
#35867 [Docs] Fix typos in comments and docstrings — v1 — by abdelhadi703 (关闭于: 2026-03-03 21:45 (UTC+8))
#33496 [Bugfix] Fix assertion error in flashmla backend with fullgraph enabled — bug,v1 — by Kurumi5210 (关闭于: 2026-03-03 20:11 (UTC+8))
#35856 Segmented spans rebased — documentation,v1,kv-connector — by almogtavor (关闭于: 2026-03-03 18:35 (UTC+8))
#35722 Hanlin12 hip online tuning — documentation,frontend — by hanlin12-AMD (关闭于: 2026-03-03 14:57 (UTC+8))
#32773 [Bugfix][Quant] fix online fp8 quantization oom — bug — by CSWYF3634076 (关闭于: 2026-03-03 14:17 (UTC+8))
#35803 [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths — bug,nvidia — by blake-snc (关闭于: 2026-03-03 13:59 (UTC+8))
#35818 Revert “[CI/Build] Enable Qwen3.5 tests on CI” (#35763) — qwen — by zhewenl (关闭于: 2026-03-03 12:18 (UTC+8))