vLLM 开发动态报告 - 2026-03-09

时间窗口: 2026-03-09 11:22 (UTC+8) ~ 2026-03-10 11:22 (UTC+8) 数据统计: 新 Issue 30 | 关闭 Issue 25 | 新 PR 114 | 合并 PR 46 | 关闭未合并 PR 24

📊 每日开发状态摘要

在过去24小时（2026-03-09至2026-03-10）内，vLLM 项目保持了极高的开发活跃度，新增了114个PR和30个Issue。社区关注的焦点主要集中在Qwen3.5系列模型的稳定性问题（特别是FP8、GPTQ量化和MoE变体）以及多模态推理中的内存泄漏上。同时，AMD生态的集成与优化工作持续进行，多个与ROCm、Quark相关的PR被提交，旨在提升在AMD硬件上的性能和功能支持。

🎯 AMD/ROCm 生态相关动态

本周期内，AMD相关贡献非常活跃，主要集中在性能优化、内核扩展和基础设施完善方面。

新增 Issue (1个):

#36454: 【性能回归】v0.15.0 vs ROCm v0.14.0 吞吐量对比
- 用户 @Spurthi-Bhat-ScalersAI 报告，在8× MI300X上，使用Qwen3模型进行基准测试，发现上游vLLM v0.15.0的性能比AMD维护的ROCm分支v0.14.0低约1.34倍。
- 核心团队成员 @robertgshaw2-redhat 回应指出，上游项目不应追踪AMD分支的性能回归，并强调AMD有责任将其优化及时上游化（upstream），以确保主分支版本具备同等优化。该观点明确了社区对上游与下游分支关系的立场，此Issue随后被关闭。
- 影响：凸显了AMD硬件优化融入主流代码库的持续挑战，敦促AMD团队加速上游贡献流程。

新增/开放 PR (7个):

【性能优化】AWQ-Marlin 配置支持 (#36505)：由 AMD 员工 @mgehre-amd 提交。此PR使AWQMarlinConfig能在ROCm平台上使用统一的choose_mp_linear_kernel框架，替代了之前直接调用NVIDIA专用ops.marlin_gemm或ops.awq_gemm的路径。在AMD Strix Halo上测试Qwen3-4B-AWQ模型，获得了解码吞吐量提升73% 的显著性能收益。
【内核优化】AITER 持久化MLA解码内核 (#36574)：由 AMD 员工 @SKPsanjeevi 提交。为ROCm平台添加了AITER持久化模式MLA（多查询潜在注意力）解码内核支持，通过让内核常驻GPU计算单元来避免每次批处理的启动开销，旨在提升解码性能。通过环境变量VLLM_ROCM_USE_AITER_MLA_PERSISTENT控制。
【内核优化】AITER 融合MLA解码内核 (#36573)：由 AMD 员工 @khairulkabir1661 提交。添加了对MLA注意力的AITER融合解码内核支持，结合了FP4/FP8 BMM、RoPE、KV缓存写入等操作，旨在提升性能。
【Bug修复】Quark 模拟逻辑问题 (#36486)：修复了当AITER被禁用时，Quark（AMD的MXFP4/MXFP6量化工具）模拟路径不生效的问题。该问题由之前的PR引入，导致即使用户通过环境变量禁用AITER，系统仍错误地尝试使用AITER原生路径。此PR属于AMD/Quark生态的关键修复。
【CI/构建】添加新GPU架构支持 (#36499)：由 AMD 员工 @mgehre-amd 提交。将gfx1152/gfx1153（RDNA 3.5架构，如“Krackan”）添加到HIP支持的架构列表中，确保编译出的vLLM wheel包包含适用于这些新GPU的内核。
【CI/测试】降低ROCm测试的脆弱性 (#36442)：为ROCm测试添加重试机制，并禁用了“skinny GEMMs”特性以应对测试中的非确定性行为。讨论中，核心贡献者 @gshtras 指出，长期方案应是修复skinny GEMMs的bug而非在测试中禁用生产特性。
【基础设施测试】 (#36562)：标记为“DO NOT MERGE”，用于测试新的CI队列。

小结：AMD团队在本周期内动作频繁，一方面积极修复已发现的关键Bug（如Quark模拟逻辑），另一方面大力推进性能优化（AWQ-Marlin支持、AITER内核），并完善对新硬件的支持。性能回归Issue中的讨论也明确了社区对“优化需及时上游化”的期望。

💬 高热度讨论分析

Issue #36452: Qwen3.5-35B-A3B-FP8 AssertionError
- 议题：用户报告在启动Qwen3.5 FP8模型时遇到AssertionError导致启动失败。
- 观点：
  - 用户 @Indigo-Coder-github：提供了详细的错误栈和复现步骤。
  - 核心贡献者 @ZJY0516：快速定位问题，指引用户查阅Qwen3.5配置文档，建议设置max_cudagraph_capture_size=256或enforce_eager=True。
  - 其他用户 @NovaShen555：建议尝试gpu_memory_utilization=1。
- 争议/结论：无实质争议。用户确认文档中的解决方案有效，但追问了默认值导致错误的根本原因。@ZJY0516 解释是因为默认的max_cudagraph_capture_size大于最大缓存大小，并指出enforce_eager会影响性能。该Issue展示了社区对Qwen3.5系列模型“最佳实践”的积累，以及文档的重要性。
Issue #36454: vLLM v0.15.0 与 ROCm v0.14.0 的性能回归
- 议题：对比上游版本与AMD维护分支版本的性能。
- 观点：
  - 报告者 @Spurthi-Bhat-ScalersAI：提供了详尽的基准测试数据和配置，证明AMD分支性能更优。
  - 核心成员 @robertgshaw2-redhat：代表项目立场，明确指出上游不跟踪下游分支的回归，责任在于AMD方将优化合并到上游。
- 争议焦点：这并非技术争议，而是项目治理和协作流程的明确表态。它划清了社区版与供应商定制版的界限，强调了“上游优先”的原则。
- 结论：Issue被迅速关闭，结论清晰：性能差异需由AMD通过上游PR解决。
Issue #36456: 本地GGUF路径加载失败
- 议题：使用本地GGUF文件配合--hf-config-path时，因代码路径问题导致模型架构识别失败（qwen35 is not supported yet）。
- 观点：
  - 报告者 @shba007：进行了深入分析，指出maybe_override_with_speculators函数在处理本地GGUF文件时，绕过了用户提供的--hf-config-path，直接尝试解析GGUF文件，而当前transformers库的GGUF解析器尚未支持qwen35架构。
  - 其他用户：询问解决方案（如升级transformers）。
- 争议/结论：这是一个明确的Bug报告，指出了本地加载路径与HF Hub加载路径的行为不一致问题。讨论揭示了vLLM模型加载流程中针对不同来源（HF Hub vs 本地文件）的逻辑差异，是底层框架兼容性的一个案例。
PR #36486: 修复Quark模拟逻辑问题
- 议题：如何正确修复Quark在AITER禁用时的模拟逻辑。
- 观点：
  - PR作者 @wangjiaxin99：提出将多个条件合并为一个AND判断来简化逻辑。
  - 审阅者 @ChuanLi1101（可能为AMD方）：指出该简化逻辑可能错误地将两个独立的非模拟路径（CK原生路径和 Triton MXFP4后端路径）耦合，导致某些场景下（如硬件不支持MX但存在Triton后端）仍错误地强制模拟。他提到了另一个待合并的修复PR (#36422)。
- 争议焦点：修复方案的完整性与正确性。是简单合并条件，还是需要更精细地处理两条独立路径。
- 当前状态：PR仍处于开放状态，需要根据评审意见修改逻辑以确保所有情况都被正确处理。

🔥 热门话题与趋势分析

Qwen3.5系列模型的稳定性挑战：多个新增Issue（#36452, #36489, #36476, #36566, #36478）都围绕Qwen3.5不同规模的模型（35B, 122B, 397B）在v0.17.0版本下的启动失败、随机崩溃或性能异常。max_cudagraph_capture_size参数频繁成为临时解决方案的关键，表明该系列模型的CUDA图编译与KV缓存配置存在普遍适配问题。
性能回归与优化并进：一方面有用户报告版本升级带来的性能回退（#36454）；另一方面，大量PR（如#36505, #36574, #35777）致力于引入新的融合内核、优化调度以减少开销，这反映了项目在快速迭代中平衡性能与稳定性的常态。
多模态与内存泄漏：Issue #35191（关于Qwen3.5 397B多模态推理下内存线性增长直至OOM）在本次周期内被关闭，可能已修复。但这类问题揭示了大规模多模态模型服务时，缓存管理、内存回收机制的复杂性，是需要持续关注的领域。
生态系统扩展：
- Apple Silicon支持：PR #36523 正式为macOS的MPS（Metal）后端添加平台支持，并附有详细文档，满足了社区长期以来的需求（#1441）。
- 架构抽象化：RFC #36459 提议对vLLM IR内核注册机制进行重构，以更好地支持树外（Out-of-Tree）硬件后端，这为未来集成更多异构计算平台（如AMD、Intel、苹果等）奠定了更清晰的架构基础。

🛠️ 重点技术变更

Apple MPS平台支持（#36523）：这是一个重要的平台扩展。该PR引入了完整的MPS平台实现，包括设备管理、纯PyTorch实现的Attention后端、Metal INT4/GGUF反量化内核（可选）等。它使vLLM能够利用Apple Silicon GPU进行加速推理，虽然性能与专用GPU有差距，但极大地拓宽了部署场景。
vLLM IR 内核注册机制RFC（#36459）：这是一个重要的架构演进提案。旨在解决当前内核模块强制导入、不利于第三方平台集成的问题。核心思想是将内核模块加载的职责委托给Platform类，让每个平台（如CUDA、ROCm、MPS）自行决定加载哪些内核实现。这提升了vLLM的模块化和可扩展性，对AMD及其他异构后端社区是长期利好。
Health检查改进（#36451）：针对V1引擎，添加了health_ping()机制，通过轻量级IPC轮询来检测引擎核心是否“活锁”（进程存活但业务循环卡死）。这增强了生产环境下的可观察性和可靠性，是对原有仅能检测进程崩溃的死亡哨兵机制的重要补充。
AnyModel 异构模型支持（#36512）：来自NVIDIA的贡献，旨在支持通过NAS（神经架构搜索）生成的异构架构模型（每层的配置如KV头数、FFN维度、专家数可能不同）。这是首个在主流推理框架中支持此类高度灵活模型结构的工作，为部署前沿的模型压缩与优化成果打开了大门。

📈 开发活跃度观察

贡献者活跃：本周期提交PR的贡献者背景多样，包括AMD团队（-amd后缀）、NVIDIA研究员、苹果生态开发者以及众多社区用户。显示vLLM吸引了广泛的行业参与者。
代码合并节奏：共合并46个PR，涵盖从Bug修复、性能优化到新功能（如MPS平台）的各个方面。审查与合并流程运转高效，例如对Qwen3.5 Eagle3支持的修复（#36527）和Apple MPS支持（#36523）等较大变更都在积极推进。
问题解决闭环：关闭了25个Issue，其中包括一些历史遗留问题（如#35191多模态内存泄漏、#36249 Qwen3.5 GPTQ崩溃），表明团队在跟进和解决积压问题。

💡 值得关注的问题

多模态内存泄漏根因：虽然#35191被关闭，但大规模多模态模型（如Qwen-VL、Qwen3.5-VL）在长时间高并发下的内存增长问题需要持续监控，确认修复是否彻底。
Qwen3.5系列的“踩坑”经验普及：大量用户遇到类似启动问题。需要评估是否应将max_cudagraph_capture_size等参数的推荐值更直接地整合到模型加载逻辑或错误提示中，降低用户的使用门槛。
AMD优化上游化进程：在性能回归Issue的讨论后，社区会密切关注AMD团队是否以及如何将MI300X等平台的关键优化贡献到主分支，以避免未来版本再次出现类似的性能差距争议。
XLA后端的内存开销：Issue #36537 报告了record_metadata_for_reloading函数在XLA后端导致2-3倍主机内存膨胀的问题。这对使用TPU等硬件的用户影响较大，需要尽快评估优化方案（如惰性记录）。

📋 附录：详细数据列表

新增 Issue

#36485 [Feature]: will support qwen3.5-eagle3 deployment? — feature request — by aidenchn (创建于: 2026-03-09 19:12 (UTC+8))
#36498 [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. — bug — by piekey1994 (创建于: 2026-03-09 20:45 (UTC+8))
#36524 [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 — bug — by wzhao18 (创建于: 2026-03-10 01:16 (UTC+8))
#36454 [Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 — performance,rocm — by Spurthi-Bhat-ScalersAI (创建于: 2026-03-09 13:27 (UTC+8))
#36489 [Bug]: vllm 0.17.0 部署 Qwen3.5 397b-fp8版本运行过程中异常崩溃(vllm 0.17.0 crashed unexpectedly during deployment of Qwen3.5 397b-fp8 version.) — bug — by monkeywl2020 (创建于: 2026-03-09 19:29 (UTC+8))
#36566 [Bug]:Qwen3.5-35B-A3B vllm v0.17.0 ERROR 03-10 00:52:24 [multiproc_executor.py:261] Worker proc VllmWorker-0 died unexpectedly, shutting down executor. — bug — by jieguolove (创建于: 2026-03-10 09:05 (UTC+8))
#36565 [Bug]: GPTBigCode scale_attn_weights config flag is ignored in vLLM — bug — by Qi-Zhan (创建于: 2026-03-10 09:01 (UTC+8))
#36547 [Bug] MultiConnector: NIXL transfers silently broken after HMA migration — bug — by ZhanqiuHu (创建于: 2026-03-10 05:08 (UTC+8))
#36533 [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency — bug — by cquil11 (创建于: 2026-03-10 02:56 (UTC+8))
#36537 [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends — bug — by kmabeeTT (创建于: 2026-03-10 04:31 (UTC+8))
#36530 [Bug]: pip install 0.17 fails with CXXABI_1.3.15 not found — bug — by jlqibm (创建于: 2026-03-10 02:44 (UTC+8))
#36526 [Bug]: DeepSeek hangs with overridden num_hidden_layers — bug — by ProExpertProg (创建于: 2026-03-10 01:42 (UTC+8))
#36513 Analyze middleware traces: OWUI sampling profile comparison — 无标签 — by jsboige (创建于: 2026-03-09 23:12 (UTC+8))
#36502 [Feature]: Built-in debug tensor dump for intermediate activations — feature request — by MOIPA (创建于: 2026-03-09 21:31 (UTC+8))
#36501 [Feature]: Add support for token_adapter.trainable_tokens_delta LoRA weight — feature request — by akowalsk (创建于: 2026-03-09 21:21 (UTC+8))
#36492 [Bug]: Abnormal Output When Using FP8 KVCache for Kimi-K2.5 Inference under vLLM v0.17.0 — bug — by makabaka6338 (创建于: 2026-03-09 20:26 (UTC+8))
#36493 [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% — bug — by piekey1994 (创建于: 2026-03-09 20:32 (UTC+8))
#36456 [Bug]: Local GGUF path fails with “architecture qwen35 is not supported yet” even when –hf-config-path is provided — bug — by shba007 (创建于: 2026-03-09 13:37 (UTC+8))
#36478 [Bug]: LoRA on Qwen-3.5-2B fails to run — bug — by JasonX7 (创建于: 2026-03-09 17:25 (UTC+8))
#36476 [Bug]: vllm 0.17.0 启动Qwen3.5-122B-A10B失败 — bug — by Lee-xeo (创建于: 2026-03-09 16:58 (UTC+8))
#36481 [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes — performance — by shenyt-sanshui (创建于: 2026-03-09 17:53 (UTC+8))
#36455 [Bug]: Unable to run Qwen3.5 on RTX5090 — bug — by YiJunSachs (创建于: 2026-03-09 13:33 (UTC+8))
#36443 [Bug]: qwen3.5-27b ValueError: Tokenizer class TokenizersBackendFast does not exist or is not currently imported. — bug — by xiaotianns (创建于: 2026-03-09 11:58 (UTC+8))
#36450 [Bug]: Qwen3.5 AWQ models crash during inference on RTX 5090 (Blackwell) with Triton OOM in solve_tril despite successful model load — bug — by NEWbie0709 (创建于: 2026-03-09 13:16 (UTC+8))
#36452 [Bug]: Qwen3.5-35B-A3B-FP8 AssertionError — bug — by Indigo-Coder-github (创建于: 2026-03-09 13:23 (UTC+8))
#36459 [RFC]: vLLM IR Out-of-Tree (OOT) Kernel Registration — RFC — by wxsIcey (创建于: 2026-03-09 14:28 (UTC+8))
#36465 [Usage]: How to use Env in yaml — usage — by ciaoyizhen (创建于: 2026-03-09 14:51 (UTC+8))
#36463 [Bug]: Qwen3.5-27B fails to start with CPU KV cache offloading (--kv_offloading_backend native) while Qwen3-32B works fine — bug — by bigbear07 (创建于: 2026-03-09 14:36 (UTC+8))
#36441 [Bug]: prefix cache bug happens when use w4a16 for GLM5 — bug — by terenceucla (创建于: 2026-03-09 11:43 (UTC+8))
#36440 [Feature] Add energy consumption metrics to benchmark suite — 无标签 — by hongping-zh (创建于: 2026-03-09 11:40 (UTC+8))

已关闭 Issue

#36407 [Bug]: KeyError in get_layers_from_vllm_config with pipeline parallelism (vLLM 0.16.0) — bug — by tusharshetty61 (关闭于: 2026-03-09 11:43 (UTC+8))
#24384 [RFC]: Decoupling vLLM Configuration from Hugging Face — RFC,stale — by charlotte12l (关闭于: 2026-03-10 10:29 (UTC+8))
#26565 [Bug]: vLLM engine loads model successfully but fails during sampler execution. — bug,stale — by thenumberouscode (关闭于: 2026-03-10 10:28 (UTC+8))
#26567 [Refactor][MLA]: Independently pass q_nope & q_rope — feature request,stale — by vnadathur (关闭于: 2026-03-10 10:28 (UTC+8))
#28186 [Bug] Cannot load qwen3-vl series with lora adapter — bug,stale — by deepNoah (关闭于: 2026-03-10 10:28 (UTC+8))
#28334 [Bug][LoRA]: Custom AR IMA during CG Capture with LoRA — bug,stale — by gnovack (关闭于: 2026-03-10 10:27 (UTC+8))
#28348 [Usage]: Does vllm support max_pixels in prompt on Qwen3-VL reasoning? — usage,stale — by leijie-ww (关闭于: 2026-03-10 10:27 (UTC+8))
#35504 [Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 and fuse_allreduce_rms=False (H100s on PCIe) — bug — by vitush93 (关闭于: 2026-03-10 10:16 (UTC+8))
#36387 📋 Documentation Enhancement Suggestion — 无标签 — by croviatrust (关闭于: 2026-03-10 09:19 (UTC+8))
#36454 [Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 — performance,rocm — by Spurthi-Bhat-ScalersAI (关闭于: 2026-03-10 09:19 (UTC+8))
#34054 [Bug]: pd disaggregation on the same host with nixl connector can not use nvlink to transfer kv cache — bug — by ChowXu (关闭于: 2026-03-10 07:49 (UTC+8))
#35191 [Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests — bug — by FWao (关闭于: 2026-03-10 07:14 (UTC+8))
#36526 [Bug]: DeepSeek hangs with overridden num_hidden_layers — bug — by ProExpertProg (关闭于: 2026-03-10 01:44 (UTC+8))
#36513 Analyze middleware traces: OWUI sampling profile comparison — 无标签 — by jsboige (关闭于: 2026-03-09 23:13 (UTC+8))
#29141 [Bug]: Unable to run Qwen3 MoE NVFP4 on SM120 — bug,stale — by terryaic (关闭于: 2026-03-09 22:56 (UTC+8))
#28910 [Bug]: — bug,stale — by SongXiaoMao (关闭于: 2026-03-09 22:52 (UTC+8))
#36031 [Bug]: Commit 28ef9ba causing VLLM to crash when running Qwen 3.5 122B with Speculative Decoding enabled — bug — by mdierolf (关闭于: 2026-03-09 21:29 (UTC+8))
#35321 [Feature]: Encoder self-attention for RocmAttentionImpl — feature request,rocm — by micah-wil (关闭于: 2026-03-09 20:24 (UTC+8))
#36357 [Bug]: Multimodal encoder memory profiling hangs indefinitely on V100 (SM 7.0) in v0.17 — bug — by aswinkumar1999 (关闭于: 2026-03-09 18:43 (UTC+8))
#36382 [CPU] AssertionError in CPUModelRunner: device_tensor type mismatch (numpy.ndarray) — 无标签 — by maxwillzq (关闭于: 2026-03-09 17:35 (UTC+8))
#36331 [Bug]: 0% acceptance rate (MTP) with Qwen 3.5 122B (NVFP4) — bug — by ccdv-ai (关闭于: 2026-03-09 16:43 (UTC+8))
#36389 [Bug]: Multi-node MP backend with vllm v17 - inner dp world group is not initialized — bug — by alan-cooney-dsit (关闭于: 2026-03-09 16:42 (UTC+8))
#36452 [Bug]: Qwen3.5-35B-A3B-FP8 AssertionError — bug — by Indigo-Coder-github (关闭于: 2026-03-09 15:55 (UTC+8))
#36249 [Bug]: Qwen3.5-27B-GPTQ-Int4 crashes [vllm version v0.16.1rc1] — bug — by berkayersoyy (关闭于: 2026-03-09 13:39 (UTC+8))
#34736 [Refactor]: Handle OOV Multimodal Tokens Generically — documentation — by alex-jw-brooks (关闭于: 2026-03-09 11:30 (UTC+8))

新增 PR

#36574 [ROCm] Utilize persistent MLA kernel from AITER — rocm,v1 — by SKPsanjeevi (创建于: 2026-03-10 10:20 (UTC+8))
#36580 [Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill — v1 — by njhill (创建于: 2026-03-10 11:17 (UTC+8))
#36561 Add Blackwell DeepSeek MTP acceptance length regression test — speculative-decoding,ci/build,v1,deepseek — by qiching (创建于: 2026-03-10 08:24 (UTC+8))
#36576 [Feature] Add debug tensor dump for intermediate activations — documentation,v1 — by MOIPA (创建于: 2026-03-10 10:49 (UTC+8))
#36578 feat: add RISC-V support for CPU backend (v2) — ci/build,cpu — by typer-J (创建于: 2026-03-10 10:57 (UTC+8))
#36495 [Fix] always use loopback ip for UniProcExecutor — v1 — by BugenZhao (创建于: 2026-03-09 20:41 (UTC+8))
#36579 Guard AWQ-Marlin auto-selection on CUDA driver/toolkit mismatch — nvidia — by fede-kamel (创建于: 2026-03-10 11:05 (UTC+8))
#36577 Remove instance ID initialization logic — 无标签 — by zhuohan123 (创建于: 2026-03-10 10:51 (UTC+8))
#36557 [Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. — bug,ready — by hallerite (创建于: 2026-03-10 07:05 (UTC+8))
#36553 [BugFix] Remove incorrect assert in split_decodes_and_prefills — bug,ready,v1 — by WoosukKwon (创建于: 2026-03-10 06:24 (UTC+8))
#36575 Main branch failure triage — ci/build — by simon-mo (创建于: 2026-03-10 10:28 (UTC+8))
#36462 [Bugfix] Fix GDN in_proj_ba crash with GPTQ/FP8 and TP > 1 — bug,needs-rebase,qwen — by AjAnubolu (创建于: 2026-03-09 14:36 (UTC+8))
#36573 [ROCm] Add AITER fused decode kernel for MLA attention — rocm — by khairulkabir1661 (创建于: 2026-03-10 10:09 (UTC+8))
#36538 feat: add RISC-V support for CPU backend — ci/build,cpu — by typer-J (创建于: 2026-03-10 04:42 (UTC+8))
#36571 Fix issue 04 — v1 — by xueliangyang-oeuler (创建于: 2026-03-10 09:53 (UTC+8))
#36572 fix: make MiniMaxM2AppendThinkReasoningParser extract reasoning corre… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-10 09:53 (UTC+8))
#36570 [Hardware][NVIDIA] Set NCCL_P2P_DISABLE=1 on non-fully-connected CUDA topology — v1,nvidia — by pooyadavoodi (创建于: 2026-03-10 09:31 (UTC+8))
#36458 [XPU] Support block fp8 moe by fallback to TritonExpert on XPU — ready — by jikunshang (创建于: 2026-03-09 14:19 (UTC+8))
#36569 fix: prevent KeyError in harmony parser by preserving None fields (is… — frontend,gpt-oss — by xueliangyang-oeuler (创建于: 2026-03-10 09:30 (UTC+8))
#36556 [ci] Update atol for test_classification — 无标签 — by angelayi (创建于: 2026-03-10 07:04 (UTC+8))
#36568 fix: revert triton_kernels tag to v3.5.0 to resolve ImportError (issu… — ci/build,kv-connector — by xueliangyang-oeuler (创建于: 2026-03-10 09:21 (UTC+8))
#36473 [Perf] Reuse GDN attention buffer to reduce allocation overhead — qwen — by cwazai (创建于: 2026-03-09 16:18 (UTC+8))
#36567 fix: disable async scheduling by default due to instability (issue #3… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-10 09:07 (UTC+8))
#36562 [DO NOT MERGE] AMD Infra tests — rocm,ready,needs-rebase,ci/build,v1,kv-connector — by AndreasKaratzas (创建于: 2026-03-10 08:33 (UTC+8))
#36564 [Draft] Support temporal compression for videos — 无标签 — by collinmccarthy (创建于: 2026-03-10 08:45 (UTC+8))
#36521 [Core] Simplify core kv-cache blocks initialization logic — ready,v1 — by njhill (创建于: 2026-03-10 00:50 (UTC+8))
#36563 [Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation — 无标签 — by gmagogsfm (创建于: 2026-03-10 08:40 (UTC+8))
#36438 [Hardware][NIXL] set default kv buffer type for different platform — ready,kv-connector — by zhenwei-intel (创建于: 2026-03-09 11:39 (UTC+8))
#36523 [Platform] Add MPS (Apple Metal) platform support for macOS — documentation,performance,ci/build,v1,cpu — by robtaylor (创建于: 2026-03-10 01:12 (UTC+8))
#36558 refactor(envs): introduce typed Envs class with lazy getattr and attribute docstrings — 无标签 — by nliu365 (创建于: 2026-03-10 07:30 (UTC+8))
#36542 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by d-biswa (创建于: 2026-03-10 04:55 (UTC+8))
#36559 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (创建于: 2026-03-10 07:41 (UTC+8))
#36560 [CI] Add bfcl tool call correctness eval — ci/build — by sfeng33 (创建于: 2026-03-10 07:49 (UTC+8))
#36475 [bugfix] fix nvlink for nixl/ucx — bug,ready,kv-connector — by youkaichao (创建于: 2026-03-09 16:52 (UTC+8))
#36484 fix: Responses API streaming tool call support for non-harmony models — frontend,gpt-oss — by giulio-leone (创建于: 2026-03-09 18:46 (UTC+8))
#36554 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (创建于: 2026-03-10 06:25 (UTC+8))
#36555 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,llama,qwen — by Lucaskabela (创建于: 2026-03-10 06:32 (UTC+8))
#36549 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors — bug,kv-connector — by ZhanqiuHu (创建于: 2026-03-10 05:15 (UTC+8))
#36486 [Bugfix] Fix issues in quark emulative logic — bug — by wangjiaxin99 (创建于: 2026-03-09 19:23 (UTC+8))
#36551 [torch.compile] Add support for fused RMSNorm + group quant — ci/build — by ProExpertProg (创建于: 2026-03-10 06:05 (UTC+8))
#36536 [Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels — frontend — by sagearc (创建于: 2026-03-10 04:20 (UTC+8))
#36552 Add non-contiguous input tests for rms_norm_per_block_quant and dynamic per-token quant kernels — 无标签 — by copilot-swe-agent (创建于: 2026-03-10 06:08 (UTC+8))
#36505 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm — by mgehre-amd (创建于: 2026-03-09 21:51 (UTC+8))
#36544 [Model Runner V2] Add model_state inputs to CUDA graph capture — v1,nvidia — by WoosukKwon (创建于: 2026-03-10 05:00 (UTC+8))
#36550 Add GlmOcrConfig with fixes, E2E test, and generic config registry validation — meta-exported,fb-exported — by yhujia-8888 (创建于: 2026-03-10 05:50 (UTC+8))
#36499 [ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures — rocm,ready,ci/build — by mgehre-amd (创建于: 2026-03-09 20:52 (UTC+8))
#36539 Fix prompt_logprobs to respect logprobs_mode — v1 — by fede-kamel (创建于: 2026-03-10 04:42 (UTC+8))
#36548 [Bugfix] Fix TypeError during AWQ Marlin MoE loading by unwrapping traced tuple — bug — by EmilHaase (创建于: 2026-03-10 05:14 (UTC+8))
#36546 Remove unused disable_fallback field — ready — by zhuohan123 (创建于: 2026-03-10 05:07 (UTC+8))
#36545 [Speculative Decoding] Add norm_before_fc for gpt-oss draft models — speculative-decoding,llama,gpt-oss — by shubhra (创建于: 2026-03-10 05:05 (UTC+8))
#36543 [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression — bug — by kmabeeTT (创建于: 2026-03-10 04:57 (UTC+8))
#36541 [MRV2] Extensible CG dispatch rework — v1,nvidia — by WoosukKwon (创建于: 2026-03-10 04:52 (UTC+8))
#36449 [Observability] Add scheduler preemption metrics — v1 — by lisperz (创建于: 2026-03-09 13:00 (UTC+8))
#36540 [fix] Remove trtllm ragged mla prefills — v1 — by evezhier (创建于: 2026-03-10 04:42 (UTC+8))
#36448 fix(tokenizer): handle TokenizersBackendFast class for Qwen3.5 GPTQ models — qwen — by giulio-leone (创建于: 2026-03-09 12:58 (UTC+8))
#36483 [Frontend] Delegate preprocessing to OpenAIServingRender — frontend,v1 — by sagearc (创建于: 2026-03-09 18:45 (UTC+8))
#36531 [Bugfix] Fix MTP hidden_states regression for chained-layer models (Qwen3.5) — bug,speculative-decoding,needs-rebase,v1,qwen — by voipmonitor (创建于: 2026-03-10 02:44 (UTC+8))
#36516 [responsesAPI] prioritize content over summary in reasoning item input — frontend — by qandrew (创建于: 2026-03-09 23:31 (UTC+8))
#36534 Fix LFM2 MoE test for Transformers v5 — 无标签 — by hmellor (创建于: 2026-03-10 03:41 (UTC+8))
#36535 Patch for vLLM + FlashAttention4 + torch for GRPO colocated training — 无标签 — by markrogersjr (创建于: 2026-03-10 03:44 (UTC+8))
#36508 [Misc] fix typo: homogenous-> homogeneous (2 lines change) — v1 — by SoluMilken (创建于: 2026-03-09 22:47 (UTC+8))
#36532 Fix Qwen2.5-VL test for Transformers v5 — documentation,qwen — by hmellor (创建于: 2026-03-10 02:54 (UTC+8))
#36507 [MTP][Misc] Clean up dead code — speculative-decoding,ready,v1 — by MatthewBonanni (创建于: 2026-03-09 22:41 (UTC+8))
#36529 [Compile] Fix compile warning in moe_permute — ready — by yewentao256 (创建于: 2026-03-10 02:36 (UTC+8))
#36528 [Docs] Remove the reo beacon — documentation,ready — by simon-mo (创建于: 2026-03-10 02:14 (UTC+8))
#36527 [Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models — bug,speculative-decoding,v1,qwen — by NikitosKh (创建于: 2026-03-10 02:04 (UTC+8))
#36491 [Docs] Add documentation for vllm launch render command — documentation — by sagearc (创建于: 2026-03-09 20:26 (UTC+8))
#36525 torch-profiler-proxy-server-disagg-support — v1,kv-connector — by etemadiamd (创建于: 2026-03-10 01:37 (UTC+8))
#36506 [Docs] Expand –allowed-media-domains security guidance with threat details — documentation,ready — by russellb (创建于: 2026-03-09 22:17 (UTC+8))
#36520 [Model Runner V2] Add dummy profile_cudagraph_memory API — documentation,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 00:44 (UTC+8))
#36522 [Docs] Add Apple MPS (Metal) GPU installation guide — documentation,cpu — by robtaylor (创建于: 2026-03-10 01:00 (UTC+8))
#36518 [Kernel] Fuse FP8 output quantization into merge_attn_states — v1 — by carlyou (创建于: 2026-03-09 23:56 (UTC+8))
#36511 [Misc] fix typo: dependant -> dependent (2 lines change) — qwen — by SoluMilken (创建于: 2026-03-09 23:00 (UTC+8))
#36517 Add VLLM_USE_MONITORX to use more efficient busy polling — ci/build — by pschlan-amd (创建于: 2026-03-09 23:53 (UTC+8))
#36519 [Bugfix][Sparse MLA] report indexer CG support properly — bug,ready,v1 — by MatthewBonanni (创建于: 2026-03-10 00:33 (UTC+8))
#36515 [CI] Fix edge case that could lead to broken docs builds on main — documentation,ready — by hmellor (创建于: 2026-03-09 23:27 (UTC+8))
#36479 [Model] Consolidate score logic by introduce score_type — new-model,frontend,ready,qwen — by noooop (创建于: 2026-03-09 17:29 (UTC+8))
#36446 feat(v1): add timeout to engine core step to prevent deadlock — v1 — by xueliangyang-oeuler (创建于: 2026-03-09 12:31 (UTC+8))
#36509 [Build] Fix Dockerfile issues for ppc64le build — ci/build — by lucas-s-p (创建于: 2026-03-09 22:50 (UTC+8))
#36442 [ROCm][CI] Retrying in case of batch variance effects and reducing flakiness — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-09 11:56 (UTC+8))
#36514 Containerized segmented spans — v1,kv-connector — by almogtavor (创建于: 2026-03-09 23:25 (UTC+8))
#36512 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — new-model,gpt-oss — by askliar (创建于: 2026-03-09 23:03 (UTC+8))
#36494 Fix: Re-Enable EP for trtllm MoE FP8 backend — ready,nvidia — by amirkl94 (创建于: 2026-03-09 20:39 (UTC+8))
#36510 Feature/anymodel clean — new-model,gpt-oss — by askliar (创建于: 2026-03-09 22:56 (UTC+8))
#36490 Fix image — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by lucas-s-p (创建于: 2026-03-09 19:37 (UTC+8))
#36461 [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes — bug,needs-rebase,v1 — by AjAnubolu (创建于: 2026-03-09 14:35 (UTC+8))
#36504 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by askliar (创建于: 2026-03-09 21:42 (UTC+8))
#36480 bugfix(dcp, gdn): disabling DCP semantics for linear-attention KV/state groups — bug,v1,nvidia — by pisceskkk (创建于: 2026-03-09 17:49 (UTC+8))
#36503 feat(dcp, flashinfer): support FULL_DECODE_ONLY for flashinfer with DCP>1 — v1,nvidia — by pisceskkk (创建于: 2026-03-09 21:37 (UTC+8))
#36500 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — qwen — by Wangbei25 (创建于: 2026-03-09 21:14 (UTC+8))
#36482 [Frontend] Move warmup into Renderer — frontend,ready — by DarkLight1337 (创建于: 2026-03-09 18:12 (UTC+8))
#36497 GLM5 rot-safetensors — deepseek — by Zhujiyang2 (创建于: 2026-03-09 20:43 (UTC+8))
#36496 [Misc] Add configuration endpoint — frontend,v1 — by hickeyma (创建于: 2026-03-09 20:42 (UTC+8))
#36439 [Bugfix] Fall back to TORCH_SDPA for encoder attention on SM<80 GPUs — bug,documentation,v1 — by weiguangli-io (创建于: 2026-03-09 11:40 (UTC+8))
#36472 [MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU — ready,nvidia — by Isotr0py (创建于: 2026-03-09 16:15 (UTC+8))
#36487 Replace OMP initialization — v1,cpu — by kot-begemot-uk (创建于: 2026-03-09 19:23 (UTC+8))
#36453 fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends — nvidia — by brandonmmusic-max (创建于: 2026-03-09 13:25 (UTC+8))
#36488 [Bugfix] Enable bitwise invariance for mxfp4 MoE matmul_ogs kernel — bug — by codearky (创建于: 2026-03-09 19:24 (UTC+8))
#36447 [Perf][GDN] Eliminate GPU-CPU synchronization in GDNAttentionMetadataBuilder.build() — v1 — by wanghuanjun2113 (创建于: 2026-03-09 12:45 (UTC+8))
#36470 [Deprecation][1/2] Remove items deprecated in v0.18 — documentation,ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-03-09 15:48 (UTC+8))
#36471 [ci] Bound openai dependency to 2.24.0 — ready,ci/build — by khluu (创建于: 2026-03-09 16:13 (UTC+8))
#36468 feat: support fp8 kv cache and chunked prefill for minimax model — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-09 15:14 (UTC+8))
#36477 [XPU]Unify xpu test dependencies in dockerfile.xpu — ci/build — by 1643661061leo (创建于: 2026-03-09 16:59 (UTC+8))
#36469 [Cleanup] Unify RMSNormGated and FusedRMSNormGated — 无标签 — by EdgeN8v (创建于: 2026-03-09 15:41 (UTC+8))
#36444 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — 无标签 — by JoursBleu (创建于: 2026-03-09 12:07 (UTC+8))
#36464 [Examples][2/n] Resettle generate examples. — documentation,structured-output,ci/build,qwen — by noooop (创建于: 2026-03-09 14:43 (UTC+8))
#36474 [Bugfix] Fix GPTQ Marlin size_k for Qwen3.5 — bug,qwen — by Isotr0py (创建于: 2026-03-09 16:39 (UTC+8))
#36467 uniform batch tokens when uniform_batch_tokens is set — v1,meta-exported,fb-exported — by cgufb (创建于: 2026-03-09 15:11 (UTC+8))
#36466 feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… — v1 — by Prathmesh234 (创建于: 2026-03-09 14:54 (UTC+8))
#36451 [CORE][V1] fix: alive-but-hung EngineCore not being detected by /health endpoint. — v1 — by sihyeonn (创建于: 2026-03-09 13:22 (UTC+8))
#36457 [Bugfix] Add regression test for allreduce RMS fusion with PP — bug,needs-rebase,qwen — by robellliu-dev (创建于: 2026-03-09 13:47 (UTC+8))
#36437 [Bugfix] Non-tensor member assertion in cpu_model_runner — bug,needs-rebase,v1,cpu — by PatchouliTIS (创建于: 2026-03-09 11:38 (UTC+8))
#36445 fix: Responses API streaming tool call support for non-harmony models — frontend,gpt-oss — by herve-ves (创建于: 2026-03-09 12:10 (UTC+8))
#36460 [Bugfix] Fix FP8 block-scale dimension mismatch in fused linear weight loading — bug — by AjAnubolu (创建于: 2026-03-09 14:35 (UTC+8))

已合并 PR

#36553 [BugFix] Remove incorrect assert in split_decodes_and_prefills — bug,ready,v1 — by WoosukKwon (合并于: 2026-03-10 11:02 (UTC+8))
#31471 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support — documentation,new-model,frontend,ready,v1,multi-modality,qwen,cpu — by effortprogrammer (合并于: 2026-03-10 10:59 (UTC+8))
#36242 [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 — bug,ready,qwen — by AjAnubolu (合并于: 2026-03-10 10:16 (UTC+8))
#36179 [ROCm][CI] Fix ROCm GPT-OSS Eval test group — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-10 08:55 (UTC+8))
#36475 [bugfix] fix nvlink for nixl/ucx — bug,ready,kv-connector — by youkaichao (合并于: 2026-03-10 07:49 (UTC+8))
#36552 Add non-contiguous input tests for rms_norm_per_block_quant and dynamic per-token quant kernels — 无标签 — by copilot-swe-agent (合并于: 2026-03-10 06:51 (UTC+8))
#36544 [Model Runner V2] Add model_state inputs to CUDA graph capture — v1,nvidia — by WoosukKwon (合并于: 2026-03-10 06:14 (UTC+8))
#36393 add nemotron v3 reasoning parser — ready,nvidia — by shaunkotek (合并于: 2026-03-10 06:11 (UTC+8))
#35959 [MRV2] Extensible CG dispatch rework — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-03-10 04:58 (UTC+8))
#36507 [MTP][Misc] Clean up dead code — speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-10 02:43 (UTC+8))
#36436 [Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers — ready — by taneem-ibrahim (合并于: 2026-03-10 02:14 (UTC+8))
#36025 [ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm — rocm,ready,ci/build,v1,multi-modality — by micah-wil (合并于: 2026-03-10 02:27 (UTC+8))
#36281 [BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder — documentation,ready,llama,qwen — by Lucaskabela (合并于: 2026-03-10 02:22 (UTC+8))
#35930 [Model Runner V2] Use NamedTuple for execute_model_state — documentation,v1 — by WoosukKwon (合并于: 2026-03-10 02:17 (UTC+8))
#36528 [Docs] Remove the reo beacon — documentation,ready — by simon-mo (合并于: 2026-03-10 02:16 (UTC+8))
#36027 [torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints — ready — by copilot-swe-agent (合并于: 2026-03-10 02:04 (UTC+8))
#36412 Fix/resupport nongated fused moe triton — documentation,performance,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality,qwen — by shaunkotek (合并于: 2026-03-10 02:01 (UTC+8))
#36506 [Docs] Expand –allowed-media-domains security guidance with threat details — documentation,ready — by russellb (合并于: 2026-03-10 01:43 (UTC+8))
#36520 [Model Runner V2] Add dummy profile_cudagraph_memory API — documentation,v1,nvidia — by WoosukKwon (合并于: 2026-03-10 01:20 (UTC+8))
#36101 [ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-10 01:07 (UTC+8))
#36292 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks — documentation,rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-10 01:02 (UTC+8))
#36511 [Misc] fix typo: dependant -> dependent (2 lines change) — qwen — by SoluMilken (合并于: 2026-03-10 01:00 (UTC+8))
#34917 [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 — performance,ready,ci/build,v1,deepseek,nvidia — by LopezCastroRoberto (合并于: 2026-03-10 00:50 (UTC+8))
#35290 [Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 — performance,ready,deepseek,nvidia — by LopezCastroRoberto (合并于: 2026-03-10 00:46 (UTC+8))
#36253 [ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. — rocm,ready — by benenzhu (合并于: 2026-03-10 00:30 (UTC+8))
#36515 [CI] Fix edge case that could lead to broken docs builds on main — documentation,ready — by hmellor (合并于: 2026-03-10 00:06 (UTC+8))
#36416 [Bugfix] Clear stale CG keys after memory profiling — bug,ready,v1,nvidia — by MatthewBonanni (合并于: 2026-03-09 23:56 (UTC+8))
#35634 [Refactor] Simplify chat_completion_full_generator for tool parsers — frontend,ready — by yewentao256 (合并于: 2026-03-09 23:33 (UTC+8))
#36300 [Bug] Fix pooling model benchmark script — bug,performance,ready — by yewentao256 (合并于: 2026-03-09 23:17 (UTC+8))
#35122 Reapply [Attention] Refactor check_and_update_config — rocm,speculative-decoding,ready,v1,multi-modality,cpu,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-03-09 22:17 (UTC+8))
#36319 Support online use_audio_in_video — frontend,ready,multi-modality,qwen — by gty111 (合并于: 2026-03-09 22:16 (UTC+8))
#36482 [Frontend] Move warmup into Renderer — frontend,ready — by DarkLight1337 (合并于: 2026-03-09 21:03 (UTC+8))
#36472 [MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU — ready,nvidia — by Isotr0py (合并于: 2026-03-09 18:43 (UTC+8))
#36470 [Deprecation][1/2] Remove items deprecated in v0.18 — documentation,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-03-09 18:43 (UTC+8))
#36471 [ci] Bound openai dependency to 2.24.0 — ready,ci/build — by khluu (合并于: 2026-03-09 18:43 (UTC+8))
#35777 [Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next — ready,qwen — by xyang16 (合并于: 2026-03-09 14:41 (UTC+8))
#36434 [XPU] Add test script of PD disaggregation — ready,v1,kv-connector — by zhenwei-intel (合并于: 2026-03-09 13:50 (UTC+8))
#36160 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) — documentation,frontend,ready,v1 — by alex-jw-brooks (合并于: 2026-03-09 13:46 (UTC+8))
#36430 [Bugfix] Avoid to replace non-tensor members in cpu model runner — bug,ready,v1,cpu — by bigPYJ1151 (合并于: 2026-03-09 13:06 (UTC+8))
#36110 [Frontend][2/n] Improve pooling entrypoints embed. — rocm,frontend — by noooop (合并于: 2026-03-09 11:42 (UTC+8))
#36263 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (合并于: 2026-03-09 11:40 (UTC+8))
#36243 [Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel — bug,ready,v1 — by tusharshetty61 (合并于: 2026-03-09 11:40 (UTC+8))
#35953 [Misc] Move processors to transformers_utils — documentation,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-09 11:31 (UTC+8))
#34858 Increase Flexibility for OOV Multimodal Token Handling — speculative-decoding,ready,qwen — by alex-jw-brooks (合并于: 2026-03-09 11:30 (UTC+8))
#36149 fix: Use iterator as not to store all the file loads in memory at once — ready — by shaunkotek (合并于: 2026-03-09 11:25 (UTC+8))
#35579 [Examples][1/n] Resettle basic examples. — documentation,ready,ci/build,cpu — by noooop (合并于: 2026-03-09 11:22 (UTC+8))

关闭但未合并的 PR

#31547 [3/n] Migrate norm kernels to libtorch stable ABI — documentation,rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (关闭于: 2026-03-10 11:11 (UTC+8))
#36575 Main branch failure triage — ci/build — by simon-mo (关闭于: 2026-03-10 10:28 (UTC+8))
#28361 Normalize OpenPangu RoPE scaling defaults — stale — by skyloevil (关闭于: 2026-03-10 10:27 (UTC+8))
#36538 feat: add RISC-V support for CPU backend — ci/build,cpu — by typer-J (关闭于: 2026-03-10 10:04 (UTC+8))
#36473 [Perf] Reuse GDN attention buffer to reduce allocation overhead — qwen — by cwazai (关闭于: 2026-03-10 09:16 (UTC+8))
#36418 Populate multimedia pre-processing performance metrics via PerfStats in the vLLM engine — v1,multi-modality,meta-exported,fb-exported — by d-biswa (关闭于: 2026-03-10 08:00 (UTC+8))
#36542 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by d-biswa (关闭于: 2026-03-10 07:49 (UTC+8))
#30299 [Bugfix] Update WSL detection to check for WSL1 compatibility as WSL2… — bug — by HoneyBerries (关闭于: 2026-03-10 07:56 (UTC+8))
#36554 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (关闭于: 2026-03-10 07:40 (UTC+8))
#36541 [MRV2] Extensible CG dispatch rework — v1,nvidia — by WoosukKwon (关闭于: 2026-03-10 04:58 (UTC+8))
#36531 [Bugfix] Fix MTP hidden_states regression for chained-layer models (Qwen3.5) — bug,speculative-decoding,needs-rebase,v1,qwen — by voipmonitor (关闭于: 2026-03-10 04:12 (UTC+8))
#36522 [Docs] Add Apple MPS (Metal) GPU installation guide — documentation,cpu — by robtaylor (关闭于: 2026-03-10 01:11 (UTC+8))
#36514 Containerized segmented spans — v1,kv-connector — by almogtavor (关闭于: 2026-03-09 23:26 (UTC+8))
#36510 Feature/anymodel clean — new-model,gpt-oss — by askliar (关闭于: 2026-03-09 22:56 (UTC+8))
#36490 Fix image — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by lucas-s-p (关闭于: 2026-03-09 22:43 (UTC+8))
#36504 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by askliar (关闭于: 2026-03-09 22:32 (UTC+8))
#36497 GLM5 rot-safetensors — deepseek — by Zhujiyang2 (关闭于: 2026-03-09 20:43 (UTC+8))
#36496 [Misc] Add configuration endpoint — frontend,v1 — by hickeyma (关闭于: 2026-03-09 20:52 (UTC+8))
#36414 [Bugfix] Add PyTorch GDN fallback for SM12.0+ (Blackwell) in Qwen3Next/Qwen3.5 — bug,performance,ci/build,qwen — by lucaspirola (关闭于: 2026-03-09 19:55 (UTC+8))
#32627 [CI]: remove unused FLASHINFER_AOT_COMPILE build argument — needs-rebase,ci/build — by haitwang-cloud (关闭于: 2026-03-09 17:31 (UTC+8))
#36420 Revert “[Core] NGram GPU Implementation compatible with Async Scheduler” (#29184) — speculative-decoding,v1 — by zhewenl (关闭于: 2026-03-09 14:51 (UTC+8))
#36437 [Bugfix] Non-tensor member assertion in cpu_model_runner — bug,needs-rebase,v1,cpu — by PatchouliTIS (关闭于: 2026-03-09 14:47 (UTC+8))
#36289 [Model] Register Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM for text-only checkpoints — new-model,qwen — by aminsamir45 (关闭于: 2026-03-09 14:12 (UTC+8))
#35990 [Core] Add contiguous block allocation mode for KV cache — needs-rebase,v1 — by xiaguan (关闭于: 2026-03-09 11:36 (UTC+8))