vLLM 开发动态报告 - 2026-03-09
时间窗口: 2026-03-09 11:22 (UTC+8) ~ 2026-03-10 11:22 (UTC+8) 数据统计: 新 Issue 30 | 关闭 Issue 25 | 新 PR 114 | 合并 PR 46 | 关闭未合并 PR 24
📊 每日开发状态摘要
在过去24小时(2026-03-09至2026-03-10)内,vLLM 项目保持了极高的开发活跃度,新增了114个PR和30个Issue。社区关注的焦点主要集中在Qwen3.5系列模型的稳定性问题(特别是FP8、GPTQ量化和MoE变体)以及多模态推理中的内存泄漏上。同时,AMD生态的集成与优化工作持续进行,多个与ROCm、Quark相关的PR被提交,旨在提升在AMD硬件上的性能和功能支持。
🎯 AMD/ROCm 生态相关动态
本周期内,AMD相关贡献非常活跃,主要集中在性能优化、内核扩展和基础设施完善方面。
新增 Issue (1个):
- #36454: 【性能回归】v0.15.0 vs ROCm v0.14.0 吞吐量对比
- 用户 @Spurthi-Bhat-ScalersAI 报告,在8× MI300X上,使用Qwen3模型进行基准测试,发现上游vLLM v0.15.0的性能比AMD维护的ROCm分支v0.14.0低约1.34倍。
- 核心团队成员 @robertgshaw2-redhat 回应指出,上游项目不应追踪AMD分支的性能回归,并强调AMD有责任将其优化及时上游化(upstream),以确保主分支版本具备同等优化。该观点明确了社区对上游与下游分支关系的立场,此Issue随后被关闭。
- 影响:凸显了AMD硬件优化融入主流代码库的持续挑战,敦促AMD团队加速上游贡献流程。
新增/开放 PR (7个):
- 【性能优化】AWQ-Marlin 配置支持 (#36505):由 AMD 员工 @mgehre-amd 提交。此PR使
AWQMarlinConfig能在ROCm平台上使用统一的choose_mp_linear_kernel框架,替代了之前直接调用NVIDIA专用ops.marlin_gemm或ops.awq_gemm的路径。在AMD Strix Halo上测试Qwen3-4B-AWQ模型,获得了解码吞吐量提升73% 的显著性能收益。 - 【内核优化】AITER 持久化MLA解码内核 (#36574):由 AMD 员工 @SKPsanjeevi 提交。为ROCm平台添加了AITER持久化模式MLA(多查询潜在注意力)解码内核支持,通过让内核常驻GPU计算单元来避免每次批处理的启动开销,旨在提升解码性能。通过环境变量
VLLM_ROCM_USE_AITER_MLA_PERSISTENT控制。 - 【内核优化】AITER 融合MLA解码内核 (#36573):由 AMD 员工 @khairulkabir1661 提交。添加了对MLA注意力的AITER融合解码内核支持,结合了FP4/FP8 BMM、RoPE、KV缓存写入等操作,旨在提升性能。
- 【Bug修复】Quark 模拟逻辑问题 (#36486):修复了当AITER被禁用时,Quark(AMD的MXFP4/MXFP6量化工具)模拟路径不生效的问题。该问题由之前的PR引入,导致即使用户通过环境变量禁用AITER,系统仍错误地尝试使用AITER原生路径。此PR属于AMD/Quark生态的关键修复。
- 【CI/构建】添加新GPU架构支持 (#36499):由 AMD 员工 @mgehre-amd 提交。将gfx1152/gfx1153(RDNA 3.5架构,如“Krackan”)添加到HIP支持的架构列表中,确保编译出的vLLM wheel包包含适用于这些新GPU的内核。
- 【CI/测试】降低ROCm测试的脆弱性 (#36442):为ROCm测试添加重试机制,并禁用了“skinny GEMMs”特性以应对测试中的非确定性行为。讨论中,核心贡献者 @gshtras 指出,长期方案应是修复skinny GEMMs的bug而非在测试中禁用生产特性。
- 【基础设施测试】 (#36562):标记为“DO NOT MERGE”,用于测试新的CI队列。
小结:AMD团队在本周期内动作频繁,一方面积极修复已发现的关键Bug(如Quark模拟逻辑),另一方面大力推进性能优化(AWQ-Marlin支持、AITER内核),并完善对新硬件的支持。性能回归Issue中的讨论也明确了社区对“优化需及时上游化”的期望。
💬 高热度讨论分析
- Issue #36452: Qwen3.5-35B-A3B-FP8 AssertionError
- 议题:用户报告在启动Qwen3.5 FP8模型时遇到AssertionError导致启动失败。
- 观点:
- 用户 @Indigo-Coder-github:提供了详细的错误栈和复现步骤。
- 核心贡献者 @ZJY0516:快速定位问题,指引用户查阅Qwen3.5配置文档,建议设置
max_cudagraph_capture_size=256或enforce_eager=True。 - 其他用户 @NovaShen555:建议尝试
gpu_memory_utilization=1。
- 争议/结论:无实质争议。用户确认文档中的解决方案有效,但追问了默认值导致错误的根本原因。@ZJY0516 解释是因为默认的
max_cudagraph_capture_size大于最大缓存大小,并指出enforce_eager会影响性能。该Issue展示了社区对Qwen3.5系列模型“最佳实践”的积累,以及文档的重要性。
- Issue #36454: vLLM v0.15.0 与 ROCm v0.14.0 的性能回归
- 议题:对比上游版本与AMD维护分支版本的性能。
- 观点:
- 报告者 @Spurthi-Bhat-ScalersAI:提供了详尽的基准测试数据和配置,证明AMD分支性能更优。
- 核心成员 @robertgshaw2-redhat:代表项目立场,明确指出上游不跟踪下游分支的回归,责任在于AMD方将优化合并到上游。
- 争议焦点:这并非技术争议,而是项目治理和协作流程的明确表态。它划清了社区版与供应商定制版的界限,强调了“上游优先”的原则。
- 结论:Issue被迅速关闭,结论清晰:性能差异需由AMD通过上游PR解决。
- Issue #36456: 本地GGUF路径加载失败
- 议题:使用本地GGUF文件配合
--hf-config-path时,因代码路径问题导致模型架构识别失败(qwen35 is not supported yet)。 - 观点:
- 报告者 @shba007:进行了深入分析,指出
maybe_override_with_speculators函数在处理本地GGUF文件时,绕过了用户提供的--hf-config-path,直接尝试解析GGUF文件,而当前transformers库的GGUF解析器尚未支持qwen35架构。 - 其他用户:询问解决方案(如升级
transformers)。
- 报告者 @shba007:进行了深入分析,指出
- 争议/结论:这是一个明确的Bug报告,指出了本地加载路径与HF Hub加载路径的行为不一致问题。讨论揭示了vLLM模型加载流程中针对不同来源(HF Hub vs 本地文件)的逻辑差异,是底层框架兼容性的一个案例。
- 议题:使用本地GGUF文件配合
- PR #36486: 修复Quark模拟逻辑问题
- 议题:如何正确修复Quark在AITER禁用时的模拟逻辑。
- 观点:
- PR作者 @wangjiaxin99:提出将多个条件合并为一个AND判断来简化逻辑。
- 审阅者 @ChuanLi1101(可能为AMD方):指出该简化逻辑可能错误地将两个独立的非模拟路径(CK原生路径 和 Triton MXFP4后端路径)耦合,导致某些场景下(如硬件不支持MX但存在Triton后端)仍错误地强制模拟。他提到了另一个待合并的修复PR (#36422)。
- 争议焦点:修复方案的完整性与正确性。是简单合并条件,还是需要更精细地处理两条独立路径。
- 当前状态:PR仍处于开放状态,需要根据评审意见修改逻辑以确保所有情况都被正确处理。
🔥 热门话题与趋势分析
- Qwen3.5系列模型的稳定性挑战:多个新增Issue(#36452, #36489, #36476, #36566, #36478)都围绕Qwen3.5不同规模的模型(35B, 122B, 397B)在v0.17.0版本下的启动失败、随机崩溃或性能异常。
max_cudagraph_capture_size参数频繁成为临时解决方案的关键,表明该系列模型的CUDA图编译与KV缓存配置存在普遍适配问题。 - 性能回归与优化并进:一方面有用户报告版本升级带来的性能回退(#36454);另一方面,大量PR(如#36505, #36574, #35777)致力于引入新的融合内核、优化调度以减少开销,这反映了项目在快速迭代中平衡性能与稳定性的常态。
- 多模态与内存泄漏:Issue #35191(关于Qwen3.5 397B多模态推理下内存线性增长直至OOM)在本次周期内被关闭,可能已修复。但这类问题揭示了大规模多模态模型服务时,缓存管理、内存回收机制的复杂性,是需要持续关注的领域。
- 生态系统扩展:
🛠️ 重点技术变更
- Apple MPS平台支持(#36523):这是一个重要的平台扩展。该PR引入了完整的MPS平台实现,包括设备管理、纯PyTorch实现的Attention后端、Metal INT4/GGUF反量化内核(可选)等。它使vLLM能够利用Apple Silicon GPU进行加速推理,虽然性能与专用GPU有差距,但极大地拓宽了部署场景。
- vLLM IR 内核注册机制RFC(#36459):这是一个重要的架构演进提案。旨在解决当前内核模块强制导入、不利于第三方平台集成的问题。核心思想是将内核模块加载的职责委托给
Platform类,让每个平台(如CUDA、ROCm、MPS)自行决定加载哪些内核实现。这提升了vLLM的模块化和可扩展性,对AMD及其他异构后端社区是长期利好。 - Health检查改进(#36451):针对V1引擎,添加了
health_ping()机制,通过轻量级IPC轮询来检测引擎核心是否“活锁”(进程存活但业务循环卡死)。这增强了生产环境下的可观察性和可靠性,是对原有仅能检测进程崩溃的死亡哨兵机制的重要补充。 - AnyModel 异构模型支持(#36512):来自NVIDIA的贡献,旨在支持通过NAS(神经架构搜索)生成的异构架构模型(每层的配置如KV头数、FFN维度、专家数可能不同)。这是首个在主流推理框架中支持此类高度灵活模型结构的工作,为部署前沿的模型压缩与优化成果打开了大门。
📈 开发活跃度观察
- 贡献者活跃:本周期提交PR的贡献者背景多样,包括AMD团队(
-amd后缀)、NVIDIA研究员、苹果生态开发者以及众多社区用户。显示vLLM吸引了广泛的行业参与者。 - 代码合并节奏:共合并46个PR,涵盖从Bug修复、性能优化到新功能(如MPS平台)的各个方面。审查与合并流程运转高效,例如对Qwen3.5 Eagle3支持的修复(#36527)和Apple MPS支持(#36523)等较大变更都在积极推进。
- 问题解决闭环:关闭了25个Issue,其中包括一些历史遗留问题(如#35191多模态内存泄漏、#36249 Qwen3.5 GPTQ崩溃),表明团队在跟进和解决积压问题。
💡 值得关注的问题
- 多模态内存泄漏根因:虽然#35191被关闭,但大规模多模态模型(如Qwen-VL、Qwen3.5-VL)在长时间高并发下的内存增长问题需要持续监控,确认修复是否彻底。
- Qwen3.5系列的“踩坑”经验普及:大量用户遇到类似启动问题。需要评估是否应将
max_cudagraph_capture_size等参数的推荐值更直接地整合到模型加载逻辑或错误提示中,降低用户的使用门槛。 - AMD优化上游化进程:在性能回归Issue的讨论后,社区会密切关注AMD团队是否以及如何将MI300X等平台的关键优化贡献到主分支,以避免未来版本再次出现类似的性能差距争议。
- XLA后端的内存开销:Issue #36537 报告了
record_metadata_for_reloading函数在XLA后端导致2-3倍主机内存膨胀的问题。这对使用TPU等硬件的用户影响较大,需要尽快评估优化方案(如惰性记录)。
📋 附录:详细数据列表
新增 Issue
- #36485 [Feature]: will support qwen3.5-eagle3 deployment? — feature request — by aidenchn (创建于: 2026-03-09 19:12 (UTC+8))
- #36498 [Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. — bug — by piekey1994 (创建于: 2026-03-09 20:45 (UTC+8))
- #36524 [Bug]: Accuracy Issue with FlashMLA Sparse on DeepSeek V3.2 — bug — by wzhao18 (创建于: 2026-03-10 01:16 (UTC+8))
- #36454 [Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 — performance,rocm — by Spurthi-Bhat-ScalersAI (创建于: 2026-03-09 13:27 (UTC+8))
- #36489 [Bug]: vllm 0.17.0 部署 Qwen3.5 397b-fp8版本运行过程中异常崩溃(vllm 0.17.0 crashed unexpectedly during deployment of Qwen3.5 397b-fp8 version.) — bug — by monkeywl2020 (创建于: 2026-03-09 19:29 (UTC+8))
- #36566 [Bug]:Qwen3.5-35B-A3B vllm v0.17.0 ERROR 03-10 00:52:24 [multiproc_executor.py:261] Worker proc VllmWorker-0 died unexpectedly, shutting down executor. — bug — by jieguolove (创建于: 2026-03-10 09:05 (UTC+8))
- #36565 [Bug]: GPTBigCode scale_attn_weights config flag is ignored in vLLM — bug — by Qi-Zhan (创建于: 2026-03-10 09:01 (UTC+8))
- #36547 [Bug] MultiConnector: NIXL transfers silently broken after HMA migration — bug — by ZhanqiuHu (创建于: 2026-03-10 05:08 (UTC+8))
- #36533 [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency — bug — by cquil11 (创建于: 2026-03-10 02:56 (UTC+8))
- #36537 [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends — bug — by kmabeeTT (创建于: 2026-03-10 04:31 (UTC+8))
- #36530 [Bug]: pip install 0.17 fails with CXXABI_1.3.15 not found — bug — by jlqibm (创建于: 2026-03-10 02:44 (UTC+8))
- #36526 [Bug]: DeepSeek hangs with overridden num_hidden_layers — bug — by ProExpertProg (创建于: 2026-03-10 01:42 (UTC+8))
- #36513 Analyze middleware traces: OWUI sampling profile comparison — 无标签 — by jsboige (创建于: 2026-03-09 23:12 (UTC+8))
- #36502 [Feature]: Built-in debug tensor dump for intermediate activations — feature request — by MOIPA (创建于: 2026-03-09 21:31 (UTC+8))
- #36501 [Feature]: Add support for token_adapter.trainable_tokens_delta LoRA weight — feature request — by akowalsk (创建于: 2026-03-09 21:21 (UTC+8))
- #36492 [Bug]: Abnormal Output When Using FP8 KVCache for Kimi-K2.5 Inference under vLLM v0.17.0 — bug — by makabaka6338 (创建于: 2026-03-09 20:26 (UTC+8))
- #36493 [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% — bug — by piekey1994 (创建于: 2026-03-09 20:32 (UTC+8))
- #36456 [Bug]: Local GGUF path fails with “architecture qwen35 is not supported yet” even when –hf-config-path is provided — bug — by shba007 (创建于: 2026-03-09 13:37 (UTC+8))
- #36478 [Bug]: LoRA on Qwen-3.5-2B fails to run — bug — by JasonX7 (创建于: 2026-03-09 17:25 (UTC+8))
- #36476 [Bug]: vllm 0.17.0 启动Qwen3.5-122B-A10B失败 — bug — by Lee-xeo (创建于: 2026-03-09 16:58 (UTC+8))
- #36481 [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes — performance — by shenyt-sanshui (创建于: 2026-03-09 17:53 (UTC+8))
- #36455 [Bug]: Unable to run Qwen3.5 on RTX5090 — bug — by YiJunSachs (创建于: 2026-03-09 13:33 (UTC+8))
- #36443 [Bug]: qwen3.5-27b ValueError: Tokenizer class TokenizersBackendFast does not exist or is not currently imported. — bug — by xiaotianns (创建于: 2026-03-09 11:58 (UTC+8))
- #36450 [Bug]: Qwen3.5 AWQ models crash during inference on RTX 5090 (Blackwell) with Triton OOM in solve_tril despite successful model load — bug — by NEWbie0709 (创建于: 2026-03-09 13:16 (UTC+8))
- #36452 [Bug]: Qwen3.5-35B-A3B-FP8 AssertionError — bug — by Indigo-Coder-github (创建于: 2026-03-09 13:23 (UTC+8))
- #36459 [RFC]: vLLM IR Out-of-Tree (OOT) Kernel Registration — RFC — by wxsIcey (创建于: 2026-03-09 14:28 (UTC+8))
- #36465 [Usage]: How to use Env in yaml — usage — by ciaoyizhen (创建于: 2026-03-09 14:51 (UTC+8))
- #36463 [Bug]: Qwen3.5-27B fails to start with CPU KV cache offloading (
--kv_offloading_backend native) while Qwen3-32B works fine — bug — by bigbear07 (创建于: 2026-03-09 14:36 (UTC+8)) - #36441 [Bug]: prefix cache bug happens when use w4a16 for GLM5 — bug — by terenceucla (创建于: 2026-03-09 11:43 (UTC+8))
- #36440 [Feature] Add energy consumption metrics to benchmark suite — 无标签 — by hongping-zh (创建于: 2026-03-09 11:40 (UTC+8))
已关闭 Issue
- #36407 [Bug]: KeyError in get_layers_from_vllm_config with pipeline parallelism (vLLM 0.16.0) — bug — by tusharshetty61 (关闭于: 2026-03-09 11:43 (UTC+8))
- #24384 [RFC]: Decoupling vLLM Configuration from Hugging Face — RFC,stale — by charlotte12l (关闭于: 2026-03-10 10:29 (UTC+8))
- #26565 [Bug]: vLLM engine loads model successfully but fails during sampler execution. — bug,stale — by thenumberouscode (关闭于: 2026-03-10 10:28 (UTC+8))
- #26567 [Refactor][MLA]: Independently pass q_nope & q_rope — feature request,stale — by vnadathur (关闭于: 2026-03-10 10:28 (UTC+8))
- #28186 [Bug] Cannot load qwen3-vl series with lora adapter — bug,stale — by deepNoah (关闭于: 2026-03-10 10:28 (UTC+8))
- #28334 [Bug][LoRA]: Custom AR IMA during CG Capture with LoRA — bug,stale — by gnovack (关闭于: 2026-03-10 10:27 (UTC+8))
- #28348 [Usage]: Does vllm support max_pixels in prompt on Qwen3-VL reasoning? — usage,stale — by leijie-ww (关闭于: 2026-03-10 10:27 (UTC+8))
- #35504 [Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 and fuse_allreduce_rms=False (H100s on PCIe) — bug — by vitush93 (关闭于: 2026-03-10 10:16 (UTC+8))
- #36387 📋 Documentation Enhancement Suggestion — 无标签 — by croviatrust (关闭于: 2026-03-10 09:19 (UTC+8))
- #36454 [Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 — performance,rocm — by Spurthi-Bhat-ScalersAI (关闭于: 2026-03-10 09:19 (UTC+8))
- #34054 [Bug]: pd disaggregation on the same host with nixl connector can not use nvlink to transfer kv cache — bug — by ChowXu (关闭于: 2026-03-10 07:49 (UTC+8))
- #35191 [Bug]: Qwen3.5 397B FP8 fills 1TB RAM and OOM killed with high-concurrency multimodal requests — bug — by FWao (关闭于: 2026-03-10 07:14 (UTC+8))
- #36526 [Bug]: DeepSeek hangs with overridden num_hidden_layers — bug — by ProExpertProg (关闭于: 2026-03-10 01:44 (UTC+8))
- #36513 Analyze middleware traces: OWUI sampling profile comparison — 无标签 — by jsboige (关闭于: 2026-03-09 23:13 (UTC+8))
- #29141 [Bug]: Unable to run Qwen3 MoE NVFP4 on SM120 — bug,stale — by terryaic (关闭于: 2026-03-09 22:56 (UTC+8))
- #28910 [Bug]: — bug,stale — by SongXiaoMao (关闭于: 2026-03-09 22:52 (UTC+8))
- #36031 [Bug]: Commit 28ef9ba causing VLLM to crash when running Qwen 3.5 122B with Speculative Decoding enabled — bug — by mdierolf (关闭于: 2026-03-09 21:29 (UTC+8))
- #35321 [Feature]: Encoder self-attention for RocmAttentionImpl — feature request,rocm — by micah-wil (关闭于: 2026-03-09 20:24 (UTC+8))
- #36357 [Bug]: Multimodal encoder memory profiling hangs indefinitely on V100 (SM 7.0) in v0.17 — bug — by aswinkumar1999 (关闭于: 2026-03-09 18:43 (UTC+8))
- #36382 [CPU] AssertionError in CPUModelRunner: device_tensor type mismatch (numpy.ndarray) — 无标签 — by maxwillzq (关闭于: 2026-03-09 17:35 (UTC+8))
- #36331 [Bug]: 0% acceptance rate (MTP) with Qwen 3.5 122B (NVFP4) — bug — by ccdv-ai (关闭于: 2026-03-09 16:43 (UTC+8))
- #36389 [Bug]: Multi-node MP backend with vllm v17 - inner dp world group is not initialized — bug — by alan-cooney-dsit (关闭于: 2026-03-09 16:42 (UTC+8))
- #36452 [Bug]: Qwen3.5-35B-A3B-FP8 AssertionError — bug — by Indigo-Coder-github (关闭于: 2026-03-09 15:55 (UTC+8))
- #36249 [Bug]: Qwen3.5-27B-GPTQ-Int4 crashes [vllm version v0.16.1rc1] — bug — by berkayersoyy (关闭于: 2026-03-09 13:39 (UTC+8))
- #34736 [Refactor]: Handle OOV Multimodal Tokens Generically — documentation — by alex-jw-brooks (关闭于: 2026-03-09 11:30 (UTC+8))
新增 PR
- #36574 [ROCm] Utilize persistent MLA kernel from AITER — rocm,v1 — by SKPsanjeevi (创建于: 2026-03-10 10:20 (UTC+8))
- #36580 [Model Runner V2] Fix
_compute_slot_mappings_kernelfor chunked prefill — v1 — by njhill (创建于: 2026-03-10 11:17 (UTC+8)) - #36561 Add Blackwell DeepSeek MTP acceptance length regression test — speculative-decoding,ci/build,v1,deepseek — by qiching (创建于: 2026-03-10 08:24 (UTC+8))
- #36576 [Feature] Add debug tensor dump for intermediate activations — documentation,v1 — by MOIPA (创建于: 2026-03-10 10:49 (UTC+8))
- #36578 feat: add RISC-V support for CPU backend (v2) — ci/build,cpu — by typer-J (创建于: 2026-03-10 10:57 (UTC+8))
- #36495 [Fix] always use loopback ip for
UniProcExecutor— v1 — by BugenZhao (创建于: 2026-03-09 20:41 (UTC+8)) - #36579 Guard AWQ-Marlin auto-selection on CUDA driver/toolkit mismatch — nvidia — by fede-kamel (创建于: 2026-03-10 11:05 (UTC+8))
- #36577 Remove instance ID initialization logic — 无标签 — by zhuohan123 (创建于: 2026-03-10 10:51 (UTC+8))
- #36557 [Bugfix] Fix
RuntimeError: Already borrowedthat degrades VLM serving throughput under concurrent load. — bug,ready — by hallerite (创建于: 2026-03-10 07:05 (UTC+8)) - #36553 [BugFix] Remove incorrect assert in split_decodes_and_prefills — bug,ready,v1 — by WoosukKwon (创建于: 2026-03-10 06:24 (UTC+8))
- #36575 Main branch failure triage — ci/build — by simon-mo (创建于: 2026-03-10 10:28 (UTC+8))
- #36462 [Bugfix] Fix GDN in_proj_ba crash with GPTQ/FP8 and TP > 1 — bug,needs-rebase,qwen — by AjAnubolu (创建于: 2026-03-09 14:36 (UTC+8))
- #36573 [ROCm] Add AITER fused decode kernel for MLA attention — rocm — by khairulkabir1661 (创建于: 2026-03-10 10:09 (UTC+8))
- #36538 feat: add RISC-V support for CPU backend — ci/build,cpu — by typer-J (创建于: 2026-03-10 04:42 (UTC+8))
- #36571 Fix issue 04 — v1 — by xueliangyang-oeuler (创建于: 2026-03-10 09:53 (UTC+8))
- #36572 fix: make MiniMaxM2AppendThinkReasoningParser extract reasoning corre… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-10 09:53 (UTC+8))
- #36570 [Hardware][NVIDIA] Set NCCL_P2P_DISABLE=1 on non-fully-connected CUDA topology — v1,nvidia — by pooyadavoodi (创建于: 2026-03-10 09:31 (UTC+8))
- #36458 [XPU] Support block fp8 moe by fallback to TritonExpert on XPU — ready — by jikunshang (创建于: 2026-03-09 14:19 (UTC+8))
- #36569 fix: prevent KeyError in harmony parser by preserving None fields (is… — frontend,gpt-oss — by xueliangyang-oeuler (创建于: 2026-03-10 09:30 (UTC+8))
- #36556 [ci] Update atol for test_classification — 无标签 — by angelayi (创建于: 2026-03-10 07:04 (UTC+8))
- #36568 fix: revert triton_kernels tag to v3.5.0 to resolve ImportError (issu… — ci/build,kv-connector — by xueliangyang-oeuler (创建于: 2026-03-10 09:21 (UTC+8))
- #36473 [Perf] Reuse GDN attention buffer to reduce allocation overhead — qwen — by cwazai (创建于: 2026-03-09 16:18 (UTC+8))
- #36567 fix: disable async scheduling by default due to instability (issue #3… — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-10 09:07 (UTC+8))
- #36562 [DO NOT MERGE] AMD Infra tests — rocm,ready,needs-rebase,ci/build,v1,kv-connector — by AndreasKaratzas (创建于: 2026-03-10 08:33 (UTC+8))
- #36564 [Draft] Support temporal compression for videos — 无标签 — by collinmccarthy (创建于: 2026-03-10 08:45 (UTC+8))
- #36521 [Core] Simplify core kv-cache blocks initialization logic — ready,v1 — by njhill (创建于: 2026-03-10 00:50 (UTC+8))
- #36563 [Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation — 无标签 — by gmagogsfm (创建于: 2026-03-10 08:40 (UTC+8))
- #36438 [Hardware][NIXL] set default kv buffer type for different platform — ready,kv-connector — by zhenwei-intel (创建于: 2026-03-09 11:39 (UTC+8))
- #36523 [Platform] Add MPS (Apple Metal) platform support for macOS — documentation,performance,ci/build,v1,cpu — by robtaylor (创建于: 2026-03-10 01:12 (UTC+8))
- #36558 refactor(envs): introduce typed Envs class with lazy getattr and attribute docstrings — 无标签 — by nliu365 (创建于: 2026-03-10 07:30 (UTC+8))
- #36542 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by d-biswa (创建于: 2026-03-10 04:55 (UTC+8))
- #36559 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (创建于: 2026-03-10 07:41 (UTC+8))
- #36560 [CI] Add bfcl tool call correctness eval — ci/build — by sfeng33 (创建于: 2026-03-10 07:49 (UTC+8))
- #36475 [bugfix] fix nvlink for nixl/ucx — bug,ready,kv-connector — by youkaichao (创建于: 2026-03-09 16:52 (UTC+8))
- #36484 fix: Responses API streaming tool call support for non-harmony models — frontend,gpt-oss — by giulio-leone (创建于: 2026-03-09 18:46 (UTC+8))
- #36554 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (创建于: 2026-03-10 06:25 (UTC+8))
- #36555 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict — documentation,llama,qwen — by Lucaskabela (创建于: 2026-03-10 06:32 (UTC+8))
- #36549 [Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors — bug,kv-connector — by ZhanqiuHu (创建于: 2026-03-10 05:15 (UTC+8))
- #36486 [Bugfix] Fix issues in quark emulative logic — bug — by wangjiaxin99 (创建于: 2026-03-09 19:23 (UTC+8))
- #36551 [torch.compile] Add support for fused RMSNorm + group quant — ci/build — by ProExpertProg (创建于: 2026-03-10 06:05 (UTC+8))
- #36536 [Frontend] Split
OpenAIServingModelsintoOpenAIModelRegistry+OpenAIServingModels— frontend — by sagearc (创建于: 2026-03-10 04:20 (UTC+8)) - #36552 Add non-contiguous input tests for rms_norm_per_block_quant and dynamic per-token quant kernels — 无标签 — by copilot-swe-agent (创建于: 2026-03-10 06:08 (UTC+8))
- #36505 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel — rocm — by mgehre-amd (创建于: 2026-03-09 21:51 (UTC+8))
- #36544 [Model Runner V2] Add model_state inputs to CUDA graph capture — v1,nvidia — by WoosukKwon (创建于: 2026-03-10 05:00 (UTC+8))
- #36550 Add GlmOcrConfig with fixes, E2E test, and generic config registry validation — meta-exported,fb-exported — by yhujia-8888 (创建于: 2026-03-10 05:50 (UTC+8))
- #36499 [ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures — rocm,ready,ci/build — by mgehre-amd (创建于: 2026-03-09 20:52 (UTC+8))
- #36539 Fix prompt_logprobs to respect logprobs_mode — v1 — by fede-kamel (创建于: 2026-03-10 04:42 (UTC+8))
- #36548 [Bugfix] Fix TypeError during AWQ Marlin MoE loading by unwrapping traced tuple — bug — by EmilHaase (创建于: 2026-03-10 05:14 (UTC+8))
- #36546 Remove unused disable_fallback field — ready — by zhuohan123 (创建于: 2026-03-10 05:07 (UTC+8))
- #36545 [Speculative Decoding] Add
norm_before_fcfor gpt-oss draft models — speculative-decoding,llama,gpt-oss — by shubhra (创建于: 2026-03-10 05:05 (UTC+8)) - #36543 [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression — bug — by kmabeeTT (创建于: 2026-03-10 04:57 (UTC+8))
- #36541 [MRV2] Extensible CG dispatch rework — v1,nvidia — by WoosukKwon (创建于: 2026-03-10 04:52 (UTC+8))
- #36449 [Observability] Add scheduler preemption metrics — v1 — by lisperz (创建于: 2026-03-09 13:00 (UTC+8))
- #36540 [fix] Remove trtllm ragged mla prefills — v1 — by evezhier (创建于: 2026-03-10 04:42 (UTC+8))
- #36448 fix(tokenizer): handle TokenizersBackendFast class for Qwen3.5 GPTQ models — qwen — by giulio-leone (创建于: 2026-03-09 12:58 (UTC+8))
- #36483 [Frontend] Delegate preprocessing to
OpenAIServingRender— frontend,v1 — by sagearc (创建于: 2026-03-09 18:45 (UTC+8)) - #36531 [Bugfix] Fix MTP hidden_states regression for chained-layer models (Qwen3.5) — bug,speculative-decoding,needs-rebase,v1,qwen — by voipmonitor (创建于: 2026-03-10 02:44 (UTC+8))
- #36516 [responsesAPI] prioritize content over summary in reasoning item input — frontend — by qandrew (创建于: 2026-03-09 23:31 (UTC+8))
- #36534 Fix LFM2 MoE test for Transformers v5 — 无标签 — by hmellor (创建于: 2026-03-10 03:41 (UTC+8))
- #36535 Patch for vLLM + FlashAttention4 + torch for GRPO colocated training — 无标签 — by markrogersjr (创建于: 2026-03-10 03:44 (UTC+8))
- #36508 [Misc] fix typo: homogenous-> homogeneous (2 lines change) — v1 — by SoluMilken (创建于: 2026-03-09 22:47 (UTC+8))
- #36532 Fix Qwen2.5-VL test for Transformers v5 — documentation,qwen — by hmellor (创建于: 2026-03-10 02:54 (UTC+8))
- #36507 [MTP][Misc] Clean up dead code — speculative-decoding,ready,v1 — by MatthewBonanni (创建于: 2026-03-09 22:41 (UTC+8))
- #36529 [Compile] Fix compile warning in
moe_permute— ready — by yewentao256 (创建于: 2026-03-10 02:36 (UTC+8)) - #36528 [Docs] Remove the reo beacon — documentation,ready — by simon-mo (创建于: 2026-03-10 02:14 (UTC+8))
- #36527 [Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models — bug,speculative-decoding,v1,qwen — by NikitosKh (创建于: 2026-03-10 02:04 (UTC+8))
- #36491 [Docs] Add documentation for vllm launch render command — documentation — by sagearc (创建于: 2026-03-09 20:26 (UTC+8))
- #36525 torch-profiler-proxy-server-disagg-support — v1,kv-connector — by etemadiamd (创建于: 2026-03-10 01:37 (UTC+8))
- #36506 [Docs] Expand –allowed-media-domains security guidance with threat details — documentation,ready — by russellb (创建于: 2026-03-09 22:17 (UTC+8))
- #36520 [Model Runner V2] Add dummy profile_cudagraph_memory API — documentation,v1,nvidia — by WoosukKwon (创建于: 2026-03-10 00:44 (UTC+8))
- #36522 [Docs] Add Apple MPS (Metal) GPU installation guide — documentation,cpu — by robtaylor (创建于: 2026-03-10 01:00 (UTC+8))
- #36518 [Kernel] Fuse FP8 output quantization into merge_attn_states — v1 — by carlyou (创建于: 2026-03-09 23:56 (UTC+8))
- #36511 [Misc] fix typo: dependant -> dependent (2 lines change) — qwen — by SoluMilken (创建于: 2026-03-09 23:00 (UTC+8))
- #36517 Add VLLM_USE_MONITORX to use more efficient busy polling — ci/build — by pschlan-amd (创建于: 2026-03-09 23:53 (UTC+8))
- #36519 [Bugfix][Sparse MLA] report indexer CG support properly — bug,ready,v1 — by MatthewBonanni (创建于: 2026-03-10 00:33 (UTC+8))
- #36515 [CI] Fix edge case that could lead to broken docs builds on main — documentation,ready — by hmellor (创建于: 2026-03-09 23:27 (UTC+8))
- #36479 [Model] Consolidate score logic by introduce score_type — new-model,frontend,ready,qwen — by noooop (创建于: 2026-03-09 17:29 (UTC+8))
- #36446 feat(v1): add timeout to engine core step to prevent deadlock — v1 — by xueliangyang-oeuler (创建于: 2026-03-09 12:31 (UTC+8))
- #36509 [Build] Fix Dockerfile issues for ppc64le build — ci/build — by lucas-s-p (创建于: 2026-03-09 22:50 (UTC+8))
- #36442 [ROCm][CI] Retrying in case of batch variance effects and reducing flakiness — rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-09 11:56 (UTC+8))
- #36514 Containerized segmented spans — v1,kv-connector — by almogtavor (创建于: 2026-03-09 23:25 (UTC+8))
- #36512 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — new-model,gpt-oss — by askliar (创建于: 2026-03-09 23:03 (UTC+8))
- #36494 Fix: Re-Enable EP for trtllm MoE FP8 backend — ready,nvidia — by amirkl94 (创建于: 2026-03-09 20:39 (UTC+8))
- #36510 Feature/anymodel clean — new-model,gpt-oss — by askliar (创建于: 2026-03-09 22:56 (UTC+8))
- #36490 Fix image — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by lucas-s-p (创建于: 2026-03-09 19:37 (UTC+8))
- #36461 [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes — bug,needs-rebase,v1 — by AjAnubolu (创建于: 2026-03-09 14:35 (UTC+8))
- #36504 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by askliar (创建于: 2026-03-09 21:42 (UTC+8))
- #36480 bugfix(dcp, gdn): disabling DCP semantics for linear-attention KV/state groups — bug,v1,nvidia — by pisceskkk (创建于: 2026-03-09 17:49 (UTC+8))
- #36503 feat(dcp, flashinfer): support FULL_DECODE_ONLY for flashinfer with DCP>1 — v1,nvidia — by pisceskkk (创建于: 2026-03-09 21:37 (UTC+8))
- #36500 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder — qwen — by Wangbei25 (创建于: 2026-03-09 21:14 (UTC+8))
- #36482 [Frontend] Move warmup into Renderer — frontend,ready — by DarkLight1337 (创建于: 2026-03-09 18:12 (UTC+8))
- #36497 GLM5 rot-safetensors — deepseek — by Zhujiyang2 (创建于: 2026-03-09 20:43 (UTC+8))
- #36496 [Misc] Add configuration endpoint — frontend,v1 — by hickeyma (创建于: 2026-03-09 20:42 (UTC+8))
- #36439 [Bugfix] Fall back to TORCH_SDPA for encoder attention on SM<80 GPUs — bug,documentation,v1 — by weiguangli-io (创建于: 2026-03-09 11:40 (UTC+8))
- #36472 [MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU — ready,nvidia — by Isotr0py (创建于: 2026-03-09 16:15 (UTC+8))
- #36487 Replace OMP initialization — v1,cpu — by kot-begemot-uk (创建于: 2026-03-09 19:23 (UTC+8))
- #36453 fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends — nvidia — by brandonmmusic-max (创建于: 2026-03-09 13:25 (UTC+8))
- #36488 [Bugfix] Enable bitwise invariance for mxfp4 MoE matmul_ogs kernel — bug — by codearky (创建于: 2026-03-09 19:24 (UTC+8))
- #36447 [Perf][GDN] Eliminate GPU-CPU synchronization in GDNAttentionMetadataBuilder.build() — v1 — by wanghuanjun2113 (创建于: 2026-03-09 12:45 (UTC+8))
- #36470 [Deprecation][1/2] Remove items deprecated in v0.18 — documentation,ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-03-09 15:48 (UTC+8))
- #36471 [ci] Bound openai dependency to 2.24.0 — ready,ci/build — by khluu (创建于: 2026-03-09 16:13 (UTC+8))
- #36468 feat: support fp8 kv cache and chunked prefill for minimax model — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-09 15:14 (UTC+8))
- #36477 [XPU]Unify xpu test dependencies in dockerfile.xpu — ci/build — by 1643661061leo (创建于: 2026-03-09 16:59 (UTC+8))
- #36469 [Cleanup] Unify RMSNormGated and FusedRMSNormGated — 无标签 — by EdgeN8v (创建于: 2026-03-09 15:41 (UTC+8))
- #36444 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — 无标签 — by JoursBleu (创建于: 2026-03-09 12:07 (UTC+8))
- #36464 [Examples][2/n] Resettle generate examples. — documentation,structured-output,ci/build,qwen — by noooop (创建于: 2026-03-09 14:43 (UTC+8))
- #36474 [Bugfix] Fix GPTQ Marlin size_k for Qwen3.5 — bug,qwen — by Isotr0py (创建于: 2026-03-09 16:39 (UTC+8))
- #36467 uniform batch tokens when uniform_batch_tokens is set — v1,meta-exported,fb-exported — by cgufb (创建于: 2026-03-09 15:11 (UTC+8))
- #36466 feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… — v1 — by Prathmesh234 (创建于: 2026-03-09 14:54 (UTC+8))
- #36451 [CORE][V1] fix: alive-but-hung EngineCore not being detected by
/healthendpoint. — v1 — by sihyeonn (创建于: 2026-03-09 13:22 (UTC+8)) - #36457 [Bugfix] Add regression test for allreduce RMS fusion with PP — bug,needs-rebase,qwen — by robellliu-dev (创建于: 2026-03-09 13:47 (UTC+8))
- #36437 [Bugfix] Non-tensor member assertion in cpu_model_runner — bug,needs-rebase,v1,cpu — by PatchouliTIS (创建于: 2026-03-09 11:38 (UTC+8))
- #36445 fix: Responses API streaming tool call support for non-harmony models — frontend,gpt-oss — by herve-ves (创建于: 2026-03-09 12:10 (UTC+8))
- #36460 [Bugfix] Fix FP8 block-scale dimension mismatch in fused linear weight loading — bug — by AjAnubolu (创建于: 2026-03-09 14:35 (UTC+8))
已合并 PR
- #36553 [BugFix] Remove incorrect assert in split_decodes_and_prefills — bug,ready,v1 — by WoosukKwon (合并于: 2026-03-10 11:02 (UTC+8))
- #31471 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support — documentation,new-model,frontend,ready,v1,multi-modality,qwen,cpu — by effortprogrammer (合并于: 2026-03-10 10:59 (UTC+8))
- #36242 [Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 — bug,ready,qwen — by AjAnubolu (合并于: 2026-03-10 10:16 (UTC+8))
- #36179 [ROCm][CI] Fix ROCm GPT-OSS Eval test group — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-10 08:55 (UTC+8))
- #36475 [bugfix] fix nvlink for nixl/ucx — bug,ready,kv-connector — by youkaichao (合并于: 2026-03-10 07:49 (UTC+8))
- #36552 Add non-contiguous input tests for rms_norm_per_block_quant and dynamic per-token quant kernels — 无标签 — by copilot-swe-agent (合并于: 2026-03-10 06:51 (UTC+8))
- #36544 [Model Runner V2] Add model_state inputs to CUDA graph capture — v1,nvidia — by WoosukKwon (合并于: 2026-03-10 06:14 (UTC+8))
- #36393 add nemotron v3 reasoning parser — ready,nvidia — by shaunkotek (合并于: 2026-03-10 06:11 (UTC+8))
- #35959 [MRV2] Extensible CG dispatch rework — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-03-10 04:58 (UTC+8))
- #36507 [MTP][Misc] Clean up dead code — speculative-decoding,ready,v1 — by MatthewBonanni (合并于: 2026-03-10 02:43 (UTC+8))
- #36436 [Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers — ready — by taneem-ibrahim (合并于: 2026-03-10 02:14 (UTC+8))
- #36025 [ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm — rocm,ready,ci/build,v1,multi-modality — by micah-wil (合并于: 2026-03-10 02:27 (UTC+8))
- #36281 [BE] Rename
should_torch_compile_mm_vittoshould_torch_compile_mm_encoder— documentation,ready,llama,qwen — by Lucaskabela (合并于: 2026-03-10 02:22 (UTC+8)) - #35930 [Model Runner V2] Use NamedTuple for
execute_model_state— documentation,v1 — by WoosukKwon (合并于: 2026-03-10 02:17 (UTC+8)) - #36528 [Docs] Remove the reo beacon — documentation,ready — by simon-mo (合并于: 2026-03-10 02:16 (UTC+8))
- #36027 [torch.compile] Rename
compile_ranges_split_pointstocompile_ranges_endpoints— ready — by copilot-swe-agent (合并于: 2026-03-10 02:04 (UTC+8)) - #36412 Fix/resupport nongated fused moe triton — documentation,performance,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality,qwen — by shaunkotek (合并于: 2026-03-10 02:01 (UTC+8))
- #36506 [Docs] Expand –allowed-media-domains security guidance with threat details — documentation,ready — by russellb (合并于: 2026-03-10 01:43 (UTC+8))
- #36520 [Model Runner V2] Add dummy profile_cudagraph_memory API — documentation,v1,nvidia — by WoosukKwon (合并于: 2026-03-10 01:20 (UTC+8))
- #36101 [ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm — rocm,ready — by AndreasKaratzas (合并于: 2026-03-10 01:07 (UTC+8))
- #36292 [ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks — documentation,rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-10 01:02 (UTC+8))
- #36511 [Misc] fix typo: dependant -> dependent (2 lines change) — qwen — by SoluMilken (合并于: 2026-03-10 01:00 (UTC+8))
- #34917 [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 — performance,ready,ci/build,v1,deepseek,nvidia — by LopezCastroRoberto (合并于: 2026-03-10 00:50 (UTC+8))
- #35290 [Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 — performance,ready,deepseek,nvidia — by LopezCastroRoberto (合并于: 2026-03-10 00:46 (UTC+8))
- #36253 [ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. — rocm,ready — by benenzhu (合并于: 2026-03-10 00:30 (UTC+8))
- #36515 [CI] Fix edge case that could lead to broken docs builds on main — documentation,ready — by hmellor (合并于: 2026-03-10 00:06 (UTC+8))
- #36416 [Bugfix] Clear stale CG keys after memory profiling — bug,ready,v1,nvidia — by MatthewBonanni (合并于: 2026-03-09 23:56 (UTC+8))
- #35634 [Refactor] Simplify
chat_completion_full_generatorfor tool parsers — frontend,ready — by yewentao256 (合并于: 2026-03-09 23:33 (UTC+8)) - #36300 [Bug] Fix pooling model benchmark script — bug,performance,ready — by yewentao256 (合并于: 2026-03-09 23:17 (UTC+8))
- #35122 Reapply [Attention] Refactor
check_and_update_config— rocm,speculative-decoding,ready,v1,multi-modality,cpu,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (合并于: 2026-03-09 22:17 (UTC+8)) - #36319 Support online use_audio_in_video — frontend,ready,multi-modality,qwen — by gty111 (合并于: 2026-03-09 22:16 (UTC+8))
- #36482 [Frontend] Move warmup into Renderer — frontend,ready — by DarkLight1337 (合并于: 2026-03-09 21:03 (UTC+8))
- #36472 [MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU — ready,nvidia — by Isotr0py (合并于: 2026-03-09 18:43 (UTC+8))
- #36470 [Deprecation][1/2] Remove items deprecated in v0.18 — documentation,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-03-09 18:43 (UTC+8))
- #36471 [ci] Bound openai dependency to 2.24.0 — ready,ci/build — by khluu (合并于: 2026-03-09 18:43 (UTC+8))
- #35777 [Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next — ready,qwen — by xyang16 (合并于: 2026-03-09 14:41 (UTC+8))
- #36434 [XPU] Add test script of PD disaggregation — ready,v1,kv-connector — by zhenwei-intel (合并于: 2026-03-09 13:50 (UTC+8))
- #36160 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) — documentation,frontend,ready,v1 — by alex-jw-brooks (合并于: 2026-03-09 13:46 (UTC+8))
- #36430 [Bugfix] Avoid to replace non-tensor members in cpu model runner — bug,ready,v1,cpu — by bigPYJ1151 (合并于: 2026-03-09 13:06 (UTC+8))
-
#36110 [Frontend][2/n] Improve pooling entrypoints embed. — rocm,frontend — by noooop (合并于: 2026-03-09 11:42 (UTC+8)) - #36263 feat(attention): extract KV-cache update from FlexAttention backend — v1 — by cong-or (合并于: 2026-03-09 11:40 (UTC+8))
- #36243 [Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel — bug,ready,v1 — by tusharshetty61 (合并于: 2026-03-09 11:40 (UTC+8))
- #35953 [Misc] Move processors to
transformers_utils— documentation,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-03-09 11:31 (UTC+8)) - #34858 Increase Flexibility for OOV Multimodal Token Handling — speculative-decoding,ready,qwen — by alex-jw-brooks (合并于: 2026-03-09 11:30 (UTC+8))
- #36149 fix: Use iterator as not to store all the file loads in memory at once — ready — by shaunkotek (合并于: 2026-03-09 11:25 (UTC+8))
- #35579 [Examples][1/n] Resettle basic examples. — documentation,ready,ci/build,cpu — by noooop (合并于: 2026-03-09 11:22 (UTC+8))
关闭但未合并的 PR
- #31547 [3/n] Migrate norm kernels to libtorch stable ABI — documentation,rocm,needs-rebase,ci/build,cpu,nvidia — by mikaylagawarecki (关闭于: 2026-03-10 11:11 (UTC+8))
- #36575 Main branch failure triage — ci/build — by simon-mo (关闭于: 2026-03-10 10:28 (UTC+8))
- #28361 Normalize OpenPangu RoPE scaling defaults — stale — by skyloevil (关闭于: 2026-03-10 10:27 (UTC+8))
- #36538 feat: add RISC-V support for CPU backend — ci/build,cpu — by typer-J (关闭于: 2026-03-10 10:04 (UTC+8))
- #36473 [Perf] Reuse GDN attention buffer to reduce allocation overhead — qwen — by cwazai (关闭于: 2026-03-10 09:16 (UTC+8))
- #36418 Populate multimedia pre-processing performance metrics via PerfStats in the vLLM engine — v1,multi-modality,meta-exported,fb-exported — by d-biswa (关闭于: 2026-03-10 08:00 (UTC+8))
- #36542 Compute MM Pre-processing Timing Metrics in a Non-Blocking Way for Prod — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by d-biswa (关闭于: 2026-03-10 07:49 (UTC+8))
- #30299 [Bugfix] Update WSL detection to check for WSL1 compatibility as WSL2… — bug — by HoneyBerries (关闭于: 2026-03-10 07:56 (UTC+8))
- #36554 [Kernel] Add swap AB optimization to fused_moe_kernel — 无标签 — by xyang16 (关闭于: 2026-03-10 07:40 (UTC+8))
- #36541 [MRV2] Extensible CG dispatch rework — v1,nvidia — by WoosukKwon (关闭于: 2026-03-10 04:58 (UTC+8))
- #36531 [Bugfix] Fix MTP hidden_states regression for chained-layer models (Qwen3.5) — bug,speculative-decoding,needs-rebase,v1,qwen — by voipmonitor (关闭于: 2026-03-10 04:12 (UTC+8))
- #36522 [Docs] Add Apple MPS (Metal) GPU installation guide — documentation,cpu — by robtaylor (关闭于: 2026-03-10 01:11 (UTC+8))
- #36514 Containerized segmented spans — v1,kv-connector — by almogtavor (关闭于: 2026-03-09 23:26 (UTC+8))
- #36510 Feature/anymodel clean — new-model,gpt-oss — by askliar (关闭于: 2026-03-09 22:56 (UTC+8))
- #36490 Fix image — documentation,performance,new-model,rocm,frontend,speculative-decoding,needs-rebase,ci/build,v1,multi-modality — by lucas-s-p (关闭于: 2026-03-09 22:43 (UTC+8))
- #36504 [WIP][Model] Add AnyModel: generic support for NAS-optimized heterogeneous architectures — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,needs-rebase,ci/build,v1 — by askliar (关闭于: 2026-03-09 22:32 (UTC+8))
- #36497 GLM5 rot-safetensors — deepseek — by Zhujiyang2 (关闭于: 2026-03-09 20:43 (UTC+8))
- #36496 [Misc] Add configuration endpoint — frontend,v1 — by hickeyma (关闭于: 2026-03-09 20:52 (UTC+8))
- #36414 [Bugfix] Add PyTorch GDN fallback for SM12.0+ (Blackwell) in Qwen3Next/Qwen3.5 — bug,performance,ci/build,qwen — by lucaspirola (关闭于: 2026-03-09 19:55 (UTC+8))
- #32627 [CI]: remove unused FLASHINFER_AOT_COMPILE build argument — needs-rebase,ci/build — by haitwang-cloud (关闭于: 2026-03-09 17:31 (UTC+8))
- #36420 Revert “[Core] NGram GPU Implementation compatible with Async Scheduler” (#29184) — speculative-decoding,v1 — by zhewenl (关闭于: 2026-03-09 14:51 (UTC+8))
- #36437 [Bugfix] Non-tensor member assertion in cpu_model_runner — bug,needs-rebase,v1,cpu — by PatchouliTIS (关闭于: 2026-03-09 14:47 (UTC+8))
- #36289 [Model] Register Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM for text-only checkpoints — new-model,qwen — by aminsamir45 (关闭于: 2026-03-09 14:12 (UTC+8))
- #35990 [Core] Add contiguous block allocation mode for KV cache — needs-rebase,v1 — by xiaguan (关闭于: 2026-03-09 11:36 (UTC+8))