vLLM 开发动态报告 - 2026-02-26
时间窗口: 2026-02-26 19:38 (UTC+8) ~ 2026-02-27 19:38 (UTC+8) 数据统计: 新 Issue 27 | 关闭 Issue 26 | 新 PR 83 | 合并 PR 44 | 关闭未合并 PR 20
📊 每日开发状态摘要
在本次观察窗口内,vLLM 项目保持了极高的开发活跃度,新增了83个PR和27个Issue。开发重点集中在性能优化(特别是MoE、注意力、内存管理)、多模态与模型支持(Qwen系列问题突出)、以及AMD平台生态的持续完善上。社区同时处理了大量用户部署中遇到的兼容性与配置问题。
🎯 AMD/ROCm 生态相关动态
本周期内 AMD 生态相关活动活跃,涵盖了内核优化、量化支持和问题修复等多个方面。
- 新增PR - AMD Quark量化支持 (#35491):
- 贡献者:
xuebwang-amd(AMD员工) - 内容:新增对使用 AMD Quark 工具量化(MXFP4, PTPC FP8等)的 Qwen3.5 模型的支持,重点关注模型加载和精度。
- 分析:这是AMD生态拓展的关键一步,将自家量化工具链与主流模型(Qwen)在vLLM上打通,有助于提升AMD硬件上特定模型的推理效率。
- 贡献者:
- 新增PR - ROCm平台修复与优化:
- RDNA3/4 CUDA图捕获禁用 (#35485):由
haosdent提交,修复了在RDNA3/4架构上因Triton内核与HIP图捕获不兼容导致的崩溃问题,策略性地将cudagraph_mode从FULL_AND_PIECEWISE降级为PIECEWISE。 - AITER Triton导入路径更新 (#35453):由
Rohan138提交,同步vLLM代码以匹配上游ROCm/aiter项目重构后的Triton算子导入路径,是维护性质的更新。 - DeepSeek MLA层AITER融合优化 (#35483):为DeepSeek模型的MLA层添加了基于AITER的RMSNorm + FP8量化融合优化,预计在AMD GPU上可获得1.2-1.5倍速度提升。
- RDNA3/4 CUDA图捕获禁用 (#35485):由
- 已关闭Issue - MI355X兼容性问题 (#32895):
- 状态:已关闭。用户确认在vLLM 0.16镜像中,MI355X(gfx950)单卡(TP=1)运行GPT-OSS模型的问题已得到解决。
- 分析:这表明AMD持续在解决新硬件(MI300系列)的初期支持问题,官方镜像的兼容性正在逐步完善。
💬 高热度讨论分析
- Issue #35414: “4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16”
- 核心议题:用户在旧款GPU(2080 Ti)上部署Qwen3.5模型时,即使指定
--dtype float16,仍报错不支持bfloat16。 - 各方观点:
- 用户:报告问题,并提供修改
config.json和添加环境变量等复杂临时解决方法。 - 维护者:(通过后续讨论推断)问题可能源于模型配置或初始化流程中bfloat16的隐式使用,与新引入的优化特性(如AllReduceRMS融合)在特定硬件上的交互有关。
- 用户:报告问题,并提供修改
- 争议焦点:无明确争议,更多是问题排查。反映了新模型和优化特性对老旧硬件兼容性带来的挑战。
- 当前状态:仍为
open,等待根因定位和修复。
- 核心议题:用户在旧款GPU(2080 Ti)上部署Qwen3.5模型时,即使指定
- Issue #35390: “Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton”
- 核心议题:在H200上运行Qwen3.5时,因
fuse_allreduce_rms优化自动启用且失败,导致Triton内核错误。 - 各方观点:
- 报告者 (
ehfd):指出问题与PR #35085相关,手动禁用该优化可解决。 - 修复者 (
haosdent):分析指出根本原因是当GPU不支持SymmDeviceMemory时,优化pass应被优雅禁用而非导致后续错误。相关修复PR #35424已合并。
- 报告者 (
- 争议焦点:无。是一次高效的缺陷识别与修复协作。
- 最终结论:问题已通过PR #35424修复,该Issue已关闭。
- 核心议题:在H200上运行Qwen3.5时,因
- RFC #35409: “Worker Controller for warm GPU worker pooling”
- 核心议题:提议引入Worker控制器架构,预热并池化GPU工作进程,以降低动态多模型服务的冷启动延迟。
- 各方观点:
- 提议者 (
tangkenyi2001):认为解耦Worker生命周期可减少引擎创建时间(演示降低~45%),尤其利于频繁切换模型的场景。 - 反对者 (
robertgshaw2-redhat):持怀疑态度,指出对于合理大小的模型,模型加载时间才是启动瓶颈,而该方案对此无影响,带来的复杂度可能得不偿失。
- 提议者 (
- 争议焦点:该优化的实际收益与引入的架构复杂度是否匹配。核心矛盾在于优化的是“进程初始化”还是“模型加载”这个主要耗时阶段。
- 当前状态:该RFC在发起后很快被关闭,相关PR #35408也已关闭,表明社区目前未采纳此方向。
🔥 热门话题与趋势分析
- Qwen系列模型问题集中:多个Issue涉及Qwen不同版本(3.5-A3B, 3-Coder-Next, VL-Reranker, Omni)的部署错误、性能下降、推理挂起和多模态问题。这表明Qwen模型家族虽然被广泛采用,但其复杂的特性(多专家、思维链、多模态、工具调用)与vLLM的集成仍面临挑战,是当前问题排查的重点领域。
- AMD平台支持深化:不仅限于基础运行支持,AMD贡献者正在深入性能优化(MLA融合)和量化生态(Quark工具链集成),显示出构建完整软硬件解决方案的努力。
- 性能优化与稳定性博弈:
fuse_allreduce_rms(PR #35085)等高级优化在提升性能的同时,也引入了在特定配置(PP>1、旧GPU)下的稳定性风险。这反映了高性能推理引擎在追求极致性能时需持续平衡兼容性与复杂性。 - 内存管理与生命周期:关于
CuMemAllocator睡眠模式错误、KV缓存初始化失败导致进程僵尸等Issue,表明内存管理器和引擎生命周期的健壮性仍是关注点。
🛠️ 重点技术变更
- PR #35457: “[Model Performance] Add Qwen3MoE tuned MoE configs for H200”
- 技术解读:为Qwen3MoE模型添加了针对NVIDIA H200 GPU调优的Triton MoE内核配置(包括BF16和FP8)。
- 影响:在常见批次大小下可获得显著性能提升(BF16提升最高24%,FP8提升最高11%),直接提升该热门模型在最新硬件上的性价比。
- PR #35121: “[Performance] Cublas Bf16 Gate with Fp32 Output”
- 技术解读:引入专用的
GateLinear层,为MoE门控网络提供三阶GEMM分发策略(专用DSV3内核 -> cuBLAS BF16->FP32 -> 回退),并为需要FP32计算精度的模型(如NemotronH)提供支持。 - 影响:优化了MoE模型中路由计算的性能(部分硬件上可达3.9倍加速),同时确保计算精度,是底层计算库的精细化利用。
- 技术解读:引入专用的
- PR #35368: “[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression”
- 技术解读:修复了因重构导致的多模态嵌入处理回归问题,恢复了非交错路径下对
super().embed_input_ids()的调用。 - 影响:解决了Qwen Omni模型无法正确处理音频+图像+视频混合输入的关键bug,保障了复杂多模态能力的可用性。
- 技术解读:修复了因重构导致的多模态嵌入处理回归问题,恢复了非交错路径下对
📈 开发活跃度观察
- 贡献者多样性:活跃贡献者包括来自AMD (
xuebwang-amd)、NVIDIA、Red Hat等公司的工程师以及独立开发者。AMD员工的持续提交尤为突出。 - 代码合并效率:在24小时内合并了44个PR,表明核心团队审查和合并流程高效。许多修复性PR从创建到合并周期很短(例如#35424, #35368)。
- 问题解决状态:新增27个Issue,关闭26个,保持了一个动态平衡。但仍有相当数量复杂的Bug处于
open状态,涉及深度技术细节(如PP+TP下的AllReduce融合),需要更多时间解决。
💡 值得关注的问题
- AllReduceRMS融合的稳定性:Issue #35426 和 PR #35468 显示,该优化在流水线并行(PP>1)场景下会引发NCCL错误,目前采用条件禁用作为临时方案。其根本解决依赖于上游
flashinfer库的修复。 - 老旧GPU兼容性:Issue #35414 凸显了为最新模型和优化设计的代码,在面对不支持bfloat16等特性的老旧硬件时出现的兼容性问题。需要更完善的硬件能力检测和优雅降级机制。
- torch.compile的深层挑战:Issue #34034 和 PR #35472 指向
torch.compile在vLLM中惰性编译带来的复杂性问题,转向主动编译是向更稳定、可预测的编译生命周期迈进的一步。
📋 附录:详细数据列表
新增 Issue
- #35501 [Bug]: fp8_blockscale_gemm JIT compilation fails on vLLM Docker image — missing cublasLt.h, nvrtc.h, and -lnvrtc — bug — by onurguner (创建于: 2026-02-27 18:16 (UTC+8))
- #35504 [Bug]: qwen3-coder-next inference randomly hangs, accuracy degradation in 0.16.0+ with TP > 1 — bug — by vitush93 (创建于: 2026-02-27 19:05 (UTC+8))
- #35432 Prebuilt vLLM wheels / official images fail on RTX 50-series (Blackwell, SM120/SM121) — “no kernel image” / “sm_120 not compatible” — feature request — by zyforsure (创建于: 2026-02-27 02:40 (UTC+8))
- #35465 [Bug]: No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization) — bug — by DongZhaoXiong (创建于: 2026-02-27 10:32 (UTC+8))
- #35502 [Bug]: Server hangs indefinitely during inference with Qwen3.5-27B-FP8 (vLLM nightly) — bug — by gallery2016 (创建于: 2026-02-27 18:34 (UTC+8))
- #35407 [CI] DBO with DP+EP accuracy regression on GSM8K evaluation — ci-failure — by LucasWilkinson (创建于: 2026-02-26 23:23 (UTC+8))
- #35496 [Bug]: RPC call to sample_tokens timed out. Qwen3.5-397B-A17B — bug — by DBDXSS (创建于: 2026-02-27 16:47 (UTC+8))
- #35476 [CI] AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — ci-failure — by LucasWilkinson (创建于: 2026-02-27 11:58 (UTC+8))
- #35477 [Usage]: How to test the performance of a encoder-only BertModel using
vllm bench serve— usage — by duanshengliu (创建于: 2026-02-27 12:36 (UTC+8)) - #35412 [Bug]:Qwen3-VL-Reranker produces completely wrong relevance scores compared to native Transformers — bug — by xl2014 (创建于: 2026-02-26 23:35 (UTC+8))
- #35463 [Bug]: Error on sleep mode when upgrade into vllm 0.16.0 — bug — by knlnguyen1802 (创建于: 2026-02-27 10:06 (UTC+8))
- #35494 [Bug]: sparsemixer uses hard-coded jitter_eps instead of config — bug — by Qi-Zhan (创建于: 2026-02-27 16:00 (UTC+8))
- #35438 [Bug]: Invalid response_format leads in 500 errors — bug — by antonovsergey93 (创建于: 2026-02-27 04:07 (UTC+8))
- #35473 [Feature]: Request vllm support agent skills — feature request — by tonyaw (创建于: 2026-02-27 11:42 (UTC+8))
- #35426 [Bug]: AllReduceRMSFusionPass crashes with PP — bug — by ZJY0516 (创建于: 2026-02-27 01:33 (UTC+8))
- #35467 [Performance]: non-optimal performance of
linearfor medium batches — performance — by vadiklyutiy (创建于: 2026-02-27 10:37 (UTC+8)) - #35464 [Bug]: Fix GLM-OCR text_config model_type handling — bug — by iron-shaper (创建于: 2026-02-27 10:31 (UTC+8))
- #35414 [Bug]: 4*2080ti 22g deploy Qwen3.5-35B-A3B fail:2080 Ti does not support bfloat16 — bug — by chuanSir123 (创建于: 2026-02-26 23:47 (UTC+8))
- #35462 [Installation]: branch v0.16.0 still rely on torch 2.9.1, not 2.10 — installation — by chamwen (创建于: 2026-02-27 09:35 (UTC+8))
- #35460 [Installation]: Can VLLM 0.16.0 release a WHL package for CUDA12? — installation — by Youmo1 (创建于: 2026-02-27 09:23 (UTC+8))
- #35437 [ci][cuda 13.0, torch 2.10] ValueError: When using LoRA, vocab size must be > 32000 and <= 258048 — ci-failure — by angelayi (创建于: 2026-02-27 03:44 (UTC+8))
- #35439 [Feature Request]: W4A8 (compressed-tensors) Kernel support for Blackwell SM100+ — bug — by zeryx (创建于: 2026-02-27 04:19 (UTC+8))
- #35419 [Feature]: Publish macOS wheels — feature request — by reneleonhardt (创建于: 2026-02-27 00:12 (UTC+8))
- #35409 [RFC]: Worker Controller for warm GPU worker pooling to reduce cold-start latency — RFC — by tangkenyi2001 (创建于: 2026-02-26 23:31 (UTC+8))
- #35411 [Bug]:Qwen3-VL-Reranker produces completely wrong relevance scores compared to native Transformers — bug — by xl2014 (创建于: 2026-02-26 23:35 (UTC+8))
- #35402 [Bug]: GPT-2 scale_attn_weights config flag is ignored in vLLM — bug — by Qi-Zhan (创建于: 2026-02-26 22:23 (UTC+8))
- #35403 [Bug]: Ray Compiled DAG timeout/deadlock during VLM forward pass with PP>1 and high-res images — bug — by emricksini-h (创建于: 2026-02-26 22:25 (UTC+8))
已关闭 Issue
- #35390 [Bug]: Qwen3.5 (NVIDIA H200) Pointer argument (at 0) cannot be accessed from Triton — bug — by ehfd (关闭于: 2026-02-27 13:42 (UTC+8))
- #35477 [Usage]: How to test the performance of a encoder-only BertModel using
vllm bench serve— usage — by duanshengliu (关闭于: 2026-02-27 17:38 (UTC+8)) - #35438 [Bug]: Invalid response_format leads in 500 errors — bug — by antonovsergey93 (关闭于: 2026-02-27 16:01 (UTC+8))
- #35268 [Bug]: Qwen3-VL-Rerank model’s rerank API does not support query with mixed image and text inputs. — bug — by Yimi81 (关闭于: 2026-02-27 14:27 (UTC+8))
- #32895 [Bug]: [ROCm] [MI355X] new 0.14 upstream gptoss hard error TP=1? — bug,rocm — by functionstackx (关闭于: 2026-02-27 11:37 (UTC+8))
- #34705 [Bug]: Old torch compile files cause poor CPU utilisation — bug,torch.compile — by almayne (关闭于: 2026-02-27 10:54 (UTC+8))
- #21838 [Feature]: [V1][DCA] Support DCA in V1 — feature request,stale — by LucasWilkinson (关闭于: 2026-02-27 10:21 (UTC+8))
- #23088 [Feature]: Support for AIDC-AI/Ovis2.5-9B enable_thinking_budget and thinking_budget — feature request,stale — by Magmanat (关闭于: 2026-02-27 10:21 (UTC+8))
- #24377 [Bug]: [FP4 gemm Runner] Failed to initialize cutlass FP4 gemm. — bug,stale — by shahizat (关闭于: 2026-02-27 10:21 (UTC+8))
- #26118 [RFC]: Migrate from Ubuntu 20.04 as Build Base to manylinux — RFC,stale — by simon-mo (关闭于: 2026-02-27 10:21 (UTC+8))
- #27045 [Bug]:
Prefixnot correctly propogated through Backend — bug,stale — by Lucaskabela (关闭于: 2026-02-27 10:20 (UTC+8)) - #27371 [Feature][Performance][LoRA]: Run LoRA in Separate CUDA Streams — feature request,stale — by robertgshaw2-redhat (关闭于: 2026-02-27 10:20 (UTC+8))
- #27633 [Bug]: Multi-node mode with pplx backend fails to run on AWS EFA — bug,stale — by CalebZ9909 (关闭于: 2026-02-27 10:20 (UTC+8))
- #27709 [Feature]: Accelerate penalty calculation by sampler cache — feature request,stale — by xyz2606 (关闭于: 2026-02-27 10:20 (UTC+8))
- #27767 [RFC]: Coordinating vLLM and PyTorch Release Timelines. Starting with PyTorch Release 2.10 — RFC,stale — by atalman (关闭于: 2026-02-27 10:20 (UTC+8))
- #27778 [Usage]: Is DP + PP a possible way to use vLLM? — usage,stale — by oldcpple (关闭于: 2026-02-27 10:20 (UTC+8))
- #34224 [Installation]: vllm 0.15.1 pip install still rely on torch2.9.1 — installation — by chamwen (关闭于: 2026-02-27 09:03 (UTC+8))
- #35437 [ci][cuda 13.0, torch 2.10] ValueError: When using LoRA, vocab size must be > 32000 and <= 258048 — ci-failure — by angelayi (关闭于: 2026-02-27 08:19 (UTC+8))
- #16364 [Feature]: Support loading vision layers in VLM LoRA adapters — feature request — by pbarker (关闭于: 2026-02-27 07:47 (UTC+8))
- #28236 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — help wanted,good first issue,feature request,stale — by bnellnm (关闭于: 2026-02-27 05:26 (UTC+8))
- #35221 [Bug]:
qwen3reasoning parser incorrectly parse reasoning-only output ascontent— bug — by cjackal (关闭于: 2026-02-27 04:30 (UTC+8)) - #35353 [CI] LoRA vocab size constraint violation during model initialization — ci-failure — by LucasWilkinson (关闭于: 2026-02-27 02:44 (UTC+8))
- #35409 [RFC]: Worker Controller for warm GPU worker pooling to reduce cold-start latency — RFC — by tangkenyi2001 (关闭于: 2026-02-26 23:41 (UTC+8))
- #35411 [Bug]:Qwen3-VL-Reranker produces completely wrong relevance scores compared to native Transformers — bug — by xl2014 (关闭于: 2026-02-26 23:37 (UTC+8))
- #35349 [Bug]: MiniMax-2.1: Missing
<think>tag in the LLM response after tool calls succeed. — bug — by stingoChen (关闭于: 2026-02-26 22:11 (UTC+8)) - #34506 [Bug]: Qwen 2.5 Omni Output text seems only load first part of mm input — bug — by tzhouam (关闭于: 2026-02-26 19:58 (UTC+8))
新增 PR
- #35505 [MyPy][BugFix] Check profiler is assigned before calling start() on it — bug,v1 — by hickeyma (创建于: 2026-02-27 19:23 (UTC+8))
- #35466 [CI/Build] CPU release supports both of AVX2 and AVX512 — ready,ci/build,v1,cpu — by majian4work (创建于: 2026-02-27 10:36 (UTC+8))
- #35413 [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping — bug,ready,qwen — by jeejeelee (创建于: 2026-02-26 23:47 (UTC+8))
- #35503 [Bugfix] Propagate compilation_time from workers to main process for TP>1 — bug,v1,cpu — by huydhn (创建于: 2026-02-27 18:54 (UTC+8))
- #35424 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 — bug,ready — by ZJY0516 (创建于: 2026-02-27 01:01 (UTC+8))
- #35449 [Bugfix] Fix tool call arguments parsed as content/reasoning in harmony streaming — bug,frontend,gpt-oss — by jfrery (创建于: 2026-02-27 06:24 (UTC+8))
- #35489 [Bugfix] Fix Fabric/RDMA attribute queries poisoning global error_code in cumem allocator — bug — by haosdent (创建于: 2026-02-27 15:26 (UTC+8))
- #35500 [Feature] Add basic metrics for /realtime endpoint — frontend — by pougetat (创建于: 2026-02-27 17:54 (UTC+8))
- #35443 [BUGFIX] Replace assert with ValueError for response_format validation in chat completions endpoint — bug,frontend — by antonovsergey93 (创建于: 2026-02-27 05:15 (UTC+8))
- #35498 Fix indexError in moe wna16 quantization with enable-expert-parallel — 无标签 — by ivyilike (创建于: 2026-02-27 17:35 (UTC+8))
- #35499 fix(phimoe): use config router_jitter_noise instead of hardcoded jitter_eps — 无标签 — by stakeswky (创建于: 2026-02-27 17:50 (UTC+8))
- #35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (创建于: 2026-02-27 17:25 (UTC+8))
- #35493 Grpc renderer — frontend,ci/build — by hyeongyun0916 (创建于: 2026-02-27 15:47 (UTC+8))
- #35405 [Fix] Avoid sending image input to other PP ranks — v1 — by emricksini-h (创建于: 2026-02-26 23:02 (UTC+8))
- #35484 [WIP][Bugfix] Fix multi-node PP crash with logprobs due to pinned memory serialization — bug,v1 — by haosdent (创建于: 2026-02-27 15:10 (UTC+8))
- #35468 [Bugfix] Fix AllReduceFusionPass crash with PP+TP configurations — bug,nvidia — by haosdent (创建于: 2026-02-27 11:07 (UTC+8))
- #35404 [Bugfix][Model] Fix gpt-oss batch invariance — bug,ready,gpt-oss — by jzakrzew (创建于: 2026-02-26 22:36 (UTC+8))
- #35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (创建于: 2026-02-27 15:27 (UTC+8))
- #35480 [perf] Use pinned memory for async H2D transfer in do_mamba_copy_block — ready,v1 — by hl475 (创建于: 2026-02-27 13:16 (UTC+8))
- #35487 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,ready,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:18 (UTC+8))
- #35444 [Bugfix] Fix Triton attention layout when used in combination with the NIXL connector — bug,v1,kv-connector — by tlrmchlsmth (创建于: 2026-02-27 05:39 (UTC+8))
- #35492 feat(responses): add WebSocket mode for Responses API — frontend — by rasonyang (创建于: 2026-02-27 15:42 (UTC+8))
- #35399 feat(kv-transfer): add KV transfer support for DSA indexer — 无标签 — by wz1qqx (创建于: 2026-02-26 21:07 (UTC+8))
- #35495 [Misc] Bound NIXL upper bound version — ready,ci/build,kv-connector — by NickLucche (创建于: 2026-02-27 16:26 (UTC+8))
- #35427 [Refactor] Fix maxsim cuda platform and add env to control it — frontend,ready,nvidia — by yewentao256 (创建于: 2026-02-27 01:37 (UTC+8))
- #35456 [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint — bug,frontend,ready — by umut-polat (创建于: 2026-02-27 07:58 (UTC+8))
- #35474 [Perf][DeepSeek-V3] Dynamic MLA/MHA Routing for Sub-1024 Token Prefill (~3x TTFT Speedup) — deepseek — by Joy-In-Code (创建于: 2026-02-27 11:44 (UTC+8))
- #35481 [Feature][CI]: compare
func&no_funcoutputs in test_functionalization.py — 无标签 — by 11happy (创建于: 2026-02-27 14:40 (UTC+8)) - #35442 [Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching — v1 — by tdoublep (创建于: 2026-02-27 05:08 (UTC+8))
- #35491 [ROCm][Quantization] support amd-quark quantized Qwen3.5 model — rocm,qwen — by xuebwang-amd (创建于: 2026-02-27 15:27 (UTC+8))
- #35488 [WIP][Bugfix] Fix CUDA OOM in sparse_attn_indexer prefill with high concurrency — bug,v1,nvidia — by haosdent (创建于: 2026-02-27 15:19 (UTC+8))
- #35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (创建于: 2026-02-27 15:14 (UTC+8))
- #35485 [Bugfix][ROCm] Disable full CUDA graph capture on RDNA3/RDNA4 (gfx1x) — bug,rocm,nvidia — by haosdent (创建于: 2026-02-27 15:11 (UTC+8))
- #35451 [Core] Add optional flags to check for repetitive token patterns in engine output — frontend,v1 — by aykoppol (创建于: 2026-02-27 06:52 (UTC+8))
- #35483 Add AMD AITER MLA fusion optimization for DeepSeek models — rocm,deepseek — by khairulkabir1661 (创建于: 2026-02-27 14:47 (UTC+8))
- #35482 Fix AttributeError in RMSNormGated by adding activation attribute and… — needs-rebase,qwen — by xueliangyang-oeuler (创建于: 2026-02-27 14:44 (UTC+8))
- #35458 [Frontend] Add multimodal support to /inference/v1/generate endpoint — frontend — by nithinvc (创建于: 2026-02-27 08:53 (UTC+8))
- #35457 [Model Performance] Add Qwen3MoE tuned MoE configs for H200 — ready,qwen — by chengyinie (创建于: 2026-02-27 08:41 (UTC+8))
- #35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (创建于: 2026-02-27 13:07 (UTC+8))
- #35420 [Bugfix] Add monkeypatch to prevent race condition from writing — bug,ready — by Lucaskabela (创建于: 2026-02-27 00:12 (UTC+8))
- #35478 [BugFix] Fix engine hanging after KV cache initialization failure — bug,v1 — by 842974287 (创建于: 2026-02-27 12:38 (UTC+8))
- #35472 [torch.compile] Stop lazily compiling — 无标签 — by zou3519 (创建于: 2026-02-27 11:35 (UTC+8))
- #35434 [BugFix] Repo utils debug print patch — bug,ready — by pi314ever (创建于: 2026-02-27 03:05 (UTC+8))
- #35452 Add the option to allow left padding when needed — meta-exported,fb-exported — by cgufb (创建于: 2026-02-27 07:01 (UTC+8))
- #35475 [torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 — 无标签 — by zou3519 (创建于: 2026-02-27 11:45 (UTC+8))
- #35471 fix(benchmarks): correct peak output token throughput calculation for speculative decoding — performance — by hukongyi (创建于: 2026-02-27 11:27 (UTC+8))
- #35400 [Misc] Move
GPUModelRunner.prepare_kernel_block_sizesto utils — ready,v1 — by NickLucche (创建于: 2026-02-26 21:56 (UTC+8)) - #35441 [Deprecation] Deprecate code in 0.17 as scheduled — frontend,ready,v1,multi-modality — by yewentao256 (创建于: 2026-02-27 05:02 (UTC+8))
- #35470 Fix MIG UUID handling in interface.py and cuda.py — nvidia — by GaneshSubhashPatil (创建于: 2026-02-27 11:15 (UTC+8))
- #35469 feat: add OpenTelemetry Metrics support via OTLP protocol — v1 — by RichardoMrMu (创建于: 2026-02-27 11:12 (UTC+8))
- #35423 [Bugfix] Add missing activation attr to RMSNormGated — bug,ready — by Tib-Gridello (创建于: 2026-02-27 00:52 (UTC+8))
- #35459 [ModelRunnerV2] Rename sampler functions and variables for clarity — v1 — by andylolu2 (创建于: 2026-02-27 09:21 (UTC+8))
- #35461 [WIP][Model Runner V2] Add probabilistic rejection sampling for spec decoding — v1 — by TheEpicDolphin (创建于: 2026-02-27 09:26 (UTC+8))
- #35398 Andy/spec probs — speculative-decoding,v1 — by andylolu2 (创建于: 2026-02-26 20:24 (UTC+8))
- #35455 [Core] Num External Cached Tokens — v1,kv-connector — by aeon-x (创建于: 2026-02-27 07:55 (UTC+8))
- #35430 [Bugfix] Fix KV Scale loading for MLA Models — bug,ready — by pavanimajety (创建于: 2026-02-27 02:12 (UTC+8))
- #35454 Enable GLM4.7 FP8 KVCache scale loading — 无标签 — by BowenBao (创建于: 2026-02-27 07:15 (UTC+8))
- #35453 [ROCm]: Update aiter triton imports to match aiter folder structure — rocm,v1 — by Rohan138 (创建于: 2026-02-27 07:11 (UTC+8))
- #35446 [Perf] Optimize
sampled_token_idsusing numpy and removetolist, 0.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-02-27 06:02 (UTC+8)) - #35450 Cutlass W4A16 (Machete) Tests — nvidia — by ojhaanshika (创建于: 2026-02-27 06:31 (UTC+8))
- #35428 [CI] Add macOS ARM CPU wheel build and publish — ci/build — by mgoin (创建于: 2026-02-27 01:39 (UTC+8))
- #35448 [Quant][Feature] Support online MXFP8 MoE quantization using trtllm_fp8_block_scale_moe — nvidia — by EdalatiAli (创建于: 2026-02-27 06:21 (UTC+8))
- #35429 [Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism — bug,ready,v1 — by luccafong (创建于: 2026-02-27 02:01 (UTC+8))
- #35447 [Bugfix] Fix NemotronH MTP + Chunked Prefill — bug,v1 — by benchislett (创建于: 2026-02-27 06:06 (UTC+8))
- #35396 Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models — ready,nvidia — by danielafrimi (创建于: 2026-02-26 19:50 (UTC+8))
- #35445 [Bugfix] Fix DeepSeek-OCR crash on small images with empty crops — bug,deepseek — by jreiml (创建于: 2026-02-27 05:44 (UTC+8))
- #35435 Optimize batched_moe_align_block_size_kernel with cooperative writes — meta-exported,fb-exported — by ajitmaths (创建于: 2026-02-27 03:07 (UTC+8))
- #35422 [Performance] Extract KV cache update op from flashinfer forward — ready,v1,nvidia — by ElizaWszola (创建于: 2026-02-27 00:40 (UTC+8))
- #35421 [Bugfix] Fix Qwen3Coder tool call streaming with speculative decoding — bug,frontend,qwen — by voipmonitor (创建于: 2026-02-27 00:17 (UTC+8))
- #35440 refactor: unify duplicated flashinfer utility modules — nvidia — by LakshmiSravyaVedantham (创建于: 2026-02-27 04:38 (UTC+8))
- #35410 [compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism — ready — by jasonlizhengjian (创建于: 2026-02-26 23:33 (UTC+8))
- #35415 feat(qwen3-asr): support prompt parameter in v1/audio/transcriptions — qwen — by TheCodeWrangler (创建于: 2026-02-26 23:48 (UTC+8))
- #35436 [Bugfix] Add assertion for scale_attn_weights in GPT-2 — bug — by mariorch22 (创建于: 2026-02-27 03:15 (UTC+8))
- #35433 [Misc][Harmony] Split harmony_utils.py into domain-specific modules — frontend,ci/build,gpt-oss — by sfeng33 (创建于: 2026-02-27 02:55 (UTC+8))
- #35431 [Bugfix] Use null block (0) for padded block table entries in SSM backends — bug,v1 — by SandishKumarHN (创建于: 2026-02-27 02:37 (UTC+8))
- #35418 [Refactor] Remove dead code for attention benchmark script — performance,ready — by yewentao256 (创建于: 2026-02-27 00:01 (UTC+8))
- #35425 [DRAFT][Spec Decode] implement online hidden state extraction — documentation,performance,new-model,speculative-decoding,v1,llama,kv-connector — by harryzorus (创建于: 2026-02-27 01:25 (UTC+8))
- #35417 [Bugfix] Fix EC connector unit tests after has_cache_item API change — bug,v1 — by pakkah (创建于: 2026-02-26 23:51 (UTC+8))
- #35416 [CI] Enable Crosslayer KV layout tests for ROCm platforms — rocm,ci/build,v1,kv-connector — by qli88 (创建于: 2026-02-26 23:50 (UTC+8))
- #35406 [Test] Add some gsm8k configs for hybrid models. — 无标签 — by tdoublep (创建于: 2026-02-26 23:20 (UTC+8))
- #35408 [Core] Worker Controller: improve engine create/delete lifecycle and worker reuse — new-model,frontend,needs-rebase,ci/build,v1 — by tangkenyi2001 (创建于: 2026-02-26 23:27 (UTC+8))
- #35401 [DRAFT] Add triton_attn_td backend — documentation,v1 — by jikunshang (创建于: 2026-02-26 22:01 (UTC+8))
- #35397 [Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels — 无标签 — by tomeras91 (创建于: 2026-02-26 20:21 (UTC+8))
已合并 PR
- #35413 [Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping — bug,ready,qwen — by jeejeelee (合并于: 2026-02-27 11:46 (UTC+8))
- #35424 [Bugfix] disable allreduce_rms_fusion by default when pp size > 1 — bug,ready — by ZJY0516 (合并于: 2026-02-27 12:18 (UTC+8))
- #35456 [Bugfix] Replace assert with ValueError for response_format validation in completions endpoint — bug,frontend,ready — by umut-polat (合并于: 2026-02-27 16:01 (UTC+8))
- #35297 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model — documentation,new-model,ready,multi-modality,llama,nvidia — by jzakrzew (合并于: 2026-02-26 22:17 (UTC+8))
- #33088 [Bugfix] Use ‘sum’ reduction instead of ‘avg’ in Async TP reduce-scatter — bug,ready — by wangxingran222 (合并于: 2026-02-27 15:06 (UTC+8))
- #35457 [Model Performance] Add Qwen3MoE tuned MoE configs for H200 — ready,qwen — by chengyinie (合并于: 2026-02-27 13:51 (UTC+8))
- #35369 [Bug] correct out dtype of rms_norm_gated native path — bug,ready — by zufangzhu (合并于: 2026-02-27 13:19 (UTC+8))
- #35434 [BugFix] Repo utils debug print patch — bug,ready — by pi314ever (合并于: 2026-02-27 11:50 (UTC+8))
- #35314 [Bug] Fix outdated links in source code — bug,documentation,ready,ci/build — by yewentao256 (合并于: 2026-02-27 11:50 (UTC+8))
- #33197 use ‘max_active_experts’ for moe lora input size — ready — by gnovack (合并于: 2026-02-27 11:50 (UTC+8))
- #35400 [Misc] Move
GPUModelRunner.prepare_kernel_block_sizesto utils — ready,v1 — by NickLucche (合并于: 2026-02-27 11:42 (UTC+8)) - #33012 [Core]Extract is_last_rank in Ray for tpu to override — ready,v1 — by Chenyaaang (合并于: 2026-02-27 11:18 (UTC+8))
- #35119 [compile] Invalidate cache for cpu flags — ready — by angelayi (合并于: 2026-02-27 10:54 (UTC+8))
- #35184 [Bugfix] Emit reasoning_part events in simple streaming path for Resp… — bug,frontend,ready — by daniel-salib (合并于: 2026-02-27 09:49 (UTC+8))
- #34274 [CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI — ready,ci/build,nvidia — by mgoin (合并于: 2026-02-27 08:58 (UTC+8))
- #35121 [Performance] Cublas Bf16 Gate with Fp32 Output — performance,ready,ci/build,deepseek,nvidia — by roikoren755 (合并于: 2026-02-27 08:51 (UTC+8))
- #34687 [Update] Use FlashInfer fast_decode_plan directly instead of replication — ready,v1,nvidia — by askliar (合并于: 2026-02-27 08:31 (UTC+8))
- #35430 [Bugfix] Fix KV Scale loading for MLA Models — bug,ready — by pavanimajety (合并于: 2026-02-27 07:38 (UTC+8))
- #33839 [Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds — performance,ready — by pkousha (合并于: 2026-02-27 06:35 (UTC+8))
- #30357 [ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales — rocm,ready,gpt-oss — by maleksan85 (合并于: 2026-02-27 06:50 (UTC+8))
- #33724 [WideEP] Remove pplx all2all backend — documentation,ready,ci/build,nvidia — by tlrmchlsmth (合并于: 2026-02-27 06:30 (UTC+8))
- #35047 add mixed precision support for modelopt — ready — by sychen52 (合并于: 2026-02-27 05:56 (UTC+8))
- #35429 [Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism — bug,ready,v1 — by luccafong (合并于: 2026-02-27 06:08 (UTC+8))
- #35396 Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models — ready,nvidia — by danielafrimi (合并于: 2026-02-27 05:55 (UTC+8))
- #35422 [Performance] Extract KV cache update op from flashinfer forward — ready,v1,nvidia — by ElizaWszola (合并于: 2026-02-27 05:29 (UTC+8))
- #35230 fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning — ready,qwen — by stakeswky (合并于: 2026-02-27 04:30 (UTC+8))
- #35383 [Model Runner V2] Prepare attn metadata in ModelState [2/N] — v1,nvidia — by WoosukKwon (合并于: 2026-02-27 03:47 (UTC+8))
- #35350 [Model Runner V2] Add model states [1/N] — v1,nvidia — by WoosukKwon (合并于: 2026-02-27 03:20 (UTC+8))
- #35063 [Model Runner V2] Fix error-handling — ready,v1 — by njhill (合并于: 2026-02-27 03:00 (UTC+8))
- #35354 [Bugfix] Remove erroneous lower bound on LoRA vocab size constraint — bug,ready — by LucasWilkinson (合并于: 2026-02-27 02:44 (UTC+8))
- #35330 [Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement — frontend,ready — by yewentao256 (合并于: 2026-02-27 01:14 (UTC+8))
- #34396 [BugFix] Align fused MoE-LoRA kernel config with actual weight shapes — bug,ready — by RunkaiTao (合并于: 2026-02-27 02:03 (UTC+8))
- #35418 [Refactor] Remove dead code for attention benchmark script — performance,ready — by yewentao256 (合并于: 2026-02-27 01:53 (UTC+8))
- #32642 [Core] Support
min_tokenswith speculative decoding — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by qianlihuang (合并于: 2026-02-27 01:31 (UTC+8)) - #34982 Add GlmOcrConfig for GLM-OCR model type recognition — ready,meta-exported,fb-exported — by hujia177 (合并于: 2026-02-27 01:04 (UTC+8))
- #35125 [BugFix][kv_offload]: Fix kernel block size detection — bug,ready,v1 — by orozery (合并于: 2026-02-27 00:29 (UTC+8))
- #34387 [ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release — rocm,ready,ci/build — by SageMoore (合并于: 2026-02-27 00:28 (UTC+8))
- #34157 [ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers — rocm,ready,deepseek — by dllehr-amd (合并于: 2026-02-27 00:00 (UTC+8))
- #35318 [Refactor] Remove dead or duplicate func utils or variables — performance,ready,v1,nvidia — by yewentao256 (合并于: 2026-02-26 23:57 (UTC+8))
- #35352 [Bug] Fix missing
tag after tool call in MiniMax 2.1 — bug,ready — by stingoChen (合并于: 2026-02-26 22:11 (UTC+8)) - #35284 [Misc] Standardize handling of
mm_processor_kwargs.size— documentation,ready,multi-modality,qwen — by DarkLight1337 (合并于: 2026-02-26 21:05 (UTC+8)) - #35275 [Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic — bug,ready — by Josephasafg (合并于: 2026-02-26 20:22 (UTC+8))
- #34336 [Bugfix] fix device_name for routing replay — bug,ready — by Li-Yongwen (合并于: 2026-02-26 20:18 (UTC+8))
- #35368 [Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression — bug,ready,multi-modality,qwen — by linyueqian (合并于: 2026-02-26 19:58 (UTC+8))
关闭但未合并的 PR
- #34881 [Bugfix] limit cudagraph capture sizes by num_blocks for GDN models — bug,v1,qwen,nvidia — by ZJY0516 (关闭于: 2026-02-27 18:40 (UTC+8))
- #35497 [Bugfix] Fix NVFP4 MoE scale misalignment for non-128-aligned hidden dims — bug,nvidia — by eous (关闭于: 2026-02-27 17:48 (UTC+8))
- #35490 [WIP][Bugfix] Fix Qwen3-VL-Reranker wrong relevance scores due to incorrect pooling and attention type — bug,qwen — by haosdent (关闭于: 2026-02-27 17:14 (UTC+8))
- #35486 [Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests — bug,needs-rebase,multi-modality,qwen — by linyueqian (关闭于: 2026-02-27 15:17 (UTC+8))
- #35479 [BugFix/CI]: Fix AttributeError: ‘RMSNormGated’ object has no attribute ‘activation’ — bug,ready — by LucasWilkinson (关闭于: 2026-02-27 13:11 (UTC+8))
- #32157 feat: add requires_token_ids interface for sampling params — v1 — by llsj14 (关闭于: 2026-02-27 12:23 (UTC+8))
- #34835 [Core] Extract is_last_rank in ray for tpu to override — v1 — by pv97 (关闭于: 2026-02-27 11:47 (UTC+8))
- #29191 [Models] Lfm2-VL Architecture — documentation,new-model,stale — by paulpak58 (关闭于: 2026-02-27 11:38 (UTC+8))
- #34511 [Bugfix] Fix AllReduceFusionPass NCCL error in DP+TP configurations — bug,needs-rebase — by haosdent (关闭于: 2026-02-27 11:08 (UTC+8))
- #35359 feat: add trace spans for failed API requests — frontend — by RichardoMrMu (关闭于: 2026-02-27 11:04 (UTC+8))
- #35296 Fix Qwen3ReasoningParser misclassifying truncated reasoning as content — needs-rebase,qwen — by sxu75374 (关闭于: 2026-02-27 10:35 (UTC+8))
- #26566 [Bugfix] Check flashinfer availability using flashinfer.sampling.get_sampling_module — needs-rebase,stale,v1 — by thenumberouscode (关闭于: 2026-02-27 10:21 (UTC+8))
- #27270 Fix: Handle both bytes and str types in async_request_openai_audio — performance,stale — by KilJaeeun (关闭于: 2026-02-27 10:20 (UTC+8))
- #27638 feature: support ipv6 in vllm distributed — stale — by brook-cpp (关闭于: 2026-02-27 10:20 (UTC+8))
- #27749 [test/dnm] do not merge: ci-infra dummy PR — stale — by dougbtv (关闭于: 2026-02-27 10:20 (UTC+8))
- #35455 [Core] Num External Cached Tokens — v1,kv-connector — by aeon-x (关闭于: 2026-02-27 08:13 (UTC+8))
- #33658 feat(mla): add default do_kv_cache_update for MLA — ready,v1 — by dw2761 (关闭于: 2026-02-27 04:47 (UTC+8))
- #34105 [Spec Decode] Move spec decode offline script from example/ to inside vllm for reusability — documentation,speculative-decoding,needs-rebase,v1 — by ekagra-ranjan (关闭于: 2026-02-27 04:03 (UTC+8))
- #34652 [AMD][CI] Skip test new_weight_syncing/rlhf.py — documentation,rocm — by rjrock (关闭于: 2026-02-27 00:08 (UTC+8))
- #35408 [Core] Worker Controller: improve engine create/delete lifecycle and worker reuse — new-model,frontend,needs-rebase,ci/build,v1 — by tangkenyi2001 (关闭于: 2026-02-26 23:41 (UTC+8))