vLLM 开发动态报告 - 2026-02-15

时间窗口: 2026-02-15 11:35 (UTC+8) ~ 2026-02-16 11:35 (UTC+8) 数据统计: 新 Issue 6 | 关闭 Issue 27 | 新 PR 21 | 合并 PR 11 | 关闭未合并 PR 12

📊 每日开发状态摘要

在24小时的分析窗口内，vLLM 项目展现出高活跃度的维护与开发状态：大量（27个）历史 Issue 因解决或过期被关闭，同时有11个新功能、修复和优化被合并入主线。当前开发焦点集中在多模态支持优化、内核性能提升（特别是针对 Blackwell GPU 的 FP8 支持）以及 AMD ROCm 平台的 CI 与依赖稳定性上。

🎯 AMD/ROCm 生态相关动态

本周期 AMD 生态相关活动集中在 ROCm 平台的支持与 CI 稳定性上，未发现涉及 Quark 量化工具或 MI300 的特定修改。

新增 Issue：ROCm 7 运行时错误 (#34573)
- 概述：用户 NickJLange 报告在 Ubuntu 22.04 / ROCm 7.0 环境下运行 vLLM 时遇到 RuntimeError: No HIP GPUs are available。
- 技术细节：用户详细提供了环境信息（Python 3.12， PyTorch 2.9.0a0+rocm），表明已安装 ROCm 驱动和 PyTorch ROCm 版本，但 vLLM 无法识别到 GPU。Issue 被自动标记并通知了 ROCm 相关的维护者 (@hongxiayang， @tjtanaa， @vllmellm)。
- 影响：这表明在最新的 ROCm 7.0 软件栈上可能存在环境配置、驱动兼容性或 vLLM 初始化路径的问题，需要维护者介入排查。
新增 PR：ROCm CI 依赖兼容性修复 (#34596) & (#34589)
- PR #34596 (cccat6): 尝试将 requirements/common.txt 中的 transformers 依赖从 <5 升级为 >=5.0.0，并同步更新 requirements/rocm-test.txt 中的版本至 5.1.0。此 PR 旨在与 Axolotl 工作流兼容，但核心维护者 (@DarkLight1337) 指出当前 vLLM 主分支 CI 尚未完全通过 Transformers v5 测试，因此暂不建议解除版本上限。这表明 AMD CI 流水线在尝试对齐上游依赖版本时遇到了挑战。
- PR #34589 (AndreasKaratzas): 旨在稳定 ROCm CI 中的插件测试。具体修复了 terratorch 插件测试的依赖冲突（将 segmentation-models-pytorch 固定到兼容版本），并处理了一个 PyTorch 张量拷贝的警告。这反映了维护团队在确保 AMD 平台上各种模型插件（尤其是遥感领域模型）测试稳定性的持续努力。
新增 Issue：AMD APU 性能对比 (#34578, #34579)
- 概述：用户 kathirvel-balakrishnan 在 AMD Strix Halo APU 上对比 vLLM 与 llama.cpp 运行 GPT-OSS-120B 模型的性能，发现单提示场景下 vLLM 吞吐量显著落后。
- 分析：虽然问题涉及 AMD 硬件，但核心在于 vLLM 为高并发、批量推理优化的架构与 llama.cpp 等为低延迟、单序列优化的库之间的设计哲学差异。用户询问架构上如何改进，这引发了关于 vLLM 调度策略和单序列性能优化的潜在讨论点，但目前尚未有社区回复。

💬 高热度讨论分析

本周期内，已关闭的旧 Issue 中包含了多个曾有过热烈讨论的线程。

Issue #23724: [Bug]: vLLM server crashes with CUDA illegal memory access for specific sequence lengths on B200 (8条评论)
- 核心议题：在 NVIDIA B200/H100/H200 上使用 MXFP4 精度和 FlashInfer 注意力后端时，特定序列长度和请求率组合会导致 CUDA 非法内存访问和进程崩溃。
- 观点整理：
  - 报告者 & 遇到相同问题的用户：提供了详细的复现步骤、日志和环境信息，指出问题与数据并行（DP）规模相关（DP=1 正常，DP=8 崩溃）。
  - 提供解决方案的用户 (@ABXayn)：指出设置环境变量 VLLM_USE_FLASHINFER_SAMPLER=0 可以解决问题，这是一个关键的工作区。
- 争议焦点：无实质性争议，主要是用户协同排查问题。
- 当前状态：该 Issue 因超过90天无新活动而被自动关闭。根本原因可能是特定内核在分布式场景下的 bug，工作区已给出。
Issue #26328: [Bug]: EAGLE3 with gpt-oss-120b not working (11条评论)
- 核心议题：为 GPT-OSS-120B 配置 EAGLE3 推测解码时，工作进程启动失败。
- 观点整理：
  - 问题报告者：展示了错误日志，怀疑 EAGLE3 组件可能仅限 Blackwell GPU 使用。
  - 提供解决方案的用户 (@mailhost)：指出需要使用包含特定修复的 vLLM 版本（引用 issue #25882）。
  - 其他用户：分享了在 A800 上的类似经历和通过设置 VLLM_WORKER_MULTIPROC_METHOD=spawn 等环境变量解决的经验。
- 争议焦点：无争议，属于用户间经验分享和问题排查。
- 最终结论：报告者在更新到正确的 nightly 版本后确认问题解决，且性能提升显著。该 Issue 因问题解决且长期无活动被关闭。
Issue #27080: [RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) (7条评论)
- 核心议题：是否应在 v0.11.1 版本中默认启用 Torch Inductor 图分割功能。该功能可提升性能并启用更多融合优化，但也会增加冷启动编译时间并带来一定风险。
- 观点整理：
  - 发起者 (ProExpertProg)：阐述了利弊（性能提升 vs. 编译时间增加和潜在风险），寻求团队决策。
  - 担忧方 (@tomasruizt, @mgoin, @tlrmchlsmth)：主要担心显著增加的编译时间会影响开发调试效率，并认为在已包含 Torch 2.9 大版本升级的发布周期中，再加入此重大变更风险较高。建议收集更多模型数据后再决定。
- 争议焦点：功能收益（性能）与成本（编译时间、稳定性风险）之间的权衡，以及发布管理的节奏。
- 当前状态：讨论倾向于不在当时（v0.11.1）默认启用。该 RFC Issue 随后因无活动被关闭。

🔥 热门话题与趋势分析

性能对比与优化：社区持续关注 vLLM 与其他推理引擎（如 llama.cpp）的性能对比，特别是在单提示/低并发场景下的表现（Issue #34578, #34579）。这反映了用户对 vLLM 适用场景边界探索的兴趣。
安装与运行时问题：NVIDIA 和 AMD 双平台均出现了特定环境下的启动失败问题（Issue #34586 - RTX 6000 Docker; #34573 - ROCm 7.0）。这说明随着硬件和驱动环境的快速迭代，持续的系统兼容性测试至关重要。
LoRA 适配器问题：“静默失败”（Silent Failure）问题再次出现（Issue #34591），即 LoRA 适配器看似加载成功但未实际生效。这是一个影响模型微调部署稳定性的棘手问题。
推测解码与 logprobs 支持：历史 Issue (#34497) 的讨论和合并 PR 表明，在支持推测解码的同时正确返回 logprobs 是一个技术难点，社区正在逐步完善。
视觉语言模型坐标偏移问题：长期存在的 Qwen3-VL 视觉定位坐标偏移问题（Issue #30918）在最新评论中被确认已在主分支修复，体现了对多模态模型输出正确性的持续跟进。
多模态评分支持：新 PR (#34574) 为 ColQwen3 模型增加了多模态输入到 /score 和 /rerank 端点的支持，表明多模态能力的应用正从生成扩展到检索和重排序领域。

🛠️ 重点技术变更

PR #34597: [Kernel] Add FP8 KV cache support to Triton MLA decode attention (grimulkan)
- 解读：为 Triton MLA 注意力后端启用了 FP8/F8E4M3 KV 缓存支持。此前这是 Blackwell 消费级 GPU（如 RTX 5090）上唯一可用的 MLA 后端，但阻塞了 FP8 KV 缓存的使用。
- 影响：显著扩大了 FP8 KV 缓存技术在新一代消费级 NVIDIA GPU 上的适用性，有助于在这些设备上降低内存占用、服务更大模型或更长上下文。
PR #34593: [Bugfix] Fix MXFP4 weight loading for individual expert tensors (olka)
- 解读：修复了 MXFP4 量化 MoE 层的权重加载逻辑。原代码仅处理 GPT-OSS 风格的预堆叠 3D 权重张量，导致像 MiniMax-M2.5 这样将专家权重存储为独立 2D 张量的模型加载失败。
- 影响：扩展了 MXFP4 量化格式对更多 MoE 模型架构的兼容性，提升了量化模型支持的覆盖面。
Issue #34579: [Performance]: vLLM‘s throughput lags behind llama.cpp for single prompt
- 解读：此 Issue 本身是问题报告，但突显了 vLLM 的核心设计权衡。vLLM 通过 PagedAttention、复杂调度器等方式优化了高并发、可变长度请求的吞吐量和内存效率，而 llama.cpp 等框架可能为单序列、固定批次的低延迟做了极致优化。
- 影响：引导社区理解不同工具的适用场景。vLLM 的优化目标并非单序列绝对速度，而是多用户服务场景下的整体资源利用率和吞吐量。
PR #34516: [bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used (kizill)
- 解读：修复了一个严重的前端 Bug：某些错误路径下，API 会错误地返回 HTTP 200 状态码（成功）而非正确的错误码。
- 影响：提升了 API 的可靠性和符合性。对于依赖 HTTP 状态码进行错误处理的客户端应用，此修复至关重要。
PR #34476: [BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 (vadiklyutiy)
- 解读：修复了此前一个统一 Mamba 模型张量并行（TP）代码的 PR 引入的回归问题。该回归导致量化 Mamba 模型（如 Nemotron-3-Nano NVFP4）在 TP>1 时，权重与缩放因子错位，产生错误结果。
- 影响：保证了复杂量化模型（NVFP4 + Mamba + MoE）在分布式推理下的正确性，对生产环境部署此类先进模型非常关键。

📈 开发活跃度观察

Issue 处理高效：关闭 27 个 Issue 的数量远高于新增的 6 个，显示维护团队在清理积压、自动关闭过期问题方面投入积极。
PR 合并活跃：合并 11 个 PR，涵盖从内核、前端到文档的多个方面，表明代码审查和集成流程运行顺畅。
社区参与度高：在已关闭的热点 Issue 中，可见大量用户之间的互助（分享解决方案、工作区），体现了活跃的社区氛围。
维护者响应及时：在新增的 AMD ROCm Issue 和关键 Bugfix PR 中，维护者能快速被提及并参与讨论。

💡 值得关注的问题

AMD ROCm 7 运行时支持 (Issue #34573)：需关注维护者如何解决 ROCm 7.0 下的 No HIP GPUs 错误，这关系到 AMD 平台用户的升级路径。
单提示场景性能优化 (Issue #34578/79)：虽然反映了架构差异，但社区是否会针对此场景提出优化选项或最佳实践，值得观察。
LoRA 静默失败问题 (Issue #34591)：这是一个影响微调工作流可信度的持久性问题，需要有效的诊断工具或更严格的验证机制。
推测解码与 logprobs 的正确性 (历史 Issue #34497)：对于需要同时使用推测解码和获取 token 概率的应用，确保其正确性是一个持续的技术挑战。
Mamba 前缀缓存与分块预填充的兼容性 (PR #34587)：正在修复的此 Bug 揭示了状态空间模型（SSM）与 vLLM 现有缓存机制集成时的复杂性。

📋 附录：详细数据列表

新增 Issue

#34586 [Bug]: RTX 6000 Pro Workstation Edition isn’t supported by torch inside docker container — bug — by UncleCShark (创建于: 2026-02-16 04:31 (UTC+8))
#34591 [Bug]: Ministral-3 LoRA adapter fails silently — bug — by sbischl (创建于: 2026-02-16 07:25 (UTC+8))
#34583 [Bug] Missing Vocabulary Validation for MTP and Eagle Speculative Methods leads to potential OOB Access — bug — by amadhan882 (创建于: 2026-02-15 21:31 (UTC+8))
#34579 [Performance]: vLLM’s throughput lags behind llama.cpp for single prompt — performance — by kathirvel-balakrishnan (创建于: 2026-02-15 19:19 (UTC+8))
#34578 [Performance]: vLLM’s throughput performance for a single prompt scenario — performance — by kathirvel-balakrishnan (创建于: 2026-02-15 19:18 (UTC+8))
#34573 [Installation/Runtime]: Linux ROCM7 / RuntimeError: No HIP GPUs are available — installation,rocm — by NickJLange (创建于: 2026-02-15 13:46 (UTC+8))

已关闭 Issue

#30918 [Bug] Qwen3-VL Visual Grounding Offset Issue After Upgrade from v0.11.0 — bug — by Sylar-W (关闭于: 2026-02-16 10:20 (UTC+8))
#21130 [Bug]: Qwen3 fails parsing reasoning content — bug,stale — by dc0709 (关闭于: 2026-02-16 10:17 (UTC+8))
#21422 [Doc]: how to close the log ？ Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. — documentation,stale — by Youmo1 (关闭于: 2026-02-16 10:17 (UTC+8))
#23724 [Bug]: vLLM server crashes with CUDA illegal memory access for specific sequence lengths on B200 — bug,stale,gpt-oss — by KeitaW (关闭于: 2026-02-16 10:17 (UTC+8))
#24476 [Feature]: Assign worker process titles and logging prefix earlier — good first issue,feature request,stale — by 22quinn (关闭于: 2026-02-16 10:17 (UTC+8))
#26328 [Bug]: EAGLE3 with gpt-oss-120b not working — bug,stale — by icsy7867 (关闭于: 2026-02-16 10:17 (UTC+8))
#26703 [RFC]: [gpt-oss] Supporting Any MCP Server Tools — RFC,stale — by alecsolder (关闭于: 2026-02-16 10:16 (UTC+8))
#26912 [Feature]: Make CompilationMode an Enum — feature request,stale — by morrison-turnansky (关闭于: 2026-02-16 10:16 (UTC+8))
#26914 [Usage]: 为什么在采集的profiling中看不到通信算子？ — usage,stale — by sheep94lion (关闭于: 2026-02-16 10:16 (UTC+8))
#26957 [Bug]: Python 3.9 compatibility issue in 0.11.0 — bug,stale — by danieleynis (关闭于: 2026-02-16 10:16 (UTC+8))
#27013 [Feature]: Reintroduce GPU memory usage control for vLLM to coexist with other GPU-light tasks — feature request,stale — by LwengGitHub (关闭于: 2026-02-16 10:16 (UTC+8))
#27021 [Usage]: Need guidance reproducing benchmark results from PR #25337 — results differ significantly from reported data — usage,stale — by deitxfge (关闭于: 2026-02-16 10:16 (UTC+8))
#27067 [Bug]: pipeline_parallel_size and max_num_batched_tokens produce CUDA error: an illegal memory access was encountered — bug,stale — by akeshet (关闭于: 2026-02-16 10:16 (UTC+8))
#27071 Can someone help When I upload an MP4 video, there will be an error that recognizes the video as an image — feature request,stale — by checkHup (关闭于: 2026-02-16 10:16 (UTC+8))
#27074 [Bug]: infinite empty scheduling when load kv async from external KV Cache with KVConnector — bug,stale — by little-walnut (关闭于: 2026-02-16 10:16 (UTC+8))
#27080 [RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) — RFC,stale — by ProExpertProg (关闭于: 2026-02-16 10:16 (UTC+8))
#27081 [Bug]: After turning on VLLM_USE_V1=1, NixlConnector performance is worse！ — bug,stale — by tensorflowt (关闭于: 2026-02-16 10:16 (UTC+8))
#27093 [Feature]: Process-level PD Disaggregation within Single Instance — feature request,stale — by fanfanaaaa (关闭于: 2026-02-16 10:16 (UTC+8))
#27129 [Feature]: Allow picking input, output lengths and prefix overlaps from a distribution for PrefixRandom dataset — feature request,stale — by susiejojo (关闭于: 2026-02-16 10:16 (UTC+8))
#27153 [Feature]: Allow vllm bench serve in non-streaming mode with /completions API — feature request,stale — by susiejojo (关闭于: 2026-02-16 10:16 (UTC+8))
#34497 [Bug]: return logprobs with speculative decoding — bug — by xiaoxiaosuaxuan (关闭于: 2026-02-15 23:21 (UTC+8))
#34401 [CI Failure]: Distributed Tests (8 GPUs)(H100) — ci-failure — by wzhao18 (关闭于: 2026-02-15 22:33 (UTC+8))
#29606 [Bug]: Florence 2 model in 0.10.2 not working — bug — by TheMrguiller (关闭于: 2026-02-15 20:11 (UTC+8))
#31819 [New Model]: Support Florence-2 on Engine 1 After Engine 0 Removal — new-model — by docee (关闭于: 2026-02-15 20:11 (UTC+8))
#33367 [New Model]: Bart — 无标签 — by JingHHj (关闭于: 2026-02-15 20:11 (UTC+8))
#34452 [Usage]: Issue Running Nemotron 3 Nano NVFP4 on sm12x (RTX 5090 / Blackwell Pro 6000) — usage — by shahizat (关闭于: 2026-02-15 16:12 (UTC+8))
#34477 [Bug]: TVM_FFI_ICHECK(args->n_group != 0) « “n_group should not be zero for DeepSeekV3 routing” — bug — by jeejeelee (关闭于: 2026-02-15 15:25 (UTC+8))

新增 PR

#34598 [Renderer] Move InputPreprocessor into Renderer (1.5/2) — ready,multi-modality — by DarkLight1337 (创建于: 2026-02-16 11:30 (UTC+8))
#34597 [Kernel] Add FP8 KV cache support to Triton MLA decode attention — documentation,v1 — by grimulkan (创建于: 2026-02-16 11:20 (UTC+8))
#34596 Dependency compatibility: transformers>=5.0.0 and resolver fix for vLLM 0.16.0 — rocm,ci/build — by cccat6 (创建于: 2026-02-16 09:46 (UTC+8))
#34585 [CI/Build] Enable tests for recent day-0 new models — ready,multi-modality — by Isotr0py (创建于: 2026-02-16 00:18 (UTC+8))
#34590 [CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice — frontend,ready — by AndreasKaratzas (创建于: 2026-02-16 05:59 (UTC+8))
#34594 Dependency compatibility: require transformers>=5.0.0 for vLLM — ci/build,nvidia — by cccat6 (创建于: 2026-02-16 09:06 (UTC+8))
#34595 Dependency compatibility: require transformers>=5.0.0 for vLLM (v2) — ci/build — by cccat6 (创建于: 2026-02-16 09:35 (UTC+8))
#34574 [Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) — documentation,frontend,qwen — by craftsangjae (创建于: 2026-02-15 14:10 (UTC+8))
#34589 [ROCm][CI] Fix plugins test group; updating terratorch and dependencies — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-16 05:30 (UTC+8))
#34593 [Bugfix] Fix MXFP4 weight loading for individual expert tensors — bug — by olka (创建于: 2026-02-16 08:51 (UTC+8))
#34592 [CI] Optimize Dockerfile caching with narrowed COPY and stage split — documentation,ci/build — by amrmahdi (创建于: 2026-02-16 08:27 (UTC+8))
#34588 [MoE Refactor] Convert mxfp4 marlin into modular kernel format — ready — by zyongye (创建于: 2026-02-16 05:00 (UTC+8))
#34587 [Bugfix] - Fix Mamba prefix caching corruption with chunked prefill — bug,v1 — by Josephasafg (创建于: 2026-02-16 04:34 (UTC+8))
#34582 [NemotronH] Do not force router to run in fp32 — nvidia — by roikoren755 (创建于: 2026-02-15 21:01 (UTC+8))
#34577 [Bugfix] Add stability scaling for NVFP4 Marlin dense and MoE layers — bug — by ricky-chaoju (创建于: 2026-02-15 19:08 (UTC+8))
#34584 [Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model — documentation,ready — by banparth (创建于: 2026-02-15 23:57 (UTC+8))
#34580 Flashinfer cuDNN backend for Qwen3 VL ViT attention — needs-rebase,v1,qwen,nvidia — by maxyanghu (创建于: 2026-02-15 19:36 (UTC+8))
#34575 Fix call to moe_mk in modelopt MoE modules (required for LoRA) — ready — by danisereb (创建于: 2026-02-15 15:05 (UTC+8))
#34581 [Doc] Update Encoder-Decoder models support doc with Florence-2 — documentation,ready — by Isotr0py (创建于: 2026-02-15 20:09 (UTC+8))
#34576 [Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload — bug,ready,v1 — by Vivo50E (创建于: 2026-02-15 16:20 (UTC+8))
#34572 Fix: Network functions read_file, download_bytes_from_url lack SSRF… — frontend — by shanecodezzz (创建于: 2026-02-15 12:09 (UTC+8))

已合并 PR

#34585 [CI/Build] Enable tests for recent day-0 new models — ready,multi-modality — by Isotr0py (合并于: 2026-02-16 10:17 (UTC+8))
#32183 [MM Encoder] Add Triton ViT attention backend — performance,rocm,ready,v1,qwen,nvidia — by Isotr0py (合并于: 2026-02-15 22:32 (UTC+8))
#34392 [torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test — ready,ready-run-all-tests — by ProExpertProg (合并于: 2026-02-15 22:33 (UTC+8))
#33079 [CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… — documentation,ready,ci/build,cpu — by maryamtahhan (合并于: 2026-02-15 22:33 (UTC+8))
#34581 [Doc] Update Encoder-Decoder models support doc with Florence-2 — documentation,ready — by Isotr0py (合并于: 2026-02-15 20:18 (UTC+8))
#34415 [KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround — ready,v1,kv-connector — by eicherseiji (合并于: 2026-02-15 15:26 (UTC+8))
#34494 [Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE — bug,ready,nvidia — by haosdent (合并于: 2026-02-15 15:25 (UTC+8))
#34476 [BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 — bug,ready,nvidia — by vadiklyutiy (合并于: 2026-02-15 15:25 (UTC+8))
#34516 [bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used — bug,frontend,ready — by kizill (合并于: 2026-02-15 15:24 (UTC+8))
#34468 [CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 — frontend,ready — by AndreasKaratzas (合并于: 2026-02-15 15:08 (UTC+8))
#34537 [Kernels] Fix Helion GPU utils to use platform-agnostic device name API — rocm,ready — by AndreasKaratzas (合并于: 2026-02-15 12:29 (UTC+8))

关闭但未合并的 PR

#24472 [Frontend] Add OpenAI Harmony integration for responses API — frontend,needs-rebase,stale,gpt-oss — by alecsolder (关闭于: 2026-02-16 10:17 (UTC+8))
#26154 [Bench] Add decode-only profile support in bench serve — performance,needs-rebase,stale — by minosfuture (关闭于: 2026-02-16 10:17 (UTC+8))
#26430 Adding the test-amd.yaml for test definitions for the AMD backend. — rocm,needs-rebase,ci/build,stale — by Alexei-V-Ivanov-AMD (关闭于: 2026-02-16 10:17 (UTC+8))
#26796 [V1][performance] add multi step — stale,v1 — by chengda-wu (关闭于: 2026-02-16 10:16 (UTC+8))
#26829 [Bugfix] : prevent automatic URL path decoding for async client #26636 — stale — by 1994 (关闭于: 2026-02-16 10:16 (UTC+8))
#27051 Add warmup to vllm bench throughput — performance,stale — by mgoin (关闭于: 2026-02-16 10:16 (UTC+8))
#34594 Dependency compatibility: require transformers>=5.0.0 for vLLM — ci/build,nvidia — by cccat6 (关闭于: 2026-02-16 09:45 (UTC+8))
#34595 Dependency compatibility: require transformers>=5.0.0 for vLLM (v2) — ci/build — by cccat6 (关闭于: 2026-02-16 09:36 (UTC+8))
#15246 [ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES — rocm,ci/build,v1 — by HollowMan6 (关闭于: 2026-02-15 22:38 (UTC+8))
#27221 ARM64 CUDA 12.9 wheels built and uploaded to index incorrectly — ci/build,nvidia — by Gregory-Pereira (关闭于: 2026-02-15 23:49 (UTC+8))
#32816 [Misc] fix: GPU Worker uses allowlist for distributed_executor_backend — v1 — by HollowMan6 (关闭于: 2026-02-15 22:44 (UTC+8))
#33238 [Feature]: Kernels Attention Test cleanup — v1 — by SouthWest7 (关闭于: 2026-02-15 11:36 (UTC+8))