vLLM 开发动态报告 - 2026-03-13
时间窗口: 2026-03-13 11:22 (UTC+8) ~ 2026-03-14 11:22 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 30 | 新 PR 66 | 合并 PR 30 | 关闭未合并 PR 20
📊 每日开发状态摘要
在本次观察周期内,vLLM 社区保持了极高的开发活跃度,新增了66个PR并合并了30个。开发焦点集中在内存/缓存管理优化(如KV缓存泄漏、预热内存占用)、对新模型架构和量化方案的支持(如GLM 4.7、MiniMax-M2.1 GGUF、AMD Quark量化),以及提升系统可靠性与运维能力(如健康检查端点、优雅关闭)。AMD ROCm生态的兼容性与性能优化也是本周期的重要主题。
🎯 AMD/ROCm 生态相关动态
本周期内 AMD/ROCm 生态相关活动持续活跃,涉及核心问题修复、性能优化和新功能支持。
1. 新增Issue(关键问题)
- #36994 - [Bug]: supports_block_size wrongly rejects dynamically computed block sizes (e.g., Qwen3.5 on ROCm):一个关键Bug。在ROCm平台上,由于
attention/backend.py中硬编码的block_size白名单检查,导致Qwen3.5等使用动态计算块大小的混合架构(如HybridAttentionMambaModelConfig)模型无法运行。这是因为ROCm依赖Triton后端,而该检查阻止了内核自身的有效性验证。影响:直接阻碍了特定模型在AMD平台上的使用。状态:已开放,有评论指出可能已被PR #36274修复。 - #36989 - [Bug]: Qwen3-0.6B hangs during
vllm bench throughputon amd machine:报告在AMD机器上对Qwen3-0.6B进行吞吐量基准测试时发生挂起。影响:影响AMD平台的基准测试与模型兼容性验证。状态:已开放。
2. 新增PR(功能与修复)
- #37009 - [ROCm] issue management - request information for bug issues on ROCm:由AMD员工
hongxiayang提交。此PR旨在优化ROCm相关Issue的管理流程,通过自动化脚本在新建Bug报告时自动请求关键信息(如重现步骤、GPU架构、安装方式),以提升问题排查效率。 - #37029 - [Hardware][XPU][ROCm] Align memory usage with cuda on xpu/rocm:提议在ROCm(和XPU)平台的内存分析代码中添加
empty_cache()调用,以使其内存剖析行为与CUDA平台保持一致,确保跨平台指标的可比性。 - #36993 - [CI][Bugfix][AMD] Ensure weights created when using emulating OCP MXFP4:修复了在AMD MI300上使用OCP MXFP4量化仿真模式时的测试失败问题。根本原因包括:1) 测试中设置的
cuda_graph_capture_sizes与VllmRunner内部设置冲突;2) 仿真模式下权重未正确创建,导致后续处理崩溃。 - #36996 - [CI][BugFix][AMD] Don‘t set VLLM_ROCM_USE_AITER anymore…:移除了测试
test_rocm_aiter_topk.py中不必要的环境变量设置,避免污染同一进程中运行的其他测试。
3. 已合并PR(进展落地)
- #35316 - [ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer:由AMD员工
divakar-amd提交并合并。此PR为ROCm平台增加了对 Quark量化工具链 的w4a8 mxfp4_fp8(权重MXFP4,激活FP8) 量化方案的支持,并已通过端到端测试(使用GPT-OSS-120B模型)。意义:显著扩展了AMD硬件上可用的高效量化选项。 - #35786 - Enable RoPE+KV cache fusion for ROCm AITER FA:为ROCm AITER FlashAttention后端启用了RoPE与KV缓存的融合优化(在非shuffle布局下),旨在提升推理性能。
总结:AMD团队在本周期内不仅解决了多个平台兼容性Bug(如动态block_size、测试配置),还积极推动基础设施改进(Issue管理、内存分析对齐)和量化生态建设(Quark新格式支持),显示出对ROCm后端稳定性和功能完整性的持续投入。
💬 高热度讨论分析
- 已关闭 Issue #32186 - [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector:该Issue围绕在AMD ROCm平台上使用MoRIIO KV连接器时遇到的RDMA错误展开。
- 核心议题:在容器环境中,因缺少正确的Broadcom用户空间RDMA驱动库,导致无法发现RDMA设备,连接失败。
- 观点与进展:
- 用户
junkang1991提供了详细的错误日志和ibv_devices命令输出。 - AMD贡献者
inkcherry和jhchouuu逐步诊断,指出问题在于容器内未安装对应的OOB(Out-of-Box)驱动库,并提供了详细的安装脚本和解决方案(挂载主机驱动库、移除冲突的内建版本)。 - 用户在应用方案后解决了连接问题,但提出了新的关于“Cannot allocate memory”警告的疑问,贡献者进一步提供了调整后端配置的建议。
- 用户
- 争议焦点:无实质性争议,是一个典型的技术支持与协作排查过程。
- 最终结论:问题根本在于容器化的RDMA环境配置。通过社区协作提供了明确的解决方案,Issue被关闭。
- 开放 Issue #36973 - [Bug] _warmup_prefill_kernels in qwen3_next.py leaks ~3.4 GiB GPU memory:此Issue引发了关于性能优化与资源管理平衡的讨论。
- 核心议题:为预热Triton GDN内核而添加的
_warmup_prefill_kernels()函数,在执行后即使调用empty_cache(),仍会永久占用约3.4GB GPU内存,严重挤占KV缓存空间。 - 观点与立场:
- 报告者 (
jhsmith409):提供了详实的对比数据,指出该优化导致可用KV缓存锐减66%,认为这是不可接受的资源泄漏,推测是编译后的Triton内核二进制文件残留所致。 - 维护者 (
ZJY0516):首先澄清该内存占用是否仅在首次运行(无Triton缓存)时发生。在得到报告者“每次启动容器都会发生”的反馈后,表示困惑,因为理论上Triton在自动调优后应仅保留最佳内核。
- 报告者 (
- 争议焦点:问题的普遍性(是否与特定环境有关)和根本原因(是Triton的预期行为还是Bug)。
- 当前状态:开放中。维护者已表示将进行调查。讨论凸显了在追求极致性能(内核预热)时,对内存等稀缺资源管理需格外谨慎。
- 核心议题:为预热Triton GDN内核而添加的
- 开放 Issue #36994 - [Bug]: supports_block_size wrongly rejects dynamically computed block sizes:虽然评论数不多,但技术讨论深入。
- 核心议题:硬编码的block_size白名单是否合理,平台差异(NVIDIA FlashAttention vs. AMD Triton)导致的兼容性问题。
- 观点与立场:
- 报告者 (
kyuz0):明确指出硬编码检查是错误的设计,应信任内核自身的get_supported_kernel_block_sizes()方法。这尤其影响了AMD平台上的混合架构模型。 - 维护者 (
jennyyyyzhen):快速回应,指出该问题可能已被另一个PR (#36274) 修复,要求验证最新代码。
- 报告者 (
- 当前状态:待验证。讨论体现了上游代码逻辑对异构计算生态的影响,以及维护者对新旧PR关联的熟悉度。
🔥 热门话题与趋势分析
- 内存与缓存管理的持续优化:多个议题围绕内存使用展开。除了#36973的预热内存泄漏,还有#36958修复NIXL MLA模式下的KV缓存泄漏,#37003 RFC提议更智能的、上下文感知的KV缓存保留策略以替代简单LRU,以及#37029寻求跨平台内存分析的一致性。这反映了在高并发、长上下文的推理场景下,高效内存管理是核心挑战。
- 模型支持“军备竞赛”:社区持续快速集成最新模型。本周期新增了对 MiniMax-M2.1 GGUF 格式(#36965)、ERNIE 系列分类模型(#36385)的支持,并修复了 GLM-4.7-Flash AWQ (#34695)、Qwen3.5 MoE (#36954,#36975) 等模型的加载或推理问题。此外,关于 Cohere
/v2/embedAPI (#37000) 和 Mistral Guidance (#37005) 的RFC/PR也显示了对接更多生态接口的努力。 - 推理可靠性及运维工具增强:生产环境的需求驱动了相关功能的开发。#36961 提议增加
/health/ready端点,用于实际验证GPU推理能力,超越了仅检查进程存活的现有健康检查。#36666 重新引入了优雅关闭超时机制,允许完成正在处理的请求。#36990 修复了环境变量值验证后返回原始大小写导致PyTorch API调用失败的问题。 - AMD生态的深耕与挑战:如前述专题,AMD相关的活动不仅限于Bug修复,已深入到量化支持(Quark)、性能优化(融合内核)、测试框架完善和开发者体验提升(Issue管理模板)。同时,也暴露出在特定模型架构(如混合SSM-Attention)和分布式通信(RDMA配置)方面的兼容性挑战需要持续攻关。
🛠️ 重点技术变更
- PR #36666 - Re-add shutdown timeout:此PR重新引入了之前因测试失败被回退的优雅关闭超时功能。它允许服务器在收到SIGTERM信号后,可选地等待一段时间让进行中的请求完成,而非立即终止。影响:显著提升了vLLM服务在生产环境部署中的可控性和优雅性,是服务端软件成熟度的重要标志。其成功合并依赖于另一个修复测试环境问题的PR (#36950)。
- PR #35316 - [ROCm][Quantization] add quark w4a8 mxfp4_fp8:此合并PR为AMD ROCm平台引入了新的低精度量化格式支持。影响:使得在AMD GPU(如MI300)上使用更高效的4位权重、8位激活的量化模型成为可能,有助于降低大模型部署的显存门槛和提升吞吐,是AMD量化生态建设的关键一步。
- PR #36968 - [UX] Improve UX of CPU backend:此PR从多个方面改善了CPU后端的使用体验,包括添加CPU指令集兼容性测试(使用Intel SDE),简化交叉编译参数,以及增加库依赖检查。影响:降低了用户在非标准x86-64 CPU(或不同ISA级别)上使用vLLM CPU版本的门槛和风险,提升了稳定性和可支持性。
- PR #36937 - fix: resolve chat template names before kwargs detection:修复了一个当
chat_template参数传递模板名称(如”tool_use”)而非Jinja模板字符串时,会导致后续tools等关键参数被错误丢弃的Bug。影响:虽然改动不大,但修复了Cohere Command-R等模型工具调用功能的严重问题,保障了API的兼容性和可用性。
📈 开发活跃度观察
- 贡献活跃度:24小时内处理了近百个(66新增+30合并)PR,显示社区开发节奏极快。贡献者来源多样,包括AMD (
-amd后缀)、NVIDIA、Cohere (walterbm) 等公司的工程师,以及广泛的独立开发者。 - 代码审查与合并效率:大量PR(如#36950,#36952,#36959,#36962)在当天提交并迅速合并,表明核心团队对错误修复、测试改进和小型功能增强的审查与合并流程非常高效。同时,像#37002(观察插件)这类大型RFC伴随的PR,则因存在合并冲突和复杂度需要更长时间讨论。
- CI/CD持续优化:出现了多个专门优化CI测试的PR(如#36945, #37016, #37014),旨在通过分片并行化来缩短整体测试时间。这反映了项目在快速发展中,对维护高质量测试流水线并保持其效率的高度重视。
💡 值得关注的问题
- #36994 - ROCm平台block_size验证问题:此Issue触及了核心代码中可能存在的平台中心主义假设(硬编码值偏向NVIDIA生态),对AMD和其他非CUDA后端的兼容性构成障碍。其解决方式将体现项目对异构计算的支持深度。
- #37030 - GPT-OSS-120B MXFP4在Blackwell (SM121)上输出错误:报告在最新的Blackwell GPU上使用MXFP4量化时,模型首token生成完全错误,导致输出为空。这可能指向新一代硬件上特定量化内核(如Marlin)的兼容性或正确性问题,需要NVIDIA和社区高度关注。
- RFC #37003 - Context-Aware KV-Cache Retention API 与 RFC #36998 - Observation Plugin:这两个RFC分别提出了对KV缓存管理策略和模型激活值观察拦截的框架性扩展。它们涉及核心调度与执行逻辑的修改,旨在支持更复杂的Agentic工作流和安全性、可解释性需求。这些讨论可能引导vLLM向更智能、更可观测的推理系统架构演进。
📋 附录:详细数据列表
新增 Issue
- #37032 [Bug]: GLM 4.7-flash performance degrades when native KV cache offloading is on — bug — by yz342 (创建于: 2026-03-14 10:55 (UTC+8))
- #36973 [Bug] _warmup_prefill_kernels in qwen3_next.py leaks ~3.4 GiB GPU memory despite empty_cache() — 无标签 — by jhsmith409 (创建于: 2026-03-13 18:39 (UTC+8))
- #37030 [Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning — 无标签 — by ohsono (创建于: 2026-03-14 10:44 (UTC+8))
- #37023 [CI Failure]: GLM4 moe reasoning parser test failure — ci-failure — by sfeng33 (创建于: 2026-03-14 08:11 (UTC+8))
- #37022 [CI Failure]: Seedoss reasoning parser test failure — ci-failure — by sfeng33 (创建于: 2026-03-14 08:00 (UTC+8))
- #36998 [RFC]: Observation Plugin for Intercepting & Routing on Activations — RFC — by DDDDarrenWB (创建于: 2026-03-14 02:33 (UTC+8))
- #37003 [RFC]: Context-Aware KV-Cache Retention API — RFC — by vMaroon (创建于: 2026-03-14 03:05 (UTC+8))
- #37001 [Bug]: Crash when using PP in multi-node — bug — by ortegaalfredo (创建于: 2026-03-14 02:49 (UTC+8))
- #37000 [RFC]: add support for Cohere’s /v2/embed HTTP API entry point — RFC — by walterbm (创建于: 2026-03-14 02:44 (UTC+8))
- #36999 [Bug]: CPU offloading fails with flashinfer autotuner — bug — by wzhao18 (创建于: 2026-03-14 02:38 (UTC+8))
- #36994 [Bug]: supports_block_size wrongly rejects dynamically computed block sizes (e.g., Qwen3.5 on ROCm) — bug,rocm — by kyuz0 (创建于: 2026-03-14 01:29 (UTC+8))
- #36969 [Bug]: Kimi-K2.5 output a single </think> tag in the content when “enable_thinking” is set to false — bug — by elepherai (创建于: 2026-03-13 17:26 (UTC+8))
- #36989 [Bug]: Qwen3-0.6B hangs during
vllm bench throughputon amd machine — bug,rocm — by mikaylagawarecki (创建于: 2026-03-14 00:09 (UTC+8)) - #36986 [Bug]: whisper large v2 incorrectly transcribing — bug — by gilljon (创建于: 2026-03-13 23:04 (UTC+8))
- #36972 [Feature]: fused SiLU + fp8 block quantized kernel in Helion — feature request — by rwtarpit (创建于: 2026-03-13 18:31 (UTC+8))
- #36960 [Feature]: Add /health/ready endpoint for GPU health verification — 无标签 — by anencore94 (创建于: 2026-03-13 15:38 (UTC+8))
- #36954 [Bug]: KeyError ‘layers.0.mlp.experts.w2_weight’ when using speculative_config method=’mtp’ with Qwen3.5-122B-GPTQ-Int4 (moe_wna16) — bug — by sunyong01 (创建于: 2026-03-13 13:55 (UTC+8))
已关闭 Issue
- #12829 [Feature]: Add support for multi-lora using classification — feature request,stale — by ashwani-bhat (关闭于: 2026-03-14 10:18 (UTC+8))
- #26881 [Bug]: cli running but debug crashing on flex attention backend — bug,stale — by yukiy927 (关闭于: 2026-03-14 10:17 (UTC+8))
- #27318 [Bug]: vLLM fails to perform CPU-only-head inference in Kubernetes + Ray cluster environment — bug,ray,stale — by jasonXue653 (关闭于: 2026-03-14 10:17 (UTC+8))
- #27774 [RFC]: Fault-Tolerant Expert Parallelism — RFC,stale — by tzulingk (关闭于: 2026-03-14 10:17 (UTC+8))
- #27859 [Bug]: EVS for Qwen 2_5 Errors out when using with video + image in this order — bug,stale — by sthakur2-sc (关闭于: 2026-03-14 10:17 (UTC+8))
- #28084 [Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1 — feature request,stale — by nv-dlasalle (关闭于: 2026-03-14 10:16 (UTC+8))
- #28317 [Bug]: Prefix caching leads to different outputs for Hermes-3-Llama-3.1-8B — bug,stale — by codeman38 (关闭于: 2026-03-14 10:16 (UTC+8))
- #28440 [Bug]: Failed to generate Cutlass kernels: [Errno 13] Permission denied in Docker image — bug,stale — by bbartels (关闭于: 2026-03-14 10:16 (UTC+8))
- #28538 [RFC]: 4-bit KV cache quantization through Hadamard transforms — RFC,stale — by mratsim (关闭于: 2026-03-14 10:16 (UTC+8))
- #28547 [Bug]: vLLM server not running on CPU AMD EPYC 9J14, with Python 3.12.12 — bug,stale — by chu22fr (关闭于: 2026-03-14 10:16 (UTC+8))
- #28612 Qwen3-VL-8B-Instruct-FP8 fails with GLIBC_2.32 not found in FlashAttention (works fine with Qwen3-VL-4B-Instruct) — bug,stale — by Passenger12138 (关闭于: 2026-03-14 10:16 (UTC+8))
- #28616 [Bug]: Qwen3-14B TP1 PP4 CUDA error an illegal memory access was encountered — bug,stale — by Quakso (关闭于: 2026-03-14 10:16 (UTC+8))
- #28622 [Bug]: Can we able to benchmark Quantized MOE models Either W8A8 or W8A16 ? — bug,stale — by logesh13 (关闭于: 2026-03-14 10:16 (UTC+8))
- #28629 [Usage]: TPOT per request information was not collected by vllm bench serve — usage,stale — by jlwang1996 (关闭于: 2026-03-14 10:16 (UTC+8))
- #28639 [Bug]: TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function aten.index(…): got RuntimeError(‘unknown table encoding’) — bug,stale — by xiuxiuxius (关闭于: 2026-03-14 10:16 (UTC+8))
- #28646 [Feature][P2]: Implement CI Build Time and Size Guards — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
- #28647 [Feature][P2]: Move Examples/Benchmarks to Test Stage Only — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
- #28653 [Feature][P1]: Investigate and Implement FSx for Persistent Caching — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
- #28655 [Feature][P2]: Investigate Lazy Image Pull with Stargz/Nydus Snapshotters — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
- #35300 [Feature]: Add ISA-level smoke tests using Intel SDE to catch instruction set mismatches — cpu — by wjhrdy (关闭于: 2026-03-14 09:27 (UTC+8))
- #36907 [Bug]:
resolve_chat_template_kwargssilently drops chat template kwargs whenchat_templateis passed in as chat template name instead of Jinja — bug — by xiangxl-a (关闭于: 2026-03-14 08:20 (UTC+8)) - #34561 [Bug]: GLM-4.7-Flash-AWQ fails with AttributeError: ‘ColumnParallelLinear’ object has no attribute ‘weight’ — bug — by eugr (关闭于: 2026-03-14 07:25 (UTC+8))
- #36969 [Bug]: Kimi-K2.5 output a single </think> tag in the content when “enable_thinking” is set to false — bug — by elepherai (关闭于: 2026-03-14 00:11 (UTC+8))
- #36986 [Bug]: whisper large v2 incorrectly transcribing — bug — by gilljon (关闭于: 2026-03-13 23:20 (UTC+8))
- #30487 [Bug] [CPU Backend]: Wrong L2 cache size on Arm CPU for Attention tiles — bug,cpu — by Radu2k (关闭于: 2026-03-13 17:40 (UTC+8))
- #36141 [Bug]: Vllm models crashes when using runai_streamer & –api-server-count > 1 — bug — by Sanches166 (关闭于: 2026-03-13 15:13 (UTC+8))
- #32186 [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector — bug,rocm — by junkang1991 (关闭于: 2026-03-13 14:55 (UTC+8))
- #35391 [Bug]: Value error, Model architectures [‘Qwen3_5ForConditionalGeneration’] are not supported for now — bug — by Iam-a-Savage-coder (关闭于: 2026-03-13 14:29 (UTC+8))
- #36942 [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() — bug — by dharaghodasara (关闭于: 2026-03-13 12:23 (UTC+8))
- #36778 [Bug]: Using vLLM to deploy Minimax m2.5, the thinking/reasoning cannot be disable. — bug — by shuixiaoer (关闭于: 2026-03-13 11:54 (UTC+8))
新增 PR
- #37029 [Hardware][XPU][ROCm] Align memory usage with cuda on xpu/rocm — rocm,nvidia — by jikunshang (创建于: 2026-03-14 10:18 (UTC+8))
- #37031 [Hardware] Replace memory related torch.cuda APIs — performance,v1,nvidia — by jikunshang (创建于: 2026-03-14 10:48 (UTC+8))
- #37013 [Spec Decode] Update extract_hidden_states to use deferred kv_connector clear — speculative-decoding,ready,v1,kv-connector — by fynnsu (创建于: 2026-03-14 05:28 (UTC+8))
- #36976 [Bugfix][LoRA] Fix Qwen35 LoRA — bug,qwen — by jeejeelee (创建于: 2026-03-13 19:19 (UTC+8))
- #37015 [CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs — ready,ci/build — by khluu (创建于: 2026-03-14 06:12 (UTC+8))
- #37026 [UX]: Fix unclean shutdown from ctrl-c with AR Fusion — nvidia — by siewcapital (创建于: 2026-03-14 08:41 (UTC+8))
- #37028 [WIP][Model Runner V2] Support Streaming Inputs — v1 — by santiramos27 (创建于: 2026-03-14 09:57 (UTC+8))
- #37006 [V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls — ready,v1 — by sbeurnier (创建于: 2026-03-14 04:27 (UTC+8))
- #37027 [CI] Fix flaky tool_use chat completion tests with deterministic seed — tool-calling — by sfeng33 (创建于: 2026-03-14 09:07 (UTC+8))
- #36968 [UX] Improve UX of CPU backend — documentation,ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-03-13 17:05 (UTC+8))
- #36982 [MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible — v1 — by MatthewBonanni (创建于: 2026-03-13 22:38 (UTC+8))
- #37025 [CI] Add reasoning parser tests to CI — ci/build — by sfeng33 (创建于: 2026-03-14 08:34 (UTC+8))
- #37024 [bug] fix hang dpep pause — bug,v1 — by hao-aaron (创建于: 2026-03-14 08:29 (UTC+8))
- #37004 [Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces — bug,ready,deepseek — by MatthewBonanni (创建于: 2026-03-14 03:39 (UTC+8))
- #37016 [CI] Split V1 Others into 3 separate jobs — ready,ci/build — by khluu (创建于: 2026-03-14 06:13 (UTC+8))
- #37008 [BugFix] Fix “DP Coordinator receives unexpected…” messages — bug,ready,v1 — by njhill (创建于: 2026-03-14 04:45 (UTC+8))
- #37021 Enable in-process engine core for AsyncLLM. — v1 — by wang2yn84 (创建于: 2026-03-14 07:43 (UTC+8))
- #37020 [CI][Bugfix] Fix incorrect status handling with
set -ein CI shell scripts — bug,ci/build — by gkapetanakis (创建于: 2026-03-14 07:33 (UTC+8)) - #37018 [BUG] Collective causing deadlock for DPEP MoE — bug,frontend,v1 — by hao-aaron (创建于: 2026-03-14 06:57 (UTC+8))
- #37014 [CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs — ready,ci/build — by khluu (创建于: 2026-03-14 05:29 (UTC+8))
- #37019 fix: generalize sharded LoRA slice_lora_a for N subloras — 无标签 — by hallerite (创建于: 2026-03-14 07:16 (UTC+8))
- #37012 [CI] Pin helion version — ready,ci/build — by gmagogsfm (创建于: 2026-03-14 05:26 (UTC+8))
- #37017 [CI][Bugfix] Fix incorrect status handling with set -e in CI shell scripts — bug,ci/build — by gkapetanakis (创建于: 2026-03-14 06:57 (UTC+8))
- #37011 fix: return HTTP 413 when request exceeds max context length — frontend — by Chase-Xuu (创建于: 2026-03-14 05:05 (UTC+8))
- #36981 feat(riscv64): add RISC-V 64-bit pure-Python CPU support — structured-output,ci/build,v1,cpu — by gounthar (创建于: 2026-03-13 22:12 (UTC+8))
- #37007 [responsesAPI] move streaming to parser — frontend — by qandrew (创建于: 2026-03-14 04:38 (UTC+8))
- #36985 fix: Guard AllReduceFusionPass.del against Python shutdown — 无标签 — by wavebyrd (创建于: 2026-03-13 22:50 (UTC+8))
- #37010 [Bugfix] Fix FusedMoE weight loading with padded hidden dimensions — bug — by SandishKumarHN (创建于: 2026-03-14 05:02 (UTC+8))
- #37009 [ROCm] issue management - request information for bug issues on ROCm — bug,rocm,ci/build — by hongxiayang (创建于: 2026-03-14 04:50 (UTC+8))
- #36952 [Bugfix][Spec Decode] Avoid double call of Ngram CPU — bug,ready,v1 — by ekagra-ranjan (创建于: 2026-03-13 13:02 (UTC+8))
- #37005 Add Mistral Guidance — rocm,structured-output,frontend,ci/build,v1,nvidia — by juliendenize (创建于: 2026-03-14 04:09 (UTC+8))
- #36995 Make skip_special_tokens configurable on the server side — frontend — by linzebing (创建于: 2026-03-14 01:48 (UTC+8))
- #37002 [Core][Feature] Observation Plugin for Intercepting & Routing on Activations — documentation,needs-rebase,v1 — by DDDDarrenWB (创建于: 2026-03-14 02:59 (UTC+8))
- #36997 Enable loading of fused expert weights in the Transformers modelling backend — 无标签 — by hmellor (创建于: 2026-03-14 02:20 (UTC+8))
- #36996 [CI][BugFix][AMD] Don’t set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary — bug,rocm — by rasmith (创建于: 2026-03-14 01:49 (UTC+8))
- #36967 Allow platform plugins to override gpu_memory_utilization default — frontend,v1 — by aws-navyadhara (创建于: 2026-03-13 16:41 (UTC+8))
- #36993 [CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 — bug,rocm — by rasmith (创建于: 2026-03-14 01:08 (UTC+8))
- #36951 [CI] Add persistent cache mounts for all CI test downloads and media URLs — ready,ci/build,v1,multi-modality,gpt-oss,kv-connector — by AndreasKaratzas (创建于: 2026-03-13 12:56 (UTC+8))
- #36992 [Bugfix] accept redacted thinking blocks in Anthropic messages — bug,frontend — by bbartels (创建于: 2026-03-14 01:05 (UTC+8))
- #36971 Mistral common v10 — rocm,ready,ci/build — by juliendenize (创建于: 2026-03-13 18:28 (UTC+8))
- #36957 [PD][Nixl][WIP] Exploring support for heterogeneous TP in hybrid SSM-FA P/D disaggregation — v1,kv-connector — by ZhanqiuHu (创建于: 2026-03-13 14:22 (UTC+8))
- #36991 Fix
xgrammarbeing locked to a version which has high vulnerabilities — ci/build — by benglewis (创建于: 2026-03-14 00:30 (UTC+8)) - #36990 [Bugfix] Fix env_with_choices returning raw casing when case_sensitive=False — bug — by aymuos15 (创建于: 2026-03-14 00:24 (UTC+8))
- #36988 bump compressed-tensors version to 0.14.0.1 — ci/build — by brian-dellabetta (创建于: 2026-03-14 00:00 (UTC+8))
- #36950 [Tests] Shutdown test
RemoteVLLMServercleanly — ready — by njhill (创建于: 2026-03-13 12:46 (UTC+8)) - #36980 sched/v1: use SRTF tiebreaker for preemption victim selection — documentation,new-model,rocm,frontend,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by whycoming (创建于: 2026-03-13 21:50 (UTC+8))
- #36984 [Test] Add unittests for CompilerManager — 无标签 — by SoluMilken (创建于: 2026-03-13 22:42 (UTC+8))
- #36979 [refactor] Refactor SpeculativeConfig for speculative method extensibility — 无标签 — by TQCB (创建于: 2026-03-13 20:59 (UTC+8))
- #36987 [FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell — v1,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-13 23:34 (UTC+8))
- #36974 Fix issue 36969 — frontend — by xueliangyang-oeuler (创建于: 2026-03-13 19:01 (UTC+8))
- #36983 [Bugfix] Fix delta_text and delta_token_ids desync with stop strings — bug,v1 — by aymuos15 (创建于: 2026-03-13 22:41 (UTC+8))
- #36965 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — 无标签 — by JoursBleu (创建于: 2026-03-13 16:28 (UTC+8))
- #36959 [Bugfix] fix paddleocr crash on some image shape — bug,ready — by MoyanZitto (创建于: 2026-03-13 15:24 (UTC+8))
- #36966 [Frontend] Add srt and vtt response formats for audio transcription — frontend — by majiayu000 (创建于: 2026-03-13 16:32 (UTC+8))
- #36978 Revert “[Bugfix] ep_scatter kernel store-load race condition” (#34991) — bug — by zhewenl (创建于: 2026-03-13 20:03 (UTC+8))
- #36977 chore(pre-commit): make bash hooks runnable on Windows — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-13 19:21 (UTC+8))
- #36975 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen — by xueliangyang-oeuler (创建于: 2026-03-13 19:16 (UTC+8))
- #36970 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen — by xueliangyang-oeuler (创建于: 2026-03-13 17:37 (UTC+8))
- #36962 [XPU] Support LoRA via torch.compile on XPU platform — ready — by chaojun-zhang (创建于: 2026-03-13 15:53 (UTC+8))
- #36963 [Bugfix][Model] Fix PixtralForConditionalGeneration LoRA — bug — by jeejeelee (创建于: 2026-03-13 15:54 (UTC+8))
- #36964 [Feature] Abort in-flight requests and drain outputs on shutdown — v1 — by wojciech-wais (创建于: 2026-03-13 15:57 (UTC+8))
- #36961 [Feature] Add /health/ready endpoint for GPU health verification — frontend,v1 — by anencore94 (创建于: 2026-03-13 15:44 (UTC+8))
- #36953 [Bugfix] Resolve chat template names before kwargs detection — bug — by he-yufeng (创建于: 2026-03-13 13:15 (UTC+8))
- #36958 [Bugfix] Fix NIXL MLA notification request ID mismatch causing prefill KV cache leak — bug,v1,kv-connector — by wz1qqx (创建于: 2026-03-13 14:43 (UTC+8))
- #36956 [Bugfix] Fix unclean shutdown from ctrl-c with AR Fusion — bug,nvidia — by siewcapital (创建于: 2026-03-13 14:08 (UTC+8))
- #36955 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,nvidia — by siewcapital (创建于: 2026-03-13 14:06 (UTC+8))
已合并 PR
- #37006 [V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls — ready,v1 — by sbeurnier (合并于: 2026-03-14 09:37 (UTC+8))
- #36968 [UX] Improve UX of CPU backend — documentation,ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-03-14 09:27 (UTC+8))
- #36516 [responsesAPI] prioritize content over summary in reasoning item input — frontend,ready — by qandrew (合并于: 2026-03-14 09:20 (UTC+8))
- #36937 fix: resolve chat template names before kwargs detection — ready — by giulio-leone (合并于: 2026-03-14 08:20 (UTC+8))
- #37004 [Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces — bug,ready,deepseek — by MatthewBonanni (合并于: 2026-03-14 06:55 (UTC+8))
- #37008 [BugFix] Fix “DP Coordinator receives unexpected…” messages — bug,ready,v1 — by njhill (合并于: 2026-03-14 07:18 (UTC+8))
- #36931 [Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype — bug,ready,v1,deepseek,nvidia — by dbari (合并于: 2026-03-14 07:41 (UTC+8))
- #34695 [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models — bug,ready — by haosdent (合并于: 2026-03-14 07:25 (UTC+8))
- #36063 [Refactor] Consolidate SupportsEagle — ready,v1,llama,qwen,gpt-oss,kv-connector — by benchislett (合并于: 2026-03-14 07:22 (UTC+8))
- #36945 [CI] Split V1 e2e + engine (1 GPU) into parallel jobs — ready,ci/build,v1 — by khluu (合并于: 2026-03-14 05:16 (UTC+8))
- #31545 Use Transformers v5
WeightRenamingfor Transformers modeling backend — ready,multi-modality — by hmellor (合并于: 2026-03-14 04:49 (UTC+8)) - #36952 [Bugfix][Spec Decode] Avoid double call of Ngram CPU — bug,ready,v1 — by ekagra-ranjan (合并于: 2026-03-14 04:33 (UTC+8))
- #35316 [ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer — rocm,ready,gpt-oss — by divakar-amd (合并于: 2026-03-14 03:44 (UTC+8))
- #36666 [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (合并于: 2026-03-14 03:10 (UTC+8))
- #36862 [Misc] Set default
kv_buffer_devicein a better way — ready,kv-connector — by hmellor (合并于: 2026-03-14 03:07 (UTC+8)) - #35242 Fp8 lora dense kernel — ready — by yugong333 (合并于: 2026-03-14 03:05 (UTC+8))
- #36800 [Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request — bug,ready,multi-modality,qwen — by Isotr0py (合并于: 2026-03-14 02:16 (UTC+8))
- #36950 [Tests] Shutdown test
RemoteVLLMServercleanly — ready — by njhill (合并于: 2026-03-13 15:32 (UTC+8)) - #36181 [ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-13 17:04 (UTC+8))
- #36711 [ROCm][CI] Corrected the GPT-OSS test root path — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-13 15:53 (UTC+8))
- #36959 [Bugfix] fix paddleocr crash on some image shape — bug,ready — by MoyanZitto (合并于: 2026-03-13 21:46 (UTC+8))
- #35627 [2/N] Elastic EP Milestone 2: Integrating NIXL-EP — ready,v1,kv-connector,nvidia — by itayalroy (合并于: 2026-03-13 21:25 (UTC+8))
- #36962 [XPU] Support LoRA via torch.compile on XPU platform — ready — by chaojun-zhang (合并于: 2026-03-13 18:35 (UTC+8))
- #36610 [kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec — ready,v1,kv-connector — by orozery (合并于: 2026-03-13 16:04 (UTC+8))
- #36483 [Frontend] Delegate preprocessing to
OpenAIServingRender— frontend,ready,v1 — by sagearc (合并于: 2026-03-13 15:39 (UTC+8)) - #35786 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) — rocm,ready,v1 — by Rohan138 (合并于: 2026-03-13 15:33 (UTC+8))
- #36876 [Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs — bug,ready,qwen — by tdoublep (合并于: 2026-03-13 14:09 (UTC+8))
- #36379 [Frontend] Fix usage incorrectly returned with empty stream_options` — frontend,ready,v1 — by Csrayz (合并于: 2026-03-13 11:33 (UTC+8))
- #36684 fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 — ready,v1,ready-run-all-tests — by hai-meh-cs (合并于: 2026-03-13 11:28 (UTC+8))
- #36385 [Model] Add support for BERT-like Chinese ERNIE pooling models — documentation,new-model,ready — by whyiug (合并于: 2026-03-13 11:23 (UTC+8))
关闭但未合并的 PR
- #26650 fix(benchmark): Prevent OOV error for small tokenizers in latency test — performance,stale — by ihb2032 (关闭于: 2026-03-14 10:17 (UTC+8))
- #26765 [Bugfix][Core] When preemption occurs, the requests in the running set are not fully scheduled. — stale,v1 — by CLFutureX (关闭于: 2026-03-14 10:17 (UTC+8))
- #27585 [Build] Optimize docker layers sizes — ready,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:17 (UTC+8))
- #28057 [wip] Fix torch nightly — ready,needs-rebase,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:17 (UTC+8))
- #28258 [GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn’t depend on kv cache config to load_model — needs-rebase,stale,v1 — by heheda12345 (关闭于: 2026-03-14 10:16 (UTC+8))
- #37017 [CI][Bugfix] Fix incorrect status handling with set -e in CI shell scripts — bug,ci/build — by gkapetanakis (关闭于: 2026-03-14 07:02 (UTC+8))
- #36985 fix: Guard AllReduceFusionPass.del against Python shutdown — 无标签 — by wavebyrd (关闭于: 2026-03-14 05:39 (UTC+8))
- #34363 fix: return HTTP 413 when request exceeds max context length — frontend,needs-rebase — by Chase-Xuu (关闭于: 2026-03-14 05:04 (UTC+8))
- #37005 Add Mistral Guidance — rocm,structured-output,frontend,ci/build,v1,nvidia — by juliendenize (关闭于: 2026-03-14 04:15 (UTC+8))
- #36995 Make skip_special_tokens configurable on the server side — frontend — by linzebing (关闭于: 2026-03-14 03:41 (UTC+8))
- #32781 [WIP] Add Mistral guidance — structured-output,frontend,v1 — by juliendenize (关闭于: 2026-03-14 02:30 (UTC+8))
- #29505 [Misc][Refactor] Decouple quant methods from FusedMoE — needs-rebase — by bnellnm (关闭于: 2026-03-14 00:55 (UTC+8))
- #25200 [Kernels] Move shared/fused expert sum and final reduce into FusedMoE layer — llama,qwen,deepseek,gpt-oss — by bnellnm (关闭于: 2026-03-14 00:54 (UTC+8))
- #36941 [Bugfix][Core]: Fix silent output_handler death on CancelledError — bug,documentation,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,llama — by eugenepaniot (关闭于: 2026-03-13 21:49 (UTC+8))
- #36464 [Examples][2/n] Resettle generate examples. — documentation,structured-output,needs-rebase,ci/build,qwen — by noooop (关闭于: 2026-03-13 17:41 (UTC+8))
- #36936 fix: Resolve template name to Jinja content in resolve_chat_template to prevent dropped kwargs — 无标签 — by siddharthmohan619-eng (关闭于: 2026-03-13 15:49 (UTC+8))
- #34964 Fix LoRA adapter silently failing on Pixtral/Ministral-3 models — 无标签 — by timon0305 (关闭于: 2026-03-13 15:56 (UTC+8))
- #36953 [Bugfix] Resolve chat template names before kwargs detection — bug — by he-yufeng (关闭于: 2026-03-13 15:48 (UTC+8))
- #36444 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by JoursBleu (关闭于: 2026-03-13 14:55 (UTC+8))
- #36925 [Bugfix] signature match for passing
spec_step_idxin qwen3-next and qwen3.5 — bug,qwen — by JGSweets (关闭于: 2026-03-13 11:26 (UTC+8))