vLLM 开发动态报告 - 2026-03-13

时间窗口: 2026-03-13 11:22 (UTC+8) ~ 2026-03-14 11:22 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 30 | 新 PR 66 | 合并 PR 30 | 关闭未合并 PR 20

📊 每日开发状态摘要

在本次观察周期内，vLLM 社区保持了极高的开发活跃度，新增了66个PR并合并了30个。开发焦点集中在内存/缓存管理优化（如KV缓存泄漏、预热内存占用）、对新模型架构和量化方案的支持（如GLM 4.7、MiniMax-M2.1 GGUF、AMD Quark量化），以及提升系统可靠性与运维能力（如健康检查端点、优雅关闭）。AMD ROCm生态的兼容性与性能优化也是本周期的重要主题。

🎯 AMD/ROCm 生态相关动态

本周期内 AMD/ROCm 生态相关活动持续活跃，涉及核心问题修复、性能优化和新功能支持。

1. 新增Issue（关键问题）

#36994 - [Bug]: supports_block_size wrongly rejects dynamically computed block sizes (e.g., Qwen3.5 on ROCm)：一个关键Bug。在ROCm平台上，由于 attention/backend.py 中硬编码的 block_size 白名单检查，导致Qwen3.5等使用动态计算块大小的混合架构（如HybridAttentionMambaModelConfig）模型无法运行。这是因为ROCm依赖Triton后端，而该检查阻止了内核自身的有效性验证。影响：直接阻碍了特定模型在AMD平台上的使用。状态：已开放，有评论指出可能已被PR #36274修复。
#36989 - [Bug]: Qwen3-0.6B hangs during vllm bench throughput on amd machine：报告在AMD机器上对Qwen3-0.6B进行吞吐量基准测试时发生挂起。影响：影响AMD平台的基准测试与模型兼容性验证。状态：已开放。

2. 新增PR（功能与修复）

#37009 - [ROCm] issue management - request information for bug issues on ROCm：由AMD员工 hongxiayang 提交。此PR旨在优化ROCm相关Issue的管理流程，通过自动化脚本在新建Bug报告时自动请求关键信息（如重现步骤、GPU架构、安装方式），以提升问题排查效率。
#37029 - [Hardware][XPU][ROCm] Align memory usage with cuda on xpu/rocm：提议在ROCm（和XPU）平台的内存分析代码中添加 empty_cache() 调用，以使其内存剖析行为与CUDA平台保持一致，确保跨平台指标的可比性。
#36993 - [CI][Bugfix][AMD] Ensure weights created when using emulating OCP MXFP4：修复了在AMD MI300上使用OCP MXFP4量化仿真模式时的测试失败问题。根本原因包括：1) 测试中设置的 cuda_graph_capture_sizes 与 VllmRunner 内部设置冲突；2) 仿真模式下权重未正确创建，导致后续处理崩溃。
#36996 - [CI][BugFix][AMD] Don‘t set VLLM_ROCM_USE_AITER anymore…：移除了测试 test_rocm_aiter_topk.py 中不必要的环境变量设置，避免污染同一进程中运行的其他测试。

3. 已合并PR（进展落地）

#35316 - [ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer：由AMD员工 divakar-amd 提交并合并。此PR为ROCm平台增加了对 Quark量化工具链 的 w4a8 mxfp4_fp8 (权重MXFP4，激活FP8) 量化方案的支持，并已通过端到端测试（使用GPT-OSS-120B模型）。意义：显著扩展了AMD硬件上可用的高效量化选项。
#35786 - Enable RoPE+KV cache fusion for ROCm AITER FA：为ROCm AITER FlashAttention后端启用了RoPE与KV缓存的融合优化（在非shuffle布局下），旨在提升推理性能。

总结：AMD团队在本周期内不仅解决了多个平台兼容性Bug（如动态block_size、测试配置），还积极推动基础设施改进（Issue管理、内存分析对齐）和量化生态建设（Quark新格式支持），显示出对ROCm后端稳定性和功能完整性的持续投入。

💬 高热度讨论分析

已关闭 Issue #32186 - [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector：该Issue围绕在AMD ROCm平台上使用MoRIIO KV连接器时遇到的RDMA错误展开。
- 核心议题：在容器环境中，因缺少正确的Broadcom用户空间RDMA驱动库，导致无法发现RDMA设备，连接失败。
- 观点与进展：
  - 用户 junkang1991 提供了详细的错误日志和 ibv_devices 命令输出。
  - AMD贡献者 inkcherry 和 jhchouuu 逐步诊断，指出问题在于容器内未安装对应的OOB（Out-of-Box）驱动库，并提供了详细的安装脚本和解决方案（挂载主机驱动库、移除冲突的内建版本）。
  - 用户在应用方案后解决了连接问题，但提出了新的关于“Cannot allocate memory”警告的疑问，贡献者进一步提供了调整后端配置的建议。
- 争议焦点：无实质性争议，是一个典型的技术支持与协作排查过程。
- 最终结论：问题根本在于容器化的RDMA环境配置。通过社区协作提供了明确的解决方案，Issue被关闭。
开放 Issue #36973 - [Bug] _warmup_prefill_kernels in qwen3_next.py leaks ~3.4 GiB GPU memory：此Issue引发了关于性能优化与资源管理平衡的讨论。
- 核心议题：为预热Triton GDN内核而添加的 _warmup_prefill_kernels() 函数，在执行后即使调用 empty_cache()，仍会永久占用约3.4GB GPU内存，严重挤占KV缓存空间。
- 观点与立场：
  - 报告者 (jhsmith409)：提供了详实的对比数据，指出该优化导致可用KV缓存锐减66%，认为这是不可接受的资源泄漏，推测是编译后的Triton内核二进制文件残留所致。
  - 维护者 (ZJY0516)：首先澄清该内存占用是否仅在首次运行（无Triton缓存）时发生。在得到报告者“每次启动容器都会发生”的反馈后，表示困惑，因为理论上Triton在自动调优后应仅保留最佳内核。
- 争议焦点：问题的普遍性（是否与特定环境有关）和根本原因（是Triton的预期行为还是Bug）。
- 当前状态：开放中。维护者已表示将进行调查。讨论凸显了在追求极致性能（内核预热）时，对内存等稀缺资源管理需格外谨慎。
开放 Issue #36994 - [Bug]: supports_block_size wrongly rejects dynamically computed block sizes：虽然评论数不多，但技术讨论深入。
- 核心议题：硬编码的block_size白名单是否合理，平台差异（NVIDIA FlashAttention vs. AMD Triton）导致的兼容性问题。
- 观点与立场：
  - 报告者 (kyuz0)：明确指出硬编码检查是错误的设计，应信任内核自身的 get_supported_kernel_block_sizes() 方法。这尤其影响了AMD平台上的混合架构模型。
  - 维护者 (jennyyyyzhen)：快速回应，指出该问题可能已被另一个PR (#36274) 修复，要求验证最新代码。
- 当前状态：待验证。讨论体现了上游代码逻辑对异构计算生态的影响，以及维护者对新旧PR关联的熟悉度。

🔥 热门话题与趋势分析

内存与缓存管理的持续优化：多个议题围绕内存使用展开。除了#36973的预热内存泄漏，还有#36958修复NIXL MLA模式下的KV缓存泄漏，#37003 RFC提议更智能的、上下文感知的KV缓存保留策略以替代简单LRU，以及#37029寻求跨平台内存分析的一致性。这反映了在高并发、长上下文的推理场景下，高效内存管理是核心挑战。
模型支持“军备竞赛”：社区持续快速集成最新模型。本周期新增了对 MiniMax-M2.1 GGUF 格式（#36965）、ERNIE 系列分类模型（#36385）的支持，并修复了 GLM-4.7-Flash AWQ (#34695)、Qwen3.5 MoE (#36954，#36975) 等模型的加载或推理问题。此外，关于 Cohere /v2/embed API (#37000) 和 Mistral Guidance (#37005) 的RFC/PR也显示了对接更多生态接口的努力。
推理可靠性及运维工具增强：生产环境的需求驱动了相关功能的开发。#36961 提议增加 /health/ready 端点，用于实际验证GPU推理能力，超越了仅检查进程存活的现有健康检查。#36666 重新引入了优雅关闭超时机制，允许完成正在处理的请求。#36990 修复了环境变量值验证后返回原始大小写导致PyTorch API调用失败的问题。
AMD生态的深耕与挑战：如前述专题，AMD相关的活动不仅限于Bug修复，已深入到量化支持（Quark）、性能优化（融合内核）、测试框架完善和开发者体验提升（Issue管理模板）。同时，也暴露出在特定模型架构（如混合SSM-Attention）和分布式通信（RDMA配置）方面的兼容性挑战需要持续攻关。

🛠️ 重点技术变更

PR #36666 - Re-add shutdown timeout：此PR重新引入了之前因测试失败被回退的优雅关闭超时功能。它允许服务器在收到SIGTERM信号后，可选地等待一段时间让进行中的请求完成，而非立即终止。影响：显著提升了vLLM服务在生产环境部署中的可控性和优雅性，是服务端软件成熟度的重要标志。其成功合并依赖于另一个修复测试环境问题的PR (#36950)。
PR #35316 - [ROCm][Quantization] add quark w4a8 mxfp4_fp8：此合并PR为AMD ROCm平台引入了新的低精度量化格式支持。影响：使得在AMD GPU（如MI300）上使用更高效的4位权重、8位激活的量化模型成为可能，有助于降低大模型部署的显存门槛和提升吞吐，是AMD量化生态建设的关键一步。
PR #36968 - [UX] Improve UX of CPU backend：此PR从多个方面改善了CPU后端的使用体验，包括添加CPU指令集兼容性测试（使用Intel SDE），简化交叉编译参数，以及增加库依赖检查。影响：降低了用户在非标准x86-64 CPU（或不同ISA级别）上使用vLLM CPU版本的门槛和风险，提升了稳定性和可支持性。
PR #36937 - fix: resolve chat template names before kwargs detection：修复了一个当chat_template参数传递模板名称（如”tool_use”）而非Jinja模板字符串时，会导致后续tools等关键参数被错误丢弃的Bug。影响：虽然改动不大，但修复了Cohere Command-R等模型工具调用功能的严重问题，保障了API的兼容性和可用性。

📈 开发活跃度观察

贡献活跃度：24小时内处理了近百个（66新增+30合并）PR，显示社区开发节奏极快。贡献者来源多样，包括AMD (-amd后缀)、NVIDIA、Cohere (walterbm) 等公司的工程师，以及广泛的独立开发者。
代码审查与合并效率：大量PR（如#36950，#36952，#36959，#36962）在当天提交并迅速合并，表明核心团队对错误修复、测试改进和小型功能增强的审查与合并流程非常高效。同时，像#37002（观察插件）这类大型RFC伴随的PR，则因存在合并冲突和复杂度需要更长时间讨论。
CI/CD持续优化：出现了多个专门优化CI测试的PR（如#36945， #37016， #37014），旨在通过分片并行化来缩短整体测试时间。这反映了项目在快速发展中，对维护高质量测试流水线并保持其效率的高度重视。

💡 值得关注的问题

#36994 - ROCm平台block_size验证问题：此Issue触及了核心代码中可能存在的平台中心主义假设（硬编码值偏向NVIDIA生态），对AMD和其他非CUDA后端的兼容性构成障碍。其解决方式将体现项目对异构计算的支持深度。
#37030 - GPT-OSS-120B MXFP4在Blackwell (SM121)上输出错误：报告在最新的Blackwell GPU上使用MXFP4量化时，模型首token生成完全错误，导致输出为空。这可能指向新一代硬件上特定量化内核（如Marlin）的兼容性或正确性问题，需要NVIDIA和社区高度关注。
RFC #37003 - Context-Aware KV-Cache Retention API 与 RFC #36998 - Observation Plugin：这两个RFC分别提出了对KV缓存管理策略和模型激活值观察拦截的框架性扩展。它们涉及核心调度与执行逻辑的修改，旨在支持更复杂的Agentic工作流和安全性、可解释性需求。这些讨论可能引导vLLM向更智能、更可观测的推理系统架构演进。

📋 附录：详细数据列表

新增 Issue

#37032 [Bug]: GLM 4.7-flash performance degrades when native KV cache offloading is on — bug — by yz342 (创建于: 2026-03-14 10:55 (UTC+8))
#36973 [Bug] _warmup_prefill_kernels in qwen3_next.py leaks ~3.4 GiB GPU memory despite empty_cache() — 无标签 — by jhsmith409 (创建于: 2026-03-13 18:39 (UTC+8))
#37030 [Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning — 无标签 — by ohsono (创建于: 2026-03-14 10:44 (UTC+8))
#37023 [CI Failure]: GLM4 moe reasoning parser test failure — ci-failure — by sfeng33 (创建于: 2026-03-14 08:11 (UTC+8))
#37022 [CI Failure]: Seedoss reasoning parser test failure — ci-failure — by sfeng33 (创建于: 2026-03-14 08:00 (UTC+8))
#36998 [RFC]: Observation Plugin for Intercepting & Routing on Activations — RFC — by DDDDarrenWB (创建于: 2026-03-14 02:33 (UTC+8))
#37003 [RFC]: Context-Aware KV-Cache Retention API — RFC — by vMaroon (创建于: 2026-03-14 03:05 (UTC+8))
#37001 [Bug]: Crash when using PP in multi-node — bug — by ortegaalfredo (创建于: 2026-03-14 02:49 (UTC+8))
#37000 [RFC]: add support for Cohere’s /v2/embed HTTP API entry point — RFC — by walterbm (创建于: 2026-03-14 02:44 (UTC+8))
#36999 [Bug]: CPU offloading fails with flashinfer autotuner — bug — by wzhao18 (创建于: 2026-03-14 02:38 (UTC+8))
#36994 [Bug]: supports_block_size wrongly rejects dynamically computed block sizes (e.g., Qwen3.5 on ROCm) — bug,rocm — by kyuz0 (创建于: 2026-03-14 01:29 (UTC+8))
#36969 [Bug]: Kimi-K2.5 output a single </think> tag in the content when “enable_thinking” is set to false — bug — by elepherai (创建于: 2026-03-13 17:26 (UTC+8))
#36989 [Bug]: Qwen3-0.6B hangs during vllm bench throughput on amd machine — bug,rocm — by mikaylagawarecki (创建于: 2026-03-14 00:09 (UTC+8))
#36986 [Bug]: whisper large v2 incorrectly transcribing — bug — by gilljon (创建于: 2026-03-13 23:04 (UTC+8))
#36972 [Feature]: fused SiLU + fp8 block quantized kernel in Helion — feature request — by rwtarpit (创建于: 2026-03-13 18:31 (UTC+8))
#36960 [Feature]: Add /health/ready endpoint for GPU health verification — 无标签 — by anencore94 (创建于: 2026-03-13 15:38 (UTC+8))
#36954 [Bug]: KeyError ‘layers.0.mlp.experts.w2_weight’ when using speculative_config method=’mtp’ with Qwen3.5-122B-GPTQ-Int4 (moe_wna16) — bug — by sunyong01 (创建于: 2026-03-13 13:55 (UTC+8))

已关闭 Issue

#12829 [Feature]: Add support for multi-lora using classification — feature request,stale — by ashwani-bhat (关闭于: 2026-03-14 10:18 (UTC+8))
#26881 [Bug]: cli running but debug crashing on flex attention backend — bug,stale — by yukiy927 (关闭于: 2026-03-14 10:17 (UTC+8))
#27318 [Bug]: vLLM fails to perform CPU-only-head inference in Kubernetes + Ray cluster environment — bug,ray,stale — by jasonXue653 (关闭于: 2026-03-14 10:17 (UTC+8))
#27774 [RFC]: Fault-Tolerant Expert Parallelism — RFC,stale — by tzulingk (关闭于: 2026-03-14 10:17 (UTC+8))
#27859 [Bug]: EVS for Qwen 2_5 Errors out when using with video + image in this order — bug,stale — by sthakur2-sc (关闭于: 2026-03-14 10:17 (UTC+8))
#28084 [Feature]: Improve MoE pre-amble (router, quantization, etc.) perf for DSv3.1 — feature request,stale — by nv-dlasalle (关闭于: 2026-03-14 10:16 (UTC+8))
#28317 [Bug]: Prefix caching leads to different outputs for Hermes-3-Llama-3.1-8B — bug,stale — by codeman38 (关闭于: 2026-03-14 10:16 (UTC+8))
#28440 [Bug]: Failed to generate Cutlass kernels: [Errno 13] Permission denied in Docker image — bug,stale — by bbartels (关闭于: 2026-03-14 10:16 (UTC+8))
#28538 [RFC]: 4-bit KV cache quantization through Hadamard transforms — RFC,stale — by mratsim (关闭于: 2026-03-14 10:16 (UTC+8))
#28547 [Bug]: vLLM server not running on CPU AMD EPYC 9J14, with Python 3.12.12 — bug,stale — by chu22fr (关闭于: 2026-03-14 10:16 (UTC+8))
#28612 Qwen3-VL-8B-Instruct-FP8 fails with GLIBC_2.32 not found in FlashAttention (works fine with Qwen3-VL-4B-Instruct) — bug,stale — by Passenger12138 (关闭于: 2026-03-14 10:16 (UTC+8))
#28616 [Bug]: Qwen3-14B TP1 PP4 CUDA error an illegal memory access was encountered — bug,stale — by Quakso (关闭于: 2026-03-14 10:16 (UTC+8))
#28622 [Bug]: Can we able to benchmark Quantized MOE models Either W8A8 or W8A16 ? — bug,stale — by logesh13 (关闭于: 2026-03-14 10:16 (UTC+8))
#28629 [Usage]: TPOT per request information was not collected by vllm bench serve — usage,stale — by jlwang1996 (关闭于: 2026-03-14 10:16 (UTC+8))
#28639 [Bug]: TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function aten.index(…): got RuntimeError(‘unknown table encoding’) — bug,stale — by xiuxiuxius (关闭于: 2026-03-14 10:16 (UTC+8))
#28646 [Feature][P2]: Implement CI Build Time and Size Guards — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
#28647 [Feature][P2]: Move Examples/Benchmarks to Test Stage Only — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
#28653 [Feature][P1]: Investigate and Implement FSx for Persistent Caching — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
#28655 [Feature][P2]: Investigate Lazy Image Pull with Stargz/Nydus Snapshotters — feature request,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:16 (UTC+8))
#35300 [Feature]: Add ISA-level smoke tests using Intel SDE to catch instruction set mismatches — cpu — by wjhrdy (关闭于: 2026-03-14 09:27 (UTC+8))
#36907 [Bug]: resolve_chat_template_kwargs silently drops chat template kwargs when chat_template is passed in as chat template name instead of Jinja — bug — by xiangxl-a (关闭于: 2026-03-14 08:20 (UTC+8))
#34561 [Bug]: GLM-4.7-Flash-AWQ fails with AttributeError: ‘ColumnParallelLinear’ object has no attribute ‘weight’ — bug — by eugr (关闭于: 2026-03-14 07:25 (UTC+8))
#36969 [Bug]: Kimi-K2.5 output a single </think> tag in the content when “enable_thinking” is set to false — bug — by elepherai (关闭于: 2026-03-14 00:11 (UTC+8))
#36986 [Bug]: whisper large v2 incorrectly transcribing — bug — by gilljon (关闭于: 2026-03-13 23:20 (UTC+8))
#30487 [Bug] [CPU Backend]: Wrong L2 cache size on Arm CPU for Attention tiles — bug,cpu — by Radu2k (关闭于: 2026-03-13 17:40 (UTC+8))
#36141 [Bug]: Vllm models crashes when using runai_streamer & –api-server-count > 1 — bug — by Sanches166 (关闭于: 2026-03-13 15:13 (UTC+8))
#32186 [Bug]: Unable to serve Qwen3-8B-FP8 with moriio kv connector — bug,rocm — by junkang1991 (关闭于: 2026-03-13 14:55 (UTC+8))
#35391 [Bug]: Value error, Model architectures [‘Qwen3_5ForConditionalGeneration’] are not supported for now — bug — by Iam-a-Savage-coder (关闭于: 2026-03-13 14:29 (UTC+8))
#36942 [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() — bug — by dharaghodasara (关闭于: 2026-03-13 12:23 (UTC+8))
#36778 [Bug]: Using vLLM to deploy Minimax m2.5, the thinking/reasoning cannot be disable. — bug — by shuixiaoer (关闭于: 2026-03-13 11:54 (UTC+8))

新增 PR

#37029 [Hardware][XPU][ROCm] Align memory usage with cuda on xpu/rocm — rocm,nvidia — by jikunshang (创建于: 2026-03-14 10:18 (UTC+8))
#37031 [Hardware] Replace memory related torch.cuda APIs — performance,v1,nvidia — by jikunshang (创建于: 2026-03-14 10:48 (UTC+8))
#37013 [Spec Decode] Update extract_hidden_states to use deferred kv_connector clear — speculative-decoding,ready,v1,kv-connector — by fynnsu (创建于: 2026-03-14 05:28 (UTC+8))
#36976 [Bugfix][LoRA] Fix Qwen35 LoRA — bug,qwen — by jeejeelee (创建于: 2026-03-13 19:19 (UTC+8))
#37015 [CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs — ready,ci/build — by khluu (创建于: 2026-03-14 06:12 (UTC+8))
#37026 [UX]: Fix unclean shutdown from ctrl-c with AR Fusion — nvidia — by siewcapital (创建于: 2026-03-14 08:41 (UTC+8))
#37028 [WIP][Model Runner V2] Support Streaming Inputs — v1 — by santiramos27 (创建于: 2026-03-14 09:57 (UTC+8))
#37006 [V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls — ready,v1 — by sbeurnier (创建于: 2026-03-14 04:27 (UTC+8))
#37027 [CI] Fix flaky tool_use chat completion tests with deterministic seed — tool-calling — by sfeng33 (创建于: 2026-03-14 09:07 (UTC+8))
#36968 [UX] Improve UX of CPU backend — documentation,ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-03-13 17:05 (UTC+8))
#36982 [MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible — v1 — by MatthewBonanni (创建于: 2026-03-13 22:38 (UTC+8))
#37025 [CI] Add reasoning parser tests to CI — ci/build — by sfeng33 (创建于: 2026-03-14 08:34 (UTC+8))
#37024 [bug] fix hang dpep pause — bug,v1 — by hao-aaron (创建于: 2026-03-14 08:29 (UTC+8))
#37004 [Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces — bug,ready,deepseek — by MatthewBonanni (创建于: 2026-03-14 03:39 (UTC+8))
#37016 [CI] Split V1 Others into 3 separate jobs — ready,ci/build — by khluu (创建于: 2026-03-14 06:13 (UTC+8))
#37008 [BugFix] Fix “DP Coordinator receives unexpected…” messages — bug,ready,v1 — by njhill (创建于: 2026-03-14 04:45 (UTC+8))
#37021 Enable in-process engine core for AsyncLLM. — v1 — by wang2yn84 (创建于: 2026-03-14 07:43 (UTC+8))
#37020 [CI][Bugfix] Fix incorrect status handling with set -e in CI shell scripts — bug,ci/build — by gkapetanakis (创建于: 2026-03-14 07:33 (UTC+8))
#37018 [BUG] Collective causing deadlock for DPEP MoE — bug,frontend,v1 — by hao-aaron (创建于: 2026-03-14 06:57 (UTC+8))
#37014 [CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs — ready,ci/build — by khluu (创建于: 2026-03-14 05:29 (UTC+8))
#37019 fix: generalize sharded LoRA slice_lora_a for N subloras — 无标签 — by hallerite (创建于: 2026-03-14 07:16 (UTC+8))
#37012 [CI] Pin helion version — ready,ci/build — by gmagogsfm (创建于: 2026-03-14 05:26 (UTC+8))
#37017 [CI][Bugfix] Fix incorrect status handling with set -e in CI shell scripts — bug,ci/build — by gkapetanakis (创建于: 2026-03-14 06:57 (UTC+8))
#37011 fix: return HTTP 413 when request exceeds max context length — frontend — by Chase-Xuu (创建于: 2026-03-14 05:05 (UTC+8))
#36981 feat(riscv64): add RISC-V 64-bit pure-Python CPU support — structured-output,ci/build,v1,cpu — by gounthar (创建于: 2026-03-13 22:12 (UTC+8))
#37007 [responsesAPI] move streaming to parser — frontend — by qandrew (创建于: 2026-03-14 04:38 (UTC+8))
#36985 fix: Guard AllReduceFusionPass.del against Python shutdown — 无标签 — by wavebyrd (创建于: 2026-03-13 22:50 (UTC+8))
#37010 [Bugfix] Fix FusedMoE weight loading with padded hidden dimensions — bug — by SandishKumarHN (创建于: 2026-03-14 05:02 (UTC+8))
#37009 [ROCm] issue management - request information for bug issues on ROCm — bug,rocm,ci/build — by hongxiayang (创建于: 2026-03-14 04:50 (UTC+8))
#36952 [Bugfix][Spec Decode] Avoid double call of Ngram CPU — bug,ready,v1 — by ekagra-ranjan (创建于: 2026-03-13 13:02 (UTC+8))
#37005 Add Mistral Guidance — rocm,structured-output,frontend,ci/build,v1,nvidia — by juliendenize (创建于: 2026-03-14 04:09 (UTC+8))
#36995 Make skip_special_tokens configurable on the server side — frontend — by linzebing (创建于: 2026-03-14 01:48 (UTC+8))
#37002 [Core][Feature] Observation Plugin for Intercepting & Routing on Activations — documentation,needs-rebase,v1 — by DDDDarrenWB (创建于: 2026-03-14 02:59 (UTC+8))
#36997 Enable loading of fused expert weights in the Transformers modelling backend — 无标签 — by hmellor (创建于: 2026-03-14 02:20 (UTC+8))
#36996 [CI][BugFix][AMD] Don’t set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary — bug,rocm — by rasmith (创建于: 2026-03-14 01:49 (UTC+8))
#36967 Allow platform plugins to override gpu_memory_utilization default — frontend,v1 — by aws-navyadhara (创建于: 2026-03-13 16:41 (UTC+8))
#36993 [CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 — bug,rocm — by rasmith (创建于: 2026-03-14 01:08 (UTC+8))
#36951 [CI] Add persistent cache mounts for all CI test downloads and media URLs — ready,ci/build,v1,multi-modality,gpt-oss,kv-connector — by AndreasKaratzas (创建于: 2026-03-13 12:56 (UTC+8))
#36992 [Bugfix] accept redacted thinking blocks in Anthropic messages — bug,frontend — by bbartels (创建于: 2026-03-14 01:05 (UTC+8))
#36971 Mistral common v10 — rocm,ready,ci/build — by juliendenize (创建于: 2026-03-13 18:28 (UTC+8))
#36957 [PD][Nixl][WIP] Exploring support for heterogeneous TP in hybrid SSM-FA P/D disaggregation — v1,kv-connector — by ZhanqiuHu (创建于: 2026-03-13 14:22 (UTC+8))
#36991 Fix xgrammar being locked to a version which has high vulnerabilities — ci/build — by benglewis (创建于: 2026-03-14 00:30 (UTC+8))
#36990 [Bugfix] Fix env_with_choices returning raw casing when case_sensitive=False — bug — by aymuos15 (创建于: 2026-03-14 00:24 (UTC+8))
#36988 bump compressed-tensors version to 0.14.0.1 — ci/build — by brian-dellabetta (创建于: 2026-03-14 00:00 (UTC+8))
#36950 [Tests] Shutdown test RemoteVLLMServer cleanly — ready — by njhill (创建于: 2026-03-13 12:46 (UTC+8))
#36980 sched/v1: use SRTF tiebreaker for preemption victim selection — documentation,new-model,rocm,frontend,ci/build,v1,multi-modality,qwen,kv-connector,nvidia — by whycoming (创建于: 2026-03-13 21:50 (UTC+8))
#36984 [Test] Add unittests for CompilerManager — 无标签 — by SoluMilken (创建于: 2026-03-13 22:42 (UTC+8))
#36979 [refactor] Refactor SpeculativeConfig for speculative method extensibility — 无标签 — by TQCB (创建于: 2026-03-13 20:59 (UTC+8))
#36987 [FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell — v1,qwen,nvidia — by vadiklyutiy (创建于: 2026-03-13 23:34 (UTC+8))
#36974 Fix issue 36969 — frontend — by xueliangyang-oeuler (创建于: 2026-03-13 19:01 (UTC+8))
#36983 [Bugfix] Fix delta_text and delta_token_ids desync with stop strings — bug,v1 — by aymuos15 (创建于: 2026-03-13 22:41 (UTC+8))
#36965 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — 无标签 — by JoursBleu (创建于: 2026-03-13 16:28 (UTC+8))
#36959 [Bugfix] fix paddleocr crash on some image shape — bug,ready — by MoyanZitto (创建于: 2026-03-13 15:24 (UTC+8))
#36966 [Frontend] Add srt and vtt response formats for audio transcription — frontend — by majiayu000 (创建于: 2026-03-13 16:32 (UTC+8))
#36978 Revert “[Bugfix] ep_scatter kernel store-load race condition” (#34991) — bug — by zhewenl (创建于: 2026-03-13 20:03 (UTC+8))
#36977 chore(pre-commit): make bash hooks runnable on Windows — 无标签 — by xueliangyang-oeuler (创建于: 2026-03-13 19:21 (UTC+8))
#36975 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen — by xueliangyang-oeuler (创建于: 2026-03-13 19:16 (UTC+8))
#36970 [Bugfix] Avoid KeyError when loading Qwen3.5 MoE expert weights — bug,qwen — by xueliangyang-oeuler (创建于: 2026-03-13 17:37 (UTC+8))
#36962 [XPU] Support LoRA via torch.compile on XPU platform — ready — by chaojun-zhang (创建于: 2026-03-13 15:53 (UTC+8))
#36963 [Bugfix][Model] Fix PixtralForConditionalGeneration LoRA — bug — by jeejeelee (创建于: 2026-03-13 15:54 (UTC+8))
#36964 [Feature] Abort in-flight requests and drain outputs on shutdown — v1 — by wojciech-wais (创建于: 2026-03-13 15:57 (UTC+8))
#36961 [Feature] Add /health/ready endpoint for GPU health verification — frontend,v1 — by anencore94 (创建于: 2026-03-13 15:44 (UTC+8))
#36953 [Bugfix] Resolve chat template names before kwargs detection — bug — by he-yufeng (创建于: 2026-03-13 13:15 (UTC+8))
#36958 [Bugfix] Fix NIXL MLA notification request ID mismatch causing prefill KV cache leak — bug,v1,kv-connector — by wz1qqx (创建于: 2026-03-13 14:43 (UTC+8))
#36956 [Bugfix] Fix unclean shutdown from ctrl-c with AR Fusion — bug,nvidia — by siewcapital (创建于: 2026-03-13 14:08 (UTC+8))
#36955 [Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace — bug,nvidia — by siewcapital (创建于: 2026-03-13 14:06 (UTC+8))

已合并 PR

#37006 [V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls — ready,v1 — by sbeurnier (合并于: 2026-03-14 09:37 (UTC+8))
#36968 [UX] Improve UX of CPU backend — documentation,ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-03-14 09:27 (UTC+8))
#36516 [responsesAPI] prioritize content over summary in reasoning item input — frontend,ready — by qandrew (合并于: 2026-03-14 09:20 (UTC+8))
#36937 fix: resolve chat template names before kwargs detection — ready — by giulio-leone (合并于: 2026-03-14 08:20 (UTC+8))
#37004 [Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces — bug,ready,deepseek — by MatthewBonanni (合并于: 2026-03-14 06:55 (UTC+8))
#37008 [BugFix] Fix “DP Coordinator receives unexpected…” messages — bug,ready,v1 — by njhill (合并于: 2026-03-14 07:18 (UTC+8))
#36931 [Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype — bug,ready,v1,deepseek,nvidia — by dbari (合并于: 2026-03-14 07:41 (UTC+8))
#34695 [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models — bug,ready — by haosdent (合并于: 2026-03-14 07:25 (UTC+8))
#36063 [Refactor] Consolidate SupportsEagle — ready,v1,llama,qwen,gpt-oss,kv-connector — by benchislett (合并于: 2026-03-14 07:22 (UTC+8))
#36945 [CI] Split V1 e2e + engine (1 GPU) into parallel jobs — ready,ci/build,v1 — by khluu (合并于: 2026-03-14 05:16 (UTC+8))
#31545 Use Transformers v5 WeightRenaming for Transformers modeling backend — ready,multi-modality — by hmellor (合并于: 2026-03-14 04:49 (UTC+8))
#36952 [Bugfix][Spec Decode] Avoid double call of Ngram CPU — bug,ready,v1 — by ekagra-ranjan (合并于: 2026-03-14 04:33 (UTC+8))
#35316 [ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer — rocm,ready,gpt-oss — by divakar-amd (合并于: 2026-03-14 03:44 (UTC+8))
#36666 [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish — frontend,ready,v1 — by markmc (合并于: 2026-03-14 03:10 (UTC+8))
#36862 [Misc] Set default kv_buffer_device in a better way — ready,kv-connector — by hmellor (合并于: 2026-03-14 03:07 (UTC+8))
#35242 Fp8 lora dense kernel — ready — by yugong333 (合并于: 2026-03-14 03:05 (UTC+8))
#36800 [Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request — bug,ready,multi-modality,qwen — by Isotr0py (合并于: 2026-03-14 02:16 (UTC+8))
#36950 [Tests] Shutdown test RemoteVLLMServer cleanly — ready — by njhill (合并于: 2026-03-13 15:32 (UTC+8))
#36181 [ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-13 17:04 (UTC+8))
#36711 [ROCm][CI] Corrected the GPT-OSS test root path — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (合并于: 2026-03-13 15:53 (UTC+8))
#36959 [Bugfix] fix paddleocr crash on some image shape — bug,ready — by MoyanZitto (合并于: 2026-03-13 21:46 (UTC+8))
#35627 [2/N] Elastic EP Milestone 2: Integrating NIXL-EP — ready,v1,kv-connector,nvidia — by itayalroy (合并于: 2026-03-13 21:25 (UTC+8))
#36962 [XPU] Support LoRA via torch.compile on XPU platform — ready — by chaojun-zhang (合并于: 2026-03-13 18:35 (UTC+8))
#36610 [kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec — ready,v1,kv-connector — by orozery (合并于: 2026-03-13 16:04 (UTC+8))
#36483 [Frontend] Delegate preprocessing to OpenAIServingRender — frontend,ready,v1 — by sagearc (合并于: 2026-03-13 15:39 (UTC+8))
#35786 Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) — rocm,ready,v1 — by Rohan138 (合并于: 2026-03-13 15:33 (UTC+8))
#36876 [Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs — bug,ready,qwen — by tdoublep (合并于: 2026-03-13 14:09 (UTC+8))
#36379 [Frontend] Fix usage incorrectly returned with empty stream_options` — frontend,ready,v1 — by Csrayz (合并于: 2026-03-13 11:33 (UTC+8))
#36684 fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 — ready,v1,ready-run-all-tests — by hai-meh-cs (合并于: 2026-03-13 11:28 (UTC+8))
#36385 [Model] Add support for BERT-like Chinese ERNIE pooling models — documentation,new-model,ready — by whyiug (合并于: 2026-03-13 11:23 (UTC+8))

关闭但未合并的 PR

#26650 fix(benchmark): Prevent OOV error for small tokenizers in latency test — performance,stale — by ihb2032 (关闭于: 2026-03-14 10:17 (UTC+8))
#26765 [Bugfix][Core] When preemption occurs, the requests in the running set are not fully scheduled. — stale,v1 — by CLFutureX (关闭于: 2026-03-14 10:17 (UTC+8))
#27585 [Build] Optimize docker layers sizes — ready,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:17 (UTC+8))
#28057 [wip] Fix torch nightly — ready,needs-rebase,ci/build,stale — by rzabarazesh (关闭于: 2026-03-14 10:17 (UTC+8))
#28258 [GPUModelRunner] initialize_kv_cache cleanup (1/N): move initialization that doesn’t depend on kv cache config to load_model — needs-rebase,stale,v1 — by heheda12345 (关闭于: 2026-03-14 10:16 (UTC+8))
#37017 [CI][Bugfix] Fix incorrect status handling with set -e in CI shell scripts — bug,ci/build — by gkapetanakis (关闭于: 2026-03-14 07:02 (UTC+8))
#36985 fix: Guard AllReduceFusionPass.del against Python shutdown — 无标签 — by wavebyrd (关闭于: 2026-03-14 05:39 (UTC+8))
#34363 fix: return HTTP 413 when request exceeds max context length — frontend,needs-rebase — by Chase-Xuu (关闭于: 2026-03-14 05:04 (UTC+8))
#37005 Add Mistral Guidance — rocm,structured-output,frontend,ci/build,v1,nvidia — by juliendenize (关闭于: 2026-03-14 04:15 (UTC+8))
#36995 Make skip_special_tokens configurable on the server side — frontend — by linzebing (关闭于: 2026-03-14 03:41 (UTC+8))
#32781 [WIP] Add Mistral guidance — structured-output,frontend,v1 — by juliendenize (关闭于: 2026-03-14 02:30 (UTC+8))
#29505 [Misc][Refactor] Decouple quant methods from FusedMoE — needs-rebase — by bnellnm (关闭于: 2026-03-14 00:55 (UTC+8))
#25200 [Kernels] Move shared/fused expert sum and final reduce into FusedMoE layer — llama,qwen,deepseek,gpt-oss — by bnellnm (关闭于: 2026-03-14 00:54 (UTC+8))
#36941 [Bugfix][Core]: Fix silent output_handler death on CancelledError — bug,documentation,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,llama — by eugenepaniot (关闭于: 2026-03-13 21:49 (UTC+8))
#36464 [Examples][2/n] Resettle generate examples. — documentation,structured-output,needs-rebase,ci/build,qwen — by noooop (关闭于: 2026-03-13 17:41 (UTC+8))
#36936 fix: Resolve template name to Jinja content in resolve_chat_template to prevent dropped kwargs — 无标签 — by siddharthmohan619-eng (关闭于: 2026-03-13 15:49 (UTC+8))
#34964 Fix LoRA adapter silently failing on Pixtral/Ministral-3 models — 无标签 — by timon0305 (关闭于: 2026-03-13 15:56 (UTC+8))
#36953 [Bugfix] Resolve chat template names before kwargs detection — bug — by he-yufeng (关闭于: 2026-03-13 15:48 (UTC+8))
#36444 [Model][Quantization] Add GGUF support for MiniMax-M2.1 — performance,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by JoursBleu (关闭于: 2026-03-13 14:55 (UTC+8))
#36925 [Bugfix] signature match for passing spec_step_idx in qwen3-next and qwen3.5 — bug,qwen — by JGSweets (关闭于: 2026-03-13 11:26 (UTC+8))