vLLM 开发动态报告 - 2026-03-08

时间窗口: 2026-03-08 11:29 (UTC+8) ~ 2026-03-09 11:29 (UTC+8) 数据统计: 新 Issue 11 | 关闭 Issue 8 | 新 PR 54 | 合并 PR 11 | 关闭未合并 PR 26

📊 每日开发状态摘要

本周期（2026年3月8日至9日）vLLM项目开发活动保持活跃，共产生11个新Issue和54个新PR，合并了11个PR。开发焦点集中在模型兼容性修复（特别是Qwen系列）、CPU模式支持以及AMD ROCm平台的深度优化。同时，关于API设计（如usage字段默认行为）和内部架构（如客户端协议分离）的讨论仍在持续。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动显著，主要集中在性能优化和bug修复，由AMD员工（ChuanLi1101， hongxiayang 等）主导。

PR #36425: [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass (by ChuanLi1101)
- 技术细节：新增ROCm专用的RocmAiterAllReduceFusionPass编译通道，将TP（Tensor Parallelism）通信中的allreduce操作与AITER CK的RMSNorm计算融合为单个图节点（rocm_aiter_fused_allreduce_rmsnorm）。这是NVIDIA FlashInfer对应功能的ROCm/AITER版本。
- 影响：旨在为多GPU ROCm部署（如MI300X/MI355X）减少图开销和内存流量，是提升分布式训练/推理性能的关键一步。贡献者指出此为第一阶段（图级融合），第二阶段将升级为使用AITER原生内核以实现真正的通信计算重叠。
- 讨论亮点：AMD审核人（tjtanaa）要求提供性能基准和模型精度（lm-eval）验证数据，体现了对功能质量和性能收益的严谨要求。
PR #36422: [ROCm][Bugfix] Fix MXFP4 MoE emulate fallback logic on MX-capable hardware (by ChuanLi1101)
- 技术细节：修复了QuarkOCP_MX_MoEMethod中一个布尔逻辑错误，该错误导致在支持MX指令集（如MI350X/gfx950）的硬件上，即使AITER CK内核不兼容（如ROCm版本不匹配），也无法回退到模拟模式，从而产生错误输出。
- 影响：确保了Quark量化工具链中MXFP4格式的MoE模型在AMD硬件上的鲁棒性，当高性能内核不可用时能安全降级。
PR #36374 / #36376: [Bugfix][ROCm] full graph capture on triton-attn fix (by hongxiayang)
- 技术细节：提供了两个方案修复ROCm上使用Triton注意力时全CUDA图捕获导致的“写入只读页”错误。
  - Option 1 (PR #36374)：在ROCm上，不为Triton注意力的softmax_segm_output等缓冲区分配内存，强制使用2D内核（类似AITER路径）。
  - Option 2 (PR #36376)：在ROCm上，覆盖Triton注意力的get_cudagraph_support()方法，返回NEVER，从而完全禁用全图捕获，仅使用分段图。
- 影响：解决了ROCm平台一个影响稳定性的核心问题，确保了CUDA图优化技术的可用性。
已关闭 Issue #36167: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM
- 结论：用户通过手动添加ROCm的clr库到LD_LIBRARY_PATH解决了问题，但指出这感觉像临时方案。这暴露了在某些AMD消费级平台（Ryzen AI）上依赖库路径的配置问题。

💬 高热度讨论分析

Issue #36408: [Bug]: Qwen3.5-9B (BF16/AWQ) Illegal Memory Access (6条评论)
- 核心议题：用户在WSL2/RTX 3090 Ti环境下，运行Qwen3.5-9B（包括AWQ量化版）时遇到CUDA非法内存访问错误。
- 观点整理：
  - 报告者 (d-etu)：提供了详细的复现步骤，强调问题不仅限于量化模型。
  - 参与者1 (SaxenaShiv)：表示有兴趣修复，并询问是否可在其他GPU（RTX 4060）上复现。
  - 参与者2 (hocop)：确认在双3090上遇到相同问题，并指出禁用MTP后问题解决，提供了关键线索。
  - 参与者3 (ZJY0516)：建议测试主线代码，暗示相关问题可能已修复。
  - 报告者跟进：在更新NVIDIA驱动后，AWQ版本问题解决，但基础BF16版本出现新错误，表明问题可能具多层原因。
- 争议焦点/结论：问题可能与特定驱动版本、MTP（可能是某种并行技术） 的兼容性有关，非单一原因。驱动更新解决了部分问题，但揭示了更深层的不稳定性。
PR #36425: [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass (4条评论)
- 核心议题：AMD审核人对新功能提出性能验证要求。
- 观点整理：
  - 审核人 (tjtanaa)：要求提供融合前后的模型评估分数（lm-eval）和跨不同TP规模的性能增益数据，并建议在CI中启用端到端测试。
  - 贡献者 (ChuanLi1101)：解释该PR为第一阶段（图级融合），主要收益是减少中间张量和图开销；真正的性能提升（通信计算重叠）需依赖第二阶段的AITER原生融合内核。他提议在AMD CI节点上协同进行性能测试。
- 总结：讨论体现了开源协作中厂商贡献的严谨性，不仅要求功能正确，更要求有可衡量的性能收益和完整的集成测试。
Issue #36382: [CPU] AssertionError in CPUModelRunner (2条评论)
- 核心议题：在macOS (ARM64) CPU模式下，CPUModelRunner初始化时因num_tokens_no_spec属性为NumPy数组而非PyTorch Tensor导致断言失败。
- 观点整理：
  - 报告者 (maxwillzq)：提供了详细复现步骤，并直接给出了修复代码——注释掉类型断言。
  - 另一用户 (NimaSoroush)：确认该修复方法有效。
- 结论：这是一个典型的CPU/GPU代码路径差异导致的兼容性问题。用户直接提供了解决方案，体现了社区的高效互助。

🔥 热门话题与趋势分析

Qwen系列模型问题集中爆发：多个Issue报告了Qwen3.5/3在不同配置下的问题（#36408, #36411, #36372, #36432），涉及非法内存访问、LoRA加载失败、pipeline parallelism下的KeyError、聊天模板应用错误等。这表明该系列模型在vLLM中的兼容性测试面临较大压力。
CPU支持与健壮性：除了上述macOS CPU问题，已关闭的Issue #35789修复了多节点CPU部署的NUMA绑定问题，PR #36430也在修复CPU模型运行器中的非张量成员替换问题。CPU推理支持正在稳步推进但细节挑战多。
多模态与推理功能持续增强：PR #36431修复了Qwen3-Omni视频音频交错处理的多个bug，PR #36375修复了Qwen3推理模型流式与非流式输出不一致的问题，显示了对此类复杂模型支持的精耕细作。
AMD平台优化明显加速：本周期多个核心PR均围绕ROCm生态，从内核融合（#36425）、量化回退逻辑（#36422）到CUDA图兼容性（#36374/36376），表明AMD团队正在对vLLM进行系统性的深度优化和问题扫除。

🛠️ 重点技术变更

PR #36398: Allow markdownlint to run locally (已合并)
- 解读：将文档lint工具markdownlint从仅CI运行改为同时支持本地预提交钩子运行，并升级到官方markdownlint-cli2。
- 影响：提升开发者体验（DX），使文档贡献者能在提交前本地修复格式问题，降低PR失败率，体现了对项目质量维护流程的改进。
PR #36166: [Frontend] Add GPU-less render serving path (已合并)
- 解读：引入了一个纯CPU的预处理服务路径 (vllm launch render)，提供聊天模板渲染、分词等功能，无需GPU或推理引擎。
- 影响：为前后端分离架构奠定了基础，允许将耗时的预处理环节卸载到专用服务，是提升系统架构灵活性和可扩展性的重要一步。
PR #36388: [Bugfix] Fix hybrid Attention+Mamba models failing…
- 解读：修复了当混合KV缓存管理器被禁用时（如使用了--kv-events-config），Attention+Mamba混合模型（如Qwen3.5-0.8B）启动失败的问题。
- 影响：放宽了校验逻辑，允许MambaSpec与Attention Spec共存，确保新兴的混合架构模型在更多配置下可用。
PR #36170: [Dependency] Remove default ray dependency (已合并)
- 解读：将Ray从默认依赖项改为可选依赖，仅在使用Ray分布式执行器后端时才需要安装。
- 影响：解决了用户环境中因vLLM强制安装Ray而导致的版本冲突问题，减小了安装包体积和潜在安全风险，是依赖项管理的重要优化。

📈 开发活跃度观察

贡献者：AMD团队成员（ChuanLi1101, hongxiayang）在本周期异常活跃，提交了多个涉及核心优化和修复的PR。常规贡献者如lailoo、scoootscooob、esmeetu等持续在模型、调度器、前端等领域进行bug修复和功能完善。
代码审查与合并：共有11个PR被合并，涵盖文档、前端、模型支持、CPU后端和AMD优化等多个方面。一些PR（如#35815, #36170）在获得批准后等待合并，表明核心团队对合并节奏有所把控。针对AMD相关PR的审查对话显示出对性能和测试数据的严格要求。
Issue处理：关闭了8个Issue，其中包括一个历时较长的“Ray依赖”RFC讨论（#33445），表明一些架构决策已达成共识并落地。

💡 值得关注的问题

Qwen模型家族稳定性：集中出现的多个Qwen相关Bug（#36408, #36411, #36372等）需要核心团队重点关注，可能需要系统性的兼容性测试和修复。
CPU模式的健壮性：Issue #36382和PR #36430揭示的CPU/GPU代码路径不一致问题，可能会随着CPU支持功能的增加而愈发普遍，需要建立更统一的抽象或测试机制。
RFC #36406: Changing the Default Behavior of usage Field in Responses：此提议希望改变流式响应中usage字段默认不返回的行为，以更好满足生产环境计费和监控需求。这涉及对OpenAI API标准的偏离，可能引发社区广泛讨论，需要谨慎权衡兼容性与实用性。

📋 附录：详细数据列表

新增 Issue

#36411 [Bug]: vLLM not working with qwen3.5 27B — bug — by surapuramakhil (创建于: 2026-03-08 22:15 (UTC+8))
#36408 [Bug]: Qwen3.5-9B (BF16/AWQ) Illegal Memory Access in vLLM v0.17.0 (WSL2/RTX3090 Ti) — bug — by d-etu (创建于: 2026-03-08 20:52 (UTC+8))
#36435 [Bug]: Responses API streaming emits tool call XML as response.output_text.delta instead of response.function_call_arguments.delta for non-harmony models — bug — by herve-ves (创建于: 2026-03-09 11:00 (UTC+8))
#36433 [Bug]: matryoshka need gpu-memory??? — bug — by ciaoyizhen (创建于: 2026-03-09 10:56 (UTC+8))
#36432 [Bug]: ValueError: No user query found in messages QWEN 3.5 27B VLLM 0.16.0 NIGHTLY — bug — by HojaMuerta (创建于: 2026-03-09 10:54 (UTC+8))
#36382 [CPU] AssertionError in CPUModelRunner: device_tensor type mismatch (numpy.ndarray) — 无标签 — by maxwillzq (创建于: 2026-03-08 14:26 (UTC+8))
#36407 [Bug]: KeyError in get_layers_from_vllm_config with pipeline parallelism (vLLM 0.16.0) — bug — by tusharshetty61 (创建于: 2026-03-08 20:35 (UTC+8))
#36406 [RFC]: Changing the Default Behavior of usage Field in Responses — RFC — by Csrayz (创建于: 2026-03-08 20:32 (UTC+8))
#36372 [Bug]: LoRA on Qwen-3.5-27B fails to run — bug — by Nero10578 (创建于: 2026-03-08 11:59 (UTC+8))
#36389 [Bug]: Multi-node MP backend with vllm v17 - inner dp world group is not initialized — bug — by alan-cooney-dsit (创建于: 2026-03-08 16:00 (UTC+8))
#36387 📋 Documentation Enhancement Suggestion — 无标签 — by croviatrust (创建于: 2026-03-08 15:05 (UTC+8))

已关闭 Issue

#36435 [Bug]: Responses API streaming emits tool call XML as response.output_text.delta instead of response.function_call_arguments.delta for non-harmony models — bug — by herve-ves (关闭于: 2026-03-09 11:11 (UTC+8))
#35789 [Bug]: CPU OMP thread autobind fails when number of local ranks <= NUMA nodes — bug — by shunzhiwen (关闭于: 2026-03-09 11:07 (UTC+8))
#33445 [RFC] Remove mandatory ray installation — 无标签 — by yewentao256 (关闭于: 2026-03-09 11:06 (UTC+8))
#27898 [Doc]: Multi-node EP on EFA (i.e. no IBGDA/DeepEP) — documentation,stale — by nathan-az (关闭于: 2026-03-09 10:14 (UTC+8))
#35213 [RFC]: Disaggregate RendererClient from EngineClient as Foundation for Disaggregated Frontend — RFC — by sagearc (关闭于: 2026-03-09 06:29 (UTC+8))
#33911 [Bug]: Gemma-3 multimodal models (4b/12b/27b) fail with torch.compile assertion error — torch.compile — by tomasruizt (关闭于: 2026-03-09 01:20 (UTC+8))
#36167 [Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM — bug,rocm — by Eoghanmc22 (关闭于: 2026-03-09 00:23 (UTC+8))
#34028 [Usage]: Unable to load Mistral-Small-3.2-24B-Instruct-2506-FP8 due to “Value error, Found unknown quantization”, but the same configs worked for vllm v0.11.0 — usage — by gabinguo (关闭于: 2026-03-08 22:23 (UTC+8))

新增 PR

#36436 [Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers — 无标签 — by taneem-ibrahim (创建于: 2026-03-09 11:13 (UTC+8))
#36385 [Model] Add support for BERT-like Chinese ERNIE pooling models — documentation,new-model,needs-rebase — by whyiug (创建于: 2026-03-08 14:55 (UTC+8))
#36434 [XPU] Add test script of PD disaggregation — ready,v1,kv-connector — by zhenwei-intel (创建于: 2026-03-09 10:56 (UTC+8))
#36431 [Bugfix] Fix Qwen3-Omni/Qwen2.5-Omni use_audio_in_video deployment issues — bug,multi-modality,qwen — by xxddccaa (创建于: 2026-03-09 10:44 (UTC+8))
#36418 Populate multimedia pre-processing performance metrics via PerfStats in the vLLM engine — v1,multi-modality,meta-exported,fb-exported — by d-biswa (创建于: 2026-03-09 02:39 (UTC+8))
#36398 Allow markdownlint to run locally — documentation,performance,ready,ci/build,cpu,kv-connector,nvidia — by hmellor (创建于: 2026-03-08 19:12 (UTC+8))
#36430 [Bugfix] Avoid to replace non-tensor members in cpu model runner — bug,ready,v1,cpu — by bigPYJ1151 (创建于: 2026-03-09 10:34 (UTC+8))
#36420 Revert “[Core] NGram GPU Implementation compatible with Async Scheduler” (#29184) — speculative-decoding,v1 — by zhewenl (创建于: 2026-03-09 03:57 (UTC+8))
#36428 [Darft][Spec decode] Proposer Interface Unification — speculative-decoding,v1 — by wangxiyuan (创建于: 2026-03-09 10:27 (UTC+8))
#36429 Bridge pad_token_id from model config to tokenizer — meta-exported,fb-exported — by cgufb (创建于: 2026-03-09 10:32 (UTC+8))
#36427 TMA Support for Fully-Sharded LoRA MoE + Tuned Config Control — 无标签 — by RunkaiTao (创建于: 2026-03-09 10:22 (UTC+8))
#36410 [Bugfix] Fix GLM4 tool parser double serialization issue — bug — by kasyoukin (创建于: 2026-03-08 21:39 (UTC+8))
#36395 fix(lora): add bounds checking for TP configurations — 无标签 — by lailoo (创建于: 2026-03-08 18:52 (UTC+8))
#36402 fix(lora): use replaced_module_name in pooling model name check — ready — by gambletan (创建于: 2026-03-08 19:33 (UTC+8))
#36426 docs: fix local testing setup in contribution guide — documentation — by eexwhyzee (创建于: 2026-03-09 09:49 (UTC+8))
#36425 [ROCm] Enable AITER fused allreduce + RMSNorm compilation pass — rocm — by ChuanLi1101 (创建于: 2026-03-09 09:15 (UTC+8))
#36424 [Refactor] Remove dead code in KV connector — ready,v1,kv-connector — by yewentao256 (创建于: 2026-03-09 09:03 (UTC+8))
#36423 [XPU] Support cpu kv offloading on XPU platform — v1 — by chaojun-zhang (创建于: 2026-03-09 08:59 (UTC+8))
#36416 [Bugfix] Clear stale CG keys after memory profiling — bug,ready,v1,nvidia — by MatthewBonanni (创建于: 2026-03-09 02:27 (UTC+8))
#36422 [ROCm][Bugfix] Fix MXFP4 MoE emulate fallback logic on MX-capable hardware — bug,rocm — by ChuanLi1101 (创建于: 2026-03-09 06:58 (UTC+8))
#36421 [chore] Refactor the commonly shared functions for the parsers under one module — 无标签 — by taneem-ibrahim (创建于: 2026-03-09 05:49 (UTC+8))
#36414 [Bugfix] Add PyTorch GDN fallback for SM12.0+ (Blackwell) in Qwen3Next/Qwen3.5 — bug,performance,ci/build,qwen — by lucaspirola (创建于: 2026-03-09 02:08 (UTC+8))
#36413 [Feat] add rmsnorm fp4 quant fusion of flashinfer — nvidia — by sparkecho (创建于: 2026-03-09 01:31 (UTC+8))
#36417 [Bugfix] Download mmproj GGUF files for multimodal models — bug — by s-zx (创建于: 2026-03-09 02:35 (UTC+8))
#36419 [Bugfix] Routed experts capturer reinitialization after KV cache profiling — bug,v1 — by HollowMan6 (创建于: 2026-03-09 02:45 (UTC+8))
#36415 Usage stats V2 design — documentation — by simon-mo (创建于: 2026-03-09 02:22 (UTC+8))
#36412 Fix/resupport nongated fused moe triton — ready,nvidia — by shaunkotek (创建于: 2026-03-08 23:21 (UTC+8))
#36409 logging: opt-in per-rank log files (Ray-friendly) (#23761) — documentation — by KOKOSde (创建于: 2026-03-08 21:35 (UTC+8))
#36380 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (创建于: 2026-03-08 13:31 (UTC+8))
#36399 fix: handle model validation error in realtime WebSocket session.update — frontend — by gambletan (创建于: 2026-03-08 19:13 (UTC+8))
#36394 fix: correct timestamp drift in speech-to-text for audio > 30s — frontend — by alvinttang (创建于: 2026-03-08 18:46 (UTC+8))
#36384 Fix/fla triton autotune oom 34954 — qwen — by oneraghavan (创建于: 2026-03-08 14:35 (UTC+8))
#36404 fix(lora): fix variable shadowing in get_supported_lora_modules — 无标签 — by gambletan (创建于: 2026-03-08 19:35 (UTC+8))
#36401 fix: use byte count for realtime WebSocket audio size validation — frontend — by gambletan (创建于: 2026-03-08 19:17 (UTC+8))
#36396 fix: release VideoCapture resources and guard div-by-zero in video utils — 无标签 — by gambletan (创建于: 2026-03-08 18:56 (UTC+8))
#36392 fix: register atexit handler to prevent ImportError during FlashInfer shutdown — nvidia — by gambletan (创建于: 2026-03-08 17:51 (UTC+8))
#36403 fix(s3): paginate list_objects_v2 to return all objects — 无标签 — by alvinttang (创建于: 2026-03-08 19:34 (UTC+8))
#36400 fix: guard against ZeroDivisionError in layerwise profiler — 无标签 — by alvinttang (创建于: 2026-03-08 19:14 (UTC+8))
#36397 fix: check HTTP status in batch read_file to prevent silent failures — frontend — by alvinttang (创建于: 2026-03-08 19:12 (UTC+8))
#36393 add nemotron v3 reasoning parser — 无标签 — by shaunkotek (创建于: 2026-03-08 18:37 (UTC+8))
#36405 tests(v1): cover parallel sampling output_kind matrix for AsyncLLM and LLMEngine — v1 — by KOKOSde (创建于: 2026-03-08 19:58 (UTC+8))
#36378 fix(lora): add bounds checking in slice_lora_b for TP configurations — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,multi-modality,deepseek,gpt-oss — by lailoo (创建于: 2026-03-08 12:57 (UTC+8))
#36391 [Core] Decouple draft-token hook from speculative decoding — v1 — by scoootscooob (创建于: 2026-03-08 16:23 (UTC+8))
#36390 [Bugfix] Fix KV-cache head-of-line blocking in scheduler — bug,v1 — by scoootscooob (创建于: 2026-03-08 16:22 (UTC+8))
#36388 [Bugfix] Fix hybrid Attention+Mamba models failing when hybrid KV cache manager is disabled — bug,v1 — by esmeetu (创建于: 2026-03-08 15:14 (UTC+8))
#36383 feat(openai): add per-request timing metrics and completion_tokens_de… — documentation,frontend — by dryu-nv (创建于: 2026-03-08 14:32 (UTC+8))
#36386 fix: clean shutdown for AR Fusion workspace on ctrl-c — nvidia — by goingforstudying-ctrl (创建于: 2026-03-08 14:56 (UTC+8))
#36381 [Doc]: Add CUDA 13.0 recommendation for GPT-OSS on Blackwell GPUs — documentation,gpt-oss,nvidia — by lailoo (创建于: 2026-03-08 13:58 (UTC+8))
#36377 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (创建于: 2026-03-08 12:31 (UTC+8))
#36379 [Frontend] Fix usage incorrectly returned with empty stream_options` — frontend,v1 — by Csrayz (创建于: 2026-03-08 13:06 (UTC+8))
#36376 [Bugfix][ROCm] Memory access fault fix for full graph capture for triton-attn - Option 2 — bug,rocm,v1 — by hongxiayang (创建于: 2026-03-08 12:30 (UTC+8))
#36375 [Bugfix] Fix non-streaming/streaming inconsistency for Qwen3 reasoning when enable_thinking is not set — bug,frontend,qwen — by esmeetu (创建于: 2026-03-08 12:28 (UTC+8))
#36374 [Bugfix][ROCm] full graph capture on triton-attn fix - Option1 — bug,rocm,v1 — by hongxiayang (创建于: 2026-03-08 12:23 (UTC+8))
#36373 [Bugfix][Frontend] Do not persist load_inplace on stored LoRA requests — bug,frontend — by omkaark (创建于: 2026-03-08 12:08 (UTC+8))

已合并 PR

#36149 fix: Use iterator as not to store all the file loads in memory at once — ready — by shaunkotek (合并于: 2026-03-09 11:25 (UTC+8))
#35579 [Examples][1/n] Resettle basic examples. — documentation,ready,ci/build,cpu — by noooop (合并于: 2026-03-09 11:22 (UTC+8))
#35815 [Bugfix] Fix CPU OMP autobind assertion to use local_world_size — bug,ready,v1,cpu — by weiguangli-io (合并于: 2026-03-09 11:07 (UTC+8))
#36170 [Dependency] Remove default ray dependency — documentation,rocm,ready,ci/build,nvidia — by yewentao256 (合并于: 2026-03-09 11:06 (UTC+8))
#36398 Allow markdownlint to run locally — documentation,performance,ready,ci/build,cpu,kv-connector,nvidia — by hmellor (合并于: 2026-03-09 11:05 (UTC+8))
#36301 [XPU][Doc] update xpu document about triton dependency/conflict issue. — documentation,ready — by jikunshang (合并于: 2026-03-09 10:09 (UTC+8))
#28044 [cudagraph] fix cudagraph warning in deepseekv32 — ready,deepseek,nvidia — by ZJY0516 (合并于: 2026-03-09 08:27 (UTC+8))
#35986 Add support for ModelOpt MXFP8 MoE models — ready,nvidia,quantization — by danisereb (合并于: 2026-03-09 04:00 (UTC+8))
#36341 [CI] fix flaky empty responses and add diagnostic assertions in vision chat tests — rocm,ready — by AndreasKaratzas (合并于: 2026-03-08 14:42 (UTC+8))
#36166 [Frontend] Add GPU-less render serving path (vllm launch render) — frontend,ready,v1 — by sagearc (合并于: 2026-03-08 23:35 (UTC+8))
#35657 [Model] Nano Nemotron VL - fast media preprocessing — ready — by nvnbagrov (合并于: 2026-03-08 18:04 (UTC+8))

关闭但未合并的 PR

#31400 [wip] custom allreduce and custom unquantized_gemm — 无标签 — by wxsIcey (关闭于: 2026-03-09 10:25 (UTC+8))
#36421 [chore] Refactor the commonly shared functions for the parsers under one module — 无标签 — by taneem-ibrahim (关闭于: 2026-03-09 05:54 (UTC+8))
#36415 Usage stats V2 design — documentation — by simon-mo (关闭于: 2026-03-09 02:26 (UTC+8))
#33958 Fix double‑scaled linear RoPE cache for long‑context configs #33911 — 无标签 — by baonudesifeizhai (关闭于: 2026-03-09 01:16 (UTC+8))
#34919 [Bugfix][Model] Qwen3-Coder and Qwen3-Coder-Next: Fix qwen3_coder tool parser to handle </parameter> in parameter content — bug,qwen — by yaro-tal (关闭于: 2026-03-09 00:19 (UTC+8))
#35228 [Frontend] Disaggregate RendererClient from EngineClient [1/N] — documentation,performance,new-model,rocm,structured-output,frontend,needs-rebase,ci/build,v1,qwen — by sagearc (关闭于: 2026-03-08 21:48 (UTC+8))
#36378 fix(lora): add bounds checking in slice_lora_b for TP configurations — documentation,performance,new-model,frontend,speculative-decoding,ci/build,v1,multi-modality,deepseek,gpt-oss — by lailoo (关闭于: 2026-03-08 18:46 (UTC+8))
#36132 Handle null RoPE factor in max length derivation (+tests) — 无标签 — by siewcapital (关闭于: 2026-03-08 16:36 (UTC+8))
#36334 [Doc]: Add README.md to examples directory — documentation,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36369 [Doc]: Add model card template — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36368 [Doc]: Add common errors and solutions guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36367 [Doc]: Add comprehensive environment variables reference — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36366 [Doc]: Add API migration guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36365 [Doc]: Add production deployment checklist — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36364 [Doc]: Add glossary of terms — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36363 [Doc]: Add performance tuning guide — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36362 [Doc]: Add APC debugging tips documentation — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:36 (UTC+8))
#36354 [Bugfix]: Fix unclean shutdown from ctrl-c with AR Fusion — bug,documentation,qwen,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36353 [Bugfix]: Support transformers 5.x Qwen3_5MoeTextConfig — bug,documentation,qwen — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36352 [Bugfix]: Fix Qwen3.5 missing max_window_layers attribute — bug,documentation,qwen — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36351 [Doc]: Add deployment patterns guide — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36349 [Doc]: Add quick start cheat sheet — documentation,structured-output,tool-calling,cpu,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36335 [Doc]: Add debugging tip for Automatic Prefix Caching — documentation — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36333 [Doc]: Add –speculative-config option documentation — documentation,nvidia — by goingforstudying-ctrl (关闭于: 2026-03-08 14:35 (UTC+8))
#36381 [Doc]: Add CUDA 13.0 recommendation for GPT-OSS on Blackwell GPUs — documentation,gpt-oss,nvidia — by lailoo (关闭于: 2026-03-08 14:08 (UTC+8))
#36377 fix: handle escaped <\think> tags in reasoning parser (closes #36207) — 无标签 — by zpassion (关闭于: 2026-03-08 13:20 (UTC+8))