vLLM 开发动态报告 - 2026-03-05

时间窗口: 2026-03-05 11:23 (UTC+8) ~ 2026-03-06 11:23 (UTC+8) 数据统计: 新 Issue 27 | 关闭 Issue 31 | 新 PR 96 | 合并 PR 41 | 关闭未合并 PR 16

📊 每日开发状态摘要

本周期（2026年3月5日至6日）vLLM 项目开发活动保持高度活跃，新增 PR 96 个，合并 41 个，显示出持续的快速迭代。核心焦点集中在多模态模型支持扩展、内核与运行时性能优化（特别是 CUDA 图和 torch.compile），以及 AMD ROCm 平台的兼容性与稳定性修复。同时，社区对 dLLM（扩散语言模型）集成、新硬件抽象层（torch.accelerator） 和量化模型准确性问题展开了深入讨论。

🎯 AMD/ROCm 生态相关动态

本周期 AMD/ROCm 生态是重点修复和扩展方向，涉及多个 PR 和 Issue。

1. PR：构建与依赖更新

#36086 ([AMD][Build] Add DeepEP to ROCm Dockerfile): 在 ROCm Docker 镜像中集成 AMD 端口的 DeepEP 高性能 MoE 库。这是增强 AMD 平台 MoE 性能的重要一步。
#36177 ([ROCm][CI] Adding missing dependencies for Multi-modal models tests): 为 ROCm CI 补充多模态模型测试的依赖，提升测试覆盖度。
#36179 ([ROCm][CI] Fix ROCm GPT-OSS Eval test group): 修复 ROCm 平台上 GPT-OSS 评估测试组。

2. PR：核心功能修复与增强

#35246 ([ROCm] Refactor ROCm attention backend selection logic) (已合并): 重构了 ROCm 注意力后端选择逻辑，使其风格与 CUDA 侧统一，为未来更复杂的后端选择奠定基础。
#36185 (Reenable features for ROCm attention backends): 修复因 #35246 重构导致的 ROCm 多个注意力后端（如 AITER_MLA）功能支持（如 FP8 KV 缓存）被错误禁用的问题。
#36100 ([ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs) 与 #36092 ([ROCm] Fix AITER ops fake impl and minor bugs): 这两个 PR 修复了 ROCm AITER 操作符的假实现（fake impl）签名不匹配、静态方法参数错误、错误信息拼写等多个底层 bug。特别是 fused_moe_fake 的签名问题会直接导致 torch.compile 在 FakeTensor 模式下崩溃，影响 MXFP4 量化模型的运行。
#36101 ([ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm): 针对未训练的 TitanML/tiny-mixtral 模型，在 AITER 的 rms_norm 使用 bfloat16 累加导致与 HF 参考输出微小差异的问题，在测试中加入了临时补丁，确保 CI 通过。这反映了对数值精度敏感场景的细致处理。

3. Issue：用户反馈的兼容性与功能缺陷

#36193 ([Bug]: Unsupported Activation Function for Step-3.5-Flash): 用户报告 ROCm Aiter 量化 MoE 路径不支持 Step-3.5-Flash 模型的 SwiGLU_STEP 激活函数，导致运行时报错。这暴露了 AMD 量化生态对新模型架构支持的滞后。
#36105 ([ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start…): 用户报告该 FP8 模型因 “No FP8 MoE backend supports the deployment configuration” 错误而无法启动。凸显了 AMD 平台对 FP8 MoE 后端支持的缺失。
#36167 ([Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM) 与 #36180 ([Bug]: meta-llama/Llama-3.2-1B-Instruct Fails With ROCM_ATTN Due To Seg Fault): 用户报告在 AMD 硬件（Strix Halo/Ryzen AI）上运行 Qwen3-Omni 时 hip runtime 加载失败，以及 Llama-3.2-1B 使用 ROCM_ATTN 后端时出现段错误。这些是 AMD 消费级硬件和新模型支持上的具体问题。

总结: AMD 生态本周期的工作以 “修复”和“打基础” 为主，重点解决了 AITER 操作符的一系列底层 bug 和注意力后端功能回归问题，并开始集成 DeepEP 等性能库。然而，用户 Issues 反映出在新模型架构（Step-3.5）、新数据类型（FP8 MoE）和特定硬件组合上的兼容性挑战依然存在。

💬 高热度讨论分析

1. Issue #36155 ([RFC]: dLLM support via plugin (spec-decode path reuse))

核心议题: 是否及如何通过插件系统复用推测解码（Speculative Decoding）的数据路径和调度器接口，来支持块扩散语言模型（dLLM）。
各方观点:
- 提案方 (Red Hat): 认为此举能以最小核心变更实现最大封装，保持 vLLM 生态对齐，并已设计出具体方案。
- 维护者 (@benchislett): 总体上认为方案合理，但提出两大挑战：1) 结构化输出（Structured Outputs） 与 dLLM 单步多令牌生成的兼容性复杂，可能需大规模重构或引入新通信阶段；2) 自定义掩码注意力 的跨平台支持度问题。
争议焦点: 如何平衡 “最小核心变更” 的优雅性与解决 结构化输出 这一硬核技术挑战的可行性。讨论仍处早期技术探讨阶段。
当前状态: 开放讨论中。

2. Issue #36094 ([Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy)

核心议题: NVIDIA 官方发布的 NVFP4 量化版 Qwen3.5-397B 模型在 vLLM 上运行精度严重下降（GSM8K 从 0.90 降至 ~0.35）。
各方观点:
- 报告者 (@pavanimajety): 提供详细测试数据，指出问题与 modelopt_fp4 量化方法相关，怀疑权重重归一化步骤有问题。
- NVIDIA 工程师 (@zhewenl): 表示之前验证过该检查点精度对齐，要求提供环境信息以便复现。
- 其他参与者: 指出模型存在 w1_weight_scale_2 must match w3_weight_scale_2 的警告，推测这可能影响精度。
争议焦点: 问题是 vLLM 的模型加载/量化实现缺陷，还是 NVIDIA 发布的权重本身存在问题？需要双方协作定位。
当前状态: 开放调查中，双方正在交换环境与 commit 信息。

3. Issue #36091 / PR #36139 ([RFC]: Add InstantTensor Support in vLLM)

核心议题: 是否集成 InstantTensor（一种通过内存映射加速模型加载的技术）作为可选的加载格式 (--load-format instanttensor)。
各方观点:
- 提案方: 展示其能大幅提升模型加载速度，充分利用高速存储带宽。
- 社区成员 (@robertgshaw2-redhat): 询问是否可设为默认选项及其使用考量，反映出对引入新依赖和维护成本的关切。
争议焦点: 新技术带来的性能收益与项目依赖复杂度、维护负担之间的权衡。
当前状态: RFC Issue 开放讨论，对应的实现 PR #36139 已提交，等待审查。

🔥 热门话题与趋势分析

多模态与音频模型支持加速: 新增 PR #36127 为 Kimi-Audio-7B (Whisper + Qwen2) 添加支持，PR #36124 增加 MetaCLIP 模型支持。这表明 vLLM 在多模态领域的模型矩阵正在快速扩展。
性能优化深水区: 优化重点从基础算子转向更复杂的调度和组合。例如：
- PR #36182 为分类池器（classify pooler）添加 CUDA 图优化。
- PR #36142 实验性地为结构化输出添加 跳转解码（Jump Decoding） 支持，以加速某些约束生成场景。
- PR #36159 将 MaxSim 评分计算从 API 服务器 下推至 Worker，减少数据传输开销。
AMD 平台持续投入与挑战: 如前所述，围绕 ROCm 的 CI 稳定化、AITER 内核修复、新硬件支持是持续热点。用户 Issues 密集反映了在消费级 AMD GPU 和复杂量化模型上的可用性问题。
开发工具链与体验: PR #36135 优化 CI，仅当 PR 带有 documentation 或 ready 标签时才构建文档，以节约资源。这反映了项目在规模化后对 CI 效率的精细化管理。

🛠️ 重点技术变更

PR #35246 ([ROCm] Refactor ROCm attention backend selection logic) (已合并): 这是 AMD 生态的基础设施性重构。将原本可能散落的 ROCm 注意力后端选择逻辑规范化，为后续引入更多后端和更精细化的功能选择策略（类似 CUDA）铺平了道路，是提升 AMD 平台长期可维护性的关键一步。
PR #36162 ([Mamba] Flashinfer selective_state_update): 为 Mamba 架构模型（如 Nemotron 3 Nano）增加了 FlashInfer 后端 作为选择性状态更新内核的新选项，并引入了 MambaConfig 来集中相关配置。这丰富了状态空间模型（SSM）的后端生态，为用户提供了性能调优的 choice。
PR #36085 和 #36145 ([Hardware] Replace torch.cuda.* API): 这些 PR 系统性地将 torch.cuda.device_count、torch.cuda.synchronize() 等 API 替换为 torch.accelerator 命名空间下的新 API。这是响应 PyTorch 硬件抽象化趋势，为未来更好地支持 XPU、NPU 等更多加速器 做准备的基础性迁移工作。
PR #35866 (Order config.py in Lexicographical order) (已合并): 一个看似微小的改动，将庞大的模型配置文件按字母顺序重新排序，显著提升了大规模配置文件的可读性和可维护性，体现了对代码质量的持续关注。
PR #36176 ([Model Runner V2] Fix warmup for very small kvcache and/or blocksizes) (已合并): 修复了在 KV 缓存块数或块大小被设置得非常小（例如用于测试抢占）时，预热逻辑可能出错的问题。这增强了系统在极端配置下的鲁棒性。

📈 开发活跃度观察

贡献者多样性: 除了核心团队（如 @DarkLight1337, @yewentao256, @zou3519），Red Hat (@robertgshaw2-redhat, @benchislett)、AMD 相关贡献者（如 @SageMoore, @AndreasKaratzas）以及许多独立开发者（如 @OiPunk, @AjAnubolu）都非常活跃。
AMD 贡献者活跃: 用户名中包含 “-amd” 后缀或明显属于 AMD 的贡献者（如 @SageMoore）在 ROCm 相关的修复和 CI 维护上贡献了多个 PR，显示了 AMD 对 vLLM 生态的持续投入。
代码审查与合并节奏: 单日合并 41 个 PR，显示核心团队有很强的代码审查和合并能力。同时，大量 PR 处于 “needs-rebase” 状态，也表明开发分支迭代非常迅速，贡献者需及时同步上游变更。
跨团队协作: 在 dLLM RFC (#36155) 和 NVFP4 精度问题 (#36094) 的讨论中，可以看到来自 Red Hat、NVIDIA 等不同机构的工程师进行深度技术对话，展现了健康的跨公司协作生态。

💡 值得关注的问题

dLLM 支持路径的权衡: Issue #36155 中的讨论触及了 vLLM 核心调度器的扩展边界。是否以及如何支持单步多令牌生成的模型，是一个影响深远的设计决策，其结论将决定 vLLM 对未来新型架构的包容性。
NVFP4 等新量化格式的准确性与兼容性: Issue #36094 暴露了前沿量化技术（如 NVFP4）在落地 vLLM 时可能存在的准确性问题。这不仅是一个技术调试问题，也关乎 vLLM 与硬件厂商量化工具链的集成质量保证流程。
AMD 消费级硬件与复杂模型的兼容性: Issues #36167, #36180 显示，在 AMD Ryzen AI 等新兴消费级平台上运行视觉语言大模型（VLMs）时，仍会遇到低级运行时错误。扩大 vLLM 在 AMD 生态的普及度，解决这些“最后一公里”的兼容性问题至关重要。
系统提示词（System Prompt）缓存优化: PR #36196 提出了缓存系统提示词的 Token IDs 以优化重复请求。这是一个常见的性能优化点，其实现方式值得关注，可能对高并发、固定系统提示的应用场景产生显著影响。

📋 附录：详细数据列表

新增 Issue

#36203 [Bug]: CUDAGraphMode.FULL not supported with ChunkedLocalAttention backend for Llama4 models — bug — by huydhn (创建于: 2026-03-06 11:10 (UTC+8))
#36202 [Bug]: Symbol not found: __ZN3c1013MessageLoggerC1EPKcii\n Referenced from: <6A12389C-7A10-3CA4-BEDF-893991822933> /opt/anaconda3/envs/vllm-inference/lib/python3.11/site-packages/vllm/_C.abi3.so\n — bug — by LuWei6896 (创建于: 2026-03-06 10:59 (UTC+8))
#36116 [Bug]: Pseudo-Streaming Issue When Using Tool Parser with Qwen3-Coder and MiniMax-M2 — bug — by delwen123 (创建于: 2026-03-05 15:45 (UTC+8))
#36148 [Bug]: Qwen3-VL-Reranker-8B fails with shape mismatch error when loading with –quantization bitsandbytes — bug — by xl2014 (创建于: 2026-03-05 22:29 (UTC+8))
#36122 [Bug]: Protocol inconsistency between Scheduler and Runner when using Speculative Decoding — bug — by zengliejie (创建于: 2026-03-05 17:15 (UTC+8))
#36193 [Bug]: Unsupported Activation Function for Step-3.5-Flash — bug,rocm — by ColinZ22 (创建于: 2026-03-06 08:39 (UTC+8))
#36189 [Feature]: Per-Request Timing Headers (--enable-request-stats-headers) — feature request — by vrdn-23 (创建于: 2026-03-06 08:01 (UTC+8))
#36094 [Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy — bug,qwen,nvidia — by pavanimajety (创建于: 2026-03-05 13:09 (UTC+8))
#36180 [Bug]: meta-llama/Llama-3.2-1B-Instruct Fails With ROCM_ATTN Due To Seg Fault — bug,rocm — by micah-wil (创建于: 2026-03-06 06:13 (UTC+8))
#36175 [Feature][Cleanup]: Unify RMSNormGated and FusedRMSNormGated — bug — by eellison (创建于: 2026-03-06 04:35 (UTC+8))
#36167 [Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM — bug,rocm — by Eoghanmc22 (创建于: 2026-03-06 02:21 (UTC+8))
#36155 [RFC]: dLLM support via plugin (spec-decode path reuse) — RFC — by AlonKellner-RedHat (创建于: 2026-03-06 00:26 (UTC+8))
#36096 [Bug]: Initialization failure when using Qwen3-Omni — bug — by Eoghanmc22 (创建于: 2026-03-05 13:19 (UTC+8))
#36151 [Bug]: Inconsistent PP layer indexing in EAGLE model code — bug — by benchislett (创建于: 2026-03-05 23:37 (UTC+8))
#36091 [RFC]: Add InstantTensor Support in vLLM — RFC — by arlo-aisys (创建于: 2026-03-05 12:47 (UTC+8))
#36137 [Bug]: Title: Embeddings from Qwen3-VL-Embedding-8B loaded via VLLM fail to effectively distinguish image-text pair similarities compared to native transformers — bug — by xl2014 (创建于: 2026-03-05 20:39 (UTC+8))
#36141 [Bug]: Vllm models crashes when using runai_streamer & –api-server-count > 1 — bug — by Sanches166 (创建于: 2026-03-05 21:03 (UTC+8))
#36117 [Bug]: Qwen3-VL-2B-Instruct produces significantly different (degraded) outputs on v0.12.0 compared to v0.11.0 — 无标签 — by AGENDD (创建于: 2026-03-05 15:49 (UTC+8))
#36123 [Usage]: 为什么vllm0.14.0部署Qwen3-VL-30B-A3B的MOE模型时，模型在初始化时CPU负载几乎100%，相同的环境部署Qwen3-VL-32B的模型时，不会出现CPU负载100%的情况 — usage — by sanshi9523 (创建于: 2026-03-05 17:59 (UTC+8))
#36109 [Bug]: TypeError when rope_scaling.factor is null in model config — bug — by yuanheng-zhao (创建于: 2026-03-05 15:16 (UTC+8))
#36112 [Bug]: bge-large-en-v1.5 embedding model can not be use when I use vllm to deploy the embedding model. — bug — by zhujiajian98 (创建于: 2026-03-05 15:30 (UTC+8))
#36105 [ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start with NotImplementedError: No FP8 MoE backend supports the deployment configuration — bug,rocm — by gbdjxgp (创建于: 2026-03-05 14:38 (UTC+8))
#36103 [Bug]: torch._inductor.exc.InductorError: — bug — by crimoc-lgtm (创建于: 2026-03-05 14:25 (UTC+8))
#36095 [Feature]: Add fused MoE kernel tuning configs for NVIDIA GeForce RTX 5090 (int4_w4a16) — 无标签 — by Anoneeeemus (创建于: 2026-03-05 13:13 (UTC+8))
#36099 [Bug]: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory — bug — by crimoc-lgtm (创建于: 2026-03-05 13:42 (UTC+8))
#36082 AI 推理效率工具实践分享 — 无标签 — by zhuxunyu (创建于: 2026-03-05 11:40 (UTC+8))
#36084 [Bug]: prefix cach hit降到了0，running的seq也是逐步下降到0 — bug — by edc3000 (创建于: 2026-03-05 11:50 (UTC+8))

已关闭 Issue

#26093 [Installation]: non-CUDA x86 vLLM v0.10.2 wheel is CUDA dependent — installation,stale — by (关闭于: 2026-03-06 10:34 (UTC+8))
#27626 [Bug]: gpt-oss-120b with EAGLE3 Speculative decoding, awful benchmarks — bug,stale — by icsy7867 (关闭于: 2026-03-06 10:33 (UTC+8))
#28097 [Bug]: ZeroDivisionError caused by dividing by pbar.format_dict[“elapsed”] in LLM._run_engine() when use_tqdm=True — bug,stale — by KnightChaser (关闭于: 2026-03-06 10:32 (UTC+8))
#28102 [Bug]: switching between multiple LoRAs in multimodal scenario takes long time — bug,stale — by caraxl (关闭于: 2026-03-06 10:32 (UTC+8))
#28107 [Bug]: llama 4 + fp4 is broke — bug,stale — by chenhao862 (关闭于: 2026-03-06 10:32 (UTC+8))
#28115 [Installation]: Request to include vllm==0.10.2 for cuda 11.8 — installation,stale — by ppalantir (关闭于: 2026-03-06 10:32 (UTC+8))
#28121 [Bug]: Cannot load llmcompressor 3-bit quantized models but can load AutoGPTQ ones — bug,stale — by mratsim (关闭于: 2026-03-06 10:32 (UTC+8))
#28131 [Bug]: Batch Invariant Failed when enable cudagraph on L40s (Ada) GPU — bug,stale — by fopdoodle8 (关闭于: 2026-03-06 10:32 (UTC+8))
#28134 [Bug]: Streaming doesnt start until the response is completed — bug,stale — by baughmann (关闭于: 2026-03-06 10:32 (UTC+8))
#28138 [Feature]: Support Shared Memory (DGX Spark) — feature request,stale — by swtb3-ryder (关闭于: 2026-03-06 10:32 (UTC+8))
#28143 [Bug]: Model not loaded on proper GPU despite setting CUDA device — bug,stale — by jordanallred (关闭于: 2026-03-06 10:32 (UTC+8))
#31030 [Performance]: Regression between 0.10.2 and 0.12.0 — performance — by maximegmd (关闭于: 2026-03-06 08:25 (UTC+8))
#35967 [Bug]: Value error, invalid literal for int() with base 10: ‘4.0’ [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs] — bug — by LuWei6896 (关闭于: 2026-03-06 05:16 (UTC+8))
#36096 [Bug]: Initialization failure when using Qwen3-Omni — bug — by Eoghanmc22 (关闭于: 2026-03-06 01:33 (UTC+8))
#32992 [Bug]: Batch invariance fails on NVIDIA B200 (Blackwell) with torch.compile — bug,torch.compile — by ZhanqiuHu (关闭于: 2026-03-06 00:01 (UTC+8))
#36075 [Bug]: Garbled rollouts with GLM5 if VLLM_USE_DEEP_GEMM is not set — bug — by S1ro1 (关闭于: 2026-03-05 23:26 (UTC+8))
#34205 [Bug]: Set env ROCP_TOOL_ATTACH=1 caused vllm server stopped — bug,rocm — by BigFaceBoy (关闭于: 2026-03-05 23:14 (UTC+8))
#36137 [Bug]: Title: Embeddings from Qwen3-VL-Embedding-8B loaded via VLLM fail to effectively distinguish image-text pair similarities compared to native transformers — bug — by xl2014 (关闭于: 2026-03-05 22:05 (UTC+8))
#35717 [Bug]: RunAI streamer breaks in 0.15.1 — bug — by Sanches166 (关闭于: 2026-03-05 20:14 (UTC+8))
#35004 [RFC]: Realtime Endpoint Metrics — RFC — by pougetat (关闭于: 2026-03-05 19:06 (UTC+8))
#29056 [Bug]: Cannot use Qwen3 Next autoround quant model with 0.11.1 — bug,stale — by ariable (关闭于: 2026-03-05 17:24 (UTC+8))
#36112 [Bug]: bge-large-en-v1.5 embedding model can not be use when I use vllm to deploy the embedding model. — bug — by zhujiajian98 (关闭于: 2026-03-05 16:57 (UTC+8))
#35991 [Installation]: Making server with GPT-OSS-20B on vllm+openwebui rtx5080 16gb — installation — by Ed-test-s (关闭于: 2026-03-05 16:48 (UTC+8))
#35926 [Bug]: Illegal Memory Access in NemotronH MTP — bug — by benchislett (关闭于: 2026-03-05 14:10 (UTC+8))
#36095 [Feature]: Add fused MoE kernel tuning configs for NVIDIA GeForce RTX 5090 (int4_w4a16) — 无标签 — by Anoneeeemus (关闭于: 2026-03-05 14:04 (UTC+8))
#36099 [Bug]: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory — bug — by crimoc-lgtm (关闭于: 2026-03-05 13:49 (UTC+8))
#11905 [Feature]: Support Multiple Tasks Per Model — feature request,stale — by FurtherAI (关闭于: 2026-03-05 12:40 (UTC+8))
#36082 AI 推理效率工具实践分享 — 无标签 — by zhuxunyu (关闭于: 2026-03-05 12:35 (UTC+8))
#4435 [Feature]: option to return hidden states — feature request,unstale — by zhenlan0426 (关闭于: 2026-03-05 12:13 (UTC+8))
#6165 [Feature]: Return hidden states (in progress?) — feature request,unstale — by Elanmarkowitz (关闭于: 2026-03-05 12:13 (UTC+8))
#24288 [RFC]: Support Returning Prompt Hidden States — RFC — by charlotte12l (关闭于: 2026-03-05 12:13 (UTC+8))

新增 PR

#36204 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks — 无标签 — by laithsakka (创建于: 2026-03-06 11:19 (UTC+8))
#36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (创建于: 2026-03-06 10:29 (UTC+8))
#36192 [Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 — ready — by russellb (创建于: 2026-03-06 08:31 (UTC+8))
#36199 [Bugfix] Fix Qwen3.5 Marlin TP failure for GDN in_proj_ba — bug,qwen — by AjAnubolu (创建于: 2026-03-06 10:39 (UTC+8))
#36196 [Frontend] Cache system prompt token IDs across requests — 无标签 — by AjAnubolu (创建于: 2026-03-06 10:36 (UTC+8))
#36144 replace with torch.cuda.device with with torch.accelerator.device_index — performance,rocm,kv-connector,nvidia — by yma11 (创建于: 2026-03-05 21:44 (UTC+8))
#36110 [Frontend][2/n] Improve pooling entrypoints embed. — frontend — by noooop (创建于: 2026-03-05 15:22 (UTC+8))
#36124 [Model] Add MetaCLIP and MetaCLIP-2 Support — documentation,new-model,multi-modality — by KeriaGuma (创建于: 2026-03-05 18:06 (UTC+8))
#36150 [bugfix] add api process rank in default multimodal request — bug,ready — by fake0fan (创建于: 2026-03-05 23:19 (UTC+8))
#36201 [openapi server] log exception in exception handler(2/N) — frontend — by andyxning (创建于: 2026-03-06 10:55 (UTC+8))
#36197 [Bugfix] Fix misleading context length error messages — bug,frontend — by AjAnubolu (创建于: 2026-03-06 10:37 (UTC+8))
#36194 Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property. — llama,qwen — by laithsakka (创建于: 2026-03-06 09:22 (UTC+8))
#36191 [BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list — bug,ready — by walterbm (创建于: 2026-03-06 08:11 (UTC+8))
#36200 Fix multi-node WorkerProc init ordering and compilation_time None — v1,meta-exported,fb-exported — by ananyakgarg (创建于: 2026-03-06 10:42 (UTC+8))
#36111 [Perf] add cute dsl kernel for gdn decode — qwen — by ZJY0516 (创建于: 2026-03-05 15:30 (UTC+8))
#36198 [Test] Add unit tests for GDN fused recurrent kernel — 无标签 — by AjAnubolu (创建于: 2026-03-06 10:39 (UTC+8))
#36136 [Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps — bug,multi-modality,qwen — by OiPunk (创建于: 2026-03-05 20:31 (UTC+8))
#36127 [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct — documentation,new-model,frontend,multi-modality — by tunglinwood (创建于: 2026-03-05 18:33 (UTC+8))
#36145 [Hardware] Replace torch.cuda.device_count/current_device/set_device API — documentation,performance,speculative-decoding,ready,v1,multi-modality,kv-connector,nvidia — by jikunshang (创建于: 2026-03-05 21:47 (UTC+8))
#36159 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-06 01:32 (UTC+8))
#36186 [Bugfix] Fix WorkerProc init order for multi-node TP — bug,v1 — by 842974287 (创建于: 2026-03-06 07:29 (UTC+8))
#36168 [Build] Upgrade xgrammar to get a security fix — ready,ci/build — by russellb (创建于: 2026-03-06 02:43 (UTC+8))
#36185 Reenable features for ROCm attention backends — documentation,rocm,ready,v1 — by Rohan138 (创建于: 2026-03-06 07:25 (UTC+8))
#36101 [ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-05 14:06 (UTC+8))
#36177 [ROCm][CI] Adding missing dependencies for Multi-modal models tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-06 06:02 (UTC+8))
#36179 [ROCm][CI] Fix ROCm GPT-OSS Eval test group — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-06 06:10 (UTC+8))
#36188 [docs] Add docs for new RL flows — documentation — by hao-aaron (创建于: 2026-03-06 07:58 (UTC+8))
#36088 Don’t fire ray compatibility webhook when PR or branch is not provided — ready,ci/build — by jeffreywang-anyscale (创建于: 2026-03-05 12:22 (UTC+8))
#36183 Fix: Clone NVFP4 MoE weights on SM121 to prevent Marlin kernel NaN — 无标签 — by scottgl9 (创建于: 2026-03-06 06:51 (UTC+8))
#36190 fix(qwen3.5): prevent false gate_proj match from dropping MoE router gate weights — qwen — by scottgl9 (创建于: 2026-03-06 08:02 (UTC+8))
#36187 [BugFix] Ensure contiguous input tensor in LoRA shrink kernel — bug — by RunkaiTao (创建于: 2026-03-06 07:49 (UTC+8))
#36164 perf: add slots to KVCacheBlock — ready,v1 — by cong-or (创建于: 2026-03-06 01:56 (UTC+8))
#36184 Yejin/fix warmup drain — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-06 07:24 (UTC+8))
#36176 [Model Runner V2] Fix warmup for very small kvcache and/or blocksizes — ready,v1 — by njhill (创建于: 2026-03-06 04:57 (UTC+8))
#36182 [Performance] Add CUDA graph optimization for classify pooler — v1,nvidia — by ilinam (创建于: 2026-03-06 06:25 (UTC+8))
#36157 [CI] Add mandatory H100 TP=2 smoke test — ci/build,v1 — by stecasta (创建于: 2026-03-06 01:21 (UTC+8))
#36181 [ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-06 06:18 (UTC+8))
#36178 [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking — bug,v1 — by LucasWilkinson (创建于: 2026-03-06 06:02 (UTC+8))
#36174 [ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 — rocm,gpt-oss — by micah-wil (创建于: 2026-03-06 04:28 (UTC+8))
#36128 [Bugfix] Fix float-like CPU env int parsing for #35967 — bug — by MatteoFari (创建于: 2026-03-05 18:40 (UTC+8))
#36172 [Feat] Add vllm eval CLI subcommand integrating lm_eval accuracy and perf benchmarking — documentation,performance,frontend — by AndreasKaratzas (创建于: 2026-03-06 03:45 (UTC+8))
#36083 Add adaptive decode chunking for SM100 fused TRTLLM path (TMP FIX)#34988 — v1,nvidia — by baonudesifeizhai (创建于: 2026-03-05 11:46 (UTC+8))
#36098 [compile] Split compile/warmup monitoring — needs-rebase — by zou3519 (创建于: 2026-03-05 13:37 (UTC+8))
#36161 Add 320 dimension size support to MLA — ready — by juliendenize (创建于: 2026-03-06 01:44 (UTC+8))
#36171 [Refactor] Remove unused dead code — frontend,ready,kv-connector — by yewentao256 (创建于: 2026-03-06 03:42 (UTC+8))
#36173 Change “following fields were present in the request but ignored” log from warn to debug — frontend — by tlrmchlsmth (创建于: 2026-03-06 03:52 (UTC+8))
#36170 [Dependency] Remove default ray dependency — documentation,rocm,ready,ci/build,nvidia — by yewentao256 (创建于: 2026-03-06 03:25 (UTC+8))
#36162 [Mamba] Flashinfer selective_state_update — ci/build,v1 — by roikoren755 (创建于: 2026-03-06 01:45 (UTC+8))
#36165 [Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) — bug,ready,v1,nvidia — by TQCB (创建于: 2026-03-06 01:58 (UTC+8))
#36119 [Bugfix] Guard fabric handle APIs for CUDA < 12.4 compatibility — bug,nvidia — by aimbit-ni (创建于: 2026-03-05 16:15 (UTC+8))
#36169 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add –grpc flag to vllm serve — rocm,frontend,ci/build — by CatherineSue (创建于: 2026-03-06 03:01 (UTC+8))
#36166 [Frontend] Add GPU-less render serving path (vllm launch render) — frontend,needs-rebase,v1 — by sagearc (创建于: 2026-03-06 02:20 (UTC+8))
#36160 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) — documentation,frontend,v1 — by alex-jw-brooks (创建于: 2026-03-06 01:34 (UTC+8))
#36163 Add support to Mistral large 3 eagle with dense layers — speculative-decoding — by juliendenize (创建于: 2026-03-06 01:47 (UTC+8))
#36158 [Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-03-06 01:27 (UTC+8))
#36153 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) — frontend,ready,multi-modality — by alex-jw-brooks (创建于: 2026-03-06 00:04 (UTC+8))
#36146 [Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE — bug,ready,nvidia — by tomeras91 (创建于: 2026-03-05 22:16 (UTC+8))
#36147 cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul — ci/build,cpu — by nikhil-arm (创建于: 2026-03-05 22:29 (UTC+8))
#36133 ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm — ready — by netanel-haber (创建于: 2026-03-05 20:02 (UTC+8))
#36107 [CI] Stabilize test_no_args_tool_call and add ROCm-specific server args — rocm,ready — by AndreasKaratzas (创建于: 2026-03-05 14:55 (UTC+8))
#36092 [ROCm] Fix AITER ops fake impl and minor bugs — rocm,needs-rebase — by ChuanLi1101 (创建于: 2026-03-05 12:57 (UTC+8))
#36156 [Bugfix] Fix simple Mistral-Small example — bug,documentation,ready — by DarkLight1337 (创建于: 2026-03-06 00:34 (UTC+8))
#36149 fix: Use iterator as not to store all the file loads in memory at once — 无标签 — by shaunkotek (创建于: 2026-03-05 22:47 (UTC+8))
#36140 [Bugfix] Fix Qwen-VL tokenizer implementation — bug,performance,ready,qwen — by DarkLight1337 (创建于: 2026-03-05 21:03 (UTC+8))
#36154 chunk audio clips by 30 seconds to match mcore — 无标签 — by netanel-haber (创建于: 2026-03-06 00:09 (UTC+8))
#36108 refactor funasr model. — ready,qwen — by AllenDou (创建于: 2026-03-05 14:57 (UTC+8))
#36152 [compile] Stop unconditionally patching constrain_to_fx_strides — 无标签 — by zou3519 (创建于: 2026-03-05 23:54 (UTC+8))
#36120 Fix Eagle3 with Transformers modelling backend — ready — by hmellor (创建于: 2026-03-05 16:21 (UTC+8))
#36093 [torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation — 无标签 — by zou3519 (创建于: 2026-03-05 12:58 (UTC+8))
#36138 [Bugfix] Grammar was ignored when reasoning ended within speculated tokens — bug,structured-output,v1 — by sfbemerk (创建于: 2026-03-05 20:43 (UTC+8))
#36130 [BugFix] Add opt-in request watchdog to abort stuck requests (#33099) — bug,v1 — by KrxGu (创建于: 2026-03-05 19:35 (UTC+8))
#36139 [Feature] Add support for InstantTensor — documentation,ci/build — by arlo-aisys (创建于: 2026-03-05 21:00 (UTC+8))
#36135 [Docs] Only build docs if documentation or ready labels are present — documentation,ready — by hmellor (创建于: 2026-03-05 20:24 (UTC+8))
#36106 [Bugfix] Add safety check and fallback for null scaling factor — bug — by yuanheng-zhao (创建于: 2026-03-05 14:54 (UTC+8))
#36142 [Feature]: Initial Implementation Jump Decoding Guidance — structured-output,v1 — by FredericOdermatt (创建于: 2026-03-05 21:05 (UTC+8))
#36143 [Core] Add –hybrid-kv-cache-group-size to override KV cache grouping for hybrid attn models — v1 — by mzmssg (创建于: 2026-03-05 21:14 (UTC+8))
#36115 [Chore] Correct MTP models test registry ordering — ready — by Isotr0py (创建于: 2026-03-05 15:42 (UTC+8))
#36129 [LMCache] Pass TP size in lookup for MLA multi-reader locking — kv-connector — by maobaolong (创建于: 2026-03-05 19:31 (UTC+8))
#36134 # [Feature] Add EPLB Support for Minimax M2 Model — 无标签 — by ivyilike (创建于: 2026-03-05 20:15 (UTC+8))
#36131 docs: add missing docstrings for SamplingParams methods — 无标签 — by Yuxin1999 (创建于: 2026-03-05 19:49 (UTC+8))
#36132 Handle null RoPE factor in max length derivation (+tests) — 无标签 — by siewcapital (创建于: 2026-03-05 19:59 (UTC+8))
#36114 [Bugfix] Fix mypy errors in hermes_tool_parser.py — bug,ready — by 842974287 (创建于: 2026-03-05 15:40 (UTC+8))
#36126 Fix TypeError when rope factor is null during max length derivation — 无标签 — by siewcapital (创建于: 2026-03-05 18:31 (UTC+8))
#36085 [Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize — documentation,performance,ready,v1,nvidia,ready-run-all-tests — by jikunshang (创建于: 2026-03-05 11:57 (UTC+8))
#36125 Perf: Relax CUDA kernel condition for MoE INT4 — nvidia — by xueliangyang-oeuler (创建于: 2026-03-05 18:15 (UTC+8))
#36100 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs — rocm,ready,v1 — by ChuanLi1101 (创建于: 2026-03-05 13:45 (UTC+8))
#36118 Set default value for –ready-check-timeout-sec flag in benchmarking script — performance — by almaslof (创建于: 2026-03-05 15:56 (UTC+8))
#36121 README Sylee — documentation,kv-connector — by brianlsy98 (创建于: 2026-03-05 16:28 (UTC+8))
#36104 [CI] Bump mypy version to 1.19.1 — ready,v1,kv-connector — by hmellor (创建于: 2026-03-05 14:26 (UTC+8))
#36113 fix: handle float-like strings in int() parsing to prevent ValidationError — 无标签 — by OiPunk (创建于: 2026-03-05 15:33 (UTC+8))
#36102 [Frontend] Add gRPC server support for vllm launch render — frontend,ci/build — by hyeongyun0916 (创建于: 2026-03-05 14:19 (UTC+8))
#36097 [WIP][Model Runner V2] Support multi-modal embeddings for spec decode model — needs-rebase,v1 — by TheEpicDolphin (创建于: 2026-03-05 13:23 (UTC+8))
#36087 [CI] Don’t leave docs preview comment on closed PRs — ready,ci/build — by hmellor (创建于: 2026-03-05 12:19 (UTC+8))
#36086 [AMD][Build] Add DeepEP to ROCm Dockerfile — rocm,ci/build — by rjrock (创建于: 2026-03-05 12:15 (UTC+8))
#36090 [ROCm][CI] Making some tests optional to reduce workload — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-05 12:46 (UTC+8))
#36089 [Bugfix] Handle TimeoutError in Voxtral buffer_realtime_audio to prevent silent hang — bug — by OiPunk (创建于: 2026-03-05 12:32 (UTC+8))

已合并 PR

#35136 [Release] Include source distribution (sdist) in PyPI uploads — ready,ci/build — by dougbtv (合并于: 2026-03-05 17:43 (UTC+8))
#31164 [openai api] log exception in exception handler (1/N) — frontend,ready,v1 — by andyxning (合并于: 2026-03-06 00:00 (UTC+8))
#36088 Don’t fire ray compatibility webhook when PR or branch is not provided — ready,ci/build — by jeffreywang-anyscale (合并于: 2026-03-06 08:42 (UTC+8))
#35384 [Performance] Extract KV-cache update from TreeAttention backend — ready,v1 — by dorhuri123 (合并于: 2026-03-06 08:22 (UTC+8))
#36176 [Model Runner V2] Fix warmup for very small kvcache and/or blocksizes — ready,v1 — by njhill (合并于: 2026-03-06 06:48 (UTC+8))
#35810 [compile] Consistent compiler config for saved/loaded vllm backends. — ready — by zhxchen17 (合并于: 2026-03-06 04:08 (UTC+8))
#32550 [Model] Add support for OLMo Hybrid — documentation,new-model,ready — by yanhong-lbh (合并于: 2026-03-06 03:51 (UTC+8))
#35775 [CI] Add explicit permissions to macOS smoke test workflow — ready,ci/build — by russellb (合并于: 2026-03-06 03:08 (UTC+8))
#36059 [BugFix] Fallback from FA4->FA2 for Batch Invariance — bug,ready,v1 — by frankwang28 (合并于: 2026-03-06 03:05 (UTC+8))
#35794 [Perf] Optimize FusedMoEModularKernel output tensor using torch.empty — ready-run-all-tests — by xyang16 (合并于: 2026-03-06 02:47 (UTC+8))
#36146 [Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE — bug,ready,nvidia — by tomeras91 (合并于: 2026-03-06 01:49 (UTC+8))
#36133 ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm — ready — by netanel-haber (合并于: 2026-03-06 01:29 (UTC+8))
#36078 [XPU] Enable ModelRunnerV2 on XPU — ready,v1 — by xinyu-intel (合并于: 2026-03-06 01:19 (UTC+8))
#35994 [BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 — bug,ready,qwen — by jjmiao1 (合并于: 2026-03-06 01:16 (UTC+8))
#36107 [CI] Stabilize test_no_args_tool_call and add ROCm-specific server args — rocm,ready — by AndreasKaratzas (合并于: 2026-03-05 21:52 (UTC+8))
#34934 [Bugfix][CI] fix typos — bug,documentation,performance,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by 1195343015 (合并于: 2026-03-06 01:05 (UTC+8))
#35246 [ROCm] Refactor ROCm attention backend selection logic — documentation,rocm,ready,v1 — by SageMoore (合并于: 2026-03-06 00:51 (UTC+8))
#36140 [Bugfix] Fix Qwen-VL tokenizer implementation — bug,performance,ready,qwen — by DarkLight1337 (合并于: 2026-03-06 00:07 (UTC+8))
#36108 refactor funasr model. — ready,qwen — by AllenDou (合并于: 2026-03-06 00:07 (UTC+8))
#36120 Fix Eagle3 with Transformers modelling backend — ready — by hmellor (合并于: 2026-03-05 21:59 (UTC+8))
#34616 [KVConnector] Scheduler: Fix num_computed_tokens after async KV load — ready,v1,kv-connector — by orozery (合并于: 2026-03-05 22:25 (UTC+8))
#36135 [Docs] Only build docs if documentation or ready labels are present — documentation,ready — by hmellor (合并于: 2026-03-05 21:58 (UTC+8))
#36115 [Chore] Correct MTP models test registry ordering — ready — by Isotr0py (合并于: 2026-03-05 16:55 (UTC+8))
#35976 [Bugfix] Fix RunAI streamer crash with S3-hosted model paths — bug,frontend,ready,v1 — by AjAnubolu (合并于: 2026-03-05 20:14 (UTC+8))
#36114 [Bugfix] Fix mypy errors in hermes_tool_parser.py — bug,ready — by 842974287 (合并于: 2026-03-05 19:37 (UTC+8))
#36020 [Misc] Fix SyntaxWarning - invalid escape sequence ‘\e’ — frontend,ready — by cjackal (合并于: 2026-03-05 18:55 (UTC+8))
#36085 [Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize — documentation,performance,ready,v1,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-05 18:36 (UTC+8))
#32767 [Docs] add Dynamo/aibrix integration and kubeai/aks link — documentation,ready — by pacoxu (合并于: 2026-03-05 17:39 (UTC+8))
#34083 [Docs] Update docs to include mm processor + encoder benchmarks — documentation,frontend,ready,v1,multi-modality — by reaganjlee (合并于: 2026-03-05 17:38 (UTC+8))
#36032 qwen3coder tool parser fix anyOf double encoded parameters — ready,qwen — by cmunley1 (合并于: 2026-03-05 17:06 (UTC+8))
#36006 [Misc] Remove deprecated items that are due for removal — frontend,ready,multi-modality — by hickeyma (合并于: 2026-03-05 14:14 (UTC+8))
#35632 [Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA — ready,nvidia — by eicherseiji (合并于: 2026-03-05 14:24 (UTC+8))
#36036 [Bugfix] Fix block_size for hybrid model MTP — bug,speculative-decoding,ready,v1 — by benchislett (合并于: 2026-03-05 14:10 (UTC+8))
#35973 [Doc] Add Parallel Draft Models — documentation,ready — by zihaoanllm (合并于: 2026-03-05 13:44 (UTC+8))
#36062 [Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 — ready — by gmagogsfm (合并于: 2026-03-05 13:25 (UTC+8))
#36087 [CI] Don’t leave docs preview comment on closed PRs — ready,ci/build — by hmellor (合并于: 2026-03-05 12:51 (UTC+8))
#35849 [Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker — bug,ready,qwen — by gkswns0531 (合并于: 2026-03-05 12:57 (UTC+8))
#35890 [Perf] Use dummy M for weight prepacking on x86 — ready,cpu — by tianmu-li (合并于: 2026-03-05 12:56 (UTC+8))
#35866 Order config.py in Lexicographical order — ready,ci/build,multi-modality — by askliar (合并于: 2026-03-05 12:56 (UTC+8))
#35921 [compile] Fix extra cache save on warm start. — ready — by zhxchen17 (合并于: 2026-03-05 12:56 (UTC+8))
#35328 [Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper — ready,v1 — by njhill (合并于: 2026-03-05 12:11 (UTC+8))

关闭但未合并的 PR

#36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (关闭于: 2026-03-06 11:19 (UTC+8))
#36157 [CI] Add mandatory H100 TP=2 smoke test — ci/build,v1 — by stecasta (关闭于: 2026-03-06 01:24 (UTC+8))
#34926 [Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle — needs-rebase,v1,kv-connector — by ZhanqiuHu (关闭于: 2026-03-06 05:46 (UTC+8))
#36128 [Bugfix] Fix float-like CPU env int parsing for #35967 — bug — by MatteoFari (关闭于: 2026-03-06 05:18 (UTC+8))
#32509 [Refactor] Extract KV-cache update logic for FlashAttentionDiffKV backend — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by VedantMadane (关闭于: 2026-03-06 04:11 (UTC+8))
#33747 feat(grpc): expose kv_connector and kv_role in GetServerInfoResponse — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by slin1237 (关闭于: 2026-03-06 03:17 (UTC+8))
#35590 Use smg-grpc-proto package for gRPC proto definitions — ci/build — by CatherineSue (关闭于: 2026-03-06 03:02 (UTC+8))
#33335 Parakeet avlm — frontend,needs-rebase — by netanel-haber (关闭于: 2026-03-06 02:08 (UTC+8))
#36154 chunk audio clips by 30 seconds to match mcore — 无标签 — by netanel-haber (关闭于: 2026-03-06 00:09 (UTC+8))
#36052 [Bugfix] Fix async OffloadingConnector silently losing decode-phase blocks — bug,kv-connector — by ZhanqiuHu (关闭于: 2026-03-05 22:19 (UTC+8))
#34241 [Bugfix] Grammar ignored when reasoning ends within speculated tokens — bug,structured-output,v1 — by sfbemerk (关闭于: 2026-03-05 20:37 (UTC+8))
#36126 Fix TypeError when rope factor is null during max length derivation — 无标签 — by siewcapital (关闭于: 2026-03-05 19:32 (UTC+8))
#33798 Add Kimi-Audio-7B ASR support for audio transcriptions and chat completions — documentation,new-model,frontend,multi-modality,gpt-oss — by tunglinwood (关闭于: 2026-03-05 18:47 (UTC+8))
#36121 README Sylee — documentation,kv-connector — by brianlsy98 (关闭于: 2026-03-05 16:28 (UTC+8))
#36113 fix: handle float-like strings in int() parsing to prevent ValidationError — 无标签 — by OiPunk (关闭于: 2026-03-05 15:50 (UTC+8))
#35484 [WIP][Bugfix] Fix multi-node PP crash with logprobs due to pinned memory serialization — bug,needs-rebase,v1 — by haosdent (关闭于: 2026-03-05 13:27 (UTC+8))