vLLM 开发动态报告 - 2026-03-05
时间窗口: 2026-03-05 11:23 (UTC+8) ~ 2026-03-06 11:23 (UTC+8) 数据统计: 新 Issue 27 | 关闭 Issue 31 | 新 PR 96 | 合并 PR 41 | 关闭未合并 PR 16
📊 每日开发状态摘要
本周期(2026年3月5日至6日)vLLM 项目开发活动保持高度活跃,新增 PR 96 个,合并 41 个,显示出持续的快速迭代。核心焦点集中在多模态模型支持扩展、内核与运行时性能优化(特别是 CUDA 图和 torch.compile),以及 AMD ROCm 平台的兼容性与稳定性修复。同时,社区对 dLLM(扩散语言模型)集成、新硬件抽象层(torch.accelerator) 和量化模型准确性问题展开了深入讨论。
🎯 AMD/ROCm 生态相关动态
本周期 AMD/ROCm 生态是重点修复和扩展方向,涉及多个 PR 和 Issue。
1. PR:构建与依赖更新
- #36086 ([AMD][Build] Add DeepEP to ROCm Dockerfile): 在 ROCm Docker 镜像中集成 AMD 端口的 DeepEP 高性能 MoE 库。这是增强 AMD 平台 MoE 性能的重要一步。
- #36177 ([ROCm][CI] Adding missing dependencies for Multi-modal models tests): 为 ROCm CI 补充多模态模型测试的依赖,提升测试覆盖度。
- #36179 ([ROCm][CI] Fix ROCm GPT-OSS Eval test group): 修复 ROCm 平台上 GPT-OSS 评估测试组。
2. PR:核心功能修复与增强
- #35246 ([ROCm] Refactor ROCm attention backend selection logic) (已合并): 重构了 ROCm 注意力后端选择逻辑,使其风格与 CUDA 侧统一,为未来更复杂的后端选择奠定基础。
- #36185 (Reenable features for ROCm attention backends): 修复因 #35246 重构导致的 ROCm 多个注意力后端(如 AITER_MLA)功能支持(如 FP8 KV 缓存)被错误禁用的问题。
- #36100 ([ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs) 与 #36092 ([ROCm] Fix AITER ops fake impl and minor bugs): 这两个 PR 修复了 ROCm AITER 操作符的假实现(fake impl)签名不匹配、静态方法参数错误、错误信息拼写等多个底层 bug。特别是
fused_moe_fake的签名问题会直接导致torch.compile在 FakeTensor 模式下崩溃,影响 MXFP4 量化模型的运行。 - #36101 ([ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm): 针对未训练的
TitanML/tiny-mixtral模型,在 AITER 的rms_norm使用 bfloat16 累加导致与 HF 参考输出微小差异的问题,在测试中加入了临时补丁,确保 CI 通过。这反映了对数值精度敏感场景的细致处理。
3. Issue:用户反馈的兼容性与功能缺陷
- #36193 ([Bug]: Unsupported Activation Function for Step-3.5-Flash): 用户报告 ROCm Aiter 量化 MoE 路径不支持 Step-3.5-Flash 模型的
SwiGLU_STEP激活函数,导致运行时报错。这暴露了 AMD 量化生态对新模型架构支持的滞后。 - #36105 ([ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start…): 用户报告该 FP8 模型因 “No FP8 MoE backend supports the deployment configuration” 错误而无法启动。凸显了 AMD 平台对 FP8 MoE 后端支持的缺失。
- #36167 ([Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM) 与 #36180 ([Bug]: meta-llama/Llama-3.2-1B-Instruct Fails With ROCM_ATTN Due To Seg Fault): 用户报告在 AMD 硬件(Strix Halo/Ryzen AI)上运行 Qwen3-Omni 时 hip runtime 加载失败,以及 Llama-3.2-1B 使用
ROCM_ATTN后端时出现段错误。这些是 AMD 消费级硬件和新模型支持上的具体问题。
总结: AMD 生态本周期的工作以 “修复”和“打基础” 为主,重点解决了 AITER 操作符的一系列底层 bug 和注意力后端功能回归问题,并开始集成 DeepEP 等性能库。然而,用户 Issues 反映出在新模型架构(Step-3.5)、新数据类型(FP8 MoE)和特定硬件组合上的兼容性挑战依然存在。
💬 高热度讨论分析
1. Issue #36155 ([RFC]: dLLM support via plugin (spec-decode path reuse))
- 核心议题: 是否及如何通过插件系统复用推测解码(Speculative Decoding)的数据路径和调度器接口,来支持块扩散语言模型(dLLM)。
- 各方观点:
- 提案方 (Red Hat): 认为此举能以最小核心变更实现最大封装,保持 vLLM 生态对齐,并已设计出具体方案。
- 维护者 (@benchislett): 总体上认为方案合理,但提出两大挑战:1) 结构化输出(Structured Outputs) 与 dLLM 单步多令牌生成的兼容性复杂,可能需大规模重构或引入新通信阶段;2) 自定义掩码注意力 的跨平台支持度问题。
- 争议焦点: 如何平衡 “最小核心变更” 的优雅性与解决 结构化输出 这一硬核技术挑战的可行性。讨论仍处早期技术探讨阶段。
- 当前状态: 开放讨论中。
2. Issue #36094 ([Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy)
- 核心议题: NVIDIA 官方发布的 NVFP4 量化版 Qwen3.5-397B 模型在 vLLM 上运行精度严重下降(GSM8K 从 0.90 降至 ~0.35)。
- 各方观点:
- 报告者 (@pavanimajety): 提供详细测试数据,指出问题与
modelopt_fp4量化方法相关,怀疑权重重归一化步骤有问题。 - NVIDIA 工程师 (@zhewenl): 表示之前验证过该检查点精度对齐,要求提供环境信息以便复现。
- 其他参与者: 指出模型存在
w1_weight_scale_2 must match w3_weight_scale_2的警告,推测这可能影响精度。
- 报告者 (@pavanimajety): 提供详细测试数据,指出问题与
- 争议焦点: 问题是 vLLM 的模型加载/量化实现缺陷,还是 NVIDIA 发布的权重本身存在问题?需要双方协作定位。
- 当前状态: 开放调查中,双方正在交换环境与 commit 信息。
3. Issue #36091 / PR #36139 ([RFC]: Add InstantTensor Support in vLLM)
- 核心议题: 是否集成 InstantTensor(一种通过内存映射加速模型加载的技术)作为可选的加载格式 (
--load-format instanttensor)。 - 各方观点:
- 提案方: 展示其能大幅提升模型加载速度,充分利用高速存储带宽。
- 社区成员 (@robertgshaw2-redhat): 询问是否可设为默认选项及其使用考量,反映出对引入新依赖和维护成本的关切。
- 争议焦点: 新技术带来的性能收益与项目依赖复杂度、维护负担之间的权衡。
- 当前状态: RFC Issue 开放讨论,对应的实现 PR #36139 已提交,等待审查。
🔥 热门话题与趋势分析
- 多模态与音频模型支持加速: 新增 PR #36127 为 Kimi-Audio-7B (Whisper + Qwen2) 添加支持,PR #36124 增加 MetaCLIP 模型支持。这表明 vLLM 在多模态领域的模型矩阵正在快速扩展。
- 性能优化深水区: 优化重点从基础算子转向更复杂的调度和组合。例如:
- PR #36182 为分类池器(classify pooler)添加 CUDA 图优化。
- PR #36142 实验性地为结构化输出添加 跳转解码(Jump Decoding) 支持,以加速某些约束生成场景。
- PR #36159 将 MaxSim 评分计算从 API 服务器 下推至 Worker,减少数据传输开销。
- AMD 平台持续投入与挑战: 如前所述,围绕 ROCm 的 CI 稳定化、AITER 内核修复、新硬件支持是持续热点。用户 Issues 密集反映了在消费级 AMD GPU 和复杂量化模型上的可用性问题。
- 开发工具链与体验: PR #36135 优化 CI,仅当 PR 带有
documentation或ready标签时才构建文档,以节约资源。这反映了项目在规模化后对 CI 效率的精细化管理。
🛠️ 重点技术变更
- PR #35246 ([ROCm] Refactor ROCm attention backend selection logic) (已合并): 这是 AMD 生态的基础设施性重构。将原本可能散落的 ROCm 注意力后端选择逻辑规范化,为后续引入更多后端和更精细化的功能选择策略(类似 CUDA)铺平了道路,是提升 AMD 平台长期可维护性的关键一步。
- PR #36162 ([Mamba] Flashinfer selective_state_update): 为 Mamba 架构模型(如 Nemotron 3 Nano)增加了 FlashInfer 后端 作为选择性状态更新内核的新选项,并引入了
MambaConfig来集中相关配置。这丰富了状态空间模型(SSM)的后端生态,为用户提供了性能调优的 choice。 - PR #36085 和 #36145 ([Hardware] Replace torch.cuda.* API): 这些 PR 系统性地将
torch.cuda.device_count、torch.cuda.synchronize()等 API 替换为torch.accelerator命名空间下的新 API。这是响应 PyTorch 硬件抽象化趋势,为未来更好地支持 XPU、NPU 等更多加速器 做准备的基础性迁移工作。 - PR #35866 (Order
config.pyin Lexicographical order) (已合并): 一个看似微小的改动,将庞大的模型配置文件按字母顺序重新排序,显著提升了大规模配置文件的可读性和可维护性,体现了对代码质量的持续关注。 - PR #36176 ([Model Runner V2] Fix warmup for very small kvcache and/or blocksizes) (已合并): 修复了在 KV 缓存块数或块大小被设置得非常小(例如用于测试抢占)时,预热逻辑可能出错的问题。这增强了系统在极端配置下的鲁棒性。
📈 开发活跃度观察
- 贡献者多样性: 除了核心团队(如 @DarkLight1337, @yewentao256, @zou3519),Red Hat (@robertgshaw2-redhat, @benchislett)、AMD 相关贡献者(如 @SageMoore, @AndreasKaratzas)以及许多独立开发者(如 @OiPunk, @AjAnubolu)都非常活跃。
- AMD 贡献者活跃: 用户名中包含 “-amd” 后缀或明显属于 AMD 的贡献者(如 @SageMoore)在 ROCm 相关的修复和 CI 维护上贡献了多个 PR,显示了 AMD 对 vLLM 生态的持续投入。
- 代码审查与合并节奏: 单日合并 41 个 PR,显示核心团队有很强的代码审查和合并能力。同时,大量 PR 处于 “needs-rebase” 状态,也表明开发分支迭代非常迅速,贡献者需及时同步上游变更。
- 跨团队协作: 在 dLLM RFC (#36155) 和 NVFP4 精度问题 (#36094) 的讨论中,可以看到来自 Red Hat、NVIDIA 等不同机构的工程师进行深度技术对话,展现了健康的跨公司协作生态。
💡 值得关注的问题
- dLLM 支持路径的权衡: Issue #36155 中的讨论触及了 vLLM 核心调度器的扩展边界。是否以及如何支持单步多令牌生成的模型,是一个影响深远的设计决策,其结论将决定 vLLM 对未来新型架构的包容性。
- NVFP4 等新量化格式的准确性与兼容性: Issue #36094 暴露了前沿量化技术(如 NVFP4)在落地 vLLM 时可能存在的准确性问题。这不仅是一个技术调试问题,也关乎 vLLM 与硬件厂商量化工具链的集成质量保证流程。
- AMD 消费级硬件与复杂模型的兼容性: Issues #36167, #36180 显示,在 AMD Ryzen AI 等新兴消费级平台上运行视觉语言大模型(VLMs)时,仍会遇到低级运行时错误。扩大 vLLM 在 AMD 生态的普及度,解决这些“最后一公里”的兼容性问题至关重要。
- 系统提示词(System Prompt)缓存优化: PR #36196 提出了缓存系统提示词的 Token IDs 以优化重复请求。这是一个常见的性能优化点,其实现方式值得关注,可能对高并发、固定系统提示的应用场景产生显著影响。
📋 附录:详细数据列表
新增 Issue
- #36203 [Bug]: CUDAGraphMode.FULL not supported with ChunkedLocalAttention backend for Llama4 models — bug — by huydhn (创建于: 2026-03-06 11:10 (UTC+8))
- #36202 [Bug]: Symbol not found: __ZN3c1013MessageLoggerC1EPKcii\n Referenced from: <6A12389C-7A10-3CA4-BEDF-893991822933> /opt/anaconda3/envs/vllm-inference/lib/python3.11/site-packages/vllm/_C.abi3.so\n — bug — by LuWei6896 (创建于: 2026-03-06 10:59 (UTC+8))
- #36116 [Bug]: Pseudo-Streaming Issue When Using Tool Parser with Qwen3-Coder and MiniMax-M2 — bug — by delwen123 (创建于: 2026-03-05 15:45 (UTC+8))
- #36148 [Bug]: Qwen3-VL-Reranker-8B fails with shape mismatch error when loading with –quantization bitsandbytes — bug — by xl2014 (创建于: 2026-03-05 22:29 (UTC+8))
- #36122 [Bug]: Protocol inconsistency between Scheduler and Runner when using Speculative Decoding — bug — by zengliejie (创建于: 2026-03-05 17:15 (UTC+8))
- #36193 [Bug]: Unsupported Activation Function for Step-3.5-Flash — bug,rocm — by ColinZ22 (创建于: 2026-03-06 08:39 (UTC+8))
- #36189 [Feature]: Per-Request Timing Headers (
--enable-request-stats-headers) — feature request — by vrdn-23 (创建于: 2026-03-06 08:01 (UTC+8)) - #36094 [Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy — bug,qwen,nvidia — by pavanimajety (创建于: 2026-03-05 13:09 (UTC+8))
- #36180 [Bug]: meta-llama/Llama-3.2-1B-Instruct Fails With ROCM_ATTN Due To Seg Fault — bug,rocm — by micah-wil (创建于: 2026-03-06 06:13 (UTC+8))
- #36175 [Feature][Cleanup]: Unify RMSNormGated and FusedRMSNormGated — bug — by eellison (创建于: 2026-03-06 04:35 (UTC+8))
- #36167 [Bug]: Engine initialization failure with Qwen3 Omni on Strix Halo/ROCM — bug,rocm — by Eoghanmc22 (创建于: 2026-03-06 02:21 (UTC+8))
- #36155 [RFC]: dLLM support via plugin (spec-decode path reuse) — RFC — by AlonKellner-RedHat (创建于: 2026-03-06 00:26 (UTC+8))
- #36096 [Bug]: Initialization failure when using Qwen3-Omni — bug — by Eoghanmc22 (创建于: 2026-03-05 13:19 (UTC+8))
- #36151 [Bug]: Inconsistent PP layer indexing in EAGLE model code — bug — by benchislett (创建于: 2026-03-05 23:37 (UTC+8))
- #36091 [RFC]: Add InstantTensor Support in vLLM — RFC — by arlo-aisys (创建于: 2026-03-05 12:47 (UTC+8))
- #36137 [Bug]: Title: Embeddings from Qwen3-VL-Embedding-8B loaded via VLLM fail to effectively distinguish image-text pair similarities compared to native transformers — bug — by xl2014 (创建于: 2026-03-05 20:39 (UTC+8))
- #36141 [Bug]: Vllm models crashes when using runai_streamer & –api-server-count > 1 — bug — by Sanches166 (创建于: 2026-03-05 21:03 (UTC+8))
- #36117 [Bug]: Qwen3-VL-2B-Instruct produces significantly different (degraded) outputs on v0.12.0 compared to v0.11.0 — 无标签 — by AGENDD (创建于: 2026-03-05 15:49 (UTC+8))
- #36123 [Usage]: 为什么vllm0.14.0部署Qwen3-VL-30B-A3B的MOE模型时,模型在初始化时CPU负载几乎100%,相同的环境部署Qwen3-VL-32B的模型时,不会出现CPU负载100%的情况 — usage — by sanshi9523 (创建于: 2026-03-05 17:59 (UTC+8))
- #36109 [Bug]: TypeError when rope_scaling.factor is null in model config — bug — by yuanheng-zhao (创建于: 2026-03-05 15:16 (UTC+8))
- #36112 [Bug]: bge-large-en-v1.5 embedding model can not be use when I use vllm to deploy the embedding model. — bug — by zhujiajian98 (创建于: 2026-03-05 15:30 (UTC+8))
- #36105 [ROCm][v0.16.0] Qwen3-VL-30B-A3B-Instruct-FP8 fails to start with NotImplementedError: No FP8 MoE backend supports the deployment configuration — bug,rocm — by gbdjxgp (创建于: 2026-03-05 14:38 (UTC+8))
- #36103 [Bug]: torch._inductor.exc.InductorError: — bug — by crimoc-lgtm (创建于: 2026-03-05 14:25 (UTC+8))
- #36095 [Feature]: Add fused MoE kernel tuning configs for NVIDIA GeForce RTX 5090 (int4_w4a16) — 无标签 — by Anoneeeemus (创建于: 2026-03-05 13:13 (UTC+8))
- #36099 [Bug]: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory — bug — by crimoc-lgtm (创建于: 2026-03-05 13:42 (UTC+8))
- #36082 AI 推理效率工具实践分享 — 无标签 — by zhuxunyu (创建于: 2026-03-05 11:40 (UTC+8))
- #36084 [Bug]: prefix cach hit降到了0,running的seq也是逐步下降到0 — bug — by edc3000 (创建于: 2026-03-05 11:50 (UTC+8))
已关闭 Issue
- #26093 [Installation]: non-CUDA x86 vLLM v0.10.2 wheel is CUDA dependent — installation,stale — by (关闭于: 2026-03-06 10:34 (UTC+8))
- #27626 [Bug]: gpt-oss-120b with EAGLE3 Speculative decoding, awful benchmarks — bug,stale — by icsy7867 (关闭于: 2026-03-06 10:33 (UTC+8))
- #28097 [Bug]: ZeroDivisionError caused by dividing by pbar.format_dict[“elapsed”] in LLM._run_engine() when use_tqdm=True — bug,stale — by KnightChaser (关闭于: 2026-03-06 10:32 (UTC+8))
- #28102 [Bug]: switching between multiple LoRAs in multimodal scenario takes long time — bug,stale — by caraxl (关闭于: 2026-03-06 10:32 (UTC+8))
- #28107 [Bug]: llama 4 + fp4 is broke — bug,stale — by chenhao862 (关闭于: 2026-03-06 10:32 (UTC+8))
- #28115 [Installation]: Request to include vllm==0.10.2 for cuda 11.8 — installation,stale — by ppalantir (关闭于: 2026-03-06 10:32 (UTC+8))
- #28121 [Bug]: Cannot load llmcompressor 3-bit quantized models but can load AutoGPTQ ones — bug,stale — by mratsim (关闭于: 2026-03-06 10:32 (UTC+8))
- #28131 [Bug]: Batch Invariant Failed when enable cudagraph on L40s (Ada) GPU — bug,stale — by fopdoodle8 (关闭于: 2026-03-06 10:32 (UTC+8))
- #28134 [Bug]: Streaming doesnt start until the response is completed — bug,stale — by baughmann (关闭于: 2026-03-06 10:32 (UTC+8))
- #28138 [Feature]: Support Shared Memory (DGX Spark) — feature request,stale — by swtb3-ryder (关闭于: 2026-03-06 10:32 (UTC+8))
- #28143 [Bug]: Model not loaded on proper GPU despite setting CUDA device — bug,stale — by jordanallred (关闭于: 2026-03-06 10:32 (UTC+8))
- #31030 [Performance]: Regression between 0.10.2 and 0.12.0 — performance — by maximegmd (关闭于: 2026-03-06 08:25 (UTC+8))
- #35967 [Bug]: Value error, invalid literal for int() with base 10: ‘4.0’ [type=value_error, input_value=ArgsKwargs((), {‘model_co…transfer_config’: None}), input_type=ArgsKwargs] — bug — by LuWei6896 (关闭于: 2026-03-06 05:16 (UTC+8))
- #36096 [Bug]: Initialization failure when using Qwen3-Omni — bug — by Eoghanmc22 (关闭于: 2026-03-06 01:33 (UTC+8))
- #32992 [Bug]: Batch invariance fails on NVIDIA B200 (Blackwell) with torch.compile — bug,torch.compile — by ZhanqiuHu (关闭于: 2026-03-06 00:01 (UTC+8))
- #36075 [Bug]: Garbled rollouts with GLM5 if VLLM_USE_DEEP_GEMM is not set — bug — by S1ro1 (关闭于: 2026-03-05 23:26 (UTC+8))
- #34205 [Bug]: Set env ROCP_TOOL_ATTACH=1 caused vllm server stopped — bug,rocm — by BigFaceBoy (关闭于: 2026-03-05 23:14 (UTC+8))
- #36137 [Bug]: Title: Embeddings from Qwen3-VL-Embedding-8B loaded via VLLM fail to effectively distinguish image-text pair similarities compared to native transformers — bug — by xl2014 (关闭于: 2026-03-05 22:05 (UTC+8))
- #35717 [Bug]: RunAI streamer breaks in 0.15.1 — bug — by Sanches166 (关闭于: 2026-03-05 20:14 (UTC+8))
- #35004 [RFC]: Realtime Endpoint Metrics — RFC — by pougetat (关闭于: 2026-03-05 19:06 (UTC+8))
- #29056 [Bug]: Cannot use Qwen3 Next autoround quant model with 0.11.1 — bug,stale — by ariable (关闭于: 2026-03-05 17:24 (UTC+8))
- #36112 [Bug]: bge-large-en-v1.5 embedding model can not be use when I use vllm to deploy the embedding model. — bug — by zhujiajian98 (关闭于: 2026-03-05 16:57 (UTC+8))
- #35991 [Installation]: Making server with GPT-OSS-20B on vllm+openwebui rtx5080 16gb — installation — by Ed-test-s (关闭于: 2026-03-05 16:48 (UTC+8))
- #35926 [Bug]: Illegal Memory Access in NemotronH MTP — bug — by benchislett (关闭于: 2026-03-05 14:10 (UTC+8))
- #36095 [Feature]: Add fused MoE kernel tuning configs for NVIDIA GeForce RTX 5090 (int4_w4a16) — 无标签 — by Anoneeeemus (关闭于: 2026-03-05 14:04 (UTC+8))
- #36099 [Bug]: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory — bug — by crimoc-lgtm (关闭于: 2026-03-05 13:49 (UTC+8))
- #11905 [Feature]: Support Multiple Tasks Per Model — feature request,stale — by FurtherAI (关闭于: 2026-03-05 12:40 (UTC+8))
- #36082 AI 推理效率工具实践分享 — 无标签 — by zhuxunyu (关闭于: 2026-03-05 12:35 (UTC+8))
- #4435 [Feature]: option to return hidden states — feature request,unstale — by zhenlan0426 (关闭于: 2026-03-05 12:13 (UTC+8))
- #6165 [Feature]: Return hidden states (in progress?) — feature request,unstale — by Elanmarkowitz (关闭于: 2026-03-05 12:13 (UTC+8))
- #24288 [RFC]: Support Returning Prompt Hidden States — RFC — by charlotte12l (关闭于: 2026-03-05 12:13 (UTC+8))
新增 PR
- #36204 use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks — 无标签 — by laithsakka (创建于: 2026-03-06 11:19 (UTC+8))
- #36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (创建于: 2026-03-06 10:29 (UTC+8))
- #36192 [Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 — ready — by russellb (创建于: 2026-03-06 08:31 (UTC+8))
- #36199 [Bugfix] Fix Qwen3.5 Marlin TP failure for GDN in_proj_ba — bug,qwen — by AjAnubolu (创建于: 2026-03-06 10:39 (UTC+8))
- #36196 [Frontend] Cache system prompt token IDs across requests — 无标签 — by AjAnubolu (创建于: 2026-03-06 10:36 (UTC+8))
- #36144 replace
with torch.cuda.devicewithwith torch.accelerator.device_index— performance,rocm,kv-connector,nvidia — by yma11 (创建于: 2026-03-05 21:44 (UTC+8)) -
#36110 [Frontend][2/n] Improve pooling entrypoints embed. — frontend — by noooop (创建于: 2026-03-05 15:22 (UTC+8)) - #36124 [Model] Add MetaCLIP and MetaCLIP-2 Support — documentation,new-model,multi-modality — by KeriaGuma (创建于: 2026-03-05 18:06 (UTC+8))
- #36150 [bugfix] add api process rank in default multimodal request — bug,ready — by fake0fan (创建于: 2026-03-05 23:19 (UTC+8))
- #36201 [openapi server] log exception in exception handler(2/N) — frontend — by andyxning (创建于: 2026-03-06 10:55 (UTC+8))
- #36197 [Bugfix] Fix misleading context length error messages — bug,frontend — by AjAnubolu (创建于: 2026-03-06 10:37 (UTC+8))
- #36194 Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property. — llama,qwen — by laithsakka (创建于: 2026-03-06 09:22 (UTC+8))
- #36191 [BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list — bug,ready — by walterbm (创建于: 2026-03-06 08:11 (UTC+8))
- #36200 Fix multi-node WorkerProc init ordering and compilation_time None — v1,meta-exported,fb-exported — by ananyakgarg (创建于: 2026-03-06 10:42 (UTC+8))
- #36111 [Perf] add cute dsl kernel for gdn decode — qwen — by ZJY0516 (创建于: 2026-03-05 15:30 (UTC+8))
- #36198 [Test] Add unit tests for GDN fused recurrent kernel — 无标签 — by AjAnubolu (创建于: 2026-03-06 10:39 (UTC+8))
- #36136 [Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps — bug,multi-modality,qwen — by OiPunk (创建于: 2026-03-05 20:31 (UTC+8))
- #36127 [Model] Add support for moonshotai/Kimi-Audio-7B-Instruct — documentation,new-model,frontend,multi-modality — by tunglinwood (创建于: 2026-03-05 18:33 (UTC+8))
- #36145 [Hardware] Replace torch.cuda.device_count/current_device/set_device API — documentation,performance,speculative-decoding,ready,v1,multi-modality,kv-connector,nvidia — by jikunshang (创建于: 2026-03-05 21:47 (UTC+8))
- #36159 [Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement — frontend,ready,v1 — by yewentao256 (创建于: 2026-03-06 01:32 (UTC+8))
- #36186 [Bugfix] Fix WorkerProc init order for multi-node TP — bug,v1 — by 842974287 (创建于: 2026-03-06 07:29 (UTC+8))
- #36168 [Build] Upgrade xgrammar to get a security fix — ready,ci/build — by russellb (创建于: 2026-03-06 02:43 (UTC+8))
- #36185 Reenable features for ROCm attention backends — documentation,rocm,ready,v1 — by Rohan138 (创建于: 2026-03-06 07:25 (UTC+8))
- #36101 [ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm — rocm,ready — by AndreasKaratzas (创建于: 2026-03-05 14:06 (UTC+8))
- #36177 [ROCm][CI] Adding missing dependencies for Multi-modal models tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-06 06:02 (UTC+8))
- #36179 [ROCm][CI] Fix ROCm GPT-OSS Eval test group — rocm,ready,ci/build,gpt-oss — by AndreasKaratzas (创建于: 2026-03-06 06:10 (UTC+8))
- #36188 [docs] Add docs for new RL flows — documentation — by hao-aaron (创建于: 2026-03-06 07:58 (UTC+8))
- #36088 Don’t fire ray compatibility webhook when PR or branch is not provided — ready,ci/build — by jeffreywang-anyscale (创建于: 2026-03-05 12:22 (UTC+8))
- #36183 Fix: Clone NVFP4 MoE weights on SM121 to prevent Marlin kernel NaN — 无标签 — by scottgl9 (创建于: 2026-03-06 06:51 (UTC+8))
- #36190 fix(qwen3.5): prevent false gate_proj match from dropping MoE router gate weights — qwen — by scottgl9 (创建于: 2026-03-06 08:02 (UTC+8))
- #36187 [BugFix] Ensure contiguous input tensor in LoRA shrink kernel — bug — by RunkaiTao (创建于: 2026-03-06 07:49 (UTC+8))
- #36164 perf: add slots to KVCacheBlock — ready,v1 — by cong-or (创建于: 2026-03-06 01:56 (UTC+8))
- #36184 Yejin/fix warmup drain — performance,frontend,tpu,needs-rebase,v1,cpu — by YJYJLee (创建于: 2026-03-06 07:24 (UTC+8))
- #36176 [Model Runner V2] Fix warmup for very small kvcache and/or blocksizes — ready,v1 — by njhill (创建于: 2026-03-06 04:57 (UTC+8))
- #36182 [Performance] Add CUDA graph optimization for classify pooler — v1,nvidia — by ilinam (创建于: 2026-03-06 06:25 (UTC+8))
- #36157 [CI] Add mandatory H100 TP=2 smoke test — ci/build,v1 — by stecasta (创建于: 2026-03-06 01:21 (UTC+8))
- #36181 [ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-06 06:18 (UTC+8))
- #36178 [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking — bug,v1 — by LucasWilkinson (创建于: 2026-03-06 06:02 (UTC+8))
- #36174 [ROCm][CI] Enable AITER for failing
test_gpt_osstest case on MI355 — rocm,gpt-oss — by micah-wil (创建于: 2026-03-06 04:28 (UTC+8)) - #36128 [Bugfix] Fix float-like CPU env int parsing for #35967 — bug — by MatteoFari (创建于: 2026-03-05 18:40 (UTC+8))
- #36172 [Feat] Add vllm eval CLI subcommand integrating lm_eval accuracy and perf benchmarking — documentation,performance,frontend — by AndreasKaratzas (创建于: 2026-03-06 03:45 (UTC+8))
- #36083 Add adaptive decode chunking for SM100 fused TRTLLM path (TMP FIX)#34988 — v1,nvidia — by baonudesifeizhai (创建于: 2026-03-05 11:46 (UTC+8))
- #36098 [compile] Split compile/warmup monitoring — needs-rebase — by zou3519 (创建于: 2026-03-05 13:37 (UTC+8))
- #36161 Add 320 dimension size support to MLA — ready — by juliendenize (创建于: 2026-03-06 01:44 (UTC+8))
- #36171 [Refactor] Remove unused dead code — frontend,ready,kv-connector — by yewentao256 (创建于: 2026-03-06 03:42 (UTC+8))
- #36173 Change “following fields were present in the request but ignored” log from warn to debug — frontend — by tlrmchlsmth (创建于: 2026-03-06 03:52 (UTC+8))
- #36170 [Dependency] Remove default ray dependency — documentation,rocm,ready,ci/build,nvidia — by yewentao256 (创建于: 2026-03-06 03:25 (UTC+8))
- #36162 [Mamba] Flashinfer selective_state_update — ci/build,v1 — by roikoren755 (创建于: 2026-03-06 01:45 (UTC+8))
- #36165 [Bugfix] Fix
cudagraph_mode:FULLdispatch (This does not impactFULL_AND_PIECEWISE(default)) — bug,ready,v1,nvidia — by TQCB (创建于: 2026-03-06 01:58 (UTC+8)) - #36119 [Bugfix] Guard fabric handle APIs for CUDA < 12.4 compatibility — bug,nvidia — by aimbit-ni (创建于: 2026-03-05 16:15 (UTC+8))
- #36169 feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add –grpc flag to vllm serve — rocm,frontend,ci/build — by CatherineSue (创建于: 2026-03-06 03:01 (UTC+8))
- #36166 [Frontend] Add GPU-less render serving path (
vllm launch render) — frontend,needs-rebase,v1 — by sagearc (创建于: 2026-03-06 02:20 (UTC+8)) - #36160 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) — documentation,frontend,v1 — by alex-jw-brooks (创建于: 2026-03-06 01:34 (UTC+8))
- #36163 Add support to Mistral large 3 eagle with dense layers — speculative-decoding — by juliendenize (创建于: 2026-03-06 01:47 (UTC+8))
- #36158 [Misc] Rename
group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs— ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-03-06 01:27 (UTC+8)) - #36153 [Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) — frontend,ready,multi-modality — by alex-jw-brooks (创建于: 2026-03-06 00:04 (UTC+8))
- #36146 [Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE — bug,ready,nvidia — by tomeras91 (创建于: 2026-03-05 22:16 (UTC+8))
- #36147 cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul — ci/build,cpu — by nikhil-arm (创建于: 2026-03-05 22:29 (UTC+8))
- #36133 ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm — ready — by netanel-haber (创建于: 2026-03-05 20:02 (UTC+8))
- #36107 [CI] Stabilize test_no_args_tool_call and add ROCm-specific server args — rocm,ready — by AndreasKaratzas (创建于: 2026-03-05 14:55 (UTC+8))
- #36092 [ROCm] Fix AITER ops fake impl and minor bugs — rocm,needs-rebase — by ChuanLi1101 (创建于: 2026-03-05 12:57 (UTC+8))
- #36156 [Bugfix] Fix simple Mistral-Small example — bug,documentation,ready — by DarkLight1337 (创建于: 2026-03-06 00:34 (UTC+8))
- #36149 fix: Use iterator as not to store all the file loads in memory at once — 无标签 — by shaunkotek (创建于: 2026-03-05 22:47 (UTC+8))
- #36140 [Bugfix] Fix Qwen-VL tokenizer implementation — bug,performance,ready,qwen — by DarkLight1337 (创建于: 2026-03-05 21:03 (UTC+8))
- #36154 chunk audio clips by 30 seconds to match mcore — 无标签 — by netanel-haber (创建于: 2026-03-06 00:09 (UTC+8))
- #36108 refactor funasr model. — ready,qwen — by AllenDou (创建于: 2026-03-05 14:57 (UTC+8))
- #36152 [compile] Stop unconditionally patching constrain_to_fx_strides — 无标签 — by zou3519 (创建于: 2026-03-05 23:54 (UTC+8))
- #36120 Fix Eagle3 with Transformers modelling backend — ready — by hmellor (创建于: 2026-03-05 16:21 (UTC+8))
- #36093 [torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation — 无标签 — by zou3519 (创建于: 2026-03-05 12:58 (UTC+8))
- #36138 [Bugfix] Grammar was ignored when reasoning ended within speculated tokens — bug,structured-output,v1 — by sfbemerk (创建于: 2026-03-05 20:43 (UTC+8))
- #36130 [BugFix] Add opt-in request watchdog to abort stuck requests (#33099) — bug,v1 — by KrxGu (创建于: 2026-03-05 19:35 (UTC+8))
- #36139 [Feature] Add support for InstantTensor — documentation,ci/build — by arlo-aisys (创建于: 2026-03-05 21:00 (UTC+8))
- #36135 [Docs] Only build docs if
documentationorreadylabels are present — documentation,ready — by hmellor (创建于: 2026-03-05 20:24 (UTC+8)) - #36106 [Bugfix] Add safety check and fallback for null scaling factor — bug — by yuanheng-zhao (创建于: 2026-03-05 14:54 (UTC+8))
- #36142 [Feature]: Initial Implementation Jump Decoding Guidance — structured-output,v1 — by FredericOdermatt (创建于: 2026-03-05 21:05 (UTC+8))
- #36143 [Core] Add –hybrid-kv-cache-group-size to override KV cache grouping for hybrid attn models — v1 — by mzmssg (创建于: 2026-03-05 21:14 (UTC+8))
- #36115 [Chore] Correct MTP models test registry ordering — ready — by Isotr0py (创建于: 2026-03-05 15:42 (UTC+8))
- #36129 [LMCache] Pass TP size in lookup for MLA multi-reader locking — kv-connector — by maobaolong (创建于: 2026-03-05 19:31 (UTC+8))
- #36134 # [Feature] Add EPLB Support for Minimax M2 Model — 无标签 — by ivyilike (创建于: 2026-03-05 20:15 (UTC+8))
- #36131 docs: add missing docstrings for SamplingParams methods — 无标签 — by Yuxin1999 (创建于: 2026-03-05 19:49 (UTC+8))
- #36132 Handle null RoPE factor in max length derivation (+tests) — 无标签 — by siewcapital (创建于: 2026-03-05 19:59 (UTC+8))
- #36114 [Bugfix] Fix mypy errors in hermes_tool_parser.py — bug,ready — by 842974287 (创建于: 2026-03-05 15:40 (UTC+8))
- #36126 Fix TypeError when rope factor is null during max length derivation — 无标签 — by siewcapital (创建于: 2026-03-05 18:31 (UTC+8))
- #36085 [Hardware] Replace
torch.cuda.synchronize()api withtorch.accelerator.synchronize— documentation,performance,ready,v1,nvidia,ready-run-all-tests — by jikunshang (创建于: 2026-03-05 11:57 (UTC+8)) - #36125 Perf: Relax CUDA kernel condition for MoE INT4 — nvidia — by xueliangyang-oeuler (创建于: 2026-03-05 18:15 (UTC+8))
- #36100 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs — rocm,ready,v1 — by ChuanLi1101 (创建于: 2026-03-05 13:45 (UTC+8))
- #36118 Set default value for –ready-check-timeout-sec flag in benchmarking script — performance — by almaslof (创建于: 2026-03-05 15:56 (UTC+8))
- #36121 README Sylee — documentation,kv-connector — by brianlsy98 (创建于: 2026-03-05 16:28 (UTC+8))
- #36104 [CI] Bump
mypyversion to 1.19.1 — ready,v1,kv-connector — by hmellor (创建于: 2026-03-05 14:26 (UTC+8)) - #36113 fix: handle float-like strings in int() parsing to prevent ValidationError — 无标签 — by OiPunk (创建于: 2026-03-05 15:33 (UTC+8))
- #36102 [Frontend] Add gRPC server support for
vllm launch render— frontend,ci/build — by hyeongyun0916 (创建于: 2026-03-05 14:19 (UTC+8)) - #36097 [WIP][Model Runner V2] Support multi-modal embeddings for spec decode model — needs-rebase,v1 — by TheEpicDolphin (创建于: 2026-03-05 13:23 (UTC+8))
- #36087 [CI] Don’t leave docs preview comment on closed PRs — ready,ci/build — by hmellor (创建于: 2026-03-05 12:19 (UTC+8))
- #36086 [AMD][Build] Add DeepEP to ROCm Dockerfile — rocm,ci/build — by rjrock (创建于: 2026-03-05 12:15 (UTC+8))
- #36090 [ROCm][CI] Making some tests optional to reduce workload — rocm,ci/build — by AndreasKaratzas (创建于: 2026-03-05 12:46 (UTC+8))
- #36089 [Bugfix] Handle TimeoutError in Voxtral buffer_realtime_audio to prevent silent hang — bug — by OiPunk (创建于: 2026-03-05 12:32 (UTC+8))
已合并 PR
- #35136 [Release] Include source distribution (sdist) in PyPI uploads — ready,ci/build — by dougbtv (合并于: 2026-03-05 17:43 (UTC+8))
- #31164 [openai api] log exception in exception handler (1/N) — frontend,ready,v1 — by andyxning (合并于: 2026-03-06 00:00 (UTC+8))
- #36088 Don’t fire ray compatibility webhook when PR or branch is not provided — ready,ci/build — by jeffreywang-anyscale (合并于: 2026-03-06 08:42 (UTC+8))
- #35384 [Performance] Extract KV-cache update from TreeAttention backend — ready,v1 — by dorhuri123 (合并于: 2026-03-06 08:22 (UTC+8))
- #36176 [Model Runner V2] Fix warmup for very small kvcache and/or blocksizes — ready,v1 — by njhill (合并于: 2026-03-06 06:48 (UTC+8))
- #35810 [compile] Consistent compiler config for saved/loaded vllm backends. — ready — by zhxchen17 (合并于: 2026-03-06 04:08 (UTC+8))
- #32550 [Model] Add support for OLMo Hybrid — documentation,new-model,ready — by yanhong-lbh (合并于: 2026-03-06 03:51 (UTC+8))
- #35775 [CI] Add explicit permissions to macOS smoke test workflow — ready,ci/build — by russellb (合并于: 2026-03-06 03:08 (UTC+8))
- #36059 [BugFix] Fallback from FA4->FA2 for Batch Invariance — bug,ready,v1 — by frankwang28 (合并于: 2026-03-06 03:05 (UTC+8))
- #35794 [Perf] Optimize FusedMoEModularKernel output tensor using torch.empty — ready-run-all-tests — by xyang16 (合并于: 2026-03-06 02:47 (UTC+8))
- #36146 [Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE — bug,ready,nvidia — by tomeras91 (合并于: 2026-03-06 01:49 (UTC+8))
- #36133 ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm — ready — by netanel-haber (合并于: 2026-03-06 01:29 (UTC+8))
- #36078 [XPU] Enable ModelRunnerV2 on XPU — ready,v1 — by xinyu-intel (合并于: 2026-03-06 01:19 (UTC+8))
- #35994 [BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 — bug,ready,qwen — by jjmiao1 (合并于: 2026-03-06 01:16 (UTC+8))
- #36107 [CI] Stabilize test_no_args_tool_call and add ROCm-specific server args — rocm,ready — by AndreasKaratzas (合并于: 2026-03-05 21:52 (UTC+8))
- #34934 [Bugfix][CI] fix typos — bug,documentation,performance,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by 1195343015 (合并于: 2026-03-06 01:05 (UTC+8))
- #35246 [ROCm] Refactor ROCm attention backend selection logic — documentation,rocm,ready,v1 — by SageMoore (合并于: 2026-03-06 00:51 (UTC+8))
- #36140 [Bugfix] Fix Qwen-VL tokenizer implementation — bug,performance,ready,qwen — by DarkLight1337 (合并于: 2026-03-06 00:07 (UTC+8))
- #36108 refactor funasr model. — ready,qwen — by AllenDou (合并于: 2026-03-06 00:07 (UTC+8))
- #36120 Fix Eagle3 with Transformers modelling backend — ready — by hmellor (合并于: 2026-03-05 21:59 (UTC+8))
- #34616 [KVConnector] Scheduler: Fix num_computed_tokens after async KV load — ready,v1,kv-connector — by orozery (合并于: 2026-03-05 22:25 (UTC+8))
- #36135 [Docs] Only build docs if
documentationorreadylabels are present — documentation,ready — by hmellor (合并于: 2026-03-05 21:58 (UTC+8)) - #36115 [Chore] Correct MTP models test registry ordering — ready — by Isotr0py (合并于: 2026-03-05 16:55 (UTC+8))
- #35976 [Bugfix] Fix RunAI streamer crash with S3-hosted model paths — bug,frontend,ready,v1 — by AjAnubolu (合并于: 2026-03-05 20:14 (UTC+8))
- #36114 [Bugfix] Fix mypy errors in hermes_tool_parser.py — bug,ready — by 842974287 (合并于: 2026-03-05 19:37 (UTC+8))
- #36020 [Misc] Fix SyntaxWarning - invalid escape sequence ‘\e’ — frontend,ready — by cjackal (合并于: 2026-03-05 18:55 (UTC+8))
- #36085 [Hardware] Replace
torch.cuda.synchronize()api withtorch.accelerator.synchronize— documentation,performance,ready,v1,nvidia,ready-run-all-tests — by jikunshang (合并于: 2026-03-05 18:36 (UTC+8)) - #32767 [Docs] add Dynamo/aibrix integration and kubeai/aks link — documentation,ready — by pacoxu (合并于: 2026-03-05 17:39 (UTC+8))
- #34083 [Docs] Update docs to include mm processor + encoder benchmarks — documentation,frontend,ready,v1,multi-modality — by reaganjlee (合并于: 2026-03-05 17:38 (UTC+8))
- #36032 qwen3coder tool parser fix anyOf double encoded parameters — ready,qwen — by cmunley1 (合并于: 2026-03-05 17:06 (UTC+8))
- #36006 [Misc] Remove deprecated items that are due for removal — frontend,ready,multi-modality — by hickeyma (合并于: 2026-03-05 14:14 (UTC+8))
- #35632 [Docs] Update
CacheConfigblock_size docstring to remove inaccurate limit when using CUDA — ready,nvidia — by eicherseiji (合并于: 2026-03-05 14:24 (UTC+8)) - #36036 [Bugfix] Fix block_size for hybrid model MTP — bug,speculative-decoding,ready,v1 — by benchislett (合并于: 2026-03-05 14:10 (UTC+8))
- #35973 [Doc] Add Parallel Draft Models — documentation,ready — by zihaoanllm (合并于: 2026-03-05 13:44 (UTC+8))
- #36062 [Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 — ready — by gmagogsfm (合并于: 2026-03-05 13:25 (UTC+8))
- #36087 [CI] Don’t leave docs preview comment on closed PRs — ready,ci/build — by hmellor (合并于: 2026-03-05 12:51 (UTC+8))
- #35849 [Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker — bug,ready,qwen — by gkswns0531 (合并于: 2026-03-05 12:57 (UTC+8))
- #35890 [Perf] Use dummy M for weight prepacking on x86 — ready,cpu — by tianmu-li (合并于: 2026-03-05 12:56 (UTC+8))
- #35866 Order
config.pyin Lexicographical order — ready,ci/build,multi-modality — by askliar (合并于: 2026-03-05 12:56 (UTC+8)) - #35921 [compile] Fix extra cache save on warm start. — ready — by zhxchen17 (合并于: 2026-03-05 12:56 (UTC+8))
- #35328 [Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper — ready,v1 — by njhill (合并于: 2026-03-05 12:11 (UTC+8))
关闭但未合并的 PR
- #36195 [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893 — ci/build,nvidia — by baonudesifeizhai (关闭于: 2026-03-06 11:19 (UTC+8))
- #36157 [CI] Add mandatory H100 TP=2 smoke test — ci/build,v1 — by stecasta (关闭于: 2026-03-06 01:24 (UTC+8))
- #34926 [Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle — needs-rebase,v1,kv-connector — by ZhanqiuHu (关闭于: 2026-03-06 05:46 (UTC+8))
- #36128 [Bugfix] Fix float-like CPU env int parsing for #35967 — bug — by MatteoFari (关闭于: 2026-03-06 05:18 (UTC+8))
- #32509 [Refactor] Extract KV-cache update logic for FlashAttentionDiffKV backend — documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality,tool-calling — by VedantMadane (关闭于: 2026-03-06 04:11 (UTC+8))
- #33747 feat(grpc): expose kv_connector and kv_role in GetServerInfoResponse — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by slin1237 (关闭于: 2026-03-06 03:17 (UTC+8))
- #35590 Use smg-grpc-proto package for gRPC proto definitions — ci/build — by CatherineSue (关闭于: 2026-03-06 03:02 (UTC+8))
- #33335 Parakeet avlm — frontend,needs-rebase — by netanel-haber (关闭于: 2026-03-06 02:08 (UTC+8))
- #36154 chunk audio clips by 30 seconds to match mcore — 无标签 — by netanel-haber (关闭于: 2026-03-06 00:09 (UTC+8))
- #36052 [Bugfix] Fix async OffloadingConnector silently losing decode-phase blocks — bug,kv-connector — by ZhanqiuHu (关闭于: 2026-03-05 22:19 (UTC+8))
- #34241 [Bugfix] Grammar ignored when reasoning ends within speculated tokens — bug,structured-output,v1 — by sfbemerk (关闭于: 2026-03-05 20:37 (UTC+8))
- #36126 Fix TypeError when rope factor is null during max length derivation — 无标签 — by siewcapital (关闭于: 2026-03-05 19:32 (UTC+8))
- #33798 Add Kimi-Audio-7B ASR support for audio transcriptions and chat completions — documentation,new-model,frontend,multi-modality,gpt-oss — by tunglinwood (关闭于: 2026-03-05 18:47 (UTC+8))
- #36121 README Sylee — documentation,kv-connector — by brianlsy98 (关闭于: 2026-03-05 16:28 (UTC+8))
- #36113 fix: handle float-like strings in int() parsing to prevent ValidationError — 无标签 — by OiPunk (关闭于: 2026-03-05 15:50 (UTC+8))
- #35484 [WIP][Bugfix] Fix multi-node PP crash with logprobs due to pinned memory serialization — bug,needs-rebase,v1 — by haosdent (关闭于: 2026-03-05 13:27 (UTC+8))