vLLM 开发动态报告 - 2025-12-29

时间窗口: 2025-12-29 10:51 (UTC+8) ~ 2025-12-30 10:51 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 19 | 新 PR 35 | 合并 PR 20 | 关闭未合并 PR 43

📊 每日开发状态摘要

在2025年12月29日至30日的24小时窗口内，vLLM项目保持了极高的开发活跃度，共处理了超过70个Issue和PR。开发焦点集中在多模态模型支持扩展（特别是LoRA适配）、针对特定模型（如MiniMax-M2、DeepSeek-V3.2）的性能优化，以及持续修复各类平台（尤其是AMD/ROCm）的测试与稳定性问题。多项关键优化和功能（如异步调度默认启用、卷积层统一替换）被合并入主干，显示出项目在追求高性能与广泛兼容性上的持续进步。

🎯 AMD/ROCm 生态相关动态

本周期内AMD生态相关活动活跃，主要集中在性能问题修复和测试稳定性提升上。

关键性能问题报告 (#31475)：
- 议题：用户 ehartford 报告在8×MI300X上运行GLM-4.7和MiniMax-M2.1等MoE模型时，FP8推理速度显著慢于BF16，与预期不符（例如GLM-4.7 FP8 TPS ~30， BF16 TPS ~60）。此问题在vLLM和sglang上均复现，提示可能是ROCm平台上FP8实现路径的普遍性能问题。
- 影响：这直接挑战了在AMD硬件上使用FP8进行高效推理的假设，对MI300X用户部署大型MoE模型有重大影响。问题已被标记为bug和rocm，并自动通知了AMD相关维护者（@hongxiayang等）。
ROCm平台Bug修复与优化（已合并PR）：
- PR #31502: 修复了近期代码重构导致的ROCm FP8静态量化问题，确保量化参数能被正确替换和加载。
- PR #31499: 作为MoE重构的一部分，将 prepare_moe_fp8_layer_for_marlin 改为纯函数，提升了与量化权重重加载机制的兼容性。
- PR #31460 / #31187: 修复了ROCm CI测试中的多个失败问题，包括将NIXL替换为RIXL，以及修正注意力测试。
- PR #30719: 修复了GPTQ GEMM内核中因输出张量零值化竞争条件导致的数值错误，此问题在input_size > 128时更易触发。
尝试性优化（未成功） (#31487):
- 贡献者尝试为ROCm添加一个融合的Triton LayerNorm内核，旨在减少内核调用次数和HBM访问，以提升类似BERT/ViT架构的性能。该PR被快速关闭，具体原因未说明，但体现了社区对AMD平台性能优化的关注。

小结：AMD生态的核心动态是一个严重的FP8性能回归问题被明确提出，同时开发团队在持续修复基础功能和测试套件的稳定性。MI300X上FP8性能劣于BF16的现象需要AMD和vLLM团队共同深入调查。

💬 高热度讨论分析

Issue #31475: MI300X上FP8性能不及BF16
- 核心议题：AMD MI300X GPU上FP8推理性能异常，速度反而不如更高精度的BF16。
- 观点与进展：
  - 报告者 (ehartford)：提供了详细的测试环境、模型和可复现的数据，指出问题存在于vLLM和sglang两个项目中，暗示是ROCm底层实现的共性问题。
  - 社区/自动化：Issue被自动标记并通知ROCm维护团队。报告者进一步在sglang项目中验证并确认了相同现象，强化了问题根因在于ROCm平台的判断。
- 当前状态：问题开放中，等待AMD或核心开发团队的深度技术分析。这是影响AMD平台竞争力的潜在关键问题。
Issue #31479: 为更多多模态模型启用LoRA支持
- 核心议题：当前仅部分多模态模型（Qwen VL, idefics3）支持对视觉塔（tower）和连接器（connector）添加LoRA适配器，需要将其扩展到所有多模态模型。
- 观点与协作：
  - 发起人 (jeejeelee)：清晰地说明了需要实现 get_num_mm_encoder_tokens 和 get_num_mm_connector_tokens 两个函数的技术原因。
  - 社区贡献者 (Zyyeric, B-201, Anexdeus)：多人主动表示愿意承接此项工作，表现出强烈的社区协作意愿。
  - 协调：发起人建议贡献者内部协商分工，避免重复劳动。
- 当前状态：开放中，已被标记为help wanted。相关PR #31513（为LLaVA添加支持）已提交，显示工作已启动。
Issue #31501: stream-interval > 1 导致工具调用参数丢失
- 核心议题：在使用--stream-interval参数且值大于1时，流式输出中的工具调用（tool call）参数内容丢失。
- 观点与排查：
  - 报告者 (ktsaou)：详细描述了问题现象（参数变为空{}），并提供了完整的服务端启动命令和客户端复现脚本。
  - 维护者询问 (chaunceyjiang)：要求提供客户端复现代码以精准定位问题。
  - 报告者跟进：迅速提供了基于OpenAI客户端的复现脚本。
- 争议焦点：无实质性争议，属于明确的bug。讨论集中在如何高效复现和定位问题上。
- 当前状态：开放中，已具备完整的复现路径，等待修复。

🔥 热门话题与趋势分析

多模态模型支持深化：围绕#31479的讨论和PR#31513的执行，表明社区正致力于将高级功能（如LoRA）从语言模型平移到多模态模型，提升其微调灵活性。
模型特定性能调优：出现了针对具体模型的性能优化跟踪（#31473 for DeepSeek-V3.2）和实际优化PR（#31493 for MiniMax-M2.1），表明优化工作正从通用内核向针对热门模型架构的特化优化发展。
调度系统演进：PR #27614 “默认启用异步调度”的合并是一个重要里程碑，标志着异步调度在经过长期迭代和验证后，已成为稳定且推荐的生产配置。
部署与可维护性：多个PR关注于改善开发体验和可维护性，例如统一版本管理（#31492）、重构重复代码（#31476）、文档链接修复（#31494）以及离线API文档支持（#30184）。

🛠️ 重点技术变更

PR #31502 & #31499: ROCm量化与MoE重构修复
- 内容：修复了ROCm平台FP8静态量化因参数替换方式错误而失败的问题，并将相关函数纯化以提升兼容性。
- 影响：确保了AMD用户能够正常加载和使用量化模型，是维护AMD平台功能完整性的关键补丁。
PR #31493: 优化MiniMax-M2/M2.1的QKNorm
- 内容：通过融合操作，在张量并行模式下将QKNorm所需的all_reduce通信次数减少为一次。
- 影响：直接降低了该系列模型在分布式推理时的通信开销，提升了吞吐量（根据PR描述，优化后在TP=4时解码吞吐提升约7%）。
PR #31498: Transformers后端统一使用vLLM ConvNdLayer
- 内容：使通过Transformers建模后端加载的模型中的nn.Conv2d/nn.Conv3d自动替换为vLLM的ConvNdLayer。
- 影响：扩展了自定义卷积算子的收益范围，特别是对于非CUDA硬件（OOT设备），能带来潜在的通用性能提升。
PR #27614: 默认启用异步调度
- 内容：将--async-scheduling设为默认开启选项。
- 影响：这是调度系统的重大变更。异步调度有助于提高GPU利用率，特别是在处理大量并发的短请求时。此举表明该功能已足够稳定，旨在为大多数用户提供更好的开箱即用性能。
PR #30184: 支持离线FastAPI文档
- 内容：通过自带静态资源，支持在无外网访问的隔离环境中使用/docs交互式API文档。
- 影响：提升了vLLM在安全敏感或内网环境中的部署友好度和集成调试便利性。

📈 开发活跃度观察

高吞吐量：新增35个PR，合并20个，显示项目代码审查与合并流程高效。
贡献者多样化：核心维护者（如njhill, robertgshaw2-redhat, hmellor）在推动架构改进和修复深层次Bug；同时社区贡献者活跃，在模型支持、文档、测试修复等方面均有积极提交。
社区协作良好：在#31479等Issue中，可见社区成员主动认领任务并协商分工，项目生态健康。
平台覆盖持续：针对CPU、NVIDIA、AMD、TPU等不同平台的开发和测试工作均在并行推进，体现了项目对异构计算生态的广泛支持。

💡 值得关注的问题

MI300X FP8性能瓶颈 (#31475)：此问题若确认为ROCm底层限制，将严重影响AMD最新硬件在高效推理场景下的竞争力。需要密切关注AMD官方的回应或修复进展。
多模态模型LoRA支持扩展 (#31479)：这是一个明确的、社区驱动的功能扩展任务。其完成进度将影响众多多模态应用开发者的微调能力。
推测解码优化提案 (RFC #31483)：提出了名为EARS的自适应拒绝采样算法，旨在提升高不确定性生成任务中推测解码的效率。该RFC是否被采纳并实现，可能影响未来vLLM在创意生成类任务上的性能。

📋 附录：详细数据列表

新增 Issue

#31518 [Bug]: NUMA node validation incorrectly compares against CPU IDs instead of NUMA node IDs — bug — by SameerAsal (创建于: 2025-12-30 10:45 (UTC+8))
#31484 [Usage]: RuntimeError when running Qwen2.5-VL-7B-Instruct with vllm: Potential version incompatibility — usage — by puyuan1996 (创建于: 2025-12-29 16:36 (UTC+8))
#31486 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (创建于: 2025-12-29 17:30 (UTC+8))
#31515 [Feature]: need scheduler solution with high priority to process prefill — feature request — by 184603418 (创建于: 2025-12-30 10:09 (UTC+8))
#31503 [Bug]: Value of aborted requests of vllm:request_success_total metrics is always zero even if some requests were actually aborted — bug — by xTayEx (创建于: 2025-12-30 00:34 (UTC+8))
#31479 [Feature]: Enable LoRA support for tower and connector in more MM models — help wanted,feature request — by jeejeelee (创建于: 2025-12-29 15:28 (UTC+8))
#31507 [CI Failure]: mi325_1: Model Executor Test — ci-failure — by AndreasKaratzas (创建于: 2025-12-30 05:28 (UTC+8))
#31475 [Bug]: FP8 inference much slower than BF16 on MI300X for GLM-4.7 and MiniMax-M2.1 — bug,rocm — by ehartford (创建于: 2025-12-29 13:11 (UTC+8))
#31501 [Bug]: –stream-interval > 1 causes tool call arguments to be empty/lost — 无标签 — by ktsaou (创建于: 2025-12-29 22:46 (UTC+8))
#31495 [Bug]: GPT-OSS-120b giving incomplete responses with gibberish \xa0\u200b ending characters — bug — by rhajou (创建于: 2025-12-29 20:26 (UTC+8))
#31490 [Performance]: The prefill phase in the disaggregated_serving_p2p_nccl_xpyd example exhibits poor performance — performance — by leefty-1024 (创建于: 2025-12-29 18:03 (UTC+8))
#31474 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (创建于: 2025-12-29 12:55 (UTC+8))
#31485 [Feature]: Support Tool Calling with transformers 5.x for GLM-4.6V Models — feature request — by tamthaihoangminh (创建于: 2025-12-29 17:14 (UTC+8))
#31483 [RFC]: a efficient adaptive rejection sampling for accelerating speculative decoding. — RFC — by sunchendd (创建于: 2025-12-29 16:31 (UTC+8))
#31480 [Usage]: run deepseek v3.2 failed — usage — by ljwps (创建于: 2025-12-29 15:33 (UTC+8))
#31469 [Feature]: Optimize the definition of the fake function in the code. — feature request — by lengrongfu (创建于: 2025-12-29 11:14 (UTC+8))
#31473 [Performance]: DeepSeek-V3.2 Performance Optimization Tracking — performance — by LucasWilkinson (创建于: 2025-12-29 12:12 (UTC+8))

已关闭 Issue

#17604 [Bug]: size_k must divisible by BLOCK_SIZE_K error when using tensor parallelism with AWQ-quantized MoE models — bug,stale — by ehartford (关闭于: 2025-12-30 10:31 (UTC+8))
#19424 [Bug]: InternVL3 FP8 missing module/parameter on model load — bug,stale — by bbeiler-ridgeline (关闭于: 2025-12-30 10:30 (UTC+8))
#25864 [Bug]: FlexAttention does not support sliding window yet !!! — bug,stale — by therohit777 (关闭于: 2025-12-30 10:30 (UTC+8))
#20627 [Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output — bug,stale — by ekmekovski (关闭于: 2025-12-30 10:30 (UTC+8))
#20995 [Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with –mm-processor-kwargs — bug,stale — by arpitgupta14 (关闭于: 2025-12-30 10:30 (UTC+8))
#22044 [RFC]: Optimize Input Media Processing in vLLM — RFC,stale,multi-modality — by huachenheli (关闭于: 2025-12-30 10:30 (UTC+8))
#22702 [Bug]: Unexpected CUDA OOM with larger TP size — bug,stale — by Farrrrland (关闭于: 2025-12-30 10:30 (UTC+8))
#23662 [Bug]: RuntimeError: operator _C::marlin_qqq_gemm does not exist — bug,stale — by zxue2 (关闭于: 2025-12-30 10:29 (UTC+8))
#23879 [MM Encoder] Investigate heuristic for enabling encoder DP by default — help wanted,feature request,stale — by ywang96 (关闭于: 2025-12-30 10:29 (UTC+8))
#23985 [Bug]: Non-output-rank Workers Fail to Report Runtime Errors, Causing MultiProcExecutor to Wait for RPC Timeout — bug,stale — by fangyuchu (关闭于: 2025-12-30 10:29 (UTC+8))
#29535 [CI Failure]: mi325_2: Distributed Tests (H200) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:30 (UTC+8))
#29528 [CI Failure]: mi325_8: Distributed Tests (8 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:29 (UTC+8))
#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:21 (UTC+8))
#31244 [CI Failure]: mi325_2: Plugin Tests (2 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:21 (UTC+8))
#27679 Async Scheduling Plan — feature request — by njhill (关闭于: 2025-12-30 04:20 (UTC+8))
#29847 [Feature]: enable FastAPI documention (/docs) for air-gaped host — feature request — by runyournode (关闭于: 2025-12-30 00:22 (UTC+8))
#31437 [Bug]: Streaming tool calls missing id/type/name in finish chunk — 无标签 — by amittell (关闭于: 2025-12-29 21:10 (UTC+8))
#31474 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (关闭于: 2025-12-29 17:28 (UTC+8))
#26823 [Feature]: support jina multimodal reranker api request format — feature request — by tongda (关闭于: 2025-12-29 17:13 (UTC+8))

新增 PR

#31519 [BugFix] Fix NUMA node validation in CPU platform — 无标签 — by SameerAsal (创建于: 2025-12-30 10:51 (UTC+8))
#31517 [Core] Remove unused num_tokens parameter from _init_model_kwargs — v1 — by maang-h (创建于: 2025-12-30 10:36 (UTC+8))
#31510 [Core] Deduplicate generate/encode logic in AsyncLLM — frontend,ready,v1 — by njhill (创建于: 2025-12-30 06:47 (UTC+8))
#31497 [Frontend] add continue_final_message parameter to /embeddings endpoint — frontend,ready — by kevin-pw (创建于: 2025-12-29 21:38 (UTC+8))
#31514 FP8 KV cache append optimized with precomputed inverse scale — 无标签 — by tfpre (创建于: 2025-12-30 09:44 (UTC+8))
#31516 Extract activations — needs-rebase,ci/build,v1,cpu — by hanneshapke (创建于: 2025-12-30 10:10 (UTC+8))
#31513 [Model] Enable LoRA support for tower and connector in LLaVA — documentation — by jayhemnani9910 (创建于: 2025-12-30 08:48 (UTC+8))
#31482 [log] enable max_log_len trim only when needed — frontend — by andyxning (创建于: 2025-12-29 16:13 (UTC+8))
#31476 FlashInferUnification — nvidia — by rajanyadav0307 (创建于: 2025-12-29 13:14 (UTC+8))
#31506 Add embedding input functionality for disabled modalities — tpu,v1,multi-modality — by reaganjlee (创建于: 2025-12-30 05:11 (UTC+8))
#31512 [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8 — documentation,cpu — by Scottcjn (创建于: 2025-12-30 08:26 (UTC+8))
#31500 Migrate meetups & sponsors [2/N] — documentation — by esmeetu (创建于: 2025-12-29 22:32 (UTC+8))
#31511 [Doc] Fix README image rendering on GitHub — documentation — by happyahluwalia (创建于: 2025-12-30 07:17 (UTC+8))
#31508 [Minor] Various small code cleanups/simplifications — structured-output,frontend,ready,v1,multi-modality — by njhill (创建于: 2025-12-30 05:45 (UTC+8))
#31509 [2/n] Migrate kernels to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2025-12-30 06:34 (UTC+8))
#31492 [CI/Build][Docker] Add centralized version manifest for Docker builds — ci/build — by mritunjaysharma394 (创建于: 2025-12-29 18:51 (UTC+8))
#31502 [Bugfix][ROCm] Fix Static Quant Issue — rocm — by robertgshaw2-redhat (创建于: 2025-12-30 00:03 (UTC+8))
#31499 [MoE Refactor][12/N] Marlin Fp8 MoE Pure Function — rocm,ready — by robertgshaw2-redhat (创建于: 2025-12-29 22:31 (UTC+8))
#31505 [Bugfix]Fix pooling model always disabled due to incorrect PP rank check — v1 — by vintipandey (创建于: 2025-12-30 03:45 (UTC+8))
#31489 [LoRA] Hide lora_init_id in request.py — documentation,performance,frontend,v1,llama — by jeejeelee (创建于: 2025-12-29 17:57 (UTC+8))
#31504 [MoE Refactor] Explicit construct mk for flashinfer bf16 kernel — 无标签 — by zyongye (创建于: 2025-12-30 03:01 (UTC+8))
#31498 Replace nn.ConvNd with vLLM’s ConvNdLayer for Transformers modeling backend — ready — by hmellor (创建于: 2025-12-29 21:48 (UTC+8))
#31493 Optimize QKNorm for MiniMax-M2/M2.1 — ready — by rogeryoungh (创建于: 2025-12-29 18:57 (UTC+8))
#31496 Migrate doc to website: Hardware Plugins (1/N) — documentation,ready — by esmeetu (创建于: 2025-12-29 21:20 (UTC+8))
#31494 [Docs] Use relative md links instead of absolute html links for cross referencing — documentation,ready,cpu — by hmellor (创建于: 2025-12-29 19:07 (UTC+8))
#31470 Remove patched_fused_scaled_matmul_reduce_scatter torch is 2.9.1 — 无标签 — by lengrongfu (创建于: 2025-12-29 11:38 (UTC+8))
#31488 [CI] fix test_chat_truncation_content_not_null test — ready,gpt-oss — by chaunceyjiang (创建于: 2025-12-29 17:53 (UTC+8))
#31491 [CI][NIXL] Split DPEP tests — ready,ci/build,v1,kv-connector — by NickLucche (创建于: 2025-12-29 18:16 (UTC+8))
#31487 [Hardware][AMD] Add Fused Triton LayerNorm kernel for ROCm — rocm — by sayakmondal321 (创建于: 2025-12-29 17:49 (UTC+8))
#31477 Add docker buildx bake configuration — 无标签 — by amrmahdi (创建于: 2025-12-29 14:42 (UTC+8))
#31481 [CI/Build] Add source build test to catch build failures early — ci/build — by mhetrerajat (创建于: 2025-12-29 15:33 (UTC+8))
#31478 Optimize Top-K Sigmoid Routing and QKNorm for MiniMax-M2/M2.1 — ci/build — by rogeryoungh (创建于: 2025-12-29 15:23 (UTC+8))
#31468 feat: Add per-layer MLP size support for Qwen pruning — documentation,qwen — by CedricHwong (创建于: 2025-12-29 11:10 (UTC+8))
#31471 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support — new-model — by effortprogrammer (创建于: 2025-12-29 11:49 (UTC+8))
#31472 [model] llama MTP — llama — by sun-chao-hahaha (创建于: 2025-12-29 11:53 (UTC+8))

已合并 PR

#31510 [Core] Deduplicate generate/encode logic in AsyncLLM — frontend,ready,v1 — by njhill (合并于: 2025-12-30 10:42 (UTC+8))
#31207 fix: update kimi k2 tool parser logic — ready — by wangln19 (合并于: 2025-12-30 10:01 (UTC+8))
#27577 [Prefix Cache] Include lora_name in BlockStored event for deterministic KV-cache reconstruction — documentation,ready,v1,kv-connector — by sagearc (合并于: 2025-12-30 08:17 (UTC+8))
#30184 [Feature] Add offline FastAPI documentation support for air-gapped environments — frontend,ready,ci/build — by rickychen-infinirc (合并于: 2025-12-30 00:22 (UTC+8))
#31460 [CI]Test Group ‘NixlConnector PD accuracy tests’ is fixed — documentation,rocm,ready,ci/build,kv-connector — by qli88 (合并于: 2025-12-30 07:48 (UTC+8))
#31187 [CI/ROCm] Fixing “V1 Test attention (H100)” test group. — rocm,ready,v1 — by Alexei-V-Ivanov-AMD (合并于: 2025-12-30 05:53 (UTC+8))
#31502 [Bugfix][ROCm] Fix Static Quant Issue — rocm — by robertgshaw2-redhat (合并于: 2025-12-30 05:27 (UTC+8))
#31499 [MoE Refactor][12/N] Marlin Fp8 MoE Pure Function — rocm,ready — by robertgshaw2-redhat (合并于: 2025-12-30 05:27 (UTC+8))
#27614 [Core] Enable async scheduling by default — structured-output,frontend,ready,v1,gpt-oss,suppress-bc-linter,kv-connector,ready-run-all-tests — by njhill (合并于: 2025-12-30 04:20 (UTC+8))
#31397 implements register kv caches in lmcache connector — ready,kv-connector — by chunxiaozheng (合并于: 2025-12-30 03:13 (UTC+8))
#31462 [ROCm][CI] Skip DeepGemm-dependent test on ROCm platform — rocm,ready — by AndreasKaratzas (合并于: 2025-12-29 15:31 (UTC+8))
#30719 [ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition — bug,rocm,ready — by AndreasKaratzas (合并于: 2025-12-29 17:13 (UTC+8))
#31498 Replace nn.ConvNd with vLLM’s ConvNdLayer for Transformers modeling backend — ready — by hmellor (合并于: 2025-12-30 00:20 (UTC+8))
#31493 Optimize QKNorm for MiniMax-M2/M2.1 — ready — by rogeryoungh (合并于: 2025-12-30 00:30 (UTC+8))
#31496 Migrate doc to website: Hardware Plugins (1/N) — documentation,ready — by esmeetu (合并于: 2025-12-29 23:55 (UTC+8))
#31438 [Bugfix] Preserve tool call id/type/name in streaming finish chunk — frontend,ready — by amittell (合并于: 2025-12-29 21:10 (UTC+8))
#31494 [Docs] Use relative md links instead of absolute html links for cross referencing — documentation,ready,cpu — by hmellor (合并于: 2025-12-29 21:33 (UTC+8))
#31488 [CI] fix test_chat_truncation_content_not_null test — ready,gpt-oss — by chaunceyjiang (合并于: 2025-12-29 20:47 (UTC+8))
#31445 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — frontend,ready,multi-modality — by twjww (合并于: 2025-12-29 17:13 (UTC+8))
#31466 [CI/Build][CPU] Update CPU CI test cases — ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-29 14:17 (UTC+8))

关闭但未合并的 PR

#19428 [Bugfix]: fix JSON decode error when tool call argument is empty — frontend,stale — by my-git9 (关闭于: 2025-12-30 10:30 (UTC+8))
#21532 [Bugfix] Handle None case for dt_bias and D in selective_state_update — stale — by Knarf04 (关闭于: 2025-12-30 10:30 (UTC+8))
#23043 [V1] check request priority if scheduler policy is fcfs — frontend,stale,v1 — by calvin0327 (关闭于: 2025-12-30 10:30 (UTC+8))
#23602 Synchronize TYPE_CHECKING section with environment_variables dictionary in envs.py — needs-rebase,stale — by copilot-swe-agent (关闭于: 2025-12-30 10:29 (UTC+8))
#31516 Extract activations — needs-rebase,ci/build,v1,cpu — by hanneshapke (关闭于: 2025-12-30 10:11 (UTC+8))
#31369 [ROCm][CI] Fix rocm attention backends selection on ROCm — rocm,needs-rebase,v1 — by zejunchen-zejun (关闭于: 2025-12-30 09:44 (UTC+8))
#31511 [Doc] Fix README image rendering on GitHub — documentation — by happyahluwalia (关闭于: 2025-12-30 07:28 (UTC+8))
#31457 Fix missing tool call fields in streaming responses — frontend — by ssam18 (关闭于: 2025-12-30 04:45 (UTC+8))
#31321 [MoE Refactor] AITER Mixtral Fix — rocm — by robertgshaw2-redhat (关闭于: 2025-12-30 01:00 (UTC+8))
#31232 [Feature]: Integrate Sonic MoE kernel for Hopper GPUs — 无标签 — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#31233 [Benchmark] Auto-infer dataset name from path for backward compatibility — performance — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#31231 [Bug] Fix Qwen3-VL 2:4 sparsity shape mismatch during decompression — qwen — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#31230 [Bug]: Fix port race condition in distributed initialization — needs-rebase — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#30936 [v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance — v1 — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#30925 [Multimodal] Add FIPS 140-3 compliant hashing support — multi-modality — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
#31432 Add named constant for continuous usage report interval — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31431 Add INT32_BITS constant to replace magic number in quant_utils.py — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31430 Consolidate duplicate exception handling in ray/lazy_utils.py — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31428 [Bug] Fix GLM4 tool parser TypeError with empty arguments — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31427 [Feature] Add –disable-metrics-access-log to filter monitoring logs — frontend — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31426 [UX] Improve DCP/PCP/MTP error messages with backend suggestions — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31417 [Reasoning] Add GLM-4.7 reasoning parser for template-injected tag — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31416 [Feature][Cleanup] Unify flashinfer utils into package structure — nvidia — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31384 [Bugfix] Fix GLM47 tool parser TypeError for empty parameter tools — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31383 [Bugfix] Add GLM-4 append-think reasoning parser for missing token — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31364 fix(ray): correct misleading GPU warning for multi-node clusters — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31360 style: add return type annotations to config and tokenizer utils — ready — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31357 [UX] Display model name in Swagger documentation title — frontend — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31355 Add missing return type annotations to config, weight_utils, and deep_gemm modules — ready — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31318 [Code Quality] Add missing return type annotations to core modules — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31314 [Bugfix] Suppress UserWarning for non-writable buffer in binary2tensor — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31301 fix(ray): correct misleading warning message for multi-node clusters — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31292 fix: reset FlashInfer wrappers after sleep mode — v1,nvidia — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31288 fix(spec_decode): sync ngram draft tokens across TP ranks — speculative-decoding,v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31287 [Bugfix] Preserve original tokenizer class name for transformers compatibility — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31284 [Doc] Add GPT-OSS (openai) tool parser documentation — documentation,tool-calling,gpt-oss — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31283 [Feature] Make EngineCore shutdown timeout configurable via environment variable — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31279 [Bugfix] Disable FlashInfer MoE in batch invariant mode for determinism — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31238 Refactor aiter_shared_expert_fusion logic into helper class — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31237 fix(models): Handle weight prefix mapping for Mamba-Codestral — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
#31487 [Hardware][AMD] Add Fused Triton LayerNorm kernel for ROCm — rocm — by sayakmondal321 (关闭于: 2025-12-29 17:52 (UTC+8))
#20787 [WIP] [Feature]: LoRA for vision modules — v1,qwen — by prashanth058 (关闭于: 2025-12-29 15:22 (UTC+8))
#31472 [model] llama MTP — llama — by sun-chao-hahaha (关闭于: 2025-12-29 12:31 (UTC+8))