vLLM 开发动态报告 - 2025-12-29
时间窗口: 2025-12-29 10:51 (UTC+8) ~ 2025-12-30 10:51 (UTC+8) 数据统计: 新 Issue 17 | 关闭 Issue 19 | 新 PR 35 | 合并 PR 20 | 关闭未合并 PR 43
📊 每日开发状态摘要
在2025年12月29日至30日的24小时窗口内,vLLM项目保持了极高的开发活跃度,共处理了超过70个Issue和PR。开发焦点集中在多模态模型支持扩展(特别是LoRA适配)、针对特定模型(如MiniMax-M2、DeepSeek-V3.2)的性能优化,以及持续修复各类平台(尤其是AMD/ROCm)的测试与稳定性问题。多项关键优化和功能(如异步调度默认启用、卷积层统一替换)被合并入主干,显示出项目在追求高性能与广泛兼容性上的持续进步。
🎯 AMD/ROCm 生态相关动态
本周期内AMD生态相关活动活跃,主要集中在性能问题修复和测试稳定性提升上。
- 关键性能问题报告 (#31475):
- 议题:用户
ehartford报告在8×MI300X上运行GLM-4.7和MiniMax-M2.1等MoE模型时,FP8推理速度显著慢于BF16,与预期不符(例如GLM-4.7 FP8 TPS ~30, BF16 TPS ~60)。此问题在vLLM和sglang上均复现,提示可能是ROCm平台上FP8实现路径的普遍性能问题。 - 影响:这直接挑战了在AMD硬件上使用FP8进行高效推理的假设,对MI300X用户部署大型MoE模型有重大影响。问题已被标记为
bug和rocm,并自动通知了AMD相关维护者(@hongxiayang等)。
- 议题:用户
- ROCm平台Bug修复与优化(已合并PR):
- PR #31502: 修复了近期代码重构导致的ROCm FP8静态量化问题,确保量化参数能被正确替换和加载。
- PR #31499: 作为MoE重构的一部分,将
prepare_moe_fp8_layer_for_marlin改为纯函数,提升了与量化权重重加载机制的兼容性。 - PR #31460 / #31187: 修复了ROCm CI测试中的多个失败问题,包括将NIXL替换为RIXL,以及修正注意力测试。
- PR #30719: 修复了GPTQ GEMM内核中因输出张量零值化竞争条件导致的数值错误,此问题在
input_size > 128时更易触发。
- 尝试性优化(未成功) (#31487):
- 贡献者尝试为ROCm添加一个融合的Triton LayerNorm内核,旨在减少内核调用次数和HBM访问,以提升类似BERT/ViT架构的性能。该PR被快速关闭,具体原因未说明,但体现了社区对AMD平台性能优化的关注。
小结:AMD生态的核心动态是一个严重的FP8性能回归问题被明确提出,同时开发团队在持续修复基础功能和测试套件的稳定性。MI300X上FP8性能劣于BF16的现象需要AMD和vLLM团队共同深入调查。
💬 高热度讨论分析
- Issue #31475: MI300X上FP8性能不及BF16
- 核心议题:AMD MI300X GPU上FP8推理性能异常,速度反而不如更高精度的BF16。
- 观点与进展:
- 报告者 (ehartford):提供了详细的测试环境、模型和可复现的数据,指出问题存在于vLLM和sglang两个项目中,暗示是ROCm底层实现的共性问题。
- 社区/自动化:Issue被自动标记并通知ROCm维护团队。报告者进一步在sglang项目中验证并确认了相同现象,强化了问题根因在于ROCm平台的判断。
- 当前状态:问题开放中,等待AMD或核心开发团队的深度技术分析。这是影响AMD平台竞争力的潜在关键问题。
- Issue #31479: 为更多多模态模型启用LoRA支持
- 核心议题:当前仅部分多模态模型(Qwen VL, idefics3)支持对视觉塔(tower)和连接器(connector)添加LoRA适配器,需要将其扩展到所有多模态模型。
- 观点与协作:
- 发起人 (jeejeelee):清晰地说明了需要实现
get_num_mm_encoder_tokens和get_num_mm_connector_tokens两个函数的技术原因。 - 社区贡献者 (Zyyeric, B-201, Anexdeus):多人主动表示愿意承接此项工作,表现出强烈的社区协作意愿。
- 协调:发起人建议贡献者内部协商分工,避免重复劳动。
- 发起人 (jeejeelee):清晰地说明了需要实现
- 当前状态:开放中,已被标记为
help wanted。相关PR #31513(为LLaVA添加支持)已提交,显示工作已启动。
- Issue #31501: stream-interval > 1 导致工具调用参数丢失
- 核心议题:在使用
--stream-interval参数且值大于1时,流式输出中的工具调用(tool call)参数内容丢失。 - 观点与排查:
- 报告者 (ktsaou):详细描述了问题现象(参数变为空
{}),并提供了完整的服务端启动命令和客户端复现脚本。 - 维护者询问 (chaunceyjiang):要求提供客户端复现代码以精准定位问题。
- 报告者跟进:迅速提供了基于OpenAI客户端的复现脚本。
- 报告者 (ktsaou):详细描述了问题现象(参数变为空
- 争议焦点:无实质性争议,属于明确的bug。讨论集中在如何高效复现和定位问题上。
- 当前状态:开放中,已具备完整的复现路径,等待修复。
- 核心议题:在使用
🔥 热门话题与趋势分析
- 多模态模型支持深化:围绕#31479的讨论和PR#31513的执行,表明社区正致力于将高级功能(如LoRA)从语言模型平移到多模态模型,提升其微调灵活性。
- 模型特定性能调优:出现了针对具体模型的性能优化跟踪(#31473 for DeepSeek-V3.2)和实际优化PR(#31493 for MiniMax-M2.1),表明优化工作正从通用内核向针对热门模型架构的特化优化发展。
- 调度系统演进:PR #27614 “默认启用异步调度”的合并是一个重要里程碑,标志着异步调度在经过长期迭代和验证后,已成为稳定且推荐的生产配置。
- 部署与可维护性:多个PR关注于改善开发体验和可维护性,例如统一版本管理(#31492)、重构重复代码(#31476)、文档链接修复(#31494)以及离线API文档支持(#30184)。
🛠️ 重点技术变更
- PR #31502 & #31499: ROCm量化与MoE重构修复
- 内容:修复了ROCm平台FP8静态量化因参数替换方式错误而失败的问题,并将相关函数纯化以提升兼容性。
- 影响:确保了AMD用户能够正常加载和使用量化模型,是维护AMD平台功能完整性的关键补丁。
- PR #31493: 优化MiniMax-M2/M2.1的QKNorm
- 内容:通过融合操作,在张量并行模式下将QKNorm所需的
all_reduce通信次数减少为一次。 - 影响:直接降低了该系列模型在分布式推理时的通信开销,提升了吞吐量(根据PR描述,优化后在TP=4时解码吞吐提升约7%)。
- 内容:通过融合操作,在张量并行模式下将QKNorm所需的
- PR #31498: Transformers后端统一使用vLLM ConvNdLayer
- 内容:使通过Transformers建模后端加载的模型中的
nn.Conv2d/nn.Conv3d自动替换为vLLM的ConvNdLayer。 - 影响:扩展了自定义卷积算子的收益范围,特别是对于非CUDA硬件(OOT设备),能带来潜在的通用性能提升。
- 内容:使通过Transformers建模后端加载的模型中的
- PR #27614: 默认启用异步调度
- 内容:将
--async-scheduling设为默认开启选项。 - 影响:这是调度系统的重大变更。异步调度有助于提高GPU利用率,特别是在处理大量并发的短请求时。此举表明该功能已足够稳定,旨在为大多数用户提供更好的开箱即用性能。
- 内容:将
- PR #30184: 支持离线FastAPI文档
- 内容:通过自带静态资源,支持在无外网访问的隔离环境中使用
/docs交互式API文档。 - 影响:提升了vLLM在安全敏感或内网环境中的部署友好度和集成调试便利性。
- 内容:通过自带静态资源,支持在无外网访问的隔离环境中使用
📈 开发活跃度观察
- 高吞吐量:新增35个PR,合并20个,显示项目代码审查与合并流程高效。
- 贡献者多样化:核心维护者(如njhill, robertgshaw2-redhat, hmellor)在推动架构改进和修复深层次Bug;同时社区贡献者活跃,在模型支持、文档、测试修复等方面均有积极提交。
- 社区协作良好:在#31479等Issue中,可见社区成员主动认领任务并协商分工,项目生态健康。
- 平台覆盖持续:针对CPU、NVIDIA、AMD、TPU等不同平台的开发和测试工作均在并行推进,体现了项目对异构计算生态的广泛支持。
💡 值得关注的问题
- MI300X FP8性能瓶颈 (#31475):此问题若确认为ROCm底层限制,将严重影响AMD最新硬件在高效推理场景下的竞争力。需要密切关注AMD官方的回应或修复进展。
- 多模态模型LoRA支持扩展 (#31479):这是一个明确的、社区驱动的功能扩展任务。其完成进度将影响众多多模态应用开发者的微调能力。
- 推测解码优化提案 (RFC #31483):提出了名为EARS的自适应拒绝采样算法,旨在提升高不确定性生成任务中推测解码的效率。该RFC是否被采纳并实现,可能影响未来vLLM在创意生成类任务上的性能。
📋 附录:详细数据列表
新增 Issue
- #31518 [Bug]: NUMA node validation incorrectly compares against CPU IDs instead of NUMA node IDs — bug — by SameerAsal (创建于: 2025-12-30 10:45 (UTC+8))
- #31484 [Usage]: RuntimeError when running Qwen2.5-VL-7B-Instruct with vllm: Potential version incompatibility — usage — by puyuan1996 (创建于: 2025-12-29 16:36 (UTC+8))
- #31486 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (创建于: 2025-12-29 17:30 (UTC+8))
- #31515 [Feature]: need scheduler solution with high priority to process prefill — feature request — by 184603418 (创建于: 2025-12-30 10:09 (UTC+8))
- #31503 [Bug]: Value of aborted requests of
vllm:request_success_totalmetrics is always zero even if some requests were actually aborted — bug — by xTayEx (创建于: 2025-12-30 00:34 (UTC+8)) - #31479 [Feature]: Enable LoRA support for tower and connector in more MM models — help wanted,feature request — by jeejeelee (创建于: 2025-12-29 15:28 (UTC+8))
- #31507 [CI Failure]: mi325_1: Model Executor Test — ci-failure — by AndreasKaratzas (创建于: 2025-12-30 05:28 (UTC+8))
- #31475 [Bug]: FP8 inference much slower than BF16 on MI300X for GLM-4.7 and MiniMax-M2.1 — bug,rocm — by ehartford (创建于: 2025-12-29 13:11 (UTC+8))
- #31501 [Bug]: –stream-interval > 1 causes tool call arguments to be empty/lost — 无标签 — by ktsaou (创建于: 2025-12-29 22:46 (UTC+8))
- #31495 [Bug]: GPT-OSS-120b giving incomplete responses with gibberish \xa0\u200b ending characters — bug — by rhajou (创建于: 2025-12-29 20:26 (UTC+8))
- #31490 [Performance]: The prefill phase in the disaggregated_serving_p2p_nccl_xpyd example exhibits poor performance — performance — by leefty-1024 (创建于: 2025-12-29 18:03 (UTC+8))
- #31474 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (创建于: 2025-12-29 12:55 (UTC+8))
- #31485 [Feature]: Support Tool Calling with transformers 5.x for GLM-4.6V Models — feature request — by tamthaihoangminh (创建于: 2025-12-29 17:14 (UTC+8))
- #31483 [RFC]: a efficient adaptive rejection sampling for accelerating speculative decoding. — RFC — by sunchendd (创建于: 2025-12-29 16:31 (UTC+8))
- #31480 [Usage]: run deepseek v3.2 failed — usage — by ljwps (创建于: 2025-12-29 15:33 (UTC+8))
- #31469 [Feature]: Optimize the definition of the fake function in the code. — feature request — by lengrongfu (创建于: 2025-12-29 11:14 (UTC+8))
- #31473 [Performance]: DeepSeek-V3.2 Performance Optimization Tracking — performance — by LucasWilkinson (创建于: 2025-12-29 12:12 (UTC+8))
已关闭 Issue
- #17604 [Bug]:
size_k must divisible by BLOCK_SIZE_Kerror when using tensor parallelism with AWQ-quantized MoE models — bug,stale — by ehartford (关闭于: 2025-12-30 10:31 (UTC+8)) - #19424 [Bug]: InternVL3 FP8 missing module/parameter on model load — bug,stale — by bbeiler-ridgeline (关闭于: 2025-12-30 10:30 (UTC+8))
- #25864 [Bug]: FlexAttention does not support sliding window yet !!! — bug,stale — by therohit777 (关闭于: 2025-12-30 10:30 (UTC+8))
- #20627 [Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output — bug,stale — by ekmekovski (关闭于: 2025-12-30 10:30 (UTC+8))
- #20995 [Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with –mm-processor-kwargs — bug,stale — by arpitgupta14 (关闭于: 2025-12-30 10:30 (UTC+8))
- #22044 [RFC]: Optimize Input Media Processing in vLLM — RFC,stale,multi-modality — by huachenheli (关闭于: 2025-12-30 10:30 (UTC+8))
- #22702 [Bug]: Unexpected CUDA OOM with larger TP size — bug,stale — by Farrrrland (关闭于: 2025-12-30 10:30 (UTC+8))
- #23662 [Bug]: RuntimeError: operator _C::marlin_qqq_gemm does not exist — bug,stale — by zxue2 (关闭于: 2025-12-30 10:29 (UTC+8))
- #23879 [MM Encoder] Investigate heuristic for enabling encoder DP by default — help wanted,feature request,stale — by ywang96 (关闭于: 2025-12-30 10:29 (UTC+8))
- #23985 [Bug]: Non-output-rank Workers Fail to Report Runtime Errors, Causing MultiProcExecutor to Wait for RPC Timeout — bug,stale — by fangyuchu (关闭于: 2025-12-30 10:29 (UTC+8))
- #29535 [CI Failure]: mi325_2: Distributed Tests (H200) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:30 (UTC+8))
- #29528 [CI Failure]: mi325_8: Distributed Tests (8 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:29 (UTC+8))
- #29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:21 (UTC+8))
- #31244 [CI Failure]: mi325_2: Plugin Tests (2 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-30 05:21 (UTC+8))
- #27679 Async Scheduling Plan — feature request — by njhill (关闭于: 2025-12-30 04:20 (UTC+8))
- #29847 [Feature]: enable FastAPI documention (
/docs) for air-gaped host — feature request — by runyournode (关闭于: 2025-12-30 00:22 (UTC+8)) - #31437 [Bug]: Streaming tool calls missing id/type/name in finish chunk — 无标签 — by amittell (关闭于: 2025-12-29 21:10 (UTC+8))
- #31474 [Feature]: GLM 4.7 vocab padding feature — feature request — by H100-H200-B200 (关闭于: 2025-12-29 17:28 (UTC+8))
- #26823 [Feature]: support jina multimodal reranker api request format — feature request — by tongda (关闭于: 2025-12-29 17:13 (UTC+8))
新增 PR
- #31519 [BugFix] Fix NUMA node validation in CPU platform — 无标签 — by SameerAsal (创建于: 2025-12-30 10:51 (UTC+8))
- #31517 [Core] Remove unused
num_tokensparameter from_init_model_kwargs— v1 — by maang-h (创建于: 2025-12-30 10:36 (UTC+8)) - #31510 [Core] Deduplicate generate/encode logic in
AsyncLLM— frontend,ready,v1 — by njhill (创建于: 2025-12-30 06:47 (UTC+8)) - #31497 [Frontend] add continue_final_message parameter to /embeddings endpoint — frontend,ready — by kevin-pw (创建于: 2025-12-29 21:38 (UTC+8))
- #31514 FP8 KV cache append optimized with precomputed inverse scale — 无标签 — by tfpre (创建于: 2025-12-30 09:44 (UTC+8))
- #31516 Extract activations — needs-rebase,ci/build,v1,cpu — by hanneshapke (创建于: 2025-12-30 10:10 (UTC+8))
- #31513 [Model] Enable LoRA support for tower and connector in LLaVA — documentation — by jayhemnani9910 (创建于: 2025-12-30 08:48 (UTC+8))
- #31482 [log] enable max_log_len trim only when needed — frontend — by andyxning (创建于: 2025-12-29 16:13 (UTC+8))
- #31476 FlashInferUnification — nvidia — by rajanyadav0307 (创建于: 2025-12-29 13:14 (UTC+8))
- #31506 Add embedding input functionality for disabled modalities — tpu,v1,multi-modality — by reaganjlee (创建于: 2025-12-30 05:11 (UTC+8))
- #31512 [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8 — documentation,cpu — by Scottcjn (创建于: 2025-12-30 08:26 (UTC+8))
- #31500 Migrate meetups & sponsors [2/N] — documentation — by esmeetu (创建于: 2025-12-29 22:32 (UTC+8))
- #31511 [Doc] Fix README image rendering on GitHub — documentation — by happyahluwalia (创建于: 2025-12-30 07:17 (UTC+8))
- #31508 [Minor] Various small code cleanups/simplifications — structured-output,frontend,ready,v1,multi-modality — by njhill (创建于: 2025-12-30 05:45 (UTC+8))
- #31509 [2/n] Migrate kernels to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2025-12-30 06:34 (UTC+8))
- #31492 [CI/Build][Docker] Add centralized version manifest for Docker builds — ci/build — by mritunjaysharma394 (创建于: 2025-12-29 18:51 (UTC+8))
- #31502 [Bugfix][ROCm] Fix Static Quant Issue — rocm — by robertgshaw2-redhat (创建于: 2025-12-30 00:03 (UTC+8))
- #31499 [MoE Refactor][12/N] Marlin Fp8 MoE Pure Function — rocm,ready — by robertgshaw2-redhat (创建于: 2025-12-29 22:31 (UTC+8))
- #31505 [Bugfix]Fix pooling model always disabled due to incorrect PP rank check — v1 — by vintipandey (创建于: 2025-12-30 03:45 (UTC+8))
- #31489 [LoRA] Hide lora_init_id in request.py — documentation,performance,frontend,v1,llama — by jeejeelee (创建于: 2025-12-29 17:57 (UTC+8))
- #31504 [MoE Refactor] Explicit construct mk for flashinfer bf16 kernel — 无标签 — by zyongye (创建于: 2025-12-30 03:01 (UTC+8))
- #31498 Replace
nn.ConvNdwith vLLM’sConvNdLayerfor Transformers modeling backend — ready — by hmellor (创建于: 2025-12-29 21:48 (UTC+8)) - #31493 Optimize QKNorm for MiniMax-M2/M2.1 — ready — by rogeryoungh (创建于: 2025-12-29 18:57 (UTC+8))
- #31496 Migrate doc to website: Hardware Plugins (1/N) — documentation,ready — by esmeetu (创建于: 2025-12-29 21:20 (UTC+8))
- #31494 [Docs] Use relative
mdlinks instead of absolutehtmllinks for cross referencing — documentation,ready,cpu — by hmellor (创建于: 2025-12-29 19:07 (UTC+8)) - #31470 Remove patched_fused_scaled_matmul_reduce_scatter torch is 2.9.1 — 无标签 — by lengrongfu (创建于: 2025-12-29 11:38 (UTC+8))
- #31488 [CI] fix test_chat_truncation_content_not_null test — ready,gpt-oss — by chaunceyjiang (创建于: 2025-12-29 17:53 (UTC+8))
- #31491 [CI][NIXL] Split DPEP tests — ready,ci/build,v1,kv-connector — by NickLucche (创建于: 2025-12-29 18:16 (UTC+8))
- #31487 [Hardware][AMD] Add Fused Triton LayerNorm kernel for ROCm — rocm — by sayakmondal321 (创建于: 2025-12-29 17:49 (UTC+8))
- #31477 Add docker buildx bake configuration — 无标签 — by amrmahdi (创建于: 2025-12-29 14:42 (UTC+8))
- #31481 [CI/Build] Add source build test to catch build failures early — ci/build — by mhetrerajat (创建于: 2025-12-29 15:33 (UTC+8))
- #31478 Optimize Top-K Sigmoid Routing and QKNorm for MiniMax-M2/M2.1 — ci/build — by rogeryoungh (创建于: 2025-12-29 15:23 (UTC+8))
- #31468 feat: Add per-layer MLP size support for Qwen pruning — documentation,qwen — by CedricHwong (创建于: 2025-12-29 11:10 (UTC+8))
- #31471 [Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support — new-model — by effortprogrammer (创建于: 2025-12-29 11:49 (UTC+8))
- #31472 [model] llama MTP — llama — by sun-chao-hahaha (创建于: 2025-12-29 11:53 (UTC+8))
已合并 PR
- #31510 [Core] Deduplicate generate/encode logic in
AsyncLLM— frontend,ready,v1 — by njhill (合并于: 2025-12-30 10:42 (UTC+8)) - #31207 fix: update kimi k2 tool parser logic — ready — by wangln19 (合并于: 2025-12-30 10:01 (UTC+8))
- #27577 [Prefix Cache] Include lora_name in BlockStored event for deterministic KV-cache reconstruction — documentation,ready,v1,kv-connector — by sagearc (合并于: 2025-12-30 08:17 (UTC+8))
- #30184 [Feature] Add offline FastAPI documentation support for air-gapped environments — frontend,ready,ci/build — by rickychen-infinirc (合并于: 2025-12-30 00:22 (UTC+8))
- #31460 [CI]Test Group ‘NixlConnector PD accuracy tests’ is fixed — documentation,rocm,ready,ci/build,kv-connector — by qli88 (合并于: 2025-12-30 07:48 (UTC+8))
- #31187 [CI/ROCm] Fixing “V1 Test attention (H100)” test group. — rocm,ready,v1 — by Alexei-V-Ivanov-AMD (合并于: 2025-12-30 05:53 (UTC+8))
- #31502 [Bugfix][ROCm] Fix Static Quant Issue — rocm — by robertgshaw2-redhat (合并于: 2025-12-30 05:27 (UTC+8))
- #31499 [MoE Refactor][12/N] Marlin Fp8 MoE Pure Function — rocm,ready — by robertgshaw2-redhat (合并于: 2025-12-30 05:27 (UTC+8))
- #27614 [Core] Enable async scheduling by default — structured-output,frontend,ready,v1,gpt-oss,suppress-bc-linter,kv-connector,ready-run-all-tests — by njhill (合并于: 2025-12-30 04:20 (UTC+8))
- #31397 implements register kv caches in lmcache connector — ready,kv-connector — by chunxiaozheng (合并于: 2025-12-30 03:13 (UTC+8))
- #31462 [ROCm][CI] Skip DeepGemm-dependent test on ROCm platform — rocm,ready — by AndreasKaratzas (合并于: 2025-12-29 15:31 (UTC+8))
- #30719 [ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition — bug,rocm,ready — by AndreasKaratzas (合并于: 2025-12-29 17:13 (UTC+8))
- #31498 Replace
nn.ConvNdwith vLLM’sConvNdLayerfor Transformers modeling backend — ready — by hmellor (合并于: 2025-12-30 00:20 (UTC+8)) - #31493 Optimize QKNorm for MiniMax-M2/M2.1 — ready — by rogeryoungh (合并于: 2025-12-30 00:30 (UTC+8))
- #31496 Migrate doc to website: Hardware Plugins (1/N) — documentation,ready — by esmeetu (合并于: 2025-12-29 23:55 (UTC+8))
- #31438 [Bugfix] Preserve tool call id/type/name in streaming finish chunk — frontend,ready — by amittell (合并于: 2025-12-29 21:10 (UTC+8))
- #31494 [Docs] Use relative
mdlinks instead of absolutehtmllinks for cross referencing — documentation,ready,cpu — by hmellor (合并于: 2025-12-29 21:33 (UTC+8)) - #31488 [CI] fix test_chat_truncation_content_not_null test — ready,gpt-oss — by chaunceyjiang (合并于: 2025-12-29 20:47 (UTC+8))
- #31445 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility — frontend,ready,multi-modality — by twjww (合并于: 2025-12-29 17:13 (UTC+8))
- #31466 [CI/Build][CPU] Update CPU CI test cases — ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-29 14:17 (UTC+8))
关闭但未合并的 PR
- #19428 [Bugfix]: fix JSON decode error when tool call argument is empty — frontend,stale — by my-git9 (关闭于: 2025-12-30 10:30 (UTC+8))
- #21532 [Bugfix] Handle None case for dt_bias and D in selective_state_update — stale — by Knarf04 (关闭于: 2025-12-30 10:30 (UTC+8))
- #23043 [V1] check request priority if scheduler policy is fcfs — frontend,stale,v1 — by calvin0327 (关闭于: 2025-12-30 10:30 (UTC+8))
- #23602 Synchronize TYPE_CHECKING section with environment_variables dictionary in envs.py — needs-rebase,stale — by copilot-swe-agent (关闭于: 2025-12-30 10:29 (UTC+8))
- #31516 Extract activations — needs-rebase,ci/build,v1,cpu — by hanneshapke (关闭于: 2025-12-30 10:11 (UTC+8))
- #31369 [ROCm][CI] Fix rocm attention backends selection on ROCm — rocm,needs-rebase,v1 — by zejunchen-zejun (关闭于: 2025-12-30 09:44 (UTC+8))
- #31511 [Doc] Fix README image rendering on GitHub — documentation — by happyahluwalia (关闭于: 2025-12-30 07:28 (UTC+8))
- #31457 Fix missing tool call fields in streaming responses — frontend — by ssam18 (关闭于: 2025-12-30 04:45 (UTC+8))
- #31321 [MoE Refactor] AITER Mixtral Fix — rocm — by robertgshaw2-redhat (关闭于: 2025-12-30 01:00 (UTC+8))
- #31232 [Feature]: Integrate Sonic MoE kernel for Hopper GPUs — 无标签 — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #31233 [Benchmark] Auto-infer dataset name from path for backward compatibility — performance — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #31231 [Bug] Fix Qwen3-VL 2:4 sparsity shape mismatch during decompression — qwen — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #31230 [Bug]: Fix port race condition in distributed initialization — needs-rebase — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #30936 [v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance — v1 — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #30925 [Multimodal] Add FIPS 140-3 compliant hashing support — multi-modality — by yurekami (关闭于: 2025-12-29 20:51 (UTC+8))
- #31432 Add named constant for continuous usage report interval — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31431 Add INT32_BITS constant to replace magic number in quant_utils.py — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31430 Consolidate duplicate exception handling in ray/lazy_utils.py — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31428 [Bug] Fix GLM4 tool parser TypeError with empty arguments — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31427 [Feature] Add –disable-metrics-access-log to filter monitoring logs — frontend — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31426 [UX] Improve DCP/PCP/MTP error messages with backend suggestions — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31417 [Reasoning] Add GLM-4.7 reasoning parser for template-injected
tag — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8)) - #31416 [Feature][Cleanup] Unify flashinfer utils into package structure — nvidia — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31384 [Bugfix] Fix GLM47 tool parser TypeError for empty parameter tools — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31383 [Bugfix] Add GLM-4 append-think reasoning parser for missing
token — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8)) - #31364 fix(ray): correct misleading GPU warning for multi-node clusters — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31360 style: add return type annotations to config and tokenizer utils — ready — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31357 [UX] Display model name in Swagger documentation title — frontend — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31355 Add missing return type annotations to config, weight_utils, and deep_gemm modules — ready — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31318 [Code Quality] Add missing return type annotations to core modules — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31314 [Bugfix] Suppress UserWarning for non-writable buffer in binary2tensor — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31301 fix(ray): correct misleading warning message for multi-node clusters — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31292 fix: reset FlashInfer wrappers after sleep mode — v1,nvidia — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31288 fix(spec_decode): sync ngram draft tokens across TP ranks — speculative-decoding,v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31287 [Bugfix] Preserve original tokenizer class name for transformers compatibility — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31284 [Doc] Add GPT-OSS (openai) tool parser documentation — documentation,tool-calling,gpt-oss — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31283 [Feature] Make EngineCore shutdown timeout configurable via environment variable — v1 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31279 [Bugfix] Disable FlashInfer MoE in batch invariant mode for determinism — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31238 Refactor aiter_shared_expert_fusion logic into helper class — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31237 fix(models): Handle weight prefix mapping for Mamba-Codestral — 无标签 — by yurekami (关闭于: 2025-12-29 20:50 (UTC+8))
- #31487 [Hardware][AMD] Add Fused Triton LayerNorm kernel for ROCm — rocm — by sayakmondal321 (关闭于: 2025-12-29 17:52 (UTC+8))
- #20787 [WIP] [Feature]: LoRA for vision modules — v1,qwen — by prashanth058 (关闭于: 2025-12-29 15:22 (UTC+8))
- #31472 [model] llama MTP — llama — by sun-chao-hahaha (关闭于: 2025-12-29 12:31 (UTC+8))