vLLM 开发动态报告 - 2025-12-25

时间窗口: 2025-12-25 10:49 (UTC+8) ~ 2025-12-26 10:49 (UTC+8) 数据统计: 新 Issue 12 | 关闭 Issue 28 | 新 PR 29 | 合并 PR 9 | 关闭未合并 PR 18

📊 每日开发状态摘要

本周期（12月25-26日）vLLM开发活动保持活跃，共处理40个Issue（新增12，关闭28）和提交29个PR（合并9个）。开发重点集中在性能优化与Bug修复上，尤其关注推理性能回归、内核优化（如MoE LoRA）和AMD平台支持。同时，社区持续进行代码质量改善（类型注解）和前端功能增强。

🎯 AMD/ROCm 生态相关动态

本周期有明确的AMD生态相关活动，涉及Issue讨论和性能优化PR。

Bug报告：ROCm Docker镜像兼容性 (#31333)
- 内容：用户报告vLLM使用的ROCm 7.0-complete基础Docker镜像不支持GFX1150架构（如Radeon显卡），建议升级至7.0.2或更高版本。
- 分析：此Issue揭示了vLLM官方Docker镜像在支持AMD消费级显卡（Radeon系列）时可能存在的底层环境问题。保持镜像更新对于扩大AMD硬件兼容性范围至关重要。
- 状态：开放中，已被标记rocm标签并通知相关维护者。
PR：为AMD MI308X调整fused_moe配置 (#31328)
- 内容：贡献者yuttian1提交PR，针对AMD Instinct MI308X GPU优化fused_moe内核行为，并添加了int4 w4a16的特定配置。
- 分析：这是针对新一代AMD GPU（MI308X）的定向性能调优。通过调整内核配置和添加对新数据类型的支持，旨在提升MoE模型在AMD高端计算卡上的推理效率。这体现了AMD生态持续深入优化计算核心的趋势。
- 状态：开放中，待通过CI检查。
历史Issue回顾：ROCm环境标志过多 (#21138) - 已关闭
- 内容：此长期讨论在本周期因不活跃而自动关闭。用户曾抱怨启用ROCm最佳性能需要设置过多环境变量（如VLLM_USE_TRITON_FLASH_ATTN=0, VLLM_ROCM_USE_AITER_*等），体验不佳。
- 关键讨论：
  - 用户观点：标志泛滥增加使用复杂度，缺乏文档说明最佳配置，希望有统一开关（如VLLM_ROCM_USE_BEST_CONFIG=1）或根据GPU型号自动设置。
  - AMD工程师回应 (tjtanaa)：解释了部分标志（如AITER相关）因实验性、硬件特定优化或兼容性原因默认关闭。同时指出，对于MI系列卡，许多性能标志（如VLLM_USE_TRITON_FLASH_ATTN）已是默认启用。
  - 结论与现状：AMD团队承认需减少参数数量，并已在进行环境变量审计。目前建议是，除了VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1外，不建议用户设置其他ROCm性能标志，因为大多数已针对MI系列默认优化。AITER标志因其实验性而保持默认关闭。

💬 高热度讨论分析

RFC：懒加载CUDA图捕获 (#20098) - 已关闭
- 核心议题：是否应将CUDA图捕获从引擎启动时（Eager）改为运行时按需（Lazy），以显著减少启动时间。
- 各方观点：
  - 支持方（提案者及部分用户）：启动时间对用户体验、自动扩缩容（从0扩展）场景至关重要。当前捕获67个图很多可能用不到，浪费时间和内存。
  - 反对方/担忧方：
    - 运行时性能波动：懒捕获会导致首次遇到新形状时产生延迟，使性能不可预测。
    - 内存规划困难：CUDA图占用显存，懒捕获使得峰值内存使用难以在启动时预估，可能导致运行时OOM。
    - 替代方案：更倾向于减少捕获图的数量、按实际关心的批处理大小进行捕获，或使用全模型图而非分片图。
  - 延伸讨论：涉及LoRA场景（大量图）、提示嵌入支持等复杂情况。
- 争议焦点：启动时间优化与运行时性能一致性/可预测性之间的权衡。
- 最终状态：该RFC在讨论后未立即实施，于本周期因长期无新进展而被标记为stale并自动关闭。反映出社区在当前阶段更倾向于优化现有捕获策略，而非引入懒加载的复杂度。
Issue：DeepSeek-V3.2 性能大幅下降 (#31368)
- 核心议题：用户报告在8*H200上，DeepSeek-V3.2的吞吐量、TTFT和TPOT均显著低于V3.1版本，并附上了详细的测试脚本和结果。
- 分析：这是一个新开的高质量性能回归报告。虽然尚无评论，但其内容详实，直接指向特定模型版本的性能问题，可能涉及vLLM对新模型架构（如MLA）的优化支持、内核选择或调度策略。这类问题通常会引起核心开发者的高度关注。
- 状态：开放中，是潜在的重大性能问题追踪点。
Issue：TP通信分解的数值精度问题 (#31361)
- 核心议题：用户尝试将Tensor Parallelism中的all-reduce通信分解为reduce-scatter + all-gather以优化，但出现了随着网络层数累加的显著数值误差，而引擎启动时的dummy run却完全匹配。
- 观点分析：
  - 用户困惑：dummy run（用于预热、性能剖析）与真实请求执行的数值行为为何不同？精度差异从何而来？
  - 潜在技术点：此问题触及分布式训练/推理中通信原语数值稳定性、编译器优化（如CUDA图）、不同运行路径（无缓存/有缓存）下的计算顺序等深层议题。
- 状态：开放中，是一个需要深入技术诊断的复杂问题。

🔥 热门话题与趋势分析

性能回归与优化：是本期最突出的主题。
- 模型特定性能：DeepSeek-V3.2（#31368）、此前版本间的TTFT增长（#21983）和RMSNorm性能损失（#30043）表明，新模型架构或vLLM版本升级时常伴随性能挑战。
- 内核级优化：针对MoE LoRA融合内核的优化PR（#31354）和AMD fused_moe的调优PR（#31328）显示，社区正深入特定计算模式进行微调以提升效率。
- 异步调度优化：PR #31336旨在消除推测解码中因CPU查询GPU动态形状导致的同步点，以提升异步调度性能。
Bug与稳定性：大量Issue反映了生产环境中遇到的各种问题。
- KV缓存异常：包括持续增长（#31353）、卸载同步问题（#31341）等。
- CUDA/ROCm环境：涉及CUDA图捕获（#28901）、ROCm镜像（#31333）、驱动兼容性（#21979）等。
- API与行为变更：如版本升级后默认编译配置改变（#31352）、工具参数未生效（#31345）等。
功能增强与模型支持：
- 前端与用户体验：新增--default-chat-template-kwargs参数（#31343）、修复Ray GPU计数警告（#31364）、改进Swagger文档标题（#31357）。
- 模型支持：新增对Isaac-0.1（#28367）、CWM工具解析（#31340）、FunctionGemma工具解析（#31218）的支持，以及持续增加重排序模型模板（#31335）。

🛠️ 重点技术变更

PR：懒加载CUDA图捕获（RFC #20098的后续讨论）：虽然PR未在本周期出现，但该RFC的深度讨论是重要的技术风向标。它集中体现了在大模型服务系统中，启动优化与运行时确定性这一根本矛盾，以及内存管理复杂性的挑战。
PR：优化fused MoE LoRA Triton算子 (#31354)：通过核内累加和工作空间复用，消除了额外的全局内存传输和多个内核启动。这对于小批处理（解码）场景和CUDA图捕获的延迟优化尤其有益，是面向生产环境性能的微观优化典范。
PR：修复AMD MI308X的fused_moe配置 (#31328)：针对特定硬件（AMD MI308X）进行内核参数调优并添加对新量化格式（int4 w4a16）的支持，体现了硬件厂商针对其最新架构进行深度适配的努力。
PR：等待计算完成后再将KV卸载至CPU (#31341)：修复了OffloadingConnector中KV缓存卸载的同步bug，确保在GPU计算完成后再进行DMA拷贝，并调整卸载时机以避免与采样令牌拷贝的竞争。这是保证P/D分解等功能稳定性的重要底层修复。
PR：修复异步调度+推理+结构化输出时的令牌计数问题 (#31332)：修复了StructuredOutputManager在异步调度场景下的令牌计数错误。这是一个针对V1引擎、推理与结构化输出等高级功能交织场景的精准修复，确保了复杂功能组合的正确性。

📈 开发活跃度观察

贡献者活跃：本期共有数十位不同贡献者提交PR和Issue，涵盖Bug报告、性能测试、功能开发、代码质量优化等多个方面，社区活跃度较高。
代码审查与合并：共合并9个PR，包括新模型支持、文档更新、Bug修复和性能优化，合并节奏平稳。值得注意的是，大量新PR（如#31328, #31329, #31354等）因预提交检查（pre-commit）失败而阻塞，主要集中在代码格式和类型注解上，表明项目对代码规范有严格要求。
Issue清理：关闭了28个旧Issue，其中多数因长期不活跃（stale）被自动关闭，显示了维护工作流的正常运行。

💡 值得关注的问题

DeepSeek-V3.2性能回归 (#31368)：需要核心开发者介入，分析是新模型特性导致，还是vLLM的优化未及时覆盖。
TP通信分解的数值精度 (#31361)：涉及分布式计算和计算图执行的底层一致性，诊断结果可能对理解vLLM内部执行机制有重要价值。
ROCm Docker镜像兼容性 (#31333)：影响AMD Radeon显卡用户的使用体验，需关注维护者的修复进展。
默认编译配置变更 (#31352)：从v0.11.2到v0.12.0，默认的编译模式和自定义算子列表发生显著变化，可能影响用户升级后的性能和行为，建议查看相关发布说明。

📋 附录：详细数据列表

新增 Issue

#31368 [Performance]: DeepSeek-v3.2 throughput&TTFT&TPOT is much slower than DeepSeek-v3.1 on 8*H200 — performance — by princepride (创建于: 2025-12-26 09:51 (UTC+8))
#31361 [Usage]: Question about the dummy run。It seems the dummy run use different precision? — usage — by Dingjifeng (创建于: 2025-12-26 00:38 (UTC+8))
#31331 [New Model]: Support molmo2 — new-model — by tunglinwood (创建于: 2025-12-25 15:25 (UTC+8))
#31342 [Bug]: vllm profiler cannot working — bug — by lengrongfu (创建于: 2025-12-25 17:43 (UTC+8))
#31353 [Bug]: KV Cache grows continuously with just one chat completion request using meta-llama/Llama-3.2-1B on L40 GPU with Flash Attention and finally completed after 10 minutes — bug — by aravilli (创建于: 2025-12-25 21:56 (UTC+8))
#31352 [Bug]: Behavior change 0.11.2 vs 0.12 (and up) — bug — by vince62s (创建于: 2025-12-25 21:35 (UTC+8))
#31344 [Usage]: how to pass param logits_processors in AsyncEngineArgs? — usage — by cqray1990 (创建于: 2025-12-25 18:12 (UTC+8))
#31351 [Bug]: run test case in local failed: moe_expert_num = len(expert_map) TypeError: object of type ‘NoneType’ has no len() — bug — by leo-pony (创建于: 2025-12-25 21:10 (UTC+8))
#31345 [Bug]: The ‘’draft_tensor_parallel_size‘’ parameter of the Eagel3 draft model does not take effect — bug — by zhaomingyu13 (创建于: 2025-12-25 18:59 (UTC+8))
#31337 [Bug]: Failed to capture CUDA kernel data and GPU memory data when running vllm with tensor_parallel_size=1 — bug — by EmperorKaiser (创建于: 2025-12-25 17:13 (UTC+8))
#31333 [Bug]: The base image used by rocm Docker is 7.0-complete, and rocm 7.0 does not support GFX1150. — bug,rocm — by Uhao-P (创建于: 2025-12-25 15:51 (UTC+8))
#31334 [Bug]: vllm+ray：vllm serve DeepSeek-V3.2V-AWQ ERROR. — bug — by F0undLinks (创建于: 2025-12-25 15:51 (UTC+8))

已关闭 Issue

#20098 [RFC]: Lazy CUDA Graph capture — RFC,stale,startup-ux — by lionelvillard (关闭于: 2025-12-26 10:18 (UTC+8))
#20778 [Usage]: How am I suppose to pass images to /tokenize? — good first issue,usage,stale — by dan-jacobson (关闭于: 2025-12-26 10:18 (UTC+8))
#21027 [Performance]: The GPU memory usage of vllm v0.9.2 is significantly higher than that of v0.9.1. Why is this? How can it be improved? — performance,stale — by dengcao (关闭于: 2025-12-26 10:18 (UTC+8))
#21138 [ROCm]: There is too many opt-in ROCm specific flags — feature request,stale — by functionstackx (关闭于: 2025-12-26 10:18 (UTC+8))
#21979 [Bug]: cuda version of vllm/vllm-openai:latest older than k8s node cuda 12.9 Incompatibility error — bug,stale — by kev-iotairx (关闭于: 2025-12-26 10:18 (UTC+8))
#21983 [Bug]: TTFT increased especially in some Distill Models with small BatchSize in v0.10.0 compared to v0.9.2 — bug,stale — by AiKiAi-stack (关闭于: 2025-12-26 10:18 (UTC+8))
#22126 [Feature]: Tensor parallelism for GLM-4.5 — feature request,stale — by fernandaspets (关闭于: 2025-12-26 10:18 (UTC+8))
#22442 [Usage]: Whether to support benchmark service stress testing of embeddings model？ — usage,stale — by tensorflowt (关闭于: 2025-12-26 10:18 (UTC+8))
#22625 [Bug]: Generation quality decrease while batch size > 1 (Qwen3-235B-A22B-Instruct) — bug,stale — by edgeinfinity1 (关闭于: 2025-12-26 10:18 (UTC+8))
#23062 [Feature]: Supports torch 2.9 — feature request,stale — by Tame21 (关闭于: 2025-12-26 10:17 (UTC+8))
#23342 [Bug]: when user MTP method will cause UnboundLocalError — bug,stale — by xyxinyang (关闭于: 2025-12-26 10:17 (UTC+8))
#23538 [Bug]: Qwen2.5-VL get an error when attatch image file — bug,stale — by foxmale007 (关闭于: 2025-12-26 10:17 (UTC+8))
#23554 [Bug]: [V1][V0.9.1][PD disaggregation] Infinite waiting loop when kv cache is only capable for one long text request — bug,stale — by JC-ut0 (关闭于: 2025-12-26 10:17 (UTC+8))
#23572 [Bug]: meta-llama/Llama-4-Maverick-17B-128E-Instruct confuses about image inputs when tool choice is enabled — bug,stale — by erkintelnyx (关闭于: 2025-12-26 10:17 (UTC+8))
#23587 [Bug]: NIXL Crashes if P/D Protocol is off — bug,stale — by robertgshaw2-redhat (关闭于: 2025-12-26 10:17 (UTC+8))
#23645 [Bug]: Qwen2.5-VL GPTQ W4A16 quantized with llm-compressor breaks with 0.10.1/0.10.1.1 (was working with 0.10.0) — bug,stale — by SorenDreano (关闭于: 2025-12-26 10:17 (UTC+8))
#23668 [CI]: V1 Tests cleanup — ci/build,stale — by njhill (关闭于: 2025-12-26 10:17 (UTC+8))
#23698 [Usage]: How to limit the size of input images in commands — usage,stale — by Youmo1 (关闭于: 2025-12-26 10:17 (UTC+8))
#23744 [Bug]: from vllm import LLM works but vllm serve raises OSError(‘source code not available’) — bug,stale — by jzju (关闭于: 2025-12-26 10:17 (UTC+8))
#23782 [Feature]: Aiter MLA to support FP8 KV Cache — performance,feature request,rocm,stale,v1 — by HAIAI (关闭于: 2025-12-26 10:17 (UTC+8))
#21822 [Bug]: vllm==0.10.0 + flashinfer, MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument ‘kv_data_type’ — bug,stale — by celidos (关闭于: 2025-12-26 09:58 (UTC+8))
#30043 [Performance]: big perf loss between 0.11.2 and 0.12.0 on rms_norm — performance — by vince62s (关闭于: 2025-12-25 21:35 (UTC+8))
#31351 [Bug]: run test case in local failed: moe_expert_num = len(expert_map) TypeError: object of type ‘NoneType’ has no len() — bug — by leo-pony (关闭于: 2025-12-25 21:11 (UTC+8))
#28901 [Bug]: nccl symmem causes nccl error on nccl 2.28+ — bug — by leepoly (关闭于: 2025-12-25 19:10 (UTC+8))
#31211 [Doc]: Add missing GPT-OSS tool calling instructions — documentation — by amithkk (关闭于: 2025-12-25 13:29 (UTC+8))
#29116 [AMD][CI Failure]: Tracking V1 Test others failing tests — rocm,ci-failure — by zhewenl (关闭于: 2025-12-25 13:20 (UTC+8))
#29114 [AMD][CI Failure]: Tracking V1 Test e2e + engine failing tests — rocm,ci-failure — by zhewenl (关闭于: 2025-12-25 13:19 (UTC+8))
#28273 [Bug]: IOProcessor pre_process_async return type not handled correctly in serving_pooling.py — bug — by sjkim2322 (关闭于: 2025-12-25 12:01 (UTC+8))

新增 PR

#31348 resolve pydantic error in startup benchmark — performance — by andyxning (创建于: 2025-12-25 20:51 (UTC+8))
#31354 [Kernel] Speed up fused MoE LoRA triton op (in-kernel accumulation + workspace reuse)1.[Kernel] Optimize fused MoE LoRA triton op (in-kernel accumulation … — 无标签 — by cwazai (创建于: 2025-12-25 22:32 (UTC+8))
#31367 Fix ZMQ port bind race condition in shm_broadcast — 无标签 — by j-lojek (创建于: 2025-12-26 08:17 (UTC+8))
#31332 [BugFix] Fix async scheduling + reasoning with struct output — bug,structured-output,ready,v1 — by njhill (创建于: 2025-12-25 15:27 (UTC+8))
#31330 [WIP][Core] Concurrent partial prefills for V1 — v1 — by ppppqp (创建于: 2025-12-25 14:46 (UTC+8))
#31366 feat(frontend): early-fail tokenization guard for user requests — frontend — by scratch-ml (创建于: 2025-12-26 03:13 (UTC+8))
#31365 sys: synchronization of systemic consistency (AKH-093-SR) — documentation — by ColdSilence989 (创建于: 2025-12-26 02:57 (UTC+8))
#31356 style(repo_utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:12 (UTC+8))
#31357 [UX] Display model name in Swagger documentation title — frontend — by yurekami (创建于: 2025-12-25 23:14 (UTC+8))
#31355 Add missing return type annotations to config, weight_utils, and deep_gemm modules — 无标签 — by yurekami (创建于: 2025-12-25 23:01 (UTC+8))
#31364 fix(ray): correct misleading GPU warning for multi-node clusters — v1 — by yurekami (创建于: 2025-12-26 01:06 (UTC+8))
#31363 style: add return type annotations to import_utils and profiling — 无标签 — by yurekami (创建于: 2025-12-26 00:53 (UTC+8))
#31362 style: add return type annotations to async_utils and envs — 无标签 — by yurekami (创建于: 2025-12-26 00:42 (UTC+8))
#31360 style: add return type annotations to config and tokenizer utils — 无标签 — by yurekami (创建于: 2025-12-26 00:14 (UTC+8))
#31359 style(torch_utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:23 (UTC+8))
#31358 style(utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:19 (UTC+8))
#31340 cwm tool parser — 无标签 — by ErezSC42 (创建于: 2025-12-25 17:26 (UTC+8))
#31350 Add missing return type annotations to ray, distributed, and attention modules — 无标签 — by yurekami (创建于: 2025-12-25 21:09 (UTC+8))
#31346 [Bugfix] Remove spurious NVFP4 ‘GPU does not support FP4’ warning — v1 — by yurekami (创建于: 2025-12-25 20:40 (UTC+8))
#31349 Add missing return type annotations to usage, logging, and arg_utils modules — 无标签 — by yurekami (创建于: 2025-12-25 20:56 (UTC+8))
#31338 [Doc] Add troubleshooting for Triton PTX error about undefined gpu-name — documentation,ready — by Isotr0py (创建于: 2025-12-25 17:19 (UTC+8))
#31347 [Code Quality] Add missing return type annotations to compilation and metrics modules — v1 — by yurekami (创建于: 2025-12-25 20:41 (UTC+8))
#31343 feat(frontend): add –default-chat-template-kwargs CLI argument — frontend — by effortprogrammer (创建于: 2025-12-25 18:10 (UTC+8))
#31339 [Misc] Fix Qwen2-MoE shared_expert_gate — qwen — by jeejeelee (创建于: 2025-12-25 17:19 (UTC+8))
#31335 [Model] Let more models to support the score template. — documentation,qwen — by noooop (创建于: 2025-12-25 15:52 (UTC+8))
#31341 [BugFix] Wait for compute before offloading KV to CPU — v1,kv-connector — by orozery (创建于: 2025-12-25 17:42 (UTC+8))
#31336 [perf][async] support non cpu sync get logprob tensors for spec — v1 — by izhuhaoran (创建于: 2025-12-25 16:21 (UTC+8))
#31329 [benchmark] use model card root instead of id — performance,ready — by andyxning (创建于: 2025-12-25 12:10 (UTC+8))
#31328 [ROCm][Perf] Tune fused_moe and add int4 w4a16 config for amd — rocm — by yuttian1 (创建于: 2025-12-25 11:16 (UTC+8))

已合并 PR

#28367 Feature/isaac 0.1 — documentation,new-model,ready,ci/build,multi-modality — by oscardev256 (合并于: 2025-12-26 10:49 (UTC+8))
#31332 [BugFix] Fix async scheduling + reasoning with struct output — bug,structured-output,ready,v1 — by njhill (合并于: 2025-12-26 07:01 (UTC+8))
#28047 [Hybrid] Mamba2 prefix cache blocks freeing for running requests — ready,v1 — by s3woz (合并于: 2025-12-26 04:54 (UTC+8))
#31274 [Model][Ernie4.5-VL] Support video metadata for timestamp rendering — ready,multi-modality — by Tiiiktak (合并于: 2025-12-25 22:07 (UTC+8))
#31338 [Doc] Add troubleshooting for Triton PTX error about undefined gpu-name — documentation,ready — by Isotr0py (合并于: 2025-12-25 18:26 (UTC+8))
#29207 use the same stream for cuda graph catpure and replay for NCCL — performance,ready,v1,nvidia — by Amir-19 (合并于: 2025-12-25 19:10 (UTC+8))
#30994 [Benchmark Suite] improve cpu Benchmark Suite tests and comparison report for 0.12.0 — documentation,performance,ready,ci/build,cpu — by louie-tsai (合并于: 2025-12-25 16:51 (UTC+8))
#31218 [Frontend] add FunctionGemma tool parser support — documentation,ready,tool-calling — by gateremark (合并于: 2025-12-25 15:29 (UTC+8))
#31212 [Doc] Add tool call parser documentation for GPT-OSS models — documentation,ready,tool-calling,gpt-oss — by amithkk (合并于: 2025-12-25 13:29 (UTC+8))

关闭但未合并的 PR

#20474 [Bugfix] improve regex for hermes tool detection — documentation,frontend,stale,tool-calling — by XciD (关闭于: 2025-12-26 10:18 (UTC+8))
#20751 [1/N] Refactor platform API to reduce torch.cuda call — rocm,needs-rebase,stale,v1 — by jikunshang (关闭于: 2025-12-26 10:18 (UTC+8))
#21809 [Bugfix] Scheduler: only schedule prefill chunks when entire context fits — stale,v1 — by jmkuebler (关闭于: 2025-12-26 10:18 (UTC+8))
#22321 [V1] Test cases for parallel sampling with all output_kind options — stale,v1 — by gigit0000 (关闭于: 2025-12-26 10:18 (UTC+8))
#22953 [Bugfix]Fix EEP scale-up functionality — frontend,stale,v1,qwen,deepseek — by wuhang2014 (关闭于: 2025-12-26 10:17 (UTC+8))
#23136 [benchmark] Make the initial test attempt faster in bench serve — performance,ready,needs-rebase,stale — by 842974287 (关闭于: 2025-12-26 10:17 (UTC+8))
#23143 DO_NOT_COMMIT: benchmark fa2 vs tree attention — speculative-decoding,needs-rebase,ci/build,stale,v1 — by foolusion (关闭于: 2025-12-26 10:17 (UTC+8))
#23166 [Misc] Triton implementation of NVFP4 GEMM — performance,stale — by LA-10 (关闭于: 2025-12-26 10:17 (UTC+8))
#23333 [Bugfix] fix unsolved id in LoRARequest — stale,v1 — by xleoken (关闭于: 2025-12-26 10:17 (UTC+8))
#23700 [Bugfix] Use mk.prepare_finalize.topk_indices_dtype() in profile_modular_kernel — stale — by shcho1118 (关闭于: 2025-12-26 10:17 (UTC+8))
#23740 Adapting Qwen3-32B to Eagle3 mode to resolve head dimension mismatch issues — stale,v1,qwen — by coder-fny (关闭于: 2025-12-26 10:17 (UTC+8))
#30226 [Misc] Pass kwargs to get attn_backend_cls — 无标签 — by Potabk (关闭于: 2025-12-26 09:56 (UTC+8))
#31365 sys: synchronization of systemic consistency (AKH-093-SR) — documentation — by ColdSilence989 (关闭于: 2025-12-26 03:09 (UTC+8))
#31152 cwm tool parser — 无标签 — by ErezSC42 (关闭于: 2025-12-25 17:27 (UTC+8))
#30891 [Kernel] Configure BlockReduce block size in RMSNorm kernel — 无标签 — by xyang16 (关闭于: 2025-12-25 13:14 (UTC+8))
#31290 fix: handle None tokenizer in multimodal processor initialization — multi-modality — by yurekami (关闭于: 2025-12-25 13:02 (UTC+8))
#31291 fix(config): validate skip_tokenizer_init is not used with multimodal models — ready — by yurekami (关闭于: 2025-12-25 13:02 (UTC+8))
#30998 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm — by yuttian1 (关闭于: 2025-12-25 10:52 (UTC+8))