vLLM 开发动态报告 - 2025-12-25
时间窗口: 2025-12-25 10:49 (UTC+8) ~ 2025-12-26 10:49 (UTC+8) 数据统计: 新 Issue 12 | 关闭 Issue 28 | 新 PR 29 | 合并 PR 9 | 关闭未合并 PR 18
📊 每日开发状态摘要
本周期(12月25-26日)vLLM开发活动保持活跃,共处理40个Issue(新增12,关闭28)和提交29个PR(合并9个)。开发重点集中在性能优化与Bug修复上,尤其关注推理性能回归、内核优化(如MoE LoRA)和AMD平台支持。同时,社区持续进行代码质量改善(类型注解)和前端功能增强。
🎯 AMD/ROCm 生态相关动态
本周期有明确的AMD生态相关活动,涉及Issue讨论和性能优化PR。
- Bug报告:ROCm Docker镜像兼容性 (#31333)
- 内容:用户报告vLLM使用的ROCm 7.0-complete基础Docker镜像不支持GFX1150架构(如Radeon显卡),建议升级至7.0.2或更高版本。
- 分析:此Issue揭示了vLLM官方Docker镜像在支持AMD消费级显卡(Radeon系列)时可能存在的底层环境问题。保持镜像更新对于扩大AMD硬件兼容性范围至关重要。
- 状态:开放中,已被标记
rocm标签并通知相关维护者。
- PR:为AMD MI308X调整fused_moe配置 (#31328)
- 内容:贡献者
yuttian1提交PR,针对AMD Instinct MI308X GPU优化fused_moe内核行为,并添加了int4 w4a16的特定配置。 - 分析:这是针对新一代AMD GPU(MI308X)的定向性能调优。通过调整内核配置和添加对新数据类型的支持,旨在提升MoE模型在AMD高端计算卡上的推理效率。这体现了AMD生态持续深入优化计算核心的趋势。
- 状态:开放中,待通过CI检查。
- 内容:贡献者
- 历史Issue回顾:ROCm环境标志过多 (#21138) - 已关闭
- 内容:此长期讨论在本周期因不活跃而自动关闭。用户曾抱怨启用ROCm最佳性能需要设置过多环境变量(如
VLLM_USE_TRITON_FLASH_ATTN=0,VLLM_ROCM_USE_AITER_*等),体验不佳。 - 关键讨论:
- 用户观点:标志泛滥增加使用复杂度,缺乏文档说明最佳配置,希望有统一开关(如
VLLM_ROCM_USE_BEST_CONFIG=1)或根据GPU型号自动设置。 - AMD工程师回应 (
tjtanaa):解释了部分标志(如AITER相关)因实验性、硬件特定优化或兼容性原因默认关闭。同时指出,对于MI系列卡,许多性能标志(如VLLM_USE_TRITON_FLASH_ATTN)已是默认启用。 - 结论与现状:AMD团队承认需减少参数数量,并已在进行环境变量审计。目前建议是,除了
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1外,不建议用户设置其他ROCm性能标志,因为大多数已针对MI系列默认优化。AITER标志因其实验性而保持默认关闭。
- 用户观点:标志泛滥增加使用复杂度,缺乏文档说明最佳配置,希望有统一开关(如
- 内容:此长期讨论在本周期因不活跃而自动关闭。用户曾抱怨启用ROCm最佳性能需要设置过多环境变量(如
💬 高热度讨论分析
- RFC:懒加载CUDA图捕获 (#20098) - 已关闭
- 核心议题:是否应将CUDA图捕获从引擎启动时(Eager)改为运行时按需(Lazy),以显著减少启动时间。
- 各方观点:
- 支持方(提案者及部分用户):启动时间对用户体验、自动扩缩容(从0扩展)场景至关重要。当前捕获67个图很多可能用不到,浪费时间和内存。
- 反对方/担忧方:
- 运行时性能波动:懒捕获会导致首次遇到新形状时产生延迟,使性能不可预测。
- 内存规划困难:CUDA图占用显存,懒捕获使得峰值内存使用难以在启动时预估,可能导致运行时OOM。
- 替代方案:更倾向于减少捕获图的数量、按实际关心的批处理大小进行捕获,或使用全模型图而非分片图。
- 延伸讨论:涉及LoRA场景(大量图)、提示嵌入支持等复杂情况。
- 争议焦点:启动时间优化与运行时性能一致性/可预测性之间的权衡。
- 最终状态:该RFC在讨论后未立即实施,于本周期因长期无新进展而被标记为
stale并自动关闭。反映出社区在当前阶段更倾向于优化现有捕获策略,而非引入懒加载的复杂度。
- Issue:DeepSeek-V3.2 性能大幅下降 (#31368)
- 核心议题:用户报告在8*H200上,DeepSeek-V3.2的吞吐量、TTFT和TPOT均显著低于V3.1版本,并附上了详细的测试脚本和结果。
- 分析:这是一个新开的高质量性能回归报告。虽然尚无评论,但其内容详实,直接指向特定模型版本的性能问题,可能涉及vLLM对新模型架构(如MLA)的优化支持、内核选择或调度策略。这类问题通常会引起核心开发者的高度关注。
- 状态:开放中,是潜在的重大性能问题追踪点。
- Issue:TP通信分解的数值精度问题 (#31361)
- 核心议题:用户尝试将Tensor Parallelism中的
all-reduce通信分解为reduce-scatter+all-gather以优化,但出现了随着网络层数累加的显著数值误差,而引擎启动时的dummy run却完全匹配。 - 观点分析:
- 用户困惑:
dummy run(用于预热、性能剖析)与真实请求执行的数值行为为何不同?精度差异从何而来? - 潜在技术点:此问题触及分布式训练/推理中通信原语数值稳定性、编译器优化(如CUDA图)、不同运行路径(无缓存/有缓存)下的计算顺序等深层议题。
- 用户困惑:
- 状态:开放中,是一个需要深入技术诊断的复杂问题。
- 核心议题:用户尝试将Tensor Parallelism中的
🔥 热门话题与趋势分析
- 性能回归与优化:是本期最突出的主题。
- 模型特定性能:DeepSeek-V3.2(#31368)、此前版本间的TTFT增长(#21983)和RMSNorm性能损失(#30043)表明,新模型架构或vLLM版本升级时常伴随性能挑战。
- 内核级优化:针对MoE LoRA融合内核的优化PR(#31354)和AMD fused_moe的调优PR(#31328)显示,社区正深入特定计算模式进行微调以提升效率。
- 异步调度优化:PR #31336旨在消除推测解码中因CPU查询GPU动态形状导致的同步点,以提升异步调度性能。
- Bug与稳定性:大量Issue反映了生产环境中遇到的各种问题。
- KV缓存异常:包括持续增长(#31353)、卸载同步问题(#31341)等。
- CUDA/ROCm环境:涉及CUDA图捕获(#28901)、ROCm镜像(#31333)、驱动兼容性(#21979)等。
- API与行为变更:如版本升级后默认编译配置改变(#31352)、工具参数未生效(#31345)等。
- 功能增强与模型支持:
- 前端与用户体验:新增
--default-chat-template-kwargs参数(#31343)、修复Ray GPU计数警告(#31364)、改进Swagger文档标题(#31357)。 - 模型支持:新增对
Isaac-0.1(#28367)、CWM工具解析(#31340)、FunctionGemma工具解析(#31218)的支持,以及持续增加重排序模型模板(#31335)。
- 前端与用户体验:新增
🛠️ 重点技术变更
-
PR:懒加载CUDA图捕获(RFC #20098的后续讨论):虽然PR未在本周期出现,但该RFC的深度讨论是重要的技术风向标。它集中体现了在大模型服务系统中,启动优化与运行时确定性这一根本矛盾,以及内存管理复杂性的挑战。
-
PR:优化fused MoE LoRA Triton算子 (#31354):通过核内累加和工作空间复用,消除了额外的全局内存传输和多个内核启动。这对于小批处理(解码)场景和CUDA图捕获的延迟优化尤其有益,是面向生产环境性能的微观优化典范。
-
PR:修复AMD MI308X的fused_moe配置 (#31328):针对特定硬件(AMD MI308X)进行内核参数调优并添加对新量化格式(int4 w4a16)的支持,体现了硬件厂商针对其最新架构进行深度适配的努力。
-
PR:等待计算完成后再将KV卸载至CPU (#31341):修复了
OffloadingConnector中KV缓存卸载的同步bug,确保在GPU计算完成后再进行DMA拷贝,并调整卸载时机以避免与采样令牌拷贝的竞争。这是保证P/D分解等功能稳定性的重要底层修复。 -
PR:修复异步调度+推理+结构化输出时的令牌计数问题 (#31332):修复了
StructuredOutputManager在异步调度场景下的令牌计数错误。这是一个针对V1引擎、推理与结构化输出等高级功能交织场景的精准修复,确保了复杂功能组合的正确性。
📈 开发活跃度观察
- 贡献者活跃:本期共有数十位不同贡献者提交PR和Issue,涵盖Bug报告、性能测试、功能开发、代码质量优化等多个方面,社区活跃度较高。
- 代码审查与合并:共合并9个PR,包括新模型支持、文档更新、Bug修复和性能优化,合并节奏平稳。值得注意的是,大量新PR(如#31328, #31329, #31354等)因预提交检查(pre-commit)失败而阻塞,主要集中在代码格式和类型注解上,表明项目对代码规范有严格要求。
- Issue清理:关闭了28个旧Issue,其中多数因长期不活跃(
stale)被自动关闭,显示了维护工作流的正常运行。
💡 值得关注的问题
- DeepSeek-V3.2性能回归 (#31368):需要核心开发者介入,分析是新模型特性导致,还是vLLM的优化未及时覆盖。
- TP通信分解的数值精度 (#31361):涉及分布式计算和计算图执行的底层一致性,诊断结果可能对理解vLLM内部执行机制有重要价值。
- ROCm Docker镜像兼容性 (#31333):影响AMD Radeon显卡用户的使用体验,需关注维护者的修复进展。
- 默认编译配置变更 (#31352):从v0.11.2到v0.12.0,默认的编译模式和自定义算子列表发生显著变化,可能影响用户升级后的性能和行为,建议查看相关发布说明。
📋 附录:详细数据列表
新增 Issue
- #31368 [Performance]: DeepSeek-v3.2 throughput&TTFT&TPOT is much slower than DeepSeek-v3.1 on 8*H200 — performance — by princepride (创建于: 2025-12-26 09:51 (UTC+8))
- #31361 [Usage]: Question about the dummy run。It seems the dummy run use different precision? — usage — by Dingjifeng (创建于: 2025-12-26 00:38 (UTC+8))
- #31331 [New Model]: Support molmo2 — new-model — by tunglinwood (创建于: 2025-12-25 15:25 (UTC+8))
- #31342 [Bug]: vllm profiler cannot working — bug — by lengrongfu (创建于: 2025-12-25 17:43 (UTC+8))
- #31353 [Bug]: KV Cache grows continuously with just one chat completion request using meta-llama/Llama-3.2-1B on L40 GPU with Flash Attention and finally completed after 10 minutes — bug — by aravilli (创建于: 2025-12-25 21:56 (UTC+8))
- #31352 [Bug]: Behavior change 0.11.2 vs 0.12 (and up) — bug — by vince62s (创建于: 2025-12-25 21:35 (UTC+8))
- #31344 [Usage]: how to pass param logits_processors in AsyncEngineArgs? — usage — by cqray1990 (创建于: 2025-12-25 18:12 (UTC+8))
- #31351 [Bug]: run test case in local failed: moe_expert_num = len(expert_map) TypeError: object of type ‘NoneType’ has no len() — bug — by leo-pony (创建于: 2025-12-25 21:10 (UTC+8))
- #31345 [Bug]: The ‘’draft_tensor_parallel_size‘’ parameter of the Eagel3 draft model does not take effect — bug — by zhaomingyu13 (创建于: 2025-12-25 18:59 (UTC+8))
- #31337 [Bug]: Failed to capture CUDA kernel data and GPU memory data when running vllm with tensor_parallel_size=1 — bug — by EmperorKaiser (创建于: 2025-12-25 17:13 (UTC+8))
- #31333 [Bug]: The base image used by rocm Docker is 7.0-complete, and rocm 7.0 does not support GFX1150. — bug,rocm — by Uhao-P (创建于: 2025-12-25 15:51 (UTC+8))
- #31334 [Bug]: vllm+ray:vllm serve DeepSeek-V3.2V-AWQ ERROR. — bug — by F0undLinks (创建于: 2025-12-25 15:51 (UTC+8))
已关闭 Issue
- #20098 [RFC]: Lazy CUDA Graph capture — RFC,stale,startup-ux — by lionelvillard (关闭于: 2025-12-26 10:18 (UTC+8))
- #20778 [Usage]: How am I suppose to pass images to /tokenize? — good first issue,usage,stale — by dan-jacobson (关闭于: 2025-12-26 10:18 (UTC+8))
- #21027 [Performance]: The GPU memory usage of vllm v0.9.2 is significantly higher than that of v0.9.1. Why is this? How can it be improved? — performance,stale — by dengcao (关闭于: 2025-12-26 10:18 (UTC+8))
- #21138 [ROCm]: There is too many opt-in ROCm specific flags — feature request,stale — by functionstackx (关闭于: 2025-12-26 10:18 (UTC+8))
- #21979 [Bug]: cuda version of
vllm/vllm-openai:latestolder than k8s node cuda 12.9 Incompatibility error — bug,stale — by kev-iotairx (关闭于: 2025-12-26 10:18 (UTC+8)) - #21983 [Bug]: TTFT increased especially in some Distill Models with small BatchSize in v0.10.0 compared to v0.9.2 — bug,stale — by AiKiAi-stack (关闭于: 2025-12-26 10:18 (UTC+8))
- #22126 [Feature]: Tensor parallelism for GLM-4.5 — feature request,stale — by fernandaspets (关闭于: 2025-12-26 10:18 (UTC+8))
- #22442 [Usage]: Whether to support benchmark service stress testing of embeddings model? — usage,stale — by tensorflowt (关闭于: 2025-12-26 10:18 (UTC+8))
- #22625 [Bug]: Generation quality decrease while batch size > 1 (Qwen3-235B-A22B-Instruct) — bug,stale — by edgeinfinity1 (关闭于: 2025-12-26 10:18 (UTC+8))
- #23062 [Feature]: Supports torch 2.9 — feature request,stale — by Tame21 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23342 [Bug]: when user MTP method will cause UnboundLocalError — bug,stale — by xyxinyang (关闭于: 2025-12-26 10:17 (UTC+8))
- #23538 [Bug]: Qwen2.5-VL get an error when attatch image file — bug,stale — by foxmale007 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23554 [Bug]: [V1][V0.9.1][PD disaggregation] Infinite waiting loop when kv cache is only capable for one long text request — bug,stale — by JC-ut0 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23572 [Bug]: meta-llama/Llama-4-Maverick-17B-128E-Instruct confuses about image inputs when tool choice is enabled — bug,stale — by erkintelnyx (关闭于: 2025-12-26 10:17 (UTC+8))
- #23587 [Bug]: NIXL Crashes if P/D Protocol is off — bug,stale — by robertgshaw2-redhat (关闭于: 2025-12-26 10:17 (UTC+8))
- #23645 [Bug]: Qwen2.5-VL GPTQ W4A16 quantized with llm-compressor breaks with 0.10.1/0.10.1.1 (was working with 0.10.0) — bug,stale — by SorenDreano (关闭于: 2025-12-26 10:17 (UTC+8))
- #23668 [CI]: V1 Tests cleanup — ci/build,stale — by njhill (关闭于: 2025-12-26 10:17 (UTC+8))
- #23698 [Usage]: How to limit the size of input images in commands — usage,stale — by Youmo1 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23744 [Bug]: from vllm import LLM works but vllm serve raises OSError(‘source code not available’) — bug,stale — by jzju (关闭于: 2025-12-26 10:17 (UTC+8))
- #23782 [Feature]: Aiter MLA to support FP8 KV Cache — performance,feature request,rocm,stale,v1 — by HAIAI (关闭于: 2025-12-26 10:17 (UTC+8))
- #21822 [Bug]: vllm==0.10.0 + flashinfer, MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument ‘kv_data_type’ — bug,stale — by celidos (关闭于: 2025-12-26 09:58 (UTC+8))
- #30043 [Performance]: big perf loss between 0.11.2 and 0.12.0 on rms_norm — performance — by vince62s (关闭于: 2025-12-25 21:35 (UTC+8))
- #31351 [Bug]: run test case in local failed: moe_expert_num = len(expert_map) TypeError: object of type ‘NoneType’ has no len() — bug — by leo-pony (关闭于: 2025-12-25 21:11 (UTC+8))
- #28901 [Bug]: nccl symmem causes nccl error on nccl 2.28+ — bug — by leepoly (关闭于: 2025-12-25 19:10 (UTC+8))
- #31211 [Doc]: Add missing GPT-OSS tool calling instructions — documentation — by amithkk (关闭于: 2025-12-25 13:29 (UTC+8))
- #29116 [AMD][CI Failure]: Tracking V1 Test others failing tests — rocm,ci-failure — by zhewenl (关闭于: 2025-12-25 13:20 (UTC+8))
- #29114 [AMD][CI Failure]: Tracking V1 Test e2e + engine failing tests — rocm,ci-failure — by zhewenl (关闭于: 2025-12-25 13:19 (UTC+8))
- #28273 [Bug]: IOProcessor
pre_process_asyncreturn type not handled correctly inserving_pooling.py— bug — by sjkim2322 (关闭于: 2025-12-25 12:01 (UTC+8))
新增 PR
- #31348 resolve pydantic error in startup benchmark — performance — by andyxning (创建于: 2025-12-25 20:51 (UTC+8))
- #31354 [Kernel] Speed up fused MoE LoRA triton op (in-kernel accumulation + workspace reuse)1.[Kernel] Optimize fused MoE LoRA triton op (in-kernel accumulation … — 无标签 — by cwazai (创建于: 2025-12-25 22:32 (UTC+8))
- #31367 Fix ZMQ port bind race condition in shm_broadcast — 无标签 — by j-lojek (创建于: 2025-12-26 08:17 (UTC+8))
- #31332 [BugFix] Fix async scheduling + reasoning with struct output — bug,structured-output,ready,v1 — by njhill (创建于: 2025-12-25 15:27 (UTC+8))
- #31330 [WIP][Core] Concurrent partial prefills for V1 — v1 — by ppppqp (创建于: 2025-12-25 14:46 (UTC+8))
- #31366 feat(frontend): early-fail tokenization guard for user requests — frontend — by scratch-ml (创建于: 2025-12-26 03:13 (UTC+8))
- #31365 sys: synchronization of systemic consistency (AKH-093-SR) — documentation — by ColdSilence989 (创建于: 2025-12-26 02:57 (UTC+8))
- #31356 style(repo_utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:12 (UTC+8))
- #31357 [UX] Display model name in Swagger documentation title — frontend — by yurekami (创建于: 2025-12-25 23:14 (UTC+8))
- #31355 Add missing return type annotations to config, weight_utils, and deep_gemm modules — 无标签 — by yurekami (创建于: 2025-12-25 23:01 (UTC+8))
- #31364 fix(ray): correct misleading GPU warning for multi-node clusters — v1 — by yurekami (创建于: 2025-12-26 01:06 (UTC+8))
- #31363 style: add return type annotations to import_utils and profiling — 无标签 — by yurekami (创建于: 2025-12-26 00:53 (UTC+8))
- #31362 style: add return type annotations to async_utils and envs — 无标签 — by yurekami (创建于: 2025-12-26 00:42 (UTC+8))
- #31360 style: add return type annotations to config and tokenizer utils — 无标签 — by yurekami (创建于: 2025-12-26 00:14 (UTC+8))
- #31359 style(torch_utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:23 (UTC+8))
- #31358 style(utils): add return type annotations — 无标签 — by yurekami (创建于: 2025-12-25 23:19 (UTC+8))
- #31340 cwm tool parser — 无标签 — by ErezSC42 (创建于: 2025-12-25 17:26 (UTC+8))
- #31350 Add missing return type annotations to ray, distributed, and attention modules — 无标签 — by yurekami (创建于: 2025-12-25 21:09 (UTC+8))
- #31346 [Bugfix] Remove spurious NVFP4 ‘GPU does not support FP4’ warning — v1 — by yurekami (创建于: 2025-12-25 20:40 (UTC+8))
- #31349 Add missing return type annotations to usage, logging, and arg_utils modules — 无标签 — by yurekami (创建于: 2025-12-25 20:56 (UTC+8))
- #31338 [Doc] Add troubleshooting for Triton PTX error about undefined gpu-name — documentation,ready — by Isotr0py (创建于: 2025-12-25 17:19 (UTC+8))
- #31347 [Code Quality] Add missing return type annotations to compilation and metrics modules — v1 — by yurekami (创建于: 2025-12-25 20:41 (UTC+8))
- #31343 feat(frontend): add –default-chat-template-kwargs CLI argument — frontend — by effortprogrammer (创建于: 2025-12-25 18:10 (UTC+8))
- #31339 [Misc] Fix Qwen2-MoE shared_expert_gate — qwen — by jeejeelee (创建于: 2025-12-25 17:19 (UTC+8))
- #31335 [Model] Let more models to support the score template. — documentation,qwen — by noooop (创建于: 2025-12-25 15:52 (UTC+8))
- #31341 [BugFix] Wait for compute before offloading KV to CPU — v1,kv-connector — by orozery (创建于: 2025-12-25 17:42 (UTC+8))
- #31336 [perf][async] support non cpu sync get logprob tensors for spec — v1 — by izhuhaoran (创建于: 2025-12-25 16:21 (UTC+8))
- #31329 [benchmark] use model card root instead of id — performance,ready — by andyxning (创建于: 2025-12-25 12:10 (UTC+8))
- #31328 [ROCm][Perf] Tune fused_moe and add int4 w4a16 config for amd — rocm — by yuttian1 (创建于: 2025-12-25 11:16 (UTC+8))
已合并 PR
- #28367 Feature/isaac 0.1 — documentation,new-model,ready,ci/build,multi-modality — by oscardev256 (合并于: 2025-12-26 10:49 (UTC+8))
- #31332 [BugFix] Fix async scheduling + reasoning with struct output — bug,structured-output,ready,v1 — by njhill (合并于: 2025-12-26 07:01 (UTC+8))
- #28047 [Hybrid] Mamba2 prefix cache blocks freeing for running requests — ready,v1 — by s3woz (合并于: 2025-12-26 04:54 (UTC+8))
- #31274 [Model][Ernie4.5-VL] Support video metadata for timestamp rendering — ready,multi-modality — by Tiiiktak (合并于: 2025-12-25 22:07 (UTC+8))
- #31338 [Doc] Add troubleshooting for Triton PTX error about undefined gpu-name — documentation,ready — by Isotr0py (合并于: 2025-12-25 18:26 (UTC+8))
- #29207 use the same stream for cuda graph catpure and replay for NCCL — performance,ready,v1,nvidia — by Amir-19 (合并于: 2025-12-25 19:10 (UTC+8))
- #30994 [Benchmark Suite] improve cpu Benchmark Suite tests and comparison report for 0.12.0 — documentation,performance,ready,ci/build,cpu — by louie-tsai (合并于: 2025-12-25 16:51 (UTC+8))
- #31218 [Frontend] add FunctionGemma tool parser support — documentation,ready,tool-calling — by gateremark (合并于: 2025-12-25 15:29 (UTC+8))
- #31212 [Doc] Add tool call parser documentation for GPT-OSS models — documentation,ready,tool-calling,gpt-oss — by amithkk (合并于: 2025-12-25 13:29 (UTC+8))
关闭但未合并的 PR
- #20474 [Bugfix] improve regex for hermes tool detection — documentation,frontend,stale,tool-calling — by XciD (关闭于: 2025-12-26 10:18 (UTC+8))
- #20751 [1/N] Refactor platform API to reduce
torch.cudacall — rocm,needs-rebase,stale,v1 — by jikunshang (关闭于: 2025-12-26 10:18 (UTC+8)) - #21809 [Bugfix] Scheduler: only schedule prefill chunks when entire context fits — stale,v1 — by jmkuebler (关闭于: 2025-12-26 10:18 (UTC+8))
- #22321 [V1] Test cases for parallel sampling with all output_kind options — stale,v1 — by gigit0000 (关闭于: 2025-12-26 10:18 (UTC+8))
- #22953 [Bugfix]Fix EEP scale-up functionality — frontend,stale,v1,qwen,deepseek — by wuhang2014 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23136 [benchmark] Make the initial test attempt faster in bench serve — performance,ready,needs-rebase,stale — by 842974287 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23143 DO_NOT_COMMIT: benchmark fa2 vs tree attention — speculative-decoding,needs-rebase,ci/build,stale,v1 — by foolusion (关闭于: 2025-12-26 10:17 (UTC+8))
- #23166 [Misc] Triton implementation of NVFP4 GEMM — performance,stale — by LA-10 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23333 [Bugfix] fix unsolved id in LoRARequest — stale,v1 — by xleoken (关闭于: 2025-12-26 10:17 (UTC+8))
- #23700 [Bugfix] Use mk.prepare_finalize.topk_indices_dtype() in profile_modular_kernel — stale — by shcho1118 (关闭于: 2025-12-26 10:17 (UTC+8))
- #23740 Adapting Qwen3-32B to Eagle3 mode to resolve head dimension mismatch issues — stale,v1,qwen — by coder-fny (关闭于: 2025-12-26 10:17 (UTC+8))
- #30226 [Misc] Pass kwargs to get attn_backend_cls — 无标签 — by Potabk (关闭于: 2025-12-26 09:56 (UTC+8))
- #31365 sys: synchronization of systemic consistency (AKH-093-SR) — documentation — by ColdSilence989 (关闭于: 2025-12-26 03:09 (UTC+8))
- #31152 cwm tool parser — 无标签 — by ErezSC42 (关闭于: 2025-12-25 17:27 (UTC+8))
- #30891 [Kernel] Configure BlockReduce block size in RMSNorm kernel — 无标签 — by xyang16 (关闭于: 2025-12-25 13:14 (UTC+8))
- #31290 fix: handle None tokenizer in multimodal processor initialization — multi-modality — by yurekami (关闭于: 2025-12-25 13:02 (UTC+8))
- #31291 fix(config): validate skip_tokenizer_init is not used with multimodal models — ready — by yurekami (关闭于: 2025-12-25 13:02 (UTC+8))
- #30998 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm — by yuttian1 (关闭于: 2025-12-25 10:52 (UTC+8))