vLLM 开发动态报告 - 2026-01-10

时间窗口: 2026-01-10 11:03 (UTC+8) ~ 2026-01-11 11:03 (UTC+8) 数据统计: 新 Issue 5 | 关闭 Issue 21 | 新 PR 26 | 合并 PR 22 | 关闭未合并 PR 7

📊 每日开发状态摘要

在2026年1月10日至11日期间，vLLM项目保持了极高的开发活跃度。整体进展侧重于性能优化、多模型/多模态支持以及核心架构演进。代码合并率很高（新增26个PR，合并22个），表明核心功能开发与问题修复进展迅速。同时，社区持续关注AMD ROCm平台兼容性与推测解码（Speculative Decoding） 等前沿特性的完善。

🎯 AMD/ROCm 生态相关动态

本周期内有多项与AMD生态相关的重要修复和优化，突显了vLLM对AMD硬件支持的持续投入。

PR #31295 (已合并): [Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process
- 技术细节: 修复了sampler内核中WARP_SIZE硬编码为32的问题，改为使用cuda_compat.h中定义的动态宏。这是确保代码在AMD Wave64（如MI300X/gfx942）和Wave32（如未来Strix Halo/gfx1151）架构上正确运行的关键兼容性修复。
- 影响: 解决了潜在的性能和正确性问题，是支持不同AMD GPU架构的基础。贡献者 c0de128 提供了详细的硬件验证（在MI300X上测试通过）。
PR #32099 (新增): [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output
- 技术细节: 修复了基于Mamba的模型（如Jamba）在ROCm平台上批处理解码时输出错误的问题。根本原因是causal_conv1d_update返回了非连续（non-contiguous）张量视图，导致后续线性层在ROCm的GEMM内核中计算异常。
- 解决方案: 在关键路径上添加.contiguous()调用，确保传递给线性层的张量是连续的。修复仅针对ROCm平台，避免对CUDA平台产生额外开销。
PR #32084 (新增): [ROCm][Bugfix] Fix AITER speculative decoding accuracy issue
- 技术细节: 修复了使用ROCm AITER FA后端时，推测解码（query_len > 1）准确率降至0%的问题。原因是AITER的paged_attention_v1内核不支持多token查询。
- 解决方案: 检测到推测解码（max_query_len > 1）时，回退到支持分页KV缓存和多token查询的Triton context_attention_fwd内核。此修复确保了用户在使用AITER后端时，启用推测解码功能仍能获得正确结果。讨论焦点：审阅者 tjtanaa 询问此更改是否带来显著性能提升，并希望保持AITER后端的纯净性。贡献者 c0de128 解释这主要是正确性修复，单token解码的常见路径仍使用AITER优化内核，只有推测解码场景才会回退，旨在提供更好的用户体验。

总结：本周期AMD相关的PR均为正确性修复，旨在解决特定模型（Mamba）或高级功能（推测解码）在ROCm平台上的运行时问题，体现了对AMD生态稳定性和功能完整性的持续打磨。

💬 高热度讨论分析

Issue #23670 (已关闭): “[CI]: Speed up Models Tests”
- 核心议题：如何优化vLLM中耗时的模型测试，以加速CI流程。
- 观点与进展：
  - 发起人/维护者 (njhill, robertgshaw2-redhat)：认为这是优先级任务，指派了负责人 (afeldman-nm)。
  - 贡献者 (afeldman-nm)：进行了深入分析，将测试分类（Basic Models, Language Models Test），识别出耗时最长的测试（如初始化测试占49分钟），并提出了通过代码变更范围来智能选择运行测试集的方案（PR #24253）。
  - 社区参与者 (zhengkezhou1)：表达了参与兴趣，被鼓励协作。
- 最终结论：该Issue因长时间无活动被自动标记为stale并关闭，但其核心任务已由具体PR (#24253) 承接并持续推进。
Issue #18975 (已关闭): “[Feature]: Colocating multiple LLM engines in the same process with sleep mode.”
- 核心议题：用户希望在同一个进程中创建多个启用了睡眠模式（sleep mode）的LLM引擎，用于RL训练等场景，但当前版本存在断言错误。
- 观点与立场：
  - 用户群体：多名用户（abhiram1809, wangzhixin-ai等）报告遇到相同问题，强烈请求此功能。
  - 解决方案提供者 (paxlunae)：指出升级到V1引擎可以解决此问题。
  - 维护者 (njhill)：在关闭Issue时确认该问题应在最新版本（v0.14.0）中已解决。
- 争议焦点：无实质性争议，主要集中在对解决方案的探寻和确认上。
- 当前状态：Issue被关闭，建议用户升级至V1引擎或最新版本。
PR #31341 (已合并): “[BugFix] Wait for compute before offloading KV to CPU”
- 核心议题：如何安全、高效地实现KV Cache从GPU到CPU的卸载（Offloading），避免与计算流（compute stream）和采样输出流（sample_tokens stream）的竞争。
- 观点与讨论：
  - 贡献者 (orozery)：提出方案——让卸载流等待计算流，并将卸载操作延迟到下一个引擎步骤开始时，以避开与sample_tokens的DMA拷贝竞争。
  - 核心维护者 (njhill)：进行了深度技术评审，指出在异步调度启用时，sample_tokens的拷贝可能仍在进行。建议更优方案是让卸载流等待sample_tokens流（async_output_copy_stream），并讨论了将此流通过ForwardContext传递的可能性。
  - 技术探讨：双方就CUDA拷贝引擎的FIFO行为、流优先级的影响、以及如何最优雅地获取async_output_copy_stream进行了多轮专业讨论。
- 最终结论：PR在经过深入的技术交锋和方案优化后合并，体现了对核心底层机制的高标准设计。
PR #32084 (新增): “[ROCm][Bugfix] Fix AITER speculative decoding accuracy issue” (讨论部分)
- 核心议题：在为AMD平台修复功能时，如何权衡“代码纯净性”与“用户体验和功能完整性”。
- 观点与立场：
  - 审阅者 (tjtanaa)：倾向于保持AITER后端仅包含AITER库本身的内核，以维持其简洁性。如果性能提升不显著，则可能不接受引入Triton内核的混合方案。
  - 贡献者 (c0de128)：强调此PR首要目标是修复功能缺陷（0%准确率），而非性能提升。让AITER后端用户能正常使用推测解码，比保持后端“纯净”更重要。这是一种以用户体验为导向的实用主义立场。
- 争议焦点：架构哲学上的分歧——是严格隔离后端实现，还是允许为了功能完整性进行适度的混合实现。
- 当前状态：讨论进行中，贡献者已运行CI寻求更全面的测试验证。

🔥 热门话题与趋势分析

模型支持与Bug修复：这是最活跃的领域。大量Issue和PR涉及特定模型的加载、运行问题（如Qwen3-Reranker, GLM-4.7 FP8, Nemotron Nano, Gemma3等），反映出vLLm社区快速响应和适配海量新模型的能力。
性能优化与基准测试：对性能的追求贯穿始终。既有针对特定内核（如滑动窗口注意力、MoE LoRA）的微观优化，也有对宏观基准测试工具（如SLA搜索算法）的改进（PR #32075, #32095）。
多模态与视觉语言模型 (VLMs)：热度持续高涨。相关PR涵盖模型加载适配（#32089）、音频处理修复（#31657）、示例文档更新（#32085）等，表明vLLm正强化其作为多模态推理引擎的能力。
AMD ROCm支持持续增强：如前所述，本周期多个PR专门针对ROCm平台进行修复，显示了对此生态位投入的连续性。
核心架构演进：围绕V1引擎和Model Runner V2的重构是主线之一。PR #32083旨在移除容易出错的async_barrier，引入双缓冲等设计，这是底层数据流的重要革新。

🛠️ 重点技术变更

PR #32102 (新增): “[Model Runner V2] Support structured outputs + spec decoding”
- 解读：为Model Runner V2引入了结构化输出（如JSON语法）与推测解码协同工作的能力。通过新增StructuredOutputsWorker和专用Triton内核，实现了在推测解码的多logits场景下，对grammar bitmaps的原地（in-place）应用。
- 影响：增强了V2引擎的先进功能集成度，使复杂输出约束与推测加速可以同时生效。
PR #32083 (新增): “[Model Runner V2] Remove async barrier”
- 解读：一次重要的架构重构。旨在消除难以推理的async_barrier机制，通过设计双缓冲（Double Buffering）和UVA支持的暂存写入来重新设计CPU-GPU数据流，从根本上避免竞态条件并降低GPU内存压力。
- 影响：这是提升V2引擎稳定性和可维护性的关键一步，虽然改动量巨大，但旨在从设计上解决根本问题。
PR #32089 (已合并): “[Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification”
- 解读：修复了视觉语言（VL）版Qwen3-Reranker模型在序列分类模式下加载失败的问题。关键在于识别并处理VL模型嵌套的模型结构（model.language_model.lm_head），而非普通语言模型的扁平结构。
- 影响：确保了一类重要模型（VL Reranker）在vLLm中的正常运行，完善了多模态任务支持。
PR #32101 (新增): “[MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp”
- 解读：修复了GLM-4.x MTP（多令牌预测）模型在加载FP8量化检查点时，由于.weight_scale张量跳过逻辑错误导致的MTP预测精度暴跌（从~90%降至~1%）。
- 影响：保障了量化模型使用高级推测解码技术时的正确性，对性能和成本敏感的应用场景至关重要。
PR #32075 (已合并): “[Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins”
- 解读：重构基准测试中的SLA（服务等级协议）验证逻辑，从简单的“通过/失败”二元判断，改为基于“边际（margin）”的连续值评估。
- 影响：为后续更智能的搜索算法（如样条插值）铺平了道路，将使自动化性能调优更快速、更精确。

📈 开发活跃度观察

高效合并：在24小时内合并了22个PR，显示代码审查和集成流程非常高效。
AMD团队活跃：贡献者 c0de128、AndreasKaratzas 等提交了多个高质量的ROCm相关修复，表明AMD团队在社区中深入参与。
核心团队深耕架构：WoosukKwon、DarkLight1337、Isotr0py、njhill 等核心贡献者持续推动V1引擎、基准测试、模型加载器等核心模块的演进。
社区广泛参与：从模型适配、文档修复到安全漏洞修补（PR #32098），来自不同背景的社区成员积极参与，生态健康。

💡 值得关注的问题

新功能请求：DFlash集成 (Issue #32094)：用户询问是否计划集成DFlash这一新的高效注意力算法。核心成员 njhill 回复称 @benchislett 正在开发中。这预示着一个新的性能优化特性可能在未来加入。
高压场景下的稳定性 (Issue #32090)：用户报告在8xH100上对GLM-4.7 FP8模型进行负载测试时vLLm崩溃。此问题尚未有回复，涉及高性能硬件和量化模型，值得高度关注。
API演进与弃用 (Issue #32072)：这是一个跟踪性Issue，记录了弃用CommonAttentionMetadata中CPU相关属性的进度。此类底层API的清理工作关系到代码库的长期健康，需关注其影响范围。

📋 附录：详细数据列表

新增 Issue

#32094 [Feature]: DFlash implementation — feature request — by shahizat (创建于: 2026-01-11 00:45 (UTC+8))
#32086 [Bug]: qwen3-vl-reranker-8b run failed in nightly vllm — bug — by zhcn000000 (创建于: 2026-01-10 16:52 (UTC+8))
#32093 [Bug]: Nemotron Nano V3 FP16 on Jetson THOR — bug — by mcr-ksh (创建于: 2026-01-11 00:14 (UTC+8))
#32090 [Bug]: vLLM crashed when load testing GLM 4.7 FP8 on H100 — bug — by raihan0824 (创建于: 2026-01-10 23:20 (UTC+8))
#32072 [Tracking]: Deprecate CPU seqlen related CommonAttentionMetadata properties — documentation — by LucasWilkinson (创建于: 2026-01-10 12:36 (UTC+8))

已关闭 Issue

#18975 [Feature]: Colocating multiple LLM engines in the same process with sleep mode. — feature request,stale — by zixinwen98 (关闭于: 2026-01-11 10:17 (UTC+8))
#19215 [Bug]: RuntimeError: PassManager::run failed — bug,stale — by daisyden (关闭于: 2026-01-11 10:17 (UTC+8))
#22419 [Bug]: Potential Integer Overflow and Out-of-bounds in layernorm_kernels — bug,stale — by molly-ting (关闭于: 2026-01-11 10:16 (UTC+8))
#23670 [CI]: Speed up Models Tests — ci/build,stale — by njhill (关闭于: 2026-01-11 10:16 (UTC+8))
#24302 [Feature]: Implement SRT generation for audio transcription in vLLM — feature request,stale — by tarukumar (关闭于: 2026-01-11 10:16 (UTC+8))
#24327 [Bug]: vLLM engine v1 with Qwen/Qwen3-Reranker-0.6B crashes with long input — bug,stale — by lpapavassiliou (关闭于: 2026-01-11 10:16 (UTC+8))
#24416 [Bug]: the port number after –port didn’t take effect; it’s still using the default port 8000 — bug,good first issue,stale — by liyuerich (关闭于: 2026-01-11 10:16 (UTC+8))
#24495 [Bug]: v0.10.1.1 deployed Minicpm-V 4.5. During requests, it probabilistically returns: {“error”:{“message”:”EngineCore encountered an issue. See stack trace (above) for the root cause.”,”type”:”Internal Server Error”,”param”:null,”code”:500}} — bug,stale — by zzj-otw (关闭于: 2026-01-11 10:16 (UTC+8))
#24657 [RFC]: Support asymmetric tp/pp in p2pncclconnector — RFC,stale — by ZhongsJie (关闭于: 2026-01-11 10:16 (UTC+8))
#24681 [Bug]: ModuleNotFoundError: No module named ‘vllm._C’ — bug,stale — by jo-pillar (关闭于: 2026-01-11 10:16 (UTC+8))
#24704 [Bug]: Qwen3-Reranker: Process Hang with /score Endpoint for Specific Data — bug,stale — by TURX (关闭于: 2026-01-11 10:15 (UTC+8))
#24715 [Installation]:i use cuda11.8 and torch=2.4.0+cu118 install vllm,but find this problem. — installation,stale — by hkaiwen6-eng (关闭于: 2026-01-11 10:15 (UTC+8))
#24723 [Bug]: can’t run the reranker model service on a GPU machine using only the CPU — bug,stale — by zhz292 (关闭于: 2026-01-11 10:15 (UTC+8))
#24756 [Bug]: Occasional ZMQ port conflict — bug,stale,ci-failure — by njhill (关闭于: 2026-01-11 10:15 (UTC+8))
#24760 [Bug]: CUDA illegal memory access when benchmarking gemma-3-4b-it — bug,stale — by dupavankumar (关闭于: 2026-01-11 10:15 (UTC+8))
#24782 [Bug]: Potential Integer Overflow and Out-of-bounds in awq_gemm — bug,stale — by molly-ting (关闭于: 2026-01-11 10:15 (UTC+8))
#19063 [Bug]: ‘FutureWrapper’ object has no attribute ‘sampled_token_ids’ when using ray to perform pipeline parallelism — bug,ray,stale — by havever (关闭于: 2026-01-11 05:12 (UTC+8))
#18431 [Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant. — bug,stale — by to-jiawen (关闭于: 2026-01-11 05:10 (UTC+8))
#32086 [Bug]: qwen3-vl-reranker-8b run failed in nightly vllm — bug — by zhcn000000 (关闭于: 2026-01-11 04:40 (UTC+8))
#31599 [Bug]: AssertionError in per_token_quant_int8 for long input — bug — by cjackal (关闭于: 2026-01-11 04:40 (UTC+8))
#32067 [Bug]: Cannot Run on B200s — bug — by yzeng58 (关闭于: 2026-01-10 22:29 (UTC+8))

新增 PR

#32102 [Model Runner V2] Support structured outputs + spec decoding — v1 — by WoosukKwon (创建于: 2026-01-11 10:22 (UTC+8))
#32083 [Model Runner V2] Remove async barrier — v1 — by WoosukKwon (创建于: 2026-01-10 16:17 (UTC+8))
#32100 [responseAPI] support partial message generation — frontend — by qandrew (创建于: 2026-01-11 07:47 (UTC+8))
#32101 [MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp — 无标签 — by andyl98 (创建于: 2026-01-11 08:32 (UTC+8))
#32089 [Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification — ready,qwen — by ricky-chaoju (创建于: 2026-01-10 22:17 (UTC+8))
#32099 [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output — rocm — by AndreasKaratzas (创建于: 2026-01-11 07:23 (UTC+8))
#32098 fix(examples): replace unsafe eval() with safe math evaluator in xLAM tool examples — documentation — by deosha (创建于: 2026-01-11 06:54 (UTC+8))
#32095 [Benchmark][2/2] Use spline interpolation to tune SLA variables — documentation,performance,ready — by DarkLight1337 (创建于: 2026-01-11 01:08 (UTC+8))
#32096 [Misc] Make scipy as optional audio/benchmark dependency — ready,ci/build,multi-modality — by Isotr0py (创建于: 2026-01-11 02:21 (UTC+8))
#32097 [Bugfix] Fix GLM-4.7 tool parser for tool call without arguments — 无标签 — by steinfurt (创建于: 2026-01-11 02:42 (UTC+8))
#32084 [ROCm][Bugfix] Fix AITER speculative decoding accuracy issue — rocm,v1 — by c0de128 (创建于: 2026-01-10 16:21 (UTC+8))
#32077 [Cleanup] Removed unused KVConnectorModelRunnerMixin methods — ready,v1,kv-connector — by njhill (创建于: 2026-01-10 13:29 (UTC+8))
#32080 Add support for compressed-tensors NVFP4 in non-gated MoE layers #31782 — 无标签 — by baonudesifeizhai (创建于: 2026-01-10 14:43 (UTC+8))
#32091 [Tracing] Support OTEL_TRACES_EXPORTER env var and multiple exporters — ci/build,v1 — by minimAluminiumalism (创建于: 2026-01-10 23:40 (UTC+8))
#32085 [Model] Improve multimodal pooling examples — documentation — by noooop (创建于: 2026-01-10 16:43 (UTC+8))
#32092 [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend — performance,cpu — by andikarachman (创建于: 2026-01-10 23:54 (UTC+8))
#32088 fix offline inference chat response prompt — documentation,speculative-decoding — by andyxning (创建于: 2026-01-10 21:20 (UTC+8))
#32087 refactor: refactor_repeated_interfaces — deepseek — by tom-zju (创建于: 2026-01-10 20:27 (UTC+8))
#32076 [Bugfix] fix offline chat output prompt — documentation,ready — by andyxning (创建于: 2026-01-10 13:17 (UTC+8))
#32082 [Models] Add SharedFusedMoE support to Qwen3MoE — qwen — by Isotr0py (创建于: 2026-01-10 15:54 (UTC+8))
#32075 [Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins — performance,ready — by DarkLight1337 (创建于: 2026-01-10 13:02 (UTC+8))
#32081 [Bugfix] Fix ModelOpt Llama-4 slow loading via tensor contiguity — llama — by ishrith-gowda (创建于: 2026-01-10 15:04 (UTC+8))
#32078 [EPLB] Replace async handshake flags with TransferPhase state machine — 无标签 — by Anri-Lombard (创建于: 2026-01-10 14:05 (UTC+8))
#32079 [Speculators][Speculative Decoding] Add EAGLE-3 Speculative Decoding Support for DeepSeek V3/V3.2 — deepseek — by qianlihuang (创建于: 2026-01-10 14:08 (UTC+8))
#32074 [Misc] Delay deprecation of CommonAttentionMetadata properties — v1 — by LucasWilkinson (创建于: 2026-01-10 12:55 (UTC+8))
#32073 [Attention][4/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — speculative-decoding,v1 — by LucasWilkinson (创建于: 2026-01-10 12:52 (UTC+8))

已合并 PR

#32089 [Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification — ready,qwen — by ricky-chaoju (合并于: 2026-01-11 04:40 (UTC+8))
#32063 [ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples — rocm,ready,gpt-oss — by AndreasKaratzas (合并于: 2026-01-10 13:02 (UTC+8))
#31341 [BugFix] Wait for compute before offloading KV to CPU — ready,v1,kv-connector — by orozery (合并于: 2026-01-11 06:25 (UTC+8))
#32008 [MISC] Add strict contiguity check for FlashInfer attention tensors — ready,v1,nvidia — by vadiklyutiy (合并于: 2026-01-11 04:40 (UTC+8))
#31637 [Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 — ready — by Flink-ddd (合并于: 2026-01-11 04:40 (UTC+8))
#31617 Revert “[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… — ready,nvidia — by shyeh25 (合并于: 2026-01-11 04:39 (UTC+8))
#31583 [BugFix] scheduler: Fix resuming of preempted requests after async load — ready,v1 — by orozery (合并于: 2026-01-11 04:39 (UTC+8))
#32002 fused_moe_kernel - cast accumulator after applying router weights — ready,ci-failure — by gnovack (合并于: 2026-01-11 04:36 (UTC+8))
#32019 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank — ready — by xyang16 (合并于: 2026-01-11 03:04 (UTC+8))
#31984 [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel — ready,v1 — by jvlunteren (合并于: 2026-01-11 02:13 (UTC+8))
#31926 [Quant] Support MXFP4 W4A16 for compressed-tensors dense models — ready,nvidia — by mgoin (合并于: 2026-01-10 22:44 (UTC+8))
#32076 [Bugfix] fix offline chat output prompt — documentation,ready — by andyxning (合并于: 2026-01-10 15:50 (UTC+8))
#31657 [Bugfix] Fix integer overflow in Gemma3n audio processing — ready,multi-modality — by jeremyteboul (合并于: 2026-01-10 17:52 (UTC+8))
#32075 [Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins — performance,ready — by DarkLight1337 (合并于: 2026-01-10 15:11 (UTC+8))
#31857 [Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling — ready,v1 — by frelam (合并于: 2026-01-10 14:27 (UTC+8))
#32074 [Misc] Delay deprecation of CommonAttentionMetadata properties — v1 — by LucasWilkinson (合并于: 2026-01-10 13:06 (UTC+8))
#31895 Update modelopt KV cache quantization resolution to new scheme — ready — by roikoren755 (合并于: 2026-01-10 12:54 (UTC+8))
#32026 [Refactor] Separate sequence and token pooling types — documentation,new-model,frontend,ready,qwen — by DarkLight1337 (合并于: 2026-01-10 12:53 (UTC+8))
#31939 [Misc] Refactor ColumnParallelLinear: remove unused parameter and optimize forward — ready — by maang-h (合并于: 2026-01-10 12:19 (UTC+8))
#31295 [Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process — rocm,ready — by c0de128 (合并于: 2026-01-10 11:57 (UTC+8))
#25774 Fuse RoPE and MLA KV-cache write — ready,ci/build,v1 — by PatrykSaffer (合并于: 2026-01-10 11:18 (UTC+8))
#31550 feature/issac 0.2 — ready,multi-modality — by AkshatSh (合并于: 2026-01-10 11:18 (UTC+8))

关闭但未合并的 PR

#17232 [Model]Remove Dropout Layers — needs-rebase,stale,llama — by alex-jw-brooks (关闭于: 2026-01-11 10:17 (UTC+8))
#22309 [Feat] Allow custom comm_group in ParallelLinear layers — stale — by jianzs (关闭于: 2026-01-11 10:16 (UTC+8))
#24552 packed_modules_mapping for Command A Vision model — stale — by dongluw (关闭于: 2026-01-11 10:16 (UTC+8))
#31733 [Bugfix][Hardware][AMD] Use platform device type in compilation fusion helpers — rocm — by c0de128 (关闭于: 2026-01-11 01:56 (UTC+8))
#31184 [Bugfix][Hardware][AMD] Fix FP8 support detection on gfx11x architectures — rocm — by c0de128 (关闭于: 2026-01-11 01:26 (UTC+8))
#31958 [Bugfix] Keep all tensors to be on the same device — v1 — by wjunLu (关闭于: 2026-01-10 16:58 (UTC+8))
#25052 [Kernel] Add prefix arg to RMSNorm and adapt existing models — speculative-decoding,stale,llama,qwen,deepseek,gpt-oss — by 22dimensions (关闭于: 2026-01-10 11:17 (UTC+8))