vLLM 开发动态报告 - 2026-01-02

时间窗口: 2026-01-02 10:45 (UTC+8) ~ 2026-01-03 10:45 (UTC+8) 数据统计: 新 Issue 8 | 关闭 Issue 30 | 新 PR 15 | 合并 PR 19 | 关闭未合并 PR 9

📊 每日开发状态摘要

在2026年1月2日至3日的24小时窗口内，vLLM社区保持高活跃度，合并了19个PR，远超新增的15个，表明代码审查与集成流程高效。开发焦点集中在AMD ROCm平台的功能增强与问题修复、MoE（Mixture of Experts）架构的持续优化与重构，以及多模态（Multi-Modal）和推测解码（Speculative Decoding）等前沿功能的完善上。同时，多个长期悬而未决的Issue（如数据并行性能问题）被成功关闭，标志着项目在稳定性和性能上取得重要进展。

🎯 AMD/ROCm 生态相关动态

本周期AMD生态相关活动活跃，涵盖功能启用、Bug修复、CI/CD优化和问题排查。

新增PR - 在ROCm上启用FlashAttention后端 (#31614)
- 贡献者: ehartford
- 内容: 该PR旨在解除对ROCm平台使用上游FlashAttention（FA）后端的封锁。关键修改包括：为保持CUDA Graph兼容性而预计算cu_seqlens_k；要求KV缓存块大小为128的倍数；添加相关测试。
- 讨论与影响: 这是本期最核心的AMD生态动态。讨论热烈，robertgshaw2-redhat指出，此前评估中FA在ROCm上性能显著劣于AITER和Triton后端，并投入了大量精力开发V1的Triton Attention。他建议合并前需明确性能优势。PR作者ehartford则认为应给予用户选择权而非强行禁止。此PR若合并，将为AMD用户提供更多后端选项，但需社区后续验证其实际性能表现。
新增Issue - ROCm AITER后端推测解码准确率问题 (#31625)
- 报告者: vllmellm
- 内容: 在使用VLLM_ATTENTION_BACKEND=ROCM_AITER_FA运行推测解码（EAGLE）时，GSM8K基准测试准确率极差（甚至降至0）。根本原因已定位：当前ROCm AITER后端缺乏对UNIFORM_BATCH的CUDA Graph支持，无法正确处理query_len > 1的场景（推测解码所需）。
- 影响: 这直接影响了AMD平台上推测解码功能的可用性和准确性，是提升AMD生态功能完备性的一个关键阻塞点。
新增Issue - AMD CI 测试失败 (#31631)
- 报告者: AndreasKaratzas
- 内容: AMD CI中mi325_1节点的测试因NIXL不可用而失败。根本原因是等待新的ROCm基础Docker镜像发布（与PR #31460相关）。
- 影响: 反映了AMD CI基础设施的依赖管理和版本更新流程，属于持续集成过程中的常规问题。
新增PR - 更新ROCm Docker基础镜像至7.1 (#31615)
- 贡献者: c0de128
- 内容: 将基础镜像从rocm/dev-ubuntu-22.04:7.0-complete升级至7.1版本，以官方支持Strix Point/Halo APUs（gfx1150, gfx1151架构）。
- 影响: 确保vLLM能够在新一代AMD Ryzen AI硬件上构建和运行，扩大了硬件支持范围。
已合并PR - 多项ROCm相关修复
- #31282: 修复AITER MLA解码路径中paged_kv_last_page_len的计算错误，避免潜在的内存访问越界。
- #31596: 修复Attention层中output_shape计算对3D query输入的假设错误，解决了DeepSeek-V2等模型在特定路径下的问题。
- #31612: 修复ROCm上ModernBERT测试失败，通过强制HuggingFace参考实现使用eager注意力来规避其FlashAttention在ROCm上的数值差异。
- #31553, #31565, #31630: 一系列CI测试修复，解决因测试分片策略、长提示token计数逻辑等导致的CI稳定性问题。

💬 高热度讨论分析

新增PR #31614: 在ROCm上启用FlashAttention后端
- 核心议题: 是否应在ROCm平台上解除对上游FlashAttention后端的限制。
- 观点与立场:
  - 作者ehartford: 认为既然上游已支持ROCm，vLLM不应人为屏蔽，应给予用户选择自由。
  - 维护者robertgshaw2-redhat: 回顾历史性能评估，指出FA在AMD硬件上曾表现不佳，团队已为V1开发了优化的Triton Attention。对合并此PR持谨慎态度，要求明确性能收益。
  - 社区用户JartX: 报告了在RDNA3显卡上应用相关补丁后出现的CUDA Graph捕获错误，提供了详细的Dockerfile供调试。
- 争议焦点: 功能可用性与性能最优解的权衡。是优先提供更多选择，还是确保默认/推荐的方案是经过充分优化验证的。
- 当前状态: PR处于开放状态，讨论进行中，需进一步的技术评估和性能数据。
新增Issue #31624: ModelOpt Llama-4 检查点加载过慢
- 核心议题: NVIDIA ModelOpt优化的Llama-4 FP8模型加载时间长达5分钟以上。
- 观点与立场:
  - 报告者robertgshaw2-redhat: 指出根本原因是ModelOpt的权重加载逻辑产生了非连续的CPU张量，导致CPU到GPU的拷贝异常缓慢。他提出了将权重先移至GPU再进行转置的“GPU优先”方案，速度提升2.1倍，但担心这会违背当前权重加载器“不额外占用GPU内存”的设计保证。
  - 参与者pixelsoccupied: 通过实验验证了“GPU优先”方案的有效性，并询问具体顾虑。
  - 其他参与者: youneedgreg和mgarner3均表示希望接手解决此问题。
- 争议焦点: 在追求极致加载速度和保持内存管理安全性、多GPU兼容性之间的设计取舍。
- 当前状态: Issue开放，寻求解决方案。mgarner3表示将进行调查并提交修复方案。
已关闭Issue #24124: 移除CUDA 11.8支持
- 核心议题: 随着PyTorch版本升级，是否应移除对老旧CUDA 11.8的构建、测试和引用。
- 观点与立场:
  - 发起者simon-mo: 认为当前是清理旧版本的合适时机，可以简化维护矩阵。
  - 社区参与者ayushsatyam146: 主动请求承担此项工作。
- 总结: 这是一个常规的技术债务清理议题，社区对此类简化维护的工作持积极态度，无明显争议。
- 最终状态: 因超过90天无活动，被自动标记为stale并关闭。这反映了项目对长期无进展议题的自动化管理。

🔥 热门话题与趋势分析

多模态（Multi-Modal）支持增强：本期合并了多模态处理器基准测试工具 (#29105)，并有关联PR (#31627) 为多模态编码器的torch.compile支持添加文档。这表明多模态推理正从功能支持阶段进入性能评估与优化阶段。
MoE架构深度优化：围绕MoE的PR非常密集，包括显式的模块化内核构造 (#31504)、内核拆分以支持未来架构清理 (#31050)、以及修复因Attention层输出形状计算错误导致的MoE FP8问题 (#31596)。这显示MoE作为核心模型架构，其性能、正确性和代码可维护性是持续投入的重点。
AMD ROCm平台深耕：如上文所述，从启用新后端、修复核心Bug到更新基础镜像和CI，表明vLLM社区正系统地加强对AMD硬件的支持深度和稳定性。
推测解码与性能优化：除了AMD平台的推测解码问题 (#31625)，还有修复EAGLE槽位映射中块大小使用的PR (#31540)。推测解码作为提升吞吐的关键技术，其在不同模型和硬件上的正确性是关注焦点。
基础设施与代码质量：包括异常处理范围收窄 (#31616)、CUDA旧版本清理（已关闭Issue）、模型配置解析器重构 (#28454)等，反映了项目在追求快速迭代的同时，也注重代码健壮性和长期可维护性。

🛠️ 重点技术变更

#30739 [BugFix] 支持在线稠密模型数据并行（DP）而无额外开销：
- 技术解读：此前，即使在非MoE的普通模型上启用DP，也会进行不必要的跨rank同步和“空闲rank虚前向”协调，造成显著性能开销。此PR将非MoE模型的worker级并行配置修改为等效于DP=1，使各rank独立工作，仅在需要负载均衡时运行协调器传播统计信息。
- 影响：显著提升了非MoE模型在DP模式下的性能，解决了长期存在的性能痛点（关联Issue #24461, #30655）。基准测试显示性能接近线性扩展，是DP模式的重要优化。
#28454 [Core] 从hf_config解析模型架构配置：
- 技术解读：这是一个大型重构PR，引入了model_arch_config来显式定义vLLM运行时所需的所有标准化字段，并通过解析器从HuggingFace配置中读取和标准化这些信息。目标是最终将标准化逻辑从现有代码中剥离，形成更清晰、更易维护的配置层。
- 影响：这是基础架构的重要改进，为未来支持更多模型、统一配置管理、以及实现更灵活的后端调度（如基于配置选择每层注意力类）奠定了坚实基础。
#29105 添加多模态处理器基准测试：
- 技术解读：引入了一个新的命令行工具 vllm bench multimodal-processor，用于在单GPU单实例场景下，对多模态模型（如图文模型）的视觉编码器（processor）部分进行端到端的性能基准测试，可测量编码延迟、吞吐量等关键指标。
- 影响：填补了多模态模型性能评估工具的空白，使开发者和用户能够定量评估和比较不同多模态模型或配置下的编码器性能，有助于识别瓶颈和进行优化。

📈 开发活跃度观察

高效合并：合并PR数量（19）超过新增PR数量（15），表明代码审查流程顺畅，积压工作得到有效处理。
多元化贡献：贡献者既包括AMD员工（关注ROCm）、Red Hat工程师（关注ModelOpt、MoE），也有大量社区开发者（如提交模型支持、文档、bug修复）。Issue中频繁出现“help wanted”和“good first issue”标签，且有开发者主动认领，显示社区参与度良好。
问题闭环率高：在24小时内关闭了30个旧Issue，其中许多是数月甚至一年多前的历史问题，说明团队在持续清理问题追踪系统，保持项目健康度。

💡 值得关注的问题

AMD FlashAttention后端性能对比 (#31614)：此PR的合并决策需要基于严谨的性能测试数据。社区需关注后续是否会有AMD平台下FA、AITER、Triton等不同注意力后端的基准测试报告。
ModelOpt模型加载性能通用解法 (#31624)：此问题暴露了当前权重加载器对非连续张量处理的缺陷。其解决方案可能对优化其他复杂量化或优化格式的模型加载速度有借鉴意义。
Sleep/Wake_up API的健壮性改进 (#31618, #31613, #31619)：用户在使用新特性时提出的边缘情况改进建议（如唤醒无效标签、睡眠时检查未完成请求），反映了生产环境对API鲁棒性的高要求。这些改进将使该特性更易于安全使用。
推测解码在AMD平台的完整支持 (#31625)：ROCm AITER后端对推测解码的支持是功能缺口，需要尽快解决以提升AMD生态的竞争力。

📋 附录：详细数据列表

新增 Issue

#31634 [RFC]: Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility — RFC — by Lumosis (创建于: 2026-01-03 08:24 (UTC+8))
#31631 [CI Failure]: mi325_1: V1 Test others — ci-failure — by AndreasKaratzas (创建于: 2026-01-03 06:59 (UTC+8))
#31624 [Bug]: ModelOpt Llama-4 Checkpoints Take 5+ minutes to load — bug,help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2026-01-02 23:18 (UTC+8))
#31628 [Bug][ModelOpt]: Llama4 DP/EP FlashInfer Cutlass Is Broken — bug,nvidia — by robertgshaw2-redhat (创建于: 2026-01-03 05:14 (UTC+8))
#31626 [Bug]: Gibberish output on CPU backend when –enforce-eager is enabled (Qwen3-0.6B) — bug,cpu — by kzwrime (创建于: 2026-01-03 00:19 (UTC+8))
#31625 [Bug][ROCm][AITER]: Speculative Decoding Accuracy Issue with VLLM_ATTENTION_BACKEND=ROCM_AITER_FA — bug,rocm — by vllmellm (创建于: 2026-01-03 00:02 (UTC+8))
#31623 [Bug]: Minimax-M2.1 Error loading QKV for partial quantized version — bug — by mratsim (创建于: 2026-01-02 18:25 (UTC+8))
#31618 [Bug]: Potential improvements and fixes for sleep/wake_up API — bug — by danielhumanmod (创建于: 2026-01-02 13:19 (UTC+8))

已关闭 Issue

#8177 [Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered — bug,stale — by DreamGenX (关闭于: 2026-01-03 10:15 (UTC+8))
#11037 [Installation]: no version of pip install vllm works - Failed to initialize NumPy: No Module named ‘numpy’ — installation,stale — by cephdon (关闭于: 2026-01-03 10:15 (UTC+8))
#21051 [Performance]: Long startup delay due to plugin loading and subprocess spawning — performance,stale,startup-ux — by Huzifa1 (关闭于: 2026-01-03 10:14 (UTC+8))
#21887 [Feature]: Multimodal Benchmarking Support (MMLM) — feature request,stale — by knlnguyen1802 (关闭于: 2026-01-03 10:14 (UTC+8))
#22412 [Bug]: When accessing the API with the ‘stop’ parameter, the ‘qwen3-reasoning-parser’ fails to function correctly. — bug,stale — by wyy007 (关闭于: 2026-01-03 10:14 (UTC+8))
#22480 [Feature]: Add Moving Average Statistics for Better Performance Monitoring — feature request,stale — by NumberWan (关闭于: 2026-01-03 10:14 (UTC+8))
#22657 [Bug]: Qwen 3 2507 update models use deepseek_r1 reasoning parser - suggest renaming — bug,stale — by sethkimmel3 (关闭于: 2026-01-03 10:14 (UTC+8))
#23192 [Usage]: How to run model - RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8 — usage,stale — by mrtpk (关闭于: 2026-01-03 10:14 (UTC+8))
#23609 [Usage]: Unable to see more than 20% improvement on b200 for vllm — usage,stale — by shaamil101-etched (关闭于: 2026-01-03 10:14 (UTC+8))
#23611 [Usage]: Which dataset do you recommend using for the ngram spec decoding method? — usage,stale — by shaamil101-etched (关闭于: 2026-01-03 10:14 (UTC+8))
#23834 [Usage]: Load Qwen3 Moe model error when starting the vllm server on TPU — usage,stale — by Prayer3th (关闭于: 2026-01-03 10:14 (UTC+8))
#24096 [Feature]: Support extendable configuration files — feature request,stale — by DaividFrank (关闭于: 2026-01-03 10:14 (UTC+8))
#24116 [Feature]: Optimize EPLB Rearrange Experts — feature request,stale — by robertgshaw2-redhat (关闭于: 2026-01-03 10:14 (UTC+8))
#24124 Remove CUDA 11.8 — feature request,stale — by simon-mo (关闭于: 2026-01-03 10:14 (UTC+8))
#24139 [Bug]: B200 hang on flashinfer fa2 prefill — bug,stale — by haydn-jones (关闭于: 2026-01-03 10:14 (UTC+8))
#24164 [Bug]: Qwen2-VL-7B-Instruct model`s output of the 1st inference is different with the subsequent inferences. — bug,stale — by suluner (关闭于: 2026-01-03 10:14 (UTC+8))
#24166 Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions. — stale — by justinSmileDate (关闭于: 2026-01-03 10:13 (UTC+8))
#24169 [Performance]: MoE FP8 and Gemm FP8 for CPU — performance,stale — by duanzhui (关闭于: 2026-01-03 10:13 (UTC+8))
#24182 FastAPI Swagger Documentation Name to be Updated to the Model Name — feature request,stale — by sssaha (关闭于: 2026-01-03 10:13 (UTC+8))
#24186 [Usage]: how does v1 engine perform the model parameter hot update? — usage,stale — by LEON-gittech (关闭于: 2026-01-03 10:13 (UTC+8))
#24206 [Installation]: fail to install in cuda 118 with v100. — installation,stale — by betasecond (关闭于: 2026-01-03 10:13 (UTC+8))
#24232 [Feature]: When the tool choice configuration parameters are invalid, the API server returns a 500 error code. — feature request,stale — by wangxshen (关闭于: 2026-01-03 10:13 (UTC+8))
#24244 [Improvement]: The fixed “language_model” prefix issue in multimodal models — bug,stale — by wenba0 (关闭于: 2026-01-03 10:13 (UTC+8))
#24246 [Installation]: Warning on char conversion on aarch64 — installation,stale — by green-br (关闭于: 2026-01-03 10:13 (UTC+8))
#25822 [RFC][Plugin]: support loading kernels from other place — RFC,stale — by ILikeIneine (关闭于: 2026-01-03 03:13 (UTC+8))
#23925 [Bug]: C.abi3.so: undefined symbol: _Z24silu_and_mul_nvfp4_quantRN2at6TensorES1_S1_S1 — bug,stale — by Rio-Allen (关闭于: 2026-01-03 03:01 (UTC+8))
#30655 [Performance]: VLLM with DP performing worst — performance — by akasshdeep (关闭于: 2026-01-02 23:36 (UTC+8))
#24461 [BugFix]: Avoid unnecessary coordination for non-MoE data parallel — bug,help wanted — by njhill (关闭于: 2026-01-02 23:36 (UTC+8))
#30727 [Performance] Optimization through caching the moe modular kernels — 无标签 — by yewentao256 (关闭于: 2026-01-02 23:05 (UTC+8))
#24171 [MM processor]: Benchmark mm processor’s performance — help wanted,good first issue,feature request,multi-modality — by Isotr0py (关闭于: 2026-01-02 11:26 (UTC+8))

新增 PR

#31635 Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. — v1 — by Lumosis (创建于: 2026-01-03 08:53 (UTC+8))
#31633 [Misc]ModelConfig use architecture rather than archiectures — new-model — by charlotte12l (创建于: 2026-01-03 08:07 (UTC+8))
#31632 [CI] Skip Phi-MoE test due to old API util — ci/build — by AndreasKaratzas (创建于: 2026-01-03 07:36 (UTC+8))
#31616 [Bugfix] Narrow broad exceptions in compilation backends — 无标签 — by c0de128 (创建于: 2026-01-02 12:49 (UTC+8))
#31630 [CI][Bugfix] Fix token counting in chunked prefill compl test — 无标签 — by AndreasKaratzas (创建于: 2026-01-03 05:52 (UTC+8))
#31629 Move skipping of unused GPTQ bias to AutoWeightsLoader — speculative-decoding,ready,llama,qwen,deepseek — by hmellor (创建于: 2026-01-03 05:23 (UTC+8))
#31627 [Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders — documentation — by Lucaskabela (创建于: 2026-01-03 03:30 (UTC+8))
#31614 [ROCm][Attention] Enable FlashAttention backend on ROCm (graph‑safe cu_seqlens_k) — rocm,speculative-decoding,v1 — by ehartford (创建于: 2026-01-02 12:41 (UTC+8))
#31617 Revert “[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… — nvidia — by shyeh25 (创建于: 2026-01-02 12:50 (UTC+8))
#31621 Add K-EXAONE-236B-A23B — documentation,new-model — by lkm2835 (创建于: 2026-01-02 17:19 (UTC+8))
#31622 Fix GLM-4.6v flash tool calling in transformers 5.x — documentation,tool-calling — by baonudesifeizhai (创建于: 2026-01-02 17:21 (UTC+8))
#31620 [Model] Enable LoRA support for BLIP2 — documentation — by ppppqp (创建于: 2026-01-02 15:10 (UTC+8))
#31613 [Bugfix] Make executor wake_up idempotent and robust to invalid tags — v1 — by danielhumanmod (创建于: 2026-01-02 12:11 (UTC+8))
#31619 [Bugfix] Disallow sleep call if there are unfinished requests — frontend,v1 — by danielhumanmod (创建于: 2026-01-02 13:26 (UTC+8))
#31615 [Docker][ROCm] Update base image to ROCm 7.1 for GFX1150/1151 support — rocm,ci/build — by c0de128 (创建于: 2026-01-02 12:45 (UTC+8))

已合并 PR

#30739 [BugFix] Support online dense model DP without overhead — ready,v1,kv-connector — by njhill (合并于: 2026-01-02 23:36 (UTC+8))
#28454 [Core] Parse vLLM engine required fields from hf_config to model_arch_config — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by charlotte12l (合并于: 2026-01-03 07:13 (UTC+8))
#31604 [Benchmark] Fix OOM during MoE kernel tuning for large models — performance,ready — by massif-01 (合并于: 2026-01-03 06:24 (UTC+8))
#31504 [MoE Refactor] Explicit construct mk for flashinfer bf16 kernel — ready — by zyongye (合并于: 2026-01-03 05:54 (UTC+8))
#31050 [MoE Refactor] Split invoke_fused_moe_kernel — ready — by zyongye (合并于: 2026-01-03 05:47 (UTC+8))
#31553 [ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size — rocm,ci/build — by AndreasKaratzas (合并于: 2026-01-02 11:29 (UTC+8))
#31612 [ROCm][CI] Fix ModernBERT token classification test — rocm,ready — by AndreasKaratzas (合并于: 2026-01-02 12:19 (UTC+8))
#31596 [MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs — rocm,ready — by AndreasKaratzas (合并于: 2026-01-02 23:46 (UTC+8))
#31601 Add multimodal input method in the documentation — documentation,ready — by labAxiaoming (合并于: 2026-01-02 20:43 (UTC+8))
#31530 CustomOp: test forward dispatch for grouped_topk — ready — by xinyu-intel (合并于: 2026-01-02 23:04 (UTC+8))
#31282 [Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode — rocm,ready,v1 — by c0de128 (合并于: 2026-01-02 13:14 (UTC+8))
#31103 [Bugfix] Fix weight_loader v1 block scale — ready — by kyuyeunk (合并于: 2026-01-02 13:14 (UTC+8))
#31549 Remove unused use_marlin variable in Mxfp4MoEMethod — documentation,new-model,ready,ci/build,v1,multi-modality,cpu,gpt-oss — by vsourirajan (合并于: 2026-01-02 13:13 (UTC+8))
#31572 [Bugfix] Fix activation quantization for compressed-tensors W4A16 — ready — by Tmn07 (合并于: 2026-01-02 13:13 (UTC+8))
#31513 [Model] Enable LoRA support for tower and connector in LLaVA — documentation,ready — by jayhemnani9910 (合并于: 2026-01-02 11:32 (UTC+8))
#31540 [Bugfix] Fix block size used in EAGLE slot mapping — bug,speculative-decoding,ready,v1 — by benchislett (合并于: 2026-01-02 11:32 (UTC+8))
#31569 feat: support LoRA for DeepSeek-OCR(Language Model part) — documentation,ready,deepseek — by zhima771 (合并于: 2026-01-02 11:32 (UTC+8))
#31590 [Bugfix] Replace BaseException with specific exceptions in FLA utils — ready — by c0de128 (合并于: 2026-01-02 11:27 (UTC+8))
#29105 Add Multimodal Processor Benchmark — documentation,performance,frontend,ready,ci/build,v1,multi-modality — by reaganjlee (合并于: 2026-01-02 11:26 (UTC+8))

关闭但未合并的 PR

#22869 [FIXBUG] Add stop and stop_token_ids to BeamSearchParams — frontend,stale — by hoangvictor (关闭于: 2026-01-03 10:14 (UTC+8))
#23618 Fix regex patterns in DeepSeekV31ToolParser to use non-greedy matching — frontend,stale,tool-calling,deepseek — by eric8810 (关闭于: 2026-01-03 10:14 (UTC+8))
#23788 [benchmark] add random and common prefix usage — performance,stale — by panpan0000 (关闭于: 2026-01-03 10:14 (UTC+8))
#24162 [Log] Per Rank Log — stale,v1 — by ZJY0516 (关闭于: 2026-01-03 10:14 (UTC+8))
#24174 Optimize detokenizer performance for long-generation sequences — stale — by GITHUBear (关闭于: 2026-01-03 10:13 (UTC+8))
#31629 Move skipping of unused GPTQ bias to AutoWeightsLoader — speculative-decoding,ready,llama,qwen,deepseek — by hmellor (关闭于: 2026-01-03 05:42 (UTC+8))
#30304 Fix incomplete response generation for tool call outputs — frontend,deepseek,meta-exported — by qandrew (关闭于: 2026-01-02 23:12 (UTC+8))
#25761 [Core] Fix torch.dynamo compatibility for Qwen models on vllm-gaudi — frontend,v1,tool-calling,qwen — by pawel-olejniczak (关闭于: 2026-01-02 16:36 (UTC+8))
#31576 feat: add vllm.utils.device_utils module — 无标签 — by codebasecomprehension (关闭于: 2026-01-02 11:19 (UTC+8))