vLLM 开发动态报告 - 2026-01-06

时间窗口: 2026-01-06 10:53 (UTC+8) ~ 2026-01-07 10:53 (UTC+8) 数据统计: 新 Issue 28 | 关闭 Issue 16 | 新 PR 65 | 合并 PR 46 | 关闭未合并 PR 16

📊 每日开发状态摘要

在2026年1月6日至7日的24小时内，vLLM项目保持了极高的开发活跃度，新增28个Issue和65个PR，合并了46个PR。开发焦点集中在性能优化（特别是MoE内核）、平台兼容性（尤其是AMD ROCm CI的稳定性修复）以及多模态模型支持的扩展上。同时，社区对量化（FP8/NVFP4）、推理（MTP）相关Bug的响应和修复也非常迅速。

🎯 AMD/ROCm 生态相关动态

本周期AMD/ROCm生态相关工作聚焦于CI稳定性修复、性能分析工具链完善以及平台特定内核的准确性修复。

功能请求：ROCTx支持 (Issue #31831)
- 内容：用户请求为AMD ROCm平台添加ROCTx支持（类比NVIDIA的NVTX），以便PyTorch Profiler能获取内核的形状数据，生成更详细的性能剖析报告。
- 分析：这表明社区用户正在将vLLM部署到AMD GPU上，并对性能分析和调试工具链提出了与NVIDIA平台对等的要求。这是AMD生态成熟度提升的一个标志。
Bug修复：AITER MLA后端精度回归 (PR #31816)
- 内容：修复了在AMD ROCm平台上使用ROCM_AITER_TRITON_MLA注意力后端时，运行DeepSeek-V3模型出现的严重精度下降问题（GSM8K准确率从~96%降至几乎为零）。
- 分析：该修复至关重要，确保了AMD自有Attention后端实现（AITER）的推理准确性。问题根源在于继承关系错误，未从正确的基类（AiterMLABackend）继承，这属于平台特定实现中的关键bug。
CI/基础设施：多项ROCm测试修复 (PR #31829, #31835, #31820)
- 内容：
  - 移除可能导致ROCm测试失败的VLLM_FLOAT32_MATMUL_PRECISION环境变量设置，并固定albumentations库版本。
  - 固定timm库版本以解决Nemotron多模态模型测试中的导入错误。
  - 放宽ROCm上ModernBERT词分类测试的精度容错阈值（rtol），以适配平台间固有的数值差异。
- 分析：这些工作主要致力于维护AMD CI管道的稳定运行。频繁的依赖版本锁定和精度阈值调整表明，确保跨平台（CUDA vs ROCm）的测试一致性存在持续挑战，是维护多架构支持的必要开销。
功能实现：ROCTx支持 (PR #31187) （已合并）
- 内容：实际实现了在AMD GPU上添加ROCTx范围注解的支持。
- 分析：此PR直接响应了上述Issue的需求，补齐了AMD平台性能分析工具链的一块短板，有助于开发者优化在MI系列GPU上的vLLM性能。

💬 高热度讨论分析

Issue #31792: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’
- 核心议题：用户报告在运行特定Mistral模型时，vLLM的自定义Tokenizer包装类缺少convert_tokens_to_ids方法，导致与Hugging Face Transformers库（特别是多模态处理器）的兼容性问题。
- 观点与进展：
  - 用户：提供了详细的错误堆栈，指出问题出现在PixtralProcessor初始化时。
  - 维护者 (@DarkLight1337)：迅速提交PR #31796 为TokenizerLike接口添加缺失的方法。随后发现更深层的兼容性问题与Transformers库版本（v4.57 vs v5）有关，并提交了第二个PR #31817 来正确处理多模态处理器中的tokenizer类型。
- 总结：讨论凸显了vLLM在封装HF生态时面临的兼容性挑战，尤其是在多模态和Tokenizer定制化场景下。维护者通过快速迭代修复解决了问题。
- 当前状态：Issue已关闭，通过两个关联PR修复。
**Issue #31319: GLM-4.7-FP8 missing beginning tag**（于本周期关闭）
- 核心议题：GLM-4.7模型在使用thinking（推理）功能时，流式输出中缺少开头的<think>标签，导致前端解析错误。
- 观点与进展：
  - 社区用户：提供了临时补丁，并讨论使用deepseek_r1或step3解析器作为替代方案的可能性。
  - 维护者：最终通过PR #31788 采用了一种方案：将GLM-4.5/4.7的reasoning parser配置指向deepseek_v3 parser，并设置默认的enable_thinking参数。
- 争议焦点：是否应该创建一个新的专属解析器（如Glm47AppendThinkReasoningParser），还是复用现有解析器并调整配置。
- 最终结论：采用配置复用方案，认为根本原因在于GLM-45推理解析器实现有误，当前方案作为权宜之计。

🔥 热门话题与趋势分析

量化精度与性能问题：多个Issue（#31843 SM90 FlashInfer FP8编译、#31844 DeepGEMM E8M0精度、#31840 NVFP4 MoE数值精度）指向了新硬件（H200, B200）和新量化格式（FP8, NVFP4） 下的兼容性与精度挑战，表明前沿量化技术在复杂场景（如MoE）中的稳定性仍需打磨。
MoE（Mixture of Experts）深度优化：大量PR围绕MoE层进行，包括内核性能提升（如PR #31830, #31832, #31837 针对cutlass_moe的融合与计算优化）、内核抽象化重构（PR #31827）以及支持更多量化格式（PR #31770）。这表明MoE模型是当前性能攻坚的重点。
多模态支持扩展：社区积极扩展多模态模型的LoRA支持（PR #31656, #31812, #31825），并重构音频模型实现（PR #31779）。视频处理效率也受到关注，出现了关于视频帧稀疏化的RFC（Issue #31803）。
基础设施与CI/CD：AMD CI的稳定性修复是明显主题。同时，也有关于构建CUDA 13版本镜像（PR #31822）和更新发布流程文档（PR #31799）的工作，反映出项目在维护多版本、多平台支持上的努力。

🛠️ 重点技术变更

PR #31187: [ROCm] Add ROCTx range annotations（已合并）
- 技术解读：为vLLM在AMD ROCm平台上的关键内核和操作添加了ROCTx（ROCm Tracer eXtensions）注解。
- 影响：使AMD GPU用户能够使用rocprof、Omniperf等工具进行更精细化的性能剖析，包括获取内核参数信息，大幅提升了在AMD平台上进行性能调试和优化的能力。
PR #31816: [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend（待合并）
- 技术解读：修复了AMD AITER Triton MLA后端因继承错误而导致的严重精度问题。通过确保后端从正确的基类（AiterMLABackend）继承，恢复了计算准确性。
- 影响：保障了使用AMD自有高性能Attention后端时模型的输出质量，对于依赖该后端以获得最佳性能的AMD平台用户至关重要。
PR #31830: [Perf] Optimize cutlass moe problem size calculation（待合并）
- 技术解读：通过添加一个专用的内核来快速计算MoE运算的问题规模（problem size），替代了原有串行逻辑。
- 影响：实现了约5.3%的端到端吞吐量提升和2.2%的TTFT（首字延迟）降低，是MoE推理路径上显著的性能优化。
PR #31739: [Spec Decode][UX] Add acceptance stats to vllm bench serve report（已合并）
- 技术解读：在vllm bench serve基准测试报告中，当服务器启用推测解码（Speculative Decoding）时，自动增加一个“Speculative Decoding”数据板块，展示接受率、各位置接受率等关键指标。
- 影响：极大改善了使用推测解码时的用户体验和性能评估效率。用户无需额外工具即可直观评估草案模型的质量和推测解码的收益，使性能分析更加全面。

📈 开发活跃度观察

贡献活跃度：单日新增PR（65）和合并PR（46）数量极高，表明社区贡献和核心团队审核、合并节奏非常快。
AMD贡献者活跃：本周期内有多个由AMD相关贡献者（如 AndreasKaratzas, vllmellm, rjrock）提交的PR，主要聚焦于修复ROCm平台的CI测试和功能bug，显示AMD团队正持续投入资源确保其平台兼容性。
多模态贡献：围绕多模态模型LoRA支持、音频模型重构的PR较多，显示社区在扩展vLLM应用边界方面的活跃度。
高效的问题闭环：在新增28个Issue的同时，关闭了16个（包括一些历史遗留Issue），表明问题处理流程顺畅。

💡 值得关注的问题

GLM-4.5-Air + MTP结构性输出错误 (Issue #31858)：MTP（多令牌预测）与结构化输出（JSON Schema）同时启用时产生冲突，可能导致输出格式错误。这涉及到推测采样与约束解码的交互，是一个复杂且影响体验的Bug。
DeepSeek V3.2 MTP > 1 运行错误 (Issue #31845)：在H200上使用MTP>1服务DeepSeek-V3.2时失败。DeepSeek-V3.2是重要的新模型，其MTP支持稳定性值得关注。
Streaming Quantization内存问题 (Issue #31805)：流式量化导致模型加载和后处理阶段峰值内存使用升高。这关系到大规模模型部署的资源利用效率。
B200上的DeepGEMM E8M0精度问题 (Issue #31844)：在Blackwell架构（B200）上使用E8M0格式的DeepGEMM出现准确率下降。这提示新硬件与新计算库的适配可能存在隐患。
视频帧稀疏化RFC (Issue #31803)：提出了对长视频输入进行自适应关键帧提取的算法RFC，以解决多模态长视频推理的序列过长问题。这是一个针对实际应用痛点的前瞻性功能提议。

📋 附录：详细数据列表

新增 Issue

#31858 [Bug]: GLM-4.5-Air + MTP causes structural output errors — bug — by ruokee (创建于: 2026-01-07 10:23 (UTC+8))
#31856 [Bug]: Text degeneration / repetition loops with MiniMax-M2.1-NVFP4 on v0.14.0rc1.dev308 — 无标签 — by ktsaou (创建于: 2026-01-07 09:36 (UTC+8))
#31843 [Bug]: SM90 FlashInfer with FP8 kv cache fails to compile — bug — by mgoin (创建于: 2026-01-07 07:20 (UTC+8))
#31848 [RFC]: Native Weight Syncing APIs — RFC — by ahao-anyscale (创建于: 2026-01-07 08:37 (UTC+8))
#31845 [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) — bug — by jhaotingc (创建于: 2026-01-07 07:39 (UTC+8))
#31844 [Bug]: DeepEP LL Accuracy Issue with DeepGEMM E8M0 on B200 — bug — by robertgshaw2-redhat (创建于: 2026-01-07 07:27 (UTC+8))
#31840 [Bug]: NVFP4 Flashinfer CuteDSL MoE + DeepEP ll + VLLM_MOE_DP_CHUNK_SIZE=1024 numerical accuracy issue — bug,nvidia — by mxz297 (创建于: 2026-01-07 06:54 (UTC+8))
#31805 [Bug]: streaming quantization cause higher peak memory used during model loading and post process — bug — by yma11 (创建于: 2026-01-06 20:00 (UTC+8))
#31831 [Feature]: Add support for ROCTx — feature request — by AidenPetersen (创建于: 2026-01-07 04:28 (UTC+8))
#31792 [Bug]: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’ — bug — by dgouju (创建于: 2026-01-06 16:47 (UTC+8))
#31824 [Feature]: Remove parameter re-registration and names from kernel abstraction — help wanted,feature request — by ProExpertProg (创建于: 2026-01-07 02:55 (UTC+8))
#31819 [New Model]: Support Florence-2 on Engine 1 After Engine 0 Removal — new-model — by docee (创建于: 2026-01-07 00:46 (UTC+8))
#31823 [Feature]: Use kernel abstraction for fp4 scaled mm — help wanted,feature request — by ProExpertProg (创建于: 2026-01-07 01:16 (UTC+8))
#31821 [Feature]: Change min_capability to is_supported in the kernel abstraction — feature request — by ProExpertProg (创建于: 2026-01-07 01:13 (UTC+8))
#31818 [Feature]: Refactor W8A8 Block Linear to use the kernel abstraction — feature request — by ProExpertProg (创建于: 2026-01-07 00:42 (UTC+8))
#31810 [CI Failure]: — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-06 22:38 (UTC+8))
#31809 [Feature]: Add a default logger select write to file — feature request — by lengrongfu (创建于: 2026-01-06 22:31 (UTC+8))
#31794 [Feature]: Feasibility of Layer-wise Intervention (e.g., KV Transfer) compatible with V1 CUDA Graph architecture? — feature request — by leepeter008 (创建于: 2026-01-06 16:54 (UTC+8))
#31803 [RFC]: Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference — RFC — by rayleeku (创建于: 2026-01-06 19:38 (UTC+8))
#31782 [Feature]: Support compressed-tensors NVFP4 quantization for MoE models (Nemotron-H non-gated MoE) — feature request — by Firworksyt (创建于: 2026-01-06 14:43 (UTC+8))
#31804 [Bug]: Pydantic model’s description is not taken into account when generating structured outputs — bug — by venki-lfc (创建于: 2026-01-06 19:39 (UTC+8))
#31766 [Docs] Feedback for /en/latest/contributing/profiling/ — documentation — by cyk2018 (创建于: 2026-01-06 11:15 (UTC+8))
#31798 [Bug]: The Qwen3-VL-30B-A3B-Thinking model deployed by vllm is not responding to requests. — bug — by haifzhu (创建于: 2026-01-06 17:20 (UTC+8))
#31789 [Installation]: ModuleNotFoundError: No module named ‘vllm._C’ when run demo basic.py — installation — by sunbinbin1991 (创建于: 2026-01-06 15:54 (UTC+8))
#31787 [Usage]: How to set different attention backend for prefill and decode phases? — usage — by stormchasingg (创建于: 2026-01-06 15:33 (UTC+8))
#31767 [Bug]: Lora Bug : Is it more appropriate to replace type() with isinstancein() the “can_replace_layer” class method — bug — by ZT-AIA (创建于: 2026-01-06 11:49 (UTC+8))
#31771 [Performance]: VLLM performance degradation when comparing with Tensor RT — performance — by akasshdeep (创建于: 2026-01-06 12:45 (UTC+8))
#31768 [Bug]: Run RedHatAI/Qwen3-30B-A3B-FP8-block failed — bug — by jessiewiswjc (创建于: 2026-01-06 11:55 (UTC+8))

已关闭 Issue

#31319 [Bug]: GLM-4.7-FP8 missing beginning tag — bug — by Nemo-G (关闭于: 2026-01-06 21:53 (UTC+8))
#19407 [RFC]: Support automatic max context length via --max-model-len auto — RFC,stale — by yeqcharlotte (关闭于: 2026-01-07 09:02 (UTC+8))
#31044 [CI Failure]: Blackwell Fusion Tests — help wanted,torch.compile,ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-07 08:31 (UTC+8))
#29109 [Bug]: Speculative decoding fails with Llama4-eagle — bug — by divakar-amd (关闭于: 2026-01-07 05:47 (UTC+8))
#30818 [Bug]: Unpickling MediaWithBytes dataclass leads to RecursionError — bug — by eicherseiji (关闭于: 2026-01-07 04:11 (UTC+8))
#31792 [Bug]: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’ — bug — by dgouju (关闭于: 2026-01-06 20:08 (UTC+8))
#27300 [Bug]: vLLM produces invalid UTF-8 tokens and “�” replacement characters when returning logprobs — bug — by yinggeh (关闭于: 2026-01-07 01:07 (UTC+8))
#30571 [Bug]: DeepGEMM MoE generating tons of warnings — bug — by MatthewBonanni (关闭于: 2026-01-07 00:25 (UTC+8))
#31810 [CI Failure]: — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-06 22:39 (UTC+8))
#31794 [Feature]: Feasibility of Layer-wise Intervention (e.g., KV Transfer) compatible with V1 CUDA Graph architecture? — feature request — by leepeter008 (关闭于: 2026-01-06 22:22 (UTC+8))
#26605 [Bug]: DeepSeek v3.2 hits IMA on DP/EP setup — bug — by MatthewBonanni (关闭于: 2026-01-06 22:18 (UTC+8))
#31648 [Bug][CPU Backend]: torch.compile fails for MoE models on CPU backend with -dp 2 — bug,cpu — by kzwrime (关闭于: 2026-01-06 20:06 (UTC+8))
#30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (关闭于: 2026-01-06 18:57 (UTC+8))
#28912 [RFC]: Support Context Pipeline Parallelism(CPP). — RFC — by pisceskkk (关闭于: 2026-01-06 17:37 (UTC+8))
#31789 [Installation]: ModuleNotFoundError: No module named ‘vllm._C’ when run demo basic.py — installation — by sunbinbin1991 (关闭于: 2026-01-06 16:04 (UTC+8))
#31768 [Bug]: Run RedHatAI/Qwen3-30B-A3B-FP8-block failed — bug — by jessiewiswjc (关闭于: 2026-01-06 12:15 (UTC+8))

新增 PR

#31854 refactor: refatcor find_loaded_library — nvidia — by tom-zju (创建于: 2026-01-07 09:17 (UTC+8))
#31857 [Bugfix] manully free encode cache of waiting request to avoid potential dead … — v1 — by frelam (创建于: 2026-01-07 10:21 (UTC+8))
#31826 [Perf] Slight improvement of ITL with multiple GPUs — 无标签 — by access2rohit (创建于: 2026-01-07 03:13 (UTC+8))
#31779 [Refactor] GLM-ASR Modeling — 无标签 — by JaredforReal (创建于: 2026-01-06 14:26 (UTC+8))
#31855 Change warning in get_current_vllm_config to report caller’s line number — ready — by tlrmchlsmth (创建于: 2026-01-07 09:28 (UTC+8))
#31837 [Perf] Fuse stride preparation for NVFP4 cutlass_moe — performance,ready,nvidia — by mgoin (创建于: 2026-01-07 05:50 (UTC+8))
#31838 [1/2][lmcache connector] clean up lmcache multi-process adapter — ready,kv-connector — by ApostaC (创建于: 2026-01-07 06:09 (UTC+8))
#31778 [BugFix] fixing stream interval > 1 will cause tool call bug — v1 — by MrIceCreamMan (创建于: 2026-01-06 14:09 (UTC+8))
#31791 [Bugfix] Use isinstance() instead of type() in LoRA can_replace_layer — 无标签 — by majiayu000 (创建于: 2026-01-06 16:40 (UTC+8))
#31814 [Bugfix] Inject JSON schema descriptions into prompt for structured outputs — frontend — by ricky-chaoju (创建于: 2026-01-06 23:29 (UTC+8))
#31853 Add GPU memory usage warning system — documentation,v1 — by Dedulus (创建于: 2026-01-07 09:16 (UTC+8))
#31841 [Bugfix] Fix race condition in async-scheduling for vlm model — v1 — by tianshu-Michael-yu (创建于: 2026-01-07 06:58 (UTC+8))
#31852 [Attention] Full CG support for llama4 and remove use of deprecated properties — v1,llama — by LucasWilkinson (创建于: 2026-01-07 09:15 (UTC+8))
#31851 [Attention] Full CG support for llama4 and remove use of deprecated properties — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by LucasWilkinson (创建于: 2026-01-07 09:12 (UTC+8))
#31847 [Model] Add Grok-2 — documentation — by dangoldbj (创建于: 2026-01-07 08:27 (UTC+8))
#31829 [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-07 04:12 (UTC+8))
#31850 [Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-01-07 08:56 (UTC+8))
#31849 [CI] Fix weight mapping test for transformers v5 tied weights — multi-modality — by AndreasKaratzas (创建于: 2026-01-07 08:44 (UTC+8))
#31799 [Doc] Update release docs — documentation — by DarkLight1337 (创建于: 2026-01-06 17:34 (UTC+8))
#31846 Optimize graph capture size — 无标签 — by jiahanc (创建于: 2026-01-07 08:08 (UTC+8))
#31817 [Bugfix] Handle mistral tokenizer in get_hf_processor — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-07 00:40 (UTC+8))
#31785 [Fix] Use torch.empty for output in attention+quant fusion — ready — by elvischenv (创建于: 2026-01-06 15:11 (UTC+8))
#31835 [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-07 05:36 (UTC+8))
#31820 [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm — rocm,ready — by AndreasKaratzas (创建于: 2026-01-07 00:52 (UTC+8))
#31842 [4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-07 07:13 (UTC+8))
#31781 [Kernel] Support bias type in grouped_topk kernel — performance,deepseek,nvidia — by xyang16 (创建于: 2026-01-06 14:31 (UTC+8))
#31828 [Perf] Add opt-in SM100 Oink RMSNorm custom-op path — 无标签 — by Laurawly (创建于: 2026-01-07 04:00 (UTC+8))
#31832 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe — performance,nvidia — by mgoin (创建于: 2026-01-07 04:52 (UTC+8))
#31839 [responsesAPI] get reasoning token metrics for simpleContext — frontend,gpt-oss — by qandrew (创建于: 2026-01-07 06:18 (UTC+8))
#31833 [ROCm][CI] v1 cpu offloading attention backend fix — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-07 05:08 (UTC+8))
#31836 [responsesAPI] fix incomplete_messages for simple/parsable context — frontend — by qandrew (创建于: 2026-01-07 05:37 (UTC+8))
#31834 [CI/Build] Enable test_kv_cache_events_dp for AMD — rocm,v1 — by rjrock (创建于: 2026-01-07 05:19 (UTC+8))
#31770 [Feat] Add Flashinfer TRTLLM MOE to Modular Kernel — needs-rebase,v1,nvidia — by jiahanc (创建于: 2026-01-06 12:29 (UTC+8))
#31830 [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement — ready,nvidia — by yewentao256 (创建于: 2026-01-07 04:20 (UTC+8))
#31827 [MoE Refactor][17/N] Apply Refactor to Bf16 — 无标签 — by zyongye (创建于: 2026-01-07 03:46 (UTC+8))
#31808 Report error log after vllm bench serve — performance,ready — by elvircrn (创建于: 2026-01-06 21:16 (UTC+8))
#31825 [Model] Enable LoRA support for tower and connector in DotsOCR — documentation — by ShaanveerS (创建于: 2026-01-07 03:10 (UTC+8))
#31801 [Quantization][Refactor] Move CPU GPTQ kernel into MP linear — ready,cpu — by bigPYJ1151 (创建于: 2026-01-06 18:47 (UTC+8))
#31807 [NemotronH] Use ReplicatedLinear for fc1_latent_proj — ready — by roikoren755 (创建于: 2026-01-06 20:53 (UTC+8))
#31775 [docker] A follow-up patch to fix #30913: [docker] install cuda13 version of lmcache and nixl — ci/build,kv-connector,nvidia — by wangshangsam (创建于: 2026-01-06 13:32 (UTC+8))
#31812 Enable LoRA support for tower and connector in Mistral and Voxtral — documentation,frontend,qwen,deepseek — by Anexdeus (创建于: 2026-01-06 22:52 (UTC+8))
#31774 [Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-01-06 12:57 (UTC+8))
#31822 [CI] Add CUDA 13 nightly containers — ci/build,nvidia — by csahithi (创建于: 2026-01-07 01:14 (UTC+8))
#31816 [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend — rocm,v1 — by vllmellm (创建于: 2026-01-07 00:03 (UTC+8))
#31784 [Model] rename use_pad_token to use_sep_token — documentation,frontend,ready — by noooop (创建于: 2026-01-06 14:57 (UTC+8))
#31786 [Chore] Try remove init_cached_hf_modules — tpu,v1,ready-run-all-tests — by DarkLight1337 (创建于: 2026-01-06 15:28 (UTC+8))
#31815 [Bugfix] Fix TorchAO quantization bugs and add --torchao-config CLI support — 无标签 — by jwpark33 (创建于: 2026-01-06 23:34 (UTC+8))
#31811 [KVConnector] Handle KV events from multiple connectors — v1,kv-connector — by hickeyma (创建于: 2026-01-06 22:44 (UTC+8))
#31813 [Bugfix] Add TorchAO CLI Support and Fix Tensor Metadata Preservation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen — by jwpark33 (创建于: 2026-01-06 23:23 (UTC+8))
#31806 [Bugfix]: Fix cross attention backend selection for Turing GPU — ready — by Isotr0py (创建于: 2026-01-06 20:45 (UTC+8))
#31777 [LoRA]Disable linear LoRA kernel PDL — documentation,ready — by jeejeelee (创建于: 2026-01-06 13:58 (UTC+8))
#31788 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false — ready,deepseek — by chaunceyjiang (创建于: 2026-01-06 15:51 (UTC+8))
#31790 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) — ready,qwen — by Jzz1943 (创建于: 2026-01-06 16:32 (UTC+8))
#31802 [Core][NIXL] Support HMA+NixlConnector — documentation,performance,structured-output,frontend,tpu,needs-rebase,ci/build,v1,multi-modality,tool-calling — by NickLucche (创建于: 2026-01-06 19:04 (UTC+8))
#31796 [Misc] Implement TokenizerLike.convert_tokens_to_ids — ready — by DarkLight1337 (创建于: 2026-01-06 17:05 (UTC+8))
#31773 [Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-06 12:53 (UTC+8))
#31793 [Chore] Cleanup mem_utils.py — ready — by DarkLight1337 (创建于: 2026-01-06 16:48 (UTC+8))
#31776 [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff — ready — by Isotr0py (创建于: 2026-01-06 13:56 (UTC+8))
#31800 [Doc] Fix format of multimodal_inputs.md — documentation,ready — by BlankRH (创建于: 2026-01-06 17:45 (UTC+8))
#31797 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. — ready — by noooop (创建于: 2026-01-06 17:12 (UTC+8))
#31780 [Misc] Use deprecated for seed_everything — documentation,performance,ready — by DarkLight1337 (创建于: 2026-01-06 14:29 (UTC+8))
#31783 [Chore] Remove more V0 dead code from sequence.py — ready — by DarkLight1337 (创建于: 2026-01-06 14:54 (UTC+8))
#31795 [ci] fix timm import error — ci/build — by andyxning (创建于: 2026-01-06 16:54 (UTC+8))
#31772 [Model] Enable LoRA support for LLaVA family — documentation — by ppppqp (创建于: 2026-01-06 12:50 (UTC+8))
#31769 Add more ci for moe refactor b200 — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-06 12:05 (UTC+8))

已合并 PR

#29493 [Log] add log about gpu worker init snapshot and requested memory — ready,ci/build,v1 — by andyxning (合并于: 2026-01-07 01:32 (UTC+8))
#31656 [Model] Enable LoRA support for PaliGemma — documentation,ready — by A1c0r-Z (合并于: 2026-01-07 10:09 (UTC+8))
#31838 [1/2][lmcache connector] clean up lmcache multi-process adapter — ready,kv-connector — by ApostaC (合并于: 2026-01-07 10:02 (UTC+8))
#31554 [Misc][BE] Type coverage for vllm/compilation [1/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-07 09:37 (UTC+8))
#29197 [Frontend] Implement robust video frame recovery for corrupted videos — documentation,performance,ready,multi-modality — by vSeamar (合并于: 2026-01-07 09:13 (UTC+8))
#31829 [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-07 09:12 (UTC+8))
#31183 [CI] Add warmup run in test_fusion_attn — ready — by angelayi (合并于: 2026-01-07 08:31 (UTC+8))
#31817 [Bugfix] Handle mistral tokenizer in get_hf_processor — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-07 07:46 (UTC+8))
#31835 [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-07 07:23 (UTC+8))
#31820 [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm — rocm,ready — by AndreasKaratzas (合并于: 2026-01-07 07:21 (UTC+8))
#31739 [Spec Decode][UX] Add acceptance stats to vllm bench serve report — performance,ready — by MatthewBonanni (合并于: 2026-01-07 05:21 (UTC+8))
#31808 Report error log after vllm bench serve — performance,ready — by elvircrn (合并于: 2026-01-07 04:24 (UTC+8))
#31191 Fix RecursionError in MediaWithBytes unpickling — ready,multi-modality — by nrghosh (合并于: 2026-01-07 04:11 (UTC+8))
#31801 [Quantization][Refactor] Move CPU GPTQ kernel into MP linear — ready,cpu — by bigPYJ1151 (合并于: 2026-01-07 03:10 (UTC+8))
#31807 [NemotronH] Use ReplicatedLinear for fc1_latent_proj — ready — by roikoren755 (合并于: 2026-01-07 00:00 (UTC+8))
#28895 [ROCm][CI] Fix tests/compile unit tests — rocm,ready,nvidia — by charlifu (合并于: 2026-01-07 02:50 (UTC+8))
#29821 [Perf] Async Scheduling + Speculative Decoding + Structured Outputs — structured-output,ready,v1 — by benchislett (合并于: 2026-01-07 02:50 (UTC+8))
#31055 [Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking — ready — by ReinforcedKnowledge (合并于: 2026-01-07 01:57 (UTC+8))
#20610 make 500: InternalServerError more informative — ready,stale,v1 — by guicho271828 (合并于: 2026-01-07 01:36 (UTC+8))
#31722 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) — performance,ready,qwen — by vadiklyutiy (合并于: 2026-01-07 01:32 (UTC+8))
#31774 [Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (合并于: 2026-01-07 01:32 (UTC+8))
#31571 [Quantization][MoE] remove unused ep logic from moe marlin — ready — by jinzhen-lin (合并于: 2026-01-07 01:07 (UTC+8))
#31784 [Model] rename use_pad_token to use_sep_token — documentation,frontend,ready — by noooop (合并于: 2026-01-06 22:16 (UTC+8))
#31593 [MoE Refactor][14/N] Clean Up FI Quant Config Smuggling — ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 23:47 (UTC+8))
#31759 [MoE Refactor] Add Temporary Integration Tests - H100/B200 — ready,ci/build,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 23:34 (UTC+8))
#31806 [Bugfix]: Fix cross attention backend selection for Turing GPU — ready — by Isotr0py (合并于: 2026-01-06 23:15 (UTC+8))
#31777 [LoRA]Disable linear LoRA kernel PDL — documentation,ready — by jeejeelee (合并于: 2026-01-06 23:12 (UTC+8))
#30890 [UX] Add -ep shorthand for --enable-expert-parallel — ready,v1 — by mgoin (合并于: 2026-01-06 11:13 (UTC+8))
#31788 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false — ready,deepseek — by chaunceyjiang (合并于: 2026-01-06 21:53 (UTC+8))
#31738 [Models]: Use MMEncoderAttention for MoonViT — ready — by Isotr0py (合并于: 2026-01-06 16:00 (UTC+8))
#31790 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) — ready,qwen — by Jzz1943 (合并于: 2026-01-06 20:59 (UTC+8))
#31796 [Misc] Implement TokenizerLike.convert_tokens_to_ids — ready — by DarkLight1337 (合并于: 2026-01-06 20:08 (UTC+8))
#31650 [Bugfix] Fix torch.compile error for DP + MoE on CPU Backend — ready — by kzwrime (合并于: 2026-01-06 20:06 (UTC+8))
#31773 [Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-01-06 20:05 (UTC+8))
#31793 [Chore] Cleanup mem_utils.py — ready — by DarkLight1337 (合并于: 2026-01-06 19:56 (UTC+8))
#31776 [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff — ready — by Isotr0py (合并于: 2026-01-06 16:44 (UTC+8))
#31800 [Doc] Fix format of multimodal_inputs.md — documentation,ready — by BlankRH (合并于: 2026-01-06 19:30 (UTC+8))
#31797 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. — ready — by noooop (合并于: 2026-01-06 19:30 (UTC+8))
#31780 [Misc] Use deprecated for seed_everything — documentation,performance,ready — by DarkLight1337 (合并于: 2026-01-06 19:29 (UTC+8))
#31720 [cpu][bench] Add CPU paged attention benchmarks — performance,ready,cpu — by fadara01 (合并于: 2026-01-06 18:57 (UTC+8))
#31783 [Chore] Remove more V0 dead code from sequence.py — ready — by DarkLight1337 (合并于: 2026-01-06 18:25 (UTC+8))
#31714 [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py — rocm,speculative-decoding,ready,v1 — by vllmellm (合并于: 2026-01-06 15:53 (UTC+8))
#30837 [Doc] Show that use_audio_in_video is supported in docs — documentation,ready,qwen — by DarkLight1337 (合并于: 2026-01-06 15:27 (UTC+8))
#31177 [Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check — rocm,ready — by c0de128 (合并于: 2026-01-06 14:11 (UTC+8))
#31764 [CI] Fix CPU MM Processor Test — ready — by robertgshaw2-redhat (合并于: 2026-01-06 12:22 (UTC+8))
#31042 [Bugfix] Add init_workspace_manager to moe kernel benchmarks — performance,nvidia — by mgoin (合并于: 2026-01-06 11:14 (UTC+8))

关闭但未合并的 PR

#23076 Update supported models — documentation,stale — by bvrockwell (关闭于: 2026-01-07 10:32 (UTC+8))
#21975 [Build] Add FlashInfer wheel build to release pipeline — documentation,needs-rebase,ci/build,stale — by mgoin (关闭于: 2026-01-07 10:28 (UTC+8))
#31851 [Attention] Full CG support for llama4 and remove use of deprecated properties — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by LucasWilkinson (关闭于: 2026-01-07 09:13 (UTC+8))
#31673 Test1 — 无标签 — by AlexLI-hub (关闭于: 2026-01-07 07:46 (UTC+8))
#30146 [WIP] Nightly signal on torch — rocm,needs-rebase,ci/build,cpu,nvidia — by atalman (关闭于: 2026-01-07 07:26 (UTC+8))
#31243 [WIP] Adopt Dockerfile to build nightly version — ci/build — by atalman (关闭于: 2026-01-07 07:25 (UTC+8))
#26782 [Fix] Avoid UserWarning when creating tensors from base64 embeddings — documentation — by mmangkad (关闭于: 2026-01-07 03:12 (UTC+8))
#27501 [Refactor][MLA] Move prefill/decode implementation into MLAAttention layer — rocm,ready,needs-rebase,v1,nvidia — by therealnaveenkamal (关闭于: 2026-01-07 01:34 (UTC+8))
#31543 [Model] Add support for Nemotron-Flash-3B — documentation,new-model,v1 — by hhzhang16 (关闭于: 2026-01-07 01:23 (UTC+8))
#31723 [Core] Optimize expensive deepcopy in GPU model runner — v1 — by GOavi101 (关闭于: 2026-01-07 01:04 (UTC+8))
#31813 [Bugfix] Add TorchAO CLI Support and Fix Tensor Metadata Preservation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen — by jwpark33 (关闭于: 2026-01-06 23:23 (UTC+8))
#31795 [ci] fix timm import error — ci/build — by andyxning (关闭于: 2026-01-06 17:37 (UTC+8))
#31765 [BUGFIX] free encode inputs in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (关闭于: 2026-01-06 14:06 (UTC+8))
#31769 Add more ci for moe refactor b200 — 无标签 — by robertgshaw2-redhat (关闭于: 2026-01-06 12:05 (UTC+8))
#30192 [Frontend] Add MCP tool streaming support to Responses API — frontend,gpt-oss — by daniel-salib (关闭于: 2026-01-06 12:38 (UTC+8))
#31717 [Perf] GLM ASR — 无标签 — by JaredforReal (关闭于: 2026-01-06 10:58 (UTC+8))