vLLM 开发动态报告 - 2026-01-06
时间窗口: 2026-01-06 10:53 (UTC+8) ~ 2026-01-07 10:53 (UTC+8) 数据统计: 新 Issue 28 | 关闭 Issue 16 | 新 PR 65 | 合并 PR 46 | 关闭未合并 PR 16
📊 每日开发状态摘要
在2026年1月6日至7日的24小时内,vLLM项目保持了极高的开发活跃度,新增28个Issue和65个PR,合并了46个PR。开发焦点集中在性能优化(特别是MoE内核)、平台兼容性(尤其是AMD ROCm CI的稳定性修复)以及多模态模型支持的扩展上。同时,社区对量化(FP8/NVFP4)、推理(MTP)相关Bug的响应和修复也非常迅速。
🎯 AMD/ROCm 生态相关动态
本周期AMD/ROCm生态相关工作聚焦于CI稳定性修复、性能分析工具链完善以及平台特定内核的准确性修复。
- 功能请求:ROCTx支持 (Issue #31831)
- 内容:用户请求为AMD ROCm平台添加
ROCTx支持(类比NVIDIA的NVTX),以便PyTorch Profiler能获取内核的形状数据,生成更详细的性能剖析报告。 - 分析:这表明社区用户正在将vLLM部署到AMD GPU上,并对性能分析和调试工具链提出了与NVIDIA平台对等的要求。这是AMD生态成熟度提升的一个标志。
- 内容:用户请求为AMD ROCm平台添加
- Bug修复:AITER MLA后端精度回归 (PR #31816)
- 内容:修复了在AMD ROCm平台上使用
ROCM_AITER_TRITON_MLA注意力后端时,运行DeepSeek-V3模型出现的严重精度下降问题(GSM8K准确率从~96%降至几乎为零)。 - 分析:该修复至关重要,确保了AMD自有Attention后端实现(AITER)的推理准确性。问题根源在于继承关系错误,未从正确的基类(
AiterMLABackend)继承,这属于平台特定实现中的关键bug。
- 内容:修复了在AMD ROCm平台上使用
- CI/基础设施:多项ROCm测试修复 (PR #31829, #31835, #31820)
- 内容:
- 移除可能导致ROCm测试失败的
VLLM_FLOAT32_MATMUL_PRECISION环境变量设置,并固定albumentations库版本。 - 固定
timm库版本以解决Nemotron多模态模型测试中的导入错误。 - 放宽ROCm上ModernBERT词分类测试的精度容错阈值(
rtol),以适配平台间固有的数值差异。
- 移除可能导致ROCm测试失败的
- 分析:这些工作主要致力于维护AMD CI管道的稳定运行。频繁的依赖版本锁定和精度阈值调整表明,确保跨平台(CUDA vs ROCm)的测试一致性存在持续挑战,是维护多架构支持的必要开销。
- 内容:
- 功能实现:ROCTx支持 (PR #31187) (已合并)
- 内容:实际实现了在AMD GPU上添加
ROCTx范围注解的支持。 - 分析:此PR直接响应了上述Issue的需求,补齐了AMD平台性能分析工具链的一块短板,有助于开发者优化在MI系列GPU上的vLLM性能。
- 内容:实际实现了在AMD GPU上添加
💬 高热度讨论分析
- Issue #31792: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’
- 核心议题:用户报告在运行特定Mistral模型时,vLLM的自定义Tokenizer包装类缺少
convert_tokens_to_ids方法,导致与Hugging Face Transformers库(特别是多模态处理器)的兼容性问题。 - 观点与进展:
- 总结:讨论凸显了vLLM在封装HF生态时面临的兼容性挑战,尤其是在多模态和Tokenizer定制化场景下。维护者通过快速迭代修复解决了问题。
- 当前状态:Issue已关闭,通过两个关联PR修复。
- 核心议题:用户报告在运行特定Mistral模型时,vLLM的自定义Tokenizer包装类缺少
- **Issue #31319: GLM-4.7-FP8 missing beginning
tag**(于本周期关闭) - 核心议题:GLM-4.7模型在使用thinking(推理)功能时,流式输出中缺少开头的
<think>标签,导致前端解析错误。 - 观点与进展:
- 社区用户:提供了临时补丁,并讨论使用
deepseek_r1或step3解析器作为替代方案的可能性。 - 维护者:最终通过PR #31788 采用了一种方案:将GLM-4.5/4.7的reasoning parser配置指向
deepseek_v3parser,并设置默认的enable_thinking参数。
- 社区用户:提供了临时补丁,并讨论使用
- 争议焦点:是否应该创建一个新的专属解析器(如
Glm47AppendThinkReasoningParser),还是复用现有解析器并调整配置。 - 最终结论:采用配置复用方案,认为根本原因在于GLM-45推理解析器实现有误,当前方案作为权宜之计。
- 核心议题:GLM-4.7模型在使用thinking(推理)功能时,流式输出中缺少开头的
🔥 热门话题与趋势分析
- 量化精度与性能问题:多个Issue(#31843 SM90 FlashInfer FP8编译、#31844 DeepGEMM E8M0精度、#31840 NVFP4 MoE数值精度)指向了新硬件(H200, B200)和新量化格式(FP8, NVFP4) 下的兼容性与精度挑战,表明前沿量化技术在复杂场景(如MoE)中的稳定性仍需打磨。
- MoE(Mixture of Experts)深度优化:大量PR围绕MoE层进行,包括内核性能提升(如PR #31830, #31832, #31837 针对cutlass_moe的融合与计算优化)、内核抽象化重构(PR #31827)以及支持更多量化格式(PR #31770)。这表明MoE模型是当前性能攻坚的重点。
- 多模态支持扩展:社区积极扩展多模态模型的LoRA支持(PR #31656, #31812, #31825),并重构音频模型实现(PR #31779)。视频处理效率也受到关注,出现了关于视频帧稀疏化的RFC(Issue #31803)。
- 基础设施与CI/CD:AMD CI的稳定性修复是明显主题。同时,也有关于构建CUDA 13版本镜像(PR #31822)和更新发布流程文档(PR #31799)的工作,反映出项目在维护多版本、多平台支持上的努力。
🛠️ 重点技术变更
- PR #31187: [ROCm] Add ROCTx range annotations(已合并)
- 技术解读:为vLLM在AMD ROCm平台上的关键内核和操作添加了
ROCTx(ROCm Tracer eXtensions)注解。 - 影响:使AMD GPU用户能够使用
rocprof、Omniperf等工具进行更精细化的性能剖析,包括获取内核参数信息,大幅提升了在AMD平台上进行性能调试和优化的能力。
- 技术解读:为vLLM在AMD ROCm平台上的关键内核和操作添加了
- PR #31816: [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend(待合并)
- 技术解读:修复了AMD AITER Triton MLA后端因继承错误而导致的严重精度问题。通过确保后端从正确的基类(
AiterMLABackend)继承,恢复了计算准确性。 - 影响:保障了使用AMD自有高性能Attention后端时模型的输出质量,对于依赖该后端以获得最佳性能的AMD平台用户至关重要。
- 技术解读:修复了AMD AITER Triton MLA后端因继承错误而导致的严重精度问题。通过确保后端从正确的基类(
- PR #31830: [Perf] Optimize cutlass moe problem size calculation(待合并)
- 技术解读:通过添加一个专用的内核来快速计算MoE运算的问题规模(problem size),替代了原有串行逻辑。
- 影响:实现了约5.3%的端到端吞吐量提升和2.2%的TTFT(首字延迟)降低,是MoE推理路径上显著的性能优化。
- PR #31739: [Spec Decode][UX] Add acceptance stats to
vllm bench servereport(已合并)- 技术解读:在
vllm bench serve基准测试报告中,当服务器启用推测解码(Speculative Decoding)时,自动增加一个“Speculative Decoding”数据板块,展示接受率、各位置接受率等关键指标。 - 影响:极大改善了使用推测解码时的用户体验和性能评估效率。用户无需额外工具即可直观评估草案模型的质量和推测解码的收益,使性能分析更加全面。
- 技术解读:在
📈 开发活跃度观察
- 贡献活跃度:单日新增PR(65)和合并PR(46)数量极高,表明社区贡献和核心团队审核、合并节奏非常快。
- AMD贡献者活跃:本周期内有多个由AMD相关贡献者(如
AndreasKaratzas,vllmellm,rjrock)提交的PR,主要聚焦于修复ROCm平台的CI测试和功能bug,显示AMD团队正持续投入资源确保其平台兼容性。 - 多模态贡献:围绕多模态模型LoRA支持、音频模型重构的PR较多,显示社区在扩展vLLM应用边界方面的活跃度。
- 高效的问题闭环:在新增28个Issue的同时,关闭了16个(包括一些历史遗留Issue),表明问题处理流程顺畅。
💡 值得关注的问题
- GLM-4.5-Air + MTP结构性输出错误 (Issue #31858):MTP(多令牌预测)与结构化输出(JSON Schema)同时启用时产生冲突,可能导致输出格式错误。这涉及到推测采样与约束解码的交互,是一个复杂且影响体验的Bug。
- DeepSeek V3.2 MTP > 1 运行错误 (Issue #31845):在H200上使用MTP>1服务DeepSeek-V3.2时失败。DeepSeek-V3.2是重要的新模型,其MTP支持稳定性值得关注。
- Streaming Quantization内存问题 (Issue #31805):流式量化导致模型加载和后处理阶段峰值内存使用升高。这关系到大规模模型部署的资源利用效率。
- B200上的DeepGEMM E8M0精度问题 (Issue #31844):在Blackwell架构(B200)上使用E8M0格式的DeepGEMM出现准确率下降。这提示新硬件与新计算库的适配可能存在隐患。
- 视频帧稀疏化RFC (Issue #31803):提出了对长视频输入进行自适应关键帧提取的算法RFC,以解决多模态长视频推理的序列过长问题。这是一个针对实际应用痛点的前瞻性功能提议。
📋 附录:详细数据列表
新增 Issue
- #31858 [Bug]: GLM-4.5-Air + MTP causes structural output errors — bug — by ruokee (创建于: 2026-01-07 10:23 (UTC+8))
- #31856 [Bug]: Text degeneration / repetition loops with MiniMax-M2.1-NVFP4 on v0.14.0rc1.dev308 — 无标签 — by ktsaou (创建于: 2026-01-07 09:36 (UTC+8))
- #31843 [Bug]: SM90 FlashInfer with FP8 kv cache fails to compile — bug — by mgoin (创建于: 2026-01-07 07:20 (UTC+8))
- #31848 [RFC]: Native Weight Syncing APIs — RFC — by ahao-anyscale (创建于: 2026-01-07 08:37 (UTC+8))
- #31845 [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) — bug — by jhaotingc (创建于: 2026-01-07 07:39 (UTC+8))
- #31844 [Bug]: DeepEP LL Accuracy Issue with DeepGEMM E8M0 on B200 — bug — by robertgshaw2-redhat (创建于: 2026-01-07 07:27 (UTC+8))
- #31840 [Bug]: NVFP4 Flashinfer CuteDSL MoE + DeepEP ll + VLLM_MOE_DP_CHUNK_SIZE=1024 numerical accuracy issue — bug,nvidia — by mxz297 (创建于: 2026-01-07 06:54 (UTC+8))
- #31805 [Bug]: streaming quantization cause higher peak memory used during model loading and post process — bug — by yma11 (创建于: 2026-01-06 20:00 (UTC+8))
- #31831 [Feature]: Add support for ROCTx — feature request — by AidenPetersen (创建于: 2026-01-07 04:28 (UTC+8))
- #31792 [Bug]: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’ — bug — by dgouju (创建于: 2026-01-06 16:47 (UTC+8))
- #31824 [Feature]: Remove parameter re-registration and names from kernel abstraction — help wanted,feature request — by ProExpertProg (创建于: 2026-01-07 02:55 (UTC+8))
- #31819 [New Model]: Support Florence-2 on Engine 1 After Engine 0 Removal — new-model — by docee (创建于: 2026-01-07 00:46 (UTC+8))
- #31823 [Feature]: Use kernel abstraction for fp4 scaled mm — help wanted,feature request — by ProExpertProg (创建于: 2026-01-07 01:16 (UTC+8))
- #31821 [Feature]: Change min_capability to is_supported in the kernel abstraction — feature request — by ProExpertProg (创建于: 2026-01-07 01:13 (UTC+8))
- #31818 [Feature]: Refactor W8A8 Block Linear to use the kernel abstraction — feature request — by ProExpertProg (创建于: 2026-01-07 00:42 (UTC+8))
- #31810 [CI Failure]: — ci-failure — by robertgshaw2-redhat (创建于: 2026-01-06 22:38 (UTC+8))
- #31809 [Feature]: Add a default logger select write to file — feature request — by lengrongfu (创建于: 2026-01-06 22:31 (UTC+8))
- #31794 [Feature]: Feasibility of Layer-wise Intervention (e.g., KV Transfer) compatible with V1 CUDA Graph architecture? — feature request — by leepeter008 (创建于: 2026-01-06 16:54 (UTC+8))
- #31803 [RFC]: Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference — RFC — by rayleeku (创建于: 2026-01-06 19:38 (UTC+8))
- #31782 [Feature]: Support compressed-tensors NVFP4 quantization for MoE models (Nemotron-H non-gated MoE) — feature request — by Firworksyt (创建于: 2026-01-06 14:43 (UTC+8))
- #31804 [Bug]: Pydantic model’s description is not taken into account when generating structured outputs — bug — by venki-lfc (创建于: 2026-01-06 19:39 (UTC+8))
- #31766 [Docs] Feedback for
/en/latest/contributing/profiling/— documentation — by cyk2018 (创建于: 2026-01-06 11:15 (UTC+8)) - #31798 [Bug]: The Qwen3-VL-30B-A3B-Thinking model deployed by vllm is not responding to requests. — bug — by haifzhu (创建于: 2026-01-06 17:20 (UTC+8))
- #31789 [Installation]: ModuleNotFoundError: No module named ‘vllm._C’ when run demo basic.py — installation — by sunbinbin1991 (创建于: 2026-01-06 15:54 (UTC+8))
- #31787 [Usage]: How to set different attention backend for prefill and decode phases? — usage — by stormchasingg (创建于: 2026-01-06 15:33 (UTC+8))
- #31767 [Bug]: Lora Bug : Is it more appropriate to replace type() with isinstancein() the “can_replace_layer” class method — bug — by ZT-AIA (创建于: 2026-01-06 11:49 (UTC+8))
- #31771 [Performance]: VLLM performance degradation when comparing with Tensor RT — performance — by akasshdeep (创建于: 2026-01-06 12:45 (UTC+8))
- #31768 [Bug]: Run RedHatAI/Qwen3-30B-A3B-FP8-block failed — bug — by jessiewiswjc (创建于: 2026-01-06 11:55 (UTC+8))
已关闭 Issue
- #31319 [Bug]: GLM-4.7-FP8 missing beginning
tag — bug — by Nemo-G (关闭于: 2026-01-06 21:53 (UTC+8)) - #19407 [RFC]: Support automatic max context length via
--max-model-len auto— RFC,stale — by yeqcharlotte (关闭于: 2026-01-07 09:02 (UTC+8)) - #31044 [CI Failure]: Blackwell Fusion Tests — help wanted,torch.compile,ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-07 08:31 (UTC+8))
- #29109 [Bug]: Speculative decoding fails with Llama4-eagle — bug — by divakar-amd (关闭于: 2026-01-07 05:47 (UTC+8))
- #30818 [Bug]: Unpickling MediaWithBytes dataclass leads to RecursionError — bug — by eicherseiji (关闭于: 2026-01-07 04:11 (UTC+8))
- #31792 [Bug]: VLLM 0.13.0 / Mistral-Small-3.2-24B-Instruct-2506: MistralTokenizer object has no attribute ‘convert_tokens_to_ids’ — bug — by dgouju (关闭于: 2026-01-06 20:08 (UTC+8))
- #27300 [Bug]: vLLM produces invalid UTF-8 tokens and “�” replacement characters when returning logprobs — bug — by yinggeh (关闭于: 2026-01-07 01:07 (UTC+8))
- #30571 [Bug]: DeepGEMM MoE generating tons of warnings — bug — by MatthewBonanni (关闭于: 2026-01-07 00:25 (UTC+8))
- #31810 [CI Failure]: — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-06 22:39 (UTC+8))
- #31794 [Feature]: Feasibility of Layer-wise Intervention (e.g., KV Transfer) compatible with V1 CUDA Graph architecture? — feature request — by leepeter008 (关闭于: 2026-01-06 22:22 (UTC+8))
- #26605 [Bug]: DeepSeek v3.2 hits IMA on DP/EP setup — bug — by MatthewBonanni (关闭于: 2026-01-06 22:18 (UTC+8))
- #31648 [Bug][CPU Backend]: torch.compile fails for MoE models on CPU backend with
-dp 2— bug,cpu — by kzwrime (关闭于: 2026-01-06 20:06 (UTC+8)) - #30374 [Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend — feature request,cpu — by fadara01 (关闭于: 2026-01-06 18:57 (UTC+8))
- #28912 [RFC]: Support Context Pipeline Parallelism(CPP). — RFC — by pisceskkk (关闭于: 2026-01-06 17:37 (UTC+8))
- #31789 [Installation]: ModuleNotFoundError: No module named ‘vllm._C’ when run demo basic.py — installation — by sunbinbin1991 (关闭于: 2026-01-06 16:04 (UTC+8))
- #31768 [Bug]: Run RedHatAI/Qwen3-30B-A3B-FP8-block failed — bug — by jessiewiswjc (关闭于: 2026-01-06 12:15 (UTC+8))
新增 PR
- #31854 refactor: refatcor find_loaded_library — nvidia — by tom-zju (创建于: 2026-01-07 09:17 (UTC+8))
- #31857 [Bugfix] manully free encode cache of waiting request to avoid potential dead … — v1 — by frelam (创建于: 2026-01-07 10:21 (UTC+8))
- #31826 [Perf] Slight improvement of ITL with multiple GPUs — 无标签 — by access2rohit (创建于: 2026-01-07 03:13 (UTC+8))
- #31779 [Refactor] GLM-ASR Modeling — 无标签 — by JaredforReal (创建于: 2026-01-06 14:26 (UTC+8))
- #31855 Change warning in get_current_vllm_config to report caller’s line number — ready — by tlrmchlsmth (创建于: 2026-01-07 09:28 (UTC+8))
- #31837 [Perf] Fuse stride preparation for NVFP4 cutlass_moe — performance,ready,nvidia — by mgoin (创建于: 2026-01-07 05:50 (UTC+8))
- #31838 [1/2][lmcache connector] clean up lmcache multi-process adapter — ready,kv-connector — by ApostaC (创建于: 2026-01-07 06:09 (UTC+8))
- #31778 [BugFix] fixing stream interval > 1 will cause tool call bug — v1 — by MrIceCreamMan (创建于: 2026-01-06 14:09 (UTC+8))
- #31791 [Bugfix] Use isinstance() instead of type() in LoRA can_replace_layer — 无标签 — by majiayu000 (创建于: 2026-01-06 16:40 (UTC+8))
- #31814 [Bugfix] Inject JSON schema descriptions into prompt for structured outputs — frontend — by ricky-chaoju (创建于: 2026-01-06 23:29 (UTC+8))
- #31853 Add GPU memory usage warning system — documentation,v1 — by Dedulus (创建于: 2026-01-07 09:16 (UTC+8))
- #31841 [Bugfix] Fix race condition in async-scheduling for vlm model — v1 — by tianshu-Michael-yu (创建于: 2026-01-07 06:58 (UTC+8))
- #31852 [Attention] Full CG support for llama4 and remove use of deprecated properties — v1,llama — by LucasWilkinson (创建于: 2026-01-07 09:15 (UTC+8))
- #31851 [Attention] Full CG support for llama4 and remove use of deprecated properties — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by LucasWilkinson (创建于: 2026-01-07 09:12 (UTC+8))
- #31847 [Model] Add Grok-2 — documentation — by dangoldbj (创建于: 2026-01-07 08:27 (UTC+8))
- #31829 [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing
VLLM_FLOAT32_MATMUL_PRECISIONfrom all ROCm tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-07 04:12 (UTC+8)) - #31850 [Attention][3/n] Remove usage of deprecated
seq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-01-07 08:56 (UTC+8)) - #31849 [CI] Fix weight mapping test for transformers v5 tied weights — multi-modality — by AndreasKaratzas (创建于: 2026-01-07 08:44 (UTC+8))
- #31799 [Doc] Update release docs — documentation — by DarkLight1337 (创建于: 2026-01-06 17:34 (UTC+8))
- #31846 Optimize graph capture size — 无标签 — by jiahanc (创建于: 2026-01-07 08:08 (UTC+8))
- #31817 [Bugfix] Handle mistral tokenizer in get_hf_processor — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-07 00:40 (UTC+8))
- #31785 [Fix] Use torch.empty for output in attention+quant fusion — ready — by elvischenv (创建于: 2026-01-06 15:11 (UTC+8))
- #31835 [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-01-07 05:36 (UTC+8))
- #31820 [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm — rocm,ready — by AndreasKaratzas (创建于: 2026-01-07 00:52 (UTC+8))
- #31842 [4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-07 07:13 (UTC+8))
- #31781 [Kernel] Support bias type in grouped_topk kernel — performance,deepseek,nvidia — by xyang16 (创建于: 2026-01-06 14:31 (UTC+8))
- #31828 [Perf] Add opt-in SM100 Oink RMSNorm custom-op path — 无标签 — by Laurawly (创建于: 2026-01-07 04:00 (UTC+8))
- #31832 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe — performance,nvidia — by mgoin (创建于: 2026-01-07 04:52 (UTC+8))
- #31839 [responsesAPI] get reasoning token metrics for simpleContext — frontend,gpt-oss — by qandrew (创建于: 2026-01-07 06:18 (UTC+8))
- #31833 [ROCm][CI] v1 cpu offloading attention backend fix — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-07 05:08 (UTC+8))
- #31836 [responsesAPI] fix incomplete_messages for simple/parsable context — frontend — by qandrew (创建于: 2026-01-07 05:37 (UTC+8))
- #31834 [CI/Build] Enable test_kv_cache_events_dp for AMD — rocm,v1 — by rjrock (创建于: 2026-01-07 05:19 (UTC+8))
- #31770 [Feat] Add Flashinfer TRTLLM MOE to Modular Kernel — needs-rebase,v1,nvidia — by jiahanc (创建于: 2026-01-06 12:29 (UTC+8))
- #31830 [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement — ready,nvidia — by yewentao256 (创建于: 2026-01-07 04:20 (UTC+8))
- #31827 [MoE Refactor][17/N] Apply Refactor to Bf16 — 无标签 — by zyongye (创建于: 2026-01-07 03:46 (UTC+8))
- #31808 Report error log after vllm bench serve — performance,ready — by elvircrn (创建于: 2026-01-06 21:16 (UTC+8))
- #31825 [Model] Enable LoRA support for tower and connector in DotsOCR — documentation — by ShaanveerS (创建于: 2026-01-07 03:10 (UTC+8))
- #31801 [Quantization][Refactor] Move CPU GPTQ kernel into MP linear — ready,cpu — by bigPYJ1151 (创建于: 2026-01-06 18:47 (UTC+8))
- #31807 [NemotronH] Use ReplicatedLinear for fc1_latent_proj — ready — by roikoren755 (创建于: 2026-01-06 20:53 (UTC+8))
- #31775 [docker] A follow-up patch to fix #30913:
[docker] install cuda13 version of lmcache and nixl— ci/build,kv-connector,nvidia — by wangshangsam (创建于: 2026-01-06 13:32 (UTC+8)) - #31812 Enable LoRA support for tower and connector in Mistral and Voxtral — documentation,frontend,qwen,deepseek — by Anexdeus (创建于: 2026-01-06 22:52 (UTC+8))
- #31774 [Attention][2/n] Remove usage of deprecated
seq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (创建于: 2026-01-06 12:57 (UTC+8)) - #31822 [CI] Add CUDA 13 nightly containers — ci/build,nvidia — by csahithi (创建于: 2026-01-07 01:14 (UTC+8))
- #31816 [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend — rocm,v1 — by vllmellm (创建于: 2026-01-07 00:03 (UTC+8))
- #31784 [Model] rename use_pad_token to use_sep_token — documentation,frontend,ready — by noooop (创建于: 2026-01-06 14:57 (UTC+8))
- #31786 [Chore] Try remove
init_cached_hf_modules— tpu,v1,ready-run-all-tests — by DarkLight1337 (创建于: 2026-01-06 15:28 (UTC+8)) - #31815 [Bugfix] Fix TorchAO quantization bugs and add
--torchao-configCLI support — 无标签 — by jwpark33 (创建于: 2026-01-06 23:34 (UTC+8)) - #31811 [KVConnector] Handle KV events from multiple connectors — v1,kv-connector — by hickeyma (创建于: 2026-01-06 22:44 (UTC+8))
- #31813 [Bugfix] Add TorchAO CLI Support and Fix Tensor Metadata Preservation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen — by jwpark33 (创建于: 2026-01-06 23:23 (UTC+8))
- #31806 [Bugfix]: Fix cross attention backend selection for Turing GPU — ready — by Isotr0py (创建于: 2026-01-06 20:45 (UTC+8))
- #31777 [LoRA]Disable linear LoRA kernel PDL — documentation,ready — by jeejeelee (创建于: 2026-01-06 13:58 (UTC+8))
- #31788 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false — ready,deepseek — by chaunceyjiang (创建于: 2026-01-06 15:51 (UTC+8))
- #31790 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) — ready,qwen — by Jzz1943 (创建于: 2026-01-06 16:32 (UTC+8))
- #31802 [Core][NIXL] Support HMA+NixlConnector — documentation,performance,structured-output,frontend,tpu,needs-rebase,ci/build,v1,multi-modality,tool-calling — by NickLucche (创建于: 2026-01-06 19:04 (UTC+8))
- #31796 [Misc] Implement
TokenizerLike.convert_tokens_to_ids— ready — by DarkLight1337 (创建于: 2026-01-06 17:05 (UTC+8)) - #31773 [Attention][1/n] Remove usage of deprecated
seq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-06 12:53 (UTC+8)) - #31793 [Chore] Cleanup
mem_utils.py— ready — by DarkLight1337 (创建于: 2026-01-06 16:48 (UTC+8)) - #31776 [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff — ready — by Isotr0py (创建于: 2026-01-06 13:56 (UTC+8))
- #31800 [Doc] Fix format of multimodal_inputs.md — documentation,ready — by BlankRH (创建于: 2026-01-06 17:45 (UTC+8))
- #31797 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. — ready — by noooop (创建于: 2026-01-06 17:12 (UTC+8))
- #31780 [Misc] Use
deprecatedforseed_everything— documentation,performance,ready — by DarkLight1337 (创建于: 2026-01-06 14:29 (UTC+8)) - #31783 [Chore] Remove more V0 dead code from
sequence.py— ready — by DarkLight1337 (创建于: 2026-01-06 14:54 (UTC+8)) - #31795 [ci] fix timm import error — ci/build — by andyxning (创建于: 2026-01-06 16:54 (UTC+8))
- #31772 [Model] Enable LoRA support for LLaVA family — documentation — by ppppqp (创建于: 2026-01-06 12:50 (UTC+8))
- #31769 Add more ci for moe refactor b200 — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-06 12:05 (UTC+8))
已合并 PR
- #29493 [Log] add log about gpu worker init snapshot and requested memory — ready,ci/build,v1 — by andyxning (合并于: 2026-01-07 01:32 (UTC+8))
- #31656 [Model] Enable LoRA support for PaliGemma — documentation,ready — by A1c0r-Z (合并于: 2026-01-07 10:09 (UTC+8))
- #31838 [1/2][lmcache connector] clean up lmcache multi-process adapter — ready,kv-connector — by ApostaC (合并于: 2026-01-07 10:02 (UTC+8))
- #31554 [Misc][BE] Type coverage for vllm/compilation [1/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-07 09:37 (UTC+8))
- #29197 [Frontend] Implement robust video frame recovery for corrupted videos — documentation,performance,ready,multi-modality — by vSeamar (合并于: 2026-01-07 09:13 (UTC+8))
- #31829 [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing
VLLM_FLOAT32_MATMUL_PRECISIONfrom all ROCm tests — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-07 09:12 (UTC+8)) - #31183 [CI] Add warmup run in test_fusion_attn — ready — by angelayi (合并于: 2026-01-07 08:31 (UTC+8))
- #31817 [Bugfix] Handle mistral tokenizer in get_hf_processor — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-07 07:46 (UTC+8))
- #31835 [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-07 07:23 (UTC+8))
- #31820 [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm — rocm,ready — by AndreasKaratzas (合并于: 2026-01-07 07:21 (UTC+8))
- #31739 [Spec Decode][UX] Add acceptance stats to
vllm bench servereport — performance,ready — by MatthewBonanni (合并于: 2026-01-07 05:21 (UTC+8)) - #31808 Report error log after vllm bench serve — performance,ready — by elvircrn (合并于: 2026-01-07 04:24 (UTC+8))
- #31191 Fix RecursionError in MediaWithBytes unpickling — ready,multi-modality — by nrghosh (合并于: 2026-01-07 04:11 (UTC+8))
- #31801 [Quantization][Refactor] Move CPU GPTQ kernel into MP linear — ready,cpu — by bigPYJ1151 (合并于: 2026-01-07 03:10 (UTC+8))
- #31807 [NemotronH] Use ReplicatedLinear for fc1_latent_proj — ready — by roikoren755 (合并于: 2026-01-07 00:00 (UTC+8))
- #28895 [ROCm][CI] Fix tests/compile unit tests — rocm,ready,nvidia — by charlifu (合并于: 2026-01-07 02:50 (UTC+8))
- #29821 [Perf] Async Scheduling + Speculative Decoding + Structured Outputs — structured-output,ready,v1 — by benchislett (合并于: 2026-01-07 02:50 (UTC+8))
- #31055 [Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking — ready — by ReinforcedKnowledge (合并于: 2026-01-07 01:57 (UTC+8))
- #20610 make 500: InternalServerError more informative — ready,stale,v1 — by guicho271828 (合并于: 2026-01-07 01:36 (UTC+8))
- #31722 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) — performance,ready,qwen — by vadiklyutiy (合并于: 2026-01-07 01:32 (UTC+8))
- #31774 [Attention][2/n] Remove usage of deprecated
seq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties — ready,v1 — by LucasWilkinson (合并于: 2026-01-07 01:32 (UTC+8)) - #31571 [Quantization][MoE] remove unused ep logic from moe marlin — ready — by jinzhen-lin (合并于: 2026-01-07 01:07 (UTC+8))
- #31784 [Model] rename use_pad_token to use_sep_token — documentation,frontend,ready — by noooop (合并于: 2026-01-06 22:16 (UTC+8))
- #31593 [MoE Refactor][14/N] Clean Up FI Quant Config Smuggling — ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 23:47 (UTC+8))
- #31759 [MoE Refactor] Add Temporary Integration Tests - H100/B200 — ready,ci/build,nvidia — by robertgshaw2-redhat (合并于: 2026-01-06 23:34 (UTC+8))
- #31806 [Bugfix]: Fix cross attention backend selection for Turing GPU — ready — by Isotr0py (合并于: 2026-01-06 23:15 (UTC+8))
- #31777 [LoRA]Disable linear LoRA kernel PDL — documentation,ready — by jeejeelee (合并于: 2026-01-06 23:12 (UTC+8))
- #30890 [UX] Add
-epshorthand for--enable-expert-parallel— ready,v1 — by mgoin (合并于: 2026-01-06 11:13 (UTC+8)) - #31788 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false — ready,deepseek — by chaunceyjiang (合并于: 2026-01-06 21:53 (UTC+8))
- #31738 [Models]: Use
MMEncoderAttentionfor MoonViT — ready — by Isotr0py (合并于: 2026-01-06 16:00 (UTC+8)) - #31790 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) — ready,qwen — by Jzz1943 (合并于: 2026-01-06 20:59 (UTC+8))
- #31796 [Misc] Implement
TokenizerLike.convert_tokens_to_ids— ready — by DarkLight1337 (合并于: 2026-01-06 20:08 (UTC+8)) - #31650 [Bugfix] Fix torch.compile error for DP + MoE on CPU Backend — ready — by kzwrime (合并于: 2026-01-06 20:06 (UTC+8))
- #31773 [Attention][1/n] Remove usage of deprecated
seq_lens_cpuandnum_computed_tokens_cpuCommonAttentionMetadata properties — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-01-06 20:05 (UTC+8)) - #31793 [Chore] Cleanup
mem_utils.py— ready — by DarkLight1337 (合并于: 2026-01-06 19:56 (UTC+8)) - #31776 [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff — ready — by Isotr0py (合并于: 2026-01-06 16:44 (UTC+8))
- #31800 [Doc] Fix format of multimodal_inputs.md — documentation,ready — by BlankRH (合并于: 2026-01-06 19:30 (UTC+8))
- #31797 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. — ready — by noooop (合并于: 2026-01-06 19:30 (UTC+8))
- #31780 [Misc] Use
deprecatedforseed_everything— documentation,performance,ready — by DarkLight1337 (合并于: 2026-01-06 19:29 (UTC+8)) - #31720 [cpu][bench] Add CPU paged attention benchmarks — performance,ready,cpu — by fadara01 (合并于: 2026-01-06 18:57 (UTC+8))
- #31783 [Chore] Remove more V0 dead code from
sequence.py— ready — by DarkLight1337 (合并于: 2026-01-06 18:25 (UTC+8)) - #31714 [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in
eagle.py— rocm,speculative-decoding,ready,v1 — by vllmellm (合并于: 2026-01-06 15:53 (UTC+8)) - #30837 [Doc] Show that
use_audio_in_videois supported in docs — documentation,ready,qwen — by DarkLight1337 (合并于: 2026-01-06 15:27 (UTC+8)) - #31177 [Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check — rocm,ready — by c0de128 (合并于: 2026-01-06 14:11 (UTC+8))
- #31764 [CI] Fix CPU MM Processor Test — ready — by robertgshaw2-redhat (合并于: 2026-01-06 12:22 (UTC+8))
- #31042 [Bugfix] Add init_workspace_manager to moe kernel benchmarks — performance,nvidia — by mgoin (合并于: 2026-01-06 11:14 (UTC+8))
关闭但未合并的 PR
- #23076 Update supported models — documentation,stale — by bvrockwell (关闭于: 2026-01-07 10:32 (UTC+8))
- #21975 [Build] Add FlashInfer wheel build to release pipeline — documentation,needs-rebase,ci/build,stale — by mgoin (关闭于: 2026-01-07 10:28 (UTC+8))
- #31851 [Attention] Full CG support for llama4 and remove use of deprecated properties — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by LucasWilkinson (关闭于: 2026-01-07 09:13 (UTC+8))
- #31673 Test1 — 无标签 — by AlexLI-hub (关闭于: 2026-01-07 07:46 (UTC+8))
- #30146 [WIP] Nightly signal on torch — rocm,needs-rebase,ci/build,cpu,nvidia — by atalman (关闭于: 2026-01-07 07:26 (UTC+8))
- #31243 [WIP] Adopt Dockerfile to build nightly version — ci/build — by atalman (关闭于: 2026-01-07 07:25 (UTC+8))
- #26782 [Fix] Avoid UserWarning when creating tensors from base64 embeddings — documentation — by mmangkad (关闭于: 2026-01-07 03:12 (UTC+8))
- #27501 [Refactor][MLA] Move prefill/decode implementation into MLAAttention layer — rocm,ready,needs-rebase,v1,nvidia — by therealnaveenkamal (关闭于: 2026-01-07 01:34 (UTC+8))
- #31543 [Model] Add support for Nemotron-Flash-3B — documentation,new-model,v1 — by hhzhang16 (关闭于: 2026-01-07 01:23 (UTC+8))
- #31723 [Core] Optimize expensive deepcopy in GPU model runner — v1 — by GOavi101 (关闭于: 2026-01-07 01:04 (UTC+8))
- #31813 [Bugfix] Add TorchAO CLI Support and Fix Tensor Metadata Preservation — documentation,new-model,ci/build,v1,multi-modality,llama,qwen — by jwpark33 (关闭于: 2026-01-06 23:23 (UTC+8))
- #31795 [ci] fix timm import error — ci/build — by andyxning (关闭于: 2026-01-06 17:37 (UTC+8))
- #31765 [BUGFIX] free encode inputs in _update_after_schedule to avoid dead lock in sc… — v1 — by frelam (关闭于: 2026-01-06 14:06 (UTC+8))
- #31769 Add more ci for moe refactor b200 — 无标签 — by robertgshaw2-redhat (关闭于: 2026-01-06 12:05 (UTC+8))
- #30192 [Frontend] Add MCP tool streaming support to Responses API — frontend,gpt-oss — by daniel-salib (关闭于: 2026-01-06 12:38 (UTC+8))
- #31717 [Perf] GLM ASR — 无标签 — by JaredforReal (关闭于: 2026-01-06 10:58 (UTC+8))