vLLM 开发动态报告 - 2025-12-16

时间窗口: 2025-12-16 10:45 (UTC+8) ~ 2025-12-17 10:45 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 7 | 新 PR 70 | 合并 PR 37 | 关闭未合并 PR 15

📊 每日开发状态摘要

vLLM 项目在报告周期内保持了极高的开发活跃度，新增了25个Issue并合并了37个PR。开发重点集中在性能优化（特别是针对GPT-OSS、MoE、LoRA等场景）、多模态推理（如Gemma3、Qwen-Omni）的支持完善、编译系统（为Torch 2.10做准备）的升级，以及分布式计算架构（如P/D分离、Dual Batch Overlap）的演进。社区持续解决用户报告的各类Bug，同时推进多项影响深远的新功能。

🎯 AMD/ROCm 生态相关动态

本周期内与AMD生态直接相关的活动较少，但有两项值得关注：

Issue #30801 ([Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output):
- 描述：用户报告在ROCm平台上对distilbert/distilgpt2模型启用graph_mode（enforce_eager=False）时，无论输入什么提示词，模型都会生成乱码输出（如“!!!!!!!!!!!!!!!”）。
- 分析：这是一个新发现的、特定于ROCm平台和graph执行模式的Bug。问题已被机器人标记并通知了ROCm相关的维护者（@hongxiayang, @tjtanaa, @vllmellm）。这提示在ROCm的图编译路径上可能存在兼容性或正确性问题，需要进一步排查。
PR #30811 ([ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32):
- 描述：由AMD员工（用户名后缀为-wil）提交。此PR旨在解决ROCm平台上test_async_scheduling测试的不稳定性问题。通过将测试的默认精度从fp16改回float32并指定使用ROCM_ATTN后端，使测试在本地连续通过了35次，显著提高了CI的可靠性。
- 分析：这体现了AMD团队对维护ROCm平台测试稳定性的持续投入。解决测试不稳定性是保证ROCm支持质量的重要基础工作。

总结：本周期AMD生态相关更新以问题修复和测试稳定性提升为主，无涉及Quark量化、MI300等新硬件或工具链的重大特性引入。

💬 高热度讨论分析

Issue #30830 ([Bug]: accuracy issue on MoE online fp8 quantization)：
- 核心议题：用户yma11报告在MoE模型上使用在线FP8量化时出现准确性问题，并直接@核心开发者vkuzo寻求见解。
- 观点与分析：问题刚刚提出，尚无其他开发者讨论。但用户直接点名核心开发者，表明该问题可能涉及底层量化内核或调度逻辑，且影响较大（涉及MoE和FP8两个高性能关键特性）。这需要核心团队高度关注。
RFC #30786 ([RFC]: FlashMask Attention Backend for PrefixLM Models)：
- 核心议题：提议引入FlashMask Attention后端，以解决如Gemma3等多模态模型中所需的混合注意力模式（PrefixLM）目前性能较差的问题，目标是接近FlashAttention-2的性能。
- 主要观点：
  - 提议者 (lucianommartins)：提供了详尽的性能数据（当前方案慢2-4倍）和实现方案，强调其对多模态性能提升的战略价值。
  - 维护者 (LucasWilkinson)：肯定提议的价值，但同时建议考虑更现代的方案如FlashAttention-3 CuTE DSL FlexAttention，并明确表示vLLM不应包含训练所需的反向传播内核。他还要求查看现有的Triton代码以更好地理解需求。
  - 参与者 (Bhanu068)：表示支持并愿意合作。
- 争议焦点：技术路径的选择（FlashMask vs. FA3 FlexAttention）和实现范围（是否包含反向传播）。
- 当前状态：讨论进行中，提议者回应了维护者的关切，解释了选择FlashMask是为了更好的平台覆盖率，并同意移除反向传播支持。
Issue #30778 ([RFC] DBO fallout tracking metrics)：
- 核心议题：提议为“双批重叠”功能添加监控指标，以追踪其何时失效及失效原因。目前DBO静默失效导致生产环境吞吐量波动难以诊断。
- 主要观点：
  - 提议者 (markmc 等)：详细阐述了在生产环境中因DBO静默失效导致的性能诊断困境，提出了具体的指标方案（如vllm:dbo_active， vllm:dbo_fallout_total），并关联了相关的Slack讨论以证明需求的迫切性。
- 争议焦点：无争议，这是一个明确的生产环境需求。
- 当前状态：Issue新开，旨在收集反馈并推动实现。该提议获得了包括@robertgshaw2-redhat在内的多位核心开发者的关注。

🔥 热门话题与趋势分析

性能优化与性能回归报告：
- 主动优化：有多个追踪和提案专注于性能提升，如Issue #30758 (GPT-OSS B200/GB200性能优化追踪)、PR #29512 (激活函数内核向量化)、PR #30014 (MoE FP4量化前移以减少通信量)。
- 性能回归：用户积极报告性能下降问题，例如Issue #30741 (使用LoRA适配器导致约20%性能下降，已定位到MoE层是瓶颈)，以及Issue #30757 (异步调度中返回AsyncGPUModelRunnerOutput的时机可能影响批处理重叠)。
多模态与视觉语言模型 (VLM) 支持深化：
- 大量Issue和PR围绕VLM展开，包括Gemma3、Qwen2.5/3-VL、Qwen3-Omni等模型的支持与问题修复（如#30779, #30777, #30819）。
- 核心优化在进行中，如PR #30475 (优化编码器缓存内存使用，大幅减少Qwen3-VL视频推理内存占用)和RFC #30786 (为PrefixLM模型提升注意力性能)。
编译系统升级与完善：
- 为迎接PyTorch 2.10，多项编译相关的修复和测试在进行中（如PR #30790, #30743, #30810）。
- 编译配置管理趋于严格，引发了测试适配需求（如PR #30817修复因配置项变更导致的测试失败）。
分布式计算架构演进：
- P/D (Prefill/Decode) 分离：PR #30794 实现了基于NCCL的PD异步KV缓存加载，以解决网络抖动下的头部阻塞问题。
- KV连接器与缓存管理：多个PR在改进KV连接器的功能与稳定性（如#30753, #30745, #30814）。
- Dual Batch Overlap (DBO)：其监控可观测性被提上日程（Issue #30778），同时其通用化（XBO）的代码已合并（PR #30120）。

🛠️ 重点技术变更

PR #30738 ([Metrics] Model FLOPs Utilization estimation)：
- 解读：实现了可选的模型FLOPs利用率统计功能。通过分析调度器输出和模型配置，以极低开销估算并定期日志输出每个GPU的平均计算和内存带宽性能。
- 影响：为运维和开发者提供了关键的硬件效能监控工具，有助于性能瓶颈分析和资源利用率优化。该功能默认关闭，通过--mfu-metrics参数启用。
PR #30756 ([MM] Pass FA version in ViT Attn) & PR #30789 ([ROCm] [Bugfix] Fix torch sdpa hallucination)：
- 解读：这两个PR分别解决了不同后端下非因果ViT注意力的问题。前者确保为ViT Attention传递正确的FlashAttention版本号；后者修复了因之前PR移除必要步骤而导致的ROCm平台上Torch SDPA后端产生幻觉的问题。
- 影响：共同提升了多模态模型中视觉编码器部分的稳定性和正确性，是支持各类VLM模型的基础。
PR #30475 ([Core][MM] Optimize encoder cache manager by operating with embeddings only)：
- 解读：重构了编码器缓存管理器的底层逻辑，使其仅存储和处理嵌入向量本身，而非包含特殊占位符的整个连续令牌范围。
- 影响：对于像Qwen3-VL这类在视频推理中包含大量非嵌入特殊令牌（时间戳）的模型，可大幅减少内存占用（示例中可用KV缓存从7.99GiB增至12.68GiB），是实现长上下文视频理解的关键优化。
PR #30815 (Update model-hosting-container-standards to 0.1.10)：
- 解读：更新了容器依赖包版本，新版本默认不再创建/dev/shm/sagemaker_sessions文件夹，避免了潜在的权限错误。
- 影响：提升了vLLM Docker镜像在特定部署环境（如SageMaker）下的兼容性和稳定性。

📈 开发活跃度观察

贡献者活跃：周期内有大量来自不同机构和个人的贡献，包括NVIDIA、AMD、RedHat等公司的员工以及独立开发者。AMD员工(micah-wil)提交了针对ROCm测试稳定性的PR。
代码审查与合并效率：在新增的70个PR中，有37个被合并，合并率超过50%，表明核心团队审查和合并流程高效。许多PR被打上ready标签，显示其已通过初步审查等待CI验证或最终合并。
社区参与：用户积极提交Bug报告和使用问题，并积极参与讨论（如Issue #30741中用户与维护者jeejeelee的深入交流），社区生态健康。

💡 值得关注的问题

MoE在线FP8量化精度问题 (Issue #30830)：该问题涉及MoE和FP8量化两个前沿高性能特性，其根本原因和影响范围尚不明确，需要核心开发者尽快介入调查，可能影响大规模MoE模型的部署。
DeepSeek-V3.1 部署时的NoneType错误 (Issue #30736)：在部署DeepSeek-V3.1这类复杂新模型时出现DCP/DBO相关错误，提示分布式编译与执行路径可能存在与新模型架构的兼容性问题。
FlashMask Attention Backend的引入 (RFC #30786)：关于是否引入以及如何引入FlashMask的讨论，关系到未来多模态模型在vLLM中的推理性能上限，是一个重要的技术方向决策。
LoRA与MoE结合的性能开销 (Issue #30741)：用户实测并定位到LoRA应用到MoE层是主要性能瓶颈。虽然维护者表示“符合预期并在优化中”，但此问题直接影响LoRA在大型MoE模型上的实用性，其优化进展值得关注。

📋 附录：详细数据列表

新增 Issue

#30830 [Bug]: accuracy issue on MoE online fp8 quantization — bug — by yma11 (创建于: 2025-12-17 10:40 (UTC+8))
#30828 [Bug]: The tokenizer you are loading from ‘/models/GLM-4.6V-Flash’ with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. — bug — by jieguolove (创建于: 2025-12-17 09:51 (UTC+8))
#30766 [Bug]: Qwen3-32B with MTP, run failed. — bug — by wbj1995 (创建于: 2025-12-16 18:01 (UTC+8))
#30736 [Bug] DCP/DBO: ‘NoneType’ error building attention_metadata during DeepSeek-V3.1 deployment dummy run — bug,help wanted — by Butterfingrz (创建于: 2025-12-16 11:07 (UTC+8))
#30759 [Feature]: proper logging for ParsableContext — feature request — by qandrew (创建于: 2025-12-16 16:33 (UTC+8))
#30758 [Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker — performance — by nvpohanh (创建于: 2025-12-16 16:26 (UTC+8))
#30824 [Feature]: xinference has xllamacpp, can you add something in vllm for gguf files — feature request — by aynig (创建于: 2025-12-17 08:24 (UTC+8))
#30798 [Usage]: vllm offline server lora model — usage — by zapqqqwe (创建于: 2025-12-17 00:38 (UTC+8))
#30819 [Bug]: vLLM inference stuck when requesting video description on VLM models — bug — by sidezrw (创建于: 2025-12-17 06:53 (UTC+8))
#30818 [Bug]: Unpickling MediaWithBytes dataclass leads to RecursionError — bug — by eicherseiji (创建于: 2025-12-17 06:27 (UTC+8))
#30786 [RFC]: FlashMask Attention Backend for PrefixLM Models — RFC — by lucianommartins (创建于: 2025-12-16 22:47 (UTC+8))
#30813 [Bug]: compilation_config.backend=”eager” doesn’t work aot mode (yet). — bug,torch.compile — by zhxchen17 (创建于: 2025-12-17 03:55 (UTC+8))
#30779 [Bug]: v0.11.2 can not support Qwen2.5-Omni- — bug — by GoGo-UpUp (创建于: 2025-12-16 21:11 (UTC+8))
#30805 [docker] kv_connectors (lmcache) not installed on arm64 release images — installation — by amrmahdi (创建于: 2025-12-17 01:59 (UTC+8))
#30801 [Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output — bug,rocm — by qli88 (创建于: 2025-12-17 00:59 (UTC+8))
#30776 [Usage]: Qwen3-omni’s offline usage — usage — by Auraithm (创建于: 2025-12-16 20:30 (UTC+8))
#30777 [Bug]: whisper-large-v3-turbo have accuracy problem on nightly build — bug — by cyysky (创建于: 2025-12-16 20:40 (UTC+8))
#30778 [RFC] DBO fallout tracking metrics — feature request — by markmc (创建于: 2025-12-16 21:07 (UTC+8))
#30763 [Feature]: Optionally Return LogSumExp in Attention — feature request — by hypnopump (创建于: 2025-12-16 17:18 (UTC+8))
#30752 [Bug]: ` leaked shared_memory objects to clean up at shutdown` cause offline mode start failed. — bug — by xsank (创建于: 2025-12-16 15:11 (UTC+8))
#30762 [Feature][P1]: docker image / cache reuse between CI and release pipelines — feature request — by Harry-Chen (创建于: 2025-12-16 17:06 (UTC+8))
#30757 [Performance]: Async sched: Why return AsyncGPUModelRunnerOutput util func sample_tokens — performance — by iwzbi (创建于: 2025-12-16 16:26 (UTC+8))
#30741 [Performance]: Performance Drops by ~20% When Using LoRA Adapter — performance — by happyfanta00 (创建于: 2025-12-16 12:34 (UTC+8))
#30755 [Feature][P1]: Separation of building and final base images on more variants — feature request — by Harry-Chen (创建于: 2025-12-16 16:08 (UTC+8))
#30750 [Bug]: assert part_size_n % scales.nelement() == 0 (Marlin FP8 Kernel) — bug — by milizhang (创建于: 2025-12-16 14:53 (UTC+8))

已关闭 Issue

#22621 [Bug]: Latency spikes Issue — bug,stale — by saifulislam79 (关闭于: 2025-12-17 10:11 (UTC+8))
#25903 [MM]: Optimize encoder cache memory consumption by storing encoder outputs only — feature request — by ywang96 (关闭于: 2025-12-17 06:18 (UTC+8))
#25187 [Docs] Feedback for /en/latest/getting_started/installation/gpu.html — documentation,rocm — by ajaxdude (关闭于: 2025-12-17 05:54 (UTC+8))
#30741 [Performance]: Performance Drops by ~20% When Using LoRA Adapter — performance — by happyfanta00 (关闭于: 2025-12-16 16:11 (UTC+8))
#25759 [Usage]: run Qwen2.5-VL with PD disaggregation raise ValueError(“Insufficient memory”) — usage — by bingchen-mi (关闭于: 2025-12-16 15:08 (UTC+8))
#30105 [Feature]: Extend Dual Batch Overlap (DBO) to Multi-Batch Overlap (XBO) — feature request — by jiangkuaixue123 (关闭于: 2025-12-16 14:23 (UTC+8))
#30726 [Feature]: Suggestion: make lazy_imports.py faster — feature request — by zou3519 (关闭于: 2025-12-16 11:12 (UTC+8))

新增 PR

#30829 [Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models — ready — by bigPYJ1151 (创建于: 2025-12-17 10:36 (UTC+8))
#30760 [WIP]verify_is_multiproc — 无标签 — by 1643661061leo (创建于: 2025-12-16 16:50 (UTC+8))
#30811 [ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 — rocm,v1 — by micah-wil (创建于: 2025-12-17 03:39 (UTC+8))
#30825 [FusedMoE Refactor][2/N] Use Modular Kernels for Fp8 — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-17 09:03 (UTC+8))
#30827 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (创建于: 2025-12-17 09:40 (UTC+8))
#30826 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (创建于: 2025-12-17 09:38 (UTC+8))
#30815 Update model-hosting-container-standards to 0.1.10 — ci/build — by mgoin (创建于: 2025-12-17 05:51 (UTC+8))
#30823 [Bug] Fix AttributeError: ‘ColumnParallelLinear’ object has no attribute weight_scale_inv — ready — by yewentao256 (创建于: 2025-12-17 07:56 (UTC+8))
#30748 [Docs] fix function name — documentation — by lengrongfu (创建于: 2025-12-16 14:51 (UTC+8))
#30808 [Refactor][TPU] Remove torch_xla path and use tpu-inference — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by weiyu0824 (创建于: 2025-12-17 02:32 (UTC+8))
#30767 Algo — 无标签 — by Mercykid-bash (创建于: 2025-12-16 18:02 (UTC+8))
#30795 Fix nemotron_nas intermediate_size computation — bug,ready,nvidia — by grzegorz-k-karch (创建于: 2025-12-16 23:49 (UTC+8))
#30753 [Core][KVConnector] Propagate block hashes in SchedulerOutput — tpu,ready,v1 — by QierLi (创建于: 2025-12-16 15:25 (UTC+8))
#30739 [BugFix] Support online dense model DP without overhead — ready,v1,kv-connector — by njhill (创建于: 2025-12-16 12:08 (UTC+8))
#30806 [docker] Allow kv_connectors install to fail on arm64 — ci/build — by amrmahdi (创建于: 2025-12-17 02:03 (UTC+8))
#30738 [Metrics] Model FLOPs Utilization estimation — ready,v1 — by SungMinCho (创建于: 2025-12-16 11:27 (UTC+8))
#30746 [SM100] Enable fp8 compute for prefill MLA — v1 — by pavanimajety (创建于: 2025-12-16 14:31 (UTC+8))
#30787 [CI/Build] Fix compatibility between #30244 and #30396 — ready,needs-rebase — by DarkLight1337 (创建于: 2025-12-16 22:49 (UTC+8))
#30756 [MM] Pass FA version in ViT Attn — ready — by NickLucche (创建于: 2025-12-16 16:20 (UTC+8))
#30820 [Bug] Fix compressed tensor not using deepgemm — ready — by yewentao256 (创建于: 2025-12-17 07:33 (UTC+8))
#30817 Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant — ready,ci-failure — by mgoin (创建于: 2025-12-17 06:05 (UTC+8))
#30822 [Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC — qwen — by Lucaskabela (创建于: 2025-12-17 07:36 (UTC+8))
#30821 [Kernel] Enable fused_qknorm_rope_kernel supports partial rope — 无标签 — by jeejeelee (创建于: 2025-12-17 07:36 (UTC+8))
#30789 [ROCm] [Bugfix] Fix torch sdpa hallucination — rocm,ready — by tjtanaa (创建于: 2025-12-16 23:16 (UTC+8))
#30783 AWQ: Evaluate fused vs unfused GEMM on actual shape — 无标签 — by mgehre-amd (创建于: 2025-12-16 22:18 (UTC+8))
#30796 [BugFix][Async] clear spec tokens for preempted or resumed reqs in async — v1 — by izhuhaoran (创建于: 2025-12-17 00:00 (UTC+8))
#30804 [CI] Skip ci failure test — ready — by yewentao256 (创建于: 2025-12-17 01:33 (UTC+8))
#30745 [BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector — kv-connector — by wz1qqx (创建于: 2025-12-16 14:17 (UTC+8))
#30754 [Bugfix][DSV32] Fix overflow in topk. — bug,ready,deepseek — by dcampora (创建于: 2025-12-16 15:39 (UTC+8))
#30816 [UX] Make vllm bench serve discover model by default and use –input-len — performance,perf-benchmarks,ready — by mgoin (创建于: 2025-12-17 05:52 (UTC+8))
#30747 [Fix] uniform decode batch check — bug,ready,v1 — by Jialin (创建于: 2025-12-16 14:32 (UTC+8))
#30799 bump up compressed tensors version to 0.13.0 — quantization,ready,ci/build — by shanjiaz (创建于: 2025-12-17 00:41 (UTC+8))
#30814 [KV connector][LMCache] Only record the cuda event when there are request to store/load — kv-connector,nvidia — by ApostaC (创建于: 2025-12-17 05:25 (UTC+8))
#30782 [CI/Build] Skip broken ViT backend functionality test tempoarily — ready,multi-modality — by Isotr0py (创建于: 2025-12-16 22:02 (UTC+8))
#30810 [compile] Disable aot when eager backend is used. — ready — by zhxchen17 (创建于: 2025-12-17 02:54 (UTC+8))
#30743 [compile] Recompile graph module during Dynamo cache loading. — ready — by zhxchen17 (创建于: 2025-12-16 12:54 (UTC+8))
#30809 [compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors — ready — by zhxchen17 (创建于: 2025-12-17 02:52 (UTC+8))
#30812 Add back support of tokenizer_mode == custom — fb-exported,meta-exported — by henryoier (创建于: 2025-12-17 03:41 (UTC+8))
#30769 [Frontend] Add max-completion-token option to transcription/translation endpoints — frontend,ready — by NickLucche (创建于: 2025-12-16 18:36 (UTC+8))
#30807 [compile] Disable aot when eager backend is used. — 无标签 — by zhxchen17 (创建于: 2025-12-17 02:17 (UTC+8))
#30790 [Release 2.10] Test Torch 2.10 RC - with skipped test — rocm,ci/build,v1,nvidia — by atalman (创建于: 2025-12-16 23:20 (UTC+8))
#30784 [Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get — ci/build,multi-modality,nvidia — by emricksini-h (创建于: 2025-12-16 22:45 (UTC+8))
#30800 [PERF] Add interleaved memory allocation to NUMA module — 无标签 — by skaraban3807 (创建于: 2025-12-17 00:56 (UTC+8))
#30803 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation — 无标签 — by omer-dayan (创建于: 2025-12-17 01:05 (UTC+8))
#30802 Add support for LoRA adapters in Nemotron-H models — nvidia — by danisereb (创建于: 2025-12-17 01:05 (UTC+8))
#30797 [Bugfix] Fix DeepSeekV32 tool parser incorrect type conversion for array/object parameters — deepseek — by fangtaosong (创建于: 2025-12-17 00:17 (UTC+8))
#30794 [P/D] p2p_nccl: implement async KV loading for decode stage — v1,kv-connector — by dongbo910220 (创建于: 2025-12-16 23:45 (UTC+8))
#30793 Tinyquant — ci/build — by Kirill-Kuznetsov-git (创建于: 2025-12-16 23:37 (UTC+8))
#30792 Tinyquant — ci/build — by Kirill-Kuznetsov-git (创建于: 2025-12-16 23:36 (UTC+8))
#30791 [Doc] Add Chinese documentation for vLLM — documentation — by 5qol255 (创建于: 2025-12-16 23:20 (UTC+8))
#30761 [KVConnector]: Enable Cross-layers KV cache layout for MultiConnector — kv-connector — by kfirtoledo (创建于: 2025-12-16 16:59 (UTC+8))
#30774 [Docs][API] Remove warning about LoRARequest being internal-only — ready — by markmc (创建于: 2025-12-16 19:38 (UTC+8))
#30771 Update where bytes_to_unicode is imported from — structured-output,ready,v1 — by hmellor (创建于: 2025-12-16 19:14 (UTC+8))
#30764 Remove head_mask from Ultravox and Swin — ready — by hmellor (创建于: 2025-12-16 17:33 (UTC+8))
#30768 Fix instantiation of HfHubHTTPError in LoRA test — ready — by hmellor (创建于: 2025-12-16 18:20 (UTC+8))
#30788 [refactor] Add prefix support to embed_tokens in DeepSeek MTP — deepseek — by zzhx1 (创建于: 2025-12-16 23:06 (UTC+8))
#30785 [Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) — ready — by danielafrimi (创建于: 2025-12-16 22:46 (UTC+8))
#30744 [BugFix] Fix memory spike in workspace allocation — ready,ci/build,v1 — by LucasWilkinson (创建于: 2025-12-16 13:09 (UTC+8))
#30773 [BugFix] Fix initialization bug in openpangu.py — 无标签 — by JeffLee1874 (创建于: 2025-12-16 19:28 (UTC+8))
#30781 [CI] add polling for precompiled wheel in python_only_compile.sh — 无标签 — by Harry-Chen (创建于: 2025-12-16 21:46 (UTC+8))
#30772 [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request — ready,v1 — by NickLucche (创建于: 2025-12-16 19:25 (UTC+8))
#30780 Optimize workspace memory in DeepGEMM. — 无标签 — by halyavin (创建于: 2025-12-16 21:19 (UTC+8))
#30770 Don’t assume position_embedding_type will be present for BERT and RoBERTa models — ready — by hmellor (创建于: 2025-12-16 18:53 (UTC+8))
#30775 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — 无标签 — by teddygood (创建于: 2025-12-16 20:06 (UTC+8))
#30765 [Doc][CPU] Update CPU doc — documentation,ci/build — by bigPYJ1151 (创建于: 2025-12-16 17:55 (UTC+8))
#30749 [Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory — frontend,ci/build — by chaunceyjiang (创建于: 2025-12-16 14:53 (UTC+8))
#30751 [Bugfix]: prevent leaking tokens in crash log — v1 — by dr75 (创建于: 2025-12-16 15:07 (UTC+8))
#30742 Support LoRA of PLaMo 2/3 — documentation — by Alnusjaponica (创建于: 2025-12-16 12:50 (UTC+8))
#30740 Properly handle packed_modules_mapping of PLaMo2 — 无标签 — by Alnusjaponica (创建于: 2025-12-16 12:30 (UTC+8))
#30737 [Metrics] Model FLOPs Utilization estimation — v1,nvidia — by SungMinCho (创建于: 2025-12-16 11:14 (UTC+8))

已合并 PR

#30815 Update model-hosting-container-standards to 0.1.10 — ci/build — by mgoin (合并于: 2025-12-17 09:52 (UTC+8))
#30795 Fix nemotron_nas intermediate_size computation — bug,ready,nvidia — by grzegorz-k-karch (合并于: 2025-12-17 09:06 (UTC+8))
#30806 [docker] Allow kv_connectors install to fail on arm64 — ci/build — by amrmahdi (合并于: 2025-12-17 08:41 (UTC+8))
#30756 [MM] Pass FA version in ViT Attn — ready — by NickLucche (合并于: 2025-12-17 07:54 (UTC+8))
#30817 Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant — ready,ci-failure — by mgoin (合并于: 2025-12-17 07:40 (UTC+8))
#30789 [ROCm] [Bugfix] Fix torch sdpa hallucination — rocm,ready — by tjtanaa (合并于: 2025-12-17 07:32 (UTC+8))
#29512 [Perf][Kernels] Vectorize csrc/activations_kernels.cu — performance,kernel,moe,ready — by mgoin (合并于: 2025-12-17 06:56 (UTC+8))
#30804 [CI] Skip ci failure test — ready — by yewentao256 (合并于: 2025-12-17 06:47 (UTC+8))
#30731 [Bugfix] Fix broken ViT attention selection for Blackwell device — ready,nvidia — by Isotr0py (合并于: 2025-12-16 13:24 (UTC+8))
#29901 [Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) — quantization,kernel,ready,ci/build — by jinzhen-lin (合并于: 2025-12-17 06:35 (UTC+8))
#30475 [Core][MM] Optimize encoder cache manager by operating with embeddings only — ready,v1,multi-modality,llama,qwen,ready-run-all-tests — by ywang96 (合并于: 2025-12-17 06:18 (UTC+8))
#30754 [Bugfix][DSV32] Fix overflow in topk. — bug,ready,deepseek — by dcampora (合并于: 2025-12-17 06:21 (UTC+8))
#29627 [Attention] Cache attention metadata builds across hybrid KV-cache groups — ready,v1 — by LucasWilkinson (合并于: 2025-12-17 06:10 (UTC+8))
#30782 [CI/Build] Skip broken ViT backend functionality test tempoarily — ready,multi-modality — by Isotr0py (合并于: 2025-12-16 22:45 (UTC+8))
#30014 [Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE — performance,moe,ready,nvidia — by jiahanc (合并于: 2025-12-17 05:01 (UTC+8))
#30562 [Refactor] Small refactor for group topk — ready,v1 — by yewentao256 (合并于: 2025-12-17 03:50 (UTC+8))
#30769 [Frontend] Add max-completion-token option to transcription/translation endpoints — frontend,ready — by NickLucche (合并于: 2025-12-17 03:36 (UTC+8))
#30723 [CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test — ready,ci/build,qwen,nvidia — by mgoin (合并于: 2025-12-17 03:28 (UTC+8))
#30774 [Docs][API] Remove warning about LoRARequest being internal-only — ready — by markmc (合并于: 2025-12-17 00:35 (UTC+8))
#30771 Update where bytes_to_unicode is imported from — structured-output,ready,v1 — by hmellor (合并于: 2025-12-17 00:05 (UTC+8))
#30764 Remove head_mask from Ultravox and Swin — ready — by hmellor (合并于: 2025-12-17 00:02 (UTC+8))
#30768 Fix instantiation of HfHubHTTPError in LoRA test — ready — by hmellor (合并于: 2025-12-17 00:02 (UTC+8))
#30713 [TRTLLM] Remove the MoE GEMM weight name change — bug,ready,nvidia — by minosfuture (合并于: 2025-12-17 00:01 (UTC+8))
#30559 [Feat] Enable eplb with default all2all backend — performance,moe,ready — by yewentao256 (合并于: 2025-12-16 23:33 (UTC+8))
#30744 [BugFix] Fix memory spike in workspace allocation — ready,ci/build,v1 — by LucasWilkinson (合并于: 2025-12-16 22:46 (UTC+8))
#30772 [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request — ready,v1 — by NickLucche (合并于: 2025-12-16 22:20 (UTC+8))
#28624 [ROCm][MTP] Support MTP for AITER MLA backend — rocm,ready,v1 — by ganyi1996ppo (合并于: 2025-12-16 22:10 (UTC+8))
#30728 update piecewise cudagraph warning when splitting_ops=[] — nvidia — by BoyuanFeng (合并于: 2025-12-16 22:09 (UTC+8))
#30586 [ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops — rocm,ready — by tjtanaa (合并于: 2025-12-16 21:50 (UTC+8))
#30770 Don’t assume position_embedding_type will be present for BERT and RoBERTa models — ready — by hmellor (合并于: 2025-12-16 21:40 (UTC+8))
#30636 [Bugfix] Fix RequestOutput miss lora_request — ready,v1,llama,gpt-oss — by jeejeelee (合并于: 2025-12-16 17:36 (UTC+8))
#29663 [Bugfix] Fix prefix_repetition routing in bench throughput — performance,ready — by jr-shen (合并于: 2025-12-16 17:37 (UTC+8))
#30158 [responsesAPI][8] input/output messages for ResponsesParser — documentation,frontend,ready,gpt-oss — by qandrew (合并于: 2025-12-16 13:55 (UTC+8))
#30120 [feature] extend DBO to XBO — documentation,ready,v1 — by jiangkuaixue123 (合并于: 2025-12-16 13:04 (UTC+8))
#30733 improve lazy import test — ready — by BoyuanFeng (合并于: 2025-12-16 11:12 (UTC+8))
#29873 [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic — documentation,rocm,ready,qwen — by shen-shanshan (合并于: 2025-12-16 11:08 (UTC+8))
#30626 [docker] Restructure Dockerfile for more efficient and cache-friendly builds — documentation,ready,ci/build — by amrmahdi (合并于: 2025-12-16 10:52 (UTC+8))

关闭但未合并的 PR

#30826 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (关闭于: 2025-12-17 09:40 (UTC+8))
#29167 [BugFix] Call Base Layer Directly if LoRA A/B in Parallel Vocab are 0 — 无标签 — by alex-jw-brooks (关闭于: 2025-12-17 06:43 (UTC+8))
#30536 encoder cache optimization budget alignment — v1,multi-modality,qwen — by sunYtokki (关闭于: 2025-12-17 06:33 (UTC+8))
#30807 [compile] Disable aot when eager backend is used. — 无标签 — by zhxchen17 (关闭于: 2025-12-17 02:54 (UTC+8))
#30793 Tinyquant — ci/build — by Kirill-Kuznetsov-git (关闭于: 2025-12-16 23:38 (UTC+8))
#30792 Tinyquant — ci/build — by Kirill-Kuznetsov-git (关闭于: 2025-12-16 23:36 (UTC+8))
#30791 [Doc] Add Chinese documentation for vLLM — documentation — by 5qol255 (关闭于: 2025-12-16 23:24 (UTC+8))
#30773 [BugFix] Fix initialization bug in openpangu.py — 无标签 — by JeffLee1874 (关闭于: 2025-12-16 22:32 (UTC+8))
#30427 fix(gguf): Extract attn_logit_softcapping from GGUF metadata — 无标签 — by kitaekatt (关闭于: 2025-12-16 20:34 (UTC+8))
#29889 [Bugfix] respect user-defined flash attention version in ViT attentions — 无标签 — by cjackal (关闭于: 2025-12-16 16:43 (UTC+8))
#28570 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by teddygood (关闭于: 2025-12-16 12:34 (UTC+8))
#30667 Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing — v1 — by storyicon (关闭于: 2025-12-16 11:38 (UTC+8))
#30737 [Metrics] Model FLOPs Utilization estimation — v1,nvidia — by SungMinCho (关闭于: 2025-12-16 11:28 (UTC+8))
#28859 [Metrics] Model FLOPs Utilization estimation — v1,fb-exported,meta-exported — by SungMinCho (关闭于: 2025-12-16 11:14 (UTC+8))
#24936 [WIP][performance] DP for ViT in Intern S1 model — stale — by hsliuustc0106 (关闭于: 2025-12-16 10:45 (UTC+8))