vLLM 开发动态报告 - 2025-12-16
时间窗口: 2025-12-16 10:45 (UTC+8) ~ 2025-12-17 10:45 (UTC+8) 数据统计: 新 Issue 25 | 关闭 Issue 7 | 新 PR 70 | 合并 PR 37 | 关闭未合并 PR 15
📊 每日开发状态摘要
vLLM 项目在报告周期内保持了极高的开发活跃度,新增了25个Issue并合并了37个PR。开发重点集中在性能优化(特别是针对GPT-OSS、MoE、LoRA等场景)、多模态推理(如Gemma3、Qwen-Omni)的支持完善、编译系统(为Torch 2.10做准备)的升级,以及分布式计算架构(如P/D分离、Dual Batch Overlap)的演进。社区持续解决用户报告的各类Bug,同时推进多项影响深远的新功能。
🎯 AMD/ROCm 生态相关动态
本周期内与AMD生态直接相关的活动较少,但有两项值得关注:
- Issue #30801 ([Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output):
- 描述:用户报告在ROCm平台上对distilbert/distilgpt2模型启用
graph_mode(enforce_eager=False)时,无论输入什么提示词,模型都会生成乱码输出(如“!!!!!!!!!!!!!!!”)。 - 分析:这是一个新发现的、特定于ROCm平台和graph执行模式的Bug。问题已被机器人标记并通知了ROCm相关的维护者(@hongxiayang, @tjtanaa, @vllmellm)。这提示在ROCm的图编译路径上可能存在兼容性或正确性问题,需要进一步排查。
- 描述:用户报告在ROCm平台上对distilbert/distilgpt2模型启用
- PR #30811 ([ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32):
- 描述:由AMD员工(用户名后缀为
-wil)提交。此PR旨在解决ROCm平台上test_async_scheduling测试的不稳定性问题。通过将测试的默认精度从fp16改回float32并指定使用ROCM_ATTN后端,使测试在本地连续通过了35次,显著提高了CI的可靠性。 - 分析:这体现了AMD团队对维护ROCm平台测试稳定性的持续投入。解决测试不稳定性是保证ROCm支持质量的重要基础工作。
- 描述:由AMD员工(用户名后缀为
总结:本周期AMD生态相关更新以问题修复和测试稳定性提升为主,无涉及Quark量化、MI300等新硬件或工具链的重大特性引入。
💬 高热度讨论分析
- Issue #30830 ([Bug]: accuracy issue on MoE online fp8 quantization):
- 核心议题:用户
yma11报告在MoE模型上使用在线FP8量化时出现准确性问题,并直接@核心开发者vkuzo寻求见解。 - 观点与分析:问题刚刚提出,尚无其他开发者讨论。但用户直接点名核心开发者,表明该问题可能涉及底层量化内核或调度逻辑,且影响较大(涉及MoE和FP8两个高性能关键特性)。这需要核心团队高度关注。
- 核心议题:用户
- RFC #30786 ([RFC]: FlashMask Attention Backend for PrefixLM Models):
- 核心议题:提议引入FlashMask Attention后端,以解决如Gemma3等多模态模型中所需的混合注意力模式(PrefixLM)目前性能较差的问题,目标是接近FlashAttention-2的性能。
- 主要观点:
- 提议者 (
lucianommartins):提供了详尽的性能数据(当前方案慢2-4倍)和实现方案,强调其对多模态性能提升的战略价值。 - 维护者 (
LucasWilkinson):肯定提议的价值,但同时建议考虑更现代的方案如FlashAttention-3 CuTE DSL FlexAttention,并明确表示vLLM不应包含训练所需的反向传播内核。他还要求查看现有的Triton代码以更好地理解需求。 - 参与者 (
Bhanu068):表示支持并愿意合作。
- 提议者 (
- 争议焦点:技术路径的选择(FlashMask vs. FA3 FlexAttention)和实现范围(是否包含反向传播)。
- 当前状态:讨论进行中,提议者回应了维护者的关切,解释了选择FlashMask是为了更好的平台覆盖率,并同意移除反向传播支持。
- Issue #30778 ([RFC] DBO fallout tracking metrics):
- 核心议题:提议为“双批重叠”功能添加监控指标,以追踪其何时失效及失效原因。目前DBO静默失效导致生产环境吞吐量波动难以诊断。
- 主要观点:
- 提议者 (
markmc等):详细阐述了在生产环境中因DBO静默失效导致的性能诊断困境,提出了具体的指标方案(如vllm:dbo_active,vllm:dbo_fallout_total),并关联了相关的Slack讨论以证明需求的迫切性。
- 提议者 (
- 争议焦点:无争议,这是一个明确的生产环境需求。
- 当前状态:Issue新开,旨在收集反馈并推动实现。该提议获得了包括
@robertgshaw2-redhat在内的多位核心开发者的关注。
🔥 热门话题与趋势分析
- 性能优化与性能回归报告:
- 主动优化:有多个追踪和提案专注于性能提升,如Issue #30758 (GPT-OSS B200/GB200性能优化追踪)、PR #29512 (激活函数内核向量化)、PR #30014 (MoE FP4量化前移以减少通信量)。
- 性能回归:用户积极报告性能下降问题,例如Issue #30741 (使用LoRA适配器导致约20%性能下降,已定位到MoE层是瓶颈),以及Issue #30757 (异步调度中返回
AsyncGPUModelRunnerOutput的时机可能影响批处理重叠)。
- 多模态与视觉语言模型 (VLM) 支持深化:
- 大量Issue和PR围绕VLM展开,包括Gemma3、Qwen2.5/3-VL、Qwen3-Omni等模型的支持与问题修复(如#30779, #30777, #30819)。
- 核心优化在进行中,如PR #30475 (优化编码器缓存内存使用,大幅减少Qwen3-VL视频推理内存占用)和RFC #30786 (为PrefixLM模型提升注意力性能)。
- 编译系统升级与完善:
- 为迎接PyTorch 2.10,多项编译相关的修复和测试在进行中(如PR #30790, #30743, #30810)。
- 编译配置管理趋于严格,引发了测试适配需求(如PR #30817修复因配置项变更导致的测试失败)。
- 分布式计算架构演进:
- P/D (Prefill/Decode) 分离:PR #30794 实现了基于NCCL的PD异步KV缓存加载,以解决网络抖动下的头部阻塞问题。
- KV连接器与缓存管理:多个PR在改进KV连接器的功能与稳定性(如#30753, #30745, #30814)。
- Dual Batch Overlap (DBO):其监控可观测性被提上日程(Issue #30778),同时其通用化(XBO)的代码已合并(PR #30120)。
🛠️ 重点技术变更
- PR #30738 ([Metrics] Model FLOPs Utilization estimation):
- 解读:实现了可选的模型FLOPs利用率统计功能。通过分析调度器输出和模型配置,以极低开销估算并定期日志输出每个GPU的平均计算和内存带宽性能。
- 影响:为运维和开发者提供了关键的硬件效能监控工具,有助于性能瓶颈分析和资源利用率优化。该功能默认关闭,通过
--mfu-metrics参数启用。
- PR #30756 ([MM] Pass FA version in ViT Attn) & PR #30789 ([ROCm] [Bugfix] Fix torch sdpa hallucination):
- 解读:这两个PR分别解决了不同后端下非因果ViT注意力的问题。前者确保为ViT Attention传递正确的FlashAttention版本号;后者修复了因之前PR移除必要步骤而导致的ROCm平台上Torch SDPA后端产生幻觉的问题。
- 影响:共同提升了多模态模型中视觉编码器部分的稳定性和正确性,是支持各类VLM模型的基础。
- PR #30475 ([Core][MM] Optimize encoder cache manager by operating with embeddings only):
- 解读:重构了编码器缓存管理器的底层逻辑,使其仅存储和处理嵌入向量本身,而非包含特殊占位符的整个连续令牌范围。
- 影响:对于像Qwen3-VL这类在视频推理中包含大量非嵌入特殊令牌(时间戳)的模型,可大幅减少内存占用(示例中可用KV缓存从7.99GiB增至12.68GiB),是实现长上下文视频理解的关键优化。
- PR #30815 (Update model-hosting-container-standards to 0.1.10):
- 解读:更新了容器依赖包版本,新版本默认不再创建
/dev/shm/sagemaker_sessions文件夹,避免了潜在的权限错误。 - 影响:提升了vLLM Docker镜像在特定部署环境(如SageMaker)下的兼容性和稳定性。
- 解读:更新了容器依赖包版本,新版本默认不再创建
📈 开发活跃度观察
- 贡献者活跃:周期内有大量来自不同机构和个人的贡献,包括NVIDIA、AMD、RedHat等公司的员工以及独立开发者。AMD员工(
micah-wil)提交了针对ROCm测试稳定性的PR。 - 代码审查与合并效率:在新增的70个PR中,有37个被合并,合并率超过50%,表明核心团队审查和合并流程高效。许多PR被打上
ready标签,显示其已通过初步审查等待CI验证或最终合并。 - 社区参与:用户积极提交Bug报告和使用问题,并积极参与讨论(如Issue #30741中用户与维护者
jeejeelee的深入交流),社区生态健康。
💡 值得关注的问题
- MoE在线FP8量化精度问题 (Issue #30830):该问题涉及MoE和FP8量化两个前沿高性能特性,其根本原因和影响范围尚不明确,需要核心开发者尽快介入调查,可能影响大规模MoE模型的部署。
- DeepSeek-V3.1 部署时的
NoneType错误 (Issue #30736):在部署DeepSeek-V3.1这类复杂新模型时出现DCP/DBO相关错误,提示分布式编译与执行路径可能存在与新模型架构的兼容性问题。 - FlashMask Attention Backend的引入 (RFC #30786):关于是否引入以及如何引入FlashMask的讨论,关系到未来多模态模型在vLLM中的推理性能上限,是一个重要的技术方向决策。
- LoRA与MoE结合的性能开销 (Issue #30741):用户实测并定位到LoRA应用到MoE层是主要性能瓶颈。虽然维护者表示“符合预期并在优化中”,但此问题直接影响LoRA在大型MoE模型上的实用性,其优化进展值得关注。
📋 附录:详细数据列表
新增 Issue
- #30830 [Bug]: accuracy issue on MoE online fp8 quantization — bug — by yma11 (创建于: 2025-12-17 10:40 (UTC+8))
- #30828 [Bug]: The tokenizer you are loading from ‘/models/GLM-4.6V-Flash’ with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. — bug — by jieguolove (创建于: 2025-12-17 09:51 (UTC+8))
- #30766 [Bug]: Qwen3-32B with MTP, run failed. — bug — by wbj1995 (创建于: 2025-12-16 18:01 (UTC+8))
- #30736 [Bug] DCP/DBO: ‘NoneType’ error building attention_metadata during DeepSeek-V3.1 deployment dummy run — bug,help wanted — by Butterfingrz (创建于: 2025-12-16 11:07 (UTC+8))
- #30759 [Feature]: proper logging for ParsableContext — feature request — by qandrew (创建于: 2025-12-16 16:33 (UTC+8))
- #30758 [Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker — performance — by nvpohanh (创建于: 2025-12-16 16:26 (UTC+8))
- #30824 [Feature]: xinference has xllamacpp, can you add something in vllm for gguf files — feature request — by aynig (创建于: 2025-12-17 08:24 (UTC+8))
- #30798 [Usage]: vllm offline server lora model — usage — by zapqqqwe (创建于: 2025-12-17 00:38 (UTC+8))
- #30819 [Bug]: vLLM inference stuck when requesting video description on VLM models — bug — by sidezrw (创建于: 2025-12-17 06:53 (UTC+8))
- #30818 [Bug]: Unpickling MediaWithBytes dataclass leads to RecursionError — bug — by eicherseiji (创建于: 2025-12-17 06:27 (UTC+8))
- #30786 [RFC]: FlashMask Attention Backend for PrefixLM Models — RFC — by lucianommartins (创建于: 2025-12-16 22:47 (UTC+8))
- #30813 [Bug]: compilation_config.backend=”eager” doesn’t work aot mode (yet). — bug,torch.compile — by zhxchen17 (创建于: 2025-12-17 03:55 (UTC+8))
- #30779 [Bug]: v0.11.2 can not support Qwen2.5-Omni- — bug — by GoGo-UpUp (创建于: 2025-12-16 21:11 (UTC+8))
- #30805 [docker] kv_connectors (lmcache) not installed on arm64 release images — installation — by amrmahdi (创建于: 2025-12-17 01:59 (UTC+8))
- #30801 [Bug]: distilbert/distilgpt2 with graph_mode on ROCm platform generates garbage output — bug,rocm — by qli88 (创建于: 2025-12-17 00:59 (UTC+8))
- #30776 [Usage]: Qwen3-omni’s offline usage — usage — by Auraithm (创建于: 2025-12-16 20:30 (UTC+8))
- #30777 [Bug]: whisper-large-v3-turbo have accuracy problem on nightly build — bug — by cyysky (创建于: 2025-12-16 20:40 (UTC+8))
- #30778 [RFC] DBO fallout tracking metrics — feature request — by markmc (创建于: 2025-12-16 21:07 (UTC+8))
- #30763 [Feature]: Optionally Return LogSumExp in Attention — feature request — by hypnopump (创建于: 2025-12-16 17:18 (UTC+8))
- #30752 [Bug]: ` leaked shared_memory objects to clean up at shutdown` cause offline mode start failed. — bug — by xsank (创建于: 2025-12-16 15:11 (UTC+8))
- #30762 [Feature][P1]: docker image / cache reuse between CI and release pipelines — feature request — by Harry-Chen (创建于: 2025-12-16 17:06 (UTC+8))
- #30757 [Performance]: Async sched: Why return AsyncGPUModelRunnerOutput util func sample_tokens — performance — by iwzbi (创建于: 2025-12-16 16:26 (UTC+8))
- #30741 [Performance]: Performance Drops by ~20% When Using LoRA Adapter — performance — by happyfanta00 (创建于: 2025-12-16 12:34 (UTC+8))
- #30755 [Feature][P1]: Separation of building and final base images on more variants — feature request — by Harry-Chen (创建于: 2025-12-16 16:08 (UTC+8))
- #30750 [Bug]: assert part_size_n % scales.nelement() == 0 (Marlin FP8 Kernel) — bug — by milizhang (创建于: 2025-12-16 14:53 (UTC+8))
已关闭 Issue
- #22621 [Bug]: Latency spikes Issue — bug,stale — by saifulislam79 (关闭于: 2025-12-17 10:11 (UTC+8))
- #25903 [MM]: Optimize encoder cache memory consumption by storing encoder outputs only — feature request — by ywang96 (关闭于: 2025-12-17 06:18 (UTC+8))
- #25187 [Docs] Feedback for
/en/latest/getting_started/installation/gpu.html— documentation,rocm — by ajaxdude (关闭于: 2025-12-17 05:54 (UTC+8)) - #30741 [Performance]: Performance Drops by ~20% When Using LoRA Adapter — performance — by happyfanta00 (关闭于: 2025-12-16 16:11 (UTC+8))
- #25759 [Usage]: run Qwen2.5-VL with PD disaggregation raise ValueError(“Insufficient memory”) — usage — by bingchen-mi (关闭于: 2025-12-16 15:08 (UTC+8))
- #30105 [Feature]: Extend Dual Batch Overlap (DBO) to Multi-Batch Overlap (XBO) — feature request — by jiangkuaixue123 (关闭于: 2025-12-16 14:23 (UTC+8))
- #30726 [Feature]: Suggestion: make lazy_imports.py faster — feature request — by zou3519 (关闭于: 2025-12-16 11:12 (UTC+8))
新增 PR
- #30829 [Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models — ready — by bigPYJ1151 (创建于: 2025-12-17 10:36 (UTC+8))
- #30760 [WIP]verify_is_multiproc — 无标签 — by 1643661061leo (创建于: 2025-12-16 16:50 (UTC+8))
- #30811 [ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 — rocm,v1 — by micah-wil (创建于: 2025-12-17 03:39 (UTC+8))
- #30825 [FusedMoE Refactor][2/N] Use Modular Kernels for Fp8 — 无标签 — by robertgshaw2-redhat (创建于: 2025-12-17 09:03 (UTC+8))
- #30827 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (创建于: 2025-12-17 09:40 (UTC+8))
- #30826 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (创建于: 2025-12-17 09:38 (UTC+8))
- #30815 Update model-hosting-container-standards to 0.1.10 — ci/build — by mgoin (创建于: 2025-12-17 05:51 (UTC+8))
- #30823 [Bug] Fix AttributeError: ‘ColumnParallelLinear’ object has no attribute
weight_scale_inv— ready — by yewentao256 (创建于: 2025-12-17 07:56 (UTC+8)) - #30748 [Docs] fix function name — documentation — by lengrongfu (创建于: 2025-12-16 14:51 (UTC+8))
- #30808 [Refactor][TPU] Remove torch_xla path and use tpu-inference — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,ci/build,v1 — by weiyu0824 (创建于: 2025-12-17 02:32 (UTC+8))
- #30767 Algo — 无标签 — by Mercykid-bash (创建于: 2025-12-16 18:02 (UTC+8))
- #30795 Fix nemotron_nas intermediate_size computation — bug,ready,nvidia — by grzegorz-k-karch (创建于: 2025-12-16 23:49 (UTC+8))
- #30753 [Core][KVConnector] Propagate block hashes in SchedulerOutput — tpu,ready,v1 — by QierLi (创建于: 2025-12-16 15:25 (UTC+8))
- #30739 [BugFix] Support online dense model DP without overhead — ready,v1,kv-connector — by njhill (创建于: 2025-12-16 12:08 (UTC+8))
- #30806 [docker] Allow kv_connectors install to fail on arm64 — ci/build — by amrmahdi (创建于: 2025-12-17 02:03 (UTC+8))
- #30738 [Metrics] Model FLOPs Utilization estimation — ready,v1 — by SungMinCho (创建于: 2025-12-16 11:27 (UTC+8))
- #30746 [SM100] Enable fp8 compute for prefill MLA — v1 — by pavanimajety (创建于: 2025-12-16 14:31 (UTC+8))
- #30787 [CI/Build] Fix compatibility between #30244 and #30396 — ready,needs-rebase — by DarkLight1337 (创建于: 2025-12-16 22:49 (UTC+8))
- #30756 [MM] Pass FA version in ViT Attn — ready — by NickLucche (创建于: 2025-12-16 16:20 (UTC+8))
- #30820 [Bug] Fix compressed tensor not using deepgemm — ready — by yewentao256 (创建于: 2025-12-17 07:33 (UTC+8))
- #30817 Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant — ready,ci-failure — by mgoin (创建于: 2025-12-17 06:05 (UTC+8))
- #30822 [Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC — qwen — by Lucaskabela (创建于: 2025-12-17 07:36 (UTC+8))
- #30821 [Kernel] Enable fused_qknorm_rope_kernel supports partial rope — 无标签 — by jeejeelee (创建于: 2025-12-17 07:36 (UTC+8))
- #30789 [ROCm] [Bugfix] Fix torch sdpa hallucination — rocm,ready — by tjtanaa (创建于: 2025-12-16 23:16 (UTC+8))
- #30783 AWQ: Evaluate fused vs unfused GEMM on actual shape — 无标签 — by mgehre-amd (创建于: 2025-12-16 22:18 (UTC+8))
- #30796 [BugFix][Async] clear spec tokens for preempted or resumed reqs in async — v1 — by izhuhaoran (创建于: 2025-12-17 00:00 (UTC+8))
- #30804 [CI] Skip ci failure test — ready — by yewentao256 (创建于: 2025-12-17 01:33 (UTC+8))
- #30745 [BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector — kv-connector — by wz1qqx (创建于: 2025-12-16 14:17 (UTC+8))
- #30754 [Bugfix][DSV32] Fix overflow in topk. — bug,ready,deepseek — by dcampora (创建于: 2025-12-16 15:39 (UTC+8))
- #30816 [UX] Make
vllm bench servediscover model by default and use –input-len — performance,perf-benchmarks,ready — by mgoin (创建于: 2025-12-17 05:52 (UTC+8)) - #30747 [Fix] uniform decode batch check — bug,ready,v1 — by Jialin (创建于: 2025-12-16 14:32 (UTC+8))
- #30799 bump up compressed tensors version to 0.13.0 — quantization,ready,ci/build — by shanjiaz (创建于: 2025-12-17 00:41 (UTC+8))
- #30814 [KV connector][LMCache] Only record the cuda event when there are request to store/load — kv-connector,nvidia — by ApostaC (创建于: 2025-12-17 05:25 (UTC+8))
- #30782 [CI/Build] Skip broken ViT backend functionality test tempoarily — ready,multi-modality — by Isotr0py (创建于: 2025-12-16 22:02 (UTC+8))
- #30810 [compile] Disable aot when eager backend is used. — ready — by zhxchen17 (创建于: 2025-12-17 02:54 (UTC+8))
- #30743 [compile] Recompile graph module during Dynamo cache loading. — ready — by zhxchen17 (创建于: 2025-12-16 12:54 (UTC+8))
- #30809 [compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors — ready — by zhxchen17 (创建于: 2025-12-17 02:52 (UTC+8))
- #30812 Add back support of tokenizer_mode == custom — fb-exported,meta-exported — by henryoier (创建于: 2025-12-17 03:41 (UTC+8))
- #30769 [Frontend] Add
max-completion-tokenoption to transcription/translation endpoints — frontend,ready — by NickLucche (创建于: 2025-12-16 18:36 (UTC+8)) - #30807 [compile] Disable aot when eager backend is used. — 无标签 — by zhxchen17 (创建于: 2025-12-17 02:17 (UTC+8))
- #30790 [Release 2.10] Test Torch 2.10 RC - with skipped test — rocm,ci/build,v1,nvidia — by atalman (创建于: 2025-12-16 23:20 (UTC+8))
- #30784 [Improvement] Persist CUDA compat libraries paths to prevent reset on
apt-get— ci/build,multi-modality,nvidia — by emricksini-h (创建于: 2025-12-16 22:45 (UTC+8)) - #30800 [PERF] Add interleaved memory allocation to NUMA module — 无标签 — by skaraban3807 (创建于: 2025-12-17 00:56 (UTC+8))
- #30803 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation — 无标签 — by omer-dayan (创建于: 2025-12-17 01:05 (UTC+8))
- #30802 Add support for LoRA adapters in Nemotron-H models — nvidia — by danisereb (创建于: 2025-12-17 01:05 (UTC+8))
- #30797 [Bugfix] Fix DeepSeekV32 tool parser incorrect type conversion for array/object parameters — deepseek — by fangtaosong (创建于: 2025-12-17 00:17 (UTC+8))
- #30794 [P/D] p2p_nccl: implement async KV loading for decode stage — v1,kv-connector — by dongbo910220 (创建于: 2025-12-16 23:45 (UTC+8))
- #30793 Tinyquant — ci/build — by Kirill-Kuznetsov-git (创建于: 2025-12-16 23:37 (UTC+8))
- #30792 Tinyquant — ci/build — by Kirill-Kuznetsov-git (创建于: 2025-12-16 23:36 (UTC+8))
- #30791 [Doc] Add Chinese documentation for vLLM — documentation — by 5qol255 (创建于: 2025-12-16 23:20 (UTC+8))
- #30761 [KVConnector]: Enable Cross-layers KV cache layout for MultiConnector — kv-connector — by kfirtoledo (创建于: 2025-12-16 16:59 (UTC+8))
- #30774 [Docs][API] Remove warning about LoRARequest being internal-only — ready — by markmc (创建于: 2025-12-16 19:38 (UTC+8))
- #30771 Update where
bytes_to_unicodeis imported from — structured-output,ready,v1 — by hmellor (创建于: 2025-12-16 19:14 (UTC+8)) - #30764 Remove
head_maskfrom Ultravox and Swin — ready — by hmellor (创建于: 2025-12-16 17:33 (UTC+8)) - #30768 Fix instantiation of
HfHubHTTPErrorin LoRA test — ready — by hmellor (创建于: 2025-12-16 18:20 (UTC+8)) - #30788 [refactor] Add prefix support to embed_tokens in DeepSeek MTP — deepseek — by zzhx1 (创建于: 2025-12-16 23:06 (UTC+8))
- #30785 [Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) — ready — by danielafrimi (创建于: 2025-12-16 22:46 (UTC+8))
- #30744 [BugFix] Fix memory spike in workspace allocation — ready,ci/build,v1 — by LucasWilkinson (创建于: 2025-12-16 13:09 (UTC+8))
- #30773 [BugFix] Fix initialization bug in openpangu.py — 无标签 — by JeffLee1874 (创建于: 2025-12-16 19:28 (UTC+8))
- #30781 [CI] add polling for precompiled wheel in python_only_compile.sh — 无标签 — by Harry-Chen (创建于: 2025-12-16 21:46 (UTC+8))
- #30772 [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request — ready,v1 — by NickLucche (创建于: 2025-12-16 19:25 (UTC+8))
- #30780 Optimize workspace memory in DeepGEMM. — 无标签 — by halyavin (创建于: 2025-12-16 21:19 (UTC+8))
- #30770 Don’t assume
position_embedding_typewill be present for BERT and RoBERTa models — ready — by hmellor (创建于: 2025-12-16 18:53 (UTC+8)) - #30775 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — 无标签 — by teddygood (创建于: 2025-12-16 20:06 (UTC+8))
- #30765 [Doc][CPU] Update CPU doc — documentation,ci/build — by bigPYJ1151 (创建于: 2025-12-16 17:55 (UTC+8))
- #30749 [Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory — frontend,ci/build — by chaunceyjiang (创建于: 2025-12-16 14:53 (UTC+8))
- #30751 [Bugfix]: prevent leaking tokens in crash log — v1 — by dr75 (创建于: 2025-12-16 15:07 (UTC+8))
- #30742 Support LoRA of PLaMo 2/3 — documentation — by Alnusjaponica (创建于: 2025-12-16 12:50 (UTC+8))
- #30740 Properly handle
packed_modules_mappingof PLaMo2 — 无标签 — by Alnusjaponica (创建于: 2025-12-16 12:30 (UTC+8)) - #30737 [Metrics] Model FLOPs Utilization estimation — v1,nvidia — by SungMinCho (创建于: 2025-12-16 11:14 (UTC+8))
已合并 PR
- #30815 Update model-hosting-container-standards to 0.1.10 — ci/build — by mgoin (合并于: 2025-12-17 09:52 (UTC+8))
- #30795 Fix nemotron_nas intermediate_size computation — bug,ready,nvidia — by grzegorz-k-karch (合并于: 2025-12-17 09:06 (UTC+8))
- #30806 [docker] Allow kv_connectors install to fail on arm64 — ci/build — by amrmahdi (合并于: 2025-12-17 08:41 (UTC+8))
- #30756 [MM] Pass FA version in ViT Attn — ready — by NickLucche (合并于: 2025-12-17 07:54 (UTC+8))
- #30817 Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant — ready,ci-failure — by mgoin (合并于: 2025-12-17 07:40 (UTC+8))
- #30789 [ROCm] [Bugfix] Fix torch sdpa hallucination — rocm,ready — by tjtanaa (合并于: 2025-12-17 07:32 (UTC+8))
- #29512 [Perf][Kernels] Vectorize
csrc/activations_kernels.cu— performance,kernel,moe,ready — by mgoin (合并于: 2025-12-17 06:56 (UTC+8)) - #30804 [CI] Skip ci failure test — ready — by yewentao256 (合并于: 2025-12-17 06:47 (UTC+8))
- #30731 [Bugfix] Fix broken ViT attention selection for Blackwell device — ready,nvidia — by Isotr0py (合并于: 2025-12-16 13:24 (UTC+8))
- #29901 [Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) — quantization,kernel,ready,ci/build — by jinzhen-lin (合并于: 2025-12-17 06:35 (UTC+8))
- #30475 [Core][MM] Optimize encoder cache manager by operating with embeddings only — ready,v1,multi-modality,llama,qwen,ready-run-all-tests — by ywang96 (合并于: 2025-12-17 06:18 (UTC+8))
- #30754 [Bugfix][DSV32] Fix overflow in topk. — bug,ready,deepseek — by dcampora (合并于: 2025-12-17 06:21 (UTC+8))
- #29627 [Attention] Cache attention metadata builds across hybrid KV-cache groups — ready,v1 — by LucasWilkinson (合并于: 2025-12-17 06:10 (UTC+8))
- #30782 [CI/Build] Skip broken ViT backend functionality test tempoarily — ready,multi-modality — by Isotr0py (合并于: 2025-12-16 22:45 (UTC+8))
- #30014 [Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE — performance,moe,ready,nvidia — by jiahanc (合并于: 2025-12-17 05:01 (UTC+8))
- #30562 [Refactor] Small refactor for group topk — ready,v1 — by yewentao256 (合并于: 2025-12-17 03:50 (UTC+8))
- #30769 [Frontend] Add
max-completion-tokenoption to transcription/translation endpoints — frontend,ready — by NickLucche (合并于: 2025-12-17 03:36 (UTC+8)) - #30723 [CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test — ready,ci/build,qwen,nvidia — by mgoin (合并于: 2025-12-17 03:28 (UTC+8))
- #30774 [Docs][API] Remove warning about LoRARequest being internal-only — ready — by markmc (合并于: 2025-12-17 00:35 (UTC+8))
- #30771 Update where
bytes_to_unicodeis imported from — structured-output,ready,v1 — by hmellor (合并于: 2025-12-17 00:05 (UTC+8)) - #30764 Remove
head_maskfrom Ultravox and Swin — ready — by hmellor (合并于: 2025-12-17 00:02 (UTC+8)) - #30768 Fix instantiation of
HfHubHTTPErrorin LoRA test — ready — by hmellor (合并于: 2025-12-17 00:02 (UTC+8)) - #30713 [TRTLLM] Remove the MoE GEMM weight name change — bug,ready,nvidia — by minosfuture (合并于: 2025-12-17 00:01 (UTC+8))
- #30559 [Feat] Enable eplb with default all2all backend — performance,moe,ready — by yewentao256 (合并于: 2025-12-16 23:33 (UTC+8))
- #30744 [BugFix] Fix memory spike in workspace allocation — ready,ci/build,v1 — by LucasWilkinson (合并于: 2025-12-16 22:46 (UTC+8))
- #30772 [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request — ready,v1 — by NickLucche (合并于: 2025-12-16 22:20 (UTC+8))
- #28624 [ROCm][MTP] Support MTP for AITER MLA backend — rocm,ready,v1 — by ganyi1996ppo (合并于: 2025-12-16 22:10 (UTC+8))
- #30728 update piecewise cudagraph warning when splitting_ops=[] — nvidia — by BoyuanFeng (合并于: 2025-12-16 22:09 (UTC+8))
- #30586 [ROCm] [AITER] [DOC] Add usage description about check functions in
_aiter_ops— rocm,ready — by tjtanaa (合并于: 2025-12-16 21:50 (UTC+8)) - #30770 Don’t assume
position_embedding_typewill be present for BERT and RoBERTa models — ready — by hmellor (合并于: 2025-12-16 21:40 (UTC+8)) - #30636 [Bugfix] Fix RequestOutput miss lora_request — ready,v1,llama,gpt-oss — by jeejeelee (合并于: 2025-12-16 17:36 (UTC+8))
- #29663 [Bugfix] Fix prefix_repetition routing in bench throughput — performance,ready — by jr-shen (合并于: 2025-12-16 17:37 (UTC+8))
- #30158 [responsesAPI][8] input/output messages for ResponsesParser — documentation,frontend,ready,gpt-oss — by qandrew (合并于: 2025-12-16 13:55 (UTC+8))
- #30120 [feature] extend DBO to XBO — documentation,ready,v1 — by jiangkuaixue123 (合并于: 2025-12-16 13:04 (UTC+8))
- #30733 improve lazy import test — ready — by BoyuanFeng (合并于: 2025-12-16 11:12 (UTC+8))
- #29873 [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic — documentation,rocm,ready,qwen — by shen-shanshan (合并于: 2025-12-16 11:08 (UTC+8))
- #30626 [docker] Restructure Dockerfile for more efficient and cache-friendly builds — documentation,ready,ci/build — by amrmahdi (合并于: 2025-12-16 10:52 (UTC+8))
关闭但未合并的 PR
- #30826 [Model] Gemma3: Support untied word embeddings — 无标签 — by www-spam (关闭于: 2025-12-17 09:40 (UTC+8))
- #29167 [BugFix] Call Base Layer Directly if LoRA A/B in Parallel Vocab are 0 — 无标签 — by alex-jw-brooks (关闭于: 2025-12-17 06:43 (UTC+8))
- #30536 encoder cache optimization budget alignment — v1,multi-modality,qwen — by sunYtokki (关闭于: 2025-12-17 06:33 (UTC+8))
- #30807 [compile] Disable aot when eager backend is used. — 无标签 — by zhxchen17 (关闭于: 2025-12-17 02:54 (UTC+8))
- #30793 Tinyquant — ci/build — by Kirill-Kuznetsov-git (关闭于: 2025-12-16 23:38 (UTC+8))
- #30792 Tinyquant — ci/build — by Kirill-Kuznetsov-git (关闭于: 2025-12-16 23:36 (UTC+8))
- #30791 [Doc] Add Chinese documentation for vLLM — documentation — by 5qol255 (关闭于: 2025-12-16 23:24 (UTC+8))
- #30773 [BugFix] Fix initialization bug in openpangu.py — 无标签 — by JeffLee1874 (关闭于: 2025-12-16 22:32 (UTC+8))
- #30427 fix(gguf): Extract attn_logit_softcapping from GGUF metadata — 无标签 — by kitaekatt (关闭于: 2025-12-16 20:34 (UTC+8))
- #29889 [Bugfix] respect user-defined flash attention version in ViT attentions — 无标签 — by cjackal (关闭于: 2025-12-16 16:43 (UTC+8))
- #28570 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — documentation,performance,new-model,rocm,structured-output,frontend,tpu,needs-rebase,ci/build,v1 — by teddygood (关闭于: 2025-12-16 12:34 (UTC+8))
- #30667 Support GPU tensors in tensor_data() to enable GPU-accelerated multimodal preprocessing — v1 — by storyicon (关闭于: 2025-12-16 11:38 (UTC+8))
- #30737 [Metrics] Model FLOPs Utilization estimation — v1,nvidia — by SungMinCho (关闭于: 2025-12-16 11:28 (UTC+8))
- #28859 [Metrics] Model FLOPs Utilization estimation — v1,fb-exported,meta-exported — by SungMinCho (关闭于: 2025-12-16 11:14 (UTC+8))
- #24936 [WIP][performance] DP for ViT in Intern S1 model — stale — by hsliuustc0106 (关闭于: 2025-12-16 10:45 (UTC+8))