vLLM 开发动态报告 - 2026-02-21
时间窗口: 2026-02-21 11:31 (UTC+8) ~ 2026-02-22 11:31 (UTC+8) 数据统计: 新 Issue 6 | 关闭 Issue 21 | 新 PR 32 | 合并 PR 26 | 关闭未合并 PR 11
📊 每日开发状态摘要
在本报告周期内(2026-02-21至2026-02-22),vLLM项目保持了极高的开发活跃度,共处理了32个新增PR和21个已关闭Issue,显示出持续的功能迭代和问题修复。开发焦点集中在Model Runner V2架构演进、多模态与推理功能增强以及跨硬件平台(NVIDIA/AMD/XPU)的兼容性与性能优化上。同时,多个关于大型稀疏模型(如GLM-5, DeepSeek-R1)的Bug报告也突显了社区对新模型和先进量化技术支持的迫切需求。
🎯 AMD/ROCm 生态相关动态
本周期内与AMD生态直接相关的活动较少,但有一些值得关注的间接相关修改和延续性讨论:
- CI/测试稳定性与兼容性修复:
- PR #35008 ([CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream):此PR旨在稳定ROCm CI流水线信号,移除了一个因外部HTTP请求问题(Qwen字体文件)而产生误报的测试组的阻塞标签,并修复了上游CI定义中的命名问题。这有助于确保AMD平台CI的可靠性。
- 已关闭 Issue #27464 ([Bug]: Crashing when use PIECEWISE compilation):这是一个在ROCm平台上使用
PIECEWISE编译时出现的崩溃问题,已在稍早时间被关闭(因不活跃)。它表明ROCm特定的编译路径可能存在历史问题。
- 跨平台硬件抽象层推进:
- PR #35042 ([Platform]Add current_platform.get_num_compute_units interface):提议增加统一的
get_num_compute_units接口,以替代代码中多处直接调用torch.cuda.get_device_properties().multi_processor_count的做法。这一抽象旨在更好地支持非CUDA硬件(如XPU、NPU),对AMD ROCm平台未来实现统一管理计算单元数也有积极作用。讨论中已确认“Compute Unit”是跨GPU厂商的通用术语。
- PR #35042 ([Platform]Add current_platform.get_num_compute_units interface):提议增加统一的
- 非CUDA硬件支持问题(间接相关):
- PR #35011 ([XPU] Avoid use
top_k_top_p_tritonkernel for xpu) 和 PR #35030 (Fix apply_top_k_top_p_triton called by non-cuda logits Tensor):这两个PR都修复了因PR #33538引入的Triton新采样内核在非CUDA设备(XPU)上调用失败的问题。这反映了项目在支持多元化硬件生态时面临的挑战,为未来确保类似内核在AMD ROCm上正常工作提供了参考案例。讨论指出,根本修复应是移除内核内部的is_cuda断言,使其能通用化运行。
- PR #35011 ([XPU] Avoid use
小结:本周期无直接针对AMD新功能(如Quark、MI300)的重大提交。活动主要集中在维护ROCm CI的稳定和推进底层硬件抽象以惠及所有非CUDA平台,包括AMD。这体现了项目在扩大硬件生态支持方面的基础性建设工作。
💬 高热度讨论分析
- Issue #24706 ([Feature]: automatically use pre-compiled wheels when installed from github) (16条评论)
- 核心议题:用户通过
pip install git+https://github.com/vllm-project/vllm.git安装时,即使有预编译的wheel可用,也会触发完整的源码编译,导致安装时间过长。 - 不同观点:
- 提议者/支持者:认为应该自动检测并利用预编译wheel,提升安装体验,并给出了通过查询
https://wheels.vllm.ai/$commit/来检测wheel是否存在的实现思路。 - 实现讨论:贡献者们就如何实现展开了技术讨论,包括如何回退查找最近的有wheel的commit、环境变量控制(
VLLM_USE_PRECOMPILED)、以及是否应该默认开启等。
- 提议者/支持者:认为应该自动检测并利用预编译wheel,提升安装体验,并给出了通过查询
- 争议焦点:主要在实现细节上,例如检测逻辑的健壮性、与现有配置的整合方式,并无根本性分歧。
- 当前状态:该Issue因超过90天无活动而被自动关闭。但其提出的需求合理,未来可能被重新提出或由其他PR实现。
- 核心议题:用户通过
- Issue #25301 ([Feature]: Add tokenizer process pool to eliminate preprocessing bottleneck) (15条评论)
- 核心议题:在高并发、长上下文请求场景下,单线程的tokenizer预处理成为性能瓶颈,严重影响了TTFT和吞吐量。
- 不同观点:
- 问题提出者与附议者:通过性能追踪数据证明了瓶颈的存在,特别是在小模型(如嵌入模型)上,GPU利用率因等待tokenization而严重不足。提议使用多进程tokenizer池来并行化。
- 核心维护者:认可问题存在,并指出对于小模型,CPU成为瓶颈,可通过增加
--api-server-count来缓解。同时提到在Model Runner V2中代码路径更清晰,可能更适合实现此类优化。 - 其他用户:分享了自身遇到瓶颈的经历,并指出多模态参数细微变化导致处理器重复初始化是另一个潜在瓶颈源。
- 争议焦点:是否需要以及如何引入多进程池。讨论更偏向于“如何做”而非“是否做”,大家一致认为性能瓶颈真实存在且需要解决。
- 当前状态:Issue已被关闭(因不活跃),但有相关的PR #27180已在之前被提出以解决此问题。
- PR #34466 ([CI/Build] Add opentelemetry libs in default vllm build) (14条评论)
- 核心议题:是否应该将OpenTelemetry(可观测性)库作为vLLM默认依赖的一部分,以避免用户手动补装和因缺失依赖导致的启动失败。
- 不同观点:
- 提议者:认为在生产环境中,可观测性至关重要。将其作为默认依赖可以减少用户出错,且对镜像体积影响极小。
- 讨论:围绕具体实施展开,例如版本约束、与现有CI测试环境的冲突(需要先合并依赖更新再清理测试中的手动安装步骤)。也有关于是否所有用户都需要此依赖的轻微顾虑。
- 当前状态:PR已合并。这标志着vLLM项目更加重视生产就绪性,将可观测性基础设施作为标准组件。
🔥 热门话题与趋势分析
- 模型支持与兼容性Bug:多个新增Issue反映了对新发布或复杂模型的支持挑战。
- GLM-5 (DSA):Issue #35021 报告GLM-5等使用DeepSeek稀疏注意力的模型在sm80 GPU(A100)上因硬依赖sm90+的DeepGemm TMA指令而崩溃,亟需fallback方案或明确错误提示。
- Qwen3.5 FP8:Issue #35015 报告在高压工作负载下出现CUTLASS GEMM的CUDA非法访问错误。
- DeepSeek-R1:相关的性能优化PR #34900 获得合并,显示社区对高效运行此类推理密集型模型的持续投入。
- 推理性能与优化:
- 推测解码:围绕Eagle3的支持(PR #35029, #35040)、MTP的Bug修复(Issue #35031, PR #35041)和性能回归分析(已关闭Issue #27389)的讨论非常活跃。
- 内存与计算优化:从底层内核(PR #34974 优化拒绝采样器)到上层内存管理(Issue #35023 提议延迟分配嵌入缓冲区),再到调度算法优化(PR #35006 优化滑动窗口缓存查找),全栈都在进行性能打磨。
- 功能扩展与API丰富:
- 多模态与语音:新增了Qwen3-ASR的实时流式支持(PR #34613)和Whisper的自动语言检测(PR #34342)。
- API与协议:有PR在扩展Anthropic Messages API以支持推理/思考输出(PR #35035),并为OpenAI API添加注意力捕获功能(PR #35014,但引发了对设计复杂性的讨论)。
- 结构化输出:PR #35022 为集束搜索添加了结构化输出(如JSON语法)支持,完善了约束生成能力。
- 合规性与生产化:Issue #35005 提出了为vLLM部署制定数据治理与欧盟AI法案合规指南的需求,反映出项目在企业和监管环境中应用日益广泛。
🛠️ 重点技术变更
-
PR #34900 ([Model Bash][DSR1] Add selective dynamic shape marking for CustomOp):通过为
CustomOp(如_DecodeConcatQuantFP8)指定动态维度,避免了torch.compile将所有维度标记为符号带来的性能损失。此优化在DeepSeek-R1模型上带来了19.5%的输出token吞吐量提升,是使用动态编译技术优化定制算子的典范。 -
PR #34342 ([Frontend] Add automatic language detection for Whisper transcription):为Whisper转录模型添加了自动语言检测功能,当用户未指定
language参数时,模型会运行一个简短的检测步骤。这显著提升了API的易用性,并展示了如何通过抽象接口(SupportsExplicitLanguageDetection)来优雅地为模型添加新能力。 -
PR #34574 ([Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel):解决了ColQwen3等多模态交互模型在
/score和/rerank端点上不支持图像等多媒体输入的问题,首次在vLLM中实现了跨模态的检索评分和重排。同时新增了对NVIDIA Nemotron-Colembed模型的原生支持,扩展了检索模型生态。 -
PR #35036 ([Model Runner V2] Support attention group) & PR #35029 ([Model Runner V2] Support Eagle3):这两个合并的PR标志着Model Runner V2架构的持续演进。V2在统一和优化模型执行路径方面取得进展,开始支持更复杂的注意力机制和推测解码方案(Eagle3),为未来的性能与功能统一打下基础。
📈 开发活跃度观察
- 高效合并:在24小时内合并了26个PR,表明核心团队审查和合并流程非常高效。许多PR(如多个Model Runner V2相关PR)在创建当天即被合并。
- 贡献者多样性:除核心维护者(@WoosukKwon, @DarkLight1337, @njhill)外,也有来自NVIDIA、Meta等公司的工程师,以及众多独立贡献者解决特定模型或功能问题。
- Issue管理:关闭了21个老旧Issue,大部分是由于长时间不活跃而被自动化流程关闭,显示项目在清理积压问题。同时,新增Issue质量较高,附有详细的重现步骤和分析。
- AMD相关贡献:本周期内由AMD员工(如 @AndreasKaratzas)主导的提交主要围绕CI稳定性和测试修复,暂无推动AMD专用新功能的大型PR。
💡 值得关注的问题
- Issue #35031 ([Bug]: MTP Speculative Decoding with NVFP4: Weight Shape Mismatch):MTP推测解码在使用NVFP4量化时出现权重形状不匹配。PR #35041 已提出修复,将未接入量化框架的
nn.Linear层替换为ReplicatedLinear。这揭示了在支持复杂推测解码与新型量化格式结合时的模型层定义一致性问题。 - Issue #35021 ([Bug]: GLM-5(Sparse MLA / DSA 模型)无法在 sm80 GPU(A100/A800)上运行):该问题指出了GLM-5模型对新一代硬件指令(sm90+ TMA)的硬依赖,在仍广泛使用的A100上缺乏fallback路径。这不仅是vLLM的问题,也反映了前沿模型算法与主流硬件部署之间的断层,需要社区或硬件厂商共同解决。
- PR #35014 ([Feature] Add per-request attention capture to the OpenAI-compatible API):此PR提议通过API暴露注意力分数,用于可解释性研究。然而,核心维护者@DarkLight1337指出,直接通过API返回大张量可能带来IPC开销和设计复杂性问题,建议参考KV缓存连接器的设计思路。这个讨论触及了在核心推理引擎中暴露深度调试信息与保持API简洁高效之间的平衡。
📋 附录:详细数据列表
新增 Issue
- #35031 [Bug]: MTP Speculative Decoding with NVFP4: Weight Shape Mismatch — bug — by eleqtrizit (创建于: 2026-02-22 03:33 (UTC+8))
- #35028 [Bug]: RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx — bug — by shahizat (创建于: 2026-02-22 01:55 (UTC+8))
- #35023 [Feature]: Remove embedding initialization in cases where embedding is not needed in gpu_model_runner init — feature request — by labAxiaoming (创建于: 2026-02-21 21:22 (UTC+8))
- #35021 [Bug]: GLM-5(Sparse MLA / DSA 模型)无法在 sm80 GPU(A100/A800)上运行 — DeepGemm 硬依赖无 fallback — bug — by qjxjy123 (创建于: 2026-02-21 20:02 (UTC+8))
- #35015 [Bug]: Qwen3.5 FP8 CUDA Illegal Access in CUTLASS GEMM — bug — by kimbochen (创建于: 2026-02-21 17:13 (UTC+8))
- #35005 Data Governance & Compliance Guide for vLLM Deployments (EU AI Act Article 6) — 无标签 — by desiorac (创建于: 2026-02-21 11:44 (UTC+8))
已关闭 Issue
- #21453 [Feature]: sm_120 support — feature request,stale — by shahizat (关闭于: 2026-02-22 10:17 (UTC+8))
- #21653 [Bug]: Voxtral on Docker not launching — bug,stale — by Fhrozen (关闭于: 2026-02-22 10:17 (UTC+8))
- #22851 [Usage]: Custom position ids for prompts and generations — usage,stale — by shivamag125 (关闭于: 2026-02-22 10:17 (UTC+8))
- #24570 [Usage]: Phi-4-multimodal-instruct: Model Does Not Support Transcriptions API Error — usage,stale — by BugsBuggy (关闭于: 2026-02-22 10:16 (UTC+8))
- #24706 [Feature]: automatically use pre-compiled wheels when installed from github — good first issue,feature request,stale — by youkaichao (关闭于: 2026-02-22 10:16 (UTC+8))
- #25301 [Feature]: Add tokenizer process pool to eliminate preprocessing bottleneck — feature request,stale — by Zhathw (关闭于: 2026-02-22 10:16 (UTC+8))
- #25381 [Usage]: NCCL ERROR — usage,stale — by Tag-b (关闭于: 2026-02-22 10:16 (UTC+8))
- #26525 [Bug]: v0.11.0 New default VLLM_ALLREDUCE_USE_SYMM_MEM=1 prevent tensor-parallel on gpt-oss-120b — bug,stale — by Weihrauch (关闭于: 2026-02-22 10:16 (UTC+8))
- #26770 [Bug]: 4卡2080ti使用vllm v1 engine加载模型成功后进行推理时报错 — bug,stale — by desktop4025 (关闭于: 2026-02-22 10:16 (UTC+8))
- #27379 [Bug]: [Spec Decode] ngram speculation performance regression at higher concurrany — bug,stale — by PramuPerera (关闭于: 2026-02-22 10:16 (UTC+8))
- #27389 [Bug]: api_server.py: error: unrecognized arguments: /root/models/Qwen3-8B — bug,stale — by jiangxinufo (关闭于: 2026-02-22 10:16 (UTC+8))
- #27401 [Bug]: Deepseek3.2 exp awq seems to not work on new vllm — bug,stale — by Eggwardhan (关闭于: 2026-02-22 10:16 (UTC+8))
- #27425 [Installation]: FlashMLA still uses old cutlass 3.9.0 — installation,stale — by titanous (关闭于: 2026-02-22 10:16 (UTC+8))
- #27428 [Bug]: ldconfig segfaults (SIGSEGV 11) in vllm-openai images, making them unusable for non-root Singularity users — bug,stale — by whynotkimhari (关闭于: 2026-02-22 10:16 (UTC+8))
- #27448 [Usage]: how to pass multi turn multimode messages to Vllm? — usage,stale — by cqray1990 (关闭于: 2026-02-22 10:16 (UTC+8))
- #27454 [Usage]: How to set the expert id on each EP by myself after setting EP in Deepseek (how to reorder experts?) — usage,stale — by HameWu (关闭于: 2026-02-22 10:16 (UTC+8))
- #27462 [Bug]:
--max-num-seqslimits total running sequences instead of scheduled sequences, causing severe underutilization in PP — bug,stale — by King-ty (关闭于: 2026-02-22 10:16 (UTC+8)) - #27464 [Bug]: Crashing when use PIECEWISE compilation — bug,rocm,stale — by Rus-P (关闭于: 2026-02-22 10:15 (UTC+8))
- #27471 [Bug]: NVFP4A16 spurious warning that GPU doesn’t support Fp4 — bug,stale — by mratsim (关闭于: 2026-02-22 10:15 (UTC+8))
- #27473 [Feature]: GLM4MOE gruff support — feature request,stale — by Rane2021 (关闭于: 2026-02-22 10:15 (UTC+8))
- #27479 [Bug]: Low GPU utilization with Embedding Model — bug,stale — by JhaceLam (关闭于: 2026-02-22 10:15 (UTC+8))
新增 PR
- #35042 [Platform]Add current_platform.get_num_compute_units interface — performance,rocm,v1,nvidia — by jikunshang (创建于: 2026-02-22 10:09 (UTC+8))
- #35008 [CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream — rocm,ci/build — by AndreasKaratzas (创建于: 2026-02-21 13:14 (UTC+8))
- #35009 update-pillow-version to mitigate CVE-2026-25990 — ci/build — by to-curiosity (创建于: 2026-02-21 13:27 (UTC+8))
- #35013 [CI/Build] Fix gRPC version mismatch — rocm,ready,ci/build,ci-failure — by DarkLight1337 (创建于: 2026-02-21 16:45 (UTC+8))
- #35039 [Model Runner V2][Minor] Remove redundant
do_spec_decodefield — ready,v1 — by njhill (创建于: 2026-02-22 09:42 (UTC+8)) - #35041 [Bug Fix] MTP Speculative Decoding with NVFP4: Weight Shape Mismatch — bug,deepseek — by jagguvarma15 (创建于: 2026-02-22 10:08 (UTC+8))
- #35011 [XPU] Avoid use
top_k_top_p_tritonkernel for xpu — ready,v1 — by jikunshang (创建于: 2026-02-21 15:15 (UTC+8)) - #35040 [Model Runner V2] Enable CUDA graph for Eagle3 — v1,nvidia — by WoosukKwon (创建于: 2026-02-22 09:52 (UTC+8))
- #35030 Fix apply_top_k_top_p_triton called by non-cuda logits Tensor — ready,v1,nvidia,meta-exported,fb-exported — by xli (创建于: 2026-02-22 03:25 (UTC+8))
- #35034 fix(tool_parsers): cache tokenizer encode/decode to prevent concurrent access panic — 无标签 — by timon0305 (创建于: 2026-02-22 06:39 (UTC+8))
- #35035 feat: add reasoning/thinking support to Anthropic /v1/messages endpoint — frontend — by timon0305 (创建于: 2026-02-22 08:07 (UTC+8))
- #35038 [BugFix] Refactor add max_num_tokens_per_forward_pass to account for drafting — bug,rocm,speculative-decoding,v1,nvidia — by LucasWilkinson (创建于: 2026-02-22 09:04 (UTC+8))
- #35037 feat: add reasoning_tokens to completion_tokens_details in chat completion usage — frontend — by timon0305 (创建于: 2026-02-22 09:00 (UTC+8))
- #35036 [Model Runner V2] Support attention group — v1,nvidia — by WoosukKwon (创建于: 2026-02-22 08:25 (UTC+8))
- #35033 [Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests — bug,ready,multi-modality,llama — by eldarkurtic (创建于: 2026-02-22 06:12 (UTC+8))
- #35032 [CI/Build] Remove redundant OpenTelemetry pip install from CI configs — ci/build — by vladmihailescu (创建于: 2026-02-22 05:12 (UTC+8))
- #35029 [Model Runner V2] Support Eagle3 (no CUDA graph) — v1,nvidia — by WoosukKwon (创建于: 2026-02-22 03:22 (UTC+8))
- #35022 [Core] Support structured outputs for beam search — frontend — by guan404ming (创建于: 2026-02-21 21:12 (UTC+8))
- #35007 [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning — bug — by WindChimeRan (创建于: 2026-02-21 12:33 (UTC+8))
- #35025 [Refactor] Simplify dummy data generation — documentation,speculative-decoding,ready,multi-modality,llama,qwen,deepseek — by DarkLight1337 (创建于: 2026-02-21 22:40 (UTC+8))
- #35027 [Benchmark] Use
sns.relplotfor plotting — performance,ready — by DarkLight1337 (创建于: 2026-02-22 00:19 (UTC+8)) - #35026 fix(models): apply embedding_multiplier to inputs_embeds in GraniteMoeHybrid — 无标签 — by nightcityblade (创建于: 2026-02-21 23:08 (UTC+8))
- #35024 [Deprecation] Remove old locations of
get_tokenizerandresolve_hf_chat_template— frontend,ready — by DarkLight1337 (创建于: 2026-02-21 21:57 (UTC+8)) - #35014 [Feature] Add per-request attention capture to the OpenAI-compatible API — documentation,frontend,v1 — by Parkprogrammer (创建于: 2026-02-21 16:57 (UTC+8))
- #35010 [XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend — 无标签 — by yma11 (创建于: 2026-02-21 13:52 (UTC+8))
- #35020 [Model] Add GGUF loading support for Qwen3-Next — qwen — by laudney (创建于: 2026-02-21 19:47 (UTC+8))
- #35018 perf(v1): optimize InputBatch.swap_states by swapping active token prefixes — v1 — by VedantMadane (创建于: 2026-02-21 18:16 (UTC+8))
- #35019 [GGUF] Fix loading of fused/shard-less quantized weights — 无标签 — by laudney (创建于: 2026-02-21 19:46 (UTC+8))
- #35012 [Benchmark] Improve benchmarks — performance,ready — by DarkLight1337 (创建于: 2026-02-21 16:32 (UTC+8))
- #35017 Add support for DeepSeek Attention replay on CUDA — deepseek,nvidia,meta-exported,fb-exported — by maazmusameta (创建于: 2026-02-21 17:34 (UTC+8))
- #35016 Add env variable VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K to disable FlashInfer concat_mla_k — meta-exported,fb-exported — by maazmusameta (创建于: 2026-02-21 17:29 (UTC+8))
- #35006 [Core]Optimize SlidingWindowManager.find_longest_cache_hit by skipping positions on cache miss — v1 — by lichuang (创建于: 2026-02-21 12:12 (UTC+8))
已合并 PR
- #35013 [CI/Build] Fix gRPC version mismatch — rocm,ready,ci/build,ci-failure — by DarkLight1337 (合并于: 2026-02-22 03:14 (UTC+8))
- #35036 [Model Runner V2] Support attention group — v1,nvidia — by WoosukKwon (合并于: 2026-02-22 08:42 (UTC+8))
- #34900 [Model Bash][DSR1] Add selective dynamic shape marking for CustomOp — performance,ready,v1,deepseek,nvidia — by vadiklyutiy (合并于: 2026-02-22 08:28 (UTC+8))
- #35029 [Model Runner V2] Support Eagle3 (no CUDA graph) — v1,nvidia — by WoosukKwon (合并于: 2026-02-22 04:55 (UTC+8))
- #34466 [CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) — documentation,ready,ci/build — by vladmihailescu (合并于: 2026-02-21 11:54 (UTC+8))
- #34342 [Frontend] Add automatic language detection for Whisper transcription — frontend,ready,multi-modality — by spacecheck (合并于: 2026-02-21 20:49 (UTC+8))
- #34791 [Bugfix] Gate 256-bit instructions to CUDA 12.9+ — bug,ready,nvidia — by huydhn (合并于: 2026-02-21 20:48 (UTC+8))
- #35012 [Benchmark] Improve benchmarks — performance,ready — by DarkLight1337 (合并于: 2026-02-21 18:31 (UTC+8))
- #34960 [Doc] Fix example of eagle3 — documentation,ready — by petrpechman (合并于: 2026-02-21 17:57 (UTC+8))
- #34765 [Core] Minor structured-output related scheduler optimization — ready,v1 — by njhill (合并于: 2026-02-21 17:38 (UTC+8))
- #34896 [PD] Change kv_load_failure_policy Default from “recompute” to “fail” — documentation,ready,v1,kv-connector — by NickLucche (合并于: 2026-02-21 17:34 (UTC+8))
- #34688 [ROCm] Enable bitsandbytes quantization support on ROCm — documentation,rocm,ready,ci/build — by Abdennacer-Badaoui (合并于: 2026-02-21 16:34 (UTC+8))
- #34636 [ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion — bug,rocm,ready — by Rohan138 (合并于: 2026-02-21 11:56 (UTC+8))
- #34541 [ROCM] Optimize ROCM_AITER_FA spec decode eagle performance — rocm,ready,v1 — by jennyyyyzhen (合并于: 2026-02-21 12:32 (UTC+8))
- #34570 [ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-02-21 12:25 (UTC+8))
- #34599 [ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends — rocm,speculative-decoding,ready,v1 — by AndreasKaratzas (合并于: 2026-02-21 12:25 (UTC+8))
- #34567 [CI] Fix ColBERT HF comparison tests on AMD CI + refactor — rocm,ready — by AndreasKaratzas (合并于: 2026-02-21 12:12 (UTC+8))
- #33949 [CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure — frontend,ready,gpt-oss — by AndreasKaratzas (合并于: 2026-02-21 12:03 (UTC+8))
- #33304 [feat] Add per-block extra_keys to KV events — documentation,ready,v1,multi-modality — by zhongdaor-nv (合并于: 2026-02-21 12:11 (UTC+8))
- #34574 [Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed — documentation,new-model,rocm,frontend,ready,multi-modality,qwen,nvidia — by craftsangjae (合并于: 2026-02-21 12:01 (UTC+8))
- #34613 [Realtime] Add Qwen3-ASR realtime streaming support — new-model,frontend,ready,qwen — by pougetat (合并于: 2026-02-21 11:59 (UTC+8))
- #34974 [Kernel] Optimize sample_recovered_tokens_kernel — performance,ready,v1 — by xyang16 (合并于: 2026-02-21 11:59 (UTC+8))
- #34904 Support prompt_embeds for pooling requests in output processor — ready,v1 — by laviier (合并于: 2026-02-21 11:57 (UTC+8))
- #34959 [Misc] Fix mypy errors in vllm/profiler and remove from exclude list — ready — by taneem-ibrahim (合并于: 2026-02-21 11:56 (UTC+8))
- #34928 [Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names — ready — by gmagogsfm (合并于: 2026-02-21 11:55 (UTC+8))
- #30286 [LoRA] Support Quantized Adapters — ready — by yugong333 (合并于: 2026-02-21 11:54 (UTC+8))
关闭但未合并的 PR
- #24491 [V1] add generate optional in health api — frontend,needs-rebase,stale,v1 — by lengrongfu (关闭于: 2026-02-22 10:16 (UTC+8))
- #26730 granite-4.0-h: fix prefix naming and add AWQ compatibility — stale — by toncao (关闭于: 2026-02-22 10:16 (UTC+8))
- #27248 [MISC] Add prefix cache reset to LMCache CPU offload example — documentation,stale,kv-connector — by sakunkun (关闭于: 2026-02-22 10:16 (UTC+8))
- #27310 [Frontend][gpt-oss] Allow system message to overwrite model identity — frontend,stale,gpt-oss — by lacora (关闭于: 2026-02-22 10:16 (UTC+8))
- #34925 [Bug] Fix pooling unit test
run_mteb_rerank— bug,ready — by yewentao256 (关闭于: 2026-02-22 01:43 (UTC+8)) - #35026 fix(models): apply embedding_multiplier to inputs_embeds in GraniteMoeHybrid — 无标签 — by nightcityblade (关闭于: 2026-02-22 00:02 (UTC+8))
- #27109 Add missing opentelemetry dependency to base docker image — ci/build,unstale — by Aymendje (关闭于: 2026-02-21 22:43 (UTC+8))
- #35020 [Model] Add GGUF loading support for Qwen3-Next — qwen — by laudney (关闭于: 2026-02-21 20:19 (UTC+8))
- #34693 [Bugfix] Fix mypy errors for StructuredOutputsParams by using stdlib dataclass — bug,ready — by hyeongyun0916 (关闭于: 2026-02-21 20:02 (UTC+8))
- #27993 [Kernel] Remove Redundant Prefill Support From 3D Triton Attention Kernel — needs-rebase,unstale — by jvlunteren (关闭于: 2026-02-21 16:45 (UTC+8))
- #34972 Fix different embeddings produced by LLM and AsyncLLM — v1 — by guodongxiaren (关闭于: 2026-02-21 11:39 (UTC+8))