[vLLM GitHub 开发动态] 2026-01-22
[概览]
- 时间窗口: 2026-01-22 10:55 (UTC+8) ~ 2026-01-23 10:55 (UTC+8)
- 新 issue: 31 (label 分布: bug:19, feature request:5, cpu:4, usage:4, rocm:3)
- 关闭 issue: 45
- 新 PR: 45 (label 分布: bug:17, ready:10, v1:10, rocm:7, ci/build:5)
- 合并 PR: 42
- 关闭未合并 PR: 16
[新 issue]
-
#32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (创建于: 2026-01-23 10:52 (UTC+8)) ### 🚀 The feature, motivation and pitch
hello, Why is the “thinking” parameter inconsistent with the OpenAI API? Is there any consideration behind this? Thank you
### Alternatives
No response
### Additional context …
-
#32900 [Feature]: Container image WORKDIR consistency — feature request,cpu — by qhaas (创建于: 2026-01-23 10:43 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
CPU version uses
/workspaceGPU version uses/vllm-workspaceWhile this a minor distinction, it would simplify the logic for those building derived container images that support both variants to have the same WORKDIR path.
### Alternatives
No response …
-
#32858 [Bug]: Use vllm to obtain Qwen3VL last token hidden status — bug — by JasonLeeUT (创建于: 2026-01-22 22:15 (UTC+8)) [💬1] ### Your current environment
from hdfs_io import hopen, hexists, hlist_files,hmv from qwen_vl_utils.vision_process import smart_resize
import torch, base64, io from dataclasses import asdict import json import os import argparse …
-
#32898 [Bug]: Incorrect/Incoherent model output in Disaggregated Serving (PD Separation) on v0.14.0 — bug — by JianDan0212 (创建于: 2026-01-23 10:00 (UTC+8)) ### Your current environment
The output of
```text Since I’m in a completely isolated intranet environment, I’m unable to copy the parameter information out. I’m using vLLM version 0.14.0 with an A100 GPU. ```python collect_env.py…
-
#32897 [Bug]: PaddleOCR-VL crash — bug — by NiuBlibing (创建于: 2026-01-23 09:53 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32862 [Bug]: llama4-fp8 tp=2 ep=2 doesn’t work on b200 — bug — by ProExpertProg (创建于: 2026-01-22 23:16 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#32841 [Bug]: The environment variable
VLLM_USE_MODELSCOPE=1is ineffective when downloading LoRA models. — bug — by AuYang261 (创建于: 2026-01-22 17:45 (UTC+8)) ### Your current environmentThe output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32895 [Bug]: [ROCm] [MI355X] new 0.14 upstream gptoss hard error TP=1? — bug,rocm — by functionstackx (创建于: 2026-01-23 08:43 (UTC+8)) [💬2] ### Your current environment
Platform: AMD MI355X Docker Image: vllm/vllm-openai-rocm:v0.14.0
### 🐛 Describe the bug
hi @powderluv @chunfangamd
Full logs of this bug available at: https://github.com/InferenceMAX/InferenceMAX/actions/runs/21260856946 …
-
#32896 [Installation]: How to use v0.10.x with pytorch2.9? — installation — by Wesley-Jzy (创建于: 2026-01-23 08:43 (UTC+8)) ### Your current environment
As the title describes, how can I get a vllm 0.10.x for pytorch 2.9 and cuda 12.8 environment?
### How you are installing vllm
pip install -vvv vllm…
-
#32889 [ROCm] MI355X CI parity is missing — feature request,rocm — by functionstackx (创建于: 2026-01-23 06:32 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
hi @powderluv @chunfangamd
there is good amount of B200 CI tests enablement already but MI355 is extremely lack there is close to zero MI355 CI.
and it is not tracked in the vllm project board either. https://github.com/orgs/vllm-project/projects/39
what is the plans around MI355 vLLM upstream enablement? Thanks in advance for ur time on this! ### Alternatives …
-
#32840 [Bug]: [CPU Backend] Engine crashed due to error on flashinfer op registration — bug,cpu — by fadara01 (创建于: 2026-01-22 17:20 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32874 [Bug]: DeepSeek V3.2 return Internal Server Errors instead of Bad Request in reasoning mode — bug — by artvl (创建于: 2026-01-23 01:48 (UTC+8)) [💬1] ### Your current environment
The output of
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#32850 [RFC]: Clarify policy for Open Responses API extensions in vLLM — RFC — by DanielMe (创建于: 2026-01-22 20:11 (UTC+8)) [💬2] ### Motivation.
vLLM recently added support for the OpenAI-compatible
/v1/responsesAPI, which aligns with the public Open Responses specification.The specification explicitly allows implementations to extend existing schemas with implementation-specific fields, as long as core semantics remain unchanged. At the same time, such extensions carry a known risk: future vers…
-
#32870 [Bug][CPU Backend] [Arm]: FAILED tests/kernels/moe/test_moe.py::test_cpu_fused_moe_basic[silu-False-dtype0-2-8-128-128-1] — bug,cpu — by focusunsink (创建于: 2026-01-23 00:43 (UTC+8)) [💬5] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...python collect_env.py -
#32871 [Bug]: — bug — by focusunsink (创建于: 2026-01-23 00:44 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32866 wrong issue, need delete — bug,cpu — by yugetnow (创建于: 2026-01-23 00:16 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...python collect_env.py -
#32864 [Bug] [ROCm]: Loading Qwen3-MoE-MXFP4 Weights in v0.14. — bug,rocm — by tjtanaa (创建于: 2026-01-22 23:31 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here Collecting environment information... uv is set ============================== ...python collect_env.py -
#32838 [Bug]: CompressedTensorsW4A16Fp4 is not support on turing — bug — by ir1ka (创建于: 2026-01-22 16:00 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32857 [Bug]: openai/gpt-oss-120b resolved architecture Qwen3ForCausalLM — bug — by LorenRd (创建于: 2026-01-22 21:28 (UTC+8)) ### Your current environment
Running vLLM (checkout v0.14.0) from Docker with ROCm built with:
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .If I try to serve openai/gpt-oss-120b the resolved architecture is Qwen3ForCausalLM:
…
-
#32856 [Usage]: How to use VLLM to infer the MXFP4A16 model exported by LLM Compressor — usage — by lijixing0425 (创建于: 2026-01-22 21:05 (UTC+8)) ### Your current environment
Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
-
#32854 [Usage]: — usage — by lijixing0425 (创建于: 2026-01-22 21:01 (UTC+8)) ### Your current environment
Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
-
#32853 [Usage]: How to run a compressed model in MXFP4A16 format — usage — by lijixing0425 (创建于: 2026-01-22 20:57 (UTC+8)) ### Your current environment
First, I exported an LLM model in MXFP4A16 format using LLM Compressor. Then, I tried to infer this model based on VLLM, but I found that the inference result was incorrect.
### export MXFP4A16 from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation …
-
#32848 [Usage]: Structured output — usage — by danielwit-lb (创建于: 2026-01-22 19:21 (UTC+8)) ### Your current environment
```text Collecting environment information… uv is set ============================== System Info ============================== OS : Amazon Linux 2023.9.20251208 (x86_64) GCC version : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) …
-
#32843 [Feature]:
kv_cache_dtype="auto"should resolve to the actualkv_cache_dtypepicked by vllm, and be displayed to users — feature request — by fxmarty-amd (创建于: 2026-01-22 18:05 (UTC+8)) ### 🚀 The feature, motivation and pitchAs per title.
At the moment, even in
TritonAttentionImpl.forward, if the argument--kv-cache-dtypeis not specified, it remainskv_cache_dtype="auto"even at the forward level.Although the documentation specifies `“auto”: U…
-
#32839 [Bug] AssertionError loading Unsloth-optimized Qwen3-VL-2B-4bit with bitsandbytes in vLLM 0.14.0 — bug — by mesteruh (创建于: 2026-01-22 16:55 (UTC+8)) ``` ### Your current environment
Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : Could not collect …
-
#32827 [Bug]: OpenAI tool call crashes when parameter description contains parentheses with examples (e.g. “(e.g. ls -la)”) — bug — by Lcng-H (创建于: 2026-01-22 11:39 (UTC+8)) ### Your current environment
============================== Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_USE_MODELSCOPE=true ... -
#32834 [Bug]: CUDA illegal memory access caused by CUDA Graph with awq_marlin — bug — by jason-webcomm (创建于: 2026-01-22 13:35 (UTC+8)) [💬3] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32832 [Performance]: H200 Kimi K2 TP8 Triton MoE min-latency — performance — by jhaotingc (创建于: 2026-01-22 12:57 (UTC+8)) ### Proposal to improve performance
Here’s a nsys screenshot of DeepSeek-V3.1 TP=8, conc=16 gen step. 2 triton MoE kernels take about 109us.
The nsys screenshot of Kimi-K2-Instruct TP=8, conc=16, gen step. 2 triton MoE kernels take 234us. <img width=”1765” height=”585” alt=”Image” src=”https://github.com/user-attachments/assets/1f0cb40e-f067-4449-a76f-a0998ac053c1…
-
#32828 [Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba — feature request — by jhaotingc (创建于: 2026-01-22 12:01 (UTC+8)) ### 🚀 The feature, motivation and pitch
In TensorRT LLM, Qwen3-Next
in_proj_qkvzandin_proj_baare executed in dual stream. [code ref].(https://github.com/NVIDIA/TensorRT-LLM/blob/0434db5bf75dfd01fe575a79c27d9260b597f167/tensorrt_llm/_torch/models/modeling_qwen3_next.py#L777C26-L777C30)See TensorRT LLM nsys screenshot below. The two GEMM are executed in 2 streams in parallel. (Qwen3-Next-80B-A3B TP2 generation step BS=8) <img width=”2018” height=”601” alt=”Image” src=”https://github.com/…
-
#32829 [Bug]: During GLM-4.7 function calling (fc), the function output is not streamed. — bug — by zhangsongqing (创建于: 2026-01-22 12:16 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32826 [Bug]: MiniMax-M2.1 NVFP4 fails on RTX PRO 6000 Blackwell (SM120) with expert parallel — bug — by gittb (创建于: 2026-01-22 10:59 (UTC+8)) ### Your current environment
The output of
``` Collecting environment information... ============================== System Info ============================== ...python collect_env.py
[已关闭 issue]
-
#27413 [Usage]: how to request a qwen2.5-VL-7B classify model served by vllm using openai SDK? — good first issue,usage,stale — by muziyongshixin (关闭于: 2026-01-23 10:22 (UTC+8)) [💬13] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I launch a server with the following command to serving a Qwen2.5-VL-7B model finetued for seqence classification. (this model replaced the lm_head with a 2 classes score_head) …
-
#18811 [Bug]: python sampler is faster than flashinfer sampler — bug,stale — by ErykCh (关闭于: 2026-01-23 10:17 (UTC+8)) [💬7] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#18890 [Bug]: vllm启动模型后使用openai格式请求传base64值有问题 — bug,stale — by wqw0806 (关闭于: 2026-01-23 10:17 (UTC+8)) [💬6] ### Your current environment
使用vllm0.7.3部署的qwen2.5vl-7b模型,使用openai格式请求接口。 方法1是在代码中读取本地文件然后转base64传入大模型中,正常。 方法2是将本地图片转换好的base64直接传入大模型中,报错。 下面是我传入的图片,请问是什么问题?
### 🐛 Describe the bug …
-
#19131 [Usage]: ModuleNotFoundError: No module named ‘vllm.vllm_flash_attn.layers’ vllm@0.9.0.1 — usage,stale — by rbgo404 (关闭于: 2026-01-23 10:17 (UTC+8)) [💬22] ### Your current environment
### Your current environment
Trying to install vllm, got installed and then got this error.
``` torch : 2.7.0+cu128 cuda : 12.8 python: 3.12.9 Linux-6.8.0-1013-nvidia-64k-aarch64-with-glibc2.35 …
-
#19342 [Usage]: How to get the router logits for an MoE model? — usage,stale — by zhenqincn (关闭于: 2026-01-23 10:17 (UTC+8)) [💬6] ### How would you like to use vllm
I would like to run a Qwen/Qwen3-30B-A3B model and need to store the router logits for each prompt in order to calculate the usage balance among experts. Is there a recommended or efficient way to implement this?
### Before submitting a new issue…
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently…
-
#20060 [Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al. — bug,stale — by NilsHellwig (关闭于: 2026-01-23 10:17 (UTC+8)) [💬41] ### 🐛 Describe the bug
This is what the code looks like:
```py from vllm import LLM, SamplingParams from vllm.sampling_params import GuidedDecodingParams from pydantic import BaseModel, Field, create_model from typing import Literal, List, Optional import json …
-
#20776 [Feature]: Add LFM2 models — new-model,feature request,stale — by egorsmkv (关闭于: 2026-01-23 10:17 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Recently LiquidAI released LFM2 models.
URL: https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38
Would love to see vLLM could run them.
### Alternatives
…
-
#21207 [Feature]: Support xformers on ARM GPU machines including GB200. — feature request,stale — by kathyyu-google (关闭于: 2026-01-23 10:17 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch
The official Dockerfile and requirements file don’t include building
xformersfor ARM GPU machines such as GB200 machines.Would it be possible to support
xformers? This dependency is needed for many models, including meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8.Thank you!
### Alternatives
…
-
#21708 [Bug]: vLLM Server Crash with CUDA Memory Error when serving
gemma-3-27b-it-FP8-Dynamic— bug,stale — by hnt2601 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬13] ### Your current environmentThe output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#22532 [Bug]: NIXL disaggregation example does not work — bug,stale — by piotr-topnotch (关闭于: 2026-01-23 10:16 (UTC+8)) [💬4] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#22578 [Bug]: [gpt-oss-120b] Chat Completions endpoint tool_call support is not working — bug,stale,gpt-oss — by gaopeijie (关闭于: 2026-01-23 10:16 (UTC+8)) [💬13] ### Your current environment
version: ‘3.8’
services: vllm: container_name: vllm-gpt image: vllm/vllm-openai:gptoss restart: unless-stopped runtime: nvidia …
-
#22750 [Bug]: Non-functional max_capture_size of gpt-oss causes OOM — bug,stale — by QwertyJack (关闭于: 2026-01-23 10:16 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#23292 [Feature][Chat Completion] Support builtin tools of gpt-oss — feature request,stale — by heheda12345 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch
Based on recent bug report, gpt-oss can generate builtin tool calls even if the users don’t ask for it, so it’s a very high priority to add proper support of it.
### Alternatives
No response
### Additional context
…
-
#23590 [CI]: Audit use of fixtures across tests to minimize server starts — ci/build,stale — by njhill (关闭于: 2026-01-23 10:16 (UTC+8)) [💬4] Many tests use a server with the same configuration, or are testing something where it wouldn’t matter if the server configuration was changed to be the same as other test(s).
We should make sure that the fixtures are configured such that for these we re-use the running server as much as possible and don’t stop/start a new vLLM instance for each test.
In particular we should try to use package-scoped fixtures in preference to module-scoped, for example in https://github.com/vllm-project/vllm/t…
-
#23627 [Usage]: How to start deepseek with pd disaggregation through vllm p2pncclcommunicator? — usage,stale — by LJL36 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬10] ### Your current environment
```text Collecting environment information… ============================== System Info ============================== OS : TencentOS Server 4.4 (x86_64) GCC version : (Tencent Compiler 12.3.1.4) 12.3.1 20230912 (TencentOS 12.3.1.4-2) Clang version : Could not collect …
-
#25310 [Feature]: Proactive, scheduler-bypass pull-KV path to reduce KV-transfer latency — feature request,stale — by Zaragoto (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch
In the current vLLM implementation, when the Decoder side receives a new request, it must wait for the current scheduling iteration of EngineCore to complete. Only in the next iteration can the scheduler allocate KV blocks and construct kv_connector_metadata, after which the model_runner dispatches the Pull KV task.
This waiting introduces latency averaging roughly half of the TPOT, leaving room for performance improvements.
### Alternatives
Introduce …
-
#25371 [Feature]: Distribute API servers across multiple nodes — feature request,stale — by Zhathw (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
The current –api-server-count parameter scales the API servers on a single node.
from utils.py, APIServerProcessManager ``` … for i, in_addr, out_addr in zip(range(num_servers), input_addresses, output_addresses): client_config = { …
-
#25402 [Bug]: Speculative decoding has poor UX — bug,stale — by MatthewBonanni (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#25437 [Bug][Qwen3-Next][DP]: Assert when serving Qwen3-Next using DP — bug,stale — by tlrmchlsmth (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### Your current environment
The output of
Collecting environment information... uv is set ============================== System Info ============================== ...python collect_env.py -
#25452 [Usage]: Autoscaling vLLM with kuberay (pipeline parallel — usage,stale — by akash-syook (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment
Hello Team, a newbie here in K8, Kuberay, vLLM. Apologies if this question already asked or if this too simple. I am using attached tutorial for “Pipeline parallelism with Kuberay” and it works successfully. https://docs.vllm.ai/projects/production-stack/en/latest/use_cases/pipeline-parallelism-kuberay.html I am moving to the next stage of autoscaling (HPA) and found awesome tutorial https://github.com/vllm-project/production-stack/blob/main/tutorials/10-horizontal-…
-
#25453 [Feature]: MXFP4 support for the SM75 device. — feature request,stale — by wang824892540 (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
There are widely available 20xx models and Tesla T4s on the market.
### Alternatives
No response
### Additional context
…
-
#25460 [Bug]: MTP assert error when num_speculative_tokeens > 1 — bug,stale — by mpjlu (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### Your current environment
vllm serve ${MODEL_PATH}
–trust-remote-code
–block-size 64
–served-model-name deepseek-r1
–max-model-len 32184
–max-num-seqs 64
–gpu-memory-utilization 0.9
-tp 8
… -
#25484 [Feature][WideEP]: More “masked-m” GEMM and quant kernels for use with DeepEP LowLatency + PPLX All2Alls — feature request,stale — by tlrmchlsmth (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
The
dispatchoperation in DeepEP LowLatency and PPLX-kernels is ahidden_statestensor with shape[num_local_experts, max_num_tokens_per_expert, hidden_size].max_num_tokens_per_expertmay be much larger than the actual number of tokens for each expert. To avoid useless work, we need efficient “masked-m” kernels that only work on the relevant parts of the tensors.Currently we use DeepGEMM and a fused-silu-mul-quant kernel for blocked fp8 forma…
-
#25529 [Chore]: Get rid of annoying pre-commit error on MacOS — bug,stale — by panpan0000 (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### Your current environment
vLLM branch main with latest code.
My MacOS env:
…
-
#25536 [Bug]: Inconsistent P90 TTFT Results with Identical Benchmark Parameters — bug,stale — by jinqinn (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment
### Environment
- vLLM Version: [v0.8.0]
- Model: DeepSeek-R1
- Backend: OpenAI Chat API
- Hardware: [H20]
- OS: Linux
### 🐛 Describe the bug …
-
#25595 [RFC]: Support openai HarmonyContext plugin — RFC,stale — by frank-wei (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Motivation.
In some gpt-oss use cases, especially internal scenarios, we need to customize the HarmonyContext class. However, the current implementation does not allow this in OSS. I believe other users may also want similar flexibility to define their own class methods. This PR introduces custom plugin registration for HarmonyContext, making it more extensible by allowing users to register and load their own implementations.
Example:
from vllm.entrypoints.harmony_utils import register_c…
-
#25600 [Bug]: Partial TPU chip usage hangs indefinitely with MP backend — bug,stale — by thameem-abbas (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment
The output of
```text root@t1v-n-95832fcc-w-0:/workspace/vllm# python3.12 collect_env.py Collecting environment information... ============================== System Info ...python collect_env.py -
#32841 [Bug]: The environment variable
VLLM_USE_MODELSCOPE=1is ineffective when downloading LoRA models. — bug — by AuYang261 (关闭于: 2026-01-23 09:14 (UTC+8)) ### Your current environmentThe output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#29095 [Performance]: gpt-oss-120b runs 4 times slower on v0.11.2 than on v0.11.0 — performance — by mosalov (关闭于: 2026-01-23 08:39 (UTC+8)) [💬7] ### Proposal to improve performance
No response
### Report of performance regression
I have been experimenting with running
gpt-oss-120bob 4xh100 with the latest versions of vLLM and I have noticed that after vLLM moved from v0.11.0 the performance dropped significantly, roughly by 4 times.I wonder if anyone else noticed that or if it is a local issue on my side.
…
-
#32840 [Bug]: [CPU Backend] Engine crashed due to error on flashinfer op registration — bug,cpu — by fadara01 (关闭于: 2026-01-23 06:27 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32786 [Bug]: [CPU Backend] [Arm] Slow model-parallel inference across NUMA domains — bug,cpu — by fadara01 (关闭于: 2026-01-23 02:55 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30569 [Feature]: Add –ssl-ciphers to CLI arguments — feature request — by GraceMoreau (关闭于: 2026-01-23 01:53 (UTC+8)) ### 🚀 The feature, motivation and pitch
Add support for an –ssl-ciphers CLI argument.
This flag would be passed directly to uvicorn’s –ssl-ciphers kwargs inside the existing serve_http command, parallel to how –ssl-certfile and –ssl-keyfile flags are handled
Motivation: vLLM already exposes several uvicorn SSL-related parameters, enabling users to run the vLLM server with managed TLS. However, currently there is no way to specify the allowed SSL/TLS cipher suites. Adding this CLI opti…
-
#31245 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-01-23 01:35 (UTC+8)) [💬1] ### Name of failing test
bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32871 [Bug]: — bug — by focusunsink (关闭于: 2026-01-23 00:51 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32866 wrong issue, need delete — bug,cpu — by yugetnow (关闭于: 2026-01-23 00:22 (UTC+8)) [💬1] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...python collect_env.py - #13609 [Usage]: How can I get the sparse embedding from OpenAI Embedding Client? — usage — by shawnhaoxy (关闭于: 2026-01-22 23:52 (UTC+8)) [💬15] I want to run embedding model like BGE-m3 for online serve. I can get the dense embedding, but how can I get its sparse embedding?
-
#29992 [Bug]: vLLM cold start on MOE models not optimal — bug,torch.compile,startup-ux — by zou3519 (关闭于: 2026-01-22 23:44 (UTC+8)) [💬6] ### Your current environment
main
### 🐛 Describe the bug
Previously, for dense models, with FX graph splitting, vLLM produces 3 unique graphs. (the model is split at the attention operator). The graph split ends up producing ~50 graphs, and we only needed to compile 3 unique graphs out of the 50.
Looking at a tlparse for llama4 maverick, which is an MOE model:
- every other layer has a MOE (instead of an nn.Linear feedforward) …
-
#32854 [Usage]: — usage — by lijixing0425 (关闭于: 2026-01-22 21:02 (UTC+8)) ### Your current environment
Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
-
#32853 [Usage]: How to run a compressed model in MXFP4A16 format — usage — by lijixing0425 (关闭于: 2026-01-22 20:58 (UTC+8)) ### Your current environment
First, I exported an LLM model in MXFP4A16 format using LLM Compressor. Then, I tried to infer this model based on VLLM, but I found that the inference result was incorrect.
### export MXFP4A16 from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation …
-
#31577 [Bug]: Memory leak in serving Whisper — bug — by parssky (关闭于: 2026-01-22 18:50 (UTC+8)) [💬11] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32631 [Tracker]: Initialize MM components in context managers — multi-modality — by DarkLight1337 (关闭于: 2026-01-22 16:20 (UTC+8)) [💬5] ### Purpose
Apply #32605 to all applicable models to enable encoder-only and LM-only mode..
- #32632
- #32641
- #32650
- #32663
- #32695
- #32691 …
-
#32666 [Feature]: Save the start time of the benchmark request — feature request — by kebe7jun (关闭于: 2026-01-22 15:10 (UTC+8)) ### 🚀 The feature, motivation and pitch
The current
vllm bench serve --save-detailedstores data such asttfts,itls, andtpots, but does not save the actual time of request startup, which makes it difficult to accurately trace the entire test process. Therefore, it is desired to also save thestart_time.### Alternatives
No response
### Additional context
…
-
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-22 14:03 (UTC+8)) [💬6] ### Name of failing test
pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py### Basic information
- Fla…
-
#29510 [CI Failure]: mi325_1: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-01-22 13:47 (UTC+8)) [💬6] ### Name of failing test
pytest -v -s v1/e2e && pytest -v -s v1/engine### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (关闭于: 2026-01-22 10:56 (UTC+8)) [💬2] ### Your current environment
The output of
When the temperature is set to 0 in qwen3-235b, there is still randomness. ```text Your output of `python collect_env.py` here ``` Collecting environment information... ============================== ...python collect_env.py
[新 PR]
- #32878 [Bugfix][Hardware][AMD] Auto-enable AITER MoE on supported ROCm devices — bug,rocm — by c0de128 (创建于: 2026-01-23 02:27 (UTC+8)) [💬7 | +18/-2, 3 files | commented:1]
## Summary
- Auto-enable AITER MoE on supported ROCm platforms (MI300X/gfx9) without requiring
VLLM_ROCM_USE_AITER=1 - Previously, FP8 MoE models fell back to slower Triton kernels by default
## Changes
- Modify
is_fused_moe_enabled()to auto-enable when AITER is found and supported - Respect explicit user settings (
VLLM_ROCM_USE_AITER=0orVLLM_ROCM_USE_AITER_MOE=0to disable)
## Test plan On MI300X: …
- Auto-enable AITER MoE on supported ROCm platforms (MI300X/gfx9) without requiring
- #32899 [WIP][CT][XPU] Add W8A16_FP8 MoE Support — 无标签 — by Zhenzhong1 (创建于: 2026-01-23 10:20 (UTC+8)) [💬1 | +247/-15, 3 files | commented:2 | 📝草稿] ## Test Plan ```python # quantization scheme: quant_stage: quant_modifiers: QuantizationModifier: ignore: [“lm_head”, “re:.self_attn.q_proj$”, “re:.self_attn.k_proj$”, “re:.self_attn.v_proj$”, “re:.self_attn.o_proj$”, “re:.mlp.gate$”, “re:.mlp.shared_expert_gate$”] config_groups: group_0: targets: [“Linear”] …
-
#32852 [perf] v1/spec_decode: skip softmax for all-greedy rejection sampling — ready,v1 — by caozuoba (创建于: 2026-01-22 20:47 (UTC+8)) [💬2 | +6/-1, 1 files | commented:1 approved:1]
## Purpose This PR avoids computing a full-vocabulary softmax in the v1 speculative decoding rejection sampler when the entire batch is greedy (sampling_metadata.all_greedy).
For all-greedy decoding, the rejection sampler only needs argmax(target_logits); a dense softmax is unnecessary work. Since argmax(softmax(logits)) == argmax(logits), this change is behavior-preserving for the greedy path while reducing compute/memory overhead.
## Test Result …
-
#32863 [Frontend] Use new Renderer for Completions and Tokenize API — frontend — by DarkLight1337 (创建于: 2026-01-22 23:25 (UTC+8)) [💬1 | +962/-1189, 32 files | commented:4 | 📝草稿] ## Purpose
- Add
render_completionsandtokenize_promptto Renderer API. - Add
render_completions_asyncandtokenize_prompt_asyncto Renderer API. These are backed byAsyncMicrobatchTokenizerwhich is internally managed by Renderer. - Remove legacy
CompletionRendererand related code from serving engine. - Add
_preprocess_completionto serving engine which has the same function as_preprocess_chat. - Introduce
ChatParserandTokenizeParamswhich are passed to the Renderer to…
- Add
-
#32884 [BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions — bug,ready,deepseek — by RishabhSaini (创建于: 2026-01-23 05:06 (UTC+8)) [💬2 | +39/-28, 1 files | commented:1 approved:1] Resolves: https://github.com/vllm-project/vllm/issues/32874
Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
-
#32836 [BugFix] Add env variable to control PDL in LoRA — bug — by jeejeelee (创建于: 2026-01-22 15:40 (UTC+8)) [+10/-1, 2 files | commented:1]
## Purpose On SM100 GPUs, enabling PDL support for LoRA causes Triton compilation to fail. Therefore, an environment variable(
VLLM_LORA_DISABLE_PDL) is added as a temporary workaround to avoid this error. FIX #30872 FIX https://github.com/vllm-project/vllm/issues/32424 ## Test Plan## Test Result
…
-
#32885 [CI][Models] Add VLM Support for Sequence Classification Conversion — v1 — by AndreasKaratzas (创建于: 2026-01-23 05:56 (UTC+8)) [💬1 | +155/-39, 3 files | commented:2] This PR enables Vision-Language Models (VLMs) like Gemma 3 to be converted to sequence classifiers using the
no_post_processingandfrom_2_way_softmaxmethods. Additionally, it fixes two PyTorch compiler warnings that were causing noise in the logs.## Changes
### 1.
vllm/model_executor/models/adapters.py- Added
_get_language_model_for_seq_cls()helper function to correctly retrieve the inner language model component from VLMs - Updated
load_weights_no_post_processing()and `load_w…
- Added
- #32879 [Bugfix][Hardware][AMD] Fix MXFP4 weight loading for Quark OCP_MX MoE models — bug,rocm — by c0de128 (创建于: 2026-01-23 03:00 (UTC+8)) [💬3 | +20/-4, 1 files | commented:1]
## Summary
- Fix MXFP4 weight loading for Quark OCP_MX quantized MoE models
- The weight loader check for MXFP4 style loading only triggered for
quant_config.get_name() == "mxfp4", but Quark config returns"quark" - This caused dimension mismatch errors when loading Quark OCP_MX (MXFP4) quantized models with tensor/expert parallelism
## Changes
- Extend the MXFP4-style weight loading check to also handle
QuarkOCP_MX_MoEMethod - Both native MXFP4 and Quark OCP_MX use the same packed weight…
- #32891 [ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) — rocm,ci/build — by micah-wil (创建于: 2026-01-23 06:56 (UTC+8)) [+3/-0, 1 files | commented:4] We have noticed some flaky behavior on AMD CI for Distributed Tests (A100) (exemplified here: https://buildkite.com/vllm/amd-ci/builds/3155/steps/canvas?jid=019bc705-99de-4131-b7ef-24111fd273cc). After investigation, it is due to the same HIP runtime bug that afflicts the other distributed tests groups, for which the same workaround proposed in this PR was applied in https://github.com/vllm-project/vllm/pull/31259. The HIP bug is being tracked here https://github.co…
-
#32859 [Bugfix] Remap FP8 scale names for Mistral3/Pixtral models — bug — by ricky-chaoju (创建于: 2026-01-22 23:07 (UTC+8)) [+6/-0, 1 files | commented:1] ## Summary Fix loading FP8 quantized Mistral3/Pixtral models by remapping HuggingFace scale parameter names to vLLM format.
Changes:
activation_scale→input_scaleweight_scale_inv→weight_scale
## Testing Tested with FP8 quantized Mistral3 model (e.g.,
RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic).…
-
#32894 [Intel GPU] refine xpu worker — ready,ci/build,v1 — by jikunshang (创建于: 2026-01-23 08:35 (UTC+8)) [+26/-86, 3 files | commented:2]
## Purpose There are some legacy code in xpu platform(xpu_worker.py, xpu platform init) this PR refine these file to align with gpu. also disable some failed UT for now. will further check ut status later.
## Test Plan CI
## Test Result …
-
#32888 [UX] Glm4MoeModelToolParser - true tool-call streaming — 无标签 — by bigBrain1901 (创建于: 2026-01-23 06:18 (UTC+8)) [+294/-95, 1 files | commented:3 | 📝草稿]
## Purpose Instead of buffers + regex to identify tool calls, maintain a state machine to have true streaming deltas for tool-calls as the engine generates these (#32829)
## Test Plan A serving with this –tool-parser works.
## Test Result — will fill this up soon …
-
#32892 [Perf] Optimize
moe_permutefor CUTLASS FP8 - 40%+ performance improvement — nvidia — by yewentao256 (创建于: 2026-01-23 07:54 (UTC+8)) [💬2 | +47/-44, 3 files | commented:1] ## PurposeOptimize
moe_permutekernel usingaligned_expert_first_token_offset- Reduce calculation
- Reduce shared memory usage
## Test
### Acc …
-
#32893 Fix/glm4 moe mla detection — v1,nvidia — by mgoin (创建于: 2026-01-23 08:06 (UTC+8)) [+75/-11, 4 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... - #32886 [Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 — bug,ready,llama — by baonudesifeizhai (创建于: 2026-01-23 06:05 (UTC+8)) [💬1 | +21/-1, 1 files | approved:1 commented:3]
## Purpose
#32862
Add a version-guarded fallback in Llama4 MoE weight loading to avoid CPU FP8 indexing on older PyTorch releases. For torch < 2.11, weights are temporarily cast to FP16 for indexing and then cast back, preventing index_cpu errors while leaving newer versions unchanged.
## Test Plan
```
VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel
VLLM_USE_FLASHINFER_MOE_FP8=0
vllm bench throughput –model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 -tp=2 –enable-expert-parallel … -
#32872 [Bugfix][MoE] Fix grouped topk when num_experts == num_expert_groups — bug — by bnellnm (创建于: 2026-01-23 01:09 (UTC+8)) [+74/-79, 3 files | commented:6] ## Purpose Fix fallback logic when
num_experts==num_expert_groups. The old code wasn’t handling the case where the scoring function was not softmax.Added new tests to cover all the different combinations. Simplified the bias parameters in existing tests.
## Test Plan CI tests Added new routing tests for all grouped routing combinations.
…
- #32890 Testing vllm/issues/32718 — 无标签 — by RishabhSaini (创建于: 2026-01-23 06:43 (UTC+8)) [+222/-0, 2 files | commented:1 | 📝草稿] Testing https://github.com/vllm-project/vllm/issues/32718
-
#32887 [Spec Decode] Unified Parallel Drafting — documentation,speculative-decoding,v1,nvidia — by benchislett (创建于: 2026-01-23 06:16 (UTC+8)) [💬1 | +467/-281, 7 files | commented:1 | 📝草稿] ## Purpose
Preliminary work. Meant to combine Parallel Decoding for EAGLE and for Draft Models, while also simplifying the implementation for non-EAGLE draft model support.
## Test Plan
Local tests for the main triton kernel are implemented (and passing) but not yet committed.
## Test Result
…
-
#32855 [BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration — bug,ready,nvidia — by fadara01 (创建于: 2026-01-22 21:04 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:2] [BugFix] Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration
Fixes: #32840
## Purpose
Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration The default value has to be an int an not an enum as the error indicates: ``` invalid numeric default value: …
-
#32876 [CPU Backend][BugFix] Fix failing CPU MoE test — bug — by fadara01 (创建于: 2026-01-23 02:23 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] ## Purpose
Fixes: #32870
## Test Plan
Ran reproducer in #32870 locally
## Test Result
…
-
#32882 Set splitk=1 for fused-moe-lora expand kernel — 无标签 — by dcmaddix (创建于: 2026-01-23 04:16 (UTC+8)) [+1/-1, 1 files | commented:2]
## Purpose The splitK optimization is useful only for the LoRA shrink kernels where the inner K dimension in the GEMM is the significantly largest dimension. For the expand kernel, we do not need to tune it and can set it to 1 as done in the dense LoRA expand: https://github.com/vllm-project/vllm/blob/main/vllm/lora/ops/triton_ops/kernel_utils.py#L195.
## Test Plan
## Test Result
…
-
#32849 [Docs] Adding links and intro to Speculators and LLM Compressor — documentation — by aireilly (创建于: 2026-01-22 19:47 (UTC+8)) [💬4 | +75/-0, 3 files | commented:1 changes:1] Small docs update that adds a new “Related Projects” section for LLM Compressor and Speculators
Updated docs/.nav.yml to add a new section: - Related Projects: - usage/llm_compressor.md - usage/speculators.md
Signed-off-by: Aidan Reilly aireilly@redhat.com
-
#32883 deepseek_v32_encoding: Replace asserts with proper exceptions — deepseek — by RishabhSaini (创建于: 2026-01-23 04:57 (UTC+8)) [💬1 | +38/-31, 1 files] Resolves: https://github.com/vllm-project/vllm/issues/32874
Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
-
#32880 Add inside_opaque_custom_op context — 无标签 — by mgoin (创建于: 2026-01-23 03:57 (UTC+8)) [+67/-3, 2 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32881 [BugFix] Async Eplb fix potential race condition — bug — by ilmarkov (创建于: 2026-01-23 04:14 (UTC+8)) [+18/-0, 2 files | commented:1 | 📝草稿]
## Purpose Fixes potential race condition in async EPLB. The scenario is following. EPLB successfully transfers layer 1, saves the weights in intermediate buffers, signals main thread that the buffer is ready. The main thread schedules weight copy on main stream when communication on EPLB stream is done and notifies the async thread that buffer is ready. In this scenario it might happen that copying the weights takes too long and async thread will start a weight t…
-
#32860 [BugFix] Fix async EPLB hang with DeepEP LL all2all backend — bug,v1 — by ilmarkov (创建于: 2026-01-22 23:13 (UTC+8)) [💬1 | +54/-0, 2 files | commented:2] Fix hang when async EPLB is used with DeepEP LL.
The hang happens because of cooperative launch in DeepEP LL. When Async EPLB in async thread launches NCCL kernels for weight exchange with large number of SMs it blocks DeepEP kernels. When these kernels are launched on different ranks out-of-order we hit a deadlock.
See https://github.com/deepseek-ai/DeepEP/issues/496.
We resolve it by limiting number of CTAs in NCCL. In DP/EP mode NCCL is not used in the hot path so it doesn’t directly affec…
-
#32875 Add the missing router_logits_dtype for nemotron_h — 无标签 — by cjluo-nv (创建于: 2026-01-23 02:08 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1]
## Purpose
Nemotron H uses FP32 linear for gate and the router logits are FP32 instead of the default BF16.
This fixes the error triggered
``` (EngineCore_DP2 pid=147) return self.forward_impl_chunked( …
-
#32861 [Voxtral] Add new streaming arch — multi-modality,llama — by patrickvonplaten (创建于: 2026-01-22 23:16 (UTC+8)) [💬1 +767/-313, 9 files commented:9] - Moves causal whisper logic into own file
- Adapt voxtral streaming arch to improved architecture
- Add tests that is skipped for now
- #32877 [Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support — bug,rocm,v1 — by c0de128 (创建于: 2026-01-23 02:27 (UTC+8)) [+8/-3, 1 files | commented:1]
## Summary
- Fix ROCM_AITER_FA attention backend to support speculative decoding (multi-token decode)
- The decode path was hardcoding
max_seqlen_q=1, causing incorrect results with speculative decoding
## Changes
- Extract actual
max_query_lenfrom decode metadata - Route multi-token decode queries to
unified_attentioninstead ofpaged_attention_v1 - Use actual
max_seqlen_qinstead of hardcoded1 - Update assertion message to reflect both sliding window and speculative decoding con…
-
#32831 [CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel — bug,rocm — by rasmith (创建于: 2026-01-22 12:53 (UTC+8)) [+23/-0, 2 files | commented:8] ## Purpose The
wvSplitKQ_hf_smlkernel is not able to properly handle non-contiguous tensors. This is resulting in failures in AMD CI. This is possibly resulting in other failures or inaccuracies. In the long term, it might be better to update the kernel to be able to properly handle non-contiguous tensors, since this trades incorrectness for performance loss. We can then work on fix to the kernels in skinny_gems.cu to handle non-contiguous kernels.In addition, all of the kernels in `ski…
-
#32873 Add config for selective_state_update with B200 — 无标签 — by danisereb (创建于: 2026-01-23 01:26 (UTC+8)) [+6/-1, 1 files | commented:1 | 📝草稿] ## Purpose
Tune config of
selective_state_updatefor Nemotron Nano NVFP4 on B200 (TP1).## Test Plan
Compare vLLM performance with main branch and this branch.
## Test Result
…
-
#32845 [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vl — multi-modality,qwen — by rayleeku (创建于: 2026-01-22 18:29 (UTC+8)) [💬2 | +374/-1, 5 files | commented:2] ## Purpose
RFC: https://github.com/vllm-project/vllm/issues/31803
The exponential growth of video content and the rising demand for long-video intelligence (e.g., video summarization, multimodal QA, video generation) have made long-video inference a standard requirement for multimodal large models. Current models suffer from critical efficiency bottlenecks and face two key challenges: temporal and spatial redundancy of videos and input sequence overflow, thus necessitating a lightweight, adapt…
- #32868 [MISC] Add .cursor to .gitignore — ready — by vadiklyutiy (创建于: 2026-01-23 00:34 (UTC+8)) [+3/-0, 1 files | commented:1 approved:1]
Add
.cursorfolder to.gitignore -
#32837 [Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars — bug,rocm,ready,ci/build,v1,kv-connector — by mawong-amd (创建于: 2026-01-22 15:44 (UTC+8)) [💬1 | +14/-6, 3 files | commented:1 approved:1] ## Purpose This PR fixes various issues on AMD ROCm caused by the deprecation of environment variables in https://github.com/vllm-project/vllm/pull/32812.
Firstly, the deprecation of
VLLM_V1_USE_PREFILL_DECODE_ATTENTIONmeans thatget_current_vllm_configis now called as a replacement inRocmPlatform.get_attn_backend_clsto determine if theROCM_ATTNbackend should be used. However, there are instances where the current vLLM config is not yet set, which causes the above function to erro… - #32867 [Misc] Postpone VLLM_ATTENTION_BACKEND deprecation — 无标签 — by NickLucche (创建于: 2026-01-23 00:32 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] I think we can safely postpone this deprecation schedule. Updating warning accordingly.
-
#32869 [CPU][Feat] Update PyTorch to v2.10 for Arm CPUs — ci/build,cpu — by fadara01 (创建于: 2026-01-23 00:36 (UTC+8)) [💬1 | +4/-3, 2 files | commented:2] ## Purpose
Update PyTorch to v2.10 for Arm CPUs.
A lot of nice features, improvements and bug fixes have gone into PyTorch v2.10 for Arm CPUs and we should capitalize on that in vLLM!
Here’s a con-exhaustive list:
- Enable mimalloc on AArch64: Switched the c10 system allocator to mimalloc on AArch64 to improve large-allocation behavior and overall performance. This delivers broad wins across dtypes, including up to 60% faster DenseNet121 (FP32) and up to 40% faster GPT2-Large (BF16) [#pull/…
-
#32844 [Bugfix] ModelScope is supported when downloading LORA models. — bug,ready — by AuYang261 (创建于: 2026-01-22 18:05 (UTC+8)) [💬1 | +21/-6, 1 files | commented:2 approved:1]
## Purpose Fix the bug that the environment variable
VLLM_USE_MODELSCOPE=1is ineffective when downloading LoRA models, which is described in #32841 .## Test Plan After setting the environment variables
VLLM_USE_MODELSCOPE=1, this command tests downloading the LoRA model from ModelScope.vllm serve LLM-Research/Llama-3.2-3B-Instruct --enable-lora --lora-modules sql-lora=vllm-ascend/llama32-3b-text2sql-spider…
-
#32865 [Chore]: simplify cuda device count with
torch.cuda.device_count— ready,nvidia — by Isotr0py (创建于: 2026-01-22 23:32 (UTC+8)) [+4/-42, 1 files | commented:1]## Purpose
- Remove
_cuda_device_count_stateless, we can usetorch.cuda.device_countdirectly now.
## Test Plan
## Test Result
…
- Remove
-
#32846 [Kernel] use flashinfer for gdn prefill — qwen — by ZJY0516 (创建于: 2026-01-22 18:41 (UTC+8)) [+62/-1, 1 files | commented:3] ## Purpose flashinfer introduced prefill gdn kernel in https://github.com/flashinfer-ai/flashinfer/pull/2276
TODO: accuracy problem
## Test Plan ``` vllm bench serve –model Qwen/Qwen3-Next-80B-A3B-Instruct –dataset-name random
–num-prompts 512
–random-input-len 4096
… -
#32851 Fix mkdocs_autorefs error — needs-rebase — by aireilly (创建于: 2026-01-22 20:34 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 | 📝草稿] Fixes this docs build error:
``` WARNING - mkdocs_autorefs: api/vllm/model_executor/models/index.md: from /home/docs/checkouts/readthedocs.org/user_builds/vllm/checkouts/32849/vllm/model_executor/models/interfaces_base.py:167: (vllm.model_executor.models.VllmModelForPooling.default_pooling_type) Could not find cross-reference target ‘vllm.model_executor.layers.pooler.PoolerConfig.pooling_type’ WARNING - mkdocs_autorefs: api/vllm/model_executor/models/interfaces_base.md: from /home/docs/checko…
-
#32847 [BugFix] Fix AttributeError in MooncakeConnector logging for non-tensor KV caches — bug,kv-connector — by EheinWang (创建于: 2026-01-22 18:58 (UTC+8)) [💬2 | +23/-3, 1 files | commented:3] ## Purpose
This PR fixes an
AttributeError: 'tuple' object has no attribute 'shape'that occurs inMooncakeConnector.register_kv_cacheswhen using hardware backends (like Ascend NPU) that represent KV caches as tuples (separate Key/Value tensors) rather than a single tensor.The existing logging implementation assumed that
kv_cachesvalues always possessed a.shapeattribute. When running with thevllm-ascendplugin, this assumption failed, causing the engine initialization to crash…. -
#32833 [CI] refactor release pipeline config into groups — ready,ci/build — by Harry-Chen (创建于: 2026-01-22 13:25 (UTC+8)) [💬1 | +277/-276, 1 files | commented:2 approved:2] ## Purpose
The current release pipeline has too many steps and is a little bit hard to read and maintain. This PR puts steps into some groups without changing the actual jobs.
Specifically, it also removes the block before the CUDA 13.0 release image build steps, as requested by @wangshangsam ( https://github.com/vllm-project/vllm/pull/32522#issuecomment-3776281837).
## Test Plan
Tested by CI.
…
- #32842 [Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_modes — bug — by HirokenOvo (创建于: 2026-01-22 17:45 (UTC+8)) [+8/-1, 2 files]
## Purpose
PR #23207 introduced torch compile support for the ViT part of Qwen2.5-VL. This PR addresses an issue where enabling
torch.compilefor the vision encoder (--compilation-config '{"compile_mm_encoder": true}') caused crashes when switching between--mm-encoder-tp-mode "weights"and--mm-encoder-tp-mode "data". ### The Problem: vLLM usesVllmConfig.compute_hash()to identify unique configurations for caching compiled graphs. However,mm_encoder_tp_modewas missing from this … -
#32835 [ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2026-01-22 14:09 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] This PR adds a clarifying comment as requested in #32787 by @DarkLight1337.
The comment explains why
TRITON_ATTNis used as the fallback attention backend for ROCm platforms. -
#32830 fix — needs-rebase,v1 — by aidendle94 (创建于: 2026-01-22 12:28 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
...
[已合并 PR]
-
#30937 [Feature] Add –ssl-ciphers CLI argument for TLS cipher control — frontend,ready — by ricky-chaoju (合并于: 2026-01-23 01:53 (UTC+8)) [💬3 | +4/-0, 2 files | commented:2 approved:2] ## Summary This PR adds support for the
--ssl-ciphersCLI argument, enabling users to specify allowed SSL/TLS cipher suites for fine-grained security control.Fixes #30569
## Changes
vllm/entrypoints/openai/cli_args.py: Addedssl_ciphersparameter toFrontendArgsdataclassvllm/entrypoints/openai/api_server.py: Passssl_ciphersargument toserve_http()function
## Motivation …
-
#31996 [MoE Refactor] Move
select_expertsfromFusedMoEQuantMethod->FusedMoE— documentation,ready,nvidia — by bnellnm (合并于: 2026-01-23 07:21 (UTC+8)) [💬9 | +495/-530, 22 files | commented:9 approved:1] ## PurposeMove (almost all) the
select_expertscalls to theFusedMoElayer instead of eachFusedMoEMethodBase.applymethod. Change theapplysignature to taketopk_weightsandtopk_idsinstead ofrouter_logits. For kernels that have combined routing + gemms we callapply_monolithicwhich takes hidden_states androuter_logits.I made both
applyandapply_monolithicnon-abstract so that most FusedMoEMethods would not need to override them with dummy/error implementations… -
#32723 [Benchmark] Don’t default to
temperature==0invllm bench serve— performance,ready — by njhill (合并于: 2026-01-22 18:03 (UTC+8)) [💬3 | +7/-6, 2 files | commented:1 approved:1]vllm bench servecurrent sets temperature to 0 (greedy) in all requests if it’s not explicitly provided in the benchmark command.Thus results will not reflect the true default performance of a given deployment - where default sampling parameters are determined based on the model’s configuration metadata and/or the API server defaults.
The difference can be substantial given the current overhead of parameters such as top-k and top-p (see https://github.com/vllm-project/vllm/pull/32558).
So …
-
#32855 [BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration — bug,ready,nvidia — by fadara01 (合并于: 2026-01-23 06:27 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:2] [BugFix] Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration
Fixes: #32840
## Purpose
Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration The default value has to be an int an not an enum as the error indicates: ``` invalid numeric default value: …
-
#32619 [Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper — performance,ready,deepseek — by xyang16 (合并于: 2026-01-23 04:47 (UTC+8)) [💬1 | +75/-17, 7 files | commented:5 approved:1] ## Purpose
This PR is to create TMA-aligned input scale tensor for DeepGemm on Hopper.
During profiling on H200 I noticed
deep_gemm::transpose_fp32was launched beforedeep_gemm::sm90_fp8_gemm_1d2d_implevery time, because deepgemm requires input scale tensor to be column major TMA-aligned layout on Hopper, see here. If not in the required layout, deepgemm will launch transpose_fp32 kernel to transpose tens… -
#32610 [Refactor] Remove unused tpu files — tpu,ready,v1 — by yewentao256 (合并于: 2026-01-23 04:35 (UTC+8)) [💬1 | +0/-353, 4 files | commented:1 approved:1] ## Purpose
Seems introduced in https://github.com/vllm-project/vllm/pull/14227
But these functions and files now are not used anywhere, not sure if we can delete it
CC: @yaochengji @NickLucche
-
#30141 Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) — documentation,speculative-decoding,ready,v1,llama — by eldarkurtic (合并于: 2026-01-23 04:29 (UTC+8)) [💬12 | +538/-243, 18 files | commented:10] TLDR: this PR adds support to load and run
llm-compressormodels with FP8 KV-cache and attention quantization. In addition to the standard “per-tensor” quantization, it adds support for “per-attention-head” quantization.## Summary
- enable using the existing pathway of “per-tensor” KV-cache (and query) FP8 quantization with scales calibrated through
llm-compressor - Flash Attention v3 backend supports “finer-grained” scales, i.e. one scale per attention head. This is currently not su…
- enable using the existing pathway of “per-tensor” KV-cache (and query) FP8 quantization with scales calibrated through
-
#32795 [Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 — bug,ready,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-01-23 03:05 (UTC+8)) [+31/-11, 13 files | commented:3 approved:2] ## Purpose Attention backends all support bfloat16, but only report support for
auto. In cases where the automatically resolved dtype is not bfloat16, this can lead to failures.Related issue: #32732 - setting
--kv-cache-dtype=bfloat16should work as a workaround, but does not, because TRITON_MLA does not report support for bfloat16.## Test Plan
vllm serve nvidia/DeepSeek-R1-0528-NVFP4 --attention-backend=TRITON_MLA --kv-cache-dtype=bfloat16## Test Result main: …
-
#32792 [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm — ready,ci/build,v1,cpu — by fadara01 (合并于: 2026-01-23 02:55 (UTC+8)) [💬2 | +164/-6, 6 files | commented:1 approved:1] [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm
Fixes: #32786
## Purpose
This PR addresses #32786 by enabling vLLM’s custom shared memory communicator for Arm, in favor of torch.distributed which has very high synchronization costs. It also enables auto thread binding for best OOB performance.
## Benchmarks
…
-
#32487 [CI][Attention] Add more CI dependencies for attention tests — ready,ci/build — by MatthewBonanni (合并于: 2026-01-23 02:44 (UTC+8)) [+12/-0, 3 files | commented:4 approved:1]
## Purpose #32339 broke
V1 Tests attention (B200)by changing defaults, and this should have been caught in CI on the PR. This PR adds more source file dependencies for these tests so they will be covered next time.## Test Plan
## Test Result
…
-
#32393 Support custom URI schemes and trace handlers for profiler — ready,v1,meta-exported,fb-exported — by diviramon (合并于: 2026-01-23 01:45 (UTC+8)) [💬5 | +74/-19, 3 files | commented:1 approved:2] Summary: Enable platforms to use custom URI schemes (e.g., gs://, s3://, hdfs://) for profiler trace output without path normalization, and allow injection of custom trace handlers for platform-specific profiling backends.
This generalizes the existing gs:// handling to support any URI scheme by detecting the scheme:// pattern, and adds an optional on_trace_ready parameter to TorchProfilerWrapper for custom trace export logic.
Test Plan: Existing profiler tests continue to pass. Manual verific…
-
#32525 [UX] Default api_server_count to dp_size if not specified — frontend,ready — by tlrmchlsmth (合并于: 2026-01-23 01:35 (UTC+8)) [💬3 | +60/-8, 2 files | commented:4 approved:1] When using data parallelism, a good rule of thumb is to set the api_server_count to the dp_size in order to avoid the api server becoming a bottleneck.
Use this as the default when
--api-server-countis not specified. - #32868 [MISC] Add .cursor to .gitignore — ready — by vadiklyutiy (合并于: 2026-01-23 01:27 (UTC+8)) [+3/-0, 1 files | commented:1 approved:1]
Add
.cursorfolder to.gitignore -
#32837 [Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars — bug,rocm,ready,ci/build,v1,kv-connector — by mawong-amd (合并于: 2026-01-23 00:59 (UTC+8)) [💬1 | +14/-6, 3 files | commented:1 approved:1] ## Purpose This PR fixes various issues on AMD ROCm caused by the deprecation of environment variables in https://github.com/vllm-project/vllm/pull/32812.
Firstly, the deprecation of
VLLM_V1_USE_PREFILL_DECODE_ATTENTIONmeans thatget_current_vllm_configis now called as a replacement inRocmPlatform.get_attn_backend_clsto determine if theROCM_ATTNbackend should be used. However, there are instances where the current vLLM config is not yet set, which causes the above function to erro… -
#32844 [Bugfix] ModelScope is supported when downloading LORA models. — bug,ready — by AuYang261 (合并于: 2026-01-23 00:33 (UTC+8)) [💬1 | +21/-6, 1 files | commented:2 approved:1]
## Purpose Fix the bug that the environment variable
VLLM_USE_MODELSCOPE=1is ineffective when downloading LoRA models, which is described in #32841 .## Test Plan After setting the environment variables
VLLM_USE_MODELSCOPE=1, this command tests downloading the LoRA model from ModelScope.vllm serve LLM-Research/Llama-3.2-3B-Instruct --enable-lora --lora-modules sql-lora=vllm-ascend/llama32-3b-text2sql-spider…
-
#30200 [Frontend] Introduce Renderer for processing chat messages (using
ModelConfig) — documentation,performance,structured-output,frontend,ready,ci/build,v1,multi-modality,tool-calling,gpt-oss — by DarkLight1337 (合并于: 2026-01-22 20:44 (UTC+8)) [💬29 | +2141/-1585, 48 files | commented:10] Ready for review.## Purpose
- Prototype an interface,
vllm.renderers.RendererLike, to process chat messages into engine inputs. - Introduce
RENDERER_REGISTRYwhich lazily registers renderers to avoid circular import problem. - Move implementation-specific chat utils to the corresponding renderer in
vllm.renderers. - Initialize the renderer in
InputPreprocessor, replacing the tokenizer initialization insideLLMEngineandAsyncLLM. - Replace
EngineClient.get_tokenizer()with `Engi…
- Prototype an interface,
-
#14526 Support bge-m3 sparse embeddings and colbert embeddings — documentation,new-model,frontend,ready — by maxdebayser (合并于: 2026-01-22 23:52 (UTC+8)) [💬29 | +393/-19, 9 files | commented:7 approved:1] FIX #13609 FIX #15384 FIX #18469
Here I’m loading the extra
sparse_linear.ptfile using the secondary_weights loading introduced in the ultravox model when I detect that the model name isBAAI/bge-m3. It’s a bit ugly but I don’t know if there is a more generic way to do this.Currently, since the only permissible pooling return type is torch.tensor, I’m just returning the token weights tensor directly. If the user wants to match tokens to the weights they have to call
tokenizeand remove… -
#32668 [Misc] Bump opencv-python dependecy version to 4.13 — ready,ci/build — by Isotr0py (合并于: 2026-01-22 23:51 (UTC+8)) [💬4 | +35/-16, 5 files | commented:4 approved:1]
## Purpose
- Update opencv-python version in dependencies to get up-to-date security fix.
## Test Plan
## Test Result
…
-
#32706 [Cleanup] Move scheduler
get_routed_expertslogic to separate method — ready,v1 — by njhill (合并于: 2026-01-22 23:46 (UTC+8)) [+31/-23, 1 files | commented:2 approved:1] Follow-on from https://github.com/vllm-project/vllm/pull/28284 which I didn’t get time to fully review.It harms readability to have all of this special-case logic in the main scheduler flow, better to move it to a separate method.
Also includes small change to ensure that we always return the final output for a request, even if otherwise empty.
-
#32805 [torch.compile] Improve Cold Start for MoEs — ready — by zou3519 (合并于: 2026-01-22 23:44 (UTC+8)) [💬2 | +71/-9, 3 files | commented:3 approved:2] ## Purpose
Fixes https://github.com/vllm-project/vllm/issues/29992
For torch.compile cold start times, we need to avoid hard-coding any strings into the graph. Right now, the vllm.moe_forward and vllm.moe_forward_shared custom operators hard-code strings into the graph.
The workaround is to store a list of the strings that each of those custom ops needs, in reverse order, in the ForwardContext. The ForwardContext object is alive for the duration of the forward pass. When the custom op needs t…
-
#31756 [Misc][BE] Turn on strict type coverage for vllm/compilation — ready,nvidia — by Lucaskabela (合并于: 2026-01-22 23:12 (UTC+8)) [💬2 | +121/-68, 11 files | commented:9 approved:1] ## Purpose This PR follows up * #31748 * #31744 * #31554 And adds strict type checking to our linter/pyproject.toml, turning it on in the
compilationrepo as a test pilot## Test Plan ``` mypy vllm/compilation …
-
#30692 OffloadingConnector: Support kernel_block_size != block_size — ready,v1 — by orozery (合并于: 2026-01-22 20:30 (UTC+8)) [💬6 | +156/-114, 7 files | commented:9 approved:1] This PR enables the offloading connector to work for cases where the kernel block size is different than vLLM’s logical block size.
To efficiently copy blocks when the logical block size is greater than the block size, we add an additional parameter to the swap_blocks function that determines the size per each contiguous copy.
This fix addresses the same issue that was fixed in the NixlConnector in #28677.
…
-
#32824 [Frontend] add prompt_cache_key for openresponses — frontend,ready — by chaunceyjiang (合并于: 2026-01-22 19:34 (UTC+8)) [💬1 | +8/-0, 1 files | commented:5 approved:1] ## Purpose supports https://www.openresponses.org/reference
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32833 [CI] refactor release pipeline config into groups — ready,ci/build — by Harry-Chen (合并于: 2026-01-22 19:27 (UTC+8)) [💬1 | +277/-276, 1 files | commented:2 approved:2] ## Purpose
The current release pipeline has too many steps and is a little bit hard to read and maintain. This PR puts steps into some groups without changing the actual jobs.
Specifically, it also removes the block before the CUDA 13.0 release image build steps, as requested by @wangshangsam ( https://github.com/vllm-project/vllm/pull/32522#issuecomment-3776281837).
## Test Plan
Tested by CI.
…
-
#32574 [Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest — documentation,frontend,ready,qwen — by noooop (合并于: 2026-01-22 18:32 (UTC+8)) [💬4 | +456/-205, 24 files | commented:1 approved:1]
## Purpose
Split out the following RequestMixin
- ChatRequestMixin
Add tests
- test_chat_request
…
-
#32789 [Bugfix] Fix Whisper/encoder-decoder GPU memory leak — bug,ready,v1,multi-modality — by NickLucche (合并于: 2026-01-22 18:50 (UTC+8)) [💬2 | +54/-5, 2 files | commented:2 approved:1] Fix https://github.com/vllm-project/vllm/issues/31577.
# Overview For encoder-decoder models (e.g., Whisper), the encoder cache manager was returning newly allocated entries in
get_freed_mm_hashes()during the same scheduling step they were allocated. Since the runner frees cache entries before model execution, this caused encoder outputs free attempts to miss, leading to an ever-growing encoder cache. Flow: `EncoderCacheManager.get_freed_mm_hashes->runner “free mm hashes”->runner._execute_m… -
#30207 Enable Cross layers KV cache layout at NIXL Connector — documentation,ready,v1,kv-connector — by liranschour (合并于: 2026-01-22 18:12 (UTC+8)) [💬16 | +308/-89, 5 files | commented:9 changes:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC #27742 and implemented in #27743. Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.
Tested with P!=D with run_accuracy_test.sh P=1 D=2
Tested on H100 1P1D Qwen/Qwen3-0.6B:
b… -
#32746 [Misc] Replace urllib’s
urlparsewith urllib3’sparse_url— ready,multi-modality — by Isotr0py (合并于: 2026-01-22 16:37 (UTC+8)) [💬1 | +23/-16, 4 files | commented:2 approved:1]## Purpose
- Replace urllib’s
urlparsewith urllib3’sparse_urlfor safer and more standard url parsing.
## Test Plan
## Test Result
…
- Replace urllib’s
-
#28664 [AMD][ROCm] MoRI EP: a high-performance all2all backend — documentation,rocm,ready,ci/build,deepseek,nvidia — by alexsun07 (合并于: 2026-01-22 16:33 (UTC+8)) [💬17 | +397/-9, 16 files | commented:10] ## Purpose
This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.
This PR follows the design of vLLM’s Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components: [Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]
For MoRI+AITER path, w…
-
#32757 [Model] Extend
collect_childrenandno_init_weightscontexts — ready,llama,qwen — by DarkLight1337 (合并于: 2026-01-22 16:20 (UTC+8)) [💬3 | +442/-255, 20 files | commented:4 approved:1] ## PurposeFurther generalize
_mark_language_modeland_mark_tower_modelto handle the case where both MM and LM components are in the same child module, by introducing atargetargument. For convenience, add a wrapper_mark_composite_modelwhich enters the context of both_mark_language_modeland_mark_tower_model.Also, merged
LMMissingLayerandTowerMissingLayerintoStageMissingLayer, and use this to replace generation-specific modules in pooling adapters as well. cc @noo… -
#32667 [bench] add start_times field to vllm bench serve json result — performance,ready — by kebe7jun (合并于: 2026-01-22 15:10 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] ## Purpose
Fix https://github.com/vllm-project/vllm/issues/32666
## Test Plan
vllm bench serve --save-detailed --trust-remote-code --base-url http://127.0.0.1:17021 --model /home/ubuntu/DeepSeek-V3.2 --seed 4632803 --dataset-name random --model /home/ubuntu/DeepSeek-V3.2 --num-prompts 512 --save-result --ignore-eos --max-concurrency 64 --random-input-len 6000 --random-output-len 500 --request-rate 2.4 --result-filename bench-result.json…
- #32810 [FlashMLA] Update FlashMLA to expose new arguments — ci/build,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-22 13:02 (UTC+8)) [💬1 | +133/-217, 8 files | commented:1 approved:1] The FlashMLA update contains new arguments; update the interface to expose these
-
#32835 [ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm — rocm,speculative-decoding,v1 — by AndreasKaratzas (合并于: 2026-01-22 14:11 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] This PR adds a clarifying comment as requested in #32787 by @DarkLight1337.
The comment explains why
TRITON_ATTNis used as the fallback attention backend for ROCm platforms. -
#32346 [ROCm][CI] Fix AITER test flakiness by using explicit attention backend — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-22 13:55 (UTC+8)) [💬5 | +10/-6, 3 files | commented:6 approved:1] ## Summary
This PR fixes test failures for
TitanML/tiny-mixtralwhen running AITER tests on ROCm as part of the full test suite.## Problem
When running the test suite as a group, the following tests were failing:
test_models[True-True-5-32-TitanML/tiny-mixtral]test_models[False-True-5-32-TitanML/tiny-mixtral]
…
-
#32787 [ROCm][CI] fix get_valid_backends — rocm,speculative-decoding,ready,v1 — by divakar-amd (合并于: 2026-01-22 12:27 (UTC+8)) [+8/-2, 1 files | commented:4 approved:1] This PR fixes the “get_valid_backends” check
Details - the getattr is overridden in interface.py which returns
Noneif the attribute is not found instead of raising an AttributeErrorpytest -s -v tests/v1/spec_decode/test_acceptance_length.pyNote: There is more debugging ongoing, might need another PR to fully fix this test. Preliminary analysis hints towards th…
- #32731 [ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quantization — rocm,ready,v1 — by micah-wil (合并于: 2026-01-22 13:47 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
https://github.com/vllm-project/vllm/pull/24322 introduced tests for spec decoding with draft models in the
V1 Test e2e + enginetest group. One test case fails on AMD ci:pytest -v -s tests/v1/e2e/test_spec_decode.py::test_draft_model_quantization[True-target_quantized]The test fails in an assertion: ```
assert acceptance_len >= args.expected_acceptance_len -- ... -
#32287 Upgrade transformers-4.57.5 — ready,ci/build,multi-modality,ready-run-all-tests — by huydhn (合并于: 2026-01-22 13:19 (UTC+8)) [💬6 | +25/-3, 4 files | commented:1 approved:3] ## Purpose
https://pypi.org/project/transformers/4.57.5 is out today. Although there is already a PR to upgrade this to 5x https://github.com/vllm-project/vllm/pull/30566, I’m looking to get to
4.57.5earlier to address this CI issue https://github.com/huggingface/transformers/issues/42886## Test Plan
CI run
cc @hmellor
- #32780 [Llama.py -> mistral.py] Extract mistral-only relevant code into separate file — new-model,ready,llama — by patrickvonplaten (合并于: 2026-01-22 13:14 (UTC+8)) [💬3 | +248/-115, 3 files | commented:5 approved:1] We’re adding more and more mistral-only code to the llama.py class which makes it harder to read and creates possible future unwanted dependencies. E.g. if other models depend on the llama.py class one might think that mistral-only code might also be relevant for such classes and thus make vLLM too rigid.
- #32775 [Docs] Remove outdated async_scheduling limitation with speculative decoding — ready — by ikaadil (合并于: 2026-01-22 12:19 (UTC+8)) [💬3 | +0/-3, 1 files | commented:4 approved:2 changes:1]
Remove outdated limitation from
async_schedulingdocstring. Async scheduling now works with speculative decoding (enabled in #31998). -
#32788 Cleanup some huggingface_hub-related stuff — ready — by Wauplin (合并于: 2026-01-22 11:38 (UTC+8)) [💬1 | +9/-50, 3 files | commented:1 approved:2] Hi there! Maintainer of the
huggingface_hubPython SDK here :wave:I’ve looked at how
huggingface_hubis used in vLLM and found a few places that could be improved/cleaned-up:- in lora utils: no need to catch
EntryNotFoundErrorandRepositoryNotFoundErrorsince both are subclasses ofHfHubHTTPError(which is already caught) - removed
_get_hf_tokenprivate utility => the only thing it does is to fetchHF_TOKENenv variable and clean it. Sincetokenis then passed only to `tra…
- in lora utils: no need to catch
-
#32585 [EC Connector] Optimize remote cache check in scheduler — ready,v1 — by knlnguyen1802 (合并于: 2026-01-22 11:30 (UTC+8)) [💬3 | +76/-57, 5 files | commented:2 approved:1]
## Purpose Currently for EC Connector, the scheduler will check if remote cache exist for all step, even if all requests is on decode phase This will affect TPOT
## What is change in this PR Modify
_try_schedule_encoder_inputto only check remote cache exists when the current multimodal media is planed to schedule. ## Test Result…
-
#32818 [Bugfix] Fix potential EAGLE spec decode segfault during graph capture — bug,speculative-decoding,ready,v1 — by mawong-amd (合并于: 2026-01-22 11:11 (UTC+8)) [+8/-4, 1 files | commented:1 approved:1] ## Purpose This PR restores the original logic in
SpecDecodeBaseProposer’sdummy_runfunction, which was changed after the refactor in https://github.com/vllm-project/vllm/pull/30143. Credits also to @micah-wil for the investigation and fix.After the refactor, the
use_cudagraphsparameter is left unused which incorrectly leads to CUDA graph dispatch where they previously would not have occurred. We are seeing this manifest on MI300X as a segfault during CUDA…
[关闭未合并 PR]
- #19927 [V1] Solve potential deadlock issue in v1 engine core client internally — needs-rebase,stale,v1 — by Isotr0py (关闭于: 2026-01-23 10:17 (UTC+8)) [💬5 | +75/-79, 6 files | commented:3]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
- Following PR for https://github.com/vllm-project/v…
-
#19999 [Models] Remove GPU-CPU sync when
do_pan_and_scan=falsein Gemma3 — needs-rebase,stale — by lgeiger (关闭于: 2026-01-23 10:17 (UTC+8)) [💬5 | +20/-5, 1 files | commented:4] Whendo_pan_and_scan=falsewe currently still doing a bit of extra preprocessing work, pass around anum_cropstensor and callnum_patches.tolist(). This is not necessary and the latter causes GPU host synchronization, preventing some more CPU operations to be overlapped with GPU work:This PR makes
num_patchestensor optional and prevents sp… - #22752 [V1] implement tree sampler for draft token acceptance — speculative-decoding,ready,needs-rebase,stale,v1,llama — by TheEpicDolphin (关闭于: 2026-01-23 10:16 (UTC+8)) [💬14 | +1486/-254, 13 files | commented:10] # Purpose Continuing with my work in this PR: #20401 , where I added support for drafting a tree of speculative tokens, that are then validated by the target model. In this PR, I add the class that performs rejection sampling for those draft tokens, so that they conform to the target model’s output distribution. This class is based off of RejectionSampler, but with some key differences necessary to support a tree structure for drafted tokens. I added some tests for this new class to verify it’s …
-
#24258 [ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server — stale — by AzizCode92 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬2 | +16/-0, 1 files | commented:2] ## Purpose
The
__exit__method of theRemoteOpenAIServeris not sufficient to clean up GPU memory. This creates a race condition between the different tests in CI. This PR ensures we properly clean the GPU memory when exiting the openAI remote server.Solves: https://github.com/vllm-project/vllm/issues/24144
## Test Plan
…
-
#24911 Download model and add configs. — stale — by chenxi-yang (关闭于: 2026-01-23 10:15 (UTC+8)) [💬7 | +648/-0, 4 files | commented:1] Summary: Add kernel configs for glm.
Rollback Plan:
Reviewed By: zzh142857
Differential Revision: D82469889
-
#25173 TokenizerBase#convert_ids_to_tokens adds support for single id conversion — stale — by usberkeley (关闭于: 2026-01-23 10:15 (UTC+8)) [💬5 | +49/-19, 5 files] ## Purpose
TokenizerBase#convert_ids_to_tokensadd supports single ID conversion, ensuring consistency withPreTrainedTokenizerandPreTrainedTokenizerFast. This has the following advantages:1)Maintain interface consistency: A single id can be safely converted to a token. For example, the
detokenize_incrementallyfunction no longer needs to convertnew_token_idinto a list:new_tokens = tokenizer.convert_ids_to_tokens(new_token_id, skip_special_tokens=skip_special_tokens)…
-
#25441 add str format for proxy port — stale,kv-connector — by yangqinghao-cmss (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1]
## Purpose When the
proxy_portin thekv_connector_extra_configparameter ofvllm serveis an integer, it will trigger aTypeError: can only concatenate str (not "int") to str. This is because no type check is performed before initializingP2pNcclEngine. It is therefore recommended thatproxy_portbe formatted as a string. ## Test Plan## Test Result
... -
#30108 Add config flag for VLLM_DISABLE_COMPILE_CACHE — documentation,llama — by elizabetht (关闭于: 2026-01-23 04:34 (UTC+8)) [💬11 | commented:9 approved:1]
## Purpose Add config flag for VLLM_DISABLE_COMPILE_CACHE which should be overridden if env var is set
## Test Plan
Scenario | Config Value | Env Var | Test – | – | – | – Default (nothing set) | False | Not set | ✅ test_disable_compile_cache_config_without_env_var …
-
#32883 deepseek_v32_encoding: Replace asserts with proper exceptions — deepseek — by RishabhSaini (关闭于: 2026-01-23 05:05 (UTC+8)) [💬1 | +38/-31, 1 files] Resolves: https://github.com/vllm-project/vllm/issues/32874
Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
-
#32880 Add inside_opaque_custom_op context — 无标签 — by mgoin (关闭于: 2026-01-23 04:17 (UTC+8)) [+67/-3, 2 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32875 Add the missing router_logits_dtype for nemotron_h — 无标签 — by cjluo-nv (关闭于: 2026-01-23 02:59 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1]
## Purpose
Nemotron H uses FP32 linear for gate and the router logits are FP32 instead of the default BF16.
This fixes the error triggered
``` (EngineCore_DP2 pid=147) return self.forward_impl_chunked( …
-
#32851 Fix mkdocs_autorefs error — needs-rebase — by aireilly (关闭于: 2026-01-22 20:42 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 | 📝草稿] Fixes this docs build error:
``` WARNING - mkdocs_autorefs: api/vllm/model_executor/models/index.md: from /home/docs/checkouts/readthedocs.org/user_builds/vllm/checkouts/32849/vllm/model_executor/models/interfaces_base.py:167: (vllm.model_executor.models.VllmModelForPooling.default_pooling_type) Could not find cross-reference target ‘vllm.model_executor.layers.pooler.PoolerConfig.pooling_type’ WARNING - mkdocs_autorefs: api/vllm/model_executor/models/interfaces_base.md: from /home/docs/checko…
-
#31705 [BugFix] Support setting tp=1 for the Eagle draft model to take effect — bug,speculative-decoding,v1 — by zhaomingyu13 (关闭于: 2026-01-22 16:40 (UTC+8)) [💬6 | commented:2 changes:1] ## Purpose According to the official documentation, the parameter
draft_tensor_parallel_size: 1is supposed to be applied to the Eagle model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected.Note: This feature has not been superimposed and tested with
spanddp. It will be adapted later ## Test Plan ```python imp… -
#32813 [Hardware][AMD][Bugfix] Fix PTPC FP8 quantization — bug,rocm — by mawong-amd (关闭于: 2026-01-22 11:39 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1] ## Purpose Fixes PTPC FP8 quantization and thus AMD Quantization Tests after the refactoring done in https://github.com/vllm-project/vllm/pull/32189.
PTPCFP8LinearMethodshould now inherit fromFP8OnlineLinearMethodrather thanFP8LinearMethod.## Test Plan
pytest -sv quantization/test_ptpc_fp8.pyThe above is implicitly run as part of AMD CI’s Quantization Tests group.## Test Result The test and test group both pass.
…
-
#32825 [Hardware][AMD][Bugfix] Fix spec decode acceptance length test — bug,rocm,speculative-decoding,v1 — by mawong-amd (关闭于: 2026-01-22 11:37 (UTC+8)) [💬2 | +13/-4, 1 files | commented:1] ## Purpose This PR fixes the speculative decoding acceptance length tests, first implemented in https://github.com/vllm-project/vllm/pull/32030, to run on ROCm.
The current implementation fails because it erroneously calls
current_platform.get_valid_backendswhich is not implemented on non-CUDA platforms. It guards this call behindhasattr(current_platform, "get_valid_backends"), but this does not work because thePlatformclass implements a__getattr__met… -
#32830 fix — needs-rebase,v1 — by aidendle94 (关闭于: 2026-01-22 12:30 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
...