vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-22

[概览]

时间窗口: 2026-01-22 10:55 (UTC+8) ~ 2026-01-23 10:55 (UTC+8)
新 issue: 31 (label 分布: bug:19, feature request:5, cpu:4, usage:4, rocm:3)
关闭 issue: 45
新 PR: 45 (label 分布: bug:17, ready:10, v1:10, rocm:7, ci/build:5)
合并 PR: 42
关闭未合并 PR: 16

[新 issue]

#32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (创建于: 2026-01-23 10:52 (UTC+8)) ### 🚀 The feature, motivation and pitch

hello， Why is the “thinking” parameter inconsistent with the OpenAI API? Is there any consideration behind this? Thank you

### Alternatives

No response

### Additional context …
#32900 [Feature]: Container image WORKDIR consistency — feature request,cpu — by qhaas (创建于: 2026-01-23 10:43 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

CPU version uses /workspace GPU version uses /vllm-workspace

While this a minor distinction, it would simplify the logic for those building derived container images that support both variants to have the same WORKDIR path.

### Alternatives

No response …
#32858 [Bug]: Use vllm to obtain Qwen3VL last token hidden status — bug — by JasonLeeUT (创建于: 2026-01-22 22:15 (UTC+8)) [💬1] ### Your current environment

from hdfs_io import hopen, hexists, hlist_files,hmv from qwen_vl_utils.vision_process import smart_resize

import torch, base64, io from dataclasses import asdict import json import os import argparse …
#32898 [Bug]: Incorrect/Incoherent model output in Disaggregated Serving (PD Separation) on v0.14.0 — bug — by JianDan0212 (创建于: 2026-01-23 10:00 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Since I’m in a completely isolated intranet environment, I’m unable to copy the parameter information out. I’m using vLLM version 0.14.0 with an A100 GPU. ```

…
#32897 [Bug]: PaddleOCR-VL crash — bug — by NiuBlibing (创建于: 2026-01-23 09:53 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32862 [Bug]: llama4-fp8 tp=2 ep=2 doesn’t work on b200 — bug — by ProExpertProg (创建于: 2026-01-22 23:16 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#32841 [Bug]: The environment variable VLLM_USE_MODELSCOPE=1 is ineffective when downloading LoRA models. — bug — by AuYang261 (创建于: 2026-01-22 17:45 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32895 [Bug]: [ROCm] [MI355X] new 0.14 upstream gptoss hard error TP=1? — bug,rocm — by functionstackx (创建于: 2026-01-23 08:43 (UTC+8)) [💬2] ### Your current environment

Platform: AMD MI355X Docker Image: vllm/vllm-openai-rocm:v0.14.0

### 🐛 Describe the bug

hi @powderluv @chunfangamd

Full logs of this bug available at: https://github.com/InferenceMAX/InferenceMAX/actions/runs/21260856946 …
#32896 [Installation]: How to use v0.10.x with pytorch2.9? — installation — by Wesley-Jzy (创建于: 2026-01-23 08:43 (UTC+8)) ### Your current environment

As the title describes, how can I get a vllm 0.10.x for pytorch 2.9 and cuda 12.8 environment?

### How you are installing vllm
```
  pip install -vvv vllm
```
…
#32889 [ROCm] MI355X CI parity is missing — feature request,rocm — by functionstackx (创建于: 2026-01-23 06:32 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

hi @powderluv @chunfangamd

there is good amount of B200 CI tests enablement already but MI355 is extremely lack there is close to zero MI355 CI.

and it is not tracked in the vllm project board either. https://github.com/orgs/vllm-project/projects/39

what is the plans around MI355 vLLM upstream enablement? Thanks in advance for ur time on this! ### Alternatives …
#32840 [Bug]: [CPU Backend] Engine crashed due to error on flashinfer op registration — bug,cpu — by fadara01 (创建于: 2026-01-22 17:20 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32874 [Bug]: DeepSeek V3.2 return Internal Server Errors instead of Bad Request in reasoning mode — bug — by artvl (创建于: 2026-01-23 01:48 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32850 [RFC]: Clarify policy for Open Responses API extensions in vLLM — RFC — by DanielMe (创建于: 2026-01-22 20:11 (UTC+8)) [💬2] ### Motivation.

vLLM recently added support for the OpenAI-compatible /v1/responses API, which aligns with the public Open Responses specification.

The specification explicitly allows implementations to extend existing schemas with implementation-specific fields, as long as core semantics remain unchanged. At the same time, such extensions carry a known risk: future vers…
#32870 [Bug][CPU Backend] [Arm]: FAILED tests/kernels/moe/test_moe.py::test_cpu_fused_moe_basic[silu-False-dtype0-2-8-128-128-1] — bug,cpu — by focusunsink (创建于: 2026-01-23 00:43 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...
#32871 [Bug]: — bug — by focusunsink (创建于: 2026-01-23 00:44 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32866 wrong issue, need delete — bug,cpu — by yugetnow (创建于: 2026-01-23 00:16 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...
#32864 [Bug] [ROCm]: Loading Qwen3-MoE-MXFP4 Weights in v0.14. — bug,rocm — by tjtanaa (创建于: 2026-01-22 23:31 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here Collecting environment information... uv is set ============================== ...
#32838 [Bug]: CompressedTensorsW4A16Fp4 is not support on turing — bug — by ir1ka (创建于: 2026-01-22 16:00 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32857 [Bug]: openai/gpt-oss-120b resolved architecture Qwen3ForCausalLM — bug — by LorenRd (创建于: 2026-01-22 21:28 (UTC+8)) ### Your current environment

Running vLLM (checkout v0.14.0) from Docker with ROCm built with:
```
  DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
```
If I try to serve openai/gpt-oss-120b the resolved architecture is Qwen3ForCausalLM:

…
#32856 [Usage]: How to use VLLM to infer the MXFP4A16 model exported by LLM Compressor — usage — by lijixing0425 (创建于: 2026-01-22 21:05 (UTC+8)) ### Your current environment

Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
#32854 [Usage]: — usage — by lijixing0425 (创建于: 2026-01-22 21:01 (UTC+8)) ### Your current environment

Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
#32853 [Usage]: How to run a compressed model in MXFP4A16 format — usage — by lijixing0425 (创建于: 2026-01-22 20:57 (UTC+8)) ### Your current environment

First, I exported an LLM model in MXFP4A16 format using LLM Compressor. Then, I tried to infer this model based on VLLM, but I found that the inference result was incorrect.

### export MXFP4A16 from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation …
#32848 [Usage]: Structured output — usage — by danielwit-lb (创建于: 2026-01-22 19:21 (UTC+8)) ### Your current environment

```text Collecting environment information… uv is set ============================== System Info ============================== OS : Amazon Linux 2023.9.20251208 (x86_64) GCC version : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) …
#32843 [Feature]: kv_cache_dtype="auto" should resolve to the actual kv_cache_dtype picked by vllm, and be displayed to users — feature request — by fxmarty-amd (创建于: 2026-01-22 18:05 (UTC+8)) ### 🚀 The feature, motivation and pitch

As per title.

At the moment, even in TritonAttentionImpl.forward, if the argument --kv-cache-dtype is not specified, it remains kv_cache_dtype="auto" even at the forward level.

Although the documentation specifies `“auto”: U…
#32839 [Bug] AssertionError loading Unsloth-optimized Qwen3-VL-2B-4bit with bitsandbytes in vLLM 0.14.0 — bug — by mesteruh (创建于: 2026-01-22 16:55 (UTC+8)) ``` ### Your current environment

Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : Could not collect …
#32827 [Bug]: OpenAI tool call crashes when parameter description contains parentheses with examples (e.g. “(e.g. ls -la)”) — bug — by Lcng-H (创建于: 2026-01-22 11:39 (UTC+8)) ### Your current environment

============================== Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_USE_MODELSCOPE=true ...
#32834 [Bug]: CUDA illegal memory access caused by CUDA Graph with awq_marlin — bug — by jason-webcomm (创建于: 2026-01-22 13:35 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32832 [Performance]: H200 Kimi K2 TP8 Triton MoE min-latency — performance — by jhaotingc (创建于: 2026-01-22 12:57 (UTC+8)) ### Proposal to improve performance

Here’s a nsys screenshot of DeepSeek-V3.1 TP=8, conc=16 gen step. 2 triton MoE kernels take about 109us.

The nsys screenshot of Kimi-K2-Instruct TP=8, conc=16, gen step. 2 triton MoE kernels take 234us. <img width=”1765” height=”585” alt=”Image” src=”https://github.com/user-attachments/assets/1f0cb40e-f067-4449-a76f-a0998ac053c1…
#32828 [Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba — feature request — by jhaotingc (创建于: 2026-01-22 12:01 (UTC+8)) ### 🚀 The feature, motivation and pitch

In TensorRT LLM, Qwen3-Next in_proj_qkvz and in_proj_ba are executed in dual stream. [code ref].(https://github.com/NVIDIA/TensorRT-LLM/blob/0434db5bf75dfd01fe575a79c27d9260b597f167/tensorrt_llm/_torch/models/modeling_qwen3_next.py#L777C26-L777C30)

See TensorRT LLM nsys screenshot below. The two GEMM are executed in 2 streams in parallel. (Qwen3-Next-80B-A3B TP2 generation step BS=8) <img width=”2018” height=”601” alt=”Image” src=”https://github.com/…
#32829 [Bug]: During GLM-4.7 function calling (fc), the function output is not streamed. — bug — by zhangsongqing (创建于: 2026-01-22 12:16 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32826 [Bug]: MiniMax-M2.1 NVFP4 fails on RTX PRO 6000 Blackwell (SM120) with expert parallel — bug — by gittb (创建于: 2026-01-22 10:59 (UTC+8)) ### Your current environment

The output of python collect_env.py
``` Collecting environment information... ============================== System Info ============================== ...

[已关闭 issue]

#27413 [Usage]: how to request a qwen2.5-VL-7B classify model served by vllm using openai SDK? — good first issue,usage,stale — by muziyongshixin (关闭于: 2026-01-23 10:22 (UTC+8)) [💬13] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I launch a server with the following command to serving a Qwen2.5-VL-7B model finetued for seqence classification. (this model replaced the lm_head with a 2 classes score_head) …
#18811 [Bug]: python sampler is faster than flashinfer sampler — bug,stale — by ErykCh (关闭于: 2026-01-23 10:17 (UTC+8)) [💬7] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#18890 [Bug]: vllm启动模型后使用openai格式请求传base64值有问题 — bug,stale — by wqw0806 (关闭于: 2026-01-23 10:17 (UTC+8)) [💬6] ### Your current environment

使用vllm0.7.3部署的qwen2.5vl-7b模型，使用openai格式请求接口。方法1是在代码中读取本地文件然后转base64传入大模型中，正常。方法2是将本地图片转换好的base64直接传入大模型中，报错。下面是我传入的图片，请问是什么问题？

### 🐛 Describe the bug …
#19131 [Usage]: ModuleNotFoundError: No module named ‘vllm.vllm_flash_attn.layers’ vllm@0.9.0.1 — usage,stale — by rbgo404 (关闭于: 2026-01-23 10:17 (UTC+8)) [💬22] ### Your current environment

### Your current environment

Trying to install vllm, got installed and then got this error.

``` torch : 2.7.0+cu128 cuda : 12.8 python: 3.12.9 Linux-6.8.0-1013-nvidia-64k-aarch64-with-glibc2.35 …
#19342 [Usage]: How to get the router logits for an MoE model? — usage,stale — by zhenqincn (关闭于: 2026-01-23 10:17 (UTC+8)) [💬6] ### How would you like to use vllm

I would like to run a Qwen/Qwen3-30B-A3B model and need to store the router logits for each prompt in order to calculate the usage balance among experts. Is there a recommended or efficient way to implement this?

### Before submitting a new issue…
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently…
#20060 [Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al. — bug,stale — by NilsHellwig (关闭于: 2026-01-23 10:17 (UTC+8)) [💬41] ### 🐛 Describe the bug

This is what the code looks like:

```py from vllm import LLM, SamplingParams from vllm.sampling_params import GuidedDecodingParams from pydantic import BaseModel, Field, create_model from typing import Literal, List, Optional import json …
#20776 [Feature]: Add LFM2 models — new-model,feature request,stale — by egorsmkv (关闭于: 2026-01-23 10:17 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Recently LiquidAI released LFM2 models.

URL: https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38

Would love to see vLLM could run them.

### Alternatives

…
#21207 [Feature]: Support xformers on ARM GPU machines including GB200. — feature request,stale — by kathyyu-google (关闭于: 2026-01-23 10:17 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch

The official Dockerfile and requirements file don’t include building xformers for ARM GPU machines such as GB200 machines.

Would it be possible to support xformers? This dependency is needed for many models, including meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8.

Thank you!

### Alternatives

…
#21708 [Bug]: vLLM Server Crash with CUDA Memory Error when serving gemma-3-27b-it-FP8-Dynamic — bug,stale — by hnt2601 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬13] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#22532 [Bug]: NIXL disaggregation example does not work — bug,stale — by piotr-topnotch (关闭于: 2026-01-23 10:16 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#22578 [Bug]: [gpt-oss-120b] Chat Completions endpoint tool_call support is not working — bug,stale,gpt-oss — by gaopeijie (关闭于: 2026-01-23 10:16 (UTC+8)) [💬13] ### Your current environment

version: ‘3.8’

services: vllm: container_name: vllm-gpt image: vllm/vllm-openai:gptoss restart: unless-stopped runtime: nvidia …
#22750 [Bug]: Non-functional max_capture_size of gpt-oss causes OOM — bug,stale — by QwertyJack (关闭于: 2026-01-23 10:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#23292 [Feature][Chat Completion] Support builtin tools of gpt-oss — feature request,stale — by heheda12345 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch

Based on recent bug report, gpt-oss can generate builtin tool calls even if the users don’t ask for it, so it’s a very high priority to add proper support of it.

### Alternatives

No response

### Additional context

…
#23590 [CI]: Audit use of fixtures across tests to minimize server starts — ci/build,stale — by njhill (关闭于: 2026-01-23 10:16 (UTC+8)) [💬4] Many tests use a server with the same configuration, or are testing something where it wouldn’t matter if the server configuration was changed to be the same as other test(s).

We should make sure that the fixtures are configured such that for these we re-use the running server as much as possible and don’t stop/start a new vLLM instance for each test.

In particular we should try to use package-scoped fixtures in preference to module-scoped, for example in https://github.com/vllm-project/vllm/t…
#23627 [Usage]: How to start deepseek with pd disaggregation through vllm p2pncclcommunicator? — usage,stale — by LJL36 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬10] ### Your current environment

```text Collecting environment information… ============================== System Info ============================== OS : TencentOS Server 4.4 (x86_64) GCC version : (Tencent Compiler 12.3.1.4) 12.3.1 20230912 (TencentOS 12.3.1.4-2) Clang version : Could not collect …
#25310 [Feature]: Proactive, scheduler-bypass pull-KV path to reduce KV-transfer latency — feature request,stale — by Zaragoto (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch

In the current vLLM implementation, when the Decoder side receives a new request, it must wait for the current scheduling iteration of EngineCore to complete. Only in the next iteration can the scheduler allocate KV blocks and construct kv_connector_metadata, after which the model_runner dispatches the Pull KV task.

This waiting introduces latency averaging roughly half of the TPOT, leaving room for performance improvements.

### Alternatives

Introduce …
#25371 [Feature]: Distribute API servers across multiple nodes — feature request,stale — by Zhathw (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

The current –api-server-count parameter scales the API servers on a single node.

from utils.py, APIServerProcessManager ``` … for i, in_addr, out_addr in zip(range(num_servers), input_addresses, output_addresses): client_config = { …
#25402 [Bug]: Speculative decoding has poor UX — bug,stale — by MatthewBonanni (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#25437 [Bug][Qwen3-Next][DP]: Assert when serving Qwen3-Next using DP — bug,stale — by tlrmchlsmth (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
Collecting environment information... uv is set ============================== System Info ============================== ...
#25452 [Usage]: Autoscaling vLLM with kuberay (pipeline parallel — usage,stale — by akash-syook (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment

Hello Team, a newbie here in K8, Kuberay, vLLM. Apologies if this question already asked or if this too simple. I am using attached tutorial for “Pipeline parallelism with Kuberay” and it works successfully. https://docs.vllm.ai/projects/production-stack/en/latest/use_cases/pipeline-parallelism-kuberay.html I am moving to the next stage of autoscaling (HPA) and found awesome tutorial https://github.com/vllm-project/production-stack/blob/main/tutorials/10-horizontal-…
#25453 [Feature]: MXFP4 support for the SM75 device. — feature request,stale — by wang824892540 (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

There are widely available 20xx models and Tesla T4s on the market.

### Alternatives

No response

### Additional context

…
#25460 [Bug]: MTP assert error when num_speculative_tokeens > 1 — bug,stale — by mpjlu (关闭于: 2026-01-23 10:15 (UTC+8)) [💬4] ### Your current environment

vllm serve ${MODEL_PATH}
–trust-remote-code
–block-size 64
–served-model-name deepseek-r1
–max-model-len 32184
–max-num-seqs 64
–gpu-memory-utilization 0.9
-tp 8
…
#25484 [Feature][WideEP]: More “masked-m” GEMM and quant kernels for use with DeepEP LowLatency + PPLX All2Alls — feature request,stale — by tlrmchlsmth (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

The dispatch operation in DeepEP LowLatency and PPLX-kernels is a hidden_states tensor with shape [num_local_experts, max_num_tokens_per_expert, hidden_size].

max_num_tokens_per_expert may be much larger than the actual number of tokens for each expert. To avoid useless work, we need efficient “masked-m” kernels that only work on the relevant parts of the tensors.

Currently we use DeepGEMM and a fused-silu-mul-quant kernel for blocked fp8 forma…
#25529 [Chore]: Get rid of annoying pre-commit error on MacOS — bug,stale — by panpan0000 (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3] ### Your current environment

vLLM branch main with latest code.

My MacOS env:

…
#25536 [Bug]: Inconsistent P90 TTFT Results with Identical Benchmark Parameters — bug,stale — by jinqinn (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment

### Environment
- vLLM Version: [v0.8.0]
- Model: DeepSeek-R1
- Backend: OpenAI Chat API
- Hardware: [H20]
- OS: Linux
### 🐛 Describe the bug …
#25595 [RFC]: Support openai HarmonyContext plugin — RFC,stale — by frank-wei (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Motivation.

In some gpt-oss use cases, especially internal scenarios, we need to customize the HarmonyContext class. However, the current implementation does not allow this in OSS. I believe other users may also want similar flexibility to define their own class methods. This PR introduces custom plugin registration for HarmonyContext, making it more extensible by allowing users to register and load their own implementations.

Example:

from vllm.entrypoints.harmony_utils import register_c…
#25600 [Bug]: Partial TPU chip usage hangs indefinitely with MP backend — bug,stale — by thameem-abbas (关闭于: 2026-01-23 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text root@t1v-n-95832fcc-w-0:/workspace/vllm# python3.12 collect_env.py Collecting environment information... ============================== System Info ...
#32841 [Bug]: The environment variable VLLM_USE_MODELSCOPE=1 is ineffective when downloading LoRA models. — bug — by AuYang261 (关闭于: 2026-01-23 09:14 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#29095 [Performance]: gpt-oss-120b runs 4 times slower on v0.11.2 than on v0.11.0 — performance — by mosalov (关闭于: 2026-01-23 08:39 (UTC+8)) [💬7] ### Proposal to improve performance

No response

### Report of performance regression

I have been experimenting with running gpt-oss-120b ob 4xh100 with the latest versions of vLLM and I have noticed that after vLLM moved from v0.11.0 the performance dropped significantly, roughly by 4 times.

I wonder if anyone else noticed that or if it is a local issue on my side.

…
#32840 [Bug]: [CPU Backend] Engine crashed due to error on flashinfer op registration — bug,cpu — by fadara01 (关闭于: 2026-01-23 06:27 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32786 [Bug]: [CPU Backend] [Arm] Slow model-parallel inference across NUMA domains — bug,cpu — by fadara01 (关闭于: 2026-01-23 02:55 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30569 [Feature]: Add –ssl-ciphers to CLI arguments — feature request — by GraceMoreau (关闭于: 2026-01-23 01:53 (UTC+8)) ### 🚀 The feature, motivation and pitch

Add support for an –ssl-ciphers CLI argument.

This flag would be passed directly to uvicorn’s –ssl-ciphers kwargs inside the existing serve_http command, parallel to how –ssl-certfile and –ssl-keyfile flags are handled

Motivation: vLLM already exposes several uvicorn SSL-related parameters, enabling users to run the vLLM server with managed TLS. However, currently there is no way to specify the allowed SSL/TLS cipher suites. Adding this CLI opti…
#31245 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2026-01-23 01:35 (UTC+8)) [💬1] ### Name of failing test

bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32871 [Bug]: — bug — by focusunsink (关闭于: 2026-01-23 00:51 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32866 wrong issue, need delete — bug,cpu — by yugetnow (关闭于: 2026-01-23 00:22 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...
#13609 [Usage]: How can I get the sparse embedding from OpenAI Embedding Client? — usage — by shawnhaoxy (关闭于: 2026-01-22 23:52 (UTC+8)) [💬15] I want to run embedding model like BGE-m3 for online serve. I can get the dense embedding, but how can I get its sparse embedding?
#29992 [Bug]: vLLM cold start on MOE models not optimal — bug,torch.compile,startup-ux — by zou3519 (关闭于: 2026-01-22 23:44 (UTC+8)) [💬6] ### Your current environment

main

### 🐛 Describe the bug

Previously, for dense models, with FX graph splitting, vLLM produces 3 unique graphs. (the model is split at the attention operator). The graph split ends up producing ~50 graphs, and we only needed to compile 3 unique graphs out of the 50.

Looking at a tlparse for llama4 maverick, which is an MOE model:
- every other layer has a MOE (instead of an nn.Linear feedforward) …
#32854 [Usage]: — usage — by lijixing0425 (关闭于: 2026-01-22 21:02 (UTC+8)) ### Your current environment

Package Version ——————————— ————- accelerate 1.12.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.3 aiosignal 1.4.0 annotated-doc 0.0.4 annotated-types 0.7.0 …
#32853 [Usage]: How to run a compressed model in MXFP4A16 format — usage — by lijixing0425 (关闭于: 2026-01-22 20:58 (UTC+8)) ### Your current environment

First, I exported an LLM model in MXFP4A16 format using LLM Compressor. Then, I tried to infer this model based on VLLM, but I found that the inference result was incorrect.

### export MXFP4A16 from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation …
#31577 [Bug]: Memory leak in serving Whisper — bug — by parssky (关闭于: 2026-01-22 18:50 (UTC+8)) [💬11] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#32631 [Tracker]: Initialize MM components in context managers — multi-modality — by DarkLight1337 (关闭于: 2026-01-22 16:20 (UTC+8)) [💬5] ### Purpose

Apply #32605 to all applicable models to enable encoder-only and LM-only mode..
- #32632
- #32641
- #32650
- #32663
- #32695
- #32691 …
#32666 [Feature]: Save the start time of the benchmark request — feature request — by kebe7jun (关闭于: 2026-01-22 15:10 (UTC+8)) ### 🚀 The feature, motivation and pitch

The current vllm bench serve --save-detailed stores data such as ttfts, itls, and tpots, but does not save the actual time of request startup, which makes it difficult to accurately trace the entire test process. Therefore, it is desired to also save the start_time.

### Alternatives

No response

### Additional context

…
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-22 14:03 (UTC+8)) [💬6] ### Name of failing test

pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py

### Basic information
- Fla…
#29510 [CI Failure]: mi325_1: V1 Test e2e + engine — ci-failure — by AndreasKaratzas (关闭于: 2026-01-22 13:47 (UTC+8)) [💬6] ### Name of failing test

pytest -v -s v1/e2e && pytest -v -s v1/engine

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32822 [Bug]: When the temperature is set to 0 in qwen3-235B-A22B-Instruct-2507, there is still randomness. — bug — by xidiancpy (关闭于: 2026-01-22 10:56 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
When the temperature is set to 0 in qwen3-235b, there is still randomness. ```text Your output of `python collect_env.py` here ``` Collecting environment information... ============================== ...

[新 PR]

#32878 [Bugfix][Hardware][AMD] Auto-enable AITER MoE on supported ROCm devices — bug,rocm — by c0de128 (创建于: 2026-01-23 02:27 (UTC+8)) [💬7 | +18/-2, 3 files | commented:1] ## Summary
- Auto-enable AITER MoE on supported ROCm platforms (MI300X/gfx9) without requiring VLLM_ROCM_USE_AITER=1
- Previously, FP8 MoE models fell back to slower Triton kernels by default
## Changes
- Modify is_fused_moe_enabled() to auto-enable when AITER is found and supported
- Respect explicit user settings (VLLM_ROCM_USE_AITER=0 or VLLM_ROCM_USE_AITER_MOE=0 to disable)
## Test plan On MI300X: …
#32899 [WIP][CT][XPU] Add W8A16_FP8 MoE Support — 无标签 — by Zhenzhong1 (创建于: 2026-01-23 10:20 (UTC+8)) [💬1 | +247/-15, 3 files | commented:2 | 📝草稿] ## Test Plan ```python # quantization scheme: quant_stage: quant_modifiers: QuantizationModifier: ignore: [“lm_head”, “re:.self_attn.q_proj$”, “re:.self_attn.k_proj$”, “re:.self_attn.v_proj$”, “re:.self_attn.o_proj$”, “re:.mlp.gate$”, “re:.mlp.shared_expert_gate$”] config_groups: group_0: targets: [“Linear”] …
#32852 [perf] v1/spec_decode: skip softmax for all-greedy rejection sampling — ready,v1 — by caozuoba (创建于: 2026-01-22 20:47 (UTC+8)) [💬2 | +6/-1, 1 files | commented:1 approved:1]

## Purpose This PR avoids computing a full-vocabulary softmax in the v1 speculative decoding rejection sampler when the entire batch is greedy (sampling_metadata.all_greedy).

For all-greedy decoding, the rejection sampler only needs argmax(target_logits); a dense softmax is unnecessary work. Since argmax(softmax(logits)) == argmax(logits), this change is behavior-preserving for the greedy path while reducing compute/memory overhead.

## Test Result …
#32863 [Frontend] Use new Renderer for Completions and Tokenize API — frontend — by DarkLight1337 (创建于: 2026-01-22 23:25 (UTC+8)) [💬1 | +962/-1189, 32 files | commented:4 | 📝草稿] ## Purpose
- Add render_completions and tokenize_prompt to Renderer API.
- Add render_completions_async and tokenize_prompt_async to Renderer API. These are backed by AsyncMicrobatchTokenizer which is internally managed by Renderer.
- Remove legacy CompletionRenderer and related code from serving engine.
- Add _preprocess_completion to serving engine which has the same function as _preprocess_chat.
- Introduce ChatParser and TokenizeParams which are passed to the Renderer to…
#32884 [BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions — bug,ready,deepseek — by RishabhSaini (创建于: 2026-01-23 05:06 (UTC+8)) [💬2 | +39/-28, 1 files | commented:1 approved:1] Resolves: https://github.com/vllm-project/vllm/issues/32874

Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
#32836 [BugFix] Add env variable to control PDL in LoRA — bug — by jeejeelee (创建于: 2026-01-22 15:40 (UTC+8)) [+10/-1, 2 files | commented:1]

## Purpose On SM100 GPUs, enabling PDL support for LoRA causes Triton compilation to fail. Therefore, an environment variable(VLLM_LORA_DISABLE_PDL) is added as a temporary workaround to avoid this error. FIX #30872 FIX https://github.com/vllm-project/vllm/issues/32424 ## Test Plan

## Test Result

…
#32885 [CI][Models] Add VLM Support for Sequence Classification Conversion — v1 — by AndreasKaratzas (创建于: 2026-01-23 05:56 (UTC+8)) [💬1 | +155/-39, 3 files | commented:2] This PR enables Vision-Language Models (VLMs) like Gemma 3 to be converted to sequence classifiers using the no_post_processing and from_2_way_softmax methods. Additionally, it fixes two PyTorch compiler warnings that were causing noise in the logs.

## Changes

### 1. vllm/model_executor/models/adapters.py
- Added _get_language_model_for_seq_cls() helper function to correctly retrieve the inner language model component from VLMs
- Updated load_weights_no_post_processing() and `load_w…
#32879 [Bugfix][Hardware][AMD] Fix MXFP4 weight loading for Quark OCP_MX MoE models — bug,rocm — by c0de128 (创建于: 2026-01-23 03:00 (UTC+8)) [💬3 | +20/-4, 1 files | commented:1] ## Summary
- Fix MXFP4 weight loading for Quark OCP_MX quantized MoE models
- The weight loader check for MXFP4 style loading only triggered for quant_config.get_name() == "mxfp4", but Quark config returns "quark"
- This caused dimension mismatch errors when loading Quark OCP_MX (MXFP4) quantized models with tensor/expert parallelism
## Changes
- Extend the MXFP4-style weight loading check to also handle QuarkOCP_MX_MoEMethod
- Both native MXFP4 and Quark OCP_MX use the same packed weight…
#32891 [ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) — rocm,ci/build — by micah-wil (创建于: 2026-01-23 06:56 (UTC+8)) [+3/-0, 1 files | commented:4] We have noticed some flaky behavior on AMD CI for Distributed Tests (A100) (exemplified here: https://buildkite.com/vllm/amd-ci/builds/3155/steps/canvas?jid=019bc705-99de-4131-b7ef-24111fd273cc). After investigation, it is due to the same HIP runtime bug that afflicts the other distributed tests groups, for which the same workaround proposed in this PR was applied in https://github.com/vllm-project/vllm/pull/31259. The HIP bug is being tracked here https://github.co…
#32859 [Bugfix] Remap FP8 scale names for Mistral3/Pixtral models — bug — by ricky-chaoju (创建于: 2026-01-22 23:07 (UTC+8)) [+6/-0, 1 files | commented:1] ## Summary Fix loading FP8 quantized Mistral3/Pixtral models by remapping HuggingFace scale parameter names to vLLM format.

Changes:
- activation_scale → input_scale
- weight_scale_inv → weight_scale
## Testing Tested with FP8 quantized Mistral3 model (e.g., RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic).

…
#32894 [Intel GPU] refine xpu worker — ready,ci/build,v1 — by jikunshang (创建于: 2026-01-23 08:35 (UTC+8)) [+26/-86, 3 files | commented:2]

## Purpose There are some legacy code in xpu platform(xpu_worker.py, xpu platform init) this PR refine these file to align with gpu. also disable some failed UT for now. will further check ut status later.

## Test Plan CI

## Test Result …
#32888 [UX] Glm4MoeModelToolParser - true tool-call streaming — 无标签 — by bigBrain1901 (创建于: 2026-01-23 06:18 (UTC+8)) [+294/-95, 1 files | commented:3 | 📝草稿]

## Purpose Instead of buffers + regex to identify tool calls, maintain a state machine to have true streaming deltas for tool-calls as the engine generates these (#32829)

## Test Plan A serving with this –tool-parser works.

## Test Result — will fill this up soon …
#32892 [Perf] Optimize moe_permute for CUTLASS FP8 - 40%+ performance improvement — nvidia — by yewentao256 (创建于: 2026-01-23 07:54 (UTC+8)) [💬2 | +47/-44, 3 files | commented:1] ## Purpose

Optimize moe_permute kernel using aligned_expert_first_token_offset
- Reduce calculation
- Reduce shared memory usage
## Test

### Acc …
#32893 Fix/glm4 moe mla detection — v1,nvidia — by mgoin (创建于: 2026-01-23 08:06 (UTC+8)) [+75/-11, 4 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32886 [Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 — bug,ready,llama — by baonudesifeizhai (创建于: 2026-01-23 06:05 (UTC+8)) [💬1 | +21/-1, 1 files | approved:1 commented:3] ## Purpose #32862 Add a version-guarded fallback in Llama4 MoE weight loading to avoid CPU FP8 indexing on older PyTorch releases. For torch < 2.11, weights are temporarily cast to FP16 for indexing and then cast back, preventing index_cpu errors while leaving newer versions unchanged. ## Test Plan ``` VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel
VLLM_USE_FLASHINFER_MOE_FP8=0
vllm bench throughput –model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 -tp=2 –enable-expert-parallel …
#32872 [Bugfix][MoE] Fix grouped topk when num_experts == num_expert_groups — bug — by bnellnm (创建于: 2026-01-23 01:09 (UTC+8)) [+74/-79, 3 files | commented:6] ## Purpose Fix fallback logic when num_experts == num_expert_groups. The old code wasn’t handling the case where the scoring function was not softmax.

Added new tests to cover all the different combinations. Simplified the bias parameters in existing tests.

## Test Plan CI tests Added new routing tests for all grouped routing combinations.

…
#32890 Testing vllm/issues/32718 — 无标签 — by RishabhSaini (创建于: 2026-01-23 06:43 (UTC+8)) [+222/-0, 2 files | commented:1 | 📝草稿] Testing https://github.com/vllm-project/vllm/issues/32718
#32887 [Spec Decode] Unified Parallel Drafting — documentation,speculative-decoding,v1,nvidia — by benchislett (创建于: 2026-01-23 06:16 (UTC+8)) [💬1 | +467/-281, 7 files | commented:1 | 📝草稿] ## Purpose

Preliminary work. Meant to combine Parallel Decoding for EAGLE and for Draft Models, while also simplifying the implementation for non-EAGLE draft model support.

## Test Plan

Local tests for the main triton kernel are implemented (and passing) but not yet committed.

## Test Result

…
#32855 [BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration — bug,ready,nvidia — by fadara01 (创建于: 2026-01-22 21:04 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:2] [BugFix] Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration

Fixes: #32840

## Purpose

Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration The default value has to be an int an not an enum as the error indicates: ``` invalid numeric default value: …
#32876 [CPU Backend][BugFix] Fix failing CPU MoE test — bug — by fadara01 (创建于: 2026-01-23 02:23 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] ## Purpose

Fixes: #32870

## Test Plan

Ran reproducer in #32870 locally

## Test Result

…
#32882 Set splitk=1 for fused-moe-lora expand kernel — 无标签 — by dcmaddix (创建于: 2026-01-23 04:16 (UTC+8)) [+1/-1, 1 files | commented:2]

## Purpose The splitK optimization is useful only for the LoRA shrink kernels where the inner K dimension in the GEMM is the significantly largest dimension. For the expand kernel, we do not need to tune it and can set it to 1 as done in the dense LoRA expand: https://github.com/vllm-project/vllm/blob/main/vllm/lora/ops/triton_ops/kernel_utils.py#L195.

## Test Plan

## Test Result

…
#32849 [Docs] Adding links and intro to Speculators and LLM Compressor — documentation — by aireilly (创建于: 2026-01-22 19:47 (UTC+8)) [💬4 | +75/-0, 3 files | commented:1 changes:1] Small docs update that adds a new “Related Projects” section for LLM Compressor and Speculators

Updated docs/.nav.yml to add a new section: - Related Projects: - usage/llm_compressor.md - usage/speculators.md

Signed-off-by: Aidan Reilly aireilly@redhat.com
#32883 deepseek_v32_encoding: Replace asserts with proper exceptions — deepseek — by RishabhSaini (创建于: 2026-01-23 04:57 (UTC+8)) [💬1 | +38/-31, 1 files] Resolves: https://github.com/vllm-project/vllm/issues/32874

Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
#32880 Add inside_opaque_custom_op context — 无标签 — by mgoin (创建于: 2026-01-23 03:57 (UTC+8)) [+67/-3, 2 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32881 [BugFix] Async Eplb fix potential race condition — bug — by ilmarkov (创建于: 2026-01-23 04:14 (UTC+8)) [+18/-0, 2 files | commented:1 | 📝草稿]

## Purpose Fixes potential race condition in async EPLB. The scenario is following. EPLB successfully transfers layer 1, saves the weights in intermediate buffers, signals main thread that the buffer is ready. The main thread schedules weight copy on main stream when communication on EPLB stream is done and notifies the async thread that buffer is ready. In this scenario it might happen that copying the weights takes too long and async thread will start a weight t…
#32860 [BugFix] Fix async EPLB hang with DeepEP LL all2all backend — bug,v1 — by ilmarkov (创建于: 2026-01-22 23:13 (UTC+8)) [💬1 | +54/-0, 2 files | commented:2] Fix hang when async EPLB is used with DeepEP LL.

The hang happens because of cooperative launch in DeepEP LL. When Async EPLB in async thread launches NCCL kernels for weight exchange with large number of SMs it blocks DeepEP kernels. When these kernels are launched on different ranks out-of-order we hit a deadlock.

See https://github.com/deepseek-ai/DeepEP/issues/496.

We resolve it by limiting number of CTAs in NCCL. In DP/EP mode NCCL is not used in the hot path so it doesn’t directly affec…
#32875 Add the missing router_logits_dtype for nemotron_h — 无标签 — by cjluo-nv (创建于: 2026-01-23 02:08 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1]

## Purpose

Nemotron H uses FP32 linear for gate and the router logits are FP32 instead of the default BF16.

This fixes the error triggered

``` (EngineCore_DP2 pid=147) return self.forward_impl_chunked( …

#32861 [Voxtral] Add new streaming arch — multi-modality,llama — by patrickvonplaten (创建于: 2026-01-22 23:16 (UTC+8)) [💬1

+767/-313, 9 files

commented:9]

Moves causal whisper logic into own file
Adapt voxtral streaming arch to improved architecture
Add tests that is skipped for now

#32877 [Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support — bug,rocm,v1 — by c0de128 (创建于: 2026-01-23 02:27 (UTC+8)) [+8/-3, 1 files | commented:1] ## Summary
- Fix ROCM_AITER_FA attention backend to support speculative decoding (multi-token decode)
- The decode path was hardcoding max_seqlen_q=1, causing incorrect results with speculative decoding
## Changes
- Extract actual max_query_len from decode metadata
- Route multi-token decode queries to unified_attention instead of paged_attention_v1
- Use actual max_seqlen_q instead of hardcoded 1
- Update assertion message to reflect both sliding window and speculative decoding con…
#32831 [CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel — bug,rocm — by rasmith (创建于: 2026-01-22 12:53 (UTC+8)) [+23/-0, 2 files | commented:8] ## Purpose The wvSplitKQ_hf_sml kernel is not able to properly handle non-contiguous tensors. This is resulting in failures in AMD CI. This is possibly resulting in other failures or inaccuracies. In the long term, it might be better to update the kernel to be able to properly handle non-contiguous tensors, since this trades incorrectness for performance loss. We can then work on fix to the kernels in skinny_gems.cu to handle non-contiguous kernels.

In addition, all of the kernels in `ski…
#32873 Add config for selective_state_update with B200 — 无标签 — by danisereb (创建于: 2026-01-23 01:26 (UTC+8)) [+6/-1, 1 files | commented:1 | 📝草稿] ## Purpose

Tune config of selective_state_update for Nemotron Nano NVFP4 on B200 (TP1).

## Test Plan

Compare vLLM performance with main branch and this branch.

## Test Result

…
#32845 [Model] Video Frame Sparsification via Pixel-Level Similarity for Efficient Multimodal Long-Video Inference for Qwen2.5_vl and Qwen3_vl — multi-modality,qwen — by rayleeku (创建于: 2026-01-22 18:29 (UTC+8)) [💬2 | +374/-1, 5 files | commented:2] ## Purpose

RFC: https://github.com/vllm-project/vllm/issues/31803

The exponential growth of video content and the rising demand for long-video intelligence (e.g., video summarization, multimodal QA, video generation) have made long-video inference a standard requirement for multimodal large models. Current models suffer from critical efficiency bottlenecks and face two key challenges: temporal and spatial redundancy of videos and input sequence overflow, thus necessitating a lightweight, adapt…
#32868 [MISC] Add .cursor to .gitignore — ready — by vadiklyutiy (创建于: 2026-01-23 00:34 (UTC+8)) [+3/-0, 1 files | commented:1 approved:1] Add .cursor folder to .gitignore
#32837 [Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars — bug,rocm,ready,ci/build,v1,kv-connector — by mawong-amd (创建于: 2026-01-22 15:44 (UTC+8)) [💬1 | +14/-6, 3 files | commented:1 approved:1] ## Purpose This PR fixes various issues on AMD ROCm caused by the deprecation of environment variables in https://github.com/vllm-project/vllm/pull/32812.

Firstly, the deprecation of VLLM_V1_USE_PREFILL_DECODE_ATTENTION means that get_current_vllm_config is now called as a replacement in RocmPlatform.get_attn_backend_cls to determine if the ROCM_ATTN backend should be used. However, there are instances where the current vLLM config is not yet set, which causes the above function to erro…
#32867 [Misc] Postpone VLLM_ATTENTION_BACKEND deprecation — 无标签 — by NickLucche (创建于: 2026-01-23 00:32 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] I think we can safely postpone this deprecation schedule. Updating warning accordingly.
#32869 [CPU][Feat] Update PyTorch to v2.10 for Arm CPUs — ci/build,cpu — by fadara01 (创建于: 2026-01-23 00:36 (UTC+8)) [💬1 | +4/-3, 2 files | commented:2] ## Purpose

Update PyTorch to v2.10 for Arm CPUs.

A lot of nice features, improvements and bug fixes have gone into PyTorch v2.10 for Arm CPUs and we should capitalize on that in vLLM!

Here’s a con-exhaustive list:
- Enable mimalloc on AArch64: Switched the c10 system allocator to mimalloc on AArch64 to improve large-allocation behavior and overall performance. This delivers broad wins across dtypes, including up to 60% faster DenseNet121 (FP32) and up to 40% faster GPT2-Large (BF16) [#pull/…
#32844 [Bugfix] ModelScope is supported when downloading LORA models. — bug,ready — by AuYang261 (创建于: 2026-01-22 18:05 (UTC+8)) [💬1 | +21/-6, 1 files | commented:2 approved:1]

## Purpose Fix the bug that the environment variable VLLM_USE_MODELSCOPE=1 is ineffective when downloading LoRA models, which is described in #32841 .

## Test Plan After setting the environment variables VLLM_USE_MODELSCOPE=1, this command tests downloading the LoRA model from ModelScope.
```
  vllm serve LLM-Research/Llama-3.2-3B-Instruct --enable-lora --lora-modules sql-lora=vllm-ascend/llama32-3b-text2sql-spider
```
…
#32865 [Chore]: simplify cuda device count with torch.cuda.device_count — ready,nvidia — by Isotr0py (创建于: 2026-01-22 23:32 (UTC+8)) [+4/-42, 1 files | commented:1]

## Purpose
- Remove _cuda_device_count_stateless, we can use torch.cuda.device_count directly now.
## Test Plan

## Test Result

…
#32846 [Kernel] use flashinfer for gdn prefill — qwen — by ZJY0516 (创建于: 2026-01-22 18:41 (UTC+8)) [+62/-1, 1 files | commented:3] ## Purpose flashinfer introduced prefill gdn kernel in https://github.com/flashinfer-ai/flashinfer/pull/2276

TODO: accuracy problem

## Test Plan ``` vllm bench serve –model Qwen/Qwen3-Next-80B-A3B-Instruct –dataset-name random
–num-prompts 512
–random-input-len 4096
…
#32851 Fix mkdocs_autorefs error — needs-rebase — by aireilly (创建于: 2026-01-22 20:34 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 | 📝草稿] Fixes this docs build error:

``` WARNING - mkdocs_autorefs: api/vllm/model_executor/models/index.md: from /home/docs/checkouts/readthedocs.org/user_builds/vllm/checkouts/32849/vllm/model_executor/models/interfaces_base.py:167: (vllm.model_executor.models.VllmModelForPooling.default_pooling_type) Could not find cross-reference target ‘vllm.model_executor.layers.pooler.PoolerConfig.pooling_type’ WARNING - mkdocs_autorefs: api/vllm/model_executor/models/interfaces_base.md: from /home/docs/checko…
#32847 [BugFix] Fix AttributeError in MooncakeConnector logging for non-tensor KV caches — bug,kv-connector — by EheinWang (创建于: 2026-01-22 18:58 (UTC+8)) [💬2 | +23/-3, 1 files | commented:3] ## Purpose

This PR fixes an AttributeError: 'tuple' object has no attribute 'shape' that occurs in MooncakeConnector.register_kv_caches when using hardware backends (like Ascend NPU) that represent KV caches as tuples (separate Key/Value tensors) rather than a single tensor.

The existing logging implementation assumed that kv_caches values always possessed a .shape attribute. When running with the vllm-ascend plugin, this assumption failed, causing the engine initialization to crash….
#32833 [CI] refactor release pipeline config into groups — ready,ci/build — by Harry-Chen (创建于: 2026-01-22 13:25 (UTC+8)) [💬1 | +277/-276, 1 files | commented:2 approved:2] ## Purpose

The current release pipeline has too many steps and is a little bit hard to read and maintain. This PR puts steps into some groups without changing the actual jobs.

Specifically, it also removes the block before the CUDA 13.0 release image build steps, as requested by @wangshangsam ( https://github.com/vllm-project/vllm/pull/32522#issuecomment-3776281837).

## Test Plan

Tested by CI.

…
#32842 [Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_modes — bug — by HirokenOvo (创建于: 2026-01-22 17:45 (UTC+8)) [+8/-1, 2 files] ## Purpose PR #23207 introduced torch compile support for the ViT part of Qwen2.5-VL. This PR addresses an issue where enabling torch.compile for the vision encoder (--compilation-config '{"compile_mm_encoder": true}') caused crashes when switching between --mm-encoder-tp-mode "weights" and --mm-encoder-tp-mode "data". ### The Problem: vLLM uses VllmConfig.compute_hash() to identify unique configurations for caching compiled graphs. However, mm_encoder_tp_mode was missing from this …
#32835 [ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm — rocm,speculative-decoding,v1 — by AndreasKaratzas (创建于: 2026-01-22 14:09 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] This PR adds a clarifying comment as requested in #32787 by @DarkLight1337.

The comment explains why TRITON_ATTN is used as the fallback attention backend for ROCm platforms.
#32830 fix — needs-rebase,v1 — by aidendle94 (创建于: 2026-01-22 12:28 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...

[已合并 PR]

#30937 [Feature] Add –ssl-ciphers CLI argument for TLS cipher control — frontend,ready — by ricky-chaoju (合并于: 2026-01-23 01:53 (UTC+8)) [💬3 | +4/-0, 2 files | commented:2 approved:2] ## Summary This PR adds support for the --ssl-ciphers CLI argument, enabling users to specify allowed SSL/TLS cipher suites for fine-grained security control.

Fixes #30569

## Changes
- vllm/entrypoints/openai/cli_args.py: Added ssl_ciphers parameter to FrontendArgs dataclass
- vllm/entrypoints/openai/api_server.py: Pass ssl_ciphers argument to serve_http() function
## Motivation …
#31996 [MoE Refactor] Move select_experts from FusedMoEQuantMethod -> FusedMoE — documentation,ready,nvidia — by bnellnm (合并于: 2026-01-23 07:21 (UTC+8)) [💬9 | +495/-530, 22 files | commented:9 approved:1] ## Purpose

Move (almost all) the select_experts calls to the FusedMoE layer instead of each FusedMoEMethodBase.apply method. Change the apply signature to take topk_weights and topk_ids instead of router_logits. For kernels that have combined routing + gemms we call apply_monolithic which takes hidden_states and router_logits.

I made both apply and apply_monolithic non-abstract so that most FusedMoEMethods would not need to override them with dummy/error implementations…
#32723 [Benchmark] Don’t default to temperature==0 in vllm bench serve — performance,ready — by njhill (合并于: 2026-01-22 18:03 (UTC+8)) [💬3 | +7/-6, 2 files | commented:1 approved:1] vllm bench serve current sets temperature to 0 (greedy) in all requests if it’s not explicitly provided in the benchmark command.

Thus results will not reflect the true default performance of a given deployment - where default sampling parameters are determined based on the model’s configuration metadata and/or the API server defaults.

The difference can be substantial given the current overhead of parameters such as top-k and top-p (see https://github.com/vllm-project/vllm/pull/32558).

So …
#32855 [BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration — bug,ready,nvidia — by fadara01 (合并于: 2026-01-23 06:27 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:2] [BugFix] Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration

Fixes: #32840

## Purpose

Fix invalid numeric default value in flashinfer_fused_moe_blockscale_fp8 op registration The default value has to be an int an not an enum as the error indicates: ``` invalid numeric default value: …
#32619 [Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper — performance,ready,deepseek — by xyang16 (合并于: 2026-01-23 04:47 (UTC+8)) [💬1 | +75/-17, 7 files | commented:5 approved:1] ## Purpose

This PR is to create TMA-aligned input scale tensor for DeepGemm on Hopper.

During profiling on H200 I noticed deep_gemm::transpose_fp32 was launched before deep_gemm::sm90_fp8_gemm_1d2d_impl every time, because deepgemm requires input scale tensor to be column major TMA-aligned layout on Hopper, see here. If not in the required layout, deepgemm will launch transpose_fp32 kernel to transpose tens…
#32610 [Refactor] Remove unused tpu files — tpu,ready,v1 — by yewentao256 (合并于: 2026-01-23 04:35 (UTC+8)) [💬1 | +0/-353, 4 files | commented:1 approved:1] ## Purpose

Seems introduced in https://github.com/vllm-project/vllm/pull/14227

But these functions and files now are not used anywhere, not sure if we can delete it

CC: @yaochengji @NickLucche
#30141 Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) — documentation,speculative-decoding,ready,v1,llama — by eldarkurtic (合并于: 2026-01-23 04:29 (UTC+8)) [💬12 | +538/-243, 18 files | commented:10] TLDR: this PR adds support to load and run llm-compressor models with FP8 KV-cache and attention quantization. In addition to the standard “per-tensor” quantization, it adds support for “per-attention-head” quantization.

## Summary
1. enable using the existing pathway of “per-tensor” KV-cache (and query) FP8 quantization with scales calibrated through llm-compressor
2. Flash Attention v3 backend supports “finer-grained” scales, i.e. one scale per attention head. This is currently not su…
#32795 [Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 — bug,ready,v1,cpu,nvidia — by MatthewBonanni (合并于: 2026-01-23 03:05 (UTC+8)) [+31/-11, 13 files | commented:3 approved:2] ## Purpose Attention backends all support bfloat16, but only report support for auto. In cases where the automatically resolved dtype is not bfloat16, this can lead to failures.

Related issue: #32732 - setting --kv-cache-dtype=bfloat16 should work as a workaround, but does not, because TRITON_MLA does not report support for bfloat16.

## Test Plan vllm serve nvidia/DeepSeek-R1-0528-NVFP4 --attention-backend=TRITON_MLA --kv-cache-dtype=bfloat16

## Test Result main: …
#32792 [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm — ready,ci/build,v1,cpu — by fadara01 (合并于: 2026-01-23 02:55 (UTC+8)) [💬2 | +164/-6, 6 files | commented:1 approved:1] [CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm

Fixes: #32786

## Purpose

This PR addresses #32786 by enabling vLLM’s custom shared memory communicator for Arm, in favor of torch.distributed which has very high synchronization costs. It also enables auto thread binding for best OOB performance.

## Benchmarks

…
#32487 [CI][Attention] Add more CI dependencies for attention tests — ready,ci/build — by MatthewBonanni (合并于: 2026-01-23 02:44 (UTC+8)) [+12/-0, 3 files | commented:4 approved:1]

## Purpose #32339 broke V1 Tests attention (B200) by changing defaults, and this should have been caught in CI on the PR. This PR adds more source file dependencies for these tests so they will be covered next time.

## Test Plan

## Test Result

…
#32393 Support custom URI schemes and trace handlers for profiler — ready,v1,meta-exported,fb-exported — by diviramon (合并于: 2026-01-23 01:45 (UTC+8)) [💬5 | +74/-19, 3 files | commented:1 approved:2] Summary: Enable platforms to use custom URI schemes (e.g., gs://, s3://, hdfs://) for profiler trace output without path normalization, and allow injection of custom trace handlers for platform-specific profiling backends.

This generalizes the existing gs:// handling to support any URI scheme by detecting the scheme:// pattern, and adds an optional on_trace_ready parameter to TorchProfilerWrapper for custom trace export logic.

Test Plan: Existing profiler tests continue to pass. Manual verific…
#32525 [UX] Default api_server_count to dp_size if not specified — frontend,ready — by tlrmchlsmth (合并于: 2026-01-23 01:35 (UTC+8)) [💬3 | +60/-8, 2 files | commented:4 approved:1] When using data parallelism, a good rule of thumb is to set the api_server_count to the dp_size in order to avoid the api server becoming a bottleneck.

Use this as the default when --api-server-count is not specified.
#32868 [MISC] Add .cursor to .gitignore — ready — by vadiklyutiy (合并于: 2026-01-23 01:27 (UTC+8)) [+3/-0, 1 files | commented:1 approved:1] Add .cursor folder to .gitignore
#32837 [Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars — bug,rocm,ready,ci/build,v1,kv-connector — by mawong-amd (合并于: 2026-01-23 00:59 (UTC+8)) [💬1 | +14/-6, 3 files | commented:1 approved:1] ## Purpose This PR fixes various issues on AMD ROCm caused by the deprecation of environment variables in https://github.com/vllm-project/vllm/pull/32812.

Firstly, the deprecation of VLLM_V1_USE_PREFILL_DECODE_ATTENTION means that get_current_vllm_config is now called as a replacement in RocmPlatform.get_attn_backend_cls to determine if the ROCM_ATTN backend should be used. However, there are instances where the current vLLM config is not yet set, which causes the above function to erro…
#32844 [Bugfix] ModelScope is supported when downloading LORA models. — bug,ready — by AuYang261 (合并于: 2026-01-23 00:33 (UTC+8)) [💬1 | +21/-6, 1 files | commented:2 approved:1]

## Purpose Fix the bug that the environment variable VLLM_USE_MODELSCOPE=1 is ineffective when downloading LoRA models, which is described in #32841 .

## Test Plan After setting the environment variables VLLM_USE_MODELSCOPE=1, this command tests downloading the LoRA model from ModelScope.
```
  vllm serve LLM-Research/Llama-3.2-3B-Instruct --enable-lora --lora-modules sql-lora=vllm-ascend/llama32-3b-text2sql-spider
```
…
#30200 [Frontend] Introduce Renderer for processing chat messages (using ModelConfig) — documentation,performance,structured-output,frontend,ready,ci/build,v1,multi-modality,tool-calling,gpt-oss — by DarkLight1337 (合并于: 2026-01-22 20:44 (UTC+8)) [💬29 | +2141/-1585, 48 files | commented:10] Ready for review.

## Purpose
- Prototype an interface, vllm.renderers.RendererLike, to process chat messages into engine inputs.
- Introduce RENDERER_REGISTRY which lazily registers renderers to avoid circular import problem.
- Move implementation-specific chat utils to the corresponding renderer in vllm.renderers.
- Initialize the renderer in InputPreprocessor, replacing the tokenizer initialization inside LLMEngine and AsyncLLM.
- Replace EngineClient.get_tokenizer() with `Engi…
#14526 Support bge-m3 sparse embeddings and colbert embeddings — documentation,new-model,frontend,ready — by maxdebayser (合并于: 2026-01-22 23:52 (UTC+8)) [💬29 | +393/-19, 9 files | commented:7 approved:1] FIX #13609 FIX #15384 FIX #18469

Here I’m loading the extra sparse_linear.pt file using the secondary_weights loading introduced in the ultravox model when I detect that the model name is BAAI/bge-m3. It’s a bit ugly but I don’t know if there is a more generic way to do this.

Currently, since the only permissible pooling return type is torch.tensor, I’m just returning the token weights tensor directly. If the user wants to match tokens to the weights they have to call tokenize and remove…
#32668 [Misc] Bump opencv-python dependecy version to 4.13 — ready,ci/build — by Isotr0py (合并于: 2026-01-22 23:51 (UTC+8)) [💬4 | +35/-16, 5 files | commented:4 approved:1]

## Purpose
- Update opencv-python version in dependencies to get up-to-date security fix.
## Test Plan

## Test Result

…
#32706 [Cleanup] Move scheduler get_routed_experts logic to separate method — ready,v1 — by njhill (合并于: 2026-01-22 23:46 (UTC+8)) [+31/-23, 1 files | commented:2 approved:1] Follow-on from https://github.com/vllm-project/vllm/pull/28284 which I didn’t get time to fully review.

It harms readability to have all of this special-case logic in the main scheduler flow, better to move it to a separate method.

Also includes small change to ensure that we always return the final output for a request, even if otherwise empty.
#32805 [torch.compile] Improve Cold Start for MoEs — ready — by zou3519 (合并于: 2026-01-22 23:44 (UTC+8)) [💬2 | +71/-9, 3 files | commented:3 approved:2] ## Purpose

Fixes https://github.com/vllm-project/vllm/issues/29992

For torch.compile cold start times, we need to avoid hard-coding any strings into the graph. Right now, the vllm.moe_forward and vllm.moe_forward_shared custom operators hard-code strings into the graph.

The workaround is to store a list of the strings that each of those custom ops needs, in reverse order, in the ForwardContext. The ForwardContext object is alive for the duration of the forward pass. When the custom op needs t…
#31756 [Misc][BE] Turn on strict type coverage for vllm/compilation — ready,nvidia — by Lucaskabela (合并于: 2026-01-22 23:12 (UTC+8)) [💬2 | +121/-68, 11 files | commented:9 approved:1] ## Purpose This PR follows up * #31748 * #31744 * #31554 And adds strict type checking to our linter/pyproject.toml, turning it on in the compilation repo as a test pilot

## Test Plan ``` mypy vllm/compilation …
#30692 OffloadingConnector: Support kernel_block_size != block_size — ready,v1 — by orozery (合并于: 2026-01-22 20:30 (UTC+8)) [💬6 | +156/-114, 7 files | commented:9 approved:1] This PR enables the offloading connector to work for cases where the kernel block size is different than vLLM’s logical block size.

To efficiently copy blocks when the logical block size is greater than the block size, we add an additional parameter to the swap_blocks function that determines the size per each contiguous copy.

This fix addresses the same issue that was fixed in the NixlConnector in #28677.

…
#32824 [Frontend] add prompt_cache_key for openresponses — frontend,ready — by chaunceyjiang (合并于: 2026-01-22 19:34 (UTC+8)) [💬1 | +8/-0, 1 files | commented:5 approved:1] ## Purpose supports https://www.openresponses.org/reference

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32833 [CI] refactor release pipeline config into groups — ready,ci/build — by Harry-Chen (合并于: 2026-01-22 19:27 (UTC+8)) [💬1 | +277/-276, 1 files | commented:2 approved:2] ## Purpose

The current release pipeline has too many steps and is a little bit hard to read and maintain. This PR puts steps into some groups without changing the actual jobs.

Specifically, it also removes the block before the CUDA 13.0 release image build steps, as requested by @wangshangsam ( https://github.com/vllm-project/vllm/pull/32522#issuecomment-3776281837).

## Test Plan

Tested by CI.

…
#32574 [Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest — documentation,frontend,ready,qwen — by noooop (合并于: 2026-01-22 18:32 (UTC+8)) [💬4 | +456/-205, 24 files | commented:1 approved:1]

## Purpose

Split out the following RequestMixin
- ChatRequestMixin
Add tests
- test_chat_request
…
#32789 [Bugfix] Fix Whisper/encoder-decoder GPU memory leak — bug,ready,v1,multi-modality — by NickLucche (合并于: 2026-01-22 18:50 (UTC+8)) [💬2 | +54/-5, 2 files | commented:2 approved:1] Fix https://github.com/vllm-project/vllm/issues/31577.

# Overview For encoder-decoder models (e.g., Whisper), the encoder cache manager was returning newly allocated entries in get_freed_mm_hashes() during the same scheduling step they were allocated. Since the runner frees cache entries before model execution, this caused encoder outputs free attempts to miss, leading to an ever-growing encoder cache. Flow: `EncoderCacheManager.get_freed_mm_hashes->runner “free mm hashes”->runner._execute_m…
#30207 Enable Cross layers KV cache layout at NIXL Connector — documentation,ready,v1,kv-connector — by liranschour (合并于: 2026-01-22 18:12 (UTC+8)) [💬16 | +308/-89, 5 files | commented:9 changes:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC #27742 and implemented in #27743. Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.

Tested with P!=D with run_accuracy_test.sh P=1 D=2

Tested on H100 1P1D Qwen/Qwen3-0.6B:

b…
#32746 [Misc] Replace urllib’s urlparse with urllib3’s parse_url — ready,multi-modality — by Isotr0py (合并于: 2026-01-22 16:37 (UTC+8)) [💬1 | +23/-16, 4 files | commented:2 approved:1]

## Purpose
- Replace urllib’s urlparse with urllib3’s parse_url for safer and more standard url parsing.
## Test Plan

## Test Result

…
#28664 [AMD][ROCm] MoRI EP: a high-performance all2all backend — documentation,rocm,ready,ci/build,deepseek,nvidia — by alexsun07 (合并于: 2026-01-22 16:33 (UTC+8)) [💬17 | +397/-9, 16 files | commented:10] ## Purpose

This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.

This PR follows the design of vLLM’s Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components: [Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

For MoRI+AITER path, w…
#32757 [Model] Extend collect_children and no_init_weights contexts — ready,llama,qwen — by DarkLight1337 (合并于: 2026-01-22 16:20 (UTC+8)) [💬3 | +442/-255, 20 files | commented:4 approved:1] ## Purpose

Further generalize _mark_language_model and _mark_tower_model to handle the case where both MM and LM components are in the same child module, by introducing a target argument. For convenience, add a wrapper _mark_composite_model which enters the context of both _mark_language_model and _mark_tower_model.

Also, merged LMMissingLayer and TowerMissingLayer into StageMissingLayer, and use this to replace generation-specific modules in pooling adapters as well. cc @noo…

#32667 [bench] add start_times field to vllm bench serve json result — performance,ready — by kebe7jun (合并于: 2026-01-22 15:10 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] ## Purpose

Fix https://github.com/vllm-project/vllm/issues/32666

## Test Plan

  vllm bench serve --save-detailed --trust-remote-code --base-url http://127.0.0.1:17021 --model /home/ubuntu/DeepSeek-V3.2 --seed 4632803 --dataset-name random --model /home/ubuntu/DeepSeek-V3.2 --num-prompts 512 --save-result --ignore-eos --max-concurrency 64 --random-input-len 6000 --random-output-len 500 --request-rate 2.4 --result-filename bench-result.json

…

#32810 [FlashMLA] Update FlashMLA to expose new arguments — ci/build,v1,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-22 13:02 (UTC+8)) [💬1 | +133/-217, 8 files | commented:1 approved:1] The FlashMLA update contains new arguments; update the interface to expose these
#32835 [ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm — rocm,speculative-decoding,v1 — by AndreasKaratzas (合并于: 2026-01-22 14:11 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] This PR adds a clarifying comment as requested in #32787 by @DarkLight1337.

The comment explains why TRITON_ATTN is used as the fallback attention backend for ROCm platforms.
#32346 [ROCm][CI] Fix AITER test flakiness by using explicit attention backend — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-01-22 13:55 (UTC+8)) [💬5 | +10/-6, 3 files | commented:6 approved:1] ## Summary

This PR fixes test failures for TitanML/tiny-mixtral when running AITER tests on ROCm as part of the full test suite.

## Problem

When running the test suite as a group, the following tests were failing:
- test_models[True-True-5-32-TitanML/tiny-mixtral]
- test_models[False-True-5-32-TitanML/tiny-mixtral]
…
#32787 [ROCm][CI] fix get_valid_backends — rocm,speculative-decoding,ready,v1 — by divakar-amd (合并于: 2026-01-22 12:27 (UTC+8)) [+8/-2, 1 files | commented:4 approved:1] This PR fixes the “get_valid_backends” check

Details - the getattr is overridden in interface.py which returns None if the attribute is not found instead of raising an AttributeError

pytest -s -v tests/v1/spec_decode/test_acceptance_length.py

Note: There is more debugging ongoing, might need another PR to fully fix this test. Preliminary analysis hints towards th…
#32731 [ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quantization — rocm,ready,v1 — by micah-wil (合并于: 2026-01-22 13:47 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/24322 introduced tests for spec decoding with draft models in the V1 Test e2e + engine test group. One test case fails on AMD ci:
```
  pytest -v -s tests/v1/e2e/test_spec_decode.py::test_draft_model_quantization[True-target_quantized]
```
The test fails in an assertion: ```
```
  assert acceptance_len >= args.expected_acceptance_len   --   ...
```
#32287 Upgrade transformers-4.57.5 — ready,ci/build,multi-modality,ready-run-all-tests — by huydhn (合并于: 2026-01-22 13:19 (UTC+8)) [💬6 | +25/-3, 4 files | commented:1 approved:3] ## Purpose

https://pypi.org/project/transformers/4.57.5 is out today. Although there is already a PR to upgrade this to 5x https://github.com/vllm-project/vllm/pull/30566, I’m looking to get to 4.57.5 earlier to address this CI issue https://github.com/huggingface/transformers/issues/42886

## Test Plan

CI run

cc @hmellor
#32780 [Llama.py -> mistral.py] Extract mistral-only relevant code into separate file — new-model,ready,llama — by patrickvonplaten (合并于: 2026-01-22 13:14 (UTC+8)) [💬3 | +248/-115, 3 files | commented:5 approved:1] We’re adding more and more mistral-only code to the llama.py class which makes it harder to read and creates possible future unwanted dependencies. E.g. if other models depend on the llama.py class one might think that mistral-only code might also be relevant for such classes and thus make vLLM too rigid.
#32775 [Docs] Remove outdated async_scheduling limitation with speculative decoding — ready — by ikaadil (合并于: 2026-01-22 12:19 (UTC+8)) [💬3 | +0/-3, 1 files | commented:4 approved:2 changes:1] Remove outdated limitation from async_scheduling docstring. Async scheduling now works with speculative decoding (enabled in #31998).
#32788 Cleanup some huggingface_hub-related stuff — ready — by Wauplin (合并于: 2026-01-22 11:38 (UTC+8)) [💬1 | +9/-50, 3 files | commented:1 approved:2] Hi there! Maintainer of the huggingface_hub Python SDK here :wave:

I’ve looked at how huggingface_hub is used in vLLM and found a few places that could be improved/cleaned-up:
- in lora utils: no need to catch EntryNotFoundError and RepositoryNotFoundError since both are subclasses of HfHubHTTPError (which is already caught)
- removed _get_hf_token private utility => the only thing it does is to fetch HF_TOKEN env variable and clean it. Since token is then passed only to `tra…
#32585 [EC Connector] Optimize remote cache check in scheduler — ready,v1 — by knlnguyen1802 (合并于: 2026-01-22 11:30 (UTC+8)) [💬3 | +76/-57, 5 files | commented:2 approved:1]

## Purpose Currently for EC Connector, the scheduler will check if remote cache exist for all step, even if all requests is on decode phase This will affect TPOT

## What is change in this PR Modify _try_schedule_encoder_input to only check remote cache exists when the current multimodal media is planed to schedule. ## Test Result

…
#32818 [Bugfix] Fix potential EAGLE spec decode segfault during graph capture — bug,speculative-decoding,ready,v1 — by mawong-amd (合并于: 2026-01-22 11:11 (UTC+8)) [+8/-4, 1 files | commented:1 approved:1] ## Purpose This PR restores the original logic in SpecDecodeBaseProposer’s dummy_run function, which was changed after the refactor in https://github.com/vllm-project/vllm/pull/30143. Credits also to @micah-wil for the investigation and fix.

After the refactor, the use_cudagraphs parameter is left unused which incorrectly leads to CUDA graph dispatch where they previously would not have occurred. We are seeing this manifest on MI300X as a segfault during CUDA…

[关闭未合并 PR]

#19927 [V1] Solve potential deadlock issue in v1 engine core client internally — needs-rebase,stale,v1 — by Isotr0py (关闭于: 2026-01-23 10:17 (UTC+8)) [💬5 | +75/-79, 6 files | commented:3] ## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose
- Following PR for https://github.com/vllm-project/v…
#19999 [Models] Remove GPU-CPU sync when do_pan_and_scan=false in Gemma3 — needs-rebase,stale — by lgeiger (关闭于: 2026-01-23 10:17 (UTC+8)) [💬5 | +20/-5, 1 files | commented:4] When do_pan_and_scan=false we currently still doing a bit of extra preprocessing work, pass around a num_crops tensor and call num_patches.tolist(). This is not necessary and the latter causes GPU host synchronization, preventing some more CPU operations to be overlapped with GPU work:

This PR makes num_patches tensor optional and prevents sp…
#22752 [V1] implement tree sampler for draft token acceptance — speculative-decoding,ready,needs-rebase,stale,v1,llama — by TheEpicDolphin (关闭于: 2026-01-23 10:16 (UTC+8)) [💬14 | +1486/-254, 13 files | commented:10] # Purpose Continuing with my work in this PR: #20401 , where I added support for drafting a tree of speculative tokens, that are then validated by the target model. In this PR, I add the class that performs rejection sampling for those draft tokens, so that they conform to the target model’s output distribution. This class is based off of RejectionSampler, but with some key differences necessary to support a tree structure for drafted tokens. I added some tests for this new class to verify it’s …
#24258 [ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server — stale — by AzizCode92 (关闭于: 2026-01-23 10:16 (UTC+8)) [💬2 | +16/-0, 1 files | commented:2] ## Purpose

The __exit__ method of the RemoteOpenAIServer is not sufficient to clean up GPU memory. This creates a race condition between the different tests in CI. This PR ensures we properly clean the GPU memory when exiting the openAI remote server.

Solves: https://github.com/vllm-project/vllm/issues/24144

## Test Plan

…
#24911 Download model and add configs. — stale — by chenxi-yang (关闭于: 2026-01-23 10:15 (UTC+8)) [💬7 | +648/-0, 4 files | commented:1] Summary: Add kernel configs for glm.

Rollback Plan:

Reviewed By: zzh142857

Differential Revision: D82469889
#25173 TokenizerBase#convert_ids_to_tokens adds support for single id conversion — stale — by usberkeley (关闭于: 2026-01-23 10:15 (UTC+8)) [💬5 | +49/-19, 5 files] ## Purpose TokenizerBase#convert_ids_to_tokens add supports single ID conversion, ensuring consistency with PreTrainedTokenizer and PreTrainedTokenizerFast. This has the following advantages:

1）Maintain interface consistency: A single id can be safely converted to a token. For example, the detokenize_incrementally function no longer needs to convert new_token_id into a list:
```
  new_tokens = tokenizer.convert_ids_to_tokens(new_token_id, skip_special_tokens=skip_special_tokens)
```
…
#25441 add str format for proxy port — stale,kv-connector — by yangqinghao-cmss (关闭于: 2026-01-23 10:15 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1]

## Purpose When the proxy_port in the kv_connector_extra_config parameter of vllm serve is an integer, it will trigger a TypeError: can only concatenate str (not "int") to str. This is because no type check is performed before initializing P2pNcclEngine. It is therefore recommended that proxy_port be formatted as a string. ## Test Plan

## Test Result

...
#30108 Add config flag for VLLM_DISABLE_COMPILE_CACHE — documentation,llama — by elizabetht (关闭于: 2026-01-23 04:34 (UTC+8)) [💬11 | commented:9 approved:1]

## Purpose Add config flag for VLLM_DISABLE_COMPILE_CACHE which should be overridden if env var is set

## Test Plan

Scenario | Config Value | Env Var | Test – | – | – | – Default (nothing set) | False | Not set | ✅ test_disable_compile_cache_config_without_env_var …
#32883 deepseek_v32_encoding: Replace asserts with proper exceptions — deepseek — by RishabhSaini (关闭于: 2026-01-23 05:05 (UTC+8)) [💬1 | +38/-31, 1 files] Resolves: https://github.com/vllm-project/vllm/issues/32874

Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input
#32880 Add inside_opaque_custom_op context — 无标签 — by mgoin (关闭于: 2026-01-23 04:17 (UTC+8)) [+67/-3, 2 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32875 Add the missing router_logits_dtype for nemotron_h — 无标签 — by cjluo-nv (关闭于: 2026-01-23 02:59 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1]

## Purpose

Nemotron H uses FP32 linear for gate and the router logits are FP32 instead of the default BF16.

This fixes the error triggered

``` (EngineCore_DP2 pid=147) return self.forward_impl_chunked( …
#32851 Fix mkdocs_autorefs error — needs-rebase — by aireilly (关闭于: 2026-01-22 20:42 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 | 📝草稿] Fixes this docs build error:

``` WARNING - mkdocs_autorefs: api/vllm/model_executor/models/index.md: from /home/docs/checkouts/readthedocs.org/user_builds/vllm/checkouts/32849/vllm/model_executor/models/interfaces_base.py:167: (vllm.model_executor.models.VllmModelForPooling.default_pooling_type) Could not find cross-reference target ‘vllm.model_executor.layers.pooler.PoolerConfig.pooling_type’ WARNING - mkdocs_autorefs: api/vllm/model_executor/models/interfaces_base.md: from /home/docs/checko…
#31705 [BugFix] Support setting tp=1 for the Eagle draft model to take effect — bug,speculative-decoding,v1 — by zhaomingyu13 (关闭于: 2026-01-22 16:40 (UTC+8)) [💬6 | commented:2 changes:1] ## Purpose According to the official documentation, the parameter draft_tensor_parallel_size: 1 is supposed to be applied to the Eagle model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected.

Note: This feature has not been superimposed and tested with sp and dp. It will be adapted later ## Test Plan ```python imp…
#32813 [Hardware][AMD][Bugfix] Fix PTPC FP8 quantization — bug,rocm — by mawong-amd (关闭于: 2026-01-22 11:39 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1] ## Purpose Fixes PTPC FP8 quantization and thus AMD Quantization Tests after the refactoring done in https://github.com/vllm-project/vllm/pull/32189. PTPCFP8LinearMethod should now inherit from FP8OnlineLinearMethod rather than FP8LinearMethod.

## Test Plan pytest -sv quantization/test_ptpc_fp8.py The above is implicitly run as part of AMD CI’s Quantization Tests group.

## Test Result The test and test group both pass.

…
#32825 [Hardware][AMD][Bugfix] Fix spec decode acceptance length test — bug,rocm,speculative-decoding,v1 — by mawong-amd (关闭于: 2026-01-22 11:37 (UTC+8)) [💬2 | +13/-4, 1 files | commented:1] ## Purpose This PR fixes the speculative decoding acceptance length tests, first implemented in https://github.com/vllm-project/vllm/pull/32030, to run on ROCm.

The current implementation fails because it erroneously calls current_platform.get_valid_backends which is not implemented on non-CUDA platforms. It guards this call behind hasattr(current_platform, "get_valid_backends"), but this does not work because the Platform class implements a __getattr__ met…
#32830 fix — needs-rebase,v1 — by aidendle94 (关闭于: 2026-01-22 12:30 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...