vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-23

[概览]

时间窗口: 2026-01-23 10:50 (UTC+8) ~ 2026-01-24 10:50 (UTC+8)
新 issue: 15 (label 分布: bug:10, feature request:2, performance:1, ci-failure:1, cpu:1)
关闭 issue: 29
新 PR: 75 (label 分布: ready:21, v1:18, bug:17, documentation:16, ci/build:13)
合并 PR: 40
关闭未合并 PR: 16

[新 issue]

#32962 [Performance]: Custom Helion Kernels — performance — by xiaohongchen1991 (创建于: 2026-01-24 03:40 (UTC+8)) [💬1] ### Proposal to improve performance

This is the sub issue from the vLLM Helion Integration Project https://github.com/vllm-project/vllm/issues/32219 proposed by @gmagogsfm. This issue lists down vLLM kernels that are good candidates to explore, implement, and benchmark with Helion. The progress and results will be tracked here.

### Scope ###
- This Issue focuses on kernels integrated into vLLM via the CustomOp framework (potentially getting r…
#32972 [CI Failure]: mi325_8: Kernels Attention Test %N — ci-failure — by AndreasKaratzas (创建于: 2026-01-24 06:06 (UTC+8)) [💬1] ### Name of failing test

pytest -v -s kernels/attention

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32925 [Bug]: tensorize_vllm_model double gpu — bug — by lyc1995452 (创建于: 2026-01-23 17:18 (UTC+8)) [💬1] ### Your current environment

v 0.14 or v0.9.2

### 🐛 Describe the bug

docker run –gpus ‘“device=3”’ –shm-size=4g -p 38954:38954 –name tensorize-deploy -v /storage/leo/workspace/common-ie/asset_script/serialize_model/vllm/model/checkpoint-1350-merged:/storage/model -v /storage/leo/workspace/train_project/train_somefile/chat_template:/storage/chat_template vllm-openai-tensorizer:v0.9.2 –served-model-name K-GPT-V –model /storage/model –port 38954 –load-for…
#32959 [Bug]: KeyError: merging_layer.weight when loading Mistral/vision-enabled checkpoints after PR #32780 refactor — bug — by ms1design (创建于: 2026-01-24 03:16 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#32932 [Bug][cpu][arm]: Failure case for BF16 dispatch on non-bf16 supported arm HW — bug,cpu — by gassan-arm (创建于: 2026-01-23 19:00 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.5.3 ...
#32919 [Bug]: Memory fault access when serving DeepSeek-R1-0528 with mori-ep + concurrency of 128 / 256 — bug,rocm — by junkang1991 (创建于: 2026-01-23 15:30 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32926 [Feature]: Add dedicated tool parser for Qwen2.5-Coder models — feature request — by hanXen (创建于: 2026-01-23 18:01 (UTC+8)) ### 🚀 The feature, motivation and pitch Qwen2.5-Coder models have no working tool call parser in vLLM. The vLLM documentation recommend --tool-call-parser hermes for Qwen2.5, but Qwen2.5-Coder does not follow hermes format — it outputs ` json ` code blocks or plain JSON instead of <tool_call> tags.

Output format analysis (Qwen2.5-Coder-7B-Instruct, temperature=0):

System prompt strategy Model output format vLLM …
#32938 [Bug]: illegal memory access while running Qwen3-30B-A3B-Instruct-2507 on multi node with DeepEP backend — bug — by llsj14 (创建于: 2026-01-23 21:27 (UTC+8)) ### Your current environment
- H200 hardware
- vLLM v0.13.0 (upstream image)
### 🐛 Describe the bug

Server Script ``` # Node 1 (Primary - handles incoming requests) …
#32901 [Feature]: make thinking parameter consistent with openai api — feature request — by ltm920716 (创建于: 2026-01-23 10:52 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

hello， Why is the “thinking” parameter inconsistent with the OpenAI API? Is there any consideration behind this? Thank you

### Alternatives

No response

### Additional context …
#32920 [Bug]: disagg_proxy_demo.py: Method logic error in ‘remove_instance_endpoint’ — bug — by ChenqianCao (创建于: 2026-01-23 15:40 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#32911 [Bug]: “ValueError: No tokenizer file found in directory” when serving Qwen3-Omni — bug — by CHNtentes (创建于: 2026-01-23 13:50 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32921 [Bug]: gpt-oss-20b streaming last reasoning content part into content — bug — by DeoLeung (创建于: 2026-01-23 15:57 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32915 [Bug]: Enable Lora cause OOM — bug — by bi1101 (创建于: 2026-01-23 14:55 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32910 [Doc]: Having default of “None” in docs for serve params is not helpful, because there is actually a default — documentation — by jcowles (创建于: 2026-01-23 13:49 (UTC+8)) [💬1] ### 📚 The doc issue

What is the actual default? It’s not ‘None’.

### Suggest a potential alternative/fix

No response

…
#32903 [Bug]: FlashInfer error when running vLLM throughput bench on B200 — bug — by baonudesifeizhai (创建于: 2026-01-23 11:47 (UTC+8)) ### Your current environment

### 🐛 Describe the bug

running ``` VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_USE_FLASHINFER_MOE_FP8=1
VLLM_FLASHINFER_MOE_BACKEND=throughput
…

[已关闭 issue]

#14040 [Bug]: RuntimeError: (‘Quantization scheme is not supported for ‘, ‘the current GPU. Min capability: 80. ‘, ‘Current capability: 75.’) — bug,stale — by Bakht-Ullah (关闭于: 2026-01-24 10:16 (UTC+8)) [💬9] ### Your current environment

I am using google colab T4 GPU

### 🐛 Describe the bug

from vllm.assets.audio import AudioAsset from vllm import LLM, SamplingParams

# Load your local audio file …
#18582 [Usage]: How to use the appropriate –gpu-memory-utilization — usage,stale — by ChEnYoNgB (关闭于: 2026-01-24 10:15 (UTC+8)) [💬9] ### Your current environment

```text

============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
#28312 [Bug]: DeepSeek MTP IMA (with num_spec_tokens=2) — bug,speculative-decoding — by robertgshaw2-redhat (关闭于: 2026-01-24 02:16 (UTC+8)) [💬6] ### Your current environment

The output of python collect_env.py
```text H200 ```

…
#29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-24 05:58 (UTC+8)) [💬8] ### Name of failing test

pytest -v -s models/language -m 'core_model and (not slow_test)'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-24 05:58 (UTC+8)) [💬3] ### Name of failing test

pytest -v -s tests/entrypoints/openai/responses

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32896 [Installation]: How to use v0.10.x with pytorch2.9? — installation — by Wesley-Jzy (关闭于: 2026-01-24 02:42 (UTC+8)) [💬1] ### Your current environment

As the title describes, how can I get a vllm 0.10.x for pytorch 2.9 and cuda 12.8 environment?

### How you are installing vllm
```
  pip install -vvv vllm
```
…
#30654 [Feature][Attention][UX]: Incorporate Features into Attention Selection — help wanted,good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:34 (UTC+8)) [💬12] ### 🚀 The feature, motivation and pitch

SUMMARY:
- we have default attention backends by priority and a notion of which backend supports what hw
- however, certain features are not considered in this (e.g. fp8 kv cache, e.g. attention sinks)
Recent example, we had test failures because we updated the logic to load kv cache quantization from the model config. But since CUTLASS_MLA is the default backend on B200, we started seeing test failures (since CUTLASS MLA does not support fp8 kv cache) b…
#27668 [Feature]: WideEP CI Testing — feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:30 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

We have seen many regressions related to WideEP in llm-d due to frequent changes in vLLM, which is creating significant issues in developer velocity

We need to get some automated testing in place that can run on a nightly basis, including the following:
- DP/EP with deepep ll
- DP/EP with deepep ht
- EPLB
- DBO
- Async Scheduling …
#22992 [RFC]: Refactor CI/CD — help wanted,good first issue,RFC — by robertgshaw2-redhat (关闭于: 2026-01-24 02:29 (UTC+8)) [💬17] ### Motivation.

vLLM’s CI/CD has grown in a less than ideal way as it has built up over the years.

We have the following problems:
- CI takes very long, especially on a per commit cycle
- CI has failures that cannot be reproduced on every machine due to numerics
- CI has failures on models that are not the 80-20 of our usage — which runs per commit
- CI failures in early tests often lead to vLLM not cleaning up properly — which creates failures across many tests that makes it hard to ident…
#22918 [Feature][Tools]: Complete Redesign of Tool Calling — help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:29 (UTC+8)) [💬26] ### 🚀 The feature, motivation and pitch
- We currently have a patchwork of support for tools in vLLM
- We currently have regexes in tools which can cause noisy-neighbor issues in vLLM
- We would welcome a contributor to redesign the system, improve our CI coverage, and work together with the ecosystem to ensure vLLM’s support for tool calling is elite
### Alternatives

No response

…
#22294 [Feature]: Tune Triton Configs for Qwen3-30B-A3-Fp8 and Bf16 — good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch

Hardware Cases:
- H100, H200, B200
Configurations:
- TP=1
- TP=2
- TP=4
- TP=8 …
#14530 [Feature]: Audit and Update Examples To Use VLLM_USE_V1=1 — good first issue,feature request,unstale — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch

Many of the examples leverage V0 internals.

We should:
- raise NotImplementedError if envs.VLLM_USE_V1 with these
- convert them to use V1 if we can
### Alternatives

…
#23963 Model Performance Bash! — help wanted,performance,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:28 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Performance continues to be a top priority in the vLLM project.

We have recently seen an explosion of models launched in the last few months. We are seeking help to ensure vLLM runs these models as efficiently as possible.

To achieve this goal, we are creating a Model Performance Bash! The goal of this bash is to analyze the performance of vLLM’s model execution in various configurations to identify opportunities to improve the model side execution. We …
#28152 [Feature]: Factor out zero_expert_num from FusedMoE — help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

We have many special cases in FusedMoE for zero_expert_num

This parameter is used exclusively for LongCatFlash. We should factor this out of FusedMoe and put the complexity into the model file.

### Alternatives

No response

…
#28470 [CI Failure]: Entrypoints Test (API Server) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬2] ### Name of failing test

tests/entrypoints/openai/test_response_api_with_harmony.py::test_function_call_with_previous_input_messages

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#28402 [CI Failure]: Nightly MM Models Extended (3) — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬1] ### Name of failing test

[2025-11-09T06:15:40Z] FAILED models/multimodal/generation/test_common.py::test_custom_inputs_models[llava_onevision-multiple-images-test_case5] - AssertionError: Test0:

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#28401 [CI Failure]: Nightly B200 LM Eval Failure — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 02:21 (UTC+8)) [💬1] ### Name of failing test

FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1] - AssertionError: Accuracy too low: 0.000 < 0.450 - 0.080

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#28423 [Feature][Kernels]: Integrate FlashInfer MoE Fused Finalize — feature request,torch.compile — by robertgshaw2-redhat (关闭于: 2026-01-24 01:18 (UTC+8)) ### 🚀 The feature, motivation and pitch

We are working on fusing all small ops in DSR1 and other popular models.

There are open work streams on a couple of these:
- RMSNorm + BlockFP8
- ROPE+KV Insert
- All Reduce + RMSNorm
One other one that is possible is fusing the MoE finalize reduction. Here is an op in FlashInfer: …
#31039 [Feature]: Integrate Sonic MoE — help wanted,good first issue,feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 01:18 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch

https://x.com/wentaoguo7/status/2001773245318541324?s=46&t=jLcDgQXDbYe6HgFmTNYgpg https://github.com/Dao-AILab/sonic-moe

Curious to see benchmarks!

### Alternatives

No response …
#28460 [Bug]: rocm crash AMD Ryzen AI 9 HX PRO 370 w/ Radeon 890M - docker/podman — bug,feature request,rocm,gpt-oss — by hargathor (关闭于: 2026-01-23 23:00 (UTC+8)) [💬20] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#28403 [CI Failure]: Nightly H200 Distributed Test Failure — ci-failure — by robertgshaw2-redhat (关闭于: 2026-01-24 00:06 (UTC+8)) ### Name of failing test

tests/compile/test_sequence_parallelism.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#28427 [Feature][Kernel]: DeepSeek-R1 KV Proj is Too Slow for TP — feature request — by robertgshaw2-redhat (关闭于: 2026-01-24 00:06 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch TRT-LLM has a SwapAB kernel for KV proj for DSR1. We should integrate this by collaborating with the FlashInfer team

Current situation: we run CUTLASS Block Fp8 for KV proj because DeepGEMM upstream does not support it
- the CUTLASS Block Fp8 kernels are ~1/2 the speed of DeepGEMM for other Linear layers
- the CUTLASS Block Fp8 kernels have padding overhead
Example Trace: <img width=”1303” height=”352” alt=”Image” src=”https://github.com/user-attachments…
#32848 [Usage]: Structured output — usage — by danielwit-lb (关闭于: 2026-01-23 22:25 (UTC+8)) ### Your current environment

```text Collecting environment information… uv is set ============================== System Info ============================== OS : Amazon Linux 2023.9.20251208 (x86_64) GCC version : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) …
#32870 [Bug][CPU Backend] [Arm]: FAILED tests/kernels/moe/test_moe.py::test_cpu_fused_moe_basic[silu-False-dtype0-2-8-128-128-1] — bug,cpu — by focusunsink (关闭于: 2026-01-23 20:06 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (aarch64) ...
#32874 [Bug]: DeepSeek V3.2 return Internal Server Errors instead of Bad Request in reasoning mode — bug — by artvl (关闭于: 2026-01-23 11:44 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#32368 [Bug]: _CPU_MOE_ACT in cpu_fused_moe_torch cause AssertionError: Current vLLM config is not set — bug — by kzwrime (关闭于: 2026-01-23 16:22 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-23 14:55 (UTC+8)) [💬13] ### Name of failing test

pytest -v -s models/language/pooling -m 'not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32718 [Bug]: reload_weights and _get_weights_iterator return cached/stale weights instead of re-reading from disk — bug,rl — by RobotSail (关闭于: 2026-01-23 13:50 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : CentOS Stream 9 (x86_64) ...
#32059 [Feature]: Use platform op if inside a custom op — feature request — by robertgshaw2-redhat (关闭于: 2026-01-23 11:52 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

We have the CustomOp for SIluandMul

``` @staticmethod def forward_native(x: torch.Tensor) -> torch.Tensor: “"”PyTorch-native implementation equivalent to forward().””” d = x.shape[-1] // 2 return F.silu(x[…, :d]) * x[…, d:] …

[新 PR]

#32953 [UX] Deduplicate sampling parameter startup logs — frontend,ready — by DarkLight1337 (创建于: 2026-01-24 01:13 (UTC+8)) [+14/-34, 4 files | commented:3 approved:1]

## Purpose

Since multiple OpenAIServing* classes are instantiated by vllm serve (even for a single process), the startup log gets spammed like so:

``` (EngineCore_DP0 pid=2378269) INFO 01-23 15:32:21 [core.py:272] init engine (profile, create kv cache, warmup model) took 83.04 seconds (EngineCore_DP0 pid=2378269) INFO 01-23 15:32:21 [vllm.py:618] Asynchronous scheduling is enabled. (APIServer pid=2377843) INFO 01-23 15:32:22 [api_server.py:664] Supported task…
#32990 Indicate compile mode in the benchmark results — performance — by huydhn (创建于: 2026-01-24 09:47 (UTC+8)) [+13/-0, 1 files | commented:1 approved:1] ## Purpose

After https://github.com/pytorch/pytorch-integration-testing/pull/135. PyTorch team is now running vLLM benchmark in both eager and compile mode to compare their performance. In eager mode, vLLM is run with compilation_config.mode set to 0. This PR checks for that config and several other conditions like enforce_eager or the output filenames to denote if the benchmark is run with compile (default) or not.

The field [extra_info.use_compile](https://github.com/pytorch/test-infra/…
#32931 [Feature] Add qwen2_5_coder tool parser for Qwen2.5-Coder models — documentation,tool-calling,qwen — by hanXen (创建于: 2026-01-23 18:59 (UTC+8)) [💬3 | +386/-0, 4 files | commented:2]

## Purpose

Add a dedicated qwen2_5_coder tool call parser for Qwen2.5-Coder models. These models do not follow hermes <tool_call> format (outputs code blocks instead), but follow <tools> tag format with 100% compliance when prompted with few-shot examples.

Fixes #32926

Usage: ```bash …
#32944 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — rocm,ready — by monajafi-amd (创建于: 2026-01-23 23:34 (UTC+8)) [💬3 | +34/-1, 1 files | commented:1 approved:1] ## Purpose

Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.

## Changes
- Added flash_attn_triton_available() in vllm/platforms/rocm.py to detect RDNA GPUs and verify Triton backend availability
- Updated get_vit_attn_backend() to use Flash Attention when available on RDNA GPUs -…
#32965 [Core][Bugfix] allow graceful worker termination — bug,ready,v1 — by joerunde (创建于: 2026-01-24 04:28 (UTC+8)) [+18/-5, 1 files | commented:1 approved:1] ## Purpose

This PR fixes a small bug where workers are immediately sent a SIGTERM on shutdown.

(This was extracted from #28053)

When using the MultiprocExecutor, the parent process signals the local worker processes to shut down by closing a pipe (self.death_pipe). The worker processes each have a thread waiting for this pipe to close, and will gracefully shut themselves down when it does. However, the parent process immediately sends all workers a SIGTERM right after closing its death p…
#32902 fix[ROCm]: Remove unconditional aiter import — rocm,speculative-decoding,v1 — by rabi (创建于: 2026-01-23 11:25 (UTC+8)) [💬2 | +18/-7, 3 files | commented:10] ## Purpose AITER modules were being imported unconditionally at module load time, triggering JIT compilation and warnings even when VLLM_ROCM_USE_AITER=0 (the default).
```
[aiter] WARNING: NUMA balancing is enabled...
[aiter] import [module_aiter_enum] under .../aiter/jit/...
```
This occurred because several code paths imported aiter without first checking the VLLM_ROCM_USE_AITER environment variable.

## Test Plan Tested locally …
#32989 [Misc]Consolidate RoPE-related parsing into ModelArchitectureConfig — 无标签 — by charlotte12l (创建于: 2026-01-24 08:49 (UTC+8)) [+142/-53, 3 files | commented:1 | 📝草稿]

## Purpose

We are introducing model_arch_config https://github.com/vllm-project/vllm/pull/28454, which defines explicitly what kind of information vLLM engine need from hugggingface config/ user-defined config, so we could avoid hf_config/getattr(hf_config, xxx) got passing around in engine.

Before this PR, _get_and_verify_max_len in model.py directly accessed multiple HuggingFace config fields:
- rope_parameters
- original_max_position_embeddings …
#32988 Auth_token added in documentation as it is required — documentation — by ruizcrp (创建于: 2026-01-24 08:26 (UTC+8)) [💬2 | +2/-0, 1 files | commented:1 approved:1]

## Purpose The documentation for Claude Code is currently not working as described. After some trying, I found out that ANTHROPIC_AUTH_TOKEN is required. At least in my case, this was what made it finally working. This PR is thus just to add this information.

## Test Plan Does not affect code.

## Test Result Does not affect code …
#32908 [Bugfix][TPU] Return a Default fp8 MoE Backend — bug — by vanbasten23 (创建于: 2026-01-23 12:35 (UTC+8)) [💬5 | +1/-3, 1 files | commented:2] ## Purpose

Commit caused TPU to fail at def select_fp8_moe_backend( (error). This PR intends to fix this error for TPU.

## Test Plan

TPU CI: ``` USE_MOE_EP_KERNEL=1 MODEL_IMPL_TYPE=vllm vllm serve –seed=42 –model=BCCard/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic –max-model-len=10240 –max-num-batched-tokens=8192 –max-num-seqs=512 –no-enable-prefix-caching –disab…
#32983 [Perf] Cache xpu_get_mem_info() result to avoid duplicate calls — v1 — by sjhddh (创建于: 2026-01-24 07:18 (UTC+8)) [💬2 | +2/-1, 1 files | commented:1] ## Summary Cache the result of xpu_get_mem_info() in local variables instead of calling it twice in the same expression.

Before:
```
  total_allocated_bytes = self.xpu_get_mem_info()[1] - self.xpu_get_mem_info()[0]
```
After: ```python …
#32982 [Tests] Standardize RNG seed utility across test files — v1 — by sjhddh (创建于: 2026-01-24 07:18 (UTC+8)) [💬3 | +13/-20, 4 files | commented:2] ## Summary Use the centralized set_random_seed from vllm.utils.torch_utils instead of scattered custom seed functions:
- Remove custom set_seed() function from tests/kernels/test_flex_attention.py
- Replace random.seed() with set_random_seed() in tests/v1/logits_processors/test_custom_offline.py
- Re-export set_random_seed from tests/utils.py for convenience
## Motivation
- Ensures consistent seeding of random, numpy, and torch RNGs
- Reduces code duplication …
#32951 [WIP][Async][Spec Decoding] Zero-bubble async scheduling + spec decoding — speculative-decoding,v1 — by MatthewBonanni (创建于: 2026-01-24 01:01 (UTC+8)) [💬1 | +300/-71, 6 files | commented:1 | 📝草稿] co-authored by @izhuhaoran

## Purpose This is a refactor of #29957

## Test Plan On 8x H200: ``` vllm serve deepseek-ai/DeepSeek-R1
-tp 8 -ep
…
#32981 [Tests] Clarify pytest skip reasons with actionable context — v1 — by sjhddh (创建于: 2026-01-24 07:16 (UTC+8)) [💬3 | +7/-3, 2 files | commented:1] ## Summary Replace vague FIXME comments in test skip markers with more descriptive reasons:
- tests/v1/sample/test_topk_topp_sampler.py: Changed from “FIXME: This test is failing right now” to explain the FlashInfer top-k/top-p renorm comparison issue
- tests/samplers/test_beam_search.py: Changed from “FIXME: This fails on V1 right now” to clarify that V1 engine does not yet support beam search
## Test Plan
- Run pytest --collect-only on the affected test files to verify no syntax errors…
#32985 [Fix] Include list index in multimodal validation error messages — frontend — by sjhddh (创建于: 2026-01-24 07:19 (UTC+8)) [+12/-18, 1 files | commented:1] ## Summary Include the list index in multimodal validation error messages to help users identify which specific item has the issue.

Before:
```
  Multi-modal data for image is None but UUID is not provided
```
After: ``` …
#32987 [Docs] Update README with uv recommendation and Python version requirements — documentation — by sjhddh (创建于: 2026-01-24 07:20 (UTC+8)) [💬1 | +7/-1, 1 files | commented:1] ## Summary Update the README.md “Getting Started” section to:
- Add Python version requirements (3.10 - 3.13) prominently
- Recommend uv as the preferred installation method
- Keep pip as an alternative for users who prefer it
- Update installation guide link to the general installation page
## Motivation
- uv is significantly faster than pip for installing vLLM
- The docs already recommend uv, README should be consistent …
#32986 [Tests] Replace flaky sleep with polling in test_background_cancel — v1 — by sjhddh (创建于: 2026-01-24 07:20 (UTC+8)) [+19/-6, 1 files | commented:1] ## Summary Replace the fixed 0.5s sleep (marked with # FIXME: This test can be flaky) with a proper polling loop that waits for the response status to change before attempting to cancel.

## Changes
- Poll with 0.1s intervals until status changes from “queued” (max 5s)
- Use proper assertions in the post-cancel verification loop
- Remove FIXME comment as the flakiness is addressed
## Motivation
- Fixed sleeps are unreliable: too short on slow machines, wasteful on fast ones …
#32980 [CI] Add pip and pre-commit caching to pre-commit workflow — ci/build — by sjhddh (创建于: 2026-01-24 07:16 (UTC+8)) [💬1 | +7/-0, 1 files | commented:1] ## Summary
- Add cache: 'pip' to speed up pip dependency installation
- Add actions/cache for ~/.cache/pre-commit to cache pre-commit hook environments
- Cache key is based on .pre-commit-config.yaml hash for proper invalidation
## Test Plan
- Run the pre-commit workflow on a test PR
- Verify cache is created on first run (check logs for “Cache saved”)
- Verify cache is restored on subsequent runs (check logs for “Cache restored”)
- Confirm workflow time decreases by ~30-60 seconds on c…
#32945 [Bugfix] Fix empty response under concurrent requests — bug,frontend,v1 — by vasia123 (创建于: 2026-01-23 23:48 (UTC+8)) [+341/-1, 5 files | commented:2] The with_cancellation decorator returned None when the disconnect listener completed before the handler, causing FastAPI to send HTTP 200 with an empty body. Now raises asyncio.CancelledError instead, so the ASGI framework properly handles the disconnected client.

Also adds defensive ready.set() calls in RequestOutputCollector.put() merge/replace branches to prevent latent lost-wakeup race conditions.

Related to: https://github.com/vllm-project/vllm/issues/3209

## Purpose

Fix intermitten…
#32976 [AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 — bug,rocm — by rasmith (创建于: 2026-01-24 06:51 (UTC+8)) [+9/-7, 1 files | commented:2] ## Purpose This PR updates concat_and_cache_ds_mla_kernel to use the correct scale divisor when running on gfx942 architectures on ROCm. The tile_size divisor was 448.0 which does not work on AMD platforms with arch of gfx942, e.g. MI300, MI325. This PR updates the divisor to be 224.0 on gfx942.

Additionally, I consolidated the scale divisor into a constexpr float. ## Test Plan Use lm_eval to check accuracy on DeepSeek-R1 model.

Command to run: `…
#32954 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — nvidia — by Linda-Stadter (创建于: 2026-01-24 01:15 (UTC+8)) [💬3 | +216/-14, 5 files | commented:2] ## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.
## Test Plan

``VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct …
#32966 Add TUI Monitor: Real-time Terminal Dashboard for vLLM Metrics — documentation — by sjhddh (创建于: 2026-01-24 04:48 (UTC+8)) [💬4 | +448/-0, 1 files | commented:2] ## Summary This PR introduces vllm_tui_monitor.py, a standalone Terminal User Interface (TUI) tool for monitoring vLLM instances in real-time.

Why this is useful: Debugging performance issues or monitoring long-running serving instances often requires setting up a full Prometheus+Grafana stack. This tool provides an immediate, lightweight visualization of key metrics (KV cache usage, throughput, queue depth) directly in the terminal over SSH.

Features:
- **“Reactor Core” Visualizat…
#32984 [Perf] Cache exc.errors() result in validation exception handler — frontend — by sjhddh (创建于: 2026-01-24 07:19 (UTC+8)) [+4/-3, 1 files | commented:1] ## Summary Cache the result of exc.errors() to avoid calling it multiple times in the validation_exception_handler.

Before: exc.errors() called 3 times After: Result cached in errors variable, used 3 times

## Motivation
- Avoids redundant method calls
- Improves code readability
- Error handling paths should be efficient …
#32979 [CI] Add pip caching to cleanup_pr_body workflow — ci/build — by sjhddh (创建于: 2026-01-24 07:15 (UTC+8)) [💬1 | +1/-0, 1 files] ## Summary
- Add cache: 'pip' to the actions/setup-python step in cleanup_pr_body workflow
- Reduces pip dependency installation time on cache hits
## Test Plan
- Open/edit a PR to trigger the workflow
- Check workflow logs for “Cache restored” message on subsequent runs
- Verify workflow completes successfully
## Risk …
#32978 [Docs] Sync quantization list between README files — documentation — by sjhddh (创建于: 2026-01-24 07:15 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] ## Summary
- Add AutoRound to the quantization list in docs/README.md
- Ensures consistency with the root README.md which already lists AutoRound
## Test Plan
- Build documentation with mkdocs build to verify no rendering issues
- Visual inspection of the docs landing page
## Risk
- Very low risk: documentation-only change, adding one line
#32977 [Docs] Fix Apple silicon include path in CPU installation docs — documentation,cpu — by sjhddh (创建于: 2026-01-24 07:14 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] ## Summary
- Fix incorrect include path in the CPU installation documentation
- The “Build image from source” section for Apple silicon was referencing cpu.arm.inc.md instead of cpu.apple.inc.md
## Test Plan
- Build the documentation locally with mkdocs build or mkdocs serve
- Verify the Apple silicon section renders correctly without include errors
## Risk
- Very low risk: documentation-only change …
#32912 [fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors — ready — by dolpm (创建于: 2026-01-23 14:12 (UTC+8)) [💬2 | +38/-0, 2 files | commented:1 approved:3] random env var…
```
      "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME": get_env_or_set_default(
          "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME",
          lambda: f"VLLM_OBJECT_STORAGE_SHM_BUFFER_{uuid.uuid4().hex}",
      ),
```
will cache miss every time :) —

...
#32964 [Kernel] [Helion] Helion kernel wrapper — 无标签 — by gmagogsfm (创建于: 2026-01-24 03:45 (UTC+8)) [💬2 | +2347/-0, 8 files | commented:3] THIS IS A MANUALLY STACKED PR, PLEASE ONLY REVIEW TOP COMMIT, lower commits are being reviewed separately in an earlier PR.

This PR adds two basic cases to help Helion kernels with compilation and runtime dispatching.
- HelionKernelWrapper would be constructed by vllm.helion.register() in following PRs. It is responsible for adding Helion kernel to registry and partially specify Helion kernel according to GPU platform and model config. As …
#32974 [Attention][WIP] FA4 integration — ci/build,v1,nvidia — by LucasWilkinson (创建于: 2026-01-24 06:30 (UTC+8)) [+611/-37, 7 files | commented:2] Integrate upstream FA4
#32973 [Perf] Overlap workspace_buffer.fill_(0) with compute in MLA attention — 无标签 — by robertgshaw2-redhat (创建于: 2026-01-24 06:19 (UTC+8)) [💬1 | +167/-1, 2 files | commented:2] Move workspace_buffer.fill_(0) for TRT-LLM ragged attention to run in a separate CUDA stream (aux_stream) so it can overlap with the compute operations that precede the attention kernel:
- gather_and_maybe_dequant_cache (or cp_gather_cache for context parallel)
- kv_b_proj
- _concat_k_nope_k_pe
The fill operation is launched at the start of each loop iteration in _compute_prefill_context and _context_parallel_compute_prefill_context, allowing it to execute concurrently with these compute opera…
#32975 [Perf] Optimize detokenizer python logic — ready,v1 — by yewentao256 (创建于: 2026-01-24 06:45 (UTC+8)) [+26/-9, 2 files | commented:1] ## Purpose
1. adding a num_output_tokens function, to avoid len(self.output_token_ids) might do a slice in SlowIncrementalDetokenizer
```py class SlowIncrementalDetokenizer(BaseIncrementalDetokenizer): @property def output_token_ids(self) -> list[int]: return ( self.token_ids …
#32950 [MLA] Fuse cat and qaunt for fp8 kv-cache — documentation,performance,ready,deepseek — by LucasWilkinson (创建于: 2026-01-24 00:48 (UTC+8)) [💬7 | +41/-20, 1 files | commented:2 approved:3] Main: PR:
#32971 [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh — ready,ci/build — by Harry-Chen (创建于: 2026-01-24 05:39 (UTC+8)) [+6/-5, 1 files | commented:4 approved:1] ## Purpose

@khluu found some issues when trying this script. This PR fixes them.

## Test Plan

Normal CI will not trigger this step. Needs to be tested on the next release.

## Test Result

…
#32935 [Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig — bug,ready — by joninco (创建于: 2026-01-23 21:10 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose

Fix missing is_layer_skipped check for FusedMoE layers in AWQConfig.get_quant_method.

This caused models with modules_to_not_convert (e.g., MTP layers stored as bfloat16) to incorrectly receive quantized methods, leading to weight loading failures when the code expected quantized weights but found bfloat16.

Two bugs fixed:
1. MoeWNA16 fallback config was missing modules_to_not_convert, so MoeWNA16Config couldn’t properly skip layers
2. AWQMarlinMoEMethod was return…

#32968 [Misc] Add run one batch script that supports profiling — documentation — by LucasWilkinson (创建于: 2026-01-24 05:28 (UTC+8)) [💬2 | +112/-0, 1 files | commented:1] add a script that runs one batch and can optionally profile it; e.g.

  python examples/offline_inference/run_one_batch.py --model deepseek-ai/DeepSeek-V2-Lite --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.9 --kv-cache-dtype fp8 --trust-remote-code --profile both --profile-dir ./profiles/main

#32952 [Refactor] Rename gptq_marlin to marlin to match MoE — performance,ready,ci/build — by mgoin (创建于: 2026-01-24 01:12 (UTC+8)) [💬1 | +40/-40, 24 files | commented:2 approved:1] ## Purpose
- Moved csrc/quantization/gptq_marlin/ to csrc/quantization/marlin/
- Renamed gptq_marlin.cu to marlin.cu
- Renamed gptq_marlin_gemm to marlin_gemm
## Test Plan

## Test Result

…
#32969 [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling — bug,qwen — by AndreasKaratzas (创建于: 2026-01-24 05:33 (UTC+8)) [💬1 | +29/-12, 1 files | commented:3] Fixes RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1 (input tensor's size at dimension 0), but got split_sizes=[4900, 4900] when running Qwen2.5-VL with the transformers backend.

## Root Cause

During memory profiling, vLLM creates dummy encoder outputs with minimal size (shape [1, hidden_dim]). The original embed_multimodal code called unsqueeze(0) on 2D tensors, then attempted torch.split() with sizes derived from num_image_patches (e.g., [4900, 4900]), c…
#32913 [fix] CPUDNNLGEMMHandler pointer baked into inductor artifact — ready,cpu — by dolpm (创建于: 2026-01-23 14:23 (UTC+8)) [💬1 | +38/-23, 4 files | commented:3 changes:1 approved:2] fixes https://github.com/vllm-project/vllm/issues/32033

pointers shouldn’t be passed around as integers. either you should maintain some lookup table on the cpp side and deal with deterministic keys in python-land, or throw them into a tensor that will cooperate through the graph. i’ve opted for the latter. they will be treated as constants, and will be folded as such. not fun.

e.g.,
```
  torch.ops._C.onednn_mm.default(buf14, buf11, None, 251002016)
```
## Purpose …
#32970 [release] Fix upload release wheel script — ci/build — by khluu (创建于: 2026-01-24 05:37 (UTC+8)) [+2/-2, 1 files commented:1]
- Cut out the extra v when comparing tagged version and expected version
- Use rc[0-9] instead of rc to not exclude aaRCh64 wheels accidentally
#32967 [Frontend] Use init_app_state and FrontendArgs from api_server in run_batch — frontend — by pooyadavoodi (创建于: 2026-01-24 05:20 (UTC+8)) [+186/-132, 3 files | commented:4] ## Purpose
- Adding support for more features such as tool calling to run_batch.
- This is achieved by using init_app_state and FrontendArgs from vllm/entrypoints/openai/api_server.py.
- The approach taken here also removes some code duplication between api_server and run_batch.
- Due to args conflict over --port between FrontendArgs and the existing run_batch options, we improve the option names from --port and --url to --metrics-port and --metrics-url and provide a backward c…
#32949 [Bug] Fix benchmark script moe_permute_unpermute — bug,performance,ready — by yewentao256 (创建于: 2026-01-24 00:32 (UTC+8)) [+3/-5, 1 files | commented:2 approved:1] ## Purpose

python benchmarks/kernels/benchmark_moe_permute_unpermute.py

Two bugs fixed

```bash 93, ip=10.243.64.20, actor_id=4b4963ca6aa94bfc7fb40cf101000000, repr=<benchmark_moe_permute_unpermute.BenchmarkWorker object at 0x7f18ef275790>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …
#32957 [WIP][Feat] Add RMSNorm NvFp4 Quant Operator (#32612) — ci/build — by sparkecho (创建于: 2026-01-24 02:53 (UTC+8)) [💬4 | +392/-2, 8 files | commented:5] ## Purpose This commit implements rmsnorm + fp4 quant fusion, and integrate to rmsnorm + quant fusion pass, fixing #32612

## Test Plan In progress.

## Test Result TODO
#32961 [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing — bug,documentation,performance,nvidia — by aviralgarg05 (创建于: 2026-01-24 03:32 (UTC+8)) [💬3 | +66/-48, 23 files | commented:1] # [Bugfix][Kernel] Rename NoEP to NoDPEP and fix NvFP4 MoE expert routing

Fixes #32009

## Purpose

This PR addresses feedback regarding the naming and logic of MoE preparation strategies, specifically for NVFP4/FlashInfer backends.

### 1. Rename NoEP to NoDPEP As pointed out in PR feedback, MoEPrepareAndFinalizeNoEP was a misleading name because it does support Expert Parallelism (EP) via redundant computation and AllReduce. It specifically does not support the Data Parallelism + Expert P…
#32946 [cudagraphs] Refactor cudagraph capture loop — ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-23 23:55 (UTC+8)) [+117/-59, 3 files | commented:2 approved:1] Refactor cudagraph capture loop to pave the way for different PIECEWISE and FULL sizes and dynamic spec-decode sizes
#32963 Update CPU doc according to feedback — documentation,cpu — by louie-tsai (创建于: 2026-01-24 03:41 (UTC+8)) [💬1 | +4/-4, 3 files | commented:1] ## Purpose

Change “supported model” to “recommended model” following TPU doc and feedback. overall, we might functional support official vLLM supported models, but we test those recommended model for function and performance.

## Test Plan NA ## Test Result NA — …
#32960 [compile][cuda_graph]Add sym_size handling by folding them to constant — needs-rebase,nvidia — by fxdawnn (创建于: 2026-01-24 03:23 (UTC+8)) [💬1 | +169/-3, 4 files | commented:2 | 📝草稿]

## Purpose Fix #31043

## Test Plan
- Comparing between the pro of the issues after reverting the previous temporary fix
- Adding local test
## Test Result …
#32958 glm 4.6 fused tuned inference config for B200 — 无标签 — by navmarri14 (创建于: 2026-01-24 03:02 (UTC+8)) [+163/-0, 1 files | commented:1] This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B200 GPUs using FP8 quantization.

Specifically, it targets the configuration:

Experts (E): 160 Sharded size N=384 for TP=4 Device: NVIDIA B200 Dtype: fp8_w8a8 Previously, vLLM lacked a static configuration for these shapes on B200, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel param…
#32948 [Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during install — documentation,ready,ci/build,nvidia — by mgoin (创建于: 2026-01-24 00:30 (UTC+8)) [💬1 | +51/-6, 2 files | commented:2 approved:4]

## Purpose

Inspired by the --torch-backend=auto we use from uv to get the right CUDA 12.x or 13.x binary for dev

We first respect VLLM_MAIN_CUDA_VERSION if set, then check torch.version.cuda if a user already have torch installed for some reason (we assume it works for them), and last we try nvidia-smi to extract the version. Then we map that to the two wheel variants we have on wheels.vllm.ai, cu129 or cu130 based on 12.x or 13.x

On my dev system with `CUDA…
#32956 [Bugfix][CI] Fix pre-commit — bug,v1 — by MatthewBonanni (创建于: 2026-01-24 02:19 (UTC+8)) [+1/-2, 1 files | commented:1 approved:1]

## Purpose Fixes pre-commit issue caused by #30877. The issue shouldn’t break CI because check_triton_import only checks the diff, but it shows up when merging main into PR branches

## Test Plan

## Test Result

…
#32955 [Refactor] Use data parser for matching data items to multi-modal UUIDs — frontend,v1 — by DarkLight1337 (创建于: 2026-01-24 01:29 (UTC+8)) [💬2 | +50/-96, 3 files | commented:1 | 📝草稿] ## Purpose

Move the code related to UUID validation from LLM class to InputProcessor to improve the readability. Also fixed mismatch between item counts between _validate_multi_modal_uuids and _maybe_build_mm_uuids, by using data parser to get the correct number of items in both cases.

## Test Plan

## Test Result

...
#32934 [BugFix] Fix tool parser crash with parentheses in descriptions — bug,tool-calling — by majiayu000 (创建于: 2026-01-23 19:50 (UTC+8)) [💬3 | +244/-27, 2 files | commented:2] ## Purpose

Fixes #32827

When using OpenAI-compatible function calling, if a parameter’s description contains example commands like (e.g. ls -la, ssh user@host, cat file.txt), vLLM server crashes with “Connection error”. This happens because the MiniMax-M2 model sometimes echoes the description back as additional attributes in the XML tags, causing the regex parsing to fail.

## Test Plan

```bash pytest tests/tool_use/test_minimax_m2_tool_parser.py -v …
#32924 Create 52012 — 无标签 — by AKB0700 (创建于: 2026-01-23 16:49 (UTC+8)) [💬2 | +1/-0, 1 files | commented:3]

## Purpose

## Test Plan

## Test Result

...
#32940 [torch.compile][CI] Add back attn fusion on hopper/ada — ready — by ProExpertProg (创建于: 2026-01-23 22:58 (UTC+8)) [+4/-5, 1 files | commented:1 approved:2] We were not running attention quant fusion because the unit tests were using the llama4 for the model name which is too big for a single gpu (but the model wasn’t used in the test apart from the name). Not technically a problem on B200 but we still don’t want to be skipping
#32947 [Bugfix] Fix weights offloading for sleep mode — bug,v1 — by jseppanen (创建于: 2026-01-24 00:20 (UTC+8)) [💬1 | +4/-3, 1 files | commented:1] ## Purpose

FIx weights offloading bug.

Before:
```
  [gpu_worker.py:137] Sleep mode freed 24.08 GiB memory, 33.62 GiB memory is still in use.
```
…
#32904 [Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test — bug,rocm,ready — by mawong-amd (创建于: 2026-01-23 11:50 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose This PR fixes a failing test (kernels/attention/test_cache.py::test_reshape_and_cache_flash) caused by https://github.com/vllm-project/vllm/pull/30141. The reference dequantization implementation used in the test assumes the FP8 data format is e4m3fn, but on AMD gfx942-series cards, the FP8 data type used is e4m3fnuz instead.

## Test Plan Run pytest -sv tests/kernels/attention/test_cache.py -k test_reshape_and_cache_flash on a MI325X as part of…

#32939 [ROCm][PD] Remove unused moriio connector proxy code — documentation,rocm,ready,kv-connector — by markmc (创建于: 2026-01-23 22:16 (UTC+8)) [💬1

+0/-22, 1 files

commented:1 approved:1]

send_request_to_decode() is not called anywhere
extract_ip_port_fast() for P is called twice

#32943 Adding optional speculator tests for larger models — speculative-decoding,ci/build,v1 — by shanjiaz (创建于: 2026-01-23 23:15 (UTC+8)) [+26/-1, 2 files | commented:3 | 📝草稿] ## Purpose We just enabled speculative decoding for qwen3 moe vision language pathway. Adding a test for vision language speculator model.
- Add support for testing Qwen3-30B-MOE-VL-Eagle3 speculator model
- Create separate optional CI job for large speculator model tests on A100 GPUs …
#32942 [Bugfix]: Fix display errors in TORCH_CHECK messages — bug,cpu — by lingebeng (创建于: 2026-01-23 23:14 (UTC+8)) [+8/-8, 6 files | commented:1] Fix display errors in TORCH_CHECK messages.
#32941 draft: removed a priori unneeded fills, use optimized copy kernel for DSV3 — 无标签 — by hypdeb (创建于: 2026-01-23 23:11 (UTC+8)) [+17/-6, 1 files | commented:2 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#32936 [Model Runner V2] support cudagraph check based on attn backend — v1,nvidia — by izhuhaoran (创建于: 2026-01-23 21:14 (UTC+8)) [💬1 | +177/-113, 6 files] ## Purpose

A follow-up PR of #32771 and #32820.

After #32820 we can select any attention backend, but some of them have limitations for CUDA-graph. This PR, like Model-Runner V1, adds a CUDA-graph check that adjusts the cudagraph mode & capture_sizes according to the attention backend. For example, with FLASHINFER + spec decode, a user-specified FULL_AND_PIECEWISE is automatically resolved to PIECEWISE.

``` [WARNING 01-23 20:32:46 [compilation.py:1148] CUDAGraphMode.FULL_AND_PIECEWISE is not…
#32906 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by monajafi-amd (创建于: 2026-01-23 12:07 (UTC+8)) [💬2 | +16257/-10390, 438 files | commented:1 | 📝草稿] ## Purpose

Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.

## Changes
- Added flash_attn_triton_available() in vllm/platforms/rocm.py to detect RDNA GPUs and verify Triton backend availability
- Updated get_vit_attn_backend() to use Flash Attention when available on RDNA GPUs -…
#32916 [Bugfix] Disable tma_aligned_scales in test_fusions_e2e — bug,ready — by xyang16 (创建于: 2026-01-23 15:12 (UTC+8)) [+9/-1, 3 files | commented:1 approved:1]

## Purpose

This PR is to disable tma_aligned_scales in test_fusions_e2e to unblock CI.

## Test Plan

``` pytest -s -v tests/compile/distributed/test_fusions_e2e.py …
#32933 [Bugfix] Fix getting vision features in Transformer Multimodal backend — bug,ready — by zucchini-nlp (创建于: 2026-01-23 19:16 (UTC+8)) [💬1 | +9/-0, 1 files | commented:2 approved:1] Makes sure that transformers multimodal backend keeps working after v5 release.

PR https://github.com/huggingface/transformers/pull/42564 changed the output of self.model.get_image_features to tuple | dict format. Prev we expected the output to always be a single tensor or a list of tensors for non-homogeneous image sizes. A simple check if the output is tuple

The default output format currently depends on model.config.return_dict, so I added both formats

cc @hmellor
#32937 [DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination — documentation,rocm,kv-connector — by markmc (创建于: 2026-01-23 21:20 (UTC+8)) [💬1 | +514/-72, 5 files | commented:2 | 📝草稿] After #27987 introduced random suffixes to internal request_ids, the MoRIIO connector broke because Prefill and Decode instances now have different internal request_ids (e.g., “cmpl-uuid-abc” vs “cmpl-uuid-def”). The MoRIIO connector’s parallel dispatch model requires a stable, shared identifier for P/D coordination.

This commit introduces a transfer_id (format: “xfer-{uuid}”) that is generated by the proxy and shared between Prefill and Decode for KV transfer coordination. The connector work…
#32905 [Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest — documentation,frontend,ready — by noooop (创建于: 2026-01-23 11:54 (UTC+8)) [💬2 | +326/-261, 11 files | commented:2 approved:1]

## Purpose

Split out the following RequestMixin
- EncodingRequestMixin
- EmbedRequestMixin
- ClassifyRequestMixin
address https://github.com/vllm-project/vllm/pull/31784#discussion_r2716425132 …
#32927 [Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark — bug,performance,ready — by Isotr0py (创建于: 2026-01-23 18:07 (UTC+8)) [+37/-0, 2 files | commented:3 approved:1]

## Purpose
- Currently, there is a race condition that bench command is executed too early before server become ready for vllm bench sweep serve: ``` [BEGIN SERVER] Server overrides: {‘mm_encoder_attn_backend’: ‘FLASH_ATTN’} Server command: [‘vllm’, ‘serve’, ‘/home/mozf/LLM/Qwen3-VL-4B-Instruct/’, ‘–enforce-eager’, ‘–max-model-len’, ‘32768’, ‘–mm-processor-cache-gb=0’, ‘–media-io-kwargs.video.num_frames’, ‘-1’, ‘–mm-encoder-attn-backend’, ‘FLASH_ATTN’] [BEGI…
#32923 [StepVL] support close img patch — 无标签 — by ltd0924 (创建于: 2026-01-23 16:32 (UTC+8)) [💬3 | +8/-2, 2 files | commented:3]

## Purpose The default behavior of the StepVL model involves patch operations for image computation. A new environment variable, VLLM_ENABLE_STEP_VL_IMG_PATCH, has been introduced to allow disabling patch computation, thus supporting image-only computation without patch operations (i.e., calculating embeddings for the image alone).

## Test Plan

## Test Result

…
#32930 [MoE] Add FusedMoE configs for RTX PRO 6000 Blackwell Max-Q — 无标签 — by ErikDeBruijn (创建于: 2026-01-23 18:43 (UTC+8)) [💬2 | +400/-0, 8 files | commented:2] ## Summary

Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering popular MoE architectures.

### Configurations Added (8 files)

| E | N | Use Case | Latency (M=16) | |—|—|———-|—————-| | 8 | 14336 | Mixtral 8x7B/8x22B | 1.58ms | | 8 | 7168 | Large 8-expert | 0.92ms | …
#32907 [CI/Build][CPU] Fix failed pooling tests and macos smoke test — ready,cpu — by bigPYJ1151 (创建于: 2026-01-23 12:29 (UTC+8)) [💬1 | +7/-1, 2 files | commented:6 approved:1]

## Purpose
- Skip post weight processing for missing layers
- Exclude unsupported op for macos
## Test Plan

CI tests …
#32928 [MoE] Add tuned config for RTX PRO 6000 Blackwell Max-Q (E=64, N=1024) — 无标签 — by ErikDeBruijn (创建于: 2026-01-23 18:27 (UTC+8)) [💬2 | +720/-0, 8 files | commented:1] ## Summary

Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering the most popular MoE architectures.

### Configurations Added (8 files)

| E | N | Use Case | |—|—|———-| | 8 | 14336 | Mixtral 8x7B/8x22B | | 8 | 7168 | Large 8-expert models | …
#32929 [Draft][FP8]add FP8 WoQ kernel abstraction. — nvidia — by jikunshang (创建于: 2026-01-23 18:27 (UTC+8)) [💬1 | +329/-42, 6 files | commented:2]

## Purpose add abstraction for Wfp8A16 kernel.

## Test Plan

## Test Result

…
#32922 [Bugfix] Fix ‘remove_instance_endpoint’ method logic in disagg_proxy_demo — bug,documentation,kv-connector — by ChenqianCao (创建于: 2026-01-23 15:58 (UTC+8)) [💬2 | +2/-2, 1 files | commented:1]

## Purpose

Link to the issue: Fixes #32920

Fixed a simple logic bug in the remove_instance_endpoint (vllm\examples\online_serving\disaggregated_serving\disagg_proxy_demo.py) method where the logic for removing prefill instances was checking against the wrong instance list and cycling the wrong iterator.

## Problem

…
#32914 [ROCm][perf] Shuffle KV cache to use paged_attention_common — rocm,ci/build,v1 — by samutamm (创建于: 2026-01-23 14:37 (UTC+8)) [💬4 | +40/-5, 2 files | commented:1] ## Purpose For Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model, currently VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 performs worse on small concurrencies, compared to VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=0. This PR fixes the issue using paged_attention_common from aiter (see https://github.com/ROCm/aiter/pull/1821).

## Test Plan For input and output lengths of 1k and 8k and concurrencies from 8, 18, 32, 64, 128, compare current main branch with and without VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT (_vllm_ma…
#32917 Sm120 130 — ci/build,nvidia — by pacoxu (创建于: 2026-01-23 15:29 (UTC+8)) [+186/-3, 3 files | commented:1 | 📝草稿]

https://github.com/vllm-project/vllm/pull/19757 https://github.com/vllm-project/vllm/pull/32237/

## Purpose

## Test Plan

## Test Result …
#32918 [Fix] Update CUTLASS_REVISION to v4.3.5 — ci/build,nvidia — by pacoxu (创建于: 2026-01-23 15:29 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose

https://github.com/NVIDIA/cutlass?tab=readme-ov-file#whats-new-in-cutlass-43 https://docs.nvidia.com/cutlass/latest/CHANGELOG.html

cutlass 4.3.5 vs 4.2.1, it includes many Blackwell SM100/SM103/SM120/SM130 support feature/bugfix.
- https://github.com/vllm-project/vllm/pull/12702 is a old PR to update cutlass(the PR updated it to 3.8) and CUDA_SUPPORTED_ARCHS( cuda arches has been already updated in main now). #12702 is out-of-dated and should be closed.
Probably this may benefit PR…
#32909 [CI][Pooling] Stabilize ModernBERT test — 无标签 — by AndreasKaratzas (创建于: 2026-01-23 13:05 (UTC+8)) [💬2 | +27/-0, 1 files | commented:10] This PR attempts to stabilize the Language Models Test (Extended Pooling). It marks it as flaky, enables rerunning the test, and logs an informative message regarding the possible inaccurate results based on the conversation here: https://github.com/vllm-project/vllm/pull/32403#pullrequestreview-3695341925

## Related
- Addresses flaky test: test_modernbert_models[float-disham993/electrical-ner-ModernBERT-base]
- Related to #32403 which increased test tolerance from atol=1.2e-2 to `atol=3…

[已合并 PR]

#32944 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — rocm,ready — by monajafi-amd (合并于: 2026-01-24 10:03 (UTC+8)) [💬3 | +34/-1, 1 files | commented:1 approved:1] ## Purpose

Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.

## Changes
- Added flash_attn_triton_available() in vllm/platforms/rocm.py to detect RDNA GPUs and verify Triton backend availability
- Updated get_vit_attn_backend() to use Flash Attention when available on RDNA GPUs -…
#32279 [Bugfix] Fix FusedMoE LoRA kernel offs_token out of bound value — bug,ready — by xyang16 (合并于: 2026-01-24 09:41 (UTC+8)) [💬6 | +4/-2, 1 files | commented:2 approved:1] ## Purpose

This PR is a fix to correct the out of bound value in FusedMoE LoRA kernel.
- offs_token out of bound value should be num_valid_tokens sentinel, instead of 0. This will make it avoid reading token 0 for those out of bound range. Result shows accuracy improvement and slight performance gain.
- And a minor fix of moe_weight out of bound value to 0.0, because it’s fp type.
## Test Plan

``` pytest -s -v tests/lora/test_fused_moe_lora_kernel.py …
#32965 [Core][Bugfix] allow graceful worker termination — bug,ready,v1 — by joerunde (合并于: 2026-01-24 09:28 (UTC+8)) [+18/-5, 1 files | commented:1 approved:1] ## Purpose

This PR fixes a small bug where workers are immediately sent a SIGTERM on shutdown.

(This was extracted from #28053)

When using the MultiprocExecutor, the parent process signals the local worker processes to shut down by closing a pipe (self.death_pipe). The worker processes each have a thread waiting for this pipe to close, and will gracefully shut themselves down when it does. However, the parent process immediately sends all workers a SIGTERM right after closing its death p…
#25954 [Performance] Split FlashAttn attention and cache update — documentation,speculative-decoding,ready,v1,qwen,kv-connector,nvidia,ready-run-all-tests — by ElizaWszola (合并于: 2026-01-24 09:28 (UTC+8)) [💬36 | +458/-68, 21 files | commented:10] This PR creates codepaths for separating KV Cache update and Attention forward op. It also implements this split for FlashAttn backend. This separation facilitates future unwrapping.

#### E2E tests: ran inference on Blackwell machine with
```
  llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
```
and both VLLM_MLA_DISABLE=1 (to test the split) and VLLM_MLA_DISABLE=0 (to test if this PR does not affect backends that don’t do the split).

Some lm_eval results (flash infer…
#32742 [Model Runner V2] Add KV Connector support — ready,v1,kv-connector — by njhill (合并于: 2026-01-24 02:49 (UTC+8)) [💬1 | +159/-16, 4 files | commented:4 approved:1] Tested with cpu offloading and NIXL P/D.
#32912 [fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors — ready — by dolpm (合并于: 2026-01-24 06:53 (UTC+8)) [💬2 | +38/-0, 2 files | commented:1 approved:3] random env var…
```
      "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME": get_env_or_set_default(
          "VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME",
          lambda: f"VLLM_OBJECT_STORAGE_SHM_BUFFER_{uuid.uuid4().hex}",
      ),
```
will cache miss every time :) —

...
#32971 [CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh — ready,ci/build — by Harry-Chen (合并于: 2026-01-24 06:21 (UTC+8)) [+6/-5, 1 files | commented:4 approved:1] ## Purpose

@khluu found some issues when trying this script. This PR fixes them.

## Test Plan

Normal CI will not trigger this step. Needs to be tested on the next release.

## Test Result

…
#32935 [Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig — bug,ready — by joninco (合并于: 2026-01-24 06:19 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose

Fix missing is_layer_skipped check for FusedMoE layers in AWQConfig.get_quant_method.

This caused models with modules_to_not_convert (e.g., MTP layers stored as bfloat16) to incorrectly receive quantized methods, leading to weight loading failures when the code expected quantized weights but found bfloat16.

Two bugs fixed:
1. MoeWNA16 fallback config was missing modules_to_not_convert, so MoeWNA16Config couldn’t properly skip layers
2. AWQMarlinMoEMethod was return…
#32692 [Refactor] Clean up unused variables & func — rocm,frontend,ready — by yewentao256 (合并于: 2026-01-24 06:04 (UTC+8)) [+0/-30, 5 files | commented:1 approved:1] ## Purpose

Clean up unused variables & func
#32869 [CPU][Feat] Update PyTorch to v2.10 for CPU Backend — ready,ci/build,cpu — by fadara01 (合并于: 2026-01-23 21:13 (UTC+8)) [💬6 | +5/-6, 3 files | commented:2 approved:1] ## Purpose

Update PyTorch to v2.10 for CPU Backend.

A lot of nice features, improvements and bug fixes have gone into PyTorch v2.10 for Arm CPUs and we should capitalize on that in vLLM!

Here’s a con-exhaustive list:
- Enable mimalloc on AArch64: Switched the c10 system allocator to mimalloc on AArch64 to improve large-allocation behavior and overall performance. This delivers broad wins across dtypes, including up to 60% faster DenseNet121 (FP32) and up to 40% faster GPT2-Large (BF16) [#pu…
#32952 [Refactor] Rename gptq_marlin to marlin to match MoE — performance,ready,ci/build — by mgoin (合并于: 2026-01-24 05:48 (UTC+8)) [💬1 | +40/-40, 24 files | commented:2 approved:1] ## Purpose
- Moved csrc/quantization/gptq_marlin/ to csrc/quantization/marlin/
- Renamed gptq_marlin.cu to marlin.cu
- Renamed gptq_marlin_gemm to marlin_gemm
## Test Plan

## Test Result

…
#32831 [CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel — bug,rocm,ready — by rasmith (合并于: 2026-01-24 05:35 (UTC+8)) [+23/-0, 2 files | commented:8 approved:1] ## Purpose The wvSplitKQ_hf_sml kernel is not able to properly handle non-contiguous tensors. This is resulting in failures in AMD CI. This is possibly resulting in other failures or inaccuracies. In the long term, it might be better to update the kernel to be able to properly handle non-contiguous tensors, since this trades incorrectness for performance loss. We can then work on fix to the kernels in skinny_gems.cu to handle non-contiguous kernels.

In addition, all of the kernels in `ski…
#32949 [Bug] Fix benchmark script moe_permute_unpermute — bug,performance,ready — by yewentao256 (合并于: 2026-01-24 05:18 (UTC+8)) [+3/-5, 1 files | commented:2 approved:1] ## Purpose

python benchmarks/kernels/benchmark_moe_permute_unpermute.py

Two bugs fixed

```bash 93, ip=10.243.64.20, actor_id=4b4963ca6aa94bfc7fb40cf101000000, repr=<benchmark_moe_permute_unpermute.BenchmarkWorker object at 0x7f18ef275790>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …
#32614 fix: Add glm4_moe_lite to MLA detection — ready,v1,nvidia — by marksverdhei (合并于: 2026-01-24 04:38 (UTC+8)) [💬3 | +70/-10, 4 files | commented:4 approved:4 changes:1] ## Summary
- Add glm4_moe_lite and glm4_moe_lite_mtp to is_deepseek_mla() check in model_arch_config_convertor.py
GLM-4.7-Flash (glm4_moe_lite) uses Multi-head Latent Attention (MLA) via Glm4MoeLiteMLAAttention (which inherits from DeepseekV2MLAAttention) but was missing from the MLA detection.

Without this fix, vLLM falls back to standard KV caching instead of efficient MLA caching, resulting in ~4x higher KV cache memory usage.

Note: glm4_moe is intentionally NOT included a…
#32946 [cudagraphs] Refactor cudagraph capture loop — ready,v1,nvidia — by LucasWilkinson (合并于: 2026-01-24 04:22 (UTC+8)) [+117/-59, 3 files | commented:2 approved:1] Refactor cudagraph capture loop to pave the way for different PIECEWISE and FULL sizes and dynamic spec-decode sizes
#32956 [Bugfix][CI] Fix pre-commit — bug,v1 — by MatthewBonanni (合并于: 2026-01-24 02:26 (UTC+8)) [+1/-2, 1 files | commented:1 approved:1]

## Purpose Fixes pre-commit issue caused by #30877. The issue shouldn’t break CI because check_triton_import only checks the diff, but it shows up when merging main into PR branches

## Test Plan

## Test Result

…
#30877 [V1][Hybrid] Mamba Prefix Caching with align mode — documentation,ready,v1,qwen — by peakcrosser7 (合并于: 2026-01-24 01:56 (UTC+8)) [💬22 | +1775/-129, 42 files | commented:10] The cleaned-up version of #29272

## Purpose

This PR enhances the design of #28176 , adopting the same memory layout as FullAttention while adding support for decode caching and speculative decoding.

The core idea of this Mamba Prefix-Caching implementation (referred to as LPC) is to directly cache Mamba states through block-aligned scheduling. This approach enables rapid support for Prefix-caching in Mamba models without modifications to the underlying kernel code. Furthermore, it ma…
#30443 [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests — documentation,ready,ci/build,ready-run-all-tests — by orionr (合并于: 2026-01-24 02:22 (UTC+8)) [💬19 | +203/-32, 4 files | commented:9 approved:1] Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.

Moving this from https://github.com/vllm-project/ci-infra/pull/239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo

Tests to confirm:
1. Baseline (my vllm fork matching HEAD, no ci-infra changes) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. …
#32885 [CI][Models] Add VLM Support for Sequence Classification Conversion — ready,v1 — by AndreasKaratzas (合并于: 2026-01-23 16:22 (UTC+8)) [💬4 | +155/-39, 3 files | commented:2 approved:1] This PR enables Vision-Language Models (VLMs) like Gemma 3 to be converted to sequence classifiers using the no_post_processing and from_2_way_softmax methods. Additionally, it fixes two PyTorch compiler warnings that were causing noise in the logs.

## Changes

### 1. vllm/model_executor/models/adapters.py
- Added _get_language_model_for_seq_cls() helper function to correctly retrieve the inner language model component from VLMs
- Updated load_weights_no_post_processing() and `load_w…
#32397 [Model] Enable LoRA support for internvl2 — ready — by MatteoFari (合并于: 2026-01-24 01:39 (UTC+8)) [💬3 | +16/-3, 1 files | commented:1 approved:1]

## Purpose Enable dynamic LoRA adapters on InternVL2 vision tower + connector (part of #31479) by implementing:
- get_num_mm_encoder_tokens()
- get_num_mm_connector_tokens() ## Technical Details InternVL2 has a CLS token and pixel-shuffle downsampling (downsample_ratio), so LM tokens (post-downsample) don’t match tower tokens (pre-downsample). These helpers map token counts correctly between LM and tower/connector. ...
#32796 [Misc] Log vLLM logo when starting server — frontend,ready — by njhill (合并于: 2026-01-23 11:15 (UTC+8)) [💬5 | +50/-4, 5 files | commented:3 approved:1] TBD if folks like this :)
#32940 [torch.compile][CI] Add back attn fusion on hopper/ada — ready — by ProExpertProg (合并于: 2026-01-24 00:49 (UTC+8)) [+4/-5, 1 files | commented:1 approved:2] We were not running attention quant fusion because the unit tests were using the llama4 for the model name which is too big for a single gpu (but the model wasn’t used in the test apart from the name). Not technically a problem on B200 but we still don’t want to be skipping
#31059 [Frontend] add logprob, compression_rate to ‘verbose_json’ features — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ready,ci/build,v1 — by sangbumlikeagod (合并于: 2026-01-24 00:35 (UTC+8)) [💬4 | +32/-14, 4 files | commented:6 approved:1] ## Purpose

Last time I made verbose_json options on Whisper inference but I left some features like avg_logprob and compression_ratio. And I made this PR to get those features on verbose_json segments!

https://github.com/vllm-project/vllm/pull/24209

## Test Plan

curl –location ‘http://domain:port/v1/audio/transcriptions’
–header ‘Authorization: ••••••’
…
#32904 [Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test — bug,rocm,ready — by mawong-amd (合并于: 2026-01-24 00:24 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose This PR fixes a failing test (kernels/attention/test_cache.py::test_reshape_and_cache_flash) caused by https://github.com/vllm-project/vllm/pull/30141. The reference dequantization implementation used in the test assumes the FP8 data format is e4m3fn, but on AMD gfx942-series cards, the FP8 data type used is e4m3fnuz instead.

## Test Plan Run pytest -sv tests/kernels/attention/test_cache.py -k test_reshape_and_cache_flash on a MI325X as part of…

#32939 [ROCm][PD] Remove unused moriio connector proxy code — documentation,rocm,ready,kv-connector — by markmc (合并于: 2026-01-23 23:59 (UTC+8)) [💬1

+0/-22, 1 files

commented:1 approved:1]

send_request_to_decode() is not called anywhere
extract_ip_port_fast() for P is called twice

#32886 [Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 — bug,ready,llama — by baonudesifeizhai (合并于: 2026-01-23 23:31 (UTC+8)) [💬1 | +21/-1, 1 files | approved:1 commented:3] ## Purpose #32862 Add a version-guarded fallback in Llama4 MoE weight loading to avoid CPU FP8 indexing on older PyTorch releases. For torch < 2.11, weights are temporarily cast to FP16 for indexing and then cast back, preventing index_cpu errors while leaving newer versions unchanged. ## Test Plan ``` VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel
VLLM_USE_FLASHINFER_MOE_FP8=0
vllm bench throughput –model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 -tp=2 –enable-expert-parallel …
#32876 [CPU Backend][BugFix] Fix failing CPU MoE test — bug,ready — by fadara01 (合并于: 2026-01-23 20:06 (UTC+8)) [💬3 | +1/-0, 1 files | commented:1 approved:1] ## Purpose

Fixes: #32870

## Test Plan

Ran reproducer in #32870 locally

## Test Result

…
#32867 [Misc] Postpone torch_profiler deprecation — ready — by NickLucche (合并于: 2026-01-23 22:39 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] I think we can safely postpone this deprecation schedule. Updating warning accordingly.
#32916 [Bugfix] Disable tma_aligned_scales in test_fusions_e2e — bug,ready — by xyang16 (合并于: 2026-01-23 22:34 (UTC+8)) [+9/-1, 3 files | commented:1 approved:1]

## Purpose

This PR is to disable tma_aligned_scales in test_fusions_e2e to unblock CI.

## Test Plan

``` pytest -s -v tests/compile/distributed/test_fusions_e2e.py …
#32933 [Bugfix] Fix getting vision features in Transformer Multimodal backend — bug,ready — by zucchini-nlp (合并于: 2026-01-23 21:34 (UTC+8)) [💬1 | +9/-0, 1 files | commented:2 approved:1] Makes sure that transformers multimodal backend keeps working after v5 release.

PR https://github.com/huggingface/transformers/pull/42564 changed the output of self.model.get_image_features to tuple | dict format. Prev we expected the output to always be a single tensor or a list of tensors for non-homogeneous image sizes. A simple check if the output is tuple

The default output format currently depends on model.config.return_dict, so I added both formats

cc @hmellor
#32815 [Feature]: Remove DtoH Copy for lfm2_vl On Default Stream — ready,v1 — by tianshu-Michael-yu (合并于: 2026-01-23 21:20 (UTC+8)) [+260/-158, 5 files | commented:10] Purpose This PR removes all CUDA Device-to-Host (DtoH) memcpy/syncs observed during lfm2_vl preprocess on the default compute stream.

Main changes:

Run LFM2-VL’s SigLIP2 vision path fully packed (unpadded) end-to-end so we don’t trigger tiny syncs from CUDA nonzero / padding logic. Avoid host sync in the vision attention path by keeping max_seqlen on CPU (prevent .item()-style sync in FA wrapper). Remove remaining preprocess DtoH in the ShortConv/Mamba metadata path by using CPU-side query_sta…
#32905 [Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest — documentation,frontend,ready — by noooop (合并于: 2026-01-23 20:03 (UTC+8)) [💬2 | +326/-261, 11 files | commented:2 approved:1]

## Purpose

Split out the following RequestMixin
- EncodingRequestMixin
- EmbedRequestMixin
- ClassifyRequestMixin
address https://github.com/vllm-project/vllm/pull/31784#discussion_r2716425132 …
#32927 [Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark — bug,performance,ready — by Isotr0py (合并于: 2026-01-23 20:11 (UTC+8)) [+37/-0, 2 files | commented:3 approved:1]

## Purpose
- Currently, there is a race condition that bench command is executed too early before server become ready for vllm bench sweep serve: ``` [BEGIN SERVER] Server overrides: {‘mm_encoder_attn_backend’: ‘FLASH_ATTN’} Server command: [‘vllm’, ‘serve’, ‘/home/mozf/LLM/Qwen3-VL-4B-Instruct/’, ‘–enforce-eager’, ‘–max-model-len’, ‘32768’, ‘–mm-processor-cache-gb=0’, ‘–media-io-kwargs.video.num_frames’, ‘-1’, ‘–mm-encoder-attn-backend’, ‘FLASH_ATTN’] [BEGI…

#32861 [Voxtral] Add new streaming arch — ready,multi-modality,llama — by patrickvonplaten (合并于: 2026-01-23 19:41 (UTC+8)) [💬1

+768/-314, 9 files

commented:9 approved:1]

Moves causal whisper logic into own file
Adapt voxtral streaming arch to improved architecture
Add tests that is skipped for now

#32907 [CI/Build][CPU] Fix failed pooling tests and macos smoke test — ready,cpu — by bigPYJ1151 (合并于: 2026-01-23 18:48 (UTC+8)) [💬1 | +7/-1, 2 files | commented:6 approved:1]

## Purpose
- Skip post weight processing for missing layers
- Exclude unsupported op for macos
## Test Plan

CI tests …
#32698 [Misc] Add get_name to missing AttentionBackends — ready,v1 — by NickLucche (合并于: 2026-01-23 18:35 (UTC+8)) [💬5 | +28/-1, 6 files | commented:2 approved:1] There’s a few attention backends that are missing this meta-method, most notably the Mamba ones. Albeit the get_name() descriptor isn’t a functional one, I would expect every backend to define a simple unique tag for identification, adhering to the new structure of AttentionBackends.

I don’t have a strong opinion on how this should be carried out (I am also fine with baseclass providing a default), but I tried adding a basic decorator util to make the process less verbose in terms of lines of…
#32777 [Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set — bug,ready,cpu — by karanb192 (合并于: 2026-01-23 16:22 (UTC+8)) [+26/-26, 2 files | commented:1 approved:1] ## Summary
- Fixes the AssertionError: Current vLLM config is not set in cpu_fused_moe_torch caused by _LazyActivationDict triggering CustomOp.__init__() during forward pass
- Replaces _LazyActivationDict with direct function references (_CPU_MOE_ACT_FN) to avoid instantiating CustomOp classes
- Uses SiluAndMul.forward_native directly (it’s a @staticmethod) and a standalone _swigluoai_forward_native function for SwigluOAIAndMul
Fixes #32368

## Problem The `_LazyActivation…
#32722 [CI] Fix mypy for vllm/v1/structured_output — rocm,structured-output,ready,v1,deepseek — by yewentao256 (合并于: 2026-01-23 11:55 (UTC+8)) [💬1 | +32/-25, 18 files | commented:3 approved:1] ## Purpose

Part of the https://github.com/vllm-project/vllm/issues/26533

## Test

At first:

```bash (yewentao256) [yewentao256@nma-h200-isolated-0-preserve vllm-source]$ pre-commit run –hook-stage manual mypy-3.10 -a …
#32806 [torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8 to avoid raw torch ops inside opaque custom ops — ready,torch.compile,ready-run-all-tests — by ProExpertProg (合并于: 2026-01-23 11:52 (UTC+8)) [💬2 | +52/-13, 7 files | commented:5 approved:2] ## Purpose When CustomOp.forward_native is invoked from within another opaque torch custom op (e.g. fused_moe, unified_attention), it is hidden from the model-level torch.compile. In this case, executing forward_native directly executes eager PyTorch ops, which can significantly hurt performance.

In this PR, we compile forward_native manually to avoid this scenario. Existing CustomOp invocations visible to model-level compilation are not affected as torch.compile is ignored when nes…
#32884 [BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions — bug,ready,deepseek — by RishabhSaini (合并于: 2026-01-23 11:44 (UTC+8)) [💬2 | +39/-28, 1 files | commented:1 approved:1] Resolves: https://github.com/vllm-project/vllm/issues/32874

Replace validation asserts with ValueError and parsing asserts with RuntimeError to return 400 Bad Request instead of 500 Internal Server Error for invalid input

[关闭未合并 PR]

#31514 FP8 KV cache append optimized with precomputed inverse scale — rocm — by tfpre (关闭于: 2026-01-24 10:38 (UTC+8)) [💬4 | +193/-152, 3 files | commented:1]

## Purpose FP8 KV cache append (reshape_and_cache_flash) runs every layer, every decode step. The FP8 quantization path was doing per-element division to convert FP16→FP8:
```
  // Before: division per element (expensive)
  __nv_cvt_float_to_fp8(half_to_float(a) / scale, ...)
```
…
#25824 [Kernel] Triton-based Top-k and Top-p sampler kernels — performance,needs-rebase,v1 — by cakeng (关闭于: 2026-01-24 08:58 (UTC+8)) [💬4 | +674/-1, 2 files | commented:5]

## Purpose The current method for top-k and top-p sampling utilizes a sorting-based approach on PyTorch operations, which is suboptimal.

This PR introduces pivot-search-based top-k and top-p sampling kernels using Triton, which achieves a speedup of up to 10 times compared to the native method.

Set the env variable “VLLM_USE_TRITON_SAMPLER=1” to enable the Triton-based top-k and top-p sampler.

## Test Plan …
#32970 [release] Fix upload release wheel script — ci/build — by khluu (关闭于: 2026-01-24 05:40 (UTC+8)) [+2/-2, 1 files commented:1]
- Cut out the extra v when comparing tagged version and expected version
- Use rc[0-9] instead of rc to not exclude aaRCh64 wheels accidentally
#31938 moe-offload-fixed — documentation,needs-rebase,ci/build,v1 — by wangyxbh (关闭于: 2026-01-24 02:40 (UTC+8)) [💬5 | +3275/-97, 28 files | commented:2] # PR: CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM

## Summary

This PR proposes a CPU Offload Module for MoE inference in vLLM, enabling a large portion of expert weights and computation to be dynamically offloaded to the CPU while keeping only a small, hot subset of experts cached on GPU.

The design supports:
- Hybrid GPU–CPU execution
- Pinned-memory–based weight streaming
- Asynchronous GPU ↔ CPU interaction via callback …
#32147 Add descriptive error message for missing tools. — frontend,needs-rebase,gpt-oss,meta-exported,fb-exported — by madongfly (关闭于: 2026-01-24 02:13 (UTC+8)) [💬3 | +85/-8, 2 files | commented:1] Summary: Add descriptive error message for missing tools.

Test Plan: Unit test

Differential Revision: D90477073

[!NOTE] …
#32893 Fix/glm4 moe mla detection — v1,nvidia — by mgoin (关闭于: 2026-01-24 01:56 (UTC+8)) [💬2 | +75/-11, 4 files | commented:1 approved:1 | 📝草稿]

## Purpose

Validated that glm-4.7-flash works by default on B200

## Test Plan

## Test Result

…
#32759 [Bugfix] Add glm4_moe_lite to MLA model list to fix excessive KV cache memory usage — bug — by yhfgyyf (关闭于: 2026-01-24 01:50 (UTC+8)) [💬2 | +1/-0, 1 files | commented:2] ## Purpose

Add glm4_moe_lite architecture to the MLA supported model list to enable MLA attention for GLM-4.7-Flash.

GLM-4.7-Flash uses the glm4_moe_lite architecture with MLA attention parameters (kv_lora_rank=512, q_lora_rank=768), but was not recognized by is_deepseek_mla(), preventing MLA attention from being enabled and causing excessive KV cache memory usage.
#27603 [ROCm][Docs] Update ROCm installation docs for ROCm 7.0 — documentation,rocm,needs-rebase — by Rohan138 (关闭于: 2026-01-24 01:43 (UTC+8)) [💬7 | +56/-55, 1 files | commented:10]

## Purpose
- Update docs in https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#amd-rocm for ROCm 7.0
- Add and document MI355 support
- Update standalone install instructions for PyTorch, Triton, Flash Attention, and AITER
## Test Plan

…
#27200 [Feature] Added Prefix Repetition with Random Lengths dataset — performance,stale — by susiejojo (关闭于: 2026-01-24 01:26 (UTC+8)) [💬4 | +169/-0, 1 files | commented:1] ## Purpose

Fixes issue: #27129

vLLM’s bench serve currently supports a PrefixRepetitionRandomDataset for generating prompts with fixed-length prefix and suffix lengths. It also has support for a RandomDataset which samples input and expected output lengths fr…
#32924 Create 52012 — 无标签 — by AKB0700 (关闭于: 2026-01-24 01:06 (UTC+8)) [💬2 | +1/-0, 1 files | commented:3]

## Purpose

## Test Plan

## Test Result

...
#32748 [Misc] Add index url for torch==2.9.1+cpu — ready,needs-rebase,ci/build,cpu — by lk-chen (关闭于: 2026-01-24 00:39 (UTC+8)) [💬3 | +1/-1, 1 files | commented:2] Added extra index URL for CPU version of PyTorch.

Otherwise I hit “not found” error https://vllm-dev.slack.com/archives/C07QP347J4D/p1768948741022109

## Purpose

## Test Plan

…
#32906 [ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 — documentation,performance,new-model,rocm,structured-output,frontend,speculative-decoding,ci/build,v1,multi-modality — by monajafi-amd (关闭于: 2026-01-23 23:04 (UTC+8)) [💬2 | +16257/-10390, 438 files | commented:1 | 📝草稿] ## Purpose

Flash Attention’s CK backend only supports CDNA GPUs (gfx90a/gfx942/gfx950). On RDNA3/RDNA4, vLLM falls back to PyTorch SDPA for vision attention, which is significantly slower. Flash Attention 2.8.3+ includes a Triton backend that supports RDNA architectures.

## Changes
- Added flash_attn_triton_available() in vllm/platforms/rocm.py to detect RDNA GPUs and verify Triton backend availability
- Updated get_vit_attn_backend() to use Flash Attention when available on RDNA GPUs -…
#32928 [MoE] Add tuned config for RTX PRO 6000 Blackwell Max-Q (E=64, N=1024) — 无标签 — by ErikDeBruijn (关闭于: 2026-01-23 18:39 (UTC+8)) [💬2 | +720/-0, 8 files | commented:1] ## Summary

Add FusedMoE kernel configurations for NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition covering the most popular MoE architectures.

### Configurations Added (8 files)

| E | N | Use Case | |—|—|———-| | 8 | 14336 | Mixtral 8x7B/8x22B | | 8 | 7168 | Large 8-expert models | …
#31478 Optimize Top-K Sigmoid Routing and QKNorm for MiniMax-M2/M2.1 — needs-rebase,ci/build — by rogeryoungh (关闭于: 2026-01-23 16:40 (UTC+8)) [💬1 | +906/-70, 12 files | commented:2] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

## Purpose

This PR introduces two performance optimizations targeting MiniMax-M2 / M2.1 inference:
1. Adds a topk_sigmoid CUDA kernel to support sigmoid-based expert routing used by MiniMax-M2. Previously, this was implemented via a workaround using grouped_topk with group_size=1, which incurs unnecessary overhead.
2. Fuses QKNorm in tensor-parallel…
#32909 [CI][Pooling] Stabilize ModernBERT test — 无标签 — by AndreasKaratzas (关闭于: 2026-01-23 14:53 (UTC+8)) [💬2 | +27/-0, 1 files | commented:10] This PR attempts to stabilize the Language Models Test (Extended Pooling). It marks it as flaky, enables rerunning the test, and logs an informative message regarding the possible inaccurate results based on the conversation here: https://github.com/vllm-project/vllm/pull/32403#pullrequestreview-3695341925

## Related
- Addresses flaky test: test_modernbert_models[float-disham993/electrical-ner-ModernBERT-base]
- Related to #32403 which increased test tolerance from atol=1.2e-2 to `atol=3…

#29046 [Feature]: Disable logging /metrics — frontend — by WorldExplored (关闭于: 2026-01-23 12:43 (UTC+8)) [💬5

+76/-0, 3 files

commented:1]

This PR adds a metrics-specific logging control to stop /metrics access-log spam while preserving normal request logs, as requested in #29023.
It introduces a new frontend flag disable_uvicorn_metrics_access_log (CLI: --disable-uvicorn-metrics-access-log) alongside the existing --disable-uvicorn-access-log.
When the new flag is False, behavior is unchanged and Uvicorn access logs are controlled by current global options.
…