vLLM Daily Report

[vLLM GitHub 开发动态] 2025-12-18

[概览]

时间窗口: 2025-12-18 10:48 (UTC+8) ~ 2025-12-19 10:48 (UTC+8)
新 issue: 30 (label 分布: bug:16, feature request:4, ci-failure:4, RFC:2, installation:1)
关闭 issue: 36
新 PR: 51 (label 分布: v1:17, documentation:13, ready:10, frontend:7, nvidia:7)
合并 PR: 50
关闭未合并 PR: 10

[新 issue]

#30999 [Feature] GLM45 tool parser: Stream tool name before full arguments — 无标签 — by so2liu (创建于: 2025-12-19 10:35 (UTC+8)) ## Feature Request

### Problem Description

When using GLM-4.5 with tool calling (--tool-call-parser glm45), the current parser waits for the complete tool call structure before returning anything to the client.

For long-running tool calls (e.g., generating articles with a write_article tool), this creates poor user experience:
- Users see nothing during generation
- Users don’t know what the model is doing
- No way to provide early feedback about which tool was selected …
#30996 [Bug]: vLLM 0.12.0 + LMCache outputs ERROR: PrometheusLogger instance already created with different metadata. — bug — by keivenchang (创建于: 2025-12-19 09:59 (UTC+8)) ### Your current environment

```text ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : 18.1.3 (1ubuntu1) CMake version : version 4.2.0 …
#30956 [Feature]: could output the given format logger ? — feature request — by ucas010 (创建于: 2025-12-18 17:35 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch

hi,dear , i have def the logger from py scripts ,etc, logger_utils.py and could i use shell run the command with the logger, such as , vllm serve qwen3-embedding-0.6b --logger_file logger_utils.py

thx …
#30995 [Bug]: Fused MoE errors without safe serialization — bug — by ojh31 (创建于: 2025-12-19 09:35 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#30959 [Installation]: AssertionError: CUDA_HOME is not set — installation — by shahasim (创建于: 2025-12-18 18:54 (UTC+8)) [💬1] ### Your current environment

Hi,

I am trying to build a docker image for vllm==0.12.0 and I am unable to build because the setup fails with AssertionError: CUDA_HOME is not set and there is a numpy error as well in the logs. I tried the chatbot, downgraded dumpy but it does not work. The same dockerfile if I try and install v0.11.2 it works. Could I get some help with this issue? ```

STEP 14/19: RUN pip install “numpy<2.0.0” Collecting numpy<2.0.0 Downloading https://www.artifactreposit…
#30985 [RFC]: DRY Dependencies for Better Hardware-Specific vLLM Dependency Management — RFC — by wjhrdy (创建于: 2025-12-19 04:06 (UTC+8)) [💬2] ### Motivation.

vLLM currently maintains 16 separate requirements/*.txt files (cuda.txt, rocm.txt, tpu.txt, xpu.txt, cpu.txt, plus build/test variants) with no centralized dependency specification for common dependencies. This creates three critical issues:

#### 1. No Single Source of Truth Creates CI Fragility
- Each hardware target has divergent torch versions:
  - CUDA: torch==2.9.0
  - XPU: torch==2.9.0+xpu with custom index URLs
  - TPU: Completely different dependency …
#30929 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:32 (UTC+8)) [💬2] ### Name of failing test

bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:11 (UTC+8)) [💬2] ### Name of failing test

pytest -v -s v1/entrypoints

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30928 [CI Failure]: mi325_1: LM Eval Small Models (1 Card) — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:28 (UTC+8)) [💬2] ### Name of failing test

pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30927 [CI Failure]: mi325_1: LM Eval Small Models — ci-failure — by AndreasKaratzas (创建于: 2025-12-18 13:20 (UTC+8)) [💬2] ### Name of failing test

pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30989 [Bug]: CUTLASS BLOCK SCALE FP8 IMA on B200 — bug — by robertgshaw2-redhat (创建于: 2025-12-19 05:49 (UTC+8)) ### Your current environment

B200

### 🐛 Describe the bug

Currently, CUTLASS BLOCK SCALE FP8 is broken on main. There are two problems.
- A) it is impossible to use cutlass block scale fp8
…
#30933 [Usage]: What is the latest instruction to run DeepSeek V3.2? — usage — by IKACE (创建于: 2025-12-18 14:18 (UTC+8)) [💬1] ### Your current environment

vLLM 0.12.0

### How would you like to use vllm

I am following the guidelines here https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html for running DeepSeek v3.2. By following the instructions I installed vLLM 0.12.0 on my H200 node. However, when I try to run it with vllm serve deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32 it gives an error

``` (APIServer pid=816209) ValueError: No tokenizer regis…
#30970 [Bug]: MPClient sends queue message but EngineCoreProc does not receive, causing timeout. — bug — by YuYun329 (创建于: 2025-12-18 22:54 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#30969 [Bug]: SmolLM3-3B FP8 models fail to load in v0.11.1 with “Unable to find matching target” error in compressed-tensors config — bug — by GauthierRoy (创建于: 2025-12-18 22:36 (UTC+8)) ### Your current environment

The output of python collect_env.py
Running in official Docker image: vllm/vllm-openai:v0.11.1 GPU: NVIDIA L4 (GCP g2-standard-8) `| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.9 |` vLLM version: 0.11.1 ...
#30964 [Bug]: semaphore leak for Qwen3-Next-80B-A3B-Instruct on long data inference — bug — by AIR-hl (创建于: 2025-12-18 20:50 (UTC+8)) ### Your current environment

Collecting environment information… ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : version 4.2.0 …
#30939 [Bug]: The same input produces outputs of different lengths in the same batch. — bug — by sorenwu (创建于: 2025-12-18 15:29 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#30963 [Bug]: AssertionError: Failed to apply prompt replacement for mm_items[‘image’][0] — bug — by liaomingg (创建于: 2025-12-18 20:45 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30958 [Bug]: Vllm Server stuck and automatically shutdown. — bug — by tzjtatata (创建于: 2025-12-18 18:34 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#30934 [Bug]: Qwen3-VL-235B-A22B-Thinking-FP8 not divisible by weight quantization block_n = 128. — bug — by serser (创建于: 2025-12-18 14:39 (UTC+8)) [💬2] ### Your current environment

I am trying to run FP8 version on 8*A100 GPUs but I encountered ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.

``` Collecting environment information... ============================== ...
#30942 [Bug]: Scale Elastic EP is Pending on EngineCore — bug — by xeonliu (创建于: 2025-12-18 15:41 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#30943 [Bug]: memory leak (nvidia v100, mineru, dp*8) — bug — by ChrisKimZHT (创建于: 2025-12-18 15:44 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ...
#30941 [Performance]: Why Does Latency Remain Unchanged in vLLM 0.11.0 When Input Token Count Decreases for qwen3-vl-30b-a3b? — performance — by Hormoney (创建于: 2025-12-18 15:40 (UTC+8)) ### Proposal to improve performance

No response

### Report of performance regression

No response

### Misc discussion on performance

…
#30940 [Feature]: The issue of being unable to achieve streaming output in function call inference — feature request — by cluo991 (创建于: 2025-12-18 15:35 (UTC+8)) ### 🚀 The feature, motivation and pitch

The issue of being unable to achieve streaming output in function call inference

### Alternatives

No response

### Additional context

…
#30938 [Feature]: Support NVIDIA ModelOpt FP8 PTQ variants (FP8_PER_CHANNEL_PER_TOKEN / FP8_PB_WO) in vLLM — feature request — by CedricHwong (创建于: 2025-12-18 15:11 (UTC+8)) ### 🚀 The feature, motivation and pitch

### Current behavior vLLM’s built-in modelopt quantization support only recognizes a limited set of ModelOpt checkpoint formats (e.g., quant_algo: “FP8” and NVFP4). When loading newer ModelOpt FP8 PTQ exports such as:
```
- FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale + per-token dynamic activation quantization)
- fp8_pb_wo / FP8_PB_WO (block-scaled FP8 weight-only with 4D block scales)

vLLM fails early during config parsing / quantization...
```
#30931 [Bug]: Prefix Cache Corruption with LoRA with the same name but different id — bug — by Butanium (创建于: 2025-12-18 14:01 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#30930 [Draft] [RFC]: vLLM + UCC Backend — RFC — by lengrongfu (创建于: 2025-12-18 13:45 (UTC+8)) ### Motivation.

Integrating UCC (Unified Collective Communication) https://github.com/openucx/ucc into vLLM as an optional communication backend. ### Strengthening vLLM’s Backend Abstraction
- UCC supports a wide range of hardware backends,
- It decouples vLLM’s distributed engine from any single communication API.
- It allows developers to register new communication backends through UCC’s standardized API.
- This modularity supports future frameworks, accelerators, or transport technologies w…
#30922 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR，the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (创建于: 2025-12-18 12:08 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30923 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR，the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (创建于: 2025-12-18 12:14 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30921 [Feature]: Models with different quantization for different layers/blocks — feature request — by Datta0 (创建于: 2025-12-18 11:37 (UTC+8)) ### 🚀 The feature, motivation and pitch

We at Unsloth have dynamically quantized bnb 4bit models which we ideally do to preserve accuracy, keeping in mind the important layers. We let them be in 16-bit while quantizing the rest to 4-bit.

For example, if you see unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit, all of the vision layers are left unquantized, while having only a [few language la…
#30919 [Bug]: GPT-OSS-120B NotImplementedError: FlashInfer backend currently does not support attention sinks — bug — by shyeh25 (创建于: 2025-12-18 11:16 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…

[已关闭 issue]

#3583 Supporting RWKV models — feature request,stale — by RonanKMcGovern (关闭于: 2025-12-19 10:14 (UTC+8)) [💬13] ### 🚀 The feature, motivation and pitch

Linear attention allows for longer context and faster inference. The eagle model has a 2T checkpoint soon.

### Alternatives

NA

### Additional context

…
#11655 [Feature]: Support Inflight quantization: load as 8bit quantization. — feature request,stale — by ShelterWFF (关闭于: 2025-12-19 10:14 (UTC+8)) [💬24] ### 🚀 The feature, motivation and pitch

VLLM supports 4bit inflight quantification, but does not support 8bit, 8bit speed is faster than 4bit, request support for support.

### Alternatives

No response

### Additional context …
#13047 [Bug]: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE when running 0.7.3.dev57+g2ae88905.precompiled on A100 — bug,stale — by imangohari1 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬15] ### Your current environment

The output of `python collect_env.py`
```text INFO 02-10 17:07:03 __init__.py:190] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False ...
#14193 [Bug]: when I use disaggregated_prefill, if I don’t input anything ,KV receiving thread will report time out — bug,stale — by mengli0 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬9] ### Your current environment

PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect …
#16467 [Bug]: Grammar error: Pointer ‘/$defs/xxxxx’ does not exist — bug,stale — by ItzAmirreza (关闭于: 2025-12-19 10:14 (UTC+8)) [💬8] ### Your current environment

The output of `python collect_env.py`
```text root@c2069cb9db81:/vllm-workspace# python3 collect_env.py INFO 04-11 01:50:21 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.6.0+cu124 ...
#17618 [Bug]: Engine Core initialization failed. See root cause above — bug,stale — by darkness8i8 (关闭于: 2025-12-19 10:14 (UTC+8)) [💬44] ### Your current environment

Colab notebooks, A100

### 🐛 Describe the bug

I have no idea what’s wrong. This model works with normal pipeline, but fails with vllm. It was saved to 16 bit and built off Unsloth/Llama3.18b Instruct

This works -> from transformers import pipeline …
#17813 [Bug]: 0.8.5 post1 cuda error — bug,stale — by gengchaogit (关闭于: 2025-12-19 10:14 (UTC+8)) [💬18] ### Your current environment

env: 5070ti 16G+5080TI 16G model:Qwen3-32B-AWQ command:PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_TRITON_FLASH_ATTN=1 vllm serve xunlei_model/Qwen3-32B-AWQ –port 8099 –max-model-len 25000 –served-model-name vllm –enforce_eager –tensor-parallel-size 2 –gpu_memory_utilization 0.95

A few days ago I installed a relatively new vllm, which allowed me to run tensors in parallel on two graphics cards. A lot of concurrent tests were done at that time …
#18231 [Bug]: SamplingParams() use_beam_search error — bug,stale — by michellewheatleyxsite (关闭于: 2025-12-19 10:14 (UTC+8)) [💬7] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ``` I am using the following packages/versions: ...
#18489 [Bug]: SharedStorageConnector does not load KV cache — bug,stale — by KoyamaSohei (关闭于: 2025-12-19 10:14 (UTC+8)) [💬9] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (aarch64) ...
#18572 [Bug]: ValueError: Attempted to assign 25 + 25 + 25 = 75 multimodal tokens to 147 placeholders — bug,stale — by Mypainismorethanyours (关闭于: 2025-12-19 10:14 (UTC+8)) [💬10] ### Your current environment

vllm==0.7.3 transformers==4.49.0

### 🐛 Describe the bug

```import os os.environ[“HF_HOME”] = “/gz-data/.cache/huggingface” …
#18725 [Performance]: Unexpected: B200 GPU Performance Similar to H200 for Qwen/QwQ-32B, Expected B200 to be Significantly Faster — performance,stale — by awterman (关闭于: 2025-12-19 10:14 (UTC+8)) [💬7] ### Proposal to improve performance

We are observing that the B200 GPU is performing similarly to the H200 GPU when running inference with the Qwen/QwQ-32B model using vLLM. We expect the B200 to have significantly better performance.

Hardware Information:
- CPU: 192 x vCPU
- Memory: 1585 GB
Benchmark Script: …
#18730 [Bug]:RuntimeError: Engine core initialization failed. — bug,stale — by YasinFu (关闭于: 2025-12-19 10:14 (UTC+8)) [💬8] ### Your current environment

The output of python collect_env.py
```text (vllm) llm@aitt:/data_a/llm$ python collect_env.py INFO 05-23 10:07:58 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... ============================== ...
#23072 [Bug]: eagle3 draft model len > 2048 will be broken — bug,stale — by fan-niu (关闭于: 2025-12-19 10:13 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#23074 [Bug]: Can the thinking mode of qwen3 be used simultaneously with the guided_json of vllm — bug,stale — by ZYJ-3721 (关闭于: 2025-12-19 10:13 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#25287 [Bug]: CrashLoopBackOff on AMD GPU (MI300x) with rocm/vllm Container on DigitalOcean Kubernetes — bug,rocm — by kev-pebble (关闭于: 2025-12-19 09:39 (UTC+8)) [💬6] Attempting to deploy the rocm/vllm container on a DigitalOcean Kubernetes (DOKS) cluster with AMD MI300x GPUs and are encountering a persistent CrashLoopBackOff error. We are currently unable to get application logs due to a platform-level issue, but we have isolated the problem to the vllm serve command itself.

Environment: * Platform: DigitalOcean Kubernetes (DOKS) * Kubernetes Version: v1.33.1 * GPU Node: DigitalOcean gpu-mi300x1-192gb-devcloud (AMD MI300x) * Container Image: doc…
#30843 [Bug]: vllm serve GLM-4.6V-Flash error — bug — by F0undLinks (关闭于: 2025-12-19 09:29 (UTC+8)) [💬7] ### Your current environment

docker image: vllm/vllm-openai:nightly（sha256:b40b770900bfb2b4a66bc04e888141830e20fd732c79e07ab3e3d6186d0ed437）

vllm version: 0.13.0rc2.dev118+g29f7d9771 transformers version: 4.57.3

### 🐛 Describe the bug

…

#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:04 (UTC+8)) [💬6] ### Name of failing test

` pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pip freeze

grep -E ‘torch’ && pytest -v -s models/multimodal -m core_model –ignore models/multimodal/generation/test_whisper.py –ignore models/multimodal/processing && cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model`

### Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug i…

#30929 [CI Failure]: mi325_4: DeepSeek V2-Lite Async EPLB Accuracy — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:03 (UTC+8)) [💬2] ### Name of failing test

bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_async_eplb.sh 0.25 1319 8030

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29519 [CI Failure]: mi325_1: Examples Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:02 (UTC+8)) [💬2] ### Name of failing test

`pip install tensorizer && python3 offline_inference/basic/generate.py –model facebook/opt-125m && python3 offline_inference/basic/generate.py –model meta-llama/Llama-2-13b-chat-hf –cpu-offload-gb 10 && python3 offline_inference/basic/chat.py && python3 offline_inference/prefix_caching.py && python3 offline_inference/llm_engine_example.py && python3 offline_inference/audio_language.py –seed 0 && python3 offline_inference/vision_language.py –seed 0 && python3 offlin…
#29538 [CI Failure]: mi325_8: Kernels Quantization Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:01 (UTC+8)) [💬1] ### Name of failing test

pytest -v -s kernels/quantization --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29468 [CI Failure]: mi325_1: Basic Models Tests (Other) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:00 (UTC+8)) [💬4] ### Name of failing test

pytest -v -s models/test_transformers.py models/test_registry.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30926 [CI Failure]: mi325_1: V1 Test entrypoints — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 07:00 (UTC+8)) [💬2] ### Name of failing test

pytest -v -s v1/entrypoints

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29526 [CI Failure]: mi325_1: Entrypoints Integration Test (Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬5] ### Name of failing test

export VLLM_WORKER_MULTIPROC_METHOD=spawn && pytest -v -s entrypoints/pooling

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬3] ### Name of failing test

pytest -v -s lora --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py --ignore=lora/test_olmoe_tp.py --ignore=lora/test_deepseekv2_tp.py --ignore=lora/test_gptoss_tp.py --ignore=lora/test_qwen3moe_tp.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30928 [CI Failure]: mi325_1: LM Eval Small Models (1 Card) — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:59 (UTC+8)) [💬2] ### Name of failing test

pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30927 [CI Failure]: mi325_1: LM Eval Small Models — ci-failure — by AndreasKaratzas (关闭于: 2025-12-19 06:58 (UTC+8)) [💬2] ### Name of failing test

pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30939 [Bug]: The same input produces outputs of different lengths in the same batch. — bug — by sorenwu (关闭于: 2025-12-18 21:14 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#25718 [Bug][ROCm]: large max_num_seqs hurts performance on AMD — bug,rocm — by draftbk (关闭于: 2025-12-18 16:54 (UTC+8)) [💬7] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#22245 [Bug]: VLLM_ROCM_USE_AITER=1 hit device_gemm with the specified compilation parameters does not support this GEMM problem for Qwen3-235B-A22B — bug,rocm,stale — by billishyahao (关闭于: 2025-12-18 16:27 (UTC+8)) [💬8] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here Collecting environment information... ...
#30934 [Bug]: Qwen3-VL-235B-A22B-Thinking-FP8 not divisible by weight quantization block_n = 128. — bug — by serser (关闭于: 2025-12-18 16:15 (UTC+8)) [💬2] ### Your current environment

I am trying to run FP8 version on 8*A100 GPUs but I encountered ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.

``` Collecting environment information... ============================== ...
#21504 [RFC] [ROCm] [AITER]: Propose a _aiter_ops class like _custom_ops and _ipex_ops — rocm,RFC,unstale — by tjtanaa (关闭于: 2025-12-18 14:55 (UTC+8)) [💬5] ### Motivation.

This RFC proposes the creation of a new module, vllm/_aiter_ops.py, to centralize the management, conditional loading, and registration of AITER kernels for ROCm. This new module will be analogous to the existing _custom_ops.py and _ipex_ops.py, providing a single, authoritative source for all AITER-related operations. This change will improve code organization, prevent circular dependencies, simplify the developer experience, and streamline testing.

As the integration of…
#25102 [Bug]: Crash when using CUTLASS_MLA on B200 with flashinfer 0.3.1: UnboundLocalError: cannot access local variable 'returned_lse' where it is not associated with a value — bug,stale — by smarterclayton (关闭于: 2025-12-18 13:56 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py from the decode worker
```text Collecting environment information... ============================== System Info ============================== ...
#29522 [CI Failure]: mi325_1: PyTorch Fullgraph Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-18 13:35 (UTC+8)) [💬2] ### Name of failing test

pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile' && pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29443 [CI Failure]: mi325_1: Python-only Installation Test — ci-failure — by AndreasKaratzas (关闭于: 2025-12-18 13:25 (UTC+8)) [💬5] ### Name of failing test

bash standalone_tests/python_only_compile.sh

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#30923 [Bug]: Use the offical doucment vllm online method deploy DeepSeek-OCR，the result is very bad . but I ust the offline method the result is normal. why ? — bug — by git-liweichao (关闭于: 2025-12-18 12:25 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#30834 [Bug]: vllm0.12.0 h100 PTX was compiled with an unsupported toolchain — bug — by Qingyuncookie (关闭于: 2025-12-18 11:54 (UTC+8)) [💬4] ### Your current environment

==============================
System Info
============================== …

[新 PR]

#30944 [XPU] enable fp8 online streaming quantization — 无标签 — by yma11 (创建于: 2025-12-18 15:44 (UTC+8)) [💬1 | +32/-107, 2 files | commented:4] ## Purpose This PR enables fp8 online streaming quantization on xpu path for other linear and MoE.

## Test Plan
```
  VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=float16 --max-model-len=4096 --quantization=fp8 -tp=4
```
## Test Result ``` Processed prompts: 100%|██████████████████| 4/4 [00:00<00:00, 4.21it/s, est. speed input: 23.18 toks/s, output: 67.43 toks/s] …
#30998 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm — by yuttian1 (创建于: 2025-12-19 10:27 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary

This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:

fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com

awq_dequant

The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).

…
#30997 Add Molmo2 multimodal model support — documentation,new-model,multi-modality — by sangho-vision (创建于: 2025-12-19 10:22 (UTC+8)) [💬2 | +3044/-1, 7 files | commented:2]

## Purpose

This PR adds support for Molmo2 to vLLM. Molmo2 is a family of open vision-language models developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.

This change:
- Introduces a new Molmo2 model implementation for vLLM
- Registers Molmo2 in the multimodal model registry …
#30965 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm,needs-rebase — by yuttian1 (创建于: 2025-12-18 21:05 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary

This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:

fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com

awq_dequant

The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).

…

#30957 [Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM — documentation,frontend,nvidia — by CedricHwong (创建于: 2025-12-18 17:46 (UTC+8)) [💬3 | +435/-16, 6 files | commented:4]

## Purpose

This PR adds support for two NVIDIA ModelOpt-exported FP8 HuggingFace checkpoint variants that currently cannot be loaded by vLLM:

- quant_algo=FP8_PER_CHANNEL_PER_TOKEN (aka hf_fp8_pc_pt)
- quant_algo=FP8_PB_WO (ModelOpt may emit fp8_pb_wo, handled case-insensitively)

It keeps strict quant_algo matching (only known ModelOpt algos map to quantization="modelopt") to avoid   ...

#30967 [Misc] Remove unused custom ops copy_blocks and copy_blocks_mla — kernel,ready — by lengrongfu (创建于: 2025-12-18 22:02 (UTC+8)) [+0/-286, 6 files | commented:1] ## Purpose

Found that some defined ops are no longer in use, so need remote it.

## Test Plan

## Test Result

...
#30991 [ROCm][CI/Build] Update ROCm dockerfiles — rocm,ci/build — by gshtras (创建于: 2025-12-19 06:50 (UTC+8)) [+11/-5, 2 files | commented:1 approved:1] Dockerfile.rocm: Override torch profiler setting to allow full traces Write extra info to versions.txt

Dockerfile.rocm_base: Revert to ROCm7.0 due to a performance regression in 7.1 that is investigated internally Update torch commit to ROCm 2.9.0 release Update triton commit to match the new torch commit Update AITER to a more recent commit
#30953 Hotfix cpu container build — documentation,ci/build,cpu — by maryamtahhan (创建于: 2025-12-18 17:03 (UTC+8)) [💬3 | +125/-11, 3 files | commented:3] ## Purpose This PR improves the CPU Docker build process for better compatibility with Podman and adds build configuration tracking through OCI labels.

Changes:
1. Podman Compatibility: Replace bind mounts with COPY for requirements files in docker/Dockerfile.cpu to improve compatibility with Podman
2. OCI Labels: Add standard OCI image labels and custom build configuration labels to track CPU instruction set flags (AVX2, AVX512, etc.) in built images
3. **Cross-platform Build Sup…
#30974 [Bugfix] Fix incorrect tiles creation for mm prefix triton attention — ready — by Isotr0py (创建于: 2025-12-19 01:07 (UTC+8)) [+10/-4, 1 files | commented:3]

## Purpose
- Actually, #30386’s implementation is still incorrect, because the tiles are still selected casually, so the hidden_states is not fully converged.
- This PR fixes this issue to make sure correct all valid tiles are selected for computaion
## Test Plan
```
  python examples/offline_inference/vision_language.py -m gemma3 --num-prompts 1
```
…
#30994 [WIP]improve cpu Benchmark Suite tests for 0.12.0 — performance,ci/build,cpu — by louie-tsai (创建于: 2025-12-19 08:25 (UTC+8)) [+76/-21, 1 files | commented:1] ## Purpose

Improve CPU benchmark tests for 0.12.0

## Test Plan

## Test Result

...
#30987 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. — new-model,llama — by aidando73 (创建于: 2025-12-19 04:11 (UTC+8)) [💬2 | +2/-0, 1 files | commented:2 changes:1]

## Purpose

If you don’t set the model architecture in the config - I think transformers pulls out Eagle3LlamaModel or something - which isn’t mapped in vllm

## Test Plan

You can repro with this config:

…
#30993 Bump Flashinfer to v0.6.0rc1 — ci/build,nvidia — by elvischenv (创建于: 2025-12-19 07:50 (UTC+8)) [+6/-67, 9 files | commented:1]

## Purpose Bump Flashinfer to v0.6.0rc1. API change: argument tile_tokens_dim has been removed from all TRTLLM MoE kernels.

## Test Plan

## Test Result

…
#30992 [Misc] Remove deprecated metric vllm:time_per_output_token_seconds for v0.13 release — v1 — by jliu9515 (创建于: 2025-12-19 07:18 (UTC+8)) [+0/-39, 1 files | commented:2] This metric was deprecated in v0.11 and renamed to vllm:inter_token_latency_seconds. The TODO comment indicated it should be removed in v0.13.0.

Follows up on #30396 which removed other deprecated items for v0.13.

## Purpose
- This metric was deprecated in v0.11 and renamed to vllm:inter_token_latency_seconds
- The TODO comment indicated it should be removed in v0.13.0
- Follows up on #30396 which removed other deprecated items for v0.13
##…
#30973 [Bugfix] Remove tile_size=64 for mm_prefix triton attention — ready — by Isotr0py (创建于: 2025-12-19 00:18 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2]

## Purpose
- Fix https://github.com/vllm-project/vllm/pull/30386#discussion_r2631541204
cc @lucianommartins

## Test Plan

## Test Result …
#30990 [MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) — ready,nvidia — by robertgshaw2-redhat (创建于: 2025-12-19 06:31 (UTC+8)) [💬3 | +3/-685, 7 files | commented:2 approved:2] ## Purpose
- per https://github.com/vllm-project/vllm/issues/30989, CUTLASS Block quant FP8 does not run on main. When modifying vLLM to force it to run, it get an IMA immediately
- per discussion with NVIDIA, FlashInfer kernels are better for TP DSR1 anyways
- rather than fixing the IMA, we will just remove this kernel
- remove to simplify code
## Test Plan
- ci to ensure nothing broke
## Test Result …
#30979 [MoE Refactor] Add mk for cutlass fp8 block — v1,nvidia — by robertgshaw2-redhat (创建于: 2025-12-19 02:15 (UTC+8)) [+191/-32, 5 files | commented:1 | 📝草稿] NOTE: #30989 [Bug]: CUTLASS BLOCK SCALE FP8 IMA on B200

The kernel is broken on main
#30983 Check for truthy rope_parameters not the existence of it — bug,ready — by hmellor (创建于: 2025-12-19 03:21 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] Fixes the following situation:
- Model explicitly sets rope_parameters=None (new name, Transfomers v5) or rope_scaling=None (old name, Transformers v4)
- User is using Transformers v4 with vLLM
#30984 Grid construction based on num_active_lora and support CUDA graph capture across various num_active_lora — v1,nvidia — by yugong333 (创建于: 2025-12-19 03:53 (UTC+8)) [+201/-41, 9 files | commented:2]

## Purpose

Reduce overhead of idle kernel launch due to max-loras in grid construction

## Test Plan

## Test Result

…
#30988 [P/D] Add utility to slow down server for decode benchmarking — frontend,v1,kv-connector — by hjjq (创建于: 2025-12-19 05:10 (UTC+8)) [+82/-0, 10 files | commented:1 | 📝草稿] https://github.com/vllm-project/vllm/pull/25986 works but does not support real data. This feature allows benchmarking with real data by slowing down the decode node to wait until enough requests are completed and sent from the prefill node.

Usage:

VLLM_SERVER_DEV_MODE=1 vllm serve ... --slowdown-threshold N where N is the number of requests at which slowdown will be turned off. If the number of scheduled requests is less than N, the scheduler will sleep 10s for each step. This threshold can…
#30986 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. “ — documentation,performance,new-model,rocm,frontend,tpu,ci/build,v1,tool-calling,llama — by aidando73 (创建于: 2025-12-19 04:08 (UTC+8)) [💬3 | +20070/-1312, 118 files | commented:2]

## Purpose

As per title

## Test Plan

## Test Result

…
#30980 [Do not merge][Async] Asynchronous DP coordination — v1 — by MatthewBonanni (创建于: 2025-12-19 02:40 (UTC+8)) [💬1 | +253/-42, 2 files | commented:2 | 📝草稿] ## Purpose When using async scheduling with MTP and DP, _get_valid_sampled_token_count causes a sync point. Following this sync point, DP coordination causes a GPU bubble. This PR creates and employs an asynchronous replacement for coordinate_batch_across_dp, allowing this communication to overlap with execution.

## Test Plan ``` vllm serve deepseek-ai/DeepSeek-R1
-dp 8
–enable-expert-parallel
–no-enable-prefix-caching
–trust-remote-code
…
#30982 Update Pytorch version update docs — documentation — by atalman (创建于: 2025-12-19 03:18 (UTC+8)) [💬1 | +9/-14, 1 files | commented:1] This updates PyTorch runbook
#30981 blackwell — frontend,v1 — by dtunikov (创建于: 2025-12-19 02:49 (UTC+8)) [💬2 | +80/-0, 6 files | commented:2] enforce
#30935 [XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU — 无标签 — by faaany (创建于: 2025-12-18 14:56 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Currently, the worker class on XPU is hard-coded to “vllm.v1.worker.xpu_worker.XPUWorker”. This PR improves the logic to only override if worker_cls is still the default “auto”. This allows custom workers (e.g. vllm-omni workers) to be used on XPU.

see CUDA as a reference: https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156

Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "F...
#30977 Docs: add OpenAI SDK example for Qwen2.5-VL classification — documentation,qwen — by Dhruv-80 (创建于: 2025-12-19 01:55 (UTC+8)) [💬2 | +39/-0, 1 files | commented:3] ## Purpose

Add a usage example demonstrating how to call Qwen2.5-VL classification models served by vLLM using the OpenAI-compatible SDK.

This clarifies the required structured multimodal input format and helps users avoid 400 Bad Request errors caused by passing raw text or media URLs directly in requests.

Fixes #27413 …
#30978 Add positional embedding and kv_cache fusion for llama and gpt-oss — v1,llama,gpt-oss — by dllehr-amd (创建于: 2025-12-19 01:57 (UTC+8)) [+255/-70, 6 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#30976 Use aiter triton fused_add_rmsnorm_pad for gpt-oss — gpt-oss — by Rohan138 (创建于: 2025-12-19 01:39 (UTC+8)) [+39/-2, 3 files | commented:1 | 📝草稿]

## Purpose

Adds fused padding op before router GEMM on ROCm, eliminating this unfused pad after the GEMM before the fused_moe: https://github.com/ROCm/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#1603

Before: After: <img width=”1462” height=”34” alt=”image” src=”https://github.com/user-attachments/assets/e5979f88-03c0-…
#30975 [Misc] Disable default --ready-check-timeout-sec extra call in vllm bench — performance — by NickLucche (创建于: 2025-12-19 01:37 (UTC+8)) [+2/-3, 1 files | commented:1] As per brief offline discussion with @mgoin, current vllm bench serve implementation will default to sending (at least) one extra request to probe for server up status. I suppose this is due to a non-uniform backend /health check API, so we’ve been defaulting to sending the same test request https://github.com/vllm-project/vllm/blob/686cbaac643c3412036728dd5e6bc29d6cce1a9f/vllm/benchmarks/serve.py#L596

I believe this is largely misleading for most use-cases as it results in an ambiguous “wa…
#30971 GB200 DeepSeek R1-NVFP4 Updates — needs-rebase,v1,deepseek,nvidia — by elvircrn (创建于: 2025-12-18 23:23 (UTC+8)) [💬1 | +379/-60, 14 files | commented:2 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#30924 [BugFix] Fix TypeError: unhashable type: ‘dict’ when serving deepseek32 — ready,v1,deepseek — by LucasWilkinson (创建于: 2025-12-18 12:16 (UTC+8)) [💬1 | +11/-6, 1 files | commented:5] FIX: https://github.com/vllm-project/vllm/issues/30861
#30972 feat(metrics): Add Prometheus exemplars support for request-level met… — documentation,frontend,v1 — by TheCodeWrangler (创建于: 2025-12-18 23:30 (UTC+8)) [💬2 | +205/-29, 10 files | commented:1] ## Purpose

Add opt-in support for Prometheus exemplars on per-request histogram metrics. Exemplars attach request IDs to metric observations, enabling correlation between metrics and individual requests for debugging and tracing.

This is useful for:
- Debugging slow requests by correlating latency spikes with specific request IDs
- Tracing individual requests through distributed systems
- Investigating outliers in histogram distributions
## Key Implementation Notes …
#30949 [Doc] Add Sophgo TPU Support — documentation,ready — by wzyrrr (创建于: 2025-12-18 16:30 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] vllm-tpu based on the Sophgo TPU is open-sourced，and we contributed documentation support to the vLLM community.

## Purpose We implemented a vllm plugin based on Sophgo TPU according to the RFC https://github.com/vllm-project/vllm/issues/11162 （Hardware Out-Of-Tree Plugin) standard, and this plugin can achieve high-efficiency inference of LLMs on Sophgo TPUs using the vllm framework.

Users can download the source code and install this plugin via pip install -e . …
#30962 [Quantization] support logical_widths for fp8 marlin — 无标签 — by jinzhen-lin (创建于: 2025-12-18 20:36 (UTC+8)) [💬3 | +10/-0, 1 files | commented:6] fix https://github.com/vllm-project/vllm/issues/30750
#30968 Add hidden dimension validation for multimodal embedding inputs — multi-modality — by wenqiglantz (创建于: 2025-12-18 22:08 (UTC+8)) [💬1 | +836/-10, 4 files | commented:4] The original fix for https://github.com/advisories/GHSA-pmqf-x6x8-p7qw added options that required opting in to image and text embeds inputs. https://github.com/vllm-project/vllm/pull/27204

This PR adds hidden dimension validation for multimodal embedding inputs when the feature is turned back on.

Already reviewed via a private security report by @DarkLight1337 @Isotr0py
#30966 Migrate some old models to Transformers modelling backend — documentation,new-model — by hmellor (创建于: 2025-12-18 21:51 (UTC+8)) [💬1 | +8/-1455, 6 files | commented:2 | 📝草稿] Migrate support of some older models to use the Transformers modeling backend instead of having duplicated implementations in vLLM.

These architectures are ~2 years old and do not feature in the top 100 most used architectures in vLLM (with the exception of GPT2).

This migration:
- Reduces the number of model implementations maintained in vLLM
- Adds support for the following features that the Transformers modelling backend base class supports (these were not all added to the original model i…
#30960 Migrate from mypy to ty — ci/build — by hmellor (创建于: 2025-12-18 20:04 (UTC+8)) [💬1 | +20/-207, 5 files | commented:3 | 📝草稿] Depends on:
- vllm-project/vllm/issues/26533
- Replace all mypy references with ty
pre-commit setup:
- Switch to only running on the lower bound Python version instead of testing all supported versions …
#30954 [Kernel][AWQ] Optimize awq_gemm: K-group reuse for scales/zeros, increase pipeline stages, and simplify dequant math — needs-rebase — by yuttian1 (创建于: 2025-12-18 17:32 (UTC+8)) [💬1 | +201/-245, 3 files | commented:2] What this PR changes 1) K-group optimization (reuse scales/zeros within AWQ group)

AWQ group size is 128 while the kernel K tile is 32.

Within a group, scales and zero-points are identical across multiple K tiles, so we reuse the same scales/zeros for the 4 consecutive K tiles inside the group.

This reduces global memory reads for scales and zero-points.

2) More pipeline stages …
#30961 [Quantization] fix marlin w8a8 check — 无标签 — by jinzhen-lin (创建于: 2025-12-18 20:21 (UTC+8)) [💬1 | +3/-6, 1 files | commented:1] The marlin W8A8 is currently disabled now, we should raise error for this case.

fix https://github.com/vllm-project/vllm/issues/30904
#30955 [Bugfix][CPU] Fix Mac CPU build — cpu — by bigPYJ1151 (创建于: 2025-12-18 17:32 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] ## Purpose

Fix mac CPU build issue after #30531

## Test Plan

Mac smoke test

## Test Result

…
#30952 [ROCm] Serving Fails on Radeon Due to AITER Dtype Import — rocm,ready — by vllmellm (创建于: 2025-12-18 16:48 (UTC+8)) [+18/-12, 1 files | commented:2 approved:1] ## Purpose The AITER method get_gfx_custom_op_core doesn’t handle the gfx1201 architecture yet, causing issues when running vLLM even if AITER is not on . Currently AITER is not yet supported on gfx1201, so we need to prevent its usage until proper support is added to the AITER library.

## Test Plan Run test_basic_correctness.py on gfx1201 hardware (skipping meta-llama/Llama-3.2-1B-Instruct tests as they are time consuming). ## Test Result Before ``` ERROR test_basic_correctness.py - Runtim…
#30950 [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) — documentation,ready,v1 — by markmc (创建于: 2025-12-18 16:41 (UTC+8)) [💬2 | +109/-1, 5 files | commented:2] See #30738 - this is a follow-on to export these metrics via Prometheus in addition to the console logging

The metrics are only calculated and available with --enable-mfu-metrics
#30936 [v1][CP] Improve DCP/PCP/MTP error messages with actionable guidance — v1 — by yurekami (创建于: 2025-12-18 14:59 (UTC+8)) [💬3 | +49/-15, 1 files | commented:1] ## Summary

This PR improves error messages when context parallel features (DCP/PCP/MTP) are used with incompatible attention backends.

Problem (Fixes #28407):
- Current error messages use raw AssertionError with technical jargon
- No explanation of what DCP (Decode Context Parallel), PCP (Prefill Context Parallel), or MTP are
- No list of compatible backends
- No actionable fix suggestions
…
#30925 [Multimodal] Add FIPS 140-3 compliant hashing support — multi-modality — by yurekami (创建于: 2025-12-18 13:07 (UTC+8)) [💬2 | +220/-5, 2 files | commented:2] ## Summary

This PR adds FIPS 140-3 compliant SHA-256 hashing as an alternative to blake3 for multimodal content hashing in vLLM. This enables vLLM deployment in regulated environments (government, healthcare, finance) that require FIPS-approved cryptographic algorithms.

### Changes
- Add _Sha256Hasher wrapper class providing FIPS-compliant SHA-256 hashing
- Add _Blake3Hasher wrapper class maintaining consistent interface
- Add _create_hasher() factory function for automatic hasher select…
#30946 [Core] Improve DP synchronization error messages — v1 — by yurekami (创建于: 2025-12-18 16:10 (UTC+8)) [+26/-3, 1 files | commented:1] ## Summary

Replace bare assert statements with informative RuntimeError exceptions in dp_utils.py to improve debugging experience for Data Parallel users.

Changes:
- Replace assert num_tokens_padded >= num_tokens_unpadded with descriptive error showing actual values
- Replace assert should_attempt_dp_padding == should_dp_pad with error explaining DP rank synchronization mismatch
- Replace assert uniform_decode is not None with error explaining microbatching parameter requiremen…
#30945 [Feature] adpat step3 with eager mode — documentation,frontend,v1,deepseek — by InhabitancyCocoon (创建于: 2025-12-18 16:06 (UTC+8)) [💬1 | +2981/-118, 31 files | commented:2]

## Purpose

## Test Plan

## Test Result

...
#30951 [Misc] Improve worker error messages for better debugging — v1,cpu — by yurekami (创建于: 2025-12-18 16:42 (UTC+8)) [+36/-6, 3 files | commented:1] ## Summary

Replace bare assert statements with informative RuntimeError/TypeError exceptions in v1 worker modules. This provides users with actionable error messages instead of cryptic AssertionError exceptions.

## Changes

### vllm/v1/worker/gpu/dp_utils.py
- Replace assert dp_size > 1 with RuntimeError explaining Data Parallel configuration requirements
### vllm/v1/worker/cpu_model_runner.py …
#30947 [RFC][docs] Add lightweight AI assisted contribution policy — documentation — by markmc (创建于: 2025-12-18 16:22 (UTC+8)) [💬2 | +27/-0, 2 files | commented:3 | 📝草稿] Adds AI assisted contribution sections to governance/process.md and contributing/README.md, establishing clear guidelines for using AI tools while maintaining code quality and attribution standards.

🤖 Generated with Claude Code
#30948 fix: Suppress torch.frombuffer UserWarning for non-writable buffers — 无标签 — by yurekami (创建于: 2025-12-18 16:25 (UTC+8)) [💬1 | +26/-2, 3 files | commented:1] ## Summary Fixes #26781 - Suppress UserWarning from torch.frombuffer() on non-writable buffers

## Problem torch.frombuffer() emits a UserWarning when given a non-writable buffer (e.g., from pickle.dumps() or shared memory):
```
  UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors.
  This means you can write to the underlying (supposedly non-writable) buffer using the tensor...
```
…
#30937 [Feature] Add –ssl-ciphers CLI argument for TLS cipher control — frontend — by rickychen-infinirc (创建于: 2025-12-18 15:03 (UTC+8)) [+4/-0, 2 files | commented:1] ## Summary This PR adds support for the --ssl-ciphers CLI argument, enabling users to specify allowed SSL/TLS cipher suites for fine-grained security control.

Fixes #30569

## Changes
- vllm/entrypoints/openai/cli_args.py: Added ssl_ciphers parameter to FrontendArgs dataclass
- vllm/entrypoints/openai/api_server.py: Pass ssl_ciphers argument to serve_http() function
## Motivation …

#30920 [Bugfix] Fix Unicode issues in GLM-4 tool calling — ready — by chaunceyjiang (创建于: 2025-12-18 11:26 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose

## Test Plan main

  [ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a206a0a9cae7c285', function=Function(arguments='{"city": "\\u5317\\u4eac", "date": "2024-06-27"}', name='get_weather'), type='function'), ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-9b1f3dc7e2f41093', function=Function(arguments='{"city": "\\u4e0a\\u6d77", "date": "2024-06-27"}', name='get_weather'), type='function')]
    

this pr …

#30932 enable vllm ut for Intel GPU — documentation,rocm,speculative-decoding,v1,nvidia — by wincent8 (创建于: 2025-12-18 14:06 (UTC+8)) [💬2 | +156/-59, 16 files | commented:1 | 📝草稿] enable vllm ut for Intel GPU

[已合并 PR]

#30270 [ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements — rocm,ready,ci/build,multi-modality,qwen — by AndreasKaratzas (合并于: 2025-12-19 10:17 (UTC+8)) [💬5 | +109/-47, 7 files | commented:10] This PR addresses several ROCm-specific issues with multi-modal/vision-language models and improves attention backend dispatching for encoder-only self-attention models. It renders green the following test groups on ROCm:
- Multi-Modal Models Test (Standard)
- Multi-Modal Models Test (Extended) 1
- Multi-Modal Models Test (Extended) 2
- Multi-Modal Models Test (Extended) 3
#### Key Changes

Attention Backend Selection (vllm/platforms/rocm.py):
- Added dtype validation (fp16/bf16 o…
#30895 [BugFix] Handle errors when preprocessing added requests — bug,ready,ci/build,v1 — by njhill (合并于: 2025-12-19 09:29 (UTC+8)) [💬4 | +93/-3, 3 files | commented:2 approved:1] preprocess_add_request() runs in the engine core input socket processing thread. It’s not expected to raise exceptions but if it does, they aren’t caught or logged - the input processing thread exits and the engine hangs silently :(

This PR catches and logs request-scoped preprocessing errors, and returns an output to fail the request in question.

I’ll open another PR to deal with any other unexpected exceptions occurring in the input socket processing thread (which should be considered fata…
#30867 [Bugfix] Fix tool_choice=”none” being ignored by GPT-OSS/harmony models — frontend,ready,gpt-oss — by HaloWorld (合并于: 2025-12-19 09:34 (UTC+8)) [💬5 | +86/-4, 2 files | commented:6 approved:1] GPT-OSS models using the harmony format were ignoring the tool_choice="none" parameter and could still trigger tool calls when tools were provided in the request. This issue arose because the _make_request_with_harmony method only checked for the existence of request.tools, without accounting for the tool_choice setting or the exclude_tools_when_tool_choice_none flag. This fix ensures that harmony models respect the exclude_tools_when_tool_choice_none flag, aligning their behavior wi…
#30684 [MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface — tpu,ready,v1,llama — by Isotr0py (合并于: 2025-12-19 02:04 (UTC+8)) [💬6 | +182/-266, 20 files | commented:2 approved:1]

## Purpose
- Following PR for #30125
- Migrate MultiHeadAttention usage to new MMEncoderAttention
## Test Plan
```
  pytest - s-v tests/kernels/attention/test_attention.py
```
…
#30973 [Bugfix] Remove tile_size=64 for mm_prefix triton attention — ready — by Isotr0py (合并于: 2025-12-19 03:42 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2]

## Purpose
- Fix https://github.com/vllm-project/vllm/pull/30386#discussion_r2631541204
cc @lucianommartins

## Test Plan

## Test Result …
#29128 [Cleanup] Refactor FlashInferMetadataBuilder — ready,v1,nvidia — by benchislett (合并于: 2025-12-19 06:45 (UTC+8)) [💬1 | +403/-269, 1 files | commented:9 approved:1] ## Purpose

Reorganize the FlashInferMetadata into clear prefill and decode sections that either belong to FlashInfer or TRTLLM execution pathways. This separation is desirable because it allows us to make explicit which metadata needs to be prepared for each backend, and therefore which computations can be omitted when a certain backend is not used.

As such, I refactor (and make skip-able) some computations related to the paged kv indices, see _compute_paged_kv_indices.

## Test Plan

I …
#30907 [Bug] Fix batch invariant in torch 2.10 — ready — by yewentao256 (合并于: 2025-12-18 23:27 (UTC+8)) [💬3 | +20/-24, 1 files | commented:5 approved:3] ## Purpose

Fixes https://github.com/pytorch/pytorch/issues/170490

## Test

Originally:

```bash … …
#30419 [NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA — ready,v1,kv-connector,nvidia — by xuechendi (合并于: 2025-12-19 06:10 (UTC+8)) [💬13 | +53/-36, 3 files | commented:9 dismissed:1] ## Purpose

This PR is fixed upon https://github.com/vllm-project/vllm/pull/30420 Should get that one merged firstly

Two issue detected and resolved in this PR
1. Fix a bug after #29665 for running PD with cpu host buffer
2. Fix accuracy issue for running PD with cpu host buffer, described in https://github.com/vllm-project/vllm/issues/30358
…
#30983 Check for truthy rope_parameters not the existence of it — bug,ready — by hmellor (合并于: 2025-12-19 05:59 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] Fixes the following situation:
- Model explicitly sets rope_parameters=None (new name, Transfomers v5) or rope_scaling=None (old name, Transformers v4)
- User is using Transformers v4 with vLLM
#30916 [BugFix] Fix spec decode + structured outputs + preemption edge case — bug,ready,v1 — by njhill (合并于: 2025-12-19 04:59 (UTC+8)) [+5/-1, 1 files | commented:1 approved:1] Fix an edge case that can be triggered when using spec decode with structured outputs.

There is a sequence of preemption and drafting being skipped that can result in the scheduler’s request.spec_token_ids being stale which can then fail bitmask generation because they will be out of sync with the grammar.

When triggered, vLLM crashes with an error like this: ``` (EngineCore_DP0 pid=1549249) File “/home/nickhill/workspace/vllm2/vllm/vllm/v1/engine/core.py”, line 920, in _process_engine_ste…
#30652 Strengthen input validation and tests for ‘parse_raw_prompts’. — ready — by mivehk (合并于: 2025-12-19 03:51 (UTC+8)) [💬3 | +11/-10, 1 files | commented:9 approved:1]

## Purpose The ‘parse_raw_prompts’ can benefit from the stricter input validation in the array-of-arrays input path. This is for when input is a list of lists; however, we verify the nested inner list is a non-null list of integers beyond the first inner list verification.

## Test Plan Added a pytest coverage for invalid and edge-case inputs of mixed-type nested lists for parse_raw_prompts. Updated the array-of-arrays input condition of parse_raw_prompts to strict…
#30319 [BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 — ready,v1 — by heheda12345 (合并于: 2025-12-19 03:47 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1 approved:1]

## Purpose We now forgets to pass the draft tokens from model runner to scheduler. This PR fix it.

## Test Plan
```
  VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/spec_decode.py --test
```
## Test Result …
#30363 Add removal version for all2all backend env var — documentation,ready,ci/build,v1,nvidia — by elizabetht (合并于: 2025-12-19 03:46 (UTC+8)) [💬8 | +40/-43, 12 files | commented:4 changes:1 approved:2] ## Purpose Convert VLLM_ALL2ALL_BACKEND envvar to config value

## Test Plan CI run should be sufficient

## Test Result

...
#30820 [Bug] Fix compressed tensor not using deepgemm — ready — by yewentao256 (合并于: 2025-12-19 03:45 (UTC+8)) [💬2 | +10/-1, 2 files | commented:2 approved:1] ## Purpose

A fix for https://github.com/vllm-project/vllm/pull/30718#discussion_r2624920272

CT should use deepgemm when available, this PR fixes that

## Test

export MODEL="RedHatAI/Qwen3-30B-A3B-FP8-block"

…
#30629 tuned fused configs for B300 — 无标签 — by navmarri14 (合并于: 2025-12-19 03:41 (UTC+8)) [💬4 | +441/-0, 3 files | commented:1 approved:1]

## Purpose This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs using FP8 quantization.

Specifically, it targets the configuration:

Experts (E): 160 Sharded size N=192 for TP=8, N=384 for TP=4, N=768 for TP=2 …
#30729 [Perf] enable flashinfer rotary_embedding custom ops in DeepSeek rotary — ready,deepseek — by jiahanc (合并于: 2025-12-19 03:31 (UTC+8)) [💬6 | +24/-2, 2 files | commented:7 approved:1] ## Purpose
- Enable flashinfer rotary_embedding custom ops in DeepSeek To use the custom op, add --compilation_config.custom_ops+=+rotary_embedding in vllm config ## Test Plan ``` VLLM_USE_FLASHINFER_MOE_FP4=1 python3 -m vllm.entrypoints.openai.api_server –model nvidia/DeepSeek-R1-0528-FP4-v2 –tokenizer nvidia/DeepSeek-R1-0528-FP4-v2 –dtype auto –kv-cache-dtype fp8 –tensor-parallel-size 1 –pipeline-parallel-size 1 –data-parallel-size 4 –enable-expert-parallel –swap-space 16 –max-…
#30745 [BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector — ready,kv-connector — by wz1qqx (合并于: 2025-12-19 03:09 (UTC+8)) [💬2 | +24/-0, 2 files | commented:7 approved:1] ## Purpose This is from Novita.AI Team

Do clean up to prevent memory leak ## Test Plan

## Test Result

...
#30935 [XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU — 无标签 — by faaany (合并于: 2025-12-19 02:16 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Currently, the worker class on XPU is hard-coded to “vllm.v1.worker.xpu_worker.XPUWorker”. This PR improves the logic to only override if worker_cls is still the default “auto”. This allows custom workers (e.g. vllm-omni workers) to be used on XPU.

see CUDA as a reference: https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156

Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "F...
#29476 [BugFix] Add sleep to fix tight loop and release GIL — ready,v1 — by alec-flowers (合并于: 2025-12-19 01:52 (UTC+8)) [💬3 | +7/-0, 1 files | commented:2 approved:1] Potential Fix for - https://github.com/vllm-project/vllm/issues/29369

While not very elegant it does do the job of releasing the GIL.

## Purpose

## Test Plan

## Test Result

…
#29002 [Bugfix] fix DP-aware routing in OpenAI API requests — frontend,ready,ci/build,v1,nvidia — by inkcherry (合并于: 2025-12-19 01:50 (UTC+8)) [💬2 | +68/-0, 7 files | commented:5 dismissed:1 approved:1] ## Purpose fix https://github.com/vllm-project/vllm/pull/24945 In add_request, duplicate initialization is skipped, but during the previousself.processor.process_inputs, data_parallel_rank is not initialized. Using -H 'X-data-parallel-rank' to specify the data parallel rank would be invalid in this case., cc @njhill

https://github.com/vllm-project/vllm/blob/d69062c67af46a2e624be92162e9db585eef329b/vllm/v1/engine/async_llm.py#L283-L302

## Test Plan

## Test Result

…
#30218 [Cleanup] Remove unused ModelRunner V1 InputBatch.num_tokens field — tpu,ready,v1 — by njhill (合并于: 2025-12-19 01:17 (UTC+8)) [+12/-36, 4 files | commented:4 approved:1] Even though we’ll be transitioning to GPU ModelRunner V2, makes sense to remove this in the meantime.
#30909 [ROCm][Bugfix] Fix fa_version argument error in flash_attn_maxseqlen_wrapper for ROCm without aiter — rocm,ready — by AndreasKaratzas (合并于: 2025-12-18 16:45 (UTC+8)) [+5/-4, 1 files | commented:2 approved:1] ### Problem On ROCm platforms with AITER either uninstalled or disabled, flash_attn_varlen_func fails with:
```
  TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'fa_version'
```
This occurs because is_rocm_aiter is False when aiter is disabled, causing the code to fall through to the else branch which unconditionally passes fa_version to flash_attn_varlen_func. However, the ROCm version of Flash Attention (via vllm.attention.utils.fa_utils) does not support…
#30730 [ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties — rocm,structured-output,ready,v1 — by AndreasKaratzas (合并于: 2025-12-18 15:04 (UTC+8)) [💬7 | +45/-2, 2 files | commented:2 approved:1] ## Summary

This PR fixes a structured output generation failure when using backend="auto" with JSON schemas containing patternProperties. The fix adds detection for patternProperties in the auto-mode fallback logic and routes such schemas to the outlines backend instead of guidance.

## Problem

When using structured_outputs with backend="auto" and a JSON schema containing patternProperties, the llguidance library produces malformed output consisting of { followed by endless…
#30900 fix fp8 online quantization streaming with tp > 1 — ready — by vkuzo (合并于: 2025-12-19 00:45 (UTC+8)) [💬4 | +33/-8, 1 files | commented:1 approved:2] Summary:

Fix for https://github.com/vllm-project/vllm/issues/30830

When we added online fp8 quant with streaming weight post-processing in https://github.com/vllm-project/vllm/pull/29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:
- https://github.com/vllm-project/vllm/pull/29196 assumed that weight_loader copies loaded_weight to param directly
- this is not always true, as weight_loader can call arbitrary logic on both param and `loaded_wei…
#30598 [LoRA] Set default MXFP4 LoRA backend to Marlin — ready — by xyang16 (合并于: 2025-12-19 00:42 (UTC+8)) [💬4 | +5/-5, 1 files | commented:2 approved:1] ## Purpose

This PR set default MXFP4 LoRA backend to Marlin because Triton has accuracy issues and Marlin has slight better performance.
- Use Triton only if Marlin is disabled (set VLLM_MXFP4_USE_MARLIN=0 explicitly) and triton_kernels is supported.
- Use Marlin by default
  - if VLLM_MXFP4_USE_MARLIN is not set
  - if VLLM_MXFP4_USE_MARLIN=1
  - if triton_kernels is not supported
## Benchmarking …
#30949 [Doc] Add Sophgo TPU Support — documentation,ready — by wzyrrr (合并于: 2025-12-19 00:29 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1 approved:1] vllm-tpu based on the Sophgo TPU is open-sourced，and we contributed documentation support to the vLLM community.

## Purpose We implemented a vllm plugin based on Sophgo TPU according to the RFC https://github.com/vllm-project/vllm/issues/11162 （Hardware Out-Of-Tree Plugin) standard, and this plugin can achieve high-efficiency inference of LLMs on Sophgo TPUs using the vllm framework.

Users can download the source code and install this plugin via pip install -e . …
#30822 [Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC — ready,qwen — by Lucaskabela (合并于: 2025-12-19 00:23 (UTC+8)) [💬1 | +13/-7, 3 files | commented:4 approved:1]

## Purpose See https://github.com/pytorch/pytorch/issues/170568

In torch 2.10, we will rely more heavily on aot_precompile; part of this upgrade leads to a successful load and re-instantion of the VLLMBackend in caching.py; however, for multimodal encoders that previously relied on the set_model_tag context manager to forward is_encoder information to compile ranges, this field is no longer set (since aot loading happens at callsite)

Therefore, we need to …
#30188 [Model] adds jais 2 support — documentation,new-model,ready — by sarathc-cerebras (合并于: 2025-12-18 23:46 (UTC+8)) [💬8 | +534/-0, 4 files | commented:10] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#30915 [Fix][FlexAttention] return max logical block index to handle reused blocks — ready,v1 — by ivanium (合并于: 2025-12-18 14:42 (UTC+8)) [+42/-4, 2 files | commented:2 approved:1]

## Purpose

For FlexAttention, we need to build physical_to_logical_mapping to reversely map physical block ids to logical block ids. This process previously assumes logical block ids are always unique, which is not true for some attention types such as sliding window attention, where some blocks may be released and reused later, causing the same physical block id to appear multiple times in a row of block_table at different logical block indices. As a result, …
#29776 [Misc] support nsys profile for bench latency — performance,ready — by izhuhaoran (合并于: 2025-12-18 22:52 (UTC+8)) [💬7 | +14/-12, 1 files | commented:4 approved:1] ## Purpose As titled, this pr support nsys profile for bench latency
#30537 Filter safetensors files to download if .safetensors.index.json exists — ready — by mgoin (合并于: 2025-12-18 22:51 (UTC+8)) [💬1 | +29/-6, 1 files | commented:1 changes:1 approved:1] ## Purpose

I noticed when running vllm serve openai/gpt-oss-20b on a new machine that it not only downloaded the base model files, but also included the original/ directory that I assumed was picked up because it has files ending in *.safetensors

``` model-00000-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████| 4.79G/4.79G [06:24<00:00, 12.5MB/s] model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████…
#30910 [BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) — ready — by LucasWilkinson (合并于: 2025-12-18 15:50 (UTC+8)) [💬2 | +14/-74, 2 files | commented:1 approved:2] Partially revert https://github.com/vllm-project/vllm/pull/29558 as this broke H200 tests

https://buildkite.com/vllm/ci/builds/43863#019b29e9-5c1b-4eff-83f7-c8304f774aa7

i.e.
```
  VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
```
…
#30955 [Bugfix][CPU] Fix Mac CPU build — cpu — by bigPYJ1151 (合并于: 2025-12-18 17:38 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] ## Purpose

Fix mac CPU build issue after #30531

## Test Plan

Mac smoke test

## Test Result

…
#30952 [ROCm] Serving Fails on Radeon Due to AITER Dtype Import — rocm,ready — by vllmellm (合并于: 2025-12-18 19:47 (UTC+8)) [+18/-12, 1 files | commented:2 approved:1] ## Purpose The AITER method get_gfx_custom_op_core doesn’t handle the gfx1201 architecture yet, causing issues when running vLLM even if AITER is not on . Currently AITER is not yet supported on gfx1201, so we need to prevent its usage until proper support is added to the AITER library.

## Test Plan Run test_basic_correctness.py on gfx1201 hardware (skipping meta-llama/Llama-3.2-1B-Instruct tests as they are time consuming). ## Test Result Before ``` ERROR test_basic_correctness.py - Runtim…
#29935 [moe] Use enable_chunking func (to support disabling chunking) — ready — by minosfuture (合并于: 2025-12-18 17:02 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1 approved:1]

## Purpose

VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING wasn’t used. This PR uses it and allow disabling activation chunking. Disabling chunking is good for throughput performance.

## Test Plan

tested VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING

…
#30883 [Chore] Remove v0 dead code for Qwen2.5-omni — ready,qwen — by Isotr0py (合并于: 2025-12-18 11:54 (UTC+8)) [💬1 | +0/-22, 1 files | commented:1 approved:1]

## Purpose
- Just found we missed Qwen2.5-omni’s embed_multimodal_v0 when removing embed_multimodal_v0 from models.
## Test Plan

## Test Result

…

#30920 [Bugfix] Fix Unicode issues in GLM-4 tool calling — ready — by chaunceyjiang (合并于: 2025-12-18 15:12 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1] ## Purpose

## Test Plan main

  [ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a206a0a9cae7c285', function=Function(arguments='{"city": "\\u5317\\u4eac", "date": "2024-06-27"}', name='get_weather'), type='function'), ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-9b1f3dc7e2f41093', function=Function(arguments='{"city": "\\u4e0a\\u6d77", "date": "2024-06-27"}', name='get_weather'), type='function')]
    

this pr …

#30531 [CPU] Refactor CPU fused MOE — ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-18 14:36 (UTC+8)) [💬4 | +1394/-206, 23 files | commented:8 changes:2]

## Purpose

Refactor CPU fused MOE by optimizing tile schedule and enable torch compile

Part of #29580

main: ```log …
#30225 [Platform] Let EPD work with non-cuda platform — ready,nvidia — by wangxiyuan (合并于: 2025-12-18 14:45 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose SImple change to make EPD work with non-cuda platform. Remove the cuda hardcode logic. ## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#30706 fix: add warmup for audio preprocessing — frontend,ready — by TheCodeWrangler (合并于: 2025-12-18 14:12 (UTC+8)) [💬4 | +126/-1, 1 files | commented:6 approved:1] # Audio Preprocessing Warmup for Whisper Models

## Purpose

Fixes first-request latency issue for Whisper transcription models where the first request can take significantly longer (10-20x) than subsequent requests.

Problem: The first Whisper transcription request experiences substantial latency overhead due to lazy initialization, causing poor user experience and potential timeout issues in production.

Root Cause: Investigation revealed two lazy initialization bottlenecks:
1. **Libro…
#30903 [UX] Reduce DeepGEMM warmup log output to single progress bar — ready,startup-ux — by MatthewBonanni (合并于: 2025-12-18 12:21 (UTC+8)) [+99/-42, 1 files | commented:2 approved:1] ## Purpose Showing a progress bar for each shape during DeepGEMM warmup is unnecessary. In the interest of reducing log clutter, this PR reduces the output to a single progress bar, only shown on rank 0.

## Test Plan
```
  vllm serve deepseek-ai/DeepSeek-R1 -dp 8 --enable-expert-parallel
```
## Test Result …
#30814 [KV connector][LMCache] Only record the cuda event when there are request to store/load — ready,kv-connector,nvidia — by ApostaC (合并于: 2025-12-18 13:31 (UTC+8)) [💬2 | +40/-17, 2 files | commented:1 approved:1]

## Purpose

This is a small refactor/optimization that we only record the CUDA events when there are requests to store and load

## Test Plan N/A

## Test Result …
#30911 [AMD][CI] fix lm eval ci arg — rocm,ready,ci/build — by divakar-amd (合并于: 2025-12-18 13:18 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1] Fixing args for AMD-CI run in accordance with https://github.com/vllm-project/vllm/pull/30723

This fixes the following tests for AMD-CI:

LM Eval Small Models LM Eval Small Models (1 Card)
#29553 [PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json — ready,qwen,nvidia — by vadiklyutiy (合并于: 2025-12-18 13:16 (UTC+8)) [💬3 | +882/-0, 7 files | commented:1 approved:1] ## Purpose
1. For Qwen3-next-FP8 add tuned MoE configs for Blackwell.
2. Remove execution permission for E=128,N=768,device_name=AMD_Instinct_MI308X.json
#30765 [Doc][CPU] Update CPU doc — documentation,ready,ci/build,cpu — by bigPYJ1151 (合并于: 2025-12-18 12:59 (UTC+8)) [💬2 | +106/-16, 5 files | commented:9 approved:1]

## Purpose

Update CPU doc for wheel installation

## Test Plan

## Test Result

…
#30788 [refactor] Add prefix support to embed_tokens in DeepSeek MTP — ready,deepseek — by zzhx1 (合并于: 2025-12-18 12:45 (UTC+8)) [💬1 | +1/-0, 1 files | commented:1 approved:1]

## Purpose This PR adds proper weight prefixing to the embed_tokens embedding layer in DeepSeekMultiTokenPredictor using maybe_prefix(prefix, “embed_tokens”).

## Test Plan None ## Test Result None …
#30902 [compile] Fix CI for test_gpt2_cache_hit — ready — by zhxchen17 (合并于: 2025-12-18 12:22 (UTC+8)) [💬1 | +15/-6, 2 files | commented:1 approved:1] Signed-off-by: zhxchen17 zhxchen17@fb.com

## Purpose

Fixing torch 2.10 release CI issues from https://github.com/pytorch/pytorch/issues/170549

## Test Plan

pytest tests/compile/test_aot_compile.py …
#30071 [Quantization] Support Quark int4-fp8 w4a8 for MoE — rocm,ready — by BowenBao (合并于: 2025-12-18 12:20 (UTC+8)) [💬4 | +201/-2, 2 files | commented:6 approved:1] This PR extends support of Quark quantized model for int4-fp8 w4a8 quantization spec.

Specifically:
- Weights are double quantized: first at INT4 per channel, then further at FP8 per tensor.
- Activations are quantized dynamically in FP8 per tensor.
We observe large performance uplift and near full accuracy with Kimi-K2-Thinking quantized by Quark with int4-fp8 w4a8 on MI300x.
#30716 fused_moe_lora PDL improvements — ready — by gnovack (合并于: 2025-12-18 11:55 (UTC+8)) [💬1 | +18/-12, 1 files | commented:3 approved:1] ## Purpose

The existing implementation of PDL for the fused_moe_lora kernel has a couple issues which don’t allow us to realize the full gains of PDL:

### 1. torch.zeros between shrink and expand calls.

The call to torch.zeros to initialize the intermediate cache occurs between the shrink and expand calls to fused_moe_lora. This means we cannot overlap the two fused_moe_lora calls via PDL.

![Image 12-15-25 at 12 06 PM](https://github.com/user-attachments/assets/da7ff404-db87-41fc-…
#27274 [NIXL] Support P tensor-parallel-size > D tensor-parallel-size — ready,v1,kv-connector — by NickLucche (合并于: 2025-12-18 11:53 (UTC+8)) [💬10 | +556/-212, 4 files | commented:8 approved:2] ## Overview

This PR addresses the following case, P tensor-parallel-size > D tensor-parallel-size.

I think it helps to differentiate two main cases

### MLA

For MLA model, the workflow is easier: each D worker reads from some other single P worker (fan-out reads to avoid all reading from same remote), as MLA cache is duplicated. Some P workers will not be read from at all.
…

[关闭未合并 PR]

#18365 [Hardware][Intel-Gaudi] enable text embedding for Intel-Gaudi backend — needs-rebase,stale — by libinta (关闭于: 2025-12-19 10:14 (UTC+8)) [💬5 | +558/-215, 6 files] This PR adds the initial text embedding implementation on the Intel-Gaudi backend.
#18373 Adding basic dynasor functionality to vllm v1 scheduler — needs-rebase,stale,v1 — by PratishthaGaur (关闭于: 2025-12-19 10:14 (UTC+8)) [💬5 | +2404/-7, 4 files] This PR introduces initial support for Dynasor-style early exit in reasoning model within the vLLM v1 scheduler. It enables the system to dynamically insert probe requests that encourage the model to finalize its reasoning and produce an answer when it appears confident.

Key Behavior When the model has generated a multiple of 128 tokens, a probe request is created. The probe request appends a token to the current context, prompting the model to safely exit and produce a final answer if it is re…
#30965 [Perf][ROCm][AWQ] Improve performance of fused MoE GPTQ-AWQ and AWQ dequant kernels — rocm,needs-rebase — by yuttian1 (关闭于: 2025-12-19 10:08 (UTC+8)) [💬1 | +206/-356, 3 files | commented:2] Summary

This PR optimizes and tunes the following AWQ-related kernels on AMD ROCm platforms:

fused_moe_kernel_gptq_awq optimizes and tunes by cuichang@amd.com

awq_dequant

The changes focus on reducing memory traffic, improving arithmetic efficiency, and updating tuned configurations for better performance on large-scale MoE workloads (e.g. Qwen3-235B-VL-AWQ).

…
#30986 Eagle 3 fix sometimes when you don’t set architectures you get “Model architectures [‘EagleLlamaModel’] are not supported for now. “ — documentation,performance,new-model,rocm,frontend,tpu,ci/build,v1,tool-calling,llama — by aidando73 (关闭于: 2025-12-19 04:08 (UTC+8)) [💬3 | +20070/-1312, 118 files | commented:2]

## Purpose

As per title

## Test Plan

## Test Result

…
#30894 Improve DCP error message with actionable guidance — documentation,v1 — by Dhruv-80 (关闭于: 2025-12-19 02:05 (UTC+8)) [💬6 | +107/-27, 5 files | commented:1] ## Purpose

Improve the error message shown when Decode Context Parallelism (DCP) is enabled with an attention backend that does not support returning softmax LSE values during decode.

Previously, this case raised a generic assertion error that did not provide actionable guidance to users. This change replaces the assertion with a user-facing exception that clearly explains the limitation and suggests how to resolve it (e.g., by selecting a different attention backend or disabling DCP). …
#29755 Fix KV cache sync issue during CUDA graph replay — v1,kv-connector,nvidia — by yashwantbezawada (关闭于: 2025-12-18 21:22 (UTC+8)) [💬8 | +13/-0, 2 files | commented:8 changes:1 approved:1] Fixes #29608

I ran into this while looking at the async KV transfer code. When using connectors like LMCache with CUDA graphs, there’s a race condition during graph replay.

The issue is that wait_for_layer_load() calls happen inside the @maybe_transfer_kv_layer decorator on attention functions. During graph capture, these calls execute normally. But during replay, only the GPU operations are replayed - the Python decorator code gets skipped entirely. So the async KV loads might not be comp…
#30954 [Kernel][AWQ] Optimize awq_gemm: K-group reuse for scales/zeros, increase pipeline stages, and simplify dequant math — needs-rebase — by yuttian1 (关闭于: 2025-12-18 20:43 (UTC+8)) [💬1 | +201/-245, 3 files | commented:2] What this PR changes 1) K-group optimization (reuse scales/zeros within AWQ group)

AWQ group size is 128 while the kernel K tile is 32.

Within a group, scales and zero-points are identical across multiple K tiles, so we reuse the same scales/zeros for the 4 consecutive K tiles inside the group.

This reduces global memory reads for scales and zero-points.

2) More pipeline stages …
#30796 [BugFix][Async] Clear spec tokens for requests re-entering input batch — v1 — by izhuhaoran (关闭于: 2025-12-18 17:59 (UTC+8)) [💬5 | +16/-4, 1 files | commented:1 | 📝草稿] ## Purpose

In async scheduling + spec, requests (re-entering input batch) do not have pre-step draft tokens (since they were not running in the previous step). Therefore, any scheduled_spec_decode_tokens assigned by the scheduler for these requests are essentially invalid placeholders. Retaining them leads to unnecessary computation and potential unexpected behavior.
#30945 [Feature] adpat step3 with eager mode — documentation,frontend,v1,deepseek — by InhabitancyCocoon (关闭于: 2025-12-18 16:07 (UTC+8)) [💬1 | +2981/-118, 31 files | commented:2]

## Purpose

## Test Plan

## Test Result

...
#30643 Auto-rebase PRs older than 40 commits compared to main — ci/build — by khluu (关闭于: 2025-12-18 12:03 (UTC+8)) [💬2 | +7/-0, 1 files | commented:1] Closes https://github.com/vllm-project/vllm/issues/28609