vLLM Daily Report

[vLLM GitHub 开发动态] 2026-02-25

[概览]

时间窗口: 2026-02-25 11:28 (UTC+8) ~ 2026-02-26 11:28 (UTC+8)
新 issue: 25 (label 分布: bug:15, feature request:7, rocm:1, ci-failure:1)
关闭 issue: 25
新 PR: 71 (label 分布: v1:26, bug:16, ready:16, ci/build:12, nvidia:11)
合并 PR: 45
关闭未合并 PR: 28

[新 issue]

#35321 [Feature]: Encoder self-attention for RocmAttentionImpl — feature request,rocm — by micah-wil (创建于: 2026-02-26 02:44 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Encoder self-attention is not implemented for ROCM_ATTN. For example, when running the following test with ROCM_ATTN enabled instead of TRITON_ATTN: pytest -v -s entrypoints/openai/test_run_batch.py::test_empty_file

this error is raised: NotImplementedError: Encoder self-attention is not implemented for RocmAttentionImpl

### Alternatives

…
#35344 [Bug]: Value error, Model architectures [‘Qwen3_5MoeForConditionalGeneration’] are not supported for now — bug — by Dong09 (创建于: 2026-02-26 09:19 (UTC+8)) [💬6] ### Your current environment

The output of Collecting environment information... uv is set ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 ...
#35349 [Bug]: MiniMax-2.1: Missing <think> tag in the LLM response after tool calls succeed. — bug — by stingoChen (创建于: 2026-02-26 10:39 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text (vllm-test) root@d9bc88d37413:/opt/vllm-test/vllm# python collect_env.py Collecting environment information... uv is set ============================== ...
#35337 [CI] AttributeError in SMControlContextManager: torch.cuda.current_device() returns int, not device object — ci-failure — by LucasWilkinson (创建于: 2026-02-26 06:19 (UTC+8)) ## Name of failing test
- v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_low_latency]
- v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_high_throughput]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
#35300 [Feature]: Add ISA-level smoke tests using Intel SDE to catch instruction set mismatches — 无标签 — by wjhrdy (创建于: 2026-02-25 22:47 (UTC+8)) [💬2] ### Motivation

vLLM’s CPU backend compiles ISA-specific C kernels (AVX2, AVX512, AVX512BF16, AMX, NEON, VSX) selected at build time via environment variables. The default CI image is built with AVX512BF16 + AVX512VNNI + AMXBF16 enabled (see .buildkite/image_build/image_build_cpu.sh).

No test currently verifies the binary works on CPUs with different ISA levels. Running the resulting bina…
#35310 [Feature]: Qwen-ASR Forced Aligner — feature request — by jppm99 (创建于: 2026-02-26 01:27 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Will vLLM support Qwen3-ASR Forced-Aligner, allowing to timestamp audio transcription segments?

### Alternatives

No response

### Additional context

…

#35329 [Bug]: quantization=”mxfp4” produces incorrect results / hangs for MoE models at tensor_parallel_size=1 — bug — by zeryx (创建于: 2026-02-26 03:53 (UTC+8)) [💬1] ### Your current environment

- vLLM: 0.15.1 (`vllm/vllm-openai:latest`)
- Model: `Qwen/Qwen3-30B-A3B` (128 experts, all-MoE layers)
- GPU: 2x NVIDIA RTX 5880 Ada (SM89, 49 GB each)
- Also tested with a faithful tiny Qwen3-MoE replica (~190M params, same 128-expert architecture)

### 🐛 Describe the bug

## Bug Description   ...

#35336 [Refactor]: Make SSM backends use the null block (0) for padded requests instead of -1 — feature request — by LucasWilkinson (创建于: 2026-02-26 06:15 (UTC+8)) Currently we fill the block table with -1’s in the GPU model runner for padded requests for mamba/ssm backends:

https://github.com/vllm-project/vllm/blob/c97234c08b42326cf1e5ef024d9ac8441e0848b1/vllm/v1/worker/gpu_model_runner.py#L1778-L1780

To match the fact that the mamba kernels use PAD_SLOT_ID to indicate an unused block in the block table (for padded requests).

h…
#35335 [Feature]: Performance Optimization for Model Runner V2 — feature request — by yewentao256 (创建于: 2026-02-26 06:01 (UTC+8)) ### 🚀 The feature, motivation and pitch

Tasks:
- https://github.com/vllm-project/vllm/pull/35333
- https://github.com/vllm-project/vllm/pull/35214
- sorted + fromiter optimization
- More perf optimization exploration
### Alternatives

…
#35332 [Feature]: Profiler num_steps — feature request — by Oseltamivir (创建于: 2026-02-26 05:20 (UTC+8)) ### 🚀 The feature, motivation and pitch

/start_profile and /stop_profile are simple toggles with no parameters, some extra switchs (num_steps, merge_profiles, etc.) would be useful to profile a running server.

### Alternatives

No response

### Additional context

…
#35324 [Bug]: FusedMoE.weight_loader MXFP4 branch crashes on standard HuggingFace MoE checkpoints (per-expert 2-D weights) — bug — by zeryx (创建于: 2026-02-26 02:57 (UTC+8)) ### Your current environment
```
vLLM: 0.15.1 (`vllm/vllm-openai:latest` as of 2025-02-25)
Model: `Qwen/Qwen3-30B-A3B` (128 experts, per-expert weight layout)
GPU: NVIDIA RTX 5880 Ada (SM89), also reproducible on any GPU
```
### 🐛 Describe the bug

…
#35287 [Bug]: An error occurred when Qwen3.5 started the model using –quantization fp8 — bug — by Alibaba-HZY (创建于: 2026-02-25 19:44 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#35319 [Bug]: Multi-Node inference with PP > 1 crashes after processing completions request with non-None logprobs parameter. — bug — by JeanPaulShapo (创建于: 2026-02-26 02:31 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py on the first node
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#35313 [Bug]: vLLM startup memory check mis-detects available VRAM (reclaimable OS memory) on UMA Systems — bug — by sunil1511 (创建于: 2026-02-26 01:51 (UTC+8)) ### 🐛 Describe the bug

## Issue Observed on UMA GPUs (DGX Spark and GH200)

I have observed this bug on both DGX Spark and GH200 systems.

### Problem Description

vLLM defaults to requesting 90% GPU memory utilization (--gpu-memory-utilization=0.9). While it’s typically easier to achieve 90% free memory on non-UMA systems, on UMA systems the memory can be occupied by paged cache and buffers which are reclaimable by the OS.

…
#35293 [Bug] GLM-5-FP8 Crash: CUDAGraph Replay Segmentation Fault — bug — by janssen-llm (创建于: 2026-02-25 21:00 (UTC+8)) [💬3] ### Your current environment

8*H200
```
  vLLM version：0.16.0rc2.dev123+gec12d39d4 + GLM5-FP8
```
```Bash vllm serve /path/to/GLM-5-FP8
–served-model-name GLM5
–tensor-parallel-size 8
…
#35268 [Bug]: Qwen3-VL-Rerank model’s rerank API does not support query with mixed image and text inputs. — bug — by Yimi81 (创建于: 2026-02-25 15:05 (UTC+8)) [💬2] ### Your current environment

vllm 0.14.0

### 🐛 Describe the bug

https://github.com/vllm-project/vllm/issues/32378 https://github.com/vllm-project/vllm/issues/33986

…
#35303 [Bug] CompressedTensorsWNA16MarlinMoEMethod registers g_idx params unconditionally, crashes with actorder=null AWQ MoE models — 无标签 — by jhsmith409 (创建于: 2026-02-25 23:47 (UTC+8)) ## Your current environment
- vLLM version: 0.16.0rc2.dev472+gee59a7c61 (nightly wheel from wheels.vllm.ai)
- PyTorch version: 2.10.0+cu130
- GPU: NVIDIA GeForce RTX 5090 (Blackwell, sm_120)
- NVIDIA driver: 590.48
- CUDA (container): 13.0
- OS: Ubuntu (Docker container based on vllm/vllm-openai:latest-cu130)
## Model …
#35285 [Bug]: Large Video Request cause vLLM Progress Core Dump — bug — by rhluo (创建于: 2026-02-25 19:12 (UTC+8)) [💬3] ### Your current environment

vllm 0.15.1 </code>

### 🐛 Describe the bug

...
#35295 [Bugfix Proposal]: Support MIG UUIDs in CUDA_VISIBLE_DEVICES (int() conversion and NVML handle resolution) — bug — by GanyBaba (创建于: 2026-02-25 21:25 (UTC+8)) ### Your current environment

5.14.0-611.27.1.el9_7.x86_64 / RHEL 9.7 (Plow)

Python 3.9.25

### 🐛 Describe the bug

Hi team, …
#35286 [Bug]: Qwen3.5-MoE failed with enable_lora — bug — by hjh0119 (创建于: 2026-02-25 19:42 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#35288 [Bug]: MTP speculative decoding produces corrupted output at concurrency >= 4 (V1 engine) — bug — by dkremez (创建于: 2026-02-25 19:52 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#35272 [Feature]: Make the Qwen3-ASR use request_prompt from user input — feature request — by minh-nguyenhoang (创建于: 2026-02-25 16:22 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

Current implementation of Qwen3-ASR for completely ignore user prompt, which can easily be done by just append the request_prompt (maybe add validation so that the request_prompt follow Qwen3-ASR format, e.g. language {lang}<asr_text>{text} or if language is already provided (via to_lang), strip the language part from…
#35276 [Bug]: OpenAI transcribe prompt parameter with whisper return hallucinated transcription — bug — by Matht-j (创建于: 2026-02-25 17:19 (UTC+8)) [💬1] ### Your current environment

vllm 0.13.0

### 🐛 Describe the bug

When i prompt a short list (~10 elements) of Cities to Whisper to improve his spelling, i get a transcription full of hallucination (Repetitive word or syllable). I followed the guideline of prompting but still got no good result. I tried multiple kind of prompting (with and without <startofstranscript, , etc).

[Transcription without prompt.txt](ht…
#35266 [Bug]: Missing opening brace for Qwen3.5 streaming tool calls — bug — by AsterisMono (创建于: 2026-02-25 14:35 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text (vllm) root@oem:~# python collect_env.py Collecting environment information... uv is set ============================== ...
#35267 [Feature]: Integrate RadixMLP into vLLM — feature request — by Zyyeric (创建于: 2026-02-25 14:45 (UTC+8)) ### 🚀 The feature, motivation and pitch

I’m wondering whether vLLM would consider supporting RadixMLP-style intra-batch deduplication for the prefill path.

The high-level idea of RadixMLP is that it deduplicates position-wise computations (e.g., norms / linear projections / MLP-related activations) for tokens that have identical prefix, while running attention on the original expanded layout.

This shall be complimentary to vLLM’s existing prefix caching / KV-cache mechanisms.

That said, a…

[已关闭 issue]

#33369 [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine — bug — by sininspira2 (关闭于: 2026-02-26 10:15 (UTC+8)) [💬17] ### Your current environment

The output of python collect_env.py
I don't have all of the python libraries to run on the host, and running the script inside the docker container isn't working. I'm running vllm/vllm-openai:latest-cu130 The container has access to 2x RTX 6000 Pro Server Edition (Blackwell) ...
#32373 [Bug]: Fail to load vLLM on new NVIDIA driver — bug — by huydhn (关闭于: 2026-02-26 10:15 (UTC+8)) [💬40] ### Your current environment

I can’t run collect_env.py because importing PyTorch itself is failing.

### 🐛 Describe the bug

After https://github.com/vllm-project/vllm/pull/30784, I start seeing vLLM failing to load on its benchmark job. I suspect that the change doesn’t work with newer NVIDIA driver that the job is using to be compatible with CUDA 13.0

``` +—————————————————————————————–+ …
#29098 [Installation]: Failed building wheel for vllm — installation,stale — by aizyler (关闭于: 2026-02-26 10:46 (UTC+8)) [💬2] ### Your current environment

CMake Generate step failed. Build files cannot be regenerated correctly. Traceback (most recent call last): File “/root/anaconda3/envs/vllm/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py”, line 389, in main() File "/root/anaconda3/envs/vllm/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 373, in main json_out["return_val"] = hook(**hook_input["kwargs"]) ...
#25538 [Bug]: performance regression caused by frequently preempting and resuming a request — bug,stale — by ronny1996 (关闭于: 2026-02-26 10:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#26584 [Bug]: gpt-oss-20B output quality issues (I suspect a bug) — bug,stale — by br3no (关闭于: 2026-02-26 10:16 (UTC+8)) [💬7] ### Your current environment

The output of python collect_env.py
```text /home/xxxxx/src/py/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION Collecting environment information... ...
#27634 [Usage]: how to use –quantization option of vllm serve？ — usage,stale — by Septemberlemon (关闭于: 2026-02-26 10:16 (UTC+8)) [💬5] ### Your current environment

```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : Could not collect CMake version : Could not collect …
#27636 [Usage]: vllm如何保留qwen3-vl中的special token — usage,stale — by qfs666 (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2] ### Your current environment

我微调过的qwen3-vl模型的grounding格式为：<|object_ref_start|>图片<|object_ref_end|><|box_start|>(x1,y1),(x2,y2)<|box_end|> 使用vllm serve推理的格式是：图片(460,66),(683,252)，这个是直接忽略了special token么，是否有方法可以保留。

### How would you like to use vllm

I want to run inference of a specific model. I don’t know how to integrate it with vllm.

…
#27646 [Usage]: How to use vllm bench serve to bench remote deployed vllm models (can’t bench when ep enabled!!!) — usage,stale — by Valerianding (关闭于: 2026-02-26 10:15 (UTC+8)) [💬5] ### Your current environment

I deployed dpskv3 in a remote server using:
```
  export VLLM_USE_V1=1
  export VLLM_ALL2ALL_BACKEND=deepep_low_latency
  vllm serve /models/hf/models--deepseek-ai--DeepSeek-V3 --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel --no-enforce-eager --load-format dummy
```
And on another server: …
#27656 [Bug]: Potential out-of-bounds access in paged_attention_v1.cu and paged_attention_v2.cu — bug,stale — by molly-ting (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#27662 [Bug]: Decode ITL performance issue with DBO at batch size ~200 — bug,stale — by dagrayvid (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text /opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information... ===========================...
#27667 [Usage]: DeepseekOCR on CPU missing implementation for fused_topk — usage,stale — by brainlag (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2] ### Your current environment

Try to test if it is possible to run DeepseekOCR on CPU using current git main branch.

Fails because there is no implementation of fused_topk for CPU.

``` INFO 10-28 15:41:18 [v1/worker/cpu_model_runner.py:77] Warming up model for the compilation… ERROR: Traceback (most recent call last): File “/opt/venv/lib/python3.12/site-packages/starlette/routing.py”, line 677, in lifespan …
#27692 it run on rtx 5060 ti 16 gb — usage,stale — by bokkob556644-coder (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3] ### Your current environment

https://github.com/bokkob556644-coder/suc-vllm-rtx-5060-ti-16-gb/blob/main/suc_vllm.txt

### How would you like to use vllm

I want to run inference of a [specific model](put link here). I don’t know how to integrate it with vllm.

…
#27696 [Bug]: OpenTelemetry Error on V1 — bug,stale — by chandlj (关闭于: 2026-02-26 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : Could not collect ...
#35337 [CI] AttributeError in SMControlContextManager: torch.cuda.current_device() returns int, not device object — ci-failure — by LucasWilkinson (关闭于: 2026-02-26 10:01 (UTC+8)) ## Name of failing test
- v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_low_latency]
- v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_high_throughput]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
#34395 [Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4 — bug,torch.compile,nvidia — by ProExpertProg (关闭于: 2026-02-26 09:59 (UTC+8)) [💬6] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34259 [Bug]: gpt-oss triton_kernels crashes during startup with EP enabled (sm90 default) — bug — by mgoin (关闭于: 2026-02-26 05:33 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34526 [Bug]: accuracy issue when using multiconnector(Nixl+cpu offloading) — bug,kv-connector — by hsubramony (关闭于: 2026-02-25 17:26 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#34891 [Bug]: RuntimeError: [SymmDeviceMemory] Device does not support multicasting. — bug — by RocketRider (关闭于: 2026-02-25 23:31 (UTC+8)) [💬15] ### Trying to load Qwen3.5-397B-A17B-FP8 causes “RuntimeError: [SymmDeviceMemory] Device does not support multicasting” on 4xH200 GPUs

Startparams: “–trust_remote_code”, “–enable-auto-tool-choice”, “–tool-call-parser”, “qwen3_coder”, “–reasoning-parser”, “qwen3”, “–limit_mm_per_prompt.image”, “4”, “–limit_mm_per_prompt.video”, “0” ,”–tensor-parallel-size”, “4”

Log
...
#35167 [CI] EAGLE speculative decoding accuracy below 60% threshold — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 20:51 (UTC+8)) ## Name of failing test
- v1/e2e/test_spec_decode.py::test_eagle_correctness[FLASH_ATTN-llama3_eagle3]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries
…
#35168 [CI] Ngram speculative decoding accuracy below 66% threshold — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 20:51 (UTC+8)) [💬1] ## Name of failing test
- v1/e2e/test_spec_decode.py::test_ngram_and_suffix_correctness[speculative_config0]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries
…
#35140 [CI] Ultravox audio model HuggingFace reference produces invalid output with NaN logprobs — ci-failure — by LucasWilkinson (关闭于: 2026-02-25 20:06 (UTC+8)) [💬4] ## Name of failing test
- models/multimodal/generation/test_common.py::test_audio_models[ultravox-test_case0]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries
…
#27821 [RFC]: Reorganizing ViT Abstraction and Attention Selection Logic — RFC,stale — by tjtanaa (关闭于: 2026-02-25 14:38 (UTC+8)) [💬3] ### Motivation.

This RFC is aimed to address the following issues:
1. The ViT right now is still pretty coupled with Text backbone attention. This RFC will further the effort to decouple the ViT and the text backbone attention.
2. Another pain point is that the overriding of the ViT logic is scattered all around the places. We should avoid doing ViT logic overriding in model definition classes. The platform class should define the logic of what ViT is supported and how it should be overwritten…
#34941 [CI Failure]: Intel GPU Test : examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-25 14:22 (UTC+8)) [💬4] ### Name of failing test

python3 examples/offline_inference/basic/generate.py –model facebook/opt-125m –block-size 64 –enforce-eager

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#35252 [Installation]: No module named ‘vllm.entrypoints.anthropic.serving_messages’ — installation — by timtimyim (关闭于: 2026-02-25 12:15 (UTC+8)) [💬1] ### Your current environment

```============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.35 …
#34296 [Usage]: vllm/vllm-openai:v0.15.1 No CUDA GPUs are available,0.10.1.1 is ok — usage — by ooodwbooo (关闭于: 2026-02-25 11:39 (UTC+8)) [💬11] ### Your current environment

wsl nvidia-smi ``` root@DESKTOP-G31254H:/mnt/c/Users/CTOS# nvidia-smi Wed Feb 11 09:04:50 2026 +—————————————————————————————–+ | NVIDIA-SMI 575.51.02 Driver Version: 576.02 CUDA Version: 12.9 | |—————————————–+————————+———————-+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | …

[新 PR]

#35350 [Model Runner V2] Add model states [1/N] — v1,nvidia — by WoosukKwon (创建于: 2026-02-26 10:47 (UTC+8)) [💬2

+124/-101, 4 files

commented:1]

#35352 [Bug] Fix missing tag after tool call in MiniMax 2.1 — bug — by stingoChen (创建于: 2026-02-26 11:19 (UTC+8)) [💬1 | +6/-1, 1 files | commented:1] Fix: #35349 ## Purpose This PR improves the reasoning end detection logic in MiniMaxM2AppendThinkReasoningParser.is_reasoning_end. After a successful tool call, the following line incorrectly evaluates to `True`, which causes the subsequent `` tag to be missing.

https://github.com/vllm-project/vllm/blob/86c3b5a808506e325fd7e59d86d83170fc98c93c/vllm/entrypoints/openai/chat_completion/serving.py#L816

## Test Plan ```shell CUDA_VISIBLE_DEVICES=3,4,5,6 python3 -m vllm.entrypoints.openai.a…
#35346 Cpu dispatcher — documentation,performance,rocm,needs-rebase,ci/build,v1,cpu,nvidia — by majian4work (创建于: 2026-02-26 10:32 (UTC+8)) [💬5 | +2700/-516, 37 files | commented:1] Rebase https://github.com/dtrifiro/vllm/tree/cpu-build-dispatcher-cleanup
#35274 Remove bc-lint — ready,ci/build,v1 — by hmellor (创建于: 2026-02-25 16:53 (UTC+8)) [+0/-118, 5 files | commented:1 approved:2] The bc-lint.yml GitHub Actions workflow was added to vLLM as a trial. It was only ever added to the following classes in vLLM:
- NewRequestData
- CachedRequestData
- SchedulerOutput
Therefore, it’s coverage is very low. Also, it is consuming significant GitHub Actions resources. If we look at https://github.com/vllm-project/vllm/actions/metrics/usage we see that it is consuming:
- ~35% of all workflow runs
- ~15% of all workflow minutes …
#35351 [XPU] special handle for pooler models w8a16 gemm — 无标签 — by yma11 (创建于: 2026-02-26 11:05 (UTC+8)) [+4/-0, 1 files | commented:1] ## Purpose In case of pooler models, pooled_data may change to head_dtype like float which fp8_gemm_w8a16 doesn’t support, so we make a force convert in this PR.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#35302 [Bugfix][Hardware][AMD] Support all MoE activations in WNA16 quantization on ROCm — bug,rocm — by c0de128 (创建于: 2026-02-25 23:36 (UTC+8)) [💬2 | +1/-5, 1 files | commented:1] ## Summary

Removes the SiLU-only assertion in MoeWNA16Method.apply() and passes layer.activation through to fused_experts(), enabling GPTQ INT4 MoE models with non-SiLU activations (e.g., SWIGLUSTEP) to run on ROCm.

On ROCm, Marlin MoE is disabled (check_moe_marlin_supports_layer returns False), so all GPTQ INT4 MoE models go through the WNA16 path. The WNA16 path had a hardcoded assert layer.activation == MoEActivation.SILU that blocked models using other activations, even though …
#35308 [compile] Fix caching error over pytree slice node. — 无标签 — by zhxchen17 (创建于: 2026-02-26 01:13 (UTC+8)) [+19/-2, 1 files | commented:2] Summary:

During compile caching we will have error pickling graph module containing node like op(node, slice(0, symint_0), node) because we use pytree to internally tree map nodes to _NodePickleData, so slice() is treated as leaf node. Patching vllm to handle slice() as non-leaf before we fix it in torch 2.12.

Test Plan: vllm serve tiny-random/qwen3-next-moe –enable-expert-parallel –data-parallel-size 2 –compilation-config ‘{“cudagraph_mode”:”FULL”}’ –disable-log-requests –port …
#35348 Fix stale self._positions during piecewise CUDA graph replay in MLA — frontend,v1,nvidia — by tlrmchlsmth (创建于: 2026-02-26 10:37 (UTC+8)) [+380/-45, 14 files | commented:1 | 📝草稿] ## Summary
- Fixes a shape mismatch crash in rotary embedding (RuntimeError: The size of tensor a (8192) must match the size of tensor b (475)) when running DeepSeek-R1 with FP8 KV cache, piecewise CUDA graphs, and the fused RoPE+quant path
- Root cause: self._positions = positions in MLAAttention.forward() is a Python-level side effect inside a piecewise-compiled graph. During CUDA graph replay, this assignment is not re-executed, leaving self._positions pointing to a stale view from a…
#35347 Fix Qwen 3.5 tool calling problem — qwen — by sunqingn7 (创建于: 2026-02-26 10:34 (UTC+8)) [💬2 | +482/-0, 2 files | commented:1]

## Purpose Fix Qwen 3.5 tool calling problem, which complains the json malformated

## Test Plan for example run vllm serve Qwen/Qwen3.5-27B –port 8000 –tensor-parallel-size 1 –max-model-len 262144 –reasoning-parser qwen3 –speculative-config ‘{“method”:”qwen3_next_mtp”,”num_speculative_tokens”:2}’ –enable-auto-tool-choice –tool-call-parser qwen35_coder and use tool calling

…
#35339 [Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py — frontend,ready,gpt-oss — by sfeng33 (创建于: 2026-02-26 06:59 (UTC+8)) [💬2 | +1040/-990, 6 files | commented:1 approved:1] ## Purpose parser/harmony_utils.py had grown into a mixed bag — it contained both shared Chat Completion parsing logic and Responses API-specific conversion code (input→harmony, harmony→response output items). This meant the parser/ module had direct dependencies on Responses API types (ResponseFunctionToolCall, ResponseOutputMessage, McpCall, etc.) that don’t belong there.

### This PR
- Extracts Responses API-specific Harmony conversion logic from parser/harmony_utils.py into a d…
#35257 [Bugfix] Fix placeholder detection when tokenizer merges boundary tokens (#29863) — bug,multi-modality,qwen — by haosdent (创建于: 2026-02-25 11:28 (UTC+8)) [💬2 | +219/-1, 2 files | commented:8] ## Purpose

Fixes #29863.

When a Qwen3-VL prompt ends with punctuation (e.g., . or :) immediately before a video placeholder, the HF processor expands the placeholder into timestamp markers like <0.3 seconds><|vision_start|>.... The < in the timestamp text merges with the preceding punctuation (e.g., .< becomes a single token), breaking exact token matching in _find_mm_placeholders.

### Approach

Add a _find_mm_placeholders override in Qwen3VLMultiModalProcessor following the s…
#35271 [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function — v1,nvidia — by chaunceyjiang (创建于: 2026-02-25 16:19 (UTC+8)) [💬1 | +175/-28, 3 files | commented:4] ## Purpose

FIX https://github.com/vllm-project/vllm/issues/35021

## Test Plan ``` # pip uninstall deep_gemm # vllm serve /mnt/data4/models/deepseek-ai/DeepSeek-V3___2 -tp=8 –tokenizer-mode deepseek_v32 –enable-auto-tool-choice –tool-call-parser deepseek_v32 –reasoning-parser deepseek_v3 –enforce-eager … …
#35260 [Frontend] Prepare diarized transcription using OpenAI contract with diarized_json and stream events — frontend — by lucaschr21 (创建于: 2026-02-25 11:57 (UTC+8)) [💬1 | +483/-20, 6 files | commented:1] ## Purpose

This PR prepares vLLM’s OpenAI-compatible speech-to-text frontend for diarized transcription outputs by adding support for response_format="diarized_json" and its related streaming event contract.

The goal is to make diarized transcription behavior ready and standardized so that models that can transcribe and diarize can plug into this path and return responses/events aligned with OpenAI’s contract.
For example, models such as vibevoice-asr (which support transcription +…
#35338 [Bugfix] Fix AttributeError in SMControlContextManager — bug,ready,v1 — by LucasWilkinson (创建于: 2026-02-26 06:33 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] FIX #35337

## Summary

Fixes AttributeError: 'int' object has no attribute 'index' in SMControlContextManager.__init__() when running with data parallel and expert parallel enabled (DBO tests).

## Root Cause

PR #35042 introduced a bug where torch.cuda.current_device().index is called, but torch.cuda.current_device() returns an int directly, not a device object with an .index attribute.

…
#35289 [Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading — bug,qwen — by Alibaba-HZY (创建于: 2026-02-25 19:57 (UTC+8)) [💬2 | +23/-4, 2 files | commented:1] ## Purpose Fix #35287 Fix online FP8 quantization (--quantization fp8) for Qwen3.5 models. ### Bug — Weight loading crash: NotImplementedError for tuple shard_id

Qwen3.5’s GDN block uses stacked_params_mapping with a tuple shard_id (e.g. ("in_proj_qkvz", "in_proj_qkv", (0, 1, 2))), indicating that the checkpoint weight covers multiple output shards simultaneously.

MergedColumnParallelLinear.weight_loader does not support tuple shard_id …
#35314 [Bug] Fix outdated links in source code — bug,documentation,ready,ci/build — by yewentao256 (创建于: 2026-02-26 01:56 (UTC+8)) [💬1 | +11/-7, 5 files | commented:1 approved:1] ## Purpose

Fix outdated links in source code, replaced with correct ones
#35345 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (创建于: 2026-02-26 09:39 (UTC+8)) [+10/-6, 2 files | commented:1] Fix MoE models using bare tensor_model_parallel_all_reduce produce garbage output in EP mode on Ascend#6710 ## Purpose MoE models that use FusedMoE with reduce_results=False and then manually call tensor_model_parallel_all_reduce() in their MoE forward method produce garbage output when running with –enable-expert-parallel on Ascend NPU. ## Suggested fix The fix should be applied in the upstream vLLM model files. Replace:

tensor_model_parallel_all_reduce(final_hidden_states) with:

self.expert…
#35277 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer weights padding — bug,deepseek — by kebe7jun (创建于: 2026-02-25 17:22 (UTC+8)) [💬1 | +7/-1, 1 files | commented:1] ## Purpose

This PR https://github.com/vllm-project/vllm/pull/29287 does not include https://github.com/vllm-project/vllm/pull/32175, so, it will cause batch_size_next_n == batch_size * next_n

cc @ganyi1996ppo

## Test Plan

``` # Prefill …

#35343 feat(kv-offload): Strategy B — AdaptiveOffloadingPolicy + non-blocking loads — v1,kv-connector — by Srinivasoo7 (创建于: 2026-02-26 08:59 (UTC+8)) [💬2 | +482/-15, 4 files | commented:1] Reduces TTFT regression from CPU KV cache loading via two mechanisms:

AdaptiveOffloadingPolicy (scheduler-side):

Measures TTFT during a warm-up phase (no offloading) to capture a clean baseline.
After warm-up, monitors rolling P50 TTFT over a sliding window.
Auto-pauses offloading if TTFT regresses > overhead_threshold_pct (default 5%) above baseline; auto-resumes when it clears.

Propagates effective_load_mode (‘blocking’

‘async_with_fallback’) to the worker each step via…

#35342 feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency — v1 — by Srinivasoo7 (创建于: 2026-02-26 08:58 (UTC+8)) [💬2 | +630/-2, 6 files | commented:1] Adds a decorator OffloadingManager that transparently wraps any backing manager (LRU, ARC) and suppresses GPU->CPU stores for blocks whose hash has not been seen at least store_threshold (default: 2) times.

New files:
- vllm/v1/kv_offload/reuse_tracker.py – BlockReuseTracker (O(1) LRU gate)
- vllm/v1/kv_offload/reuse_manager.py – StoreReusedOffloadingManager
Modified:
- vllm/v1/kv_offload/cpu.py – get_manager() wraps backing with StoreR…
#35261 [MIX] More accurate processing of quantization of mlp.gate — qwen — by vadiklyutiy (创建于: 2026-02-25 12:55 (UTC+8)) [💬4 | +15/-1, 2 files | commented:1] Addon to #35156.

Mark mlp.gate as non-quant for ModelOpt models only.
#35259 Feature/zero cost kv offload — v1,kv-connector — by Srinivasoo7 (创建于: 2026-02-25 11:42 (UTC+8)) [💬11 | +1092/-17, 9 files | commented:1] ## Purpose

Reduce CPU KV cache offloading overhead to near-zero on workloads where blocks are not reused, while preserving full offload benefits for prefix-cache-friendly workloads.

The existing OffloadingConnector unconditionally stores every evicted GPU block to CPU and blocks the scheduling step waiting for PCIe transfers to complete. On one-shot (non-reuse) traffic this wastes PCIe bandwidth with zero benefit. This PR introduces three coordinated strategies:

### Strategy A (P0) — Skip U…
#35270 [XPU][NIXL] Add GPUDirect RDMA support for XPU — ci/build,kv-connector — by zhenwei-intel (创建于: 2026-02-25 15:52 (UTC+8)) [💬3 | +18/-7, 4 files | commented:8]

## Purpose

Add GPUDirect RDMA support for XPU in NIXL connector.

Requirements：
- UCX must include the fix from https://github.com/openucx/ucx/pull/11187.
Limitations: …
#35341 [Bug] FA2 is not supported for NVIDIA Blackwell architecture — bug,v1,nvidia — by olka (创建于: 2026-02-26 07:53 (UTC+8)) [💬1 | +3/-0, 1 files | commented:1] ## Purpose At the moment FA2 support in Blackwell is broken: ``` Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2 Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2 Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported by flash-attn FA2 Cannot use FA version 2 is not supported due to FA2 is unavaible due to: SM 12.x not supported …
#35278 [Bugfix] Fix broken online quantization due to weights loading tracking — bug,ready,needs-rebase — by Isotr0py (创建于: 2026-02-25 17:28 (UTC+8)) [💬1 | +7/-3, 1 files | commented:1 approved:1] ## Purpose
- Fix https://github.com/vllm-project/vllm/pull/35074#issuecomment-3957466170
- Also checking if there’re other breakpoints not covered by default CI run.
## Test Plan
```
  python3 examples/offline_inference/basic/generate.py --model /mnt/data0/LLM/Qwen3-0.6B --quantization fp8
```
## Test Result …
#35340 [Feature] Add num_steps and delay_steps parameters to /start_profile endpoint — frontend,v1,cpu — by sarathbrp (创建于: 2026-02-26 07:14 (UTC+8)) [💬2 | +225/-23, 10 files | commented:1] ## Purpose

Adds runtime-configurable num_steps and delay_steps parameters to the /start_profile API endpoint, allowing users to override the server’s max_iterations and delay_iterations profiler config on a per-session basis without restarting the server.

Closes https://github.com/vllm-project/vllm/issues/35332

## Test Plan

```bash pytest tests/v1/worker/test_gpu_profiler.py -v …
#35322 [ROCm][CI] Amending deletion of AMD mirror — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-02-26 02:47 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1] This mirror was deleted accidentally from: https://github.com/vllm-project/vllm/pull/34923

This PR reverts that.
#35265 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (创建于: 2026-02-25 14:19 (UTC+8)) [💬1 | +314/-150, 4 files | commented:4 approved:1] This PR fixes the attention backend handling in test_eagle_correctness for ROCm platforms:

Previously, the test would skip entirely when running Llama-4-Scout with FLASH_ATTN on ROCm. Instead of skipping, we now fall back to the FLEX_ATTENTION backend explicitly, allowing the test to actually run on ROCm.

Also previously, the test skipped DeepSeek models when using ROCM_AITER_FA on ROCm. Now we enable the AITER path (VLLM_ROCM_USE_AITER=1), unset VLLM_MLA_DISABLE, and switch to the…
#35333 [Perf] Optimize model runner v2 prepare_inputs copy logic, 6.1% E2E throughput improvement — ready,v1,v2 — by yewentao256 (创建于: 2026-02-26 05:58 (UTC+8)) [+47/-20, 1 files | commented:1] ## Purpose

Part of https://github.com/vllm-project/vllm/issues/35335
- Pre-allocate the gpu/cpu memory
- Optimize copy logic (temple pin in async_copy_to_gpu is expensive)
## Test

```bash …
#35334 [ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends — documentation,rocm,v1 — by gshtras (创建于: 2026-02-26 05:59 (UTC+8)) [💬1 | +106/-7, 3 files | commented:1] Another prerequisite to #33271 With this unit tests that use encoder attention such as test_run_batch.py and encoder-decoder models such as examples/offline_inference/audio_language.py -m whisper now work with ROCM_ATTN and ROCM_AITER_UNIFIED_ATTN
#35328 [Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper — ready,v1 — by njhill (创建于: 2026-02-26 03:45 (UTC+8)) [+24/-29, 2 files | commented:1] Noticed this while working on other items.

#35325 [Model Runner V2] Add coding style guide — ready,v1 — by WoosukKwon (创建于: 2026-02-26 03:00 (UTC+8)) [+17/-0, 1 files

commented:1 approved:2]

#35331 [Bugfix] Fix crash with EPLB + NVFP4 — bug,nvidia — by jasonlizhengjian (创建于: 2026-02-26 04:50 (UTC+8)) [💬1 | +3/-3, 2 files | commented:1 | 📝草稿]

## Purpose

### Bug 1: expand() creates non-contiguous input scale tensors, breaking EPLB
```
In `prepare_nvfp4_moe_layer_for_fi_or_cutlass` (`flashinfer_fp4_moe.py:514-515`), input scales are broadcast via `expand()`:

```python
a13_scale = a13_scale.max().to(torch.float32).expand(num_experts)   ...
```
#35330 [Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement — frontend,ready — by yewentao256 (创建于: 2026-02-26 04:41 (UTC+8)) [+123/-11, 3 files | commented:2] ## Purpose

Optimize maxsim scores computation for pooling models

Originally it is calculated in CPU, now we calculate it in GPU and using the batched version, so we get a lot of performance improvement

## Test

### Acc

…
#35275 [Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic — bug,ready — by Josephasafg (创建于: 2026-02-25 16:54 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose Fixed SSM state corruption for Mamba models under high concurrency.

SSMParamsBase::index_t was uint32_t, causing cache_index * ssm_states_batch_stride to overflow for large cache indices. This made the kernel write SSM state to wrong memory addresses, leaving the intended slots zeroed. Subsequent decode steps read stale/zero state, producing garbage/incorrect output.

The fix changes index_t from uint32_t to size_t (64-bit). ## Test Plan

## Test Result …
#35298 [BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 — bug — by ofirzaf (创建于: 2026-02-25 21:36 (UTC+8)) [💬2 | +0/-3, 1 files | commented:1 approved:1] ## Purpose The environment variable setting IGC_ForceOCLSIMDWidth=16 was introduced in #30538 to reduce memory usage when compiling sample_recovered_tokens_kernel.

With the introduction of copy_and_expand_eagle_inputs_kernel in #32887 speculative decoding with draft model stopped working on XPU.

Furthermore, speculative decoding with sampling (T>0) didn’t work with either Eagle/draft model, and Qwen3 + SD didn’t work due to the same issue of OOM in scratchpad.

It turns out that there i…
#35327 Fix deprecated v1 config tests — 无标签 — by jcaip (创建于: 2026-02-26 03:41 (UTC+8)) [+1/-17, 1 files | commented:1 approved:1] Updated the model name in test_pre_quantized_model and removed the test_opt_125m_int4wo_model_per_module_quant function.

## Purpose

We recently deprecated the v1 configs for Int4WeightOnlyConfig and Float8DynamicActivationFloat8WeightConfig, but there are two model checkpoints still using the v1 configs in the tests:
- drisspg/fp8-opt-125m
- jerryzh168/opt-125m-int4wo-per-module
This PR changes drisspg/fp8-opt-125m to use [torchao-testing/opt-125m-Float8WeightOnlyConfig-v2-0.15.0](https:…
#35317 Add @BoyuanFeng to CODEOWNERS — ci/build — by BoyuanFeng (创建于: 2026-02-26 02:19 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:3] I am happy to maintain vLLM-torch.compile integration and review PRs.

Thank you all for the opportunity! Special thanks to @zou3519 @ProExpertProg @youkaichao @LucasWilkinson @robertgshaw2-redhat @yewentao256 @mgoin @hmellor @njhill @tlrmchlsmth @bigPYJ1151 @heheda12345 @houseroad @ywang96, and many more for their collaboration and support!
#35320 [CI] Add nsys profiling support with NVTX tracing for Buildkite CI — needs-rebase,ci/build — by ojhaanshika (创建于: 2026-02-26 02:40 (UTC+8)) [💬1 | +102/-0, 4 files | commented:1 | 📝草稿] ## Purpose

Add nsys profiling support for Buildkite CI, allowing developers to profile specific test steps with NVIDIA Nsight Systems.

This introduces:
- VLLM_ENABLE_LAYERWISE_NVTX_TRACING environment variable in vllm/envs.py to enable layerwise NVTX tracing without requiring the --enable-layerwise-nvtx-tracing CLI flag
- A model validator in ObservabilityConfig that automatically applies the env var override
- .buildkite/scripts/run-with-profiling.sh — a wrapper script that conditio…
#35316 [ROCm][Quantization] add quark w4a8 for LinearLayer — rocm,needs-rebase,gpt-oss — by divakar-amd (创建于: 2026-02-26 02:11 (UTC+8)) [💬1 | +303/-3, 5 files | commented:3 approved:1 | 📝草稿] weights: mxfp4 with static scales activations: fp8
- Temporary hack to be removed: Force MoE to use un-quantized path instead of Quark at this location.
#35264 [KV Connector]: Support KV push from Prefill to Decode node using Nixl KV Connector — v1,kv-connector — by snadampal (创建于: 2026-02-25 13:44 (UTC+8)) [💬1 | +830/-13, 4 files | commented:1]

Implemented KV push feature where Prefill node pushes its KV blocks to Decode node as soon as the model executor completes the forward pass and finishes request

The implemenation covers both the scenarios: Scenario 1: D registers blocks with P before P finishes generating KV Scenario 2: P has the KV ready before D registers

## Purpose To improve TTFT for P-D disaggregated inference deployment. …
#35326 [MoE Refactor] Initial split of DefaultMoERunner class — nvidia — by bnellnm (创建于: 2026-02-26 03:39 (UTC+8)) [+1305/-823, 33 files | commented:1 | 📝草稿] ## Purpose Split DefaultMoERunner into two more classes:
- MoERunnerBase that contains common code for all runners.
- ChunkedMoERunner a runner that handles DP chunking. Inherits from MoERunnerBase
DefaultMoERunner inherits from MoERunnerBase and only handles the non-chunked/naive execution path.

Based off https://github.com/vllm-project/vllm/pull/35153

## Test Plan …
#35311 [Spec Decode] Finite state machine speculative decoding — documentation,speculative-decoding,v1 — by yeonsw (创建于: 2026-02-26 01:36 (UTC+8)) [💬4 | +1163/-5, 11 files | commented:4]

## Purpose

Add FSM (Finite State Machine) speculative decoding support to vLLM v1. This enables fast-forwarding through deterministic token sequences by using a state machine to propose draft tokens, reducing latency for structured outputs like JSON or templated text.

Key features:
- FSM-based draft token proposal with no rejection for deterministic paths
- Support for branching (multiple valid next tokens) and wildcards (any token allowed)
- Custom FSM definitio…
#35299 [Bugfix] Fix weight loading tracking for Marlin MoE quantized models — bug,needs-rebase — by alexbi29 (创建于: 2026-02-25 21:47 (UTC+8)) [💬2 | +47/-6, 5 files | commented:3] ## Summary

PR #35074 introduced track_weights_loading() to validate all registered parameters are loaded from checkpoint. This broke loading of GPTQ/AWQ MoE models using the Marlin backend with the following error:

``` ValueError: Following weights were not initialized from checkpoint: {‘model.layers.0.mlp.experts.w13_g_idx_sort_indices’, ‘model.layers.0.mlp.experts.w2_g_idx_sort_indices’, ‘model.layers.0.mlp.experts.w13_weight_g_idx’, ‘model.layers.0.mlp.experts.w2_weight_g_idx’, …} …
#35306 [Benchmark] Simplify SLA scan — documentation,performance — by DarkLight1337 (创建于: 2026-02-26 00:16 (UTC+8)) [💬3 | +239/-795, 7 files | commented:2] ## Purpose

Based on the logic of https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575, we can greatly simplify the logic of SLA scan. Instead of trying to match each SLA threshold directly, we can just scan through the space of latency vs. throughput uniformly and let users plot the figures themselves to get the SLA thresholds for whatever metrics they want.
- Removed --sla-params, and replace with --sla-iters to control how to sweep through the spac…
#35323 Add Qwen3.5-MoE (qwen3_5_moe) support via Qwen3-Next backend — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by blake-snc (创建于: 2026-02-26 02:56 (UTC+8)) [💬4 | +2597/-363, 38 files | commented:1] ## Summary

Adds support for Qwen3.5-MoE models (model_type: qwen3_5_moe, Qwen3_5MoeForConditionalGeneration) which use the same hybrid GDN (GatedDeltaNet) architecture as Qwen3-Next but ship with a different HuggingFace config layout and checkpoint weight format.

Key changes:
- Config: Qwen3_5MoeConfig that flattens the outer text_config wrapper and delegates to Qwen3NextConfig
- Registry: Register qwen3_5_moe model type and both ForConditionalGeneration / `ForCausal…
#35318 [Refactor] Remove dead or duplicate func utils or variables — performance,ready,v1,nvidia — by yewentao256 (创建于: 2026-02-26 02:21 (UTC+8)) [+0/-199, 9 files | commented:1] ## Purpose

Remove dead or duplicate func utils or variables
#35315 [draft] cuda13 CI update — ci/build,llama,nvidia — by atalman (创建于: 2026-02-26 02:08 (UTC+8)) [💬1 | +92/-60, 14 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#35312 [Core] Fix gpu_worker.py pre-commit errors — ready,v1 — by njhill (创建于: 2026-02-26 01:41 (UTC+8)) [+6/-2, 1 files | commented:1 approved:1] Errors appeared when making some unrelated modifications to the file.
#35309 Revert “[Misc] Enable weights loading tracking for quantized models” — ready — by LucasWilkinson (创建于: 2026-02-26 01:16 (UTC+8)) [+4/-15, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#35074, this PR broke a ton of quantized models weight loading in the nightly https://buildkite.com/vllm/ci/builds/53087/ .

We should re-land this with ready-run-all-tests

cc @Isotr0py @DarkLight1337
#35297 [Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model — documentation,new-model,multi-modality,llama,nvidia — by jzakrzew (创建于: 2026-02-25 21:34 (UTC+8)) [💬4 | +545/-31, 8 files | commented:4] ## Purpose Add support for the nvidia/llama-nemotron-embed-vl-1b-v2 embedding model. The model is quite similar to already implemented LlamaNemotronVLChatModel, but not exactly compatible. I tried to reuse as much of the existing code as possible, modifying the code in nemotron_vl.py to facilitate that.

## Test Plan ``` pytest tests/models/multimodal/pooling/test_llama_nemotron_vl_embed.py tests/mod…
#35307 [CI][HPU] Pin vllm to last-good-commit compatible with vllm-gaudi — ci/build — by PatrykWo (创建于: 2026-02-26 00:21 (UTC+8)) [💬3 | +13/-1, 1 files | commented:1] The HPU CI test installs the vllm-gaudi plugin on top of the vllm upstream checkout. Recent upstream changes (e.g. kernel abstraction reorganisation in #34055) can break the plugin before it has been updated to match.

Fix: fetch VLLM_STABLE_COMMIT from the vllm-gaudi repository’s tracking branch and check out that commit inside the Docker image. This pins the vllm baseline to a version vllm-gaudi is known to work with, while the CI commit still reflects the triggered build.

```bash VLLM_…
#35290 [Attention][Perf][WIP] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 — performance,deepseek — by LopezCastroRoberto (创建于: 2026-02-25 20:14 (UTC+8)) [+444/-75, 3 files | commented:1 | 📝草稿] ``` ====================================================================== CP_GATHER_AND_UPCONVERT_FP8_KV_CACHE BENCHMARKS ====================================================================== Cache entry: 656 bytes (512 FP8 + 16 scales + 128 RoPE) Output row: 576 BF16 = 1152 bytes Per token: 1808 bytes (read + write) Block size: 64 tokens/block ======================================================================

…
#35305 [BugFix] Fix MoE g_idx params causing ValueError with actorder=null AWQ models — bug — by jhsmith409 (创建于: 2026-02-25 23:52 (UTC+8)) [💬1 | +64/-97, 1 files | commented:1] ## Summary
- Fix CompressedTensorsWNA16MarlinMoEMethod.create_weights() to only register g_idx parameters when self.actorder is set, preventing ValueError from track_weights_loading() when loading compressed-tensors AWQ MoE models with actorder=null
- Fix CompressedTensorsWNA16MoEMethod.create_weights() to remove unreachable g_idx registration (class asserts actorder != "group")
- Fix process_weights_after_loading() to derive num_experts/device from w13_weight_packed i…
#35304 [Bugfix][Hardware][AMD] Fix startup hang on ROCm gfx1151 in MinTokensLogitsProcessor — bug,rocm,v1 — by c0de128 (创建于: 2026-02-25 23:49 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1] ## Summary

Fixes an indefinite hang during vLLM startup on AMD Ryzen AI MAX+ 395 (gfx1151) by creating neg_inf_tensor on CPU first and transferring to device with non_blocking=True.

In MinTokensLogitsProcessor.__init__(), the neg_inf_tensor scalar is created directly on the ROCm device via torch.tensor(..., device=self.device). On gfx1151/gfx1150, this triggers HIP kernel compilation that hangs indefinitely — confirmed by py-spy traces showing 100% CPU time stuck at this line.

The f…
#35301 [speculative decoding] Add dynamic speculation length with confidence-threshold early exit — performance,speculative-decoding,v1 — by jmamou (创建于: 2026-02-25 23:32 (UTC+8)) [💬1 | +549/-11, 10 files | commented:1 | 📝草稿] ## Purpose

This PR introduces confidence-threshold based Dynamic Speculative Length (DSL) support for speculative decoding and adds end-to-end DSL observability.

### Functional changes
- Adds draft_confidence_threshold to speculative config (default 0.0, disabled by default).
- Implements confidence-based early-exit in speculative decoding drafting when DSL is enabled.
- Propagates DSL metrics through scheduler/output/benchmark paths:
  - total proposals
  - early exits …
#35296 Fix Qwen3ReasoningParser misclassifying truncated reasoning as content — qwen — by sxu75374 (创建于: 2026-02-25 21:31 (UTC+8)) [💬1 | +14/-2, 1 files | commented:2] ## Purpose

Fixes #35221.

When max_completion_tokens cuts output before </think> is generated, extract_reasoning() assumes thinking is disabled and returns all tokens as content. This is incorrect — the model was reasoning, it just got truncated.

The fix reads enable_thinking from chat_template_kwargs (mirroring the approach used by DeepSeekV3ReasoningParser) and uses it to determine behavior when </think> is absent:
- enable_thinking=True (default): truncated output → `…
#35292 [Perf] Optimize gdn decode and mtp — qwen — by ZJY0516 (创建于: 2026-02-25 20:54 (UTC+8)) [+301/-30, 1 files | commented:1 | 📝草稿] ## Purpose

Do not review it

## Test Plan

## Test Result

...
#35258 [Reasoning] Fix Qwen3 parser misclassifying truncated reasoning as content — qwen — by sxu75374 (创建于: 2026-02-25 11:31 (UTC+8)) [💬5 | +14/-2, 1 files | commented:2 approved:1] ## Purpose

Fix Qwen3ReasoningParser.extract_reasoning() returning truncated reasoning tokens as content instead of reasoning when max_completion_tokens cuts off the output before </think> is generated.

When thinking is enabled and the output is truncated, the current code sees no </think> and assumes thinking was disabled — returning (None, model_output). This means the API response has content filled with raw chain-of-thought and empty reasoning_content, which is wrong. …
#35294 [Model Runner V2] support dp & ep for spec decoding — v1,nvidia — by izhuhaoran (创建于: 2026-02-25 21:03 (UTC+8)) [+119/-57, 4 files | commented:1] Further improve based on #35248

The key change is to trigger speculator.propose() in _dummy_run (called by execute_dummy_batch on idle DP ranks). Without this, when one DP rank has requests and enters speculative decoding, the idle rank only runs the target model forward in its dummy batch — the speculator’s EP communication has no counterpart, causing hang.

Additionally, num_tokens_across_dp in the speculator must be properly synced across DP ranks via all-reduce. When the target mode…
#35284 [Misc] Standardize handling of mm_processor_kwargs.size — documentation,ready,multi-modality,qwen — by DarkLight1337 (创建于: 2026-02-25 18:50 (UTC+8)) [💬1 | +128/-47, 15 files | commented:4 approved:1] ## Purpose
- Update example mm_processor_kwargs to use "size" which is more preferred.
- Add back-compatibility for passing min_pixels and max_pixels instead of size.shortest_edge and size.longest_edge.
CLOSE https://github.com/vllm-project/vllm/pull/35269#pullrequestreview-3852541054

## Test Plan

## Test Result …

#35291 [Bugfix] Enable stats logging with multiple API servers (DP>1) — bug,v1 — by LZZhang-HW (创建于: 2026-02-25 20:43 (UTC+8)) [💬2 | +2/-5, 1 files | commented:1] ## Summary

When `data_parallel_size > 1`, `api_server_count` defaults to `data_parallel_size`, causing `client_count > 1`. Previously this disabled **all** stats logging with a
warning:

> "AsyncLLM created with api_server_count more than 1; disabling stats logging to avoid incomplete stats."

This means prefix cache hit rate, speculative decoding metrics (acceptance rate, mean acceptance length), and throughput stats are completely invisible in DP>1 deployments.

## Fix   ...

#35283 [Test] Add tests for n parameter in chat completions API — 无标签 — by KrxGu (创建于: 2026-02-25 18:43 (UTC+8)) [💬2 | +228/-0, 1 files | commented:8] ## Summary

This PR adds comprehensive tests for the n parameter in the chat completions API to verify that multiple completion choices are correctly returned (#34305).

## Details
- Integration tests (test_chat_n_parameter.py): Full end-to-end tests with server
  - Non-streaming with n=3
  - Streaming with n=2
  - Different seeds producing diverse outputs …
#35269 [PaddleOCR-VL] improve image_processor size compatibility for Transformers v5.0 — 无标签 — by zhang-prog (创建于: 2026-02-25 15:35 (UTC+8)) [💬4 | +2/-2, 1 files | commented:2] In Transformers v5.0, the keys in PaddleOCRVLImageProcessor.size have been updated from min_pixels/max_pixels to shortest_edge/longest_edge.

This PR enhances the compatibility of the size parameter to ensure a smooth transition to v5.0 while maintaining backward compatibility.
#35281 Doc link typo — documentation — by gante (创建于: 2026-02-25 17:50 (UTC+8)) [💬2 | +3/-3, 1 files | commented:2 approved:1] ## Purpose

Fixes a minor typo in the docs.

We can see that the first link in the table here is broken :)

Also applied a suggestion by the gemini bot.

## Test Plan

…
#35282 [Core] Add chunking for audio over 30s on offline inference, — frontend,needs-rebase — by sangbumlikeagod (创建于: 2026-02-25 18:00 (UTC+8)) [💬1 | +130/-4, 2 files | commented:1 | 📝草稿]

## Purpose

Currently, vLLM supports ASR (Automatic Speech Recognition) inference for audio segments exceeding 30 seconds, but this is not yet supported for offline inference.

https://github.com/vllm-project/vllm/issues/25750#issuecomment-3886442288

I have been working on a PR to address this (see my comment here: [link]), but in the meantime, another PR (#34628) covering similar ground was merged. …
#35280 custom dataset img support base64 — performance — by flutist (创建于: 2026-02-25 17:47 (UTC+8)) [💬4 | +8/-6, 1 files | commented:1]

## Purpose custom dataset img support base64 ## Test Plan ## Test Result pass —

Essential Elements of an Effective PR Description Checklist
...
#35279 [Perf][Benchmark] Add uniform top-k scalar fast path and profiling co… — performance,v1 — by MatteoFari (创建于: 2026-02-25 17:28 (UTC+8)) [+154/-27, 6 files | commented:1] ## Purpose Part of #32455

This PR speeds up top-k sampling for a common case and adds better benchmark reporting. It uses a new top_k_scalar path when top-p is off and batch top-k is uniform, while keeping outputs the same. ## Test Plan

Run: ```bash Commit-to-commit microbenchmark validation python /content/bench_topk_commit_compare.py …
#35273 fix(xpu): handle mem_get_info RuntimeError on TBX simulated devices — 无标签 — by lukaszszady (创建于: 2026-02-25 16:40 (UTC+8)) [+21/-0, 1 files | commented:1 | 📝草稿] ## Summary

On Intel XPU TBX (Translation Based eXecution) simulated devices, torch.xpu.mem_get_info() raises a RuntimeError because the simulated device does not support querying available free memory:
```
  RuntimeError: The device (Intel(R) Graphics [0x674c]) doesn't support querying the available free memory.
```
This causes vLLM to crash during worker initialization when MemorySnapshot calls current_platform.mem_get_info(device).

…
#35263 fix(docs): update broken perf.vllm.ai link to PyTorch HUD — performance,ci/build — by machov (创建于: 2026-02-25 13:31 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Fixes #31710

The URL referenced in is no longer accessible and returns connection errors.

## Changes
- Updated broken link to point to the correct vLLM Performance Dashboard:
- Made the link text consistent with other documentation in the repository that already uses the correct PyTorch HUD URL
## Testing Verified that the new URL is accessible and shows the vLLM performance benchmarks. This URL format is already used in other documentation files throughout the repository (docs/benchmarki…
#35262 [Frontend] Add metrics-only uvicorn access log flag — frontend — by YuxSong (创建于: 2026-02-25 13:22 (UTC+8)) [💬1 | +81/-7, 4 files | commented:1] ## Summary
- Add a new --disable-uvicorn-metrics-access-log CLI flag to suppress access logs only for /metrics.
- Merge this flag with existing --disable-access-log-for-endpoints handling and deduplicate excluded paths.
- Add tests for CLI parsing and get_uvicorn_log_config behavior, including log_config_file precedence.
## Test plan
- `python3 -m py_compile vllm/entrypoints/openai/cli_args.py vllm/entrypoints/openai/server_utils.py tests/entrypoints/openai/test_cli_args.py tests/…

[已合并 PR]

#34109 [Kernel] Refactor FlashInfer allreduce for mnnvl backend — performance,ready,nvidia — by hjjq (合并于: 2026-02-26 11:17 (UTC+8)) [💬21 | +592/-179, 7 files | commented:7 approved:3]

## Purpose Flashinfer has a new allreduce API: https://github.com/flashinfer-ai/flashinfer/pull/2130 and a mnnvl backend for allreduce optimized for multi-node NVLink cases (https://github.com/flashinfer-ai/flashinfer/pull/1213). The mnnvl backend performs same/better than the old trtllm backend that vLLM currently uses. The new …
#33992 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs — bug,documentation,ready,ci/build,nvidia — by ehfd (合并于: 2026-02-26 10:15 (UTC+8)) [💬36 | +334/-5, 6 files | commented:6 approved:2] ## Purpose

Fix #32373 Fix #33369 Closes #34226

Fixes the core problem in https://github.com/vllm-project/vllm/issues/32373 and https://github.com/vllm-project/vllm/issues/33369, introduced from https://github.com/vllm-project/vllm/pull/30784 and https://github.com/vllm-project/vllm/pull/33116.

The issue with the CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs: (1) Consumer NVIDIA GPUs must NOT have the CUDA compatibility libraries inside LD_LIBRARY_PATH. …
#34134 openpangu-vl support video input — ready,multi-modality — by hujiaxin0 (合并于: 2026-02-26 11:08 (UTC+8)) [💬8 | +87/-0, 1 files | commented:9 approved:1]

## Purpose This is a method for calculating the Pangu video sampling frame, which is different from the existing method. total_frames_num indicates the total number of frames. It is more reasonable to start the calculation from frame 0 corresponding to 0 seconds. Assume that total_num_frames = 10 and fps = 1. The timestamp of frame 0 is 0.0. The timestamp of frame 1 is 1.0. The timestamp of frame 9 (the last frame) should be 9.0, that is, (total_frames_num – 1) / o…
#35210 [BugFix] Fix fp4 quant kernel on CUDA 12.8 — bug,ready,nvidia,quantization — by LopezCastroRoberto (合并于: 2026-02-26 10:32 (UTC+8)) [+11/-7, 2 files | commented:1 approved:1] ## Summary Fixes a bug in the non-swizzled FP4 quantization path (cvt_fp16_to_fp4_sf_major) and silu_mul_cvt_fp16_to_fp4 when CVT_FP4_ELTS_PER_THREAD == 8, leading to failures in test_quantize_to_fp4_padded_no_sf_swizzled and test_silu_mul_nvfp4_quant.py (e.g. on sm120 / CUDA 12.8).

This patch introduces no additional performance overhead for the CUDA 12.8 code path. The CUDA 12.9+ path remains unaffected by this bug fix.

## Root cause The kernel and host launch used sf_n_unpadded …
#35338 [Bugfix] Fix AttributeError in SMControlContextManager — bug,ready,v1 — by LucasWilkinson (合并于: 2026-02-26 10:01 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:2] FIX #35337

## Summary

Fixes AttributeError: 'int' object has no attribute 'index' in SMControlContextManager.__init__() when running with data parallel and expert parallel enabled (DBO tests).

## Root Cause

PR #35042 introduced a bug where torch.cuda.current_device().index is called, but torch.cuda.current_device() returns an int directly, not a device object with an .index attribute.

…
#33807 [UX] Add --moe-backend arg for explicit kernel selection — ready,nvidia — by mgoin (合并于: 2026-02-26 09:44 (UTC+8)) [💬4 | +260/-140, 37 files | commented:8]

## Purpose

Adds --moe-backend argument for explicit MoE kernel selection, allowing users to override the automatic backend selection logic (e.g., --moe-backend triton, --moe-backend marlin, --moe-backend flashinfer_trtllm)

Supports all three oracle paths currently implemented: unquantized, FP8, and NVFP4 If MoEBackend is specified by the user and isn’t valid for the given quantization format, it will error. Currently it doesn’t include CPU, XPU, etc…
#34542 [MoE Refactor] MXFP4 Cutlass Experts to MK — ready,ci/build,gpt-oss,nvidia — by zyongye (合并于: 2026-02-26 09:32 (UTC+8)) [💬8 | +453/-168, 19 files | commented:10] ## Purpose Refactor MXFP4 cutlass backend for ongoing moe refactor

Also adding testing infrastructure.

## Test Plan Test GPQA benchmarks, with medium reasoning effort

gpt-oss-120b TP=2 on gb200 with tested kernel on ```bash …
#34936 [UX] Add --performance-mode {balanced,interactivity,throughput} — performance,ready — by mgoin (合并于: 2026-02-26 09:28 (UTC+8)) [💬3 | +37/-5, 2 files | commented:6 approved:4] ## Purpose

Adds a new flag to specify user intent for workload, --performance-mode {balanced,interactivity,throughput}. Default is balanced which preserves existing behavior. To start and just demonstrate the feature, setting interactivity mode captures fine-grained CUDA graphs at small batch sizes (step-1 up to 32) and throughput mode doubles default batch size limits.

In the future we can use this flag to:
- change scheduler behavior to specialize batch size or smoothness of batch …
#29941 [offloader] v2: Hide weight onloading latency via prefetching — frontend,ready,ci/build,v1,deepseek,nvidia — by minosfuture (合并于: 2026-02-26 09:20 (UTC+8)) [💬11 | +1550/-131, 20 files | commented:10]

## Purpose

This PR adds CPU weight offloader that hides weight onloading latency by prefetching weights. This saves the performance cost of zero-copy UVA access. This technique was first developed in SGLang for GB200: https://lmsys.org/blog/2025-09-25-gb200-part-2/, and now adapted to support torch.compile and CUDA graph within vLLM in this PR.

Also refactors the offloading to be extensible.

Demonstrated in the trace:
- H2D is for prefetching next offloaded we…
#35042 [Platform] Add current_platform.num_compute_units interface — performance,rocm,ready,v1,nvidia — by jikunshang (合并于: 2026-02-25 14:22 (UTC+8)) [💬14 | +72/-52, 24 files | commented:10] ## Purpose there are some torch.cuda.get_device_properties().multi_processor_count across vllm code base. we can unify it into current_platform.num_compute_units interface to make it clean and extensible for non-cuda hardware like xpu and npu.

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#35322 [ROCm][CI] Amending deletion of AMD mirror — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-26 06:17 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1] This mirror was deleted accidentally from: https://github.com/vllm-project/vllm/pull/34923

This PR reverts that.
#35265 [ROCm][CI] Extending attention backend coverage for Eagle spec decode tests — rocm,ready,ci/build,v1 — by AndreasKaratzas (合并于: 2026-02-26 06:16 (UTC+8)) [💬1 | +314/-150, 4 files | commented:4 approved:1] This PR fixes the attention backend handling in test_eagle_correctness for ROCm platforms:

Previously, the test would skip entirely when running Llama-4-Scout with FLASH_ATTN on ROCm. Instead of skipping, we now fall back to the FLEX_ATTENTION backend explicitly, allowing the test to actually run on ROCm.

Also previously, the test skipped DeepSeek models when using ROCM_AITER_FA on ROCm. Now we enable the AITER path (VLLM_ROCM_USE_AITER=1), unset VLLM_MLA_DISABLE, and switch to the…
#34270 fix(mxfp4): Disable monolithic path for TRITON backend with EP — ready,gpt-oss — by elizabetht (合并于: 2026-02-26 05:33 (UTC+8)) [💬4 | +225/-5, 2 files | commented:9 approved:1] ## Problem

When running MXFP4 models with the TRITON backend and expert parallelism (EP) enabled (e.g. vllm serve openai/gpt-oss-20b -tp=2 -ep), the server crashes with an illegal memory access.

triton_kernel_moe_forward passes expert_map through to triton_kernel_fused_experts but never actually applies it. As a result, legacy_routing produces routing data using global expert IDs that don’t correspond to the local weight indices on each EP rank.

## Fix

When expert_map is …
#34985 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor — bug,rocm,ready,nvidia — by rasmith (合并于: 2026-02-26 03:18 (UTC+8)) [+2/-0, 1 files | commented:1 approved:1] ## Purpose The tests in test_punica_ops can fail if a previous test ran and changed the device that kernels execute in, in particular Triton kernels, by using torch.cuda.set_device, e.g. lora/test_layers.py. This can cause subsequent tests to also fail, since the system seems to go into a bad state once this happens. ## Test Plan ``` pytest -v -s lora
–ignore=lora/test_chatglm3_tp.py
–ignore=lora/test_llama_tp.py
–ignore=lora/test_llm_with_mult…
#35049 [ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results — rocm,ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-26 00:48 (UTC+8)) [💬1 | +18/-0, 1 files | commented:2 approved:1] This PR disables VLLM_ROCM_USE_SKINNY_GEMM for ROCm multimodal tests by setting it to 0 in pytest_configure, which runs before test collection and module imports.

Purpose

The wvSplitKrc skinny GEMM kernel introduced in #33493 uses atomicAdd for cross-wave reduction, which produces non-deterministic results across runs due to floating-point non-associativity. This causes test failures in accuracy-sensitive tests (e.g., test_logprobs.py, test_serving_tokens.py).

Setting `VLLM_…
#35074 [Misc] Enable weights loading tracking for quantized models — ready — by Isotr0py (合并于: 2026-02-25 14:11 (UTC+8)) [💬4 | +15/-4, 1 files | commented:1 approved:1]

## Purpose
- See if we can also enable weights loading tracking for quantized models
## Test Plan

## Test Result

…
#35309 Revert “[Misc] Enable weights loading tracking for quantized models” — ready — by LucasWilkinson (合并于: 2026-02-26 01:20 (UTC+8)) [+4/-15, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#35074, this PR broke a ton of quantized models weight loading in the nightly https://buildkite.com/vllm/ci/builds/53087/ .

We should re-land this with ready-run-all-tests

cc @Isotr0py @DarkLight1337
#32114 [Bugfix] Fix Harmony preamble visibility in Responses API — bug,frontend,ready,gpt-oss — by thepushkarp (合并于: 2026-02-26 00:08 (UTC+8)) [💬11 | +341/-63, 10 files | commented:3 approved:1] ## Summary

Per the Harmony spec, preambles (commentary channel messages with no recipient) are “intended to be shown to end-users”. vLLM was incorrectly treating them as hidden reasoning across multiple code paths.

This PR fixes preamble visibility in 6 code paths across 3 files:

Parser (harmony_utils.py):
- parse_output_message(): route preambles to ResponseOutputMessage instead of ResponseReasoningItem
- `parse_…
#34528 [Core] Cleanup engine pause/sleep logic — frontend,ready,v1 — by njhill (合并于: 2026-02-25 11:33 (UTC+8)) [💬9 | +302/-198, 10 files | commented:9 approved:1] This is a follow-on from https://github.com/vllm-project/vllm/pull/33195 which I didn’t get a chance to review, and the subsequent merging with https://github.com/vllm-project/vllm/pull/34125
- Always pause scheduler prior to sleeping (regardless of level)
- Centralize cache resetting
- Also support sleep/pause/resume with inline-engine cases (VLLM_ENABLE_V1_MULTIPROCESSING=0 / external launcher mode) - i.e. move from EngineCoreProc to EngineCore
- Deduplicate the added LLM.enqueue logi…
#35085 [Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support — bug,ready — by haosdent (合并于: 2026-02-25 23:31 (UTC+8)) [+20/-8, 1 files | commented:9 approved:1] ## Purpose

Fixes #34891: On GPUs without NVSwitch (e.g., H200/H100 with NVLink bridge-only or PCIe topologies), AllReduceFusionPass.__init__() crashes with RuntimeError: [SymmDeviceMemory] Device does not support multicasting when flashinfer_comm.create_allreduce_fusion_workspace() tries to create a SymmDeviceMemory that requires multicast support.

This PR wraps the workspace creation call in a try/except RuntimeError that logs a warning and returns early, leaving the pass in its def…
#35249 [XPU]Fix for Qwen-OMNI crash — ready,qwen — by xuechendi (合并于: 2026-02-25 23:31 (UTC+8)) [+2/-1, 1 files | commented:7 approved:2]

## Purpose

The error is detected with vLLM-omni project when running v0.16.0 on XPU https://github.com/vllm-project/vllm-omni/pull/1416

Two issues are detected:
1. Due to xpu does not support cudagraph yet. vLLM main repo set max_cudagraph_capture_size=None. This leads to a crash in …
#34773 [Misc][LoRA] Increase max vocab size limit to 258048 in logits processor — rocm,ready — by bhoomit (合并于: 2026-02-25 23:30 (UTC+8)) [💬8 | +34/-9, 3 files | commented:5 approved:3] Increases the LoRA logits processor vocab size upper bound from 257024 to 258048 to support models with larger vocabularies (e.g. 258048 vocab size). Adds test coverage for the new limit.
#34211 [Bugfix] Fix step3p5 reasoning with interleaved thinking — bug,ready — by mariohong128 (合并于: 2026-02-25 23:13 (UTC+8)) [💬8 | +387/-14, 2 files | commented:1 approved:1] ## Purpose When there are multiple rounds of conversation, the prompt contains </think> from the previous round, and the step3p5 reasoning parser failed to correctly determine the end of reasoning.

## Test Plan Add tests: tests/reasoning/test_step3p5_reasoning_parser.py

## Test Result

All passed. …
#34772 [Tests] Add GSM8k check to SpecDec E2E tests — ready,v1 — by benchislett (合并于: 2026-02-25 20:51 (UTC+8)) [💬3 | +237/-143, 2 files | commented:1 approved:1] FIX https://github.com/vllm-project/vllm/issues/35168 FIX https://github.com/vllm-project/vllm/issues/35167

## Purpose

Existing testing for speculative decoding is ad-hoc and unreliable. Specifically, when we select which sample prompts to use for the test case we will select from a few ‘kinds’ of prompt, with the ‘sentence’ kind of prompt having a lot of ambiguity and leading to flaky CI tests.

This PR replaces the ‘sentence’ type with some prompts selected from GSM8k. In addition, we now al…
#35281 Doc link typo — documentation — by gante (合并于: 2026-02-25 19:00 (UTC+8)) [💬2 | +3/-3, 1 files | commented:2 approved:1] ## Purpose

Fixes a minor typo in the docs.

We can see that the first link in the table here is broken :)

Also applied a suggestion by the gemini bot.

## Test Plan

…
#35107 Fix custom processors that use deleted behaviour for Transformers v5 — ready — by hmellor (合并于: 2026-02-25 18:36 (UTC+8)) [💬3 | +32/-0, 1 files | commented:3 approved:1] Some remote code processors may define optional_attributes in their ProcessorMixin subclass, and then pass these arbitrary attributes directly to ProcessorMixin.__init__, which is no longer allowed in Transformers v5. For backward compatibility, we intercept these optional attributes and set them on the processor instance before calling the original ProcessorMixin.__init__.

In vLLM CI, one architecture which does this is Molmo2ForConditionalGeneration. If we upstream this architecture…
#34677 [Bugfix][CPU] Fix basic unit tests failing in CPU platforms — bug,ready,nvidia — by jasonyanwenl (合并于: 2026-02-25 16:36 (UTC+8)) [💬6 | +15/-6, 1 files | commented:4 changes:1 approved:1] ## Purpose

This PR is to fix issues reported in: https://github.com/vllm-project/vllm/issues/32730 (see this comment)

Specifically, fixed below CPU test failures:
- test_vllm_config_defaults: make optimization-level cudagraph mode defaults platform-aware via _platform_aware_cudagraph(), so non-GPU platforms (e.g., CPU) correctly get CUDAGraphMode.NONE instead of GPU-specific modes like PIECEWISE / `FULL_AND_P…
#33069 [Doc] Suggest “–managed-python” flag when installing python using uv — documentation,ready,cpu — by jasonyanwenl (合并于: 2026-02-25 16:19 (UTC+8)) [💬10 | +1/-1, 1 files | commented:1 approved:1] ## Purpose

Build the vllm from source with CPU-only failed:

full error message
```text (vllm) ubuntu@ip-172-31-4-43:~/vllm$ VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation Resolved 136 packages in 2.94s × Failed to build `vllm @ file:///home/ubuntu/vllm` ...
#34513 [DOC][BugFix] Specfiy build dependency installation — bug,documentation,ready — by jonoillar (合并于: 2026-02-25 16:04 (UTC+8)) [💬5 | +7/-1, 1 files | commented:2 approved:2] ## Purpose When setting up the development environment with Python and CUDA/C++ code, we need to install manually the build dependencies (cmake, setuptools, etc)

That’s why I added the instructions to install those build dependencies.

This is an update to the steps in documentation as when installing vLLM without setting up a new environment (–no-build-isolation flag), you need to install the build dependencies explicitly. ## Test Plan

## Test Result

…
#34679 [Docs]Fix documentation formatting in architecture overview — documentation,ready — by lichuang (合并于: 2026-02-25 16:00 (UTC+8)) [💬3 | +1/-3, 1 files | commented:2 changes:1 approved:1] ## Purpose

This PR fixes several documentation formatting issues in docs/design/arch_overview.md:
1. Fixed code block markers - Removed the ??? prefix from code block annotations (??? code → code) to ensure proper render of code examples in the LLM class usage and model constructor sections.
2. Simplified image embedding - Replaced the verbose HTML
  tag with standard Markdown image syntax () for better compatibility and cleaner markup.
These changes improve the read…
#35225 docs: document committer proposal process in governance — documentation,ready — by simon-mo (合并于: 2026-02-25 15:58 (UTC+8)) [💬1 | +7/-5, 1 files | commented:4 approved:4] Documents the committer proposal process in docs/governance/process.md:
- Nominate: Committer emails the committer group to nominate a candidate, highlighting contributions
- Discuss and vote: Group discusses, votes, voices concerns; lead maintainers decide
- Feedback period: After 2-week feedback, if moving forward, nominator invites candidate to open PR to update code ownership (CODEOWNERS, committers list)
- Permissions and onboarding: Lead maintainers assign GitHub permiss…
#35203 Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM — documentation,ready,v1,deepseek,kv-connector — by hmellor (合并于: 2026-02-25 14:00 (UTC+8)) [💬3 | +9/-56, 5 files | commented:1 approved:1] Since we implement the config class ourselves, there’s no need to enforce that users pass architectures themselves. We can just add it if it’s missing from the checkpoint.

#31828 [Perf] Add opt-in SM100 Oink RMSNorm custom-op path — ready — by Laurawly (合并于: 2026-02-25 15:01 (UTC+8)) [💬3 | +331/-0, 4 files | commented:4 approved:1]

## Purpose

Add an opt-in integration path for external SM100 (Blackwell) RMSNorm kernels (CuTeDSL / Oink) without introducing a hard dependency in vLLM.

When enabled, vLLM routes eligible CUDA RMSNorm calls to externally-registered torch.ops.oink.* kernels (and otherwise falls back to the existing vLLM CUDA implementation). This is primarily intended for torch.compile + CUDA graph execution.

## What This PR Changes

…

#34424 [Perf] Optimize FP8 gemm of sm120. — performance,ready,nvidia — by wenshuai-xiaomi (合并于: 2026-02-25 14:25 (UTC+8)) [💬2 | +133/-1, 1 files | commented:3 approved:2] ## Purpose Optimize FP8 gemm of sm120 at any input shape.

## Test Plan benchmarks/kernels# python3 bench_fp8_gemm.py run bench_fp8_gemm.py and compare the results with the patch or not.

## Test Result the current restut without the patch ``` …
#34482 [XPU]Support CUDAGraph on XPU Platform — ready,v1,nvidia — by xinyu-intel (合并于: 2026-02-25 14:22 (UTC+8)) [💬1 | +45/-4, 3 files | commented:2 approved:2] ## Purpose

Enable CUDAGraph functionality support on the XPU platform leveraging features available in the nightly PyTorch XPU build.
- Validated with torch==2.11.0.dev20260216+xpu from nightly channel and torch==2.11.0+xpu from test channel
- distributed is not supported
- For FLASH_ATTN backend, only PIECEWISE mode is supported
- For TRITON_ATTN backend, all modes are supported
## Test Plan

…
#35011 remove cuda check in top_k_top_p_triton kernel — ready,v1,nvidia — by jikunshang (合并于: 2026-02-25 14:22 (UTC+8)) [💬8 | +3/-4, 2 files | commented:2 approved:2] ## Purpose fix #34941 https://github.com/vllm-project/vllm/pull/33538 introduce triton based top_k_top_p kernel and have some cuda hard code in triton kernel code. this may break non-cuda-like device like xpu and npu.

cc @Yikun @wangxiyuan ## Test Plan CI ## Test Result

…
#35055 [Misc] Add shard_id validation for MergedColumnLinear — ready — by Isotr0py (合并于: 2026-02-25 14:12 (UTC+8)) [💬1 | +67/-7, 1 files | commented:2 approved:1]

## Purpose
- Update shard_id validation to avoid unexpected shard_id for MergedColumnLinear
- Update annotations for linear utility functions
## Test Plan

## Test Result

…
#35231 [Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors — frontend,ready — by AndreasKaratzas (合并于: 2026-02-25 13:52 (UTC+8)) [+27/-1, 2 files | commented:2 approved:1] Adds server-side validation to reject negative token IDs in POST /v1/completions, which previously caused an unhandled 500 Internal Server Error.

## Problem

When the prompt field contains negative token IDs (e.g. {"prompt": [[-1]]}), the server crashes with a 500 instead of returning a proper 400 validation error. This was discovered via schemathesis fuzz testing in test_openapi_stateless.

``` curl -X POST -H ‘Content-Type: application/json’
-d ‘{“prompt”: [[-1]]}’
http://loca…
#35115 [compile] Improve error message during artifacts load failure. — ready — by zhxchen17 (合并于: 2026-02-25 14:01 (UTC+8)) [+7/-2, 1 files | commented:1 approved:1] Summary:

When a load failure happens for AOT, we just fallback to compiling again. Instead of mentioning “AOT compile” in the message, change the wording to notify user we’re simply compiling again. Add a common path when cache file is corrupted with EOFError so that it’s more clear what compiler is doing.

Test Plan:

Reviewers: …
#35157 [Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache — bug,ready,v1 — by heheda12345 (合并于: 2026-02-25 14:00 (UTC+8)) [💬1 | +74/-1, 2 files | commented:1 approved:1]

## Purpose We need to clear mamba_state_idx for resumed requests. When requests are force-preempted (e.g., during reset_prefix_cache / KV cache flush), they appear in resumed_req_ids without a corresponding entry in preempted_req_ids, leaving stale mamba_state_idx entries that can point to block indices beyond the new (smaller) block allocation. ## Test Plan

See the new unit test

## Test Result

…

#35237 [Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest — bug,frontend,ready — by pks (合并于: 2026-02-25 14:00 (UTC+8)) [+18/-2, 2 files | commented:1 approved:1] ## Summary

The `check_structured_outputs_count` model validator on `CompletionRequest` and `ChatCompletionRequest` uses `mode="before"` and calls `.get()` on the `structured_outputs` value, assuming it is always a dict. However, the field is typed as `StructuredOutputsParams | None`, so passing a dataclass instance is a valid usage but crashes with:   ```
AttributeError: 'StructuredOutputsParams' object has no attribute 'get'   ```
This PR handles both dict and `StructuredOutputsParams` input...

#34933 [FIX] fused moe with lora shared expert dual stream (1.07x otps) — performance,ready — by jhaotingc (合并于: 2026-02-25 12:40 (UTC+8)) [💬6 | +6/-2, 1 files | commented:1 approved:2] ## Purpose MoE model with shared expert has dual stream overlap in TP when there’s no LoRA. But the overlapping disappears when adding LoRA module.

Tested with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and dasereb/Nemotron_3_Nano_BF16_r16_zeros

Before: <img width=”1597” height=”605” alt=”image” src=”https://github.com/user-attachments/assets/beac041a-b41a-4f62-…
#35180 [ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE — rocm,ready — by Rohan138 (合并于: 2026-02-25 12:36 (UTC+8)) [💬3 | +139/-67, 9 files | commented:3 approved:1] ## Purpose Follow-up to #25135 and #33443 to fix and enable the AITER RoPE custom op by default, and enable RoPE+KVCache fusion. During prefill (q.shape[0] > 256), we thus use the AITER unfused RoPE kernel instead of the vllm native custom op, which gives another 1% uplift on gpt-oss.

This is also a mild prerequisite for MLA RoPE+KVCache fusion on ROCm, since there currently isn’t a vllm native custom op for DeepseekScalingRotaryEmbedding.

## Test Plan Before: ``` ============ Serving Benc…
#35148 [Responses] Decouple SSE event helpers from Harmony context — frontend,ready,gpt-oss — by sfeng33 (合并于: 2026-02-25 12:05 (UTC+8)) [💬7 | +527/-565, 5 files | commented:8 approved:2] ## Purpose
- Refactor SSE streaming event helpers to decouple shared event-building logic from Harmony-specific context
- Fix a bug for build-in python tool to generate code interpreter events instead of mcp events (according to https://community.openai.com/t/responses-api-streaming-the-simple-guide-to-events/1363122)
- Enhance unit test to validate streaming events
### Architecture: Two-layer design in streaming_events.py The core refactor splits streaming_events.py into two layers: dispatcher…
#32967 [Frontend] Use init_app_state and FrontendArgs in run_batch — frontend,ready — by pooyadavoodi (合并于: 2026-02-25 11:40 (UTC+8)) [💬6 | +632/-330, 4 files | commented:8 approved:1] ## Purpose
- Adding support for more features such as tool calling to run_batch.
- This is achieved by using init_app_state from vllm/entrypoints/openai/api_server.py
- run_batch now uses a subset of frontend args from vllm/entrypoints/openai/cli_args.py (this is achieved by adding a new base class for FrontendArgs and letting BatchFrontendArgs derive from that).
- The approach taken here also removes some code duplication between api_server and run_batch.
## Test Plan Adding a new test…

[关闭未合并 PR]

#34547 [Docs] Fix NCCL typo — documentation — by jiangkuaixue123 (关闭于: 2026-02-26 11:11 (UTC+8)) [💬3 | +1/-1, 1 files | commented:2] ## Purpose Fix UCCL -> NCCL
#34226 [build] fix priority of cuda-compat libraries in ld loading — ci/build,nvidia — by youkaichao (关闭于: 2026-02-26 10:15 (UTC+8)) [💬3 | +4/-4, 1 files | commented:1]

## Purpose

People still report Error 803: system has unsupported display driver / cuda driver combination after https://github.com/vllm-project/vllm/pull/33116 . It is because some docker images specify the driver so file in nvidia.conf, which gets suppressed by cuda-compat.conf.

This PR adds zzz- to cuda-compat.conf so that it’s ordered lastly.

<img width=”970” height=”594” alt=”a119df1a223395c6b8d6145b8913cd2c” src=”https://github.com/user-attachment…
#35348 Fix stale self._positions during piecewise CUDA graph replay in MLA — frontend,v1,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 10:37 (UTC+8)) [+380/-45, 14 files | commented:1 | 📝草稿] ## Summary
- Fixes a shape mismatch crash in rotary embedding (RuntimeError: The size of tensor a (8192) must match the size of tensor b (475)) when running DeepSeek-R1 with FP8 KV cache, piecewise CUDA graphs, and the fused RoPE+quant path
- Root cause: self._positions = positions in MLAAttention.forward() is a Python-level side effect inside a piecewise-compiled graph. During CUDA graph replay, this assignment is not re-executed, leaving self._positions pointing to a stale view from a…
#26235 [V1] Mamba2 SSD kernel integration — stale,v1 — by bohnstingl (关闭于: 2026-02-26 10:16 (UTC+8)) [💬3 | +500/-100, 3 files | commented:1] ## Purpose

This PR reduces overhead from the pure PyTorch implementation of the SSD initial state extraction and state caching by using two additional Triton kernels. It is based on https://github.com/vllm-project/vllm/pull/26222

cc @tdoublep @s3woz

## Test Plan

The behavior is transparent to the user and thus the existing tests can be reused. …
#27429 [DOC] fix the rocm installation doc — documentation,rocm,needs-rebase,stale — by billishyahao (关闭于: 2026-02-26 10:16 (UTC+8)) [💬5 | +38/-22, 1 files | commented:1]

## Purpose

Fix the ROCm related installation guide according to the dockerfile:

https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base#L4-L11

## Test Plan …
#27617 [Model] make the inv_freq in Qwen2_5_VisionRotaryEmbedding device-agnostic — stale,qwen — by yangulei (关闭于: 2026-02-26 10:16 (UTC+8)) [💬5 | +1/-3, 1 files | commented:4] ## Purpose Currently the inv_freq in Qwen2_5_VisionRotaryEmbedding is set to ‘cpu’ explicitly. https://github.com/vllm-project/vllm/blob/a8d2e326ecb70876aa73dce70dbe2434c64b710a/vllm/model_executor/models/qwen2_5_vl.py#L610-L612 This PR makes it device-agnostic.
#34425 [Bug Fix] Re-add the DCP/PCP compatibility check for CUDA platform — bug,needs-rebase,nvidia — by Siritao (关闭于: 2026-02-26 09:53 (UTC+8)) [💬2 | +21/-0, 1 files | commented:1] Background:
- PR #29952 moved DCP/PCP + full CUDA graphs compatibility check to platform layer
- PR #29558 inadvertently removed this check during conflict resolution (see commit diff)
- ROCm platform still has the check, but CUDA is missing it
- Impact: Users with DCP/PCP enabled cras…
#34420 Fix MoE models in EP mode on Ascend — 无标签 — by ylyhyqsl (关闭于: 2026-02-26 09:41 (UTC+8)) [💬7 | +11/-4, 2 files | commented:1] Fix MoE models using bare tensor_model_parallel_all_reduce produce garbage output in EP mode on Ascend#6710

## Purpose MoE models that use FusedMoE with reduce_results=False and then manually call tensor_model_parallel_all_reduce() in their MoE forward method produce garbage output when running with –enable-expert-parallel on Ascend NPU.

## Suggested fix The fix should be applied in the upstream vLLM model files. Replace:

tensor_model_parallel_all_reduce(final_h…
#34987 [CI Bugfix] Add pytest.mark.flaky to tests/v1/e2e/test_spec_decode.py — bug,ready,v1,ci-failure — by mgoin (关闭于: 2026-02-26 09:34 (UTC+8)) [💬2 | +4/-0, 1 files | commented:1 approved:1]

## Purpose

The hope is this will help the flaky failures of “V1 e2e + engine” since I identified these tests are often the culprit. They really should be reworked but when I ran locally it really did vary so much run to run.

FIX https://github.com/vllm-project/vllm/issues/34617 https://github.com/vllm-project/vllm/issues/34797

https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=7days&query=tests%2Fv1%2Fe2e%2Ftest_spec_decode.p…
#34774 [Build] Add BUILDER_CUDA_VERSION to decouple build and runtime/PyTorch CUDA versions — ready,ci/build,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 09:27 (UTC+8)) [💬1 | +39/-26, 5 files | commented:1] Addresses the CUDA 12.8/12.9 compatibility issue affecting v0.16.0: the official PyTorch PyPI wheel is built on CUDA 12.8, but vLLM wheels have been built on CUDA 12.9. When users pip install vllm, PyTorch pins nvidia-cublas-cu12 to 12.8, which causes runtime failures (pytorch/pytorch#174949).

By decoupling the versions, we can build the vLLM wheel against CUDA 12.8 PyTorch (matching what users get from PyPI) while still compiling CUDA kernels with a 12.9, allowing FlashMLA to still be u…
#34852 [Build] Downgrade to CUDA 12.8 but use nvcc 12.9 to build csrc — ci/build,nvidia — by tlrmchlsmth (关闭于: 2026-02-26 09:27 (UTC+8)) [💬2 | +89/-20, 4 files | commented:2] Take 2 of https://github.com/vllm-project/vllm/pull/34774

Addresses the CUDA 12.8/12.9 compatibility issue affecting v0.16.0: the official PyTorch PyPI wheel is built on CUDA 12.8, but vLLM wheels have been built on CUDA 12.9. When users pip install vllm, PyTorch pins nvidia-cublas-cu12 to 12.8, which causes runtime failures (https://github.com/pytorch/pytorch/issues/174949).

By decoupling the versions, we can build the vLLM wheel against CUDA 12.8 PyTorch (matching what users get from P…
#34517 [Bugfix] Fix mypy errors on StructuredOutputsParams in OpenAI entrypoints — bug — by junuxyz (关闭于: 2026-02-26 09:20 (UTC+8)) [💬2 | +7/-2, 1 files | commented:1] ## Purpose

This PR fixes mypy errors in OpenAI entrypoint code paths using StructuredOutputsParams.

Root cause: StructuredOutputsParams uses pydantic.dataclasses.dataclass, and mypy does not always treat it like a stdlib dataclass in these paths (replace(...), StructuredOutputsParams(json=...)). Fix: use stdlib dataclass under TYPE_CHECKING for mypy, while keeping pydantic.dataclasses.dataclass at runtime.

``` …
#27996 Add the NV-ModelOPT FP8 & FP4 quantization E2E accuracy test case — 无标签 — by jingyu-ml (关闭于: 2026-02-26 09:18 (UTC+8)) [💬5 | +132/-1, 2 files | commented:2 approved:1] ## Purpose

Add GSM8K accuracy tests for ModelOpt FP8 quantized models to ensure quantization quality is maintained. This PR adds automated testing for:
- nvidia/Llama-3.1-8B-Instruct-FP8
- nvidia/Qwen3-8B-FP8
The tests verify that ModelOpt FP8 quantization achieves expected accuracy benchmarks on the GSM8K mathematical reasoning task.

## Test Plan …
#34540 [Kernel] [Helion] [8/N] Remove fake_impl usage and inference — rocm,ready — by gmagogsfm (关闭于: 2026-02-26 05:56 (UTC+8)) [💬13 | +292/-197, 5 files | commented:4 approved:1] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Replace fake_impl with helion’s infer_output_spec in HelionKernelWrapper

Since we uses HOP to represent Helion kernel calls now, and Helion has internal support to compute output tensor shape, we no longer need to infer fake_impl. This PR removes all relevant logic to fake_impl
#35331 [Bugfix] Fix crash with EPLB + NVFP4 — bug,nvidia — by jasonlizhengjian (关闭于: 2026-02-26 05:18 (UTC+8)) [💬1 | +3/-3, 2 files | commented:1 | 📝草稿]

## Purpose

### Bug 1: expand() creates non-contiguous input scale tensors, breaking EPLB
```
In `prepare_nvfp4_moe_layer_for_fi_or_cutlass` (`flashinfer_fp4_moe.py:514-515`), input scales are broadcast via `expand()`:

```python
a13_scale = a13_scale.max().to(torch.float32).expand(num_experts)   ...
```
#35299 [Bugfix] Fix weight loading tracking for Marlin MoE quantized models — bug,needs-rebase — by alexbi29 (关闭于: 2026-02-26 03:08 (UTC+8)) [💬2 | +47/-6, 5 files | commented:3] ## Summary

PR #35074 introduced track_weights_loading() to validate all registered parameters are loaded from checkpoint. This broke loading of GPTQ/AWQ MoE models using the Marlin backend with the following error:

``` ValueError: Following weights were not initialized from checkpoint: {‘model.layers.0.mlp.experts.w13_g_idx_sort_indices’, ‘model.layers.0.mlp.experts.w2_g_idx_sort_indices’, ‘model.layers.0.mlp.experts.w13_weight_g_idx’, ‘model.layers.0.mlp.experts.w2_weight_g_idx’, …} …
#35323 Add Qwen3.5-MoE (qwen3_5_moe) support via Qwen3-Next backend — documentation,new-model,rocm,frontend,needs-rebase,ci/build,v1,multi-modality,qwen,kv-connector — by blake-snc (关闭于: 2026-02-26 03:04 (UTC+8)) [💬4 | +2597/-363, 38 files | commented:1] ## Summary

Adds support for Qwen3.5-MoE models (model_type: qwen3_5_moe, Qwen3_5MoeForConditionalGeneration) which use the same hybrid GDN (GatedDeltaNet) architecture as Qwen3-Next but ship with a different HuggingFace config layout and checkpoint weight format.

Key changes:
- Config: Qwen3_5MoeConfig that flattens the outer text_config wrapper and delegates to Qwen3NextConfig
- Registry: Register qwen3_5_moe model type and both ForConditionalGeneration / `ForCausal…
#35258 [Reasoning] Fix Qwen3 parser misclassifying truncated reasoning as content — qwen — by sxu75374 (关闭于: 2026-02-25 15:30 (UTC+8)) [💬5 | +14/-2, 1 files | commented:2 approved:1] ## Purpose

Fix Qwen3ReasoningParser.extract_reasoning() returning truncated reasoning tokens as content instead of reasoning when max_completion_tokens cuts off the output before </think> is generated.

When thinking is enabled and the output is truncated, the current code sees no </think> and assumes thinking was disabled — returning (None, model_output). This means the API response has content filled with raw chain-of-thought and empty reasoning_content, which is wrong. …
#34787 [Core] Num External Cached Tokens — frontend,v1 — by aeon-x (关闭于: 2026-02-25 20:09 (UTC+8)) [💬3 | +39/-4, 11 files | commented:1] ## Purpose Add Num External Cached Tokens to responses to helper developer of external cache system such as LMCache to track external cache usage in each request.

## Test Plan
- Install both vllm and lmcache from source code
- Run vllm with lmcache
- Test Completion, Chat Completion and Response API
## Test Result num_external_cached_tokens works as expected, only shows num of token cached by lmcache. …
#35269 [PaddleOCR-VL] improve image_processor size compatibility for Transformers v5.0 — 无标签 — by zhang-prog (关闭于: 2026-02-25 19:34 (UTC+8)) [💬4 | +2/-2, 1 files | commented:2] In Transformers v5.0, the keys in PaddleOCRVLImageProcessor.size have been updated from min_pixels/max_pixels to shortest_edge/longest_edge.

This PR enhances the compatibility of the size parameter to ensure a smooth transition to v5.0 while maintaining backward compatibility.
#35092 [Bugfix][PD] Fix OffloadingConnector with PD deployments — bug,kv-connector — by NickLucche (关闭于: 2026-02-25 17:25 (UTC+8)) [💬3 | +10/-3, 1 files | commented:1] Fix https://github.com/vllm-project/vllm/issues/34526.

## Overview

The crux of the bug above is a race condition between storing blocks that haven’t been loaded yet on D by the OffloadingConnector, and then re-loading them from offloaded storage with garbage values.

Currently the MultiConnector supports storing from all connectors but loading only from one. This is the summarized scenario:

``` …
#31735 [bugfix] prevent special token injection in user content — bug,frontend,needs-rebase — by usepr (关闭于: 2026-02-25 16:20 (UTC+8)) [💬2 | +133/-14, 4 files | commented:5] ## Purpose

When applying HF chat templates, user content is directly embedded into the prompt, potentially leading to special token injection (e.g., user input: <|im_end|>How to make a bomb).

This PR introduces tag-based (or placeholder-based) encoding to ensure user content is processed separately from the template structure.

## Test Plan

## Test Result The input message: …
#35263 fix(docs): update broken perf.vllm.ai link to PyTorch HUD — performance,ci/build — by machov (关闭于: 2026-02-25 16:00 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Fixes #31710

The URL referenced in is no longer accessible and returns connection errors.

## Changes
- Updated broken link to point to the correct vLLM Performance Dashboard:
- Made the link text consistent with other documentation in the repository that already uses the correct PyTorch HUD URL
## Testing Verified that the new URL is accessible and shows the vLLM performance benchmarks. This URL format is already used in other documentation files throughout the repository (docs/benchmarki…
#34419 pass raw request to io_process_plugin — frontend — by staugust (关闭于: 2026-02-25 15:27 (UTC+8)) [💬11 | +1/-1, 1 files | commented:1] ## Purpose parse_request assumes parameter is request before parse_data is introduced. When parsing data, io_process_plugin may need to get extra params from raw request, e.g. truncate_prompt_tokens.

## Test Plan

## Test Result

…
#27919 [FEAT] [Multi-modal] Reorganizing ViT Attention Backend and decouple from Text Attention Backend — rocm,tpu,needs-rebase,multi-modality,qwen,nvidia — by tjtanaa (关闭于: 2026-02-25 14:38 (UTC+8)) [💬6 | +923/-360, 28 files | commented:6 approved:1]

## Purpose

This is an implementation of the RFC https://github.com/vllm-project/vllm/issues/27821

### 1. It decouples the ViT Attention Backend selection logic from the Text Attention Backend.

Redefining _MHA_Backend following the discussion in https://github.com/vllm-project/vllm/pull/27061/files#r2443909604 .

…
#35262 [Frontend] Add metrics-only uvicorn access log flag — frontend — by YuxSong (关闭于: 2026-02-25 13:39 (UTC+8)) [💬1 | +81/-7, 4 files | commented:1] ## Summary
- Add a new --disable-uvicorn-metrics-access-log CLI flag to suppress access logs only for /metrics.
- Merge this flag with existing --disable-access-log-for-endpoints handling and deduplicate excluded paths.
- Add tests for CLI parsing and get_uvicorn_log_config behavior, including log_config_file precedence.
## Test plan
- `python3 -m py_compile vllm/entrypoints/openai/cli_args.py vllm/entrypoints/openai/server_utils.py tests/entrypoints/openai/test_cli_args.py tests/…
#31140 [Feature] Support weight-shape-unaligned block-scale fp8 models — 无标签 — by Wanli-Jiang (关闭于: 2026-02-25 13:14 (UTC+8)) [💬6 | +348/-84, 5 files | commented:4] ## Purpose
- Support Nemotron-3-Nano block-scale fp8 models for now.
## Test Plan
- Step 1: Generate block-scale fp8 model for Nemotron-3-Nano ``` (using modelopt to quantize bf16 model.
…
#31619 [Bugfix] Disallow sleep call if there are unfinished requests — bug,frontend,v1 — by danielhumanmod (关闭于: 2026-02-25 11:33 (UTC+8)) [💬2 | +205/-0, 4 files | commented:2] ## Purpose

Fix #31618

This PR prevents the engine from entering sleep mode while there are active/unfinished requests.

Calling sleep() during in-flight generation can put the engine into an invalid state (e.g., weights offloaded / caches reset while requests are still being processed), which can lead to failures or confusing behavior. To make the API safer and more predictable, this PR explicitly block sleep() when there are unfinished requests.

## Test Plan

…