vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-02-05

[概览]

时间窗口: 2026-02-05 11:27 (UTC+8) ~ 2026-02-06 11:27 (UTC+8)
新 issue: 36 (label 分布: bug:18, feature request:9, usage:4, RFC:2, rocm:2)
关闭 issue: 17
新 PR: 66 (label 分布: ready:17, bug:16, v1:14, ci/build:10, documentation:7)
合并 PR: 40
关闭未合并 PR: 12

[新 issue]

#33885 [Bug]: 用vllm启动Qwen3VLReranke接口在重排任何图像时获取的得分很低仅有0.5左右 — bug — by RC-Qiao (创建于: 2026-02-05 16:32 (UTC+8)) [💬1] ### Your current environment

docker run –gpus ‘“device=7”’ –entrypoint “” -v /dataset/models/Qwen/Qwen3-VL-Reranker-8B:/model -p 9091:8000 –shm-size=8g vllm/vllm-openai:v0.15.1-cu130 vllm serve /model –runner pooling –max-model-len 16384 –gpu-memory-utilization 0.5 –dtype bfloat16 –hf-overrides ‘{“architectures”: [“Qwen3VLForSequenceClassification”],”classifier_from_token”: [“no”, “yes”],”is_original_qwen3_reranker”: true}’ –chat-template /model/qwen…
#33882 [Feature]: Improve sparse embedding pooling output format for better efficiency and usability — feature request — by staugust (创建于: 2026-02-05 16:01 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch

This issue is about the current output format of the sparse embedding pooling interface and how it impacts performance and usability. For models that support sparse embeddings, we typically need to store, for each document, the mapping from token → weight so we can use it later during retrieval. Currently, the vLLM pooling API returns only the per-token weights for each position in the document. This leads to two potential issues:
1. **Duplicate…
#33914 [RFC]: Support Chunked Pipeline Parallel(CPP) with dynamic chunk size. — RFC — by pisceskkk (创建于: 2026-02-05 21:36 (UTC+8)) ### Motivation.

Currently, vLLM supports DCP and is planning to support PCP to improve long-sequence inference capability and overall inference efficiency. However, in scenarios with heterogeneous sequence lengths, Context Parallelism can perform suboptimally and may even lead to overall performance degradation due to reduced efficiency on short sequences. In addition, to better support and optimize ultra-long sequence inputs (e.g., reaching 1M tokens or more), introducing a new partitioning di…
#33911 [Bug]: Gemma-3 multimodal models (4b/12b/27b) fail with torch.compile assertion error — 无标签 — by tomasruizt (创建于: 2026-02-05 20:27 (UTC+8)) [💬1] Gemma-3 multimodal models fail during initialization with a torch.compile assertion error. Text-only Gemma-3 models (270m, 1b) work fine.

Reproduce:
```
  vllm serve google/gemma-3-4b-it --max-model-len 4096
```
Error: ``` AssertionError: expected size 1048576==131072, stride 256==256 at dim=0 …
#33951 [Feature]: Implement get_kv_cache_stride_order for all classes — feature request — by Rohan138 (创建于: 2026-02-06 07:49 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

Currently, AttentionBackend.get_kv_cache_stride_order raises a notImplementedError by default: https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backend.py#L90

This causes this awkward usage in vLLM e.g. in https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L5864:

``` try: kv_cache_stride_order = attn_backend.get_kv_cache_stride_order() except (AttributeError, NotImplementedErro…
#33954 [Bug]: Qwen/Qwen3-VL-Embedding-8B embedding quality declines significantly sometime after vLLM version 0.14.0rc2.dev199+gc80f92c14 — bug — by kevin-pw (创建于: 2026-02-06 08:25 (UTC+8)) ### Your current environment

I am posting below the output of python collect_env.py for vLLM version `0.14.0rc2.dev199+gc80f92c14` (where embeddings are high quality) and vLLM version `0.15.2.dev0+g1892993bc` (where embeddings are low quality).
Note that the output of python collect_env.py shows that all relevant environment parameters are identical between the vLLM versions that produce different embedding qualities. Output of pyt...
#33950 [Bug]: — bug — by FrancescoSaverioZuppichini (创建于: 2026-02-06 07:30 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#33930 [Feature]: GPU Memory Snapshotting to reduce cold starts — feature request — by Volko61 (创建于: 2026-02-06 00:53 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

I noticed that Modal (https://modal.com/blog/gpu-mem-snapshots) and InferX (https://inferx.net/) have implemented the CUDA checkpoint/restore API to drastically reduce cold start.

I tried both of the services and it seems to work extremelly well. InferX told me they build it on top of vLLM so it definitly seems possible to do. Right now, I’m not aware of any open source implementation of this tech.

I would love to see this feature implemented in vLLM as…
#33917 [Feature]: Up to date docker images — feature request — by lee-b (创建于: 2026-02-05 22:17 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

The official vllm-openai docker image appears to be significantly outdated – even the nightly image!

I was already having to base on top of the docker image and install a later transformers version to run some of the models that vllm claims to support. However, after installing a new host with CUDA 13.1, I found that the vllm image won’t run even with that update, due to host nvidia driver and cuda library version incompatibilities.

I’m having to apply…
#33938 [Bug][ROCm] Platform detection initializes CUDA prematurely, breaking Ray multi-GPU allocation — bug,rocm — by kouroshHakha (创建于: 2026-02-06 03:33 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#33872 [Feature]: Community Help Wanted: Migrate Remaining Linear Methods into Kernel Abstraction — feature request — by BadrBasowid (创建于: 2026-02-05 14:39 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch

This issue is a continuation of “[Refactor] Make FP8 Linear Ops use kernel abstraction” #27814, which moved the FP8 scaled-mm linear path onto the ScaledMMLinearKernelinterface and improved consistency across FP8 execution paths. The next step is to extend that same kernel-abstraction approach to the remaining linear Methods/schemes by routing them through the kernel abstraction inside their existing LinearMethod implementations so that (once everyth…
#33888 [Feature]: when support torch 2.10.0 — feature request — by chamwen (创建于: 2026-02-05 16:51 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

when support torch 2.10.0

### Alternatives

No response

### Additional context

…
#33925 [Bug]: OpenAI API: system message accept image — bug — by jonoillar (创建于: 2026-02-06 00:15 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33923 [Feature]: Report cached tokens in Anthropic /v1/messages API — feature request — by msanft (创建于: 2026-02-05 23:59 (UTC+8)) ### 🚀 The feature, motivation and pitch

vLLM implements the Anthropic /v1/messages API since #22627. However, the API, as implemented in this PR, does not report cached tokens. The upstream Anthropic API does.

The cached tokens should be reported via the already-available fields:

```py class AnthropicUsage(BaseModel): “"”Token usage information”””

…
#33921 [Bug]: Qwen3 Instruct context is limited to 8K — bug — by mobicham (创建于: 2026-02-05 23:21 (UTC+8)) ### Your current environment

Docker: vllm/vllm-openai:v0.15.1 GPU: H100 PCIe 80GB

### 🐛 Describe the bug

…
#33920 [Bug]: GLM-4.7-Flash OOM during sampler warmup with tensor parallelism on RTX 4090 — bug — by VecherVhatuX (创建于: 2026-02-05 22:54 (UTC+8)) ### Your current environment

### System Info - **vLLM version**: 0.15.0rc2.dev40+g2e8de8677.cu130 - **GPU**: 2x NVIDIA GeForce RTX 4090 (24564 MiB each, compute capability 8.9) - **NVIDIA Driver**: 580.126.09 - **CUDA**: 13.0 ...
#33906 [Bug]: mxfp4 (gpt-oss moe) on AMD rocm (W7900/gfx1100) breaks — bug,rocm — by ctheune (创建于: 2026-02-05 19:41 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#33916 [Bug] IndexError: list index out of range in chat_completion_stream_generator with –tool-call-parser=mistral during streaming tool calls — bug — by dfischer-mw (创建于: 2026-02-05 22:06 (UTC+8)) ````md ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== ...
#33915 [Feature]: Support include_reasoning request parameter for non-harmony models — feature request — by cjackal (创建于: 2026-02-05 21:57 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

OpenAI responses API support include_reasoning parameters to filter out the reasoning content from model response. While setting include_reasoning=false in a request does not “skip” model’s reasoning phase in the inference engine or reducing the token generation cost, it helps reduce the network traffic cost (which is crucial in some network-limited devices as the reasoning text often overwhelms the user-facing model answer) and considered inevitable …
#33877 [Bug]: GLM47 Tool Call Bug — bug — by junzhiL (创建于: 2026-02-05 15:28 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33913 [Usage]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. — usage — by 870223666 (创建于: 2026-02-05 20:38 (UTC+8)) ### Your current environment

请替我分析一下：Traceback (most recent call last): File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 666, in worker_busy_loop output = func(args, **kwargs) File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context return func(args, **kwargs) File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/py…
#33910 [Usage]: AssertionError: collective_rpc should not be called on follower node — usage — by PhdShi (创建于: 2026-02-05 20:26 (UTC+8)) [💬1] ### Your current environment

```text Is CUDA available : True CUDA runtime version : 13.1.80 CUDA_MODULE_LOADING set to : LAZY GPU models and configuration : GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 …
#33880 [Installation]: Cant install vllm 0.15.0 on Windows & Python 3.12 — installation — by ImSo3K (创建于: 2026-02-05 15:48 (UTC+8)) [💬5] ### Your current environment

Im trying to install the latest version but I get: ```text $ pip install vllm==0.15.0 ERROR: Ignored the following yanked versions: 0.2.1 ERROR: Could not find a version that satisfies the requirement vllm==0.15.0 (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.2.0, 0.2.1.post1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.3.0, 0.3.1, 0.3.3, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.5.0.post1, 0.5.1, 0.5.2, 0.5.3, 0.5.3.post1, 0.5.4, 0.5.5, 0.6.0…
#33908 [Usage]: vLLM is an excellent project - thank you to the team! — usage — by Rainlin007 (创建于: 2026-02-05 19:54 (UTC+8)) ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

This is not really a usage question, but I wanted to take a moment to express my appreciation for the vLLM project. …
#33873 [Bug]: BGE-M3 pooling endpoint fails with short inputs (< 4 tokens) in token_classify task — 无标签 — by eveningcafe (创建于: 2026-02-05 14:46 (UTC+8)) [💬1] ### Your current environment
- vLLM Version: v0.15.0
- Model: BAAI/bge-m3
- Task: token_classify
- Endpoint: /pooling
### 🐛 Describe the bug

The /pooling endpoint with task: token_classify returns 400 error for inputs that tokenize to fewer than 4 tokens (including CLS and SEP special tokens). The backend returns a single float value instead of a list of floats, causing Pydantic validation to fail. …
#33899 [Bug]: DeepSeek-R1-0528 AssertionError: tokens not padded correctly on GB200 — bug — by chaunceyjiang (创建于: 2026-02-05 18:38 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33895 [Bug]: /tokenize hangs after many requests — bug — by JakubCerven (创建于: 2026-02-05 18:05 (UTC+8)) ### Your current environment

n/a

### 🐛 Describe the bug

running vllm using vllm/vllm-openai:v0.15.0 + QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQ

after calling large number of /tokenize requests, this endpoint will become unresponsive -> will hang forever

…
#33894 [Feature]: Support setting tool-call-parser to auto — feature request — by shell-nlp (创建于: 2026-02-05 17:36 (UTC+8)) ### 🚀 The feature, motivation and pitch

# Support setting tool-call-parser to auto It’s cumbersome to configure tool-call-parser every time, and it’s unclear what parameter to use—I’d suggest supporting setting tool-call-parser to ‘auto’.

### Alternatives

No response

### Additional context …
#33871 [Bug]: The local deployment achieves about 30% higher accuracy compared to the server deployment. — bug — by charliess123 (创建于: 2026-02-05 14:34 (UTC+8)) ### Your current environment

Collecting environment information... ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : Could not collect ...
#33889 [Usage]: RuntimeError: CUDA error: invalid device ordinal — usage — by Athe-kunal (创建于: 2026-02-05 17:04 (UTC+8)) [💬1] ### Your current environment

I am running this tutorial. Here my script is the same except my ray init

```python ray.init( runtime_env={ “excludes”: [“.venv/”, “.git/”, “*.pyc”, “pycache/”], “env_vars”: { “VIRTUAL_ENV”: “”, # Unset VIRTUAL_ENV for Ray workers …
#33887 [Bug]: DeepSeek-R1-0528-NVFP4 Missing TRTLLM-GEN kernel (decode) on GB200 — bug — by chaunceyjiang (创建于: 2026-02-05 16:42 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33883 [Bug]: DeepSeek-V3.2 NVFP4 with fp8 kvcache reports src_cache must be uint8 — bug — by kebe7jun (创建于: 2026-02-05 16:10 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text GB200 main branch ``` ...
#33864 Bug: CPU KV cache offloading fails for blocks formed during decode — bug,kv-connector — by sriumcp (创建于: 2026-02-05 12:33 (UTC+8)) [💬1] ## Summary

When using CPU KV cache offloading (--kv-offloading-size), blocks that complete during the decode phase are never offloaded to CPU. Only blocks that complete during prefill are correctly offloaded.

## Root Cause

In vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py, the _get_reqs_to_store() method calculates the number of blocks to store using:

```python new_tokens = scheduler_output.num_scheduled_tokens[req_id] …
#33865 [Bug]: OpenAI-compatible Embeddings API intermittently crashes with multimodal cache assertion (Expected a cached item for mm_hash) on Qwen3-VL-Embedding-8B — bug — by ojipadeson (创建于: 2026-02-05 12:45 (UTC+8)) [💬1] ### 🐛 Describe the bug

Using vLLM serve with the OpenAI-compatible Embeddings API to run Qwen3-VL-Embedding-8B, the server intermittently crashes while pre-processing embedding requests. The crash is an assertion inside the multimodal feature cache:
```
  AssertionError: Expected a cached item for mm_hash='...'
```
After that, the API server hits a second assertion:

…
#33874 [Bug]: GLM-OCR POST bug — bug — by F0undLinks (创建于: 2026-02-05 14:46 (UTC+8)) ### Your current environment

vllm version：0.16.0tc1.dev78-gb6bb2842c.cu130 Transformers version：5.0.1.dev0 GPU：2080Ti * 4 CUDA version :13.0

…
#33869 [RFC]:DeepSeek-R1 Moe offload — RFC — by wangyxbh (创建于: 2026-02-05 13:54 (UTC+8)) # RFC: CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM PR：https://github.com/vllm-project/vllm/pull/31938

## Summary

This PR proposes a CPU Offload Module for MoE inference in vLLM, enabling a large portion of expert weights and computation to be dynamically offloaded to the CPU while keeping only a small, hot subset of experts cached on GPU.

The design supports:
- Hybrid GPU–CPU execution
- Pinned-memory–based weight streaming …

[已关闭 issue]

#18324 [Bug]: Clarification regarding bug inside vllm-flash-attn vision module — bug,stale — by vrdn-23 (关闭于: 2026-02-06 10:17 (UTC+8)) [💬17] ### Your current environment

N/A

### 🐛 Describe the bug

Hi, I am looking for a clarification regarding the warning message that pops up when trying to load a Molmo Vision model in vLLM. It seems that the warning message was first introduced in this PR and carried along with many refactors but I coul…
#15254 [RFC]: Better support for weight updating while waking up from sleep mode for RLHF — RFC,stale — by erictang000 (关闭于: 2026-02-06 10:17 (UTC+8)) [💬14] ### Motivation.

Currently, when using sleep mode and wake_up() for RLHF, like in verl, we can run into issues like here, where we run into OOMs when waking up the vllm engine due to additional copies of the model living on the same GPU.

The issue in verl specifically lives in the [fsdp_vllm sharding manager](https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py#L94…
#19097 [RFC]: Response format extensions for structured outputs — structured-output,RFC,stale,v1 — by aarnphm (关闭于: 2026-02-06 10:17 (UTC+8)) [💬16] ### Motivation.

Currently, users can provide additional constraints format via extra_body in OpenAI client:

```python from enum import Enum from pydantic import BaseModel from openai import OpenAI

simplified_sql_grammar = “”” …
#20581 [Bug]: Llama4 always go OOM when using LLama4ForCausalLM — bug,stale — by Mirco-Ramo (关闭于: 2026-02-06 10:17 (UTC+8)) [💬9] ### Your current environment

vllm==0.9.1, python==3.10,8xA100 80GB, Ubuntu 22.04.5 LTS
INFO 07-07 15:24:09 [__init__.py:244] Automatically detected platform cuda. Collecting environment information... ============================== System Info ============================== ...
#26014 [Bug]: Bugs in the new logprobs_mode — bug,stale — by YunruiZhang (关闭于: 2026-02-06 10:16 (UTC+8)) [💬6] ### 🐛 Describe the bug

The new logprobs_mode has many bugs. When set to logprobs_mode = ‘processed_logprobs’, it returns the same values as logprobs_mode = ‘raw_logprobs’. Additionally, when set to ‘processed_logits’ or ‘raw_logits’, it does not produce any meaningful output the logprobs are all -inf except for rank = 1, which gives 0. I am using MODEL_ID = “Qwen/Qwen3-30B-A3B-Thinking-2507-FP8”
#26288 [Bug]: schema field becomes None in Responses API when stream=True — bug,stale — by WoutDeRijck (关闭于: 2026-02-06 10:16 (UTC+8)) [💬2] ### Your current environment

vllm: 0.10.2 openai: 1.108.0

### 🐛 Describe the bug

#### 1. Client Request Arrives ```python stream = await client.responses.create( …
#26349 [Bug]: Qwen-3-VL-30B-A3B… is realy slow.. in H100 — bug,stale — by sanakspock (关闭于: 2026-02-06 10:16 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ython collect_env.py /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information... ...
#33831 [Bug]: Deepseek V3.2 Benchmark failure “TypeError: argument ‘tokens’: ‘NoneType’ object” — bug — by wzhao18 (关闭于: 2026-02-06 07:50 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text python collect_env.py Collecting environment information... ============================== ...
#33917 [Feature]: Up to date docker images — feature request — by lee-b (关闭于: 2026-02-06 03:55 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

The official vllm-openai docker image appears to be significantly outdated – even the nightly image!

I was already having to base on top of the docker image and install a later transformers version to run some of the models that vllm claims to support. However, after installing a new host with CUDA 13.1, I found that the vllm image won’t run even with that update, due to host nvidia driver and cuda library version incompatibilities.

I’m having to apply…
#33859 [Bug]: DeepSeek V3.2-NVFP4 with flashinfer moe reports q must have dtype torch::kBFloat16 — bug — by kebe7jun (关闭于: 2026-02-06 03:22 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text GB200 on main branch ```

…
#33802 [CI Failure]: Distributed 2xH100 tests — torch.compile,ci-failure,needs reproduction — by ProExpertProg (关闭于: 2026-02-05 20:15 (UTC+8)) [💬5] ### Name of failing test

tests/compile/distributed/test_async_tp.py::test_async_tp_pass_correctness

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#33908 [Usage]: vLLM is an excellent project - thank you to the team! — usage — by Rainlin007 (关闭于: 2026-02-05 19:55 (UTC+8)) ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

This is not really a usage question, but I wanted to take a moment to express my appreciation for the vLLM project. …
#33873 [Bug]: BGE-M3 pooling endpoint fails with short inputs (< 4 tokens) in token_classify task — 无标签 — by eveningcafe (关闭于: 2026-02-05 18:51 (UTC+8)) [💬1] ### Your current environment
- vLLM Version: v0.15.0
- Model: BAAI/bge-m3
- Task: token_classify
- Endpoint: /pooling
### 🐛 Describe the bug

The /pooling endpoint with task: token_classify returns 400 error for inputs that tokenize to fewer than 4 tokens (including CLS and SEP special tokens). The backend returns a single float value instead of a list of floats, causing Pydantic validation to fail. …
#33401 [Bug]: [GML-4.5-Air-Fp8] RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug — by twilighgt (关闭于: 2026-02-05 15:37 (UTC+8)) [💬3] ### Your current environment

vllm.0.12.0

root@ubuntu:/vllm-workspace# python collect_env.py bash: python: command not found root@ubuntu:/vllm-workspace# python3 collect_env.py Collecting environment information… ============================== System Info …
#33696 [Bug]: [Cpu Backend] Whisper W8A8 failure — bug,cpu — by aditew01 (关闭于: 2026-02-05 14:26 (UTC+8)) [💬2] ### 🐛 Describe the bug

Running W8A8 Quantized whisper model - RedHatAI/whisper-large-v3-quantized.w8a8 results in the failure.

```(EngineCore_DP0 pid=35539) output = self.model_runner.execute_model( (EngineCore_DP0 pid=35539) File “/home/aditew01/envs/tvllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3452, in execute_model (EngineCore_DP0 pid=35539) ) = self._preprocess( (EngineCore_DP0 pid=…
#33816 [CI Failure]: Quantized Models Test in tests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2] — ci-failure — by mgoin (关闭于: 2026-02-05 13:11 (UTC+8)) [💬2] ### Name of failing test

tests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#27669 [Bug]: AssertionError in lora_shrink_op.py during profile_run when serving Qwen3-VL-8B-Instruct with LoRA on v0.11.0 — bug,stale — by CaiNiaoLucifer (关闭于: 2026-02-05 11:29 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...

[新 PR]

#33961 Elastic AFD — documentation,frontend,ci/build,v1,deepseek — by gaidandawang-afk (创建于: 2026-02-06 11:14 (UTC+8)) [💬3 | +6714/-235, 57 files | commented:1]

## Purpose Elastic AFD basic implemetation based on #28296 #29772 ## Test Plan

## Test Result

...
#33960 [Core][WIP] Pipeline Parallel support for Model Runner V2 — v1 — by ZhanqiuHu (创建于: 2026-02-06 10:35 (UTC+8)) [💬1 | +402/-17, 2 files | commented:1 | 📝草稿] ## Summary

This PR adds Pipeline Parallel (PP) support to the experimental Model Runner V2. It introduces a modular PPHandler class that encapsulates all PP-related logic, keeping the main model runner code clean. The current implementation uses blocking token synchronization and produces correct output verified against the model runner v1 baseline.

## Purpose

Add Pipeline Parallel (PP) support to the experimental Model Runner V2 (vllm/v1/worker/gpu/model_runner.py).

Related: #32455 (Q1 …
#33949 [CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure — frontend,gpt-oss — by AndreasKaratzas (创建于: 2026-02-06 07:26 (UTC+8)) [+980/-802, 8 files | commented:2] This PR eliminates systemic test flakiness in the Harmony and MCP Responses API integration tests by addressing root causes in both the test infrastructure and source code. The core problem was that tests asserted on non-deterministic LLM output using @pytest.mark.flaky, which corrupted server fixture lifecycles and masked real failures. This PR replaces that pattern with deterministic infrastructure: pinned system prompts, API-level retries, and a clear separation between server invariants (h…
#33937 [Bugfix] Fix mamba cache mode null-block padding — bug,v1 — by tianshu-Michael-yu (创建于: 2026-02-06 03:32 (UTC+8)) [💬1 | +52/-2, 2 files | commented:1] ## Purpose

In mamba_get_block_table_tensor, block id 0 is reserved for BlockPool.null_block (never allocated) but can appear in block tables as a placeholder (e.g. mamba align mode). Mamba kernels treat PAD_SLOT_ID (-1) as padding; if we pass block id 0 through, kernels can read/write state for the shared null block, causing cross-request state corruption.

This PR maps block id 0 to PAD_SLOT_ID for all mamba cache modes.

## Test Plan
- `python -m pytest tests/v1/attention/tes…
#33935 Update DeepGEMM version pin in Dockerfile to match #32479 — ci/build — by zifeitong (创建于: 2026-02-06 03:00 (UTC+8)) [💬2 | +2/-2, 2 files | commented:1] #32479 updated the version in install_deepgemm.sh but not Dockerfile.

Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_mo...
#33912 Optimize popleft_n free-list traversal for KV cache blocks — v1 — by junxiangxiaoxiang (创建于: 2026-02-05 20:34 (UTC+8)) [💬3 | +9/-10, 1 files | commented:2] Purpose Optimize the popleft_n implementation used by the KV cache free list to reduce overhead when popping multiple blocks at once.

The new implementation:

Traverses the free list once while collecting n blocks. Updates the head pointer and neighboring links in a batched manner. Avoids per-node pointer rewiring inside the loop. This keeps the semantics unchanged (same set and order of returned blocks, same list structure invariants) while reducing Python overhead in the hot path.

…
#33909 [Models] Consolidate Deepseek-OCR2 processor — ready,deepseek — by Isotr0py (创建于: 2026-02-05 20:20 (UTC+8)) [+52/-336, 5 files | commented:1 approved:1]

## Purpose
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the image_size and image token num calculation.
- This PR consolidate their implementation into one processor to avoid duplicate code.
## Test Plan
```
  python examples/offline_inference/vision_language.py -m deepseek_ocr
```
…
#33959 Make voxtral compile friendly — 无标签 — by tugsbayasgalan (创建于: 2026-02-06 09:46 (UTC+8)) [💬1 | +9/-2, 1 files | commented:1]

## Purpose This PR makes sure voxtral model can be run with torch.compile by working around torch.compile issue lilsted here (https://github.com/pytorch/pytorch/issues/174299) This is temporary fix that will be gone starting torch 2.11 ## Test Plan

## Test Result

...
#33878 [CMake] Switch vllm-flash-attn to ExternalProject for separate scope (#9129) — ready,ci/build — by jungledesh (创建于: 2026-02-05 15:33 (UTC+8)) [💬4 | +33/-1, 1 files | commented:6] …cope (#9129)

## Purpose Implements item 4 from #9129: replace include(cmake/external_projects/vllm_flash_attn.cmake) with ExternalProject_Add so that vllm-flash-attn is built in its own isolated CMake scope and process.

This eliminates shared-scope “footguns” such as variable overwrites, flag pollution, dependency mismatches, or duplicate target issues between the main vLLM build and the external flash-attn repository.

## Changes

…

#33897 fix nixl connector num blocks check logic — kv-connector — by Rugu7 (创建于: 2026-02-05 18:24 (UTC+8)) [💬2

+2/-1, 1 files

commented:1]

Cause: When creating a PD split instance, the PD instance calculated inconsistent num_blocks values. This inconsistency caused the API call to fail, triggering an assertion failure in the Decode split service and resulting in the service hanging.

Modification description: For the same instance in the NIXL connector, the num_blocks value is set to …

#33928 [Refactor] Move sequence normalization and enc-dec parsing to renderer — frontend,ready,needs-rebase,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-06 00:43 (UTC+8)) [💬3 | +520/-491, 19 files | commented:3] ## Purpose
- Introduce *DictPrompt* classes to represent prompts that have been normalized into dictionaries. This replaces Parsed*Prompt` classes.
- Update and improve documentation of various input schemas.
- Move the logic of normalizing a single prompt/conversation or a sequence of prompt/conversation to a sequence to Renderer.
- Move the logic of rendering encoder-decoder inputs to Renderer. (Tokenization will be moved in another PR)
- Remove dead code related to encoder-decoder models: …
#33941 [bugfix] [ROCm] Fix premature CUDA initialization in platform detection — bug,rocm,ready,ci/build,nvidia — by kouroshHakha (创建于: 2026-02-06 04:37 (UTC+8)) [💬5 | +133/-6, 6 files | commented:4] ## Summary

On ROCm, importing vllm.platforms triggers torch.cuda.get_device_properties() at module load time, which initializes CUDA before Ray workers can set CUDA_VISIBLE_DEVICES. This locks device_count() to the total number of GPUs, causing all workers to incorrectly use GPU 0 instead of their assigned GPUs.

Closes https://github.com/vllm-project/vllm/issues/33938

## Problem

```python # vllm/platforms/rocm.py (before fix) …
#33958 Fix double‑scaled linear RoPE cache for long‑context configs #33911 — 无标签 — by baonudesifeizhai (创建于: 2026-02-06 09:42 (UTC+8)) [+17/-1, 1 files | commented:1]

## Purpose #33911 Gemma‑3 configs report a scaled max_position_embeddings (128K) but still include rope_scaling.factor=8, which makes linear RoPE caches scale twice and breaks torch.compile guards. This change adjusts linear RoPE construction to avoid double scaling when original_max_position_embeddings is absent and the context is very large, preventing the 128K→1,048,576 cache blow‑up.

## Test Plan ``` vllm serve google/gemma-3-27b-it
–max-model-len 4096 ...
#33957 [Docs] Add reo analytics — documentation — by simon-mo (创建于: 2026-02-06 09:28 (UTC+8)) [💬1 | +4/-0, 2 files | commented:1] For vLLM community analytics
#33952 [CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N — bug,rocm — by rasmith (创建于: 2026-02-06 08:06 (UTC+8)) [💬1 | +22/-4, 3 files | commented:4]

## Purpose This PR broke many tests (over 30) and this PR fixed one test in the ` Kernels MoE Test %N` group, but when the test is run as a group using

pytest -sv kernels/moe

the first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.

This PR loads the ops in vllm._aiter_ops but then ensures tha…
#33956 [Bugfix] Fix video frame sampling for short videos and Qwen3-VL 2-frame requirement — bug,multi-modality,qwen — by chengyinie (创建于: 2026-02-06 09:05 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1]

## Purpose Fixes video processing failures for very short videos (< 1 second) when using low fps settings. Problem:
- When fps * duration < 1, the frame sampling formula int(math.floor(duration * fps)) returns 0 frames
- This causes Qwen3VLProcessor to fail with empty video arrays
- Additionally, Qwen3-VL requires a minimum of 2 frames, so even 1-frame videos fail
Solution: …
#33934 [Frontend]Add support for transcriptions and translations to run_batch — frontend,needs-rebase — by pooyadavoodi (创建于: 2026-02-06 02:46 (UTC+8)) [💬2 | +555/-101, 3 files | commented:4] ## Purpose Adding support for transcriptions and translations in the vllm OpenAI batch API (i.e. run_batch.py).

Note that the OpenAI Batch API doesn’t support transcriptions or translations. This will be a vllm feature only AFAIK.

## Test Plan Adding new unit tests to tests/entrypoints/openai/test_run_batch.py

## Test Result pytest -v tests/entrypoints/openai/test_run_batch.py …
#33955 Rocm/fused short seq attention — rocm — by MohamedSayedFathy (创建于: 2026-02-06 08:26 (UTC+8)) [💬1 | +263/-54, 1 files | commented:1]

## Purpose

Optimize PagedAttention v2 decode performance on ROCm for short sequences (≤256 tokens) by eliminating the reduce kernel when only a single partition is needed.

When max_num_partitions == 1, the QKV kernel writes directly to final_out instead of tmp_out, skipping:
- The tmp_out intermediate buffer write (~8 KB per seq/layer)
- The tmp_out read-back in the reduce kernel (~8 KB per seq/layer)
- The exp_sums/max_logits metadata writes (~256 …
#33919 Fix RoutingMethodType logic — 无标签 — by dbari (创建于: 2026-02-05 22:51 (UTC+8)) [💬1 | +47/-11, 4 files | commented:1 | 📝草稿]

## Purpose

This PR contains two fixes for #33792:
- Fix the selection logic for RoutingMethodType in fused_topk_bias_router.py and fused_topk_router.py
- When use_grouped_topk=True, the GroupedTopKRouter did not find any valid routing methods and there is only one group, fall back to the non-grouped routers
The latter point covers Mistral Large 3, which has n_group=1 and topk_group=1 but uses Renormalize instead of DeepSeekV3 routing, handled by…
#33953 Regular FP8 LoRA kernels — 无标签 — by yugong333 (创建于: 2026-02-06 08:12 (UTC+8)) [+1395/-0, 4 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#33948 [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache — bug,rocm — by Rohan138 (创建于: 2026-02-06 07:25 (UTC+8)) [💬2 | +17/-53, 1 files | commented:2 approved:1]

Reuse AttentionBackend utils to initialize the dummy KV cache with the required shape and stride. Also, turns out this UT is being skipped on ROCm because of a typo/missing rename, so BACKENDS -> BACKENDS_FP8.

## Purpose

## Test Plan

## Test Result

…
#33945 [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. — ready,torch.compile — by charlifu (创建于: 2026-02-06 06:20 (UTC+8)) [💬1 | +5/-0, 1 files | commented:2 approved:1] Add kv_cache_dummy_dep parameter to the attention fusion pass to make sure the result IR keeps this parameter.

Otherwise kv_cache_update op will be removed by clean pass.
#33936 [Doc] Add DCP support to attention backend doc — documentation — by mgoin (创建于: 2026-02-06 03:24 (UTC+8)) [💬2 | +769/-644, 2 files | commented:1] ## Purpose

Also refactor table rendering to not be duplicated, extract FA/FI variant expansion, and compute capability parsing

## Test Plan

## Test Result

...
#33947 [WIP][CI] Add integration tests for P2pNccl and LMCache connectors — tpu,ci/build,v1,kv-connector — by eicherseiji (创建于: 2026-02-06 06:25 (UTC+8)) [+32/-6, 10 files | commented:1 | 📝草稿] ## Summary

Add CI coverage for P2pNcclConnector and LMCacheConnectorV1 - currently only NixlConnector has integration tests in CI.

## Changes
1. Parameterize run_accuracy_test.sh - Accept KV_CONNECTOR environment variable (defaults to NixlConnector for backwards compatibility)
2. Add P2pNccl Connector test - 2 GPUs, 20 min timeout
…
#33940 [Docs] Add sections on vLLM’s process count and minimum CPU resources — documentation,ready — by mgoin (创建于: 2026-02-06 04:31 (UTC+8)) [💬3 | +113/-0, 4 files | commented:2]

## Purpose

It seems users can be confused about vLLM’s performance when running with very small amounts of CPU cores available. We are missing a clear overview of what vLLM’s process architecture is, so I added this along with some diagrams in arch_overview.md, and included a section on CPU resource recommendations in optimization.md

Here are some example diagrams I generated with help from Opus 4.6

<img width=”2816” height=”1536” alt=”Generated_image1” src=”htt…

#33946 Add Video stats to response metadata — frontend — by zakariaelh (创建于: 2026-02-06 06:23 (UTC+8)) [💬1

+57/-13, 9 files

commented:1]

#33943 move checks out of unified_kv_cache_update custom op — rocm,v1 — by Rohan138 (创建于: 2026-02-06 05:56 (UTC+8)) [+79/-100, 7 files | commented:3]

## Purpose Move checks for k, v, and kv_sharing_target_layer_name up from the unified_kv_cache_update into Attention.forward.

Note that the other PRs in #32335 should follow this pattern as well; I think after all those PRs are merged, we can safely remove the kv_sharing_target_layer_name arg from the AttentionImpl class entirely.

## Test Plan

## Test Result …

#33944 [Log] Optimize duplicate startup log — v1 — by yewentao256 (创建于: 2026-02-06 06:07 (UTC+8)) [+10/-7, 3 files | commented:1] ## Purpose

Optimize logs like

  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:09 [gpu_worker.py:122] Using V2 Model Runner

#33942 Adding support to Sarvam’s MoE models — new-model — by rahul-sarvam (创建于: 2026-02-06 04:54 (UTC+8)) [💬2 | +717/-0, 2 files | commented:1] This PR adds support for Sarvam MoE model executors in vLLM.

SarvamMoEForCausalLM implements a standard MoE architecture with conventional MHA, while SarvamMLAForCausalLM implements an MoE variant using MLA. The implementation reuses existing vLLM primitives (MLA modules, fused MoE, TP/PP support, and quantization paths), with minimal extensions for Sarvam-specific routing, expert bias normalization, and weight-loading compatibility.
#33939 Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 — 无标签 — by TundeAtSN (创建于: 2026-02-06 03:53 (UTC+8)) [💬1 | +9/-1, 1 files | commented:1] This PR adds support for Eagle3 spec decoding for Mistral3ForConditionalGeneration model. Changes were tested with a locally trained speculator model, and observed reasonable acceptance rates.
#33926 [BugFix] Fix small race condition when pausing generation for RL — bug,ready,v1 — by njhill (创建于: 2026-02-06 00:29 (UTC+8)) [+10/-11, 2 files | commented:3] Related to https://github.com/vllm-project/vllm/pull/28037.

Without this, it’s technically possible for a request to slip in after pausing.

There’s also an issue with multiple API servers, but a check for that will be added in https://github.com/vllm-project/vllm/pull/32351.

cc @SamitHuang @hao-aaron
#33866 fix: Qwen3ReasoningParser - handle prompt prefix format for Thinking models — qwen — by seli-equinix (创建于: 2026-02-05 13:34 (UTC+8)) [💬5 | +16/-22, 1 files | commented:4] ## Summary

Fixes Qwen3ReasoningParser to correctly extract reasoning content from Qwen3-Thinking models (e.g., Qwen/Qwen3-Next-80B-A3B-Thinking-FP8).

## Problem

The current parser requires both <think> and </think> tags in the model output:

```python if self.start_token not in model_output or self.end_token not in model_output: …
#33881 [BugFix][kv_offload] Fix offloading decodes with async scheduling — bug,ready,kv-connector — by orozery (创建于: 2026-02-05 15:53 (UTC+8)) [💬1 | +7/-2, 1 files | commented:1] This PR fixes OffloadingConnector to offload also decoded tokens, which were previously skipped as their block hash was not available. This issue only occurs when async scheduling is enabled.
#33933 [WIP] part 2: helion integration into vllm gemm+fp8+all_gather — 无标签 — by LironKesem (创建于: 2026-02-06 02:28 (UTC+8)) [+573/-10, 3 files | commented:1 | 📝草稿] Current state:
- I was able to integrate it–> the test pass. but I have leakage.– didn’t look into it.
```
  TORCH_LOGS="+dynamic" NCCL_DEBUG=TRACE  USE_HELION_BACKEND=1 VLLM_LOGGING_LEVEL=DEBUG  VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s   tests/compile/distributed/test_async_tp.py   -k "TestHelionAGScaledMMModel"   --log-cli-level=DEBUG 
```
TODOs(in this PR):
- move kernel registration to collective_fusion.py move the kernel implementation into `vllm/vllm/kernels/helion/distributed…
#33932 [Bugfix] Fix DSV3.2 NVFP4 — bug,ready — by MatthewBonanni (创建于: 2026-02-06 01:18 (UTC+8)) [💬1 | +4/-2, 1 files | commented:1 approved:1] ## Purpose Fix https://github.com/vllm-project/vllm/issues/33859

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#33904 [Misc] Rename translations to speech_to_text for OAI serving component — frontend,ready — by NickLucche (创建于: 2026-02-05 19:20 (UTC+8)) [💬1 | +11/-11, 7 files | approved:1 commented:3] Follow up to https://github.com/vllm-project/vllm/pull/32313/ as expressed in this comment https://github.com/vllm-project/vllm/pull/32313/#issuecomment-3769015464, translations is likely too reductive for the ASR endpoints.

cc @DarkLight1337
#33902 Fix tokenizer test for renamed attr on Transformers v5 — ready — by hmellor (创建于: 2026-02-05 18:58 (UTC+8)) [+9/-1, 1 files | commented:1 approved:1] This attr has been renamed in Transformers v5. This PR updates the test to check both locations.
#33890 [kernel] use fused_topk of flashinfer — 无标签 — by ZJY0516 (创建于: 2026-02-05 17:05 (UTC+8)) [💬1 | +55/-10, 1 files | commented:2] ## Purpose use fused_topk of flashinfer

TODO: I don’t know when to use this for other models

## Test Plan

``` vllm serve deepseek-ai/DeepSeek-V3
–trust-remote-code
…
#33896 [Bugfix] Fix illegal memory access in AWQ-Marlin with CUDA graphsFix illegal memory access in AWQ-Marlin with CUDA graphs (Fixes #32834) — bug,needs-rebase,nvidia — by KrxGu (创建于: 2026-02-05 18:10 (UTC+8)) [💬3 | +3648/-3444, 4 files | commented:4] This PR fixes issue #32834 where AWQ-Marlin quantization causes illegal memory access when used with CUDA graphs.

## Problem AWQ-Marlin kernels use workspace buffers allocated during model loading. The pointers to these buffers are captured during CUDA graph creation, but the buffers themselves can be reallocated or moved, making the captured pointers stale. When the CUDA graph replays, it uses these stale pointers, resulting in illegal memory access errors.

## Solution Automatically enable `c…
#33868 support view_from_cpu_tensor on XPU — 无标签 — by xinyu-intel (创建于: 2026-02-05 13:41 (UTC+8)) [+5/-1, 1 files | commented:2]

## Purpose

support get_cuda_view_from_cpu_tensor on XPU. This will be used for ModelRunnerV2

## Test Plan

## Test Result

…
#33922 Support benchmarking of Geospatial models — performance,multi-modality — by mgazz (创建于: 2026-02-05 23:58 (UTC+8)) [+106/-50, 3 files | commented:3] This PR adds the following features necessary to the Geospatial models like Prithvi:
- ability to run benchmarks when the tokenizer is not initialised. This is done via a new argument --skip-tokenizer-init. In this case the benchmark does not print token related metrics.
- Adds new benchmark backend that supports sending requests to the /pooling endpoint.
The benchmark can be run as following:

Step 1: start the server

``` …
#33931 [Misc] Add debug logs — kv-connector — by NickLucche (创建于: 2026-02-06 00:58 (UTC+8)) [+5/-0, 2 files | commented:1 approved:1] Nixl is currently missing some minor yet vital debug logs that I find particularly useful when trying to debug issues without access to the machine. This PR logs kv cache shapes to address that.

cc @orozery
#33879 [BugFix] Fix LoRA Fp8 — bug,ready,ci/build — by danisereb (创建于: 2026-02-05 15:43 (UTC+8)) [💬2 | +14/-8, 1 files | commented:2 approved:1] ## Purpose

When running Nemotron Nano FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Using LoRA adapters fails:
```
  ValueError: ModelOptFp8MoEMethod uses the new modular kernel initialization logic. This function should not be called.
```
…
#33929 [Bugfix][Docker] Install CUDA dev packages for JIT compilation headers — bug,ci/build,nvidia — by jasonlizhengjian (创建于: 2026-02-06 00:50 (UTC+8)) [💬1 | +3/-3, 1 files | commented:1 | 📝草稿] Install cuda-cudart-dev, cuda-nvrtc-dev, and libcublas-dev instead of runtime-only packages to provide headers (cuda.h, cuda_runtime.h, nvrtc.h, cublasLt.h) needed for FlashInfer JIT compilation of fp8_blockscale_gemm_sm90 kernels.

Fixes #33833

## Purpose

## Test Plan

…
#33927 [Bugfix] Fix tokenizer model_max_length incorrectly constraining user-specified max_model_len — bug — by soyr-redhat (创建于: 2026-02-06 00:41 (UTC+8)) [💬1 | +73/-3, 2 files | commented:1] This fixes a bug where the tokenizer’s model_max_length would override the user’s explicit –max-model-len parameter, preventing users from utilizing extended context lengths even when the model supports them.

For example, with Qwen/Qwen3-4B-Instruct-2507-FP8:
- Model supports 256K tokens (max_position_embeddings = 262144)
- Tokenizer config has model_max_length = 8192
- User specifies –max-model-len 262144
- Before: Server would cap at 8192 and reject 10K token requests
- After: Server correc…
#33924 Add test for Nemotron Nano FP8 with LoRA adapters — 无标签 — by danisereb (创建于: 2026-02-06 00:14 (UTC+8)) [+72/-0, 2 files | commented:1 | 📝草稿]

## Purpose

Verify that Nemotron Nano FP8 works with LoRA adapters.

Added following this bugfix (test works only with this fix): https://github.com/vllm-project/vllm/pull/33879

## Test Plan …
#33901 [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (创建于: 2026-02-05 18:48 (UTC+8)) [💬2 | +10/-2, 1 files | commented:2 approved:1] Improvements on Neoverse v1:

Metric Before After Speedup Throughput (rps) 0.30 0.55 +83% Total tok/s 348.70 632.84 +81% Out tok/s 38.74 70.32 +82%

fix formatting issues …
#33918 [WIP][Kernel] Add Helion kernel for static_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-05 22:46 (UTC+8)) [+218/-0, 2 files | commented:1] ## Purpose This PR is to add Helion kernel for static_scaled_fp8_quant operation. It follows the implementation from the vllm c version.

This is a subtask for https://github.com/vllm-project/vllm/issues/32962.

## Test Plan
- Test correctness
  - Added unit test to cover its correctness
- Benchmark performance at kernel level with different kernel imple…
#33903 [NIXL] Version bump — ci/build,kv-connector — by NickLucche (创建于: 2026-02-05 19:03 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Regular version bump for NIXL.
#33898 [Metrics] Some small refactoring for better maintainability — speculative-decoding,v1,kv-connector — by hickeyma (创建于: 2026-02-05 18:25 (UTC+8)) [+117/-122, 6 files | commented:1 changes:1] ## Purpose

When reviewing #30950, I identified some parts of the code that if refactored would benefit the code base from a maintainability and bugs pov.

The issues identified are as follows:
- Use suffix Prom instead of Prometheus for consistency. See KVConnectorPrometheus in vllm/distributed/kv_transfer/kv_connector/v1/metrics.py
- Consolidate the duplicated helper function make_per_engine into 1 function instead of duplicates
- Rename make_per_engine –> `create_met…
#33905 [Docs] Add bart-plugin to docs — documentation,ready — by NickLucche (创建于: 2026-02-05 19:38 (UTC+8)) [💬1 | +17/-4, 3 files | commented:2 approved:1] Add bart-plugin https://github.com/vllm-project/bart-plugin to docs, mention it as a reference for OOT Model implementation, and add BartForConditionalGeneration to supported models.

cc @DarkLight1337
#33907 [Bugfix] Fix Random Dataset Prefix Length Inaccuracy — bug,performance — by frankwang28 (创建于: 2026-02-05 19:45 (UTC+8)) [+29/-10, 1 files | commented:1]

## Purpose This PR fixes the RandomDataset prefix generation so the shared prefix token length is adjusted via decode-encode tokenization, similar to the way the input prompt is generated.

## Test Plan Observe prefix cache hit rate before and after. For testing, a prefix length of 4096 with an input length of 1024 is used. The expected prefix cache hit rate should thus be around 80% (4096/(4096+1024) = 0.8).

## Test Result Before: …
#33886 [Bugfix] Fix corner case of sparse embedding — bug,ready — by noooop (创建于: 2026-02-05 16:35 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1 approved:1]

## Purpose

Fix #33873

## Test Plan

tests/models/language/pooling/test_bge_m3.py

…
#33875 Update kv_cache_utils.py — v1,tool-calling — by junxiangxiaoxiang (创建于: 2026-02-05 15:04 (UTC+8)) [💬7 | +194/-195, 28 files | commented:1] Optimize popleft_n by pre-allocating list

## Purpose

This PR optimizes the popleft_n method in FreeKVCacheBlockQueue to improve performance when popping multiple free KV cache blocks from the head of the free list.

The original implementation used repeated list.append() and reset the prev_free_block/next_free_block pointers of each popped block inside the loop, which introduces unnecessary overhead.

The optimized version applies two key improvements:
1. **Pre-allocate the result li…
#33876 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading — bug,ready,deepseek — by Isotr0py (创建于: 2026-02-05 15:24 (UTC+8)) [+15/-5, 2 files | commented:1 approved:1]

## Purpose
- nvidia/Kimi-K2.5-NVFP4 is quantized based on legacy model layout (language_model.layers.*), which was refactored at #33346
- Since v0.15.0 has released, this PR adds backward compatibility to load this NVFP4 checkpoint
## Test Plan Run with nvidia/Kimi-K2.5-NVFP4 ``` python examples/offline_inference/vision_language.py -m kimi_k25 –modality vision_chunk …
#33900 [Misc] Update code for encoder-decoder models — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-05 18:47 (UTC+8)) [+9/-3, 2 files | approved:1 commented:1]

## Purpose

FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754749999 FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754759311

## Test Plan

## Test Result …
#33892 [W8A8 Block Linear Refactor][3/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. — performance,nvidia — by maralbahari (创建于: 2026-02-05 17:32 (UTC+8)) [+1617/-760, 20 files | commented:2 | 📝草稿]

## Purpose This PR refactors the FP8 block scaled linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.

Removes the W8A8Fp8BlockLinearOp class and updates all code paths and files that use this class.

## Test Plan Cuda platfrom run ci/cd tests. …
#33891 [W8A8 Block Linear Refactor][2/N] Make Fp8 block linear Op use kernel abstraction. — nvidia — by maralbahari (创建于: 2026-02-05 17:29 (UTC+8)) [+1236/-15, 11 files | commented:1]

## Purpose This PR refactors the FP8 block scaled linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.

This PR Extracts only block-scaled kernels into the kernel abstraction

## Test Plan …
#33893 [W8A8 Block Linear Refactor][4/N] Make all scaled MM kernels inherit from common generic base. — performance,needs-rebase,cpu,nvidia — by maralbahari (创建于: 2026-02-05 17:34 (UTC+8)) [💬1 | +1881/-1018, 23 files | commented:1 | 📝草稿]

## Purpose This PR unifies the inheritance of all types of ScaledMM kernels into same super class.

Applies base inheritance to the remaining ScaledMM kernels for consistent code and improved maintainability of linear kernel classes.

## Test Plan

## Test Result …
#33884 [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast — bug,deepseek — by kebe7jun (创建于: 2026-02-05 16:15 (UTC+8)) [+16/-4, 1 files | commented:1]

## Purpose

Fix https://github.com/vllm-project/vllm/issues/33883

cc @LucasWilkinson

## Test Plan

…
#33870 [CI/Build] Fix CPU CI test case title — ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-02-05 14:30 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]

## Purpose

CPU-TP/DP/PP Tests may violate key naming rule of Buildkite.

## Test Plan

## Test Result

…
#33860 Add FlashAttention v2.8.3 scaling benchmark on Mistral-7B (H100) — performance — by RegularJoe-CEO (创建于: 2026-02-05 12:00 (UTC+8)) [💬4 | +34/-0, 1 files | commented:1] ## Purpose

Add baseline FlashAttention v2.8.3 scaling benchmark data on NVIDIA H100 for reference and validation purposes. This documents attention mechanism performance across sequence lengths from 1K to 32K tokens on Mistral-7B-v0.1.

## Test Plan
```
  cd ~/vllm/benchmarks/attention_benchmarks
  python vllm_longcontext_benchmark.py 2>&1 | tee mistral_flash_scaling.log
```
…
#33867 [WIP][Kernel] Add Helion kernel for dynamic_per_tensor_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-05 13:35 (UTC+8)) [+120/-0, 2 files | commented:1] ## Purpose This PR is to add Helion kernel for dynamic_per_tensor_scaled_fp8_quant operation. It follows the implementation from the vllm c version.

This is a subtask for https://github.com/vllm-project/vllm/issues/32962.

## Test Plan
- Test correctness
  - Added unit test to cover its correctness
- Benchmark performance at kernel level with different …
#33862 Waller Operator: Constant 14ms attention latency across 512-524K tokens (24.5x faster than FlashAttention at 32K) — performance,ci/build,v1 — by RegularJoe-CEO (创建于: 2026-02-05 12:12 (UTC+8)) [💬4 | +42/-13, 4 files | commented:1] ## Purpose

Introduce the Waller Operator attention mechanism demonstrating constant O(N log N) memory complexity and O(1) latency characteristics. This PR provides head-to-head benchmark comparison against FlashAttention v2.8.3 baseline (PR #33860).

Key Results:
- Constant latency: 14.168-14.305ms across 1000x sequence length increase (512 → 524K tokens)
- 24.5x faster than FlashAttention at 32K tokens
- Zero throughput degradation vs FlashAttention’s 76% loss
- *Scalability:…
#33863 [docs] fix unintentional misspellings — documentation — by rinbaro (创建于: 2026-02-05 12:17 (UTC+8)) [💬1 | +3/-3, 3 files | commented:1 approved:1] ### Motivation

These misspellings looked unintentional.
#33861 [docs] fix unintentional misspellings — documentation — by rinbaro (创建于: 2026-02-05 12:09 (UTC+8)) [💬2 | +3/-3, 3 files | commented:1] ## Motivation make docs great again.

These 3 looked unintentional.

[已合并 PR]

#33511 fix(ROCm): Make flash_attn import optional in MLA attention — rocm,ready — by rabi (合并于: 2026-02-06 10:22 (UTC+8)) [💬2 | +19/-3, 1 files | commented:5 approved:2] ## Purpose

On ROCm, models that don’t use MLA were failing to load because attention/init.py eagerly imported MLAAttention, which in turn tried to import flash_attn unconditionally.
- Makes flash_attn import optional in mla_attention.py with try/except
- Adds a clear error message when MLA is used without flash_attn
This allows non-MLA models to work on ROCm without needing flash_attn installed.

## Test Plan …
#33909 [Models] Consolidate Deepseek-OCR2 processor — ready,deepseek — by Isotr0py (合并于: 2026-02-06 02:29 (UTC+8)) [+52/-336, 5 files | commented:1 approved:1]

## Purpose
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the image_size and image token num calculation.
- This PR consolidate their implementation into one processor to avoid duplicate code.
## Test Plan
```
  python examples/offline_inference/vision_language.py -m deepseek_ocr
```
…
#33957 [Docs] Add reo analytics — documentation — by simon-mo (合并于: 2026-02-06 09:42 (UTC+8)) [💬1 | +4/-0, 2 files | commented:1] For vLLM community analytics
#33568 [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel — ready — by xyang16 (合并于: 2026-02-06 09:34 (UTC+8)) [💬2 | +61/-27, 4 files | commented:4 approved:2] ## Purpose

During profiling DeepSeek V3.2, I noticed in sparse_attn_indexer, deep_gemm::smxx_clean_logits kernel was launched following deep_gemm::sm90_fp8_mqa_logits, because clean_logits is set to true in https://github.com/vllm-project/vllm/blob/v0.16.0rc0/vllm/utils/deep_gemm.py#L331.

The purpose of deep_gemm::smxx_clean_logits kernel is to fill the padding value of MQA logits with -inf, see https://github.com/deepseek-ai/DeepGEMM/blob/v2.1.1/README.md?plain=1#L124. However, I foun…
#31162 [Feature] OTEL tracing during loading — frontend,ready,ci/build,v1,cpu — by emricksini-h (合并于: 2026-02-06 08:59 (UTC+8)) [💬16 | +873/-280, 29 files | commented:7 approved:1] ## Purpose

This PR instruments the vLLM loading process with OpenTelemetry (OTel) tracing to provide precise, span-based observability into initialization performance.

## Why we need it Currently, performance insights for vLLM’s startup phase rely on parsing unstructured logs. Tools like llm-d-benchmark depend on specific log strings to calculate durations [see here](https://github.com/llm-d/llm-d-benchmark/blob/main/workload/harnesses/nop-llm-d-benc…

#33832 [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue — bug,ready,deepseek — by wzhao18 (合并于: 2026-02-06 07:50 (UTC+8)) [💬2 | +4/-0, 1 files | commented:6 approved:1]

## Purpose Fix #33831

## Test Plan

  lm_eval   --model local-completions   --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5

…

#33527 Adds padding and perf improvements to wvSplitK_fp8 — rocm,ready — by amd-hhashemi (合并于: 2026-02-06 06:16 (UTC+8)) [💬8 | +169/-229, 3 files | commented:2 approved:1] Adds activation padding support to wvSplitKQ. Additionally improves bias and dpp reduce perf. Expands test scenarios.

## Purpose

## Test Plan

## Test Result

...
#33491 [Minor] Sort safetensors files to ensure deterministic loading order — ready — by Lumosis (合并于: 2026-02-06 06:05 (UTC+8)) [💬7 | +11/-2, 1 files | commented:1 approved:1] This PR explicitly sorts hf_weights_files by filename within the safetensors_weights_iterator

## Purpose

We are refactoring the weight loading logic to better support TPU execution. To optimize memory usage and initialization on TPU, we need to shard each layer to TPU HBM immediately after its weights are available.

By sorting the safetensors files (which typically follow a sequential naming convention like model-00001-of-00005.safetensors), we ensure that weights are loaded in a deterministi…
#33817 [Bugfix] Make MM batching more robust — bug,ready,ci/build,multi-modality — by DarkLight1337 (合并于: 2026-02-06 04:40 (UTC+8)) [💬6 | +625/-428, 13 files | commented:4 approved:1] ## Purpose

FIX https://github.com/vllm-project/vllm/pull/32955#issuecomment-3849069453

This adds a bit of hashing overhead, but should be negligible since shared fields normally contain single integers (e.g. image/video token ID) or booleans (e.g. use_audio_in_video), with the notable exception of Terratorch models.

cc @christian-pinto

Related changes:

…
#33932 [Bugfix] Fix DSV3.2 NVFP4 — bug,ready — by MatthewBonanni (合并于: 2026-02-06 03:22 (UTC+8)) [💬1 | +4/-2, 1 files | commented:1 approved:1] ## Purpose Fix https://github.com/vllm-project/vllm/issues/33859

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#33904 [Misc] Rename translations to speech_to_text for OAI serving component — frontend,ready — by NickLucche (合并于: 2026-02-06 03:16 (UTC+8)) [💬1 | +11/-11, 7 files | approved:1 commented:3] Follow up to https://github.com/vllm-project/vllm/pull/32313/ as expressed in this comment https://github.com/vllm-project/vllm/pull/32313/#issuecomment-3769015464, translations is likely too reductive for the ASR endpoints.

cc @DarkLight1337
#33902 Fix tokenizer test for renamed attr on Transformers v5 — ready — by hmellor (合并于: 2026-02-06 03:16 (UTC+8)) [+9/-1, 1 files | commented:1 approved:1] This attr has been renamed in Transformers v5. This PR updates the test to check both locations.
#29714 [Bugfix] Suppress non-TTY color output on the process name part of the log — bug,ready — by a4lg (合并于: 2026-02-06 02:47 (UTC+8)) [💬5 | +6/-1, 1 files | commented:4 approved:1] ## Purpose

Regular logs are not decorated with colors when non-TTY stdout/stderr is selected as logging output (c.f. _use_color function in vllm/logger.py). However, process-based decoration part in vllm/utils/system_utils.py did not implement color output suppression as the regular logger.

It resulted in excess escape sequences in a redirected output (e.g. log.txt after running vllm serve ... >log.txt 2>&1) when all conditions are met:
1. Neither `NO…
#33375 [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor — ready,nvidia — by bnellnm (合并于: 2026-02-06 02:07 (UTC+8)) [💬1 | +132/-109, 37 files | commented:10] ## Purpose Pass the inplace flag to the FusedMoEModularKernel constructor. Knowing the inplace (or being able to disable) behavior ahead of time will help simplify some of the runtime logic in layer.py related to running shared experts and making clones of the hidden_states.

## Test Plan CI + MoE refactoring tests

## Test Result

cc @robertgshaw2-redhat , @ProExpertProg

…
#32887 [Spec Decode] Unified Parallel Drafting — documentation,performance,speculative-decoding,ready,v1,llama,nvidia — by benchislett (合并于: 2026-02-06 01:37 (UTC+8)) [💬8 | +1085/-392, 14 files | commented:10] ## Purpose

This PR implements a single input preparation kernel for draft model support, and parallel drafting both with and without hidden states from the target model. As such we now have support for AMD’s PARD, which proposes parallel drafting for fine-tuned external draft models, and AWS’ P-EAGLE which implements parallel prediction for EAGLE3. Both of these are benchmarked as part of this PR effort.

## Testing

E2E tests for parallel drafting and unit tests for the input preparation logic…
#33795 [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path — bug,ready,llama,kv-connector — by zackyoray (合并于: 2026-02-06 01:51 (UTC+8)) [💬1 | +3/-3, 1 files | commented:1 approved:3] The Llama 4 local attention optimization code path in _read_blocks had the engine_ids swapped when calling _get_block_descs_ids:
- layer_local_desc_ids was incorrectly using dst_engine_id
- layer_remote_desc_ids was incorrectly using self.engine_id
This caused the wrong num_blocks to be used when computing descriptor IDs, resulting in “remote index out of range” errors during NIXL KV cache transfers with heterogeneous tensor parallelism (e.g., prefill TP=2, decode TP=4).

The fix …
#33931 [Misc] Add debug logs — kv-connector — by NickLucche (合并于: 2026-02-06 01:42 (UTC+8)) [+5/-0, 2 files | commented:1 approved:1] Nixl is currently missing some minor yet vital debug logs that I find particularly useful when trying to debug issues without access to the machine. This PR logs kv cache shapes to address that.

cc @orozery
#33879 [BugFix] Fix LoRA Fp8 — bug,ready,ci/build — by danisereb (合并于: 2026-02-06 01:25 (UTC+8)) [💬2 | +14/-8, 1 files | commented:2 approved:1] ## Purpose

When running Nemotron Nano FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Using LoRA adapters fails:
```
  ValueError: ModelOptFp8MoEMethod uses the new modular kernel initialization logic. This function should not be called.
```
…
#31943 [Feat][RL][1/2] Native Weight Syncing API: NCCL — documentation,frontend,ready,ci/build,v1 — by hao-aaron (合并于: 2026-02-06 01:13 (UTC+8)) [💬23 | +2974/-2, 27 files | commented:10] ## Purpose

This PR introduces native weight syncing APIs for vLLM to support reinforcement learning post-training workflows (RLHF, PPO, etc.).

Currently, open-source projects like SkyRL, VeRL, and TRL must implement their own weight syncing infrastructure to use vLLM as an inference server during training. This leads to duplicated effort and requires users to version-lock to specific implementations. See RFC #31848 for full motivation. …
#32710 [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) — bug,rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-05 18:01 (UTC+8)) [💬4 | +11/-0, 2 files | commented:10] Fixes Mamba-1 models producing garbage output on ROCm.

## Problem

On ROCm, Mamba models (e.g., state-spaces/mamba-130m-hf) produce completely incorrect output:
- Expected: “The LLM is a high-performance, scalable…”
- Actual: “fprintf NdEx INDIRECT roidism oneliness”
## Root Cause

…
#33690 [Bugfix] Fix step3p5 parser when using mtp — bug,ready — by mariohong128 (合并于: 2026-02-06 00:04 (UTC+8)) [💬2 | +1455/-5, 2 files | commented:1 approved:1] ## Purpose Fix step3.5 parser when using mtp. If model outputs </tool_call><tool_call>< (using mtp will greatly increase the possibility of this), parser will start a new empty toolcall incorrectly. ``` {"error":{"message":"1 validation error for ValidatorIterator\\n1.function.name\\n Field required [type=missing, input_value={\‘arguments\’: \’{}\’}, input_type=dict]\\n For further information visit https://errors.pydantic.dev/2.11/v/missing None","type":"BadRequestError…
#33905 [Docs] Add bart-plugin to docs — documentation,ready — by NickLucche (合并于: 2026-02-05 20:20 (UTC+8)) [💬1 | +17/-4, 3 files | commented:2 approved:1] Add bart-plugin https://github.com/vllm-project/bart-plugin to docs, mention it as a reference for OOT Model implementation, and add BartForConditionalGeneration to supported models.

cc @DarkLight1337
#33886 [Bugfix] Fix corner case of sparse embedding — bug,ready — by noooop (合并于: 2026-02-05 18:51 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1 approved:1]

## Purpose

Fix #33873

## Test Plan

tests/models/language/pooling/test_bge_m3.py

…
#33876 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading — bug,ready,deepseek — by Isotr0py (合并于: 2026-02-05 18:29 (UTC+8)) [+15/-5, 2 files | commented:1 approved:1]

## Purpose
- nvidia/Kimi-K2.5-NVFP4 is quantized based on legacy model layout (language_model.layers.*), which was refactored at #33346
- Since v0.15.0 has released, this PR adds backward compatibility to load this NVFP4 checkpoint
## Test Plan Run with nvidia/Kimi-K2.5-NVFP4 ``` python examples/offline_inference/vision_language.py -m kimi_k25 –modality vision_chunk …
#33687 [Refactor] Clean up input preprocessing — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-05 18:43 (UTC+8)) [💬1 | +91/-204, 4 files | commented:3 approved:1] ## Purpose

Remove code specific to text-only encoder-decoder models, since we now implement them as MM models to simplify the code.

## Test Plan

## Test Result

...
#31171 [perf] Integrate flashinfer concat_mla_k — ready,v1,nvidia — by jiahanc (合并于: 2026-02-05 18:23 (UTC+8)) [💬10 | +64/-3, 2 files | commented:5 approved:2] ## Purpose Integrate Flashinfer concat_mla_k kernel

## Test Plan

## Test Result ``` local-completions (base_url=http://0.0.0.0:8087/v1/completions,pretrained=/ds-models/DeepSeek-R1-0528-FP4-v2,model=/ds-models/DeepSeek-R1-0528-FP4-v2,add_bos_token=True,tokenized_requests=False,tokenizer_backend=None,num_concurrent=1024,timeout=60000,max_retries=5,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |…

#33339 Enable Cross layers KV cache layout at NIXL Connector V2 — documentation,ready,v1,kv-connector — by liranschour (合并于: 2026-02-05 18:17 (UTC+8)) [💬1 | +339/-89, 6 files | commented:9 changes:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC and implemented in #27743

Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.

Tested with P!=D with run_accuracy_test.sh P=1 D=2

branch	num reqs	input len	TTFT	ITL	tok/s	Desc/transfer
–	–	-…

#33796 [Refactor] Move task outside of PoolingParams.verify — frontend,ready,ci/build,v1 — by DarkLight1337 (合并于: 2026-02-05 17:33 (UTC+8)) [💬1 | +186/-216, 24 files | commented:1 approved:1]

## Purpose

Actually, the task assignment only needs to be done in LLM.encode and /pooling endpoint. This change allows verification to take place in a single place inside InputProcessor.

Also fixed test_pooling_params.py not being run in CI.

## Test Plan

…
#33858 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. — bug,ready,deepseek — by pavanimajety (合并于: 2026-02-05 17:32 (UTC+8)) [💬1 | +3/-11, 1 files | commented:1 approved:2] ## Purpose This PR fixes a bug introduced in PR #33174 that sets the values for n_group and topk_group to None when they are (1, 1) respectively. This while it fixes Kimi-K2 may introduce an error with Mistral. @dbari Please confirm if this fix is good or if the values need to be passed differently

The marlin path works because it doesn’t have monolithic kernel for routing + MOE unlike the INT4 TRTLLM MOE Kernels.

## Test Plan GSM8k before and after.

## Test Result Main Kimi-K2-Thinking (…
#33557 [Perf] Optimize the performance of structured output + reasoning — structured-output,frontend,ready,v1 — by chaunceyjiang (合并于: 2026-02-05 15:45 (UTC+8)) [💬7 | +51/-61, 4 files | commented:2 approved:1] ## Purpose
1. Optimize the performance of structured output + reasoning
Move reasoner.is_reasoning_end(request.prompt_token_ids or []) from the core engine to the frontend.
1. fix https://github.com/vllm-project/vllm/issues/33215 The root cause of #33215 is the use of the parameter "chat_template_kwargs": {"thinking": true, "enable_thinking": true}. …
#30522 [KV Connector][Metrics] Do not count local prefix cache hits in connector queries — ready,v1,kv-connector — by markmc (合并于: 2026-02-05 15:57 (UTC+8)) [💬4 | +115/-31, 3 files | commented:4 approved:1] Somewhat related to #28585

The vllm:external_prefix_cache_queries metric was double-counting queries by recording the total prompt tokens instead of only the tokens actually queried from the KV connector.

Example scenario:
- Request with 1000 prompt tokens
- Local cache finds 600 tokens
- External cache finds 200 of the remaining 400 tokens
- Computed: 200 tokens
…
#33870 [CI/Build] Fix CPU CI test case title — ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-02-05 15:07 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]

## Purpose

CPU-TP/DP/PP Tests may violate key naming rule of Buildkite.

## Test Plan

## Test Result

…
#33727 [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs — bug,ready,cpu — by fadara01 (合并于: 2026-02-05 14:26 (UTC+8)) [💬2 | +3/-1, 1 files | commented:2 approved:1] ## Purpose

This enables us to run RedHatAI/whisper-large-v3-quantized.w8a8 on CPU.

Fixes: #33696

## Test Plan

Reproducer #33696 Will enable this in CI once we accelerate the w8a8 GEMMs in oneDNN …
#33837 [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result — bug,frontend,ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-05 14:17 (UTC+8)) [💬8 | +21/-44, 1 files | commented:1 approved:1] PR #33060 introduced a regression where passing a single ScoreMultiModalParam with multiple content items (e.g., 2 images as documents) to LLM.score() returns only 1 result instead of 2.

## Root Cause

Found via git bisect: commit 1b8fe6f7c (“[Frontend][4/n] Make pooling entrypoints request schema consensus ScoreRequest”).

In validate_score_input(), when a single ScoreMultiModalParam(content=[item1, item2]) is passed, the new code wraps it as [ScoreMultiModalParam(...)] (length…
#33778 [CI/Build] Parallelize CPU CI tests — ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-02-05 13:53 (UTC+8)) [💬1 | +157/-130, 6 files | commented:1 approved:1] ## Purpose
- Split CPU CI tests for parallel execution
- Add source depencendies to test cases
## Test Plan

CI tests

## Test Result …
#33281 [2/N] move responses/serving _make_response_output_items logic to parser — frontend,ready — by qandrew (合并于: 2026-02-05 13:46 (UTC+8)) [💬14 | +241/-99, 2 files | commented:8 approved:1]

## Purpose

as titled. by moving logic of the reasoning / tool calling into parser, each model can implement its own interpretation of how to parse. we can then migrate GPT-OSS into parser. And models that support parallel tool calling / interleaved reasoning can use the parser structure.

stacking on top of https://github.com/vllm-project/vllm/pull/32712

## Test Plan

…
#33840 [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly — bug,rocm,ready — by rasmith (合并于: 2026-02-05 13:05 (UTC+8)) [+11/-5, 1 files | commented:1 approved:1]

## Purpose This PR introduced changes that broke many tests, including test_rocm_aiter_topk.py.

Either an environment variable or backend must be selected for AITER ops to be properly imported.

This PR updates the test to set VLLM_ROCM_USE_AITER to 1 so loading of AITER ops will occur and the test can check that the ops were correctly loaded.

## Test Plan …
#33863 [docs] fix unintentional misspellings — documentation — by rinbaro (合并于: 2026-02-05 12:50 (UTC+8)) [💬1 | +3/-3, 3 files | commented:1 approved:1] ### Motivation

These misspellings looked unintentional.
#33856 [Minor] Include StreamingInput in inputs package — ready,v1 — by njhill (合并于: 2026-02-05 12:38 (UTC+8)) [+6/-4, 4 files | commented:1 approved:1] Follow-on from recently-introduced streaming inputs support.
#33841 Revert “[Attention][FA3] Update FA3 to include new swizzle optimization” — ready,ci/build,v1 — by ProExpertProg (合并于: 2026-02-05 11:54 (UTC+8)) [💬2 | +3/-13, 3 files | commented:1 approved:1] Reverts vllm-project/vllm#23465

As described in #33802, #23465 broke the Distributed Tests 2 GPUs (H100).
- CI run for #23465: https://buildkite.com/vllm/ci/builds/49808/steps/canvas?sid=019c244e-f4b2-46e4-bdea-4e8c0c0d8bb9&tab=output
- CI run for #31034 (previous commit): https://buildkite.com/vllm/ci/builds/49807/steps/canvas?sid=019c244e-cc62-4d6c-a7d6-9c2922ed648f&tab=output
Note that since the tests have been slightly refactored so the failing test (`test_async_tp.py::test_async_pass_corr…

[关闭未合并 PR]

#14148 [V1][Metrics] Add model_load_time as a log for CUDA devices — needs-rebase,stale,v1 — by sahelib25 (关闭于: 2026-02-06 10:18 (UTC+8)) [💬21 | +13/-3, 3 files | commented:10] This PR will add below metrics to V1, Metric Name | Type | Unit – | – | – model_load_time | Provided as a log for CUDA devices | Seconds
#26184 [Perf] Non blocking cpu->gpu for spec decoding — ready,stale,v1 — by chelsea0x3b (关闭于: 2026-02-06 10:16 (UTC+8)) [💬9 | +14/-21, 2 files | commented:4 approved:1] ## Purpose

Speculative decoding previously did a blocking gpu->cpu data transfer which oyu can see in this nsight profile:

After this PR it looks like this:

Summary of changes:
1. Replaced call to torch.cat & torch.full with a `torch…
#28238 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — documentation,ready,ci/build,qwen,nvidia — by jiahanc (关闭于: 2026-02-06 08:26 (UTC+8)) [💬24 | +194/-4, 4 files | commented:7 approved:2 | 📝草稿] ## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- Fixed flashinfer trtllm-gen moe SM check ## Test Plan `VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct –max-num-batched-tokens 8192 –max-model-len 32768 –no-enable-prefix-caching –async-scheduling –compilation_config.pass_config.enable_fi_allreduce_fusion true –compilation_config.pass_config.enable_noop true –compilation_config.custom_ops+=+rms_n…
#31255 [LoRA] Simplify gpt-oss w13_lora_b weight loading — gpt-oss — by xyang16 (关闭于: 2026-02-06 07:29 (UTC+8)) [💬4 | +1/-8, 1 files | commented:1]

## Purpose

This PR simplifies gpt-oss w13_lora_b weight loading code.

gpt-oss lora_b weights is in interleaved layout [w1_0, w3_0, w1_1, w3_1, ...]. The current _slice_w13_b code do de-interleaving, slicing, and re-interleaving. But the code can be simplified by directly returning a contiguous slice of the original interleaved dimension.

Below is a short example of the original code:

…

#33946 Add Video stats to response metadata — frontend — by zakariaelh (关闭于: 2026-02-06 06:23 (UTC+8)) [💬1

+57/-13, 9 files

commented:1]

#33024 [V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode — ready,needs-rebase,v1 — by peakcrosser7 (关闭于: 2026-02-06 01:17 (UTC+8)) [💬5 | +39/-24, 3 files | commented:5] ## Purpose
1. Re-enabled spec decoding: Previously disabled in #30877, speculative decoding is now re-enabled as the related issues have been confirmed as fixed.
2. Optimized block-aligned splitting for resumed requests: Refined the logic to ensure that Mamba states for resumed requests are also cached in a block-aligned fashion, maintaining consistency across prefill phases.
## Test Plan

## Test Result

...
#33927 [Bugfix] Fix tokenizer model_max_length incorrectly constraining user-specified max_model_len — bug — by soyr-redhat (关闭于: 2026-02-06 00:46 (UTC+8)) [💬1 | +73/-3, 2 files | commented:1] This fixes a bug where the tokenizer’s model_max_length would override the user’s explicit –max-model-len parameter, preventing users from utilizing extended context lengths even when the model supports them.

For example, with Qwen/Qwen3-4B-Instruct-2507-FP8:
- Model supports 256K tokens (max_position_embeddings = 262144)
- Tokenizer config has model_max_length = 8192
- User specifies –max-model-len 262144
- Before: Server would cap at 8192 and reject 10K token requests
- After: Server correc…
#30906 [Feature]: support serving nvfp4 W4A16 moe models uisng Marlin — 无标签 — by EdalatiAli (关闭于: 2026-02-06 00:36 (UTC+8)) [💬3 | +220/-12, 2 files | commented:2]

## Purpose This PR enables serving nvfp4 W4A16 compressed-tensors quantized MoE models by adding CompressedTensorsW4A16Nvfp4MoeMethod. Weight-only nvfp4 quantization improves the quality at the expense of higher latency at large concurrencies. As the provided results show, nvfp4 W4A16 quantized version of Qwen/Qwen3-30B-A3B singificantly outperforms the nvfp4 W4A4 variant.

## Test Plan …
#33903 [NIXL] Version bump — ci/build,kv-connector — by NickLucche (关闭于: 2026-02-05 21:58 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Regular version bump for NIXL.
#33787 [Fix]: Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (关闭于: 2026-02-05 18:46 (UTC+8)) [💬2 | +3/-2, 1 files | commented:2] Improvements on Neoverse v1:

Metric Before After Speedup Throughput (rps) 0.30 0.55 +83% Total tok/s 348.70 632.84 +81% Out tok/s 38.74 70.32 +82%

## Purpose …
#33875 Update kv_cache_utils.py — v1,tool-calling — by junxiangxiaoxiang (关闭于: 2026-02-05 19:01 (UTC+8)) [💬7 | +194/-195, 28 files | commented:1] Optimize popleft_n by pre-allocating list

## Purpose

This PR optimizes the popleft_n method in FreeKVCacheBlockQueue to improve performance when popping multiple free KV cache blocks from the head of the free list.

The original implementation used repeated list.append() and reset the prev_free_block/next_free_block pointers of each popped block inside the loop, which introduces unnecessary overhead.

The optimized version applies two key improvements:
1. **Pre-allocate the result li…
#33861 [docs] fix unintentional misspellings — documentation — by rinbaro (关闭于: 2026-02-05 12:12 (UTC+8)) [💬2 | +3/-3, 3 files | commented:1] ## Motivation make docs great again.

These 3 looked unintentional.