vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-01-29

[概览]

时间窗口: 2026-01-29 11:23 (UTC+8) ~ 2026-01-30 11:23 (UTC+8)
新 issue: 24 (label 分布: bug:11, feature request:7, RFC:2, usage:2)
关闭 issue: 33
新 PR: 62 (label 分布: ready:24, v1:22, bug:15, documentation:11, ci/build:9)
合并 PR: 41
关闭未合并 PR: 13

[新 issue]

#33386 [Feature]: Add support for local image file paths in vision reranker API — feature request — by RC-Qiao (创建于: 2026-01-30 11:18 (UTC+8)) ### 🚀 The feature, motivation and pitch

Currently, the /rerank （examples/pooling/score/vision_rerank_api_online.py）API only supports:
- Text documents
- Image URLs
  However, there is no way to directly provide a local file path (e.g., /Users/xxx/Desktop/photo.jpg) to the API.
### Alternatives

No response …
#33384 [Bug]: vLLM serve crashes after processing requests with FlashMLA + DP on DeepSeek — bug — by Mandaluoren (创建于: 2026-01-30 10:56 (UTC+8)) ### Your current environment

Environment
- GPU: NVIDIA H20
- Model: DeepSeek-R1
- Backend: FlashMLA
- TP: 4
- DP: 2
…
#33361 [Bug]: Different embeddings produced by LLM and AsyncLLM — bug — by kabell (创建于: 2026-01-30 01:52 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#33319 [Bug]: Requests Stuck in Waiting Queue with Zero Running — bug — by yptheangel (创建于: 2026-01-29 16:01 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information... ============================...
#33369 [Bug]: Serving model in 0.15.0 Docker container hangs - 0.14.1 worked fine — bug — by sininspira2 (创建于: 2026-01-30 05:42 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
I don't have all of the python libraries to run on the host, and running the script inside the docker container isn't working. I'm running vllm/vllm-openai:latest-cu130 The container has access to 2x RTX 6000 Pro Server Edition (Blackwell) ...
#33381 [RFC]: Align with the openresponses spec. — RFC — by chaunceyjiang (创建于: 2026-01-30 09:24 (UTC+8)) ### Motivation.

Open Responses is an open-source specification for multi-provider, interoperable LLM interfaces inspired by the OpenAI Responses API. It defines a shared request/response model, streaming semantics, and tool invocation patterns so clients and providers can exchange structured inputs and outputs in a consistent shape.

ref: https://www.openresponses.org

### Proposed Change.

…
#33333 [Bug]: sm_120 -NvFp4 MoE backend ‘FLASHINFER_CUTLASS’ does not support the deployment configuration since kernel does not support current device. — bug — by shahizat (创建于: 2026-01-29 19:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```Collecting environment information... uv is set ============================== System Info ============================== ...
#33356 [Feature]: Pipeline Parallel Features & Performance Optimizations — feature request — by yewentao256 (创建于: 2026-01-30 01:03 (UTC+8)) ### 🚀 The feature, motivation and pitch

We can further optimize pipeline parallelism in vLLM for better performance

Tasks:
- Enable async scheduling with Pipeline Parallel
  - https://github.com/vllm-project/vllm/pull/32359
  - Under review: https://github.com/vllm-project/vllm/pull/32618
- Using isend/irecv for async communication. https://github.com/vllm-project/vllm/pull/33368
- Chunked Pipeline Prefill for long context prefill (idea introduced by https://lmsys.org…
#33367 [New Model]: Bart — 无标签 — by JingHHj (创建于: 2026-01-30 04:58 (UTC+8)) ### The model to consider.

I am working on supporting Florence-2 (https://github.com/vllm-project/vllm/issues/31819) on the newest version, which stop supporting after v0.10.0.

However it’s language model its fully using Bart (https://huggingface.co/facebook/bart-large) which has also been stop supporting on current version. So I am planning to start with supporting Bart first, before I do Florence-2.

### The closest model vllm already supports.

No response

…
#33364 [Bug]: Qwen3-Next speculative decoding (DeepSeek MTP) fails for num_speculative_tokens>=3 on H100 with FlashInfer + FP8 KV cache (GMMA operator JIT compile error) — bug — by mingi001025 (创建于: 2026-01-30 02:43 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here --2026-01-29 10:35:20-- https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ... ...
#33363 [TESTS] Unit tests for GDN attn — 无标签 — by vadiklyutiy (创建于: 2026-01-30 02:43 (UTC+8)) [💬1] In #33326 was found very hard error in GDN attention kernel.

@benchislett proposed “Please add a unit test for this case. Otherwise, lgtm”

Originally posted by @benchislett in https://github.com/vllm-project/vllm/pull/33326#pullrequestreview-3723804936
#33341 [Feature][Performance][Speculative Decoding]: Support Full CUDA Graph for the drafter — feature request — by tomasruizt (创建于: 2026-01-29 20:25 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

Currently, the drafter code in eagle.py only supports piecewise CUDA graphs. Integrating full CUDA graphs would speed up the drafter execution and reduce the full runtime e.g. of speculative decoding with method=draft_model.

This should be possible in practice because we already run the model in full CUDA graph mode when they run standalone (without speculative decoding).

I made some measurements to forecast how much faster SD would run with full CUDA…
#33347 Bad issue, please ignore — feature request — by stisiTT (创建于: 2026-01-29 23:45 (UTC+8)) [💬1] Error
#33348 [Bug] GLM-4.7 uses wrong reasoning parser (should use deepseek_r1 instead of glm45) — 无标签 — by QwertyJack (创建于: 2026-01-29 23:47 (UTC+8)) ## Summary

GLM-4.7 models have a different chat template than GLM-4.5/4.6 models. The key difference is:
- GLM-4.5/4.6: The <think> token is NOT included in the generation prompt. The model generates <think> as part of its output.
- GLM-4.7: The <think> token IS included in the generation prompt. The model starts generating reasoning content directly without outputting <think> first.
This means GLM-4.7 should use deepseek_r1 reasoning parser (which handles the case where `<t…
#33340 [Feature]: start DPEngineCoreProc asynchronously — feature request — by zhaozx-cn (创建于: 2026-01-29 20:08 (UTC+8)) ### 🚀 The feature, motivation and pitch

### Motivation Currently, vllm launches each DPEngineCoreProc in a synchronous manner. This issue proposed starting DPEngineCoreProc in a asynchronous manner, which would improve service startup speed

### Proposed Implementation Details utils.py
- Add start_async to execute all DPEngineCoreProc process start
- Add _enginecore_bootstrap to set device id and call EngineCoreProc.run_engine_core
- Use ThreadPoolExecutor to run start_async …
#33338 [Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 — bug — by Lrcx (创建于: 2026-01-29 19:54 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#33336 [Bug]: AttributeError: ‘Step3VLProcessor’ object has no attribute ‘_get_num_multimodal_tokens’ — bug — by Dineshkumar-Anandan-ZS0367 (创建于: 2026-01-29 19:32 (UTC+8)) ### Your current environment

Hf link: https://huggingface.co/stepfun-ai/Step3-VL-10B

Issues: AttributeError: ‘Step3VLProcessor’ object has no attribute ‘_get_num_multimodal_tokens’

### 🐛 Describe the bug

Discussions: …
#33330 [Usage]: Why does the acceptance rate of the second token drop drastically when the Eagle3 draft model has two layers? — usage — by xinanjiao (创建于: 2026-01-29 18:13 (UTC+8)) ### Your current environment

``` Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.3 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
#33329 [Bug]: Eagle3 not support Qwen3-Next-80B-A3B-Instruct-FP8 — bug — by Martion-z (创建于: 2026-01-29 17:53 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#33314 [RFC] [Feature]: Apply RFC #8913 to Quantized Linear Methods (Decouple Kernels from Checkpoint Layout) — feature request,RFC — by BadrBasowid (创建于: 2026-01-29 15:01 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

### Context RFC #8913 proposes decoupling checkpoint formats from kernel integrations by standardizing how weights are represented at runtime and allowing kernel-specific repacking/preprocessing as a separate concern. This RFC scopes that same idea specifically to quantized linear methods (e.g., GPTQ/AWQ/FP8/FP4 linear integrations) where kernel selection is currently constrained by checkpoint packing/layout assumptions.

### Problem

Many quantized line…
#33311 [Feature]: support pixel_values_videos input for VLM — feature request — by wuxibin89 (创建于: 2026-01-29 14:05 (UTC+8)) ### 🚀 The feature, motivation and pitch

In RL training, to avoid Retokenization Drift, RL frameworks choose token-in-token-out for generation. While it works perfect for text/image, there’s still some mismatch between video processing.

In verl, we apply_chat_template to messages with video: https://github.com/verl-project/verl/blob/main/verl/experimental/agent_loop/agent_loop.py#L286-L293

Then pass the prompt_token_ids and **raw vide…
#33313 [Bug]: cuda 13 docker image error “ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’” — bug — by openzeka-birol-kuyumcu (创建于: 2026-01-29 14:10 (UTC+8)) for docker image vllm/vllm-openai:v0.14.1-cu130 at Dg x-spark Triton rises ptxas fatal : Value ‘sm_121a’ is not defined for option ‘gpu-name’ which is closed bug https://github.com/triton-lang/triton/issues/8539
#33307 [Bug]: Latency spikes at input_len=1024 with batch_size=16 (TP1 & TP2) — bug — by mingi001025 (创建于: 2026-01-29 13:40 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here Collecting environment information... ...
#33302 [Usage]: What would happen if the offloading KV cache size is larger than config the max_local_cpu_size ? — usage — by fantasy520 (创建于: 2026-01-29 11:30 (UTC+8)) ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I want to run inference of a specific model. I don’t know how to integrate it with vllm. …

[已关闭 issue]

#17812 [Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE — performance,stale — by ekagra-ranjan (关闭于: 2026-01-30 10:17 (UTC+8)) [💬6] We have been doing perf bench on MTBench so that e2e speedup and AL are comparable with other setups and academic papers. Thanks to @luyuzhe111 and others for the discussion and helping with measuring the gaps!

## llama 3 8b During model wt loading
- https://github.com/vllm-project/vllm/pull/16035#issuecomment-2790985075
  - SGL correction: https://github.com/vllm-project/vllm/pull/16035#issuecomment-2803265232
  - SGL setup: https://docs.google.com/document/d/18ETJLsnxR88Qq3VDk5Mq-Hb7vuE…
#17931 invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’} — installation,stale — by xxm1668 (关闭于: 2026-01-30 10:17 (UTC+8)) [💬7] ### Your current environment

‘DevicePtrInfo getPointer(PyObject*, int)’: /tmp/tmpp8zry39f/main.c:118:20: error: invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’} [-fpermissive] 118 | CUDA_CHECK(status); // Catch any other cuda API errors | ^ | | | int /tmp/tmpp8zry39f/main.c:24:38: note: in definition of macro ‘CUDA_CHECK’ 24 | #define CUDA_CHECK(ans) { gpuAssert((ans), FILE, LINE);…
#25927 [Feature]: Support EAGLE3 for Qwen3-30B-A3B — feature request,stale — by lionsheep0724 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

As far as I know, there’s no support for Qwen3 MoE model. When I tried to load trained draft model via Specforge, below error returns. RuntimeError: Model does not support EAGLE3 interface but aux_hidden_state_outputs was requested Simiar issue : https://github.com/vllm-project/vllm/issues/25134

### Alternatives

No response …
#18372 [Bug]: Dynamic loading LoRA is not working properly — bug,stale — by DonghoKang (关闭于: 2026-01-30 10:17 (UTC+8)) [💬6] ### Your current environment

The output of python collect_env.py
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

### 🐛 Describe the bug …
#19942 [Usage]: how to invoke KVCache save in one node deployment development enviroment — usage,stale — by gaowayne (关闭于: 2026-01-30 10:17 (UTC+8)) [💬5] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I would like to let below method run more in single node deployment development environment …
#22098 [Bug]: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 198. — bug,stale — by JakubCerven (关闭于: 2026-01-30 10:17 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#23431 [RFC]: Continuous profiling and regression prevention for vLLM — RFC,stale — by linzebing (关闭于: 2026-01-30 10:17 (UTC+8)) [💬3] ### Motivation.

While we continue to push the limit of vLLM performance, an important aspect is to make sure that these wins will stay. It’s quite common in practice that commits may just silently regress engine performance. For instance, we observed a downward trend of engine performance in the past month:

[vLLM Llama 8b shareGPT benchmark dashboard](https://hud…
#24039 [Usage]: how to disable thinking for different model — usage,stale — by jiangix-paper (关闭于: 2026-01-30 10:17 (UTC+8)) [💬4] ### Your current environment

vllm 0.10.1

### How would you like to use vllm

I want to call model without thinking using v1/chat/completions interface: For glm4.5 model, the body is: { “model”: “glm-4.5”, …
#24807 [Installation]: Errors when compiling VLLM from source | pytorch-cuda12.6-cudnn9 — installation,stale — by Lyken17 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬6] ### Your current environment

Click to expand sys environments
```text Collecting environment information... ============================== System Info ...
#24869 [Installation]: Assertion Error while serving vllm due to this error : assert param_data.shape == loaded_weight.shape — installation,stale — by VarunUpadhyayShorthillsAI (关闭于: 2026-01-30 10:16 (UTC+8)) [💬4] ### Your current environment

python3 collect_env.py Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect …
#25333 [Bug]: Accuracy Discrepancy in Qwen 4B Embeddings: vLLM vs. Transformers — bug,stale — by Alireza3242 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬11] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#25462 [Usage]: How Do I use Ray to deploy a multi node TP Disaggregated Prefilling ? — usage,ray,stale — by yangqinghao-cmss (关闭于: 2026-01-30 10:16 (UTC+8)) [💬7] ### Your current environment
```
  The output of `python collect_env.py`
```
### How would you like to use vllm

I want to run inference for a 1P1D setup, where both P and D utilize only Tensor Parallelism (TP) across multiple nodes. However, I’m unsure how to integrate this with vllm. I’ve tried using Ray with p2pNcclConnector and the lmcache+nixl approach, but neither works. …
#25797 [Bug]: Qwen3-Next-80B-A3B-Instruct-AWQ-4bit KeyError: ‘layers.45.mlp.shared_expert.down_proj.weight’ — bug,stale — by xibaoning (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#25815 [Performance]: PD Disaggregate deployment performance issue — performance,stale — by wangxiyuan (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Issue For Prefill Decode Disaggregate case, on Decode node, if the KV cache of a request hit the full of HBM, the scheduler will trigger recalculation and put it into the waiting queue, and the KV cache will be released. This will cause prefill calculation to be performed in the next iteration on decode node, resulting in serious performance degradation. ### Solution Currently, we have 3 possible way:
1. address V0 scheduler swap queue to V1. While it maybe break the design for V1 schedule…
#25836 [Usage]: how to deploy my own multimodal model — usage,stale — by AdvancedBot (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment
```
  The output of `python collect_env.py`
```
Collecting environment information… ============================== System Info ============================== OS : Ubuntu 22.04.3 LTS (x86_64) …
#25865 [Bug]: vLLM recognize the bge-m3-korean (embedding model) max length, 512 tokens. — bug,stale — by bnuazz15 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬9] ### Your current environment

vLLM 0.10.2

### 🐛 Describe the bug

bge-m3-korean config.json { – “architectures”: [ …
#25898 [Usage]: How to disable the use of chat_template in vllm serve? — usage,stale — by noforit (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### Your current environment

```text Collecting environment information… ============================== System Info ============================== OS : Rocky Linux 8.5 (Green Obsidian) (x86_64) GCC version : (GCC) 9.4.0 Clang version : Could not collect …
#25907 [Bug]: qwen3-next CUDA illegal memory access — bug,stale — by felixzhu555 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#25915 [Performance]: Limited improvements when compared Qwen3 / Qwen3-FP8 — performance,stale — by Lyken17 (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Proposal to improve performance

On the latest vllm/vllm-openai:v0.8.5 with h100 hardware

### Report of performance regression

vllm bench throughput --model Qwen/Qwen3-8B --input-len 1024 --output-len 2048

bf16 …
#25919 [Feature]: Upstream some unsloth patches if possible — feature request,stale — by vadimkantorov (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Unsloth adds vllm patches in https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/vllm_utils.py - which looks extremely fragile wrt vllm updates

Here are some things in there (mostly related to 4-bit / lora quite pertinent in the light of https://thinkingmachines.ai/blog/lora/, but also standby mode - quite useful): ``` patch_vllm_enable_sleep_mode() patch_vllm_graph_capture()

patch_vllm_set_inductor_config() …
#25920 [Bug]: Running qwen3-235b in full graph mode with dp4tp4 and disable chunked_prefill will cause inference hang — bug,stale — by wjx-xin (关闭于: 2026-01-30 10:16 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here vllm 0.10.2 vllm-ascend (Recent versions, higher than 0.10.2rc1, used unmerged PR) ``` ...
#25934 [Feature]: Jet nemotron models — feature request,stale — by arunpatala (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Hi,

I was wondering if jet nemotron models can be supported in VLLM. They have a new attention implementation called JetBlock which is linear or more efficient. Is this planned to be supported or not inthe scope? I am just wondering, because I would like to train my custom jet nemotron models and I am using VLLM currently.

https://github.com/NVlabs/Jet-Nemotron

Thanks

…
#25939 [RFC][P/D]: Support hybrid and flexible KV Cache transfer timing & pathway at request-level — RFC,stale — by Shirley1Huang (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Motivation.

In current prefill-decode (PD) disaggregated architectures, a static strategy determining KV Cache transfer timing and method is adopted, which is uniformly applied to all requests. This inflexible approach fails to account for key trade-offs: between latency and throughput when choosing when to start KV Cache transfer(during versus after prefill), and between latency and scalability when choosing how to transfer it (via P2P communication or through a centralized sto…
#25940 [Feature]: image2text supports SVG image — feature request,stale — by jfcherng (关闭于: 2026-01-30 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

I am using InternVL3 iamge2text model. In the model card, it says it supports SVG image. But vLLM failed to recognize a SVG image.

The following is my test payload.

```json { “model”: “OpenGVLab/InternVL3_5-1B”, “messages”: [ …
#25957 [Feature]: Upgrade as many vLLM built-in logits processors to the new logits processors programming model as possible — feature request,stale — by afeldman-nm (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

vLLM v1 is intended to handle logits processors differently than vLLM v0. Logits processors should subclass LogitsProcessor; built-in logits processors should be implemented in vllm/v1/sample/logits_processor/builtin.py and added to the BUILTIN_LOGITS_PROCESSORS list in vllm/v1/sample/logits_processor/__init__.py

However, in actuality only several of the vLLM built-in logits processors (Min-P, logits bias, min tokens) have been converted to work …
#25979 [Bug]: EAGLE3 gpt-oss eagle3 failed on high concurrencies — bug,stale — by jiahanc (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2] ### Your current environment

Followed standard installation steps on main branch https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-images

The output of python collect_env.py
```text Your output of `python collect_env.py` here Collecting environment information... ============================== ...
#25985 [Bug]: IMA with the prefix_prefill::context_attention_fwd kernel when capturing full cudagraphs — bug,rocm,stale — by SageMoore (关闭于: 2026-01-30 10:16 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#33364 [Bug]: Qwen3-Next speculative decoding (DeepSeek MTP) fails for num_speculative_tokens>=3 on H100 with FlashInfer + FP8 KV cache (GMMA operator JIT compile error) — bug — by mingi001025 (关闭于: 2026-01-30 02:48 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here --2026-01-29 10:35:20-- https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ... ...
#31186 [Bug]: Qwen3-Next MTP Crash — bug — by benchislett (关闭于: 2026-01-30 02:40 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#28129 [Feature]: DGX Spark sm121a support — feature request — by swtb3-ryder (关闭于: 2026-01-29 23:54 (UTC+8)) [💬6] ### 🚀 The feature, motivation and pitch

Does vLLM support the DGX Spark hardware? I know nvidia provide a vLLM container with the support packaged, but i believe it uses vLLM version 0.10.2

### Alternatives

No response

### Additional context

…
#32701 [Feature]: Async Scheduling + Pipeline Parallel Support — feature request — by yewentao256 (关闭于: 2026-01-30 00:54 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

Supporting async scheduling + pp

Tasks
- https://github.com/vllm-project/vllm/pull/32359
- Under review: https://github.com/vllm-project/vllm/pull/32618
### Alternatives

…
#33347 Bad issue, please ignore — feature request — by stisiTT (关闭于: 2026-01-30 00:17 (UTC+8)) [💬1] Error
#31658 [Bug] Audio batching fails with variable-length inputs in Ultravox model — 无标签 — by AndriiPasternak31 (关闭于: 2026-01-29 18:43 (UTC+8)) ### Your current environment
- vLLM version: 0.13.0
- GPU: NVIDIA RTX 3090 (24GB)
- PyTorch: 2.9.0+cu129
- Model: fixie-ai/ultravox-v0_6-llama-3_1-8b
### 🐛 Describe the bug

When sending concurrent audio transcription requests to the vLLM server with audio samples of different durations, the engine crashes with: …

[新 PR]

#33387 feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support — frontend — by akaihaoshuai (创建于: 2026-01-30 11:18 (UTC+8)) [+442/-21, 5 files | commented:1]

## Purpose

This PR enhances multimodal support in both the /v1/embeddings and /rerank endpoints by:
- Adding an optional instruction field to the RerankRequest schema to enable instruction-aware reranking.
- Supporting flexible multimodal input formats for embedding and reranking, including:
  - Mixed list of strings and image URLs or Data URLs (e.g., ["text", "https://...", "data:image/..."])
  - Structured objects with explicit {"text": ...} or `{“i…
#33385 [Bugfix] Fix inconsistent embeddings between AsyncLLM and LLM — bug,v1 — by MengAiDev (创建于: 2026-01-30 11:00 (UTC+8)) [+12/-0, 1 files | commented:1] The pooling_params.verify() method was only called in the LLM class path, not in the AsyncLLM (v1) path. This caused default values like use_activation=True to not be set, resulting in different embeddings.

Call verify() in InputProcessor.process_inputs() to ensure consistent behavior across both code paths.

Fixes #33361
#33332 [Misc] Replace Optional[X] with X | None syntax — rocm,frontend,needs-rebase,v1,multi-modality,cpu,kv-connector,nvidia — by carlory (创建于: 2026-01-29 19:08 (UTC+8)) [💬3 | +129/-135, 56 files | commented:1 approved:1] ## Purpose

Modernize type annotations by replacing Optional[X] with the more readable X | None union type syntax (PEP 604) that has been available since Python 3.10.

Changes:
- Replace Optional[X] with X | None across 55 files
- Remove unused Optional imports from typing
## Test Plan

…
#33379 [Draft][XPU]deprecate ipex and switch to vllm-xpu-kernels — ci/build,v1 — by jikunshang (创建于: 2026-01-30 08:45 (UTC+8)) [+139/-455, 12 files | commented:1]

## Purpose [1/N] https://github.com/vllm-project/vllm/issues/33214 start from this PR, we will switch to vllm-xpu-kernels based kernel impl for Intel GPU. this PR also upgrade dependency to oneapi 2025.3 and pytorch 2.10 for xpu platform.

## Test Plan

## Test Result …
#33326 [Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel — bug,ready — by CarstyYou (创建于: 2026-01-29 17:26 (UTC+8)) [💬5 | +18/-15, 1 files | commented:1 approved:3 changes:1] ## Purpose

Fix illegal memory access (cudaErrorIllegalAddress) when running Qwen3-Next models with MTP speculative decoding and CUDA Graphs enabled.

Fixes #31186

cc @vadiklyutiy ### Root Cause

The Triton kernel fused_recurrent_gated_delta_rule_fwd_kernel in GDN attention did not check for PAD_SLOT_ID = -1 values in ssm_state_indices. When CUDA Graph pads unused slots with -1, the kernel computes negative memory offsets, causing illegal memory access. …
#33305 [CI][AMD] Skip 4 GPUs testgroup ray tests — documentation,rocm,ready,ci/build,v1 — by rjrock (创建于: 2026-01-29 12:27 (UTC+8)) [💬4 | +8/-0, 2 files | commented:1 approved:1] ## Purpose To fix the below tests for AMD CI.
- offline_inference/rlhf_colocate.py
- distributed/test_utils.py
## Test Plan
- python rlhf_colocate.py
- pytest distributed/test_utils.py
## Test Result …
#33378 support return prompt token ids in responses — frontend — by cmunley1 (创建于: 2026-01-30 08:26 (UTC+8)) [💬5 | +44/-0, 3 files | commented:2]

## Purpose support returning prompt token ids in responses api

## Test Plan

## Test Result

…
#33372 Explicitly set return_dict for apply_chat_template — documentation,ready — by hmellor (创建于: 2026-01-30 06:17 (UTC+8)) [💬1 | +22/-10, 11 files | commented:1 approved:1] In Transformers v5 the default value of return_dict for apply_chat _template was changed to True to match other Tokeniser methods.

This PR explicitly sets:
- return_dict=True outside the vllm/ dir so that users get used to the new/expected behaviour of Transformers
- return_dict=False inside the vllm/ dir so that the internal interfaces of vLLM do not need to change to adapt to the new default
#33376 fix: allow LFM2 MoE prefix caching (align) — ready — by tianshu-Michael-yu (创建于: 2026-01-30 06:57 (UTC+8)) [+11/-3, 2 files | commented:1 approved:1] ## Summary
- allow LFM2 MoE prefix caching in align mode
- add LFM2-VL Mamba state copy-func for prefix cache
- keep mamba-cache-mode=all restriction
## Testing
- ./run_lfm_vllm_benchmark.sh –wait 150 –prewarm-bench –work-dir eval_outputs/outputs_run_1769727159
  –model LiquidAI/LFM2-VL-8B-A1B-3096689 –served-model lfm2-vl-8b-a1b
  –vllm-arg –enable-prefix-caching –vllm-arg –mamba-cache-mode –vllm-arg align
  –vllm-arg –mm-processor-cache-type –vllm-arg shm –vllm-arg –a…
#33383 [Bugfix] Disable torch.compile for batch invariance on Blackwell to ensure determinism — bug — by ZhanqiuHu (创建于: 2026-01-30 10:34 (UTC+8)) [+31/-0, 2 files | commented:1] ## Purpose

Fix batch invariance failing on Blackwell (B200) GPUs when torch.compile is enabled.

Fixes #32992

### Observation
- Batch invariance works on Blackwell with enforce_eager=True
- Batch invariance fails on Blackwell with torch.compile enabled …
#33357 [BugFix] Fix cold start compilation time — bug,ready — by zou3519 (创建于: 2026-01-30 01:23 (UTC+8)) [💬5 | +48/-2, 3 files | commented:3 approved:1 | 📝草稿] https://github.com/vllm-project/vllm/pull/25954 caused a regression in cold start compilation times in all models (MOE and Dense). This PR fixes it. The regression is that #25954 causes strings to appear in the graph and messes with the number of total unique subgraphs that need to be compiled. In general a model has some number of layers and should have 3-5 unique subgraphs, the change made it so that there is approx one subgraph per layer.

Benchmarks: ``` # vLLM 0.14.1 gpt-oss-120b # torch.co…
#33349 [Reasoning] Add glm47 reasoning parser for GLM-4.7 models — documentation — by QwertyJack (创建于: 2026-01-29 23:49 (UTC+8)) [💬2 | +306/-1, 3 files | commented:3] ## Summary

GLM-4.7 models have a different chat template than GLM-4.5/4.6 models:
- GLM-4.5/4.6: <think> is NOT included in the generation prompt
- GLM-4.7: <think> IS included in the generation prompt
This means GLM-4.7 should use DeepSeekR1ReasoningParser instead of DeepSeekV3ReasoningWithThinkingParser used by GLM-4.5/4.6.

## Changes

…
#33382 [BugFix] Fix DeepGEMM Warmup Logic — bug — by robertgshaw2-redhat (创建于: 2026-01-30 09:39 (UTC+8)) [+5/-2, 1 files | commented:1]

## Purpose
- recent PR (not in 0.15) broke the logic for DG by checking the wrong Attr
- this fixes it
Essential Elements of an Effective PR Description Checklist
...
#33359 Fix tie_word_embeddings for multimodal models in Transformers v5 — ready — by hmellor (创建于: 2026-01-30 01:50 (UTC+8)) [+24/-0, 1 files | commented:2 approved:1] In Transformers v5, tie_word_embeddings belongs to the config of the class that can see both layers to be tied. For example:

```py SomeVLModel: self.language_model = SomeLanguageModel() self.vision_model = SomeVisionModel()

SomeVLModelForMultimodalLM: self.model = SomeVLModel() self.lm_head = nn.Linear() …
#33380 [Feat][SM90] Add cutlass int4 x bf16 MoE — ci/build,nvidia — by IwakuraRein (创建于: 2026-01-30 09:20 (UTC+8)) [+1635/-36, 10 files | commented:1 | 📝草稿]

## Purpose
- Add cutlass int4 x bf16 MoE by referencing #29691 .
## Test Plan
- pytest tests/kernels/quantization/test_cutlass_w4a16_moe.py
- pytest tests/test_cutlass_w4a16_moe_mm.py …
#33366 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 — bug,rocm,ready — by gshtras (创建于: 2026-01-30 04:57 (UTC+8)) [💬4 | +18/-18, 4 files | commented:1] Fixing 2 issues from the previous PR
1. When the wvsplitk kernel is not applicable, the fallback should be hipblaslt through torch._scaled_mm, and not cloning and modifying the input tensor. This fixes the +100% performance regression on amd/LLama-3.1-*-FP8-KV models when running on smaller batch sizes
2. Weights can be non-contiguous, and often are in the FP8 case where we explicitly pad them to a multiple of 256. So the condition needs to only be applied to the activation tensor
3. cc @rasmith
#33375 [Moe Refactor] Provide inplace flag at FusedMoEModularKernel construction time. — ready,nvidia — by bnellnm (创建于: 2026-01-30 06:56 (UTC+8)) [+131/-109, 37 files | commented:5] ## Purpose Pass the inplace flag to the FusedMoEModularKernel constructor. Knowing the inplace (or being able to disable) behavior ahead of time will help simplify some of the runtime logic in layer.py related to running shared experts and making clones of the hidden_states.

## Test Plan CI + MoE refactoring tests

## Test Result

cc @robertgshaw2-redhat , @ProExpertProg

…
#33358 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker — bug,v1 — by xuechendi (创建于: 2026-01-30 01:50 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]

## Purpose

Fix a bug as below: ``` File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 333, in determine_available_memory (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] assert self.init_snapshot.free_memory > free_gpu_memory, ( (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] Assertion…
#33374 [ModelRunner V2] Support spec decode with structured outputs — v1 — by njhill (创建于: 2026-01-30 06:35 (UTC+8)) [+59/-1, 3 files | commented:1] Tested for correctness, works as expected.
#33377 [Bugfix][Async][LMCache] avoid vllm-side double free during async scheduling + LMCache — bug,kv-connector — by KuntaiDu (创建于: 2026-01-30 07:58 (UTC+8)) [💬1 | +27/-4, 1 files | commented:1]

## Purpose

This PR fixes vLLM-side double-request-free bug in async scheduling.

Key reason: in async scheduling when request being aborted, the get_finished API might be called twice, resulting in vLLM freeing the same request twice. This change make sure that even when get_finished is called twice for one request, LMCache only return this request once and ignore the second time.

Design graph:

…
#33371 [UX] Use gguf repo_id:quant_type syntax for examples and docs — documentation,ready,quantization — by mgoin (创建于: 2026-01-30 05:52 (UTC+8)) [💬1 | +79/-28, 4 files | commented:1 approved:1]

## Purpose
- Fixed the repo_id:quant_type syntax to support extended naming conventions
  - These are recognized by checking if the base type (e.g., Q4_K) is a valid GGMLQuantizationType and the suffix is one of _M, _S, _L, _XL, _XS, _XXS
- Updated docs, examples, tests
## Test Plan

…
#33351 [Dependency] Remove mandatory ray installation — documentation,rocm,ci/build,v1,nvidia — by yewentao256 (创建于: 2026-01-30 00:31 (UTC+8)) [💬1 | +5/-3, 5 files | commented:8 changes:1] ## Purpose
- vLLM v1 PP can run via the multiprocessing backend; Ray is only required when users explicitly choose the Ray executor backend. Keeping Ray as a default dependency on CUDA/ROCm causes confusion and unnecessarily pulls Ray into environments that don’t use it.
So this PR makes ray a optional dependency.
#33373 [Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel — 无标签 — by gmagogsfm (创建于: 2026-01-30 06:24 (UTC+8)) [💬1 | +2369/-0, 11 files | commented:4] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Migrate silu_mul_fp8 kernel to new registration system

This PR implements silu_mul with fp8 quantization Helion op. This is the first OP built with new vLLM+Helion integration framework.
- Added configs for H100, H200 GPUs
#33370 [Models]: lfm2_siglip2 return intermediate encoder layers — 无标签 — by lalo (创建于: 2026-01-30 05:43 (UTC+8)) [💬2 | +52/-4, 1 files | commented:1] ## Purpose

Adds select_layer parameter to Siglip2Model, Siglip2VisionTransformer, and Siglip2Encoder to allow returning hidden states from any encoder layer. Useful for VLM architectures that use intermediate vision encoder outputs instead of the final layer.
- select_layer=-1: last layer with post_layernorm (default)
- select_layer=-2: second-to-last without post_layernorm
- Positive indices also supported for direct layer selection
No breaking change - sane defaults.

…
#33368 [Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-01-30 05:41 (UTC+8)) [+332/-13, 3 files | commented:1] ## Purpose

Part of the https://github.com/vllm-project/vllm/issues/33356

Enable async send/recv for better performance

## Test

export MODEL="Qwen/Qwen3-30B-A3B-Thinking-2507-FP8"

…
#33327 [Quant] Fix FP8 block quantization for non-aligned dimensions — documentation — by Etelis (创建于: 2026-01-29 17:32 (UTC+8)) [💬1 | +70/-12, 2 files | commented:1] ## Summary

Fixes FP8 block quantization for models with dimensions not divisible by block size (128).

Tested with bdellabe/DeepSeek-V2-Lite-FP8-BLOCK where intermediate_size=10944.

## Changes
1. Validation fix - Skip partition alignment check for merged weights when TP=1
2. Scale shard fix - Compute scale boundaries correctly using ceil(offset+size) - ceil(offset) …
#33360 [BugFix] Fix whisper FA2 + full cudagraphs — bug,ready,v1,nvidia — by LucasWilkinson (创建于: 2026-01-30 01:51 (UTC+8)) [💬1 | +12/-12, 2 files | commented:1 approved:2] Fix: https://github.com/vllm-project/vllm/issues/33091

CrossAttentionBuilder.build() overrides max_seq_len with encoder_seq_lens (new_metadata.max_seq_len = max(encoder_seq_lens_cpu)) this leads to a CG capture with max_seq_len == 0 and an incorrect graph
#33362 [Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 — ready,v1 — by yewentao256 (创建于: 2026-01-30 02:35 (UTC+8)) [+0/-75, 3 files | commented:1] ## Purpose

Deprecated as scheduled
#33345 Feat/add nemotron nano v3 tests — ready,ci/build — by shaharmor98 (创建于: 2026-01-29 23:06 (UTC+8)) [+54/-0, 6 files | commented:1 approved:1] This PR adds support and testing configurations for the NVIDIA Nemotron-3 Nano 30B models (both BF16 and FP8 variants) to the lm-eval-harness CI pipeline.

## Test Plan

Verified locally using pytest with the lm-eval-harness test suite.

Essential Elements of an Effective PR Description Checklist
...
#33365 move spec decode slow test to test_areas.yaml — speculative-decoding,ready,ci/build,v1 — by shanjiaz (创建于: 2026-01-30 02:53 (UTC+8)) [+17/-3, 2 files | commented:1 approved:1] ## Purpose Moved spec decode sloe test to test_areas.yaml. Added test was previously approved in a former PR.

## Test Plan Tested the command locally. pytest tests/v1/spec_decode/test_acceptance_length.py -m 'slow_test'

## Test Result ``` ============================================================ test session starts ============================================================= …
#33352 [BugFix] Disable async scheduling for Mamba prefix caching — bug,ready — by peakcrosser7 (创建于: 2026-01-30 00:34 (UTC+8)) [+12/-0, 1 files | commented:2 approved:2]

## Purpose Async scheduling has been enabled by default since v0.14.0. Currently, Mamba prefix caching (“all” and “align” modes) does not support it. Therefore, this PR disables async scheduling when prefix-caching is enabled for Mamba models.

## Test Plan

## Test Result

…
#33318 Drafter Supports Multiple KVCache Groups — speculative-decoding,v1 — by tomasruizt (创建于: 2026-01-29 15:33 (UTC+8)) [💬1 | +311/-146, 6 files | commented:2 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#33355 [torch.compile] Avoid graph fragmentation in unified_kv_cache_update — 无标签 — by algebra-MCX (创建于: 2026-01-30 00:48 (UTC+8)) [💬3 | +255/-5, 3 files | commented:1] ## Purpose

Fixes #33267 Motivation:

unified_kv_cache_update appears in piecewise cudagraph regions. Previously, each layer used a unique name, forcing each to be compiled separately. This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused.

Changes:

This PR optimizes unified_kv_cache_update to use a generic name ("from_forward_context") during compilation. The actual layer is resolved dynamically at runtime using `ForwardContex…
#33342 start dp engine core asynchronously — v1 — by zhaozx-cn (创建于: 2026-01-29 21:46 (UTC+8)) [💬3 | +55/-23, 1 files | commented:1]

## Purpose start dp engine core asynchronously ## Test Plan

## Test Result

...
#33354 [Feature] Add VLLM_TRITON_AUTOTUNE environment variable — 无标签 — by monajafi-amd (创建于: 2026-01-30 00:40 (UTC+8)) [+6/-0, 1 files | commented:1 | 📝草稿] Implements #33279

Add environment variable VLLM_TRITON_AUTOTUNE to control Triton kernel autotuning behavior.
- Default: Disabled (0) for deterministic behavior and faster startup
- Enabled (1): For benchmarking or calibrating kernel configurations
## Motivation Triton kernel parameters are not one-size-fits-all across hardware platforms. Currently, autotuning behavior is implicit and not user-controllable, making it difficult…
#33324 [Chore] Move MediaConnector to vllm.multimodal.media — frontend,ready,multi-modality — by DarkLight1337 (创建于: 2026-01-29 17:17 (UTC+8)) [💬1 | +381/-350, 8 files | commented:2 approved:1] ## Purpose

Missed this in #32406, further resolve circular import issues.

Also fix missing AttributeError in vllm.transformers_utils.tokenizer.__getattr__.

## Test Plan

## Test Result

…
#33353 [KVConnector] Allow connector to protect GPU blocks from eviction — v1,kv-connector — by orozery (创建于: 2026-01-30 00:36 (UTC+8)) [+336/-44, 10 files | commented:1] This PR introduces a new connector API allowing the scheduler-side connector to increase GPU blocks ref-counts, in order to prevent them from evicting. In particular, this is necessary for the case of async offloading of sliding window KV data, as it is automatically freed by the KV cache manager as the window progresses.
#33350 [Bugfix] Fix broken GLM-OCR initialization — bug — by Isotr0py (创建于: 2026-01-29 23:51 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]

## Purpose
- The GLM-OCR model is broken actually, because GlmOcrVisionConfig is only imported at type checking https://github.com/vllm-project/vllm/blob/c6e7404cc5713a926e8b6c187b5f197a5436e9ff/vllm/model_executor/models/glm_ocr.py#L36-L39
```
  NameError: name 'GlmOcrVisionConfig' is not defined. Did you mean: 'GlmOcrVisionMLP'?
```
## Test Plan …
#33346 [Models] Refactor Kimi-K2.5 weight loading — ready — by Isotr0py (创建于: 2026-01-29 23:39 (UTC+8)) [+40/-176, 2 files | commented:1 approved:1]

## Purpose
- Refactor Kim-K2.5 model interface usage to catch up previous refactoring
## Test Plan
```
  vllm serve moonshotai/Kimi-K2.5 --enforce-eager -tp 4 --trust-remote-code --mm-encoder-tp-mode data
```
``` …
#33315 feat: add max tokens per doc in rerank request — frontend — by hustxiayang (创建于: 2026-01-29 15:12 (UTC+8)) [+240/-7, 4 files | commented:1 | 📝草稿] ## Purpose add max_tokens_per_doc in rerank request parameter, which is in cohere rerank api https://docs.cohere.com/reference/rerank and jina rerank api https://jina.ai/reranker/
#33343 Mamba multistream — 无标签 — by TomerBN-Nvidia (创建于: 2026-01-29 21:56 (UTC+8)) [💬1 | +539/-54, 10 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#33344 Fix: Validate placeholder tokens don’t exceed batch length in chunked… — v1 — by GOavi101 (创建于: 2026-01-29 22:30 (UTC+8)) [+29/-3, 1 files | commented:1] ## Summary

This PR implements validation to ensure multimodal placeholder tokens are not truncated by chunked prefill. Placeholder tokens must be processed in a single chunk to ensure proper embedding replacement.

## Problem

When chunked prefill is enabled with --max-num-batched-tokens, multimodal placeholder tokens could be split across chunks. This breaks embedding replacement because placeholder tokens need to be processed atomically - if they’re split, the embeddings cannot be properly …
#33331 [Multimodal] Simplify MM input definitions — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-01-29 18:39 (UTC+8)) [💬1 | +142/-164, 17 files | commented:2 approved:1] ## Purpose

Remove unnecessary fields in MultiModalFieldElem; instead, always pass dictionaries to construct MultiModalKwargsItem and MultiModalKwargsItems to make their roles clearer. As group_mm_kwargs_by_modality still requires modality information, it has been updated to receive a list of (modality, MultiModalKwargsItem) instead of just a list of MultiModalKwargsItem.

Also, improve related docstrings.

## Test Plan

## Test Result

…

#33339 Enable Cross layers KV cache layout at NIXL Connector V2 — documentation,v1,kv-connector — by liranschour (创建于: 2026-01-29 20:06 (UTC+8)) [💬1 | +324/-88, 5 files | commented:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC and implemented in #27743

Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.

Tested with P!=D with run_accuracy_test.sh P=1 D=2

branch	num reqs	input len	TTFT	ITL	tok/s	Desc/transfer
–	–	-…

#33335 Parakeet avlm — needs-rebase — by netanel-haber (创建于: 2026-01-29 19:26 (UTC+8)) [💬1 | +376/-23, 2 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#33310 [Chore] Remove use_data_parallel kwargs from ViT implementation — ready,llama — by Isotr0py (创建于: 2026-01-29 13:59 (UTC+8)) [💬1 | +36/-89, 9 files | commented:2 approved:1]

## Purpose
- Clean up use_data_parallel from ViT model implementation
## Test Plan

## Test Result

…
#33337 [BUGFIX] Make HF input preprocessing async to prevent frontend event loop blocking — bug,v1 — by yzhu802 (创建于: 2026-01-29 19:42 (UTC+8)) [💬3 | +52/-3, 1 files | commented:1]

## Purpose

Improve the issue where the frontend process event loop becomes blocked under medium to high concurrency with multimodal requests, due to preprocessing in the Hugging Face (HF) processor. This improvement helps enhance the performance of the vLLM framework in multimodal scenarios.

In addition, this change is critical for concurrent testing in the downstream framework vllm-omni. In multi-stage orchestration, the frontend process is responsible for forwa…
#33320 [Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… — ready,nvidia — by flyrae (创建于: 2026-01-29 16:30 (UTC+8)) [💬4 | +2/-1, 1 files | commented:1 approved:2] commit msg: Replaced the hardcoded torch.cuda.current_device() with current_platform.current_device() in the KimiK25ForConditionalGeneration initialization. This change enhances platform compatibility, allowing the model to run on non-CUDA devices (e.g., NPU, ROCm) supported by vLLM.

## Purpose

Update KimiK25ForConditionalGeneration to use current_platform.current_device() instead of the hardcoded torch.cuda.current_device().

…
#33334 [Feature] Add rich request snapshot stream for step-level observability (PR #5) — documentation,frontend,needs-rebase,ci/build,v1 — by sriumcp (创建于: 2026-01-29 19:18 (UTC+8)) [💬3 | +13599/-1130, 56 files | commented:1] ## Summary

This PR implements PR #5: Rich Request Snapshot Stream from the step-level tracing plan, adding per-request observability events within sampled scheduler steps.

Part of the step-level tracing series:
- PR #3: Step-level batch summary tracing ✅ (merged)
- PR #4: KV cache metrics utilities ✅ (merged)
- PR #5: Rich request snapshot stream ⬅️ this PR
## What This PR Adds …
#33322 [Bugfix] Fix SP compilation shape mismatch errors for multimodal models and prompt embeds — bug — by wangxingran222 (创建于: 2026-01-29 17:10 (UTC+8)) [💬1 | +131/-3, 2 files | commented:1]

## Purpose

When deploying multimodal models on the current vLLM main branch, or deploying text models with the --enable-prompt-embeds option enabled, Sequence Parallelism (SP) cannot be enabled due to torch compile errors. Specifically, two issues occur:
1. Shape mismatch in aten.copy_ operations: When torch compile detects that the forward function modifies input parameters (e.g., inputs_embeds), the compiler inserts aten.copy_ nodes to write modifie…
#33312 [Models] Qwen3-ASR — documentation,new-model,ready,qwen — by ywang96 (创建于: 2026-01-29 14:08 (UTC+8)) [💬1 | +1269/-0, 9 files | approved:2 commented:4]

## Purpose

Add support for Qwen3-ASR model series - see recipe at https://github.com/vllm-project/recipes/blob/main/Qwen/Qwen3-ASR.md

## Test Plan

## Test Result

…
#33317 [Bugfix][CPU] Fix thread num for shared memory communication — bug,ready,cpu — by bigPYJ1151 (创建于: 2026-01-29 15:30 (UTC+8)) [+25/-10, 3 files | commented:1 approved:1]

## Purpose

Some CPU specs have different thread num across memory node and leads to error in CPU shared memory communicator. This PR fixed the issue.

## Test Plan

Locally verified.

…
#33308 [rocm] unset RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES in ROCM images — rocm,ready,ci/build — by kouroshHakha (创建于: 2026-01-29 13:53 (UTC+8)) [💬2 | +0/-2, 1 files | commented:1 | 📝草稿] vLLM’s ray integration relies on automatically setting the visible devices in scheduling model executor ray workers.

### Testing:

The command below would fail on rocm/v0.14.0 before this pr, but works after this pr:
```
  vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic -dp 8 --load-format dummy --distributed-executor-backend ray -cc '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
```
#33328 [Quant] Skip CUTLASS block FP8 on Blackwell GPUs — nvidia — by Etelis (创建于: 2026-01-29 17:36 (UTC+8)) [+50/-13, 2 files | commented:1] ## Summary

Skip CUTLASS block FP8 kernels on Blackwell GPUs (sm_120+) and fall back to Triton.

CUTLASS block FP8 is not stable on Blackwell, causing “Invalid status” runtime errors.

## Changes

Added compute capability check in is_cutlass_block_fp8_supported() to exclude sm_120+.

…
#33321 [ROCm] make rocm_aiter_fa support qwen3-next, remove multiple 16 block size support — documentation,rocm,v1,qwen — by ganyi1996ppo (创建于: 2026-01-29 17:05 (UTC+8)) [💬1 | +2/-3, 2 files | commented:2] ## Purpose Reset get_supported_kernel_block_sizes from multiple(16) to [1, 16, 32]. multiple(16) might cause unsupported block-size like 144 when serving qwen3-next. ## Test Plan gsm8k ## Test Result ``` |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |—–|——:|—————-|—–:|———–|—|—–:|—|—–:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8597|± |0.0096| | | |strict-match | 5|exact_match|↑ |0.8400|± |…
#33306 [Frontend] Add tool_choice=required support for GPT-OSS Harmony models — frontend,tool-calling,gpt-oss — by gkswns0531 (创建于: 2026-01-29 13:22 (UTC+8)) [💬2 | +242/-14, 3 files | commented:1] ## Summary
```
This PR adds `tool_choice="required"` support for GPT-OSS Harmony models. When `tool_choice="required"` is set, the model is forced to generate tool calls instead of plain text responses.
```
### Problem

GPT-OSS models use the Harmony chat format, which differs from standard models in its response generation behavior. Even when tool_choice="required" is set, these models tend to generate direct text responses instead of tool calls, resulting in only 91% success rate for tool call g…
#33325 [ut] enhance structured output ut — structured-output,v1 — by andyxning (创建于: 2026-01-29 17:23 (UTC+8)) [+2/-2, 1 files | commented:1]

## Purpose Enhance structured output ut. ## Test Plan NA ## Test Result NA —

...
#33323 [Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal — 无标签 — by carlory (创建于: 2026-01-29 17:16 (UTC+8)) [+1/-5, 1 files | commented:1] ## Purpose

Clean up the HIDDEN_DEPRECATED_METRICS list in the metrics test file after the deprecated metrics were removed in #29330.

The following metrics were removed in #29330:
- vllm:gpu_cache_usage_perc
- vllm:gpu_prefix_cache_queries
- vllm:gpu_prefix_cache_hits
Since these metrics are now fully removed, they no longer need to be tracked in HIDDEN_DEPRECATED_METRICS. …
#33316 [Release] [ROCm] Remove old build step — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-29 15:23 (UTC+8)) [+2/-19, 1 files | commented:1 approved:1]

## Purpose

While rebasing this PR https://github.com/vllm-project/vllm/pull/33156 , I missed removing the old ROCm Image Build step.

The current pipeline graph will look like this https://buildkite.com/vllm/release-pipeline-shadow/builds/1693/steps/canvas

## Test Plan …
#33309 [Bugfix] Make tensorizer encrypted model output tests deterministic — bug — by Zaire404 (创建于: 2026-01-29 13:58 (UTC+8)) [+2/-0, 1 files | commented:1] ## Purpose Fix nondeterministic failures in tests/model_executor/model_loader/tensorizer_loader/test_tensorizer.py::test_deserialized_encrypted_vllm_model_with_tp_has_same_outputs by making generation deterministic. This test compares generated outputs between the original model and the deserialized encrypted model, without explicitly setting sampling parameters, generation can be nondeterministic and lead to intermittent mismatches.

Add deterministic sampling params: `sampling_params = Sampl…
#33304 [feat] Add per-block extra_keys to KV events — documentation,v1 — by zhongdaor-nv (创建于: 2026-01-29 12:15 (UTC+8)) [💬4 | +37/-3, 3 files | commented:1] ## Summary
- Add extra_keys field to BlockStored events that exposes the extra hash keys (MM identifiers, LoRA name, cache_salt, prompt embeddings, etc.) used in block hash computation
- Each element in the list corresponds to one block in block_hashes, enabling external KV cache consumers to reconstruct block hashes
## Changes
- Add extra_keys field to BlockStored in kv_events.py
- Generate extra_keys for each block individually during cache_full_blocks in block_pool.py
- Updat…
#33303 Support Pipeline and Data Parallelism for MiniMax-M2 — 无标签 — by rogeryoungh (创建于: 2026-01-29 11:49 (UTC+8)) [+21/-6, 1 files | commented:1]

## Purpose

Adds Pipeline Parallelism (PP) and Data Parallelism (DP) for minimax_m2. Currently, enabling both PP+DP simultaneously results in character encoding issues. However, both PP and DP work correctly when used individually.

## Test Plan

## Test Result

…

[已合并 PR]

#33326 [Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel — bug,ready — by CarstyYou (合并于: 2026-01-30 02:40 (UTC+8)) [💬5 | +18/-15, 1 files | commented:1 approved:3 changes:1] ## Purpose

Fix illegal memory access (cudaErrorIllegalAddress) when running Qwen3-Next models with MTP speculative decoding and CUDA Graphs enabled.

Fixes #31186

cc @vadiklyutiy ### Root Cause

The Triton kernel fused_recurrent_gated_delta_rule_fwd_kernel in GDN attention did not check for PAD_SLOT_ID = -1 values in ssm_state_indices. When CUDA Graph pads unused slots with -1, the kernel computes negative memory offsets, causing illegal memory access. …
#32696 [Model][Multimodal] Add explicit MusicFlamingo adapter — documentation,new-model,ready — by WangHaoyuuu (合并于: 2026-01-30 11:01 (UTC+8)) [💬5 | +115/-2, 6 files | commented:6 approved:1] ## Summary
- Add an explicit MusicFlamingo adapter that reuses AudioFlamingo3 while accepting MusicFlamingo config/processor with a safe fallback.
- Register MusicFlamingoForConditionalGeneration in the multimodal registry.
- Fix AudioFlamingo3 encoder compatibility for MusicFlamingo checkpoints and Qwen2Audio layer signature.
## Motivation MusicFlamingo shares the AudioFlamingo3 architecture but uses model_type=musicflamingo and MusicFlamingoProcessor. Without an explicit adapter, type che…
#33358 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker — bug,v1 — by xuechendi (合并于: 2026-01-30 01:56 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]

## Purpose

Fix a bug as below: ``` File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 333, in determine_available_memory (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] assert self.init_snapshot.free_memory > free_gpu_memory, ( (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=418) ERROR 01-29 15:38:21 [core.py:946] Assertion…
#32849 [Docs] Adding links and intro to Speculators and LLM Compressor — documentation,ready — by aireilly (合并于: 2026-01-30 06:12 (UTC+8)) [💬7 | +73/-7, 5 files | commented:4 changes:1 approved:2] Small docs update for the Features section that adds overview docs for LLM Compressor and Speculators projects.
#33300 [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic — bug,ready — by mgoin (合并于: 2026-01-30 04:15 (UTC+8)) [+3/-0, 2 files | commented:1 approved:1] ## Purpose

Without this change and running an moe model with fp8 per-tensor dynamic scales, the triton moe backend will not be selected and marlin would be chosen in the end.
```
  (EngineCore_DP0 pid=15286) DEBUG 01-28 21:25:48 [model_executor/.../oracle/fp8.py:332] FP8 MoE backend TRITON does not support the deployment configuration since kernel does not support quantization scheme.
```
This is simply a support registration mistake.

## Test Plan

…
#33205 Make mypy opt-out instead of opt-in — ready,v1 — by hmellor (合并于: 2026-01-29 17:12 (UTC+8)) [+35/-57, 11 files | commented:3 approved:1] The opt-in nature of the mypy.py script meant that as new directories were added, they were not automatically tested with mypy.

This PR:
- Removes the requirement to add new directories/files to the FILES list
- Adds the currently untested directories/files to the EXCLUDE list (this is what they effectively are currently)
- Performs some simple fixes for some of the previously untested files

#33129 [release] Minor fixes to release annotation and wheel upload — ready,ci/build — by khluu (合并于: 2026-01-30 04:09 (UTC+8)) [💬2

+51/-63, 3 files

commented:10]

Add commands for CUDA 13.0 Docker images
Use twine upload <args> instead of twine <args> upload. The latter caused errors.
Use vllm-openai-rocm repo instead of vllm-openai:rocm tag
Remove github release part from wheel upload script. This is currently done manually, and if need to be automated, we should separate it into a separate job/script.
Rename release wheel script to be PyPI only.

#33298 [Bugfix] Fix Qwen3-VL-Reranker load. — bug,documentation,ready,qwen — by noooop (合并于: 2026-01-29 16:42 (UTC+8)) [💬4 | +234/-112, 6 files | commented:2 approved:1]

## Purpose

Fix Qwen3-VL-Reranker load.

## Test Plan

tests/entrypoints/pooling/score/test_online_score_vision.py

…
#32804 Add Triton fused MoE config for B200 (Nemotron Nano) — ready — by danisereb (合并于: 2026-01-30 03:21 (UTC+8)) [💬2 | +139/-0, 1 files | commented:1 approved:1] ## Purpose

When running Nemotron Nano on B200 the following warning appears:
```
  Using default MoE config. Performance might be sub-optimal!
  Config file not found at .../vllm/model_executor/layers/fused_moe/configs/E=128,N=1856,device_name=NVIDIA_B200.json
```
I used the benchmark_moe.py to create a JSON file for this use-case: ``` …
#32954 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — ready,nvidia — by Linda-Stadter (合并于: 2026-01-30 02:00 (UTC+8)) [💬5 | +290/-17, 5 files | commented:10] ## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.
## Test Plan

``VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct …
#33324 [Chore] Move MediaConnector to vllm.multimodal.media — frontend,ready,multi-modality — by DarkLight1337 (合并于: 2026-01-30 00:54 (UTC+8)) [💬1 | +381/-350, 8 files | commented:2 approved:1] ## Purpose

Missed this in #32406, further resolve circular import issues.

Also fix missing AttributeError in vllm.transformers_utils.tokenizer.__getattr__.

## Test Plan

## Test Result

…
#33287 [ez] Delete torch25_custom_graph_pass — ready — by angelayi (合并于: 2026-01-30 00:47 (UTC+8)) [+0/-44, 1 files | commented:1 approved:1] Addresses https://github.com/vllm-project/vllm/pull/33209/changes/BASE..8ed781422f12b84be4ea004a4039b75839795ab0#diff-f19ed1cccb2fec69f4510e30750fb43450207d3d6c15ba4e0d6aad194ee70fed
#33350 [Bugfix] Fix broken GLM-OCR initialization — bug — by Isotr0py (合并于: 2026-01-29 23:56 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]

## Purpose
- The GLM-OCR model is broken actually, because GlmOcrVisionConfig is only imported at type checking https://github.com/vllm-project/vllm/blob/c6e7404cc5713a926e8b6c187b5f197a5436e9ff/vllm/model_executor/models/glm_ocr.py#L36-L39
```
  NameError: name 'GlmOcrVisionConfig' is not defined. Did you mean: 'GlmOcrVisionMLP'?
```
## Test Plan …
#33157 [Misc] Cleanup Kimi-K2.5’s vision chunk modality entrypoints — documentation,frontend,ready,multi-modality — by Isotr0py (合并于: 2026-01-29 17:46 (UTC+8)) [💬1 | +733/-204, 7 files | commented:3 approved:1]

## Purpose Kimi-l2.5’s vision chunk implementation is a bit messy since it’s a bit urgent to catch up day0 support.

This PR does some remained cleanup for it:
- Add missing tests for vision chunk modality
- Move uuids reconstruction fuctions to render.
## Test Plan …
#33331 [Multimodal] Simplify MM input definitions — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-01-29 21:32 (UTC+8)) [💬1 | +142/-164, 17 files | commented:2 approved:1] ## Purpose

Remove unnecessary fields in MultiModalFieldElem; instead, always pass dictionaries to construct MultiModalKwargsItem and MultiModalKwargsItems to make their roles clearer. As group_mm_kwargs_by_modality still requires modality information, it has been updated to receive a list of (modality, MultiModalKwargsItem) instead of just a list of MultiModalKwargsItem.

Also, improve related docstrings.

## Test Plan

## Test Result

…
#33310 [Chore] Remove use_data_parallel kwargs from ViT implementation — ready,llama — by Isotr0py (合并于: 2026-01-29 18:20 (UTC+8)) [💬1 | +36/-89, 9 files | commented:2 approved:1]

## Purpose
- Clean up use_data_parallel from ViT model implementation
## Test Plan

## Test Result

…
#33320 [Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… — ready,nvidia — by flyrae (合并于: 2026-01-29 20:29 (UTC+8)) [💬4 | +2/-1, 1 files | commented:1 approved:2] commit msg: Replaced the hardcoded torch.cuda.current_device() with current_platform.current_device() in the KimiK25ForConditionalGeneration initialization. This change enhances platform compatibility, allowing the model to run on non-CUDA devices (e.g., NPU, ROCm) supported by vLLM.

## Purpose

Update KimiK25ForConditionalGeneration to use current_platform.current_device() instead of the hardcoded torch.cuda.current_device().

…
#32894 [Intel GPU] refine xpu worker — ready,ci/build,v1 — by jikunshang (合并于: 2026-01-29 20:26 (UTC+8)) [+27/-90, 2 files | commented:2 approved:2]

## Purpose There are some legacy code in xpu platform(xpu_worker.py, xpu platform init) this PR refine these file to align with gpu. also disable some failed UT for now. will further check ut status later.

## Test Plan CI

## Test Result …
#33312 [Models] Qwen3-ASR — documentation,new-model,ready,qwen — by ywang96 (合并于: 2026-01-29 19:27 (UTC+8)) [💬1 | +1269/-0, 9 files | approved:2 commented:4]

## Purpose

Add support for Qwen3-ASR model series - see recipe at https://github.com/vllm-project/recipes/blob/main/Qwen/Qwen3-ASR.md

## Test Plan

## Test Result

…
#33317 [Bugfix][CPU] Fix thread num for shared memory communication — bug,ready,cpu — by bigPYJ1151 (合并于: 2026-01-29 19:26 (UTC+8)) [+25/-10, 3 files | commented:1 approved:1]

## Purpose

Some CPU specs have different thread num across memory node and leads to error in CPU shared memory communicator. This PR fixed the issue.

## Test Plan

Locally verified.

…
#33042 [Voxtral] Streaming example — documentation,ready,ci/build,multi-modality — by patrickvonplaten (合并于: 2026-01-29 19:22 (UTC+8)) [💬8 | +207/-46, 5 files | commented:7 approved:1] This PR adds a test for the new streaming generator API: https://github.com/vllm-project/vllm/pull/28973 which works nicely!

#33130 [Quantization][Refactor] use platform dict to choose kernel — ready — by zufangzhu (合并于: 2026-01-29 18:44 (UTC+8)) [💬2

+23/-13, 1 files

commented:6 approved:2]

#31751 [Bug Fix] Handle variable-length tensors in MultiModalFlatField batching — bug,documentation,new-model,ready,ci/build,v1,multi-modality,llama,qwen — by AndriiPasternak31 (合并于: 2026-01-29 18:43 (UTC+8)) [💬9 | +86/-0, 2 files | commented:5 approved:2] ## Summary

Fix audio batching crash with variable-length inputs in Ultravox model.

Fixes #31658

## Problem

When batching audio tensors with different time dimensions (e.g., [80, 325] vs [80, 666]), MultiModalFlatField._reduce_data() incorrectly flattens tensors into individual rows instead of returning them as a list.

…
#32881 [BugFix] Async Eplb fix potential race condition — bug,ready — by ilmarkov (合并于: 2026-01-29 18:31 (UTC+8)) [+18/-0, 2 files | commented:2 approved:2]

## Purpose Fixes potential race condition in async EPLB. The scenario is following. EPLB successfully transfers layer 1, saves the weights in intermediate buffers, signals main thread that the buffer is ready. The main thread schedules weight copy on main stream when communication on EPLB stream is done and notifies the async thread that buffer is ready. In this scenario it might happen that copying the weights takes too long and async thread will start a weight t…
#32769 [fix] tesdt mcp_tool_calling_streaming with a more complex math question — ready — by daniel-salib (合并于: 2026-01-29 18:25 (UTC+8)) [+1/-1, 1 files | approved:2] ## Purpose

python tool is not always triggered if the math question is too simple. When asked to compute larger numbers the python tool triggers consistently

## Test Plan pytest tests/entrypoints/openai/responses/test_mcp_tools.py:: test_mcp_tool_calling_streaming_types

## Test Result

passes consistently
#32669 Bugfix: Pass router logits dtype in nemotron shared experts — bug,ready — by amirkl94 (合并于: 2026-01-29 17:36 (UTC+8)) [💬1 | +3/-1, 1 files | commented:1 approved:2] ## Purpose A change introduced in this PR , requires passing router_logits_dtype to MoE layer.

When running with dp > 1 and flashinfer cutlass MoE kernel in nvfp4, the following error happens: ``` assert self.batched_router_logits.dtype == full_router_logits.dtype, ( ERROR 01-19 05:53:49 [multiproc_executor.py:839] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 01-19 05:53:49 [multiproc_executor.py:839] Assert…
#33116 [CI/Build][BugFix] fix cuda/compat loading order issue in docker build — bug,ready,ci/build,nvidia — by wpc (合并于: 2026-01-29 16:19 (UTC+8)) [💬1 | +2/-2, 1 files | commented:2 approved:1] ## Purpose In PR #30784 we persistents CUDA compat lib path with ldconfig. This caused a CUDA init issue when the docker image runs with new driver without requiring compatibility libs. ``` $ nvidia-smi Mon Jan 26 13:09:50 2026 +—————————————————————————————–+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +—————————————–+————————+——————…
#33283 [Misc] Remove missed pad_for_cudagraph — ready,nvidia — by LucasWilkinson (合并于: 2026-01-29 17:12 (UTC+8)) [+0/-7, 1 files | commented:1 approved:2] Remove missed pad_for_cudagraph in https://github.com/vllm-project/vllm/pull/30143
#32660 [Doc] Update outdated link to Ray documentation — documentation,ready — by graftim (合并于: 2026-01-29 16:56 (UTC+8)) [💬2 | +1/-1, 1 files | approved:1] The link to the ray serve documentation became outdated, so I fixed it to the same page in the current documentation of Ray.

## Purpose

## Test Plan

## Test Result

…
#32943 Adding optional speculator tests for larger models — speculative-decoding,ready,ci/build,v1 — by shanjiaz (合并于: 2026-01-29 16:54 (UTC+8)) [💬1 | +45/-4, 2 files | commented:7 approved:1] ## Purpose We just enabled speculative decoding for qwen3 moe vision language pathway. Adding a test for vision language speculator model.
- Add support for testing Qwen3-30B-MOE-VL-Eagle3 speculator model
- Create separate optional CI job for large speculator model tests on A100 GPUs …
#33152 [PluggableLayer][2/N] Apply PluggableLayer to linear layers — ready,ready-run-all-tests — by whx-sjtu (合并于: 2026-01-29 16:53 (UTC+8)) [+5/-5, 1 files | commented:1 approved:1] ## Purpose As a task in https://github.com/vllm-project/vllm/issues/32676, this PR applies PluggableLayer to linear layers.

## Test Plan All tests should pass. ## Test Result All tests should pass. —

Essential Elements of an Effective PR Description Checklist
...
#33212 support returning tokenids in responses api — frontend,ready — by cmunley1 (合并于: 2026-01-29 16:52 (UTC+8)) [+50/-9, 3 files | commented:10]

## Purpose Support returning token ids from responses API

## Test Plan

## Test Result

…
#33262 [BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend — bug,ready — by ilmarkov (合并于: 2026-01-29 16:52 (UTC+8)) [+10/-2, 1 files | commented:2 approved:2]

## Purpose PR fixes a bug when nvidia/DeepSeek-R1-0528-FP4-v2 crashes when we enable EPLB with Flashinfer MOE FP4 disabled.

In marlin Moe_FP4 backend we don’t use activations scales so we replace them in replace_parameter util with None. But it doesn’t set the parameter to None, it creates an empty parameter on cpu which triggers EPLB error:

``` (EngineCore_DP1 pid=3244402) File “/home/sagemoore/git/nm-vllm/vllm/distributed/eplb/eplb_state.py”, line 551, …
#33256 [Doc]: fixing multiple typos in diverse files — documentation,frontend,ready,ci/build,v1,cpu — by didier-durand (合并于: 2026-01-29 16:52 (UTC+8)) [💬3 | +20/-20, 14 files | commented:1 approved:1] ## Purpose

Improve quality of the repo by fixing typos

## Test Plan

No tests changed or added

## Test Result

…
#33316 [Release] [ROCm] Remove old build step — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-29 15:35 (UTC+8)) [+2/-19, 1 files | commented:1 approved:1]

## Purpose

While rebasing this PR https://github.com/vllm-project/vllm/pull/33156 , I missed removing the old ROCm Image Build step.

The current pipeline graph will look like this https://buildkite.com/vllm/release-pipeline-shadow/builds/1693/steps/canvas

## Test Plan …
#33189 [Misc][Build] Lazy load cv2 in nemotron_parse.py — bug,ready — by kiersten-stokes (合并于: 2026-01-29 14:55 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1 approved:1] ## Purpose Lazy loads the cv2 module to avoid import errors in certain envs when not needed

Error trace
``` Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main ...
#33156 [Release] [CI] Optim release pipeline — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-29 14:45 (UTC+8)) [💬1 | +399/-24, 5 files | commented:8 approved:2]

## Purpose

### Changes
1. Docker image release pipeline is not unified with vLLM ROCm wheel release pipeline. Docker image release pipeline uses the content cache and sccache in both cases.
2. The sccache is always removed at the end of the docker image creation so that the docker image is readily available to be used as local development environment.
  - The test I have carried out is as follows. Docker pull the image to local machine and reinstall a fresh vLL…
#33141 Fix tool call indexing double-counting — frontend,ready — by wangln19 (合并于: 2026-01-29 13:57 (UTC+8)) [+4/-3, 1 files | commented:2 approved:1] ## Purpose Fixes a bug in [vllm/entrypoints/openai/chat_completion/serving.py]where tool call IDs were being generated with non-sequential indices due to double-counting. Related to PR “fix: preserve native tool call ID in multi-turn tool calling” #32768.

In the original code, history_tool_call_cnt was being incremented inside the loop, but the index calculation also used idx from enumerate: ```python # Old logic: idx = history_tool_call_cnt + idx # Double counting effect history_tool_ca…
#33260 [Refactor] Define MM data parser in processing info instead of processor itself — ready,multi-modality,llama,qwen — by DarkLight1337 (合并于: 2026-01-29 13:55 (UTC+8)) [+399/-347, 34 files | commented:2 approved:1] ## Purpose

Gradually separate the logic of data parsing outside of processor class. Also fix some instances where _get_expected_hidden_size is not used for data parsing.

## Test Plan

## Test Result

...
#33288 [ez] Delete more torch version checks <= 2.8 — ready — by angelayi (合并于: 2026-01-29 13:28 (UTC+8)) [+22/-70, 2 files | commented:1 approved:1] I missed the cases where it uses torch.__version__ instead of is_torch_equal_xx 😅

#33227 [Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) — ready,ci/build — by orozery (合并于: 2026-01-29 12:24 (UTC+8)) [+8/-6, 1 files

commented:1 approved:1]

[关闭未合并 PR]

#19507 deps: Update torch and deps to 2.7.1 — rocm,needs-rebase,ci/build,stale — by seemethere (关闭于: 2026-01-30 10:17 (UTC+8)) [💬8 | +28/-28, 10 files | commented:2] ## Purpose

Updates PyTorch and all of its downstream projects to 2.7.1 respective versions

## Test Plan

CI

## Test Result

…
#20250 [Bugfix] enable logging in v1 offline inference — needs-rebase,stale,v1 — by MingzhenHan (关闭于: 2026-01-30 10:17 (UTC+8)) [💬5 | +51/-9, 1 files | commented:2] ## Purpose Fixed the issue of metrics not being logged during running offline inference with vLLM v1.

Related issue：https://github.com/vllm-project/vllm/issues/17382

## Test Result
#25828 Fix num tokens when sp moe enabled — ready,stale — by wuxun-zhang (关闭于: 2026-01-30 10:16 (UTC+8)) [💬6 | +1/-1, 1 files | commented:1 approved:1]

For some backends like Gaudi, the input shape of first MOE layer could be 3D (bs, seqlen, hdim). So here when SP MOE enabled (https://github.com/vllm-project/vllm/pull/24982), num_tokens may got wrong value.

@tlrmchlsmth

## Purpose

## Test Plan …
#25930 [Misc] Allow toggling prefix cache during turning — performance,stale — by wdhongtw (关闭于: 2026-01-30 10:16 (UTC+8)) [💬2 | +15/-1, 1 files | commented:1] Using prefix-cache introduce more non-deterministic during turning, as the prefill/decode ratio may change according to prefix cache hit rate.

Add a switch to allowing turning off prefix-cache during turning, while keep the default behavior as before (enabling prefix-cache).

## Purpose

Based on different dataset, input length and random sampling, the prefix-cache hit rate varies across different invocation. This PR provide a switch to optionally turn off prefix-cache, for more stable experime…
#19434 [WIP][FP8] ScaledMM refactor — needs-rebase,nvidia — by ProExpertProg (关闭于: 2026-01-30 09:29 (UTC+8)) [💬3 | +184/-107, 7 files | commented:2 | 📝草稿] ## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose This PR will refactor the fp8 scaled mm kernels to u…
#26738 [DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) — documentation,rocm,ready,ci/build,llama — by ProExpertProg (关闭于: 2026-01-30 09:29 (UTC+8)) [💬11 | +1/-1, 1 files | commented:2] In-progress PR to test inductor partitioning in CI.

Includes:
- turning on inductor partitioning by default
Past fixes in this PR now in main:
- #26956 AOT caching issue fix
- #26735 monkeypatch
- #26878 monkeypatch for memory plan output naming …
#32115 Fix select_gemm_impl in ModelOptFp8MoEMethod — ready,needs-rebase — by danisereb (关闭于: 2026-01-29 22:48 (UTC+8)) [💬3 | +15/-4, 1 files | commented:3] ## Purpose Add support for Triton as MoE backend, required for LoRA support.

In the init of ModelOptFp8MoEMethod the function select_fp8_moe_backend is called. If it returns Fp8MoeBackend.TRITON, handling for this case is missing in select_gemm_impl.

The code in this PR is similar to the if-else block in Fp8MoEMethod.select_gemm_impl (but with only Triton and Cutlass as options).

This PR is blocked by another PR: …
#30618 [BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens — bug,ready,needs-rebase,v1 — by peakcrosser7 (关闭于: 2026-01-29 22:16 (UTC+8)) [💬22 | +24/-5, 1 files | commented:4]

## Purpose For the Hybrid model, the tokens scheduled during the prefill phase must not include draft tokens.If draft tokens are included, Mamba will incorrectly save a state with a length of prompt_len + draft_tokens instead of the correct length of prompt_len, leading to the wrong output.

## Test Plan test script: ```python from vllm import LLM, SamplingParams

…
#30553 [Bug][CPU Backend]: Improve L2 cache size detection and usage on aarch64 — bug,needs-rebase,cpu — by Radu2k (关闭于: 2026-01-29 21:46 (UTC+8)) [💬4 | +34/-0, 1 files | commented:6 changes:1] ## Purpose
- Add /sys/devices/system/cpu/cpu0/cache/index2/size parsing for L2 cache on aarch64
- Fallback to 1MB with warning if the sysfs file cannot be opened
- Use 70% of detected L2 cache for attention scheduling to leave enough room for sys processes
## Test Plan l2_cache_size gets the right value.

## Test Result l2_cache_size is set to 204810240.7 bytes on c8g instance type which is the correct value give that it has 2MB of L2 cache per core and we want to use 70%. …
#33334 [Feature] Add rich request snapshot stream for step-level observability (PR #5) — documentation,frontend,needs-rebase,ci/build,v1 — by sriumcp (关闭于: 2026-01-29 19:30 (UTC+8)) [💬3 | +13599/-1130, 56 files | commented:1] ## Summary

This PR implements PR #5: Rich Request Snapshot Stream from the step-level tracing plan, adding per-request observability events within sampled scheduler steps.

Part of the step-level tracing series:
- PR #3: Step-level batch summary tracing ✅ (merged)
- PR #4: KV cache metrics utilities ✅ (merged)
- PR #5: Rich request snapshot stream ⬅️ this PR
## What This PR Adds …
#32552 Ignore: try to implement parakeet — needs-rebase — by netanel-haber (关闭于: 2026-01-29 19:25 (UTC+8)) [💬1 | +370/-21, 2 files | commented:1 | 📝草稿] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#32620 [CI/Build][AMD] Fix distributed/test_utils.py — rocm,ci/build — by rjrock (关闭于: 2026-01-29 12:29 (UTC+8)) [💬4 | +10/-1, 2 files | commented:1 changes:1] ## Purpose To improve the health of AMD’s Distributed Tests (4 GPUs) testgroup.

Ray uses HIP_VISIBLE_DEVICES for GPU allocation when on a ROCm platform.

## Test Plan RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="" pytest -v -s distributed/test_utils.py

## Test Result Before: …
#31189 [CI/Build] Fix rlhf_colocate.py for AMD — rocm,ci/build — by rjrock (关闭于: 2026-01-29 12:28 (UTC+8)) [💬12 | +15/-2, 2 files | commented:7 changes:1] ## Purpose To get rlhf_colocate.py to pass on AMD’s CI.

## Test Plan RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES="" VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py

## Test Result Before: An import error on from vllm.platforms import current_platform due to ray spawning an Actor with no GPUs allocated.

After: Exit code of 0. …