vLLM Daily Report

[vLLM GitHub 开发动态] 2026-02-06

[概览]

时间窗口: 2026-02-06 11:19 (UTC+8) ~ 2026-02-07 11:19 (UTC+8)
新 issue: 21 (label 分布: bug:7, feature request:4, usage:3, RFC:3, rocm:3)
关闭 issue: 23
新 PR: 52 (label 分布: ready:23, bug:17, v1:11, documentation:9, frontend:9)
合并 PR: 43
关闭未合并 PR: 31

[新 issue]

#34034 [Bug]: vLLM-compile should not execute the decoder forward pass during compilation — bug,torch.compile — by zou3519 (创建于: 2026-02-07 11:07 (UTC+8)) ### Your current environment

main

### 🐛 Describe the bug

During cold start compilation, vLLM-compile executes the text decoder forward pass using the max_num_batched_tokens size. I’m not completely sure how long a text decoder forward pass is, but if it is O(1s) then this is unnecessary and will save on compile time.

There’s one complication in that I don’t know when autotuning will happen. I hope that vLLM does warmup of a size before the first cudagraph capture (but also we do compile-time…
#34002 [Bug]: Recent PR breaks Whisper inference — bug — by almayne (创建于: 2026-02-07 00:17 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ...
#34028 [Usage]: How to load FP8 Quantised Model - Value error, Found unknown quantization, same worked for v0.11.0 — usage — by gabinguo (创建于: 2026-02-07 07:29 (UTC+8)) ### Your current environment

```text python collect_env.py –2026-02-06 23:27:52– https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)… 185.199.108.133, 185.199.109.133, 185.199.110.133, … Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443… connected. HTTP request sent, awaiting response… 200 OK Length: 27835 (27K) [text/plain] Saving to: ‘collect_env.py’ …
#34005 [Bug]: Qwen3-1.7B apparently not respecting max-model-len — bug — by kdu4108 (创建于: 2026-02-07 00:59 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#34018 [RFC]: Helix (Context + Tensor) Parallelism for Efficient Long-Context Decoding — RFC — by sungsooha (创建于: 2026-02-07 04:23 (UTC+8)) ### Motivation.

This RFC proposes adding Helix (Context + Tensor) Parallelism to vLLM, based on NVIDIA’s Helix paper (July 2025). Helix enables efficient long-context decoding by combining Context Parallelism (sequence sharding) with Tensor Parallelism (head sharding), eliminating KV cache duplication that occurs in traditional Tensor Parallelism when TP > num_kv_heads.

Paper results (NVIDIA):
- Up to 1.5× decode latency reduction at fixed batc…
#34016 [Bug]: vLLM crashes when trying to load Devstral-2-123B-Instruct-2512 directly from S3 — bug — by moorglade (创建于: 2026-02-07 04:21 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#33996 [RFC]: Add an option to use NCCL-based symmetric memory when pytorch symmetric memory is applied — RFC — by ilmarkov (创建于: 2026-02-06 22:40 (UTC+8)) [💬1] ### Motivation.

Pytorch now supports different backends for symmetric memory including ‘CUDA’, ‘NVSHMEM’, ‘NCCL’ which can be controlled with TORCH_SYMMMEM.

### Proposed Change.
- We need to find out if NCCL-based pytorch symmetric memory replicates NCCL symmetric memory we use in vLLM.
- We need to refactor our usage of symmetric memory so that it has a single entry point.
- Benchmark the collectives we use in vLLM to pick the best version of symmetric memory on different hardware.
…
#34004 [Bug]: docker ROCm build fails — bug,rocm — by markg85 (创建于: 2026-02-07 00:56 (UTC+8)) [💬4] ### Your current environment

Irrelevant because it’s inside docker, not on my host. I’m building the docker container.

### 🐛 Describe the bug

I followed this description: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html I went for option 1 and i’m meeting the criteria exactly. I do have a AMD 7900XT so i did went for this docker build line:

…
#34012 [Feature]: MXFP8 GEMM / Grouped GEMM Kernels for AMD — feature request,rocm — by EdalatiAli (创建于: 2026-02-07 02:34 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

vLLM does not currently support serving MXFP8 quantized dense/MoE models on AMD gpus. It would be great to to add the required MXFP8 GEMM / Grouped GEMM kernels to enable serving MXFP8 dense and MoE models on AMD gpus.

### Alternatives

No response

### Additional context

…
#34008 [Feature]: W4A16 (int4) GEMM / Grouped GEMM Kernels for AMD — feature request,rocm — by ZewenShen-Cohere (创建于: 2026-02-07 02:03 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

vLLM currently lacks performant W4A16 (int4 weight) GEMM support on AMD ROCm/AITER backends. W4A16 Grouped GEMM exists only as a Triton kernel, which is not optimal.

Would it be possible to add AMD-optimized native kernels for W4A16 GEMM and W4A16 grouped GEMM?

### Alternatives

No response

…
#34001 [RFC] Change the directory layout from scaled_mm/ and mixed_precision/ to backend-first . — 无标签 — by tjtanaa (创建于: 2026-02-06 23:46 (UTC+8)) [💬2] After implementing https://github.com/vllm-project/vllm/issues/33872

Before starting to snowball the Kernel Abstraction, we should align the semantics of the directory to make sure they are easy to follow and use.

We would like to propose the following directory structure to organize the kernel abstraction of linear gemm based on the following pattern:
1. based on providers if the kernels are imported from third party library
2. if the kernels are implemented in vllm/csrc, they are organiz…
#34000 Move the Python kernel abstraction package (selectors + interfaces) under vllm/model_executor so unquantized linear kernels don’t need to be referenced from a quantization-specific directory. — 无标签 — by tjtanaa (创建于: 2026-02-06 23:45 (UTC+8)) Move the kernels from quantization into model_executor

Final directory layout ``` vllm/ └── vllm/ └── model_executor/ └── kernels/ ├── init.py ├── mixed_precision …
#33995 [Bug] [torch 2.10] test_routed_input_transform_inside_vs_outside failing — bug — by atalman (创建于: 2026-02-06 22:37 (UTC+8)) ### Your current environment

Buildkite CI: https://buildkite.com/vllm/ci/builds/50239

### 🐛 Describe the bug

Related PR for torch 2.10 update: https://github.com/vllm-project/vllm/pull/30525 Looks like https://github.com/vllm-project/vllm/pull/32790 introduced this test However its failing on torch 2.10 update

…
#33970 [Usage]: how to serve quantized Qwen3-Reranker-8B — usage — by jiiihyeonnn (创建于: 2026-02-06 14:05 (UTC+8)) [💬1] ### Your current environment
```
  GPU == NVIDIA RTX A5000
  vLLM == vllm/vllm-openai:v0.14.0
```
### How would you like to use vllm

…
#33991 [Installation]: building docker cpu image with VLLM_CPU_DISABLE_AVX512=true (or on any x86_64 CPU without AVX512) fails to compile mla_decode.cpp because BFloat16 has no AVX2 fallback — installation,cpu — by matthewfranglen (创建于: 2026-02-06 19:39 (UTC+8)) ### Your current environment

```text Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
#33987 [Installation]: vllm as submodule results in unbuildable docker images due to setuptools-scm and docker context — installation — by matthewfranglen (创建于: 2026-02-06 18:53 (UTC+8)) ### Your current environment

```text Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
#33986 [Tracking issue]: Issue Tracker for Qwen/Qwen3-VL-Embedding & Qwen/Qwen3-VL-Reranker — usage — by noooop (创建于: 2026-02-06 18:07 (UTC+8)) [💬2] ### vLLM examples for Qwen3-VL-Embedding :
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py
### vLLM examples for Qwen3-VL-Reranker:
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_rerank_api_online.py
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_score_api_online.py
- https://githu…
#33980 [RFC]: Sparse attention KV cache offloading to support longer sequence length — RFC — by zhangsicheng5 (创建于: 2026-02-06 17:09 (UTC+8)) ## Motivation.

In long sequence inference scenario, KV cache size has become one of the inference bottlenecks. To save GPU memory usage of KV cache and support longer sequence length, we have proposed a layerwise KV cache offloading approach in RFC (#33398). However, during development we find out that the available offload layer number is limited by the loading speed: Based on a rough estimation, the loading time is $\frac{kv\_cache\_size}…
#33974 [Feature]: We propose the official development and maintenance of a VLLM integration or plugin within Dify. — feature request — by ooodwbooo (创建于: 2026-02-06 15:32 (UTC+8)) ### 🚀 The feature, motivation and pitch

We propose the official development and maintenance of a VLLM integration or plugin within Dify.Currently, the vLLM plugin in Dify only supports text-only LLMs. We currently require support for VL, Embedding, Rerank, VL-Embedding, and VL-Rerank, and hope that the official team can take over the maintenance of the vLLM plugin in Dify.

https://github.com/vllm-project/vllm/issues/17454

https://github.com/yangyaofei/dify-vllm-provider/issues/4#issuecomment-…

#33966 [Feature]: Add tool_choice=”required” support for GPT-OSS Harmony models — feature request — by gkswns0531 (创建于: 2026-02-06 13:20 (UTC+8)) ### 🚀 The feature, motivation and pitch

GPT-OSS models use the Harmony chat format, which differs from standard models in its response generation behavior. Even when
tool_choice="required" is set, these models tend to generate direct text responses instead of tool calls, resulting in only 91%
success rate for tool call generation.

Harmony format models select one of three channels (final, analysis, commentary) after completing their internal processing. Tool
calls are only generated ...

#33962 [Bug]: Realtime transcription completes but server never sends transcription.done when final commit is received during slow streaming — bug — by pjs102793 (创建于: 2026-02-06 11:30 (UTC+8)) ### Your current environment

``` Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …

[已关闭 issue]

#34002 [Bug]: Recent PR breaks Whisper inference — bug — by almayne (关闭于: 2026-02-07 10:42 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text ============================== System Info ...
#33295 [Bug]: QKNorm+RoPE fusion broken for qwen3-fp8 on B200 — bug,help wanted,torch.compile — by ProExpertProg (关闭于: 2026-02-07 10:27 (UTC+8)) [💬12] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#11905 [Feature]: Support Multiple Tasks Per Model — feature request,stale — by FurtherAI (关闭于: 2026-02-07 10:18 (UTC+8)) [💬20] ### 🚀 The feature, motivation and pitch

Requesting this for V1 #11862

The idea is pretty simple, it would be nice to be able to, e.g., get generations and embeddings out of a single model. An example use case is when you have a LoRA for generation and a LoRA for embedding on top of the same base model. Deploying two vLLM servers is really inefficient for accomplishing this.

### Alternatives

A lesser feature would be one task per LoRA, but it’s better to be general if possible. …
#19415 [Bug]: [V1][gpu_model_runner.py] CUDA memory error — bug,stale — by rajagond (关闭于: 2026-02-07 10:18 (UTC+8)) [💬12] ### Your current environment

I am using vllm docker image: docker pull vllm/vllm-openai:v0.9.1rc1

### 🐛 Describe the bug

Hi,

I am making some changes to vllm/v1/worker/gpu_model_runner.py and trying to initialize some custom tensors inside the GPUModelRunner class. Specifically, I have added the following code:

…
#24737 [Bug]: Qwen3 Reranker 500 error when submitting longer query — bug,stale — by HadiSDev (关闭于: 2026-02-07 10:17 (UTC+8)) [💬6] logs (1).txt

### Your current environment
```
  --host 0.0.0.0 --port 8000 --model tomaarsen/Qwen3-Reranker-8B-seq-cls --max-model-len 8128
```
I use docker to host vllm for reranking qwen3 but when I send a query that is a bit longer than average I get this exception (See attached logs) ...
#25910 [Bug]: Audio usage reporting inconcistent between streaming and blocking — bug,stale — by alugowski (关闭于: 2026-02-07 10:16 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text irrelevant ```

…
#26369 [Performance]: Use int over list[int] as output_tokens to reduce GC overhead — performance,stale — by Jialin (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8] ### Proposal to improve performance

Currently, we’re consistently using list[int] to represent output_tokens in ModelRunnerOutput which is very inefficient from GC prospective.

The default setup of GC is (700, 10, 10) which means
- if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
- GC1 is triggered after 10 GC0
- GC2 is triggered after 10 GC1 In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be tri…
#26398 [Docs] flashinfer missing from /en/latest/configuration/env_vars.html — documentation,stale — by MikeSpreitzer (关闭于: 2026-02-07 10:16 (UTC+8)) [💬3] ### 📚 The doc issue

I am using vLLM release 0.10.2 and got a nasty surprise about some undocumented environment variables used by flashinfer. I eventually found https://github.com/flashinfer-ai/flashinfer/blob/v0.3.0/flashinfer/jit/env.py, which I hope provides the answers I need.

### Suggest a potential alternative/fix

Document FLASHINFER_WORKSPACE_BASE, FLASHINFER_CUBIN_DIR.

FYI, https://github.com/vllm-project/vllm/pull/21972 will affect this.

…
#26418 [Feature]: Resume guided/JSON generation after early termination — feature request,stale — by WoutDeRijck (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

## Summary

Add support for continuing constrained (guided / JSON-schema) generation from a partial output after an abort (e.g., when stopping an “infinite list” mid-stream). This would let users truncate and resume without restarting decoding from scratch.

## Problem

When streaming structured JSON with guided decoding, the model can enter an infinite or repetitive list loop. …
#26421 [Feature]: Return reason why a JSON schema is unsupported by xgrammar — feature request,stale — by WardLT (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

The error message for why a JSON schema is unsupported for structured output is vague. It only says it has an unsupported feature but not why, which is a particular problem given the rapid advancements in xgrammar meaning the reason a schema will fail varies highly depending on version.

### Alternatives

Haven’t figured…
#26432 [Doc]: vLLM setup with Triton/ROCm seems to have stale content. — documentation,rocm,stale — by vishnoianil (关闭于: 2026-02-07 10:16 (UTC+8)) [💬5] ### 📚 The doc issue

Noticed two issues on the GPU installation page for vLLM https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=cuda#build-wheel-from-source
1. AMD ROCm installation instruction says “Install ROCm’s Triton (the default triton-mlir branch) following the instructions from ROCm/triton”, but the instruction is using triton-lang/triton ``` git clone https://github.com/triton-lang/triton.git cd tri…
#33598 [CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:19 (UTC+8)) [💬3] ### Name of failing test

bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:18 (UTC+8)) [💬5] ### Name of failing test

pytest -v -s models/language/generation_ppl_test

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:18 (UTC+8)) [💬17] ### Name of failing test

pytest -v -s models/language/pooling -m 'not core_model'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29459 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:17 (UTC+8)) [💬7] ### Name of failing test

pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29462 [CI Failure]: mi325_8: Language Models Tests (Hybrid) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:17 (UTC+8)) [💬7] ### Name of failing test

uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5' && uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2' && pytest -v -s models/language/generation -m hybrid_model --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$BUILDKITE_PARALLEL_JOB

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#33938 [Bug][ROCm] Platform detection initializes CUDA prematurely, breaking Ray multi-GPU allocation — bug,rocm — by kouroshHakha (关闭于: 2026-02-07 06:17 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#34004 [Bug]: docker ROCm build fails — bug,rocm — by markg85 (关闭于: 2026-02-07 03:52 (UTC+8)) [💬4] ### Your current environment

Irrelevant because it’s inside docker, not on my host. I’m building the docker container.

### 🐛 Describe the bug

I followed this description: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html I went for option 1 and i’m meeting the criteria exactly. I do have a AMD 7900XT so i did went for this docker build line:

…
#33541 [Feature]: Automatic full-core utilization for TP=1 on arm multi-NUMA CPU systems — feature request,cpu — by phalani-paladugu (关闭于: 2026-02-06 20:26 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch

On arm servers with two NUMA nodes, when tensor parallelism (TP) is set to 1, the runtime currently utilizes cores from only a single NUMA node unless VLLM_CPU_OMP_THREADS_BIND is explicitly configured. For example, on a two NUMA system with 64 cores, only cores 0–31 are active by default.

With TP=2, cores are correctly partitioned across NUMA nodes without requiring any manual configuration.

Requested enhancement: When TP=1, the runtime should automati…
#31467 [RFC]: A Triton operator dispatch mechanism through modified CustomOp — RFC — by MengqingCao (关闭于: 2026-02-06 20:07 (UTC+8)) [💬14] ### Motivation.

Triton is becoming increasingly important in vLLM, and we’ve noticed its use in many models, quantization processes, and general workflows. Meanwhile, vLLM supports various backends. Typically, to achieve high performance, different implementations of the Triton kernels are used on different hardware, such as Ascend NPU. However, we’ve observed that vLLM currently lacks an effective operator dispatch mechanism for Triton to ensure that various backends can implement their ow…
#33478 [RFC]: Introduce ATOM as model implementation backend for AMD GPU — rocm,RFC — by zejunchen-zejun (关闭于: 2026-02-06 16:13 (UTC+8)) [💬8] ### Motivation ATOM is a foundational component of AMD’s AI inference strategy. It is implemented by ROCm developers and can be used to serve as the model implementation backend for high-performance inference on AMD GPUs. It is built by integrating optimizations from ROCm’s high-performance operator library aiter and high-performance communication library mori into model execution path.

Th…
#33885 [Bug]: 用vllm启动Qwen3VLReranke接口在重排任何图像时获取的得分很低仅有0.5左右 — bug — by RC-Qiao (关闭于: 2026-02-06 14:42 (UTC+8)) [💬1] ### Your current environment

docker run –gpus ‘“device=7”’ –entrypoint “” -v /dataset/models/Qwen/Qwen3-VL-Reranker-8B:/model -p 9091:8000 –shm-size=8g vllm/vllm-openai:v0.15.1-cu130 vllm serve /model –runner pooling –max-model-len 16384 –gpu-memory-utilization 0.5 –dtype bfloat16 –hf-overrides ‘{“architectures”: [“Qwen3VLForSequenceClassification”],”classifier_from_token”: [“no”, “yes”],”is_original_qwen3_reranker”: true}’ –chat-template /model/qwen…
#31229 [RFC]: Early-Fail Tokenization Guard for Completions or Chat Completions — RFC — by scratch-ml (关闭于: 2026-02-06 11:38 (UTC+8)) [💬4] ## Motivation.

### Problem Extremely long chat inputs (e.g., hundreds of millions of tokens) are fully tokenized before length checks, causing CPU OOM (e.g., ~300GB per vllm instance) and hanging the single-threaded async tokenizer, blocking all other requests.

More concretely, in RL scenarios, many request inputs come from agent–environment interactions and can sometimes become huge. This should ideally be mitigated by client-side validation before sending the request, but in practice calle…

[新 PR]

#33992 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs — bug,documentation,ci/build,nvidia — by ehfd (创建于: 2026-02-06 20:35 (UTC+8)) [💬3 | +82/-5, 4 files | commented:6] ## Purpose

Fix #32373 Fix #33369

Fixes the core problem in https://github.com/vllm-project/vllm/issues/32373 and https://github.com/vllm-project/vllm/issues/33369, introduced from https://github.com/vllm-project/vllm/pull/30784 and https://github.com/vllm-project/vllm/pull/33116.

The issue with the CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs: (1) Consumer NVIDIA GPUs must NOT have the CUDA compatibility libraries inside LD_LIBRARY_PATH. (2) Professional and …
#34027 [bug-fix] supported_tasks is breaking backward compatibility at init_app_state — bug,frontend,ready — by kouroshHakha (创建于: 2026-02-07 07:26 (UTC+8)) [+3/-1, 1 files | commented:3] init_app_state has now added supported_task as a mandatory input which breaks applications that integrate with vllm at this layer.

engine_client is the source of truth for supported_tasks and it’s already a parameter to the init_app_state helper. This PR makes supported_tasks param optional and in case of None retrieves it from engine_client.
#34017 [ModelRunner V2] Revert token rank comparison difference for now — ready,v1 — by njhill (创建于: 2026-02-07 04:22 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] We think it makes more sense to use a strict inequality for the token rank computation, but would be best to propose this in a separate PR since it involves a behavior change and is orthogonal to MRV2.

Making this consistent with V1 will help with output equivalence tests. We can subsequently change it in both places if needed (or make configurable if that’s really a problem).
#34033 [BugFix] Fix mm_encoder_only init for qwen3 vl moe model — bug,qwen — by shepark (创建于: 2026-02-07 11:04 (UTC+8)) [💬2 | +27/-1, 1 files | commented:1] Fix mm_encoder_only for qwen3-vl-moe by adding get_language_model_spec() to prevent OOM and guarding moe init when the language model isn’t instantiated to prevent AttributeError during encoder-only loading and avoids unnecessary full language model init and save memory usage.

## Test Plan python test.py ``` from vllm import LLM MODEL=”Qwen/Qwen3-VL-30B-A3B-Instruct” if name == ‘main’: llm = LLM( …
#33967 [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 — bug,ready — by ikchifo (创建于: 2026-02-06 13:33 (UTC+8)) [💬6 | +154/-6, 5 files | commented:3 approved:1] ## Purpose

Fixes #33295

On B200 with FP8-quantized models (e.g., Qwen/Qwen3-30B-A3B-FP8), PyTorch Inductor’s CSE fails to merge identical split_with_sizes calls. The QK Norm+RoPE pattern matcher expects one split node with 3 getitem users (q/k/v), but the graph contains 3 separate split nodes with 1 user each, so fusion fails silently.

This PR adds coalesce_equivalent_splits() to merge duplicate splits before matching. It groups splits by (args, kwargs) signature and merges duplicat…
#34015 [Misc] Add backward-compatible import aliases for renamed translations module — frontend,ready — by kouroshHakha (创建于: 2026-02-07 03:32 (UTC+8)) [+69/-0, 5 files | commented:1 approved:1] ## Summary
- PR #33904 renamed vllm.entrypoints.openai.translations to vllm.entrypoints.openai.speech_to_text, which breaks any downstream code importing from the old path.
- This PR restores the old vllm.entrypoints.openai.translations package as thin backward-compatible stubs that re-export everything from the new speech_to_text location while emitting a DeprecationWarning on import.
- The deprecation warnings instruct users to update their imports and are scheduled for removal in v…
#34032 [ROCm] update triton branch to support gpt-oss models for gfx11xx devices — rocm,ci/build,gpt-oss — by hongxiayang (创建于: 2026-02-07 10:00 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] Fix https://github.com/vllm-project/vllm/issues/33906 Fix https://github.com/vllm-project/vllm/issues/33143

To add support to gfx11xx devices to serve gpt-oss (20b for example) models , we need to update triton branch in Dockerfile.rocm_base. The needed support was in upstream triton, but not in the branch used by vllm base docker. The patch to rocm/triton was done using this associated PR https://github.com/ROCm/triton/pull/923.

…
#34007 [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion — bug,rocm,ready — by rasmith (创建于: 2026-02-07 01:15 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose

If cfg.model_config is None, then vllm.enable_norm_pad_fusion will crash. This is causing the following failures in CI:

``` FAILED kernels/moe/test_routing_simulator.py::test_routing_strategy_integration - AttributeError: ‘NoneType’ object has no attribute ‘get_hidden_size’ FAILED kernels/moe/test_shared_fused_moe_routed_transform.py::test_routed_input_transform_inside_vs_outside[dtype0-256-128-1] - AttributeError: ‘NoneType’ object has no attrib…
#34011 [Bugfix] Fix Whisper tokenization — bug,ready — by NickLucche (创建于: 2026-02-07 02:32 (UTC+8)) [+8/-0, 1 files | commented:1 approved:1] Fix https://github.com/vllm-project/vllm/issues/34002.

Since HF forwards **kwargs to both tokenizer and feature extractor, WhisperFeatureExtractor truncates audio to N raw samples (~0.03s), producing empty output. Strip these kwargs when audio data is present.

cc @DarkLight1337
#33983 [CPU][PPC64] Fix bf16 path in mla_decode.cpp — cpu — by Akashcodes732 (创建于: 2026-02-06 17:32 (UTC+8)) [💬2 | +9/-1, 1 files | commented:1] ## Purpose Add BF16 Kernel types for ppc64 in mla_decode.cpp

Similar issue seen in #33788 , #30329
#34031 [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 — ready,ci/build — by ProExpertProg (创建于: 2026-02-07 09:56 (UTC+8)) [+8/-10, 1 files | commented:1]

## Purpose The final version of the E2E fusion overhaul in #33293 ended up slashing some tests that should have remained. This PR fixes that by improving the filtering logic. Only for B200 tests ## Test Plan CI

## Test Result TBD
#34026 add –insecure arg to the vllm bench to skip TLS — performance — by fanyang-real (创建于: 2026-02-07 07:14 (UTC+8)) [💬1 | +138/-4, 2 files | commented:1] ## Purpose

Add --insecure flag to vllm bench serve to skip TLS certificate verification when connecting to servers with self-signed certificates.

This is useful when benchmarking vLLM servers that use self-signed SSL certificates, where the default strict certificate verification would fail.

## Test Plan

```bash pytest tests/benchmarks/test_serve_cli.py -v -m benchmark …
#34030 [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming — bug,frontend — by cradonn (创建于: 2026-02-07 08:58 (UTC+8)) [💬1 | +89/-0, 2 files | commented:1] # PR: [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming

## Summary

Adds reasoning_content field to DeltaMessage class to maintain backward compatibility for streaming responses, completing the work started in commit bf001da.

## Context
- RFC #27755 renamed reasoning_content to reasoning to align with OpenAI API
- Commit bf001da added backward compat for input message parsing in chat_utils.py …
#34029 [Perf] Optimize async scheduling redundant copy, 0.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-02-07 08:35 (UTC+8)) [+22/-12, 2 files | commented:2] ## Purpose

Originally, we do a req.all_token_ids.copy() anyway, this PR optimize the logic:
1. sync case: no copy (No need since we already sync)
2. async scheduling: only copy the output_token_ids
## Test

### Unit test …
#34025 [Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure — 无标签 — by gmagogsfm (创建于: 2026-02-07 07:08 (UTC+8)) [💬1 | +1649/-14, 6 files | commented:3] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Add Helion autotuning infra

This PR adds autotuning support to vLLM-Helion. Each Helion kernel can register an input generator that returns a dictionary of (helion_config_key, representative_input) pairs. The autotuning infra would run autotuning against these inputs and generate one config per helion_config_key.
#34024 [Core] Add Helix (Context + Tensor) Parallelism — documentation,v1,llama,nvidia — by sungsooha (创建于: 2026-02-07 06:44 (UTC+8)) [💬6 | +1485/-86, 18 files | commented:1]

## Purpose

Implements Helix (Context + Tensor) Parallelism based on NVIDIA’s Helix paper (arXiv:2507.07120).

RFC: #34018 Helix enables efficient long-context decoding by:
- Eliminating KV cache duplication when TP > num_kv_heads
- Removing the TP <= num_kv_heads constraint for GQA models with context parallelism …
#34022 [Misc][Spec Decode] support different load config for draft model — speculative-decoding,v1 — by ZhengkaiZ (创建于: 2026-02-07 06:00 (UTC+8)) [+8/-1, 3 files | commented:1 | 📝草稿] [Misc][Spec Decode] support different load config for draft model

Summary:

Sometime, to achieve better draft model performance, or to use different checkpoint format, we will customize those part, adding freedom to specify different load config can make our private trained draft model work.

Test Plan: Command:

``` …
#34006 [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune — ready — by mmangkad (创建于: 2026-02-07 01:03 (UTC+8)) [💬1 | +104/-1, 5 files | approved:2 commented:8] ## Purpose

Support enabling/disabling FlashInfer autotune.

cc @ProExpertProg

## Test Plan

## Test Result

…
#34023 [Bugfix] Fix RAW hazard and optimize stores in EP Scatter Kernel — bug — by Manikvsin (创建于: 2026-02-07 06:17 (UTC+8)) [💬1 | +11/-4, 1 files | commented:1] Title: [Bugfix] Eliminate RAW hazard and redundant stores in EP Scatter Kernel

Description: This PR fixes a Read-After-Write (RAW) hazard and optimizes global memory access in the EP Scatter Kernel.

## Purpose Previously, the kernel followed a “Store-then-Load” pattern:

Every thread/block computed the cumsum (prefix sum) of tokens per expert.

…
#34021 [Bugfix] Fix Worker.load_model context-manager composition for sleep mode — bug,v1 — by tianshu-Michael-yu (创建于: 2026-02-07 05:40 (UTC+8)) [💬1 | +4/-3, 1 files | commented:1 approved:1] ## What this PR does

Fixes a context-manager composition bug in vllm/v1/worker/gpu_worker.py:
- Before: with A and B:
- After: with A, B:
In Python, with A and B: evaluates to a single context manager (B when A is truthy), so A is never entered. Here that means _maybe_get_memory_pool_context(tag="weights") is skipped, so weight allocations are not tracked by cuMem sleep mode.

## Why this matters …
#34003 [torch.compile] Stop compiling identical artifacts — ready,ready-run-all-tests — by zou3519 (创建于: 2026-02-07 00:44 (UTC+8)) [💬3 | +64/-11, 2 files | commented:1 approved:1] ## Purpose

This is a resubmission of https://github.com/vllm-project/vllm/pull/33641 that is actually correct.

Previously, the vLLM-compile compilation:
- does a full dynamo trace to get a full graph
- splits the graph on attention ops into N subgraphs
- compiles each of the N subgraphs via standalone_compile.
- If any of the subgraphs are the same, we rely on aotautograd and inductor to cache-hit, but we still end up producing N artifacts total. Usually there are 3 unique graphs in vLLM model…
#34020 [wip] layerwise loading for fp8.py, take 2 — 无标签 — by vkuzo (创建于: 2026-02-07 05:23 (UTC+8)) [+221/-63, 4 files | commented:2 | 📝草稿] Summary:

TODO write me

Test Plan:

TODO write me

…
#34019 [Quantization][Refactor] Clean up GPTQ + AWQ quantization — 无标签 — by mu-hashmi (创建于: 2026-02-07 04:51 (UTC+8)) [+40/-36, 3 files | commented:1 | 📝草稿] ## Purpose Addresses #31689. Building on gptq_marlin.py per @robertgshaw2-redhat’s guidance to consolidate GPTQ/AWQ quantization and remove the legacy code paths.

So far: widened GPTQMarlinConfig to handle all GPTQ bit-widths and symmetry configs through the MPLinearKernel abstraction, added GPTQv2 checkpoint format support, and enabled 2/3-bit types in ExllamaLinearKernel.

Still TODO: validate all kernel backends end-to-end, remove gptq.py, convert moe_wna16.py into a modular kernel…
#33988 [Bugfix][Frontend] Fix IndexError in Mistral tool parser during streaming tool calls — bug,frontend — by veeceey (创建于: 2026-02-06 19:13 (UTC+8)) [💬3 | +171/-11, 3 files | commented:4] ## Summary
- Fixes IndexError: list index out of range in chat_completion_stream_generator when using --tool-call-parser=mistral with streaming tool calls
- The root cause was that the Mistral tool parser never populated streamed_args_for_tool, but serving.py unconditionally accessed tool_parser.streamed_args_for_tool[index] at stream completion
- Adds a defensive bounds check in serving.py before indexing streamed_args_for_tool (protects all tool parsers)
- Updates both v11+ and…
#34010 [Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint — frontend,ready — by SumanthRH (创建于: 2026-02-07 02:24 (UTC+8)) [+29/-2, 2 files | commented:1 approved:1] ## Purpose

/inference/v1/generate endpoint differs from /v1/completions endpoint and the Engine APIs in logprobs=0 handling: Passing logprobs=0 returns no logprobs, while it is supposed to return chosen token logprobs (same as logprobs=1)

This PR ensures that the behaviour is consistent with the completions endpoint.

## Test Plan

Added a new test for logprobs handling in the generate endpoint - with logprobs values 0, 1, 5

…
#34014 [Model] Introduce first-class DFlash speculative decoding in vLLM V1 — documentation,new-model,speculative-decoding,v1,qwen — by dangoldbj (创建于: 2026-02-07 03:21 (UTC+8)) [💬2 | +2163/-22, 18 files | commented:1]

## Purpose This PR introduces first-class DFlash speculative decoding support in vLLM V1.

This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.

Key changes:
- Adds explicit method="dflash" support and full DFlash model/config plumbing.
- Establishes a dedicated `DFlash…
#34009 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,v1 — by benchislett (创建于: 2026-02-07 02:16 (UTC+8)) [+3/-0, 1 files | commented:3] ## Purpose

FIX #32626 FIX #33450

Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don’t get out of sync: it such cases, we pad for attention but the attention metadata builder receives the padded number of requests and unpadded number of tokens.

This leads to a crash where we see num_decode_tokens==1 but num_requests==2. I tracked this down to _build_attention_metadata on…
#33997 [Hybrid] Enable mamba prefix cache “align” mode with async scheduling — v1 — by tdoublep (创建于: 2026-02-06 22:44 (UTC+8)) [+79/-41, 5 files | commented:3] ## Purpose

On main, mamba prefix caching “align” mode does not currently work with async scheduling when speculative decoding is enabled. The core reason for this is that the scheduler does not have an accurate measurement of num_computed_tokens and thus we don’t actually know the exact number of computed tokens in the MambaManager when we are allocating and freeing blocks. Specifically, the value of num_computed_tokens in the scheduler is an over-estimate of the real number, since it can…
#34013 Threshold fix wvSplitk for occasional CI fails — rocm — by amd-hhashemi (创建于: 2026-02-07 03:12 (UTC+8)) [+2/-0, 1 files | commented:1]

## Purpose

## Test Plan

## Test Result

...
#33993 [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) — bug,ready,deepseek — by xuebwang-amd (创建于: 2026-02-06 21:33 (UTC+8)) [💬4 | +6/-1, 2 files | commented:1 approved:2] ## Purpose

Model: deepseek-ai/DeepSeek-V3.1 This PR aims to fix an error: ``` Traceback (most recent call last): File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main worker = WorkerProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 580, in init …
#33971 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (创建于: 2026-02-06 14:17 (UTC+8)) [💬3 | +237/-235, 4 files | commented:1 approved:1]

## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.

## Test Plan

## Test Result

…

#33964 [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 — bug,ready — by chaunceyjiang (创建于: 2026-02-06 12:39 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1 approved:1] ## Purpose

      <｜DSML｜function_calls>
      <｜DSML｜invoke name="get_weather">
      <｜DSML｜parameter name="location" string="true">杭州</｜DSML｜parameter>
      <｜DSML｜parameter name="date" string="true">2024-01-16</｜DSML｜parameter>
      </｜DSML｜invoke>
    

…

#33963 [Bugfix] send None sentinel on final commit so server properly sends transcription.done — bug,frontend — by pjs102793 (创建于: 2026-02-06 11:40 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1] ## Purpose

Fixes https://github.com/vllm-project/vllm/issues/33962

Fix server not sending transcription.done when client sends input_audio_buffer.commit with final: true during real-time audio streaming.

Transcription works correctly — transcription.delta events are sent as expected. However, the final commit handler sets _is_input_finished = True but does not send the None sentinel to audio_queue, leaving audio_stream_generator blocked on queue.get() forever. This deadloc…
#33994 Bump lm-eval version for Transformers v5 compatibility — documentation,rocm,ready,ci/build — by hmellor (创建于: 2026-02-06 21:53 (UTC+8)) [💬3 | +19/-31, 14 files | approved:1 commented:1] Update lm-eval pin to 0.4.10 so that EleutherAI/lm-evaluation-harness@384118d is included (the previous version imported a class that no longer exists in Transformers v5)
#33998 [Revert] Add util handle_deprecated back — ready — by yewentao256 (创建于: 2026-02-06 23:32 (UTC+8)) [+25/-0, 1 files | commented:1 approved:2] ## Purpose

Fixes https://github.com/vllm-project/vllm/pull/33718#discussion_r2773564375

CC: @hmellor
#33973 [XPU][5/N] add wna16 xpu kernel — ready,ci/build — by zufangzhu (创建于: 2026-02-06 15:27 (UTC+8)) [💬2 | +58/-66, 2 files | commented:1 approved:1] This PR is [5/N] of https://github.com/vllm-project/vllm/issues/33214 add compressed tensor wna16 support for xpu
#33999 fix description in plugin_system.md — documentation — by guodongxiaren (创建于: 2026-02-06 23:32 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]

## Purpose Because vllm_add_dummy_model is packages not py_modules in this example. ## Test Plan

## Test Result

...
#33978 Fix spelling errors — ready,v1 — by sleepcoo (创建于: 2026-02-06 16:40 (UTC+8)) [+10/-10, 5 files | commented:1 approved:1] ## Summary
- Fix Intialize → Initialize in test files
- Fix is_availble → is_available in flashmla.py
- Fix AVAILBLE_BACKENDS → AVAILABLE_BACKENDS in fp8.py
- Fix NO_REASONING_QUICK_THROUGHT → NO_REASONING_QUICK_THOUGHT in test file
## Test plan
- Changes are typo fixes only (comments and variable names)
- No functional code changes …

#33989 Update WeightTransferConfig to be more standard like the others — ready — by hmellor (创建于: 2026-02-06 19:17 (UTC+8)) [+4/-6, 2 files commented:1 approved:1]

@config is a dataclass transform now
The default shouldn’t be duplicated in EngineArgs

#33990 Pass modality information in embed_multimodal — speculative-decoding,needs-rebase,v1,qwen — by reaganjlee (创建于: 2026-02-06 19:31 (UTC+8)) [💬1 | +43/-14, 4 files | commented:1 | 📝草稿]

## Purpose

## Test Plan

## Test Result

...
#33979 [Frontend] Add –disable-log-prefix flag and VLLM_DISABLE_LOG_PREFIX env var — frontend,v1 — by veeceey (创建于: 2026-02-06 16:53 (UTC+8)) [💬3 | +133/-9, 9 files | commented:5] ## Summary Add option to disable the process/thread log prefix (e.g. (APIServer pid=12345)) that vLLM adds to stdout/stderr via decorate_logs().

Motivation: Users with custom logging configs or log aggregation systems cannot control this prefix because it’s injected before Python’s logging system. The VLLM_LOGGING_CONFIG_PATH config has no effect on it.

## Changes
- vllm/envs.py: Add VLLM_DISABLE_LOG_PREFIX environment variable
- vllm/entrypoints/openai/cli_args.py: Add…
#33977 [Bugfix] Fix models and tests for transformers v5 — bug,documentation,ready,multi-modality,qwen — by zucchini-nlp (创建于: 2026-02-06 16:29 (UTC+8)) [💬2 | +70/-55, 12 files | commented:8 approved:1] As per title, related to https://github.com/vllm-project/vllm/pull/30566. All changes are v4 compatible as well, should be compatible

@hmellor
#33985 [Kernel] FlashInfer: switch allreduce fusion to unified API — performance — by mmangkad (创建于: 2026-02-06 18:06 (UTC+8)) [+80/-114, 3 files | commented:1]

## Purpose
- Migrate vLLM’s FlashInfer allreduce fusion to the unified flashinfer.comm.allreduce_fusion API and workspace creation.
- Update the fused collective benchmark and fusion test to the new API and workspace lifecycle.
Dependency note: Requires flashinfer-python >= 0.6.3 (unified allreduce API).

## Test Plan …
#33984 Bump HF Hub client to get bug fix — bug,rocm,ready,ci/build — by hmellor (创建于: 2026-02-06 17:35 (UTC+8)) [+2/-2, 2 files | approved:1 commented:1] Bug fix in question is https://github.com/huggingface/huggingface_hub/pull/3778

#33976 [PaddleOCR-VL] Add BC for transformers 5.0 config — ready — by zhang-prog (创建于: 2026-02-06 16:22 (UTC+8)) [+7/-0, 1 files

commented:1 approved:2]

#33982 Consolidate and fix forbidden import pre-commit checks — ready — by hmellor (创建于: 2026-02-06 17:21 (UTC+8)) [+145/-300, 5 files | commented:2 approved:1] Previously the regex and triton checks didn’t actually work in CI because they did not use the file list passed by pre-commit and instead relied on the git diff.

This means that the check would miss violations in CI because th git diff would be empty.

This PR consolidates the three forbidden import checks into a single check which does consume pre-commit’s file list.
#33981 fix: reject non-text content in system/developer messages — frontend — by veeceey (创建于: 2026-02-06 17:21 (UTC+8)) [💬2 | +211/-0, 2 files | commented:4] ## Summary

Fixes #33925

Per the OpenAI API specification, system and developer role messages should only accept text content type. Previously, vLLM allowed multimodal content (e.g. image_url, input_audio, video_url) in system messages without any validation, which diverges from the OpenAI API behavior.

### Changes
- vllm/entrypoints/chat_utils.py: Added a `_validate_text_only_c…
#33969 [Frontend] Add –disable-uvicorn-metrics-access-log shorthand flag — documentation,frontend — by veeceey (创建于: 2026-02-06 13:56 (UTC+8)) [💬3 | +146/-11, 4 files | commented:1] ## Summary

Adds a convenience boolean CLI flag --disable-uvicorn-metrics-access-log that suppresses uvicorn access logs for /health and /metrics endpoints. This is the flag originally requested in #29023.
- New CLI argument: --disable-uvicorn-metrics-access-log (boolean, default False)
- Reuses existing infrastructure: Builds on the UvicornAccessLogFilter and --disable-access-log-for-endpoints machinery from #30011 – no new filter classes or duplicate logic
- **Composabl…
#33972 Scale input before applying Marlin operator — 无标签 — by ir1ka (创建于: 2026-02-06 15:21 (UTC+8)) [+59/-1, 4 files | commented:3 | 📝草稿] When dtype=float16, due to the smaller dynamic range of float16, data overflow can easily occur during GEMM, leading to inference errors. Therefore, we dynamically scale the input before GEMM to prevent data overflow.

Fix: https://github.com/vllm-project/vllm/issues/33560 https://github.com/vllm-project/vllm/issues/33461

## Purpose

The nvfp4 model supports the use of dtype=float16. …
#33975 Fix main pre-commit — 无标签 — by hmellor (创建于: 2026-02-06 16:07 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] This one slipped through in https://github.com/vllm-project/vllm/pull/33720
#33968 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (创建于: 2026-02-06 13:48 (UTC+8)) [💬3 | +231/-192, 5 files]

## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.

## Test Plan

## Test Result

…
#33965 [Bugfix] Fix Qwen3-Coder tool call streaming for duplicate names and param parsing — bug,qwen — by alexbi29 (创建于: 2026-02-06 12:51 (UTC+8)) [💬2 | +260/-264, 1 files | commented:1] ## Summary

Fixes multiple issues in the Qwen3-Coder-Next tool parser’s streaming (with opencode) mode that caused tool call arguments to be lost or corrupted:
- Duplicate tool names broken: When the model made multiple calls to the same tool (e.g. two read calls), the name-based lookup in prev_tool_call_arr prevented the second call from being added. Clients received undefined for the second call’s arguments. Fixed by replacing all name-based lookups with index-based ones.
- **Body p…

[已合并 PR]

#34017 [ModelRunner V2] Revert token rank comparison difference for now — ready,v1 — by njhill (合并于: 2026-02-07 11:11 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] We think it makes more sense to use a strict inequality for the token rank computation, but would be best to propose this in a separate PR since it involves a behavior change and is orthogonal to MRV2.

Making this consistent with V1 will help with output equivalence tests. We can subsequently change it in both places if needed (or make configurable if that’s really a problem).
#33967 [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 — bug,ready — by ikchifo (合并于: 2026-02-07 10:27 (UTC+8)) [💬6 | +154/-6, 5 files | commented:3 approved:1] ## Purpose

Fixes #33295

On B200 with FP8-quantized models (e.g., Qwen/Qwen3-30B-A3B-FP8), PyTorch Inductor’s CSE fails to merge identical split_with_sizes calls. The QK Norm+RoPE pattern matcher expects one split node with 3 getitem users (q/k/v), but the graph contains 3 separate split nodes with 1 user each, so fusion fails silently.

This PR adds coalesce_equivalent_splits() to merge duplicate splits before matching. It groups splits by (args, kwargs) signature and merges duplicat…
#34015 [Misc] Add backward-compatible import aliases for renamed translations module — frontend,ready — by kouroshHakha (合并于: 2026-02-07 11:01 (UTC+8)) [+69/-0, 5 files | commented:1 approved:1] ## Summary
- PR #33904 renamed vllm.entrypoints.openai.translations to vllm.entrypoints.openai.speech_to_text, which breaks any downstream code importing from the old path.
- This PR restores the old vllm.entrypoints.openai.translations package as thin backward-compatible stubs that re-export everything from the new speech_to_text location while emitting a DeprecationWarning on import.
- The deprecation warnings instruct users to update their imports and are scheduled for removal in v…
#33821 [Bugfix] Fix _fused_moe_lora_expand signature mismatch — bug,ready — by xyang16 (合并于: 2026-02-07 10:45 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]

## Purpose

This PR fix the signature mismatch between _fused_moe_lora_expand and _fused_moe_lora_expand_fake. The parameter b_intermediate_cache1 was removed in _fused_moe_lora_expand, so it should be removed in the fake function too.

Essential Elements of an Effective PR Description Checklist
...
#34007 [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion — bug,rocm,ready — by rasmith (合并于: 2026-02-07 10:43 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose

If cfg.model_config is None, then vllm.enable_norm_pad_fusion will crash. This is causing the following failures in CI:

``` FAILED kernels/moe/test_routing_simulator.py::test_routing_strategy_integration - AttributeError: ‘NoneType’ object has no attribute ‘get_hidden_size’ FAILED kernels/moe/test_shared_fused_moe_routed_transform.py::test_routed_input_transform_inside_vs_outside[dtype0-256-128-1] - AttributeError: ‘NoneType’ object has no attrib…
#34011 [Bugfix] Fix Whisper tokenization — bug,ready — by NickLucche (合并于: 2026-02-07 10:42 (UTC+8)) [+8/-0, 1 files | commented:1 approved:1] Fix https://github.com/vllm-project/vllm/issues/34002.

Since HF forwards **kwargs to both tokenizer and feature extractor, WhisperFeatureExtractor truncates audio to N raw samples (~0.03s), producing empty output. Strip these kwargs when audio data is present.

cc @DarkLight1337
#32351 [Feat][RL] Pause and Resume with keep requests for single engine — documentation,frontend,ready,v1 — by hao-aaron (合并于: 2026-02-07 08:08 (UTC+8)) [💬10 | +537/-31, 8 files | commented:9 approved:1] ## Purpose Completes part 1 of https://github.com/vllm-project/vllm/issues/32103 We introduce new “mode” parameter to pause and resume APIs, allowing for the following behaviors:
- mode="abort": all inflight requests are immediately aborted and partially generated sequences are returned to callers
- mode="wait": before pausing, we wait for all inflight requests to finish first
- NEW: mode="keep": we stop all inflight requests asap, but do not return. From the perspective of caller, it will…
#33941 [bugfix] [ROCm] Fix premature CUDA initialization in platform detection — bug,rocm,ready,ci/build,nvidia — by kouroshHakha (合并于: 2026-02-07 06:17 (UTC+8)) [💬9 | +133/-6, 6 files | commented:4 approved:1] ## Summary

On ROCm, importing vllm.platforms triggers torch.cuda.get_device_properties() at module load time, which initializes CUDA before Ray workers can set CUDA_VISIBLE_DEVICES. This locks device_count() to the total number of GPUs, causing all workers to incorrectly use GPU 0 instead of their assigned GPUs.

Closes https://github.com/vllm-project/vllm/issues/33938

## Problem

```python # vllm/platforms/rocm.py (before fix) …
#33919 Fix RoutingMethodType logic — ready,ci/build,nvidia,ready-run-all-tests — by dbari (合并于: 2026-02-07 06:03 (UTC+8)) [💬6 | +61/-20, 9 files | commented:1]

## Purpose

This PR contains two fixes for #33792:
- Fix the selection logic for RoutingMethodType in fused_topk_bias_router.py and fused_topk_router.py
- When use_grouped_topk=True, the GroupedTopKRouter did not find any valid routing methods and there is only one group, fall back to the non-grouped routers
The latter point covers Mistral Large 3, which has n_group=1 and topk_group=1 but uses Renormalize instead of DeepSeekV3 routing, handled by…
#34010 [Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint — frontend,ready — by SumanthRH (合并于: 2026-02-07 04:33 (UTC+8)) [+29/-2, 2 files | commented:1 approved:1] ## Purpose

/inference/v1/generate endpoint differs from /v1/completions endpoint and the Engine APIs in logprobs=0 handling: Passing logprobs=0 returns no logprobs, while it is supposed to return chosen token logprobs (same as logprobs=1)

This PR ensures that the behaviour is consistent with the completions endpoint.

## Test Plan

Added a new test for logprobs handling in the generate endpoint - with logprobs values 0, 1, 5

…
#33993 [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) — bug,ready,deepseek — by xuebwang-amd (合并于: 2026-02-07 03:11 (UTC+8)) [💬4 | +6/-1, 2 files | commented:1 approved:2] ## Purpose

Model: deepseek-ai/DeepSeek-V3.1 This PR aims to fix an error: ``` Traceback (most recent call last): File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main worker = WorkerProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 580, in init …
#33734 [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op — bug,rocm,ready — by charlifu (合并于: 2026-02-07 03:09 (UTC+8)) [+6/-1, 1 files | commented:1 approved:2] Aiter op per_1x32_f4_quant_hip returns the tensor as torch.float4_e2m1fn_x2 type, which will cause the gemm_a4w4 op to raise error because the weight type is still torch.uint8.

This PR fixes this issue.
#33449 [Refactor] Remove align block size logic in moe_permute — performance,ready — by yewentao256 (合并于: 2026-02-07 02:57 (UTC+8)) [+38/-297, 8 files | commented:1 approved:1] ## Purpose

We have very complicated logic in cuda kernel related to optional align_block_size used by DeepGEMM, but this is actually not called at all in main branch for a very long time, so it is time to clean this up.

It is even slower vs current implementation so we are safe to remove

And we can also get a little bit performance improvement from deleting the torch.full

## Test

…
#33251 [Model Runner V2] support apply penalty for spec decode — v1 — by izhuhaoran (合并于: 2026-02-07 02:56 (UTC+8)) [💬5 | +91/-14, 4 files | commented:4 approved:1] ## Purpose

As titled, this pr support penalty for spec decode in MRV2
#33971 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (合并于: 2026-02-07 02:05 (UTC+8)) [💬3 | +237/-235, 4 files | commented:1 approved:1]

## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.

## Test Plan

## Test Result

…
#33292 [KV Connector] Add missing method overrides to MultiConnector — documentation,new-model,ready,ci/build,v1,multi-modality,llama,qwen,kv-connector — by eicherseiji (合并于: 2026-02-07 01:58 (UTC+8)) [💬6 | +177/-11, 2 files | commented:4 approved:2] ## Summary

MultiConnector was missing several method overrides that are required when wrapping other KV connectors. This PR adds the missing methods and a test to prevent future regressions.

## Missing Methods Fixed

Handshake methods (for NixlConnector support):
- get_handshake_metadata() - returns the first non-None metadata from sub-connectors
- set_xfer_handshake_metadata() - delegates to all sub-connectors
…

#33944 [Log] Optimize duplicate startup log — ready,v1 — by yewentao256 (合并于: 2026-02-07 01:49 (UTC+8)) [+10/-7, 3 files | commented:1 approved:1] ## Purpose

Optimize logs like

  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner
  INFO 02-05 21:47:09 [gpu_worker.py:122] Using V2 Model Runner

#33749 [ROCm][AITER] Fix AITER import regression for explicit backend selection — rocm,speculative-decoding,ready,v1 — by AndreasKaratzas (合并于: 2026-02-06 23:08 (UTC+8)) [💬5 | +262/-66, 5 files | commented:8 approved:2] A regression was introduced that broke explicit AITER backend selection on ROCm when VLLM_ROCM_USE_AITER=0 (or unset). Users could not explicitly select the AITER backend via attention_config={"backend": "ROCM_AITER_FA"} even though the backend was available.

Error observed:
```
  AttributeError: 'builtin_function_or_method' object has no attribute 'flash_attn_varlen_func'
```
This occurred because:
1. `is_aiter_found_and_supported…

#33964 [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 — bug,ready — by chaunceyjiang (合并于: 2026-02-07 01:23 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1 approved:1] ## Purpose

      <｜DSML｜function_calls>
      <｜DSML｜invoke name="get_weather">
      <｜DSML｜parameter name="location" string="true">杭州</｜DSML｜parameter>
      <｜DSML｜parameter name="date" string="true">2024-01-16</｜DSML｜parameter>
      </｜DSML｜invoke>
    

…

#33254 [Docs] Update link to Benchmark CLI documentation — performance,ready — by eldarkurtic (合并于: 2026-02-07 00:01 (UTC+8)) [+1/-1, 1 files

commented:1 approved:1]

#33973 [XPU][5/N] add wna16 xpu kernel — ready,ci/build — by zufangzhu (合并于: 2026-02-06 23:59 (UTC+8)) [💬2 | +58/-66, 2 files | commented:1 approved:1] This PR is [5/N] of https://github.com/vllm-project/vllm/issues/33214 add compressed tensor wna16 support for xpu
#33928 [Refactor] Consolidate sequence normalization and enc-dec parsing — frontend,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-06 23:43 (UTC+8)) [💬18 | +1256/-848, 38 files | commented:4 approved:1] ## Purpose
- Introduce *DictPrompt classes to represent prompts that have been standardized into dictionaries. This replaces Parsed*Prompt classes.
- Update and improve documentation of various input schemas.
- Move the logic of normalizing a single prompt/conversation or a sequence of prompt/conversation to a sequence to a new module vllm.renderer.inputs.parse.
- Allow renderer to handle encoder-decoder inputs.
- Renamed render_completion(s) to render_prompt(s).
- Move the logic of r…
#33431 [Model] Support MiniCPM-o 4.5 — ready — by tc-mb (合并于: 2026-02-06 23:29 (UTC+8)) [💬1 | +86/-7, 1 files | commented:10] ## Summary

This PR adds support for MiniCPM-o 4.5, a new generation of Omni model, which has excellent performance.

### Changes
- Added MiniCPMO4_5 class with Qwen3 backbone support
- Refactored MiniCPMO to use factory pattern for automatic version dispatch
- Made audio_pool_step configurable (v4.5 uses 5, v2.6 uses 2)
- Added auto version detection for backward compatibility …
#33940 [Docs] Add sections on process architecture and minimum CPU resources — documentation,ready — by mgoin (合并于: 2026-02-06 23:26 (UTC+8)) [💬3 | +113/-0, 4 files | commented:3 approved:1] ## Purpose

It seems users can be confused about vLLM’s performance when running with very small amounts of CPU cores available. We are missing a clear overview of what vLLM’s process architecture is, so I added this along with some diagrams in arch_overview.md, and included a section on CPU resource recommendations in optimization.md

Here are some example diagrams I generated with help from Opus 4.6

<img width=”2816” height=”1536” alt=”Generated_image1” src=”https://github.com/user-attachment…
#33509 [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab — structured-output,ready,v1 — by FredericOdermatt (合并于: 2026-02-06 22:23 (UTC+8)) [💬2 | +1/-1, 1 files | commented:5 changes:1 approved:1] Hi there, thanks for the amazing work on guided decoding in VLLM @russellb @jeejeelee @hmellor @yewentao256 @mgoin and many more.

## Purpose

I’ve observed the following: Models like Gemma 3 1B have len(tokenizer) > config.vocab_size due to added special tokens not counted in config.
- GuidanceBackend.__post_init__ passes self.vocab_size (from model config) as n_vocab to [llguidance_hf.from_tokenizer()](https://github.com/vllm-project/vllm/blob/b5f8c3092d1e1466b2b9c516fb39e5b2c15e7…

#33989 Update WeightTransferConfig to be more standard like the others — ready — by hmellor (合并于: 2026-02-06 21:15 (UTC+8)) [+4/-6, 2 files commented:1 approved:1]

@config is a dataclass transform now
The default shouldn’t be duplicated in EngineArgs

#33977 [Bugfix] Fix models and tests for transformers v5 — bug,documentation,ready,multi-modality,qwen — by zucchini-nlp (合并于: 2026-02-06 21:47 (UTC+8)) [💬2 | +70/-55, 12 files | commented:8 approved:1] As per title, related to https://github.com/vllm-project/vllm/pull/30566. All changes are v4 compatible as well, should be compatible

@hmellor
#33799 [Docs] Improve documentation — documentation,frontend,ready — by SorenDreano (合并于: 2026-02-06 20:57 (UTC+8)) [💬2 | +12/-12, 4 files | commented:1 approved:2] ## Purpose Update vLLM documentation to reflect current Hugging Face tooling and paths:
- Replace deprecated huggingface_cli usage with the recommended hf command across docs, per Hugging Face migration guidance:
  - https://huggingface.co/docs/huggingface_hub/concepts/migration
  - https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download
- Update the Hugging Face token path reference from the old ~/.huggingface location to the current `…
#29816 [Bugfix][Model] Support LoRA on Qwen3 Output Embedding — bug,ready,qwen — by klshuster (合并于: 2026-02-06 20:25 (UTC+8)) [💬6 | +132/-13, 6 files | commented:10] ## Purpose This PR adds support for LoRA on the embed/unembed layers for Qwen3 dense/MoE models. It is a simplified version of #26115 that removes the changes for supporting zero-padded vocab (which was addressed in #28545).

(cc @jeejeelee, I accidentally did some merges into the other PR and forgot to sign off on the commits, and the rebase was very unwieldy, so I am just opening a new PR)

## Test Plan I’ve added a comprehensive test suite in tests/lora/test_qwen3_unembed.py

## Test Result…
#33731 [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) — ready,torch.compile,ci/build,ready-run-all-tests — by ProExpertProg (合并于: 2026-02-06 20:19 (UTC+8)) [💬2 | +717/-651, 47 files | commented:7 approved:2] ## Purpose

Pure code movement and renaming to prepare for vLLM IR, which will add its own passes and tests. Will wait for #33293 before landing.

## Test Plan CI

## Test Result TBD

…
#33582 [CPU][BugFix] Fix loading of w4a8int models with bias — bug,ready — by fadara01 (合并于: 2026-02-06 19:59 (UTC+8)) [💬4 | +8/-3, 1 files | commented:2 approved:2] ## Purpose

Fix loading of w4a8int models with bias The current code triggers “TypeError: cannot assign ‘torch.FloatTensor’ as parameter ‘bias’ (torch.nn.Parameter or None expected)” for models that have bias in their MLPs

## Test Plan

tested on w8a8int quantized whisper model (has bias in its MLPs) following quantization recipe from here: https://learn.arm.com/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model/

## Test Result …
#33984 Bump HF Hub client to get bug fix — bug,rocm,ready,ci/build — by hmellor (合并于: 2026-02-06 19:25 (UTC+8)) [+2/-2, 2 files | approved:1 commented:1] Bug fix in question is https://github.com/huggingface/huggingface_hub/pull/3778

#33976 [PaddleOCR-VL] Add BC for transformers 5.0 config — ready — by zhang-prog (合并于: 2026-02-06 18:33 (UTC+8)) [+7/-0, 1 files

commented:1 approved:2]

#33982 Consolidate and fix forbidden import pre-commit checks — ready — by hmellor (合并于: 2026-02-06 17:47 (UTC+8)) [+145/-300, 5 files | commented:2 approved:1] Previously the regex and triton checks didn’t actually work in CI because they did not use the file list passed by pre-commit and instead relied on the git diff.

This means that the check would miss violations in CI because th git diff would be empty.

This PR consolidates the three forbidden import checks into a single check which does consume pre-commit’s file list.
#33868 support view_from_cpu_tensor on XPU — ready,v1 — by xinyu-intel (合并于: 2026-02-06 16:34 (UTC+8)) [+13/-9, 4 files | commented:5 approved:1]

## Purpose

support get_cuda_view_from_cpu_tensor on XPU. This will be used for ModelRunnerV2

## Test Plan

## Test Result

…
#33975 Fix main pre-commit — 无标签 — by hmellor (合并于: 2026-02-06 16:08 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] This one slipped through in https://github.com/vllm-project/vllm/pull/33720
#32263 [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation — ready,v1,cpu — by gassan-arm (合并于: 2026-02-06 15:01 (UTC+8)) [💬9 | +704/-4, 4 files | commented:8 changes:2] ## Purpose

CPU Paged Attention NEON BFMMLA BF16 Implementation

Co-authored-by: GitHub Copilot

## Test Results

Using: https://github.com/vllm-project/vllm/pull/31720 Benchmark Suite, …
#33720 Onboard voyage-4-nano — documentation,new-model,ready,qwen — by chengchengpei (合并于: 2026-02-06 14:23 (UTC+8)) [💬8 | +216/-2, 8 files | commented:10] ## Purpose

Onboard voyage-4-nano

## Test Plan

Run ``` from vllm import LLM from vllm.config import PoolerConfig …
#31112 [XPU]Replace pip in docker.xpu with uv pip — ready,ci/build — by 1643661061leo (合并于: 2026-02-06 14:02 (UTC+8)) [💬1 | +46/-28, 1 files | commented:10] ## Purpose We replaced pip in docker.xpu with uv pip in order to improve the speed of building images. And We created a virtual environment at /opt/venv and by default installed Python packages via uv pip. Wheels such as torch/intel-extension and vllm-openai are also installed in the same location, with most using /root/.cache/uv uniformly. Specifically, since the NIXL script is not suitable for UV, we use the Python in the virtual environment for installation to ensure consistency, while driver…
#33679 [XPU][4/N] add mxfp4 moe model support — ready — by jikunshang (合并于: 2026-02-06 13:03 (UTC+8)) [💬1 | +53/-31, 1 files | commented:1 approved:1] ## Purpose [4/N] of https://github.com/vllm-project/vllm/issues/33214 add mxfp4 moe support. we can also refactor xpu part once mxfp4 apply kernel abstraction.

## Test Plan
```
  python3 examples/offline_inference/basic/generate.py --model openai/gpt-oss-20b --temperature 0 --enforce-eager
```
## Test Result

…
#33788 [CPU] Add BF16 Kernel type for s390x — ready,cpu — by R3hankhan123 (合并于: 2026-02-06 12:57 (UTC+8)) [💬1 | +9/-0, 1 files | commented:1 approved:1] ## Purpose Add BF16 Kernel types for s390x in mla_decode.cpp ## Test Plan
1. Build the image
2. Run inference
## Test Result ``` [root@b314lp81 vllm]# docker run –rm -p 8000:8000 local:test ibm-granite/granite-4.0-micro –port=8000 INFO 02-04 11:03:05 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. …
#33900 [Misc] Update code for encoder-decoder models — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-06 11:38 (UTC+8)) [+9/-3, 2 files | approved:1 commented:1]

## Purpose

FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754749999 FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754759311

## Test Plan

## Test Result …
#31366 feat(frontend): early-fail tokenization guard for user requests — frontend,ready — by scratch-ml (合并于: 2026-02-06 11:38 (UTC+8)) [💬12 | +308/-202, 7 files | commented:9 changes:1] ## Purpose To advance the early-fail truncation design proposed in vllm#31229, these tests ensure default tokenization applies protective truncation and raises Exception immediately on overlength inputs, preventing the async tokenizer from being monopolized.

Default behavior: tokenization runs with protective truncation (truncation=True, max_length=max_model_len+1); if the encoded length hits the limit, it raises ValueError immediately …

[关闭未合并 PR]

#21241 [bugfix] Remove the attribute ‘version’ from docker compose — documentation,stale — by 1195343015 (关闭于: 2026-02-07 10:18 (UTC+8)) [💬4 | +0/-1, 1 files | commented:1] ## Purpose the attribute version is obsolete, it will be ignored, please remove it to avoid potential confusion
#21502 [Bugfix] Fix retrieve_process not ending normally and resources not being released properly — documentation,stale,kv-connector — by Foreverythin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +2/-2, 1 files | commented:1] ## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose ### Bug Description In `kv_cache_sharing_lmcache_v1….
#21535 [Bugfix] Add startup probe and fix disable extraInit container in online deploy helm chart — documentation,stale — by vladmirtxrx (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +23/-3, 3 files | commented:1] ## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose Enhance online deploy helm chart: add startup probe …
#22262 [V1][Spec Decode] Async scheduling integration with spec decode — documentation,speculative-decoding,stale,v1 — by zixi-qi (关闭于: 2026-02-07 10:17 (UTC+8)) [💬7 | +108/-25, 6 files | commented:1] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose Support async scheduling with speculative decoding w…
#22406 [feature] add all_reduce registry for custom backends — documentation,needs-rebase,stale — by draftbk (关闭于: 2026-02-07 10:17 (UTC+8)) [💬8 | +245/-0, 4 files | commented:1] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose

…
#22413 [XPU] Fix OOM when manually specifying ZE_AFFINITY_MASK with Ray distributed executor on XPU — documentation,stale,v1 — by chaojun-zhang (关闭于: 2026-02-07 10:17 (UTC+8)) [💬8 | +107/-28, 6 files | commented:10] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
## Purpose When using Intel GPUs with Ray, the device_control_e…
#23450 Add Predicted Outputs API — documentation,frontend,speculative-decoding,ci/build,stale,v1 — by bwasti (关闭于: 2026-02-07 10:17 (UTC+8)) [💬9 | +461/-4, 15 files | commented:2] ## Purpose

As referenced in this issue: #23276, predicted outputs is a nice API from OpenAI that allows user-provided external token predictions to hit the speculative decoding backend.

This PR is the first of a couple to properly thread the API through the stack (looking for an initial review on this).

The key API is the addition of --speculative-config '{"method": "predicted", "num_speculative_tokens": 4}' at serving time and an exact replication of the OpenAI API at request:

``` { …
#24168 [logging] Refine PyNcclConnector Proxy logging — documentation,stale,kv-connector — by panpan0000 (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +17/-1, 1 files | commented:1]

## Purpose

Make logging and response more clear

## Test Plan

## Test Result

…
#24187 [Spec Decode][Model]Add qwen2-eagle — documentation,performance,new-model,speculative-decoding,needs-rebase,stale,v1,qwen — by wwl2755 (关闭于: 2026-02-07 10:17 (UTC+8)) [💬7 | +325/-18, 9 files | commented:7]

## Purpose

Add qwen2-eagle (and qwen2.5-eagle) support.

According to Slack, Qwen2 is not supported yet because of different number of kv heads in the target and draft model. This PR addresses this by making the kv cache space aligns with the larger one (though it can cause some waste).

Another thing this PR addresses is that some eagle model places “…
#24216 [Misc] fix lmcache cpu offload example — documentation,stale,kv-connector — by zhenwei-intel (关闭于: 2026-02-07 10:17 (UTC+8)) [💬3 | +163/-112, 1 files | commented:1] The current example is outdated. It uses the prefix cache and will not invoke CPU offload.

Update the example from the lmcache documentation here: https://docs.lmcache.ai/getting_started/quickstart/offload_kv_cache.html

# Running the Example

First, run the script without LMCache:

``` python cpu-offloading.py …
#24287 [CI/Build] Fix cmake incremental build when running “pip install –no-build-isolation -e .” — documentation,ready,ci/build,stale — by xli (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +34/-12, 2 files | approved:1 commented:3] Fix cmake incremental build when running “pip install –no-build-isolation -e .” (uv pip install works same)

Current c extensions cmake incremental build requires some extra steps to enable (https://docs.vllm.ai/en/v0.9.2/contributing/incremental_build.html)

By changing the cmake working directory in the setup.py, we can enable cmake incremental build without generating CMakeUserPresets.json file and following specific runbook.

Not sure why current implementation uses self.build_temp directo…
#24878 add default value for max_tokens and max_num_batched_tokens to basic chat example so that the example can be run out of box — documentation,stale,v1 — by chenfengjin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +5/-0, 1 files | commented:5] ## Purpose

add default value for max_tokens and max_tokens to basic chat example so that the example can be run out of box. without these default value, users struggle to run the basic example
- for max_tokens, the default value is 16, which results in meanless generating output
- for max_num_batched_tokens , it confilcts with model parameters, which results in runtime error
following are errors with framework default values.

### with default max_num_batched_tokens as 4096

…
#24965 add dtype and max_num_batched_tokens to classify example so that it can be run out of box — documentation,stale — by chenfengjin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +2/-0, 1 files | commented:3] ## Purpose the example basic/classify.py failed to generate probabilities ```console Processed prompts: 100%|█████████████████████| 4/4 [00:00<00:00, 11.52it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Generated Outputs: ———————————————————— Prompt: ‘Hello, my name is’ Class Probabilities: [nan, nan] (size=2) ———————————————————— …
#24972 [Model] Deepseek-V3.1 reasoning parser — documentation,frontend,needs-rebase,stale,qwen,deepseek — by taohui (关闭于: 2026-02-07 10:17 (UTC+8)) [💬6 | +214/-11, 15 files | commented:6]

## Purpose

This PR adds a new reasoning parser for the DeepSeek-V3.1 model, named deepseek_v3. Unlike previous models such as deepseek_r1, the reasoning parser for DeepSeek-V3.1 is deterministic. Specifically:

When a request includes chat_template_kwargs”: {“thinking”: True}, the model uses the deepseek_r1 reasoning parser.

Otherwise, it uses a new IdentityReasoningParser, which implements the ReasoningParser interface but does not perform actual reasoning, effe…
#25356 [Feature] OTEL Tracing for Individual Model Steps — documentation,stale,v1 — by tomasruizt (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8 | +205/-14, 7 files | commented:1] ## Purpose Enable fine-grained OpenTelemetry (OTEL) tracing for model steps like Preprocess, Forward, etc. at the level of the individual request in a visual way.

## Why Do We Need This? Currently, we can use the pytoch profiler to track the runtime of each model step in vLLM. However, the profiler overhead is large, and it slows down vLLM execution, distorting the actual runtimes of steps. The profiler also cannot separate different requests, so in a workload where multiple requests part…
#25367 [Docs] wheel larger than limit — documentation,stale — by pfk-beta (关闭于: 2026-02-07 10:16 (UTC+8)) [💬4 | +4/-0, 1 files | commented:1] ## Purpose Based on my experience with building Docker, it should be mentioned about too large wheel. Related to #18786

## Test Plan N/A

## Test Result N/A
#25376 [V0 Deprecation][KVConnector] Remove KVConnector v1/v0 differentiation — documentation,tpu,ready,needs-rebase,ci/build,stale,v1,kv-connector — by NickLucche (关闭于: 2026-02-07 10:16 (UTC+8)) [💬9 | +481/-530, 47 files | commented:2 approved:1 changes:1] This PR completes the KVConnector V0 deprecation we started here https://github.com/vllm-project/vllm/issues/21785. It mostly moves stuff from v1/kv_connector to kv_connector to clarify there is no other “vX” connector (since removed in the above PR) . Takes care of v1-related naming + warning as well (eg checking v1 is enabled).

List of changes:
- KVConnectorBase_V1->KVConnectorBase
- ~KVConnectorBaseType = KVConnectorBase_V1~
- vllm/distributed/kv_transfer/v1/kv_connector/->`vl…
#25574 [Misc] Add presence_penalty to default generation config — frontend,stale — by noiji (关闭于: 2026-02-07 10:16 (UTC+8)) [💬3 | +17/-4, 2 files | commented:1]

## Purpose Add presence_penalty to default generation config

## Test Plan

## Test Result

…
#25728 [Bugfix] fixing streaming issues and tool call output for gpt-oss (#22704) — frontend,needs-rebase,stale,gpt-oss — by noelkelias (关闭于: 2026-02-07 10:16 (UTC+8)) [💬5 | +51/-2, 1 files] ## Purpose Closely related to bugs associated with streaming for gpt-oss response API (#22704):

Harmony’s streaming didn’t support functions.* tool calls at all. We handled reasoning on analysis, final text on final, and built-ins like code/web—but "commentary" messages with a recipient="functions.NAME" had a no-op branch.

## Test Plan

First, tested GPT-OSS with a basic set of instructions, few-shot examples, a function definition, and a prompt to find the weather in San Francisco, …
#25952 [Bugfix] message loop — documentation,stale,tool-calling — by muazem7 (关闭于: 2026-02-07 10:16 (UTC+8)) [💬6 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Fix bug
#26009 [Core] NIXL/Segment-tree encoder cache based EPD-disaggregation — documentation,performance,needs-rebase,ci/build,stale,v1,kv-connector — by MerHS (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8 | +2111/-68, 14 files | commented:3] # Purpose

Recently, several multi-node inference/orchestration engines such as llm-d and dynamo have emerged, and they natively supports Prefill-Decode disaggregation. To improve them, we can imagine Encode-Prefill-Decode disaggregation that is introduced in several RFCs and PRs: https://github.com/vllm-project/vllm/issues/20799, https://github.com/hsliuustc0106/vllm/issues/18, https://github.com/vllm-project/vllm/pull/25233.

However, llm-d and dynamo rely on the NIXL communicator, which wraps…
#26389 [Misc] Cleanup fp8.py logic cleanup — stale — by andoorve (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2 | +6/-14, 1 files | commented:1] Very minor refactor I noticed when reading this file. Removes some redundant asserts.
#32051 [Misc] support separate draft model loading config from base model — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-02-07 06:40 (UTC+8)) [💬4 | +10/-1, 3 files | commented:2 | 📝草稿] [Misc] support seperate draft model loading config from base model

Summary: Users could build custom model loading format, In cases that base model and draft model may not share the same model loading config, in this PR, we support the use case

An example command would be

``` vllm serve /home/zzhengkai/vllm_models/Llama-3.1-8B-Instruct –disable-log-requests –port 9001 –speculative_config ‘{“method”: “eagle3”, “model”: “zzhengkai/custom_draft_mdoel_for_llama3”, “num_speculative_tokens”:…
#33836 Fix RoutingMethodType.from_topk softmax+renormalize mapping#33792 — nvidia — by baonudesifeizhai (关闭于: 2026-02-07 03:46 (UTC+8)) [💬5 | +48/-11, 6 files | commented:4] ## Purpose #33792 Align from_topk with routing semantics: softmax + renormalize=True now maps to RoutingMethodType.Renormalize (not RenormalizeNaive). Add a small unit test to cover from_topk mapping and invalid scoring functions. ## Test Plan

## Test Result ``python -m pytest tests/model_executor/test_routed_experts_capture.py -v passed —

...
#34014 [Model] Introduce first-class DFlash speculative decoding in vLLM V1 — documentation,new-model,speculative-decoding,v1,qwen — by dangoldbj (关闭于: 2026-02-07 04:33 (UTC+8)) [💬2 | +2163/-22, 18 files | commented:1]

## Purpose This PR introduces first-class DFlash speculative decoding support in vLLM V1.

This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.

Key changes:
- Adds explicit method="dflash" support and full DFlash model/config plumbing.
- Establishes a dedicated `DFlash…
#33952 [CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N — bug,rocm — by rasmith (关闭于: 2026-02-07 01:17 (UTC+8)) [💬3 | +22/-4, 3 files | commented:4]

## Purpose This PR broke many tests (over 30) and this PR fixed one test in the ` Kernels MoE Test %N` group, but when the test is run as a group using

pytest -sv kernels/moe

the first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.

This PR loads the ops in vllm._aiter_ops but then ensures tha…
#31875 [Feature] Add flag to disable FlashInfer autotune — 无标签 — by mmangkad (关闭于: 2026-02-06 17:02 (UTC+8)) [💬4 | +14/-1, 3 files | commented:2 changes:1 | 📝草稿]

## Purpose

FlashInfer autotuning can sometimes take a long time to complete during initialization. This PR introduces a flag to disable it, allowing users to bypass this step if they are okay with skipping optimization to speed up startup.

## Test Plan

## Test Result

…
#33146 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (关闭于: 2026-02-07 00:40 (UTC+8)) [💬8 | +231/-192, 5 files | commented:7 changes:1 approved:2]

## Purpose

Update the Deployment > Docker with latest docker information for ROCm platform.

## Test Plan

## Test Result

…
#33929 [Bugfix][Docker] Install CUDA dev packages for JIT compilation headers — bug,ci/build,nvidia — by jasonlizhengjian (关闭于: 2026-02-07 00:22 (UTC+8)) [💬2 | +3/-3, 1 files | commented:1 | 📝草稿] Install cuda-cudart-dev, cuda-nvrtc-dev, and libcublas-dev instead of runtime-only packages to provide headers (cuda.h, cuda_runtime.h, nvrtc.h, cublasLt.h) needed for FlashInfer JIT compilation of fp8_blockscale_gemm_sm90 kernels.

Fixes #33833

## Purpose

## Test Plan

…
#32280 Bump triton_kernels to v3.5.1 for version consistency — ci/build — by mmangkad (关闭于: 2026-02-06 17:02 (UTC+8)) [+1/-1, 1 files | commented:2]

## Purpose

Align triton_kernels fetch tag (v3.5.0 → v3.5.1) with the Triton package version used in requirements, which ships with PyTorch 2.9.1.

## Test Plan

## Test Result

…
#33968 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (关闭于: 2026-02-06 14:10 (UTC+8)) [💬3 | +231/-192, 5 files]

## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.

## Test Plan

## Test Result

…