[vLLM GitHub 开发动态] 2026-02-05
[概览]
- 时间窗口: 2026-02-05 11:27 (UTC+8) ~ 2026-02-06 11:27 (UTC+8)
- 新 issue: 36 (label 分布: bug:18, feature request:9, usage:4, RFC:2, rocm:2)
- 关闭 issue: 17
- 新 PR: 66 (label 分布: ready:17, bug:16, v1:14, ci/build:10, documentation:7)
- 合并 PR: 40
- 关闭未合并 PR: 12
[新 issue]
-
#33885 [Bug]: 用vllm启动Qwen3VLReranke接口在重排任何图像时获取的得分很低仅有0.5左右 — bug — by RC-Qiao (创建于: 2026-02-05 16:32 (UTC+8)) [💬1] ### Your current environment
docker run –gpus ‘“device=7”’ –entrypoint “” -v /dataset/models/Qwen/Qwen3-VL-Reranker-8B:/model -p 9091:8000 –shm-size=8g vllm/vllm-openai:v0.15.1-cu130 vllm serve /model –runner pooling –max-model-len 16384 –gpu-memory-utilization 0.5 –dtype bfloat16 –hf-overrides ‘{“architectures”: [“Qwen3VLForSequenceClassification”],”classifier_from_token”: [“no”, “yes”],”is_original_qwen3_reranker”: true}’ –chat-template /model/qwen…
-
#33882 [Feature]: Improve sparse embedding pooling output format for better efficiency and usability — feature request — by staugust (创建于: 2026-02-05 16:01 (UTC+8)) [💬4] ### 🚀 The feature, motivation and pitch
This issue is about the current output format of the sparse embedding pooling interface and how it impacts performance and usability. For models that support sparse embeddings, we typically need to store, for each document, the mapping from token → weight so we can use it later during retrieval. Currently, the vLLM
poolingAPI returns only the per-token weights for each position in the document. This leads to two potential issues:- **Duplicate…
-
#33914 [RFC]: Support Chunked Pipeline Parallel(CPP) with dynamic chunk size. — RFC — by pisceskkk (创建于: 2026-02-05 21:36 (UTC+8)) ### Motivation.
Currently, vLLM supports DCP and is planning to support PCP to improve long-sequence inference capability and overall inference efficiency. However, in scenarios with heterogeneous sequence lengths, Context Parallelism can perform suboptimally and may even lead to overall performance degradation due to reduced efficiency on short sequences. In addition, to better support and optimize ultra-long sequence inputs (e.g., reaching 1M tokens or more), introducing a new partitioning di…
-
#33911 [Bug]: Gemma-3 multimodal models (4b/12b/27b) fail with torch.compile assertion error — 无标签 — by tomasruizt (创建于: 2026-02-05 20:27 (UTC+8)) [💬1] Gemma-3 multimodal models fail during initialization with a torch.compile assertion error. Text-only Gemma-3 models (270m, 1b) work fine.
Reproduce:
vllm serve google/gemma-3-4b-it --max-model-len 4096Error: ``` AssertionError: expected size 1048576==131072, stride 256==256 at dim=0 …
-
#33951 [Feature]: Implement
get_kv_cache_stride_orderfor all classes — feature request — by Rohan138 (创建于: 2026-02-06 07:49 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitchCurrently,
AttentionBackend.get_kv_cache_stride_orderraises a notImplementedError by default: https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backend.py#L90This causes this awkward usage in vLLM e.g. in https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L5864:
``` try: kv_cache_stride_order = attn_backend.get_kv_cache_stride_order() except (AttributeError, NotImplementedErro…
-
#33954 [Bug]:
Qwen/Qwen3-VL-Embedding-8Bembedding quality declines significantly sometime after vLLM version0.14.0rc2.dev199+gc80f92c14— bug — by kevin-pw (创建于: 2026-02-06 08:25 (UTC+8)) ### Your current environmentI am posting below the output of
Note that the output ofpython collect_env.pyfor vLLM version `0.14.0rc2.dev199+gc80f92c14` (where embeddings are high quality) and vLLM version `0.15.2.dev0+g1892993bc` (where embeddings are low quality).python collect_env.pyshows that all relevant environment parameters are identical between the vLLM versions that produce different embedding qualities. Output ofpyt... -
#33950 [Bug]: — bug — by FrancescoSaverioZuppichini (创建于: 2026-02-06 07:30 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33930 [Feature]: GPU Memory Snapshotting to reduce cold starts — feature request — by Volko61 (创建于: 2026-02-06 00:53 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
I noticed that Modal (https://modal.com/blog/gpu-mem-snapshots) and InferX (https://inferx.net/) have implemented the CUDA checkpoint/restore API to drastically reduce cold start.
I tried both of the services and it seems to work extremelly well. InferX told me they build it on top of vLLM so it definitly seems possible to do. Right now, I’m not aware of any open source implementation of this tech.
I would love to see this feature implemented in vLLM as…
-
#33917 [Feature]: Up to date docker images — feature request — by lee-b (创建于: 2026-02-05 22:17 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
The official vllm-openai docker image appears to be significantly outdated – even the nightly image!
I was already having to base on top of the docker image and install a later transformers version to run some of the models that vllm claims to support. However, after installing a new host with CUDA 13.1, I found that the vllm image won’t run even with that update, due to host nvidia driver and cuda library version incompatibilities.
I’m having to apply…
-
#33938 [Bug][ROCm] Platform detection initializes CUDA prematurely, breaking Ray multi-GPU allocation — bug,rocm — by kouroshHakha (创建于: 2026-02-06 03:33 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33872 [Feature]: Community Help Wanted: Migrate Remaining Linear Methods into Kernel Abstraction — feature request — by BadrBasowid (创建于: 2026-02-05 14:39 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch
This issue is a continuation of “[Refactor] Make FP8 Linear Ops use kernel abstraction” #27814, which moved the FP8 scaled-mm linear path onto the
ScaledMMLinearKernelinterfaceand improved consistency across FP8 execution paths. The next step is to extend that same kernel-abstraction approach to the remaining linear Methods/schemes by routing them through the kernel abstraction inside their existingLinearMethodimplementations so that (once everyth… -
#33888 [Feature]: when support torch 2.10.0 — feature request — by chamwen (创建于: 2026-02-05 16:51 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
when support torch 2.10.0
### Alternatives
No response
### Additional context
…
-
#33925 [Bug]: OpenAI API: system message accept image — bug — by jonoillar (创建于: 2026-02-06 00:15 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33923 [Feature]: Report cached tokens in Anthropic
/v1/messagesAPI — feature request — by msanft (创建于: 2026-02-05 23:59 (UTC+8)) ### 🚀 The feature, motivation and pitchvLLM implements the Anthropic
/v1/messagesAPI since #22627. However, the API, as implemented in this PR, does not report cached tokens. The upstream Anthropic API does.The cached tokens should be reported via the already-available fields:
```py class AnthropicUsage(BaseModel): “"”Token usage information”””
…
-
#33921 [Bug]: Qwen3 Instruct context is limited to 8K — bug — by mobicham (创建于: 2026-02-05 23:21 (UTC+8)) ### Your current environment
Docker: vllm/vllm-openai:v0.15.1 GPU: H100 PCIe 80GB### 🐛 Describe the bug
…
-
#33920 [Bug]: GLM-4.7-Flash OOM during sampler warmup with tensor parallelism on RTX 4090 — bug — by VecherVhatuX (创建于: 2026-02-05 22:54 (UTC+8)) ### Your current environment
### System Info - **vLLM version**: 0.15.0rc2.dev40+g2e8de8677.cu130 - **GPU**: 2x NVIDIA GeForce RTX 4090 (24564 MiB each, compute capability 8.9) - **NVIDIA Driver**: 580.126.09 - **CUDA**: 13.0 ... -
#33906 [Bug]: mxfp4 (gpt-oss moe) on AMD rocm (W7900/gfx1100) breaks — bug,rocm — by ctheune (创建于: 2026-02-05 19:41 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#33916 [Bug] IndexError: list index out of range in chat_completion_stream_generator with –tool-call-parser=mistral during streaming tool calls — bug — by dfischer-mw (创建于: 2026-02-05 22:06 (UTC+8)) ````md ### Your current environment
The output of
```text ============================== System Info ============================== ...python collect_env.py -
#33915 [Feature]: Support
include_reasoningrequest parameter for non-harmony models — feature request — by cjackal (创建于: 2026-02-05 21:57 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitchOpenAI responses API support
include_reasoningparameters to filter out the reasoning content from model response. While settinginclude_reasoning=falsein a request does not “skip” model’s reasoning phase in the inference engine or reducing the token generation cost, it helps reduce the network traffic cost (which is crucial in some network-limited devices as the reasoning text often overwhelms the user-facing model answer) and considered inevitable … -
#33877 [Bug]: GLM47 Tool Call Bug — bug — by junzhiL (创建于: 2026-02-05 15:28 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33913 [Usage]: Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions. — usage — by 870223666 (创建于: 2026-02-05 20:38 (UTC+8)) ### Your current environment请替我分析一下:Traceback (most recent call last): File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 666, in worker_busy_loop output = func(args, **kwargs) File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context return func(args, **kwargs) File “/mnt/bn/liweilai-dct1201/miniconda3/envs/infer/lib/py…
-
#33910 [Usage]: AssertionError: collective_rpc should not be called on follower node — usage — by PhdShi (创建于: 2026-02-05 20:26 (UTC+8)) [💬1] ### Your current environment
```text Is CUDA available : True CUDA runtime version : 13.1.80 CUDA_MODULE_LOADING set to : LAZY GPU models and configuration : GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 …
-
#33880 [Installation]: Cant install vllm 0.15.0 on Windows & Python 3.12 — installation — by ImSo3K (创建于: 2026-02-05 15:48 (UTC+8)) [💬5] ### Your current environment
Im trying to install the latest version but I get: ```text $ pip install vllm==0.15.0 ERROR: Ignored the following yanked versions: 0.2.1 ERROR: Could not find a version that satisfies the requirement vllm==0.15.0 (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.2.0, 0.2.1.post1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.3.0, 0.3.1, 0.3.3, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.5.0.post1, 0.5.1, 0.5.2, 0.5.3, 0.5.3.post1, 0.5.4, 0.5.5, 0.6.0…
-
#33908 [Usage]: vLLM is an excellent project - thank you to the team! — usage — by Rainlin007 (创建于: 2026-02-05 19:54 (UTC+8)) ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
This is not really a usage question, but I wanted to take a moment to express my appreciation for the vLLM project. …
-
#33873 [Bug]: BGE-M3 pooling endpoint fails with short inputs (< 4 tokens) in token_classify task — 无标签 — by eveningcafe (创建于: 2026-02-05 14:46 (UTC+8)) [💬1] ### Your current environment
- vLLM Version: v0.15.0
- Model: BAAI/bge-m3
- Task: token_classify
- Endpoint: /pooling
### 🐛 Describe the bug
The
/poolingendpoint withtask: token_classifyreturns 400 error for inputs that tokenize to fewer than 4 tokens (including CLS and SEP special tokens). The backend returns a single float value instead of a list of floats, causing Pydantic validation to fail. … -
#33899 [Bug]: DeepSeek-R1-0528 AssertionError: tokens not padded correctly on GB200 — bug — by chaunceyjiang (创建于: 2026-02-05 18:38 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33895 [Bug]: /tokenize hangs after many requests — bug — by JakubCerven (创建于: 2026-02-05 18:05 (UTC+8)) ### Your current environment
n/a
### 🐛 Describe the bug
running vllm using
vllm/vllm-openai:v0.15.0+QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQafter calling large number of
/tokenizerequests, this endpoint will become unresponsive -> will hang forever…
-
#33894 [Feature]: Support setting tool-call-parser to auto — feature request — by shell-nlp (创建于: 2026-02-05 17:36 (UTC+8)) ### 🚀 The feature, motivation and pitch
# Support setting tool-call-parser to auto It’s cumbersome to configure tool-call-parser every time, and it’s unclear what parameter to use—I’d suggest supporting setting tool-call-parser to ‘auto’.
### Alternatives
No response
### Additional context …
-
#33871 [Bug]: The local deployment achieves about 30% higher accuracy compared to the server deployment. — bug — by charliess123 (创建于: 2026-02-05 14:34 (UTC+8)) ### Your current environment
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : Could not collect ... -
#33889 [Usage]: RuntimeError: CUDA error: invalid device ordinal — usage — by Athe-kunal (创建于: 2026-02-05 17:04 (UTC+8)) [💬1] ### Your current environment
I am running this tutorial. Here my script is the same except my ray init
```python ray.init( runtime_env={ “excludes”: [“.venv/”, “.git/”, “*.pyc”, “pycache/”], “env_vars”: { “VIRTUAL_ENV”: “”, # Unset VIRTUAL_ENV for Ray workers …
-
#33887 [Bug]: DeepSeek-R1-0528-NVFP4 Missing TRTLLM-GEN kernel (decode) on GB200 — bug — by chaunceyjiang (创建于: 2026-02-05 16:42 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33883 [Bug]: DeepSeek-V3.2 NVFP4 with fp8 kvcache reports
src_cache must be uint8— bug — by kebe7jun (创建于: 2026-02-05 16:10 (UTC+8)) ### Your current environmentThe output of
```text GB200 main branch ``` ...python collect_env.py -
#33864 Bug: CPU KV cache offloading fails for blocks formed during decode — bug,kv-connector — by sriumcp (创建于: 2026-02-05 12:33 (UTC+8)) [💬1] ## Summary
When using CPU KV cache offloading (
--kv-offloading-size), blocks that complete during the decode phase are never offloaded to CPU. Only blocks that complete during prefill are correctly offloaded.## Root Cause
In
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py, the_get_reqs_to_store()method calculates the number of blocks to store using:```python new_tokens = scheduler_output.num_scheduled_tokens[req_id] …
-
#33865 [Bug]: OpenAI-compatible Embeddings API intermittently crashes with multimodal cache assertion (
Expected a cached item for mm_hash) on Qwen3-VL-Embedding-8B — bug — by ojipadeson (创建于: 2026-02-05 12:45 (UTC+8)) [💬1] ### 🐛 Describe the bugUsing vLLM
servewith the OpenAI-compatible Embeddings API to run Qwen3-VL-Embedding-8B, the server intermittently crashes while pre-processing embedding requests. The crash is an assertion inside the multimodal feature cache:AssertionError: Expected a cached item for mm_hash='...'After that, the API server hits a second assertion:
…
-
#33874 [Bug]: GLM-OCR POST bug — bug — by F0undLinks (创建于: 2026-02-05 14:46 (UTC+8)) ### Your current environment
vllm version:0.16.0tc1.dev78-gb6bb2842c.cu130 Transformers version:5.0.1.dev0 GPU:2080Ti * 4 CUDA version :13.0…
-
#33869 [RFC]:DeepSeek-R1 Moe offload — RFC — by wangyxbh (创建于: 2026-02-05 13:54 (UTC+8)) # RFC: CPU Offload for Mixture-of-Experts (MoE) Inference in vLLM PR:https://github.com/vllm-project/vllm/pull/31938
## Summary
This PR proposes a CPU Offload Module for MoE inference in vLLM, enabling a large portion of expert weights and computation to be dynamically offloaded to the CPU while keeping only a small, hot subset of experts cached on GPU.
The design supports:
- Hybrid GPU–CPU execution
- Pinned-memory–based weight streaming …
[已关闭 issue]
-
#18324 [Bug]: Clarification regarding bug inside vllm-flash-attn vision module — bug,stale — by vrdn-23 (关闭于: 2026-02-06 10:17 (UTC+8)) [💬17] ### Your current environment
N/A
### 🐛 Describe the bug
Hi, I am looking for a clarification regarding the warning message that pops up when trying to load a Molmo Vision model in vLLM. It seems that the warning message was first introduced in this PR and carried along with many refactors but I coul…
-
#15254 [RFC]: Better support for weight updating while waking up from sleep mode for RLHF — RFC,stale — by erictang000 (关闭于: 2026-02-06 10:17 (UTC+8)) [💬14] ### Motivation.
Currently, when using sleep mode and wake_up() for RLHF, like in verl, we can run into issues like here, where we run into OOMs when waking up the vllm engine due to additional copies of the model living on the same GPU.
The issue in verl specifically lives in the [fsdp_vllm sharding manager](https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py#L94…
-
#19097 [RFC]: Response format extensions for structured outputs — structured-output,RFC,stale,v1 — by aarnphm (关闭于: 2026-02-06 10:17 (UTC+8)) [💬16] ### Motivation.
Currently, users can provide additional constraints format via
extra_bodyin OpenAI client:```python from enum import Enum from pydantic import BaseModel from openai import OpenAI
simplified_sql_grammar = “”” …
-
#20581 [Bug]: Llama4 always go OOM when using LLama4ForCausalLM — bug,stale — by Mirco-Ramo (关闭于: 2026-02-06 10:17 (UTC+8)) [💬9] ### Your current environment
vllm==0.9.1, python==3.10,8xA100 80GB, Ubuntu 22.04.5 LTS
INFO 07-07 15:24:09 [__init__.py:244] Automatically detected platform cuda. Collecting environment information... ============================== System Info ============================== ... -
#26014 [Bug]: Bugs in the new logprobs_mode — bug,stale — by YunruiZhang (关闭于: 2026-02-06 10:16 (UTC+8)) [💬6] ### 🐛 Describe the bug
The new logprobs_mode has many bugs. When set to logprobs_mode = ‘processed_logprobs’, it returns the same values as logprobs_mode = ‘raw_logprobs’. Additionally, when set to ‘processed_logits’ or ‘raw_logits’, it does not produce any meaningful output the logprobs are all -inf except for rank = 1, which gives 0. I am using MODEL_ID = “Qwen/Qwen3-30B-A3B-Thinking-2507-FP8”
-
#26288 [Bug]:
schemafield becomesNonein Responses API whenstream=True— bug,stale — by WoutDeRijck (关闭于: 2026-02-06 10:16 (UTC+8)) [💬2] ### Your current environmentvllm: 0.10.2 openai: 1.108.0
### 🐛 Describe the bug
#### 1. Client Request Arrives ```python stream = await client.responses.create( …
-
#26349 [Bug]: Qwen-3-VL-30B-A3B… is realy slow.. in H100 — bug,stale — by sanakspock (关闭于: 2026-02-06 10:16 (UTC+8)) [💬2] ### Your current environment
The output of
```text ython collect_env.py /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information... ...python collect_env.py -
#33831 [Bug]: Deepseek V3.2 Benchmark failure “TypeError: argument ‘tokens’: ‘NoneType’ object” — bug — by wzhao18 (关闭于: 2026-02-06 07:50 (UTC+8)) [💬4] ### Your current environment
The output of
```text python collect_env.py Collecting environment information... ============================== ...python collect_env.py -
#33917 [Feature]: Up to date docker images — feature request — by lee-b (关闭于: 2026-02-06 03:55 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
The official vllm-openai docker image appears to be significantly outdated – even the nightly image!
I was already having to base on top of the docker image and install a later transformers version to run some of the models that vllm claims to support. However, after installing a new host with CUDA 13.1, I found that the vllm image won’t run even with that update, due to host nvidia driver and cuda library version incompatibilities.
I’m having to apply…
-
#33859 [Bug]: DeepSeek V3.2-NVFP4 with flashinfer moe reports
q must have dtype torch::kBFloat16— bug — by kebe7jun (关闭于: 2026-02-06 03:22 (UTC+8)) [💬1] ### Your current environmentThe output of
```text GB200 on main branch ```python collect_env.py…
-
#33802 [CI Failure]: Distributed 2xH100 tests — torch.compile,ci-failure,needs reproduction — by ProExpertProg (关闭于: 2026-02-05 20:15 (UTC+8)) [💬5] ### Name of failing test
tests/compile/distributed/test_async_tp.py::test_async_tp_pass_correctness
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33908 [Usage]: vLLM is an excellent project - thank you to the team! — usage — by Rainlin007 (关闭于: 2026-02-05 19:55 (UTC+8)) ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
This is not really a usage question, but I wanted to take a moment to express my appreciation for the vLLM project. …
-
#33873 [Bug]: BGE-M3 pooling endpoint fails with short inputs (< 4 tokens) in token_classify task — 无标签 — by eveningcafe (关闭于: 2026-02-05 18:51 (UTC+8)) [💬1] ### Your current environment
- vLLM Version: v0.15.0
- Model: BAAI/bge-m3
- Task: token_classify
- Endpoint: /pooling
### 🐛 Describe the bug
The
/poolingendpoint withtask: token_classifyreturns 400 error for inputs that tokenize to fewer than 4 tokens (including CLS and SEP special tokens). The backend returns a single float value instead of a list of floats, causing Pydantic validation to fail. … -
#33401 [Bug]: [GML-4.5-Air-Fp8] RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} — bug — by twilighgt (关闭于: 2026-02-05 15:37 (UTC+8)) [💬3] ### Your current environment
vllm.0.12.0
root@ubuntu:/vllm-workspace# python collect_env.py bash: python: command not found root@ubuntu:/vllm-workspace# python3 collect_env.py Collecting environment information… ============================== System Info …
-
#33696 [Bug]: [Cpu Backend] Whisper W8A8 failure — bug,cpu — by aditew01 (关闭于: 2026-02-05 14:26 (UTC+8)) [💬2] ### 🐛 Describe the bug
Running W8A8 Quantized whisper model - RedHatAI/whisper-large-v3-quantized.w8a8 results in the failure.
```(EngineCore_DP0 pid=35539) output = self.model_runner.execute_model( (EngineCore_DP0 pid=35539) File “/home/aditew01/envs/tvllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3452, in execute_model (EngineCore_DP0 pid=35539) ) = self._preprocess( (EngineCore_DP0 pid=…
-
#33816 [CI Failure]: Quantized Models Test in
tests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2]— ci-failure — by mgoin (关闭于: 2026-02-05 13:11 (UTC+8)) [💬2] ### Name of failing testtests/models/quantization/test_gptq_marlin.py::test_models[5-32-bfloat16-model2]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#27669 [Bug]: AssertionError in lora_shrink_op.py during profile_run when serving Qwen3-VL-8B-Instruct with LoRA on v0.11.0 — bug,stale — by CaiNiaoLucifer (关闭于: 2026-02-05 11:29 (UTC+8)) [💬3] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py
[新 PR]
-
#33961 Elastic AFD — documentation,frontend,ci/build,v1,deepseek — by gaidandawang-afk (创建于: 2026-02-06 11:14 (UTC+8)) [💬3 | +6714/-235, 57 files | commented:1]
## Purpose Elastic AFD basic implemetation based on #28296 #29772 ## Test Plan
## Test Result
... -
#33960 [Core][WIP] Pipeline Parallel support for Model Runner V2 — v1 — by ZhanqiuHu (创建于: 2026-02-06 10:35 (UTC+8)) [💬1 | +402/-17, 2 files | commented:1 | 📝草稿] ## Summary
This PR adds Pipeline Parallel (PP) support to the experimental Model Runner V2. It introduces a modular
PPHandlerclass that encapsulates all PP-related logic, keeping the main model runner code clean. The current implementation uses blocking token synchronization and produces correct output verified against the model runner v1 baseline.## Purpose
Add Pipeline Parallel (PP) support to the experimental Model Runner V2 (
vllm/v1/worker/gpu/model_runner.py).Related: #32455 (Q1 …
- #33949 [CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure — frontend,gpt-oss — by AndreasKaratzas (创建于: 2026-02-06 07:26 (UTC+8)) [+980/-802, 8 files | commented:2]
This PR eliminates systemic test flakiness in the Harmony and MCP Responses API integration tests by addressing root causes in both the test infrastructure and source code. The core problem was that tests asserted on non-deterministic LLM output using
@pytest.mark.flaky, which corrupted server fixture lifecycles and masked real failures. This PR replaces that pattern with deterministic infrastructure: pinned system prompts, API-level retries, and a clear separation between server invariants (h… -
#33937 [Bugfix] Fix mamba cache mode null-block padding — bug,v1 — by tianshu-Michael-yu (创建于: 2026-02-06 03:32 (UTC+8)) [💬1 | +52/-2, 2 files | commented:1] ## Purpose
In
mamba_get_block_table_tensor, block id0is reserved forBlockPool.null_block(never allocated) but can appear in block tables as a placeholder (e.g. mamba align mode). Mamba kernels treatPAD_SLOT_ID(-1) as padding; if we pass block id0through, kernels can read/write state for the shared null block, causing cross-request state corruption.This PR maps block id
0toPAD_SLOT_IDfor all mamba cache modes.## Test Plan
- `python -m pytest tests/v1/attention/tes…
-
#33935 Update DeepGEMM version pin in Dockerfile to match #32479 — ci/build — by zifeitong (创建于: 2026-02-06 03:00 (UTC+8)) [💬2 | +2/-2, 2 files | commented:1] #32479 updated the version in
install_deepgemm.shbut not Dockerfile.
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_mo... -
#33912 Optimize popleft_n free-list traversal for KV cache blocks — v1 — by junxiangxiaoxiang (创建于: 2026-02-05 20:34 (UTC+8)) [💬3 | +9/-10, 1 files | commented:2] Purpose Optimize the popleft_n implementation used by the KV cache free list to reduce overhead when popping multiple blocks at once.
The new implementation:
Traverses the free list once while collecting n blocks. Updates the head pointer and neighboring links in a batched manner. Avoids per-node pointer rewiring inside the loop. This keeps the semantics unchanged (same set and order of returned blocks, same list structure invariants) while reducing Python overhead in the hot path.
…
-
#33909 [Models] Consolidate Deepseek-OCR2 processor — ready,deepseek — by Isotr0py (创建于: 2026-02-05 20:20 (UTC+8)) [+52/-336, 5 files | commented:1 approved:1]
## Purpose
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the
image_sizeand image token num calculation. - This PR consolidate their implementation into one processor to avoid duplicate code.
## Test Plan
python examples/offline_inference/vision_language.py -m deepseek_ocr…
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the
-
#33959 Make voxtral compile friendly — 无标签 — by tugsbayasgalan (创建于: 2026-02-06 09:46 (UTC+8)) [💬1 | +9/-2, 1 files | commented:1]
## Purpose This PR makes sure voxtral model can be run with torch.compile by working around torch.compile issue lilsted here (https://github.com/pytorch/pytorch/issues/174299) This is temporary fix that will be gone starting torch 2.11 ## Test Plan
## Test Result
... -
#33878 [CMake] Switch vllm-flash-attn to ExternalProject for separate scope (#9129) — ready,ci/build — by jungledesh (创建于: 2026-02-05 15:33 (UTC+8)) [💬4 | +33/-1, 1 files | commented:6] …cope (#9129)
## Purpose Implements item 4 from #9129: replace
include(cmake/external_projects/vllm_flash_attn.cmake)withExternalProject_Addso thatvllm-flash-attnis built in its own isolated CMake scope and process.This eliminates shared-scope “footguns” such as variable overwrites, flag pollution, dependency mismatches, or duplicate target issues between the main vLLM build and the external flash-attn repository.
## Changes
…
-
#33897 fix nixl connector num blocks check logic — kv-connector — by Rugu7 (创建于: 2026-02-05 18:24 (UTC+8)) [💬2 +2/-1, 1 files commented:1] - Cause: When creating a PD split instance, the PD instance calculated inconsistent
num_blocksvalues. This inconsistency caused the API call to fail, triggering an assertion failure in the Decode split service and resulting in the service hanging.
- Modification description: For the same instance in the NIXL connector, the
num_blocksvalue is set to …
- Cause: When creating a PD split instance, the PD instance calculated inconsistent
-
#33928 [Refactor] Move sequence normalization and enc-dec parsing to renderer — frontend,ready,needs-rebase,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-06 00:43 (UTC+8)) [💬3 | +520/-491, 19 files | commented:3] ## Purpose
- Introduce
*DictPrompt* classes to represent prompts that have been normalized into dictionaries. This replacesParsed*Prompt` classes. - Update and improve documentation of various input schemas.
- Move the logic of normalizing a single prompt/conversation or a sequence of prompt/conversation to a sequence to Renderer.
- Move the logic of rendering encoder-decoder inputs to Renderer. (Tokenization will be moved in another PR)
- Remove dead code related to encoder-decoder models: …
- Introduce
-
#33941 [bugfix] [ROCm] Fix premature CUDA initialization in platform detection — bug,rocm,ready,ci/build,nvidia — by kouroshHakha (创建于: 2026-02-06 04:37 (UTC+8)) [💬5 | +133/-6, 6 files | commented:4] ## Summary
On ROCm, importing
vllm.platformstriggerstorch.cuda.get_device_properties()at module load time, which initializes CUDA before Ray workers can setCUDA_VISIBLE_DEVICES. This locksdevice_count()to the total number of GPUs, causing all workers to incorrectly use GPU 0 instead of their assigned GPUs.Closes https://github.com/vllm-project/vllm/issues/33938
## Problem
```python # vllm/platforms/rocm.py (before fix) …
-
#33958 Fix double‑scaled linear RoPE cache for long‑context configs #33911 — 无标签 — by baonudesifeizhai (创建于: 2026-02-06 09:42 (UTC+8)) [+17/-1, 1 files | commented:1]
## Purpose #33911 Gemma‑3 configs report a scaled max_position_embeddings (128K) but still include rope_scaling.factor=8, which makes linear RoPE caches scale twice and breaks torch.compile guards. This change adjusts linear RoPE construction to avoid double scaling when original_max_position_embeddings is absent and the context is very large, preventing the 128K→1,048,576 cache blow‑up.
## Test Plan ``` vllm serve google/gemma-3-27b-it
–max-model-len 4096 ... - #33957 [Docs] Add reo analytics — documentation — by simon-mo (创建于: 2026-02-06 09:28 (UTC+8)) [💬1 | +4/-0, 2 files | commented:1] For vLLM community analytics
-
#33952 [CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N — bug,rocm — by rasmith (创建于: 2026-02-06 08:06 (UTC+8)) [💬1 | +22/-4, 3 files | commented:4]
## Purpose This PR broke many tests (over 30) and this PR fixed one test in the ` Kernels MoE Test %N` group, but when the test is run as a group using
pytest -sv kernels/moethe first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.
This PR loads the ops in
vllm._aiter_opsbut then ensures tha… -
#33956 [Bugfix] Fix video frame sampling for short videos and Qwen3-VL 2-frame requirement — bug,multi-modality,qwen — by chengyinie (创建于: 2026-02-06 09:05 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1]
## Purpose Fixes video processing failures for very short videos (< 1 second) when using low
fpssettings. Problem:- When
fps * duration < 1, the frame sampling formulaint(math.floor(duration * fps))returns 0 frames - This causes
Qwen3VLProcessorto fail with empty video arrays - Additionally, Qwen3-VL requires a minimum of 2 frames, so even 1-frame videos fail
Solution: …
- When
-
#33934 [Frontend]Add support for transcriptions and translations to run_batch — frontend,needs-rebase — by pooyadavoodi (创建于: 2026-02-06 02:46 (UTC+8)) [💬2 | +555/-101, 3 files | commented:4] ## Purpose Adding support for transcriptions and translations in the vllm OpenAI batch API (i.e. run_batch.py).
Note that the OpenAI Batch API doesn’t support transcriptions or translations. This will be a vllm feature only AFAIK.
## Test Plan Adding new unit tests to tests/entrypoints/openai/test_run_batch.py
## Test Result pytest -v tests/entrypoints/openai/test_run_batch.py …
-
#33955 Rocm/fused short seq attention — rocm — by MohamedSayedFathy (创建于: 2026-02-06 08:26 (UTC+8)) [💬1 | +263/-54, 1 files | commented:1]
## Purpose
Optimize PagedAttention v2 decode performance on ROCm for short sequences (≤256 tokens) by eliminating the reduce kernel when only a single partition is needed.
When
max_num_partitions == 1, the QKV kernel writes directly tofinal_outinstead oftmp_out, skipping:- The
tmp_outintermediate buffer write (~8 KB per seq/layer) - The
tmp_outread-back in the reduce kernel (~8 KB per seq/layer) - The
exp_sums/max_logitsmetadata writes (~256 …
- The
-
#33919 Fix RoutingMethodType logic — 无标签 — by dbari (创建于: 2026-02-05 22:51 (UTC+8)) [💬1 | +47/-11, 4 files | commented:1 | 📝草稿]
## Purpose
This PR contains two fixes for #33792:
- Fix the selection logic for
RoutingMethodTypeinfused_topk_bias_router.pyandfused_topk_router.py - When
use_grouped_topk=True, theGroupedTopKRouterdid not find any valid routing methods and there is only one group, fall back to the non-grouped routers
The latter point covers Mistral Large 3, which has
n_group=1andtopk_group=1but usesRenormalizeinstead ofDeepSeekV3routing, handled by… - Fix the selection logic for
-
#33953 Regular FP8 LoRA kernels — 无标签 — by yugong333 (创建于: 2026-02-06 08:12 (UTC+8)) [+1395/-0, 4 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33948 [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache — bug,rocm — by Rohan138 (创建于: 2026-02-06 07:25 (UTC+8)) [💬2 | +17/-53, 1 files | commented:2 approved:1]
Reuse
AttentionBackendutils to initialize the dummy KV cache with the required shape and stride. Also, turns out this UT is being skipped on ROCm because of a typo/missing rename, soBACKENDS->BACKENDS_FP8.## Purpose
## Test Plan
## Test Result
…
-
#33945 [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. — ready,torch.compile — by charlifu (创建于: 2026-02-06 06:20 (UTC+8)) [💬1 | +5/-0, 1 files | commented:2 approved:1] Add
kv_cache_dummy_depparameter to the attention fusion pass to make sure the result IR keeps this parameter.Otherwise
kv_cache_updateop will be removed by clean pass. -
#33936 [Doc] Add DCP support to attention backend doc — documentation — by mgoin (创建于: 2026-02-06 03:24 (UTC+8)) [💬2 | +769/-644, 2 files | commented:1] ## Purpose
Also refactor table rendering to not be duplicated, extract FA/FI variant expansion, and compute capability parsing
## Test Plan
## Test Result
... -
#33947 [WIP][CI] Add integration tests for P2pNccl and LMCache connectors — tpu,ci/build,v1,kv-connector — by eicherseiji (创建于: 2026-02-06 06:25 (UTC+8)) [+32/-6, 10 files | commented:1 | 📝草稿] ## Summary
Add CI coverage for P2pNcclConnector and LMCacheConnectorV1 - currently only NixlConnector has integration tests in CI.
## Changes
-
Parameterize
run_accuracy_test.sh- AcceptKV_CONNECTORenvironment variable (defaults toNixlConnectorfor backwards compatibility) -
Add P2pNccl Connector test - 2 GPUs, 20 min timeout
…
-
-
#33940 [Docs] Add sections on vLLM’s process count and minimum CPU resources — documentation,ready — by mgoin (创建于: 2026-02-06 04:31 (UTC+8)) [💬3 | +113/-0, 4 files | commented:2]
## Purpose
It seems users can be confused about vLLM’s performance when running with very small amounts of CPU cores available. We are missing a clear overview of what vLLM’s process architecture is, so I added this along with some diagrams in arch_overview.md, and included a section on CPU resource recommendations in optimization.md
Here are some example diagrams I generated with help from Opus 4.6
<img width=”2816” height=”1536” alt=”Generated_image1” src=”htt…
-
#33946 Add Video stats to response metadata — frontend — by zakariaelh (创建于: 2026-02-06 06:23 (UTC+8)) [💬1 +57/-13, 9 files commented:1] -
#33943 move checks out of
unified_kv_cache_updatecustom op — rocm,v1 — by Rohan138 (创建于: 2026-02-06 05:56 (UTC+8)) [+79/-100, 7 files | commented:3]## Purpose Move checks for k, v, and kv_sharing_target_layer_name up from the
unified_kv_cache_updateintoAttention.forward.Note that the other PRs in #32335 should follow this pattern as well; I think after all those PRs are merged, we can safely remove the
kv_sharing_target_layer_namearg from theAttentionImplclass entirely.## Test Plan
## Test Result …
-
#33944 [Log] Optimize duplicate startup log — v1 — by yewentao256 (创建于: 2026-02-06 06:07 (UTC+8)) [+10/-7, 3 files | commented:1] ## Purpose
Optimize logs like
INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:09 [gpu_worker.py:122] Using V2 Model Runner -
#33942 Adding support to Sarvam’s MoE models — new-model — by rahul-sarvam (创建于: 2026-02-06 04:54 (UTC+8)) [💬2 | +717/-0, 2 files | commented:1] This PR adds support for Sarvam MoE model executors in vLLM.
SarvamMoEForCausalLM implements a standard MoE architecture with conventional MHA, while SarvamMLAForCausalLM implements an MoE variant using MLA. The implementation reuses existing vLLM primitives (MLA modules, fused MoE, TP/PP support, and quantization paths), with minimal extensions for Sarvam-specific routing, expert bias normalization, and weight-loading compatibility.
- #33939 Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 — 无标签 — by TundeAtSN (创建于: 2026-02-06 03:53 (UTC+8)) [💬1 | +9/-1, 1 files | commented:1] This PR adds support for Eagle3 spec decoding for Mistral3ForConditionalGeneration model. Changes were tested with a locally trained speculator model, and observed reasonable acceptance rates.
-
#33926 [BugFix] Fix small race condition when pausing generation for RL — bug,ready,v1 — by njhill (创建于: 2026-02-06 00:29 (UTC+8)) [+10/-11, 2 files | commented:3] Related to https://github.com/vllm-project/vllm/pull/28037.
Without this, it’s technically possible for a request to slip in after pausing.
There’s also an issue with multiple API servers, but a check for that will be added in https://github.com/vllm-project/vllm/pull/32351.
cc @SamitHuang @hao-aaron
-
#33866 fix: Qwen3ReasoningParser - handle prompt prefix format for Thinking models — qwen — by seli-equinix (创建于: 2026-02-05 13:34 (UTC+8)) [💬5 | +16/-22, 1 files | commented:4] ## Summary
Fixes
Qwen3ReasoningParserto correctly extract reasoning content from Qwen3-Thinking models (e.g.,Qwen/Qwen3-Next-80B-A3B-Thinking-FP8).## Problem
The current parser requires both
<think>and</think>tags in the model output:```python if self.start_token not in model_output or self.end_token not in model_output: …
- #33881 [BugFix][kv_offload] Fix offloading decodes with async scheduling — bug,ready,kv-connector — by orozery (创建于: 2026-02-05 15:53 (UTC+8)) [💬1 | +7/-2, 1 files | commented:1] This PR fixes OffloadingConnector to offload also decoded tokens, which were previously skipped as their block hash was not available. This issue only occurs when async scheduling is enabled.
- #33933 [WIP] part 2: helion integration into vllm gemm+fp8+all_gather — 无标签 — by LironKesem (创建于: 2026-02-06 02:28 (UTC+8)) [+573/-10, 3 files | commented:1 | 📝草稿]
Current state:
- I was able to integrate it–> the test pass. but I have leakage.– didn’t look into it.
TORCH_LOGS="+dynamic" NCCL_DEBUG=TRACE USE_HELION_BACKEND=1 VLLM_LOGGING_LEVEL=DEBUG VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_async_tp.py -k "TestHelionAGScaledMMModel" --log-cli-level=DEBUG
TODOs(in this PR):
- move kernel registration to
collective_fusion.pymove the kernel implementation into `vllm/vllm/kernels/helion/distributed…
- I was able to integrate it–> the test pass. but I have leakage.– didn’t look into it.
-
#33932 [Bugfix] Fix DSV3.2 NVFP4 — bug,ready — by MatthewBonanni (创建于: 2026-02-06 01:18 (UTC+8)) [💬1 | +4/-2, 1 files | commented:1 approved:1] ## Purpose Fix https://github.com/vllm-project/vllm/issues/33859
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#33904 [Misc] Rename
translationstospeech_to_textfor OAI serving component — frontend,ready — by NickLucche (创建于: 2026-02-05 19:20 (UTC+8)) [💬1 | +11/-11, 7 files | approved:1 commented:3] Follow up to https://github.com/vllm-project/vllm/pull/32313/ as expressed in this comment https://github.com/vllm-project/vllm/pull/32313/#issuecomment-3769015464, translations is likely too reductive for the ASR endpoints.cc @DarkLight1337
- #33902 Fix tokenizer test for renamed attr on Transformers v5 — ready — by hmellor (创建于: 2026-02-05 18:58 (UTC+8)) [+9/-1, 1 files | commented:1 approved:1] This attr has been renamed in Transformers v5. This PR updates the test to check both locations.
-
#33890 [kernel] use fused_topk of flashinfer — 无标签 — by ZJY0516 (创建于: 2026-02-05 17:05 (UTC+8)) [💬1 | +55/-10, 1 files | commented:2] ## Purpose use
fused_topkof flashinferTODO: I don’t know when to use this for other models
## Test Plan
``` vllm serve deepseek-ai/DeepSeek-V3
–trust-remote-code
… -
#33896 [Bugfix] Fix illegal memory access in AWQ-Marlin with CUDA graphsFix illegal memory access in AWQ-Marlin with CUDA graphs (Fixes #32834) — bug,needs-rebase,nvidia — by KrxGu (创建于: 2026-02-05 18:10 (UTC+8)) [💬3 | +3648/-3444, 4 files | commented:4] This PR fixes issue #32834 where AWQ-Marlin quantization causes illegal memory access when used with CUDA graphs.
## Problem AWQ-Marlin kernels use workspace buffers allocated during model loading. The pointers to these buffers are captured during CUDA graph creation, but the buffers themselves can be reallocated or moved, making the captured pointers stale. When the CUDA graph replays, it uses these stale pointers, resulting in illegal memory access errors.
## Solution Automatically enable `c…
-
#33868 support view_from_cpu_tensor on XPU — 无标签 — by xinyu-intel (创建于: 2026-02-05 13:41 (UTC+8)) [+5/-1, 1 files | commented:2]
## Purpose
support get_cuda_view_from_cpu_tensor on XPU. This will be used for ModelRunnerV2
## Test Plan
## Test Result
…
- #33922 Support benchmarking of Geospatial models — performance,multi-modality — by mgazz (创建于: 2026-02-05 23:58 (UTC+8)) [+106/-50, 3 files | commented:3]
This PR adds the following features necessary to the Geospatial models like Prithvi:
- ability to run benchmarks when the tokenizer is not initialised. This is done via a new argument
--skip-tokenizer-init. In this case the benchmark does not print token related metrics. - Adds new benchmark backend that supports sending requests to the /pooling endpoint.
The benchmark can be run as following:
Step 1: start the server
``` …
- ability to run benchmarks when the tokenizer is not initialised. This is done via a new argument
-
#33931 [Misc] Add debug logs — kv-connector — by NickLucche (创建于: 2026-02-06 00:58 (UTC+8)) [+5/-0, 2 files | commented:1 approved:1] Nixl is currently missing some minor yet vital debug logs that I find particularly useful when trying to debug issues without access to the machine. This PR logs kv cache shapes to address that.
cc @orozery
-
#33879 [BugFix] Fix LoRA Fp8 — bug,ready,ci/build — by danisereb (创建于: 2026-02-05 15:43 (UTC+8)) [💬2 | +14/-8, 1 files | commented:2 approved:1] ## Purpose
When running Nemotron Nano FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Using LoRA adapters fails:
ValueError: ModelOptFp8MoEMethod uses the new modular kernel initialization logic. This function should not be called.…
-
#33929 [Bugfix][Docker] Install CUDA dev packages for JIT compilation headers — bug,ci/build,nvidia — by jasonlizhengjian (创建于: 2026-02-06 00:50 (UTC+8)) [💬1 | +3/-3, 1 files | commented:1 | 📝草稿] Install cuda-cudart-dev, cuda-nvrtc-dev, and libcublas-dev instead of runtime-only packages to provide headers (cuda.h, cuda_runtime.h, nvrtc.h, cublasLt.h) needed for FlashInfer JIT compilation of fp8_blockscale_gemm_sm90 kernels.
Fixes #33833
## Purpose
## Test Plan
…
-
#33927 [Bugfix] Fix tokenizer model_max_length incorrectly constraining user-specified max_model_len — bug — by soyr-redhat (创建于: 2026-02-06 00:41 (UTC+8)) [💬1 | +73/-3, 2 files | commented:1] This fixes a bug where the tokenizer’s model_max_length would override the user’s explicit –max-model-len parameter, preventing users from utilizing extended context lengths even when the model supports them.
For example, with Qwen/Qwen3-4B-Instruct-2507-FP8:
- Model supports 256K tokens (max_position_embeddings = 262144)
- Tokenizer config has model_max_length = 8192
- User specifies –max-model-len 262144
- Before: Server would cap at 8192 and reject 10K token requests
- After: Server correc…
-
#33924 Add test for Nemotron Nano FP8 with LoRA adapters — 无标签 — by danisereb (创建于: 2026-02-06 00:14 (UTC+8)) [+72/-0, 2 files | commented:1 | 📝草稿]
## Purpose
Verify that Nemotron Nano FP8 works with LoRA adapters.
Added following this bugfix (test works only with this fix): https://github.com/vllm-project/vllm/pull/33879
## Test Plan …
-
#33901 [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (创建于: 2026-02-05 18:48 (UTC+8)) [💬2 | +10/-2, 1 files | commented:2 approved:1] Improvements on Neoverse v1:
Metric Before After Speedup Throughput (rps) 0.30 0.55 +83% Total tok/s 348.70 632.84 +81% Out tok/s 38.74 70.32 +82%
fix formatting issues …
-
#33918 [WIP][Kernel] Add Helion kernel for static_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-05 22:46 (UTC+8)) [+218/-0, 2 files | commented:1] ## Purpose This PR is to add Helion kernel for
static_scaled_fp8_quantoperation. It follows the implementation from the vllm c version.This is a subtask for https://github.com/vllm-project/vllm/issues/32962.
## Test Plan
- Test correctness
- Added unit test to cover its correctness
- Benchmark performance at kernel level with different kernel imple…
- Test correctness
- #33903 [NIXL] Version bump — ci/build,kv-connector — by NickLucche (创建于: 2026-02-05 19:03 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Regular version bump for NIXL.
-
#33898 [Metrics] Some small refactoring for better maintainability — speculative-decoding,v1,kv-connector — by hickeyma (创建于: 2026-02-05 18:25 (UTC+8)) [+117/-122, 6 files | commented:1 changes:1] ## Purpose
When reviewing #30950, I identified some parts of the code that if refactored would benefit the code base from a maintainability and bugs pov.
The issues identified are as follows:
- Use suffix
Prominstead ofPrometheusfor consistency. SeeKVConnectorPrometheusinvllm/distributed/kv_transfer/kv_connector/v1/metrics.py - Consolidate the duplicated helper function
make_per_engineinto 1 function instead of duplicates - Rename
make_per_engine–> `create_met…
- Use suffix
-
#33905 [Docs] Add bart-plugin to docs — documentation,ready — by NickLucche (创建于: 2026-02-05 19:38 (UTC+8)) [💬1 | +17/-4, 3 files | commented:2 approved:1] Add bart-plugin https://github.com/vllm-project/bart-plugin to docs, mention it as a reference for OOT Model implementation, and add
BartForConditionalGenerationto supported models.cc @DarkLight1337
-
#33907 [Bugfix] Fix Random Dataset Prefix Length Inaccuracy — bug,performance — by frankwang28 (创建于: 2026-02-05 19:45 (UTC+8)) [+29/-10, 1 files | commented:1]
## Purpose This PR fixes the
RandomDatasetprefix generation so the shared prefix token length is adjusted via decode-encode tokenization, similar to the way the input prompt is generated.## Test Plan Observe prefix cache hit rate before and after. For testing, a prefix length of 4096 with an input length of 1024 is used. The expected prefix cache hit rate should thus be around 80%
(4096/(4096+1024) = 0.8).## Test Result Before: …
-
#33886 [Bugfix] Fix corner case of sparse embedding — bug,ready — by noooop (创建于: 2026-02-05 16:35 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1 approved:1]
## Purpose
Fix #33873
## Test Plan
tests/models/language/pooling/test_bge_m3.py
…
-
#33875 Update kv_cache_utils.py — v1,tool-calling — by junxiangxiaoxiang (创建于: 2026-02-05 15:04 (UTC+8)) [💬7 | +194/-195, 28 files | commented:1] Optimize popleft_n by pre-allocating list
## Purpose
This PR optimizes the
popleft_nmethod inFreeKVCacheBlockQueueto improve performance when popping multiple free KV cache blocks from the head of the free list.The original implementation used repeated
list.append()and reset theprev_free_block/next_free_blockpointers of each popped block inside the loop, which introduces unnecessary overhead.The optimized version applies two key improvements:
- **Pre-allocate the result li…
-
#33876 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading — bug,ready,deepseek — by Isotr0py (创建于: 2026-02-05 15:24 (UTC+8)) [+15/-5, 2 files | commented:1 approved:1]
## Purpose
nvidia/Kimi-K2.5-NVFP4is quantized based on legacy model layout (language_model.layers.*), which was refactored at #33346- Since v0.15.0 has released, this PR adds backward compatibility to load this NVFP4 checkpoint
## Test Plan Run with
nvidia/Kimi-K2.5-NVFP4``` python examples/offline_inference/vision_language.py -m kimi_k25 –modality vision_chunk … -
#33900 [Misc] Update code for encoder-decoder models — ready,v1,multi-modality — by DarkLight1337 (创建于: 2026-02-05 18:47 (UTC+8)) [+9/-3, 2 files | approved:1 commented:1]
## Purpose
FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754749999 FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754759311
## Test Plan
## Test Result …
-
#33892 [W8A8 Block Linear Refactor][3/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. — performance,nvidia — by maralbahari (创建于: 2026-02-05 17:32 (UTC+8)) [+1617/-760, 20 files | commented:2 | 📝草稿]
## Purpose This PR refactors the FP8 block scaled linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.
Removes the
W8A8Fp8BlockLinearOpclass and updates all code paths and files that use this class.## Test Plan Cuda platfrom run ci/cd tests. …
-
#33891 [W8A8 Block Linear Refactor][2/N] Make Fp8 block linear Op use kernel abstraction. — nvidia — by maralbahari (创建于: 2026-02-05 17:29 (UTC+8)) [+1236/-15, 11 files | commented:1]
## Purpose This PR refactors the FP8 block scaled linear kernel integration in vLLM to improve code clarity, maintainability, and path consistency.
This PR Extracts only block-scaled kernels into the kernel abstraction
## Test Plan …
-
#33893 [W8A8 Block Linear Refactor][4/N] Make all scaled MM kernels inherit from common generic base. — performance,needs-rebase,cpu,nvidia — by maralbahari (创建于: 2026-02-05 17:34 (UTC+8)) [💬1 | +1881/-1018, 23 files | commented:1 | 📝草稿]
## Purpose This PR unifies the inheritance of all types of ScaledMM kernels into same super class.
Applies base inheritance to the remaining ScaledMM kernels for consistent code and improved maintainability of linear kernel classes.
## Test Plan
## Test Result …
-
#33884 [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast — bug,deepseek — by kebe7jun (创建于: 2026-02-05 16:15 (UTC+8)) [+16/-4, 1 files | commented:1]
## Purpose
Fix https://github.com/vllm-project/vllm/issues/33883
cc @LucasWilkinson
## Test Plan
…
-
#33870 [CI/Build] Fix CPU CI test case title — ready,ci/build,v1,cpu — by bigPYJ1151 (创建于: 2026-02-05 14:30 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]
## Purpose
CPU-TP/DP/PP Testsmay violate key naming rule of Buildkite.## Test Plan
## Test Result
…
-
#33860 Add FlashAttention v2.8.3 scaling benchmark on Mistral-7B (H100) — performance — by RegularJoe-CEO (创建于: 2026-02-05 12:00 (UTC+8)) [💬4 | +34/-0, 1 files | commented:1] ## Purpose
Add baseline FlashAttention v2.8.3 scaling benchmark data on NVIDIA H100 for reference and validation purposes. This documents attention mechanism performance across sequence lengths from 1K to 32K tokens on Mistral-7B-v0.1.
## Test Plan
cd ~/vllm/benchmarks/attention_benchmarks python vllm_longcontext_benchmark.py 2>&1 | tee mistral_flash_scaling.log…
-
#33867 [WIP][Kernel] Add Helion kernel for dynamic_per_tensor_scaled_fp8_quant — 无标签 — by xiaohongchen1991 (创建于: 2026-02-05 13:35 (UTC+8)) [+120/-0, 2 files | commented:1] ## Purpose This PR is to add Helion kernel for
dynamic_per_tensor_scaled_fp8_quantoperation. It follows the implementation from the vllm c version.This is a subtask for https://github.com/vllm-project/vllm/issues/32962.
## Test Plan
- Test correctness
- Added unit test to cover its correctness
- Benchmark performance at kernel level with different …
- Test correctness
-
#33862 Waller Operator: Constant 14ms attention latency across 512-524K tokens (24.5x faster than FlashAttention at 32K) — performance,ci/build,v1 — by RegularJoe-CEO (创建于: 2026-02-05 12:12 (UTC+8)) [💬4 | +42/-13, 4 files | commented:1] ## Purpose
Introduce the Waller Operator attention mechanism demonstrating constant O(N log N) memory complexity and O(1) latency characteristics. This PR provides head-to-head benchmark comparison against FlashAttention v2.8.3 baseline (PR #33860).
Key Results:
- Constant latency: 14.168-14.305ms across 1000x sequence length increase (512 → 524K tokens)
- 24.5x faster than FlashAttention at 32K tokens
- Zero throughput degradation vs FlashAttention’s 76% loss
- *Scalability:…
-
#33863 [docs] fix unintentional misspellings — documentation — by rinbaro (创建于: 2026-02-05 12:17 (UTC+8)) [💬1 | +3/-3, 3 files | commented:1 approved:1] ### Motivation
These misspellings looked unintentional.
-
#33861 [docs] fix unintentional misspellings — documentation — by rinbaro (创建于: 2026-02-05 12:09 (UTC+8)) [💬2 | +3/-3, 3 files | commented:1] ## Motivation make docs great again.
These 3 looked unintentional.
[已合并 PR]
-
#33511 fix(ROCm): Make flash_attn import optional in MLA attention — rocm,ready — by rabi (合并于: 2026-02-06 10:22 (UTC+8)) [💬2 | +19/-3, 1 files | commented:5 approved:2] ## Purpose
On ROCm, models that don’t use MLA were failing to load because attention/init.py eagerly imported MLAAttention, which in turn tried to import flash_attn unconditionally.
- Makes flash_attn import optional in mla_attention.py with try/except
- Adds a clear error message when MLA is used without flash_attn
This allows non-MLA models to work on ROCm without needing flash_attn installed.
## Test Plan …
-
#33909 [Models] Consolidate Deepseek-OCR2 processor — ready,deepseek — by Isotr0py (合并于: 2026-02-06 02:29 (UTC+8)) [+52/-336, 5 files | commented:1 approved:1]
## Purpose
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the
image_sizeand image token num calculation. - This PR consolidate their implementation into one processor to avoid duplicate code.
## Test Plan
python examples/offline_inference/vision_language.py -m deepseek_ocr…
- Actually, the processor difference between ds-ocr and ds-ocr2 is only the
- #33957 [Docs] Add reo analytics — documentation — by simon-mo (合并于: 2026-02-06 09:42 (UTC+8)) [💬1 | +4/-0, 2 files | commented:1] For vLLM community analytics
-
#33568 [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel — ready — by xyang16 (合并于: 2026-02-06 09:34 (UTC+8)) [💬2 | +61/-27, 4 files | commented:4 approved:2] ## Purpose
During profiling DeepSeek V3.2, I noticed in sparse_attn_indexer,
deep_gemm::smxx_clean_logitskernel was launched followingdeep_gemm::sm90_fp8_mqa_logits, becauseclean_logitsis set to true in https://github.com/vllm-project/vllm/blob/v0.16.0rc0/vllm/utils/deep_gemm.py#L331.The purpose of
deep_gemm::smxx_clean_logitskernel is to fill the padding value of MQA logits with -inf, see https://github.com/deepseek-ai/DeepGEMM/blob/v2.1.1/README.md?plain=1#L124. However, I foun… -
#31162 [Feature] OTEL tracing during loading — frontend,ready,ci/build,v1,cpu — by emricksini-h (合并于: 2026-02-06 08:59 (UTC+8)) [💬16 | +873/-280, 29 files | commented:7 approved:1] ## Purpose
This PR instruments the vLLM loading process with OpenTelemetry (OTel) tracing to provide precise, span-based observability into initialization performance.
## Why we need it Currently, performance insights for vLLM’s startup phase rely on parsing unstructured logs. Tools like llm-d-benchmark depend on specific log strings to calculate durations [see here](https://github.com/llm-d/llm-d-benchmark/blob/main/workload/harnesses/nop-llm-d-benc…
-
#33832 [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue — bug,ready,deepseek — by wzhao18 (合并于: 2026-02-06 07:50 (UTC+8)) [💬2 | +4/-0, 1 files | commented:6 approved:1]
## Purpose Fix #33831
## Test Plan
lm_eval --model local-completions --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5…
-
#33527 Adds padding and perf improvements to wvSplitK_fp8 — rocm,ready — by amd-hhashemi (合并于: 2026-02-06 06:16 (UTC+8)) [💬8 | +169/-229, 3 files | commented:2 approved:1] Adds activation padding support to wvSplitKQ. Additionally improves bias and dpp reduce perf. Expands test scenarios.
## Purpose
## Test Plan
## Test Result
... -
#33491 [Minor] Sort safetensors files to ensure deterministic loading order — ready — by Lumosis (合并于: 2026-02-06 06:05 (UTC+8)) [💬7 | +11/-2, 1 files | commented:1 approved:1] This PR explicitly sorts hf_weights_files by filename within the safetensors_weights_iterator
## Purpose
We are refactoring the weight loading logic to better support TPU execution. To optimize memory usage and initialization on TPU, we need to shard each layer to TPU HBM immediately after its weights are available.
By sorting the safetensors files (which typically follow a sequential naming convention like model-00001-of-00005.safetensors), we ensure that weights are loaded in a deterministi…
-
#33817 [Bugfix] Make MM batching more robust — bug,ready,ci/build,multi-modality — by DarkLight1337 (合并于: 2026-02-06 04:40 (UTC+8)) [💬6 | +625/-428, 13 files | commented:4 approved:1] ## Purpose
FIX https://github.com/vllm-project/vllm/pull/32955#issuecomment-3849069453
This adds a bit of hashing overhead, but should be negligible since shared fields normally contain single integers (e.g. image/video token ID) or booleans (e.g.
use_audio_in_video), with the notable exception of Terratorch models.cc @christian-pinto
Related changes:
…
-
#33932 [Bugfix] Fix DSV3.2 NVFP4 — bug,ready — by MatthewBonanni (合并于: 2026-02-06 03:22 (UTC+8)) [💬1 | +4/-2, 1 files | commented:1 approved:1] ## Purpose Fix https://github.com/vllm-project/vllm/issues/33859
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#33904 [Misc] Rename
translationstospeech_to_textfor OAI serving component — frontend,ready — by NickLucche (合并于: 2026-02-06 03:16 (UTC+8)) [💬1 | +11/-11, 7 files | approved:1 commented:3] Follow up to https://github.com/vllm-project/vllm/pull/32313/ as expressed in this comment https://github.com/vllm-project/vllm/pull/32313/#issuecomment-3769015464, translations is likely too reductive for the ASR endpoints.cc @DarkLight1337
- #33902 Fix tokenizer test for renamed attr on Transformers v5 — ready — by hmellor (合并于: 2026-02-06 03:16 (UTC+8)) [+9/-1, 1 files | commented:1 approved:1] This attr has been renamed in Transformers v5. This PR updates the test to check both locations.
-
#29714 [Bugfix] Suppress non-TTY color output on the process name part of the log — bug,ready — by a4lg (合并于: 2026-02-06 02:47 (UTC+8)) [💬5 | +6/-1, 1 files | commented:4 approved:1] ## Purpose
Regular logs are not decorated with colors when non-TTY stdout/stderr is selected as logging output (c.f.
_use_colorfunction invllm/logger.py). However, process-based decoration part invllm/utils/system_utils.pydid not implement color output suppression as the regular logger.It resulted in excess escape sequences in a redirected output (e.g.
log.txtafter runningvllm serve ... >log.txt 2>&1) when all conditions are met:- Neither `NO…
-
#33375 [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor — ready,nvidia — by bnellnm (合并于: 2026-02-06 02:07 (UTC+8)) [💬1 | +132/-109, 37 files | commented:10] ## Purpose Pass the inplace flag to the FusedMoEModularKernel constructor. Knowing the inplace (or being able to disable) behavior ahead of time will help simplify some of the runtime logic in layer.py related to running shared experts and making clones of the
hidden_states.## Test Plan CI + MoE refactoring tests
## Test Result
cc @robertgshaw2-redhat , @ProExpertProg
…
-
#32887 [Spec Decode] Unified Parallel Drafting — documentation,performance,speculative-decoding,ready,v1,llama,nvidia — by benchislett (合并于: 2026-02-06 01:37 (UTC+8)) [💬8 | +1085/-392, 14 files | commented:10] ## Purpose
This PR implements a single input preparation kernel for draft model support, and parallel drafting both with and without hidden states from the target model. As such we now have support for AMD’s PARD, which proposes parallel drafting for fine-tuned external draft models, and AWS’ P-EAGLE which implements parallel prediction for EAGLE3. Both of these are benchmarked as part of this PR effort.
## Testing
E2E tests for parallel drafting and unit tests for the input preparation logic…
- #33795 [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path — bug,ready,llama,kv-connector — by zackyoray (合并于: 2026-02-06 01:51 (UTC+8)) [💬1 | +3/-3, 1 files | commented:1 approved:3]
The Llama 4 local attention optimization code path in
_read_blockshad the engine_ids swapped when calling_get_block_descs_ids:layer_local_desc_idswas incorrectly usingdst_engine_idlayer_remote_desc_idswas incorrectly usingself.engine_id
This caused the wrong
num_blocksto be used when computing descriptor IDs, resulting in “remote index out of range” errors during NIXL KV cache transfers with heterogeneous tensor parallelism (e.g., prefill TP=2, decode TP=4).The fix …
-
#33931 [Misc] Add debug logs — kv-connector — by NickLucche (合并于: 2026-02-06 01:42 (UTC+8)) [+5/-0, 2 files | commented:1 approved:1] Nixl is currently missing some minor yet vital debug logs that I find particularly useful when trying to debug issues without access to the machine. This PR logs kv cache shapes to address that.
cc @orozery
-
#33879 [BugFix] Fix LoRA Fp8 — bug,ready,ci/build — by danisereb (合并于: 2026-02-06 01:25 (UTC+8)) [💬2 | +14/-8, 1 files | commented:2 approved:1] ## Purpose
When running Nemotron Nano FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Using LoRA adapters fails:
ValueError: ModelOptFp8MoEMethod uses the new modular kernel initialization logic. This function should not be called.…
-
#31943 [Feat][RL][1/2] Native Weight Syncing API: NCCL — documentation,frontend,ready,ci/build,v1 — by hao-aaron (合并于: 2026-02-06 01:13 (UTC+8)) [💬23 | +2974/-2, 27 files | commented:10] ## Purpose
This PR introduces native weight syncing APIs for vLLM to support reinforcement learning post-training workflows (RLHF, PPO, etc.).
Currently, open-source projects like SkyRL, VeRL, and TRL must implement their own weight syncing infrastructure to use vLLM as an inference server during training. This leads to duplicated effort and requires users to version-lock to specific implementations. See RFC #31848 for full motivation. …
-
#32710 [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) — bug,rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-02-05 18:01 (UTC+8)) [💬4 | +11/-0, 2 files | commented:10] Fixes Mamba-1 models producing garbage output on ROCm.
## Problem
On ROCm, Mamba models (e.g.,
state-spaces/mamba-130m-hf) produce completely incorrect output:- Expected: “The LLM is a high-performance, scalable…”
- Actual: “fprintf NdEx INDIRECT roidism oneliness”
## Root Cause
…
- #33690 [Bugfix] Fix step3p5 parser when using mtp — bug,ready — by mariohong128 (合并于: 2026-02-06 00:04 (UTC+8)) [💬2 | +1455/-5, 2 files | commented:1 approved:1]
## Purpose
Fix step3.5 parser when using mtp.
If model outputs
</tool_call><tool_call><(using mtp will greatly increase the possibility of this), parser will start a new empty toolcall incorrectly. ``` {"error":{"message":"1 validation error for ValidatorIterator\\n1.function.name\\n Field required [type=missing, input_value={\‘arguments\’: \’{}\’}, input_type=dict]\\n For further information visit https://errors.pydantic.dev/2.11/v/missing None","type":"BadRequestError… -
#33905 [Docs] Add bart-plugin to docs — documentation,ready — by NickLucche (合并于: 2026-02-05 20:20 (UTC+8)) [💬1 | +17/-4, 3 files | commented:2 approved:1] Add bart-plugin https://github.com/vllm-project/bart-plugin to docs, mention it as a reference for OOT Model implementation, and add
BartForConditionalGenerationto supported models.cc @DarkLight1337
-
#33886 [Bugfix] Fix corner case of sparse embedding — bug,ready — by noooop (合并于: 2026-02-05 18:51 (UTC+8)) [💬1 | +11/-1, 2 files | commented:1 approved:1]
## Purpose
Fix #33873
## Test Plan
tests/models/language/pooling/test_bge_m3.py
…
-
#33876 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading — bug,ready,deepseek — by Isotr0py (合并于: 2026-02-05 18:29 (UTC+8)) [+15/-5, 2 files | commented:1 approved:1]
## Purpose
nvidia/Kimi-K2.5-NVFP4is quantized based on legacy model layout (language_model.layers.*), which was refactored at #33346- Since v0.15.0 has released, this PR adds backward compatibility to load this NVFP4 checkpoint
## Test Plan Run with
nvidia/Kimi-K2.5-NVFP4``` python examples/offline_inference/vision_language.py -m kimi_k25 –modality vision_chunk … -
#33687 [Refactor] Clean up input preprocessing — ready,multi-modality — by DarkLight1337 (合并于: 2026-02-05 18:43 (UTC+8)) [💬1 | +91/-204, 4 files | commented:3 approved:1] ## Purpose
Remove code specific to text-only encoder-decoder models, since we now implement them as MM models to simplify the code.
## Test Plan
## Test Result
... -
#31171 [perf] Integrate flashinfer concat_mla_k — ready,v1,nvidia — by jiahanc (合并于: 2026-02-05 18:23 (UTC+8)) [💬10 | +64/-3, 2 files | commented:5 approved:2] ## Purpose Integrate Flashinfer concat_mla_k kernel
## Test Plan
## Test Result ``` local-completions (base_url=http://0.0.0.0:8087/v1/completions,pretrained=/ds-models/DeepSeek-R1-0528-FP4-v2,model=/ds-models/DeepSeek-R1-0528-FP4-v2,add_bos_token=True,tokenized_requests=False,tokenizer_backend=None,num_concurrent=1024,timeout=60000,max_retries=5,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |…
-
#33339 Enable Cross layers KV cache layout at NIXL Connector V2 — documentation,ready,v1,kv-connector — by liranschour (合并于: 2026-02-05 18:17 (UTC+8)) [💬1 | +339/-89, 6 files | commented:9 changes:1] ## Purpose Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC and implemented in #27743
Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.
Tested with P!=D with run_accuracy_test.sh P=1 D=2
branch num reqs input len TTFT ITL tok/s Desc/transfer – – -… -
#33796 [Refactor] Move
taskoutside ofPoolingParams.verify— frontend,ready,ci/build,v1 — by DarkLight1337 (合并于: 2026-02-05 17:33 (UTC+8)) [💬1 | +186/-216, 24 files | commented:1 approved:1]## Purpose
Actually, the task assignment only needs to be done in
LLM.encodeand/poolingendpoint. This change allows verification to take place in a single place insideInputProcessor.Also fixed
test_pooling_params.pynot being run in CI.## Test Plan
…
-
#33858 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. — bug,ready,deepseek — by pavanimajety (合并于: 2026-02-05 17:32 (UTC+8)) [💬1 | +3/-11, 1 files | commented:1 approved:2] ## Purpose This PR fixes a bug introduced in PR #33174 that sets the values for n_group and topk_group to None when they are (1, 1) respectively. This while it fixes Kimi-K2 may introduce an error with Mistral. @dbari Please confirm if this fix is good or if the values need to be passed differently
The marlin path works because it doesn’t have monolithic kernel for routing + MOE unlike the INT4 TRTLLM MOE Kernels.
## Test Plan GSM8k before and after.
## Test Result Main
Kimi-K2-Thinking(… -
#33557 [Perf] Optimize the performance of structured output + reasoning — structured-output,frontend,ready,v1 — by chaunceyjiang (合并于: 2026-02-05 15:45 (UTC+8)) [💬7 | +51/-61, 4 files | commented:2 approved:1] ## Purpose
- Optimize the performance of structured output + reasoning
Move
reasoner.is_reasoning_end(request.prompt_token_ids or [])from the core engine to the frontend.- fix https://github.com/vllm-project/vllm/issues/33215
The root cause of #33215 is the use of the parameter
"chat_template_kwargs": {"thinking": true, "enable_thinking": true}. …
-
#30522 [KV Connector][Metrics] Do not count local prefix cache hits in connector queries — ready,v1,kv-connector — by markmc (合并于: 2026-02-05 15:57 (UTC+8)) [💬4 | +115/-31, 3 files | commented:4 approved:1] Somewhat related to #28585
The
vllm:external_prefix_cache_queriesmetric was double-counting queries by recording the total prompt tokens instead of only the tokens actually queried from the KV connector.Example scenario:
- Request with 1000 prompt tokens
- Local cache finds 600 tokens
- External cache finds 200 of the remaining 400 tokens
- Computed: 200 tokens
…
-
#33870 [CI/Build] Fix CPU CI test case title — ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-02-05 15:07 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1]
## Purpose
CPU-TP/DP/PP Testsmay violate key naming rule of Buildkite.## Test Plan
## Test Result
…
-
#33727 [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs — bug,ready,cpu — by fadara01 (合并于: 2026-02-05 14:26 (UTC+8)) [💬2 | +3/-1, 1 files | commented:2 approved:1] ## Purpose
This enables us to run RedHatAI/whisper-large-v3-quantized.w8a8 on CPU.
Fixes: #33696
## Test Plan
Reproducer #33696 Will enable this in CI once we accelerate the w8a8 GEMMs in oneDNN …
-
#33837 [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result — bug,frontend,ready,multi-modality — by AndreasKaratzas (合并于: 2026-02-05 14:17 (UTC+8)) [💬8 | +21/-44, 1 files | commented:1 approved:1] PR #33060 introduced a regression where passing a single
ScoreMultiModalParamwith multiple content items (e.g., 2 images as documents) toLLM.score()returns only 1 result instead of 2.## Root Cause
Found via git bisect: commit1b8fe6f7c(“[Frontend][4/n] Make pooling entrypoints request schema consensusScoreRequest”). In
validate_score_input(), when a singleScoreMultiModalParam(content=[item1, item2])is passed, the new code wraps it as[ScoreMultiModalParam(...)](length… -
#33778 [CI/Build] Parallelize CPU CI tests — ready,ci/build,v1,cpu — by bigPYJ1151 (合并于: 2026-02-05 13:53 (UTC+8)) [💬1 | +157/-130, 6 files | commented:1 approved:1] ## Purpose
- Split CPU CI tests for parallel execution
- Add source depencendies to test cases
## Test Plan
CI tests
## Test Result …
-
#33281 [2/N] move responses/serving _make_response_output_items logic to parser — frontend,ready — by qandrew (合并于: 2026-02-05 13:46 (UTC+8)) [💬14 | +241/-99, 2 files | commented:8 approved:1]
## Purpose
as titled. by moving logic of the reasoning / tool calling into parser, each model can implement its own interpretation of how to parse. we can then migrate GPT-OSS into parser. And models that support parallel tool calling / interleaved reasoning can use the parser structure.
stacking on top of https://github.com/vllm-project/vllm/pull/32712
## Test Plan
…
-
#33840 [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly — bug,rocm,ready — by rasmith (合并于: 2026-02-05 13:05 (UTC+8)) [+11/-5, 1 files | commented:1 approved:1]
## Purpose This PR introduced changes that broke many tests, including
test_rocm_aiter_topk.py.Either an environment variable or backend must be selected for AITER ops to be properly imported.
This PR updates the test to set VLLM_ROCM_USE_AITER to 1 so loading of AITER ops will occur and the test can check that the ops were correctly loaded.
## Test Plan …
-
#33863 [docs] fix unintentional misspellings — documentation — by rinbaro (合并于: 2026-02-05 12:50 (UTC+8)) [💬1 | +3/-3, 3 files | commented:1 approved:1] ### Motivation
These misspellings looked unintentional.
- #33856 [Minor] Include
StreamingInputin inputs package — ready,v1 — by njhill (合并于: 2026-02-05 12:38 (UTC+8)) [+6/-4, 4 files | commented:1 approved:1] Follow-on from recently-introduced streaming inputs support. -
#33841 Revert “[Attention][FA3] Update FA3 to include new swizzle optimization” — ready,ci/build,v1 — by ProExpertProg (合并于: 2026-02-05 11:54 (UTC+8)) [💬2 | +3/-13, 3 files | commented:1 approved:1] Reverts vllm-project/vllm#23465
As described in #33802, #23465 broke the Distributed Tests 2 GPUs (H100).
- CI run for #23465: https://buildkite.com/vllm/ci/builds/49808/steps/canvas?sid=019c244e-f4b2-46e4-bdea-4e8c0c0d8bb9&tab=output
- CI run for #31034 (previous commit): https://buildkite.com/vllm/ci/builds/49807/steps/canvas?sid=019c244e-cc62-4d6c-a7d6-9c2922ed648f&tab=output
Note that since the tests have been slightly refactored so the failing test (`test_async_tp.py::test_async_pass_corr…
[关闭未合并 PR]
- #14148 [V1][Metrics] Add model_load_time as a log for CUDA devices — needs-rebase,stale,v1 — by sahelib25 (关闭于: 2026-02-06 10:18 (UTC+8)) [💬21 | +13/-3, 3 files | commented:10] This PR will add below metrics to V1, Metric Name | Type | Unit – | – | – model_load_time | Provided as a log for CUDA devices | Seconds
-
#26184 [Perf] Non blocking cpu->gpu for spec decoding — ready,stale,v1 — by chelsea0x3b (关闭于: 2026-02-06 10:16 (UTC+8)) [💬9 | +14/-21, 2 files | commented:4 approved:1] ## Purpose
Speculative decoding previously did a blocking gpu->cpu data transfer which oyu can see in this nsight profile:
After this PR it looks like this:
Summary of changes:
- Replaced call to
torch.cat&torch.fullwith a `torch…
- Replaced call to
- #28238 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe — documentation,ready,ci/build,qwen,nvidia — by jiahanc (关闭于: 2026-02-06 08:26 (UTC+8)) [💬24 | +194/-4, 4 files | commented:7 approved:2 | 📝草稿]
## Purpose
- Integrate flashinfer trtllm-gen BF16 moe to supported models
- Fixed flashinfer trtllm-gen moe SM check ## Test Plan `VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct –max-num-batched-tokens 8192 –max-model-len 32768 –no-enable-prefix-caching –async-scheduling –compilation_config.pass_config.enable_fi_allreduce_fusion true –compilation_config.pass_config.enable_noop true –compilation_config.custom_ops+=+rms_n…
-
#31255 [LoRA] Simplify gpt-oss w13_lora_b weight loading — gpt-oss — by xyang16 (关闭于: 2026-02-06 07:29 (UTC+8)) [💬4 | +1/-8, 1 files | commented:1]
## Purpose
This PR simplifies gpt-oss
w13_lora_bweight loading code.gpt-oss
lora_bweights is in interleaved layout[w1_0, w3_0, w1_1, w3_1, ...]. The current_slice_w13_bcode do de-interleaving, slicing, and re-interleaving. But the code can be simplified by directly returning a contiguous slice of the original interleaved dimension.Below is a short example of the original code:
…
-
#33946 Add Video stats to response metadata — frontend — by zakariaelh (关闭于: 2026-02-06 06:23 (UTC+8)) [💬1 +57/-13, 9 files commented:1] - #33024 [V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode — ready,needs-rebase,v1 — by peakcrosser7 (关闭于: 2026-02-06 01:17 (UTC+8)) [💬5 | +39/-24, 3 files | commented:5]
## Purpose
- Re-enabled spec decoding: Previously disabled in #30877, speculative decoding is now re-enabled as the related issues have been confirmed as fixed.
- Optimized block-aligned splitting for resumed requests: Refined the logic to ensure that Mamba states for resumed requests are also cached in a block-aligned fashion, maintaining consistency across prefill phases.
## Test Plan
## Test Result
... -
#33927 [Bugfix] Fix tokenizer model_max_length incorrectly constraining user-specified max_model_len — bug — by soyr-redhat (关闭于: 2026-02-06 00:46 (UTC+8)) [💬1 | +73/-3, 2 files | commented:1] This fixes a bug where the tokenizer’s model_max_length would override the user’s explicit –max-model-len parameter, preventing users from utilizing extended context lengths even when the model supports them.
For example, with Qwen/Qwen3-4B-Instruct-2507-FP8:
- Model supports 256K tokens (max_position_embeddings = 262144)
- Tokenizer config has model_max_length = 8192
- User specifies –max-model-len 262144
- Before: Server would cap at 8192 and reject 10K token requests
- After: Server correc…
-
#30906 [Feature]: support serving nvfp4 W4A16 moe models uisng Marlin — 无标签 — by EdalatiAli (关闭于: 2026-02-06 00:36 (UTC+8)) [💬3 | +220/-12, 2 files | commented:2]
## Purpose This PR enables serving nvfp4 W4A16 compressed-tensors quantized MoE models by adding
CompressedTensorsW4A16Nvfp4MoeMethod. Weight-only nvfp4 quantization improves the quality at the expense of higher latency at large concurrencies. As the provided results show, nvfp4 W4A16 quantized version ofQwen/Qwen3-30B-A3Bsingificantly outperforms the nvfp4 W4A4 variant.## Test Plan …
- #33903 [NIXL] Version bump — ci/build,kv-connector — by NickLucche (关闭于: 2026-02-05 21:58 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Regular version bump for NIXL.
-
#33787 [Fix]: Prepack weights for w8a8 oneDNN matmul — cpu — by nikhil-arm (关闭于: 2026-02-05 18:46 (UTC+8)) [💬2 | +3/-2, 1 files | commented:2] Improvements on Neoverse v1:
Metric Before After Speedup Throughput (rps) 0.30 0.55 +83% Total tok/s 348.70 632.84 +81% Out tok/s 38.74 70.32 +82%
## Purpose …
-
#33875 Update kv_cache_utils.py — v1,tool-calling — by junxiangxiaoxiang (关闭于: 2026-02-05 19:01 (UTC+8)) [💬7 | +194/-195, 28 files | commented:1] Optimize popleft_n by pre-allocating list
## Purpose
This PR optimizes the
popleft_nmethod inFreeKVCacheBlockQueueto improve performance when popping multiple free KV cache blocks from the head of the free list.The original implementation used repeated
list.append()and reset theprev_free_block/next_free_blockpointers of each popped block inside the loop, which introduces unnecessary overhead.The optimized version applies two key improvements:
- **Pre-allocate the result li…
-
#33861 [docs] fix unintentional misspellings — documentation — by rinbaro (关闭于: 2026-02-05 12:12 (UTC+8)) [💬2 | +3/-3, 3 files | commented:1] ## Motivation make docs great again.
These 3 looked unintentional.