[vLLM GitHub 开发动态] 2026-02-06
[概览]
- 时间窗口: 2026-02-06 11:19 (UTC+8) ~ 2026-02-07 11:19 (UTC+8)
- 新 issue: 21 (label 分布: bug:7, feature request:4, usage:3, RFC:3, rocm:3)
- 关闭 issue: 23
- 新 PR: 52 (label 分布: ready:23, bug:17, v1:11, documentation:9, frontend:9)
- 合并 PR: 43
- 关闭未合并 PR: 31
[新 issue]
-
#34034 [Bug]: vLLM-compile should not execute the decoder forward pass during compilation — bug,torch.compile — by zou3519 (创建于: 2026-02-07 11:07 (UTC+8)) ### Your current environment
main
### 🐛 Describe the bug
During cold start compilation, vLLM-compile executes the text decoder forward pass using the max_num_batched_tokens size. I’m not completely sure how long a text decoder forward pass is, but if it is O(1s) then this is unnecessary and will save on compile time.
There’s one complication in that I don’t know when autotuning will happen. I hope that vLLM does warmup of a size before the first cudagraph capture (but also we do compile-time…
-
#34002 [Bug]: Recent PR breaks Whisper inference — bug — by almayne (创建于: 2026-02-07 00:17 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ...python collect_env.py -
#34028 [Usage]: How to load FP8 Quantised Model - Value error, Found unknown quantization, same worked for v0.11.0 — usage — by gabinguo (创建于: 2026-02-07 07:29 (UTC+8)) ### Your current environment
```text python collect_env.py –2026-02-06 23:27:52– https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)… 185.199.108.133, 185.199.109.133, 185.199.110.133, … Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443… connected. HTTP request sent, awaiting response… 200 OK Length: 27835 (27K) [text/plain] Saving to: ‘collect_env.py’ …
-
#34005 [Bug]: Qwen3-1.7B apparently not respecting max-model-len — bug — by kdu4108 (创建于: 2026-02-07 00:59 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34018 [RFC]: Helix (Context + Tensor) Parallelism for Efficient Long-Context Decoding — RFC — by sungsooha (创建于: 2026-02-07 04:23 (UTC+8)) ### Motivation.
This RFC proposes adding Helix (Context + Tensor) Parallelism to vLLM, based on NVIDIA’s Helix paper (July 2025). Helix enables efficient long-context decoding by combining Context Parallelism (sequence sharding) with Tensor Parallelism (head sharding), eliminating KV cache duplication that occurs in traditional Tensor Parallelism when
TP > num_kv_heads.Paper results (NVIDIA):
- Up to 1.5× decode latency reduction at fixed batc…
-
#34016 [Bug]: vLLM crashes when trying to load Devstral-2-123B-Instruct-2512 directly from S3 — bug — by moorglade (创建于: 2026-02-07 04:21 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#33996 [RFC]: Add an option to use NCCL-based symmetric memory when pytorch symmetric memory is applied — RFC — by ilmarkov (创建于: 2026-02-06 22:40 (UTC+8)) [💬1] ### Motivation.
Pytorch now supports different backends for symmetric memory including ‘CUDA’, ‘NVSHMEM’, ‘NCCL’ which can be controlled with
TORCH_SYMMMEM.### Proposed Change.
- We need to find out if NCCL-based pytorch symmetric memory replicates NCCL symmetric memory we use in vLLM.
- We need to refactor our usage of symmetric memory so that it has a single entry point.
- Benchmark the collectives we use in vLLM to pick the best version of symmetric memory on different hardware.
…
-
#34004 [Bug]: docker ROCm build fails — bug,rocm — by markg85 (创建于: 2026-02-07 00:56 (UTC+8)) [💬4] ### Your current environment
Irrelevant because it’s inside docker, not on my host. I’m building the docker container.
### 🐛 Describe the bug
I followed this description: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html I went for option 1 and i’m meeting the criteria exactly. I do have a AMD 7900XT so i did went for this docker build line:
…
-
#34012 [Feature]: MXFP8 GEMM / Grouped GEMM Kernels for AMD — feature request,rocm — by EdalatiAli (创建于: 2026-02-07 02:34 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
vLLM does not currently support serving MXFP8 quantized dense/MoE models on AMD gpus. It would be great to to add the required MXFP8 GEMM / Grouped GEMM kernels to enable serving MXFP8 dense and MoE models on AMD gpus.
### Alternatives
No response
### Additional context
…
-
#34008 [Feature]: W4A16 (int4) GEMM / Grouped GEMM Kernels for AMD — feature request,rocm — by ZewenShen-Cohere (创建于: 2026-02-07 02:03 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
vLLM currently lacks performant W4A16 (int4 weight) GEMM support on AMD ROCm/AITER backends. W4A16 Grouped GEMM exists only as a Triton kernel, which is not optimal.
Would it be possible to add AMD-optimized native kernels for W4A16 GEMM and W4A16 grouped GEMM?
### Alternatives
No response
…
-
#34001 [RFC] Change the directory layout from
scaled_mm/andmixed_precision/to backend-first . — 无标签 — by tjtanaa (创建于: 2026-02-06 23:46 (UTC+8)) [💬2] After implementing https://github.com/vllm-project/vllm/issues/33872Before starting to snowball the Kernel Abstraction, we should align the semantics of the directory to make sure they are easy to follow and use.
We would like to propose the following directory structure to organize the kernel abstraction of linear gemm based on the following pattern:
- based on providers if the kernels are imported from third party library
- if the kernels are implemented in
vllm/csrc, they are organiz…
-
#34000 Move the Python kernel abstraction package (selectors + interfaces) under
vllm/model_executorso unquantized linear kernels don’t need to be referenced from a quantization-specific directory. — 无标签 — by tjtanaa (创建于: 2026-02-06 23:45 (UTC+8)) Move thekernelsfromquantizationintomodel_executorFinal directory layout ``` vllm/ └── vllm/ └── model_executor/ └── kernels/ ├── init.py ├── mixed_precision …
-
#33995 [Bug] [torch 2.10] test_routed_input_transform_inside_vs_outside failing — bug — by atalman (创建于: 2026-02-06 22:37 (UTC+8)) ### Your current environment
Buildkite CI: https://buildkite.com/vllm/ci/builds/50239
### 🐛 Describe the bug
Related PR for torch 2.10 update: https://github.com/vllm-project/vllm/pull/30525 Looks like https://github.com/vllm-project/vllm/pull/32790 introduced this test However its failing on torch 2.10 update
…
-
#33970 [Usage]: how to serve quantized Qwen3-Reranker-8B — usage — by jiiihyeonnn (创建于: 2026-02-06 14:05 (UTC+8)) [💬1] ### Your current environment
GPU == NVIDIA RTX A5000 vLLM == vllm/vllm-openai:v0.14.0### How would you like to use vllm
…
-
#33991 [Installation]: building docker cpu image with VLLM_CPU_DISABLE_AVX512=true (or on any x86_64 CPU without AVX512) fails to compile mla_decode.cpp because BFloat16 has no AVX2 fallback — installation,cpu — by matthewfranglen (创建于: 2026-02-06 19:39 (UTC+8)) ### Your current environment
```text Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
-
#33987 [Installation]: vllm as submodule results in unbuildable docker images due to setuptools-scm and docker context — installation — by matthewfranglen (创建于: 2026-02-06 18:53 (UTC+8)) ### Your current environment
```text Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
-
#33986 [Tracking issue]: Issue Tracker for Qwen/Qwen3-VL-Embedding & Qwen/Qwen3-VL-Reranker — usage — by noooop (创建于: 2026-02-06 18:07 (UTC+8)) [💬2] ### vLLM examples for Qwen3-VL-Embedding :
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py
### vLLM examples for Qwen3-VL-Reranker:
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_rerank_api_online.py
- https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_score_api_online.py
- https://githu…
-
#33980 [RFC]: Sparse attention KV cache offloading to support longer sequence length — RFC — by zhangsicheng5 (创建于: 2026-02-06 17:09 (UTC+8)) ## Motivation.
In long sequence inference scenario, KV cache size has become one of the inference bottlenecks. To save GPU memory usage of KV cache and support longer sequence length, we have proposed a layerwise KV cache offloading approach in RFC (#33398). However, during development we find out that the available offload layer number is limited by the loading speed: Based on a rough estimation, the loading time is $\frac{kv\_cache\_size}…
-
#33974 [Feature]: We propose the official development and maintenance of a VLLM integration or plugin within Dify. — feature request — by ooodwbooo (创建于: 2026-02-06 15:32 (UTC+8)) ### 🚀 The feature, motivation and pitch
We propose the official development and maintenance of a VLLM integration or plugin within Dify.Currently, the vLLM plugin in Dify only supports text-only LLMs. We currently require support for VL, Embedding, Rerank, VL-Embedding, and VL-Rerank, and hope that the official team can take over the maintenance of the vLLM plugin in Dify.
https://github.com/vllm-project/vllm/issues/17454
https://github.com/yangyaofei/dify-vllm-provider/issues/4#issuecomment-…
-
#33966 [Feature]: Add tool_choice=”required” support for GPT-OSS Harmony models — feature request — by gkswns0531 (创建于: 2026-02-06 13:20 (UTC+8)) ### 🚀 The feature, motivation and pitch
GPT-OSS models use the Harmony chat format, which differs from standard models in its response generation behavior. Even when tool_choice="required" is set, these models tend to generate direct text responses instead of tool calls, resulting in only 91% success rate for tool call generation. Harmony format models select one of three channels (final, analysis, commentary) after completing their internal processing. Tool calls are only generated ... -
#33962 [Bug]: Realtime transcription completes but server never sends transcription.done when final commit is received during slow streaming — bug — by pjs102793 (创建于: 2026-02-06 11:30 (UTC+8)) ### Your current environment
``` Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
[已关闭 issue]
-
#34002 [Bug]: Recent PR breaks Whisper inference — bug — by almayne (关闭于: 2026-02-07 10:42 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ...python collect_env.py -
#33295 [Bug]: QKNorm+RoPE fusion broken for qwen3-fp8 on B200 — bug,help wanted,torch.compile — by ProExpertProg (关闭于: 2026-02-07 10:27 (UTC+8)) [💬12] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#11905 [Feature]: Support Multiple Tasks Per Model — feature request,stale — by FurtherAI (关闭于: 2026-02-07 10:18 (UTC+8)) [💬20] ### 🚀 The feature, motivation and pitch
Requesting this for V1 #11862
The idea is pretty simple, it would be nice to be able to, e.g., get generations and embeddings out of a single model. An example use case is when you have a LoRA for generation and a LoRA for embedding on top of the same base model. Deploying two vLLM servers is really inefficient for accomplishing this.
### Alternatives
A lesser feature would be one task per LoRA, but it’s better to be general if possible. …
-
#19415 [Bug]: [V1][gpu_model_runner.py] CUDA memory error — bug,stale — by rajagond (关闭于: 2026-02-07 10:18 (UTC+8)) [💬12] ### Your current environment
I am using vllm docker image: docker pull vllm/vllm-openai:v0.9.1rc1
### 🐛 Describe the bug
Hi,
I am making some changes to vllm/v1/worker/gpu_model_runner.py and trying to initialize some custom tensors inside the GPUModelRunner class. Specifically, I have added the following code:
…
-
#24737 [Bug]: Qwen3 Reranker 500 error when submitting longer query — bug,stale — by HadiSDev (关闭于: 2026-02-07 10:17 (UTC+8)) [💬6] logs (1).txt
### Your current environment
--host 0.0.0.0 --port 8000 --model tomaarsen/Qwen3-Reranker-8B-seq-cls --max-model-len 8128I use docker to host vllm for reranking qwen3 but when I send a query that is a bit longer than average I get this exception (See attached logs) ...
-
#25910 [Bug]: Audio usage reporting inconcistent between streaming and blocking — bug,stale — by alugowski (关闭于: 2026-02-07 10:16 (UTC+8)) [💬4] ### Your current environment
The output of
```text irrelevant ```python collect_env.py…
-
#26369 [Performance]: Use int over list[int] as output_tokens to reduce GC overhead — performance,stale — by Jialin (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8] ### Proposal to improve performance
Currently, we’re consistently using list[int] to represent output_tokens in ModelRunnerOutput which is very inefficient from GC prospective.
The default setup of GC is (700, 10, 10) which means
- if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
- GC1 is triggered after 10 GC0
- GC2 is triggered after 10 GC1 In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be tri…
-
#26398 [Docs] flashinfer missing from
/en/latest/configuration/env_vars.html— documentation,stale — by MikeSpreitzer (关闭于: 2026-02-07 10:16 (UTC+8)) [💬3] ### 📚 The doc issueI am using vLLM release 0.10.2 and got a nasty surprise about some undocumented environment variables used by flashinfer. I eventually found https://github.com/flashinfer-ai/flashinfer/blob/v0.3.0/flashinfer/jit/env.py, which I hope provides the answers I need.
### Suggest a potential alternative/fix
Document FLASHINFER_WORKSPACE_BASE, FLASHINFER_CUBIN_DIR.
FYI, https://github.com/vllm-project/vllm/pull/21972 will affect this.
…
-
#26418 [Feature]: Resume guided/JSON generation after early termination — feature request,stale — by WoutDeRijck (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
## Summary
Add support for continuing constrained (guided / JSON-schema) generation from a partial output after an abort (e.g., when stopping an “infinite list” mid-stream). This would let users truncate and resume without restarting decoding from scratch.
## Problem
When streaming structured JSON with guided decoding, the model can enter an infinite or repetitive list loop. …
-
#26421 [Feature]: Return reason why a JSON schema is unsupported by xgrammar — feature request,stale — by WardLT (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
The error message for why a JSON schema is unsupported for structured output is vague. It only says it has an unsupported feature but not why, which is a particular problem given the rapid advancements in xgrammar meaning the reason a schema will fail varies highly depending on version.
### Alternatives
Haven’t figured…
-
#26432 [Doc]: vLLM setup with Triton/ROCm seems to have stale content. — documentation,rocm,stale — by vishnoianil (关闭于: 2026-02-07 10:16 (UTC+8)) [💬5] ### 📚 The doc issue
Noticed two issues on the GPU installation page for vLLM https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=cuda#build-wheel-from-source
- AMD ROCm installation instruction says “Install ROCm’s Triton (the default triton-mlir branch) following the instructions from ROCm/triton”, but the instruction is using triton-lang/triton ``` git clone https://github.com/triton-lang/triton.git cd tri…
-
#33598 [CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:19 (UTC+8)) [💬3] ### Name of failing test
bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29461 [CI Failure]: mi325_1: Language Models Test (PPL) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:18 (UTC+8)) [💬5] ### Name of failing test
pytest -v -s models/language/generation_ppl_test### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:18 (UTC+8)) [💬17] ### Name of failing test
pytest -v -s models/language/pooling -m 'not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29459 [CI Failure]: mi325_1: Language Models Test (Extended Generation) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:17 (UTC+8)) [💬7] ### Name of failing test
pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29462 [CI Failure]: mi325_8: Language Models Tests (Hybrid) %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-07 09:17 (UTC+8)) [💬7] ### Name of failing test
uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5' && uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2' && pytest -v -s models/language/generation -m hybrid_model --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$BUILDKITE_PARALLEL_JOB### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33938 [Bug][ROCm] Platform detection initializes CUDA prematurely, breaking Ray multi-GPU allocation — bug,rocm — by kouroshHakha (关闭于: 2026-02-07 06:17 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34004 [Bug]: docker ROCm build fails — bug,rocm — by markg85 (关闭于: 2026-02-07 03:52 (UTC+8)) [💬4] ### Your current environment
Irrelevant because it’s inside docker, not on my host. I’m building the docker container.
### 🐛 Describe the bug
I followed this description: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html I went for option 1 and i’m meeting the criteria exactly. I do have a AMD 7900XT so i did went for this docker build line:
…
-
#33541 [Feature]: Automatic full-core utilization for TP=1 on arm multi-NUMA CPU systems — feature request,cpu — by phalani-paladugu (关闭于: 2026-02-06 20:26 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch
On arm servers with two NUMA nodes, when tensor parallelism (TP) is set to 1, the runtime currently utilizes cores from only a single NUMA node unless VLLM_CPU_OMP_THREADS_BIND is explicitly configured. For example, on a two NUMA system with 64 cores, only cores 0–31 are active by default.
With TP=2, cores are correctly partitioned across NUMA nodes without requiring any manual configuration.
Requested enhancement: When TP=1, the runtime should automati…
-
#31467 [RFC]: A Triton operator dispatch mechanism through modified
CustomOp— RFC — by MengqingCao (关闭于: 2026-02-06 20:07 (UTC+8)) [💬14] ### Motivation.Triton is becoming increasingly important in vLLM, and we’ve noticed its use in many models, quantization processes, and general workflows. Meanwhile, vLLM supports various backends. Typically, to achieve high performance, different implementations of the Triton kernels are used on different hardware, such as Ascend NPU. However, we’ve observed that vLLM currently lacks an effective operator dispatch mechanism for Triton to ensure that various backends can implement their ow…
-
#33478 [RFC]: Introduce ATOM as model implementation backend for AMD GPU — rocm,RFC — by zejunchen-zejun (关闭于: 2026-02-06 16:13 (UTC+8)) [💬8] ### Motivation ATOM is a foundational component of AMD’s AI inference strategy. It is implemented by ROCm developers and can be used to serve as the model implementation backend for high-performance inference on AMD GPUs. It is built by integrating optimizations from ROCm’s high-performance operator library aiter and high-performance communication library mori into model execution path.
Th…
-
#33885 [Bug]: 用vllm启动Qwen3VLReranke接口在重排任何图像时获取的得分很低仅有0.5左右 — bug — by RC-Qiao (关闭于: 2026-02-06 14:42 (UTC+8)) [💬1] ### Your current environment
docker run –gpus ‘“device=7”’ –entrypoint “” -v /dataset/models/Qwen/Qwen3-VL-Reranker-8B:/model -p 9091:8000 –shm-size=8g vllm/vllm-openai:v0.15.1-cu130 vllm serve /model –runner pooling –max-model-len 16384 –gpu-memory-utilization 0.5 –dtype bfloat16 –hf-overrides ‘{“architectures”: [“Qwen3VLForSequenceClassification”],”classifier_from_token”: [“no”, “yes”],”is_original_qwen3_reranker”: true}’ –chat-template /model/qwen…
-
#31229 [RFC]: Early-Fail Tokenization Guard for Completions or Chat Completions — RFC — by scratch-ml (关闭于: 2026-02-06 11:38 (UTC+8)) [💬4] ## Motivation.
### Problem Extremely long chat inputs (e.g., hundreds of millions of tokens) are fully tokenized before length checks, causing CPU OOM (e.g., ~300GB per vllm instance) and hanging the single-threaded async tokenizer, blocking all other requests.
More concretely, in RL scenarios, many request inputs come from agent–environment interactions and can sometimes become huge. This should ideally be mitigated by client-side validation before sending the request, but in practice calle…
[新 PR]
-
#33992 [Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs — bug,documentation,ci/build,nvidia — by ehfd (创建于: 2026-02-06 20:35 (UTC+8)) [💬3 | +82/-5, 4 files | commented:6] ## Purpose
Fix #32373 Fix #33369
Fixes the core problem in https://github.com/vllm-project/vllm/issues/32373 and https://github.com/vllm-project/vllm/issues/33369, introduced from https://github.com/vllm-project/vllm/pull/30784 and https://github.com/vllm-project/vllm/pull/33116.
The issue with the CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs: (1) Consumer NVIDIA GPUs must NOT have the CUDA compatibility libraries inside
LD_LIBRARY_PATH. (2) Professional and … -
#34027 [bug-fix] supported_tasks is breaking backward compatibility at init_app_state — bug,frontend,ready — by kouroshHakha (创建于: 2026-02-07 07:26 (UTC+8)) [+3/-1, 1 files | commented:3] init_app_state has now added supported_task as a mandatory input which breaks applications that integrate with vllm at this layer.
engine_client is the source of truth for supported_tasks and it’s already a parameter to the init_app_state helper. This PR makes supported_tasks param optional and in case of None retrieves it from engine_client.
-
#34017 [ModelRunner V2] Revert token rank comparison difference for now — ready,v1 — by njhill (创建于: 2026-02-07 04:22 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] We think it makes more sense to use a strict inequality for the token rank computation, but would be best to propose this in a separate PR since it involves a behavior change and is orthogonal to MRV2.
Making this consistent with V1 will help with output equivalence tests. We can subsequently change it in both places if needed (or make configurable if that’s really a problem).
-
#34033 [BugFix] Fix mm_encoder_only init for qwen3 vl moe model — bug,qwen — by shepark (创建于: 2026-02-07 11:04 (UTC+8)) [💬2 | +27/-1, 1 files | commented:1] Fix mm_encoder_only for qwen3-vl-moe by adding get_language_model_spec() to prevent OOM and guarding moe init when the language model isn’t instantiated to prevent AttributeError during encoder-only loading and avoids unnecessary full language model init and save memory usage.
## Test Plan python test.py ``` from vllm import LLM MODEL=”Qwen/Qwen3-VL-30B-A3B-Instruct” if name == ‘main’: llm = LLM( …
-
#33967 [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 — bug,ready — by ikchifo (创建于: 2026-02-06 13:33 (UTC+8)) [💬6 | +154/-6, 5 files | commented:3 approved:1] ## Purpose
Fixes #33295
On B200 with FP8-quantized models (e.g.,
Qwen/Qwen3-30B-A3B-FP8), PyTorch Inductor’s CSE fails to merge identicalsplit_with_sizescalls. The QK Norm+RoPE pattern matcher expects one split node with 3getitemusers (q/k/v), but the graph contains 3 separate split nodes with 1 user each, so fusion fails silently.This PR adds
coalesce_equivalent_splits()to merge duplicate splits before matching. It groups splits by(args, kwargs)signature and merges duplicat… -
#34015 [Misc] Add backward-compatible import aliases for renamed translations module — frontend,ready — by kouroshHakha (创建于: 2026-02-07 03:32 (UTC+8)) [+69/-0, 5 files | commented:1 approved:1] ## Summary
- PR #33904 renamed
vllm.entrypoints.openai.translationstovllm.entrypoints.openai.speech_to_text, which breaks any downstream code importing from the old path. - This PR restores the old
vllm.entrypoints.openai.translationspackage as thin backward-compatible stubs that re-export everything from the newspeech_to_textlocation while emitting aDeprecationWarningon import. - The deprecation warnings instruct users to update their imports and are scheduled for removal in v…
- PR #33904 renamed
-
#34032 [ROCm] update triton branch to support gpt-oss models for gfx11xx devices — rocm,ci/build,gpt-oss — by hongxiayang (创建于: 2026-02-07 10:00 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] Fix https://github.com/vllm-project/vllm/issues/33906 Fix https://github.com/vllm-project/vllm/issues/33143
To add support to gfx11xx devices to serve gpt-oss (20b for example) models , we need to update triton branch in Dockerfile.rocm_base. The needed support was in upstream triton, but not in the branch used by vllm base docker. The patch to rocm/triton was done using this associated PR https://github.com/ROCm/triton/pull/923.
…
-
#34007 [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion — bug,rocm,ready — by rasmith (创建于: 2026-02-07 01:15 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose
If
cfg.model_configisNone, thenvllm.enable_norm_pad_fusionwill crash. This is causing the following failures in CI:``` FAILED kernels/moe/test_routing_simulator.py::test_routing_strategy_integration - AttributeError: ‘NoneType’ object has no attribute ‘get_hidden_size’ FAILED kernels/moe/test_shared_fused_moe_routed_transform.py::test_routed_input_transform_inside_vs_outside[dtype0-256-128-1] - AttributeError: ‘NoneType’ object has no attrib…
-
#34011 [Bugfix] Fix Whisper tokenization — bug,ready — by NickLucche (创建于: 2026-02-07 02:32 (UTC+8)) [+8/-0, 1 files | commented:1 approved:1] Fix https://github.com/vllm-project/vllm/issues/34002.
Since HF forwards **kwargs to both tokenizer and feature extractor, WhisperFeatureExtractor truncates audio to N raw samples (~0.03s), producing empty output. Strip these kwargs when audio data is present.
cc @DarkLight1337
-
#33983 [CPU][PPC64] Fix bf16 path in mla_decode.cpp — cpu — by Akashcodes732 (创建于: 2026-02-06 17:32 (UTC+8)) [💬2 | +9/-1, 1 files | commented:1] ## Purpose Add BF16 Kernel types for ppc64 in
mla_decode.cppSimilar issue seen in #33788 , #30329
-
#34031 [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 — ready,ci/build — by ProExpertProg (创建于: 2026-02-07 09:56 (UTC+8)) [+8/-10, 1 files | commented:1]
## Purpose The final version of the E2E fusion overhaul in #33293 ended up slashing some tests that should have remained. This PR fixes that by improving the filtering logic. Only for B200 tests ## Test Plan CI
## Test Result TBD
-
#34026 add –insecure arg to the vllm bench to skip TLS — performance — by fanyang-real (创建于: 2026-02-07 07:14 (UTC+8)) [💬1 | +138/-4, 2 files | commented:1] ## Purpose
Add
--insecureflag tovllm bench serveto skip TLS certificate verification when connecting to servers with self-signed certificates.This is useful when benchmarking vLLM servers that use self-signed SSL certificates, where the default strict certificate verification would fail.
## Test Plan
```bash pytest tests/benchmarks/test_serve_cli.py -v -m benchmark …
-
#34030 [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming — bug,frontend — by cradonn (创建于: 2026-02-07 08:58 (UTC+8)) [💬1 | +89/-0, 2 files | commented:1] # PR: [Bugfix] Add reasoning_content backward compat to DeltaMessage for streaming
## Summary
Adds
reasoning_contentfield toDeltaMessageclass to maintain backward compatibility for streaming responses, completing the work started in commit bf001da.## Context
- RFC #27755 renamed
reasoning_contenttoreasoningto align with OpenAI API - Commit bf001da added backward compat for input message parsing in
chat_utils.py…
- RFC #27755 renamed
-
#34029 [Perf] Optimize async scheduling redundant copy, 0.9% E2E throughput improvement — ready,v1 — by yewentao256 (创建于: 2026-02-07 08:35 (UTC+8)) [+22/-12, 2 files | commented:2] ## Purpose
Originally, we do a
req.all_token_ids.copy()anyway, this PR optimize the logic:- sync case: no copy (No need since we already sync)
- async scheduling: only copy the
output_token_ids
## Test
### Unit test …
-
#34025 [Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure — 无标签 — by gmagogsfm (创建于: 2026-02-07 07:08 (UTC+8)) [💬1 | +1649/-14, 6 files | commented:3] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Add Helion autotuning infra
This PR adds autotuning support to vLLM-Helion. Each Helion kernel can register an input generator that returns a dictionary of
(helion_config_key, representative_input)pairs. The autotuning infra would run autotuning against these inputs and generate one config perhelion_config_key. -
#34024 [Core] Add Helix (Context + Tensor) Parallelism — documentation,v1,llama,nvidia — by sungsooha (创建于: 2026-02-07 06:44 (UTC+8)) [💬6 | +1485/-86, 18 files | commented:1]
## Purpose
Implements Helix (Context + Tensor) Parallelism based on NVIDIA’s Helix paper (arXiv:2507.07120).
RFC: #34018 Helix enables efficient long-context decoding by:
- Eliminating KV cache duplication when
TP > num_kv_heads - Removing the
TP <= num_kv_headsconstraint for GQA models with context parallelism …
- Eliminating KV cache duplication when
-
#34022 [Misc][Spec Decode] support different load config for draft model — speculative-decoding,v1 — by ZhengkaiZ (创建于: 2026-02-07 06:00 (UTC+8)) [+8/-1, 3 files | commented:1 | 📝草稿] [Misc][Spec Decode] support different load config for draft model
Summary:
Sometime, to achieve better draft model performance, or to use different checkpoint format, we will customize those part, adding freedom to specify different load config can make our private trained draft model work.
Test Plan: Command:
``` …
-
#34006 [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune — ready — by mmangkad (创建于: 2026-02-07 01:03 (UTC+8)) [💬1 | +104/-1, 5 files | approved:2 commented:8] ## Purpose
Support enabling/disabling FlashInfer autotune.
cc @ProExpertProg
## Test Plan
## Test Result
…
-
#34023 [Bugfix] Fix RAW hazard and optimize stores in EP Scatter Kernel — bug — by Manikvsin (创建于: 2026-02-07 06:17 (UTC+8)) [💬1 | +11/-4, 1 files | commented:1] Title: [Bugfix] Eliminate RAW hazard and redundant stores in EP Scatter Kernel
Description: This PR fixes a Read-After-Write (RAW) hazard and optimizes global memory access in the EP Scatter Kernel.
## Purpose Previously, the kernel followed a “Store-then-Load” pattern:
Every thread/block computed the cumsum (prefix sum) of tokens per expert.
…
-
#34021 [Bugfix] Fix Worker.load_model context-manager composition for sleep mode — bug,v1 — by tianshu-Michael-yu (创建于: 2026-02-07 05:40 (UTC+8)) [💬1 | +4/-3, 1 files | commented:1 approved:1] ## What this PR does
Fixes a context-manager composition bug in
vllm/v1/worker/gpu_worker.py:- Before:
with A and B: - After:
with A, B:
In Python,
with A and B:evaluates to a single context manager (BwhenAis truthy), soAis never entered. Here that means_maybe_get_memory_pool_context(tag="weights")is skipped, so weight allocations are not tracked by cuMem sleep mode.## Why this matters …
- Before:
-
#34003 [torch.compile] Stop compiling identical artifacts — ready,ready-run-all-tests — by zou3519 (创建于: 2026-02-07 00:44 (UTC+8)) [💬3 | +64/-11, 2 files | commented:1 approved:1] ## Purpose
This is a resubmission of https://github.com/vllm-project/vllm/pull/33641 that is actually correct.
Previously, the vLLM-compile compilation:
- does a full dynamo trace to get a full graph
- splits the graph on attention ops into N subgraphs
- compiles each of the N subgraphs via standalone_compile.
- If any of the subgraphs are the same, we rely on aotautograd and inductor to cache-hit, but we still end up producing N artifacts total. Usually there are 3 unique graphs in vLLM model…
-
#34020 [wip] layerwise loading for fp8.py, take 2 — 无标签 — by vkuzo (创建于: 2026-02-07 05:23 (UTC+8)) [+221/-63, 4 files | commented:2 | 📝草稿] Summary:
TODO write me
Test Plan:
TODO write me
…
-
#34019 [Quantization][Refactor] Clean up GPTQ + AWQ quantization — 无标签 — by mu-hashmi (创建于: 2026-02-07 04:51 (UTC+8)) [+40/-36, 3 files | commented:1 | 📝草稿] ## Purpose Addresses #31689. Building on
gptq_marlin.pyper @robertgshaw2-redhat’s guidance to consolidate GPTQ/AWQ quantization and remove the legacy code paths.So far: widened
GPTQMarlinConfigto handle all GPTQ bit-widths and symmetry configs through theMPLinearKernelabstraction, added GPTQv2 checkpoint format support, and enabled 2/3-bit types in ExllamaLinearKernel.Still TODO: validate all kernel backends end-to-end, remove
gptq.py, convertmoe_wna16.pyinto a modular kernel… - #33988 [Bugfix][Frontend] Fix IndexError in Mistral tool parser during streaming tool calls — bug,frontend — by veeceey (创建于: 2026-02-06 19:13 (UTC+8)) [💬3 | +171/-11, 3 files | commented:4]
## Summary
- Fixes
IndexError: list index out of rangeinchat_completion_stream_generatorwhen using--tool-call-parser=mistralwith streaming tool calls - The root cause was that the Mistral tool parser never populated
streamed_args_for_tool, butserving.pyunconditionally accessedtool_parser.streamed_args_for_tool[index]at stream completion - Adds a defensive bounds check in
serving.pybefore indexingstreamed_args_for_tool(protects all tool parsers) - Updates both v11+ and…
- Fixes
-
#34010 [Fix] Fix
logprobs=0handling for/inference/v1/generateendpoint — frontend,ready — by SumanthRH (创建于: 2026-02-07 02:24 (UTC+8)) [+29/-2, 2 files | commented:1 approved:1] ## Purpose/inference/v1/generateendpoint differs from/v1/completionsendpoint and the Engine APIs inlogprobs=0handling: Passinglogprobs=0returns no logprobs, while it is supposed to return chosen token logprobs (same aslogprobs=1)This PR ensures that the behaviour is consistent with the completions endpoint.
## Test Plan
Added a new test for logprobs handling in the generate endpoint - with logprobs values 0, 1, 5
…
-
#34014 [Model] Introduce first-class DFlash speculative decoding in vLLM V1 — documentation,new-model,speculative-decoding,v1,qwen — by dangoldbj (创建于: 2026-02-07 03:21 (UTC+8)) [💬2 | +2163/-22, 18 files | commented:1]
## Purpose This PR introduces first-class DFlash speculative decoding support in vLLM V1.
This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.
Key changes:
- Adds explicit
method="dflash"support and full DFlash model/config plumbing. - Establishes a dedicated `DFlash…
- Adds explicit
-
#34009 [Bugfix] Fix DP Attention Padding in Dummy Run — bug,v1 — by benchislett (创建于: 2026-02-07 02:16 (UTC+8)) [+3/-0, 1 files | commented:3] ## Purpose
FIX #32626 FIX #33450
Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don’t get out of sync: it such cases, we pad for attention but the attention metadata builder receives the padded number of requests and unpadded number of tokens.
This leads to a crash where we see
num_decode_tokens==1butnum_requests==2. I tracked this down to_build_attention_metadataon… -
#33997 [Hybrid] Enable mamba prefix cache “align” mode with async scheduling — v1 — by tdoublep (创建于: 2026-02-06 22:44 (UTC+8)) [+79/-41, 5 files | commented:3] ## Purpose
On main, mamba prefix caching “align” mode does not currently work with async scheduling when speculative decoding is enabled. The core reason for this is that the scheduler does not have an accurate measurement of
num_computed_tokensand thus we don’t actually know the exact number of computed tokens in theMambaManagerwhen we are allocating and freeing blocks. Specifically, the value ofnum_computed_tokensin the scheduler is an over-estimate of the real number, since it can… -
#34013 Threshold fix wvSplitk for occasional CI fails — rocm — by amd-hhashemi (创建于: 2026-02-07 03:12 (UTC+8)) [+2/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#33993 [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) — bug,ready,deepseek — by xuebwang-amd (创建于: 2026-02-06 21:33 (UTC+8)) [💬4 | +6/-1, 2 files | commented:1 approved:2] ## Purpose
Model: deepseek-ai/DeepSeek-V3.1 This PR aims to fix an error: ``` Traceback (most recent call last): File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main worker = WorkerProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 580, in init …
-
#33971 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (创建于: 2026-02-06 14:17 (UTC+8)) [💬3 | +237/-235, 4 files | commented:1 approved:1]
## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.
## Test Plan
## Test Result
…
- #33964 [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 — bug,ready — by chaunceyjiang (创建于: 2026-02-06 12:39 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1 approved:1]
## Purpose
<|DSML|function_calls> <|DSML|invoke name="get_weather"> <|DSML|parameter name="location" string="true">杭州</|DSML|parameter> <|DSML|parameter name="date" string="true">2024-01-16</|DSML|parameter> </|DSML|invoke>…
-
#33963 [Bugfix] send None sentinel on final commit so server properly sends transcription.done — bug,frontend — by pjs102793 (创建于: 2026-02-06 11:40 (UTC+8)) [💬2 | +1/-0, 1 files | commented:1] ## Purpose
Fixes https://github.com/vllm-project/vllm/issues/33962
Fix server not sending
transcription.donewhen client sendsinput_audio_buffer.commitwithfinal: trueduring real-time audio streaming.Transcription works correctly —
transcription.deltaevents are sent as expected. However, thefinalcommit handler sets_is_input_finished = Truebut does not send theNonesentinel toaudio_queue, leavingaudio_stream_generatorblocked onqueue.get()forever. This deadloc… - #33994 Bump
lm-evalversion for Transformers v5 compatibility — documentation,rocm,ready,ci/build — by hmellor (创建于: 2026-02-06 21:53 (UTC+8)) [💬3 | +19/-31, 14 files | approved:1 commented:1] Update lm-eval pin to0.4.10so that EleutherAI/lm-evaluation-harness@384118d is included (the previous version imported a class that no longer exists in Transformers v5) -
#33998 [Revert] Add util
handle_deprecatedback — ready — by yewentao256 (创建于: 2026-02-06 23:32 (UTC+8)) [+25/-0, 1 files | commented:1 approved:2] ## PurposeFixes https://github.com/vllm-project/vllm/pull/33718#discussion_r2773564375
CC: @hmellor
- #33973 [XPU][5/N] add wna16 xpu kernel — ready,ci/build — by zufangzhu (创建于: 2026-02-06 15:27 (UTC+8)) [💬2 | +58/-66, 2 files | commented:1 approved:1] This PR is [5/N] of https://github.com/vllm-project/vllm/issues/33214 add compressed tensor wna16 support for xpu
-
#33999 fix description in plugin_system.md — documentation — by guodongxiaren (创建于: 2026-02-06 23:32 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
## Purpose Because vllm_add_dummy_model is
packagesnotpy_modulesin this example. ## Test Plan## Test Result
... - #33978 Fix spelling errors — ready,v1 — by sleepcoo (创建于: 2026-02-06 16:40 (UTC+8)) [+10/-10, 5 files | commented:1 approved:1]
## Summary
- Fix
Intialize→Initializein test files - Fix
is_availble→is_availablein flashmla.py - Fix
AVAILBLE_BACKENDS→AVAILABLE_BACKENDSin fp8.py - Fix
NO_REASONING_QUICK_THROUGHT→NO_REASONING_QUICK_THOUGHTin test file
## Test plan
- Changes are typo fixes only (comments and variable names)
- No functional code changes …
- Fix
-
#33989 Update WeightTransferConfigto be more standard like the others — ready — by hmellor (创建于: 2026-02-06 19:17 (UTC+8)) [+4/-6, 2 filescommented:1 approved:1] @configis a dataclass transform now- The default shouldn’t be duplicated in
EngineArgs
-
#33990 Pass modality information in embed_multimodal — speculative-decoding,needs-rebase,v1,qwen — by reaganjlee (创建于: 2026-02-06 19:31 (UTC+8)) [💬1 | +43/-14, 4 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#33979 [Frontend] Add –disable-log-prefix flag and VLLM_DISABLE_LOG_PREFIX env var — frontend,v1 — by veeceey (创建于: 2026-02-06 16:53 (UTC+8)) [💬3 | +133/-9, 9 files | commented:5] ## Summary Add option to disable the process/thread log prefix (e.g.
(APIServer pid=12345)) that vLLM adds to stdout/stderr viadecorate_logs().Motivation: Users with custom logging configs or log aggregation systems cannot control this prefix because it’s injected before Python’s logging system. The
VLLM_LOGGING_CONFIG_PATHconfig has no effect on it.## Changes
vllm/envs.py: AddVLLM_DISABLE_LOG_PREFIXenvironment variablevllm/entrypoints/openai/cli_args.py: Add…
-
#33977 [Bugfix] Fix models and tests for transformers v5 — bug,documentation,ready,multi-modality,qwen — by zucchini-nlp (创建于: 2026-02-06 16:29 (UTC+8)) [💬2 | +70/-55, 12 files | commented:8 approved:1] As per title, related to https://github.com/vllm-project/vllm/pull/30566. All changes are v4 compatible as well, should be compatible
@hmellor
-
#33985 [Kernel] FlashInfer: switch allreduce fusion to unified API — performance — by mmangkad (创建于: 2026-02-06 18:06 (UTC+8)) [+80/-114, 3 files | commented:1]
## Purpose
- Migrate vLLM’s FlashInfer allreduce fusion to the unified
flashinfer.comm.allreduce_fusionAPI and workspace creation. - Update the fused collective benchmark and fusion test to the new API and workspace lifecycle.
Dependency note: Requires
flashinfer-python >= 0.6.3(unified allreduce API).## Test Plan …
- Migrate vLLM’s FlashInfer allreduce fusion to the unified
- #33984 Bump HF Hub client to get bug fix — bug,rocm,ready,ci/build — by hmellor (创建于: 2026-02-06 17:35 (UTC+8)) [+2/-2, 2 files | approved:1 commented:1] Bug fix in question is https://github.com/huggingface/huggingface_hub/pull/3778
-
#33976 [PaddleOCR-VL] Add BC for transformers 5.0 config — ready — by zhang-prog (创建于: 2026-02-06 16:22 (UTC+8)) [+7/-0, 1 files commented:1 approved:2] -
#33982 Consolidate and fix forbidden import
pre-commitchecks — ready — by hmellor (创建于: 2026-02-06 17:21 (UTC+8)) [+145/-300, 5 files | commented:2 approved:1] Previously the regex and triton checks didn’t actually work in CI because they did not use the file list passed bypre-commitand instead relied on the git diff.This means that the check would miss violations in CI because th git diff would be empty.
This PR consolidates the three forbidden import checks into a single check which does consume
pre-commit’s file list. -
#33981 fix: reject non-text content in system/developer messages — frontend — by veeceey (创建于: 2026-02-06 17:21 (UTC+8)) [💬2 | +211/-0, 2 files | commented:4] ## Summary
Fixes #33925
Per the OpenAI API specification,
systemanddeveloperrole messages should only accepttextcontent type. Previously, vLLM allowed multimodal content (e.g.image_url,input_audio,video_url) in system messages without any validation, which diverges from the OpenAI API behavior.### Changes
vllm/entrypoints/chat_utils.py: Added a `_validate_text_only_c…
-
#33969 [Frontend] Add –disable-uvicorn-metrics-access-log shorthand flag — documentation,frontend — by veeceey (创建于: 2026-02-06 13:56 (UTC+8)) [💬3 | +146/-11, 4 files | commented:1] ## Summary
Adds a convenience boolean CLI flag
--disable-uvicorn-metrics-access-logthat suppresses uvicorn access logs for/healthand/metricsendpoints. This is the flag originally requested in #29023.- New CLI argument:
--disable-uvicorn-metrics-access-log(boolean, defaultFalse) - Reuses existing infrastructure: Builds on the
UvicornAccessLogFilterand--disable-access-log-for-endpointsmachinery from #30011 – no new filter classes or duplicate logic - **Composabl…
- New CLI argument:
-
#33972 Scale input before applying Marlin operator — 无标签 — by ir1ka (创建于: 2026-02-06 15:21 (UTC+8)) [+59/-1, 4 files | commented:3 | 📝草稿] When
dtype=float16, due to the smaller dynamic range of float16, data overflow can easily occur during GEMM, leading to inference errors. Therefore, we dynamically scale the input before GEMM to prevent data overflow.Fix: https://github.com/vllm-project/vllm/issues/33560 https://github.com/vllm-project/vllm/issues/33461
## Purpose
The nvfp4 model supports the use of
dtype=float16. … - #33975 Fix
mainpre-commit — 无标签 — by hmellor (创建于: 2026-02-06 16:07 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] This one slipped through in https://github.com/vllm-project/vllm/pull/33720 -
#33968 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (创建于: 2026-02-06 13:48 (UTC+8)) [💬3 | +231/-192, 5 files]
## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.
## Test Plan
## Test Result
…
-
#33965 [Bugfix] Fix Qwen3-Coder tool call streaming for duplicate names and param parsing — bug,qwen — by alexbi29 (创建于: 2026-02-06 12:51 (UTC+8)) [💬2 | +260/-264, 1 files | commented:1] ## Summary
Fixes multiple issues in the Qwen3-Coder-Next tool parser’s streaming (with opencode) mode that caused tool call arguments to be lost or corrupted:
- Duplicate tool names broken: When the model made multiple calls to the same tool (e.g. two
readcalls), the name-based lookup inprev_tool_call_arrprevented the second call from being added. Clients receivedundefinedfor the second call’s arguments. Fixed by replacing all name-based lookups with index-based ones. - **Body p…
- Duplicate tool names broken: When the model made multiple calls to the same tool (e.g. two
[已合并 PR]
-
#34017 [ModelRunner V2] Revert token rank comparison difference for now — ready,v1 — by njhill (合并于: 2026-02-07 11:11 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] We think it makes more sense to use a strict inequality for the token rank computation, but would be best to propose this in a separate PR since it involves a behavior change and is orthogonal to MRV2.
Making this consistent with V1 will help with output equivalence tests. We can subsequently change it in both places if needed (or make configurable if that’s really a problem).
-
#33967 [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 — bug,ready — by ikchifo (合并于: 2026-02-07 10:27 (UTC+8)) [💬6 | +154/-6, 5 files | commented:3 approved:1] ## Purpose
Fixes #33295
On B200 with FP8-quantized models (e.g.,
Qwen/Qwen3-30B-A3B-FP8), PyTorch Inductor’s CSE fails to merge identicalsplit_with_sizescalls. The QK Norm+RoPE pattern matcher expects one split node with 3getitemusers (q/k/v), but the graph contains 3 separate split nodes with 1 user each, so fusion fails silently.This PR adds
coalesce_equivalent_splits()to merge duplicate splits before matching. It groups splits by(args, kwargs)signature and merges duplicat… -
#34015 [Misc] Add backward-compatible import aliases for renamed translations module — frontend,ready — by kouroshHakha (合并于: 2026-02-07 11:01 (UTC+8)) [+69/-0, 5 files | commented:1 approved:1] ## Summary
- PR #33904 renamed
vllm.entrypoints.openai.translationstovllm.entrypoints.openai.speech_to_text, which breaks any downstream code importing from the old path. - This PR restores the old
vllm.entrypoints.openai.translationspackage as thin backward-compatible stubs that re-export everything from the newspeech_to_textlocation while emitting aDeprecationWarningon import. - The deprecation warnings instruct users to update their imports and are scheduled for removal in v…
- PR #33904 renamed
-
#33821 [Bugfix] Fix _fused_moe_lora_expand signature mismatch — bug,ready — by xyang16 (合并于: 2026-02-07 10:45 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]
## Purpose
This PR fix the signature mismatch between
_fused_moe_lora_expandand_fused_moe_lora_expand_fake. The parameterb_intermediate_cache1was removed in_fused_moe_lora_expand, so it should be removed in the fake function too.
Essential Elements of an Effective PR Description Checklist
... -
#34007 [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion — bug,rocm,ready — by rasmith (合并于: 2026-02-07 10:43 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose
If
cfg.model_configisNone, thenvllm.enable_norm_pad_fusionwill crash. This is causing the following failures in CI:``` FAILED kernels/moe/test_routing_simulator.py::test_routing_strategy_integration - AttributeError: ‘NoneType’ object has no attribute ‘get_hidden_size’ FAILED kernels/moe/test_shared_fused_moe_routed_transform.py::test_routed_input_transform_inside_vs_outside[dtype0-256-128-1] - AttributeError: ‘NoneType’ object has no attrib…
-
#34011 [Bugfix] Fix Whisper tokenization — bug,ready — by NickLucche (合并于: 2026-02-07 10:42 (UTC+8)) [+8/-0, 1 files | commented:1 approved:1] Fix https://github.com/vllm-project/vllm/issues/34002.
Since HF forwards **kwargs to both tokenizer and feature extractor, WhisperFeatureExtractor truncates audio to N raw samples (~0.03s), producing empty output. Strip these kwargs when audio data is present.
cc @DarkLight1337
- #32351 [Feat][RL] Pause and Resume with keep requests for single engine — documentation,frontend,ready,v1 — by hao-aaron (合并于: 2026-02-07 08:08 (UTC+8)) [💬10 | +537/-31, 8 files | commented:9 approved:1]
## Purpose
Completes part 1 of https://github.com/vllm-project/vllm/issues/32103
We introduce new “mode” parameter to pause and resume APIs, allowing for the following behaviors:
mode="abort": all inflight requests are immediately aborted and partially generated sequences are returned to callersmode="wait": before pausing, we wait for all inflight requests to finish first- NEW:
mode="keep": we stop all inflight requests asap, but do not return. From the perspective of caller, it will…
-
#33941 [bugfix] [ROCm] Fix premature CUDA initialization in platform detection — bug,rocm,ready,ci/build,nvidia — by kouroshHakha (合并于: 2026-02-07 06:17 (UTC+8)) [💬9 | +133/-6, 6 files | commented:4 approved:1] ## Summary
On ROCm, importing
vllm.platformstriggerstorch.cuda.get_device_properties()at module load time, which initializes CUDA before Ray workers can setCUDA_VISIBLE_DEVICES. This locksdevice_count()to the total number of GPUs, causing all workers to incorrectly use GPU 0 instead of their assigned GPUs.Closes https://github.com/vllm-project/vllm/issues/33938
## Problem
```python # vllm/platforms/rocm.py (before fix) …
-
#33919 Fix RoutingMethodType logic — ready,ci/build,nvidia,ready-run-all-tests — by dbari (合并于: 2026-02-07 06:03 (UTC+8)) [💬6 | +61/-20, 9 files | commented:1]
## Purpose
This PR contains two fixes for #33792:
- Fix the selection logic for
RoutingMethodTypeinfused_topk_bias_router.pyandfused_topk_router.py - When
use_grouped_topk=True, theGroupedTopKRouterdid not find any valid routing methods and there is only one group, fall back to the non-grouped routers
The latter point covers Mistral Large 3, which has
n_group=1andtopk_group=1but usesRenormalizeinstead ofDeepSeekV3routing, handled by… - Fix the selection logic for
-
#34010 [Fix] Fix
logprobs=0handling for/inference/v1/generateendpoint — frontend,ready — by SumanthRH (合并于: 2026-02-07 04:33 (UTC+8)) [+29/-2, 2 files | commented:1 approved:1] ## Purpose/inference/v1/generateendpoint differs from/v1/completionsendpoint and the Engine APIs inlogprobs=0handling: Passinglogprobs=0returns no logprobs, while it is supposed to return chosen token logprobs (same aslogprobs=1)This PR ensures that the behaviour is consistent with the completions endpoint.
## Test Plan
Added a new test for logprobs handling in the generate endpoint - with logprobs values 0, 1, 5
…
-
#33993 [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) — bug,ready,deepseek — by xuebwang-amd (合并于: 2026-02-07 03:11 (UTC+8)) [💬4 | +6/-1, 2 files | commented:1 approved:2] ## Purpose
Model: deepseek-ai/DeepSeek-V3.1 This PR aims to fix an error: ``` Traceback (most recent call last): File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 754, in worker_main worker = WorkerProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/workspace/xuebwang/vllm/vllm/v1/executor/multiproc_executor.py”, line 580, in init …
-
#33734 [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op — bug,rocm,ready — by charlifu (合并于: 2026-02-07 03:09 (UTC+8)) [+6/-1, 1 files | commented:1 approved:2] Aiter op per_1x32_f4_quant_hip returns the tensor as
torch.float4_e2m1fn_x2type, which will cause thegemm_a4w4op to raise error because the weight type is stilltorch.uint8.This PR fixes this issue.
-
#33449 [Refactor] Remove align block size logic in
moe_permute— performance,ready — by yewentao256 (合并于: 2026-02-07 02:57 (UTC+8)) [+38/-297, 8 files | commented:1 approved:1] ## PurposeWe have very complicated logic in cuda kernel related to optional
align_block_sizeused by DeepGEMM, but this is actually not called at all in main branch for a very long time, so it is time to clean this up.It is even slower vs current implementation so we are safe to remove
And we can also get a little bit performance improvement from deleting the
torch.full## Test
…
-
#33251 [Model Runner V2] support apply penalty for spec decode — v1 — by izhuhaoran (合并于: 2026-02-07 02:56 (UTC+8)) [💬5 | +91/-14, 4 files | commented:4 approved:1] ## Purpose
As titled, this pr support penalty for spec decode in MRV2
-
#33971 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (合并于: 2026-02-07 02:05 (UTC+8)) [💬3 | +237/-235, 4 files | commented:1 approved:1]
## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.
## Test Plan
## Test Result
…
-
#33292 [KV Connector] Add missing method overrides to MultiConnector — documentation,new-model,ready,ci/build,v1,multi-modality,llama,qwen,kv-connector — by eicherseiji (合并于: 2026-02-07 01:58 (UTC+8)) [💬6 | +177/-11, 2 files | commented:4 approved:2] ## Summary
MultiConnectorwas missing several method overrides that are required when wrapping other KV connectors. This PR adds the missing methods and a test to prevent future regressions.## Missing Methods Fixed
Handshake methods (for NixlConnector support):
get_handshake_metadata()- returns the first non-None metadata from sub-connectorsset_xfer_handshake_metadata()- delegates to all sub-connectors
…
-
#33944 [Log] Optimize duplicate startup log — ready,v1 — by yewentao256 (合并于: 2026-02-07 01:49 (UTC+8)) [+10/-7, 3 files | commented:1 approved:1] ## Purpose
Optimize logs like
INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:08 [gpu_worker.py:122] Using V2 Model Runner INFO 02-05 21:47:09 [gpu_worker.py:122] Using V2 Model Runner -
#33749 [ROCm][AITER] Fix AITER import regression for explicit backend selection — rocm,speculative-decoding,ready,v1 — by AndreasKaratzas (合并于: 2026-02-06 23:08 (UTC+8)) [💬5 | +262/-66, 5 files | commented:8 approved:2] A regression was introduced that broke explicit AITER backend selection on ROCm when
VLLM_ROCM_USE_AITER=0(or unset). Users could not explicitly select the AITER backend viaattention_config={"backend": "ROCM_AITER_FA"}even though the backend was available.Error observed:
AttributeError: 'builtin_function_or_method' object has no attribute 'flash_attn_varlen_func'This occurred because:
- `is_aiter_found_and_supported…
- #33964 [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 — bug,ready — by chaunceyjiang (合并于: 2026-02-07 01:23 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1 approved:1]
## Purpose
<|DSML|function_calls> <|DSML|invoke name="get_weather"> <|DSML|parameter name="location" string="true">杭州</|DSML|parameter> <|DSML|parameter name="date" string="true">2024-01-16</|DSML|parameter> </|DSML|invoke>…
-
#33254 [Docs] Update link to Benchmark CLI documentation — performance,ready — by eldarkurtic (合并于: 2026-02-07 00:01 (UTC+8)) [+1/-1, 1 files commented:1 approved:1] - #33973 [XPU][5/N] add wna16 xpu kernel — ready,ci/build — by zufangzhu (合并于: 2026-02-06 23:59 (UTC+8)) [💬2 | +58/-66, 2 files | commented:1 approved:1] This PR is [5/N] of https://github.com/vllm-project/vllm/issues/33214 add compressed tensor wna16 support for xpu
-
#33928 [Refactor] Consolidate sequence normalization and enc-dec parsing — frontend,ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-06 23:43 (UTC+8)) [💬18 | +1256/-848, 38 files | commented:4 approved:1] ## Purpose
- Introduce
*DictPromptclasses to represent prompts that have been standardized into dictionaries. This replacesParsed*Promptclasses. - Update and improve documentation of various input schemas.
- Move the logic of normalizing a single prompt/conversation or a sequence of prompt/conversation to a sequence to a new module
vllm.renderer.inputs.parse. - Allow renderer to handle encoder-decoder inputs.
- Renamed
render_completion(s)torender_prompt(s). - Move the logic of r…
- Introduce
-
#33431 [Model] Support MiniCPM-o 4.5 — ready — by tc-mb (合并于: 2026-02-06 23:29 (UTC+8)) [💬1 | +86/-7, 1 files | commented:10] ## Summary
This PR adds support for MiniCPM-o 4.5, a new generation of Omni model, which has excellent performance.
### Changes
- Added
MiniCPMO4_5class with Qwen3 backbone support - Refactored
MiniCPMOto use factory pattern for automatic version dispatch - Made
audio_pool_stepconfigurable (v4.5 uses 5, v2.6 uses 2) - Added auto version detection for backward compatibility …
- Added
-
#33940 [Docs] Add sections on process architecture and minimum CPU resources — documentation,ready — by mgoin (合并于: 2026-02-06 23:26 (UTC+8)) [💬3 | +113/-0, 4 files | commented:3 approved:1] ## Purpose
It seems users can be confused about vLLM’s performance when running with very small amounts of CPU cores available. We are missing a clear overview of what vLLM’s process architecture is, so I added this along with some diagrams in arch_overview.md, and included a section on CPU resource recommendations in optimization.md
Here are some example diagrams I generated with help from Opus 4.6
<img width=”2816” height=”1536” alt=”Generated_image1” src=”https://github.com/user-attachment…
-
#33509 [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab — structured-output,ready,v1 — by FredericOdermatt (合并于: 2026-02-06 22:23 (UTC+8)) [💬2 | +1/-1, 1 files | commented:5 changes:1 approved:1] Hi there, thanks for the amazing work on guided decoding in VLLM @russellb @jeejeelee @hmellor @yewentao256 @mgoin and many more.
## Purpose
I’ve observed the following: Models like Gemma 3 1B have
len(tokenizer) > config.vocab_sizedue to added special tokens not counted in config.GuidanceBackend.__post_init__passesself.vocab_size(from model config) asn_vocabto [llguidance_hf.from_tokenizer()](https://github.com/vllm-project/vllm/blob/b5f8c3092d1e1466b2b9c516fb39e5b2c15e7…
-
#33989 Update WeightTransferConfigto be more standard like the others — ready — by hmellor (合并于: 2026-02-06 21:15 (UTC+8)) [+4/-6, 2 filescommented:1 approved:1] @configis a dataclass transform now- The default shouldn’t be duplicated in
EngineArgs
-
#33977 [Bugfix] Fix models and tests for transformers v5 — bug,documentation,ready,multi-modality,qwen — by zucchini-nlp (合并于: 2026-02-06 21:47 (UTC+8)) [💬2 | +70/-55, 12 files | commented:8 approved:1] As per title, related to https://github.com/vllm-project/vllm/pull/30566. All changes are v4 compatible as well, should be compatible
@hmellor
- #33799 [Docs] Improve documentation — documentation,frontend,ready — by SorenDreano (合并于: 2026-02-06 20:57 (UTC+8)) [💬2 | +12/-12, 4 files | commented:1 approved:2]
## Purpose
Update vLLM documentation to reflect current Hugging Face tooling and paths:
- Replace deprecated
huggingface_cliusage with the recommendedhfcommand across docs, per Hugging Face migration guidance:- https://huggingface.co/docs/huggingface_hub/concepts/migration
- https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download
- Update the Hugging Face token path reference from the old
~/.huggingfacelocation to the current `…
- Replace deprecated
-
#29816 [Bugfix][Model] Support LoRA on Qwen3 Output Embedding — bug,ready,qwen — by klshuster (合并于: 2026-02-06 20:25 (UTC+8)) [💬6 | +132/-13, 6 files | commented:10] ## Purpose This PR adds support for LoRA on the embed/unembed layers for Qwen3 dense/MoE models. It is a simplified version of #26115 that removes the changes for supporting zero-padded vocab (which was addressed in #28545).
(cc @jeejeelee, I accidentally did some merges into the other PR and forgot to sign off on the commits, and the rebase was very unwieldy, so I am just opening a new PR)
## Test Plan I’ve added a comprehensive test suite in
tests/lora/test_qwen3_unembed.py## Test Result…
-
#33731 [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) — ready,torch.compile,ci/build,ready-run-all-tests — by ProExpertProg (合并于: 2026-02-06 20:19 (UTC+8)) [💬2 | +717/-651, 47 files | commented:7 approved:2] ## Purpose
Pure code movement and renaming to prepare for vLLM IR, which will add its own passes and tests. Will wait for #33293 before landing.
## Test Plan CI
## Test Result TBD
…
-
#33582 [CPU][BugFix] Fix loading of w4a8int models with bias — bug,ready — by fadara01 (合并于: 2026-02-06 19:59 (UTC+8)) [💬4 | +8/-3, 1 files | commented:2 approved:2] ## Purpose
Fix loading of w4a8int models with bias The current code triggers “TypeError: cannot assign ‘torch.FloatTensor’ as parameter ‘bias’ (torch.nn.Parameter or None expected)” for models that have bias in their MLPs
## Test Plan
tested on w8a8int quantized whisper model (has bias in its MLPs) following quantization recipe from here: https://learn.arm.com/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model/
## Test Result …
- #33984 Bump HF Hub client to get bug fix — bug,rocm,ready,ci/build — by hmellor (合并于: 2026-02-06 19:25 (UTC+8)) [+2/-2, 2 files | approved:1 commented:1] Bug fix in question is https://github.com/huggingface/huggingface_hub/pull/3778
-
#33976 [PaddleOCR-VL] Add BC for transformers 5.0 config — ready — by zhang-prog (合并于: 2026-02-06 18:33 (UTC+8)) [+7/-0, 1 files commented:1 approved:2] -
#33982 Consolidate and fix forbidden import
pre-commitchecks — ready — by hmellor (合并于: 2026-02-06 17:47 (UTC+8)) [+145/-300, 5 files | commented:2 approved:1] Previously the regex and triton checks didn’t actually work in CI because they did not use the file list passed bypre-commitand instead relied on the git diff.This means that the check would miss violations in CI because th git diff would be empty.
This PR consolidates the three forbidden import checks into a single check which does consume
pre-commit’s file list. -
#33868 support view_from_cpu_tensor on XPU — ready,v1 — by xinyu-intel (合并于: 2026-02-06 16:34 (UTC+8)) [+13/-9, 4 files | commented:5 approved:1]
## Purpose
support get_cuda_view_from_cpu_tensor on XPU. This will be used for ModelRunnerV2
## Test Plan
## Test Result
…
- #33975 Fix
mainpre-commit — 无标签 — by hmellor (合并于: 2026-02-06 16:08 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] This one slipped through in https://github.com/vllm-project/vllm/pull/33720 -
#32263 [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation — ready,v1,cpu — by gassan-arm (合并于: 2026-02-06 15:01 (UTC+8)) [💬9 | +704/-4, 4 files | commented:8 changes:2] ## Purpose
CPU Paged Attention NEON BFMMLA BF16 Implementation
Co-authored-by: GitHub Copilot
## Test Results
Using: https://github.com/vllm-project/vllm/pull/31720 Benchmark Suite, …
-
#33720 Onboard voyage-4-nano — documentation,new-model,ready,qwen — by chengchengpei (合并于: 2026-02-06 14:23 (UTC+8)) [💬8 | +216/-2, 8 files | commented:10] ## Purpose
Onboard voyage-4-nano
## Test Plan
Run ``` from vllm import LLM from vllm.config import PoolerConfig …
- #31112 [XPU]Replace pip in docker.xpu with uv pip — ready,ci/build — by 1643661061leo (合并于: 2026-02-06 14:02 (UTC+8)) [💬1 | +46/-28, 1 files | commented:10] ## Purpose We replaced pip in docker.xpu with uv pip in order to improve the speed of building images. And We created a virtual environment at /opt/venv and by default installed Python packages via uv pip. Wheels such as torch/intel-extension and vllm-openai are also installed in the same location, with most using /root/.cache/uv uniformly. Specifically, since the NIXL script is not suitable for UV, we use the Python in the virtual environment for installation to ensure consistency, while driver…
-
#33679 [XPU][4/N] add mxfp4 moe model support — ready — by jikunshang (合并于: 2026-02-06 13:03 (UTC+8)) [💬1 | +53/-31, 1 files | commented:1 approved:1] ## Purpose [4/N] of https://github.com/vllm-project/vllm/issues/33214 add mxfp4 moe support. we can also refactor xpu part once mxfp4 apply kernel abstraction.
## Test Plan
python3 examples/offline_inference/basic/generate.py --model openai/gpt-oss-20b --temperature 0 --enforce-eager## Test Result
…
- #33788 [CPU] Add BF16 Kernel type for s390x — ready,cpu — by R3hankhan123 (合并于: 2026-02-06 12:57 (UTC+8)) [💬1 | +9/-0, 1 files | commented:1 approved:1]
## Purpose
Add BF16 Kernel types for s390x in
mla_decode.cpp## Test Plan- Build the image
- Run inference
## Test Result ``` [root@b314lp81 vllm]# docker run –rm -p 8000:8000 local:test ibm-granite/granite-4.0-micro –port=8000 INFO 02-04 11:03:05 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. …
-
#33900 [Misc] Update code for encoder-decoder models — ready,v1,multi-modality — by DarkLight1337 (合并于: 2026-02-06 11:38 (UTC+8)) [+9/-3, 2 files | approved:1 commented:1]
## Purpose
FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754749999 FIX https://github.com/vllm-project/vllm/pull/33559#discussion_r2754759311
## Test Plan
## Test Result …
-
#31366 feat(frontend): early-fail tokenization guard for user requests — frontend,ready — by scratch-ml (合并于: 2026-02-06 11:38 (UTC+8)) [💬12 | +308/-202, 7 files | commented:9 changes:1] ## Purpose To advance the early-fail truncation design proposed in vllm#31229, these tests ensure default tokenization applies protective truncation and raises Exception immediately on overlength inputs, preventing the async tokenizer from being monopolized.
Default behavior: tokenization runs with protective truncation (
truncation=True,max_length=max_model_len+1); if the encoded length hits the limit, it raisesValueErrorimmediately …
[关闭未合并 PR]
- #21241 [bugfix] Remove the attribute ‘version’ from docker compose — documentation,stale — by 1195343015 (关闭于: 2026-02-07 10:18 (UTC+8)) [💬4 | +0/-1, 1 files | commented:1]
## Purpose
the attribute
versionis obsolete, it will be ignored, please remove it to avoid potential confusion - #21502 [Bugfix] Fix retrieve_process not ending normally and resources not being released properly — documentation,stale,kv-connector — by Foreverythin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +2/-2, 1 files | commented:1]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose ### Bug Description In `kv_cache_sharing_lmcache_v1….
- #21535 [Bugfix] Add startup probe and fix disable extraInit container in online deploy helm chart — documentation,stale — by vladmirtxrx (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +23/-3, 3 files | commented:1]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose Enhance online deploy helm chart: add startup probe …
-
#22262 [V1][Spec Decode] Async scheduling integration with spec decode — documentation,speculative-decoding,stale,v1 — by zixi-qi (关闭于: 2026-02-07 10:17 (UTC+8)) [💬7 | +108/-25, 6 files | commented:1] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose Support async scheduling with speculative decoding w…
-
#22406 [feature] add all_reduce registry for custom backends — documentation,needs-rebase,stale — by draftbk (关闭于: 2026-02-07 10:17 (UTC+8)) [💬8 | +245/-0, 4 files | commented:1] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
…
-
#22413 [XPU] Fix OOM when manually specifying ZE_AFFINITY_MASK with Ray distributed executor on XPU — documentation,stale,v1 — by chaojun-zhang (关闭于: 2026-02-07 10:17 (UTC+8)) [💬8 | +107/-28, 6 files | commented:10] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose When using Intel GPUs with Ray, the device_control_e…
-
#23450 Add Predicted Outputs API — documentation,frontend,speculative-decoding,ci/build,stale,v1 — by bwasti (关闭于: 2026-02-07 10:17 (UTC+8)) [💬9 | +461/-4, 15 files | commented:2] ## Purpose
As referenced in this issue: #23276, predicted outputs is a nice API from OpenAI that allows user-provided external token predictions to hit the speculative decoding backend.
This PR is the first of a couple to properly thread the API through the stack (looking for an initial review on this).
The key API is the addition of
--speculative-config '{"method": "predicted", "num_speculative_tokens": 4}'at serving time and an exact replication of the OpenAI API at request:``` { …
-
#24168 [logging] Refine PyNcclConnector Proxy logging — documentation,stale,kv-connector — by panpan0000 (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +17/-1, 1 files | commented:1]
## Purpose
Make logging and response more clear
## Test Plan
## Test Result
…
-
#24187 [Spec Decode][Model]Add qwen2-eagle — documentation,performance,new-model,speculative-decoding,needs-rebase,stale,v1,qwen — by wwl2755 (关闭于: 2026-02-07 10:17 (UTC+8)) [💬7 | +325/-18, 9 files | commented:7]
## Purpose
Add qwen2-eagle (and qwen2.5-eagle) support.
According to Slack, Qwen2 is not supported yet because of different number of kv heads in the target and draft model. This PR addresses this by making the kv cache space aligns with the larger one (though it can cause some waste).
Another thing this PR addresses is that some eagle model places “…
-
#24216 [Misc] fix lmcache cpu offload example — documentation,stale,kv-connector — by zhenwei-intel (关闭于: 2026-02-07 10:17 (UTC+8)) [💬3 | +163/-112, 1 files | commented:1] The current example is outdated. It uses the prefix cache and will not invoke CPU offload.
Update the example from the lmcache documentation here: https://docs.lmcache.ai/getting_started/quickstart/offload_kv_cache.html
# Running the Example
First, run the script without LMCache:
``` python cpu-offloading.py …
-
#24287 [CI/Build] Fix cmake incremental build when running “pip install –no-build-isolation -e .” — documentation,ready,ci/build,stale — by xli (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +34/-12, 2 files | approved:1 commented:3] Fix cmake incremental build when running “pip install –no-build-isolation -e .” (uv pip install works same)
Current c extensions cmake incremental build requires some extra steps to enable (https://docs.vllm.ai/en/v0.9.2/contributing/incremental_build.html)
By changing the cmake working directory in the setup.py, we can enable cmake incremental build without generating CMakeUserPresets.json file and following specific runbook.
Not sure why current implementation uses self.build_temp directo…
-
#24878 add default value for max_tokens and max_num_batched_tokens to basic chat example so that the example can be run out of box — documentation,stale,v1 — by chenfengjin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +5/-0, 1 files | commented:5] ## Purpose
add default value for max_tokens and max_tokens to basic chat example so that the example can be run out of box. without these default value, users struggle to run the basic example
- for
max_tokens, the default value is 16, which results in meanless generating output - for
max_num_batched_tokens, it confilcts with model parameters, which results in runtime error
following are errors with framework default values.
### with default max_num_batched_tokens as 4096
…
- for
-
#24965 add dtype and max_num_batched_tokens to classify example so that it can be run out of box — documentation,stale — by chenfengjin (关闭于: 2026-02-07 10:17 (UTC+8)) [💬4 | +2/-0, 1 files | commented:3] ## Purpose the example basic/classify.py failed to generate probabilities ```console Processed prompts: 100%|█████████████████████| 4/4 [00:00<00:00, 11.52it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Generated Outputs: ———————————————————— Prompt: ‘Hello, my name is’ Class Probabilities: [nan, nan] (size=2) ———————————————————— …
-
#24972 [Model] Deepseek-V3.1 reasoning parser — documentation,frontend,needs-rebase,stale,qwen,deepseek — by taohui (关闭于: 2026-02-07 10:17 (UTC+8)) [💬6 | +214/-11, 15 files | commented:6]
## Purpose
This PR adds a new reasoning parser for the DeepSeek-V3.1 model, named deepseek_v3. Unlike previous models such as deepseek_r1, the reasoning parser for DeepSeek-V3.1 is deterministic. Specifically:
When a request includes chat_template_kwargs”: {“thinking”: True}, the model uses the deepseek_r1 reasoning parser.
Otherwise, it uses a new IdentityReasoningParser, which implements the ReasoningParser interface but does not perform actual reasoning, effe…
-
#25356 [Feature] OTEL Tracing for Individual Model Steps — documentation,stale,v1 — by tomasruizt (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8 | +205/-14, 7 files | commented:1] ## Purpose Enable fine-grained OpenTelemetry (OTEL) tracing for model steps like
Preprocess,Forward, etc. at the level of the individual request in a visual way.## Why Do We Need This? Currently, we can use the pytoch profiler to track the runtime of each model step in vLLM. However, the profiler overhead is large, and it slows down vLLM execution, distorting the actual runtimes of steps. The profiler also cannot separate different requests, so in a workload where multiple requests part…
-
#25367 [Docs] wheel larger than limit — documentation,stale — by pfk-beta (关闭于: 2026-02-07 10:16 (UTC+8)) [💬4 | +4/-0, 1 files | commented:1] ## Purpose Based on my experience with building Docker, it should be mentioned about too large wheel. Related to #18786
## Test Plan N/A
## Test Result N/A
-
#25376 [V0 Deprecation][KVConnector] Remove KVConnector v1/v0 differentiation — documentation,tpu,ready,needs-rebase,ci/build,stale,v1,kv-connector — by NickLucche (关闭于: 2026-02-07 10:16 (UTC+8)) [💬9 | +481/-530, 47 files | commented:2 approved:1 changes:1] This PR completes the KVConnector V0 deprecation we started here https://github.com/vllm-project/vllm/issues/21785. It mostly moves stuff from
v1/kv_connectortokv_connectorto clarify there is no other “vX” connector (since removed in the above PR) . Takes care of v1-related naming + warning as well (eg checking v1 is enabled).List of changes:
KVConnectorBase_V1->KVConnectorBase- ~
KVConnectorBaseType = KVConnectorBase_V1~ vllm/distributed/kv_transfer/v1/kv_connector/->`vl…
-
#25574 [Misc] Add presence_penalty to default generation config — frontend,stale — by noiji (关闭于: 2026-02-07 10:16 (UTC+8)) [💬3 | +17/-4, 2 files | commented:1]
## Purpose Add presence_penalty to default generation config
## Test Plan
## Test Result
…
-
#25728 [Bugfix] fixing streaming issues and tool call output for gpt-oss (#22704) — frontend,needs-rebase,stale,gpt-oss — by noelkelias (关闭于: 2026-02-07 10:16 (UTC+8)) [💬5 | +51/-2, 1 files] ## Purpose Closely related to bugs associated with streaming for gpt-oss response API (#22704):
Harmony’s streaming didn’t support
functions.*tool calls at all. We handled reasoning onanalysis, final text onfinal, and built-ins like code/web—but"commentary"messages with arecipient="functions.NAME"had a no-op branch.## Test Plan
First, tested GPT-OSS with a basic set of instructions, few-shot examples, a function definition, and a prompt to find the weather in San Francisco, …
- #25952 [Bugfix] message loop — documentation,stale,tool-calling — by muazem7 (关闭于: 2026-02-07 10:16 (UTC+8)) [💬6 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Fix bug
-
#26009 [Core] NIXL/Segment-tree encoder cache based EPD-disaggregation — documentation,performance,needs-rebase,ci/build,stale,v1,kv-connector — by MerHS (关闭于: 2026-02-07 10:16 (UTC+8)) [💬8 | +2111/-68, 14 files | commented:3] # Purpose
Recently, several multi-node inference/orchestration engines such as llm-d and dynamo have emerged, and they natively supports Prefill-Decode disaggregation. To improve them, we can imagine Encode-Prefill-Decode disaggregation that is introduced in several RFCs and PRs: https://github.com/vllm-project/vllm/issues/20799, https://github.com/hsliuustc0106/vllm/issues/18, https://github.com/vllm-project/vllm/pull/25233.
However, llm-d and dynamo rely on the NIXL communicator, which wraps…
- #26389 [Misc] Cleanup fp8.py logic cleanup — stale — by andoorve (关闭于: 2026-02-07 10:16 (UTC+8)) [💬2 | +6/-14, 1 files | commented:1] Very minor refactor I noticed when reading this file. Removes some redundant asserts.
-
#32051 [Misc] support separate draft model loading config from base model — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (关闭于: 2026-02-07 06:40 (UTC+8)) [💬4 | +10/-1, 3 files | commented:2 | 📝草稿] [Misc] support seperate draft model loading config from base model
Summary: Users could build custom model loading format, In cases that base model and draft model may not share the same model loading config, in this PR, we support the use case
An example command would be
``` vllm serve /home/zzhengkai/vllm_models/Llama-3.1-8B-Instruct –disable-log-requests –port 9001 –speculative_config ‘{“method”: “eagle3”, “model”: “zzhengkai/custom_draft_mdoel_for_llama3”, “num_speculative_tokens”:…
-
#33836 Fix RoutingMethodType.from_topk softmax+renormalize mapping#33792 — nvidia — by baonudesifeizhai (关闭于: 2026-02-07 03:46 (UTC+8)) [💬5 | +48/-11, 6 files | commented:4] ## Purpose #33792 Align from_topk with routing semantics: softmax + renormalize=True now maps to RoutingMethodType.Renormalize (not RenormalizeNaive). Add a small unit test to cover from_topk mapping and invalid scoring functions. ## Test Plan
## Test Result ``python -m pytest tests/model_executor/test_routed_experts_capture.py -v
passed—... -
#34014 [Model] Introduce first-class DFlash speculative decoding in vLLM V1 — documentation,new-model,speculative-decoding,v1,qwen — by dangoldbj (关闭于: 2026-02-07 04:33 (UTC+8)) [💬2 | +2163/-22, 18 files | commented:1]
## Purpose This PR introduces first-class DFlash speculative decoding support in vLLM V1.
This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.
Key changes:
- Adds explicit
method="dflash"support and full DFlash model/config plumbing. - Establishes a dedicated `DFlash…
- Adds explicit
-
#33952 [CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N — bug,rocm — by rasmith (关闭于: 2026-02-07 01:17 (UTC+8)) [💬3 | +22/-4, 3 files | commented:4]
## Purpose This PR broke many tests (over 30) and this PR fixed one test in the ` Kernels MoE Test %N` group, but when the test is run as a group using
pytest -sv kernels/moethe first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.
This PR loads the ops in
vllm._aiter_opsbut then ensures tha… -
#31875 [Feature] Add flag to disable FlashInfer autotune — 无标签 — by mmangkad (关闭于: 2026-02-06 17:02 (UTC+8)) [💬4 | +14/-1, 3 files | commented:2 changes:1 | 📝草稿]
## Purpose
FlashInfer autotuning can sometimes take a long time to complete during initialization. This PR introduces a flag to disable it, allowing users to bypass this step if they are okay with skipping optimization to speed up startup.
## Test Plan
## Test Result
…
-
#33146 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,ready,nvidia — by vllmellm (关闭于: 2026-02-07 00:40 (UTC+8)) [💬8 | +231/-192, 5 files | commented:7 changes:1 approved:2]
## Purpose
Update the
Deployment > Dockerwith latest docker information for ROCm platform.## Test Plan
## Test Result
…
-
#33929 [Bugfix][Docker] Install CUDA dev packages for JIT compilation headers — bug,ci/build,nvidia — by jasonlizhengjian (关闭于: 2026-02-07 00:22 (UTC+8)) [💬2 | +3/-3, 1 files | commented:1 | 📝草稿] Install cuda-cudart-dev, cuda-nvrtc-dev, and libcublas-dev instead of runtime-only packages to provide headers (cuda.h, cuda_runtime.h, nvrtc.h, cublasLt.h) needed for FlashInfer JIT compilation of fp8_blockscale_gemm_sm90 kernels.
Fixes #33833
## Purpose
## Test Plan
…
-
#32280 Bump triton_kernels to v3.5.1 for version consistency — ci/build — by mmangkad (关闭于: 2026-02-06 17:02 (UTC+8)) [+1/-1, 1 files | commented:2]
## Purpose
Align triton_kernels fetch tag (v3.5.0 → v3.5.1) with the Triton package version used in requirements, which ships with PyTorch 2.9.1.
## Test Plan
## Test Result
…
-
#33968 [DOC] [ROCm] Update docker deployment doc — documentation,rocm,nvidia — by vllmellm (关闭于: 2026-02-06 14:10 (UTC+8)) [💬3 | +231/-192, 5 files]
## Purpose Update the Deployment > Docker with docker information for ROCm/CUDA/Intel platform. Linked docker deployment as a snippets from installations/docker section.
## Test Plan
## Test Result
…