vLLM Daily Report

« Back to vLLM Reports

[vLLM GitHub 开发动态] 2026-02-20

[概览]

时间窗口: 2026-02-20 11:15 (UTC+8) ~ 2026-02-21 11:15 (UTC+8)
新 issue: 18 (label 分布: ci-failure:8, bug:4, RFC:3, feature request:2, rocm:2)
关闭 issue: 41
新 PR: 49 (label 分布: v1:13, ready:13, nvidia:9, bug:9, ci/build:7)
合并 PR: 27
关闭未合并 PR: 20

[新 issue]

#34994 [Feature]: Infrastructure Improvements for ROCm CI — feature request,rocm — by AndreasKaratzas (创建于: 2026-02-21 06:35 (UTC+8)) [💬1] ## General
- Enable grade: Blocking for all tests on amd-ci pipeline
- Mirror all tests on upstream CI pipeline
- Tailor the “V1 Test attention” test group to MI325 and MI355 respecitvely
## Distributed Tests

### Distributed Tests (2 GPUs)
- Remove TORCH_NCCL_BLOCKING_WAIT=1 when HIP bug ROCm/hip#3876 is fixed in a future ROCm release. …
#34988 [Bug]: FlashInfer attn-fp4 fused kernel performs worse than unfused — bug,performance,torch.compile,needs reproduction,nvidia — by ProExpertProg (创建于: 2026-02-21 05:29 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#34995 [CI] Maverick model QKV weight shape mismatch during load_weights with expert parallelism — ci-failure — by LucasWilkinson (创建于: 2026-02-21 06:36 (UTC+8)) ## Name of failing test
- models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
- models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
#35004 [RFC]: Realtime Endpoint Metrics — RFC — by pougetat (创建于: 2026-02-21 08:40 (UTC+8)) ### Motivation.

With the recent pre-release of v0.16.0, vLLM now exposes a /v1/realtime endpoint that allows clients to progressively submit prompts without having to make separate HTTP requests to the /chatcompletions endpoint.

This endpoint requires clients to establish a websocket connection to vLLM to stream the prompts over.

The /chatcompletions HTTP endpoint currently benefits out-of-the-box metrics provided by the prometheus client that part of the stack relies on.

No such metrics are…
#34993 [CI] GDN attention backend assertion failure with MTP speculative decoding — ci-failure — by LucasWilkinson (创建于: 2026-02-21 06:29 (UTC+8)) [💬2] ## Name of failing test
- evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4-EP2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries
…
#34948 [Bug]: Qwen3.5 CUDA Illegal Memory Access — bug — by kimbochen (创建于: 2026-02-20 16:48 (UTC+8)) [💬5] ### Your current environment

vLLM Nightly on H200

``` PyTorch version : 2.10.0+cu130 CUDA used to build PyTorch : 13.0 CUDA runtime version : 13.0.88 Python version : 3.12.12 vLLM Version : 0.16.0rc2.dev294+g6a7b85d94 (git sha: 6a7b85d94) …
#34954 [Bug]: Triton Error [CUDA]: out of memory when received query — bug — by kwonmha (创建于: 2026-02-20 21:46 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) ...
#34996 [Feature][CI]: test_functionalization.py should compare outputs — feature request — by Rohan138 (创建于: 2026-02-21 06:38 (UTC+8)) ### 🚀 The feature, motivation and pitch

https://github.com/vllm-project/vllm/blob/main/tests/compile/passes/test_functionalization.py should compare the test outputs from the functionalized model and the defunctionalized model, which are expected to be identical. I tried doing this as part of #33443, but ran into issues with the RMSNorm tests e.g. TestFusedAddRMSNorm having significantly different output values.

This is probably a good first issue, can we have it marked as such for a communi…
#34989 [Bug]: Qwen3-Next-FP8 failure — bug — by tdoublep (创建于: 2026-02-21 05:30 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#34969 [CI Failure]: LM Eval Small Models (B200) - DeepSeek and Qwen3 Next — ci-failure — by mgoin (创建于: 2026-02-21 00:54 (UTC+8)) ### Name of failing test

evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34943 [CI Failure]: AMD Samplers Test (mi325_1) — rocm,ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:48 (UTC+8)) [💬2] ### Name of failing test

distributed/test_ca_buffer_sharing.py, distributed/test_multiproc_executor.py, distributed/test_torchrun_example.py , distributed/test_torchrun_example_moe.py, kernels/helion/test_utils.py, plugins_tests/test_stats_logger_plugins.py, v1/kv_connector/nixl_integration/test_edge_cases.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34963 [RFC]: Adding Support for Single Batch Overlap (SBO) With FlashInfer DeepEP LL NVFP4 — RFC — by elvircrn (创建于: 2026-02-20 23:54 (UTC+8)) # Background

DeepEP’s combine_v2 kernel supports overlapping the combine all-to-all with GEMM2 by polling per-expert completion signals instead of waiting for all experts to finish before sending. SGLang implements this but vllm does not.

# DeepEP Support

https://github.com/fzyzcjy/DeepEP/tree/gb200_blog_part_2/ (as per https://lmsys.org/blog/2025-09-25-gb200-part-2/) implements combine_v2, but

https://github.com/elvircrn/DeepEP/tree/gb200_blog might be more convenient to use on CUDA 13 as …
#34956 [RFC]: Overlap weight loading with torch.compile compilation — RFC,torch.compile — by zou3519 (创建于: 2026-02-20 22:30 (UTC+8)) [💬1] ### Motivation.

torch.compile cold starts take some time - O(10s - 1m). We could totally hide a lot of the compilation time behind weight loading.

NB: this doesn’t mean that we won’t make torch.compile faster; in general it’s rare that there is a situation where it is possible to overlap compilation with something else.

### Proposed Change.

Here’s an initial design:
- develop a way for vLLM to load a model using FakeTensors (e.g., Tensors with no storage) and to run the torch.compile over it…
#34947 Feature: Audit-grade request logging for EU AI Act compliance (Article 12) — 无标签 — by desiorac (创建于: 2026-02-20 16:05 (UTC+8)) ## Context

The EU AI Act (Regulation 2024/1689) requires high-risk AI systems to have automatic logging capabilities (Article 12) that record events throughout the system’s operation for traceability purposes. For LLM serving infrastructure like vLLM, this translates to structured, audit-grade request logging.

## Current State

vLLM provides excellent performance metrics and basic request logging via its OpenAI-compatible API server. However, the current logging is focused on **operational…
#34945 [CI Failure]: V1 Others : test_custom_logitsprocs[CustomLogitprocSource.LOGITPROC_SOURCE_ENTRYPOINT] — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 14:01 (UTC+8)) ### Name of failing test

v1/logits_processors/test_custom_offline.py::test_custom_logitsprocs[CustomLogitprocSource.LOGITPROC_SOURCE_ENTRYPOINT]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34942 [CI Failure]: Intel HPU Test - examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:38 (UTC+8)) ### Name of failing test

examples/offline_inference/basic/generate.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34941 [CI Failure]: Intel GPU Test : examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:28 (UTC+8)) ### Name of failing test

python3 examples/offline_inference/basic/generate.py –model facebook/opt-125m –block-size 64 –enforce-eager

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34939 [CI Failure]: V1 e2e + engine : Cannot re-initialize CUDA in forked subprocess — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:11 (UTC+8)) ### Name of failing test

v1/e2e/test_kv_sharing_fast_prefill.py::test_kv_sharing_fast_prefill[*, *], v1/e2e/test_mamba_prefix_cache.py::test_mamba_prefix_cache

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…

[已关闭 issue]

#14289 [Feature]: Chat inputs to AsyncLLMEngine — feature request,stale — by sfc-gh-mkrubinski (关闭于: 2026-02-21 10:17 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch

Currently, only the LLM class meant for offline inference supports the chat method. Are there any plans to implement a similar method for AsyncLLMEngine, besides the existing generate? Alternatively, is there any work on extending the PromptType acceptable by generate to include more prompt variants, such as chat conversations?

### Alternatives

No response

…
#22008 [Installation]: with latest vllm source code installation done, but failed to run vllm server — installation,stale — by gaowayne (关闭于: 2026-02-21 10:17 (UTC+8)) [💬16] ### Your current environment
```
  The output of `python collect_env.py`
```
server is hang now

### How you are installing vllm

```sh …
#22704 [Bug]: Using GPT-OSS with the streaming option degrades the quality of responses. — bug,stale,gpt-oss — by ashgold (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...
#26490 [Bug]: PPLXAll2AllManager fails to init on pplx-kernels latest — bug,stale — by wseaton (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### 🐛 Describe the bug

The pplx-kernels all2all backend fails to initialize when using the latest pplx-kernels on main due to an interface change following the nvshmem4py integration: https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/all2all.py#L158-L162

Last working commit: https://github.com/perplexityai/pplx-kernels/blob/12cecfda252e4e646417ac263d96e994d476ee5d/src/pplx_kernels/nvshmem.py

``` (APIServer pid=317) (EngineCore_DP7 pid=606) self._target(…
#26636 [Bug]: AsyncHttpClient incorrectly decodes URLs via aiohttp, breaking signed URLs (e.g., S3) — bug,stale — by 1994 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ``` VLLM Version： 0.9.2

…
#27004 [New Model]: support tencent/Hunyuan-MT-Chimera-7B-fp8 — new-model,stale — by xinliu9451 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### The model to consider.

When will support for the tencent/Hunyuan-MT-Chimera-7B-fp8 model be available? Thank you very much.

### The closest model vllm already supports.

No response

### What’s your difficulty of supporting the model you want?

…
#27179 [Bug]: KVConnectorLogging initialization fails in single-process mode with LMCache due to has_kv_transfer_group assertion error — bug,stale,kv-connector — by sakunkun (关闭于: 2026-02-21 10:16 (UTC+8)) [💬5] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#27222 [Performance][Qwen3-next] Decrease huge CPU overhead — performance,stale — by vadiklyutiy (关闭于: 2026-02-21 10:16 (UTC+8)) [💬6] ### Proposal to improve performance

Qwen3-next has a sufficient CPU overhead even with pretty big (more than 512 - upper bound for cudagraph size) batch sizes.

I took batch size 1024 for demonstration. For that used
```
  vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos
```
where the most time we spend on decoding 1024 request in 1…
#27233 gguf run good — usage,stale — by kmnnmk212-source (关闭于: 2026-02-21 10:16 (UTC+8)) [💬14] ### Your current environment

from vllm import LLM, SamplingParams

gguf_path = “/home/m/Desktop/vllm/vllm/examples/offline_inference/basic/Qwen3-1.7B-GGUF/Qwen3-1.7B-Q6_K.gguf”

llm = LLM( gguf_path, tokenizer=”Qwen/Qwen3-1.7B” ) …
#27239 [Feature]: Heterogeneous TP per Pipeline Stage (uneven TP across PP) — feature request,stale — by Mankeerat (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch

Goal: Enable different tensor-parallel (TP) sizes per pipeline-parallel (PP) stage so mixed accelerators can run in one pipeline (e.g., per-stage TP = 4,1,2,1). Behavior remains unchanged unless explicitly enabled.

Motivation A lot of available hardware is heterogeneous (spot pools, older nodes, new accelerators). Today vLLM assumes uniform TP across all PP stages, which limits deployment on mixed GPU fleets. Supporting per-stage TP would let users combi…
#27251 [Bug]: Deploying the model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF using docker vllm/vllm-openai:v0.10.2 and vllm/vllm-openai:v0.11.0 failed. — bug,stale — by suntao2015005848 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
NVIDIA: 8 * NVIDIA L20 docker: vllm/vllm-openai:v0.10.2 and vllm/vllm-openai:v0.11.0

…
#27252 [Usage]: ”@app.post(“/generate”)“ API is support qwen2_vl or not? — usage,stale — by wwkww (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment

i want tot know ”@app.post(“/generate”)“ API support qwen2_vl or not?

### How would you like to use vllm

I want to run inference of a specific model. I don’t know how to integrate it with vllm.

…
#27268 [Usage]: failed to infer device type on GCP COS despite nvidia container toolkit installed — usage,stale — by forrestbao (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment

I failed to run this script on GCP COS.

### How would you like to use vllm

I was trying to use VLLM on a Google Cloud (GCP) Container-Optimized OS (COS) instance via Docker.

I followed GCP’s documentation to install the nvidia driver, including mapping nvidia driver-related dirs to the Docker container. All tests worked fine.

…
#27278 args.hf_split is overriden even when set causing some dataset to not be actually supported — stale — by Delaunay (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] https://github.com/vllm-project/vllm/blob/aa1356ec53a65a79a0027e8c265c76d84de8c046/vllm/benchmarks/datasets.py#L1680

using the dataset openslr/librispeech_asr
```
  Raise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}")
  ValueError: Bad split: train. Available splits: ['test.clean', 'test.other', 'train.clean.100', 'train.clean.360', 'train.other.500', 'validation.clean', 'validation.other']
```
#27287 [Bug]: Speculative Decoding Issue with VLLM_ENABLE_V1_MULTIPROCESSING=0 — bug,stale — by shadowpa0327 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... uv is set ============================== System Info ...
#27315 [Feature]: 请问该模型支持微调吗？ — feature request,stale — by yaoyun1 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

请问该模型支持微调吗？

### Alternatives

No response

### Additional context

…
#27325 [Bug]: When spec_tokens count is greater than 1, the adaptation of cuda_graph_sizes causes the decoding process to fall back to eager mode. — bug,stale — by zouyida2052 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#27333 [Bug]: When I run qwen3-vl-30b-a3b-fp8, it only generates part of the content — bug,stale — by bigbro13 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#27356 [Feature]: Support usage of server_url as a tool parameter to dynamically connect to MCP servers — feature request,stale — by mandos21 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

To fully flesh out the functionality of the responses API and tool calling, I think that this should be supported as well. More info in the additional context section.

### Alternatives

No response

### Additional context

…
#27358 [Feature]: Support for custom node order of pipeline stages. — feature request,stale — by JorgenTrondsen (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch

I am doing a master thesis using vllm with ray. I need control over which nodes receives which parts of a pipeline stage(partitioned layers).

### Alternatives

No response

### Additional context

…
#34995 [CI] Maverick model QKV weight shape mismatch during load_weights with expert parallelism — ci-failure — by LucasWilkinson (关闭于: 2026-02-21 09:19 (UTC+8)) ## Name of failing test
- models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
- models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
#34234 [Feature]: Enable CUDA graph capture for Eagle speculator prefill — feature request — by HamzaElshafie (关闭于: 2026-02-21 08:46 (UTC+8)) [💬1] ## 🚀 The feature, motivation and pitch

Eagle’s speculator currently runs prefill in eager mode, while decode can use CUDA graphs.

There is an explicit TODO in the codebase to support CUDA graphs for the prefill path. I’m interested in working on this as my first contribution and wanted to check alignment before starting any implementation.

Currently, the Eagle speculator contains the following TODO:
- Location: vllm/v1/worker/gpu/spec_decode/eagle.py (around line 229)
- **Current state…
#34989 [Bug]: Qwen3-Next-FP8 failure — bug — by tdoublep (关闭于: 2026-02-21 05:31 (UTC+8)) [💬1] ### Your current environment

The output of python collect_env.py
```text Collecting environment information... ============================== System Info ============================== ...
#34969 [CI Failure]: LM Eval Small Models (B200) - DeepSeek and Qwen3 Next — ci-failure — by mgoin (关闭于: 2026-02-21 05:25 (UTC+8)) ### Name of failing test

evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34929 [Feature]: Get latency and tpot per request in vLLM benchmark — feature request — by snova-rodrigom (关闭于: 2026-02-21 05:15 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch

Today ttft and itls are provided but I think that some users would prefer to have the end-to-end latency of the request for a more realistic number in their environments. Also, tpot per request should bring a better idea of the performance that happened on a particular request. It’s cool to have it for the whole run but per request might bring a more granular perspective and can help debugging

### Alternatives

I haven’t seen options to get the requested…
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:41 (UTC+8)) [💬10] ### Name of failing test

pytest -v -s lora --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py --ignore=lora/test_olmoe_tp.py --ignore=lora/test_deepseekv2_tp.py --ignore=lora/test_gptoss_tp.py --ignore=lora/test_qwen3moe_tp.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#32219 [RFC]: Add Helion integration in vLLM — RFC — by gmagogsfm (关闭于: 2026-02-21 03:09 (UTC+8)) [💬5] ### Motivation.

(with significant inputs from @zou3519, @ProExpertProg, @mgoin, @xiaohongchen1991)

Helion is PyTorch’s latest innovation in authoring custom kernels, featuring simple and familiar syntax, good developer experience, and superior performance.

This RFC proposes a developer-friendly framework for integrating Helion kernels into vLLM, making custom ops in vLLM more efficient, enjoyable to write, and performant in production.

The proposed integration is [prototyped here](https://gi…
#29511 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:42 (UTC+8)) [💬14] ### Name of failing test

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29521 [CI Failure]: mi325_1: Samplers Test — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:40 (UTC+8)) [💬12] ### Name of failing test

pytest -v -s samplers && VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:40 (UTC+8)) [💬7] ### Name of failing test

pytest -v -s kernels/moe

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:39 (UTC+8)) [💬10] ### Name of failing test

pytest -v -s models/language -m 'core_model and (not slow_test)'

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34943 [CI Failure]: AMD Samplers Test (mi325_1) — rocm,ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-21 01:18 (UTC+8)) [💬2] ### Name of failing test

distributed/test_ca_buffer_sharing.py, distributed/test_multiproc_executor.py, distributed/test_torchrun_example.py , distributed/test_torchrun_example_moe.py, kernels/helion/test_utils.py, plugins_tests/test_stats_logger_plugins.py, v1/kv_connector/nixl_integration/test_edge_cases.py

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 01:17 (UTC+8)) [💬16] ### Name of failing test

pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py

### Basic information
- Fla…
#34619 [Bug]: Qwen3.5. illegal memory access — bug — by vadiklyutiy (关闭于: 2026-02-20 21:54 (UTC+8)) [💬10] ### Your current environment

The output of python collect_env.py
```text uv is set ============================== System Info ============================== ...
#31625 [Feature][ROCm][AITER]: Speculative Decoding Accuracy Issue with VLLM_ATTENTION_BACKEND=ROCM_AITER_FA — bug,feature request,rocm — by vllmellm (关闭于: 2026-02-20 14:25 (UTC+8)) [💬2] ### Your current environment rocm/vllm-dev:nightly

### 🐛 Describe the bug

## Problem Description

When running speculative decoding using VLLM_ATTENTION_BACKEND=ROCM_AITER_FA, the model produces extremely poor accuracy results on GSM8K benchmark.

### Command Used …
#34935 [Usage]: TypeError: ‘>’ not supported between instances of ‘str’ and ‘int’ — usage — by lottopotato (关闭于: 2026-02-20 13:32 (UTC+8)) [💬3] If you’re encountering a TypeError: '>' not supported between instances of 'str' and 'int' with the transformers tokenizer in vllm.chat(), this issue likely stems from transformers versions 5.x.

I’m not sure this change was applied continually, Transformers tokenizer’s return_dict argument was changed to True by default in these versions.
- [Tokenizer] Change default value of return_dict to True in doc string for…
#33925 [Bug]: OpenAI API: system message accept image — bug — by jonoillar (关闭于: 2026-02-20 13:30 (UTC+8)) ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#34249 [Bug]: [Fp8] [MoE] ‘FLASHINFER_CUTLASS’ is auto-selected as MoE backend instead of ‘DEEPGEMM’ on hopper — bug — by cjackal (关闭于: 2026-02-20 13:29 (UTC+8)) [💬4] ### Your current environment

The output of python collect_env.py
```text Your output of `python collect_env.py` here ```

…
#34819 [CI Failure]: models/test_initialization.py::test_can_initialize_small_subset[Llama4ForConditionalGeneration] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test

models/test_initialization.py::test_can_initialize_small_subset[Llama4ForConditionalGeneration]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34814 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[InternS1ProForConditionalGeneration] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test

models/test_initialization.py::test_can_initialize_large_subset[InternS1ProForConditionalGeneration]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…
#34810 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[H2OVLChatModel] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test

models/test_initialization.py::test_can_initialize_large_subset[H2OVLChatModel]

### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in transformers)
…

[新 PR]

#34946 Remove redundant uniform=False in mixed_mode PIECEWISE branch — v1,nvidia — by oops-oom (创建于: 2026-02-20 15:01 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] In initialize_cudagraph_keys, the mixed_mode() branch always calls _create_padded_batch_descriptor with uniform_decode=False. This is by design — mixed mode handles general (non-uniform) batches, so uniform is guaranteed to be False in the returned BatchDescriptor.

Therefore, the subsequent replace(batch_desc, num_reqs=None, uniform=False) in the PIECEWISE sub-branch redundantly sets uniform=False. This PR simplifies it to replace(batch_desc, num_reqs=None).
#34962 [Build] Bump FlashInfer version from v0.6.3 to v0.6.4 — ci/build,nvidia — by vadiklyutiy (创建于: 2026-02-20 23:53 (UTC+8)) [💬1 | +5/-5, 4 files | commented:1 | 📝草稿] ## Summary

Promote FlashInfer dependency from v0.6.3 to v0.6.4 across all build artifacts.

The main goal is use GDN decode kernel for Qwen3.5
#35003 Fix: [Bug]: FlashInfer attn-fp4 fused kernel performs w — bug — by darshjme-codes (创建于: 2026-02-21 07:36 (UTC+8)) [💬4 | +50/-0, 1 files | commented:1] The fused attention quant kernel for fp4 was observed to be slower than the unfused version. The fix disables the fusion pass when fp4 quantization is active, ensuring the higher‑performance unfused kernel is used instead. This resolves the performance regression without altering the underlying kernel implementation.

Fixes #34988

Changes:
- vllm/flash_attn/flash_attn_interface.py: Disabled the fused attention quantization pass when fp4 is enabled, preventing the slower fused kernel…
#34997 Revert “[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers “ — ready,ci-failure,llama — by LucasWilkinson (创建于: 2026-02-21 06:41 (UTC+8)) [💬1 | +68/-29, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#34471 to fix CI failures: https://github.com/vllm-project/vllm/issues/34995

FIX: https://github.com/vllm-project/vllm/issues/34995

cc @eldarkurtic
#35002 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-21 07:36 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355
#34999 [CI] Fix tests/evals/gsm8k/test_gsm8k_correctness.py for Qwen3-Next-80B-A3B-NVFP4-EP2 — v1,qwen — by LucasWilkinson (创建于: 2026-02-21 07:08 (UTC+8)) [💬2 | +0/-6, 1 files | commented:1] FIX: https://github.com/vllm-project/vllm/issues/34993

https://github.com/vllm-project/vllm/pull/34077 broke pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k "Qwen3-Next-80B-A3B-NVFP4-EP2" --config-list-file=tests/evals/gsm8k/configs/models-blackwell.txt with what appears to be an overly conservative assert

Test Plan:

``` pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k “Qwen3-Next-80B-A3B-NVFP4-EP2” –config-list-file=tests/evals/gsm8k/configs/models-blackwell.txt `…
#34977 [Mamba][APC] Add test case to compare apc outputs — 无标签 — by divakar-amd (创建于: 2026-02-21 02:21 (UTC+8)) [💬2 | +55/-0, 1 files | commented:1] (Merge after https://github.com/vllm-project/vllm/pull/34798)
- Added test_same_mamba_output_apc_on_vs_off to test_hybrid.py to check that output texts are identical whether prefix caching is enabled or not for the Mamba model.
https://github.com/vllm-project/vllm/pull/34798 fixes a bug in the csrc/mamba/mamba_ssm/selective_scan_fwd.cu where index calculation was incorrect with prefix caching enabled for mamba, hence leading to incorrect outputs. This PR builds upon https://github.com/…
#34978 fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser) — structured-output,v1,deepseek — by nbethala (创建于: 2026-02-21 02:32 (UTC+8)) [💬3 | +33/-18, 3 files | commented:1] UPDATE (Feb 20): Rewrote the summary for clarity . The PR fixes a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Summary
This PR completes DeepSeek‑R1 structured‑output integration by fixing a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.

Problem: DeepSeek‑R1 must detect the </think> token to transition from free‑form reasoning into JSON‑const…
#34987 [CI Bugfix] Add pytest.mark.flaky to tests/v1/e2e/test_spec_decode.py — bug,ready,v1,ci-failure — by mgoin (创建于: 2026-02-21 05:23 (UTC+8)) [💬1 | +4/-0, 1 files | commented:1 approved:1]

## Purpose

The hope is this will help the flaky failures of “V1 e2e + engine” since I identified these tests are often the culprit. They really should be reworked but when I ran locally it really did vary so much run to run.

FIX https://github.com/vllm-project/vllm/issues/34617 https://github.com/vllm-project/vllm/issues/34797

https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=7days&query=tests%2Fv1%2Fe2e%2Ftest_spec_decode.p…
#35001 feat(model): add embed_sparse task for BGE-M3 server-side sparse aggr… — 无标签 — by joeqzzuo (创建于: 2026-02-21 07:33 (UTC+8)) [💬3 | +180/-5, 3 files | commented:1] …egation

## Purpose

Add embed_sparse pooling task for BgeM3EmbeddingModel to enable server-side sparse vector aggregation.

Currently, BGE-M3 sparse retrieval requires a cumbersome 2-step client workflow:
1. Call /tokenize to get token IDs …
#35000 Simplify output assignment condition — v1 — by zhuohan123 (创建于: 2026-02-21 07:28 (UTC+8)) [💬1 | +1/-3, 1 files | commented:1] ## Purpose Refactor conditional check for output assignment.

Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ...
#34990 [CI] Skip Responses API — ready,ci-failure — by robertgshaw2-redhat (创建于: 2026-02-21 05:56 (UTC+8)) [💬2 | +4/-0, 1 files | commented:1 approved:2]

## Purpose
- skip responses API tests, which are flakly.
- it needs further investigation but this will help to stabilize the CI
#34998 [ROCm] Check that AITER MHA is not selected with sinks — rocm — by gshtras (创建于: 2026-02-21 07:05 (UTC+8)) [💬1 | +13/-2, 1 files | commented:1] AITER MHA does not support sinks. Gating its selector This should unbreak VLLM_ROCM_USE_AITER=1 vllm serve openai/gpt-oss-120b
#34992 [RL] Validation for pause_mode=’keep’ — documentation,ready — by hao-aaron (创建于: 2026-02-21 06:02 (UTC+8)) [💬2 | +179/-105, 1 files | commented:6 approved:2] ## Purpose

Changed the example rlhf_async_new_apis.py to perform validation on post weight updated tokens, by comparing them to tokens generated with a fresh vllm engine with new weights.

relevant RFC: https://github.com/vllm-project/vllm/issues/32103

Essential Elements of an Effective PR Description Checklist
...
#34991 [Bugfix] ep_scatter kernel store-load race condition — bug — by ivanium (创建于: 2026-02-21 05:58 (UTC+8)) [💬4 | +7/-3, 1 files | commented:1] ## Purpose

Fix an intra-CTA data race in _fwd_kernel_ep_scatter_1 that causes sporadic “Warp Illegal Address” crashes during DeepGEMM MoE permutation (observed on DeepSeek V3 PD decode, but possibly affects any EP path using this kernel).

### The bug

The kernel launches 256 blocks × 256 threads (8 warps). Each block computes an exclusive prefix sum and writes all 256 entries of expert_start_loc to global memory via a vector tl.store, then loads back one entry via `tl.load(expert_sta…
#34984 [Misc][LoRA] Add –lora-target-modules to restrict LoRA to specific modules — documentation — by bhoomit (创建于: 2026-02-21 03:47 (UTC+8)) [💬7 | +183/-2, 6 files | commented:1] ## Purpose

Add deployment-time control over which model modules have LoRA applied via a new --lora-target-modules CLI parameter and LoRAConfig.target_modules field.

This accepts module suffixes (e.g., o_proj, qkv_proj) and restricts LoRA application to only those modules, useful for performance tuning. When not specified, all supported LoRA modules are used (existing behavior).

## Usage

```bash vllm serve model –enable-lora –lora-target-modules o_proj qkv_proj …
#34974 [Kernel] Optimize sample_recovered_tokens_kernel — performance,ready,v1 — by xyang16 (创建于: 2026-02-21 01:51 (UTC+8)) [💬1 | +178/-33, 2 files | commented:3 approved:1]

## Purpose

This PR optimize the sample_recovered_tokens_kernel in rejection sampler.
- Use tiled reduction over vocab. Instead of creating one massive tl.arange(0, PADDED_VOCAB_SIZE) vector and loading the whole vocab in one program, load it in chunks of BLOCK_SIZE. This will have lower register pressure and better occupancy.
- Replace score = prob / q division with score = prob * inv_q, because division is slower than multiply.
- Kernel level more than 5…
#34979 [CI] Revert PRs 34818 and 33600 — speculative-decoding,ready,v1,multi-modality,nvidia — by LucasWilkinson (创建于: 2026-02-21 02:34 (UTC+8)) [💬1 | +242/-294, 16 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33600 broke the basic models test, https://github.com/vllm-project/vllm/pull/34818 tried to fix it but caused accuracy issues for ``` FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400 assert np.float64(0.011372251705837756) >= (0.72 - 0.08) FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4…
#34980 [not for land] Change VLLM_USE_MEGA_AOT_ARTIFACT default to ‘1’ — ready — by zhxchen17 (创建于: 2026-02-21 03:05 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#34986 Trtllm gen full attn test coverage — v1,nvidia — by ojhaanshika (创建于: 2026-02-21 04:27 (UTC+8)) [💬2 | +510/-0, 2 files | commented:1 | 📝草稿] ## Purpose

Add test coverage for the TRTLLM gen-full attention pipeline in vLLM. The existing kernel-level tests (test_flashinfer_trtllm_attention.py) call the FlashInfer C++ kernels directly, bypassing the vLLM integration layer entirely. This leaves the decision logic, metadata construction, and forward dispatch at 0% coverage.

This PR adds two new test files:
1. Unit tests for attention decision functions (tests/kernels/attention/test_use_trtllm_attention.py) — 22 tests covering al…
#34955 [Bugfix] kimi k2.5 tool call tokens leaking into reasoning_content — bug — by felixmr1 (创建于: 2026-02-20 22:19 (UTC+8)) [💬1 | +476/-3, 3 files | commented:2] ## Purpose

When Kimi K2.5 skips </think> and jumps directly to a tool call, the reasoning parser had no mechanism to stop routing subsequent tokens to reasoning_content. This caused raw tool call tokens to appear in the thinking block instead of being handled by the tool call parser.

This PR introduces KimiK25ReasoningParser as a thin subclass of DeepSeekV3ReasoningWithThinkingParser that:
- Forces thinking=True (Kimi K2.5 always reasons)
- Tracks a _tool_call_started flag
- Return…
#34983 [MoE Refactor] Convert mxfp4 moe quant method into oracle — ci/build,gpt-oss,nvidia — by zyongye (创建于: 2026-02-21 03:44 (UTC+8)) [💬1 | +1495/-930, 25 files | commented:1 | 📝草稿]

## Purpose Ongoing MXFP4 MoE refactor

This PR can be greatly improved once #32564 is merged.

## Test Plan gpt-oss-20b testing in various of kernels. [WIP]

…
#34985 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor — bug,rocm,nvidia — by rasmith (创建于: 2026-02-21 04:17 (UTC+8)) [+2/-0, 1 files | commented:1] ## Purpose The tests in test_punica_ops can fail if a previous test ran and changed the device that kernels execute in, in particular Triton kernels, by using torch.cuda.set_device, e.g. lora/test_layers.py. This can cause subsequent tests to also fail, since the system seems to go into a bad state once this happens. ## Test Plan ``` pytest -v -s lora
–ignore=lora/test_chatglm3_tp.py
–ignore=lora/test_llama_tp.py
–ignore=lora/test_llm_with_mult…
#34981 Add retry logic to LoRA adapter config loading — meta-exported,fb-exported — by mgrange1998 (创建于: 2026-02-21 03:11 (UTC+8)) [💬4 | +241/-2, 3 files | commented:1] Summary: In PEFTHelper.from_local_dir(), the non-tensorizer path does a bare open() + json.load() to read adapter_config.json. If the file hasn’t finished being written or downloaded (race condition with Manifold, HF Hub, or NFS), this raises FileNotFoundError and crashes the vLLM worker.

This diff adds a _load_lora_config() static method with exponential backoff retry (50ms → 1s cap, 15s default timeout). It catches FileNotFoundError (file not yet created) and `json.JSONDecodeError…
#34982 Add GlmOcrConfig for GLM-OCR model type recognition — meta-exported,fb-exported — by hujia177 (创建于: 2026-02-21 03:12 (UTC+8)) [💬2 | +94/-0, 3 files | commented:1] Summary: The GLM-OCR model (zai-org/GLM-OCR) uses model_type: "glm_ocr" in its config.json, which is not recognized by the bundled HuggingFace Transformers library (requires transformers >= 5.1.0). This causes a ValidationError during ModelConfig creation when vLLM tries to load the model via AutoConfig.from_pretrained().

Add a custom GlmOcrConfig (extending PretrainedConfig) and GlmOcrVisionConfig to bridge this gap, following the same pattern as DotsOCRConfig. Register the c…
#34959 [Misc] Fix mypy errors in vllm/profiler and remove from exclude list — ready — by taneem-ibrahim (创建于: 2026-02-20 23:39 (UTC+8)) [+37/-30, 2 files | commented:1 approved:1]

## Purpose
- This PR removes vllm/profiler from the mypy EXCLUDE list and fixes all resulting type errors in vllm/profiler/layerwise_profile.py.
  - Part of the ongoing effort in #26533 to enable mypy checking across the codebase.
    Test Plan
- Run the pre-commit hook: pre-commit run --hook-stage manual mypy-3.10 -a ## Test Result
- mypy checks pass cleanly except for the pre-existing nixl_connector check er…
#34965 [Test] Improve flashinfer cutlass moe test coverage — ci/build,nvidia — by jasonlizhengjian (创建于: 2026-02-21 00:20 (UTC+8)) [+453/-2, 4 files | commented:4]

## Purpose

Improve flashinfer cutlass moe test coverage. In particular, Hopper coverage was lacking. Also fp8_block_scale was not tested at all previously.

## Test Plan

``` …
#34972 Fix different embeddings produced by LLM and AsyncLLM — v1 — by guodongxiaren (创建于: 2026-02-21 01:27 (UTC+8)) [💬2 | +4/-0, 1 files | commented:1]

## Purpose fix issue: #33361

## Test Plan this code from issue: #33361 ```python import uuid

…

#34976 Allow _dummy_run to use _model_forward for hardware backends with compiled execution — v1,meta-exported,fb-exported — by jumpinghawk (创建于: 2026-02-21 02:17 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1] Summary:

_dummy_run currently calls self.model(...) directly, bypassing the _model_forward() override hook. This means hardware backends that override
_model_forward for compiled/graph-mode execution cannot control the execution path during dummy batch runs.

This matters for DP with expert parallelism: execute_dummy_batch uses _dummy_run to keep ...

#34975 dcp prefill -> non-dcp decode prototype — kv-connector — by aws-bowencc (创建于: 2026-02-21 02:00 (UTC+8)) [💬2 | +192/-95, 2 files | commented:1] # DCP Prefill → Non-DCP Decode: KV Cache Transfer Design

## Background

With Disaggregated Context Parallelism (DCP), the prefill node splits its TP workers into DCP groups that each process a portion of the input sequence. When interleave_size == block_size, consecutive blocks are assigned round-robin across DCP ranks within a TP group. The prefill scheduler inflates its logical block size by dcp_size (e.g. block_size=16, dcp=2 → scheduler uses block_size=32), so a single scheduler blo…
#34971 [Bugfix][Model] Fix Ultravox silent weight overwrite for diff checkpoints — bug — by AndriiPasternak31 (创建于: 2026-02-21 01:12 (UTC+8)) [💬1 | +43/-1, 1 files | commented:1] ## Summary

When serving a fine-tuned Ultravox diff checkpoint (one that contains only audio_tower and multi_modal_projector weights, not the full model), the fine-tuned weights are silently overwritten by base Whisper weights during model loading. This causes the model to produce degraded outputs with no error or warning.

### Root cause

DefaultModelLoader.get_all_weights() yields weights in order:
1. Primary checkpoint weights (fine-tuned audio_tower.* + `multi_modal_proje…
#34973 Enforce that model is the first positional arg when --served-model-name is used — 无标签 — by hmellor (创建于: 2026-02-21 01:37 (UTC+8)) [+14/-4, 1 files | commented:1] Closes https://github.com/vllm-project/vllm/issues/34326

Prevents users from passing --served-model-name without actually specifying the model.
#34970 [Bugfix] Fix block_size mismatch for MLA models after #34818 — bug,ready,v1 — by mgoin (创建于: 2026-02-21 01:11 (UTC+8)) [+21/-5, 1 files | commented:1] ## Purpose

#34818 deferred block_size selection to after model loading, but InputBatch is created before model loading with a placeholder block_size=16. When update_block_size_for_backend later sets the real block_size (e.g. 32 for FLASHINFER_MLA), may_reinitialize_input_batch compared the kv_cache sizes against the already-updated cache_config.block_size (both 32) and skipped re-initialization. The InputBatch kept using block_size=16 for slot mappings while the KV cache us…
#34952 [Build] Fix DSV3_FUSED_A_GEMM_ARCHS to include SM120 on CUDA 13.0 — ci/build,nvidia — by aabbccddwasd (创建于: 2026-02-20 20:21 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1] ## Summary

Fix dsv3_fused_a_gemm kernel linking failure on SM120 (Blackwell) GPUs with CUDA 13.0.

## Problem

When building vLLM on systems with:
- CUDA compiler version >= 13.0
- SM120 (Blackwell) GPUs (e.g., RTX PRO 6000 Blackwell)
…
#34968 fix(tool_parsers): remove kimi k2 8k section char limit that truncates large tool call arguments — frontend — by timon0305 (创建于: 2026-02-21 00:38 (UTC+8)) [+49/-37, 4 files | commented:1] ## Purpose

Fix #34442

The Kimi K2 tool parser had a hard max_section_chars = 8192 limit that force-exited the tool section when total content exceeded 8 KiB. This silently truncated large tool-call arguments (e.g. when the model generates a source file via a tool call), making the parser unusable for coding use cases.

The parser also had a buffer_max_size = 1024 limit for the marker-detection buffer that was too small and could flush content containing partial section markers during large…
#34966 fix(cli): prevent –served-model-name from consuming positional model argument — frontend — by timon0305 (创建于: 2026-02-21 00:28 (UTC+8)) [💬2 | +202/-3, 5 files | commented:1] ## Purpose

Fix #34326

When running vllm serve --served-model-name ASR Qwen/Qwen3-ASR-1.7B, the --served-model-name flag (which uses nargs='+') greedily consumes both ASR and Qwen/Qwen3-ASR-1.7B as its values, leaving the positional model_tag unset. This causes args.model to fall back to the default Qwen/Qwen3-0.6B, resulting in:
- The banner showing the wrong model name
- The /v1/audio/transcriptions endpoint disappearing (wrong model capabilities detected)
### Changes

…
#34967 Fix: Add shard validation for quantized models to prevent silent failures — 无标签 — by paipeline (创建于: 2026-02-21 00:34 (UTC+8)) [💬1 | +167/-0, 2 files | commented:1] Fixes: #34859

## Problem Loading quantized checkpoints with missing shards fails silently, which can lead to incorrect model initialization and inference results. This happens because the current validation logic only checks parameter completeness for non-quantized models.

## Solution
Added shard validation specifically for quantized models that:
- ✅ Validates safetensors shards: Reads the index file and ensures all referenced shard files exist
- ✅ Validates .pt/.bin files: Ensu…
#34961 [CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests — rocm,ready,ci/build — by yewentao256 (创建于: 2026-02-20 23:45 (UTC+8)) [💬1 | +3/-3, 3 files | commented:2 approved:1] ## Purpose

This PR should be merged after https://github.com/vllm-project/vllm/pull/34925, I mean to make it separate so that if there is any issue, we can easily revert this bump version PR

## Test

Covered by CI

CC: @DarkLight1337 @noooop
#34964 Fix LoRA adapter silently failing on Pixtral/Ministral-3 models — 无标签 — by timon0305 (创建于: 2026-02-21 00:10 (UTC+8)) [+39/-0, 3 files | commented:1] ## Purpose

Fix LoRA adapters silently having no effect when loaded on Pixtral-based models (e.g. mistralai/Ministral-3-8B-Instruct-2512).

The root cause is that PixtralForConditionalGeneration is missing the hf_to_vllm_mapper attribute that other multimodal models (like Mistral3ForConditionalGeneration) define. Without this mapper, LoRA weight names from HuggingFace checkpoints (e.g. model.language_model.model.layers.…) are not translated to vLLM’s internal names (e.g. `language_mod…
#34960 [Doc] Fix example of eagle3 — documentation — by petrpechman (创建于: 2026-02-20 23:44 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] ## Purpose Update the example in the documentation to use the new speculative method eagle3.

## Test Plan No code changes were made.

## Test Result No code changes were made.

…
#34958 Ensure that MkDocs v2 does not get installed — ready,ci/build — by hmellor (创建于: 2026-02-20 23:12 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] We do this for compatibility reasons with the plugins that we use.

A more likely upgrade path is to Zensical, which is more actively maintained by the creators of MkDocs Material.
#34938 [Bug] Use FlashAttention for NemotronH — bug — by benchislett (创建于: 2026-02-20 12:32 (UTC+8)) [💬1 | +11/-1, 1 files | commented:1] ## Purpose

For some time, we have been observing accuracy issues with FlashInfer attention for NemotronH models (Super internal checkpoints), and use FlashAttention for all evals. While we investigate this issue, this temporary fix defaults an unspecified backend to Flash Attention for NemotronH on Blackwell just to be safe.
#34957 [Bugfix] Use getattr fallback for weight_loader in stacked-param branch — bug,speculative-decoding,llama,qwen,deepseek,gpt-oss — by wojciech-wais (创建于: 2026-02-20 23:05 (UTC+8)) [💬2 | +361/-177, 123 files | commented:1] Fixes #34201.

After sleep mode level 2 discards model weights and reload_weights() is called, fresh nn.Parameter objects may no longer carry the custom weight_loader attribute that is set during normal model initialisation. Model load_weights implementations accessed param.weight_loader with a bare attribute lookup in the stacked-parameter loading branch, causing:
```
  AttributeError: 'Parameter' object has no attribute 'weight_loader'
```
The else branch of the same functions already …

#34950 [CI] Bump mypy version — v1,kv-connector,nvidia — by hmellor (创建于: 2026-02-20 19:47 (UTC+8)) [💬3 +48/-25, 10 files commented:3 approved:1 📝草稿]

Bump mypy from 1.11.1 (Jul 30, 2024) to 1.15.0 (Feb 5, 2025) to get better checking and performance
1.13.0 added a faster-cache optional extra to improve caching performance. We add this in case it helps
We don’t bump all the way because newer versions catch things that older versions missed. So we step a few versions at a time to keep these PRs smaller and more manageable

#34944 [V0 Deprecation] Remove unused MM placeholders in request output — ready — by DarkLight1337 (创建于: 2026-02-20 13:56 (UTC+8)) [+1/-5, 1 files | commented:1 approved:1]

## Purpose

This field was only used in V0 (#10407). We can remove this now.

## Test Plan

## Test Result

…
#34953 Fix kv_cache_dtype auto resolution with checkpoint quantization — 无标签 — by paipeline (创建于: 2026-02-20 20:35 (UTC+8)) [💬2 | +334/-1, 3 files | commented:1] Fixes issue #34752 - improve kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo.

Problem: When using –kv-cache-dtype auto with models that specify kv_cache_quant_algo in their checkpoint, the system was falling back to model_config.dtype instead of using the checkpoint specified FP8 quantization.

Solution: Modified kv_cache_dtype_str_to_dtype() to properly use resolve_kv_cache_dtype_string() when kv_cache_dtype is auto. This ensures checkpoint-specified quantization algorit…
#34951 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (创建于: 2026-02-20 20:19 (UTC+8)) [+12/-3, 1 files | commented:1]

## Purpose

Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate <|start|>developer blocks. However, models that follow the OpenAI gpt-oss-20b/120b chat template (based on the chat_template.jinja) expect a single developer block containing both # Instructions and # Tools sections.

This PR modifies `_make_request…
#34949 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (创建于: 2026-02-20 19:39 (UTC+8)) [💬2 | +12/-5, 1 files | commented:1]

## Purpose

Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate <|start|>developer blocks. However, models that follow the OpenAI gpt-oss-20b/120b chat template (based on the chat_template.jinja) expect a single developer block containing both # Instructions and # Tools sections.

This PR modifies `_make_request…
#34940 [CI] Remove DBO xfail on Blackwell — ready,v1 — by LucasWilkinson (创建于: 2026-02-20 13:23 (UTC+8)) [+0/-19, 1 files | commented:1] Seems to be passing locally, try removing

[已合并 PR]

#34997 Revert “[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers “ — ready,ci-failure,llama — by LucasWilkinson (合并于: 2026-02-21 09:19 (UTC+8)) [💬1 | +68/-29, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#34471 to fix CI failures: https://github.com/vllm-project/vllm/issues/34995

FIX: https://github.com/vllm-project/vllm/issues/34995

cc @eldarkurtic
#34899 Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion — ready,ci/build,deepseek,nvidia,ready-run-all-tests — by wzhao18 (合并于: 2026-02-21 05:37 (UTC+8)) [💬8 | +6/-29, 5 files | commented:1 approved:1]

## Purpose The patch for fixing the Deepseek V3 accuracy issue with AR+rms+fp4 fusion (#34395) is included in flashinfer 0.6.4. This PR bumps flashinfer version and re-enables the fusion pass by default.

## Test Plan
```
  serve nvidia/DeepSeek-V3.1-NVFP4 -tp=4 -cc.pass_config.fuse_allreduce_rms=True
```
…
#34735 [AMD][CI] Fix test_custom_allreduce for A100 testgroup — rocm,ready — by rjrock (合并于: 2026-02-21 05:33 (UTC+8)) [💬2 | +2/-0, 1 files | commented:1 approved:1] ## Purpose To fix the test_custom_allreduce test for the AMD A100 testgroup.

For ROCm, ray modifies HIP_VISIBLE_DEVICES when allocating a GPU (@ray.remote(num_gpus=1)), so we need to remove HIP_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES.

## Test Plan pytest -v -s distributed/test_custom_all_reduce.py

## Test Result Before …
#34979 [CI] Revert PRs 34818 and 33600 — speculative-decoding,ready,v1,multi-modality,nvidia — by LucasWilkinson (合并于: 2026-02-21 05:25 (UTC+8)) [💬1 | +242/-294, 16 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33600 broke the basic models test, https://github.com/vllm-project/vllm/pull/34818 tried to fix it but caused accuracy issues for ``` FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400 assert np.float64(0.011372251705837756) >= (0.72 - 0.08) FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4…
#34473 [Test] Add FP8 KV Cache Testing for MLA Backends — documentation,performance,new-model,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by wzhao18 (合并于: 2026-02-21 02:51 (UTC+8)) [💬5 | +68/-27, 1 files | commented:2 approved:6]

## Purpose This PR improves the MLA backend test coverage to include fp8 kv cache testing.

## Test Plan Tested tests/v1/attention/

## Test Result

…
#34843 [CI] Remove failing prime-rl integration test — ready,ci/build — by mgoin (合并于: 2026-02-21 02:17 (UTC+8)) [+0/-106, 3 files | commented:1]

## Purpose

This test has been failing for several weeks, so let’s just remove it. We can add it back if someone want to revive it.

## Test Plan

## Test Result

…
#34912 [compile] Fix torch.compile time discrepancy in logging. — ready — by zhxchen17 (合并于: 2026-02-21 00:47 (UTC+8)) [+17/-8, 3 files | commented:1 approved:1] Summary: Fixing the issue reported in https://github.com/vllm-project/vllm/issues/27815 .

The discrepancy occurs because we didn’t measure the end-to-end wall time passed between critical points of the program. Instead in the old behavior, we simply summed the time we spent in dynamo and inductor, leaving everything else out of the metrics.

This causes confusion because users saw inconsistency between the real time passed between log lines and the time reported by vllm: …
#34831 [compile] Move torch_aot_compile directory under torch_compile_cache — ready — by zhxchen17 (合并于: 2026-02-21 00:46 (UTC+8)) [+5/-4, 1 files | commented:1 approved:2] Summary:

Per discussion with @zou3519, people sometimes delete torch_compile_cache/ only but assume it will invalidate all the cache. Turns out confusing to have 2 directories managing orthogonal cache files on disk. This PR consolidates them into one.

Test Plan: CI

Reviewers: …
#34185 [Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching — ready — by gmagogsfm (合并于: 2026-02-21 00:36 (UTC+8)) [💬2 | +55199/-200, 3 files | commented:4 approved:2] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: [Helion] Add num_tokens dimension to silu_mul_fp8 autotuning

Previous version of silu_mul kernel was only tuned for a constant num_tokens=256, which underperforms for smaller num_tokens per @xiaohongchen1991 ‘s study. This PR adds another dimension of num_tokens to config key of silu_mul so that its autotuning and dispatching are now optimized for different values …
#34958 Ensure that MkDocs v2 does not get installed — ready,ci/build — by hmellor (合并于: 2026-02-20 23:38 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] We do this for compatibility reasons with the plugins that we use.

A more likely upgrade path is to Zensical, which is more actively maintained by the creators of MkDocs Material.
#34870 [perf] Avoid dtype promotion sync in mamba_get_block_table_tensor — ready,v1 — by hl475 (合并于: 2026-02-20 22:21 (UTC+8)) [💬2 | +6/-2, 1 files | commented:1 approved:1] ## Purpose

From the perf profiling on a model using linear attention, one thing draws the attention is aten::to from somewhere in vllm gpu_model_runner preprocess

For this aten::to,

We can tell it is doing to with dtype=3 (int64)…
#34909 [Refactor] Extract Harmony streaming SSE event builders into streaming_events.py — frontend,ready,gpt-oss — by sfeng33 (合并于: 2026-02-20 22:20 (UTC+8)) [💬2 | +907/-874, 2 files | commented:4 approved:2]

## Purpose serving.py is ~2800 lines. The _emit_* methods are pure functions (state + data → events) that don’t use instance state, yet they live on the class. Extracting them:
1. Reduces serving.py by ~800 lines
2. Makes event builders independently testable
3. Prepares for StreamingParsableContext support — the upcoming streaming parsable implementation will reuse these same event builders instead of duplicating them (as `_process_simple_streaming_eve…
#34944 [V0 Deprecation] Remove unused MM placeholders in request output — ready — by DarkLight1337 (合并于: 2026-02-20 22:19 (UTC+8)) [+1/-5, 1 files | commented:1 approved:1]

## Purpose

This field was only used in V0 (#10407). We can remove this now.

## Test Plan

## Test Result

…
#34866 [BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization — bug,ready,v1 — by vadiklyutiy (合并于: 2026-02-20 21:54 (UTC+8)) [💬10 | +31/-26, 1 files | commented:6 approved:1] Fixes #34619

### Root cause

With async scheduling, execute_model protects shared pinned CPU buffers (seq_lens, query_start_loc, etc.) from reuse races via prepare_inputs_event: it synchronizes before overwriting them and records after enqueuing non_blocking H2D DMAs.

_dummy_run — used by idle DP workers for EP coordination — writes to the same pinned buffers and enqueues the same H2D DMAs but never participates in this event protocol. Back-to-back dummy steps (or a dummy s…
#34206 [Kernel] Optimize grouped topk kernel — performance,ready,deepseek,model-bash — by xyang16 (合并于: 2026-02-20 17:34 (UTC+8)) [💬2 | +643/-100, 3 files | commented:7 approved:1] ## Purpose

This PR add grouped topk kernel optimized for small expert count (num_experts<=512 for single group, or num_experts<=256 for multi group), which covers DeepSeek (num_experts=256, n_group=8), Kimi-K2.5 (num_experts=384, n_group=1), Nemotron (num_experts=512, n_group=1) etc.

For large expert count, fall back to existing grouped_topk_fused_kernel.

## Test Plan

``` pytest -s -v tests/kernels/moe/test_grouped_topk.py …
#32877 [Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support — bug,rocm,ready,v1 — by c0de128 (合并于: 2026-02-20 14:25 (UTC+8)) [💬3 | +35/-2, 1 files | commented:1 approved:1] ## Summary
- Fix ROCM_AITER_FA attention backend to support speculative decoding (multi-token decode)
- The decode path was hardcoding max_seqlen_q=1, causing incorrect results with speculative decoding
## Changes
- Extract actual max_query_len from decode metadata
- Route multi-token decode queries to unified_attention instead of paged_attention_v1
- Use actual max_seqlen_q instead of hardcoded 1
- Update assertion message to reflect both sliding window and speculative decoding con…
#34916 [Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend — ready — by frankwang28 (合并于: 2026-02-20 14:07 (UTC+8)) [+3/-0, 1 files | commented:2 approved:1]

## Purpose When setting VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1, it was noticed that there was no logging that MXFP4 MXFP8 TRTLLM was the selected backend. Other backends such as SM100_FI_MXFP4_BF16 and SM100_FI_MXFP4_MXFP8_CUTLASS do perform that logging: ``` VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 vllm serve openai/gpt-oss-20b … (EngineCore_DP0 pid=1494838) INFO 02-19 20:06:21 [gpu_model_runner.py:4128] Starting to load model openai/gpt-oss-20b… (En…
#34921 [Models] LFM2: Support LoRA — ready — by tianshu-Michael-yu (合并于: 2026-02-20 14:07 (UTC+8)) [+36/-16, 2 files | commented:3 approved:1] ## Summary This PR enables LoRA for LFM2 and LFM2-MoE models.

## Changes
- Rename fused MLP projection module from w1 to w13 in vllm/model_executor/models/lfm2.py and vllm/model_executor/models/lfm2_moe.py.
- Update packed_modules_mapping from w1 -> [w1, w3] to w13 -> [w1, w3].
- Add hf_to_vllm_mapper = WeightsMapper({'.conv.': '.short_conv.'}) so HF LoRA target names map correctly to vLLM modules.
- Fix stacked weight-name matching to segment-boundary matching (`weight_name + ‘…
#34922 [ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout — rocm,ready — by micah-wil (合并于: 2026-02-20 13:37 (UTC+8)) [+2/-2, 2 files | commented:1 approved:1] We’ve noticed some flakiness in amd-ci related to tests timing out in Entrypoints Integration Test (API Server 1). Here is one example from a recent nightly build: https://buildkite.com/vllm/amd-ci/builds/4972/summary?jid=019c6f8c-8d3b-47b7-977b-ab61ac8730ce&tab=output#019c6f8c-8d3b-47b7-977b-ab61ac8730ce/L6775

In this example, the following test times out after the 240 second server init timeout defined in RemoteOpenAIServer:

``` pytest -v -s entrypoints/openai/…
#33677 [Misc] Add deprecated environment variable utilities — ready — by carlory (合并于: 2026-02-20 13:33 (UTC+8)) [💬3 | +65/-0, 1 files | commented:2 approved:1] ## Purpose

Add general-purpose utilities for handling deprecated environment variables with deprecation warnings. These functions can be reused across the codebase when deprecating environment variables in favor of CLI arguments or config options.

This addresses the suggestion from @hmellor in PR #33536 to add general versions of the removed _get_from_env_if_set and _set_from_env_if_set methods to utils.py for reuse in future deprecations.

**New functions added to `vllm/utils/system_uti…
#33739 [CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass — bug,rocm,ready,v1,kv-connector — by rasmith (合并于: 2026-02-20 13:32 (UTC+8)) [💬8 | +8/-5, 1 files | commented:3 approved:2] ## Purpose test_moriio_handshake_returns_metadata and test_register_kv_caches do not make use of the set_vllm_config fixture and both result in this error when being ran:

``` def get_current_vllm_config() -> VllmConfig: if _current_vllm_config is None:
```
      raise AssertionError(
              "Current vLLM config is not set. This typically means "
              "get_current_vllm_config() was called outside of a "   ...
```
#34072 Add validation to reject non-text content in system messages — frontend,ready — by veeceey (合并于: 2026-02-20 13:30 (UTC+8)) [💬12 | +187/-1, 2 files | commented:7 approved:3] ## Summary Fixes #33925

According to OpenAI API specification, system messages can only contain text content. This PR adds validation to warn on non-text content types (images, audio, video) in system messages.

## Problem vLLM was accepting images and other multimodal content in system messages without any indication, which deviates from the OpenAI API specification. System messages should only su…
#34260 [Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection — ready — by elizabetht (合并于: 2026-02-20 13:29 (UTC+8)) [💬5 | +46/-13, 1 files | commented:3 approved:1] On Hopper (SM90), FLASHINFER_CUTLASS was auto-selected as the FP8 MoE backend over DEEPGEMM due to having higher priority in the selection order. FlashInfer kernels are primarily optimized for Blackwell and do not work well with features like chunked prefill on Hopper.

This change makes the backend priority order architecture-aware:
- Blackwell (SM100+): FlashInfer backends preferred over DeepGEMM
- Hopper and older: DeepGEMM preferred over FlashInfer backends
## Purpose Fix [34249](https://g…
#34335 [Bugfix] Add regression test for MoE quant_config under torch.compile — bug,ready — by mgehre-amd (合并于: 2026-02-20 13:27 (UTC+8)) [💬4 | +23/-0, 1 files | commented:2 approved:2] Update: The fix landed separately via #34371 and this PR only adds the regression test.

## Summary

After the MoE Refactor (#32344), w4a16 models fail with AssertionError: Hidden size mismatch 2048 != 1024 under torch.compile. This is because ensure_moe_quant_config_init() is called in FusedMoE.forward_native(). When torch.compile is active, forward_native is traced by Dynamo, but the side effect of setting self.quant_method.moe_quant_config (an attribute mutation) is not replayed at…
#34386 [Quark] Fix MoE fp8 activation scale handling on mi300 — ready — by BowenBao (合并于: 2026-02-20 13:27 (UTC+8)) [💬1 | +3/-3, 1 files | commented:2 approved:1] Small fix follow-up after #29008 for running gpt-oss on mi300. Also ensure the fp8 scale conversion only runs when activation is fp8 quantized.
#34915 [ci] Use the right tag for CPU arm64 image — ci/build — by khluu (合并于: 2026-02-20 11:59 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1] During the migration, the tag got changed by mistake. This is to fix it back to the right tag
#34794 [Refactor] Implement output type check in LLM — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-20 11:57 (UTC+8)) [💬4 | +58/-38, 2 files | commented:2 approved:1]

## Purpose

LLMEngine.validate_outputs actually does nothing. Apply the output type check inside LLM instead.

## Test Plan

## Test Result

…

[关闭未合并 PR]

#34962 [Build] Bump FlashInfer version from v0.6.3 to v0.6.4 — ci/build,nvidia — by vadiklyutiy (关闭于: 2026-02-21 10:18 (UTC+8)) [💬1 | +5/-5, 4 files | commented:1 | 📝草稿] ## Summary

Promote FlashInfer dependency from v0.6.3 to v0.6.4 across all build artifacts.

The main goal is use GDN decode kernel for Qwen3.5
#24388 [gpt-oss]Support lazy init mcp session — frontend,needs-rebase,stale,gpt-oss — by wuhang2014 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬14 | +99/-62, 3 files | commented:2 changes:1] ## Purpose

Currently, we initialize all MCP sessions that specified by user request before generation process no matter whether tools are actually being used. This PR suggests implementing lazy initialization for MCP sessions. The major change is to initialize sessions with tool server only before the tool is called.

## Test Plan With a python MCP tool server, test 2 requests with tool of code_interpreter type with background mode using responses API. One request do not use tool while anothe…
#24796 [Core] Set process titles early in init for better debugging — tpu,needs-rebase,stale,v1 — by Mohit-Gaur (关闭于: 2026-02-21 10:16 (UTC+8)) [💬5 | +942/-71, 6 files | commented:1 changes:1]

## Purpose Fixes #24476 - Assign worker process titles and logging prefix earlier
#35003 Fix: [Bug]: FlashInfer attn-fp4 fused kernel performs w — bug — by darshjme-codes (关闭于: 2026-02-21 09:38 (UTC+8)) [💬4 | +50/-0, 1 files | commented:1] The fused attention quant kernel for fp4 was observed to be slower than the unfused version. The fix disables the fusion pass when fp4 quantization is active, ensuring the higher‑performance unfused kernel is used instead. This resolves the performance regression without altering the underlying kernel implementation.

Fixes #34988

Changes:
- vllm/flash_attn/flash_attn_interface.py: Disabled the fused attention quantization pass when fp4 is enabled, preventing the slower fused kernel…
#35000 Simplify output assignment condition — v1 — by zhuohan123 (关闭于: 2026-02-21 07:29 (UTC+8)) [💬1 | +1/-3, 1 files | commented:1] ## Purpose Refactor conditional check for output assignment.

Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ...
#26278 [1/N] Elastic EP Milestone 2 — rocm,ready,ci/build,v1,cpu,nvidia — by libertyeagle (关闭于: 2026-02-21 07:10 (UTC+8)) [💬32 | +3448/-964, 42 files | commented:10] ## Purpose

PR #20775 introduces initial support of elastic expert parallelism. This PR adds further optimizations towards Milestone 2 in #20323. Key features include:
- Breakdown the scale up/down logic into a state machine of multiple stages, with their execution controlled in vllm/distributed/elastic_ep/elastic_state.py and vllm/distributed/elastic_ep/elastic_execute.py.
- Newly started workers receive all weights (non-MoE modules and expert weights) from peer GPUs.
- We no longer need t…
#34980 [not for land] Change VLLM_USE_MEGA_AOT_ARTIFACT default to ‘1’ — ready — by zhxchen17 (关闭于: 2026-02-21 05:16 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose

## Test Plan

## Test Result

Essential Elements of an Effective PR Description Checklist
...
#29214 [LMCache KVConnector] Integrate LMCache observability to vLLM’s KV connector metrics — kv-connector — by aeon-x (关闭于: 2026-02-21 05:08 (UTC+8)) [💬6 | +272/-0, 1 files | commented:5 approved:1] ## Purpose Fix issue https://github.com/LMCache/LMCache/issues/1914

### Integration test by Grafana

### Implementation

…
#34821 [Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs — bug,ci/build,nvidia — by 88plug (关闭于: 2026-02-21 04:33 (UTC+8)) [💬7 | +274/-4, 3 files | commented:1] ## Purpose

Fix CUDA forward compatibility library loading that causes Error 803 (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) in Docker containers. The persistent cuda-compat.conf in /etc/ld.so.conf.d/ unconditionally loads the container’s CUDA compat libs, which shadow the host-mounted driver when the host has a newer CUDA version than the container’s toolkit (e.g., host driver 580.x with CUDA 13.0 support, container built with CUDA 12.x).

This was originally reported on an NVIDIA B200 (datacente…
#34168 [DRAFT][Feature] implement online data capture/generation for eagle3 — speculative-decoding,v1 — by harryzorus (关闭于: 2026-02-21 02:06 (UTC+8)) [💬3 | +3393/-1, 17 files] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

## Purpose

vLLM today has the excellent speculators library that implements an offline generation of logits and hidden states for training eagles heads from scratch using a custom offline worker using the…
#32628 feat: add parallel token drafting for EAGLE — documentation,speculative-decoding,needs-rebase,v1,llama — by hai-meh-cs (关闭于: 2026-02-21 01:46 (UTC+8)) [💬11 | +1335/-16, 9 files | commented:9 changes:1] ## Purpose

PTD (Parallel Token Decoding) generates K draft tokens in a single forward pass instead of K sequential passes, reducing draft overhead for EAGLE speculative decoding.

```bash vllm serve openai/gpt-oss-120b
–speculative-config ‘{“model”: “", "method": "eagle3-ptd", "num_speculative_tokens": 4}' \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ ...
#34671 Update max_num_tokens value when specdec is enabled — ready,v1 — by shaharmor98 (关闭于: 2026-02-21 01:15 (UTC+8)) [💬4 | +16/-6, 1 files | commented:2 approved:2] ## Purpose

Following the Unified Parallel Drafting PR, the max_num_tokens field, in gpu_model_runner.py hasn’t been adjusted.
When speculative decoding is enabled (draft model or Eagle), the scheduler increases max_num_batched_tokens to account for speculative tokens, but GPUModelRunner.max_num_tokens was not reflecting this adjustment. This could cause mismatches between what the scheduler sends and what the model runner expects. …
#34966 fix(cli): prevent –served-model-name from consuming positional model argument — frontend — by timon0305 (关闭于: 2026-02-21 00:39 (UTC+8)) [💬2 | +202/-3, 5 files | commented:1] ## Purpose

Fix #34326

When running vllm serve --served-model-name ASR Qwen/Qwen3-ASR-1.7B, the --served-model-name flag (which uses nargs='+') greedily consumes both ASR and Qwen/Qwen3-ASR-1.7B as its values, leaving the positional model_tag unset. This causes args.model to fall back to the default Qwen/Qwen3-0.6B, resulting in:
- The banner showing the wrong model name
- The /v1/audio/transcriptions endpoint disappearing (wrong model capabilities detected)
### Changes

…
#28453 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — needs-rebase — by WorldExplored (关闭于: 2026-02-21 00:18 (UTC+8)) [💬12 | +301/-65, 9 files | commented:4 changes:1] Reference Issue: https://github.com/vllm-project/vllm/issues/28236

Add a FusedMoEPrepareAndFinalize subclass.

cc @bnellnm
#34938 [Bug] Use FlashAttention for NemotronH — bug — by benchislett (关闭于: 2026-02-20 23:32 (UTC+8)) [💬1 | +11/-1, 1 files | commented:1] ## Purpose

For some time, we have been observing accuracy issues with FlashInfer attention for NemotronH models (Super internal checkpoints), and use FlashAttention for all evals. While we investigate this issue, this temporary fix defaults an unspecified backend to Flash Attention for NemotronH on Blackwell just to be safe.
#34949 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (关闭于: 2026-02-20 19:58 (UTC+8)) [💬2 | +12/-5, 1 files | commented:1]

## Purpose

Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate <|start|>developer blocks. However, models that follow the OpenAI gpt-oss-20b/120b chat template (based on the chat_template.jinja) expect a single developer block containing both # Instructions and # Tools sections.

This PR modifies `_make_request…
#30669 [Doc] Add AI Badgr framework integration documentation — documentation — by cartersaundersx (关闭于: 2026-02-20 19:56 (UTC+8)) [💬4 | +83/-0, 1 files | commented:1] ## Purpose

Add documentation for AI Badgr, an OpenAI-compatible LLM provider, in the framework integrations section. AI Badgr uses tier-based model naming (basic/normal/premium) and can work with vLLM as a backend or be accessed as a hosted service.

## Test Plan

```bash # Verify markdown linting pre-commit run –hook-stage manual markdownlint –files docs/deployment/frameworks/aibadgr.md

…
#34623 [Bugfix] Fix tests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generation — bug — by NickLucche (关闭于: 2026-02-20 19:05 (UTC+8)) [💬1 | +20/-12, 2 files | commented:1 | 📝草稿] Fix tests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generation following up on https://github.com/vllm-project/vllm/pull/34523 changes.

Interpreting original PR’s intent, I am trying to address quirks of handling --enforce-eager + compilation-config in the same start command.

cc @ProExpertProg
#34937 [Bugfix]: Qwen3 reasoning parser now handles in prompt prefix — bug,qwen — by babyplutokurt (关闭于: 2026-02-20 15:17 (UTC+8)) [💬4 | +131/-22, 2 files | commented:1] [Bugfix]: Qwen3 reasoning parser now handles in prompt prefix

Fixes #21130

## Purpose

Newer Qwen3 thinking models (e.g., Qwen3-4B-Thinking-2507) support only thinking mode — enable_thinking=True is always in effect. Their default chat template automatically appends <think> to the assistant turn prefix inside the prompt. As a result, the model’s generated output starts directly with reasoning text and ends with </think>content — the opening <think> tag does not appear…
#33981 fix: reject non-text content in system/developer messages — frontend — by veeceey (关闭于: 2026-02-20 14:34 (UTC+8)) [💬10 | +230/-0, 2 files | commented:6] ## Summary

Fixes #33925

Per the OpenAI API specification, system and developer role messages should only accept text content type. Previously, vLLM allowed multimodal content (e.g. image_url, input_audio, video_url) in system messages without any validation, which diverges from the OpenAI API behavior.

### Changes
- vllm/entrypoints/chat_utils.py: Added a `_validate_text_only_c…

Test Plan