[vLLM GitHub 开发动态] 2026-02-20
[概览]
- 时间窗口: 2026-02-20 11:15 (UTC+8) ~ 2026-02-21 11:15 (UTC+8)
- 新 issue: 18 (label 分布: ci-failure:8, bug:4, RFC:3, feature request:2, rocm:2)
- 关闭 issue: 41
- 新 PR: 49 (label 分布: v1:13, ready:13, nvidia:9, bug:9, ci/build:7)
- 合并 PR: 27
- 关闭未合并 PR: 20
[新 issue]
-
#34994 [Feature]: Infrastructure Improvements for ROCm CI — feature request,rocm — by AndreasKaratzas (创建于: 2026-02-21 06:35 (UTC+8)) [💬1] ## General
- Enable
grade: Blockingfor all tests on amd-ci pipeline - Mirror all tests on upstream CI pipeline
- Tailor the “V1 Test attention” test group to MI325 and MI355 respecitvely
## Distributed Tests
### Distributed Tests (2 GPUs)
- Remove
TORCH_NCCL_BLOCKING_WAIT=1when HIP bug ROCm/hip#3876 is fixed in a future ROCm release. …
- Enable
-
#34988 [Bug]: FlashInfer attn-fp4 fused kernel performs worse than unfused — bug,performance,torch.compile,needs reproduction,nvidia — by ProExpertProg (创建于: 2026-02-21 05:29 (UTC+8)) [💬1] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34995 [CI] Maverick model QKV weight shape mismatch during load_weights with expert parallelism — ci-failure — by LucasWilkinson (创建于: 2026-02-21 06:36 (UTC+8)) ## Name of failing test
models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
-
#35004 [RFC]: Realtime Endpoint Metrics — RFC — by pougetat (创建于: 2026-02-21 08:40 (UTC+8)) ### Motivation.
With the recent pre-release of v0.16.0, vLLM now exposes a /v1/realtime endpoint that allows clients to progressively submit prompts without having to make separate HTTP requests to the /chatcompletions endpoint.
This endpoint requires clients to establish a websocket connection to vLLM to stream the prompts over.
The /chatcompletions HTTP endpoint currently benefits out-of-the-box metrics provided by the prometheus client that part of the stack relies on.
No such metrics are…
-
#34993 [CI] GDN attention backend assertion failure with MTP speculative decoding — ci-failure — by LucasWilkinson (创建于: 2026-02-21 06:29 (UTC+8)) [💬2] ## Name of failing test
evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4-EP2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries
…
-
#34948 [Bug]: Qwen3.5 CUDA Illegal Memory Access — bug — by kimbochen (创建于: 2026-02-20 16:48 (UTC+8)) [💬5] ### Your current environment
vLLM Nightly on H200
``` PyTorch version : 2.10.0+cu130 CUDA used to build PyTorch : 13.0 CUDA runtime version : 13.0.88 Python version : 3.12.12 vLLM Version : 0.16.0rc2.dev294+g6a7b85d94 (git sha: 6a7b85d94) …
-
#34954 [Bug]: Triton Error [CUDA]: out of memory when received query — bug — by kwonmha (创建于: 2026-02-20 21:46 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.4 LTS (x86_64) ...python collect_env.py -
#34996 [Feature][CI]:
test_functionalization.pyshould compare outputs — feature request — by Rohan138 (创建于: 2026-02-21 06:38 (UTC+8)) ### 🚀 The feature, motivation and pitchhttps://github.com/vllm-project/vllm/blob/main/tests/compile/passes/test_functionalization.py should compare the test outputs from the functionalized model and the defunctionalized model, which are expected to be identical. I tried doing this as part of #33443, but ran into issues with the RMSNorm tests e.g.
TestFusedAddRMSNormhaving significantly different output values.This is probably a good first issue, can we have it marked as such for a communi…
-
#34989 [Bug]: Qwen3-Next-FP8 failure — bug — by tdoublep (创建于: 2026-02-21 05:30 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34969 [CI Failure]: LM Eval Small Models (B200) - DeepSeek and Qwen3 Next — ci-failure — by mgoin (创建于: 2026-02-21 00:54 (UTC+8)) ### Name of failing test
evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34943 [CI Failure]: AMD Samplers Test (mi325_1) — rocm,ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:48 (UTC+8)) [💬2] ### Name of failing test
distributed/test_ca_buffer_sharing.py, distributed/test_multiproc_executor.py, distributed/test_torchrun_example.py , distributed/test_torchrun_example_moe.py, kernels/helion/test_utils.py, plugins_tests/test_stats_logger_plugins.py, v1/kv_connector/nixl_integration/test_edge_cases.py
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34963 [RFC]: Adding Support for Single Batch Overlap (SBO) With FlashInfer DeepEP LL NVFP4 — RFC — by elvircrn (创建于: 2026-02-20 23:54 (UTC+8)) # Background
DeepEP’s combine_v2 kernel supports overlapping the combine all-to-all with GEMM2 by polling per-expert completion signals instead of waiting for all experts to finish before sending. SGLang implements this but vllm does not.
# DeepEP Support
https://github.com/fzyzcjy/DeepEP/tree/gb200_blog_part_2/ (as per https://lmsys.org/blog/2025-09-25-gb200-part-2/) implements combine_v2, but
https://github.com/elvircrn/DeepEP/tree/gb200_blog might be more convenient to use on CUDA 13 as …
-
#34956 [RFC]: Overlap weight loading with torch.compile compilation — RFC,torch.compile — by zou3519 (创建于: 2026-02-20 22:30 (UTC+8)) [💬1] ### Motivation.
torch.compile cold starts take some time - O(10s - 1m). We could totally hide a lot of the compilation time behind weight loading.
NB: this doesn’t mean that we won’t make torch.compile faster; in general it’s rare that there is a situation where it is possible to overlap compilation with something else.
### Proposed Change.
Here’s an initial design:
- develop a way for vLLM to load a model using FakeTensors (e.g., Tensors with no storage) and to run the torch.compile over it…
-
#34947 Feature: Audit-grade request logging for EU AI Act compliance (Article 12) — 无标签 — by desiorac (创建于: 2026-02-20 16:05 (UTC+8)) ## Context
The EU AI Act (Regulation 2024/1689) requires high-risk AI systems to have automatic logging capabilities (Article 12) that record events throughout the system’s operation for traceability purposes. For LLM serving infrastructure like vLLM, this translates to structured, audit-grade request logging.
## Current State
vLLM provides excellent performance metrics and basic request logging via its OpenAI-compatible API server. However, the current logging is focused on **operational…
-
#34945 [CI Failure]: V1 Others : test_custom_logitsprocs[CustomLogitprocSource.LOGITPROC_SOURCE_ENTRYPOINT] — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 14:01 (UTC+8)) ### Name of failing test
v1/logits_processors/test_custom_offline.py::test_custom_logitsprocs[CustomLogitprocSource.LOGITPROC_SOURCE_ENTRYPOINT]
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34942 [CI Failure]: Intel HPU Test - examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:38 (UTC+8)) ### Name of failing test
examples/offline_inference/basic/generate.py
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34941 [CI Failure]: Intel GPU Test : examples/offline_inference/basic/generate.py — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:28 (UTC+8)) ### Name of failing test
python3 examples/offline_inference/basic/generate.py –model facebook/opt-125m –block-size 64 –enforce-eager
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34939 [CI Failure]: V1 e2e + engine : Cannot re-initialize CUDA in forked subprocess — ci-failure — by varun-sundar-rabindranath (创建于: 2026-02-20 13:11 (UTC+8)) ### Name of failing test
v1/e2e/test_kv_sharing_fast_prefill.py::test_kv_sharing_fast_prefill[*, *], v1/e2e/test_mamba_prefix_cache.py::test_mamba_prefix_cache
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
[已关闭 issue]
-
#14289 [Feature]: Chat inputs to AsyncLLMEngine — feature request,stale — by sfc-gh-mkrubinski (关闭于: 2026-02-21 10:17 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch
Currently, only the
LLMclass meant for offline inference supports thechatmethod. Are there any plans to implement a similar method forAsyncLLMEngine, besides the existinggenerate? Alternatively, is there any work on extending thePromptTypeacceptable bygenerateto include more prompt variants, such as chat conversations?### Alternatives
No response
…
-
#22008 [Installation]: with latest vllm source code installation done, but failed to run vllm server — installation,stale — by gaowayne (关闭于: 2026-02-21 10:17 (UTC+8)) [💬16] ### Your current environment
The output of `python collect_env.py`server is hang now
### How you are installing vllm
```sh …
-
#22704 [Bug]: Using GPT-OSS with the streaming option degrades the quality of responses. — bug,stale,gpt-oss — by ashgold (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#26490 [Bug]:
PPLXAll2AllManagerfails to init on pplx-kernels latest — bug,stale — by wseaton (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### 🐛 Describe the bugThe pplx-kernels all2all backend fails to initialize when using the latest pplx-kernels on main due to an interface change following the nvshmem4py integration: https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/all2all.py#L158-L162
Last working commit: https://github.com/perplexityai/pplx-kernels/blob/12cecfda252e4e646417ac263d96e994d476ee5d/src/pplx_kernels/nvshmem.py
``` (APIServer pid=317) (EngineCore_DP7 pid=606) self._target(…
-
#26636 [Bug]: AsyncHttpClient incorrectly decodes URLs via aiohttp, breaking signed URLs (e.g., S3) — bug,stale — by 1994 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ``` VLLM Version: 0.9.2python collect_env.py…
-
#27004 [New Model]: support tencent/Hunyuan-MT-Chimera-7B-fp8 — new-model,stale — by xinliu9451 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### The model to consider.
When will support for the tencent/Hunyuan-MT-Chimera-7B-fp8 model be available? Thank you very much.
### The closest model vllm already supports.
No response
### What’s your difficulty of supporting the model you want?
…
-
#27179 [Bug]: KVConnectorLogging initialization fails in single-process mode with LMCache due to has_kv_transfer_group assertion error — bug,stale,kv-connector — by sakunkun (关闭于: 2026-02-21 10:16 (UTC+8)) [💬5] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#27222 [Performance][Qwen3-next] Decrease huge CPU overhead — performance,stale — by vadiklyutiy (关闭于: 2026-02-21 10:16 (UTC+8)) [💬6] ### Proposal to improve performance
Qwen3-next has a sufficient CPU overhead even with pretty big (more than 512 - upper bound for cudagraph size) batch sizes.
I took batch size 1024 for demonstration. For that used
vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eoswhere the most time we spend on decoding 1024 request in 1…
-
#27233 gguf run good — usage,stale — by kmnnmk212-source (关闭于: 2026-02-21 10:16 (UTC+8)) [💬14] ### Your current environment
from vllm import LLM, SamplingParams
gguf_path = “/home/m/Desktop/vllm/vllm/examples/offline_inference/basic/Qwen3-1.7B-GGUF/Qwen3-1.7B-Q6_K.gguf”
llm = LLM( gguf_path, tokenizer=”Qwen/Qwen3-1.7B” ) …
-
#27239 [Feature]: Heterogeneous TP per Pipeline Stage (uneven TP across PP) — feature request,stale — by Mankeerat (关闭于: 2026-02-21 10:16 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Goal: Enable different tensor-parallel (TP) sizes per pipeline-parallel (PP) stage so mixed accelerators can run in one pipeline (e.g., per-stage TP = 4,1,2,1). Behavior remains unchanged unless explicitly enabled.
Motivation A lot of available hardware is heterogeneous (spot pools, older nodes, new accelerators). Today vLLM assumes uniform TP across all PP stages, which limits deployment on mixed GPU fleets. Supporting per-stage TP would let users combi…
-
#27251 [Bug]: Deploying the model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF using docker vllm/vllm-openai:v0.10.2 and vllm/vllm-openai:v0.11.0 failed. — bug,stale — by suntao2015005848 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬2] ### Your current environment
The output of
NVIDIA: 8 * NVIDIA L20 docker: vllm/vllm-openai:v0.10.2 and vllm/vllm-openai:v0.11.0python collect_env.py…
-
#27252 [Usage]: ”@app.post(“/generate”)“ API is support qwen2_vl or not? — usage,stale — by wwkww (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment
i want tot know ”@app.post(“/generate”)“ API support qwen2_vl or not?
### How would you like to use vllm
I want to run inference of a specific model. I don’t know how to integrate it with vllm.
…
-
#27268 [Usage]: failed to infer device type on GCP COS despite nvidia container toolkit installed — usage,stale — by forrestbao (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment
I failed to run this script on GCP COS.
### How would you like to use vllm
I was trying to use VLLM on a Google Cloud (GCP) Container-Optimized OS (COS) instance via Docker.
I followed GCP’s documentation to install the nvidia driver, including mapping nvidia driver-related dirs to the Docker container. All tests worked fine.
…
-
#27278 args.hf_split is overriden even when set causing some dataset to not be actually supported — stale — by Delaunay (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] https://github.com/vllm-project/vllm/blob/aa1356ec53a65a79a0027e8c265c76d84de8c046/vllm/benchmarks/datasets.py#L1680
using the dataset
openslr/librispeech_asrRaise ValueError(f"Bad split: {split}. Available splits: {list(splits_generators)}") ValueError: Bad split: train. Available splits: ['test.clean', 'test.other', 'train.clean.100', 'train.clean.360', 'train.other.500', 'validation.clean', 'validation.other'] -
#27287 [Bug]: Speculative Decoding Issue with VLLM_ENABLE_V1_MULTIPROCESSING=0 — bug,stale — by shadowpa0327 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#27315 [Feature]: 请问该模型支持微调吗? — feature request,stale — by yaoyun1 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
请问该模型支持微调吗?
### Alternatives
No response
### Additional context
…
-
#27325 [Bug]: When spec_tokens count is greater than 1, the adaptation of cuda_graph_sizes causes the decoding process to fall back to eager mode. — bug,stale — by zouyida2052 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#27333 [Bug]: When I run qwen3-vl-30b-a3b-fp8, it only generates part of the content — bug,stale — by bigbro13 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#27356 [Feature]: Support usage of server_url as a tool parameter to dynamically connect to MCP servers — feature request,stale — by mandos21 (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
To fully flesh out the functionality of the responses API and tool calling, I think that this should be supported as well. More info in the additional context section.
### Alternatives
No response
### Additional context
…
-
#27358 [Feature]: Support for custom node order of pipeline stages. — feature request,stale — by JorgenTrondsen (关闭于: 2026-02-21 10:15 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
I am doing a master thesis using vllm with ray. I need control over which nodes receives which parts of a pipeline stage(partitioned layers).
### Alternatives
No response
### Additional context
…
-
#34995 [CI] Maverick model QKV weight shape mismatch during load_weights with expert parallelism — ci-failure — by LucasWilkinson (关闭于: 2026-02-21 09:19 (UTC+8)) ## Name of failing test
models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]models/multimodal/generation/test_maverick.py::test_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]
## Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries …
-
#34234 [Feature]: Enable CUDA graph capture for Eagle speculator prefill — feature request — by HamzaElshafie (关闭于: 2026-02-21 08:46 (UTC+8)) [💬1] ## 🚀 The feature, motivation and pitch
Eagle’s speculator currently runs prefill in eager mode, while decode can use CUDA graphs.
There is an explicit TODO in the codebase to support CUDA graphs for the prefill path. I’m interested in working on this as my first contribution and wanted to check alignment before starting any implementation.
Currently, the Eagle speculator contains the following TODO:
- Location:
vllm/v1/worker/gpu/spec_decode/eagle.py(around line 229) - **Current state…
- Location:
-
#34989 [Bug]: Qwen3-Next-FP8 failure — bug — by tdoublep (关闭于: 2026-02-21 05:31 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34969 [CI Failure]: LM Eval Small Models (B200) - DeepSeek and Qwen3 Next — ci-failure — by mgoin (关闭于: 2026-02-21 05:25 (UTC+8)) ### Name of failing test
evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34929 [Feature]: Get latency and tpot per request in vLLM benchmark — feature request — by snova-rodrigom (关闭于: 2026-02-21 05:15 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
Today ttft and itls are provided but I think that some users would prefer to have the end-to-end latency of the request for a more realistic number in their environments. Also, tpot per request should bring a better idea of the performance that happened on a particular request. It’s cool to have it for the whole run but per request might bring a more granular perspective and can help debugging
### Alternatives
I haven’t seen options to get the requested…
-
#29534 [CI Failure]: mi325_8: LoRA Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:41 (UTC+8)) [💬10] ### Name of failing test
pytest -v -s lora --shard-id=$BUILDKITE_PARALLEL_JOB --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py --ignore=lora/test_olmoe_tp.py --ignore=lora/test_deepseekv2_tp.py --ignore=lora/test_gptoss_tp.py --ignore=lora/test_qwen3moe_tp.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32219 [RFC]: Add Helion integration in vLLM — RFC — by gmagogsfm (关闭于: 2026-02-21 03:09 (UTC+8)) [💬5] ### Motivation.
(with significant inputs from @zou3519, @ProExpertProg, @mgoin, @xiaohongchen1991)
Helion is PyTorch’s latest innovation in authoring custom kernels, featuring simple and familiar syntax, good developer experience, and superior performance.
This RFC proposes a developer-friendly framework for integrating Helion kernels into vLLM, making custom ops in vLLM more efficient, enjoyable to write, and performant in production.
The proposed integration is [prototyped here](https://gi…
-
#29511 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 1 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:42 (UTC+8)) [💬14] ### Name of failing test
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29521 [CI Failure]: mi325_1: Samplers Test — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:40 (UTC+8)) [💬12] ### Name of failing test
pytest -v -s samplers && VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:40 (UTC+8)) [💬7] ### Name of failing test
pytest -v -s kernels/moe### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29463 [CI Failure]: mi325_1: Language Models Tests (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 02:39 (UTC+8)) [💬10] ### Name of failing test
pytest -v -s models/language -m 'core_model and (not slow_test)'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34943 [CI Failure]: AMD Samplers Test (mi325_1) — rocm,ci-failure — by varun-sundar-rabindranath (关闭于: 2026-02-21 01:18 (UTC+8)) [💬2] ### Name of failing test
distributed/test_ca_buffer_sharing.py, distributed/test_multiproc_executor.py, distributed/test_torchrun_example.py , distributed/test_torchrun_example_moe.py, kernels/helion/test_utils.py, plugins_tests/test_stats_logger_plugins.py, v1/kv_connector/nixl_integration/test_edge_cases.py
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-21 01:17 (UTC+8)) [💬16] ### Name of failing test
pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py### Basic information
- Fla…
-
#34619 [Bug]: Qwen3.5.
illegal memory access— bug — by vadiklyutiy (关闭于: 2026-02-20 21:54 (UTC+8)) [💬10] ### Your current environmentThe output of
```text uv is set ============================== System Info ============================== ...python collect_env.py -
#31625 [Feature][ROCm][AITER]: Speculative Decoding Accuracy Issue with VLLM_ATTENTION_BACKEND=ROCM_AITER_FA — bug,feature request,rocm — by vllmellm (关闭于: 2026-02-20 14:25 (UTC+8)) [💬2] ### Your current environment
rocm/vllm-dev:nightly### 🐛 Describe the bug
## Problem Description
When running speculative decoding using
VLLM_ATTENTION_BACKEND=ROCM_AITER_FA, the model produces extremely poor accuracy results on GSM8K benchmark.### Command Used …
-
#34935 [Usage]: TypeError: ‘>’ not supported between instances of ‘str’ and ‘int’ — usage — by lottopotato (关闭于: 2026-02-20 13:32 (UTC+8)) [💬3] If you’re encountering a
TypeError: '>' not supported between instances of 'str' and 'int'with thetransformerstokenizer invllm.chat(), this issue likely stems fromtransformersversions 5.x.I’m not sure this change was applied continually, Transformers tokenizer’s
return_dictargument was changed toTrueby default in these versions.- [Tokenizer] Change default value of return_dict to True in doc string for…
-
#33925 [Bug]: OpenAI API: system message accept image — bug — by jonoillar (关闭于: 2026-02-20 13:30 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34249 [Bug]: [Fp8] [MoE] ‘FLASHINFER_CUTLASS’ is auto-selected as MoE backend instead of ‘DEEPGEMM’ on hopper — bug — by cjackal (关闭于: 2026-02-20 13:29 (UTC+8)) [💬4] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34819 [CI Failure]: models/test_initialization.py::test_can_initialize_small_subset[Llama4ForConditionalGeneration] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test
models/test_initialization.py::test_can_initialize_small_subset[Llama4ForConditionalGeneration]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34814 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[InternS1ProForConditionalGeneration] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test
models/test_initialization.py::test_can_initialize_large_subset[InternS1ProForConditionalGeneration]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34810 [CI Failure]: models/test_initialization.py::test_can_initialize_large_subset[H2OVLChatModel] — ci-failure — by ilmarkov (关闭于: 2026-02-20 11:51 (UTC+8)) [💬1] ### Name of failing test
models/test_initialization.py::test_can_initialize_large_subset[H2OVLChatModel]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
[新 PR]
-
#34946 Remove redundant uniform=False in mixed_mode PIECEWISE branch — v1,nvidia — by oops-oom (创建于: 2026-02-20 15:01 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] In
initialize_cudagraph_keys, themixed_mode()branch always calls_create_padded_batch_descriptorwithuniform_decode=False. This is by design — mixed mode handles general (non-uniform) batches, souniformis guaranteed to beFalsein the returnedBatchDescriptor.Therefore, the subsequent
replace(batch_desc, num_reqs=None, uniform=False)in the PIECEWISE sub-branch redundantly setsuniform=False. This PR simplifies it toreplace(batch_desc, num_reqs=None). -
#34962 [Build] Bump FlashInfer version from v0.6.3 to v0.6.4 — ci/build,nvidia — by vadiklyutiy (创建于: 2026-02-20 23:53 (UTC+8)) [💬1 | +5/-5, 4 files | commented:1 | 📝草稿] ## Summary
Promote FlashInfer dependency from v0.6.3 to v0.6.4 across all build artifacts.
The main goal is use GDN decode kernel for Qwen3.5
-
#35003 Fix: [Bug]: FlashInfer attn-fp4 fused kernel performs w — bug — by darshjme-codes (创建于: 2026-02-21 07:36 (UTC+8)) [💬4 | +50/-0, 1 files | commented:1] The fused attention quant kernel for fp4 was observed to be slower than the unfused version. The fix disables the fusion pass when fp4 quantization is active, ensuring the higher‑performance unfused kernel is used instead. This resolves the performance regression without altering the underlying kernel implementation.
Fixes #34988
Changes:
vllm/flash_attn/flash_attn_interface.py: Disabled the fused attention quantization pass when fp4 is enabled, preventing the slower fused kernel…
-
#34997 Revert “[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers “ — ready,ci-failure,llama — by LucasWilkinson (创建于: 2026-02-21 06:41 (UTC+8)) [💬1 | +68/-29, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#34471 to fix CI failures: https://github.com/vllm-project/vllm/issues/34995
FIX: https://github.com/vllm-project/vllm/issues/34995
cc @eldarkurtic
- #35002 Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355 — ci/build — by Alexei-V-Ivanov-AMD (创建于: 2026-02-21 07:36 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Fixing matcher to enable the 2-node-tests-4-gpus-in-total on MI355
-
#34999 [CI] Fix
tests/evals/gsm8k/test_gsm8k_correctness.pyforQwen3-Next-80B-A3B-NVFP4-EP2— v1,qwen — by LucasWilkinson (创建于: 2026-02-21 07:08 (UTC+8)) [💬2 | +0/-6, 1 files | commented:1] FIX: https://github.com/vllm-project/vllm/issues/34993https://github.com/vllm-project/vllm/pull/34077 broke
pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k "Qwen3-Next-80B-A3B-NVFP4-EP2" --config-list-file=tests/evals/gsm8k/configs/models-blackwell.txtwith what appears to be an overly conservative assertTest Plan:
``` pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py -k “Qwen3-Next-80B-A3B-NVFP4-EP2” –config-list-file=tests/evals/gsm8k/configs/models-blackwell.txt `…
-
#34977 [Mamba][APC] Add test case to compare apc outputs — 无标签 — by divakar-amd (创建于: 2026-02-21 02:21 (UTC+8)) [💬2 | +55/-0, 1 files | commented:1] (Merge after https://github.com/vllm-project/vllm/pull/34798)
- Added
test_same_mamba_output_apc_on_vs_offtotest_hybrid.pyto check that output texts are identical whether prefix caching is enabled or not for the Mamba model.
https://github.com/vllm-project/vllm/pull/34798 fixes a bug in the
csrc/mamba/mamba_ssm/selective_scan_fwd.cuwhere index calculation was incorrect with prefix caching enabled for mamba, hence leading to incorrect outputs. This PR builds upon https://github.com/… - Added
-
#34978 fix: DeepSeek-R1 structured-output reasoning end detection (scheduler + parser) — structured-output,v1,deepseek — by nbethala (创建于: 2026-02-21 02:32 (UTC+8)) [💬3 | +33/-18, 3 files | commented:1] UPDATE (Feb 20): Rewrote the summary for clarity . The PR fixes a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.
Summary
This PR completes DeepSeek‑R1 structured‑output integration by fixing a signature mismatch in StructuredOutputManager.should_advance() that prevented correct advancement after reasoning ended.Problem: DeepSeek‑R1 must detect the
</think>token to transition from free‑form reasoning into JSON‑const… -
#34987 [CI Bugfix] Add pytest.mark.flaky to tests/v1/e2e/test_spec_decode.py — bug,ready,v1,ci-failure — by mgoin (创建于: 2026-02-21 05:23 (UTC+8)) [💬1 | +4/-0, 1 files | commented:1 approved:1]
## Purpose
The hope is this will help the flaky failures of “V1 e2e + engine” since I identified these tests are often the culprit. They really should be reworked but when I ran locally it really did vary so much run to run.
FIX https://github.com/vllm-project/vllm/issues/34617 https://github.com/vllm-project/vllm/issues/34797
https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=7days&query=tests%2Fv1%2Fe2e%2Ftest_spec_decode.p…
-
#35001 feat(model): add embed_sparse task for BGE-M3 server-side sparse aggr… — 无标签 — by joeqzzuo (创建于: 2026-02-21 07:33 (UTC+8)) [💬3 | +180/-5, 3 files | commented:1] …egation
## Purpose
Add
embed_sparsepooling task forBgeM3EmbeddingModelto enable server-side sparse vector aggregation.Currently, BGE-M3 sparse retrieval requires a cumbersome 2-step client workflow:
- Call
/tokenizeto get token IDs …
- Call
-
#35000 Simplify output assignment condition — v1 — by zhuohan123 (创建于: 2026-02-21 07:28 (UTC+8)) [💬1 | +1/-3, 1 files | commented:1] ## Purpose Refactor conditional check for output assignment.
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ... -
#34990 [CI] Skip Responses API — ready,ci-failure — by robertgshaw2-redhat (创建于: 2026-02-21 05:56 (UTC+8)) [💬2 | +4/-0, 1 files | commented:1 approved:2]
## Purpose
- skip responses API tests, which are flakly.
- it needs further investigation but this will help to stabilize the CI
- #34998 [ROCm] Check that AITER MHA is not selected with sinks — rocm — by gshtras (创建于: 2026-02-21 07:05 (UTC+8)) [💬1 | +13/-2, 1 files | commented:1]
AITER MHA does not support sinks. Gating its selector
This should unbreak
VLLM_ROCM_USE_AITER=1 vllm serve openai/gpt-oss-120b -
#34992 [RL] Validation for pause_mode=’keep’ — documentation,ready — by hao-aaron (创建于: 2026-02-21 06:02 (UTC+8)) [💬2 | +179/-105, 1 files | commented:6 approved:2] ## Purpose
Changed the example rlhf_async_new_apis.py to perform validation on post weight updated tokens, by comparing them to tokens generated with a fresh vllm engine with new weights.
relevant RFC: https://github.com/vllm-project/vllm/issues/32103
Essential Elements of an Effective PR Description Checklist
... -
#34991 [Bugfix] ep_scatter kernel store-load race condition — bug — by ivanium (创建于: 2026-02-21 05:58 (UTC+8)) [💬4 | +7/-3, 1 files | commented:1] ## Purpose
Fix an intra-CTA data race in
_fwd_kernel_ep_scatter_1that causes sporadic “Warp Illegal Address” crashes during DeepGEMM MoE permutation (observed on DeepSeek V3 PD decode, but possibly affects any EP path using this kernel).### The bug
The kernel launches 256 blocks × 256 threads (8 warps). Each block computes an exclusive prefix sum and writes all 256 entries of
expert_start_locto global memory via a vectortl.store, then loads back one entry via `tl.load(expert_sta… -
#34984 [Misc][LoRA] Add –lora-target-modules to restrict LoRA to specific modules — documentation — by bhoomit (创建于: 2026-02-21 03:47 (UTC+8)) [💬7 | +183/-2, 6 files | commented:1] ## Purpose
Add deployment-time control over which model modules have LoRA applied via a new
--lora-target-modulesCLI parameter andLoRAConfig.target_modulesfield.This accepts module suffixes (e.g.,
o_proj,qkv_proj) and restricts LoRA application to only those modules, useful for performance tuning. When not specified, all supported LoRA modules are used (existing behavior).## Usage
```bash vllm serve model –enable-lora –lora-target-modules o_proj qkv_proj …
-
#34974 [Kernel] Optimize sample_recovered_tokens_kernel — performance,ready,v1 — by xyang16 (创建于: 2026-02-21 01:51 (UTC+8)) [💬1 | +178/-33, 2 files | commented:3 approved:1]
## Purpose
This PR optimize the
sample_recovered_tokens_kernelin rejection sampler.- Use tiled reduction over vocab. Instead of creating one massive
tl.arange(0, PADDED_VOCAB_SIZE)vector and loading the whole vocab in one program, load it in chunks of BLOCK_SIZE. This will have lower register pressure and better occupancy. - Replace
score = prob / qdivision withscore = prob * inv_q, because division is slower than multiply. - Kernel level more than 5…
- Use tiled reduction over vocab. Instead of creating one massive
- #34979 [CI] Revert PRs 34818 and 33600 — speculative-decoding,ready,v1,multi-modality,nvidia — by LucasWilkinson (创建于: 2026-02-21 02:34 (UTC+8)) [💬1 | +242/-294, 16 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33600 broke the basic models test, https://github.com/vllm-project/vllm/pull/34818 tried to fix it but caused accuracy issues for ``` FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400 assert np.float64(0.011372251705837756) >= (0.72 - 0.08) FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4…
-
#34980 [not for land] Change VLLM_USE_MEGA_AOT_ARTIFACT default to ‘1’ — ready — by zhxchen17 (创建于: 2026-02-21 03:05 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34986 Trtllm gen full attn test coverage — v1,nvidia — by ojhaanshika (创建于: 2026-02-21 04:27 (UTC+8)) [💬2 | +510/-0, 2 files | commented:1 | 📝草稿] ## Purpose
Add test coverage for the TRTLLM gen-full attention pipeline in vLLM. The existing kernel-level tests (
test_flashinfer_trtllm_attention.py) call the FlashInfer C++ kernels directly, bypassing the vLLM integration layer entirely. This leaves the decision logic, metadata construction, and forward dispatch at 0% coverage.This PR adds two new test files:
- Unit tests for attention decision functions (
tests/kernels/attention/test_use_trtllm_attention.py) — 22 tests covering al…
- Unit tests for attention decision functions (
-
#34955 [Bugfix] kimi k2.5 tool call tokens leaking into reasoning_content — bug — by felixmr1 (创建于: 2026-02-20 22:19 (UTC+8)) [💬1 | +476/-3, 3 files | commented:2] ## Purpose
When Kimi K2.5 skips
</think>and jumps directly to a tool call, the reasoning parser had no mechanism to stop routing subsequent tokens toreasoning_content. This caused raw tool call tokens to appear in the thinking block instead of being handled by the tool call parser.This PR introduces
KimiK25ReasoningParseras a thin subclass ofDeepSeekV3ReasoningWithThinkingParserthat:- Forces
thinking=True(Kimi K2.5 always reasons) - Tracks a
_tool_call_startedflag - Return…
- Forces
-
#34983 [MoE Refactor] Convert mxfp4 moe quant method into oracle — ci/build,gpt-oss,nvidia — by zyongye (创建于: 2026-02-21 03:44 (UTC+8)) [💬1 | +1495/-930, 25 files | commented:1 | 📝草稿]
## Purpose Ongoing MXFP4 MoE refactor
This PR can be greatly improved once #32564 is merged.
## Test Plan gpt-oss-20b testing in various of kernels. [WIP]
…
- #34985 [CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor — bug,rocm,nvidia — by rasmith (创建于: 2026-02-21 04:17 (UTC+8)) [+2/-0, 1 files | commented:1]
## Purpose
The tests in
test_punica_opscan fail if a previous test ran and changed the device that kernels execute in, in particular Triton kernels, by usingtorch.cuda.set_device, e.g.lora/test_layers.py. This can cause subsequent tests to also fail, since the system seems to go into a bad state once this happens. ## Test Plan ``` pytest -v -s lora
–ignore=lora/test_chatglm3_tp.py
–ignore=lora/test_llama_tp.py
–ignore=lora/test_llm_with_mult… -
#34981 Add retry logic to LoRA adapter config loading — meta-exported,fb-exported — by mgrange1998 (创建于: 2026-02-21 03:11 (UTC+8)) [💬4 | +241/-2, 3 files | commented:1] Summary: In
PEFTHelper.from_local_dir(), the non-tensorizer path does a bareopen() + json.load()to readadapter_config.json. If the file hasn’t finished being written or downloaded (race condition with Manifold, HF Hub, or NFS), this raisesFileNotFoundErrorand crashes the vLLM worker.This diff adds a
_load_lora_config()static method with exponential backoff retry (50ms → 1s cap, 15s default timeout). It catchesFileNotFoundError(file not yet created) and `json.JSONDecodeError… -
#34982 Add GlmOcrConfig for GLM-OCR model type recognition — meta-exported,fb-exported — by hujia177 (创建于: 2026-02-21 03:12 (UTC+8)) [💬2 | +94/-0, 3 files | commented:1] Summary: The GLM-OCR model (
zai-org/GLM-OCR) usesmodel_type: "glm_ocr"in its config.json, which is not recognized by the bundled HuggingFace Transformers library (requires transformers >= 5.1.0). This causes aValidationErrorduringModelConfigcreation when vLLM tries to load the model viaAutoConfig.from_pretrained().Add a custom
GlmOcrConfig(extendingPretrainedConfig) andGlmOcrVisionConfigto bridge this gap, following the same pattern asDotsOCRConfig. Register the c… -
#34959 [Misc] Fix mypy errors in vllm/profiler and remove from exclude list — ready — by taneem-ibrahim (创建于: 2026-02-20 23:39 (UTC+8)) [+37/-30, 2 files | commented:1 approved:1]
## Purpose
- This PR removes vllm/profiler from the mypy EXCLUDE list and fixes all resulting type errors in vllm/profiler/layerwise_profile.py.
- Part of the ongoing effort in #26533 to enable mypy checking across the codebase.
Test Plan
- Part of the ongoing effort in #26533 to enable mypy checking across the codebase.
- Run the pre-commit hook:
pre-commit run --hook-stage manual mypy-3.10 -a## Test Result - mypy checks pass cleanly except for the pre-existing nixl_connector check er…
- This PR removes vllm/profiler from the mypy EXCLUDE list and fixes all resulting type errors in vllm/profiler/layerwise_profile.py.
-
#34965 [Test] Improve flashinfer cutlass moe test coverage — ci/build,nvidia — by jasonlizhengjian (创建于: 2026-02-21 00:20 (UTC+8)) [+453/-2, 4 files | commented:4]
## Purpose
Improve flashinfer cutlass moe test coverage. In particular, Hopper coverage was lacking. Also
fp8_block_scalewas not tested at all previously.## Test Plan
``` …
-
#34972 Fix different embeddings produced by LLM and AsyncLLM — v1 — by guodongxiaren (创建于: 2026-02-21 01:27 (UTC+8)) [💬2 | +4/-0, 1 files | commented:1]
## Purpose fix issue: #33361
## Test Plan this code from issue: #33361 ```python import uuid
…
-
#34976 Allow _dummy_run to use _model_forward for hardware backends with compiled execution — v1,meta-exported,fb-exported — by jumpinghawk (创建于: 2026-02-21 02:17 (UTC+8)) [💬1 | +4/-1, 1 files | commented:1] Summary:
_dummy_run currently calls self.model(...) directly, bypassing the _model_forward() override hook. This means hardware backends that override _model_forward for compiled/graph-mode execution cannot control the execution path during dummy batch runs. This matters for DP with expert parallelism: execute_dummy_batch uses _dummy_run to keep ... -
#34975 dcp prefill -> non-dcp decode prototype — kv-connector — by aws-bowencc (创建于: 2026-02-21 02:00 (UTC+8)) [💬2 | +192/-95, 2 files | commented:1] # DCP Prefill → Non-DCP Decode: KV Cache Transfer Design
## Background
With Disaggregated Context Parallelism (DCP), the prefill node splits its TP workers into DCP groups that each process a portion of the input sequence. When
interleave_size == block_size, consecutive blocks are assigned round-robin across DCP ranks within a TP group. The prefill scheduler inflates its logical block size bydcp_size(e.g.block_size=16, dcp=2→ scheduler usesblock_size=32), so a single scheduler blo… -
#34971 [Bugfix][Model] Fix Ultravox silent weight overwrite for diff checkpoints — bug — by AndriiPasternak31 (创建于: 2026-02-21 01:12 (UTC+8)) [💬1 | +43/-1, 1 files | commented:1] ## Summary
When serving a fine-tuned Ultravox diff checkpoint (one that contains only
audio_towerandmulti_modal_projectorweights, not the full model), the fine-tuned weights are silently overwritten by base Whisper weights during model loading. This causes the model to produce degraded outputs with no error or warning.### Root cause
DefaultModelLoader.get_all_weights()yields weights in order:- Primary checkpoint weights (fine-tuned
audio_tower.*+ `multi_modal_proje…
- Primary checkpoint weights (fine-tuned
-
#34973 Enforce that
modelis the first positional arg when--served-model-nameis used — 无标签 — by hmellor (创建于: 2026-02-21 01:37 (UTC+8)) [+14/-4, 1 files | commented:1] Closes https://github.com/vllm-project/vllm/issues/34326Prevents users from passing
--served-model-namewithout actually specifying the model. -
#34970 [Bugfix] Fix block_size mismatch for MLA models after #34818 — bug,ready,v1 — by mgoin (创建于: 2026-02-21 01:11 (UTC+8)) [+21/-5, 1 files | commented:1] ## Purpose
#34818 deferred block_size selection to after model loading, but
InputBatchis created before model loading with a placeholderblock_size=16. Whenupdate_block_size_for_backendlater sets the real block_size (e.g. 32 forFLASHINFER_MLA),may_reinitialize_input_batchcompared the kv_cache sizes against the already-updatedcache_config.block_size(both 32) and skipped re-initialization. TheInputBatchkept usingblock_size=16for slot mappings while the KV cache us… -
#34952 [Build] Fix DSV3_FUSED_A_GEMM_ARCHS to include SM120 on CUDA 13.0 — ci/build,nvidia — by aabbccddwasd (创建于: 2026-02-20 20:21 (UTC+8)) [💬1 | +2/-2, 1 files | commented:1] ## Summary
Fix
dsv3_fused_a_gemmkernel linking failure on SM120 (Blackwell) GPUs with CUDA 13.0.## Problem
When building vLLM on systems with:
- CUDA compiler version >= 13.0
- SM120 (Blackwell) GPUs (e.g., RTX PRO 6000 Blackwell)
…
-
#34968 fix(tool_parsers): remove kimi k2 8k section char limit that truncates large tool call arguments — frontend — by timon0305 (创建于: 2026-02-21 00:38 (UTC+8)) [+49/-37, 4 files | commented:1] ## Purpose
Fix #34442
The Kimi K2 tool parser had a hard
max_section_chars = 8192limit that force-exited the tool section when total content exceeded 8 KiB. This silently truncated large tool-call arguments (e.g. when the model generates a source file via a tool call), making the parser unusable for coding use cases.The parser also had a
buffer_max_size = 1024limit for the marker-detection buffer that was too small and could flush content containing partial section markers during large… -
#34966 fix(cli): prevent –served-model-name from consuming positional model argument — frontend — by timon0305 (创建于: 2026-02-21 00:28 (UTC+8)) [💬2 | +202/-3, 5 files | commented:1] ## Purpose
Fix #34326
When running
vllm serve --served-model-name ASR Qwen/Qwen3-ASR-1.7B, the--served-model-nameflag (which usesnargs='+') greedily consumes bothASRandQwen/Qwen3-ASR-1.7Bas its values, leaving the positionalmodel_tagunset. This causesargs.modelto fall back to the defaultQwen/Qwen3-0.6B, resulting in:- The banner showing the wrong model name
- The
/v1/audio/transcriptionsendpoint disappearing (wrong model capabilities detected)
### Changes
…
-
#34967 Fix: Add shard validation for quantized models to prevent silent failures — 无标签 — by paipeline (创建于: 2026-02-21 00:34 (UTC+8)) [💬1 | +167/-0, 2 files | commented:1] Fixes: #34859
## Problem Loading quantized checkpoints with missing shards fails silently, which can lead to incorrect model initialization and inference results. This happens because the current validation logic only checks parameter completeness for non-quantized models.
## Solution
Added shard validation specifically for quantized models that:- ✅ Validates safetensors shards: Reads the index file and ensures all referenced shard files exist
- ✅ Validates .pt/.bin files: Ensu…
-
#34961 [CI] Bump mteb version to
mteb[bm25s]>=2, <3for pooling model unit tests — rocm,ready,ci/build — by yewentao256 (创建于: 2026-02-20 23:45 (UTC+8)) [💬1 | +3/-3, 3 files | commented:2 approved:1] ## PurposeThis PR should be merged after https://github.com/vllm-project/vllm/pull/34925, I mean to make it separate so that if there is any issue, we can easily revert this bump version PR
## Test
Covered by CI
CC: @DarkLight1337 @noooop
-
#34964 Fix LoRA adapter silently failing on Pixtral/Ministral-3 models — 无标签 — by timon0305 (创建于: 2026-02-21 00:10 (UTC+8)) [+39/-0, 3 files | commented:1] ## Purpose
Fix LoRA adapters silently having no effect when loaded on Pixtral-based models (e.g.
mistralai/Ministral-3-8B-Instruct-2512).The root cause is that
PixtralForConditionalGenerationis missing thehf_to_vllm_mapperattribute that other multimodal models (likeMistral3ForConditionalGeneration) define. Without this mapper, LoRA weight names from HuggingFace checkpoints (e.g.model.language_model.model.layers.…) are not translated to vLLM’s internal names (e.g. `language_mod… -
#34960 [Doc] Fix example of eagle3 — documentation — by petrpechman (创建于: 2026-02-20 23:44 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1] ## Purpose Update the example in the documentation to use the new speculative method
eagle3.## Test Plan No code changes were made.
## Test Result No code changes were made.
…
-
#34958 Ensure that MkDocs v2 does not get installed — ready,ci/build — by hmellor (创建于: 2026-02-20 23:12 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] We do this for compatibility reasons with the plugins that we use.
A more likely upgrade path is to Zensical, which is more actively maintained by the creators of MkDocs Material.
-
#34938 [Bug] Use FlashAttention for NemotronH — bug — by benchislett (创建于: 2026-02-20 12:32 (UTC+8)) [💬1 | +11/-1, 1 files | commented:1] ## Purpose
For some time, we have been observing accuracy issues with FlashInfer attention for NemotronH models (Super internal checkpoints), and use FlashAttention for all evals. While we investigate this issue, this temporary fix defaults an unspecified backend to Flash Attention for NemotronH on Blackwell just to be safe.
-
#34957 [Bugfix] Use getattr fallback for weight_loader in stacked-param branch — bug,speculative-decoding,llama,qwen,deepseek,gpt-oss — by wojciech-wais (创建于: 2026-02-20 23:05 (UTC+8)) [💬2 | +361/-177, 123 files | commented:1] Fixes #34201.
After sleep mode level 2 discards model weights and
reload_weights()is called, freshnn.Parameterobjects may no longer carry the customweight_loaderattribute that is set during normal model initialisation. Modelload_weightsimplementations accessedparam.weight_loaderwith a bare attribute lookup in the stacked-parameter loading branch, causing:AttributeError: 'Parameter' object has no attribute 'weight_loader'The
elsebranch of the same functions already … -
#34950 [CI] Bump mypyversion — v1,kv-connector,nvidia — by hmellor (创建于: 2026-02-20 19:47 (UTC+8)) [💬3+48/-25, 10 files commented:3 approved:1 📝草稿] - Bump
mypyfrom 1.11.1 (Jul 30, 2024) to 1.15.0 (Feb 5, 2025) to get better checking and performance - 1.13.0 added a
faster-cacheoptional extra to improve caching performance. We add this in case it helps - We don’t bump all the way because newer versions catch things that older versions missed. So we step a few versions at a time to keep these PRs smaller and more manageable
- Bump
-
#34944 [V0 Deprecation] Remove unused MM placeholders in request output — ready — by DarkLight1337 (创建于: 2026-02-20 13:56 (UTC+8)) [+1/-5, 1 files | commented:1 approved:1]
## Purpose
This field was only used in V0 (#10407). We can remove this now.
## Test Plan
## Test Result
…
-
#34953 Fix kv_cache_dtype auto resolution with checkpoint quantization — 无标签 — by paipeline (创建于: 2026-02-20 20:35 (UTC+8)) [💬2 | +334/-1, 3 files | commented:1] Fixes issue #34752 - improve kv-cache-dtype behavior when checkpoint specifies kv_cache_quant_algo.
Problem: When using –kv-cache-dtype auto with models that specify kv_cache_quant_algo in their checkpoint, the system was falling back to model_config.dtype instead of using the checkpoint specified FP8 quantization.
Solution: Modified kv_cache_dtype_str_to_dtype() to properly use resolve_kv_cache_dtype_string() when kv_cache_dtype is auto. This ensures checkpoint-specified quantization algorit…
-
#34951 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (创建于: 2026-02-20 20:19 (UTC+8)) [+12/-3, 1 files | commented:1]
## Purpose
Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate
<|start|>developerblocks. However, models that follow the OpenAIgpt-oss-20b/120bchat template (based on the chat_template.jinja) expect a single developer block containing both# Instructionsand# Toolssections.This PR modifies `_make_request…
-
#34949 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (创建于: 2026-02-20 19:39 (UTC+8)) [💬2 | +12/-5, 1 files | commented:1]
## Purpose
Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate
<|start|>developerblocks. However, models that follow the OpenAIgpt-oss-20b/120bchat template (based on the chat_template.jinja) expect a single developer block containing both# Instructionsand# Toolssections.This PR modifies `_make_request…
- #34940 [CI] Remove DBO xfail on Blackwell — ready,v1 — by LucasWilkinson (创建于: 2026-02-20 13:23 (UTC+8)) [+0/-19, 1 files | commented:1] Seems to be passing locally, try removing
[已合并 PR]
-
#34997 Revert “[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers “ — ready,ci-failure,llama — by LucasWilkinson (合并于: 2026-02-21 09:19 (UTC+8)) [💬1 | +68/-29, 1 files | commented:1 approved:1] Reverts vllm-project/vllm#34471 to fix CI failures: https://github.com/vllm-project/vllm/issues/34995
FIX: https://github.com/vllm-project/vllm/issues/34995
cc @eldarkurtic
-
#34899 Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion — ready,ci/build,deepseek,nvidia,ready-run-all-tests — by wzhao18 (合并于: 2026-02-21 05:37 (UTC+8)) [💬8 | +6/-29, 5 files | commented:1 approved:1]
## Purpose The patch for fixing the Deepseek V3 accuracy issue with AR+rms+fp4 fusion (#34395) is included in flashinfer 0.6.4. This PR bumps flashinfer version and re-enables the fusion pass by default.
## Test Plan
serve nvidia/DeepSeek-V3.1-NVFP4 -tp=4 -cc.pass_config.fuse_allreduce_rms=True…
-
#34735 [AMD][CI] Fix test_custom_allreduce for A100 testgroup — rocm,ready — by rjrock (合并于: 2026-02-21 05:33 (UTC+8)) [💬2 | +2/-0, 1 files | commented:1 approved:1] ## Purpose To fix the
test_custom_allreducetest for the AMD A100 testgroup.For ROCm, ray modifies
HIP_VISIBLE_DEVICESwhen allocating a GPU (@ray.remote(num_gpus=1)), so we need to removeHIP_VISIBLE_DEVICESinstead ofCUDA_VISIBLE_DEVICES.## Test Plan
pytest -v -s distributed/test_custom_all_reduce.py## Test Result Before …
- #34979 [CI] Revert PRs 34818 and 33600 — speculative-decoding,ready,v1,multi-modality,nvidia — by LucasWilkinson (合并于: 2026-02-21 05:25 (UTC+8)) [💬1 | +242/-294, 16 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33600 broke the basic models test, https://github.com/vllm-project/vllm/pull/34818 tried to fix it but caused accuracy issues for ``` FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400 assert np.float64(0.011372251705837756) >= (0.72 - 0.08) FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[Qwen3-Next-80B-A3B-NVFP4…
-
#34473 [Test] Add FP8 KV Cache Testing for MLA Backends — documentation,performance,new-model,rocm,structured-output,frontend,ready,ci/build,v1,multi-modality — by wzhao18 (合并于: 2026-02-21 02:51 (UTC+8)) [💬5 | +68/-27, 1 files | commented:2 approved:6]
## Purpose This PR improves the MLA backend test coverage to include fp8 kv cache testing.
## Test Plan Tested
tests/v1/attention/## Test Result
…
-
#34843 [CI] Remove failing prime-rl integration test — ready,ci/build — by mgoin (合并于: 2026-02-21 02:17 (UTC+8)) [+0/-106, 3 files | commented:1]
## Purpose
This test has been failing for several weeks, so let’s just remove it. We can add it back if someone want to revive it.
## Test Plan
## Test Result
…
-
#34912 [compile] Fix torch.compile time discrepancy in logging. — ready — by zhxchen17 (合并于: 2026-02-21 00:47 (UTC+8)) [+17/-8, 3 files | commented:1 approved:1] Summary: Fixing the issue reported in https://github.com/vllm-project/vllm/issues/27815 .
The discrepancy occurs because we didn’t measure the end-to-end wall time passed between critical points of the program. Instead in the old behavior, we simply summed the time we spent in dynamo and inductor, leaving everything else out of the metrics.
This causes confusion because users saw inconsistency between the real time passed between log lines and the time reported by vllm: …
-
#34831 [compile] Move torch_aot_compile directory under torch_compile_cache — ready — by zhxchen17 (合并于: 2026-02-21 00:46 (UTC+8)) [+5/-4, 1 files | commented:1 approved:2] Summary:
Per discussion with @zou3519, people sometimes delete torch_compile_cache/ only but assume it will invalidate all the cache. Turns out confusing to have 2 directories managing orthogonal cache files on disk. This PR consolidates them into one.
Test Plan: CI
Reviewers: …
-
#34185 [Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching — ready — by gmagogsfm (合并于: 2026-02-21 00:36 (UTC+8)) [💬2 | +55199/-200, 3 files | commented:4 approved:2] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: [Helion] Add num_tokens dimension to silu_mul_fp8 autotuning
Previous version of silu_mul kernel was only tuned for a constant
num_tokens=256, which underperforms for smallernum_tokensper @xiaohongchen1991 ‘s study. This PR adds another dimension ofnum_tokensto config key of silu_mul so that its autotuning and dispatching are now optimized for different values … -
#34958 Ensure that MkDocs v2 does not get installed — ready,ci/build — by hmellor (合并于: 2026-02-20 23:38 (UTC+8)) [+1/-1, 1 files | approved:1 commented:1] We do this for compatibility reasons with the plugins that we use.
A more likely upgrade path is to Zensical, which is more actively maintained by the creators of MkDocs Material.
-
#34870 [perf] Avoid dtype promotion sync in mamba_get_block_table_tensor — ready,v1 — by hl475 (合并于: 2026-02-20 22:21 (UTC+8)) [💬2 | +6/-2, 1 files | commented:1 approved:1] ## Purpose
From the perf profiling on a model using linear attention, one thing draws the attention is
aten::tofrom somewhere in vllm gpu_model_runner preprocessFor this
aten::to,We can tell it is doing to with dtype=3 (int64)…
-
#34909 [Refactor] Extract Harmony streaming SSE event builders into streaming_events.py — frontend,ready,gpt-oss — by sfeng33 (合并于: 2026-02-20 22:20 (UTC+8)) [💬2 | +907/-874, 2 files | commented:4 approved:2]
## Purpose
serving.pyis ~2800 lines. The_emit_*methods are pure functions (state + data → events) that don’t use instance state, yet they live on the class. Extracting them:- Reduces
serving.pyby ~800 lines - Makes event builders independently testable
- Prepares for
StreamingParsableContextsupport — the upcoming streaming parsable implementation will reuse these same event builders instead of duplicating them (as `_process_simple_streaming_eve…
- Reduces
-
#34944 [V0 Deprecation] Remove unused MM placeholders in request output — ready — by DarkLight1337 (合并于: 2026-02-20 22:19 (UTC+8)) [+1/-5, 1 files | commented:1 approved:1]
## Purpose
This field was only used in V0 (#10407). We can remove this now.
## Test Plan
## Test Result
…
-
#34866 [BUGFIX] Fix
_dummy_runmissingprepare_inputs_eventsynchronization — bug,ready,v1 — by vadiklyutiy (合并于: 2026-02-20 21:54 (UTC+8)) [💬10 | +31/-26, 1 files | commented:6 approved:1] Fixes #34619### Root cause
With async scheduling,
execute_modelprotects shared pinned CPU buffers (seq_lens,query_start_loc, etc.) from reuse races viaprepare_inputs_event: it synchronizes before overwriting them and records after enqueuingnon_blockingH2D DMAs._dummy_run— used by idle DP workers for EP coordination — writes to the same pinned buffers and enqueues the same H2D DMAs but never participates in this event protocol. Back-to-back dummy steps (or a dummy s… -
#34206 [Kernel] Optimize grouped topk kernel — performance,ready,deepseek,model-bash — by xyang16 (合并于: 2026-02-20 17:34 (UTC+8)) [💬2 | +643/-100, 3 files | commented:7 approved:1] ## Purpose
This PR add grouped topk kernel optimized for small expert count (num_experts<=512 for single group, or num_experts<=256 for multi group), which covers DeepSeek (num_experts=256, n_group=8), Kimi-K2.5 (num_experts=384, n_group=1), Nemotron (num_experts=512, n_group=1) etc.
For large expert count, fall back to existing
grouped_topk_fused_kernel.## Test Plan
``` pytest -s -v tests/kernels/moe/test_grouped_topk.py …
- #32877 [Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support — bug,rocm,ready,v1 — by c0de128 (合并于: 2026-02-20 14:25 (UTC+8)) [💬3 | +35/-2, 1 files | commented:1 approved:1]
## Summary
- Fix ROCM_AITER_FA attention backend to support speculative decoding (multi-token decode)
- The decode path was hardcoding
max_seqlen_q=1, causing incorrect results with speculative decoding
## Changes
- Extract actual
max_query_lenfrom decode metadata - Route multi-token decode queries to
unified_attentioninstead ofpaged_attention_v1 - Use actual
max_seqlen_qinstead of hardcoded1 - Update assertion message to reflect both sliding window and speculative decoding con…
-
#34916 [Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend — ready — by frankwang28 (合并于: 2026-02-20 14:07 (UTC+8)) [+3/-0, 1 files | commented:2 approved:1]
## Purpose When setting
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1, it was noticed that there was no logging that MXFP4 MXFP8 TRTLLM was the selected backend. Other backends such asSM100_FI_MXFP4_BF16andSM100_FI_MXFP4_MXFP8_CUTLASSdo perform that logging: ``` VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 vllm serve openai/gpt-oss-20b … (EngineCore_DP0 pid=1494838) INFO 02-19 20:06:21 [gpu_model_runner.py:4128] Starting to load model openai/gpt-oss-20b… (En… -
#34921 [Models] LFM2: Support LoRA — ready — by tianshu-Michael-yu (合并于: 2026-02-20 14:07 (UTC+8)) [+36/-16, 2 files | commented:3 approved:1] ## Summary This PR enables LoRA for LFM2 and LFM2-MoE models.
## Changes
- Rename fused MLP projection module from
w1tow13invllm/model_executor/models/lfm2.pyandvllm/model_executor/models/lfm2_moe.py. - Update
packed_modules_mappingfromw1 -> [w1, w3]tow13 -> [w1, w3]. - Add
hf_to_vllm_mapper = WeightsMapper({'.conv.': '.short_conv.'})so HF LoRA target names map correctly to vLLM modules. - Fix stacked weight-name matching to segment-boundary matching (`weight_name + ‘…
- Rename fused MLP projection module from
-
#34922 [ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout — rocm,ready — by micah-wil (合并于: 2026-02-20 13:37 (UTC+8)) [+2/-2, 2 files | commented:1 approved:1] We’ve noticed some flakiness in amd-ci related to tests timing out in Entrypoints Integration Test (API Server 1). Here is one example from a recent nightly build: https://buildkite.com/vllm/amd-ci/builds/4972/summary?jid=019c6f8c-8d3b-47b7-977b-ab61ac8730ce&tab=output#019c6f8c-8d3b-47b7-977b-ab61ac8730ce/L6775
In this example, the following test times out after the 240 second server init timeout defined in RemoteOpenAIServer:
``` pytest -v -s entrypoints/openai/…
-
#33677 [Misc] Add deprecated environment variable utilities — ready — by carlory (合并于: 2026-02-20 13:33 (UTC+8)) [💬3 | +65/-0, 1 files | commented:2 approved:1] ## Purpose
Add general-purpose utilities for handling deprecated environment variables with deprecation warnings. These functions can be reused across the codebase when deprecating environment variables in favor of CLI arguments or config options.
This addresses the suggestion from @hmellor in PR #33536 to add general versions of the removed
_get_from_env_if_setand_set_from_env_if_setmethods toutils.pyfor reuse in future deprecations.**New functions added to `vllm/utils/system_uti…
-
#33739 [CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass — bug,rocm,ready,v1,kv-connector — by rasmith (合并于: 2026-02-20 13:32 (UTC+8)) [💬8 | +8/-5, 1 files | commented:3 approved:2] ## Purpose
test_moriio_handshake_returns_metadataandtest_register_kv_cachesdo not make use of theset_vllm_configfixture and both result in this error when being ran:``` def get_current_vllm_config() -> VllmConfig: if _current_vllm_config is None:
raise AssertionError( "Current vLLM config is not set. This typically means " "get_current_vllm_config() was called outside of a " ... -
#34072 Add validation to reject non-text content in system messages — frontend,ready — by veeceey (合并于: 2026-02-20 13:30 (UTC+8)) [💬12 | +187/-1, 2 files | commented:7 approved:3] ## Summary Fixes #33925
According to OpenAI API specification, system messages can only contain text content. This PR adds validation to warn on non-text content types (images, audio, video) in system messages.
## Problem vLLM was accepting images and other multimodal content in system messages without any indication, which deviates from the OpenAI API specification. System messages should only su…
-
#34260 [Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection — ready — by elizabetht (合并于: 2026-02-20 13:29 (UTC+8)) [💬5 | +46/-13, 1 files | commented:3 approved:1] On Hopper (SM90), FLASHINFER_CUTLASS was auto-selected as the FP8 MoE backend over DEEPGEMM due to having higher priority in the selection order. FlashInfer kernels are primarily optimized for Blackwell and do not work well with features like chunked prefill on Hopper.
This change makes the backend priority order architecture-aware:
- Blackwell (SM100+): FlashInfer backends preferred over DeepGEMM
- Hopper and older: DeepGEMM preferred over FlashInfer backends
## Purpose Fix [34249](https://g…
-
#34335 [Bugfix] Add regression test for MoE quant_config under torch.compile — bug,ready — by mgehre-amd (合并于: 2026-02-20 13:27 (UTC+8)) [💬4 | +23/-0, 1 files | commented:2 approved:2] Update: The fix landed separately via #34371 and this PR only adds the regression test.
## Summary
After the MoE Refactor (#32344), w4a16 models fail with
AssertionError: Hidden size mismatch 2048 != 1024under torch.compile. This is becauseensure_moe_quant_config_init()is called inFusedMoE.forward_native(). When torch.compile is active,forward_nativeis traced by Dynamo, but the side effect of settingself.quant_method.moe_quant_config(an attribute mutation) is not replayed at… - #34386 [Quark] Fix MoE fp8 activation scale handling on mi300 — ready — by BowenBao (合并于: 2026-02-20 13:27 (UTC+8)) [💬1 | +3/-3, 1 files | commented:2 approved:1] Small fix follow-up after #29008 for running gpt-oss on mi300. Also ensure the fp8 scale conversion only runs when activation is fp8 quantized.
- #34915 [ci] Use the right tag for CPU arm64 image — ci/build — by khluu (合并于: 2026-02-20 11:59 (UTC+8)) [+3/-3, 1 files | commented:1 approved:1] During the migration, the tag got changed by mistake. This is to fix it back to the right tag
-
#34794 [Refactor] Implement output type check in LLM — frontend,ready,v1 — by DarkLight1337 (合并于: 2026-02-20 11:57 (UTC+8)) [💬4 | +58/-38, 2 files | commented:2 approved:1]
## Purpose
LLMEngine.validate_outputsactually does nothing. Apply the output type check insideLLMinstead.## Test Plan
## Test Result
…
[关闭未合并 PR]
-
#34962 [Build] Bump FlashInfer version from v0.6.3 to v0.6.4 — ci/build,nvidia — by vadiklyutiy (关闭于: 2026-02-21 10:18 (UTC+8)) [💬1 | +5/-5, 4 files | commented:1 | 📝草稿] ## Summary
Promote FlashInfer dependency from v0.6.3 to v0.6.4 across all build artifacts.
The main goal is use GDN decode kernel for Qwen3.5
-
#24388 [gpt-oss]Support lazy init mcp session — frontend,needs-rebase,stale,gpt-oss — by wuhang2014 (关闭于: 2026-02-21 10:16 (UTC+8)) [💬14 | +99/-62, 3 files | commented:2 changes:1] ## Purpose
Currently, we initialize all MCP sessions that specified by user request before generation process no matter whether tools are actually being used. This PR suggests implementing lazy initialization for MCP sessions. The major change is to initialize sessions with tool server only before the tool is called.
## Test Plan With a python MCP tool server, test 2 requests with tool of
code_interpretertype with background mode using responses API. One request do not use tool while anothe… -
#24796 [Core] Set process titles early in init for better debugging — tpu,needs-rebase,stale,v1 — by Mohit-Gaur (关闭于: 2026-02-21 10:16 (UTC+8)) [💬5 | +942/-71, 6 files | commented:1 changes:1]
## Purpose Fixes #24476 - Assign worker process titles and logging prefix earlier
-
#35003 Fix: [Bug]: FlashInfer attn-fp4 fused kernel performs w — bug — by darshjme-codes (关闭于: 2026-02-21 09:38 (UTC+8)) [💬4 | +50/-0, 1 files | commented:1] The fused attention quant kernel for fp4 was observed to be slower than the unfused version. The fix disables the fusion pass when fp4 quantization is active, ensuring the higher‑performance unfused kernel is used instead. This resolves the performance regression without altering the underlying kernel implementation.
Fixes #34988
Changes:
vllm/flash_attn/flash_attn_interface.py: Disabled the fused attention quantization pass when fp4 is enabled, preventing the slower fused kernel…
-
#35000 Simplify output assignment condition — v1 — by zhuohan123 (关闭于: 2026-02-21 07:29 (UTC+8)) [💬1 | +1/-3, 1 files | commented:1] ## Purpose Refactor conditional check for output assignment.
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results ... -
#26278 [1/N] Elastic EP Milestone 2 — rocm,ready,ci/build,v1,cpu,nvidia — by libertyeagle (关闭于: 2026-02-21 07:10 (UTC+8)) [💬32 | +3448/-964, 42 files | commented:10] ## Purpose
PR #20775 introduces initial support of elastic expert parallelism. This PR adds further optimizations towards Milestone 2 in #20323. Key features include:
- Breakdown the scale up/down logic into a state machine of multiple stages, with their execution controlled in
vllm/distributed/elastic_ep/elastic_state.pyandvllm/distributed/elastic_ep/elastic_execute.py. - Newly started workers receive all weights (non-MoE modules and expert weights) from peer GPUs.
- We no longer need t…
- Breakdown the scale up/down logic into a state machine of multiple stages, with their execution controlled in
-
#34980 [not for land] Change VLLM_USE_MEGA_AOT_ARTIFACT default to ‘1’ — ready — by zhxchen17 (关闭于: 2026-02-21 05:16 (UTC+8)) [+1/-1, 1 files | commented:1] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#29214 [LMCache KVConnector] Integrate LMCache observability to vLLM’s KV connector metrics — kv-connector — by aeon-x (关闭于: 2026-02-21 05:08 (UTC+8)) [💬6 | +272/-0, 1 files | commented:5 approved:1] ## Purpose Fix issue https://github.com/LMCache/LMCache/issues/1914
### Integration test by Grafana
### Implementation
…
-
#34821 [Bugfix] Make CUDA compat library loading opt-in to fix consumer GPUs — bug,ci/build,nvidia — by 88plug (关闭于: 2026-02-21 04:33 (UTC+8)) [💬7 | +274/-4, 3 files | commented:1] ## Purpose
Fix CUDA forward compatibility library loading that causes Error 803 (
CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) in Docker containers. The persistentcuda-compat.confin/etc/ld.so.conf.d/unconditionally loads the container’s CUDA compat libs, which shadow the host-mounted driver when the host has a newer CUDA version than the container’s toolkit (e.g., host driver 580.x with CUDA 13.0 support, container built with CUDA 12.x).This was originally reported on an NVIDIA B200 (datacente…
-
#34168 [DRAFT][Feature] implement online data capture/generation for eagle3 — speculative-decoding,v1 — by harryzorus (关闭于: 2026-02-21 02:06 (UTC+8)) [💬3 | +3393/-1, 17 files] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
## Purpose
vLLM today has the excellent speculators library that implements an offline generation of logits and hidden states for training eagles heads from scratch using a custom offline worker using the…
-
#32628 feat: add parallel token drafting for EAGLE — documentation,speculative-decoding,needs-rebase,v1,llama — by hai-meh-cs (关闭于: 2026-02-21 01:46 (UTC+8)) [💬11 | +1335/-16, 9 files | commented:9 changes:1] ## Purpose
PTD (Parallel Token Decoding) generates K draft tokens in a single forward pass instead of K sequential passes, reducing draft overhead for EAGLE speculative decoding.
```bash vllm serve openai/gpt-oss-120b
–speculative-config ‘{“model”: “", "method": "eagle3-ptd", "num_speculative_tokens": 4}' \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ ... -
#34671 Update max_num_tokens value when specdec is enabled — ready,v1 — by shaharmor98 (关闭于: 2026-02-21 01:15 (UTC+8)) [💬4 | +16/-6, 1 files | commented:2 approved:2] ## Purpose
Following the Unified Parallel Drafting PR, the
max_num_tokens field, ingpu_model_runner.pyhasn’t been adjusted.
When speculative decoding is enabled (draft model or Eagle), the scheduler increasesmax_num_batched_tokensto account for speculative tokens, butGPUModelRunner.max_num_tokenswas not reflecting this adjustment. This could cause mismatches between what the scheduler sends and what the model runner expects. … -
#34966 fix(cli): prevent –served-model-name from consuming positional model argument — frontend — by timon0305 (关闭于: 2026-02-21 00:39 (UTC+8)) [💬2 | +202/-3, 5 files | commented:1] ## Purpose
Fix #34326
When running
vllm serve --served-model-name ASR Qwen/Qwen3-ASR-1.7B, the--served-model-nameflag (which usesnargs='+') greedily consumes bothASRandQwen/Qwen3-ASR-1.7Bas its values, leaving the positionalmodel_tagunset. This causesargs.modelto fall back to the defaultQwen/Qwen3-0.6B, resulting in:- The banner showing the wrong model name
- The
/v1/audio/transcriptionsendpoint disappearing (wrong model capabilities detected)
### Changes
…
-
#28453 [Feature]: Implement naive prepare/finalize class to replace naive dispatching in fused_moe/layer.py — needs-rebase — by WorldExplored (关闭于: 2026-02-21 00:18 (UTC+8)) [💬12 | +301/-65, 9 files | commented:4 changes:1] Reference Issue: https://github.com/vllm-project/vllm/issues/28236
Add a FusedMoEPrepareAndFinalize subclass.
cc @bnellnm
-
#34938 [Bug] Use FlashAttention for NemotronH — bug — by benchislett (关闭于: 2026-02-20 23:32 (UTC+8)) [💬1 | +11/-1, 1 files | commented:1] ## Purpose
For some time, we have been observing accuracy issues with FlashInfer attention for NemotronH models (Super internal checkpoints), and use FlashAttention for all evals. While we investigate this issue, this temporary fix defaults an unspecified backend to Flash Attention for NemotronH on Blackwell just to be safe.
-
#34949 [Frontend] Merge developer instructions into tool block for Harmony prompt rendering — frontend,gpt-oss — by beom115 (关闭于: 2026-02-20 19:58 (UTC+8)) [💬2 | +12/-5, 1 files | commented:1]
## Purpose
Current Harmony chat implementation renders the system/developer instructions and tool definitions in two separate
<|start|>developerblocks. However, models that follow the OpenAIgpt-oss-20b/120bchat template (based on the chat_template.jinja) expect a single developer block containing both# Instructionsand# Toolssections.This PR modifies `_make_request…
-
#30669 [Doc] Add AI Badgr framework integration documentation — documentation — by cartersaundersx (关闭于: 2026-02-20 19:56 (UTC+8)) [💬4 | +83/-0, 1 files | commented:1] ## Purpose
Add documentation for AI Badgr, an OpenAI-compatible LLM provider, in the framework integrations section. AI Badgr uses tier-based model naming (basic/normal/premium) and can work with vLLM as a backend or be accessed as a hosted service.
## Test Plan
```bash # Verify markdown linting pre-commit run –hook-stage manual markdownlint –files docs/deployment/frameworks/aibadgr.md
…
-
#34623 [Bugfix] Fix
tests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generation— bug — by NickLucche (关闭于: 2026-02-20 19:05 (UTC+8)) [💬1 | +20/-12, 2 files | commented:1 | 📝草稿] Fixtests/compile/correctness_e2e/test_sequence_parallel.py::test_tp_sp_generationfollowing up on https://github.com/vllm-project/vllm/pull/34523 changes.Interpreting original PR’s intent, I am trying to address quirks of handling
--enforce-eager+compilation-configin the same start command.cc @ProExpertProg
-
#34937 [Bugfix]: Qwen3 reasoning parser now handles
in prompt prefix — bug,qwen — by babyplutokurt (关闭于: 2026-02-20 15:17 (UTC+8)) [💬4 | +131/-22, 2 files | commented:1] [Bugfix]: Qwen3 reasoning parser now handles in prompt prefix Fixes #21130
## Purpose
Newer Qwen3 thinking models (e.g.,
Qwen3-4B-Thinking-2507) support only thinking mode —enable_thinking=Trueis always in effect. Their default chat template automatically appends<think>to the assistant turn prefix inside the prompt. As a result, the model’s generated output starts directly with reasoning text and ends with</think>content— the opening<think>tag does not appear… -
#33981 fix: reject non-text content in system/developer messages — frontend — by veeceey (关闭于: 2026-02-20 14:34 (UTC+8)) [💬10 | +230/-0, 2 files | commented:6] ## Summary
Fixes #33925
Per the OpenAI API specification,
systemanddeveloperrole messages should only accepttextcontent type. Previously, vLLM allowed multimodal content (e.g.image_url,input_audio,video_url) in system messages without any validation, which diverges from the OpenAI API behavior.### Changes
vllm/entrypoints/chat_utils.py: Added a `_validate_text_only_c…