[vLLM GitHub 开发动态] 2026-02-11
[概览]
- 时间窗口: 2026-02-11 11:37 (UTC+8) ~ 2026-02-12 11:37 (UTC+8)
- 新 issue: 23 (label 分布: bug:13, ci-failure:3, feature request:3, torch.compile:2, rocm:2)
- 关闭 issue: 36
- 新 PR: 71 (label 分布: ready:26, bug:23, v1:16, ci/build:9, documentation:8)
- 合并 PR: 44
- 关闭未合并 PR: 10
[新 issue]
-
#34395 [Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4 — bug,torch.compile,nvidia — by ProExpertProg (创建于: 2026-02-12 10:39 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34365 [CI Failure]: mi325_1: Async Engine, Inputs, Utils, Worker, Config Test (CPU) — ci-failure — by AndreasKaratzas (创建于: 2026-02-12 02:16 (UTC+8)) [💬2] ### Name of failing test
pytest -s -v tests/config/test_config_generation.py::test_unrecognized_env### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34391 [Performance]: qknorm+rope fusion slower than unfused on H100 — help wanted,performance,torch.compile — by ProExpertProg (创建于: 2026-02-12 08:56 (UTC+8)) ### Proposal to improve performance
Running
vllm bench sweep servefor-cc.pass_config.enable_qknorm_rope_fusion in {True, False}gives the following results:### Report of performance regression
No response …
-
#34356 [Bug]: FP8 MoE Backend Regression: Nemotron-3 Models Fail in vLLM 0.15.0/0.15.1 — bug — by jordidelatorre (创建于: 2026-02-12 00:40 (UTC+8)) [💬1] ### Your current environment
## Description
Nemotron-3 models with FP8 quantization fail to load in vLLM 0.15.0 and 0.15.1, but work correctly in vLLM 0.14.1.
## Environment
- OS: Rocky Linux 8.10 (Green Obsidian)
- Kernel: 4.18.0-553.51.1.el8_10.x86_64 …
-
#34315 [CI Failure]: Flashinfer Kernel tests missing argument: ‘num_logical_experts’ — ci-failure — by wzhao18 (创建于: 2026-02-11 13:34 (UTC+8)) ### Name of failing test
tests/kernels/moe/test_flashinfer.py
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34380 [Bug]: GLM-5 MTP crashes with num_speculative_tokens > 1 — bug — by mgoin (创建于: 2026-02-12 05:55 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34357 [Bug]: MoE models with shared experts do not support DP+EP — bug — by jeejeelee (创建于: 2026-02-12 00:49 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34376 [Bug]: #33731 is causing Triton build fails with gptoss — bug,rocm,gpt-oss — by amd-hhashemi (创建于: 2026-02-12 04:46 (UTC+8)) [💬1] ### Your current environment
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Clang version : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e) ...
-
#34377 [Bug]: FP8 speed regression in version 0.16.0rc2.dev87+g0b20469c6 (latest nightly) — bug — by laterbreh (创建于: 2026-02-12 04:50 (UTC+8)) ### Your current environment
vllm/vllm-openai:nightly
Commit-tagged alias: nightly-0b20469c627e94060d1015170b186d19de1db583
### 🐛 Describe the bug
Substantial drop in speed on FP8 images on: …
-
#34370 [Feature]: Make Prometheus histogram buckets configurable — feature request — by npache (创建于: 2026-02-12 03:18 (UTC+8)) ### Feature Request
Allow configuring Prometheus histogram buckets for vLLM metrics (e.g.,
vllm:e2e_request_latency_seconds) via CLI flag or environment variable.
### 🎯 Motivation
Currently, histogram buckets for metrics are statically defined in the source code.
…
-
#34355 [Feature]: Detect langue for Whisper — feature request — by warichet (创建于: 2026-02-12 00:40 (UTC+8)) ### 🚀 The feature, motivation and pitch
Motivation The addition of a language detection feature to the Whisper model will enhance its usability by automatically identifying the language of the audio input before transcription. This is especially useful for multilingual environments where the language of the input audio might not always be known in advance. By integrating language detection directly into the transcription pipeline, we can streamline the process and make it more efficient for…
-
#34361 [Bug]: Qwen Coder Next prefix caching — bug — by Nepherpitou (创建于: 2026-02-12 01:42 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#34351 [Installation]: MAC M1 installation fails because of bits-and-bytes — installation — by jonoillar (创建于: 2026-02-11 23:49 (UTC+8)) ### Your current environment
```text Collecting environment information… ============================== System Info ============================== OS : macOS 15.6.1 (arm64) GCC version : Could not collect Clang version : 17.0.0 (clang-1700.0.13.5) …
-
#34345 [Bug]: Since version >=0.14.0, the key-value cache of heterogeneous graphics cards cannot be allocated correctly. — bug — by gengchaogit (创建于: 2026-02-11 22:59 (UTC+8)) [💬5] ### Your current environment
1
### 🐛 Describe the bug
1
### Before submitting a new issue…
…
-
#34340 [Bug]: request larger than max_tokens should return http code 413 instead of 400 — bug — by qin-nz (创建于: 2026-02-11 22:07 (UTC+8)) [💬3] ### Your current environment
N/A
### 🐛 Describe the bug
Now, when user request with a large content, vllm return http code 400 as follow.
{"detail":"Upstream[decode] error, b'{\"error\":{\"message\":\"\\'max_tokens\\' or \\'max_completion_tokens\\' is too large: 65536. This model\\'s maximum context length is 262144 tokens and your request has 203462 input tokens (65536 > 262144 - 203462). None\",\"type\":\"BadRequestError\",\"param\":null,\"code\":400}}'"},…
-
#34323 [CI Failure]: Spawned tests can fail silently — ci-failure — by almayne (创建于: 2026-02-11 16:50 (UTC+8)) [💬4] ### Name of failing test
tests/models/multimodal/generation/test_whisper.py::test_models
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34331 [RFC]: Ahead of time dequantization of weights for quantization emulation (OCP MX, NVFP4) on unsupported devices — rocm,RFC — by fxmarty-amd (创建于: 2026-02-11 18:26 (UTC+8)) [💬1] ### Motivation.
Hi, opening this for comment.
At the moment, vLLM has some support for quantization simulation of MXFP4/MXFP6/NVFP4 models for execution on devices that do not support these dtypes:
- Quark OCP MX (dense & MOE) at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/quark/schemes/quark_ocp_mx.py
- compressed-tensors NVFP4 (e.g. https://huggingface.co/RedHatAI/Qwen3-8B-NVFP4, no MOE emulation support atm)
This is for example useful for resear…
-
#34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (创建于: 2026-02-11 13:02 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see previous
### Alternatives
No response
### Additional context
…
-
#34333 [RFC]: Automated baselining and degradation detection — RFC — by jmkuebler (创建于: 2026-02-11 19:22 (UTC+8)) ### Motivation.
TL;DR We published a method to detect accuracy degradations with statistical guarantees. This RFC serves as a platform to discuss whether and how this method could help in vLLMs model onboarding and CI processes
## Context When onboarding a new model to vLLM or optimizing the performance, an end-to-end accuracy check is very useful to catch any potential performance degradations against a baseline. Such a baseline could be either the implementation in
transformers(in case of… -
#34305 [Bug]: gpt-oss-120b Multiple generations not working — bug — by dhineshkumar-r (创建于: 2026-02-11 11:42 (UTC+8)) [💬1] ### 🐛 Describe the bug
I’m trying to generate multiple assistant completions with tool calls for the same user message and system prompt. I tried setting
n= 2 and 3, but I only see completion. The model I’m interested in isgpt-oss-120b.The model is deployed in NVIDIA Dynamo with vLLM backend and the exact runtime image is https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=0.7.0.
``` curl -X POST http://localhost:${SVC_PORT}/v1/chat/completions ...
-
#34326 [Bug]: –served-model-name causes model detection issues — bug — by ssendev (创建于: 2026-02-11 17:21 (UTC+8)) ### Your current environment
FROM vllm/vllm-openai:nightly AS audio RUN uv pip install --system --no-cache-dir mistral-common[soundfile] RUN uv pip install --system 'vllm[audio]'The output of
...python collect_env.py -
#34322 [Bug]: Qwen3-Coder-Next模型结合qwen3_coder这个tool parser时,报错IndexError: list index out of range — bug — by MaoJianwei (创建于: 2026-02-11 16:42 (UTC+8)) [💬1] ### Your current environment
v0.15.0
### 🐛 Describe the bug
``` vllm serve /llm_data2/Qwen3-Coder-Next/ –tensor-parallel-size 4 –served-model-name Qwen3-Coder-Next –trust-remote-code –enable-auto-tool-c hoice –tool-call-parser qwen3_coder –max-model-len auto –max-num-batched-tokens 4096 –compilation-config ‘{“cudagraph_mode”:”FULL_DECODE_ONLY “}’ –gpu-memory-utilization 0.8 …
-
#34317 [Bug]: can not install nightly wheel — bug — by JaheimLee (创建于: 2026-02-11 14:22 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
[已关闭 issue]
-
#33242 [Bug]: Qwen3-VL-30B (MoE): CUDA error ‘illegal memory access’ when max_model_len > 128k — bug — by Austin-Liang (关闭于: 2026-02-12 11:20 (UTC+8)) [💬3] ### Your current environment
The output of
```text Collecting environment information... /home/usr/vllm-env/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning: ...python collect_env.py -
#9964 [Installation]: build on arm64 meet a error — installation,stale — by gongchangsui (关闭于: 2026-02-12 10:19 (UTC+8)) [💬15] ### Your current environment
```text The output of
python collect_env.py(pytorch_gpu) ➜ vllm git:(main) ✗ python collect_env.py Collecting environment information… WARNING 11-03 12:55:08 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError(“No module named ‘vllm._C’”) INFO 11-03 12:55:08 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. PyTorch version: 2.6.0.dev20241101+cu124 Is debug build: False … -
#23160 [Usage]: How to get the generated_tokens before current_token in the new logits processor. — usage,stale — by johnking0099 (关闭于: 2026-02-12 10:18 (UTC+8)) [💬4] ### Your current environment
Not necessary.### How would you like to use vllm
In the version 0.10.1 (just released), I see V1 engine supports logits processor now. But I cannot find the way to find all the tokens generated so far. Is it possible to get it?
…
-
#26558 CUDA illegal memory access in fused_marlin_moe with GLM-4.6-GPTQ MoE model (V1 engine, TP=4) — stale — by chriswritescode-dev (关闭于: 2026-02-12 10:17 (UTC+8)) [💬6] ### Describe the bug
CUDA illegal memory access error during model initialization with GLM-4.6-GPTQ-Int4-Int8Mix MoE model when using the V1 engine with tensor parallelism.
### Environment
- vLLM version: 0.11.0rc2.dev371+gaafb99a4d.d20251010
- Platform: CUDA 13 (tried 12.8 12.9)
- GPU: 4x GPU RTX 6000 pro
- Python: 3.12 …
-
#26762 [Usage]: about curl http://ip:8000/metrics — usage,stale — by Renoshen (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment
When I run this command, I get the following results: # HELP python_gc_objects_collected_total Objects collected during gc # TYPE python_gc_objects_collected_total counter python_gc_objects_collected_total{generation=”0”} 12286.0 python_gc_objects_collected_total{generation=”1”} 1244.0 python_gc_objects_collected_total{generation=”2”} 1326.0 # HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC # TYPE python_gc_objects_uncollectable_tot…
-
#26774 [Usage]: how to use vllm on CUDA 12.9 — usage,stale — by Mrpingdan (关闭于: 2026-02-12 10:17 (UTC+8)) [💬3] ### Your current environment
```text Traceback (most recent call last): File “/vllm-workspace/collect_env.py”, line 825, in
main() File "/vllm-workspace/collect_env.py", line 804, in main output = get_pretty_env_info() ^^^^^^^^^^^^^^^^^^^^^ File "/vllm-workspace/collect_env.py", line 799, in get_pretty_env_info ... -
#26797 [Bug]: Token id 5279552648203111001 is out of vocabulary — bug,stale — by xuelei123-xgb (关闭于: 2026-02-12 10:17 (UTC+8)) [💬3] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#26812 [Bug]: Inconsistent Results with Seed Setting in MambaMixer2 — bug,stale — by XuanofXXX (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment
The output of
```text uv is set ============================== System Info ============================== ...python collect_env.py -
#26825 [Performance]: QuantFP8
forward_native()is on par or slower thanforward_cuda()on a B200 GPU — bug,stale — by ElizaWszola (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environmentThe output of
```text ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) ...python collect_env.py -
#26843 [Bug]: Undefined symbol cutlass_moe_mm_sm100 on SM120 CUDA builds (macro enabled, grouped_mm_c3x_sm100.cu not compiled) — bug,stale — by Jonahcb (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### Your current environment
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect ...
-
#26850 [Feature]: Add new stats metrics for available_kv_cache_memory — feature request,stale — by MML-coder (关闭于: 2026-02-12 10:17 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Currently, vLLM logs the “Available KV cache memory” vllm log during engine initialization (as seen in logs like “ Available KV cache memory: XX.XX GiB”, but this information is not exposed as a Prometheus metric. The current logging approach only provides point-in-time visibility during startup, but doesn’t enable continuous monitoring. This makes it difficult to:
- Perform capacity planning by the user/customer when they want figure out the actual GPU…
-
#32391 [CI Failure]: shellcheck hook fails with find syntax error and causes silent failures — bug — by junuxyz (关闭于: 2026-02-12 08:56 (UTC+8)) [💬1] ### Your current environment
This is not environment-specific bug.
### 🐛 Describe the bug
The
shellcheckscript fails initially due to inappropriate usage of.gitin thefindcommand syntax.```bash ❯ pre-commit run –all-files …
-
#34315 [CI Failure]: Flashinfer Kernel tests missing argument: ‘num_logical_experts’ — ci-failure — by wzhao18 (关闭于: 2026-02-12 06:17 (UTC+8)) ### Name of failing test
tests/kernels/moe/test_flashinfer.py
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33962 [Bug]: Realtime transcription completes but server never sends transcription.done when final commit is received during slow streaming — bug — by pjs102793 (关闭于: 2026-02-12 05:01 (UTC+8)) ### Your current environment
``` Collecting environment information… uv is set ============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 …
-
#33252 [Feature]: Add Deepseek OCR 2 support to the latest version — feature request — by ItzAmirreza (关闭于: 2026-02-12 04:51 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
### Alternatives
No response
### Additional context
…
-
#30933 [Usage]: What is the latest instruction to run DeepSeek V3.2? — usage — by IKACE (关闭于: 2026-02-12 04:51 (UTC+8)) [💬2] ### Your current environment
vLLM 0.12.0
### How would you like to use vllm
I am following the guidelines here https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html for running DeepSeek v3.2. By following the instructions I installed vLLM 0.12.0 on my H200 node. However, when I try to run it with
vllm serve deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 8 --tokenizer-mode deepseek_v32it gives an error``` (APIServer pid=816209) ValueError: No tokenizer regis…
-
#27075 [Bug]: gpt-oss poor performance — bug,stale — by seindum (关闭于: 2026-02-12 04:30 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#33895 [Bug]: /tokenize hangs after large request — bug — by JakubCerven (关闭于: 2026-02-12 02:36 (UTC+8)) [💬4] ### Your current environment
n/a
### 🐛 Describe the bug
running vllm using
vllm/vllm-openai:v0.15.0+QuantTrio/Qwen3-235B-A22B-Instruct-2507-AWQafter calling large number of
/tokenizerequests, this endpoint will become unresponsive -> will hang forever…
-
#34164 [CI Failure]: mi325_4: LM Eval Large Models — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:17 (UTC+8)) [💬1] ### Name of failing test
pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29529 [CI Failure]: mi325_2: Distributed Tests (2 GPUs) — rocm,ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:16 (UTC+8)) [💬11] ### Name of failing test
`TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py && TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py && DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py && pytest -v -s entrypoints/llm/test_collective_rpc.py && pytest -v -s ./compile/fullgraph/test_basic_correctness.py && pytest -v -s ./compile/test_wrapper.py && VLLM_TEST_SAME_HOST=1 torchrun –nproc-per-node=4 distributed/test_same_node.py grep ‘Same n… -
#29460 [CI Failure]: mi325_1: Language Models Test (MTEB) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬4] ### Name of failing test
pytest -v -s models/language/pooling_mteb_test### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29509 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 3 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬10] ### Name of failing test
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29536 [CI Failure]: mi325_1: Multi-Modal Models Test (Extended) 2 — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬13] ### Name of failing test
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git; pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29466 [CI Failure]: mi325_1: Language Models Test (Extended Pooling) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬19] ### Name of failing test
pytest -v -s models/language/pooling -m 'not core_model'### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:10 (UTC+8)) [💬10] ### Name of failing test
pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py### Basic information
- Fla…
-
#34160 [CI Failure]: mi325_1: Entrypoints Unit Tests — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬1] ### Name of failing test
pytest -s -v tests/entrypoints/openai/tool_parsers/test_openai_tool_parser.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#33809 [CI Failure]: Kernels MoE Test %N — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬3] ### Name of failing test
pytest -v -s kernels/moe### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29520 [CI Failure]: mi325_1: Multi-Modal Models Test (Standard) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:09 (UTC+8)) [💬10] ### Name of failing test
` pip install git+https://github.com/TIGER-AI-Lab/Mantis.git && pip freeze grep -E ‘torch’ && pytest -v -s models/multimodal -m core_model –ignore models/multimodal/generation/test_whisper.py –ignore models/multimodal/processing && cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model` ### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug i…
-
#32751 [CI Failure]: mi325_1 Entrypoints Integration Test (Responses API) — ci-failure — by AndreasKaratzas (关闭于: 2026-02-12 02:07 (UTC+8)) [💬5] ### Name of failing test
pytest -v -s tests/entrypoints/openai/responses### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#34355 [Feature]: Detect langue for Whisper — feature request — by warichet (关闭于: 2026-02-12 02:00 (UTC+8)) ### 🚀 The feature, motivation and pitch
Motivation The addition of a language detection feature to the Whisper model will enhance its usability by automatically identifying the language of the audio input before transcription. This is especially useful for multilingual environments where the language of the input audio might not always be known in advance. By integrating language detection directly into the transcription pipeline, we can streamline the process and make it more efficient for…
-
#34156 [Bug]: AttributeError: ‘Glm46VImageProcessorFast’ object has no attribute ‘get_number_of_image_patches’ — bug — by regmibijay (关闭于: 2026-02-11 22:09 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#34237 [Bug]: Responses API with Harmony is broken for structured content inputs — bug — by Kimahriman (关闭于: 2026-02-11 21:14 (UTC+8)) ### Your current environment
The output of
Can't copy from my environment right nowpython collect_env.py…
-
#34311 [Model Bash][DeepSeek R1]: Enable AR Fusion By Default — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-11 19:59 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see previous
### Alternatives
No response
### Additional context
…
-
#34317 [Bug]: can not install nightly wheel — bug — by JaheimLee (关闭于: 2026-02-11 14:28 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#34288 [Model Bash][DeepSeek]: Remove Bias Casts in DSR1 NVFP4 TRTLLM — feature request,model-bash — by robertgshaw2-redhat (关闭于: 2026-02-11 13:01 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
- see main issue for more details
### Alternatives
No response
### Additional context
…
-
#28278 [Bug]: vLLM v0.11.0 container in k8s pod Fails to Load GLM-4.6-FP8 Model During CUDA Graph Capture, But v0.10.2 is OK. — bug,stale — by AndrewTsao (关闭于: 2026-02-11 12:49 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ...python collect_env.py
[新 PR]
-
#34397 [bugfix] refactor FunASR’s _get_data_parser — bug,ready — by AllenDou (创建于: 2026-02-12 10:47 (UTC+8)) [+7/-11, 1 files commented:1 approved:1] - refactor _get_data_parser, move it from FunASRMultiModalProcessor to FunASRProcessingInfo because “BaseMultiModalProcessor._get_data_parser has been moved to
BaseProcessingInfo.build_data_parserin v0.16. “ - and It’s better to check prompt_len, so I delete skip_prompt_length_check(return True) @DarkLight1337 please take a look, thanks.
- refactor _get_data_parser, move it from FunASRMultiModalProcessor to FunASRProcessingInfo because “BaseMultiModalProcessor._get_data_parser has been moved to
-
#34362 [Refactor] Move validation to params definitions — ready,v1 — by DarkLight1337 (创建于: 2026-02-12 01:50 (UTC+8)) [💬2 | +264/-245, 3 files | commented:2 approved:1] ## Purpose
Move
SamplingParamsvalidation fromInputProcessortoSamplingParams, similar to howPoolingParamsdoes it.## Test Plan
## Test Result
... -
#34398 [new model] add COLQwen3 code & Inference — documentation,new-model,multi-modality,qwen — by craftsangjae (创建于: 2026-02-12 10:57 (UTC+8)) [💬3 | +864/-0, 9 files | commented:2] ## Purpose
Add native support for ColQwen3 multi-modal late interaction models in vLLM.
ColPali (arXiv:2407.01449) introduces a ColBERT-style multi-vector retrieval approach for vision-language models. ColQwen3 extends Qwen3-VL with a linear projection head that produces per-token L2-normalized embeddings, enabling MaxSim late interaction scoring for document retrieval and reranking across both text and image inputs.
This PR adds support for the followi…
-
#34353 [Bugfix] fix default is_neox_style to be True for deepseekv3.2 — bug,ready,deepseek — by xyDong0223 (创建于: 2026-02-12 00:10 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose To resolve the default style issue with index rope, neox should be True when it’s not present in config. ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34307 [ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops — rocm,ready,ci/build — by tjtanaa (创建于: 2026-02-11 12:17 (UTC+8)) [💬5 | +312/-105, 11 files | commented:2]
## Purpose
Split away from https://github.com/vllm-project/vllm/pull/34244 to focus on fusion pass only.
## Test Plan
Triggered the tests on AMD CI.
…
-
#34354 [Bugfix] Fix step3p5 tool parser and unnecessary unstreamed tool args in serving. — bug,frontend — by mariohong128 (创建于: 2026-02-12 00:26 (UTC+8)) [💬1 | +1177/-1425, 5 files | commented:1 | 📝草稿] ## Purpose The step3.5 tool parser and test cases have been modified, resulting in a more stable parser.
In
vllm/entrypoints/openai/chat_completion/serving.py, some parsers do not require checking unstreamed tool arguments. For example:qwen3coder_tool_parser: does not maintain variables for checking unstreamed tool arguments (
streamed_args_for_tool), may cause out of index error.qwen3xml_tool_parser: has some bugs during maintenance, causing duplicate parameter sending.
**step3…
-
#34327 TRITION_ATTN backend support int8 kvcache dtype — documentation,v1 — by EricccYang (创建于: 2026-02-11 17:22 (UTC+8)) [💬5 | +111/-38, 6 files | commented:4] ## Purpose as title describe
## Test Plan
A100-PCIE-40G Qwen3-4B
vllm bench latency –model ../Qwen3-4B/ –batch-size BATCH_SIZE –input-len 512 –output-len OUT_LEN –max_model_len 2000 –num-iters 1 –gpu-memory-utilization 0.90 –num-iters-warmup 1 –at tention-backend TRITON_ATTN –kv-cache-dtype KV_CACHE_TYEP …
-
#34385 [Bugfix] Fix MTP accuracy for GLM-5 — bug,speculative-decoding,ready,v1 — by mgoin (创建于: 2026-02-12 07:30 (UTC+8)) [💬1 | +18/-0, 1 files | commented:1 approved:1] ## Purpose
Fix MTP producing NaN logits for models (e.g. GLM-5) whose checkpoints don’t store a duplicate
shared_head.headweight in the MTP layer (like DeepSeek V3.2). The existing_maybe_share_lm_headonly setsself.model.lm_head, but MTP’scompute_logitsusesshared_head.headinside each MTP layer. This left it uninitialized when the checkpoint omits it. This patch explicitly shares the target model’slm_headwith each MTP layer’sshared_head.head,matching what DeepSeek-V3.2 … -
#34392 [torch.compile] Disable ar-rms fusion for ds3-fp4 — ready — by ProExpertProg (创建于: 2026-02-12 09:50 (UTC+8)) [💬1 | +42/-3, 3 files | commented:2] ## Purpose #34299 enabled AR+rms fusion by default, which is causing accuracy issues on DS3-fp4. Other fp4 and non-fp4 deepseek models appear unaffected. Disable it by default for those cases.
## Test Plan CI, tested locally
## Test Result Good
…
-
#34396 [BugFix] Align fused MoE-LoRA kernel config with actual weight shapes — bug — by RunkaiTao (创建于: 2026-02-12 10:44 (UTC+8)) [+5/-1, 1 files | commented:1] ## Purpose We keep the config key derived from the LoRA weight shapes. For the GPT-OSS gate-up kernel, a mismatch occurs because the w1 and w3 expand weights are concatenated.
Before this change, vLLM used the key for gate-up kernels:
max_loras-1-m-rank-3072-2994After this change, the correct key becomes:
…
-
#34393 Unify MLA cache gather into gather_cache and remove split ops — v1 — by yashnib (创建于: 2026-02-12 09:51 (UTC+8)) [💬5 | +400/-281, 8 files | commented:1] Closes #33669
This PR unifies
gather_and_maybe_dequant_cacheandcp_gather_cacheinto a single kernel and torch opgather_cache, as requested in the feature discussion.The unified implementation handles both token-major (with optional FP8 dequant) and batch-major (CP copy-only) traversal internally, removing Python-level branching and eliminating duplicate kernel implementations.
-
#34394 [torch.compile] Remove duplicated split_with_sizes after RoPE — 无标签 — by elvischenv (创建于: 2026-02-12 10:10 (UTC+8)) [+70/-3, 2 files | commented:2]
## Purpose
Consider the following ops around the RoPE: ``` qkv = self.qkv_proj(hidden_states) q, k, v = qkv.split([self.hidden_size, self.hidden_size, self.hidden_size], dim=-1) q, k = self.rotary_emb(positions, q, k)
…
-
#34390 [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching — 无标签 — by gmagogsfm (创建于: 2026-02-12 08:43 (UTC+8)) [💬1 | +1860/-121, 10 files | commented:2] NOTE: this is a manually stacked PR, each commit is reviewed separately. For this PR, please only review the top commit: Add Helion autotuning infra
This PR uses a PyTorch HigherOrderOp
helion_kernel_wrapper_mutationto represent a Helion kernel call (without specializing to a particular config yet). This is useful for pattern matching and graph-based optimizations.In the future, this op can be further lowered later into another HOP after specializing into a particular config, the new HOP …
-
#34337 [GPT-OSS] Remove unnecessary contiguous — ready,gpt-oss — by elvischenv (创建于: 2026-02-11 21:14 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]
## Purpose Remove unnecessary contiguous for GPT-OSS.
Test Result PR:
[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-medium_temp1.0_20260211_121731', 'metric': 0.7266414141414141}]…
-
#34379 [BugFix] Fix DP chunking — bug,ready — by LucasWilkinson (创建于: 2026-02-12 05:23 (UTC+8)) [💬2 | +12/-2, 1 files | commented:9 approved:1] https://github.com/vllm-project/vllm/pull/32790 broke DP chunking for models with shared experts by not chunking the shared expert input
causing:
(EngineCore_DP0 pid=2405) RuntimeError: The size of tensor a (256) must match the size of tensor b (4096) at non-singleton dimension 0in
tests/v1/distributed/test_dbo.py::test_dbo_dp_ep_gsm8k[deepep_low_latency][Distributed Tests (2 GPUs)(H100) ](https://buildkite.com/vllm/ci/builds/51065/steps/canvas?sid=019c4b81-18e7-4893-bfec-cb087afbc… - #34386 [Quark] Fix MoE fp8 activation scale handling on mi300 — 无标签 — by BowenBao (创建于: 2026-02-12 07:38 (UTC+8)) [💬1 | +3/-3, 1 files | commented:2] Small fix follow-up after #29008 for running gpt-oss on mi300. Also ensure the fp8 scale conversion only runs when activation is fp8 quantized.
-
#34382 [BUG] Reset running requests when clearing cache for pause/resume — bug,ready,v1 — by hao-aaron (创建于: 2026-02-12 06:35 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose Addressing https://github.com/vllm-project/vllm/pull/32351#issuecomment-3870210338, in progress requests block clearing prefix cache in keep mode.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34389 [Custom Ops] Add functional + out variant for scaled_fp4_quant — 无标签 — by tianrengao (创建于: 2026-02-12 08:30 (UTC+8)) [+160/-77, 6 files | commented:1 | 📝草稿] Add PyTorch standard functional + out-variant pair for scaled_fp4_quant, following the same pattern as silu_and_mul.
Old schema: scaled_fp4_quant(Tensor! output, Tensor input, Tensor! output_scale, Tensor input_scale, bool is_sf_swizzled_layout) -> ()
New schemas: scaled_fp4_quant(Tensor input, Tensor input_scale, bool is_sf_swizzled_layout) -> Tensor[] scaled_fp4_quant.out(Tensor input, Tensor input_scale, …
-
#34374 [Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA — bug,ready,nvidia — by mgoin (创建于: 2026-02-12 04:23 (UTC+8)) [+5/-0, 1 files | commented:1 approved:1]
## Purpose
If you start dsv3.2 or glm-5 on Blackwell, you don’t require DeepGEMM for the linear/moe kernels, but we do require DeepGEMM for
fp8_paged_mqa_logitsso it will crash at first inference if not installed. We should enforce this at init time## Test Plan
## Test Result
…
-
#34335 [Bugfix] Fix MoE quant_config not initialized under torch.compile — bug — by mgehre-amd (创建于: 2026-02-11 20:07 (UTC+8)) [💬1 | +30/-0, 2 files | commented:2] ## Summary
After the MoE Refactor (#32344), w4a16 models fail with
AssertionError: Hidden size mismatch 2048 != 1024under torch.compile. This is becauseensure_moe_quant_config_init()is called inFusedMoE.forward_native(). When torch.compile is active,forward_nativeis traced by Dynamo, but the side effect of settingself.quant_method.moe_quant_config(an attribute mutation) is not replayed at runtime. This causesmoe_quant_configto remainNonewhen `DefaultMoERunner.forward_i… -
#34388 [WIP][Spec Decode] P/D disaggregation: transfer hidden states for EAGLE warm-up — documentation,speculative-decoding,v1,kv-connector — by ZhanqiuHu (创建于: 2026-02-12 07:47 (UTC+8)) [💬1 | +1705/-16, 7 files | commented:1 | 📝草稿] ## Summary
Transfer full-prefix hidden states from the prefill instance to the decode instance via RDMA (NixlConnectorWithAux) so that EAGLE’s draft model can “prefill” its own KV cache on the first decode step, matching single-card acceptance rates.
Changes:
- NixlConnectorWithAux: subclass of NixlConnector that manages auxiliary dense tensor (hidden states) transfer alongside KV cache transfer, with slab-based GPU buffer management and backpressure.
- gpu_model_runner.py (v1): `_…
- #34384 [ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI — rocm,ready,ci/build — by micah-wil (创建于: 2026-02-12 07:29 (UTC+8)) [💬3 | +7/-7, 1 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33731 mistakenly added some test groups to the 8 GPU agent pool that only require 1 GPU in AMD CI. This PR reverts those changes.
- #34387 [ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release — rocm,ci/build — by SageMoore (创建于: 2026-02-12 07:42 (UTC+8)) [+2/-2, 1 files | commented:1]
## Purpose
After this change I’m able to build vLLM on an MI300X machine running ROCm 7.1 with the following commands.
When choosing a new amdsmi version, I just picked the most recent release on pypy (https://pypi.org/project/amdsmi/#history)
uv pip install -r requirements/rocm-build.txt --index-strategy unsafe-best-match python setup.py develop## Test Plan This change should only impact developers who are building locally. I ran a simple lm_eval check with `deepseek-ai/DeepSeek-V2-Li…
-
#34383 [Ray] Propagate third-party env vars to Ray workers via prefix matching — ready,ci/build,v1 — by kouroshHakha (创建于: 2026-02-12 06:49 (UTC+8)) [💬1 | +279/-36, 5 files | commented:2] ## Problem
get_env_vars_to_copy()only forwarded env vars registered invllm.envs.environment_variables(VLLM_* vars) plus a small hardcoded set (HF_TOKEN,HUGGING_FACE_HUB_TOKEN). Third-party env vars needed by KV connector integrations (LMCACHE_*,NCCL_*,UCX_*) andPYTHONHASHSEEDwere silently dropped when propagating from the EngineCore to Ray GPU workers.This caused subtle failures — e.g., LMCache workers using a different hash algorithm than the scheduler, resulting in 0…
-
#34371 [Bugfix] Fix some issues with MoERunner PR #32344 — bug,ready — by bnellnm (创建于: 2026-02-12 03:24 (UTC+8)) [💬1 | +6/-3, 2 files | commented:2 approved:2] ## Purpose
Move
ensure_moe_quant_configcall fromFusedMoE.forward_nativeinto_moe_forwardand_moe_forward_shared. This is closer to how it was before when it was hidden inside the custom op. It should avoid torch.compile issues.Fix handling of
gate. Theuse_overlappedflag should have been checked before returning_gate.Possible fix for https://github.com/vllm-project/vllm/issues/34357
## Test Plan
…
-
#34375 Fix #32588 — frontend — by mainnebula (创建于: 2026-02-12 04:30 (UTC+8)) [💬2 | +218/-17, 2 files | commented:1 | 📝草稿] ## What Addresses https://github.com/vllm-project/vllm/issues/32588
## Context This contribution was created by Token Steward — an initiative that redirects unused AI compute (from Claude Max subscription plans) toward open-source contributions before weekly token limits reset.
The code was generated by Claude (Anthropic) and should be reviewed like any other contribution. AI-assisted, human-reviewed.
## How to get involved
- If you maintain an o…
-
#34316 Fix CI failure - Flashinfer Kernel tests — ready,nvidia — by wzhao18 (创建于: 2026-02-11 13:36 (UTC+8)) [💬4 | +3/-0, 3 files | commented:1 approved:4]
## Purpose Fix #34315
## Test Plan Tested locally with
pytest -v -s tests/kernels/moe/test_flashinfer.py…
-
#34381 [Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel — ci/build — by EdalatiAli (创建于: 2026-02-12 06:09 (UTC+8)) [💬1 | +1283/-0, 10 files | commented:1 | 📝草稿]
## Purpose Integrate SGLang’s SM100+ expert-specialization MXFP8 blockscaled grouped kernels into vLLM so they are built, registered, importable, and test-covered in the vLLM codebase. Here is source PR for the adopted kernels.
This PR:
- Adds
es_sm100_mxfp8_blockscaled_grouped_mmandes_sm100_mxfp8_blockscaled_grouped_quantkernel sources into vLLM’s_Cbuild path (CUDA-gated for SM100-compatible targets)….
- Adds
-
#34378 Use paged_attention_v1 for sliding window decode in rocm_aiter_fa — rocm,v1,meta-exported,fb-exported — by iseeyuan (创建于: 2026-02-12 05:23 (UTC+8)) [+2/-29, 1 files | commented:1] Summary: Replace unified_attention (Triton) with paged_attention_v1 for the sliding window decode path in AiterFlashAttentionImpl. paged_attention_v1 already supports sliding window natively via its sliding_window parameter, so this unifies the NHD decode path for both sliding window and non-sliding window cases. The sliding window value is recovered from the flash-attn convention (self.sliding_window[0] + 1), which yields 0 (disabled) when no sliding window is configured.
Test Plan: Requires R…
-
#34334 [Bugfix] Fix more multimodal tests for transformers V5 — bug,ready,multi-modality,qwen — by zucchini-nlp (创建于: 2026-02-11 19:53 (UTC+8)) [💬1 | +18/-11, 5 files | commented:2 approved:2] There’s also idefics test failing
tests/models/multimodal/processing/test_idefics3.py::test_processor_override[False-1-mm_processor_kwargs1-845-HuggingFaceM4/Idefics3-8B-Llama3]I see the root issue in
Idefics3Processor._get_prompt_updateswhich creates a processor but doesn’t use providedkwargs. The kwargs get filtered out due tocached_get_processor_without_dynamic_kwargsand the number of placeholder is calculated incorrectly. It can be fixed by reverting https://github.com/vllm-pr… - #34349 [Benchmarks] Reduce ready checker log verbosity — performance,ready — by tomasruizt (创建于: 2026-02-11 23:43 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1]
## Summary
- The
wait_for_endpoint()ready checker used to log the full traceback of the error on every retry (every 5 seconds), making the output very noisy and hard to read. - Now it logs only the last line of the error (the actual exception message), which is much more compact and informative.
Before (repeated every 5s): ``` WARNING [ready_checker.py:69] Endpoint is not ready. Error=’Traceback (most recent call last): File “…/aiohttp/connector.py”, line 1268, in _wrap_creat…
- The
-
#34373 [Feature] Enable uniform KV cache allocation for multi-group HMA models — v1,kv-connector — by Etelis (创建于: 2026-02-12 04:06 (UTC+8)) [+206/-25, 2 files | commented:1]
use_uniform_kv_cache()currently rejects any model with more than one KV cache group, which means hybrid-attention models (alternating full + sliding-window layers) cannot use the contiguous cross-layer layout for efficient KV transfers.This PR relaxes the single-group gate: instead of requiring exactly one group, we loop over all groups and check that they share the same backend shape and stride order.
## Test Plan
Unit tests (
tests/v1/kv_connector/unit/test_uniform_kv_cache.py) — … -
#34372 [ROCm] [CI] fix test_unrecognized_env (#34350) — rocm — by vagabond2522 (创建于: 2026-02-12 03:41 (UTC+8)) [+7/-0, 1 files | commented:1] Signed-off-by: tjtanaa tunjian.tan@embeddedllm.com
## Purpose
## Test Plan
## Test Result
…
-
#34369 Add devcontainer configuration file — 无标签 — by vagabond2522 (创建于: 2026-02-12 02:59 (UTC+8)) [+7/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#34332 fp8.py online quant: reuse layerwise reloading infra, take 3 — 无标签 — by vkuzo (创建于: 2026-02-11 19:07 (UTC+8)) [💬1 | +77/-154, 5 files | commented:1 | 📝草稿] Summary:
Copy of https://github.com/vllm-project/vllm/pull/34184
TODO write me up
Test Plan: TODO
…
-
#34366 Add language detection feature to Whisper — 无标签 — by warichet (创建于: 2026-02-12 02:35 (UTC+8)) [💬2 | +125/-20, 1 files | commented:1]
## Purpose This PR introduces an automatic language detection feature for the Whisper model. When the
languagefield is not specified in the request, the model will automatically detect the language of the audio input before transcription begins. This is achieved by predicting the most likely language token using the<|startoftranscript|>token in the decoder prompt. If language detection fails, the model defaults to English ("en").The purpo…
-
#34368 Add initial devcontainer configuration — 无标签 — by vagabond2522 (创建于: 2026-02-12 02:51 (UTC+8)) [+4/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#34367 [BugFix] Skip null blocks when adding cached blocks in current step — bug,v1 — by peakcrosser7 (创建于: 2026-02-12 02:37 (UTC+8)) [+2/-0, 1 files | commented:1 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34350 [ROCm] [CI] fix test_unrecognized_env — rocm,ready — by tjtanaa (创建于: 2026-02-11 23:46 (UTC+8)) [+10/-3, 1 files | commented:1 approved:1]
## Purpose
Fix https://buildkite.com/vllm/amd-ci/builds/4609/steps/canvas?sid=019c4d1a-4084-4c15-b40f-c0ef00b0ffd7&tab=output
```
... -
#34359 [CI] Add GPT-OSS Eval job for H100 — ready,ci/build,gpt-oss — by mgoin (创建于: 2026-02-12 01:15 (UTC+8)) [+13/-0, 1 files | commented:2]
## Purpose
Previously we only evaled gpt-oss on B200 and we’ved missed issues with triton_kernels since it isn’t the default for Blackwell. Running on Hopper will trigger the default case of triton_kernels
## Test Plan
## Test Result
…
-
#34358 [Bugfix] Standardize getting number of image patches/tokens — bug,ready,multi-modality — by DarkLight1337 (创建于: 2026-02-12 00:58 (UTC+8)) [💬1 | +164/-226, 17 files | commented:3 approved:1] ## Purpose
- Consider
mm_kwargswhen determining number of image tokens. - Disallow passing
processor=Noneto simplify the code - Fix Idefics3 and SmolVLM tests not passing
mm_kwargsto the reference processor call.
FIX Idefics3 test in #34334
## Test Plan
…
- Consider
-
#34342 [Frontend] Add automatic language detection for Whisper transcription — frontend,multi-modality — by spacecheck (创建于: 2026-02-11 22:24 (UTC+8)) [💬4 | +237/-11, 6 files | commented:2] ads feature from #14174, #25750
## Purpose Add automatic language detection for Whisper transcription when no
languageparameter is specified.- Whisper auto-detects the spoken language by running a single-token generation with `<|startoftranscript|>` as the decoder prompt and parsing the predicted... - #34364 [Fix] Fix tracing test race condition by adding server readiness check — 无标签 — by emricksini-h (创建于: 2026-02-12 02:02 (UTC+8)) [+23/-0, 1 files | commented:1] Trying to fix #34284
-
#34348 [Docs] Fix typo (“defult”) and double spacing — ready — by SorenDreano (创建于: 2026-02-11 23:23 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose Update vLLM documentation to fix two small typos (
defult->defaultand-02to-O2) and a double space. vLLM will not start with-02.## Test Plan Documentation-only change.
## Test Result N/A (documentation-only change; no runtime behavior affected). …
-
#34363 fix: return HTTP 413 when request exceeds max context length — frontend — by Chase-Xuu (创建于: 2026-02-12 01:58 (UTC+8)) [💬2 | +10/-2, 2 files | commented:1] ## Summary
Fixes #34340
Returns HTTP 413 (Request Entity Too Large) instead of 400 (Bad Request) when a request exceeds the model’s maximum context length.
## Changes
- Add optional
http_statusfield toVLLMValidationErrorexception class - Return 413 when input tokens exceed
max_model_len
…
- Add optional
-
#34347 [DRAFT] Async Allgather — ci/build,v1 — by elvircrn (创建于: 2026-02-11 23:22 (UTC+8)) [+422/-58, 3 files | commented:1 | 📝草稿] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34346 Add group quantization support to fused FP8 RMSNorm quant kernels — 无标签 — by Bias92 (创建于: 2026-02-11 23:15 (UTC+8)) [💬2 | +276/-26, 4 files | commented:1] Add group quantization to fused FP8 RMSNorm quant kernels
Closes #24629
- Add
group_size(default0) to fused FP8 RMSNorm quant ops. group_size > 0: per-group scaling along the last dimension;group_size == 0: unchanged per-tensor path.- Add 144 tests; all passed.
Tests:
tests/kernels/core/test_rms_norm_group_quant.py…
- Add
- #34360 Revert “[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner” — documentation,v1 — by robertgshaw2-redhat (创建于: 2026-02-12 01:16 (UTC+8)) [💬1 | +753/-913, 25 files | commented:1] Reverts vllm-project/vllm#32344
- #34344 [Model Bash][DeepSeekR1] Remove Shared Expert Clone — performance,ready,deepseek — by robertgshaw2-redhat (创建于: 2026-02-11 22:59 (UTC+8)) [+4/-8, 2 files | commented:1]
## Purpose
- previously, we required cloning the shared expert input because some models used inplace. For instance, in a previous PR fixing a similar issue (https://github.com/vllm-project/vllm/pull/28942/changes), we had seen certain methods applying inplace unilaterally (https://github.com/zhyajie/vllm/blob/2a16631527f17bb550bc2137ae189ae3ddb74283/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L404-L416)
- we now have systematically disabled inplace if there is a shared …
-
#34330 [Multimodal] Expose
mm_processor_kwargsforDummyInputsBuilder— ready,multi-modality,llama,qwen,deepseek — by Isotr0py (创建于: 2026-02-11 18:01 (UTC+8)) [💬3 | +131/-27, 72 files | commented:4 approved:1]## Purpose
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease
longest_edgefor hf_processor withmm_processor_kwargs. - However, these fields won’t reflect to dummy inputs during profiling, which can cause inaccurate profiling results.
- This PR exposes
mm_processor_kwargsforDummyInputsBuilderto reflect the config.
## Test Plan ``` vllm serve Qwen/Qwen3-VL-4B-Instruct –enforce-eager –mm-pro…
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease
-
#34312 kv connector: Add mooncake store connector — kv-connector — by snadampal (创建于: 2026-02-11 13:04 (UTC+8)) [💬1 | +2217/-0, 11 files | commented:1 | 📝草稿]
This PR adds mooncake store connector to vllm kv connectors. The store connector acts as an extension to local cache and co-exists with nixl for p2p. This has been ported from vllm-ascend repo and the connector has been made generic for any mookcake transfer engine transport protocol.
ported from: https://github.com/vllm-project/vllm-ascend/tree/main/vllm_ascend/distributed/kv_transfer/kv_pool
## Purpose To support KV cache offloading from vllm node to external c…
-
#34339 [Bugfix] Fix kv_load_failure_recovery example in sync mode — bug,documentation,kv-connector — by sykzhong (创建于: 2026-02-11 21:57 (UTC+8)) [💬3 | +1/-0, 1 files | commented:1]
## Purpose
Fix the kv_load_failure_recovery example to work correctly in sync mode. When async_scheduling is not explicitly set, vLLM automatically enables async scheduling by default (unless incompatible configurations are detected). However, async scheduling is incompatible with the sync mode of kv_load_failure_recovery, causing the example to fail.
### Changes
Explicitly set async_scheduling=False in sync mode to prevent automatic async scheduling activation….
-
#34336 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (创建于: 2026-02-11 20:52 (UTC+8)) [+2/-1, 1 files | commented:1]
## Purpose
[Bugfix] fix device_name for routing replay
## Test Plan
## Test Result
…
-
#34352 Don’t try and run GLM-ASR with remote code — 无标签 — by hmellor (创建于: 2026-02-11 23:54 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] GLM-ASR has been upstreamed to Transformers in https://github.com/huggingface/transformers/commit/a7f29523361b2cc12e51c1f5133d95f122f6f45c which was first released in
v5.0.0rc2.Therefore we should remove this kwarg from our test. (The test will not actually run in this PR because it’s still gated on Transformesr v5)
-
#34329 [Refactor] Pass Renderer to Input Processor — frontend,ready,v1 — by DarkLight1337 (创建于: 2026-02-11 17:48 (UTC+8)) [💬2 | +107/-106, 20 files | commented:3 approved:1] ## Purpose
Prepare for next Renderer refactor.
Also improve consistency between how attributes are being accessed between offline and online APIs.
- Importantly, this removes
max_model_lenfrom online serving, it should be accessed viamodel_config.max_model_len. I think this makes the code a bit clearer.
## Test Plan
## Test Result …
- Importantly, this removes
- #34343 Lazy imports for model loader classes in model_loader/init.py — meta-exported,fb-exported — by klintqinami (创建于: 2026-02-11 22:26 (UTC+8)) [+128/-29, 1 files | commented:1]
Summary:
model_loader/__init__.pyeagerly imports all 7 loader classes (BitsAndBytesModelLoader,DefaultModelLoader,DummyModelLoader,GGUFModelLoader,RunaiModelStreamerLoader,ShardedStateLoader,TensorizerLoader) at module scope. These are only needed whenget_model_loader()is called with a matching format, but the eager imports mean that any code importing from themodel_loaderpackage (e.g. `from vllm.model_executor.model_loader.weight_utils import default_weight_loader… -
#34341 Make has_triton_kernels() handle import failures gracefully — meta-exported,fb-exported — by klintqinami (创建于: 2026-02-11 22:08 (UTC+8)) [+4/-1, 1 files | commented:1] Summary:
has_triton_kernels()usesfind_spec()to check if thetriton_kernelspackage exists, then eagerly callsimport_triton_kernels(). On MTIA, the package is findable (it exists in the build) but fails to import because it depends on GPU-specific Triton modules liketriton.language.target_infothat don’t exist in MTIA Triton. This causes a crash instead of the function returningFalse.The fix wraps the
import_triton_kernels()call in a try/except so the function correctly rep… -
#34309 Add new sections to CODEOWNERS — ci/build — by DarkLight1337 (创建于: 2026-02-11 12:40 (UTC+8)) [💬1 | +27/-11, 1 files | commented:1] ## Purpose
- Create new section in CODEOWNERS for API-related and IO Processing-related code
- Split up
entrypoints/and add various people that I think are suitable - Add
tokenizers/andrenderers/to CODEOWNERS, assigning myself and @njhill - Add
sampling_params.pyto CODEOWNERS to be consistent withpooling_params.py, and assigned @njhill - Fixed some outdated directories in CODEOWNERS
## Test Plan
…
- #34338 [Bugfix][CI] Fix spawned tests — bug — by NickLucche (创建于: 2026-02-11 21:45 (UTC+8)) [+21/-4, 1 files | commented:1] Attempt to fix https://github.com/vllm-project/vllm/issues/34323 more generally. Do not merge until a full CI run
-
#34324 Fixed whisper CPU test that does not spawn properly. — multi-modality — by almayne (创建于: 2026-02-11 17:00 (UTC+8)) [💬5 | +0/-1, 1 files | commented:1 approved:1]
Partially addresses #34323
## Purpose
Fix a test that isn’t being run properly. This issue describes the problem with using the annotation
@create_new_process_for_each_test("spawn"). This change will prevent spawning or forking of the test.## Test Plan
…
-
#34325 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (创建于: 2026-02-11 17:08 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1]
## Purpose
[Bugfix] fix device_name for routing replay
## Test Plan
## Test Result
…
-
#34321 [Bugfix][CPU] Fix llama4 inference on CPU — bug,ready,v1,llama,cpu — by bigPYJ1151 (创建于: 2026-02-11 16:30 (UTC+8)) [💬3 | +60/-18, 6 files | commented:1 approved:1]
## Purpose
- Fix llama4 inference on CPU
## Test Plan
## Test Result
…
-
#34328 [KV Connector] Support using FlexKV as KV Cache Offloading option. — documentation,kv-connector — by feiqiangs (创建于: 2026-02-11 17:38 (UTC+8)) [💬2 | +445/-0, 4 files | commented:2] FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud’s TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching and a distributed KVCache pool to enable inference engines to achieve higher throughput and lower latency.
In our case, when intergated with FlexKV, we can achieve the following improvement:
ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by…
-
#34306 [CI][Bugfix] add regression test for GDN fused_recurrent kernel — bug,v1,qwen — by CarstyYou (创建于: 2026-02-11 11:59 (UTC+8)) [💬2 | +542/-8, 3 files | commented:1]
## Purpose
Fix illegal memory access in
fused_recurrentkernel when handlingPAD_SLOT_ID (-1)for padded sequences (Issue #31186).Root Cause: The
fused_recurrent_gated_delta_rule_fwd_kernelwas not properly handlingPAD_SLOT_ID (-1)when storing final states in continuous batching mode. When processing padded sequences in CUDA Graph scenarios (especially with Qwen3-Next MTP), the kernel would attem… - #34319 [Doc] Update Marlin support matrix for Turing — documentation,ready — by iori2333 (创建于: 2026-02-11 16:16 (UTC+8)) [💬1 | +5/-4, 2 files | commented:1 approved:2] #29901 adds Marlin support for SM75 devices. This PR updates the support matrix of Marlin documentation.
-
#34308 [Chore] Move
BaseRenderertobase.py— ready — by DarkLight1337 (创建于: 2026-02-11 12:24 (UTC+8)) [+7/-7, 8 files | commented:1 approved:1]## Purpose
It’s not a protocol anymore.
## Test Plan
## Test Result
…
-
#34320 [Bugfix] Fix Dynamo unexpected keyword argument — bug — by samutamm (创建于: 2026-02-11 16:27 (UTC+8)) [+3/-3, 1 files | commented:1] ## Purpose
Fix QuantFP8 with torch.compile on ROCm when CustomOP
quant_fp8is disabled with--compilation-config '{"custom_ops": ["-quant_fp8"]}'.Current main branch raises error:
``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
-
#34318 [bugfix] Fix Dynamo unexpected keyword argument — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by samutamm (创建于: 2026-02-11 15:42 (UTC+8)) [💬3 | +6279/-1686, 114 files | commented:1] ## Purpose
Fix QuantFP8 with torch.compile on ROCm when CustomOP
quant_fp8is disabled with--compilation-config '{"custom_ops": ["-quant_fp8"]}'.Current main branch raises error:
``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
-
#34313 [Bugfix] Fix weight naming in Qwen3.5 — bug,qwen — by ywang96 (创建于: 2026-02-11 13:04 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
## Purpose
## Test Plan
## Test Result
... -
#34314 [Bugfix] Fix Fp8OnlineMoEMethod crash when weights are on CPU — bug — by aarkharov (创建于: 2026-02-11 13:15 (UTC+8)) [💬3 | +138/-9, 2 files | commented:1] ## Summary
When an external plugin (e.g.
online_fp8) creates MoE weight tensors on CPU, the streaming quantization path inFp8OnlineMoEMethod.patched_weight_loadercrashes because it callsops.scaled_fp8_quant— a CUDA kernel — on CPU tensors.Root cause:
patched_weight_loaderunconditionally callsself.process_weights_after_loading(layer)when all weight chunks are loaded, and sets_already_called_process_weights_after_loading = True.When this happens: Large MoE model ma…
-
#34310 fix: correct chunk start times for verbose transcription timestamps — frontend — by saikrishnarallabandi (创建于: 2026-02-11 12:42 (UTC+8)) [💬2 | +220/-14, 2 files | commented:1] ## Summary
Fix timestamp drift in chunked audio transcription by using actual chunk boundary times instead of nominal fixed offsets.
Fixes #29350
## Problem
When transcribing long audio (>30s), segment timestamps progressively drift because:
…
[已合并 PR]
-
#34362 [Refactor] Move validation to params definitions — ready,v1 — by DarkLight1337 (合并于: 2026-02-12 11:33 (UTC+8)) [💬2 | +264/-245, 3 files | commented:2 approved:1] ## Purpose
Move
SamplingParamsvalidation fromInputProcessortoSamplingParams, similar to howPoolingParamsdoes it.## Test Plan
## Test Result
... -
#33848 [Bug Fix] Fix
naive_block_assignmentalways defaulting to False due to arg misalignment — bug,ready — by RunkaiTao (合并于: 2026-02-12 11:30 (UTC+8)) [💬1 | +2/-1, 2 files | commented:1 approved:2] ## Purpose Fix a bug thatnaive_block_assignmentalways defaulting to False due to arg misalignment.## Test Result gpt-oss 120b max_loras=8, concurrency=1
before ``` ============ Serving Benchmark Result ============ Successful requests: 40
… -
#34353 [Bugfix] fix default is_neox_style to be True for deepseekv3.2 — bug,ready,deepseek — by xyDong0223 (合并于: 2026-02-12 02:20 (UTC+8)) [+1/-1, 1 files | commented:1 approved:2] ## Purpose To resolve the default style issue with index rope, neox should be True when it’s not present in config. ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#34385 [Bugfix] Fix MTP accuracy for GLM-5 — bug,speculative-decoding,ready,v1 — by mgoin (合并于: 2026-02-12 11:08 (UTC+8)) [💬1 | +18/-0, 1 files | commented:1 approved:1] ## Purpose
Fix MTP producing NaN logits for models (e.g. GLM-5) whose checkpoints don’t store a duplicate
shared_head.headweight in the MTP layer (like DeepSeek V3.2). The existing_maybe_share_lm_headonly setsself.model.lm_head, but MTP’scompute_logitsusesshared_head.headinside each MTP layer. This left it uninitialized when the checkpoint omits it. This patch explicitly shares the target model’slm_headwith each MTP layer’sshared_head.head,matching what DeepSeek-V3.2 … -
#33963 [Bugfix] send None sentinel on final commit so server properly sends transcription.done — bug,frontend,ready — by pjs102793 (合并于: 2026-02-12 05:01 (UTC+8)) [💬6 | +2/-8, 2 files | commented:1 approved:2 changes:1] ## Purpose
Fixes https://github.com/vllm-project/vllm/issues/33962
Fix server not sending
transcription.donewhen client sendsinput_audio_buffer.commitwithfinal: trueduring real-time audio streaming.Transcription works correctly —
transcription.deltaevents are sent as expected. However, thefinalcommit handler sets_is_input_finished = Truebut does not send theNonesentinel toaudio_queue, leavingaudio_stream_generatorblocked onqueue.get()forever. This deadloc… -
#34337 [GPT-OSS] Remove unnecessary contiguous — ready,gpt-oss — by elvischenv (合并于: 2026-02-12 04:29 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]
## Purpose Remove unnecessary contiguous for GPT-OSS.
Test Result PR:
[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-medium_temp1.0_20260211_121731', 'metric': 0.7266414141414141}]…
-
#33843 [Refactor] Replace
activation: strwithMoEActivationenum — performance,rocm,ready,cpu,gpt-oss,nvidia,ready-run-all-tests — by mgoin (合并于: 2026-02-12 09:29 (UTC+8)) [💬5 | +474/-282, 48 files | commented:10] ## PurposeWe have had
activationdefined, validated, and passed around as a raw string forever. Now that we have popular models that don’t just use thesiludefault and also need to support non-gated MoEs and their activation functions, we need to have a single source of truth for all the activation functions that exist in vLLM and which MoE kernels support which functions. I want to start with MoE since that is where we have the most divergence in support due to fused kernels.This PR in…
-
#33626 [ci] Integrate AMD tests into CI — rocm,ready,ci/build — by khluu (合并于: 2026-02-12 08:54 (UTC+8)) [💬3 +25/-10, 6 files commented:2 approved:2] - Enable 6 AMD test jobs into CI pipeline as the starting point.
- Define the config interface for AMD mirror in test definition.
- #34384 [ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI — rocm,ready,ci/build — by micah-wil (合并于: 2026-02-12 07:52 (UTC+8)) [💬3 | +7/-7, 1 files | commented:1 approved:1] https://github.com/vllm-project/vllm/pull/33731 mistakenly added some test groups to the 8 GPU agent pool that only require 1 GPU in AMD CI. This PR reverts those changes.
-
#34371 [Bugfix] Fix some issues with MoERunner PR #32344 — bug,ready — by bnellnm (合并于: 2026-02-12 06:33 (UTC+8)) [💬1 | +6/-3, 2 files | commented:2 approved:2] ## Purpose
Move
ensure_moe_quant_configcall fromFusedMoE.forward_nativeinto_moe_forwardand_moe_forward_shared. This is closer to how it was before when it was hidden inside the custom op. It should avoid torch.compile issues.Fix handling of
gate. Theuse_overlappedflag should have been checked before returning_gate.Possible fix for https://github.com/vllm-project/vllm/issues/34357
## Test Plan
…
-
#34316 Fix CI failure - Flashinfer Kernel tests — ready,nvidia — by wzhao18 (合并于: 2026-02-12 06:17 (UTC+8)) [💬4 | +3/-0, 3 files | commented:1 approved:4]
## Purpose Fix #34315
## Test Plan Tested locally with
pytest -v -s tests/kernels/moe/test_flashinfer.py…
-
#34299 [torch.compile] Enable AR+rms fusion by default available for
-O2— performance,ready,torch.compile — by ProExpertProg (合并于: 2026-02-11 16:30 (UTC+8)) [💬1 | +16/-8, 2 files | commented:4 approved:1] ## Purpose Turns out AR+RMS fusion and compiling for a second range only marginally increases warm/cold start compile time, and the benefits are 5-22% (see #24252). This PR enables this fusion by default.## Test Plan Startup, bench, eval, CI
## Test Result
### Startup ``` …
-
#34334 [Bugfix] Fix more multimodal tests for transformers V5 — bug,ready,multi-modality,qwen — by zucchini-nlp (合并于: 2026-02-12 05:02 (UTC+8)) [💬1 | +18/-11, 5 files | commented:2 approved:2] There’s also idefics test failing
tests/models/multimodal/processing/test_idefics3.py::test_processor_override[False-1-mm_processor_kwargs1-845-HuggingFaceM4/Idefics3-8B-Llama3]I see the root issue in
Idefics3Processor._get_prompt_updateswhich creates a processor but doesn’t use providedkwargs. The kwargs get filtered out due tocached_get_processor_without_dynamic_kwargsand the number of placeholder is calculated incorrectly. It can be fixed by reverting https://github.com/vllm-pr… - #34349 [Benchmarks] Reduce ready checker log verbosity — performance,ready — by tomasruizt (合并于: 2026-02-12 04:57 (UTC+8)) [+2/-1, 1 files | commented:1 approved:1]
## Summary
- The
wait_for_endpoint()ready checker used to log the full traceback of the error on every retry (every 5 seconds), making the output very noisy and hard to read. - Now it logs only the last line of the error (the actual exception message), which is much more compact and informative.
Before (repeated every 5s): ``` WARNING [ready_checker.py:69] Endpoint is not ready. Error=’Traceback (most recent call last): File “…/aiohttp/connector.py”, line 1268, in _wrap_creat…
- The
-
#34279 [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides — bug,ready — by tlrmchlsmth (合并于: 2026-02-11 13:15 (UTC+8)) [💬1 | +27/-27, 1 files | approved:2 commented:1] ## Summary
- Annotate all stride parameters in
fused_moe_kernelandfused_moe_kernel_gptq_awqastl.int64to prevent int32 overflow in pointer arithmetic for large tensors - This follows the same pattern already used in
fused_batched_moe.py - Without this fix, large problem sizes cause
illegal memory accesscrashes when chunking is disabled, because the C tensor offset exceeds int32 max (~2.1 billion) - This is a prerequisite for removing the chunking workaround (`VLLM_FUSED_MOE_CHUNK…
- Annotate all stride parameters in
-
#34350 [ROCm] [CI] fix test_unrecognized_env — rocm,ready — by tjtanaa (合并于: 2026-02-12 02:50 (UTC+8)) [+10/-3, 1 files | commented:1 approved:1]
## Purpose
Fix https://buildkite.com/vllm/amd-ci/builds/4609/steps/canvas?sid=019c4d1a-4084-4c15-b40f-c0ef00b0ffd7&tab=output
```
... -
#34243 [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) — bug,ready,llama — by eldarkurtic (合并于: 2026-02-12 02:24 (UTC+8)) [💬1 | +29/-5, 1 files | commented:5 approved:1] Llama-4 weights of
q/k_projare permuted during model loading to prepare the model for interleaved/gpt-neox rope. The same permutation needs to be applied on quantization weight scales as well.## Purpose 1 So far, all of quantized Llama-4 models used to skip attention quantization as the model’s accuracy would collapse. This was mistakenly interpreted as high sensitivity of the attention layers under quantization. Luckily for quantization, this is not the case and the accuracy collapse was …
-
#34348 [Docs] Fix typo (“defult”) and double spacing — ready — by SorenDreano (合并于: 2026-02-12 01:02 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose Update vLLM documentation to fix two small typos (
defult->defaultand-02to-O2) and a double space. vLLM will not start with-02.## Test Plan Documentation-only change.
## Test Result N/A (documentation-only change; no runtime behavior affected). …
-
#34330 [Multimodal] Expose
mm_processor_kwargsforDummyInputsBuilder— ready,multi-modality,llama,qwen,deepseek — by Isotr0py (合并于: 2026-02-12 01:37 (UTC+8)) [💬3 | +131/-27, 72 files | commented:4 approved:1]## Purpose
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease
longest_edgefor hf_processor withmm_processor_kwargs. - However, these fields won’t reflect to dummy inputs during profiling, which can cause inaccurate profiling results.
- This PR exposes
mm_processor_kwargsforDummyInputsBuilderto reflect the config.
## Test Plan ``` vllm serve Qwen/Qwen3-VL-4B-Instruct –enforce-eager –mm-pro…
- Currently, to enable/disable long video comprehension in Qwen3-VL-style models, user have to increase/decrease
-
#33217 [Model Runner V2] Init cuda graph pool when necessary — v1,nvidia — by xinyu-intel (合并于: 2026-02-12 01:12 (UTC+8)) [+6/-2, 2 files | commented:3 approved:1]
## Purpose
Init cuda graph pool only when cudagraph_mode is not NONE. This will help if torch.cuda.graph_pool_handle() is not available in platforms.
## Test Plan
## Test Result
…
-
#32458 [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… — bug,ready — by junuxyz (合并于: 2026-02-12 01:03 (UTC+8)) [💬3 | +126/-2, 2 files | commented:4 approved:1] related issue: https://github.com/vllm-project/vllm/issues/32391
## Purpose CI was not reliably running shellcheck due to invalid find invocation, so issues could slip through unnoticed. Initially started as a small one-line fix but after correcting the find command, CI began failing due to pre-existing shellcheck errors in the affected scripts. This PR restores a working signal by fixing the CI invocation and introducing a baseline, so CI only fails on newly introduced warnings.
## Summary
- …
-
#34273 [Misc] Bump
fastsafetensorsversion for latest fixes — ready,ci/build — by njhill (合并于: 2026-02-11 16:30 (UTC+8)) [+4/-5, 4 files | commented:1 approved:1] This includes in particular https://github.com/foundation-model-stack/fastsafetensors/pull/46 which is needed to avoid unnecessary memory overhead on rank 0 in multi-GPU deployments.See https://github.com/vllm-project/vllm/pull/34070.
cc @takeshi-yoshimura
-
#34217 [Frontend] Exploit tokenizers “new stream” in FastIncrementalDetokenizer — ready,v1 — by njhill (合并于: 2026-02-11 18:03 (UTC+8)) [💬1 | +17/-31, 1 files | commented:2 approved:1] Finally remembered to revisit https://github.com/vllm-project/vllm/pull/18840.
Faster incremental detokenizer initialization.
Includes some unrelated minor code simplifications in
detokenizer.py. -
#33681 [ROCm] [aiter] Split KV cache update for AiterFlashAttention — rocm,ready,v1 — by kliuae (合并于: 2026-02-12 00:26 (UTC+8)) [💬1 | +68/-40, 1 files | commented:3 approved:1]
## Purpose Supporting #32335, this PR extracts KV cache update from the attention forward pass in AiterFlashAttention.
ROCM_AITER_FA supports both flash and shuffled KV cache layouts. This PR covers both of them and uses the same flag to control the respective KV cache layouts.
## Test Plan Accuracy test with lm_eval
…
-
#33948 [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache — bug,rocm,ready — by Rohan138 (合并于: 2026-02-12 00:12 (UTC+8)) [💬7 | +27/-52, 1 files | commented:2 approved:3]
Reuse
AttentionBackendutils to initialize the dummy KV cache with the required shape and stride. Also, turns out this UT is being skipped on ROCm because of a typo/missing rename, soBACKENDS->BACKENDS_FP8.## Purpose
## Test Plan
## Test Result
…
-
#34352 Don’t try and run GLM-ASR with remote code — 无标签 — by hmellor (合并于: 2026-02-12 00:09 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] GLM-ASR has been upstreamed to Transformers in https://github.com/huggingface/transformers/commit/a7f29523361b2cc12e51c1f5133d95f122f6f45c which was first released in
v5.0.0rc2.Therefore we should remove this kwarg from our test. (The test will not actually run in this PR because it’s still gated on Transformesr v5)
- #34043 Reapply [Attention][FA3] Update FA3 to include new swizzle optimization — performance,ready,ci/build,v1,nvidia — by LucasWilkinson (合并于: 2026-02-11 23:07 (UTC+8)) [💬1 | +60/-44, 6 files | commented:1 approved:1] Reapply https://github.com/vllm-project/vllm/pull/23465 after revert in https://github.com/vllm-project/vllm/pull/33841 but with correct metadata sizes
-
#34264 Make JAIS compatible with Transformers v5 — ready — by hmellor (合并于: 2026-02-11 20:30 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1]
add_cross_attentionwas an attribute ofPreTrainedConfigthat got removed in https://github.com/huggingface/transformers/pull/41541This attribute was never explicitly in the JAIS config and the checkpoints we use for testing in CI do not set it.
-
#34268 Responses harmony system message structured — frontend,ready,gpt-oss — by Kimahriman (合并于: 2026-02-11 21:14 (UTC+8)) [💬1 | +43/-6, 2 files | commented:5 approved:2] ## Purpose Resolves https://github.com/vllm-project/vllm/issues/34237
Fix an issue where the responses API will fail if a structured system message content is passed.
## Test Plan New test verifying the behavior.
## Test Result Not able to get tests to run locally. …
-
#33715 [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE — ready,nvidia — by Linda-Stadter (合并于: 2026-02-11 20:38 (UTC+8)) [+296/-1, 7 files | commented:5 approved:2 changes:1] ## Purpose Adding tests for trtllm bf16 moe backend added in PR [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe #32954
## Test Plan
- unit and integration test for the new moe backend
- unit tests for utility functions
- Changing E2E tests to use fi cutlass because E2E tests triggered an intermittent issue with flashinfer ## Test Result
…
-
#34262 Make Qwen3VL compatible with Transformers v5 — ready,qwen — by hmellor (合并于: 2026-02-11 20:13 (UTC+8)) [💬4 | +25/-37, 2 files | commented:2 approved:1] The location of
tie_word_embeddingshas moved for Transformers v5 for multimodal models to a more sensible location (please see https://github.com/vllm-project/vllm/pull/33359 for the full explanation).This PR updates the Qwen3VL implementation to leverage the fix made in https://github.com/vllm-project/vllm/pull/33359 and moves the
vision_configspecific check that was taking place inQwen3LLMModeltoQwen3VLForConditionalGeneration, where the vision config is available. -
#34321 [Bugfix][CPU] Fix llama4 inference on CPU — bug,ready,v1,llama,cpu — by bigPYJ1151 (合并于: 2026-02-11 19:07 (UTC+8)) [💬3 | +60/-18, 6 files | commented:1 approved:1]
## Purpose
- Fix llama4 inference on CPU
## Test Plan
## Test Result
…
-
#34255 [Docs] Reduce time spent generating API docs — rocm,ready,v1,multi-modality,cpu,gpt-oss,nvidia — by hmellor (合并于: 2026-02-11 18:56 (UTC+8)) [💬3 | +50/-2, 25 files | commented:1 approved:1] All time reductions are measured in my local environment which takes 520s to build the docs normally.
Changes:
- Update
git-revision-date-localized.exclude:argparse/*->generated/*- these were moved in the past- Add
api/*- these are always generated at build time so the revision dates always fall back to the build date, which is meaningless - This saves ~20s
- Remove
show_if_no_docstring: true:- “if at least one of its direct or indirect members (lower in the tree) ha…
- Update
-
#34253 Patch protobuf for CVE-2026-0994 — ready,ci/build — by eicherseiji (合并于: 2026-02-11 18:25 (UTC+8)) [+2/-2, 2 files | commented:5 approved:2] ## Purpose
Pin protobuf to exclude versions vulnerable to CVE-2026-0994.
#33619 patched this for the v0.15.1 release branch but main is still unpinned.
## Changes
requirements/build.txt: pinprotobuf >= 5.29.6and exclude vulnerable 6.30.0–6.33.4requirements/common.txt: same …
- #34319 [Doc] Update Marlin support matrix for Turing — documentation,ready — by iori2333 (合并于: 2026-02-11 17:03 (UTC+8)) [💬1 | +5/-4, 2 files | commented:1 approved:2] #29901 adds Marlin support for SM75 devices. This PR updates the support matrix of Marlin documentation.
-
#34308 [Chore] Move
BaseRenderertobase.py— ready — by DarkLight1337 (合并于: 2026-02-11 16:29 (UTC+8)) [+7/-7, 8 files | commented:1 approved:1]## Purpose
It’s not a protocol anymore.
## Test Plan
## Test Result
…
-
#34111 [XPU][9/N] clean up existing ipex code/doc — documentation,ready,ci/build,v1,cpu — by jikunshang (合并于: 2026-02-11 16:27 (UTC+8)) [💬2 | +16/-50, 10 files | commented:1 approved:1] ## Purpose part of https://github.com/vllm-project/vllm/issues/33214 clean up ipex, use xpu_ops instead. also update xpu documents.
## Test Plan CI
## Test Result
…
-
#33247 [model] support FunASR model — documentation,new-model,ready,qwen — by AllenDou (合并于: 2026-02-11 15:37 (UTC+8)) [💬9 | +1585/-3, 7 files | commented:10] Hi @WoosukKwon @DarkLight1337 , this PR adds support for the FunASR model. Could you please take a look?
server,
vllm serve allendou/Fun-ASR-Nano-2512-vllm -tp=2 --dtype=float32, Use –dtype=float32 to achieve the highest accuracyclient
python3 openai_transcription_client.py --repetition_penalty=1.0…
-
#34116 [CPU] Enable FP16 (Half dtype) support for s390x — ready,cpu — by R3hankhan123 (合并于: 2026-02-11 14:41 (UTC+8)) [💬1 | +244/-9, 3 files | commented:1 approved:1] ## Purpose Adds FP16 model inference support for s390x (IBM Z) architecture using vectorized Bit-manipulation. FP16 was previously disabled on s390x, limiting users to BF16 or FP32.
## Test Plan
- Build docker image and run the server with
dtype=half - Send an inference request to the server ## Test Result
``` [root@b314lp81 vllm]# docker run –rm -p 8000:8000 local:test ibm-granite/granite-4.0-micro –port=8000 –dtype=half …
- Build docker image and run the server with
-
#34313 [Bugfix] Fix weight naming in Qwen3.5 — bug,qwen — by ywang96 (合并于: 2026-02-11 13:37 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
## Purpose
## Test Plan
## Test Result
... - #34298 [ModelBash][DSR1 NVFp4] Avoid Bf16 Bias Cast — performance,ready,deepseek,nvidia — by robertgshaw2-redhat (合并于: 2026-02-11 13:00 (UTC+8)) [+4/-7, 1 files | commented:4 approved:2]
## Purpose
- avoid bf16 bias conversion on the hotpath
## Test Plan
- lm eval
## Test Result ``` |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |—–|——:|—————-|—–:|———–|—|—-:|—|—–:| …
-
#34013 Threshold fix wvSplitk for occasional CI fails — rocm,ready,ci/build — by amd-hhashemi (合并于: 2026-02-11 11:59 (UTC+8)) [💬2 | +5/-2, 1 files | commented:2 approved:2]
## Purpose
## Test Plan
## Test Result
... -
#34149 [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 — bug,performance,ready — by mgehre-amd (合并于: 2026-02-11 11:58 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1 approved:1] ## Purpose
Fix
benchmarks/kernels/benchmark_moe.pycrashing with anAssertionErroron torch >= 2.9.After #33375 moved the
disable_inplace()guard from insidefused_experts_impl/dispatch_fused_experts_functo an assertion at thefused_expertsentry point, the benchmark was not updated and still unconditionally passesinplace=True. This triggers:assert not inplace or not disable_inplace() AssertionError…
-
#34236 [Plugin] Simplify IO Processor Plugin interface — documentation,frontend,ready — by DarkLight1337 (合并于: 2026-02-11 11:47 (UTC+8)) [💬8 | +167/-151, 9 files | commented:6 approved:1]
## Purpose
- Rename
parse_request -> parse_dataand passrequest.datato it so there is no need to distinguish between offline and online APIs. - For the same reason, deprecate
output_to_response, since we can construct the response automatically from the output data and provided request ID. - Split up
validate_or_generate_paramsintomerge_sampling_paramsandmerge_pooling_paramsto make type annotation easier.
All changes are back-compatible until v…
- Rename
[关闭未合并 PR]
-
#26805 Fix seed reproducibility issue by adding output.copy_(out) — stale — by XuanofXXX (关闭于: 2026-02-12 10:17 (UTC+8)) [💬4 | +1/-0, 1 files | commented:1]
## Purpose This PR fixes the seed reproducibility issue in vLLM MambaMixer2. Despite setting the seed in the script, the results were not consistent across multiple runs. The issue was found to be in the mamba_mixer2.py file, where the missing output.copy_(out) line prevented correct output from being written to the buffer, causing unstable results.
## Test Plan Added the line output.copy_(out) to the V1 configuration section in mamba_mixer2.py to ensure stable se…
-
#33518 [Bugfix] Fix NVFP4 MoE weight shapes for non-gated MLPs (Nemotron-Nano) — bug,ready,nvidia — by Code4me2 (关闭于: 2026-02-12 03:01 (UTC+8)) [💬10 | +249/-18, 5 files | commented:5 approved:1] ## Summary
Fix weight shape calculation in
prepare_static_weights_for_trtllm_fp4_moe()for models using non-gated MLPs like Nemotron-Nano.## Problem
Loading
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4fails with:RuntimeError: shape '[64, 8192, 2048]' is invalid for input of size 536870912…
-
#34368 Add initial devcontainer configuration — 无标签 — by vagabond2522 (关闭于: 2026-02-12 02:57 (UTC+8)) [+4/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... - #34360 Revert “[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner” — documentation,v1 — by robertgshaw2-redhat (关闭于: 2026-02-12 01:51 (UTC+8)) [💬1 | +753/-913, 25 files | commented:1] Reverts vllm-project/vllm#32344
- #34184 [Online Quantization] Support memory-efficient online quantization via layerwise loading — 无标签 — by kylesayrs (关闭于: 2026-02-12 00:40 (UTC+8)) [+77/-185, 5 files | commented:1 | 📝草稿]
## Purpose ##
- Support online quantization in a more maintainable way by integrating with existing layerwise processing functionality
## Changes ##
- Change layerwise logic to only copy and re-place into kernel tensors if reloading
## Weights with initialized values ## Handling weights which require values placed at init time is a little tricky. One example are rotary embeddings, whose values are created at init time, and are not loaded from disk. In order to avoid overwriting these values wi…
-
#33061 [KV MultiConnector]: Add out-of-band handshake metadata get/set functions — kv-connector — by snadampal (关闭于: 2026-02-12 00:13 (UTC+8)) [💬5 | +20/-0, 1 files | commented:1] ## Purpose A MultiConnector with NixlConnector as one of the sub-connectors was failing to complete handshake between two nodes due to (1) missing xfer metadata for the NixlConnector and (2) Nixl handshake listener thread not started. This commit adds those missing get/set xfer metadata functions for kv multiconnector to propagate the calls to its sub-connectors.
## Test Plan Tested vllm P-D disaggregated inference with KV multiconnector with Nixl and LMCache sub-connectors
## Test Result P-D …
-
#34325 [Bugfix] fix device_name for routing replay — bug — by Li-Yongwen (关闭于: 2026-02-11 20:38 (UTC+8)) [💬1 | +2/-1, 1 files | commented:1]
## Purpose
[Bugfix] fix device_name for routing replay
## Test Plan
## Test Result
…
-
#30917 [Feature] Support using FlexKV as anothor KV Cache Offloading option. — documentation,v1,kv-connector — by axxx03 (关闭于: 2026-02-11 17:49 (UTC+8)) [💬8 | +383/-0, 3 files | commented:9 changes:1] # Description FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud’s TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.
In our case, when intergated with FlexKV, we can achieve the following improvement:
- ISL=21K, OSL=1K, batch_size=8: TTFT decreases by 60%, TPOT increases by 13%, and QPM i…
-
#34318 [bugfix] Fix Dynamo unexpected keyword argument — bug,documentation,performance,new-model,rocm,frontend,speculative-decoding,ci/build,v1,multi-modality — by samutamm (关闭于: 2026-02-11 16:15 (UTC+8)) [💬3 | +6279/-1686, 114 files | commented:1] ## Purpose
Fix QuantFP8 with torch.compile on ROCm when CustomOP
quant_fp8is disabled with--compilation-config '{"custom_ops": ["-quant_fp8"]}'.Current main branch raises error:
``` (EngineCore_DP0 pid=565) File “/app/vllm/vllm/v1/executor/multiproc_executor.py”, line 375, in collective_rpc (EngineCore_DP0 pid=565) return aggregate(get_response()) (EngineCore_DP0 pid=565) ^^^^^^^^^^^^^^ …
-
#22599 [Feature] Improve logging for error messages — documentation,needs-rebase,unstale,v1 — by elizabetht (关闭于: 2026-02-11 12:00 (UTC+8)) [💬5 | +169/-15, 1 files | commented:4] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
…