[vLLM GitHub 开发动态] 2026-01-08
[概览]
- 时间窗口: 2026-01-08 10:53 (UTC+8) ~ 2026-01-09 10:53 (UTC+8)
- 新 issue: 13 (label 分布: bug:6, feature request:3, torch.compile:1, performance:1, usage:1)
- 关闭 issue: 37
- 新 PR: 57 (label 分布: ready:23, v1:20, rocm:8, nvidia:6, ci/build:6)
- 合并 PR: 58
- 关闭未合并 PR: 24
[新 issue]
-
#32006 [Bug]: vLLM hangs on specific request each time (qwen-coder-480b-fp8) — bug — by Yanpas (创建于: 2026-01-09 09:52 (UTC+8)) ### Your current environment
Can't do it. It's RH Openshift, GPU H200, Cuda 12.9 Model: Qwen Coder 480b FP8 vllm params: --enable-expert-parallel --data-parallel-size=8 ... -
#32007 [Feature]: Speculating with a draft model — feature request — by liudl85 (创建于: 2026-01-09 09:55 (UTC+8)) ### 🚀 The feature, motivation and pitch
In docs, I found this Speculating with a draft model
While in codes, I also found this https://github.com/vllm-project/vllm/blob/main/vllm/config/speculative.py#L379-L384
I’m really confused about whether vLLM supports speculation with a d…
-
#32004 [Bug]: When running the model on an RTX 5090 GPU, the model starts successfully, but the CPU usage remains at 100%. — bug — by ljwps (创建于: 2026-01-09 09:21 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#31963 [Bug]: vllm部署的Qwen3 VL 32B模型推理结果和transformers推理得到的结果不一致 — bug — by yy19941128 (创建于: 2026-01-08 18:54 (UTC+8)) [💬4] ### Your current environment
vllm离线版本和线上版本均部署了Qwen3 vl 32B微调后的模型,但是推理结果和transformers得到的结果不一样,经过排查,经过对话模版对齐后的prompt是完全一样的,tokenizer后的token_id也是完全一致的。不知道是不是vision encoder会有所差异? vllm的版本是0.11.0,transformers的版本是 4.57.0。 vl任务是ocr,出现很多张图片的ocr结果中有一两个字符不一样,其中transformers基本都是对的,vllm离线和线上版本都是错的
### 🐛 Describe the bug
vllm离线版本和线上版本均部署了Qwen3 vl 32B微调后的模型,但是推理结果和transformers得到的结果不一样,经过排查,经过对话模版对齐后的prompt是完全一样的,tokenizer后的token_id也是完全一致的。不知道是不是vision encoder会有所差异? vllm的版本是0.11.0,transformers的版本是 4.57.0。 …
-
#31966 [Feature]: Proposal: Optional Admin Control Plane API + CLI for Safe Production Operations — feature request — by khajamoddin (创建于: 2026-01-08 20:33 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
Introduce a minimal, optional admin control plane for vLLM, consisting of:
A small HTTP Admin API for inspecting runtime state and performing safe lifecycle operations
A thin CLI reference client (vllm admin …) built on top of this API
This control plane is strictly for production operations, not inference, and is:
…
-
#31990 [Bug]: [H200] Qwen3-Next-80B-A3B-Instruct-FP8 TP1 DP4 EP4 CUDA illegal memory error — bug — by jhaotingc (创建于: 2026-01-09 03:34 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#31985 [Feature]: Unwrap FusedMoE custom op — feature request,torch.compile — by ProExpertProg (创建于: 2026-01-09 01:26 (UTC+8)) ### 🚀 The feature, motivation and pitch
Currently, the whole
FusedMoElayer is wrapped in thefused_moe/fused_moe_sharedcustom ops. This is mostly due to chunking but prevents torch.compile-based optimizations, increases CPU overhead, and makes it more likely for inefficiencies (copies, sequences of torch ops, etc.) to pop up unexpectedly. It also increases complexity in many cases as we have to write custom kernels for code that could otherwise be implemented in native torch and be effi… -
#31975 [New Model]: Qwen3Guard Stream — 无标签 — by Wildshire (创建于: 2026-01-08 22:33 (UTC+8)) ### The model to consider.
Hello
I was wondering if the Qwen3Guard stream family is going to be supported by vLLM. Closest issue / PR related to this topic I found in the repo is this PR #25463
HF models:
-
#31957 [Bug]: Nemotron Nano V3 FP8 - Expected ‘silu’ activation but got relu2_no_mul — bug — by danisereb (创建于: 2026-01-08 17:30 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#31961 [Performance]: EPD Disaggregation Performance Testing Scripts — performance — by Adenialzz (创建于: 2026-01-08 18:25 (UTC+8)) ### Proposal to improve performance
Hi, I noticed that epd disaggregation is now ready in vLLM. And the blog shows encouraging results. But we failed to get the expected performance improvement and want to troubleshoot the potential errors. Could you please provide complete testing scripts for reproduction? Thanks in advance.
### Report of performance regression
No response
### Misc discussion on performance
…
- #31959 [Usage]: Cannor Run vLLM Docker Container as Non-Root User — usage — by tahmid-therap (创建于: 2026-01-08 17:54 (UTC+8)) I am trying to build a docker image from the vllm/vllm-openai:v0.12.0 image which will allow me to run the vllm server as a defined non-root user and dump the log file under the ownership of the same. However, I am getting constant error during startup regarding library file access inside the container. I have attempted a brute force approach by changing ownership and permission to directories like /usr or /root. However, I am still getting similar errors. The following is my raw dockerfile, apo…
-
#31954 [Bug] OLMo3 reasoning parser fails to detect </think> end tag, preventing GCD activation — 无标签 — by ivnle (创建于: 2026-01-08 16:11 (UTC+8)) [💬1] ### Description
The OLMo3 reasoning parser (
olmo3_reasoning_parser.py) fails to correctly detect when thinking ends, causing all generated tokens to be classified as thinking tokens. This prevents grammar-constrained decoding (GCD) from activating for the post-thinking structured output.### Evidence
From experiments with OLMo3-7B-Think + GCD + thinking=high on the math500 task:
Question 0 (Polar coordinates conversion): ```json …
-
#31940 [Bug]: — bug — by EgoistaCercis (创建于: 2026-01-08 11:22 (UTC+8)) ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
[已关闭 issue]
-
#20170 [Bug]:RuntimeError: CUDA error: an illegal memory access was encountered — bug,stale — by xiaocode337317439 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬7] ### Your current environment
vllm/vllm-openai:v0.9.1
The output of
```text ...python collect_env.py -
#21047 [RFC]: Migrate vLLM Docs back to Sphinx for Localization — RFC,stale — by hwhsu1231 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬9] ### Motivation.
To localize the vLLM Documentation.
Recently, I officially published my 1st localization project,
cmake-docs-l10n, on GitHub:- Preview: https://localizethedocs.github.io/cmake-docs-l10n
- Crowdin: https://localizethedocs.crowdin.com/cmake-docs-l10n
- GitHub: https://github.com/localizethedocs/cmake-docs-l10n
…
-
#21466 [Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8) — bug,stale — by marcin-brzezanski (关闭于: 2026-01-09 10:27 (UTC+8)) [💬12] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#21481 [Feature]: Multiple models one server — feature request,stale — by botirk38 (关闭于: 2026-01-09 10:27 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Hey Im working on a project and I want to do my on infra inference for small models, I just want one inference server running multiple models to do this though I had to setup my own model management and unloading loading models when appropriate into memory. It would be cool if VLLM had this I would be happy myself to make a PR, if you guys are ok with this.
### Alternatives
No response
### Additional context
…
-
#22093 [Bug]: update the kv_connector from v0 to v1 in example — bug,stale — by 1mujue (关闭于: 2026-01-09 10:27 (UTC+8)) [💬2] ### Your current environment
============================== System Info ============================== OS : Ubuntu 24.04.1 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : 18.1.3 (1ubuntu1) CMake version : version 4.0.3 Libc version : glibc-2.39 …
-
#23009 [Feature]: Plugin framework for out-of-tree custom checkpoint loading — feature request,stale,rl — by 22quinn (关闭于: 2026-01-09 10:26 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
### Motivation Currently, vLLM primarily supports HuggingFace-format checkpoints along with a few built-in custom loaders such as mistral. However, some proprietary reinforcement learning (RL) systems use custom checkpoint and weight formats along with unique loading mechanisms, which creates challenges for adopting vLLM in these environments. Support is required in two key areas (they share some common path):
- Initial custom weight loading during rollo…
-
#23227 [Feature][Responses API] Support tool_choice other than “auto” — feature request,stale — by heheda12345 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬13] ### 🚀 The feature, motivation and pitch
Current response API only supports tool_choice=”auto”. For other tool_choice, e.g., tool_choice=”required”, we need to properly enable structure output. See the OpenAI API doc for tool_choice here https://platform.openai.com/docs/guides/function-calling#tool-choice
### Alternatives
No response
### Additional context
…
-
#23582 [Bug]: vLLM server timeout due to multiprocessing communication error — bug,stale — by shaamil101-etched (关闭于: 2026-01-09 10:26 (UTC+8)) [💬11] ### Your current environment
I ran
sudo docker pull vllm/vllm-openai:latest
``` sudo docker run -d
–gpus all
–name vllm-8b-bf16-b200
… -
#24142 [Feature]: support Hunyuan-MT-Chimera-7B and HunYuanDenseV1ForCausalLM — feature request,stale — by devops724 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬8] ### 🚀 The feature, motivation and pitch
Hi Tencent recently released https://github.com/Tencent-Hunyuan/Hunyuan-MT if we have support this model in vllm it would be great
### Alternatives
No response …
-
#24237 [Bug]: Xformers is not available, falling back, even though I have Xformers installed — bug,stale — by Magmanat (关闭于: 2026-01-09 10:26 (UTC+8)) [💬4] ### Your current environment
The output of
```text Using Python 3.10.12 environment at: vllm Package Version Editable project location --------------------------------- -------------------------------------- ----------------------------- accelerate 1.10.0 ...python collect_env.py -
#24272 [Bug]: Multi-node DeepSeek-V3-0324 errors out with CUDA Illegal Memory Access — bug,stale — by amanshanbhag (关闭于: 2026-01-09 10:26 (UTC+8)) [💬5] ### Your current environment
The output of
Since this is a Slurm environment, I ran: `srun -N 1 --container-image /fsx/ubuntu/vLLM-testing/vllm-ep.sqsh bash -c "wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py && python3 collect_env.py"` ```text Collecting environment information... ...python collect_env.py -
#24304 [Bug]: inference with Qwen 2.5 Omni error: vllm.worker.model_runner_base.InputProcessingError: Failed to prepare inputs for sequence group with request id: 0, Error: index 0 is out of bounds for dimension 0 with size 0 — bug,stale — by qq31415926 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 20.04.6 LTS (x86_64) ...python collect_env.py -
#24436 [Bug]: V1 backend crash when using Qwen3 series model — bug,stale — by nkh0472 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬10] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 20.04.6 LTS (x86_64) ...python collect_env.py -
#24439 [Feature]: Support using their own max_num_seqs in prefill and decode stages in pd hybrid scenario — feature request,stale — by Liccol (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
vLLM currently uses a single max_num_seqs parameter to control the batch size for both prefill and decode stages. Maybe support for PD hybrid scenarios because:
- Prefill stage: Requires smaller batches due to higher computational intensity and memory usage per request
- Decode stage: Can handle larger batches because of lower computational requirements per request
- Resource utilization: Fixed batch size leads to either underutilization (decode) or overl…
-
#24449 [Usage]: How to quantize huggingface model to mxfp4 format and infer via vllm — usage,stale — by charyang-ai (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I want to run inference of a specific model. I don’t know how to integrate it with vllm. …
-
#24459 [Bug]: Vllm Deepseek-R1 Reasoning Parser error — bug,stale — by access2rohit (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#24475 [Bug]: Some benchmark results for model whisper-large-v3-turbo are zero — bug,stale — by hnt2601 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#24479 [Feature]: responses API conversation feature — feature request,stale — by questcollector (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
When you call the responses API with the store feature enabled, the msg content is stored in msg_store. Currently msg_store stores content items based on response id key, but it seems that if we manage items by generating conversation id, we can maintain the same context for multiple responses API requests.
- Create a conversation ID and add it to the response body
- Change the key of msg_store to the conversation ID from response_id.
### Alternatives
…
-
#24494 [Feature]: Integration of enable_thinking parameter for Async LLM Generation for dynamic generaiton — feature request,stale — by Dammerzone (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Hi guys,
I found the possible use of the extra_body arg “enable_thinking” into V1 with the llm.chat() method.
However this feature doesn’t seems do be supported for the generate() method of the async generation API (according to the bot). For now it needs to be passed through tokenizer parameter at engine initialization.
It could be a great improvement to support dynamic thinking based on request parameter instead of initial initialization.
…
-
#24507 [Bug]: vllm/vllm-openai image has an stdlib with many CVEs, update stdlib — bug,stale — by rubenporras (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
vllm/vllm-openai image has an stdlib with many CVEs
Please upgrade stdlib.
### 🐛 Describe the bug …
-
#24508 [Bug]: vllm/vllm-openai image has an linux-libc-dev with many CVEs, update linux-libc-dev — bug,stale — by rubenporras (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
vllm/vllm-openai image has an linux-libc-dev with many CVEs,
please, update linux-libc-dev…
-
#24550 [Bug]: the LongCat-Flash model still not offer support — bug,stale — by QiyaoHuang (关闭于: 2026-01-09 10:25 (UTC+8)) [💬4] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#24555 [Feature]: Add New Goodput Metrics in benchmark_serving.py — feature request,stale — by xiaoqun2011 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Suggested Content:
Currently, the output only includes Request Goodput (requests per second), which limits the ability to evaluate Server-Level Objective (SLO) performance on specific GPU systems. To enhance the granularity of performance analysis, we propose adding the following metrics:
• Output Good Throughput – Measures the number of good tokens generated per second. • Total Good Throughput – Measures the total number of good tok…
-
#24557 [Bug]: Intermittent Service Downtime Issue with Magistral-Small-2506 Model on GPU VM — bug,stale — by Jimmy888-333 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#24558 [Bug]: Unexpected service down of Llama-4-Scout-17B-16E-Instruct model on GPU-enabled VM — bug,stale — by Jimmy888-333 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#24587 [Bug]: use disagg_example_p2p_nccl_xpyd.sh, 每次pd 分离的实例,p和d 只能启动成功一个,另一个就报下面的错,大佬帮忙看看 — bug,stale — by ldh127 (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Your current environment
@KuntaiDu 大佬瞅一眼, [Bug]: use disagg_example_p2p_nccl_xpyd.sh, 每次pd 分离的实例,p和d 只能启动成功一个,另一个就报下面的错,大佬帮忙看看
INFO 09-10 21:20:09 [init.py:1152] Found nccl from library libnccl.so.2 Traceback (most recent call last): File “/opt/ac2/bin/vllm”, line 8, in
sys.exit(main()) ^^^^^^ ... -
#24590 [CI Failure]: CUTLASS MLA decode is flaky — stale,ci-failure — by MatthewBonanni (关闭于: 2026-01-09 10:25 (UTC+8)) [💬2] ### Name of failing test
tests/kernels/test_cutlass_mla_decode.py::test_cutlass_mla_decode[torch_dtype1-False-True-64-512-576-1-16-4096-1-128]### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#24591 [Bug]:
vllm bench serve --random-input-length=1generates an empty prompt — bug,stale — by smarterclayton (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3] ### Your current environmentSince at least the last few weeks (mid August), but probably much further back.
### 🐛 Describe the bug
While benchmarking P/D I realized that
--random-input-len=1results in an empty prompt for some models, specifically DeepSeek-R1.This is due to this chunk in the random dataset: …
-
#30859 [Bug]: set_current_vllm_config() is only done during the initialization stage but not the runtime stage — bug — by nvpohanh (关闭于: 2026-01-09 07:20 (UTC+8)) [💬7] ### Your current environment
Any env
### 🐛 Describe the bug
# Issue Statement
Currently,
set_current_vllm_config()is only done during the initialization stage but not the runtime stage. If the code tries to callget_current_vllm_config(), vLLM prints a warning “Current vLLM config is not set.” and returns a default config.…
-
#26110 [RFC]: CI metrics dashboard improvements — RFC,stale — by rzabarazesh (关闭于: 2026-01-09 04:29 (UTC+8)) [💬2] ### Motivation.
Our CI metrics are currently fragmented and we utilize multiple dashboards. We also lack several key metrics such as % of force merges or PR cycle times. We currently have a dashboard tracking some important CI metrics as well as buildkite UI. We seek to improve upon that by integrating vLLM with the comprehensive set of features offered by [pytorch HUD](https://hud.pytorch.org…
-
#31912 MoE LoRA Kernel Correctness Verification Report — 无标签 — by qywu (关闭于: 2026-01-09 02:52 (UTC+8)) ## Summary
This issue documents the comprehensive correctness verification of the fused MoE LoRA Triton kernel (
vllm/lora/ops/triton_ops/fused_moe_lora_op.py).## Test Methodology
### 1. Kernel-Level Testing Compared the fused MoE LoRA kernel output against a reference implementation that computes: ```python # For each token with lora_idx and expert_id: …
-
#31876 [CI Failure]: backend_xgrammar.py: Failed to advance FSM for request — ci-failure — by BlankRH (关闭于: 2026-01-09 00:40 (UTC+8)) [💬4] ### Name of failing test
v1/e2e/test_async_scheduling.py::test_with_spec_decoding
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#23557 [New Model]: Grok 2 — good first issue,new-model — by mgoin (关闭于: 2026-01-08 23:57 (UTC+8)) [💬18] ### The model to consider.
https://huggingface.co/xai-org/grok-2
### The closest model vllm already supports.
Grok 1 has already been supported (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/grok1.py) and it seems Grok 2 is a few modifications ontop of it - still listing its architecture as
Grok1ForCausalLMin the config.jsonYou can reference this sglang PR that implements the changes to the Grok1 d…
-
#31859 [Bug]: Intermittent 500 Internal Server Error When Stress-Testing Qwen3-VL-2B-Instruct with vLLM on H20 — bug — by AlpacaKnight (关闭于: 2026-01-08 21:28 (UTC+8)) [💬7] ### Your current environment
The service is started with the following command:
``` vllm serve--host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --enable-log-requests --enable-log-outputs --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --served-model-name model ``` </details> ... -
#31439 [Bug]: aot_compile disables inductor graph partition — bug — by BoyuanFeng (关闭于: 2026-01-08 17:27 (UTC+8)) ### Your current environment
Is CUDA available : True CUDA runtime version : 12.9.86 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA B200 GPU 1: NVIDIA B200 GPU 2: NVIDIA B200 GPU 3: NVIDIA B200 …
-
#31954 [Bug] OLMo3 reasoning parser fails to detect </think> end tag, preventing GCD activation — 无标签 — by ivnle (关闭于: 2026-01-08 17:18 (UTC+8)) [💬1] ### Description
The OLMo3 reasoning parser (
olmo3_reasoning_parser.py) fails to correctly detect when thinking ends, causing all generated tokens to be classified as thinking tokens. This prevents grammar-constrained decoding (GCD) from activating for the post-thinking structured output.### Evidence
From experiments with OLMo3-7B-Think + GCD + thinking=high on the math500 task:
Question 0 (Polar coordinates conversion): ```json …
-
#29574 [Performance]: Using vLLM to accelerate VLM models, does the vision encoding part currently support parallel processing, or is it still being processed serially? — performance — by NewZxy (关闭于: 2026-01-08 14:21 (UTC+8)) [💬2] ### Proposal to improve performance
I found that currently, images of different sizes are processed sequentially, which significantly slows down the processing speed. How can we adapt to parallel processing? Should we resize or pad all images to the same size for batch processing, or can we run multiple encoder models in parallel? Thank you.
### Report of performance regression
No response
### Misc discussion on performance
…
[新 PR]
-
#31968 [CPU] Add head sizes 80 and 112 with vec16 fallback — v1,cpu — by R3hankhan123 (创建于: 2026-01-08 20:50 (UTC+8)) [+33/-4, 2 files | commented:2]
## Purpose Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16. ## Test Plan Build Docker image and test using
ibm-granite/granite-3b-code-base-2kmodel which has head size of 80. ## Test Result Server Logs ```… -
#31998 [Misc] Enable async scheduling by default with spec decoding — ready — by njhill (创建于: 2026-01-09 07:31 (UTC+8)) [+20/-19, 1 files | commented:1] Now that all of the gaps have been addressed in async scheduling + spec decoding support, we can enable it by default in this case too.
It will still be disabled implicitly for non-EAGLE/MTP types or when padded drafter batch is disabled.
This should only be merged once https://github.com/vllm-project/vllm/pull/30495 is merged.
[!NOTE] Enables async scheduling by default when using compatible speculative decoding, with clearer gating and messaging. …
-
#31993 [ROCm][CI] Fix test_token_classification.py::test_bert_models — rocm — by divakar-amd (创建于: 2026-01-09 04:53 (UTC+8)) [💬1 | +12/-4, 1 files | commented:1 approved:2] This PR fixes mi325_1: Language Models Test (Extended Pooling):
models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]The issues were caused due to similar reasons mentioned in https://github.com/vllm-project/vllm/pull/31612
[!NOTE] Addresses ROCm-specific correctness in token classification tests.
- For
test_bert_modelsandtest_modernbert_models, passmodel_kwargs={'attn_implementation': 'eager'}to HF `AutoModelForToken…
- For
-
#31992 [Bugfix] Fix typo in FusedMoE LoRA reshape comment — 无标签 — by xyang16 (创建于: 2026-01-09 04:36 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose
This PR is a minor fix of typo in lora_b reshape comment. Need to switch rank and num_experts position.
## Test Plan
## Test Result
... -
#31987 [Frontend] Improve error message — frontend,ready — by DarkLight1337 (创建于: 2026-01-09 01:50 (UTC+8)) [+20/-4, 3 files | commented:1 approved:1]
## Purpose
Improve the error message that is returned to the client
## Test Plan
## Test Result
…
-
#32008 [MISC] Add strict contiguity check for FlashInfer attention tensors — v1,nvidia — by vadiklyutiy (创建于: 2026-01-09 10:21 (UTC+8)) [+41/-10, 2 files | commented:2] Early check of potential error as in #30842. See also #31617, https://github.com/flashinfer-ai/flashinfer/issues/2232
Updates FlashInfer attention path to use stricter contiguous check, preventing potential CUDA kernel memory access issues. Introduces
is_strictly_contiguous()utility to detect tensors with degenerate strides that PyTorch’sis_contiguous()reports as contiguous.
[!NOTE] Strengthens memory layout validation to prevent degenerate-stride tensors from reaching CUDA k…
-
#31994 fix lora moe sharding when rank < max_lora_rank — ready,gpt-oss — by gnovack (创建于: 2026-01-09 05:41 (UTC+8)) [💬1 | +6/-8, 2 files | commented:1 approved:1] ## Purpose This PR fixes a bug in the implementation of
fully_shardedfor moe lora adapters. Currently, when a LoRA adapter is loaded whose rank is less than themax_lora_rankthe LoRA A W13 weights are split into shards of sizecurrent_lora_rank // tp_degree.This causes problems when we allgather the LoRA A output along the rank dim, since the resulting tensor will be interspersed with 0s, as opposed to right-hand-padded with 0s (like the LoRA B W13 weights).
e.g. ``` LoRA A output (w…
-
#31970 [CI] [ROCm] Fix
tests/entrypoints/test_grpc_server.pyon ROCm — rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-08 21:08 (UTC+8)) [+18/-4, 5 files | commented:3 approved:1] ## PurposeFix
tests/entrypoints/test_grpc_server.pyCI issue https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?sid=019b9d04-9972-4807-9712-b517956d17b3``` ==================================== ERRORS ==================================== – ____ ERROR collecting tests/entrypoints/test_grpc_server.py ____ ImportError while importing test module ‘/vllm-workspace/tests/entrypoints/test_grpc_server.py’. …
-
#32000 [ROCm][CI][V1] Fix
nixl_connectortest failure and achieve CUDA parity intest_async_scheduling— rocm,speculative-decoding,v1,kv-connector,nvidia — by AndreasKaratzas (创建于: 2026-01-09 08:03 (UTC+8)) [💬3 | +28/-39, 3 files | commented:1 approved:1] This PR adds FlexAttention backend support for ROCm in the EAGLE speculative decoding proposer, removing platform-specific attention backend restrictions and merging with the CUDA data flow of this test. It also fixestest_abort_timeout_on_prefiller[ray]failing on ROCm platforms.## Changes
- Added
FlexAttentionMetadatato the allowed attention types for ROCm ineagle.py - Removed ROCm-specific backend overrides that were workarounds for missing FlexAttention support
- Modify `_make_fak…
- Added
-
#31949 [Bugfix] Fix FusedMoE LoRA w2_output_size — ready — by xyang16 (创建于: 2026-01-08 14:35 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 approved:1] ## Purpose
This PR fixes FusedMoE LoRA
w2_output_size.w2_output_sizecurrently is set tow2_lora_a_stacked[0].shape[-2]incorrectly, making it become rank.## Test Plan
pytest -s -v tests/lora/test_gptoss_tp.py## Test Result
…
- #31948 fix: remove duplicate engine_id check in nixl_connector — ready,kv-connector — by xbfs (创建于: 2026-01-08 14:28 (UTC+8)) [+0/-7, 1 files | commented:1 approved:1]
[!NOTE] Removes duplicate engine ID validation in
nixl_connector.pyduring_nixl_handshake, keeping a single check after decodingNixlAgentMetadata.- Simplifies control flow; no functional change expected aside from eliminating redundant error raising
Written by Cursor Bugbot for commit cb5e7816bed2c28eb9f8e218d4ddd1aa0e5b2c28. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugb…
-
#31943 Weight transfer — documentation,frontend,v1 — by ahao-anyscale (创建于: 2026-01-08 12:37 (UTC+8)) [💬2 | +2483/-10, 26 files | commented:2 | 📝草稿] ## Purpose
This PR introduces native weight syncing APIs for vLLM to support reinforcement learning post-training workflows (RLHF, PPO, etc.).
Currently, open-source projects like SkyRL, VeRL, and TRL must implement their own weight syncing infrastructure to use vLLM as an inference server during training. This leads to duplicated effort and requires users to version-lock to specific implementations. See RFC #31848 for full motivation. …
-
#32005 Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. — v1,nvidia — by yugong333 (创建于: 2026-01-09 09:41 (UTC+8)) [+212/-42, 12 files | commented:2] ## Purpose
## Test Plan Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4 ## Test Result
... - #32003 [Cleanup] Remove obsolete spec decoding compatibility logic — performance,speculative-decoding,ready,v1 — by njhill (创建于: 2026-01-09 09:19 (UTC+8)) [+45/-75, 8 files | commented:1]
This is logic which was included with the original V1 spec decoding impl (ngram only), prior to various other changes which have made it obsolete including:
- Adding other spec decoding methods
- Adding support for logprobs and penalty parameters with spec decoding
Note that there are still some parameters which don’t work with spec decoding (min_p, min_tokens, logit_bias). Requests with these params will now fail when spec decoding is enabled (see https://github.com/vllm-project/vllm/pull/3198…
-
#31956 [Frontend] Add
reasoning_efforttoOpenAIServing._preprocess_chat()— frontend — by sanghoon-yn (创建于: 2026-01-08 17:04 (UTC+8)) [+8/-1, 3 files | commented:4] ## Purpose I suggest addingreasoning_efforttoOpenAIServing._preprocess_chat()to make it used in chat template.## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31952 [Frontend] Add
reasoning_effortandparallel_tool_callstoextra_argsofSamplingParams. — frontend,gpt-oss — by sanghoon-yn (创建于: 2026-01-08 15:25 (UTC+8)) [💬5 | +14/-2, 2 files | commented:7] ## PurposeI suggest adding two fields,reasoning_effortandparallel_tool_calls, toSamplingParamsto enable logits processors to support the OpenAI-compatible parameters by controlling token generation.I suggest adding two fields,
reasoning_effortandparallel_tool_calls, toextra_argsofSamplingParamsto enable logits processors to support the OpenAI-compatible parameters by controlling token generation.## Test Plan
## Test Result
…
-
#31999 Fix type error — frontend,ready,meta-exported,fb-exported — by Adolfo-Karim (创建于: 2026-01-09 07:36 (UTC+8)) [💬1 | +3/-1, 1 files | commented:3 approved:2] Summary: id can be None, so we need to check it.
Test Plan: Updated local instance, the error is gone.
Differential Revision: D90353429
[!NOTE] …
-
#32002 fused_moe_kernel - cast accumulator after applying router weights — 无标签 — by gnovack (创建于: 2026-01-09 08:54 (UTC+8)) [+5/-8, 1 files | commented:1] ## Purpose The
tests/lora/test_olmoe_tp.py test_olmoe_lora_mixedtest case has been failing since https://github.com/vllm-project/vllm/pull/31676 was merged. Previously the application ofmoe_weight(router weights) to theaccumulatorwas performed infloat32, but after https://github.com/vllm-project/vllm/pull/31676, this computation is now done afteraccumulatorhas been cast tocompute_type.This change in behavior caused the above test case to begin failing, and introduced a sl…
- #32001 [fix] add cutedsl to global sf — nvidia — by jiahanc (创建于: 2026-01-09 08:39 (UTC+8)) [+1/-0, 1 files | commented:1 approved:1] ## Purpose Add flashinfer cutedsl to global sf list, fixes https://github.com/vllm-project/vllm/issues/31918 ## Test Plan ``` VLLM_DEEPEPLL_NVFP4_DISPATCH=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_STANDALONE_COMPILE=0 VLLM_FLASHINFER_MOE_BACKEND=”masked_gemm” VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ALL2ALL_BACKEND=”deepep_low_latency” lm_eval –model vllm –model_args pretrained=nvidia/DeepSeek-R1-0528-FP4-v2,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=Tr…
- #31977 [Bugfix] Fix Typo from NVFP4 Refactor — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-01-08 23:36 (UTC+8)) [+30/-3, 6 files | commented:2 approved:2]
## Purpose
- broke cutedsl with typo
- this fixes it and adds testing in the ci/cd
## Test Plan
- ci, running cutedsl
## Test Result
…
-
#31982 [BugFix] Add spec-decode-incompatible request param validation — ready,v1 — by njhill (创建于: 2026-01-09 01:08 (UTC+8)) [+12/-3, 2 files | commented:2 approved:1] Some sampling parameters are not yet supported with spec decoding. We log a warning during startup but then effectively silently ignore the parameters if included in requests at runtime.
This adds request-scope validation for this to fail rather than silently ignore.
-
#31996 [MoE Refactor] Move
select_expertsfromFusedMoEQuantMethod->FusedMoE— needs-rebase — by bnellnm (创建于: 2026-01-09 06:10 (UTC+8)) [💬2 | +1180/-574, 30 files | commented:1 | 📝草稿] ## Purpose## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31980 Add mergify label job for “bug” match — ci/build — by mgoin (创建于: 2026-01-09 00:59 (UTC+8)) [💬1 | +12/-0, 1 files | commented:1]
## Purpose
## Test Plan
## Test Result
... -
#31986 [Feature][#29390]: Add timeout support to MultiprocExecutor.collective_rpc and FutureWrapper — v1 — by SandishKumarHN (创建于: 2026-01-09 01:31 (UTC+8)) [💬3 | +33/-9, 1 files | commented:1] ## Purpose
- Add support for timeouts when calling vllm.v1.executor.multiproc_executor.MultiprocExecutor.collective_rpc and when awaiting futures returned by non-blocking RPCs.
- Treat a provided timeout as an overall deadline across response MessageQueues when collecting responses.
- Add FutureWrapper.result(timeout=…) and propagate the timeout into FutureWrapper.wait_for_response(get_response, timeout=…) so user code can wait on the Future with an overall timeout.
- This change makes debu…
-
#31997 [CI/Build][Hardware][AMD] Fix test_forward_error — rocm,v1 — by rjrock (创建于: 2026-01-09 07:03 (UTC+8)) [+15/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get
CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdownto pass on AMD CI.Since ROCm uses fork as the subprocess init method, we need to use sitecustomize.py to propagate the
evil_forwardfunction.## Test Plan
CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown.## Test Result Before: Failed to raise error …
-
#31989 Update VOLKANURALTR.md — documentation — by volkTRhacV (创建于: 2026-01-09 02:53 (UTC+8)) [💬4 | +4/-4, 1 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#31991 [wip] expose online quantization for linear for compressed-tensors path — 无标签 — by vkuzo (创建于: 2026-01-09 03:52 (UTC+8)) [💬1 | +126/-4, 2 files | commented:2] Summary:
WIP and not ready for review yet
High level, I’d like to work towards enabling online quantization for all the quantization recipes in compressed-tensors. Ideally we work through a POC for a single recipe and get alignment from vllm team, and then we can parallelize further work if needed.
For now, tested e2e on a tiny dense model (facebook/opt-125m) and float8 rowwise scaling.
tl;dr;
- enables
CompressedTensorsConfig.from_config_file, and provides an example config file for flo…
- enables
-
#31941 LoRA Per Request Loading Pipelining Support — v1 — by kfhfar (创建于: 2026-01-08 11:27 (UTC+8)) [+301/-108, 11 files | commented:10 | 📝草稿] ## Purpose Currently LoRA Loading for Per Request needs to first sequentially load all the LoRA layers before beginning Pre-Fill. This adds significant overhead in TTFT.
However, Pre-Fill is a layer by layer process and for large enough input tokens(> 1000 tokens) it is longer than LoRA Loading time. If we run **Lo…
-
#31995 [ROCM] Add ROCm image build to release pipeline — rocm,ci/build — by dllehr-amd (创建于: 2026-01-09 05:56 (UTC+8)) [+22/-2, 3 files commented:1] - Added build-release-image-rocm step to build ROCm release images
- Builds Dockerfile.rocm_base first, then Dockerfile.rocm with vllm-openai target
- Tags and publishes image as vllm-release-repo:$BUILDKITE_COMMIT-rocm
- Updated annotate-release.sh script to include ROCm image pull/tag/push instructions
## Purpose
## Test Plan …
- #31978 Revert “feat(moe): Add is_act_and_mul=False support for Triton MoE kernels” — rocm,ready — by mgoin (创建于: 2026-01-09 00:17 (UTC+8)) [+9/-191, 7 files | commented:1 approved:1] Reverts vllm-project/vllm#31645, my reasoning is here https://github.com/vllm-project/vllm/pull/31645#issuecomment-3724590758
- #31988 [Misc][PD] Fix
get_attn_backendusage in transfer connectors — kv-connector — by NickLucche (创建于: 2026-01-09 02:20 (UTC+8)) [💬1 | +17/-20, 2 files | commented:4 approved:1] This PR fixes the use ofget_attn_backendin the context discussed here https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1767810349936949. In particular, it turns out that modification to the interface of this shared function can cause unintended backend retrieval (as a partial configuration was passed in), leading to cases such as ``` VLLM_LOGGING_LEVEL=DEBUG vllm serve google/gemma-3-4b-it –port 8004 –enforce-eager –kv-transfer-config ‘{“kv_connector”:”NixlConnector”,”kv_role”:”kv_both”}’… -
#31965 [Model Runner V2] Simplify BlockTables with UVA — v1 — by WoosukKwon (创建于: 2026-01-08 19:16 (UTC+8)) [+52/-154, 2 files] This PR simplifies the
append_block_idsmethod inBlockTablesby using UVA buffer to store the block tables.Previously, the block table was stored in GPU, and we send the “diff” to GPU every step. While efficient, this approach complicated the code quite a bit to pack the “diff”s into contiguous GPU tensors and perform async-copy for them.
Using the UVA tensor eliminates the need for such packing and async copies, at the cost of transferring the block tables from CPU to GPU every step.
-
#31962 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow — 无标签 — by xuebwang-amd (创建于: 2026-01-08 18:44 (UTC+8)) [💬2 | +33/-18, 1 files | commented:1] ## Purpose
# Background: This PR is a continuous work on two previous PRs:
- PR #31676
- key contribution: move bias adding after dequantization
- computation order: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: this is the most closest one to the right order except for MoE weight multiplication not in float32.
- PR #31931
- key contribution: preserving router weight s…
- PR #31676
- #31983 [Bugfix] Fix Fp8 Triton for non-gated MoE (Nemotron) — 无标签 — by danisereb (创建于: 2026-01-09 01:09 (UTC+8)) [💬1 | +7/-0, 3 files | commented:1]
## Purpose
vLLM serve command that fails:
export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/ export VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve $MODEL_PATH --served-model-name my_model \ --trust-remote-code --async-scheduling --kv-cache-dtype fp8 --tensor-parallel-size 1…
- #31944 [BugFix] Fix spec decoding edge case bugs — ready,v1 — by njhill (创建于: 2026-01-08 12:53 (UTC+8)) [+49/-46, 3 files | commented:2 approved:1]
There are a couple of edge case issues:
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s
spec_token_idsin the scheduler when it is preempted. - There’s also an issue related to a case where the request can be excluded from the batch temporarily…
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s
-
#31984 [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel — 无标签 — by jvlunteren (创建于: 2026-01-09 01:18 (UTC+8)) [+26/-3, 1 files | commented:1] ## Purpose
This pull request improves the efficiency of sliding window attention in the 3D kernel by applying the same optimization used in the 2D kernel. It ensures that only tiles within the defined window are processed, resulting in better performance and reduced computational overhead.
## Performance The following results were obtained for
openai/gpt-oss-20bon an NVIDIA H100 GPU, by running$ VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm bench latency \ ... -
#31979 [Doc] Improve MM models LoRA notes — documentation,ready — by jeejeelee (创建于: 2026-01-09 00:21 (UTC+8)) [💬1 | +1/-23, 1 files | approved:1 commented:1]
## Purpose
## Test Plan
## Test Result
... -
#31981 [Misc] Clean up world_size > avail_gpu warning for ray — v1 — by ruisearch42 (创建于: 2026-01-09 01:05 (UTC+8)) [+0/-17, 1 files | commented:1] ## Purpose
The same check is in ParallelConfig.post_init() and should be called before initialize_ray_cluster(): https://github.com/vllm-project/vllm/blob/b8112c1d85442b060e37df90e9db343cbc7b000c/vllm/config/parallel.py#L592
FIX #31005
## Test Plan N/A
…
-
#31973 [Model] Reorganize pooling layers — documentation,ready,v1,qwen — by DarkLight1337 (创建于: 2026-01-08 21:27 (UTC+8)) [💬2 | +1241/-1130, 32 files | commented:9 approved:1] ## Purpose
pooler.pyis getting really bloated, so let’s split everything up:- Pooler activations (
vllm.model_executor.layers.pooler.activations) - Common code (
vllm.model_executor.layers.pooler.common) - Abstract pooler (
vllm.model_executor.layers.pooler.abstract) - Special poolers (
vllm.model_executor.layers.pooler.special) - Poolers that apply to the whole batch of requests at once (
vllm.model_executor.layers.pooler.batch) - Poolers that apply to one request at a time (`vllm.mo…
- Pooler activations (
-
#31955 [BugFix] Clear spec_token_ids for preempted req to prevent grammar conflicts on resumption — v1 — by izhuhaoran (创建于: 2026-01-08 16:51 (UTC+8)) [💬4 | +9/-0, 1 files | commented:1] ### Purpose Fixes #31876
The unit test
v1/e2e/test_async_scheduling.py::test_with_spec_decodingfails intermittently. The failure occurs with the following configuration:```python test_sampling_params = [ dict(structured_outputs=struct_outputs), ]
…
-
#31960 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 — ready — by danisereb (创建于: 2026-01-08 17:54 (UTC+8)) [+4/-3, 1 files | commented:6 approved:1] ## Purpose Fix bug: https://github.com/vllm-project/vllm/issues/31957
The fix in file
fp8_utils.pyis required to fix this error (raised fromflashinfer_cutlass_fused_moe):RuntimeError: Check failed: fc1_dequant.ndim() == 1 (0 vs. 1) : fc1_dequant must be a 1D tensor## Test Plan …
- #31953 [CI] fix ROCM_HOME: unbound variable — rocm,ready,ci/build — by LucasWilkinson (创建于: 2026-01-08 15:59 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
https://github.com/vllm-project/vllm/pull/31922 broke the “2 Node Tests (4 GPUs in total)” with
[2026-01-08T07:32:12Z] ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variablehttps://buildkite.com/organizations/vllm/pipelines/ci/builds/46074/jobs/019b9c75-6389-40dc-a23b-5ecd3683d951/log
-
#31976 [WIP] Add: EAGLE3 Acceptance Length Regression Tests — speculative-decoding,v1 — by rahul-tuli (创建于: 2026-01-08 23:21 (UTC+8)) [+208/-0, 1 files | commented:1 | 📝草稿] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.
## Changes
tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested
| Verifier | Drafter | |———-|———| …
-
#31969 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming — ready — by chaunceyjiang (创建于: 2026-01-08 20:53 (UTC+8)) [+6/-0, 1 files | commented:1 approved:1] ## Purpose Follow up https://github.com/vllm-project/vllm/issues/30056
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31971 [Docs]: update claude code url — documentation,ready — by chaunceyjiang (创建于: 2026-01-08 21:10 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Follow up #31188 ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31974 Fix bug
hf_tokenargument to LLM in Python SDK ignored invllm.transformer_utils.config— 无标签 — by benglewis (创建于: 2026-01-08 21:54 (UTC+8)) [💬1 | +36/-2, 3 files | commented:1 | 📝草稿] ## Purpose Fix thehf_tokenargument not being passed through thetransformers’sAutoConfigcorrectly. Fixes #31894## Test Plan Try passing an
hf_tokenand loaded a gated model which requires anhf_tokenvia the vLLM Python code.## Test Result WIP
…
-
#31967 [CI] [Bugfix] Fix unbounded variable in
run-multi-node-test.sh— rocm,ready,ci/build — by tjtanaa (创建于: 2026-01-08 20:36 (UTC+8)) [+2/-1, 2 files | commented:1 approved:1] ## PurposeFix the issue
./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variableafter PR https://github.com/vllm-project/vllm/pull/31922 .https://buildkite.com/vllm/ci/builds/46115/steps/canvas?sid=019b9d2c-bfc3-4c47-b214-3bf7def2f86a
Add
.buildkite/scripts/run-multi-node-test.shintotest-pipeline.ymlto trigger during CI.## Test Plan
…
-
#31972 [Models]: Make Multimodal config implicit in ViT implementation — qwen — by Isotr0py (创建于: 2026-01-08 21:20 (UTC+8)) [+41/-52, 13 files | commented:1 | 📝草稿]
## Purpose
- Currently, we are passing
MultimodalConfigthrough whole MMEncoder only to enable data parallel and attention backend overrides - This PR makes it implicit through
get_current_vllm_config.
## Test Plan
## Test Result
…
- Currently, we are passing
-
#31946 Feat/feat/support nemotron h mtp functional — new-model,speculative-decoding,v1 — by shaharmor98 (创建于: 2026-01-08 14:05 (UTC+8)) [💬2 | +1452/-127, 12 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#31964 [KVConnector] Support worker -> scheduler metadata — v1,kv-connector — by orozery (创建于: 2026-01-08 19:04 (UTC+8)) [💬1 | +270/-27, 6 files | commented:1] This PR introduces a new build_worker_connector_meta KV connector API, allowing workers to send back arbitrary metadata back to the scheduler-side connector. This aligns with the already existing API build_connector_metadata which allows for the same on the opposite direction (scheduler -> worker).
In particular, this API is needed for the OffloadingConnector to be able to notify the scheduler-side on offloaded blocks, even before a request is finished.
-
#31950 [MM Encoder]: Make MMEncoderAttention’s
scaletakes effect properly — ready,qwen — by Isotr0py (创建于: 2026-01-08 14:47 (UTC+8)) [+34/-8, 13 files | approved:1 commented:1]## Purpose
scaleinMMEncoderAttentiondoesn’t take effect exactly, because we don’t pass it to vit wrapper ops before.- This PR fixes it, and also standardize Qwen-VL-style MMEncoderAttention usage to pass
scaleeven if they havescale=head_dim**-0.5
## Test Plan
## Test Result
…
-
#31947 [Model] Standardize common vision encoders — ready,deepseek — by DarkLight1337 (创建于: 2026-01-08 14:26 (UTC+8)) [💬1 | +254/-174, 19 files | commented:3 approved:1]
## Purpose
- Accept
multimodal_configargument to the common vision encoders used byinit_vision_tower_for_llava(CLIP, SigLIP, PixtralHF), and forward them toMMEncoderAttention. This also enables handling ofmm_encoder_tp_mode. - Update
SiglipVisionEmbeddingsaccording to current Transformers code. - Merge
Phi3ImageEmbeddingBaseintoPhi3HDImageEmbeddingsince it’s the only subclass. - Fix some instances where
self.scalefailed to be passed to `M…
- Accept
-
#31951 [Chore] Further cleanup pooler — ready,v1 — by DarkLight1337 (创建于: 2026-01-08 14:55 (UTC+8)) [+47/-62, 7 files | commented:1 approved:1] ## Purpose
- Rename
Tokens* -> Tokenwise*to be clearer - Remove redundant
PoolingTypeand replace it with string literals. - Avoid redundant operations in
AllPool.forward.
## Test Plan
## Test Result
…
- Rename
- #31958 [Bugfix] Keep all tensors to be on the same device — v1 — by wjunLu (创建于: 2026-01-08 17:35 (UTC+8)) [💬1 | +1/-1, 1 files] When running on Ascend NPU, I meet the following error ``` (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] File “/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py”, line 2988, in _torch_cuda_wrapper (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] yield (Worker_TP0 pid=41296) ERROR 01-08 03:39:41 [multiproc_executor.py:822] File “/__w/vllm-ascend/vllm-ascend/vllm-ascend/vllm_ascend/worker/model_runner_v1.py”, lin…
-
#31942 [platform] add dp_metadata arg to set_additional_forward_context — ready — by Ronald1995 (创建于: 2026-01-08 11:34 (UTC+8)) [💬1 | +1/-0, 1 files | commented:3 approved:3]
## Purpose #31674 add
set_additional_forward_contextin set_forward_context, so other platform can add extra fields in forward_context, butset_additional_forward_contextlack the argumentdp_metadata, other platform will usenum_tokens_across_dp, but if there is nodp_medata, they will callcoordinate_batch_acorss_dpagain, which make some extra communication. ## Test Plan## Test Result
…
-
#31945 [5/n] Migrate non-cutlass part of csrc/quantization/w8a8 to libtorch stable ABI — ci/build,cpu,nvidia — by mikaylagawarecki (创建于: 2026-01-08 12:57 (UTC+8)) [+1526/-1044, 30 files | commented:1 | 📝草稿] Stacked on https://github.com/vllm-project/vllm/pull/31842/files
## Purpose
## Test Plan pytest tests/kernels/quantization/test_fp8_quant.py -v pytest tests/kernels/quantization/test_fp8_quant_group.py -v pytest tests/kernels/quantization/test_int8_quant.py -v pytest tests/kernels/quantization/test_int8_kernel.py -v …
- #31939 [Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward — 无标签 — by maang-h (创建于: 2026-01-08 10:57 (UTC+8)) [+4/-10, 1 files | commented:1]
## Purpose
Remove unused
output_sizesparameter and optimize output_bias computation inColumnParallelLinear. ### Changes- Remove unused
output_sizesparameter fromColumnParallelLinear.__init__():- The subclass
MergedColumnParallelLinearsetsself.output_sizesas an instance attribute before callingsuper().__init__(), making the parameter unnecessary. - Also removed dead code (lines 492-493) and updated docstring.
- The subclass
- Optimize
output_biascomputation inforward()…
- Remove unused
[已合并 PR]
-
#30495 [Async][Feat] support apply penalty or bad_words for async + spec — ready,v1 — by izhuhaoran (合并于: 2026-01-09 10:31 (UTC+8)) [💬16 | +70/-34, 4 files | commented:8 approved:1] ## Purpose
A follow-up PR about #30122
This PR enables support for sampling penalties (frequency, presence, repetition) and
bad_wordswhen using async scheduling combined with speculative decoding.Key Changes:
GPUModelRunner: Implemented a mechanism to asynchronously copy draft token IDs from GPU to CPU (_copy_draft_token_ids,_get_draft_token_ids_cpu).InputBatch: Addedupdate_async_spec_token_idsto populatesampling_metadata.spec_token_idswith the retrieve…
-
#31992 [Bugfix] Fix typo in FusedMoE LoRA reshape comment — 无标签 — by xyang16 (合并于: 2026-01-09 10:46 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] ## Purpose
This PR is a minor fix of typo in lora_b reshape comment. Need to switch rank and num_experts position.
## Test Plan
## Test Result
... -
#31987 [Frontend] Improve error message — frontend,ready — by DarkLight1337 (合并于: 2026-01-09 04:07 (UTC+8)) [+20/-4, 3 files | commented:1 approved:1]
## Purpose
Improve the error message that is returned to the client
## Test Plan
## Test Result
…
-
#31761 [Frontend] Add MCP tool streaming support to Responses API — frontend,ready,gpt-oss — by daniel-salib (合并于: 2026-01-09 09:19 (UTC+8)) [💬9 | +1387/-629, 3 files | commented:2 approved:2] ## Purpose This change enables streaming support for MCP tools when using GPT OSS. It extends the harmony utilities and response serving infrastructure to handle tool streaming, allowing tool calls and their results to be incrementally streamed back to clients rather than returned as a single batch.
This PR is taken over from #30301
## Test Plan
``` VLLM_ENABLE_RESPONSES_API_STORE=1 VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,conta…
- #31977 [Bugfix] Fix Typo from NVFP4 Refactor — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-01-09 08:18 (UTC+8)) [+30/-3, 6 files | commented:2 approved:2]
## Purpose
- broke cutedsl with typo
- this fixes it and adds testing in the ci/cd
## Test Plan
- ci, running cutedsl
## Test Result
…
- #31193 [Feature] Add iteration level logging and enhance nvtx marker — ready,v1,nvidia — by maxyanghu (合并于: 2026-01-09 08:13 (UTC+8)) [💬5 | +138/-10, 6 files | commented:10]
## Purpose
This PR adds iteration level logging for each scheduled iteration. It computes iteration details like number of context/generation requests and number of context/generation tokens. The definition of a context request is any requests that is still being processed in prefill phase and no output token has been generated for that request. It also logs elapsed time of iterations and keeps an index of the current iteration. As the index of iteration is recorded per
EngineCore, it is kept … -
#31982 [BugFix] Add spec-decode-incompatible request param validation — ready,v1 — by njhill (合并于: 2026-01-09 08:08 (UTC+8)) [+12/-3, 2 files | commented:2 approved:1] Some sampling parameters are not yet supported with spec decoding. We log a warning during startup but then effectively silently ignore the parameters if included in requests at runtime.
This adds request-scope validation for this to fail rather than silently ignore.
-
#31688 [Quantization] Deprecate Long Tail of Schemes — ready — by robertgshaw2-redhat (合并于: 2026-01-09 08:07 (UTC+8)) [💬1 | +61/-5, 8 files | commented:4 approved:1]
### SUMMARY
- start deprecation process for long tail of quantization schemes
- we will have 1 release with the ability to enable the deprecated stuff, then remove completely
Essential Elements of an Effective PR Description Checklist
... - #31752 [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection — rocm,ready — by zyongye (合并于: 2026-01-09 08:01 (UTC+8)) [💬4 | +148/-4, 3 files | commented:3 approved:2]
Refactor WNA16 Triton kernel into modular kernel format.
Previously, we factor out the WNA16 from
TritonExpertsand that caused WNA16 lora module can’t select the correct kernel (see issue). This PR fixes the problem. - #31747 [Misc] Fix
Current vLLM config is not set.warnings, assert to avoid issues in the future — ready,v1,multi-modality,cpu,kv-connector,nvidia,ready-run-all-tests — by LucasWilkinson (合并于: 2026-01-09 07:20 (UTC+8)) [💬7 | +380/-240, 48 files | commented:7 approved:3] https://github.com/vllm-project/vllm/pull/30531 and https://github.com/vllm-project/vllm/pull/29575 introduced accesses toget_current_vllm_configon boot that are outside ofset_current_vllm_configcontexts leading to repeated logs ``` (EngineCore_DP0 pid=1647618) WARNING 01-05 16:41:49 [vllm.py:1447] Current vLLM config is not set. (EngineCore_DP0 pid=1647618) INFO 01-05 16:41:49 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (EngineCore_DP0 pid=1647618) I… - #30881 [Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs — ready — by dsikka (合并于: 2026-01-09 06:45 (UTC+8)) [💬2 | +42/-42, 2 files | commented:8 approved:1]
## Purpose
- Clean-up the conditions used to select the Nvfp4 schemes
- Enable running Nvfp4A16 MoEs through the marlin pathway so that we can support NVFP4A16 MoEs
## Test Plan
- Smoke testing Qwen3 MoE NVFP4 and NVFP4A16 models
- Validate on SM100 and SM90
## Test Result
- Both models generate coherent outputs on both platforms …
-
#30519 [Misc][Refactor] Add FusedMoERouter object — ready — by bnellnm (合并于: 2026-01-09 04:52 (UTC+8)) [💬12 | +165/-36, 20 files | commented:4 approved:1] ## Purpose
Add abstract
FusedMoERouterclass that providesselect_experts. There’s a concrete subclass that wrapsFusedMoE._select_expertsthat is passed to all the quantization methods instead of callingFusedMoE._select_expertsdirectly. This is a step on teasing out the routing logic fromFusedMoE.apply.See https://github.com/vllm-project/vllm/issues/28408
## Test Plan Existing CI
## Test Result …
-
#31627 [Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders — documentation,ready — by Lucaskabela (合并于: 2026-01-09 03:33 (UTC+8)) [💬5 | +111/-0, 1 files | commented:1 approved:2] ## Purpose See title - this PR adds documentation describing torch.compile for multimodal encoders, including how to use and apply to new modules
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - #31978 Revert “feat(moe): Add is_act_and_mul=False support for Triton MoE kernels” — rocm,ready — by mgoin (合并于: 2026-01-09 03:31 (UTC+8)) [+9/-191, 7 files | commented:1 approved:1] Reverts vllm-project/vllm#31645, my reasoning is here https://github.com/vllm-project/vllm/pull/31645#issuecomment-3724590758
-
#31965 [Model Runner V2] Simplify BlockTables with UVA — v1 — by WoosukKwon (合并于: 2026-01-09 02:24 (UTC+8)) [+52/-154, 2 files] This PR simplifies the
append_block_idsmethod inBlockTablesby using UVA buffer to store the block tables.Previously, the block table was stored in GPU, and we send the “diff” to GPU every step. While efficient, this approach complicated the code quite a bit to pack the “diff”s into contiguous GPU tensors and perform async-copy for them.
Using the UVA tensor eliminates the need for such packing and async copies, at the cost of transferring the block tables from CPU to GPU every step.
- #31728 [CI][ROCm] Fix NIXL tests on ROCm — rocm,ready,ci/build,kv-connector — by NickLucche (合并于: 2026-01-09 01:34 (UTC+8)) [💬4 | +20/-6, 3 files | commented:1 approved:1] Follow up to https://github.com/vllm-project/vllm/pull/31491 for AMD mirror pipeline + timeout update from latest nightly run.
- #31944 [BugFix] Fix spec decoding edge case bugs — ready,v1 — by njhill (合并于: 2026-01-08 15:31 (UTC+8)) [+49/-46, 3 files | commented:2 approved:1]
There are a couple of edge case issues:
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s
spec_token_idsin the scheduler when it is preempted. - There’s also an issue related to a case where the request can be excluded from the batch temporarily…
- Since https://github.com/vllm-project/vllm/pull/29821, it’s again possible to hit the issue described in https://github.com/vllm-project/vllm/pull/30916. Our test does cover it but it’s not guaranteed to trigger so was missed and is manifesting as a flake now. The fix here is to clear the request’s
-
#31702 Fix ijson build for Power. — ready,ci/build — by npanpaliya (合并于: 2026-01-09 01:12 (UTC+8)) [💬2 | +5/-5, 1 files | commented:1 approved:1]
## Purpose
## Test Plan
## Test Result
... -
#31979 [Doc] Improve MM models LoRA notes — documentation,ready — by jeejeelee (合并于: 2026-01-09 00:55 (UTC+8)) [💬1 | +1/-23, 1 files | approved:1 commented:1]
## Purpose
## Test Plan
## Test Result
... - #31591 [Misc] Tidy up some spec decode logic in GPUModelRunner — ready,v1 — by njhill (合并于: 2026-01-09 01:10 (UTC+8)) [💬3 | +51/-50, 3 files | commented:4 approved:1]
Simplify messy top-level logic in
GPUModelRunner.sample_tokens, avoid computingeffective_drafter_max_model_lenevery step and only execute this spec-decoding-specific logic when spec decoding is actually enabled. -
#31928 [ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-08 12:35 (UTC+8)) [💬3 | +1/-1, 1 files | commented:1 approved:1] Fixes intermittent test failures in
test_attention_backends.pycaused by uninitialized memory in the KV cache.## Problem
The
create_and_prepopulate_kv_cachefunction usestorch.empty()to allocate the KV cache, which leaves memory uninitialized. Only the blocks actually used by test sequences are populated with valid data.This manifests as intermittent failures, particularly on ROCm where uninitialized GPU memory is less likely to contain “safe” values compared to CUDA.
## Fix
…
- #31833 [ROCm][CI] v1 cpu offloading attention backend fix — rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-01-08 14:37 (UTC+8)) [💬6 | +4/-2, 1 files | commented:3 approved:1]
This PR fixes a regression caused by https://github.com/vllm-project/vllm/pull/30687 on ROCm. The underlying cause is that
FLASH_ATTNis not supported on ROCm. -
#31960 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 — ready — by danisereb (合并于: 2026-01-09 00:08 (UTC+8)) [+4/-3, 1 files | commented:6 approved:1] ## Purpose Fix bug: https://github.com/vllm-project/vllm/issues/31957
The fix in file
fp8_utils.pyis required to fix this error (raised fromflashinfer_cutlass_fused_moe):RuntimeError: Check failed: fc1_dequant.ndim() == 1 (0 vs. 1) : fc1_dequant must be a 1D tensor## Test Plan …
-
#31969 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming — ready — by chaunceyjiang (合并于: 2026-01-08 23:28 (UTC+8)) [+6/-0, 1 files | commented:1 approved:1] ## Purpose Follow up https://github.com/vllm-project/vllm/issues/30056
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31758 [Model] Add LFM2-VL model support — documentation,new-model,ready,v1 — by tianshu-Michael-yu (合并于: 2026-01-08 21:00 (UTC+8)) [💬4 | +1266/-1, 6 files | commented:10] ## Purpose
Add support for the LFM2-VL (Liquid Foundation Model 2 Vision-Language) model family from LiquidAI.
Changes:
- Add
Lfm2VLForConditionalGenerationmodel implementation with full multimodal processing pipeline - Add SigLIP2 vision encoder (
siglip2.py) used by LFM2-VL - Register model in the multimodal model registry
- Fix
max_seqlentype hint inMMEncoderAttentionto acceptint | torch.Tensor - Add double-buffering for
is_mm_embedin GPU model runner to avoid race cond…
- Add
-
#31575 [Model] Support IQuestCoder model — documentation,new-model,ready — by yxing-bj (合并于: 2026-01-08 22:42 (UTC+8)) [💬15 | +605/-0, 4 files | commented:10] ## Purpose IQuest-Coder-V1 is a new family of code large language models (LLMs) designed to advance autonomous software engineering and code intelligence. We built a repo about IQuestCoder.
We had uploaded these models to Hugging Face, including IQuestCoder and IQuestLoopCoder. To make them easier for everyone to use, we support these models on vLLM platform.
## Test Plan
Firstly, we start to launch a vLLM …
-
#31971 [Docs]: update claude code url — documentation,ready — by chaunceyjiang (合并于: 2026-01-08 22:04 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Purpose Follow up #31188 ## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31967 [CI] [Bugfix] Fix unbounded variable in
run-multi-node-test.sh— rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-08 21:42 (UTC+8)) [+2/-1, 2 files | commented:1 approved:1] ## PurposeFix the issue
./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variableafter PR https://github.com/vllm-project/vllm/pull/31922 .https://buildkite.com/vllm/ci/builds/46115/steps/canvas?sid=019b9d2c-bfc3-4c47-b214-3bf7def2f86a
Add
.buildkite/scripts/run-multi-node-test.shintotest-pipeline.ymlto trigger during CI.## Test Plan
…
-
#31610 [OpenAI] Fix tool_choice=required streaming when output has trailing extra data — frontend,ready,tool-calling — by maylikenoother (合并于: 2026-01-08 21:01 (UTC+8)) [💬1 | +41/-2, 2 files | commented:1 approved:1] ### Problem: When using tool_choice=”required”, OpenAIServingChat.extract_tool_call_required_streaming parses the partial stream as JSON. If the model emits a valid JSON tool-call array and then continues with trailing tokens (e.g. “\nDONE”), parsing can hit a JSON “Extra data” case and stop/behave incorrectly.
### Fix:
- Parse required-tool streaming output using partial_json_loads(…, Allow.ALL) so we can safely parse the first JSON value while tolerating trailing extra data.
- Continue to …
-
#31724 [Model] Enable LoRA support for Pixtral — documentation,ready — by A1c0r-Z (合并于: 2026-01-08 21:00 (UTC+8)) [💬4 | +30/-3, 2 files | commented:1 approved:1] ## Purpose Enable LoRA adapters on the vision tower and connector components of Pixtral models (Part of #31479).
## Technical Details
- Implement
get_mm_mapping(),get_num_mm_encoder_tokens(), andget_num_mm_connector_tokens(). - Add
SupportsLoRAtoPixtralForConditionalGenerationinheritance. - Patch Merge Logic: Account for the $S \times S$ (typically $2 \times 2$) patch merging in Pixtral.
get_num_mm_encoder_tokensscales up the LLM budget by $S^2$ to match the Vision …
- Implement
-
#31847 [Model] Add Grok-2 — documentation,new-model,rocm,frontend,ready,v1,nvidia — by dangoldbj (合并于: 2026-01-08 20:59 (UTC+8)) [💬12 | +777/-20, 8 files | commented:10] ## Purpose Implements https://github.com/vllm-project/vllm/issues/23557
- Add the Grok2 tokenizer that understands
.tok.json, ties into the tokenizer registry, and supports Grok-2 chat templates. - Grok1/Grok2 model machinery (MoE routing, residual path, tokenizer docs, and test coverage) so Grok-2 configs load correctly.
- Document the new support entry in
docs/models/supported_models.md. - Preserve Grok-1 behavior while adding a Grok-2 renormalization override and documenting tokenizer r…
- Add the Grok2 tokenizer that understands
-
#31652 [Model] Enable LoRA support for tower and connector in GLM4-V — ready — by Zyyeric (合并于: 2026-01-08 15:45 (UTC+8)) [+14/-0, 1 files | commented:1 approved:1]
## Purpose Issue: #31479 Add GLM4-V token-count helpers so LoRA can size tower/connector inputs correctly across the spatial merge.
## Technical Detail GLM4-V’s vision tower produces patch-level tokens that are spatially merged by spatial_merge_size before visual.merger runs.
get_num_mm_encoder_tokensexpands LM-side merged tokens back to pre-merge patch count (× merge_size²), whileget_num_mm_connector_tokensconverts tower (pre-merge) tokens to the connector… -
#31931 [ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling — rocm,ready — by AndreasKaratzas (合并于: 2026-01-08 12:17 (UTC+8)) [💬9 | +8/-4, 1 files | commented:1 approved:1] Fixes MoE accuracy regression on ROCm introduced in #31676.
## Problem
PR #31676 reordered post-accumulation operations to add bias after dequantization. However, this also moved
MUL_ROUTED_WEIGHTafter the.to(compute_type)precision conversion, causing router weight multiplication to occur in bf16/fp16 instead of float32.This precision loss causes different outputs on ROCm due to differences in mixed-precision handling between ROCm and CUDA Triton backends. The issue manifests as non-d…
-
#30190 [grpc] Support gRPC server entrypoint — frontend,ready,ci/build — by CatherineSue (合并于: 2026-01-08 15:24 (UTC+8)) [💬11 | +1357/-6, 12 files | commented:10]
## Purpose
Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.
Key Benefits:
- Native gRPC Protocol Support
- Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON …
- Native gRPC Protocol Support
-
#31388 [Voxtral] Fix speech transcription api — frontend,ready — by patrickvonplaten (合并于: 2026-01-08 18:34 (UTC+8)) [💬4 | +114/-27, 5 files | commented:5 approved:1] Make sure speech transcription api works.
Allows to do standard vllm serve + speech-to-text transcription API
-
#31950 [MM Encoder]: Make MMEncoderAttention’s
scaletakes effect properly — ready,qwen — by Isotr0py (合并于: 2026-01-08 18:33 (UTC+8)) [+34/-8, 13 files | approved:1 commented:1]## Purpose
scaleinMMEncoderAttentiondoesn’t take effect exactly, because we don’t pass it to vit wrapper ops before.- This PR fixes it, and also standardize Qwen-VL-style MMEncoderAttention usage to pass
scaleeven if they havescale=head_dim**-0.5
## Test Plan
## Test Result
…
-
#31947 [Model] Standardize common vision encoders — ready,deepseek — by DarkLight1337 (合并于: 2026-01-08 18:33 (UTC+8)) [💬1 | +254/-174, 19 files | commented:3 approved:1]
## Purpose
- Accept
multimodal_configargument to the common vision encoders used byinit_vision_tower_for_llava(CLIP, SigLIP, PixtralHF), and forward them toMMEncoderAttention. This also enables handling ofmm_encoder_tp_mode. - Update
SiglipVisionEmbeddingsaccording to current Transformers code. - Merge
Phi3ImageEmbeddingBaseintoPhi3HDImageEmbeddingsince it’s the only subclass. - Fix some instances where
self.scalefailed to be passed to `M…
- Accept
-
#31951 [Chore] Further cleanup pooler — ready,v1 — by DarkLight1337 (合并于: 2026-01-08 18:16 (UTC+8)) [+47/-62, 7 files | commented:1 approved:1] ## Purpose
- Rename
Tokens* -> Tokenwise*to be clearer - Remove redundant
PoolingTypeand replace it with string literals. - Avoid redundant operations in
AllPool.forward.
## Test Plan
## Test Result
…
- Rename
-
#30803 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation — ready — by omer-dayan (合并于: 2026-01-08 18:00 (UTC+8)) [💬4 | +15/-4, 4 files | commented:2 approved:2] Continueing the discussion from this PR: https://github.com/vllm-project/vllm/pull/30154
We suggest in here a solution that works for both RayLLM + vllm serve.
The root cause of the bug is that object store path is cloned to local dir, and
self.modelis changed to that local dir whileself.model_weightscontains the object store path - unless you run create_engine_config once, and then re-run it and nowself.modelequal to local dir andself.model_weightsis not preserved.The solutio…
-
#31890 [Models] Allow converting Qwen3-VL into Reranker model — documentation,frontend,ready,qwen — by Isotr0py (合并于: 2026-01-08 16:10 (UTC+8)) [💬1 | +287/-13, 8 files | commented:3 approved:3]
## Purpose
- Enable reranker support for Qwen3-VL
## Test Plan
python examples/pooling/score/vision_language_reranker.py -m qwen3_vl_reranker…
- #31719 [Misc] Support qwen3-next lora — ready,qwen — by BJWang-ant (合并于: 2026-01-08 17:27 (UTC+8)) [💬1 | +7/-1, 1 files | commented:3 approved:1] Replace the torch.nn.Linear of shared_expert_gate in the qwen3-next model with ReplicatedLinear, so that lora can correctly identify shared_expert_gate.
-
#31536 fix(compile): apply partition wrapper when loading AOT cached functions — ready — by devbyteai (合并于: 2026-01-08 17:27 (UTC+8)) [💬4 | +103/-3, 2 files | commented:4 approved:2] ## Summary Fixes #31439 where VLLM_USE_AOT_COMPILE=1 causes 2x latency regression when use_inductor_graph_partition=True and cudagraph_mode=PIECEWISE.
## Root Cause The bug is in vllm/compilation/decorators.py. When loading AOT-compiled functions from cache, the code bypasses maybe_use_cudagraph_partition_wrapper at two locations: 1. Line 374-376: Early return when aot_compiled_fn already set (subsequent calls) 2. Line 431-438: Early return after loading from AOT cache (first call a... -
#31834 [CI/Build] Enable test_kv_cache_events_dp for AMD — rocm,ready,v1 — by rjrock (合并于: 2026-01-08 17:00 (UTC+8)) [💬1 | +5/-2, 1 files | commented:1 approved:1] ## Purpose To have the command
pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dppass on AMD CI.## Test Plan
pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp## Test Result Before: ```text ERROR: found no collectors for vllm/tests/v1/engine/test_engine_core_client.py::test_kv_cache_events_dp …
-
#31635 Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. — ready,v1,kv-connector — by Lumosis (合并于: 2026-01-08 17:00 (UTC+8)) [💬4 | +75/-20, 6 files | commented:10] ## Purpose
RFC: https://github.com/vllm-project/vllm/issues/31634
This change refactors AttentionSpec in vLLM V1 to allow explicit setting of page_size_bytes. This is required for backends like TPU that use Ragged Paged Attention (RPA), where physical memory padding is necessary for alignment. The current reliance on num_gpu_blocks_override is not only a design “hack” but also breaks multi-host inference using the Ray executor.
Essential Elements of an Effective PR Des...
-
#31905 [ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch — rocm,ready — by ZhiweiYan-96 (合并于: 2026-01-08 15:47 (UTC+8)) [💬2 | +6/-0, 1 files | commented:2 approved:1] ## Purpose
Skip UT
test_torchao.py::test_pre_quantized_modelon CDNA3 arch. On CDNA3 arch, only fp8_e4m3_fnuz is supported, but the tested checkpoints use fp8_e4m3.## Test Plan
pytest -sv tests/quantization/test_torchao.py::test_pre_quantized_model…
-
#31775 [docker] A follow-up patch to fix #30913:
[docker] install cuda13 version of lmcache and nixl— ready,ci/build,kv-connector,nvidia — by wangshangsam (合并于: 2026-01-08 15:47 (UTC+8)) [💬1 | +2/-0, 1 files | commented:3 approved:1] This PR fixes a issue in #30913 :lmcacheneedsTORCH_CUDA_ARCH_LISTwhich was removed in #30913 .## Purpose
Currently, when you try to build a cuda 13 image on
main(I was doing it on an arm64 machine, but I’d presume that same error would happen for x86 too), you would encounter this error: https://github.com/vllm-project/vllm/pull/30913#discussion_r2663626084 . This PR fixes it so that the build would be fine.## Test Plan
On a GH200 node (or any sort of arm64 machines): ```bash …
-
#31712 fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN — documentation,rocm,ready,v1 — by rabi (合并于: 2026-01-08 15:46 (UTC+8)) [💬6 | +8/-0, 1 files | commented:1 approved:1] ## Purpose
ROCM_ATTN kernel only supports block sizes 16 and 32. Previously, the backend used the default which claims any block size is supported, causing select_common_block_size() to return large framework block sizes (e.g., 256 for Mamba alignment) that the kernel cannot handle.
This fix enables the kernel block size mechanism to correctly select 32 as the kernel block size for hybrid models like Nemotron-H.
-
#31745 [Bugfix] Remove the num_hidden_layers override for glm4_moe — ready — by andyl98 (合并于: 2026-01-08 15:45 (UTC+8)) [+0/-1, 1 files | commented:1 approved:1] ## Purpose Fixes an issue where we override
num_hidden_layers: 0for GLM4-MOE MTP models.The Issue:
- The override set
num_hidden_layersto 0 forGlm4MoeMTPModel, whereas the actual checkpoint weights for GLM4-MOE MTP are indexed after the base model’s last layer (e.g.,model.layers.92...for a 92-layer model).
- The override set
-
#31927 [Fix] Enable mm_processor_cache with vision LoRA — ready,v1,multi-modality — by prashanth058 (合并于: 2026-01-08 15:31 (UTC+8)) [+52/-16, 5 files | commented:2 approved:1] ## Purpose Enables multi-modal processor cache when
enable_tower_connector_lorais active. #26674 prefixesidentifierwith LoRA hash to avoid incorrect encoder cache hits(https://github.com/vllm-project/vllm/pull/26674#discussion_r2636937770) but processor cache should be shared across LoRAs.Solution: - Add
mm_hashfield toMultiModalFeatureSpecto store the base hash (without LoRA prefix) - Processor cache usesmm_hashfor cache lookups (shared across LoRAs) - Encoder cac… -
#30460 [chore] Update FA commit — ready,ci/build — by LucasWilkinson (合并于: 2026-01-08 15:24 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1] Update vLLM-FA commit to include:
https://github.com/vllm-project/flash-attention/pull/110 https://github.com/vllm-project/flash-attention/pull/112
-
#31942 [platform] add dp_metadata arg to set_additional_forward_context — ready — by Ronald1995 (合并于: 2026-01-08 14:56 (UTC+8)) [💬1 | +1/-0, 1 files | commented:3 approved:3]
## Purpose #31674 add
set_additional_forward_contextin set_forward_context, so other platform can add extra fields in forward_context, butset_additional_forward_contextlack the argumentdp_metadata, other platform will usenum_tokens_across_dp, but if there is nodp_medata, they will callcoordinate_batch_acorss_dpagain, which make some extra communication. ## Test Plan## Test Result
…
- #31825 [Model] Enable LoRA support for tower and connector in DotsOCR — documentation,ready — by ShaanveerS (合并于: 2026-01-08 14:50 (UTC+8)) [💬4 | +9/-1, 2 files | commented:1 approved:1]
## Purpose
Enable dynamic LoRA adapters on the multimodal vision tower and connector for
DotsOCRForCausalLMby implementing:get_num_mm_encoder_tokens()get_num_mm_connector_tokens()
## Technical Details DotsOCR’s vision path uses spatial merging (
spatial_merge_size) viavision_tower.merger, which changes the effective token count between:- LM-side “image tokens” (post-merge)
- vision tower tokens before the merger (pre-merge) …
-
#31188 [Doc] Add Claude code usage example — documentation,ready — by mgoin (合并于: 2026-01-08 13:50 (UTC+8)) [💬2 | +75/-0, 2 files | commented:4 approved:1]
## Purpose
Simple example for how to use gpt-oss-120b with claude code
## Test Plan
## Test Result
…
-
#31873 [CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 — rocm,ready,ci/build — by rasmith (合并于: 2026-01-08 13:06 (UTC+8)) [💬2 | +1/-2, 1 files | commented:4 approved:1] There are some tests in tests/samplers/test_beam_search.py that are marked @pytest.mark.skip_v1 and are meant to be skipped in v1 due to unreliable behavior from the beam search. However, they are not skipped, so I added a ‘-m ‘not skip_v1’ flag which causes the test to actually be skipped.
These tests do not seem to run in the main CI either, but it might be a configuration in the main CI. They do run right now on AMD CI and fail intermittently.
We can try to find a permanent fix in the fut…
- #31922 [ROCm][CI] Add rocm support for run-multi-node-test.sh — rocm,ready,ci/build — by charlifu (合并于: 2026-01-08 12:36 (UTC+8)) [💬1 | +21/-3, 1 files | commented:1 approved:1]
Now run-multi-node-test.sh only works for cuda, this PR add support for rocm by:
- check if on rocm or not
- using different flags for rocm to control gpu devices used inside ci image.
-
#31915 [BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 — ready,v1 — by zou3519 (合并于: 2026-01-08 12:04 (UTC+8)) [+3/-1, 1 files | approved:2 commented:1] ## Purpose
Fix flakiness in this test for PyTorch 2.10. The test can fail in PyTorch 2.9 too, see https://github.com/vllm-project/vllm/issues/31913 for explanation.
## Test Plan
Ran
TP_SIZE=2 DP_SIZE=2 pytest tests/v1/distributed/test_eagle_dp.py::test_run_eagle_dpon 2x L4 and verified it passed.## Test Result
…
-
#31692 [MoE Refactor][16/N] Apply Refactor to NVFP4 — documentation,performance,ready,llama,nvidia — by robertgshaw2-redhat (合并于: 2026-01-08 11:46 (UTC+8)) [💬5 | +780/-684, 15 files | commented:5 approved:3] ## Purpose
Apply refactor to nvfp4 integrations, key steps:
- support nvfp4 in
MarlinExpertsfor NVFP4 - use mks for all kernels except for
trtllmkernels - create oracle for centralized kernel selection
- factor out process_weights_after_loading for sharing between ct and modelopt
- create kernel in process_weights_after_loading + call modular kernel in apply
## Test Plan …
- support nvfp4 in
-
#31932 [CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency — ready,multi-modality,qwen — by AndreasKaratzas (合并于: 2026-01-08 10:58 (UTC+8)) [💬1 | +5/-0, 1 files | commented:1 approved:1] The Qwen-VL and Qwen-VL-Chat models are being skipped in
test_processing_correctnessdue to intermittent CI failures.The tokenizer for these models attempts to download a font file from Alibaba’s China servers (
qianwen-res.oss-cn-beijing.aliyuncs.com), which sometimes times out or refuses connections in CI environments.urllib3.exceptions.ConnectTimeoutError: Connection to qianwen-res.oss-cn-beijing.aliyuncs.com timed outDiscussed with @ywang96 who confirmed these models can be …
[关闭未合并 PR]
-
#19983 [Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper — performance,needs-rebase,ci/build,stale — by ElizaWszola (关闭于: 2026-01-09 10:27 (UTC+8)) [💬22 | +3386/-400, 30 files | commented:10] Add support for blocked fp8 CUTLASS MoE for SM90.
# Testing: Single grouped multiply unit tests:
pytest tests/kernels/quantization/test_cutlass_scaled_mm.py -k test_cutlass_fp8_group_gemmFused experts op unit tests: ``` pytest tests/kernels/moe/test_cutlass_moe.py -k test_blocked_cutlass_moe_8 …
-
#21930 Fix #21840. Fix tool_call parsing edge cases — frontend,stale,tool-calling — by okamiRvS (关闭于: 2026-01-09 10:27 (UTC+8)) [💬5 | +4/-3, 1 files | commented:3] # Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
…
-
#22003 [Bugfix]: Fix Possible Output Corruption in Cascade Attention Caused by Non-Contiguous LSE Tensor — stale,v1 — by griii (关闭于: 2026-01-09 10:27 (UTC+8)) [💬8 | +3/-0, 1 files | commented:1] ## Purpose Fix an output corruption issue when using Cascade Attention. The
flash_attn_varlen_funcoperator with flash-attn2 may return a non-contiguous LSE tensor (especially suffix_lse) in some cases. Passing a non-contiguous LSE tensor tomerge_attn_statescan cause incorrect outputs. This PR fixes the issue by making sure the LSE tensor is contiguous before further processing.## Test Plan This issue can be consistently reproduced by serving the Qwen2.5-32B-Instruct model (or any large …
-
#23799 valley-eagle-7b (not finished yet) — new-model,needs-rebase,stale — by oswardlx (关闭于: 2026-01-09 10:26 (UTC+8)) [💬18 | +3504/-255, 3 files | commented:1]
## Purpose
This PR adds vLLM support for the Valley-Eagle-7B multi-modal large language model. Valley is a multi-vision tower model that combines SigLIP and Qwen2VL vision encoders with Qwen2 architecture for enhanced visual understanding capabilities.
Key Features Added:
- Complete vLLM adaptation for Valley-Eagle-7B model
- Support for dual vision towers (SigLIP + Qwen2VL)
- Multi-modal input processing with image and text integration …
-
#23949 fit the qwen3 moe’s awq quantization for 2080Ti. — stale,qwen — by Readon (关闭于: 2026-01-09 10:26 (UTC+8)) [💬9 | +36/-8, 2 files | commented:1]
## Purpose
Run Qwen3 A30B and Coder-A30B for awq quantization on 2080Ti.
## Test Plan
run by: serve billy800/Qwen3-30B-A3B-Instruct-2507-AWQ –max-model-len 8192 –tensor-parallel-size=4 –gpu-memory-utilization=0.6 –quantization=awq –host 0.0.0.0 –port 40030 –served-model-name qwen3-coder-30b …
-
#23968 [rocm] update pytorch rocm from 6.3 to 6.4 — rocm,ci/build,stale — by draftbk (关闭于: 2026-01-09 10:26 (UTC+8)) [💬7 | +1/-1, 1 files | commented:7 approved:1] ## Purpose Update pytorch rocm from 6.3 to 6.4
Initial purpose for myself: trying to test TorchAO FP8 on MI300X, and from the doc, it is tested with ROCm 6.4.
## Test Plan & Test Result ``` # MI300X lm_eval
–model vllm
… -
#24224 [bugfix] fix returned chunk too large bug — performance,stale — by qibaoyuan (关闭于: 2026-01-09 10:26 (UTC+8)) [💬3 | +20/-10, 1 files | commented:1] ## Purpose
When the returned value is too large the benchmark will fail. This patch can solve this problem: using iter_any to avoid this problem.
after applying this patch, everything goes well:
…
-
#24225 [Benchmarks]Accelerate random dataset generation — performance,stale — by wuhang2014 (关闭于: 2026-01-09 10:26 (UTC+8)) [💬5 | +86/-11, 2 files | commented:4] ## Purpose
Fix #24058 using backend_tokenizer’s
decode_batchandencode_batch_fastmethod to run without GIL as much as possible.## Test Plan
--num-promptsis in range of 1,10,100,1000,10000, 100000``` vllm bench serve –base-url http://127.0.0.1:8000 –model /home/models/Qwen3-0.6B –dataset-name random –random-input-len 2000 –random-output-len 1000 –max-concurrency 10000 –num-prompts 100000 –seed $(date +%s) –ignore-eos …
-
#24445 [Do not merge] Test pytest-cov — ready,needs-rebase,ci/build,stale — by rzabarazesh (关闭于: 2026-01-09 10:25 (UTC+8)) [💬3 | +295/-155, 4 files] This is just to test the CI with a simple pytest-cov configuration. This will not be merged.
## Purpose
## Test Plan
## Test Result
…
-
#31236 Construting grid using num of active lora in lora kernels — documentation,performance,new-model,rocm,structured-output,frontend,tpu,speculative-decoding,needs-rebase,ci/build — by yugong333 (关闭于: 2026-01-09 09:13 (UTC+8)) [💬7 | +24694/-14818, 711 files | commented:4] ## Purpose
Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras.
## Test Plan
Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4
## Test Result
…
-
#25427 [Perf] Optimize torch native input group quant — performance,stale — by mgoin (关闭于: 2026-01-09 06:42 (UTC+8)) [💬1 | +16/-12, 1 files | commented:3]
## Purpose
Benchmark command
python benchmarks/kernels/bench_per_token_quant_fp8.py --dtype bfloat16 --group-sizes 128 \ --hidden-sizes 896 1024 2048 4096 7168 \ --batch-sizes 1 16 128 512 1024…
-
#31989 Update VOLKANURALTR.md — documentation — by volkTRhacV (关闭于: 2026-01-09 06:39 (UTC+8)) [💬4 | +4/-4, 1 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#31991 [wip] expose online quantization for linear for compressed-tensors path — 无标签 — by vkuzo (关闭于: 2026-01-09 06:31 (UTC+8)) [💬1 | +126/-4, 2 files | commented:2] Summary:
WIP and not ready for review yet
High level, I’d like to work towards enabling online quantization for all the quantization recipes in compressed-tensors. Ideally we work through a POC for a single recipe and get alignment from vllm team, and then we can parallelize further work if needed.
For now, tested e2e on a tiny dense model (facebook/opt-125m) and float8 rowwise scaling.
tl;dr;
- enables
CompressedTensorsConfig.from_config_file, and provides an example config file for flo…
- enables
-
#31934 [CI] Remove torch nightly checks — ci/build — by simon-mo (关闭于: 2026-01-09 01:41 (UTC+8)) [💬8 | +1/-490, 10 files | commented:1 changes:1] I would like to remove torch nightly test as it has been failing for a while (2+ months) and we don’t have a good way to track them. They are unmaintained and didn’t help us with shortening time to latest torch release.
Given this, we can both reduce our codebase complexity and testing cost.
In the future, if we do want to bring it back, I think we should create…
-
#12919 [CI/Build][v1] vLLM v1 automatic benchmarking — performance,ready,needs-rebase,ci/build,unstale,kv-connector — by Shaoting-Feng (关闭于: 2026-01-09 01:35 (UTC+8)) [💬9 | +87/-24, 3 files] This PR extends the performance benchmark to include both v0 and v1. The latency, throughput, and fixed-QPS serving tests will first run with v0 and then with v1. The results for v1 will be recorded and processed in the same way as v0. The filenames will have
_v1appended as a suffix.The file structure will look like this:
```bash results/ |— benchmark_results.json |— benchmark_results.md |— benchmark_results_v1.json |— benchmark_results_v1.md …
-
#29733 [Feature][#29390]: Add timeout support to MultiprocExecutor.collective_rpc and FutureWrapper — v1 — by SandishKumarHN (关闭于: 2026-01-09 01:19 (UTC+8)) [💬3 | commented:1] ## Purpose
- Add support for timeouts when calling vllm.v1.executor.multiproc_executor.MultiprocExecutor.collective_rpc and when awaiting futures returned by non-blocking RPCs.
- Treat a provided timeout as an overall deadline across response MessageQueues when collecting responses.
- Add FutureWrapper.result(timeout=…) and propagate the timeout into FutureWrapper.wait_for_response(get_response, timeout=…) so user code can wait on the Future with an overall timeout.
- This change makes debu…
-
#30301 [ResponsesAPI] Add GPTOSS MCP tool streaming — frontend,ready,needs-rebase,gpt-oss — by qandrew (关闭于: 2026-01-09 00:39 (UTC+8)) [💬3 | +1110/-431, 3 files | commented:3] ## Purpose This change enables streaming support for MCP tools when using GPT OSS. It extends the harmony utilities and response serving infrastructure to handle tool streaming, allowing tool calls and their results to be incrementally streamed back to clients rather than returned as a single batch.
taken over from #30192, builds on top of https://github.com/vllm-project/vllm/pull/30054
## Test Plan
``` VLLM_GPT_OSS_SYSTEM_TOOL_MCP_LABELS=web_search_preview,container,code_interpreter CUDA_VIS…
-
#31955 [BugFix] Clear spec_token_ids for preempted req to prevent grammar conflicts on resumption — v1 — by izhuhaoran (关闭于: 2026-01-09 00:20 (UTC+8)) [💬4 | +9/-0, 1 files | commented:1] ### Purpose Fixes #31876
The unit test
v1/e2e/test_async_scheduling.py::test_with_spec_decodingfails intermittently. The failure occurs with the following configuration:```python test_sampling_params = [ dict(structured_outputs=struct_outputs), ]
…
- #31953 [CI] fix ROCM_HOME: unbound variable — rocm,ready,ci/build — by LucasWilkinson (关闭于: 2026-01-08 23:44 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1]
https://github.com/vllm-project/vllm/pull/31922 broke the “2 Node Tests (4 GPUs in total)” with
[2026-01-08T07:32:12Z] ./.buildkite/scripts/run-multi-node-test.sh: line 10: ROCM_HOME: unbound variablehttps://buildkite.com/organizations/vllm/pipelines/ci/builds/46074/jobs/019b9c75-6389-40dc-a23b-5ecd3683d951/log
-
#31976 [WIP] Add: EAGLE3 Acceptance Length Regression Tests — speculative-decoding,v1 — by rahul-tuli (关闭于: 2026-01-08 23:22 (UTC+8)) [+208/-0, 1 files | commented:1 | 📝草稿] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.
## Changes
tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested
| Verifier | Drafter | |———-|———| …
-
#31814 [Bugfix] Inject JSON schema descriptions into prompt for structured outputs — frontend — by ricky-chaoju (关闭于: 2026-01-08 21:29 (UTC+8)) [💬2 | +156/-1, 2 files | commented:1] Fixes #31804
## Problem When using `response_format` with `json_schema`, field descriptions (e.g., `"Country in UPPERCASE"`) were ignored because they were only passed to the grammar constraint (xgrammar) but not to the model itself. ## Solution Inject the JSON schema into the prompt as a system message so the model can understand and follow field-level constraints specified in descriptions. ## Changes - Added `_build_json_schema_prompt()` helper function in `serving_chat.py` ... -
#31727 Fix versions for vulnerable packages — ci/build — by adobrzyn (关闭于: 2026-01-08 15:44 (UTC+8)) [+2/-2, 2 files | commented:1]
## Purpose Fix packages versions to resolve vulnerabilities found.
- cbor2:
- CVE-2025-68131
- CBORDecoder reuse can leak shareable values across decode calls
- ” Patched versions 5.8.0 …
- cbor2:
-
#29844 [KV Transfer] Add optional int8 quantization for KV cache transfer to P2pNcclEngine — kv-connector — by xbfs (关闭于: 2026-01-08 14:01 (UTC+8)) [💬4 | +43/-2, 1 files | commented:3] ## Summary
This PR adds optional int8 quantization support for KV cache transfer in
P2pNcclEngineto reduce network bandwidth usage and improve transfer throughput, especially in bandwidth-constrained scenarios.## Motivation
When transferring KV cache across nodes in distributed inference setups, network bandwidth can become a bottleneck. By quantizing tensors from float16/float32 to int8 during transfer, we can reduce the data size by approximately 50-75%, significantly improving transfer…
-
#30646 fix: unsatisfiable testing dependencies caused by a version conflict — needs-rebase,ci/build — by leejianwoo-collab (关闭于: 2026-01-08 13:40 (UTC+8)) [💬2 | +2/-2, 1 files | commented:2] # Fix: Resolve dependency conflict with model-hosting-container-standards
## Problem Issue #30595 reported unsatisfiable testing dependencies caused by a version conflict between:
model-hosting-container-standards >= 0.1.9(requiresstarlette >= 0.49.1)fastapi[standard] >= 0.115.0(some versions requirestarlette < 0.48.0)
This conflict prevented successful dependency resolution during testing and development.
## Root Cause Analysis …