[vLLM GitHub 开发动态] 2026-01-09
[概览]
- 时间窗口: 2026-01-09 10:47 (UTC+8) ~ 2026-01-10 10:47 (UTC+8)
- 新 issue: 20 (label 分布: bug:9, feature request:5, RFC:3, help wanted:2, usage:1)
- 关闭 issue: 21
- 新 PR: 43 (label 分布: ready:13, v1:13, rocm:8, documentation:5, speculative-decoding:5)
- 合并 PR: 44
- 关闭未合并 PR: 10
[新 issue]
-
#32055 [Feature]: Copy Ops in FP8 MLA — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:11 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
https://github.com/vllm-project/vllm/blob/a4ec0c559521c055519eeabddd8279c83eb4e936/vllm/v1/attention/backends/mla/common.py#L2117
These add ~20us of 300us for a layer
### Alternatives
We can address this by unwrapping MLA and torch.compile will handle it.
…
- #32069 [Usage]: — usage — by xiazi-yu (创建于: 2026-01-10 09:02 (UTC+8)) Comparison testing between Qwen3-VL-Embedding-2B vLLM 13.0 and Qwen3VLEmbedder revealed inconsistencies in accuracy.
-
#32068 [Bug]: Recompile in LLama model — bug — by Lucaskabela (创建于: 2026-01-10 08:43 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32067 [Bug]: Cannot Run on B200s — bug — by yzeng58 (创建于: 2026-01-10 07:55 (UTC+8)) ### 🐛 the bug
I try to run vllm on B200 but it gives me weird error.
The following is the minimal code to reproduce the error. ``` # test.py import os os.environ[“UNSLOTH_VLLM_STANDBY”] = “1”
…
-
#32066 [RFC]: Expose Sharding Metadata from vLLM Model Layers — RFC,rl — by chunzhao (创建于: 2026-01-10 07:35 (UTC+8)) [💬3] ### Motivation.
Currently, downstream systems that need to understand how vLLM model weights are sharded must infer sharding patterns from tensor names using heuristics:
```python # Current approach: Pattern matching on tensor names def infer_sharding(tensor_name: str) -> ShardingType: if “gate_proj” in tensor_name or “up_proj” in tensor_name: return COLUMN_PARALLEL # [tp_size, 1] …
-
#32053 [Model Bash]: DeepSeek-R1-NVFP4 B200 — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 05:20 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
# DeepSeek-R1 NVFP4 Model Bash
``` launch_trtllm_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode_mpt3: VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=”latency” CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve -tp –port
–attention-config.use_trtllm_ragge… -
#32059 [Feature]: Use platform op if inside a custom op — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:19 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
We have the
CustomOpfor SIluandMul``` @staticmethod def forward_native(x: torch.Tensor) -> torch.Tensor: “"”PyTorch-native implementation equivalent to forward().””” d = x.shape[-1] // 2 return F.silu(x[…, :d]) * x[…, d:] …
-
#32062 [Bug]: Dynamic shapes config applied in decortators not set for conditional range compilation — bug — by laithsakka (创建于: 2026-01-10 06:56 (UTC+8)) ### Your current environment
NA
### 🐛 Describe the bug
amely ``` dynamo_config_patches = {} try: …
-
#32057 [Feature][DSR1 NVFP4 Model Bash]: FlashInfer Quantize Op — feature request — by robertgshaw2-redhat (创建于: 2026-01-10 06:15 (UTC+8)) ### 🚀 The feature, motivation and pitch
We currently run a flashinfer quantize op
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py#L290-L294
This is much slower than the vllm native op
### Alternatives
…
-
#32046 Google MedASR — 无标签 — by Jeevi10 (创建于: 2026-01-10 04:17 (UTC+8)) ### The model to consider.
https://huggingface.co/google/medasr
### The closest model vllm already supports.
Whisper
### What’s your difficulty of supporting the model you want?
…
-
#32023 [Bug]: Fail to get response from a Qwen2-audio-7B-Instruct Service — bug — by moonlightian (创建于: 2026-01-09 18:20 (UTC+8)) [💬1] ### Name of failing test
None
### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#32033 [Bug]: [CPU Backend]: Engine crashes on model re-run with torch=2.10.0 on CPU — bug,cpu — by fadara01 (创建于: 2026-01-10 00:37 (UTC+8)) [💬9] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#32038 KV-cache / long-context boundary request (minimal repro + metric) — 7-day receipts eval — performance — by StanByriukov02 (创建于: 2026-01-10 01:27 (UTC+8)) ### Proposal to improve performance
I’m looking for the smallest vLLM boundary where KV-cache / long-context becomes the bottleneck (memory traffic + tail latency). I want a runnable repro + the metric you care about, then I can return a receipts-backed before/after within 7 days.
The candidate improvement class is topology: replace scan-like access patterns with structured traversal / layout on a declared boundary (with SHA256-indexed receipts), but I won’t speculate until we lock the minimal…
-
#32028 [RFC]: EPLB Implementation Refactoring — RFC — by ilmarkov (创建于: 2026-01-09 22:12 (UTC+8)) [💬1] ### Motivation.
### Current Pain Points
The existing EPLB implementation, while functionally correct, suffers from several issues that impact development velocity and system reliability:
-
Coordination Complexity: The async EPLB uses multiple variables (
rebalanced,ep_buffer_ready,pending_global_ready_check,layer_to_transfer) to coordinate between threads, making it difficult to reason about system state and leading to debugging challenges. -
Testing Challenges: Race cond…
-
-
#32029 [RFC]: Online Quantization Roadmap — RFC — by vkuzo (创建于: 2026-01-09 22:32 (UTC+8)) [💬1] ### Motivation.
Online quantization (weights passed to vLLM in high precision, quantization of weights done inside of vLLM) is emerging as an important use case for quick experimentation and RL. Today vLLM supports online quantization with a single recipe (float8 per-tensor scaling). I wanted to align on the high level future direction of the specifics of how online quantization is implemented in vLLM as we extend this to more recipes / more hardware types / more use cases.
### Proposed Chan…
-
#32017 [Bug]: [lm_eval crashed] lm eval accuracy test crashed using VLLM MAIN branch but v0.14.0rc0 works — bug,rocm — by danielhua23 (创建于: 2026-01-09 15:32 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32012 [Bug]: Qwen3-VL-MoE fails loading when using bitsandbytes quantization — bug — by Datta0 (创建于: 2026-01-09 13:08 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32009 [Bug]: 0% Acceptance rate with FI Cutlass DeepSeekR1 NVFP4 with mtp ep — bug,help wanted — by robertgshaw2-redhat (创建于: 2026-01-09 12:09 (UTC+8)) ### Your current environment
b200
### 🐛 Describe the bug
``` launch_cutlass_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode: VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=”throughput” CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve -tp –port
–attention-config.use_trtllm_ragged_deepseek_prefill=True –attention-backend FLASHINFER_MLA
… -
#32011 [Bug]: “Re-running CMake” loop In pip install — bug — by zhaohaixu (创建于: 2026-01-09 12:45 (UTC+8)) ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#32010 [Feature]: Support DCP with FP8 KV Cache — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-01-09 12:31 (UTC+8)) ### 🚀 The feature, motivation and pitch
DCP is great Fp8 kv cache is great We should support both at the same time:
(EngineCore_DP0 pid=2734260) ERROR 01-08 23:29:29 [core.py:900] RuntimeError: Worker failed with error 'DCP not support fp8 kvcache now.', please check the stack trace above for the root cause…
[已关闭 issue]
-
#11416 [Feature]: QTIP Quantization — feature request,stale — by ehtom (关闭于: 2026-01-10 10:14 (UTC+8)) [💬7] ### 🚀 The feature, motivation and pitch
Over the last year there have been several exciting new low-bit quantization algorithms proposed. These include AQLM (which is now in VLLM) and QuIP (which is in aphrodite engine, a vllm relative). QTIP is a new algorithm which has almost lossless performance even at 2-bits. There is code available implementing the technique here:
https://github.com/Cornell-RelaxML/qtip
I have run this code and experimented with the 8b lllama 3.1 model quantized dow…
-
#17759 [Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5 — bug,stale — by YoctoHan (关闭于: 2026-01-10 10:13 (UTC+8)) [💬15] ### Your current environment
The output of
```text INFO 05-06 19:57:58 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.6.0+cu124 Is debug build: False ...python collect_env.py -
#18660 [Doc]: CPU Docker Image Requires AVX512 — documentation,stale — by zsxsoft (关闭于: 2026-01-10 10:13 (UTC+8)) [💬4] ### 📚 The doc issue
I’m finding a way to run vLLM on CPU without compilation. Following the documentation at https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#troubleshooting, I used the Docker image
public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1. However, when running vLLM directly, it throws a SIGILL error.After debugging with GDB, I confirmed there’s a AVX512 instruction
vinserti64x4. But documentation only mentions “AVX512 (optional, recommended)”, the … -
#21583 [Bug]: [P/D] P/d is incompatible with spec decoding — bug,stale — by robertgshaw2-redhat (关闭于: 2026-01-10 10:13 (UTC+8)) [💬5] ### Your current environment
unsure, user report
### 🐛 Describe the bug
For dev, we have had asserts in various places. We have one here that is breaking spec decoding
@NickLucche - could you take a look?
…
-
#23641 [RFC]: Frequency and Cost Aware Eviction Policy for Prefix Caching — feature request,stale — by luolun (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### 🚀 The feature, motivation and pitch
Key Idea:
For a prefix cache of size
size, and access frequencyfreq, define:Retention Benefit of a prefix (over time T): T * freq * compute_cost. where compute_cost = cost_factor * cost_func(size)…
-
#24394 [Performance]: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact — performance,stale — by CLFutureX (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Proposal to improve performance
## Optimization Proposal: Improve Prefix Cache Hit Rate and Reduce Dirty Cache Impact
### Current Background Currently, vLLM uses prefix caching for matching. The earlier a request is processed, the higher its hit rate. Therefore, when evicting and releasing blocks, they are reversed before being added to the free list, implementing eviction from the tail to improve block reuse rate.
However, between multiple adjacent requests, the overall block distributio…
-
#24398 [Bug]: Compatibility Issue with Compressed-Tensors Sparse-Only Models (Llama4 MoE) — bug,stale — by ibrahimnd2000 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#24399 [Usage]: — usage,stale — by parth0774 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬2] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I have running llama model on my Machine using vllm serve meta-llama/Meta-Llama-3-8B-Instruct –host 0.0.0.0 –port 8081 –tensor-parallel-size 4 –enable-auto-tool-choice –tool-call-parser llama3_json …
-
#24408 [Usage]: Does the qwen3 model extrapolate automatically, or do I need to set the rope_scaling parameter? — usage,stale — by yuanjie-ai (关闭于: 2026-01-10 10:13 (UTC+8)) [💬3] ### Your current environment
The output of `python collect_env.py`### How would you like to use vllm
I want to run inference of a specific model. I don’t know how to integrate it with vllm. …
-
#24429 [Feature]:Does VLLM support Seed OSS-36B Instruction quantified in int4/int8 format? — feature request,stale — by xiaotianns (关闭于: 2026-01-10 10:13 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
Does VLLM support Seed OSS-36B Instruction quantified in int4/int8 format?
### Alternatives
No response
### Additional context
…
-
#26379 [Bug][torch.compile]: piecewise graphs not cached for LLaMa-4-Scout — bug,torch.compile,stale — by ProExpertProg (关闭于: 2026-01-10 08:36 (UTC+8)) [💬6] ### Your current environment
The output of
```text Collecting environment information... uv is set ============================== System Info ...python collect_env.py -
#31918 [Bug]: nvidia/DeepSeek-R1-NVFP4-v2 accuracy issue with NVFP4 dispatch (CUTEDSL MoE + DeepEP LL) — bug — by minosfuture (关闭于: 2026-01-10 04:03 (UTC+8)) [💬5] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#30879 [Doc]: Add some documentation about encoder compilation — documentation,torch.compile — by zou3519 (关闭于: 2026-01-10 03:48 (UTC+8)) [💬2] ### 📚 The doc issue
I want something like a design doc for encoder compilation. For example:
- It uses support_torch_compile and set_model_tag to avoid cache collisions
- it supports or doesn’t support the following features that VllmBackend does: cudagraphs, compile_ranges, and a high-level explanation for how these are turned off or on.
- it inherits from compilation_config (or maybe it doesn’t)
- here’s how to turn it on/off
I’m having a difficult time thinking through the edge cases in htt…
- #26473 [Bug][ROCm]: Failed to send request to Qwen3-Next — bug,feature request,rocm — by zhentaocc (关闭于: 2026-01-09 19:28 (UTC+8)) [💬4]
### Your current environment
output of
```text Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 ...python collect_env.py -
#29541 [CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:05 (UTC+8)) [💬6] ### Name of failing test
pytest -v -s entrypoints/openai/test_collective_rpc.py; pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_collective_rpc.py --ignore=entrypoints/openai/tool_parsers/; pytest -v -s entrypoints/test_chat_utils.py### Basic information
- Fla…
-
#29543 [CI Failure]: mi325_2: Distributed Model Tests (2 GPUs) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:03 (UTC+8)) [💬3] ### Name of failing test
`TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m ‘distributed(num_gpus=2)’ && CUDA_VISIBLE_DEVICES=0,1 pytest -v -s model_executor/model_loader/test_sharded_state_loader.py && pytest models/test_transformers.py -v -s -m ‘distributed(num_gpus=2)’ && pytest models/language -v -s -m ‘distributed(num_gpus=2)’ && pytest models/multimodal -v -s -m ‘distributed(num_gpus=2)’ –ignore models/multimodal/generation/test_whisper.py && VLLM_WORKER_MULTIPROC_METHOD=spawn pyte…
-
#29516 [CI Failure]: mi325_4: Distributed Tests (A100) — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 14:03 (UTC+8)) [💬5] ### Name of failing test
pytest -v -s distributed/test_custom_all_reduce.py && torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py && TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)' && pytest -v -s -x lora/test_mixtral.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#29521 [CI Failure]: mi325_1: Samplers Test — ci-failure — by AndreasKaratzas (关闭于: 2026-01-09 13:58 (UTC+8)) [💬8] ### Name of failing test
pytest -v -s samplers && VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#30099 [Bug]: Tokens are missing in streaming mode. — bug — by Ri0S (关闭于: 2026-01-09 12:07 (UTC+8)) [💬1] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#28635 [Bug]: Streaming=True causes missing or scrambled tokens with GPT-OSS 120B on vLLM v0.11.0 — bug — by cxz1418 (关闭于: 2026-01-09 11:59 (UTC+8)) [💬26] ### Your current environment
NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 - vLLM versions: v0.10.2 and v0.11.0
- Model: GPT-OSS 120B
- Streaming: True
- –enforce-eager: Tested
### 🐛 Describe the bug …
-
#26458 [Bug][Deepseek-v32-exp]: import
DeepseekV32IndexerCachefrommodel registryin gpu_model_runner — bug,stale — by ILikeIneine (关闭于: 2026-01-09 11:05 (UTC+8)) [💬1] ### Your current environmentThe output of
```text vllm-metax plugin backend ```python collect_env.py…
[新 PR]
-
#32019 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank — 无标签 — by xyang16 (创建于: 2026-01-09 17:52 (UTC+8)) [💬2 | +14/-2, 1 files | commented:5] ## Purpose
This PR is to improve performance of FusedMoE LoRA for smaller rank
-
#32041 [Doc] Update NVIDIA PyTorch Docker image in gpu.cuda.inc.md since it seems outdated — documentation,nvidia — by rasmith (创建于: 2026-01-10 03:10 (UTC+8)) [💬1 | +1/-1, 1 files | commented:2] The NVIDIA PyTorch Docker image link in
gpu.cuda.inc.mdseems outdated, since when trying this along withpip install -e .doesn’t work, and results in asetuptoolserror, but when switching to the more recent version, everything works.
[!NOTE] Updates the recommended NVIDIA PyTorch Docker image in
gpu.cuda.inc.mdto a newer tag for build troubleshooting.- Changes
docker runexample to usenvcr.io/nvidia/pytorch:25.12-py3instead of23.10-py3
Written by [C…
- Changes
- #32065 [CI] Allow Deprecated Quantization For LM Eval Tests — ready,ci/build,ci-failure — by micah-wil (创建于: 2026-01-10 07:07 (UTC+8)) [+1/-0, 1 files | commented:2 approved:1]
Ever since https://github.com/vllm-project/vllm/pull/31688 began the process of deprecating some quantization schemes, the
LM Eval Large Modelstest group is failing on main, as seen here for example: https://buildkite.com/vllm/ci/builds/46314/steps/canvas?sid=019ba18f-0a1a-4482-82d8-b8c102fd92c0. The PR updated other relevant tests to allow deprecated quantization types, but skippedtest_lm_eval_correctness.py, so theMeta-Llama-3-70B-Instruct-FBGEMM-nonuniformtest case is failing with t… -
#32070 [RFC] Improve environment variable declaration and handling (#31249) — documentation — by nliu365 (创建于: 2026-01-10 09:06 (UTC+8)) [💬2 | +1400/-0, 5 files | commented:2 | 📝草稿] This PR implements the refactoring proposed in issue #31249 to improve environment variable handling in vLLM by consolidating declarations into a single source of truth with automatic type conversion.
## Summary
Replaces the duplicated type definitions and getter dictionary pattern with a unified system that:
- Declares each variable once with type annotations and defaults
- Automatically converts types based on type hints
- Supports custom parsing, lazy defaults, and validated choices
- Provi…
-
#32071 Use inference_mode() for torchao weight quantization — ci/build — by jerryzh168 (创建于: 2026-01-10 09:16 (UTC+8)) [💬2 | +27/-63, 3 files | commented:1] Summary: Trying to resolve issue: https://github.com/pytorch/pytorch/issues/170419, https://github.com/pytorch/pytorch/issues/164872, https://github.com/pytorch/ao/issues/3487
Main thing is in vllm we didn’t create the model under inference_mode(), but inference is in inference mode, this is causing runtime error:
RuntimeError: Cannot set version_counter for inference tensorAnd no easy workaround for the use case: https://github.com/pytorch/pytorch/issues/170419#issuecomment-3656315408
so …
-
#32058 [Perf] Optimize grouped topk kernel, 1.2% E2E Throughput improvement — ready — by yewentao256 (创建于: 2026-01-10 06:18 (UTC+8)) [💬1 | +186/-283, 2 files | commented:3] ## Purpose
Part of the https://github.com/vllm-project/vllm/issues/31755
Here we combine the two kernels together, reducing global write and read
## Test
## Test
…
-
#32056 [Perf] Optimize async scheduling placeholder using empty — ready,v1 — by yewentao256 (创建于: 2026-01-10 06:13 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose
Here for a placeholder tensor, no need to use
rand, empty is good enough
[!NOTE] Cursor Bugbot is generating a summary for commit 3462afd4a5afeda39c35c85d50e394211106a38f. Configure here.
…
-
#32049 [CI/Build][Hardware][AMD] Fix test_engine_core_client.py — rocm,v1 — by rjrock (创建于: 2026-01-10 04:42 (UTC+8)) [+23/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get ROCm to pass on tests in test_engine_core_client.py, as run in AMD’s V1 Test e2e + engine testgroup. ## Test Plan
pytest -v -s v1/engine/test_engine_core_client.py## Test Result Still some failures, draft atm.
Essential Elements of an Effective PR Description Checklist
... -
#32045 [Core] Use weights_only=True with torch.load — ready — by russellb (创建于: 2026-01-10 04:08 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] This code neglected to use
weights_only=Truewithtorch.load(). This is unsafe if used on potentially untrusted model data. We should always be usingweights_only=Trueunless necessary.It turns out that the default changed to
Trueas of PyTorch 2.6, so this isn’t a security vulnerability, but we explicitly set it toTrueeverywhere else, so change it here for consistency. This also prevents it from getting flagged by security scanners, which is how I came aware of it via a private re… -
#32052 [2/N][Attention] Fix pre-commit errors — ready,v1 — by MatthewBonanni (创建于: 2026-01-10 05:19 (UTC+8)) [+6/-16, 3 files | commented:1 approved:1] ## Purpose Another step in the attention restructuring, #31919 This PR fixes pre-commit errors which arose when moving files into regions of higher mypy scrutiny
## Test Plan Pre-commit should pass
## Test Result
…
-
#32044 make assume_32_bit_indexing configurable — 无标签 — by laithsakka (创建于: 2026-01-10 04:01 (UTC+8)) [+9/-1, 2 files | commented:8 approved:1] Some meta internal models requires 64 bit indexing for compilation! , this way allow them to override the default config. also address compilation config hashing issue reported by AI?
[!NOTE] Cursor Bugbot is generating a summary for commit 106dbf229af2a5584e9968aaa238e0b3a8f7e636. Configure here.
…
- #32054 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py — rocm,speculative-decoding,ready,v1,cpu,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-10 06:03 (UTC+8)) [+371/-370, 36 files | commented:3 approved:1]
## Purpose
Step 3 of #31919. Moves chunks of code from utils.py to backend.py (unchanged) and updates imports accordingly. The following objects are moved:
CommonAttentionMetadataAttentionMetadataBuilderAttentionCGSupport
## Test Plan CI (run all tests)
## Test Result …
-
#32060 [4/N][Attention] Move MLA common to model_executor — rocm,speculative-decoding,ready,v1,kv-connector,nvidia,ready-run-all-tests — by MatthewBonanni (创建于: 2026-01-10 06:45 (UTC+8)) [+48/-42, 14 files | commented:1 approved:1] ## Purpose Step 4 of #31919. Moves
vllm/v1/attention/backends/mla/common.pytovllm/model_executor/layers/attention/mla_attention.pyand updates imports accordingly.## Test Plan CI
## Test Result
... -
#32061 [ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing — rocm,v1 — by AndreasKaratzas (创建于: 2026-01-10 06:50 (UTC+8)) [+181/-70, 1 files | commented:2] Follow-up from: https://github.com/vllm-project/vllm/pull/32049. Fixes
test_engine_core_client.pyutility method tests failing on ROCm.## Problem
ROCm requires
spawnas the multiprocessing start method (NVIDIA usesfork). Withspawn, child processes start a fresh Python interpreter and re-import all modules, which means monkey-patches applied in the parent process are not inherited.This caused utility method tests to fail with: ``` AttributeError: ‘EngineCoreProc’ object has no attr…
-
#32064 [5/N][Attention] Finish eliminating
vllm/attentionfolder — 无标签 — by MatthewBonanni (创建于: 2026-01-10 07:03 (UTC+8)) [+777/-733, 4 files | commented:1 | 📝草稿]## Purpose Step 5 of #31919: This PR finishes eliminating the
vllm/attentionfolder by doing the following:- Split
vllm/attention/layer.pyintovllm/model_executor/layers/attention/mla_attention.py(MLAAttention,unified_mla_attention) andvllm/model_executor/layers/attention/attention.py(Attention,unified_attention) - Move
vllm/attention/utils/kv_sharing_utils.pycontent intovllm/model_executor/layers/attention/attention.py - Move `vllm/att…
- Split
-
#32063 [ROCm][CI] Fix flaky
test_function_calling_with_streamand reduce schema test examples — rocm,gpt-oss — by AndreasKaratzas (创建于: 2026-01-10 06:56 (UTC+8)) [+14/-10, 2 files | commented:1] This PR fixes two test failures in the Harmony and OpenAPI schema test suites:-
test_function_calling_with_stream: Fixed anUnboundLocalErrorcaused by accessing an uninitialized variable when the model doesn’t return the expected function call. -
test_openapi_stateless: Reduced test examples to avoid timeout failures caused by schemathesis generating malformed requests that the server processes slowly.
## Changes
### 1.
test_function_calling_with_streamfix…
-
-
#32036 [responsesAPI] add unit test for optional function tool call id — 无标签 — by qandrew (创建于: 2026-01-10 01:06 (UTC+8)) [💬1 | +118/-0, 1 files | commented:3] ## Purpose
## Test Plan follow up from #31999
## Test Result
unit tests pass
…
-
#32048 Added qwen3 vision language moe support for speculative decoding — speculative-decoding,v1,qwen — by shanjiaz (创建于: 2026-01-10 04:41 (UTC+8)) [+17/-1, 2 files | commented:1 | 📝草稿]
## Purpose
## Test Plan
## Test Result
... -
#32051 seperate draft and target mode loading config — speculative-decoding,v1,meta-exported,fb-exported — by ZhengkaiZ (创建于: 2026-01-10 05:17 (UTC+8)) [💬1 | +14/-3, 3 files | commented:2] Summary: seperate draft and target mode loading config
Test Plan: un on top of D86736723 sh vllm/fb/scripts/llama4/start_l4_maverick_spec_decoding_h200.sh server log: P2108209888 I0105 12:21:20.634000 2686673 time_request_logger.py:217] Successfully completed request d990167cb3574d75a0e1b44f26178180 after 0.805 seconds INFO 01-05 12:21:23 [loggers.py:257] 1 Engines Aggregated: Avg prompt throughput: 9.2 tokens/s, Avg generation throughput: 34.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV…
-
#32040 [ROCm][CI] Handle pytest status code 5 when a shard isn’t allocated any tests — rocm,ci/build — by divakar-amd (创建于: 2026-01-10 02:24 (UTC+8)) [+11/-2, 2 files | commented:2] This PR ensures that test shards exiting with status 5 (indicating “no tests collected”) are handled gracefully for AMD CI. In multi-gpu setting, when tests are shared among GPUs, it’s possible that a particular shard may not be allocated with any test at all. In such cases, the pytest returns with status 5.
>>Shard 0: collected 312 items / 37 deselected / 275 selected >>Shard 0: Running 0 items in this shard:- Updated
run-amd-test.shto treat shard exit status 5 (“no tests collecte…
- Updated
-
#32050 [EPLB][Cleanup] Remove
is_async_enabledfromEplbModelState— 无标签 — by SageMoore (创建于: 2026-01-10 04:43 (UTC+8)) [+12/-20, 2 files | commented:1 | 📝草稿] ## Purpose See above. This PR just removes theis_async_enabledinstance variable fromEplbModelStateand refactors the code to useis_asyncinEplbState. Having two variables to determine if we are running async eplb invites the question: “When can these two variables be different?”. They really shouldn’t be different. This is a bit of unnecessary complexity and opens us up to bugs, where we update one and not the other, further down the line.Bonus: The linter was complaining about th…
-
#32047 [Refactor] Remove unused cutlass moe problem size function — nvidia — by yewentao256 (创建于: 2026-01-10 04:24 (UTC+8)) [+0/-75, 4 files | commented:1] ## Purpose
Remove unused
get_cutlass_moe_mm_problem_sizesfunctionNo tested needed
[!NOTE] Cursor Bugbot is generating a summary for commit 6d738cb321a6d1f197e1bc05064a7b2f8ca012a6. Configure here. …
-
#32043 [AOT compilation] cached inductor artifacts benchmark — performance — by dolpm (创建于: 2026-01-10 04:01 (UTC+8)) [+682/-0, 1 files | commented:3] ## Purpose
split out of https://github.com/vllm-project/vllm/pull/25205
## Test Plan
## Test Result
... -
#32042 [AMD][QWEN3-NEXT] FP8 Tunings — performance,rocm,qwen — by draftbk (创建于: 2026-01-10 03:50 (UTC+8)) [+586/-2, 5 files | commented:1] ## Purpose Add pre-tuned FP8 block-scaled matmul kernel configs for Qwen3-Next-80B on 1 MI300X.
**Configs added:** - `N=12288,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json` - `N=2048,K=4096,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json` - `N=1024,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json` - `N=9216,K=2048,device_name=AMD_Instinct_MI300X,dtype=fp8_w8a8,block_shape=[128,128].json`Tu…
-
#32034 [Refactor] Remove numpy split in async scheduling — ready,v1 — by yewentao256 (创建于: 2026-01-10 00:39 (UTC+8)) [💬1 | +6/-12, 1 files | commented:2 approved:1] ## Purpose
Remove numpy split in async scheduling and got performance improvement a little bit
## Test
export MODEL="zai-org/GLM-4.7-FP8"vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128…
-
#32035 [Docs] Add docs about OOT Quantization Plugins — documentation,ready — by mgoin (创建于: 2026-01-10 00:48 (UTC+8)) [💬1 | +158/-0, 2 files | commented:1] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32039 AMD CI Test - unskip moe_sum test and moe_align_block_size tests — rocm — by hongxiayang (创建于: 2026-01-10 01:32 (UTC+8)) [💬1 | +11/-4, 3 files | commented:1] ## Purpose
To test AMD CI by enabling one skipped test.
Summary of Changes
- Fixed test_moe.py and test_moe_align_block_size.py to unskip it on ROCm Removed unnecessary @pytest.mark.skipif(current_platform.is_rocm()) from test_moe_sum, test_moe_align_block_size, test_moe_align_block_size_with_expert_map
…
-
#32026 [Refactor] Separate sequence and token pooling types — documentation,new-model,frontend,ready,qwen — by DarkLight1337 (创建于: 2026-01-09 19:57 (UTC+8)) [💬5 | +324/-204, 42 files | approved:1 commented:7] ## Purpose
This refactor is necessary to cleanly enable both sequence and token pooling tasks at the same time, allowing us to support both CLS/LAST/MEAN pooling together with tokenwise pooling methods apart from ALL.
## Test Plan
## Test Result
... -
#32037 [CPU][BugFix] Disable AOT Compile for CPU — 无标签 — by fadara01 (创建于: 2026-01-10 01:19 (UTC+8)) [💬1 | +6/-1, 1 files | commented:1 approved:1] ## Purpose
Disable AOT Compile for CPU Fixes: #32033
## Test Plan
Reproducer in #32033
## Test Result …
-
#32022 [Bugfix] fix memory inconsistency in cross-process shared memory — 无标签 — by slippersss (创建于: 2026-01-09 18:18 (UTC+8)) [💬2 | +6/-0, 1 files | commented:1 | 📝草稿] ## Purpose
This PR aims to fix potential memory inconsistency in current cross-process shared memory. Related issue has been described in #27858. It is probably due to weak ordering of some CPU architectures. That is, readers may read
metadata_buffer[0] = 1beforebufis fully written. We usememory_fence()to ensure the order of the buffer (i.e.buf) and flag (i.e.metadata_buffer[0]) writes.## Test Plan
This issue occurs intermittently, so we need long-running high-concurrency te…
-
#32031 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure — v1,kv-connector — by NickLucche (创建于: 2026-01-09 23:12 (UTC+8)) [💬2 | +234/-28, 2 files | commented:6] ## Overview
This PR is aimed at improving logging to more easily identify failures during runs. It does so by providing a shared util function with more comprehensive context of the failure.
Change result: ``` # before ERROR 01-09 12:04:06 [nixl_connector.py:2281] NIXL transfer setup/initiation failed for request test_transfer_setup_failed_req. Marking blocks as invalid.
…
-
#32032 [CI/Build] Publish CPU images for each release — ci/build — by nathan-weinberg (创建于: 2026-01-09 23:24 (UTC+8)) [+101/-14, 2 files | commented:2] ## Purpose It was suggested to me in https://github.com/vllm-project/vllm/pull/31749 to extend any CPU images off a base, but currently no such base exists as far as I can tell. The only published CPU images I’ve been able to find are ECR images with a rather old vLLM version (more details in https://github.com/vllm-project/vllm/issues/23681)
This PR updates the Buildkite automation to build both x86_64 and arm64 CPU images and publish them to Docker Hub. Since this is not a “first-class” useca…
-
#32020 Rename –exclude-log-deltas to –enable-log-deltas — frontend,ready — by Catacomba (创建于: 2026-01-09 17:55 (UTC+8)) [+9/-10, 3 files | commented:1 approved:2] ## Purpose
Rename
--exclude-log-deltas, implemented in #30322, to the standard-enable/--no-enablemodel. Remove the validation check for--enable-log-deltas/--no-enable-log-deltas, since strictly speaking, changing the flag does not require--enable-log-outputs.## Test Plan
-
Run vllm serve with any model, with
--enable-log-request,--enable-log-outputs enabled.vllm serve model --enable-log-requests --enable-log-outputs…
-
-
#32030 Add: acceptance length tests — speculative-decoding,v1 — by rahul-tuli (创建于: 2026-01-09 22:44 (UTC+8)) [💬1 | +266/-0, 2 files | commented:2] Adds parameterized pytest tests to detect acceptance length regressions in EAGLE3 speculative decoding. These tests ensure that new commits do not degrade speculative decoding performance.
## Changes
tests/v1/spec_decode/test_acceptance_length.py: New test file with parameterized tests for EAGLE3 model pairs
## Models Tested
| Verifier | Drafter | Mean AL | Per-Position AL | |———-|———|———|—————–| …
- #32027 [Doc] Remove hardcoded Whisper in example openai translation client — documentation,ready — by Isotr0py (创建于: 2026-01-09 21:42 (UTC+8)) [💬1 | +12/-6, 1 files | commented:1 approved:1]
## Purpose
- Avoid hardcoded
openai/whisper-large-v3inopenai_translation_client.py, so that we can test other model’s translation API easily.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - Avoid hardcoded
-
#32016 [Model] Remove redundant None check in DeepSeekOCR image input processing — deepseek — by maang-h (创建于: 2026-01-09 14:02 (UTC+8)) [+10/-13, 1 files | commented:1 approved:1] ## Purpose Remove redundant
if pixel_values is not Nonecheck and unreachable assertion inDeepseekOCRImagePixelInputs._parse_image_pixel_values.Line 437-438 already handles the
pixel_valuesis None case with an early returnThis cleanup improves code readability and removes unnecessary branching
## Test Plan
…
-
#32025 [Misc] Hash keys only when value is
Nonein kwargs — ready,multi-modality — by ywang96 (创建于: 2026-01-09 18:56 (UTC+8)) [+3/-0, 1 files | approved:1 commented:5] ## Purpose It’s not guaranteed that multimodal kwargs always have non-Nonevalues. When this happens every request will produce serialization warning which isn’t helpful. This PR modifies so that the hasher hashes keys only for such key-value pairs when value isNone## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32024 [Frontend] Try multiple prompts in benchmark initial test run — performance — by MatteoFari (创建于: 2026-01-09 18:31 (UTC+8)) [💬3 | +99/-36, 2 files | commented:1 approved:1] ## Purpose Added num_test_prompts argument (defaulting to 5) to improve robustness of benchmark, fixes #31902 ## Test Plan Added a new test case test_bench_serve_num_test_prompts in tests/benchmarks/test_serve_cli.py to ensure the new CLI argument is correctly parsed and passed to the benchmark function. ## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32021 [WIP] Optimize greedy sample. — tpu,v1 — by whx-sjtu (创建于: 2026-01-09 18:09 (UTC+8)) [+61/-6, 3 files | commented:1 | 📝草稿] ## Purpose Implement https://github.com/vllm-project/vllm/issues/31216
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32018 [Frontend]
finish_reasonmust betool_callwhenever a tool is called — frontend — by sanghoon-yn (创建于: 2026-01-09 16:55 (UTC+8)) [+8/-8, 1 files | commented:2] ## Purpose Currently, whentool_choiceis a function name,OpenAIServingChatresponds withfinish_reason: "stop".The OpenAI API now emits explicit
tool_callinstead.This PR updates
OpenAIServingChat’s chat serving logic to align with the latest OpenAI API behavior for tool invocation.## Test Plan
## Test Result …
-
#32013 [Fix] Qwen3-VL-MoE bitsandbytes 4 bit quant — qwen — by Datta0 (创建于: 2026-01-09 13:09 (UTC+8)) [💬2 | +257/-43, 2 files | commented:3] ## Purpose Starting the vLLM server with the Qwen/Qwen3-VL-30B-A3B-Instruct using bitsandbytes quantization format seems to fail while loading the weights due to quantization issues.
FIX https://github.com/vllm-project/vllm/issues/32012 ## Test Plan
## Test Result
... -
#32015 test — 无标签 — by 1643661061leo (创建于: 2026-01-09 13:20 (UTC+8)) [💬1 | +199/-250, 1 files | commented:1] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#32014 [MLA] Support DCP + FP8 — v1 — by LucasWilkinson (创建于: 2026-01-09 13:19 (UTC+8)) [+11/-7, 1 files | commented:1 | 📝草稿] FIX https://github.com/vllm-project/vllm/issues/32010
Signed-off-by: Lucas Wilkinson lwilkins@redhat.com
[已合并 PR]
- #31348 resolve pydantic error in startup benchmark — performance,ready — by andyxning (合并于: 2026-01-10 10:41 (UTC+8)) [💬4 | +21/-7, 2 files | commented:6 approved:1]
## Purpose
Avoid pydantic error about CompilationConfig in
vllm bench startup``` RuntimeError: Subprocess failed: 7 validation errors for CompilationConfig local_cache_dir Unexpected keyword argument [type=unexpected_keyword_argument, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.12/v/unexpected_keyword_argument bs_to_padded_graph_size Unexpected keyword argument [type=unexpected_keyword_argument, input_value=None, input_type=NoneTy… -
#31616 [Bugfix] Narrow broad exceptions in compilation backends — ready — by c0de128 (合并于: 2026-01-10 10:39 (UTC+8)) [💬15 | +1/-1, 1 files | commented:2 approved:2] ## Summary
Narrows broad
except Exception:handlers in the compilation cache initialization to specific exception types for better debuggability.## Changes
### 1. File reading exception (line 609) ```python # Before: except Exception: …
- #32065 [CI] Allow Deprecated Quantization For LM Eval Tests — ready,ci/build,ci-failure — by micah-wil (合并于: 2026-01-10 10:10 (UTC+8)) [+1/-0, 1 files | commented:2 approved:1]
Ever since https://github.com/vllm-project/vllm/pull/31688 began the process of deprecating some quantization schemes, the
LM Eval Large Modelstest group is failing on main, as seen here for example: https://buildkite.com/vllm/ci/builds/46314/steps/canvas?sid=019ba18f-0a1a-4482-82d8-b8c102fd92c0. The PR updated other relevant tests to allow deprecated quantization types, but skippedtest_lm_eval_correctness.py, so theMeta-Llama-3-70B-Instruct-FBGEMM-nonuniformtest case is failing with t… -
#32056 [Perf] Optimize async scheduling placeholder using empty — ready,v1 — by yewentao256 (合并于: 2026-01-10 08:46 (UTC+8)) [💬2 | +4/-1, 1 files | commented:1 approved:1] ## Purpose
Here for a placeholder tensor, no need to use
rand, empty is good enough
[!NOTE] Cursor Bugbot is generating a summary for commit 3462afd4a5afeda39c35c85d50e394211106a38f. Configure here.
…
-
#32045 [Core] Use weights_only=True with torch.load — ready — by russellb (合并于: 2026-01-10 08:28 (UTC+8)) [+1/-1, 1 files | commented:1 approved:1] This code neglected to use
weights_only=Truewithtorch.load(). This is unsafe if used on potentially untrusted model data. We should always be usingweights_only=Trueunless necessary.It turns out that the default changed to
Trueas of PyTorch 2.6, so this isn’t a security vulnerability, but we explicitly set it toTrueeverywhere else, so change it here for consistency. This also prevents it from getting flagged by security scanners, which is how I came aware of it via a private re… -
#32052 [2/N][Attention] Fix pre-commit errors — ready,v1 — by MatthewBonanni (合并于: 2026-01-10 08:27 (UTC+8)) [+6/-16, 3 files | commented:1 approved:1] ## Purpose Another step in the attention restructuring, #31919 This PR fixes pre-commit errors which arose when moving files into regions of higher mypy scrutiny
## Test Plan Pre-commit should pass
## Test Result
…
-
#31744 [Misc][BE] Type coverage for vllm/compilation [2/3] — ready,nvidia — by Lucaskabela (合并于: 2026-01-10 07:30 (UTC+8)) [+161/-91, 12 files | commented:3 approved:2] ## Purpose We want to provide better type hint coverage in vllm/compilation to improve maintainability, readability, and reduce silent errors
This PR should be applied on top of #31554
## Test Plan mypy vllm/compilation
## Test Result
…
-
#31998 [Misc] Enable async scheduling by default with spec decoding — ready — by njhill (合并于: 2026-01-10 07:09 (UTC+8)) [💬2 | +27/-22, 1 files | commented:2 approved:3] Now that all of the gaps have been addressed in async scheduling + spec decoding support, we can enable it by default in this case too.
It will still be disabled implicitly for non-EAGLE/MTP types or when padded drafter batch is disabled.
This should only be merged once https://github.com/vllm-project/vllm/pull/30495 is merged.
[!NOTE] Enables async scheduling by default when using compatible speculative decoding, with clearer gating and messaging. …
-
#31336 [perf][async] support non cpu sync get logprob tensors for spec — ready,v1 — by izhuhaoran (合并于: 2026-01-10 05:24 (UTC+8)) [💬8 | +48/-29, 3 files | commented:10] ## Description:
This PR eliminates implicit CPU-GPU synchronization points in the
RejectionSamplerwhen computing logprobs during speculative decoding, specifically for the async scheduling path.Problem Previously, the rejection sampling logic used boolean masking to identify accepted tokens: https://github.com/vllm-project/vllm/blob/ba25a6599212f01e6b877e22549612333e95b2c1/vllm/v1/sample/rejection_sampler.py#L191-L194 https://github.com/vllm-project/vllm/blob/ba25a6599212f01e6b877e2254…
-
#30275 [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout — ready,v1,kv-connector — by xuechendi (合并于: 2026-01-10 05:22 (UTC+8)) [💬7 | +139/-87, 2 files | commented:10] ## Purpose
We have supported heterogeneous BlockSize and kv_layout in seperate post process methods. This PR is to clean up and use single method to post_process for cases.
## What is changed in this PR:
I removed
permute_device_kvandblocksize_post_process, and move the logic intopost_process_device_kv_on_receiveas single post_process function with 3 options:``` …
-
#31916 [1/N][Attention] Restructure attention: move files — documentation,performance,rocm,tpu,speculative-decoding,ready,ci/build,v1,multi-modality,llama — by MatthewBonanni (合并于: 2026-01-10 05:10 (UTC+8)) [💬12 | +425/-395, 195 files | commented:1 approved:3] ## Purpose Implement step 1 of #31919. This PR consists solely of file renaming and movement, and the necessary updates to imports.
- Move vllm/attention/layers to vllm/model_executor/layers/attention
- Move vllm/attention/backends/abstract.py to vllm/v1/attention/backend.py
- Move vllm/attention/backends/registry.py to vllm/v1/attention/backends/registry.py
- Eliminate vllm/attention/backends folder
- Move vllm/attention/utils/fa_utils.py to vllm/v1/attention/backends/fa_utils.py
- Move vll…
-
#31830 [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement — ready,nvidia — by yewentao256 (合并于: 2026-01-10 03:13 (UTC+8)) [💬3 | +172/-63, 6 files | commented:9 approved:1] ## Purpose
Part of the https://github.com/vllm-project/vllm/issues/31755
Here we add a kernel for faster calculation of problem size
## Test
export MODEL="zai-org/GLM-4.7-FP8"…
-
#31836 [responsesAPI] fix incomplete_messages for simple/parsable context — frontend,ready — by qandrew (合并于: 2026-01-10 05:00 (UTC+8)) [💬1 | +44/-0, 4 files | commented:1 approved:1] ## Purpose
Similar to https://github.com/vllm-project/vllm/pull/24561 (which was done only for GPTOSS), we want to output the incomplete reason for non gptoss models.
We make sure
assert response.status == "incomplete" assert response.incomplete_details.reason == "max_output_tokens"…
-
#30833 [Quant] Make static quant support all group shapes — ready — by LucasWilkinson (合并于: 2026-01-10 04:49 (UTC+8)) [💬6 | +339/-47, 7 files | commented:5 approved:2] Preparatory pr for https://github.com/vllm-project/vllm/pull/30141 (per-head quant)
[!NOTE] Cursor Bugbot is generating a summary for commit c82c073a6134ce25dbd7234699e8e6e66021dc4f. Configure here.
[!NOTE] …
- #32001 [fix] add cutedsl to global sf — ready,nvidia — by jiahanc (合并于: 2026-01-10 04:03 (UTC+8)) [+1/-0, 1 files | commented:1 approved:3] ## Purpose Add flashinfer cutedsl to global sf list, fixes https://github.com/vllm-project/vllm/issues/31918 ## Test Plan ``` VLLM_DEEPEPLL_NVFP4_DISPATCH=1 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_STANDALONE_COMPILE=0 VLLM_FLASHINFER_MOE_BACKEND=”masked_gemm” VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ALL2ALL_BACKEND=”deepep_low_latency” lm_eval –model vllm –model_args pretrained=nvidia/DeepSeek-R1-0528-FP4-v2,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=Tr…
-
#29354 Add unpermute-aware fused MoE path and small-batch fallback — performance,ready — by RunkaiTao (合并于: 2026-01-10 03:58 (UTC+8)) [💬9 | +176/-21, 2 files | commented:10] ## Purpose This PR enhances the fused MoE implementation by adding support for the unpermute execution path. It introduces a small-batch fallback path designed for models with large numbers of experts, where the regular kernel becomes inefficient. For these regimes, the new path provides a speed-up by reducing unnecessary data movement and improving kernel efficiency for low-token workloads.
## Test Result
### Benchmark: Llama-4-Maverick-17B-128E (TP = 8)
Command: ```bash python3 benchmarks/k…
-
#31595 [Fix] Introduce audio channels spec — documentation,performance,new-model,rocm,frontend,speculative-decoding,ready,ci/build,v1,multi-modality — by jeremyteboul (合并于: 2026-01-10 03:34 (UTC+8)) [💬10 | +717/-189, 9 files | commented:8 approved:1] Add generic AudioSpec framework for audio channel normalization This PR introduces an extensible AudioSpec-based framework for normalizing audio channels in vLLM multimodal models. The framework addresses the stereo-to-mono conversion issue where torchaudio returns [channels, time] format but Whisper-based models expect [time] (mono).
Key changes:
- Add AudioSpec dataclass and ChannelReduction enum for flexible audio format specification
- Add get_audio_spec() to detect model requirements fro…
-
#31707 [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator — ready,v1 — by ivanium (合并于: 2026-01-10 02:53 (UTC+8)) [💬4 | +324/-127, 2 files | commented:7 approved:1] ## Purpose
The current hybrid KV cache coordinator supports at most two attention types (full attention + another sliding-window/mamba attention). However, emerging models may need more flexible support. For example, full attention + sliding window attn with various window sizes, etc., as required in #31592 and #30263.
Since prefix caching for sliding window and mamba does not have the monotonic prefix cache hit property, viz., a cache hit at position i does not imply a cache hit at position j…
-
#32034 [Refactor] Remove numpy split in async scheduling — ready,v1 — by yewentao256 (合并于: 2026-01-10 03:09 (UTC+8)) [💬1 | +6/-12, 1 files | commented:2 approved:1] ## Purpose
Remove numpy split in async scheduling and got performance improvement a little bit
## Test
export MODEL="zai-org/GLM-4.7-FP8"vllm serve $MODEL -tp 8 --port 9256 --enable-expert-parallel --max_num_seqs 128…
-
#31737 [Frontend][gpt-oss] Allow system message to overwrite model identity — frontend,ready,gpt-oss — by qandrew (合并于: 2026-01-10 03:03 (UTC+8)) [+111/-9, 3 files | commented:2 approved:1] ## Purpose
taking over from https://github.com/vllm-project/vllm/pull/27310 We have use cases where we want to override the system prompt (and not have it fed to developer prompt).
## Test Plan
CUDA_VISIBLE_DEVICES=2,3 with-proxy vllm serve "openai/gpt-oss-120b" -tp 2 --trust-remote-code…
-
#30585 [Bugfix] Fix Triton FusedMoE LoRA — ready,v1,gpt-oss — by xyang16 (合并于: 2026-01-09 19:46 (UTC+8)) [💬11 | +53/-37, 3 files | commented:3 approved:2] ## Purpose
This PR is to fix Triton fused_moe_lora.
- Reorder the rows of
intermediate_cache1to make sure LoRA weights is added to the correct rows here. - This will fix
test_gptoss_tp.pyfor Triton backend.
## Test Plan
``` pytest -s -v tests/lora/test_gptoss_tp.py …
- Reorder the rows of
-
#31921 [Bugfix] Fix OpenAPI schema test failures — frontend,ready — by AndreasKaratzas (合并于: 2026-01-09 18:56 (UTC+8)) [+11/-3, 2 files | commented:2 approved:1] This PR fixes issues causing OpenAPI schema test failures:
## Changes
### 1. Add upper bound validation to
truncate_prompt_tokens(protocol.py)The
truncate_prompt_tokensfield only had a lower bound (ge=-1) but no upper bound. When schemathesis generates extremely large integers, the server crashed instead of returning a proper 400 validation error.Added
le=_LONG_INFO.maxconstraint to match theseedfield’s validation.…
-
#32000 [ROCm][CI][V1] Fix
nixl_connectortest failure and achieve CUDA parity intest_async_scheduling— rocm,speculative-decoding,ready,v1,kv-connector,nvidia — by AndreasKaratzas (合并于: 2026-01-09 20:48 (UTC+8)) [💬5 | +28/-39, 3 files | commented:1 approved:1] This PR adds FlexAttention backend support for ROCm in the EAGLE speculative decoding proposer, removing platform-specific attention backend restrictions and merging with the CUDA data flow of this test. It also fixestest_abort_timeout_on_prefiller[ray]failing on ROCm platforms.## Changes
- Added
FlexAttentionMetadatato the allowed attention types for ROCm ineagle.py - Removed ROCm-specific backend overrides that were workarounds for missing FlexAttention support
- Modify `_make_fak…
- Added
-
#29450 [UX] Add vLLM model inspection view — frontend,ready,v1 — by mgoin (合并于: 2026-01-10 01:12 (UTC+8)) [💬1 | +180/-1, 6 files | commented:6 approved:2] ## Purpose
Initial start to the kernels view request in https://github.com/vllm-project/vllm/issues/28085 by making a general modules view where we can see attention backends and quant_methods used at first.
You can see the model inspection if you add
VLLM_LOG_MODEL_INSPECTION=1during the engine creation or by simply printing theLLMobject.## Test Result
#### LLM class
…
-
#30886 [Doc] Add developer guide for CustomOp — documentation,rocm,ready — by shen-shanshan (合并于: 2026-01-10 00:21 (UTC+8)) [💬10 | +441/-5, 24 files | commented:10] ## Purpose
Currently, there are more and more
CustomOpdefined both in vLLM and other OOT plugin devices. Following https://github.com/vllm-project/vllm/pull/30125#issuecomment-3648916991, I have written a doc about the principle and usage ofCustomOp.## Test Plan
## Test Result
... -
#32020 Rename –exclude-log-deltas to –enable-log-deltas — frontend,ready — by Catacomba (合并于: 2026-01-09 23:30 (UTC+8)) [+9/-10, 3 files | commented:1 approved:2] ## Purpose
Rename
--exclude-log-deltas, implemented in #30322, to the standard-enable/--no-enablemodel. Remove the validation check for--enable-log-deltas/--no-enable-log-deltas, since strictly speaking, changing the flag does not require--enable-log-outputs.## Test Plan
-
Run vllm serve with any model, with
--enable-log-request,--enable-log-outputs enabled.vllm serve model --enable-log-requests --enable-log-outputs…
-
- #32027 [Doc] Remove hardcoded Whisper in example openai translation client — documentation,ready — by Isotr0py (合并于: 2026-01-09 22:44 (UTC+8)) [💬1 | +12/-6, 1 files | commented:1 approved:1]
## Purpose
- Avoid hardcoded
openai/whisper-large-v3inopenai_translation_client.py, so that we can test other model’s translation API easily.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... - Avoid hardcoded
-
#31832 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe — performance,ready,nvidia — by mgoin (合并于: 2026-01-09 22:40 (UTC+8)) [+278/-95, 8 files | commented:5 approved:1] ## Purpose
We can easily fuse the silu_and_mul operation into the nvfp4 quantization operation for moe, which we already do for dense nvfp4 as of https://github.com/vllm-project/vllm/pull/23671. This PR just generalizes some utils to expose the same functionality for expert quantization. This seems to improve latency by ~2% and throughput by ~4%
Before: <img width=”1631” height=”659” alt=”Screenshot 2026-01-06 at 4 00 44 PM” src=”https://github.com/user-attachments/assets/bd52135a-21aa-4438-99…
- #31968 [CPU] Add head sizes 80 and 112 with vec16 fallback — ready,v1,cpu — by R3hankhan123 (合并于: 2026-01-09 22:14 (UTC+8)) [💬2 | +12/-5, 4 files | commented:5 approved:1]
## Purpose
Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16.
## Test Plan
Build Docker image and test using
ibm-granite/granite-3b-code-base-2kmodel which has head size of 80. ## Test Result Server Logs ``` docker run –rm -it -p 8000:8… - #32003 [Cleanup] Remove obsolete spec decoding compatibility logic — performance,speculative-decoding,ready,v1 — by njhill (合并于: 2026-01-09 13:44 (UTC+8)) [+45/-75, 8 files | commented:1 approved:1]
This is logic which was included with the original V1 spec decoding impl (ngram only), prior to various other changes which have made it obsolete including:
- Adding other spec decoding methods
- Adding support for logprobs and penalty parameters with spec decoding
Note that there are still some parameters which don’t work with spec decoding (min_p, min_tokens, logit_bias). Requests with these params will now fail when spec decoding is enabled (see https://github.com/vllm-project/vllm/pull/3198…
-
#32016 [Model] Remove redundant None check in DeepSeekOCR image input processing — deepseek — by maang-h (合并于: 2026-01-09 22:12 (UTC+8)) [+10/-13, 1 files | commented:1 approved:1] ## Purpose Remove redundant
if pixel_values is not Nonecheck and unreachable assertion inDeepseekOCRImagePixelInputs._parse_image_pixel_values.Line 437-438 already handles the
pixel_valuesis None case with an early returnThis cleanup improves code readability and removes unnecessary branching
## Test Plan
…
-
#31999 Fix type error — frontend,ready,meta-exported,fb-exported — by Adolfo-Karim (合并于: 2026-01-09 22:03 (UTC+8)) [💬1 | +3/-1, 1 files | commented:3 approved:2] Summary: id can be None, so we need to check it.
Test Plan: Updated local instance, the error is gone.
Differential Revision: D90353429
[!NOTE] …
-
#29304 [ROCm][PD] add moriio kv connector. — documentation,rocm,frontend,ready,ci/build,v1,kv-connector — by inkcherry (合并于: 2026-01-09 22:01 (UTC+8)) [💬10 | +3369/-3, 10 files | commented:10] ## Purpose
This PR introduces the mori-io KV connector for AMD devices. Built on top of the MORI project, the mori-io connector supports both PULL and PUSH modes for KV Cache transfer. Key features include:
- Mori backend integration.
- Mori-related components (buffer merge &session cache management &batch io).
- PULL mode (Serial interaction of prefill and decode).
- PUSH mode (Parallel interaction of prefill and decode, with non-blocking layer-wise transfer…
-
#32025 [Misc] Hash keys only when value is
Nonein kwargs — ready,multi-modality — by ywang96 (合并于: 2026-01-09 21:20 (UTC+8)) [+3/-0, 1 files | approved:1 commented:5] ## Purpose It’s not guaranteed that multimodal kwargs always have non-Nonevalues. When this happens every request will produce serialization warning which isn’t helpful. This PR modifies so that the hasher hashes keys only for such key-value pairs when value isNone## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31881 [Feature][Benchmarks] Custom dataset: read output length from dataset — performance,ready — by sducouedic (合并于: 2026-01-09 20:40 (UTC+8)) [+26/-5, 1 files | commented:1 approved:1] For custom datasets only – these changes allow to read the output length from the custom dataset, through the optional field
"output_tokens"in the jsonl file.The value stored in
custom-output-lenargument is still used by default and overrides whatever is loaded from the dataset. The value from the dataset is used only ifcustom-output-lenisNoneor set to -1.
[!NOTE] Cursor Bugbot is generating a summary for commit 4275e65f6…
-
#31380 [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten — rocm,ready,v1,qwen — by vllmellm (合并于: 2026-01-09 19:28 (UTC+8)) [💬8 | +283/-84, 5 files | commented:10] ## Purpose
Fixes #26473
This PR refactors the rocm_attn backend kernels to support models with non-power-of-2 block sizes, specifically the Qwen/Qwen3-Next-80B-A3B-Thinking model.
The core of this update is a Dynamic Dispatching Mechanism:Standard Path ($2^n$): For models with power-of-2 block sizes (16, 32, 64, 128, etc.), the kernel retains the legacy bitwise-optimization logic to ensure maximum performance and zero regression.Generalized Path ($non-2^n$): For non-standard models (e.g., Qwe…
- #31948 fix: remove duplicate engine_id check in nixl_connector — ready,kv-connector — by xbfs (合并于: 2026-01-09 20:13 (UTC+8)) [+0/-7, 1 files | commented:1 approved:1]
[!NOTE] Eliminates a duplicate
engine_idmismatch check in_nixl_handshakewithinnixl_connector.py, keeping a single validation after decodingNixlAgentMetadata.- Simplifies handshake logic; no functional behavior change expected
Written by Cursor Bugbot for commit f0828f3d720af451535076198c4f158a5345e3e3. This will update automatically on new commits. Configure here.
-
#31973 [Model] Reorganize pooling layers — documentation,ready,ci/build,v1,qwen — by DarkLight1337 (合并于: 2026-01-09 19:02 (UTC+8)) [💬4 | +1290/-1143, 34 files | commented:9 approved:1] ## Purpose
pooler.pyis getting really bloated, so let’s split everything up:- Pooler activations (
vllm.model_executor.layers.pooler.activations) - Common code (
vllm.model_executor.layers.pooler.common) - Abstract pooler (
vllm.model_executor.layers.pooler.abstract) - Special poolers (
vllm.model_executor.layers.pooler.special) - Poolers that aggregate the whole sequence (
vllm.model_executor.layers.pooler.seqwise) - Poolers that apply to each token in the sequence (`vllm.model_execu…
- Pooler activations (
- #31906 [Bugfix] Fix Var Length Batched Padding in Granite Speech — ready — by alex-jw-brooks (合并于: 2026-01-09 18:28 (UTC+8)) [+8/-3, 1 files | commented:2 approved:1]
## Purpose
Fixes a bug in granite speech padding - the features are variable length, so we pad tensors to be
[bsz, longest_feature, 160], but when the multimodal inputs are batched, they are provided as a list of dim[feat_len, 160], which breaks the pad call expecting a 3D tensor. ``` (EngineCore_DP0 pid=2881328) File “/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py”, line 675, in parse_and_validate_audio_input (EngineCore_DP0 pid=2881328) input_features = self._pad… -
#31994 fix lora moe sharding when rank < max_lora_rank — ready,gpt-oss — by gnovack (合并于: 2026-01-09 14:43 (UTC+8)) [💬1 | +6/-8, 2 files | commented:1 approved:1] ## Purpose This PR fixes a bug in the implementation of
fully_shardedfor moe lora adapters. Currently, when a LoRA adapter is loaded whose rank is less than themax_lora_rankthe LoRA A W13 weights are split into shards of sizecurrent_lora_rank // tp_degree.This causes problems when we allgather the LoRA A output along the rank dim, since the resulting tensor will be interspersed with 0s, as opposed to right-hand-padded with 0s (like the LoRA B W13 weights).
e.g. ``` LoRA A output (w…
-
#31949 [Bugfix] Fix FusedMoE LoRA w2_output_size — ready — by xyang16 (合并于: 2026-01-09 13:54 (UTC+8)) [💬2 | +1/-1, 1 files | commented:1 approved:1] ## Purpose
This PR fixes FusedMoE LoRA
w2_output_size.w2_output_sizecurrently is set tow2_lora_a_stacked[0].shape[-2]incorrectly, making it become rank.## Test Plan
pytest -s -v tests/lora/test_gptoss_tp.py## Test Result
…
-
#31970 [CI] [ROCm] Fix
tests/entrypoints/test_grpc_server.pyon ROCm — rocm,ready,ci/build — by tjtanaa (合并于: 2026-01-09 12:54 (UTC+8)) [💬1 | +18/-4, 5 files | commented:4 approved:1] ## PurposeFix
tests/entrypoints/test_grpc_server.pyCI issue https://buildkite.com/vllm/amd-ci/builds/2524/steps/canvas?sid=019b9d04-9972-4807-9712-b517956d17b3``` ==================================== ERRORS ==================================== – ____ ERROR collecting tests/entrypoints/test_grpc_server.py ____ ImportError while importing test module ‘/vllm-workspace/tests/entrypoints/test_grpc_server.py’. …
-
#31993 [ROCm][CI] Fix test_token_classification.py::test_bert_models — rocm,ready — by divakar-amd (合并于: 2026-01-09 12:04 (UTC+8)) [💬1 | +12/-4, 1 files | commented:1 approved:2] This PR fixes mi325_1: Language Models Test (Extended Pooling):
models/language/pooling/test_token_classification.py::test_bert_models[float-boltuix/NeuroBERT-NER]The issues were caused due to similar reasons mentioned in https://github.com/vllm-project/vllm/pull/31612
[!NOTE] Addresses ROCm-specific correctness in token classification tests.
- For
test_bert_modelsandtest_modernbert_models, passmodel_kwargs={'attn_implementation': 'eager'}to HF `AutoModelForToken…
- For
-
#30437 [Bugfix] missing tokens occur in harmony streaming — frontend,ready,gpt-oss — by Ri0S (合并于: 2026-01-09 11:59 (UTC+8)) [💬8 | +13/-7, 2 files | commented:1 approved:2] ## Purpose Fixed an issue where in harmony streaming mode, when the engine yields more than one token at a time, only the last token is used.
FIX #28635 #30099
## Test Plan
uv run api_server.py --model openai/gpt-oss-120b --gpu-memory-utilization 0.95 --port 8000 --served-model-name gptoss120b --disable-log-request --tool-call-parser openai --enable-auto-tool-choice```python …
[关闭未合并 PR]
- #19469 [V1][Metrics] Add instance_id (hostname) label for prometheus metrics — needs-rebase,stale,v1 — by reidliu41 (关闭于: 2026-01-10 10:13 (UTC+8)) [💬7 | +10/-3, 2 files]
## Essential Elements of an Effective PR Description Checklist
- The purpose of the PR, such as “Fix some issue (link existing issues this PR will resolve)”.
- The test plan, such as providing test command.
- The test results, such as pasting the results comparison before and after, or e2e results
- (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
## Purpose
fixes #17029 …
-
#32049 [CI/Build][Hardware][AMD] Fix test_engine_core_client.py — rocm,v1 — by rjrock (关闭于: 2026-01-10 08:29 (UTC+8)) [+23/-1, 1 files | commented:1 | 📝草稿] ## Purpose To get ROCm to pass on tests in test_engine_core_client.py, as run in AMD’s V1 Test e2e + engine testgroup. ## Test Plan
pytest -v -s v1/engine/test_engine_core_client.py## Test Result Still some failures, draft atm.
Essential Elements of an Effective PR Description Checklist
... -
#30573 [Misc][Refactor] Separate router from FusedMoE class — needs-rebase — by bnellnm (关闭于: 2026-01-10 08:17 (UTC+8)) [💬4 | +428/-248, 22 files | commented:1 | 📝草稿] ## Purpose Separate existing router logic into a standalone class
DefaultFusedMoERouter.Needs https://github.com/vllm-project/vllm/pull/30519
## Test Plan
## Test Result
…
-
#31896 [Frontend] extract tool calls before reasoning to prevent marker abso… — frontend — by daniel-salib (关闭于: 2026-01-10 07:26 (UTC+8)) [+156/-9, 2 files | commented:3] ## Purpose
Fix sporadic issue with Kimi K2 model where tool calls are lost and tool markers appear in reasoning content.
Kimi K2 can output in two modes: 1. With
tags: reasoning <|tool_ calls_section_begin|>... 2. Withouttags: reasoning text <|tool_calls_section_begin|>... …
-
#31929 [ROCm][CI] Fix test script to respect Buildkite parallelism settings — rocm,ci/build — by AndreasKaratzas (关闭于: 2026-01-10 03:28 (UTC+8)) [💬1 | +64/-37, 1 files | commented:2] Fixes the ROCm test runner script to respect Buildkite’s native parallelism configuration instead of always overriding shard settings.
## Problem
When
parallelism: Nis set in the pipeline YAML, Buildkite spawns N separate jobs and sets:BUILDKITE_PARALLEL_JOB_COUNT=NBUILDKITE_PARALLEL_JOB=0..N-1
The test commands use
$$BUILDKITE_PARALLEL_JOB_COUNTand$$BUILDKITE_PARALLEL_JOBwhich get substituted correctly. However, therun-rocm-test.shscript was:…
-
#31537 [Bugfix] Record metrics for aborted requests — needs-rebase,v1 — by jayhemnani9910 (关闭于: 2026-01-10 02:23 (UTC+8)) [💬4 | +100/-26, 6 files | commented:7] When requests are aborted, the abort path bypasses the normal metrics recording flow. This causes vllm:request_success_total{finished_reason=”abort”} to always show 0.
Fix by collecting abort statistics in OutputProcessor.abort_requests() and having LLMEngine/AsyncLLM record them via logger_manager.record().
Changes:
- Add AbortedRequestStats dataclass to capture abort stats
- Modify OutputProcessor.abort_requests() to optionally collect stats
- Add _record_abort_metrics() helper to LLMEngine …
-
#30914 [Bug] Fix torch inductor issue (shape passing through sub-graphs) — ready,qwen,nvidia — by yewentao256 (关闭于: 2026-01-09 23:26 (UTC+8)) [💬7 | +22/-12, 4 files | commented:5 changes:2] ## Purpose
Context: https://vllm-dev.slack.com/archives/C08U97ZRC0J/p1765934670913979
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL="debug" python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2There are two issues we found:
- the shape passed through different sub-graphs, which could be described in this image: <img width=”2355” height=”1035” alt=”image” src=”https://github.com/user-attachments/as…
-
#31910 [Feature][Benchmarks] Test run: try different prompts until success — performance — by sducouedic (关闭于: 2026-01-09 19:06 (UTC+8)) [💬1 | +55/-30, 2 files | commented:2] When running benchmarks and using a test request, this one might fail for various possible reason related to the request itself (for example, if the prompt is too long and doesn’t fit the maximum context length). These changes allow to try again with other prompts from the dataset before failing definitively.
Closes #31881
-
#32015 test — 无标签 — by 1643661061leo (关闭于: 2026-01-09 13:37 (UTC+8)) [💬1 | +199/-250, 1 files | commented:1] ## Purpose
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#31563 [Model] Support SentenceTransformers V6 reranker config — documentation,frontend,needs-rebase — by noooop (关闭于: 2026-01-09 11:19 (UTC+8)) [💬2 | commented:1 | 📝草稿] ## Purpose
Following #30550 #31335
Users can correctly use the latest powerful rerank model without manually setting any hf_overrides or loading any templates.
e.g. - BAAI/bge-reranker-v2-gemma - Qwen/Qwen3-Reranker-0.6B …