[vLLM GitHub 开发动态] 2026-03-21
[概览]
- 时间窗口: 2026-03-21 11:34 (UTC+8) ~ 2026-03-22 11:34 (UTC+8)
- 新 issue: 12 (label 分布: bug:5, feature request:3, help wanted:2, rocm:2, good first issue:1)
- 关闭 issue: 10
- 新 PR: 43 (label 分布: ready:15, rocm:13, ci/build:9, v1:6, frontend:6)
- 合并 PR: 18
- 关闭未合并 PR: 10
[新 issue]
-
#37749 [Bug]: Qwen 3.5 stops working after upgrade to v0.18.0 — bug — by Pinockel (创建于: 2026-03-21 22:44 (UTC+8)) [💬1] ### Your current environment
Model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 + QuantTrio/Qwen3.5-27B-AWQ Inference Framework: vLLM 0.18.0 GPU Hardware: Multiple A100 40GB (one model / card) Deployment Mode: vLLM as Docker
Parameter: ` –gpu-memory-utilization 0.90 –reasoning-parser qwen3 …
-
#37765 [Feature]: Consolidate GPTQ Quantization — help wanted,feature request — by robertgshaw2-redhat (创建于: 2026-03-22 04:45 (UTC+8)) [💬3] ### 🚀 The feature, motivation and pitch
We currently have two files:
gptq.pygptq_marlin.py
These are not needed. We now have decoupled the quantization format integration and kernels.
We should delete gptq.py and consolidate everything into
gptq_marlin.py…
-
#37777 [Bug]: [OOM] DeepSeek-R1 Out of Memory — bug — by hjjq (创建于: 2026-03-22 06:30 (UTC+8)) [💬1] ### Your current environment
4xGB200, vllm nightly, cu130
### 🐛 Describe the bug
Likely caused by https://github.com/vllm-project/vllm/pull/37442#issuecomment-4104485267 Creating an issue in case someone is also searching for this
```
… -
#37773 [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 — 无标签 — by ccgibson (创建于: 2026-03-22 05:50 (UTC+8)) [💬1] ### Your current environment
- vLLM v0.18.0 (release, not nightly)
- 8x NVIDIA B200 (141 GB HBM each)
- Model:
moonshotai/Kimi-K2.5(1T MoE, 32B active, compressed-tensors 4-bit) - CUDA 13.0, Driver 580.126.09
### 🐛 Describe the bug
EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when
--max-model-len 262144. This causes the model to produce repetitive/degenerate output untilmax_tokensis hit. The bug is 100% reproducible and occurs withi… -
#37758 [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP — bug — by robertgshaw2-redhat (创建于: 2026-03-22 03:19 (UTC+8)) [💬1] ### Your current environment
b200, main
### 🐛 Describe the bug
both of these fail
``` VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run –gpus 2 – pytest -s -v evals/gsm8k/test_gsm8k_correctness.py –config-list-file=configs/models-qwen35-blackwell.txt …
-
#37754 [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model — 无标签 — by TrevorS (创建于: 2026-03-22 01:44 (UTC+8)) [💬1] ## Summary
FlashInfer attention backend + MTP speculative decoding (
num_speculative_tokens=2) crashes with “illegal memory access” on NVIDIA GB10 (SM121 / DGX Spark) when serving Nemotron-3-Super-120B-A12B-NVFP4 (GQA ratio = 16). Triton attention backend works correctly.## Reproduction
```bash # Works (Triton attention): vllm serve /models/nemotron-3-super
–attention-backend triton_attn
… -
#37753 [Feature]: Unify MoE “Oracles” with Class Structure — help wanted,good first issue,feature request — by robertgshaw2-redhat (创建于: 2026-03-22 01:29 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch
We currently have the following MoE “oracles”, which select the right MoE kernel for each model
model_executor/layers/fused_moe/oracle
We have:
- fp8
- nvfp4
- mxfp8
- unquantized …
-
#37736 [CI Failure]: Gemma3 OOMs with transformers backend — rocm,ci-failure — by AndreasKaratzas (创建于: 2026-03-21 13:32 (UTC+8)) [💬2] ### Test group
mi250_1: Multi-Modal Models (Standard) 2: qwen3 + gemma
### Describe the failing test
This is not exactly a test failure, but it has been recommended to investigate further the OOMing event of Gemma3, which is a 4B model. The intuition here is that the fake tensor that is used for profiling is large enough that exceeds the 64 GB of MI250 GPUs. However, it has been suggested that this is still weird.
### 📝 History of failing test
…
-
#37748 [Feature]: Is there docker image support vllm and rocm 7.1+ — feature request,rocm — by BigFaceBoy (创建于: 2026-03-21 22:08 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
as title
### Alternatives
No response
### Additional context
…
-
#37746 [Bug] prompt_logprobs causes livelock with IsHybrid models (Qwen3.5) in DP mode — 无标签 — by dengoswei (创建于: 2026-03-21 21:01 (UTC+8)) ## Bug:
prompt_logprobscauses livelock withIsHybridmodels (Qwen3.5) in DP mode### Your current environment
- vLLM version: 0.17.0 (V1 engine, default)
- GPU: H100-SXM-80GB × 8
- Python: 3.11.2
- OS: Linux 5.15
- CUDA: 12.x
…
-
#37745 [Bug]: Mooncake Connector: Decode nodes stuck in WAITING_FOR_REMOTE_KVS after Prefill node restart — bug — by SSmallMonster (创建于: 2026-03-21 20:21 (UTC+8)) ### Your current environment
The output of
```text ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) ...python collect_env.py -
#37737 [Bug]: Missing logprobs for
<tool_call>in streaming chat completions — bug — by sdpkjc (创建于: 2026-03-21 14:44 (UTC+8)) ### Your current environment- vLLM version: v0.15.0, v0.17.1
- Serving mode: OpenAI-compatible API server
- Model: Qwen/Qwen3-VL-8B-Thinking
- Request mode: streaming chat completions with logprobs=true
### 🐛 Describe the bug
…
[已关闭 issue]
-
#37773 [Bug]: EAGLE-3 acceptance rate collapses to 0% with Kimi-K2.5 at max_model_len=262144 — 无标签 — by ccgibson (关闭于: 2026-03-22 05:52 (UTC+8)) [💬1] ### Your current environment
- vLLM v0.18.0 (release, not nightly)
- 8x NVIDIA B200 (141 GB HBM each)
- Model:
moonshotai/Kimi-K2.5(1T MoE, 32B active, compressed-tensors 4-bit) - CUDA 13.0, Driver 580.126.09
### 🐛 Describe the bug
EAGLE-3 speculative decoding acceptance rate progressively collapses to 0% during generation when
--max-model-len 262144. This causes the model to produce repetitive/degenerate output untilmax_tokensis hit. The bug is 100% reproducible and occurs withi… -
#31689 [Feature][Quantization][Help Wanted]: Clean up GPTQ + AWQ Quantization — help wanted,feature request — by robertgshaw2-redhat (关闭于: 2026-03-22 04:41 (UTC+8)) [💬9] ### 🚀 The feature, motivation and pitch
We are in process of cleaning up the quantization integrations in vllm (see the FusedMoE refactor PRs I am working on)
In general, this means we are trying to separate concerns of the quantization INTEGRATION (on disk format — responsible for weight loading) from the quantization KERNEL (runtime format — responsible for executing at runtime).
For GPTQ/AWQ, we have tech debt in that we have different quantization integrations (
gptq.py, `gptq_marlin… -
#34256 [Model Performance SIG]: Improve MoE Oracle — feature request,rocm,model-bash — by robertgshaw2-redhat (关闭于: 2026-03-22 04:39 (UTC+8)) [💬5] ### 🚀 The feature, motivation and pitch
We have recently created programmatic interface for selecting NVFP4 and FP8 MoE kernels.
Currently, we have a single “ordering” of kernels for all hardware, device configs, etc. We should make this oracle smarter by having different defaults for different devices / models.
We should do this for:
- DeepSeekV3
- Qwen3MoE
- GLM …
-
#35688 [CI Failure][Tool Calling]: — ci-failure — by robertgshaw2-redhat (关闭于: 2026-03-22 04:39 (UTC+8)) ### Name of failing test
entrypoints/openai/test_completion_with_function_calling.py### Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
…
-
#37618 [Bug]: FP8 kv cache on b200 with qwen3.5 has degraded accuracy — bug — by robertgshaw2-redhat (关闭于: 2026-03-22 04:38 (UTC+8)) [💬10] ### Your current environment
see run here
- https://buildkite.com/vllm/ci/builds/57133#019d0808-caa0-41a2-88d2-9a4bd389efdf
removing fp8 kv cache allows this to pass
### 🐛 Describe the bug
x …
-
#37701 [Bug]: Segfault in FlashInfer autotuning for NVFP4 latency backend on Qwen3-30B-A3B-NVFP4 — bug — by baonudesifeizhai (关闭于: 2026-03-22 04:01 (UTC+8)) [💬2] ### Your current environment
sm100
### 🐛 Describe the bug
### Description …
-
#37730 Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM) — 无标签 — by glaziermag (关闭于: 2026-03-22 01:41 (UTC+8)) # Benchmark: Radix vs. PagedAttention Scaling (SGLang / vLLM)
### Problem Description This issue documents a benchmark comparing SGLang (RadixAttention) and vLLM (PagedAttention) to observe the “Scaling Zero-Sum” trade-off. SGLang’s Radix tree optimizes prefix-sharing across requests but uses Python-based routing, which can be vulnerable to Python Global Interpreter Lock (GIL) contention under high concurrency. vLLM mitigates the Python GIL bottleneck by offloading its PagedAttention im…
-
#36094 [Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy — bug,qwen,nvidia — by pavanimajety (关闭于: 2026-03-21 22:14 (UTC+8)) [💬27] ### Your current environment
The output of
```text Your output of `python collect_env.py` here ```python collect_env.py…
-
#35286 [Bug]: Qwen3.5-MoE failed with enable_lora — bug — by hjh0119 (关闭于: 2026-03-21 19:30 (UTC+8)) [💬7] ### Your current environment
The output of
```text Collecting environment information... ============================== System Info ============================== ...python collect_env.py -
#37422 [Feature]: Add kv_transfer_params to Responses API for PD disaggregation — rocm — by bongwoobak (关闭于: 2026-03-21 13:48 (UTC+8)) [💬1] ### 🚀 The feature, motivation and pitch
The Chat Completions API (
/v1/chat/completions) supportskv_transfer_paramson both request and response, enabling PD disaggregation. However, the Responses API (/v1/responses) does not have this field, so PD disaggregation cannot be used through it.Since the Responses API is becoming more widely adopted, it would be valuable to add
kv_transfer_paramssupport for feature parity with Chat Completions.The change is straightforward: add the field…
[新 PR]
-
#37775 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing — bug,rocm,ready,v1 — by AndreasKaratzas (创建于: 2026-03-22 06:25 (UTC+8)) [💬6 +63/-1, 2 files commented:2 approved:1] - PR #37303 changed
num_prompt_tokensinInputBatchfrom a plainnp.zeros()array to a pinned-memory-backed numpy view (torch.zeros(..., pin_memory=True).numpy()). get_pooling_metadata()callstorch.from_numpy(self.num_prompt_tokens[:self.num_reqs]), which creates a tensor that shares the underlying pinned buffer rather than copying the data.- Because pinned memory is used for async GPU transfers, the shared buffer can be modified between the time
prompt_lensis created and wh…
- PR #37303 changed
-
#37787 [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 — bug,rocm,ready,gpt-oss — by AndreasKaratzas (创建于: 2026-03-22 10:52 (UTC+8)) [💬1 | +83/-13, 8 files | commented:1] Fixes several issues introduced by #37128 that broke gpt-oss on ROCm and LoRA on NVIDIA.
- Restore gfx950 gate for CK mxfp4 backend selection. The old code only picked CK on gfx950 via on_gfx950(), the refactor dropped this and let CK get selected on gfx942 where it crashes.
- Restore CK_MXFP4_MOE_DIM_ALIGNMENT (256) check. Models with intermediate_size not aligned to 256 (like gpt-oss-20b at 2880) hit a reshape error in aiter shuffle_scale_a16w4. Added is_supported_config to AiterExperts so…
-
#37786 Revert “[MoE Refactor] Mxfp4 oracle rebased” (#37128) — documentation,rocm,gpt-oss,nvidia — by zhewenl (创建于: 2026-03-22 09:06 (UTC+8)) [💬2 | +1369/-1695, 18 files | commented:1 | 📝草稿] ## Revert of #37128
This reverts commit 87bd91892f8c63ed9aeb2ad2e701472ead8be84c (PR https://github.com/vllm-project/vllm/pull/37128).
### Reason CI build #57431 detected 1 new failure linked to this PR:
- LoRA TP (Distributed):
test_gpt_oss_lora_tp2[True-False]fails with LoRA adapter producing garbage output (!!!...) instead of expected SQL query. This PR changedgpt_oss_triton_kernels_moe.pyand MoE layer code which is exercised by thi…
- LoRA TP (Distributed):
-
#37782 [Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback — bug,rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-22 08:02 (UTC+8)) [+5/-1, 1 files | commented:2] Fixes regression after: https://github.com/vllm-project/vllm/pull/37058
When multiple threads concurrently fail sf_open_virtual() on unsupported formats (e.g. MP4 bytes passed as audio), sf_error(NULL) may return 0 instead of the actual error code due to a global-state race in libsndfile. This caused code=0 LibsndfileError (“Garbled error message from libsndfile”) to bypass the pyav fallback in load_audio(), crashing requests that send video bytes as audio_url (e.g. audio-in-video). In this P…
-
#37781 [CI] Skip ISAAC multimodal tests due to broken upstream HF model weights — rocm,ready,multi-modality — by AndreasKaratzas (创建于: 2026-03-22 07:29 (UTC+8)) [+5/-1, 1 files commented:2] - Skip isaac multimodal generation tests
- The upstream HF repo consolidated weights from 3 shards into a single
model.safetensorsbut left a stalemodel.safetensors.index.jsonreferencing the old shard filenames, causing weight loading to fail - This was introduced by upstream repo commits on 2026-03-20 (“5.0 structure fixes”), not a vllm code change
## Test plan
pytest -s -v tests/models/multimodal/generation/test_common.py::test_single_image_models[isaac-test_case35]
cc @ke…
- #37780 [ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 07:16 (UTC+8)) [+14/-0, 1 files | commented:1] The labeled tests are not optional for MI250. We are making them optional for MI325 and MI355 since they are already duplicated for MI250. Hence, this PR takes off some of the (unnecessary) workload from AMD CI.
-
#37778 [ROCm][CI] Added missing resampy dependency for MM audio tests — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 06:34 (UTC+8)) [+2/-0, 1 files | commented:2] Follow-up on: https://github.com/vllm-project/vllm/pull/37058
Addresses:
Multi-Modal Models (Standard) 3: llava + qwen2_vlRelated: https://buildkite.com/vllm/amd-ci/builds/6757/steps/canvas?sid=019d0efa-d61c-4621-915a-c06a2f968717&tab=outputcc @kenroche
-
#37774 [ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh — rocm,ready,ci/build — by AndreasKaratzas (创建于: 2026-03-22 06:19 (UTC+8)) [+2/-2, 1 files | commented:1 approved:1] Addresses regression from: #32700
Fix a syntax error in
run-amd-test.shwhere thekernels/moeignore block insideapply_rocm_test_overrideswas missing the closing"on thecmdsstring assignment.cc @kenroche
- #37783 [do not merge][release] Move agent queue to Release cluster queues — ci/build — by khluu (创建于: 2026-03-22 09:00 (UTC+8)) [💬1 | +24/-24, 1 files | commented:1] for better isolation & secret protection
-
#37785 Revert “[Frontend] Remove librosa from audio dependency” (#37058) — performance,frontend,ci/build,multi-modality — by zhewenl (创建于: 2026-03-22 09:05 (UTC+8)) [+188/-247, 18 files | commented:1 | 📝草稿] ## Revert of #37058
This reverts commit c7f98b4d0a63b32ed939e2b6dfaa8a626e9b46c4 (PR https://github.com/vllm-project/vllm/pull/37058).
### Reason CI build #57431 detected 1 new failure linked to this PR:
- Entrypoints Integration (API Server 1):
test_online_audio_in_video_interleavedfails withError opening <_io.BytesIO object>: (Garbled error message from libsndfile)after librosa was removed from audio dependencies.
### Auto-generated T…
- Entrypoints Integration (API Server 1):
-
#37776 [MoE] Unify MoE oracles with class structure — needs-rebase — by Zijun9 (创建于: 2026-03-22 06:26 (UTC+8)) [💬5 | +2217/-1547, 8 files | commented:1] ## Purpose
Resolves #37753.
Introduces
MoEKernelOracle(ABC, Generic[BackendT])as a base class for all MoE kernel selection oracles. Each oracle (FP8, NvFP4, MXFP4, MXFP8, Unquantized) now inherits from this base class, standardizing the 4 core operations:select_backend– choose the best kernel backendconvert_to_kernel_format– shuffle weights for a backendmake_quant_config– build aFusedMoEQuantConfigmake_kernel– construct theFusedMoEKernel…
-
#37784 [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle — 无标签 — by jikunshang (创建于: 2026-03-22 09:04 (UTC+8)) [💬1 | +54/-101, 3 files | commented:1] ## Purpose follow up of https://github.com/vllm-project/vllm/pull/37128, move xpu mxfp4 support into oracle as well.
## Test Plan
## Test Result
Essential Elements of an Effective PR Description Checklist
... -
#37759 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ — ready,nvidia — by robertgshaw2-redhat (创建于: 2026-03-22 03:43 (UTC+8)) [+2/-2, 3 files | commented:1 approved:1] ## Summary
- Moves
flashinfer_cutedsl_moe.pyfromfused_moe/root into thefused_moe/experts/subdirectory, consistent with the ongoing migration of expert kernel files (e.g.trtllm_nvfp4_moe.py,trtllm_fp8_moe.py,trtllm_mxfp4_moe.py) - Updates import in
oracle/nvfp4.pyto point to the new location - Updates import in
tests/kernels/moe/test_cutedsl_moe.pyto point to the new location
This is not duplicating an existing PR — it is part of an in-progress migration of `vllm/model…
- Moves
-
#37779 [Perf] Optimize glm4.xv VIT — 无标签 — by KKSK-DON (创建于: 2026-03-22 06:49 (UTC+8)) [+1/-2, 1 files | commented:1] PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Glm4vVisionTransformer.forward()has 24 unnecessary CUDA D2H
synchronizations caused bymax_seqlen.item()being called on a … -
#37757 [UX] Logging - Improve Startup Error Logs — ready,v1 — by Waknis (创建于: 2026-03-22 03:14 (UTC+8)) [💬4 | +721/-83, 6 files | commented:5] Fixes #31683. Also addresses #37714.
## Summary This PR scopes #31683 to startup failures only, matching the guidance in the issue thread.
- propagate structured startup failures from the worker ready pipe through the engine startup handshake
- preserve the innermost child exception type, message, source, and traceback in the surfaced startup error
- improve failed process summaries during startup and add focused regression coverage for the worker and engine startup paths
## Why this…
-
#37770 [Quant] Consolidate GPTQ: remove gptq.py, rename gptq_marlin.py to gptq.py — rocm,v1,ready-run-all-tests — by robertgshaw2-redhat (创建于: 2026-03-22 05:35 (UTC+8)) [💬1 | +58/-557, 17 files | commented:1] ## Why this is not a duplicate
Checked open PRs for GPTQ consolidation — none found:
gh pr list --repo vllm-project/vllm --state open --search "gptq consolidate"## Summary
The Marlin kernels (gptq_marlin.py) support all practical GPTQ use cases (4-bit and 8-bit symmetric quantization), making the legacy exllama-based
gptq.pyredundant. This PR consolidates them into a single file. … -
#37772 Consolidate AWQ quantization into single awq_marlin.py file — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-22 05:49 (UTC+8)) [+253/-279, 3 files | commented:1 | 📝草稿] ## Summary
- Merges
awq.pyandawq_marlin.pyinto a single file, eliminating the circular import between them awq.pybecomes a backward-compat shim that re-exportsAWQConfigandAWQLinearMethodfromawq_marlin.py__init__.pyupdated to importAWQConfigdirectly fromawq_marlin- Follows the same consolidation structure as
gptq_marlin.py
Why this is not a duplicate: Previous attempt (#37768) was reverted due to an accidental push, not a correctness issue. This PR re-…
- Merges
-
#37766 [CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI — ci/build — by yurun00 (创建于: 2026-03-22 05:01 (UTC+8)) [💬1 | +28/-7, 2 files | commented:2] ## Purpose This PR is to resolve a dependency deadlock when installing the test dependencies used in CI (CUDA only). It downgrades
opentelemetry-protofrom1.36.0to1.35.0and upgradesprotobuffrom6.33.2to6.33.6inrequirements/test.txtto .Reasoning: Currently,
uv pip install -r requirements/common.txt -r requirements/dev.txt --torch-backend=autofails becauseprotobufis pinned to6.33.2and opentelemetry-sdkis pinned to1.35.0`, which is incompatible… -
#37771 [Quant] Consolidate AWQ quantization into a single file — documentation,performance,rocm,needs-rebase,ci/build,gpt-oss,nvidia — by robertgshaw2-redhat (创建于: 2026-03-22 05:38 (UTC+8)) [💬2 | +3132/-2511, 83 files | commented:1 | 📝草稿] ## Summary
- Merges
awq.pyandawq_marlin.pyinto a single consolidated module (awq_marlin.py), following the same structure asgptq_marlin.py - Adds
AWQMarlinLinearKernelinvllm/model_executor/kernels/linear/mixed_precision/awq_marlin.pysoAWQMarlinLinearMethodnow delegates weight-repacking and inference through the kernel abstraction instead of calling marlin utils directly - Extracts shared AWQ weight-parameter creation (
_create_awq_weight_params) used by both the legacy a…
- Merges
-
#37769 [Quantization] Merge gptq_marlin into gptq, remove slow GEMM kernel — performance,rocm — by robertgshaw2-redhat (创建于: 2026-03-22 05:22 (UTC+8)) [+753/-1185, 25 files | commented:1] ## Summary
gptq.pyis replaced by the formergptq_marlin.py(Marlin kernel).GPTQConfig.get_name()now returns"gptq", andoverride_quantization_methodis removed (no longer needed since gptq IS Marlin).gptq_marlin.pyis reduced to a thin backward-compat re-export shim.--quantization gptq_marlinauto-converts togptqwith a deprecation warning in_verify_quantization.- The old exllama/GEMM-based
GPTQConfig/GPTQLinearMethodare removed; backward-compa…
- #37768 Revert “Consolidate AWQ quantization into single awq_marlin.py file” — 无标签 — by robertgshaw2-redhat (创建于: 2026-03-22 05:15 (UTC+8)) [+279/-253, 3 files | commented:1] REVERT Accidental push
-
#37767 [Quantization] Merge gptq_marlin into gptq, deprecate gptq_marlin — performance — by robertgshaw2-redhat (创建于: 2026-03-22 05:05 (UTC+8)) [+730/-1176, 21 files | commented:1] ## Summary
gptq_marlin.pyis a strict superset ofgptq.py— the marlin kernel handles all GPTQ-compatible checkpoints on sm75+, and the 4-bit gptq_gemm fallback was already marked buggy in the old code. This PR merges them.- Delete
gptq_marlin.py; rewritegptq.pywith the marlin-based implementation GPTQMarlinConfig→GPTQConfig(get_name()returns"gptq"); backward-compat aliases preserved for any external callers (GPTQMarlinConfig,GPTQMarlinLinearMethod, `GPT…
-
#37760 [MoE] Move GPT OSS Triton kernel experts into fused_moe/experts/ — ready,gpt-oss — by robertgshaw2-redhat (创建于: 2026-03-22 03:57 (UTC+8)) [+12/-12, 7 files | commented:1] ## Summary
- Moves
gpt_oss_triton_kernels_moe.pyfromfused_moe/root intofused_moe/experts/, consistent with the ongoing migration of expert kernel files (e.g.trtllm_nvfp4_moe.py,trtllm_fp8_moe.py,trtllm_mxfp4_moe.py) - Updates all import sites:
oracle/mxfp4.py(TRITON and TRITON_UNFUSED backends)vllm/lora/layers/fused_moe.pyvllm/model_executor/layers/quantization/quark/quark_moe.pytests/kernels/moe/test_modular_oai_triton_moe.py- `tests/kernels/mo…
- Moves
-
#37761 [MoE] Move DEEP_GEMM into experts/ subdirectory — documentation,performance,ready — by robertgshaw2-redhat (创建于: 2026-03-22 04:01 (UTC+8)) [💬1 | +24/-20, 14 files | commented:1] ## Summary
Part of the ongoing migration to consolidate MoE kernel implementations under
vllm/model_executor/layers/fused_moe/experts/.- Moves
fused_moe/deep_gemm_moe.py→fused_moe/experts/deep_gemm_moe.py - Moves
fused_moe/batched_deep_gemm_moe.py→fused_moe/experts/batched_deep_gemm_moe.py - Updates all import sites in
vllm/,benchmarks/, andtests/
This is not a duplicate of any existing PR: #35927 moves prepare/finalize methods; this PR moves the DeepGemm expert kernel …
- Moves
-
#37764 [ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 — rocm,ready — by AndreasKaratzas (创建于: 2026-03-22 04:40 (UTC+8)) [+3/-3, 1 files commented:1] - Fix 3 failing tests in
test_rocm_unquantized_gemm.pyintroduced in #34709 get_cu_countwas renamed tonum_compute_unitsin #35042, but the test PR was authored against the old name
cc @kenroche
- Fix 3 failing tests in
-
#37763 [ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support — rocm,ready — by AndreasKaratzas (创建于: 2026-03-22 04:18 (UTC+8)) [+5/-1, 1 files | commented:1] When
VLLM_USE_MEGA_AOT_ARTIFACT=1is set on PyTorch < 2.10.0 (e.g. ROCm),standalone_compilecorrectly falls back to non-AOT mode, butset_functorch_config()still leavesbundled_autograd_cacheenabled because it only checks the env var, not the PyTorch version.This causes
test_moe_startup[1]to fail:assert counters["aot_autograd"]["total"] == 0 E assert 30 == 0In this PR, we gate on both
VLLM_USE_MEGA_AOT_ARTIFACTandis_torch_equal_or_newer("2.10.0")befor… -
#37762 [MoE] Move nixl_ep and mori prepare/finalize into fused_moe/prepare_finalize/ — kv-connector — by robertgshaw2-redhat (创建于: 2026-03-22 04:02 (UTC+8)) [+3/-3, 4 files | commented:1] ## Summary
- Moves
nixl_ep_prepare_finalize.pyandmori_prepare_finalize.pyfromfused_moe/root intofused_moe/prepare_finalize/, alongside the existingnaive_dp_ep.pyandno_dp_ep.py - Updates all import sites:
fused_moe/all2all_utils.py(relative imports updated to.prepare_finalize.*)tests/kernels/moe/modular_kernel_tools/mk_objects.py
This is not duplicating an existing PR — it is part of an in-progress migration of
vllm/model_executor/layers/fused_moe/into s… - Moves
-
#37756 [Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning — ready — by mmangkad (创建于: 2026-03-22 01:48 (UTC+8)) [💬1 | +23/-0, 3 files | commented:1 approved:2] ## Summary
Add benchmarked optimal all-reduce communicator config values for SM 10.3 (B300/GB300).
Depends on #37755 for allreduce fusion to be auto-enabled on SM 10.3.
## Config Values
| Config | ws=2 | ws=4 | ws=6 | ws=8 | |——–|——|——|——|——| …
-
#37755 [Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) — ready — by mmangkad (创建于: 2026-03-22 01:46 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Summary
Enable allreduce fusion by default for SM 10.3 (B300/GB300) by using
is_device_capability_family(100)instead ofis_device_capability(100)inenable_allreduce_rms_fusion.## Accuracy Test
``` vllm serve nvidia/DeepSeek-V3.2-NVFP4
–tensor-parallel-size 2
–tokenizer-mode deepseek_v32
… -
#37752 [Test] Add unittests for
make_copy_and_call— 无标签 — by SoluMilken (创建于: 2026-03-22 01:25 (UTC+8)) [+136/-11, 2 files | commented:2] ## Purpose- Add unit tests for
make_copy_and_callintests/compile/test_backends.py, covering:- sym tensor is copied into pre-allocated buffer; callable_fn receives a buffer view
- oversize input raises before callable_fn is reached
- second call overwrites buffer; out-of-range rows are left untouched
- non-contiguous indices with a static passthrough arg
- Refactor
make_copy_and_call:- Remove lazy buffer initialization — callers always pass pre-allocated buf…
- Add unit tests for
-
#37751 [Quantization] Route SM120 GPUs to CUTLASS MXFP4 MoE backend — needs-rebase,nvidia — by Tib-Gridello (创建于: 2026-03-21 23:43 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1] ## Summary
SM120 GPUs (RTX PRO 6000 Blackwell, RTX 5090) fall back to the slow Marlin backend for MXFP4 MoE models because
get_mxfp4_backend()only checksis_device_capability_family(100). SM120 is family 120, not 100, so it’s excluded.FlashInfer already ships SM120 CUTLASS kernel infrastructure (
gen_cutlass_fused_moe_sm120_module,CutlassTileConfigSM120,COMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS). This 1-line change routes SM120 to that existing path.## Change
Add `is_device_cap…
-
#37750 fix: specify device for torch.Event to prevent multi-GPU issues — v1 — by xueliangyang-oeuler (创建于: 2026-03-21 22:50 (UTC+8)) [+11/-9, 2 files | commented:1] When using torch.Event() without specifying a device parameter, it may cause issues in multi-GPU scenarios where the event is created on a different device than intended. This can lead to deadlocks in CUDA graph execution paths, especially with async scheduling and prefix caching.
This fix adds the device parameter to all torch.Event() calls in:
- gpu_model_runner.py: prepare_inputs_event, transfer_event, draft_token_ids_event, num_accepted_tokens_event, valid_sampled_token_count_event, asyn…
-
#37735 [Feature]: IndexCache support for DSA models — deepseek — by chaunceyjiang (创建于: 2026-03-21 13:09 (UTC+8)) [+73/-11, 4 files | commented:3] ## Purpose FIX https://github.com/vllm-project/vllm/issues/37684
## Test ``` export VLLM_INDEXCACHE_ENABLE=1 vllm serve /mnt/data4/models/deepseek-ai/DeepSeek-V3___2 -tp=8
–tokenizer-mode deepseek_v32 –enable-auto-tool-choice
–tool-call-parser deepseek_v32 –reasoning-parser deepseek_v3
–hf-overrides ‘{“index_topk_freq”: 4}’ … -
#37747 fix: preserve logprobs for control tokens in streaming tool calls — frontend — by xueliangyang-oeuler (创建于: 2026-03-21 21:30 (UTC+8)) [+1/-3, 1 files | commented:1] When streaming tool_calls chunks with control tokens (e.g.
), the delta_message may be None but logprobs should still be preserved. Previously, chunks with None delta_message were skipped entirely when return_token_ids was False, causing logprobs to be lost. Now we also check if logprobs is requested before skipping the chunk, ensuring control token logprobs are included in the streaming response.
Fixes #37737
## Purpose …
-
#37741 [Responses API] Fix ParsableContext: defer parsing to end of generation — frontend — by esmeetu (创建于: 2026-03-21 16:01 (UTC+8)) [💬3 | +233/-5, 3 files | commented:2] ## Summary
- Bug:
ParsableContext.append_output()calledparser.process()on every streaming delta, butResponsesParser.process()internally callsextract_reasoning()andextract_tool_calls()which expect complete output text. This caused incorrect parsing when reasoning content (<think>...</think>) was split across multiple deltas. - Fix: Accumulate text across
append_output()calls, run a single full parse via_ensure_final_parse()when results are first accessed. …
- Bug:
-
#37740 [Responses API] Fix tool_choice=required: WebSearch crash, parallel tool merge, JSON fallback — frontend,tool-calling — by esmeetu (创建于: 2026-03-21 16:01 (UTC+8)) [💬3 | +556/-2, 6 files | commented:2] ## Summary
Three independent bug fixes for
tool_choice=requiredin the Responses API:- WebSearchTool crash (
tool_parsers/utils.py):_get_json_schema_from_toolscrashes when tools list includes non-function types (e.g.WebSearchTool) because they lack.name/.parameters. Now filters to onlyFunctionTool/ChatCompletionToolsParambefore schema generation. - Consecutive function_call merge (
responses/utils.py): ConsecutiveResponseFunctionToolCallitems in the input (pa…
- WebSearchTool crash (
-
#37744 [Bugfix] Fix PyTorch stable ABI compatibility for permute_cols — bug,ci/build — by kilork (创建于: 2026-03-21 16:53 (UTC+8)) [💬1 | +6/-4, 3 files | commented:1] ## Purpose
Fix build failure when combining PR #37491 (CUTLASS upgrade to v4.4.2) with commit 8b10e4fb (Migrate permute_cols to libtorch stable ABI). The build fails with:
static_assert(std::is_trivially_copyable_v<T>); error: non-static data member 'torch::stable::detail::ToImpl<const torch::stable::Tensor&>::call(...)::Result::t' in a union may not have reference type## Root Cause …
-
#37742 [Responses API] Add ToolChoiceRequiredLogitsProcessor for thinking models — v1,tool-calling — by esmeetu (创建于: 2026-03-21 16:02 (UTC+8)) [💬3 | +617/-8, 5 files | commented:2] ## Summary
- Problem:
tool_choice=requiredwith thinking models (QwQ, Kimi K2, DeepSeek R1, etc.) fails because guided generation corrupts<think>reasoning tokens. The model either produces garbled reasoning or terminates before generating any tool calls. - Solution: Add
ToolChoiceRequiredLogitsProcessorthat suppresses all stop/EOS tokens from the start of generation, forcing the model to produce tool calls. Activated automatically whenToolParser.adjust_request()detects a th…
- Problem:
-
#37743 [CI] replace shellcheck script with shellcheck-py hook — ci/build,v1,kv-connector — by SoluMilken (创建于: 2026-03-21 16:50 (UTC+8)) [+57/-79, 6 files | commented:2] ## Purpose
Related to #36977 (no activity since 2026-03-13).
Replace the custom
tools/pre_commit/shellcheck.shwith theshellcheck-pypre-commit hook. The old script only worked on Linux x86_64;shellcheck-pyships shellcheck as a Python wheel and is cross-platform.Also fixes real issues surfaced by the new hook: SC2089/SC2090 in
run-multi-node-test.sh(string-quoted GPU args → bash array), SC2048 in `spec_decode_acceptance_test… -
#37738 Fix: preserve streaming logprobs — frontend — by sdpkjc (创建于: 2026-03-21 14:50 (UTC+8)) [+15/-7, 1 files | commented:5] ## Purpose
Fixes #37737.
When a streaming tool parser suppresses a control-token chunk by returning
None, the chat streaming path can skip the whole chunk even if it still carries logprobs. This change keeps emitting an empty delta chunk when per-token metadata must be preserved.## Test Plan
Run the streaming chat completion flow with
logprobs=trueand reproduce the tool-call path withQwen/Qwen3-VL-8B-Thinking. Confirm that<tool_call>logprobs are preserved even when the p… -
#37739 [Frontend] Fix default_chat_template_kwargs handling in Responses API — documentation,frontend — by sidsaha-ai (创建于: 2026-03-21 15:39 (UTC+8)) [💬2 | +313/-15, 10 files | commented:2] ## Summary
--default-chat-template-kwargswas already available in the shared render stack, but the/v1/responsesserving path still dropped those defaults when building prompts and when instantiating the reasoning parser used to post-process non-streaming responses.This meant Responses API requests could still behave as if Qwen3 thinking was enabled even when the server was started with
--default-chat-template-kwargs '{"enable_thinking": false}', which in turn could leaveoutput_text… -
#37734 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by simpx (创建于: 2026-03-21 12:59 (UTC+8)) [💬2 | +1248/-2, 3 files | commented:1] ## What does this PR do?
This PR adds LoRA fine-tuning support for the Qwen3-ASR model by implementing the
SupportsLoRAinterface.## Why is this change needed?
Currently, attempting to serve Qwen/Qwen3-ASR with LoRA enabled fails because the model does not declare LoRA support. This change enables users to fine-tune Qwen3-ASR models with LoRA adapters.
## How was it tested?
…
-
#37733 Revert “[compile] Initialize passes at VllmBackend init” — 无标签 — by simon-mo (创建于: 2026-03-21 12:35 (UTC+8)) [+5/-19, 3 files | commented:2] Reverts vllm-project/vllm#35216
see #37732
[已合并 PR]
-
#37775 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing — bug,rocm,ready,v1 — by AndreasKaratzas (合并于: 2026-03-22 11:22 (UTC+8)) [💬6 +63/-1, 2 files commented:2 approved:1] - PR #37303 changed
num_prompt_tokensinInputBatchfrom a plainnp.zeros()array to a pinned-memory-backed numpy view (torch.zeros(..., pin_memory=True).numpy()). get_pooling_metadata()callstorch.from_numpy(self.num_prompt_tokens[:self.num_reqs]), which creates a tensor that shares the underlying pinned buffer rather than copying the data.- Because pinned memory is used for async GPU transfers, the shared buffer can be modified between the time
prompt_lensis created and wh…
- PR #37303 changed
-
#37128 [MoE Refactor] Mxfp4 oracle rebased — documentation,rocm,ready,gpt-oss,nvidia — by zyongye (合并于: 2026-03-21 11:37 (UTC+8)) [💬5 | +1695/-1369, 18 files | commented:9 approved:1] ## Purpose Rebased and improve version of #34983 Ongoing MXFP4 MoE refactor
- Refactor MXFP4 MoE from a monolithic 1299-line
Mxfp4MoEMethodclass to the oracle pattern used by FP8 and NvFP4- Create
oracle/mxfp4.pywith backend selection, weight conversion, quant config, and kernel assembly - Create
TrtLlmMxfp4ExpertsMonolithicwrappingtrtllm_fp4_block_scale_moe()(both BF16 and MXFP8 input modes) - Create
OAITritonMxfp4ExpertsMonolithicwrappingtriton_kernel_moe_forward()…
- Create
- Refactor MXFP4 MoE from a monolithic 1299-line
-
#37774 [ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-22 09:42 (UTC+8)) [+2/-2, 1 files | commented:1 approved:1] Addresses regression from: #32700
Fix a syntax error in
run-amd-test.shwhere thekernels/moeignore block insideapply_rocm_test_overrideswas missing the closing"on thecmdsstring assignment.cc @kenroche
-
#37759 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ — ready,nvidia — by robertgshaw2-redhat (合并于: 2026-03-22 07:15 (UTC+8)) [+2/-2, 3 files | commented:1 approved:1] ## Summary
- Moves
flashinfer_cutedsl_moe.pyfromfused_moe/root into thefused_moe/experts/subdirectory, consistent with the ongoing migration of expert kernel files (e.g.trtllm_nvfp4_moe.py,trtllm_fp8_moe.py,trtllm_mxfp4_moe.py) - Updates import in
oracle/nvfp4.pyto point to the new location - Updates import in
tests/kernels/moe/test_cutedsl_moe.pyto point to the new location
This is not duplicating an existing PR — it is part of an in-progress migration of `vllm/model…
- Moves
-
#32700 [Quantization][Deprecation] Remove PTPC FP8 — rocm,ready,ci/build — by robertgshaw2-redhat (合并于: 2026-03-22 06:10 (UTC+8)) [💬3 | +1/-196, 5 files | commented:2 approved:2]
## Purpose
- now that 0.14 is out with deprecation notice, remove completely from 0.15
Essential Elements of an Effective PR Description Checklist
- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". ... - #37768 Revert “Consolidate AWQ quantization into single awq_marlin.py file” — 无标签 — by robertgshaw2-redhat (合并于: 2026-03-22 05:20 (UTC+8)) [+279/-253, 3 files | commented:1] REVERT Accidental push
-
#32104 Add tensor IPC transfer mechanism for multimodal data — performance,frontend,ready,v1,multi-modality — by brandonpelfrey (合并于: 2026-03-22 04:10 (UTC+8)) [💬34 | +1430/-25, 13 files | commented:10] Introduce Multimodal Content Tensor IPC/SHMEM Data Path
Following on from a request to break down the RFC/PR in https://github.com/vllm-project/vllm/pull/31925 , this PR introduces a IPC/SHMEM pathway for sending multimodal content from API Server -> CoreEngine processes via multiprocessing Queues. Part of the intention of this change is to reduce the number of changes in the original PR and introduce easier-to-review components which are required for the complete solution.
Note that this pat…
-
#37756 [Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning — ready — by mmangkad (合并于: 2026-03-22 03:43 (UTC+8)) [💬1 | +23/-0, 3 files | commented:1 approved:2] ## Summary
Add benchmarked optimal all-reduce communicator config values for SM 10.3 (B300/GB300).
Depends on #37755 for allreduce fusion to be auto-enabled on SM 10.3.
## Config Values
| Config | ws=2 | ws=4 | ws=6 | ws=8 | |——–|——|——|——|——| …
-
#37755 [Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) — ready — by mmangkad (合并于: 2026-03-22 03:40 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1] ## Summary
Enable allreduce fusion by default for SM 10.3 (B300/GB300) by using
is_device_capability_family(100)instead ofis_device_capability(100)inenable_allreduce_rms_fusion.## Accuracy Test
``` vllm serve nvidia/DeepSeek-V3.2-NVFP4
–tensor-parallel-size 2
–tokenizer-mode deepseek_v32
… - #37721 [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) — rocm,ready,ci/build — by AndreasKaratzas (合并于: 2026-03-21 15:27 (UTC+8)) [💬1 | +1/-1, 1 files | commented:1 approved:1]
Follow-up for:
- #34839
Updated the GSM8K correctness eval in the AMD CI pipeline to use the
models-mi3xx-fp8-and-mixed.txtconfig list instead ofmodels-mi3xx-fp8.txt. Addresses failure inmi355_2: LM Eval Small Models (B200-MI325)Motivation: https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-713e-4e07-bcb9-9b38689611a0&tab=output
cc @kenroche
-
#37722 quick fix for 37665 — 无标签 — by xuechendi (合并于: 2026-03-21 21:08 (UTC+8)) [+6/-4, 1 files | commented:1 approved:1] ## Purpose
missing
set_current_vllm_configso catch error as below ``` File “/opt/venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py”, line 121, in init s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig Assertion failed, Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_co…
-
#37318 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() — ready,v1 — by fuscof-ibm (合并于: 2026-03-21 17:29 (UTC+8)) [+11/-5, 2 files | commented:1 approved:1] ## Purpose
Now get_mamba_groups() is called only once during MambaCopyBuffers.create() and the result is reused in both preprocess_mamba() and postprocess_mamba() rather than being recomputed on every batch.
## Test Plan
python -m pytest tests/v1/worker/test_mamba_utils.py -v
-
#34692 [ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. — documentation,rocm,ready,ci/build — by lcskrishna (合并于: 2026-03-21 15:32 (UTC+8)) [💬26 | +68/-29, 7 files | commented:4 approved:2 changes:2] This PR is to integrate changes required to run DeepEP as an all2allbackend on AMD GPUs.
co-authored by : @itej89
The following changes are performed:
- The current codebase is modified to run DeepEP backend OOB on AMD GPUs without any issues.
- FusedBatchedMoE kernels can run on AMD GPUs with float8_e4m3fnuz format added.
Related steps: …
-
#37424 [Responses API] Add kv_transfer_params for PD disaggregation — frontend,ready,kv-connector — by bongwoobak (合并于: 2026-03-21 13:48 (UTC+8)) [💬4 | +29/-2, 3 files | commented:4 approved:1] ## Purpose
Add
kv_transfer_paramsto the Responses API (/v1/responses) for PD disaggregation support. The Chat Completions API already supports this, but the Responses API does not, preventing PD disaggregation through it.Follows the same pattern as
ChatCompletionRequest/ChatCompletionResponse:ResponsesRequest: addkv_transfer_paramsfield, inject intoSamplingParams.extra_argsResponsesResponse: addkv_transfer_paramsfield, populate from engine output- All 4 context …
- #37610 [ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 — rocm,ready,ci/build,multi-modality — by AndreasKaratzas (合并于: 2026-03-21 12:57 (UTC+8)) [💬6 | +19/-15, 2 files | commented:2 approved:1]
Follow-up for:
- #34839
Fixes OOM in
mi250_1: Multi-Modal Models (Standard) 2: qwen3 + gemmaMotivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1a19-4174-b4a1-c9bbfff0c164&tab=output
@kenroche
-
#37733 Revert “[compile] Initialize passes at VllmBackend init” — 无标签 — by simon-mo (合并于: 2026-03-21 12:35 (UTC+8)) [+5/-19, 3 files | commented:2] Reverts vllm-project/vllm#35216
see #37732
- #37617 [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds — rocm,ready,nvidia — by AndreasKaratzas (合并于: 2026-03-21 11:58 (UTC+8)) [💬2 | +35/-3, 1 files | commented:1 approved:1]
Follow-up for:
- #34839
Fixes unconditionally imports
CudaPlatformandRocmPlatformat module level. Addresses failure inmi355_1: Kernels (B200-MI355)Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1ac3-4614-993f-cc80ab9b2c57&tab=output
cc @kenroche
- #37058 [Frontend] Remove librosa from audio dependency — performance,rocm,frontend,ready,ci/build,multi-modality — by Isotr0py (合并于: 2026-03-21 11:36 (UTC+8)) [💬4 | +247/-188, 18 files | commented:3 approved:1]
## Purpose
Related issues:
- https://github.com/vllm-project/vllm-omni/issues/1013
- https://github.com/vllm-project/vllm-omni/issues/1725
Remove
librosafrom audio dependency, so downstream projects needs audio support likevllm-omnican avoid affected by LGPL.
## Test Plan ``` pytest -s -v tests/multimodal/test_audio.py pytest -s -v tests/multimodal/meida/test_audio.py …
[关闭未合并 PR]
-
#37669 WIP: [openapi] enable scaling ep only when api_server_count is 1 — frontend — by andyxning (关闭于: 2026-03-22 10:58 (UTC+8)) [💬4 | +21/-12, 1 files | commented:1] ## Purpose
scale ep can not be enabled when api server count is above 1. This is because there is a in-memory status
_scaling_elastic_epper api server process. This is similar with dynamic lora updating can only work with api server count of 1. https://github.com/vllm-project/vllm/blob/b4c1aef21c1a4cb252e7a440b3f9b0baebefbbef/vllm/entrypoints/cli/serve.py#L273-L276We use to api server count of 2, if the
/scale_elastic_epis served by api server process 1, and then when the `/is_scali… - #28813 build: align CUDA 12.1 xformers wheel pin — ci/build,stale,nvidia — by m0nk111 (关闭于: 2026-03-22 10:14 (UTC+8)) [💬6 | +2/-1, 1 files | commented:3]
## Summary
- bump the
xformersrequirement on CUDA 12.1 builds to the internally built0.0.33+5d4b92a5.d20251106wheel - add a comment pointing to the upstream release tag we use for the build
This isolates the build-only change that was previously bundled inside #28241.
## Testing
- not run (dependency pin update only)
- bump the
-
#37726 Revert “[Model] Deprecate the score task (this will not affect users).” (#37537) — documentation,frontend,v1 — by zhewenl (关闭于: 2026-03-22 10:08 (UTC+8)) [💬4 | +163/-184, 22 files | commented:1 | 📝草稿] ## Revert of #37537
This reverts https://github.com/vllm-project/vllm/pull/37537 (merge commit ed359c497a728f08b5b41456c07a688ccd510fbc).
Reason: This PR is linked to 1 new CI failure in build #57332:
- Language Models Test (MTEB) —
nvidia/llama-nemotron-rerank-1b-v2rerank MTEB score dropped marginally (diff=0.0023 vs atol=0.002), causingtest_rerank_models_mteb[model_info0]to fail.
The PR changed pooler heads, activations, and scori…
- Language Models Test (MTEB) —
-
#37771 [Quant] Consolidate AWQ quantization into a single file — documentation,performance,rocm,needs-rebase,ci/build,gpt-oss,nvidia — by robertgshaw2-redhat (关闭于: 2026-03-22 05:45 (UTC+8)) [💬2 | +3132/-2511, 83 files | commented:1 | 📝草稿] ## Summary
- Merges
awq.pyandawq_marlin.pyinto a single consolidated module (awq_marlin.py), following the same structure asgptq_marlin.py - Adds
AWQMarlinLinearKernelinvllm/model_executor/kernels/linear/mixed_precision/awq_marlin.pysoAWQMarlinLinearMethodnow delegates weight-repacking and inference through the kernel abstraction instead of calling marlin utils directly - Extracts shared AWQ weight-parameter creation (
_create_awq_weight_params) used by both the legacy a…
- Merges
-
#37769 [Quantization] Merge gptq_marlin into gptq, remove slow GEMM kernel — performance,rocm — by robertgshaw2-redhat (关闭于: 2026-03-22 05:26 (UTC+8)) [+753/-1185, 25 files | commented:1] ## Summary
gptq.pyis replaced by the formergptq_marlin.py(Marlin kernel).GPTQConfig.get_name()now returns"gptq", andoverride_quantization_methodis removed (no longer needed since gptq IS Marlin).gptq_marlin.pyis reduced to a thin backward-compat re-export shim.--quantization gptq_marlinauto-converts togptqwith a deprecation warning in_verify_quantization.- The old exllama/GEMM-based
GPTQConfig/GPTQLinearMethodare removed; backward-compa…
-
#37767 [Quantization] Merge gptq_marlin into gptq, deprecate gptq_marlin — performance — by robertgshaw2-redhat (关闭于: 2026-03-22 05:06 (UTC+8)) [+730/-1176, 21 files | commented:1] ## Summary
gptq_marlin.pyis a strict superset ofgptq.py— the marlin kernel handles all GPTQ-compatible checkpoints on sm75+, and the 4-bit gptq_gemm fallback was already marked buggy in the old code. This PR merges them.- Delete
gptq_marlin.py; rewritegptq.pywith the marlin-based implementation GPTQMarlinConfig→GPTQConfig(get_name()returns"gptq"); backward-compat aliases preserved for any external callers (GPTQMarlinConfig,GPTQMarlinLinearMethod, `GPT…
-
#34019 [Quantization][Refactor] Clean up GPTQ + AWQ quantization — 无标签 — by mu-hashmi (关闭于: 2026-03-22 04:40 (UTC+8)) [💬1 | +40/-36, 3 files | commented:1 | 📝草稿] ## Purpose Addresses #31689. Building on
gptq_marlin.pyper @robertgshaw2-redhat’s guidance to consolidate GPTQ/AWQ quantization and remove the legacy code paths.So far: widened
GPTQMarlinConfigto handle all GPTQ bit-widths and symmetry configs through theMPLinearKernelabstraction, added GPTQv2 checkpoint format support, and enabled 2/3-bit types in ExllamaLinearKernel.Still TODO: validate all kernel backends end-to-end, remove
gptq.py, convertmoe_wna16.pyinto a modular kernel… -
#37751 [Quantization] Route SM120 GPUs to CUTLASS MXFP4 MoE backend — needs-rebase,nvidia — by Tib-Gridello (关闭于: 2026-03-21 23:49 (UTC+8)) [💬2 | +3/-2, 1 files | commented:1] ## Summary
SM120 GPUs (RTX PRO 6000 Blackwell, RTX 5090) fall back to the slow Marlin backend for MXFP4 MoE models because
get_mxfp4_backend()only checksis_device_capability_family(100). SM120 is family 120, not 100, so it’s excluded.FlashInfer already ships SM120 CUTLASS kernel infrastructure (
gen_cutlass_fused_moe_sm120_module,CutlassTileConfigSM120,COMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS). This 1-line change routes SM120 to that existing path.## Change
Add `is_device_cap…
-
#37377 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by karanb192 (关闭于: 2026-03-21 14:01 (UTC+8)) [💬2 | +20/-1, 2 files | commented:1] ## Summary
- Adds
SupportsLoRAmixin toQwen3ASRForConditionalGeneration, enabling--enable-loraforQwen/Qwen3-ASR-0.6Band similar models - Adds
packed_modules_mappingandembedding_modulesclass attributes, following the same pattern asQwen3ForCausalLM,Qwen2_5OmniThinkerForConditionalGeneration, andQwen3VLForConditionalGeneration - The model already has
get_mm_mapping()which correctly identifies language model and tower model prefixes for LoRA weight routing - Marks L…
- Adds
-
#37734 [Feature] Add LoRA support for Qwen3ASRForConditionalGeneration — documentation,qwen — by simpx (关闭于: 2026-03-21 14:01 (UTC+8)) [💬2 | +1248/-2, 3 files | commented:1] ## What does this PR do?
This PR adds LoRA fine-tuning support for the Qwen3-ASR model by implementing the
SupportsLoRAinterface.## Why is this change needed?
Currently, attempting to serve Qwen/Qwen3-ASR with LoRA enabled fails because the model does not declare LoRA support. This change enables users to fine-tune Qwen3-ASR models with LoRA adapters.
## How was it tested?
…