[Bug]: Hidden Size mismatch when enabling VLLM_MOE_PADDING and VLLM_MOE_SHUFFLE with Mixtral-8x7B-Instruct-v0.1 FP8 Quant #415

tjtanaa · 2025-02-11T09:37:51Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here
INFO 02-11 09:36:43 __init__.py:186] Automatically detected platform rocm.                                                                                                             
WARNING 02-11 09:36:43 rocm.py:36] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.                                              
Collecting environment information...                                                                                                                                                  
PyTorch version: 2.7.0a0+git3a58512                                                                                                                                                    
Is debug build: False                                                                                                                                                                  
CUDA used to build PyTorch: N/A                                                                                                                                                        
ROCM used to build PyTorch: 6.3.42133-1b9c17779                                                                                                                                        
                                                                                                                                                                                       
OS: Ubuntu 22.04.5 LTS (x86_64)                                                                                                                                                        
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                                                                     
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)                                                  
CMake version: version 3.31.4                                                                                                                                                          
Libc version: glibc-2.35                                                                                                                                                               
                                                                                                                                                                                       
Python version: 3.12.8 (main, Dec  4 2024, 08:54:12) [GCC 11.4.0] (64-bit runtime)                                                                                                     
Python platform: Linux-5.15.0-116-generic-x86_64-with-glibc2.35                                                                                                                        
Is CUDA available: True                                                                                                                                                                
CUDA runtime version: Could not collect                                                                                                                                                
CUDA_MODULE_LOADING set to: LAZY                                                                                                                                                       
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)                                                                                                             
Nvidia driver version: Could not collect                                                                                                                                               
cuDNN version: Could not collect                                                                                                                                                       
HIP runtime version: 6.3.42133                                                                                                                                                         
MIOpen runtime version: 3.3.0                                                                                                                                                          
Is XNNPACK available: True                                                                                                                                                             
                                                                                                                                                                                       
CPU:                                                                                                                                                                                   
Architecture:                         x86_64                                                                                                                                           
CPU op-mode(s):                       32-bit, 64-bit                                                                                                                                   
Address sizes:                        52 bits physical, 57 bits virtual                                                                                                                
Byte Order:                           Little Endian                                                                                                                                    
CPU(s):                               192                                                                                                                                              
On-line CPU(s) list:                  0-191                                                                                                                                            
Vendor ID:                            AuthenticAMD                                                                                                                                     
Model name:                           AMD EPYC 9654 96-Core Processor                                                                                                                  
CPU family:                           25                                                                                                                                               
Model:                                17                                                                                                                                               
Thread(s) per core:                   1                                                                                                                                                
Core(s) per socket:                   96                                                                                                                                               
Socket(s):                            2                                                                                                                                                
Stepping:                             1                                                                                                                                                
Frequency boost:                      enabled                                                                                                                                          
CPU max MHz:                          3707.8120                                                                                                                                        
CPU min MHz:                          1500.0000                                                                                                                                        
BogoMIPS:                             4792.72                                                                                                                                          
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdran[51/1924$
 cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_singl
e hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha
_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv 
svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmu
lqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d                                                                                   
Virtualization:                       AMD-V                                                                                                                                            
L1d cache:                            6 MiB (192 instances)                                                                                                                            
L1i cache:                            6 MiB (192 instances)                                                                                                                            
L2 cache:                             192 MiB (192 instances)                                                                                                                          
L3 cache:                             768 MiB (24 instances)                                                                                                                           
NUMA node(s):                         2                                                                                                                                                
NUMA node0 CPU(s):                    0-95                                                                                                                                             
NUMA node1 CPU(s):                    96-191                                                                                                                                           
Vulnerability Gather data sampling:   Not affected                                                                                                                                     
Vulnerability Itlb multihit:          Not affected                                                                                                                                     
Vulnerability L1tf:                   Not affected                                                                                                                                     
Vulnerability Mds:                    Not affected                                                                                                                                     
Vulnerability Meltdown:               Not affected                                                                                                                                     
Vulnerability Mmio stale data:        Not affected                                                                                                                                     
Vulnerability Reg file data sampling: Not affected                                                                                                                                     
Vulnerability Retbleed:               Not affected                                                                                                                                     
Vulnerability Spec rstack overflow:   Mitigation; safe RET                                                                                                                             
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp                                                                              
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization                                                                             
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected                 
Vulnerability Srbds:                  Not affected                                                                                                                                     
Vulnerability Tsx async abort:        Not affected                                                                                                                                     
                                                                                                                                                                                       
Versions of relevant libraries:                                                                                                                                                        
[pip3] numpy==1.26.4                                                                                                                                                                   
[pip3] pyzmq==26.2.1                                                                                                                                                                   
[pip3] torch==2.7.0a0+git3a58512                                                                                                                                                       
[pip3] torchvision==0.19.1a0+6194369                                                                                                                                                   
[pip3] transformers==4.48.2                                                                                                                                                            
[pip3] triton==3.2.0+gite5be006a                                                                                                                                                       
[conda] Could not collect                                                                                                                                                              
ROCM Version: 6.3.42133-1b9c17779                                                                                                                                                      
Neuron SDK Version: N/A                                                                                                                                                                
vLLM Version: 0.7.1.dev105+g29499bb1                                                                                                                                                   
vLLM Build Flags:                                                                                                                                                                      
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled                                                                                                                                  
GPU Topology:                                                                                                                                                                          
============================ ROCm System Management Interface ============================                                                                                             
================================ Weight between two GPUs =================================                                                                                             
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7                                                                                 
GPU0   0            15           15           15           15           15           15           15                                                                                   
GPU1   15           0            15           15           15           15           15           15                                                                                   
GPU2   15           15           0            15           15           15           15           15                                                                                   
GPU3   15           15           15           0            15           15           15           15                                                                                   
GPU4   15           15           15           15           0            15           15           15  
GPU5   15           15           15           15           15           0            15           15           
GPU6   15           15           15           15           15           15           0            15           
GPU7   15           15           15           15           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
VLLM_WORKER_MULTIPROC_METHOD=spawn
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

When running the following command where we enabled VLLM_MOE_PADDING=1 and VLLM_MOE_SHUFFLING=1,

NCCL_MIN_NCHANNELS=112 RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 TRITON_HIP_USE_NEW_STREAM_PIPELINE=1 VLLM_MOE_PADDING=1 VLLM_MOE_SHUFFLING=1 HIP_FORCE_DEV_KERNARG=1 TORCH_BLAS_PREFER_HIPBLASLT=1 VLLM_SCHED_PREFILL_COUNT=0 VLLM_USE_ROCM_CUSTOM_PAGED_ATTN=1 VLLM_USE_TRITON_FLASH_ATTN=0 HIP_VISIBLE_DEVICES=7 HF_TOKEN= vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 -tp 1 --quantization fp8 --kv_cache_dtype fp8_e4m3

it throws the following error trace (only partial snippet of the error trace is include)

ERROR 02-11 07:31:20 engine.py:389]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                          [202/1789]
ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/compilation/decorators.py", line 172, in __call__                                                               
ERROR 02-11 07:31:20 engine.py:389]     return self.forward(*args, **kwargs)                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/models/mixtral.py", line 311, in forward                                                         
ERROR 02-11 07:31:20 engine.py:389]     hidden_states, residual = layer(positions, hidden_states,                                                                                      
ERROR 02-11 07:31:20 engine.py:389]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                      
ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl                                      
ERROR 02-11 07:31:20 engine.py:389]     return self._call_impl(*args, **kwargs)                                                                                                        
ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                        
ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1760, in _call_impl                                              
ERROR 02-11 07:31:20 engine.py:389]     return forward_call(*args, **kwargs)                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/models/mixtral.py", line 248, in forward                                                         
ERROR 02-11 07:31:20 engine.py:389]     hidden_states = self.block_sparse_moe(hidden_states)                                                                                           
ERROR 02-11 07:31:20 engine.py:389]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                           
ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl                                      
ERROR 02-11 07:31:20 engine.py:389]     return self._call_impl(*args, **kwargs)                                                                                                        
ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                        
ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1760, in _call_impl                                              
ERROR 02-11 07:31:20 engine.py:389]     return forward_call(*args, **kwargs)                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                           
ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/models/mixtral.py", line 104, in forward                                                         
ERROR 02-11 07:31:20 engine.py:389]     final_hidden_states = self.experts(hidden_states, router_logits)                                                                               
ERROR 02-11 07:31:20 engine.py:389]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                               ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl                                            ERROR 02-11 07:31:20 engine.py:389]     return self._call_impl(*args, **kwargs)                                                                                                              ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1760, in _call_impl                                                    ERROR 02-11 07:31:20 engine.py:389]     return forward_call(*args, **kwargs)                                                                                                                 ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                 ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/layers/fused_moe/layer.py", line 599, in forward                                                       ERROR 02-11 07:31:20 engine.py:389]     final_hidden_states = self.quant_method.apply(                                                                                                       ERROR 02-11 07:31:20 engine.py:389]                           ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                       ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/layers/quantization/fp8.py", line 713, in apply                                                      25ERROR 02-11 07:31:20 engine.py:389]     return fused_experts(                                                                                                                             -ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^                                                                                                                             ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1106, in fused_experts                                        ERROR 02-11 07:31:20 engine.py:389]     torch.ops.vllm.inplace_fused_experts(hidden_states, w1, w2,                                                                                      ERROR 02-11 07:31:20 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1149, in __call__                                                               ERROR 02-11 07:31:20 engine.py:389]     return self._op(*args, **(kwargs or {}))                                                                                                         ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                        ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1008, in inplace_fused_experts                              ERROR 02-11 07:31:20 engine.py:389]     fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,                                                                        ERROR 02-11 07:31:20 engine.py:389]   File "/app/vllmmaincheckmoe/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1141, in fused_experts_impl                                 ERROR 02-11 07:31:20 engine.py:389]     assert hidden_states.shape[                                                                                                                    ERROR 02-11 07:31:20 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^                                                                                                                    ERROR 02-11 07:31:20 engine.py:389] AssertionError: Hidden size mismatch

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-02-11T14:46:28Z

Hi @tjtanaa. Internal ticket has been created to investigate this issue. Thanks!

tjtanaa added the bug Something isn't working label Feb 11, 2025

ppanchad-amd added the Under Investigation label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Hidden Size mismatch when enabling VLLM_MOE_PADDING and VLLM_MOE_SHUFFLE with Mixtral-8x7B-Instruct-v0.1 FP8 Quant #415

[Bug]: Hidden Size mismatch when enabling VLLM_MOE_PADDING and VLLM_MOE_SHUFFLE with Mixtral-8x7B-Instruct-v0.1 FP8 Quant #415

tjtanaa commented Feb 11, 2025 •

edited

Loading

ppanchad-amd commented Feb 11, 2025

[Bug]: Hidden Size mismatch when enabling VLLM_MOE_PADDING and VLLM_MOE_SHUFFLE with Mixtral-8x7B-Instruct-v0.1 FP8 Quant #415

[Bug]: Hidden Size mismatch when enabling VLLM_MOE_PADDING and VLLM_MOE_SHUFFLE with Mixtral-8x7B-Instruct-v0.1 FP8 Quant #415

Comments

tjtanaa commented Feb 11, 2025 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

ppanchad-amd commented Feb 11, 2025

tjtanaa commented Feb 11, 2025 •

edited

Loading