Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 33bb948

Browse files
committedJun 21, 2024·
add document for quant_lm_head
Signed-off-by: xin3he <xin3.he@intel.com>
1 parent 1104f44 commit 33bb948

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed
 

‎docs/3x/PT_WeightOnlyQuant.md

+1
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ Notes:
7676

7777
- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
7878
- 4-bit NormalFloat(NF4) is proposed in QLoRA[7]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.
79+
- *quant_lm_head* defaults to False. This means that, except for transformer blocks, the last layer in transformer models will not be quantized by default. The last layer may be named "lm_head", "output_layer" or "embed_out".
7980
- Only RTN and GPTQ support double quant.
8081

8182
#### RTN

0 commit comments

Comments
 (0)
Please sign in to comment.