Skip to content

Commit be866d4

Browse files
committed
update paragraph
1 parent e54dcd2 commit be866d4

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

docs/source/inference.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ tokenizer.save_pretrained(save_directory)
9999

100100
### Weight-only quantization
101101

102-
You can also apply fp16, 8-bit or 4-bit weight compression on the linear and embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
102+
You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
103103

104104
```bash
105105
optimum-cli export openvino --model gpt2 --weight-format int8 ov_model

docs/source/optimization_ov.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Quantization is a technique to reduce the computational and memory costs of runn
2525

2626
### Weight-only quantization
2727

28-
Quantization can be applied on the model's linear and embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
28+
Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
2929

3030

3131
#### 8-bit

0 commit comments

Comments
 (0)