|
2 | 2 |
|
3 | 3 | [Movement Pruning (Sanh et al., 2020)](https://arxiv.org/pdf/2005.07683.pdf) is an effective learning-based unstructured sparsification algorithm, especially for Transformer-based models in transfer learning setup. [Lagunas et al., 2021](https://arxiv.org/pdf/2109.04838.pdf) extends the algorithm to sparsify by block grain size, enabling structured sparsity which can achieve device-agnostic inference acceleration.
|
4 | 4 |
|
5 |
| -NNCF implements both unstructured and structured movement sparsification. The implementation is designed with a minimal set of configuration for ease of use. The algorithm can be applied in conjunction with other NNCF algorithms, e.g. quantization-aware training and knowledge distillation. The optimized model can be deployed and accelerated via [OpenVINO](https://docs.openvino.ai/2024/index.html) toolchain. |
| 5 | +NNCF implements both unstructured and structured movement sparsification. The implementation is designed with a minimal set of configuration for ease of use. The algorithm can be applied in conjunction with other NNCF algorithms, e.g. quantization-aware training and knowledge distillation. The optimized model can be deployed and accelerated via [OpenVINO](https://docs.openvino.ai/2025/index.html) toolchain. |
6 | 6 |
|
7 | 7 | For usage explanation of the algorithm, let's start with an example configuration below which is targeted for BERT models.
|
8 | 8 |
|
@@ -37,11 +37,11 @@ This diagram is the sparsity level of BERT-base model over the optimization life
|
37 | 37 |
|
38 | 38 | 1. **Unstructured sparsification**: In the first stage, model weights are gradually sparsified in the grain size specified by `sparse_structure_by_scopes`. This example will result in _BertAttention layers (Multi-Head Self-Attention)_ being sparsified in 32 by 32 block size, whereas _BertIntermediate, BertOutput layers (Feed-Forward Network)_ will be sparsified in its row or column respectively. The sparsification follows a predefined warmup schedule where users only have to specify the start `warmup_start_epoch` and end `warmup_end_epoch` and the sparsification strength proportional to `importance_regularization_factor`. Users might need some heuristics to find a satisfactory trade-off between sparsity and task performance. For more details on how movement sparsification works, please refer the original papers [1, 2] .
|
39 | 39 |
|
40 |
| -2. **Structured masking and fine-tuning**: At the end of first stage, i.e. `warmup_end_epoch`, the sparsified model cannot be accelerated without tailored HW/SW but some sparse structures can be totally discarded from the model to save compute and memory footprint. NNCF provides mechanism to achieve structured masking by `"enable_structured_masking": true`, where it automatically resolves the structured masking between dependent layers and rewinds the sparsified parameters that does not participate in acceleration for task modeling. In the example above, the sparsity level has dropped after `warmup_end_epoch` due to structured masking and the model will continue to fine-tune thereafter. Currently, the automatic structured masking feature was tested on **_BERT, DistilBERT, RoBERTa, MobileBERT, Wav2Vec2, Swin, ViT, CLIPVisual_** architectures defined by [Hugging Face's transformers](https://huggingface.co/docs/transformers/index). Support for other architectures is not guaranteed. Users can disable this feature by setting `"enable_structured_masking": false`, where the sparse structures at the end of first stage will be frozen and training/fine-tuning will continue on unmasked parameters. Please refer next section to realize model inference acceleration with [OpenVINO](https://docs.openvino.ai/2024/index.html) toolchain. |
| 40 | +2. **Structured masking and fine-tuning**: At the end of first stage, i.e. `warmup_end_epoch`, the sparsified model cannot be accelerated without tailored HW/SW but some sparse structures can be totally discarded from the model to save compute and memory footprint. NNCF provides mechanism to achieve structured masking by `"enable_structured_masking": true`, where it automatically resolves the structured masking between dependent layers and rewinds the sparsified parameters that does not participate in acceleration for task modeling. In the example above, the sparsity level has dropped after `warmup_end_epoch` due to structured masking and the model will continue to fine-tune thereafter. Currently, the automatic structured masking feature was tested on **_BERT, DistilBERT, RoBERTa, MobileBERT, Wav2Vec2, Swin, ViT, CLIPVisual_** architectures defined by [Hugging Face's transformers](https://huggingface.co/docs/transformers/index). Support for other architectures is not guaranteed. Users can disable this feature by setting `"enable_structured_masking": false`, where the sparse structures at the end of first stage will be frozen and training/fine-tuning will continue on unmasked parameters. Please refer next section to realize model inference acceleration with [OpenVINO](https://docs.openvino.ai/2025/index.html) toolchain. |
41 | 41 |
|
42 |
| -## Inference Acceleration via [OpenVINO](https://docs.openvino.ai/2024/index.html) |
| 42 | +## Inference Acceleration via [OpenVINO](https://docs.openvino.ai/2025/index.html) |
43 | 43 |
|
44 |
| -Optimized models are compatible with OpenVINO toolchain. Use `compression_controller.export_model("movement_sparsified_model.onnx")` to export model in onnx format. Sparsified parameters in the onnx are in value of zero. Structured sparse structures can be discarded during ONNX translation to OpenVINO IR using [Model Conversion](https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-to-ir.html) with utilizing [pruning transformation](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api.html#transform). Corresponding IR is compressed and deployable with [OpenVINO Runtime](https://docs.openvino.ai/2024/openvino-workflow/running-inference.html). To quantify inference performance improvement, both ONNX and IR can be profiled using [Benchmark Tool](https://docs.openvino.ai/2024/learn-openvino/openvino-samples/benchmark-tool.html). |
| 44 | +Optimized models are compatible with OpenVINO toolchain. Use `compression_controller.export_model("movement_sparsified_model.onnx")` to export model in onnx format. Sparsified parameters in the onnx are in value of zero. Structured sparse structures can be discarded during ONNX translation to OpenVINO IR using [Model Conversion](https://docs.openvino.ai/2025/openvino-workflow/model-preparation/convert-model-to-ir.html) with utilizing [pruning transformation](https://docs.openvino.ai/2025/openvino-workflow/model-optimization-guide/compressing-models-during-training/filter-pruning.html). Corresponding IR is compressed and deployable with [OpenVINO Runtime](https://docs.openvino.ai/2025/openvino-workflow/running-inference.html). To quantify inference performance improvement, both ONNX and IR can be profiled using [Benchmark Tool](https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html). |
45 | 45 |
|
46 | 46 | ## Getting Started
|
47 | 47 |
|
|
0 commit comments