You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: src/common/snippets/docs/debug_capabilities/perf_count.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -5,4 +5,4 @@ Subgraph in snippets could be very large. Sometimes developers are interested th
5
5
There are two perf count modes.
6
6
-`Chrono` : Perf count via chrono call. This is a universal method, and support multi-threads scenario to print perf count data for each thread.
7
7
-`BackendSpecific` : Perf count provided by backend. This is for device specific requirement. For example, for sake of more light overhead and more accurate result, x86 or x86-64 CPU specific mode via reading RDTSC register is implemented. At current this x86 or x86-64 CPU BackendSpecific mode only support single thread.
8
-
One can select prefered mode by setting `perf_count_mode` default value in [snippets Config](../../include/snippets/utils/debug_caps.hpp)
8
+
One can select prefered mode by setting `perf_count_mode` default value in [snippets Config](../../include/snippets/utils/debug_caps_config.hpp)
Copy file name to clipboardexpand all lines: src/common/snippets/docs/mha_optimization_guide.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -129,7 +129,7 @@ The heuristics for determining the optimal block sizes can be found in [BrgemmCP
129
129
130
130
### Blocking Order
131
131
132
-
The lowered pass [BrgemmBlocking](../../../plugins/intel_cpu/src/transformations/snippets/x64/pass/lowered/brgemm_blocking.cpp) performs blocking loops creation on LinearIR.
132
+
The lowered pass [BrgemmBlocking](../../../common/snippets/src/lowered/pass/brgemm_blocking.cpp) performs blocking loops creation on LinearIR.
133
133
Currently, the order of blocking loops is following (from outer to inner): `M->N->K`.
Copy file name to clipboardexpand all lines: src/common/snippets/docs/snippets_design_guide.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -638,23 +638,23 @@ Consequently, all the ports connected to the same `PortConnector` will have the
638
638
In other words, when all the `Expressions` that required input data in a certain register are evaluated, the register may be reused to hold another `Expression's` output.
639
639
`AssignRegisters` also supports two types of registers: general-purpose and vector ones.
640
640
Different types of registers are managed and assigned independently, and a particular register type required by an `Expression` is provided by the `ov::snippets::Generator` (or a derived generator for target-specific `Ops`).
641
-
2.`InsertTailLoop` injects tail-processing section after a loop body if needed.
641
+
2.`InsertSpecificIterations` injects initialization section before a loop body and tail-processing section after a loop body if needed.
642
642
Note that every loop has two parameters that specify how its body is evaluated: `work_amount` and `increment` The `work_amount` indicates how much of the data needs to be processed, it often equals to the dimension's size the loop is working on.
643
643
The `increment` defines how many data entries are processed on every loop iteration (it usually equals to vector size for the innermost loops of elementwise subgraph).
644
644
So if a loop's `work_amount` is not evenly divisible by its `increment`, it means that a tail processing is required.
645
-
`InsertTailLoop` duplicates the body of such a loop, rescales pointer increments and load/store masks appropriately, and injects these `Ops` immediately after the processed loop.
645
+
`InsertSpecificIterations` duplicates the body of such a loop, rescales pointer increments and load/store masks appropriately, and injects these `Ops` immediately after the processed loop.
646
646
3.`CleanupLoopOffsets` "fuses" the finalization offsets of loop with an outer loop's pointer increments and zeroes the offsets before `Result` operations.
647
647
4.`OptimizeLoopSingleEvaluation` moves all pointer arithmetic to finalization offsets in `LoopEnd`, and marks the loops that will be executed only once.
648
648
This information will be used during code emission to eliminate redundant instructions.
649
649
650
-
Please see [assign_registers.cpp](../src/lowered/pass/assign_registers.cpp) and [insert_tail_loop.cpp](../src/lowered/pass/insert_tail_loop.cpp) for more info regarding the main passes in the `Preparation` stage.
650
+
Please see [assign_registers.cpp](../src/lowered/pass/assign_registers.cpp) and [insert_specific_iterations.cpp](../src/lowered/pass/insert_specific_iterations.cpp) for more info regarding the main passes in the `Preparation` stage.
651
651
When the `Preparation` is finished, the `Generator` constructs target-specific emitters by calling `init_emitter(target)` method for every `Expression` in the `LinearIR`, where the `target` is a `TargetMachine` instance.
652
652
653
653
The `TargetMachine` is a class that provides generator with target-specific information, such as supported instruction sets, vector register size etc.
654
654
`TargetMachine` also maps the OpenVINO's `DiscreteTypeInfo` (stored in the `Expression`) to the emitter that actually implements the operation.
655
655
The mapping is done using the `jitters` map defined in [target_machine.hpp](../include/snippets/target_machine.hpp).
656
656
In order for this mechanism to work, every `Snippets'` code generation backend should create emitter implementations derived from the `Emitter` base class defined in [emitter.hpp](../include/snippets/emitter.hpp).
657
-
The backend then should create its own target machine class (derived from the common `TargetMachine`) and populate the `jitters` map, see the [cpu_generator.cpp](../../../plugins/intel_cpu/src/emitters/x64/cpu_generator.cpp) for an implementation example.
657
+
The backend then should create its own target machine class (derived from the common `TargetMachine`) and populate the `jitters` map, see the [cpu_generator.cpp](../../../plugins/intel_cpu/src/emitters/snippets/x64/cpu_generator.cpp) for an implementation example.
658
658
659
659
Note that `init_emitters(...)` only initializes the appropriate emitters, but do not actually emit any code.
660
660
To perform code emission, a `snippets::op::Kernel` operation is constructed (see [generator.cpp](../src/generator.cpp)), its constructor takes the `IR` with all the initialized emitters as an only input argument.
@@ -663,7 +663,7 @@ Finally, the `kernel->emit_code({}, {})` command initiates the code emission.
663
663
Note that the `emit_code(...)` is called only for the `KernelEmitter`, and the emitter is responsible for calling the same method for the rest of the expressions in the `IR` This encapsulation is needed because the `KernelEmitter` performs mapping of the assigned abstract registers to physical registers available on a particular platform.
664
664
Another important function of the `KernelEmitter` is to calculate input/output data offsets based on dimension indices provided in runtime, and to shift corresponding data-handling registers accordingly.
665
665
Keep in mind however, that the required functionality of the `KernelEmitter` depends on how the rest of the emitters are implemented (particularly for `Load`/`Store``Ops`).
666
-
We've discussed above how the emitters for the `intel_cpu` plugin are implemented (see [jit_snippets_emitters.cpp](../../../plugins/intel_cpu/src/emitters/x64/jit_snippets_emitters.cpp) for more details), but a different backend might require a different approach depending on hardware specifics.
666
+
We've discussed above how the emitters for the `intel_cpu` plugin are implemented (see [jit_snippets_emitters.cpp](../../../plugins/intel_cpu/src/emitters/snippets/x64/jit_snippets_emitters.cpp) for more details), but a different backend might require a different approach depending on hardware specifics.
0 commit comments