The algorithms are taken form the books:
- David A. Patterson, John L. Hennessy "Computer Organization and Design. The hardware software interface. RISK-V Edition",
- David A. Patterson, John L. Hennessy "Computer Organization and Design. The hardware software interface. MIPS Edition"
- Basic, unoptimized, dgemm_basic.cpp
- Using AVX 256-bit intrinsics, dgemm_avx256.cpp
- Using AVX 512-bit intinsics, dgemm_avx512.cpp
- Using AVX 512-bit intinsics with loop unrolling, dgemm_unrolled.cpp
- Blocked version, unoptimized, dgemm_basic_blocked.cpp
- Blocked version with AVX 512-bit intinsics with loop unrolling dgemm_blocked.cpp
To build the system, execute the following commands:
- git clone https://github.com/romz-pl/matrix-matrix-multiply
- cd matrix-matrix-multiply
- mkdir build
- cd build
- cmake ..
- make
- ./src/dgemm
The command ./src/dgemm
executes the programm.
- For Core i7 CPU, with matrix size equal to
128
, I obtained the following results averaged over1000
randomly generated matrices:
dgemm_basic: elapsed-time= 1661
dgemm_basic_blocked: elapsed-time= 1260 speed-up= 1.31
dgemm_avx256: elapsed-time= 443 speed-up= 3.74
dgemm_avx512: elapsed-time= 233 speed-up= 7.12
dgemm_unrolled: elapsed-time= 106 speed-up= 15.66
dgemm_blocked: elapsed-time= 100 speed-up= 16.61
- For Core i7 CPU, with matrix size equal to
640
, I obtained the following results averaged over10
randomly generated matrices:
dgemm_basic: elapsed-time= 241958
dgemm_basic_blocked: elapsed-time= 162224 speed-up= 1.49
dgemm_avx256: elapsed-time= 66246 speed-up= 3.65
dgemm_avx512: elapsed-time= 35604 speed-up= 6.79
dgemm_unrolled: elapsed-time= 16634 speed-up= 14.54
dgemm_blocked: elapsed-time= 12981 speed-up= 18.63
- For Core i7 CPU, with matrix size equal to
1280
, I obtained the following results averaged over5
randomly generated matrices:
dgemm_basic: elapsed-time= 4592295
dgemm_basic_blocked: elapsed-time= 1626700 speed-up= 2.82
dgemm_avx256: elapsed-time= 1227037 speed-up= 3.74
dgemm_avx512: elapsed-time= 637091 speed-up= 7.20
dgemm_unrolled: elapsed-time= 558080 speed-up= 8.22
dgemm_blocked: elapsed-time= 181634 speed-up= 25.28
- For Core i7 CPU, with matrix size equal to
2560
, I obtained the following results for one randomly generated matrices:
dgemm_basic: elapsed-time= 62731813
dgemm_basic_blocked: elapsed-time= 16474759 speed-up= 3.80
dgemm_avx256: elapsed-time= 17050012 speed-up= 3.67
dgemm_avx512: elapsed-time= 9012450 speed-up= 6.96
dgemm_unrolled: elapsed-time= 5958033 speed-up= 10.52
dgemm_blocked: elapsed-time= 1837494 speed-up= 34.13
- For Core i7 CPU, with matrix size equal to
5120
, I obtained the following results for one randomly generated matrices:
dgemm_basic: elapsed-time= 1154120417
dgemm_basic_blocked: elapsed-time= 137582063 speed-up= 8.38
dgemm_avx256: elapsed-time= 297156247 speed-up= 3.88
dgemm_avx512: elapsed-time= 144941094 speed-up= 7.96
dgemm_unrolled: elapsed-time= 97428303 speed-up= 11.84
dgemm_blocked: elapsed-time= 18558107 speed-up= 62.18