Description
Multiplication
Xudong Han , Peiyu Luo1, Jinqi Wei1, Xingze Li1, Chunfeng Hu , and Adam Ghandar ,∗
Abstract—Matrix multiplication is one of the basic algorithms in computer science. Many kinds of algorithms for matrix multiplication have been developed to pursue faster speed and better quality. In this project, we will compare the brute-force algorithm, Strassen’s algorithm and the optimized Strassen’s algorithm, and conclude an algorithm that can choose the appropriate method according to the matrix situation.
Index Terms—matrix multiplication, Strassen’s algorithm, adaptive method, running time
I. INTRODUCTION
Matrix multiplication (MM) is a basic method of data calculation. People developed natural (or brute-force) matrix multiplication algorithm by definition. The time complexity is O(n3) and when matrices become large, running time is so long that hard to use. People then tried to find faster and more effective algorithm. A major direction is how to effectively reduce the number of use of the arithmetic multiplication in MM, instead of considering the degree of density.
The earliest optimized algorithm was developed by Volker Strassen in 1969 and was named as Strassen algorithm. The basic idea is to reduce the number of times multiplication by replacing multiplication with addition and subtraction. Because multiplication is spend much more running time than addition and subtraction, this idea is effective. The details of Strassen’s algorithm will be discussed in Section IV.
Following scholars continue to study and develop faster algorithms. Pan’s algorithm was proposed in 1981 which time complexity is decreased to O(n2.494). Andrew Stothers (2010) proposed a new algorithm whose time complexity is O(n2.374). Up to now, the matrix multiplication algorithm with the lowest time complexity is O(n2.3728639), an extension of simplified Stanford’s algorithm by Franc¸ois Le Gall (2014).
In this paper, we focus on the Strassen’s algorithm and how to optimize it. Three kinds of optimized algorithms, data-processing-optimized Strassen’s algorithm, multithreaded
Strassen’s algorithm and adaptive algorithm, was developed,
and we applied them to from small to large matrices to compare their performance. Crossover point was also found to guide the algorithm optimization.
II. BACKGROUND
The central role of the conventional matrix multiplication algorithm as a building block in solving problems in algebra and scientific computations has generated a significant amount of research into techniques for improving the performance of this algorithm. Since 1960, many matrix multiplication algorithms have been discovered to multiply matrices fast with the cost estimate yielding very complex algorithms that are impractical for many matrices of reasonable sizes. The order of conventional algorithm has been established to and a way forward is to reduce this order to O(nα) where α < 3. Strassen’s divide and conquer matrix multiplication algorithm was developed and found to be of order with a significant gain of 0.19 (or 19%) in time and cost (see for example the work of Harvard). As a consequence, for sufficiently large values of (in thousands), Strassen’s algorithm will run faster on the one hand, as well as offers a significant improvement in cost efficiency on the other hand than the conventional algorithm for matrix multiplication. Adaptive Strassen’s matrix multiplication combines the novel implementation of Strassen’s ideas with the auto-tuning linear algebra software (ATLAS) or GotoBLAS’s matrix multiplication. The author studied the interaction between the effective performance of Strassen and the memory hierarchy, and demonstrated the correct combination of Strassen and ATLAS/GotoBLAS to take full advantage of Strassen’s full potential between different architectures, and the paper also proved their approach can reach 30%/22% effect compared with using ATLAS/GotoBLAS. The implementation logic of the adaptive algorithm is as follows: at the top, the cache forgetting algorithm is used to reduce the problem to almost a square matrix; at the middle layer, the Strassen algorithm is deployed to reduce the amount of calculation; at a lower level, ATLA or GotoBLAS are deployed to release the architecture and structure.
The results of this article quantify the interaction between the kernels of an application and the underlying architecture can have big significant in helping design complex-butportable codes. Such metrics can improve algorithm design and can be used as the basis for fully automated methods.
III. THEORETICAL ANALYSIS
Strassen’s matrix multiplication discovered that the original recursive algorithm of complexity O(n3) can be reorganized in such a way that one computationally expensive recursive matrix multiplication step can be replaced with 18 cheaper matrix additions. Strassen’s algorithm embodies two different locality properties because its two basic computations exploit different data locality: matrix multiply has spatial and temporal locality, and matrix addition has only spatial locality. Now we suppose the matrix A of size i × k, the matrix B of size k × j and the matrix C of size i × j which equals to the product of AB. All A, B and C in Strassen’s matrix multiplication are composed of four sub-matrices each.
Considering the operand matrix A with σ(A) = i · k, it is logically composed of four matrices and every part’s operand could be expressed as
Similarly, B is also composed of four matrices and every part’s operand could be expressed as
During the Strassen’s matrix multiplication, the time of the process for seven multiplications is
where π is the efficiency of matrix multiplication, and 1/π is simply the floating point operation per second (FLOPS) of the computation.
And the time of the process for 22 matrix additions (18 matrix additions and 4 matrix copies) is
(2)
Thus, we find that the problem size [m, n, p] to yield control to the ATLAS/GotoBLAS’s algorithm is when Eq. 3 below is satisfied:
We assume that π and α are functions of the matrix size only, as we explain in the following. So for the square matrix, i = k = j, we can find the crossover point:
(4)
Fig. 1. The theoretical comparison of the complexity of matrix multiplication algorithm (Θ(n3), green curve) and Strassen’s algorithm (Θ(nlg7)≈ Θ(n2.807), red curve).
IV. METHODOLOGY
The most important method of matrix multiplication is the general matrix product. It is defined only when the number of columns of the first matrix is the same as that of the second matrix. If A is an i × k matrix and B is an k × j matrix, then their product AB will be an i × j matrix. The elements of the product matrix are obtained as follows:
(5)
For the natural (or brute-force) matrix multiplication, the operation needs three for loops, and the time complexity is O(n3). The pseudocode of the naive matrix multiplication is as follows:
SQUARE−MATRIX−MULTIPLY(A,B) n=A. rows
l e t C be a new n × n matrix for i =1 to n for j =1 to n
C[ i , j ]=0
for k=1 to n
C[ i , j ]+=A[ i , k]*B[k , j ] return C
In 1969, Volker Strassen proposed the first algorithm, whose time complexity is lower than Θ(n3) matrix multiplication algorithm, and the algorithm complexity is Θ(nlg7). As Fig. 1 shown, Strassen’s algorithm has great advantages in performance for matrices with large dimension, which can reduce a lot of multiplication calculation.
We are now going to show the Strassen’s matrix multiplication principle. Suppose that both the matrix A and the matrix B are square matrices of n × n(n = 2n), find C = AB, as shown below,
And,
The matrix C can be obtained by the following formula,
C11 = A11 · B11 + A12 · B21
C12 = A11 · B12 + A12 · B22
C21 = A21 · B11 + A22 · B21
C22 = A21 · B12 + A22 · B22
From the above expression, we can conclude that to calculate the multiplication of two n × n matrices requires 8 times multiplication and 4 times addition of two n/2 × n/2. We use T(n) to express the time complexity of n × n matrix multiplication, and according to the above decomposition, we can get
(6)
The above steps take time O(1), and create ten n/2 × n/2 matrices, S1,S2, …,S10 (time complexity is O(n2)), which are as shown in following
S1 = B12 − B22 S2 = A11 + A12
S3 = A21 + A22
S4 = B21 − B11
S5 = A11 + A22
S6 = B11 + B22
S7 = A12 − A22
S8 = B21 + B22
S9 = A11 − A21
S10 = B11 + B12
Recursively calculate the product P1, P2, …, P7, of seven matrices, each matrix Pi is n/2 × n/2. And we get
P1 = A11 · S1
P2 = S2 · B22
P3 = S3 · B11
P4 = A22 · S4
P5 = S5 · S6
P6 = S7 · S8
P7 = S9 · S10
Then we can calculate the product C in four parts, C11, , …, P7, which spend time Θ(n2).
C11 = P5 + P4 − P2 + P6 C12 = P1 + P2 C21 = P3 + P4
C22 = P5 + P1 − P3 − P7
From the above process, it is not difficult to derive the recursive expression to represent the time complexity of Strassen’s algorithm, as Eq. 7.
(7)
We have traded off one matrix multiplication for a constant number of matrix additions. By the master method, Eq. 7 has the solution
T(n) = Θ(nlg7) (8)
The pseudocode of Strassen’s algorithm is shown below.
STRASSEN(A,B) n=A. rows
i f n==1
return A[1 ,1]*B[1 ,1]
l e t C be a new n × n matrix
A[1 ,1]=A[ 1 . . n / 2 ] [ 1 . . n /2]
A[1 ,2]=A[ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
A[2 ,1]=A[ n / 2 + 1 . . n ] [ 1 . . n /2]
A[2 ,2]=A[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ]
B[1 ,1]=B [ 1 . . n / 2 ] [ 1 . . n /2]
B[1 ,2]=B [ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
B[2 ,1]=B[ n / 2 + 1 . . n ] [ 1 . . n /2]
B[2 ,2]=B[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ]
S[1]=B[1 ,2] −B[2 ,2]
S[2]=A[1 ,1]+A[1 ,2]
S[3]=A[2 ,1]+A[2 ,2]
S[4]=B[2 ,1] −B[1 ,1]
S[5]=A[1 ,1]+A[2 ,2]
S[6]=B[1 ,1]+B[2 ,2]
S[7]=A[1 ,2] −A[2 ,2]
S[8]=B[2 ,1]+B[2 ,2]
S[9]=A[1 ,1] −A[2 ,1]
S[10]=B[1 ,1]+B[1 ,2]
P[1]=STRASSEN(A[1 ,1] , S [ 1 ] ) P[2]=STRASSEN(S [2] ,B[2 ,2])
P[3]=STRASSEN(S [3] ,B[1 ,1]) P[4]=STRASSEN(A[2 ,2] , S [ 4 ] )
P[5]=STRASSEN(S [5] , S [ 6 ] )
P[6]=STRASSEN(S [7] , S [ 8 ] )
P[7]=STRASSEN(S [9] , S [10])
C [ 1 . . n / 2 ] [ 1 . . n /2]=P[5]+P[4] −P[2]+P [6] C [ 1 . . n / 2 ] [ n / 2 + 1 . . n]=P[1]+P [2]
C[ n / 2 + 1 . . n ] [ 1 . . n /2]=P[3]+P [4]
C[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n]=P[5]+P[1] −P[3] −
P [7] return C
V. EXPERIMENT DESIGN
In this section, we are going to introduce each part of our work to implement matrix multiplication. This will be expanded in the order of modules within the code.
A. Class Matrix
The class named Matrix represent a matrix in “row-major” order with parameter “rows” and “columns”. Some basic operations of the matrix is also applied in Matrix, as shown below.
• “getValueAt” and “setValueAt”: are to get and set the element value at the specified position.
• “getMatrixAt”: is to obtain the small matrix of the specified area in the matrix.
• “partitionMatrix”: is to divide the matrix into four pieces.
• “mergeMatrix”: merging the four small matrices into one large matrix.
• “plus” and “minus”: are to add and subtract the matrix.
• “toString” and “show”: converting to a string (and printed on the screen).
B. Square Matrix Multiply
This method use the natural (or brute-force) mathematical method to calculate the value of each element of the matrix using triple for loops. And the pseudocode is shown in Section IV. It is quite easy to understand.
C. Strassen matrix multiply
Strassen’ algorithm is originally designed for even-order matrix. But adaption to any order matrix is important. To make the algorithm meet the square matrix of any size, when the matrix is divided into blocks, the even-order matrix is naturally divided into four blocks, and the odd-order matrix is divided into “half of the odd-order plus one”-order matrix. In this recursion, the matrix of any order can be divided and operated. The representation of all matrices in the operation uses the Matrix object.
In order to optimize the running time of the block matrix, merge the matrix, and reduce the large amount of space and time caused by the continuous recursive generation and return of new objects, we have changed the function calling method and the specific implementation of matrix multiplication. The original functions all take the result of the operation as the return value of a function and return a new Matrix object. The new function directly calculates the one-dimensional array of the matrix, avoiding repeated disassembly and assembly of data. Then it passes the array storing the result matrix to the function as an input parameter, and modify the value of the array directly inside the arithmetic function. This is equivalent to passing the result’s pointer, reducing the time consumption of generating objects. And for some continuous addition and subtraction operations in matrix operations, special functions for continuous addition and subtraction are written, instead of repeatedly calling the addition or subtraction function, which reduces the time of switching between functions and saves the space for storing intermediate calculation results.
Another optimizing idea is to use parallel algorithm. We used multithreading to optimize the code. The process of calculating the intermediate matrix in the Strassen algorithm is executed in parallel, which greatly reduces the time consumption of calculating the intermediate matrix.
Of course, whether it is optimizing the function calling method or using multi-threading to optimize, it is all modified on the basis of the standard matrix multiplication implemented by the Matrix object. Variable controlling is to be used to examine the optimization effects of different optimization methods on operating speed in Section VI.
D. Crossover Point
To find the effective crossover point of the running time of the Strassen’s algorithm and brute-force algorithm, we tested and continuously recorded ten sets of running time data. When for the running time of the Strassen’s algorithm is larger than that of brute-force algorithm for the first five data and smaller for the last five data, the intermediate point is considered as the intersection point. This is to avoid data jitter caused by fluctuations in the operating environment. If the data fluctuates severely under special circumstances, when the running time of the Strassen’s algorithm is smaller than that of the bruteforce algorithm in ten consecutive sets of data, the starting point is considered as the intersection point. The upper limit of the crossover-point-finding test is 4098. If no intersection is found before the upper limit is reached, the default intersection 350 is used, which is empirical value based on experiments.
E. Generate Random Matrix
To test and verify the matrix multiplication algorithm, there must be a method to generate random matrices. We were to generate a random number in the interval [−1,1], and first generated an integer in the interval [0,201), then divided the result by 100, and randomly added a sign to it. Class “DecimalFormat” is used to keep all two decimal places to eliminate the excessive number of decimal places caused by the positive and negative numbers. The generated random numbers were filled into the matrix circularly to generate a random matrix of any size for testing.
VI. EMPIRICAL ANALYSIS
First of all, we compared the running time of the brute-force algorithm and Strassen’s algorithm without any optimization. The original Strassen’s algorithm always took longer running time than the brute force algorithm, regardless of the size of the matrix, mainly because of the time taken to split the matrix. (Fig. 2) Therefore, the optimization for Strassen’s algorithm is necessary. We developed optimized Strassen’s algorithm, multithreaded Strassen’s algorithm and adaptive algorithm.
We are going to discuss brute-force and other four optimized algorithms and compare their running time. They are described in details below.
• Brute-force algorithm: Θ(n3) matrix multiplication algorithm. This is a simple and natural method using
Fig. 2. The comparison between brute-force algorithm and original Strassen’s algorithm
definition. The pseudocode is shown before in SQUAREMATRIX-MULTIPLY(A,B).
• Data-Processing-Optimized Strassen’s algorithm: Based on Strassen’s algorithm mentioned before, the data processing is improved by changing the function calling method from data copying to pointer passing,to make it easier to store and modify and the speed is higher. The pseudocode is shown before in STRASSEN(A,B).
• Multithreaded Strassen’s algorithm: Strassen’s algorithm applied parallel computation which greatly improved the processing efficiency. Here, we only use nested parallelism. And the parallelism of multithreaded Strassen’s algorithm is Θ(nlg7/lg2n). The pseudocode is shown below as P-STRASSEN(A,B).
P−STRASSEN(A,B) n=A. rows
i f n==1
return A[1 ,1]*B[1 ,1]
l e t C be a new n × n matrix p a r a l l e l
A[1 ,1]=A[ 1 . . n / 2 ] [ 1 . . n /2]
A[1 ,2]=A[ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
A[2 ,1]=A[ n / 2 + 1 . . n ] [ 1 . . n /2]
A[2 ,2]=A[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ]
B[1 ,1]=B [ 1 . . n / 2 ] [ 1 . . n /2]
B[1 ,2]=B [ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
B[2 ,1]=B[ n / 2 + 1 . . n ] [ 1 . . n /2]
B[2 ,2]=B[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ] p a r a l l e l
S[1]=B[1 ,2] −B[2 ,2]
S[2]=A[1 ,1]+A[1 ,2]
S[3]=A[2 ,1]+A[2 ,2]
S[4]=B[2 ,1] −B[1 ,1]
S[5]=A[1 ,1]+A[2 ,2]
S[6]=B[1 ,1]+B[2 ,2]
S[7]=A[1 ,2] −A[2 ,2]
S[8]=B[2 ,1]+B[2 ,2]
S[9]=A[1 ,1] −A[2 ,1]
S[10]=B[1 ,1]+B[1 ,2] p a r a l l e l
P[1]=STRASSEN(A[1 ,1] , S [ 1 ] ) P[2]=STRASSEN(S [2] ,B[2 ,2])
P[3]=STRASSEN(S [3] ,B[1 ,1]) P[4]=STRASSEN(A[2 ,2] , S [ 4 ] )
P[5]=STRASSEN(S [5] , S [ 6 ] )
P[6]=STRASSEN(S [7] , S [ 8 ] )
P[7]=STRASSEN(S [9] , S [10]) p a r a l l e l
C [ 1 . . n / 2 ] [ 1 . . n /2]=P[5]+P[4] −P [2]+P [6]
C [ 1 . . n / 2 ] [ n / 2 + 1 . . n]=P[1]+P [2]
C[ n / 2 + 1 . . n ] [ 1 . . n /2]=P[3]+P [4]
C[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n]=P[5]+P
[1] −P[3] −P [7] return C
• Adaptive 16 algorithm: hybrid algorithm using SQUAREMATRIX-MULTIPLY(A,B) when the size is under 16 and STRASSEN(A,B) when the size is above 16. The pseudocode is shown below as ADAPTIVE-MATRIXMULTIPLY(A,B).
ADAPTIVE−MATRIX−MULTIPLY(A,B) n=A. rows
i f n==1
return A[1 ,1]*B[1 ,1]
l e t C be a new n × n matrix
i f n<=16 return SQUARE−MATRIX−MULTIPLY( A,B)
A[1 ,1]=A[ 1 . . n / 2 ] [ 1 . . n /2]
A[1 ,2]=A[ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
A[2 ,1]=A[ n / 2 + 1 . . n ] [ 1 . . n /2]
A[2 ,2]=A[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ]
B[1 ,1]=B [ 1 . . n / 2 ] [ 1 . . n /2]
B[1 ,2]=B [ 1 . . n / 2 ] [ n / 2 + 1 . . n ]
B[2 ,1]=B[ n / 2 + 1 . . n ] [ 1 . . n /2]
B[2 ,2]=B[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n ]
S[1]=B[1 ,2] −B[2 ,2]
S[2]=A[1 ,1]+A[1 ,2]
S[3]=A[2 ,1]+A[2 ,2]
S[4]=B[2 ,1] −B[1 ,1]
S[5]=A[1 ,1]+A[2 ,2]
S[6]=B[1 ,1]+B[2 ,2]
S[7]=A[1 ,2] −A[2 ,2]
S[8]=B[2 ,1]+B[2 ,2]
S[9]=A[1 ,1] −A[2 ,1]
S[10]=B[1 ,1]+B[1 ,2]
P[1]=STRASSEN(A[1 ,1] , S [ 1 ] ) P[2]=STRASSEN(S [2] ,B[2 ,2])
P[3]=STRASSEN(S [3] ,B[1 ,1]) P[4]=STRASSEN(A[2 ,2] , S [ 4 ] )
P[5]=STRASSEN(S [5] , S [ 6 ] )
Fig. 3. The comparison between brute-force algorithm (Brute-force), dataprocessing-optimized Strassen’s algorithm (DP-Optimized Strassen), multithreaded Strassen’s algorithm (Multithreaded Strassen), adaptive 16 algorithm (Adaptive 16), adaptive 128 algorithm (Adaptive 128), and two curves that fit brute-force algorithm (Fitted Curve nˆ3) and optimized Strassen’s algorithm (Fitted Curve nˆ2.81)
P[6]=STRASSEN(S [7] , S [ 8 ] )
P[7]=STRASSEN(S [9] , S [10])
C [ 1 . . n / 2 ] [ 1 . . n /2]=P[5]+P[4] −P[2]+P
[6]
C [ 1 . . n / 2 ] [ n / 2 + 1 . . n]=P[1]+P [2]
C[ n / 2 + 1 . . n ] [ 1 . . n /2]=P[3]+P [4]
C[ n / 2 + 1 . . n ] [ n / 2 + 1 . . n]=P[5]+P[1] −P
[3] −P [7] return C
• Adaptive 128 algorithm: hybrid algorithm using SQUARE-MATRIX-MULTIPLY(A,B) when the size is under 128 and STRASSEN(A,B) when the size is above 128. The pseudocode is much similar to adaptive 16 algorithm and would not show again.
We applied all the five algorithms to matrices whose size are from 2 × 2 to 2081 × 2081. The testing platform is Intel Core i7-7500U @2.70GHz with ROM 16G. We recorded all the code running time and plotted the data as a line graph. According to Fig. 3, we can get the following results.
a) Implementation of Optimization: The data-processingoptimized Strassen’s algorithm become faster than the bruteforce algorithm after n = 1793. The multithreaded Strassen’s algorithm become faster than the brute-force algorithm after n = 353. And depending on the design of adaptive algorithm, it must be faster than the brute-force algorithm. So the optimization of matrix multiplication algorithm is successfully implemented.
TABLE I
CROSSOVER POINT EXPERIMENT FOR DATA-PROCESSING-OPTIMIZED
STRASSEN’S ALGORITHM AND MULTITHREADED STRASSEN’S ALGORITHM AGAINST BRUTE-FORCE ALGORITHM
Serial Number DP-Optimized Multithreaded
1 1780 350
2 1791 332
3 1776 330
4 1785 345
5 1788 343
Average 1784 340
to the fact that Strassen algorithm is no longer used in the calculation of small-size matrix multiplication, but the bruteforce algorithm with fewer steps and more direct operation is adopted in this state, which speeds up all small-size matrix multiplication, thus obtaining a faster operation speed.
c) Fitting accuracy: The optimized Strassen’s algorithm is basically consistent with the curve nˆ2.81, and the bruteforce algorithm is consistent with the curve nˆ3. These prove the correctness of the previous part of time complexity calculation.
d) Step Effect: Discontinuity appeared in the running time of Strassen’s algorithm. This was because when the order place of the matrix was between 2n and 2n+1, the number of matrix blocks was 2n+1, and the time consumed by matrix blocks was relatively large, so step effect occurred.
For the crossover point, we conducted an experiment to find the point between the data-processing-optimized Strassen’s algorithm and the brute-force algorithm and the multithreaded Strassen’s algorithm and the brute-force algorithm respectively. The results (Table I) shows that the crossover point is stable over a certain interval. And of course, based on the theoretical analysis in Section III, the point value still depends on the testing platforms.
VII. CONCLUSION
REFERENCES
[2] TH Cormen, CE Leiserson, RL Rivest, C Stein. Introduction to algorithms. MIT Press, 2009
Reviews
There are no reviews yet.