Naive matrix multiplication

Definition

Naive matrix multiplication refers to the naive algorithm for executing matrix multiplication: we calculate each entry as the sum of products.

Explicitly, suppose $A=(a_{ij})_{1\leq i\leq m,1\leq j\leq n}$ is a $m\times n$ matrix and $B=(b_{jk})_{1\leq j\leq n,1\leq k\leq p}$ is a $n\times p$ matrix, and denote by $C=(c_{ik})_{1\leq i\leq m,1\leq k\leq p}$ the product of the matrices. We then have the following formula:

$c_{ik}=\sum _{j=1}^{n}a_{ij}b_{jk}$

In other words, each entry of the product is computed as a sum of $n$ pairwise products.

Arithmetic complexity

Complexity for the serial algorithm without parallelization

The total number of multiplications is $mnp$ : there are $n$ multiplications required to calculate each entry, and there are $mp$ entries to be calculated.
Under the assumption that our algorithm allows only for the addition of numbers two at a time, the total number of additions required is $m(n-1)p$ : To compute each entry, we need to add $n$ products. This requires $n-1$ additions (we add the first two, then the third to the sum, and so on). There are $mp$ entries in total, so we get a total addition count of $m(n-1)p$ .

Complexity allowing for parallelization

The arithmetic complexity allowing parallelization (ignoring communication complexity issues entirely) is $\lceil \log _{2}n\rceil$ . The reasoning is as follows:

The computations of all the matrix entries can be done in parallel, because the computations for the entries do not depend on each other.
For computing a given entry, all the computations of the products being added can be done in parallel. Combined with the preceding observation, we obtain that all the $mnp$ multiplications can be performed in parallel.
The only thing that cannot be completely parallelized is the addition step. However, this too can be partly parallelized. We can divide the list of terms to be added into two halves, sum them up, and combine. By iterating this process of dividing in halves and adding, we can encode the parallelized addition using a binary tree. The arithmetic time complexity is then given by the depth of the tree, which is $\lceil \log _{2}n\rceil$ .

In practice, no matrix multiplication algorithm would be that fast, because of communication complexity issues: it is unrealistic to expect that all the parallel processors will be able to read and write the main data at zero cost.