Matrix Multiply Problem

To start our engines we look for the first problem, the 2d Matrix multiplication. We will start with a short introduction to the problem and how to solve it sequentially with C.

Problem

So we need to multiply 2 matrix A (3x3), and B(3x2). The first thing that we need to verify is that the numbers of columns of A must match with the number of rows of B. The output matrix C will be (MxN) where M is the rows of A, and N the columns of B.

So for each row of A we multiply and add each row element of A with each column element of B. The result is the sum of those elements. Describing this on C:

void matrix_2d_mul(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
  float sum = 0;
  int num_rows_C = num_rows_A;
  int num_cols_C = num_cols_B;
  // Iterate on each row of A
  for(int i=0; i<num_rows_A; i++) {
    // Iterate on each collumn of B
    for (int k=0; k<num_cols_B; k++) {
      sum = 0;
      // Do the "multiply add between" row of A and collumn of B
      for (int j=0; j<num_cols_A; j++){
        // A[i][j] == A[i*num_cols_A+j]
        // B[j][k] == B[j*num_cols_B+k]
        //sum += A[i][j]*B[j][k];
        sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
      }
      // C[i][k] == C[i*num_cols_C+k]
      C[i*num_cols_C+k]=sum;
    }
  }
}

So for each element of C we iterate a complete row of A, a complete column of B, multiplying each element and summing them.

Just to remember on C/C++ matrices are stored on the Row-major order on memory, this contrast with other languages like Matlab/Fortran. Pay attention to this detail because you could also get performance penalties due to cache miss. In the end there is a reason why some languages choose column major order. (Memory Coalescing)

Our complete program output takes 140 ms to compute

Our Main loop

Our program is quite simple it's main loop basically copy some inputs (A,B) from somewhere to memory and then calculate the matrix multiplication.

Profiling sequential algorithm

So considering that your sequential algorithm works we should start profiling your code we want to know the following things:

  • How much % of time your program takes executing this function

  • How much time in seconds does your function takes.

Using Calgrind

The first question can be solved using a tool called Callgrind. Here we show the instructions to use it

Here is the result related to the source code.

Here is a map relating all your function relative times.

Basically we can now have the following assumptions:

  • Our matrix_2d_mul is called 999999 and it's responsible for 91.29% of the total program time

  • Inside our loop the instructions related to memory movement take 5% of the time

It's a very good indicator that our function need to be accelerated, but it's possible to be done on the GPU?

Using Gprof

Gprof it's an old tool used to measure the time in seconds spent in some functions. In order to use the gprof you need to recompile your program with the -pg option:

Your output would be something like this:

Here we are interested on the time spent on the matrix_2d_mul function which is approximately 110.53 nanoseconds.

Now this give us a good parameter on how to speed-up this function and also some important constraints.

Using perf

For more accurate results we can use the linux tool called perf. This tool collect events using the kernel and hardware timers, so it's more precise than gprof.

This means basically that our program executed in 141ms which means 247,027,242 cycles at 1.742 GHz. Now let's get some information of the hot-spots. For that we must first record events.

This basically give the same information as Callgrind but you have a more accurate information due to the execution time and number of cycles.

Can we speed-up on the GPU?

So we mention on the previous chapter that even if the GPU takes 0 seconds to calculate the matrix multiplication we cannot run away from the CPU/GPU transfer latency. Now we know that the transfer time of the matrix A and B to the GPU plus the transfer time of the result matrix back to the CPU must be less than 110.53 nanoseconds.

On this particular problem Matrix A has 36 bytes, matrix B has 24 bytes and the result matrix 24 bytes. So we need to calculate the time to send 60 bytes and receive 24 bytes.

So consider the following CUDA program (I wrote in Cuda because it's easier to profile on my Nvidia GPU and also because it's less code compared to OpenCL, but they should be the same)

Now when using the NVidia profiling tools nvprof and nvvp we have the following results.

Here we have the following information

  • It takes 2100 nanoseconds to send data

  • It takes 1184 nanoseconds to receive data

  • Our cudaMemcpy function takes on total 52521 nanoseconds to complete

This means that even if we take 0 seconds to compute our speed up would be speedUp=1105221=0.002speedUp=\frac{110}{5221}=0.002, which means much slower.

So this means that if we cannot play with other variables, like latency where we could give the results some time later to improve the total throughput, we cannot use the GPU.

Can we speed-up on the CPU?

Another type of speed-up can be achieved with the use of the available cores on the CPU itself, there we have the advantage that the memory latency will be smaller.

Let's change the matrix multiplication function by adding a OpenMP compiler directive #pragma parallel for. This will instruct the compiler to run all the rows in parallel

You just need now to add openmp on the compilation.

Now running the again the perf we have the following results

Ok now the total time increased to 2.5 seconds which is much worse. The issue here is that we're wasting a lot of time synchronizing the cores, which is again a inevitable sequential part.

What to do on this case?

If you know for sure that the sequential algorithm is already the most smart to do the job you can still play with the compiler optimization.

So we improved from 140ms to 61ms, which gives an speedup=14061=2.2xspeedup=\frac{140}{61}=2.2x.

Take away for the chapter

From this chapter we saw that we could not Speed-up our matrix multiplication example. If you remember the first talk about Amdahl's law, you may remember that what's is happening here is that the work needed to enable parallelization, add some sequential time that cannot be solved.

  • If you don't have enough data to process it's not worth parallelization.

  • Check if it's possible to exchange latency by throughput.

  • This amount depend on your system characteristics.

  • Use profiling tools to help you decide, don't assume that you know before hand the amount of data necessary to start parallelism.

  • Also check compiler optimization.

Last updated

Was this helpful?