To start our engines we look for the first problem, the 2d Matrix multiplication. We will start with a short introduction to the problem and how to solve it sequentially with C.
Problem
So we need to multiply 2 matrix A (3x3), and B(3x2). The first thing that we need to verify is that the numbers of columns of A must match with the number of rows of B. The output matrix C will be (MxN) where M is the rows of A, and N the columns of B.
So for each row of A we multiply and add each row element of A with each column element of B. The result is the sum of those elements. Describing this on C:
voidmatrix_2d_mul(float*A,float*B,float*C,int num_rows_A,int num_cols_A,int num_cols_B) {float sum =0;int num_rows_C = num_rows_A;int num_cols_C = num_cols_B;// Iterate on each row of Afor(int i=0; i<num_rows_A; i++) {// Iterate on each collumn of Bfor (int k=0; k<num_cols_B; k++) { sum =0;// Do the "multiply add between" row of A and collumn of Bfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k]; }// C[i][k] == C[i*num_cols_C+k] C[i*num_cols_C+k]=sum; } }}
So for each element of C we iterate a complete row of A, a complete column of B, multiplying each element and summing them.
Just to remember on C/C++ matrices are stored on the Row-major order on memory, this contrast with other languages like Matlab/Fortran. Pay attention to this detail because you could also get performance penalties due to cache miss. In the end there is a reason why some languages choose column major order. (Memory Coalescing)
Our complete program output takes 140 ms to compute
Our program is quite simple it's main loop basically copy some inputs (A,B) from somewhere to memory and then calculate the matrix multiplication.
// ****************** Main loop *******************for (int idxLoop=1; idxLoop <1000000; idxLoop++) {// Get input to some memorymemcpy(A,A_ref,numBytesA);memcpy(B,B_ref,numBytesB);// Call our sequential algorithmmatrix_2d_mul_float(A,B,C,num_rows_A,num_cols_A,num_cols_B); }
Profiling sequential algorithm
So considering that your sequential algorithm works we should start profiling your code we want to know the following things:
How much % of time your program takes executing this function
How much time in seconds does your function takes.
Using Calgrind
The first question can be solved using a tool called Callgrind. Here we show the instructions to use it
# Compile your programg++-gseq_matrix_mul.c-oseq_matrix_mul# Execute and get profile informationvalgrind--tool=callgrind./seq_matrix_mul# Displaykcachegrindcallgrind.out.26375
Here is the result related to the source code.
Here is a map relating all your function relative times.
Basically we can now have the following assumptions:
Our matrix_2d_mul is called 999999 and it's responsible for 91.29% of the total program time
Inside our loop the instructions related to memory movement take 5% of the time
It's a very good indicator that our function need to be accelerated, but it's possible to be done on the GPU?
Using Gprof
Gprof it's an old tool used to measure the time in seconds spent in some functions. In order to use the gprof you need to recompile your program with the -pg option:
# Compile your programg++-pgseq_matrix_mul.c-oseq_matrix_mul# Execute program to generate the gmon.out file./seq_matrix_mul# Create a text filegprof./seq_matrix_mulgmon.out>timeProf.txt
Your output would be something like this:
Flatprofile:Eachsamplecountsas0.01seconds.%cumulativeselfselftotaltime seconds seconds calls ns/call ns/call name 92.110.110.11999999110.53110.53matrix_2d_mul_float(float*,float*,float*,int,int,int)8.370.120.01main0.000.120.0020.000.00displayMatrix2d(float*,int,int)0.000.120.0010.000.00_GLOBAL__sub_I_num_rows_A0.000.120.0010.000.00displayVec1d(float*,int,char*)0.000.120.0010.000.00__static_initialization_and_destruction_0(int,int)
Here we are interested on the time spent on the matrix_2d_mul function which is approximately 110.53 nanoseconds.
Now this give us a good parameter on how to speed-up this function and also some important constraints.
Using perf
For more accurate results we can use the linux tool called perf. This tool collect events using the kernel and hardware timers, so it's more precise than gprof.
This means basically that our program executed in 141ms which means 247,027,242 cycles at 1.742 GHz. Now let's get some information of the hot-spots. For that we must first record events.
sudoperfrecord-g./seq_matrix_mulsudoperfreport-g
This basically give the same information as Callgrind but you have a more accurate information due to the execution time and number of cycles.
Can we speed-up on the GPU?
So we mention on the previous chapter that even if the GPU takes 0 seconds to calculate the matrix multiplication we cannot run away from the CPU/GPU transfer latency. Now we know that the transfer time of the matrix A and B to the GPU plus the transfer time of the result matrix back to the CPU must be less than 110.53 nanoseconds.
On this particular problem Matrix A has 36 bytes, matrix B has 24 bytes and the result matrix 24 bytes. So we need to calculate the time to send 60 bytes and receive 24 bytes.
So consider the following CUDA program (I wrote in Cuda because it's easier to profile on my Nvidia GPU and also because it's less code compared to OpenCL, but they should be the same)
intmain(){constunsignedint BytesToSend=60;constunsignedint BytesToReceive=24;// Alocate memory on CPUchar*hostArray= (char*)malloc(BytesToSend);char*deviceArray;// Allocate memory on GPUcudaMalloc((char**)&deviceArray,BytesToSend);memset(hostArray,0,BytesToSend);// Transfer hostArray from CPU to GPUcudaMemcpy(deviceArray,hostArray,BytesToSend,cudaMemcpyHostToDevice);// Get hostArray from GPU to CPUcudaMemcpy(hostArray,deviceArray,BytesToReceive,cudaMemcpyDeviceToHost);// Release memory from GPUcudaFree(deviceArray);}
Now when using the NVidia profiling tools nvprof and nvvp we have the following results.
==18788==NVPROFisprofilingprocess18788,command:./checkTransferMatMul==18788==Profilingapplication:./checkTransferMatMul==18788==Profilingresult:Time(%) Time Calls Avg Min Max Name64.08%2.1120us12.1120us2.1120us2.1120us [CUDA memcpyDtoH]35.92%1.1840us11.1840us1.1840us1.1840us [CUDA memcpyHtoD]==18788==APIcalls:Time(%) Time Calls Avg Min Max Name98.74%85.226ms185.226ms85.226ms85.226mscudaMalloc0.92%795.35us839.5820us838ns345.44uscuDeviceGetAttribute0.15%132.42us1132.42us132.42us132.42uscudaFree0.06%54.336us154.336us54.336us54.336uscuDeviceTotalMem0.06%52.521us226.260us23.607us28.914uscudaMemcpy0.05%44.629us144.629us44.629us44.629uscuDeviceGetName0.00%2.5850us21.2920us978ns1.6070uscuDeviceGetCount0.00%2.0260us21.0130us838ns1.1880uscuDeviceGet
Here we have the following information
It takes 2100 nanoseconds to send data
It takes 1184 nanoseconds to receive data
Our cudaMemcpy function takes on total 52521 nanoseconds to complete
So this means that if we cannot play with other variables, like latency where we could give the results some time later to improve the total throughput, we cannot use the GPU.
Can we speed-up on the CPU?
Another type of speed-up can be achieved with the use of the available cores on the CPU itself, there we have the advantage that the memory latency will be smaller.
Let's change the matrix multiplication function by adding a OpenMP compiler directive #pragma parallel for. This will instruct the compiler to run all the rows in parallel
voidmatrix_2d_mul_float(float*A,float*B,float*C,int num_rows_A,int num_cols_A,int num_cols_B) {float sum =0;int num_rows_C = num_rows_A;int num_cols_C = num_cols_B;// Iterate on each row of A#pragmaompparallelforfor(int i=0; i<num_rows_A; i++) {// Iterate on each collumn of Bfor (int k=0; k<num_cols_B; k++) { sum =0;// Do the "multiply add between" row of A and collumn of Bfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k]; }// C[i][k] == C[i*num_cols_C+k] C[i*num_cols_C+k]=sum; } }}
You just need now to add openmp on the compilation.
Ok now the total time increased to 2.5 seconds which is much worse. The issue here is that we're wasting a lot of time synchronizing the cores, which is again a inevitable sequential part.
What to do on this case?
If you know for sure that the sequential algorithm is already the most smart to do the job you can still play with the compiler optimization.
From this chapter we saw that we could not Speed-up our matrix multiplication example. If you remember the first talk about Amdahl's law, you may remember that what's is happening here is that the work needed to enable parallelization, add some sequential time that cannot be solved.
If you don't have enough data to process it's not worth parallelization.
Check if it's possible to exchange latency by throughput.
This amount depend on your system characteristics.
Use profiling tools to help you decide, don't assume that you know before hand the amount of data necessary to start parallelism.
Also check compiler optimization.
This means that even if we take 0 seconds to compute our speed up would be speedUp=5221110=0.002, which means much slower.
So we improved from 140ms to 61ms, which gives an speedup=61140=2.2x.