To start our engines we look for the first problem, the 2d Matrix multiplication. We will start with a short introduction to the problem and how to solve it sequentially with C.
Problem
So we need to multiply 2 matrix A (3x3), and B(3x2). The first thing that we need to verify is that the numbers of columns of A must match with the number of rows of B. The output matrix C will be (MxN) where M is the rows of A, and N the columns of B.
So for each row of A we multiply and add each row element of A with each column element of B. The result is the sum of those elements. Describing this on C:
void matrix_2d_mul(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
float sum = 0;
int num_rows_C = num_rows_A;
int num_cols_C = num_cols_B;
// Iterate on each row of A
for(int i=0; i<num_rows_A; i++) {
// Iterate on each collumn of B
for (int k=0; k<num_cols_B; k++) {
sum = 0;
// Do the "multiply add between" row of A and collumn of B
for (int j=0; j<num_cols_A; j++){
// A[i][j] == A[i*num_cols_A+j]
// B[j][k] == B[j*num_cols_B+k]
//sum += A[i][j]*B[j][k];
sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
}
// C[i][k] == C[i*num_cols_C+k]
C[i*num_cols_C+k]=sum;
}
}
}
So for each element of C we iterate a complete row of A, a complete column of B, multiplying each element and summing them.
Just to remember on C/C++ matrices are stored on the Row-major order on memory, this contrast with other languages like Matlab/Fortran. Pay attention to this detail because you could also get performance penalties due to cache miss. In the end there is a reason why some languages choose column major order. (Memory Coalescing)
Our complete program output takes 140 ms to compute
leo@leodev:~/work/learningOpenCl/chap1$ time ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24
Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 24.00 |
Calculated result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 34.00 |
real 0m0.143s
user 0m0.140s
sys 0m0.000s
Our Main loop
Our program is quite simple it's main loop basically copy some inputs (A,B) from somewhere to memory and then calculate the matrix multiplication.
// ****************** Main loop *******************
for (int idxLoop=1; idxLoop < 1000000; idxLoop++) {
// Get input to some memory
memcpy(A,A_ref,numBytesA);
memcpy(B,B_ref,numBytesB);
// Call our sequential algorithm
matrix_2d_mul_float(A,B,C,num_rows_A,num_cols_A,num_cols_B);
}
Profiling sequential algorithm
So considering that your sequential algorithm works we should start profiling your code we want to know the following things:
How much % of time your program takes executing this function
How much time in seconds does your function takes.
Using Calgrind
The first question can be solved using a tool called Callgrind. Here we show the instructions to use it
# Compile your program
g++ -g seq_matrix_mul.c -o seq_matrix_mul
# Execute and get profile information
valgrind --tool=callgrind ./seq_matrix_mul
# Display
kcachegrind callgrind.out.26375
Here is the result related to the source code.
Here is a map relating all your function relative times.
Basically we can now have the following assumptions:
Our matrix_2d_mul is called 999999 and it's responsible for 91.29% of the total program time
Inside our loop the instructions related to memory movement take 5% of the time
It's a very good indicator that our function need to be accelerated, but it's possible to be done on the GPU?
Using Gprof
Gprof it's an old tool used to measure the time in seconds spent in some functions. In order to use the gprof you need to recompile your program with the -pg option:
# Compile your program
g++ -pg seq_matrix_mul.c -o seq_matrix_mul
# Execute program to generate the gmon.out file
./seq_matrix_mul
# Create a text file
gprof ./seq_matrix_mul gmon.out > timeProf.txt
Here we are interested on the time spent on the matrix_2d_mul function which is approximately 110.53 nanoseconds.
Now this give us a good parameter on how to speed-up this function and also some important constraints.
Using perf
For more accurate results we can use the linux tool called perf. This tool collect events using the kernel and hardware timers, so it's more precise than gprof.
leo@leodev:~/work/learningOpenCl/chap1$ perf stat ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24
Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 24.00 |
Calculated result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 34.00 |
Performance counter stats for './seq_matrix_mul':
141.792356 task-clock (msec) # 0.989 CPUs utilized
42 context-switches # 0.296 K/sec
4 cpu-migrations # 0.028 K/sec
55 page-faults # 0.388 K/sec
247,027,242 cycles # 1.742 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
750,140,592 instructions # 3.04 insns per cycle
63,222,441 branches # 445.880 M/sec
8,935 branch-misses # 0.01% of all branches
0.143309844 seconds time elapsed
This means basically that our program executed in 141ms which means 247,027,242 cycles at 1.742 GHz. Now let's get some information of the hot-spots. For that we must first record events.
sudo perf record -g ./seq_matrix_mul
sudo perf report -g
This basically give the same information as Callgrind but you have a more accurate information due to the execution time and number of cycles.
Can we speed-up on the GPU?
So we mention on the previous chapter that even if the GPU takes 0 seconds to calculate the matrix multiplication we cannot run away from the CPU/GPU transfer latency. Now we know that the transfer time of the matrix A and B to the GPU plus the transfer time of the result matrix back to the CPU must be less than 110.53 nanoseconds.
On this particular problem Matrix A has 36 bytes, matrix B has 24 bytes and the result matrix 24 bytes. So we need to calculate the time to send 60 bytes and receive 24 bytes.
So consider the following CUDA program (I wrote in Cuda because it's easier to profile on my Nvidia GPU and also because it's less code compared to OpenCL, but they should be the same)
int main()
{
const unsigned int BytesToSend=60;
const unsigned int BytesToReceive=24;
// Alocate memory on CPU
char *hostArray= (char*)malloc(BytesToSend);
char *deviceArray;
// Allocate memory on GPU
cudaMalloc((char**)&deviceArray,BytesToSend);
memset(hostArray,0,BytesToSend);
// Transfer hostArray from CPU to GPU
cudaMemcpy(deviceArray,hostArray,BytesToSend,cudaMemcpyHostToDevice);
// Get hostArray from GPU to CPU
cudaMemcpy(hostArray,deviceArray,BytesToReceive,cudaMemcpyDeviceToHost);
// Release memory from GPU
cudaFree(deviceArray);
}
Now when using the NVidia profiling tools nvprof and nvvp we have the following results.
==18788== NVPROF is profiling process 18788, command: ./checkTransferMatMul
==18788== Profiling application: ./checkTransferMatMul
==18788== Profiling result:
Time(%) Time Calls Avg Min Max Name
64.08% 2.1120us 1 2.1120us 2.1120us 2.1120us [CUDA memcpy DtoH]
35.92% 1.1840us 1 1.1840us 1.1840us 1.1840us [CUDA memcpy HtoD]
==18788== API calls:
Time(%) Time Calls Avg Min Max Name
98.74% 85.226ms 1 85.226ms 85.226ms 85.226ms cudaMalloc
0.92% 795.35us 83 9.5820us 838ns 345.44us cuDeviceGetAttribute
0.15% 132.42us 1 132.42us 132.42us 132.42us cudaFree
0.06% 54.336us 1 54.336us 54.336us 54.336us cuDeviceTotalMem
0.06% 52.521us 2 26.260us 23.607us 28.914us cudaMemcpy
0.05% 44.629us 1 44.629us 44.629us 44.629us cuDeviceGetName
0.00% 2.5850us 2 1.2920us 978ns 1.6070us cuDeviceGetCount
0.00% 2.0260us 2 1.0130us 838ns 1.1880us cuDeviceGet
Here we have the following information
It takes 2100 nanoseconds to send data
It takes 1184 nanoseconds to receive data
Our cudaMemcpy function takes on total 52521 nanoseconds to complete
This means that even if we take 0 seconds to compute our speed up would be speedUp=5221110=0.002, which means much slower.
So this means that if we cannot play with other variables, like latency where we could give the results some time later to improve the total throughput, we cannot use the GPU.
Can we speed-up on the CPU?
Another type of speed-up can be achieved with the use of the available cores on the CPU itself, there we have the advantage that the memory latency will be smaller.
Let's change the matrix multiplication function by adding a OpenMP compiler directive #pragma parallel for. This will instruct the compiler to run all the rows in parallel
void matrix_2d_mul_float(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
float sum = 0;
int num_rows_C = num_rows_A;
int num_cols_C = num_cols_B;
// Iterate on each row of A
#pragma omp parallel for
for(int i=0; i<num_rows_A; i++) {
// Iterate on each collumn of B
for (int k=0; k<num_cols_B; k++) {
sum = 0;
// Do the "multiply add between" row of A and collumn of B
for (int j=0; j<num_cols_A; j++){
// A[i][j] == A[i*num_cols_A+j]
// B[j][k] == B[j*num_cols_B+k]
//sum += A[i][j]*B[j][k];
sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
}
// C[i][k] == C[i*num_cols_C+k]
C[i*num_cols_C+k]=sum;
}
}
}
You just need now to add openmp on the compilation.
Now running the again the perf we have the following results
leo@leodev:~/work/learningOpenCl/chap1$ sudo perf stat ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24
Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 24.00 |
Calculated result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 34.00 |
Performance counter stats for './seq_matrix_mul':
20012.558926 task-clock (msec) # 7.898 CPUs utilized
4,777 context-switches # 0.239 K/sec
37 cpu-migrations # 0.002 K/sec
84 page-faults # 0.004 K/sec
58,413,342,405 cycles # 2.919 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
14,470,208,799 instructions # 0.25 insns per cycle
3,757,468,603 branches # 187.756 M/sec
17,796,067 branch-misses # 0.47% of all branches
2.534001217 seconds time elapsed
Ok now the total time increased to 2.5 seconds which is much worse. The issue here is that we're wasting a lot of time synchronizing the cores, which is again a inevitable sequential part.
What to do on this case?
If you know for sure that the sequential algorithm is already the most smart to do the job you can still play with the compiler optimization.
leo@leodev:~/work/learningOpenCl/chap1$ g++ -g -O3 seq_matrix_mul.c -o seq_matrix_mul
leo@leodev:~/work/learningOpenCl/chap1$ perf stat ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24
Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 24.00 |
Calculated result for C
Matrix 3x2
| 30.00 33.00 |
| 87.00 93.00 |
| 35.00 34.00 |
Performance counter stats for './seq_matrix_mul':
60.987616 task-clock (msec) # 0.984 CPUs utilized
24 context-switches # 0.394 K/sec
4 cpu-migrations # 0.066 K/sec
56 page-faults # 0.918 K/sec
84,728,846 cycles # 1.389 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
285,011,535 instructions # 3.36 insns per cycle
51,199,227 branches # 839.502 M/sec
8,497 branch-misses # 0.02% of all branches
0.061996921 seconds time elapsed
So we improved from 140ms to 61ms, which gives an speedup=61140=2.2x.
Take away for the chapter
From this chapter we saw that we could not Speed-up our matrix multiplication example. If you remember the first talk about Amdahl's law, you may remember that what's is happening here is that the work needed to enable parallelization, add some sequential time that cannot be solved.
If you don't have enough data to process it's not worth parallelization.
Check if it's possible to exchange latency by throughput.
This amount depend on your system characteristics.
Use profiling tools to help you decide, don't assume that you know before hand the amount of data necessary to start parallelism.