Bigger Matrix Multiply problem
Now we have to multiply 2 matrices A[2000,2000] and B[2000,6000]. Our result will be a matrix C[2000,6000]. On bytes we have
Matrix A: 16 Mb
Matrix B/C: 4.8Mb
Basic our code multiply 10 times those matrices.
/*
Now we make the matrix much bigger
g++ -pg seq_matrix_big_mul.c -o seq_matrix_big_mul
*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
int num_rows_A = 900; int num_rows_B = 900; int num_rows_C = 900;
int num_cols_A = 900; int num_cols_B = 600; int num_cols_C = 600;
// I'm forcing a malloc because I want to add the malloc time on the game
float *A = (float*) malloc(sizeof(float) * num_rows_A * num_cols_A);
float *B = (float*) malloc(sizeof(float) * num_rows_B * num_cols_B);
float *C = (float*) malloc(sizeof(float) * num_rows_C * num_cols_C);
void matrix_2d_mul_float(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
float sum = 0;
int num_rows_C = num_rows_A;
int num_cols_C = num_cols_B;
// Iterate on each row of A
//#pragma omp parallel for schedule(dynamic,1) collapse(2)
for(int i=0; i<num_rows_A; i++) {
// Iterate on each collumn of B
for (int k=0; k<num_cols_B; k++) {
sum = 0;
// Do the "multiply add between" row of A and collumn of B
for (int j=0; j<num_cols_A; j++){
// A[i][j] == A[i*num_cols_A+j]
// B[j][k] == B[j*num_cols_B+k]
//sum += A[i][j]*B[j][k];
sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
}
// C[i][k] == C[i*num_cols_C+k]
C[i*num_cols_C+k]=sum;
}
}
}
void fillRand(float *vec, int minValue, int maxValue, int sizeVec) {
srand(time(NULL));
for (int idx = 0; idx < sizeVec; idx++) {
vec[idx] = rand() % maxValue + minValue;
}
}
int main() {
// Get size in bytes for our vectors
int numBytesA = sizeof(float) * num_rows_A * num_cols_A;
int numBytesB = sizeof(float) * num_rows_B * num_cols_B;
printf("Size in bytes A: %d\n",numBytesA);
printf("Size in bytes B: %d\n",numBytesB);
// Fill arrays
fillRand(A, 1, 100, num_rows_A * num_cols_A);
fillRand(B, 1, 100, num_rows_B * num_cols_B);
// Call sequential function
//ProfilerStart("nameOfProfile.log");
for (int idxLoop=0; idxLoop < 10; idxLoop++) {
// Populate matricex on heap
matrix_2d_mul_float(A,B,C,num_rows_A,num_cols_A,num_cols_B);
printf("Matrix multiplication done %d\n",idxLoop);
}
// Free memory
free(A);free(B);free(C);
return 0;
}Executing sequentially
Executing this sequentially will be slow (30 seconds) to complete 10 matrix multiplications.
Taking measurements with gprof each call takes 9.5 seconds
Executing on our CPU cores (7 cores)
Now we change our matrix2dmulfloat to run all rows in parallel with _#pragma omp parallel for schedule(dynamic,1) collapse(2). Now we have 2x speed-up. This is cool but could we done better on GPU?
First check the Transfer time
Basically we need to know what is the time to send 20.8Mb (Matrices A and B) and receive 4.8Mb matrix (result matrix).
From the results the whole transfer takes 8ms. (On a system with a Quadro GPU). On a system with a Titan X (Maxwell) this transfer takes 2.6ms. So if our GPU calculate each matrix multiplication in less than 9 seconds (Sequential version) we have speed-ups.
Naive way on the GPU. (Using global memory)
So we have less than 9 seconds to multiply 2 matrices A[2000,2000] and B[2000,6000] on the GPU. Let's start by the Naive implementation. First let's take a look again on the CPU code.
You can observe that there is no communication between the loop iterations. So each kernel would execute something like this:
As there is no communication between the threads the only issue is that each thread reads the global memory at least 2*j times. One for all the cols of A, one for all the rows of B) and one position for C. The issue is that accessing the global memory it's expensive in terms of time. Specially when you have multiple threads accessing at the same time, you start to have some congestion which make the kernel slower.
CUDA
On this case each thread create one entry on the matrix C(i,k), by sampling a row from A(i,j) and a col from B(j,k)

Ok let's check how the naive version performed.
Actually each matrix multiplication on the gpu took 23ms (kernel only) adding the time between GPU and CPU transfers we have now a total of 48 ms. This means an speed-up of .
So even with the naive implementation we got 199 times faster than the original sequential version.
Now let's compare with the multi-threaded cpu version (Took total of 14.6 seconds)
Comparing with the multi-threaded version we had a speed up of 14.6/0.3, which is 48x faster.
This seems cool, also if we compare the execution timeline with Nvidia Visual Profiler. So the kernel execution time is bigger than the CPU/GPU transfer and we're still faster than the CPU.

So what's the problem with the naive implementation? Let's use the Nvidia visual profiler to also analyse the "Kernel performance Limiter" and the Kernel latency.
This shows that we're doing more memory operations than actually computing. Also on the title is saying "Kernel performance bounded by memory Bandwith"

Now let's check what is stalling our kernel.

Here our biggest "Stall Reason" is the "Execution Dependency" which means that our kernels are just waiting for data.
Making faster by Tiling
On the case of the matrix multiplication, we could divide our matrices in tiles and solve all the tiles in parallel. Also while doing this you bring the tiles to the shared memory which is faster than reading directly from the global memory (Kernel pointers, ex A[i,j]). And the tile C can be stored on the register memory (local variable). Which is really fast.

In the end we can just add the results.

Also inside the tiles the calculation remains the same.

This is the new version
Now if we compare the kernel execution time we're faster by a factor of ..... Now let's run again our profile to check the balance between computation and memory.

Now our kernel duration dropped to 5ms but if you see we're still bounded by memory bandwidth, but now it's not the L2 memory but the Shared memory, which is faster.
Get more information
There is an compile option -lineinfo that allows the nvidia profiler to annotate your source code to give an idea where you are loosing time (Or stalled due to memory).
Bellow we can see with "Kernel Profile PC Sampling" that the selected instruction (line 51) is using most of it's time doing "Execution dependency".
Now the line 44 it's spending most of it's time waiting for data.

The bad news is that this tool works only with Cuda. For more information on using the Nvidia Visual profiler you may refer here and refer to the code which shows a simple transpose algorithm been optimized.
Last updated
Was this helpful?