Now we change our matrix2dmulfloat to run all rows in parallel with _#pragma omp parallel for schedule(dynamic,1) collapse(2). Now we have 2x speed-up. This is cool but could we done better on GPU?
Basically we need to know what is the time to send 20.8Mb (Matrices A and B) and receive 4.8Mb matrix (result matrix).
leo@leodev:~/work/learningOpenCl/chap1$nvprof./checkTransferMatMul208000004800000CheckingGPUtransfer...Sending20800000bytesSending4800000bytes==4451==NVPROFisprofilingprocess4451,command:./checkTransferMatMul208000004800000==4451==Profilingapplication:./checkTransferMatMul208000004800000==4451==Profilingresult:Time(%) Time Calls Avg Min Max Name82.27%6.7186ms16.7186ms6.7186ms6.7186ms [CUDA memcpyHtoD]17.73%1.4480ms11.4480ms1.4480ms1.4480ms [CUDA memcpyDtoH]==4451==APIcalls:Time(%) Time Calls Avg Min Max Name90.92%96.428ms196.428ms96.428ms96.428mscudaMalloc8.04%8.5233ms24.2617ms2.0535ms6.4698mscudaMemcpy0.76%805.41us839.7030us838ns350.05uscuDeviceGetAttribute0.19%201.07us1201.07us201.07us201.07uscudaFree0.05%56.781us156.781us56.781us56.781uscuDeviceTotalMem0.04%44.070us144.070us44.070us44.070uscuDeviceGetName0.00%2.9340us21.4670us1.0480us1.8860uscuDeviceGetCount0.00%1.9560us2978ns908ns1.0480uscuDeviceGet
From the results the whole transfer takes 8ms. (On a system with a Quadro GPU). On a system with a Titan X (Maxwell) this transfer takes 2.6ms. So if our GPU calculate each matrix multiplication in less than 9 seconds (Sequential version) we have speed-ups.
Naive way on the GPU. (Using global memory)
So we have less than 9 seconds to multiply 2 matrices A[2000,2000] and B[2000,6000] on the GPU. Let's start by the Naive implementation. First let's take a look again on the CPU code.
voidmatrix_2d_mul_float(float*A,float*B,float*C,int num_rows_A,int num_cols_A,int num_cols_B) {float sum =0;int num_rows_C = num_rows_A;int num_cols_C = num_cols_B;// Iterate on each row of A for(int i=0; i<num_rows_A; i++) {// Iterate on each collumn of Bfor (int k=0; k<num_cols_B; k++) { sum =0;// Do the "multiply add between" row of A and collumn of Bfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k]; }// C[i][k] == C[i*num_cols_C+k] C[i*num_cols_C+k]=sum; } }}
You can observe that there is no communication between the loop iterations. So each kernel would execute something like this:
sum =0;// Do the "multiply add between" row of A and collumn of Bfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];}// C[i][k] == C[i*num_cols_C+k]C[i*num_cols_C+k]=sum;
As there is no communication between the threads the only issue is that each thread reads the global memory at least 2*j times. One for all the cols of A, one for all the rows of B) and one position for C. The issue is that accessing the global memory it's expensive in terms of time. Specially when you have multiple threads accessing at the same time, you start to have some congestion which make the kernel slower.
CUDA
__global__ voidmatrix_2d_mul_float_gpu(float*A,float*B,float*C,int num_rows_A,int num_cols_A,int num_cols_B) {// Same code for all 2d kernelint i =blockIdx.y *blockDim.y +threadIdx.y;int k =blockIdx.x *blockDim.x +threadIdx.x;if (i > num_rows_A || k > num_cols_B) return;// Sum is on the register(local to each thread)float sum =0;// This iterate a lot on the global memory 2*j timesfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k]; }// And now one more time C[i*num_cols_B+k]=sum;}
__kernel voidmatrix_2d_mul_float_gpu(__global float* A, __global float* B, __global float* C,int num_rows_A,int num_cols_A,int num_cols_B) { int i =get_global_id(0); int k =get_global_id(1);if (i > num_rows_A || k > num_cols_B) return;// Sum is on the register(local to each thread)float sum =0;// This iterate a lot on the global memory 2*j timesfor (int j=0; j<num_cols_A; j++){// A[i][j] == A[i*num_cols_A+j]// B[j][k] == B[j*num_cols_B+k]//sum += A[i][j]*B[j][k]; sum += A[i*num_cols_A+j]*B[j*num_cols_B+k]; }// And now one more time C[i*num_cols_B+k]=sum;}
On this case each thread create one entry on the matrix C(i,k), by sampling a row from A(i,j) and a col from B(j,k)
Ok let's check how the naive version performed.
==2543==Profilingapplication:./matrix_big_mul==2543==Profilingresult:Time(%) Time Calls Avg Min Max Name89.60%226.14ms1022.614ms20.433ms23.861msmatrix_2d_mul_float_gpu(float*,float*,float*,int,int,int)8.20%20.684ms201.0342ms456.25us1.6279ms [CUDA memcpyHtoD]2.20%5.5601ms10556.01us457.15us982.74us [CUDA memcpyDtoH]
So even with the naive implementation we got 199 times faster than the original sequential version.
Now let's compare with the multi-threaded cpu version (Took total of 14.6 seconds)
Comparing with the multi-threaded version we had a speed up of 14.6/0.3, which is 48x faster.
This seems cool, also if we compare the execution timeline with Nvidia Visual Profiler. So the kernel execution time is bigger than the CPU/GPU transfer and we're still faster than the CPU.
So what's the problem with the naive implementation? Let's use the Nvidia visual profiler to also analyse the "Kernel performance Limiter" and the Kernel latency.
This shows that we're doing more memory operations than actually computing. Also on the title is saying "Kernel performance bounded by memory Bandwith"
Now let's check what is stalling our kernel.
Here our biggest "Stall Reason" is the "Execution Dependency" which means that our kernels are just waiting for data.
Making faster by Tiling
On the case of the matrix multiplication, we could divide our matrices in tiles and solve all the tiles in parallel. Also while doing this you bring the tiles to the shared memory which is faster than reading directly from the global memory (Kernel pointers, ex A[i,j]). And the tile C can be stored on the register memory (local variable). Which is really fast.
In the end we can just add the results.
Also inside the tiles the calculation remains the same.
This is the new version
__global__ voidmatrix_2d_mul_float_gpu(float*A,float*B,float*C,int num_rows_A,int num_cols_A,int num_cols_B) {// Create shared variables (Available to all threads on the same block) __shared__ float A_tile[N_THREADS][N_THREADS]; __shared__ float B_tile[N_THREADS][N_THREADS];// Block indexint bx =blockIdx.x; int by =blockIdx.y;// Thread indexint tx =threadIdx.x; int ty =threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = num_cols_A * N_THREADS * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + num_cols_A -1;// Index of the first sub-matrix of B processed by the blockint bBegin = N_THREADS * bx;int bStep = N_THREADS * num_cols_B;int aStep = N_THREADS;float sum =0;for (int a = aBegin, b = bBegin;a <= aEnd;a += aStep, b += bStep) { A_tile[ty][tx] = A[a + num_cols_A * ty + tx]; B_tile[tx][ty] = B[b + num_cols_B * tx + ty];// Synchronize to make sure the matrices are loaded__syncthreads();for (int k =0; k < N_THREADS; ++k) sum += A_tile[ty][k] * B_tile[k][tx];// Wait other threads to finish their sub-matrices__syncthreads(); }// Write the block sub-matrix to device memory;// each thread writes one elementint c = num_cols_B * N_THREADS * by + N_THREADS * bx; C[c + num_cols_B * ty + tx] = sum;}
Now if we compare the kernel execution time we're faster by a factor of ..... Now let's run again our profile to check the balance between computation and memory.
Now our kernel duration dropped to 5ms but if you see we're still bounded by memory bandwidth, but now it's not the L2 memory but the Shared memory, which is faster.
Get more information
There is an compile option -lineinfo that allows the nvidia profiler to annotate your source code to give an idea where you are loosing time (Or stalled due to memory).
Bellow we can see with "Kernel Profile PC Sampling" that the selected instruction (line 51) is using most of it's time doing "Execution dependency".
Now the line 44 it's spending most of it's time waiting for data.
The bad news is that this tool works only with Cuda. For more information on using the Nvidia Visual profiler you may refer here and refer to the code which shows a simple transpose algorithm been optimized.
Actually each matrix multiplication on the gpu took 23ms (kernel only) adding the time between GPU and CPU transfers we have now a total of 48 ms. This means an speed-up of 489560=199.