GPU Considerations

GPU Compute Unit

Work-group Size

As a rule of thumb make your work-group size the same as your wavefront or warp size.

Global Memory

What you may think...

Actually the reality is different. Global memory is accessed through channels and depending the size of the chunk that you read/write or it's boundaries, you may have performance impact. Also you want to avoid that multiple compute unit's use the same memory channel, because it will serialize your

To minimize this problem we should try to enforce work-items to access adjacent memory.

GPU/CPU Transfer

Consider the following cuda program. It's just allocate and transfer 1Gb of memory

int main()
{
    //const unsigned int X=1; //1 Bytes (2us/1us)
    //const unsigned int X=10; //10 Bytes (2us/1us)
    //const unsigned int X=100; //100 Bytes (2us/1us)
    //const unsigned int X=1000; //1k Bytes (2us/1us)
    //const unsigned int X=10000; //10k Bytes (2.7us/2us)
    //const unsigned int X=100000; //100k Bytes (10us/10us)
    //const unsigned int X=1000000; //1 Megabyte (80us/79us)
    //const unsigned int X=10000000; //10 Megabyte (1000us/900us)
    //const unsigned int X=100000000; //100 Megabyte (10000us/10000us)
    const unsigned int X=1000000000; //1000 Megabyte (106000us/103000us)
    //const unsigned int X=256000000; //256 Megabyte (27000us/26000us)
    //const unsigned int X=120*120*3; // 120x120 RGB image (43200 bytes) (7us/6us)
    const unsigned int bytes = X*sizeof(char);
    // Alocate memory on CPU
    char *hostArray= (char*)malloc(bytes);
    char *deviceArray;

    // Allocate memory on GPU
    cudaMalloc((char**)&deviceArray,bytes);
    memset(hostArray,0,bytes);

    // Transfer hostArray from CPU to GPU
    cudaMemcpy(deviceArray,hostArray,bytes,cudaMemcpyHostToDevice);
    // Get hostArray from GPU to CPU
    cudaMemcpy(hostArray,deviceArray,bytes,cudaMemcpyDeviceToHost);

    // Release memory from GPU
    cudaFree(deviceArray);
}

Now if we use the console profiler (I'm assuming that you already have cuda toolkit installed)

# Compiling
nvcc checkTransfer.cu -o checkTransfer

# Profiling
nvprof ./checkTransfer
laraujo@lindev:~/work/learningOpenCl/cudaSamples$ nvprof ./checkTransfer
==8921== NVPROF is profiling process 8921, command: ./checkTransfer
==8921== Profiling application: ./checkTransfer
==8921== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 51.36%  109.67ms         1  109.67ms  109.67ms  109.67ms  [CUDA memcpy DtoH]
 48.64%  103.87ms         1  103.87ms  103.87ms  103.87ms  [CUDA memcpy HtoD]

==8921== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 71.69%  213.87ms         2  106.93ms  103.95ms  109.92ms  cudaMemcpy
 28.01%  83.557ms         1  83.557ms  83.557ms  83.557ms  cudaMalloc
  0.19%  580.47us        83  6.9930us      84ns  323.09us  cuDeviceGetAttribute
  0.09%  270.79us         1  270.79us  270.79us  270.79us  cudaFree
  0.01%  31.607us         1  31.607us  31.607us  31.607us  cuDeviceTotalMem
  0.01%  22.521us         1  22.521us  22.521us  22.521us  cuDeviceGetName
  0.00%     919ns         2     459ns     122ns     797ns  cuDeviceGetCount
  0.00%     235ns         2     117ns      84ns     151ns  cuDeviceGet

Optionally we can use the NVIDIA visual profiler

nvvp ./checkTransfer

Last updated