As a rule of thumb make your work-group size the same as your wavefront or warp size.
Global Memory
What you may think...
Actually the reality is different. Global memory is accessed through channels and depending the size of the chunk that you read/write or it's boundaries, you may have performance impact. Also you want to avoid that multiple compute unit's use the same memory channel, because it will serialize your
To minimize this problem we should try to enforce work-items to access adjacent memory.
GPU/CPU Transfer
Consider the following cuda program. It's just allocate and transfer 1Gb of memory
intmain(){//const unsigned int X=1; //1 Bytes (2us/1us)//const unsigned int X=10; //10 Bytes (2us/1us)//const unsigned int X=100; //100 Bytes (2us/1us)//const unsigned int X=1000; //1k Bytes (2us/1us)//const unsigned int X=10000; //10k Bytes (2.7us/2us)//const unsigned int X=100000; //100k Bytes (10us/10us)//const unsigned int X=1000000; //1 Megabyte (80us/79us)//const unsigned int X=10000000; //10 Megabyte (1000us/900us)//const unsigned int X=100000000; //100 Megabyte (10000us/10000us)constunsignedint X=1000000000; //1000 Megabyte (106000us/103000us)//const unsigned int X=256000000; //256 Megabyte (27000us/26000us)//const unsigned int X=120*120*3; // 120x120 RGB image (43200 bytes) (7us/6us)constunsignedint bytes = X*sizeof(char);// Alocate memory on CPUchar*hostArray= (char*)malloc(bytes);char*deviceArray;// Allocate memory on GPUcudaMalloc((char**)&deviceArray,bytes);memset(hostArray,0,bytes);// Transfer hostArray from CPU to GPUcudaMemcpy(deviceArray,hostArray,bytes,cudaMemcpyHostToDevice);// Get hostArray from GPU to CPUcudaMemcpy(hostArray,deviceArray,bytes,cudaMemcpyDeviceToHost);// Release memory from GPUcudaFree(deviceArray);}
Now if we use the console profiler (I'm assuming that you already have cuda toolkit installed)