Parallel Computing
Last updated
Last updated
On this chapter we will review some concepts about Parallel Computing. But giving more emphasis on GPU(s).
Before dealing with performance let's review some concepts.
Throughput: number of computing tasks per time unit. ie: 1000 credit card payments in a minute.
Latency: delay between invoking the operation and getting the response. ie: maximum time taken to process a credit card transaction is 25ms.
When optimizing performance, an improvement in one factor (such as throughput) may lead to the worsening in another factor (such as latency).
This is the old way, we have a problem, we break into small pieces and solve them one after the other.
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem.
Following the Flynn's taxonomy there are 4 different ways to classify parallel computers.
SISD: Really old computers (PDP1)
MIMD: Super computers
SIMD: Intel processors, Nvidia Gpus
MISD: Really rare.
On the case of GPUs they normally are SIMD type processors. Where different processing units execute the same instruction but in different parts of the shared memory.
Bellow we have a 4-width SIMD. All processors here are executing the "add" instruction at the same time.
Don't get fooled when you hear that a GPU has 5000 cores, it's probably just saying that it has 5000 ALU (Arithimetic Logic Unit). The maximum number of things that a GPU can do at the same time is normally called "warp size" on Nvidia or "wavefront" on AMD, and is normally a 32-wide SIMD units, organized on blocks/grids.
An interesting problem that could happen is if you have a branch(if) instruction and each processing element decide different things. If this happen, you will have a processing penalty. This effect is called divergence. To remedy this you must try to minimize usage of branching instructions on the wavefront(warp in cuda).
If you need this kind of branch assignment you can use the "select" in opencl that compile to a single instruction(atomic) so the divergence problem will not happen.
What can be infer from Amdahl's law:
Speedup is limited by the fraction of the work that is not parallelizable, even using an infinite number of processors the speed will not improve, because the serial part will limit.
The total execution time T of a program falls into two categories:
Time spent doing non-parallelizable serial work
Time spent doing parallelizable work
Something important is also missing here. Amdahl's law does not take into account other factors like memory latency.
On this model a shared memory is visible to all nodes, but each node deal with parts of this shared memory. This is what we're normally going to do an GPUs
Main characteristics of data parallel method is that the programming is relatively simple since multiple processors are all running the same program, and that all processors finish their task at around the same time. This method is efficient when the dependency between the data being processed by each processor is minimal. For example, vector addition can benefit greatly from this method.
The main characteristic of the task parallel method is that each processor executes different commands. This increases the programming difficulty when compared to the data parallel method. Since the processing time may vary depending on how the task is split up some synchronization will be needed. If the tasks are completely uncorrelated the problem will be much easier.
One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. There are two basic ways to partition computational work among parallel tasks:
domain decomposition:
functional decomposition.
In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data.
In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.
Often some parallel problems need that the nodes(tasks) communicate between each other. Again this is a problem related issue. Some points to consider:
Communication always implies overhead
Communication frequently need nodes(task) synchronization, which need more overhead
When you need to send data to your GPU to perform some computation and then bring the results back to your CPU, communication is implied.
Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. For example, imagine an image processing operation where every pixel in a black and white image needs to have its color reversed. The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work. These types of problems are often called embarrassingly parallel because they are so straight-forward. Very little inter-task communication is required.
Most parallel applications are not quite so simple, and do require tasks to share data with each other. For example, a 3-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect on that task's data.
Managing the sequence of work and the tasks performing it is a critical design consideration for most parallel programs. Synchronization always impact performance, but it's always needed when tasks need to communicate.
Types of Synchronization
Barrier (Used on OpenCl)
Lock / semaphore
Synchronous communication operations
It's a synchronization mechanism where each task performs its work until it reaches the barrier. It then stops, or "blocks" until all tasks reach that same point. When the last task reaches the barrier, all tasks are synchronized.
It's about the ratio between computation and communication. There are 2 types of granularity
More communication than computation
More computation than communication
The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs. But... Often communication has a bigger latency than computation.For example copy data to/from GPU. So we prefer to have Coarse-grain, which means lots of computation and few GPU/CPU communication.
Just as e mental experiment, imagine a processing element (node/task) that make statements in 1 second (ie V:=1+2+3/4). But if you need to read/write to the GPU global memory it will take much more time. Consider the table bellow.
By the way we're considering that the data is already on the GPU, sending data to the GPU is another problem.
On this table we have different memory types, where global is the GPU memory, private and local are memories located inside each core, and constant is also a global memory but speciallized to read faster. Now check the following examples.
Now imagine that we need to deal with even more data, instead of int (4 bytes) x,y will be long (8 bytes).
This can means 2 things if we want to improve performance compared to some original sequential algorithm:
The original sequential algorithm must be much slower than this memory I/O latency.
You need much more operations inside the GPU to make this time be dilluted.
The first thing that we may think is to add much more processing to be done that actually would take much more time than this memory latency. Again if the processing time plus the memory latency is smaller than the original sequential CPU version you will get your speedup.
On this case you have now 100% Alu efficiency, but this only happens in real life if you are dealing with embarrassingly parallel problems. For example big matrix multiplication, crypto-analysis, etc...
A better technique is to use the GPU context switching mechanisms to hide this latency. This works by making the parallel code flag to the gpu that it's waiting for a memory request to be available. When this happen the group of processing elements that are waiting for the memory be available will go to a pool. Meanwhile the GPU can launch another work-group to execute that will eventually be temporarily stalled. The idea is that while this is happening some work-items will have the memory available, this will have the effect of minimize the whole latency.
Then we hide the long memory latency time where the kernel has access to the global memory, because while the GPU is assigning work-groups to execute, some can be available. By the way this will work when your work-group is inside a wave-front (or warp)
After the host send data to the GPU, the memory will be located on the global memory, there each thread(compute unit) will access data. We discuss already that this is slow, but sometimes you need to do so. Each time that a kernel read/write on the global memory it's actually accessing a chuck of memory. Coalesced access is about accessing data on adjacent addresses.
So this means that it's faster to access memory having fewer threads consuming a block of adjascent memory, than having lot's of threads consuming on random addresses.
So far we’ve only considered performance when the data is already on the GPU (global memory). This neglects the slowest part of GPU programming which is getting data on and off of GPU.
We should not use only the GPU execution time of a kernel relative to the execution time of its CPU implementation to decide whether to run the GPU or CPU version. We also need to consider the cost of moving data across the PCI-e bus, when we are initially porting code to GPU.
Because the GPU is plugged into the PCI-e bus, this largely depends on how fast the PCI bus is and how many other things are using it.
The host/device transfer latency is the major difficulty when trying to accelerate algorithms on the GPU.
This happen because if your sequential algorithm compute in a time smaller than this Host/Device transfer, there is not to much things to do. But there are some...
This is the most obvious one, but you need at least one right? So prefer to have one big transfer than several small ones, specially on your program loop.
Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked. This happens because the OS gives virtual address for all it's devices, and somehow your driver need to use those pages to get the real address. The GPU driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.
As you can see in the figure, pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory.
You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.
Here are some classical examples
Amdahl's Law states that potential program speedup(theoretical latency) is defined by the fraction of code that can be parallelized.
: theoretical speedup in latency of the execution of the whole task
: Fraction of the code that can be parallelized.
: Number of processors
If there is no parallel parts , the speed up will be 1 (No Speedup)
If there all parts are parallel , the speed up would scale to infinity
On this case our computation is of the whole time. This is bad this means that our ALUs are working at .
Now the situation got even worse. We're doing computation at and our Alu eficiency dropped to .
So we oveflow our compute units with work-groups