The pooling layer, is used to reduce the spatial dimensions, but not depth, on a convolution neural network, model, basically this is what you gain:
By having less spatial information you gain computation performance
Less spatial information also means less parameters, so less chance to over-fit
You get some translation invariance
Some projects don't use pooling, specially when they want to "learn" some object specific position. Learn how to play atari games.
On the diagram bellow we show the most common type of pooling the max-pooling layer, which slides a window, like a normal convolution, and get the biggest value on the window as the output.
The most important parameters to play:
Input: H1 x W1 x Depth_In x N
Stride: Scalar that control the amount of pixels that the window slide.
K: Kernel size
Regarding it's Output H2 x W2 x Depth_Out x N:
It's also valid to point out that there is no learnable parameters on the pooling layer. So it's backpropagation is simpler.
The window movement mechanism on pooling layers is the same as convolution layer, the only change is that we will select the biggest value on the window.
From the backpropagation chapter we learn that the max node simply act as a router, giving the input gradient "dout" to the input that has value bigger than zero.
You can consider that the max pooling use a series of max nodes, on it's computation graph. So consider the backward propagation of the max pooling layer as a product between a mask containing all elements that were selected during the forward propagation and dout.
In other words the gradient with respect to the input of the max pooling layer will be a tensor make of zeros except on the places that was selected during the forward propagation.
On future chapter we will learn a technique that improves the convolution performance, until them we will stick with the naive implementation.
Next chapter we will learn about Batch Norm layer