The previous methods of object detection all share one thing in common: they have one part of their network dedicated to providing region proposals followed by a high quality classifier to classify these proposals. These methods are very accurate but come at a big computational cost (low frame-rate), in other words they are not fit to be used on embedded devices.
Another way of doing object detection is by combining these two tasks into one network. We can do this by instead of having a network produce proposals we instead have a set of pre-defined boxes to look for objects.
Using convolutional features maps from later layers of a network we run small CONV filters over these features maps to predict class scores and bounding box offsets.
One way to reuse the computation that is already made during classification to localize objects is to grab activations from the final conv layers. At this point we still have spatial information but represented on a smaller version. For example an input image of size 640x480x3 passing into an inception model will have it's spatial information compressed into a 13x18x2048 size on it's final layers.
What happens is that on the final layers each "pixel" represent a larger area of the input image so we can use those cells to infer the object position. One thing to pay attention is that even though we are squeezing the image to a lower spatial dimension, the tensor is quite deep, so not much information is lost. (This is not entirely true when using pooling layers).
At this point imagine that you could use a 1x1 CONV layer to classify each cell as a class (ex: Pedestrian/Background), also from the same layer you could attach another CONV or FC layer to predict 4 numbers (Bounding box). In this way you get both class scores and location from one.
One common mistake is to think that we're actually dividing the input image into a grid, this does not happen! What actually happens is that each layer represent the input image with few spatial data but with bigger depth. On training time we will do some sort of matching between our ground truth and virtual cells. Also those cells will actually overlap they are not perfectly tiled.
Also regarding the number of detection, each one of those cells could detect an object. So the output of this model could be 13x18 detections.
One of the things that may be difficult to understand at first is how the detection system will convert the cells to an actual bounding box that fit's above the object.
Here is the family of object detectors that follow this strategy:
SSD: Uses different activation maps (multiple-scales) for prediction of classes and bounding boxes
YOLO: Uses a single activation map for prediction of classes and bounding boxes
R-FCN(Region based Fully-Convolution Neural Networks): Like Faster Rcnn (400ms), but faster (170ms) due to less computation per box also it's Fully Convolutional (No FC layer)
Using multiple scales helps to achieve a higher mAP(mean average precision) by being able to detect objects with different sizes on the image better.
Summarising the strategy of these methods
Train a CNN with regression(bounding box) and classification objective (loss function).
Normally their loss functions are more complex because it has to manage multiple objectives (classification, regression, check if there is an object or not)
Gather Activation from a particular layer (or layers) to infer classification and location with a FC layer or another CONV layer that works like a FC layer.
During prediction use algorithms like non-maxima suppression to filter multiple boxes around same object.
During training time use algorithms like IoU to relate the predictions during training the the ground truth.
On this kind of detector it is typical to have a collection of boxes overlaid on the image at different spatial locations, scales and aspect ratios that act as “anchors” (sometimes called “priors” or “default boxes”).