Yolo
This detector is a little bit less precise (Improved on v2) but it is a really fast detector, this chapter will try to explain how it works and also give a reference working code in tensorflow.
Main idea
The idea of this detector is that you run the image on a CNN model and get the detection on a single pass. First the image is resized to 448x448, then fed to the network and finally the output is filtered by a Non-max suppression algorithm.

Model Yolo:
The tiny version is composed with 9 convolution layers with leaky relu activations. Observe that after maxpool6 the 448x448 input image becomes a 7x7 image.
The output of this model is a tensor batch size 7x7x30. In this tensor the following information is encoded:
2 Box definitions: (consisting of: x,y,width,height,"is object" confidence)
20 class probabilities (only considered if the "is object" confidence is high)
Where:
S: Tensor spatial dimension (7 on this case)
B: Number of bounding boxes (x,y,w,h,confidence)
C: Number of classes
Here "is object" or is the probability that a box contains any object (or it is background), if during training a particular cell is not over some object we set "is object" to zero.
What this 7x7 tensor represents
This 7x7 tensor can be considered as a 7x7 grid representing the input image, where each cell of this tensor will hold the 2 box definitions and 20 class probabilities.

Here it's also useful to say that each cell has the probability to be one of the 20 classes. (And each cell has 2 bounding box)
Notice that this information with the fact that each bounding box has the information if it's below an object or not will help to detect the class of the object.
The logic is that if there was an object on that cell, we define which object by using the biggest class probability value from that cell.

Filtering results
At the end of the model at prediction time you will have something like this:
Finally by using thresholding and non-maxima suppression we can filter out boxes that are not valid detections.
Training phase
Steps:
Look which cell is near the center of the bounding box of the Ground truth. (Matching phase)
Check from a particular cell which of it's bounding boxes overlaps more with the ground truth (IoU), then decrease the confidence of the bounding box that overlap less. (Each bounding box has it's on confidence)
Decrease the confidence of all bounding boxes from each cell that has no object. Also don't adjust the box coordinates or class probabilities from those cells.
Decrease the bounding boxes confidence of the cells that don't contain any object.
Pre-train
The paper mentioned that before training for object detection, they modified the network (Add a Average pooling, FC and Softmax) layers and train for classification on the Imagenet Dataset for one week. (Until they got a good top 5 error). Later they add more conv layers and the FC layer responsible for detection.
Other details
Pre-trained on Imagenet
Use lot's of augmentation
Use SGD to train
Evaluated on Pascal VOC
135 Epochs, batch size: 64
Momentum 0.9
Random scale and translations up to 20% size of original image
Color exposure/saturation augmentation
Loss Function
Here is the multi-part loss function that we want to optimize. This loss function take into account the following objectives:
Classification (20 classes)
Object/No object classification
Bounding box coordinates (x,y,height,width) regression (4 scalars)
Each of this sub objectives use a sum-squared error, also a factor and are used to unbalance the box coordinates and the classification objectives.
Some other points to observe:
The classification loss is not back propagated if the cell has no object
The bounding box loss with highest IOU (Intersect over union) with the ground truth is backpropagated
where
B: Number of bounding boxes (2)
Box definition
Some particular class i
S: Grid size (7)
: If object appear on the cell i, if does not appear it will be zero
: Bounding box j, from cell i responsible for prediction
Intersect over Union (IoU)
It's a method used to evaluate how well an object detection output is related to some ground truth, the IoU is normally used during training and testing by comparing how the bounding box given during prediction overlap with the ground truth (training/test data) bounding box.

Calculating the IoU is simple we basically divide the overlap area between the boxes by the union of those areas.
# Calculate Intersect over usion between boxes b1 and b2, here each box is defined with 2 points
# box(startX, startY, endX, endY), there are other definitions ie box(x,y,width,height)
def calc_iou(b1, b2):
# determine the (x, y)-coordinates of the intersection rectangle
xA = max(b1[0], b2[0])
yA = max(b1[1], b2[1])
xB = min(b1[2], b2[2])
yB = min(b1[3], b2[3])
# compute the area of intersection rectangle
area_intersect = (xB - xA + 1) * (yB - yA + 1)
# Calculate area of boxes
area_b1 = (b1[2] - b1[0] + 1) * (b1[3] - b1[1] + 1)
area_b2 = (b2[2] - b2[0] + 1) * (b2[3] - b2[1] + 1)
# compute the intersection over union by taking the intersection
# area and dividing it by the sum of prediction + ground-truth
# areas - the intersection area
iou = area_intersect / float(area_b1 + area_b2 - area_intersect)
# return the intersection over union value
return iou
Another way to calculate the IoU with numpy
import numpy as np
def calc_iou(xy_min1, xy_max1, xy_min2, xy_max2):
# Get areas
areas_1 = np.multiply.reduce(xy_max1 - xy_min1)
areas_2 = np.multiply.reduce(xy_max2 - xy_min2)
# determine the (x, y)-coordinates of the intersection rectangle
_xy_min = np.maximum(xy_min1, xy_min2)
_xy_max = np.minimum(xy_max1, xy_max2)
_wh = np.maximum(_xy_max - _xy_min, 0)
# compute the area of intersection rectangle
_areas = np.multiply.reduce(_wh)
# return the intersection over union value
return _areas / np.maximum(areas_1 + areas_2 - _areas, 1e-10)
Non-Maxima Suppression (nms)
During prediction time (after training) you may have lot's of box predictions around a single object the nms algorithm will filter out those boxes that overlap between each other and also some threshold.

Here we have a example with numpy and python
def non_max_suppress(conf, xy_min, xy_max, threshold=.4):
_, _, classes = conf.shape
# List Comprehension
# https://www.youtube.com/watch?v=HobjHIpLhZk
# https://www.youtube.com/watch?v=Q7EYKuZJfdA
boxes = [(_conf, _xy_min, _xy_max) for _conf, _xy_min, _xy_max in zip(conf.reshape(-1, classes), xy_min.reshape(-1, 2), xy_max.reshape(-1, 2))]
# Iterate each class
for c in range(classes):
# Sort boxes
boxes.sort(key=lambda box: box[0][c], reverse=True)
# Iterate each box
for i in range(len(boxes) - 1):
box = boxes[i]
if box[0][c] == 0:
continue
for _box in boxes[i + 1:]:
# Take iou threshold into account
if calc_iou(box[1], box[2], _box[1], _box[2]) >= threshold:
_box[0][c] = 0
return boxes
Yolo v2
The Yolo detector has been improved recently, to list their main improvements:
Faster
More Accurate (73.4 mAP(Mean average precision over all classes) on Pascal dataset)
Can detect up to 9000 classes (Before was 20)
What they did to improve:
Added Batchnorm
Pre-train on imagenet at multiple scales (224x224) then (448x448), then only after they train for detection.
Now they use anchor boxes like Faster-RCNN , the classification is done per-box shape, instead of per each grid-cell
Instead of manually choose the box shape, they use K-means to get a box shape based on data
Train the network at multiple scales, as the network is now Fully Convolutional (NO FC layer) this is easy to do.
They train on both Image-net and MS-COCO
They create a new mechanism to train on datasets that don't have detection data. By selecting on the multi-part loss function what to propagate.
Use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO.
References:
Last updated
Was this helpful?