Yolo
Last updated
Last updated
This detector is a little bit less precise (Improved on v2) but it is a really fast detector, this chapter will try to explain how it works and also give a reference working code in tensorflow.
The idea of this detector is that you run the image on a CNN model and get the detection on a single pass. First the image is resized to 448x448, then fed to the network and finally the output is filtered by a Non-max suppression algorithm.
The tiny version is composed with 9 convolution layers with leaky relu activations. Observe that after maxpool6 the 448x448 input image becomes a 7x7 image.
2 Box definitions: (consisting of: x,y,width,height,"is object" confidence)
20 class probabilities (only considered if the "is object" confidence is high)
Where:
S: Tensor spatial dimension (7 on this case)
B: Number of bounding boxes (x,y,w,h,confidence)
C: Number of classes
This 7x7 tensor can be considered as a 7x7 grid representing the input image, where each cell of this tensor will hold the 2 box definitions and 20 class probabilities.
Here it's also useful to say that each cell has the probability to be one of the 20 classes. (And each cell has 2 bounding box)
Notice that this information with the fact that each bounding box has the information if it's below an object or not will help to detect the class of the object.
The logic is that if there was an object on that cell, we define which object by using the biggest class probability value from that cell.
At the end of the model at prediction time you will have something like this:
Steps:
Look which cell is near the center of the bounding box of the Ground truth. (Matching phase)
Check from a particular cell which of it's bounding boxes overlaps more with the ground truth (IoU), then decrease the confidence of the bounding box that overlap less. (Each bounding box has it's on confidence)
Decrease the confidence of all bounding boxes from each cell that has no object. Also don't adjust the box coordinates or class probabilities from those cells.
Decrease the bounding boxes confidence of the cells that don't contain any object.
The paper mentioned that before training for object detection, they modified the network (Add a Average pooling, FC and Softmax) layers and train for classification on the Imagenet Dataset for one week. (Until they got a good top 5 error). Later they add more conv layers and the FC layer responsible for detection.
Pre-trained on Imagenet
Use lot's of augmentation
Use SGD to train
Evaluated on Pascal VOC
135 Epochs, batch size: 64
Momentum 0.9
Random scale and translations up to 20% size of original image
Color exposure/saturation augmentation
Here is the multi-part loss function that we want to optimize. This loss function take into account the following objectives:
Classification (20 classes)
Object/No object classification
Bounding box coordinates (x,y,height,width) regression (4 scalars)
Some other points to observe:
The classification loss is not back propagated if the cell has no object
The bounding box loss with highest IOU (Intersect over union) with the ground truth is backpropagated
B: Number of bounding boxes (2)
S: Grid size (7)
It's a method used to evaluate how well an object detection output is related to some ground truth, the IoU is normally used during training and testing by comparing how the bounding box given during prediction overlap with the ground truth (training/test data) bounding box.
Calculating the IoU is simple we basically divide the overlap area between the boxes by the union of those areas.
Another way to calculate the IoU with numpy
During prediction time (after training) you may have lot's of box predictions around a single object the nms algorithm will filter out those boxes that overlap between each other and also some threshold.
Here we have a example with numpy and python
The Yolo detector has been improved recently, to list their main improvements:
Faster
More Accurate (73.4 mAP(Mean average precision over all classes) on Pascal dataset)
Can detect up to 9000 classes (Before was 20)
What they did to improve:
Added Batchnorm
Pre-train on imagenet at multiple scales (224x224) then (448x448), then only after they train for detection.
Now they use anchor boxes like Faster-RCNN , the classification is done per-box shape, instead of per each grid-cell
Instead of manually choose the box shape, they use K-means to get a box shape based on data
Train the network at multiple scales, as the network is now Fully Convolutional (NO FC layer) this is easy to do.
They train on both Image-net and MS-COCO
They create a new mechanism to train on datasets that don't have detection data. By selecting on the multi-part loss function what to propagate.
Use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO.
The output of this model is a tensor batch size 7x7x30. In this tensor the following information is encoded:
Here "is object" or is the probability that a box contains any object (or it is background), if during training a particular cell is not over some object we set "is object" to zero.
Finally by using thresholding and non-maxima suppression we can filter out boxes that are not valid detections.
Each of this sub objectives use a sum-squared error, also a factor and are used to unbalance the box coordinates and the classification objectives.
where
Box definition
Some particular class i
: If object appear on the cell i, if does not appear it will be zero
: Bounding box j, from cell i responsible for prediction