The SSD detector differs from others single shot detectors due to the usage of multiple layers that provide a finer accuracy on objects with different scales. (Each deeper layer will see bigger objects).
The SSD normally start with a VGG on Resnet pre-trained model that is converted to a fully convolution neural network. Then we attach some extra conv layers, which will actually help to handle bigger objects. The SSD architecture can in principle be used with any deep network base model.
One important point to notice is that after the image is passed on the VGG network, some conv layers are added producing feature maps of sizes 19x19, 10x10, 5x5, 3x3, 1x1. These, together with the 38x38 feature map produced by VGG’s conv4_3, are the feature maps which will be used to predict bounding boxes.
There the conv4_3 is responsible to detect the smallest objects while the conv11_2 is responsible for the biggest objects.
As shown on the diagram some activations are "grabbed" from the network and passed to a specialized sub-network that should work as a classifier and localizer. During prediction we use a Non-maxima suppression algorithm to filter the multiple boxes per object that may appear.

Anchor(Priors or Default boxes) concept

Anchors are a collection of boxes overlaid on the image at different spatial locations, scales and aspect ratios that act as reference points on the ground truth images. It's like the Yolo idea where each cell on the activation map has multiple boxes.
A model is then trained to make two predictions for each anchor:
  1. 1.
    A discrete class prediction for each anchor
  2. 2.
    A continuous prediction of an offset by which the anchor needs to be shifted to fit the ground-truth bounding box
During training SSD matches objects with _default boxes _of different aspects. Each element of the feature map (cell) has a number of default boxes associated with it. Any default box with an IoU (Jaccard index) greater than 0.5 is considered a match.
Consider the image above, observe that the cat is has 2 boxes that match on the 8x8 feature map, but none on the dog. Now on the 4x4 feature map there is one box that matches the dog.
It is important to note that the boxes in the 8x8 feature map are smaller than those in the 4x4 feature map: SSD grab some feature maps, each responsible for a different scale of objects, allowing it to identify objects across a large range of scales.
For each default box on each cell the network output the following:
  • A probability vector of length c, where c are the number of classes plus the background class that indicates no object.
  • A vector with 4 elements (x,y,width,height) representing the offset required move the default box position to the real object.

Multibox Loss Function

During training we minimize a combined classification and regression loss.
As Yolo the SSD loss balance the classification objective and the localization objective.

Localization Loss

Classification Loss

Bellow we have the forward propagation of this loss using PyTorch
def forward(self, predictions, targets):
"""Multibox Loss
predictions (tuple): A tuple containing loc preds, conf preds,
and prior boxes from SSD net.
conf shape: torch.size(batch_size,num_priors,num_classes)
loc shape: torch.size(batch_size,num_priors,4)
priors shape: torch.size(num_priors,4)
ground_truth (tensor): Ground truth boxes and labels for a batch,
shape: [batch_size,num_objs,5] (last idx is the label).
loc_data, conf_data, priors = predictions
num = loc_data.size(0)
num_priors = (priors.size(0))
num_classes = self.num_classes
# match priors (default boxes) and ground truth boxes
loc_t = torch.Tensor(num, num_priors, 4)
conf_t = torch.LongTensor(num, num_priors)
for idx in range(num):
truths = targets[idx][:,:-1].data
labels = targets[idx][:,-1].data
defaults =
# Send localization and confidence to GPU if available
if GPU:
loc_t = loc_t.cuda()
conf_t = conf_t.cuda()
# wrap targets as Variables (Include on graph, to use autograd)
loc_t = Variable(loc_t, requires_grad=False)
conf_t = Variable(conf_t,requires_grad=False)
pos = conf_t > 0
num_pos = pos.sum()
# Localization Loss (Smooth L1)
# Shape: [batch,num_priors,4]
pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
loc_p = loc_data[pos_idx].view(-1,4)
loc_t = loc_t[pos_idx].view(-1,4)
loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)
# Compute max conf across batch for hard negative mining
batch_conf = conf_data.view(-1,self.num_classes)
loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1,1))
# Hard Negative Mining
loss_c[pos] = 0 # filter out pos boxes for now
loss_c = loss_c.view(num, -1)
_,loss_idx = loss_c.sort(1, descending=True)
_,idx_rank = loss_idx.sort(1)
num_pos = pos.long().sum(1)
num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
neg = idx_rank < num_neg.expand_as(idx_rank)
# Confidence Loss Including Positive and Negative Examples
pos_idx = pos.unsqueeze(2).expand_as(conf_data)
neg_idx = neg.unsqueeze(2).expand_as(conf_data)
conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1,self.num_classes)
targets_weighted = conf_t[(pos+neg).gt(0)]
loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)
# Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
N =
return loss_l,loss_c
Bellow there are the same loss implemented on Tensorflow (Notice that the code is a bit more complicated)
def ssd_losses(logits, localisations,
gclasses, glocalisations, gscores,
with tf.name_scope(scope, 'ssd_losses'):
lshape = tfe.get_shape(logits[0], 5)
num_classes = lshape[-1]
batch_size = lshape[0]
# Flatten out all vectors!
flogits = []
fgclasses = []
fgscores = []
flocalisations = []
fglocalisations = []
for i in range(len(logits)):
flogits.append(tf.reshape(logits[i], [-1, num_classes]))
fgclasses.append(tf.reshape(gclasses[i], [-1]))
fgscores.append(tf.reshape(gscores[i], [-1]))
flocalisations.append(tf.reshape(localisations[i], [-1, 4]))
fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4]))
# And concat the crap!
logits = tf.concat(flogits, axis=0)
gclasses = tf.concat(fgclasses, axis=0)
gscores = tf.concat(fgscores, axis=0)
localisations = tf.concat(flocalisations, axis=0)
glocalisations = tf.concat(fglocalisations, axis=0)
dtype = logits.dtype
# Compute positive matching mask...
pmask = gscores > match_threshold
fpmask = tf.cast(pmask, dtype)
n_positives = tf.reduce_sum(fpmask)
# Hard negative mining...
no_classes = tf.cast(pmask, tf.int32)
predictions = slim.softmax(logits)
nmask = tf.logical_and(tf.logical_not(pmask),
gscores > -0.5)
fnmask = tf.cast(nmask, dtype)
nvalues = tf.where(nmask,
predictions[:, 0],
1. - fnmask)
nvalues_flat = tf.reshape(nvalues, [-1])
# Number of negative entries to select.
max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)
n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size
n_neg = tf.minimum(n_neg, max_neg_entries)
val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
max_hard_pred = -val[-1]
# Final negative mask.
nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
fnmask = tf.cast(nmask, dtype)
# Add cross-entropy loss.
with tf.name_scope('cross_entropy_pos'):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')
with tf.name_scope('cross_entropy_neg'):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
# Add localization loss: smooth L1, L2, ...
with tf.name_scope('localization'):
# Weights Tensor: positive mask + random negative.
weights = tf.expand_dims(alpha * fpmask, axis=-1)
loss = custom_layers.abs_smooth(localisations - glocalisations)
loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')