This is the paper proposed RetinaNet and also the focal loss function to better train object detection models. Object detection models are in two camps, there are two-stage proposal-driven models such as R-CNN, and one-stage detector such that YOLO and SSD. The paper claimed that the prior result on one-stage model is not as good as two-stage because of class imbalance during training.
In two-stage models, the first stage generates a sparse set of candidate proposals and the second stage classifies the proposals into classes. One-stage detectors, however, need to evaluate 10K to 100K candidate locations per image but only a few locations contain objects, whereas two-stage models would reduce this to 1K to 2K with 1:3 ratio of positive to negative proposals. The easy negatives can overwhelm training and lead to degenerate models. Therefore the solution is to down-weight easy examples (a.k.a. inliers) such that their contribution to total loss is small even if their number is large.
The loss function proposed is as follows:
\[\begin{aligned} p_t &= \begin{cases} p & \text{if }y=1\\ 1-p & \text{otherwise} \end{cases}\\ \text{FL}(p_t) &= -(1-p_t)^\gamma \log(p_t) \end{aligned}\]Which \(p\) is the predicted probability in classification and \(y\) is the one-hot target (0 or 1). Therefore \(p_t\) is the probability of truth, which ideally should be 1 if \(y=1\) and 0 otherwise. The cross entropy for a sample is therefore \(-\log(p_t)\), but the log function decays too gently as \(p_t\) approaches 1. The modulating factor \((1-p_t)^\gamma\) is to make the loss function decays faster. It degenerates into the simple cross entropy at \(\gamma=0\) and the paper suggested \(\gamma=2\) while it is found relatively robust in \(\gamma\in[0.5,5]\). This helps to make loss closer to zero once \(p_t\) is large enough (i.e. ,\(p_t>0.5\)).
To test this proposal, the paper suggested a RetinaNet model. It combines a backbone network with two task-specific subnetworks, such that:
- The backbone, such as ResNet, connects to feature pyramid network (FPN). The
ResNet backbone provides a bottom-up pathway to produce feature maps, and FPN
is the top-down pathway to combine feature maps of different resolution into a
multiscale feature pyramid
- RetinaNet use ResNet feature maps \(C_3\) to \(C_5\) with FPN to produce pyramid levels \(P_3\) to \(P_7\), which \(P_k\) has resolution \(1/2^k\) of the input
- ResNet residual stage \(C_3\) to \(C_5\) contributes to \(P_3\) to \(P_5\) using FPN top-down pathway; but \(P_6\) is obtained via a 3×3 stride-2 conv on \(P_5\), and \(P_7\) is from ReLU then another 3×3 stride-2 conv on \(P_6\)
- All pyramid levels have \(C=256\) channels
- Anchor boxes similar to that of RPN are produced from the FPN, with areas of
32² to 512² on \(P_3\) to \(P_7\) and aspect ratios 1:2, 1:1, 2:1. Optionally
at each layer, anchor boxes are scaled to the ratio of
\(2^0, 2^{1/3}, 2^{2/3}\) of its original size. In total \(A=3\times 3=9\)
anchors are produced at each spatial location.
- Each anchor is assigned a length \(K\) one-hot vector as classification target and length \(4K\) vector as bbox regression target
- anchor label: compare bbox to ground-truth using IoU, positive samples for \(>0.5\) and background for \(\in[0,0.4)\)
- Classification subnet: Take the input feature map of \(C=256\) channels at
each pyramid level \(P_k\), then
- pass through 4× of: 3×3 conv with \(C\) channels, then ReLU
- then 3×3 conv with \(KA\) channels, then sigmoid activation to produce binary classification of each anchor
- Regression subnet: Parallel to and same design as classification subnet,
except the output is length \(4A\) and with linear activation
- output length is \(4A\) instead of \(4KA\): the bounding box regression is class agnostic but found to be equally effective
- Model output \(A\) anchors per pyramid level and spatial location, but to
trade-off for speed, only decode the bbox of the top-scoring 1K predictions
- at training, focal loss applied to all 100K anchors per image, normalized by the number of anchors assigned to a ground-truth box
- detector confidence threshold: 0.05
- then run non-maximum suppression, with threshold of 0.5
- New layers in the model are initialized with:
- bias are zeroed, except the final conv layer for classification to use \(b=-\log\frac{1-\pi}{\pi}\) with \(\pi=0.01\) as the initial probability of finding a foreground class
- weights are Gaussian with \(\sigma=0.01\)
The network is trained with a minibatch size of 16 images, 90K steps, weight decay of 10e-4, momentum of 0.9, and initial learning rate of \(10^{-2}\) but tuned down to \(10^{-3}\) at step 60K and further to \(10^{-4}\) at step 80K. The overall loss function is the sum of focal loss for classification and smoothed L1 loss for regression.
Bibliographic data
@inproceedings{
title = "Focal Loss for Dense Object Detection",
author = "Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollar",
booktitle = "Proc CVPR",
pages = "2980--2988",
year = "2017",
note = "arXiv:1708.02002",
}