This is the paper proposed RetinaNet and also the focal loss function to better train object detection models. Object detection models are in two camps, there are two-stage proposal-driven models such as R-CNN, and one-stage detector such that YOLO and SSD. The paper claimed that the prior result on one-stage model is not as good as two-stage because of class imbalance during training.

In two-stage models, the first stage generates a sparse set of candidate proposals and the second stage classifies the proposals into classes. One-stage detectors, however, need to evaluate 10K to 100K candidate locations per image but only a few locations contain objects, whereas two-stage models would reduce this to 1K to 2K with 1:3 ratio of positive to negative proposals. The easy negatives can overwhelm training and lead to degenerate models. Therefore the solution is to down-weight easy examples (a.k.a. inliers) such that their contribution to total loss is small even if their number is large.

The loss function proposed is as follows:

\[\begin{aligned} p_t &= \begin{cases} p & \text{if }y=1\\ 1-p & \text{otherwise} \end{cases}\\ \text{FL}(p_t) &= -(1-p_t)^\gamma \log(p_t) \end{aligned}\]

Which \(p\) is the predicted probability in classification and \(y\) is the one-hot target (0 or 1). Therefore \(p_t\) is the probability of truth, which ideally should be 1 if \(y=1\) and 0 otherwise. The cross entropy for a sample is therefore \(-\log(p_t)\), but the log function decays too gently as \(p_t\) approaches 1. The modulating factor \((1-p_t)^\gamma\) is to make the loss function decays faster. It degenerates into the simple cross entropy at \(\gamma=0\) and the paper suggested \(\gamma=2\) while it is found relatively robust in \(\gamma\in[0.5,5]\). This helps to make loss closer to zero once \(p_t\) is large enough (i.e. ,\(p_t>0.5\)).

To test this proposal, the paper suggested a RetinaNet model. It combines a backbone network with two task-specific subnetworks, such that:

  • The backbone, such as ResNet, connects to feature pyramid network (FPN). The ResNet backbone provides a bottom-up pathway to produce feature maps, and FPN is the top-down pathway to combine feature maps of different resolution into a multiscale feature pyramid
    • RetinaNet use ResNet feature maps \(C_3\) to \(C_5\) with FPN to produce pyramid levels \(P_3\) to \(P_7\), which \(P_k\) has resolution \(1/2^k\) of the input
    • ResNet residual stage \(C_3\) to \(C_5\) contributes to \(P_3\) to \(P_5\) using FPN top-down pathway; but \(P_6\) is obtained via a 3×3 stride-2 conv on \(P_5\), and \(P_7\) is from ReLU then another 3×3 stride-2 conv on \(P_6\)
    • All pyramid levels have \(C=256\) channels
  • Anchor boxes similar to that of RPN are produced from the FPN, with areas of 32² to 512² on \(P_3\) to \(P_7\) and aspect ratios 1:2, 1:1, 2:1. Optionally at each layer, anchor boxes are scaled to the ratio of \(2^0, 2^{1/3}, 2^{2/3}\) of its original size. In total \(A=3\times 3=9\) anchors are produced at each spatial location.
    • Each anchor is assigned a length \(K\) one-hot vector as classification target and length \(4K\) vector as bbox regression target
    • anchor label: compare bbox to ground-truth using IoU, positive samples for \(>0.5\) and background for \(\in[0,0.4)\)
  • Classification subnet: Take the input feature map of \(C=256\) channels at each pyramid level \(P_k\), then
    • pass through 4× of: 3×3 conv with \(C\) channels, then ReLU
    • then 3×3 conv with \(KA\) channels, then sigmoid activation to produce binary classification of each anchor
  • Regression subnet: Parallel to and same design as classification subnet, except the output is length \(4A\) and with linear activation
    • output length is \(4A\) instead of \(4KA\): the bounding box regression is class agnostic but found to be equally effective
  • Model output \(A\) anchors per pyramid level and spatial location, but to trade-off for speed, only decode the bbox of the top-scoring 1K predictions
    • at training, focal loss applied to all 100K anchors per image, normalized by the number of anchors assigned to a ground-truth box
    • detector confidence threshold: 0.05
    • then run non-maximum suppression, with threshold of 0.5
  • New layers in the model are initialized with:
    • bias are zeroed, except the final conv layer for classification to use \(b=-\log\frac{1-\pi}{\pi}\) with \(\pi=0.01\) as the initial probability of finding a foreground class
    • weights are Gaussian with \(\sigma=0.01\)

The network is trained with a minibatch size of 16 images, 90K steps, weight decay of 10e-4, momentum of 0.9, and initial learning rate of \(10^{-2}\) but tuned down to \(10^{-3}\) at step 60K and further to \(10^{-4}\) at step 80K. The overall loss function is the sum of focal loss for classification and smoothed L1 loss for regression.

Bibliographic data

@inproceedings{
   title = "Focal Loss for Dense Object Detection",
   author = "Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollar",
   booktitle = "Proc CVPR",
   pages = "2980--2988",
   year = "2017",
   note = "arXiv:1708.02002",
}