Object detection is to predict the bounding boxes and category labels for each object of interest. This paper proposed DETR (Detection Transformer) to predict all objects at once, trained endtoend with a set loss function to perform bipartite matching between the predicted and groundtruth. It is found to perform better on large objects due to the nonlocal computations of the transformer but sacrificed the performance on small objects.
Object detection is challenging because predicting sets using deep learning model is challenging. The set of bounding boxes predicted has structural relationships (i.e., overlapping). It is not always an exact match to the ground truth. Therefore, postprocessing techniques such as nonmaximal suppression are used. In DETR, set matching is to use Hungarian algorithm to find the bestfit bipartite matching, and avoided the nonmaximal suppression.
The existing object detectors are either twostage (i.e., predict bounding box w.r.t. proposals) or singlestage (i.e., predict w.r.t. a grid of anchors as object centers). The accuracy depends on how exactly the initial guesses are set. DETR predicts the box w.r.t. image directly without anchors.
The DETR model
The architecture is in Fig.10 in appendix A.3 of the paper. Its goal is to infer a set of $N$ object predictions from the input image.
It uses CNN to generate a feature representation of the image. For an image of spatial size $(H_0,W_0)$ the feature is $(\frac{H_0}{32}, \frac{W_0}{32}, C)$ with $C=2048$. Then this feature is input to the encoder of a transformer architecture. Firstly, a $1\times1$ convolution is applied to reduce the channel dimension of the activation map from $C$ to $d$. Then flatten the spatial dimension, as the transformer encoder expects sequence as input.
The DETR architecture is an encoderdecoder transformer connected to a feedforward network for final detection prediction. In the encoder, input to each encoder layer is added with a fixed positional encoding (only the Q and K vectors, not V). The encoder layer is a multihead selfattention module and an FNN. Precisely, each block is: selfattention → add & norm → feedforward network → add & norm → output.
In the decoder, $N$ objects are decoded in parallel, not using the autoregressive model of sequence prediction. The output from the encoder becomes the K and V vectors to the crossattention layer in the decoder blocks. The decoder block receives queries, initially zero, and output positional encoding, which is served as queries to the subsequent decoder block.
The feedforward network is using a 3layer perceptron model, with hidden dimension $d$, and ReLU activation. It is to predict the normalized center coordinates and the width and height of the bounding box (all w.r.t. input image width and height). Parallel to it is a linear layer to predict the class label (of $N$ classes, some prediction can be $\varnothing$ for background class) using softmax function.
The difficulties of training the DETR model is about scoring the predicted objects (class, position, and size) w.r.t. the ground truth. The loss function is set up to optimize objectspecific bbox losses, as follows:
Let $y={y_1,\dots,y_N}$ be the groundtruth and $\hat{y}={\hat{y}_1,\dots,\hat{y}_N}$ be a set of $N$ predictions, some groundtruth can be “no object” $\varnothing$. Each $y_i=(c_i,b_i)$ for target class label $c_i$ and bbox $b_i\in [0,1]^4$ of (cx,cy,h,w) relative to image size. First, we have an optimal bipartite matching $\hat{\sigma}$ (in the form of a permutation, obtained, e.g., from the Hungarian algorithm) between $y$ and $\hat{y}$ that minimized the matching cost:
\[\hat{\sigma} = \arg\min_{\sigma} \sum_{i=1}^N L(y_i, \hat{y}_{\sigma(i)})\]where
\[L(y_i,\hat{y}_{\sigma(i)})=\mathbb{I}[c_i\neq\varnothing](\hat{p}_{\sigma(i)}(c_i)+L_{\text{box}}(b_i,\hat{b}_{\sigma(i)}))\]is the pairwise matching cost, with $\hat{p}_{\sigma(i)}(c_i)$ the predicted probability of class $c_i$ under the permutation $\sigma(i)$. The bbox loss is defined as
\[L_{\text{box}}(b_i,\hat{b}_i) = \lambda_{\text{IoU}}L_{\text{IoU}}(b_i,\hat{b}_{\sigma(i)})+\lambda_{L1} \Vert b_i  \hat{b}_{\sigma(i)}\Vert_1\]where $L_{\text{IoU}}$ is a generalized IoU and it should be scaleinvariant (see Rezatofighi et al. (2019) “Generalized intersection over union”):
\[L_{\text{IoU}}(b_i,\hat{b}_{\sigma(i)}) = 1\Bigg( \frac{\vert b_i \cap\hat{b}_{\sigma(i)}\vert}{\vert b_i \cup\hat{b}_{\sigma(i)}\vert}  \frac{\vert B(b_i, \hat{b}_{\sigma(i)})\setminus b_i \cup\hat{b}_{\sigma(i)}\vert}{\vert B(b_i,\hat{b}_{\sigma(i)})\vert} \Bigg)\]where:
 area is denoted with $\vert\cdot\vert$
 union and intersection is on box geometry
 \(B(b_i,\hat{b}_{\sigma(i)})\) is the bounding box enclosing both \(b_i\) and \(\hat{b}_{\sigma(i)}\)

The formula of $L_{\text{IoU}}$ is from DICE/F1 loss, which considered the logit prediction $\hat{m}$ vs binary target $m$, defined as
\[L_{\text{DICE}}(m,\hat{m}) = 1\frac{2m\sigma(\hat{m}+1)}{\sigma(\hat{m})+m+1}\]
With the optimal matching $\hat{\sigma}$ is found between $y$ and $\hat{y}$, the training loss is defined as a linear combination of class prediction crossentropy and objectspecific box loss:
\[L_H(y,\hat{y}) = \sum_{i=1}^N \Big(\log \hat{p}_{\hat{\sigma}(i)}(c_i)+\mathbb{I}[c_i\neq\varnothing]L_{\text{box}}(b_i,\hat{b}_{\hat{\sigma}(i)})\Big)\]which in practice, the log probability term for $c_i=\varnothing$ is downweighted by a factor of 10 to correct class imbalance.
Details
More implementation details are in the appendix of the paper.
The loss metrics are normalized by the number of objects in the batch. Care must be taken in distributed training since only a subbatch is provided to each computing node.
Hyperparameters for training:
 Optimizer: AdamW with weight decay $10^{4}$
 Gradient clipped at maximal gradient norm of 0.1
 Backbone (CNN) is ResNet50 (from Torchvision) discarded the last
classification layer, with batch normalization weights and statistics frozen
during training, all else finetuned with a learning rate of $10^{5}$
 the backbone learning rate is an order of magnitude smaller than the rest of the network to stabilize the training
 Transformer is trained with a learning rate $10^{4}$, additive dropout of 0.1 at every multihead attention and feedforward layer before normalization
 Transformer weights are initialized with Xavier initialization
 Loss hyperparameter: $\lambda_{\text{IoU}}=2$, $\lambda_{L1}=5$
 Decoder query slot $N=100$
 Baseline compared with FasterRCNN, trained for 109 epochs, using settings as
in Detectron2 model zoo Spatial positional encoding: Fixed absolute encoding
 both spatial coordinates of each embeddings use $d/2$ sine and cosine functions with different frequencies, then concatenate them to get $d$ channel
 2D positional encoding see Parmer et al (2018)
Bibliographic data
@inproceedings{
title = "EndtoEnd Object Detection with Transformers",
author = "Nicolas Carion and Alexander Kirillov and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Sergey Zagoruyko",
booktitle = "Proc. ECCV",
year = "2020",
arXiv = "2005.12872",
github = "https://github.com/facebookresearch/detr",
}