Jekyll20230722T12:00:3404:00https://www.adrian.idv.hk/feed.xml∫ntegrabℓε ∂ifferentiαℓsunorganised memo, notes, code, data, and writings of random topicsAdrian S. Tamrighthandabacus@users.github.comSelfhosted Copilot for Your VSCode20230717T00:00:0004:0020230717T00:00:0004:00https://www.adrian.idv.hk/copilot<p>GitHub has its Copilot service that we can pay a subscription for. It is a
coding assistant in your IDE, which requires a plugin on your editor and performs
<em>autocomplete</em> for the code you typed. There are offtheshelf language models
that can generate code like GitHub’s Copilot. But the model can emit code like
a function.</p>
<p>To make it an assistant, you need to integrate it with your code editor.
Neovim, unfortunately, has a smaller user base and <a href="https://github.com/fauxpilot/fauxpilot/issues/21">not much
progress</a> yet. But there is a
VSCode plugin that can use
<a href="https://github.com/fauxpilot/fauxpilot/">FauxPilot</a>. FauxPilot is a project
try to let you selfhost a server compatible to GitHub Copilot. Theoretically
you can use the GitHub Copilot plugin in your editor (i.e., Neovim has it) but
switch to the FauxPilot backend. But since GitHub hardcoded the server address
in the plugin, you need to somehow hack the plugin to make it work. The
dedicated FauxPilot plugin allows you to configure a different hostname/port
number for the server, and that would be more convenient. Of course, how to
communicate with the editor so that you can extract the context and provide
suggestions seamlessly from the UX perspective would be another story. But the
point is, there’s a solution for the client (i.e., editor such as VSCode or
neovim). And there’s a model (e.g., CodeGen2). FauxPilot hardcoded to use
Salesforce’s CodeGen model.</p>
<p>Indeed, I believe FauxPilot made things too complicated. Of course, its merit
is to have a professional deployment of the selfhosted Copilot clone using
Docker. But if the goal is to try out models <em>with your IDE</em>, that’s too heavy.
Therefore, I trimed down FauxPilot to take only the web interface part:
Endpoints are implemented as REST API using FastAPI and uvicorn (hence server
code can be asynchronous). From the web requests, we get the code that the user
typed as a string (together with some parameters such as model temperature) and
we can invoke the model to produce output. The interaction with the model
should be a blackbox to the REST API. Hence it is designed as such.</p>
<p>The code is at here: https://github.com/righthandabacus/fauxpilot_lite</p>
<p>See the readme for more details.</p>Adrian S. Tamrighthandabacus@users.github.comGitHub has its Copilot service that we can pay a subscription for. It is a coding assistant in your IDE, which requires a plugin on your editor and performs autocomplete for the code you typed. There are offtheshelf language models that can generate code like GitHub’s Copilot. But the model can emit code like a function.Conda and CUDA20230712T00:00:0004:0020230712T00:00:0004:00https://www.adrian.idv.hk/conda<p>If we want to run TensorFlow or PyTorch with CUDA on Linux, for example, we can
install CUDA as a system library first and then install the Python package with
pip (or via aptget, in the rare case). This way, the package will find the
CUDA library at system locations. The case of pyenv or virtualenv are similar.
The Python packages are installed via pip and CUDA library are expected at the
system path.</p>
<p>The other way to run this would be using conda. It is special because conda is
not a Python virtualenv. An environment in conda can come with other binaries,
such as CUDA library. Hence you can conda install cudatoolkit and then conda
install pytorch. These are condaspecific build that assumed libraries are
installed in nonstandard locations.</p>
<p>At the time of writing, we have Python 3.11.4, PyTorch 2.0.1, and TensorFlow
2.13.0. Luckily, PyTorch 2.0 and TensorFlow 2.13 both depends on CUDA 11.8
(CUDA 12 is not supported yet). But, unfortunately, conda does not have
TensorFlow 2.13 yet. In order to get everything in the same conda
environment, seems it is what should be done:</p>
<pre><code class="languagesh">sudo aptget install cuda118
mamba create n <name> python=3.11.4
mamba activate <name>
mamba install c pytorch c nvidia pytorch torchvision torchaudio pytorchcuda=11.8
pip install tensorflow # 2.13.0 using system CUDA
</code></pre>
<p>If TensorFlow 2.12 is acceptable, we can indeed do (the build is necessary)</p>
<pre><code>mamba install tensorflow=2.12.0=gpu_py311h65739b5_0
</code></pre>
<p>But in the case above, we have TWO copies of CUDA installed: one at system
level and another inside conda environment. The pip installed tensorflow uses
the former, and conda installed pytorch uses the latter.</p>Adrian S. Tamrighthandabacus@users.github.comIf we want to run TensorFlow or PyTorch with CUDA on Linux, for example, we can install CUDA as a system library first and then install the Python package with pip (or via aptget, in the rare case). This way, the package will find the CUDA library at system locations. The case of pyenv or virtualenv are similar. The Python packages are installed via pip and CUDA library are expected at the system path.SSH as VPN20230612T00:00:0004:0020230612T00:00:0004:00https://www.adrian.idv.hk/ssh<p>To use SSH as a VPN, the man page provided the following instructions:</p>
<pre><code>SSHBASED VIRTUAL PRIVATE NETWORKS
ssh contains support for Virtual Private Network (VPN) tunnelling using
the tun(4) network pseudodevice, allowing two networks to be joined
securely. The sshd_config(5) configuration option PermitTunnel controls
whether the server supports this, and at what level (layer 2 or 3
traffic).
The following example would connect client network 10.0.50.0/24 with
remote network 10.0.99.0/24 using a pointtopoint connection from
10.1.1.1 to 10.1.1.2, provided that the SSH server running on the gateway
to the remote network, at 192.168.1.15, allows it.
On the client:
# ssh f w 0:1 192.168.1.15 true
# ifconfig tun0 10.1.1.1 10.1.1.2 netmask 255.255.255.252
# route add 10.0.99.0/24 10.1.1.2
On the server:
# ifconfig tun1 10.1.1.2 10.1.1.1 netmask 255.255.255.252
# route add 10.0.50.0/24 10.1.1.1
Client access may be more finely tuned via the /root/.ssh/authorized_keys
file (see below) and the PermitRootLogin server option. The following
entry would permit connections on tun(4) device 1 from user “jane” and on
tun device 2 from user “john”, if PermitRootLogin is set to
“forcedcommandsonly”:
tunnel="1",command="sh /etc/netstart tun1" sshrsa ... jane
tunnel="2",command="sh /etc/netstart tun2" sshrsa ... john
Since an SSHbased setup entails a fair amount of overhead, it may be more
suited to temporary setups, such as for wireless VPNs. More permanent
VPNs are better provided by tools such as ipsecctl(8) and isakmpd(8).
</code></pre>
<p>which the command to launch the VPN is as follows (routing still needed):</p>
<pre><code>ssh \
o PermitLocalCommand=yes \
o LocalCommand="sudo ifconfig tun5 192.168.244.2 pointopoint 192.168.244.1 netmask 255.255.255.0" \
o ServerAliveInterval=60 \
w 5:5 vpn@example.com \
'sudo ifconfig tun5 192.168.244.1 pointopoint 192.168.244.2 netmask 255.255.255.0; echo tun0 ready'
</code></pre>Adrian S. Tamrighthandabacus@users.github.comTo use SSH as a VPN, the man page provided the following instructions:Wang et al (2021) RealESRGAN20230610T00:00:0004:0020230610T00:00:0004:00https://www.adrian.idv.hk/wxds21esrgan<p>This is to extend the SRGAN and ESRGAN to do blind superresolution. The
problem statement is to reconstruct the highresolution image from low
resolution (a.k.a. superresolution), but without knowing how the low
resolution is derived from the original highresolution image, i.e.,
<em>blind superresolution</em>.</p>
<p>The contribution of this paper: (1) A process to create highquality synthetic
dataset for superresolution, (2) a network for SR, especially using UNet in
the discriminator with spectral normalization to increase discriminator
capability.</p>
<p>The classical degradation model includes blur, downsampling, noise, and JPEG
compression:</p>
\[\mathbf{x} = D(\mathbf{y}) = [(\mathbf{y} \oast \mathbf{k})\]
<hr />
<p>firstorder vs highorder degradation modeling for realworld degradation
sinc filter for ringing and overshoot artifacts
discriminator of more powerful capability
gradient feedback from discriminator needs to be more accurate for local detail nehancement
Unet design with spectral normalization (SN) regularization</p>
<h2 id="furtherreading">Further Reading</h2>
<ul>
<li>First SR network: SRCNN, 9, 10</li>
<li>Blind SR survey: 28</li>
</ul>Adrian S. Tamrighthandabacus@users.github.comThis is to extend the SRGAN and ESRGAN to do blind superresolution. The problem statement is to reconstruct the highresolution image from low resolution (a.k.a. superresolution), but without knowing how the low resolution is derived from the original highresolution image, i.e., blind superresolution.Ledig et al (2017) PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network20230530T00:00:0004:0020230530T00:00:0004:00https://www.adrian.idv.hk/lthccaattws17srgan<p>This is the “SRGAN” paper. The problem of upscaling a photo with details is called “SISR” as in the title. This paper takes a 4x upscaling as an example problem and build a GAN model to do it.</p>
<p>The optimization target for a superresolution algorithm is usually MSE between the pixels of the highresolution image (HR) and that of the superresolution output (SR). Minimizing the MSE is effectively minimizing the PSNR, or SSIM. But MSE cannot capture <em>perceptual</em> differences, such as texture details. The major contribution of this paper is to introduce a <em>perceptual loss</em> function using the feature maps of the VGG network.</p>
<p>The SRGAN is using a CNN architecture to process images. Some key design components in the CNN:</p>
<ul>
<li>Deep CNN is difficult to train but has a potential for better accuracy. Batch normalization can help counteract the covariate shift inside the model</li>
<li>Skipconnection relieved the network of modeling the identity mapping, which is nontrivial in a convolutional kernel</li>
</ul>
<h2 id="ganmodel">GAN Model</h2>
<p>Adversarial minmax problem:</p>
<p>\(\min_{\theta_G} \max_{\theta_D}
\Big\{
\mathbb{E}_{I^{\text{HR}}\in T^{\text{HR}}}\Big[\log D_{\theta_D}(I^{\text{HR}})\Big]
+
\mathbb{E}_{I^{\text{LR}}\in T^{\text{LR}}}\Big[\log\Big(1D_{\theta_D}\big(G_{\theta_G}(I^{\text{LR}})\big)\Big)\Big]
\Big\}\)</p>
<ul>
<li>$\theta_D$ and $\theta_G$ are the model parameters of the discriminator and the generator respectively; to be learned</li>
<li>$T^{\text{HR}}$ and $T^{\text{LR}}$ are the set of hires and lores images</li>
<li>above is same as the binary crossentropy loss from the discriminator</li>
</ul>
<p>Model design: (Fig.4 in the paper on page 5)</p>
<ul>
<li>Parametric ReLU activation is used on the generator so the activation can be learned</li>
<li>Subpixel convolution is used for upscaling</li>
<li>LeakyReLU (with $\alpha=0.2$) is used throughout the discriminator so maxpooling is avoided (Radford et al, 2016)</li>
<li>At discriminator, stride2 convolution is used to half the image resolution but whenever it is applied, the number of features is doubled</li>
</ul>
<p>The key to GAN training is the <em>perceptual loss</em> function, defined as equation (3) in the paper:
\(l^{\text{SR}} = l_X^{\text{SR}} + 10^{3} l_{\text{Gen}}^{\text{SR}}\)
which $l_X^{\text{SR}}$ is the content loss (based on VGG19 network features) and $l_{\text{Gen}}^{\text{SR}}$ is the adversarial loss (based on discriminator).</p>
<p>The content loss $l_X^{\text{SR}}$ is modeled after pixelwise MSE loss, which is shown to positively correlate with PSNR. However, MSE loss on pixels tend to overly smooth textures as it fail to account for the high frequency content. The paper propose to do MSE loss on feature output from VGG19 of the layer $\phi_{5,4}$, i.e., the 4th conv layer in the block preceding the 5th pooling layer (note not the 4th conv layer from beginning, but the 4th in that block starting from an activation layer). There are $C=512$ feature channels. The MSE is computing elementwise.</p>
<p>The adversarial loss or the generative loss $l_{\text{Gen}}^{\text{SR}}$ is the cross entropy over all training samples:
$$
l_{\text{Gen}}^{\text{SR}} =</p>
<ul>
<li>\sum_{n=1}^N
\log D_{\theta_D}\big(G_{\theta_G}(I^{\text{LR}})\big)
$$
which we expects the discriminator $D_{\theta_G}(\cdot)$ produces sigmoidal output.</li>
</ul>
<h2 id="training">Training</h2>
<p>The model is trained as follows: BSD300 dataset is used as the test set. The model is designed for a scale factor of $4\times$ or $16\times$ the pixel count. PSNR (in dB) and SSIM are used as the evaluation metric. The model is compared to the upscaling algorithms nearest neighbor, bicubic, SRCNN (Dong et al, 2014), and SelfExSR (Huang et al, 2015).</p>
<p>Training set is a random sample of 350K images from ImageNet database, which LR images are from downsampling the HR images (RGB) using bicubic $4\times$ downsampling. Then 16 random $96\times 96$ subimages are cropped from distinct image samples for a minibatch.</p>
<p>The LR images (input to generator) are in pixel range of $[0,1]$ and HR images (output from generator) are in range of $[1,1]$. The VGG output features is scaled by a factor of $1/12.75$ to make the MSE on VGG features comparable to pixel MSE loss.</p>
<p>Training is using Adam optimizer ($\beta_1=0.9$) with learning rate of $10^{4}$ for the first 100K update steps and learning rate $10^{5}$ for another 100K update steps.</p>
<h2 id="implementation">Implementation</h2>
<p>There are quite a number of implementations on the web. Below is what I polished from various sources:</p>
<pre><code class="languagepython">#!/usr/bin/env python
# coding: utf8
"""
Based on the paper
"""
import os
import cv2
import numpy as np
import tensorflow as tf
import tqdm
from tensorflow.keras.layers import \
Input, Conv2D, LeakyReLU, BatchNormalization, Flatten, Dense, PReLU, Add, UpSampling2D
from tensorflow.keras.models import Model
from tensorflow.keras.losses import BinaryCrossentropy, binary_crossentropy, mean_squared_error
from tensorflow.keras.optimizers import Adam, SGD
#
# Data generator
#
def make_dataset(image_dir, hires_size=(256,256), lores_size=(64,64), batch_size=8):
"""Tensorflow dataset of batches of (lores,hires) images"""
hires = tf.keras.utils.image_dataset_from_directory(image_dir, labels=None,
color_mode="rgb",
image_size=hires_size,
batch_size=None)
hires = hires.batch(batch_size, drop_remainder=True)
lores = hires.map(lambda nhwc: tf.image.resize(nhwc, lores_size))
dataset = tf.data.Dataset.zip((hires, lores))
return dataset
#
# Discriminator
#
def discriminator_block(input, n_filters, strides=1, bn=True, name_prefix=""):
"""Repeated discriminator block. Batch normalization is not used on the first block"""
y = Conv2D(n_filters, (3, 3), strides, padding="same", name=name_prefix+"_conv")(input)
if bn:
y = BatchNormalization(momentum=0.8, name=name_prefix+"_bn")(y)
y = LeakyReLU(alpha=0.2, name=name_prefix+"lrelu")(y)
return y
def discriminator_model(input, name="discriminator"):
"""The complete discriminator that takes an input image and output a logit value"""
n_filters = 64
# k3n64s1 and k3n64s2
y = discriminator_block(input, n_filters, bn=False, name_prefix="block1")
y = discriminator_block(y, n_filters, strides=2, name_prefix="block2")
# k3n128s1 and k3n128s2
y = discriminator_block(y, n_filters*2, name_prefix="block3")
y = discriminator_block(y, n_filters*2, strides=2, name_prefix="block4")
# k3n256s1 and k3n256s2
y = discriminator_block(y, n_filters*4, name_prefix="block5")
y = discriminator_block(y, n_filters*4, strides=2, name_prefix="block6")
# k3n512s1 and k3n512s2
y = discriminator_block(y, n_filters*8, name_prefix="block7")
y = discriminator_block(y, n_filters*8, strides=2, name_prefix="block8")
# Dense layers and logit output
y = Flatten(name="flatten")(y)
y = Dense(n_filters*16, name="fc1")(y)
y = LeakyReLU(alpha=0.2, name="lrelu")(y)
output = Dense(1, name="fc2")(y) # no sigmoid act, to make logit output
return Model(inputs=input, outputs=output, name=name)
#
# Generator
#
def residual_block(input, name_prefix=""):
"""Residual block in generator"""
# two layers of k3n64s1
y = Conv2D(64, (3, 3), padding="same", name=name_prefix+"_conv1")(input)
y = BatchNormalization(momentum=0.5, name=name_prefix+"_bn1")(y)
y = PReLU(shared_axes=[1, 2], name=name_prefix+"_prelu")(y)
y = Conv2D(64, (3, 3), padding="same", name=name_prefix+"_conv2")(y)
y = BatchNormalization(momentum=0.5, name=name_prefix+"_bn2")(y)
y = Add(name=name_prefix+"_add")([input, y]) # skip connection
return y
def upscale_block(input, name_prefix=""):
"""Upscale the image 2x, used at the end of the generator network
"""
# k3n256s1
y = Conv2D(256, (3, 3), padding="same", name=name_prefix+"_conv")(input)
y = tf.nn.depth_to_space(y, 2) # 2x upsampling
y = PReLU(shared_axes=[1, 2], name=name_prefix+"_prelu")(y)
return y
def generator_model(input, num_res_blocks=16, name="generator"):
"""Create the generator model of SRGAN for 4x superresolution"""
# k9n64s1 and PReLU layer before the residual block
y = Conv2D(64, (9, 9), padding="same", name="entry_conv")(input)
y = PReLU(shared_axes=[1, 2], name="entry_prelu")(y)
# B times the residual blocks
res_input = y
for n in range(num_res_blocks):
y = residual_block(y, name_prefix=f"residual{n}")
# k3n64s1 Conv+BN block
y = Conv2D(64, (3, 3), padding="same", name="mid_conv")(y)
y = BatchNormalization(momentum=0.5, name="mid_bn")(y)
y = Add(name="mid_add")([y, res_input])
# two upscale blocks
y = upscale_block(y, name_prefix="up1")
y = upscale_block(y, name_prefix="up2")
# k9n3s1 conv at output
output = Conv2D(3, (9, 9), padding="same", name="out_conv")(y)
return Model(inputs=input, outputs=output, name=name)
#
# VGG model for content loss
#
def vgg_model(output_layer=20):
"""Create VGG19 model for measuring the perceptual loss
"""
# take VGG model from Keras, output at layer "block5_conv4" (20),
# paper referred this layer as \phi_{5,4}
vgg = tf.keras.applications.VGG19(input_shape=(None, None, 3), weights="imagenet", include_top=False)
model = Model(inputs=vgg.input, outputs=vgg.layers[output_layer].output, name="VGG19")
model.trainable = False # need model.compile()
for layer in model.layers:
layer.trainable = False # no need model.compile()
return model
#
# Training
#
def save_weights(generator, discriminator, epoch, basedir="checkpoint"):
"""Syntax sugar for saving the generator and discriminator models"""
os.makedirs(basedir, exist_ok=True)
gen_path = os.path.join(basedir, f"generator_{epoch}.h5")
disc_path = os.path.join(basedir, f"discriminator_{epoch}.h5")
generator.save(gen_path)
discriminator.save(disc_path)
def main():
image_dir = "dataset_images"
batch_size = 8
n_epochs = 100
# try to build and print the discriminator
hr_input = Input(shape=(256, 256, 3))
discriminator = discriminator_model(hr_input)
discriminator.summary(line_length=120, expand_nested=True, show_trainable=True)
# try to build and print the generator (1/4 size of the discriminator input)
lr_input = Input(shape=(64, 64, 3))
generator = generator_model(lr_input)
generator.summary(line_length=120, expand_nested=True, show_trainable=True)
# VGG model to reuse for feature extraction during loss calculation
vgg = vgg_model()
vgg.summary(line_length=120, expand_nested=True, show_trainable=True)
# The loss metrics
ones = tf.ones(batch_size)
zeros = tf.ones(batch_size)
def content_loss(hires, supres):
"""Use VGG model to compare features extracted from hires and supreres images.
Keras VGG model expects "caffe" image format (BGR, meanshifted), hence
preprocess_input() is required. This function is for use with model.compile()
Args:
hires: Hires image, pixels in [0,255]
supres: Generator output, pixels in [0,1] supposedly
Returns:
tf.Tensor of a scalar value
"""
supres = tf.keras.applications.vgg19.preprocess_input(tf.clip_by_value((supres+1)*127.5, 0, 255))
hires = tf.keras.applications.vgg19.preprocess_input(hires)
hires_feat = vgg(hires, training=False) / 12.75
supres_feat = vgg(supres, training=False) / 12.75
return tf.math.reduce_mean(tf.math.squared_difference(hires_feat, supres_feat))
disc_loss = BinaryCrossentropy(from_logits=True)
def gan_loss(hires, supres):
"""Generator perceptual loss = content loss + 1e3 * adversarial loss"""
disc_output = discriminator(supres, training=False)
content = content_loss(hires, supres)
adversarial = disc_loss(ones, disc_output)
return content + 1e3 * adversarial
# Optmizers for use in training: Separate because these optimizers are stateful
gen_opt = Adam(learning_rate=1e4, beta_1=0.9, beta_2=0.999, epsilon=1e8)
disc_opt = Adam(learning_rate=1e4, beta_1=0.9, beta_2=0.999, epsilon=1e8)
# compile models
generator.compile(loss=gan_loss, optimizer=gen_opt)
discriminator.compile(loss=disc_loss, optimizer=disc_opt)
# training loop
dataset = make_dataset(image_dir, batch_size=batch_size).prefetch(tf.data.AUTOTUNE)
p_mean = tf.keras.metrics.Mean() # to average perceptual loss
d_mean = tf.keras.metrics.Mean() # to average discriminator loss
for epoch in range(n_epochs):
with tqdm.tqdm(dataset, unit="step", desc=f"Epoch {epoch}") as tqdmbar:
for hires_batch, lores_batch in tqdmbar:
# train the discriminator; generator input is [0,1] output is [1,1]
lores_batch /= 255.0
supres_batch = generator(lores_batch, training=False) # output pixel [1,1]
disc_loss0 = discriminator.train_on_batch(supres_batch, zeros)
disc_loss1 = discriminator.train_on_batch(hires_batch/127.51, ones) # convert [0,255] > [1,1]
# train the generator
percep_loss = generator.train_on_batch(lores_batch, hires_batch)
p_mean.update_state(percep_loss)
d_mean.update_state(disc_loss0+disc_loss1)
tqdmbar.set_postfix(percep=f"{p_mean.result():.3f}",
disc=f"{d_mean.result():.3f}")
# save model at end of each epoch
save_weights(generator, discriminator, epoch+1)
p_mean.reset_states()
d_mean.reset_states()
main()
</code></pre>Adrian S. Tamrighthandabacus@users.github.comThis is the “SRGAN” paper. The problem of upscaling a photo with details is called “SISR” as in the title. This paper takes a 4x upscaling as an example problem and build a GAN model to do it.Explaining Attention Mechanism20230521T00:00:0004:0020230521T00:00:0004:00https://www.adrian.idv.hk/attention<p>Attention mechanism was first mentioned in Bahdanau et al (2015) paper titled
“Neural Machine Translation by Jointly Learning to Align and Translate”, and
Luong et al (2015) improved it with the paper “Effective Approaches to
Attentionbased Neural Machine Translation”. The key is to find the <em>attention
score</em> $a_{ij}$ between two state vectors, $h_i$ and $s_j$. Should there be
many $h_i$ and $s_j$, the attention score can tell which pair is most relevant.</p>
<p>The steps in producing the attention score are as follows:</p>
<ol>
<li>Take $h_i$ (e.g., from encoder output) and $s_j$ (e.g., from decoder output)</li>
<li>With a function $a(\cdot,\cdot)$, compute $e_{ij} = a(h_i, s_j)$; this function can be implemented as a neural network, e.g., $a(h_i,s_j) = v^\top \tanh(W[h_i;s_j])$</li>
<li>Compute $\alpha_{ij} = \dfrac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}$</li>
</ol>
<p>If we are to compute the context vector of a sequence of vectors $h_i$, it can be the weighted sum</p>
\[c_j = \sum_{i=1}^T \alpha_{ij}h_i\]
<p>But computing $\alpha_{ij}$ from function $a(\cdot,\cdot)$ means the function
is to be called for $M\times N$ times for $M,N$ the cardinality of $h_i$ and
$s_j$ respectively. We can reduce this complexity to $M+N$ by first mapping
$h_i$ and $s_j$ to a common vector space:</p>
\[\begin{aligned}
q_i &= f(h_i) \\
k_j &= g(s_j) \\
e_{ij} &= q_i k_j^\top
\end{aligned}\]
<p>The nonlinearity is moved to $f(\cdot)$ and $g(\cdot)$, and $e_{ij}$ can be
computed all at once using matrix multiplication.</p>
<h2 id="selfattention">Selfattention</h2>
<p>In the Vaswani et al (2017) paper “Attention is All You Need”, singlehead
attention mechanism is an extension to the above to introduce the term, query,
key, and value. At a high level, query sequence $q$ and key sequence $k$ are
used to compute attention score matrix $\alpha$, which is then
matrixmultiplied with value sequence $v$ to produce the output $\alpha v$.
The sequences $q,k,v$ are transformed sequences from the original $Q,K,V$.</p>
<p>Precisely, $Q$ and $K$ are sequences of vectors stacked into matrix form. Then
within the attention module, learnable matrices are multiplied with them, and
compute the attention output $O$ from the attention score and value $V$:</p>
\[\begin{aligned}
q &= QW^Q \\
k &= KW^K \\
v &= VW^V \\
\alpha &= \text{softmax}\Big(\frac{q k^\top}{\sqrt{d_k}}\Big) \\
O &= \alpha v = \text{attn}(Q, K, V)
\end{aligned}\]
<p>where softmax function is to compute
$\sigma(z_i) = \exp(z_i) / \sum_j \exp(z_j)$, each row of keys share the same
softmax and each query is independent. Note that usually $Q,K,V$ are in the same
dimension size in each sequence step. Transformation matrices $W^Q,W^K$ should
be of the same shape to make $qk^\top$ possible. But $W^V$ can be of different
dimensions (the output dimension size).</p>
<p>How should we understand this mechanism? In the book
<em>Natural Language Processing with Transformers</em> (Tunstall et al, 2022), an
example sentence “time flies like an arrow” is used to explain.
Each word is first converted into a embedding vector. Hence we have a sequence
of 5 vectors ($Q=K=V$). Then the vectors are transformed into $q,k,v$ by
multiplying with matrices, usually resulted in a lower dimension. Afterwards,
the attention score matrix $\alpha$ (in shape $5\times 5$) is derived, which
each element $\alpha_{ij}$ is the similarity score of $q_i$ to $v_j$. Then the
output $O$ (also a sequence of 5 vectors) is a matrix multiplication of $\alpha
v$, which each element is therefore a weighted sum of elements of $v$ according
to the attention score on the corresponding row of $\alpha$.</p>
<p>The projection of $Q,K,V$ into $q,k,v$ is to transform the vector
representation. Since the dotproduct is used to calculate the attention score,
this transformation adjusted what the score means, e.g., relating subject to
verb for agreement.</p>
<p>The sequence length of $q$ and $k$ can be different (since they are to find the
attention score only) but sequence length of $k$ and $v$ have to agree so the
multiplication $\alpha v$ is possible. Also, the dimension of embedding vector
in $q$ and $k$ should agree to make the dot product possible, but $k$ and $v$
can be different. The output of the selfattention mechanism would be of length
same as $q$ but dimension of embedding same as that of $v$.</p>
<p>In the case of selfattention, of course, $q,k,v$ are in the same length. And
for easy stacking multiple attention layers, the transformed embedding vector
space are also in the same dimension.</p>
<h1 id="multiheadattention">Multihead attention</h1>
<p>Multihead attention is a stack of $h$ singlehead attention, each with
independent weights $W^Q, W^K, W^V$. The $h$ outputs are concatenated, then
transformed with a matrix:</p>
\[\text{MultiHead}(Q,K,V) = [O_1;\dots;O_h] W^O\]
<p>where the shape of transformation matrices are:</p>
\[\begin{aligned}
W_i^Q &\in \mathbb{R}^{d_{\text{model}}\times d_k} \\
W_i^K &\in \mathbb{R}^{d_{\text{model}}\times d_k} \\
W_i^V &\in \mathbb{R}^{d_{\text{model}}\times d_v} \\
W_i^O &\in \mathbb{R}^{hd_v\times d_{\text{model}}}
\end{aligned}\]
<p>The reason multihead is used is to capture different kind of attention:
Consider the case of a sentence, one head may be to find the gender agreement,
another is to find the subjectverb relationship, for example. Each <em>head</em>
responsible for one theme. The output is then concatenated. Of course,
concatenation means the output from each head is clustered. The output
dimension is also a multiple of the number of heads. Therefore, there is a
transformation matrix $W^O$ to realign the output to correct dimension.</p>
<p>TensorFlow’s implementation of <code>MultiHeadAttention</code> layer takes the call
argument query, value, and key. The key is optional.
In many cases, $K,V$ are the same (the same sequence). For example, in
translation, $Q$ is the target sequence and $K,V$ are both the source sequence.
In a recommendation system, $Q$ is the target items and $K,V$ is the user
profile. In language models, selfattention is used and all $Q,K,V$ are the
same. We often see $K=V$ when we need to relate different positions of the same
sequence to one another (e.g., what “it” refers to in the sentence).</p>
<p>Usually, feedforward network is after the attention mechanism to introduce
nonlinearity. This feedforward network is to replace the <em>hidden state</em> of an
RNN to extract hierarchical features.</p>Adrian S. Tamrighthandabacus@users.github.comAttention mechanism was first mentioned in Bahdanau et al (2015) paper titled “Neural Machine Translation by Jointly Learning to Align and Translate”, and Luong et al (2015) improved it with the paper “Effective Approaches to Attentionbased Neural Machine Translation”. The key is to find the attention score $a_{ij}$ between two state vectors, $h_i$ and $s_j$. Should there be many $h_i$ and $s_j$, the attention score can tell which pair is most relevant.Carion et al (2020) EndtoEnd Object Detection with Transformers20230520T00:00:0004:0020230520T00:00:0004:00https://www.adrian.idv.hk/ckmsuz20detr<p>Object detection is to predict the bounding boxes and category labels for each
object of interest. This paper proposed DETR (Detection Transformer) to predict
all objects at once, trained endtoend with a set loss function to perform
bipartite matching between the predicted and groundtruth. It is found to
perform better on large objects due to the nonlocal computations of the
transformer but sacrificed the performance on small objects.</p>
<p>Object detection is challenging because predicting sets using deep learning model
is challenging. The set of bounding boxes predicted has structural relationships
(i.e., overlapping). It is not always an exact match to the ground truth.
Therefore, postprocessing techniques such as nonmaximal suppression are used. In
DETR, set matching is to use Hungarian algorithm to find the bestfit bipartite
matching, and avoided the nonmaximal suppression.</p>
<p>The existing object detectors are either twostage (i.e., predict bounding box
w.r.t. proposals) or singlestage (i.e., predict w.r.t. a grid of anchors as
object centers). The accuracy depends on how exactly the initial guesses are
set. DETR predicts the box w.r.t. image directly without anchors.</p>
<h2 id="thedetrmodel">The DETR model</h2>
<p>The architecture is in Fig.10 in appendix A.3 of the paper. Its goal is to
infer a set of $N$ object predictions from the input image.</p>
<p>It uses CNN to generate a feature representation of the image. For an image of
spatial size $(H_0,W_0)$ the feature is $(\frac{H_0}{32}, \frac{W_0}{32}, C)$
with $C=2048$. Then this feature is input to the encoder of a transformer
architecture. Firstly, a $1\times1$ convolution is applied to reduce the
channel dimension of the activation map from $C$ to $d$. Then flatten the
spatial dimension, as the transformer encoder expects sequence as input.</p>
<p>The DETR architecture is an encoderdecoder transformer connected to a
feedforward network for final detection prediction. In the encoder, input to
each encoder layer is added with a fixed positional encoding (only the Q and K
vectors, not V). The encoder layer is a multihead selfattention module and an
FNN. Precisely, each block is:
selfattention → add & norm → feedforward network → add & norm → output.</p>
<p>In the decoder, $N$ objects are decoded in parallel, not using the
autoregressive model of sequence prediction. The output from the encoder
becomes the K and V vectors to the crossattention layer in the decoder blocks.
The decoder block receives queries, initially zero, and output positional
encoding, which is served as queries to the subsequent decoder block.</p>
<p>The feedforward network is using a 3layer perceptron model, with hidden
dimension $d$, and ReLU activation. It is to predict the normalized center
coordinates and the width and height of the bounding box (all w.r.t. input
image width and height). Parallel to it is a linear layer to predict the class
label (of $N$ classes, some prediction can be $\varnothing$ for background
class) using softmax function.</p>
<p>The difficulties of training the DETR model is about scoring the predicted
objects (class, position, and size) w.r.t. the ground truth. The loss function
is set up to optimize objectspecific bbox losses, as follows:</p>
<p>Let $y={y_1,\dots,y_N}$ be the groundtruth and
$\hat{y}={\hat{y}_1,\dots,\hat{y}_N}$ be a set of $N$ predictions, some
groundtruth can be “no object” $\varnothing$. Each $y_i=(c_i,b_i)$ for target
class label $c_i$ and bbox $b_i\in [0,1]^4$ of (cx,cy,h,w) relative to image
size. First, we have an optimal bipartite matching $\hat{\sigma}$ (in the form
of a permutation, obtained, e.g., from the Hungarian algorithm) between $y$ and
$\hat{y}$ that minimized the matching cost:</p>
\[\hat{\sigma} = \arg\min_{\sigma} \sum_{i=1}^N L(y_i, \hat{y}_{\sigma(i)})\]
<p>where</p>
\[L(y_i,\hat{y}_{\sigma(i)})=\mathbb{I}[c_i\neq\varnothing](\hat{p}_{\sigma(i)}(c_i)+L_{\text{box}}(b_i,\hat{b}_{\sigma(i)}))\]
<p>is the pairwise matching cost, with $\hat{p}_{\sigma(i)}(c_i)$ the predicted
probability of class $c_i$ under the permutation $\sigma(i)$. The bbox loss is
defined as</p>
\[L_{\text{box}}(b_i,\hat{b}_i) = \lambda_{\text{IoU}}L_{\text{IoU}}(b_i,\hat{b}_{\sigma(i)})+\lambda_{L1} \Vert b_i  \hat{b}_{\sigma(i)}\Vert_1\]
<p>where $L_{\text{IoU}}$ is a generalized IoU and it should be scaleinvariant
(see Rezatofighi et al. (2019) “Generalized intersection over union”):</p>
\[L_{\text{IoU}}(b_i,\hat{b}_{\sigma(i)}) =
1\Bigg(
\frac{\vert b_i \cap\hat{b}_{\sigma(i)}\vert}{\vert b_i \cup\hat{b}_{\sigma(i)}\vert}

\frac{\vert B(b_i, \hat{b}_{\sigma(i)})\setminus b_i \cup\hat{b}_{\sigma(i)}\vert}{\vert B(b_i,\hat{b}_{\sigma(i)})\vert}
\Bigg)\]
<p>where:</p>
<ul>
<li>area is denoted with $\vert\cdot\vert$</li>
<li>union and intersection is on box geometry</li>
<li>\(B(b_i,\hat{b}_{\sigma(i)})\) is the bounding box enclosing both
\(b_i\) and \(\hat{b}_{\sigma(i)}\)</li>
<li>
<p>The formula of $L_{\text{IoU}}$ is from DICE/F1 loss, which considered the
logit prediction $\hat{m}$ vs binary target $m$, defined as</p>
\[L_{\text{DICE}}(m,\hat{m}) = 1\frac{2m\sigma(\hat{m}+1)}{\sigma(\hat{m})+m+1}\]
</li>
</ul>
<p>With the optimal matching $\hat{\sigma}$ is found between $y$ and $\hat{y}$,
the training loss is defined as a linear combination of class prediction
crossentropy and objectspecific box loss:</p>
\[L_H(y,\hat{y}) = \sum_{i=1}^N \Big(\log \hat{p}_{\hat{\sigma}(i)}(c_i)+\mathbb{I}[c_i\neq\varnothing]L_{\text{box}}(b_i,\hat{b}_{\hat{\sigma}(i)})\Big)\]
<p>which in practice, the log probability term for $c_i=\varnothing$ is
downweighted by a factor of 10 to correct class imbalance.</p>
<h2 id="details">Details</h2>
<p>More implementation details are in the appendix of the paper.</p>
<p>The loss metrics are normalized by the number of objects in the batch. Care
must be taken in distributed training since only a subbatch is provided to
each computing node.</p>
<p>Hyperparameters for training:</p>
<ul>
<li>Optimizer: AdamW with weight decay $10^{4}$</li>
<li>Gradient clipped at maximal gradient norm of 0.1</li>
<li>Backbone (CNN) is ResNet50 (from Torchvision) discarded the last
classification layer, with batch normalization weights and statistics frozen
during training, all else finetuned with a learning rate of $10^{5}$
<ul>
<li>the backbone learning rate is an order of magnitude smaller than the rest
of the network to stabilize the training</li>
</ul>
</li>
<li>Transformer is trained with a learning rate $10^{4}$, additive dropout of
0.1 at every multihead attention and feedforward layer before normalization</li>
<li>Transformer weights are initialized with Xavier initialization</li>
<li>Loss hyperparameter: $\lambda_{\text{IoU}}=2$, $\lambda_{L1}=5$</li>
<li>Decoder query slot $N=100$</li>
<li>Baseline compared with FasterRCNN, trained for 109 epochs, using settings as
in Detectron2 model zoo Spatial positional encoding: Fixed absolute encoding
<ul>
<li>both spatial coordinates of each embeddings use $d/2$ sine and cosine
functions with different frequencies, then concatenate them to get $d$
channel</li>
<li>2D positional encoding see Parmer et al (2018)</li>
</ul>
</li>
</ul>Adrian S. Tamrighthandabacus@users.github.comObject detection is to predict the bounding boxes and category labels for each object of interest. This paper proposed DETR (Detection Transformer) to predict all objects at once, trained endtoend with a set loss function to perform bipartite matching between the predicted and groundtruth. It is found to perform better on large objects due to the nonlocal computations of the transformer but sacrificed the performance on small objects.Prokhorenkova et al (2018) CatBoost: Unbiased boosting with categorical features20230519T00:00:0004:0020230519T00:00:0004:00https://www.adrian.idv.hk/pgvdg18catboost<p>CatBoost is a library for random forest. This paper describes the key feature behind it.</p>
<h2 id="sec2">Sec 2</h2>
<p>Symbols and key concepts are provided.</p>
<p>Dataset $D={(x_i, y_i)}_{i=1,\dots,n}$ with $x_i = (x^i_1,\dots,x^i_m)$ a
vector of $m$ features and $y_i\in\mathbb{R}$ a target value. Random forest is
a model $F:\mathbb{R}^m\mapsto\mathbb{R}$ to minimize expected loss $\mathbb{E}L(y, F(x))$</p>
<p>Gradient boosting: Builds iteratively a sequence of approximations $F^t$ in greedy fashion</p>
<ul>
<li>additive: $F^t = F^{t1} + \alpha h^t$ with step size $\alpha$</li>
<li>
<p>$h^t:\mathbb{R}^m\mapsto\mathbb{R}$ is the <em>base predictor</em> chosen from a family of functions $H$, such that</p>
\[h^t = \arg\min_{h\in H}\mathbb{E}L(y,F^{t1}(x)+h(x))\]
</li>
<li>optimization using Newton’s method, in particular, leastsquare approximation</li>
</ul>
\[\begin{aligned}
h^t &= \arg\min_{h\in H}\mathbb{E}(g^t(x,y)h(x))^2
\text{with}\quad
g^t(x,y) &= \left.\frac{\partial L(y,s)}{\partial s}\right\vert_{s=F^{t1}(x)}
\end{aligned}\]
<h2 id="sec3">Sec 3</h2>
<p>The problem of the gradient boosting tree is when we encountered categorical features.</p>
<p>Handling categorical features in a boosted tree is usually to use onehot
encoding. But if the feature has high cardinality, e.g., user ID, this would
lead to an infeasibly large number of new features.</p>
<p>Other method: Group categories by the target statistics (TS)</p>
<ul>
<li>TS = expected value of the target within that group of categories</li>
<li>LightGBM approach: categorical features become gradient statistics at each step of gradient boosting
<ul>
<li>providing information for the tree</li>
<li>high computation cost to calculate statistics for each categorical value at each step</li>
<li>high memory cost to store the categories of each node for each split</li>
</ul>
</li>
<li>using TS as a new numerical feature is costefficient with minimal information loss</li>
</ul>
<p>How to substitute category $x^i_k$ of sample $k$ with a numeric feature $\hat{x}^i_k$ of TS?</p>
<ul>
<li>make $\hat{x}^i_k = \mathbb{E}[y\mid x^i = x^i_k]$, i.e., the expectation of target of the same category over the entire population</li>
<li>
<p>Greedy approach to estimate $\mathbb{E}[y\mid x^i = x^i_k]$ on lowfrequency categories:</p>
\[\hat{x}^i_k = \frac{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k] + a}\]
<ul>
<li>$p$ the prior estimate, usually the average target value of the dataset</li>
<li>$a>0$ the smoothing parameter</li>
<li>subject to target leaking, i.e., there is a conditional shift from the desired expectation to the estimated expectation
\(\mathbb{E}[\hat{x}^i\mid y=v] \ne \mathbb{E}[\hat{x}^i_k\mid y_k = v]\)</li>
</ul>
</li>
</ul>
<p>Improvement to mitigate conditional shift: Compute target statistics from a special subset</p>
\[\hat{x}^i_k = \frac{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k] + a}\]
<ul>
<li>holdout TS: Partition the training dataset into two parts, use one part to calculate the TS and another for actual training. But this made the size of the training set smaller</li>
<li>Leaveoneout TS: Use $D_k = D\setminus x_k$ for TS</li>
<li>Ordered TS (CatBoost):
<ul>
<li>inspired by online learning algorithms (getting samples sequentially in time)</li>
<li>in offline setting, random permute the samples $\sigma(j)$ as artificial “time”</li>
<li>to compute TS at $x_k$, consider the subset $D_k = {x_j: \sigma(j)<\sigma(k)}$ of samples seen before in the permuted sequence</li>
<li>CatBoost use different permutation in different gradient boosting steps (i.e., order boosting), to reduce the variance</li>
</ul>
</li>
</ul>
<h2 id="sec4">Sec 4</h2>
<p>$h^t$ is usually approximated from the dataset,
\(h^t = \arg\min_{h\in H}\frac{1}{n}\sum_{k=1}^n(g^t(x_k,y_k)h(x_k))^2\)</p>
<ul>
<li>base predictor $h^t$ is biased since the distribution of $g^t(x_k,y_k)\mid x_k\ne$ that of $g^t(x,y)\mid x$</li>
<li>causing target leakage: gradients used at each step are estimated using the target values of the same data points the current model $F^{t1}$ was built on</li>
</ul>
<p>Prediction shift: Theorem 1 in the paper said, unbiased estimate can be
achieved if independent datasets are used at each gradient step. Otherwise
there is a bias of $\frac{1}{n=1}c_2(x^2\frac12)$ for $n$ the dataset size</p>Adrian S. Tamrighthandabacus@users.github.comCatBoost is a library for random forest. This paper describes the key feature behind it.Redmon et al (2016) You Only Look Once: Unified, Realtime Object Detection20230402T00:00:0004:0020230402T00:00:0004:00https://www.adrian.idv.hk/rdgf16yolo<p>This is the paper to propose YOLOv1 network, which reframed object detection as a
regression problem. It is a single convolutional network that simultaneously predicts
multiple bounding boxes and class probabilities for those boxes. It is to compare against
RCNN but faster and can see the entire image at once.</p>
<h2 id="networkdesign">Network Design</h2>
<p>The input image is divided into $S\times S$ grid, which the center of an object in a grid
cell, that grid cell should detect it. Each grid cell detects $B$ bboxes with confidence
score, which is defined as the probability of object times the IoU (i.e., score should be
zero for no object). A bbox prediction is 5tuple, $(x,y,w,h,o)$ which $(x,y)$ is the box
center relative to the bounds of the grid cell, $w,h$ are relative to the whole image, and
$o$ the IoU between predicted box and groundtruth (classagnostic).</p>
<p>Each grid cell predicts $C$ classes simultaneously, regardless of the number of bbox
predicted. The classspecific confidence score for a box is therefore
$P(C_i)\times \text{IoU}$.</p>
<p>In summary, the $S\times S$ grid produces a tensor of $S\times S\times (5B+C)$. For PASCAL
VOC, the paper suggested $S=7$, $B=2$ with $C=20$.</p>
<p>The network architecture is depicted in Fig.3 in the paper, with 24 conv layers followed
by 2 fully connected layers. A fast YOLO would use 9 conv layers and fewer filters in
them. The 24conv layer version for 20class PASCAL VOC is as follows, all sequential:</p>
<ul>
<li>Conv 7×7×64 stride 2, for input of 448×448</li>
<li>Max pooling 2×2 stride 2, outputs 112×112</li>
<li>Conv 3×3×192</li>
<li>Max pooling 2×2 stride 2, outputs 56×56</li>
<li>Conv 1×1×128</li>
<li>Conv 3×3×256</li>
<li>Conv 1×1×256</li>
<li>Conv 3×3×512</li>
<li>Max pooling 2×2 stride 2, outputs 28×28</li>
<li>Conv 1×1×256 (repetition 1 of 4)</li>
<li>Conv 3×3×512</li>
<li>Conv 1×1×256 (repetition 2 of 4)</li>
<li>Conv 3×3×512</li>
<li>Conv 1×1×256 (repetition 3 of 4)</li>
<li>Conv 3×3×512</li>
<li>Conv 1×1×256 (repetition 4 of 4)</li>
<li>Conv 3×3×512</li>
<li>Conv 1×1×512</li>
<li>Conv 3×3×1024</li>
<li>Max pooling 2×2 stride 2, outputs 14×14</li>
<li>Conv 1×1×512 (repetition 1 of 2)</li>
<li>Conv 3×3×1024</li>
<li>Conv 1×1×512 (repetition 2 of 2)</li>
<li>Conv 3×3×1024</li>
<li>Conv 3×3×1024</li>
<li>Conv 3×3×1024 stride 2, outputs 7×7</li>
<li>Conv 3×3×1024</li>
<li>Conv 3×3×1024</li>
<li>Fully connected layer, outputs 4096 units</li>
<li>Dropout at rate 0.5</li>
<li>Fully connected layer, outputs 7×7×30 units after reshape</li>
</ul>
<p>The final output of 7×7×30 is for 7×7 grids, each grid cell has 20 class probabilities and
2 bounding boxes, each box represented by a 5tuple of $(x,y,w,h,o)$.</p>
<p>According to sec 2.2, the final layer uses linear activation, and the rest are using leaky
ReLU,</p>
\[\phi(x) = \begin{cases}
x & \text{if }x>0 \\
0.1x & \text{otherwise}
\end{cases}\]
<p>For PASCAL VOC, the network outputs 7×7×2=98 bounding boxes. At inference, they run
through nonmaximal suppression to produce 23% higher mAP output.</p>
<h2 id="training">Training</h2>
<p>The YOLO network would pretrained with ImageNet dataset by taking the first 20 conv layers
followed by an average pooling layer and a fully connected layer. The paper reported that
this classification network can achieve 88% top5 accuracy.</p>
<p>After the pretraining, the remaining 4 conv layers and 2 FC layers are added, with random
weight, and modified the input from 224×224 to 448×448 to provide more finegrained visual
information.</p>
<p>The network output is interpreted as follows:</p>
<ul>
<li>width and height on image are normalized by image dimension to between 0 and 1, and
network output is the square root of such</li>
<li>bounding box coordinates $(x,y)$ are offset to grid cell location, which are also
between 0 and 1</li>
</ul>
<p>The network is trained using sumsquared error as loss function, but weighted the loss
between positive and negative boxes with $\lambda_\text{pos} = 5$ and
$\lambda_\text{neg} = 0.5$ to mitigate the imbalanced sample sizes. The overall loss
function:</p>
\[\begin{aligned}
L &= \lambda_\text{pos} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{I}_{ij} [(x_i\hat{x}_i)^2 +
(y_i\hat{y}_i)^2 + (\sqrt{w_i}  \sqrt{\hat{w}_i})^2 + (\sqrt{h_i}  \sqrt{\hat{h}_i})^2]
\\
& \quad + \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{I}_{ij} (C_i  \hat{C}_i)^2 \\
& \quad + \lambda_\text{neg} \sum_{i=0}^{S^2} \sum_{j=0}^B (1\mathbb{I}_{ij}) (C_i  \hat{C}_i)^2 \\
& \quad + \sum_{i=0}^{S^2} \mathbb{I}_i \sum_c (P_i(c)  \hat{P}_i(c))^2
\end{aligned}\]
<p>where</p>
<ul>
<li>Index $i$ iterates over all grid cells, index $j$ iterates over all bounding boxes in a grid cell</li>
<li>Index $c$ iterates over all classes</li>
<li>$\mathbb{I}_{i}$ is indicator function for cell $i$ contains an object</li>
<li>$\mathbb{I}_{ij}$ is indicator function for cell $i$ contains an object in bounding box $j$</li>
<li>$p_i(c)$ is the classification probability of class $c$ (groundtruth is either 0 or 1)</li>
</ul>
<p>At training only the predictor with the highest IoU with the groundtruth is considered
positive. The paper claimed this can lead to better overall recall as the predictors are
trained for certain size and aspect ratios.</p>
<p>The training runs for 135 epochs on datasets of PASCAL VOC 2007 and 2012. Batch size is
64, with momentum 0.9 and weight decay 5e4. The learning rate started at 1e3 and
gradually increase to 1e2 in the first epoch. Then trained at 1e2 for 75 epochs,
followed by 1e3 for 30 epochs, and finally 1e4 for 30 epochs. Images are augmented with
random scaling and translation up to 20% of the original image size, then randomly adjust
the exposure and saturation of the image by up to a factor of 1.5 on the HSV space.</p>Adrian S. Tamrighthandabacus@users.github.comThis is the paper to propose YOLOv1 network, which reframed object detection as a regression problem. It is a single convolutional network that simultaneously predicts multiple bounding boxes and class probabilities for those boxes. It is to compare against RCNN but faster and can see the entire image at once.Liu et al (2016) SSD: Single Shot MultiBox Detector20230401T00:00:0004:0020230401T00:00:0004:00https://www.adrian.idv.hk/laesrfb16ssd<p>This paper distinct from previous work in the sense that the older approach of object
detection first hypothesize bounding boxes, resample features for each box, then apply a
classifier. This paper proposed a network that does not resample for bounding box
hypotheses but equally accurate. It can do high speed detection, at 59 fps with mAP 74.3%
on VOC2007 test set while Faster RCNN can do only 7 fps with mAP 73.2%. The reasons for
the speed up and accuracy improvement are (1) eliminated bbox proposal and feature
resampling stage, (2) small conv filter to predict object categories and offsets, (3)
separate predictors for different aspect ratio detection.</p>
<h2 id="model">Model</h2>
<p>SSD is a feedforward conv net that produces a fixedsize collection of bboxes and scores.
Then nonmaximum suppression can produce the final detection. The paper suggested an
backbone network of VGG16 truncated before the classifier, which takes an input of
300×300 image and outputs a 38×38 feature map of channel depth 512 at its “conv5_3” layer.
Then multiple conv layers are appended to it, with progressively decreased sizes to
generate feature maps of different scales.</p>
<p>A feature map of size $m\times n$ produces several detections at each position using a 3×3
small kernel. The output of each convolution position is a $(c+4)k$ results, for $k$
“default boxes” of different size and aspect ratios, each box has $c$ classification
scores and 4 regressed bounding box parameters (the bounding box is agnostic to the
classification result). Thus, there are $(c+4)kmn$ outputs from a feature map, regardless
the channel depth $p$.</p>
<p>The network proposed in the paper (Fig.2) is as follows:</p>
<ul>
<li>Backbone: VGG16, through “Conv5_3” layer, output feature map of 38×38×512 to a
classifier of 4×(c+4) output values (4 default boxes per location)</li>
<li>the previous feature map filtered with Conv 6 (3×3×1024 with 6×6 dilation) than Conv 7
(1×1×1024), producing a feature map of 19×19×1024 to a classifier of 6 default boxes,
output 6×(c+4) values</li>
<li>then filter with Conv 8 (1×1×256 then 3×3×512 stride 2) producing a feature map of
10×10×512 to a classifier of 6 default boxes (output 6×(c+4) values)</li>
<li>then filter with Conv 9 (1×1×128 then 3×3×256 stride 2) producing a feature map of
5×5×256 to a classifier of 6 default boxes (output 6×(c+4) values)</li>
<li>then filter with Conv 10 (1×1×128 then 3×3×256 stride 1) producing a feature map of
3×3×256 to a classifier of 4 default boxes (output 4×(c+4) values)</li>
<li>then filter with Conv 11 (1×1×128 then 3×3×256 stride 1) producing a feature map of
1×1×256 to a classifier of 4 default boxes (output 4×(c+4) values)</li>
</ul>
<p>The total number of output is \(38^2 \times 4 + 19^2 \times 6 + 10^2 \times 6 + 5^2 \times 6 + 3^2 \times 4 + 1^2 \times 4 = 8732\)
boxes, each to classify to \(c\) classes (PASCAL VOC 20 classes + 1 background).</p>
<p>This is commonly called SSD300 model. Another model is SSD512 which takes 512×512 image as
input.</p>
<h2 id="traininglossfunction">Training: Loss function</h2>
<p>In training, SSD only needs an input image and groundtruth boxes for each object. As the
output of various resolution and different default box are known, the groundtruth box is
matched to the default box whenever >0.5 IoU (a.k.a. Jaccard overlap). Groundtruth is not
associate with the single best default box so the network allows to predict high score for
multiple boxes rather than requiring it to pick only one.</p>
<p>The loss function at each location is the weighted sum of localization loss $L_\text{loc}$
and confidence loss $L_\text{conf}$:</p>
\[L(x,c,l,g) = \frac{1}{N} (L_\text{conf}(x,c) + \alpha L_\text{loc}(x,l,g))\]
<p>where $N$ the number of matched default boxes, $L_\text{loc}$ is the smooth L1 loss
between predicted box $l$ and the groundtruth box $g$, and $L_\text{conf}$ the softmax
loss over multiple class confidences $c$, and $x\in{0,1}$ the indicator of whether the
predicted box match with groundtruth box. Precisely,</p>
\[L_\text{loc}(x,l,g) = \sum_{i=1}^N \sum_{m\in\{cx,cy,w,h\}} x_{ij}^k L1(l_i^m  g_j^m)\]
<p>which $m$ is the 4 tuples of a bounding box, (cx, cy, w, h), and $i$ iterates over the $N$
default boxes that matched with groundtruth $j$ (such that $x_{ij}=1$). The groundtruth
bbox is in the form of that in RCNN.</p>
<p>The confidence loss is</p>
\[L_\text{conf}(x,c) = \sum_i \log c_i^p\]
<p>where $c_i^p$ is the softmax output for the correct class $p$ (background class if not
matched with any groundtruth bbox) and $i$ iterates over all default boxes.</p>
<h2 id="trainingdefaultboxes">Training: Default boxes</h2>
<p>The SSD network produced feature maps of different resolutions. The lower layer in higher
resolution is believed to have better semantic segmentation quality, and pooling over a
global context can also improve the result.</p>
<p>For a SSD network with $m$ feature maps, the scale of the default boxes at feature map
$k\in{1,\dots,m}$ is computed as</p>
\[s_k = s_{\min} + \frac{s_{\max}  s_{\min}}{m1}(k1)\]
<p>with $s_{\min}=0.2$ meaning the lowest layer has a scale of 0.2 and $s_{\max}=0.9$ meaning the
highest layer has a scale of 0.9, and all layers in between are evenly scaled. The default
boxes are with aspect ratios $a_r\in{\frac13,\frac12,1,2,3}$ which the width and height
are $w_k=s_k\sqrt{a_r}, h_k=s_k/\sqrt{a_r}$. An additional square default box of scale
$s_k’ = \sqrt{s_k s_{k+1}}$ is added to make up 6 boxes per location.</p>
<p>The center of each box is set to $(\frac{i+0.5}{f_k}, \frac{j+0.5}{f_k})$ (center of a
pixel) where $f_k$ is the dimension of the square feature map $k$, and $i,j\in[0,f_k)$.</p>
<p>This way, there are large amount of negative default boxes. In training, those negative
boxes with highest confidence loss is picked such that the ratio between positive and
negative boxes is maintained at 1:3 or less.</p>
<h2 id="trainingaugmentation">Training: Augmentation</h2>
<p>The input data is randomly augmented by extracting a patch of size 0.1 to 1 of the
original image, with aspect ratio between 0.5 and 2. The sampled patch is then resized and
randomly flipped horizontally before other photometric distortions. A patch is selected
only if the IoU with the target object is high enough (e.g., 0.9) to provide enough
positive samples.</p>Adrian S. Tamrighthandabacus@users.github.comThis paper distinct from previous work in the sense that the older approach of object detection first hypothesize bounding boxes, resample features for each box, then apply a classifier. This paper proposed a network that does not resample for bounding box hypotheses but equally accurate. It can do high speed detection, at 59 fps with mAP 74.3% on VOC2007 test set while Faster RCNN can do only 7 fps with mAP 73.2%. The reasons for the speed up and accuracy improvement are (1) eliminated bbox proposal and feature resampling stage, (2) small conv filter to predict object categories and offsets, (3) separate predictors for different aspect ratio detection.