Convolutional networks such as AlexNet demonstrated the accuracy image recognition. However, latency as well as model size (i.e., memory) can be a concern. MobileNet proposed in this paper is to make these adjustable.

## Depthwise Separable Convolution

The key component of MobileNet is the *depthwise separable convolution* layer.
It is to break a standard convolution into a depthwise convolution and 1×1
pointwise convolution. A standard convolution is to filter and combine inputs
into output in one step. For example, a Conv2D with kernel of size
\(D_K\times D_K\) applied on an \(M\)-channel input tensor of
\(H\times W\times M\) is parameterized by a kernel tensor of
\(D_K\times D_K\times M\times N\), where \(N\) is the number of output
channels. Assume stride 1 and appropriate padding, the input tensor \(F\) and
output tensor \(G\) is related as

In other words, the input tensor is 3D and the kernel parameter is 4D. For each output channel, the kernel is applied by scanning the width and height but fitted completely on all input channels. The sum of the Hadamard product is assigned to one element in that channel of the output.

Each Hadamard product has compute cost of \(D_K^2MN\) and the output feature map has size \(H\times W\), resulting the total cost per convolution layer to be \(D_K^2MNHW\).

Depthwise separable convolution has a single-channel kernel, hence the kernel tensor has size \(D_K\times D_K\times 1\) only. This is applied to each channel separately, producing an output tensor of \(H\times W\times M\), i.e., same number of channel as input:

\[G'_{k,l,m} = \sum_{i,j} K_{i,j} \times F_{k+i, l+j, m}\]The pointwise convolution is just a convolution layer with 1×1 kernel, hence its kernel tensor has size \(1\times 1\times M\times N\). This is like a fully-connected layer applied on channels, transforming \(M\) channels in the input to \(N\) channels in the output:

\[G_{k,l,n} = \sum_{m} W_{m,n} \times G'_{k,l,m}\]The total computation cost is \(D_K^2MWH + MNWH\), which is a reduction of \(1/N+1/D_K^2\). For example, with 3×3 kernel, the reduction is roughly 1/9, while the accuracy is only lowered a little.

## Architecture

Typically a convolution network has a building block of “3×3 Conv-BN-ReLU”. In MobileNet, this is replaced into:

```
3×3 Depthwise Conv - BN - ReLU - 1×1 Conv - BN - ReLU
```

It is found that 94.76% of the Mult-Add operation is on the 1×1 convolutions, and 74.59% of parameters are the 1×1 convolution weights. (Compare: 3×3 depthwise convolution contributed 3.06% Mult-Add operations and 1.06% of parameters, the only 3×3 Conv later at input is 1.19% Mult-Add and the fully connected layer at output is 0.18% Mult-Add, but 24.33% parameters)

The paper suggested to train the model with less regularization and data augmentation for this model is smaller, e.g., than Inception. The model architecture is provided by Table 1 in the paper:

Type | Stride | Filter Shape | Input Size | Note |
---|---|---|---|---|

Conv | 2 | 3×3×3×32 | 224×224×3 | Input |

Conv DW | 1 | 3×3×32 DW | 112×112×32 | |

Conv | 1 | 1×1×32×64 | 112×112×32 | |

Conv DW | 2 | 3×3×64 DW | 112×112×64 | |

Conv | 1 | 1×1×64×128 | 56×56×64 | |

Conv DW | 1 | 3×3×128 DW | 56×56×128 | |

Conv | 1 | 1×1×128×128 | 56×56×128 | |

Conv DW | 2 | 3×3×128 DW | 56×56×128 | |

Conv | 1 | 1×1×128×256 | 28×28×128 | |

Conv DW | 1 | 3×3×256 DW | 28×28×256 | |

Conv | 1 | 1×1×256×256 | 28×28×256 | |

Conv DW | 2 | 3×3×256 DW | 28×28×256 | |

Conv | 1 | 1×1×256×512 | 14×14×256 | |

Conv DW | 1 | 3×3×512 DW | 14×14×512 | repeat 1 |

Conv | 1 | 1×1×512×512 | 14×14×512 | |

Conv DW | 1 | 3×3×512 DW | 14×14×512 | repeat 2 |

Conv | 1 | 1×1×512×512 | 14×14×512 | |

Conv DW | 1 | 3×3×512 DW | 14×14×512 | repeat 3 |

Conv | 1 | 1×1×512×512 | 14×14×512 | |

Conv DW | 1 | 3×3×512 DW | 14×14×512 | repeat 4 |

Conv | 1 | 1×1×512×512 | 14×14×512 | |

Conv DW | 1 | 3×3×512 DW | 14×14×512 | repeat 5 |

Conv | 1 | 1×1×512×512 | 14×14×512 | |

Conv DW | 2 | 3×3×512 DW | 14×14×512 | repeat end |

Conv | 1 | 1×1×512×1024 | 7×7×512 | |

Conv DW | 2 | 3×3×1024 DW | 7×7×1024 | |

Conv | 1 | 1×1×1024×1024 | 7×7×1024 | |

Avg pool | 1 | 7×7 Pool | 7×7×1024 | |

FC | 1024×1000 | 1024 | ||

Softma× | Classifier | 1000 |

In the paper, the model is tested with ImageNet and found to achieve 70.6% accuracy, down 1% by converting between full convolution and the depthwise separable convolution in the design (i.e., 3×3×64×128 Conv vs vs 3×3×64 DW + 1×1×64×128 Conv).

The pooling layer before the fully-connected layer is a
*2D global average pooling* layer. It takes an input tensor of shape
(batch, height, width, channel) and output a tensor of shape (batch, channel),
which the value of each element is the average of all spatial data in the same
channel.

## Scaling Parameters

The width multiplier \(\alpha\) allows adjusting the number of channels. For the baseline design above, where each layer has \(M\) input channels and \(N\) output channels, it becomes \(\alpha M\) input channels and \(\alpha N\) output channels (rounded), which typically \(\alpha\in [1, 0.75, 0.5, 0.25]\). Setting \(\alpha\) makes the total number of parameters as well as number of Mult-Add scaled by \(\alpha^2\).

There is also a resolution parameter \(\rho\), which scales the input resolution (hence normally not part of the model). Typically, MobileNet takes input size the square of 224, 192, 160, or 128. Setting \(\rho\) scales both the number of Mult-Add and number of parameters to \(\rho^2\).

The paper evaluated and found that decreasing \(\alpha\) or \(\rho\) drops the accuracy smoothly.

## Keras Implementation

In Tensorflow, the MobileNet model is built-in. This is how you can plot a model:

```
import tensorflow as tf
model = tf.keras.applications.MobileNet()
# print summary
model.summary(line_length=120)
# save model to `model.png`
tf.keras.utils.plot_model(model, show_shapes=True, show_dtype=True,
show_layer_names=True, expand_nested=True,
show_layer_activations=True)
```

This generates `model.png`

like this. The MobileNet function signature (and parameter default) are:

```
tf.keras.applications.MobileNet(
input_shape=None,
alpha=1.0,
depth_multiplier=1,
dropout=0.001,
include_top=True,
weights="imagenet",
input_tensor=None,
pooling=None,
classes=1000,
classifier_activation="softmax",
**kwargs
)
```

And its model summary is:

```
Model: "mobilenet_1.00_224"
________________________________________________________________________________________________________________________
Layer (type) Output Shape Param #
========================================================================================================================
input_1 (InputLayer) [(None, 224, 224, 3)] 0
conv1 (Conv2D) (None, 112, 112, 32) 864
conv1_bn (BatchNormalization) (None, 112, 112, 32) 128
conv1_relu (ReLU) (None, 112, 112, 32) 0
conv_dw_1 (DepthwiseConv2D) (None, 112, 112, 32) 288
conv_dw_1_bn (BatchNormalization) (None, 112, 112, 32) 128
conv_dw_1_relu (ReLU) (None, 112, 112, 32) 0
conv_pw_1 (Conv2D) (None, 112, 112, 64) 2048
conv_pw_1_bn (BatchNormalization) (None, 112, 112, 64) 256
conv_pw_1_relu (ReLU) (None, 112, 112, 64) 0
conv_pad_2 (ZeroPadding2D) (None, 113, 113, 64) 0
conv_dw_2 (DepthwiseConv2D) (None, 56, 56, 64) 576
conv_dw_2_bn (BatchNormalization) (None, 56, 56, 64) 256
conv_dw_2_relu (ReLU) (None, 56, 56, 64) 0
conv_pw_2 (Conv2D) (None, 56, 56, 128) 8192
conv_pw_2_bn (BatchNormalization) (None, 56, 56, 128) 512
conv_pw_2_relu (ReLU) (None, 56, 56, 128) 0
conv_dw_3 (DepthwiseConv2D) (None, 56, 56, 128) 1152
conv_dw_3_bn (BatchNormalization) (None, 56, 56, 128) 512
conv_dw_3_relu (ReLU) (None, 56, 56, 128) 0
conv_pw_3 (Conv2D) (None, 56, 56, 128) 16384
conv_pw_3_bn (BatchNormalization) (None, 56, 56, 128) 512
conv_pw_3_relu (ReLU) (None, 56, 56, 128) 0
conv_pad_4 (ZeroPadding2D) (None, 57, 57, 128) 0
conv_dw_4 (DepthwiseConv2D) (None, 28, 28, 128) 1152
conv_dw_4_bn (BatchNormalization) (None, 28, 28, 128) 512
conv_dw_4_relu (ReLU) (None, 28, 28, 128) 0
conv_pw_4 (Conv2D) (None, 28, 28, 256) 32768
conv_pw_4_bn (BatchNormalization) (None, 28, 28, 256) 1024
conv_pw_4_relu (ReLU) (None, 28, 28, 256) 0
conv_dw_5 (DepthwiseConv2D) (None, 28, 28, 256) 2304
conv_dw_5_bn (BatchNormalization) (None, 28, 28, 256) 1024
conv_dw_5_relu (ReLU) (None, 28, 28, 256) 0
conv_pw_5 (Conv2D) (None, 28, 28, 256) 65536
conv_pw_5_bn (BatchNormalization) (None, 28, 28, 256) 1024
conv_pw_5_relu (ReLU) (None, 28, 28, 256) 0
conv_pad_6 (ZeroPadding2D) (None, 29, 29, 256) 0
conv_dw_6 (DepthwiseConv2D) (None, 14, 14, 256) 2304
conv_dw_6_bn (BatchNormalization) (None, 14, 14, 256) 1024
conv_dw_6_relu (ReLU) (None, 14, 14, 256) 0
conv_pw_6 (Conv2D) (None, 14, 14, 512) 131072
conv_pw_6_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_6_relu (ReLU) (None, 14, 14, 512) 0
conv_dw_7 (DepthwiseConv2D) (None, 14, 14, 512) 4608
conv_dw_7_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_dw_7_relu (ReLU) (None, 14, 14, 512) 0
conv_pw_7 (Conv2D) (None, 14, 14, 512) 262144
conv_pw_7_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_7_relu (ReLU) (None, 14, 14, 512) 0
conv_dw_8 (DepthwiseConv2D) (None, 14, 14, 512) 4608
conv_dw_8_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_dw_8_relu (ReLU) (None, 14, 14, 512) 0
conv_pw_8 (Conv2D) (None, 14, 14, 512) 262144
conv_pw_8_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_8_relu (ReLU) (None, 14, 14, 512) 0
conv_dw_9 (DepthwiseConv2D) (None, 14, 14, 512) 4608
conv_dw_9_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_dw_9_relu (ReLU) (None, 14, 14, 512) 0
conv_pw_9 (Conv2D) (None, 14, 14, 512) 262144
conv_pw_9_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_9_relu (ReLU) (None, 14, 14, 512) 0
conv_dw_10 (DepthwiseConv2D) (None, 14, 14, 512) 4608
conv_dw_10_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_dw_10_relu (ReLU) (None, 14, 14, 512) 0
conv_pw_10 (Conv2D) (None, 14, 14, 512) 262144
conv_pw_10_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_10_relu (ReLU) (None, 14, 14, 512) 0
conv_dw_11 (DepthwiseConv2D) (None, 14, 14, 512) 4608
conv_dw_11_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_dw_11_relu (ReLU) (None, 14, 14, 512) 0
conv_pw_11 (Conv2D) (None, 14, 14, 512) 262144
conv_pw_11_bn (BatchNormalization) (None, 14, 14, 512) 2048
conv_pw_11_relu (ReLU) (None, 14, 14, 512) 0
conv_pad_12 (ZeroPadding2D) (None, 15, 15, 512) 0
conv_dw_12 (DepthwiseConv2D) (None, 7, 7, 512) 4608
conv_dw_12_bn (BatchNormalization) (None, 7, 7, 512) 2048
conv_dw_12_relu (ReLU) (None, 7, 7, 512) 0
conv_pw_12 (Conv2D) (None, 7, 7, 1024) 524288
conv_pw_12_bn (BatchNormalization) (None, 7, 7, 1024) 4096
conv_pw_12_relu (ReLU) (None, 7, 7, 1024) 0
conv_dw_13 (DepthwiseConv2D) (None, 7, 7, 1024) 9216
conv_dw_13_bn (BatchNormalization) (None, 7, 7, 1024) 4096
conv_dw_13_relu (ReLU) (None, 7, 7, 1024) 0
conv_pw_13 (Conv2D) (None, 7, 7, 1024) 1048576
conv_pw_13_bn (BatchNormalization) (None, 7, 7, 1024) 4096
conv_pw_13_relu (ReLU) (None, 7, 7, 1024) 0
global_average_pooling2d (GlobalAveragePooling2D) (None, 1, 1, 1024) 0
dropout (Dropout) (None, 1, 1, 1024) 0
conv_preds (Conv2D) (None, 1, 1, 1000) 1025000
reshape_2 (Reshape) (None, 1000) 0
predictions (Activation) (None, 1000) 0
========================================================================================================================
Total params: 4,253,864
Trainable params: 4,231,976
Non-trainable params: 21,888
________________________________________________________________________________________________________________________
```

The factory function `MobileNet()`

accepts parameter `alpha`

as the width
multiplier \(\alpha\) and `depth_multiplier`

is the resolution parameter
\(\rho\). By default, the model created has the weight initialized for ImageNet
training set, hence the default input shape is (224, 224, 3).

## Bibliographic data

```
@misc{
title = "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications",
author = "Andrew G. Howard and Menglong Zhu and Bo Chen and Dmitry Kalenichenko and Weijun Wang and Tobias Weyand and Macro Andreetto and Hartwig Adam",
howpublished = "arXiv:1704.04861",
month = "Apr",
year = "2017",
}
```