Howard et al (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Convolutional networks such as AlexNet demonstrated the accuracy image recognition. However, latency as well as model size (i.e., memory) can be a concern. MobileNet proposed in this paper is to make these adjustable.

Depthwise Separable Convolution

The key component of MobileNet is the depthwise separable convolution layer. It is to break a standard convolution into a depthwise convolution and 1×1 pointwise convolution. A standard convolution is to filter and combine inputs into output in one step. For example, a Conv2D with kernel of size \(D_K\times D_K\) applied on an \(M\)-channel input tensor of \(H\times W\times M\) is parameterized by a kernel tensor of \(D_K\times D_K\times M\times N\), where \(N\) is the number of output channels. Assume stride 1 and appropriate padding, the input tensor \(F\) and output tensor \(G\) is related as

\[G_{k,l,n} = \sum_{i,j,m} K_{i,j,m,n} \times F_{k+i, l+j, m}\]

In other words, the input tensor is 3D and the kernel parameter is 4D. For each output channel, the kernel is applied by scanning the width and height but fitted completely on all input channels. The sum of the Hadamard product is assigned to one element in that channel of the output.

Each Hadamard product has compute cost of \(D_K^2MN\) and the output feature map has size \(H\times W\), resulting the total cost per convolution layer to be \(D_K^2MNHW\).

Depthwise separable convolution has a single-channel kernel, hence the kernel tensor has size \(D_K\times D_K\times 1\) only. This is applied to each channel separately, producing an output tensor of \(H\times W\times M\), i.e., same number of channel as input:

\[G'_{k,l,m} = \sum_{i,j} K_{i,j} \times F_{k+i, l+j, m}\]

The pointwise convolution is just a convolution layer with 1×1 kernel, hence its kernel tensor has size \(1\times 1\times M\times N\). This is like a fully-connected layer applied on channels, transforming \(M\) channels in the input to \(N\) channels in the output:

\[G_{k,l,n} = \sum_{m} W_{m,n} \times G'_{k,l,m}\]

The total computation cost is \(D_K^2MWH + MNWH\), which is a reduction of \(1/N+1/D_K^2\). For example, with 3×3 kernel, the reduction is roughly 1/9, while the accuracy is only lowered a little.

Architecture

Typically a convolution network has a building block of “3×3 Conv-BN-ReLU”. In MobileNet, this is replaced into:

3×3 Depthwise Conv - BN - ReLU - 1×1 Conv - BN - ReLU

It is found that 94.76% of the Mult-Add operation is on the 1×1 convolutions, and 74.59% of parameters are the 1×1 convolution weights. (Compare: 3×3 depthwise convolution contributed 3.06% Mult-Add operations and 1.06% of parameters, the only 3×3 Conv later at input is 1.19% Mult-Add and the fully connected layer at output is 0.18% Mult-Add, but 24.33% parameters)

The paper suggested to train the model with less regularization and data augmentation for this model is smaller, e.g., than Inception. The model architecture is provided by Table 1 in the paper:

Type	Stride	Filter Shape	Input Size	Note
Conv	2	3×3×3×32	224×224×3	Input
Conv DW	1	3×3×32 DW	112×112×32
Conv	1	1×1×32×64	112×112×32
Conv DW	2	3×3×64 DW	112×112×64
Conv	1	1×1×64×128	56×56×64
Conv DW	1	3×3×128 DW	56×56×128
Conv	1	1×1×128×128	56×56×128
Conv DW	2	3×3×128 DW	56×56×128
Conv	1	1×1×128×256	28×28×128
Conv DW	1	3×3×256 DW	28×28×256
Conv	1	1×1×256×256	28×28×256
Conv DW	2	3×3×256 DW	28×28×256
Conv	1	1×1×256×512	14×14×256
Conv DW	1	3×3×512 DW	14×14×512	repeat 1
Conv	1	1×1×512×512	14×14×512
Conv DW	1	3×3×512 DW	14×14×512	repeat 2
Conv	1	1×1×512×512	14×14×512
Conv DW	1	3×3×512 DW	14×14×512	repeat 3
Conv	1	1×1×512×512	14×14×512
Conv DW	1	3×3×512 DW	14×14×512	repeat 4
Conv	1	1×1×512×512	14×14×512
Conv DW	1	3×3×512 DW	14×14×512	repeat 5
Conv	1	1×1×512×512	14×14×512
Conv DW	2	3×3×512 DW	14×14×512	repeat end
Conv	1	1×1×512×1024	7×7×512
Conv DW	2	3×3×1024 DW	7×7×1024
Conv	1	1×1×1024×1024	7×7×1024
Avg pool	1	7×7 Pool	7×7×1024
FC		1024×1000	1024
Softma×		Classifier	1000

In the paper, the model is tested with ImageNet and found to achieve 70.6% accuracy, down 1% by converting between full convolution and the depthwise separable convolution in the design (i.e., 3×3×64×128 Conv vs vs 3×3×64 DW + 1×1×64×128 Conv).

The pooling layer before the fully-connected layer is a 2D global average pooling layer. It takes an input tensor of shape (batch, height, width, channel) and output a tensor of shape (batch, channel), which the value of each element is the average of all spatial data in the same channel.

Scaling Parameters

The width multiplier \(\alpha\) allows adjusting the number of channels. For the baseline design above, where each layer has \(M\) input channels and \(N\) output channels, it becomes \(\alpha M\) input channels and \(\alpha N\) output channels (rounded), which typically \(\alpha\in [1, 0.75, 0.5, 0.25]\). Setting \(\alpha\) makes the total number of parameters as well as number of Mult-Add scaled by \(\alpha^2\).

There is also a resolution parameter \(\rho\), which scales the input resolution (hence normally not part of the model). Typically, MobileNet takes input size the square of 224, 192, 160, or 128. Setting \(\rho\) scales both the number of Mult-Add and number of parameters to \(\rho^2\).

The paper evaluated and found that decreasing \(\alpha\) or \(\rho\) drops the accuracy smoothly.

Keras Implementation

In Tensorflow, the MobileNet model is built-in. This is how you can plot a model:

import tensorflow as tf

model = tf.keras.applications.MobileNet()
# print summary
model.summary(line_length=120)
# save model to `model.png`
tf.keras.utils.plot_model(model, show_shapes=True, show_dtype=True,
                          show_layer_names=True, expand_nested=True,
                          show_layer_activations=True)

This generates model.png like this. The MobileNet function signature (and parameter default) are:

tf.keras.applications.MobileNet(
    input_shape=None,
    alpha=1.0,
    depth_multiplier=1,
    dropout=0.001,
    include_top=True,
    weights="imagenet",
    input_tensor=None,
    pooling=None,
    classes=1000,
    classifier_activation="softmax",
    **kwargs
)

And its model summary is:

Model: "mobilenet_1.00_224"
________________________________________________________________________________________________________________________
 Layer (type)                                         Output Shape                                    Param #
========================================================================================================================
 input_1 (InputLayer)                                 [(None, 224, 224, 3)]                           0

 conv1 (Conv2D)                                       (None, 112, 112, 32)                            864

 conv1_bn (BatchNormalization)                        (None, 112, 112, 32)                            128

 conv1_relu (ReLU)                                    (None, 112, 112, 32)                            0

 conv_dw_1 (DepthwiseConv2D)                          (None, 112, 112, 32)                            288

 conv_dw_1_bn (BatchNormalization)                    (None, 112, 112, 32)                            128

 conv_dw_1_relu (ReLU)                                (None, 112, 112, 32)                            0

 conv_pw_1 (Conv2D)                                   (None, 112, 112, 64)                            2048

 conv_pw_1_bn (BatchNormalization)                    (None, 112, 112, 64)                            256

 conv_pw_1_relu (ReLU)                                (None, 112, 112, 64)                            0

 conv_pad_2 (ZeroPadding2D)                           (None, 113, 113, 64)                            0

 conv_dw_2 (DepthwiseConv2D)                          (None, 56, 56, 64)                              576

 conv_dw_2_bn (BatchNormalization)                    (None, 56, 56, 64)                              256

 conv_dw_2_relu (ReLU)                                (None, 56, 56, 64)                              0

 conv_pw_2 (Conv2D)                                   (None, 56, 56, 128)                             8192

 conv_pw_2_bn (BatchNormalization)                    (None, 56, 56, 128)                             512

 conv_pw_2_relu (ReLU)                                (None, 56, 56, 128)                             0

 conv_dw_3 (DepthwiseConv2D)                          (None, 56, 56, 128)                             1152

 conv_dw_3_bn (BatchNormalization)                    (None, 56, 56, 128)                             512

 conv_dw_3_relu (ReLU)                                (None, 56, 56, 128)                             0

 conv_pw_3 (Conv2D)                                   (None, 56, 56, 128)                             16384

 conv_pw_3_bn (BatchNormalization)                    (None, 56, 56, 128)                             512

 conv_pw_3_relu (ReLU)                                (None, 56, 56, 128)                             0

 conv_pad_4 (ZeroPadding2D)                           (None, 57, 57, 128)                             0

 conv_dw_4 (DepthwiseConv2D)                          (None, 28, 28, 128)                             1152

 conv_dw_4_bn (BatchNormalization)                    (None, 28, 28, 128)                             512

 conv_dw_4_relu (ReLU)                                (None, 28, 28, 128)                             0

 conv_pw_4 (Conv2D)                                   (None, 28, 28, 256)                             32768

 conv_pw_4_bn (BatchNormalization)                    (None, 28, 28, 256)                             1024

 conv_pw_4_relu (ReLU)                                (None, 28, 28, 256)                             0

 conv_dw_5 (DepthwiseConv2D)                          (None, 28, 28, 256)                             2304

 conv_dw_5_bn (BatchNormalization)                    (None, 28, 28, 256)                             1024

 conv_dw_5_relu (ReLU)                                (None, 28, 28, 256)                             0

 conv_pw_5 (Conv2D)                                   (None, 28, 28, 256)                             65536

 conv_pw_5_bn (BatchNormalization)                    (None, 28, 28, 256)                             1024

 conv_pw_5_relu (ReLU)                                (None, 28, 28, 256)                             0

 conv_pad_6 (ZeroPadding2D)                           (None, 29, 29, 256)                             0

 conv_dw_6 (DepthwiseConv2D)                          (None, 14, 14, 256)                             2304

 conv_dw_6_bn (BatchNormalization)                    (None, 14, 14, 256)                             1024

 conv_dw_6_relu (ReLU)                                (None, 14, 14, 256)                             0

 conv_pw_6 (Conv2D)                                   (None, 14, 14, 512)                             131072

 conv_pw_6_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_pw_6_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_dw_7 (DepthwiseConv2D)                          (None, 14, 14, 512)                             4608

 conv_dw_7_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_dw_7_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_pw_7 (Conv2D)                                   (None, 14, 14, 512)                             262144

 conv_pw_7_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_pw_7_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_dw_8 (DepthwiseConv2D)                          (None, 14, 14, 512)                             4608

 conv_dw_8_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_dw_8_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_pw_8 (Conv2D)                                   (None, 14, 14, 512)                             262144

 conv_pw_8_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_pw_8_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_dw_9 (DepthwiseConv2D)                          (None, 14, 14, 512)                             4608

 conv_dw_9_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_dw_9_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_pw_9 (Conv2D)                                   (None, 14, 14, 512)                             262144

 conv_pw_9_bn (BatchNormalization)                    (None, 14, 14, 512)                             2048

 conv_pw_9_relu (ReLU)                                (None, 14, 14, 512)                             0

 conv_dw_10 (DepthwiseConv2D)                         (None, 14, 14, 512)                             4608

 conv_dw_10_bn (BatchNormalization)                   (None, 14, 14, 512)                             2048

 conv_dw_10_relu (ReLU)                               (None, 14, 14, 512)                             0

 conv_pw_10 (Conv2D)                                  (None, 14, 14, 512)                             262144

 conv_pw_10_bn (BatchNormalization)                   (None, 14, 14, 512)                             2048

 conv_pw_10_relu (ReLU)                               (None, 14, 14, 512)                             0

 conv_dw_11 (DepthwiseConv2D)                         (None, 14, 14, 512)                             4608

 conv_dw_11_bn (BatchNormalization)                   (None, 14, 14, 512)                             2048

 conv_dw_11_relu (ReLU)                               (None, 14, 14, 512)                             0

 conv_pw_11 (Conv2D)                                  (None, 14, 14, 512)                             262144

 conv_pw_11_bn (BatchNormalization)                   (None, 14, 14, 512)                             2048

 conv_pw_11_relu (ReLU)                               (None, 14, 14, 512)                             0

 conv_pad_12 (ZeroPadding2D)                          (None, 15, 15, 512)                             0

 conv_dw_12 (DepthwiseConv2D)                         (None, 7, 7, 512)                               4608

 conv_dw_12_bn (BatchNormalization)                   (None, 7, 7, 512)                               2048

 conv_dw_12_relu (ReLU)                               (None, 7, 7, 512)                               0

 conv_pw_12 (Conv2D)                                  (None, 7, 7, 1024)                              524288

 conv_pw_12_bn (BatchNormalization)                   (None, 7, 7, 1024)                              4096

 conv_pw_12_relu (ReLU)                               (None, 7, 7, 1024)                              0

 conv_dw_13 (DepthwiseConv2D)                         (None, 7, 7, 1024)                              9216

 conv_dw_13_bn (BatchNormalization)                   (None, 7, 7, 1024)                              4096

 conv_dw_13_relu (ReLU)                               (None, 7, 7, 1024)                              0

 conv_pw_13 (Conv2D)                                  (None, 7, 7, 1024)                              1048576

 conv_pw_13_bn (BatchNormalization)                   (None, 7, 7, 1024)                              4096

 conv_pw_13_relu (ReLU)                               (None, 7, 7, 1024)                              0

 global_average_pooling2d (GlobalAveragePooling2D)    (None, 1, 1, 1024)                              0

 dropout (Dropout)                                    (None, 1, 1, 1024)                              0

 conv_preds (Conv2D)                                  (None, 1, 1, 1000)                              1025000

 reshape_2 (Reshape)                                  (None, 1000)                                    0

 predictions (Activation)                             (None, 1000)                                    0

========================================================================================================================
Total params: 4,253,864
Trainable params: 4,231,976
Non-trainable params: 21,888
________________________________________________________________________________________________________________________

The factory function MobileNet() accepts parameter alpha as the width multiplier \(\alpha\) and depth_multiplier is the resolution parameter \(\rho\). By default, the model created has the weight initialized for ImageNet training set, hence the default input shape is (224, 224, 3).

Bibliographic data

@misc{
   title = "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications",
   author = "Andrew G. Howard and Menglong Zhu and Bo Chen and Dmitry Kalenichenko and Weijun Wang and Tobias Weyand and Macro Andreetto and Hartwig Adam",
   howpublished = "arXiv:1704.04861",
   month = "Apr",
   year = "2017",
}