The paper points out in sec 1 that, in contrast to the good old-fashioned AI (GOFAI) algorithms, artificial neural networks are understood only at a heuristic level. The authors try to provide analytic insights on why deep learning and general ANN works.

There are three kinds of machine learning applications:

- unsupervised learning:
- classification:
- generation or prediction:

the above, we use for data (e.g. the million greyscale pixels of an image), for model parameters (e.g. the classification of an image) and the problem is to find the probability distribution over a set of possibilities. The quality of a neural network is to consider:

- expressibility: what function can a neural network express?
- efficiency: how many resources (neurons, parameters) does it require to approximate a given function?
- learnability: how rapidly can the neural network learn good parameters for approximating a function?

The paper points out that the set of possible functions is exponentially larger than the set of functions can be represented by a neural network. Therefore, it try to answer why a neural network can still approximate functions well. (For example, there are possible images of 1 megapixel in 256-greyscale, but a neural network for image classification has merely thousands or millions of parameters)

The key idea of the paper is to point out the combinatorial swindle nature of neural networks, which for inputs taking values each, it cuts the number of parameters from to . Swindle works because the data sets (problems) we are interested are drawn from an exponentially tiny faction of all imaginable data sets.

The paper then move on to sec 2 to model a neural network. Firstly, the probability distribution has many simplifying features enabling accurate approximation, such as symmetry, locality, etc. Therefore the harder case would be modeling , which can be expressed with simpler function using Bayes’ theorem:

If we introduce the Hamiltonian and self-information , we can rewrite in Boltzmann form:

Usually takes discrete values and we usually find over all values of , so we can express as vectors and write the above as

The vector is approximated by a neural network. A standard -layer feedforward network implements a function of the form

Here the is an affine transformation function of the form for some matrices and bias vector . And is a nonlinear operator on vectors, which softmax function often used:

[Note: is applying exponential to individual elements of . Softmax function is to normalize a vector such that sum of all elements become 1 and each element becomes positive due to exponentiation. Such normalization is not linear.]

Comparing this to the notation of above, we can see that we can rewrite it as:

which means a neural network can *approximate this function with a softmax
output layer*, if we can compute in hidden
layers. The way to compute the Hamiltonians is as follows: First we expand the
Hamiltonian as power series (of infinite terms):

This is a multivariate polynomial. If vector has components, there are terms of degree lower than or equal to in the polynomial (to see this, consider the terms in polynomial are combinations of terms of and count of 1).

If we introduce the popular choice of logistic sigmoid activation function to act elementwise on neuron output, we can see that a two-layer neural network of the form can approximate multiplication (paper fig 2):

which we cal the multiplication approximator and it is accurate when are small in absolute values – which can be achieved by scaling. Here we established the first theorem in the paper, we can achieve arbitrarily accurate multiplication using 4 neurons.

In the alternative case that is binary (i.e., ), we can make things simpler because we always have and hence the power series of Hamiltonian has finite terms:

And the product of bits can be trivially determined from their sum, for example, iff . Therefore, with the logistic sigmoid, we can write the product

for some set of subscript indices . A moderately large in the equation above is suffice to approximate the product, as the function approaches 0 or 1 exponentially. Fig 2 of the paper also shows a network with a single neuron to find the multiplication of bits.

At this point, we can see that, as long as we can express the Hamiltonian in the form of a power series, we can approximate it in a neural network. However, the Hamiltonian that we will approximate in machine learning problems are usually simple, typically a polynomial of degree 2 to 4, and the number of coefficients in , for the following reasons:

- Taylor’s theorem suggests that we can discard the higher order terms (coefficient tends to zero in higher degrees)
*Renomalization*: Higher order terms in the Hamiltonian are negligible if we only observe macroscopic variables-
many probability distributions can be accurately approximated by multivariate Gaussians, due to Central Limit Theorem:

so the Hamiltonian is a quadratic polynomial.

These reasons are valid only if we observe the random variable directly, instead of observing some generic functions of the random variables. But a Hamiltonian in low-order polynomial does not necessarily mean a neural network can reproduce it exactly due to other requirements such as accuracy, stuck at local minimum, or quality of data.

Besides the order of the polynomial, the principle of locality, which means the range of effect of a variable is only at its vicinity, also causes a lot of coefficients in the Hamiltonian power series vanish and the resulting polynomial has only terms. Moreover, the nature of the problem which usually is invariant under translation and rotation limits the number of parameters further.

Discussion thus far is about neural networks in general and the further question is why deep learning works well in some problems while shallow neural networks do not. The reason this paper proposed is the hierarchical structure of physical world, and the dynamic in physics is fundamentally Markovian: data of hierarchy depends on data of hierarchy only and the probability vector of data from each hierarchy is therefore in the form of

for some Markov transition matrix . A deep neural network
stacking a lot of simpler networks would then implement such *generative
process* efficiently. Using such deep neural network is to reverse the
hierarchical generative process to find the primitive input from observed
output .

Then we can proceed to describe these analytically, to show that deep neural network is more efficient in certain cases:

First we consider , if then is a sufficient statistic. If is a minimal sufficient statistic, it means for all sufficient statistic there is a function such that . The physical meaning of is an information distiller, which retains only the information that is relevant and discarding all other. Consider the Markov property , if we have as the minimal sufficient statistic of , we have

which means it depends on only through , i.e. is a sufficient statistic for , so we established theorem 2 that there exists a function such that , and hence by induction,

The above expression states that, the structure of the inference problem (from image to classification ) reflects the structure of the generative process (from classification to image ).

The minimal sufficient statistics can also be put into information theory context: The data processing inequality states that, for any function (the information distillation function) and random variables , the mutual information measure satisfies

and a sufficient statistic is a function such that which in other words, retained all information in . But if is not strictly sufficient, it is still useful as long as it retained most of the relevant information and make the Hamitonian easier to implement than

Finally, we are about to establish the no-flattening theorems, which states that for some problems, deep neural network is more efficient than shallow one (which only has only hidden layers and the error compare to is bounded by ) in terms of the number of parameters involved (i.e., the sum of dimensions of its hidden layers, or number of neurons, and te number of non-zero entries in weight matrices, or the number of synapses, ).

This result starts with defining the flattening cost functions:

and if or for a problem, then flattening is costly and inefficient.

The paper qualitatively provide a few examples. Some numerical linear algebra problems that allows divide-and-conquer approach (such as FFT) is in complexity, which is suitable for deep neural network and it will become if hidden layers are eliminated to become a fully flattened network. Hence in this example.

Another example is the matrix multiplication. For of rank , is matrix and is matrix, then implemented in a flattened neural network will have synapses in a network with layers and synapses in a network with layers. So the flattening cost is , which is greater than 1 for .

Yet another example is the polynomial. It is vigorously proved in the appendix of the paper that they are exponentially expensive to flatten. One result is that, a monomial for needs neurons to evaluate if the multiplication are arranged in a -layer binary tree, which each two-element multiplication is performed by 4 neurons. But if this is to be done by a flattened network, it will require neurons.

Summary of the paper:

- Machine learning in practice inclined to certain class of problems that involves exceptionally simply probability distributions
- Shallow network hinges on symmetry, locality, and polynomial log-probability in data from natural world, which has sparse low-order Hamiltonians
- A neural network with a finite size and a generic smooth activation function can approximate a multivariate polynomial well
- Success of deep learning depends on the hierarchical and compositional generative process in physics; such compositional functions can be efficiently implemented by a deep neural network
- Those compositional functions generally not possible to retain the efficiency if the neural network is flattened

## Bibliographic data

```
@article{
title = "Why does deep and cheap learning work so well?",
author = "Henry W. Lin and Max Tegmark and David Rolnick",
journal = "Journal of Statistical Physics",
month = "September",
year = "2017",
volume = "168",
number = "6",
pages = "1223-1247",
url = "https://link.springer.com/article/10.1007/s10955-017-1836-5",
}
```