One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, they study the landscape of neural networks for binary classification tasks.
Under mild assumptions, they prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.
Deep neural networks have recently achieved huge success in various machine learning tasks. However, a theoretical understanding of neural networks is largely lacking. One of the difficulties in analyzing neural networks is the non-convexity of the loss function which allows the existence of many local minima with large losses. This was one of the reasons why convex formulations such as support vector machine were preferred previously.
Given the recent empirical success of the deep neural networks, an interesting question is whether the non-convexity of the neural network is really an issue.
Given this context, Their main result is quite surprising: for binary classification, with a small modification of the neural architecture, every local minimum is a global minimum of the loss function. The result achieved by them in the paper requires no assumption on the network size, the specific type of the original neural network, etc., yet their result applies to every local minimum. The major trick is adding one special neuron (with a skip connection) and an associated regularizer of this neuron.
Their major results and its implications are as follows:
The focus is kept on the binary classification problem with a smooth hinge loss function. They prove the following result: for any neural network, by adding a special neuron (e.g., exponential neuron) to the network and adding a quadratic regularizer of this neuron, the new loss function has no bad local minimum. In addition, every local minimum achieves the minimum misclassification error.
In the main result, the augmented neuron can be viewed as a skip connection from the input to the output layer. However, this skip connection is not critical, as the same result also holds if we add one special neuron to each layer of a fully-connected feed-forward neural network.
To their knowledge, this is the first result that no spurious local minimum exists for a wide class of deep nonlinear networks. Their result indicates that the class of “good neural networks” (neural networks such that there is an associated loss function with no spurious local minima) contains any network with one special neuron, thus this class is rather “dense” in the class of all neural networks: the distance between any neural network and a good neural network is just a neuron away.
One of the difficulties in analyzing neural networks is the non-convexity of the loss functions which allows the existence of many spurious minima with large loss values. In this paper, they prove that for any neural network, by adding a special neuron and an associated regularizer, the new loss function has no spurious local minimum.
In addition, they also prove that, at every local minimum of this new loss function, the exponential neuron is inactive and this means that the augmented neuron and regularizer improve the landscape of the loss surface without affecting the representing power of the original neural network.
They also extend the main result in a few ways.
First, while adding a special neuron makes the network different from a classical neural network architecture, the same result also holds for a standard fully connected network with one special neuron added to each layer.
Second, the same result holds if we change the exponential neuron to a polynomial neuron with a degree dependent on the data.
Third, the same result holds even if one feature vector corresponds to both labels.
For a deeper insight into the paper, the links to the paper are given below:
Link to the paper: Click Here