Neural Networks

Recall: for linear regression, we can represent the predictor as a network:

And logistic regression can be represented by:

Neural networks generalize this idea. The first step in their definition requires the definition of a neuron. We will see that a neural network is constructed from a collection of neurons. This is a neuron:

It is standard to use to represent the weights in a neural network. The bias plays a role analogous to the constant term in a regression model. It is an important component of a neuron, but we will generally omit it from the diagrams. The type of a neuron is determined by the form of the function . Some examples include:

  1. : equivalent to linear regression. This is a linear neuron.
  2. : similar to logistic regression. This is a sigmoid neuron.
  3. This is a perceptron.

Example 1: A perceptron with two inputs will divide the plane into two parts with a straight line (and, in general, a hyperplane in higher dimensions):

Example 2: There are some training sets that cannot be separated by a perceptron:

We will often apply a simplified visual notation to represent neurons, by omitting the

summation node and the function , and simply display a neuron as:

The field of Neural Networks is a large one, so we will restrict our attention to a particular class of NN’s, namely

Feedforward Neural Networks

In many situations, we will view a neural network as a black box, having a set of inputs and a single output (actually, NN’s are not restricted to a single output. In a classification setting, for instance, we might have one output for each class). So the simple picture looks like

In a multilayer feedforward NN, the details inside the black box look like this:

In this layout, all neurons are either hidden neurons or output neurons. Note that there are many choices available here:

How many hidden layers are there?

What type are the neurons (perceptrons, sigmoid, etc.)?

How many neurons are there per layer?

Our goal is to find a way to determine the optimal weights to be assigned to the branches. To get a sense of what’s involved, consider the simpler case of a multilayer feedforward NN with a single hidden layer:

In this diagram, note that (i) the terms represent the intermediate outputs coming from the hidden layer, and (ii) the weights in the input layer are labeled with the output index first (e.g., is the weight on the branch fromto the k-th neuron in the hidden layer); this looks backward but turns out to be convenient. It is clear that this NN represents a function that maps the inputs to the output and which is dependent on the values of the weights. In other words, neural networks are a parametric function of inputs. In this case, if , then we can state:

Goal: find optimal weights that minimize

For simplicity, suppose that the transfer function in all neurons is the same function g. Then

To minimize MSE, we can apply the gradient descent algorithm. In the course of this algorithm, it will be necessary to compute terms like:

It’s clear that these terms will get complicated, and that we need to figure out how to solve for the weights efficiently. We’ll see the solution shortly in the backpropagation algorithm.

Simple Method for Neural Network Training

Before explaining the backpropagation, we illustrate the simple numerical method for finding partial derivatives that are necessary for the gradient descent updates of neural network weights, . It is based on the definition of the first derivative:

Therefore, for some small value , the approximation will be very accurate. To estimate the derivative numerically, we need to calculate MSE twice, which results in O(WN) time complexity, where W is the total number of weights and N is the number of training data points. To run one round of gradient descent, we need to calculate the partial derivative for all W weights, which results in total cost of O(W2N). Assuming that E rounds (or epochs in neural network lingo) are needed for the gradient descent to converge, the total cost of training would be O(W2NE). The cost of the backpropagation is O(WNE). which is a significant improvement.

Mathematical Background

We can motivate the backpropogation algorithm on neural networks by first considering a simpler setting. Consider a function defined by , which can be represented graphically as:

And suppose we want to compute . Intuitively, if x changes a small amount, we see that each of the intermediate outputs changes accordingly, and that these changes all contribute to a change in z. This is made precise in the chain rule:

This pattern is similar when computing partial derivatives of with respect to various weights, though the details are slightly more involved.

The Backpropagation Algorithm

Consider this small piece of a NN:

To perform the update of weights using the gradient descent algorithm, we need to calculate all partial derivatives of MSE with respect to the weights. Let us simplify our notation a bit by denoting

where

In this case

Let us concentrate on finding , and remove index i for notational convenience. One application of the chain rule gives:

To compute , we observe that , so .

To compute , use the chain rule to yield , where the summation is over all neurons . If we define , then this result translates to

But since , we see that (assuming all neurons are sigmoid: ):

What we’ve shown is that the computation of can be performed by knowing the values of for all neurons l downstream from k. The backpropagation algorithm consists of computing the values starting from the output layer, and proceeding backward through the network. In this way, all desired partial derivatives can be computed efficiently (and in a manner that lends itself well to a matrix implementation).

The only thing left to specify is how to start off the algorithm (i.e., what to do at the output neuron):

Since , we have

And the algorithm (minus a few computational details) is complete.

Computational cost. To calculate  of the output neuron, we need to know its output. In a process of figuring this out, we will also learn the outputs of all other neurons. This whole “forward step” takes O(W) time. Using formula for , we see that to know  for allK1neurons in the previous layer we need to make O(K1) calculations. To know  for all K2 neurons in the layer before it we need to make O(K1K2) calculations, and so on. The total number of calculations to calculate all  is O(W). Knowing all , it takes O(W) additional time to calculate all the partial derivatives . Overall, all weight updates take O(W) time. One epoch therefore takes O(WN) time, and the whole training O(WNE).

Main Properties of NN’s

Given a large enough NN, with sigmoid neurons:

  1. It can approximate an arbitrary continuous function (this can actually be done with a NN with two hidden layers).
  2. It can learn an arbitrary Boolean function (and one hidden layer is sufficient).

Reading assignment:

Sections 5.3.4, 5.4