Anartificial neural network (ANN), usually calledneural network(NN), is amathematical modelorcomputational modelthat is inspired by the structure and/or functional aspects ofbiological neural networks. A neural network consists of an interconnected group ofartificial neurons, and it processes information using aconnectionistapproach tocomputation. In most cases an ANN is anadaptive systemthat changes its structure based on external or internal information that flows through the network during the learning phase. Modern neural networks arenon-linearstatisticaldata modelingtools. They are usually used to model complex relationships between inputs and outputs or tofind patternsin data.
Learning
What has attracted the most interest in neural networks is the possibility oflearning. Given a specifictaskto solve, and aclassof functions, learning means using a set ofobservationsto findwhich solves the task in someoptimalsense.
This entails defining acost functionsuch that, for the optimal solution,(i.e., no solution has a cost less than the cost of the optimal solution).
Thecost functionis an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.
For applications where the solution is dependent on some data, the cost must necessarily be afunction of the observations, otherwise we would not be modelling anything related to the data. It is frequently defined as astatisticto which only approximations can be made. As a simple example, consider the problem of finding the modelwhich minimizes, for data pairsdrawn from some distribution. In practical situations we would only havesamples fromand thus, for the above example, we would only minimize. Thus, the cost is minimized over a sample of the data rather than the entire data set.
Whensome form ofonline machine learningmust be used, where the cost is partially minimized as each new example is seen. While online machine learning is often used whenis fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.
See also:Optimization (mathematics),Estimation theory,andMachine learning
[edit]Choosing a cost function
While it is possible to define some arbitrary,ad hoccost function, frequently a particular cost will be used, either because it has desirable properties (such asconvexity) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function will depend on the desired task. An overview of the three main categories of learning tasks is provided below.
[edit]Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These aresupervised learning,unsupervised learningandreinforcement learning.
[edit]Supervised learning
Insupervised learning, we are given a set of example pairsand the aim is to find a functionin the allowed class of functions that matches the examples. In other words, we wish toinferthe mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.
A commonly used cost is themean-squared errorwhich tries to minimize the average squared error between the network's output, f(x), and the target value y over all the example pairs. When one tries to minimize this cost usinggradient descentfor the class of neural networks calledmultilayer perceptrons, one obtains the common and well-knownbackpropagation algorithmfor training neural networks.
Tasks that fall within the paradigm of supervised learning arepattern recognition(also known as classification) andregression(also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
[edit]Unsupervised learning
Inunsupervised learning, some datais given and the cost function to be minimized, that can be any function of the dataand the network's output,.
The cost function is dependent on the task (what we are trying to model) and oura prioriassumptions (the implicit properties of our model, its parameters and the observed variables).
As a trivial example, consider the model, whereis a constant and the cost. Minimizing this cost will give us a value ofthat is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to themutual informationbetween x and y, whereas in statistical modeling, it could be related to theposterior probabilityof the model given the data. (Note that in both of those examples those quantities would be maximized rather than minimized).
Tasks that fall within the paradigm of unsupervised learning are in generalestimationproblems; the applications includeclustering, the estimation ofstatistical distributions,compressionandfiltering.
[edit]Reinforcement learning
Inreinforcement learning, dataare usually not given, but generated by an agent's interactions with the environment. At each point in time, the agent performs an actionand the environment generates an observationand an instantaneous cost, according to some (usually unknown) dynamics. The aim is to discover apolicyfor selecting actions that minimizes some measure of a long-term cost; i.e., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as aMarkov decision process(MDP) with statesand actionswith the following probability distributions: the instantaneous cost distribution, the observation distributionand the transition, while a policy is defined as conditional distribution over actions given the observations. Taken together, the two define aMarkov chain(MC). The aim is to discover the policy that minimizes the cost; i.e., the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems,gamesand othersequential decision makingtasks.
See also:dynamic programmingandstochastic control
ABoltzmann machineis a type ofstochastic recurrent neural networkbyGeoffrey HintonandTerry Sejnowski. Boltzmann machines can be seen as thestochastic,generativecounterpart ofHopfield nets. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are still theoretically intriguing, however, due to the locality andHebbiannature of their training algorithm, as well as their parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the learning can be made efficient enough to be useful for practical problems.
They are named after theBoltzmann distributionin statistical mechanics, which is used in their sampling function.
Structure
A Boltzmann machine, like aHopfield network, is a network of units with an "energy" defined for the network. It also hasbinaryunits, but unlike Hopfield nets, Boltzmann machine units arestochastic. The global energy,E, in a Boltzmann machine is identical in form to that of a Hopfield network:
Where:
§ wijis the connection strength between unitjand uniti.
§ siis the state,, of uniti.
§ θiis the threshold of uniti.
The connections in a Boltzmann machine have two restrictions:
§ No unit has a connection with itself.
§ . (All connections aresymmetric.)
[edit]Probability of a unit's state
The difference in the global energy that results from a single unitibeing 0 (off) versus 1 (on), writtenΔEi, is given by:
This can be expressed as the difference of energies of two states:
ΔEi=Ei=off−Ei=on
We then substitute the energy of each state with its relative probability according to theBoltzmann Factor(the property of aBoltzmann distributionthat the energy of a state is proportional to the negative log probability of that state):
wherekBis Boltzmann's constant and becomes lumped into the artificial notion of temperatureT. We then rearrange terms and consider that the probabilities of the unit being on and off must sum to one:
We can now finally solve forpi=on, the probability that thei-th unit is on.
where thescalarTis referred to as thetemperatureof the system.
This relation is the source of thelogistic functionfound in probability expressions in variants of the Boltzmann machine.
Equilibrium state
The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to aBoltzmann distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "atthermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach athermal equilibriumat a low temperature, we are guaranteed to converge to a distribution where the energy level fluctuates around the global minimum. This process is calledsimulated annealing.
If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.
[edit]Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denotedP+(V).
On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach athermal equilibrium. We denote the converged distribution, after we marginalize it over the visible unitsV, asP−(V).
Our goal is to approximate the "real" distributionP+(V)using theP−(V)which will be produced (eventually) by the machine. To measure how similar the two distributions are, we use theKullback-Leibler divergence,G:
Where the sum is over all the possible states ofV.Gis a function of the weights, since they determine the energy of a state, and the energy determinesP−(v), as promised by theBoltzmann distribution. Hence, we can use agradient descentalgorithm overG, so a given weight,wijis changed by subtracting thepartial derivativeofGwith respect to the weight.
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according toP+). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight,wij, is given by the very simple equation (proved in Ackley et al.):
Where:
§ is the probability of unitsiandjboth being on when the machine is at equilibrium on the positive phase.
§ is the probability of unitsiandjboth being on when the machine is at equilibrium on the negative phase.
§ Rdenotes the learning rate
This result follows from the fact that at thethermal equilibriumthe probabilityP−(s)of any global stateswhen the network is free-running is given by theBoltzmann distribution(hence the name "Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (orsynapsebiologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such asbackpropagation.
The training of a Boltzmann machine does not use theEM algorithm, which is heavily used inmachine learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
[edit]Problems
The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the learning seems to stop working correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:
§ the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
§ connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-calledvariance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.