Optimum scaling for Sigmoid output functions
Alan Parkinson
School of Information Systems
CurtinUniversity of Technology
Perth GPO Box U1987
Australia
Abstract: - This paper reports the results of an experimental investigation of the effects of different data scaling ranges used for a feed forward neural network with sigmoid output functions. Although ranges of 0.1 – 0.9 are commonly used, an argument is presented suggesting that a narrower range may be better. Use of real world data indicated that a range of 0.3-0.7 would be optimal in many cases, a figure significantly narrower than that used by most workers.
Key-Words: - Pre-processing, data scaling, neural networks, time series prediction, sigmoid functions
1 Introduction
In practical applications, data is rarely presented to neural networks in raw form. Some form of pre-processing invariably precedes its use and the selection of an appropriate pre-processing method is acknowledged to be both a significant factor in the effectiveness of the networks [1, 2] and a major consumer of project resources[3]. Despite this there are few published studies on the effects of particular pre-processing methods[4].
The use of some sigmoid output function is almost ubiquitous in the ANN literature, particularly for feed-forward networks, although there are many alternatives [5]. Not only does this function serve to keep internal values in the network within range, but it is also the source of non-linearity and hence responsible for one the most prized capabilities of the neural network paradigm. When used on the output neurons of a network, however, there are implications for the pre-processing of the data, since it is futile to ask the network to produce a value outside the range of the output function. Although the functions used have a theoretical range of 0-1, to produce values towards the extremes of this range would require activations of very large values. In turn this would imply the need for data storage far beyond that of the commonly used double precision floating point format. The accepted procedure is to pre-scale the data so that it lies within a somewhat restricted range. Commonly used ranges are 0.1-0.9 [6] or 0.2-0.8 [7].
This paper will first present a theoretical argument from first principles that suggests that the use of a rather narrower range (such as 0.3-0.7) might be beneficial. This proposal is tested in the next section using a real life data set (consisting of currency exchange rates) which is scaled to nine different ranges. The results are found to offer broad support for the proposition that a narrower range is beneficial, although the optimal range does not exactly correspond to the theoretical prediction.
2 Theoretical Argument
For autoregressive time series prediction the input and output variables are, of course, the same so any scaling will affect both the input and target. The non-linear nature of the network renders evaluation of the effects of input scaling somewhat intractable, but we can derive some insight by considering the output node only.
Consider a feed forward network with m hidden nodes and 1 output node.
Let
zi = actual output of hidden node i
wi = weight for connection from hidden node i to output node
x = total input for output node
y = actual output of output node
d = desired output of output node
The above variables are related by the formulae:
(1)
Where represents the neuron output function.
Using the logistic sigmoid as the output function we have:
(2)
Define an error measure E:
(3)
We are interested in the way in which this error depends upon the weights in the output node. By the chain rule:
(4)
Clearly we have an understanding here that (1-y) is strictly positive since it would be futile to ask the network to produce an output beyond the range of the output neuron transfer function. However the actual range of y could be much smaller than that. Let us investigate the consequences of a simple scaling transformation on d (and therefore implicitly on y):
(5)
Where k is some constant, subject to the constraints that (1-ky) and (1-kd) remain positive. The error gradient vector now becomes:
(6)
Note that since we are considering only effects of scaling on the output node, we are neglecting at this point any changes to zi as a result of the changed scaling (such effects may well have been ameliorated by the hidden layer in any case).The effect of this transformation then is to modify the gradient vector by the factor in the second term. When this factor is greater than 1, the gradient has become steeper. Since a steeper gradient should, in theory, result in faster learning we regard this as a desirable outcome of the scaling process. It remains to see for what values of y and k this improvement is manifested. Using standard calculus to determine the stationary point we get:
(5.10)
The first term cannot be zero since y < 1. Therefore:
(5.11)
We verify that this stationary point is in fact a maximum by taking the second derivative:
(5.12)
Once again, the first term is positive since y < 1. Substituting the value of .67 for yk shows the second term to be negative, so the overall value for the second derivative must also be negative, verifying that the stationary point is in fact a maximum.
This would seem to imply that optimal scaling (using this criterion) would limit the range of y to less than 0.67, significantly more restrictive than the often used 0.8 or 0.9. There is, therefore, some theoretical basis for using a narrower scaling range for this type of network. By the argument above, we might expect faster learning with a reduced scaling range. The extent to which this translates into practical benefit can only be established by experimentation, as is done in the next section.
3 Experimental Procedure
The following experiment was conducted using the Stuttgart Neural Network Simulator version 4.1 (obtainable from ftp.informatik.uni-stuttgart.de). Nine different input data files were prepared with different scaling ranges. To allow for the effects of different initial weight values, each file was used to train 10 networks (with different random weight initialisations, always in the range -1 to +1).
The raw data was downloaded from This is the same data used by [8] and consists of daily exchange rates of five currencies (British pound, Canadian dollar, Swiss franc, German mark and Japanese yen) against the US dollar over the period of 1/6/73 to 21/5/87. Before use, the data was cleaned by dropping missing data points and de-trended using linear regression. It was subsequently divided into a training set (3000 points) and a validation set (505 points) on a chronological basis. The task chosen was to predict the exchange rate of the Swiss franc for the next period given exchange rate data for all five currencies up to the current time. The networks were presented with exchange rates for all five currencies for five days previous to the prediction day.
A feed forward network with 25 input units, 50 hidden units and one output unit was used. The architecture was chosen with such a large number of hidden units in order to deliberately induce overtraining. This permitted the measurement of training time by monitoring validation error during training and recording the minimum point. This was done every 100 epochs (one epoch being the presentation of each training data point exactly once), when the validation data set was presented to the network (without weight adjustment) and the error measured. A total of 20000 epochs was trained in all cases. The network weights were also saved at the minimum point to permit later investigation.
3.1 Results
The results are reported in table 1, which shows the average (over the 10 repeated runs) squared error (multiplied by 1000 for readability).
Range / Mean SE * 1000 / Time0.45 – 0.55 / 0.001341 / 193700
0.4 – 0.6 / 0.001062 / 175900
0.35 – 0.75 / 0.00162 / 160400
0.3 – 0.7 / 0.003227 / 74200
0.25 – 0.75 / 0.004165 / 95600
0.2 – 0.8 / 0.012769 / 71100
0.15 – 0.85 / 0.025915 / 34900
0.1 – 0.9 / 0.029964 / 75900
0.05 – 0.95 / 0.030776 / 94800
Table 1 Errors and training times. Time shows number of epochs to minimum validation error. Comparison between ranges may be made on the basis of Mean SE, since these are reported in the original units (i.e. after reversal of the pre-processing).
It should be noted that the figures represent means of 10 separate runs and that there was considerable variability between runs (which differ only in initial weights). Generally speaking, the variability was more pronounced for the wider scaling ranges, to the extent that at the wider ranges a number of runs displayed no apparent learning. Use of a narrower scaling range therefore resulted in much consistent performance. It is also possible that a more extended training session would have reduced some of the errors. This was most apparent for the very narrow ranges where the measured minimum appeared very close to the training cut off. The data is presented graphically in Figure 1.
The graph shows a clear inter-dependency between scaling range and the two measures taken, with training time generally increasing for narrower scaling ranges but error showing an opposing trend. The graph of training time versus scaling range shows a minimum at a range of 0.15-0.85, rather higher than the theoretical prediction. The origin of the reduced error with narrower scaling is not clear, but may be related to weight initialisation, which was the same for all conditions (arguably a narrower range should have been used for wider ranges of input data). Using the minimum error rather than a mean of 10 runs, reduces the effect considerably, although it does not eliminate it completely.
Figure 1-Errors and training times for different scaling ranges. The errors shown are the Mean SSE figures from Table 1. The different scaling ranges are coded: 1 0.45-0.55, 2 0.4-0.6, 3 0.35-0.75, 4 0.3-0.7, 5 0.25-0.75, 6 0.2-0.8,7 0.15-0.85,8 0.1-0.9,9 0.05-0.95
4 Discussion
It may be objected that the effect of very narrow scaling ranges is tantamount to using a linear output function (indeed for a range of 0.4-0.6 the sigmoid function can be approximated by a linear one with an error of less than 0.1%). Although this is not an uncommon choice, it is not hard to come up with artificial examples where the use of a logistic output function results in a much more parsimonious model (see, for example, [9] p29), although these usually require the use of the full scaling range. The data here may therefore be alternatively interpreted as an argument for the use of linear output functions.
Given the use of sigmoid function, it is clear from the data above that scaling range affects both the time required for training and the resulting error. This result is consistent with those in [10], who likewise observed a trade-of between accuracy and training time for different input standardisation procedures. Choosing an optimal scaling range would first require some determination of the relative importance of these two factors. By plotting the data on a scatter graph as in Figure 2, however, we may investigate effect of different criteria using techniques borrowed from the field of linear programming [11]. A decision criterion requiring the minimisation of a (linear) combination of error and training time would appear on this graph as straight line, with the optimal choice determined by its intersection with the lowermost leftmost data point.
For example with equal weighting (on these scales) the decision line would lie at an angle of 45 and would favour the scaling range of 0.3 – 0.7 (labelled d in the figure). Only in extreme cases where the decision line was nearly vertical (minimise training time at all cost) or nearly horizontal (minimise error at all cost) would one of the other scaling ranges be beneficial. Under a wide range of criteria, then, we would be directed to this choice.
Figure 2 Decision plot. The different scaling ranges are coded a 0.45-0.55, b 0.4-0.6, c 0.35-0.75, d 0.3-0.7, e 0.25-0.75, f 0.2-0.8, g 0.15-0.85,h 0.1-0.9,i 0.05-0.95. A decision criterion made up of a linear weighting on error and time will appear as a straight line of negative slope and the best condition under this criterion identified by its intersection (point d in most cases).
These results are not exactly those specific values estimated in the theoretical analysis (which would have predicted point c as an optimum) but would be supportive of a general strategy to scale data into narrower ranges. One must always be cautious, of course, in extrapolating from results obtained on a single data set and it is intended to validate the results against other data sets. Preliminary studies using a time series consisting of the opening price of ANZ shares on the Australian stock market indicates that scaling to a range of 0.3 – 0.7 will reduce training time by 29% and generalisation error by 17.5% (as compared to a scaling range of 0.1-0.9), which is at least an indication of the generality of this effect.
On a purely pragmatic basis, the use of narrower scaling ranges can also reduce the problems often encountered with severely trending data series, where there is a danger that new data will lie outside the range of the sigmoid function. Combined with the increased consistency noted above, these results should perhaps encourage practitioners to err on the side of caution (i.e. scale to narrower ranges).
References:
1.Deboek, G.J. and M. Cader, Pre- and Postprocessing of Financial Data, in Trading on the edge, G.J. Deboek, Editor. 1994, Wiley. p. 45-.
2.Cherkassky, V. and F. Mulier, Learning from Data. 1998: Wiley.
3.Hall, J.W., Adaptive Selection of U.S. Stocks with Neural Nets, in Trading on the edge, G.J. Deboek, Editor. 1994, Wiley. p. 45-.
4.Parkinson, A. Financial Time Series Prediction using Neural Networks: Approaches to Data Pre-Processing. in Advanced Investment Technology 1999. 1999. Gold Coast australia.
5.Duch, W. and N. Jankowski. Transfer functions: hidden possibilities for better neural networks. in Proceedings of ESANN 2001. 2001. Belgium.
6.Angstenberger, J., Prediction of the S&P 500 Index with Neural Networks, in Neural Networks and their Applications, J. Taylor, Editor. 1996, Wiley. p. 143-152.
7.Lisi, F. and R. Schiavo, A comparison between neural networks and chaotic models for exchange rate prediction. Computational Statistics & Data Analysis, 1999. 30: p. 87-102.
8.Weigend, A.S., B.A. Huberman, and D.E. Rumelhart, Predicting sunspots and exchange rates with connectionist networks, in Nonlinear Modeling and Forecasting, SFI Studies in the Sciences of Complexity, Proceedings Vol XII, M. Casdagli and S. Eubank, Editors. 1992, Addison-Wesley. p. 395-432.
9.Sarle, W., 2002, Neural Network FAQ, ftp://ftp.sas.com/pub/neural/FAQ3.html
10.Shanker, M., M.Y. Hu, and M.S. Hung, Effect of Data Standardization on Neural Network Training. Omega, 1996. 24(4): p. 385-397.
11.Turban, E. and J. Meredith, Fundamentals of Management Science. 1994: McGraw-Hill.