Front End Interface Of The WOLS Symbolic Regression System
The symbolic regression genetic programming component (SRGP) of the WOLS system, whose overall architecture was previously depicted in Figure 3.5 of Chapter 3, includes an interactive front end interface, which allows a user to easily generate a genetic program that employs symbolic regression and is tailor made to the problem he wishes to solve. The system run starts by asking the user interactively for parameters needed to run the symbolic regression system (e.g., the variable names that should be used, the name of the file or files containing the training and testing data sets, the population size, the desired fitness function, etc.). The information the user provides is then used to generate the functions for the specific problem he wishes to solve. These problem specific functions are then merged with a set of general genetic programming functions, located in the gp-template file; to form a complete genetic program that is placed in a file specified by the user. For example, each input data set will be transformed into an evaluation function with respect to the fitness function the user selected earlier. These data collection specific functions are then merged with the set of general genetic programming functions generating a complete genetic program customized to solve the users specific problem. The genetic program is then automatically compiled and now all that is left for the user to do is load and execute it. Basically the SRGP component eliminates the need of a user having to directly edit the genetic programs source code each time the problem changes. The architecture of the front end interface of the symbolic regression genetic programming component of the WOLS system is depicted in Figure 4.1.
Figure 4.1: Architecture of the Front End of the SRGP Component
The entire program is contained in the following four files:
make-gp.o- front end interface
gp-template- general genetic programming functions
reg-test.l - testing functions for regression analysis problems
class-test.l - testing functions for classification problems
Input File Format Requirements
During the execution of the SRGP the user will be asked to provide the system with at least two different files. The first of these is a file containing the variable names of the independent variables the user wishes the symbolic regression system to use when solving the problem. The second file will contain the data collection that the symbolic regression system will use as the training data set, which will be used to find a solution to the problem at hand (i.e., the approximation function.). Optionally the user may provide the system with additional files containing additional data sets for purposes such as testing the quality of the solution obtained by the system.
There are two requirements that all input files provided by the user must meet. The contents of all input files must begin with an open parenthesis and end with a closing parenthesis (i.e., the contents of a file must be enclosed in parenthesis). In other words, the contents of any user provided input file must be in the form of a list. The second requirement is that the names of all input file names must be in all capital letters.
SRGP Operation
In this section we will perform a complete walk through of the interactive front end interface of the SRGP component. A symbolic regression system capable of solving a multivariable regression analysis problem containing three independent variables will be produced as a result of this walk through. In the example below the greater than symbol '>'represents the LISP prompt.
After loading the front end of the SRGP component contained in the make-gp.o file, the first thing you will be prompted for is the VARIABLE-FILE file name.
>Enter VARIABLE-FILE file name:
This is the name of the file that contains the list of variables (variable names) you wish the genetic program to use. For example, if your VARIABLE-FILE was named NAMES and you wanted to use the following three variable names: ALPHA, BETA, and GAMMA to represent the three independent variables used by the genetic program then the contents of the NAMES file would look as follows:
(ALPHA BETA GAMMA)
Notice that the three independent variable names are enclosed in parenthesis, which constitutes a list in LISP. The first character in the VARIABLE-FILE must always be an open parenthesis and the last character must always be a closed parenthesis. Variable names should be separated by one or more white spaces (blanks). The three variable names ALPHA, BETA, and GAMMA were arbitrarily chosen, normally variable names, which are more representative of the problem, would be chosen. For example, if you were trying to solve a multivariable regression problem in which you want to obtain an approximation function that will predict the average rain fall in an area (dependent variable) as a function of the two independent variables, humidity and temperature, you would probably use more meaningful variables names, such as TEMP and HUMIDITY.
The next thing you will be asked for is the DATA-SET file name.
>Enter DATA-SET file name:
This is the name of the file that contains your training or testing data collection. The data sets should be arranged in a way that there is one corresponding column of data for each independent or dependent variable. Each column of data should be separated by one or more white spaces (blanks). So for a three variable regression analysis problem (i.e., 3 independent and 1 dependent variables) you should have a total of 4 columns of data present in your DATA-SET file. For example, if you were working on a three variable problem and the DATA-SET file name was simply DATA, the contents of this file would look as follows:
( 0.000.00 0.001.00
1.00 1.00 0.31 0.00
1.000.000.810.88
0.2 0.2 0.2 0.2 Dependent Variable Data
0.38 0.38 0.38 0.38
0.4 0.4 0.4 0.4
0.6 0.6 0.6 0.6
0.80.8 0.8 0.8)
If your three independent variables were named ALPHA, BETA, and GAMMA, then reading from left to right; the first column of data would be the corresponding values for variable ALPHA, the second column of data would be the corresponding values for BETA, the third column of data would be the corresponding values for GAMMA, and the fourth column of data would be the corresponding values for the dependent variable. The SRGP component assumes that corresponding values for the dependent variable are the column of values located furthest to the right of the data collection. Also notice once again that the entire data set is enclosed in parenthesis making it a list. As with the VARIABLE-FILE the first character of the DATA-SET file must be an open parenthesis and the last character must be a closed parenthesis.
Next you will be asked to provide a name for an output file.
>Enter desired OUTPUT-FILE file name:
This will be the name of the file that the genetic program will be written to. There is only one minor restriction here, the last two characters of the file name must be .l. Otherwise, the SRGP will not be able to compile the file and you will have to go back, rename the file, and compile it manually (i.e., the file must have the extension .l to be compileable). Additionally you may want to avoid giving your OUTPUT-FILE the same name as any of the four file names used for this program, which are listed in Section 4.2, if you are planning on running this program more than once.
Next you will be asked if you wish to use the default population size of 400 and the default number of cross-over points of 70.
>Would you like to use the default population size of 400 and 70 cross-over points?
Enter yes or no:
If you enter no you will be prompted to enter integer values for population size and number of cross-over points you wish to use.
>Enter population size:
>Enter the maximum number of cross-over points:
The population size refers to the number of solutions that will be kept when moving from generation to generation. Bear in mind that for each new generation three times as many solutions as the population size are generated and the best 1/3 of these solutions are kept. So if a population size of 400 is used, 1200 solutions will be generated for each new generation and the 400 best solutions will be kept. Therefore, the larger the population size, the more operations and computations must be performed at each generation and as a result the slower the genetic program will run. The number of cross-over points deals with the limiting of the tree size for each solution in your population. The higher the number of crossover points, the larger the solution trees will end up being. Very large solution trees greatly reduce the speed of the genetic program while providing little or no real increase in the quality of the solutions found. Using the default cross-over points value of 70 will give preference to generating trees with less than 70 crossover points which results in a fairly reasonable program speed and solution quality. On the other hand better program speed (i.e., a shorter runtime) can be achieved by setting this parameter to a lower value such as 40, however a loss in quality in the solutions found will most likely occur.
Next you choose which fitness function you would like your genetic program to use.
>Enter the number of the fitness function you would like to use:
0 = Manhattan error function
1 = Least Square error function
2 = Classification function
Choice:
Choices 0 and 1 are for regression analysis problems, where you are essentially trying to find a function which best curve fits the data by minimizing the error in the output variable. Choice 2 is for classification problems, where the values for the dependent variable are actually a set of class numbers and the goal is to minimize the number of misclassifications. In classification problems the goal is to classify an object in a universe into a particular class on the basis of its attributes (independent variables). Here the approximation function found by the WOLS system corresponds to a computer program consisting of functions that test the attributes of the object (i.e., the function approximation behaves like a decision tree). The input to this computer program consists of the values of certain attributes associated with a given data point and the output is the class into which a given data point is classified. In this way the WOLS system can be used to perform a task similar to the task performed by a decision tree analysis tool, such as C4.5 [26]. Besides the fitness function, the only other difference between solving classification and regression analysis problems with the WOLS system is the contents of the data collections used. For example, for a four variable classification problem a typical data collection will look as follows:
(1.5214.36 3.85 0.89 0
1.52 12.82 3.55 1.49 0
1.52 13.89 3.53 1.32 0
1.53 10.73 0.00 2.10 1
1.53 12.30 0.00 1.00 1
1.52 14.43 0.00 1.00 1
1.52 13.33 3.53 1.34 2Dependent Variable Data
1.52 14.32 3.90 0.83 2(Class Numbers)
1.52 13.64 3.65 0.65 2
1.52 14.85 0.00 2.38 3
1.52 14.20 0.00 2.79 3
1.52 14.75 0.00 2.00 3)
Again the four left most columns are the values for the four independent variables and the right most column contains the values for the dependent variable. However, this time the values for the dependent variables are the class numbers for each of the test cases. In the example data collection above, there are a total of four possible classes an object may belong to (i.e., class: 0, 1, 2, or 3). As far as the class numbers go, your lowest class should always be 0.
If the input data in your DATA-SET file is not already in a normalized form, the SRGP component will give you a chance to normalize it using MIN-MAX normalization (equation 4.3).
where:
Xi=The column value currently being normalized.
Xmax=The largest value in the current column of values being
normalized.
Xmin=The smallest value in the current column of values being
normalized.
During MIN-MAX normalization, equation 4.3 is applied to each value in each column of data in the data collection scaling all of the values in the data collection into a range between zero and one. Normalization of the input data is necessary if functions with domains limited to values between zero and one are present in the function set of the symbolic regression system (e.g., Bayesian tools require all input values to be between 0 and 1). Normalization of the input data can also be used to limit the ranges of the functions in the function set, since the minimum and maximum input values are known in a normalized data set. The normalization of the input data is purely optional, the symbolic regression system is capable of working with data in both raw and normalized form, if the default function set is used. In any case you will be prompted as follows:
>Would you like the data in your DATA-FILE normalized?
Enter yes or no:
If your enter yes, MIN-MAX normalization will be applied to the data in your DATA-SET file. Another option you will have is to enter a building block that will be used by the symbolic regression system.
>Would you like to use a specific building block?
Enter yes or no:
Specifying a building block, which will be used in your genetic program, is entirely optional. The selection of a good building block may result in better solutions being found in much fewer generations. If you choose to enter a building block it must be in postfix form and enclosed in parenthesis, which is the way expressions are written in LISP [30][13]. For example, to use an expression such as A * B as a building block, you would have to rewrite and enter it as (* A B).
If you choose the classification function as your fitness function (choice 2) earlier, then you will be prompted with either one or both of the following additional questions that deal solely with classification problems.
>Enter the total number of classes:
>Enter the highest class number:
For the first of these two questions you simply enter the total number of different classes. For instance, if the only classes existing are numbered 0, 1, and 2 then you would enter 3 at the prompt, since in this case there are only 3 different classes a data point can belong to. If you choose not to normalize the data in the DATA-FILE earlier, you will now also be asked for the highest class number, which would be 2 in the previous example where you had the three classes 0, 1, and 2. The SRGP assumes that the lowest class number is always 0 and that there always is a class 0. If your data set does not contain a class 0 then you should choose yes when asked if you would like your data in your DATA-FILE normalized.
Finally, you will be asked if there are any additional data sets you would like to add.
>Are there any additional DATA sets that you would like to add (such as a testing
data set)?
Enter yes or no:
If you have other data sets stored in files for the purpose of testing the final solution found by the genetic program, or if you wish to run the genetic program with all of the same parameters but with different training data sets, then you may add these data sets at this time. Every data set that is input is transformed to an evaluation function and each evaluation function is automatically given a number according to the order in which it was added. The very first data set you input is always given the name eval1 and any subsequent data sets will be incrementally named eval2, eval3, ...., evaln.
> Enter the number of additional DATA sets you would like to add:
Enter DATA-SET file name:
You may add as many additional DATA sets as you wish. If you chose to normalize the very first data set all subsequent data sets you add will normalized automatically.
After answering this last question the complete genetic program will automatically be compiled, so now all that remains for you to do is load and run it. For example, if the name you chose for your OUTPUT-FILE was genetic.l, then the compiled version of the genetic program can be found in a file called GENETIC.o. You would then load it by typing:
>(load "GENETIC.o")
at the prompt as shown and run it by typing:
>(run-gp 100 #'eval1)
Above, run-gp is the top level function that runs the entire genetic program. It requires two input parameters, the maximum number of generations (iterations) the program should run and the name of evaluation function (transformed data collection) that should be used as the training data set (i.e., the data set which an approximation function will be found for). In the example above the 100 indicates that you wish to run the genetic program for 100 generations and the #'eval1 indicates that you wish the evaluation function eval1 to be used for training. However, you can choose to run the program for any number of generations and use any valid evaluation function (e.g., eval1, eval2, eval3, ..., evaln) for training. At the very end of the run the average Manhattan distance (see equation 6.1 in Chapter 6) will be calculated for the best solution found, if you chose either of the first two fitness functions (i.e., Manhattan fitness function (choice 0) or least squares fitness function (choice 1)). Otherwise the total number of classes (test cases) classified correctly and incorrectly by the best solution found will be shown. If you have added additional DATA sets you can test the best solution with these data sets, which will now be in the form of evaluation functions, by simply typing test followed by the evaluation function you want to test the best solution found with. For example, if you wanted to test the best solution found with the evaluation function eval5 (i.e., the fifth data set you had added) you would type: