Step 2: selected one data set & examine the data set in Excel and decidedabout the variables to be predicted.
Step 3: Created a neuralnetwork which predicts the output variable. Separatedthe data into a
Training Set and a Test Set
Step 4: Analyze the results and write-up a report on the outcome
Introduction
Rainfall Prediction is dependent on several variables. It is a complex function which gives an approximate indication of next day’s rainfall scenario. Often we can use the historical data on the dependent variables. In this study our aim is to construct some models and possibly the best ones which help to predict the “RainTomorrow” from the given variables.
Software like Palisade help in data mining on these variables and their analytics provides criteria to make a suitable prediction and decision. These versatile softwares can import the raw data from multiple sources like Excel or other repositories and assign them a variable name over which a conditional dataset can be developed. In this study, all the requisite inputs were taken from the excel file and constructed the dataset. Software packages were utilized to build the neural network which help in generating the criteria and help in prediction of rainfall.
Palisade is an extremely versatile tool in Data Analytics and Business Intelligence. It is capable of presenting many examples of data mining functionalities. It accepts the inputs from disparate sources like csv files, excel and flat files and assigns them a variable name and data-set. This data-set can then be further processed or manipulated as a pre-analyses stage.Once the data is staged properly, several standard statistical tools are available to present a wealth of information from the dataset.
Palisade is right software to conduct a controlled data-mining which can help in prediction of happening of an event. In our context, we have a raw data-set on the climate on 24 variables. Based on that, we have the requirement to predict “tomorrow’s rainfall”. Predicting climactic condition is not always accurate and it is a very complex function of several independent or dependent variables and conditions. This prediction is of huge importance as it affects the living conditions. Fortunately, several techniques are available which help in this prediction.
Decision Tree creates a Tree of criteria or observations, having several nodes, which finally lead to the conclusion. The leaf value represents the conclusion or the target value on which we to base our prediction. Neural Network is a technique which helps to predict a response variable on the basis of the flexible selection of dependent variables. In this assignment we are studying the prediction of the response variable “Tomorrow’s rainfall” through the varying application of input variables, like humidity, sunshine, wind, and many others, which can be responsible to cause rainfall.
Method
Pre-processing of the data
The raw data was consolidated on 14 variable. This data ranges through a long period of more than 14 months. There were some problems in data as some of the fields were missing or just having ‘NA’. Non-availability of the data for a particular day may have caused to have a value of “NA” on the particular day. However, mentioning NA or not applicable as a value for a field, usually causes an error or unexpected results in calculating an aggregate or a statistical function over the underlying variable.
Thus, pre-processing of the data was required. All the “NA” values were substituted with 0 or spaces, dependent on whether the variable is numeric or string /alphanumeric. The data was then sorted on the date so as to provide a sorted input to the tool, as this often helps in the performance of search criteria.
The data was imported to excel sheet:
We can install the Palisade software which is visible as a tab in the excel sheet as shown below in the screen:
A dataset was prepared and from this a Decision Tree was prepared using the function “Decision tree” functionality. Palisade function prepares the decision tree, and another method predict from the same software, makes prediction for new data.
This package helps to create a model of decision tree which establishes the criteria helping in prediction.
Traindata is used to draw the tree, and test data set helps to test the tree with test data.
We set the criteria by giving the parameter in defining my formula:
>myFormula <- RainTomorrow ~ MaxTemp + Rainfall + Sunshine + Evaporation
For better accuracy we can easy change my criteria with the formula:
myFormula <- RainTomorrow ~ MaxTemp + Rainfall + Sunshine + Evaporation + WindGustSpeed + WindSpeed3pm + Humidity3pm + Cloud3pm
Since there is lot of flexibility in making the decision tree, we can easily define the criteria and generate the decision tree. Better criteria, generates a better tree. These results are shown in the next ‘Result’ Section.
Now, let us take all the parameters to generate the formula, which should give more accurate results.This has taken all the parameters into account. We expect that this is going to give more accurate results. All the results are compiled in the result section and the discussion and comparisons of the results, are done in the following section
The data is successfully loaded as shown above.Next, we can create the neural network model, using the parameters as we loaded above.
First, we launch the Neural platform and next, we can select the response variable (dependent variable) as “Rain Tomorrow” and x-factors (independent variables) as
“MaxTemp + Rainfall + Sunshine + Evaporation”
We model in the similar manner we did for the decision tree.
The results from this execution have been recorded in the next section.
Next we generate another model with more number of parameters as we did for Decision Tree. Here, we take the following parameters:
“MaxTemp + Rainfall + Sunshine + Evaporation + WindGustSpeed + WindSpeed3pm + Humidity3pm + Cloud3pm”
The results are recorded in the next section. The explanation follows in the discussion section.
Finally, we generate the neural network for all the parameters and we expect the maximum accuracy.
“MaxTemp + Rainfall + Sunshine + Evaporation + WindGustSpeed + WindSpeed3pm + Humidity3pm + Cloud3pm++ MinTemp + WindGustDir + WindDir9am + WindDir3pm + WindSpeed9am + Humidity9am + Pressure9am + Pressure3pm + Cloud9am + Temp9am + Temp3pm + RainToday + RISK_MM “
Results
This is for the first the initial formula:
table(predict(climateAu_ctree), trainData$RainTomorrow)
0 / No / Yes0 / 225 / 1 / 66
No / 0 / 0 / 0
Yes / 0 / 0 / 0
> table(testPred, testData$RainTomorrow)
0 / No / Yes0 / 95 / 0 / 28
No / 0 / 0 / 0
Yes / 0 / 0 / 0
The plot is shown below:
> table(predict(climateAu_ctree), trainData$RainTomorrow)
0 / No / Yes0 / 255 / 1 / 66
No / 0 / 0 / 0
Yes / 0 / 0 / 0
A simple plot for this table is shown below:
Now we take the second criteria, and we get the following table
> table(predict(climateAu_ctree), trainData$RainTomorrow)
0 / No / Yes0 / 209 / 1 / 26
No / 0 / 0 / 0
Yes / 16 / 0 / 40
We can plot this as:
Finally we take all the parameters and get the following table:
> table(predict(climateAu_ctree), trainData$RainTomorrow)
0 / No / Yes0 / 255 / 1 / 0
No / 0 / 0 / 0
Yes / 0 / 0 / 66
Number of observations: 292
And plot the results as shown:
plot(climateAu_ctree)
With Palisade software we give the first criteria:
We have the following interpretations as shown:
RainTomorrow / MeasuresGeneralized RSquare / 0.418536
Entropy RSquare / 0.29656
RMSE / 0.355886
Mean Abs Dev / 0.255185
Misclassification Rate / 0.195652
LogLikelihood / 107.9029
Sum Freq / 276
Generalized RSquare shows that it does not presents a close fit. Generalized RSquare
should be close to 1.
The matrix is shown below, which is quite close to the Decision tree with 4 parameters.
RainTomorow / no / no / yes0 / 190 / 0 / 23
No / 1 / 0 / 0
Yes / 30 / 0 / 32
Dashboard
The plot is shown below, which is not a close-fit curve.
The plots as shown below show some improvement in the prediction as Generalized RSquare = 0.482712
The curve has shown some improvement, with a slightly better yes ratio.
Now, we have all the columns as the x factor, the all 23 variables:
We have the following interpretations as shown:
RainTomorrow / MeasuresGeneralized RSquare / 0.91508
Entropy RSquare / 0.856329
RMSE / 0.149569
Mean Abs Dev / 0.04593
Misclassification Rate / 0.028986
LogLikelihood / 22.0382
Sum Freq / 276
With all the 23 variable included in the x-factor, the resulting response y- variable (“RainTomorrow”) has Generalized RSquare = 0.91508.
This means that the prediction is very close as the value is closer to 1.
The plot shown below is also a close-fit as shown below:
The curve is closer to 1 as shown above.
Report
- Decision Tree is made using the criteria
>myFormula <- RainTomorrow ~ MaxTemp + Rainfall + Sunshine + Evaporation + WindGustSpeed + WindSpeed3pm + Humidity3pm + Cloud3pm
provide a better mapping as compare to the results from the formula:
myFormula <- RainTomorrow ~ MaxTemp + Rainfall + Sunshine + Evaporation
Obviously the more and relevant parameters we choose in the formula, we will get a better mapping in the response (dependent) variable. In the third matrix, we have a better mapping for Yes—Yes as compared to the previous matrix.
Neural Networksand Decision Trees provide the results in an easy to understand manner. It also gives us the data in weight format, like in the first case it starts with Rainfall, and then divides the tree into two parts. Right hand side of the tree is the criteria which is more probable towards the outcome. The criteria starts with the parameter which has maximum weightage, so naturally if the previous day has rain, following day may as well experience rain, given the number of parameters.
Traversing left side of the tree is the less probable side where left most leaf brings us to the least probable criteria. Each node has two child-nodes or ends with leaf nodes. Traversing the tree, we can predict that which variables are most important which have a high probability of producing the outcome. While overall all the parameters are important in the prediction of the tomorrow’s rainfall, but if certain parameters meet the sequence of criteria as shown in the decision tree, the probability of the outcome is very high.
The results of the neural network have resemblance with the decision tree. Since, generating a neural network becomes quite easy with a good GUI interface like Palisade, so attempt was to try andtest many options. A selection of different parameters as input was made and was checked with the results. The prediction while taking all the 23 factors is much closer compared to the lesser number of inputs. So, all the inputs are important for the prediction.
For, some day if a parameter is not available, the prediction for next day’s rainfall gets affected. In neural network, just by seeing the analysis, it is bit difficult to predict on a specific outcome if the result is missing. However, we have substituted the missing item and the generation of diagram and result is possible, but, in the real scenario of day-to-day prediction, when some statistic is missed-out, the prediction can be affected.
In decision tree, we can select the best path or tree available for a given situation if the parameters are less. We have generated the tree with lesser number of parameter but we can expect a 100% outcome depending on the criteria has seen in the display. Right-most part of the tree is actually having >99% of the probability.
Similarly, if other part of the tree, that is the left part is the part which says that “No Rainfall” is predicted if those criteria are met. So both-ways we can take the decision – rainfall or no rainfall. As aware we have taken the response variable y which can assume the value of 0 = no rainfall predicted and 1 = rainfall predicted. There is a third outcome we had to give as 0 = NA. If there is a null value or “NA” where a numeric field is required, the execution of the packages report error.
Conclusion
Neural network and decision trees are quite important in predicting the outcome. In our case we began by modeling decision tree using Palisade software. We generated training and Test sets and tried to give a different combination of parameters to generate the tree. We were able to generate the tree and neural network in different scenarios and recorded our results.
The Neural Networks with this software allows taking the inputs from the csv file and generating the neural network after we have identified the response parameter and different set of x-factors (independent variables). We simulated the generation in the same criteria as we did for the decision tree. However, with all the parameters the prediction seems to be in the range of > 95% accurate, which is quite a good input for prediction. Thus, these methods are a great way to predict the outcome. Thus, we have good reasons to say that we have arrived at with adequate and appropriate models which help in predicting “RainTomorrow” from the given variables.
References:
Guszcza, J., & Lucker, J. (2012). A delicate balance: Organizational barriers to evidence-based management. Deloitte Review, (10), Retrieved from
Davenport, T. (2013). Keep up with your quants.Harvard Business Review, (July-August), 120-123. Retrieved from
Partnership for Public Service, & IBM Center for the Business of Government. (2011, November). From data to decisions: The power of analytics. Retrieved from Data to Decisions.pdf
Kiron, D., Ferguson, R. B., & Prentice, P. K. (2013). From value to vision: Reimagining the possible with data analytics. MIT Sloan Management Review, 54(3), 1-19. Retrieved from Article URL:
Davenport, T. (Performer) (2013). How managers should use data [Web]. Retrieved from
Cukier, K. (Performer) (Jun 2014). Big data is better data [Web]. Retrieved from
1