Naïve Bayes Approach for Future Grade Prediction

Technical Details

Overview

This chapter explains about how the data mining algorithms are implemented in Faculty Support system model to address the problems discussed in chapter 3.4. To address the issue of final exam grade prediction, a Naïve Bayes classifier is implemented and the grades are predicted. To address the problem of classifying students into different groups based on final exam grade, different algorithms like Naïve Bayes, J48 Decision Trees, Random Forests and Multiple Layer Neural Networks were implemented using a data mining tool called Weka. The Weka tool implements all the algorithms by choosing different parameters. A study of choosing best parameters is also explained in this chapter. Also, the different algorithms were evaluated based on different evaluation methods and performance metrics.

The rest of the chapter is organized as follows: First the Naïve Bayes implementation is explained clearly and then the process of final exam grade prediction is explained, followed by different evaluation methods and metrics used to evaluate performance of classifiers. Then, a brief summary of Weka tool is outlined. Finally, the parameter selection in different algorithms is clearly explained.

Naïve Bayes approach for future grade prediction

Naïve Bayes Classification

As seen from the Chapter 2, Bayesian classification is based on Bayes Theorem. Let X be a data tuple. In Bayesian terms, X is considered evidence. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the evidence or observed data tuple X. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from the equation 1.

P(H | X)is the posterior probability of H conditioned on X
P(H)is the prior probability of H
P(X | H)is the posterior probability of X conditioned on H.
P(X)is the prior probability ofX.

During the training phase, we need to learn the posterior probabilities P(Y l X) for every combination of X, evidence or attribute value set and Y, class variables based on information gathered from the training data. By knowing these probabilities, a test record Xi can be classified by finding the class Y that maximizes the posterior probability, P(Y| Xi).

Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional attribute vector, X = {X1,X2,…,Xn}, depicting n measurements made on the tuple from n attributes, respectively, A1, A2, … , A2.Suppose that there are m classes, C1, C2, … , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ciif and only if

Thus we maximize P(C | X). The classCifor which P(C | X) is maximized is called themaximum posteriori hypothesis. By Bayes’ theorem

Given data sets with many attributes, it would be extremely computationally expensiveto compute P(X| Ci). In order to reduce computation in evaluating P(X| Ci), thenaive assumption of class conditional independence is made. This presumes thatthe values of the attributes are conditionally independent of one another, given the class label of the tuple. Thus,

With the conditional independence assumption, instead of computing the class-conditional probability for every combination of X, we only have to estimate the conditional probability of each Xi, given C. This approach is more practical because it does not require a very large training set to obtain a good estimate of the probability. To classify a test record, the naive Bayes classifier computes the posterior probability for each class C:

Since P(X) is fixed for every Y, it is sufficient to choose the class that maximizes the numerator term, .

Prediction of final grade using Naïve Bayes:

Training phase:

The Naïve Bayes classifier is trained with all the training data. In this research, we used 241 instances of data for training. In the training phase we need to calculate the posterior probabilities P(Y | X) for every combination of X and Y based on information gathered from the training data, where X = attribute value set and Y = class label.

To calculate the posterior probabilities, the prior and conditional probabilities should be calculated first. The prior and conditional probabilities are calculated by constructing frequency tables for each attribute.The frequency table consists of count for each different variable in the attribute set and the number of instances it can contains each class variable. The frequency table is calculated for each and every attribute. The next step is finding the conditional probabilities from the frequency tables. The frequency table and the conditional probabilities are outlined for each attribute in the following tables.

From table1, the conditional probability of an attribute, ATT = Good and having class variable First, is given by: P(ATT = Good | class = First) is equal to number of students with ATT = Good divided by total number of students with class = First. i.e. 75/92. Likewise, all other conditional probabilities were computed.

Attendance (ATT)
First / Second / Third / Fail
Good / 75 / 66 / 21 / 10
Average / 12 / 17 / 8 / 5
Poor / 5 / 4 / 7 / 11
Total / 92 / 87 / 36 / 26
Conditional Probability
Good / 75/92 / 66/87 / 21/36 / 10/26
Average / 12/92 / 17/87 / 8/36 / 5/26
Poor / 5/92 / 4/87 / 7/36 / 11/26

Table 1: Frequencies with Conditional probabilities for attribute Attendance

Quizzes (QZ)
First / Second / Third / Fail
Good / 13 / 2 / 1 / 0
Average / 59 / 55 / 5 / 2
Poor / 20 / 30 / 30 / 24
Total / 92 / 87 / 36 / 26
Conditional Probability
Good / 13/92 / 2/87 / 1/36 / 0/26
Average / 59/92 / 55/87 / 5/36 / 2/26
Poor / 20/92 / 30/87 / 30/36 / 24/26

Table 2: Frequencies with Conditional probabilities for attribute Quizzes

Assignments (ASS)
First / Second / Third / Fail
Good / 70 / 49 / 9 / 3
Average / 18 / 21 / 8 / 1
Poor / 4 / 17 / 19 / 22
Total / 92 / 87 / 36 / 26
Conditional Probability
Good / 70/92 / 49/87 / 9/36 / 3/26
Average / 18/92 / 21/87 / 8/36 / 1/26
Poor / 4/92 / 17/87 / 19/36 / 22/26

Table 3: Frequencies with Conditional probabilities for attribute Assignments

Class Projects (CP)
First / Second / Third / Fail
Good / 57 / 42 / 7 / 3
Average / 23 / 24 / 8 / 8
Poor / 12 / 21 / 21 / 15
Total / 92 / 87 / 36 / 26
Conditional Probability
Good / 57/92 / 42/87 / 7/36 / 3/26
Average / 23/92 / 24/87 / 8/36 / 8/26
Poor / 12/92 / 21/87 / 21/36 / 15/26

Table 4: Frequencies with Conditional probabilities for attribute Class Projects

Exams (EX)
First / Second / Third / Fail
Good / 38 / 6 / 3 / 1
Average / 52 / 53 / 7 / 3
Poor / 2 / 28 / 26 / 22
Total / 92 / 87 / 36 / 26
Conditional Probability
Good / 38/92 / 6/87 / 3/36 / 1/26
Average / 52/92 / 53/87 / 7/36 / 3/26
Poor / 2/92 / 28/87 / 26/36 / 22/26

Table 5: Frequencies with Conditional probabilities for attribute Exams

Finding Prior probabilities of Response Class:

The prior probability for response classes are found from the training set by simple calculations like counting the number of instances each class variable had. The prior probabilities are given in the table 6.

Prior Probabilty of Response Variables
First / Second / Third / Fail
No of Instances / 92 / 87 / 36 / 26
Probability / 92/241 / 87/241 / 36/241 / 26/241

Table 6:Prior Probability of Response Variables

Testing phase:

Once the training phase is done, the classifier is learned with all the training data and can predict the class labels of the test data. Consider a test instance with no class label

ATT / QZ / ASS / CP / EX / Grade
Good / Average / Average / Poor / Poor / ????

Table 7:Test Instance with no class label

Using the prior and conditional probabilities calculated above, we need to find the class label for this instance.

From equation 1,

To find the new instance of the class, since P(X) is fixed for every Y, it is sufficient to choose the class that maximizes the numerator term, , where is conditional probability of each Xi, given Y.

So, the grade of the new instance can be calculated with the below equation.

Here, X is evidence and Grade = {First, Second, Third, Fail}

So, calculate the equation from each grade.

P(Grade = First | X ) = P(Grade = First)

* P(ATT = Good | Grade = First)

* P(QZ = Average | Grade = First)

* P(ASS = Average | Grade = First)

* P(CP = Poor | Grade = First)

* P(EX = Poor | Grade = First)

= 0.198

Similarly, we calculate for Grade = Second, Third and Fail and obtain the results as

P(Grade = Second | X ) = 0.266

P(Grade = Third | X ) = 0.344

P(Grade = Fail | X ) = 0.202

Since the posterior probability for Third Grade is higher than other grades, the new test instance is classified to be Third Grade.

The test data we considered are 2 data sets with 111 instances and 130 instances respectively clearly explained in chapter 5.2. Finding class labels of each and every instance is very time taking and difficult task as it involves calculating the probabilities for each and every instance in the testing data set.

To address this problem, we have automated the process by implementing the naïve bayes classifier using a java program that calculates all the probabilities and computes the class labels of the test instance in no matter of time because of the api support provided by the java programming language. The program implementation is given in the figure 1:

Figure 4.1: data flow of Naive Bayes Algorithm with Input and Output Data

Explanation:

There are 2 kinds of input to the program (from figure 1).
Training data set with the explanatory variables and class labels
Testing data set with only explanatory variables
The classifier is learned using training data. In the process of learning,
Compute the likelihood of each attribute and store them in a separate Hash Map. So, there are five attributes, we used five different hash maps.
Compute the prior probabilities of class variable and store them in a separate Hash Map.
After calculating all the probabilities, the testing set is read line by line and for each attribute obtain its likelihoods and prior probability from the stored hash maps and compute the posterior probability.
The step 3 is repeated for number of classes (4 in this research).
The computed values are compared against each other and the class having higher value is chosen to be the class of that instance in test data.
Repeat the steps 3 to 5 for each instance of the testing data set.

The implementation of Naïve Bayes algorithm is presented in the Section 4.6.

Evaluation Methods and Metrics:

This section explains the different evaluation methods and metrics used in Faculty Support System. Model. The most important part in data mining is to understand the data present in the records, analyze it, finding what can be done and achieved with the data and finally draw conclusions from the data analytic results. In a given context, data mining metrics are some measures of quantitative assessment which are used for comparison or evaluation of different data mining algorithms. Generally, data mining metrics are divided into three categories, accuracy, robustness and usefulness.

Accuracy is a measure that tell us how well a model associates an outcome with the given attributes in the data sets provided. Accuracy is measured in multiple ways but all the accuracy measures depends on the data that has been used. Robustness judges the way that a data mining model performs on different kinds of datasets. The data mining model is robust only if the same model generates the same type of predictions for the same kind of patterns irrespective of the supplied test data. The data mining metric usefulness is a combined metric of several metrics that provides us whether the model generates useful information or not.

Evaluation Methods

The performance of the classifiers is evaluated based on the different evaluation methods. The evaluation is important to understand the quality of the model, for refining parameters in the iterative process of learning and for selecting the most appropriate model from a given set of models. There are several criteria for evaluating the model. As far as classification models are concerned, the performance of a classifier is measured in terms of error-rate. If the classifier predicts the class of an instance correctly, then it is treated as success, otherwise an error. In this research, for choosing the best performing algorithm in the post-Data Mining phase, we used different kinds of evaluation methods and compare them based on different evaluation metrics.

Hold out Method:

Hold out Method involves a single data split. The data is split into two separate datasets where one data set is used for training and the other is used for testing. The model is learned with the training data and finally asked to predict the output values in the testing data.

Random Sub Sampling:

The Random Sub Sampling method is an extension of hold-out method. In the random sub sampling method, the hold out method can be repeated several times to improve the estimation of the classifier’s performance.

K-fold Cross Validation Method:

Cross Validation is a popular technique for evaluating the generalization performance of a data mining model. The basic idea behind cross validation is to split the data, once or many times for estimating the risk of the data mining algorithm. From the split data, a small part called training sample is used for training each algorithm and the remaining part also called the validation sample is used to estimate the risk of the algorithm. The cross validation method finally selects the algorithm with smallest estimated risk.

The k-fold cross validation method is more optimized than hold out method. In this method, the dataset is divided into k subsets and the hold out method repeated for k times. Each time, one of the k subsets is used as the testing set and the union of other folds are used as training set. The k results from the folds can be averaged and error is computed. The advantage of this method is that all observations are used for both training and testing, and each observation is used for testing exactly once.

Evaluation Metrics:

Metrics summarize performance of a model and give a simplified view of model behavior. Thus, using several performance metrics and check whether they agree helps better understanding the model behavior by quantifying its performance. Evaluation of data mining algorithms can be compared according to a number of measures. Comparing the performance of different data mining algorithms determines their predictability, some quantities that interpret the goodness of fit of a model, and error measurements must be considered. Though empirical studies claimed that it is difficult to decide which metric to use for different problem, each of them has specific features that measures various aspects of the algorithms being evaluated. It is often difficult to state which metrics is the most suitable to evaluate algorithms in educational data due to large weighted discrepancies that often arise between predicted and actual value or otherwise. The combination of different metrics may reveal accurate results. For instance, some metrics such as true positive rate (TPR) take higher values if the algorithm gives better results compared to metric such as errors that takes lower values.

The classification of metrics is divided into three families, probabilistic understanding of errors, qualitative understanding of errors, and visual metrics.

Probabilistic understanding of errors:

These kind of metrics are based on probabilistic understanding of predictions pi and of errors (pi – oi), where pi is predicted outcome, oi is actual outcome. This type of metrics is natural mainly for predictions of performance i.e. correctness of answers. Most commonly used metrics based on probabilistic understanding of errors are Mean Absolute Error (MAE), Root Mean Square Error (RMSE). Typically, lower the values of MAE and RMSE, higher is the performance of the classifier.

Mean absolute error considers absolute differences between predictions and answers. This is not a suitable performance metric, because it prefers models which are biased towards the majority result. Despite this disadvantage it is sometimes used for evaluation of student models.

Root mean square error is obtained by using squared values instead of absolute values. In the particular context of student modelling and evaluation of probabilities, this is not particularly useful, since the resulting numbers are hard to interpret anyway. However, in EDM the use of RMSE metric is very common, particularly for evaluation of skill models

Qualitative understanding of errors:

These metrics are based on qualitative understanding of errors, i.e., either the prediction is correct or incorrect. In student modeling this approach is suitable mainly for predictions of student state. In qualitative understanding of errors, predictions have to classify into multiple classes, then the classification can be done easily by choosing a threshold and doing the classification by comparison to this threshold. Once predictions are divided into multiple classes, they can be classified as true/false positives/negatives by a confusion matrix. Theconfusion matrixjuxtaposestheobservedclassificationsforaphenomenon(columns)withthe predictedclassificationsofamodel(rows).Theclassificationsthatliealongthe majordiagonalofthetablearethecorrectclassifications,thatis,thetruepositivesandthe truenegatives.Theotherfieldssignifymodelerrors. The most common qualitative performance metrics calculated from the matrix are accuracy, sensitivity, specificity, precision, and F- Measure. These statistical measures are commonly used to explain out dataset and estimate how good and consistent was the classifier.

Accuracy: It compares how close a new test value is to a value predicted by if ... then rules

Sensitivity: It measures the ability of a test to be positive when the condition is actually present. It is often recognizing as recall

Specificity: It measures the ability of a test to be negative when the condition is actually not present.

Precision: It measures the positive predictive value

F-Measure: A measure that combines precision and recall is the harmonic mean of precision and recall,

Visual metrics (ROC)

Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. This approach for evaluation of predictions takes into account ranking of predictions, i.e., the values of pi are considered relatively to each other. The ROC curve summarizes the qualitative error of the prediction model over all possible thresholds. The curve has false positive rate () on the x-axis and true positive rate on the y-axis (, each point of the curve corresponds to a choice of a threshold. Area under the ROC curve (AUC) provides a summary performance measure across all possible thresholds. It is equal to the probability that a randomly selected positive observation has higher predicted score than a randomly selected negative observation.