Ronald H. Heck and Lynn N. Tabata Techniques for Examining Categorical Outcomes 2
EDEP 606: Multivariate Methods (2013) March 31, 2013

Techniques for Examining Categorical Outcomes: Discriminant Analysis

For the last part of the course, we will focus on a couple of analytic techniques that can be used explain group membership (i.e., where outcomes are categorical). Discriminant Analysis (the first of these techniques) is useful when the outcome represents two or more groups. For example, we may wish to use a set of variables to determine whether someone is likely to be admitted to a graduate program or not. We might use variables such as GPA, GRE scores, written essay, and other types of information to develop a descriptive model. Descriptive discriminant analysis involves describing differences between members in different groups based on a set of observed variables. In contrast, predictive discriminant analysis involves our ability to assign individuals to two or more groups based on a set of information about them (e.g., applicants applying for a loan). Descriptive discriminant analysis is often used as a follow up to MANOVA—that is, it makes use of the dependent variables in MANOVA to classify individuals into groups. A linear composite of the dependent variables may be used to explain differences between the groups. A linear composite of the variables can be defined as follows:

Y = a1X1 + a2X2 +… + apXp

where there are p variables used to separate the groups (Marcoulides & Hershberger, 1997). The coefficients are chosen to maximize the group differences.

We can extract one less weighted combination of variables, which is referred to as a discriminant function, than number of groups (k – 1). The first function will account for the most variance, the second will account for the second most variance, and so on. We can consider these functions as similar to underlying latent dimensions (i.e., which are measured by a set of observed variables). We can “name” the underlying dimensions by considering which observed variables are most strongly associated with the function (i.e., much like naming a latent factor by the items that load most strongly on the factor).

To determine which variables are most strongly associated with the underlying dimension, we can use the standardized discriminant function coefficients (which are a type of z-score and are therefore similar to standardized betas in a multiple regression model) or the structure coefficients (which provide the correlation between each variable and the underlying dimension). The two types of coefficients can give different Impressions of which variables are most useful in defining the underlying function so it is important to consider how each type of coefficient may affect the interpretation. As Huberty (1989) notes, structure coefficients generally have greater stability in medium to small samples, especially when there are high correlations among the predictors. Correlations provide a direct indication of which variables are most related to the underlying variable associated with each discriminant function. The standardized coefficient s provide an indication of which variable might “dominate” in explaining group membership (i.e., they remove the effects of other predictors). We can see therefore that depending on one’s purpose one or the other type of coefficient might be favored (Marcoulides & Hershberger, 1997). In either case, variables that have coefficients near to zero have little impact on the discriminant function.

As always, the technique is best illustrated with a simple example. Let’s make use of the example in the book on page 132. We have two groups of prospective teaching candidates (not employed = 0 and employed = 1) who are measured on three tests. We may have conducted a one-way MANOVA to see whether the two groups of teachers differed on reading, math and writing tests required for certification. As a follow up to the MANOVA, we can run a discriminant to see which dependent variable is most responsible for separating the groups of employed and unemployed teachers. (Instructions for performing the MANOVA and discriminant analysis are provided at the end of this handout. Download the accompanying DiscriminantData.sav data set from the class web page.)

MANOVA Results

First, let’s examine the MANOVA. Box’s M suggests the covariance matrices are the same across groups.

Table 1. Box's Test of Equality of Covariance Matrices
Box's M / 12.952
F / 1.763
df1 / 6
df2 / 2347.472
Sig. / .103

The MANOVA results suggest employment status is significantly related to the set of outcomes (Wilks’ Lambda = 0.54, p = .017).

Table 2. Multivariate Testsa
Effect / Value / F / Hypothesis df / Error df / Sig.
Intercept / Pillai's Trace / .967 / 158.522b / 3.000 / 16.000 / .000
Wilks' Lambda / .033 / 158.522b / 3.000 / 16.000 / .000
Hotelling's Trace / 29.723 / 158.522b / 3.000 / 16.000 / .000
Roy's Largest Root / 29.723 / 158.522b / 3.000 / 16.000 / .000
employed / Pillai's Trace / .460 / 4.549b / 3.000 / 16.000 / .017
Wilks' Lambda / .540 / 4.549b / 3.000 / 16.000 / .017
Hotelling's Trace / .853 / 4.549b / 3.000 / 16.000 / .017
Roy's Largest Root / .853 / 4.549b / 3.000 / 16.000 / .017
a. Design: Intercept + employed
b. Exact statistic

Discriminant Analysis

We can use discriminant analysis to investigate which of the dependent variables are most responsible for the significant group effect. Using discriminant analysis, we can develop a weighted combination of the dependent variables that best separates individuals into not employed (0) and employed (1) groups. We find the technique in ANALYZE > Classify > Discriminant. For the dependent variable, we must specify the coding of the lowest and highest groups in the analysis. We will assume equal prior probabilities with these data, since there are an equal number of employed and unemployed individuals in the sample (Note: we can alter the prior probabilities if we know there are twice as many individuals in one category as the other in the population). It should be noted also that discriminant analysis (like MANOVA) depends on the assumption of normality of the data.

Table 3. Eigenvalues
Function / Eigenvalue / % of Variance / Cumulative % / Canonical Correlation
1 / .853a / 100.0 / 100.0 / .678
a. First 1 canonical discriminant functions were used in the analysis.

First, we note that we can only calculate one discriminant function in this example since we have only two groups. Since there is only one function, it accounts for all of the observed variance in group membership. We can see in the table above that the canonical correlation (which provides the strength of relationship between the set of variables and group membership is moderate (0.678).

Second, we can see that the function is significant in separating the groups (Wilks’ Lambda = .540, p = .017).

Table 4. Wilks' Lambda
Test of Function(s) / Wilks' Lambda / Chi-square / df / Sig.
1 / .540 / 10.176 / 3 / .017

We can next turn our attention to which of the dependent variables is most likely responsible for the observed differences in groups on the dependent variables. In the following table, we can example the standardized discriminant function coefficients. We can see that writing dominates in separating the two groups of teachers in this sample (standardized coefficient = 0.62). Reading contributes also to defining the groups (0.392); however, math does not contribute much to group separation (0.079) after the effects of the other variables are removed.

Table 5. Standardized Coefficients
Function
1
Math / .079
Writing / .624
Reading / .392

We also can examine the structure coefficients. They also suggest that writing is most strongly correlated with the underlying function separating the groups (0.956). Reading has the next strongest correlation at 0.888, and math is also strongly associated with the underlying dimension (0.708). Here the standardized coefficients and the structure coefficients provide similar evidence of which tests were most responsible for separating the two groups of teachers.

Table 6. Structure Matrix
Function
1
writing / .956
reading / .888
Math / .708

Determining the Fit of the Model

A measure of the usefulness of the model is how well it classifies individuals into their correct groups. This information is summarized in Table 7. The first part of the table provides the percentage of individuals who were correctly classified. We can see that 16/20 individuals (or 80%) were correctly classified. We can consider result this against chance (which would be 50% since there were equal numbers in each group). The “leave-one-out” method (or cross-validated
sample) provides results where each case is classified based on information from all other cases except itself. We can see the classification rate is a bit lower at 75% correctly classified.

Table 7. Classification Resultsa,c
employed / Predicted Group Membership / Total
0 / 1
Original / Count / 0 / 8 / 2 / 10
1 / 2 / 8 / 10
% / 0 / 80.0 / 20.0 / 100.0
1 / 20.0 / 80.0 / 100.0
Cross-validatedb / Count / 0 / 8 / 2 / 10
1 / 3 / 7 / 10
% / 0 / 80.0 / 20.0 / 100.0
1 / 30.0 / 70.0 / 100.0
a. 80.0% of original grouped cases correctly classified.
b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.
c. 75.0% of cross-validated grouped cases correctly classified.

We can conclude from this analysis that writing seems to have the strongest relationship to separating the groups. We can also note that our model is quite useful in classifying individuals into their employment categories.

Ronald H. Heck and Lynn N. Tabata Techniques for Examining Categorical Outcomes 2
EDEP 606: Multivariate Methods (2013) March 31, 2013

References

Huberty, C. (1989). Problems with step-wise methods: Better alternatives. In B. Thompson (Ed.), Advances in social science methodology (Vol. 1, pp. 43-70). Greenwich, CT: JAI Press.

Marcoulides, G. A., & Hershberger, S. L. (1997). Multivariate statistical methods: A first course. Mahwah, NJ: Lawrence Erlbaum Associates.

Ronald H. Heck and Lynn N. Tabata Techniques for Examining Categorical Outcomes 12
EDEP 606: Multivariate Methods (2013) March 31, 2013

Defining MANOVA Model (Tables 1, 2) with IBM SPSS Menu Commands

IBM SPSS syntax: / GLM math writing reading BY employed
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(OVERALL)
/EMMEANS=TABLES(employed)
/PRINT=HOMOGENEITY
/CRITERIA=ALPHA(.05)
/DESIGN= employed.
(Launch the IBM SPSS application program and select the Discriminant Example.sav data file.)
1. Go to the toolbar, select ANALYZE, GENERAL LINEAR MODEL, MULTIVARIATE.
This command opens the
Generalized Linear Model Multivariate main dialog box. /
2. In the Multivariate main dialog box we will select 3 variables (read, math, and lang) as dependent variables, and employed as the independent variable.
a. Click to select read, math, and lang then click the right arrow button to move them into the Dependent Variables box.
Note: An alternative method to move variables is to “drag” them to the Dependent Variables box.
b. Now select employed and click the right arrow button (or drag the variable) into the Fixed Factor(s) box.
c. Click the OPTIONS button to access the Multivariate: Options dialog box for designating various statistical output. /
2. The Multivariate: Options dialog box provides various statistical options.
a. For this example we want to have the means displayed in the output.
Click to select (OVERALL) and employed then click the right arrow button (or drag the factor) to the Display Means for box.
b. We want to include the Box M test of homogeneity in the output so click to select: Homogeneity tests.
Click the CONTINUE button to return to the Multivariate main dialog box. /
3. From the Multivariate main dialog box click the OK button to generate the output results. /


Defining Discriminant Analysis Model (Tables 3 to 7) with IBM SPSS Menu Commands

IBM SPSS syntax: / DISCRIMINANT
/GROUPS=employed(0 1)
/VARIABLES=math writing reading
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV TABLE CROSSVALID
/CLASSIFY=NONMISSING POOLED.
(Continue using the Discriminant Example.sav data file.)
1. Go to the toolbar, select ANALYZE, CLASSIFY, DISCRIMINANT.
This command opens the Discriminant Analysis main dialog box. /
2. In the Discriminant Analysis main dialog box we will specific the grouping variable and independent variables in the model.
a. We will specify employed as the grouping variable. Click to select employed then click the right arrow button (or drag the variable) into the Grouping Variable box. Note that the variable appears as employed(??) – the question marks represent the minimum and maximum values of employed.
b. Specifying the minimum and maximum values of the grouping variable (employed) is performed first clicking the DEFINE RANGE button to access the dialog box. /

c. In the Discriminant Analysis Define Range dialog box, enter the minimum and maximum values of employed (0,1). Then click the CONTINUE button to close the dialog box.

d. Now designate the predictors for the model. Click to select math, writing, and reading then click the right arrow button (or drag the variables) into the Independents box.
Note: The default is Enter independents together which we’ll retain for this analysis. A stepwise analysis would require using the stepwise method instead.

e. Click the STATISTICS button to access the dialog box for specifying assorted options for the
output.

3a. From the Discriminant Analysis main dialog box, click the STATISTICS button to access the dialog box.
b. For this example we want the means displayed in the output so click to select: Means.
Click the CONTINUE button to return to the Discriminant Analysis main dialog box.
/