Chapter 6 : LOG LINEAR ANALYSIS
Purpose: Log linear analysis is an advanced technique that can be used instead of independence tests. There are two main advantages to using the Log linear technique: 1) you can test more than two variables at a time with as many levels as you like, and 2) you can also test for main effects. A main effect occurs when the frequencies for a variable differ among levels of that variable. For example, let’s assume that one of your variables was blue-eye color with two levels, blue and not blue. Let’s also assume that the percentages from your sample were 44% blue and 56% not-blue. Remember that a sample provides estimates but not the true results. The log linear analysis for the main effect “blue-eye color” tests whether the actual proportions of blue to not-blue were really 50-50.
Background:
Main effects in a 2X2 table
Let’s illustrate the main effect in a simple problem. You have noticed that there are two variations of a shrub species, one with smooth leaves and one with hairy leaves. You think that the presence of smooth leaves might be related to the presence of serpentine in the soil. You sample 120 random locations in Santa Clara county where the shrubs are found and you record the leaf type (hairy or smooth) and whether Serpentine is present in the soil. You obtained the following data:
Table 6-1: Serpentine soil versus leaf type data for example of log-linear analysis
Leaf typeSerpentine Soil / Hairy / Smooth / Serp. Soil Total
Yes / 12 / 22 / 34
No / 36 / 50 / 86
Leaf Type Total / 48 / 72 / 120
Analyzed as a 2X2 Test of Independence
Ho:Leaf type is independent of the presence or absence of Serpentine soil.
Alpha ()= 0.025 (Why?)
Results: Because the p-value (Prob=5.06) > (0.025) you would accept Ho and conclude that leaf type is independent of the presence or absence of serpentine soil (Table 6-2).
Table 6-2: Results fromTwo -Way Crosstabs (Systat™ 10.0) for data in Table 6-1.
Analyzed as a 2x2 Log linear
For this log linear example, there are three Ho’s instead of one:
- Ho #1: There is no Leaf type*Serpentine soil interaction. This is the same thing as Leaf type is independent of the presence or absence of Serpentine soil.
In addition, the following hypotheses can be tested if Ho #1 is accepted:
- Ho #2: The proportion of sites with serpentine soil is equal to the proportion of sites without serpentine soil. Remember we are dealing with a sample here.
- Ho #3: The proportion of sites with hairy leaves is equal to the proportion of sites with smooth leaves.
The Log linear test results:
- Accept Ho for Ho #1: There is no interaction (p=0.506) between leaf type and the presence or absence of Serpentine soil (Figure 6-1). This means the leaf type is independent of the presence or absence of Serpentine soil that is the same conclusion as for the 2x2 Test of Independence.
- Because we accepted Ho for the interaction, we go on to test the other Hos
- Reject Ho for Ho #2: The proportion of sites with serpentine soil is significantly different (p<0.001) from the proportion of sites without serpentine soil (Figure 6-2). From the table percentages (not shown here), it can be determined that the proportion of sites with serpentine soil (28.3%) is significantly (p<0.001) less than the proportion of sites without serpentine soil (71.7%).
- Accept Ho for Ho #3: The proportion of sites with hairy leaves is not significantly different (p=0.0279 with α=0.025) from the proportion of sites with smooth leaves (Figure 6- 2). This means that you have no evidence that one leaf type is more abundant than the other (i.e., the difference in frequencies could have been a function of chance given your sample size).
How does Log Linear analysis work?
Log linear analysis involves testing a series of models. Log linear models are equations that contain terms made up of combinations of the variables used in the analysis plus a constant. The most complex model (i.e. the one that contains the most terms) is called the Fully Saturated Model. The terms are arranged in a hierarchy starting with all variables singly, then all combinations of 2 variables at a time, then all combinations of three variables at a time etc. until all combinations have been exhausted. In the preceding example there were two variables, Serpentine Soil (Yes or No) and Leaf Type (Hairy or Smooth). In this example, we will use “S” for the Serpentine Soil variable and “L” for the Leaf Type variable. The fully saturated model for the problem above is:
CONSTANT+S+L+(S*L).
The models consist of terms with each term representing a particular effect. In the model above there are three effects (terms) and a constant (included in all models for statistical reasons). The three effects are the Serpentine main effect (S), the Leaf type main effect (L) and the interaction between Serpentine and Leaf type (S*L). With the exception of CONSTANT, each effect or term in the model refers to a specific Ho.
The fully saturated model is used as a standard of comparison for all tests of that system because it explains all of the possible variation in the data. The idea behind Log linear analysis is to find the simplest model that does the same job as the fully saturated model. Simplified models are those with less effects or terms.
To find the most simplified model, you do a series of Goodness of Fit type comparisons with the fully saturated model.
- If you throw out a term and the two models match, the term or effect was not important.
- If you throw out a term and the two models do not match, that term is important and should be included in the final simplified model.
The order in which you throw out terms is important because, if an interaction is important, all main effects that make up the interaction are not important. This is because a significant interaction implies that the main effects affect each other, so you can’t make a simple statement about one main effect without dealing with the others. Therefore, you start the process by first throwing out the interactions.
In our simple example, we would first throw out the S*L interaction. This would create the simplified model CONSTANT+S+L. We would then see if the simplified model does the same job as the fully saturated model. If it does, the term we threw out wasn’t important and we would accept Ho for that term. If it doesn’t, we would reject Ho for that term and keep it the model.
IMPORTANT: If you do end up rejecting Ho for an interaction, when you put the term back in, you also must get rid of all lower order terms that could be made from the terms in the interaction. In this example, if we reject Ho for S*L, that term would be put back in the model but S and L would we taken out. When lower order terms are taken out in this way, you are NOT testing them; they are simply irrelevant. Why?
When there are no more terms to test, the remaining model is called the FINAL MODEL and is the simplest model that will do the same job as the fully saturated model.
For our specific example, when we threw out S*L and tested the simplified model CONSTANT+S+L, the results of the G-test indicate that we accept Ho. Therefore the interaction was not important (i.e. not significant).
The next step would be to throw out one of the single (main) effects; it doesn’t matter which. Let’s start with the S term. If we throw out the S effect and compare the model CONSTANT+L to the fully saturated model, we will find that the two models do not match (we rejected Ho). Therefore, the S effect is important and needs to be included in the final model.
Next, we would put the S effect back in and take out the L effect. We find that the model CONSTANT+S does NOT differ significantly from the fully saturated model. Therefore, the L effect is not important and we accept Ho for that term. If you had thrown out the L term first, you would have found a match with the fully saturated model and concluded that Leaf type was not important.
Since there are no more terms to throw out, we have the final model:
CONSTANT+S. We can then interpret the meaning of the final terms be looking at the totals for the levels. We would then reach the conclusions listed in the Analyzed as a 2x2 log linear section.
Computing the test – Basic Steps
1)Determine what you are going to test.
2)Design the experiment.
- What are the variables?
- What are the levels for the variables?
- What analysis should you use?
- What planned comparisons do I want to make among the levels?
- What terms are there in the fully saturated model?
- What are all of the Hos and Has?
- What would it mean if you accept Ho?
- What would it mean if you reject Ho?
- How would you conduct the experiment?
- What statistical error should you avoid?
3)Collect data.
4)In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant.
5)For any significant effects, plot the percentages.
6)Conduct planned comparisons if you reject Ho for step 5. See pages 5-12 to 5-14 RxC Test of Independence.
7)Conduct any unplanned comparisons if you reject Ho for step 5. See pages 5-14 to 5-15 RxC Test of Independence.
8)Draw conclusion.
EXAMPLE 1: 2x2 Log Linear analysis
We will use the same experiment and data as for the Banded and Unbanded snake patterns versus the presence or absence of brush (see EXAMPLE 1: 2x2 Test of Independence on Page 6-3)
1)Determine what you are going to test.
We want to determine if the snake pattern is related to the presence or absence of brush
2)Design the experiment.
- What are the variables?Snake and Brush
- What are the levels for the variables?Snake: Banded or Unbanded
Brush: Present or Absent
- What analysis should you use? You are going to use a stepwise backward hierarchical 2x2 Log linear analysis.
- What planned comparisons do I want to make among the levels? See page 5-10 RxC Test of Independence. Because none of the levels have more than 2 levels, planned comparisons cannot be done.
- What terms are there in the fully saturated model?
Constant + Snake + Brush + Snake*Brush
- What are all of the Hos and Has?
- Ho #1: whether a snake is banded or unbanded is independent of the presence or absence of brush (Snake*Brush interaction). Ha #1 is that whether a snake is banded or unbanded depends on the presence or absence of brush.
- Ho #2:The proportion of sites with banded snakes is equal to the proportion of sites with unbanded snakes. Ha #2 is that the proportion of sites with banded snakes is NOT equal to the proportion of sites with unbanded snakes.
- Ho #3:The proportion of sites with brush is equal to the proportion of sites without brush. Ha #3 is that the proportion of sites with brush is NOT equal to the proportion of sites without brush.
- What would it mean if you accept Ho?
- Accept Ho #1 would mean that there is no relationship or interaction between the presence or absence of brush and whether snakes are banded or unbanded. Also it is ok to test Hypotheses 2 and 3.
- Accept Ho #2 would mean that the proportion of banded snakes is not different from the proportion of unbanded snakes.
- Accept Ho #3 would mean that the proportion of sites with brush is not different from the proportion of sites without brush.
- What would it mean if you reject Ho?
- Reject Ho #1 would mean that the presence or absence of brush does have some relationship to the presence of banded or unbanded snakes. Also, do NOT test hypotheses 2 and 3.
- Reject Ho #2 would mean that the proportion of banded snakes is different from the proportion of unbanded snakes.
- Reject Ho #3 would mean that the proportion of sites with brush is different from the proportion of sites without brush.
- How would you conduct the experiment? You will randomly sample sites until you find 180 sites with snakes. For each site, you will record whether or not brush was present and whether the snake was banded or unbanded.
- What statistical error should you avoid? Conclude that the worse error is Type I so alpha will equal 0.025.
3)Collect data
Table 6-3: Frequency of banded/unbanded snakes and presence/absence of brush for 180 sites with snakes.
Data / BRUSHSNAKE / Absent / Present
Banded / 32 / 46
Unbanded / 43 / 59
4)In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant.
- Use SPSS™ 10.0 to compute a stepwise backward elimination hierarchical log linear analysis (see page 7-11 for SPSS instructions).
- Accept Ho #1. There is there is no relationship or interaction (p=0.879) between the presence or absence of brush and whether snakes are banded or unbanded. Also it is ok to test Hypotheses 2 and 3.
- Accept Ho #2. The proportion of banded snakes is not different (p=0.732) from the proportion of unbanded snakes.
- Accept Ho #3.The proportion of sites with brush is not different (p=0.025 with α=0.025) from the proportion of sites without brush.
5)For any significant effects, plot the percentages. No graphs needed.
6)Conduct planned comparison if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons.
7)Conduct any unplanned comparisons if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons.
8)Draw conclusion
The banding pattern doesn’t appear to have anything to do with the presence of brush in the environment.
EXAMPLE 2: 2x2x2 Log Linear analysis
We will now learn how to do a Log linear analysis with 3 variables. You are exploring the relationship between a color morph of lizard (light and dark), the type of ground (sand or dirt) and the presence or absence of shade.
1)Determine what you are going to test.
We want to determine if the color morph of the lizard is related to the type of ground and/or the presence or absence of shade.
2)Design the experiment.
- What are the variables? Morph, Ground and Shade
- What are the levels for the variables?Morph: Light or Dark
Ground: Sand or Dirt
Shade: Present or Absent
- What analysis should you use? You are going to use a 2x2x3 stepwise backward hierarchical log linear analysis.
- What terms are there in the fully saturated model?
Constant + Morph + Ground + Shade + Morph*Ground + Morph*Shade + Ground*Shade + Morph*Ground*Shade
- What are all of the Hos and Has (we won’t include these here)?
- Ho #1: there is no interaction between lizard morph, ground and shade.
- Ho #2:there is no interaction between lizard morph and ground.
- Ho #3:there is no interaction between lizard morph and shade.
- Ho #4:there is no interaction between ground and shade.
- Ho #5:the proportion of dark lizard morphs is equal to the proportion of light lizard morphs.
- Ho #6:the proportion of dirt sites is equal to the proportion of sand sites.
- Ho #7:the proportion of shaded sites is equal to the proportion of unshaded sites.
- What would it mean if you accept Ho?
- Accept Ho #1 would mean that there is no relationship or interaction between the three variables. Also, you can test the two-way interactions.
- Accept Ho #2would mean that there is no relationship between lizard morph and ground. Also, you can test the lizard morph and ground main effects.
- Accept Ho #3would mean there is no relationship between lizard morph and shade. Also, you can test the lizard morph and shade main effects.
- Accept Ho #4would mean there is no relationship between ground and shade. Also, you can test the ground and shade main effects.
- Accept Ho #5 would mean that the proportions of dark and light lizard morphs are equal.
- Accept Ho #6 would mean that the proportions of dirt and sand sites are equal.
- Accept Ho #7 would mean that the proportions of shaded and unshaded sites are equal.
- What would it mean if you reject Ho? (not included here).
- How would you conduct the experiment? You will randomly sample sites until you find 885 sites with lizards. For each site, you will record whether the lizard was dark or light, whether the ground was dirt or sand and whether or not shade was present.
- What statistical error should you avoid? Assume that you concluded that the worse error is Type II so alpha will equal 0.050.
3)Collect data
Table 6-4: Frequency of lizard morphs, ground types and shade for 885 sites with lizards. Number value of variable in ().
SHADELIZARD COLOR / GROUND / Absent (0) / Present (1)
Light (0) / Sand (0) / 231 / 136
Light (0) / Dirt (1) / 81 / 37
Dark (1) / Sand (0) / 57 / 47
Dark (1) / Dirt (1) / 177 / 119
4)In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant.
- Use SPSS 10.0 to compute a stepwise backward elimination hierarchical log linear analysis (see page 7-11 for instructions).