The Performance of Fit Indices in Small Samples

Simulation Studies

The Performance of Fit Indices in Small Samples

Simulation studies were performed to assess the performance of the different fit indices for a range of sample sizes. In each run 12 dyadic measurements from a multivariate normal distribution were generated. In order to obtain realistic family data, the default SRM was fitted to the four-person anxiety dataset of Kenny, Kashy and Cook (2006) with a single indicator, and the model-implied variance-covariance used for data generation. For each sample size, 200 repetitions were made. The default SRM (cfr. section “The Standard Four-Person Model with one Indicator and one Group” in the main paper) was then fitted each of to the generated datasets and different fit indices were calculated in every run. Table 1 shows the percentage of times that the fit was deemed good according to these different indices.

Table 1

Percentage of acceptable fits per sample size with a single indicator (200 simulations)

Sample sizeχ²CFICFIRMSEARMSEA

p > .05>.90>.95< .08< .04

n = 5086.5%100%95.5%83%49%

n = 7589%100%100%96%60%

n = 10092%100%100%98%73%

n = 12592.5%100%100%100%78%

n = 15094.5%100%100%100%86%

n = 100095.5%100%100%100%100%

As data were generated under the SRM, one would expect that the fit should typically be acceptable. It seems that the chi-square test performs good with larger sample sizes, but reject the null hypothesis too quickly in smaller samples. The simulation studies further suggest that a cut-off of .08 for the RMSEA is recommended in this setting, as a cut-off of .04 seems too strict. Similar trends were observed when family data were generated based on data from other published studies. Finally note that the sensitivity of each of the fit indices to detect misspecification was not explored here.

When one is interested in separating the relationship variance from the error variance the use of multiple indicators is necessary. We therefore replicated the afore described setting, but now generated 2 indicators for each dyadic measure (based on the same four person data, but this time with two indicators). The simulation study, again based on 200 simulations for different sample sizes (see Table 2), reveals that the CFI may be the best suitable fit index. And, of all commonly used fit indices, the chi square test seems to be the least reliable with relatively small sample sizes.

Table 2

Percentage of acceptable fits per sample size with two indicator (200 simulations)

Sample sizeCFICFIRMSEARMSEA

p > .05> .90> .95< .08< .04

n = 5039%100%80%69%14%

n = 7569%100%98%98%43%

n = 10078%100%100%100%77%

n = 12586%100%100%100%85%

n = 15085%100%100%100%88%

n = 100093.5%100%100%100%100%

Simulation Studies for the Test of Equality of Mean Actor/Partner Effects

For single group analyses, it can be interesting to formally test if the mean of each SRM component differs significantly by role. For example, is the mean actor effect of the father of equal size as those of the mother and the children, or not? If these effects are equal, they will all be zero (with the effect coding that is used in the fSRM-package). In previous studies, such test is performed by comparing the chi-squares of a constrained model (assuming all actor means to be zero, for example) with an unconstrained model (Kenny et al., 2006; Eichelsheim, Buist, Deković, Cook, Manders, & Branje, 2011; Rasbash, Jenkins, O’Connor, Tackett, & Reiss, 2011), but such approach has the disadvantageof being computationally slow, since it requires fitting a different model for each type of component. In the fSRM package the test for equality in mean effects is based on a multivariate Wald-test. It simply tests if, for example, all actor effects are simultaneously equal to zero. But since in a four-person round robin design the sum of the four actor (partner) effects is constrained to zero, one only needs to test in the Wald formula whether three out of four will equal zero. This multivariate Wald-method was tested, evaluated and compared with the chi-square difference test by means of simulation studies. Simulated data are again based on the estimated SRM-parameters from the default SRM fitted to the anxiety dataset of Kenny et al.(2006) with four-person families.

First of all, the Type I error rate was examined. A Type I error is present if a true null hypothesis is rejected, therefore the means of all SRM components of interest were fixed to zero and the number of rejected null hypotheses calculated. In order to test the influence of which actor (partner) mean was left out of the test, all four scenarios were tested for the four-person families.Finally, the Type I error rate was calculated based on 500 simulation, for sample sizes of 25, 50 and 100. The results can be found in Table 3. Clearly the choice of the excluded SRM component did not have an effect on the Type I error rate, only the sample size did. The Wald approach seems liberal, especially at small sample sizes. At large sample sizes however, the observed Type I error rate equals the nominal level (α=5%).

Table 3

Partner effects
n / Mother / Father / Child1 / Child2
25 / 6.6 / 6.6 / 6.6 / 6.6
50 / 6.4 / 6.4 / 6.4 / 6.4
100 / 5.0 / 5.0 / 5.0 / 5.0

Wald approach. Type I error rate when excluding one actor (left panel) or partner effect (right panel), when testing for the equality of mean effects.

Actor effects
n / Mother / Father / Child1 / Child2
25 / 9.2 / 9.2 / 9.2 / 9.2
50 / 7.2 / 7.2 / 7.2 / 7.2
100 / 5.0 / 5.0 / 5.0 / 5.0

The traditional approach (i.e., the chi-square difference test) was also examined by means of 500 simulations. The results can be found in table 4.On comparing both methods, it can be concluded that both methods perform very similar in terms of Type I error.

Table 4

Chi square difference test. Type I error rate when testing for the equality of mean effects.

n / Actor effects / Partner effects
25 / 5.6 / 8.0
50 / 4.6 / 6.0
100 / 4.6 / 4.4

In order to test the power of both methods, the means of all SRM components were fixed to the estimated parameters from the default SRM fitted to the anxiety dataset of Kenny et al. (2006) and the percentage of correctly rejected null hypotheses was calculated. Table 5 shows that both approaches produce very similar results in terms of power.

Table 5

Power calculation for equality of means with the Wald formula (left panel) and the chi-square difference test (right panel) with different sample sizes

N / Actor effects / Partner effects
25 / 46.8 / 48.6
50 / 83.2 / 86.2
100 / 99.2 / 99.2

Since the Chi-square difference test and the Wald-test are comparable in terms of both power and Type I error rate and the latter is computational much faster (it requires fitting only one model), this approach is included in fSRM.

Simulation Studies for Differences between Groups

In a multiple group setting, it is often interesting to investigate whether SRM components differ between groups. Typically, these comparisons are also based on a chi-square difference test (Eichelsheim et al., 2011). But this approach requires a large amount of constrained models (one for each comparison) to be compared with the unconstrained model which allows all means and variances to differ between groups. The Wald-test, with the standard error of the difference calculated following the delta-method, requires only one model to be fitted. A simulation study, based on 200 simulations with different sample sizes was performed to compare these methods. In particular, the social relations model for two groupswas fitted to the negativity dataset of Eicheilsheim et al. (2011)[1]. The goal was to investigate both the Type I error rate and the power of the Wald and Chi² difference testin a two group setting. To assess the Type I error, we assumed a common variance between groups for each SRM-component. To inspect the power we used the estimated variances in both groups. The Type I error and power for the test of difference between groups were investigated for every SRM-component separately. Tables 6 and 7 present the average Type I error and power over all four actor and all four partner means, respectively. Similar results were obtained for other SRM-means and for the SRM-variances.

When comparing the problematic and nonproblematic families thesemethods produce similar results in terms of power when one tested if the mean actor (partner) components differed significantly between two groups (see Table 6). Also the Type one error rate was comparable between the two approaches with groupsizes of 50 observations. With larger sample sizes (i.e. n1=n2=100) the results slightly differed between both methods. Note however that with 200 simulations only, the standard error on the empirical Type I error equals 1.5 and therefore values rangingbetween2and 8 are situated in anacceptable range. It can therefore be concluded that tests based on both methods produce similar results in a multiple group setting and since the Wald test is computational much faster this one is included in fSRM.

Table 6

Power of Wald- and chi²-difference test when comparing the actor and partner variances between two groups

n1=n2 / Wald test / Chi² difference test
Actor / Partner / Actor / Partner
50 / 40 / 33.25 / 39.63 / 33.38
100 / 51.13 / 47.63 / 50.5 / 48.38

Table 7

Type I error rate of Wald- and chi²-difference test when comparing the actor and partner variances between two groups

n1=n2 / Wald test / Chi² difference test
Actor / Partner / Actor / Partner
50 / 5.25 / 5.38 / 5 / 5.13
100 / 5.5 / 6.25 / 4.75 / 4.13

References

Eichelsheim, V. I., Buist, K. L., Deković, M., Cook, W. L., Manders, W., Branje, S. J. T. (2011). Negativity in problematic and nonproblematic families: A multigroup social relations analysis with structured means. Journal of family psychology, 25(1), 152–156. doi:10.1037/a0022450

Kenny, D. A., Kashy, D. A., & Cook, W. L. (2006). Dyadic Data Analysis. New York: The Guilford Press.

Rasbash, J., Jenkins, J., O’Connor, T. G., Tackett, J., & Reiss, D. (2011). A Social Relations Model of Observed Family Negativity and Positivity Using a Genetically Informative Sample. Journal of personality and social psychology, 100(3), 474–491. doi:10.1037/a0020931

[1]For a detailed description of this dataset, please consult the section “Motivating Example on Negativity in Families: Data” in the main paper.