Mutliple Regression: Example Questions and Outline Model Answers
Question 1
A researcher investigating depression amongst school children rated the level of depression by counting the number of negative self-statements in essays the children wrote (variable name= numnegs). She then asked two parents/guardians/relatives and the teacher of each child to rate the child’s ability on a 10 point scale (1=high, 10=low). These variables were recorded as parent1, parent2, and teacher. She also recorded which of the three local schools they attended, using a dummy coding scheme (i.e., the variable school1 has values of 0=doesn’t attend school 1, 1=does attend school 1, and school2 uses the same coding with respect to school 2). Her conclusion was that “the multiple regression model was highly significant (F=8.914, p<0.0001) showing that parents, teachers and school all have an impact. However, the only group that has a significant impact are the teachers (t=3.598, p<0.0001) and we can therefore conclude that it is their negative expectations of the child that cause the depression.”
(i)What data-screening steps should the researcher have taken before carrying out the analysis reported above? (35% of marks)
(ii)Discuss the appropriateness of the researcher’s conclusions in detail, focusing in particular on her use of the statistics shown in the printout. (65% of marks)
Printout for Question 1
Regression
Question1: MODEL ANSWER
(Underlined points should be mentioned, others are desirable or impressive.)
(i) Data Screening Steps
- Note the sample size is adequate; neither too large nor too small (according to e.g. Green, 91)
- Data is broadly adequate for MR – all ordinal/linear (scale) data or dummy variables.
- Will want to screen the data checking for normality of distribution, univariate and bivariate outliers (frequency and scatterplots) and multivariate outliers (mahalanobis distance), illegal variables.
- Collinearity between pairs of IVs can be checked by their bivariate correlations (should be below 0.9) and multicollinearity within the set of IVs used (explain what this is) should be assessed either by calculating tolerances (explain what these are and what are likely to be unsafe values) or by using collinearity diagnostics.
- Multivariate normality can be assessed by scatterplots on selected pairs of variables (checking for linearity, normality and homoscedasticity); variables with very different skews may be useful to plot in this connection. Alternatively, violations of multivariate normality can be revealed by examining plots of residuals against predicted DVs
- Answer should show a brief understanding of dummy coding of the school variable (marks will be lost if answer states that the coding doesn’t account for school three)
- Good answer may briefly note that the data suggests that the vast majority of children were from school 3 (using the means) – and comment on the implications of this for the analysis
- Good answer may briefly question whether a dummy coding scheme is the best way to have coded the school variable as it operates to compare the 1-coded schools with a comparison school (the one zero-coded in all variables).
(ii) Discussion of Statistics Used and the Researcher’s Interpretation
- Answer should give correct interpretations of the statistics used: F, R, Rsquare and Adj R square – notes relationship of number of variables to Adj R Sq (particularly with reference to the School variables).
- May give correct interpretation of Std Error and how it can be usefully applied (possibly by comparison to SD).
- May give correct interpretation of the B and Beta weights and how they can be usefully applied (to compute predictive DV)
- Answer must note the main errors made by the researcher:-
- Error 1: The researcher notes that the model is significant therefore teacher/school/parent all have an impact. This conclusion is wrong assuming the stats were properly carried out: school does not appear to have an impact (no relationship in the correlation or MR output); the school variables just happen to be included in a model that is significant. That is, a significant F does not mean that all variables in the model contribute a significant independent portion of the DV variance.
- Error 2: the low beta/b/part/partial correlations and non-sig t values for parent ratings are spurious due to collinearity between parent ratings. Answer should comment that the bivariate correlation between parent ratings shows this collinearity and should note the distorting effects are revealed by the discrepancy between significant parent-DV bivariate correlations and the non-sig MR outputs for the parental variables
- A good answer will note that the teacher-DV bivariate correlations are lower than either of the parent-DV correlations and explain how the relative contribution to the MR model arise (where has the parent-DV relationship “gone” in the MR model?)
- A good answer will note possible solutions to the collinearity between the parental ratings, e.g. combine them for a composite score or remove one.
- Error 3: Answer must stress the theoretical limitations of regression methods in both causality (thus the teacher is wrong to conclude from an MR analysis that negative perceptions CAUSE depression)
- Good answer may offer alternative causal explanations of the relationships
- Answer may point out the relatively low levels of agreement between teachers/parents as an ‘interesting point’
- Answer may comment on the restriction of MR to linear relationships (thus there still could be a relationships even if MR doesn’t find them)