Evaluating Trustworthiness in Educational Research That Is Used to Support a Change In

NCITE: National Center to Improve the Tools of Educators

Funded by the US Office of Special Education Programs

Evaluating Trustworthiness in Educational Research That is Used to Support a Change in Teaching or School Practice

A. Find a research review of the practice.

Guidelines for Evaluating the Trustworthiness of a Research Review
(A research review is more trustworthy than an individual study)
A reformer’s claim that one teaching practice should replace another requires research that compares the recommended practice with the one it is to replace.
Are the comparative studies reviewed with each individual study described in sufficient detail to clarify the specific nature of the comparison and the results? / If the answer is no, disregard the study
If “effect sizes” are compared, were the effect sizes calculated using the same procedures across all studies? (Reporting effect sizes from other meta-analyses can be misleading. Different rules may be used by different researchers.) / If no, disregard
Does the review seem complete? If not all studies were reviewed, are the criteria for the inclusion and the exclusion of studies listed and reasonable? / If no, disregard
Is there clear evidence that things compared were the same across all studies? (e.g., the “nongraded,” primary refer to either mixed-ability or homogeneous grouping) / If no, disregard
Are the researchers independent, receiving no financial benefit from the sale of the program or services being evaluated? / Be suspicious

B. If no research reviews are available, refer to individual studies:

Essential Factors in Evaluating a Single Study
Was there a comparison? (Gains in percentile points qualify as a comparison) / If no, disregard
Do the measures evaluate performance that is important to your school? (e.g., academic or self-esteem goals) / If no, disregard
Are the measures fair to both treatments? (e.g., if one approach is used to teach arithmetic and the other geometry, a measure of arithmetic would bias the findings) / If no, disregard
The description of comparisons and results should be accurate
Is there evidence that the description of the treatment actually matched what happened in the classroom? (e.g., a specific program named in the literature review and description may not be the same one used in the treatment.) / If not, rename the treatments to match what actually was compared
Do the actual measures align with what the researchers said was measured on the posttests? (e.g., sometimes “math problem solving” involves problems with no numbers in them) / If not, rename the treatments to match what actually was compared
Statistical analyses are necessary to rule out chance as an explanation for any findings
Is there evidence that treatment groups were equivalent on relevant factors just before the study began?[1] (Were the groups equivalent on pre-tests? Were subjects randomly assigned?) / If no, disregard
Is the reliability of the measures satisfactory? (For subjective measures, inter-rater reliability should be at least .90. Other reliabilities should be at least .70) / If no, disregard
Were treatment differences statistically significant? Or, was the percentile gain greater than normal unexplained differences that occur from year to year, or from class to class? / If no, differences might be due to chance
If many outcomes were evaluated, were most of them statistically significant? If not, did the few differences fall out as predicted before the study? (Were they planned comparisons?) / If no, disregard
Is the difference in the means of the two groups at least one quarter of the average standard deviation for the two groups? (Large samples often result in statistically significant differences that are not educationally significant –i.e., there was not much difference in learning). / If no, disregard
Policy changes should never be based on only one study, especially if the conclusions are inconsistent with other research investigating a similar question. Check the literature review of the study to see if includes all other comparative studies. / Find other relevant studies
TRUSTWORTHINESS QUOTIENT / Write the number of comparative studies with a consistent conclusion
Write the total number of studies

[1]“Analysis of covariance” (ANCOVA) can be used to adjust for pretest differences. The instruction may have been more appropriate for one of the groups than the other when the groups are different on the pretest. The ANCOVA checks statistically to protect against this possibility but equivalent groups are more desirable. “Analysis of Variance” (ANOVA) is use when the groups are equivalent on pretests.