3 Validity
In Chapter 2, validity was referred to as "the notion of how we establish the veracity of our findings" and was linked to the task of determining "what counts as evidence" in the evaluation of language education programs. In this chapter, the concept of validity is developed along both theoretical (What is it?) and practical (How do I know when I have it?) lines. As mentioned previously, this will also involve an elaboration of aspects of the paradigm dialog. That is, to the extent that a determination of what counts as evidence differs from one research paradigm to another, the concept of validity is formulated differently, depending on the paradigmatic perspective chosen.
The first section of this chapter presents validity from the positivistic perspective. I first use Cook and Campbell's (1979) typology and discuss the classic threats to validity. Then I present other conceptualizations of validity from an essentially positivistic perspective. After that, I consider validity from the naturalistic perspective, offering various criteria and techniques for enhancing and assessing those criteria.
Validity from the positivistic perspective
It should be reiterated that the use of the term positivistic is meant as a general label for a variety of current stances in relation to the philosophy of science. It should also be pointed out that although there are few, if any, who would refer to themselves as positivists or logical positivists these days, the philosophical assumptions of logical positivism continue to provide the rational basis for what counts as evidence in most scientific inquiry (Bechtel 1988). As an example, Cook and Campbell (1979) admit to certain beliefs held in common with classical positivism, and refer to their philosophical position as evolutionary critical-realism. This perspective
...enables us to recognize causal perceptions as "subjective" or "constructed by the mind"; but at the same time it stresses that many causal perceptions constitute assertions about the nature of the world which go beyond the immediate experience of perceivers and so have objective contents which can be right or wrong (albeit not always testable). The perspective is realist because it assumes that causal relationships exist outside of the human mind, and it is critical-realist because it assumes that these valid causal relationships cannot be perceived with total accuracy by our imperfect sensory and intellective capacities. And the perspective is evolutionary because it assumes a special survival value to knowing about causes and, in particular, about manipulable causes. (Cook and Campbell 1979:29)
For the purposes of the present discussion, however, this position will be thought of as a variant of a general paradigm that I continue to label as positivistic. A hallmark principle of this paradigm is the notion that the reality that we seek to know is objective, existing independently of our minds. Although few researchers working within this perspective still cling to the notions that complete objectivity can be achieved and that reality can be perfectly perceived, objectivity as a "regulatory ideal" and the existence of an independently existing reality remain as key philosophical assumptions. Another key notion is that of causal relationships. Cordray (1986:10) summarizes the requirements for these relationships as: "First, the purported cause (X) precedes the effect (Y); second, X covaries with Y; third, all other rival explanations are implausible." The other defining principle of this paradigm is that we arrive at an approximation of reality through the test of falsification. This means that "at best, one can know what has not yet been ruled out as false" (Cook and Campbell 1979:37).[1]
The traditional validity typology
This view of what there is to know and how we can go about knowing it leads to a particular way of defining and assessing validity. The classic typology, formulated by Campbell and Stanley (1966), divided validity into internal and external. I mention it here in part to pay tribute to its
ongoing influence — positivistic and naturalistic researchers alike discuss validity in reference to this original typology - and in part to provide the necessary historical background for understanding more recent formulations of validity. My discussion of the threats to validity within this typology will attempt to contextualize the concepts of internal and external within the field of applied linguistics. However, those readers who are already familiar with Campbell and Stanley (1966) and with Cook and Campbell's (1979) reformulation may wish to skip over this section.
The Campbell and Stanley (1966) conceptualization of internal validity had to do with the degree to which observed relationships between variables could he inferred to be causal. In the case of program evaluation this amounts to asking the question: How strongly can we infer that the program being evaluated caused the observed effects (e.g., significant increases in achievement test scores)? External validity referred to the degree to which the inferred causal relationship could be generalized to other persons, settings, and times. For program evaluation, the question is: To what extent can the effects caused by the observed program be expected to occur in other program contexts?
Cook and Campbell (1979) later revised this typology by differentiating internal and external validity into four dimensions. What had previously been labeled internal validity was now divided into statistical conclusion validity and internal validity. Statistical conclusion validity refers to the degree of certainty with which we can infer that the variables being observed are related. For program evaluation this can be thought of as asking: How strongly are the program (the independent variable) and the measure of program effect (the dependent variable) related? This relationship is expressed statistically as covariation, or the degree to which variation in the dependent variable goes along with variation in the independent variable. For example, does the variation in achievement test scores go along with whether the students are in the program or not? Cook and Campbell's revised conceptualization of internal validity has to do specifically with whether or not the observed relationship between the dependent and independent variables can be inferred to be causal. For program evaluation, this is the question of whether the program caused the effects that were measured in the evaluation.
Under the revised typology, Campbell and Stanley's external validity is divided into construct validity and external validity. Construct validity concerns the degree to which we can use our research findings to make generalizations about the constructs, or traits that underlie the variables we are investigating. These constructs, for the purposes of this discussion, can be considered as labels that assign meaning to the things we are measuring. The labels, and their ascribed meanings, are derived from current theory. In the case of language program evaluation, this attempt at generalization amounts to asking: To what extent can the observed effects of the language program being evaluated generalize to theoretical constructs such as communicative language ability (Bachman 1990), or to more practical, methodological constructs such as content based instruction (Brinton et al. 1989)? The revised conceptualization ot external validity refers to the degree to which we can draw conclusions about the generalizability of a causal relationship to other persons, settings, and times. For program evaluation, this can be formulated as follows: To what extent can the conclusions reached in this evaluation he generalized to other program contexts? Another way to frame this question is: To what degree can we claim that this program will work equally well for another group of students, in another location, at another time?
Assuming, then, that these definitions of internal (and statistical conclusion) and external (and construct) validity represent the "What is it?", how do we know whether we have it or not? Traditionally, this question has been addressed with reference to the threats to validity. Each type of validity has specific threats associated with it, and these threats must be ruled out in order to make claims for the veracity of research findings - that is, in order for us to "know that we have it."
Threats to validity
Following Cook and Campbell's (1979) typology, the specific threats to statistical conclusion validity are as follows:
1. Low statistical power. If a small sample size is used, if the alpha level for statistical inferences is set too low, or if the statistical tests being used are not powerful enough, we increase the chances of type II error. This means that we may incorrectly conclude that there are no significant differences when they do, in fact, exist.
2. Violated assumptions of statistical tests. If particular statistical procedures with restrictive assumptions are used, we may be left with research findings that are uninterpretable. For example, the use of analysis of covariance (ANCOVA) assumes that the covariate is measured without error. Because most measures that are used in applied linguistics research are less than perfectly reliable, this assumption is likely to be violated and therefore limits our claims to validity when using a procedure such as ANCOVA.
3. Making multiple comparisons. When we use several different measures in a research study such as a program evaluation, a certain proportion of the comparisons we make (for example, comparing students in the program of interest with those in a rival program) will rum up statistically significant differences that are due purely to chance. This is referred to as type I error, or the failure to accept a true null hypothesis (no significant differences between groups).
4. Unreliable measurement. In addition to being a specific assumption of certain statistical tests, the lack of reliability in the measures we use is a general problem for the validity of our research findings.
For certain of these threats, addressing them is as simple as showing that they do not exist. For example, we can now show that we have selected a large sample, that we have set our alpha level at a reasonable level, that we have chosen statistical tests with sufficient power, and that we have not violated the assumptions of those tests. For other threats, we may need to do more. For example, if we have made multiple comparisons, we will need to adopt an experiment-wide error rate or use procedures that control for this threat (e.g., multiple analysis of variance and conservative multiple comparison tests). If we have unreliable measures, we may need to consider correcting them for attenuation. In all cases, these are statistical problems with statistical solutions.
Again following Cook and Campbell's (1979) typology, the specific threats to internal validity are as follows:
5. History. Factors external to the research setting can affect the research results. For example, events in the community in which the program exists may affect the students in a way that makes it impossible to assess how much the program itself was responsible for the measured effects (e.g., end-of-program test scores).
6. Maturation. Normal growth or change in the research subjects may account for the research findings. For example, end-of-program test scores may be shown to have improved significantly, but this may be due solely to the fact that the students have become older and more experienced, independently of the program.
7. Testing. The measurement conducted as a part of the research can have its own effect on the results. For example, if a great deal of testing is done over the course of a program evaluation, the results may be due to increased familiarity on the part of the students with the tests and testing procedures.
8. Instrumentation. When the testing instruments used have particular properties that may affect the test results, the validity of the conclusions drawn from these measures is threatened. For example, the test being used as the end-of-program measure may behave differently at different points along the test scale - it may fail to measure accurately those students who are at the high or low end of the scale.
9. Statistical regression. Test scores tend to regress toward the group mean over time when the tests are less than perfectly reliable; that is, high scorers on a pretest tend to score lower on the posttest, low scorers tend to score higher, as a group phenomenon. The result is that real differences between groups may be obscured.
10. Mortality. This rather grim-sounding threat refers to the fact that research subjects sometimes leave the research setting before the study is concluded. When the characteristics of those who remain in the research setting versus those who leave are systematically different in some way, the research conclusions become less than straightforward. For example, a majority of the students who left the program before the evaluation was finished may have been from the lower ability levels. A higher score for the program group at posttest time might thus be due primarily to the absence of these lower ability students.
11. Selection. The wav in which subjects are selected for a research study may have a significant effect on the results. For example, if the students in the program to he evaluated were selected from a particular socioeconomic group, this might explain the evaluation results independently of the program, especially if the students are being compared with another group who were selected based on some other criteria.
12. Interaction with selection. Selection may interact with other threats to produce problems in interpreting the research findings. For example, maturation may combine with selection in that the group of students selected for the program of interest may be growing (cognitively) at a different rate than the group of students to which the program group is being compared.
13. Diffusion or imitation of treatments. When the research is comparing two or more groups, the treatment (the independent variable, or the experience that differentiates the groups) intended for one group can reach the other groups to some degree. For example, an evaluation of an innovative language teaching program may be conducted by comparing it to a more traditional curriculum. If, during the course of the evaluation, the students in the traditional curriculum are given elements of the innovative program (because, for example, their teacher is aware of the new program's methods and wants her students to be able to benefit from them), then any claims about the success or failure of the program in comparison to the traditional one will be difficult to make.