H 615 Week 7 Comments/Critiques
Randomized experiments are the gold standard for establishing causal inference, but they are limited with respect to generalizability. Shadish et al(2002) argue that random sampling is inadequate to solve the problem of generalized causal inference from experiments. Five principlesare proposed to improve generalizations ofconstruct and external validity: surface similarity (similarities between study operations and characteristics of target of generalization), ruling out irrelevancies (attributes that do not change generalization), making discriminations (features of persons/settings/treatments/outcomes that limit generalizability), interpolation-extrapolation (interpolating to unsampled values within sample range; extrapolating beyond sample range), and causal explanation (developing/testing explanatory theories about generalization target). Flay et al (2005) propose standards for prevention interventions that are efficacious, effective, and ready for dissemination. Effective interventions must meet efficacy standards, and four additional standards. Interventions ready for dissemination must meet another three standards. Several “desirable standards” arealso proposed, including measurement of proximal outcomes(also suggested by Shadish as an approach for studying causal explanation), and measurement of potential negative effects (question: why a required standard that efficacy claims must have no serious negative effects, yet measurement is not required?). I would like to learn more about how the standards have advanced prevention science since this publication.
This weeks readings suggest that we begin to think about intervention development as a program of research rather than a series of independent studies. SCC and F/B recognize the limitations of the single study and in response created principles and processes that shift the field and potentially funders perspective of what it means to create a efficacious, effective and scalable intervention. SCC’s review of the principles of surface similarity, ruling out irrelevancies, making discrimination, interpolation and extrapolation in the face of purposive sampling demands greater thought and consideration than might be devoted if one had access to a truly random sample and random assignment. SCC fleshes out how real world barriers can be addressed via sampling methods and statistical techniques and still enlightens our understanding of causal relationships. The SCC reading dovetails into the standards of evidence outlined by F/B who speak to researcher accountability and responsibility to creating research that meets professional expectations that address creation, translation and implementation of interventions that impact human lives. These readings force the researcher to get their head out of the immediate issues and develop a long-term vision of a product that is of value to science and the lives it will impact.
The readings this week explore challenges in establishing generalized causal inference while specifying standards to which experiments should adhere to promote disseminating only the best interventions. With increasing variability in the range of UTOS studied, is the extent to which findings in RCTs can be generalized diminishing? The solution, formal sampling, appears costly (cluster sample when full enumerations of the outcome variable are not possible) and beyond the reach of many investigators (sampling different settings). Does a movement toward community-based participatory research by some emphasize specificity over generalizability? Sampling theory can guide researchers to make decisions intentionally. Measuring many constructs, along with mediators and moderators, can increase chance of respondent fatigue, but is still preferred over limiting measurement tools to a few items of high validity. The quest for psychometrically sound instruments, which is also one of Flay et al’s criteria for Standards of Evidence, can continue indefinitely if the purpose is to find the right balance between the number of items and the degree of validity, as could SEM to measure assumed causal inferences when a variety of variables and mediators (and the different ways they are measured), are tested.
This week’s readings provide methods and considerations for improving generalizability of program results through design considerations connected to the five principles of causal inference. Providing an argument for large-scale program generalizability requires evidence in support of all five principles. However, it is likely impractical or not possible to address all principles in one or even a few studies thus a line of research is the best way to establish generalizability. As Flay et al. describe, a line of research is necessary to establish efficacy, effectiveness, and dissemination. Research must progress through these phases as evidence provided by each phase is required to establish the knowledge related to the next research phase. For example, a study must meet all standards of efficacy before effectiveness can be established. As there are multiple standards for each research phase, and replication of results is recommended to provide the necessary support for results, programs of research over single-studies should be used for program development, implementation, evaluation, and dissemination. Any single study’s worth is not simply the outcomes of that study but the ability of other researchers to evaluate the strengths/weaknesses of the study to determine what questions were answered/remain on generalizability and build future studies accordingly.
The readings really drove home the point that a program of research is needed for a program to get to the dissemination phase. It highlighted the benefit of using both qualitative and quantitative methods throughout one’s research program. For example, Flay and colleagues made a comment that it is important for the design to establish causal effects and that we must be confident that the program/treatment was responsible for that effect. In Chapter 12 it highlighted also path models that can statistically pinpoint associations between certain variables. However, without a program of research teasing out other possible causes and by using multiple measures at times to ensure what we are looking at does correlate with an outcome, we can’t be confident in our conclusions. Qualitative methods can be brought to see if A is really causing B or if it really is an unmeasured variable. It was especially beneficial to read about path models because without sound constructs and theoretically rationale, significant results may be practically meaningless. This to me stresses the importance of collaboration to better grasp how to measure variables, sample, and to theoretically and conceptually discuss issues to get research to the dissemination phase.
The readings this week took us deeper into generalized causal inference from experiments, particularly sampling procedures to promote strength in claiming causal inference. Research design is an important contributor to our ability to generalize results and demonstrates the effectiveness of a program. If a program is shown to be effective, efforts from the authors/developers should be to make the intervention available to others. From my experience, physical activity literature has lacked thorough explanation of the intervention. Even fewer studies actually provide information on how to obtain more about the intervention. Is it not required of authors to make public tested/evaluated interventions? From the information presented in chapter 12, and previous chapters on statistical use in developing causal inference, I am still caught up in the idea that an investigator can generate a number of desired outcomes purely by implementing the appropriate statistical methods. If this were the case, is it common to observe “slacking” on the design of a study? As with the Duke incident, investigators can easily manipulate data to reflect poor study design or errors on part of the researcher in appropriate delivery of a study.
No matter how good one study or experiment is, alone, it is not enough for making causal inferences. Single studies are unlikely to have enough power in terms of samples, settings, treatments, and outcomes to answer our questions in a definitive manner. As Valentine et al (2011) state, replication is an essential feature of science, and without it, we cannot make good decisions about public health interventions. Shadish et al (2002) identify several ways to go around that issue, including multistudy programs, narrative reviews of existing research, and meta-analysis – all of which have their flaws, but when used in conjunction can strengthen causal inference. Shadish et al (2002) also make a call for re-evaluating our assumptions as social scientists and being critical of our own work as we are of others. As the field of experimentation evolves and becomes more specialized – both in terms of knowledge and opportunity – critical evaluation of research becomes even more essential. To that end, Valentine et al (2011) suggest that more funds and incentives be devoted to replicating studies and experiments as a way to improve the field of public health.
As is evident from SCC, in the progression of an intervention from possible to plausible to proven, and from proven within a limited set of circumstances to widely accepted and effectively implemented (disseminated), a great number of potential pitfalls and detours exist, for which Flay, Biglan and colleagues offer a set of standards to guide the researcher. One standard in particular struck me, first for its impracticality, and then for its efficiency. Under Standard 2.b.ii, the first Desirable Standard (Measurement of proximal outcomes and mediators) first struck me as an onerous and expensive addition. In the behavioral sciences, such outcomes may be difficult to measure, may require the use of proxy measures (a violation of another standard) and may not be initially intuitive. However, given the likelihood that changes will need to be made to the original intervention protocol as study is conducted on a broader demographic, it will prove useful to determine if an intervention still exerts its effect through the same mechanisms, and if not, whether alternate outcomes are being affected. This can also provide critical guidance in the case an intervention fails to have the intended effect in a new population, and indicate whether the intervention is completely ineffective, or whether impact on alternative outcomes may be acceptable indication of effectiveness. Despite the utility of such data, each research endeavor is limited by cost, and to measure everything is difficult and costly.
These readings built on the foundation laid by previous readings and class discussions regarding sampling strategies (and how researchers may correctly phrase results based on sampling), as well as how to discover causal paths in quantitative and qualitative ways. SCC also ties in how theory drives statistical models in different ways. Flay et al.’s article emphasized that it takes more than one study design to evaluate the efficacy of an intervention, and details each standard that needs to be met before the intervention is ready to be disseminated, while echoing SCC’s cautions about specificity regarding not only outcomes but about by whom, where, and how outcomes were experienced by different groups of people. While it was satisfying to read a synthesis of previous material, from a practical standpoint, do policymakers appropriately understand the necessity and utility of having multiple studies with designs that guard against different threats in order to create a picture that allows for a comprehensive view of both the efficacy and effectiveness of a given intervention? How many interventions are truly long-lived enough for enough evaluations to take place for a researcher or policymaker to confidently promote that intervention?
Shadish, Cook, and Campbell (2002) describe many issues surrounding generalized causal inference and strategies for studying causal explanation. Of particular interest is their extensive discussion of qualitative methods and structural equation modeling. I am beginning to learn more about qualitative methods by working on analyzing transcripts that my advisor, Carolyn Mendez-Luck, has collected regarding Latino caregivers and am also taking a qualitative methods course and learning ethnographic field methods. While I most likely will not be conducting ethnographies as part of my Public Health doctoral program, it is an intriguing method nonetheless. I also plan to enroll in Structural Equation Modeling next term to learn more about causal pathways and covariance. According to our text, this is an appropriate class for a student in Health Behavior and Promotion and I am looking forward to learning these methods. Flay et al (2005) make the case for having interventions meet certain criteria, or Standards of Evidence, before they are disseminated. An important aspect of these standards includes having intervention creators provide a detailed manual for their interventions to communities who wish to implement health behavior interventions. This is often extremely important when laypersons will be implementing interventions.
Being able to generalize a causal inference ultimately makes it more broadly relevant and useful. As noted in the book, researchers make generalizations intuitively, alongside more rigorous measures including obtaining heterogeneous samples, and using qualitative methods and statistical models. Despite this, however, I wonder about the implications of “generalizability” in the professional sphere. As we’ve discussed in class, for instance, translation and adaptation contribute to the ability to generalize causal inference. However, researchers may avoid these practices as they aim for innovation. Although solid translation/adaptation practices require adherence to stringent standards of evidence as outlined in the Flay et al. piece, coming up with a totally novel program or entirely new causal inference may be viewed as more “groundbreaking” than implementing evidence-based interventions or adapting them to different settings, potentially appealing more to funders and publishers. However, these "less exciting" practices can ultimately move the field forward while potentially minimizing costs. So what is the position of generalizability efforts (specifically systematic retrospective efforts and translation/adaptation) in an academic culture that places tremendous value on novelty? Is our field convinced that generalizability efforts are innovative? How do we bridge the gaps between research and practice, and generalizability and innovation?
The research field has changed considerably as demand is greater for application. Having evidence of the effectiveness of a prevention program is not enough as there is a need for applicability to a general audience. Unfortunately, this is challenging as a variety of communities exist across the world, and with dwindling resources; this limits the researchers adaptation abilities. Flay mentioned the need for more replication studies and better dissemination. Replication reduces the potential for chance and helps increase confidence in program effectiveness. In addition, effective dissemination is likely to occur when delivery, fidelity, and proximal goals are part of the intervention process. SCC adds that generalization is a product of multiple attempts at replication, which reflects why generalization processes are either successful or a failure. To improve, SCC suggests sampling people in the community, along with the treatments and observations. In an ideal situation, researchers are able to generalize and provide tailored information for people with similar traits on their rate of success. However, funding restricts the researchers and restrains how they advance their program intervention. Approaches to effectiveness and dissemination should be used with the confidence that it is a provision for improved studies in the future.
Randomized control trials have long been criticized for their lack of generalizability, calling into question their usefulness in policy and community intervention development. Shadish, Cook, & Campbell (2002) suggest five principles of generalized causal inference – surface similarity, ruling out irrelevancies, making discriminations, interpolation and extrapolation, and causal explanation – and their unique applications to both construct and external validity. Ruling out irrelevancies seems particularly pertinent when considering the cost of program implementation within a given community, especially for a RCT. Complications arise as a researcher, looking to rule out irrelevancies, excludes selected independent variables over others and misses a key moderating/mediating effect; possibly resulting in a more detrimental in the analysis/conclusions and cost more in the long-term. As an additional consideration Flay et al. (2005) call attention to evidence that even the best supported intervention is unlikely to be effective in every implementation due to diversity of settings, place, populations, etc. Although problematic, this effectiveness quandary creates an opportunity to harken back to the program theory to explore answers regarding variable irrelevancies. Depending on the selected theory (an integrative one such as the TTI would be most comprehensive), it can help researchers understand the diffusion of their intervention within a population.
While I am struck by the level of detail and careful deliberation that have obviously gone into constructing the SPR list of standards (Flay et. al, 2005) I cannot help but wonder about the capacity of the potential end users (in particular communities, much more than administrators and policymakers) to utilize such a seemingly complex list in choosing what prevention program to employ. Unless the field is prepared to offer such stakeholders a quick and easy-to-assimilate version of this list, we might run the risk of losing them entirely by overwhelming them with (undeniably) sound information. On the other hand, for researchers versed in the lingo and theory of behavioral science, this is certainly an invaluable tool. I was taken aback first to realize that, apparently, (according to SCC) what I have always considered scientific process in establishing external validity, is in fact “superficial” (to use their own words). As they point out, this would be generalizing purely on the basis of prototypical properties (“proximal similarity”). I do see how this process is a subjective one, depending more on value judgment, rather than formal sampling.
SCC present their grounded theory of generalized causal inference, which includes five principles for establishing generalizability. The five principles outline an approach for establishing generalizability for research studies, including those that employ purposive sampling. This is an important approach as much of quantitative and qualitative work in the social sciences utilizes purposive sampling. In addition, SCC discuss that there are specific ways of conducting purposive sampling, such as from typical or heterogeneous instances, in order to extend the ability to generalize from a study or research program. Flay et al. (2005) outline Standards of Evidence in the realm of prevention research. Several of the Standards refer to the generalizability of causal inferences from efficacy and effectiveness trials that may or may not be ready for dissemination. In a similar manner to SCC, the authors posit that generalizability is limited to the persons and settings that are similar to those in the research study. Towards that effect, the Standards dictate that the research sample be well-characterized and sub-grouped in analyses, and that the appropriate statistical tests are used as well as the practical value of the numbers reported.
1