Philosophy of Science, 69 (September 2002) pp. S185-S196. 0031-8248/2002/69supp-0017
Copyright 2002 by The Philosophy of Science Association. All rights reserved.

UsingMeta-ScientificStudiestoClarifyorResolveQuestionsinthePhilosophyandHistoryofScience

DavidFaust
UniversityofRhodeIsland
PaulE.Meehl
UniversityofMinnesota

More powerful methods forstudying and integrating thehistorical track record ofscientific episodes and scientificjudgment, or what Faustand Meehl describe asa program of meta-scienceand meta-scientific studies, cansupplement and extend morecommonly used case studymethods. We describe thebasic premises of meta-science,overview methodological considerations, andprovide examples of meta-scientificstudies. Meta-science can helpto clarify or resolvelong-standing questions in thehistory and philosophy ofscience and provide practicalhelp to the workingscientist.

Sendrequestsforreprintstotheauthors.DavidFaust:DepartmentofPsychology,10ChaffeRd.,Suite8,Kingston,RI02881.PaulMeehl:DepartmentofPsychology,75E.RiverRd.,ElliotHall,Minneapolis,MN55455.

1. Introduction.

As a graduate studentin psychology about 25years ago, one ofthe authors (DF) wassent by his majorprofessor to the headof the philosophy departmentto discuss certain technicalissues. At that time,this author presented thebasic features of astill somewhat tentative proposalfor meta-science, which theother author (PEM) hasco-developed and now refersto as the "Faust-MeehlThesis." As we willdescribe, the Faust-Meehl Thesisinvolves a theoretical rationalefor, and the designof, more rigorous methodsfor studying scientific episodesin order to assistin the understanding andintegration of the massivehistorical track record. Theprogram has both descriptiveand prescriptive aims. Uponhearing the proposal, thisphilosopher simply stated, "Ifyou are correct, thenmy life work hasbeen a waste andI am out ofbusiness."

It was, and remains,the conviction of bothauthors that this pessimisticpronouncement was wrong onboth scores and thatthe meta-scientific approach orprogram that we willdescribe should have justthe opposite effect, thatis, that it willsharpen traditional problems andcreate new ones involvingissues that are oftencentral to historians andphilosophers of science, leadingto many productive undertakings.These problems and questionsinvolve such matters as:What features of theoriespredict their long-term survival?To what extent arethese features similar acrossdisciplines and domains? Stateddifferently, meta-science should providerich and hardy gristfor the mill ofhistorians, logicians, and philosophersof science.

In the articlethat follows we willdiscuss the potential benefitsof applying more rigorousmethods to the analysisof the historical trackrecord, present certain basicpremises of our meta-scienceprogram, discuss its rationaleand aims, and presentsome examples of potentialapplications. Space limitations necessitatea dense presentation thatmight sometimes seem inadequatelyattentive to methodological obstaclesand objections; various sourcesprovide more detailed descriptionsof the premises, aims,and potential methods ofmeta-science, as well asour thoughts about certainobjections and practical problems(Faust 1984; Faust andMeehl 1992; Meehl 1983,1992a, 1992b, 1999).

2. Methodology for the Study of Science.

The majorcurrent approach to thestudy of science isthe case method, whichhas yielded many insightsand is seemingly irreplaceablefor certain purposes. However,there are two fundamentalreasons why this approachmay not be themethod of choice forcertain types of problemsor questions, at leastwhen used predominantly orin isolation.

First, the database of scientific episodesor occurrences is massiveand growing rapidly. Scienceis BIG, and itis nearly impossible foranyone using the casestudy method to masterand continuously track morethan a relatively smallproportion of this database.

Second, relations between themethods that scientists employand the outcome oftheir efforts are largelyprobabilistic, not deterministic. Muchof the methodology thatscientists use is not,strictly speaking, rule bound,but more so followsfrom rules of thumb,principles, or guides, manyof which can leadto inconsistent or evenopposing actions (e.g., startby simplifying versus startholistically). Good or evenexcellent methods do notguarantee success, nor dobad or poor methodsalways lead to failure.One might crudely classifymethodology and outcome intoa two-by-two table, withone dimension representing method(good versus bad) andthe other representing outcome(good versus bad). Itis evident, given themassive data base ofscientific episodes and becausethe relations between methodand outcome are inherentlyprobabilistic or statistical, thatwe could fill allfour cells of thetable with many entries,even if good methodwas much more likelyto lead to apositive outcome than poormethod. Consequently, for nearlyany descriptive or normativeprogram, no matter howsound, the proponent canfind many supportive instances(although one might haveto search much harderand more selectively inthe case of someof these programs thanothers).

Additionally, the same scientificprocedure or methodology canproduce inconsistent or varyinglevels of success. Therelationship here is theone to the many.Also, different procedures canlead to the sameoutcome, the relation herebeing the many tothe one. Again, thisspeaks to the statisticalnature of the relationsbetween scientific procedures andoutcome.

Consider also the featuresof theories that aredeemed desirable. Among thelists of such featuresthat are commonly putforth, there is much,but certainly not complete,overlap. There is certainlynot agreement about whichfeatures to assign thegreatest importance or weight,or which should countervailone or more ofthe other features whenthey are inconsistent ordifferent features favor competingtheories.

Take the following abbreviatedlist of desirable featuresof theories. The listmight include parsimony, whichitself can be dividedinto a number ofcharacteristics, such as simplicityof explanation or thefewest postulates per observationstatement. The list mightalso include novelty inrelation to numerical precision,that is, some variationof Popperian risk orSalmonian "damn strange coincidence."To these we couldadd rigor, qualitative diversityor breadth, reducibility upwardor downward, and eleganceor mathematical beauty.

No crediblephilosopher of science hasclaimed that any oneof these features isa sure-fire guarantee oftruth, or even ahigh level of verisimilitude.Nor has any crediblephilosopher claimed, despite astrong emphasis on oneor two features, thatany one always trumpsover all the others.Thus, anyone who relieson any one ofthese features to appraisea theory's status mustbe claiming statistical relationsbetween the presence orstanding on that featureand the success ofthe theory or itsverisimilitude.

The only essentially unambiguouscase is the triviallysimple one in whichTheory A beats TheoryB on all features.Commonly, however, the featuresthemselves are inconsistent withinand across theories, creatinga potential judgmental dilemma.For example, Theory Amay have excellent parsimonybut modest rigor; orTheory A may surpassTheory B on somefeatures, but for otherfeatures the opposite mighthold.

Again, given the massivenessand probabilistic nature ofthe historical track record,it is possible toidentify many positive ornegative instances for nearlyany set of preferencesproposed. In this context,case study becomes amethod for refuting extremeclaims of the typethat almost nobody makes.For example, in Realism and the Aim of Science(1983), Popper cites multipleexamples of theories thatwere abandoned quickly dueto clear falsifiers. Whatdoes this refute? Hasanyone claimed something like:"No scientific theory hasever been quickly abandonedbecause of what appearedto be clear falsifiers"?

Ifthe claim instead isthat scientific episodes shouldconform to certain characteristics,or that a certainapproach will often ortend to yield acertain outcome, then selectiveillustrations are not helpfuland different methods areneeded. Given the sizeand heterogeneity of thehistorical data base, itis possible to pileup examples for nearlyany program, even ifthe description is farfrom typical or thenormative suggestions are lessthan optimal, if notrelatively poor. If thereare tens of thousandsof episodes from whichto collect examples, theneven an approach thatoccurs or works 1%of the time willlead to hundreds ofconforming instances.

Most importantly, methodsfor studying the historicaltrack record need toincorporate some form ofrepresentative sampling of scientificepisodes. Obtaining representativeness willgenerally require random samplingof a sufficient numberof episodes (although thisnumber may not needto be nearly aslarge as one mightsuppose). If we wantto know what andhow often something occurs,representative sampling is often,far and away, themost powerful method.

Many claimsabout science contain frequencystatements or assertions thatare fundamentally statistical. Itis informative, for example,to review Laudan etal.'s (1986) list ofcontrasting assumptions about scientificchange. Of the 15assumptions or hypotheses listedunder the category forsuccessor theories, every one ofthem contains such terminologyas "seldom," "randomly," or"always."

Why not just collectand combine episodes inthe history of scienceon the basis oftrained, expert judgment? First,although not literally true,the quality of conclusionsare constrained by thequality of the dataupon which they arebased. Absent representative sampling,one lacks the database needed to bestanswer or resolve thesetypes of inherently statisticalquestions. The typical casestudy method does notcapitalize on the farmore powerful methodology thatis available for obtainingrepresentative samples and isunlikely to produce theneeded representativeness. Further, insome instances, the casestudy method is directedtoward identifying or accruinginstances that illustrate orsupport a position, andtherefore is likely toproduce skewed, or grosslyskewed, samples.

Second, optimal orimproved integration of largeand complex data basesis likely to befacilitated by decision aidsthat supplement the powerof the unassisted humanmind. As legion researchshows (e.g., see Faust1984), the capacity ofthe unaided mind isgreatly strained, if notfar overburdened, when askedto optimally combine multiplevariables with probabilistic relationsto outcomes. The unaidedhuman mind simply doesnot perform these typesof operations or computationswell. As Meehl (1986,372) has stated:

Surelywe all know thatthe human mind ispoor at weighting andcomputing. When you checkout at the supermarket,you don't eyeball theheap of purchases andsay to the clerk,"Well it looks tome as if it'sabout $17.00 worth; whatdo you think?" Theclerk adds it up.

Althoughit might be arguedthat the case studymethod will usually beeffective in identifying majordifferences in rate ofsuccess, matters become farmore difficult when onewants to know justhow often an approachsucceeds across applications; orif one method beatsanother by a marginof, say, 25%, 10%,or 5%; or ifone approach works somewhatbetter than another insome situations but notin others. The problemof subjective discernment canbecome especially difficult because,among other things, theless successful method mayhave been used farmore often than themore successful method, leadingto an absolute number(versus proportion) of positiveoutcomes that exceeds thatof the more effectiveapproach. Even relatively smalldifferences in success ratescan be of greatimportance to working scientists,especially when these probabilitiesare joined across scientificundertakings. For example, whenthe probabilities are multiplicative,five attempts with a5% versus a 2%rate of success hasa many-fold greater chanceof achieving a positiveoutcome.

The problem of integratingepisodes in the historyof science and determiningprobabilistic associations between procedureor theory features andlong-term outcome is worsethan this, however, becauseone may well haveto assign weights tothe variables and alsoexamine inter-relations or configuralpatterns among the variables.For example, although successwith novel prediction maygenerally be a morepowerful indicator of atheory's fate than parsimony,this may not holdtrue when the rangeof phenomena for whichaccurate prediction is achievedis very narrow andthe alternative theory showsnot only greater parsimonybut also much greaterbreadth; alternatively, the relativeweight that should beassigned to one oranother variable may dependon the standing ofother variables, that is,it may depend onpatterns or configural relationships.To give what mightbe an overly simplifiedexample for purposes ofclarity, parsimony might countfor nothing if novelprediction is nil, mightcount more if atheory also shows goodrigor, and perhaps shouldbe weighted heavily ifthe theory shows goodstanding on breadth. Aquote from Dawes, Faust,and Meehl (1989), infollow up to Meehl'sstatement quoted above, illustratesthe difficulties encountered whenattempting to perform thesetypes of mental operationssubjectively:

It might beobjected that this analogy,offered not probatively butpedagogically, presupposes an additivemodel that a proponentof [subjectively accomplished] configuraljudgment will not accept.Suppose instead that thesupermarket pricing rule were,"Whenever both beef andfresh vegetables are involved,multiply the logarithm of0.78 of the meatprice by the squareroot of twice thevegetable price;" would theclerk and customer eyeballthat any better? Worse,almost certainly. When humanjudges perform poorly atestimating and applying theparameters of a simpleor component mathematical function,they should not beexpected to do betterwhen required to weighta complex composite ofthese variables. (1672)

We, ofcourse, do not meanto compare the evaluationof theories to supermarketpricing. Our example isintended to illustrate thedifficulties encountered when oneattempts to subjectively integratemultiple variables with probabilisticrelations to outcome, variableswhich may act differentlywhen combined and weightedin different ways orin different configurations. Thus,in addition to representativesampling, methodology designed toassist in the analysisand integration of suchdata bases (e.g., statisticalmethods such as multipleregression) can greatly bolsterour judgmental accuracy andunderstanding.

3. Description and Prescription.

More powerful methods forstudying and integrating thehistorical track record canhelp clarify or resolvelong-standing questions in thehistory and philosophy ofscience and provide practicalhelp to the workingscientist. Perhaps the mostfundamental reason why betterdescription helps the practicingscientist is that whatwas or is mostsuccessful in the pasthas value for predictingwhat will succeed inthe future. If thepast were entirely non-predictiveon these matters, wecould junk the scientificmethod completely. Imagine ifwe believed that astatement like the followingwere justified, "Just becausecontrol groups have helpedus in thousands ofpast experiments, and justbecause this situation closelyresembles the types ofproblems for which controlgroups have worked before,there is no basisto assume that acontrol group will helpin this instance." Or,more broadly, "The pastusefulness of control groupsfor decades and acrossthousands of studies andbroad domains does notallow us to predictthat control groups willassist us in futurestudies." Scientists, of course,consider the past trackrecord of methods andapproaches all of thetime when planning orconducting new work. However,greater precision and accuracy,especially around matters thatrequire complex data integration(e.g., which factors inwhich combination best predictthe long-term fate oftheories) should provide improvedguidance.

4. Two Illustrations.

A variety of problemsmight capture the attentionof the meta-scientist, especiallyproblems in the historyand philosophy of sciencethat require the integrationof complex data. Forexample, representative sampling andstatistical analysis might beapplied to the studyof scientific change, orto the association betweenscientist's methodological preferences andthe success of theirefforts. Given space limitations,we will limit ourselvesto a discussion oftwo possible areas ofstudy.

4.1. Grant Evaluation.

Grant evaluation involves predictionunder conditions of uncertainty,that is, reviewers attemptto predict the outcomeor utility of proposed,but yet to beconducted, research studies orprograms. Presently, grant evaluationis almost always conductedthrough some form ofdata integration that restssubstantially or mainly onsubjective judgment. This isthe case even shouldthese evaluations involve assigningratings to various dimensionsand then adding upscores on these dimensionsor using another meansfor formulating some typeof global ratings, becausethe selection of thedimensions and the schemefor combining the dimensionsare, themselves, subjectively derived.How would the meta-scientistproceed in this domain?

Onemight initially identify arange of variables thatseem relevant in judgingthe quality of grantproposals or in predictingtheir success. It wouldbe sensible to startthis process by elicitingthe beliefs and impressionsof qualified scientists, particularlythose considered expert ingrant evaluation. We wouldbegin by generating alist of evaluative featuresthat, if anything, isoverly inclusive. Mistakenly includingvariables on the candidatelist should not betoo serious an error,because proper analysis willhelp us identify thosethat do not workor are unnecessary (i.e.,that are non-predictors, weakpredictors, or redundant predictors).In contrast, the failureto include potential predictorsmay represent missed opportunities.

Thevarious grant proposals arerated along these dimensions,taking steps to ensurethat the ratings arereliable or consistent acrossevaluators. Classical psychometrics providesformulas for such questionsas how many judgesmust be pooled toachieve a desired levelof reliability, the constraintsthat level of reliabilitysets on validity, andthe like. We thenexamine, through the propermathematical procedures (e.g., multipleregression), the relations betweenstanding on these backgroundvariables and outcome, thatis, the fate ofthe executed research project.At this stage, wewill probably prefer towork with archival data.With archival data, weneed not await outcome,can examine a longenough time period aftercompletion of the researchto make more accurateand trustworthy judgments ofsuccess, and can avoidcases with more ambiguousoutcomes, or for whichsuccess is particularly difficultto rate.

The mathematical analyseswill tell us whichvariables are and arenot associated with outcome,how strongly they areassociated, and what variablesin what combination orweighting scheme maximize predictiveaccuracy. For example, itmay turn out thata researcher's past successis a far morepowerful predictor than institutionalaffiliation or the thoroughnessof the literature review.We might find thata substantial number ofvariables all contribute independentlyto prediction (an outcomethat, for technical reasonswe cannot enter intohere, we consider unlikely);that many of thevariables are redundant andthat only a relativelysmall subset are neededto maximize prediction; andthat some variables generallybelieved to be goodpredictors are not andthat other variables oftenconsidered to be ofsecondary importance are amongthe best predictors. Itmight be that theuseful variables can simplybe added up andweighted similarly to maximizepredictive accuracy, that differentialweighting is needed, orthat combinations, or complexcombinations of these variablesmust be utilized. Ofcourse, we do notknow what we mightfindwe may just endup "confirming" what wasassumed all alongbut thisis the point ofdoing such studies. Ofinterest, a large bodyof research shows thefeasibility of conducting thesetypes of analyses ofhuman or expert judgment,although this work hasnot yet been appliedto the study ofhigher level scientific judgments.Further, this research onjudgmental processes often revealssubstantive discrepancies between subjectiveappraisal, weighting, and integrationof variables in comparisonto what statistical andmathematical analyses show isoptimal (Faust 1984; Meehl1954; Grove and Meehl1996).