Collaborative Project: Minorities at Risk Data Base and Explaining Ethnic Violence

Collaborative Project: "Minorities at Risk" Data Base and Explaining Ethnic Violence

NSF Grant Proposal

James D. Fearon -- Stanford University

David D. Laitin -- University of Chicago

The "Minorities at Risk" (hereafter MAR) data base produced by Ted R. Gurr and associates has been widely used in the scientific community (give cites). It has great potential to serve as the evidentiary arbitor of competing theories of ethnic violence. Large-scale ethnic violence is an interesting and important topic both because of the enormous human suffering it causes and because it could be an important piece of evidence in the larger puzzle of how world politics and polities are now evolving. Furthermore, civil and especially ethnic violence is certainly much more common now than is interstate violence of the classical sort, and it tends to be more protracted than interstate wars as well. (See Licklider 1995 and Walter 1997 for evidence on the intractability of civil and ethnic conflicts.) Because of the trend of greater degrees of ethnic violence, and because of its importance for policy and for theories of world politics, the MAR data base will play an increasingly important role in the search for explanations for this violence.

Despite its wide use and great potential, however, the data base suffers from some fundamental flaws. The first purpose of this project is to work with the Gurr team to improve substantially the scientific quality of the MAR data base, such that the social science community will have a much better resource for the study of ethnic violence than is currently available. The second purpose of this project is to exploit the newly created data base in order to make cross-sectional and time-series comparisons of violent and non-violent cases of relations between ethnic groups and states, in order to answer two questions. First, what forms or types of ethnic violence have been the most lethal in the period since 1945? Second, using the MAR case list, are there any obvious features that distinguish the ethnic groups that have been involved in large-scale violence against other groups from those that have not?

I. The Promise and the Flaws of the MAR Data Base

The MAR data set developed by Gurr and his associates contains information on some 268 culturally defined minority groups in 115 countries, with 449 variables coded concerning the social, cultural, political and military situation of these groups vis-a-vis other groups and the state since the end of World War II. The data have been systematically updated to take into account the creation of new states (and new minorities) in the wake of the collapse of the Soviet Union and Yugoslavia. (Minorities at Risk Phase III Dataset: Users' Manual, August 1996, University of Maryland. See also Gurr 1993a, 1993b, 1994).

The literature on ethnic conflict/violence before the development of the MAR data base employed two types of research designs. Scholars have undertaken (1) relatively large-N, cross-sectional comparisons among cases selected because they are marked by significant ethnic conflict or violence, and (2) smaller-N studies that consider the evolution of violence in particular cases over time (or which may compare two or three of these). The failure to systematically sample cases of low conflict or violence tends to undermine the first approach's ability to generate insight into what factors differentiate high and low violence cases. And while the small-N literature has produced a wealth of insights into particular dynamics and mechanisms at work in particular cases, this approach is inherently incapable of providing a "big picture" view of the empirical contours of ethnic violence, and thus an understanding of how particular mechanisms fit within the larger picture. Ultimately, we are interested in developing theories that describe the mechanisms that, under certain conditions, give rise to large-scale ethnic violence. But rather than move immediately to the "micro-level" of specific mechanisms, we want first to establish the larger empirical context, in part so that we can make better initial guesses about what sorts of mechanisms are most common and important empirically.

The MAR data set provides a useful first step in establishing the larger empirical context, in large part because in the selection of cases there is great variance on levels of violent conflict. The data set also has a reasonbly good proxy for our dependent variable. Finally, it has useful codings for many independent and control variables. Although we will have a good deal to say on case selection when we identify problems with the data, it should be noted up front that coverage is reasonably good, considering that this is a large-N data set that codes quite subjective entities. If we randomly choose countries and then look to see what groups appear in the list, most of the time we find a good correspondence between the groups included and our own sense of how people in the country code them (keeping in mind the "at risk" criteria of selection).

On the dependent variable side, we are most interested in the number of deaths due to ethnic violence per year and per capita for each minority group. These data are not available; the MAR data set contains two variables, however, that are imperfect measures of levels of group violence: REBEL (for "rebellion") and COMCON (for "communcal conflict"), each of them coded for every five-year period from 1945 through 1994. Informal validity checks for REBEL give us confidence in relying upon it as a proxy for our dependent variable. We listed all cases of high REBEL scores and for all but one case, our independent reading of the sources showed very high numbers of deaths directly attributable to the ethnic conflict. Moreover, the Gurr measure "gets right" a comparison that one might get wrong if one looked only at column-inches in the U.S. press: The worst cases of ethnic violence in Western Europe, such as Northern Ireland and Basque separatism in Spain, rate only low scores on the REBEL scale, as examples of "campaigns of terrorism." We believe this is a reasonable reflection of relative fatalities, since in both these cases between 1000 and 3200 have been killed over almost 30 years, a not atypical number for a five- or even one-year period in many of the more serious "third world" cases. Finally, a partial check is afforded by using the rough estimates of fatalities in 50 of the "most serious ethnopolitical conflicts" that Gurr studies in his 1994 article. We constructed a variable, DEATHS, from the estimates for these 50 cases, assigning a value of 0 to all other cases except those that have had some experience of large-scale guerilla activity or protracted civil war since 1945, which were treated as missing data. (For most of these cases, we know that fatalities are quite high, easily above the 1,000 total threshold, such as Ethiopia, Lebanon and Chechnya). When Gurr's estimates give a range, we used the midpoint of the range, and we gave the same estimate to each group involved in the conflict. The results do not differ much if the deaths estimate is divided by number of groups involved or by group populations.} The bivariate correlation of the log of this variable (LNDETH) with the maximum rebellion score since 1945 (our recode, MAXREB45) is .73. (To avoid log(0), we add 1 to DEATHS before taking the log.} Even more impressively, we find that if LNDETH is used as the dependent variable in our regressions, the results are again substantially unchanged (in many cases they are even stronger). This too gives us some confidence that REBEL is a reasonable proxy for levels of ethnic violence.

On the independent variable side of the equation, Gurr and associates have constructed scales for group history and status, opportunities for group political action, global processes shaping context of political action, and international factors facilitating political action. (For the flow diagram of Gurr's theoretical take on the data, see Gurr 1993, 125). We shall comment critically on some of these scales, but as will be explicated below, several of them are quite worthwhile and revealing for the general picture that we hope to draw.

As for controls, the MAR data base codes cases based on region, and some of the best work with the data finds distinctive regional patterns of ethnic conflict (e.g. Scarritt and McMillan 1995). Also the MAR data base allows for controls based upon types of politicized communcal groups (ethnonationalists, indigenous peoples, ethnoclases, militant sects, and both advantaged and disadvantaged communal contenders). This allows analysts to check theories across different types of communal group, or to control for communal group type in regression equations.

The promise of this data set is good. But there are some difficult problems, as yet unresolved, that undermine the validity of the results for all members of the scientific community who rely on these data. These problems include ones of selection of cases, inaccurate coding of existing variables, omission of important variables that speak to standard theories of ethnic conflict, and inclusion of variables that are endogenous to ethnic conflict itself.

Selection Problems

To be included, a group had to reside in a country with population greater than one million in 1990, had to have itself a population greater than 100,000 or 1 percent of country population, and had to meet at least one of the four criteria Gurr et al. used to decide if the group was "at risk." For the "at risk" criteria, Gurr et al. asked whether (1) the group suffers "discrimination" relative to other groups in the country, (2) the group is "disadvantaged from past discrimination," (3) the group is an advantaged minority being challenged," or (4) the group is "mobilized," meaning that "the group (in whole or part) supports one or more political organizations that advocates greater group rights, priveleges, or autonomy" (Manual, p. 7 and 65.)

Obviously, then, the criteria for inclusion are subjective and may be contestable for specific cases. There is also the problem of how to decide what the "group" is in cases where group boundaries and self/other descriptions are contested or unclear. For instance, MAR codes as single groups "Hispanics" in the U.S. and "Pashtuns" in Afghanistan, when one could argue for greater disaggregation in each case; certainly "Southerners" in Chad and Sudan could be greatly disaggregated. "Russians" are coded as a minority in Ukraine, though group boundaries in this case are at best "in formation" (Laitin 1998a). The coding of a number of the African groups in the sample could be criticized on similar grounds, and our impression is that quite a few African groups are omitted altogether that arguably might satisfy the "at risk" criteria. The Africa cases present special problems. The data base does not include ethnic groups in Malawi, the Central African Republic, Gabon, Liberia, and Tanzania. It includes for Somalia the Isaaq, but excludes other clans equally at risk. But not only in Africa are there ambiguities. The lumping together of almost all minority ethnic groups in Latin American countries under the heading "indigenous peoples" may is problematic. And ignoring the Flemish in Belgium is surprising.

Interestingly, Gurr et al. never address the problem of defining what bases and indicators of groupness potentially qualify a group for inclusion in the list. The implicit criterion seems to be that group membership must be mainly reckoned by descent by people in the country. The vast majority of the groups in the data set could be referred to as "ethnic" in ordinary language, and the vast majority in fact are. Bu if descent is crucial, why are Indian castes omitted, and why are the Ba'hais of Iran, a religious sect where being a believer is probably close to necessary and sufficient to be a member of the group, included?

One other important feature of the MAR case selection should be stressed: The most politically dominant ethnic group in a country is not included, unless (this is our best guess) the group is a "minority" in the numerical sense of having less than 50 percent of country population. Thus, "whites" are not listed in the U.S., "French" in France, "Germans" in Germany, "Malays" in Malaysia, "Russians" in Russia, "Estonians" in Estonia, and so on, because these groups comprise numerical majorities -- they are not "minorities at risk." But if a politically dominant ethnic group has less than 50 percent of country population, it may be included, even if it is the largest group in the country. Thus, Pashtuns in Afghanistan (38 percent of population) and Sunnis in Lebanon (30 percent) are included as "advantaged minorities being challenged," though each forms a "majority" in the sense of a plurality. The few other politically dominant groups in the sample are both numerical minorities and smaller in population than some other group in the country, such as Tutsis in Burundi and Rwanda, Alawi in Syria, Kalenjins in Kenya, and Ngbandi in Zaire.{Most of the 48 groups coded as "advantaged minorities being challenged" in the data set are not politically dominant. Twelve of these are "Russians" in the former Soviet Socialist Republics (counting also "Slavs" in Moldova). The list also includes a number of "trader" or "middleman" minorities, such as Chinese in several Southeast Asian countries, and a few cases of Europeans in south African countries.) The data set also includes some numerical minorities that form a plurality but are not coded as "advantaged" (such as Kikuyu in Kenya, Oromo in Ethiopia, and Bosnian Muslims in Bosnia). Finally, to add to the confusion, five "disadvantaged" groups that form absolute majorities are included (Hutus in Burundi and Rwanda, "Highland Indigenous Peoples" in Bolivia, Shi'is in Iraq, and Taiwanese in Taiwan).

While many of these inconsistencies are the result of judgment calls (and the judgment of the MAR coding team is rather good), one issue concerning selection is of overriding importance. As identified in our first cut through these data (Fearon and Laitin, 1998), and re-emphsized by Hug (1998), if REBEL is the dependent variable, there is a built in bias in the criteria of selection of cases. To be included in the MAR sample at all, a group must be larger than 100,000 persons in 1990, and be "at risk," which for Gurr et al. essentially means that the group is either "mobilized," subject to discrimination, or at a major economic disadvantage. Thus, the sample does not include the large number of ethnically defined groups that are small or that are not already marked by factors that might increase their odds of being engaged in violent conflict. Another way to put this is that virtually all cases in which there would be a high score for REBEL are included in the data base; yet not all cases are included where REBEL is nill, because under conditions of peace, the group may be insufficiently mobilized to catch the attention of coders. Therefore, the selection of cases induces an overprediction of rebellion.

To be sure, the MAR data show that large scale ethnic violence is relatively rare. Only 15 percent of the cases reach the highest level of REBEL at some time in the period 1945-1994. Only 35 percent reach the level of "small scale guerilla war" or greater at some time in this period. And fully 45 percent never in the whole post-war period rate above 0 on the REBEL scale. Yet even still, given the selection bias, the data overpredict ethnic violence, and this problem requires a solution.

Inaccurate Coding

The Phase III MAR data set with which we worked, despite careful coding procedures, still has substantial coding errors. The University of Maryland team has been proactive in correcting many of the errors, and continues to do so. We feel, however, that some of the important variables require major surgery. Rather than provide a catalogue of small problems, we present one important variable that we have found to be flawed.

The MAR variable "culdifx2" measures linguistic difference between the minority and the dominant group. The values of this variable go from "0" (No Difference) through "2" (Extreme Difference). Culdifx2 is an element of the index variable for cultural difference (culdifx), which plays a role in Gurr's own explanation for grievances in regard to group autonomy in Middle East and Latin America (Gurr 1993, 80-81).

Despite the apparent ease in coding on such a variable, the concept of linguistic difference is a tricky one to nail down. Gurr's scale is not specified in either the book or in the users' manual. Yet, one might legitimately ask, how does one code the language of either of the groups? For example, what is the language of the dominant group in Kenya? At the time of writing, the President's ancestral language is Kalenjin. The Kalenjin have garnered considerable resources due to President Daniel arap Moi's power to influence the distribution of resources. Yet he might speak Swahili -- a lingua franca throughout much of East Africa -- more often than he speaks Kalenjin; but on most official matters, concerning business and high government affairs, he is more likely to speak in English.